{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "# Finding editorial cartoons in the Bulletin\n",
    "\n",
    "In another notebook I showed how you could [download all the front pages of a digitised periodical](Get-page-images-from-a-Trove-journal.ipynb) as images. If you harvest all the front pages of *The Bulletin* you'll find a number of full page editorial cartoons under *The Bulletin*'s masthead. But you'll also find that many of the front pages are advertising wrap arounds. The full page editorial cartoons were a consistent feature of *The Bulletin* for many decades, but they moved around between pages one and eleven. That makes them hard to find.\n",
    "\n",
    "I wanted to try and assemble a collection of all the editorial cartoons, but how? At one stage I was thinking about training an image classifier to identify the cartoons. The problem with that was I'd have to download lots of pages (they're about 20mb each), most of which I'd just discard. That didn't seem like a very efficient way of proceeding, so I started thinking about what I could do with the available metadata.\n",
    "\n",
    "*The Bulletin* is indexed at article level in Trove, so I thought I might be able to construct a search that would find the cartoons. I'd noticed that the cartoons tended to have the title 'The Bulletin', so a search for `title:\"The Bulletin\"` should find them. The problem was, of course, that there were lots of false positives. I tried adding extra words that often appeared in the masthead to the search, but the results were unreliable. At that stage I was ready to give up.\n",
    "\n",
    "Then I realised I was going about it backwards. If my aim was to get an editorial cartoon for each issue, I should start with a list of issues and process them individually to find the cartoon. I'd already worked out how to get a complete list of issues from the web interface, and found you could [extract useful metadata](Get-text-from-a-Trove-journal.ipynb) from each issue's landing page. Putting these approaches together gave me a way forward. The basic methodology was this.\n",
    "\n",
    "* I manually selected all the cartoons from my harvest of front pages, as downloading them again would just make things slower\n",
    "* I harvested all the issue metadata\n",
    "* Looping through the list of issues I checked to see if I already had a cartoon from each issue, if not...\n",
    "* I grabbed the metadata from the issue landing page – for issues with OCRd text this includes a list of the articles and the pages that make up the issue\n",
    "* I looked through the list of articles to see if there was one with the *exact* title 'The Bulletin'\n",
    "* I then found the page on which the article was published\n",
    "* If the page was odd (almost all the cartoons were on odd pages), I downloaded the page\n",
    "\n",
    "This did result in some false positives, but they were easy enough to remove manually. At the end I was able to see when the editorial cartoons started appearing, and when they stopped having a page to themselves. This gave me a range of 1886 to 1952 to focus on. Looking within that range it seemed that there were only about 400 issues (out of 3452) that I had no cartoon for. The results were much better than I'd hoped!\n",
    "\n",
    "I then repeated this process several times, changing the title string and looping through the missing issues. I gradually widened the title match from exactly 'The Bulletin', to a string containing 'Bulletin', to a case-insensitive match etc. I also found and harvested a small group of cartoons from the 1940s that were published on an even numbered page! After a few repeats I had about 100 issues left without cartoons. These were mostly issues that didn't have OCRd text, so there were no articles to find. I thought I might need to process these manually, but then I though why not just go though the odd numbers from one to thirteen, harvesting these pages from the 'missing' issues and manually discarding the misses. This was easy and effective. Before too long I had a cartoon for every issue between 4 September 1886 and 17 September 1952. Yay!\n",
    "\n",
    "If you have a look at the results you'll see that I ended up with more cartoons than issues. This is for two reasons:\n",
    "\n",
    "* Some badly damaged issues were digitised twice, once from the hard copy and once from microfilm. Both versions are in Trove, so I thought I might as well keep both cartoons in my dataset.\n",
    "* The cartoons stopped appearing with the masthead in the 1920s, so they became harder to uniquely identify. In the 1940s, there were sometimes *two* full page cartoons by different artists commenting on current affairs. Where my harvesting found two such cartoons, I kept them both. However, because my aim was to find at least one cartoon from each issue, there are going to be other full page political cartoons that aren't included in my collection.\n",
    "\n",
    "Because I kept refining and adjusting the code as I went through, I'm not sure it will be very useful. But it's all here just in case.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import glob\n",
    "import io\n",
    "import json\n",
    "import os\n",
    "import re\n",
    "import time\n",
    "import zipfile\n",
    "from pathlib import Path\n",
    "\n",
    "import pandas as pd\n",
    "import requests_cache\n",
    "from bs4 import BeautifulSoup\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "from sqlite_utils import Database\n",
    "from tqdm.auto import tqdm\n",
    "\n",
    "s = requests_cache.CachedSession()\n",
    "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n",
    "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n",
    "s.mount(\"http://\", HTTPAdapter(max_retries=retries))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def harvest_metadata(obj_id):\n",
    "    \"\"\"\n",
    "    This calls an internal API from a journal landing page to extract a list of available issues.\n",
    "    \"\"\"\n",
    "    start_url = \"https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c\"\n",
    "    # The initial startIdx value\n",
    "    start = 0\n",
    "    # Number of results per page\n",
    "    n = 20\n",
    "    issues = []\n",
    "    with tqdm(desc=\"Issues\", leave=False) as pbar:\n",
    "        # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.\n",
    "        while n == 20:\n",
    "            # Get the browse page\n",
    "            response = s.get(start_url.format(obj_id, start), timeout=60)\n",
    "            # Beautifulsoup turns the HTML into an easily navigable structure\n",
    "            soup = BeautifulSoup(response.text, \"lxml\")\n",
    "            # Find all the divs containing issue details and loop through them\n",
    "            details = soup.find_all(class_=\"l-item-info\")\n",
    "            for detail in details:\n",
    "                issue = {}\n",
    "                title = detail.find(\"h3\")\n",
    "                if title:\n",
    "                    issue[\"title\"] = title.text\n",
    "                    issue[\"id\"] = title.parent[\"href\"].strip(\"/\")\n",
    "                else:\n",
    "                    issue[\"title\"] = \"No title\"\n",
    "                    issue[\"id\"] = detail.find(\"a\")[\"href\"].strip(\"/\")\n",
    "                try:\n",
    "                    # Get the issue details\n",
    "                    issue[\"details\"] = detail.find(\n",
    "                        class_=\"obj-reference content\"\n",
    "                    ).string.strip()\n",
    "                except (AttributeError, IndexError):\n",
    "                    issue[\"details\"] = \"issue\"\n",
    "                # Get the number of pages\n",
    "                try:\n",
    "                    issue[\"pages\"] = int(\n",
    "                        re.search(\n",
    "                            r\"^(\\d+)\",\n",
    "                            detail.find(\"a\", attrs={\"data-pid\": issue[\"id\"]}).text,\n",
    "                            flags=re.MULTILINE,\n",
    "                        ).group(1)\n",
    "                    )\n",
    "                except AttributeError:\n",
    "                    issue[\"pages\"] = 0\n",
    "                issues.append(issue)\n",
    "                # print(issue)\n",
    "                if not response.from_cache:\n",
    "                    time.sleep(0.5)\n",
    "            # Increment the startIdx\n",
    "            start += n\n",
    "            # Set n to the number of results on the current page\n",
    "            n = len(details)\n",
    "            pbar.update(n)\n",
    "    return issues\n",
    "\n",
    "\n",
    "def get_page_number(work_data, article):\n",
    "    page_number = 0\n",
    "    page_id = article[\"existson\"][0][\"page\"]\n",
    "    for index, page in enumerate(work_data[\"children\"][\"page\"]):\n",
    "        if page[\"pid\"] == page_id:\n",
    "            page_number = index\n",
    "            break\n",
    "    # print(page_number)\n",
    "    return page_number\n",
    "\n",
    "\n",
    "def get_articles(work_data):\n",
    "    articles = []\n",
    "    for article in work_data[\"children\"][\"article\"]:\n",
    "        # if re.search(r'^The Bulletin\\.*', article['title'], flags=re.IGNORECASE):\n",
    "        if re.search(r\"Bulletin\", article[\"title\"], flags=re.IGNORECASE):\n",
    "            # if re.search(r'^No title\\.*$', article['title']):\n",
    "            articles.append(article)\n",
    "    return articles\n",
    "\n",
    "\n",
    "def get_work_data(url):\n",
    "    \"\"\"\n",
    "    Extract work data in a JSON string from the work's HTML page.\n",
    "    \"\"\"\n",
    "    response = s.get(url, allow_redirects=True, timeout=60)\n",
    "    # print(response.url)\n",
    "    try:\n",
    "        work_data = re.search(\n",
    "            r\"var work = JSON\\.parse\\(JSON\\.stringify\\((\\{.*\\})\", response.text\n",
    "        ).group(1)\n",
    "    except AttributeError:\n",
    "        work_data = \"{}\"\n",
    "    return json.loads(work_data)\n",
    "\n",
    "\n",
    "def download_page(issue_id, page_number, image_dir):\n",
    "    if not os.path.exists(\"{}/{}-{}.jpg\".format(image_dir, issue_id, page_number + 1)):\n",
    "        url = \"https://nla.gov.au/{0}/download?downloadOption=zip&firstPage={1}&lastPage={1}\".format(\n",
    "            issue_id, page_number\n",
    "        )\n",
    "        # Get the file\n",
    "        r = s.get(url, timeout=180)\n",
    "        # The image is in a zip, so we need to extract the contents into the output directory\n",
    "        try:\n",
    "            z = zipfile.ZipFile(io.BytesIO(r.content))\n",
    "            z.extractall(image_dir)\n",
    "        except zipfile.BadZipFile:\n",
    "            print(\"{}-{}\".format(issue_id, page_number))\n",
    "\n",
    "\n",
    "def download_issue_title_pages(issue_id, image_dir):\n",
    "    url = \"https://nla.gov.au/{}\".format(issue_id)\n",
    "    work_data = get_work_data(url)\n",
    "    articles = get_articles(work_data)\n",
    "    for article in articles:\n",
    "        page_number = get_page_number(work_data, article)\n",
    "        if page_number and (page_number % 2 == 0):\n",
    "            download_page(issue_id, page_number, image_dir)\n",
    "            time.sleep(1)\n",
    "\n",
    "\n",
    "def get_downloaded_issues(image_dir):\n",
    "    issues = []\n",
    "    images = [i for i in os.listdir(image_dir) if i[-4:] == \".jpg\"]\n",
    "    for image in images:\n",
    "        issue_id = re.search(r\"(nla\\.obj-\\d+)\", image).group(1)\n",
    "        issues.append(issue_id)\n",
    "    return issues\n",
    "\n",
    "\n",
    "def download_all_title_pages(issues, image_dir):\n",
    "    dl_issues = get_downloaded_issues(image_dir)\n",
    "    for issue in tqdm(issues):\n",
    "        if issue[\"issue_id\"] not in dl_issues:\n",
    "            download_issue_title_pages(issue[\"issue_id\"], image_dir)\n",
    "            dl_issues.append(issue[\"issue_id\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Get current list of issues\n",
    "issues = harvest_metadata(\"nla.obj-68375465\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# First pass\n",
    "download_all_title_pages(\n",
    "    issues, \"/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Change filenames of downloaded images to include date and issue number\n",
    "# This makes it easier to sort and group them\n",
    "df = pd.read_csv(\"journals/bulletin/bulletin_issues.csv\")\n",
    "all_issues = df.to_dict(\"records\")\n",
    "\n",
    "# Add dates / issue nums to file titles\n",
    "for record in all_issues:\n",
    "    pages = glob.glob(\n",
    "        \"/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images/{}-*.jpg\".format(\n",
    "            record[\"id\"]\n",
    "        )\n",
    "    )\n",
    "    for page in pages:\n",
    "        page_number = re.search(r\"nla\\.obj-\\d+-(\\d+)\\.jpg\", page).group(1)\n",
    "        date = record[\"date\"].replace(\"-\", \"\")\n",
    "        os.rename(\n",
    "            page,\n",
    "            \"/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images/{}-{}-{}-{}.jpg\".format(\n",
    "                date, record[\"number\"], record[\"id\"], page_number\n",
    "            ),\n",
    "        )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Only select issues between 1886 and 1952\n",
    "df = pd.read_csv(\n",
    "    \"journals/bulletin/bulletin_issues.csv\",\n",
    "    dtype={\"number\": \"Int64\"},\n",
    "    keep_default_na=False,\n",
    ")\n",
    "df = df.rename(index=str, columns={\"id\": \"issue_id\"})\n",
    "filtered = df.loc[(df[\"number\"] >= 344) & (df[\"number\"] <= 3788)]\n",
    "filtered_issues = filtered.to_dict(\"records\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# After a round of harvesting, check to see what's missing\n",
    "missing = []\n",
    "multiple = []\n",
    "for issue in tqdm(filtered_issues):\n",
    "    pages = glob.glob(\n",
    "        \"/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images/*{}-*.jpg\".format(\n",
    "            issue[\"issue_id\"]\n",
    "        )\n",
    "    )\n",
    "    if len(pages) == 0:\n",
    "        missing.append(issue)\n",
    "    elif len(pages) > 1:\n",
    "        multiple.append(issue)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Adjust the matching settings and run again with just the missing issues\n",
    "# Rinse and repeat\n",
    "download_all_title_pages(\n",
    "    missing, \"/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/missing\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# How many are mssing now?\n",
    "len(missing)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# How many images have been harvested\n",
    "dl = get_downloaded_issues(\"/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "len(dl)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "len(multiple)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Create links for missing issues so I can look at them on Trove\n",
    "for m in missing:\n",
    "    print(\"https://nla.gov.au/{}\".format(m[\"issue_id\"]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Download odd pages (they start at zero, so 0=1 etc)\n",
    "for m in missing:\n",
    "    download_page(\n",
    "        m[\"issue_id\"], 14, \"/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/missing\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## Compile metadata\n",
    "\n",
    "Once all the images have been saved, we can compile information about them and create a database for researchers to explore."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 600,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "8b1a3da45c0b41d69ec1e29d53b9e441",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "0it [00:00, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "def get_metadata(id):\n",
    "    \"\"\"\n",
    "    Extract work data in a JSON string from the work's HTML page.\n",
    "    \"\"\"\n",
    "    if not id.startswith(\"http\"):\n",
    "        id = \"https://nla.gov.au/\" + id\n",
    "    response = s.get(id)\n",
    "    try:\n",
    "        work_data = re.search(\n",
    "            r\"var work = JSON\\.parse\\(JSON\\.stringify\\((\\{.*\\})\", response.text\n",
    "        ).group(1)\n",
    "    except AttributeError:\n",
    "        work_data = \"{}\"\n",
    "    if not response.from_cache:\n",
    "        time.sleep(0.2)\n",
    "    return json.loads(work_data)\n",
    "\n",
    "\n",
    "def download_text(issue_id, page):\n",
    "    \"\"\"\n",
    "    Download the OCRd text from a page.\n",
    "    \"\"\"\n",
    "    page_index = int(page) - 1\n",
    "    url = f\"https://trove.nla.gov.au/{issue_id}/download?downloadOption=ocr&firstPage={page_index}&lastPage={page_index}\"\n",
    "    response = s.get(url)\n",
    "    response.encoding = \"utf-8\"\n",
    "    if not response.from_cache:\n",
    "        time.sleep(0.2)\n",
    "    return \" \".join(response.text.splitlines())\n",
    "\n",
    "\n",
    "# Loop through harvested images, extracting metadata and compiling dataset\n",
    "images = []\n",
    "for image in tqdm(\n",
    "    Path(\"/home/tim/mydata/Trove-text/Bulletin/covers/bulletin_cartoons\").glob(\"*.jpg\")\n",
    "):\n",
    "    parts = image.name[0:-4].split(\"-\")\n",
    "    date = f\"{parts[0][:4]}-{parts[0][4:6]}-{parts[0][6:8]}\"\n",
    "    issue = parts[1]\n",
    "    issue_id = f\"{parts[2]}-{parts[3]}\"\n",
    "    page = parts[4]\n",
    "    metadata = get_metadata(issue_id)\n",
    "    pages = metadata[\"children\"][\"page\"]\n",
    "    page_id = pages[int(page) - 1][\"pid\"]\n",
    "    record = {\n",
    "        \"id\": page_id,\n",
    "        \"date\": date,\n",
    "        \"issue_number\": issue,\n",
    "        \"issue_id\": issue_id,\n",
    "        \"page\": page,\n",
    "        \"url\": f\"https://nla.gov.au/{page_id}\",\n",
    "        \"download_image\": f\"https://nla.gov.au/{page_id}/image\",\n",
    "        \"text\": download_text(issue_id, page),\n",
    "    }\n",
    "    images.append(record)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Save the metadata as a CSV file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 601,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "df = pd.DataFrame(images)\n",
    "df.to_csv(\"bulletin-editorial-cartoons.csv\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Add the metadata to a SQLite db for use with Datasette-Lite."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 602,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Table cartoons (thumbnail, id, date, issue_number, issue_id, page, url, download_image, text)>"
      ]
     },
     "execution_count": 602,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.insert(\n",
    "    0,\n",
    "    \"thumbnail\",\n",
    "    df[\"url\"].apply(lambda x: f'{{\"img_src\": \"{x}-t\"}}' if not pd.isnull(x) else \"\"),\n",
    ")\n",
    "\n",
    "db = Database(\"bulletin-editorial-cartoons.db\", recreate=True)\n",
    "db[\"cartoons\"].insert_all(df.to_dict(orient=\"records\"), pk=\"id\")\n",
    "db[\"cartoons\"].enable_fts([\"text\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "rocrate": {
   "action": [
    {
     "description": "This dataset includes a collection of 3,471 full-page editorial cartoons downloaded from issues of *The Bulletin* published between 1886 and 1952. In most cases there is one cartoon per issue. Metadata describing each image is available in a CSV-fromatted file, and in an SQLite database that can be explored using Datasette-Lite. The full collection of high-resolution images can be downloaded as a single 62gb zip file.",
     "isPartOf": "https://github.com/GLAM-Workbench/bulletin-editorial-cartoons",
     "local_path": "/home/tim/mydata/Trove-text/Bulletin/covers",
     "mainEntityOfPage": "https://glam-workbench.net/trove-journals/bulletin-cartoons-collection/",
     "name": "Editorial cartoons from The Bulletin, 1886 to 1952",
     "result": [
      {
       "url": "https://github.com/GLAM-Workbench/bulletin-editorial-cartoons/raw/main/bulletin-editorial-cartoons.csv"
      },
      {
       "description": "This is an SQLite database formatted for use with Datasette-Lite. As well as metadata from the `bulletin-editorial-cartoons.csv` file it includes a thumbnail column to display the cartoon.",
       "url": "https://github.com/GLAM-Workbench/bulletin-editorial-cartoons/raw/main/bulletin-editorial-cartoons.db"
      },
      {
       "description": "This dataset contains high-resolution images of 3,471 full-page editorial cartoons harvested from *The Bulletin*, 1886 to 1952.\n\nThe names of each image file provide useful contextual metadata. For example, the file name `19330412-2774-nla.obj-606969767-7.jpg` tells you:\n\n* `19330412` – the cartoon was published on 12 April 1933\n* `2774` – it was published in issue number 2774\n* `nla.obj-606969767` – the Trove identifier for the issue, can be used to make a url eg `https://nla.gov.au/nla.obj-606969767`\n* `7` – on page 7",
       "url": "https://trove-journals.s3.ap-southeast-2.amazonaws.com/bulletin_cartoons.zip"
      }
     ],
     "workExample": [
      {
       "name": "Explore using Datasette",
       "url": "https://glam-workbench.net/datasette-lite/?url=https://github.com/GLAM-Workbench/bulletin-editorial-cartoons/blob/main/bulletin-editorial-cartoons.db&install=datasette-json-html&metadata=https://raw.githubusercontent.com/GLAM-Workbench/bulletin-editorial-cartoons/main/metadata.json#/bulletin-editorial-cartoons/cartoons"
      },
      {
       "name": "Download from Dropbox: 1886 to 1889 (45mb PDF)",
       "url": "https://www.dropbox.com/s/altjl6jixwv5pt0/bulletin-1886-1889.pdf?dl=0"
      },
      {
       "name": "Download from Dropbox: 1890 to 1899 (139mb PDF)",
       "url": "https://www.dropbox.com/s/p15swmact2c9euf/bulletin-1890-1899.pdf?dl=0"
      },
      {
       "name": "Download from Dropbox: 1900 to 1909 (147mb PDF)",
       "url": "https://www.dropbox.com/s/0rivg50s8qam2et/bulletin-1900-1909.pdf?dl=0"
      },
      {
       "name": "Download from Dropbox: 1910 to 1919 (153mb PDF)",
       "url": "https://www.dropbox.com/s/pdsj6xjot0l928w/bulletin-1910-1919.pdf?dl=0"
      },
      {
       "name": "Download from Dropbox: 1920 to 1929 (159mb PDF)",
       "url": "https://www.dropbox.com/s/64x9y5nvgez1q3o/bulletin-1920-1929.pdf?dl=0"
      },
      {
       "name": "Download from Dropbox: 1930 to 1939 (151mb PDF)",
       "url": "https://www.dropbox.com/s/8mytp5qhqcrctt3/bulletin-1930-1939.pdf?dl=0"
      },
      {
       "name": "Download from Dropbox: 1940 to 1949 (146mb PDF)",
       "url": "https://www.dropbox.com/s/go3vyuqq0td6oqd/bulletin-1940-1949.pdf?dl=0"
      },
      {
       "name": "Download from Dropbox: 1950 to 1952 (42mb PDF)",
       "url": "https://www.dropbox.com/s/klcv0gyjs81c0pm/bulletin-1950-1952.pdf?dl=0"
      }
     ]
    }
   ],
   "author": [
    {
     "mainEntityOfPage": "https://timsherratt.au",
     "name": "Sherratt, Tim",
     "orcid": "https://orcid.org/0000-0001-7956-4498"
    }
   ],
   "category": "Harvesting images",
   "description": "This notebook describes a method for finding and saving full-page editorial cartoons from *The Bulletin*.",
   "mainEntityOfPage": "https://glam-workbench.net/trove-journals/finding-editorial-cartoons-in-bulletin/",
   "name": "Finding editorial cartoons in the Bulletin",
   "position": 8
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}