{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Harvest data from Papers Past\n",
    "\n",
    "This notebooks lets you harvest large amounts of data for Papers Past (via DigitalNZ) for further analysis. It saves the results as a CSV file that you can open in any spreadsheet program. It currently includes the OCRd text of all the newspaper articles, but I might make this optional in the future — thoughts?\n",
    "\n",
    "You can edit this notebook to harvest other collections in DigitalNZ — see the notes below for pointers. However, this is currently only saving a small subset of the available metadata, so you'd probably want to adjust the fields as well. Add an [issue on GitHub](https://github.com/GLAM-Workbench/digitalnz/issues) if you need help creating a custom harvester.\n",
    "\n",
    "There's only two things you **have** to change — you need to enter your API key, and supply a search term. There are additional options for limiting your search results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>\n",
    "\n",
    "<p>\n",
    "    Some tips:\n",
    "    <ul>\n",
    "        <li>Code cells have boxes around them.</li>\n",
    "        <li>To run a code cell click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>\n",
    "        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>\n",
    "        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>\n",
    "        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>\n",
    "    </ul>\n",
    "</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add your API key\n",
    "\n",
    "Go get yourself a [DigitalNZ API key](https://digitalnz.org/developers/getting-started), then paste it between the quotes below. You need a key to make API requests, but they're free and quick to obtain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Past your API key between the quotes\n",
    "# You might need to trim off any spaces at the beginning and end\n",
    "API_KEY = '[YOUR API KEY]'\n",
    "print('Your API key is: {}'.format(API_KEY))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting things up\n",
    "\n",
    "Just run these cells to set up some things that we'll need later on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This cell just sets up some stuff that we'll need later\n",
    "\n",
    "import requests\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "import pandas as pd\n",
    "from tqdm.auto import tqdm\n",
    "import time\n",
    "import re\n",
    "from slugify import slugify\n",
    "from time import strftime\n",
    "from IPython.display import display, HTML\n",
    "from pathlib import Path\n",
    "\n",
    "s = requests.Session()\n",
    "retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])\n",
    "s.mount('https://', HTTPAdapter(max_retries=retries))\n",
    "\n",
    "API_URL = 'http://api.digitalnz.org/v3/records.json'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up some code\n",
    "\n",
    "This is where all the serious harvesting work gets done. You shouldn't need to change anything unless you want to harvest something other than Papers Past. Just run the cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def process_articles(results):\n",
    "    articles = []\n",
    "    for result in results:\n",
    "        # If you're harvesting something other than Papers Past, you'd probably \n",
    "        # want to change the way results are processed.\n",
    "        title = re.sub(r'(\\([^)]*\\))[^(]*$', '', result['title']).strip()\n",
    "        articles.append({\n",
    "            'id': result['id'],\n",
    "            'title': title,\n",
    "            'newspaper': result['publisher'][0],\n",
    "            'date': result['date'][0][:10],\n",
    "            'text': result['fulltext'],\n",
    "            'paperspast_url': result['landing_url'],\n",
    "            'source_url': result['source_url']\n",
    "        })\n",
    "    return articles\n",
    "\n",
    "def get_total(params):\n",
    "    np = params.copy()\n",
    "    np['per_page'] = 0\n",
    "    data = get_records(np)\n",
    "    return data['search']['result_count']\n",
    "    \n",
    "def get_records(params):\n",
    "    response = requests.get(API_URL, params=params)\n",
    "    return response.json()\n",
    "\n",
    "def harvest(params):\n",
    "    '''\n",
    "    Do the harvesting!\n",
    "    '''\n",
    "    more = True\n",
    "    articles = []\n",
    "    params['page'] = 1\n",
    "    total = get_total(params)\n",
    "    with tqdm(total=total) as pbar:\n",
    "        while more:\n",
    "            data = get_records(params)\n",
    "            results = data['search']['results']\n",
    "            if results:\n",
    "                articles += process_articles(data['search']['results'])\n",
    "                pbar.update(len(results))\n",
    "                params['page'] += 1\n",
    "                time.sleep(0.2)\n",
    "            else:\n",
    "                more = False \n",
    "    return articles\n",
    "\n",
    "def start_harvest(query, start_year=None, end_year=None, **kwargs):\n",
    "    '''\n",
    "    Initiates a harvest.\n",
    "    If you've specified start and end years it'll loop over them getting results for each.\n",
    "    '''\n",
    "    params = {\n",
    "        'text': query,\n",
    "        'and[primary_collection][]': 'Papers Past',\n",
    "        'per_page': '100',\n",
    "        'api_key': API_KEY\n",
    "    }\n",
    "    for key, value in kwargs.items():\n",
    "        params[f'and[{key}][]'] = value\n",
    "    if start_year and end_year:\n",
    "        articles = []\n",
    "        for year in tqdm(range(start_year, end_year+1), desc='Years'):\n",
    "            current_year = year\n",
    "            params['and[year][]'] = year\n",
    "            articles += harvest(params)\n",
    "    else:\n",
    "        articles = harvest(params)\n",
    "    return articles\n",
    "\n",
    "def save_as_csv(articles, query_name):\n",
    "    '''\n",
    "    Save the results as a CSV file.\n",
    "    Filename is constructed from the the supplied query_name and the current date/time.\n",
    "    Displays a download link when finished.\n",
    "    '''\n",
    "    Path('data').mkdir(exist_ok=True)\n",
    "    filename = f'{slugify(query_name)}-{strftime(\"%Y%m%d%H%M%S\")}.csv'\n",
    "    df = pd.DataFrame(articles)\n",
    "    df.to_csv(Path('data', filename), index=False)\n",
    "    display(HTML(f'<a href=\"{Path(\"data\", filename)}\" download=\"{filename}\">{filename}</a>'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Harvest search results\n",
    "\n",
    "At the very least, the harvesting code is expecting you to supply a search query, like 'possum'. You just feed this query to the `start_harvest()` function to kick things off."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "articles = start_harvest('possum')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can save the results as a CSV file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "save_as_csv(articles, 'possum')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Harvesting articles within a range of years\n",
    "\n",
    "There's no direct way to search for a range of years, but we can get around this by issuing a request for each year separately and then combining the results. Just give `start_harvest()` a `start_year` and `end_year` value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "articles = start_harvest('possum', start_year=1905, end_year=1910)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "save_as_csv(articles, 'possum-1905-1910')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Harvesting articles from a particular newspaper\n",
    "\n",
    "You can add additional facets to the `start_harvest()` function to limit the results further. For example, you can use the `collection` facet to specify a particular newspaper:\n",
    "\n",
    "* `collection='Evening Post'`\n",
    "\n",
    "Other facets you could use include:\n",
    "* `century='1800'`\n",
    "* `decade='1850'`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "articles = start_harvest('possum', collection='Evening Post')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "save_as_csv(articles, 'possum-evening-post')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/). Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}