{ "cells": [ { "cell_type": "markdown", "id": "5790aab1", "metadata": {}, "source": [ "# Exploring Notes & Queries\n", "\n", "This chapter, and several following ones, will describe how to create various search context for 19th century issues of *Notes & Queries*. These include:\n", "\n", "- a monolithic PDF of index issues up to 1900;\n", "- a searchable database of index issues up to 1900;\n", "- a full text searchable database of non-index issues up to 1900.\n", "\n", "Original scans of the original publication, as well as automatically extracted search text, are available, for free, from the Internet Archive. " ] }, { "cell_type": "markdown", "id": "e7ca9ffc", "metadata": {}, "source": [ "## Working With Documents From the Internet Archive\n", "\n", "The Internet Archive – [`archive.org`](https://archive.org/) – is an incredible resource. Amongst other things, it is home to a large number of out-of-copyright digitised books scanned by the Google Book project as well as other book scanning initiatives.\n", "\n", "In this unbook, I will explore various ways in which can build tools around the Internet Archive and documents retrieved from it." ] }, { "cell_type": "markdown", "id": "33c89c48", "metadata": {}, "source": [ "## Searching the Internet Archive\n", "\n", "Many people will be familiar with the web interface to the [Internet Archive](https://archive.org) (and I suspect many more are not aware of the existence of the Internet Archive at all). This provides tools for discovering documents available in the archive, previewing the scanned versions of them, and even searching within them.\n", "\n", "At times, the search inside a book can be a bit hit and miss, in part depending on the quality of the scanned images and the ability of the OCR tools - where \"OCR\" stands for \"optical character recognition\" - to convert the pictures of text into actual text. Which is to say, *searchable* text.\n", "\n", "One of the advantages of creating our own database is that as well as having the corpus available locally, we can use various fuzzy search tools to find partial matches to text to supplement our full text search activities.\n", "\n", "To work with the archive, we'll use the Python programming language. This lets us write instructions for our machine helpers to follow. One of the machine helpers comes in the form of the [`internetarchive` Python package](https://archive.org/services/docs/api/internetarchive/index.html), a collection of routines that can access the Internet Archive at the programming, rather than human user interface, level.\n", "\n", "*The human level interface simply provides graphical tools that we can understand, such as menu items and toolbar buttons. Selecting or clicking these simply invokes machine level commands in a useable-for-us way. Writing program code lets us call those commands directly, in a textual way, rather than visually, by clicking menu items and buttons. Copying and pasting simple text instructions that can be used to perform a particular function is often quite straightforward. Modifying such commands may also be relatively straightforward. (For example, given a block of code that downloads a file from a web location using code of the form `download_file(\"https://example.com/this_file.pdf\")`, you could probably work out how to download a file from `http://another.example.com/myfile.pdf`.) Creating graphical user interfaces is hard. Graphical user interfaces also constrains users to using just the functions and features that the designers and developers of the user interface chose to support in the user interface, in just the way that the user interface allows. Being able to instruct a machine using code, even copied and pasted code, gives the end-use far more power over the machine.*\n", "\n", "Within any particular programming language, *packages* are often used to bundle together various tools and functions that can be used to support particular activities or tasks, or work with particular resources or resource types.\n", "\n", "One of the most useful tools within the Internet Archive package is the `search_items()` function, which lets us search the Internet Archive." ] }, { "cell_type": "code", "execution_count": 1, "id": "12942550", "metadata": {}, "outputs": [], "source": [ "# If we haven't already installed the package into our computing environment,\n", "# we need to download it and install it.\n", "#%pip install internetarchive\n", "\n", "# Load in a function to search the archive\n", "from internetarchive import search_items\n", "\n", "# We are going to build up a list of search results\n", "items = []" ] }, { "cell_type": "markdown", "id": "afa3c6de", "metadata": {}, "source": [ "### Item Metadata\n", "\n", "At the data level, the Internet Archive has *metadata*, or \"data about data\" that provides key information or summary information about each data record. For example, works can be organised as part of different collections via `collection` elements such as `collection:\"pub_notes-and-queries\"`.\n", "\n", "For periodicals, there may also be a publication identifier associated with the periodical (for example, `sim_pubid:1250`) or metadata identifying which *volume* or *issue* a particular edition of a periodical may be.\n", "\n", "In the following bit of code, we search over the *Notes & Queries* collection, retrieving data about each item in the collection.\n", "\n", "This is quite a large collection, so to run a query that retrieves all the items in it may take a considerable amount of time. Instead, we can limit the search to issues published in a particular year, and further limit the query to only retrieve a certain number of records." ] }, { "cell_type": "code", "execution_count": 2, "id": "8d4efbc6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 sim_notes-and-queries_1867-01-05_11_262 Notes and Queries 1867-01-05: Vol 11 Iss 262\n", "1 sim_notes-and-queries_1867-01-12_11_263 Notes and Queries 1867-01-12: Vol 11 Iss 263\n", "2 sim_notes-and-queries_1867-01-19_11_264 Notes and Queries 1867-01-19: Vol 11 Iss 264\n", "3 sim_notes-and-queries_1867-01-26_11_265 Notes and Queries 1867-01-26: Vol 11 Iss 265\n" ] } ], "source": [ "# We can use a programming loop to search for items, iterate through the items\n", "# and retrieve a record for each one\n", "# The enumerate() command will loop trhough all the items, returnin a running count of items\n", "# returned, as well as each separate item\n", "# The count starts at 0...\n", "for count, item in enumerate(search_items('collection:\"pub_notes-and-queries\" AND year:1867').iter_as_items()):\n", " # Display the count, the item identifier and title\n", " print(count, item.identifier, item.metadata['title'])\n", "\n", " # If we see item with count value of at least 3, which is to say, the fourth item,\n", " # (we start counting at zero, remember...)\n", " if count >= 3:\n", " # Then break out of this loop\n", " break" ] }, { "cell_type": "markdown", "id": "8663e4df", "metadata": {}, "source": [ "As well as the \"offical\" collection, some copies of *Notes and Queries* from other providers are also available in the Internet Archive. For example, there are some submissions from *Project Gutenberg*.\n", "\n", "The following retrieves an item obtained from the `gutenberg` collection, which is to say, *Project Gutenberg*, and previews its metadata:" ] }, { "cell_type": "code", "execution_count": 3, "id": "f6c113e4", "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "{'identifier': 'notesandqueriesi13536gut',\n", " 'title': 'Notes and Queries, Index of Volume 1, November, 1849-May, 1850: A Medium of Inter-Communication for Literary Men, Artists, Antiquaries, Genealogists, Etc.',\n", " 'possible-copyright-status': 'NOT_IN_COPYRIGHT',\n", " 'copyright-region': 'US',\n", " 'mediatype': 'texts',\n", " 'collection': 'gutenberg',\n", " 'creator': 'Various',\n", " 'contributor': 'Project Gutenberg',\n", " 'description': 'Book from Project Gutenberg: Notes and Queries, Index of Volume 1, November, 1849-May, 1850: A Medium of Inter-Communication for Literary Men, Artists, Antiquaries, Genealogists, Etc.',\n", " 'language': 'eng',\n", " 'call_number': 'gutenberg etext# 13536',\n", " 'addeddate': '2006-12-07',\n", " 'publicdate': '2006-12-07',\n", " 'backup_location': 'ia903600_27'}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from internetarchive import get_item\n", "\n", "# Retrieve an item from its unique identifier\n", "item = get_item('notesandqueriesi13536gut')\n", "\n", "# And display its metadata\n", "item.metadata" ] }, { "cell_type": "markdown", "id": "eafeee33", "metadata": {}, "source": [ "The items in the `pub_notes-and-queries` collection have much more metadata available, including `volume` and `issue` data, and the identifiers for the `previous` and `next` issue." ] }, { "cell_type": "markdown", "id": "acb8cb96", "metadata": {}, "source": [ "In some cases, the identifier values may be human readable, if you look closely enough. For example, *Notes and Queries* was published weekly, typically with two volumes per year, and an index for each. In the `pub_notes-and-queries` collections, the identifier for Volume 11, issue 262, published on January 5th, 1867, is `sim_notes-and-queries_1867-01-05_11_262`; and the identifier for the index of volume 12, published in throughout the second half of 1867, is `sim_notes-and-queries_1867_12_index`." ] }, { "cell_type": "markdown", "id": "c57e6bab", "metadata": {}, "source": [ "### Available Files" ] }, { "cell_type": "markdown", "id": "31416240", "metadata": {}, "source": [ "As well as the data record, certain other files may be associated with that item such as PDF scans, or files containing the raw scanned text of the document.\n", "\n", "We have already seen how we can retrieve an item given it's identifier, but let's see it in action again:" ] }, { "cell_type": "code", "execution_count": 4, "id": "dabbd1cb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('Notes and Queries 1867: Vol 12 Index',\n", " 'sim_notes-and-queries_1867_12_index')" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "item = get_item(\"sim_notes-and-queries_1867_12_index\")\n", "\n", "item.metadata['title'], item.identifier" ] }, { "cell_type": "markdown", "id": "d6faa9be", "metadata": {}, "source": [ "We can make a call from this data item to return a list of the files associated with that item, and display their file formats:" ] }, { "cell_type": "code", "execution_count": 5, "id": "69797f12", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Item Tile\n", "JPEG 2000\n", "JPEG 2000\n", "Text PDF\n", "Archive BitTorrent\n", "chOCR\n", "DjVuTXT\n", "Djvu XML\n", "Metadata\n", "JSON\n", "hOCR\n", "OCR Page Index\n", "OCR Search Text\n", "Item Image\n", "Single Page Processed JP2 ZIP\n", "Metadata\n", "Metadata\n", "Page Numbers JSON\n", "JSON\n", "Scandata\n" ] } ], "source": [ "for file_item in item.get_files():\n", " print(file_item.format)" ] }, { "cell_type": "markdown", "id": "1e65f799", "metadata": {}, "source": [ "For this item, then, we can get a PDF document, a file containing the search text, a record with information about page numbers, an XML version of the original scanned version, some image scans, and various other things containing who knows what!" ] }, { "cell_type": "markdown", "id": "cc577a43", "metadata": {}, "source": [ "### A Complete List of *Notes & Queries* Issues\n", "\n", "To help us work with the *pub_notes-and-queries* collection, let's construct a local copy of the most important metadata associated with each item in the collection, specifically the item identifier, date and title, as well as the volume and issue. (*Notes and Queries* also has a higher level of organisation, a *Series*, which means that volume and issue numbers can actually recycle, so by itself, a particular `(volume, issue)` pair does not identify a unique item, but a `(series, volume, issue)` or `(year, volume, issue)` triple does.)\n", "\n", "For convenience, we might also collect the *previous* and *next* item identifiers, as well as a flag that tells us whether access is restricted or not. (For 19th century editions, there are no restrictions; but for more recent 20th century editions, access may be limited to library shelf access).\n", "\n", "As we construct various tools for working with the Internet Archive and various files downloaded from it, it will be useful to also save those tools in a way that we can can make use of them.\n", "\n", "The Python programming language supports a simple mechanism for bundling files into \"packages\" simply by including files in directory that is marked as a package directory. The simplest way to mark a directory as a Python package is simple to create an empty file called `__init__.py` inside it.\n", "\n", "So let's create a package called `ia_utils` by creating a directory of that name containing an empty `__init__.py` file:" ] }, { "cell_type": "code", "execution_count": 40, "id": "0591b384", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "# Create the directory if it doesn't already exist\n", "ia_utils = Path(\"ia_utils\")\n", "ia_utils.mkdir(exist_ok=True)\n", "\n", "# Create the blank file\n", "Path( ia_utils / \"__init__.py\" ).touch()" ] }, { "cell_type": "markdown", "id": "0a8ae4f2", "metadata": {}, "source": [ "```{note}\n", "The `pathlib` package contains powerful tools for working with directories, files, and file paths.\n", "```" ] }, { "cell_type": "markdown", "id": "19d424e0", "metadata": {}, "source": [ "The following cell contains a set of instructions bundled together to define a *function* under a unique function name. Functions provide us with a shorthand way of writing a set of instructions once, then calling on them repeatedly via their function name.\n", "\n", "In particular, the function takes in an item metadata record, tidies it up a little and returns just the fields we are interested in.\n", "\n", "In the following cell, we use some magic to write the contents of the cell to a package file; in the next cell after that, we import the function from the file. This provides us with a convenient way of saving code to a file that we can also reuse elsewhere." ] }, { "cell_type": "code", "execution_count": 1, "id": "317fc7ef", "metadata": { "tags": [ "remove-output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Overwriting ia_utils/out_ia_metadata.py\n" ] } ], "source": [ "%%writefile ia_utils/out_ia_metadata.py\n", "import csv\n", "\n", "def out_ia_metadata(item):\n", " \"\"\"Retrieve a subset of item metadata and return it as a list.\"\"\"\n", " # This is a nested function that looks up piece of metadata if it exists\n", " # If it doesn't exist, we set it to ''\n", " def _get(_item, field):\n", " return _item[field] if field in _item else ''\n", "\n", " #item = get_item(i.identifier)\n", " identifier = item.metadata['identifier']\n", " date = _get(item.metadata, 'date')\n", " title = _get(item.metadata, 'title')\n", " volume =_get(item.metadata, 'volume')\n", " issue = _get(item.metadata, 'issue')\n", " prev_ = _get(item.metadata, 'previous_item')\n", " next_ = _get(item.metadata, 'next_item')\n", " restricted = _get(item.metadata,'access-restricted-item')\n", " \n", " return [identifier, date, title, volume, issue, prev_, next_, restricted]" ] }, { "cell_type": "markdown", "id": "f4167503", "metadata": {}, "source": [ "Now we can import the function from the package. And so can other notebooks." ] }, { "cell_type": "code", "execution_count": 8, "id": "08e90fb6", "metadata": {}, "outputs": [], "source": [ "from ia_utils.out_ia_metadata import out_ia_metadata" ] }, { "cell_type": "markdown", "id": "a05aadf3", "metadata": {}, "source": [ "```{admonition} Tracking Updates to the Function\n", ":class: dropdown\n", "\n", "If we update the function and rewrite the file, the `from...import..` line will not normally reload the (updated) function if the function has already been imported.\n", "\n", "There are two ways round this:\n", "\n", "- load the file in and run it, rather than importing the package, using a magic command of the form `%run -i ia_utils/out_ia_metadata.py`\n", "- configure the notebook at the start by running `%load_ext autoreload ; %autoreload 2` (see the [documentation](https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html)).\n", "```" ] }, { "cell_type": "markdown", "id": "87f1a635", "metadata": {}, "source": [ "Here's what the data retrieved from an item record by the `out_ia_metadata` function looks like:" ] }, { "cell_type": "code", "execution_count": 9, "id": "1bef9ff3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['sim_notes-and-queries_1867_12_index',\n", " '1867',\n", " 'Notes and Queries 1867: Vol 12 Index',\n", " '12',\n", " 'Index',\n", " 'sim_notes-and-queries_1867-06-29_11_287',\n", " 'sim_notes-and-queries_1867-07-06_12_288',\n", " '']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get an item record form its identifier\n", "item = get_item(\"sim_notes-and-queries_1867_12_index\")\n", "\n", "# Display the key metadata\n", "out_ia_metadata(item)" ] }, { "cell_type": "markdown", "id": "8ea46b45", "metadata": {}, "source": [ "We can now build up a list of lists containing the key metadata for all editions of *Notes of Queries* in the `pub_notes-and-queries` collection.\n", "\n", "Our recipe will proceed in the following three steps:\n", "\n", "- search for all the items in the collection;\n", "- build up a list of records where item contains the key metadata, extracted from the full record using the `out_ia_metadata()` function;\n", "- open a file (*nandq_internet_archive.txt*), give it a column header line, and write the key metadata records to it, one record per line.\n", "\n", "The file will be written in \"CSV\" format (comma separarated variable), a simple text format for describing tabular data. CSV files can be read by spreadsheet applications, as well as other tools, and use comma separators to identify \"columns\" of information in each row." ] }, { "cell_type": "code", "execution_count": 10, "id": "05163e97", "metadata": {}, "outputs": [], "source": [ "# The name of the file we'll write our csv data to\n", "csv_fn = \"nandq_internet_archive.txt\"" ] }, { "cell_type": "markdown", "id": "a48337d1", "metadata": {}, "source": [ "The file takes quite a long time to assemble (we need to download several thousand metadata records), so we only want to do it once.\n", "\n", "So let's check to see if the file exists (if it does, we won't try to recreate it:" ] }, { "cell_type": "code", "execution_count": 11, "id": "c81ac8d9", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "csv_file_exists = Path(csv_fn).is_file()" ] }, { "cell_type": "markdown", "id": "c50857f5", "metadata": {}, "source": [ "Conveniently, identifiers for all the issues of *Notes and Queries* held by the Internet Archive can be retrieved via the `pub_notes-and-queries` collection.\n", "\n", "The return object is an iterator with individual results that take the form `{'identifier': 'sim_notes-and-queries_1849-11-03_1_1'}` and from which we can obtain unique identifiers:" ] }, { "cell_type": "code", "execution_count": 12, "id": "12b6b2c9", "metadata": {}, "outputs": [], "source": [ "# Find records for all items in the collection\n", "items = search_items('collection:\"pub_notes-and-queries\"')" ] }, { "cell_type": "markdown", "id": "9ccddb5d", "metadata": {}, "source": [ "The following incantation constructs one list from the members of another. In particular, we iterate through each item in the `pub_notes-and-queries`, extract the identifier, retrieve the corresponding metadata record (`get_item()`), create our own corresponding metadata record (`out_ia_metadata()`) and add it to a new list.\n", "\n", "In all, there are several thousand records to download, and each takes a noticeable time, so rather than just sitting watching a progress bar for an hour, go and grab a meal rather than a coffee..." ] }, { "cell_type": "code", "execution_count": 13, "id": "584544af", "metadata": {}, "outputs": [], "source": [ "# The tqdm package provides a convenient progress bar\n", "# for tracking progress through looped actions\n", "from tqdm.notebook import tqdm\n", "\n", "# If a local file containing the data doesn't already exist,\n", "# then grab the data...\n", "if not csv_file_exists:\n", " # Our list of custom metadata records\n", " csv_items = []\n", "\n", " for i in tqdm(items):\n", " id_val = i[\"identifier\"]\n", " metadata_record = get_item(id_val)\n", " custom_metadata_record = out_ia_metadata( metadata_record )\n", " csv_items.append( custom_metadata_record )\n", " \n", "# We should perhaps incrementally write the CSV file as we go along\n", "# or incrementally save the data to a simple local database\n", "# If something goes wrong during the downloads, then at least\n", "# we won;t have lost everything.." ] }, { "cell_type": "markdown", "id": "08a001d3", "metadata": {}, "source": [ "We can now open the CSV file and write the data to it:" ] }, { "cell_type": "code", "execution_count": 14, "id": "2a61134a", "metadata": {}, "outputs": [], "source": [ "# If a local file containing the data doesn't already exist,\n", "# then grab the data...\n", "if not csv_file_exists:\n", " with open(csv_fn, 'w') as outfile:\n", " print(f\"Writing data to file {csv_fn}\")\n", "\n", " # Create a \"CSV writer\" object that can write to the file \n", " csv_write = csv.writer(outfile)\n", " # Write a header row at the top of the file\n", " csv_write.writerow(['id','date','title','vol','iss','prev_id', 'next_id','restricted'])\n", " # Then write out list of essential metadata items out, one record per row\n", " csv_write.writerows(csv_items)\n", "\n", " # Update the file exists flag\n", " csv_file_exists = Path(csv_fn).is_file()" ] }, { "cell_type": "markdown", "id": "d353c990", "metadata": {}, "source": [ "We can use a simple Linux command line tool (`head`) to show the top five lines of the file:" ] }, { "cell_type": "code", "execution_count": 15, "id": "761eb6a8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id,date,title,vol,iss,prev_id,next_id,restricted\r", "\r\n", "sim_notes-and-queries_1849-11-03_1_1,1849-11-03,Notes and Queries 1849-11-03: Vol 1 Iss 1,1,1,sim_notes-and-queries_1849-1850_1_index,sim_notes-and-queries_1849-11-10_1_2,\r", "\r\n", "sim_notes-and-queries_1849-11-10_1_2,1849-11-10,Notes and Queries 1849-11-10: Vol 1 Iss 2,1,2,sim_notes-and-queries_1849-11-03_1_1,sim_notes-and-queries_1849-11-17_1_3,\r", "\r\n", "sim_notes-and-queries_1849-11-17_1_3,1849-11-17,Notes and Queries 1849-11-17: Vol 1 Iss 3,1,3,sim_notes-and-queries_1849-11-10_1_2,sim_notes-and-queries_1849-11-24_1_4,\r", "\r\n", "sim_notes-and-queries_1849-11-24_1_4,1849-11-24,Notes and Queries 1849-11-24: Vol 1 Iss 4,1,4,sim_notes-and-queries_1849-11-17_1_3,sim_notes-and-queries_1849-12-01_1_5,\r", "\r\n" ] } ], "source": [ "!head -n 5 nandq_internet_archive.txt" ] }, { "cell_type": "markdown", "id": "bbd717db", "metadata": {}, "source": [ "So, with some idea of what's available to us, data wise, and file wise, what can we start to do with it?" ] }, { "cell_type": "markdown", "id": "2b854f3b", "metadata": {}, "source": [ "## Generating a Monolithic PDF Index for *Notes & Queries* Up To 1900" ] }, { "cell_type": "markdown", "id": "42afa8e2", "metadata": {}, "source": [ "If we want to search for items in *Notes and Queries* \"manually\", one of the most effective ways is to look up items in the volume indexes. With two volumes a year, this means checking almost 100 separate documents if we want to look up 19th century references. (That's not quite true: from the 1890s, indexes were produced that started to to aggregate indices over several years.) \n", "\n", "So how might we go about producing a single index PDF for 19th c. editions of *Notes & Queries*? As a conjoined set of original index PDFs, this wouldn't provide us with unified index terms - a search on an index item would return separate entries for each volume index in which the term appeared – but it would mean we only needed to search one PDF document.\n", "\n", "We'll use the Python `csv` package to simplify saving and load the data:" ] }, { "cell_type": "code", "execution_count": 16, "id": "e4e03047", "metadata": {}, "outputs": [], "source": [ "import csv" ] }, { "cell_type": "markdown", "id": "52b63a56", "metadata": {}, "source": [ "To begin with, we can load in our list of *Notes and Queries* record data downloaded from the Internet Archive." ] }, { "cell_type": "code", "execution_count": 17, "id": "3d32706c", "metadata": { "tags": [ "remove-output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Overwriting ia_utils/open_metadata_records.py\n" ] } ], "source": [ "%%writefile ia_utils/open_metadata_records.py\n", "import csv\n", "\n", "# Specify the file name we want to read data in from\n", "def open_metadata_records(fn='nandq_internet_archive.txt'):\n", " \"\"\"Open and read metadata records file.\"\"\"\n", "\n", " with open(fn, 'r') as f:\n", " # We are going to load the data into a data structure known as a dictionary, or dict\n", " # Each item in the dictionary contains several elements as `key:value` pairs\n", " # The key matches the column name in the CSV data file,\n", " # along with the corresponding value in a given item row\n", "\n", " # Read the data in\n", " csv_data = csv.DictReader(f)\n", "\n", " # And convert it to a list of data records\n", " data_records = list(csv_data)\n", " \n", " return data_records" ] }, { "cell_type": "code", "execution_count": 18, "id": "4aeda483", "metadata": {}, "outputs": [], "source": [ "# Import that function from the package we just wrote it to\n", "from ia_utils.open_metadata_records import open_metadata_records" ] }, { "cell_type": "markdown", "id": "a7439843", "metadata": {}, "source": [ "Let's grab the metadata records from our saved file:" ] }, { "cell_type": "code", "execution_count": 19, "id": "d01f5aeb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'id': 'sim_notes-and-queries_1849-11-03_1_1',\n", " 'date': '1849-11-03',\n", " 'title': 'Notes and Queries 1849-11-03: Vol 1 Iss 1',\n", " 'vol': '1',\n", " 'iss': '1',\n", " 'prev_id': 'sim_notes-and-queries_1849-1850_1_index',\n", " 'next_id': 'sim_notes-and-queries_1849-11-10_1_2',\n", " 'restricted': ''}" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_records = open_metadata_records()\n", "\n", "# Preview the first record (index count starts at 0)\n", "# The object returned is a dictionary / dict\n", "data_records[0]" ] }, { "cell_type": "markdown", "id": "a0b99671", "metadata": {}, "source": [ "## Populating a Database With Record Metadata\n", "\n", "Let's start by creating a table in the database that can store our metadata data records, as loaded in from the data file." ] }, { "cell_type": "code", "execution_count": 20, "id": "34555faa", "metadata": {}, "outputs": [], "source": [ "from sqlite_utils import Database\n", "\n", "db_name = \"nq_demo.db\"\n", "\n", "# While developing the script, recreate database each time...\n", "db = Database(db_name, recreate=True)" ] }, { "cell_type": "code", "execution_count": 21, "id": "41c1064b", "metadata": { "tags": [ "remove-output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Overwriting ia_utils/create_db_table_metadata.py\n" ] } ], "source": [ "%%writefile ia_utils/create_db_table_metadata.py\n", "import datetime\n", "\n", "def create_db_table_metadata(db, drop=True):\n", " # If we want to remove the table completely, we can drop it\n", " if drop:\n", " db[\"metadata\"].drop(ignore=True)\n", " db[\"metadata\"].create({\n", " \"id\": str,\n", " \"date\": str,\n", " \"datetime\": datetime.datetime, # Use an actual time representation\n", " \"series\": str,\n", " \"vol\": str,\n", " \"iss\": str,\n", " \"title\": str, \n", " \"next_id\": str, \n", " \"prev_id\": str,\n", " \"is_index\": bool, # Is the record an index record\n", " \"restricted\": str, # should really be boolean\n", " }, pk=(\"id\"))" ] }, { "cell_type": "markdown", "id": "2fefb266", "metadata": {}, "source": [ "Now we can load the function back in from out package and call it:" ] }, { "cell_type": "code", "execution_count": 22, "id": "95c5ddad", "metadata": {}, "outputs": [], "source": [ "from ia_utils.create_db_table_metadata import create_db_table_metadata\n", "\n", "create_db_table_metadata(db)" ] }, { "cell_type": "markdown", "id": "7d7a89be", "metadata": {}, "source": [ "We need to do a little tidying of the records, but then we can add them directly to the database:" ] }, { "cell_type": "code", "execution_count": 23, "id": "c3cc0019", "metadata": { "tags": [ "remove-output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Overwriting ia_utils/add_patched_metadata_records_to_db.py\n" ] } ], "source": [ "%%writefile ia_utils/add_patched_metadata_records_to_db.py\n", "from tqdm.notebook import tqdm\n", "import dateparser\n", "\n", "def add_patched_metadata_records_to_db(db, data_records):\n", " \"\"\"Add metadata records to database.\"\"\"\n", " # Patch records to include a parsed datetime element\n", " for record in tqdm(data_records):\n", " # Parse the raw date into a date object\n", " # Need to handle a YYYY - YYYY exception\n", " # If we detect this form, use the last year for the record\n", " if len(record['date'].split()[0]) > 1:\n", " record['datetime'] = dateparser.parse(record['date'].split()[-1])\n", " else:\n", " record['datetime'] = dateparser.parse(record['date'])\n", "\n", " record['is_index'] = 'index' in record['title'].lower() # We assign the result of a logical test\n", "\n", " # Add records to the database\n", " db[\"metadata\"].insert_all(data_records)" ] }, { "cell_type": "markdown", "id": "c98ba964", "metadata": {}, "source": [ "Let's call that function and add our metadata data records:" ] }, { "cell_type": "code", "execution_count": 24, "id": "34b9747b", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c85d638eef0f4108bac860ee4feaf8d6", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/5695 [00:00<?, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.9/site-packages/dateparser/date_parser.py:35: PytzUsageWarning: The localize method is no longer necessary, as this time zone supports the fold attribute (PEP 495). For more details on migrating to a PEP 495-compliant implementation, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html\n", " date_obj = stz.localize(date_obj)\n" ] } ], "source": [ "from ia_utils.add_patched_metadata_records_to_db import add_patched_metadata_records_to_db\n", "\n", "add_patched_metadata_records_to_db(db, data_records)" ] }, { "cell_type": "markdown", "id": "e569c96c", "metadata": {}, "source": [ "We can then query the data, for example return the first rows:" ] }, { "cell_type": "code", "execution_count": 25, "id": "48548b00", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>date</th>\n", " <th>datetime</th>\n", " <th>series</th>\n", " <th>vol</th>\n", " <th>iss</th>\n", " <th>title</th>\n", " <th>next_id</th>\n", " <th>prev_id</th>\n", " <th>is_index</th>\n", " <th>restricted</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>sim_notes-and-queries_1849-11-03_1_1</td>\n", " <td>1849-11-03</td>\n", " <td>1849-11-03T00:00:00</td>\n", " <td>None</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Notes and Queries 1849-11-03: Vol 1 Iss 1</td>\n", " <td>sim_notes-and-queries_1849-11-10_1_2</td>\n", " <td>sim_notes-and-queries_1849-1850_1_index</td>\n", " <td>0</td>\n", " <td></td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>sim_notes-and-queries_1849-11-10_1_2</td>\n", " <td>1849-11-10</td>\n", " <td>1849-11-10T00:00:00</td>\n", " <td>None</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>Notes and Queries 1849-11-10: Vol 1 Iss 2</td>\n", " <td>sim_notes-and-queries_1849-11-17_1_3</td>\n", " <td>sim_notes-and-queries_1849-11-03_1_1</td>\n", " <td>0</td>\n", " <td></td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>sim_notes-and-queries_1849-11-17_1_3</td>\n", " <td>1849-11-17</td>\n", " <td>1849-11-17T00:00:00</td>\n", " <td>None</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>Notes and Queries 1849-11-17: Vol 1 Iss 3</td>\n", " <td>sim_notes-and-queries_1849-11-24_1_4</td>\n", " <td>sim_notes-and-queries_1849-11-10_1_2</td>\n", " <td>0</td>\n", " <td></td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>sim_notes-and-queries_1849-11-24_1_4</td>\n", " <td>1849-11-24</td>\n", " <td>1849-11-24T00:00:00</td>\n", " <td>None</td>\n", " <td>1</td>\n", " <td>4</td>\n", " <td>Notes and Queries 1849-11-24: Vol 1 Iss 4</td>\n", " <td>sim_notes-and-queries_1849-12-01_1_5</td>\n", " <td>sim_notes-and-queries_1849-11-17_1_3</td>\n", " <td>0</td>\n", " <td></td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>sim_notes-and-queries_1849-12-01_1_5</td>\n", " <td>1849-12-01</td>\n", " <td>1849-12-01T00:00:00</td>\n", " <td>None</td>\n", " <td>1</td>\n", " <td>5</td>\n", " <td>Notes and Queries 1849-12-01: Vol 1 Iss 5</td>\n", " <td>sim_notes-and-queries_1849-12-08_1_6</td>\n", " <td>sim_notes-and-queries_1849-11-24_1_4</td>\n", " <td>0</td>\n", " <td></td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id date datetime \\\n", "0 sim_notes-and-queries_1849-11-03_1_1 1849-11-03 1849-11-03T00:00:00 \n", "1 sim_notes-and-queries_1849-11-10_1_2 1849-11-10 1849-11-10T00:00:00 \n", "2 sim_notes-and-queries_1849-11-17_1_3 1849-11-17 1849-11-17T00:00:00 \n", "3 sim_notes-and-queries_1849-11-24_1_4 1849-11-24 1849-11-24T00:00:00 \n", "4 sim_notes-and-queries_1849-12-01_1_5 1849-12-01 1849-12-01T00:00:00 \n", "\n", " series vol iss title \\\n", "0 None 1 1 Notes and Queries 1849-11-03: Vol 1 Iss 1 \n", "1 None 1 2 Notes and Queries 1849-11-10: Vol 1 Iss 2 \n", "2 None 1 3 Notes and Queries 1849-11-17: Vol 1 Iss 3 \n", "3 None 1 4 Notes and Queries 1849-11-24: Vol 1 Iss 4 \n", "4 None 1 5 Notes and Queries 1849-12-01: Vol 1 Iss 5 \n", "\n", " next_id \\\n", "0 sim_notes-and-queries_1849-11-10_1_2 \n", "1 sim_notes-and-queries_1849-11-17_1_3 \n", "2 sim_notes-and-queries_1849-11-24_1_4 \n", "3 sim_notes-and-queries_1849-12-01_1_5 \n", "4 sim_notes-and-queries_1849-12-08_1_6 \n", "\n", " prev_id is_index restricted \n", "0 sim_notes-and-queries_1849-1850_1_index 0 \n", "1 sim_notes-and-queries_1849-11-03_1_1 0 \n", "2 sim_notes-and-queries_1849-11-10_1_2 0 \n", "3 sim_notes-and-queries_1849-11-17_1_3 0 \n", "4 sim_notes-and-queries_1849-11-24_1_4 0 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pandas import read_sql\n", "\n", "q = \"SELECT * FROM metadata LIMIT 5\"\n", "\n", "read_sql(q, db.conn)" ] }, { "cell_type": "markdown", "id": "826f213f", "metadata": {}, "source": [ "Or we could return the identifiers for index issues between 1875 and 1877:" ] }, { "cell_type": "code", "execution_count": 26, "id": "3ca9a0db", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>title</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>sim_notes-and-queries_1875_3_index</td>\n", " <td>Notes and Queries 1875: Vol 3 Index</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>sim_notes-and-queries_1875_4_index</td>\n", " <td>Notes and Queries 1875: Vol 4 Index</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>sim_notes-and-queries_1876_5_index</td>\n", " <td>Notes and Queries 1876: Vol 5 Index</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>sim_notes-and-queries_1876_6_index</td>\n", " <td>Notes and Queries 1876: Vol 6 Index</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>sim_notes-and-queries_1877_7_index</td>\n", " <td>Notes and Queries 1877: Vol 7 Index</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>sim_notes-and-queries_1877_8_index</td>\n", " <td>Notes and Queries 1877: Vol 8 Index</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id title\n", "0 sim_notes-and-queries_1875_3_index Notes and Queries 1875: Vol 3 Index\n", "1 sim_notes-and-queries_1875_4_index Notes and Queries 1875: Vol 4 Index\n", "2 sim_notes-and-queries_1876_5_index Notes and Queries 1876: Vol 5 Index\n", "3 sim_notes-and-queries_1876_6_index Notes and Queries 1876: Vol 6 Index\n", "4 sim_notes-and-queries_1877_7_index Notes and Queries 1877: Vol 7 Index\n", "5 sim_notes-and-queries_1877_8_index Notes and Queries 1877: Vol 8 Index" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = \"\"\"\n", "SELECT id, title\n", "FROM metadata\n", "WHERE is_index = 1\n", " -- Extract the year\n", " AND strftime('%Y', datetime) >= '1875'\n", " AND strftime('%Y', datetime) <= '1877'\n", "\"\"\"\n", "\n", "read_sql(q, db.conn)" ] }, { "cell_type": "markdown", "id": "16715d99", "metadata": {}, "source": [ "By inspection of the list of index entries, we note that at some point cumulative indexes over a set of years, as well as volume level indexes, were made available. Cumulative indexes include:\n", "\n", "- Notes and Queries 1892 - 1897: Vol 1-12 Index\n", "- Notes and Queries 1898 - 1903: Vol 1-12 Index\n", "- Notes and Queries 1904 - 1909: Vol 1-12 Index\n", "- Notes and Queries 1910 - 1915: Vol 1-12 Index\n", "\n", "In this first pass, we shall just ignore the cumulative indexes.\n", "\n", "At this point, it is not clear where we might reliably obtain the series information from." ] }, { "cell_type": "markdown", "id": "3ba87833", "metadata": {}, "source": [ "To make the data easier to work with, we can parse the date as a date thing (technical term!;-) using tools in the Python `dateparser` package:" ] }, { "cell_type": "code", "execution_count": 27, "id": "d07a59a2", "metadata": {}, "outputs": [], "source": [ "import dateparser" ] }, { "cell_type": "markdown", "id": "08a4db56", "metadata": {}, "source": [ "The parsed data provides ways of comparing dates, extracting month and year, and so on." ] }, { "cell_type": "code", "execution_count": 41, "id": "6c2bdb44", "metadata": { "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "PytzUsageWarning: The localize method is no longer necessary, as this time zone supports the fold attribute (PEP 495). For more details on migrating to a PEP 495-compliant implementation, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html [date_parser.py:35]\n" ] }, { "data": { "text/plain": [ "[{'id': 'sim_notes-and-queries_1850_2_index',\n", " 'date': '1850',\n", " 'title': 'Notes and Queries 1850: Vol 2 Index',\n", " 'vol': '2',\n", " 'iss': 'Index',\n", " 'prev_id': 'sim_notes-and-queries_1850-05-25_1_30',\n", " 'next_id': 'sim_notes-and-queries_1850-06-01_2_31',\n", " 'restricted': '',\n", " 'datetime': datetime.datetime(1850, 3, 20, 0, 0),\n", " 'is_index': True},\n", " {'id': 'sim_notes-and-queries_1851_3_index',\n", " 'date': '1851',\n", " 'title': 'Notes and Queries 1851: Vol 3 Index',\n", " 'vol': '3',\n", " 'iss': 'Index',\n", " 'prev_id': 'sim_notes-and-queries_1850-12-28_2_61',\n", " 'next_id': 'sim_notes-and-queries_1851-01-04_3_62',\n", " 'restricted': '',\n", " 'datetime': datetime.datetime(1851, 3, 20, 0, 0),\n", " 'is_index': True},\n", " {'id': 'sim_notes-and-queries_1851_4_index',\n", " 'date': '1851',\n", " 'title': 'Notes and Queries 1851: Vol 4 Index',\n", " 'vol': '4',\n", " 'iss': 'Index',\n", " 'prev_id': 'sim_notes-and-queries_1851-06-28_3_87',\n", " 'next_id': 'sim_notes-and-queries_1851-07-05_4_88',\n", " 'restricted': '',\n", " 'datetime': datetime.datetime(1851, 3, 20, 0, 0),\n", " 'is_index': True}]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "indexes = []\n", "\n", "# Get index records up to 1900\n", "max_year = 1900\n", "\n", "for record in data_records:\n", " # Only look at index records\n", " # exclude cumulative indexes\n", " if 'index' in record['id'] and \"cumulative\" not in record['id']:\n", " # Need to handle a YYYY - YYYY exception\n", " # If we detect it, ignore it\n", " if len(record['date'].split()) > 1:\n", " continue\n", " \n", " # Parse the year into a date object\n", " # Then filter by year\n", " if dateparser.parse(record['date'].split()[0]).year >= max_year:\n", " break\n", " indexes.append(record) \n", "\n", "# Preview the first three index records\n", "indexes[:3]" ] }, { "cell_type": "markdown", "id": "322636c2", "metadata": {}, "source": [ "To generate the complete PDF index, we need to do several things:\n", "\n", "- iterate through the list of index records;\n", "- for each one, download the associated PDF to a directory;\n", "- merge all the downloaded files into a single PDF;\n", "- optionally, delete the original PDF files." ] }, { "cell_type": "markdown", "id": "b5aa81f9", "metadata": {}, "source": [ "### Working With PDF Files Downloaded from the Internet Archive\n", "\n", "We can download files from the Internet Archive using the `internetarchive.download()` function. This takes a list of items via a `formats` parameter for the files we want to download. For example, we might want to download the \"Text PDF\" (a PDF file with full text search), or a simple text file containing just the OCR captured text (`OCR Search Text`), or both.\n", "\n", "We can also specify the directory into which the files are downloaded.\n", "\n", "Let's import the packages that help simplify this task, and create a path to our desired download directory:" ] }, { "cell_type": "code", "execution_count": 29, "id": "d6f2e272", "metadata": {}, "outputs": [], "source": [ "# Import the necessary packages\n", "from internetarchive import download" ] }, { "cell_type": "markdown", "id": "90d32fde", "metadata": {}, "source": [ "To keep our files organised, we'll create a directory into which we can download the files:" ] }, { "cell_type": "code", "execution_count": 30, "id": "6397de0e", "metadata": {}, "outputs": [], "source": [ "# Create download dir file path\n", "dirname = 'ia-downloads'\n", "\n", "p = Path(dirname)" ] }, { "cell_type": "markdown", "id": "bdfeee1c", "metadata": {}, "source": [ "One of the ways we can work with the data is to process it using Python programming code.\n", "\n", "For example, we can iterate through the index records and download the required files:" ] }, { "cell_type": "code", "execution_count": 44, "id": "5383e088", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f24eb59a82034f1c9bb0f4e0080c20f1", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/98 [00:00<?, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Use tqdm for provide a progress bar\n", "for record in tqdm(indexes):\n", " _id = record['id']\n", " \n", " # Download PDF - this may take time to retrieve / download\n", " # This downloads to a directory with the same name as the record id\n", " # The file name is akin to ${id}.pdf\n", " download(_id, destdir=p, silent = True,\n", " formats=[\"Text PDF\", \"OCR Search Text\"])" ] }, { "cell_type": "markdown", "id": "7050f184", "metadata": {}, "source": [ "To create single monolithic PDF, we can use another fragment of code to iterate through the downloaded PDF files, adding each one to a single merged PDF file object. We can also create and insert a reference page between each of the original documents to provide provenance if the is no date on the index pages.\n", "\n", "Let's start by seeing how to create a simple PDF page. The `reportlab` Python package provides various tools for creating simple PDF documents:" ] }, { "cell_type": "code", "execution_count": 45, "id": "ddb3daad", "metadata": {}, "outputs": [], "source": [ "#%pip install --upgrade reportlab\n", "from reportlab.pdfgen.canvas import Canvas" ] }, { "cell_type": "markdown", "id": "4bbeb602", "metadata": {}, "source": [ "For example, we can create a simple single page document that we can add index metadata to and then insert in between the pages of each index issue:" ] }, { "cell_type": "code", "execution_count": 46, "id": "2f766435", "metadata": {}, "outputs": [], "source": [ "# Create a page canvas\n", "test_pdf = \"test-page.pdf\"\n", "canvas = Canvas(test_pdf)\n", "\n", "# Write something on the page at a particular location\n", "# In this case, let's use the title from the first index record\n", "txt = indexes[0]['title']\n", "# Co-ordinate origin is bottom left of the page\n", "# Scale is points, where 72 points = 1 inch\n", "canvas.drawString(72, 10*72, txt)\n", "\n", "# Save the page\n", "canvas.save()" ] }, { "cell_type": "markdown", "id": "e8a9852c", "metadata": {}, "source": [ "Now we can preview the test page:" ] }, { "cell_type": "code", "execution_count": 47, "id": "89a677ab", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " <iframe\n", " width=\"600\"\n", " height=\"500\"\n", " src=\"test-page.pdf\"\n", " frameborder=\"0\"\n", " allowfullscreen\n", " \n", " ></iframe>\n", " " ], "text/plain": [ "<IPython.lib.display.IFrame at 0x124137df0>" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import IFrame\n", "\n", "IFrame(test_pdf, width=600, height=500)" ] }, { "cell_type": "markdown", "id": "497edbdb", "metadata": {}, "source": [ "A simple function lets us generate a simple page rendering a short text string:" ] }, { "cell_type": "code", "execution_count": 48, "id": "c7e0acec", "metadata": {}, "outputs": [], "source": [ "def make_pdf_page(txt, fn=\"test_pdf.pdf\"):\n", " \"\"\"\"\"\"\n", " canvas = Canvas(fn)\n", "\n", " # Write something on the page at a partcular location\n", " # Co-ordinate origin is bottom left of the page\n", " # Scale is points, where 72 points = 1 inch\n", " canvas.drawString(72, 10*72, txt)\n", "\n", " # Save the page\n", " canvas.save()\n", " \n", " return fn" ] }, { "cell_type": "markdown", "id": "e49bf40c", "metadata": {}, "source": [ "Let's now create our monolithic index with metadata page inserts.\n", "\n", "The `PyPDF2` package contains various tools for splitting and combining PDF documents:" ] }, { "cell_type": "code", "execution_count": 49, "id": "72a8cbcf", "metadata": {}, "outputs": [], "source": [ "from PyPDF2 import PdfFileReader, PdfFileMerger" ] }, { "cell_type": "markdown", "id": "cbc0dd91", "metadata": {}, "source": [ "We can use it merge our separate index cover page and index issue documents, for example:" ] }, { "cell_type": "code", "execution_count": 50, "id": "f75ce4c8", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f04d8dc670134807b3001a24b5f79eaf", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/98 [00:00<?, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create a merged PDF file creating object\n", "output = PdfFileMerger()\n", "\n", "# Generate a monolithic PDF index file by concatenating the pages\n", "# from each individual PDF index file\n", "# Use tqdm for provide a progress bar\n", "for record in tqdm(indexes):\n", " # Generate some metadata:\n", " txt = record['title']\n", " metadata_pdf = make_pdf_page(txt)\n", " # Add this to the output document\n", " output.append(metadata_pdf)\n", " # Delete the metadata file\n", " Path(metadata_pdf).unlink()\n", "\n", " # Get the record ID\n", " _id = record['id']\n", "\n", " # Locate the file and merge it into the monolithic PDF\n", " output.append((p / _id / f'{_id}.pdf').as_posix())\n", " \n", "# Write merged PDF file\n", "with open(\"notes_and_queries_big_index.pdf\", \"wb\") as output_stream:\n", " output.write(output_stream)\n", "\n", "output = None" ] }, { "cell_type": "markdown", "id": "05180cfd", "metadata": {}, "source": [ "The resulting PDF document is a large document (about 100MB) that collects all the separate indexes in one place, although not as a single, *reconciled* index: if the same index terms exist in multiple index documents, there will be multiple occurrences of that term in the longer document.\n", "\n", "However, if we do need a PDF reference to the index, it is useful to have to hand." ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 5 }