{ "cells": [ { "cell_type": "markdown", "id": "e86d9d2b", "metadata": {}, "source": [ "# Jacobs' Fairy Tales\n", "\n", "This recipe shows how to scrape Jacobs' fairy tale collections from source OCR search text documents returned from the Internet Archive.\n", "\n", "The works include:\n", "\n", "- [*English Fairy Tales*](https://archive.org/details/englishfairytal00jacogoog/);\n", "- [*More English Fairy Tales*](https://archive.org/details/moreenglishfairy00jaco2/);\n", "- [*Celtic Fairy Tales*](https://archive.org/details/celticfairytale00conggoog)\n", "- [*More Celtic Fairy Tales*](https://archive.org/details/morecelticfairyt00jaco/)\n", "- [*Indian Fairy Tales*](https://archive.org/details/indianfairytales00jaco)\n", "- [*European folk and fairy tales*](https://archive.org/details/europeanfolkfair00jaco/)\n", "\n", "Most of the texts can also be found on the [*Sacred Texts*](https://www.sacred-texts.com/) website:\n", "\n", "- https://www.sacred-texts.com/neu/eng/eft/index.htm\n", "- https://www.sacred-texts.com/neu/eng/meft/index.htm\n", "- https://sacred-texts.com/neu/celt/cft/index.htm\n", "- https://sacred-texts.com/neu/celt/mcft/index.htm\n", "- https://sacred-texts.com/hin/ift/index.htm\n", "- European not available?\n", "\n", "The approach explores how we can \"chunk\" the original text into separate stories, and suggests that a combined human + machine strategy may provide a more realistic approach than trying to create a purely automated approach." ] }, { "cell_type": "markdown", "id": "75836861", "metadata": {}, "source": [ "```{warning}\n", "For each of the works on archive.org, several different scanned versions of the text may be available. A quick look at the full text document for each version will give a feel for how effective the OCR process was. Ideally, we're looking for full text that was recognised cleanly and is not full of typographical errors.\n", "```" ] }, { "cell_type": "code", "execution_count": 103, "id": "fb093b30-e088-44f2-b35c-20b1cb95b1d2", "metadata": {}, "outputs": [], "source": [ "# Support dynamic reliading if we update saved module files\n", "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "id": "d7c52c68-0c83-4d1e-a6ea-64a4bd55eba8", "metadata": {}, "source": [ "## Simple Book Indexer\n", "\n", "We can reuse various recipes we have developed previously to create a simple, searchable database over Jacobs' fairy tale collections.\n", "\n", "The original texts are available (in various forms) via the Intenrnet Archive. However, the text quality may be quite poor.\n", "\n", "Most of the books are also available from the *Sacred Texts* website." ] }, { "cell_type": "code", "execution_count": 71, "id": "89165fb0-230a-491a-b022-a6fdc0e6dafa", "metadata": {}, "outputs": [], "source": [ "book_ids = {\"English Fairy Tales\": {\"ia\": \"englishfairytal00jacogoog\",\n", " \"st\": \"neu/eng/eft/index.htm\" },\n", " \"More English Fairy Tales\": {\"ia\": \"moreenglishfairy00jaco2\",\n", " \"st\": \"neu/eng/meft/index.htm\"},\n", " \"Celtic Fairy Tales\": {\"ia\": \"celticfairytale00conggoog\",\n", " \"st\": \"neu/celt/cft/index.htm\"},\n", " \"More Celtic Fairy Tales\": {\"ia\": \"morecelticfairyt00jaco\",\n", " \"st\": \"neu/celt/mcft/index.htm\"},\n", " \"Indian Fairy Tales\": {\"ia\": \"indianfairytales00jaco\",\n", " \"st\": \"hin/ift/index.htm\"},\n", " \"European Fairy Tales\": {\"ia\": \"europeanfolkfair00jaco\"}\n", " }" ] }, { "cell_type": "markdown", "id": "ac646827-3c46-4abf-ae57-52fe20c26e79", "metadata": {}, "source": [ "Create a simple database." ] }, { "cell_type": "code", "execution_count": 106, "id": "ff27b46a-8156-421b-af00-bfaac3fdd1d2", "metadata": {}, "outputs": [], "source": [ "from sqlite_utils import Database\n", "\n", "db_name = \"jacobs_fairy_tale.db\"\n", "\n", "# Uncomment the following lines to connect to a pre-existing database\n", "#db = Database(db_name)" ] }, { "cell_type": "code", "execution_count": 107, "id": "57eeaa0b-f3c4-4740-81a1-346fe09e1ca7", "metadata": {}, "outputs": [], "source": [ "# Do not run this cell if your database already exists!\n", "\n", "# While developing the script, recreate database each time...\n", "db = Database(db_name, recreate=True)" ] }, { "cell_type": "markdown", "id": "4d5885bf-e863-4956-9bbd-a103c01fb1a5", "metadata": {}, "source": [ "The following function starts to build on the schema developed to index the Lang Fairy Tales collection." ] }, { "cell_type": "code", "execution_count": 114, "id": "53ddd828-38a4-4fcf-8992-e431466b693b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Overwriting ia_utils/create_db_tables_book.py\n" ] } ], "source": [ "%%writefile ia_utils/create_db_tables_book.py\n", "def create_db_tables_book(db, drop=True):\n", " \"\"\"Create a database table and an associated full-text search table.\"\"\"\n", " # If required, drop any previously defined tables of the same name\n", " table_name = \"stories\"\n", " if drop:\n", " db[table_name].drop(ignore=True)\n", " db[f\"{table_name}_fts\"].drop(ignore=True)\n", " elif db[table_name].exists():\n", " print(f\"Table {table_name} exists...\")\n", " return\n", "\n", " # This schema has been evolved iteratively as I have identified structure\n", " # that can be usefully mined...\n", "\n", " db[table_name].create({\n", " \"book_id\": str,\n", " \"book_title\": str,\n", " \"story_id\": str,\n", " \"story_title\": str,\n", " \"story_text\": str,\n", " \"last_para\": str, # sometimes contains provenance\n", " \"first_line\": str, # maybe we want to review the openings, or create an index...\n", " \"provenance\": str, # attempt at provenance\n", " \"chapter_order\": int, # Sort order of stories in book\n", " }, pk=(\"story_id\"))\n", "\n", " # Enable full text search\n", " # This creates an extra virtual table (issues_fts) to support the full text search\n", " # A stemmer is applied to support the efficacy of the full-text searching\n", " db[table_name].enable_fts([\"story_title\", \"story_text\"], create_triggers=True)" ] }, { "cell_type": "markdown", "id": "6fdc2b39-9d21-4e94-aadc-868536549c93", "metadata": {}, "source": [ "Create a `stories` table in the database, along with a full-text search index for it." ] }, { "cell_type": "code", "execution_count": 115, "id": "8fabc7f2-ce0a-41e5-b5e4-bdc098d94614", "metadata": {}, "outputs": [], "source": [ "from ia_utils.create_db_tables_book import create_db_tables_book\n", "\n", "create_db_tables_book(db)" ] }, { "cell_type": "markdown", "id": "d506843f-b538-492a-a313-ca74532d5731", "metadata": {}, "source": [ "Preview the tables and their columns:" ] }, { "cell_type": "code", "execution_count": 116, "id": "7bba17ff-2d23-4f33-9365-57727f280523", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[
\\n \\n | \\n \\n English Fairy Tales\\nby Joseph Jacobs\\n[1890]\\n | \\n