# Jacobs' Fairy Tales

This recipe shows how to scrape Jacobs' fairy tale collections from source OCR search text documents returned from the Internet Archive.

The works include:

- [*English Fairy Tales*](https://archive.org/details/englishfairytal00jacogoog/);
- [*More English Fairy Tales*](https://archive.org/details/moreenglishfairy00jaco2/);
- [*Celtic Fairy Tales*](https://archive.org/details/celticfairytale00conggoog)
- [*More Celtic Fairy Tales*](https://archive.org/details/morecelticfairyt00jaco/)
- [*Indian Fairy Tales*](https://archive.org/details/indianfairytales00jaco)
- [*European folk and fairy tales*](https://archive.org/details/europeanfolkfair00jaco/)

Most of the texts can also be found on the [*Sacred Texts*](https://www.sacred-texts.com/) website:

- https://www.sacred-texts.com/neu/eng/eft/index.htm
- https://www.sacred-texts.com/neu/eng/meft/index.htm
- https://sacred-texts.com/neu/celt/cft/index.htm
- https://sacred-texts.com/neu/celt/mcft/index.htm
- https://sacred-texts.com/hin/ift/index.htm
- European not available?

The approach explores how we can "chunk" the original text into separate stories, and suggests that a combined human + machine strategy may provide a more realistic approach than trying to create a purely automated approach.

```{warning}
For each of the works on archive.org, several different scanned versions of the text may be available. A quick look at the full text document for each version will give a feel for how effective the OCR process was. Ideally, we're looking for full text that was recognised cleanly and is not full of typographical errors.
```

In [103]:
# Support dynamic reliading if we update saved module files
%load_ext autoreload
%autoreload 2

##Â Simple Book Indexer

We can reuse various recipes we have developed previously to create a simple, searchable database over Jacobs' fairy tale collections.

The original texts are available (in various forms) via the Intenrnet Archive. However, the text quality may be quite poor.

Most of the books are also available from the *Sacred Texts* website.

In [71]:
book_ids = {"English Fairy Tales": {"ia": "englishfairytal00jacogoog",
                                    "st": "neu/eng/eft/index.htm" },
            "More English Fairy Tales": {"ia": "moreenglishfairy00jaco2",
                                         "st": "neu/eng/meft/index.htm"},
            "Celtic Fairy Tales": {"ia": "celticfairytale00conggoog",
                                   "st": "neu/celt/cft/index.htm"},
            "More Celtic Fairy Tales": {"ia": "morecelticfairyt00jaco",
                                        "st": "neu/celt/mcft/index.htm"},
            "Indian Fairy Tales": {"ia": "indianfairytales00jaco",
                                   "st": "hin/ift/index.htm"},
            "European Fairy Tales": {"ia": "europeanfolkfair00jaco"}
           }

Create a simple database.

In [106]:
from sqlite_utils import Database

db_name = "jacobs_fairy_tale.db"

# Uncomment the following lines to connect to a pre-existing database
#db = Database(db_name)

In [107]:
# Do not run this cell if your database already exists!

# While developing the script, recreate database each time...
db = Database(db_name, recreate=True)

The following function starts to build on the schema developed to index the Lang Fairy Tales collection.

In [114]:
%%writefile ia_utils/create_db_tables_book.py
def create_db_tables_book(db, drop=True):
    """Create a database table and an associated full-text search table."""
    # If required, drop any previously defined tables of the same name
    table_name = "stories"
    if drop:
        db[table_name].drop(ignore=True)
        db[f"{table_name}_fts"].drop(ignore=True)
    elif db[table_name].exists():
        print(f"Table {table_name} exists...")
        return

    # This schema has been evolved iteratively as I have identified structure
    # that can be usefully mined...

    db[table_name].create({
        "book_id": str,
        "book_title": str,
        "story_id": str,
        "story_title": str,
        "story_text": str,
        "last_para": str, # sometimes contains provenance
        "first_line": str, # maybe we want to review the openings, or create an index...
        "provenance": str, # attempt at provenance
        "chapter_order": int, # Sort order of stories in book
    }, pk=("story_id"))

    # Enable full text search
    # This creates an extra virtual table (issues_fts) to support the full text search
    # A stemmer is applied to support the efficacy of the full-text searching
    db[table_name].enable_fts(["story_title", "story_text"], create_triggers=True)

Overwriting ia_utils/create_db_tables_book.py


Create a `stories` table in the database, along with a full-text search index for it.

In [115]:
from ia_utils.create_db_tables_book import create_db_tables_book

create_db_tables_book(db)

Preview the tables and their columns:

In [116]:
db.tables

[<Table stories (book_id, book_title, story_id, story_title, story_text, last_para, first_line, provenance, chapter_order)>,
 <Table stories_fts (story_title, story_text)>,
 <Table stories_fts_data (id, block)>,
 <Table stories_fts_idx (segid, term, pgno)>,
 <Table stories_fts_docsize (id, sz)>,
 <Table stories_fts_config (k, v)>]

## Scrape the Sacred Texts Website

Downloadable zip files of the text from the *Sacred Texts* website only seems to be available for the *Celtic Fairy Tales* collection, so let's write a simple scraper to pull the texts, a story at a time, from each book page.

First, we need to get the links to the chapters from a book page:

In [117]:
# These packages make it easy to download web pages so that we can work with them
import requests
# "Cacheing" pages mans grabbing a local copy of the page so we only need to download it once
import requests_cache
from datetime import timedelta

requests_cache.install_cache('web_cache',
                             backend='sqlite',
                             expire_after=timedelta(days=1000))

In [118]:
# Specify the URL of the page we want to download
BASE_URL = "https://www.sacred-texts.com"

In [119]:
def get_st_url(book, base_url=BASE_URL):
    stub = book_ids[book]["st"]
    return f'{base_url}/{stub}'

example_book_url = get_st_url("English Fairy Tales")
example_book_url

'https://www.sacred-texts.com/neu/eng/eft/index.htm'

In [120]:
# And then grab the page
html = requests.get(example_book_url)

# Preview some of the raw web page / HTML text in the page we just downloaded
html.text[:5000]

'<HTML>\n <HEAD>\n<!-- Global site tag (gtag.js) - Google Analytics -->\n<script async src="https://www.googletagmanager.com/gtag/js?id=UA-12241170-1"></script>\n<script>\n  window.dataLayer = window.dataLayer || [];\n  function gtag(){dataLayer.push(arguments);}\n  gtag(\'js\', new Date());\n\n  gtag(\'config\', \'UA-12241170-1\');\n  gtag(\'config\', \'GA_MEASUREMENT_ID\', {\n    \'linker\': {\n      \'domains\': [\'sacred-texts.com\', \'next.sacred-texts.com\', \'happyvegan.jp\', \'next.happyvegan.jp\', \'sacred-texts.online\', \'next.sacred-texts.online\']\n    }\n  });\n</script>\n<!-- End Global site tag (gtag.js) - Google Analytics -->\n<META name="description" content="English Fairy Tales, by Joseph Jacobs, at sacred-texts.com">\n <META name="keywords" content="English Fairytale Fairy Tales Folklore Mythology England">\n <TITLE>English Fairy Tales Index</TITLE>\n </HEAD>\n <BODY>\n \n \n <CENTER>\n <A HREF="../../../cdshop/index.htm"><IMG SRC="../../../cdshop/cdinfo.jpg" BORDER

The book index pages contain links to separate chapter (i.e. story) pages:

In [121]:
# The BeautifulSoup package provides a range of tools
# that help us work with the downloaded web page,
# such as extracting particular elements from it
from bs4 import BeautifulSoup

# The "soup" is a parsed and structured form of the page we downloaded
soup = BeautifulSoup(html.content, "html.parser")

# Find the span elements containing the links
links_ = soup.find_all("a")

# Preview the first few extracted <span> elements
links_[5:]

[<a href="eft01.htm">Preface</a>,
 <a href="eft02.htm">Tom Tit Tot</a>,
 <a href="eft03.htm">The Three Sillies</a>,
 <a href="eft04.htm">The Rose-Tree</a>,
 <a href="eft05.htm">The Old Woman and Her Pig</a>,
 <a href="eft06.htm">How Jack Went to Seek his Fortune</a>,
 <a href="eft07.htm">Mr Vinegar</a>,
 <a href="eft08.htm">Nix Nought Nothing</a>,
 <a href="eft09.htm">Jack Hannaford</a>,
 <a href="eft10.htm">Binnorie</a>,
 <a href="eft11.htm">Mouse and Mouser</a>,
 <a href="eft12.htm">Cap O' Rushes</a>,
 <a href="eft13.htm">Teeny-Tiny</a>,
 <a href="eft14.htm">Jack and the Beanstalk</a>,
 <a href="eft15.htm">The Story of the Three Little Pigs</a>,
 <a href="eft16.htm">The Master and His Pupil</a>,
 <a href="eft17.htm">Titty Mouse and Tatty Mouse</a>,
 <a href="eft18.htm">Jack and His Golden Snuff-Box</a>,
 <a href="eft19.htm">The Story of the Three Bears</a>,
 <a href="eft20.htm">Jack the Giant-Killer</a>,
 <a href="eft21.htm">Henny-Penny</a>,
 <a href="eft22.htm">Childe Rowland</a>,
 

We notice that page links share a common key that we can also obtain from the book index page:

In [122]:
stub = example_book_url.split("/")[-2]
stub

'eft'

Create a simple list of the story links:

In [123]:
story_links = [(l.text, l.get('href')) for l in links_ if l.get('href') and l.get('href').startswith(stub)]

# Tidy out links that aren't stories
story_links = [s for s in story_links if "00." not in s[1] and not any(x in s[0] for x in ["Preface", "Notes", "Title"])]
story_links

[('Tom Tit Tot', 'eft02.htm'),
 ('The Three Sillies', 'eft03.htm'),
 ('The Rose-Tree', 'eft04.htm'),
 ('The Old Woman and Her Pig', 'eft05.htm'),
 ('How Jack Went to Seek his Fortune', 'eft06.htm'),
 ('Mr Vinegar', 'eft07.htm'),
 ('Nix Nought Nothing', 'eft08.htm'),
 ('Jack Hannaford', 'eft09.htm'),
 ('Binnorie', 'eft10.htm'),
 ('Mouse and Mouser', 'eft11.htm'),
 ("Cap O' Rushes", 'eft12.htm'),
 ('Teeny-Tiny', 'eft13.htm'),
 ('Jack and the Beanstalk', 'eft14.htm'),
 ('The Story of the Three Little Pigs', 'eft15.htm'),
 ('The Master and His Pupil', 'eft16.htm'),
 ('Titty Mouse and Tatty Mouse', 'eft17.htm'),
 ('Jack and His Golden Snuff-Box', 'eft18.htm'),
 ('The Story of the Three Bears', 'eft19.htm'),
 ('Jack the Giant-Killer', 'eft20.htm'),
 ('Henny-Penny', 'eft21.htm'),
 ('Childe Rowland', 'eft22.htm'),
 ('Molly Whuppie', 'eft23.htm'),
 ('The Red Ettin', 'eft24.htm'),
 ('The Golden Arm', 'eft25.htm'),
 ('The History of Tom Thumb', 'eft26.htm'),
 ('Mr Fox', 'eft27.htm'),
 ('Lazy Jack

The structure of the HTML page for each story may differ in certain respects, but in all but one case, it seems that there is *some* structure we can pull on from the document: the title appears as the only header, followed by the story.

We can then parse that structure out:

In [124]:
from markdownify import markdownify

def get_stories_from_book(book):
    book_index_url = get_st_url(book)
    print(book_index_url)
    stories = []
    
    story = requests.get(book_index_url)
    # The "soup" is a parsed and structured form of the page we downloaded
    soup = BeautifulSoup(story.content, "html.parser")
    links_ = soup.find_all("a")
    stub = book_index_url.split("/")[-2]
    
    story_links = [(l.text, l.get('href')) for l in links_ if l.get('href') and l.get('href').startswith(stub)]
    # Tidy out links that aren't stories
    story_links = [s for s in story_links if "00." not in s[1] and not any(x in s[0] for x in ["Preface", "Notes", "Title"])]

    # We need a heuristic to get the text of the story and not any other text
    for story_link in story_links:
        story_url = book_index_url.replace("index.htm", story_link[1])
        #print(f"Getting {story_link[0]} from {story_url}")
        story_ = requests.get(story_url)
        try:
            story_text = [markdownify(x).strip() for x in story_.text.split("<HR>") if "===" in markdownify(x)][0]
            stories.append((book, book_index_url, f'{stub}_{story_link[1]}'.split(".")[0], story_text))
        except:
            # These are not handled
            print("Error", story_link[0], story_url)
            # The only one we want to capture is https://sacred-texts.com/hin/ift/ift11.htm
            # which does not have the title of the story as a title
        
    return stories

In [125]:
# For example:
stories = get_stories_from_book("English Fairy Tales")

story = stories[-1]
story

https://www.sacred-texts.com/neu/eng/eft/index.htm


('English Fairy Tales',
 'https://www.sacred-texts.com/neu/eng/eft/index.htm',
 'eft_eft44',

Some of the story texts may contain images or web links. We can clean those out:

In [126]:
import re

def clean_story_text(txt):
    """Clean the story text."""

    # Remove images
    cleaner = re.sub(r'!\[\]\([^\)]*\)', '',
           markdownify(txt))
    # Remove links
    cleaner = re.sub(r'\[[^\]]*\]\([^\)]*\)', '', cleaner)
    # Minimise line breaks
    cleaner = re.sub(r'\n[\n]*', '\n\n', cleaner)
    
    # Remove whitespace around line breaks
    cleaner = "\n\n".join(s.strip() for s in cleaner.split("\n\n"))

    return cleaner

In [127]:
txt = clean_story_text(story[3])
txt



It will also be useful to parse out separate components of each story, such as the title, the body of the text, the first sentence aand the closig paragraph.

In [128]:
def get_story_components(txt):
    """Extract components of story for db table."""
    txt = txt.strip()
    _parts = re.split('=+', txt)
    title = _parts[0].strip()
    title = re.sub(r'p\.[^\n]*', '', title).strip()
    body = _parts[1].strip()
    # Use a proper sentence parser?
    first_sent = body.split(".")[0].strip()
    last_para = body.split("\n\n")[-1].strip()
    
    return title, body, first_sent, last_para

In [129]:
get_story_components(txt)

('The Three Heads of the Well',
 "LONG before Arthur and the Knights of the Round Table, there reigned in the eastern part of England a king who kept his court at Colchester.\n\nIn the midst of all his glory, his queen died, leaving behind her an only daughter, about fifteen years of age, who for her beauty and kindness was the wonder of all that knew her. But the king, hearing of a lady who had likewise an only daughter, had a mind to marry her for the sake of her riches, though she was old, ugly, hook-nosed, and hump-backed. Her daughter was a yellow dowdy, full of envy and ill-nature; and, in short, was much of the same mould as her mother. But in a few weeks the king, attended by the nobility and gentry, brought his deformed bride to the palace, where the marriage rites were performed. She had not been long in the court before she set the king against his own beautiful daughter by false reports. The young princess, having lost her father's love, grew weary of the court, and one day

We can parse all the stories into components and then add them to the database.

In [130]:
items = []

for book in book_ids:
    if "st" in book_ids[book]:
        stories = get_stories_from_book(book)
        for (book_title, book_id, _id, story) in stories:
            (title, body, first_sent, last_para) = get_story_components(clean_story_text(story))
            items.append({"book_id": book_id,
                          "book_title": book_title,
                          "story_id": _id,
                          "story_title": title,
                          "story_text": body,
                          "last_para": last_para, # sometimes contains provenance
                          "first_line": first_sent, # maybe we want to review the openings, or create an index...
                          "provenance": "", # attempt at provenance
                          "chapter_order": "", # Sort order of stories in book
                         })
    # The upsert means "add or replace"
    db["stories"].upsert_all(items, pk=("story_id" ))

https://www.sacred-texts.com/neu/eng/eft/index.htm
https://www.sacred-texts.com/neu/eng/meft/index.htm
https://www.sacred-texts.com/neu/celt/cft/index.htm
Error Text [Zipped] https://www.sacred-texts.com/neu/celt/cft/cft.txt.gz
https://www.sacred-texts.com/neu/celt/mcft/index.htm
https://www.sacred-texts.com/hin/ift/index.htm
Error The Soothsayers Son https://www.sacred-texts.com/hin/ift/ift11.htm


Run a test query on the database:

In [132]:
q = 'king "three sons" princess sword'

# The `.search()` method knows how to find the full text search table
# given the original table name
for story in db["stories"].search(db.quote_fts(q), columns=["story_title", "story_text"]):
    print(story)

{'story_title': 'The Black Horse', 'story_text': 'ONCE\n\nthere was a king and he had three sons, and when the king died, they did not\n\ngive a shade of anything to the youngest son, but an old white limping garron.\n\n"If I get but this," quoth he, "it seems that I\n\nhad best go with this same.\n\nHe was going with it right before him, sometimes walking,\n\nsometimes riding. When he had been riding a good while he thought that the\n\ngarron would need a while of eating, so he came down to earth, and what should\n\nhe see coming out of the heart of the western airt towards him but a rider\n\nriding high, well, and right well.\n\n"AllI hail, my lad," said he.\n\n"Hail, king\'s son," said the other.\n\n"What\'s your news?" said the king\'s son.\n\n"I have got that," said the lad who came. "I am\n\nafter breaking my heart riding this ass of a horse ; but will you give me the\n\nlimping white garron for him?"\n\n"No," said the prince; "it would be a bad\n\nbusiness for me*.*"\n\n"You nee

In [None]:
# We could manually close the databse.
#db.conn.close()

https://huggingface.co/course/chapter5/6?fw=tf and use the doc2vec recipe?

https://github.com/neuml/txtai ?