{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# N-grams and Markov chains\n", "\n", "By [Allison Parrish](http://www.decontextualize.com/)\n", "\n", "Markov chain text generation is [one of the oldest](https://elmcip.net/creative-work/travesty) strategies for predictive text generation. This notebook takes you through the basics of implementing a simple and concise Markov chain text generation procedure in Python.\n", "\n", "If all you want is to generate text with a Markov chain and you don't care about how the functions are implemented (or if you already went through this notebook and want to use the functions without copy-and-pasting them), you can [download a Python file with all of the functions here](https://gist.github.com/aparrish/14cb94ce539a868e6b8714dd84003f06). Just download the file, put it in the same directory as your code, type `from shmarkov import *` at the top, and you're good to go.\n", "\n", "## Tuples: a quick introduction\n", "\n", "Before we get to all that, I need to review a helpful Python data structure: the tuple.\n", "\n", "Tuples (rhymes with \"supple\") are data structures very similar to lists. You can create a tuple using parentheses (instead of square brackets, as you would with a list):" ] }, { "cell_type": "code", "execution_count": 273, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('alpha', 'beta', 'gamma', 'delta')" ] }, "execution_count": 273, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t = (\"alpha\", \"beta\", \"gamma\", \"delta\")\n", "t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can access the values in a tuple in the same way as you access the values in a list: using square bracket indexing syntax. Tuples support slice syntax and negative indexes, just like lists:" ] }, { "cell_type": "code", "execution_count": 274, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'gamma'" ] }, "execution_count": 274, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t[-2]" ] }, { "cell_type": "code", "execution_count": 275, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('beta', 'gamma')" ] }, "execution_count": 275, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t[1:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The difference between a list and a tuple is that the values in a tuple can't be changed after the tuple is created. This means, for example, that attempting to .append() a value to a tuple will fail:" ] }, { "cell_type": "code", "execution_count": 276, "metadata": {}, "outputs": [ { "ename": "AttributeError", "evalue": "'tuple' object has no attribute 'append'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"epsilon\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m: 'tuple' object has no attribute 'append'" ] } ], "source": [ "t.append(\"epsilon\")" ] }, { "cell_type": "code", "execution_count": 277, "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "'tuple' object does not support item assignment", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mt\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m\"bravo\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mTypeError\u001b[0m: 'tuple' object does not support item assignment" ] } ], "source": [ "t[2] = \"bravo\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"So,\" you think to yourself. \"Tuples are just like... broken lists. That's strange and a little unreasonable. Why even have them in your programming language?\" That's a fair question, and answering it requires a bit of knowledge of how Python works with these two kinds of values (lists and tuples) behind the scenes.\n", "\n", "Essentially, tuples are faster and smaller than lists. Because lists can be modified, potentially becoming larger after they're initialized, Python has to allocate more memory than is strictly necessary whenever you create a list value. If your list grows beyond what Python has already allocated, Python has to allocate more memory. Allocating memory, copying values into memory, and then freeing memory when it's when no longer needed, are all (perhaps surprisingly) slow processes---slower, at least, than using data already loaded into memory when your program begins.\n", "\n", "Because a tuple can't grow or shrink after it's created, Python knows exactly how much memory to allocate when you create a tuple in your program. That means: less wasted memory, and less wasted time allocating a deallocating memory. The cost of this decreased resource footprint is less versatility.\n", "\n", "Tuples are often called an immutable data type. \"Immutable\" in this context simply means that it can't be changed after it's created.\n", "\n", "For our purposes, the most important aspect of tuples is that–unlike lists–they can be *keys in dictionaries*. The utility of this will become apparent later in this tutorial, but to illustrate, let's start with an empty dictionary:" ] }, { "cell_type": "code", "execution_count": 278, "metadata": { "collapsed": true }, "outputs": [], "source": [ "my_stuff = {}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I can use a string as a key, of course, no problem:" ] }, { "cell_type": "code", "execution_count": 279, "metadata": { "collapsed": true }, "outputs": [], "source": [ "my_stuff[\"cheese\"] = 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I can also use an integer:" ] }, { "cell_type": "code", "execution_count": 280, "metadata": { "collapsed": true }, "outputs": [], "source": [ "my_stuff[17] = \"hello\"" ] }, { "cell_type": "code", "execution_count": 281, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'cheese': 1, 17: 'hello'}" ] }, "execution_count": 281, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_stuff" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But I can't use a *list* as a key:" ] }, { "cell_type": "code", "execution_count": 282, "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "unhashable type: 'list'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mmy_stuff\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m\"a\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"b\"\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m\"asdf\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mTypeError\u001b[0m: unhashable type: 'list'" ] } ], "source": [ "my_stuff[ [\"a\", \"b\"] ] = \"asdf\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's because a list, as a mutable data type, is \"unhashable\": since its contents can change, it's impossible to come up with a single value to represent it, as is required of dictionary keys. However, because tuples are *immutable*, you can use them as dictionary keys:" ] }, { "cell_type": "code", "execution_count": 283, "metadata": { "collapsed": true }, "outputs": [], "source": [ "my_stuff[ (\"a\", \"b\") ] = \"asdf\"" ] }, { "cell_type": "code", "execution_count": 284, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'cheese': 1, 17: 'hello', ('a', 'b'): 'asdf'}" ] }, "execution_count": 284, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_stuff" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This behavior is helpful when you want to make a data structure that maps *sequences* as keys to corresponding values. As we'll see below!\n", "\n", "It's easy to make a list that is a copy of a tuple, and a tuple that is a copy of a list, using the `list()` and `tuple()` functions, respectively:" ] }, { "cell_type": "code", "execution_count": 298, "metadata": { "collapsed": true }, "outputs": [], "source": [ "t = (1, 2, 3)" ] }, { "cell_type": "code", "execution_count": 299, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 2, 3]" ] }, "execution_count": 299, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(t)" ] }, { "cell_type": "code", "execution_count": 300, "metadata": { "collapsed": true }, "outputs": [], "source": [ "things = [4, 5, 6]" ] }, { "cell_type": "code", "execution_count": 301, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4, 5, 6)" ] }, "execution_count": 301, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tuple(things)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## N-grams\n", "\n", "The first kind of text analysis that we’ll look at today is an n-gram model. An n-gram is simply a sequence of units drawn from a longer sequence; in the case of text, the unit in question is usually a character or a word. For convenience, we'll call the unit of the n-gram is called its *level*; the length of the n-gram is called its *order*. For example, the following is a list of all unique character-level order-2 n-grams in the word `condescendences`:\n", "\n", " co\n", " on\n", " nd\n", " de\n", " es\n", " sc\n", " ce\n", " en\n", " nc\n", "\n", "And the following is an excerpt from the list of all unique word-level order-5 n-grams in *The Road Not Taken*:\n", "\n", " Two roads diverged in a\n", " roads diverged in a yellow\n", " diverged in a yellow wood,\n", " in a yellow wood, And\n", " a yellow wood, And sorry\n", " yellow wood, And sorry I\n", "\n", "N-grams are used frequently in natural language processing and are a basic tool text analysis. Their applications range from programs that correct spelling to creative visualizations to compression algorithms to stylometrics to generative text. They can be used as the basis of a Markov chain algorithm—and, in fact, that’s one of the applications we’ll be using them for later in this lesson.\n", "\n", "### Finding and counting word pairs\n", "\n", "So how would we go about writing Python code to find n-grams? We'll start with a simple task: finding *word pairs* in a text. A word pair is essentially a word-level order-2 n-gram; once we have code to find word pairs, we’ll generalize it to handle n-grams of any order.\n", "\n", "To find word pairs, we'll first need some words!" ] }, { "cell_type": "code", "execution_count": 285, "metadata": { "collapsed": true }, "outputs": [], "source": [ "text = open(\"genesis.txt\").read()\n", "words = text.split()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data structure we want to end up with is a *list* of *tuples*, where the tuples have two elements, i.e., each successive pair of words from the text. There are a number of clever ways to go about creating this list. Here's one: imagine our starting list of strings, with their corresponding indices:\n", "\n", " ['a', 'b', 'c', 'd', 'e']\n", " 0 1 2 3 4\n", " \n", "The first item of the list of pairs should consist of the elements at index 0 and index 1 from this list; the second item should consist of the elements at index 1 and index 2; and so forth. We can accomplish this using a list comprehension over the range of numbers from zero up until the end of the list minus one:" ] }, { "cell_type": "code", "execution_count": 286, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pairs = [(words[i], words[i+1]) for i in range(len(words)-1)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Why `len(words) - 1`? Because the final element of the list can only be the *second* element of a pair. Otherwise we'd be trying to access an element beyond the end of the list.)\n", "\n", "The corresponding way to write this with a `for` loop:" ] }, { "cell_type": "code", "execution_count": 287, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pairs = []\n", "for i in range(len(words)-1):\n", " this_pair = (words[i], words[i+1])\n", " pairs.append(this_pair)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In either case, the list of n-grams ends up looking like this. (I'm only showing the first 25 for the sake of brevity; remove `[:25]` to see the whole list.)" ] }, { "cell_type": "code", "execution_count": 291, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('In', 'the'),\n", " ('the', 'beginning'),\n", " ('beginning', 'God'),\n", " ('God', 'created'),\n", " ('created', 'the'),\n", " ('the', 'heaven'),\n", " ('heaven', 'and'),\n", " ('and', 'the'),\n", " ('the', 'earth.'),\n", " ('earth.', 'And'),\n", " ('And', 'the'),\n", " ('the', 'earth'),\n", " ('earth', 'was'),\n", " ('was', 'without'),\n", " ('without', 'form,'),\n", " ('form,', 'and'),\n", " ('and', 'void;'),\n", " ('void;', 'and'),\n", " ('and', 'darkness'),\n", " ('darkness', 'was'),\n", " ('was', 'upon'),\n", " ('upon', 'the'),\n", " ('the', 'face'),\n", " ('face', 'of'),\n", " ('of', 'the')]" ] }, "execution_count": 291, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pairs[:25]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a list of word pairs, we can count them using a `Counter` object." ] }, { "cell_type": "code", "execution_count": 292, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from collections import Counter" ] }, { "cell_type": "code", "execution_count": 294, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pair_counts = Counter(pairs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.most_common()` method of the `Counter` shows us the items in our list that occur most frequently:" ] }, { "cell_type": "code", "execution_count": 295, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(('And', 'God'), 21),\n", " (('of', 'the'), 15),\n", " (('it', 'was'), 13),\n", " (('and', 'the'), 12),\n", " (('upon', 'the'), 10),\n", " (('And', 'the'), 9),\n", " (('God', 'said,'), 9),\n", " (('in', 'the'), 9),\n", " (('the', 'earth'), 8),\n", " (('said,', 'Let'), 8)]" ] }, "execution_count": 295, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pair_counts.most_common(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the phrase \"And God\" occurs 21 times, by far the most common word pair in the text. In fact, \"And God\" comprises about 3% of all word pairs found in the text:" ] }, { "cell_type": "code", "execution_count": 296, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.026381909547738693" ] }, "execution_count": 296, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pair_counts[(\"And\", \"God\")] / sum(pair_counts.values())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can do the same calculation with character-level pairs with pretty much exactly the same code, owing to the fact that strings and lists can be indexed using the same syntax:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "char_pairs = [(text[i], text[i+1]) for i in range(len(text)-1)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The variable `char_pairs` now has a list of all pairs of *characters* in the text. Using `Counter` again, we can find the most common pairs of characters:" ] }, { "cell_type": "code", "execution_count": 297, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(('t', 'h'), 184),\n", " (('e', ' '), 172),\n", " ((' ', 't'), 162),\n", " (('d', ' '), 154),\n", " (('h', 'e'), 144),\n", " (('n', 'd'), 114),\n", " ((' ', 'a'), 88),\n", " (('t', ' '), 72),\n", " (('e', 'r'), 71),\n", " (('a', 'n'), 70)]" ] }, "execution_count": 297, "metadata": {}, "output_type": "execute_result" } ], "source": [ "char_pair_counts = Counter(char_pairs)\n", "char_pair_counts.most_common(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> What are the practical applications of this kind of analysis? For one, you can use n-gram counts to judge *similarity* between two texts. If two texts have the same n-grams in similar proportions, then those texts probably have similar compositions, meanings, or authorship. N-grams can also be a basis for fast text searching; [Google Books Ngram Viewer](https://books.google.com/ngrams) works along these lines." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### N-grams of arbitrary lengths\n", "\n", "The step from pairs to n-grams of arbitrary lengths is a only a matter of using slice indexes to get a slice of length `n`, where `n` is the length of the desired n-gram. For example, to get all of the word-level order 7 n-grams from the list of words in `genesis.txt`:" ] }, { "cell_type": "code", "execution_count": 307, "metadata": { "collapsed": true }, "outputs": [], "source": [ "seven_grams = [tuple(words[i:i+7]) for i in range(len(words)-6)]" ] }, { "cell_type": "code", "execution_count": 310, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('In', 'the', 'beginning', 'God', 'created', 'the', 'heaven'),\n", " ('the', 'beginning', 'God', 'created', 'the', 'heaven', 'and'),\n", " ('beginning', 'God', 'created', 'the', 'heaven', 'and', 'the'),\n", " ('God', 'created', 'the', 'heaven', 'and', 'the', 'earth.'),\n", " ('created', 'the', 'heaven', 'and', 'the', 'earth.', 'And'),\n", " ('the', 'heaven', 'and', 'the', 'earth.', 'And', 'the'),\n", " ('heaven', 'and', 'the', 'earth.', 'And', 'the', 'earth'),\n", " ('and', 'the', 'earth.', 'And', 'the', 'earth', 'was'),\n", " ('the', 'earth.', 'And', 'the', 'earth', 'was', 'without'),\n", " ('earth.', 'And', 'the', 'earth', 'was', 'without', 'form,'),\n", " ('And', 'the', 'earth', 'was', 'without', 'form,', 'and'),\n", " ('the', 'earth', 'was', 'without', 'form,', 'and', 'void;'),\n", " ('earth', 'was', 'without', 'form,', 'and', 'void;', 'and'),\n", " ('was', 'without', 'form,', 'and', 'void;', 'and', 'darkness'),\n", " ('without', 'form,', 'and', 'void;', 'and', 'darkness', 'was'),\n", " ('form,', 'and', 'void;', 'and', 'darkness', 'was', 'upon'),\n", " ('and', 'void;', 'and', 'darkness', 'was', 'upon', 'the'),\n", " ('void;', 'and', 'darkness', 'was', 'upon', 'the', 'face'),\n", " ('and', 'darkness', 'was', 'upon', 'the', 'face', 'of'),\n", " ('darkness', 'was', 'upon', 'the', 'face', 'of', 'the')]" ] }, "execution_count": 310, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seven_grams[:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Two tricky things in this expression: in `tuple(words[i:i+7])`, I call `tuple()` to convert the list slice (`words[i:i+7]`) into a tuple. In `range(len(words)-6)`, the `6` is there because it's one fewer than the length of the n-gram. Just as with the pairs above, we need to stop counting before we reach the end of the list with enough room to make sure we're always grabbing slices of the desired length.\n", "\n", "For the sake of convenience, here's a function that will return n-grams of a desired length from any sequence, whether list or string:" ] }, { "cell_type": "code", "execution_count": 311, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def ngrams_for_sequence(n, seq):\n", " return [tuple(seq[i:i+n]) for i in range(len(seq)-n+1)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using this function, here are random character-level n-grams of order 9 from `genesis.txt`:" ] }, { "cell_type": "code", "execution_count": 314, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('l', 'e', 't', ' ', 'f', 'o', 'w', 'l', ' '),\n", " ('h', 'i', 'n', 'g', ' ', 't', 'h', 'a', 't'),\n", " ('n', 'g', ' ', 's', 'e', 'e', 'd', ';', ' '),\n", " ('n', 'g', ' ', 't', 'h', 'i', 'n', 'g', ' '),\n", " ('e', ' ', 'w', 'a', 't', 'e', 'r', 's', ' '),\n", " ('o', 'd', '.', ' ', '\\n', 'A', 'n', 'd', ' '),\n", " ('a', 'i', 'd', ',', ' ', 'L', 'e', 't', ' '),\n", " ('e', ' ', 'e', 'a', 'r', 't', 'h', ',', ' '),\n", " ('s', 't', 'a', 'r', 's', ' ', 'a', 'l', 's'),\n", " ('r', 'o', 'm', ' ', 't', 'h', 'e', ' ', 'n')]" ] }, "execution_count": 314, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import random\n", "genesis_9grams = ngrams_for_sequence(9, open(\"genesis.txt\").read())\n", "random.sample(genesis_9grams, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or all the word-level 5-grams from `frost.txt`:" ] }, { "cell_type": "code", "execution_count": 316, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Two', 'roads', 'diverged', 'in', 'a'),\n", " ('roads', 'diverged', 'in', 'a', 'yellow'),\n", " ('diverged', 'in', 'a', 'yellow', 'wood,'),\n", " ('in', 'a', 'yellow', 'wood,', 'And'),\n", " ('a', 'yellow', 'wood,', 'And', 'sorry'),\n", " ('yellow', 'wood,', 'And', 'sorry', 'I'),\n", " ('wood,', 'And', 'sorry', 'I', 'could'),\n", " ('And', 'sorry', 'I', 'could', 'not'),\n", " ('sorry', 'I', 'could', 'not', 'travel'),\n", " ('I', 'could', 'not', 'travel', 'both'),\n", " ('could', 'not', 'travel', 'both', 'And'),\n", " ('not', 'travel', 'both', 'And', 'be'),\n", " ('travel', 'both', 'And', 'be', 'one'),\n", " ('both', 'And', 'be', 'one', 'traveler,'),\n", " ('And', 'be', 'one', 'traveler,', 'long'),\n", " ('be', 'one', 'traveler,', 'long', 'I'),\n", " ('one', 'traveler,', 'long', 'I', 'stood'),\n", " ('traveler,', 'long', 'I', 'stood', 'And'),\n", " ('long', 'I', 'stood', 'And', 'looked'),\n", " ('I', 'stood', 'And', 'looked', 'down'),\n", " ('stood', 'And', 'looked', 'down', 'one'),\n", " ('And', 'looked', 'down', 'one', 'as'),\n", " ('looked', 'down', 'one', 'as', 'far'),\n", " ('down', 'one', 'as', 'far', 'as'),\n", " ('one', 'as', 'far', 'as', 'I'),\n", " ('as', 'far', 'as', 'I', 'could'),\n", " ('far', 'as', 'I', 'could', 'To'),\n", " ('as', 'I', 'could', 'To', 'where'),\n", " ('I', 'could', 'To', 'where', 'it'),\n", " ('could', 'To', 'where', 'it', 'bent'),\n", " ('To', 'where', 'it', 'bent', 'in'),\n", " ('where', 'it', 'bent', 'in', 'the'),\n", " ('it', 'bent', 'in', 'the', 'undergrowth;'),\n", " ('bent', 'in', 'the', 'undergrowth;', 'Then'),\n", " ('in', 'the', 'undergrowth;', 'Then', 'took'),\n", " ('the', 'undergrowth;', 'Then', 'took', 'the'),\n", " ('undergrowth;', 'Then', 'took', 'the', 'other,'),\n", " ('Then', 'took', 'the', 'other,', 'as'),\n", " ('took', 'the', 'other,', 'as', 'just'),\n", " ('the', 'other,', 'as', 'just', 'as'),\n", " ('other,', 'as', 'just', 'as', 'fair,'),\n", " ('as', 'just', 'as', 'fair,', 'And'),\n", " ('just', 'as', 'fair,', 'And', 'having'),\n", " ('as', 'fair,', 'And', 'having', 'perhaps'),\n", " ('fair,', 'And', 'having', 'perhaps', 'the'),\n", " ('And', 'having', 'perhaps', 'the', 'better'),\n", " ('having', 'perhaps', 'the', 'better', 'claim,'),\n", " ('perhaps', 'the', 'better', 'claim,', 'Because'),\n", " ('the', 'better', 'claim,', 'Because', 'it'),\n", " ('better', 'claim,', 'Because', 'it', 'was'),\n", " ('claim,', 'Because', 'it', 'was', 'grassy'),\n", " ('Because', 'it', 'was', 'grassy', 'and'),\n", " ('it', 'was', 'grassy', 'and', 'wanted'),\n", " ('was', 'grassy', 'and', 'wanted', 'wear;'),\n", " ('grassy', 'and', 'wanted', 'wear;', 'Though'),\n", " ('and', 'wanted', 'wear;', 'Though', 'as'),\n", " ('wanted', 'wear;', 'Though', 'as', 'for'),\n", " ('wear;', 'Though', 'as', 'for', 'that'),\n", " ('Though', 'as', 'for', 'that', 'the'),\n", " ('as', 'for', 'that', 'the', 'passing'),\n", " ('for', 'that', 'the', 'passing', 'there'),\n", " ('that', 'the', 'passing', 'there', 'Had'),\n", " ('the', 'passing', 'there', 'Had', 'worn'),\n", " ('passing', 'there', 'Had', 'worn', 'them'),\n", " ('there', 'Had', 'worn', 'them', 'really'),\n", " ('Had', 'worn', 'them', 'really', 'about'),\n", " ('worn', 'them', 'really', 'about', 'the'),\n", " ('them', 'really', 'about', 'the', 'same,'),\n", " ('really', 'about', 'the', 'same,', 'And'),\n", " ('about', 'the', 'same,', 'And', 'both'),\n", " ('the', 'same,', 'And', 'both', 'that'),\n", " ('same,', 'And', 'both', 'that', 'morning'),\n", " ('And', 'both', 'that', 'morning', 'equally'),\n", " ('both', 'that', 'morning', 'equally', 'lay'),\n", " ('that', 'morning', 'equally', 'lay', 'In'),\n", " ('morning', 'equally', 'lay', 'In', 'leaves'),\n", " ('equally', 'lay', 'In', 'leaves', 'no'),\n", " ('lay', 'In', 'leaves', 'no', 'step'),\n", " ('In', 'leaves', 'no', 'step', 'had'),\n", " ('leaves', 'no', 'step', 'had', 'trodden'),\n", " ('no', 'step', 'had', 'trodden', 'black.'),\n", " ('step', 'had', 'trodden', 'black.', 'Oh,'),\n", " ('had', 'trodden', 'black.', 'Oh,', 'I'),\n", " ('trodden', 'black.', 'Oh,', 'I', 'kept'),\n", " ('black.', 'Oh,', 'I', 'kept', 'the'),\n", " ('Oh,', 'I', 'kept', 'the', 'first'),\n", " ('I', 'kept', 'the', 'first', 'for'),\n", " ('kept', 'the', 'first', 'for', 'another'),\n", " ('the', 'first', 'for', 'another', 'day!'),\n", " ('first', 'for', 'another', 'day!', 'Yet'),\n", " ('for', 'another', 'day!', 'Yet', 'knowing'),\n", " ('another', 'day!', 'Yet', 'knowing', 'how'),\n", " ('day!', 'Yet', 'knowing', 'how', 'way'),\n", " ('Yet', 'knowing', 'how', 'way', 'leads'),\n", " ('knowing', 'how', 'way', 'leads', 'on'),\n", " ('how', 'way', 'leads', 'on', 'to'),\n", " ('way', 'leads', 'on', 'to', 'way,'),\n", " ('leads', 'on', 'to', 'way,', 'I'),\n", " ('on', 'to', 'way,', 'I', 'doubted'),\n", " ('to', 'way,', 'I', 'doubted', 'if'),\n", " ('way,', 'I', 'doubted', 'if', 'I'),\n", " ('I', 'doubted', 'if', 'I', 'should'),\n", " ('doubted', 'if', 'I', 'should', 'ever'),\n", " ('if', 'I', 'should', 'ever', 'come'),\n", " ('I', 'should', 'ever', 'come', 'back.'),\n", " ('should', 'ever', 'come', 'back.', 'I'),\n", " ('ever', 'come', 'back.', 'I', 'shall'),\n", " ('come', 'back.', 'I', 'shall', 'be'),\n", " ('back.', 'I', 'shall', 'be', 'telling'),\n", " ('I', 'shall', 'be', 'telling', 'this'),\n", " ('shall', 'be', 'telling', 'this', 'with'),\n", " ('be', 'telling', 'this', 'with', 'a'),\n", " ('telling', 'this', 'with', 'a', 'sigh'),\n", " ('this', 'with', 'a', 'sigh', 'Somewhere'),\n", " ('with', 'a', 'sigh', 'Somewhere', 'ages'),\n", " ('a', 'sigh', 'Somewhere', 'ages', 'and'),\n", " ('sigh', 'Somewhere', 'ages', 'and', 'ages'),\n", " ('Somewhere', 'ages', 'and', 'ages', 'hence:'),\n", " ('ages', 'and', 'ages', 'hence:', 'Two'),\n", " ('and', 'ages', 'hence:', 'Two', 'roads'),\n", " ('ages', 'hence:', 'Two', 'roads', 'diverged'),\n", " ('hence:', 'Two', 'roads', 'diverged', 'in'),\n", " ('Two', 'roads', 'diverged', 'in', 'a'),\n", " ('roads', 'diverged', 'in', 'a', 'wood,'),\n", " ('diverged', 'in', 'a', 'wood,', 'and'),\n", " ('in', 'a', 'wood,', 'and', 'I—'),\n", " ('a', 'wood,', 'and', 'I—', 'I'),\n", " ('wood,', 'and', 'I—', 'I', 'took'),\n", " ('and', 'I—', 'I', 'took', 'the'),\n", " ('I—', 'I', 'took', 'the', 'one'),\n", " ('I', 'took', 'the', 'one', 'less'),\n", " ('took', 'the', 'one', 'less', 'travelled'),\n", " ('the', 'one', 'less', 'travelled', 'by,'),\n", " ('one', 'less', 'travelled', 'by,', 'And'),\n", " ('less', 'travelled', 'by,', 'And', 'that'),\n", " ('travelled', 'by,', 'And', 'that', 'has'),\n", " ('by,', 'And', 'that', 'has', 'made'),\n", " ('And', 'that', 'has', 'made', 'all'),\n", " ('that', 'has', 'made', 'all', 'the'),\n", " ('has', 'made', 'all', 'the', 'difference.')]" ] }, "execution_count": 316, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frost_word_5grams = ngrams_for_sequence(5, open(\"frost.txt\").read().split())\n", "frost_word_5grams" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of the bigrams (ngrams of order 2) from the string `condescendences`:" ] }, { "cell_type": "code", "execution_count": 317, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('c', 'o'),\n", " ('o', 'n'),\n", " ('n', 'd'),\n", " ('d', 'e'),\n", " ('e', 's'),\n", " ('s', 'c'),\n", " ('c', 'e'),\n", " ('e', 'n'),\n", " ('n', 'd'),\n", " ('d', 'e'),\n", " ('e', 'n'),\n", " ('n', 'c'),\n", " ('c', 'e'),\n", " ('e', 's')]" ] }, "execution_count": 317, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ngrams_for_sequence(2, \"condescendences\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function works with non-string sequences as well:" ] }, { "cell_type": "code", "execution_count": 319, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(5, 10, 15, 20), (10, 15, 20, 25), (15, 20, 25, 30)]" ] }, "execution_count": 319, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ngrams_for_sequence(4, [5, 10, 15, 20, 25, 30])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And of course we can use it in conjunction with a `Counter` to find the most common n-grams in a text:" ] }, { "cell_type": "code", "execution_count": 321, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[((' ', 't', 'h'), 144),\n", " (('t', 'h', 'e'), 127),\n", " (('h', 'e', ' '), 114),\n", " (('n', 'd', ' '), 99),\n", " (('a', 'n', 'd'), 66),\n", " ((' ', 'a', 'n'), 64),\n", " ((',', ' ', 'a'), 40),\n", " (('i', 'n', 'g'), 37),\n", " (('d', ' ', 't'), 36),\n", " (('n', 'g', ' '), 34),\n", " (('A', 'n', 'd'), 33),\n", " ((' ', 'G', 'o'), 32),\n", " (('G', 'o', 'd'), 32),\n", " (('o', 'd', ' '), 32),\n", " (('.', ' ', '\\n'), 29),\n", " ((' ', '\\n', 'A'), 29),\n", " (('\\n', 'A', 'n'), 29),\n", " ((' ', 'w', 'a'), 28),\n", " (('d', ' ', 'G'), 28),\n", " (('r', 't', 'h'), 27)]" ] }, "execution_count": 321, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(ngrams_for_sequence(3, open(\"genesis.txt\").read())).most_common(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Markov models: what comes next?\n", "\n", "Now that we have the ability to find and record the n-grams in a text, it’s time to take our analysis one step further. The next question we’re going to try to answer is this: Given a particular n-gram in a text, what is most likely to come next?\n", "\n", "We can imagine the kind of algorithm we’ll need to extract this information from the text. It will look very similar to the code to find n-grams above, but it will need to keep track not just of the n-grams but also a list of all units (word, character, whatever) that *follow* those n-grams.\n", "\n", "Let’s do a quick example by hand. This is the same character-level order-2 n-gram analysis of the (very brief) text “condescendences” as above, but this time keeping track of all characters that follow each n-gram:\n", "\n", "| n-grams |\tnext? |\n", "| ------- | ----- |\n", "|co| n|\n", "|on| d|\n", "|nd| e, e|\n", "|de| s, n|\n", "|es| c, (end of text)|\n", "|sc| e|\n", "|ce| n, s|\n", "|en| d, c|\n", "|nc| e|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this table, we can determine that while the n-gram `co` is followed by n 100% of the time, and while the n-gram `on` is followed by `d` 100% of the time, the n-gram `de` is followed by `s` 50% of the time, and `n` the rest of the time. Likewise, the n-gram `es` is followed by `c` 50% of the time, and followed by the end of the text the other 50% of the time.\n", "\n", "The easiest way to represent this model is with a dictionary whose keys are the n-grams and whose values are all of the possible \"nexts.\" Here's what the Python code looks like to construct this model from a string. We'll use the special token `$` to represent the notion of the \"end of text\" in the table above." ] }, { "cell_type": "code", "execution_count": 336, "metadata": { "collapsed": true }, "outputs": [], "source": [ "src = \"condescendences\"\n", "src += \"$\" # to indicate the end of the string\n", "model = {}\n", "for i in range(len(src)-2):\n", " ngram = tuple(src[i:i+2]) # get a slice of length 2 from current position\n", " next_item = src[i+2] # next item is current index plus two (i.e., right after the slice)\n", " if ngram not in model: # check if we've already seen this ngram; if not...\n", " model[ngram] = [] # value for this key is an empty list\n", " model[ngram].append(next_item) # append this next item to the list for this ngram" ] }, { "cell_type": "code", "execution_count": 338, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{('c', 'e'): ['n', 's'],\n", " ('c', 'o'): ['n'],\n", " ('d', 'e'): ['s', 'n'],\n", " ('e', 'n'): ['d', 'c'],\n", " ('e', 's'): ['c', '$'],\n", " ('n', 'c'): ['e'],\n", " ('n', 'd'): ['e', 'e'],\n", " ('o', 'n'): ['d'],\n", " ('s', 'c'): ['e']}" ] }, "execution_count": 338, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The functions in the cell below generalize this to n-grams of arbitrary length (and use the special Python value `None` to indicate the end of a sequence). The `markov_model()` function creates an empty dictionary and takes an n-gram length and a sequence (which can be a string or a list) and calls the `add_to_model()` function on that sequence. The `add_to_model()` function does the same thing as the code above: iterates over every index of the sequence and grabs an n-gram of the desired length, adding keys and values to the dictionary as necessary." ] }, { "cell_type": "code", "execution_count": 339, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def add_to_model(model, n, seq):\n", " # make a copy of seq and append None to the end\n", " seq = list(seq[:]) + [None]\n", " for i in range(len(seq)-n):\n", " # tuple because we're using it as a dict key!\n", " gram = tuple(seq[i:i+n])\n", " next_item = seq[i+n] \n", " if gram not in model:\n", " model[gram] = []\n", " model[gram].append(next_item)\n", "\n", "def markov_model(n, seq):\n", " model = {}\n", " add_to_model(model, n, seq)\n", " return model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, e.g., an order-2 character-level Markov model of `condescendences`:" ] }, { "cell_type": "code", "execution_count": 340, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{('c', 'e'): ['n', 's'],\n", " ('c', 'o'): ['n'],\n", " ('d', 'e'): ['s', 'n'],\n", " ('e', 'n'): ['d', 'c'],\n", " ('e', 's'): ['c', None],\n", " ('n', 'c'): ['e'],\n", " ('n', 'd'): ['e', 'e'],\n", " ('o', 'n'): ['d'],\n", " ('s', 'c'): ['e']}" ] }, "execution_count": 340, "metadata": {}, "output_type": "execute_result" } ], "source": [ "markov_model(2, \"condescendences\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or an order 3 word-level Markov model of `genesis.txt`:" ] }, { "cell_type": "code", "execution_count": 345, "metadata": { "collapsed": true }, "outputs": [], "source": [ "genesis_markov_model = markov_model(3, open(\"genesis.txt\").read().split())" ] }, { "cell_type": "code", "execution_count": 346, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{('And', 'God', 'blessed'): ['them,', 'them,'],\n", " ('And', 'God', 'called'): ['the', 'the', 'the'],\n", " ('And', 'God', 'created'): ['great'],\n", " ('And', 'God', 'made'): ['the', 'two', 'the'],\n", " ('And', 'God', 'said,'): ['Let',\n", " 'Let',\n", " 'Let',\n", " 'Let',\n", " 'Let',\n", " 'Let',\n", " 'Let',\n", " 'Let',\n", " 'Behold,'],\n", " ('And', 'God', 'saw'): ['the', 'every'],\n", " ('And', 'God', 'set'): ['them'],\n", " ('And', 'let', 'them'): ['be'],\n", " ('And', 'the', 'Spirit'): ['of'],\n", " ('And', 'the', 'earth'): ['was', 'brought'],\n", " ('And', 'the', 'evening'): ['and', 'and', 'and', 'and', 'and', 'and'],\n", " ('And', 'to', 'every'): ['beast'],\n", " ('And', 'to', 'rule'): ['over'],\n", " ('Be', 'fruitful,', 'and'): ['multiply,', 'multiply,'],\n", " ('Behold,', 'I', 'have'): ['given'],\n", " ('Day,', 'and', 'the'): ['darkness'],\n", " ('Earth;', 'and', 'the'): ['gathering'],\n", " ('God', 'blessed', 'them,'): ['saying,', 'and'],\n", " ('God', 'called', 'the'): ['light', 'firmament', 'dry'],\n", " ('God', 'created', 'great'): ['whales,'],\n", " ('God', 'created', 'he'): ['him;'],\n", " ('God', 'created', 'man'): ['in'],\n", " ('God', 'created', 'the'): ['heaven'],\n", " ('God', 'divided', 'the'): ['light'],\n", " ('God', 'made', 'the'): ['firmament,', 'beast'],\n", " ('God', 'made', 'two'): ['great'],\n", " ('God', 'moved', 'upon'): ['the'],\n", " ('God', 'said', 'unto'): ['them,'],\n", " ('God', 'said,', 'Behold,'): ['I'],\n", " ('God', 'said,', 'Let'): ['there',\n", " 'there',\n", " 'the',\n", " 'the',\n", " 'there',\n", " 'the',\n", " 'the',\n", " 'us'],\n", " ('God', 'saw', 'every'): ['thing'],\n", " ('God', 'saw', 'that'): ['it', 'it', 'it', 'it', 'it'],\n", " ('God', 'saw', 'the'): ['light,'],\n", " ('God', 'set', 'them'): ['in'],\n", " ('Heaven.', 'And', 'the'): ['evening'],\n", " ('I', 'have', 'given'): ['you', 'every'],\n", " ('In', 'the', 'beginning'): ['God'],\n", " ('Let', 'the', 'earth'): ['bring', 'bring'],\n", " ('Let', 'the', 'waters'): ['under', 'bring'],\n", " ('Let', 'there', 'be'): ['light:', 'a', 'lights'],\n", " ('Let', 'us', 'make'): ['man'],\n", " ('Night.', 'And', 'the'): ['evening'],\n", " ('Seas:', 'and', 'God'): ['saw'],\n", " ('So', 'God', 'created'): ['man'],\n", " ('Spirit', 'of', 'God'): ['moved'],\n", " ('a', 'firmament', 'in'): ['the'],\n", " ('a', 'tree', 'yielding'): ['seed;'],\n", " ('above', 'the', 'earth'): ['in'],\n", " ('above', 'the', 'firmament:'): ['and'],\n", " ('abundantly', 'the', 'moving'): ['creature'],\n", " ('abundantly,', 'after', 'their'): ['kind,'],\n", " ('after', 'his', 'kind,'): ['whose', 'and', 'cattle,', 'and'],\n", " ('after', 'his', 'kind:'): ['and', 'and', 'and', 'and'],\n", " ('after', 'our', 'likeness:'): ['and'],\n", " ('after', 'their', 'kind,'): ['and', 'and'],\n", " ('air,', 'and', 'over'): ['the', 'every'],\n", " ('air,', 'and', 'to'): ['every'],\n", " ('all', 'the', 'earth,'): ['and', 'and'],\n", " ('also.', 'And', 'God'): ['set'],\n", " ('and', 'God', 'divided'): ['the'],\n", " ('and', 'God', 'said'): ['unto'],\n", " ('and', 'God', 'saw'): ['that', 'that', 'that', 'that', 'that'],\n", " ('and', 'beast', 'of'): ['the'],\n", " ('and', 'cattle', 'after'): ['their'],\n", " ('and', 'creeping', 'thing,'): ['and'],\n", " ('and', 'darkness', 'was'): ['upon'],\n", " ('and', 'divided', 'the'): ['waters'],\n", " ('and', 'every', 'living'): ['creature'],\n", " ('and', 'every', 'thing'): ['that'],\n", " ('and', 'every', 'tree,'): ['in'],\n", " ('and', 'every', 'winged'): ['fowl'],\n", " ('and', 'female', 'created'): ['he'],\n", " ('and', 'fill', 'the'): ['waters'],\n", " ('and', 'for', 'days,'): ['and'],\n", " ('and', 'for', 'seasons,'): ['and'],\n", " ('and', 'fowl', 'that'): ['may'],\n", " ('and', 'have', 'dominion'): ['over'],\n", " ('and', 'herb', 'yielding'): ['seed'],\n", " ('and', 'it', 'was'): ['so.', 'so.', 'so.', 'so.', 'so.', 'so.'],\n", " ('and', 'let', 'fowl'): ['multiply'],\n", " ('and', 'let', 'it'): ['divide'],\n", " ('and', 'let', 'the'): ['dry'],\n", " ('and', 'let', 'them'): ['be', 'have'],\n", " ('and', 'multiply,', 'and'): ['fill', 'replenish'],\n", " ('and', 'over', 'all'): ['the'],\n", " ('and', 'over', 'every'): ['creeping', 'living'],\n", " ('and', 'over', 'the'): ['night,', 'fowl', 'cattle,', 'fowl'],\n", " ('and', 'replenish', 'the'): ['earth,'],\n", " ('and', 'subdue', 'it:'): ['and'],\n", " ('and', 'the', 'darkness'): ['he'],\n", " ('and', 'the', 'earth.'): ['And'],\n", " ('and', 'the', 'fruit'): ['tree'],\n", " ('and', 'the', 'gathering'): ['together'],\n", " ('and', 'the', 'lesser'): ['light'],\n", " ('and', 'the', 'morning'): ['were', 'were', 'were', 'were', 'were', 'were'],\n", " ('and', 'the', 'tree'): ['yielding'],\n", " ('and', 'there', 'was'): ['light.'],\n", " ('and', 'to', 'divide'): ['the'],\n", " ('and', 'to', 'every'): ['fowl', 'thing'],\n", " ('and', 'void;', 'and'): ['darkness'],\n", " ('and', 'years:', 'And'): ['let'],\n", " ('and,', 'behold,', 'it'): ['was'],\n", " ('appear:', 'and', 'it'): ['was'],\n", " ('be', 'a', 'firmament'): ['in'],\n", " ('be', 'for', 'lights'): ['in'],\n", " ('be', 'for', 'meat.'): ['And'],\n", " ('be', 'for', 'signs,'): ['and'],\n", " ('be', 'gathered', 'together'): ['unto'],\n", " ('be', 'light:', 'and'): ['there'],\n", " ('be', 'lights', 'in'): ['the'],\n", " ('bearing', 'seed,', 'which'): ['is'],\n", " ('beast', 'of', 'the'): ['earth', 'earth', 'earth,'],\n", " ('beginning', 'God', 'created'): ['the'],\n", " ('behold,', 'it', 'was'): ['very'],\n", " ('blessed', 'them,', 'and'): ['God'],\n", " ('blessed', 'them,', 'saying,'): ['Be'],\n", " ('bring', 'forth', 'abundantly'): ['the'],\n", " ('bring', 'forth', 'grass,'): ['the'],\n", " ('bring', 'forth', 'the'): ['living'],\n", " ('brought', 'forth', 'abundantly,'): ['after'],\n", " ('brought', 'forth', 'grass,'): ['and'],\n", " ('called', 'Night.', 'And'): ['the'],\n", " ('called', 'he', 'Seas:'): ['and'],\n", " ('called', 'the', 'dry'): ['land'],\n", " ('called', 'the', 'firmament'): ['Heaven.'],\n", " ('called', 'the', 'light'): ['Day,'],\n", " ('cattle', 'after', 'their'): ['kind,'],\n", " ('cattle,', 'and', 'creeping'): ['thing,'],\n", " ('cattle,', 'and', 'over'): ['all'],\n", " ('created', 'great', 'whales,'): ['and'],\n", " ('created', 'he', 'him;'): ['male'],\n", " ('created', 'he', 'them.'): ['And'],\n", " ('created', 'man', 'in'): ['his'],\n", " ('created', 'the', 'heaven'): ['and'],\n", " ('creature', 'after', 'his'): ['kind,'],\n", " ('creature', 'that', 'hath'): ['life,'],\n", " ('creature', 'that', 'moveth,'): ['which'],\n", " ('creepeth', 'upon', 'the'): ['earth', 'earth.', 'earth,'],\n", " ('creeping', 'thing', 'that'): ['creepeth'],\n", " ('creeping', 'thing,', 'and'): ['beast'],\n", " ('darkness', 'he', 'called'): ['Night.'],\n", " ('darkness', 'was', 'upon'): ['the'],\n", " ('darkness.', 'And', 'God'): ['called'],\n", " ('darkness:', 'and', 'God'): ['saw'],\n", " ('day', 'and', 'over'): ['the'],\n", " ('day', 'from', 'the'): ['night;'],\n", " ('day,', 'and', 'the'): ['lesser'],\n", " ('day.', 'And', 'God'): ['said,', 'said,', 'said,', 'said,', 'said,'],\n", " ('days,', 'and', 'years:'): ['And'],\n", " ('deep.', 'And', 'the'): ['Spirit'],\n", " ('divide', 'the', 'day'): ['from'],\n", " ('divide', 'the', 'light'): ['from'],\n", " ('divide', 'the', 'waters'): ['from'],\n", " ('divided', 'the', 'light'): ['from'],\n", " ('divided', 'the', 'waters'): ['which'],\n", " ('dominion', 'over', 'the'): ['fish', 'fish'],\n", " ('dry', 'land', 'Earth;'): ['and'],\n", " ('dry', 'land', 'appear:'): ['and'],\n", " ('earth', 'after', 'his'): ['kind:', 'kind,', 'kind:'],\n", " ('earth', 'bring', 'forth'): ['grass,', 'the'],\n", " ('earth', 'brought', 'forth'): ['grass,'],\n", " ('earth', 'in', 'the'): ['open'],\n", " ('earth', 'was', 'without'): ['form,'],\n", " ('earth,', 'And', 'to'): ['rule'],\n", " ('earth,', 'and', 'every'): ['tree,'],\n", " ('earth,', 'and', 'over'): ['every'],\n", " ('earth,', 'and', 'subdue'): ['it:'],\n", " ('earth,', 'and', 'to'): ['every'],\n", " ('earth,', 'wherein', 'there'): ['is'],\n", " ('earth.', 'And', 'God'): ['said,'],\n", " ('earth.', 'And', 'the'): ['earth', 'evening'],\n", " ('earth.', 'So', 'God'): ['created'],\n", " ('earth:', 'and', 'it'): ['was', 'was'],\n", " ('evening', 'and', 'the'): ['morning',\n", " 'morning',\n", " 'morning',\n", " 'morning',\n", " 'morning',\n", " 'morning'],\n", " ('every', 'beast', 'of'): ['the'],\n", " ('every', 'creeping', 'thing'): ['that'],\n", " ('every', 'fowl', 'of'): ['the'],\n", " ('every', 'green', 'herb'): ['for'],\n", " ('every', 'herb', 'bearing'): ['seed,'],\n", " ('every', 'living', 'creature'): ['that'],\n", " ('every', 'living', 'thing'): ['that'],\n", " ('every', 'thing', 'that'): ['creepeth', 'creepeth', 'he'],\n", " ('every', 'tree,', 'in'): ['the'],\n", " ('every', 'winged', 'fowl'): ['after'],\n", " ('face', 'of', 'all'): ['the'],\n", " ('face', 'of', 'the'): ['deep.', 'waters.'],\n", " ('female', 'created', 'he'): ['them.'],\n", " ('fifth', 'day.', 'And'): ['God'],\n", " ('fill', 'the', 'waters'): ['in'],\n", " ('firmament', 'Heaven.', 'And'): ['the'],\n", " ('firmament', 'from', 'the'): ['waters'],\n", " ('firmament', 'in', 'the'): ['midst'],\n", " ('firmament', 'of', 'heaven.'): ['And'],\n", " ('firmament', 'of', 'the'): ['heaven', 'heaven', 'heaven'],\n", " ('firmament,', 'and', 'divided'): ['the'],\n", " ('firmament:', 'and', 'it'): ['was'],\n", " ('first', 'day.', 'And'): ['God'],\n", " ('fish', 'of', 'the'): ['sea,', 'sea,'],\n", " ('fly', 'above', 'the'): ['earth'],\n", " ('for', 'days,', 'and'): ['years:'],\n", " ('for', 'lights', 'in'): ['the'],\n", " ('for', 'meat.', 'And'): ['to'],\n", " ('for', 'meat:', 'and'): ['it'],\n", " ('for', 'seasons,', 'and'): ['for'],\n", " ('for', 'signs,', 'and'): ['for'],\n", " ('form,', 'and', 'void;'): ['and'],\n", " ('forth', 'abundantly', 'the'): ['moving'],\n", " ('forth', 'abundantly,', 'after'): ['their'],\n", " ('forth', 'grass,', 'and'): ['herb'],\n", " ('forth', 'grass,', 'the'): ['herb'],\n", " ('forth', 'the', 'living'): ['creature'],\n", " ('fourth', 'day.', 'And'): ['God'],\n", " ('fowl', 'after', 'his'): ['kind:'],\n", " ('fowl', 'multiply', 'in'): ['the'],\n", " ('fowl', 'of', 'the'): ['air,', 'air,', 'air,'],\n", " ('fowl', 'that', 'may'): ['fly'],\n", " ('from', 'the', 'darkness.'): ['And'],\n", " ('from', 'the', 'darkness:'): ['and'],\n", " ('from', 'the', 'night;'): ['and'],\n", " ('from', 'the', 'waters'): ['which'],\n", " ('from', 'the', 'waters.'): ['And'],\n", " ('fruit', 'after', 'his'): ['kind,'],\n", " ('fruit', 'of', 'a'): ['tree'],\n", " ('fruit', 'tree', 'yielding'): ['fruit'],\n", " ('fruit,', 'whose', 'seed'): ['was'],\n", " ('fruitful,', 'and', 'multiply,'): ['and', 'and'],\n", " ('gathered', 'together', 'unto'): ['one'],\n", " ('gathering', 'together', 'of'): ['the'],\n", " ('give', 'light', 'upon'): ['the', 'the'],\n", " ('given', 'every', 'green'): ['herb'],\n", " ('given', 'you', 'every'): ['herb'],\n", " ('good.', 'And', 'God'): ['said,', 'blessed', 'said,'],\n", " ('good.', 'And', 'the'): ['evening', 'evening', 'evening'],\n", " ('good:', 'and', 'God'): ['divided'],\n", " ('grass,', 'and', 'herb'): ['yielding'],\n", " ('grass,', 'the', 'herb'): ['yielding'],\n", " ('great', 'lights;', 'the'): ['greater'],\n", " ('great', 'whales,', 'and'): ['every'],\n", " ('greater', 'light', 'to'): ['rule'],\n", " ('green', 'herb', 'for'): ['meat:'],\n", " ('had', 'made,', 'and,'): ['behold,'],\n", " ('hath', 'life,', 'and'): ['fowl'],\n", " ('have', 'dominion', 'over'): ['the', 'the'],\n", " ('have', 'given', 'every'): ['green'],\n", " ('have', 'given', 'you'): ['every'],\n", " ('he', 'Seas:', 'and'): ['God'],\n", " ('he', 'called', 'Night.'): ['And'],\n", " ('he', 'had', 'made,'): ['and,'],\n", " ('he', 'him;', 'male'): ['and'],\n", " ('he', 'made', 'the'): ['stars'],\n", " ('he', 'them.', 'And'): ['God'],\n", " ('heaven', 'and', 'the'): ['earth.'],\n", " ('heaven', 'be', 'gathered'): ['together'],\n", " ('heaven', 'to', 'divide'): ['the'],\n", " ('heaven', 'to', 'give'): ['light', 'light'],\n", " ('heaven.', 'And', 'God'): ['created'],\n", " ('herb', 'bearing', 'seed,'): ['which'],\n", " ('herb', 'for', 'meat:'): ['and'],\n", " ('herb', 'yielding', 'seed'): ['after'],\n", " ('herb', 'yielding', 'seed,'): ['and'],\n", " ('him;', 'male', 'and'): ['female'],\n", " ('his', 'kind,', 'and'): ['the', 'cattle'],\n", " ('his', 'kind,', 'cattle,'): ['and'],\n", " ('his', 'kind,', 'whose'): ['seed'],\n", " ('his', 'kind:', 'and'): ['God', 'God', 'it', 'God'],\n", " ('his', 'own', 'image,'): ['in'],\n", " ('image', 'of', 'God'): ['created'],\n", " ('image,', 'after', 'our'): ['likeness:'],\n", " ('image,', 'in', 'the'): ['image'],\n", " ('in', 'his', 'own'): ['image,'],\n", " ('in', 'itself,', 'after'): ['his'],\n", " ('in', 'itself,', 'upon'): ['the'],\n", " ('in', 'our', 'image,'): ['after'],\n", " ('in', 'the', 'earth.'): ['And'],\n", " ('in', 'the', 'firmament'): ['of', 'of', 'of'],\n", " ('in', 'the', 'image'): ['of'],\n", " ('in', 'the', 'midst'): ['of'],\n", " ('in', 'the', 'open'): ['firmament'],\n", " ('in', 'the', 'seas,'): ['and'],\n", " ('in', 'the', 'which'): ['is'],\n", " ('is', 'in', 'itself,'): ['upon'],\n", " ('is', 'life,', 'I'): ['have'],\n", " ('is', 'the', 'fruit'): ['of'],\n", " ('is', 'upon', 'the'): ['face'],\n", " ('it', 'divide', 'the'): ['waters'],\n", " ('it', 'shall', 'be'): ['for'],\n", " ('it', 'was', 'good.'): ['And', 'And', 'And', 'And', 'And'],\n", " ('it', 'was', 'good:'): ['and'],\n", " ('it', 'was', 'so.'): ['And', 'And', 'And', 'And', 'And', 'And'],\n", " ('it', 'was', 'very'): ['good.'],\n", " ('it:', 'and', 'have'): ['dominion'],\n", " ('itself,', 'after', 'his'): ['kind:'],\n", " ('itself,', 'upon', 'the'): ['earth:'],\n", " ('kind,', 'and', 'cattle'): ['after'],\n", " ('kind,', 'and', 'every'): ['winged', 'thing'],\n", " ('kind,', 'and', 'the'): ['tree'],\n", " ('kind,', 'cattle,', 'and'): ['creeping'],\n", " ('kind,', 'whose', 'seed'): ['is'],\n", " ('kind:', 'and', 'God'): ['saw', 'saw', 'saw'],\n", " ('kind:', 'and', 'it'): ['was'],\n", " ('land', 'Earth;', 'and'): ['the'],\n", " ('land', 'appear:', 'and'): ['it'],\n", " ('lesser', 'light', 'to'): ['rule'],\n", " ('let', 'fowl', 'multiply'): ['in'],\n", " ('let', 'it', 'divide'): ['the'],\n", " ('let', 'the', 'dry'): ['land'],\n", " ('let', 'them', 'be'): ['for', 'for'],\n", " ('let', 'them', 'have'): ['dominion'],\n", " ('life,', 'I', 'have'): ['given'],\n", " ('life,', 'and', 'fowl'): ['that'],\n", " ('light', 'Day,', 'and'): ['the'],\n", " ('light', 'from', 'the'): ['darkness.', 'darkness:'],\n", " ('light', 'to', 'rule'): ['the', 'the'],\n", " ('light', 'upon', 'the'): ['earth:', 'earth,'],\n", " ('light,', 'that', 'it'): ['was'],\n", " ('light.', 'And', 'God'): ['saw'],\n", " ('light:', 'and', 'there'): ['was'],\n", " ('lights', 'in', 'the'): ['firmament', 'firmament'],\n", " ('lights;', 'the', 'greater'): ['light'],\n", " ('likeness:', 'and', 'let'): ['them'],\n", " ('living', 'creature', 'after'): ['his'],\n", " ('living', 'creature', 'that'): ['moveth,'],\n", " ('living', 'thing', 'that'): ['moveth'],\n", " ('made', 'the', 'beast'): ['of'],\n", " ('made', 'the', 'firmament,'): ['and'],\n", " ('made', 'the', 'stars'): ['also.'],\n", " ('made', 'two', 'great'): ['lights;'],\n", " ('made,', 'and,', 'behold,'): ['it'],\n", " ('make', 'man', 'in'): ['our'],\n", " ('male', 'and', 'female'): ['created'],\n", " ('man', 'in', 'his'): ['own'],\n", " ('man', 'in', 'our'): ['image,'],\n", " ('may', 'fly', 'above'): ['the'],\n", " ('meat.', 'And', 'to'): ['every'],\n", " ('meat:', 'and', 'it'): ['was'],\n", " ('midst', 'of', 'the'): ['waters,'],\n", " ('morning', 'were', 'the'): ['first',\n", " 'second',\n", " 'third',\n", " 'fourth',\n", " 'fifth',\n", " 'sixth'],\n", " ('moved', 'upon', 'the'): ['face'],\n", " ('moveth', 'upon', 'the'): ['earth.'],\n", " ('moveth,', 'which', 'the'): ['waters'],\n", " ('moving', 'creature', 'that'): ['hath'],\n", " ('multiply', 'in', 'the'): ['earth.'],\n", " ('multiply,', 'and', 'fill'): ['the'],\n", " ('multiply,', 'and', 'replenish'): ['the'],\n", " ('night,', 'and', 'to'): ['divide'],\n", " ('night:', 'he', 'made'): ['the'],\n", " ('night;', 'and', 'let'): ['them'],\n", " ('of', 'God', 'created'): ['he'],\n", " ('of', 'God', 'moved'): ['upon'],\n", " ('of', 'a', 'tree'): ['yielding'],\n", " ('of', 'all', 'the'): ['earth,'],\n", " ('of', 'heaven.', 'And'): ['God'],\n", " ('of', 'the', 'air,'): ['and', 'and', 'and'],\n", " ('of', 'the', 'deep.'): ['And'],\n", " ('of', 'the', 'earth'): ['after', 'after'],\n", " ('of', 'the', 'earth,'): ['and'],\n", " ('of', 'the', 'heaven'): ['to', 'to', 'to'],\n", " ('of', 'the', 'sea,'): ['and', 'and'],\n", " ('of', 'the', 'waters'): ['called'],\n", " ('of', 'the', 'waters,'): ['and'],\n", " ('of', 'the', 'waters.'): ['And'],\n", " ('one', 'place,', 'and'): ['let'],\n", " ('open', 'firmament', 'of'): ['heaven.'],\n", " ('our', 'image,', 'after'): ['our'],\n", " ('our', 'likeness:', 'and'): ['let'],\n", " ('over', 'all', 'the'): ['earth,'],\n", " ('over', 'every', 'creeping'): ['thing'],\n", " ('over', 'every', 'living'): ['thing'],\n", " ('over', 'the', 'cattle,'): ['and'],\n", " ('over', 'the', 'day'): ['and'],\n", " ('over', 'the', 'fish'): ['of', 'of'],\n", " ('over', 'the', 'fowl'): ['of', 'of'],\n", " ('over', 'the', 'night,'): ['and'],\n", " ('own', 'image,', 'in'): ['the'],\n", " ('place,', 'and', 'let'): ['the'],\n", " ('replenish', 'the', 'earth,'): ['and'],\n", " ('rule', 'over', 'the'): ['day'],\n", " ('rule', 'the', 'day,'): ['and'],\n", " ('rule', 'the', 'night:'): ['he'],\n", " ('said', 'unto', 'them,'): ['Be'],\n", " ('said,', 'Behold,', 'I'): ['have'],\n", " ('said,', 'Let', 'the'): ['waters', 'earth', 'waters', 'earth'],\n", " ('said,', 'Let', 'there'): ['be', 'be', 'be'],\n", " ('said,', 'Let', 'us'): ['make'],\n", " ('saw', 'every', 'thing'): ['that'],\n", " ('saw', 'that', 'it'): ['was', 'was', 'was', 'was', 'was'],\n", " ('saw', 'the', 'light,'): ['that'],\n", " ('saying,', 'Be', 'fruitful,'): ['and'],\n", " ('sea,', 'and', 'over'): ['the', 'the'],\n", " ('seas,', 'and', 'let'): ['fowl'],\n", " ('seasons,', 'and', 'for'): ['days,'],\n", " ('second', 'day.', 'And'): ['God'],\n", " ('seed', 'after', 'his'): ['kind,'],\n", " ('seed', 'is', 'in'): ['itself,'],\n", " ('seed', 'was', 'in'): ['itself,'],\n", " ('seed,', 'and', 'the'): ['fruit'],\n", " ('seed,', 'which', 'is'): ['upon'],\n", " ('seed;', 'to', 'you'): ['it'],\n", " ('set', 'them', 'in'): ['the'],\n", " ('shall', 'be', 'for'): ['meat.'],\n", " ('signs,', 'and', 'for'): ['seasons,'],\n", " ('so.', 'And', 'God'): ['called', 'called', 'made', 'made', 'saw'],\n", " ('so.', 'And', 'the'): ['earth'],\n", " ('stars', 'also.', 'And'): ['God'],\n", " ('subdue', 'it:', 'and'): ['have'],\n", " ('that', 'creepeth', 'upon'): ['the', 'the', 'the'],\n", " ('that', 'hath', 'life,'): ['and'],\n", " ('that', 'he', 'had'): ['made,'],\n", " ('that', 'it', 'was'): ['good:', 'good.', 'good.', 'good.', 'good.', 'good.'],\n", " ('that', 'may', 'fly'): ['above'],\n", " ('that', 'moveth', 'upon'): ['the'],\n", " ('that', 'moveth,', 'which'): ['the'],\n", " ('the', 'Spirit', 'of'): ['God'],\n", " ('the', 'air,', 'and'): ['over', 'over', 'to'],\n", " ('the', 'beast', 'of'): ['the'],\n", " ('the', 'beginning', 'God'): ['created'],\n", " ('the', 'cattle,', 'and'): ['over'],\n", " ('the', 'darkness', 'he'): ['called'],\n", " ('the', 'darkness.', 'And'): ['God'],\n", " ('the', 'darkness:', 'and'): ['God'],\n", " ('the', 'day', 'and'): ['over'],\n", " ('the', 'day', 'from'): ['the'],\n", " ('the', 'day,', 'and'): ['the'],\n", " ('the', 'deep.', 'And'): ['the'],\n", " ('the', 'dry', 'land'): ['appear:', 'Earth;'],\n", " ('the', 'earth', 'after'): ['his', 'his', 'his'],\n", " ('the', 'earth', 'bring'): ['forth', 'forth'],\n", " ('the', 'earth', 'brought'): ['forth'],\n", " ('the', 'earth', 'in'): ['the'],\n", " ('the', 'earth', 'was'): ['without'],\n", " ('the', 'earth,', 'And'): ['to'],\n", " ('the', 'earth,', 'and'): ['over', 'subdue', 'every', 'to'],\n", " ('the', 'earth,', 'wherein'): ['there'],\n", " ('the', 'earth.', 'And'): ['the', 'the', 'God'],\n", " ('the', 'earth.', 'So'): ['God'],\n", " ('the', 'earth:', 'and'): ['it', 'it'],\n", " ('the', 'evening', 'and'): ['the', 'the', 'the', 'the', 'the', 'the'],\n", " ('the', 'face', 'of'): ['the', 'the', 'all'],\n", " ('the', 'fifth', 'day.'): ['And'],\n", " ('the', 'firmament', 'Heaven.'): ['And'],\n", " ('the', 'firmament', 'from'): ['the'],\n", " ('the', 'firmament', 'of'): ['the', 'the', 'the'],\n", " ('the', 'firmament,', 'and'): ['divided'],\n", " ('the', 'firmament:', 'and'): ['it'],\n", " ('the', 'first', 'day.'): ['And'],\n", " ('the', 'fish', 'of'): ['the', 'the'],\n", " ('the', 'fourth', 'day.'): ['And'],\n", " ('the', 'fowl', 'of'): ['the', 'the'],\n", " ('the', 'fruit', 'of'): ['a'],\n", " ('the', 'fruit', 'tree'): ['yielding'],\n", " ('the', 'gathering', 'together'): ['of'],\n", " ('the', 'greater', 'light'): ['to'],\n", " ('the', 'heaven', 'and'): ['the'],\n", " ('the', 'heaven', 'be'): ['gathered'],\n", " ('the', 'heaven', 'to'): ['divide', 'give', 'give'],\n", " ('the', 'herb', 'yielding'): ['seed,'],\n", " ('the', 'image', 'of'): ['God'],\n", " ('the', 'lesser', 'light'): ['to'],\n", " ('the', 'light', 'Day,'): ['and'],\n", " ('the', 'light', 'from'): ['the', 'the'],\n", " ('the', 'light,', 'that'): ['it'],\n", " ('the', 'living', 'creature'): ['after'],\n", " ('the', 'midst', 'of'): ['the'],\n", " ('the', 'morning', 'were'): ['the', 'the', 'the', 'the', 'the', 'the'],\n", " ('the', 'moving', 'creature'): ['that'],\n", " ('the', 'night,', 'and'): ['to'],\n", " ('the', 'night:', 'he'): ['made'],\n", " ('the', 'night;', 'and'): ['let'],\n", " ('the', 'open', 'firmament'): ['of'],\n", " ('the', 'sea,', 'and'): ['over', 'over'],\n", " ('the', 'seas,', 'and'): ['let'],\n", " ('the', 'second', 'day.'): ['And'],\n", " ('the', 'sixth', 'day.'): [None],\n", " ('the', 'stars', 'also.'): ['And'],\n", " ('the', 'third', 'day.'): ['And'],\n", " ('the', 'tree', 'yielding'): ['fruit,'],\n", " ('the', 'waters', 'bring'): ['forth'],\n", " ('the', 'waters', 'brought'): ['forth'],\n", " ('the', 'waters', 'called'): ['he'],\n", " ('the', 'waters', 'from'): ['the'],\n", " ('the', 'waters', 'in'): ['the'],\n", " ('the', 'waters', 'under'): ['the'],\n", " ('the', 'waters', 'which'): ['were', 'were'],\n", " ('the', 'waters,', 'and'): ['let'],\n", " ('the', 'waters.', 'And'): ['God', 'God'],\n", " ('the', 'which', 'is'): ['the'],\n", " ('their', 'kind,', 'and'): ['every', 'every'],\n", " ('them', 'be', 'for'): ['signs,', 'lights'],\n", " ('them', 'have', 'dominion'): ['over'],\n", " ('them', 'in', 'the'): ['firmament'],\n", " ('them,', 'Be', 'fruitful,'): ['and'],\n", " ('them,', 'and', 'God'): ['said'],\n", " ('them,', 'saying,', 'Be'): ['fruitful,'],\n", " ('them.', 'And', 'God'): ['blessed'],\n", " ('there', 'be', 'a'): ['firmament'],\n", " ('there', 'be', 'light:'): ['and'],\n", " ('there', 'be', 'lights'): ['in'],\n", " ('there', 'is', 'life,'): ['I'],\n", " ('there', 'was', 'light.'): ['And'],\n", " ('thing', 'that', 'creepeth'): ['upon', 'upon', 'upon'],\n", " ('thing', 'that', 'he'): ['had'],\n", " ('thing', 'that', 'moveth'): ['upon'],\n", " ('thing,', 'and', 'beast'): ['of'],\n", " ('third', 'day.', 'And'): ['God'],\n", " ('to', 'divide', 'the'): ['day', 'light'],\n", " ('to', 'every', 'beast'): ['of'],\n", " ('to', 'every', 'fowl'): ['of'],\n", " ('to', 'every', 'thing'): ['that'],\n", " ('to', 'give', 'light'): ['upon', 'upon'],\n", " ('to', 'rule', 'over'): ['the'],\n", " ('to', 'rule', 'the'): ['day,', 'night:'],\n", " ('to', 'you', 'it'): ['shall'],\n", " ('together', 'of', 'the'): ['waters'],\n", " ('together', 'unto', 'one'): ['place,'],\n", " ('tree', 'yielding', 'fruit'): ['after'],\n", " ('tree', 'yielding', 'fruit,'): ['whose'],\n", " ('tree', 'yielding', 'seed;'): ['to'],\n", " ('tree,', 'in', 'the'): ['which'],\n", " ('two', 'great', 'lights;'): ['the'],\n", " ('under', 'the', 'firmament'): ['from'],\n", " ('under', 'the', 'heaven'): ['be'],\n", " ('unto', 'one', 'place,'): ['and'],\n", " ('unto', 'them,', 'Be'): ['fruitful,'],\n", " ('upon', 'the', 'earth'): ['after'],\n", " ('upon', 'the', 'earth,'): ['And', 'wherein'],\n", " ('upon', 'the', 'earth.'): ['So', 'And'],\n", " ('upon', 'the', 'earth:'): ['and', 'and'],\n", " ('upon', 'the', 'face'): ['of', 'of', 'of'],\n", " ('us', 'make', 'man'): ['in'],\n", " ('very', 'good.', 'And'): ['the'],\n", " ('void;', 'and', 'darkness'): ['was'],\n", " ('was', 'good.', 'And'): ['God', 'the', 'the', 'God', 'God'],\n", " ('was', 'good:', 'and'): ['God'],\n", " ('was', 'in', 'itself,'): ['after'],\n", " ('was', 'light.', 'And'): ['God'],\n", " ('was', 'so.', 'And'): ['God', 'God', 'the', 'God', 'God', 'God'],\n", " ('was', 'upon', 'the'): ['face'],\n", " ('was', 'very', 'good.'): ['And'],\n", " ('was', 'without', 'form,'): ['and'],\n", " ('waters', 'bring', 'forth'): ['abundantly'],\n", " ('waters', 'brought', 'forth'): ['abundantly,'],\n", " ('waters', 'called', 'he'): ['Seas:'],\n", " ('waters', 'from', 'the'): ['waters.'],\n", " ('waters', 'in', 'the'): ['seas,'],\n", " ('waters', 'under', 'the'): ['heaven'],\n", " ('waters', 'which', 'were'): ['under', 'above'],\n", " ('waters,', 'and', 'let'): ['it'],\n", " ('waters.', 'And', 'God'): ['said,', 'made'],\n", " ('were', 'above', 'the'): ['firmament:'],\n", " ('were', 'the', 'fifth'): ['day.'],\n", " ('were', 'the', 'first'): ['day.'],\n", " ('were', 'the', 'fourth'): ['day.'],\n", " ('were', 'the', 'second'): ['day.'],\n", " ('were', 'the', 'sixth'): ['day.'],\n", " ('were', 'the', 'third'): ['day.'],\n", " ('were', 'under', 'the'): ['firmament'],\n", " ('whales,', 'and', 'every'): ['living'],\n", " ('wherein', 'there', 'is'): ['life,'],\n", " ('which', 'is', 'the'): ['fruit'],\n", " ('which', 'is', 'upon'): ['the'],\n", " ('which', 'the', 'waters'): ['brought'],\n", " ('which', 'were', 'above'): ['the'],\n", " ('which', 'were', 'under'): ['the'],\n", " ('whose', 'seed', 'is'): ['in'],\n", " ('whose', 'seed', 'was'): ['in'],\n", " ('winged', 'fowl', 'after'): ['his'],\n", " ('without', 'form,', 'and'): ['void;'],\n", " ('years:', 'And', 'let'): ['them'],\n", " ('yielding', 'fruit', 'after'): ['his'],\n", " ('yielding', 'fruit,', 'whose'): ['seed'],\n", " ('yielding', 'seed', 'after'): ['his'],\n", " ('yielding', 'seed,', 'and'): ['the'],\n", " ('yielding', 'seed;', 'to'): ['you'],\n", " ('you', 'every', 'herb'): ['bearing'],\n", " ('you', 'it', 'shall'): ['be']}" ] }, "execution_count": 346, "metadata": {}, "output_type": "execute_result" } ], "source": [ "genesis_markov_model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now use the Markov model to make *predictions*. Given the information in the Markov model of `genesis.txt`, what words are likely to follow the sequence of words `and over the`? We can find out simply by getting the value for the key for that sequence:" ] }, { "cell_type": "code", "execution_count": 347, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['night,', 'fowl', 'cattle,', 'fowl']" ] }, "execution_count": 347, "metadata": {}, "output_type": "execute_result" } ], "source": [ "genesis_markov_model[('and', 'over', 'the')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tells us that the sequence `and over the` is followed by `fowl` 50% of the time, `night,` 25% of the time and `cattle,` 25% of the time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Markov chains: Generating text from a Markov model\n", "\n", "The Markov models we created above don't just give us interesting statistical probabilities. It also allows us generate a *new* text with those probabilities by *chaining together predictions*. Here’s how we’ll do it, starting with the order 2 character-level Markov model of `condescendences`: (1) start with the initial n-gram (`co`)—those are the first two characters of our output. (2) Now, look at the last *n* characters of output, where *n* is the order of the n-grams in our table, and find those characters in the “n-grams” column. (3) Choose randomly among the possibilities in the corresponding “next” column, and append that letter to the output. (Sometimes, as with `co`, there’s only one possibility). (4) If you chose “end of text,” then the algorithm is over. Otherwise, repeat the process starting with (2). Here’s a record of the algorithm in action:\n", "\n", " co\n", " con\n", " cond\n", " conde\n", " conden\n", " condend\n", " condendes\n", " condendesc\n", " condendesce\n", " condendesces\n", " \n", "As you can see, we’ve come up with a word that looks like the original word, and could even be passed off as a genuine English word (if you squint at it). From a statistical standpoint, the output of our algorithm is nearly indistinguishable from the input. This kind of algorithm—moving from one state to the next, according to a list of probabilities—is known as a Markov chain.\n", "\n", "Implementing this procedure in code is a little bit tricky, but it looks something like this. First, we'll make a Markov model of `condescendences`:" ] }, { "cell_type": "code", "execution_count": 348, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cmodel = markov_model(2, \"condescendences\")" ] }, { "cell_type": "code", "execution_count": 349, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{('c', 'e'): ['n', 's'],\n", " ('c', 'o'): ['n'],\n", " ('d', 'e'): ['s', 'n'],\n", " ('e', 'n'): ['d', 'c'],\n", " ('e', 's'): ['c', None],\n", " ('n', 'c'): ['e'],\n", " ('n', 'd'): ['e', 'e'],\n", " ('o', 'n'): ['d'],\n", " ('s', 'c'): ['e']}" ] }, "execution_count": 349, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cmodel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're going to generate output as we go. We'll initialize the output to the characters we want to start with, i.e., `co`:" ] }, { "cell_type": "code", "execution_count": 379, "metadata": { "collapsed": true }, "outputs": [], "source": [ "output = \"co\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now what we have to do is get the last two characters of the output, look them up in the model, and select randomly among the characters in the value for that key (which should be a list). Finally, we'll append that randomly-selected value to the end of the string:" ] }, { "cell_type": "code", "execution_count": 380, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "con\n" ] } ], "source": [ "ngram = tuple(output[-2:])\n", "next_item = random.choice(cmodel[ngram])\n", "output += next_item\n", "print(output)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try running the cell above multiple times: the `output` variable will get longer and longer—until you get an error. You can also put it into a `for` loop:" ] }, { "cell_type": "code", "execution_count": 391, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "con\n", "cond\n", "conde\n", "conden\n", "condend\n", "condende\n", "condenden\n", "condendend\n", "condendende\n", "condendenden\n", "condendendend\n", "condendendende\n", "condendendendes\n" ] }, { "ename": "TypeError", "evalue": "must be str, not NoneType", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mngram\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtuple\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mnext_item\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrandom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mchoice\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcmodel\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mngram\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0moutput\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0mnext_item\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 6\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mTypeError\u001b[0m: must be str, not NoneType" ] } ], "source": [ "output = \"co\"\n", "for i in range(100):\n", " ngram = tuple(output[-2:])\n", " next_item = random.choice(cmodel[ngram])\n", " output += next_item\n", " print(output)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `TypeError` you see above is what happens when we stumble upon the \"end of text\" condition, which we'd chosen above to represent with the special Python value `None`. When this value comes up, it means that statistically speaking, we've reached the end of the text, and so can stop generating. We'll obey this directive by skipping out of the loop early with the `break` keyword:" ] }, { "cell_type": "code", "execution_count": 401, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "con\n", "cond\n", "conde\n", "condes\n", "condesc\n", "condesce\n", "condescen\n", "condescenc\n", "condescence\n", "condescencen\n", "condescencend\n", "condescencende\n", "condescencendes\n" ] } ], "source": [ "output = \"co\"\n", "for i in range(100):\n", " ngram = tuple(output[-2:])\n", " next_item = random.choice(cmodel[ngram])\n", " if next_item is None:\n", " break # \"break\" tells Python to immediately exit the loop, skipping any remaining values\n", " else:\n", " output += next_item\n", " print(output)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why `range(100)`? No reason, really—I just picked 100 as a reasonable number for the maximum number of times the Markov chain should produce attempt to append to the output. Because there's a loop in this particular model (`nd` -> `e`, `de` -> `n`, `en` -> `d`), any time you generate text from this Markov chain, it could potentially go on infinitely. Limiting the number to `100` makes sure that it doesn't ever actually do that. You should adjust the number based on what you need the Markov chain to do.\n", "\n", "### A function to generate from a Markov model\n", "\n", "The `gen_from_model()` function below is a more general version of the code that we just wrote that works with lists and strings and n-grams of any length:" ] }, { "cell_type": "code", "execution_count": 407, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import random\n", "def gen_from_model(n, model, start=None, max_gen=100):\n", " if start is None:\n", " start = random.choice(list(model.keys()))\n", " output = list(start)\n", " for i in range(max_gen):\n", " start = tuple(output[-n:])\n", " next_item = random.choice(model[start])\n", " if next_item is None:\n", " break\n", " else:\n", " output.append(next_item)\n", " return output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `gen_from_model()` function's first parameter is the length of n-gram; the second parameter is a Markov model, as returned from `markov_model()` defined above, and the third parameter is the \"seed\" n-gram to start the generation from. The `gen_from_model()` function always returns a list:" ] }, { "cell_type": "code", "execution_count": 408, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['c', 'o', 'n', 'd', 'e', 'n', 'c', 'e', 'n', 'd', 'e', 's', 'c', 'e', 's']" ] }, "execution_count": 408, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gen_from_model(2, cmodel, ('c', 'o'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So if you're working with a character-level Markov chain, you'll want to glue the list back together into a string:" ] }, { "cell_type": "code", "execution_count": 467, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'condendescendendencendencesces'" ] }, "execution_count": 467, "metadata": {}, "output_type": "execute_result" } ], "source": [ "''.join(gen_from_model(2, cmodel, ('c', 'o')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you leave out the \"seed,\" this function will just pick a random n-gram to start with:" ] }, { "cell_type": "code", "execution_count": 498, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ashells seashore\n", "s seashore\n", "hells seashe seashe seashore\n", "y the sells by the seashe sells by the sells seashells by the seashore\n", "s seashe seashe sells seashore\n", "she seashore\n", "ells by the seashore\n", "ore\n", "y the sells sells by the seashore\n", "she seashore\n", "ore\n", "sells by the seashe sells sells sells by the sells seashe seashe seashore\n" ] } ], "source": [ "sea_model = markov_model(3, \"she sells seashells by the seashore\")\n", "for i in range(12):\n", " print(''.join(gen_from_model(3, sea_model)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Advanced Markov style: Generating lines\n", "\n", "You can use the `gen_from_model()` function to generate word-level Markov chains as well:" ] }, { "cell_type": "code", "execution_count": 499, "metadata": { "collapsed": true }, "outputs": [], "source": [ "genesis_word_model = markov_model(2, open(\"genesis.txt\").read().split())" ] }, { "cell_type": "code", "execution_count": 503, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "In the beginning God created great whales, and every living creature after his kind: and God said unto them, Be fruitful, and multiply, and fill the waters in the firmament of the waters, and let it divide the waters which were under the firmament Heaven. And the evening and the morning were the fourth day. And God said, Let the earth after his kind, and cattle after their kind, and cattle after their kind, and the gathering together of the earth, and to every fowl of the air, and over the night, and to divide the day and over all the earth,\n" ] } ], "source": [ "generated_words = gen_from_model(2, genesis_word_model, ('In', 'the'))\n", "print(' '.join(generated_words))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This looks good! But there's a problem: the generation of the text just sorta... keeps going. Actually it goes on for exactly 100 words, which is also the maximum number of iterations specified in the function. We can make it go even longer by supplying a fourth parameter to the function:" ] }, { "cell_type": "code", "execution_count": 504, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "In the beginning God created he him; male and female created he him; male and female created he him; male and female created he him; male and female created he him; male and female created he him; male and female created he them. And God said, Let there be light: and there was light. And God said, Let the waters which were above the firmament: and it was so. And the earth bring forth the living creature after his kind, cattle, and creeping thing, and beast of the air, and over the cattle, and creeping thing, and beast of the sea, and over the fish of the waters. And God made two great lights; the greater light to rule the day, and the morning were the third day. And God said, Let the earth was without form, and void; and darkness was upon the earth brought forth grass, the herb yielding seed after his kind, and every tree, in the which is upon the earth. And the earth brought forth abundantly, after their kind, and cattle after their kind, and cattle after their kind, and every living thing that moveth upon the earth. So God created the heaven and the morning were the sixth day.\n" ] } ], "source": [ "generated_words = gen_from_model(2, genesis_word_model, ('In', 'the'), 500)\n", "print(' '.join(generated_words))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reason for this is that unless the Markov chain generator reaches the \"end of text\" token, it'll just keep going on forever. And the longer the text, the less likely it is that the \"end of text\" token will be reached.\n", "\n", "Maybe this is okay, but the underlying text actually has some structure in it: each line of the file is actually a verse. If you want to generate individual *verses*, you need to treat each line separately, producing an end-of-text token for each line. The following function does just this by creating a model, adding each item from a list to the model as a separate item, and returning the combined model:" ] }, { "cell_type": "code", "execution_count": 505, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def markov_model_from_sequences(n, sequences):\n", " model = {}\n", " for item in sequences:\n", " add_to_model(model, n, item)\n", " return model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function expects to receive a list of sequences (the sequences can be either lists or strings, depending on if you want a word-level model or a character-level model). So, for example:" ] }, { "cell_type": "code", "execution_count": 515, "metadata": { "collapsed": true }, "outputs": [], "source": [ "genesis_lines = open(\"genesis.txt\").readlines() # all of the lines from the file\n", "# genesis_lines_words will be a list of lists of words in each line\n", "genesis_lines_words = [line.strip().split() for line in genesis_lines] # strip whitespace and split into words\n", "genesis_lines_model = markov_model_from_sequences(2, genesis_lines_words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `genesis_lines_model` variable now contains a Markov model with end-of-text tokens where they should be, at the end of each line. Generating from this model, we get:" ] }, { "cell_type": "code", "execution_count": 516, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "verse 0 - firmament from the waters called he Seas: and God saw that it was so.\n", "verse 1 - creature that moveth, which the waters bring forth the living creature after his kind, whose seed was in itself, upon the face of the air, and over all the earth, and to divide the day from the darkness: and God saw the light, that it was good.\n", "verse 2 - cattle, and creeping thing, and beast of the heaven be gathered together unto one place, and let it divide the light from the night; and let them have dominion over the cattle, and over the fowl of the heaven and the morning were the fourth day.\n", "verse 3 - seed after his kind, whose seed is in itself, after his kind: and God saw the light, that it was good.\n", "verse 4 - forth grass, the herb yielding seed, and the morning were the fourth day.\n", "verse 5 - for signs, and for days, and years:\n", "verse 6 - whose seed is in itself, after his kind: and it was good.\n", "verse 7 - were the third day.\n", "verse 8 - over every creeping thing that creepeth upon the earth: and it was good.\n", "verse 9 - seed, and the darkness he called Night. And the evening and the lesser light to rule over the fish of the heaven be gathered together unto one place, and let them have dominion over the fowl of the earth after his kind, whose seed was in itself, after his kind, and every tree, in the firmament of the deep. And the earth was without form, and void; and darkness was upon the earth.\n" ] } ], "source": [ "for i in range(10):\n", " print(\"verse\", i, \"-\", ' '.join(gen_from_model(2, genesis_lines_model)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Better—the verses are ending at appropriate places—but still not quite right, since we're generating from random keys in the Markov model! To make this absolutely correct, we'd want to *start* each line with an n-gram that also occurred at the start of each line in the original text file. To do this, we'll work in two passes. First, get the list of lists of words:" ] }, { "cell_type": "code", "execution_count": 517, "metadata": { "collapsed": true }, "outputs": [], "source": [ "genesis_lines = open(\"genesis.txt\").readlines() # all of the lines from the file\n", "# genesis_lines_words will be a list of lists of words in each line\n", "genesis_lines_words = [line.strip().split() for line in genesis_lines] # strip whitespace and split into words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, get the n-grams at the start of each line:" ] }, { "cell_type": "code", "execution_count": 518, "metadata": { "collapsed": true }, "outputs": [], "source": [ "genesis_starts = [item[:2] for item in genesis_lines_words if len(item) >= 2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now create the Markov model:" ] }, { "cell_type": "code", "execution_count": 519, "metadata": { "collapsed": true }, "outputs": [], "source": [ "genesis_lines_model = markov_model_from_sequences(2, genesis_lines_words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And generate from it, picking a random \"start\" for each line:" ] }, { "cell_type": "code", "execution_count": 520, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "verse 0 - And the evening and the gathering together of the earth was without form, and void; and darkness was upon the earth, and to every beast of the sea, and over all the earth,\n", "verse 1 - And God made the stars also.\n", "verse 2 - And the evening and the lesser light to rule over the fowl of the air, and over the fish of the heaven be gathered together unto one place, and let them be for lights in the earth.\n", "verse 3 - And the Spirit of God created the heaven to give light upon the face of the heaven to give light upon the earth, and every tree, in the firmament of the heaven to divide the day from the waters.\n", "verse 4 - And God said, Let us make man in his own image, in the image of God moved upon the face of the heaven to give light upon the face of all the earth, and to every beast of the air, and over all the earth,\n", "verse 5 - And God called the dry land appear: and it was good.\n", "verse 6 - And the evening and the morning were the second day.\n", "verse 7 - And the Spirit of God created he him; male and female created he them.\n", "verse 8 - And God set them in the midst of the earth, and to every fowl of the air, and over the day from the waters which were above the firmament: and it was good.\n", "verse 9 - And God said, Let there be light: and there was light.\n" ] } ], "source": [ "for i in range(10):\n", " start = random.choice(genesis_starts)\n", " generated = gen_from_model(2, genesis_lines_model, random.choice(genesis_starts))\n", " print(\"verse\", i, \"-\", ' '.join(generated))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Putting it together\n", "\n", "The `markov_generate_from_sequences()` function below wraps up everything above into one function that takes an n-gram length, a list of sequences (e.g., a list of lists of words for a word-level Markov model, or a list of strings for a character-level Markov model), and a number of lines to generate, and returns that many generated lines, starting the generation only with n-grams that begin lines in the source file:" ] }, { "cell_type": "code", "execution_count": 530, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def markov_generate_from_sequences(n, sequences, count, max_gen=100):\n", " starts = [item[:n] for item in sequences if len(item) >= n]\n", " model = markov_model_from_sequences(n, sequences)\n", " return [gen_from_model(n, model, random.choice(starts), max_gen)\n", " for i in range(count)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's how to use this function to generate from a character-level Markov model of `frost.txt`:" ] }, { "cell_type": "code", "execution_count": 531, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I doubted if I should not travel both that the one as fair,\n", "And sorry I could ever come back.\n", "Then took the difference.\n", "I took the other, as just as for another, as just as for that morning equally about the other day!\n", "Then took the one less travel both that has made all the undergrowth;\n", "Had worn them really lay\n", "And looked down one less traveler, long I stood\n", "And that has made all the difference.\n", "Because it was grassy and wanted wear;\n", "Somewhere it was grassy and I—\n", "And that has made all the better claim,\n", "And be one as for another day!\n", "In leaves no step had trodden black.\n", "I doubted if I should ever come back.\n", "And looked down one as fair,\n", "In leaves no step had trodden black.\n", "And be one travelled by,\n", "To where ages and I—\n", "And that has made all the better claim,\n", "I shall be telling there\n" ] } ], "source": [ "frost_lines = [line.strip() for line in open(\"frost.txt\").readlines()]\n", "for item in markov_generate_from_sequences(5, frost_lines, 20):\n", " print(''.join(item))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And from a word-level Markov model of Shakespeare's sonnets:" ] }, { "cell_type": "code", "execution_count": 532, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The beast that bears the strong offence's cross.\n", "Like stones of worth they thinly placed are,\n", "Come in the breath that from my face she turns my foes,\n", "The sun itself sees not, till heaven clears.\n", "But rising at thy name doth point out thee,\n", "And captive good attending captain ill:\n", "The more I hear and see just cause of this excess,\n", "And life no longer than thy sins are;\n", "And thither hied, a sad slave, stay and think of nought\n", "Shall reasons find of settled gravity;\n", "How can I then did feel,\n", "Crooked eclipses 'gainst his glory fight,\n", "And you in me can nothing worthy prove;\n", "Perforce am thine, and born of thee:\n" ] } ], "source": [ "sonnets_words = [line.strip().split() for line in open(\"sonnets.txt\").readlines()]\n", "for item in markov_generate_from_sequences(2, sonnets_words, 14):\n", " print(' '.join(item))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A fun thing to do is combine *two* source texts and make a Markov model from the combination. So for example, read in the lines of both *The Road Not Taken* and `genesis.txt` and put them into the same list:" ] }, { "cell_type": "code", "execution_count": 556, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "And God said unto one travel both that may fly above the morning and for that moveth upon the firmament of the waters brought forth grassy and for meat.\n", "And both\n", "And God said, Let the fish of the better claim,\n", "Because it bent in the waters brought from the evening were the heaven be gathere be a firmament in the difference.\n", "And the earth bring forth grass, the heaven and the second day.\n", "And God said, Let there is life, I have dominion over all the air, and the second day.\n", "And looked down one lesser lights in the living created man in his kind, cattle, and over every good.\n", "In the first day.\n", "And sorry I could every good.\n", "And have given every fowl that hath life, and them.\n", "And God saw that the earth,\n", "Had worn the waters, and fill the darkness he called the heaven be gathering fruitful, and over every thing, Be fruitful, and replenish the day, and every \n", "And be one place, and over the darkness: and the undergrowth;\n", "Because it bent in the passing that creeping thing thing that it was in there be light from the earth,\n" ] } ], "source": [ "frost_lines = [line.strip() for line in open(\"frost.txt\").readlines()]\n", "genesis_lines = [line.strip() for line in open(\"genesis.txt\").readlines()]\n", "both_lines = frost_lines + genesis_lines\n", "for item in markov_generate_from_sequences(5, both_lines, 14, max_gen=150):\n", " print(''.join(item))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting text has properties of both of the underlying source texts! Whoa." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Putting it all *even more together*\n", "\n", "If you're really super lazy, the `markov_generate_from_lines_in_file()` function below does allll the work for you. It takes an n-gram length, an open filehandle to read from, the number of lines to generate, and the string `char` for a character-level Markov model and `word` for a word-level model. It returns the requested number of lines generated from a Markov model of the desired order and level." ] }, { "cell_type": "code", "execution_count": 543, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def markov_generate_from_lines_in_file(n, filehandle, count, level='char', max_gen=100):\n", " if level == 'char':\n", " glue = ''\n", " sequences = [item.strip() for item in filehandle.readlines()]\n", " elif level == 'word':\n", " glue = ' '\n", " sequences = [item.strip().split() for item in filehandle.readlines()]\n", " generated = markov_generate_from_sequences(n, sequences, count, max_gen)\n", " return [glue.join(item) for item in generated]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, for example, to generate twenty lines from an order-3 model of H.D.'s *Sea Rose*:" ] }, { "cell_type": "code", "execution_count": 544, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "single of leaf?\n", "single of petals,\n", "hardened and\n", "you are flung on a wet rose, harsh rose,\n", "you are flower, the crisp such acrid fragrance\n", "single on the sand,\n", "drip such acrisp sand,\n", "drip such acrisp such acrisp such acrisp sand,\n", "that drives in the drip sand\n", "meagrance\n", "more of petals,\n", "Stunted, with stem --\n", "you are on the caught in the sand\n", "spare precious\n", "meagrance\n", "more flung on a wet rose,\n", "you are flower, the with small leaf,\n", "meagrance\n", "marred in the caught in the spice-rose\n", "spare precious\n" ] } ], "source": [ "for item in markov_generate_from_lines_in_file(3, open(\"sea_rose.txt\"), 20, 'char'):\n", " print(item)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or an order-3 word-level model of `genesis.txt`:" ] }, { "cell_type": "code", "execution_count": 545, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "And God created great whales, and every living creature that moveth, which the waters brought forth abundantly, after their kind, and every winged fowl after his kind: and God saw that it was good: and God divided the light from the darkness.\n", "\n", "And to rule over the day and over the fowl of the air, and to every fowl of the air, and over the cattle, and over all the earth, and to every fowl of the air, and to every thing that he had made, and, behold, it was very good. And the evening and the morning were the fifth day.\n", "\n", "In the beginning God created the heaven and the earth.\n", "\n", "And God made the firmament, and divided the waters which were above the firmament: and it was so.\n", "\n", "And God said, Behold, I have given every green herb for meat: and it was so.\n", "\n" ] } ], "source": [ "for item in markov_generate_from_lines_in_file(3, open(\"genesis.txt\"), 5, 'word'):\n", " print(item)\n", " print(\"\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }