{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The Poetics of Grouping: Dictionaries of lists\n", "\n", "By [Allison Parrish](http://www.decontextualize.com/)\n", "\n", "Dictionaries are a powerful data structure used for a number of purposes in Python. In these notes, I show the basics of how dictionaries work and how they can be used to *group* words in a text. Then I show how the resulting dictionary data structure can be used to perform interesting \"cut-ups\" through replacement.\n", "\n", "## The procedural data-driven acrostic\n", "\n", "Let's start with a simple example: make a list of all words in a text file that begin with the letter `a`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['a',\n", " 'as',\n", " 'as',\n", " 'as',\n", " 'as',\n", " 'and',\n", " 'as',\n", " 'about',\n", " 'another',\n", " 'a',\n", " 'ages',\n", " 'and',\n", " 'ages',\n", " 'a',\n", " 'and',\n", " 'all']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "words = open(\"frost.txt\").read().split()\n", "a_words = [item for item in words if item.startswith(\"a\")]\n", "a_words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This might be a poetic artifact in itself, or it might be the start of a more elaborate project. You might, for example, use this information as a basis for computational stylistics (e.g., do 20th-century American poets use words beginning with the letter `a` more frequently than 19th-century British poets?). Or you might be attempting to create a computational model for poetic composition that uses this information, such as a program to generate [acrostics](https://en.wikipedia.org/wiki/Acrostic).\n", "\n", "Actually, let's continue with the goal of making a program to generate acrostics from a given text. Our acrostic procedure will start with a \"seed\" string, e.g., the word whose letters to use as the initial letter of each line, and a source text, which will be parsed into words. The poem will consist of a list of lines, one for each letter in the seed string. Each line will have one word, randomly chosen from a list of words beginning with the corresponding letter. So, for example, using *The Road Not Taken* as the source text, and `robertfrost` as the seed string, we might end up with output that looks like this:\n", "\n", " Really\n", " One\n", " Both\n", " Equally\n", " Roads\n", " Took\n", " For\n", " Really\n", " On\n", " Shall\n", " The\n", "\n", "(Most acrostics have more than one word on each line, which is fine. The procedure I'm proposing is a first step toward being able to make a more robust acrostic-generation algorithm.)\n", "\n", "### An unfortunate implementation\n", "\n", "How to implement this? Seems simple enough. We already know how to get a list of all words beginning with a particular letter. So hey, let's just copy-and-paste that code for each letter that we want! Something like this:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "words = open(\"frost.txt\").read().split()\n", "r_words = [item for item in words if item.startswith(\"r\")]\n", "o_words = [item for item in words if item.startswith(\"o\")]\n", "b_words = [item for item in words if item.startswith(\"b\")]\n", "e_words = [item for item in words if item.startswith(\"e\")]\n", "t_words = [item for item in words if item.startswith(\"t\")]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmm. Just starting with the first few letters of `robertfrost`, it already feels like this is not quite the right way to do this. But let's see it through:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Roads\n", "Other,\n", "Both\n", "Ever\n", "Roads\n", "To\n" ] } ], "source": [ "import random\n", "print(random.choice(r_words).capitalize())\n", "print(random.choice(o_words).capitalize())\n", "print(random.choice(b_words).capitalize())\n", "print(random.choice(e_words).capitalize())\n", "print(random.choice(r_words).capitalize())\n", "print(random.choice(t_words).capitalize())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This works, and we could easily complete the program using this technique. There are a few problems with engineering the program this way, however. Let's imagine some scenarios:\n", "\n", "* I change my mind about the seed string and want to use, say, `difference` instead. Or `antidisestablishmentarianism`. To make the change, I essentially have to start from scratch and copy/paste all new lines of code.\n", "* \"Or,\" I hear you say, \"I could just make a list for *every letter in the alphabet*! That way I would just have the one chunk of code for making the lists that I could use for any text. CHECK MATE.\" True enough! But there are some drawbacks. You'd still have to copy/paste the lines to *print out* the acrostic (e.g. `print(random.choice(r_words))`). This is fine for short seed texts, but what if you wanted to generate an acrostic that with a seed string of hundreds of characters? Thousands? Millions? That's a lot of copying and pasting.\n", "* \"Oh huh, good point,\" you say, and before you can pause, I continue! Which \"alphabet\" are you talking about? What if you want to write an acrostic in French or Hungarian or Turkish? (Or hiragana or devanagari for that matter!) You'd have to add in new lines of code for *every letter that you wanted to include in your data*. Fine for four or five letters, okay for twenty-six letters, extraordinarily inconvenient for thousands of letters.\n", "\n", "\"Okay,\" you say. \"I understand that you think this implementation is not optimal. But you want to be able to write one chunk of code that can be used to extract words that start with *any* arbitrary character, even when you don't know which characters will be needed when you're writing the program. And then you want some kind of magical... code... thing... that lets you get *back* all of the words that start with an arbitrary character. Surely what you propose is is science fiction; surely this is impossible.\"\n", "\n", "> Note: Another way of thinking about this problem is *how do you get Python to have variables for all of the data that you want to store, when you're not quite sure what data will be used as input when you write the code?*\n", "\n", "### The dictionary\n", "\n", "In fact, it is possible! But to implement such a chunk of code, we need a new data structure. The appropriate data structure in this instance is called the *dictionary*. Dictionaries are also known as [maps, hashes or associative arrays in other programming languages](https://en.wikipedia.org/wiki/Associative_array).\n", "\n", "Before we get into the *why* of dictionaries, let's briefly look at the *how*. A dictionary in Python looks like this:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'Earth': 1.0, 'Mars': 1.523, 'Mercury': 0.387, 'Venus': 0.723}" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "{'Mercury': 0.387, 'Venus': 0.723, 'Earth': 1.0, 'Mars': 1.523}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That is: a sequence of *key/value pairs*, with the key and value of each pair separated by a colon (`:`) and the pairs themselves separated by commas (`,`). All of the pairs are themselves surrounded by a pair of curly brackets (`{` and `}`). (In this case, the keys are the names of the planets, and the values are the planets' mean distances from the Sun as measured in [astronomical units](https://en.wikipedia.org/wiki/Astronomical_unit).)\n", "\n", "In other words, we might say that the key `Mercury` has the value `0.387`. The verb *map* is also sometimes used to refer to this relationship: e.g., the key `Mars` *maps* to the value `1.523`.\n", "\n", "A dictionary is just like any other Python value. You can assign it to a variable:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": true }, "outputs": [], "source": [ "planet_dist = {'Mercury': 0.387, 'Venus': 0.723, 'Earth': 1.0, 'Mars': 1.523}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And that variable has a type:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(planet_dist)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At its most basic level, a dictionary is sort of like a two-column spreadsheet, where the key is one column and the value is another column. If you were to represent the dictionary above as a spreadsheet, it might look like this:\n", "\n", "| key | value |\n", "|-----|-------|\n", "| Mercury | 0.387 |\n", "| Venus | 0.723 |\n", "| Earth | 1.0 |\n", "| Mars | 1.523 |\n", "\n", "The main difference between a spreadsheet and a dictionary is that dictionaries are unordered. (For an explanation of this, see below.) As with a spreadsheet, you can put different types of data into a dictionary.\n", "\n", "### Working with keys and values: the essentials\n", "\n", "The primary operation that we'll perform on dictionaries is writing an expression that evaluates to the value for a particular key. We do that with the same syntax we used to get a value at a particular index from a list, with a twist: when using a dictionary, instead of using a number, use one of the keys that we had specified for the value when making the dictionary. For example, if you want to know how far Venus is from the sun (or, more precisely, the value for the key `Venus`), write the following expression:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.723" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planet_dist[\"Venus\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Going back to our spreadsheet analogy, this is like looking for the row whose first column is \"Venus\" and getting the value from the corresponding second column.\n", "\n", "If we put a key in those brackets that does not exist in the dictionary, we get an error similar to the one we get when trying to access an element of an array beyond the end of a list:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "ename": "KeyError", "evalue": "'Planet X'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mplanet_dist\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"Planet X\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mKeyError\u001b[0m: 'Planet X'" ] } ], "source": [ "planet_dist[\"Planet X\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `in` operator lets you check to see if a key is in a dictionary before attempting to retrieve its value:" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"Pluto\" in planet_dist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `not in` operator is the opposite of `in`: it returns `True` if the given key is *not* in the dictionary." ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"Pluto\" not in planet_dist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you might suspect, the thing you put inside the brackets doesn't have to be a string; it can be any Python expression, as long as it evaluates to something that is a key in the dictionary:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.387" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planet = 'Mercury'\n", "planet_dist[planet]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After a dictionary has been created, you might want to add new key/value pairs to it. You can do so with an assignment statement, putting the desired value to the right of the `=` and square bracket notation with the desired key to the left:" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": true }, "outputs": [], "source": [ "planet_dist[\"Jupiter\"] = 5.2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A common pattern when working with dictionaries is to use the current value as the basis for the new value that you want to store. For example, you could wreak havoc with the solar system by changing how far Jupiter is from the sun by running this code:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": true }, "outputs": [], "source": [ "planet_dist[\"Jupiter\"] = planet_dist[\"Jupiter\"] + 0.5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, more compactly:" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": true }, "outputs": [], "source": [ "planet_dist[\"Jupiter\"] += 0.5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As with any other kind of value, you can evaluate a dictionary to see its contents in Jupyter Notebook. In the cell below, I do just this so we can see the new key:" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Earth': 1.0, 'Jupiter': 6.7, 'Mars': 1.523, 'Mercury': 0.387, 'Venus': 0.723}" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planet_dist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dictionaries with lists as values\n", "\n", "Keys in a dictionary don't have to be strings, and the values don't have to be floating-point numbers. For example, the key might be an integer and the value a string, as in the following example:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": true }, "outputs": [], "source": [ "number_words = {0: \"zero\", 1: \"one\", 2: \"two\", 3: \"three\", 4: \"four\", 5: \"five\", 10: \"ten\", 100: \"one hundred\"}" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{0: 'zero',\n", " 1: 'one',\n", " 2: 'two',\n", " 3: 'three',\n", " 4: 'four',\n", " 5: 'five',\n", " 10: 'ten',\n", " 100: 'one hundred'}" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "number_words" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I have five years left until retirement.\n" ] } ], "source": [ "print(\"I have \" + number_words[5] + \" years left until retirement.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In fact, the values of a dictionary can be of any Python type. (The keys can be of any Python type, except [mutable data structures like dictionaries and lists](https://docs.python.org/3/glossary.html#term-hashable).) Often when talking about dictionaries, you'll hear them described in terms of what types their keys and values are.\n", "\n", "Going back to our acrostic example, the kind of dictionary that interests us here is a dictionary whose keys are strings and whose values are *lists* of strings. If you were to construct one of these by hand, you might write something like this:" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "collapsed": true }, "outputs": [], "source": [ "planet_moons = {\n", " 'Mercury': [],\n", " 'Venus': [],\n", " 'Earth': ['Moon'],\n", " 'Mars': ['Phobos', 'Deimos'],\n", " 'Jupiter': [\"Io\", \"Europa\", \"Ganymede\", \"Callisto\"]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a dictionary whose keys are the names of the five innermost planets and whose values are lists of the names of the moons of those planets. (Including only the four largest moons of Jupiter for brevity.) Retrieving the value for a key in this data structure will yield a value of type `list`. For example, to count the number of moons orbiting Mars:" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(planet_moons['Mars'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get the name of the second moon (index 1) listed as orbiting Mars:" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Deimos'" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planet_moons[\"Mars\"][1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And to get a random moon of Jupiter:" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Callisto'" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import random\n", "random.choice(planet_moons[\"Jupiter\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say that I change careers and become an astronomer. After several years of diligent search, using tiny variations in the rotation of Venus and pressure variations in the atmosphere that are most easily explained by tidal mechanics, I discover a new moon of Venus and name it after myself. My crowning moment of glory would be to add this newly discovered moon to the data structure that I made years before in my previous life as a lowly computer poet. I would do this like so:" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": true }, "outputs": [], "source": [ "planet_moons[\"Venus\"].append(\"Allison\")" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Earth': ['Moon'],\n", " 'Jupiter': ['Io', 'Europa', 'Ganymede', 'Callisto'],\n", " 'Mars': ['Phobos', 'Deimos'],\n", " 'Mercury': [],\n", " 'Venus': ['Allison']}" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planet_moons" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Glorious! It may not feel like it, but I've now shown you *everything you need to know* in order to create the computational acrostic code described above. Let's make it happen." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Building a dictionary based on text\n", "\n", "With the dictionary as a starting point, think about how we need the data organized in order to produce the acrostic. We need to be able to store every word that begins with a particular letter, and we need to be able to get back a list of words that begin with any letter. The data structure we want to end up with might look something like this:\n", "\n", " {'a': ['as', 'above', 'about'],\n", " 'b': ['both', 'bent'],\n", " 'c': ['could', 'could', 'claim', 'come'],\n", " ...\n", " }\n", " \n", "(Using an ellipsis there to indicate that there would be more key/value pairs.)\n", "\n", "The task at hand consists of two parts: *analyzing* the text and *generating* the text. The goal of the analysis step is to create a dictionary whose keys are initial letters and whose values are *all of the words in the text that start with that letter*.\n", "\n", "To make this data structure, we're going to build the dictionary gradually, word by word, by looping over a list of words in the text. Here's what the code looks like. I've left comments in-line." ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [], "source": [ "words = open(\"frost.txt\").read().split() # read in a text file, split into words\n", "initials = {} # create an empty dictionary\n", "\n", "for item in words: # run this code for every word in the text\n", " \n", " first_let = item[0] # first_let now has the first character of the string\n", " \n", " # check to see if the letter is already a key in the dictionary.\n", " # if not, add a new key/value pair with an empty list as the value.\n", " if first_let not in initials:\n", " initials[first_let] = []\n", " \n", " # append the current word to the list that is the value for this key\n", " initials[first_let].append(item)\n", " \n", " # uncomment line below to see debug output\n", " #print(item, first_let, initials[first_let])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's what the data structure looks like when everything's done:" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'A': ['And', 'And', 'And', 'And', 'And', 'And'],\n", " 'B': ['Because'],\n", " 'H': ['Had'],\n", " 'I': ['I', 'I', 'I', 'In', 'I', 'I', 'I', 'I', 'I—', 'I'],\n", " 'O': ['Oh,'],\n", " 'S': ['Somewhere'],\n", " 'T': ['Two', 'To', 'Then', 'Though', 'Two'],\n", " 'Y': ['Yet'],\n", " 'a': ['a',\n", " 'as',\n", " 'as',\n", " 'as',\n", " 'as',\n", " 'and',\n", " 'as',\n", " 'about',\n", " 'another',\n", " 'a',\n", " 'ages',\n", " 'and',\n", " 'ages',\n", " 'a',\n", " 'and',\n", " 'all'],\n", " 'b': ['both', 'be', 'bent', 'better', 'both', 'black.', 'back.', 'be', 'by,'],\n", " 'c': ['could', 'could', 'claim,', 'come'],\n", " 'd': ['diverged', 'down', 'day!', 'doubted', 'diverged', 'difference.'],\n", " 'e': ['equally', 'ever'],\n", " 'f': ['far', 'fair,', 'for', 'first', 'for'],\n", " 'g': ['grassy'],\n", " 'h': ['having', 'had', 'how', 'hence:', 'has'],\n", " 'i': ['in', 'it', 'in', 'it', 'if', 'in'],\n", " 'j': ['just'],\n", " 'k': ['kept', 'knowing'],\n", " 'l': ['long', 'looked', 'lay', 'leaves', 'leads', 'less'],\n", " 'm': ['morning', 'made'],\n", " 'n': ['not', 'no'],\n", " 'o': ['one', 'one', 'other,', 'on', 'one'],\n", " 'p': ['perhaps', 'passing'],\n", " 'r': ['roads', 'really', 'roads'],\n", " 's': ['sorry', 'stood', 'same,', 'step', 'should', 'shall', 'sigh'],\n", " 't': ['travel',\n", " 'traveler,',\n", " 'the',\n", " 'took',\n", " 'the',\n", " 'the',\n", " 'that',\n", " 'the',\n", " 'there',\n", " 'them',\n", " 'the',\n", " 'that',\n", " 'trodden',\n", " 'the',\n", " 'to',\n", " 'telling',\n", " 'this',\n", " 'took',\n", " 'the',\n", " 'travelled',\n", " 'that',\n", " 'the'],\n", " 'u': ['undergrowth;'],\n", " 'w': ['wood,',\n", " 'where',\n", " 'was',\n", " 'wanted',\n", " 'wear;',\n", " 'worn',\n", " 'way',\n", " 'way,',\n", " 'with',\n", " 'wood,'],\n", " 'y': ['yellow']}" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "initials" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Challenge: Modify the code above so that it's case-insensitive (i.e., words starting with `I` are stored in the same list as words starting with `i`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generating the acrostic\n", "\n", "We have the data structure that maps letters to lists of words that start with those letters." ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['a',\n", " 'as',\n", " 'as',\n", " 'as',\n", " 'as',\n", " 'and',\n", " 'as',\n", " 'about',\n", " 'another',\n", " 'a',\n", " 'ages',\n", " 'and',\n", " 'ages',\n", " 'a',\n", " 'and',\n", " 'all']" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "initials[\"a\"]" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['both', 'be', 'bent', 'better', 'both', 'black.', 'back.', 'be', 'by,']" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "initials[\"b\"]" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['could', 'could', 'claim,', 'come']" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "initials[\"c\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Picking a random item from one of these lists using `random.choice()`:" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'doubted'" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random.choice(initials[\"d\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Writing the acrostic text is now just a matter of a list comprehension:" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Roads\n", "One\n", "Better\n", "Equally\n", "Roads\n", "That\n", "For\n", "Roads\n", "One\n", "Step\n", "Travel\n" ] } ], "source": [ "acrostic = [random.choice(initials[let]).capitalize() for let in \"robertfrost\"]\n", "print(\"\\n\".join(acrostic))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, as a `for` loop:" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Really\n", "One\n", "Be\n", "Equally\n", "Roads\n", "The\n", "For\n", "Roads\n", "One\n", "Sigh\n", "Took\n" ] } ], "source": [ "for let in \"robertfrost\":\n", " word = random.choice(initials[let])\n", " print(word.capitalize())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dealing with missing keys\n", "\n", "There are, of course, a number of letters in the alphabet not represented in our data. For example, there are no words starting with `z`. So if we tried to make an acrostic with the word `pizza`:" ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Passing\n", "It\n" ] }, { "ename": "KeyError", "evalue": "'z'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mlet\u001b[0m \u001b[0;32min\u001b[0m \u001b[0;34m\"pizza\"\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mword\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrandom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mchoice\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minitials\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mlet\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mword\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcapitalize\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mKeyError\u001b[0m: 'z'" ] } ], "source": [ "for let in \"pizza\":\n", " word = random.choice(initials[let])\n", " print(word.capitalize())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get a `KeyError` because the dictionary does not contain the key `z`. Whoops! There's no real way to fix this problem without expanding the data that we're using, but we can at least write the code so that it's robust against these kinds of errors. There are two ways of doing this. The first is to check if the key is present, using the `in` operator:" ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Passing\n", "It\n", ">>> WARNING! ACROSTIC FAILURE! WARNING\n", ">>> WARNING! ACROSTIC FAILURE! WARNING\n", "Another\n" ] } ], "source": [ "for let in \"pizza\":\n", " if let in initials:\n", " word = random.choice(initials[let])\n", " print(word.capitalize())\n", " else:\n", " print(\">>> WARNING! ACROSTIC FAILURE! WARNING\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The second is to use the dictionary value's `.get()` method, which attempts to retrieve the value for the key given as the first parameter, and if that key is not found, returns the value given as the second parameter:" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Passing\n", "In\n", "???\n", "???\n", "A\n" ] } ], "source": [ "for let in \"pizza\":\n", " word = random.choice(initials.get(let, [\"???\"]))\n", " print(word.capitalize())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Default dictionaries\n", "\n", "There's a little dance we did in the code above to check to add an empty list for a key only if the key wasn't already present. That dance is so common that the makers of Python have invented away around it: the `defaultdict`. A `defaultdict` is like a regular dictionary, except that if you attempt to assign a value to a non-existent key, it will automatically create that key with a default value. To use the `defaultdict` data structure, you first need to include and run the following line in your code:" ] }, { "cell_type": "code", "execution_count": 128, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from collections import defaultdict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you can create a `defaultdict` value by calling the `defaultdict()` function, with the default data type inside the parentheses. For example, to create a dictionary whose values default to lists:" ] }, { "cell_type": "code", "execution_count": 170, "metadata": { "collapsed": true }, "outputs": [], "source": [ "initials = defaultdict(list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the code for writing the acrostic text analyzer is a little bit simpler. You don't need to check for the presence of the key/value pair; you can just append to it *as though it already exists*:" ] }, { "cell_type": "code", "execution_count": 171, "metadata": { "collapsed": true }, "outputs": [], "source": [ "words = open(\"frost.txt\").read().split()\n", "for item in words:\n", " first_let = item[0]\n", " initials[first_let].append(item)" ] }, { "cell_type": "code", "execution_count": 172, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "defaultdict(list,\n", " {'A': ['And', 'And', 'And', 'And', 'And', 'And'],\n", " 'B': ['Because'],\n", " 'H': ['Had'],\n", " 'I': ['I', 'I', 'I', 'In', 'I', 'I', 'I', 'I', 'I—', 'I'],\n", " 'O': ['Oh,'],\n", " 'S': ['Somewhere'],\n", " 'T': ['Two', 'To', 'Then', 'Though', 'Two'],\n", " 'Y': ['Yet'],\n", " 'a': ['a',\n", " 'as',\n", " 'as',\n", " 'as',\n", " 'as',\n", " 'and',\n", " 'as',\n", " 'about',\n", " 'another',\n", " 'a',\n", " 'ages',\n", " 'and',\n", " 'ages',\n", " 'a',\n", " 'and',\n", " 'all'],\n", " 'b': ['both',\n", " 'be',\n", " 'bent',\n", " 'better',\n", " 'both',\n", " 'black.',\n", " 'back.',\n", " 'be',\n", " 'by,'],\n", " 'c': ['could', 'could', 'claim,', 'come'],\n", " 'd': ['diverged',\n", " 'down',\n", " 'day!',\n", " 'doubted',\n", " 'diverged',\n", " 'difference.'],\n", " 'e': ['equally', 'ever'],\n", " 'f': ['far', 'fair,', 'for', 'first', 'for'],\n", " 'g': ['grassy'],\n", " 'h': ['having', 'had', 'how', 'hence:', 'has'],\n", " 'i': ['in', 'it', 'in', 'it', 'if', 'in'],\n", " 'j': ['just'],\n", " 'k': ['kept', 'knowing'],\n", " 'l': ['long', 'looked', 'lay', 'leaves', 'leads', 'less'],\n", " 'm': ['morning', 'made'],\n", " 'n': ['not', 'no'],\n", " 'o': ['one', 'one', 'other,', 'on', 'one'],\n", " 'p': ['perhaps', 'passing'],\n", " 'r': ['roads', 'really', 'roads'],\n", " 's': ['sorry',\n", " 'stood',\n", " 'same,',\n", " 'step',\n", " 'should',\n", " 'shall',\n", " 'sigh'],\n", " 't': ['travel',\n", " 'traveler,',\n", " 'the',\n", " 'took',\n", " 'the',\n", " 'the',\n", " 'that',\n", " 'the',\n", " 'there',\n", " 'them',\n", " 'the',\n", " 'that',\n", " 'trodden',\n", " 'the',\n", " 'to',\n", " 'telling',\n", " 'this',\n", " 'took',\n", " 'the',\n", " 'travelled',\n", " 'that',\n", " 'the'],\n", " 'u': ['undergrowth;'],\n", " 'w': ['wood,',\n", " 'where',\n", " 'was',\n", " 'wanted',\n", " 'wear;',\n", " 'worn',\n", " 'way',\n", " 'way,',\n", " 'with',\n", " 'wood,'],\n", " 'y': ['yellow']})" ] }, "execution_count": 172, "metadata": {}, "output_type": "execute_result" } ], "source": [ "initials" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Exercise: Use `defaultdict` to make a dictionary that has integer keys for word length, whose values are lists of words with that length. The dictionary that you end up with should look something like: `{1: ['a', 'I', ...], 2: ['in', 'be', 'as'...], 3: ['Two', 'And', 'not']...}`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lexical swaps: another application\n", "\n", "This particular data structure is good for more than just acrostics. Here's another possible use: take a text and replace each of its words with a different word that begins with the same letter. You can do this with the entire poem by reading it in like this:" ] }, { "cell_type": "code", "execution_count": 173, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Though roads difference. in as yellow with And shall I come no there by, And be one the looked I sigh And looked doubted other, as fair, and I claim, Two where in be if there undergrowth; Two trodden that one and just and far And how perhaps that by, come Because it wanted grassy all was where Two a far the the perhaps telling Had wear; the really ages travel should And be the morning equally long I— leads not shall how the bent Oh, In kept the fair, for ages day! Yet kept having way lay one there wanted I down if I step equally could back. I— sorry black. this that wood, as stood Somewhere as ages another hence: Then roads diverged in a where another I I took took one leaves the better And the how made as travelled day!\n" ] } ], "source": [ "words = open(\"frost.txt\").read().split()\n", "print(' '.join([random.choice(initials[item[0]]) for item in words]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That expression—`[random.choice(initials[item[0]]) for item in words]`—is pretty complex. Let's break it down.\n", "\n", "* `[ for item in words]`: list comprehension that evaluates `` for every element in list `words`, with the temporary variable `item`\n", "* `item[0]`: the first letter of the string in `item`\n", "* `initials[item[0]]`: the list of words beginning with that letter\n", "* `random.choice(initials[item[0]])`: a randomly-chosen element from that list\n", "* `[random.choice(initials[item[0]]) for item in words]`: a list of random-chosen words, each of which beginning with the same first letter of the words in list `words`\n", "\n", "If you want to preserve the line breaks in the original poem, the easiest way is to write a `for` loop that reads each line of the text file as a string, like so:" ] }, { "cell_type": "code", "execution_count": 175, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Two roads difference. if all yellow wood,\n", "And sigh I come not took both\n", "And both other, trodden long I stood\n", "And less day! one as for and I come\n", "Though was in back. if the undergrowth;\n", "\n", "To to took one and just a far\n", "And having perhaps the be come\n", "Because in wood, grassy as worn wanted\n", "To all fair, took travelled passing took\n", "Had with took really a this should\n", "\n", "And by, the made ever looked\n", "I long no step had traveler, by,\n", "Oh, I kept that fair, far as difference.\n", "Yet kept how worn looked on the wood,\n", "I— difference. if I should ever claim, be\n", "\n", "I stood bent traveler, to with as same,\n", "Somewhere and as as having\n", "Then really diverged in a wanted and I\n", "I there the one leaves that both\n", "And traveler, hence: morning as the doubted\n" ] } ], "source": [ "for line in open(\"frost.txt\"):\n", " words = line.split()\n", " print(' '.join([random.choice(initials[item[0]]) for item in words]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lexical swaps across texts\n", "\n", "Of course, there's no reason that we have to do lexical replacement on the same text that we got the original words from! In the cell below, I make a separate model of words from Sea Rose:" ] }, { "cell_type": "code", "execution_count": 181, "metadata": {}, "outputs": [], "source": [ "sea_rose_init = defaultdict(list)\n", "words = open(\"sea_rose.txt\").read().split()\n", "for item in words:\n", " first_let = item[0]\n", " sea_rose_init[first_let].append(item)" ] }, { "cell_type": "code", "execution_count": 182, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "defaultdict(list,\n", " {'-': ['--'],\n", " 'C': ['Can'],\n", " 'R': ['Rose,'],\n", " 'S': ['Stunted,'],\n", " 'a': ['and', 'a', 'a', 'are', 'are', 'are', 'acrid', 'a'],\n", " 'c': ['caught', 'crisp'],\n", " 'd': ['drift.', 'drives', 'drip'],\n", " 'f': ['flower,', 'flung', 'fragrance'],\n", " 'h': ['harsh', 'hardened'],\n", " 'i': ['in', 'in', 'in', 'in'],\n", " 'l': ['leaf,', 'leaf,', 'lifted', 'leaf?'],\n", " 'm': ['marred', 'meagre', 'more'],\n", " 'o': ['of', 'of', 'on', 'on'],\n", " 'p': ['petals,', 'precious'],\n", " 'r': ['rose,', 'rose'],\n", " 's': ['stint',\n", " 'spare',\n", " 'single',\n", " 'stem',\n", " 'small',\n", " 'sand,',\n", " 'sand',\n", " 'spice-rose',\n", " 'such'],\n", " 't': ['thin,', 'than', 'the', 'the', 'the', 'that', 'the', 'the'],\n", " 'w': ['with', 'wet', 'with', 'wind.'],\n", " 'y': ['you', 'you', 'you']})" ] }, "execution_count": 182, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sea_rose_init" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of the code below is to rewrite each line of *The Road Not Taken*, replacing each word with a word that begins with the same letter from *Sea Rose*. The problem is that *The Road Not Taken* has words that begin with letters that aren't found as word-initial letters in *Sea Rose*! So we need a backup strategy. The strategy I chose below was to check to see if the first letter of each word was present in the dictionary. If it is, then get a random word starting with that letter. If it isn't, then just use the word from the source text." ] }, { "cell_type": "code", "execution_count": 186, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Two rose, drives in are you with\n", "And such I caught not the both\n", "And be of the leaf, I such\n", "And leaf, drives of are flung a I crisp\n", "To with in bent in the undergrowth;\n", "\n", "Then the the on and just and flung\n", "And harsh petals, that better crisp\n", "Because in wet grassy and with with\n", "Though are flung than than petals, that\n", "Had wet the rose, a than stem\n", "\n", "And both the meagre equally leaf?\n", "In leaf, no stint hardened thin, black.\n", "Oh, I kept thin, flower, flung a drip\n", "Yet knowing harsh wet leaf, of the wet\n", "I drives in I single ever caught back.\n", "\n", "I sand, be the the with and sand\n", "Stunted, a and are harsh\n", "Two rose, drift. in are with a I—\n", "I the than of leaf? the by,\n", "And the harsh more acrid the drip\n" ] } ], "source": [ "# for each line in the text file...\n", "for line in open(\"frost.txt\"):\n", " words = line.split() # split the line into words\n", " output = [] # the output for each line starts with an empty list\n", " for item in words: # for each word in the line...\n", " first_let = item[0]\n", " # check if the first letter is in the dictionary\n", " if first_let in sea_rose_init: # if we have alternatives for that letter...\n", " # add a randomly-chosen word that starts with the same letter to the output\n", " output.append(random.choice(sea_rose_init[first_let]))\n", " else:\n", " # otherwise, just use the word from the source text\n", " output.append(item)\n", " # uncomment line below to see how the list gets built, item by item\n", " #print(line, item, output)\n", " print(' '.join(output))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }