{ "metadata": { "name": "", "signature": "sha256:18bae689c1c86fbf3b7001b771cc4ffa62f85daa2d6909de697961a91a37b64f" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#String operations and regular expressions\n", "\n", "Today we talk about strings. When we have a string, we might want to ask whether it has particular characteristics---does it start with a particular character? Does it contain within it another string?---or try to extract smaller parts of the string, like the first fifteen characters, or say, the part of the string inside parentheses. Or we may want to transform the string into another string altogether, by (for example) converting its characters to upper case, or replacing substrings within it with other substrings. Today we discuss how to do these things in Python.\n", "\n", "##Simple string checks\n", "\n", "There are a number of functions, methods and operators that can tell us whether or not a Python string matches certain characteristics. Let's talk about the `in` operator first:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\"foo\" in \"buffoon\"" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 1, "text": [ "True" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "\"foo\" in \"reginald\"" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "False" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `in` operator takes one expression evaluating to a string on the left and another on the right, and returns `True` if the string on the left occurs somewhere inside of the string on the right.\n", "\n", "We can check to see if a string begins with or ends with another string using that string's `.startswith()` and `.endswith()` methods, respectively:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\"foodie\".startswith(\"foo\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "True" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "\"foodie\".endswith(\"foo\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "False" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.isdigit()` method returns `True` if Python thinks the string could represent an integer, and `False` otherwise:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print \"foodie\".isdigit()\n", "print \"4567\".isdigit()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "False\n", "True\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the `.islower()` and `.isupper()` methods return `True` if the string is in all lower case or all upper case, respectively (and `False` otherwise)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print \"foodie\".islower()\n", "print \"foodie\".isupper()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "True\n", "False\n" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "print \"YELLING ON THE INTERNET\".islower()\n", "print \"YELLING ON THE INTERNET\".isupper()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "False\n", "True\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Finding substrings\n", "\n", "The `in` operator discussed above will tell us if a substring occurs in some other string. If we want to know *where* that substring occurs, we can use the `.find()` method. The `.find()` method takes a single parameter between its parentheses: an expression evaluating to a string, which will be searched for within the string whose `.find()` method was called. If the substring is found, the entire expression will evaluate to the index at which the substring is found. If the substring is not found, the expression evaluates to `-1`. To demonstrate:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print \"Now is the winter of our discontent\".find(\"win\")\n", "print \"Now is the winter of our discontent\".find(\"lose\")" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "11\n", "-1\n" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.count()` method will return the number of times a particular substring is found within the larger string:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print \"I got rhythm, I got music, I got my man, who could ask for anything more\".count(\"I got\")" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "3\n" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##String slices\n", "\n", "As has been alluded to previously, string slices work exactly like list slices---except you're getting characters from the string, instead of elements from a list. Observe:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "message = \"bungalow\"\n", "message[3]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "'g'" ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "message[1:6]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ "'ungal'" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "message[:3]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ "'bun'" ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "message[2:]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "'ngalow'" ] } ], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "message[-2]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "'o'" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combine this with the `find()` method and you can do things like write expressions that evaluate to everything from where a substring matches, up to the end of the string:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "shakespeare = \"Now is the winter of our discontent\"\n", "substr_index = shakespeare.find(\"win\")\n", "print shakespeare[substr_index:]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "winter of our discontent\n" ] } ], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Simple string transformations\n", "\n", "Python strings have a number of different methods which, when called on a string, return a copy of that string with a simple transformation applied to it. These are helpful for normalizing and cleaning up data, or preparing it to be displayed.\n", "\n", "Let's start with `.lower()`, which evaluates to a copy of the string in all lower case:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\"ARGUMENTATION! DISAGREEMENT! STRIFE!\".lower()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 28, "text": [ "'argumentation! disagreement! strife!'" ] } ], "prompt_number": 28 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The converse of `.lower()` is `.upper()`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\"e.e. cummings is. not. happy about this.\".upper()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 32, "text": [ "'E.E. CUMMINGS IS. NOT. HAPPY ABOUT THIS.'" ] } ], "prompt_number": 32 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The method `.title()` evaluates to a copy of the string it's called on, replacing every letter at the beginning of a word in the string with a capital letter:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\"dr. strangelove, or, how I learned to love the bomb\".title()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 33, "text": [ "'Dr. Strangelove, Or, How I Learned To Love The Bomb'" ] } ], "prompt_number": 33 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.strip()` method removes any whitespace from the beginning or end of the string (but not between characters later in the string):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\" got some random whitespace in some places here \".strip()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 30, "text": [ "'got some random whitespace in some places here'" ] } ], "prompt_number": 30 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, the `.replace()` method takes two parameters: a string to find, and a string to replace that string with whenever it's found. You can use this to make sad stories." ] }, { "cell_type": "code", "collapsed": false, "input": [ "\"I got rhythm, I got music, I got my man, who could ask for anything more\".replace(\"I got\", \"I used to have\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 44, "text": [ "'I used to have rhythm, I used to have music, I used to have my man, who could ask for anything more'" ] } ], "prompt_number": 44 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###\"Escape\" sequences in strings\n", "\n", "Inside of strings that you type into your Python code, there are certain sequences of characters that have a special meaning. These sequences start with a backslash character (`\\`) and allow you to insert into your string characters that would otherwise be difficult to type, or that would go against Python syntax. Here's some code illustrating a few common sequences:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print \"include \\\"double quotes\\\" (inside of a double-quoted string)\"\n", "print 'include \\'single quotes\\' (inside of a single-quoted string)'\n", "print \"one\\ttab, two\\ttabs\"\n", "print \"new\\nline\"\n", "print \"include an actual backslash \\\\ (two backslashes in the string)\"" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "include \"double quotes\" (inside of a double-quoted string)\n", "include 'single quotes' (inside of a single-quoted string)\n", "one\ttab, two\ttabs\n", "new\n", "line\n", "include an actual backslash \\ (two backslashes in the string)\n" ] } ], "prompt_number": 113 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Regular expressions\n", "\n", "So far, we've discussed how to write programs and expressions that are able to check whether strings meet very simple criteria, such as \u201cdoes this string begin with a particular character\u201d or \u201cdoes this string contain another string\u201d? But imagine writing a program that performs the following task: find and print all ZIP codes in a string (i.e., a five-character sequence of digits). Give up? Here\u2019s my attempt, using only the tools we\u2019ve discussed so far:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "input_str = \"here's a zip code: 12345. 567 isn't a zip code, but 45678 is. 23456? yet another zip code.\"\n", "current = \"\"\n", "zips = []\n", "for ch in input_str:\n", " if ch in '0123456789':\n", " current += ch\n", " else:\n", " current = \"\"\n", " if len(current) == 5:\n", " zips.append(current)\n", " current = \"\"\n", "print zips" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['12345', '45678', '23456']\n" ] } ], "prompt_number": 38 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Basically, we have to iterate over each character in the string, check to see if that character is a digit, append to a string variable if so, continue reading characters until we reach a non-digit character, check to see if we found exactly five digit characters, and add it to a list if so. At the end, we print out the list that has all of our results. Problems with this code: it\u2019s messy; it doesn\u2019t overtly communicate what it\u2019s doing; it\u2019s not easily generalized to other, similar tasks (e.g., if we wanted to write a program that printed out phone numbers from a string, the code would likely look completely different).\n", "\n", "Our ancient UNIX pioneers had this problem, and in pursuit of a solution, thought to themselves, \"Let\u2019s make a tiny language that allows us to write specifications for textual patterns, and match those patterns against strings. No one will ever have to write fiddly code that checks strings character-by-character ever again.\" And thus regular expressions were born.\n", "\n", "Here's the code for accomplishing the same task with regular expressions, by the way:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re\n", "zips = re.findall(r\"\\d{5}\", input_str)\n", "print zips" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['12345', '45678', '23456']\n" ] } ], "prompt_number": 40 }, { "cell_type": "markdown", "metadata": {}, "source": [ "I\u2019ll allow that the `r\"\\d{5}\"` in there is mighty cryptic (though hopefully it won\u2019t be when you\u2019re done reading this page and/or participating in the associated lecture). But the overall structure of the program is much simpler.\n", "\n", "###Fetching our corpus\n", "\n", "For this section of class, we'll be using the subject lines of all e-mails in the [EnronSent corpus](http://verbs.colorado.edu/enronsent/), kindly put into the public domain by the United States Federal Energy Regulatory Commission. Download a copy into your notebook directory like so:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import urllib\n", "urllib.urlretrieve(\"https://raw.githubusercontent.com/ledeprogram/courses/master/databases/data/enronsubjects.txt\", \"enronsubjects.txt\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 22, "text": [ "('enronsubjects.txt', )" ] } ], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Matching strings with regular expressions\n", "\n", "The most basic operation that regular expressions perform is matching strings: you\u2019re asking the computer whether a particular string matches some description. We're going to be using regular expressions to print only those lines from our `enronsubjects.txt` corpus that match particular sequences. Let's load our corpus into a list of lines first:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "subjects = [x.strip() for x in open(\"enronsubjects.txt\").readlines()]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check whether or not a pattern matches a given string in Python with the `re.search()` function. The first parameter to search is the regular expression you're trying to match; the second parameter is the string you're matching against.\n", "\n", "Here's an example, using a very simple regular expression. The following code prints out only those lines in our Enron corpus that match the (very simple) regular expression `shipping`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re\n", "[line for line in subjects if re.search(\"shipping\", line)]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 187, "text": [ "['FW: How to use UPS for shipping on the internet',\n", " 'FW: How to use UPS for shipping on the internet',\n", " 'How to use UPS for shipping on the internet',\n", " 'FW: How to use UPS for shipping on the internet',\n", " 'FW: How to use UPS for shipping on the internet',\n", " 'How to use UPS for shipping on the internet',\n", " 'lng shipping/mosk meeting in tokyo 2nd of feb',\n", " 'lng shipping/mosk meeting in tokyo 2nd of feb',\n", " 'Re: lng shipping',\n", " 'Re: lng shipping',\n", " 'Re: lng shipping',\n", " 'Re: lng shipping',\n", " 'Re: lng shipping',\n", " 'lng shipping',\n", " 'Re: lng shipping',\n", " 'Re: lng shipping',\n", " 'Re: lng shipping',\n", " 'lng shipping',\n", " 'lng shipping',\n", " 'lng shipping',\n", " 'Re: lng shipping',\n", " 'lng shipping']" ] } ], "prompt_number": 187 }, { "cell_type": "markdown", "metadata": {}, "source": [ "At its simplest, a regular expression matches a string if that string contains exactly the characters you've specified in the regular expression. So the expression `shipping` matches strings that contain exactly the sequences of `s`, `h`, `i`, `p`, `p`, `i`, `n`, and `g` in a row. If the regular expression matches, `re.search()` evaluates to `True` and the matching line is included in the evaluation of the list comprehension.\n", "\n", "> BONUS TECH TIP: `re.search()` doesn't actually evaluate to `True` or `False`---it evaluates to either a `Match` object if a match is found, or `None` if no match was found. Those two count as `True` and `False` for the purposes of an `if` statement, though." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Metacharacters: character classes\n", "\n", "The \"shipping\" example is pretty boring. (There was hardly any fan fiction in there at all.) Let's go a bit deeper into detail with what you can do with regular expressions. There are certain characters or strings of characters that we can insert into a regular expressions that have special meaning. For example:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[line for line in subjects if re.search(\"sh.pping\", line)]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 101, "text": [ "['FW: How to use UPS for shipping on the internet',\n", " 'FW: How to use UPS for shipping on the internet',\n", " 'How to use UPS for shipping on the internet',\n", " 'FW: How to use UPS for shipping on the internet',\n", " 'FW: How to use UPS for shipping on the internet',\n", " 'How to use UPS for shipping on the internet',\n", " \"FW: We've been shopping!\",\n", " 'Re: Start shopping...',\n", " 'Start shopping...',\n", " 'lng shipping/mosk meeting in tokyo 2nd of feb',\n", " 'lng shipping/mosk meeting in tokyo 2nd of feb',\n", " 'Re: lng shipping',\n", " 'Re: lng shipping',\n", " 'Re: lng shipping',\n", " 'Re: lng shipping',\n", " 'Re: lng shipping',\n", " 'lng shipping',\n", " 'Re: lng shipping',\n", " 'Re: lng shipping',\n", " 'Re: lng shipping',\n", " 'lng shipping',\n", " 'lng shipping',\n", " 'lng shipping',\n", " 'Re: lng shipping',\n", " 'lng shipping',\n", " 'FW: Online shopping',\n", " 'Online shopping']" ] } ], "prompt_number": 101 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a regular expression, the character `.` means \"match any character here.\" So, using the regular expression `sh.pping`, we get lines that match `shipping` but also `shopping`. The `.` is an example of a regular expression *metacharacter*---a character (or string of characters) that has a special meaning.\n", "\n", "Here are a few more metacharacters. These metacharacters allow you to say that a character belonging to a particular *class* of characters should be matched in a particular position:\n", "\n", "| metacharacter | meaning |\n", "|---------------|---------|\n", "| `.` | match any character |\n", "| `\\w` | match any alphanumeric (\"*w*ord\") character (lowercase and capital letters, 0 through 9, underscore) |\n", "| `\\s` | match any whitespace character (i.e., space and tab) |\n", "| `\\S` | match any non-whitespace character (the inverse of \\s) |\n", "| `\\d` | match any digit (0 through 9) |\n", "| `\\.` | match a literal `.` |\n", "\n", "Here, for example, is a (clearly imperfect) regular expression to search for all subject lines containing a time of day:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[line for line in subjects if re.search(r\"\\d:\\d\\d\\wm\", line)]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 111, "text": [ "['RE: 3:17pm',\n", " '3:17pm',\n", " \"RE: It's On!!! - 2:00pm Today\",\n", " \"FW: It's On!!! - 2:00pm Today\",\n", " \"It's On!!! - 2:00pm Today\",\n", " 'Re: Registration Confirmation: Larry Summers on 12/6 at 1:45pm (was',\n", " 'Re: Conference Call today 2/9/01 at 11:15am PST',\n", " 'Conference Call today 2/9/01 at 11:15am PST',\n", " '5/24 1:00pm conference call.',\n", " '5/24 1:00pm conference call.',\n", " 'FW: 07:33am EDT 15-Aug-01 Prudential Securities (C',\n", " 'FW: 07:33am EDT 15-Aug-01 Prudential Securities (C',\n", " '07:33am EDT 15-Aug-01 Prudential Securities (C',\n", " \"Re: Updated Mar'00 Requirements Received at 11:25am from CES\",\n", " \"Re: Updated Mar'00 Requirements Received at 11:25am from CES\",\n", " \"Re: Updated Mar'00 Requirements Received at 11:25am from CES\",\n", " \"Updated Mar'00 Requirements Received at 11:25am from CES\",\n", " 'Reminder: Legal Team Meeting -- Friday, 9:00am Houston time',\n", " 'Thursday, March 7th 1:30-3:00pm: REORIENTATION',\n", " 'Meeting at 2:00pm Friday',\n", " 'Meeting at 2:00pm Friday',\n", " 'Fw: 12:30pm Deadline for changes to letters or contracts today',\n", " '12:30pm Deadline for changes to letters or contracts today',\n", " 'Johnathan actually resigned at 9:00am this morning',\n", " 'FW: Enron Conference Call Today, 11:00am CST',\n", " 'Enron Conference Call Today, 11:00am CST',\n", " 'Meeting, Wednesday, January 23 at 10:00am at the Houstonian',\n", " 'RE: TVA Meeting, Wednesday June13, 1:15pm, EB3125b',\n", " 'TVA Meeting, Wednesday June13, 1:15pm, EB3125b',\n", " 'Re: Dabhol Update: Conference Call Thursday, Dec. 28, 8:00am',\n", " 'Dabhol Update: Conference Call Thursday, Dec. 28, 8:00am Houston time',\n", " 'FW: Victoria Ashley Jones Born 5/25/01 7:31am.',\n", " 'Fw: Victoria Ashley Jones Born 5/25/01 7:31am.',\n", " 'Victoria Ashley Jones Born 5/25/01 7:31am.',\n", " 'RE: Victoria Ashley Jones Born 5/25/01 7:31am.',\n", " 'Fw: Victoria Ashley Jones Born 5/25/01 7:31am.',\n", " 'Victoria Ashley Jones Born 5/25/01 7:31am.',\n", " 'RE: UCSF Cogen Calculation Conf Call, 10/12/01 at 8:00am PST',\n", " 'UCSF Cogen Calculation Conf Call, 10/12/01 at 8:00am PST',\n", " 'FW: Confirmation: UCSF Cogen Conf Call. 10/22/02 at 8:00am',\n", " '=09RE: Confirmation: UCSF Cogen Conf Call. 10/22/02 at 8:00am PST/=',\n", " '=09Confirmation: UCSF Cogen Conf Call. 10/22/02 at 8:00am PST/10:0=',\n", " 'RE: Confirmation: UCSF Cogen Conf Call. 10/22/02 at 8:00am',\n", " '=09Confirmation: UCSF Cogen Conf Call. 10/22/02 at 8:00am PST/10:0=',\n", " 'Re: March expenses - deadline 04-04-01 2:00pm',\n", " 'Cirque - Jan 24 5:00pm show']" ] } ], "prompt_number": 111 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's that regular expression again: `r\"\\d:\\d\\d\\wm\"`. I'm going to show you how to read this, one unit at a time.\n", "\n", "\"Hey, regular expression engine. Tell me if you can find this pattern in the current string. First of all, look for any number (`\\d`). If you find that, look for a colon right after it (`:`). If you find that, look for another number right after it (`\\d`). If you find *that*, look for any alphanumeric character---you know, a letter, a number, an underscore. If you find that, then look for a `m`. Good? If you found all of those things in a row, then the pattern matched.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####But what about that weirdo `r\"\"`?\n", "\n", "Python provides another way to include string literals in your program, in addition to the single- and double-quoted strings we've already discussed. The r\"\" string literal, or \"raw\" string, includes all characters inside the quotes literally, without interpolating special escape characters. Here's an example:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print \"this is\\na test\"\n", "print r\"this is\\na test\"\n", "print \"I love \\\\ backslashes!\"\n", "print r\"I love \\ backslashes!\"" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "this is\n", "a test\n", "this is\\na test\n", "I love \\ backslashes!\n", "I love \\ backslashes!\n" ] } ], "prompt_number": 114 }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, whereas a double- or single-quoted string literal interprets `\\n` as a new line character, the raw quoted string includes those characters as they were literally written. More importantly, for our purposes at least, is the fact that, in the raw quoted string, we only need to write one backslash in order to get a literal backslash in our string.\n", "\n", "Why is this important? Because regular expressions use backslashes all the time, and we don't want Python to try to interpret those backslashes as special characters. (Inside a regular string, we'd have to write a simple regular expression like `\\b\\w+\\b` as `\\\\b\\\\w+\\\\b`---yecch.)\n", "\n", "So the basic rule of thumb is this: use r\"\" to quote any regular expressions in your program. All of the examples you'll see below will use this convention." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Character classes in-depth\n", "\n", "You can define your own character classes by enclosing a list of characters, or range of characters, inside square brackets:\n", "\n", "| regex | explanation |\n", "|-------|-------------|\n", "| `[aeiou]` | matches any vowel |\n", "| `[02468]` | matches any even digit |\n", "| `[a-z]` | matches any lower-case letter |\n", "| `[A-Z]` | matches any upper-case character |\n", "| `[^0-9]` | matches any non-digit (the ^ inverts the class, matches anything not in the list) |\n", "| `[Ee]` | matches either `E` or `e` |\n", "\n", "Let's find every subject line where we have four or more vowels in a row:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[line for line in subjects if re.search(r\"[aeiou][aeiou][aeiou][aeiou]\", line)]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 121, "text": [ "['Re: Natural gas quote for Louiisiana-Pacific (L-P)',\n", " 'WooooooHoooooo more Vacation',\n", " 'Re: Clickpaper Counterparties waiting to clear the work queue',\n", " 'Gooooooooooood Bye!',\n", " 'Gooooooooooood Bye!',\n", " 'RE: Hello Sweeeeetie',\n", " 'Hello Sweeeeetie',\n", " 'FW: Waaasssaaaaabi !',\n", " 'FW: Waaasssaaaaabi !',\n", " 'FW: Waaasssaaaaabi !',\n", " 'FW: Waaasssaaaaabi !',\n", " 'Re: FW: Wasss Uuuuuup STG?',\n", " 'RE: Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',\n", " 'Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',\n", " 'FW: The Osama Bin Laden Song ( Soooo Funny !! )',\n", " 'Fw: The Osama Bin Laden Song ( Soooo Funny !! )',\n", " 'The Osama Bin Laden Song ( Soooo Funny !! )',\n", " 'RE: duuuuhhhhh',\n", " 'RE: duuuuhhhhh',\n", " 'RE: duuuuhhhhh',\n", " 'duuuuhhhhh',\n", " 'RE: duuuuhhhhh',\n", " 'duuuuhhhhh',\n", " 'RE: FPL Queue positions 1-15',\n", " 'Re: FPL Queue positions 1-15',\n", " 'Re: Helloooooo!!!',\n", " 'Re: Helloooooo!!!',\n", " 'Fw: FW: OOOooooops',\n", " 'FW: FW: OOOooooops',\n", " 'Re: yeeeeha',\n", " 'yeeeeha',\n", " 'yahoooooooooooooooooooo',\n", " 'RE: yahoooooooooooooooooooo',\n", " 'RE: yahoooooooooooooooooooo',\n", " 'yahoooooooooooooooooooo',\n", " 'RE: I hate yahooooooooooooooo',\n", " 'I hate yahooooooooooooooo',\n", " 'RE: I hate yahooooooooooooooo',\n", " 'I hate yahooooooooooooooo',\n", " 'RE: I hate yahooooooooooooooo',\n", " 'I hate yahooooooooooooooo',\n", " 'RE: I hate yahooooooooooooooo',\n", " 'I hate yahooooooooooooooo',\n", " \"FW: duuuuuuuuuuuuuuuuude...........what's up?\",\n", " \"RE: duuuuuuuuuuuuuuuuude...........what's up?\",\n", " \"RE: duuuuuuuuuuuuuuuuude...........what's up?\",\n", " 'Re: skiiiiiiiiing',\n", " 'skiiiiiiiiing',\n", " 'scuba dooooooooooooo',\n", " 'RE: scuba dooooooooooooo',\n", " 'RE: scuba dooooooooooooo',\n", " 'scuba dooooooooooooo',\n", " 'Re: skiiiiiiiing',\n", " 'skiiiiiiiing',\n", " 'Re: skiiiiiiiing',\n", " 'Re: skiiiiiiiiing',\n", " \"RE: Clickpaper CP's awaiting migration in work queue's 06/27/01\",\n", " \"FW: Clickpaper CP's awaiting migration in work queue's 06/27/01\",\n", " \"Clickpaper CP's awaiting migration in work queue's 06/27/01\",\n", " 'RE: Sequoia Adv. Pro.: Draft Stipulation and Order',\n", " 'FW: Sequoia Adv. Pro.: Draft Stipulation and Order',\n", " 'Sequoia Adv. Pro.: Draft Stipulation and Order',\n", " 'Re: FW: Sequoia Adv. Pro.: Draft Stipulation and Order',\n", " 'FW: Sequoia Adv. Pro.: Draft Stipulation and Order',\n", " 'FW: Sequoia Adv. Pro.: Draft Stipulation and Order',\n", " 'Fw: Sequoia Adv. Pro.: Draft Stipulation and Order',\n", " 'Sequoia Adv. Pro.: Draft Stipulation and Order',\n", " 'Sequoia Adv. Pro.: Draft Stipulation and Order',\n", " 'i would have done this but i was toooo busy.....']" ] } ], "prompt_number": 121 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Metacharacters: anchors\n", "\n", "The next important kind of metacharacter is the *anchor*. An anchor doesn't match a character, but matches a particular place in a string.\n", "\n", "| anchor | meaning |\n", "|--------|---------|\n", "| `^` | match at beginning of string |\n", "| `$` | match at end of string |\n", "| `\\b` | match at word boundary |\n", "\n", "> Note: `^` in a character class has a different meaning from `^` outside a character class!\n", "\n", "> Note #2: If you want to search for a literal dollar sign (`$`), you need to put a backslash in front of it, like so: `\\$`\n", "\n", "Now we have enough regular expression knowledge to do some fairly sophisticated matching. As an example, all the subject lines that begin with the string `New York`, regardless of whether or not the initial letters were capitalized:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[line for line in subjects if re.search(r\"^[Nn]ew [Yy]ork\", line)]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 127, "text": [ "['New York Details',\n", " 'New York Power Authority',\n", " 'New York Power Authority',\n", " 'New York Power Authority',\n", " 'New York Power Authority',\n", " 'New York',\n", " 'New York',\n", " 'New York',\n", " 'New York, etc.',\n", " 'New York, etc.',\n", " 'New York sites',\n", " 'New York Hotel',\n", " 'New York Hotel',\n", " 'New York Hotel',\n", " 'New York Hotel',\n", " 'New York',\n", " 'New York',\n", " 'New York City Marathon Guaranteed Entry',\n", " 'new york rest reviews',\n", " 'New York State Electric & Gas Corporation (\"NYSEG\")',\n", " 'New York State Electric & Gas Corporation (\"NYSEG\")',\n", " 'New York State Electric & Gas Corporation (\"NYSEG\")',\n", " 'New York State Electric & Gas (\"NYSEG\")',\n", " 'New York regulatory restriccions',\n", " 'New York regulatory restriccions',\n", " 'New York Bar Numbers']" ] } ], "prompt_number": 127 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every subject line that ends with an ellipsis (there are a lot of these, so I'm only displaying the first 30):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[line for line in subjects if re.search(r\"\\.\\.\\.$\", line)][:30]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 69, "text": [ "['Re: Inquiry....',\n", " 'Re: Inquiry....',\n", " 'RE: the candidate we spoke about this morning...',\n", " 'the candidate we spoke about this morning...',\n", " 'RE: the candidate we spoke about this morning...',\n", " 'RE: the candidate we spoke about this morning...',\n", " 'RE: the candidate we spoke about this morning...',\n", " 'the candidate we spoke about this morning...',\n", " 'RE: the candidate we spoke about this morning...',\n", " 'RE: the candidate we spoke about this morning...',\n", " 'RE: the candidate we spoke about this morning...',\n", " 'the candidate we spoke about this morning...',\n", " 'Re: Hmmmmm........',\n", " 'Hmmmmm........',\n", " 'FW: Bumping into the husband....',\n", " 'FW: Bumping into the husband....',\n", " 'RE: try this one...',\n", " 'RE: try this one...',\n", " 'Re: try this one...',\n", " 'try this one...',\n", " 'RE: try this one...',\n", " 'RE: try this one...',\n", " 'Re: try this one...',\n", " 'try this one...',\n", " 'RE: try this one...',\n", " 'RE: try this one...',\n", " 'Re: try this one...',\n", " 'try this one...',\n", " 'RE: try this one...',\n", " 'RE: try this one...']" ] } ], "prompt_number": 69 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first thirty subject lines containing the word \"oil\":" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[line for line in subjects if re.search(r\"\\b[Oo]il\\b\", line)][:30]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 70, "text": [ "['Re: PIRA Global Oil and Natural Outlooks- Save these dates.',\n", " 'PIRA Global Oil and Natural Outlooks- Save these dates.',\n", " 'Re: PIRA Global Oil and Natural Outlooks- Save these dates.',\n", " '=09PIRA Global Oil and Natural Outlooks- Save these dates.',\n", " 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',\n", " 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',\n", " 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',\n", " 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',\n", " 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',\n", " 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',\n", " 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',\n", " 'Cabot Oil & Gas Marketing Corp. - Amendment and Confirmations to',\n", " 'Cabot Oil & Gas Marketing Corp. - Amendment and Confirmations to',\n", " 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',\n", " 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',\n", " 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',\n", " 'Cabot Oil & Gas Marketing Corp. - Amendment and Confirmations to',\n", " 'Cabot Oil & Gas Marketing Corp. - Amendment and Confirmations to',\n", " 'EOTT Crude Oil Tanks',\n", " 'Re: Oil Skim + \"Bugs\"',\n", " 'Oil Skim + \"Bugs\"',\n", " 'Oil Release Incident',\n", " 'Oil Release Incident',\n", " 'Oil Release Incident',\n", " 'RE: Location of the 2002 Institute on Oil & Gas Law & Taxation --',\n", " 'Location of the 2002 Institute on Oil & Gas Law & Taxation -- February, 2002',\n", " 'RE: Location of the 2002 Institute on Oil & Gas Law & Taxation --',\n", " 'RE: Location of the 2002 Institute on Oil & Gas Law & Taxation -- February, 2002',\n", " 'RE: Location of the 2002 Institute on Oil & Gas Law & Taxation',\n", " 'B & J Gas and Oil']" ] } ], "prompt_number": 70 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Metacharacters: quantifiers\n", "\n", "Above we had a regular expression that looked like this:\n", "\n", " [aeiou][aeiou][aeiou][aeiou]\n", " \n", "Typing out all of those things is kind of a pain. Fortunately, there\u2019s a way to specify how many times to match a particular character, using quantifiers. These affect the character that immediately precede them:\n", "\n", "| quantifier | meaning |\n", "|------------|---------|\n", "| `{n}` | match exactly n times |\n", "| `{n,m}` | match at least n times, but no more than m times |\n", "| `{n,}` | match at least n times |\n", "| `+` | match at least once (same as {1,}) |\n", "| `*` | match zero or more times |\n", "| `?` | match one time or zero times |\n", "\n", "For example, here's an example of a regular expression that finds subjects that contain at least fifteen capital letters in a row:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[line for line in subjects if re.search(r\"[A-Z]{15,}\", line)]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 136, "text": [ "['CONGRATULATIONS!',\n", " 'CONGRATULATIONS!',\n", " 'Re: FW: Fw: Fw: Fw: Fw: Fw: Fw: PLEEEEEEEEEEEEEEEASE READ!',\n", " 'ACCOMPLISHMENTS',\n", " 'ACCOMPLISHMENTS',\n", " 'Re: FW: FORM: BILATERAL CONFIDENTIALITY AGREEMENT',\n", " 'FORM: BILATERAL CONFIDENTIALITY AGREEMENT',\n", " 'Re: CONGRATULATIONS!',\n", " 'CONGRATULATIONS!',\n", " 'Re: ORDER ACKNOWLEDGEMENT',\n", " 'ORDER ACKNOWLEDGEMENT',\n", " 'RE: CONGRATULATIONS',\n", " 'RE: CONGRATULATIONS',\n", " 'Re: CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'RE: CONGRATULATIONS',\n", " 'RE: CONGRATULATIONS',\n", " 'RE: CONGRATULATIONS',\n", " 'RE: CONGRATULATIONS',\n", " 'Re: CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'Re: VEPCO INTERCONNECTION AGREEMENT',\n", " 'VEPCO INTERCONNECTION AGREEMENT',\n", " 'Re: VEPCO INTERCONNECTION AGREEMENT',\n", " 'Re: VEPCO INTERCONNECTION AGREEMENT',\n", " 'VEPCO INTERCONNECTION AGREEMENT',\n", " 'Re: CONGRATULATIONS !',\n", " 'FW: WASSSAAAAAAAAAAAAAABI!',\n", " 'FW: WASSSAAAAAAAAAAAAAABI!',\n", " 'FW: WASSSAAAAAAAAAAAAAABI!',\n", " 'FW: WASSSAAAAAAAAAAAAAABI!',\n", " 'Re: FW: WASSSAAAAAAAAAAAAAABI!',\n", " 'FW: WASSSAAAAAAAAAAAAAABI!',\n", " 'FW: WASSSAAAAAAAAAAAAAABI!',\n", " 'RE: NOOOOOOOOOOOOOOOO',\n", " 'NOOOOOOOOOOOOOOOO',\n", " 'RE: NOOOOOOOOOOOOOOOO',\n", " 'CONGRATULATIONS!!!!!!!!!!!!!',\n", " 'RE: CONGRATULATIONS!!!!!!!!!!!!!',\n", " 'Re: CONGRATULATIONS!!!!!!!!!!!!!',\n", " 'CONGRATULATIONS',\n", " 'Re: CONFIDENTIALITY/CONFLICTS ISSUES MEETING',\n", " 'CONFIDENTIALITY/CONFLICTS ISSUES MEETING',\n", " 'GOALS AND ACCOMPLISHMENTS',\n", " 'ACCOMPLISHMENTS',\n", " 'Re: CONGRATULATIONS!',\n", " 'RE: STANDARDIZATION OF TANKER FREIGHT WORDING',\n", " 'RE: STANDARDIZATION OF TANKER FREIGHT WORDING',\n", " 'Re: STANDARDIZATION OF TANKER FREIGHT WORDING',\n", " 'STANDARDIZATION OF TANKER FREIGHT WORDING',\n", " 'BRRRRRRRRRRRRRRRRRRRRR',\n", " 'Re: CONGRATULATIONS !!!',\n", " 'CONGRATULATIONS !!!',\n", " 'RE: Mtg. to discuss assignment of customers. Transmission list: P/LEGAL/PROJECTNETCO/NETCOTRANSMISSION.XLS',\n", " 'RE: Mtg. to discuss assignment of customers. Transmission list: P/LEGAL/PROJECTNETCO/NETCOTRANSMISSION.XLS',\n", " 'Mtg. to discuss assignment of customers. Transmission list: P/LEGAL/PROJECTNETCO/NETCOTRANSMISSION.XLS',\n", " 'FW: NEW WEATHER SWAPS ON THE INTERCONTINENTAL EXCHANGE',\n", " 'NEW WEATHER SWAPS ON THE INTERCONTINENTAL EXCHANGE']" ] } ], "prompt_number": 136 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lines that contain five consecutive vowels:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[line for line in subjects if re.search(r\"[aeiou]{5}\", line)]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 137, "text": [ "['WooooooHoooooo more Vacation',\n", " 'Gooooooooooood Bye!',\n", " 'Gooooooooooood Bye!',\n", " 'RE: Hello Sweeeeetie',\n", " 'Hello Sweeeeetie',\n", " 'FW: Waaasssaaaaabi !',\n", " 'FW: Waaasssaaaaabi !',\n", " 'FW: Waaasssaaaaabi !',\n", " 'FW: Waaasssaaaaabi !',\n", " 'Re: FW: Wasss Uuuuuup STG?',\n", " 'RE: Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',\n", " 'Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',\n", " 'Re: Helloooooo!!!',\n", " 'Re: Helloooooo!!!',\n", " 'Fw: FW: OOOooooops',\n", " 'FW: FW: OOOooooops',\n", " 'yahoooooooooooooooooooo',\n", " 'RE: yahoooooooooooooooooooo',\n", " 'RE: yahoooooooooooooooooooo',\n", " 'yahoooooooooooooooooooo',\n", " 'RE: I hate yahooooooooooooooo',\n", " 'I hate yahooooooooooooooo',\n", " 'RE: I hate yahooooooooooooooo',\n", " 'I hate yahooooooooooooooo',\n", " 'RE: I hate yahooooooooooooooo',\n", " 'I hate yahooooooooooooooo',\n", " 'RE: I hate yahooooooooooooooo',\n", " 'I hate yahooooooooooooooo',\n", " \"FW: duuuuuuuuuuuuuuuuude...........what's up?\",\n", " \"RE: duuuuuuuuuuuuuuuuude...........what's up?\",\n", " \"RE: duuuuuuuuuuuuuuuuude...........what's up?\",\n", " 'Re: skiiiiiiiiing',\n", " 'skiiiiiiiiing',\n", " 'scuba dooooooooooooo',\n", " 'RE: scuba dooooooooooooo',\n", " 'RE: scuba dooooooooooooo',\n", " 'scuba dooooooooooooo',\n", " 'Re: skiiiiiiiing',\n", " 'skiiiiiiiing',\n", " 'Re: skiiiiiiiing',\n", " 'Re: skiiiiiiiiing']" ] } ], "prompt_number": 137 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Count the number of lines that are e-mail forwards, regardless of whether the subject line begins with `Fw:`, `FW:`, `Fwd:` or `FWD:`" ] }, { "cell_type": "code", "collapsed": false, "input": [ "len([line for line in subjects if re.search(r\"^F[Ww]d?:\", line)])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 140, "text": [ "20159" ] } ], "prompt_number": 140 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lines that have the word `news` in them and end in an exclamation point:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[line for line in subjects if re.search(r\"\\b[Nn]ews\\b.*!$\", line)]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 144, "text": [ "['RE: Christmas Party News!',\n", " 'FW: Christmas Party News!',\n", " 'Christmas Party News!',\n", " 'Good News!',\n", " 'Good News--Twice!',\n", " 'Re: VERY Interesting News!',\n", " 'Great News!',\n", " 'Re: Great News!',\n", " 'News Flash!',\n", " 'RE: News Flash!',\n", " 'RE: News Flash!',\n", " 'News Flash!',\n", " 'RE: Good News!',\n", " 'RE: Good News!',\n", " 'RE: Good News!',\n", " 'RE: Good News!',\n", " 'Good News!',\n", " 'RE: Good News!!!',\n", " 'Good News!!!',\n", " 'RE: Big News!',\n", " 'Big News!',\n", " 'Individual.com - News From a Friend!',\n", " 'Individual.com - News From a Friend!',\n", " 'Re: Individual.com - News From a Friend!',\n", " 'RE: We need news!',\n", " '=09We need news!',\n", " 'RE: Big News!',\n", " 'FW: Big News!',\n", " 'RE: Big News!',\n", " 'FW: Big News!',\n", " 'Big News!',\n", " 'FW: NW Wine News- Eroica, Sineann, Bergstrom, Hamacher, And more!',\n", " '=09NW Wine News- Eroica, Sineann, Bergstrom, Hamacher, And more!',\n", " 'RE: Good News!!!',\n", " 'Good News!!!',\n", " 'Re: Big News!',\n", " 'Big News!',\n", " 'RE: Good News!',\n", " 'Good News!']" ] } ], "prompt_number": 144 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Metacharacters: alternation\n", "\n", "One final bit of regular expression syntax: alternation.\n", "\n", "* `(?:x|y)`: match either x or y\n", "* `(?:x|y|z)`: match x, y or z\n", "* etc.\n", "\n", "So for example, if you wanted to count every subject line that begins with either `Re:` or `Fwd:`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "len([line for line in subjects if re.search(r\"^(?:Re|Fwd):\", line)])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 174, "text": [ "39901" ] } ], "prompt_number": 174 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every subject line that mentions kinds of cats:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[line for line in subjects if re.search(r\"\\b(?:[Cc]at|[Kk]itten|[Kk]itty)\\b\", line)]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 34, "text": [ "['Re: FW: cat attack',\n", " 'Re: FW: cat attack',\n", " 'Re: FW: cat attack',\n", " 'Re: FW: cat attack',\n", " 'Fw: Cat clip',\n", " 'Fw: Cat clip',\n", " 'FW: Cat clip',\n", " 'Re: Amazing Kitten',\n", " 'RE: How To Tell Which Cat Ate Your Drugs',\n", " 'FW: How To Tell Which Cat Ate Your Drugs',\n", " 'FW: How To Tell Which Cat Ate Your Drugs',\n", " \"FW: Fw: A cat's tale\",\n", " \"Fwd: Fw: A cat's tale\",\n", " 'Kim lost her cat this morning',\n", " 'Fw: cat clip............',\n", " 'Fw: cat clip............',\n", " 'Fw: cat clip............',\n", " 'cat clip............',\n", " 'Fw: cat clip............',\n", " 'Fw: cat clip............',\n", " 'Fw: cat clip............',\n", " 'cat clip............',\n", " 'Fw: cat clip............',\n", " 'Fw: cat clip............',\n", " 'Fw: cat clip............',\n", " 'cat clip............',\n", " 'Fw: cat clip............',\n", " 'Fw: cat clip............',\n", " 'Fw: cat clip............',\n", " 'cat clip............',\n", " 'Fw: cat clip............',\n", " 'Fw: cat clip............',\n", " 'Fw: cat clip............',\n", " 'cat clip............',\n", " 'kitty',\n", " 'Diary of a Cat',\n", " 'Diary of a Cat',\n", " 'Diary of a Cat',\n", " 'Diary of a Cat',\n", " 'Diary of a Cat',\n", " 'RE: Cat show?',\n", " 'Cat show?',\n", " 'RE: Cat show?',\n", " 'RE: Cat show?',\n", " 'RE: Cat show?',\n", " 'Cat show?']" ] } ], "prompt_number": 34 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Capturing what matches\n", "\n", "The `re.search()` function allows us to check to see *whether or not* a string matches a regular expression. Sometimes we want to find out not just if the string matches, but also to what, exactly, in the string matched. In other words, we want to *capture* whatever it was that matched.\n", "\n", "The easiest way to do this is with the `re.findall()` function, which takes a regular expression and a string to match it against, and returns a list of all parts of the string that the regular expression matched. Here's an example:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re\n", "print re.findall(r\"\\b\\w{5}\\b\", \"alpha beta gamma delta epsilon zeta eta theta\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 154, "text": [ "['alpha', 'gamma', 'delta', 'theta']" ] } ], "prompt_number": 154 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The regular expression above, `\\b\\w{5}\\b`, is a regular expression that means \"find me strings of five non-white space characters between word boundaries\"---in other words, find me five-letter words. The `re.findall()` method returns a list of strings---not just telling us whether or not the string matched, but which parts of the string matched.\n", "\n", "For the following `re.findall()` examples, we'll be operating on the entire file of subject lines as a single string, instead of using a list comprehension for individual subject lines. Here's how to read in the entire file as one string, instead of as a list of strings:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_subjects = open(\"enronsubjects.txt\").read()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 72 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Having done that, let's write a regular expression that finds all domain names in the subject lines (displaying just the first thirty because the list is long):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "re.findall(r\"\\b\\w+\\.(?:com|net|org)\", all_subjects)[:30]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 73, "text": [ "['enron.com',\n", " 'enron.com',\n", " 'enron.com',\n", " 'enron.com',\n", " 'enron.com',\n", " 'enron.com',\n", " 'enron.com',\n", " 'enron.com',\n", " 'Forbes.com',\n", " 'Cortlandtwines.com',\n", " 'Cortlandtwines.com',\n", " 'Match.com',\n", " 'Amazon.com',\n", " 'Amazon.com',\n", " 'Ticketmaster.com',\n", " 'Ticketmaster.com',\n", " 'Concierge.com',\n", " 'Concierge.com',\n", " 'har.com',\n", " 'har.com',\n", " 'HoustonChronicle.com',\n", " 'HoustonChronicle.com',\n", " 'har.com',\n", " 'har.com',\n", " 'har.com',\n", " 'har.com',\n", " 'har.com',\n", " 'har.com',\n", " 'Concierge.com',\n", " 'Concierge.com']" ] } ], "prompt_number": 73 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every time the string `New York` is found, along with the word that comes directly afterward:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "re.findall(r\"New York \\b\\w+\\b\", all_subjects)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 161, "text": [ "['New York Details',\n", " 'New York Details',\n", " 'New York on',\n", " 'New York on',\n", " 'New York on',\n", " 'New York on',\n", " 'New York on',\n", " 'New York on',\n", " 'New York Times',\n", " 'New York on',\n", " 'New York on',\n", " 'New York on',\n", " 'New York on',\n", " 'New York on',\n", " 'New York on',\n", " 'New York on',\n", " 'New York on',\n", " 'New York Times',\n", " 'New York Times',\n", " 'New York Times',\n", " 'New York Times',\n", " 'New York Times',\n", " 'New York Times',\n", " 'New York Times',\n", " 'New York City',\n", " 'New York City',\n", " 'New York City',\n", " 'New York Power',\n", " 'New York Power',\n", " 'New York Power',\n", " 'New York Power',\n", " 'New York Power',\n", " 'New York Power',\n", " 'New York Power',\n", " 'New York Power',\n", " 'New York Mercantile',\n", " 'New York Mercantile',\n", " 'New York Branch',\n", " 'New York City',\n", " 'New York Energy',\n", " 'New York Energy',\n", " 'New York Energy',\n", " 'New York Energy',\n", " 'New York Energy',\n", " 'New York sites',\n", " 'New York sites',\n", " 'New York Hotel',\n", " 'New York Hotel',\n", " 'New York Hotel',\n", " 'New York Hotel',\n", " 'New York Hotel',\n", " 'New York Hotel',\n", " 'New York Hotel',\n", " 'New York Hotel',\n", " 'New York Hotel',\n", " 'New York Hotel',\n", " 'New York Hotel',\n", " 'New York Hotel',\n", " 'New York Hotel',\n", " 'New York Hotel',\n", " 'New York City',\n", " 'New York City',\n", " 'New York City',\n", " 'New York City',\n", " 'New York voice',\n", " 'New York State',\n", " 'New York State',\n", " 'New York State',\n", " 'New York State',\n", " 'New York State',\n", " 'New York State',\n", " 'New York Inc',\n", " 'New York Office',\n", " 'New York Office',\n", " 'New York regulatory',\n", " 'New York regulatory',\n", " 'New York regulatory',\n", " 'New York regulatory',\n", " 'New York Bar',\n", " 'New York Bar']" ] } ], "prompt_number": 161 }, { "cell_type": "markdown", "metadata": {}, "source": [ "And just to bring things full-circle, everything that looks like a zip code, sorted:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sorted(re.findall(r\"\\b\\d{5}\\b\", all_subjects))[:30]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 74, "text": [ "['00003',\n", " '00003',\n", " '00003',\n", " '00003',\n", " '00003',\n", " '00003',\n", " '00003',\n", " '00003',\n", " '00003',\n", " '00010',\n", " '00010',\n", " '00458',\n", " '01003',\n", " '02177',\n", " '06716',\n", " '06736',\n", " '06736',\n", " '06752',\n", " '06752',\n", " '06752',\n", " '06752',\n", " '06752',\n", " '06980',\n", " '06980',\n", " '10000',\n", " '10000',\n", " '11111',\n", " '11111',\n", " '11111',\n", " '11111']" ] } ], "prompt_number": 74 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Full example: finding the dollar value of the Enron e-mail subject corpus\n", "\n", "Here's an example that combines our regular expression prowess with our ability to do smaller manipulations on strings. We want to find all dollar amounts in the subject lines, and then figure out what their sum is.\n", "\n", "To understand what we're working with, let's start by writing a list comprehension that finds strings that just have the dollar sign (`$`) in them:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[line for line in subjects if re.search(r\"\\$\", line)]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 164, "text": [ "['Re: APEA - $228,204 hit',\n", " 'Re: APEA - $228,204 hit',\n", " 'DJ Cal-ISO Pays $10M To Avoid Rolling Blackouts Wed -Sources, DJ',\n", " 'DJ Cal-ISO Pays $10M To Avoid Rolling Blackouts Wed -Sources, DJ',\n", " 'DJ Cal-ISO Pays $10M To Avoid Rolling Blackouts Wed -Sources, DJ',\n", " 'DJ Cal-ISO Pays $10M To Avoid Rolling Blackouts Wed -Sources, DJ',\n", " 'Goldman Comment re: Enron issued this morning - Revised Price Target of $68/share',\n", " 'RE: Goldman Sachs $2.19 Natural GAs',\n", " 'Goldman Sachs $2.19 Natural GAs',\n", " 'RE: $25 million',\n", " '$25 million',\n", " 'RE: $25 million loan from EDf',\n", " '$25 million loan from EDf',\n", " 'RE: $25 million loan from EDf',\n", " 'RE: $25 million loan from EDf',\n", " 'RE: $25 million loan from EDf',\n", " '$25 million loan from EDf',\n", " 'RE: $25 million loan from EDf',\n", " 'RE: $25 million loan from EDf',\n", " 'RE: $25 million loan from EDf',\n", " 'RE: $25 million loan from EDf',\n", " 'RE: $25 million loan from EDf',\n", " '$25 million loan from EDf',\n", " 'A$M and its \"second tier\" status',\n", " 'A$M and its \"second tier\" status',\n", " 'A$M and its \"second tier\" status',\n", " 'UT/a$m business school and engineering school comparisons',\n", " 'Re: $',\n", " '$',\n", " 'Re: $',\n", " '$',\n", " '$$$$',\n", " 'FFL $$',\n", " 'RE: shipper imbal $$ collected',\n", " 'shipper imbal $$ collected',\n", " \"Oneok's Strangers Gas Payment $820,000\",\n", " \"Oneok's Strangers Gas Payment $820,000\",\n", " 'Another $40 Million?',\n", " 'FW: Entergy and FPL Group Agree to a $27 Billion Merger Of Equals',\n", " 'FW: Entergy and FPL Group Agree to a $27 Billion Merger Of Equals',\n", " 'Over $50 -- You made it happen!',\n", " 'Over $50 -- You made it happen!',\n", " 'FW: Co 0530 CINY 40781075 $5,356.46 FX Funding',\n", " 'Co 0530 CINY 40781075 $5,356.46 FX Funding',\n", " 'FW: Outstanding Young Alumni Travel Value to Amsterdam from $895',\n", " 'Outstanding Young Alumni Travel Value to Amsterdam from $895',\n", " 'RE: Modesto 7 MW COB deal @$19.3.',\n", " 'RE: Modesto 7 MW COB deal @$19.3.',\n", " 'Modesto 7 MW COB deal @$19.3.',\n", " 'Modesto 7 MW COB deal @$19.3.',\n", " 'RE: -$870K prior month adjustments',\n", " '-$870K prior month adjustments',\n", " 'RE: -$141,000 P&L hit on 8/13/01',\n", " '-$141,000 P&L hit on 8/13/01',\n", " '$$$',\n", " 'Re: DWR Stranded costs: $21 billion',\n", " 'CAISO cuts refund estimate to $6.1B from $8.9B',\n", " \"State's Power Purchases Costlier Than Projected Tab is $6 million a\",\n", " 'Fwd: Edison gets more time; Calif. may sell $14 bln bonds',\n", " 'Edison gets more time; Calif. may sell $14 bln bonds',\n", " 'Re: IDEA RE ISSUE OF UTILS IN CALIF WANTING $100 PRICE CAP',\n", " 'Back to $250 Cap in California',\n", " 'Energy Secretary Announces $350MM to Upgrade Path 15',\n", " 'RE: $.01 surcharge as \"tax\"',\n", " 'FW: $.01 surcharge as \"tax\"',\n", " 'FW: $.01 surcharge as \"tax\"',\n", " '$.01 surcharge as \"tax\"',\n", " \"California's $12.5 Bln Bond Sale May Be Salvaged, Official Says;\",\n", " \"RE: California's $12.5 Bln Bond Sale May Be Salvaged, Official\",\n", " \"RE: California's $12.5 Bln Bond Sale May Be Salvaged, Official Says; DWR Contract Renegotiation Is Key\",\n", " \"California's $12.5 Bln Bond Sale May Be Salvaged, Official Says; DWR Contract Renegotiation Is Key\",\n", " 'Re: Royal Bank of Canada - Wire ($2,529,352.58)',\n", " 'Free $10 Three Team Parlay',\n", " 'Blue Girl - $1.2MM option expires today - need to know whether to',\n", " 'Blue Girl - $1.2MM option expires today - need to know whether to',\n", " 'Blue Girl - $1.2MM option expires today - need to know whether to',\n", " 'Blue Girl - $1.2MM option expires today - need to know whether to',\n", " 'Blue Girl - $1.2MM option expires today - need to know whether to',\n", " 'Blue Girl - $1.2MM option expires today - need to know whether to',\n", " 'Blue Girl - $1.2MM option expires today - need to know whether to',\n", " 'FW: Economic Times article: FIs may take over Enron for $700-800m',\n", " 'FW: Economic Times article: FIs may take over Enron for $700-800m',\n", " 'FW: Economic Times article: FIs may take over Enron for $700-800m',\n", " 'Red Rock Delay $$ Impact',\n", " 'HandsFree Kits - $2',\n", " 'HandsFree Kits - $2',\n", " 'Re: The $10 you owe me',\n", " 'The $10 you owe me',\n", " 'RE: Enron files for Chapter 11 owing US$13B',\n", " 'Enron files for Chapter 11 owing US$13B',\n", " 'RE: $ allocation',\n", " '$ allocation',\n", " 'Re: Last chance: Save $100 on a future airline ticket',\n", " 'Re: ECS and the $500k reduction',\n", " 'Re: ECS and the $500k reduction',\n", " 'Re: ECS and the $500k reduction',\n", " 'Re: ECS and the $500k reduction',\n", " 'ECS and the $500k reduction',\n", " 'ECS and the $500k reduction',\n", " 'ECS and the $500k reduction',\n", " 'ECS and the $500k reduction',\n", " 'ECS and the $500k reduction',\n", " 'FW: Free Shipping & $1,300 in Savings',\n", " 'Free Shipping & $1,300 in Savings',\n", " 'RE: Free Shipping & $1,300 in Savings',\n", " 'RE: Free Shipping & $1,300 in Savings',\n", " 'FW: Free Shipping & $1,300 in Savings',\n", " 'Free Shipping & $1,300 in Savings',\n", " 'RE: Dynegy Is Mulling $2 Billion Investment In Enron in Possible',\n", " 'FW: Dynegy Is Mulling $2 Billion Investment In Enron in Possible \\tStep Toward Merger',\n", " 'FW: Dynegy Is Mulling $2 Billion Investment In Enron in Possible Step Toward Merger',\n", " 'Dynegy Is Mulling $2 Billion Investment In Enron in Possible Step Toward Merger',\n", " 'Peoples Gas --> $5,000 Invoice for Summer-Winter Exchange 6-1-00 to',\n", " 'Peoples Gas --> $5,000 Invoice for Summer-Winter Exchange 6-1-00 to',\n", " 'Peoples Gas --> $5,000 Invoice for Summer-Winter Exchange 6-1-00 to',\n", " 'Peoples Gas --> $5,000 Invoice for Summer-Winter Exchange 6-1-00 to',\n", " 'Re: short fall $971,443.11 for Wis Elect Power',\n", " 'Re: short fall $971,443.11 for Wis Elect Power',\n", " 'Re: short fall $971,443.11 for Wis Elect Power',\n", " 'Re: short fall $971,443.11 for Wis Elect Power',\n", " 'Re: short fall $971,443.11 for Wis Elect Power',\n", " 'short fall $971,443.11 for Wis Elect Power',\n", " 'RE: Q&A for NNG/TW Supported $1Billion Line of Credit',\n", " 'Q&A for NNG/TW Supported $1Billion Line of Credit',\n", " 'FW: Deals from $39 in our Las Vegas store!',\n", " '=09Deals from $39 in our Las Vegas store!',\n", " 'A trip worth $10,000 could be yours',\n", " 'A trip worth $10,000 could be yours',\n", " '142,000,000 Email Addresses for ONLY $149!!!!',\n", " \"Lou's $50,000\",\n", " \"Lou's $50,000\",\n", " \"Lou's $50,000\",\n", " 'Summary of $ at Risk for Customs',\n", " 'Summary of $ at Risk for Customs',\n", " 'Summary of $ at Risk for Customs',\n", " \"Calling All Investors: The New Power Company's IPO Priced at $21\",\n", " \"Calling All Investors: The New Power Company's IPO Priced at $21 P=\",\n", " 'Fenosa and Enron to Invest $550 Million in Dominican Republic',\n", " \"Enron Brazil To Invest $455 Million In Gas Distribution '01-'04\",\n", " 'RE: $5 million for 90 days?- how quaint!',\n", " 'FW: $5 million for 90 days?- how quaint!',\n", " '$5 million for 90 days?- how quaint!',\n", " 'RE: Wind $7MM',\n", " 'RE: Wind $7MM',\n", " 'RE: Wind $7MM',\n", " 'Wind $7MM',\n", " 'RE: Wind $7MM',\n", " 'Wind $7MM',\n", " 'Re: Counting the Cal ISO Votes for a $100 Price Cap',\n", " 'RE: C$ swap between EIM/ENA',\n", " 'C$ swap between EIM/ENA',\n", " \"Re: Where's My $20\",\n", " \"Re: Where's My $20\",\n", " \"Re: Where's My $20\",\n", " \"Re: Where's My $20\",\n", " 'Re: $100',\n", " 'Re: $100',\n", " 'Re: $100',\n", " \"Re: Where's My $20\",\n", " \"Re: Where's My $20\",\n", " 'RE: Eric Schroeder has just sent you $29.75 with PayPal',\n", " 'Fw: Eric Schroeder has just sent you $29.75 with PayPal',\n", " 'Eric Schroeder has just sent you $29.75 with PayPal',\n", " 'RE: Eric Schroeder has just sent you $29.75 with PayPal',\n", " 'Fw: Eric Schroeder has just sent you $29.75 with PayPal',\n", " 'Eric Schroeder has just sent you $29.75 with PayPal',\n", " 'RE: What are you talking about $1600?',\n", " 'Re: What are you talking about $1600?',\n", " 'RE: What are you talking about $1600?',\n", " 'RE: What are you talking about $1600?',\n", " '=09Re: What are you talking about $1600?',\n", " 'What are you talking about $1600?',\n", " 'What are you talking about $1600?',\n", " 'FW: Enron Seeks $2 Billion Cash Infusion As It Faces an Escalating',\n", " 'FW: Enron Seeks $2 Billion Cash Infusion As It Faces an Escalating Fiscal Crisis',\n", " 'Enron Seeks $2 Billion Cash Infusion As It Faces an Escalating Fiscal Crisis',\n", " 'The new, correct price is $67,776,700',\n", " 'Re: Demar request for $2.7 mm to pay out the Skandinavian now',\n", " 'Re: Demar request for $2.7 mm to pay out the Skandinavian now',\n", " 'RE: Transactions exceeding $100mil',\n", " 'Our benefits are about $50 per month higher with UBS',\n", " 'RE: $9.6MM EOL Gas Daily Issue',\n", " '$9.6MM EOL Gas Daily Issue',\n", " 'FW: NEAL - ITIN ONLY/$212.50',\n", " 'FW: NEAL - ITIN ONLY/$212.50',\n", " 'NEAL - ITIN ONLY/$212.50',\n", " 'FW: NEAL - ITIN ONLY/$212.50',\n", " 'FW: NEAL - ITIN ONLY/$212.50',\n", " 'NEAL - ITIN ONLY/$212.50',\n", " 'FW: Duke $',\n", " 'Duke $',\n", " 'RE: Duke $',\n", " 'FW: Duke $',\n", " 'FW: Duke $',\n", " 'Duke $',\n", " '$$$$',\n", " '$$$$',\n", " 'RE: Wire Detail for 10/25/01 wire for $195,209.95',\n", " 'FW: Wire Detail for 10/25/01 wire for $195,209.95',\n", " 'RE: Wire Detail for 10/25/01 wire for $195,209.95',\n", " 'RE: Wire Detail for 10/25/01 wire for $195,209.95',\n", " 'FW: Wire Detail for 10/25/01 wire for $195,209.95',\n", " 'RE: Wire Detail for 10/25/01 wire for $195,209.95',\n", " 'RE: Wire Detail for 10/25/01 wire for $195,209.95',\n", " 'FW: Wire Detail for 10/25/01 wire for $195,209.95',\n", " 'RE: Wire Detail for 10/25/01 wire for $195,209.95',\n", " 'RE: Wire Detail for 10/25/01 wire for $195,209.95',\n", " 'FW: Wire Detail for 10/25/01 wire for $195,209.95',\n", " 'RE: Wire Detail for 10/25/01 wire for $195,209.95',\n", " 'FW: DYN($42/sh)/ENE($7/sh) Merger At Risk. - Simmons and Company',\n", " 'FW: DYN($42/sh)/ENE($7/sh) Merger At Risk. - Simmons and Company latest thoughts',\n", " 'FW: DYN($42/sh)/ENE($7/sh) Merger At Risk. - Simmons and Company latest thoughts',\n", " \"FW: Re-Allocaton of $'s\",\n", " \"RE: Re-Allocaton of $'s\",\n", " \"Re-Allocaton of $'s\",\n", " \"Re-Allocaton of $'s\",\n", " 'RE: Wind $7MM',\n", " 'FW: Wind $7MM',\n", " 'Wind $7MM',\n", " 'RE: $9.92????????????',\n", " '$9.92????????????',\n", " 'RE: Below $10',\n", " 'Below $10',\n", " 'FW: Comments on the Status of ENE ($16/sh).',\n", " 'FW: Comments on the Status of ENE ($16/sh).',\n", " 'FW: Comments on the Status of ENE ($16/sh).',\n", " 'Breaking News : Williams Ordered to Pay $8 Million Refund to',\n", " 'Breaking News : Williams Ordered to Pay $8 Million Refund to Cal-ISO',\n", " 'Coho $500mm lawsuit against Hicks Muse',\n", " 'Coho $500mm lawsuit against Hicks Muse',\n", " 'Coho $500mm lawsuit against Hicks Muse',\n", " 'Re: $$$$',\n", " '$$$$',\n", " 'Perd $',\n", " 'Re: $80 million',\n", " 'Re: $80 million',\n", " '$80 million',\n", " '$80 million',\n", " 'Re: $80 million',\n", " '$80 million',\n", " '$80 million',\n", " 'Re: Calif Atty Gen Offers $50M Reward In Pwr Supplier',\n", " 'Financial Disclosure of $1.2 Billion Equity Adjustment',\n", " 'ENE: Despite Bounce It Appears Cheap; Yet $102 Target Likely a Late',\n", " 'ENE: Despite Bounce It Appears Cheap; Yet $102 Target Likely a Late 2002 Event:',\n", " 'Is it worth $200?',\n", " 'RE: #@$ !!!!!!!!',\n", " '$#%:#@$ !!!!!!!!',\n", " 'RE: @%$$@!!!',\n", " '=09@%$$@!!!',\n", " 'Special Offer: Switch to ShareBuilder and Get $50!',\n", " 'Amendment to Enron Corp. $25 Million guaranty of Enron Credit Inc.',\n", " 'RE: Amendment to Enron Corp. $25 Million guaranty of Enron Credit',\n", " 'Goldman Sach $ repo docs',\n", " 'Re: Goldman Sach $ repo docs',\n", " 'RE: Amendment to Enron Corp. $25 Million guaranty of Enron Credit',\n", " 'FW: Goldmans $1.5m',\n", " 'Goldmans $1.5m',\n", " 'FW: $1.5 Check',\n", " '$1.5 Check',\n", " 'RE: Goldman Sachs $',\n", " 'Goldman Sachs $',\n", " 'RE: TODAY ONLY - SAVE UP TO $120 EXTRA ON AIRLINE TICKETS!',\n", " 'RE: TODAY ONLY - SAVE UP TO $120 EXTRA ON AIRLINE TICKETS!',\n", " 'RE: $.01 surcharge as \"tax\"',\n", " 'RE: $.01 surcharge as \"tax\"',\n", " 'RE: $.01 surcharge as \"tax\"',\n", " 'FW: $.01 surcharge as \"tax\"',\n", " 'FW: $.01 surcharge as \"tax\"',\n", " '$.01 surcharge as \"tax\"',\n", " \"FW: PennFuture's E-Cubed - The $45 Million Rip Off\",\n", " \"=09PennFuture's E-Cubed - The $45 Million Rip Off\",\n", " 'RE: PaPUC assessment of $147,000 to Enron',\n", " 'Re: PaPUC assessment of $147,000 to Enron',\n", " 'PaPUC assessment of $147,000 to Enron',\n", " \"RE: ASAP!! EES' objections to PaPUC assessment of $147,000\",\n", " \"ASAP!! EES' objections to PaPUC assessment of $147,000\",\n", " 'RE: Pennsylvania $147,000 EES Assessment',\n", " '=09Pennsylvania $147,000 EES Assessment',\n", " 'FW: CAEM Study: Gas Dereg Has Saved Consumers $600B',\n", " 'CAEM Study: Gas Dereg Has Saved Consumers $600B',\n", " 'PaPUC assessment of $147,000 to Enron',\n", " \"RE: ASAP!! EES' objections to PaPUC assessment of $147,000\",\n", " \"ASAP!! EES' objections to PaPUC assessment of $147,000\",\n", " 'FW: Energy Novice to Be Paid $240,000',\n", " 'Energy Novice to Be Paid $240,000',\n", " 'RE: $22.8 schedule C for BPA deal',\n", " '$22.8 schedule C for BPA deal',\n", " '$22.8 schedule C for BPA deal',\n", " 'origination $100k to Laird Dyer',\n", " 'Cd$ CME letter',\n", " 'Cd$ CME letter',\n", " '$',\n", " 'RE: $',\n", " 'RE: $',\n", " 'Re: $',\n", " 'RE: $',\n", " 'GET RICH ON $6.00 !!!',\n", " 'RE: Thoughts on the world of energy (OSX $77, XNG $183, XOI 496)',\n", " 'FW: Letter of Credit $ 5,500,000 in support of Transwestern',\n", " 'Letter of Credit $ 5,500,000 in support of Transwestern Pipeline Red Rock Expansion',\n", " 'Letter of Credit $ 5,500,000 in support of Transwestern Pipeline Red Rock Expansion',\n", " 'FW: shipper imbal $$ collected',\n", " 'shipper imbal $$ collected',\n", " 'FW: shipper imbal $$ collected',\n", " 'RE: shipper imbal $$ collected',\n", " 'shipper imbal $$ collected',\n", " 'FW: shipper imbal $$ collected',\n", " 'RE: shipper imbal $$ collected',\n", " 'shipper imbal $$ collected',\n", " \"FW: $$'s allocated to TW\",\n", " \"$$'s allocated to TW\",\n", " 'RE: email to USG confirming our decision not to require more LOC $',\n", " 'email to USG confirming our decision not to require more LOC $',\n", " '$',\n", " 'Re: Calpine Confirms $4.6B, 10-Yr Calif. Power Sales',\n", " 'RE: $2.15 bn Enron Metals Inventory Financings Closed',\n", " 'RE: $2.15 bn Enron Metals Inventory Financings Closed',\n", " 'FW: Thayer Aerospace Awarded $130 Million Vought Aircraft Contract',\n", " 'FW: Thayer Aerospace Awarded $130 Million Vought Aircraft Contract',\n", " 'Thayer Aerospace Awarded $130 Million Vought Aircraft Contract to',\n", " 're: mid-columbia $1 mm Schedule E difference',\n", " '$0.25 scheduling fee.',\n", " 'MPC $',\n", " 'RE: $10',\n", " 'RE: $10',\n", " 'RE: $10',\n", " '$10',\n", " 'RE: $10',\n", " '$10',\n", " 'FW: *** Gold/TSE GL/$US/CPI/TSE MM/CRB Bloomberg charts ***',\n", " 'FW: *** Gold/TSE GL/$US/CPI/TSE MM/CRB Bloomberg charts ***',\n", " 'FW: *** Gold/TSE GL/$US/CPI/TSE MM/CRB Bloomberg charts ***',\n", " 'FW: Summer Fare Sale From $128 Return!',\n", " 'Summer Fare Sale From $128 Return!']" ] } ], "prompt_number": 164 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on this data, we can guess at the steps we'd need to do in order to figure out these values. We're going to ignore anything that doesn't have \"k\", \"million\" or \"billion\" after it as chump change. So what we need to find is: a dollar sign, followed by any series of numbers (or a period), followed potentially by a space (but sometimes not), followed by a \"k\", \"m\" or \"b\" (which will sometimes start the word \"million\" or \"billion\" but sometimes not... so we won't bother looking).\n", "\n", "Here's how I would translate that into a regular expression:\n", "\n", " \\$[0-9.]+ ?(?:[Kk]|[Mm]|[Bb])\n", " \n", "We can use `re.findall()` to capture all instances where we found this regular expression in the text. Here's what that would look like:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "re.findall(r\"\\$[0-9.]+ ?(?:[Kk]|[Mm]|[Bb])\", all_subjects)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 182, "text": [ "['$10M',\n", " '$10M',\n", " '$10M',\n", " '$10M',\n", " '$25 m',\n", " '$25 m',\n", " '$25 m',\n", " '$25 m',\n", " '$25 m',\n", " '$25 m',\n", " '$25 m',\n", " '$25 m',\n", " '$25 m',\n", " '$25 m',\n", " '$25 m',\n", " '$25 m',\n", " '$25 m',\n", " '$25 m',\n", " '$40 M',\n", " '$27 B',\n", " '$27 B',\n", " '$870K',\n", " '$870K',\n", " '$21 b',\n", " '$6.1B',\n", " '$8.9B',\n", " '$6 m',\n", " '$14 b',\n", " '$14 b',\n", " '$350M',\n", " '$12.5 B',\n", " '$12.5 B',\n", " '$12.5 B',\n", " '$12.5 B',\n", " '$1.2M',\n", " '$1.2M',\n", " '$1.2M',\n", " '$1.2M',\n", " '$1.2M',\n", " '$1.2M',\n", " '$1.2M',\n", " '$13B',\n", " '$13B',\n", " '$500k',\n", " '$500k',\n", " '$500k',\n", " '$500k',\n", " '$500k',\n", " '$500k',\n", " '$500k',\n", " '$500k',\n", " '$500k',\n", " '$2 B',\n", " '$2 B',\n", " '$2 B',\n", " '$2 B',\n", " '$1B',\n", " '$1B',\n", " '$550 M',\n", " '$455 M',\n", " '$5 m',\n", " '$5 m',\n", " '$5 m',\n", " '$7M',\n", " '$7M',\n", " '$7M',\n", " '$7M',\n", " '$7M',\n", " '$7M',\n", " '$2 B',\n", " '$2 B',\n", " '$2 B',\n", " '$2.7 m',\n", " '$2.7 m',\n", " '$100m',\n", " '$9.6M',\n", " '$9.6M',\n", " '$7M',\n", " '$7M',\n", " '$7M',\n", " '$8 M',\n", " '$8 M',\n", " '$500m',\n", " '$500m',\n", " '$500m',\n", " '$80 m',\n", " '$80 m',\n", " '$80 m',\n", " '$80 m',\n", " '$80 m',\n", " '$80 m',\n", " '$80 m',\n", " '$50M',\n", " '$1.2 B',\n", " '$25 M',\n", " '$25 M',\n", " '$25 M',\n", " '$1.5m',\n", " '$1.5m',\n", " '$45 M',\n", " '$45 M',\n", " '$600B',\n", " '$600B',\n", " '$100k',\n", " '$4.6B',\n", " '$2.15 b',\n", " '$2.15 b',\n", " '$130 M',\n", " '$130 M',\n", " '$130 M',\n", " '$1 m']" ] } ], "prompt_number": 182 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to actually make a sum, though, we're going to need to do a little massaging." ] }, { "cell_type": "code", "collapsed": false, "input": [ "total_value = 0\n", "dollar_amounts = re.findall(r\"\\$\\d+ ?(?:[Kk]|[Mm]|[Bb])\", all_subjects)\n", "for amount in dollar_amounts:\n", " # the last character will be 'k', 'm', or 'b'; \"normalize\" by making lowercase.\n", " multiplier = amount[-1].lower()\n", " # trim off the beginning $ and ending multiplier value\n", " amount = amount[1:-1]\n", " # remove any remaining whitespace\n", " amount = amount.strip()\n", " # convert to a floating-point number\n", " float_amount = float(amount)\n", " # multiply by an amount, based on what the last character was\n", " if multiplier == 'k':\n", " float_amount = float_amount * 1000\n", " elif multiplier == 'm':\n", " float_amount = float_amount * 1000000\n", " elif multiplier == 'b':\n", " float_amount = float_amount * 1000000000\n", " # add to total value\n", " total_value = total_value + float_amount\n", "\n", "print total_value" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1.34965734e+12\n" ] } ], "prompt_number": 183 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The number is so big that Python decided to use scientific notation! If we convert to an integer, we get around that problem:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print int(total_value)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1349657340000\n" ] } ], "prompt_number": 184 }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's over one trillion dollars! Nice work, guys." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Finer-grained matches with grouping\n", "\n", "We used `re.search()` above to check whether or not a string matches a particular regular expression, in a context like this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re\n", "dickens = [\n", " \"it was the best of times\",\n", " \"it was the worst of times\"]\n", "[line for line in dickens if re.search(r\"best\", line)]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ "['it was the best of times']" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "But the match object doesn't actually return `True` or `False`. If the search succeeds, the function returns something called a \"match object.\" Let's assign the result of `re.search()` to a variable and see what we can do with it." ] }, { "cell_type": "code", "collapsed": false, "input": [ "source_string = \"this example has been used 423 times\"\n", "match = re.search(r\"\\d\\d\\d\", source_string)\n", "type(match)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "_sre.SRE_Match" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's a value of type `_sre.SRE_Match`. This value has several methods that we can use to access helpful and interesting information about the way the regular expression matched the string. [Read more about the methods of the match object here](https://docs.python.org/2/library/re.html#match-objects)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, we can see both where the match *started* in the string and where it *ended*, using the `.start()` and `.end()` methods. These methods return the indexes in the string where the regular expression matched." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print match.start()\n", "print match.end()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "27\n", "30\n" ] } ], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Together, we can use these methods to grab exactly the part of the string that matched the regular expression, by using the start/end values to get a slice:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "source_string[match.start():match.end()]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 20, "text": [ "'423'" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because it's so common, there's a shortcut for this operation, which is the match object's `.group()` method:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "match.group()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "'423'" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.group()` method of a match object, in other words, returns exactly the part of the string that matched the regular expression.\n", "\n", "As an example of how to use the match object and its `.group()` method in context, let's revisit the example from above which found every subject line in the Enron corpus that had fifteen or more consecutive capital letters. In that example, we could only display the *entire subject line*. If we wanted to show just the part of the string that matched (i.e., the sequence of fifteen or more capital letters), we could use `.group()`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for line in subjects:\n", " match = re.search(r\"[A-Z]{15,}\", line)\n", " if match:\n", " print match.group()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CONGRATULATIONS\n", "CONGRATULATIONS\n", "PLEEEEEEEEEEEEEEEASE\n", "ACCOMPLISHMENTS\n", "ACCOMPLISHMENTS\n", "CONFIDENTIALITY\n", "CONFIDENTIALITY\n", "CONGRATULATIONS" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "CONGRATULATIONS\n", "ACKNOWLEDGEMENT\n", "ACKNOWLEDGEMENT\n", "CONGRATULATIONS\n", "CONGRATULATIONS\n", "CONGRATULATIONS\n", "CONGRATULATIONS\n", "CONGRATULATIONS\n", "CONGRATULATIONS\n", "CONGRATULATIONS\n", "CONGRATULATIONS\n", "CONGRATULATIONS\n", "CONGRATULATIONS\n", "INTERCONNECTION\n", "INTERCONNECTION\n", "INTERCONNECTION\n", "INTERCONNECTION\n", "INTERCONNECTION\n", "CONGRATULATIONS" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "WASSSAAAAAAAAAAAAAABI\n", "WASSSAAAAAAAAAAAAAABI\n", "WASSSAAAAAAAAAAAAAABI\n", "WASSSAAAAAAAAAAAAAABI\n", "WASSSAAAAAAAAAAAAAABI\n", "WASSSAAAAAAAAAAAAAABI\n", "WASSSAAAAAAAAAAAAAABI\n", "NOOOOOOOOOOOOOOOO\n", "NOOOOOOOOOOOOOOOO\n", "NOOOOOOOOOOOOOOOO\n", "CONGRATULATIONS\n", "CONGRATULATIONS\n", "CONGRATULATIONS\n", "CONGRATULATIONS" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "CONFIDENTIALITY\n", "CONFIDENTIALITY\n", "ACCOMPLISHMENTS\n", "ACCOMPLISHMENTS\n", "CONGRATULATIONS\n", "STANDARDIZATION\n", "STANDARDIZATION\n", "STANDARDIZATION\n", "STANDARDIZATION\n", "BRRRRRRRRRRRRRRRRRRRRR\n", "CONGRATULATIONS" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "CONGRATULATIONS\n", "NETCOTRANSMISSION\n", "NETCOTRANSMISSION\n", "NETCOTRANSMISSION\n", "INTERCONTINENTAL\n", "INTERCONTINENTAL\n" ] } ], "prompt_number": 25 }, { "cell_type": "markdown", "metadata": {}, "source": [ "An important thing to remember about `re.search()` is that it returns `None` if there is no match. For this reason, you always need to check to make sure the object is *not* `None` before you attempt to call the value's `.group()` method. This is the reason that it's difficult to write the above example as a list comprehension---you need to check the result of `re.search()` before you can use it. An attempt to do something like this, for example, will fail:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[re.search(r\"[A-Z]{15,}\", line).group() for line in subjects]" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "AttributeError", "evalue": "'NoneType' object has no attribute 'group'", "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;34m[\u001b[0m\u001b[0mre\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msearch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mr\"[A-Z]{15,}\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mline\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroup\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mline\u001b[0m \u001b[0;32min\u001b[0m \u001b[0msubjects\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'group'" ] } ], "prompt_number": 32 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python complains that `NoneType` has no `group()` method. This happens because sometimes the result of `re.search()` is none.\n", "\n", "We could, of course, write a little function to get around this limitation:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# make a function\n", "def filter_and_group(source, regex):\n", " return [re.search(regex, item).group() for item in source if re.search(regex, item)]\n", "\n", "# now call it\n", "filter_and_group(subjects, r\"[A-Z]{15,}\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 33, "text": [ "['CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'PLEEEEEEEEEEEEEEEASE',\n", " 'ACCOMPLISHMENTS',\n", " 'ACCOMPLISHMENTS',\n", " 'CONFIDENTIALITY',\n", " 'CONFIDENTIALITY',\n", " 'CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'ACKNOWLEDGEMENT',\n", " 'ACKNOWLEDGEMENT',\n", " 'CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'INTERCONNECTION',\n", " 'INTERCONNECTION',\n", " 'INTERCONNECTION',\n", " 'INTERCONNECTION',\n", " 'INTERCONNECTION',\n", " 'CONGRATULATIONS',\n", " 'WASSSAAAAAAAAAAAAAABI',\n", " 'WASSSAAAAAAAAAAAAAABI',\n", " 'WASSSAAAAAAAAAAAAAABI',\n", " 'WASSSAAAAAAAAAAAAAABI',\n", " 'WASSSAAAAAAAAAAAAAABI',\n", " 'WASSSAAAAAAAAAAAAAABI',\n", " 'WASSSAAAAAAAAAAAAAABI',\n", " 'NOOOOOOOOOOOOOOOO',\n", " 'NOOOOOOOOOOOOOOOO',\n", " 'NOOOOOOOOOOOOOOOO',\n", " 'CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'CONFIDENTIALITY',\n", " 'CONFIDENTIALITY',\n", " 'ACCOMPLISHMENTS',\n", " 'ACCOMPLISHMENTS',\n", " 'CONGRATULATIONS',\n", " 'STANDARDIZATION',\n", " 'STANDARDIZATION',\n", " 'STANDARDIZATION',\n", " 'STANDARDIZATION',\n", " 'BRRRRRRRRRRRRRRRRRRRRR',\n", " 'CONGRATULATIONS',\n", " 'CONGRATULATIONS',\n", " 'NETCOTRANSMISSION',\n", " 'NETCOTRANSMISSION',\n", " 'NETCOTRANSMISSION',\n", " 'INTERCONTINENTAL',\n", " 'INTERCONTINENTAL']" ] } ], "prompt_number": 33 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Multiple groups in one regular expression\n", "\n", "So `re.search()` lets us get the parts of a string that match a regular expression, using the `.group()` method of the match object it returns. You can get even finer-grained matches using a feature of regular expressions called *grouping*.\n", "\n", "Let's start with a toy example. Say you have a list of University courses in the following format:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "courses = [\n", " \"CSCI 105: Introductory Programming for Cat-Lovers\",\n", " \"LING 214: Pronouncing Things Backwards\",\n", " \"ANTHRO 342: Theory and Practice of Cheesemongery (Graduate Seminar)\",\n", " \"CSCI 205: Advanced Programming for Cat-Lovers\",\n", " \"ENGL 112: Speculative Travel Writing\"\n", "]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 52 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say you want to extract the following items from this data:\n", "\n", "* A unique list of all departments (e.g., CSCI, LING, ANTHRO, etc.)\n", "* A list of all course names\n", "* A dictionary with all of the 100-level classes, 200-level classes, and 300-level classes\n", "\n", "Somehow we need to get *three* items from each line of data: the department, the number, and the course name. You can do this easily with regular expressions using *grouping*. To use grouping, put parentheses (`()`) around the portions of the regular expression that are of interest to you. You can then use the `.groups()` (note the `s`!) function to get the portion of the string that matched the portion of the regular expression inside the parentheses individually. Here's what it looks like, just operating on the first item of the list:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "first_course = courses[0]\n", "match = re.search(r\"(\\w+) (\\d+): (.+)$\", first_course)\n", "match.groups()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 53, "text": [ "('CSCI', '105', 'Introductory Programming for Cat-Lovers')" ] } ], "prompt_number": 53 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The regular expression in `re.search()` above roughly translates as the following:\n", "\n", "* Find me a sequence of one or more alphanumeric characters. Save this sequence as the first group.\n", "* Find a space.\n", "* Find me a sequence of one or more digits. Save this as the second group.\n", "* Find a colon followed by a space.\n", "* Find me one or more characters---I don't care which characters---and save the sequence as the third group.\n", "* Match the end of the line.\n", "\n", "Calling the `.groups()` method returns a tuple containing each of the saved items from the grouping. You can use it like so:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "groups = match.groups()\n", "print \"Department:\", groups[0] # department\n", "print \"Course number:\", groups[1] # course number\n", "print \"Course name:\", groups[2] # course name" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Department: CSCI\n", "Course number: 105\n", "Course name: Introductory Programming for Cat-Lovers\n" ] } ], "prompt_number": 54 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's iterate over the entire list of courses and put them in the data structure as appropriate:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "departments = set()\n", "course_names = []\n", "course_levels = {}\n", "for item in courses:\n", " # search and create match object\n", " match = re.search(r\"(\\w+) (\\d+): (.+)$\", item)\n", " if match: # if there's a match...\n", " groups = match.groups() # get the groups: 0 is department, 1 is course number, 2 is name\n", " departments.add(groups[0]) # add to department set (we wanted a list of *unique* departments)\n", " course_names.append(groups[2]) # add to list of courses\n", " level = int(groups[1]) / 100 # get the course \"level\" by dividing by 100\n", " # add the level/course key-value pair to course_levels\n", " if level not in course_levels:\n", " course_levels[level*100] = []\n", " course_levels[level*100].append(groups[2])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 63 }, { "cell_type": "markdown", "metadata": {}, "source": [ "After you run this cell, you can check out the unique list of departments:\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "departments" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 58, "text": [ "{'ANTHRO', 'CSCI', 'ENGL', 'LING'}" ] } ], "prompt_number": 58 }, { "cell_type": "markdown", "metadata": {}, "source": [ "... the list of course names:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "course_names" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 60, "text": [ "['Introductory Programming for Cat-Lovers',\n", " 'Pronouncing Things Backwards',\n", " 'Theory and Practice of Cheesemongery (Graduate Seminar)',\n", " 'Advanced Programming for Cat-Lovers',\n", " 'Speculative Travel Writing']" ] } ], "prompt_number": 60 }, { "cell_type": "markdown", "metadata": {}, "source": [ "... and the dictionary that maps course \"levels\" to a list of courses at that level:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "course_levels" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 65, "text": [ "{100: ['Speculative Travel Writing'],\n", " 200: ['Advanced Programming for Cat-Lovers'],\n", " 300: ['Theory and Practice of Cheesemongery (Graduate Seminar)']}" ] } ], "prompt_number": 65 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Grouping with multiple matches in the same string\n", "\n", "A problem with `re.search()` is that it only returns the *first* match in a string. What if we want to find *all* of the matches? It turns out that `re.findall()` *also* supports the regular expression grouping syntax. If the regular expression you pass to `re.findall()` includes any grouping parentheses, then the function returns not a list of strings, but a list of tuples, where each tuple has elements corresponding in order to the groups in the regular expression.\n", "\n", "As a quick example, here's a test string with number names and digits, and a regular expression to extract all instances of a series of alphanumeric characters, followed by a space, followed by a single digit:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "test = \"one 1 two 2 three 3 four 4 five 5\"\n", "re.findall(r\"(\\w+) (\\d)\", test)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 76, "text": [ "[('one', '1'), ('two', '2'), ('three', '3'), ('four', '4'), ('five', '5')]" ] } ], "prompt_number": 76 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use this to extract every phone number from the Enron subjects corpus, separating out the components of the numbers by group:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "re.findall(r\"(\\d\\d\\d)-(\\d\\d\\d)-(\\d\\d\\d\\d)\", all_subjects)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 80, "text": [ "[('713', '853', '4743'),\n", " ('713', '222', '7667'),\n", " ('713', '222', '7667'),\n", " ('713', '222', '7667'),\n", " ('713', '222', '7667'),\n", " ('713', '222', '7667'),\n", " ('713', '222', '7667'),\n", " ('713', '222', '7667'),\n", " ('713', '222', '7667'),\n", " ('713', '222', '7667'),\n", " ('713', '222', '7667'),\n", " ('281', '296', '0573'),\n", " ('713', '851', '2499'),\n", " ('713', '345', '7896'),\n", " ('713', '345', '7896'),\n", " ('713', '345', '7896'),\n", " ('713', '345', '7896'),\n", " ('713', '345', '7896'),\n", " ('281', '367', '8953'),\n", " ('713', '528', '0759'),\n", " ('713', '850', '9002'),\n", " ('713', '703', '8294'),\n", " ('614', '888', '9588'),\n", " ('713', '767', '8686'),\n", " ('303', '571', '6135'),\n", " ('281', '537', '9334'),\n", " ('800', '937', '6563'),\n", " ('800', '937', '6563'),\n", " ('888', '296', '1938')]" ] } ], "prompt_number": 80 }, { "cell_type": "markdown", "metadata": {}, "source": [ "And then we can do a quick little data analysis on the frequency of area codes in these numbers, using the [Counter](https://docs.python.org/2/library/collections.html#counter-objects) object from the `collections` module:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from collections import Counter\n", "area_codes = [item[0] for item in re.findall(r\"(\\d\\d\\d)-(\\d\\d\\d)-(\\d\\d\\d\\d)\", all_subjects)]\n", "count = Counter(area_codes)\n", "count.most_common(1)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 105, "text": [ "[('713', 21)]" ] } ], "prompt_number": 105 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Multiple match objects with `re.finditer()`\n", "\n", "The `re` library also has a `re.finditer()` function, which returns not a list of matching strings in tuples (like `re.findall()`), but an iterator of *match objects*. This is useful if you need to know not just which text matched, but *where* in the text the match occurs. So, for example, to find the positions in the `all_subjects` corpus where the word \"Oregon\" occurs, regardless of capitalization:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "[(match.start(), match.end(), match.group()) for match in re.finditer(r\"[Oo]regon\", all_subjects)]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 109, "text": [ "[(410338, 410344, 'Oregon'),\n", " (410353, 410359, 'Oregon'),\n", " (608654, 608660, 'Oregon'),\n", " (831605, 831611, 'Oregon'),\n", " (3059955, 3059961, 'Oregon'),\n", " (3640267, 3640273, 'Oregon'),\n", " (3640292, 3640298, 'Oregon'),\n", " (3640317, 3640323, 'Oregon'),\n", " (3640610, 3640616, 'Oregon'),\n", " (3640635, 3640641, 'Oregon'),\n", " (3640660, 3640666, 'Oregon'),\n", " (4385798, 4385804, 'oregon')]" ] } ], "prompt_number": 109 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Conclusion\n", "\n", "Regular expressions are a great way to take some raw text and find the parts that are of interest to you. Python's string methods and string slicing syntax are a great way to massage and clean up data. You know them both now, which makes you powerful. But as powerful as you are, you have only scratched the surface of your potential! We only scratched the surface of what's possible with regular expressions. Here's some further reading:\n", "\n", "* [egrep for Linguists](http://stts.se/egrep_for_linguists/egrep_for_linguists.html) explains how to use regular expressions using the command-line tool `egrep` (which I recommend becoming familiar with!)\n", "* Once you've mastered the basics, check the official [Python regular expressions HOWTO](https://docs.python.org/2.7/howto/regex.html). The official [Python documentation on regular expressions](https://docs.python.org/2/library/re.html) is a deep dive on the subject.\n" ] } ], "metadata": {} } ] }