{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "3f4ae24c", "metadata": { "collapsed": false }, "outputs": [], "source": [ "import codecs\n", "import unicodedata\n", "with codecs.open(\"faust.txt\",\"r\",\"utf-8\") as stream: text = stream.read()" ] }, { "cell_type": "code", "execution_count": 2, "id": "93814096", "metadata": { "collapsed": false }, "outputs": [], "source": [ "# !sudo locale-gen de_DE.UTF-8" ] }, { "cell_type": "code", "execution_count": 3, "id": "ddaa6b07", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'de_DE.utf8'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import locale\n", "locale.setlocale(locale.LC_ALL,'de_DE.utf8')\n", "# C, en_US.utf8, ..." ] }, { "cell_type": "markdown", "id": "aea096f2", "metadata": {}, "source": [ "# Basic Searching and Matching" ] }, { "cell_type": "markdown", "id": "17da0e3f", "metadata": {}, "source": [ "The `re` (regular expression) module contains all the functions we are talking about here.\n", "\n", "Regular expressions are powerful tools for searching for strings and patterns.\n", "\n", "They are the basis of the command line `fgrep`, `grep`, and `egrep` tools (the `re` in those names stands for \"regular expression\").\n", "\n", "Internally, the query is converted into a finite state automaton, and that automaton is then matched." ] }, { "cell_type": "code", "execution_count": 4, "id": "0612018b", "metadata": { "collapsed": false }, "outputs": [], "source": [ "import re" ] }, { "cell_type": "markdown", "id": "41293536", "metadata": {}, "source": [ "There are two basic operations, `search` and `match`.\n", "The first searches for a regular expression anywhere,\n", "the second requires the match to start at the beginning.\n", "\n", "A *match* is indicated by returning a regular expression object\n", "(this behaves like a boolean `True`), and a failed match\n", "is indicated by returning None." ] }, { "cell_type": "code", "execution_count": 5, "id": "76c95f11", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<_sre.SRE_Match at 0x40ab988>" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.search('cheese','the cheese and the bread')" ] }, { "cell_type": "code", "execution_count": 6, "id": "78aefc6e", "metadata": { "collapsed": false }, "outputs": [], "source": [ "re.search('butter','the cheese and the bread')" ] }, { "cell_type": "code", "execution_count": 7, "id": "3f5feb30", "metadata": { "collapsed": false }, "outputs": [], "source": [ "re.match('cheese','the cheese and the bread')" ] }, { "cell_type": "code", "execution_count": 8, "id": "2bb562ce", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<_sre.SRE_Match at 0x40aba58>" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.match('the','the cheese and the bread')" ] }, { "cell_type": "markdown", "id": "b91ba662", "metadata": {}, "source": [ "Matches are case-sensitive by default." ] }, { "cell_type": "code", "execution_count": 9, "id": "3afb8d01", "metadata": { "collapsed": false }, "outputs": [], "source": [ "re.search('THE','the cheese and the bread')" ] }, { "cell_type": "markdown", "id": "32a9a7b9", "metadata": {}, "source": [ "But we can make matches case insensitive with the `re.I` flag." ] }, { "cell_type": "code", "execution_count": 15, "id": "3e2463b5", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<_sre.SRE_Match at 0x40abd30>" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.search('THE','the cheese and the bread',re.I)" ] }, { "cell_type": "markdown", "id": "c0fdf514", "metadata": {}, "source": [ "We can also incorporate this flag directly into the query." ] }, { "cell_type": "code", "execution_count": 16, "id": "06b883d8", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<_sre.SRE_Match at 0x40abd98>" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.search('THE(?i)','the cheese and the bread')" ] }, { "cell_type": "markdown", "id": "28a0487b", "metadata": {}, "source": [ "A third important operation is `sub` and its variant `subn`." ] }, { "cell_type": "code", "execution_count": 10, "id": "f68e2dde", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'bread and butter'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.sub('cheese','butter','bread and cheese')" ] }, { "cell_type": "code", "execution_count": 11, "id": "4406d797", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "('bread and butter', 1)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.subn('cheese','butter','bread and cheese')" ] }, { "cell_type": "markdown", "id": "caaf374d", "metadata": {}, "source": [ "Also, we can find multiple matches with `findall`." ] }, { "cell_type": "code", "execution_count": 12, "id": "99d2730e", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['spam', 'spam', 'spam']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('spam','spam, spam, ham, and spam')" ] }, { "cell_type": "markdown", "id": "4ac11245", "metadata": {}, "source": [ "Finally, we can also split." ] }, { "cell_type": "code", "execution_count": 13, "id": "d573ada5", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['the', 'quick', 'brown', 'fox']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.split(' ','the quick brown fox')" ] }, { "cell_type": "markdown", "id": "bed3becf", "metadata": {}, "source": [ "# Flags" ] }, { "cell_type": "markdown", "id": "7586149f", "metadata": {}, "source": [ "Regular expression operations also take a number of flags that affect the operation:\n", "\n", "- `re.I` - ignore case\n", "- `re.L` - locale-dependent matches\n", "- `re.M` - multiline (changes meaning of `$` and `^`)\n", "- `re.S` - dot matches all characters (usually doesn't match `\\n`)\n", "- `re.X` - verbose regular expressions (whitespace is ignored and allows comments)\n", "- `re.U` - unicode-dependent matches (changes interpretation of digits etc)\n", "\n", "You can also specify these with syntax like `(?iu)` inside the expression." ] }, { "cell_type": "code", "execution_count": 19, "id": "9f0d25c1", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['the', 'the']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(r'THE','the cat in the hat',re.I)" ] }, { "cell_type": "code", "execution_count": 20, "id": "4101cd97", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['the', 'the']" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(r'THE(?i)','the cat in the hat')" ] }, { "cell_type": "markdown", "id": "85eb7b5f", "metadata": {}, "source": [ "# Match Objects" ] }, { "cell_type": "markdown", "id": "eb6c702a", "metadata": {}, "source": [ "The match object gives additional information about the match.\n", "It contains \"groups\"; group 0 refers to the entire match\n", "(we'll see how to define other groups later)." ] }, { "cell_type": "code", "execution_count": 21, "id": "8123844b", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<_sre.SRE_Match at 0x3b8e098>" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g = re.search('cheese','the cheese and the bread')\n", "g" ] }, { "cell_type": "code", "execution_count": 22, "id": "4ee94a04", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'cheese'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g.group(0)" ] }, { "cell_type": "code", "execution_count": 23, "id": "bb9bb13d", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(4, 10)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g.start(0),g.end(0)" ] }, { "cell_type": "markdown", "id": "e4dfba72", "metadata": {}, "source": [ "# Precompiled Regular Expressions" ] }, { "cell_type": "markdown", "id": "e295cb7a", "metadata": {}, "source": [ "Regular expression matching is a two step process:\n", "\n", "- the expression string is compiled (into a finite automaton)\n", "- the automaton is executed\n", "\n", "Compilation can be costly, so you can separate it from matching\n", "and substitution." ] }, { "cell_type": "code", "execution_count": 24, "id": "6bd3da13", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "re.compile(r'cheese')" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = re.compile('cheese')\n", "obj" ] }, { "cell_type": "code", "execution_count": 25, "id": "3cbff365", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<_sre.SRE_Match at 0x3b8e100>" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.search('bread and cheese')" ] }, { "cell_type": "code", "execution_count": 26, "id": "9dad00fb", "metadata": { "collapsed": false }, "outputs": [], "source": [ "obj.match('bread and cheese')" ] }, { "cell_type": "code", "execution_count": 27, "id": "5423117e", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'bread and butter'" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.sub('butter','bread and cheese')" ] }, { "cell_type": "markdown", "id": "b806b6b1", "metadata": {}, "source": [ "# Raw Strings" ] }, { "cell_type": "markdown", "id": "892ed7a9", "metadata": {}, "source": [ "Regular expressions frequently involve backslash characters (`\\`),\n", "and sometimes also single or double quotes.\n", "For this, there are several convenient quoting conventions:\n", "\n", "- `r\"abc\"` - raw string\n", "- `\"\"\"a\"bc\"\"\"` - triple quoted\n", "- `r\"\"\"a\"bc\"\"\"` - triple quoted raw\n", "- `ur\"\"\"a\"bc\"\"\"` - triple quoted raw unicode string" ] }, { "cell_type": "code", "execution_count": 28, "id": "300b5e18", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a\bc\n", "a\\bc\n", "a\"b\"c\n", "a\\\"b\\\"c\n", "a\\\"b\\\"c\n" ] } ], "source": [ "print 'a\\bc'\n", "print r'a\\bc'\n", "print \"a\\\"b\\\"c\"\n", "print r\"\"\"a\\\"b\\\"c\"\"\"\n", "print ur\"\"\"a\\\"b\\\"c\"\"\"" ] }, { "cell_type": "code", "execution_count": 29, "id": "d2ce89dd", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'the'" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.search(r'\\w+','the bread and the cheese').group(0)" ] }, { "cell_type": "code", "execution_count": 30, "id": "ee2e6ead", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "u'Brot'" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.search(ur'\\w+',u'Brot und Käse').group(0)" ] }, { "cell_type": "markdown", "id": "026555d6", "metadata": {}, "source": [ "# Unicode Matching" ] }, { "cell_type": "markdown", "id": "1aa7421a", "metadata": {}, "source": [ "Be careful when matching Unicode in Python 2.x, since you can write\n", "either or both the regular expression and the target as `str` or `unicode`.\n", "If you aren't consistent, the matches will just fail.\n", "\n", "Furthermore, matching UTF-8 encodings stored in `str` won't work right." ] }, { "cell_type": "code", "execution_count": 31, "id": "a298febb", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<_sre.SRE_Match at 0x3b8e2a0>" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.search(ur'Käse',u'Der Käse und das Brot.')" ] }, { "cell_type": "code", "execution_count": 32, "id": "cdbeb2e0", "metadata": { "collapsed": false }, "outputs": [], "source": [ "re.search('Käse',u'Der Käse und das Brot.')" ] }, { "cell_type": "code", "execution_count": 33, "id": "67f2bfd4", "metadata": { "collapsed": false }, "outputs": [], "source": [ "re.search(ur'Käse','Der Käse und das Brot.')" ] }, { "cell_type": "code", "execution_count": 34, "id": "0682ab17", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<_sre.SRE_Match at 0x3b8e308>" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.search('Käse','Der Käse und das Brot.')" ] }, { "cell_type": "markdown", "id": "e2ad5c22", "metadata": {}, "source": [ "Even if both strings are Unicode, you still have to worry\n", "about normalization." ] }, { "cell_type": "code", "execution_count": 35, "id": "3bdea9ae", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(Käse)\n" ] } ], "source": [ "s = unicodedata.normalize('NFD',u'Käse')\n", "print \"(%s)\"%s\n", "re.search(s,'Der Käse und das Brot')" ] }, { "cell_type": "code", "execution_count": 36, "id": "9ffc9edb", "metadata": { "collapsed": false }, "outputs": [], "source": [ "def normalizing_search(regex,s):\n", " regex = unicodedata.normalize('NFC',regex)\n", " s = unicodedata.normalize('NFC',s)\n", " return re.search(regex,s)" ] }, { "cell_type": "code", "execution_count": 37, "id": "af58bf7c", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<_sre.SRE_Match at 0x3b8e370>" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizing_search(s,u'Der Käse und das Brot')" ] }, { "cell_type": "markdown", "id": "27088c97", "metadata": {}, "source": [ "# Basic Regular Expression Syntax" ] }, { "cell_type": "markdown", "id": "e6b628c2", "metadata": {}, "source": [ "There are a number of standard syntactic elements:\n", "\n", "- `.` matches a single character (any character)\n", "- `x*` matches 0 or more `x`\n", "- `x+` matches 1 or more `x`\n", "- `x?` matches 0 or 1 `x`\n", "- `^` and `$` match at the beginning and end of a line, respectively\n", "- `\\x` suppresses the special meaning of character `x`\n", "- `(xyz)` matches `xyz` and treats it as a unit for the purpose of operators (it also defines a group)\n", "- `x|y` matches `x` or `y`\n", "- `[abcA-Z]` matches any one character in the set `a`, `b`, `c`, or in the range `A` through `Z`\n", "- `[^abc]` matches any character other than `a`, `b`, or `c`" ] }, { "cell_type": "code", "execution_count": 38, "id": "991ae0fd", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['cat', 'cot']" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('c.t','the cat on the cot')" ] }, { "cell_type": "code", "execution_count": 39, "id": "00b260c8", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['wet', 'wt', 'weet']" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('we*t','wet cowtippers tweet frequently')" ] }, { "cell_type": "code", "execution_count": 40, "id": "7c365ba0", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['wet', 'weet']" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('we+t','wet cowtippers tweet frequently')" ] }, { "cell_type": "code", "execution_count": 41, "id": "58efb9a5", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['wet', 'wt']" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('we?t','wet cowtippers tweet frequently')" ] }, { "cell_type": "markdown", "id": "9b3125d2", "metadata": {}, "source": [ "There is actually a generalization of the `*`-like operators, where you can\n", "specify the exact number of repetitions with syntax like `{3,7}`." ] }, { "cell_type": "code", "execution_count": 42, "id": "f6904c39", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['et', 'wt', 'et']" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('[ew]t','wet cowtippers tweet frequently')" ] }, { "cell_type": "code", "execution_count": 43, "id": "c6f51ba9", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['^.^']\n", "['^.^', '^_^']\n" ] } ], "source": [ "print re.findall(r'\\^\\.\\^','this ^.^ is a Japanese smiley, ^_^')\n", "print re.findall(r'\\^.\\^','this ^.^ is a Japanese smiley, ^_^')" ] }, { "cell_type": "code", "execution_count": 44, "id": "8ab8d5ed", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['w', 'w', 'w']\n", "['w']\n" ] } ], "source": [ "print re.findall(r'w','wet cowtippers tweet frequently')\n", "print re.findall(r'^w','wet cowtippers tweet frequently')" ] }, { "cell_type": "code", "execution_count": 45, "id": "5f0f1c58", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['tweet', 'twit']\n" ] } ], "source": [ "print re.findall(r'(tweet|twit)','wet cowtippers tweet frequently, but are twits')" ] }, { "cell_type": "markdown", "id": "49b48491", "metadata": {}, "source": [ "# Longest vs Shortest Matches" ] }, { "cell_type": "markdown", "id": "8a4ed3ef", "metadata": {}, "source": [ "By default, regular expression libraries return the longest match." ] }, { "cell_type": "code", "execution_count": 19, "id": "5a965b3d", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['abbbbbb']\n" ] } ], "source": [ "print re.findall(r'ab+','xyz abbbbbbc def')" ] }, { "cell_type": "markdown", "id": "064774db", "metadata": {}, "source": [ "Sometimes, you want the shortest possible match.\n", "You get that by putting a `?` after a repeat operator like `*`, `+`, or `?`." ] }, { "cell_type": "code", "execution_count": 25, "id": "75934314", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['ab']\n" ] } ], "source": [ "print re.findall(r'ab+?','xyz abbbbbbc def')" ] }, { "cell_type": "markdown", "id": "0240ae2d", "metadata": {}, "source": [ "Note that this does not \"search for\" the shortest match, it is just that when it\n", "matches, it picks up the shortest string." ] }, { "cell_type": "code", "execution_count": 28, "id": "a208b729", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4\n" ] } ], "source": [ "print re.search(r'ab+?','xyz abbbbbbc abc def').start(0)" ] }, { "cell_type": "markdown", "id": "2fd61278", "metadata": {}, "source": [ "# Grouping" ] }, { "cell_type": "code", "execution_count": 46, "id": "e4613c2b", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['cat', 'hat']\n" ] } ], "source": [ "print re.findall(r'the ([^ ]*)','the cat in the hat')" ] }, { "cell_type": "code", "execution_count": 47, "id": "4f5b8041", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('a', 'cat'), ('the', 'hat')]\n" ] } ], "source": [ "print re.findall(r'(a|the) ([^ ]*)','a cat in the hat')" ] }, { "cell_type": "code", "execution_count": 48, "id": "7c6efde8", "metadata": { "collapsed": false }, "outputs": [], "source": [ "g = re.search(r'(a|the) ([^ ]*)','a cat in the hat')" ] }, { "cell_type": "code", "execution_count": 49, "id": "94660518", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'a cat'" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g.group(0)" ] }, { "cell_type": "code", "execution_count": 50, "id": "c2b6d03f", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'a'" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g.group(1)" ] }, { "cell_type": "code", "execution_count": 51, "id": "133b81e4", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'cat'" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g.group(2)" ] }, { "cell_type": "code", "execution_count": 52, "id": "fad83958", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 5 (2, 5)\n" ] } ], "source": [ "print g.start(2),g.end(2),g.span(2)" ] }, { "cell_type": "code", "execution_count": 53, "id": "c8fa1f25", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['cat', 'hat']\n" ] } ], "source": [ "print re.findall(r'(?:a|the) ([^ ]*)','a cat in the hat')" ] }, { "cell_type": "code", "execution_count": 54, "id": "0d3bbfb4", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<_sre.SRE_Match object at 0x3b6ff30>\n", "<_sre.SRE_Match object at 0x3b6ff30>\n", "None\n" ] } ], "source": [ "print re.search(r'(the|a) [^ ]+ near \\1 [^ ]+','the cat near the cat')\n", "print re.search(r'(the|a) [^ ]+ near \\1 [^ ]+','a cat near a cat')\n", "print re.search(r'(the|a) [^ ]+ near \\1 [^ ]+','the cat near a cat')" ] }, { "cell_type": "markdown", "id": "fdc2fcd1", "metadata": {}, "source": [ "Grouping also takes on special meaning with `split`, alternating between separators and words." ] }, { "cell_type": "code", "execution_count": 55, "id": "00df658a", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['The', ' ', 'quick', ', ', 'brown', ' ', 'fox', ' ', 'jumps', '; ', 'over', ' ', 'lazy', ' ', 'dogs', '!', '']\n" ] } ], "source": [ "print re.split(r'([,;]?\\s+|\\W+$)','The quick, brown fox jumps; over lazy dogs!')" ] }, { "cell_type": "markdown", "id": "6be7d81d", "metadata": {}, "source": [ "# Named Groups" ] }, { "cell_type": "markdown", "id": "46bd6690", "metadata": {}, "source": [ "Grouping can get more complex with naming and conditionals." ] }, { "cell_type": "code", "execution_count": 42, "id": "9f495ec7", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['a', 'd']\n", "['a', 'd']\n", "bc\n" ] } ], "source": [ "print re.findall(r'(.)\\1','aa bc dd ef')\n", "print re.findall(r'(?P.)(?P=id)','aa bc dd ef')" ] }, { "cell_type": "markdown", "id": "3d370ee2", "metadata": {}, "source": [ "Named groups can also be used to refer to parts of patterns." ] }, { "cell_type": "code", "execution_count": 43, "id": "718f00ef", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bc\n" ] } ], "source": [ "print re.search(r'(?Pb.)','aa bc dd ef').group(\"id\")" ] }, { "cell_type": "markdown", "id": "82f6f6e2", "metadata": {}, "source": [ "There are even conditionals based on named groups." ] }, { "cell_type": "code", "execution_count": 47, "id": "828722ec", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<_sre.SRE_Match object at 0x42404e0>\n", "<_sre.SRE_Match object at 0x42404e0>\n", "None\n" ] } ], "source": [ "q = r'^(<)?[^<>]+(?(1)>|)$'\n", "print re.search(q,'abc')\n", "print re.search(q,'')\n", "print re.search(q,']+(?(1)>|)$'" ] }, { "cell_type": "markdown", "id": "7c09aaff", "metadata": {}, "source": [ "With the `re.X` flag (or `(?x)`), you can insert whitespace and comments." ] }, { "cell_type": "code", "execution_count": 49, "id": "29fec3f6", "metadata": { "collapsed": false }, "outputs": [], "source": [ "qx = r\"\"\"(?x)\n", "\n", " ^(<)? # match optional beginning \"<\"\n", "\n", " [^<>]* # match any non-bracket character\n", "\n", " (?(1)>|)$ # match a \">\" at the end if we did so at the beginning\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 50, "id": "14f46cbc", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<_sre.SRE_Match object at 0x42405d0>\n", "<_sre.SRE_Match object at 0x42405d0>\n" ] } ], "source": [ "print re.search(q,'')\n", "print re.search(qx,'')" ] }, { "cell_type": "markdown", "id": "5715aa08", "metadata": {}, "source": [ "# Character Classes" ] }, { "cell_type": "markdown", "id": "e4eec7d9", "metadata": {}, "source": [ "There are a number of common special character classes:\n", "\n", "- `\\A` - empty string at the beginning of the string\n", "- `\\Z` - empty string at end of string\n", "- `\\b` - empty string at the beginning of the word\n", "- `\\B` - empty string not at the beginning of the word (upper case is often inverse of lower case)\n", "- `\\d` - digit (usually `[0-9]`, or digit class in Unicode)\n", "- `\\D` - not a digit\n", "- `\\s` - white space\n", "- `\\S` - not white space\n", "- `\\w` - word character\n", "- `\\W` - not a word character" ] }, { "cell_type": "code", "execution_count": 56, "id": "468622fa", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'la', 'y', 'dogz']" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(r'\\w+',\"The quick brown fox... jumped over the la$y dogz.\")" ] }, { "cell_type": "code", "execution_count": 57, "id": "94715658", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['7.2973525698e-3', '3.14159']" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numbers = re.compile(r'((?:\\d+\\.\\d*|\\d*\\.\\d+)(?:e[+-]\\d+)?)',re.I)\n", "numbers.findall(\"The fine structure constant is 7.2973525698e-3, and pi is about 3.14159.\")" ] }, { "cell_type": "markdown", "id": "3e80f8a0", "metadata": {}, "source": [ "# Lookahead and Lookbehind" ] }, { "cell_type": "markdown", "id": "98a130e3", "metadata": {}, "source": [ "Sometimes you want to match something \"in context\" without actually considering the\n", "context part of the match.\n", "For this, you can use lookahead and lookbehind assertions." ] }, { "cell_type": "code", "execution_count": 29, "id": "a10cea36", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['c']" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(r\"[abc](?=z)\",\"ax by cz\")" ] }, { "cell_type": "code", "execution_count": 30, "id": "c0c3d7d6", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['a', 'b']" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(r\"[abc](?!z)\",\"ax by cz\")" ] }, { "cell_type": "code", "execution_count": 31, "id": "8d8192f7", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['x']" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(r\"(?<=a)[xyz]\",\"ax by cz\")" ] }, { "cell_type": "code", "execution_count": 32, "id": "14a23c35", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['y', 'z']" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(r\"(?" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r.match(\"x\")" ] }, { "cell_type": "code", "execution_count": 61, "id": "5a46b547", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<_regex.Match at 0x3ce6e00>" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r.match(\"(x+y)\")" ] }, { "cell_type": "code", "execution_count": 62, "id": "215e0bba", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<_regex.Match at 0x3ce6e68>" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r.match(\"(x*(y+z))\")" ] }, { "cell_type": "code", "execution_count": 63, "id": "4785fad6", "metadata": { "collapsed": false }, "outputs": [], "source": [ "r.match(\"(x*y+z))\")" ] }, { "cell_type": "markdown", "id": "7cdfe27e", "metadata": {}, "source": [ "### Fuzzy Matching" ] }, { "cell_type": "markdown", "id": "b890c154", "metadata": {}, "source": [ "Fuzzy matching allows edit distance information to be taken into account during matching.\n", "That is, a group does not need to match precisely." ] }, { "cell_type": "code", "execution_count": 147, "id": "f7f40496", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['quick', 'quack']" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.findall(r\"(?=\\w)(quick){e<=1}\",\"the quick brown fox quacks loudly\")" ] }, { "cell_type": "markdown", "id": "690be2cd", "metadata": {}, "source": [ "You can specify the number of insertions, deletions, substitutions, and errors." ] }, { "cell_type": "markdown", "id": "a63ff5a9", "metadata": {}, "source": [ "### Named Lists" ] }, { "cell_type": "markdown", "id": "5eab2d36", "metadata": {}, "source": [ "Often, it is useful to compile large lists of words into a regular expression (c.f. `fgrep`)." ] }, { "cell_type": "code", "execution_count": 65, "id": "fcaab056", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "851" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with open(\"basic-english.txt\") as stream: words = stream.read().split()\n", "len(words)" ] }, { "cell_type": "code", "execution_count": 82, "id": "28cc621e", "metadata": { "collapsed": false }, "outputs": [], "source": [ "allwords = regex.compile(r\"\\b(\\L)(?:s|es|ed|ing)?\\b(?i)\",words=words)" ] }, { "cell_type": "code", "execution_count": 83, "id": "fd44a653", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['The', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog']" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "allwords.findall(\"The quick brown fox jumps over the lazy dogs.\")" ] }, { "cell_type": "code", "execution_count": 84, "id": "2d4af40b", "metadata": { "collapsed": false }, "outputs": [], "source": [ "fuzzywords = regex.compile(r\"\\b(\\L){e<=2}(?:s|es|ed|ing)?\\b(?i)\",words=words)" ] }, { "cell_type": "code", "execution_count": 85, "id": "1a7a945e", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['The ', 'quock', ' ', 'briwn', ' fox', ' ', 'jxmp', ' over', ' the ', 'lazy', ' dog', '']\n" ] } ], "source": [ "print fuzzywords.findall(\"The quock briwn fox jxmps over the lazy dogs.\")" ] }, { "cell_type": "code", "execution_count": 87, "id": "dad2fd6c", "metadata": { "collapsed": false }, "outputs": [], "source": [ "fuzzywords = regex.compile(r\"\\b(?=\\w)(\\L){e<=2}(?:s|es|ed|ing)?\\b(?i)\",words=words)" ] }, { "cell_type": "code", "execution_count": 88, "id": "7f129ab1", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['The ', 'quock', 'briwn', 'fox ', 'jxmp', 'over', 'the ', 'lazy', 'dogs']\n" ] } ], "source": [ "print fuzzywords.findall(\"The quock briwn fox jxmps over the lazy dogs.\")" ] }, { "cell_type": "markdown", "id": "68c46b08", "metadata": {}, "source": [ "### Better Text and Unicode Support" ] }, { "cell_type": "markdown", "id": "52886a9e", "metadata": {}, "source": [ "There is generally better Unicode support in `regex`:\n", "\n", "- word characters (`\\w` etc.) refer to Unicode by default\n", "- line separators refer to Unicode line separators\n", "- whitespace recognizes Unicode whitespace\n", "- `\\m` and `\\M` match at the beginning/end of a word respectively\n", "- there are set operators\n", "- POSIX character classes are recognized\n", "- you can access Unicode properties with `\\p` and `\\P`\n", "- you can match graphemes with `\\X`" ] }, { "cell_type": "code", "execution_count": 99, "id": "9deaf5f1", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'the',\n", " u'quick',\n", " u'\\u0440\\u044b\\u0436\\u0430\\u044f',\n", " u'\\u043b\\u0438\\u0441\\u0430']" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.findall(ur'\\S+',u'the quick рыжая лиса')" ] }, { "cell_type": "code", "execution_count": 98, "id": "375a39de", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'the',\n", " u'quick',\n", " u'\\u0440\\u044b\\u0436\\u0430\\u044f',\n", " u'\\u043b\\u0438\\u0441\\u0430']" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.findall(ur'\\w+',u'the quick рыжая лиса')" ] }, { "cell_type": "code", "execution_count": 101, "id": "690565bf", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'the', u'quick']" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.findall(ur'\\p{Script=Latin}+',u'the quick рыжая лиса')" ] }, { "cell_type": "code", "execution_count": 100, "id": "a619d716", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'\\u0440\\u044b\\u0436\\u0430\\u044f', u'\\u043b\\u0438\\u0441\\u0430']" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.findall(ur'\\p{Script=Cyrillic}+',u'the quick рыжая лиса')" ] }, { "cell_type": "code", "execution_count": 104, "id": "0efc52e1", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "u'K\\xe4se'\n", "u'Ka\\u0308se'\n" ] } ], "source": [ "s = u\"Käse\"\n", "t = unicodedata.normalize('NFD',s)\n", "print repr(s)\n", "print repr(t)" ] }, { "cell_type": "markdown", "id": "948131c2", "metadata": {}, "source": [ "By default, `re` doesn't consider non-ASCII characters word characters at all." ] }, { "cell_type": "code", "execution_count": 109, "id": "30bc180c", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "([u'K', u's', u'e'], [u'K', u'a', u's', u'e'])" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(ur\"\\w\",s),re.findall(ur\"\\w\",t)" ] }, { "cell_type": "markdown", "id": "11dd4325", "metadata": {}, "source": [ "With Unicode support, it does, but it doesn't handle decomposed characters." ] }, { "cell_type": "code", "execution_count": 111, "id": "3c397b69", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "([u'K', u'\\xe4', u's', u'e'], [u'K', u'a', u's', u'e'])" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(ur\"\\w(?u)\",s),re.findall(ur\"\\w(?u)\",t)" ] }, { "cell_type": "markdown", "id": "ab75171b", "metadata": {}, "source": [ "The `regex` package deals correctly with word characters by default,\n", "but still doesn't handle deocmposed characters with either `\\w` or `.`." ] }, { "cell_type": "code", "execution_count": 110, "id": "f7a14f22", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "([u'K', u'\\xe4', u's', u'e'], [u'K', u'a', u'\\u0308', u's', u'e'])" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.findall(ur\"\\w\",s),regex.findall(ur\"\\w\",t)" ] }, { "cell_type": "code", "execution_count": 107, "id": "b27b9cd0", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "([u'K', u'\\xe4', u's', u'e'], [u'K', u'a', u'\\u0308', u's', u'e'])" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.findall(ur\".\",s),regex.findall(ur\".\",t)" ] }, { "cell_type": "markdown", "id": "a66b7de7", "metadata": {}, "source": [ "However, the grapheme matcher `\\X` recognizes that the decomposed\n", "umlaut is, in fact, a single grapheme, even though it consists\n", "of several codepoints." ] }, { "cell_type": "code", "execution_count": 108, "id": "b6f52e90", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "([u'K', u'\\xe4', u's', u'e'], [u'K', u'a\\u0308', u's', u'e'])" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.findall(ur\"\\X\",s),regex.findall(ur\"\\X\",t)" ] }, { "cell_type": "markdown", "id": "d0bdbed7", "metadata": {}, "source": [ "# Parsing" ] }, { "cell_type": "markdown", "id": "ae35ba1a", "metadata": {}, "source": [ "Regular expressions are best for fairly simple tasks.\n", "For more complex parsing tasks, you may want to use an actual parsing tool,\n", "like pyparsing." ] }, { "cell_type": "code", "execution_count": 113, "id": "ba11f475", "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pyparsing" ] }, { "cell_type": "code", "execution_count": 115, "id": "cdd7a4e9", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[['a', ['b', 'c'], 'd']]" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pyparsing.nestedExpr().parseString(\"(a (b c) d)\").asList()" ] }, { "cell_type": "code", "execution_count": 136, "id": "06cbcb8b", "metadata": { "collapsed": false }, "outputs": [], "source": [ "import string\n", "from pyparsing import oneOf,Literal,Word,Optional,StringEnd\n", "greeting = oneOf(\"Hi Yo\") + Optional(Literal(\",\")) + Word(string.uppercase,string.lowercase) + Optional(oneOf(\". !\")) + StringEnd()" ] }, { "cell_type": "code", "execution_count": 137, "id": "a8a6c3f5", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(['Hi', ',', 'Peter', '!'], {})" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "greeting.parseString(\"Hi, Peter!\")" ] }, { "cell_type": "code", "execution_count": 140, "id": "42eef18a", "metadata": { "collapsed": false }, "outputs": [ { "ename": "ParseException", "evalue": "Expected end of text (at char 7), (line:1, col:8)", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mParseException\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mgreeting\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mparseString\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Yo, DogZ.\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/usr/lib/python2.7/dist-packages/pyparsing.pyc\u001b[0m in \u001b[0;36mparseString\u001b[0;34m(self, instring, parseAll)\u001b[0m\n\u001b[1;32m 1030\u001b[0m \u001b[0;31m# catch and re-raise exception from here, clears out pyparsing internal stack trace\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1031\u001b[0m \u001b[0mexc\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexc_info\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1032\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mexc\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1033\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1034\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mtokens\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mParseException\u001b[0m: Expected end of text (at char 7), (line:1, col:8)" ] } ], "source": [ "greeting.parseString(\"Yo, DogZ.\")" ] }, { "cell_type": "code", "execution_count": null, "id": "aabf119a", "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 5 }