{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook intends to summarize key usages about regular expressions in python.\n", "\n", "References:\n", "- [Regular Expressions: Regexes in Python (Part 1)](https://realpython.com/regex-python/)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## re.search\n", "\n", "`re.search(, )` scans `` looking for the first location where the pattern `` matches.\n", "\n", "- If a match is found, then `re.search()` returns a **match object**. Otherwise, it returns `None`.\n", "- A match object is truthy, so you can use it in a Boolean context like a conditional statement.\n", "- In a regex, a set of characters specified in square brackets ([]) makes up a **character class**. This metacharacter sequence matches any single character that is in the class." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.search" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "123\n", "\n" ] } ], "source": [ "s = 'foo123bar'\n", "print(re.search('123', s))\n", "print(s[3:6])\n", "print(re.search('[1-9][1-9][1-9]', s))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found a match.\n" ] } ], "source": [ "if re.search('123', s):\n", " print('Found a match.')\n", "else:\n", " print('No match.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [ ]\n", "\n", "- `[0-9a-fA-F]` matches any hexadecimal digit character.\n", "- `[^0-9]` matches any character that isn’t a digit.\n", "- to match a literal `^`, put it not in the first position.\n", "- to match a literal hyphen `-`, put it in the first or last or use a backslash.\n", "- to match a literal `]`, put it in the first or use a backslash.\n", "- all other regex metacharacters lose their special meaning inside a character class." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(re.search('[#:^]', 'foo^bar:baz#qux'))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] } ], "source": [ "print(re.search('[-abc]', '123-456'))\n", "print(re.search('[abc-]', '123-456'))\n", "print(re.search('[ab\\-c]', '123-456'))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "print(re.search('[]abc]', '12[3]456'))\n", "print(re.search('[a\\]bc]', '12[3]456'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \\w \\W\n", "- `\\w` matches any alphanumeric word character. Word characters are uppercase and lowercase letters, digits, and the underscore (_) character, so `\\w` is essentially shorthand for [a-zA-Z0-9_].\n", "- `\\W` is the opposite. It matches any non-word character and is equivalent to [^a-zA-Z0-9_].\n", "\n", "### \\d \\D\n", "`\\d` matches any decimal digit character. `\\D` is the opposite. It matches any character that isn’t a decimal digit. `\\d` is essentially equivalent to [0-9], and `\\D` is equivalent to [^0-9].\n", "\n", "### \\s \\S\n", "\\s matches any whitespace character, including a newline charactor `\\n`. `\\S` is the opposite of \\s. It matches any character that isn’t whitespace.\n", "\n", "`[\\d\\w\\s]` matches any digit, word, or whitespace character." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "print(re.search('\\s', 'foo\\nbar baz'))\n", "print(re.search('\\S', ' \\n foo \\n baz'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Escaping Metacharacters\n", "\n", "### backslash `\\`\n", "\n", "- `\\\\` represents literal backslash.\n", "- `r' '`: raw string, which suppress the interpreter's process of literal strings. Always use raw strings when dealing with backslash matches." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "foo\bar\n", "foo\\bar\n" ] } ], "source": [ "s = 'foo\\bar'\n", "print(s)\n", "s = r'foo\\bar'\n", "print(s)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "print(re.search('\\\\\\\\', s)) # deal with interpreter's process first, then pass to reg process\n", "print(re.search(r'\\\\', s))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Anchors\n", "\n", "\n", "- `^`, `\\A`: start of a string.\n", "- `$`, `\\Z`: end of a string.\n", "- `\\b`: boundary of a word. A word means `[\\w]*`. Use raw string here.\n", "- `\\B`: not a boundary." ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "None\n", "\n", "\n", "\n" ] } ], "source": [ "print(re.search('^foo', 'foobar'))\n", "print(re.search('^foo', 'barfoo'))\n", "print(re.search('foo$', 'barfoo'))\n", "\n", "print(re.search(r'\\bfoo\\b', '#foo.bar')) # do remember to use raw string\n", "print(re.search(r'foo\\b', 'foo.bar'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Quantifiers\n", "A quantifier metacharacter immediately follows a portion of a `` and indicates how many times that portion must occur for the match to succeed.\n", "\n", "Greedy: produce the longest possible match.\n", "- `*`: zero or more\n", "- `+`: one or more\n", "- `?`: zero or one\n", "\n", "Non-greedy versions of the above respectively: the shortest possible match.\n", "- `*?`\n", "- `+?`\n", "- `??`\n", "\n", "range\n", "Note that don't put a space inside the `{}`.\n", "- `{m}`: exactly m\n", "- `{m,n}`: m - n, greedy version.\n", "- `{m,}`: m - inf\n", "- `{,n}`: 0 - n\n", "- `{,}`: 0 - inf\n", "- `{}`: literal `{}`\n", "- `{m,n}?`: non-greedy version." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] } ], "source": [ "print(re.search('<.*>', '% %'))\n", "print(re.search('<.*?>', '% %'))\n", "print(re.search('<[^>]*>', '% %'))\n", "\n", "print(re.search('<.+>', '% %'))\n", "print(re.search('<.+?>', '% %'))\n", "\n", "print(re.search('ba?', 'baaaa'))\n", "print(re.search('ba??', 'baaaa'))" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "print(re.search('b[ac]{2,7}', 'baacaaac'))\n", "print(re.search('b[ac]{2,7}?', 'baacaaac'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grouping Constructs and Backreferences\n", "\n", "- `()`: defines a group\n", "- capture groups\n", "- backreferences `\\`: treat the captured groups as variables and use them in the ``. **Use raw string.**\n", "- named groups: `(?P)`. Refer to it using `(?P=name)`, extract it using `m.group('name')`.\n", "- non-capturing group: `(?:)`. Used when we need the grouping feature, but don't need the retrieval information later.\n", "- conditional match:\n", " - `(?()|)`: use numbered reference\n", " - `(?()|)`: use named reference" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "('foobar', 'bar', None)\n", "foobar\n", "bar\n", "None\n" ] } ], "source": [ "m = re.search('(foo(bar)?)+(\\d\\d\\d)?', 'foofoobar')\n", "print(m)\n", "print(m.groups())\n", "print(m.group(1))\n", "print(m.group(2))\n", "print(m.group(3))" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "('foo', 'quux', 'baz')\n", "foo\n", "quux\n", "baz\n", "foo,quux,baz\n" ] } ], "source": [ "m = re.search('(\\w+),(\\w+),(\\w+)', 'foo,quux,baz')\n", "print(m)\n", "print(m.groups())\n", "print(m.group(1))\n", "print(m.group(2))\n", "print(m.group(3))\n", "print(m.group(0)) # the matched string" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "foo\n", "None\n" ] } ], "source": [ "regex = r'(\\w+), \\1'\n", "m = re.search(regex, 'foo, foo')\n", "print(m)\n", "print(m.group(1))\n", "\n", "m = re.search(regex, 'foo, bar')\n", "print(m)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "('foo', 'bar')\n", "bar\n" ] } ], "source": [ "m = re.search(r'(?P\\w+), (?:\\w+), (?P\\w+), (?P=w1), (?P=w2)', 'foo, test, bar, foo, bar, remaining')\n", "print(m)\n", "print(m.groups())\n", "print(m.group('w2'))" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "None\n", "None\n" ] } ], "source": [ "regex = r'^(###)?foo(?(1)bar|baz)'\n", "print(re.search(regex, '###foobar'))\n", "print(re.search(regex, 'foobaz'))\n", "print(re.search(regex, '#foobaz'))\n", "print(re.search(regex, '#foobar'))" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "None\n", "None\n" ] } ], "source": [ "regex = r'^(?P\\W+)?foo(?(ch)(?P=ch)|)$'\n", "print(re.search(regex, '##foo##'))\n", "print(re.search(regex, '#foo#'))\n", "print(re.search(regex, 'foo'))\n", "print(re.search(regex, 'foo#'))\n", "print(re.search(regex, '##foo%'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lookahead and lookbehind assertions\n", "\n", "Similar to anchors, these assertions are of zero width.\n", "\n", "- `(?=)`: assert positive the next regex parser position\n", "- `(?!)`: assert positive the next regex parser position\n", "- `(?<=)`: assert positive the previous regex parser position, must be of fixed length.\n", "- `(?)`: assert positive the previous regex parser position" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "\n" ] } ], "source": [ "print(re.search('foo(?=\\w)', 'foob1z'))\n", "print(re.search('foo(?!\\w)', 'foo@23'))\n", "print(re.search('(?<=\\W)foo', '#foob1z'))\n", "print(re.search('(?||`: alternation" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] } ], "source": [ "print(re.search('foo(?#this is a comment)bar', 'foobar123'))\n", "print(re.search('[0-9]+|(foo|bar|baz)*', '9032'))\n", "print(re.search('[0-9]+|(foo|bar|baz)*', 'foobarfoo'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Flags\n", "\n", "- `re.I`: `re.IGNORECASE`, case-insensitive.\n", "- `re.M`: `re.MULTILINE`, enable anchors to work with embedded newlines.\n", "- `re.S`: `re.DOTALL`, enable `.` to match a newline.\n", "- `re.X`: `re.VERBOSE`, ignore whitespace and comment, to make the regex more human-friendly. Use `r''' '''`.\n", "- `re.DEBUG`: show the debug information.\n", "- encoding specification\n", " - `re.A`: `re.ASCII`, ASCII encoding\n", " - `re.U`: `re.UNICODE`, UNICODE encoding\n", " - `re.L`: `re.LOCALE`, according to your current locale\n", "- `|`: combine multiple flags.\n", "- `(?)`, `imsxauL`: set flag for the whole regex, at the beginning\n", "- `(?-:)`: set and remove flag for ``." ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AT AT_BEGINNING\n", "LITERAL 102\n", "LITERAL 111\n", "LITERAL 111\n", "\n", " 0. INFO 4 0b0 3 3 (to 5)\n", " 5: AT BEGINNING\n", " 7. LITERAL_UNI_IGNORE 0x66 ('f')\n", " 9. LITERAL_UNI_IGNORE 0x6f ('o')\n", "11. LITERAL_UNI_IGNORE 0x6f ('o')\n", "13. SUCCESS\n", "\n", "\n" ] } ], "source": [ "print(re.search('^foo', 'FoObar', re.I|re.DEBUG))\n", "print(re.search(r'''\n", " ^ # start of the regex\n", " (\\(\\d{3}\\))? # optional three-digit area code\n", " (\\s)* # optional whitespace\n", " \\d{3} # three-digit prefix\n", " [-.] # seperator\n", " \\d{4} # four-digit line number\n", " $ # end of the regex\n", "''', '(123) 234-3427', re.X))" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "None\n" ] } ], "source": [ "print(re.search('^bar.baz', 'FoO\\nbAr\\nbaZ', re.I|re.M|re.S))\n", "print(re.search('(?ims)^bar.baz', 'FoO\\nbAr\\nbaZ'))\n", "print(re.search('(?ims)^bar.(?-i:baz)', 'FoO\\nbAr\\nbaZ'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary of `?`\n", "\n", "- outside `()`\n", " - following `*`, `+`, `?`, `{m,n}`: non-greedy version\n", " - following ``: zero or one repetition\n", "- inside `()`: serves as a magic prefix\n", " - `(?P)`: named group, `(?P)` to create, `(?P=name)` to reference\n", " - `(?:)`: non-capturing group, `(?:)` to create a non-capturing group\n", " - `(?#)`: comment\n", " - `(?())`: conditional match, `?()|` for numbered groups, `(?()|)` for named groups\n", " - `(?=)`, `(?!)`, `(?<=)`, `(?)`: flag can be `imsxauL`, set flags for the entire regex\n", " - `(?-:)`: set and remove flag for the regex portion" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }