{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook intends to summarize key usages about regular expressions in python.\n",
    "\n",
    "References:\n",
    "- [Regular Expressions: Regexes in Python (Part 1)](https://realpython.com/regex-python/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## re.search\n",
    "\n",
    "`re.search(<regex>, <string>)` scans `<string>` looking for the first location where the pattern `<regex>` matches.\n",
    "\n",
    "- If a match is found, then `re.search()` returns a **match object**. Otherwise, it returns `None`.\n",
    "- A match object is truthy, so you can use it in a Boolean context like a conditional statement.\n",
    "- In a regex, a set of characters specified in square brackets ([]) makes up a **character class**. This metacharacter sequence matches any single character that is in the class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<function re.search(pattern, string, flags=0)>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(3, 6), match='123'>\n",
      "123\n",
      "<re.Match object; span=(3, 6), match='123'>\n"
     ]
    }
   ],
   "source": [
    "s = 'foo123bar'\n",
    "print(re.search('123', s))\n",
    "print(s[3:6])\n",
    "print(re.search('[1-9][1-9][1-9]', s))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found a match.\n"
     ]
    }
   ],
   "source": [
    "if re.search('123', s):\n",
    "    print('Found a match.')\n",
    "else:\n",
    "    print('No match.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### [ ]\n",
    "\n",
    "- `[0-9a-fA-F]` matches any hexadecimal digit character.\n",
    "- `[^0-9]` matches any character that isn’t a digit.\n",
    "- to match a literal `^`, put it not in the first position.\n",
    "- to match a literal hyphen `-`, put it in the first or last or use a backslash.\n",
    "- to match a literal `]`, put it in the first or use a backslash.\n",
    "- all other regex metacharacters lose their special meaning inside a character class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<re.Match object; span=(3, 4), match='^'>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(re.search('[#:^]', 'foo^bar:baz#qux'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(3, 4), match='-'>\n",
      "<re.Match object; span=(3, 4), match='-'>\n",
      "<re.Match object; span=(3, 4), match='-'>\n"
     ]
    }
   ],
   "source": [
    "print(re.search('[-abc]', '123-456'))\n",
    "print(re.search('[abc-]', '123-456'))\n",
    "print(re.search('[ab\\-c]', '123-456'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(4, 5), match=']'>\n",
      "<re.Match object; span=(4, 5), match=']'>\n"
     ]
    }
   ],
   "source": [
    "print(re.search('[]abc]', '12[3]456'))\n",
    "print(re.search('[a\\]bc]', '12[3]456'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### \\w \\W\n",
    "- `\\w` matches any alphanumeric word character. Word characters are uppercase and lowercase letters, digits, and the underscore (_) character, so `\\w` is essentially shorthand for [a-zA-Z0-9_].\n",
    "- `\\W` is the opposite. It matches any non-word character and is equivalent to [^a-zA-Z0-9_].\n",
    "\n",
    "### \\d \\D\n",
    "`\\d` matches any decimal digit character. `\\D` is the opposite. It matches any character that isn’t a decimal digit. `\\d` is essentially equivalent to [0-9], and `\\D` is equivalent to [^0-9].\n",
    "\n",
    "### \\s \\S\n",
    "\\s matches any whitespace character, including a newline charactor `\\n`. `\\S` is the opposite of \\s. It matches any character that isn’t whitespace.\n",
    "\n",
    "`[\\d\\w\\s]` matches any digit, word, or whitespace character."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(3, 4), match='\\n'>\n",
      "<re.Match object; span=(4, 5), match='f'>\n"
     ]
    }
   ],
   "source": [
    "print(re.search('\\s', 'foo\\nbar baz'))\n",
    "print(re.search('\\S', '  \\n foo \\n baz'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Escaping Metacharacters\n",
    "\n",
    "### backslash `\\`\n",
    "\n",
    "- `\\\\` represents literal backslash.\n",
    "- `r' '`: raw string, which suppress the interpreter's process of literal strings. Always use raw strings when dealing with backslash matches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "foo\bar\n",
      "foo\\bar\n"
     ]
    }
   ],
   "source": [
    "s = 'foo\\bar'\n",
    "print(s)\n",
    "s = r'foo\\bar'\n",
    "print(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(3, 4), match='\\\\'>\n",
      "<re.Match object; span=(3, 4), match='\\\\'>\n"
     ]
    }
   ],
   "source": [
    "print(re.search('\\\\\\\\', s)) # deal with interpreter's process first, then pass to reg process\n",
    "print(re.search(r'\\\\', s))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Anchors\n",
    "\n",
    "\n",
    "- `^`, `\\A`: start of a string.\n",
    "- `$`, `\\Z`: end of a string.\n",
    "- `\\b`: boundary of a word. A word means `[\\w]*`. Use raw string here.\n",
    "- `\\B`: not a boundary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 3), match='foo'>\n",
      "None\n",
      "<re.Match object; span=(3, 6), match='foo'>\n",
      "<re.Match object; span=(1, 4), match='foo'>\n",
      "<re.Match object; span=(0, 3), match='foo'>\n"
     ]
    }
   ],
   "source": [
    "print(re.search('^foo', 'foobar'))\n",
    "print(re.search('^foo', 'barfoo'))\n",
    "print(re.search('foo$', 'barfoo'))\n",
    "\n",
    "print(re.search(r'\\bfoo\\b', '#foo.bar'))  # do remember to use raw string\n",
    "print(re.search(r'foo\\b', 'foo.bar'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Quantifiers\n",
    "A quantifier metacharacter immediately follows a portion of a `<regex>` and indicates how many times that portion must occur for the match to succeed.\n",
    "\n",
    "Greedy: produce the longest possible match.\n",
    "- `*`: zero or more\n",
    "- `+`: one or more\n",
    "- `?`: zero or one\n",
    "\n",
    "Non-greedy versions of the above respectively: the shortest possible match.\n",
    "- `*?`\n",
    "- `+?`\n",
    "- `??`\n",
    "\n",
    "range\n",
    "Note that don't put a space inside the `{}`.\n",
    "- `{m}`: exactly m\n",
    "- `{m,n}`: m - n, greedy version.\n",
    "- `{m,}`: m - inf\n",
    "- `{,n}`: 0 - n\n",
    "- `{,}`: 0 -  inf\n",
    "- `{}`: literal `{}`\n",
    "- `{m,n}?`: non-greedy version."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(1, 18), match='<foo> <bar> <baz>'>\n",
      "<re.Match object; span=(1, 6), match='<foo>'>\n",
      "<re.Match object; span=(1, 6), match='<foo>'>\n",
      "<re.Match object; span=(1, 18), match='<foo> <bar> <baz>'>\n",
      "<re.Match object; span=(1, 6), match='<foo>'>\n",
      "<re.Match object; span=(0, 2), match='ba'>\n",
      "<re.Match object; span=(0, 1), match='b'>\n"
     ]
    }
   ],
   "source": [
    "print(re.search('<.*>', '%<foo> <bar> <baz>%'))\n",
    "print(re.search('<.*?>', '%<foo> <bar> <baz>%'))\n",
    "print(re.search('<[^>]*>', '%<foo> <bar> <baz>%'))\n",
    "\n",
    "print(re.search('<.+>', '%<foo> <bar> <baz>%'))\n",
    "print(re.search('<.+?>', '%<foo> <bar> <baz>%'))\n",
    "\n",
    "print(re.search('ba?', 'baaaa'))\n",
    "print(re.search('ba??', 'baaaa'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 8), match='baacaaac'>\n",
      "<re.Match object; span=(0, 3), match='baa'>\n"
     ]
    }
   ],
   "source": [
    "print(re.search('b[ac]{2,7}', 'baacaaac'))\n",
    "print(re.search('b[ac]{2,7}?', 'baacaaac'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Grouping Constructs and Backreferences\n",
    "\n",
    "- `(<regex>)`: defines a group\n",
    "- capture groups\n",
    "- backreferences `\\<n>`: treat the captured groups as variables and use them in the `<regex>`. **Use raw string.**\n",
    "- named groups: `(?P<name><regex>)`. Refer to it using `(?P=name)`, extract it using `m.group('name')`.\n",
    "- non-capturing group: `(?:<regex>)`. Used when we need the grouping feature, but don't need the retrieval information later.\n",
    "- conditional match:\n",
    "    - `(?(<n>)<yes-regex>|<no-regex>)`: use numbered reference\n",
    "    - `(?(<name>)<yes-regex>|<no-regex>)`: use named reference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 9), match='foofoobar'>\n",
      "('foobar', 'bar', None)\n",
      "foobar\n",
      "bar\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "m = re.search('(foo(bar)?)+(\\d\\d\\d)?', 'foofoobar')\n",
    "print(m)\n",
    "print(m.groups())\n",
    "print(m.group(1))\n",
    "print(m.group(2))\n",
    "print(m.group(3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 12), match='foo,quux,baz'>\n",
      "('foo', 'quux', 'baz')\n",
      "foo\n",
      "quux\n",
      "baz\n",
      "foo,quux,baz\n"
     ]
    }
   ],
   "source": [
    "m = re.search('(\\w+),(\\w+),(\\w+)', 'foo,quux,baz')\n",
    "print(m)\n",
    "print(m.groups())\n",
    "print(m.group(1))\n",
    "print(m.group(2))\n",
    "print(m.group(3))\n",
    "print(m.group(0)) # the matched string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 8), match='foo, foo'>\n",
      "foo\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "regex = r'(\\w+), \\1'\n",
    "m = re.search(regex, 'foo, foo')\n",
    "print(m)\n",
    "print(m.group(1))\n",
    "\n",
    "m = re.search(regex, 'foo, bar')\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 24), match='foo, test, bar, foo, bar'>\n",
      "('foo', 'bar')\n",
      "bar\n"
     ]
    }
   ],
   "source": [
    "m = re.search(r'(?P<w1>\\w+), (?:\\w+), (?P<w2>\\w+), (?P=w1), (?P=w2)', 'foo, test, bar, foo, bar, remaining')\n",
    "print(m)\n",
    "print(m.groups())\n",
    "print(m.group('w2'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 9), match='###foobar'>\n",
      "<re.Match object; span=(0, 6), match='foobaz'>\n",
      "None\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "regex = r'^(###)?foo(?(1)bar|baz)'\n",
    "print(re.search(regex, '###foobar'))\n",
    "print(re.search(regex, 'foobaz'))\n",
    "print(re.search(regex, '#foobaz'))\n",
    "print(re.search(regex, '#foobar'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 7), match='##foo##'>\n",
      "<re.Match object; span=(0, 5), match='#foo#'>\n",
      "<re.Match object; span=(0, 3), match='foo'>\n",
      "None\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "regex = r'^(?P<ch>\\W+)?foo(?(ch)(?P=ch)|)$'\n",
    "print(re.search(regex, '##foo##'))\n",
    "print(re.search(regex, '#foo#'))\n",
    "print(re.search(regex, 'foo'))\n",
    "print(re.search(regex, 'foo#'))\n",
    "print(re.search(regex, '##foo%'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lookahead and lookbehind assertions\n",
    "\n",
    "Similar to anchors, these assertions are of zero width.\n",
    "\n",
    "- `(?=<lookahead_regex>)`: assert positive the next regex parser position\n",
    "- `(?!<lookahead_regex>)`: assert positive the next regex parser position\n",
    "- `(?<=<lookbehind_regex>)`: assert positive the previous regex parser position, must be of fixed length.\n",
    "- `(?<!<lookbehind_regex>)`: assert positive the previous regex parser position"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 3), match='foo'>\n",
      "<re.Match object; span=(0, 3), match='foo'>\n",
      "<re.Match object; span=(1, 4), match='foo'>\n",
      "<re.Match object; span=(1, 4), match='foo'>\n"
     ]
    }
   ],
   "source": [
    "print(re.search('foo(?=\\w)', 'foob1z'))\n",
    "print(re.search('foo(?!\\w)', 'foo@23'))\n",
    "print(re.search('(?<=\\W)foo', '#foob1z'))\n",
    "print(re.search('(?<!\\W)foo', 'afoob1z'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Misselaneous Metacharacters\n",
    "\n",
    "- `(?#...)`: comment, regex parser will ignore the content inside.\n",
    "- `<regex1>|<regex2>|<regex3>`: alternation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 6), match='foobar'>\n",
      "<re.Match object; span=(0, 4), match='9032'>\n",
      "<re.Match object; span=(0, 9), match='foobarfoo'>\n"
     ]
    }
   ],
   "source": [
    "print(re.search('foo(?#this is a comment)bar', 'foobar123'))\n",
    "print(re.search('[0-9]+|(foo|bar|baz)*', '9032'))\n",
    "print(re.search('[0-9]+|(foo|bar|baz)*', 'foobarfoo'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Flags\n",
    "\n",
    "- `re.I`: `re.IGNORECASE`, case-insensitive.\n",
    "- `re.M`: `re.MULTILINE`, enable anchors to work with embedded newlines.\n",
    "- `re.S`: `re.DOTALL`, enable `.` to match a newline.\n",
    "- `re.X`: `re.VERBOSE`, ignore whitespace and comment, to make the regex more human-friendly. Use `r''' '''`.\n",
    "- `re.DEBUG`: show the debug information.\n",
    "- encoding specification\n",
    "    - `re.A`: `re.ASCII`, ASCII encoding\n",
    "    - `re.U`: `re.UNICODE`, UNICODE encoding\n",
    "    - `re.L`: `re.LOCALE`, according to your current locale\n",
    "- `|`: combine multiple flags.\n",
    "- `(?<flag>)`, `imsxauL`: set flag for the whole regex, at the beginning\n",
    "- `(?<set_flag>-<remove_flag>:<regex>)`: set and remove flag for `<regex>`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AT AT_BEGINNING\n",
      "LITERAL 102\n",
      "LITERAL 111\n",
      "LITERAL 111\n",
      "\n",
      " 0. INFO 4 0b0 3 3 (to 5)\n",
      " 5: AT BEGINNING\n",
      " 7. LITERAL_UNI_IGNORE 0x66 ('f')\n",
      " 9. LITERAL_UNI_IGNORE 0x6f ('o')\n",
      "11. LITERAL_UNI_IGNORE 0x6f ('o')\n",
      "13. SUCCESS\n",
      "<re.Match object; span=(0, 3), match='FoO'>\n",
      "<re.Match object; span=(0, 15), match='(123)  234-3427'>\n"
     ]
    }
   ],
   "source": [
    "print(re.search('^foo', 'FoObar', re.I|re.DEBUG))\n",
    "print(re.search(r'''\n",
    "    ^               # start of the regex\n",
    "    (\\(\\d{3}\\))?    # optional three-digit area code\n",
    "    (\\s)*           # optional whitespace\n",
    "    \\d{3}           # three-digit prefix\n",
    "    [-.]            # seperator\n",
    "    \\d{4}           # four-digit line number\n",
    "    $               # end of the regex\n",
    "''', '(123)  234-3427', re.X))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(4, 11), match='bAr\\nbaZ'>\n",
      "<re.Match object; span=(4, 11), match='bAr\\nbaZ'>\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "print(re.search('^bar.baz', 'FoO\\nbAr\\nbaZ', re.I|re.M|re.S))\n",
    "print(re.search('(?ims)^bar.baz', 'FoO\\nbAr\\nbaZ'))\n",
    "print(re.search('(?ims)^bar.(?-i:baz)', 'FoO\\nbAr\\nbaZ'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summary of `?`\n",
    "\n",
    "- outside `()`\n",
    "    - following `*`, `+`, `?`, `{m,n}`: non-greedy version\n",
    "    - following `<regex>`: zero or one repetition\n",
    "- inside `()`: serves as a magic prefix\n",
    "    - `(?P)`: named group, `(?P<name><regex>)` to create, `(?P=name)` to reference\n",
    "    - `(?:)`: non-capturing group, `(?:<regex>)` to create a non-capturing group\n",
    "    - `(?#)`: comment\n",
    "    - `(?())`: conditional match, `?(<n>)<yes_regex>|<no_regex>` for numbered groups, `(?(<name>)<yes_regex>|<no_regex>)` for named groups\n",
    "    - `(?=)`, `(?!)`, `(?<=)`, `(?<!)`: lookahead and lookbehind assertions\n",
    "    - `(?<flag>)`: flag can be `imsxauL`, set flags for the entire regex\n",
    "    - `(?<set_flag>-<remove_flag>:<regex>)`: set and remove flag for the regex portion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}