{
 "metadata": {
  "name": "",
  "signature": "sha256:7e13e72294f7b6b1a7bac2433aa77980275fd156ab80724a2887b97f9389105b"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Regular Expression Basics\n",
      "\n",
      "- **Author:** [Chris Albon](http://www.chrisalbon.com/), [@ChrisAlbon](https://twitter.com/chrisalbon)\n",
      "- **Date:** -\n",
      "- **Repo:** [Python 3 code snippets for data science](https://github.com/chrisalbon/code_py)\n",
      "- **Note:** This snippit is based on: [http://www.tutorialspoint.com/python/python_reg_expressions.htm](http://www.tutorialspoint.com/python/python_reg_expressions.htm)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Import the regex (re) package"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import re"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Import sys"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import sys"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Create a simple text string."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "text = 'The quick brown fox jumped over the lazy black bear.'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Create a pattern to match"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "three_letter_word = '...'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Convert the string into a regex object"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pattern_re = re.compile(three_letter_word); pattern_re"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 12,
       "text": [
        "re.compile(r'...', re.UNICODE)"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Does a three letter word appear in text?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "re_search = re.search('..own', text)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### If the search query is at all true,"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re_search:\n",
      "    # Print the search results\n",
      "    print(re_search.group())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "brown\n"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## re.match\n",
      "\n",
      "re.match() is for matching ONLY the beginning of a string or the whole string\n",
      "For anything else, use re.search"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Match all three letter words in text"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "re_match = re.match('..own', text)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 16
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### If re_match is true, print the match, else print \"No Matches\""
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re_match:\n",
      "    # Print all the matches\n",
      "    print(re_match.group())\n",
      "else:\n",
      "    # Print this\n",
      "    print('No matches')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "No matches\n"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## re.split"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Split up the string using \"e\" as the seperator."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "re_split = re.split('e', text); re_split"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 18,
       "text": [
        "['Th', ' quick brown fox jump', 'd ov', 'r th', ' lazy black b', 'ar.']"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## re.sub\n",
      "\n",
      "Replaces occurrences of the regex pattern with something else\n",
      "\n",
      "The \"3\" references to the maximum number of substitutions to make."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Substitute the first three instances of \"e\" with \"E\", then print it"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "re_sub = re.sub('e', 'E', text, 3); print(re_sub)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "ThE quick brown fox jumpEd ovEr the lazy black bear.\n"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Patterns\n",
      "\n",
      "- ^   Matches beginning of line.\n",
      "- $   Matches end of line.\n",
      "- .   Matches any single character except newline.\n",
      "- [...]   Matches any single character in brackets.\n",
      "- [# ^...]  Matches any single character not in brackets\n",
      "- re* Matches 0 or more occurrences of preceding expression.\n",
      "- re+ Matches 1 or more occurrence of preceding expression.\n",
      "- re? Matches 0 or 1 occurrence of preceding expression.\n",
      "- re{ n}  Matches exactly n number of occurrences of preceding expression.\n",
      "- re{ n,} Matches n or more occurrences of preceding expression.\n",
      "- re{ n, m}   Matches at least n and at most m occurrences of preceding expression.\n",
      "- a | b    Matches either a or b.\n",
      "- (re)    Groups regular expressions and remembers matched text.\n",
      "- (?imx)  Temporarily toggles on i, m, or x options within a regular expression. If in parentheses, only that area is affected.\n",
      "- (?-imx) Temporarily toggles off i, m, or x options within a regular expression. If in parentheses, only that area is affected.\n",
      "- (?: re) Groups regular expressions without remembering matched text.\n",
      "- (?imx: re)  Temporarily toggles on i, m, or x options within parentheses.\n",
      "- (?-imx: re) Temporarily toggles off i, m, or x options within parentheses.\n",
      "- (?#...) Comment.\n",
      "- (?= re) Specifies position using a pattern. Doesn't have a range.\n",
      "- (?! re) Specifies position using pattern negation. Doesn't have a range.\n",
      "- (?> re) Matches independent pattern without backtracking.\n",
      "- \\w  Matches word characters.\n",
      "- \\W  Matches nonword characters.\n",
      "- \\s  Matches whitespace. Equivalent to [\\t\\n\\r\\f].\n",
      "- \\S  Matches nonwhitespace.\n",
      "- \\d  Matches digits. Equivalent to [0-9].\n",
      "- \\D  Matches nondigits.\n",
      "- \\A  Matches beginning of string.\n",
      "- \\Z  Matches end of string. If a newline exists, it matches just before newline.\n",
      "- \\z  Matches end of string.\n",
      "- \\G  Matches point where last match finished.\n",
      "- \\b  Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.\n",
      "- \\B  Matches nonword boundaries.\n",
      "- \\n, \\t, etc.    Matches newlines, carriage returns, tabs, etc.\n",
      "- \\1...\\9 Matches nth grouped subexpression.\n",
      "- \\10 Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code.\n",
      "\n",
      "### Examples\n",
      "\n",
      "- [Pp]ython   Match \"Python\" or \"python\"\n",
      "- rub[ye] Match \"ruby\" or \"rube\"\n",
      "- [aeiou] Match any one lowercase vowel\n",
      "- [0-9]   Match any digit; same as [0123456789]\n",
      "- [a-z]   Match any lowercase ASCII letter\n",
      "- [A-Z]   Match any uppercase ASCII letter\n",
      "- [a-zA-Z0-9] Match any of the above\n",
      "- [^aeiou]    Match anything other than a lowercase vowel\n",
      "- [^0-9]  Match anything other than a digit\n",
      "\n",
      "- ruby?   Match \"rub\" or \"ruby\": the y is optional\n",
      "- ruby*   Match \"rub\" plus 0 or more ys\n",
      "- ruby+   Match \"rub\" plus 1 or more ys\n",
      "- \\d{3}   Match exactly 3 digits\n",
      "- \\d{3,}  Match 3 or more digits\n",
      "- \\d{3,5} Match 3, 4, or 5 digits\n",
      "\n",
      "- ^Python Match \"Python\" at the start of a string or internal line\n",
      "- Python$     Match \"Python\" at the end of a string or line\n",
      "- \\APython    Match \"Python\" at the start of a string\n",
      "- Python\\Z    Match \"Python\" at the end of a string\n",
      "- \\bPython\\b  Match \"Python\" at a word boundary\n",
      "- \\brub\\B \\B is nonword boundary: match \"rub\" in \"rube\" and \"ruby\" but not alone\n",
      "- Python(?=!) Match \"Python\", if followed by an exclamation point\n",
      "- Python(?!!) Match \"Python\", if not followed by an exclamation point\n",
      "\n",
      "- python|perl Match \"python\" or \"perl\"\n",
      "- rub(y|le))  Match \"ruby\" or \"ruble\"\n",
      "- Python(!+|\\?)   \"Python\" followed by one or more ! or one ?"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}