{ "metadata": { "name": "", "signature": "sha256:7e13e72294f7b6b1a7bac2433aa77980275fd156ab80724a2887b97f9389105b" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Regular Expression Basics\n", "\n", "- **Author:** [Chris Albon](http://www.chrisalbon.com/), [@ChrisAlbon](https://twitter.com/chrisalbon)\n", "- **Date:** -\n", "- **Repo:** [Python 3 code snippets for data science](https://github.com/chrisalbon/code_py)\n", "- **Note:** This snippit is based on: [http://www.tutorialspoint.com/python/python_reg_expressions.htm](http://www.tutorialspoint.com/python/python_reg_expressions.htm)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import the regex (re) package" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import sys" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a simple text string." ] }, { "cell_type": "code", "collapsed": false, "input": [ "text = 'The quick brown fox jumped over the lazy black bear.'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a pattern to match" ] }, { "cell_type": "code", "collapsed": false, "input": [ "three_letter_word = '...'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convert the string into a regex object" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pattern_re = re.compile(three_letter_word); pattern_re" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "re.compile(r'...', re.UNICODE)" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Does a three letter word appear in text?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "re_search = re.search('..own', text)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### If the search query is at all true," ] }, { "cell_type": "code", "collapsed": false, "input": [ "if re_search:\n", " # Print the search results\n", " print(re_search.group())" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "brown\n" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## re.match\n", "\n", "re.match() is for matching ONLY the beginning of a string or the whole string\n", "For anything else, use re.search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Match all three letter words in text" ] }, { "cell_type": "code", "collapsed": false, "input": [ "re_match = re.match('..own', text)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### If re_match is true, print the match, else print \"No Matches\"" ] }, { "cell_type": "code", "collapsed": false, "input": [ "if re_match:\n", " # Print all the matches\n", " print(re_match.group())\n", "else:\n", " # Print this\n", " print('No matches')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "No matches\n" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## re.split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split up the string using \"e\" as the seperator." ] }, { "cell_type": "code", "collapsed": false, "input": [ "re_split = re.split('e', text); re_split" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "['Th', ' quick brown fox jump', 'd ov', 'r th', ' lazy black b', 'ar.']" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## re.sub\n", "\n", "Replaces occurrences of the regex pattern with something else\n", "\n", "The \"3\" references to the maximum number of substitutions to make." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Substitute the first three instances of \"e\" with \"E\", then print it" ] }, { "cell_type": "code", "collapsed": false, "input": [ "re_sub = re.sub('e', 'E', text, 3); print(re_sub)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "ThE quick brown fox jumpEd ovEr the lazy black bear.\n" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Patterns\n", "\n", "- ^ Matches beginning of line.\n", "- $ Matches end of line.\n", "- . Matches any single character except newline.\n", "- [...] Matches any single character in brackets.\n", "- [# ^...] Matches any single character not in brackets\n", "- re* Matches 0 or more occurrences of preceding expression.\n", "- re+ Matches 1 or more occurrence of preceding expression.\n", "- re? Matches 0 or 1 occurrence of preceding expression.\n", "- re{ n} Matches exactly n number of occurrences of preceding expression.\n", "- re{ n,} Matches n or more occurrences of preceding expression.\n", "- re{ n, m} Matches at least n and at most m occurrences of preceding expression.\n", "- a | b Matches either a or b.\n", "- (re) Groups regular expressions and remembers matched text.\n", "- (?imx) Temporarily toggles on i, m, or x options within a regular expression. If in parentheses, only that area is affected.\n", "- (?-imx) Temporarily toggles off i, m, or x options within a regular expression. If in parentheses, only that area is affected.\n", "- (?: re) Groups regular expressions without remembering matched text.\n", "- (?imx: re) Temporarily toggles on i, m, or x options within parentheses.\n", "- (?-imx: re) Temporarily toggles off i, m, or x options within parentheses.\n", "- (?#...) Comment.\n", "- (?= re) Specifies position using a pattern. Doesn't have a range.\n", "- (?! re) Specifies position using pattern negation. Doesn't have a range.\n", "- (?> re) Matches independent pattern without backtracking.\n", "- \\w Matches word characters.\n", "- \\W Matches nonword characters.\n", "- \\s Matches whitespace. Equivalent to [\\t\\n\\r\\f].\n", "- \\S Matches nonwhitespace.\n", "- \\d Matches digits. Equivalent to [0-9].\n", "- \\D Matches nondigits.\n", "- \\A Matches beginning of string.\n", "- \\Z Matches end of string. If a newline exists, it matches just before newline.\n", "- \\z Matches end of string.\n", "- \\G Matches point where last match finished.\n", "- \\b Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.\n", "- \\B Matches nonword boundaries.\n", "- \\n, \\t, etc. Matches newlines, carriage returns, tabs, etc.\n", "- \\1...\\9 Matches nth grouped subexpression.\n", "- \\10 Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code.\n", "\n", "### Examples\n", "\n", "- [Pp]ython Match \"Python\" or \"python\"\n", "- rub[ye] Match \"ruby\" or \"rube\"\n", "- [aeiou] Match any one lowercase vowel\n", "- [0-9] Match any digit; same as [0123456789]\n", "- [a-z] Match any lowercase ASCII letter\n", "- [A-Z] Match any uppercase ASCII letter\n", "- [a-zA-Z0-9] Match any of the above\n", "- [^aeiou] Match anything other than a lowercase vowel\n", "- [^0-9] Match anything other than a digit\n", "\n", "- ruby? Match \"rub\" or \"ruby\": the y is optional\n", "- ruby* Match \"rub\" plus 0 or more ys\n", "- ruby+ Match \"rub\" plus 1 or more ys\n", "- \\d{3} Match exactly 3 digits\n", "- \\d{3,} Match 3 or more digits\n", "- \\d{3,5} Match 3, 4, or 5 digits\n", "\n", "- ^Python Match \"Python\" at the start of a string or internal line\n", "- Python$ Match \"Python\" at the end of a string or line\n", "- \\APython Match \"Python\" at the start of a string\n", "- Python\\Z Match \"Python\" at the end of a string\n", "- \\bPython\\b Match \"Python\" at a word boundary\n", "- \\brub\\B \\B is nonword boundary: match \"rub\" in \"rube\" and \"ruby\" but not alone\n", "- Python(?=!) Match \"Python\", if followed by an exclamation point\n", "- Python(?!!) Match \"Python\", if not followed by an exclamation point\n", "\n", "- python|perl Match \"python\" or \"perl\"\n", "- rub(y|le)) Match \"ruby\" or \"ruble\"\n", "- Python(!+|\\?) \"Python\" followed by one or more ! or one ?" ] } ], "metadata": {} } ] }