{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Regular Expression By Example\n", "\n", "- **Author:** [Chris Albon](http://www.chrisalbon.com/), [@ChrisAlbon](https://twitter.com/chrisalbon)\n", "- **Date:** -\n", "- **Repo:** [Python 3 code snippets for data science](https://github.com/chrisalbon/code_py)\n", "- **Note:** This snippit is based on: [http://www.tutorialspoint.com/python/python_reg_expressions.htm](http://www.tutorialspoint.com/python/python_reg_expressions.htm)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Import regex\n", "import re" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create some data\n", "text = 'A flock of 120 quick brown foxes jumped over 30 lazy brown, bears.'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ^ Matches beginning of line." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['A']" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('^A', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### $ Matches end of line." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['bears.']" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('bears.$', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### . Matches any single character except newline." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['foxes']" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('f..es', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [...] Matches any single character in brackets." ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['u', 'i', 'o', 'o', 'e', 'u', 'e', 'o', 'e', 'a', 'o', 'e', 'a']" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find all vowels\n", "re.findall('[aeiou]', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [# ^...] Matches any single character not in brackets" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['A',\n", " ' ',\n", " '1',\n", " '2',\n", " '0',\n", " ' ',\n", " 'q',\n", " 'c',\n", " 'k',\n", " ' ',\n", " 'b',\n", " 'r',\n", " 'w',\n", " 'n',\n", " ' ',\n", " 'f',\n", " 'x',\n", " 's',\n", " ' ',\n", " 'j',\n", " 'm',\n", " 'p',\n", " 'd',\n", " ' ',\n", " 'v',\n", " 'r',\n", " ' ',\n", " '3',\n", " '0',\n", " ' ',\n", " 'l',\n", " 'z',\n", " 'y',\n", " ' ',\n", " 'b',\n", " 'r',\n", " 'w',\n", " 'n',\n", " ',',\n", " ' ',\n", " 'b',\n", " 'r',\n", " 's',\n", " '.']" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find all characters that are not lower-case vowels\n", "re.findall('[^aeiou]', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### a | b Matches either a or b." ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['A', 'a', 'a']" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('a|A', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (re) Groups regular expressions and remembers matched text." ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['foxes']" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find any instance of 'fox'\n", "re.findall('(foxes)', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \\w Matches word characters." ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['quick', 'brown', 'foxes', 'jumpe', 'brown', 'bears']" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Break up string into five character blocks\n", "re.findall('\\w\\w\\w\\w\\w', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \\W Matches nonword characters." ] }, { "cell_type": "code", "execution_count": 121, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[', ']" ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('\\W\\W', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \\s Matches whitespace. Equivalent to [\\t\\n\\r\\f]." ] }, { "cell_type": "code", "execution_count": 120, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('\\s', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \\S Matches nonwhitespace." ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['12',\n", " 'qu',\n", " 'ic',\n", " 'br',\n", " 'ow',\n", " 'fo',\n", " 'xe',\n", " 'ju',\n", " 'mp',\n", " 'ed',\n", " 'ov',\n", " 'er',\n", " '30',\n", " 'la',\n", " 'zy',\n", " 'br',\n", " 'ow',\n", " 'n,',\n", " 'be',\n", " 'ar',\n", " 's.']" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('\\S\\S', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \\d Matches digits. Equivalent to [0-9]." ] }, { "cell_type": "code", "execution_count": 125, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['120']" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('\\d\\d\\d', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \\D Matches nondigits." ] }, { "cell_type": "code", "execution_count": 127, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[' quic',\n", " 'k bro',\n", " 'wn fo',\n", " 'xes j',\n", " 'umped',\n", " ' over',\n", " ' lazy',\n", " ' brow',\n", " 'n, be']" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('\\D\\D\\D\\D\\D', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \\A Matches beginning of string." ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['A']" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('\\AA', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \\Z Matches end of string. If a newline exists, it matches just before newline." ] }, { "cell_type": "code", "execution_count": 135, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['bears.']" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('bears.\\Z', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \\z Matches end of string." ] }, { "cell_type": "code", "execution_count": 143, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('\\.\\z', text)" ] }, { "cell_type": "code", "execution_count": 153, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('\\b[foxes]', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \\n, \\t, etc. Matches newlines, carriage returns, tabs, etc." ] }, { "cell_type": "code", "execution_count": 155, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 155, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('\\n', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Pp]ython Match \"Python\" or \"python\"" ] }, { "cell_type": "code", "execution_count": 170, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['foxes', 'Foxes']" ] }, "execution_count": 170, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('[Ff]oxes', 'foxes Foxes Doxes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [0-9] Match any digit; same as [0123456789]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "re.findall('[Ff]oxes', 'foxes Foxes Doxes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [a-z] Match any lowercase ASCII letter" ] }, { "cell_type": "code", "execution_count": 172, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['f', 'o', 'x', 'e', 's', 'o', 'x', 'e', 's']" ] }, "execution_count": 172, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('[a-z]', 'foxes Foxes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [A-Z] Match any uppercase ASCII letter" ] }, { "cell_type": "code", "execution_count": 173, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['F']" ] }, "execution_count": 173, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('[A-Z]', 'foxes Foxes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [a-zA-Z0-9] Match any of the above" ] }, { "cell_type": "code", "execution_count": 175, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['f', 'o', 'x', 'e', 's', 'F', 'o', 'x', 'e', 's']" ] }, "execution_count": 175, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('[a-zA-Z0-9]', 'foxes Foxes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [^aeiou] Match anything other than a lowercase vowel" ] }, { "cell_type": "code", "execution_count": 176, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['f', 'x', 's', ' ', 'F', 'x', 's']" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('[^aeiou]', 'foxes Foxes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [^0-9] Match anything other than a digit" ] }, { "cell_type": "code", "execution_count": 177, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['f', 'o', 'x', 'e', 's', ' ', 'F', 'o', 'x', 'e', 's']" ] }, "execution_count": 177, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('[^0-9]', 'foxes Foxes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ruby? Match \"rub\" or \"ruby\": the y is optional" ] }, { "cell_type": "code", "execution_count": 180, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['foxes']" ] }, "execution_count": 180, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('foxes?', 'foxes Foxes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ruby* Match \"rub\" plus 0 or more ys" ] }, { "cell_type": "code", "execution_count": 183, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['ox', 'ox']" ] }, "execution_count": 183, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('ox*', 'foxes Foxes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ruby+ Match \"rub\" plus 1 or more ys" ] }, { "cell_type": "code", "execution_count": 184, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['ox', 'ox']" ] }, "execution_count": 184, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('ox+', 'foxes Foxes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \\d{3} Match exactly 3 digits" ] }, { "cell_type": "code", "execution_count": 188, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['120']" ] }, "execution_count": 188, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('\\d{3}', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \\d{3,} Match 3 or more digits" ] }, { "cell_type": "code", "execution_count": 189, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['120', '30']" ] }, "execution_count": 189, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('\\d{2,}', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \\d{3,5} Match 3, 4, or 5 digits" ] }, { "cell_type": "code", "execution_count": 190, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['120', '30']" ] }, "execution_count": 190, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('\\d{2,3}', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ^Python Match \"Python\" at the start of a string or internal line" ] }, { "cell_type": "code", "execution_count": 191, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['A']" ] }, "execution_count": 191, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('^A', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Python$ Match \"Python\" at the end of a string or line" ] }, { "cell_type": "code", "execution_count": 192, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['bears.']" ] }, "execution_count": 192, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('bears.$', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \\APython Match \"Python\" at the start of a string" ] }, { "cell_type": "code", "execution_count": 196, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['A']" ] }, "execution_count": 196, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('\\AA', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Python\\Z Match \"Python\" at the end of a string" ] }, { "cell_type": "code", "execution_count": 198, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['bears.']" ] }, "execution_count": 198, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('bears.\\Z', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Python(?=!) Match \"Python\", if followed by an exclamation point" ] }, { "cell_type": "code", "execution_count": 204, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['bears']" ] }, "execution_count": 204, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('bears(?=.)', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Python(?!!) Match \"Python\", if not followed by an exclamation point" ] }, { "cell_type": "code", "execution_count": 209, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['foxes']" ] }, "execution_count": 209, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('foxes(?!!)', 'foxes foxes!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### python|perl Match \"python\" or \"perl\"" ] }, { "cell_type": "code", "execution_count": 211, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['foxes', 'foxes']" ] }, "execution_count": 211, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('foxes|foxes!', 'foxes foxes!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### rub(y|le)) Match \"ruby\" or \"ruble\"" ] }, { "cell_type": "code", "execution_count": 212, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['es!']" ] }, "execution_count": 212, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('fox(es!)', 'foxes foxes!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Python(!+|\\?) \"Python\" followed by one or more ! or one ?" ] }, { "cell_type": "code", "execution_count": 213, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['!']" ] }, "execution_count": 213, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall('foxes(!)', 'foxes foxes!')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }