{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 正则表达式和 re 模块" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 正则表达式" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[正则表达式](http://baike.baidu.com/view/94238.htm)是用来匹配字符串或者子串的一种模式,匹配的字符串可以很具体,也可以很一般化。\n", "\n", "`Python` 标准库提供了 `re` 模块。 " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## re.match & re.search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在 `re` 模块中, `re.match` 和 `re.search` 是常用的两个方法:\n", "\n", " re.match(pattern, string[, flags])\n", " re.search(pattern, string[, flags])\n", "\n", "两者都寻找第一个匹配成功的部分,成功则返回一个 `match` 对象,不成功则返回 `None`,不同之处在于 `re.match` 只匹配字符串的开头部分,而 `re.search` 匹配的则是整个字符串中的子串。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## re.findall & re.finditer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`re.findall(pattern, string)` 返回所有匹配的对象, `re.finditer` 则返回一个迭代器。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## re.split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`re.split(pattern, string[, maxsplit])` 按照 `pattern` 指定的内容对字符串进行分割。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## re.sub" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`re.sub(pattern, repl, string[, count])` 将 `pattern` 匹配的内容进行替换。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## re.compile" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`re.compile(pattern)` 生成一个 `pattern` 对象,这个对象有匹配,替换,分割字符串的方法。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 正则表达式规则" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "正则表达式由一些普通字符和一些元字符(metacharacters)组成。普通字符包括大小写的字母和数字,而元字符则具有特殊的含义:\n", "\n", "子表达式|匹配内容\n", "---|---\n", "`.`| 匹配除了换行符之外的内容\n", "`\\w` | 匹配所有字母和数字字符\n", "`\\d` | 匹配所有数字,相当于 `[0-9]`\n", "`\\s` | 匹配空白,相当于 `[\\t\\n\\t\\f\\v]`\n", "`\\W,\\D,\\S`| 匹配对应小写字母形式的补\n", "`[...]` | 表示可以匹配的集合,支持范围表示如 `a-z`, `0-9` 等\n", "`(...)` | 表示作为一个整体进行匹配\n", "¦ | 表示逻辑或\n", "`^` | 表示匹配后面的子表达式的补\n", "`*` | 表示匹配前面的子表达式 0 次或更多次\n", "`+` | 表示匹配前面的子表达式 1 次或更多次\n", "`?` | 表示匹配前面的子表达式 0 次或 1 次\n", "`{m}` | 表示匹配前面的子表达式 m 次\n", "`{m,}` | 表示匹配前面的子表达式至少 m 次\n", "`{m,n}` | 表示匹配前面的子表达式至少 m 次,至多 n 次\n", "\n", "例如:\n", "\n", "- `ca*t 匹配: ct, cat, caaaat, ...`\n", "- `ab\\d|ac\\d 匹配: ab1, ac9, ...`\n", "- `([^a-q]bd) 匹配: rbd, 5bd, ...`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 例子" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "假设我们要匹配这样的字符串:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<_sre.SRE_Match object at 0x0000000003A5DA80>\n" ] } ], "source": [ "string = 'hello world'\n", "pattern = 'hello (\\w+)'\n", "\n", "match = re.match(pattern, string)\n", "print match" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "一旦找到了符合条件的部分,我们便可以使用 `group` 方法查看匹配的部分:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hello world\n" ] } ], "source": [ "if match is not None:\n", " print match.group(0)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "world\n" ] } ], "source": [ "if match is not None:\n", " print match.group(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们可以改变 string 的内容:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hello there\n", "there\n" ] } ], "source": [ "string = 'hello there'\n", "pattern = 'hello (\\w+)'\n", "\n", "match = re.match(pattern, string)\n", "if match is not None:\n", " print match.group(0)\n", " print match.group(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "通常,`match.group(0)` 匹配整个返回的内容,之后的 `1,2,3,...` 返回规则中每个括号(按照括号的位置排序)匹配的部分。\n", "\n", "如果某个 `pattern` 需要反复使用,那么我们可以将它预先编译:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "there\n" ] } ], "source": [ "pattern1 = re.compile('hello (\\w+)')\n", "\n", "match = pattern1.match(string)\n", "if match is not None:\n", " print match.group(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "由于元字符的存在,所以对于一些特殊字符,我们需要使用 `'\\'` 进行逃逸字符的处理,使用表达式 `'\\\\'` 来匹配 `'\\'` 。\n", "\n", "但事实上,`Python` 本身对逃逸字符也是这样处理的:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\\\n" ] } ], "source": [ "pattern = '\\\\'\n", "print pattern" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "因为逃逸字符的问题,我们需要使用四个 `'\\\\\\\\'` 来匹配一个单独的 `'\\'`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['C:', 'foo', 'bar', 'baz.txt']\n" ] } ], "source": [ "pattern = '\\\\\\\\'\n", "path = \"C:\\\\foo\\\\bar\\\\baz.txt\"\n", "print re.split(pattern, path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这样看起来十分麻烦,好在 `Python` 提供了 `raw string` 来忽略对逃逸字符串的处理,从而可以这样进行匹配:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['C:', 'foo', 'bar', 'baz.txt']\n" ] } ], "source": [ "pattern = r'\\\\'\n", "path = r\"C:\\foo\\bar\\baz.txt\"\n", "print re.split(pattern, path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果规则太多复杂,正则表达式不一定是个好选择。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numpy 的 fromregex()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Writing test.dat\n" ] } ], "source": [ "%%file test.dat \n", "1312 foo\n", "1534 bar\n", "444 qux" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " fromregex(file, pattern, dtype)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`dtype` 中的内容与 `pattern` 的括号一一对应:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(1312L, 'foo') (1534L, 'bar') (444L, 'qux')]\n" ] } ], "source": [ "pattern = \"(\\d+)\\s+(...)\"\n", "dt = [('num', 'int64'), ('key', 'S3')]\n", "\n", "from numpy import fromregex\n", "output = fromregex('test.dat', pattern, dt)\n", "print output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "显示 `num` 项:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1312 1534 444]\n" ] } ], "source": [ "print output['num']" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "os.remove('test.dat')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }