{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 正则表达式和 re 模块"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 正则表达式"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[正则表达式](http://baike.baidu.com/view/94238.htm)是用来匹配字符串或者子串的一种模式，匹配的字符串可以很具体，也可以很一般化。\n",
    "\n",
    "`Python` 标准库提供了 `re` 模块。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## re.match & re.search"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在 `re` 模块中， `re.match` 和 `re.search` 是常用的两个方法：\n",
    "\n",
    "    re.match(pattern, string[, flags])\n",
    "    re.search(pattern, string[, flags])\n",
    "\n",
    "两者都寻找第一个匹配成功的部分，成功则返回一个 `match` 对象，不成功则返回 `None`，不同之处在于 `re.match` 只匹配字符串的开头部分，而 `re.search` 匹配的则是整个字符串中的子串。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## re.findall & re.finditer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`re.findall(pattern, string)` 返回所有匹配的对象， `re.finditer` 则返回一个迭代器。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## re.split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`re.split(pattern, string[, maxsplit])` 按照 `pattern` 指定的内容对字符串进行分割。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## re.sub"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`re.sub(pattern, repl, string[, count])` 将 `pattern` 匹配的内容进行替换。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## re.compile"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`re.compile(pattern)` 生成一个 `pattern` 对象，这个对象有匹配，替换，分割字符串的方法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 正则表达式规则"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "正则表达式由一些普通字符和一些元字符（metacharacters）组成。普通字符包括大小写的字母和数字，而元字符则具有特殊的含义：\n",
    "\n",
    "子表达式|匹配内容\n",
    "---|---\n",
    "`.`| 匹配除了换行符之外的内容\n",
    "`\\w` | 匹配所有字母和数字字符\n",
    "`\\d` | 匹配所有数字，相当于 `[0-9]`\n",
    "`\\s` | 匹配空白，相当于 `[\\t\\n\\t\\f\\v]`\n",
    "`\\W,\\D,\\S`| 匹配对应小写字母形式的补\n",
    "`[...]` | 表示可以匹配的集合，支持范围表示如 `a-z`, `0-9` 等\n",
    "`(...)` | 表示作为一个整体进行匹配\n",
    "&#166; | 表示逻辑或\n",
    "`^` | 表示匹配后面的子表达式的补\n",
    "`*` | 表示匹配前面的子表达式 0 次或更多次\n",
    "`+` | 表示匹配前面的子表达式 1 次或更多次\n",
    "`?` | 表示匹配前面的子表达式 0 次或 1 次\n",
    "`{m}` | 表示匹配前面的子表达式 m 次\n",
    "`{m,}` | 表示匹配前面的子表达式至少 m 次\n",
    "`{m,n}` | 表示匹配前面的子表达式至少 m 次，至多 n 次\n",
    "\n",
    "例如：\n",
    "\n",
    "- `ca*t       匹配： ct, cat, caaaat, ...`\n",
    "- `ab\\d|ac\\d  匹配： ab1, ac9, ...`\n",
    "- `([^a-q]bd) 匹配： rbd, 5bd, ...`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 例子"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "假设我们要匹配这样的字符串："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<_sre.SRE_Match object at 0x0000000003A5DA80>\n"
     ]
    }
   ],
   "source": [
    "string = 'hello world'\n",
    "pattern = 'hello (\\w+)'\n",
    "\n",
    "match = re.match(pattern, string)\n",
    "print match"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一旦找到了符合条件的部分，我们便可以使用 `group` 方法查看匹配的部分："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hello world\n"
     ]
    }
   ],
   "source": [
    "if match is not None:\n",
    "    print match.group(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "world\n"
     ]
    }
   ],
   "source": [
    "if match is not None:\n",
    "    print match.group(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以改变 string 的内容："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hello there\n",
      "there\n"
     ]
    }
   ],
   "source": [
    "string = 'hello there'\n",
    "pattern = 'hello (\\w+)'\n",
    "\n",
    "match = re.match(pattern, string)\n",
    "if match is not None:\n",
    "    print match.group(0)\n",
    "    print match.group(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通常，`match.group(0)` 匹配整个返回的内容，之后的 `1,2,3,...` 返回规则中每个括号（按照括号的位置排序）匹配的部分。\n",
    "\n",
    "如果某个 `pattern` 需要反复使用，那么我们可以将它预先编译："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "there\n"
     ]
    }
   ],
   "source": [
    "pattern1 = re.compile('hello (\\w+)')\n",
    "\n",
    "match = pattern1.match(string)\n",
    "if match is not None:\n",
    "    print match.group(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于元字符的存在，所以对于一些特殊字符，我们需要使用 `'\\'` 进行逃逸字符的处理，使用表达式 `'\\\\'` 来匹配 `'\\'` 。\n",
    "\n",
    "但事实上，`Python` 本身对逃逸字符也是这样处理的："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\\\n"
     ]
    }
   ],
   "source": [
    "pattern = '\\\\'\n",
    "print pattern"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "因为逃逸字符的问题，我们需要使用四个 `'\\\\\\\\'` 来匹配一个单独的 `'\\'`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['C:', 'foo', 'bar', 'baz.txt']\n"
     ]
    }
   ],
   "source": [
    "pattern = '\\\\\\\\'\n",
    "path = \"C:\\\\foo\\\\bar\\\\baz.txt\"\n",
    "print re.split(pattern, path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这样看起来十分麻烦，好在 `Python` 提供了 `raw string` 来忽略对逃逸字符串的处理，从而可以这样进行匹配："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['C:', 'foo', 'bar', 'baz.txt']\n"
     ]
    }
   ],
   "source": [
    "pattern = r'\\\\'\n",
    "path = r\"C:\\foo\\bar\\baz.txt\"\n",
    "print re.split(pattern, path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果规则太多复杂，正则表达式不一定是个好选择。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Numpy 的 fromregex()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing test.dat\n"
     ]
    }
   ],
   "source": [
    "%%file test.dat \n",
    "1312 foo\n",
    "1534    bar\n",
    "444  qux"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "    fromregex(file, pattern, dtype)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`dtype` 中的内容与 `pattern` 的括号一一对应："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(1312L, 'foo') (1534L, 'bar') (444L, 'qux')]\n"
     ]
    }
   ],
   "source": [
    "pattern = \"(\\d+)\\s+(...)\"\n",
    "dt = [('num', 'int64'), ('key', 'S3')]\n",
    "\n",
    "from numpy import fromregex\n",
    "output = fromregex('test.dat', pattern, dt)\n",
    "print output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "显示 `num` 项："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1312 1534  444]\n"
     ]
    }
   ],
   "source": [
    "print output['num']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "os.remove('test.dat')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}