{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 7.3 String Manipulation(字符串处理)\n", "\n", "python很多内建方法很适合处理string。而且对于更复杂的模式,可以配合使用正则表达式。而pandas则混合了两种方式。\n", "\n", "# 1 String Object Methods(字符串对象方法)\n", "\n", "大部分string处理,使用内建的一些方法就足够了。比如,可以用split来分割用逗号区分的字符串:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "val = 'a,b, guido'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['a', 'b', ' guido']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val.split(',')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "split经常和strip一起搭配使用来去除空格(包括换行符):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['a', 'b', 'guido']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pieces = [x.strip() for x in val.split(',')]\n", "pieces" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以使用+号把::和字符串连起来:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "first, second, third = pieces" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'a::b::guido'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first + '::' + second + '::' + third" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "但这种方法并不python,更快的方法是直接用join方法:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'a::b::guido'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'::'.join(pieces)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "其他一些方法适合锁定子字符串位置相关的。用in关键字是检测substring最好的方法,当然,index和find也能完成任务:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'guido' in val" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val.index(',')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "-1" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val.find(':')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "注意index和find的区别。如果要找的string不存在的话,index会报错。而find会返回-1:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "ename": "ValueError", "evalue": "substring not found", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mval\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m':'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mValueError\u001b[0m: substring not found" ] } ], "source": [ "val.index(':')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "count会返回一个substring出现的次数:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val.count(',')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "replace会取代一种出现方式(pattern)。也通常用于删除pattern,传入一个空字符串即可:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'a::b:: guido'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val.replace(',', '::')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'ab guido'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val.replace(',', '')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里一些内建的string方法:\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/m643y.png)\n", "\n", "# 2 Regular Expressions(正则表达式)\n", "\n", "正则表达式能让我们寻找更复杂的pattern。通常称一个表达式为regex,由正则表达语言来代表一个字符串模式。可以使用python内建的re模块来使用。\n", "\n", "> 关于正则表达式,有很多教学资源,可以自己找几篇来学一些,这里不会介绍太多。\n", "\n", "re模块有以下三个类别:patther matching(模式匹配), substitution(替换), splitting(分割)。通常这三种都是相关的,一个regex用来描述一种pattern,这样会有很多种用法。这里举个例子,假设我们想要根据空格(tabs,spaces,newlines)来分割一个字符串。用于描述一个或多个空格的regex是`\\s+`:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "text = \"foo bar\\t baz \\tqux\"" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['foo', 'bar', 'baz', 'qux']" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.split('\\s+', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "当调用`re.split('\\s+', text)`的时候,正则表达式第一次被compile编译,并且split方法会被调用搜索text。我们可以自己编译regex,用re.compile,可以生成一个可以多次使用的regex object:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "regex = re.compile('\\s+')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['foo', 'bar', 'baz', 'qux']" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.split(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果想要得到符合regex的所有结果,以一个list结果返回,可以使用findall方法:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[' ', '\\t ', ' \\t']" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.findall(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> 为了防止\\在正则表达式中的逃逸,推荐使用raw string literal,比如`r'C:\\x'`,而不是使用`'C:\\\\x`\n", "\n", "使用re.compile创建一个regex object是被强烈推荐的,如果你打算把一个表达式用于很多string上的话,这样可以节省CPU的资源。\n", "\n", "match和search,与findall关系紧密。不过findall会返回所有匹配的结果,而search只会返回第一次匹配的结果。更严格地说,match只匹配string开始的部分。这里举个例子说明,我们想要找到所有的邮件地址:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": true }, "outputs": [], "source": [ "text = \"\"\"Dave dave@google.com \n", " Steve steve@gmail.com \n", " Rob rob@gmail.com \n", " Ryan ryan@yahoo.com \"\"\"\n", "\n", "pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# re.IGNORECASE makes the regex case-insensitive \n", "regex = re.compile(pattern, flags=re.IGNORECASE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用findall找到一组邮件地址:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.findall(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "search返回text中的第一个匹配结果。match object能告诉我们找到的结果在text中开始和结束的位置:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "m = regex.search(text)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<_sre.SRE_Match object; span=(5, 20), match='dave@google.com'>" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'dave@google.com'" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text[m.start():m.end()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "regex.match返回None,因为它只会在pattern存在于stirng开头的情况下才会返回匹配结果:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n" ] } ], "source": [ "print(regex.match(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "而sub返回一个新的string,把pattern出现的地方替换为我们指定的string:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dave REDACTED \n", " Steve REDACTED \n", " Rob REDACTED \n", " Ryan REDACTED \n" ] } ], "source": [ "print(regex.sub('REDACTED', text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "假设你想要找到邮件地址,同时,想要把邮件地址分为三个部分,username, domain name, and domain suffix.(用户名,域名,域名后缀)。需要给每一个pattern加一个括号:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": true }, "outputs": [], "source": [ "regex = re.compile(pattern, flags=re.IGNORECASE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "match object会返回一个tuple,包含多个pattern组份,通过groups方法:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "m = regex.match('wesm@bright.net')" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "('wesm', 'bright', 'net')" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m.groups()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "findall会返回a list of tuples:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('dave', 'google', 'com'),\n", " ('steve', 'gmail', 'com'),\n", " ('rob', 'gmail', 'com'),\n", " ('ryan', 'yahoo', 'com')]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.findall(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "sub也能访问groups的结果,不过要使用特殊符号 \\1, \\2。\\1表示第一个匹配的group,\\2表示第二个匹配的group,以此类推:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dave Username: dave, Domain: google, Suffix: com \n", " Steve Username: steve, Domain: gmail, Suffix: com \n", " Rob Username: rob, Domain: gmail, Suffix: com \n", " Ryan Username: ryan, Domain: yahoo, Suffix: com \n" ] } ], "source": [ "print(regex.sub(r'Username: \\1, Domain: \\2, Suffix: \\3', text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里给一些正则表达式的方法:\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/mj4vc.png)\n", "\n", "# 3 Vectorized String Functions in pandas(pandas中的字符串向量化函数)\n", "\n", "一些复杂的数据清理中,string会有缺失值:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', \n", " 'Rob': 'rob@gmail.com', 'Wes': np.nan}" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Dave dave@google.com\n", "Rob rob@gmail.com\n", "Steve steve@gmail.com\n", "Wes NaN\n", "dtype: object" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.Series(data)\n", "data" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Dave False\n", "Rob False\n", "Steve False\n", "Wes True\n", "dtype: bool" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以把一些字符串方法和正则表达式(用lambda或其他函数)用于每一个value上,通过data.map,但是这样会得到NA(null)值。为了解决这个问题,series有一些数组导向的方法可以用于字符串操作,来跳过NA值。这些方法可以通过series的str属性;比如,我们想检查每个电子邮箱地址是否有'gmail' with str.contains:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.str" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Dave False\n", "Rob True\n", "Steve True\n", "Wes NaN\n", "dtype: object" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.str.contains('gmail')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "正则表达式也可以用,配合任意的re选项,比如IGNORECASE:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\\\.([A-Z]{2,4})'" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pattern" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Dave [(dave, google, com)]\n", "Rob [(rob, gmail, com)]\n", "Steve [(steve, gmail, com)]\n", "Wes NaN\n", "dtype: object" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.str.findall(pattern, flags=re.IGNORECASE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "有很多方法用于向量化。比如str.get或index索引到str属性:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/xu/anaconda/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:1: FutureWarning: In future versions of pandas, match will change to always return a bool indexer.\n", " if __name__ == '__main__':\n" ] }, { "data": { "text/plain": [ "Dave (dave, google, com)\n", "Rob (rob, gmail, com)\n", "Steve (steve, gmail, com)\n", "Wes NaN\n", "dtype: object" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matches = data.str.match(pattern, flags=re.IGNORECASE)\n", "matches" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "为了访问嵌套list里的元素,我们可以传入一个index给函数:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Dave google\n", "Rob gmail\n", "Steve gmail\n", "Wes NaN\n", "dtype: object" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matches.str.get(1)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Dave dave\n", "Rob rob\n", "Steve steve\n", "Wes NaN\n", "dtype: object" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matches.str.get(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "也可以使用这个语法进行切片:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Dave dave@\n", "Rob rob@g\n", "Steve steve\n", "Wes NaN\n", "dtype: object" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.str[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里有一些字符串向量化的方法:\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/owc7z.png)\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/cn2y0.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [py35]", "language": "python", "name": "Python [py35]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }