{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 7.2 Data Transformation(数据变换)\n", "\n", "# 1 删除重复值\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2
0one1
1two1
2one2
3two3
4one3
5two4
6two4
\n", "
" ], "text/plain": [ " k1 k2\n", "0 one 1\n", "1 two 1\n", "2 one 2\n", "3 two 3\n", "4 one 3\n", "5 two 4\n", "6 two 4" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],\n", " 'k2': [1, 1, 2, 3, 3, 4, 4]})\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DataFrame方法duplicated返回的是一个boolean Series,表示一个row是否是重复的(根据前一行来判断):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", "6 True\n", "dtype: bool" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.duplicated()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "drop_duplicateds返回一个DataFrame,会删除重复的部分:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2
0one1
1two1
2one2
3two3
4one3
5two4
\n", "
" ], "text/plain": [ " k1 k2\n", "0 one 1\n", "1 two 1\n", "2 one 2\n", "3 two 3\n", "4 one 3\n", "5 two 4" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "上面两种方法都默认考虑所有列;另外,我们可以指定一部分来检测重复值。假设我们只想检测'k1'列的重复值:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2v1
0one10
1two11
2one22
3two33
4one34
5two45
6two46
\n", "
" ], "text/plain": [ " k1 k2 v1\n", "0 one 1 0\n", "1 two 1 1\n", "2 one 2 2\n", "3 two 3 3\n", "4 one 3 4\n", "5 two 4 5\n", "6 two 4 6" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['v1'] = range(7)\n", "data" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2v1
0one10
1two11
\n", "
" ], "text/plain": [ " k1 k2 v1\n", "0 one 1 0\n", "1 two 1 1" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop_duplicates(['k1'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "duplicated和drop_duplicated默认保留第一次观测到的数值组合。设置`keep='last'`能返回最后一个:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2v1
0one10
1two11
2one22
3two33
4one34
6two46
\n", "
" ], "text/plain": [ " k1 k2 v1\n", "0 one 1 0\n", "1 two 1 1\n", "2 one 2 2\n", "3 two 3 3\n", "4 one 3 4\n", "6 two 4 6" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop_duplicates(['k1', 'k2'], keep='last')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2 Transforming Data Using a Function or Mapping(用函数和映射来转换数据)\n", "\n", "有时候我们可能希望做一些数据转换。比如下面一个例子,有不同种类的肉:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foodounces
0bacon4.0
1pulled pork3.0
2bacon12.0
3Pastrami6.0
4corned beef7.5
5Bacon8.0
6pastrami3.0
7honey ham5.0
8nova lox6.0
\n", "
" ], "text/plain": [ " food ounces\n", "0 bacon 4.0\n", "1 pulled pork 3.0\n", "2 bacon 12.0\n", "3 Pastrami 6.0\n", "4 corned beef 7.5\n", "5 Bacon 8.0\n", "6 pastrami 3.0\n", "7 honey ham 5.0\n", "8 nova lox 6.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',\n", " 'Pastrami', 'corned beef', 'Bacon',\n", " 'pastrami', 'honey ham', 'nova lox'],\n", " 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "假设你想加一列,表明每种肉来源的动物是什么。我们可以写一个映射:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [], "source": [ "meat_to_animal = {\n", " 'bacon': 'pig',\n", " 'pulled pork': 'pig',\n", " 'pastrami': 'cow',\n", " 'corned beef': 'cow',\n", " 'honey ham': 'pig',\n", " 'nova lox': 'salmon'\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "用于series的map方法接受一个函数,或是一个字典,包含着映射关系,但这里有一个小问题,有些肉是大写,有些是小写。因此,我们先用str.lower把所有的值变为小写:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 bacon\n", "1 pulled pork\n", "2 bacon\n", "3 pastrami\n", "4 corned beef\n", "5 bacon\n", "6 pastrami\n", "7 honey ham\n", "8 nova lox\n", "Name: food, dtype: object" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lowercased = data['food'].str.lower()\n", "lowercased" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foodouncesanimal
0bacon4.0pig
1pulled pork3.0pig
2bacon12.0pig
3Pastrami6.0cow
4corned beef7.5cow
5Bacon8.0pig
6pastrami3.0cow
7honey ham5.0pig
8nova lox6.0salmon
\n", "
" ], "text/plain": [ " food ounces animal\n", "0 bacon 4.0 pig\n", "1 pulled pork 3.0 pig\n", "2 bacon 12.0 pig\n", "3 Pastrami 6.0 cow\n", "4 corned beef 7.5 cow\n", "5 Bacon 8.0 pig\n", "6 pastrami 3.0 cow\n", "7 honey ham 5.0 pig\n", "8 nova lox 6.0 salmon" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['animal'] = lowercased.map(meat_to_animal)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们也可以用一个函数解决上面的问题:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 pig\n", "1 pig\n", "2 pig\n", "3 cow\n", "4 cow\n", "5 pig\n", "6 cow\n", "7 pig\n", "8 salmon\n", "Name: food, dtype: object" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['food'].map(lambda x: meat_to_animal[x.lower()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用map是一个很简便的方法,用于element-wise转换和其他一些数据清洗操作。\n", "\n", "# 3 Replacing Values(替换值)\n", "\n", "其实fillna是一个特殊换的替换操作。map可以用于修改一个object里的部分值,但是replace能提供一个更简单和更灵活的方法做到这点。下面是一个series:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 -999.0\n", "2 2.0\n", "3 -999.0\n", "4 -1000.0\n", "5 3.0\n", "dtype: float64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.Series([1., -999., 2., -999., -1000., 3.])\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里-999可能是用来表示缺失值的标识符。用NA来替代的话,用replace,会产生一个新series(除非使用inplace=True):" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 NaN\n", "2 2.0\n", "3 NaN\n", "4 -1000.0\n", "5 3.0\n", "dtype: float64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.replace(-999, np.nan)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果想要一次替换多个值,直接用一个list即可:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 NaN\n", "2 2.0\n", "3 NaN\n", "4 NaN\n", "5 3.0\n", "dtype: float64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.replace([-999, -1000], np.nan)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "对于不同的值用不同的替换值,也是导入一个list:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 NaN\n", "2 2.0\n", "3 NaN\n", "4 0.0\n", "5 3.0\n", "dtype: float64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.replace([-999, -1000], [np.nan, 0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "参数也可以是一个dict:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 NaN\n", "2 2.0\n", "3 NaN\n", "4 0.0\n", "5 3.0\n", "dtype: float64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.replace({-999: np.nan, -1000: 0})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "注意:data.replace方法和data.str.replace方法是不同的,后者会对string进行element-wise替换。\n", "\n", "# 4 Renaming Axis Indexes(重命名Axis Indexes)\n", "\n", "\n", "像是series里的value一样,axis label也能类似地是函数或映射来转换,产生一个新的object。当然也可以设置in-place不产生新的数据:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio0123
Colorado4567
New York891011
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 0 1 2 3\n", "Colorado 4 5 6 7\n", "New York 8 9 10 11" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame(np.arange(12).reshape((3, 4)),\n", " index=['Ohio', 'Colorado', 'New York'],\n", " columns=['one', 'two', 'three', 'four'])\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "与series相同,axis index有一个map方法:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ ">" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transform = lambda x: x[:4].upper()\n", "transform" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['Ohio', 'Colorado', 'New York'], dtype='object')" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.index" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['OHIO', 'COLO', 'NEW '], dtype=object)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.index.map(transform)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以赋值给index,以in-place的方式修改DataFrame:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
OHIO0123
COLO4567
NEW891011
\n", "
" ], "text/plain": [ " one two three four\n", "OHIO 0 1 2 3\n", "COLO 4 5 6 7\n", "NEW 8 9 10 11" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.index = data.index.map(transform)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果你想要创建一个转换后的版本,而且不用修改原始的数据,可以用rename:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ONETWOTHREEFOUR
Ohio0123
Colo4567
New891011
\n", "
" ], "text/plain": [ " ONE TWO THREE FOUR\n", "Ohio 0 1 2 3\n", "Colo 4 5 6 7\n", "New 8 9 10 11" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.rename(index=str.title, columns=str.upper)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "注意,rename能用于dict一样的oject," ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwopekaboofour
INDIANA0123
COLO4567
NEW891011
\n", "
" ], "text/plain": [ " one two pekaboo four\n", "INDIANA 0 1 2 3\n", "COLO 4 5 6 7\n", "NEW 8 9 10 11" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.rename(index={'OHIO': 'INDIANA'},\n", " columns={'three': 'pekaboo'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "rename能让你避免陷入手动赋值给index和columns的杂务中。可以用inplace直接修改原始数据:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
INDIANA0123
COLO4567
NEW891011
\n", "
" ], "text/plain": [ " one two three four\n", "INDIANA 0 1 2 3\n", "COLO 4 5 6 7\n", "NEW 8 9 10 11" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.rename(index={'OHIO': 'INDIANA'}, inplace=True)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5 Discretization and Binning(离散化和装箱)\n", "\n", "\n", "连续型数据经常被离散化或分散成bins(分箱)来分析。假设你有一组数据,你想把人分到不同的年龄组里:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们把这些分到四个bin里,19~25, 26~35, 36~60, >60。可以用pandas里的cut:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]\n", "Length: 12\n", "Categories (4, object): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bins = [18, 25, 35, 60, 100]\n", "\n", "cats = pd.cut(ages, bins)\n", "\n", "cats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "返回的是一个特殊的Categorical object。我们看到的结果描述了pandas.cut如何得到bins。可以看作是一个string数组用来表示bin的名字,它内部包含了一个categories数组,用来记录不同类别的名字,并伴有表示ages的label(可以通过codes属性查看):" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats.codes" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['(18, 25]', '(25, 35]', '(35, 60]', '(60, 100]'], dtype='object')" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats.categories" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(18, 25] 5\n", "(35, 60] 3\n", "(25, 35] 3\n", "(60, 100] 1\n", "dtype: int64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.value_counts(cats)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里pd.value_counts(cats)是pandas.cut后bin的数量。\n", "\n", "这里我们注意一下区间。括号表示不包含,方括号表示包含。你可以自己设定哪一边关闭(right=False):" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]\n", "Length: 12\n", "Categories (4, object): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.cut(ages, [18, 26, 36, 61, 100], right=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "你也可以用一个list或数组给labels选项来设定bin的名字:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": true }, "outputs": [], "source": [ "group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]\n", "Length: 12\n", "Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.cut(ages, bins, labels=group_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果你只是给一个bins的数量来cut,而不是自己设定每个bind的范围,cut会根据最大值和最小值来计算等长的bins。比如下面我们想要做一个均匀分布的四个bins:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = np.random.rand(20)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(0.77, 0.98], (0.33, 0.55], (0.77, 0.98], (0.55, 0.77], (0.55, 0.77], ..., (0.77, 0.98], (0.11, 0.33], (0.11, 0.33], (0.33, 0.55], (0.11, 0.33]]\n", "Length: 20\n", "Categories (4, object): [(0.11, 0.33] < (0.33, 0.55] < (0.55, 0.77] < (0.77, 0.98]]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.cut(data, 4, precision=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "precision=2选项表示精确到小数点后两位。\n", "\n", "一个近似的函数,qcut,会按照数据的分位数来分箱。取决于数据的分布,用cut通常不能保证每一个bin有一个相同数量的数据点。而qcut是按百分比来切的,所以可以得到等数量的bins:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = np.random.randn(1000) # Normally distributed" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(-0.717, -0.0981], (-0.717, -0.0981], (-0.0981, 0.639], (0.639, 3.434], [-2.86, -0.717], ..., (-0.0981, 0.639], (-0.717, -0.0981], (-0.0981, 0.639], (0.639, 3.434], (-0.0981, 0.639]]\n", "Length: 1000\n", "Categories (4, object): [[-2.86, -0.717] < (-0.717, -0.0981] < (-0.0981, 0.639] < (0.639, 3.434]]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats = pd.qcut(data, 4) # Cut into quartiles\n", "cats" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(0.639, 3.434] 250\n", "(-0.0981, 0.639] 250\n", "(-0.717, -0.0981] 250\n", "[-2.86, -0.717] 250\n", "dtype: int64" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.value_counts(cats)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "类似的,在cut中我们可以自己指定百分比:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[NaN, NaN, (0.1, 0.5], NaN, NaN, ..., (0.1, 0.5], NaN, (0.5, 0.9], NaN, (0.5, 0.9]]\n", "Length: 1000\n", "Categories (4, object): [(0, 0.1] < (0.1, 0.5] < (0.5, 0.9] < (0.9, 1]]" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats2 = pd.cut(data, [0, 0.1, 0.5, 0.9, 1.]) # 累进的百分比\n", "cats2" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(0.1, 0.5] 135\n", "(0.5, 0.9] 124\n", "(0, 0.1] 40\n", "(0.9, 1] 21\n", "dtype: int64" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.value_counts(cats2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在之后的章节我们还会用到cut和qcut,这些离散函数对于量化和群聚分析很有用。\n", "\n", "# 6 Detecting and Filtering Outliers(检测和过滤异常值)\n", "\n", "过滤或转换异常值是数组操作的一个重头戏。下面的DataFrame有正态分布的数据:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
count1000.0000001000.0000001000.0000001000.000000
mean0.0109530.0129280.033165-0.031257
std1.0116211.0133411.0043560.996333
min-2.994342-4.328036-3.303616-3.133495
25%-0.654483-0.662177-0.644982-0.670813
50%-0.000637-0.0332410.050481-0.074641
75%0.7231000.7258390.7084520.643418
max3.3184993.3530013.0028533.002868
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "count 1000.000000 1000.000000 1000.000000 1000.000000\n", "mean 0.010953 0.012928 0.033165 -0.031257\n", "std 1.011621 1.013341 1.004356 0.996333\n", "min -2.994342 -4.328036 -3.303616 -3.133495\n", "25% -0.654483 -0.662177 -0.644982 -0.670813\n", "50% -0.000637 -0.033241 0.050481 -0.074641\n", "75% 0.723100 0.725839 0.708452 0.643418\n", "max 3.318499 3.353001 3.002853 3.002868" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame(np.random.randn(1000, 4))\n", "data.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "假设我们想要找一个列中,绝对值大于3的数字:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.1237660.9339200.4947550.690507
12.5136360.575393-0.3235900.586833
2-0.335958-0.8437350.302201-0.490675
3-1.307658-0.4856701.6127870.210169
4-0.793757-0.693757-1.7183670.515088
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.123766 0.933920 0.494755 0.690507\n", "1 2.513636 0.575393 -0.323590 0.586833\n", "2 -0.335958 -0.843735 0.302201 -0.490675\n", "3 -1.307658 -0.485670 1.612787 0.210169\n", "4 -0.793757 -0.693757 -1.718367 0.515088" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0.494755\n", "1 -0.323590\n", "2 0.302201\n", "3 1.612787\n", "4 -1.718367\n", "Name: 2, dtype: float64" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col = data[2]\n", "col.head()" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "339 -3.303616\n", "932 3.002853\n", "Name: 2, dtype: float64" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col[np.abs(col) > 3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "选中所有绝对值大于3的行,可以用any方法在一个boolean DataFrame上:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
0NaNNaNNaNNaN
1NaNNaNNaNNaN
2NaNNaNNaNNaN
3NaNNaNNaNNaN
4NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 NaN NaN NaN NaN\n", "1 NaN NaN NaN NaN\n", "2 NaN NaN NaN NaN\n", "3 NaN NaN NaN NaN\n", "4 NaN NaN NaN NaN" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[(np.abs(data) > 3)].head()" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
221.0757280.2500000.951303-3.133495
1550.064837-4.3280361.121061-0.574203
224-0.289148-2.9121160.332218-3.129604
339-0.098352-0.610929-3.303616-2.072304
6090.9832401.3726330.0181723.002868
7353.318499-2.573122-1.515901-1.204596
8221.8103963.353001-1.283856-1.166749
856-0.7950703.204789-0.2116421.278828
9320.898147-0.9618503.002853-0.128494
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "22 1.075728 0.250000 0.951303 -3.133495\n", "155 0.064837 -4.328036 1.121061 -0.574203\n", "224 -0.289148 -2.912116 0.332218 -3.129604\n", "339 -0.098352 -0.610929 -3.303616 -2.072304\n", "609 0.983240 1.372633 0.018172 3.002868\n", "735 3.318499 -2.573122 -1.515901 -1.204596\n", "822 1.810396 3.353001 -1.283856 -1.166749\n", "856 -0.795070 3.204789 -0.211642 1.278828\n", "932 0.898147 -0.961850 3.002853 -0.128494" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[(np.abs(data) > 3).any(1)] # any中axis=1表示column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下面是把绝对值大于3的数字直接变成-3或3:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data[np.abs(data) > 3] = np.sign(data) * 3" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
21-0.066111-1.1590640.518720-0.284596
221.0757280.2500000.951303-3.000000
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "21 -0.066111 -1.159064 0.518720 -0.284596\n", "22 1.075728 0.250000 0.951303 -3.000000" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[21:23]" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
count1000.0000001000.0000001000.0000001000.000000
mean0.0106340.0136980.033466-0.030997
std1.0106291.0067681.0033830.995521
min-2.994342-3.000000-3.000000-3.000000
25%-0.654483-0.662177-0.644982-0.670813
50%-0.000637-0.0332410.050481-0.074641
75%0.7231000.7258390.7084520.643418
max3.0000003.0000003.0000003.000000
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "count 1000.000000 1000.000000 1000.000000 1000.000000\n", "mean 0.010634 0.013698 0.033466 -0.030997\n", "std 1.010629 1.006768 1.003383 0.995521\n", "min -2.994342 -3.000000 -3.000000 -3.000000\n", "25% -0.654483 -0.662177 -0.644982 -0.670813\n", "50% -0.000637 -0.033241 0.050481 -0.074641\n", "75% 0.723100 0.725839 0.708452 0.643418\n", "max 3.000000 3.000000 3.000000 3.000000" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "np.sign(data)会根据值的正负号来得到1或-1:" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.01.01.01.0
11.01.0-1.01.0
2-1.0-1.01.0-1.0
3-1.0-1.01.01.0
4-1.0-1.0-1.01.0
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 1.0 1.0 1.0\n", "1 1.0 1.0 -1.0 1.0\n", "2 -1.0 -1.0 1.0 -1.0\n", "3 -1.0 -1.0 1.0 1.0\n", "4 -1.0 -1.0 -1.0 1.0" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sign(data).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 7 Permutation and Random Sampling(排列和随机采样)\n", "\n", "排列(随机排序)一个series或DataFrame中的row,用numpy.random.permutation函数很容易就能做到。调用permutation的时候设定好你想要进行排列的axis,会产生一个整数数组表示新的顺序:" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
00123
14567
2891011
312131415
416171819
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 0 1 2 3\n", "1 4 5 6 7\n", "2 8 9 10 11\n", "3 12 13 14 15\n", "4 16 17 18 19" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))\n", "df" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([4, 3, 2, 1, 0])" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sampler = np.random.permutation(5)\n", "sampler" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这个数组能被用在基于iloc上的indexing或take函数:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
416171819
312131415
2891011
14567
00123
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "4 16 17 18 19\n", "3 12 13 14 15\n", "2 8 9 10 11\n", "1 4 5 6 7\n", "0 0 1 2 3" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.take(sampler)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "为了选中一个随机的子集,而且没有代替功能(既不影响原来的值,返回一个新的series或DataFrame),可以用sample方法:" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
00123
312131415
416171819
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 0 1 2 3\n", "3 12 13 14 15\n", "4 16 17 18 19" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sample(n=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果想要生成的样本带有替代功能(即允许重复),给sample中设定replace=True:" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "4 4\n", "4 4\n", "1 7\n", "1 7\n", "1 7\n", "1 7\n", "4 4\n", "3 6\n", "1 7\n", "1 7\n", "dtype: int64" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "choices = pd.Series([5, 7, -1, 6, 4])\n", "\n", "draws = choices.sample(n=10, replace=True)\n", "\n", "draws" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 8 Computing Indicator/Dummy Variables(计算指示器/虚拟变量)\n", "\n", "> Dummy Variables:虚拟变量,又称虚设变量、名义变量或哑变量,用以反映质的属性的一个人工变量,是量化了的自变量,通常取值为0或1。\n", "\n", "另一种在统计模型上的转换或机器学习应用是把一个categorical variable(类别变量)变为一个dummy or indicator matrix(虚拟或指示器矩阵)。如果DataFrame中的一列有k个不同的值,我们可以用一个矩阵或DataFrame用k列来表示,1或0。pandas有一个get_dummies函数实现这个工作,当然,你自己设计一个其实也不难。这里举个例子:" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1key
00b
11b
22a
33c
44a
55b
\n", "
" ], "text/plain": [ " data1 key\n", "0 0 b\n", "1 1 b\n", "2 2 a\n", "3 3 c\n", "4 4 a\n", "5 5 b" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],\n", " 'data1': range(6)})\n", "df" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
00.01.00.0
10.01.00.0
21.00.00.0
30.00.01.0
41.00.00.0
50.01.00.0
\n", "
" ], "text/plain": [ " a b c\n", "0 0.0 1.0 0.0\n", "1 0.0 1.0 0.0\n", "2 1.0 0.0 0.0\n", "3 0.0 0.0 1.0\n", "4 1.0 0.0 0.0\n", "5 0.0 1.0 0.0" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.get_dummies(df['key'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在一些情况里,如果我们想要给column加一个prefix, 可以用data.get_dummies里的prefix参数来实现:" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dummies = pd.get_dummies(df['key'], prefix='key')" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1key_akey_bkey_c
000.01.00.0
110.01.00.0
221.00.00.0
330.00.01.0
441.00.00.0
550.01.00.0
\n", "
" ], "text/plain": [ " data1 key_a key_b key_c\n", "0 0 0.0 1.0 0.0\n", "1 1 0.0 1.0 0.0\n", "2 2 1.0 0.0 0.0\n", "3 3 0.0 0.0 1.0\n", "4 4 1.0 0.0 0.0\n", "5 5 0.0 1.0 0.0" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_with_dummy = df[['data1']].join(dummies)\n", "df_with_dummy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果DataFrame中的a row属于多个类别,事情会变得复杂一些。我们来看一下MoviesLens 1M 数据集:" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mnames = ['movie_id', 'title', 'genres']" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movie_idtitlegenres
01Toy Story (1995)Animation|Children's|Comedy
12Jumanji (1995)Adventure|Children's|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
56Heat (1995)Action|Crime|Thriller
67Sabrina (1995)Comedy|Romance
78Tom and Huck (1995)Adventure|Children's
89Sudden Death (1995)Action
910GoldenEye (1995)Action|Adventure|Thriller
\n", "
" ], "text/plain": [ " movie_id title genres\n", "0 1 Toy Story (1995) Animation|Children's|Comedy\n", "1 2 Jumanji (1995) Adventure|Children's|Fantasy\n", "2 3 Grumpier Old Men (1995) Comedy|Romance\n", "3 4 Waiting to Exhale (1995) Comedy|Drama\n", "4 5 Father of the Bride Part II (1995) Comedy\n", "5 6 Heat (1995) Action|Crime|Thriller\n", "6 7 Sabrina (1995) Comedy|Romance\n", "7 8 Tom and Huck (1995) Adventure|Children's\n", "8 9 Sudden Death (1995) Action\n", "9 10 GoldenEye (1995) Action|Adventure|Thriller" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies = pd.read_table('../datasets/movielens/movies.dat', sep='::',\n", " header=None, names=mnames, engine='python')\n", "movies[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "给每个genre添加一个指示变量比较麻烦。首先我们先取出所有不同的类别:" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['Animation', \"Children's\", 'Comedy', 'Adventure', 'Fantasy',\n", " 'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',\n", " 'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',\n", " 'Western'], dtype=object)" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_genres = []\n", "\n", "for x in movies.genres:\n", " all_genres.extend(x.split('|'))\n", " \n", "genres = pd.unique(all_genres)\n", "genres" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "一种构建indicator dataframe的方法是先构建一个全是0的DataFrame:" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(3883, 18)" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zero_matrix = np.zeros((len(movies), len(genres)))\n", "\n", "zero_matrix.shape" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AnimationChildren'sComedyAdventureFantasyRomanceDramaActionCrimeThrillerHorrorSci-FiDocumentaryWarMusicalMysteryFilm-NoirWestern
00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
10.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
20.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
30.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
40.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
\n", "
" ], "text/plain": [ " Animation Children's Comedy Adventure Fantasy Romance Drama Action \\\n", "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " Crime Thriller Horror Sci-Fi Documentary War Musical Mystery \\\n", "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " Film-Noir Western \n", "0 0.0 0.0 \n", "1 0.0 0.0 \n", "2 0.0 0.0 \n", "3 0.0 0.0 \n", "4 0.0 0.0 " ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummies = pd.DataFrame(zero_matrix, columns=genres)\n", "dummies.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "然后迭代每一部movie,并设置每一行中的dummies为1。使用dummies.columns来计算每一列的genre的指示器:" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['Animation', \"Children's\", 'Comedy']" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gen = movies.genres[0]\n", "gen.split('|')" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 2])" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummies.columns.get_indexer(gen.split('|'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "然后,使用.iloc,根据索引来设定值:" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "collapsed": true }, "outputs": [], "source": [ "for i, gen in enumerate(movies.genres):\n", " indices = dummies.columns.get_indexer(gen.split('|'))\n", " dummies.iloc[i, indices] = 1" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AnimationChildren'sComedyAdventureFantasyRomanceDramaActionCrimeThrillerHorrorSci-FiDocumentaryWarMusicalMysteryFilm-NoirWestern
01.01.01.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
10.01.00.01.01.00.00.00.00.00.00.00.00.00.00.00.00.00.0
20.00.01.00.00.01.00.00.00.00.00.00.00.00.00.00.00.00.0
30.00.01.00.00.00.01.00.00.00.00.00.00.00.00.00.00.00.0
40.00.01.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
\n", "
" ], "text/plain": [ " Animation Children's Comedy Adventure Fantasy Romance Drama Action \\\n", "0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 \n", "2 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 \n", "3 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 \n", "4 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " Crime Thriller Horror Sci-Fi Documentary War Musical Mystery \\\n", "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " Film-Noir Western \n", "0 0.0 0.0 \n", "1 0.0 0.0 \n", "2 0.0 0.0 \n", "3 0.0 0.0 \n", "4 0.0 0.0 " ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummies.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "然后,我们可以结合这个和movies:" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "movie_id 1\n", "title Toy Story (1995)\n", "genres Animation|Children's|Comedy\n", "Genre_Animation 1\n", "Genre_Children's 1\n", "Genre_Comedy 1\n", "Genre_Adventure 0\n", "Genre_Fantasy 0\n", "Genre_Romance 0\n", "Genre_Drama 0\n", "Genre_Action 0\n", "Genre_Crime 0\n", "Genre_Thriller 0\n", "Genre_Horror 0\n", "Genre_Sci-Fi 0\n", "Genre_Documentary 0\n", "Genre_War 0\n", "Genre_Musical 0\n", "Genre_Mystery 0\n", "Genre_Film-Noir 0\n", "Genre_Western 0\n", "Name: 0, dtype: object" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies_windic = movies.join(dummies.add_prefix('Genre_'))\n", "movies_windic.iloc[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "对于一个很大的数据集,这种构建多个成员指示变量的方法并不会加快速度。写一个低层级的函数来直接写一个numpy array,并把写过整合到DataFrame会更快一些。\n", "\n", "一个有用的recipe诀窍是把get_dummies和离散函数(比如cut)结合起来:" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "collapsed": true }, "outputs": [], "source": [ "np.random.seed(12345)" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,\n", " 0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "values = np.random.rand(10)\n", "values" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "collapsed": true }, "outputs": [], "source": [ "bins = [0, 0.2, 0.4, 0.6, 0.8, 1.]" ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(0.8, 1], (0.2, 0.4], (0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.4, 0.6], (0.8, 1], (0.6, 0.8], (0.6, 0.8], (0.6, 0.8]]\n", "Categories (5, object): [(0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1]]" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.cut(values, bins)" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
(0, 0.2](0.2, 0.4](0.4, 0.6](0.6, 0.8](0.8, 1]
00.00.00.00.01.0
10.01.00.00.00.0
21.00.00.00.00.0
30.01.00.00.00.0
40.00.01.00.00.0
50.00.01.00.00.0
60.00.00.00.01.0
70.00.00.01.00.0
80.00.00.01.00.0
90.00.00.01.00.0
\n", "
" ], "text/plain": [ " (0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1]\n", "0 0.0 0.0 0.0 0.0 1.0\n", "1 0.0 1.0 0.0 0.0 0.0\n", "2 1.0 0.0 0.0 0.0 0.0\n", "3 0.0 1.0 0.0 0.0 0.0\n", "4 0.0 0.0 1.0 0.0 0.0\n", "5 0.0 0.0 1.0 0.0 0.0\n", "6 0.0 0.0 0.0 0.0 1.0\n", "7 0.0 0.0 0.0 1.0 0.0\n", "8 0.0 0.0 0.0 1.0 0.0\n", "9 0.0 0.0 0.0 1.0 0.0" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.get_dummies(pd.cut(values, bins))" ] } ], "metadata": { "kernelspec": { "display_name": "Python [py35]", "language": "python", "name": "Python [py35]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }