{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CHAPTER 7 Data Cleaning and Preparation\n", "\n", "其实数据分析中80%的时间都是在数据清理部分,loading, clearning, transforming, rearranging。而pandas非常适合用来执行这些任务。\n", "\n", "# 7.1 Handling Missing Data\n", "\n", "在pandas中,missing data呈现的方式有些缺点的,但对大部分用户能起到足够的效果。对于数值型数据,pandas用浮点值Nan(Not a Number)来表示缺失值。我们称之为识别符(sentinel value),这种值能被轻易检测到:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 aardvark\n", "1 artichoke\n", "2 NaN\n", "3 avocado\n", "dtype: object" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])\n", "string_data" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 True\n", "3 False\n", "dtype: bool" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_data.isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在pandas中,我们使用了R语言中的一些传统,把缺失值表示为NA(not available)。在统计应用里,NA数据别是要么是数据不存在,要么是存在但不能被检测到。做数据清理的时候,对缺失值做分析是很重要的,我们要确定是否是数据收集的问题,或者缺失值是否会带来潜在的偏见。\n", "\n", "内建的Python None值也被当做NA:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "string_data[0] = None" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 False\n", "2 True\n", "3 False\n", "dtype: bool" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_data.isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里有一些用来处理缺失值的函数:\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/zq0q8.png)\n", "\n", "# 1 Filtering Out Missing Data(过滤缺失值)\n", "\n", "有一些方法来过滤缺失值。可以使用pandas.isnull和boolean indexing, 配合使用dropna。对于series,只会返回non-null数据和index values:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from numpy import nan as NA" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = pd.Series([1, NA, 3.5, NA, 7])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "2 3.5\n", "4 7.0\n", "dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "上面的等同于:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "2 3.5\n", "4 7.0\n", "dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[data.notnull()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "对于DataFrame,会复杂一些。你可能想要删除包含有NA的row和column。dropna默认会删除包含有缺失值的row:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 6.5 3.0\n", "1 1.0 NaN NaN\n", "2 NaN NaN NaN\n", "3 NaN 6.5 3.0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],\n", " [NA, NA, NA], [NA, 6.5, 3.]])\n", "data" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cleaned = data.dropna()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.06.53.0
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 6.5 3.0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cleaned" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "设定`how=all`只会删除那些全是NA的行:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.06.53.0
11.0NaNNaN
3NaN6.53.0
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 6.5 3.0\n", "1 1.0 NaN NaN\n", "3 NaN 6.5 3.0" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dropna(how='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "删除列也一样,设置axis=1:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0124
01.06.53.0NaN
11.0NaNNaNNaN
2NaNNaNNaNNaN
3NaN6.53.0NaN
\n", "
" ], "text/plain": [ " 0 1 2 4\n", "0 1.0 6.5 3.0 NaN\n", "1 1.0 NaN NaN NaN\n", "2 NaN NaN NaN NaN\n", "3 NaN 6.5 3.0 NaN" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[4] = NA\n", "data" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 6.5 3.0\n", "1 1.0 NaN NaN\n", "2 NaN NaN NaN\n", "3 NaN 6.5 3.0" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dropna(axis=1, how='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "一种删除DataFrame row的相关应用是是time series data。假设你想要保留有特定数字的观测结果,可以使用thresh参数:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.DataFrame(np.random.randn(7, 3))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-0.9865750.487466-0.251823
12.008704-0.1771331.827761
22.240856-0.5878650.273062
30.777182-0.629568-0.220044
40.3275220.781662-0.651949
51.454611-0.170581-1.740959
6-0.7118970.0749831.343807
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -0.986575 0.487466 -0.251823\n", "1 2.008704 -0.177133 1.827761\n", "2 2.240856 -0.587865 0.273062\n", "3 0.777182 -0.629568 -0.220044\n", "4 0.327522 0.781662 -0.651949\n", "5 1.454611 -0.170581 -1.740959\n", "6 -0.711897 0.074983 1.343807" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-0.986575NaN-0.251823
12.008704NaN1.827761
22.240856NaN0.273062
30.777182NaN-0.220044
40.3275220.781662-0.651949
51.454611-0.170581-1.740959
6-0.7118970.0749831.343807
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -0.986575 NaN -0.251823\n", "1 2.008704 NaN 1.827761\n", "2 2.240856 NaN 0.273062\n", "3 0.777182 NaN -0.220044\n", "4 0.327522 0.781662 -0.651949\n", "5 1.454611 -0.170581 -1.740959\n", "6 -0.711897 0.074983 1.343807" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[:4, 1] = NA\n", "df" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-0.986575NaNNaN
12.008704NaNNaN
22.240856NaN0.273062
30.777182NaN-0.220044
40.3275220.781662-0.651949
51.454611-0.170581-1.740959
6-0.7118970.0749831.343807
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -0.986575 NaN NaN\n", "1 2.008704 NaN NaN\n", "2 2.240856 NaN 0.273062\n", "3 0.777182 NaN -0.220044\n", "4 0.327522 0.781662 -0.651949\n", "5 1.454611 -0.170581 -1.740959\n", "6 -0.711897 0.074983 1.343807" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[:2, 2] = NA\n", "df" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
40.3275220.781662-0.651949
51.454611-0.170581-1.740959
6-0.7118970.0749831.343807
\n", "
" ], "text/plain": [ " 0 1 2\n", "4 0.327522 0.781662 -0.651949\n", "5 1.454611 -0.170581 -1.740959\n", "6 -0.711897 0.074983 1.343807" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
22.240856NaN0.273062
30.777182NaN-0.220044
40.3275220.781662-0.651949
51.454611-0.170581-1.740959
6-0.7118970.0749831.343807
\n", "
" ], "text/plain": [ " 0 1 2\n", "2 2.240856 NaN 0.273062\n", "3 0.777182 NaN -0.220044\n", "4 0.327522 0.781662 -0.651949\n", "5 1.454611 -0.170581 -1.740959\n", "6 -0.711897 0.074983 1.343807" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna(thresh=2) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2 Filling In Missing Data(填补缺失值)\n", "\n", "不是删除缺失值,而是用一些数字填补。对于大部分目的,fillna是可以用的。调用fillna的时候设置好一个常用用来替换缺失值:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-0.9865750.0000000.000000
12.0087040.0000000.000000
22.2408560.0000000.273062
30.7771820.000000-0.220044
40.3275220.781662-0.651949
51.454611-0.170581-1.740959
6-0.7118970.0749831.343807
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -0.986575 0.000000 0.000000\n", "1 2.008704 0.000000 0.000000\n", "2 2.240856 0.000000 0.273062\n", "3 0.777182 0.000000 -0.220044\n", "4 0.327522 0.781662 -0.651949\n", "5 1.454611 -0.170581 -1.740959\n", "6 -0.711897 0.074983 1.343807" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "给fillna传入一个dict,可以给不同列替换不同的值:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-0.9865750.5000000.000000
12.0087040.5000000.000000
22.2408560.5000000.273062
30.7771820.500000-0.220044
40.3275220.781662-0.651949
51.454611-0.170581-1.740959
6-0.7118970.0749831.343807
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -0.986575 0.500000 0.000000\n", "1 2.008704 0.500000 0.000000\n", "2 2.240856 0.500000 0.273062\n", "3 0.777182 0.500000 -0.220044\n", "4 0.327522 0.781662 -0.651949\n", "5 1.454611 -0.170581 -1.740959\n", "6 -0.711897 0.074983 1.343807" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna({1: 0.5, 2: 0})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "fillna返回一个新对象,但你可以使用in-place来直接更改原有的数据:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-0.9865750.0000000.000000
12.0087040.0000000.000000
22.2408560.0000000.273062
30.7771820.000000-0.220044
40.3275220.781662-0.651949
51.454611-0.170581-1.740959
6-0.7118970.0749831.343807
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -0.986575 0.000000 0.000000\n", "1 2.008704 0.000000 0.000000\n", "2 2.240856 0.000000 0.273062\n", "3 0.777182 0.000000 -0.220044\n", "4 0.327522 0.781662 -0.651949\n", "5 1.454611 -0.170581 -1.740959\n", "6 -0.711897 0.074983 1.343807" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "_ = df.fillna(0, inplace=True)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在使用fillna的时候,这种插入法同样能用于reindexing:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-1.1515081.185176-1.766933
10.544729-0.8078140.696087
2-1.4619500.4488520.189045
30.5597660.3413351.469807
4-0.3627891.117338-0.383870
5-0.452329-0.282040-0.541759
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -1.151508 1.185176 -1.766933\n", "1 0.544729 -0.807814 0.696087\n", "2 -1.461950 0.448852 0.189045\n", "3 0.559766 0.341335 1.469807\n", "4 -0.362789 1.117338 -0.383870\n", "5 -0.452329 -0.282040 -0.541759" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(np.random.randn(6, 3))\n", "df" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-1.1515081.185176-1.766933
10.544729-0.8078140.696087
2-1.461950NaN0.189045
30.559766NaN1.469807
4-0.362789NaN-0.383870
5-0.452329NaN-0.541759
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -1.151508 1.185176 -1.766933\n", "1 0.544729 -0.807814 0.696087\n", "2 -1.461950 NaN 0.189045\n", "3 0.559766 NaN 1.469807\n", "4 -0.362789 NaN -0.383870\n", "5 -0.452329 NaN -0.541759" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[2:, 1] = NA\n", "df" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-1.1515081.185176-1.766933
10.544729-0.8078140.696087
2-1.461950NaN0.189045
30.559766NaN1.469807
4-0.362789NaNNaN
5-0.452329NaNNaN
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -1.151508 1.185176 -1.766933\n", "1 0.544729 -0.807814 0.696087\n", "2 -1.461950 NaN 0.189045\n", "3 0.559766 NaN 1.469807\n", "4 -0.362789 NaN NaN\n", "5 -0.452329 NaN NaN" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[4:, 2] = NA\n", "df" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-1.1515081.185176-1.766933
10.544729-0.8078140.696087
2-1.461950-0.8078140.189045
30.559766-0.8078141.469807
4-0.362789-0.8078141.469807
5-0.452329-0.8078141.469807
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -1.151508 1.185176 -1.766933\n", "1 0.544729 -0.807814 0.696087\n", "2 -1.461950 -0.807814 0.189045\n", "3 0.559766 -0.807814 1.469807\n", "4 -0.362789 -0.807814 1.469807\n", "5 -0.452329 -0.807814 1.469807" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(method='ffill')" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-1.1515081.185176-1.766933
10.544729-0.8078140.696087
2-1.461950-0.8078140.189045
30.559766-0.8078141.469807
4-0.362789NaN1.469807
5-0.452329NaN1.469807
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -1.151508 1.185176 -1.766933\n", "1 0.544729 -0.807814 0.696087\n", "2 -1.461950 -0.807814 0.189045\n", "3 0.559766 -0.807814 1.469807\n", "4 -0.362789 NaN 1.469807\n", "5 -0.452329 NaN 1.469807" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(method='ffill', limit=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用fillna可以我们做一些颇有创造力的事情。比如,可以传入一个series的平均值或中位数:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1.000000\n", "1 3.833333\n", "2 3.500000\n", "3 3.833333\n", "4 7.000000\n", "dtype: float64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.Series([1., NA, 3.5, NA, 7])\n", "data.fillna(data.mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下面是fillna的一些参数:\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/vtzrf.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [py35]", "language": "python", "name": "Python [py35]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }