{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CHAPTER 7 Data Cleaning and Preparation\n",
"\n",
"其实数据分析中80%的时间都是在数据清理部分,loading, clearning, transforming, rearranging。而pandas非常适合用来执行这些任务。\n",
"\n",
"# 7.1 Handling Missing Data\n",
"\n",
"在pandas中,missing data呈现的方式有些缺点的,但对大部分用户能起到足够的效果。对于数值型数据,pandas用浮点值Nan(Not a Number)来表示缺失值。我们称之为识别符(sentinel value),这种值能被轻易检测到:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 aardvark\n",
"1 artichoke\n",
"2 NaN\n",
"3 avocado\n",
"dtype: object"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])\n",
"string_data"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 True\n",
"3 False\n",
"dtype: bool"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string_data.isnull()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在pandas中,我们使用了R语言中的一些传统,把缺失值表示为NA(not available)。在统计应用里,NA数据别是要么是数据不存在,要么是存在但不能被检测到。做数据清理的时候,对缺失值做分析是很重要的,我们要确定是否是数据收集的问题,或者缺失值是否会带来潜在的偏见。\n",
"\n",
"内建的Python None值也被当做NA:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"string_data[0] = None"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 True\n",
"1 False\n",
"2 True\n",
"3 False\n",
"dtype: bool"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string_data.isnull()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"这里有一些用来处理缺失值的函数:\n",
"\n",
"![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/zq0q8.png)\n",
"\n",
"# 1 Filtering Out Missing Data(过滤缺失值)\n",
"\n",
"有一些方法来过滤缺失值。可以使用pandas.isnull和boolean indexing, 配合使用dropna。对于series,只会返回non-null数据和index values:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from numpy import nan as NA"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"data = pd.Series([1, NA, 3.5, NA, 7])"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 1.0\n",
"2 3.5\n",
"4 7.0\n",
"dtype: float64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.dropna()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"上面的等同于:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 1.0\n",
"2 3.5\n",
"4 7.0\n",
"dtype: float64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[data.notnull()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"对于DataFrame,会复杂一些。你可能想要删除包含有NA的row和column。dropna默认会删除包含有缺失值的row:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 6.5 | \n",
" 3.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" NaN | \n",
" 6.5 | \n",
" 3.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 6.5 3.0\n",
"1 1.0 NaN NaN\n",
"2 NaN NaN NaN\n",
"3 NaN 6.5 3.0"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],\n",
" [NA, NA, NA], [NA, 6.5, 3.]])\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"cleaned = data.dropna()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 6.5 | \n",
" 3.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 6.5 3.0"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cleaned"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"设定`how=all`只会删除那些全是NA的行:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 6.5 | \n",
" 3.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" NaN | \n",
" 6.5 | \n",
" 3.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 6.5 3.0\n",
"1 1.0 NaN NaN\n",
"3 NaN 6.5 3.0"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.dropna(how='all')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"删除列也一样,设置axis=1:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 6.5 | \n",
" 3.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" NaN | \n",
" 6.5 | \n",
" 3.0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 4\n",
"0 1.0 6.5 3.0 NaN\n",
"1 1.0 NaN NaN NaN\n",
"2 NaN NaN NaN NaN\n",
"3 NaN 6.5 3.0 NaN"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[4] = NA\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 6.5 | \n",
" 3.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" NaN | \n",
" 6.5 | \n",
" 3.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 6.5 3.0\n",
"1 1.0 NaN NaN\n",
"2 NaN NaN NaN\n",
"3 NaN 6.5 3.0"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.dropna(axis=1, how='all')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"一种删除DataFrame row的相关应用是是time series data。假设你想要保留有特定数字的观测结果,可以使用thresh参数:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = pd.DataFrame(np.random.randn(7, 3))"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -0.986575 | \n",
" 0.487466 | \n",
" -0.251823 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.008704 | \n",
" -0.177133 | \n",
" 1.827761 | \n",
"
\n",
" \n",
" 2 | \n",
" 2.240856 | \n",
" -0.587865 | \n",
" 0.273062 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.777182 | \n",
" -0.629568 | \n",
" -0.220044 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.327522 | \n",
" 0.781662 | \n",
" -0.651949 | \n",
"
\n",
" \n",
" 5 | \n",
" 1.454611 | \n",
" -0.170581 | \n",
" -1.740959 | \n",
"
\n",
" \n",
" 6 | \n",
" -0.711897 | \n",
" 0.074983 | \n",
" 1.343807 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 -0.986575 0.487466 -0.251823\n",
"1 2.008704 -0.177133 1.827761\n",
"2 2.240856 -0.587865 0.273062\n",
"3 0.777182 -0.629568 -0.220044\n",
"4 0.327522 0.781662 -0.651949\n",
"5 1.454611 -0.170581 -1.740959\n",
"6 -0.711897 0.074983 1.343807"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -0.986575 | \n",
" NaN | \n",
" -0.251823 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.008704 | \n",
" NaN | \n",
" 1.827761 | \n",
"
\n",
" \n",
" 2 | \n",
" 2.240856 | \n",
" NaN | \n",
" 0.273062 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.777182 | \n",
" NaN | \n",
" -0.220044 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.327522 | \n",
" 0.781662 | \n",
" -0.651949 | \n",
"
\n",
" \n",
" 5 | \n",
" 1.454611 | \n",
" -0.170581 | \n",
" -1.740959 | \n",
"
\n",
" \n",
" 6 | \n",
" -0.711897 | \n",
" 0.074983 | \n",
" 1.343807 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 -0.986575 NaN -0.251823\n",
"1 2.008704 NaN 1.827761\n",
"2 2.240856 NaN 0.273062\n",
"3 0.777182 NaN -0.220044\n",
"4 0.327522 0.781662 -0.651949\n",
"5 1.454611 -0.170581 -1.740959\n",
"6 -0.711897 0.074983 1.343807"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.iloc[:4, 1] = NA\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -0.986575 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 2.008704 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" 2.240856 | \n",
" NaN | \n",
" 0.273062 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.777182 | \n",
" NaN | \n",
" -0.220044 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.327522 | \n",
" 0.781662 | \n",
" -0.651949 | \n",
"
\n",
" \n",
" 5 | \n",
" 1.454611 | \n",
" -0.170581 | \n",
" -1.740959 | \n",
"
\n",
" \n",
" 6 | \n",
" -0.711897 | \n",
" 0.074983 | \n",
" 1.343807 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 -0.986575 NaN NaN\n",
"1 2.008704 NaN NaN\n",
"2 2.240856 NaN 0.273062\n",
"3 0.777182 NaN -0.220044\n",
"4 0.327522 0.781662 -0.651949\n",
"5 1.454611 -0.170581 -1.740959\n",
"6 -0.711897 0.074983 1.343807"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.iloc[:2, 2] = NA\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 4 | \n",
" 0.327522 | \n",
" 0.781662 | \n",
" -0.651949 | \n",
"
\n",
" \n",
" 5 | \n",
" 1.454611 | \n",
" -0.170581 | \n",
" -1.740959 | \n",
"
\n",
" \n",
" 6 | \n",
" -0.711897 | \n",
" 0.074983 | \n",
" 1.343807 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"4 0.327522 0.781662 -0.651949\n",
"5 1.454611 -0.170581 -1.740959\n",
"6 -0.711897 0.074983 1.343807"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dropna()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" 2.240856 | \n",
" NaN | \n",
" 0.273062 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.777182 | \n",
" NaN | \n",
" -0.220044 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.327522 | \n",
" 0.781662 | \n",
" -0.651949 | \n",
"
\n",
" \n",
" 5 | \n",
" 1.454611 | \n",
" -0.170581 | \n",
" -1.740959 | \n",
"
\n",
" \n",
" 6 | \n",
" -0.711897 | \n",
" 0.074983 | \n",
" 1.343807 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"2 2.240856 NaN 0.273062\n",
"3 0.777182 NaN -0.220044\n",
"4 0.327522 0.781662 -0.651949\n",
"5 1.454611 -0.170581 -1.740959\n",
"6 -0.711897 0.074983 1.343807"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dropna(thresh=2) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2 Filling In Missing Data(填补缺失值)\n",
"\n",
"不是删除缺失值,而是用一些数字填补。对于大部分目的,fillna是可以用的。调用fillna的时候设置好一个常用用来替换缺失值:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -0.986575 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.008704 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 2 | \n",
" 2.240856 | \n",
" 0.000000 | \n",
" 0.273062 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.777182 | \n",
" 0.000000 | \n",
" -0.220044 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.327522 | \n",
" 0.781662 | \n",
" -0.651949 | \n",
"
\n",
" \n",
" 5 | \n",
" 1.454611 | \n",
" -0.170581 | \n",
" -1.740959 | \n",
"
\n",
" \n",
" 6 | \n",
" -0.711897 | \n",
" 0.074983 | \n",
" 1.343807 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 -0.986575 0.000000 0.000000\n",
"1 2.008704 0.000000 0.000000\n",
"2 2.240856 0.000000 0.273062\n",
"3 0.777182 0.000000 -0.220044\n",
"4 0.327522 0.781662 -0.651949\n",
"5 1.454611 -0.170581 -1.740959\n",
"6 -0.711897 0.074983 1.343807"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.fillna(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"给fillna传入一个dict,可以给不同列替换不同的值:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -0.986575 | \n",
" 0.500000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.008704 | \n",
" 0.500000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 2 | \n",
" 2.240856 | \n",
" 0.500000 | \n",
" 0.273062 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.777182 | \n",
" 0.500000 | \n",
" -0.220044 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.327522 | \n",
" 0.781662 | \n",
" -0.651949 | \n",
"
\n",
" \n",
" 5 | \n",
" 1.454611 | \n",
" -0.170581 | \n",
" -1.740959 | \n",
"
\n",
" \n",
" 6 | \n",
" -0.711897 | \n",
" 0.074983 | \n",
" 1.343807 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 -0.986575 0.500000 0.000000\n",
"1 2.008704 0.500000 0.000000\n",
"2 2.240856 0.500000 0.273062\n",
"3 0.777182 0.500000 -0.220044\n",
"4 0.327522 0.781662 -0.651949\n",
"5 1.454611 -0.170581 -1.740959\n",
"6 -0.711897 0.074983 1.343807"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.fillna({1: 0.5, 2: 0})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"fillna返回一个新对象,但你可以使用in-place来直接更改原有的数据:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -0.986575 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.008704 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 2 | \n",
" 2.240856 | \n",
" 0.000000 | \n",
" 0.273062 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.777182 | \n",
" 0.000000 | \n",
" -0.220044 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.327522 | \n",
" 0.781662 | \n",
" -0.651949 | \n",
"
\n",
" \n",
" 5 | \n",
" 1.454611 | \n",
" -0.170581 | \n",
" -1.740959 | \n",
"
\n",
" \n",
" 6 | \n",
" -0.711897 | \n",
" 0.074983 | \n",
" 1.343807 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 -0.986575 0.000000 0.000000\n",
"1 2.008704 0.000000 0.000000\n",
"2 2.240856 0.000000 0.273062\n",
"3 0.777182 0.000000 -0.220044\n",
"4 0.327522 0.781662 -0.651949\n",
"5 1.454611 -0.170581 -1.740959\n",
"6 -0.711897 0.074983 1.343807"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"_ = df.fillna(0, inplace=True)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在使用fillna的时候,这种插入法同样能用于reindexing:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -1.151508 | \n",
" 1.185176 | \n",
" -1.766933 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.544729 | \n",
" -0.807814 | \n",
" 0.696087 | \n",
"
\n",
" \n",
" 2 | \n",
" -1.461950 | \n",
" 0.448852 | \n",
" 0.189045 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.559766 | \n",
" 0.341335 | \n",
" 1.469807 | \n",
"
\n",
" \n",
" 4 | \n",
" -0.362789 | \n",
" 1.117338 | \n",
" -0.383870 | \n",
"
\n",
" \n",
" 5 | \n",
" -0.452329 | \n",
" -0.282040 | \n",
" -0.541759 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 -1.151508 1.185176 -1.766933\n",
"1 0.544729 -0.807814 0.696087\n",
"2 -1.461950 0.448852 0.189045\n",
"3 0.559766 0.341335 1.469807\n",
"4 -0.362789 1.117338 -0.383870\n",
"5 -0.452329 -0.282040 -0.541759"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(np.random.randn(6, 3))\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -1.151508 | \n",
" 1.185176 | \n",
" -1.766933 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.544729 | \n",
" -0.807814 | \n",
" 0.696087 | \n",
"
\n",
" \n",
" 2 | \n",
" -1.461950 | \n",
" NaN | \n",
" 0.189045 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.559766 | \n",
" NaN | \n",
" 1.469807 | \n",
"
\n",
" \n",
" 4 | \n",
" -0.362789 | \n",
" NaN | \n",
" -0.383870 | \n",
"
\n",
" \n",
" 5 | \n",
" -0.452329 | \n",
" NaN | \n",
" -0.541759 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 -1.151508 1.185176 -1.766933\n",
"1 0.544729 -0.807814 0.696087\n",
"2 -1.461950 NaN 0.189045\n",
"3 0.559766 NaN 1.469807\n",
"4 -0.362789 NaN -0.383870\n",
"5 -0.452329 NaN -0.541759"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.iloc[2:, 1] = NA\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -1.151508 | \n",
" 1.185176 | \n",
" -1.766933 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.544729 | \n",
" -0.807814 | \n",
" 0.696087 | \n",
"
\n",
" \n",
" 2 | \n",
" -1.461950 | \n",
" NaN | \n",
" 0.189045 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.559766 | \n",
" NaN | \n",
" 1.469807 | \n",
"
\n",
" \n",
" 4 | \n",
" -0.362789 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 5 | \n",
" -0.452329 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 -1.151508 1.185176 -1.766933\n",
"1 0.544729 -0.807814 0.696087\n",
"2 -1.461950 NaN 0.189045\n",
"3 0.559766 NaN 1.469807\n",
"4 -0.362789 NaN NaN\n",
"5 -0.452329 NaN NaN"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.iloc[4:, 2] = NA\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -1.151508 | \n",
" 1.185176 | \n",
" -1.766933 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.544729 | \n",
" -0.807814 | \n",
" 0.696087 | \n",
"
\n",
" \n",
" 2 | \n",
" -1.461950 | \n",
" -0.807814 | \n",
" 0.189045 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.559766 | \n",
" -0.807814 | \n",
" 1.469807 | \n",
"
\n",
" \n",
" 4 | \n",
" -0.362789 | \n",
" -0.807814 | \n",
" 1.469807 | \n",
"
\n",
" \n",
" 5 | \n",
" -0.452329 | \n",
" -0.807814 | \n",
" 1.469807 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 -1.151508 1.185176 -1.766933\n",
"1 0.544729 -0.807814 0.696087\n",
"2 -1.461950 -0.807814 0.189045\n",
"3 0.559766 -0.807814 1.469807\n",
"4 -0.362789 -0.807814 1.469807\n",
"5 -0.452329 -0.807814 1.469807"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.fillna(method='ffill')"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -1.151508 | \n",
" 1.185176 | \n",
" -1.766933 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.544729 | \n",
" -0.807814 | \n",
" 0.696087 | \n",
"
\n",
" \n",
" 2 | \n",
" -1.461950 | \n",
" -0.807814 | \n",
" 0.189045 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.559766 | \n",
" -0.807814 | \n",
" 1.469807 | \n",
"
\n",
" \n",
" 4 | \n",
" -0.362789 | \n",
" NaN | \n",
" 1.469807 | \n",
"
\n",
" \n",
" 5 | \n",
" -0.452329 | \n",
" NaN | \n",
" 1.469807 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 -1.151508 1.185176 -1.766933\n",
"1 0.544729 -0.807814 0.696087\n",
"2 -1.461950 -0.807814 0.189045\n",
"3 0.559766 -0.807814 1.469807\n",
"4 -0.362789 NaN 1.469807\n",
"5 -0.452329 NaN 1.469807"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.fillna(method='ffill', limit=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用fillna可以我们做一些颇有创造力的事情。比如,可以传入一个series的平均值或中位数:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 1.000000\n",
"1 3.833333\n",
"2 3.500000\n",
"3 3.833333\n",
"4 7.000000\n",
"dtype: float64"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.Series([1., NA, 3.5, NA, 7])\n",
"data.fillna(data.mean())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下面是fillna的一些参数:\n",
"\n",
"![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/vtzrf.png)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [py35]",
"language": "python",
"name": "Python [py35]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}