{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"toc": "true"
},
"source": [
"
Table of Contents
\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"pandas建造在NumPy之上,它使得以NumPy为中心的应用很容易使用。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from pandas import Series, DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas 数据结构入门"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Series\n",
"\n",
"Series 是一个一维的类似的数组对象,包含一个数组的数据(任何 NumPy 的数据类型)和一个与数组关联的数据标签,被叫做 索引 。最简单的 Series 是由一个数组的数据构成:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 4\n",
"1 7\n",
"2 -5\n",
"3 3\n",
"dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj = Series([4, 7, -5, 3])\n",
"obj"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Seriers 的交互式显示的字符窜表示形式是索引在左边,值在右边。 \n",
"因为我们没有给数据指定索引,一个包含整数 0 到 N-1 (这里 N 是数据的长度)的默认索引被创建。 可以分别的通过它的 values 和 index 属性来获取 Series 的数组表示和索引对象:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 4, 7, -5, 3])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj.values"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RangeIndex(start=0, stop=4, step=1)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj.index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"通常,需要创建一个带有索引来确定没一个数据点的Series:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"d 4\n",
"b 7\n",
"a -5\n",
"c 3\n",
"dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])\n",
"obj2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"与正规的 NumPy 数组相比,你可以使用索引里的值来选择一个单一值或一个值集:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-5"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj2['a']"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"c 3\n",
"a -5\n",
"d 6\n",
"dtype: int64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj2['d'] = 6\n",
"obj2[['c', 'a', 'd']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NumPy 数组操作,例如通过一个布尔数组过滤,纯量乘法,或使用数学函数,将会保持索引和值间的关联:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"d 6\n",
"b 7\n",
"a -5\n",
"c 3\n",
"dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj2"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"d 6\n",
"b 7\n",
"c 3\n",
"dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj2[obj2 > 0]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"d 12\n",
"b 14\n",
"a -10\n",
"c 6\n",
"dtype: int64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj2 * 2"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"d 403.428793\n",
"b 1096.633158\n",
"a 0.006738\n",
"c 20.085537\n",
"dtype: float64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.exp(obj2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**另一种思考的方式**是,Series 是一个定长的,有序的字典,因为它把索引和值映射起来了。它可以适用于许多期望一个字典的函数:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'b' in obj2"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'e' in obj2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"如果你有一些数据在一个 Python 字典中,你可以通过传递字典来从这些数据创建一个 Series:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ohio 35000\n",
"Oregon 16000\n",
"Texas 71000\n",
"Utah 5000\n",
"dtype: int64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj3 = Series(sdata)\n",
"obj3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"只传递一个字典的时候,结果 Series 中的索引将是排序后的字典的键。"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"California NaN\n",
"Ohio 35000.0\n",
"Oregon 16000.0\n",
"Texas 71000.0\n",
"dtype: float64"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"states = ['California', 'Ohio', 'Oregon', 'Texas']\n",
"obj4 = Series(sdata, index=states)\n",
"obj4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在这种情况下, sdata 中的3个值被放在了合适的位置,但因为没有发现对应于 ‘California’ 的值,就出现了 NaN (不是一个数),这在 pandas 中被用来标记数据缺失或 NA 值。使用 “missing” 或 “NA” 来表示数度丢失。 \n",
"**在 pandas 中用函数 isnull 和 notnull 来检测数据丢失:**"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"California True\n",
"Ohio False\n",
"Oregon False\n",
"Texas False\n",
"dtype: bool"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.isnull(obj4)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"California False\n",
"Ohio True\n",
"Oregon True\n",
"Texas True\n",
"dtype: bool"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.notnull(obj4)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"California True\n",
"Ohio False\n",
"Oregon False\n",
"Texas False\n",
"dtype: bool"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj4.isnull()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在许多应用中 Series 的一个重要功能是在算术预算中它会自动对齐不同索引的数据:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ohio 35000\n",
"Oregon 16000\n",
"Texas 71000\n",
"Utah 5000\n",
"dtype: int64"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj3"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"California NaN\n",
"Ohio 35000.0\n",
"Oregon 16000.0\n",
"Texas 71000.0\n",
"dtype: float64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj4"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"California NaN\n",
"Ohio 70000.0\n",
"Oregon 32000.0\n",
"Texas 142000.0\n",
"Utah NaN\n",
"dtype: float64"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj3 + obj4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Series 对象本身和它的索引都有一个 name 属性,它和 pandas 的其它一些关键功能整合在一起:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"obj4.name = 'population'"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"state\n",
"California NaN\n",
"Ohio 35000.0\n",
"Oregon 16000.0\n",
"Texas 71000.0\n",
"Name: population, dtype: float64"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj4.index.name = 'state'\n",
"obj4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Series 的索引可以通过赋值就地更改:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Bob 4\n",
"Steve 7\n",
"Jeff -5\n",
"Ryan 3\n",
"dtype: int64"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"一个 Datarame 表示一个表格,类似电子表格的数据结构,包含一个经过排序的列表集,它们每一个都可以有不同的类型值(数字,字符串,布尔等等)。 \n",
"Datarame 有行和列的索引;它可以被看作是一个 Series 的字典(每个 Series 共享一个索引)。 \n",
"与其它你以前使用过的(如 R 的 data.frame )类似 Datarame 的结构相比,在 DataFrame 里的面向行和面向列的操作大致是对称的。 \n",
"在底层,数据是作为一个或多个**二维数组**存储的,而不是列表,字典,或其它一维的数组集合。 \n",
"因为 DataFrame 在内部把数据存储为一个二维数组的格式,因此你可以采用分层索引以表格格式来表示高维的数据。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"有很多方法来构建一个 DataFrame,但最常用的一个是用一个相等长度列表的字典或 NumPy 数组:"
]
},
{
"cell_type": "code",
"execution_count": 229,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],\n",
" 'year': [2000, 2001, 2002, 2001, 2002],\n",
" 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}\n",
"frame = DataFrame(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"由此产生的 DataFrame 和 Series 一样,它的索引会自动分配,并且对列进行了排序:"
]
},
{
"cell_type": "code",
"execution_count": 230,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pop | \n",
" state | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.5 | \n",
" Ohio | \n",
" 2000 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.7 | \n",
" Ohio | \n",
" 2001 | \n",
"
\n",
" \n",
" 2 | \n",
" 3.6 | \n",
" Ohio | \n",
" 2002 | \n",
"
\n",
" \n",
" 3 | \n",
" 2.4 | \n",
" Nevada | \n",
" 2001 | \n",
"
\n",
" \n",
" 4 | \n",
" 2.9 | \n",
" Nevada | \n",
" 2002 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pop state year\n",
"0 1.5 Ohio 2000\n",
"1 1.7 Ohio 2001\n",
"2 3.6 Ohio 2002\n",
"3 2.4 Nevada 2001\n",
"4 2.9 Nevada 2002"
]
},
"execution_count": 230,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"如果你设定了一个列的顺序,DataFrame 的列将会精确的按照你所传递的顺序排列:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" state | \n",
" pop | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2000 | \n",
" Ohio | \n",
" 1.5 | \n",
"
\n",
" \n",
" 1 | \n",
" 2001 | \n",
" Ohio | \n",
" 1.7 | \n",
"
\n",
" \n",
" 2 | \n",
" 2002 | \n",
" Ohio | \n",
" 3.6 | \n",
"
\n",
" \n",
" 3 | \n",
" 2001 | \n",
" Nevada | \n",
" 2.4 | \n",
"
\n",
" \n",
" 4 | \n",
" 2002 | \n",
" Nevada | \n",
" 2.9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year state pop\n",
"0 2000 Ohio 1.5\n",
"1 2001 Ohio 1.7\n",
"2 2002 Ohio 3.6\n",
"3 2001 Nevada 2.4\n",
"4 2002 Nevada 2.9"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataFrame(data, columns=['year', 'state', 'pop'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"和 Series 一样,如果你传递了一个行,但不包括在 data 中,在结果中它会表示为 NA 值:"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" state | \n",
" pop | \n",
" debt | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" 2000 | \n",
" Ohio | \n",
" 1.5 | \n",
" NaN | \n",
"
\n",
" \n",
" two | \n",
" 2001 | \n",
" Ohio | \n",
" 1.7 | \n",
" NaN | \n",
"
\n",
" \n",
" three | \n",
" 2002 | \n",
" Ohio | \n",
" 3.6 | \n",
" NaN | \n",
"
\n",
" \n",
" four | \n",
" 2001 | \n",
" Nevada | \n",
" 2.4 | \n",
" NaN | \n",
"
\n",
" \n",
" five | \n",
" 2002 | \n",
" Nevada | \n",
" 2.9 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year state pop debt\n",
"one 2000 Ohio 1.5 NaN\n",
"two 2001 Ohio 1.7 NaN\n",
"three 2002 Ohio 3.6 NaN\n",
"four 2001 Nevada 2.4 NaN\n",
"five 2002 Nevada 2.9 NaN"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],index=['one', 'two', 'three', 'four', 'five'])\n",
"frame2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"和 Series 一样,在 DataFrame 中的一列可以通过字典记法或属性来检索:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"one Ohio\n",
"two Ohio\n",
"three Ohio\n",
"four Nevada\n",
"five Nevada\n",
"Name: state, dtype: object"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2['state'] "
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"one 2000\n",
"two 2001\n",
"three 2002\n",
"four 2001\n",
"five 2002\n",
"Name: year, dtype: int64"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2.year"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**注意**,返回的 Series 包含和 DataFrame 相同的索引,并它们的 name 属性也被正确的设置了。\n",
"\n",
"行也可以使用一些方法通过位置或名字来检索,例如 ix 索引成员(field)(更多的将在后面介绍):"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"year 2002\n",
"state Ohio\n",
"pop 3.6\n",
"debt NaN\n",
"Name: three, dtype: object"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2.ix['three']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"列可以通过赋值来修改。例如,空的 ‘debt’ 列可以通过一个纯量或一个数组来赋值:"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" state | \n",
" pop | \n",
" debt | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" 2000 | \n",
" Ohio | \n",
" 1.5 | \n",
" 16.5 | \n",
"
\n",
" \n",
" two | \n",
" 2001 | \n",
" Ohio | \n",
" 1.7 | \n",
" 16.5 | \n",
"
\n",
" \n",
" three | \n",
" 2002 | \n",
" Ohio | \n",
" 3.6 | \n",
" 16.5 | \n",
"
\n",
" \n",
" four | \n",
" 2001 | \n",
" Nevada | \n",
" 2.4 | \n",
" 16.5 | \n",
"
\n",
" \n",
" five | \n",
" 2002 | \n",
" Nevada | \n",
" 2.9 | \n",
" 16.5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year state pop debt\n",
"one 2000 Ohio 1.5 16.5\n",
"two 2001 Ohio 1.7 16.5\n",
"three 2002 Ohio 3.6 16.5\n",
"four 2001 Nevada 2.4 16.5\n",
"five 2002 Nevada 2.9 16.5"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2['debt'] = 16.5\n",
"frame2"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" state | \n",
" pop | \n",
" debt | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" 2000 | \n",
" Ohio | \n",
" 1.5 | \n",
" 0.0 | \n",
"
\n",
" \n",
" two | \n",
" 2001 | \n",
" Ohio | \n",
" 1.7 | \n",
" 1.0 | \n",
"
\n",
" \n",
" three | \n",
" 2002 | \n",
" Ohio | \n",
" 3.6 | \n",
" 2.0 | \n",
"
\n",
" \n",
" four | \n",
" 2001 | \n",
" Nevada | \n",
" 2.4 | \n",
" 3.0 | \n",
"
\n",
" \n",
" five | \n",
" 2002 | \n",
" Nevada | \n",
" 2.9 | \n",
" 4.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year state pop debt\n",
"one 2000 Ohio 1.5 0.0\n",
"two 2001 Ohio 1.7 1.0\n",
"three 2002 Ohio 3.6 2.0\n",
"four 2001 Nevada 2.4 3.0\n",
"five 2002 Nevada 2.9 4.0"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2['debt'] = np.arange(5.)\n",
"frame2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"通过列表或数组给一列赋值时,所赋的值的长度必须和 DataFrame 的长度相匹配。**否则全部为 NA**。\n",
"\n",
"如果你使用 Series 来赋值,它会代替在 DataFrame 中精确匹配的索引的值,并在说有的空洞插入丢失数据:"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"val2 = Series([-1.2, -1.5, -1.7])"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" state | \n",
" pop | \n",
" debt | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" 2000 | \n",
" Ohio | \n",
" 1.5 | \n",
" NaN | \n",
"
\n",
" \n",
" two | \n",
" 2001 | \n",
" Ohio | \n",
" 1.7 | \n",
" NaN | \n",
"
\n",
" \n",
" three | \n",
" 2002 | \n",
" Ohio | \n",
" 3.6 | \n",
" NaN | \n",
"
\n",
" \n",
" four | \n",
" 2001 | \n",
" Nevada | \n",
" 2.4 | \n",
" NaN | \n",
"
\n",
" \n",
" five | \n",
" 2002 | \n",
" Nevada | \n",
" 2.9 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year state pop debt\n",
"one 2000 Ohio 1.5 NaN\n",
"two 2001 Ohio 1.7 NaN\n",
"three 2002 Ohio 3.6 NaN\n",
"four 2001 Nevada 2.4 NaN\n",
"five 2002 Nevada 2.9 NaN"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2['debt'] = val2\n",
"frame2"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" state | \n",
" pop | \n",
" debt | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" 2000 | \n",
" Ohio | \n",
" 1.5 | \n",
" NaN | \n",
"
\n",
" \n",
" two | \n",
" 2001 | \n",
" Ohio | \n",
" 1.7 | \n",
" -1.2 | \n",
"
\n",
" \n",
" three | \n",
" 2002 | \n",
" Ohio | \n",
" 3.6 | \n",
" NaN | \n",
"
\n",
" \n",
" four | \n",
" 2001 | \n",
" Nevada | \n",
" 2.4 | \n",
" -1.5 | \n",
"
\n",
" \n",
" five | \n",
" 2002 | \n",
" Nevada | \n",
" 2.9 | \n",
" -1.7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year state pop debt\n",
"one 2000 Ohio 1.5 NaN\n",
"two 2001 Ohio 1.7 -1.2\n",
"three 2002 Ohio 3.6 NaN\n",
"four 2001 Nevada 2.4 -1.5\n",
"five 2002 Nevada 2.9 -1.7"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2['debt'] = val\n",
"frame2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"给一个不存在的列赋值,将会创建一个新的列。 像字典一样 del 关键字将会删除列:"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" state | \n",
" pop | \n",
" debt | \n",
" eastern | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" 2000 | \n",
" Ohio | \n",
" 1.5 | \n",
" NaN | \n",
" True | \n",
"
\n",
" \n",
" two | \n",
" 2001 | \n",
" Ohio | \n",
" 1.7 | \n",
" -1.2 | \n",
" True | \n",
"
\n",
" \n",
" three | \n",
" 2002 | \n",
" Ohio | \n",
" 3.6 | \n",
" NaN | \n",
" True | \n",
"
\n",
" \n",
" four | \n",
" 2001 | \n",
" Nevada | \n",
" 2.4 | \n",
" -1.5 | \n",
" False | \n",
"
\n",
" \n",
" five | \n",
" 2002 | \n",
" Nevada | \n",
" 2.9 | \n",
" -1.7 | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year state pop debt eastern\n",
"one 2000 Ohio 1.5 NaN True\n",
"two 2001 Ohio 1.7 -1.2 True\n",
"three 2002 Ohio 3.6 NaN True\n",
"four 2001 Nevada 2.4 -1.5 False\n",
"five 2002 Nevada 2.9 -1.7 False"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2['eastern'] = frame2.state == 'Ohio'\n",
"frame2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**注意:**索引 DataFrame 时返回的列是底层数据的一个视窗,而不是一个拷贝。因此,任何在 Series 上的就地修改都会影响 DataFrame。列可以使用 Series 的 copy 函数来显式的拷贝。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**另一种通用的数据形式是一个嵌套的字典的字典格式:**"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pop = {'Nevada': {2001: 2.4, 2002: 2.9},\n",
" ....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pop"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"如果被传递到 DataFrame,它的外部键会被解释为列索引,内部键会被解释为行索引:"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Nevada | \n",
" Ohio | \n",
"
\n",
" \n",
" \n",
" \n",
" 2000 | \n",
" NaN | \n",
" 1.5 | \n",
"
\n",
" \n",
" 2001 | \n",
" 2.4 | \n",
" 1.7 | \n",
"
\n",
" \n",
" 2002 | \n",
" 2.9 | \n",
" 3.6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Nevada Ohio\n",
"2000 NaN 1.5\n",
"2001 2.4 1.7\n",
"2002 2.9 3.6"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame3 = DataFrame(pop)\n",
"frame3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"当然,总是可以对结果转置:"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 2000 | \n",
" 2001 | \n",
" 2002 | \n",
"
\n",
" \n",
" \n",
" \n",
" Nevada | \n",
" NaN | \n",
" 2.4 | \n",
" 2.9 | \n",
"
\n",
" \n",
" Ohio | \n",
" 1.5 | \n",
" 1.7 | \n",
" 3.6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 2000 2001 2002\n",
"Nevada NaN 2.4 2.9\n",
"Ohio 1.5 1.7 3.6"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame3.T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"内部字典的键被结合并排序来形成结果的索引。如果指定了一个特定的索引,就不是这样的了:"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Nevada | \n",
" Ohio | \n",
"
\n",
" \n",
" \n",
" \n",
" 2001 | \n",
" 2.4 | \n",
" 1.7 | \n",
"
\n",
" \n",
" 2002 | \n",
" 2.9 | \n",
" 3.6 | \n",
"
\n",
" \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Nevada Ohio\n",
"2001 2.4 1.7\n",
"2002 2.9 3.6\n",
"2003 NaN NaN"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataFrame(pop, index=[2001, 2002, 2003])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Series 的字典也以相同的方式来处理:"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pdata = {'Ohio': frame3['Ohio'][:-1],\n",
" ....: 'Nevada': frame3['Nevada'][:2]}"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Nevada': 2000 NaN\n",
" 2001 2.4\n",
" Name: Nevada, dtype: float64, 'Ohio': 2000 1.5\n",
" 2001 1.7\n",
" Name: Ohio, dtype: float64}"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pdata"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Nevada | \n",
" Ohio | \n",
"
\n",
" \n",
" \n",
" \n",
" 2000 | \n",
" NaN | \n",
" 1.5 | \n",
"
\n",
" \n",
" 2001 | \n",
" 2.4 | \n",
" 1.7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Nevada Ohio\n",
"2000 NaN 1.5\n",
"2001 2.4 1.7"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataFrame(pdata)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"如果一个 DataFrame 的 index 和 columns 有它们的 name ,也会被显示出来:"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" state | \n",
" Nevada | \n",
" Ohio | \n",
"
\n",
" \n",
" year | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2000 | \n",
" NaN | \n",
" 1.5 | \n",
"
\n",
" \n",
" 2001 | \n",
" 2.4 | \n",
" 1.7 | \n",
"
\n",
" \n",
" 2002 | \n",
" 2.9 | \n",
" 3.6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"state Nevada Ohio\n",
"year \n",
"2000 NaN 1.5\n",
"2001 2.4 1.7\n",
"2002 2.9 3.6"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame3.index.name = 'year'; frame3.columns.name = 'state'\n",
"frame3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"像 Series 一样, values 属性返回一个包含在 DataFrame 中的数据的二维 ndarray:"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ nan, 1.5],\n",
" [ 2.4, 1.7],\n",
" [ 2.9, 3.6]])"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame3.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"如果 DataFrame 的列有不同的 dtypes,返回值数组将会给所有的列选择一个合适的 dtyps:"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" state | \n",
" pop | \n",
" debt | \n",
" eastern | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" 2000 | \n",
" Ohio | \n",
" 1.5 | \n",
" NaN | \n",
" True | \n",
"
\n",
" \n",
" two | \n",
" 2001 | \n",
" Ohio | \n",
" 1.7 | \n",
" -1.2 | \n",
" True | \n",
"
\n",
" \n",
" three | \n",
" 2002 | \n",
" Ohio | \n",
" 3.6 | \n",
" NaN | \n",
" True | \n",
"
\n",
" \n",
" four | \n",
" 2001 | \n",
" Nevada | \n",
" 2.4 | \n",
" -1.5 | \n",
" False | \n",
"
\n",
" \n",
" five | \n",
" 2002 | \n",
" Nevada | \n",
" 2.9 | \n",
" -1.7 | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year state pop debt eastern\n",
"one 2000 Ohio 1.5 NaN True\n",
"two 2001 Ohio 1.7 -1.2 True\n",
"three 2002 Ohio 3.6 NaN True\n",
"four 2001 Nevada 2.4 -1.5 False\n",
"five 2002 Nevada 2.9 -1.7 False"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[2000, 'Ohio', 1.5, nan, True],\n",
" [2001, 'Ohio', 1.7, -1.2, True],\n",
" [2002, 'Ohio', 3.6, nan, True],\n",
" [2001, 'Nevada', 2.4, -1.5, False],\n",
" [2002, 'Nevada', 2.9, -1.7, False]], dtype=object)"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**可能的传递到DataFrame的构造器**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"|二维ndarray|\t一个数据矩阵,有可选的行标和列标|\n",
"|----------|----------------------------|\n",
"|数组,列表或元组的字典|每一个序列成为DataFrame中的一列。所有的序列必须有相同的长度。|\n",
"|NumPy的结构/记录数组|\t和“数组字典”一样处理|\n",
"|Series的字典|\t每一个值成为一列。如果没有明显的传递索引,将结合每一个Series的索引来形成结果的行索引。|\n",
"|字典的字典|\t每一个内部的字典成为一列。和“Series的字典”一样,结合键值来形成行索引。|\n",
"|字典或Series的列表|\t每一项成为DataFrame中的一列。结合字典键或Series索引形成DataFrame的列标。|\n",
"|列表或元组的列表|\t和“二维ndarray”一样处理|\n",
"|另一个DataFrame|\tDataFrame的索引将被使用,除非传递另外一个|\n",
"|NumPy伪装数组(MaskedArray)|\t除了蒙蔽值在DataFrame中成为NA/丢失数据之外,其它的和“二维ndarray”一样|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 索引对象"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"pandas 的索引对象用来保存坐标轴标签和其它元数据(如坐标轴名或名称)。构建一个 Series 或 DataFrame 时任何数组或其它序列标签在内部转化为索引:"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index([u'a', u'b', u'c'], dtype='object')"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj = Series(range(3), index=['a', 'b', 'c'])\n",
"index = obj.index\n",
"index"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index([u'b', u'c'], dtype='object')"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index[1:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"索引对象是不可变的,因此不能由用户改变:"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "Index does not support mutable operations",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mindex\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'd'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;32m/home/evil_rabbit/anaconda2/lib/python2.7/site-packages/pandas/indexes/base.pyc\u001b[0m in \u001b[0;36m__setitem__\u001b[1;34m(self, key, value)\u001b[0m\n\u001b[0;32m 1235\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1236\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m__setitem__\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mvalue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1237\u001b[1;33m \u001b[1;32mraise\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"Index does not support mutable operations\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1238\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1239\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mTypeError\u001b[0m: Index does not support mutable operations"
]
}
],
"source": [
"index[1] = 'd'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**索引对象的不可变性非常重要,这样它可以在数据结构中结构中安全的共享:**"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index = pd.Index(np.arange(3))\n",
"obj2 = Series([1.5, -2.5, 0], index=index)\n",
"obj2.index is index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下表是库中内建的索引类清单。通过一些开发努力,索引可以被子类化,来实现特定坐标轴索引功能。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"|索引类型|说明|\n",
"|-------|---|\n",
"|Index|最通用的索引对象,使用Python对象的NumPy数组来表示坐标轴标签。|\n",
"|Int64Index|对整形值的特化索引。|\n",
"|MultiIndex|“分层”索引对象,表示单个轴的多层次的索引。可以被认为是类似的元组的数组。|\n",
"|DatetimeIndex|存储纳秒时间戳(使用NumPy的datetime64 dtyppe来表示)。|\n",
"|PeriodIndex|对周期数据(时间间隔的)的特化索引。|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"除了类似于阵列,索引也有类似固定大小集合一样的功能:"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" state | \n",
" Nevada | \n",
" Ohio | \n",
"
\n",
" \n",
" year | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2000 | \n",
" NaN | \n",
" 1.5 | \n",
"
\n",
" \n",
" 2001 | \n",
" 2.4 | \n",
" 1.7 | \n",
"
\n",
" \n",
" 2002 | \n",
" 2.9 | \n",
" 3.6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"state Nevada Ohio\n",
"year \n",
"2000 NaN 1.5\n",
"2001 2.4 1.7\n",
"2002 2.9 3.6"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame3"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'Ohio' in frame3.columns"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"2003 in frame3.index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"每个索引都有许多关于集合逻辑的方法和属性,且能够解决它所包含的数据的常见问题。这些都总结在下表中。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"|方法 | 属性|\n",
"|----|-----|\n",
"|append\t|链接额外的索引对象,产生一个新的索引|\n",
"|diff\t|计算索引的差集|\n",
"|intersection\t|计算交集|\n",
"|union\t|计算并集|\n",
"|isin\t|计算出一个布尔数组表示每一个值是否包含在所传递的集合里|\n",
"|delete\t|计算删除位置i的元素的索引|\n",
"|drop\t|计算删除所传递的值后的索引|\n",
"|insert\t|计算在位置i插入元素后的索引|\n",
"|is_monotonic\t|返回True,如果每一个元素都比它前面的元素大或相等|\n",
"|is_unique\t|返回True,如果索引没有重复的值|\n",
"|unique\t|计算索引的唯一值数组|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 重新索引:reindex"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"pandas 对象的一个关键的方法是 reindex ,意味着使数据符合一个新的索引来构造一个新的对象。来看一下下面一个简单的例子:"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"d 4.5\n",
"b 7.2\n",
"a -5.3\n",
"c 3.6\n",
"dtype: float64"
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])\n",
"obj"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在 Series 上调用 reindex 重排数据,使得它符合新的索引,如果那个索引的值不存在就引入缺失数据值: \n",
"**注意:**reindex 并不改变 index 的数值,只是重排数据。(索引对象的不可变性)"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a -5.3\n",
"b 7.2\n",
"c 3.6\n",
"d 4.5\n",
"e NaN\n",
"dtype: float64"
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])\n",
"obj2"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a -5.3\n",
"b 7.2\n",
"c 3.6\n",
"d 4.5\n",
"e 0.0\n",
"dtype: float64"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"为了对**时间序列**这样的数据排序,**当重建索引的时候可能想要对值进行内插或填充**。 method 选项可以是你做到这一点,使用一个如 ffill 的方法来向前填充值:"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 blue\n",
"2 purple\n",
"4 yellow\n",
"dtype: object"
]
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj3"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 blue\n",
"1 blue\n",
"2 purple\n",
"3 purple\n",
"4 yellow\n",
"5 yellow\n",
"dtype: object"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj3.reindex(range(6), method='ffill')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下是可用的 method 选项的清单。在此,内差比正向和反向填充更复杂。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- reindex 的 method(内插)选项\n",
"\n",
"|参数|描述|\n",
"|---|---|\n",
"|ffill或pad\t|前向(或进位)填充|\n",
"|bfill或backfill\t|后向(或进位)填充|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"对于 DataFrame, reindex 可以改变(行)索引,列或两者。当只传入一个序列时,结果中的行被重新索引了:"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],\n",
"....: columns=['Ohio', 'Texas', 'California'])"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Ohio | \n",
" Texas | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" c | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
"
\n",
" \n",
" d | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Ohio Texas California\n",
"a 0 1 2\n",
"c 3 4 5\n",
"d 6 7 8"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"frame2 = frame.reindex(['a', 'b', 'c', 'd'])"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Ohio | \n",
" Texas | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 0.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" b | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" c | \n",
" 3.0 | \n",
" 4.0 | \n",
" 5.0 | \n",
"
\n",
" \n",
" d | \n",
" 6.0 | \n",
" 7.0 | \n",
" 8.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Ohio Texas California\n",
"a 0.0 1.0 2.0\n",
"b NaN NaN NaN\n",
"c 3.0 4.0 5.0\n",
"d 6.0 7.0 8.0"
]
},
"execution_count": 115,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用 **columns** 关键字可以使列重新索引:"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"states = ['Texas', 'Utah', 'California']"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Texas | \n",
" Utah | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" NaN | \n",
" 2 | \n",
"
\n",
" \n",
" c | \n",
" 4 | \n",
" NaN | \n",
" 5 | \n",
"
\n",
" \n",
" d | \n",
" 7 | \n",
" NaN | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Texas Utah California\n",
"a 1 NaN 2\n",
"c 4 NaN 5\n",
"d 7 NaN 8"
]
},
"execution_count": 118,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame.reindex(columns=states)\n",
"# 再次注意索引对象的不可变性"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"一次可以对两个重新索引,可是插值只在**行侧(0坐标轴)**进行:"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Ohio | \n",
" Texas | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" c | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
"
\n",
" \n",
" d | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Ohio Texas California\n",
"a 0 1 2\n",
"c 3 4 5\n",
"d 6 7 8"
]
},
"execution_count": 121,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Texas | \n",
" Utah | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" NaN | \n",
" 2 | \n",
"
\n",
" \n",
" b | \n",
" 1 | \n",
" NaN | \n",
" 2 | \n",
"
\n",
" \n",
" c | \n",
" 4 | \n",
" NaN | \n",
" 5 | \n",
"
\n",
" \n",
" d | \n",
" 7 | \n",
" NaN | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Texas Utah California\n",
"a 1 NaN 2\n",
"b 1 NaN 2\n",
"c 4 NaN 5\n",
"d 7 NaN 8"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill',\n",
" ....: columns=states)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"正如你将看到的,使用带标签索引的 **ix** 可以把重新索引做的更简单:"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Texas | \n",
" Utah | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1.0 | \n",
" NaN | \n",
" 2.0 | \n",
"
\n",
" \n",
" b | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" c | \n",
" 4.0 | \n",
" NaN | \n",
" 5.0 | \n",
"
\n",
" \n",
" d | \n",
" 7.0 | \n",
" NaN | \n",
" 8.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Texas Utah California\n",
"a 1.0 NaN 2.0\n",
"b NaN NaN NaN\n",
"c 4.0 NaN 5.0\n",
"d 7.0 NaN 8.0"
]
},
"execution_count": 120,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame.ix[['a', 'b', 'c', 'd'], states]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- **reindex** 函数的参数\n",
"\n",
"|参数|说明|\n",
"|---|---|\n",
"|index\t|作为索引的新序列。可以是索引实例或任何类似序列的Python数据结构。一个索引被完全使用,没有任何拷贝。|\n",
"|method\t|插值(填充)方法,见表格5-4的选项|\n",
"|fill_value\t|代替重新索引时引入的缺失数据值|\n",
"|limit\t|当前向或后向填充时,最大的填充间隙|\n",
"|level\t|在多层索引上匹配简单索引,否则选择一个子集|\n",
"|copy\t|如果新索引与就的相等则底层数据不会拷贝。默认为True(即始终拷贝)|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 从一个坐标轴删除条目"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"从坐标轴删除一个多或多个条目是很容易的,如果你有一个索引数组或列表且没有这些条目,但是这可能需要一点修改和集合逻辑。 drop 方法将会返回一个新的对象并从坐标轴中删除指定的一个或多个值:"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a 0.0\n",
"b 1.0\n",
"d 3.0\n",
"e 4.0\n",
"dtype: float64"
]
},
"execution_count": 124,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_obj = obj.drop('c')\n",
"new_obj"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a 0.0\n",
"b 1.0\n",
"e 4.0\n",
"dtype: float64"
]
},
"execution_count": 125,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj.drop(['d', 'c'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"对于 DataFrame,可以从任何坐标轴删除索引值:"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"data = DataFrame(np.arange(16).reshape((4, 4)),\n",
" ....: index=['Ohio', 'Colorado', 'Utah', 'New York'],\n",
" ....: columns=['one', 'two', 'three', 'four'])"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 127,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.drop(['Colorado', 'Ohio'])"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Colorado | \n",
" 4 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one three four\n",
"Ohio 0 2 3\n",
"Colorado 4 6 7\n",
"Utah 8 10 11\n",
"New York 12 14 15"
]
},
"execution_count": 128,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.drop('two', axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" three | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0 | \n",
" 2 | \n",
"
\n",
" \n",
" Colorado | \n",
" 4 | \n",
" 6 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 10 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 14 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one three\n",
"Ohio 0 2\n",
"Colorado 4 6\n",
"Utah 8 10\n",
"New York 12 14"
]
},
"execution_count": 129,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.drop(['two', 'four'], axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 索引,挑选和过滤"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Series 索引 ( obj[...] )的工作原理类似与 NumPy 索引,除了可以使用 Series 的索引值,也可以仅使用整数来索引。下面是关于这一点的一些例子:"
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a 2\n",
"b 3\n",
"c 1\n",
"d 5\n",
"dtype: int64"
]
},
"execution_count": 138,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj = Series([2,3,1,5], index=['a', 'b', 'c', 'd'])\n",
"obj"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3, 3)"
]
},
"execution_count": 139,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj['b'], obj[1]"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"b 3\n",
"a 2\n",
"d 5\n",
"dtype: int64"
]
},
"execution_count": 140,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj[['b', 'a', 'd']]"
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"b 3\n",
"d 5\n",
"dtype: int64"
]
},
"execution_count": 142,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj[[1, 3]] # 索引位置"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"c 1\n",
"dtype: int64"
]
},
"execution_count": 143,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj[obj < 2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用标签来切片和正常的 Python 切片并不一样,**它会把结束点也包括在内**:"
]
},
{
"cell_type": "code",
"execution_count": 144,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"b 3\n",
"c 1\n",
"dtype: int64"
]
},
"execution_count": 144,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj['b':'c']"
]
},
{
"cell_type": "code",
"execution_count": 146,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a 2\n",
"b 5\n",
"c 5\n",
"d 5\n",
"dtype: int64"
]
},
"execution_count": 146,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj['b':'c'] = 5\n",
"obj"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"正如上面你所见到的,索引 DataFrame 来检索一个或多个列,可以使用一个单一值或一个序列:"
]
},
{
"cell_type": "code",
"execution_count": 256,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"data = DataFrame(np.arange(16).reshape((4, 4)),\n",
" .....: index=['Ohio', 'Colorado', 'Utah', 'New York'],\n",
" .....: columns=['one', 'two', 'three', 'four'])"
]
},
{
"cell_type": "code",
"execution_count": 257,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Colorado | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio 0 1 2 3\n",
"Colorado 4 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 257,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "code",
"execution_count": 258,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ohio 1\n",
"Colorado 5\n",
"Utah 9\n",
"New York 13\n",
"Name: two, dtype: int64"
]
},
"execution_count": 258,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['two']"
]
},
{
"cell_type": "code",
"execution_count": 259,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" three | \n",
" one | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 2 | \n",
" 0 | \n",
"
\n",
" \n",
" Colorado | \n",
" 6 | \n",
" 4 | \n",
"
\n",
" \n",
" Utah | \n",
" 10 | \n",
" 8 | \n",
"
\n",
" \n",
" New York | \n",
" 14 | \n",
" 12 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" three one\n",
"Ohio 2 0\n",
"Colorado 6 4\n",
"Utah 10 8\n",
"New York 14 12"
]
},
"execution_count": 259,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[['three', 'one']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"像这样的索引有一些特殊的情况。首先,可以通过切片或一个布尔数组来选择行:"
]
},
{
"cell_type": "code",
"execution_count": 151,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Colorado | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio 0 1 2 3\n",
"Colorado 4 5 6 7"
]
},
"execution_count": 151,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[:2] "
]
},
{
"cell_type": "code",
"execution_count": 260,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Colorado | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Colorado 4 5 6 7"
]
},
"execution_count": 260,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[(data.one >=1 ) & (data.one < 7) ] # 多个逻辑条件组合"
]
},
{
"cell_type": "code",
"execution_count": 152,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Colorado | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Colorado 4 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 152,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[data['three'] > 5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"另一种用法是在索引中使用一个布尔 DataFrame,例如通过纯量比较产生的:"
]
},
{
"cell_type": "code",
"execution_count": 153,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" Colorado | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Utah | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" New York | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio True True True True\n",
"Colorado True False False False\n",
"Utah False False False False\n",
"New York False False False False"
]
},
"execution_count": 153,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data < 5"
]
},
{
"cell_type": "code",
"execution_count": 156,
"metadata": {},
"outputs": [],
"source": [
"data[data <5] = 0"
]
},
{
"cell_type": "code",
"execution_count": 157,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Colorado | \n",
" 0 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio 0 0 0 0\n",
"Colorado 0 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 157,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在这种情况下使得 DataFrame 的语法更像一个 ndarry。为了使 DataFrame 可以在行上进行标签索引,我将介绍特殊的索引字段 ix 。这使你可以从 DataFrame 选择一个行和列的子集,使用像 NumPy 的记法再加上轴标签。正如我早先提到的,这也是一种不是很冗长的重新索引的方法:"
]
},
{
"cell_type": "code",
"execution_count": 158,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"two 5\n",
"three 6\n",
"Name: Colorado, dtype: int64"
]
},
"execution_count": 158,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.ix['Colorado', ['two', 'three']]"
]
},
{
"cell_type": "code",
"execution_count": 161,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" four | \n",
" one | \n",
" two | \n",
"
\n",
" \n",
" \n",
" \n",
" Colorado | \n",
" 7 | \n",
" 0 | \n",
" 5 | \n",
"
\n",
" \n",
" Utah | \n",
" 11 | \n",
" 8 | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" four one two\n",
"Colorado 7 0 5\n",
"Utah 11 8 9"
]
},
"execution_count": 161,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.ix[['Colorado', 'Utah'], [3, 0, 1]] # 3,0,1 表示位置顺序"
]
},
{
"cell_type": "code",
"execution_count": 163,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"one 8\n",
"two 9\n",
"three 10\n",
"four 11\n",
"Name: Utah, dtype: int64"
]
},
"execution_count": 163,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.ix[2] # 第 2 行"
]
},
{
"cell_type": "code",
"execution_count": 166,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ohio 0\n",
"Colorado 6\n",
"Utah 10\n",
"New York 14\n",
"Name: three, dtype: int64"
]
},
"execution_count": 166,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.ix[:,2] # 第 2 列"
]
},
{
"cell_type": "code",
"execution_count": 167,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ohio 0\n",
"Colorado 5\n",
"Utah 9\n",
"Name: two, dtype: int64"
]
},
"execution_count": 167,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.ix[:'Utah', 'two']"
]
},
{
"cell_type": "code",
"execution_count": 173,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Colorado | \n",
" 0 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Colorado 0 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 173,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.ix[data.three > 5, :]"
]
},
{
"cell_type": "code",
"execution_count": 174,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
"
\n",
" \n",
" \n",
" \n",
" Colorado | \n",
" 0 | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three\n",
"Colorado 0 5 6\n",
"Utah 8 9 10\n",
"New York 12 13 14"
]
},
"execution_count": 174,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.ix[data.three > 5, :3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"因此,有很多方法来选择和重排包含在 pandas 对象中的数据。对于 DataFrame , 下表是这些方法的简短概要。稍后你将接触到分层索引,那时你会有一些额外的选项。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**注意**:在设计 pandas 时,我觉得不得不敲下 frame[:, col] 来选择一列,是非常冗余的(且易出错的),因此列选择是最常见的操作之一。 \n",
"因此,我做了这个设计权衡,把所有的富标签索引引入到 ix 。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"|切片命令|说明|\n",
"|------|-----|\n",
"|obj[val]\t|从DataFrame选择单一列或连续列。特殊情况下的便利:布尔数组(过滤行),切片(行切片),或布尔DataFrame(根据一些标准来设置值)。|\n",
"|obj.ix[val]\t|从DataFrame的行集选择单行|\n",
"|obj.ix[:, val]\t|从列集选择单列|\n",
"|obj.ix[val1, val2]\t|选择行和列|\n",
"|reindex 方法\t|转换一个或多个轴到新的索引|\n",
"|xs 方法\t|通过标签选择单行或单列到一个Series|\n",
"|icol, irow 方法\t|通过整数位置,分别的选择单行或单列到一个Series|\n",
"|get_value, set_value 方法\t|通过行和列标选择一个单值|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**其他切片方式:**"
]
},
{
"cell_type": "code",
"execution_count": 185,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Colorado | \n",
" 0 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio 0 0 0 0\n",
"Colorado 0 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 185,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 使用标签选取数据"
]
},
{
"cell_type": "code",
"execution_count": 183,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Colorado | \n",
" 0 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Colorado 0 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 183,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# df.loc[行标签,列标签]\n",
"data.loc['Colorado':'New York'] # 选取 a:b 行数据"
]
},
{
"cell_type": "code",
"execution_count": 182,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ohio 0\n",
"Colorado 0\n",
"Utah 8\n",
"New York 12\n",
"Name: one, dtype: int64"
]
},
"execution_count": 182,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.loc[:,'one'] # 选取 one 列的数据"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 使用位置选取数据"
]
},
{
"cell_type": "code",
"execution_count": 184,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5"
]
},
"execution_count": 184,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# df.iloc[行位置,列位置]\n",
"data.iloc[1,1] # 选取第二行,第二列的值,返回的为单个值"
]
},
{
"cell_type": "code",
"execution_count": 189,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio 0 0 0 0\n",
"Utah 8 9 10 11"
]
},
"execution_count": 189,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[[0,2], :] #选取第一行及第三行的数据"
]
},
{
"cell_type": "code",
"execution_count": 190,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Colorado | \n",
" 0 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio 0 0 0 0\n",
"Colorado 0 5 6 7"
]
},
"execution_count": 190,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[0:2,:] # 选取第一行到第三行(不包含)的数据,就是前两行的数据"
]
},
{
"cell_type": "code",
"execution_count": 191,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ohio 0\n",
"Colorado 5\n",
"Utah 9\n",
"New York 13\n",
"Name: two, dtype: int64"
]
},
"execution_count": 191,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[:,1] # 选取所有记录的第一列的值,返回的为一个 Series"
]
},
{
"cell_type": "code",
"execution_count": 192,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"one 0\n",
"two 5\n",
"three 6\n",
"four 7\n",
"Name: Colorado, dtype: int64"
]
},
"execution_count": 192,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[1,:] # 选取第二行数据,返回的为一个 Series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 算术运算"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"当把对象加起来时,如果有任何的索引对不相同的话,在结果中将会把各自的索引联合起来。 \n",
"在索引不重合的地方引入了 NA 值。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 广播"
]
},
{
"cell_type": "code",
"execution_count": 197,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"arr = np.arange(12.).reshape(3,4)"
]
},
{
"cell_type": "code",
"execution_count": 198,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0., 1., 2., 3.])"
]
},
"execution_count": 198,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"arr[0]"
]
},
{
"cell_type": "code",
"execution_count": 199,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0., 0., 0., 0.],\n",
" [ 4., 4., 4., 4.],\n",
" [ 8., 8., 8., 8.]])"
]
},
"execution_count": 199,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"arr-arr[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**以下内容参考至**:[Python中的结构化数据分析利器-Pandas简介](http://cloga.info/python/%E6%95%B0%E6%8D%AE%E7%A7%91%E5%AD%A6/2013/09/17/pandasintro)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 文件操作"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 从 CSV 中读取数据\n",
"df = pd.read_csv('foo.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#将 DataFrame 写入 CSV\n",
"df.to_csv('foo.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 从 Excel 中读取数据:先定义一个 Excel 文件,\n",
"# 用 xls.parse 解析 sheet1 的内容,index_col 用于指定 index 列,na_values 定义缺失值的标识。\n",
"xls = ExcelFile('foo.xlsx')\n",
"xls.parse('sheet1', index_col=None, na_values=['NA'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
" # 将DataFrame 写入 Excel 文件 \n",
"df.to_excel('foo.xlsx', sheet_name='sheet1')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 基本运算"
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {},
"outputs": [],
"source": [
"data = DataFrame(np.arange(16).reshape((4, 4)),\n",
" ....: index=['Ohio', 'Colorado', 'Utah', 'New York'],\n",
" ....: columns=['one', 'two', 'three', 'four'])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Colorado | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio 0 1 2 3\n",
"Colorado 4 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"one 6.0\n",
"two 7.0\n",
"three 8.0\n",
"four 9.0\n",
"dtype: float64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.mean() # 计算列的平均值,参数为轴,可选值为 0 或 1.默认为 0,即按照列运算"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ohio 6\n",
"Colorado 22\n",
"Utah 38\n",
"New York 54\n",
"dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.sum(1) # 计算行的和"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"one 12\n",
"two 12\n",
"three 12\n",
"four 12\n",
"dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 将一个函数应用到 DataFrame 的每一列,这里使用的是匿名 lambda 函数,与 R 中 apply 函数类似\n",
"data.apply(lambda x: x.max() - x.min())"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Colorado | \n",
" 5 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 9 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 13 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio 1 1 2 3\n",
"Colorado 5 5 6 7\n",
"Utah 9 9 10 11\n",
"New York 13 13 14 15"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 应用到某一列\n",
"# 推荐\n",
"data['one'] = data['one'].apply(lambda x: x+1)\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 2 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Colorado | \n",
" 5 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 9 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 13 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio 2 1 2 3\n",
"Colorado 5 5 6 7\n",
"Utah 9 9 10 11\n",
"New York 13 13 14 15"
]
},
"execution_count": 143,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 应用到某一列\n",
"# 首元素会多执行一遍\n",
"def addone(v):\n",
" v[0] += 1\n",
" return v\n",
"data.apply(addone, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Colorado | \n",
" 5 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 9 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 13 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio 1 1 2 3\n",
"Colorado 5 5 6 7\n",
"Utah 9 9 10 11\n",
"New York 13 13 14 15"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def addone(v):\n",
" v += 1\n",
" return v\n",
"data['one'] = data['one'].map(addone)\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 常用操作"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 排序"
]
},
{
"cell_type": "code",
"execution_count": 234,
"metadata": {},
"outputs": [],
"source": [
"df = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],\n",
"....: columns=['Ohio', 'Texas', 'California'])"
]
},
{
"cell_type": "code",
"execution_count": 235,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Ohio | \n",
" Texas | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" c | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
"
\n",
" \n",
" d | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Ohio Texas California\n",
"a 0 1 2\n",
"c 3 4 5\n",
"d 6 7 8"
]
},
"execution_count": 235,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 238,
"metadata": {},
"outputs": [],
"source": [
"df.ix[\"a\"] = [10,11,12]"
]
},
{
"cell_type": "code",
"execution_count": 251,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Ohio | \n",
" Texas | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
"
\n",
" \n",
" c | \n",
" 3 | \n",
" 9 | \n",
" 5 | \n",
"
\n",
" \n",
" d | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Ohio Texas California\n",
"a 10 11 12\n",
"c 3 9 5\n",
"d 6 7 8"
]
},
"execution_count": 251,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"sort_index 可以以轴的标签进行排序。axis 是指用于排序的轴,可选的值有 0 和 1,默认为 0 即行标签(Y 轴),1 为按照列标签排序。 ascending 是排序方式,默认为 True 即降序排列。"
]
},
{
"cell_type": "code",
"execution_count": 252,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Ohio | \n",
" Texas | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" d | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
" c | \n",
" 3 | \n",
" 9 | \n",
" 5 | \n",
"
\n",
" \n",
" a | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Ohio Texas California\n",
"d 6 7 8\n",
"c 3 9 5\n",
"a 10 11 12"
]
},
"execution_count": 252,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sort_index(axis=0, ascending=False) # 对行进行排序 d,c,a"
]
},
{
"cell_type": "code",
"execution_count": 212,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ohio 10\n",
"Texas 11\n",
"California 12\n",
"Name: a, dtype: int64"
]
},
"execution_count": 212,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 254,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Texas | \n",
" Ohio | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 11 | \n",
" 10 | \n",
" 12 | \n",
"
\n",
" \n",
" c | \n",
" 9 | \n",
" 3 | \n",
" 5 | \n",
"
\n",
" \n",
" d | \n",
" 7 | \n",
" 6 | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Texas Ohio California\n",
"a 11 10 12\n",
"c 9 3 5\n",
"d 7 6 8"
]
},
"execution_count": 254,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sort_index(axis=1, ascending=False) # 对列进行排序 Texas, Ohio, California"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"DataFrame 也提供按照指定列进行排序,可以仅指定一个列作为排序标准(以单独列名作为 columns 的参数),也可以进行多重排序(columns 的参数为一个列名的 List,列名的出现顺序决定排序中的优先级),在多重排序中 ascending 参数也为一个 List,分别与 columns 中的 List 元素对应。"
]
},
{
"cell_type": "code",
"execution_count": 243,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Ohio | \n",
" Texas | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
"
\n",
" \n",
" d | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
" c | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Ohio Texas California\n",
"a 10 11 12\n",
"d 6 7 8\n",
"c 3 4 5"
]
},
"execution_count": 243,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sort_values(by='Ohio', ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 246,
"metadata": {},
"outputs": [],
"source": [
"df.ix[1,1]=9"
]
},
{
"cell_type": "code",
"execution_count": 247,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Ohio | \n",
" Texas | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
"
\n",
" \n",
" c | \n",
" 3 | \n",
" 9 | \n",
" 5 | \n",
"
\n",
" \n",
" d | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Ohio Texas California\n",
"a 10 11 12\n",
"c 3 9 5\n",
"d 6 7 8"
]
},
"execution_count": 247,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 250,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Ohio | \n",
" Texas | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
"
\n",
" \n",
" d | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
" c | \n",
" 3 | \n",
" 9 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Ohio Texas California\n",
"a 10 11 12\n",
"d 6 7 8\n",
"c 3 9 5"
]
},
"execution_count": 250,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sort_values(by=['Ohio','Texas'],ascending=[0,1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 重命名列"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.rename(columns={u'one':'1'}, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 设置索引"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.set_index('one')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 重设索引"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.reset_index(inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 查看最大最小值"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pd.Series.max()\n",
"pd.Series.idxmax()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 改变数据类型"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df['A'].astype(float)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 计算 Series 每个值的频率"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df['A'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# DataFrame 的合并"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Concat"
]
},
{
"cell_type": "code",
"execution_count": 267,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"ds = [{'one' : 4,'two':2},{'one' : 5,'two' : 3},{'one' : 6,'two' : 4},{'two' : 7,'three':10}]\n",
"dfs = pd.DataFrame(ds,index=['e','f','g','h'])"
]
},
{
"cell_type": "code",
"execution_count": 269,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" three | \n",
" two | \n",
"
\n",
" \n",
" \n",
" \n",
" e | \n",
" 4.0 | \n",
" NaN | \n",
" 2 | \n",
"
\n",
" \n",
" f | \n",
" 5.0 | \n",
" NaN | \n",
" 3 | \n",
"
\n",
" \n",
" g | \n",
" 6.0 | \n",
" NaN | \n",
" 4 | \n",
"
\n",
" \n",
" h | \n",
" NaN | \n",
" 10.0 | \n",
" 7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one three two\n",
"e 4.0 NaN 2\n",
"f 5.0 NaN 3\n",
"g 6.0 NaN 4\n",
"h NaN 10.0 7"
]
},
"execution_count": 269,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfs"
]
},
{
"cell_type": "code",
"execution_count": 292,
"metadata": {},
"outputs": [],
"source": [
"##构建一个新的 DataFrame,dfs\n",
"df_t=pd.concat([dfs,dfs],axis=1) # 合并两个 DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 293,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" three | \n",
" two | \n",
" one | \n",
" three | \n",
" two | \n",
"
\n",
" \n",
" \n",
" \n",
" e | \n",
" 4.0 | \n",
" NaN | \n",
" 2 | \n",
" 4.0 | \n",
" NaN | \n",
" 2 | \n",
"
\n",
" \n",
" f | \n",
" 5.0 | \n",
" NaN | \n",
" 3 | \n",
" 5.0 | \n",
" NaN | \n",
" 3 | \n",
"
\n",
" \n",
" g | \n",
" 6.0 | \n",
" NaN | \n",
" 4 | \n",
" 6.0 | \n",
" NaN | \n",
" 4 | \n",
"
\n",
" \n",
" h | \n",
" NaN | \n",
" 10.0 | \n",
" 7 | \n",
" NaN | \n",
" 10.0 | \n",
" 7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one three two one three two\n",
"e 4.0 NaN 2 4.0 NaN 2\n",
"f 5.0 NaN 3 5.0 NaN 3\n",
"g 6.0 NaN 4 6.0 NaN 4\n",
"h NaN 10.0 7 NaN 10.0 7"
]
},
"execution_count": 293,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_t"
]
},
{
"cell_type": "code",
"execution_count": 294,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"##构建一个新的 DataFrame,dfs\n",
"df_t=pd.concat([dfs,dfs],axis=0) # 合并两个 DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 295,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" three | \n",
" two | \n",
"
\n",
" \n",
" \n",
" \n",
" e | \n",
" 4.0 | \n",
" NaN | \n",
" 2 | \n",
"
\n",
" \n",
" f | \n",
" 5.0 | \n",
" NaN | \n",
" 3 | \n",
"
\n",
" \n",
" g | \n",
" 6.0 | \n",
" NaN | \n",
" 4 | \n",
"
\n",
" \n",
" h | \n",
" NaN | \n",
" 10.0 | \n",
" 7 | \n",
"
\n",
" \n",
" e | \n",
" 4.0 | \n",
" NaN | \n",
" 2 | \n",
"
\n",
" \n",
" f | \n",
" 5.0 | \n",
" NaN | \n",
" 3 | \n",
"
\n",
" \n",
" g | \n",
" 6.0 | \n",
" NaN | \n",
" 4 | \n",
"
\n",
" \n",
" h | \n",
" NaN | \n",
" 10.0 | \n",
" 7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one three two\n",
"e 4.0 NaN 2\n",
"f 5.0 NaN 3\n",
"g 6.0 NaN 4\n",
"h NaN 10.0 7\n",
"e 4.0 NaN 2\n",
"f 5.0 NaN 3\n",
"g 6.0 NaN 4\n",
"h NaN 10.0 7"
]
},
"execution_count": 295,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_t"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Merge"
]
},
{
"cell_type": "code",
"execution_count": 296,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" lval | \n",
" rval | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" foo1 | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" foo2 | \n",
" 2 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key lval rval\n",
"0 foo1 1 4\n",
"1 foo2 2 5"
]
},
"execution_count": 296,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"left = pd.DataFrame({'key': ['foo1', 'foo2'], 'lval': [1, 2]})\n",
"right = pd.DataFrame({'key': ['foo1', 'foo2'], 'rval': [4, 5]})\n",
"#构建了两个DataFrame\n",
"pd.merge(left, right, on='key')#按照key列将两个DataFrame join在一起"
]
},
{
"cell_type": "code",
"execution_count": 297,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" lval | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" foo1 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" foo2 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key lval\n",
"0 foo1 1\n",
"1 foo2 2"
]
},
"execution_count": 297,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"left"
]
},
{
"cell_type": "code",
"execution_count": 298,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" rval | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" foo1 | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" foo2 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key rval\n",
"0 foo1 4\n",
"1 foo2 5"
]
},
"execution_count": 298,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"right"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Groupby"
]
},
{
"cell_type": "code",
"execution_count": 300,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import random"
]
},
{
"cell_type": "code",
"execution_count": 305,
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],\n",
" 'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],\n",
" 'C' :np.random.randn(8), 'D' : np.random.randn(8)});"
]
},
{
"cell_type": "code",
"execution_count": 306,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" foo | \n",
" one | \n",
" 0.584419 | \n",
" -0.594567 | \n",
"
\n",
" \n",
" 1 | \n",
" bar | \n",
" one | \n",
" -1.331418 | \n",
" -0.167330 | \n",
"
\n",
" \n",
" 2 | \n",
" foo | \n",
" two | \n",
" 0.125558 | \n",
" 0.744053 | \n",
"
\n",
" \n",
" 3 | \n",
" bar | \n",
" three | \n",
" 0.850646 | \n",
" -0.741091 | \n",
"
\n",
" \n",
" 4 | \n",
" foo | \n",
" two | \n",
" -0.033590 | \n",
" -0.239762 | \n",
"
\n",
" \n",
" 5 | \n",
" bar | \n",
" two | \n",
" 0.332681 | \n",
" 0.567786 | \n",
"
\n",
" \n",
" 6 | \n",
" foo | \n",
" one | \n",
" 1.051195 | \n",
" 1.509953 | \n",
"
\n",
" \n",
" 7 | \n",
" foo | \n",
" three | \n",
" 0.374087 | \n",
" -0.318311 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"0 foo one 0.584419 -0.594567\n",
"1 bar one -1.331418 -0.167330\n",
"2 foo two 0.125558 0.744053\n",
"3 bar three 0.850646 -0.741091\n",
"4 foo two -0.033590 -0.239762\n",
"5 bar two 0.332681 0.567786\n",
"6 foo one 1.051195 1.509953\n",
"7 foo three 0.374087 -0.318311"
]
},
"execution_count": 306,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 307,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" A | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" bar | \n",
" -0.148091 | \n",
" -0.340636 | \n",
"
\n",
" \n",
" foo | \n",
" 2.101669 | \n",
" 1.101366 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" C D\n",
"A \n",
"bar -0.148091 -0.340636\n",
"foo 2.101669 1.101366"
]
},
"execution_count": 307,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('A').sum() # 按照 A 列的值分组求和"
]
},
{
"cell_type": "code",
"execution_count": 308,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" A | \n",
" B | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" bar | \n",
" one | \n",
" -1.331418 | \n",
" -0.167330 | \n",
"
\n",
" \n",
" three | \n",
" 0.850646 | \n",
" -0.741091 | \n",
"
\n",
" \n",
" two | \n",
" 0.332681 | \n",
" 0.567786 | \n",
"
\n",
" \n",
" foo | \n",
" one | \n",
" 1.635614 | \n",
" 0.915386 | \n",
"
\n",
" \n",
" three | \n",
" 0.374087 | \n",
" -0.318311 | \n",
"
\n",
" \n",
" two | \n",
" 0.091968 | \n",
" 0.504291 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" C D\n",
"A B \n",
"bar one -1.331418 -0.167330\n",
" three 0.850646 -0.741091\n",
" two 0.332681 0.567786\n",
"foo one 1.635614 0.915386\n",
" three 0.374087 -0.318311\n",
" two 0.091968 0.504291"
]
},
"execution_count": 308,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby(['A','B']).sum() # 按照 A、B 两列的值分组求和"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**在实际应用中,先定义 groups,然后再对不同的指标指定不同计算方式。**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"groups = df.groupby('A') # 按照 A 列的值分组求和\n",
"groups['B'].sum() # 按照 A 列的值分组求 B 组和\n",
"groups['B'].count() # 按照 A 列的值分组 B 组计数"
]
},
{
"cell_type": "code",
"execution_count": 309,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"groups = df.groupby('A') # 按照 A 列的值分组求和"
]
},
{
"cell_type": "code",
"execution_count": 312,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A\n",
"bar -0.148091\n",
"foo 2.101669\n",
"Name: C, dtype: float64"
]
},
"execution_count": 312,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"groups['C'].sum() # 按照 A 列的值分组求 B 组和"
]
},
{
"cell_type": "code",
"execution_count": 314,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A\n",
"bar 3\n",
"foo 5\n",
"Name: B, dtype: int64"
]
},
"execution_count": 314,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"groups['B'].count() # 按照 A 列的值分组 B 组计数"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**默认会以 groupby 的值作为索引,如果不将这些值作为索引,则需要使用 as_index=False**"
]
},
{
"cell_type": "code",
"execution_count": 316,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" foo | \n",
" one | \n",
" 0.584419 | \n",
" -0.594567 | \n",
"
\n",
" \n",
" 1 | \n",
" bar | \n",
" one | \n",
" -1.331418 | \n",
" -0.167330 | \n",
"
\n",
" \n",
" 2 | \n",
" foo | \n",
" two | \n",
" 0.125558 | \n",
" 0.744053 | \n",
"
\n",
" \n",
" 3 | \n",
" bar | \n",
" three | \n",
" 0.850646 | \n",
" -0.741091 | \n",
"
\n",
" \n",
" 4 | \n",
" foo | \n",
" two | \n",
" -0.033590 | \n",
" -0.239762 | \n",
"
\n",
" \n",
" 5 | \n",
" bar | \n",
" two | \n",
" 0.332681 | \n",
" 0.567786 | \n",
"
\n",
" \n",
" 6 | \n",
" foo | \n",
" one | \n",
" 1.051195 | \n",
" 1.509953 | \n",
"
\n",
" \n",
" 7 | \n",
" foo | \n",
" three | \n",
" 0.374087 | \n",
" -0.318311 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"0 foo one 0.584419 -0.594567\n",
"1 bar one -1.331418 -0.167330\n",
"2 foo two 0.125558 0.744053\n",
"3 bar three 0.850646 -0.741091\n",
"4 foo two -0.033590 -0.239762\n",
"5 bar two 0.332681 0.567786\n",
"6 foo one 1.051195 1.509953\n",
"7 foo three 0.374087 -0.318311"
]
},
"execution_count": 316,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 317,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" bar | \n",
" one | \n",
" -1.331418 | \n",
" -0.167330 | \n",
"
\n",
" \n",
" 1 | \n",
" bar | \n",
" three | \n",
" 0.850646 | \n",
" -0.741091 | \n",
"
\n",
" \n",
" 2 | \n",
" bar | \n",
" two | \n",
" 0.332681 | \n",
" 0.567786 | \n",
"
\n",
" \n",
" 3 | \n",
" foo | \n",
" one | \n",
" 1.635614 | \n",
" 0.915386 | \n",
"
\n",
" \n",
" 4 | \n",
" foo | \n",
" three | \n",
" 0.374087 | \n",
" -0.318311 | \n",
"
\n",
" \n",
" 5 | \n",
" foo | \n",
" two | \n",
" 0.091968 | \n",
" 0.504291 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"0 bar one -1.331418 -0.167330\n",
"1 bar three 0.850646 -0.741091\n",
"2 bar two 0.332681 0.567786\n",
"3 foo one 1.635614 0.915386\n",
"4 foo three 0.374087 -0.318311\n",
"5 foo two 0.091968 0.504291"
]
},
"execution_count": 317,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby(['A','B'], as_index=False).sum()\n",
"# 注意与 In [308] 对比"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 透视表"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用 pivot_table 和 crosstab 都可以创建数据透视表"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## pivot_table"
]
},
{
"cell_type": "code",
"execution_count": 318,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,'B' : ['A', 'B', 'C'] * 4, \n",
" 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2, \n",
" 'D' : np.random.randn(12), 'E' : np.random.randn(12)})"
]
},
{
"cell_type": "code",
"execution_count": 319,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
" E | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" one | \n",
" A | \n",
" foo | \n",
" 0.081074 | \n",
" 0.358511 | \n",
"
\n",
" \n",
" 1 | \n",
" one | \n",
" B | \n",
" foo | \n",
" -1.741607 | \n",
" -1.105890 | \n",
"
\n",
" \n",
" 2 | \n",
" two | \n",
" C | \n",
" foo | \n",
" -1.220509 | \n",
" -0.241541 | \n",
"
\n",
" \n",
" 3 | \n",
" three | \n",
" A | \n",
" bar | \n",
" -0.663041 | \n",
" 0.087185 | \n",
"
\n",
" \n",
" 4 | \n",
" one | \n",
" B | \n",
" bar | \n",
" -0.259216 | \n",
" -1.891769 | \n",
"
\n",
" \n",
" 5 | \n",
" one | \n",
" C | \n",
" bar | \n",
" -0.810547 | \n",
" 1.456797 | \n",
"
\n",
" \n",
" 6 | \n",
" two | \n",
" A | \n",
" foo | \n",
" -0.507784 | \n",
" 1.801202 | \n",
"
\n",
" \n",
" 7 | \n",
" three | \n",
" B | \n",
" foo | \n",
" 0.602052 | \n",
" 1.740630 | \n",
"
\n",
" \n",
" 8 | \n",
" one | \n",
" C | \n",
" foo | \n",
" 0.671181 | \n",
" -0.210367 | \n",
"
\n",
" \n",
" 9 | \n",
" one | \n",
" A | \n",
" bar | \n",
" 0.615852 | \n",
" -0.947002 | \n",
"
\n",
" \n",
" 10 | \n",
" two | \n",
" B | \n",
" bar | \n",
" -0.209238 | \n",
" -1.500853 | \n",
"
\n",
" \n",
" 11 | \n",
" three | \n",
" C | \n",
" bar | \n",
" 1.050343 | \n",
" 0.837500 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D E\n",
"0 one A foo 0.081074 0.358511\n",
"1 one B foo -1.741607 -1.105890\n",
"2 two C foo -1.220509 -0.241541\n",
"3 three A bar -0.663041 0.087185\n",
"4 one B bar -0.259216 -1.891769\n",
"5 one C bar -0.810547 1.456797\n",
"6 two A foo -0.507784 1.801202\n",
"7 three B foo 0.602052 1.740630\n",
"8 one C foo 0.671181 -0.210367\n",
"9 one A bar 0.615852 -0.947002\n",
"10 two B bar -0.209238 -1.500853\n",
"11 three C bar 1.050343 0.837500"
]
},
"execution_count": 319,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 324,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" C | \n",
" bar | \n",
" foo | \n",
"
\n",
" \n",
" A | \n",
" B | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" A | \n",
" 0.615852 | \n",
" 0.081074 | \n",
"
\n",
" \n",
" B | \n",
" -0.259216 | \n",
" -1.741607 | \n",
"
\n",
" \n",
" C | \n",
" -0.810547 | \n",
" 0.671181 | \n",
"
\n",
" \n",
" three | \n",
" A | \n",
" -0.663041 | \n",
" NaN | \n",
"
\n",
" \n",
" B | \n",
" NaN | \n",
" 0.602052 | \n",
"
\n",
" \n",
" C | \n",
" 1.050343 | \n",
" NaN | \n",
"
\n",
" \n",
" two | \n",
" A | \n",
" NaN | \n",
" -0.507784 | \n",
"
\n",
" \n",
" B | \n",
" -0.209238 | \n",
" NaN | \n",
"
\n",
" \n",
" C | \n",
" NaN | \n",
" -1.220509 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"C bar foo\n",
"A B \n",
"one A 0.615852 0.081074\n",
" B -0.259216 -1.741607\n",
" C -0.810547 0.671181\n",
"three A -0.663041 NaN\n",
" B NaN 0.602052\n",
" C 1.050343 NaN\n",
"two A NaN -0.507784\n",
" B -0.209238 NaN\n",
" C NaN -1.220509"
]
},
"execution_count": 324,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.pivot_table(df, values = 'D', index = ['A', 'B'], columns = ['C']) #以 A、B 为行标签,以 C 为列标签将 D 列的值汇总求和"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## crosstab"
]
},
{
"cell_type": "code",
"execution_count": 336,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
" E | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" one | \n",
" A | \n",
" foo | \n",
" 0.081074 | \n",
" 0.358511 | \n",
"
\n",
" \n",
" 1 | \n",
" one | \n",
" B | \n",
" foo | \n",
" -1.741607 | \n",
" -1.105890 | \n",
"
\n",
" \n",
" 2 | \n",
" two | \n",
" C | \n",
" foo | \n",
" -1.220509 | \n",
" -0.241541 | \n",
"
\n",
" \n",
" 3 | \n",
" three | \n",
" A | \n",
" bar | \n",
" -0.663041 | \n",
" 0.087185 | \n",
"
\n",
" \n",
" 4 | \n",
" one | \n",
" B | \n",
" bar | \n",
" -0.259216 | \n",
" -1.891769 | \n",
"
\n",
" \n",
" 5 | \n",
" one | \n",
" C | \n",
" bar | \n",
" -0.810547 | \n",
" 1.456797 | \n",
"
\n",
" \n",
" 6 | \n",
" two | \n",
" A | \n",
" foo | \n",
" -0.507784 | \n",
" 1.801202 | \n",
"
\n",
" \n",
" 7 | \n",
" three | \n",
" B | \n",
" foo | \n",
" 0.602052 | \n",
" 1.740630 | \n",
"
\n",
" \n",
" 8 | \n",
" one | \n",
" C | \n",
" foo | \n",
" 0.671181 | \n",
" -0.210367 | \n",
"
\n",
" \n",
" 9 | \n",
" one | \n",
" A | \n",
" bar | \n",
" 0.615852 | \n",
" -0.947002 | \n",
"
\n",
" \n",
" 10 | \n",
" two | \n",
" B | \n",
" bar | \n",
" -0.209238 | \n",
" -1.500853 | \n",
"
\n",
" \n",
" 11 | \n",
" three | \n",
" C | \n",
" bar | \n",
" 1.050343 | \n",
" 0.837500 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D E\n",
"0 one A foo 0.081074 0.358511\n",
"1 one B foo -1.741607 -1.105890\n",
"2 two C foo -1.220509 -0.241541\n",
"3 three A bar -0.663041 0.087185\n",
"4 one B bar -0.259216 -1.891769\n",
"5 one C bar -0.810547 1.456797\n",
"6 two A foo -0.507784 1.801202\n",
"7 three B foo 0.602052 1.740630\n",
"8 one C foo 0.671181 -0.210367\n",
"9 one A bar 0.615852 -0.947002\n",
"10 two B bar -0.209238 -1.500853\n",
"11 three C bar 1.050343 0.837500"
]
},
"execution_count": 336,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 335,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" C | \n",
" bar | \n",
" foo | \n",
" All | \n",
"
\n",
" \n",
" A | \n",
" B | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" A | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" B | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" C | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" three | \n",
" A | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" B | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" C | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" two | \n",
" A | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" B | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" C | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" All | \n",
" | \n",
" 6 | \n",
" 6 | \n",
" 12 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"C bar foo All\n",
"A B \n",
"one A 1 1 2\n",
" B 1 1 2\n",
" C 1 1 2\n",
"three A 1 0 1\n",
" B 0 1 1\n",
" C 1 0 1\n",
"two A 0 1 1\n",
" B 1 0 1\n",
" C 0 1 1\n",
"All 6 6 12"
]
},
"execution_count": 335,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab([df.A, df.B], df.C, margins=True) # 以 A、B 为行标签,以 C 为列标签将 D 列的值汇总求个数"
]
},
{
"cell_type": "code",
"execution_count": 338,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" C | \n",
" bar | \n",
" foo | \n",
"
\n",
" \n",
" A | \n",
" B | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" A | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" B | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" C | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" three | \n",
" A | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" B | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" C | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" two | \n",
" A | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" B | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" C | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"C bar foo\n",
"A B \n",
"one A 1 1\n",
" B 1 1\n",
" C 1 1\n",
"three A 1 0\n",
" B 0 1\n",
" C 1 0\n",
"two A 0 1\n",
" B 1 0\n",
" C 0 1"
]
},
"execution_count": 338,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab([df.A, df.B], df.C, margins=False) # 以 A、B 为行标签,以 C 为列标签将 D 列的值汇总求个数"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 时间序列分析"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"pandas 提供 to_datetime 方法将代表时间的字符转化为 Timestamp 对象:"
]
},
{
"cell_type": "code",
"execution_count": 339,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"s = '2013-09-16 21:00:00'\n",
"ts = pd.to_datetime(s)"
]
},
{
"cell_type": "code",
"execution_count": 340,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Timestamp('2013-09-16 21:00:00')"
]
},
"execution_count": 340,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"有时我们需要处理时区问题:"
]
},
{
"cell_type": "code",
"execution_count": 342,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"ts=pd.to_datetime(s,utc=True).tz_convert('Asia/Shanghai')"
]
},
{
"cell_type": "code",
"execution_count": 343,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Timestamp('2013-09-17 05:00:00+0800', tz='Asia/Shanghai')"
]
},
"execution_count": 343,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"构建时间序列:"
]
},
{
"cell_type": "code",
"execution_count": 345,
"metadata": {},
"outputs": [],
"source": [
"rng = pd.date_range('1/1/2012', periods=5, freq='M')\n",
"ts = pd.Series(np.random.randn(len(rng)), index=rng)"
]
},
{
"cell_type": "code",
"execution_count": 346,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30',\n",
" '2012-05-31'],\n",
" dtype='datetime64[ns]', freq='M')"
]
},
"execution_count": 346,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rng"
]
},
{
"cell_type": "code",
"execution_count": 347,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2012-01-31 0.839176\n",
"2012-02-29 -0.631665\n",
"2012-03-31 -0.941336\n",
"2012-04-30 0.754626\n",
"2012-05-31 -2.149063\n",
"Freq: M, dtype: float64"
]
},
"execution_count": 347,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas 提供 resample 方法对时间序列的时间粒度进行调整:"
]
},
{
"cell_type": "code",
"execution_count": 349,
"metadata": {},
"outputs": [],
"source": [
"ts_h=ts.resample('H').count() # M,5Min,1s\n",
"# 以上是将时间序列调整为小时,还可以支持月(M),分钟(Min)甚至秒(s)等。"
]
},
{
"cell_type": "code",
"execution_count": 351,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2012-01-31 00:00:00 1\n",
"2012-01-31 01:00:00 0\n",
"2012-01-31 02:00:00 0\n",
"2012-01-31 03:00:00 0\n",
"2012-01-31 04:00:00 0\n",
"Freq: H, dtype: int64"
]
},
"execution_count": 351,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ts_h.head()"
]
},
{
"cell_type": "code",
"execution_count": 353,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 353,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(ts_h)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 参考网站\n",
"\n",
"- [Cloga 的互联网笔记 ](http://cloga.info/python/%E6%95%B0%E6%8D%AE%E7%A7%91%E5%AD%A6/2013/09/17/pandasintro)大神~\n",
"- [Python For Data Analysis 0.1](http://pda.readthedocs.io/en/latest/chp5.html)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.2"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autoclose": false,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
},
"toc": {
"base_numbering": 1,
"nav_menu": {
"height": "512px",
"width": "252px"
},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": true,
"toc_position": {},
"toc_section_display": "block",
"toc_window_display": true
}
},
"nbformat": 4,
"nbformat_minor": 1
}