{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 5.2 Essential Functionality(主要功能)\n",
"\n",
"接下来介绍pandas中的一些主要功能,这里只介绍一些经常用到的。\n",
"\n",
"# 1 Reindexing(重新索引)\n",
"\n",
"pandas中一个重要的方法是reindex,已实施在创建object的时候遵照一个新的index。如下例:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"d 4.5\n",
"b 7.2\n",
"a -5.3\n",
"c 3.6\n",
"dtype: float64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])\n",
"obj"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在series上调用reindex能更改index,如果没有对应index的话会引入缺失数据:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a -5.3\n",
"b 7.2\n",
"c 3.6\n",
"d 4.5\n",
"e NaN\n",
"dtype: float64"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])\n",
"obj2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在处理时间序列这样的数据时,我们可能需要在reindexing的时候需要修改值。method选项能做到这一点,比如设定method为ffill:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 bule\n",
"2 purple\n",
"4 yellow\n",
"dtype: object"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj3 = pd.Series(['bule', 'purple', 'yellow'], index=[0, 2, 4])\n",
"obj3"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 bule\n",
"1 bule\n",
"2 purple\n",
"3 purple\n",
"4 yellow\n",
"5 yellow\n",
"dtype: object"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj3.reindex(range(6), method='ffill')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"对于DataFrame,reindex能更改row index,或column index。reindex the rows:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"frame = pd.DataFrame(np.arange(9).reshape(3, 3),\n",
" index=['a', 'c', 'd'],\n",
" columns=['Ohio', 'Texas', 'California'])"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" Ohio | \n",
" Texas | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" c | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
"
\n",
" \n",
" d | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Ohio Texas California\n",
"a 0 1 2\n",
"c 3 4 5\n",
"d 6 7 8"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Ohio | \n",
" Texas | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 0.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" b | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" c | \n",
" 3.0 | \n",
" 4.0 | \n",
" 5.0 | \n",
"
\n",
" \n",
" d | \n",
" 6.0 | \n",
" 7.0 | \n",
" 8.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Ohio Texas California\n",
"a 0.0 1.0 2.0\n",
"b NaN NaN NaN\n",
"c 3.0 4.0 5.0\n",
"d 6.0 7.0 8.0"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2 = frame.reindex(['a', 'b', 'c', 'd'])\n",
"frame2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"更改columns index:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"states = ['Texas', 'Utah', 'California']"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Texas | \n",
" Utah | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" NaN | \n",
" 2 | \n",
"
\n",
" \n",
" c | \n",
" 4 | \n",
" NaN | \n",
" 5 | \n",
"
\n",
" \n",
" d | \n",
" 7 | \n",
" NaN | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Texas Utah California\n",
"a 1 NaN 2\n",
"c 4 NaN 5\n",
"d 7 NaN 8"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame.reindex(columns=states)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"reindex的参数:\n",
"\n",
"![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/x0pq4.png)\n",
"\n",
"还可以使用loc更简洁的reindex:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Texas | \n",
" Utah | \n",
" California | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1.0 | \n",
" NaN | \n",
" 2.0 | \n",
"
\n",
" \n",
" b | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" c | \n",
" 4.0 | \n",
" NaN | \n",
" 5.0 | \n",
"
\n",
" \n",
" d | \n",
" 7.0 | \n",
" NaN | \n",
" 8.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Texas Utah California\n",
"a 1.0 NaN 2.0\n",
"b NaN NaN NaN\n",
"c 4.0 NaN 5.0\n",
"d 7.0 NaN 8.0"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame.loc[['a', 'b', 'c', 'd'], states]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2 Dropping Entries from an Axis (按轴删除记录) \n",
"\n",
"对于series,drop回返回一个新的object,并删去你制定的axis的值:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 0.0\n",
"b 1.0\n",
"c 2.0\n",
"d 3.0\n",
"e 4.0\n",
"dtype: float64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])\n",
"obj"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 0.0\n",
"b 1.0\n",
"d 3.0\n",
"e 4.0\n",
"dtype: float64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_obj = obj.drop('c')\n",
"new_obj"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 0.0\n",
"b 1.0\n",
"e 4.0\n",
"dtype: float64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj.drop(['d', 'c'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"对于DataFrame,index能按行或列的axis来删除:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Colorado | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio 0 1 2 3\n",
"Colorado 4 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.DataFrame(np.arange(16).reshape(4, 4),\n",
" index=['Ohio', 'Colorado', 'Utah', 'New York'],\n",
" columns=['one', 'two', 'three', 'four'])\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"行处理:如果a sequence of labels(一个标签序列)来调用drop,会删去row labels(axis 0):"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.drop(['Colorado', 'Ohio'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"列处理:drop列的话,设定axis=1或axis='columns':"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Colorado | \n",
" 4 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one three four\n",
"Ohio 0 2 3\n",
"Colorado 4 6 7\n",
"Utah 8 10 11\n",
"New York 12 14 15"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.drop('two', axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" three | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0 | \n",
" 2 | \n",
"
\n",
" \n",
" Colorado | \n",
" 4 | \n",
" 6 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 10 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 14 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one three\n",
"Ohio 0 2\n",
"Colorado 4 6\n",
"Utah 8 10\n",
"New York 12 14"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.drop(['two', 'four'], axis='columns')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"drop也可以不返回一个新的object,而是直接更改series or dataframe in-place:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 0.0\n",
"b 1.0\n",
"d 3.0\n",
"e 4.0\n",
"dtype: float64"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj.drop('c', inplace=True)\n",
"obj"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3 Indexing, Selection, and Filtering(索引,选择,过滤)\n",
"\n",
"series indexing(obj[...]) 相当于numpy的array indexing, 而且除了整数,还可以使用series的index:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 0.0\n",
"b 1.0\n",
"c 2.0\n",
"d 3.0\n",
"dtype: float64"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])\n",
"obj"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj['b']"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj[1]"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"c 2.0\n",
"d 3.0\n",
"dtype: float64"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj[2:4]"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"b 1.0\n",
"a 0.0\n",
"d 3.0\n",
"dtype: float64"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 选中行\n",
"obj[['b', 'a', 'd']]"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"b 1.0\n",
"d 3.0\n",
"dtype: float64"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj[[1, 3]]"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 0.0\n",
"b 1.0\n",
"dtype: float64"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj[obj < 2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"用label来slicing(切片)的时候,和python的切片不一样的在于,会包括尾节点:"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"b 1.0\n",
"c 2.0\n",
"dtype: float64"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj['b':'c']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以直接给选中的label更改值:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 0.0\n",
"b 5.0\n",
"c 5.0\n",
"d 3.0\n",
"dtype: float64"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj['b':'c'] = 5\n",
"obj"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"而对于DataFrame,indexing可以通过一个值或序列,选中一个以上的列:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Colorado | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio 0 1 2 3\n",
"Colorado 4 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.DataFrame(np.arange(16).reshape((4, 4)),\n",
" index=['Ohio', 'Colorado', 'Utah', 'New York'],\n",
" columns=['one', 'two', 'three', 'four'])\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Ohio 1\n",
"Colorado 5\n",
"Utah 9\n",
"New York 13\n",
"Name: two, dtype: int64"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['two']"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" three | \n",
" one | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 2 | \n",
" 0 | \n",
"
\n",
" \n",
" Colorado | \n",
" 6 | \n",
" 4 | \n",
"
\n",
" \n",
" Utah | \n",
" 10 | \n",
" 8 | \n",
"
\n",
" \n",
" New York | \n",
" 14 | \n",
" 12 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" three one\n",
"Ohio 2 0\n",
"Colorado 6 4\n",
"Utah 10 8\n",
"New York 14 12"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[['three', 'one']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"dataframe的indexing有一些比较特别的方式。比如通过布尔数组:"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Colorado | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio 0 1 2 3\n",
"Colorado 4 5 6 7"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[:2]"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Colorado | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Colorado 4 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[data['three'] > 5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"行选择的语法格式`data[:2]`是很方便的。给`[]`里传入一个list的话,可以选择列。\n",
"\n",
"另一种方法是用boolean dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" Colorado | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Utah | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" New York | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio True True True True\n",
"Colorado True False False False\n",
"Utah False False False False\n",
"New York False False False False"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data < 5"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Colorado | \n",
" 0 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio 0 0 0 0\n",
"Colorado 0 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[data < 5] = 0\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Selection with loc and iloc(用loc和iloc来选择)\n",
"\n",
"对于label-indexing on rows, 我们介绍特别的索引符,loc and iloc. 这两个方法能通过axis labels(loc)或integer(iloc),来选择行或列。\n",
"\n",
"一个列子,选中一行多列by label:"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Colorado | \n",
" 0 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three four\n",
"Ohio 0 0 0 0\n",
"Colorado 0 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"two 5\n",
"three 6\n",
"Name: Colorado, dtype: int64"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.loc['Colorado', ['two', 'three']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"同iloc实现相同的效果:"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"four 11\n",
"one 8\n",
"two 9\n",
"Name: Utah, dtype: int64"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[2, [3, 0, 1]]"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"one 8\n",
"two 9\n",
"three 10\n",
"four 11\n",
"Name: Utah, dtype: int64"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[2] # 一行"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" four | \n",
" one | \n",
" two | \n",
"
\n",
" \n",
" \n",
" \n",
" Colorado | \n",
" 7 | \n",
" 0 | \n",
" 5 | \n",
"
\n",
" \n",
" Utah | \n",
" 11 | \n",
" 8 | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" four one two\n",
"Colorado 7 0 5\n",
"Utah 11 8 9"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[[1, 2], [3, 0, 1]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"indexing函数也能用于切片,不论是single labels或lists of labels:"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Ohio 0\n",
"Colorado 5\n",
"Utah 9\n",
"Name: two, dtype: int64"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.loc[:'Utah', 'two']"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
"
\n",
" \n",
" \n",
" \n",
" Colorado | \n",
" 0 | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" Utah | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
"
\n",
" \n",
" New York | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three\n",
"Colorado 0 5 6\n",
"Utah 8 9 10\n",
"New York 12 13 14"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[:, :3][data.three > 5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"pandas中有很多用于选择和重新选择数据的方法:\n",
"\n",
"![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/bwadf.png)\n",
"\n",
"![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/lc2uc.png)\n",
"\n",
"注意:当设计padnas的时候,作者发现frame[:, col]这样的语法是比较冗长的,因为这是会被经常用到的一个功能。作者把一些indexing的功能(lable or integer)集成在了ix这个方法上。实际中,因为这种label和integer都可以用的方式很方便,于是pandas team设计了loc和ilco来实现label-based和integer-based indexing.\n",
"\n",
"虽然ix indexing依然错在,但是已经过时,不推荐使用。\n",
"\n",
"# 4 Integer Indexes(整数索引)\n",
"\n",
"一些新手再用integer来index的时候,总是会被绊倒。因为这种方法和python用于list和tuple的indexing方法不同。\n",
"\n",
"比如,你不希望下面的代码出现error:\n"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"ser = pd.Series(np.arange(3.))"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 0.0\n",
"1 1.0\n",
"2 2.0\n",
"dtype: float64"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"collapsed": false
},
"outputs": [
{
"ename": "KeyError",
"evalue": "-1",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mser\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/Users/xu/anaconda/envs/py35/lib/python3.5/site-packages/pandas/core/series.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 581\u001b[0m \u001b[0mkey\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_apply_if_callable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 582\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 583\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_value\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 584\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 585\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mlib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0misscalar\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/Users/xu/anaconda/envs/py35/lib/python3.5/site-packages/pandas/indexes/base.py\u001b[0m in \u001b[0;36mget_value\u001b[0;34m(self, series, key)\u001b[0m\n\u001b[1;32m 1978\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1979\u001b[0m return self._engine.get_value(s, k,\n\u001b[0;32m-> 1980\u001b[0;31m tz=getattr(series.dtype, 'tz', None))\n\u001b[0m\u001b[1;32m 1981\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1982\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m0\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minferred_type\u001b[0m \u001b[0;32min\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m'integer'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'boolean'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32mpandas/index.pyx\u001b[0m in \u001b[0;36mpandas.index.IndexEngine.get_value (pandas/index.c:3332)\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/index.pyx\u001b[0m in \u001b[0;36mpandas.index.IndexEngine.get_value (pandas/index.c:3035)\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/index.pyx\u001b[0m in \u001b[0;36mpandas.index.IndexEngine.get_loc (pandas/index.c:4018)\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/hashtable.pyx\u001b[0m in \u001b[0;36mpandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6610)\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/hashtable.pyx\u001b[0m in \u001b[0;36mpandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6554)\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: -1"
]
}
],
"source": [
"ser[-1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"看到了,pandas在整数索引上可能会出错。这里我们有一个index包括0,1,2,但是猜测用户想要什么是很困难的:"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 0.0\n",
"1 1.0\n",
"2 2.0\n",
"dtype: float64"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"另一方面,如果用非整数来做index,就没有歧义了:"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"2.0"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])\n",
"ser2[-1]"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"为了保持连贯性,如果axis index里包含integer,那么选择数据的时候,就会是label-orented. 为了更精确地选择,使用`loc`(for label)或`ilco`(for integers):"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 0.0\n",
"dtype: float64"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser[:1]"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 0.0\n",
"1 1.0\n",
"dtype: float64"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser.loc[:1]"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 0.0\n",
"dtype: float64"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser.iloc[:1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 5 Arithmetic and Data Alignment (算数和数据对齐)\n",
"\n",
"pandas一个有用的feature就是,不同index的obejct之间的算数计算。如果两个object相加,但他们各自的index并不相同,最后结果得到的index是这两个index的合集:"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"s2 = pd.Series([2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 7.3\n",
"c -2.5\n",
"d 3.4\n",
"e 1.5\n",
"dtype: float64"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 2.1\n",
"c 3.6\n",
"e -1.5\n",
"f 4.0\n",
"g 3.1\n",
"dtype: float64"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s2"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 9.4\n",
"c 1.1\n",
"d NaN\n",
"e 0.0\n",
"f NaN\n",
"g NaN\n",
"dtype: float64"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1 + s2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"这种数据对齐的方式(internal data alignment)引入了很多缺失值在没有陈赫的位置上。这些缺失值会被用在之后的算数计算中。\n",
"\n",
"在DataFrame中,数据对齐同时发生在行和列上:"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),\n",
" index=['Ohio', 'Texas', 'Colorado'])"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),\n",
" index=['Utah', 'Ohio', 'Texas', 'Oregon'])"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" b | \n",
" c | \n",
" d | \n",
"
\n",
" \n",
" \n",
" \n",
" Ohio | \n",
" 0.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" Texas | \n",
" 3.0 | \n",
" 4.0 | \n",
" 5.0 | \n",
"
\n",
" \n",
" Colorado | \n",
" 6.0 | \n",
" 7.0 | \n",
" 8.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" b c d\n",
"Ohio 0.0 1.0 2.0\n",
"Texas 3.0 4.0 5.0\n",
"Colorado 6.0 7.0 8.0"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" b | \n",
" d | \n",
" e | \n",
"
\n",
" \n",
" \n",
" \n",
" Utah | \n",
" 0.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" Ohio | \n",
" 3.0 | \n",
" 4.0 | \n",
" 5.0 | \n",
"
\n",
" \n",
" Texas | \n",
" 6.0 | \n",
" 7.0 | \n",
" 8.0 | \n",
"
\n",
" \n",
" Oregon | \n",
" 9.0 | \n",
" 10.0 | \n",
" 11.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" b d e\n",
"Utah 0.0 1.0 2.0\n",
"Ohio 3.0 4.0 5.0\n",
"Texas 6.0 7.0 8.0\n",
"Oregon 9.0 10.0 11.0"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"相加的结果就是两个DataFrame,行和列的合集:"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" b | \n",
" c | \n",
" d | \n",
" e | \n",
"
\n",
" \n",
" \n",
" \n",
" Colorado | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" Ohio | \n",
" 3.0 | \n",
" NaN | \n",
" 6.0 | \n",
" NaN | \n",
"
\n",
" \n",
" Oregon | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" Texas | \n",
" 9.0 | \n",
" NaN | \n",
" 12.0 | \n",
" NaN | \n",
"
\n",
" \n",
" Utah | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" b c d e\n",
"Colorado NaN NaN NaN NaN\n",
"Ohio 3.0 NaN 6.0 NaN\n",
"Oregon NaN NaN NaN NaN\n",
"Texas 9.0 NaN 12.0 NaN\n",
"Utah NaN NaN NaN NaN"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 + df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"因为'c'和'e'列都不在两个DataFrame里,所有全是缺失值。对于行,即使有相同的,但列不一样的话也会是缺失值。\n",
"\n",
"如果两个DataFrame相加,而且没有column和row,结果会全是null:"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df1 = pd.DataFrame({'A': [1, 2]})\n",
"df2 = pd.DataFrame({'B': [3, 4]})"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A\n",
"0 1\n",
"1 2"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
"
\n",
" \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" B\n",
"0 3\n",
"1 4"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"0 NaN NaN\n",
"1 NaN NaN"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 - df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Arithmetic methods with fill values (带填充值的算数方法)\n",
"\n",
"对于上面那些缺失值,我们想要填上0:"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), \n",
" columns=list('abcd'))\n",
"\n",
"df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), \n",
" columns=list('abcde'))\n",
"\n",
"df2.loc[1, 'b'] = np.nan"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
" 3.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 4.0 | \n",
" 5.0 | \n",
" 6.0 | \n",
" 7.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 8.0 | \n",
" 9.0 | \n",
" 10.0 | \n",
" 11.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c d\n",
"0 0.0 1.0 2.0 3.0\n",
"1 4.0 5.0 6.0 7.0\n",
"2 8.0 9.0 10.0 11.0"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" e | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
" 3.0 | \n",
" 4.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 5.0 | \n",
" NaN | \n",
" 7.0 | \n",
" 8.0 | \n",
" 9.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 10.0 | \n",
" 11.0 | \n",
" 12.0 | \n",
" 13.0 | \n",
" 14.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 15.0 | \n",
" 16.0 | \n",
" 17.0 | \n",
" 18.0 | \n",
" 19.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c d e\n",
"0 0.0 1.0 2.0 3.0 4.0\n",
"1 5.0 NaN 7.0 8.0 9.0\n",
"2 10.0 11.0 12.0 13.0 14.0\n",
"3 15.0 16.0 17.0 18.0 19.0"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"不使用添加方法的结果:"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" e | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.0 | \n",
" 2.0 | \n",
" 4.0 | \n",
" 6.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 9.0 | \n",
" NaN | \n",
" 13.0 | \n",
" 15.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" 18.0 | \n",
" 20.0 | \n",
" 22.0 | \n",
" 24.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c d e\n",
"0 0.0 2.0 4.0 6.0 NaN\n",
"1 9.0 NaN 13.0 15.0 NaN\n",
"2 18.0 20.0 22.0 24.0 NaN\n",
"3 NaN NaN NaN NaN NaN"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 + df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用fill_value:"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" e | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.0 | \n",
" 2.0 | \n",
" 4.0 | \n",
" 6.0 | \n",
" 4.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 9.0 | \n",
" 5.0 | \n",
" 13.0 | \n",
" 15.0 | \n",
" 9.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 18.0 | \n",
" 20.0 | \n",
" 22.0 | \n",
" 24.0 | \n",
" 14.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 15.0 | \n",
" 16.0 | \n",
" 17.0 | \n",
" 18.0 | \n",
" 19.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c d e\n",
"0 0.0 2.0 4.0 6.0 4.0\n",
"1 9.0 5.0 13.0 15.0 9.0\n",
"2 18.0 20.0 22.0 24.0 14.0\n",
"3 15.0 16.0 17.0 18.0 19.0"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.add(df2, fill_value=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下表中就有很多这样灵活的算数方法:\n",
"\n",
"![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/y0rr4.png)\n",
"\n",
"每一个都有一个配对的,以 r 开头,意思是反转:"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" inf | \n",
" 1.000000 | \n",
" 0.500000 | \n",
" 0.333333 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.250000 | \n",
" 0.200000 | \n",
" 0.166667 | \n",
" 0.142857 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.125000 | \n",
" 0.111111 | \n",
" 0.100000 | \n",
" 0.090909 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c d\n",
"0 inf 1.000000 0.500000 0.333333\n",
"1 0.250000 0.200000 0.166667 0.142857\n",
"2 0.125000 0.111111 0.100000 0.090909"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"1 / df1"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" inf | \n",
" 1.000000 | \n",
" 0.500000 | \n",
" 0.333333 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.250000 | \n",
" 0.200000 | \n",
" 0.166667 | \n",
" 0.142857 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.125000 | \n",
" 0.111111 | \n",
" 0.100000 | \n",
" 0.090909 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c d\n",
"0 inf 1.000000 0.500000 0.333333\n",
"1 0.250000 0.200000 0.166667 0.142857\n",
"2 0.125000 0.111111 0.100000 0.090909"
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.rdiv(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在reindex(重建索引)的时候,也可以使用fill_value:"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" e | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
" 3.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 4.0 | \n",
" 5.0 | \n",
" 6.0 | \n",
" 7.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 8.0 | \n",
" 9.0 | \n",
" 10.0 | \n",
" 11.0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c d e\n",
"0 0.0 1.0 2.0 3.0 0\n",
"1 4.0 5.0 6.0 7.0 0\n",
"2 8.0 9.0 10.0 11.0 0"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.reindex(columns=df2.columns, fill_value=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Operations between DataFrame and Series (DataFrame和Series之间的操作)\n",
"\n",
"先举个numpy的例子帮助理解,可以考虑成一个二维数组和它的一行:"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0., 1., 2., 3.],\n",
" [ 4., 5., 6., 7.],\n",
" [ 8., 9., 10., 11.]])"
]
},
"execution_count": 98,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"arr = np.arange(12.).reshape((3, 4))\n",
"arr"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0., 1., 2., 3.])"
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"arr[0]"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0., 0., 0., 0.],\n",
" [ 4., 4., 4., 4.],\n",
" [ 8., 8., 8., 8.]])"
]
},
"execution_count": 100,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"arr - arr[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以看到,这个减法是用在了每一行上。这种操作叫broadcasting,在Appendix A有更详细的解释。DataFrame和Series的操作也类似:"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),\n",
" columns=list('bde'),\n",
" index=['Utah', 'Ohio', 'Texas', 'Oregon'])\n",
"series = frame.iloc[0]"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" b | \n",
" d | \n",
" e | \n",
"
\n",
" \n",
" \n",
" \n",
" Utah | \n",
" 0.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" Ohio | \n",
" 3.0 | \n",
" 4.0 | \n",
" 5.0 | \n",
"
\n",
" \n",
" Texas | \n",
" 6.0 | \n",
" 7.0 | \n",
" 8.0 | \n",
"
\n",
" \n",
" Oregon | \n",
" 9.0 | \n",
" 10.0 | \n",
" 11.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" b d e\n",
"Utah 0.0 1.0 2.0\n",
"Ohio 3.0 4.0 5.0\n",
"Texas 6.0 7.0 8.0\n",
"Oregon 9.0 10.0 11.0"
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"b 0.0\n",
"d 1.0\n",
"e 2.0\n",
"Name: Utah, dtype: float64"
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以理解为series的index与dataframe的列匹配,broadcasting down the rows(向下按行广播):"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" b | \n",
" d | \n",
" e | \n",
"
\n",
" \n",
" \n",
" \n",
" Utah | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" Ohio | \n",
" 3.0 | \n",
" 3.0 | \n",
" 3.0 | \n",
"
\n",
" \n",
" Texas | \n",
" 6.0 | \n",
" 6.0 | \n",
" 6.0 | \n",
"
\n",
" \n",
" Oregon | \n",
" 9.0 | \n",
" 9.0 | \n",
" 9.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" b d e\n",
"Utah 0.0 0.0 0.0\n",
"Ohio 3.0 3.0 3.0\n",
"Texas 6.0 6.0 6.0\n",
"Oregon 9.0 9.0 9.0"
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame - series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"如果一个index既不在DataFrame的column中,也不再series里的index中,那么结果也是合集:"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" b | \n",
" d | \n",
" e | \n",
" f | \n",
"
\n",
" \n",
" \n",
" \n",
" Utah | \n",
" 0.0 | \n",
" NaN | \n",
" 3.0 | \n",
" NaN | \n",
"
\n",
" \n",
" Ohio | \n",
" 3.0 | \n",
" NaN | \n",
" 6.0 | \n",
" NaN | \n",
"
\n",
" \n",
" Texas | \n",
" 6.0 | \n",
" NaN | \n",
" 9.0 | \n",
" NaN | \n",
"
\n",
" \n",
" Oregon | \n",
" 9.0 | \n",
" NaN | \n",
" 12.0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" b d e f\n",
"Utah 0.0 NaN 3.0 NaN\n",
"Ohio 3.0 NaN 6.0 NaN\n",
"Texas 6.0 NaN 9.0 NaN\n",
"Oregon 9.0 NaN 12.0 NaN"
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"series2 = pd.Series(range(3), index=['b', 'e', 'f'])\n",
"frame + series2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"如果想要广播列,去匹配行,必须要用到算数方法:"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" b | \n",
" d | \n",
" e | \n",
"
\n",
" \n",
" \n",
" \n",
" Utah | \n",
" 0.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" Ohio | \n",
" 3.0 | \n",
" 4.0 | \n",
" 5.0 | \n",
"
\n",
" \n",
" Texas | \n",
" 6.0 | \n",
" 7.0 | \n",
" 8.0 | \n",
"
\n",
" \n",
" Oregon | \n",
" 9.0 | \n",
" 10.0 | \n",
" 11.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" b d e\n",
"Utah 0.0 1.0 2.0\n",
"Ohio 3.0 4.0 5.0\n",
"Texas 6.0 7.0 8.0\n",
"Oregon 9.0 10.0 11.0"
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"series3 = frame['d']\n",
"frame"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Utah 1.0\n",
"Ohio 4.0\n",
"Texas 7.0\n",
"Oregon 10.0\n",
"Name: d, dtype: float64"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"series3"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" b | \n",
" d | \n",
" e | \n",
"
\n",
" \n",
" \n",
" \n",
" Utah | \n",
" -1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" Ohio | \n",
" -1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" Texas | \n",
" -1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" Oregon | \n",
" -1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" b d e\n",
"Utah -1.0 0.0 1.0\n",
"Ohio -1.0 0.0 1.0\n",
"Texas -1.0 0.0 1.0\n",
"Oregon -1.0 0.0 1.0"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame.sub(series3, axis='index')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"axis参数就是用来匹配轴的。在这个例子里是匹配dataframe的row index(`axis='index` or `axis=0`),然后再广播。\n",
"\n",
"# 6 Function Application and Mapping (函数应用和映射)\n",
"\n",
"numpy的ufuncs(element-wise数组方法)也能用在pandas的object上:"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" b | \n",
" d | \n",
" e | \n",
"
\n",
" \n",
" \n",
" \n",
" Utah | \n",
" -0.810435 | \n",
" 0.194448 | \n",
" -0.705901 | \n",
"
\n",
" \n",
" Ohio | \n",
" -0.886275 | \n",
" 0.553640 | \n",
" 1.066754 | \n",
"
\n",
" \n",
" Texas | \n",
" 0.189898 | \n",
" -0.056108 | \n",
" -0.159926 | \n",
"
\n",
" \n",
" Oregon | \n",
" 0.448303 | \n",
" 0.439650 | \n",
" -1.351029 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" b d e\n",
"Utah -0.810435 0.194448 -0.705901\n",
"Ohio -0.886275 0.553640 1.066754\n",
"Texas 0.189898 -0.056108 -0.159926\n",
"Oregon 0.448303 0.439650 -1.351029"
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), \n",
" index=['Utah', 'Ohio', 'Texas', 'Oregon'])\n",
"frame"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" b | \n",
" d | \n",
" e | \n",
"
\n",
" \n",
" \n",
" \n",
" Utah | \n",
" 0.810435 | \n",
" 0.194448 | \n",
" 0.705901 | \n",
"
\n",
" \n",
" Ohio | \n",
" 0.886275 | \n",
" 0.553640 | \n",
" 1.066754 | \n",
"
\n",
" \n",
" Texas | \n",
" 0.189898 | \n",
" 0.056108 | \n",
" 0.159926 | \n",
"
\n",
" \n",
" Oregon | \n",
" 0.448303 | \n",
" 0.439650 | \n",
" 1.351029 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" b d e\n",
"Utah 0.810435 0.194448 0.705901\n",
"Ohio 0.886275 0.553640 1.066754\n",
"Texas 0.189898 0.056108 0.159926\n",
"Oregon 0.448303 0.439650 1.351029"
]
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.abs(frame)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"另一个常用的操作是把一个用在一维数组上的函数,应用在一行或一列上。要用到DataFrame中的apply函数:"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"b 1.334579\n",
"d 0.609748\n",
"e 2.417783\n",
"dtype: float64"
]
},
"execution_count": 112,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"f = lambda x: x.max() - x.min()\n",
"frame.apply(f)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"这里函数f,计算的是一个series中最大值和最小值的差,在frame中的每一列,这个函数被调用一次。作为结果的series,它的index就是frame的column。\n",
"\n",
"如果你传入`axis='column'`用于apply,那么函数会被用在每一行:"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Utah 1.004883\n",
"Ohio 1.953030\n",
"Texas 0.349825\n",
"Oregon 1.799333\n",
"dtype: float64"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame.apply(f, axis='columns')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"像是sum, mean这样的数组统计方法,DataFrame中已经集成了,所以没必要用apply。\n",
"\n",
"apply不会返回标量,只会返回一个含有多个值的series:"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def f(x): \n",
" return pd.Series([x.min(), x.max()], index=['min', 'max'])"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" b | \n",
" d | \n",
" e | \n",
"
\n",
" \n",
" \n",
" \n",
" Utah | \n",
" -0.810435 | \n",
" 0.194448 | \n",
" -0.705901 | \n",
"
\n",
" \n",
" Ohio | \n",
" -0.886275 | \n",
" 0.553640 | \n",
" 1.066754 | \n",
"
\n",
" \n",
" Texas | \n",
" 0.189898 | \n",
" -0.056108 | \n",
" -0.159926 | \n",
"
\n",
" \n",
" Oregon | \n",
" 0.448303 | \n",
" 0.439650 | \n",
" -1.351029 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" b d e\n",
"Utah -0.810435 0.194448 -0.705901\n",
"Ohio -0.886275 0.553640 1.066754\n",
"Texas 0.189898 -0.056108 -0.159926\n",
"Oregon 0.448303 0.439650 -1.351029"
]
},
"execution_count": 116,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" b | \n",
" d | \n",
" e | \n",
"
\n",
" \n",
" \n",
" \n",
" min | \n",
" -0.886275 | \n",
" -0.056108 | \n",
" -1.351029 | \n",
"
\n",
" \n",
" max | \n",
" 0.448303 | \n",
" 0.553640 | \n",
" 1.066754 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" b d e\n",
"min -0.886275 -0.056108 -1.351029\n",
"max 0.448303 0.553640 1.066754"
]
},
"execution_count": 115,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame.apply(f)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"element-wise的python函数也能用。假设想要格式化frame中的浮点数,变为string。可以用apply map:"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"format = lambda x: '%.2f' % x"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" b | \n",
" d | \n",
" e | \n",
"
\n",
" \n",
" \n",
" \n",
" Utah | \n",
" -0.81 | \n",
" 0.19 | \n",
" -0.71 | \n",
"
\n",
" \n",
" Ohio | \n",
" -0.89 | \n",
" 0.55 | \n",
" 1.07 | \n",
"
\n",
" \n",
" Texas | \n",
" 0.19 | \n",
" -0.06 | \n",
" -0.16 | \n",
"
\n",
" \n",
" Oregon | \n",
" 0.45 | \n",
" 0.44 | \n",
" -1.35 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" b d e\n",
"Utah -0.81 0.19 -0.71\n",
"Ohio -0.89 0.55 1.07\n",
"Texas 0.19 -0.06 -0.16\n",
"Oregon 0.45 0.44 -1.35"
]
},
"execution_count": 118,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame.applymap(format)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"applymap的做法是,series有一个map函数,能用来实现element-wise函数:"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Utah -0.71\n",
"Ohio 1.07\n",
"Texas -0.16\n",
"Oregon -1.35\n",
"Name: e, dtype: object"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame['e'].map(format)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 7 Sorting and Ranking (排序)\n",
"\n",
"按row或column index来排序的话,可以用sort_index方法,会返回一个新的object:"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 1\n",
"b 2\n",
"c 3\n",
"d 0\n",
"dtype: int64"
]
},
"execution_count": 120,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])\n",
"obj.sort_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在DataFrame,可以用index或其他axis来排序:"
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" d | \n",
" a | \n",
" b | \n",
" c | \n",
"
\n",
" \n",
" \n",
" \n",
" three | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" one | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" d a b c\n",
"three 0 1 2 3\n",
"one 4 5 6 7"
]
},
"execution_count": 123,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame = pd.DataFrame(np.arange(8).reshape((2, 4)),\n",
" index=['three', 'one'],\n",
" columns=['d', 'a', 'b', 'c'])\n",
"frame"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" d | \n",
" a | \n",
" b | \n",
" c | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
"
\n",
" \n",
" three | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" d a b c\n",
"one 4 5 6 7\n",
"three 0 1 2 3"
]
},
"execution_count": 124,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame.sort_index()"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
"
\n",
" \n",
" \n",
" \n",
" three | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 0 | \n",
"
\n",
" \n",
" one | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c d\n",
"three 1 2 3 0\n",
"one 5 6 7 4"
]
},
"execution_count": 125,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame.sort_index(axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"默认是升序,可以设置降序:"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" d | \n",
" c | \n",
" b | \n",
" a | \n",
"
\n",
" \n",
" \n",
" \n",
" three | \n",
" 0 | \n",
" 3 | \n",
" 2 | \n",
" 1 | \n",
"
\n",
" \n",
" one | \n",
" 4 | \n",
" 7 | \n",
" 6 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" d c b a\n",
"three 0 3 2 1\n",
"one 4 7 6 5"
]
},
"execution_count": 126,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame.sort_index(axis=1, ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"通过值来排序,用sort_values方法:"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"2 -3\n",
"3 2\n",
"0 4\n",
"1 7\n",
"dtype: int64"
]
},
"execution_count": 127,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj = pd.Series([4, 7, -3, 2])\n",
"obj.sort_values()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"缺失值会被排在最后:"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"4 -3.0\n",
"5 2.0\n",
"0 4.0\n",
"2 7.0\n",
"1 NaN\n",
"3 NaN\n",
"dtype: float64"
]
},
"execution_count": 128,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])\n",
"obj.sort_values()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"对于一个DataFrame,可以用一列或多列作为sort keys。这样的话,只需要把一列多多列的名字导入到sort_values即可:"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 7 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" -3 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b\n",
"0 0 4\n",
"1 1 7\n",
"2 0 -3\n",
"3 1 2"
]
},
"execution_count": 129,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})\n",
"frame"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" 0 | \n",
" -3 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 0 | \n",
" 0 | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b\n",
"2 0 -3\n",
"3 1 2\n",
"0 0 4\n",
"1 1 7"
]
},
"execution_count": 130,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame.sort_values(by='b')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"多列排序的话,传入一个list of names:"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" 0 | \n",
" -3 | \n",
"
\n",
" \n",
" 0 | \n",
" 0 | \n",
" 4 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b\n",
"2 0 -3\n",
"0 0 4\n",
"3 1 2\n",
"1 1 7"
]
},
"execution_count": 131,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame.sort_values(by=['a', 'b'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ranking(排名)是给有效的数据分配数字。rank方法能用于series和DataFrame,rank方法默认会给每个group一个mean rank(平均排名)。rank 表示在这个数在原来的Series中排第几名,有相同的数,取其排名平均(默认)作为值:"
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 7\n",
"1 -5\n",
"2 7\n",
"3 4\n",
"4 2\n",
"5 0\n",
"6 4\n",
"dtype: int64"
]
},
"execution_count": 133,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj = pd.Series([7, -5, 7, 4, 2, 0, 4])\n",
"obj"
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1 -5\n",
"5 0\n",
"4 2\n",
"3 4\n",
"6 4\n",
"0 7\n",
"2 7\n",
"dtype: int64"
]
},
"execution_count": 138,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj.sort_values()"
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 6.5\n",
"1 1.0\n",
"2 6.5\n",
"3 4.5\n",
"4 3.0\n",
"5 2.0\n",
"6 4.5\n",
"dtype: float64"
]
},
"execution_count": 136,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj.rank()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在obj中,4和4的排名是第4名和第五名,取平均得4.5。7和7的排名分别是第六名和第七名,则其排名取平均得6.5。\n",
"\n",
"rank也可以根据数据被观测到的顺序来设定:"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 7\n",
"1 -5\n",
"2 7\n",
"3 4\n",
"4 2\n",
"5 0\n",
"6 4\n",
"dtype: int64"
]
},
"execution_count": 141,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj"
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 6.0\n",
"1 1.0\n",
"2 7.0\n",
"3 4.0\n",
"4 3.0\n",
"5 2.0\n",
"6 5.0\n",
"dtype: float64"
]
},
"execution_count": 137,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj.rank(method='first')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"这里没有给0和2(指两个数字7)赋予average rank 6.5,而是给第一个看到的7(label 0)设置rank为6,第二个看到的7(label 2)设置rank为7。\n",
"\n",
"也可以设置降序:"
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 2.0\n",
"1 7.0\n",
"2 2.0\n",
"3 4.0\n",
"4 5.0\n",
"5 6.0\n",
"6 4.0\n",
"dtype: float64"
]
},
"execution_count": 142,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Assign tie values the maximum rank in the group\n",
"obj.rank(ascending=False, method='max')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"dataframe 可以根据行或列来计算rank:"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 4.3 | \n",
" -2.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 7.0 | \n",
" 5.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" -3.0 | \n",
" 8.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 2.0 | \n",
" -2.5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c\n",
"0 0 4.3 -2.0\n",
"1 1 7.0 5.0\n",
"2 0 -3.0 8.0\n",
"3 1 2.0 -2.5"
]
},
"execution_count": 143,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame = pd.DataFrame({'b': [4.3, 7, -3, 2],\n",
" 'a': [0, 1, 0, 1],\n",
" 'c': [-2, 5, 8, -2.5]})\n",
"frame"
]
},
{
"cell_type": "code",
"execution_count": 144,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2.0 | \n",
" 3.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 3.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 2.0 | \n",
" 1.0 | \n",
" 3.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 2.0 | \n",
" 3.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c\n",
"0 2.0 3.0 1.0\n",
"1 1.0 3.0 2.0\n",
"2 2.0 1.0 3.0\n",
"3 2.0 3.0 1.0"
]
},
"execution_count": 144,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame.rank(axis='columns') # columns表示列与列之间的排序(即每一行里数据间的排序)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/6xv9c.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 8 Axis Indexes with Duplicate Labels (有重复label的轴索引)\n",
"\n",
"我们看到的所有例子都有unique axis labels(index values),唯一的轴标签(索引值)。一些pandas函数(reindex),需要label是唯一的,但这并是不强制的。比如下面有一个重复的索引:"
]
},
{
"cell_type": "code",
"execution_count": 145,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 0\n",
"a 1\n",
"b 2\n",
"b 3\n",
"c 4\n",
"dtype: int64"
]
},
"execution_count": 145,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])\n",
"obj"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"index的is_unique特性能告诉我们label是否是唯一的:"
]
},
{
"cell_type": "code",
"execution_count": 146,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 146,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj.index.is_unique"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"数据选择对于重复label则表现有点不同。如果一个label有多个值,那么就会返回一个series, 如果是label只对应一个值的话,会返回一个标量:"
]
},
{
"cell_type": "code",
"execution_count": 147,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 0\n",
"a 1\n",
"dtype: int64"
]
},
"execution_count": 147,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj['a']"
]
},
{
"cell_type": "code",
"execution_count": 148,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 148,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj['c']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"这个选择的逻辑也应用于DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 149,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" -0.314526 | \n",
" -1.313861 | \n",
" 0.823529 | \n",
"
\n",
" \n",
" a | \n",
" 0.994028 | \n",
" -0.442338 | \n",
" -0.846985 | \n",
"
\n",
" \n",
" b | \n",
" -1.340453 | \n",
" -0.031612 | \n",
" 0.044791 | \n",
"
\n",
" \n",
" b | \n",
" -0.919341 | \n",
" -0.409164 | \n",
" -1.297257 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"a -0.314526 -1.313861 0.823529\n",
"a 0.994028 -0.442338 -0.846985\n",
"b -1.340453 -0.031612 0.044791\n",
"b -0.919341 -0.409164 -1.297257"
]
},
"execution_count": 149,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 150,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" b | \n",
" -1.340453 | \n",
" -0.031612 | \n",
" 0.044791 | \n",
"
\n",
" \n",
" b | \n",
" -0.919341 | \n",
" -0.409164 | \n",
" -1.297257 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"b -1.340453 -0.031612 0.044791\n",
"b -0.919341 -0.409164 -1.297257"
]
},
"execution_count": 150,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc['b']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [py35]",
"language": "python",
"name": "Python [py35]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}