{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.2 Essential Functionality(主要功能)\n", "\n", "接下来介绍pandas中的一些主要功能,这里只介绍一些经常用到的。\n", "\n", "# 1 Reindexing(重新索引)\n", "\n", "pandas中一个重要的方法是reindex,已实施在创建object的时候遵照一个新的index。如下例:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "d 4.5\n", "b 7.2\n", "a -5.3\n", "c 3.6\n", "dtype: float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])\n", "obj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在series上调用reindex能更改index,如果没有对应index的话会引入缺失数据:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a -5.3\n", "b 7.2\n", "c 3.6\n", "d 4.5\n", "e NaN\n", "dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])\n", "obj2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在处理时间序列这样的数据时,我们可能需要在reindexing的时候需要修改值。method选项能做到这一点,比如设定method为ffill:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 bule\n", "2 purple\n", "4 yellow\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj3 = pd.Series(['bule', 'purple', 'yellow'], index=[0, 2, 4])\n", "obj3" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 bule\n", "1 bule\n", "2 purple\n", "3 purple\n", "4 yellow\n", "5 yellow\n", "dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj3.reindex(range(6), method='ffill')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "对于DataFrame,reindex能更改row index,或column index。reindex the rows:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "frame = pd.DataFrame(np.arange(9).reshape(3, 3),\n", " index=['a', 'c', 'd'],\n", " columns=['Ohio', 'Texas', 'California'])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OhioTexasCalifornia
a012
c345
d678
\n", "
" ], "text/plain": [ " Ohio Texas California\n", "a 0 1 2\n", "c 3 4 5\n", "d 6 7 8" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OhioTexasCalifornia
a0.01.02.0
bNaNNaNNaN
c3.04.05.0
d6.07.08.0
\n", "
" ], "text/plain": [ " Ohio Texas California\n", "a 0.0 1.0 2.0\n", "b NaN NaN NaN\n", "c 3.0 4.0 5.0\n", "d 6.0 7.0 8.0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2 = frame.reindex(['a', 'b', 'c', 'd'])\n", "frame2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "更改columns index:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "states = ['Texas', 'Utah', 'California']" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TexasUtahCalifornia
a1NaN2
c4NaN5
d7NaN8
\n", "
" ], "text/plain": [ " Texas Utah California\n", "a 1 NaN 2\n", "c 4 NaN 5\n", "d 7 NaN 8" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.reindex(columns=states)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "reindex的参数:\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/x0pq4.png)\n", "\n", "还可以使用loc更简洁的reindex:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TexasUtahCalifornia
a1.0NaN2.0
bNaNNaNNaN
c4.0NaN5.0
d7.0NaN8.0
\n", "
" ], "text/plain": [ " Texas Utah California\n", "a 1.0 NaN 2.0\n", "b NaN NaN NaN\n", "c 4.0 NaN 5.0\n", "d 7.0 NaN 8.0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.loc[['a', 'b', 'c', 'd'], states]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2 Dropping Entries from an Axis (按轴删除记录) \n", "\n", "对于series,drop回返回一个新的object,并删去你制定的axis的值:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0.0\n", "b 1.0\n", "c 2.0\n", "d 3.0\n", "e 4.0\n", "dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])\n", "obj" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0.0\n", "b 1.0\n", "d 3.0\n", "e 4.0\n", "dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_obj = obj.drop('c')\n", "new_obj" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0.0\n", "b 1.0\n", "e 4.0\n", "dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.drop(['d', 'c'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "对于DataFrame,index能按行或列的axis来删除:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio0123
Colorado4567
Utah891011
New York12131415
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 0 1 2 3\n", "Colorado 4 5 6 7\n", "Utah 8 9 10 11\n", "New York 12 13 14 15" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame(np.arange(16).reshape(4, 4),\n", " index=['Ohio', 'Colorado', 'Utah', 'New York'],\n", " columns=['one', 'two', 'three', 'four'])\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "行处理:如果a sequence of labels(一个标签序列)来调用drop,会删去row labels(axis 0):" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Utah891011
New York12131415
\n", "
" ], "text/plain": [ " one two three four\n", "Utah 8 9 10 11\n", "New York 12 13 14 15" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop(['Colorado', 'Ohio'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "列处理:drop列的话,设定axis=1或axis='columns':" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onethreefour
Ohio023
Colorado467
Utah81011
New York121415
\n", "
" ], "text/plain": [ " one three four\n", "Ohio 0 2 3\n", "Colorado 4 6 7\n", "Utah 8 10 11\n", "New York 12 14 15" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop('two', axis=1)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onethree
Ohio02
Colorado46
Utah810
New York1214
\n", "
" ], "text/plain": [ " one three\n", "Ohio 0 2\n", "Colorado 4 6\n", "Utah 8 10\n", "New York 12 14" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop(['two', 'four'], axis='columns')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "drop也可以不返回一个新的object,而是直接更改series or dataframe in-place:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0.0\n", "b 1.0\n", "d 3.0\n", "e 4.0\n", "dtype: float64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.drop('c', inplace=True)\n", "obj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3 Indexing, Selection, and Filtering(索引,选择,过滤)\n", "\n", "series indexing(obj[...]) 相当于numpy的array indexing, 而且除了整数,还可以使用series的index:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0.0\n", "b 1.0\n", "c 2.0\n", "d 3.0\n", "dtype: float64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])\n", "obj" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj['b']" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj[1]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "c 2.0\n", "d 3.0\n", "dtype: float64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj[2:4]" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "b 1.0\n", "a 0.0\n", "d 3.0\n", "dtype: float64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 选中行\n", "obj[['b', 'a', 'd']]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "b 1.0\n", "d 3.0\n", "dtype: float64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj[[1, 3]]" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0.0\n", "b 1.0\n", "dtype: float64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj[obj < 2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "用label来slicing(切片)的时候,和python的切片不一样的在于,会包括尾节点:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "b 1.0\n", "c 2.0\n", "dtype: float64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj['b':'c']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以直接给选中的label更改值:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0.0\n", "b 5.0\n", "c 5.0\n", "d 3.0\n", "dtype: float64" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj['b':'c'] = 5\n", "obj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "而对于DataFrame,indexing可以通过一个值或序列,选中一个以上的列:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio0123
Colorado4567
Utah891011
New York12131415
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 0 1 2 3\n", "Colorado 4 5 6 7\n", "Utah 8 9 10 11\n", "New York 12 13 14 15" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame(np.arange(16).reshape((4, 4)),\n", " index=['Ohio', 'Colorado', 'Utah', 'New York'],\n", " columns=['one', 'two', 'three', 'four'])\n", "data" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Ohio 1\n", "Colorado 5\n", "Utah 9\n", "New York 13\n", "Name: two, dtype: int64" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['two']" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
threeone
Ohio20
Colorado64
Utah108
New York1412
\n", "
" ], "text/plain": [ " three one\n", "Ohio 2 0\n", "Colorado 6 4\n", "Utah 10 8\n", "New York 14 12" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[['three', 'one']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "dataframe的indexing有一些比较特别的方式。比如通过布尔数组:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio0123
Colorado4567
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 0 1 2 3\n", "Colorado 4 5 6 7" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[:2]" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Colorado4567
Utah891011
New York12131415
\n", "
" ], "text/plain": [ " one two three four\n", "Colorado 4 5 6 7\n", "Utah 8 9 10 11\n", "New York 12 13 14 15" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[data['three'] > 5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "行选择的语法格式`data[:2]`是很方便的。给`[]`里传入一个list的话,可以选择列。\n", "\n", "另一种方法是用boolean dataframe:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
OhioTrueTrueTrueTrue
ColoradoTrueFalseFalseFalse
UtahFalseFalseFalseFalse
New YorkFalseFalseFalseFalse
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio True True True True\n", "Colorado True False False False\n", "Utah False False False False\n", "New York False False False False" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data < 5" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio0000
Colorado0567
Utah891011
New York12131415
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 0 0 0 0\n", "Colorado 0 5 6 7\n", "Utah 8 9 10 11\n", "New York 12 13 14 15" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[data < 5] = 0\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selection with loc and iloc(用loc和iloc来选择)\n", "\n", "对于label-indexing on rows, 我们介绍特别的索引符,loc and iloc. 这两个方法能通过axis labels(loc)或integer(iloc),来选择行或列。\n", "\n", "一个列子,选中一行多列by label:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio0000
Colorado0567
Utah891011
New York12131415
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 0 0 0 0\n", "Colorado 0 5 6 7\n", "Utah 8 9 10 11\n", "New York 12 13 14 15" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "two 5\n", "three 6\n", "Name: Colorado, dtype: int64" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.loc['Colorado', ['two', 'three']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "同iloc实现相同的效果:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "four 11\n", "one 8\n", "two 9\n", "Name: Utah, dtype: int64" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[2, [3, 0, 1]]" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "one 8\n", "two 9\n", "three 10\n", "four 11\n", "Name: Utah, dtype: int64" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[2] # 一行" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fouronetwo
Colorado705
Utah1189
\n", "
" ], "text/plain": [ " four one two\n", "Colorado 7 0 5\n", "Utah 11 8 9" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[[1, 2], [3, 0, 1]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "indexing函数也能用于切片,不论是single labels或lists of labels:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Ohio 0\n", "Colorado 5\n", "Utah 9\n", "Name: two, dtype: int64" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.loc[:'Utah', 'two']" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothree
Colorado056
Utah8910
New York121314
\n", "
" ], "text/plain": [ " one two three\n", "Colorado 0 5 6\n", "Utah 8 9 10\n", "New York 12 13 14" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[:, :3][data.three > 5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pandas中有很多用于选择和重新选择数据的方法:\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/bwadf.png)\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/lc2uc.png)\n", "\n", "注意:当设计padnas的时候,作者发现frame[:, col]这样的语法是比较冗长的,因为这是会被经常用到的一个功能。作者把一些indexing的功能(lable or integer)集成在了ix这个方法上。实际中,因为这种label和integer都可以用的方式很方便,于是pandas team设计了loc和ilco来实现label-based和integer-based indexing.\n", "\n", "虽然ix indexing依然错在,但是已经过时,不推荐使用。\n", "\n", "# 4 Integer Indexes(整数索引)\n", "\n", "一些新手再用integer来index的时候,总是会被绊倒。因为这种方法和python用于list和tuple的indexing方法不同。\n", "\n", "比如,你不希望下面的代码出现error:\n" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ser = pd.Series(np.arange(3.))" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0.0\n", "1 1.0\n", "2 2.0\n", "dtype: float64" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "ename": "KeyError", "evalue": "-1", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mser\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/Users/xu/anaconda/envs/py35/lib/python3.5/site-packages/pandas/core/series.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 581\u001b[0m \u001b[0mkey\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_apply_if_callable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 582\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 583\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_value\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 584\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 585\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mlib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0misscalar\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Users/xu/anaconda/envs/py35/lib/python3.5/site-packages/pandas/indexes/base.py\u001b[0m in \u001b[0;36mget_value\u001b[0;34m(self, series, key)\u001b[0m\n\u001b[1;32m 1978\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1979\u001b[0m return self._engine.get_value(s, k,\n\u001b[0;32m-> 1980\u001b[0;31m tz=getattr(series.dtype, 'tz', None))\n\u001b[0m\u001b[1;32m 1981\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1982\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m0\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minferred_type\u001b[0m \u001b[0;32min\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m'integer'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'boolean'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32mpandas/index.pyx\u001b[0m in \u001b[0;36mpandas.index.IndexEngine.get_value (pandas/index.c:3332)\u001b[0;34m()\u001b[0m\n", "\u001b[0;32mpandas/index.pyx\u001b[0m in \u001b[0;36mpandas.index.IndexEngine.get_value (pandas/index.c:3035)\u001b[0;34m()\u001b[0m\n", "\u001b[0;32mpandas/index.pyx\u001b[0m in \u001b[0;36mpandas.index.IndexEngine.get_loc (pandas/index.c:4018)\u001b[0;34m()\u001b[0m\n", "\u001b[0;32mpandas/hashtable.pyx\u001b[0m in \u001b[0;36mpandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6610)\u001b[0;34m()\u001b[0m\n", "\u001b[0;32mpandas/hashtable.pyx\u001b[0m in \u001b[0;36mpandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6554)\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mKeyError\u001b[0m: -1" ] } ], "source": [ "ser[-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "看到了,pandas在整数索引上可能会出错。这里我们有一个index包括0,1,2,但是猜测用户想要什么是很困难的:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0.0\n", "1 1.0\n", "2 2.0\n", "dtype: float64" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "另一方面,如果用非整数来做index,就没有歧义了:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2.0" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])\n", "ser2[-1]" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "为了保持连贯性,如果axis index里包含integer,那么选择数据的时候,就会是label-orented. 为了更精确地选择,使用`loc`(for label)或`ilco`(for integers):" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0.0\n", "dtype: float64" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser[:1]" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0.0\n", "1 1.0\n", "dtype: float64" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser.loc[:1]" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0.0\n", "dtype: float64" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser.iloc[:1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5 Arithmetic and Data Alignment (算数和数据对齐)\n", "\n", "pandas一个有用的feature就是,不同index的obejct之间的算数计算。如果两个object相加,但他们各自的index并不相同,最后结果得到的index是这两个index的合集:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": true }, "outputs": [], "source": [ "s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": true }, "outputs": [], "source": [ "s2 = pd.Series([2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 7.3\n", "c -2.5\n", "d 3.4\n", "e 1.5\n", "dtype: float64" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 2.1\n", "c 3.6\n", "e -1.5\n", "f 4.0\n", "g 3.1\n", "dtype: float64" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s2" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 9.4\n", "c 1.1\n", "d NaN\n", "e 0.0\n", "f NaN\n", "g NaN\n", "dtype: float64" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1 + s2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这种数据对齐的方式(internal data alignment)引入了很多缺失值在没有陈赫的位置上。这些缺失值会被用在之后的算数计算中。\n", "\n", "在DataFrame中,数据对齐同时发生在行和列上:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),\n", " index=['Ohio', 'Texas', 'Colorado'])" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),\n", " index=['Utah', 'Ohio', 'Texas', 'Oregon'])" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bcd
Ohio0.01.02.0
Texas3.04.05.0
Colorado6.07.08.0
\n", "
" ], "text/plain": [ " b c d\n", "Ohio 0.0 1.0 2.0\n", "Texas 3.0 4.0 5.0\n", "Colorado 6.0 7.0 8.0" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
Utah0.01.02.0
Ohio3.04.05.0
Texas6.07.08.0
Oregon9.010.011.0
\n", "
" ], "text/plain": [ " b d e\n", "Utah 0.0 1.0 2.0\n", "Ohio 3.0 4.0 5.0\n", "Texas 6.0 7.0 8.0\n", "Oregon 9.0 10.0 11.0" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "相加的结果就是两个DataFrame,行和列的合集:" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bcde
ColoradoNaNNaNNaNNaN
Ohio3.0NaN6.0NaN
OregonNaNNaNNaNNaN
Texas9.0NaN12.0NaN
UtahNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " b c d e\n", "Colorado NaN NaN NaN NaN\n", "Ohio 3.0 NaN 6.0 NaN\n", "Oregon NaN NaN NaN NaN\n", "Texas 9.0 NaN 12.0 NaN\n", "Utah NaN NaN NaN NaN" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 + df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "因为'c'和'e'列都不在两个DataFrame里,所有全是缺失值。对于行,即使有相同的,但列不一样的话也会是缺失值。\n", "\n", "如果两个DataFrame相加,而且没有column和row,结果会全是null:" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df1 = pd.DataFrame({'A': [1, 2]})\n", "df2 = pd.DataFrame({'B': [3, 4]})" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
A
01
12
\n", "
" ], "text/plain": [ " A\n", "0 1\n", "1 2" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
B
03
14
\n", "
" ], "text/plain": [ " B\n", "0 3\n", "1 4" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
0NaNNaN
1NaNNaN
\n", "
" ], "text/plain": [ " A B\n", "0 NaN NaN\n", "1 NaN NaN" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 - df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Arithmetic methods with fill values (带填充值的算数方法)\n", "\n", "对于上面那些缺失值,我们想要填上0:" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), \n", " columns=list('abcd'))\n", "\n", "df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), \n", " columns=list('abcde'))\n", "\n", "df2.loc[1, 'b'] = np.nan" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
00.01.02.03.0
14.05.06.07.0
28.09.010.011.0
\n", "
" ], "text/plain": [ " a b c d\n", "0 0.0 1.0 2.0 3.0\n", "1 4.0 5.0 6.0 7.0\n", "2 8.0 9.0 10.0 11.0" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcde
00.01.02.03.04.0
15.0NaN7.08.09.0
210.011.012.013.014.0
315.016.017.018.019.0
\n", "
" ], "text/plain": [ " a b c d e\n", "0 0.0 1.0 2.0 3.0 4.0\n", "1 5.0 NaN 7.0 8.0 9.0\n", "2 10.0 11.0 12.0 13.0 14.0\n", "3 15.0 16.0 17.0 18.0 19.0" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "不使用添加方法的结果:" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcde
00.02.04.06.0NaN
19.0NaN13.015.0NaN
218.020.022.024.0NaN
3NaNNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " a b c d e\n", "0 0.0 2.0 4.0 6.0 NaN\n", "1 9.0 NaN 13.0 15.0 NaN\n", "2 18.0 20.0 22.0 24.0 NaN\n", "3 NaN NaN NaN NaN NaN" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 + df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用fill_value:" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcde
00.02.04.06.04.0
19.05.013.015.09.0
218.020.022.024.014.0
315.016.017.018.019.0
\n", "
" ], "text/plain": [ " a b c d e\n", "0 0.0 2.0 4.0 6.0 4.0\n", "1 9.0 5.0 13.0 15.0 9.0\n", "2 18.0 20.0 22.0 24.0 14.0\n", "3 15.0 16.0 17.0 18.0 19.0" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.add(df2, fill_value=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下表中就有很多这样灵活的算数方法:\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/y0rr4.png)\n", "\n", "每一个都有一个配对的,以 r 开头,意思是反转:" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
0inf1.0000000.5000000.333333
10.2500000.2000000.1666670.142857
20.1250000.1111110.1000000.090909
\n", "
" ], "text/plain": [ " a b c d\n", "0 inf 1.000000 0.500000 0.333333\n", "1 0.250000 0.200000 0.166667 0.142857\n", "2 0.125000 0.111111 0.100000 0.090909" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1 / df1" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
0inf1.0000000.5000000.333333
10.2500000.2000000.1666670.142857
20.1250000.1111110.1000000.090909
\n", "
" ], "text/plain": [ " a b c d\n", "0 inf 1.000000 0.500000 0.333333\n", "1 0.250000 0.200000 0.166667 0.142857\n", "2 0.125000 0.111111 0.100000 0.090909" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.rdiv(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在reindex(重建索引)的时候,也可以使用fill_value:" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcde
00.01.02.03.00
14.05.06.07.00
28.09.010.011.00
\n", "
" ], "text/plain": [ " a b c d e\n", "0 0.0 1.0 2.0 3.0 0\n", "1 4.0 5.0 6.0 7.0 0\n", "2 8.0 9.0 10.0 11.0 0" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.reindex(columns=df2.columns, fill_value=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Operations between DataFrame and Series (DataFrame和Series之间的操作)\n", "\n", "先举个numpy的例子帮助理解,可以考虑成一个二维数组和它的一行:" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0., 1., 2., 3.],\n", " [ 4., 5., 6., 7.],\n", " [ 8., 9., 10., 11.]])" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "arr = np.arange(12.).reshape((3, 4))\n", "arr" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0., 1., 2., 3.])" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "arr[0]" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0., 0., 0., 0.],\n", " [ 4., 4., 4., 4.],\n", " [ 8., 8., 8., 8.]])" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "arr - arr[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以看到,这个减法是用在了每一行上。这种操作叫broadcasting,在Appendix A有更详细的解释。DataFrame和Series的操作也类似:" ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "collapsed": true }, "outputs": [], "source": [ "frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),\n", " columns=list('bde'),\n", " index=['Utah', 'Ohio', 'Texas', 'Oregon'])\n", "series = frame.iloc[0]" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
Utah0.01.02.0
Ohio3.04.05.0
Texas6.07.08.0
Oregon9.010.011.0
\n", "
" ], "text/plain": [ " b d e\n", "Utah 0.0 1.0 2.0\n", "Ohio 3.0 4.0 5.0\n", "Texas 6.0 7.0 8.0\n", "Oregon 9.0 10.0 11.0" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame" ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "b 0.0\n", "d 1.0\n", "e 2.0\n", "Name: Utah, dtype: float64" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以理解为series的index与dataframe的列匹配,broadcasting down the rows(向下按行广播):" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
Utah0.00.00.0
Ohio3.03.03.0
Texas6.06.06.0
Oregon9.09.09.0
\n", "
" ], "text/plain": [ " b d e\n", "Utah 0.0 0.0 0.0\n", "Ohio 3.0 3.0 3.0\n", "Texas 6.0 6.0 6.0\n", "Oregon 9.0 9.0 9.0" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame - series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果一个index既不在DataFrame的column中,也不再series里的index中,那么结果也是合集:" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bdef
Utah0.0NaN3.0NaN
Ohio3.0NaN6.0NaN
Texas6.0NaN9.0NaN
Oregon9.0NaN12.0NaN
\n", "
" ], "text/plain": [ " b d e f\n", "Utah 0.0 NaN 3.0 NaN\n", "Ohio 3.0 NaN 6.0 NaN\n", "Texas 6.0 NaN 9.0 NaN\n", "Oregon 9.0 NaN 12.0 NaN" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series2 = pd.Series(range(3), index=['b', 'e', 'f'])\n", "frame + series2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果想要广播列,去匹配行,必须要用到算数方法:" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
Utah0.01.02.0
Ohio3.04.05.0
Texas6.07.08.0
Oregon9.010.011.0
\n", "
" ], "text/plain": [ " b d e\n", "Utah 0.0 1.0 2.0\n", "Ohio 3.0 4.0 5.0\n", "Texas 6.0 7.0 8.0\n", "Oregon 9.0 10.0 11.0" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series3 = frame['d']\n", "frame" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Utah 1.0\n", "Ohio 4.0\n", "Texas 7.0\n", "Oregon 10.0\n", "Name: d, dtype: float64" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series3" ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
Utah-1.00.01.0
Ohio-1.00.01.0
Texas-1.00.01.0
Oregon-1.00.01.0
\n", "
" ], "text/plain": [ " b d e\n", "Utah -1.0 0.0 1.0\n", "Ohio -1.0 0.0 1.0\n", "Texas -1.0 0.0 1.0\n", "Oregon -1.0 0.0 1.0" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.sub(series3, axis='index')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "axis参数就是用来匹配轴的。在这个例子里是匹配dataframe的row index(`axis='index` or `axis=0`),然后再广播。\n", "\n", "# 6 Function Application and Mapping (函数应用和映射)\n", "\n", "numpy的ufuncs(element-wise数组方法)也能用在pandas的object上:" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
Utah-0.8104350.194448-0.705901
Ohio-0.8862750.5536401.066754
Texas0.189898-0.056108-0.159926
Oregon0.4483030.439650-1.351029
\n", "
" ], "text/plain": [ " b d e\n", "Utah -0.810435 0.194448 -0.705901\n", "Ohio -0.886275 0.553640 1.066754\n", "Texas 0.189898 -0.056108 -0.159926\n", "Oregon 0.448303 0.439650 -1.351029" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), \n", " index=['Utah', 'Ohio', 'Texas', 'Oregon'])\n", "frame" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
Utah0.8104350.1944480.705901
Ohio0.8862750.5536401.066754
Texas0.1898980.0561080.159926
Oregon0.4483030.4396501.351029
\n", "
" ], "text/plain": [ " b d e\n", "Utah 0.810435 0.194448 0.705901\n", "Ohio 0.886275 0.553640 1.066754\n", "Texas 0.189898 0.056108 0.159926\n", "Oregon 0.448303 0.439650 1.351029" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.abs(frame)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "另一个常用的操作是把一个用在一维数组上的函数,应用在一行或一列上。要用到DataFrame中的apply函数:" ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "b 1.334579\n", "d 0.609748\n", "e 2.417783\n", "dtype: float64" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f = lambda x: x.max() - x.min()\n", "frame.apply(f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里函数f,计算的是一个series中最大值和最小值的差,在frame中的每一列,这个函数被调用一次。作为结果的series,它的index就是frame的column。\n", "\n", "如果你传入`axis='column'`用于apply,那么函数会被用在每一行:" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Utah 1.004883\n", "Ohio 1.953030\n", "Texas 0.349825\n", "Oregon 1.799333\n", "dtype: float64" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.apply(f, axis='columns')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "像是sum, mean这样的数组统计方法,DataFrame中已经集成了,所以没必要用apply。\n", "\n", "apply不会返回标量,只会返回一个含有多个值的series:" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def f(x): \n", " return pd.Series([x.min(), x.max()], index=['min', 'max'])" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
Utah-0.8104350.194448-0.705901
Ohio-0.8862750.5536401.066754
Texas0.189898-0.056108-0.159926
Oregon0.4483030.439650-1.351029
\n", "
" ], "text/plain": [ " b d e\n", "Utah -0.810435 0.194448 -0.705901\n", "Ohio -0.886275 0.553640 1.066754\n", "Texas 0.189898 -0.056108 -0.159926\n", "Oregon 0.448303 0.439650 -1.351029" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
min-0.886275-0.056108-1.351029
max0.4483030.5536401.066754
\n", "
" ], "text/plain": [ " b d e\n", "min -0.886275 -0.056108 -1.351029\n", "max 0.448303 0.553640 1.066754" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.apply(f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "element-wise的python函数也能用。假设想要格式化frame中的浮点数,变为string。可以用apply map:" ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "collapsed": true }, "outputs": [], "source": [ "format = lambda x: '%.2f' % x" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
Utah-0.810.19-0.71
Ohio-0.890.551.07
Texas0.19-0.06-0.16
Oregon0.450.44-1.35
\n", "
" ], "text/plain": [ " b d e\n", "Utah -0.81 0.19 -0.71\n", "Ohio -0.89 0.55 1.07\n", "Texas 0.19 -0.06 -0.16\n", "Oregon 0.45 0.44 -1.35" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.applymap(format)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "applymap的做法是,series有一个map函数,能用来实现element-wise函数:" ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Utah -0.71\n", "Ohio 1.07\n", "Texas -0.16\n", "Oregon -1.35\n", "Name: e, dtype: object" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame['e'].map(format)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 7 Sorting and Ranking (排序)\n", "\n", "按row或column index来排序的话,可以用sort_index方法,会返回一个新的object:" ] }, { "cell_type": "code", "execution_count": 120, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 2\n", "c 3\n", "d 0\n", "dtype: int64" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])\n", "obj.sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在DataFrame,可以用index或其他axis来排序:" ] }, { "cell_type": "code", "execution_count": 123, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dabc
three0123
one4567
\n", "
" ], "text/plain": [ " d a b c\n", "three 0 1 2 3\n", "one 4 5 6 7" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = pd.DataFrame(np.arange(8).reshape((2, 4)),\n", " index=['three', 'one'],\n", " columns=['d', 'a', 'b', 'c'])\n", "frame" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dabc
one4567
three0123
\n", "
" ], "text/plain": [ " d a b c\n", "one 4 5 6 7\n", "three 0 1 2 3" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.sort_index()" ] }, { "cell_type": "code", "execution_count": 125, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
three1230
one5674
\n", "
" ], "text/plain": [ " a b c d\n", "three 1 2 3 0\n", "one 5 6 7 4" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.sort_index(axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "默认是升序,可以设置降序:" ] }, { "cell_type": "code", "execution_count": 126, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dcba
three0321
one4765
\n", "
" ], "text/plain": [ " d c b a\n", "three 0 3 2 1\n", "one 4 7 6 5" ] }, "execution_count": 126, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.sort_index(axis=1, ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "通过值来排序,用sort_values方法:" ] }, { "cell_type": "code", "execution_count": 127, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2 -3\n", "3 2\n", "0 4\n", "1 7\n", "dtype: int64" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = pd.Series([4, 7, -3, 2])\n", "obj.sort_values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "缺失值会被排在最后:" ] }, { "cell_type": "code", "execution_count": 128, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "4 -3.0\n", "5 2.0\n", "0 4.0\n", "2 7.0\n", "1 NaN\n", "3 NaN\n", "dtype: float64" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])\n", "obj.sort_values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "对于一个DataFrame,可以用一列或多列作为sort keys。这样的话,只需要把一列多多列的名字导入到sort_values即可:" ] }, { "cell_type": "code", "execution_count": 129, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
004
117
20-3
312
\n", "
" ], "text/plain": [ " a b\n", "0 0 4\n", "1 1 7\n", "2 0 -3\n", "3 1 2" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})\n", "frame" ] }, { "cell_type": "code", "execution_count": 130, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
20-3
312
004
117
\n", "
" ], "text/plain": [ " a b\n", "2 0 -3\n", "3 1 2\n", "0 0 4\n", "1 1 7" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.sort_values(by='b')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "多列排序的话,传入一个list of names:" ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
20-3
004
312
117
\n", "
" ], "text/plain": [ " a b\n", "2 0 -3\n", "0 0 4\n", "3 1 2\n", "1 1 7" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.sort_values(by=['a', 'b'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ranking(排名)是给有效的数据分配数字。rank方法能用于series和DataFrame,rank方法默认会给每个group一个mean rank(平均排名)。rank 表示在这个数在原来的Series中排第几名,有相同的数,取其排名平均(默认)作为值:" ] }, { "cell_type": "code", "execution_count": 133, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 7\n", "1 -5\n", "2 7\n", "3 4\n", "4 2\n", "5 0\n", "6 4\n", "dtype: int64" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = pd.Series([7, -5, 7, 4, 2, 0, 4])\n", "obj" ] }, { "cell_type": "code", "execution_count": 138, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1 -5\n", "5 0\n", "4 2\n", "3 4\n", "6 4\n", "0 7\n", "2 7\n", "dtype: int64" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.sort_values()" ] }, { "cell_type": "code", "execution_count": 136, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 6.5\n", "1 1.0\n", "2 6.5\n", "3 4.5\n", "4 3.0\n", "5 2.0\n", "6 4.5\n", "dtype: float64" ] }, "execution_count": 136, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在obj中,4和4的排名是第4名和第五名,取平均得4.5。7和7的排名分别是第六名和第七名,则其排名取平均得6.5。\n", "\n", "rank也可以根据数据被观测到的顺序来设定:" ] }, { "cell_type": "code", "execution_count": 141, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 7\n", "1 -5\n", "2 7\n", "3 4\n", "4 2\n", "5 0\n", "6 4\n", "dtype: int64" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj" ] }, { "cell_type": "code", "execution_count": 137, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 6.0\n", "1 1.0\n", "2 7.0\n", "3 4.0\n", "4 3.0\n", "5 2.0\n", "6 5.0\n", "dtype: float64" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.rank(method='first')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里没有给0和2(指两个数字7)赋予average rank 6.5,而是给第一个看到的7(label 0)设置rank为6,第二个看到的7(label 2)设置rank为7。\n", "\n", "也可以设置降序:" ] }, { "cell_type": "code", "execution_count": 142, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 2.0\n", "1 7.0\n", "2 2.0\n", "3 4.0\n", "4 5.0\n", "5 6.0\n", "6 4.0\n", "dtype: float64" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Assign tie values the maximum rank in the group\n", "obj.rank(ascending=False, method='max')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "dataframe 可以根据行或列来计算rank:" ] }, { "cell_type": "code", "execution_count": 143, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
004.3-2.0
117.05.0
20-3.08.0
312.0-2.5
\n", "
" ], "text/plain": [ " a b c\n", "0 0 4.3 -2.0\n", "1 1 7.0 5.0\n", "2 0 -3.0 8.0\n", "3 1 2.0 -2.5" ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = pd.DataFrame({'b': [4.3, 7, -3, 2],\n", " 'a': [0, 1, 0, 1],\n", " 'c': [-2, 5, 8, -2.5]})\n", "frame" ] }, { "cell_type": "code", "execution_count": 144, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
02.03.01.0
11.03.02.0
22.01.03.0
32.03.01.0
\n", "
" ], "text/plain": [ " a b c\n", "0 2.0 3.0 1.0\n", "1 1.0 3.0 2.0\n", "2 2.0 1.0 3.0\n", "3 2.0 3.0 1.0" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.rank(axis='columns') # columns表示列与列之间的排序(即每一行里数据间的排序)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/6xv9c.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 8 Axis Indexes with Duplicate Labels (有重复label的轴索引)\n", "\n", "我们看到的所有例子都有unique axis labels(index values),唯一的轴标签(索引值)。一些pandas函数(reindex),需要label是唯一的,但这并是不强制的。比如下面有一个重复的索引:" ] }, { "cell_type": "code", "execution_count": 145, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0\n", "a 1\n", "b 2\n", "b 3\n", "c 4\n", "dtype: int64" ] }, "execution_count": 145, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])\n", "obj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "index的is_unique特性能告诉我们label是否是唯一的:" ] }, { "cell_type": "code", "execution_count": 146, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.index.is_unique" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "数据选择对于重复label则表现有点不同。如果一个label有多个值,那么就会返回一个series, 如果是label只对应一个值的话,会返回一个标量:" ] }, { "cell_type": "code", "execution_count": 147, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0\n", "a 1\n", "dtype: int64" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj['a']" ] }, { "cell_type": "code", "execution_count": 148, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 148, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj['c']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这个选择的逻辑也应用于DataFrame:" ] }, { "cell_type": "code", "execution_count": 149, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
a-0.314526-1.3138610.823529
a0.994028-0.442338-0.846985
b-1.340453-0.0316120.044791
b-0.919341-0.409164-1.297257
\n", "
" ], "text/plain": [ " 0 1 2\n", "a -0.314526 -1.313861 0.823529\n", "a 0.994028 -0.442338 -0.846985\n", "b -1.340453 -0.031612 0.044791\n", "b -0.919341 -0.409164 -1.297257" ] }, "execution_count": 149, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])\n", "df" ] }, { "cell_type": "code", "execution_count": 150, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
b-1.340453-0.0316120.044791
b-0.919341-0.409164-1.297257
\n", "
" ], "text/plain": [ " 0 1 2\n", "b -1.340453 -0.031612 0.044791\n", "b -0.919341 -0.409164 -1.297257" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['b']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [py35]", "language": "python", "name": "Python [py35]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }