{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CHAPTER 5 Getting Started with pandas\n", "\n", "这一节终于要开始讲pandas了。闲话不说,直接开始正题。之后的笔记里,这样导入pandas:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "另外可以导入Series和DataFrame,因为这两个经常被用到:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from pandas import Series, DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.1 Introduction to pandas Data Structures\n", "\n", "数据结构其实就是Series和DataFrame。\n", "\n", "# 1 Series\n", "\n", "这里series我就不翻译成序列了,因为之前的所有笔记里,我都是把sequence翻译成序列的。\n", "\n", "series是一个像数组一样的一维序列,并伴有一个数组表示label,叫做index。创建一个series的方法也很简单:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 4\n", "1 7\n", "2 -5\n", "3 3\n", "dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = pd.Series([4, 7, -5, 3])\n", "obj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以看到,左边表示index,右边表示对应的value。可以通过value和index属性查看:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 4, 7, -5, 3])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.values" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=4, step=1)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.index # like range(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "当然我们也可以自己指定index的label:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "d 4\n", "b 7\n", "a -5\n", "c 3\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj2" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['d', 'b', 'a', 'c'], dtype='object')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj2.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以用index的label来选择:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "-5" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj2['a']" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "obj2['d'] = 6" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "c 3\n", "a -5\n", "d 6\n", "dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj2[['c', 'a', 'd']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里['c', 'a', 'd']其实被当做了索引,尽管这个索引是用string构成的。\n", "\n", "使用numpy函数或类似的操作,会保留index-value的关系:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "d 6\n", "b 7\n", "c 3\n", "dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj2[obj2 > 0]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "d 12\n", "b 14\n", "a -10\n", "c 6\n", "dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj2 * 2" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "d 403.428793\n", "b 1096.633158\n", "a 0.006738\n", "c 20.085537\n", "dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "np.exp(obj2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "另一种看待series的方法,它是一个长度固定,有顺序的dict,从index映射到value。在很多场景下,可以当做dict来用:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'b' in obj2" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'e' in obj2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "还可以直接用现有的dict来创建series:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon':16000, 'Utah': 5000}" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Ohio 35000\n", "Oregon 16000\n", "Texas 71000\n", "Utah 5000\n", "dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj3 = pd.Series(sdata)\n", "obj3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "series中的index其实就是dict中排好序的keys。我们也可以传入一个自己想要的顺序:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "states = ['California', 'Ohio', 'Oregon', 'Texas']" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "California NaN\n", "Ohio 35000.0\n", "Oregon 16000.0\n", "Texas 71000.0\n", "dtype: float64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj4 = pd.Series(sdata, index=states)\n", "obj4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "顺序是按states里来的,但因为没有找到california,所以是NaN。NaN表示缺失数据,用之后我们提到的话就用missing或NA来指代。pandas中的isnull和notnull函数可以用来检测缺失数据:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "California True\n", "Ohio False\n", "Oregon False\n", "Texas False\n", "dtype: bool" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.isnull(obj4)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "California False\n", "Ohio True\n", "Oregon True\n", "Texas True\n", "dtype: bool" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.notnull(obj4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "series也有对应的方法:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "California True\n", "Ohio False\n", "Oregon False\n", "Texas False\n", "dtype: bool" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj4.isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "关于缺失数据,在第七章还会讲得更详细一些。\n", "\n", "series中一个有用的特色自动按index label来排序(Data alignment features):" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Ohio 35000\n", "Oregon 16000\n", "Texas 71000\n", "Utah 5000\n", "dtype: int64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj3" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "California NaN\n", "Ohio 35000.0\n", "Oregon 16000.0\n", "Texas 71000.0\n", "dtype: float64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj4" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "California NaN\n", "Ohio 70000.0\n", "Oregon 32000.0\n", "Texas 142000.0\n", "Utah NaN\n", "dtype: float64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj3 + obj4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这个Data alignment features(数据对齐特色)和数据库中的join相似。\n", "\n", "serice自身和它的index都有一个叫name的属性,这个能和其他pandas的函数进行整合:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "obj4.name = 'population'" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "obj4.index.name = 'state'" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "state\n", "California NaN\n", "Ohio 35000.0\n", "Oregon 16000.0\n", "Texas 71000.0\n", "Name: population, dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "series的index能被直接更改:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 4\n", "1 7\n", "2 -5\n", "3 3\n", "dtype: int64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Bob 4\n", "Steve 7\n", "Jeff -5\n", "Ryan 3\n", "dtype: int64" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']\n", "obj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2 DataFrame\n", "\n", "DataFrame表示一个长方形表格,并包含排好序的列,每一列都可以是不同的数值类型(数字,字符串,布尔值)。DataFrame有行索引和列索引(row index, column index);可以看做是分享所有索引的由series组成的字典。数据是保存在一维以上的区块里的。\n", "\n", "(其实我是把dataframe当做excel里的那种表格来用的,这样感觉更直观一些)\n", "\n", "构建一个dataframe的方法,用一个dcit,dict里的值是list:\n", "\n" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
popstateyear
01.5Ohio2000
11.7Ohio2001
23.6Ohio2002
32.4Nevada2001
42.9Nevada2002
53.2Nevada2003
\n", "
" ], "text/plain": [ " pop state year\n", "0 1.5 Ohio 2000\n", "1 1.7 Ohio 2001\n", "2 3.6 Ohio 2002\n", "3 2.4 Nevada 2001\n", "4 2.9 Nevada 2002\n", "5 3.2 Nevada 2003" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], \n", " 'year': [2000, 2001, 2002, 2001, 2002, 2003], \n", " 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}\n", "\n", "frame = pd.DataFrame(data)\n", "\n", "frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "dataframe也会像series一样,自动给数据赋index, 而列则会按顺序排好。\n", "\n", "对于一个较大的DataFrame,用head方法会返回前5行(注:这个函数在数据分析中经常使用,用来查看表格里有什么东西):" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
popstateyear
01.5Ohio2000
11.7Ohio2001
23.6Ohio2002
32.4Nevada2001
42.9Nevada2002
\n", "
" ], "text/plain": [ " pop state year\n", "0 1.5 Ohio 2000\n", "1 1.7 Ohio 2001\n", "2 3.6 Ohio 2002\n", "3 2.4 Nevada 2001\n", "4 2.9 Nevada 2002" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果指定一列的话,会自动按列排序:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepop
02000Ohio1.5
12001Ohio1.7
22002Ohio3.6
32001Nevada2.4
42002Nevada2.9
52003Nevada3.2
\n", "
" ], "text/plain": [ " year state pop\n", "0 2000 Ohio 1.5\n", "1 2001 Ohio 1.7\n", "2 2002 Ohio 3.6\n", "3 2001 Nevada 2.4\n", "4 2002 Nevada 2.9\n", "5 2003 Nevada 3.2" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(data, columns=['year', 'state', 'pop'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果你导入一个不存在的列名,那么会显示为缺失数据:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": true }, "outputs": [], "source": [ "frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], \n", " index=['one', 'two', 'three', 'four', 'five', 'six'])" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopdebt
one2000Ohio1.5NaN
two2001Ohio1.7NaN
three2002Ohio3.6NaN
four2001Nevada2.4NaN
five2002Nevada2.9NaN
six2003Nevada3.2NaN
\n", "
" ], "text/plain": [ " year state pop debt\n", "one 2000 Ohio 1.5 NaN\n", "two 2001 Ohio 1.7 NaN\n", "three 2002 Ohio 3.6 NaN\n", "four 2001 Nevada 2.4 NaN\n", "five 2002 Nevada 2.9 NaN\n", "six 2003 Nevada 3.2 NaN" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['year', 'state', 'pop', 'debt'], dtype='object')" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "从DataFrame里提取一列的话会返回series格式,可以以属性或是dict一样的形式来提取:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "one Ohio\n", "two Ohio\n", "three Ohio\n", "four Nevada\n", "five Nevada\n", "six Nevada\n", "Name: state, dtype: object" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2['state']" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "one 2000\n", "two 2001\n", "three 2002\n", "four 2001\n", "five 2002\n", "six 2003\n", "Name: year, dtype: int64" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2.year" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "注意:frame2[column]能应对任何列名,但frame2.column的情况下,列名必须是有效的python变量名才行。\n", "\n", "返回的series有DataFrame种同样的index,而且name属性也是对应的。\n", "\n", "对于行,要用在loc属性里用 位置或名字:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "year 2002\n", "state Ohio\n", "pop 3.6\n", "debt NaN\n", "Name: three, dtype: object" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2.loc['three']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "列值也能通过赋值改变。比如给debt赋值:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopdebt
one2000Ohio1.516.5
two2001Ohio1.716.5
three2002Ohio3.616.5
four2001Nevada2.416.5
five2002Nevada2.916.5
six2003Nevada3.216.5
\n", "
" ], "text/plain": [ " year state pop debt\n", "one 2000 Ohio 1.5 16.5\n", "two 2001 Ohio 1.7 16.5\n", "three 2002 Ohio 3.6 16.5\n", "four 2001 Nevada 2.4 16.5\n", "five 2002 Nevada 2.9 16.5\n", "six 2003 Nevada 3.2 16.5" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2['debt'] = 16.5\n", "frame2" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopdebt
one2000Ohio1.50.0
two2001Ohio1.71.0
three2002Ohio3.62.0
four2001Nevada2.43.0
five2002Nevada2.94.0
six2003Nevada3.25.0
\n", "
" ], "text/plain": [ " year state pop debt\n", "one 2000 Ohio 1.5 0.0\n", "two 2001 Ohio 1.7 1.0\n", "three 2002 Ohio 3.6 2.0\n", "four 2001 Nevada 2.4 3.0\n", "five 2002 Nevada 2.9 4.0\n", "six 2003 Nevada 3.2 5.0" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2['debt'] = np.arange(6.)\n", "frame2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果把list或array赋给column的话,长度必须符合DataFrame的长度。如果把一二series赋给DataFrame,会按DataFrame的index来赋值,不够的地方用缺失数据来表示:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopdebt
one2000Ohio1.5NaN
two2001Ohio1.7-1.2
three2002Ohio3.6NaN
four2001Nevada2.4-1.5
five2002Nevada2.9-1.7
six2003Nevada3.2NaN
\n", "
" ], "text/plain": [ " year state pop debt\n", "one 2000 Ohio 1.5 NaN\n", "two 2001 Ohio 1.7 -1.2\n", "three 2002 Ohio 3.6 NaN\n", "four 2001 Nevada 2.4 -1.5\n", "five 2002 Nevada 2.9 -1.7\n", "six 2003 Nevada 3.2 NaN" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])\n", "frame2['debt'] = val\n", "frame2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果列不存在,赋值会创建一个新列。而del也能像删除字典关键字一样,删除列:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopdebteastern
one2000Ohio1.5NaNTrue
two2001Ohio1.7-1.2True
three2002Ohio3.6NaNTrue
four2001Nevada2.4-1.5False
five2002Nevada2.9-1.7False
six2003Nevada3.2NaNFalse
\n", "
" ], "text/plain": [ " year state pop debt eastern\n", "one 2000 Ohio 1.5 NaN True\n", "two 2001 Ohio 1.7 -1.2 True\n", "three 2002 Ohio 3.6 NaN True\n", "four 2001 Nevada 2.4 -1.5 False\n", "five 2002 Nevada 2.9 -1.7 False\n", "six 2003 Nevada 3.2 NaN False" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2['eastern'] = frame2.state == 'Ohio'\n", "frame2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "然后用del删除这一列:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [], "source": [ "del frame2['eastern']" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['year', 'state', 'pop', 'debt'], dtype='object')" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "注意:columns返回的是一个view,而不是新建了一个copy。因此,任何对series的改变,会反映在DataFrame上。除非我们用copy方法来新建一个。\n", "\n", "另一种常见的格式是dict中的dict:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pop = {'Nevada': {2001: 2.4, 2002: 2.9},\n", " 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "把上面这种嵌套dcit传给DataFrame,pandas会把外层dcit的key当做列,内层key当做行索引:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NevadaOhio
2000NaN1.5
20012.41.7
20022.93.6
\n", "
" ], "text/plain": [ " Nevada Ohio\n", "2000 NaN 1.5\n", "2001 2.4 1.7\n", "2002 2.9 3.6" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame3 = pd.DataFrame(pop)\n", "frame3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "另外DataFrame也可以向numpy数组一样做转置:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
200020012002
NevadaNaN2.42.9
Ohio1.51.73.6
\n", "
" ], "text/plain": [ " 2000 2001 2002\n", "Nevada NaN 2.4 2.9\n", "Ohio 1.5 1.7 3.6" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame3.T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "指定index:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NevadaOhio
20012.41.7
20022.93.6
2003NaNNaN
\n", "
" ], "text/plain": [ " Nevada Ohio\n", "2001 2.4 1.7\n", "2002 2.9 3.6\n", "2003 NaN NaN" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(pop, index=[2001, 2002, 2003])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "series组成的dict:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pdata = {'Ohio': frame3['Ohio'][:-1],\n", " 'Nevada': frame3['Nevada'][:2]}" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NevadaOhio
2000NaN1.5
20012.41.7
\n", "
" ], "text/plain": [ " Nevada Ohio\n", "2000 NaN 1.5\n", "2001 2.4 1.7" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(pdata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "其他一些可以传递给DataFrame的构造器:\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/yv7rc.png)\n", "\n", "如果DataFrame的index和column有自己的name属性,也会被显示:\n", "\n" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": true }, "outputs": [], "source": [ "frame3.index.name = 'year'; frame3.columns.name = 'state'" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stateNevadaOhio
year
2000NaN1.5
20012.41.7
20022.93.6
\n", "
" ], "text/plain": [ "state Nevada Ohio\n", "year \n", "2000 NaN 1.5\n", "2001 2.4 1.7\n", "2002 2.9 3.6" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "values属性会返回二维数组:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ nan, 1.5],\n", " [ 2.4, 1.7],\n", " [ 2.9, 3.6]])" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame3.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果column有不同的类型,dtype会适应所有的列:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[2000, 'Ohio', 1.5, nan],\n", " [2001, 'Ohio', 1.7, -1.2],\n", " [2002, 'Ohio', 3.6, nan],\n", " [2001, 'Nevada', 2.4, -1.5],\n", " [2002, 'Nevada', 2.9, -1.7],\n", " [2003, 'Nevada', 3.2, nan]], dtype=object)" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3 Index Objects (索引对象)\n", "\n", "pandas的Index Objects (索引对象)负责保存axis labels和其他一些数据(比如axis name或names)。一个数组或其他一个序列标签,只要被用来做构建series或DataFrame,就会被自动转变为index:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": true }, "outputs": [], "source": [ "obj = pd.Series(range(3), index=['a', 'b', 'c'])" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['a', 'b', 'c'], dtype='object')" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index = obj.index\n", "index" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['b', 'c'], dtype='object')" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index[1:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "index object是不可更改的:" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "ename": "TypeError", "evalue": "Index does not support mutable operations", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mindex\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'd'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/Users/xu/anaconda/envs/py35/lib/python3.5/site-packages/pandas/indexes/base.py\u001b[0m in \u001b[0;36m__setitem__\u001b[0;34m(self, key, value)\u001b[0m\n\u001b[1;32m 1243\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1244\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__setitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1245\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Index does not support mutable operations\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1246\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1247\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mTypeError\u001b[0m: Index does not support mutable operations" ] } ], "source": [ "index[1] = 'd'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "正因为不可修改,所以data structure中分享index object是很安全的:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Int64Index([0, 1, 2], dtype='int64')" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labels = pd.Index(np.arange(3))\n", "labels" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1.5\n", "1 -2.5\n", "2 0.0\n", "dtype: float64" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj2 = pd.Series([1.5, -2.5, 0], index=labels)\n", "obj2" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj2.index is labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "index除了想数组,还能像大小一定的set:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stateNevadaOhio
year
2000NaN1.5
20012.41.7
20022.93.6
\n", "
" ], "text/plain": [ "state Nevada Ohio\n", "year \n", "2000 NaN 1.5\n", "2001 2.4 1.7\n", "2002 2.9 3.6" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame3" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['Nevada', 'Ohio'], dtype='object', name='state')" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame3.columns" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'Ohio' in frame3.columns" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "2003 in frame3.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "与python里的set不同,pandas的index可以有重复的labels:" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['foo', 'foo', 'bar', 'bar'], dtype='object')" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])\n", "dup_labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在这种重复的标签中选择的话,会选中所有相同的标签。\n", "\n", "Index还有一些方法和属性:\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/14j6g.png)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [py35]", "language": "python", "name": "Python [py35]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }