{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pandas建造在NumPy之上,它使得以NumPy为中心的应用很容易使用。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from pandas import Series, DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas 数据结构入门" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Series\n", "\n", "Series 是一个一维的类似的数组对象,包含一个数组的数据(任何 NumPy 的数据类型)和一个与数组关联的数据标签,被叫做 索引 。最简单的 Series 是由一个数组的数据构成:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 4\n", "1 7\n", "2 -5\n", "3 3\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = Series([4, 7, -5, 3])\n", "obj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Seriers 的交互式显示的字符窜表示形式是索引在左边,值在右边。 \n", "因为我们没有给数据指定索引,一个包含整数 0 到 N-1 (这里 N 是数据的长度)的默认索引被创建。 可以分别的通过它的 values 和 index 属性来获取 Series 的数组表示和索引对象:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 4, 7, -5, 3])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.values" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=4, step=1)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "通常,需要创建一个带有索引来确定没一个数据点的Series:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "d 4\n", "b 7\n", "a -5\n", "c 3\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])\n", "obj2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "与正规的 NumPy 数组相比,你可以使用索引里的值来选择一个单一值或一个值集:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-5" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj2['a']" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "c 3\n", "a -5\n", "d 6\n", "dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj2['d'] = 6\n", "obj2[['c', 'a', 'd']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NumPy 数组操作,例如通过一个布尔数组过滤,纯量乘法,或使用数学函数,将会保持索引和值间的关联:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "d 6\n", "b 7\n", "a -5\n", "c 3\n", "dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj2" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "d 6\n", "b 7\n", "c 3\n", "dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj2[obj2 > 0]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "d 12\n", "b 14\n", "a -10\n", "c 6\n", "dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj2 * 2" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "d 403.428793\n", "b 1096.633158\n", "a 0.006738\n", "c 20.085537\n", "dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.exp(obj2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**另一种思考的方式**是,Series 是一个定长的,有序的字典,因为它把索引和值映射起来了。它可以适用于许多期望一个字典的函数:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'b' in obj2" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'e' in obj2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果你有一些数据在一个 Python 字典中,你可以通过传递字典来从这些数据创建一个 Series:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ohio 35000\n", "Oregon 16000\n", "Texas 71000\n", "Utah 5000\n", "dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj3 = Series(sdata)\n", "obj3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "只传递一个字典的时候,结果 Series 中的索引将是排序后的字典的键。" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "California NaN\n", "Ohio 35000.0\n", "Oregon 16000.0\n", "Texas 71000.0\n", "dtype: float64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "states = ['California', 'Ohio', 'Oregon', 'Texas']\n", "obj4 = Series(sdata, index=states)\n", "obj4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在这种情况下, sdata 中的3个值被放在了合适的位置,但因为没有发现对应于 ‘California’ 的值,就出现了 NaN (不是一个数),这在 pandas 中被用来标记数据缺失或 NA 值。使用 “missing” 或 “NA” 来表示数度丢失。 \n", "**在 pandas 中用函数 isnull 和 notnull 来检测数据丢失:**" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "California True\n", "Ohio False\n", "Oregon False\n", "Texas False\n", "dtype: bool" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.isnull(obj4)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "California False\n", "Ohio True\n", "Oregon True\n", "Texas True\n", "dtype: bool" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.notnull(obj4)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "California True\n", "Ohio False\n", "Oregon False\n", "Texas False\n", "dtype: bool" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj4.isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在许多应用中 Series 的一个重要功能是在算术预算中它会自动对齐不同索引的数据:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ohio 35000\n", "Oregon 16000\n", "Texas 71000\n", "Utah 5000\n", "dtype: int64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj3" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "California NaN\n", "Ohio 35000.0\n", "Oregon 16000.0\n", "Texas 71000.0\n", "dtype: float64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj4" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "California NaN\n", "Ohio 70000.0\n", "Oregon 32000.0\n", "Texas 142000.0\n", "Utah NaN\n", "dtype: float64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj3 + obj4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Series 对象本身和它的索引都有一个 name 属性,它和 pandas 的其它一些关键功能整合在一起:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": true }, "outputs": [], "source": [ "obj4.name = 'population'" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "state\n", "California NaN\n", "Ohio 35000.0\n", "Oregon 16000.0\n", "Texas 71000.0\n", "Name: population, dtype: float64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj4.index.name = 'state'\n", "obj4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Series 的索引可以通过赋值就地更改:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Bob 4\n", "Steve 7\n", "Jeff -5\n", "Ryan 3\n", "dtype: int64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "一个 Datarame 表示一个表格,类似电子表格的数据结构,包含一个经过排序的列表集,它们每一个都可以有不同的类型值(数字,字符串,布尔等等)。 \n", "Datarame 有行和列的索引;它可以被看作是一个 Series 的字典(每个 Series 共享一个索引)。 \n", "与其它你以前使用过的(如 R 的 data.frame )类似 Datarame 的结构相比,在 DataFrame 里的面向行和面向列的操作大致是对称的。 \n", "在底层,数据是作为一个或多个**二维数组**存储的,而不是列表,字典,或其它一维的数组集合。 \n", "因为 DataFrame 在内部把数据存储为一个二维数组的格式,因此你可以采用分层索引以表格格式来表示高维的数据。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "有很多方法来构建一个 DataFrame,但最常用的一个是用一个相等长度列表的字典或 NumPy 数组:" ] }, { "cell_type": "code", "execution_count": 229, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],\n", " 'year': [2000, 2001, 2002, 2001, 2002],\n", " 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}\n", "frame = DataFrame(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "由此产生的 DataFrame 和 Series 一样,它的索引会自动分配,并且对列进行了排序:" ] }, { "cell_type": "code", "execution_count": 230, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
popstateyear
01.5Ohio2000
11.7Ohio2001
23.6Ohio2002
32.4Nevada2001
42.9Nevada2002
\n", "
" ], "text/plain": [ " pop state year\n", "0 1.5 Ohio 2000\n", "1 1.7 Ohio 2001\n", "2 3.6 Ohio 2002\n", "3 2.4 Nevada 2001\n", "4 2.9 Nevada 2002" ] }, "execution_count": 230, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果你设定了一个列的顺序,DataFrame 的列将会精确的按照你所传递的顺序排列:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepop
02000Ohio1.5
12001Ohio1.7
22002Ohio3.6
32001Nevada2.4
42002Nevada2.9
\n", "
" ], "text/plain": [ " year state pop\n", "0 2000 Ohio 1.5\n", "1 2001 Ohio 1.7\n", "2 2002 Ohio 3.6\n", "3 2001 Nevada 2.4\n", "4 2002 Nevada 2.9" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DataFrame(data, columns=['year', 'state', 'pop'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "和 Series 一样,如果你传递了一个行,但不包括在 data 中,在结果中它会表示为 NA 值:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopdebt
one2000Ohio1.5NaN
two2001Ohio1.7NaN
three2002Ohio3.6NaN
four2001Nevada2.4NaN
five2002Nevada2.9NaN
\n", "
" ], "text/plain": [ " year state pop debt\n", "one 2000 Ohio 1.5 NaN\n", "two 2001 Ohio 1.7 NaN\n", "three 2002 Ohio 3.6 NaN\n", "four 2001 Nevada 2.4 NaN\n", "five 2002 Nevada 2.9 NaN" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],index=['one', 'two', 'three', 'four', 'five'])\n", "frame2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "和 Series 一样,在 DataFrame 中的一列可以通过字典记法或属性来检索:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "one Ohio\n", "two Ohio\n", "three Ohio\n", "four Nevada\n", "five Nevada\n", "Name: state, dtype: object" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2['state'] " ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "one 2000\n", "two 2001\n", "three 2002\n", "four 2001\n", "five 2002\n", "Name: year, dtype: int64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2.year" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**注意**,返回的 Series 包含和 DataFrame 相同的索引,并它们的 name 属性也被正确的设置了。\n", "\n", "行也可以使用一些方法通过位置或名字来检索,例如 ix 索引成员(field)(更多的将在后面介绍):" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "year 2002\n", "state Ohio\n", "pop 3.6\n", "debt NaN\n", "Name: three, dtype: object" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2.ix['three']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "列可以通过赋值来修改。例如,空的 ‘debt’ 列可以通过一个纯量或一个数组来赋值:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopdebt
one2000Ohio1.516.5
two2001Ohio1.716.5
three2002Ohio3.616.5
four2001Nevada2.416.5
five2002Nevada2.916.5
\n", "
" ], "text/plain": [ " year state pop debt\n", "one 2000 Ohio 1.5 16.5\n", "two 2001 Ohio 1.7 16.5\n", "three 2002 Ohio 3.6 16.5\n", "four 2001 Nevada 2.4 16.5\n", "five 2002 Nevada 2.9 16.5" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2['debt'] = 16.5\n", "frame2" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopdebt
one2000Ohio1.50.0
two2001Ohio1.71.0
three2002Ohio3.62.0
four2001Nevada2.43.0
five2002Nevada2.94.0
\n", "
" ], "text/plain": [ " year state pop debt\n", "one 2000 Ohio 1.5 0.0\n", "two 2001 Ohio 1.7 1.0\n", "three 2002 Ohio 3.6 2.0\n", "four 2001 Nevada 2.4 3.0\n", "five 2002 Nevada 2.9 4.0" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2['debt'] = np.arange(5.)\n", "frame2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "通过列表或数组给一列赋值时,所赋的值的长度必须和 DataFrame 的长度相匹配。**否则全部为 NA**。\n", "\n", "如果你使用 Series 来赋值,它会代替在 DataFrame 中精确匹配的索引的值,并在说有的空洞插入丢失数据:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": true }, "outputs": [], "source": [ "val2 = Series([-1.2, -1.5, -1.7])" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopdebt
one2000Ohio1.5NaN
two2001Ohio1.7NaN
three2002Ohio3.6NaN
four2001Nevada2.4NaN
five2002Nevada2.9NaN
\n", "
" ], "text/plain": [ " year state pop debt\n", "one 2000 Ohio 1.5 NaN\n", "two 2001 Ohio 1.7 NaN\n", "three 2002 Ohio 3.6 NaN\n", "four 2001 Nevada 2.4 NaN\n", "five 2002 Nevada 2.9 NaN" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2['debt'] = val2\n", "frame2" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": true }, "outputs": [], "source": [ "val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopdebt
one2000Ohio1.5NaN
two2001Ohio1.7-1.2
three2002Ohio3.6NaN
four2001Nevada2.4-1.5
five2002Nevada2.9-1.7
\n", "
" ], "text/plain": [ " year state pop debt\n", "one 2000 Ohio 1.5 NaN\n", "two 2001 Ohio 1.7 -1.2\n", "three 2002 Ohio 3.6 NaN\n", "four 2001 Nevada 2.4 -1.5\n", "five 2002 Nevada 2.9 -1.7" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2['debt'] = val\n", "frame2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "给一个不存在的列赋值,将会创建一个新的列。 像字典一样 del 关键字将会删除列:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopdebteastern
one2000Ohio1.5NaNTrue
two2001Ohio1.7-1.2True
three2002Ohio3.6NaNTrue
four2001Nevada2.4-1.5False
five2002Nevada2.9-1.7False
\n", "
" ], "text/plain": [ " year state pop debt eastern\n", "one 2000 Ohio 1.5 NaN True\n", "two 2001 Ohio 1.7 -1.2 True\n", "three 2002 Ohio 3.6 NaN True\n", "four 2001 Nevada 2.4 -1.5 False\n", "five 2002 Nevada 2.9 -1.7 False" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2['eastern'] = frame2.state == 'Ohio'\n", "frame2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**注意:**索引 DataFrame 时返回的列是底层数据的一个视窗,而不是一个拷贝。因此,任何在 Series 上的就地修改都会影响 DataFrame。列可以使用 Series 的 copy 函数来显式的拷贝。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**另一种通用的数据形式是一个嵌套的字典的字典格式:**" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pop = {'Nevada': {2001: 2.4, 2002: 2.9},\n", " ....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pop" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果被传递到 DataFrame,它的外部键会被解释为列索引,内部键会被解释为行索引:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NevadaOhio
2000NaN1.5
20012.41.7
20022.93.6
\n", "
" ], "text/plain": [ " Nevada Ohio\n", "2000 NaN 1.5\n", "2001 2.4 1.7\n", "2002 2.9 3.6" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame3 = DataFrame(pop)\n", "frame3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "当然,总是可以对结果转置:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
200020012002
NevadaNaN2.42.9
Ohio1.51.73.6
\n", "
" ], "text/plain": [ " 2000 2001 2002\n", "Nevada NaN 2.4 2.9\n", "Ohio 1.5 1.7 3.6" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame3.T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "内部字典的键被结合并排序来形成结果的索引。如果指定了一个特定的索引,就不是这样的了:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NevadaOhio
20012.41.7
20022.93.6
2003NaNNaN
\n", "
" ], "text/plain": [ " Nevada Ohio\n", "2001 2.4 1.7\n", "2002 2.9 3.6\n", "2003 NaN NaN" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DataFrame(pop, index=[2001, 2002, 2003])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Series 的字典也以相同的方式来处理:" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pdata = {'Ohio': frame3['Ohio'][:-1],\n", " ....: 'Nevada': frame3['Nevada'][:2]}" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Nevada': 2000 NaN\n", " 2001 2.4\n", " Name: Nevada, dtype: float64, 'Ohio': 2000 1.5\n", " 2001 1.7\n", " Name: Ohio, dtype: float64}" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pdata" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NevadaOhio
2000NaN1.5
20012.41.7
\n", "
" ], "text/plain": [ " Nevada Ohio\n", "2000 NaN 1.5\n", "2001 2.4 1.7" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DataFrame(pdata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果一个 DataFrame 的 index 和 columns 有它们的 name ,也会被显示出来:" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stateNevadaOhio
year
2000NaN1.5
20012.41.7
20022.93.6
\n", "
" ], "text/plain": [ "state Nevada Ohio\n", "year \n", "2000 NaN 1.5\n", "2001 2.4 1.7\n", "2002 2.9 3.6" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame3.index.name = 'year'; frame3.columns.name = 'state'\n", "frame3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "像 Series 一样, values 属性返回一个包含在 DataFrame 中的数据的二维 ndarray:" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ nan, 1.5],\n", " [ 2.4, 1.7],\n", " [ 2.9, 3.6]])" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame3.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果 DataFrame 的列有不同的 dtypes,返回值数组将会给所有的列选择一个合适的 dtyps:" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearstatepopdebteastern
one2000Ohio1.5NaNTrue
two2001Ohio1.7-1.2True
three2002Ohio3.6NaNTrue
four2001Nevada2.4-1.5False
five2002Nevada2.9-1.7False
\n", "
" ], "text/plain": [ " year state pop debt eastern\n", "one 2000 Ohio 1.5 NaN True\n", "two 2001 Ohio 1.7 -1.2 True\n", "three 2002 Ohio 3.6 NaN True\n", "four 2001 Nevada 2.4 -1.5 False\n", "five 2002 Nevada 2.9 -1.7 False" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[2000, 'Ohio', 1.5, nan, True],\n", " [2001, 'Ohio', 1.7, -1.2, True],\n", " [2002, 'Ohio', 3.6, nan, True],\n", " [2001, 'Nevada', 2.4, -1.5, False],\n", " [2002, 'Nevada', 2.9, -1.7, False]], dtype=object)" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**可能的传递到DataFrame的构造器**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "|二维ndarray|\t一个数据矩阵,有可选的行标和列标|\n", "|----------|----------------------------|\n", "|数组,列表或元组的字典|每一个序列成为DataFrame中的一列。所有的序列必须有相同的长度。|\n", "|NumPy的结构/记录数组|\t和“数组字典”一样处理|\n", "|Series的字典|\t每一个值成为一列。如果没有明显的传递索引,将结合每一个Series的索引来形成结果的行索引。|\n", "|字典的字典|\t每一个内部的字典成为一列。和“Series的字典”一样,结合键值来形成行索引。|\n", "|字典或Series的列表|\t每一项成为DataFrame中的一列。结合字典键或Series索引形成DataFrame的列标。|\n", "|列表或元组的列表|\t和“二维ndarray”一样处理|\n", "|另一个DataFrame|\tDataFrame的索引将被使用,除非传递另外一个|\n", "|NumPy伪装数组(MaskedArray)|\t除了蒙蔽值在DataFrame中成为NA/丢失数据之外,其它的和“二维ndarray”一样|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 索引对象" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pandas 的索引对象用来保存坐标轴标签和其它元数据(如坐标轴名或名称)。构建一个 Series 或 DataFrame 时任何数组或其它序列标签在内部转化为索引:" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index([u'a', u'b', u'c'], dtype='object')" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = Series(range(3), index=['a', 'b', 'c'])\n", "index = obj.index\n", "index" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index([u'b', u'c'], dtype='object')" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index[1:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "索引对象是不可变的,因此不能由用户改变:" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "Index does not support mutable operations", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mindex\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'd'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;32m/home/evil_rabbit/anaconda2/lib/python2.7/site-packages/pandas/indexes/base.pyc\u001b[0m in \u001b[0;36m__setitem__\u001b[1;34m(self, key, value)\u001b[0m\n\u001b[0;32m 1235\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1236\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m__setitem__\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mvalue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1237\u001b[1;33m \u001b[1;32mraise\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"Index does not support mutable operations\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1238\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1239\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mTypeError\u001b[0m: Index does not support mutable operations" ] } ], "source": [ "index[1] = 'd'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**索引对象的不可变性非常重要,这样它可以在数据结构中结构中安全的共享:**" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index = pd.Index(np.arange(3))\n", "obj2 = Series([1.5, -2.5, 0], index=index)\n", "obj2.index is index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下表是库中内建的索引类清单。通过一些开发努力,索引可以被子类化,来实现特定坐标轴索引功能。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "|索引类型|说明|\n", "|-------|---|\n", "|Index|最通用的索引对象,使用Python对象的NumPy数组来表示坐标轴标签。|\n", "|Int64Index|对整形值的特化索引。|\n", "|MultiIndex|“分层”索引对象,表示单个轴的多层次的索引。可以被认为是类似的元组的数组。|\n", "|DatetimeIndex|存储纳秒时间戳(使用NumPy的datetime64 dtyppe来表示)。|\n", "|PeriodIndex|对周期数据(时间间隔的)的特化索引。|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "除了类似于阵列,索引也有类似固定大小集合一样的功能:" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stateNevadaOhio
year
2000NaN1.5
20012.41.7
20022.93.6
\n", "
" ], "text/plain": [ "state Nevada Ohio\n", "year \n", "2000 NaN 1.5\n", "2001 2.4 1.7\n", "2002 2.9 3.6" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame3" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'Ohio' in frame3.columns" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "2003 in frame3.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "每个索引都有许多关于集合逻辑的方法和属性,且能够解决它所包含的数据的常见问题。这些都总结在下表中。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "|方法 | 属性|\n", "|----|-----|\n", "|append\t|链接额外的索引对象,产生一个新的索引|\n", "|diff\t|计算索引的差集|\n", "|intersection\t|计算交集|\n", "|union\t|计算并集|\n", "|isin\t|计算出一个布尔数组表示每一个值是否包含在所传递的集合里|\n", "|delete\t|计算删除位置i的元素的索引|\n", "|drop\t|计算删除所传递的值后的索引|\n", "|insert\t|计算在位置i插入元素后的索引|\n", "|is_monotonic\t|返回True,如果每一个元素都比它前面的元素大或相等|\n", "|is_unique\t|返回True,如果索引没有重复的值|\n", "|unique\t|计算索引的唯一值数组|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 重新索引:reindex" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pandas 对象的一个关键的方法是 reindex ,意味着使数据符合一个新的索引来构造一个新的对象。来看一下下面一个简单的例子:" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "d 4.5\n", "b 7.2\n", "a -5.3\n", "c 3.6\n", "dtype: float64" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])\n", "obj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在 Series 上调用 reindex 重排数据,使得它符合新的索引,如果那个索引的值不存在就引入缺失数据值: \n", "**注意:**reindex 并不改变 index 的数值,只是重排数据。(索引对象的不可变性)" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a -5.3\n", "b 7.2\n", "c 3.6\n", "d 4.5\n", "e NaN\n", "dtype: float64" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])\n", "obj2" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a -5.3\n", "b 7.2\n", "c 3.6\n", "d 4.5\n", "e 0.0\n", "dtype: float64" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "为了对**时间序列**这样的数据排序,**当重建索引的时候可能想要对值进行内插或填充**。 method 选项可以是你做到这一点,使用一个如 ffill 的方法来向前填充值:" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "collapsed": true }, "outputs": [], "source": [ "obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 blue\n", "2 purple\n", "4 yellow\n", "dtype: object" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj3" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 blue\n", "1 blue\n", "2 purple\n", "3 purple\n", "4 yellow\n", "5 yellow\n", "dtype: object" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj3.reindex(range(6), method='ffill')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下是可用的 method 选项的清单。在此,内差比正向和反向填充更复杂。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- reindex 的 method(内插)选项\n", "\n", "|参数|描述|\n", "|---|---|\n", "|ffill或pad\t|前向(或进位)填充|\n", "|bfill或backfill\t|后向(或进位)填充|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "对于 DataFrame, reindex 可以改变(行)索引,列或两者。当只传入一个序列时,结果中的行被重新索引了:" ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "collapsed": true }, "outputs": [], "source": [ "frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],\n", "....: columns=['Ohio', 'Texas', 'California'])" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OhioTexasCalifornia
a012
c345
d678
\n", "
" ], "text/plain": [ " Ohio Texas California\n", "a 0 1 2\n", "c 3 4 5\n", "d 6 7 8" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "collapsed": true }, "outputs": [], "source": [ "frame2 = frame.reindex(['a', 'b', 'c', 'd'])" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OhioTexasCalifornia
a0.01.02.0
bNaNNaNNaN
c3.04.05.0
d6.07.08.0
\n", "
" ], "text/plain": [ " Ohio Texas California\n", "a 0.0 1.0 2.0\n", "b NaN NaN NaN\n", "c 3.0 4.0 5.0\n", "d 6.0 7.0 8.0" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用 **columns** 关键字可以使列重新索引:" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "collapsed": true }, "outputs": [], "source": [ "states = ['Texas', 'Utah', 'California']" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TexasUtahCalifornia
a1NaN2
c4NaN5
d7NaN8
\n", "
" ], "text/plain": [ " Texas Utah California\n", "a 1 NaN 2\n", "c 4 NaN 5\n", "d 7 NaN 8" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.reindex(columns=states)\n", "# 再次注意索引对象的不可变性" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "一次可以对两个重新索引,可是插值只在**行侧(0坐标轴)**进行:" ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OhioTexasCalifornia
a012
c345
d678
\n", "
" ], "text/plain": [ " Ohio Texas California\n", "a 0 1 2\n", "c 3 4 5\n", "d 6 7 8" ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TexasUtahCalifornia
a1NaN2
b1NaN2
c4NaN5
d7NaN8
\n", "
" ], "text/plain": [ " Texas Utah California\n", "a 1 NaN 2\n", "b 1 NaN 2\n", "c 4 NaN 5\n", "d 7 NaN 8" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill',\n", " ....: columns=states)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "正如你将看到的,使用带标签索引的 **ix** 可以把重新索引做的更简单:" ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TexasUtahCalifornia
a1.0NaN2.0
bNaNNaNNaN
c4.0NaN5.0
d7.0NaN8.0
\n", "
" ], "text/plain": [ " Texas Utah California\n", "a 1.0 NaN 2.0\n", "b NaN NaN NaN\n", "c 4.0 NaN 5.0\n", "d 7.0 NaN 8.0" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.ix[['a', 'b', 'c', 'd'], states]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **reindex** 函数的参数\n", "\n", "|参数|说明|\n", "|---|---|\n", "|index\t|作为索引的新序列。可以是索引实例或任何类似序列的Python数据结构。一个索引被完全使用,没有任何拷贝。|\n", "|method\t|插值(填充)方法,见表格5-4的选项|\n", "|fill_value\t|代替重新索引时引入的缺失数据值|\n", "|limit\t|当前向或后向填充时,最大的填充间隙|\n", "|level\t|在多层索引上匹配简单索引,否则选择一个子集|\n", "|copy\t|如果新索引与就的相等则底层数据不会拷贝。默认为True(即始终拷贝)|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 从一个坐标轴删除条目" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "从坐标轴删除一个多或多个条目是很容易的,如果你有一个索引数组或列表且没有这些条目,但是这可能需要一点修改和集合逻辑。 drop 方法将会返回一个新的对象并从坐标轴中删除指定的一个或多个值:" ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "collapsed": true }, "outputs": [], "source": [ "obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])" ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 0.0\n", "b 1.0\n", "d 3.0\n", "e 4.0\n", "dtype: float64" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_obj = obj.drop('c')\n", "new_obj" ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 0.0\n", "b 1.0\n", "e 4.0\n", "dtype: float64" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.drop(['d', 'c'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "对于 DataFrame,可以从任何坐标轴删除索引值:" ] }, { "cell_type": "code", "execution_count": 126, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = DataFrame(np.arange(16).reshape((4, 4)),\n", " ....: index=['Ohio', 'Colorado', 'Utah', 'New York'],\n", " ....: columns=['one', 'two', 'three', 'four'])" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Utah891011
New York12131415
\n", "
" ], "text/plain": [ " one two three four\n", "Utah 8 9 10 11\n", "New York 12 13 14 15" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop(['Colorado', 'Ohio'])" ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onethreefour
Ohio023
Colorado467
Utah81011
New York121415
\n", "
" ], "text/plain": [ " one three four\n", "Ohio 0 2 3\n", "Colorado 4 6 7\n", "Utah 8 10 11\n", "New York 12 14 15" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop('two', axis=1)" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onethree
Ohio02
Colorado46
Utah810
New York1214
\n", "
" ], "text/plain": [ " one three\n", "Ohio 0 2\n", "Colorado 4 6\n", "Utah 8 10\n", "New York 12 14" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop(['two', 'four'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 索引,挑选和过滤" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Series 索引 ( obj[...] )的工作原理类似与 NumPy 索引,除了可以使用 Series 的索引值,也可以仅使用整数来索引。下面是关于这一点的一些例子:" ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 2\n", "b 3\n", "c 1\n", "d 5\n", "dtype: int64" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = Series([2,3,1,5], index=['a', 'b', 'c', 'd'])\n", "obj" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3, 3)" ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj['b'], obj[1]" ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b 3\n", "a 2\n", "d 5\n", "dtype: int64" ] }, "execution_count": 140, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj[['b', 'a', 'd']]" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b 3\n", "d 5\n", "dtype: int64" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj[[1, 3]] # 索引位置" ] }, { "cell_type": "code", "execution_count": 143, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "c 1\n", "dtype: int64" ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj[obj < 2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用标签来切片和正常的 Python 切片并不一样,**它会把结束点也包括在内**:" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b 3\n", "c 1\n", "dtype: int64" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj['b':'c']" ] }, { "cell_type": "code", "execution_count": 146, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 2\n", "b 5\n", "c 5\n", "d 5\n", "dtype: int64" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj['b':'c'] = 5\n", "obj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "正如上面你所见到的,索引 DataFrame 来检索一个或多个列,可以使用一个单一值或一个序列:" ] }, { "cell_type": "code", "execution_count": 256, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = DataFrame(np.arange(16).reshape((4, 4)),\n", " .....: index=['Ohio', 'Colorado', 'Utah', 'New York'],\n", " .....: columns=['one', 'two', 'three', 'four'])" ] }, { "cell_type": "code", "execution_count": 257, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio0123
Colorado4567
Utah891011
New York12131415
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 0 1 2 3\n", "Colorado 4 5 6 7\n", "Utah 8 9 10 11\n", "New York 12 13 14 15" ] }, "execution_count": 257, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "code", "execution_count": 258, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ohio 1\n", "Colorado 5\n", "Utah 9\n", "New York 13\n", "Name: two, dtype: int64" ] }, "execution_count": 258, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['two']" ] }, { "cell_type": "code", "execution_count": 259, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
threeone
Ohio20
Colorado64
Utah108
New York1412
\n", "
" ], "text/plain": [ " three one\n", "Ohio 2 0\n", "Colorado 6 4\n", "Utah 10 8\n", "New York 14 12" ] }, "execution_count": 259, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[['three', 'one']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "像这样的索引有一些特殊的情况。首先,可以通过切片或一个布尔数组来选择行:" ] }, { "cell_type": "code", "execution_count": 151, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio0123
Colorado4567
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 0 1 2 3\n", "Colorado 4 5 6 7" ] }, "execution_count": 151, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[:2] " ] }, { "cell_type": "code", "execution_count": 260, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Colorado4567
\n", "
" ], "text/plain": [ " one two three four\n", "Colorado 4 5 6 7" ] }, "execution_count": 260, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[(data.one >=1 ) & (data.one < 7) ] # 多个逻辑条件组合" ] }, { "cell_type": "code", "execution_count": 152, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Colorado4567
Utah891011
New York12131415
\n", "
" ], "text/plain": [ " one two three four\n", "Colorado 4 5 6 7\n", "Utah 8 9 10 11\n", "New York 12 13 14 15" ] }, "execution_count": 152, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[data['three'] > 5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "另一种用法是在索引中使用一个布尔 DataFrame,例如通过纯量比较产生的:" ] }, { "cell_type": "code", "execution_count": 153, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
OhioTrueTrueTrueTrue
ColoradoTrueFalseFalseFalse
UtahFalseFalseFalseFalse
New YorkFalseFalseFalseFalse
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio True True True True\n", "Colorado True False False False\n", "Utah False False False False\n", "New York False False False False" ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data < 5" ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [], "source": [ "data[data <5] = 0" ] }, { "cell_type": "code", "execution_count": 157, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio0000
Colorado0567
Utah891011
New York12131415
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 0 0 0 0\n", "Colorado 0 5 6 7\n", "Utah 8 9 10 11\n", "New York 12 13 14 15" ] }, "execution_count": 157, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在这种情况下使得 DataFrame 的语法更像一个 ndarry。为了使 DataFrame 可以在行上进行标签索引,我将介绍特殊的索引字段 ix 。这使你可以从 DataFrame 选择一个行和列的子集,使用像 NumPy 的记法再加上轴标签。正如我早先提到的,这也是一种不是很冗长的重新索引的方法:" ] }, { "cell_type": "code", "execution_count": 158, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "two 5\n", "three 6\n", "Name: Colorado, dtype: int64" ] }, "execution_count": 158, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.ix['Colorado', ['two', 'three']]" ] }, { "cell_type": "code", "execution_count": 161, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fouronetwo
Colorado705
Utah1189
\n", "
" ], "text/plain": [ " four one two\n", "Colorado 7 0 5\n", "Utah 11 8 9" ] }, "execution_count": 161, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.ix[['Colorado', 'Utah'], [3, 0, 1]] # 3,0,1 表示位置顺序" ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "one 8\n", "two 9\n", "three 10\n", "four 11\n", "Name: Utah, dtype: int64" ] }, "execution_count": 163, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.ix[2] # 第 2 行" ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ohio 0\n", "Colorado 6\n", "Utah 10\n", "New York 14\n", "Name: three, dtype: int64" ] }, "execution_count": 166, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.ix[:,2] # 第 2 列" ] }, { "cell_type": "code", "execution_count": 167, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ohio 0\n", "Colorado 5\n", "Utah 9\n", "Name: two, dtype: int64" ] }, "execution_count": 167, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.ix[:'Utah', 'two']" ] }, { "cell_type": "code", "execution_count": 173, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Colorado0567
Utah891011
New York12131415
\n", "
" ], "text/plain": [ " one two three four\n", "Colorado 0 5 6 7\n", "Utah 8 9 10 11\n", "New York 12 13 14 15" ] }, "execution_count": 173, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.ix[data.three > 5, :]" ] }, { "cell_type": "code", "execution_count": 174, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothree
Colorado056
Utah8910
New York121314
\n", "
" ], "text/plain": [ " one two three\n", "Colorado 0 5 6\n", "Utah 8 9 10\n", "New York 12 13 14" ] }, "execution_count": 174, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.ix[data.three > 5, :3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "因此,有很多方法来选择和重排包含在 pandas 对象中的数据。对于 DataFrame , 下表是这些方法的简短概要。稍后你将接触到分层索引,那时你会有一些额外的选项。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**注意**:在设计 pandas 时,我觉得不得不敲下 frame[:, col] 来选择一列,是非常冗余的(且易出错的),因此列选择是最常见的操作之一。 \n", "因此,我做了这个设计权衡,把所有的富标签索引引入到 ix 。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "|切片命令|说明|\n", "|------|-----|\n", "|obj[val]\t|从DataFrame选择单一列或连续列。特殊情况下的便利:布尔数组(过滤行),切片(行切片),或布尔DataFrame(根据一些标准来设置值)。|\n", "|obj.ix[val]\t|从DataFrame的行集选择单行|\n", "|obj.ix[:, val]\t|从列集选择单列|\n", "|obj.ix[val1, val2]\t|选择行和列|\n", "|reindex 方法\t|转换一个或多个轴到新的索引|\n", "|xs 方法\t|通过标签选择单行或单列到一个Series|\n", "|icol, irow 方法\t|通过整数位置,分别的选择单行或单列到一个Series|\n", "|get_value, set_value 方法\t|通过行和列标选择一个单值|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**其他切片方式:**" ] }, { "cell_type": "code", "execution_count": 185, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio0000
Colorado0567
Utah891011
New York12131415
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 0 0 0 0\n", "Colorado 0 5 6 7\n", "Utah 8 9 10 11\n", "New York 12 13 14 15" ] }, "execution_count": 185, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 使用标签选取数据" ] }, { "cell_type": "code", "execution_count": 183, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Colorado0567
Utah891011
New York12131415
\n", "
" ], "text/plain": [ " one two three four\n", "Colorado 0 5 6 7\n", "Utah 8 9 10 11\n", "New York 12 13 14 15" ] }, "execution_count": 183, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# df.loc[行标签,列标签]\n", "data.loc['Colorado':'New York'] # 选取 a:b 行数据" ] }, { "cell_type": "code", "execution_count": 182, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ohio 0\n", "Colorado 0\n", "Utah 8\n", "New York 12\n", "Name: one, dtype: int64" ] }, "execution_count": 182, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.loc[:,'one'] # 选取 one 列的数据" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 使用位置选取数据" ] }, { "cell_type": "code", "execution_count": 184, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 184, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# df.iloc[行位置,列位置]\n", "data.iloc[1,1] # 选取第二行,第二列的值,返回的为单个值" ] }, { "cell_type": "code", "execution_count": 189, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio0000
Utah891011
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 0 0 0 0\n", "Utah 8 9 10 11" ] }, "execution_count": 189, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[[0,2], :] #选取第一行及第三行的数据" ] }, { "cell_type": "code", "execution_count": 190, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio0000
Colorado0567
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 0 0 0 0\n", "Colorado 0 5 6 7" ] }, "execution_count": 190, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[0:2,:] # 选取第一行到第三行(不包含)的数据,就是前两行的数据" ] }, { "cell_type": "code", "execution_count": 191, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ohio 0\n", "Colorado 5\n", "Utah 9\n", "New York 13\n", "Name: two, dtype: int64" ] }, "execution_count": 191, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[:,1] # 选取所有记录的第一列的值,返回的为一个 Series" ] }, { "cell_type": "code", "execution_count": 192, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "one 0\n", "two 5\n", "three 6\n", "four 7\n", "Name: Colorado, dtype: int64" ] }, "execution_count": 192, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[1,:] # 选取第二行数据,返回的为一个 Series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 算术运算" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "当把对象加起来时,如果有任何的索引对不相同的话,在结果中将会把各自的索引联合起来。 \n", "在索引不重合的地方引入了 NA 值。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 广播" ] }, { "cell_type": "code", "execution_count": 197, "metadata": { "collapsed": true }, "outputs": [], "source": [ "arr = np.arange(12.).reshape(3,4)" ] }, { "cell_type": "code", "execution_count": 198, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0., 1., 2., 3.])" ] }, "execution_count": 198, "metadata": {}, "output_type": "execute_result" } ], "source": [ "arr[0]" ] }, { "cell_type": "code", "execution_count": 199, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0., 0., 0., 0.],\n", " [ 4., 4., 4., 4.],\n", " [ 8., 8., 8., 8.]])" ] }, "execution_count": 199, "metadata": {}, "output_type": "execute_result" } ], "source": [ "arr-arr[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**以下内容参考至**:[Python中的结构化数据分析利器-Pandas简介](http://cloga.info/python/%E6%95%B0%E6%8D%AE%E7%A7%91%E5%AD%A6/2013/09/17/pandasintro)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 文件操作" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 从 CSV 中读取数据\n", "df = pd.read_csv('foo.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#将 DataFrame 写入 CSV\n", "df.to_csv('foo.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 从 Excel 中读取数据:先定义一个 Excel 文件,\n", "# 用 xls.parse 解析 sheet1 的内容,index_col 用于指定 index 列,na_values 定义缺失值的标识。\n", "xls = ExcelFile('foo.xlsx')\n", "xls.parse('sheet1', index_col=None, na_values=['NA'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ " # 将DataFrame 写入 Excel 文件 \n", "df.to_excel('foo.xlsx', sheet_name='sheet1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 基本运算" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [], "source": [ "data = DataFrame(np.arange(16).reshape((4, 4)),\n", " ....: index=['Ohio', 'Colorado', 'Utah', 'New York'],\n", " ....: columns=['one', 'two', 'three', 'four'])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio0123
Colorado4567
Utah891011
New York12131415
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 0 1 2 3\n", "Colorado 4 5 6 7\n", "Utah 8 9 10 11\n", "New York 12 13 14 15" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "one 6.0\n", "two 7.0\n", "three 8.0\n", "four 9.0\n", "dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.mean() # 计算列的平均值,参数为轴,可选值为 0 或 1.默认为 0,即按照列运算" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ohio 6\n", "Colorado 22\n", "Utah 38\n", "New York 54\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.sum(1) # 计算行的和" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "one 12\n", "two 12\n", "three 12\n", "four 12\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 将一个函数应用到 DataFrame 的每一列,这里使用的是匿名 lambda 函数,与 R 中 apply 函数类似\n", "data.apply(lambda x: x.max() - x.min())" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio1123
Colorado5567
Utah991011
New York13131415
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 1 1 2 3\n", "Colorado 5 5 6 7\n", "Utah 9 9 10 11\n", "New York 13 13 14 15" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 应用到某一列\n", "# 推荐\n", "data['one'] = data['one'].apply(lambda x: x+1)\n", "data" ] }, { "cell_type": "code", "execution_count": 143, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio2123
Colorado5567
Utah991011
New York13131415
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 2 1 2 3\n", "Colorado 5 5 6 7\n", "Utah 9 9 10 11\n", "New York 13 13 14 15" ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 应用到某一列\n", "# 首元素会多执行一遍\n", "def addone(v):\n", " v[0] += 1\n", " return v\n", "data.apply(addone, axis=1)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio1123
Colorado5567
Utah991011
New York13131415
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 1 1 2 3\n", "Colorado 5 5 6 7\n", "Utah 9 9 10 11\n", "New York 13 13 14 15" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def addone(v):\n", " v += 1\n", " return v\n", "data['one'] = data['one'].map(addone)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 常用操作" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 排序" ] }, { "cell_type": "code", "execution_count": 234, "metadata": {}, "outputs": [], "source": [ "df = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],\n", "....: columns=['Ohio', 'Texas', 'California'])" ] }, { "cell_type": "code", "execution_count": 235, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OhioTexasCalifornia
a012
c345
d678
\n", "
" ], "text/plain": [ " Ohio Texas California\n", "a 0 1 2\n", "c 3 4 5\n", "d 6 7 8" ] }, "execution_count": 235, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 238, "metadata": {}, "outputs": [], "source": [ "df.ix[\"a\"] = [10,11,12]" ] }, { "cell_type": "code", "execution_count": 251, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OhioTexasCalifornia
a101112
c395
d678
\n", "
" ], "text/plain": [ " Ohio Texas California\n", "a 10 11 12\n", "c 3 9 5\n", "d 6 7 8" ] }, "execution_count": 251, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "sort_index 可以以轴的标签进行排序。axis 是指用于排序的轴,可选的值有 0 和 1,默认为 0 即行标签(Y 轴),1 为按照列标签排序。 ascending 是排序方式,默认为 True 即降序排列。" ] }, { "cell_type": "code", "execution_count": 252, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OhioTexasCalifornia
d678
c395
a101112
\n", "
" ], "text/plain": [ " Ohio Texas California\n", "d 6 7 8\n", "c 3 9 5\n", "a 10 11 12" ] }, "execution_count": 252, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_index(axis=0, ascending=False) # 对行进行排序 d,c,a" ] }, { "cell_type": "code", "execution_count": 212, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ohio 10\n", "Texas 11\n", "California 12\n", "Name: a, dtype: int64" ] }, "execution_count": 212, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 254, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TexasOhioCalifornia
a111012
c935
d768
\n", "
" ], "text/plain": [ " Texas Ohio California\n", "a 11 10 12\n", "c 9 3 5\n", "d 7 6 8" ] }, "execution_count": 254, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_index(axis=1, ascending=False) # 对列进行排序 Texas, Ohio, California" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DataFrame 也提供按照指定列进行排序,可以仅指定一个列作为排序标准(以单独列名作为 columns 的参数),也可以进行多重排序(columns 的参数为一个列名的 List,列名的出现顺序决定排序中的优先级),在多重排序中 ascending 参数也为一个 List,分别与 columns 中的 List 元素对应。" ] }, { "cell_type": "code", "execution_count": 243, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OhioTexasCalifornia
a101112
d678
c345
\n", "
" ], "text/plain": [ " Ohio Texas California\n", "a 10 11 12\n", "d 6 7 8\n", "c 3 4 5" ] }, "execution_count": 243, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(by='Ohio', ascending=False)" ] }, { "cell_type": "code", "execution_count": 246, "metadata": {}, "outputs": [], "source": [ "df.ix[1,1]=9" ] }, { "cell_type": "code", "execution_count": 247, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OhioTexasCalifornia
a101112
c395
d678
\n", "
" ], "text/plain": [ " Ohio Texas California\n", "a 10 11 12\n", "c 3 9 5\n", "d 6 7 8" ] }, "execution_count": 247, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 250, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OhioTexasCalifornia
a101112
d678
c395
\n", "
" ], "text/plain": [ " Ohio Texas California\n", "a 10 11 12\n", "d 6 7 8\n", "c 3 9 5" ] }, "execution_count": 250, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(by=['Ohio','Texas'],ascending=[0,1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 重命名列" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df.rename(columns={u'one':'1'}, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 设置索引" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df.set_index('one')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 重设索引" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df.reset_index(inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 查看最大最小值" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pd.Series.max()\n", "pd.Series.idxmax()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 改变数据类型" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df['A'].astype(float)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 计算 Series 每个值的频率" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df['A'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# DataFrame 的合并" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Concat" ] }, { "cell_type": "code", "execution_count": 267, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ds = [{'one' : 4,'two':2},{'one' : 5,'two' : 3},{'one' : 6,'two' : 4},{'two' : 7,'three':10}]\n", "dfs = pd.DataFrame(ds,index=['e','f','g','h'])" ] }, { "cell_type": "code", "execution_count": 269, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onethreetwo
e4.0NaN2
f5.0NaN3
g6.0NaN4
hNaN10.07
\n", "
" ], "text/plain": [ " one three two\n", "e 4.0 NaN 2\n", "f 5.0 NaN 3\n", "g 6.0 NaN 4\n", "h NaN 10.0 7" ] }, "execution_count": 269, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfs" ] }, { "cell_type": "code", "execution_count": 292, "metadata": {}, "outputs": [], "source": [ "##构建一个新的 DataFrame,dfs\n", "df_t=pd.concat([dfs,dfs],axis=1) # 合并两个 DataFrame" ] }, { "cell_type": "code", "execution_count": 293, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onethreetwoonethreetwo
e4.0NaN24.0NaN2
f5.0NaN35.0NaN3
g6.0NaN46.0NaN4
hNaN10.07NaN10.07
\n", "
" ], "text/plain": [ " one three two one three two\n", "e 4.0 NaN 2 4.0 NaN 2\n", "f 5.0 NaN 3 5.0 NaN 3\n", "g 6.0 NaN 4 6.0 NaN 4\n", "h NaN 10.0 7 NaN 10.0 7" ] }, "execution_count": 293, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_t" ] }, { "cell_type": "code", "execution_count": 294, "metadata": { "collapsed": true }, "outputs": [], "source": [ "##构建一个新的 DataFrame,dfs\n", "df_t=pd.concat([dfs,dfs],axis=0) # 合并两个 DataFrame" ] }, { "cell_type": "code", "execution_count": 295, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onethreetwo
e4.0NaN2
f5.0NaN3
g6.0NaN4
hNaN10.07
e4.0NaN2
f5.0NaN3
g6.0NaN4
hNaN10.07
\n", "
" ], "text/plain": [ " one three two\n", "e 4.0 NaN 2\n", "f 5.0 NaN 3\n", "g 6.0 NaN 4\n", "h NaN 10.0 7\n", "e 4.0 NaN 2\n", "f 5.0 NaN 3\n", "g 6.0 NaN 4\n", "h NaN 10.0 7" ] }, "execution_count": 295, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Merge" ] }, { "cell_type": "code", "execution_count": 296, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keylvalrval
0foo114
1foo225
\n", "
" ], "text/plain": [ " key lval rval\n", "0 foo1 1 4\n", "1 foo2 2 5" ] }, "execution_count": 296, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left = pd.DataFrame({'key': ['foo1', 'foo2'], 'lval': [1, 2]})\n", "right = pd.DataFrame({'key': ['foo1', 'foo2'], 'rval': [4, 5]})\n", "#构建了两个DataFrame\n", "pd.merge(left, right, on='key')#按照key列将两个DataFrame join在一起" ] }, { "cell_type": "code", "execution_count": 297, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keylval
0foo11
1foo22
\n", "
" ], "text/plain": [ " key lval\n", "0 foo1 1\n", "1 foo2 2" ] }, "execution_count": 297, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left" ] }, { "cell_type": "code", "execution_count": 298, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyrval
0foo14
1foo25
\n", "
" ], "text/plain": [ " key rval\n", "0 foo1 4\n", "1 foo2 5" ] }, "execution_count": 298, "metadata": {}, "output_type": "execute_result" } ], "source": [ "right" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Groupby" ] }, { "cell_type": "code", "execution_count": 300, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import random" ] }, { "cell_type": "code", "execution_count": 305, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],\n", " 'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],\n", " 'C' :np.random.randn(8), 'D' : np.random.randn(8)});" ] }, { "cell_type": "code", "execution_count": 306, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
0fooone0.584419-0.594567
1barone-1.331418-0.167330
2footwo0.1255580.744053
3barthree0.850646-0.741091
4footwo-0.033590-0.239762
5bartwo0.3326810.567786
6fooone1.0511951.509953
7foothree0.374087-0.318311
\n", "
" ], "text/plain": [ " A B C D\n", "0 foo one 0.584419 -0.594567\n", "1 bar one -1.331418 -0.167330\n", "2 foo two 0.125558 0.744053\n", "3 bar three 0.850646 -0.741091\n", "4 foo two -0.033590 -0.239762\n", "5 bar two 0.332681 0.567786\n", "6 foo one 1.051195 1.509953\n", "7 foo three 0.374087 -0.318311" ] }, "execution_count": 306, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 307, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CD
A
bar-0.148091-0.340636
foo2.1016691.101366
\n", "
" ], "text/plain": [ " C D\n", "A \n", "bar -0.148091 -0.340636\n", "foo 2.101669 1.101366" ] }, "execution_count": 307, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('A').sum() # 按照 A 列的值分组求和" ] }, { "cell_type": "code", "execution_count": 308, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CD
AB
barone-1.331418-0.167330
three0.850646-0.741091
two0.3326810.567786
fooone1.6356140.915386
three0.374087-0.318311
two0.0919680.504291
\n", "
" ], "text/plain": [ " C D\n", "A B \n", "bar one -1.331418 -0.167330\n", " three 0.850646 -0.741091\n", " two 0.332681 0.567786\n", "foo one 1.635614 0.915386\n", " three 0.374087 -0.318311\n", " two 0.091968 0.504291" ] }, "execution_count": 308, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['A','B']).sum() # 按照 A、B 两列的值分组求和" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**在实际应用中,先定义 groups,然后再对不同的指标指定不同计算方式。**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "groups = df.groupby('A') # 按照 A 列的值分组求和\n", "groups['B'].sum() # 按照 A 列的值分组求 B 组和\n", "groups['B'].count() # 按照 A 列的值分组 B 组计数" ] }, { "cell_type": "code", "execution_count": 309, "metadata": { "collapsed": true }, "outputs": [], "source": [ "groups = df.groupby('A') # 按照 A 列的值分组求和" ] }, { "cell_type": "code", "execution_count": 312, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A\n", "bar -0.148091\n", "foo 2.101669\n", "Name: C, dtype: float64" ] }, "execution_count": 312, "metadata": {}, "output_type": "execute_result" } ], "source": [ "groups['C'].sum() # 按照 A 列的值分组求 B 组和" ] }, { "cell_type": "code", "execution_count": 314, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A\n", "bar 3\n", "foo 5\n", "Name: B, dtype: int64" ] }, "execution_count": 314, "metadata": {}, "output_type": "execute_result" } ], "source": [ "groups['B'].count() # 按照 A 列的值分组 B 组计数" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**默认会以 groupby 的值作为索引,如果不将这些值作为索引,则需要使用 as_index=False**" ] }, { "cell_type": "code", "execution_count": 316, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
0fooone0.584419-0.594567
1barone-1.331418-0.167330
2footwo0.1255580.744053
3barthree0.850646-0.741091
4footwo-0.033590-0.239762
5bartwo0.3326810.567786
6fooone1.0511951.509953
7foothree0.374087-0.318311
\n", "
" ], "text/plain": [ " A B C D\n", "0 foo one 0.584419 -0.594567\n", "1 bar one -1.331418 -0.167330\n", "2 foo two 0.125558 0.744053\n", "3 bar three 0.850646 -0.741091\n", "4 foo two -0.033590 -0.239762\n", "5 bar two 0.332681 0.567786\n", "6 foo one 1.051195 1.509953\n", "7 foo three 0.374087 -0.318311" ] }, "execution_count": 316, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 317, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
0barone-1.331418-0.167330
1barthree0.850646-0.741091
2bartwo0.3326810.567786
3fooone1.6356140.915386
4foothree0.374087-0.318311
5footwo0.0919680.504291
\n", "
" ], "text/plain": [ " A B C D\n", "0 bar one -1.331418 -0.167330\n", "1 bar three 0.850646 -0.741091\n", "2 bar two 0.332681 0.567786\n", "3 foo one 1.635614 0.915386\n", "4 foo three 0.374087 -0.318311\n", "5 foo two 0.091968 0.504291" ] }, "execution_count": 317, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['A','B'], as_index=False).sum()\n", "# 注意与 In [308] 对比" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 透视表" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用 pivot_table 和 crosstab 都可以创建数据透视表" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## pivot_table" ] }, { "cell_type": "code", "execution_count": 318, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,'B' : ['A', 'B', 'C'] * 4, \n", " 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2, \n", " 'D' : np.random.randn(12), 'E' : np.random.randn(12)})" ] }, { "cell_type": "code", "execution_count": 319, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDE
0oneAfoo0.0810740.358511
1oneBfoo-1.741607-1.105890
2twoCfoo-1.220509-0.241541
3threeAbar-0.6630410.087185
4oneBbar-0.259216-1.891769
5oneCbar-0.8105471.456797
6twoAfoo-0.5077841.801202
7threeBfoo0.6020521.740630
8oneCfoo0.671181-0.210367
9oneAbar0.615852-0.947002
10twoBbar-0.209238-1.500853
11threeCbar1.0503430.837500
\n", "
" ], "text/plain": [ " A B C D E\n", "0 one A foo 0.081074 0.358511\n", "1 one B foo -1.741607 -1.105890\n", "2 two C foo -1.220509 -0.241541\n", "3 three A bar -0.663041 0.087185\n", "4 one B bar -0.259216 -1.891769\n", "5 one C bar -0.810547 1.456797\n", "6 two A foo -0.507784 1.801202\n", "7 three B foo 0.602052 1.740630\n", "8 one C foo 0.671181 -0.210367\n", "9 one A bar 0.615852 -0.947002\n", "10 two B bar -0.209238 -1.500853\n", "11 three C bar 1.050343 0.837500" ] }, "execution_count": 319, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 324, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Cbarfoo
AB
oneA0.6158520.081074
B-0.259216-1.741607
C-0.8105470.671181
threeA-0.663041NaN
BNaN0.602052
C1.050343NaN
twoANaN-0.507784
B-0.209238NaN
CNaN-1.220509
\n", "
" ], "text/plain": [ "C bar foo\n", "A B \n", "one A 0.615852 0.081074\n", " B -0.259216 -1.741607\n", " C -0.810547 0.671181\n", "three A -0.663041 NaN\n", " B NaN 0.602052\n", " C 1.050343 NaN\n", "two A NaN -0.507784\n", " B -0.209238 NaN\n", " C NaN -1.220509" ] }, "execution_count": 324, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.pivot_table(df, values = 'D', index = ['A', 'B'], columns = ['C']) #以 A、B 为行标签,以 C 为列标签将 D 列的值汇总求和" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## crosstab" ] }, { "cell_type": "code", "execution_count": 336, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDE
0oneAfoo0.0810740.358511
1oneBfoo-1.741607-1.105890
2twoCfoo-1.220509-0.241541
3threeAbar-0.6630410.087185
4oneBbar-0.259216-1.891769
5oneCbar-0.8105471.456797
6twoAfoo-0.5077841.801202
7threeBfoo0.6020521.740630
8oneCfoo0.671181-0.210367
9oneAbar0.615852-0.947002
10twoBbar-0.209238-1.500853
11threeCbar1.0503430.837500
\n", "
" ], "text/plain": [ " A B C D E\n", "0 one A foo 0.081074 0.358511\n", "1 one B foo -1.741607 -1.105890\n", "2 two C foo -1.220509 -0.241541\n", "3 three A bar -0.663041 0.087185\n", "4 one B bar -0.259216 -1.891769\n", "5 one C bar -0.810547 1.456797\n", "6 two A foo -0.507784 1.801202\n", "7 three B foo 0.602052 1.740630\n", "8 one C foo 0.671181 -0.210367\n", "9 one A bar 0.615852 -0.947002\n", "10 two B bar -0.209238 -1.500853\n", "11 three C bar 1.050343 0.837500" ] }, "execution_count": 336, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 335, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CbarfooAll
AB
oneA112
B112
C112
threeA101
B011
C101
twoA011
B101
C011
All6612
\n", "
" ], "text/plain": [ "C bar foo All\n", "A B \n", "one A 1 1 2\n", " B 1 1 2\n", " C 1 1 2\n", "three A 1 0 1\n", " B 0 1 1\n", " C 1 0 1\n", "two A 0 1 1\n", " B 1 0 1\n", " C 0 1 1\n", "All 6 6 12" ] }, "execution_count": 335, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab([df.A, df.B], df.C, margins=True) # 以 A、B 为行标签,以 C 为列标签将 D 列的值汇总求个数" ] }, { "cell_type": "code", "execution_count": 338, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Cbarfoo
AB
oneA11
B11
C11
threeA10
B01
C10
twoA01
B10
C01
\n", "
" ], "text/plain": [ "C bar foo\n", "A B \n", "one A 1 1\n", " B 1 1\n", " C 1 1\n", "three A 1 0\n", " B 0 1\n", " C 1 0\n", "two A 0 1\n", " B 1 0\n", " C 0 1" ] }, "execution_count": 338, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab([df.A, df.B], df.C, margins=False) # 以 A、B 为行标签,以 C 为列标签将 D 列的值汇总求个数" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 时间序列分析" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pandas 提供 to_datetime 方法将代表时间的字符转化为 Timestamp 对象:" ] }, { "cell_type": "code", "execution_count": 339, "metadata": { "collapsed": true }, "outputs": [], "source": [ "s = '2013-09-16 21:00:00'\n", "ts = pd.to_datetime(s)" ] }, { "cell_type": "code", "execution_count": 340, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('2013-09-16 21:00:00')" ] }, "execution_count": 340, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "有时我们需要处理时区问题:" ] }, { "cell_type": "code", "execution_count": 342, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ts=pd.to_datetime(s,utc=True).tz_convert('Asia/Shanghai')" ] }, { "cell_type": "code", "execution_count": 343, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('2013-09-17 05:00:00+0800', tz='Asia/Shanghai')" ] }, "execution_count": 343, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "构建时间序列:" ] }, { "cell_type": "code", "execution_count": 345, "metadata": {}, "outputs": [], "source": [ "rng = pd.date_range('1/1/2012', periods=5, freq='M')\n", "ts = pd.Series(np.random.randn(len(rng)), index=rng)" ] }, { "cell_type": "code", "execution_count": 346, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30',\n", " '2012-05-31'],\n", " dtype='datetime64[ns]', freq='M')" ] }, "execution_count": 346, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rng" ] }, { "cell_type": "code", "execution_count": 347, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2012-01-31 0.839176\n", "2012-02-29 -0.631665\n", "2012-03-31 -0.941336\n", "2012-04-30 0.754626\n", "2012-05-31 -2.149063\n", "Freq: M, dtype: float64" ] }, "execution_count": 347, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas 提供 resample 方法对时间序列的时间粒度进行调整:" ] }, { "cell_type": "code", "execution_count": 349, "metadata": {}, "outputs": [], "source": [ "ts_h=ts.resample('H').count() # M,5Min,1s\n", "# 以上是将时间序列调整为小时,还可以支持月(M),分钟(Min)甚至秒(s)等。" ] }, { "cell_type": "code", "execution_count": 351, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2012-01-31 00:00:00 1\n", "2012-01-31 01:00:00 0\n", "2012-01-31 02:00:00 0\n", "2012-01-31 03:00:00 0\n", "2012-01-31 04:00:00 0\n", "Freq: H, dtype: int64" ] }, "execution_count": 351, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ts_h.head()" ] }, { "cell_type": "code", "execution_count": 353, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 353, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(ts_h)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 参考网站\n", "\n", "- [Cloga 的互联网笔记 ](http://cloga.info/python/%E6%95%B0%E6%8D%AE%E7%A7%91%E5%AD%A6/2013/09/17/pandasintro)大神~\n", "- [Python For Data Analysis 0.1](http://pda.readthedocs.io/en/latest/chp5.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": { "height": "512px", "width": "252px" }, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": "block", "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 1 }