{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"pandas含有使数据分析工作变得更快更简单的高级数据结构和操作工具。pandas基于NumPy构建,让以NumPy为中心的应用变得更加简单。"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from pandas import Series, DataFrame\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##1. Pandas的数据结构\n",
"pandas的两个主要数据结构是:Series和DataFrame。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###1.1 Series\n",
"Series是一种类似于一维数组的对象,它由一组**数据**(各种NumPy数据类型)以及一组与之相关的**数据索引**组成。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"####1. Series的构建"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 4\n",
"1 7\n",
"2 -5\n",
"3 3\n",
"dtype: int64"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj = Series([4, 7, -5, 3])\n",
"obj\n",
"# Series的字符串表现形式为:索引在左边,值在右边。\n",
"# 由于我们没有为数据指定索引,于是会自动创建一个0到N-1的整数索引"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 4, 7, -5, 3], dtype=int64)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 获取Series的values和index属性\n",
"obj.values"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Int64Index([0, 1, 2, 3], dtype='int64')"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj.index"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"d 4\n",
"b 7\n",
"a -5\n",
"c 3\n",
"dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 创建Series带有可以对各个数据点进行标记的索引\n",
"obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])\n",
"obj2"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Index([u'd', u'b', u'a', u'c'], dtype='object')"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj2.index"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"-5"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj2['a']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"####2. NumPy数组运算"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"d 4\n",
"b 7\n",
"a -5\n",
"c 3\n",
"dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj2"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"d 4\n",
"b 7\n",
"c 3\n",
"dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 布尔表达式过滤\n",
"obj2[obj2 > 0]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"d 8\n",
"b 14\n",
"a -10\n",
"c 6\n",
"dtype: int64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 标量乘法\n",
"obj2 * 2"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"d 54.598150\n",
"b 1096.633158\n",
"a 0.006738\n",
"c 20.085537\n",
"dtype: float64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 应用数学函数\n",
"np.exp(obj2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"将Series看成是一个定长的有序字典,因为它是索引值到数据值的一个映射。"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'b' in obj2"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'e' in obj2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"####3. 通过Python字典创建Series"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Ohio 35000\n",
"Oregon 16000\n",
"Texas 71000\n",
"Utah 5000\n",
"dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}\n",
"# 传入Python字典,原字典的键成为Series的索引\n",
"obj3 = Series(sdata)\n",
"obj3"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"California NaN\n",
"Ohio 35000\n",
"Oregon 16000\n",
"Texas 71000\n",
"dtype: float64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sindex = ['California', 'Ohio', 'Oregon', 'Texas']\n",
"obj4 = Series(sdata, index=sindex)\n",
"# sdata中跟states索引项匹配的值会被找出来并放到相应的位置上\n",
"obj4"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"California True\n",
"Ohio False\n",
"Oregon False\n",
"Texas False\n",
"dtype: bool"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj4.isnull()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"####4. Series自动对齐"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"California NaN\n",
"Ohio 70000\n",
"Oregon 32000\n",
"Texas 142000\n",
"Utah NaN\n",
"dtype: float64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj3 + obj4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"####5. Series的name属性"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"state\n",
"California NaN\n",
"Ohio 35000\n",
"Oregon 16000\n",
"Texas 71000\n",
"Name: population, dtype: float64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj4.name = 'population'\n",
"obj4.index.name = 'state'\n",
"obj4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"####6. 修改Series的索引"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Bob 4\n",
"Steve 7\n",
"Jeff -5\n",
"Ryan 3\n",
"dtype: int64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']\n",
"obj"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###1.2 DataFrame\n",
"DataFrame是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型。DataFrame既有行索引也有列索引,它可以被看做由Series组成的字典(功用同一个索引)。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"####1. 构建DataFrame\n",
"最常用是直接传入一个由等长列表或NumPy数组组成的字典"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" pop | \n",
" state | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.5 | \n",
" Ohio | \n",
" 2000 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.7 | \n",
" Ohio | \n",
" 2001 | \n",
"
\n",
" \n",
" 2 | \n",
" 3.6 | \n",
" Ohio | \n",
" 2002 | \n",
"
\n",
" \n",
" 3 | \n",
" 2.4 | \n",
" Nevada | \n",
" 2001 | \n",
"
\n",
" \n",
" 4 | \n",
" 2.9 | \n",
" Nevada | \n",
" 2002 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pop state year\n",
"0 1.5 Ohio 2000\n",
"1 1.7 Ohio 2001\n",
"2 3.6 Ohio 2002\n",
"3 2.4 Nevada 2001\n",
"4 2.9 Nevada 2002"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],\n",
" 'year': [2000, 2001, 2002, 2001, 2002],\n",
" 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}\n",
"frame = DataFrame(data)\n",
"frame\n",
"# DataFrame会自动加上索引,且全部被有序排列"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" state | \n",
" pop | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2000 | \n",
" Ohio | \n",
" 1.5 | \n",
"
\n",
" \n",
" 1 | \n",
" 2001 | \n",
" Ohio | \n",
" 1.7 | \n",
"
\n",
" \n",
" 2 | \n",
" 2002 | \n",
" Ohio | \n",
" 3.6 | \n",
"
\n",
" \n",
" 3 | \n",
" 2001 | \n",
" Nevada | \n",
" 2.4 | \n",
"
\n",
" \n",
" 4 | \n",
" 2002 | \n",
" Nevada | \n",
" 2.9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year state pop\n",
"0 2000 Ohio 1.5\n",
"1 2001 Ohio 1.7\n",
"2 2002 Ohio 3.6\n",
"3 2001 Nevada 2.4\n",
"4 2002 Nevada 2.9"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 如果指定列序列,则DataFrame的列就会按照指定顺序进行排列\n",
"DataFrame(data, columns=['year', 'state', 'pop'])"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" state | \n",
" pop | \n",
" debt | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" 2000 | \n",
" Ohio | \n",
" 1.5 | \n",
" NaN | \n",
"
\n",
" \n",
" two | \n",
" 2001 | \n",
" Ohio | \n",
" 1.7 | \n",
" NaN | \n",
"
\n",
" \n",
" three | \n",
" 2002 | \n",
" Ohio | \n",
" 3.6 | \n",
" NaN | \n",
"
\n",
" \n",
" four | \n",
" 2001 | \n",
" Nevada | \n",
" 2.4 | \n",
" NaN | \n",
"
\n",
" \n",
" five | \n",
" 2002 | \n",
" Nevada | \n",
" 2.9 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year state pop debt\n",
"one 2000 Ohio 1.5 NaN\n",
"two 2001 Ohio 1.7 NaN\n",
"three 2002 Ohio 3.6 NaN\n",
"four 2001 Nevada 2.4 NaN\n",
"five 2002 Nevada 2.9 NaN"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 如果传入的列在数据中找不到,就会产生NA值\n",
"frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],\n",
" index=['one', 'two', 'three', 'four', 'five'])\n",
"frame2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"####2. 对DataFrame的行和列的操作"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"通过类似字典标记的方式或属性的方式,可以将DataFrame的列获取为一个Series"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"one Ohio\n",
"two Ohio\n",
"three Ohio\n",
"four Nevada\n",
"five Nevada\n",
"Name: state, dtype: object"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2['state']"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"one 2000\n",
"two 2001\n",
"three 2002\n",
"four 2001\n",
"five 2002\n",
"Name: year, dtype: int64"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2.year"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"返回的Series拥有原DataFrame相同的索引,且其name属性已经被设置好了"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"用索引字段ix可以获得DataFrame的一行"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"year 2002\n",
"state Ohio\n",
"pop 3.6\n",
"debt NaN\n",
"Name: three, dtype: object"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2.ix['three']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"列可以通过赋值的方式进行修改"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"frame2['debt'] = 16.5"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" state | \n",
" pop | \n",
" debt | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" 2000 | \n",
" Ohio | \n",
" 1.5 | \n",
" 16.5 | \n",
"
\n",
" \n",
" two | \n",
" 2001 | \n",
" Ohio | \n",
" 1.7 | \n",
" 16.5 | \n",
"
\n",
" \n",
" three | \n",
" 2002 | \n",
" Ohio | \n",
" 3.6 | \n",
" 16.5 | \n",
"
\n",
" \n",
" four | \n",
" 2001 | \n",
" Nevada | \n",
" 2.4 | \n",
" 16.5 | \n",
"
\n",
" \n",
" five | \n",
" 2002 | \n",
" Nevada | \n",
" 2.9 | \n",
" 16.5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year state pop debt\n",
"one 2000 Ohio 1.5 16.5\n",
"two 2001 Ohio 1.7 16.5\n",
"three 2002 Ohio 3.6 16.5\n",
"four 2001 Nevada 2.4 16.5\n",
"five 2002 Nevada 2.9 16.5"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"将列表或数组赋值给某个列时,其长度必须跟DataFrame的长度相匹配。如果赋值的是一个Series,就会精确匹配DataFrame的索引"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" state | \n",
" pop | \n",
" debt | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" 2000 | \n",
" Ohio | \n",
" 1.5 | \n",
" 1 | \n",
"
\n",
" \n",
" two | \n",
" 2001 | \n",
" Ohio | \n",
" 1.7 | \n",
" 2 | \n",
"
\n",
" \n",
" three | \n",
" 2002 | \n",
" Ohio | \n",
" 3.6 | \n",
" 3 | \n",
"
\n",
" \n",
" four | \n",
" 2001 | \n",
" Nevada | \n",
" 2.4 | \n",
" 4 | \n",
"
\n",
" \n",
" five | \n",
" 2002 | \n",
" Nevada | \n",
" 2.9 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year state pop debt\n",
"one 2000 Ohio 1.5 1\n",
"two 2001 Ohio 1.7 2\n",
"three 2002 Ohio 3.6 3\n",
"four 2001 Nevada 2.4 4\n",
"five 2002 Nevada 2.9 5"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2['debt'] = [1, 2, 3, 4, 5]\n",
"frame2"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" state | \n",
" pop | \n",
" debt | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" 2000 | \n",
" Ohio | \n",
" 1.5 | \n",
" NaN | \n",
"
\n",
" \n",
" two | \n",
" 2001 | \n",
" Ohio | \n",
" 1.7 | \n",
" -1.2 | \n",
"
\n",
" \n",
" three | \n",
" 2002 | \n",
" Ohio | \n",
" 3.6 | \n",
" NaN | \n",
"
\n",
" \n",
" four | \n",
" 2001 | \n",
" Nevada | \n",
" 2.4 | \n",
" -1.5 | \n",
"
\n",
" \n",
" five | \n",
" 2002 | \n",
" Nevada | \n",
" 2.9 | \n",
" -1.7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year state pop debt\n",
"one 2000 Ohio 1.5 NaN\n",
"two 2001 Ohio 1.7 -1.2\n",
"three 2002 Ohio 3.6 NaN\n",
"four 2001 Nevada 2.4 -1.5\n",
"five 2002 Nevada 2.9 -1.7"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])\n",
"frame2['debt'] = val\n",
"frame2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"为不存在的列赋值会创建出一个新列"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"frame2['eastern'] = frame2.state == 'Ohio'"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" state | \n",
" pop | \n",
" debt | \n",
" eastern | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" 2000 | \n",
" Ohio | \n",
" 1.5 | \n",
" NaN | \n",
" True | \n",
"
\n",
" \n",
" two | \n",
" 2001 | \n",
" Ohio | \n",
" 1.7 | \n",
" -1.2 | \n",
" True | \n",
"
\n",
" \n",
" three | \n",
" 2002 | \n",
" Ohio | \n",
" 3.6 | \n",
" NaN | \n",
" True | \n",
"
\n",
" \n",
" four | \n",
" 2001 | \n",
" Nevada | \n",
" 2.4 | \n",
" -1.5 | \n",
" False | \n",
"
\n",
" \n",
" five | \n",
" 2002 | \n",
" Nevada | \n",
" 2.9 | \n",
" -1.7 | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year state pop debt eastern\n",
"one 2000 Ohio 1.5 NaN True\n",
"two 2001 Ohio 1.7 -1.2 True\n",
"three 2002 Ohio 3.6 NaN True\n",
"four 2001 Nevada 2.4 -1.5 False\n",
"five 2002 Nevada 2.9 -1.7 False"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"关键字del用于删除列"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"del frame2['eastern']"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Index([u'year', u'state', u'pop', u'debt'], dtype='object')"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2.columns"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" state | \n",
" pop | \n",
" debt | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" 2000 | \n",
" Ohio | \n",
" 1.5 | \n",
" NaN | \n",
"
\n",
" \n",
" two | \n",
" 2001 | \n",
" Ohio | \n",
" 1.7 | \n",
" -1.2 | \n",
"
\n",
" \n",
" three | \n",
" 2002 | \n",
" Ohio | \n",
" 3.6 | \n",
" NaN | \n",
"
\n",
" \n",
" four | \n",
" 2001 | \n",
" Nevada | \n",
" 2.4 | \n",
" -1.5 | \n",
"
\n",
" \n",
" five | \n",
" 2002 | \n",
" Nevada | \n",
" 2.9 | \n",
" -1.7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year state pop debt\n",
"one 2000 Ohio 1.5 NaN\n",
"two 2001 Ohio 1.7 -1.2\n",
"three 2002 Ohio 3.6 NaN\n",
"four 2001 Nevada 2.4 -1.5\n",
"five 2002 Nevada 2.9 -1.7"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"通过索引方式返回的列只是相应数据的视图而已,并不是副本。对返回的Series所做的任何修改都会反映到原DataFrame上。通过Series的copy方法即可显式地复制列。"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"####3. 传给DataFrame嵌套字典\n",
"如果数据形式是嵌套字典(字典的字典),将它传给DataFrame,它会被解释为:外层的键作为列,内层的键则作为行索引。"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Nevada | \n",
" Ohio | \n",
"
\n",
" \n",
" \n",
" \n",
" 2000 | \n",
" NaN | \n",
" 1.5 | \n",
"
\n",
" \n",
" 2001 | \n",
" 2.4 | \n",
" 1.7 | \n",
"
\n",
" \n",
" 2002 | \n",
" 2.9 | \n",
" 3.6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Nevada Ohio\n",
"2000 NaN 1.5\n",
"2001 2.4 1.7\n",
"2002 2.9 3.6"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pop = {'Nevada': {2001: 2.4, 2002: 2.9},\n",
" 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}\n",
"frame3 = DataFrame(pop)\n",
"frame3"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 2000 | \n",
" 2001 | \n",
" 2002 | \n",
"
\n",
" \n",
" \n",
" \n",
" Nevada | \n",
" NaN | \n",
" 2.4 | \n",
" 2.9 | \n",
"
\n",
" \n",
" Ohio | \n",
" 1.5 | \n",
" 1.7 | \n",
" 3.6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 2000 2001 2002\n",
"Nevada NaN 2.4 2.9\n",
"Ohio 1.5 1.7 3.6"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 对结果进行转置\n",
"frame3.T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以输入给DataFrame构造器的数据:\n",
"- 二维ndarray: 数据矩阵\n",
"- 由数组、列表或元组组成的字典: 每个序列会变成DataFrame的一列。所有序列的长度必须相同\n",
"- NumPy的结构化/记录数组: 类似于 有数组组成的字典\n",
"- 由Series组成的字典\n",
"- 由字典组成的字典\n",
"- 字典或Series的列表\n",
"- 由列表或元组组成的列表\n",
"- 另一个DataFrame\n",
"- NumPy的MaskedArray"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"####4. DataFrame的属性"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" state | \n",
" Nevada | \n",
" Ohio | \n",
"
\n",
" \n",
" year | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2000 | \n",
" NaN | \n",
" 1.5 | \n",
"
\n",
" \n",
" 2001 | \n",
" 2.4 | \n",
" 1.7 | \n",
"
\n",
" \n",
" 2002 | \n",
" 2.9 | \n",
" 3.6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"state Nevada Ohio\n",
"year \n",
"2000 NaN 1.5\n",
"2001 2.4 1.7\n",
"2002 2.9 3.6"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 设置DataFrame的index和columns的name属性,并显示出来\n",
"frame3.index.name = 'year'\n",
"frame3.columns.name = 'state'\n",
"frame3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"DataFrame的values属性会以二维ndarray的形式返回DataFrame中的数据"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ nan, 1.5],\n",
" [ 2.4, 1.7],\n",
" [ 2.9, 3.6]])"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame3.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##2. 索引对象\n",
"pandas的索引对象负责轴标签和其他元数据(比如轴名称等),构建Series或DataFrame时, 所用到的任何数组或其他序列的标签都会被转换成一个Index。Index对象是"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Index([u'a', u'b', u'c'], dtype='object')"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj = Series(range(3), index=['a','b','c'])\n",
"index = obj.index\n",
"index"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Index([u'b', u'c'], dtype='object')"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index[1:]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.5"
}
},
"nbformat": 4,
"nbformat_minor": 0
}