{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 二维数据结构:DataFrame" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "`DataFrame` 是 `pandas` 中的二维数据结构,可以看成一个 `Excel` 中的工作表,或者一个 `SQL` 表,或者一个存储 `Series` 对象的字典。\n", "\n", "`DataFrame(data, index, columns)` 中的 `data` 可以接受很多数据类型:\n", "\n", "- 一个存储一维数组,字典,列表或者 `Series` 的字典\n", "- 2-D 数组\n", "- 结构或者记录数组\n", "- 一个 `Series`\n", "- 另一个 `DataFrame`\n", "\n", "`index` 用于指定行的 `label`,`columns` 用于指定列的 `label`,如果参数不传入,那么会按照传入的内容进行设定。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 从 Series 字典中构造" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以使用值为 `Series` 的字典进行构造:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),\n", " 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果没有传入 `columns` 的值,那么 `columns` 的值默认为字典 `key`,`index` 默认为所有 `value` 中 `index` 的并集。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwo
a11
b22
c33
dNaN4
\n", "
" ], "text/plain": [ " one two\n", "a 1 1\n", "b 2 2\n", "c 3 3\n", "d NaN 4" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(d)\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果指定了 `index` 值,`index` 为指定的 `index` 值:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwo
dNaN4
b22
a11
\n", "
" ], "text/plain": [ " one two\n", "d NaN 4\n", "b 2 2\n", "a 1 1" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(d, index=['d', 'b', 'a'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果指定了 `columns` 值,会去字典中寻找,找不到的值为 `NaN`:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
twothree
d4NaN
b2NaN
a1NaN
\n", "
" ], "text/plain": [ " two three\n", "d 4 NaN\n", "b 2 NaN\n", "a 1 NaN" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "查看 `index` 和 `columns`:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index([u'a', u'b', u'c', u'd'], dtype='object')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.index" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index([u'one', u'two'], dtype='object')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 从 ndarray 或者 list 字典中构造" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果字典是 `ndarray` 或者 `list`,那么它们的长度要严格保持一致:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "d = {'one' : [1., 2., 3., 4.],\n", " 'two' : [4., 3., 2., 1.]}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`index` 默认为 `range(n)`,其中 `n` 为数组长度: " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwo
014
123
232
341
\n", "
" ], "text/plain": [ " one two\n", "0 1 4\n", "1 2 3\n", "2 3 2\n", "3 4 1" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果传入 `index` 参数,那么它必须与数组等长:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwo
a14
b23
c32
d41
\n", "
" ], "text/plain": [ " one two\n", "a 1 4\n", "b 2 3\n", "c 3 2\n", "d 4 1" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(d, index=['a', 'b', 'c', 'd'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 从结构数组中构造" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`numpy` 支持结构数组的构造:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([(1, 2.0, 'Hello'), (2, 3.0, 'World')], \n", " dtype=[('A', '\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
012Hello
123World
\n", "" ], "text/plain": [ " A B C\n", "0 1 2 Hello\n", "1 2 3 World" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(data)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
first12Hello
second23World
\n", "
" ], "text/plain": [ " A B C\n", "first 1 2 Hello\n", "second 2 3 World" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(data, index=['first', 'second'])" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CAB
0Hello12
1World23
\n", "
" ], "text/plain": [ " C A B\n", "0 Hello 1 2\n", "1 World 2 3" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(data, columns=['C', 'A', 'B'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 从字典列表中构造" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "字典中同一个键的值会被合并到同一列:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
012NaN
151020
\n", "
" ], "text/plain": [ " a b c\n", "0 1 2 NaN\n", "1 5 10 20" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]\n", "\n", "pd.DataFrame(data2)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
first12NaN
second51020
\n", "
" ], "text/plain": [ " a b c\n", "first 1 2 NaN\n", "second 5 10 20" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(data2, index=['first', 'second'])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
012
1510
\n", "
" ], "text/plain": [ " a b\n", "0 1 2\n", "1 5 10" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(data2, columns=['a', 'b'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 从 Series 中构造" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "相当于将 Series 二维化。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 其他构造方法" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`DataFrame.from_dict` 从现有的一个字典中构造,`DataFrame.from_records` 从现有的一个记录数组中构造:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
C
Hello12
World23
\n", "
" ], "text/plain": [ " A B\n", "C \n", "Hello 1 2\n", "World 2 3" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame.from_records(data, index='C')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`DataFrame.from_items` 从字典的 `item` 对构造:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
014
125
236
\n", "
" ], "text/plain": [ " A B\n", "0 1 4\n", "1 2 5\n", "2 3 6" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 列操作" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`DataFrame` 可以类似于字典一样对列进行操作:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 2\n", "c 3\n", "d NaN\n", "Name: one, dtype: float64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"one\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "添加新列:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreeflag
a111False
b224False
c339True
dNaN4NaNFalse
\n", "
" ], "text/plain": [ " one two three flag\n", "a 1 1 1 False\n", "b 2 2 4 False\n", "c 3 3 9 True\n", "d NaN 4 NaN False" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['three'] = df['one'] * df['two']\n", "\n", "df['flag'] = df['one'] > 2\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以像字典一样删除:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
oneflag
a1False
b2False
c3True
dNaNFalse
\n", "
" ], "text/plain": [ " one flag\n", "a 1 False\n", "b 2 False\n", "c 3 True\n", "d NaN False" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "del df[\"two\"]\n", "\n", "three = df.pop(\"three\")\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "给一行赋单一值:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
oneflagfoo
a1Falsebar
b2Falsebar
c3Truebar
dNaNFalsebar
\n", "
" ], "text/plain": [ " one flag foo\n", "a 1 False bar\n", "b 2 False bar\n", "c 3 True bar\n", "d NaN False bar" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['foo'] = 'bar'\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果 `index` 不一致,那么会只保留公共的部分:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
oneflagfooone_trunc
a1Falsebar1
b2Falsebar2
c3TruebarNaN
dNaNFalsebarNaN
\n", "
" ], "text/plain": [ " one flag foo one_trunc\n", "a 1 False bar 1\n", "b 2 False bar 2\n", "c 3 True bar NaN\n", "d NaN False bar NaN" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['one_trunc'] = df['one'][:2]\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "也可以直接插入一维数组,但是数组的长度必须与 `index` 一致。\n", "\n", "默认新列插入位置在最后,也可以指定位置插入:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onebarflagfooone_trunc
a11Falsebar1
b22Falsebar2
c33TruebarNaN
dNaNNaNFalsebarNaN
\n", "
" ], "text/plain": [ " one bar flag foo one_trunc\n", "a 1 1 False bar 1\n", "b 2 2 False bar 2\n", "c 3 3 True bar NaN\n", "d NaN NaN False bar NaN" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.insert(1, 'bar', df['one'])\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "添加一个 `test` 新列:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onebarflagfooone_trunctest
a11Falsebar12
b22Falsebar24
c33TruebarNaN6
dNaNNaNFalsebarNaNNaN
\n", "
" ], "text/plain": [ " one bar flag foo one_trunc test\n", "a 1 1 False bar 1 2\n", "b 2 2 False bar 2 4\n", "c 3 3 True bar NaN 6\n", "d NaN NaN False bar NaN NaN" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.assign(test=df[\"one\"] + df[\"bar\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 索引和选择" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "基本操作:\n", "\n", "| Operation\t| Syntax | Result |\n", "| ---- | ----- | ---- |\n", "| Select column\t| df[col] | Series |\n", "| Select row by label | df.loc[label] | Series |\n", "| Select row by integer location | df.iloc[loc] | Series |\n", "| Slice rows | df[5:10]\t| DataFrame |\n", "| Select rows by boolean vector\t| df[bool_vec] | DataFrame |" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }