{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# pandas\n", "\n", "## pandas特征与导入\n", "\n", "1. 包含高级的数据结构和精巧的工具\n", "2. pandas建造在NumPy之上\n", "3. 导入:\n", "```\n", "from pandas import Series, DataFrame\n", "import pandas as pd\n", "```\n", "\n", "## pandas数据结构\n", "\n", "1. SERIES\n", "\n", "一维的类似的数组对象\n", "\n", "包含一个数组的数据(任何NumPy的数据类型)和一个与数组关联的索引\n", "\n", "不指定索引:a = Series([1,2,3]) ,输出为\n", "```\n", "0 1\n", "1 2\n", "2 3\n", "```\n", "包含属性`a.index`,`a.values`,对应索引和值\n", "\n", "指定索引:`a = Series([1,2,3],index=['a','b','c'])`\n", "\n", "可以通过索引访问`a['b']`\n", "\n", "判断某个索引是否存在:`'b' in a`\n", "\n", "通过字典建立Series\n", "```\n", "dict = {'china':10,'america':30,'indian':20}\n", "print Series(dict)\n", "```\n", "输出:\n", "```\n", "america 30\n", "china 10\n", "indian 20\n", "dtype: int64\n", "```\n", "判断哪个索引值缺失:\n", "```\n", "dict = {'china':10,'america':30,'indian':20}\n", "state = ['china','america','test']\n", "a = Series(dict,state)\n", "print a.isnull()\n", "```\n", "输出:(test索引没有对应值)\n", "```\n", "china False\n", "america False\n", "test True\n", "dtype: bool\n", "```\n", "在算术运算中它会自动对齐不同索引的数据\n", "```\n", "a = Series([10,20],['china','test'])\n", "b = Series([10,20],['test','china'])\n", "print a+b\n", "```\n", "输出:\n", "```\n", "china 30\n", "test 30\n", "dtype: int64\n", "```\n", "指定Series对象的name和index的name属性\n", "```\n", "a = Series([10,20],['china','test'])\n", "a.index.name = 'state'\n", "a.name = 'number'\n", "print a\n", "```\n", "输出:\n", "```\n", "state\n", "china 10\n", "test 20\n", "Name: number, dtype: int64\n", "```\n", "2. DATAFRAME\n", "\n", "Datarame表示一个表格,类似电子表格的数据结构\n", "\n", "包含一个经过排序的列表集(按列名排序)\n", "\n", "每一个都可以有不同的类型值(数字,字符串,布尔等等)\n", "\n", "DataFrame在内部把数据存储为一个二维数组的格式,因此你可以采用分层索引以表格格式来表示高维的数据\n", "\n", "创建:\n", "\n", "通过字典\n", "```\n", "data = {'state': ['a', 'b', 'c', 'd', 'd'],\n", " 'year': [2000, 2001, 2002, 2001, 2002],\n", " 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}\n", "frame = DataFrame(data)\n", "print frame\n", "```\n", "输出:(按照列名排好序的[若是手动分配列名,会按照你设定的],并且索引会自动分配)\n", "```\n", " pop state year\n", "0 1.5 a 2000\n", "1 1.7 b 2001\n", "2 3.6 c 2002\n", "3 2.4 d 2001\n", "4 2.9 d 2002\n", "```\n", "访问\n", "\n", "列:与`Series`一样,通过列名访问:frame['state']或者frame.state\n", "\n", "行:`ix` 索引成员(field),`frame.ix[2]`,返回每一列的第3行数据\n", "\n", "赋值:`frame2['debt'] = np.arange(5.),若没有debt列名,则会新增一列\n", "\n", "删除某一列:`del frame2['eastern']\n", "\n", "像`Series`一样, `values` 属性返回一个包含在`DataFrame`中的数据的二维`ndarray`\n", "\n", "返回所有的列信息:`frame.columns`\n", "\n", "转置:`frame2.T`\n", "\n", "3. 索引对象\n", "\n", "pandas的索引对象用来保存坐标轴标签和其它元数据(如坐标轴名或名称)\n", "\n", "索引对象是不可变的,因此不能由用户改变\n", "\n", "创建`index = pd.Index([1,2,3])`\n", "\n", "常用操作\n", "\n", "append–>链接额外的索引对象,产生一个新的索引\n", "\n", "diff –>计算索引的差集\n", "\n", "intersection –>计算交集\n", "\n", "union –>计算并集\n", "\n", "isin –>计算出一个布尔数组表示每一个值是否包含在所传递的集合里\n", "\n", "delete –>计算删除位置i的元素的索引\n", "\n", "drop –>计算删除所传递的值后的索引\n", "\n", "insert –>计算在位置i插入元素后的索引\n", "\n", "is_monotonic –>返回True,如果每一个元素都比它前面的元素大或相等\n", "\n", "is_unique –>返回True,如果索引没有重复的值\n", "\n", "unique –>计算索引的唯一值数组\n", "\n", "## 重新索引reindex\n", "\n", "1. SERIES\n", "\n", " 重新排列\n", " ```\n", " a = Series([2,3,1],index=['b','a','c'])\n", " b = a.reindex(['a','b','c'])\n", " print b\n", " ```\n", "2. 重新排列,没有的索引补充为0,`b=a.reindex(['a','b','c','d'],fill_value=0)`\n", "\n", "3. 重建索引时对值进行内插或填充\n", "```\n", "a = Series(['a','b','c'],index=[0,2,4])\n", "b = a.reindex(range(6),method='ffill')\n", "print b\n", "```\n", "输出:\n", "```\n", "0 a\n", "1 a\n", "2 b\n", "3 b\n", "4 c\n", "5 cdata_link\n", "dtype: object\n", "```\n", "method的参数\n", "\n", "ffill或pad—->前向(或进位)填充\n", "\n", "bfill或backfill—->后向(或进位)填充\n", "\n", "3. DATAFRAME\n", "\n", "与Series一样,reindex index\n", "还可以reindex column列,frame.reindex(columns=['a','b'])\n", "\n", "## 从一个坐标轴删除条目\n", "\n", "1. SERIES\n", "\n", "`a.drop(['a','b']) `删除a,b索引项\n", "\n", "2. DATAFRAME\n", "\n", "索引项的删除与`Series`一样\n", "\n", "删除`column—>a.drop(['one'], axis=1) `删除column名为one的一列\n", "\n", "## 索引,挑选和过滤\n", "\n", "1. SERIES\n", "\n", "可以通过index值或者整数值来访问数据,eg:对于`a = Series(np.arange(4.), index=['a', 'b', 'c', 'd']),a['b']`和`a[1]`是一样的\n", "使用标签来切片和正常的Python切片并不一样,它会把结束点也包括在内\n", "```\n", "a = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])\n", "print a['b':'c']\n", "```\n", "输出包含c索引对应的值\n", "\n", "2. DATAFRAME\n", "\n", "显示前两行:`a[:2]`\n", "布尔值访问:`a[a['two']>5]`\n", "索引字段 ix 的使用\n", "index为2,column为`’one’`和`’two’—>a.ix[[2],['one','two']]`\n", "index为2的一行:`a.ix[2]`\n", "\n", "## DataFrame和Series运算\n", "\n", "1. DataFrame每一行都减去一个Series\n", "```\n", "a = pd.DataFrame(np.arange(16).reshape(4,4),index=[0,1,2,3],columns=['one', 'two','three','four'])\n", "print a\n", "b = Series([0,1,2,3],index=['one','two','three','four'])\n", "print b\n", "print a-b\n", "```\n", "输出:\n", "```\n", " one two three four\n", "0 0 1 2 3\n", "1 4 5 6 7\n", "2 8 9 10 11\n", "3 12 13 14 15\n", "one 0\n", "two 1\n", "three 2\n", "four 3\n", "dtype: int64\n", " one two three four\n", "0 0 0 0 0\n", "1 4 4 4 4\n", "2 8 8 8 8\n", "3 12 12 12 12\n", "```\n", "\n", "## 读取文件\n", "\n", "1. csv文件\n", "`pd.read_csv(r\"data/train.csv\")`,返回的数据类型是DataFrame类型\n", "\n", "## 查看DataFrame的信息\n", "\n", "1. `train_data.describe()`\n", "```\n", " PassengerId Survived Pclass Age SibSp \\\n", "count 891.000000 891.000000 891.000000 714.000000 891.000000 \n", "mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n", "std 257.353842 0.486592 0.836071 14.526497 1.102743 \n", "min 1.000000 0.000000 1.000000 0.420000 0.000000 \n", "25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n", "50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n", "75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n", "max 891.000000 1.000000 3.000000 80.000000 8.000000\n", "```\n", "\n", "## 定位到一列并替换\n", "\n", "`df.loc[df.Age.isnull(),'Age'] = 23 #'Age'列为空的内容补上数字23`\n", "\n", "## 将分类变量转化为指示变量`get_dummies()`\n", "\n", "```\n", "s = pd.Series(list('abca'))\n", "pd.get_dummies(s)\n", "```\n", "```\n", " a b c\n", "0 1 0 0\n", "1 0 1 0\n", "2 0 0 1\n", "3 1 0 0\n", "```\n", "\n", "## list和string互相转化\n", "\n", "string转list\n", "```\n", ">>> str = 'abcde'\n", ">>> list = list(str)\n", ">>> list\n", "['a', 'b', 'c', 'd', 'e']\n", "```\n", "list转string\n", "```\n", ">>> str_convert = ','.join(list)\n", ">>> str_convert\n", "'a,b,c,d,e'\n", "```\n", "\n", "## 删除原来的索引,重新从0-n索引\n", "\n", "```x = x.reset_index(drop=True)```\n", "\n", "## apply函数\n", "\n", "`DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, …..`\n", "\n", "`df.apply(numpy.sqrt) # returns DataFrame`\n", "\n", "等价==》`df.apply(lambda x : numpy.sqrt(x))`==>使用更灵活\n", "\n", "`df.apply(numpy.sum, axis=0) # equiv to df.sum(0)`\n", "\n", "`df.apply(numpy.sum, axis=1) # equiv to df.sum(1)`\n", "\n", "## `re.search().group()`函数\n", "\n", "`re.search(pattern, string, flags=0)`\n", "\n", "`group(num=0)`函数返回匹配的字符,默认num=0,可以指定多个组号,例如`group(0,1)`\n", "\n", "## pandas.cut()函数\n", "\n", "`pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)`\n", "\n", "- x为一维数组\n", "\n", "- bins可以是int值或者序列\n", "\n", " - 若是int值就根据x分为bins个数的区间\n", "\n", " - 若是序列就是自己指定的区间\n", "\n", "- right包含最右边的区间,默认为True\n", "\n", "- labels 数组或者一个布尔值\n", "\n", " - 若是数组,需要与对应bins的结果一致\n", " - 若是布尔值False,返回bin中的一个值\n", "\n", "eg:`pd.cut(full[“FamilySize”], bins=[0,1,4,20], labels=[0,1,2])`\n", "\n", "## 添加一行数据\n", "\n", "定义空的dataframe: `data_process = pd.DataFrame(columns=['route','date','1','2','3','4','5','6','7','8','9','10','11','12'])`\n", "\n", "定义一行新的数据,`new = pd.DataFrame(columns=['route','date','1','2','3','4','5','6','7','8','9','10','11','12'],index=[j])`\n", "\n", "这里`index`可以随意设置,若是想指定就指定\n", "\n", "添加:`data_process = data_process.append(new, ignore_index=True)`,注意这里是`data_process = data_process.......`" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" }, "toc": { "colors": { "hover_highlight": "#DAA520", "navigate_num": "#000000", "navigate_text": "#333333", "running_highlight": "#FF0000", "selected_highlight": "#FFD700", "sidebar_border": "#EEEEEE", "wrapper_background": "#FFFFFF" }, "moveMenuLeft": true, "nav_menu": { "height": "336px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }