{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 强大的pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. pandas简介"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "#公认的pandas导包方式\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. pandas数据结构介绍"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pandas有两个常用的数据结构：Series和DataFrame，他们为很多问题提供了解决模板"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2.1 Series"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Series是一种一维的数组型对象，它包含了一个值序列，并且包含了数据标签，称为索引，最简单的序列可以仅仅由一个数组形成"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    4\n",
      "1    7\n",
      "2   -5\n",
      "3    3\n",
      "dtype: int64\n",
      "a    4\n",
      "b    7\n",
      "c   -5\n",
      "d    3\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "obj = pd.Series([4,7,-5,3])\n",
    "print(obj)#输出的数据中左边为索引，右边为数据\n",
    "obj1 = pd.Series([4,7,-5,3], index=['a', 'b', 'c', 'd'])#指定索引\n",
    "print(obj1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在Series中你可以使用索引进行取值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7\n",
      "4\n",
      "a    4\n",
      "b    7\n",
      "d    3\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(obj[1])\n",
    "print(obj1['a'])\n",
    "print(obj1[['a', 'b','d']])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在Series中也可以进行数学操作"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a    4\n",
      "b    7\n",
      "d    3\n",
      "dtype: int64\n",
      "a     8\n",
      "b    14\n",
      "c   -10\n",
      "d     6\n",
      "dtype: int64\n",
      "a      54.598150\n",
      "b    1096.633158\n",
      "c       0.006738\n",
      "d      20.085537\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "print(obj1[obj1>0])\n",
    "print(obj1*2)\n",
    "print(np.exp(obj1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果你已经有数据包含在python的字典中，可以使用字典直接生成Series："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ohio      35000\n",
      "Texas     71000\n",
      "Oregon    16000\n",
      "Utah       5000\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}\n",
    "obj2 = pd.Series(sdata)\n",
    "print(obj2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2.2 DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "DataFrame表示的是矩阵的数组表，它包含以排序的列集合，每一列可以是不同的值类型(数组/字符串/布尔值等)，它可以被视为一个共享相同索引的Series的字典。\n",
    "\n",
    "有多种方式可以构建DataFrame，其中最常用的方式是利用包含等长度列表或Numpy数组的字典来形成DataFrame。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    state  year  pop\n",
      "0    Ohio  2000  1.5\n",
      "1    Ohio  2001  1.7\n",
      "2    Ohio  2002  3.6\n",
      "3  Nevada  2001  2.4\n",
      "4  Nevada  2002  2.9\n",
      "5  Nevada  2003  3.2\n"
     ]
    }
   ],
   "source": [
    "data={'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],\n",
    "     'year':[2000,2001,2002,2001,2002,2003],\n",
    "     'pop':[1.5,1.7,3.6,2.4,2.9,3.2]}\n",
    "frame = pd.DataFrame(data)\n",
    "print(frame)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    state  year  pop\n",
      "0    Ohio  2000  1.5\n",
      "1    Ohio  2001  1.7\n",
      "2    Ohio  2002  3.6\n",
      "3  Nevada  2001  2.4\n",
      "4  Nevada  2002  2.9\n"
     ]
    }
   ],
   "source": [
    "print(frame.head())#展示数据的前五列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   year   state  pop\n",
      "0  2000    Ohio  1.5\n",
      "1  2001    Ohio  1.7\n",
      "2  2002    Ohio  3.6\n",
      "3  2001  Nevada  2.4\n",
      "4  2002  Nevada  2.9\n",
      "5  2003  Nevada  3.2\n"
     ]
    }
   ],
   "source": [
    "#指定列的顺序\n",
    "print(pd.DataFrame(data, columns=['year', 'state', 'pop']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       year   state  pop debt\n",
      "one    2000    Ohio  1.5  NaN\n",
      "two    2001    Ohio  1.7  NaN\n",
      "three  2002    Ohio  3.6  NaN\n",
      "four   2001  Nevada  2.4  NaN\n",
      "five   2002  Nevada  2.9  NaN\n",
      "six    2003  Nevada  3.2  NaN\n"
     ]
    }
   ],
   "source": [
    "#如果指定的类不包含在字典当中，那么会出现缺失值\n",
    "frame1 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],\n",
    "                  index=['one', 'two', 'three', 'four', 'five', 'six'])\n",
    "print(frame1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['year', 'state', 'pop', 'debt'], dtype='object')\n"
     ]
    }
   ],
   "source": [
    "print(frame1.columns)#显示数据的列索引"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "one        Ohio\n",
      "two        Ohio\n",
      "three      Ohio\n",
      "four     Nevada\n",
      "five     Nevada\n",
      "six      Nevada\n",
      "Name: state, dtype: object\n"
     ]
    }
   ],
   "source": [
    "#也可以像字典类型通过索引取某一列的值\n",
    "print(frame1['state'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       year   state  pop  debt\n",
      "one    2000    Ohio  1.5  16.5\n",
      "two    2001    Ohio  1.7  16.5\n",
      "three  2002    Ohio  3.6  16.5\n",
      "four   2001  Nevada  2.4  16.5\n",
      "five   2002  Nevada  2.9  16.5\n",
      "six    2003  Nevada  3.2  16.5\n"
     ]
    }
   ],
   "source": [
    "#列的引用是可以修改的，可以将空的debt的值进行修改\n",
    "frame1['debt']=16.5\n",
    "print(frame1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       year   state  pop  debt\n",
      "one    2000    Ohio  1.5   0.0\n",
      "two    2001    Ohio  1.7   1.0\n",
      "three  2002    Ohio  3.6   2.0\n",
      "four   2001  Nevada  2.4   3.0\n",
      "five   2002  Nevada  2.9   4.0\n",
      "six    2003  Nevada  3.2   5.0\n"
     ]
    }
   ],
   "source": [
    "#还可以用np中的arange进行赋值\n",
    "frame1['debt']=np.arange(6.)\n",
    "print(frame1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        one   two three    four    five     six\n",
      "year   2000  2001  2002    2001    2002    2003\n",
      "state  Ohio  Ohio  Ohio  Nevada  Nevada  Nevada\n",
      "pop     1.5   1.7   3.6     2.4     2.9     3.2\n",
      "debt      0     1     2       3       4       5\n"
     ]
    }
   ],
   "source": [
    "#还可以对DataFrame数据进行转置操作\n",
    "print(frame1.T)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3 索引对象"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pandas中的索引对象是用于存储轴标签和其他元数据的。在构造Series或DataFrame时，你所使用的任意数组或标签序列都可以在内部转换为索引对象。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['a', 'b', 'c'], dtype='object')\n"
     ]
    }
   ],
   "source": [
    "obj = pd.Series(range(3), index=['a','b','c'])\n",
    "index = obj.index\n",
    "print(index)#索引对象是不可变的"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Int64Index([0, 1, 2], dtype='int64')\n",
      "0    1.5\n",
      "1   -2.5\n",
      "2    0.0\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "labels = pd.Index(np.arange(3))\n",
    "print(labels)\n",
    "obj2 = pd.Series([1.5,-2.5,0], index=labels)\n",
    "print(obj2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. 基本功能"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3.1 重建索引"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "reindex是pandas对象的重要方法，该方法用于创建一个符合新索引的新对象。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "d    4.5\n",
      "b    7.2\n",
      "a   -5.3\n",
      "c    3.6\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "obj = pd.Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])\n",
    "print(obj)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a   -5.3\n",
      "b    7.2\n",
      "c    3.6\n",
      "d    4.5\n",
      "e    NaN\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "obj2 = obj.reindex(['a','b','c','d','e'])#如果原本索引值不存在会引入缺失值\n",
    "print(obj2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于一些顺序数据，比如时间序列，在重建索引时可能需要进行插值或填值，我们可以使用`ffill`方法进行插值，该方法会将上面的值传递到下面"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0      blue\n",
      "2    purple\n",
      "4    yellow\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0,2,4])\n",
    "print(obj3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0      blue\n",
      "1      blue\n",
      "2    purple\n",
      "3    purple\n",
      "4    yellow\n",
      "5    yellow\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "obj4 = obj3.reindex(range(6), method='ffill')\n",
    "print(obj4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   A\n",
      "1  1\n",
      "2  2\n",
      "3  3\n",
      "   A\n",
      "3  1\n",
      "1  2\n",
      "2  3\n"
     ]
    }
   ],
   "source": [
    "df1 = pd.DataFrame({'A':[1,2,3]}, index=[1,2,3])\n",
    "df2 = pd.DataFrame({'A':[1,2,3]}, index=[3,1,2])\n",
    "print(df1)\n",
    "print(df2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   A\n",
      "1 -1\n",
      "2 -1\n",
      "3  2\n"
     ]
    }
   ],
   "source": [
    "print(df1-df2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}