{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CHAPTER 12 Advanced pandas（高级pandas用法）\n",
    "\n",
    "# 12.1 Categorical Data（类别数据）\n",
    "\n",
    "这一届会介绍pandas的Categorical类型。\n",
    "\n",
    "# 1 Background and Motivation（背景和动力）\n",
    "\n",
    "表格中的列克可能会有重复的部分。我们可以用unique和value_counts，从一个数组从提取不同的值，并计算频度："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     apple\n",
       "1    orange\n",
       "2     apple\n",
       "3     apple\n",
       "4     apple\n",
       "5    orange\n",
       "6     apple\n",
       "7     apple\n",
       "dtype: object"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)\n",
    "values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['apple', 'orange'], dtype=object)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.unique(values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "apple     6\n",
       "orange    2\n",
       "dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.value_counts(values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于不同的类型数据值，一个更好的方法是用维度表（dimension table）来表示，然后用整数键（integer keys）来指代维度表："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0\n",
       "1    1\n",
       "2    0\n",
       "3    0\n",
       "4    0\n",
       "5    1\n",
       "6    0\n",
       "7    0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "values = pd.Series([0, 1, 0, 0] * 2)\n",
    "values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     apple\n",
       "1    orange\n",
       "dtype: object"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dim = pd.Series(['apple', 'orange'])\n",
    "dim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "用take方法来重新存储原始的，由字符串构成的Series："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     apple\n",
       "1    orange\n",
       "0     apple\n",
       "0     apple\n",
       "0     apple\n",
       "1    orange\n",
       "0     apple\n",
       "0     apple\n",
       "dtype: object"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dim.take(values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这种用整数表示的方法叫做类别（categorical）或字典编码（dictionary-encoded）表示法。表示不同类别值的数组，被称作类别，字典，或层级。本书中我们将使用类别（categorical and categories）来称呼。表示类别的整数值被叫做，类别编码（category code），或编码（code）。\n",
    "\n",
    "# 2 Categorical Type in pandas（pandas中的Categorical类型）\n",
    "\n",
    "pandas中有一个Categorical类型，是用来保存那些基于整数的类别型数据。考虑下面的例子："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "fruits = ['apple', 'orange', 'apple', 'apple'] * 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "N = len(fruits)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>basket_id</th>\n",
       "      <th>fruit</th>\n",
       "      <th>count</th>\n",
       "      <th>weight</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>apple</td>\n",
       "      <td>5</td>\n",
       "      <td>2.255245</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>orange</td>\n",
       "      <td>8</td>\n",
       "      <td>1.309949</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>apple</td>\n",
       "      <td>6</td>\n",
       "      <td>2.330312</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>apple</td>\n",
       "      <td>3</td>\n",
       "      <td>2.927920</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>apple</td>\n",
       "      <td>13</td>\n",
       "      <td>1.322311</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>orange</td>\n",
       "      <td>10</td>\n",
       "      <td>0.474809</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>apple</td>\n",
       "      <td>4</td>\n",
       "      <td>0.827271</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>apple</td>\n",
       "      <td>8</td>\n",
       "      <td>2.480494</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   basket_id   fruit  count    weight\n",
       "0          0   apple      5  2.255245\n",
       "1          1  orange      8  1.309949\n",
       "2          2   apple      6  2.330312\n",
       "3          3   apple      3  2.927920\n",
       "4          4   apple     13  1.322311\n",
       "5          5  orange     10  0.474809\n",
       "6          6   apple      4  0.827271\n",
       "7          7   apple      8  2.480494"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({'fruit': fruits,\n",
    "                   'basket_id': np.arange(N),\n",
    "                   'count': np.random.randint(3, 15, size=N),\n",
    "                   'weight': np.random.uniform(0, 4, size=N)},\n",
    "                  columns=['basket_id', 'fruit', 'count', 'weight'])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里，df['fruit']是一个python的字符串对象。我们将其转换为类型对象："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     apple\n",
       "1    orange\n",
       "2     apple\n",
       "3     apple\n",
       "4     apple\n",
       "5    orange\n",
       "6     apple\n",
       "7     apple\n",
       "Name: fruit, dtype: category\n",
       "Categories (2, object): [apple, orange]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fruits_cat = df['fruit'].astype('category')\n",
    "fruits_cat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "fruits_cat的值并不是一个numpy数组，而是一个pandas.Categorical实例："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.categorical.Categorical"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c = fruits_cat.values\n",
    "type(c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个Categorical对象有categories和codes属性："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['apple', 'orange'], dtype='object')"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c.categories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c.codes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以把转换的结果变为DataFrame列："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     apple\n",
       "1    orange\n",
       "2     apple\n",
       "3     apple\n",
       "4     apple\n",
       "5    orange\n",
       "6     apple\n",
       "7     apple\n",
       "Name: fruit, dtype: category\n",
       "Categories (2, object): [apple, orange]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['fruit'] = df['fruit'].astype('category')\n",
    "df.fruit"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "也可以直接把其他的python序列变为pandas.Categorical类型："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[foo, bar, baz, foo, bar]\n",
       "Categories (3, object): [bar, baz, foo]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])\n",
    "my_categories"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果已经得到了分类编码数据（categorical encoded data），我们可以使用from_codes构造器："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "categories = ['foo', 'bar', 'baz']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "codes = [0, 1, 2, 0, 0, 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[foo, bar, baz, foo, foo, bar]\n",
       "Categories (3, object): [foo, bar, baz]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_cats_2 = pd.Categorical.from_codes(codes, categories)\n",
    "my_cats_2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "除非明确指定，非常默认类别没有特定的顺序。所以，取决于输入的数据，categories数组可能有不同的顺序。当使用from_codes或其他一些构造器的时候，我们可以指定类别的顺序："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[foo, bar, baz, foo, foo, bar]\n",
       "Categories (3, object): [foo < bar < baz]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ordered_cat = pd.Categorical.from_codes(codes, categories, \n",
    "                                        ordered=True)\n",
    "ordered_cat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "输出的结果中，`[foo < bar < baz]`表示foo在bar之间，以此类推。一个没有顺序的类型实例（unordered categorical instance）可以通过as_ordered来排序："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[foo, bar, baz, foo, foo, bar]\n",
       "Categories (3, object): [foo < bar < baz]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_cats_2.as_ordered()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最后一点需要注意的，类型数据没必要一定是字符串，它可以是任何不可变的值类型\n",
    "（any immutable value types）。\n",
    "\n",
    "# 3 Computations with Categoricals（类型计算）\n",
    "\n",
    "Categorical类型和其他类型差不多，不过对于某些函数，比如groupby函数，在Categorical数据上会有更好的效果。很多函数可以利用ordered标记。\n",
    "\n",
    "假设有一些随机的数字，用pandas.quct进行分箱（binning）。得到的类型是pandas.Categorical；虽然之前用到过pandas.cut，但是没有具体介绍里面的细节："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "np.random.seed(12345)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "draws = np.random.randn(1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "draws[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "计算分箱后的分位数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.95, -0.684], (-0.0101, 0.63], (0.63, 3.928]]\n",
       "Length: 1000\n",
       "Categories (4, interval[float64]): [(-2.95, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bins = pd.qcut(draws, 4)\n",
    "bins"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "具体分位数并不如季度的名字直观，我们直接在qcut中设定labels："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]\n",
       "Length: 1000\n",
       "Categories (4, object): [Q1 < Q2 < Q3 < Q4]"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])\n",
    "bins"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bins.codes[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "bins caetegorical并没有包含边界星系，我们可以用groupby来提取："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "bins = pd.Series(bins, name='quartile')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>quartile</th>\n",
       "      <th>count</th>\n",
       "      <th>min</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q1</td>\n",
       "      <td>250</td>\n",
       "      <td>-2.949343</td>\n",
       "      <td>-0.685484</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q2</td>\n",
       "      <td>250</td>\n",
       "      <td>-0.683066</td>\n",
       "      <td>-0.010115</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q3</td>\n",
       "      <td>250</td>\n",
       "      <td>-0.010032</td>\n",
       "      <td>0.628894</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q4</td>\n",
       "      <td>250</td>\n",
       "      <td>0.634238</td>\n",
       "      <td>3.927528</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  quartile  count       min       max\n",
       "0       Q1    250 -2.949343 -0.685484\n",
       "1       Q2    250 -0.683066 -0.010115\n",
       "2       Q3    250 -0.010032  0.628894\n",
       "3       Q4    250  0.634238  3.927528"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results = (pd.Series(draws)\n",
    "           .groupby(bins)\n",
    "           .agg(['count', 'min', 'max'])\n",
    "           .reset_index())\n",
    "results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "quartile列包含了原始的类别信息，包含bins中的顺序："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    Q1\n",
       "1    Q2\n",
       "2    Q3\n",
       "3    Q4\n",
       "Name: quartile, dtype: category\n",
       "Categories (4, object): [Q1 < Q2 < Q3 < Q4]"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results['quartile']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Better performance with categoricals （使用categoricals得到更好的效果）\n",
    "\n",
    "使用categorical能让效果提高。如果一个DataFrame的列是categorical类型，使用的时候会减少很多内存的使用。假设我们有一个一千万的元素和一个类别："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "N = 10000000\n",
    "draws = pd.Series(np.random.randn(N))\n",
    "labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "把labels变为categorical："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "categories = labels.astype('category')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到labels会比categories使用更多的内存："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "80000080"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "labels.memory_usage()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10000272"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "categories.memory_usage()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当然，转换成category也是要消耗计算的，不过这种消耗是一次性的："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 303 ms, sys: 70.1 ms, total: 373 ms\n",
      "Wall time: 385 ms\n"
     ]
    }
   ],
   "source": [
    "%time _ = labels.astype('category')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在categories上使用groupby会非常快，因为用的是基于整数的编码，而不是由字符串组成的数组。\n",
    "\n",
    "# 4 Categorical Methods（类别方法）\n",
    "\n",
    "如果是包含categorical数据的Series数据，有和Series.str类似的一些比较特殊的方法。对于访问categories和code很方便："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "s = pd.Series(['a', 'b', 'c', 'd'] * 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    a\n",
       "1    b\n",
       "2    c\n",
       "3    d\n",
       "4    a\n",
       "5    b\n",
       "6    c\n",
       "7    d\n",
       "dtype: category\n",
       "Categories (4, object): [a, b, c, d]"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_s = s.astype('category')\n",
    "cat_s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "属性cat可以访问categorical方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0\n",
       "1    1\n",
       "2    2\n",
       "3    3\n",
       "4    0\n",
       "5    1\n",
       "6    2\n",
       "7    3\n",
       "dtype: int8"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_s.cat.codes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['a', 'b', 'c', 'd'], dtype='object')"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_s.cat.categories"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "假设我们知道实际的类别超过了当前观测到的四个类别，那么我们可以使用set_categories方法来扩展："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    a\n",
       "1    b\n",
       "2    c\n",
       "3    d\n",
       "4    a\n",
       "5    b\n",
       "6    c\n",
       "7    d\n",
       "dtype: category\n",
       "Categories (5, object): [a, b, c, d, e]"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "actual_categories = ['a', 'b', 'c', 'd', 'e']\n",
    "cat_s2 = cat_s.cat.set_categories(actual_categories)\n",
    "cat_s2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据本身似乎没有改变，不过在对其进行操作的时候会反应出来。例如，value_counts："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "d    2\n",
       "c    2\n",
       "b    2\n",
       "a    2\n",
       "dtype: int64"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_s.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "d    2\n",
       "c    2\n",
       "b    2\n",
       "a    2\n",
       "e    0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_s2.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在大型数据集，categoricals经常用来作为省内存和提高效果的工具。在对一个很大的DataFrame或Series进行过滤后，很多类型可能不会出现在数据中。我们用remove_unused_categories方法来除去没有观测到的类别："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    a\n",
       "1    b\n",
       "4    a\n",
       "5    b\n",
       "dtype: category\n",
       "Categories (4, object): [a, b, c, d]"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_s3 = cat_s[cat_s.isin(['a', 'b'])]\n",
    "cat_s3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    a\n",
       "1    b\n",
       "4    a\n",
       "5    b\n",
       "dtype: category\n",
       "Categories (2, object): [a, b]"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_s3.cat.remove_unused_categories()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面是一些categorical的方法：\n",
    "\n",
    "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/kbedp.png)\n",
    "\n",
    "### Creating dummy variables for modeling（为建模创建哑变量）\n",
    "\n",
    "在使用机器学习的一些工具时，经常要转变类型数据为哑变量（dummy variables ），也被称作是独热编码（one-hot encoding）。即在DataFrame中，给一列中不同的类别创建不同的列，用1表示出现，用0表示未出现。\n",
    "\n",
    "例子："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在第七章也介绍过，pandas.get_dummies函数会把一维的类型数据变为包含哑变量的DataFrame："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>d</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a  b  c  d\n",
       "0  1  0  0  0\n",
       "1  0  1  0  0\n",
       "2  0  0  1  0\n",
       "3  0  0  0  1\n",
       "4  1  0  0  0\n",
       "5  0  1  0  0\n",
       "6  0  0  1  0\n",
       "7  0  0  0  1"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.get_dummies(cat_s)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [py35]",
   "language": "python",
   "name": "Python [py35]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}