{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CHAPTER 12 Advanced pandas(高级pandas用法)\n",
"\n",
"# 12.1 Categorical Data(类别数据)\n",
"\n",
"这一届会介绍pandas的Categorical类型。\n",
"\n",
"# 1 Background and Motivation(背景和动力)\n",
"\n",
"表格中的列克可能会有重复的部分。我们可以用unique和value_counts,从一个数组从提取不同的值,并计算频度:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 apple\n",
"1 orange\n",
"2 apple\n",
"3 apple\n",
"4 apple\n",
"5 orange\n",
"6 apple\n",
"7 apple\n",
"dtype: object"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)\n",
"values"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array(['apple', 'orange'], dtype=object)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.unique(values)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"apple 6\n",
"orange 2\n",
"dtype: int64"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.value_counts(values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"对于不同的类型数据值,一个更好的方法是用维度表(dimension table)来表示,然后用整数键(integer keys)来指代维度表:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 1\n",
"2 0\n",
"3 0\n",
"4 0\n",
"5 1\n",
"6 0\n",
"7 0\n",
"dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"values = pd.Series([0, 1, 0, 0] * 2)\n",
"values"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 apple\n",
"1 orange\n",
"dtype: object"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dim = pd.Series(['apple', 'orange'])\n",
"dim"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"用take方法来重新存储原始的,由字符串构成的Series:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 apple\n",
"1 orange\n",
"0 apple\n",
"0 apple\n",
"0 apple\n",
"1 orange\n",
"0 apple\n",
"0 apple\n",
"dtype: object"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dim.take(values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"这种用整数表示的方法叫做类别(categorical)或字典编码(dictionary-encoded)表示法。表示不同类别值的数组,被称作类别,字典,或层级。本书中我们将使用类别(categorical and categories)来称呼。表示类别的整数值被叫做,类别编码(category code),或编码(code)。\n",
"\n",
"# 2 Categorical Type in pandas(pandas中的Categorical类型)\n",
"\n",
"pandas中有一个Categorical类型,是用来保存那些基于整数的类别型数据。考虑下面的例子:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"fruits = ['apple', 'orange', 'apple', 'apple'] * 2"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"N = len(fruits)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" basket_id | \n",
" fruit | \n",
" count | \n",
" weight | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" apple | \n",
" 5 | \n",
" 2.255245 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" orange | \n",
" 8 | \n",
" 1.309949 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" apple | \n",
" 6 | \n",
" 2.330312 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" apple | \n",
" 3 | \n",
" 2.927920 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" apple | \n",
" 13 | \n",
" 1.322311 | \n",
"
\n",
" \n",
" 5 | \n",
" 5 | \n",
" orange | \n",
" 10 | \n",
" 0.474809 | \n",
"
\n",
" \n",
" 6 | \n",
" 6 | \n",
" apple | \n",
" 4 | \n",
" 0.827271 | \n",
"
\n",
" \n",
" 7 | \n",
" 7 | \n",
" apple | \n",
" 8 | \n",
" 2.480494 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" basket_id fruit count weight\n",
"0 0 apple 5 2.255245\n",
"1 1 orange 8 1.309949\n",
"2 2 apple 6 2.330312\n",
"3 3 apple 3 2.927920\n",
"4 4 apple 13 1.322311\n",
"5 5 orange 10 0.474809\n",
"6 6 apple 4 0.827271\n",
"7 7 apple 8 2.480494"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({'fruit': fruits,\n",
" 'basket_id': np.arange(N),\n",
" 'count': np.random.randint(3, 15, size=N),\n",
" 'weight': np.random.uniform(0, 4, size=N)},\n",
" columns=['basket_id', 'fruit', 'count', 'weight'])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"这里,df['fruit']是一个python的字符串对象。我们将其转换为类型对象:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 apple\n",
"1 orange\n",
"2 apple\n",
"3 apple\n",
"4 apple\n",
"5 orange\n",
"6 apple\n",
"7 apple\n",
"Name: fruit, dtype: category\n",
"Categories (2, object): [apple, orange]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fruits_cat = df['fruit'].astype('category')\n",
"fruits_cat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"fruits_cat的值并不是一个numpy数组,而是一个pandas.Categorical实例:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.categorical.Categorical"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"c = fruits_cat.values\n",
"type(c)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"这个Categorical对象有categories和codes属性:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['apple', 'orange'], dtype='object')"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"c.categories"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"c.codes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以把转换的结果变为DataFrame列:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 apple\n",
"1 orange\n",
"2 apple\n",
"3 apple\n",
"4 apple\n",
"5 orange\n",
"6 apple\n",
"7 apple\n",
"Name: fruit, dtype: category\n",
"Categories (2, object): [apple, orange]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['fruit'] = df['fruit'].astype('category')\n",
"df.fruit"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"也可以直接把其他的python序列变为pandas.Categorical类型:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[foo, bar, baz, foo, bar]\n",
"Categories (3, object): [bar, baz, foo]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])\n",
"my_categories"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"如果已经得到了分类编码数据(categorical encoded data),我们可以使用from_codes构造器:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"categories = ['foo', 'bar', 'baz']"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"codes = [0, 1, 2, 0, 0, 1]"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[foo, bar, baz, foo, foo, bar]\n",
"Categories (3, object): [foo, bar, baz]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_cats_2 = pd.Categorical.from_codes(codes, categories)\n",
"my_cats_2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"除非明确指定,非常默认类别没有特定的顺序。所以,取决于输入的数据,categories数组可能有不同的顺序。当使用from_codes或其他一些构造器的时候,我们可以指定类别的顺序:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[foo, bar, baz, foo, foo, bar]\n",
"Categories (3, object): [foo < bar < baz]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ordered_cat = pd.Categorical.from_codes(codes, categories, \n",
" ordered=True)\n",
"ordered_cat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"输出的结果中,`[foo < bar < baz]`表示foo在bar之间,以此类推。一个没有顺序的类型实例(unordered categorical instance)可以通过as_ordered来排序:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[foo, bar, baz, foo, foo, bar]\n",
"Categories (3, object): [foo < bar < baz]"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_cats_2.as_ordered()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"最后一点需要注意的,类型数据没必要一定是字符串,它可以是任何不可变的值类型\n",
"(any immutable value types)。\n",
"\n",
"# 3 Computations with Categoricals(类型计算)\n",
"\n",
"Categorical类型和其他类型差不多,不过对于某些函数,比如groupby函数,在Categorical数据上会有更好的效果。很多函数可以利用ordered标记。\n",
"\n",
"假设有一些随机的数字,用pandas.quct进行分箱(binning)。得到的类型是pandas.Categorical;虽然之前用到过pandas.cut,但是没有具体介绍里面的细节:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"np.random.seed(12345)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"draws = np.random.randn(1000)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([-0.20470766, 0.47894334, -0.51943872, -0.5557303 , 1.96578057])"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"draws[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"计算分箱后的分位数:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.95, -0.684], (-0.0101, 0.63], (0.63, 3.928]]\n",
"Length: 1000\n",
"Categories (4, interval[float64]): [(-2.95, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bins = pd.qcut(draws, 4)\n",
"bins"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"具体分位数并不如季度的名字直观,我们直接在qcut中设定labels:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]\n",
"Length: 1000\n",
"Categories (4, object): [Q1 < Q2 < Q3 < Q4]"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])\n",
"bins"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bins.codes[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"bins caetegorical并没有包含边界星系,我们可以用groupby来提取:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"bins = pd.Series(bins, name='quartile')"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quartile | \n",
" count | \n",
" min | \n",
" max | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Q1 | \n",
" 250 | \n",
" -2.949343 | \n",
" -0.685484 | \n",
"
\n",
" \n",
" 1 | \n",
" Q2 | \n",
" 250 | \n",
" -0.683066 | \n",
" -0.010115 | \n",
"
\n",
" \n",
" 2 | \n",
" Q3 | \n",
" 250 | \n",
" -0.010032 | \n",
" 0.628894 | \n",
"
\n",
" \n",
" 3 | \n",
" Q4 | \n",
" 250 | \n",
" 0.634238 | \n",
" 3.927528 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quartile count min max\n",
"0 Q1 250 -2.949343 -0.685484\n",
"1 Q2 250 -0.683066 -0.010115\n",
"2 Q3 250 -0.010032 0.628894\n",
"3 Q4 250 0.634238 3.927528"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = (pd.Series(draws)\n",
" .groupby(bins)\n",
" .agg(['count', 'min', 'max'])\n",
" .reset_index())\n",
"results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"quartile列包含了原始的类别信息,包含bins中的顺序:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 Q1\n",
"1 Q2\n",
"2 Q3\n",
"3 Q4\n",
"Name: quartile, dtype: category\n",
"Categories (4, object): [Q1 < Q2 < Q3 < Q4]"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results['quartile']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Better performance with categoricals (使用categoricals得到更好的效果)\n",
"\n",
"使用categorical能让效果提高。如果一个DataFrame的列是categorical类型,使用的时候会减少很多内存的使用。假设我们有一个一千万的元素和一个类别:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"N = 10000000\n",
"draws = pd.Series(np.random.randn(N))\n",
"labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"把labels变为categorical:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"categories = labels.astype('category')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以看到labels会比categories使用更多的内存:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"80000080"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"labels.memory_usage()"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"10000272"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"categories.memory_usage()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"当然,转换成category也是要消耗计算的,不过这种消耗是一次性的:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 303 ms, sys: 70.1 ms, total: 373 ms\n",
"Wall time: 385 ms\n"
]
}
],
"source": [
"%time _ = labels.astype('category')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在categories上使用groupby会非常快,因为用的是基于整数的编码,而不是由字符串组成的数组。\n",
"\n",
"# 4 Categorical Methods(类别方法)\n",
"\n",
"如果是包含categorical数据的Series数据,有和Series.str类似的一些比较特殊的方法。对于访问categories和code很方便:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"s = pd.Series(['a', 'b', 'c', 'd'] * 2)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 a\n",
"1 b\n",
"2 c\n",
"3 d\n",
"4 a\n",
"5 b\n",
"6 c\n",
"7 d\n",
"dtype: category\n",
"Categories (4, object): [a, b, c, d]"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cat_s = s.astype('category')\n",
"cat_s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"属性cat可以访问categorical方法:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 1\n",
"2 2\n",
"3 3\n",
"4 0\n",
"5 1\n",
"6 2\n",
"7 3\n",
"dtype: int8"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cat_s.cat.codes"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['a', 'b', 'c', 'd'], dtype='object')"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cat_s.cat.categories"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"假设我们知道实际的类别超过了当前观测到的四个类别,那么我们可以使用set_categories方法来扩展:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 a\n",
"1 b\n",
"2 c\n",
"3 d\n",
"4 a\n",
"5 b\n",
"6 c\n",
"7 d\n",
"dtype: category\n",
"Categories (5, object): [a, b, c, d, e]"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"actual_categories = ['a', 'b', 'c', 'd', 'e']\n",
"cat_s2 = cat_s.cat.set_categories(actual_categories)\n",
"cat_s2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"数据本身似乎没有改变,不过在对其进行操作的时候会反应出来。例如,value_counts:"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"d 2\n",
"c 2\n",
"b 2\n",
"a 2\n",
"dtype: int64"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cat_s.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"d 2\n",
"c 2\n",
"b 2\n",
"a 2\n",
"e 0\n",
"dtype: int64"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cat_s2.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在大型数据集,categoricals经常用来作为省内存和提高效果的工具。在对一个很大的DataFrame或Series进行过滤后,很多类型可能不会出现在数据中。我们用remove_unused_categories方法来除去没有观测到的类别:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 a\n",
"1 b\n",
"4 a\n",
"5 b\n",
"dtype: category\n",
"Categories (4, object): [a, b, c, d]"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cat_s3 = cat_s[cat_s.isin(['a', 'b'])]\n",
"cat_s3"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 a\n",
"1 b\n",
"4 a\n",
"5 b\n",
"dtype: category\n",
"Categories (2, object): [a, b]"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cat_s3.cat.remove_unused_categories()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下面是一些categorical的方法:\n",
"\n",
"![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/kbedp.png)\n",
"\n",
"### Creating dummy variables for modeling(为建模创建哑变量)\n",
"\n",
"在使用机器学习的一些工具时,经常要转变类型数据为哑变量(dummy variables ),也被称作是独热编码(one-hot encoding)。即在DataFrame中,给一列中不同的类别创建不同的列,用1表示出现,用0表示未出现。\n",
"\n",
"例子:"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在第七章也介绍过,pandas.get_dummies函数会把一维的类型数据变为包含哑变量的DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 5 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 6 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 7 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c d\n",
"0 1 0 0 0\n",
"1 0 1 0 0\n",
"2 0 0 1 0\n",
"3 0 0 0 1\n",
"4 1 0 0 0\n",
"5 0 1 0 0\n",
"6 0 0 1 0\n",
"7 0 0 0 1"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.get_dummies(cat_s)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [py35]",
"language": "python",
"name": "Python [py35]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}