{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CHAPTER 12 Advanced pandas(高级pandas用法)\n", "\n", "# 12.1 Categorical Data(类别数据)\n", "\n", "这一届会介绍pandas的Categorical类型。\n", "\n", "# 1 Background and Motivation(背景和动力)\n", "\n", "表格中的列克可能会有重复的部分。我们可以用unique和value_counts,从一个数组从提取不同的值,并计算频度:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 apple\n", "1 orange\n", "2 apple\n", "3 apple\n", "4 apple\n", "5 orange\n", "6 apple\n", "7 apple\n", "dtype: object" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)\n", "values" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['apple', 'orange'], dtype=object)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.unique(values)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "apple 6\n", "orange 2\n", "dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.value_counts(values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "对于不同的类型数据值,一个更好的方法是用维度表(dimension table)来表示,然后用整数键(integer keys)来指代维度表:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 0\n", "3 0\n", "4 0\n", "5 1\n", "6 0\n", "7 0\n", "dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "values = pd.Series([0, 1, 0, 0] * 2)\n", "values" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 apple\n", "1 orange\n", "dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dim = pd.Series(['apple', 'orange'])\n", "dim" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "用take方法来重新存储原始的,由字符串构成的Series:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 apple\n", "1 orange\n", "0 apple\n", "0 apple\n", "0 apple\n", "1 orange\n", "0 apple\n", "0 apple\n", "dtype: object" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dim.take(values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这种用整数表示的方法叫做类别(categorical)或字典编码(dictionary-encoded)表示法。表示不同类别值的数组,被称作类别,字典,或层级。本书中我们将使用类别(categorical and categories)来称呼。表示类别的整数值被叫做,类别编码(category code),或编码(code)。\n", "\n", "# 2 Categorical Type in pandas(pandas中的Categorical类型)\n", "\n", "pandas中有一个Categorical类型,是用来保存那些基于整数的类别型数据。考虑下面的例子:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "fruits = ['apple', 'orange', 'apple', 'apple'] * 2" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "N = len(fruits)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
basket_idfruitcountweight
00apple52.255245
11orange81.309949
22apple62.330312
33apple32.927920
44apple131.322311
55orange100.474809
66apple40.827271
77apple82.480494
\n", "
" ], "text/plain": [ " basket_id fruit count weight\n", "0 0 apple 5 2.255245\n", "1 1 orange 8 1.309949\n", "2 2 apple 6 2.330312\n", "3 3 apple 3 2.927920\n", "4 4 apple 13 1.322311\n", "5 5 orange 10 0.474809\n", "6 6 apple 4 0.827271\n", "7 7 apple 8 2.480494" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'fruit': fruits,\n", " 'basket_id': np.arange(N),\n", " 'count': np.random.randint(3, 15, size=N),\n", " 'weight': np.random.uniform(0, 4, size=N)},\n", " columns=['basket_id', 'fruit', 'count', 'weight'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里,df['fruit']是一个python的字符串对象。我们将其转换为类型对象:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 apple\n", "1 orange\n", "2 apple\n", "3 apple\n", "4 apple\n", "5 orange\n", "6 apple\n", "7 apple\n", "Name: fruit, dtype: category\n", "Categories (2, object): [apple, orange]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fruits_cat = df['fruit'].astype('category')\n", "fruits_cat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "fruits_cat的值并不是一个numpy数组,而是一个pandas.Categorical实例:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "pandas.core.categorical.Categorical" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = fruits_cat.values\n", "type(c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这个Categorical对象有categories和codes属性:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['apple', 'orange'], dtype='object')" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c.categories" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c.codes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以把转换的结果变为DataFrame列:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 apple\n", "1 orange\n", "2 apple\n", "3 apple\n", "4 apple\n", "5 orange\n", "6 apple\n", "7 apple\n", "Name: fruit, dtype: category\n", "Categories (2, object): [apple, orange]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['fruit'] = df['fruit'].astype('category')\n", "df.fruit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "也可以直接把其他的python序列变为pandas.Categorical类型:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[foo, bar, baz, foo, bar]\n", "Categories (3, object): [bar, baz, foo]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])\n", "my_categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果已经得到了分类编码数据(categorical encoded data),我们可以使用from_codes构造器:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "categories = ['foo', 'bar', 'baz']" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "codes = [0, 1, 2, 0, 0, 1]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[foo, bar, baz, foo, foo, bar]\n", "Categories (3, object): [foo, bar, baz]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_cats_2 = pd.Categorical.from_codes(codes, categories)\n", "my_cats_2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "除非明确指定,非常默认类别没有特定的顺序。所以,取决于输入的数据,categories数组可能有不同的顺序。当使用from_codes或其他一些构造器的时候,我们可以指定类别的顺序:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[foo, bar, baz, foo, foo, bar]\n", "Categories (3, object): [foo < bar < baz]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ordered_cat = pd.Categorical.from_codes(codes, categories, \n", " ordered=True)\n", "ordered_cat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "输出的结果中,`[foo < bar < baz]`表示foo在bar之间,以此类推。一个没有顺序的类型实例(unordered categorical instance)可以通过as_ordered来排序:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[foo, bar, baz, foo, foo, bar]\n", "Categories (3, object): [foo < bar < baz]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_cats_2.as_ordered()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "最后一点需要注意的,类型数据没必要一定是字符串,它可以是任何不可变的值类型\n", "(any immutable value types)。\n", "\n", "# 3 Computations with Categoricals(类型计算)\n", "\n", "Categorical类型和其他类型差不多,不过对于某些函数,比如groupby函数,在Categorical数据上会有更好的效果。很多函数可以利用ordered标记。\n", "\n", "假设有一些随机的数字,用pandas.quct进行分箱(binning)。得到的类型是pandas.Categorical;虽然之前用到过pandas.cut,但是没有具体介绍里面的细节:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "np.random.seed(12345)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "draws = np.random.randn(1000)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([-0.20470766, 0.47894334, -0.51943872, -0.5557303 , 1.96578057])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "draws[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "计算分箱后的分位数:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.95, -0.684], (-0.0101, 0.63], (0.63, 3.928]]\n", "Length: 1000\n", "Categories (4, interval[float64]): [(-2.95, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bins = pd.qcut(draws, 4)\n", "bins" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "具体分位数并不如季度的名字直观,我们直接在qcut中设定labels:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]\n", "Length: 1000\n", "Categories (4, object): [Q1 < Q2 < Q3 < Q4]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])\n", "bins" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bins.codes[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "bins caetegorical并没有包含边界星系,我们可以用groupby来提取:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": true }, "outputs": [], "source": [ "bins = pd.Series(bins, name='quartile')" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
quartilecountminmax
0Q1250-2.949343-0.685484
1Q2250-0.683066-0.010115
2Q3250-0.0100320.628894
3Q42500.6342383.927528
\n", "
" ], "text/plain": [ " quartile count min max\n", "0 Q1 250 -2.949343 -0.685484\n", "1 Q2 250 -0.683066 -0.010115\n", "2 Q3 250 -0.010032 0.628894\n", "3 Q4 250 0.634238 3.927528" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = (pd.Series(draws)\n", " .groupby(bins)\n", " .agg(['count', 'min', 'max'])\n", " .reset_index())\n", "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "quartile列包含了原始的类别信息,包含bins中的顺序:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 Q1\n", "1 Q2\n", "2 Q3\n", "3 Q4\n", "Name: quartile, dtype: category\n", "Categories (4, object): [Q1 < Q2 < Q3 < Q4]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results['quartile']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Better performance with categoricals (使用categoricals得到更好的效果)\n", "\n", "使用categorical能让效果提高。如果一个DataFrame的列是categorical类型,使用的时候会减少很多内存的使用。假设我们有一个一千万的元素和一个类别:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": true }, "outputs": [], "source": [ "N = 10000000\n", "draws = pd.Series(np.random.randn(N))\n", "labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "把labels变为categorical:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true }, "outputs": [], "source": [ "categories = labels.astype('category')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以看到labels会比categories使用更多的内存:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "80000080" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labels.memory_usage()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "10000272" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categories.memory_usage()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "当然,转换成category也是要消耗计算的,不过这种消耗是一次性的:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 303 ms, sys: 70.1 ms, total: 373 ms\n", "Wall time: 385 ms\n" ] } ], "source": [ "%time _ = labels.astype('category')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在categories上使用groupby会非常快,因为用的是基于整数的编码,而不是由字符串组成的数组。\n", "\n", "# 4 Categorical Methods(类别方法)\n", "\n", "如果是包含categorical数据的Series数据,有和Series.str类似的一些比较特殊的方法。对于访问categories和code很方便:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": true }, "outputs": [], "source": [ "s = pd.Series(['a', 'b', 'c', 'd'] * 2)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 b\n", "2 c\n", "3 d\n", "4 a\n", "5 b\n", "6 c\n", "7 d\n", "dtype: category\n", "Categories (4, object): [a, b, c, d]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_s = s.astype('category')\n", "cat_s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "属性cat可以访问categorical方法:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 2\n", "3 3\n", "4 0\n", "5 1\n", "6 2\n", "7 3\n", "dtype: int8" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_s.cat.codes" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['a', 'b', 'c', 'd'], dtype='object')" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_s.cat.categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "假设我们知道实际的类别超过了当前观测到的四个类别,那么我们可以使用set_categories方法来扩展:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 b\n", "2 c\n", "3 d\n", "4 a\n", "5 b\n", "6 c\n", "7 d\n", "dtype: category\n", "Categories (5, object): [a, b, c, d, e]" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "actual_categories = ['a', 'b', 'c', 'd', 'e']\n", "cat_s2 = cat_s.cat.set_categories(actual_categories)\n", "cat_s2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "数据本身似乎没有改变,不过在对其进行操作的时候会反应出来。例如,value_counts:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "d 2\n", "c 2\n", "b 2\n", "a 2\n", "dtype: int64" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_s.value_counts()" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "d 2\n", "c 2\n", "b 2\n", "a 2\n", "e 0\n", "dtype: int64" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_s2.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在大型数据集,categoricals经常用来作为省内存和提高效果的工具。在对一个很大的DataFrame或Series进行过滤后,很多类型可能不会出现在数据中。我们用remove_unused_categories方法来除去没有观测到的类别:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 b\n", "4 a\n", "5 b\n", "dtype: category\n", "Categories (4, object): [a, b, c, d]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_s3 = cat_s[cat_s.isin(['a', 'b'])]\n", "cat_s3" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 b\n", "4 a\n", "5 b\n", "dtype: category\n", "Categories (2, object): [a, b]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_s3.cat.remove_unused_categories()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下面是一些categorical的方法:\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/kbedp.png)\n", "\n", "### Creating dummy variables for modeling(为建模创建哑变量)\n", "\n", "在使用机器学习的一些工具时,经常要转变类型数据为哑变量(dummy variables ),也被称作是独热编码(one-hot encoding)。即在DataFrame中,给一列中不同的类别创建不同的列,用1表示出现,用0表示未出现。\n", "\n", "例子:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在第七章也介绍过,pandas.get_dummies函数会把一维的类型数据变为包含哑变量的DataFrame:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
01000
10100
20010
30001
41000
50100
60010
70001
\n", "
" ], "text/plain": [ " a b c d\n", "0 1 0 0 0\n", "1 0 1 0 0\n", "2 0 0 1 0\n", "3 0 0 0 1\n", "4 1 0 0 0\n", "5 0 1 0 0\n", "6 0 0 1 0\n", "7 0 0 0 1" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.get_dummies(cat_s)" ] } ], "metadata": { "kernelspec": { "display_name": "Python [py35]", "language": "python", "name": "Python [py35]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }