{ "metadata": { "name": "", "signature": "sha256:01b67015e46902e5ca06db235b12506a30c8388110dbd7ae0c296bd45973e351" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[[<- back to the GitHub Repo](https://github.com/rasbt/pattern_classification)]" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load_ext watermark\n", "%watermark -a 'Sebastian Raschka' -v -p pandas,numpy,scikit-learn -d" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Sebastian Raschka 08/02/2015 \n", "\n", "CPython 3.4.2\n", "IPython 2.3.1\n", "\n", "pandas 0.15.2\n", "numpy 1.9.1\n", "scikit-learn 0.15.2\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Tips and Tricks for Encoding Categorical Features in Classification Tasks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Features can come in various different flavors. Typically we distinguish between\n", "\n", "- continuous and\n", "- categorical (discrete)\n", "features.\n", "\n", "\n", "And the categorical features can be categorized further into:\n", "\n", "- ordinal and\n", "- nominal (= no order implied) features.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, most implementations of machine learning algorithms require numerical data as input, and we have to prepare our data accordingly. This notebook contains some useful tips for how to encode categorical features using Python pandas and scikit-learn." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Sections" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Example Data](#Example-Data)\n", "- [Class Labels](#Class-Labels)\n", "- [Ordinal Features](#Ordinal-Features)\n", "- [Nominal Features](#Nominal-Features)\n", "- [Inverse Mapping](#Inverse-Mapping)\n", "- [Using scikit-learn and pandas features](#Using-scikit-learn-and-pandas-features)\n", " - [scikit LabelEncoder](#scikit-LabelEncoder)\n", " - [scikit DictVectorizer](#scikit-DictVectorizer)\n", " - [scikit OneHotEncoder](#scikit-OneHotEncoder)\n", " - [pandas get_dummies](#pandas-get_dummies)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let us create a simple example dataset with 3 different kinds of features:\n", "- color: nominal\n", "- size: ordinal\n", "- prize: continuous" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "df = pd.DataFrame([\n", " ['green', 'M', 10.1, 'class1'], \n", " ['red', 'L', 13.5, 'class2'], \n", " ['blue', 'XL', 15.3, 'class1']])\n", "\n", "df.columns = ['color', 'size', 'prize', 'class label']\n", "df" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorsizeprizeclass label
0 green M 10.1 class1
1 red L 13.5 class2
2 blue XL 15.3 class1
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ " color size prize class label\n", "0 green M 10.1 class1\n", "1 red L 13.5 class2\n", "2 blue XL 15.3 class1" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Class Labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Typical\" machine learning algorithms handle class labels with \"no order implied\" - unless we use a ranking classifier (e.g., SVM-rank). Thus, it is save to use a simple set-item-enumeration to convert the class labels from a string representation into integers." ] }, { "cell_type": "code", "collapsed": false, "input": [ "class_mapping = {label:idx for idx,label in enumerate(set(df['class label']))}\n", "\n", "df['class label'] = df['class label'].map(class_mapping)\n", "df" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorsizeprizeclass label
0 green M 10.1 0
1 red L 13.5 1
2 blue XL 15.3 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ " color size prize class label\n", "0 green M 10.1 0\n", "1 red L 13.5 1\n", "2 blue XL 15.3 0" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Ordinal Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ordinal features need special attention: We have to make sure that the correct values are associated with the corresponding strings. Thus, we need to set-up an explicit mapping dictionary:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "size_mapping = {\n", " 'XL': 3,\n", " 'L': 2,\n", " 'M': 1}\n", "\n", "df['size'] = df['size'].map(size_mapping)\n", "df" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorsizeprizeclass label
0 green 1 10.1 0
1 red 2 13.5 1
2 blue 3 15.3 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ " color size prize class label\n", "0 green 1 10.1 0\n", "1 red 2 13.5 1\n", "2 blue 3 15.3 0" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Nominal Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately, we can't simply apply the same mapping scheme to the `color` column that we used for the `size`-mapping above. However, we can use another simply trick and convert the \"colors\" into binary features: Each possible color value becomes a feature column itself (with values 1 or 0)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "color_mapping = {\n", " 'green': (0,0,1),\n", " 'red': (0,1,0),\n", " 'blue': (1,0,0)}\n", "\n", "df['color'] = df['color'].map(color_mapping)\n", "df" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorsizeprizeclass label
0 (0, 0, 1) 1 10.1 0
1 (0, 1, 0) 2 13.5 1
2 (1, 0, 0) 3 15.3 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ " color size prize class label\n", "0 (0, 0, 1) 1 10.1 0\n", "1 (0, 1, 0) 2 13.5 1\n", "2 (1, 0, 0) 3 15.3 0" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "y = df['class label'].values\n", "X = df.iloc[:, :-1].values\n", "X = np.apply_along_axis(func1d= lambda x: np.array(list(x[0]) + list(x[1:])), axis=1, arr=X)\n", "\n", "print('Class labels:', y)\n", "print('\\nFeatures:\\n', X)\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Class labels: [0 1 0]\n", "\n", "Features:\n", " [[ 0. 0. 1. 1. 10.1]\n", " [ 0. 1. 0. 2. 13.5]\n", " [ 1. 0. 0. 3. 15.3]]\n" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Inverse Mapping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to convert the features back into its original representation, we can simply do so my using inverted mapping dictionaries:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "inv_color_mapping = {v: k for k, v in color_mapping.items()}\n", "inv_size_mapping = {v: k for k, v in size_mapping.items()}\n", "inv_class_mapping = {v: k for k, v in class_mapping.items()}\n", "\n", "df['color'] = df['color'].map(inv_color_mapping)\n", "df['size'] = df['size'].map(inv_size_mapping)\n", "df['class label'] = df['class label'].map(inv_class_mapping)\n", "df" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorsizeprizeclass label
0 green M 10.1 class1
1 red L 13.5 class2
2 blue XL 15.3 class1
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ " color size prize class label\n", "0 green M 10.1 class1\n", "1 red L 13.5 class2\n", "2 blue XL 15.3 class1" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Using scikit-learn and pandas features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The scikit-learn maching library comes with many useful preprocessing functions that we can use for our convenience." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "scikit LabelEncoder" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.preprocessing import LabelEncoder\n", "\n", "class_le = LabelEncoder()\n", "df['class label'] = class_le.fit_transform(df['class label'])\n", "\n", "size_mapping = {\n", " 'XL': 3,\n", " 'L': 2,\n", " 'M': 1}\n", "\n", "df['size'] = df['size'].map(size_mapping)\n", "df" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorsizeprizeclass label
0 green 1 10.1 0
1 red 2 13.5 1
2 blue 3 15.3 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ " color size prize class label\n", "0 green 1 10.1 0\n", "1 red 2 13.5 1\n", "2 blue 3 15.3 0" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The class labels can be converted back from integer to string via the `inverse_transform` method:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "class_le.inverse_transform(df['class label'])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "array(['class1', 'class2', 'class1'], dtype=object)" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "scikit DictVectorizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [`DictVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) is another handy tool for feature extraction. The `DictVectorizer` takes a list of dictionary entries (feature-value mappings) and transforms it to vectors. The expected input looks like this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df.transpose().to_dict().values()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "dict_values([{'class label': 0, 'color': 'green', 'size': 1, 'prize': 10.1}, {'class label': 1, 'color': 'red', 'size': 2, 'prize': 13.5}, {'class label': 0, 'color': 'blue', 'size': 3, 'prize': 15.3}])" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the dictionary keys in each row represent the feature column labels. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can use the `DictVectorizer` to turn this\n", "mapping into a matrix:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.feature_extraction import DictVectorizer\n", "dvec = DictVectorizer(sparse=False)\n", "\n", "X = dvec.fit_transform(df.transpose().to_dict().values())\n", "X" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ "array([[ 0. , 0. , 1. , 0. , 10.1, 1. ],\n", " [ 1. , 0. , 0. , 1. , 13.5, 2. ],\n", " [ 0. , 1. , 0. , 0. , 15.3, 3. ]])" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see in the array above, the columns were reordered during the conversion (due to the hash mapping when we used the dictionary). However, we can simply add back the column names via the `get_feature_names` function." ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.DataFrame(X, columns=dvec.get_feature_names())" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
class labelcolor=bluecolor=greencolor=redprizesize
0 0 0 1 0 10.1 1
1 1 0 0 1 13.5 2
2 0 1 0 0 15.3 3
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ " class label color=blue color=green color=red prize size\n", "0 0 0 1 0 10.1 1\n", "1 1 0 0 1 13.5 2\n", "2 0 1 0 0 15.3 3" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "scikit OneHotEncoder" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another useful tool in scikit-learn is the [`OneHotEncoder`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). The idea is the same as in the `DictVectorizer` example above; the only difference is that the `OneHotEncoder` takes integer columns as input. Here we are an `LabelEncoder`, we use the `LabelEncoder` first, to prepare the `color` column before we use the `OneHotEncoder`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "color_le = LabelEncoder()\n", "df['color'] = color_le.fit_transform(df['color'])\n", "\n", "df" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorsizeprizeclass label
0 1 1 10.1 0
1 2 2 13.5 1
2 0 3 15.3 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ " color size prize class label\n", "0 1 1 10.1 0\n", "1 2 2 13.5 1\n", "2 0 3 15.3 0" ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.preprocessing import OneHotEncoder\n", "ohe = OneHotEncoder(sparse=False)\n", "\n", "X = ohe.fit_transform(df[['color']].values)\n", "X" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "array([[ 0., 1., 0.],\n", " [ 0., 0., 1.],\n", " [ 1., 0., 0.]])" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "pandas get_dummies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, pandas comes with a convenience function to create new categories for nominal features, namely: [`get_dummies`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html).\n", "But first, let us quickly regenerate a fresh example `DataFrame` where the size and class label columns are already taken care of." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "df = pd.DataFrame([\n", " ['green', 'M', 10.1, 'class1'], \n", " ['red', 'L', 13.5, 'class2'], \n", " ['blue', 'XL', 15.3, 'class1']])\n", "\n", "df.columns = ['color', 'size', 'prize', 'class label']\n", "\n", "size_mapping = {\n", " 'XL': 3,\n", " 'L': 2,\n", " 'M': 1}\n", "df['size'] = df['size'].map(size_mapping)\n", "\n", "class_mapping = {label:idx for idx,label in enumerate(set(df['class label']))}\n", "df['class label'] = df['class label'].map(class_mapping)\n", "\n", "\n", "df" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorsizeprizeclass label
0 green 1 10.1 0
1 red 2 13.5 1
2 blue 3 15.3 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 20, "text": [ " color size prize class label\n", "0 green 1 10.1 0\n", "1 red 2 13.5 1\n", "2 blue 3 15.3 0" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the `get_dummies` will create a new column for every unique string in a certain column:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.get_dummies(df)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sizeprizeclass labelcolor_bluecolor_greencolor_red
0 1 10.1 0 0 1 0
1 2 13.5 1 0 0 1
2 3 15.3 0 1 0 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ " size prize class label color_blue color_green color_red\n", "0 1 10.1 0 0 1 0\n", "1 2 13.5 1 0 0 1\n", "2 3 15.3 0 1 0 0" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the `get_dummies` function leaves the numeric columns untouched, how convenient!" ] } ], "metadata": {} } ] }