{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# OrdinalEncoder\n", "The OrdinalEncoder() will replace the variable labels by digits, from 1 to the number of different labels. \n", "\n", "If we select \"arbitrary\", then the encoder will assign numbers as the labels appear in the variable (first come first served).\n", "\n", "If we select \"ordered\", the encoder will assign numbers following the mean of the target value for that label. So labels for which the mean of the target is higher will get the number 1, and those where the mean of the target is smallest will get the number n." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "from feature_engine.encoding import OrdinalEncoder" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Load titanic dataset from OpenML\n", "\n", "def load_titanic():\n", " data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')\n", " data = data.replace('?', np.nan)\n", " data['cabin'] = data['cabin'].astype(str).str[0]\n", " data['pclass'] = data['pclass'].astype('O')\n", " data['age'] = data['age'].astype('float')\n", " data['fare'] = data['fare'].astype('float')\n", " data['embarked'].fillna('C', inplace=True)\n", " data.drop(labels=['boat', 'body', 'home.dest'], axis=1, inplace=True)\n", " return data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssurvivednamesexagesibspparchticketfarecabinembarked
011Allen, Miss. Elisabeth Waltonfemale29.00000024160211.3375BS
111Allison, Master. Hudson Trevormale0.916712113781151.5500CS
210Allison, Miss. Helen Lorainefemale2.000012113781151.5500CS
310Allison, Mr. Hudson Joshua Creightonmale30.000012113781151.5500CS
410Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.000012113781151.5500CS
\n", "
" ], "text/plain": [ " pclass survived name sex \\\n", "0 1 1 Allen, Miss. Elisabeth Walton female \n", "1 1 1 Allison, Master. Hudson Trevor male \n", "2 1 0 Allison, Miss. Helen Loraine female \n", "3 1 0 Allison, Mr. Hudson Joshua Creighton male \n", "4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female \n", "\n", " age sibsp parch ticket fare cabin embarked \n", "0 29.0000 0 0 24160 211.3375 B S \n", "1 0.9167 1 2 113781 151.5500 C S \n", "2 2.0000 1 2 113781 151.5500 C S \n", "3 30.0000 1 2 113781 151.5500 C S \n", "4 25.0000 1 2 113781 151.5500 C S " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = load_titanic()\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "X = data.drop(['survived', 'name', 'ticket'], axis=1)\n", "y = data.survived" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cabin 0\n", "pclass 0\n", "embarked 0\n", "dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we will encode the below variables, they have no missing values\n", "X[['cabin', 'pclass', 'embarked']].isnull().sum()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cabin object\n", "pclass object\n", "embarked object\n", "dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "''' Make sure that the variables are type (object).\n", "if not, cast it as object , otherwise the transformer will either send an error (if we pass it as argument) \n", "or not pick it up (if we leave variables=None). '''\n", "\n", "X[['cabin', 'pclass', 'embarked']].dtypes" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((916, 8), (393, 8))" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's separate into training and testing set\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)\n", "\n", "X_train.shape, X_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The OrdinalEncoder() replaces categories by ordinal numbers \n", "(0, 1, 2, 3, etc). The numbers can be ordered based on the mean of the target\n", "per category, or assigned arbitrarily.\n", "\n", "Ordered ordinal encoding: for the variable colour, if the mean of the target\n", "for blue, red and grey is 0.5, 0.8 and 0.1 respectively, blue is replaced by 1,\n", "red by 2 and grey by 0.\n", "\n", "Arbitrary ordinal encoding: the numbers will be assigned arbitrarily to the\n", "categories, on a first seen first served basis.\n", "\n", "The encoder will encode only categorical variables (type 'object'). A list\n", "of variables can be passed as an argument. If no variables are passed, the\n", "encoder will find and encode all categorical variables (type 'object').\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ordered" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "OrdinalEncoder(variables=['pclass', 'cabin', 'embarked'])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we will encode 3 variables:\n", "'''\n", "Parameters\n", "----------\n", "\n", "encoding_method : str, default='ordered' \n", " Desired method of encoding.\n", "\n", " 'ordered': the categories are numbered in ascending order according to\n", " the target mean value per category.\n", "\n", " 'arbitrary' : categories are numbered arbitrarily.\n", " \n", "variables : list, default=None\n", " The list of categorical variables that will be encoded. If None, the \n", " encoder will find and select all object type variables.\n", "'''\n", "ordinal_enc = OrdinalEncoder(encoding_method='ordered',\n", " variables=['pclass', 'cabin', 'embarked'])\n", "\n", "# for this encoder, we need to pass the target as argument\n", "# if encoding_method='ordered'\n", "ordinal_enc.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'pclass': {3: 0, 2: 1, 1: 2},\n", " 'cabin': {'T': 0,\n", " 'n': 1,\n", " 'G': 2,\n", " 'A': 3,\n", " 'C': 4,\n", " 'F': 5,\n", " 'D': 6,\n", " 'E': 7,\n", " 'B': 8},\n", " 'embarked': {'S': 0, 'Q': 1, 'C': 2}}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ordinal_enc.encoder_dict_" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssexagesibspparchfarecabinembarked
2712male24.01082.266780
612female76.01078.850040
12800male22.0007.895810
2472female54.01059.400012
3611female22.01129.000010
\n", "
" ], "text/plain": [ " pclass sex age sibsp parch fare cabin embarked\n", "271 2 male 24.0 1 0 82.2667 8 0\n", "61 2 female 76.0 1 0 78.8500 4 0\n", "1280 0 male 22.0 0 0 7.8958 1 0\n", "247 2 female 54.0 1 0 59.4000 1 2\n", "361 1 female 22.0 1 1 29.0000 1 0" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# transform and visualise the data\n", "\n", "train_t = ordinal_enc.transform(X_train)\n", "test_t = ordinal_enc.transform(X_test)\n", "\n", "test_t.sample(5)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "''' The OrdinalEncoder with encoding_method='order' has the characteristic that return monotonic\n", " variables,that is, encoded variables which values increase as the target increases'''\n", "\n", "# let's explore the monotonic relationship\n", "plt.figure(figsize=(7,5))\n", "pd.concat([test_t,y_test], axis=1).groupby(\"pclass\")[\"survived\"].mean().plot()\n", "plt.xticks([0,1,2])\n", "plt.yticks(np.arange(0,1.1,0.1))\n", "plt.title(\"Relationship between pclass and target\")\n", "plt.xlabel(\"Pclass\")\n", "plt.ylabel(\"Mean of target\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Arbitrary" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "OrdinalEncoder(encoding_method='arbitrary',\n", " variables=['pclass', 'cabin', 'embarked'])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ordinal_enc = OrdinalEncoder(encoding_method='arbitrary',\n", " variables=['pclass', 'cabin', 'embarked'])\n", "\n", "# for this encoder we don't need to add the target. You can leave it or remove it.\n", "ordinal_enc.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'pclass': {2: 0, 3: 1, 1: 2},\n", " 'cabin': {'n': 0,\n", " 'E': 1,\n", " 'C': 2,\n", " 'D': 3,\n", " 'B': 4,\n", " 'A': 5,\n", " 'F': 6,\n", " 'T': 7,\n", " 'G': 8},\n", " 'embarked': {'S': 0, 'C': 1, 'Q': 2}}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ordinal_enc.encoder_dict_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the ordering of the different labels is not the same when we select \"arbitrary\" or \"ordered\"" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssexagesibspparchfarecabinembarked
11221femaleNaN1122.358361
9341female4.00222.025000
8151maleNaN0014.500000
1242female48.01179.200041
11251male24.0008.050000
\n", "
" ], "text/plain": [ " pclass sex age sibsp parch fare cabin embarked\n", "1122 1 female NaN 1 1 22.3583 6 1\n", "934 1 female 4.0 0 2 22.0250 0 0\n", "815 1 male NaN 0 0 14.5000 0 0\n", "124 2 female 48.0 1 1 79.2000 4 1\n", "1125 1 male 24.0 0 0 8.0500 0 0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# transform: see the numerical values in the former categorical variables\n", "\n", "train_t = ordinal_enc.transform(X_train)\n", "test_t = ordinal_enc.transform(X_test)\n", "\n", "test_t.sample(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Automatically select categorical variables\n", "\n", "This encoder selects all the categorical variables, if None is passed to the variable argument when calling the encoder." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "OrdinalEncoder(encoding_method='arbitrary',\n", " variables=['pclass', 'sex', 'cabin', 'embarked'])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ordinal_enc = OrdinalEncoder(encoding_method = 'arbitrary')\n", "\n", "# for this encoder we don't need to add the target. You can leave it or remove it.\n", "ordinal_enc.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['pclass', 'sex', 'cabin', 'embarked']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ordinal_enc.variables" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssexagesibspparchfarecabinembarked
113511NaN007.895800
3280134.01026.000000
7851022.01013.900000
7081124.0007.854200
4860124.00010.500000
\n", "
" ], "text/plain": [ " pclass sex age sibsp parch fare cabin embarked\n", "1135 1 1 NaN 0 0 7.8958 0 0\n", "328 0 1 34.0 1 0 26.0000 0 0\n", "785 1 0 22.0 1 0 13.9000 0 0\n", "708 1 1 24.0 0 0 7.8542 0 0\n", "486 0 1 24.0 0 0 10.5000 0 0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_t = ordinal_enc.transform(X_train)\n", "test_t = ordinal_enc.transform(X_test)\n", "\n", "test_t.sample(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }