{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Select with Target Mean as Performance Proxy\n", "\n", "**Method used in a KDD 2009 competition**\n", "\n", "This feature selection approach was used by data scientists at the University of Melbourne in the [KDD 2009](http://www.kdd.org/kdd-cup/view/kdd-cup-2009) data science competition. The task consisted in predicting churn based on a dataset with a huge number of features.\n", "\n", "The authors describe this procedure as an aggressive non-parametric feature selection procedure that is based in contemplating the relationship between the feature and the target.\n", "\n", "\n", "**The procedure consists in the following steps**:\n", "\n", "For each categorical variable:\n", "\n", " 1) Separate into train and test\n", "\n", " 2) Determine the mean value of the target within each label of the categorical variable using the train set\n", "\n", " 3) Use that mean target value per label as the prediction (using the test set) and calculate the roc-auc.\n", "\n", "For each numerical variable:\n", "\n", " 1) Separate into train and test\n", " \n", " 2) Divide the variable intervals\n", "\n", " 3) Calculate the mean target within each interval using the training set \n", "\n", " 4) Use that mean target value / bin as the prediction (using the test set) and calculate the roc-auc\n", "\n", "\n", "The authors quote the following advantages of the method:\n", "\n", "- Speed: computing mean and quantiles is direct and efficient\n", "- Stability respect to scale: extreme values for continuous variables do not skew the predictions\n", "- Comparable between categorical and numerical variables\n", "- Accommodation of non-linearities\n", "\n", "**Important**\n", "The authors here use the roc-auc, but in principle, we could use any metric, including those valid for regression.\n", "\n", "The authors sort continuous variables into percentiles, but Feature-engine gives the option to sort into equal-frequency or equal-width intervals.\n", "\n", "**Reference**:\n", "[Predicting customer behaviour: The University of Melbourne's KDD Cup Report. Miller et al. JMLR Workshop and Conference Proceedings 7:45-55](http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import roc_auc_score\n", "\n", "from feature_engine.selection import SelectByTargetMeanPerformance" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# load the titanic dataset\n", "data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')\n", "\n", "# remove unwanted variables\n", "data.drop(labels = ['name','boat', 'ticket','body', 'home.dest'], axis=1, inplace=True)\n", "\n", "# replace ? by Nan\n", "data = data.replace('?', np.nan)\n", "\n", "# missing values\n", "data.dropna(subset=['embarked', 'fare'], inplace=True)\n", "\n", "data['age'] = data['age'].astype('float')\n", "data['age'] = data['age'].fillna(data['age'].mean())\n", "\n", "data['fare'] = data['fare'].astype('float')\n", "\n", "def get_first_cabin(row):\n", " try:\n", " return row.split()[0]\n", " except:\n", " return 'N' \n", " \n", "data['cabin'] = data['cabin'].apply(get_first_cabin)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssurvivedsexagesibspparchfarecabinembarked
011female29.000000211.3375B5S
111male0.916712151.5500C22S
210female2.000012151.5500C22S
310male30.000012151.5500C22S
410female25.000012151.5500C22S
\n", "
" ], "text/plain": [ " pclass survived sex age sibsp parch fare cabin embarked\n", "0 1 1 female 29.0000 0 0 211.3375 B5 S\n", "1 1 1 male 0.9167 1 2 151.5500 C22 S\n", "2 1 0 female 2.0000 1 2 151.5500 C22 S\n", "3 1 0 male 30.0000 1 2 151.5500 C22 S\n", "4 1 0 female 25.0000 1 2 151.5500 C22 S" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['B', 'C', 'E', 'D', 'A', 'N', 'F'], dtype=object)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Variable preprocessing:\n", "\n", "# then I will narrow down the different cabins by selecting only the\n", "# first letter, which represents the deck in which the cabin was located\n", "\n", "# captures first letter of string (the letter of the cabin)\n", "data['cabin'] = data['cabin'].str[0]\n", "\n", "# now we will rename those cabin letters that appear only 1 or 2 in the\n", "# dataset by N\n", "\n", "# replace rare cabins by N\n", "data['cabin'] = np.where(data['cabin'].isin(['T', 'G']), 'N', data['cabin'])\n", "\n", "data['cabin'].unique()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pclass int64\n", "survived int64\n", "sex object\n", "age float64\n", "sibsp int64\n", "parch int64\n", "fare float64\n", "cabin object\n", "embarked object\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dtypes" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 999\n", "1 170\n", "2 113\n", "3 8\n", "5 6\n", "4 6\n", "9 2\n", "6 2\n", "Name: parch, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# number of passengers per value\n", "data['parch'].value_counts()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# cap variable at 3, the rest of the values are\n", "# shown by too few observations\n", "\n", "data['parch'] = np.where(data['parch']>3,3,data['parch'])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 888\n", "1 319\n", "2 42\n", "4 22\n", "3 20\n", "8 9\n", "5 6\n", "Name: sibsp, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['sibsp'].value_counts()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# cap variable at 3, the rest of the values are\n", "# shown by too few observations\n", "\n", "data['sibsp'] = np.where(data['sibsp']>3,3,data['sibsp'])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# cast discrete variables as categorical\n", "\n", "# feature-engine considers categorical variables all those of type\n", "# object. So in order to work with numerical variables as if they\n", "# were categorical, we need to cast them as object\n", "\n", "data[['pclass','sibsp','parch']] = data[['pclass','sibsp','parch']].astype('O')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pclass 0\n", "survived 0\n", "sex 0\n", "age 0\n", "sibsp 0\n", "parch 0\n", "fare 0\n", "cabin 0\n", "embarked 0\n", "dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check absence of missing data\n", "\n", "data.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Important**\n", "\n", "In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((914, 8), (392, 8))" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# separate train and test sets\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " data.drop(['survived'], axis=1),\n", " data['survived'],\n", " test_size=0.3,\n", " random_state=0)\n", "\n", "X_train.shape, X_test.shape" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SelectByTargetMeanPerformance(bins=3, cv=2, random_state=1,\n", " scoring='roc_auc_score',\n", " strategy='equal_frequency', threshold=0.6,\n", " variables=['pclass', 'sex', 'age', 'sibsp',\n", " 'parch', 'fare', 'cabin', 'embarked'])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# feautre engine automates the selection for both\n", "# categorical and numerical variables\n", "\n", "sel = SelectByTargetMeanPerformance(\n", " variables=None, # automatically finds categorical and numerical variables\n", " scoring=\"roc_auc_score\", # the metric to evaluate performance\n", " threshold=0.6, # the threshold for feature selection, \n", " bins=3, # the number of intervals to discretise the numerical variables\n", " strategy=\"equal_frequency\", # whether the intervals should be of equal size or equal number of observations\n", " cv=2,# cross validation\n", " random_state=1, #seed for reproducibility\n", ")\n", "\n", "sel.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['pclass', 'sex', 'sibsp', 'parch', 'cabin', 'embarked']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# after fitting, we can find the categorical variables\n", "# using this attribute\n", "\n", "sel.variables_categorical_" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['age', 'fare']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# and here we find the numerical variables\n", "\n", "sel.variables_numerical_" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'pclass': 0.6802934787230475,\n", " 'sex': 0.7491365252482871,\n", " 'age': 0.5345141148737766,\n", " 'sibsp': 0.5720480307315783,\n", " 'parch': 0.5243557188989476,\n", " 'fare': 0.6600883312700917,\n", " 'cabin': 0.6379782658154696,\n", " 'embarked': 0.5672382248783936}" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# here the selector stores the roc-auc per feature\n", "\n", "sel.feature_performance_" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['age', 'sibsp', 'parch', 'embarked']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# and these are the features that will be dropped\n", "\n", "sel.features_to_drop_" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((914, 4), (392, 4))" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train = sel.transform(X_train)\n", "X_test = sel.transform(X_test)\n", "\n", "X_train.shape, X_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That is all for this lecture, I hope you enjoyed it and see you in the next one!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "fengine", "language": "python", "name": "fengine" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }