{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Select with Target Mean as Performance Proxy\n",
    "\n",
    "**Method used in a KDD 2009 competition**\n",
    "\n",
    "This feature selection approach was used by data scientists at the University of Melbourne in the [KDD 2009](http://www.kdd.org/kdd-cup/view/kdd-cup-2009) data science competition. The task consisted in predicting churn based on a dataset with a huge number of features.\n",
    "\n",
    "The authors describe this procedure as an aggressive non-parametric feature selection procedure that is based in contemplating the relationship between the feature and the target.\n",
    "\n",
    "\n",
    "**The procedure consists in the following steps**:\n",
    "\n",
    "For each categorical variable:\n",
    "\n",
    "    1) Separate into train and test\n",
    "\n",
    "    2) Determine the mean value of the target within each label of the categorical variable using the train set\n",
    "\n",
    "    3) Use that mean target value per label as the prediction (using the test set) and calculate the roc-auc.\n",
    "\n",
    "For each numerical variable:\n",
    "\n",
    "    1) Separate into train and test\n",
    "    \n",
    "    2) Divide the variable intervals\n",
    "\n",
    "    3) Calculate the mean target within each interval using the training set \n",
    "\n",
    "    4) Use that mean target value / bin as the prediction (using the test set) and calculate the roc-auc\n",
    "\n",
    "\n",
    "The authors quote the following advantages of the method:\n",
    "\n",
    "- Speed: computing mean and quantiles is direct and efficient\n",
    "- Stability respect to scale: extreme values for continuous variables do not skew the predictions\n",
    "- Comparable between categorical and numerical variables\n",
    "- Accommodation of non-linearities\n",
    "\n",
    "**Important**\n",
    "The authors here use the roc-auc, but in principle, we could use any metric, including those valid for regression.\n",
    "\n",
    "The authors sort continuous variables into percentiles, but Feature-engine gives the option to sort into equal-frequency or equal-width intervals.\n",
    "\n",
    "**Reference**:\n",
    "[Predicting customer behaviour: The University of Melbourne's KDD Cup Report. Miller et al. JMLR Workshop and Conference Proceedings 7:45-55](http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "from feature_engine.selection import SelectByTargetMeanPerformance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the titanic dataset\n",
    "data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')\n",
    "\n",
    "# remove unwanted variables\n",
    "data.drop(labels = ['name','boat', 'ticket','body', 'home.dest'], axis=1, inplace=True)\n",
    "\n",
    "# replace ? by Nan\n",
    "data = data.replace('?', np.nan)\n",
    "\n",
    "# missing values\n",
    "data.dropna(subset=['embarked', 'fare'], inplace=True)\n",
    "\n",
    "data['age'] = data['age'].astype('float')\n",
    "data['age'] = data['age'].fillna(data['age'].mean())\n",
    "\n",
    "data['fare'] = data['fare'].astype('float')\n",
    "\n",
    "def get_first_cabin(row):\n",
    "    try:\n",
    "        return row.split()[0]\n",
    "    except:\n",
    "        return 'N' \n",
    "    \n",
    "data['cabin'] = data['cabin'].apply(get_first_cabin)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pclass</th>\n",
       "      <th>survived</th>\n",
       "      <th>sex</th>\n",
       "      <th>age</th>\n",
       "      <th>sibsp</th>\n",
       "      <th>parch</th>\n",
       "      <th>fare</th>\n",
       "      <th>cabin</th>\n",
       "      <th>embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>female</td>\n",
       "      <td>29.0000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>211.3375</td>\n",
       "      <td>B5</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>male</td>\n",
       "      <td>0.9167</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>C22</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>female</td>\n",
       "      <td>2.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>C22</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>male</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>C22</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>female</td>\n",
       "      <td>25.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>C22</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pclass  survived     sex      age  sibsp  parch      fare cabin embarked\n",
       "0       1         1  female  29.0000      0      0  211.3375    B5        S\n",
       "1       1         1    male   0.9167      1      2  151.5500   C22        S\n",
       "2       1         0  female   2.0000      1      2  151.5500   C22        S\n",
       "3       1         0    male  30.0000      1      2  151.5500   C22        S\n",
       "4       1         0  female  25.0000      1      2  151.5500   C22        S"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['B', 'C', 'E', 'D', 'A', 'N', 'F'], dtype=object)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Variable preprocessing:\n",
    "\n",
    "# then I will narrow down the different cabins by selecting only the\n",
    "# first letter, which represents the deck in which the cabin was located\n",
    "\n",
    "# captures first letter of string (the letter of the cabin)\n",
    "data['cabin'] = data['cabin'].str[0]\n",
    "\n",
    "# now we will rename those cabin letters that appear only 1 or 2 in the\n",
    "# dataset by N\n",
    "\n",
    "# replace rare cabins by N\n",
    "data['cabin'] = np.where(data['cabin'].isin(['T', 'G']), 'N', data['cabin'])\n",
    "\n",
    "data['cabin'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pclass        int64\n",
       "survived      int64\n",
       "sex          object\n",
       "age         float64\n",
       "sibsp         int64\n",
       "parch         int64\n",
       "fare        float64\n",
       "cabin        object\n",
       "embarked     object\n",
       "dtype: object"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    999\n",
       "1    170\n",
       "2    113\n",
       "3      8\n",
       "5      6\n",
       "4      6\n",
       "9      2\n",
       "6      2\n",
       "Name: parch, dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# number of passengers per value\n",
    "data['parch'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# cap variable at 3, the rest of the values are\n",
    "# shown by too few observations\n",
    "\n",
    "data['parch'] = np.where(data['parch']>3,3,data['parch'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    888\n",
       "1    319\n",
       "2     42\n",
       "4     22\n",
       "3     20\n",
       "8      9\n",
       "5      6\n",
       "Name: sibsp, dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['sibsp'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# cap variable at 3, the rest of the values are\n",
    "# shown by too few observations\n",
    "\n",
    "data['sibsp'] = np.where(data['sibsp']>3,3,data['sibsp'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# cast discrete variables as categorical\n",
    "\n",
    "# feature-engine considers categorical variables all those of type\n",
    "# object. So in order to work with numerical variables as if they\n",
    "# were categorical, we  need to cast them as object\n",
    "\n",
    "data[['pclass','sibsp','parch']] = data[['pclass','sibsp','parch']].astype('O')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pclass      0\n",
       "survived    0\n",
       "sex         0\n",
       "age         0\n",
       "sibsp       0\n",
       "parch       0\n",
       "fare        0\n",
       "cabin       0\n",
       "embarked    0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check absence of missing data\n",
    "\n",
    "data.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Important**\n",
    "\n",
    "In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((914, 8), (392, 8))"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# separate train and test sets\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    data.drop(['survived'], axis=1),\n",
    "    data['survived'],\n",
    "    test_size=0.3,\n",
    "    random_state=0)\n",
    "\n",
    "X_train.shape, X_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SelectByTargetMeanPerformance(bins=3, cv=2, random_state=1,\n",
       "                              scoring='roc_auc_score',\n",
       "                              strategy='equal_frequency', threshold=0.6,\n",
       "                              variables=['pclass', 'sex', 'age', 'sibsp',\n",
       "                                         'parch', 'fare', 'cabin', 'embarked'])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# feautre engine automates the selection for both\n",
    "# categorical and numerical variables\n",
    "\n",
    "sel = SelectByTargetMeanPerformance(\n",
    "    variables=None, # automatically finds categorical and numerical variables\n",
    "    scoring=\"roc_auc_score\", # the metric to evaluate performance\n",
    "    threshold=0.6, # the threshold for feature selection, \n",
    "    bins=3, # the number of intervals to discretise the numerical variables\n",
    "    strategy=\"equal_frequency\", # whether the intervals should be of equal size or equal number of observations\n",
    "    cv=2,# cross validation\n",
    "    random_state=1, #seed for reproducibility\n",
    ")\n",
    "\n",
    "sel.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['pclass', 'sex', 'sibsp', 'parch', 'cabin', 'embarked']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# after fitting, we can find the categorical variables\n",
    "# using this attribute\n",
    "\n",
    "sel.variables_categorical_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['age', 'fare']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# and here we find the numerical variables\n",
    "\n",
    "sel.variables_numerical_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'pclass': 0.6802934787230475,\n",
       " 'sex': 0.7491365252482871,\n",
       " 'age': 0.5345141148737766,\n",
       " 'sibsp': 0.5720480307315783,\n",
       " 'parch': 0.5243557188989476,\n",
       " 'fare': 0.6600883312700917,\n",
       " 'cabin': 0.6379782658154696,\n",
       " 'embarked': 0.5672382248783936}"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# here the selector stores the roc-auc per feature\n",
    "\n",
    "sel.feature_performance_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['age', 'sibsp', 'parch', 'embarked']"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# and these are the features that will be dropped\n",
    "\n",
    "sel.features_to_drop_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((914, 4), (392, 4))"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train = sel.transform(X_train)\n",
    "X_test = sel.transform(X_test)\n",
    "\n",
    "X_train.shape, X_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That is all for this lecture, I hope you enjoyed it and see you in the next one!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "fengine",
   "language": "python",
   "name": "fengine"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": "block",
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}