{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Select with Target Mean as Performance Proxy\n", "\n", "**Method used in a KDD 2009 competition**\n", "\n", "This feature selection approach was used by data scientists at the University of Melbourne in the [KDD 2009](http://www.kdd.org/kdd-cup/view/kdd-cup-2009) data science competition. The task consisted in predicting churn based on a dataset with a huge number of features.\n", "\n", "The authors describe this procedure as an aggressive non-parametric feature selection procedure that is based in contemplating the relationship between the feature and the target.\n", "\n", "\n", "**The procedure consists in the following steps**:\n", "\n", "For each categorical variable:\n", "\n", " 1) Separate into train and test\n", "\n", " 2) Determine the mean value of the target within each label of the categorical variable using the train set\n", "\n", " 3) Use that mean target value per label as the prediction (using the test set) and calculate the roc-auc.\n", "\n", "For each numerical variable:\n", "\n", " 1) Separate into train and test\n", " \n", " 2) Divide the variable intervals\n", "\n", " 3) Calculate the mean target within each interval using the training set \n", "\n", " 4) Use that mean target value / bin as the prediction (using the test set) and calculate the roc-auc\n", "\n", "\n", "The authors quote the following advantages of the method:\n", "\n", "- Speed: computing mean and quantiles is direct and efficient\n", "- Stability respect to scale: extreme values for continuous variables do not skew the predictions\n", "- Comparable between categorical and numerical variables\n", "- Accommodation of non-linearities\n", "\n", "**Important**\n", "The authors here use the roc-auc, but in principle, we could use any metric, including those valid for regression.\n", "\n", "The authors sort continuous variables into percentiles, but Feature-engine gives the option to sort into equal-frequency or equal-width intervals.\n", "\n", "**Reference**:\n", "[Predicting customer behaviour: The University of Melbourne's KDD Cup Report. Miller et al. JMLR Workshop and Conference Proceedings 7:45-55](http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import roc_auc_score\n", "\n", "from feature_engine.selection import SelectByTargetMeanPerformance" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# load the titanic dataset\n", "data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')\n", "\n", "# remove unwanted variables\n", "data.drop(labels = ['name','boat', 'ticket','body', 'home.dest'], axis=1, inplace=True)\n", "\n", "# replace ? by Nan\n", "data = data.replace('?', np.nan)\n", "\n", "# missing values\n", "data.dropna(subset=['embarked', 'fare'], inplace=True)\n", "\n", "data['age'] = data['age'].astype('float')\n", "data['age'] = data['age'].fillna(data['age'].mean())\n", "\n", "data['fare'] = data['fare'].astype('float')\n", "\n", "def get_first_cabin(row):\n", " try:\n", " return row.split()[0]\n", " except:\n", " return 'N' \n", " \n", "data['cabin'] = data['cabin'].apply(get_first_cabin)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | pclass | \n", "survived | \n", "sex | \n", "age | \n", "sibsp | \n", "parch | \n", "fare | \n", "cabin | \n", "embarked | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "1 | \n", "female | \n", "29.0000 | \n", "0 | \n", "0 | \n", "211.3375 | \n", "B5 | \n", "S | \n", "
1 | \n", "1 | \n", "1 | \n", "male | \n", "0.9167 | \n", "1 | \n", "2 | \n", "151.5500 | \n", "C22 | \n", "S | \n", "
2 | \n", "1 | \n", "0 | \n", "female | \n", "2.0000 | \n", "1 | \n", "2 | \n", "151.5500 | \n", "C22 | \n", "S | \n", "
3 | \n", "1 | \n", "0 | \n", "male | \n", "30.0000 | \n", "1 | \n", "2 | \n", "151.5500 | \n", "C22 | \n", "S | \n", "
4 | \n", "1 | \n", "0 | \n", "female | \n", "25.0000 | \n", "1 | \n", "2 | \n", "151.5500 | \n", "C22 | \n", "S | \n", "