{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Selection\n", "\n", "In the last few lectures, we learned how to use hold-out \"test\" sets and cross-validation to gain appropriate estimates of a model's performance on unseen data. There, the focus was on choosing a good \"complexity\" parameter, such as the depth of a decision tree. In this lecture, we'll instead show how to use cross-validation to get an estimate of which columns in the data should or should not be included in a model. It's very common in practice that not all columns will be used in the best model, and many, many machine learning reseachers devote their careers to studying the problem of how to intelligently and automatically choose only the most relevant columns for models. In the literature, this problem is usually called *feature selection*. In this lecture, we'll take a quick look at how feature selection can improve model performance. \n", "\n", "For this demonstration, we'll switch from decision trees to logistic regression. Logistic regression is a form of regression modeling well-suited for predicting probabilities and class labels. \n", "\n", "Let's begin by running some familiar blocks of code, in which we load our core libraries, read in the data, split the data, and clean the data. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from matplotlib import pyplot as plt\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSiblings/Spouses AboardParents/Children AboardFare
003Mr. Owen Harris Braundmale22.0107.2500
111Mrs. John Bradley (Florence Briggs Thayer) Cum...female38.01071.2833
213Miss. Laina Heikkinenfemale26.0007.9250
311Mrs. Jacques Heath (Lily May Peel) Futrellefemale35.01053.1000
403Mr. William Henry Allenmale35.0008.0500
...........................
88202Rev. Juozas Montvilamale27.00013.0000
88311Miss. Margaret Edith Grahamfemale19.00030.0000
88403Miss. Catherine Helen Johnstonfemale7.01223.4500
88511Mr. Karl Howell Behrmale26.00030.0000
88603Mr. Patrick Dooleymale32.0007.7500
\n", "

887 rows × 8 columns

\n", "
" ], "text/plain": [ " Survived Pclass Name \\\n", "0 0 3 Mr. Owen Harris Braund \n", "1 1 1 Mrs. John Bradley (Florence Briggs Thayer) Cum... \n", "2 1 3 Miss. Laina Heikkinen \n", "3 1 1 Mrs. Jacques Heath (Lily May Peel) Futrelle \n", "4 0 3 Mr. William Henry Allen \n", ".. ... ... ... \n", "882 0 2 Rev. Juozas Montvila \n", "883 1 1 Miss. Margaret Edith Graham \n", "884 0 3 Miss. Catherine Helen Johnston \n", "885 1 1 Mr. Karl Howell Behr \n", "886 0 3 Mr. Patrick Dooley \n", "\n", " Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare \n", "0 male 22.0 1 0 7.2500 \n", "1 female 38.0 1 0 71.2833 \n", "2 female 26.0 0 0 7.9250 \n", "3 female 35.0 1 0 53.1000 \n", "4 male 35.0 0 0 8.0500 \n", ".. ... ... ... ... ... \n", "882 male 27.0 0 0 13.0000 \n", "883 female 19.0 0 0 30.0000 \n", "884 female 7.0 1 2 23.4500 \n", "885 male 26.0 0 0 30.0000 \n", "886 male 32.0 0 0 7.7500 \n", "\n", "[887 rows x 8 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# assumes that you have run the function retrieve_data() \n", "# from \"Introduction to ML in Practice\" in ML_3.ipynb\n", "titanic = pd.read_csv(\"data.csv\")\n", "titanic" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "np.random.seed(1111)\n", "train, test = train_test_split(titanic, test_size = 0.2)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from sklearn import preprocessing\n", "def prep_titanic_data(data_df):\n", " df = data_df.copy()\n", " le = preprocessing.LabelEncoder()\n", " df['Sex'] = le.fit_transform(df['Sex'])\n", " df = df.drop(['Name'], axis = 1)\n", " \n", " X = df.drop(['Survived'], axis = 1)\n", " y = df['Survived']\n", " \n", " return(X, y)\n", "\n", "X_train, y_train = prep_titanic_data(train)\n", "X_test, y_test = prep_titanic_data(test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Deploying logistic regression is easy, and uses exactly the same API as the decision tree classifier. Let's go ahead and use cross-validation to estimate the predictive performance of the model. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7940365597842375" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.model_selection import cross_val_score\n", "\n", "LR = LogisticRegression()\n", "cross_val_score(LR, X_train, y_train, cv = 5).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is this the best we can do? If you've studied logistic regression before, you may know that using lots of columns doesn't always help -- due to *multicollinearity*, the model's predictive performance can actually suffer. This is actually another aspect of *overfitting*. Adding more columns makes the model more flexible, and we've seen that that is not always beneficial. So, a natural question is whether we can achieve the same (or better?) model performance by using only a subset of the columns. \n", "\n", "\n", "It's easy to train a model on a subset of the data. For example: " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "training with columns ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']\n" ] }, { "data": { "text/plain": [ "0.8025072420337629" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols = ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']\n", "print(\"training with columns \" + str(cols))\n", "\n", "LR = LogisticRegression()\n", "cross_val_score(LR, X_train[cols], y_train, cv = 5).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interesting! Excluding the last column (Fare) actually improved our CV score slightly. \n", "\n", "## Systematic Feature Selection\n", "\n", "Now, let's write a function that will let us do this systematically. Our function will use cross-validation to avoid \"peeking\" at the test set. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def check_column_score(cols):\n", " \"\"\"\n", " Trains and evaluates a model via cross-validation on the columns of the data\n", " with selected indices\n", " \"\"\"\n", " print(\"training with columns \" + str(cols))\n", "\n", " LR = LogisticRegression()\n", " return cross_val_score(LR, X_train[cols], y_train, cv = 5).mean() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now check multiple combinations simultaneously. In a real problem, you might check all possible combinations, and in the Penguins data set, for example, this would be possible. In this lecture, however, we'll just compare a few. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "training with columns ['Sex', 'Age', 'Fare']\n", "CV score is 0.773\n", "training with columns ['Pclass', 'Sex', 'Age']\n", "CV score is 0.795\n", "training with columns ['Pclass', 'Parents/Children Aboard']\n", "CV score is 0.671\n", "training with columns ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']\n", "CV score is 0.803\n", "training with columns ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']\n", "CV score is 0.794\n" ] } ], "source": [ "combos = [['Sex', 'Age', 'Fare'],\n", " ['Pclass', 'Sex', 'Age'],\n", " ['Pclass', 'Parents/Children Aboard'],\n", " ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'],\n", " ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']]\n", "\n", "for cols in combos: \n", " x = check_column_score(cols)\n", " print(\"CV score is \" + str(np.round(x, 3)))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the model that uses all the available columns achieves only the third-highest CV score. The model with the highest CV score uses all columns except for \"Fare.\"\n", "\n", "Now let's see how each of these models perform on the test set. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def test_column_score(cols):\n", " \"\"\"\n", " Trains and evaluates a model on the test set using the columns of the data\n", " with selected indices\n", " \"\"\"\n", " print(\"testing with columns \" + str(cols))\n", " LR = LogisticRegression()\n", " LR.fit(X_train[cols], y_train)\n", " return LR.score(X_test[cols], y_test)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "testing with columns ['Sex', 'Age', 'Fare']\n", "test score is 0.82\n", "testing with columns ['Pclass', 'Sex', 'Age']\n", "test score is 0.803\n", "testing with columns ['Pclass', 'Parents/Children Aboard']\n", "test score is 0.742\n", "testing with columns ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']\n", "test score is 0.831\n", "testing with columns ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']\n", "test score is 0.82\n" ] } ], "source": [ "for cols in combos: \n", " x = test_column_score(cols)\n", " print(\"test score is \" + str(np.round(x, 3)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indeed, we achieved a higher prediction score on the test set by ignoring the \"Fare\" column completely. \n", "\n", "There are a number of sophisticated algorithms for automated feature selection, but we won't go further into this topic in this course. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }