{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Selection\n", "\n", "In the last few lectures, we learned how to use hold-out \"test\" sets and cross-validation to gain appropriate estimates of a model's performance on unseen data. There, the focus was on choosing a good \"complexity\" parameter, such as the depth of a decision tree. In this lecture, we'll instead show how to use cross-validation to get an estimate of which columns in the data should or should not be included in a model. It's very common in practice that not all columns will be used in the best model, and many, many machine learning reseachers devote their careers to studying the problem of how to intelligently and automatically choose only the most relevant columns for models. In the literature, this problem is usually called *feature selection*. In this lecture, we'll take a quick look at how feature selection can improve model performance. \n", "\n", "For this demonstration, we'll switch from decision trees to logistic regression. Logistic regression is a form of regression modeling well-suited for predicting probabilities and class labels. \n", "\n", "Let's begin by running some familiar blocks of code, in which we load our core libraries, read in the data, split the data, and clean the data. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from matplotlib import pyplot as plt\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Survived | \n", "Pclass | \n", "Name | \n", "Sex | \n", "Age | \n", "Siblings/Spouses Aboard | \n", "Parents/Children Aboard | \n", "Fare | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "3 | \n", "Mr. Owen Harris Braund | \n", "male | \n", "22.0 | \n", "1 | \n", "0 | \n", "7.2500 | \n", "
1 | \n", "1 | \n", "1 | \n", "Mrs. John Bradley (Florence Briggs Thayer) Cum... | \n", "female | \n", "38.0 | \n", "1 | \n", "0 | \n", "71.2833 | \n", "
2 | \n", "1 | \n", "3 | \n", "Miss. Laina Heikkinen | \n", "female | \n", "26.0 | \n", "0 | \n", "0 | \n", "7.9250 | \n", "
3 | \n", "1 | \n", "1 | \n", "Mrs. Jacques Heath (Lily May Peel) Futrelle | \n", "female | \n", "35.0 | \n", "1 | \n", "0 | \n", "53.1000 | \n", "
4 | \n", "0 | \n", "3 | \n", "Mr. William Henry Allen | \n", "male | \n", "35.0 | \n", "0 | \n", "0 | \n", "8.0500 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
882 | \n", "0 | \n", "2 | \n", "Rev. Juozas Montvila | \n", "male | \n", "27.0 | \n", "0 | \n", "0 | \n", "13.0000 | \n", "
883 | \n", "1 | \n", "1 | \n", "Miss. Margaret Edith Graham | \n", "female | \n", "19.0 | \n", "0 | \n", "0 | \n", "30.0000 | \n", "
884 | \n", "0 | \n", "3 | \n", "Miss. Catherine Helen Johnston | \n", "female | \n", "7.0 | \n", "1 | \n", "2 | \n", "23.4500 | \n", "
885 | \n", "1 | \n", "1 | \n", "Mr. Karl Howell Behr | \n", "male | \n", "26.0 | \n", "0 | \n", "0 | \n", "30.0000 | \n", "
886 | \n", "0 | \n", "3 | \n", "Mr. Patrick Dooley | \n", "male | \n", "32.0 | \n", "0 | \n", "0 | \n", "7.7500 | \n", "
887 rows × 8 columns
\n", "