{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Benefits of using feature selection\n", "\n", "In this notebook, we aim at introducing the main benefits that can be gained\n", "when using feature selection.\n", "\n", "Indeed, the principal advantage of selecting features within a machine\n", "learning pipeline is to reduce the time to train this pipeline and its time to\n", "predict. We will give an example to highlights these advantages. First, we\n", "generate a synthetic dataset to control the number of features that will be\n", "informative, redundant, repeated, and random." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_classification\n", "\n", "data, target = make_classification(\n", " n_samples=5000,\n", " n_features=100,\n", " n_informative=2,\n", " n_redundant=0,\n", " n_repeated=0,\n", " random_state=0,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We choose to create a dataset with two informative features among a hundred.\n", "To simplify our example, we do not include either redundant or repeated\n", "features.\n", "\n", "We will create two machine learning pipelines. The former will be a random\n", "forest that will use all available features. The latter will also be a random\n", "forest, but we will add a feature selection step to train this classifier. The\n", "feature selection is based on a univariate test (ANOVA F-value) between each\n", "feature and the target that we want to predict. The features with the two most\n", "significant scores are selected.\n", "\n", "Let's create the model without any feature selection:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "model_without_selection = RandomForestClassifier(n_jobs=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, let's create a pipeline where the first stage will make the feature\n", "selection processing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_selection import SelectKBest\n", "from sklearn.feature_selection import f_classif\n", "from sklearn.pipeline import make_pipeline\n", "\n", "\n", "model_with_selection = make_pipeline(\n", " SelectKBest(score_func=f_classif, k=2),\n", " RandomForestClassifier(n_jobs=2),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will measure the average time spent to train each pipeline and make it\n", "predict. Besides, we will compute the testing score of the model. We will\n", "collect these results via cross-validation.\n", "\n", "Let's start with the random forest without feature selection. We will store\n", "the results into a dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.model_selection import cross_validate\n", "\n", "cv_results_without_selection = cross_validate(\n", " model_without_selection, data, target\n", ")\n", "cv_results_without_selection = pd.DataFrame(cv_results_without_selection)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we will repeat the process for the pipeline incorporating the feature\n", "selection." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cv_results_with_selection = cross_validate(\n", " model_with_selection, data, target, return_estimator=True\n", ")\n", "cv_results_with_selection = pd.DataFrame(cv_results_with_selection)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To analyze the results, we will merge the results from the two pipeline in a\n", "single pandas dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cv_results = pd.concat(\n", " [cv_results_without_selection, cv_results_with_selection],\n", " axis=1,\n", " keys=[\"Without feature selection\", \"With feature selection\"],\n", ")\n", "# swap the level of the multi-index of the columns\n", "cv_results = cv_results.swaplevel(axis=\"columns\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's first analyze the train and score time for each pipeline." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "color = {\"whiskers\": \"black\", \"medians\": \"black\", \"caps\": \"black\"}\n", "cv_results[\"fit_time\"].plot.box(color=color, vert=False)\n", "plt.xlabel(\"Elapsed time (s)\")\n", "_ = plt.title(\"Time to fit the model\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cv_results[\"score_time\"].plot.box(color=color, vert=False)\n", "plt.xlabel(\"Elapsed time (s)\")\n", "_ = plt.title(\"Time to make prediction\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can draw the same conclusions for both training and scoring elapsed time:\n", "selecting the most informative features speed-up our pipeline.\n", "\n", "Of course, such speed-up is beneficial only if the generalization performance\n", "in terms of metrics remain the same. Let's check the testing score." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cv_results[\"test_score\"].plot.box(color=color, vert=False)\n", "plt.xlabel(\"Accuracy score\")\n", "_ = plt.title(\"Test score via cross-validation\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can observe that the model's generalization performance selecting a subset\n", "of features decreases compared with the model using all available features.\n", "Since we generated the dataset, we can infer that the decrease is because of\n", "the selection. The feature selection algorithm did not choose the two\n", "informative features.\n", "\n", "We can investigate which feature have been selected during the\n", "cross-validation. We will print the indices of the two selected features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "for idx, pipeline in enumerate(cv_results_with_selection[\"estimator\"]):\n", " print(\n", " f\"Fold #{idx} - features selected are: \"\n", " f\"{np.argsort(pipeline[0].scores_)[-2:]}\"\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the feature `53` is always selected while the other feature varies\n", "depending on the cross-validation fold.\n", "\n", "If we would like to keep our score with similar generalization performance, we\n", "could choose another metric to perform the test or select more features. For\n", "instance, we could select the number of features based on a specific\n", "percentile of the highest scores. Besides, we should keep in mind that we\n", "simplify our problem by having informative and not informative features.\n", "Correlation between features makes the problem of feature selection even\n", "harder.\n", "\n", "Therefore, we could come with a much more complicated procedure that could\n", "tune (via cross-validation) the number of selected features and change the way\n", "feature is selected (e.g. using a machine-learning model). However, going\n", "towards these solutions alienates the feature selection's primary purpose to\n", "get a significant train/test speed-up. Also, if the primary goal was to get a\n", "more performant model, performant models exclude non-informative features\n", "natively." ] } ], "metadata": { "jupytext": { "main_language": "python" }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }