{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Model-based and sequential feature selection\n\nThis example illustrates and compares two approaches for feature selection:\n:class:`~sklearn.feature_selection.SelectFromModel` which is based on feature\nimportance, and\n:class:`~sklearn.feature_selection.SequentialFeatureSelector` which relies\non a greedy approach.\n\nWe use the Diabetes dataset, which consists of 10 features collected from 442\ndiabetes patients.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the data\n\nWe first load the diabetes dataset which is available from within\nscikit-learn, and print its description:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import load_diabetes\n\ndiabetes = load_diabetes()\nX, y = diabetes.data, diabetes.target\nprint(diabetes.DESCR)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature importance from coefficients\n\nTo get an idea of the importance of the features, we are going to use the\n:class:`~sklearn.linear_model.RidgeCV` estimator. The features with the\nhighest absolute `coef_` value are considered the most important.\nWe can observe the coefficients directly without needing to scale them (or\nscale the data) because from the description above, we know that the features\nwere already standardized.\nFor a more complete example on the interpretations of the coefficients of\nlinear models, you may refer to\n`sphx_glr_auto_examples_inspection_plot_linear_model_coefficient_interpretation.py`. # noqa: E501\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn.linear_model import RidgeCV\n\nridge = RidgeCV(alphas=np.logspace(-6, 6, num=5)).fit(X, y)\nimportance = np.abs(ridge.coef_)\nfeature_names = np.array(diabetes.feature_names)\nplt.bar(height=importance, x=feature_names)\nplt.title(\"Feature importances via coefficients\")\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting features based on importance\n\nNow we want to select the two features which are the most important according\nto the coefficients. The :class:`~sklearn.feature_selection.SelectFromModel`\nis meant just for that. :class:`~sklearn.feature_selection.SelectFromModel`\naccepts a `threshold` parameter and will select the features whose importance\n(defined by the coefficients) are above this threshold.\n\nSince we want to select only 2 features, we will set this threshold slightly\nabove the coefficient of third most important feature.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from time import time\n\nfrom sklearn.feature_selection import SelectFromModel\n\nthreshold = np.sort(importance)[-3] + 0.01\n\ntic = time()\nsfm = SelectFromModel(ridge, threshold=threshold).fit(X, y)\ntoc = time()\nprint(f\"Features selected by SelectFromModel: {feature_names[sfm.get_support()]}\")\nprint(f\"Done in {toc - tic:.3f}s\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting features with Sequential Feature Selection\n\nAnother way of selecting features is to use\n:class:`~sklearn.feature_selection.SequentialFeatureSelector`\n(SFS). SFS is a greedy procedure where, at each iteration, we choose the best\nnew feature to add to our selected features based a cross-validation score.\nThat is, we start with 0 features and choose the best single feature with the\nhighest score. The procedure is repeated until we reach the desired number of\nselected features.\n\nWe can also go in the reverse direction (backward SFS), *i.e.* start with all\nthe features and greedily choose features to remove one by one. We illustrate\nboth approaches here.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.feature_selection import SequentialFeatureSelector\n\ntic_fwd = time()\nsfs_forward = SequentialFeatureSelector(\n ridge, n_features_to_select=2, direction=\"forward\"\n).fit(X, y)\ntoc_fwd = time()\n\ntic_bwd = time()\nsfs_backward = SequentialFeatureSelector(\n ridge, n_features_to_select=2, direction=\"backward\"\n).fit(X, y)\ntoc_bwd = time()\n\nprint(\n \"Features selected by forward sequential selection: \"\n f\"{feature_names[sfs_forward.get_support()]}\"\n)\nprint(f\"Done in {toc_fwd - tic_fwd:.3f}s\")\nprint(\n \"Features selected by backward sequential selection: \"\n f\"{feature_names[sfs_backward.get_support()]}\"\n)\nprint(f\"Done in {toc_bwd - tic_bwd:.3f}s\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interestingly, forward and backward selection have selected the same set of\nfeatures. In general, this isn't the case and the two methods would lead to\ndifferent results.\n\nWe also note that the features selected by SFS differ from those selected by\nfeature importance: SFS selects `bmi` instead of `s1`. This does sound\nreasonable though, since `bmi` corresponds to the third most important\nfeature according to the coefficients. It is quite remarkable considering\nthat SFS makes no use of the coefficients at all.\n\nTo finish with, we should note that\n:class:`~sklearn.feature_selection.SelectFromModel` is significantly faster\nthan SFS. Indeed, :class:`~sklearn.feature_selection.SelectFromModel` only\nneeds to fit a model once, while SFS needs to cross-validate many different\nmodels for each of the iterations. SFS however works with any model, while\n:class:`~sklearn.feature_selection.SelectFromModel` requires the underlying\nestimator to expose a `coef_` attribute or a `feature_importances_`\nattribute. The forward SFS is faster than the backward SFS because it only\nneeds to perform `n_features_to_select = 2` iterations, while the backward\nSFS needs to perform `n_features - n_features_to_select = 8` iterations.\n\n## Using negative tolerance values\n\n:class:`~sklearn.feature_selection.SequentialFeatureSelector` can be used\nto remove features present in the dataset and return a\nsmaller subset of the original features with `direction=\"backward\"`\nand a negative value of `tol`.\n\nWe begin by loading the Breast Cancer dataset, consisting of 30 different\nfeatures and 569 samples.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nfrom sklearn.datasets import load_breast_cancer\n\nbreast_cancer_data = load_breast_cancer()\nX, y = breast_cancer_data.data, breast_cancer_data.target\nfeature_names = np.array(breast_cancer_data.feature_names)\nprint(breast_cancer_data.DESCR)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will make use of the :class:`~sklearn.linear_model.LogisticRegression`\nestimator with :class:`~sklearn.feature_selection.SequentialFeatureSelector`\nto perform the feature selection.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import roc_auc_score\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\n\nfor tol in [-1e-2, -1e-3, -1e-4]:\n start = time()\n feature_selector = SequentialFeatureSelector(\n LogisticRegression(),\n n_features_to_select=\"auto\",\n direction=\"backward\",\n scoring=\"roc_auc\",\n tol=tol,\n n_jobs=2,\n )\n model = make_pipeline(StandardScaler(), feature_selector, LogisticRegression())\n model.fit(X, y)\n end = time()\n print(f\"\\ntol: {tol}\")\n print(f\"Features selected: {feature_names[model[1].get_support()]}\")\n print(f\"ROC AUC score: {roc_auc_score(y, model.predict_proba(X)[:, 1]):.3f}\")\n print(f\"Done in {end - start:.3f}s\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the number of features selected tend to increase as negative\nvalues of `tol` approach to zero. The time taken for feature selection also\ndecreases as the values of `tol` come closer to zero.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }