{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Recursive feature elimination with cross-validation\n\nA Recursive Feature Elimination (RFE) example with automatic tuning of the\nnumber of features selected with cross-validation.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data generation\n\nWe build a classification task using 3 informative features. The introduction\nof 2 additional redundant (i.e. correlated) features has the effect that the\nselected features vary depending on the cross-validation fold. The remaining\nfeatures are non-informative as they are drawn at random.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import make_classification\n\nX, y = make_classification(\n n_samples=500,\n n_features=15,\n n_informative=3,\n n_redundant=2,\n n_repeated=0,\n n_classes=8,\n n_clusters_per_class=1,\n class_sep=0.8,\n random_state=0,\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model training and selection\n\nWe create the RFE object and compute the cross-validated scores. The scoring\nstrategy \"accuracy\" optimizes the proportion of correctly classified samples.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.feature_selection import RFECV\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import StratifiedKFold\n\nmin_features_to_select = 1 # Minimum number of features to consider\nclf = LogisticRegression()\ncv = StratifiedKFold(5)\n\nrfecv = RFECV(\n estimator=clf,\n step=1,\n cv=cv,\n scoring=\"accuracy\",\n min_features_to_select=min_features_to_select,\n n_jobs=2,\n)\nrfecv.fit(X, y)\n\nprint(f\"Optimal number of features: {rfecv.n_features_}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the present case, the model with 3 features (which corresponds to the true\ngenerative model) is found to be the most optimal.\n\n## Plot number of features VS. cross-validation scores\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nimport pandas as pd\n\ncv_results = pd.DataFrame(rfecv.cv_results_)\nplt.figure()\nplt.xlabel(\"Number of features selected\")\nplt.ylabel(\"Mean test accuracy\")\nplt.errorbar(\n x=cv_results[\"n_features\"],\n y=cv_results[\"mean_test_score\"],\n yerr=cv_results[\"std_test_score\"],\n)\nplt.title(\"Recursive Feature Elimination \\nwith correlated features\")\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the plot above one can further notice a plateau of equivalent scores\n(similar mean value and overlapping errorbars) for 3 to 5 selected features.\nThis is the result of introducing correlated features. Indeed, the optimal\nmodel selected by the RFE can lie within this range, depending on the\ncross-validation technique. The test accuracy decreases above 5 selected\nfeatures, this is, keeping non-informative features leads to over-fitting and\nis therefore detrimental for the statistical performance of the models.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }