{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Sebastian Raschka, 2015-2022 \n", "`mlxtend`, a library of extension and helper modules for Python's data analysis and machine learning libraries\n", "\n", "- GitHub repository: https://github.com/rasbt/mlxtend\n", "- Documentation: https://rasbt.github.io/mlxtend/" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Author: Sebastian Raschka\n", "\n", "Last updated: 2022-09-13\n", "\n", "Python implementation: CPython\n", "Python version : 3.9.7\n", "IPython version : 8.0.1\n", "\n", "matplotlib: 3.5.2\n", "numpy : 1.22.1\n", "scipy : 1.9.1\n", "mlxtend : 0.21.0.dev0\n", "\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -a 'Sebastian Raschka' -u -d -v -p matplotlib,numpy,scipy,mlxtend" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# ExhaustiveFeatureSelector: Optimal feature sets by considering all possible feature combinations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Implementation of an *exhaustive feature selector* for sampling and evaluating all possible feature combinations in a specified range." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> from mlxtend.feature_selection import ExhaustiveFeatureSelector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This exhaustive feature selection algorithm is a wrapper approach for brute-force evaluation of feature subsets; the best subset is selected by optimizing a specified performance metric given an arbitrary regressor or classifier. For instance, if the classifier is a logistic regression and the dataset consists of 4 features, the alogorithm will evaluate all 15 feature combinations (if `min_features=1` and `max_features=4`)\n", "\n", "- {0}\n", "- {1}\n", "- {2}\n", "- {3}\n", "- {0, 1}\n", "- {0, 2}\n", "- {0, 3}\n", "- {1, 2}\n", "- {1, 3}\n", "- {2, 3}\n", "- {0, 1, 2}\n", "- {0, 1, 3}\n", "- {0, 2, 3}\n", "- {1, 2, 3}\n", "- {0, 1, 2, 3}\n", "\n", "and select the one that results in the best performance (e.g., classification accuracy) of the logistic regression classifier.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1 - A simple Iris example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Initializing a simple classifier from scikit-learn:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Features: 15/15" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Best accuracy score: 0.97\n", "Best subset (indices): (0, 2, 3)\n", "Best subset (corresponding names): ('0', '2', '3')\n" ] } ], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.datasets import load_iris\n", "from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS\n", "\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target\n", "\n", "knn = KNeighborsClassifier(n_neighbors=3)\n", "\n", "efs1 = EFS(knn, \n", " min_features=1,\n", " max_features=4,\n", " scoring='accuracy',\n", " print_progress=True,\n", " cv=5)\n", "\n", "efs1 = efs1.fit(X, y)\n", "\n", "print('Best accuracy score: %.2f' % efs1.best_score_)\n", "print('Best subset (indices):', efs1.best_idx_)\n", "print('Best subset (corresponding names):', efs1.best_feature_names_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Feature Names" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When working with large datasets, the feature indices might be hard to interpret. In this case, we recommend using pandas DataFrames with distinct column names as input:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal lengthSepal widthPetal lengthPetal width
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", "
" ], "text/plain": [ " Sepal length Sepal width Petal length Petal width\n", "0 5.1 3.5 1.4 0.2\n", "1 4.9 3.0 1.4 0.2\n", "2 4.7 3.2 1.3 0.2\n", "3 4.6 3.1 1.5 0.2\n", "4 5.0 3.6 1.4 0.2" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df_X = pd.DataFrame(X, columns=[\"Sepal length\", \"Sepal width\", \"Petal length\", \"Petal width\"])\n", "df_X.head()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Features: 15/15" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Best accuracy score: 0.97\n", "Best subset (indices): (0, 2, 3)\n", "Best subset (corresponding names): ('Sepal length', 'Petal length', 'Petal width')\n" ] } ], "source": [ "efs1 = efs1.fit(df_X, y)\n", "\n", "print('Best accuracy score: %.2f' % efs1.best_score_)\n", "print('Best subset (indices):', efs1.best_idx_)\n", "print('Best subset (corresponding names):', efs1.best_feature_names_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Detailed Outputs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Via the `subsets_` attribute, we can take a look at the selected feature indices at each step:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{0: {'feature_idx': (0,),\n", " 'cv_scores': array([0.53333333, 0.63333333, 0.7 , 0.8 , 0.56666667]),\n", " 'avg_score': 0.6466666666666667,\n", " 'feature_names': ('Sepal length',)},\n", " 1: {'feature_idx': (1,),\n", " 'cv_scores': array([0.43333333, 0.63333333, 0.53333333, 0.43333333, 0.5 ]),\n", " 'avg_score': 0.5066666666666666,\n", " 'feature_names': ('Sepal width',)},\n", " 2: {'feature_idx': (2,),\n", " 'cv_scores': array([0.93333333, 0.93333333, 0.9 , 0.93333333, 1. ]),\n", " 'avg_score': 0.9400000000000001,\n", " 'feature_names': ('Petal length',)},\n", " 3: {'feature_idx': (3,),\n", " 'cv_scores': array([0.96666667, 0.96666667, 0.93333333, 0.93333333, 1. ]),\n", " 'avg_score': 0.96,\n", " 'feature_names': ('Petal width',)},\n", " 4: {'feature_idx': (0, 1),\n", " 'cv_scores': array([0.66666667, 0.8 , 0.7 , 0.86666667, 0.66666667]),\n", " 'avg_score': 0.74,\n", " 'feature_names': ('Sepal length', 'Sepal width')},\n", " 5: {'feature_idx': (0, 2),\n", " 'cv_scores': array([0.96666667, 1. , 0.86666667, 0.93333333, 0.96666667]),\n", " 'avg_score': 0.9466666666666667,\n", " 'feature_names': ('Sepal length', 'Petal length')},\n", " 6: {'feature_idx': (0, 3),\n", " 'cv_scores': array([0.96666667, 0.96666667, 0.9 , 0.93333333, 1. ]),\n", " 'avg_score': 0.9533333333333334,\n", " 'feature_names': ('Sepal length', 'Petal width')},\n", " 7: {'feature_idx': (1, 2),\n", " 'cv_scores': array([0.93333333, 0.93333333, 0.9 , 0.93333333, 0.93333333]),\n", " 'avg_score': 0.9266666666666667,\n", " 'feature_names': ('Sepal width', 'Petal length')},\n", " 8: {'feature_idx': (1, 3),\n", " 'cv_scores': array([0.96666667, 0.96666667, 0.86666667, 0.93333333, 0.96666667]),\n", " 'avg_score': 0.9400000000000001,\n", " 'feature_names': ('Sepal width', 'Petal width')},\n", " 9: {'feature_idx': (2, 3),\n", " 'cv_scores': array([0.96666667, 0.96666667, 0.9 , 0.93333333, 1. ]),\n", " 'avg_score': 0.9533333333333334,\n", " 'feature_names': ('Petal length', 'Petal width')},\n", " 10: {'feature_idx': (0, 1, 2),\n", " 'cv_scores': array([0.96666667, 0.96666667, 0.86666667, 0.93333333, 0.96666667]),\n", " 'avg_score': 0.9400000000000001,\n", " 'feature_names': ('Sepal length', 'Sepal width', 'Petal length')},\n", " 11: {'feature_idx': (0, 1, 3),\n", " 'cv_scores': array([0.93333333, 0.96666667, 0.9 , 0.93333333, 1. ]),\n", " 'avg_score': 0.9466666666666667,\n", " 'feature_names': ('Sepal length', 'Sepal width', 'Petal width')},\n", " 12: {'feature_idx': (0, 2, 3),\n", " 'cv_scores': array([0.96666667, 0.96666667, 0.96666667, 0.96666667, 1. ]),\n", " 'avg_score': 0.9733333333333334,\n", " 'feature_names': ('Sepal length', 'Petal length', 'Petal width')},\n", " 13: {'feature_idx': (1, 2, 3),\n", " 'cv_scores': array([0.96666667, 0.96666667, 0.93333333, 0.93333333, 1. ]),\n", " 'avg_score': 0.96,\n", " 'feature_names': ('Sepal width', 'Petal length', 'Petal width')},\n", " 14: {'feature_idx': (0, 1, 2, 3),\n", " 'cv_scores': array([0.96666667, 0.96666667, 0.93333333, 0.96666667, 1. ]),\n", " 'avg_score': 0.9666666666666668,\n", " 'feature_names': ('Sepal length',\n", " 'Sepal width',\n", " 'Petal length',\n", " 'Petal width')}}" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "efs1.subsets_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 2 - Visualizing the feature selection results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " For our convenience, we can visualize the output from the feature selection in a pandas DataFrame format using the `get_metric_dict` method of the `ExhaustiveFeatureSelector` object. The columns `std_dev` and `std_err` represent the standard deviation and standard errors of the cross-validation scores, respectively." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, we see the DataFrame of the Sequential Forward Selector from Example 2:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Features: 15/15" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
feature_idxcv_scoresavg_scorefeature_namesci_boundstd_devstd_err
12(0, 2, 3)[0.9666666666666667, 0.9666666666666667, 0.966...0.973333(Sepal length, Petal length, Petal width)0.0171370.0133330.006667
14(0, 1, 2, 3)[0.9666666666666667, 0.9666666666666667, 0.933...0.966667(Sepal length, Sepal width, Petal length, Peta...0.0270960.0210820.010541
3(3,)[0.9666666666666667, 0.9666666666666667, 0.933...0.96(Petal width,)0.0320610.0249440.012472
13(1, 2, 3)[0.9666666666666667, 0.9666666666666667, 0.933...0.96(Sepal width, Petal length, Petal width)0.0320610.0249440.012472
6(0, 3)[0.9666666666666667, 0.9666666666666667, 0.9, ...0.953333(Sepal length, Petal width)0.0436910.0339930.016997
9(2, 3)[0.9666666666666667, 0.9666666666666667, 0.9, ...0.953333(Petal length, Petal width)0.0436910.0339930.016997
5(0, 2)[0.9666666666666667, 1.0, 0.8666666666666667, ...0.946667(Sepal length, Petal length)0.0581150.0452160.022608
11(0, 1, 3)[0.9333333333333333, 0.9666666666666667, 0.9, ...0.946667(Sepal length, Sepal width, Petal width)0.0436910.0339930.016997
2(2,)[0.9333333333333333, 0.9333333333333333, 0.9, ...0.94(Petal length,)0.0419770.032660.01633
8(1, 3)[0.9666666666666667, 0.9666666666666667, 0.866...0.94(Sepal width, Petal width)0.0499630.0388730.019437
10(0, 1, 2)[0.9666666666666667, 0.9666666666666667, 0.866...0.94(Sepal length, Sepal width, Petal length)0.0499630.0388730.019437
7(1, 2)[0.9333333333333333, 0.9333333333333333, 0.9, ...0.926667(Sepal width, Petal length)0.0171370.0133330.006667
4(0, 1)[0.6666666666666666, 0.8, 0.7, 0.8666666666666...0.74(Sepal length, Sepal width)0.1028230.080.04
0(0,)[0.5333333333333333, 0.6333333333333333, 0.7, ...0.646667(Sepal length,)0.1229830.0956850.047842
1(1,)[0.43333333333333335, 0.6333333333333333, 0.53...0.506667(Sepal width,)0.0954160.0742370.037118
\n", "
" ], "text/plain": [ " feature_idx cv_scores avg_score \\\n", "12 (0, 2, 3) [0.9666666666666667, 0.9666666666666667, 0.966... 0.973333 \n", "14 (0, 1, 2, 3) [0.9666666666666667, 0.9666666666666667, 0.933... 0.966667 \n", "3 (3,) [0.9666666666666667, 0.9666666666666667, 0.933... 0.96 \n", "13 (1, 2, 3) [0.9666666666666667, 0.9666666666666667, 0.933... 0.96 \n", "6 (0, 3) [0.9666666666666667, 0.9666666666666667, 0.9, ... 0.953333 \n", "9 (2, 3) [0.9666666666666667, 0.9666666666666667, 0.9, ... 0.953333 \n", "5 (0, 2) [0.9666666666666667, 1.0, 0.8666666666666667, ... 0.946667 \n", "11 (0, 1, 3) [0.9333333333333333, 0.9666666666666667, 0.9, ... 0.946667 \n", "2 (2,) [0.9333333333333333, 0.9333333333333333, 0.9, ... 0.94 \n", "8 (1, 3) [0.9666666666666667, 0.9666666666666667, 0.866... 0.94 \n", "10 (0, 1, 2) [0.9666666666666667, 0.9666666666666667, 0.866... 0.94 \n", "7 (1, 2) [0.9333333333333333, 0.9333333333333333, 0.9, ... 0.926667 \n", "4 (0, 1) [0.6666666666666666, 0.8, 0.7, 0.8666666666666... 0.74 \n", "0 (0,) [0.5333333333333333, 0.6333333333333333, 0.7, ... 0.646667 \n", "1 (1,) [0.43333333333333335, 0.6333333333333333, 0.53... 0.506667 \n", "\n", " feature_names ci_bound std_dev \\\n", "12 (Sepal length, Petal length, Petal width) 0.017137 0.013333 \n", "14 (Sepal length, Sepal width, Petal length, Peta... 0.027096 0.021082 \n", "3 (Petal width,) 0.032061 0.024944 \n", "13 (Sepal width, Petal length, Petal width) 0.032061 0.024944 \n", "6 (Sepal length, Petal width) 0.043691 0.033993 \n", "9 (Petal length, Petal width) 0.043691 0.033993 \n", "5 (Sepal length, Petal length) 0.058115 0.045216 \n", "11 (Sepal length, Sepal width, Petal width) 0.043691 0.033993 \n", "2 (Petal length,) 0.041977 0.03266 \n", "8 (Sepal width, Petal width) 0.049963 0.038873 \n", "10 (Sepal length, Sepal width, Petal length) 0.049963 0.038873 \n", "7 (Sepal width, Petal length) 0.017137 0.013333 \n", "4 (Sepal length, Sepal width) 0.102823 0.08 \n", "0 (Sepal length,) 0.122983 0.095685 \n", "1 (Sepal width,) 0.095416 0.074237 \n", "\n", " std_err \n", "12 0.006667 \n", "14 0.010541 \n", "3 0.012472 \n", "13 0.012472 \n", "6 0.016997 \n", "9 0.016997 \n", "5 0.022608 \n", "11 0.016997 \n", "2 0.01633 \n", "8 0.019437 \n", "10 0.019437 \n", "7 0.006667 \n", "4 0.04 \n", "0 0.047842 \n", "1 0.037118 " ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target\n", "\n", "knn = KNeighborsClassifier(n_neighbors=3)\n", "\n", "efs1 = EFS(knn, \n", " min_features=1,\n", " max_features=4,\n", " scoring='accuracy',\n", " print_progress=True,\n", " cv=5)\n", "\n", "feature_names = ('sepal length', 'sepal width',\n", " 'petal length', 'petal width')\n", "\n", "df_X = pd.DataFrame(\n", " X, columns=[\"Sepal length\", \"Sepal width\", \"Petal length\", \"Petal width\"])\n", "efs1 = efs1.fit(df_X, y)\n", "\n", "df = pd.DataFrame.from_dict(efs1.get_metric_dict()).T\n", "df.sort_values('avg_score', inplace=True, ascending=False)\n", "df" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "metric_dict = efs1.get_metric_dict()\n", "\n", "fig = plt.figure()\n", "k_feat = sorted(metric_dict.keys())\n", "avg = [metric_dict[k]['avg_score'] for k in k_feat]\n", "\n", "upper, lower = [], []\n", "for k in k_feat:\n", " upper.append(metric_dict[k]['avg_score'] +\n", " metric_dict[k]['std_dev'])\n", " lower.append(metric_dict[k]['avg_score'] -\n", " metric_dict[k]['std_dev'])\n", " \n", "plt.fill_between(k_feat,\n", " upper,\n", " lower,\n", " alpha=0.2,\n", " color='blue',\n", " lw=1)\n", "\n", "plt.plot(k_feat, avg, color='blue', marker='o')\n", "plt.ylabel('Accuracy +/- Standard Deviation')\n", "plt.xlabel('Number of Features')\n", "feature_min = len(metric_dict[k_feat[0]]['feature_idx'])\n", "feature_max = len(metric_dict[k_feat[-1]]['feature_idx'])\n", "plt.xticks(k_feat, \n", " [str(metric_dict[k]['feature_names']) for k in k_feat], \n", " rotation=90)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 3 - Exhaustive feature selection for regression analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to the classification examples above, the `SequentialFeatureSelector` also supports scikit-learn's estimators\n", "for regression." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/sebastianraschka/miniforge3/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.\n", "\n", " The Boston housing prices dataset has an ethical problem. You can refer to\n", " the documentation of this function for further details.\n", "\n", " The scikit-learn maintainers therefore strongly discourage the use of this\n", " dataset unless the purpose of the code is to study and educate about\n", " ethical issues in data science and machine learning.\n", "\n", " In this special case, you can fetch the dataset from the original\n", " source::\n", "\n", " import pandas as pd\n", " import numpy as np\n", "\n", "\n", " data_url = \"https://lib.stat.cmu.edu/datasets/boston\"\n", " raw_df = pd.read_csv(data_url, sep=\"\\s+\", skiprows=22, header=None)\n", " data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])\n", " target = raw_df.values[1::2, 2]\n", "\n", " Alternative datasets include the California housing dataset (i.e.\n", " :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing\n", " dataset. You can load the datasets as follows::\n", "\n", " from sklearn.datasets import fetch_california_housing\n", " housing = fetch_california_housing()\n", "\n", " for the California housing dataset and::\n", "\n", " from sklearn.datasets import fetch_openml\n", " housing = fetch_openml(name=\"house_prices\", as_frame=True)\n", "\n", " for the Ames housing dataset.\n", " \n", " warnings.warn(msg, category=FutureWarning)\n", "Features: 377/377" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Best subset: (0, 1, 4, 6, 7, 8, 9, 10, 11, 12)\n" ] } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.datasets import load_boston\n", "\n", "boston = load_boston()\n", "X, y = boston.data, boston.target\n", "\n", "lr = LinearRegression()\n", "\n", "efs = EFS(lr, \n", " min_features=10,\n", " max_features=12,\n", " scoring='neg_mean_squared_error',\n", " cv=10)\n", "\n", "efs.fit(X, y)\n", "\n", "print('Best MSE score: %.2f' % efs.best_score_ * (-1))\n", "print('Best subset:', efs.best_idx_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 4 - Regression and adjusted R2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As shown in Example 3, the exhaustive feature selector can be used for selecting features via a regression model. In regression analysis, there exists the common phenomenon that the $R^2$ score can become spuriously inflated the more features we choose. Hence, and this is especially true for feature selection, it is useful to make model comparisons based on the adjusted $R^2$ value rather than the regular $R^2$. The adjusted $R^2$, $\\bar{R}^{2}$, accounts for the number of features and examples as follows:\n", "\n", "$$\\bar{R}^{2}=1-\\left(1-R^{2}\\right) \\frac{n-1}{n-p-1},$$\n", "\n", "where $n$ is the number of examples and $p$ is the number of features.\n", "\n", "One of the advantages of scikit-learn's API is that it's consistent, intuitive, and simple to use. However, one downside of this API design is that it can be a bit restrictive for certain scenarios. For instance, scikit-learn scoring function only take two inputs, the predicted and the true target values. Hence, we cannot use scikit-learn's scoring API to compute the adjusted $R^2$, which also requires the number of features.\n", "\n", "However, as a workaround, we can compute the $R^2$ for the different feature subsets and then do a posthoc computation to obtain the adjusted $R^2$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Step 1: Compute $R^2$:**" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/sebastianraschka/miniforge3/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.\n", "\n", " The Boston housing prices dataset has an ethical problem. You can refer to\n", " the documentation of this function for further details.\n", "\n", " The scikit-learn maintainers therefore strongly discourage the use of this\n", " dataset unless the purpose of the code is to study and educate about\n", " ethical issues in data science and machine learning.\n", "\n", " In this special case, you can fetch the dataset from the original\n", " source::\n", "\n", " import pandas as pd\n", " import numpy as np\n", "\n", "\n", " data_url = \"https://lib.stat.cmu.edu/datasets/boston\"\n", " raw_df = pd.read_csv(data_url, sep=\"\\s+\", skiprows=22, header=None)\n", " data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])\n", " target = raw_df.values[1::2, 2]\n", "\n", " Alternative datasets include the California housing dataset (i.e.\n", " :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing\n", " dataset. You can load the datasets as follows::\n", "\n", " from sklearn.datasets import fetch_california_housing\n", " housing = fetch_california_housing()\n", "\n", " for the California housing dataset and::\n", "\n", " from sklearn.datasets import fetch_openml\n", " housing = fetch_openml(name=\"house_prices\", as_frame=True)\n", "\n", " for the Ames housing dataset.\n", " \n", " warnings.warn(msg, category=FutureWarning)\n", "Features: 377/377" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Best subset: (1, 3, 5, 6, 7, 8, 9, 10, 11, 12)\n" ] } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.datasets import load_boston\n", "\n", "boston = load_boston()\n", "X, y = boston.data, boston.target\n", "\n", "lr = LinearRegression()\n", "\n", "efs = EFS(lr, \n", " min_features=10,\n", " max_features=12,\n", " scoring='r2',\n", " cv=10)\n", "\n", "efs.fit(X, y)\n", "\n", "print('Best R2 score: %.2f' % efs.best_score_ * (-1))\n", "print('Best subset:', efs.best_idx_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Step 2: Compute adjusted $R^2$:**" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "def adjust_r2(r2, num_examples, num_features):\n", " coef = (num_examples - 1) / (num_examples - num_features - 1) \n", " return 1 - (1 - r2) * coef" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "for i in efs.subsets_:\n", " efs.subsets_[i]['adjusted_avg_score'] = (\n", " adjust_r2(r2=efs.subsets_[i]['avg_score'],\n", " num_examples=X.shape[0]/10,\n", " num_features=len(efs.subsets_[i]['feature_idx']))\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Step 3: Select best subset based on adjusted $R^2$:**" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "score = -99e10\n", "\n", "for i in efs.subsets_:\n", " score = efs.subsets_[i]['adjusted_avg_score']\n", " if ( efs.subsets_[i]['adjusted_avg_score'] == score and\n", " len(efs.subsets_[i]['feature_idx']) < len(efs.best_idx_) )\\\n", " or efs.subsets_[i]['adjusted_avg_score'] > score:\n", " efs.best_idx_ = efs.subsets_[i]['feature_idx']" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Best subset: (1, 3, 5, 6, 7, 8, 9, 10, 11, 12)\n" ] } ], "source": [ "print('Best adjusted R2 score: %.2f' % efs.best_score_ * (-1))\n", "print('Best subset:', efs.best_idx_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 5 - Using the selected feature subset For making new predictions" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "# Initialize the dataset\n", "\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.datasets import load_iris\n", "from sklearn.model_selection import train_test_split\n", "\n", "iris = load_iris()\n", "X, y = iris.data, iris.target\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.33, random_state=1)\n", "\n", "knn = KNeighborsClassifier(n_neighbors=3)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Features: 15/15" ] } ], "source": [ "# Select the \"best\" three features via\n", "# 5-fold cross-validation on the training set.\n", "\n", "from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS\n", "\n", "efs1 = EFS(knn, \n", " min_features=1,\n", " max_features=4,\n", " scoring='accuracy',\n", " cv=5)\n", "efs1 = efs1.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Selected features: (2, 3)\n" ] } ], "source": [ "print('Selected features:', efs1.best_idx_)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test set accuracy: 96.00 %\n" ] } ], "source": [ "# Generate the new subsets based on the selected features\n", "# Note that the transform call is equivalent to\n", "# X_train[:, efs1.k_feature_idx_]\n", "\n", "X_train_efs = efs1.transform(X_train)\n", "X_test_efs = efs1.transform(X_test)\n", "\n", "# Fit the estimator using the new feature subset\n", "# and make a prediction on the test data\n", "knn.fit(X_train_efs, y_train)\n", "y_pred = knn.predict(X_test_efs)\n", "\n", "# Compute the accuracy of the prediction\n", "acc = float((y_test == y_pred).sum()) / y_pred.shape[0]\n", "print('Test set accuracy: %.2f %%' % (acc*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 6 - Exhaustive feature selection and GridSearch" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "# Initialize the dataset\n", "\n", "from sklearn.datasets import load_iris\n", "from sklearn.model_selection import train_test_split\n", "\n", "iris = load_iris()\n", "X, y = iris.data, iris.target\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.33, random_state=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use scikit-learn's `GridSearch` to tune the hyperparameters of the `LogisticRegression` estimator inside the `ExhaustiveFeatureSelector` and use it for prediction in the pipeline. **Note that the `clone_estimator` attribute needs to be set to `False`.**" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 2 folds for each of 3 candidates, totalling 6 fits\n" ] } ], "source": [ "from sklearn.model_selection import GridSearchCV\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.linear_model import LogisticRegression\n", "from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS\n", "\n", "lr = LogisticRegression(multi_class='multinomial', \n", " solver='newton-cg', \n", " random_state=123)\n", "\n", "efs1 = EFS(estimator=lr, \n", " min_features=2,\n", " max_features=3,\n", " scoring='accuracy',\n", " print_progress=False,\n", " clone_estimator=False,\n", " cv=5,\n", " n_jobs=1)\n", "\n", "pipe = make_pipeline(efs1, lr)\n", "\n", "param_grid = {'exhaustivefeatureselector__estimator__C': [0.1, 1.0, 10.0]}\n", " \n", "gs = GridSearchCV(estimator=pipe, \n", " param_grid=param_grid, \n", " scoring='accuracy', \n", " n_jobs=1, \n", " cv=2, \n", " verbose=1, \n", " refit=False)\n", "\n", "# run gridearch\n", "gs = gs.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... and the \"best\" parameters determined by GridSearch are ..." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best parameters via GridSearch {'exhaustivefeatureselector__estimator__C': 0.1}\n" ] } ], "source": [ "print(\"Best parameters via GridSearch\", gs.best_params_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Obtaining the best *k* feature indices after GridSearch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we are interested in the best *k* best feature indices via `SequentialFeatureSelection.best_idx_`, we have to initialize a `GridSearchCV` object with `refit=True`. Now, the grid search object will take the complete training dataset and the best parameters, which it found via cross-validation, to train the estimator pipeline." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "gs = GridSearchCV(estimator=pipe, \n", " param_grid=param_grid, \n", " scoring='accuracy', \n", " n_jobs=1, \n", " cv=2, \n", " verbose=1, \n", " refit=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After running the grid search, we can access the individual pipeline objects of the `best_estimator_` via the `steps` attribute." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 2 folds for each of 3 candidates, totalling 6 fits\n" ] }, { "data": { "text/plain": [ "[('exhaustivefeatureselector',\n", " ExhaustiveFeatureSelector(clone_estimator=False,\n", " estimator=LogisticRegression(C=0.1,\n", " multi_class='multinomial',\n", " random_state=123,\n", " solver='newton-cg'),\n", " feature_groups=[[0], [1], [2], [3]], max_features=3,\n", " min_features=2, print_progress=False)),\n", " ('logisticregression',\n", " LogisticRegression(multi_class='multinomial', random_state=123,\n", " solver='newton-cg'))]" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gs = gs.fit(X_train, y_train)\n", "gs.best_estimator_.steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Via sub-indexing, we can then obtain the best-selected feature subset:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best features: (2, 3)\n" ] } ], "source": [ "print('Best features:', gs.best_estimator_.steps[0][1].best_idx_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "During cross-validation, this feature combination had a CV accuracy of:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best score: 0.96\n" ] } ], "source": [ "print('Best score:', gs.best_score_)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'exhaustivefeatureselector__estimator__C': 0.1}" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gs.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Alternatively**, if we can set the \"best grid search parameters\" in our pipeline manually if we ran `GridSearchCV` with `refit=False`. It should yield the same results:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best features: (2, 3)\n" ] } ], "source": [ "pipe.set_params(**gs.best_params_).fit(X_train, y_train)\n", "print('Best features:', pipe.steps[0][1].best_idx_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 7 - Exhaustive Feature Selection with LOOCV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `ExhaustiveFeatureSelector` is not restricted to k-fold cross-validation. You can use any type of cross-validation method that supports the general scikit-learn cross-validation API. \n", "\n", "The following example illustrates the use of scikit-learn's `LeaveOneOut` cross-validation method in combination with the exhaustive feature selector." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Features: 15/15" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Best accuracy score: 0.96\n", "Best subset (indices): (3,)\n", "Best subset (corresponding names): ('3',)\n" ] } ], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.datasets import load_iris\n", "from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS\n", "from sklearn.model_selection import LeaveOneOut\n", "\n", "\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target\n", "\n", "knn = KNeighborsClassifier(n_neighbors=3)\n", "\n", "efs1 = EFS(knn, \n", " min_features=1,\n", " max_features=4,\n", " scoring='accuracy',\n", " print_progress=True,\n", " cv=LeaveOneOut()) ### Use cross-validation generator here\n", "\n", "efs1 = efs1.fit(X, y)\n", "\n", "print('Best accuracy score: %.2f' % efs1.best_score_)\n", "print('Best subset (indices):', efs1.best_idx_)\n", "print('Best subset (corresponding names):', efs1.best_feature_names_)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Example 8 - Interrupting Long Runs for Intermediate Results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If your run is taking too long, it is possible to trigger a `KeyboardInterrupt` (e.g., ctrl+c on a Mac, or interrupting the cell in a Jupyter notebook) to obtain temporary results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Toy dataset**" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_classification\n", "from sklearn.model_selection import train_test_split\n", "\n", "\n", "X, y = make_classification(\n", " n_samples=200000,\n", " n_features=6,\n", " n_informative=2,\n", " n_redundant=1,\n", " n_repeated=1,\n", " n_clusters_per_class=2,\n", " flip_y=0.05,\n", " class_sep=0.5,\n", " random_state=123,\n", ")\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.2, random_state=123\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Long run with interruption**" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Features: 56/56" ] } ], "source": [ "from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "model = LogisticRegression(max_iter=10000)\n", "\n", "efs1 = EFS(model, \n", " min_features=1, \n", " max_features=4,\n", " print_progress=True,\n", " scoring='accuracy')\n", "\n", "efs1 = efs1.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Finalizing the fit**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the feature selection run hasn't finished, so certain attributes may not be available. In order to use the EFS instance, it is recommended to call `finalize_fit`, which will make EFS estimator appear as \"fitted\" process the temporary results:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "efs1.finalize_fit()" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best accuracy score: 0.73\n", "Best subset (indices): (1, 2)\n" ] } ], "source": [ "print('Best accuracy score: %.2f' % efs1.best_score_)\n", "print('Best subset (indices):', efs1.best_idx_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 9 - Working with Feature Groups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since mlxtend v0.21.0, it is possible to specify feature groups. Feature groups allow you to group certain features together, such that they are always selected as a group. This can be very useful in contexts similar to one-hot encoding -- if you want to treat the one-hot encoded feature as a single feature:\n", "\n", "![](SequentialFeatureSelector_files/feature_groups.jpeg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following example, we specify sepal length and sepal width as a feature group so that they are always selected together:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal lenpetal lensepal widpetal wid
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", "
" ], "text/plain": [ " sepal len petal len sepal wid petal wid\n", "0 5.1 3.5 1.4 0.2\n", "1 4.9 3.0 1.4 0.2\n", "2 4.7 3.2 1.3 0.2\n", "3 4.6 3.1 1.5 0.2\n", "4 5.0 3.6 1.4 0.2" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.datasets import load_iris\n", "import pandas as pd\n", "\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target\n", "\n", "X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',\n", " 'sepal wid', 'petal wid'])\n", "X_df.head()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Features: 3/3" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Best accuracy score: 0.97\n", "Best subset (indices): (0, 2, 3)\n", "Best subset (corresponding names): ('sepal len', 'sepal wid', 'petal wid')\n" ] } ], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS\n", "\n", "knn = KNeighborsClassifier(n_neighbors=3)\n", "\n", "efs1 = EFS(knn, \n", " min_features=2,\n", " max_features=2,\n", " scoring='accuracy',\n", " feature_groups=[['sepal len', 'sepal wid'], ['petal len'], ['petal wid']],\n", " cv=3)\n", "\n", "efs1 = efs1.fit(X_df, y)\n", "\n", "print('Best accuracy score: %.2f' % efs1.best_score_)\n", "print('Best subset (indices):', efs1.best_idx_)\n", "print('Best subset (corresponding names):', efs1.best_feature_names_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the returned number of features is 3, since the number of `min_features` and `max_features` corresponds to the number of feature groups. I.e., we have 2 feature groups in `['sepal len', 'sepal wid'], ['petal wid']`, but it expands to 3 features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## API" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "## ExhaustiveFeatureSelector\n", "\n", "*ExhaustiveFeatureSelector(estimator, min_features=1, max_features=1, print_progress=True, scoring='accuracy', cv=5, n_jobs=1, pre_dispatch='2*n_jobs', clone_estimator=True, fixed_features=None, feature_groups=None)*\n", "\n", "Exhaustive Feature Selection for Classification and Regression.\n", " (new in v0.4.3)\n", "\n", "**Parameters**\n", "\n", "- `estimator` : scikit-learn classifier or regressor\n", "\n", "\n", "\n", "- `min_features` : int (default: 1)\n", "\n", " Minumum number of features to select\n", "\n", "\n", "- `max_features` : int (default: 1)\n", "\n", " Maximum number of features to select. If parameter `feature_groups` is not\n", " None, the number of features is equal to the number of feature groups, i.e.\n", " `len(feature_groups)`. For example, if `feature_groups = [[0], [1], [2, 3],\n", " [4]]`, then the `max_features` value cannot exceed 4.\n", "\n", "\n", "- `print_progress` : bool (default: True)\n", "\n", " Prints progress as the number of epochs\n", " to stderr.\n", "\n", "\n", "- `scoring` : str, (default='accuracy')\n", "\n", " Scoring metric in {accuracy, f1, precision, recall, roc_auc}\n", " for classifiers,\n", " {'mean_absolute_error', 'mean_squared_error',\n", " 'median_absolute_error', 'r2'} for regressors,\n", " or a callable object or function with\n", " signature ``scorer(estimator, X, y)``.\n", "\n", "\n", "- `cv` : int (default: 5)\n", "\n", " Scikit-learn cross-validation generator or `int`.\n", " If estimator is a classifier (or y consists of integer class labels),\n", " stratified k-fold is performed, and regular k-fold cross-validation\n", " otherwise.\n", " No cross-validation if cv is None, False, or 0.\n", "\n", "\n", "- `n_jobs` : int (default: 1)\n", "\n", " The number of CPUs to use for evaluating different feature subsets\n", " in parallel. -1 means 'all CPUs'.\n", "\n", "\n", "- `pre_dispatch` : int, or string (default: '2*n_jobs')\n", "\n", " Controls the number of jobs that get dispatched\n", " during parallel execution if `n_jobs > 1` or `n_jobs=-1`.\n", " Reducing this number can be useful to avoid an explosion of\n", " memory consumption when more jobs get dispatched than CPUs can process.\n", " This parameter can be:\n", " None, in which case all the jobs are immediately created and spawned.\n", " Use this for lightweight and fast-running jobs,\n", " to avoid delays due to on-demand spawning of the jobs\n", " An int, giving the exact number of total jobs that are spawned\n", " A string, giving an expression as a function\n", " of n_jobs, as in `2*n_jobs`\n", "\n", "\n", "- `clone_estimator` : bool (default: True)\n", "\n", " Clones estimator if True; works with the original estimator instance\n", " if False. Set to False if the estimator doesn't\n", " implement scikit-learn's set_params and get_params methods.\n", " In addition, it is required to set cv=0, and n_jobs=1.\n", "\n", "\n", "- `fixed_features` : tuple (default: None)\n", "\n", " If not `None`, the feature indices provided as a tuple will be\n", " regarded as fixed by the feature selector. For example, if\n", " `fixed_features=(1, 3, 7)`, the 2nd, 4th, and 8th feature are\n", " guaranteed to be present in the solution. Note that if\n", " `fixed_features` is not `None`, make sure that the number of\n", " features to be selected is greater than `len(fixed_features)`.\n", " In other words, ensure that `k_features > len(fixed_features)`.\n", "\n", "\n", "- `feature_groups` : list or None (default: None)\n", "\n", " Optional argument for treating certain features as a group.\n", " This means, the features within a group are always selected together,\n", " never split.\n", " For example, `feature_groups=[[1], [2], [3, 4, 5]]`\n", " specifies 3 feature groups.In this case,\n", " possible feature selection results with `k_features=2`\n", " are `[[1], [2]`, `[[1], [3, 4, 5]]`, or `[[2], [3, 4, 5]]`.\n", " Feature groups can be useful for\n", " interpretability, for example, if features 3, 4, 5 are one-hot\n", " encoded features. (For more details, please read the notes at the\n", " bottom of this docstring). New in mlxtend v. 0.21.0.\n", "\n", "**Attributes**\n", "\n", "- `best_idx_` : array-like, shape = [n_predictions]\n", "\n", " Feature Indices of the selected feature subsets.\n", "\n", "\n", "- `best_feature_names_` : array-like, shape = [n_predictions]\n", "\n", " Feature names of the selected feature subsets. If pandas\n", " DataFrames are used in the `fit` method, the feature\n", " names correspond to the column names. Otherwise, the\n", " feature names are string representation of the feature\n", " array indices. New in v 0.13.0.\n", "\n", "\n", "- `best_score_` : float\n", "\n", " Cross validation average score of the selected subset.\n", "\n", "\n", "- `subsets_` : dict\n", "\n", " A dictionary of selected feature subsets during the\n", " exhaustive selection, where the dictionary keys are\n", " the lengths k of these feature subsets. The dictionary\n", " values are dictionaries themselves with the following\n", " keys: 'feature_idx' (tuple of indices of the feature subset)\n", " 'feature_names' (tuple of feature names of the feat. subset)\n", " 'cv_scores' (list individual cross-validation scores)\n", " 'avg_score' (average cross-validation score)\n", " Note that if pandas\n", " DataFrames are used in the `fit` method, the 'feature_names'\n", " correspond to the column names. Otherwise, the\n", " feature names are string representation of the feature\n", " array indices. The 'feature_names' is new in v. 0.13.0.\n", "\n", "**Notes**\n", "\n", "(1) If parameter `feature_groups` is not None, the\n", " number of features is equal to the number of feature groups, i.e.\n", " `len(feature_groups)`. For example, if `feature_groups = [[0], [1], [2, 3],\n", " [4]]`, then the `max_features` value cannot exceed 4.\n", "\n", " (2) Although two or more individual features may be considered as one group\n", " throughout the feature-selection process, it does not mean the individual\n", " features of that group have the same impact on the outcome. For instance, in\n", " linear regression, the coefficient of the feature 2 and 3 can be different\n", " even if they are considered as one group in feature_groups.\n", "\n", " (3) If both fixed_features and feature_groups are specified, ensure that each\n", " feature group contains the fixed_features selection. E.g., for a 3-feature set\n", " fixed_features=[0, 1] and feature_groups=[[0, 1], [2]] is valid;\n", " fixed_features=[0, 1] and feature_groups=[[0], [1, 2]] is not valid.\n", "\n", "**Examples**\n", "\n", "For usage examples, please see\n", " https://rasbt.github.io/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector/\n", "\n", "### Methods\n", "\n", "
\n", "\n", "*fit(X, y, groups=None, **fit_params)*\n", "\n", "Perform feature selection and learn model from training data.\n", "\n", "**Parameters**\n", "\n", "- `X` : {array-like, sparse matrix}, shape = [n_samples, n_features]\n", "\n", " Training vectors, where n_samples is the number of samples and\n", " n_features is the number of features.\n", " New in v 0.13.0: pandas DataFrames are now also accepted as\n", " argument for X.\n", "\n", "\n", "- `y` : array-like, shape = [n_samples]\n", "\n", " Target values.\n", "\n", "\n", "- `groups` : array-like, with shape (n_samples,), optional\n", "\n", " Group labels for the samples used while splitting the dataset into\n", " train/test set. Passed to the fit method of the cross-validator.\n", "\n", "\n", "- `fit_params` : dict of string -> object, optional\n", "\n", " Parameters to pass to to the fit method of classifier.\n", "\n", "**Returns**\n", "\n", "- `self` : object\n", "\n", "\n", "
\n", "\n", "*fit_transform(X, y, groups=None, **fit_params)*\n", "\n", "Fit to training data and return the best selected features from X.\n", "\n", "**Parameters**\n", "\n", "- `X` : {array-like, sparse matrix}, shape = [n_samples, n_features]\n", "\n", " Training vectors, where n_samples is the number of samples and\n", " n_features is the number of features.\n", " New in v 0.13.0: pandas DataFrames are now also accepted as\n", " argument for X.\n", "\n", "- `y` : array-like, shape = [n_samples]\n", "\n", " Target values.\n", "\n", "- `groups` : array-like, with shape (n_samples,), optional\n", "\n", " Group labels for the samples used while splitting the dataset into\n", " train/test set. Passed to the fit method of the cross-validator.\n", "\n", "- `fit_params` : dict of string -> object, optional\n", "\n", " Parameters to pass to to the fit method of classifier.\n", "\n", "**Returns**\n", "\n", "Feature subset of X, shape={n_samples, k_features}\n", "\n", "
\n", "\n", "*get_metric_dict(confidence_interval=0.95)*\n", "\n", "Return metric dictionary\n", "\n", "**Parameters**\n", "\n", "- `confidence_interval` : float (default: 0.95)\n", "\n", " A positive float between 0.0 and 1.0 to compute the confidence\n", " interval bounds of the CV score averages.\n", "\n", "**Returns**\n", "\n", "Dictionary with items where each dictionary value is a list\n", " with the number of iterations (number of feature subsets) as\n", " its length. The dictionary keys corresponding to these lists\n", " are as follows:\n", " 'feature_idx': tuple of the indices of the feature subset\n", " 'cv_scores': list with individual CV scores\n", " 'avg_score': of CV average scores\n", " 'std_dev': standard deviation of the CV score average\n", " 'std_err': standard error of the CV score average\n", " 'ci_bound': confidence interval bound of the CV score average\n", "\n", "
\n", "\n", "*get_params(deep=True)*\n", "\n", "Get parameters for this estimator.\n", "\n", "**Parameters**\n", "\n", "- `deep` : bool, default=True\n", "\n", " If True, will return the parameters for this estimator and\n", " contained subobjects that are estimators.\n", "\n", "**Returns**\n", "\n", "- `params` : dict\n", "\n", " Parameter names mapped to their values.\n", "\n", "
\n", "\n", "*set_params(**params)*\n", "\n", "Set the parameters of this estimator.\n", "\n", " The method works on simple estimators as well as on nested objects\n", " (such as :class:`~sklearn.pipeline.Pipeline`). The latter have\n", " parameters of the form ``__`` so that it's\n", " possible to update each component of a nested object.\n", "\n", "**Parameters**\n", "\n", "- `**params` : dict\n", "\n", " Estimator parameters.\n", "\n", "**Returns**\n", "\n", "- `self` : estimator instance\n", "\n", " Estimator instance.\n", "\n", "
\n", "\n", "*transform(X)*\n", "\n", "Return the best selected features from X.\n", "\n", "**Parameters**\n", "\n", "- `X` : {array-like, sparse matrix}, shape = [n_samples, n_features]\n", "\n", " Training vectors, where n_samples is the number of samples and\n", " n_features is the number of features.\n", " New in v 0.13.0: pandas DataFrames are now also accepted as\n", " argument for X.\n", "\n", "**Returns**\n", "\n", "Feature subset of X, shape={n_samples, k_features}\n", "\n", "\n" ] } ], "source": [ "with open('../../api_modules/mlxtend.feature_selection/ExhaustiveFeatureSelector.md', 'r') as f:\n", " print(f.read())" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }