{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Sebastian Raschka, 2015-2022 \n", "`mlxtend`, a library of extension and helper modules for Python's data analysis and machine learning libraries\n", "\n", "- GitHub repository: https://github.com/rasbt/mlxtend\n", "- Documentation: https://rasbt.github.io/mlxtend/" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Author: Sebastian Raschka\n", "\n", "Last updated: 2023-05-17\n", "\n", "Python implementation: CPython\n", "Python version : 3.9.16\n", "IPython version : 8.13.2\n", "\n", "matplotlib: 3.7.1\n", "numpy : 1.24.3\n", "scipy : 1.10.1\n", "mlxtend : 0.23.0.dev0\n", "\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -a 'Sebastian Raschka' -u -d -v -p matplotlib,numpy,scipy,mlxtend" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# SequentialFeatureSelector: The popular forward and backward feature selection approaches (including floating variants)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Implementation of *sequential feature algorithms* (SFAs) -- greedy search algorithms -- that have been developed as a suboptimal solution to the computationally often not feasible exhaustive search." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> from mlxtend.feature_selection import SequentialFeatureSelector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sequential feature selection algorithms are a family of greedy search algorithms that are used to reduce an initial *d*-dimensional feature space to a *k*-dimensional feature subspace where *k < d*. The motivation behind feature selection algorithms is to automatically select a subset of features most relevant to the problem. The goal of feature selection is two-fold: We want to improve the computational efficiency and reduce the model's generalization error by removing irrelevant features or noise. In addition, a wrapper approach such as sequential feature selection is advantageous if embedded feature selection -- for example, a regularization penalty like LASSO -- is not applicable.\n", "\n", "In a nutshell, SFAs remove or add one feature at a time based on the classifier performance until a feature subset of the desired size *k* is reached. There are four different flavors of SFAs available via the `SequentialFeatureSelector`:\n", "\n", "1. Sequential Forward Selection (SFS)\n", "2. Sequential Backward Selection (SBS)\n", "3. Sequential Forward Floating Selection (SFFS)\n", "4. Sequential Backward Floating Selection (SBFS)\n", "\n", "The ***floating*** variants, SFFS and SBFS, can be considered extensions to the simpler SFS and SBS algorithms. The floating algorithms have an additional exclusion or inclusion step to remove features once they were included (or excluded) so that a larger number of feature subset combinations can be sampled. It is important to emphasize that this step is conditional and only occurs if the resulting feature subset is assessed as \"better\" by the criterion function after the removal (or addition) of a particular feature. Furthermore, I added an optional check to skip the conditional exclusion steps if the algorithm gets stuck in cycles. \n", "\n", "\n", "---\n", "\n", "How is this different from *Recursive Feature Elimination* (RFE) -- e.g., as implemented in `sklearn.feature_selection.RFE`? RFE is computationally less complex using the feature weight coefficients (e.g., linear models) or feature importance (tree-based algorithms) to eliminate features recursively, whereas SFSs eliminate (or add) features based on a user-defined classifier/regression performance metric.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial Videos" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Visual Illustration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A visual illustration of the sequential backward selection process is provided below, from the paper\n", "\n", "- Joe Bemister-Buffington, Alex J. Wolf, Sebastian Raschka, and Leslie A. Kuhn (2020)\n", "Machine Learning to Identify Flexibility Signatures of Class A GPCR Inhibition\n", "Biomolecules 2020, 10, 454. https://www.mdpi.com/2218-273X/10/3/454#\n", "\n", "![](SequentialFeatureSelector_files/sbs-gpcr2020.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Algorithmic Details" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sequential Forward Selection (SFS)\n", "\n", "\n", "**Input:** $Y = \\{y_1, y_2, ..., y_d\\}$ \n", "\n", "- The ***SFS*** algorithm takes the whole $d$-dimensional feature set as input.\n", "\n", "\n", "**Output:** $X_k = \\{x_j \\; | \\;j = 1, 2, ..., k; \\; x_j \\in Y\\}$, where $k = (0, 1, 2, ..., d)$\n", "\n", "- SFS returns a subset of features; the number of selected features $k$, where $k < d$, has to be specified *a priori*.\n", "\n", "**Initialization:** $X_0 = \\emptyset$, $k = 0$\n", "\n", "- We initialize the algorithm with an empty set $\\emptyset$ (\"null set\") so that $k = 0$ (where $k$ is the size of the subset).\n", "\n", "**Step 1 (Inclusion):** \n", "\n", " $x^+ = \\text{ arg max } J(X_k + x), \\text{ where } x \\in Y - X_k$ \n", " $X_{k+1} = X_k + x^+$ \n", " $k = k + 1$ \n", " *Go to Step 1* \n", "\n", "- in this step, we add an additional feature, $x^+$, to our feature subset $X_k$.\n", "- $x^+$ is the feature that maximizes our criterion function, that is, the feature that is associated with the best classifier performance if it is added to $X_k$.\n", "- We repeat this procedure until the termination criterion is satisfied.\n", "\n", "**Termination:** $k = p$\n", "\n", "- We add features from the feature subset $X_k$ until the feature subset of size $k$ contains the number of desired features $p$ that we specified *a priori*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sequential Backward Selection (SBS)\n", "\n", "**Input:** the set of all features, $Y = \\{y_1, y_2, ..., y_d\\}$ \n", "\n", "- The SBS algorithm takes the whole feature set as input.\n", "\n", "**Output:** $X_k = \\{x_j \\; | \\;j = 1, 2, ..., k; \\; x_j \\in Y\\}$, where $k = (0, 1, 2, ..., d)$\n", "\n", "- SBS returns a subset of features; the number of selected features $k$, where $k < d$, has to be specified *a priori*.\n", "\n", "**Initialization:** $X_0 = Y$, $k = d$\n", "\n", "- We initialize the algorithm with the given feature set so that the $k = d$.\n", "\n", "\n", "**Step 1 (Exclusion):** \n", "\n", "$x^- = \\text{ arg max } J(X_k - x), \\text{ where } x \\in X_k$ \n", "$X_{k-1} = X_k - x^-$ \n", "$k = k - 1$ \n", "*Go to Step 1* \n", "\n", "- In this step, we remove a feature, $x^-$ from our feature subset $X_k$.\n", "- $x^-$ is the feature that maximizes our criterion function upon re,oval, that is, the feature that is associated with the best classifier performance if it is removed from $X_k$.\n", "- We repeat this procedure until the termination criterion is satisfied.\n", "\n", "\n", "**Termination:** $k = p$\n", "\n", "- We add features from the feature subset $X_k$ until the feature subset of size $k$ contains the number of desired features $p$ that we specified *a priori*.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sequential Backward Floating Selection (SBFS)\n", "\n", "**Input:** the set of all features, $Y = \\{y_1, y_2, ..., y_d\\}$ \n", "\n", "- The SBFS algorithm takes the whole feature set as input.\n", "\n", "**Output:** $X_k = \\{x_j \\; | \\;j = 1, 2, ..., k; \\; x_j \\in Y\\}$, where $k = (0, 1, 2, ..., d)$\n", "\n", "- SBFS returns a subset of features; the number of selected features $k$, where $k < d$, has to be specified *a priori*.\n", "\n", "**Initialization:** $X_0 = Y$, $k = d$\n", "\n", "- We initialize the algorithm with the given feature set so that the $k = d$.\n", "\n", "**Step 1 (Exclusion):** \n", "\n", "$x^- = \\text{ arg max } J(X_k - x), \\text{ where } x \\in X_k$ \n", "$X_{k-1} = X_k - x^-$ \n", "$k = k - 1$ \n", "*Go to Step 2* \n", "\n", "- In this step, we remove a feature, $x^-$ from our feature subset $X_k$.\n", "- $x^-$ is the feature that maximizes our criterion function upon removal, that is, the feature that is associated with the best classifier performance if it is removed from $X_k$.\n", "\n", "\n", "**Step 2 (Conditional Inclusion):** \n", "
\n", "$x^+ = \\text{ arg max } J(X_k + x), \\text{ where } x \\in Y - X_k$ \n", "*if J(X_k + x) > J(X_k)*: \n", "     $X_{k+1} = X_k + x^+$ \n", "     $k = k + 1$ \n", "*Go to Step 1* \n", "\n", "- In Step 2, we search for features that improve the classifier performance if they are added back to the feature subset. If such features exist, we add the feature $x^+$ for which the performance improvement is maximized. If $k = 2$ or an improvement cannot be made (i.e., such feature $x^+$ cannot be found), go back to step 1; else, repeat this step.\n", "\n", "\n", "**Termination:** $k = p$\n", "\n", "- We add features from the feature subset $X_k$ until the feature subset of size $k$ contains the number of desired features $p$ that we specified *a priori*.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sequential Forward Floating Selection (SFFS)\n", "\n", "**Input:** the set of all features, $Y = \\{y_1, y_2, ..., y_d\\}$ \n", "\n", "- The ***SFFS*** algorithm takes the whole feature set as input, if our feature space consists of, e.g. 10, if our feature space consists of 10 dimensions (***d = 10***).\n", "

\n", "\n", "**Output:** a subset of features, $X_k = \\{x_j \\; | \\;j = 1, 2, ..., k; \\; x_j \\in Y\\}$, where $k = (0, 1, 2, ..., d)$\n", "\n", "- The returned output of the algorithm is a subset of the feature space of a specified size. E.g., a subset of 5 features from a 10-dimensional feature space (***k = 5, d = 10***).\n", "

\n", "\n", "**Initialization:** $X_0 = \\emptyset$, $k = 0$\n", "\n", "- We initialize the algorithm with an empty set (\"null set\") so that the ***k = 0*** (where ***k*** is the size of the subset)\n", "

\n", "\n", "**Step 1 (Inclusion):** \n", "
\n", "     $x^+ = \\text{ arg max } J(X_k + x), \\text{ where } x \\in Y - X_k$ \n", "     $X_{k+1} = X_k + x^+$ \n", "     $k = k + 1$ \n", "    *Go to Step 2* \n", "

\n", "**Step 2 (Conditional Exclusion):** \n", "
\n", "     $x^- = \\text{ arg max } J(X_k - x), \\text{ where } x \\in X_k$ \n", "    $if \\; J(X_k - x) > J(X_k)$: \n", "         $X_{k-1} = X_k - x^- $ \n", "         $k = k - 1$ \n", "    *Go to Step 1* \n", "\n", "- In step 1, we include the feature from the ***feature space*** that leads to the best performance increase for our ***feature subset*** (assessed by the ***criterion function***). Then, we go over to step 2\n", "- In step 2, we only remove a feature if the resulting subset would gain an increase in performance. If $k = 2$ or an improvement cannot be made (i.e., such feature $x^+$ cannot be found), go back to step 1; else, repeat this step.\n", "\n", "\n", "- Steps 1 and 2 are repeated until the **Termination** criterion is reached.\n", "

\n", "\n", "**Termination:** stop when ***k*** equals the number of desired features\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### References\n", "\n", "- Ferri, F. J., Pudil P., Hatef, M., Kittler, J. (1994). [*\"Comparative study of techniques for large-scale feature selection.\"*](https://books.google.com/books?hl=en&lr=&id=sbajBQAAQBAJ&oi=fnd&pg=PA403&dq=comparative+study+of+techniques+for+large+scale&ots=KdIOYpA8wj&sig=hdOsBP1HX4hcDjx4RLg_chheojc#v=onepage&q=comparative%20study%20of%20techniques%20for%20large%20scale&f=false) Pattern Recognition in Practice IV : 403-413.\n", "\n", "- Pudil, P., Novovičová, J., & Kittler, J. (1994). [*\"Floating search methods in feature selection.\"*](https://www.sciencedirect.com/science/article/pii/0167865594901279) Pattern recognition letters 15.11 (1994): 1119-1125." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1 - A simple Sequential Forward Selection example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Initializing a simple classifier from scikit-learn:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target\n", "knn = KNeighborsClassifier(n_neighbors=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start by selection the \"best\" 3 features from the Iris dataset via Sequential Forward Selection (SFS). Here, we set `forward=True` and `floating=False`. By choosing `cv=0`, we don't perform any cross-validation, therefore, the performance (here: `'accuracy'`) is computed entirely on the training set. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:36:17] Features: 1/3 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:36:17] Features: 2/3 -- score: 0.9733333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:36:17] Features: 3/3 -- score: 0.9733333333333334" ] } ], "source": [ "from mlxtend.feature_selection import SequentialFeatureSelector as SFS\n", "\n", "sfs1 = SFS(knn, \n", " k_features=3, \n", " forward=True, \n", " floating=False, \n", " verbose=2,\n", " scoring='accuracy',\n", " cv=0)\n", "\n", "sfs1 = sfs1.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Via the `subsets_` attribute, we can take a look at the selected feature indices at each step:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{1: {'feature_idx': (3,),\n", " 'cv_scores': array([0.96]),\n", " 'avg_score': 0.96,\n", " 'feature_names': ('3',)},\n", " 2: {'feature_idx': (2, 3),\n", " 'cv_scores': array([0.97333333]),\n", " 'avg_score': 0.9733333333333334,\n", " 'feature_names': ('2', '3')},\n", " 3: {'feature_idx': (1, 2, 3),\n", " 'cv_scores': array([0.97333333]),\n", " 'avg_score': 0.9733333333333334,\n", " 'feature_names': ('1', '2', '3')}}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sfs1.subsets_" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:36:17] Features: 1/3 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:36:17] Features: 2/3 -- score: 0.9733333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:36:17] Features: 3/3 -- score: 0.9733333333333334" ] }, { "data": { "text/plain": [ "{1: {'feature_idx': (3,),\n", " 'cv_scores': array([0.96]),\n", " 'avg_score': 0.96,\n", " 'feature_names': ('3',)},\n", " 2: {'feature_idx': (2, 3),\n", " 'cv_scores': array([0.97333333]),\n", " 'avg_score': 0.9733333333333334,\n", " 'feature_names': ('2', '3')},\n", " 3: {'feature_idx': (1, 2, 3),\n", " 'cv_scores': array([0.97333333]),\n", " 'avg_score': 0.9733333333333334,\n", " 'feature_names': ('1', '2', '3')}}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sfs1 = sfs1.fit(X, y)\n", "sfs1.subsets_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Furthermore, we can access the indices of the 3 best features directly via the `k_feature_idx_` attribute:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1, 2, 3)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sfs1.k_feature_idx_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, the prediction score for these 3 features can be accesses via `k_score_`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9733333333333334" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sfs1.k_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Feature Names**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When working with large datasets, the feature indices might be hard to interpret. In this case, we recommend using pandas DataFrames with distinct column names as input:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal lengthSepal widthPetal lengthPetal width
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", "
" ], "text/plain": [ " Sepal length Sepal width Petal length Petal width\n", "0 5.1 3.5 1.4 0.2\n", "1 4.9 3.0 1.4 0.2\n", "2 4.7 3.2 1.3 0.2\n", "3 4.6 3.1 1.5 0.2\n", "4 5.0 3.6 1.4 0.2" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df_X = pd.DataFrame(X, columns=[\"Sepal length\", \"Sepal width\", \"Petal length\", \"Petal width\"])\n", "df_X.head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Best accuracy score: 0.97\n", "Best subset (indices): (1, 2, 3)\n", "Best subset (corresponding names): ('Sepal width', 'Petal length', 'Petal width')\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:36:17] Features: 1/3 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:36:17] Features: 2/3 -- score: 0.9733333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:36:17] Features: 3/3 -- score: 0.9733333333333334" ] } ], "source": [ "sfs1 = sfs1.fit(df_X, y)\n", "\n", "print('Best accuracy score: %.2f' % sfs1.k_score_)\n", "print('Best subset (indices):', sfs1.k_feature_idx_)\n", "print('Best subset (corresponding names):', sfs1.k_feature_names_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 2 - Toggling between SFS, SBS, SFFS, and SBFS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the `forward` and `floating` parameters, we can toggle between SFS, SBS, SFFS, and SBFS as shown below. Note that we are performing (stratified) 4-fold cross-validation for more robust estimates in contrast to Example 1. Via `n_jobs=-1`, we choose to run the cross-validation on all our available CPU cores." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Sequential Forward Selection (k=3):\n", "(1, 2, 3)\n", "CV Score:\n", "0.9731507823613088\n", "\n", "Sequential Backward Selection (k=3):\n", "(1, 2, 3)\n", "CV Score:\n", "0.9731507823613088\n", "\n", "Sequential Forward Floating Selection (k=3):\n", "(1, 2, 3)\n", "CV Score:\n", "0.9731507823613088\n", "\n", "Sequential Backward Floating Selection (k=3):\n", "(1, 2, 3)\n", "CV Score:\n", "0.9731507823613088\n" ] } ], "source": [ "# Sequential Forward Selection\n", "sfs = SFS(knn, \n", " k_features=3, \n", " forward=True, \n", " floating=False, \n", " scoring='accuracy',\n", " cv=4,\n", " n_jobs=-1)\n", "sfs = sfs.fit(X, y)\n", "\n", "print('\\nSequential Forward Selection (k=3):')\n", "print(sfs.k_feature_idx_)\n", "print('CV Score:')\n", "print(sfs.k_score_)\n", "\n", "###################################################\n", "\n", "# Sequential Backward Selection\n", "sbs = SFS(knn, \n", " k_features=3, \n", " forward=False, \n", " floating=False, \n", " scoring='accuracy',\n", " cv=4,\n", " n_jobs=-1)\n", "sbs = sbs.fit(X, y)\n", "\n", "print('\\nSequential Backward Selection (k=3):')\n", "print(sbs.k_feature_idx_)\n", "print('CV Score:')\n", "print(sbs.k_score_)\n", "\n", "###################################################\n", "\n", "# Sequential Forward Floating Selection\n", "sffs = SFS(knn, \n", " k_features=3, \n", " forward=True, \n", " floating=True, \n", " scoring='accuracy',\n", " cv=4,\n", " n_jobs=-1)\n", "sffs = sffs.fit(X, y)\n", "\n", "print('\\nSequential Forward Floating Selection (k=3):')\n", "print(sffs.k_feature_idx_)\n", "print('CV Score:')\n", "print(sffs.k_score_)\n", "\n", "###################################################\n", "\n", "# Sequential Backward Floating Selection\n", "sbfs = SFS(knn, \n", " k_features=3, \n", " forward=False, \n", " floating=True, \n", " scoring='accuracy',\n", " cv=4,\n", " n_jobs=-1)\n", "sbfs = sbfs.fit(X, y)\n", "\n", "print('\\nSequential Backward Floating Selection (k=3):')\n", "print(sbfs.k_feature_idx_)\n", "print('CV Score:')\n", "print(sbfs.k_score_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this simple scenario, selecting the best 3 features out of the 4 available features in the Iris set, we end up with similar results regardless of which sequential selection algorithms we used." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 3 - Visualizing the results in DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " For our convenience, we can visualize the output from the feature selection in a pandas DataFrame format using the `get_metric_dict` method of the SequentialFeatureSelector object. The columns `std_dev` and `std_err` represent the standard deviation and standard errors of the cross-validation scores, respectively." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, we see the DataFrame of the Sequential Forward Selector from Example 2:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
feature_idxcv_scoresavg_scorefeature_namesci_boundstd_devstd_err
1(3,)[0.9736842105263158, 0.9473684210526315, 0.918...0.959993(3,)0.0483190.0301430.017403
2(2, 3)[0.9736842105263158, 0.9473684210526315, 0.918...0.959993(2, 3)0.0483190.0301430.017403
3(1, 2, 3)[0.9736842105263158, 1.0, 0.9459459459459459, ...0.973151(1, 2, 3)0.0306390.0191130.011035
\n", "
" ], "text/plain": [ " feature_idx cv_scores avg_score \\\n", "1 (3,) [0.9736842105263158, 0.9473684210526315, 0.918... 0.959993 \n", "2 (2, 3) [0.9736842105263158, 0.9473684210526315, 0.918... 0.959993 \n", "3 (1, 2, 3) [0.9736842105263158, 1.0, 0.9459459459459459, ... 0.973151 \n", "\n", " feature_names ci_bound std_dev std_err \n", "1 (3,) 0.048319 0.030143 0.017403 \n", "2 (2, 3) 0.048319 0.030143 0.017403 \n", "3 (1, 2, 3) 0.030639 0.019113 0.011035 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "pd.DataFrame.from_dict(sfs.get_metric_dict()).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's compare it to the Sequential Backward Selector:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
feature_idxcv_scoresavg_scorefeature_namesci_boundstd_devstd_err
4(0, 1, 2, 3)[0.9736842105263158, 0.9473684210526315, 0.918...0.953236(0, 1, 2, 3)0.036020.0224710.012974
3(1, 2, 3)[0.9736842105263158, 1.0, 0.9459459459459459, ...0.973151(1, 2, 3)0.0306390.0191130.011035
\n", "
" ], "text/plain": [ " feature_idx cv_scores avg_score \\\n", "4 (0, 1, 2, 3) [0.9736842105263158, 0.9473684210526315, 0.918... 0.953236 \n", "3 (1, 2, 3) [0.9736842105263158, 1.0, 0.9459459459459459, ... 0.973151 \n", "\n", " feature_names ci_bound std_dev std_err \n", "4 (0, 1, 2, 3) 0.03602 0.022471 0.012974 \n", "3 (1, 2, 3) 0.030639 0.019113 0.011035 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame.from_dict(sbs.get_metric_dict()).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that both SFS and SBFS found the same \"best\" 3 features, however, the intermediate steps where obviously different." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `ci_bound` column in the DataFrames above represents the confidence interval around the computed cross-validation scores. By default, a confidence interval of 95% is used, but we can use different confidence bounds via the `confidence_interval` parameter. E.g., the confidence bounds for a 90% confidence interval can be obtained as follows:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
feature_idxcv_scoresavg_scorefeature_namesci_boundstd_devstd_err
4(0, 1, 2, 3)[0.9736842105263158, 0.9473684210526315, 0.918...0.953236(0, 1, 2, 3)0.0276580.0224710.012974
3(1, 2, 3)[0.9736842105263158, 1.0, 0.9459459459459459, ...0.973151(1, 2, 3)0.0235250.0191130.011035
\n", "
" ], "text/plain": [ " feature_idx cv_scores avg_score \\\n", "4 (0, 1, 2, 3) [0.9736842105263158, 0.9473684210526315, 0.918... 0.953236 \n", "3 (1, 2, 3) [0.9736842105263158, 1.0, 0.9459459459459459, ... 0.973151 \n", "\n", " feature_names ci_bound std_dev std_err \n", "4 (0, 1, 2, 3) 0.027658 0.022471 0.012974 \n", "3 (1, 2, 3) 0.023525 0.019113 0.011035 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame.from_dict(sbs.get_metric_dict(confidence_interval=0.90)).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 4 - Plotting the results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After importing the little helper function [`plotting.plot_sequential_feature_selection`](../plotting/plot_sequential_feature_selection.md), we can also visualize the results using matplotlib figures." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:36:18] Features: 1/4 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:36:18] Features: 2/4 -- score: 0.9666666666666668[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:36:18] Features: 3/4 -- score: 0.9533333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:36:18] Features: 4/4 -- score: 0.9733333333333334" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs\n", "import matplotlib.pyplot as plt\n", "\n", "sfs = SFS(knn, \n", " k_features=4, \n", " forward=True, \n", " floating=False, \n", " scoring='accuracy',\n", " verbose=2,\n", " cv=5)\n", "\n", "sfs = sfs.fit(X, y)\n", "\n", "fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')\n", "\n", "plt.ylim([0.8, 1])\n", "plt.title('Sequential Forward Selection (w. StdDev)')\n", "plt.grid()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 5 - Sequential Feature Selection for Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to the classification examples above, the `SequentialFeatureSelector` also supports scikit-learn's estimators\n", "for regression." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.datasets import fetch_california_housing\n", "\n", "data = fetch_california_housing()\n", "X, y = data.data, data.target\n", "\n", "lr = LinearRegression()\n", "\n", "sfs = SFS(lr, \n", " k_features=8, \n", " forward=True, \n", " floating=False, \n", " scoring='neg_mean_squared_error',\n", " cv=10)\n", "\n", "sfs = sfs.fit(X, y)\n", "fig = plot_sfs(sfs.get_metric_dict(), kind='std_err')\n", "\n", "plt.title('Sequential Forward Selection (w. StdErr)')\n", "plt.grid()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 6 -- Feature Selection with Fixed Train/Validation Splits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you do not wish to use cross-validation (here: k-fold cross-validation, i.e., rotating training and validation folds), you can use the `PredefinedHoldoutSplit` class to specify your own, fixed training and validation split." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 72 112 132 88 37 138 87 42 8 90 141 33 59 116 135 104 36 13\n", " 63 45 28 133 24 127 46 20 31 121 117 4]\n" ] } ], "source": [ "from sklearn.datasets import load_iris\n", "from mlxtend.evaluate import PredefinedHoldoutSplit\n", "import numpy as np\n", "\n", "\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target\n", "\n", "rng = np.random.RandomState(123)\n", "my_validation_indices = rng.permutation(np.arange(150))[:30]\n", "print(my_validation_indices)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:36:19] Features: 1/3 -- score: 0.9666666666666667[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:36:19] Features: 2/3 -- score: 0.9666666666666667[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:36:19] Features: 3/3 -- score: 0.9666666666666667" ] } ], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "from mlxtend.feature_selection import SequentialFeatureSelector as SFS\n", "\n", "\n", "\n", "knn = KNeighborsClassifier(n_neighbors=4)\n", "piter = PredefinedHoldoutSplit(my_validation_indices)\n", "\n", "sfs1 = SFS(knn, \n", " k_features=3, \n", " forward=True, \n", " floating=False, \n", " verbose=2,\n", " scoring='accuracy',\n", " cv=piter)\n", "\n", "sfs1 = sfs1.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 7 -- Using the Selected Feature Subset For Making New Predictions" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# Initialize the dataset\n", "\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.datasets import load_iris\n", "from sklearn.model_selection import train_test_split\n", "\n", "iris = load_iris()\n", "X, y = iris.data, iris.target\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.33, random_state=1)\n", "\n", "knn = KNeighborsClassifier(n_neighbors=4)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Select the \"best\" three features via\n", "# 5-fold cross-validation on the training set.\n", "\n", "from mlxtend.feature_selection import SequentialFeatureSelector as SFS\n", "\n", "sfs1 = SFS(knn, \n", " k_features=3, \n", " forward=True, \n", " floating=False, \n", " scoring='accuracy',\n", " cv=5)\n", "sfs1 = sfs1.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Selected features: (1, 2, 3)\n" ] } ], "source": [ "print('Selected features:', sfs1.k_feature_idx_)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test set accuracy: 96.00 %\n" ] } ], "source": [ "# Generate the new subsets based on the selected features\n", "# Note that the transform call is equivalent to\n", "# X_train[:, sfs1.k_feature_idx_]\n", "\n", "X_train_sfs = sfs1.transform(X_train)\n", "X_test_sfs = sfs1.transform(X_test)\n", "\n", "# Fit the estimator using the new feature subset\n", "# and make a prediction on the test data\n", "knn.fit(X_train_sfs, y_train)\n", "y_pred = knn.predict(X_test_sfs)\n", "\n", "# Compute the accuracy of the prediction\n", "acc = float((y_test == y_pred).sum()) / y_pred.shape[0]\n", "print('Test set accuracy: %.2f %%' % (acc * 100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 8 -- Sequential Feature Selection and GridSearch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following example, we are tuning the SFS's estimator using GridSearch. To avoid unwanted behavior or side-effects, it's recommended to use the estimator inside and outside of SFS as separate instances." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Initialize the dataset\n", "\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.datasets import load_iris\n", "from sklearn.model_selection import train_test_split\n", "\n", "iris = load_iris()\n", "X, y = iris.data, iris.target\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.2, random_state=123)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "from sklearn.pipeline import Pipeline\n", "from mlxtend.feature_selection import SequentialFeatureSelector as SFS\n", "import mlxtend\n", "\n", "knn1 = KNeighborsClassifier()\n", "knn2 = KNeighborsClassifier()\n", "\n", "sfs1 = SFS(estimator=knn1, \n", " k_features=3,\n", " forward=True, \n", " floating=False, \n", " scoring='accuracy',\n", " cv=5)\n", "\n", "pipe = Pipeline([('sfs', sfs1), \n", " ('knn2', knn2)])\n", "\n", "param_grid = {\n", " 'sfs__k_features': [1, 2, 3],\n", " 'sfs__estimator__n_neighbors': [3, 4, 7], # inner knn\n", " 'knn2__n_neighbors': [3, 4, 7] # outer knn\n", " }\n", " \n", "gs = GridSearchCV(estimator=pipe, \n", " param_grid=param_grid, \n", " scoring='accuracy', \n", " n_jobs=1, \n", " cv=5,\n", " refit=False)\n", "\n", "# run gridearch\n", "gs = gs.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the suggested hyperparameters below:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "for i in range(len(gs.cv_results_['params'])):\n", " print(gs.cv_results_['params'][i], 'test acc.:', gs.cv_results_['mean_test_score'][i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The \"best\" parameters determined by GridSearch are ..." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best parameters via GridSearch {'knn2__n_neighbors': 7, 'sfs__estimator__n_neighbors': 3, 'sfs__k_features': 3}\n" ] } ], "source": [ "print(\"Best parameters via GridSearch\", gs.best_params_)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('sfs',\n",
       "                 SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3),\n",
       "                                           k_features=(3, 3),\n",
       "                                           scoring='accuracy')),\n",
       "                ('knn2', KNeighborsClassifier(n_neighbors=7))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('sfs',\n", " SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3),\n", " k_features=(3, 3),\n", " scoring='accuracy')),\n", " ('knn2', KNeighborsClassifier(n_neighbors=7))])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.set_params(**gs.best_params_).fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 9 -- Selecting the \"best\" feature combination in a k-range" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If `k_features` is set to to a tuple `(min_k, max_k)` (new in 0.4.2), the SFS will now select the best feature combination that it discovered by iterating from `k=1` to `max_k` (forward), or `max_k` to `min_k` (backward). The size of the returned feature subset is then within `max_k` to `min_k`, depending on which combination scored best during cross validation.\n", "\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(150, 4)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.shape" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "best combination (ACC: 0.992): (0, 1, 2, 3, 6, 8, 9, 10, 11, 12)\n", "\n", "all subsets:\n", " {1: {'feature_idx': (6,), 'cv_scores': array([0.84 , 0.64 , 0.84 , 0.8 , 0.875]), 'avg_score': 0.799, 'feature_names': ('6',)}, 2: {'feature_idx': (6, 9), 'cv_scores': array([0.92 , 0.88 , 1. , 0.96 , 0.91666667]), 'avg_score': 0.9353333333333333, 'feature_names': ('6', '9')}, 3: {'feature_idx': (6, 9, 12), 'cv_scores': array([0.92 , 0.92 , 0.96 , 1. , 0.95833333]), 'avg_score': 0.9516666666666665, 'feature_names': ('6', '9', '12')}, 4: {'feature_idx': (3, 6, 9, 12), 'cv_scores': array([0.96 , 0.96 , 0.96 , 1. , 0.95833333]), 'avg_score': 0.9676666666666666, 'feature_names': ('3', '6', '9', '12')}, 5: {'feature_idx': (3, 6, 9, 10, 12), 'cv_scores': array([0.92, 0.96, 1. , 1. , 1. ]), 'avg_score': 0.976, 'feature_names': ('3', '6', '9', '10', '12')}, 6: {'feature_idx': (2, 3, 6, 9, 10, 12), 'cv_scores': array([0.92, 0.96, 1. , 0.96, 1. ]), 'avg_score': 0.968, 'feature_names': ('2', '3', '6', '9', '10', '12')}, 7: {'feature_idx': (0, 2, 3, 6, 9, 10, 12), 'cv_scores': array([0.92, 0.92, 1. , 1. , 1. ]), 'avg_score': 0.968, 'feature_names': ('0', '2', '3', '6', '9', '10', '12')}, 8: {'feature_idx': (0, 2, 3, 6, 8, 9, 10, 12), 'cv_scores': array([1. , 0.92, 1. , 1. , 1. ]), 'avg_score': 0.984, 'feature_names': ('0', '2', '3', '6', '8', '9', '10', '12')}, 9: {'feature_idx': (0, 2, 3, 6, 8, 9, 10, 11, 12), 'cv_scores': array([1. , 0.92, 1. , 1. , 1. ]), 'avg_score': 0.984, 'feature_names': ('0', '2', '3', '6', '8', '9', '10', '11', '12')}, 10: {'feature_idx': (0, 1, 2, 3, 6, 8, 9, 10, 11, 12), 'cv_scores': array([1. , 0.96, 1. , 1. , 1. ]), 'avg_score': 0.992, 'feature_names': ('0', '1', '2', '3', '6', '8', '9', '10', '11', '12')}}\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from mlxtend.feature_selection import SequentialFeatureSelector as SFS\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from mlxtend.data import wine_data\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.pipeline import make_pipeline\n", "\n", "X, y = wine_data()\n", "X_train, X_test, y_train, y_test= train_test_split(X, y, \n", " stratify=y,\n", " test_size=0.3,\n", " random_state=1)\n", "\n", "knn = KNeighborsClassifier(n_neighbors=2)\n", "\n", "sfs1 = SFS(estimator=knn, \n", " k_features=(3, 10),\n", " forward=True, \n", " floating=False, \n", " scoring='accuracy',\n", " cv=5)\n", "\n", "pipe = make_pipeline(StandardScaler(), sfs1)\n", "\n", "pipe.fit(X_train, y_train)\n", "\n", "print('best combination (ACC: %.3f): %s\\n' % (sfs1.k_score_, sfs1.k_feature_idx_))\n", "print('all subsets:\\n', sfs1.subsets_)\n", "plot_sfs(sfs1.get_metric_dict(), kind='std_err');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 10 -- Using other cross-validation schemes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to standard k-fold and stratified k-fold, other cross validation schemes can be used with `SequentialFeatureSelector`. For example, `GroupKFold` or `LeaveOneOut` cross-validation from scikit-learn. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using GroupKFold with SequentialFeatureSelector" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "groups: [ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2\n", " 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4\n", " 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 7 7\n", " 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9\n", " 9 9 9 9 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11\n", " 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 14 14 14 14\n", " 14 14 14 14 14 14]\n" ] } ], "source": [ "from mlxtend.feature_selection import SequentialFeatureSelector as SFS\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from mlxtend.data import iris_data\n", "from sklearn.model_selection import GroupKFold\n", "import numpy as np\n", "\n", "X, y = iris_data()\n", "groups = np.arange(len(y)) // 10\n", "print('groups: {}'.format(groups))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calling the `split()` method of a scikit-learn cross-validator object will return a generator that yields train, test splits." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cv_gen = GroupKFold(4).split(X, y, groups)\n", "cv_gen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `cv` parameter of `SequentialFeatureSelector` must be either an `int` or an iterable yielding train, test splits. This iterable can be constructed by passing the train, test split generator to the built-in `list()` function. " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "cv = list(cv_gen)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "best combination (ACC: 0.940): (2, 3)\n", "\n" ] } ], "source": [ "knn = KNeighborsClassifier(n_neighbors=2)\n", "sfs = SFS(estimator=knn, \n", " k_features=2,\n", " scoring='accuracy',\n", " cv=cv)\n", "\n", "sfs.fit(X, y)\n", "\n", "print('best combination (ACC: %.3f): %s\\n' % (sfs.k_score_, sfs.k_feature_idx_))" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Example 11 - Interrupting Long Runs for Intermediate Results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If your run is taking too long, it is possible to trigger a `KeyboardInterrupt` (e.g., ctrl+c on a Mac, or interrupting the cell in a Jupyter notebook) to obtain temporary results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Toy dataset**" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_classification\n", "from sklearn.model_selection import train_test_split\n", "\n", "\n", "X, y = make_classification(\n", " n_samples=20000,\n", " n_features=500,\n", " n_informative=10,\n", " n_redundant=40,\n", " n_repeated=25,\n", " n_clusters_per_class=5,\n", " flip_y=0.05,\n", " class_sep=0.5,\n", " random_state=123,\n", ")\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.2, random_state=123\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Long run with interruption**" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed: 8.3s finished\n", "\n", "[2023-05-17 08:36:32] Features: 1/10 -- score: 0.5965[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 499 out of 499 | elapsed: 13.8s finished\n", "\n", "[2023-05-17 08:36:45] Features: 2/10 -- score: 0.6256875000000001[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 498 out of 498 | elapsed: 18.1s finished\n", "\n", "[2023-05-17 08:37:03] Features: 3/10 -- score: 0.642[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 497 out of 497 | elapsed: 20.4s finished\n", "\n", "[2023-05-17 08:37:24] Features: 4/10 -- score: 0.6463125[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 496 out of 496 | elapsed: 22.2s finished\n", "\n", "[2023-05-17 08:37:46] Features: 5/10 -- score: 0.6495000000000001[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 495 out of 495 | elapsed: 26.1s finished\n", "\n", "[2023-05-17 08:38:12] Features: 6/10 -- score: 0.6514374999999999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 494 out of 494 | elapsed: 26.1s finished\n", "\n", "[2023-05-17 08:38:38] Features: 7/10 -- score: 0.6533749999999999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 493 out of 493 | elapsed: 25.3s finished\n", "\n", "[2023-05-17 08:39:04] Features: 8/10 -- score: 0.6545624999999999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 492 out of 492 | elapsed: 26.3s finished\n", "\n", "[2023-05-17 08:39:30] Features: 9/10 -- score: 0.6549375[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 491 out of 491 | elapsed: 27.0s finished\n", "\n", "[2023-05-17 08:39:57] Features: 10/10 -- score: 0.6554374999999999" ] } ], "source": [ "from mlxtend.feature_selection import SequentialFeatureSelector as SFS\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "model = LogisticRegression()\n", "\n", "sfs1 = SFS(model, \n", " k_features=10, \n", " forward=True, \n", " floating=False, \n", " verbose=2,\n", " scoring='accuracy',\n", " cv=5)\n", "\n", "sfs1 = sfs1.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Finalizing the fit**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the feature selection run hasn't finished, so certain attributes may not be available. In order to use the SFS instance, it is recommended to call `finalize_fit`, which will make SFS estimator appear as \"fitted\" process the temporary results:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "sfs1.finalize_fit()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(30, 128, 144, 160, 184, 229, 256, 356, 439, 458)\n", "0.6554374999999999\n" ] } ], "source": [ "print(sfs1.k_feature_idx_)\n", "print(sfs1.k_score_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 12 - Using Pandas DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Optionally, we can also use pandas DataFrames and pandas Series as input to the `fit` function. In this case, the column names of the pandas DataFrame will be used as feature names. However, note that if `custom_feature_names` are provided in the fit function, these `custom_feature_names` take precedence over the DataFrame column-based feature names." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.datasets import load_iris\n", "from mlxtend.feature_selection import SequentialFeatureSelector as SFS\n", "\n", "\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target\n", "knn = KNeighborsClassifier(n_neighbors=4)\n", "\n", "sfs1 = SFS(knn, \n", " k_features=3, \n", " forward=True, \n", " floating=False, \n", " scoring='accuracy',\n", " cv=0)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal lenpetal lensepal widthpetal width
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", "
" ], "text/plain": [ " sepal len petal len sepal width petal width\n", "0 5.1 3.5 1.4 0.2\n", "1 4.9 3.0 1.4 0.2\n", "2 4.7 3.2 1.3 0.2\n", "3 4.6 3.1 1.5 0.2\n", "4 5.0 3.6 1.4 0.2" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',\n", " 'sepal width', 'petal width'])\n", "X_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, the target array, `y`, can be optionally be cast as a Series:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 0\n", "2 0\n", "3 0\n", "4 0\n", "dtype: int64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_series = pd.Series(y)\n", "y_series.head()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "sfs1 = sfs1.fit(X_df, y_series)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the only difference of passing a pandas DataFrame as input is that the sfs1.subsets_ array will now contain a new column, " ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{1: {'feature_idx': (3,),\n", " 'cv_scores': array([0.96]),\n", " 'avg_score': 0.96,\n", " 'feature_names': ('petal width',)},\n", " 2: {'feature_idx': (2, 3),\n", " 'cv_scores': array([0.97333333]),\n", " 'avg_score': 0.9733333333333334,\n", " 'feature_names': ('sepal width', 'petal width')},\n", " 3: {'feature_idx': (1, 2, 3),\n", " 'cv_scores': array([0.97333333]),\n", " 'avg_score': 0.9733333333333334,\n", " 'feature_names': ('petal len', 'sepal width', 'petal width')}}" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sfs1.subsets_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In mlxtend version >= 0.13 pandas DataFrames are supported as feature inputs to the `SequentianFeatureSelector` instead of NumPy arrays or other NumPy-like array types." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 13 - Specifying Fixed Feature Sets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often, it may be useful to specify a fixed set of features we want to use for a given model (e.g., determined by prior knowledge or domain knowledge). Since MLxtend v 0.18.0, it is now possible to specify such features via the `fixed_features` attribute. This will mean that these features are guaranteed to be included in the selected subsets.\n", "\n", "Note that this feature works for all options regarding forward and backward selection, and using floating selection or not.\n", "\n", "The example below illustrates how we can set the features 0 and 2 in the dataset as fixed:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:39:57] Features: 3/4 -- score: 0.9733333333333333[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:39:57] Features: 4/4 -- score: 0.9733333333333333" ] } ], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target\n", "knn = KNeighborsClassifier(n_neighbors=3)\n", "\n", "from mlxtend.feature_selection import SequentialFeatureSelector as SFS\n", "\n", "sfs1 = SFS(knn, \n", " k_features=4, \n", " forward=True, \n", " floating=False, \n", " verbose=2,\n", " scoring='accuracy',\n", " fixed_features=(0, 2),\n", " cv=3)\n", "\n", "sfs1 = sfs1.fit(X, y)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{2: {'feature_idx': (0, 2),\n", " 'cv_scores': array([0.98, 0.92, 0.94]),\n", " 'avg_score': 0.9466666666666667,\n", " 'feature_names': ('0', '2')},\n", " 3: {'feature_idx': (0, 2, 3),\n", " 'cv_scores': array([0.98, 0.96, 0.98]),\n", " 'avg_score': 0.9733333333333333,\n", " 'feature_names': ('0', '2', '3')},\n", " 4: {'feature_idx': (0, 1, 2, 3),\n", " 'cv_scores': array([0.98, 0.96, 0.98]),\n", " 'avg_score': 0.9733333333333333,\n", " 'feature_names': ('0', '1', '2', '3')}}" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sfs1.subsets_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the input dataset is a pandas DataFrame, we can also use the column names directly:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal lenpetal lensepal widthpetal width
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", "
" ], "text/plain": [ " sepal len petal len sepal width petal width\n", "0 5.1 3.5 1.4 0.2\n", "1 4.9 3.0 1.4 0.2\n", "2 4.7 3.2 1.3 0.2\n", "3 4.6 3.1 1.5 0.2\n", "4 5.0 3.6 1.4 0.2" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',\n", " 'sepal width', 'petal width'])\n", "X_df.head()" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:39:57] Features: 3/4 -- score: 0.9466666666666667[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:39:57] Features: 4/4 -- score: 0.9733333333333333" ] } ], "source": [ "sfs2 = SFS(knn, \n", " k_features=4, \n", " forward=True, \n", " floating=False, \n", " verbose=2,\n", " scoring='accuracy',\n", " fixed_features=('sepal len', 'petal len'),\n", " cv=3)\n", "\n", "sfs2 = sfs2.fit(X_df, y_series)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{2: {'feature_idx': (0, 1),\n", " 'cv_scores': array([0.72, 0.74, 0.78]),\n", " 'avg_score': 0.7466666666666667,\n", " 'feature_names': ('sepal len', 'petal len')},\n", " 3: {'feature_idx': (0, 1, 2),\n", " 'cv_scores': array([0.98, 0.92, 0.94]),\n", " 'avg_score': 0.9466666666666667,\n", " 'feature_names': ('sepal len', 'petal len', 'sepal width')},\n", " 4: {'feature_idx': (0, 1, 2, 3),\n", " 'cv_scores': array([0.98, 0.96, 0.98]),\n", " 'avg_score': 0.9733333333333333,\n", " 'feature_names': ('sepal len', 'petal len', 'sepal width', 'petal width')}}" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sfs2.subsets_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 13 - Working with Feature Groups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since mlxtend v0.21.0, it is possible to specify feature groups. Feature groups allow you to group certain features together, such that they are always selected as a group. This can be very useful in contexts similar to one-hot encoding -- if you want to treat the one-hot encoded feature as a single feature:\n", "\n", "![](SequentialFeatureSelector_files/feature_groups.jpeg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following example, we specify sepal length and sepal width as a feature group so that they are always selected together:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal lenpetal lensepal widpetal wid
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", "
" ], "text/plain": [ " sepal len petal len sepal wid petal wid\n", "0 5.1 3.5 1.4 0.2\n", "1 4.9 3.0 1.4 0.2\n", "2 4.7 3.2 1.3 0.2\n", "3 4.6 3.1 1.5 0.2\n", "4 5.0 3.6 1.4 0.2" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.datasets import load_iris\n", "import pandas as pd\n", "\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target\n", "\n", "X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',\n", " 'sepal wid', 'petal wid'])\n", "X_df.head()\n" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "from mlxtend.feature_selection import SequentialFeatureSelector as SFS\n", "\n", "knn = KNeighborsClassifier(n_neighbors=3)\n", "\n", "sfs1 = SFS(knn, \n", " k_features=2, \n", " scoring='accuracy',\n", " feature_groups=(['sepal len', 'sepal wid'], ['petal len'], ['petal wid']),\n", " cv=3)\n", "\n", "sfs1 = sfs1.fit(X_df, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "sfs1 = SFS(knn, \n", " k_features=2, \n", " scoring='accuracy',\n", " feature_groups=[[0, 2], [1], [3]],\n", " cv=3)\n", "\n", "sfs1 = sfs1.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 14 - Multiclass Metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Certain scoring metrics like ROC AUC are originally designed for binary classification. However, they can also be used for multiclass settings. It is best to consult [this scikit-learn metrics table](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values) for this.\n", "\n", "For example, we can use a ROC AUC One-Vs-Rest score via `‘\"roc_auc_ovr\"` as shown below." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_blobs\n", "X, y = make_blobs(n_samples=10, centers=4, n_features=5, random_state=0)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:39:57] Features: 1/3 -- score: 1.0[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:39:57] Features: 2/3 -- score: 1.0[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished\n", "\n", "[2023-05-17 08:39:57] Features: 3/3 -- score: 1.0" ] } ], "source": [ "from mlxtend.feature_selection import SequentialFeatureSelector as SFS\n", "\n", "sfs1 = SFS(knn, \n", " k_features=3, \n", " forward=True, \n", " floating=False, \n", " verbose=2,\n", " scoring='roc_auc_ovr',\n", " cv=0)\n", "\n", "sfs1 = sfs1.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# API" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "## SequentialFeatureSelector\n", "\n", "*SequentialFeatureSelector(estimator, k_features=1, forward=True, floating=False, verbose=0, scoring=None, cv=5, n_jobs=1, pre_dispatch='2*n_jobs', clone_estimator=True, fixed_features=None, feature_groups=None)*\n", "\n", "Sequential Feature Selection for Classification and Regression.\n", "\n", "**Parameters**\n", "\n", "- `estimator` : scikit-learn classifier or regressor\n", "\n", "\n", "- `k_features` : int or tuple or str (default: 1)\n", "\n", " Number of features to select,\n", " where k_features < the full feature set.\n", " New in 0.4.2: A tuple containing a min and max value can be provided,\n", " and the SFS will consider return any feature combination between\n", " min and max that scored highest in cross-validation. For example,\n", " the tuple (1, 4) will return any combination from\n", " 1 up to 4 features instead of a fixed number of features k.\n", " New in 0.8.0: A string argument \"best\" or \"parsimonious\".\n", " If \"best\" is provided, the feature selector will return the\n", " feature subset with the best cross-validation performance.\n", " If \"parsimonious\" is provided as an argument, the smallest\n", " feature subset that is within one standard error of the\n", " cross-validation performance will be selected.\n", "\n", "\n", "- `forward` : bool (default: True)\n", "\n", " Forward selection if True,\n", " backward selection otherwise\n", "\n", "\n", "- `floating` : bool (default: False)\n", "\n", " Adds a conditional exclusion/inclusion if True.\n", "\n", "\n", "- `verbose` : int (default: 0), level of verbosity to use in logging.\n", "\n", " If 0, no output,\n", " if 1 number of features in current set, if 2 detailed logging i\n", " ncluding timestamp and cv scores at step.\n", "\n", "\n", "- `scoring` : str, callable, or None (default: None)\n", "\n", " If None (default), uses 'accuracy' for sklearn classifiers\n", " and 'r2' for sklearn regressors.\n", " If str, uses a sklearn scoring metric string identifier, for example\n", " {accuracy, f1, precision, recall, roc_auc} for classifiers,\n", " {'mean_absolute_error', 'mean_squared_error'/'neg_mean_squared_error',\n", " 'median_absolute_error', 'r2'} for regressors.\n", " If a callable object or function is provided, it has to be conform with\n", " sklearn's signature ``scorer(estimator, X, y)``; see\n", " https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html\n", " for more information.\n", "\n", "\n", "- `cv` : int (default: 5)\n", "\n", " Integer or iterable yielding train, test splits. If cv is an integer\n", " and `estimator` is a classifier (or y consists of integer class\n", " labels) stratified k-fold. Otherwise regular k-fold cross-validation\n", " is performed. No cross-validation if cv is None, False, or 0.\n", "\n", "\n", "- `n_jobs` : int (default: 1)\n", "\n", " The number of CPUs to use for evaluating different feature subsets\n", " in parallel. -1 means 'all CPUs'.\n", "\n", "\n", "- `pre_dispatch` : int, or string (default: '2*n_jobs')\n", "\n", " Controls the number of jobs that get dispatched\n", " during parallel execution if `n_jobs > 1` or `n_jobs=-1`.\n", " Reducing this number can be useful to avoid an explosion of\n", " memory consumption when more jobs get dispatched than CPUs can process.\n", " This parameter can be:\n", " None, in which case all the jobs are immediately created and spawned.\n", " Use this for lightweight and fast-running jobs,\n", " to avoid delays due to on-demand spawning of the jobs\n", " An int, giving the exact number of total jobs that are spawned\n", " A string, giving an expression as a function\n", " of n_jobs, as in `2*n_jobs`\n", "\n", "\n", "- `clone_estimator` : bool (default: True)\n", "\n", " Clones estimator if True; works with the original estimator instance\n", " if False. Set to False if the estimator doesn't\n", " implement scikit-learn's set_params and get_params methods.\n", " In addition, it is required to set cv=0, and n_jobs=1.\n", "\n", "\n", "- `fixed_features` : tuple (default: None)\n", "\n", " If not `None`, the feature indices provided as a tuple will be\n", " regarded as fixed by the feature selector. For example, if\n", " `fixed_features=(1, 3, 7)`, the 2nd, 4th, and 8th feature are\n", " guaranteed to be present in the solution. Note that if\n", " `fixed_features` is not `None`, make sure that the number of\n", " features to be selected is greater than `len(fixed_features)`.\n", " In other words, ensure that `k_features > len(fixed_features)`.\n", " New in mlxtend v. 0.18.0.\n", "\n", "\n", "- `feature_groups` : list or None (default: None)\n", "\n", " Optional argument for treating certain features as a group.\n", " This means, the features within a group are always selected together,\n", " never split.\n", " For example, `feature_groups=[[1], [2], [3, 4, 5]]`\n", " specifies 3 feature groups. In this case,\n", " possible feature selection results with `k_features=2`\n", " are `[[1], [2]`, `[[1], [3, 4, 5]]`, or `[[2], [3, 4, 5]]`.\n", " Feature groups can be useful for\n", " interpretability, for example, if features 3, 4, 5 are one-hot\n", " encoded features. (For more details, please read the notes at the\n", " bottom of this docstring). New in mlxtend v. 0.21.0.\n", "\n", "**Attributes**\n", "\n", "- `k_feature_idx_` : array-like, shape = [n_predictions]\n", "\n", " Feature Indices of the selected feature subsets.\n", "\n", "\n", "- `k_feature_names_` : array-like, shape = [n_predictions]\n", "\n", " Feature names of the selected feature subsets. If pandas\n", " DataFrames are used in the `fit` method, the feature\n", " names correspond to the column names. Otherwise, the\n", " feature names are string representation of the feature\n", " array indices. New in v 0.13.0.\n", "\n", "\n", "- `k_score_` : float\n", "\n", " Cross validation average score of the selected subset.\n", "\n", "\n", "- `subsets_` : dict\n", "\n", " A dictionary of selected feature subsets during the\n", " sequential selection, where the dictionary keys are\n", " the lengths k of these feature subsets. If the parameter\n", " `feature_groups` is not None, the value of key indicates\n", " the number of groups that are selected together. The dictionary\n", " values are dictionaries themselves with the following\n", " keys: 'feature_idx' (tuple of indices of the feature subset)\n", " 'feature_names' (tuple of feature names of the feat. subset)\n", " 'cv_scores' (list individual cross-validation scores)\n", " 'avg_score' (average cross-validation score)\n", " Note that if pandas\n", " DataFrames are used in the `fit` method, the 'feature_names'\n", " correspond to the column names. Otherwise, the\n", " feature names are string representation of the feature\n", " array indices. The 'feature_names' is new in v 0.13.0.\n", "\n", "**Notes**\n", "\n", "(1) If parameter `feature_groups` is not None, the\n", " number of features is equal to the number of feature groups, i.e.\n", " `len(feature_groups)`. For example, if `feature_groups = [[0], [1], [2, 3],\n", " [4]]`, then the `max_features` value cannot exceed 4.\n", "\n", " (2) Although two or more individual features may be considered as one group\n", " throughout the feature-selection process, it does not mean the individual\n", " features of that group have the same impact on the outcome. For instance, in\n", " linear regression, the coefficient of the feature 2 and 3 can be different\n", " even if they are considered as one group in feature_groups.\n", "\n", " (3) If both fixed_features and feature_groups are specified, ensure that each\n", " feature group contains the fixed_features selection. E.g., for a 3-feature set\n", " fixed_features=[0, 1] and feature_groups=[[0, 1], [2]] is valid;\n", " fixed_features=[0, 1] and feature_groups=[[0], [1, 2]] is not valid.\n", "\n", " (4) In case of KeyboardInterrupt, the dictionary subsets may not be completed.\n", " If user is still interested in getting the best score, they can use method\n", " `finalize_fit`.\n", "\n", "**Examples**\n", "\n", "For usage examples, please see\n", " https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/\n", "\n", "### Methods\n", "\n", "
\n", "\n", "*finalize_fit()*\n", "\n", "None\n", "\n", "
\n", "\n", "*fit(X, y, groups=None, **fit_params)*\n", "\n", "Perform feature selection and learn model from training data.\n", "\n", "**Parameters**\n", "\n", "- `X` : {array-like, sparse matrix}, shape = [n_samples, n_features]\n", "\n", " Training vectors, where n_samples is the number of samples and\n", " n_features is the number of features.\n", " New in v 0.13.0: pandas DataFrames are now also accepted as\n", " argument for X.\n", "\n", "- `y` : array-like, shape = [n_samples]\n", "\n", " Target values.\n", " New in v 0.13.0: pandas DataFrames are now also accepted as\n", " argument for y.\n", "\n", "- `groups` : array-like, with shape (n_samples,), optional\n", "\n", " Group labels for the samples used while splitting the dataset into\n", " train/test set. Passed to the fit method of the cross-validator.\n", "\n", "- `fit_params` : various, optional\n", "\n", " Additional parameters that are being passed to the estimator.\n", " For example, `sample_weights=weights`.\n", "\n", "**Returns**\n", "\n", "- `self` : object\n", "\n", "\n", "
\n", "\n", "*fit_transform(X, y, groups=None, **fit_params)*\n", "\n", "Fit to training data then reduce X to its most important features.\n", "\n", "**Parameters**\n", "\n", "- `X` : {array-like, sparse matrix}, shape = [n_samples, n_features]\n", "\n", " Training vectors, where n_samples is the number of samples and\n", " n_features is the number of features.\n", " New in v 0.13.0: pandas DataFrames are now also accepted as\n", " argument for X.\n", "\n", "- `y` : array-like, shape = [n_samples]\n", "\n", " Target values.\n", " New in v 0.13.0: a pandas Series are now also accepted as\n", " argument for y.\n", "\n", "- `groups` : array-like, with shape (n_samples,), optional\n", "\n", " Group labels for the samples used while splitting the dataset into\n", " train/test set. Passed to the fit method of the cross-validator.\n", "\n", "- `fit_params` : various, optional\n", "\n", " Additional parameters that are being passed to the estimator.\n", " For example, `sample_weights=weights`.\n", "\n", "**Returns**\n", "\n", "Reduced feature subset of X, shape={n_samples, k_features}\n", "\n", "
\n", "\n", "*generate_error_message_k_features(name)*\n", "\n", "None\n", "\n", "
\n", "\n", "*get_metric_dict(confidence_interval=0.95)*\n", "\n", "Return metric dictionary\n", "\n", "**Parameters**\n", "\n", "- `confidence_interval` : float (default: 0.95)\n", "\n", " A positive float between 0.0 and 1.0 to compute the confidence\n", " interval bounds of the CV score averages.\n", "\n", "**Returns**\n", "\n", "Dictionary with items where each dictionary value is a list\n", " with the number of iterations (number of feature subsets) as\n", " its length. The dictionary keys corresponding to these lists\n", " are as follows:\n", " 'feature_idx': tuple of the indices of the feature subset\n", " 'cv_scores': list with individual CV scores\n", " 'avg_score': of CV average scores\n", " 'std_dev': standard deviation of the CV score average\n", " 'std_err': standard error of the CV score average\n", " 'ci_bound': confidence interval bound of the CV score average\n", "\n", "
\n", "\n", "*get_params(deep=True)*\n", "\n", "Get parameters for this estimator.\n", "\n", "**Parameters**\n", "\n", "- `deep` : bool, default=True\n", "\n", " If True, will return the parameters for this estimator and\n", " contained subobjects that are estimators.\n", "\n", "**Returns**\n", "\n", "- `params` : dict\n", "\n", " Parameter names mapped to their values.\n", "\n", "
\n", "\n", "*set_params(**params)*\n", "\n", "Set the parameters of this estimator.\n", " Valid parameter keys can be listed with ``get_params()``.\n", "\n", "**Returns**\n", "\n", "self\n", "\n", "
\n", "\n", "*transform(X)*\n", "\n", "Reduce X to its most important features.\n", "\n", "**Parameters**\n", "\n", "- `X` : {array-like, sparse matrix}, shape = [n_samples, n_features]\n", "\n", " Training vectors, where n_samples is the number of samples and\n", " n_features is the number of features.\n", " New in v 0.13.0: pandas DataFrames are now also accepted as\n", " argument for X.\n", "\n", "**Returns**\n", "\n", "Reduced feature subset of X, shape={n_samples, k_features}\n", "\n", "### Properties\n", "\n", "
\n", "\n", "*named_estimators*\n", "\n", "**Returns**\n", "\n", "List of named estimator tuples, like [('svc', SVC(...))]\n", "\n", "\n" ] } ], "source": [ "with open('../../api_modules/mlxtend.feature_selection/SequentialFeatureSelector.md', 'r') as f:\n", " s = f.read()\n", "print(s)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }