{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "##
[mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n", "#
Tutorial. Mlxtend.SFS: an easy way to select features\n", "###
Author: Anton Gilmanov, @wicker\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Intro" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Feature engineering and feature selection are one of the most important elements of data analysis and machine learning at all\".
\n", "
\n", "Your could read such phrase in many articles or books, and it's truth. But why do we need to select features?\n", "
\n", "
\n", "\n", "### 1. \"Noisy\" features\n", "\n", "Whatever good data you have, always there are some useful features, that help you to solve the problem and some noisy features - unuseful in your prediction model. Such features are dangerous, because it can lead to overfit. Opposite, quality of your model on hold-out data can be improved by deleting it from the dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Сomputation problem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If dataset has hundreds or thousands features, fitting estimator, cross-validation or another\n", "computation can take a lot of time. Of course, we can use PCA to reduce dimension of data, but sometimes it's not available for current business-task. It's impossible to explain how business could change \"new PCA feature\" to reach their goals - it's called \"interpretation problem\". So, feature selection is useful in this case." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Feature engineering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, we have good dataset. But we added many custom features ~~to beat Kaggle's baseline~~ and now we are searching how we can select, which features improve quality metric and which not. Hm... We can use L1 regularization, it will move some weights towards 0. But what if we could fit estimator with different subsets of new features, and add one if quality metric is increasing or remove one if quality metric is decreasing, and make sure that solution takes only a couple of lines of code?

**Mlxtend SequentialFeatureSelector is what we need! **" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

**Everytime we try to select the best features :)**

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Image on web](https://i.giphy.com/media/5yLgoceFO3BdJW1zvFu/giphy.webp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Well, the informal intro is coming to an end. It's time to understand some formalized theory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Introduce to SequentialFeatureSelector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please, install some libraries, if you haven't it in your system" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#!pip install pandas\n", "#!pip install mlxtend\n", "#!pip install scikit-learn\n", "#!pip install matplotlib" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.model_selection import train_test_split, KFold, cross_val_score\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import accuracy_score\n", "from sklearn.feature_selection import RFE\n", "from mlxtend.feature_selection import SequentialFeatureSelector\n", "from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs\n", "from sklearn.decomposition import PCA\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mlxtend SequentialFeatureSelector is greedy search algorithm that is used to reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k < d." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 4 different flavors of SFAs available via the *SequentialFeatureSelector*:\n", "\n", "* Sequential Forward Selection (SFS)\n", "* Sequential Backward Selection (SBS)\n", "* Sequential Forward Floating Selection (SFFS)\n", "* Sequential Backward Floating Selection (SBFS)\n", "\n", "In \"forward\" algorithm we start with no features in our subset and add one feature on each iteration, that maximize quality metric. In the contrary, \"backward\" algorithm start with full subset of features and remove one feature on each iteration maximizing quality of our model.\n", "\n", "The floating variants, SFFS and SBFS, can be considered as extensions to the simpler SFS and SBS algorithms. The floating algorithms have an additional exclusion or inclusion step to remove features once they were included (or excluded), so that a larger number of feature subset combinations can be sampled.\n", "\n", "Lets take a look at each of them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sequential Forward Selection (SFS)\n", "\n", "\n", "### Input: \n", "\n", "Y={y1,y2,...,yd}\n", "\n", "* The SFS algorithm takes the whole d-dimensional feature set as input.\n", "\n", "### Output: \n", "\n", "Xk={xj|j=1,2,...,k;xj∈Y}, where k=(0,1,2,...,d)\n", "\n", "* SFS returns a subset of features; the number of selected features k, where k0=∅, k=0\n", "\n", "* We initialize the algorithm with an empty set ∅ (\"null set\") so that k=0 (where k is the size of the subset).\n", "\n", "\n", "### Step 1 (Inclusion):\n", "\n", "x+ = arg max J(xk+x), where x∈Y−Xk\n", "\n", "Xk+1=Xk+x+\n", "\n", "k=k+1 \n", "\n", "#### Go to Step 1\n", "\n", "* In this step, we add an additional feature, x+, to our feature subset Xk.\n", "* x+ is the feature that maximizes our criterion function, that is, the feature that is associated with the best classifier performance if it is added to Xk.\n", "* We repeat this procedure until the termination criterion is satisfied.\n", "\n", "### Termination: \n", "k=p\n", "\n", "We add features from the feature subset Xk until the feature subset of size k contains the number of desired features p that we specified a priori." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sequential Backward Selection (SBS)\n", "\n", "### Input: \n", "\n", "The set of all features: Y={y1,y2,...,yd}\n", "\n", "* The SBS algorithm takes the whole feature set as input.\n", "\n", "### Output: \n", "\n", "Xk={xj|j=1,2,...,k;xj∈Y}, where k=(0,1,2,...,d)\n", "\n", "* SBS returns a subset of features; the number of selected features k, where k0=Y, k=d\n", "\n", "* We initialize the algorithm with the given feature set so that the k=d.\n", "\n", "\n", "### Step 1 (Exclusion):\n", "\n", "x-= arg max J(xk-x), where x∈Xk \n", "\n", "Xk-1=Xk-x-\n", "\n", "k=k-1 \n", "\n", "#### Go to Step 1\n", "\n", "* In this step, we remove a feature, x- from our feature subset Xk.\n", "* x- is the feature that maximizes our criterion function upon removal, that is, the feature that is associated with the best classifier performance if it is removed from Xk.\n", "* We repeat this procedure until the termination criterion is satisfied.\n", "\n", "### Termination: \n", "\n", "k=p\n", "\n", "We remove features from the feature subset Xk until the feature subset of size k contains the number of desired features p that we specified a priori" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sequential Forward Floating Selection (SFFS)\n", "\n", "### Input: \n", "\n", "The set of all features: Y={y1,y2,...,yd}\n", "\n", "* The SFFS algorithm takes the whole feature set as input\n", "\n", "### Output: \n", "\n", "Xk={xj|j=1,2,...,k;xj∈Y}, where k=(0,1,2,...,d)\n", "\n", "* The returned output of the algorithm is a subset of the feature space of a specified size.\n", "\n", "### Initialization: \n", "\n", "X0=∅, k=0\n", "\n", "* We initialize the algorithm with an empty set (\"null set\") so that the k = 0 (where k is the size of the subset) \n", "\n", "\n", "### Step 1 (Inclusion):\n", "\n", "x+= arg max J(xk+x), where x∈Y−Xk \n", "\n", "Xk+1=Xk+x+\n", "\n", "k=k+1 \n", "\n", "#### Go to Step 2\n", "\n", "In step 1, we include the feature from the feature space that leads to the best performance increase for our feature subset (assessed by the criterion function). Then, we go over to step 2.\n", "\n", "### Step 2 (Conditional Exclusion):\n", "\n", "x-= arg max J(xk-x), where x∈Xk\n", "\n", "if J(xk - x) > J(xk):\n", "\n", "   Xk-1=Xk-x- \n", "\n", "   k=k-1 \n", " \n", "#### Go to Step 1\n", "\n", "In step 2, we only remove a feature if the resulting subset would gain an increase in performance. If k=2 or an improvement cannot be made (i.e., such feature x- cannot be found), go back to step 1; else, repeat this step.\n", "\n", "Steps 1 and 2 are repeated until the Termination criterion is reached.\n", "\n", "### Termination: \n", "\n", "k=p\n", "\n", "We add features from the feature subset Xk until the feature subset of size k contains the number of desired features p that we specified a priori." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sequential Backward Floating Selection (SBFS)\n", "\n", "### Input: \n", "\n", "The set of all features: Y={y1,y2,...,yd}\n", "\n", "* The SBFS algorithm takes the whole feature set as input.\n", "\n", "### Output: \n", "\n", "Xk={xj|j=1,2,...,k;xj∈Y}, where k=(0,1,2,...,d)\n", "\n", "* SBFS returns a subset of features; the number of selected features k, where k0=Y, k=d\n", "\n", "* We initialize the algorithm with the given feature set so that the k=d.\n", "\n", "\n", "### Step 1 (Exclusion):\n", "\n", "x-= arg max J(xk-x), where x∈Xk \n", "\n", "Xk-1=Xk-x-\n", "\n", "k=k-1 \n", "\n", "#### Go to Step 2\n", "\n", "* In this step, we remove a feature, x- from our feature subset Xk.\n", "* x- is the feature that maximizes our criterion function upon removal, that is, the feature that is associated with the best classifier performance if it is removed from Xk.\n", "\n", "### Step 2 (Conditional Inclusion):\n", "\n", "x+= arg max J(xk+x), where x∈Y−Xk\n", "\n", "if J(xk + x+) > J(xk):\n", "\n", "   Xk+1=Xk+x+ \n", "\n", "   k=k+1 \n", " \n", "#### Go to Step 1\n", "\n", "In Step 2, we search for features that improve the classifier performance if they are added back to the feature subset. If such features exist, we add the feature x+ for which the performance improvement is maximized. If k=2 or an improvement cannot be made (i.e., such feature x+ cannot be found), go back to step 1; else, repeat this step.\n", "\n", "### Termination: \n", "\n", "k=p\n", "\n", "We add features from the feature subset Xk until the feature subset of size k contains the number of desired features p that we specified a priori." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. SequentialFeatureSelector object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets take a look at documentation and parameters of SFS object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**SequentialFeatureSelector**(estimator, k_features=1, forward=True, floating=False, verbose=0, scoring=None, cv=5, n_jobs=1, pre_dispatch='2n_jobs', clone_estimator=True)\n", "\n", "Sequential Feature Selection for Classification and Regression.\n", "\n", "_**Parameters**_\n", "\n", "**estimator** : scikit-learn classifier or regressor\n", "\n", "**k_features** : int or tuple or str (default: 1)\n", "\n", "Number of features to select, where k_features < the full feature set. A tuple containing a min and max value can be provided, and the SFS will consider return any feature combination between min and max that scored highest in cross-validtion. For example, the tuple (1, 4) will return any combination from 1 up to 4 features instead of a fixed number of features k. A string argument \"best\" or \"parsimonious\". If \"best\" is provided, the feature selector will return the feature subset with the best cross-validation performance. If \"parsimonious\" is provided as an argument, the smallest feature subset that is within one standard error of the cross-validation performance will be selected.\n", "\n", "**forward** : bool (default: True)\n", "\n", "Forward selection if True, backward selection otherwise\n", "\n", "**floating** : bool (default: False)\n", "\n", "Adds a conditional exclusion/inclusion if True.\n", "\n", "**verbose** : int (default: 0), level of verbosity to use in logging.\n", "\n", "If 0, no output, if 1 number of features in current set, if 2 detailed logging i ncluding timestamp and cv scores at step.\n", "\n", "**scoring** : str, callable, or None (default: None)\n", "\n", "If None (default), uses 'accuracy' for sklearn classifiers and 'r2' for sklearn regressors. If str, uses a sklearn scoring metric string identifier, for example {accuracy, f1, precision, recall, roc_auc} for classifiers, {'mean_absolute_error', 'mean_squared_error'/'neg_mean_squared_error', 'median_absolute_error', 'r2'} for regressors.\n", "\n", "**cv** : int (default: 5)\n", "\n", "Integer or iterable yielding train, test splits. If cv is an integer and estimator is a classifier (or y consists of integer class labels) stratified k-fold. Otherwise regular k-fold cross-validation is performed. No cross-validation if cv is None, False, or 0.\n", "\n", "**n_jobs** : int (default: 1)\n", "\n", "The number of CPUs to use for evaluating different feature subsets in parallel. -1 means 'all CPUs'.\n", "\n", "**pre_dispatch** : int, or string (default: '2*n_jobs')\n", "\n", "Controls the number of jobs that get dispatched during parallel execution if n_jobs > 1 or n_jobs=-1. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs An int, giving the exact number of total jobs that are spawned A string, giving an expression as a function of n_jobs, as in 2*n_jobs\n", "\n", "**clone_estimator** : bool (default: True)\n", "\n", "Clones estimator if True; works with the original estimator instance if False. Set to False if the estimator doesn't implement scikit-learn's set_params and get_params methods. In addition, it is required to set cv=0, and n_jobs=1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. Logistic Regression with feature selection by mlxtend.sfs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this article we will use toy sklearn dataset **\"breast_cancer\"** (binary classification task). Lets load the dataset and take a look at the data." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.datasets import load_breast_cancer\n", "data = load_breast_cancer()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df, y = pd.DataFrame(data=data.data, columns = data.feature_names), data.target " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 569 entries, 0 to 568\n", "Data columns (total 30 columns):\n", "mean radius 569 non-null float64\n", "mean texture 569 non-null float64\n", "mean perimeter 569 non-null float64\n", "mean area 569 non-null float64\n", "mean smoothness 569 non-null float64\n", "mean compactness 569 non-null float64\n", "mean concavity 569 non-null float64\n", "mean concave points 569 non-null float64\n", "mean symmetry 569 non-null float64\n", "mean fractal dimension 569 non-null float64\n", "radius error 569 non-null float64\n", "texture error 569 non-null float64\n", "perimeter error 569 non-null float64\n", "area error 569 non-null float64\n", "smoothness error 569 non-null float64\n", "compactness error 569 non-null float64\n", "concavity error 569 non-null float64\n", "concave points error 569 non-null float64\n", "symmetry error 569 non-null float64\n", "fractal dimension error 569 non-null float64\n", "worst radius 569 non-null float64\n", "worst texture 569 non-null float64\n", "worst perimeter 569 non-null float64\n", "worst area 569 non-null float64\n", "worst smoothness 569 non-null float64\n", "worst compactness 569 non-null float64\n", "worst concavity 569 non-null float64\n", "worst concave points 569 non-null float64\n", "worst symmetry 569 non-null float64\n", "worst fractal dimension 569 non-null float64\n", "dtypes: float64(30)\n", "memory usage: 133.4 KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "columns = df.columns" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean radiusmean texturemean perimetermean areamean smoothnessmean compactnessmean concavitymean concave pointsmean symmetrymean fractal dimension...worst radiusworst textureworst perimeterworst areaworst smoothnessworst compactnessworst concavityworst concave pointsworst symmetryworst fractal dimension
017.9910.38122.801001.00.118400.277600.30010.147100.24190.07871...25.3817.33184.602019.00.16220.66560.71190.26540.46010.11890
120.5717.77132.901326.00.084740.078640.08690.070170.18120.05667...24.9923.41158.801956.00.12380.18660.24160.18600.27500.08902
219.6921.25130.001203.00.109600.159900.19740.127900.20690.05999...23.5725.53152.501709.00.14440.42450.45040.24300.36130.08758
311.4220.3877.58386.10.142500.283900.24140.105200.25970.09744...14.9126.5098.87567.70.20980.86630.68690.25750.66380.17300
420.2914.34135.101297.00.100300.132800.19800.104300.18090.05883...22.5416.67152.201575.00.13740.20500.40000.16250.23640.07678
\n", "

5 rows × 30 columns

\n", "
" ], "text/plain": [ " mean radius mean texture mean perimeter mean area mean smoothness \\\n", "0 17.99 10.38 122.80 1001.0 0.11840 \n", "1 20.57 17.77 132.90 1326.0 0.08474 \n", "2 19.69 21.25 130.00 1203.0 0.10960 \n", "3 11.42 20.38 77.58 386.1 0.14250 \n", "4 20.29 14.34 135.10 1297.0 0.10030 \n", "\n", " mean compactness mean concavity mean concave points mean symmetry \\\n", "0 0.27760 0.3001 0.14710 0.2419 \n", "1 0.07864 0.0869 0.07017 0.1812 \n", "2 0.15990 0.1974 0.12790 0.2069 \n", "3 0.28390 0.2414 0.10520 0.2597 \n", "4 0.13280 0.1980 0.10430 0.1809 \n", "\n", " mean fractal dimension ... worst radius \\\n", "0 0.07871 ... 25.38 \n", "1 0.05667 ... 24.99 \n", "2 0.05999 ... 23.57 \n", "3 0.09744 ... 14.91 \n", "4 0.05883 ... 22.54 \n", "\n", " worst texture worst perimeter worst area worst smoothness \\\n", "0 17.33 184.60 2019.0 0.1622 \n", "1 23.41 158.80 1956.0 0.1238 \n", "2 25.53 152.50 1709.0 0.1444 \n", "3 26.50 98.87 567.7 0.2098 \n", "4 16.67 152.20 1575.0 0.1374 \n", "\n", " worst compactness worst concavity worst concave points worst symmetry \\\n", "0 0.6656 0.7119 0.2654 0.4601 \n", "1 0.1866 0.2416 0.1860 0.2750 \n", "2 0.4245 0.4504 0.2430 0.3613 \n", "3 0.8663 0.6869 0.2575 0.6638 \n", "4 0.2050 0.4000 0.1625 0.2364 \n", "\n", " worst fractal dimension \n", "0 0.11890 \n", "1 0.08902 \n", "2 0.08758 \n", "3 0.17300 \n", "4 0.07678 \n", "\n", "[5 rows x 30 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.62741652021089633" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sum(y) / len(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 30 float not-null features and 569 examples. 62,7% of examples have class 1 and 37,3% of examples have class 0. Classes are not very skewed, so accuracy metric is suitable for us." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use LogisticRegression as our base algorithm. Firstly, scaling the data. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "scaler = StandardScaler()\n", "df_scaled = scaler.fit_transform(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then initiating cross-validation object with 5 folds with fixed random_state " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cv = KFold(n_splits=5, shuffle=True, random_state=17)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Checking cv results without parameters tuning" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.98417947523676441" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logit = LogisticRegression()\n", "cross_val_score(logit, df_scaled, y, cv = cv).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CV result is **0.984**\n", "It will be our baseline. Trying to beat it with feature selection\n", "\n", "We will use Sequential Backward Selection, so toggle **forward** and **floating** parameters to **False**\n", "\n", "Sequential Backward Selection means that:\n", "\n", "* We will start with all features K (in our dataset K=30)\n", "* On each iteration n we fit estimator with K-n features and keep on K-n subset of features with best scoring\n", "\n", "Setting parameter k_features to tuple (1, K), so it will be subset of features in range (1, 30) with best scoring on CV as output of fit_transform method." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "logit = LogisticRegression()\n", "sbs = SequentialFeatureSelector(logit, \n", " k_features=(1, 30), \n", " forward=False, \n", " floating=False, \n", " verbose=2,\n", " scoring='accuracy',\n", " cv=cv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is information about CV scoring on each iteration in log. The best quality we have with subset with 15 and from 17 to 24 features." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 30 out of 30 | elapsed: 1.3s finished\n", "\n", "[2018-12-12 00:08:17] Features: 29/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 29 out of 29 | elapsed: 1.2s finished\n", "\n", "[2018-12-12 00:08:19] Features: 28/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 28 out of 28 | elapsed: 1.1s finished\n", "\n", "[2018-12-12 00:08:20] Features: 27/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 1.1s finished\n", "\n", "[2018-12-12 00:08:21] Features: 26/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 26 out of 26 | elapsed: 1.0s finished\n", "\n", "[2018-12-12 00:08:22] Features: 25/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 25 out of 25 | elapsed: 0.9s finished\n", "\n", "[2018-12-12 00:08:23] Features: 24/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.8s finished\n", "\n", "[2018-12-12 00:08:24] Features: 23/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.8s finished\n", "\n", "[2018-12-12 00:08:25] Features: 22/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 22 out of 22 | elapsed: 0.7s finished\n", "\n", "[2018-12-12 00:08:26] Features: 21/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 21 out of 21 | elapsed: 0.7s finished\n", "\n", "[2018-12-12 00:08:26] Features: 20/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.6s finished\n", "\n", "[2018-12-12 00:08:27] Features: 19/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:08:28] Features: 18/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:08:28] Features: 17/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 17 out of 17 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:08:29] Features: 16/1 -- score: 0.985949386741[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:08:29] Features: 15/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:08:30] Features: 14/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.3s finished\n", "\n", "[2018-12-12 00:08:30] Features: 13/1 -- score: 0.982425089272[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 0.3s finished\n", "\n", "[2018-12-12 00:08:31] Features: 12/1 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.2s finished\n", "\n", "[2018-12-12 00:08:31] Features: 11/1 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 0.2s finished\n", "\n", "[2018-12-12 00:08:31] Features: 10/1 -- score: 0.982440614811[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.2s finished\n", "\n", "[2018-12-12 00:08:32] Features: 9/1 -- score: 0.978916317342[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 0.1s finished\n", "\n", "[2018-12-12 00:08:32] Features: 8/1 -- score: 0.978900791803[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 0.1s finished\n", "\n", "[2018-12-12 00:08:32] Features: 7/1 -- score: 0.980639652228[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 0.1s finished\n", "\n", "[2018-12-12 00:08:32] Features: 6/1 -- score: 0.978885266263[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:08:32] Features: 5/1 -- score: 0.977146405838[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:08:32] Features: 4/1 -- score: 0.975376494333[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:08:32] Features: 3/1 -- score: 0.963080267039[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:08:33] Features: 2/1 -- score: 0.954261760596[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:08:33] Features: 1/1 -- score: 0.920835274026" ] } ], "source": [ "X_sbs = sbs.fit_transform(df_scaled, y, custom_feature_names=columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting results:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_sfs(sbs.get_metric_dict(), kind='std_dev');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SBS returns subset of dataframe with optimal K features" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(569, 24)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_sbs.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is the subset of selected feature names:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('mean radius',\n", " 'mean texture',\n", " 'mean area',\n", " 'mean smoothness',\n", " 'mean concavity',\n", " 'mean concave points',\n", " 'mean symmetry',\n", " 'mean fractal dimension',\n", " 'radius error',\n", " 'texture error',\n", " 'area error',\n", " 'smoothness error',\n", " 'compactness error',\n", " 'concave points error',\n", " 'symmetry error',\n", " 'fractal dimension error',\n", " 'worst radius',\n", " 'worst texture',\n", " 'worst area',\n", " 'worst smoothness',\n", " 'worst concavity',\n", " 'worst concave points',\n", " 'worst symmetry',\n", " 'worst fractal dimension')" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sbs.k_feature_names_" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'The best quality is 0.9877037727061015 with 24 features in dataset'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'The best quality is {} with {} features in dataset'.format(sbs.k_score_, len(sbs.k_feature_idx_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Quality is increased! ***0.984 -> 0.988***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Saving scores to dict and try another SFS algorithms" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sbs_dict = dict()\n", "for i in sbs.subsets_.values():\n", " sbs_dict[len(i['feature_names'])] = i['avg_score']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we try to use Sequential Forward Selection, so toggle forward parameter to **True**

\n", "Sequential Forward Selection means that:\n", "\n", "* We will **start with 0 features**\n", "* On each iteration N we fit estimator with N features and keep on N subset of features with best scoring" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "logit = LogisticRegression()\n", "sfs = SequentialFeatureSelector(logit, \n", " k_features=(1, 30), \n", " forward=True, \n", " floating=False, \n", " verbose=2,\n", " scoring='accuracy',\n", " cv=cv)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 30 out of 30 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:08:34] Features: 1/30 -- score: 0.920835274026[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 29 out of 29 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:08:35] Features: 2/30 -- score: 0.954261760596[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 28 out of 28 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:08:36] Features: 3/30 -- score: 0.966589038969[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:08:36] Features: 4/30 -- score: 0.971867722403[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 26 out of 26 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:08:37] Features: 5/30 -- score: 0.975376494333[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 25 out of 25 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:08:37] Features: 6/30 -- score: 0.977115354759[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:08:38] Features: 7/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:08:39] Features: 8/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 22 out of 22 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:08:39] Features: 9/30 -- score: 0.978900791803[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 21 out of 21 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:08:40] Features: 10/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:08:40] Features: 11/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:08:41] Features: 12/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:08:41] Features: 13/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 17 out of 17 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:08:42] Features: 14/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:08:42] Features: 15/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:08:43] Features: 16/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:08:43] Features: 17/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 0.3s finished\n", "\n", "[2018-12-12 00:08:44] Features: 18/30 -- score: 0.978900791803[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.3s finished\n", "\n", "[2018-12-12 00:08:44] Features: 19/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 0.3s finished\n", "\n", "[2018-12-12 00:08:45] Features: 20/30 -- score: 0.982409563732[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.3s finished\n", "\n", "[2018-12-12 00:08:45] Features: 21/30 -- score: 0.982409563732[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 0.3s finished\n", "\n", "[2018-12-12 00:08:45] Features: 22/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 0.2s finished\n", "\n", "[2018-12-12 00:08:46] Features: 23/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 0.2s finished\n", "\n", "[2018-12-12 00:08:46] Features: 24/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.2s finished\n", "\n", "[2018-12-12 00:08:46] Features: 25/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s finished\n", "\n", "[2018-12-12 00:08:46] Features: 26/30 -- score: 0.984163949697[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.1s finished\n", "\n", "[2018-12-12 00:08:47] Features: 27/30 -- score: 0.982409563732[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:08:47] Features: 28/30 -- score: 0.982409563732[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:08:47] Features: 29/30 -- score: 0.984163949697[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:08:47] Features: 30/30 -- score: 0.984179475237" ] } ], "source": [ "X_sfs = sfs.fit_transform(df_scaled, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ptotiing results:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_sfs(sfs.get_metric_dict(), kind='std_dev');" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'The best quality is 0.9841794752367644 with 30 features in dataset'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'The best quality is {} with {} features in dataset'.format(sfs.k_score_, len(sfs.k_feature_idx_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the quality is equal to our baseline. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why quality of SFS is worse, than SBS? \n", "We use \"Forward\" algorithm, so on first iteration we select one feature and fit estimator with it. It's obviously, that finding \"the best\" feature fitting one dimensional dataset is not very effective. More than that, in Sequential Forward Selection we can't remove feature once added.

Let's try to find, \"bad feature\" that we add in our dataset once and on what iteration stage this happened." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "We add \"bad feature\" # {5} on 8 iteration stage\n" ] } ], "source": [ "sbs_feat = set(sbs.subsets_[24]['feature_idx']) #best feature set of SBS algorithm\n", "for i in range(1, 30):\n", " sfs_feat = set(sfs.subsets_[i]['feature_idx']) #iterate throw feature set on each iteration of SFS algorithm\n", " if len([x for x in sfs_feat if x not in sbs_feat]) > 0:\n", " print('We add \"bad feature\" # {} on {} iteration stage'.format(sfs_feat - sbs_feat, i))\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save results on each itertaion too" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sfs_dict = dict()\n", "for i in sfs.subsets_.values():\n", " sfs_dict[len(i['feature_names'])] = i['avg_score']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will try to use Sequential Forward Floating Selection, so toggle floating parameter to True. It can help us to remove worst feature at each iteration additional step" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "logit = LogisticRegression()\n", "sffs = SequentialFeatureSelector(logit, \n", " k_features=(1, 30), \n", " forward=True, \n", " floating=True, \n", " verbose=2,\n", " scoring='accuracy',\n", " cv=cv)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 30 out of 30 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:08:49] Features: 1/30 -- score: 0.920835274026[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 29 out of 29 | elapsed: 0.5s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:08:49] Features: 2/30 -- score: 0.954261760596[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 28 out of 28 | elapsed: 0.5s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:08:50] Features: 3/30 -- score: 0.966589038969[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 0.5s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:08:50] Features: 4/30 -- score: 0.971867722403[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 26 out of 26 | elapsed: 0.5s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:08:51] Features: 5/30 -- score: 0.975376494333[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 25 out of 25 | elapsed: 0.5s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:08:52] Features: 6/30 -- score: 0.977115354759[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.5s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:08:53] Features: 7/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.5s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 0.1s finished\n", "\n", "[2018-12-12 00:08:53] Features: 8/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 22 out of 22 | elapsed: 0.5s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 0.1s finished\n", "\n", "[2018-12-12 00:08:54] Features: 9/30 -- score: 0.978900791803[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 21 out of 21 | elapsed: 0.5s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 0.1s finished\n", "\n", "[2018-12-12 00:08:55] Features: 10/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.4s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.2s finished\n", "\n", "[2018-12-12 00:08:56] Features: 11/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.4s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 0.2s finished\n", "\n", "[2018-12-12 00:08:57] Features: 12/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 0.4s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.2s finished\n", "\n", "[2018-12-12 00:08:57] Features: 13/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 17 out of 17 | elapsed: 0.4s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 0.3s finished\n", "\n", "[2018-12-12 00:08:58] Features: 14/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.3s finished\n", "\n", "[2018-12-12 00:08:59] Features: 15/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.4s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:09:00] Features: 16/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.4s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:09:01] Features: 17/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 0.4s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 17 out of 17 | elapsed: 0.5s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:09:03] Features: 17/30 -- score: 0.982409563732[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 0.4s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 17 out of 17 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:09:04] Features: 18/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.3s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:09:05] Features: 19/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 0.3s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.6s finished\n", "\n", "[2018-12-12 00:09:06] Features: 20/30 -- score: 0.978900791803[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.3s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.6s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.6s finished\n", "\n", "[2018-12-12 00:09:08] Features: 20/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.3s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.6s finished\n", "\n", "[2018-12-12 00:09:09] Features: 21/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 21 out of 21 | elapsed: 0.7s finished\n", "\n", "[2018-12-12 00:09:10] Features: 22/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 22 out of 22 | elapsed: 0.8s finished\n", "\n", "[2018-12-12 00:09:11] Features: 23/30 -- score: 0.978885266263[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.8s finished\n", "\n", "[2018-12-12 00:09:12] Features: 24/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.9s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.8s finished\n", "\n", "[2018-12-12 00:09:14] Features: 24/30 -- score: 0.982425089272[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.9s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.8s finished\n", "\n", "[2018-12-12 00:09:17] Features: 24/30 -- score: 0.984179475237[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.9s finished\n", "\n", "[2018-12-12 00:09:18] Features: 25/30 -- score: 0.984179475237[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 25 out of 25 | elapsed: 1.0s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.9s finished\n", "\n", "[2018-12-12 00:09:20] Features: 25/30 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 25 out of 25 | elapsed: 1.0s finished\n", "\n", "[2018-12-12 00:09:21] Features: 26/30 -- score: 0.984179475237[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.1s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 26 out of 26 | elapsed: 1.0s finished\n", "\n", "[2018-12-12 00:09:23] Features: 27/30 -- score: 0.984179475237[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 1.1s finished\n", "\n", "[2018-12-12 00:09:24] Features: 28/30 -- score: 0.984179475237[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 28 out of 28 | elapsed: 1.2s finished\n", "\n", "[2018-12-12 00:09:25] Features: 29/30 -- score: 0.984179475237[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 29 out of 29 | elapsed: 1.2s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 28 out of 28 | elapsed: 1.2s finished\n", "\n", "[2018-12-12 00:09:28] Features: 29/30 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 29 out of 29 | elapsed: 1.2s finished\n", "\n", "[2018-12-12 00:09:29] Features: 30/30 -- score: 0.984179475237" ] } ], "source": [ "X_sffs = sffs.fit_transform(df_scaled, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting results:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_sfs(sffs.get_metric_dict(), kind='std_dev');" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'The best quality is 0.9859338612016767 with 25 features in dataset'" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'The best quality is {} with {} features in dataset'.format(sffs.k_score_, len(sffs.k_feature_idx_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The quality is little higher, that SFS one, but SBS is the best algortihm today. Saving results to dict and lets try the last implementation - Sequential Backward Floating Selection" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sffs_dict = dict()\n", "for i in sffs.subsets_.values():\n", " sffs_dict[len(i['feature_names'])] = i['avg_score']" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "logit = LogisticRegression()\n", "sbfs = SequentialFeatureSelector(logit, \n", " k_features=(1, 30), \n", " forward=False, \n", " floating=True, \n", " verbose=2,\n", " scoring='accuracy',\n", " cv=cv)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 30 out of 30 | elapsed: 1.3s finished\n", "\n", "[2018-12-12 00:09:32] Features: 29/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 29 out of 29 | elapsed: 1.2s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:09:33] Features: 28/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 28 out of 28 | elapsed: 1.1s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:09:35] Features: 27/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 1.1s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished\n", "\n", "[2018-12-12 00:09:36] Features: 26/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 26 out of 26 | elapsed: 1.0s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.1s finished\n", "\n", "[2018-12-12 00:09:37] Features: 25/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 25 out of 25 | elapsed: 0.9s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s finished\n", "\n", "[2018-12-12 00:09:38] Features: 24/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.9s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.1s finished\n", "\n", "[2018-12-12 00:09:40] Features: 23/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.8s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 0.2s finished\n", "\n", "[2018-12-12 00:09:41] Features: 22/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 22 out of 22 | elapsed: 0.7s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 0.2s finished\n", "\n", "[2018-12-12 00:09:42] Features: 21/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 21 out of 21 | elapsed: 0.7s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 0.2s finished\n", "\n", "[2018-12-12 00:09:43] Features: 20/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.6s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.3s finished\n", "\n", "[2018-12-12 00:09:44] Features: 19/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.6s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 0.3s finished\n", "\n", "[2018-12-12 00:09:45] Features: 18/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 0.5s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.3s finished\n", "\n", "[2018-12-12 00:09:46] Features: 17/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 17 out of 17 | elapsed: 0.5s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 0.3s finished\n", "\n", "[2018-12-12 00:09:47] Features: 16/1 -- score: 0.985949386741[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:09:48] Features: 15/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.4s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:09:49] Features: 14/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.3s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:09:51] Features: 14/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.3s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:09:51] Features: 13/1 -- score: 0.982425089272[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 0.3s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 17 out of 17 | elapsed: 0.4s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:09:53] Features: 13/1 -- score: 0.984179475237[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 0.3s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 17 out of 17 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:09:54] Features: 12/1 -- score: 0.982425089272[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:09:54] Features: 11/1 -- score: 0.984179475237[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:09:55] Features: 10/1 -- score: 0.982409563732[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.4s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:09:57] Features: 10/1 -- score: 0.984163949697[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.4s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:09:58] Features: 10/1 -- score: 0.984163949697[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:09:59] Features: 9/1 -- score: 0.984163949697[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 0.1s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 21 out of 21 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:10:00] Features: 8/1 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 0.1s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 22 out of 22 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:10:01] Features: 7/1 -- score: 0.978885266263[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 0.1s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:10:01] Features: 6/1 -- score: 0.977130880298[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.0s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.5s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:10:03] Features: 6/1 -- score: 0.977130880298[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.0s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.4s finished\n", "\n", "[2018-12-12 00:10:03] Features: 5/1 -- score: 0.975376494333[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.0s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 25 out of 25 | elapsed: 0.6s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.5s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:10:05] Features: 6/1 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.0s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:10:06] Features: 5/1 -- score: 0.977146405838[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.0s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 25 out of 25 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:10:07] Features: 4/1 -- score: 0.975376494333[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 26 out of 26 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:10:07] Features: 3/1 -- score: 0.964834653004[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 0.5s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 26 out of 26 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:10:09] Features: 3/1 -- score: 0.968358950474[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:10:09] Features: 2/1 -- score: 0.954261760596[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=1)]: Done 28 out of 28 | elapsed: 0.5s finished\n", "\n", "[2018-12-12 00:10:10] Features: 1/1 -- score: 0.915572116131" ] } ], "source": [ "X_sbfs = sbfs.fit_transform(df_scaled, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ploting results:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_sfs(sbfs.get_metric_dict(), kind='std_dev');" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'The best quality is 0.9877037727061015 with 24 features in dataset'" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'The best quality is {} with {} features in dataset'.format(sbfs.k_score_, len(sbfs.k_feature_idx_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The quality of SBS and SBFS algorithms is equal in our example. But sometimes it increased." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sbfs_dict = dict()\n", "for i in sbfs.subsets_.values():\n", " sbfs_dict[len(i['feature_names'])] = i['avg_score']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. Comparing results with RFE and PCA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Trying another feature selection and dimensional reducing algorithms" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dict_pca = dict()\n", "for i in range(1, 31):\n", " pca = PCA(n_components = i)\n", " df_pca = pca.fit_transform(df_scaled, y)\n", " logit = LogisticRegression()\n", " score = cross_val_score(logit, df_pca, y, cv = cv).mean()\n", " dict_pca[i] = score" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'The best quality is 0.9841794752367644 with 18 features in dataset'" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'The best quality is {} with {} features in dataset'.format(max(dict_pca.values()), max(dict_pca, key=dict_pca.get))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accuracy metric is lower on PCA dataset" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dict_rfe = dict()\n", "for i in range(1, 31):\n", " rfe = RFE(logit, n_features_to_select=i)\n", " df_rfe = rfe.fit_transform(df_scaled, y)\n", " logit = LogisticRegression()\n", " score = cross_val_score(logit, df_rfe, y, cv = cv).mean()\n", " dict_rfe[i] = score" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'The best quality is 0.9841794752367644 with 24 features in dataset'" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'The best quality is {} with {} features in dataset'.format(max(dict_rfe.values()), max(dict_rfe, key=dict_rfe.get))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "RFE quality is lower too. RFE is computationally less complex using the feature weight coefficients (e.g., linear models) or feature importance (tree-based algorithms) to eliminate features recursively, whereas SFSs eliminate (or add) features based on a user-defined classifier/regression performance metric." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comparing CV scores of all algorithms" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RFEPCASBS
10.9208350.9138180.920835
20.9508150.9507530.954262
30.9490450.9472910.963080
40.9455210.9683740.975376
50.9701440.9735910.977146
60.9683740.9753920.978885
70.9683590.9753920.980640
80.9701290.9788850.978901
90.9666050.9824250.978916
100.9683900.9806550.982441
110.9754080.9789010.980655
120.9736530.9753920.980655
130.9718830.9753760.982425
140.9718830.9824100.985934
150.9753760.9824100.987704
160.9771460.9824100.985949
170.9806550.9806550.987704
180.9806550.9841790.987704
190.9824100.9824250.987704
200.9824100.9824250.987704
210.9824100.9841790.987704
220.9824100.9824250.987704
230.9824250.9824250.987704
240.9841790.9841790.987704
250.9841790.9841790.985934
260.9841790.9841790.985934
270.9841790.9841790.985934
280.9841790.9841790.985934
290.9841790.9841790.985934
300.9841790.9841790.984179
\n", "
" ], "text/plain": [ " RFE PCA SBS\n", "1 0.920835 0.913818 0.920835\n", "2 0.950815 0.950753 0.954262\n", "3 0.949045 0.947291 0.963080\n", "4 0.945521 0.968374 0.975376\n", "5 0.970144 0.973591 0.977146\n", "6 0.968374 0.975392 0.978885\n", "7 0.968359 0.975392 0.980640\n", "8 0.970129 0.978885 0.978901\n", "9 0.966605 0.982425 0.978916\n", "10 0.968390 0.980655 0.982441\n", "11 0.975408 0.978901 0.980655\n", "12 0.973653 0.975392 0.980655\n", "13 0.971883 0.975376 0.982425\n", "14 0.971883 0.982410 0.985934\n", "15 0.975376 0.982410 0.987704\n", "16 0.977146 0.982410 0.985949\n", "17 0.980655 0.980655 0.987704\n", "18 0.980655 0.984179 0.987704\n", "19 0.982410 0.982425 0.987704\n", "20 0.982410 0.982425 0.987704\n", "21 0.982410 0.984179 0.987704\n", "22 0.982410 0.982425 0.987704\n", "23 0.982425 0.982425 0.987704\n", "24 0.984179 0.984179 0.987704\n", "25 0.984179 0.984179 0.985934\n", "26 0.984179 0.984179 0.985934\n", "27 0.984179 0.984179 0.985934\n", "28 0.984179 0.984179 0.985934\n", "29 0.984179 0.984179 0.985934\n", "30 0.984179 0.984179 0.984179" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(data = [pd.Series(dict_rfe),pd.Series(dict_pca), pd.Series(sbs_dict)], index = ['RFE', 'PCA', 'SBS']).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Maximum score we got with SBS and minimum 15 features in subset. RFE is worse with any number of features. PCA is better only with 9 features in subset. RFE and PCA could not find subset of features with score more than full dataset's score. SBS del with it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

**How we will choose features after this tutorial :)**

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Of course, RFE, PCA and SBS solve slightly different tasks. It's important to know how and when we should implement one or another instrument. And more important is to have an inquiring mind \n", "and test the craziest hypotheses :)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusion " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial we studied something new about feature selection, understood how SequentialFeatureSelector from Mlxtend library works - it allows very easy selection from new generated features and boost model's quality. Then we compared it with another feature selection and dimension reducing algorithms.

\n", "Beginners data scientists often pay a little attention to feature selection and trying to testing many different models instead. But feature selection can boost model score very much. It's near impossible to get top Kaggle without ~~stacking xgboost~~ careful feature engineering and selecting best features.\n", "\n", "Save the best, delete the rest! That's all, folks!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Featured links\n", "Official Mlxtend Docs https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }