{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# ColumnSelector: Scikit-learn utility function to select specific columns in a pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Implementation of a column selector class for scikit-learn pipelines." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> from mlxtend.feature_selection import ColumnSelector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `ColumnSelector` can be used for \"manual\" feature selection, e.g., as part of a grid search via a scikit-learn pipeline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### References\n", "\n", "-" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1 - Fitting an Estimator on a Feature Subset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load a simple benchmark dataset:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `ColumnSelector` is a simple transformer class that selects specific columns (features) from a datast. For instance, using the `transform` method returns a reduced dataset that only contains two features (here: the first two features via the indices 0 and 1, respectively):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(150, 2)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mlxtend.feature_selection import ColumnSelector\n", "\n", "col_selector = ColumnSelector(cols=(0, 1))\n", "# col_selector.fit(X) # optional, does not do anything\n", "col_selector.transform(X).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`ColumnSelector` works both with numpy arrays and pandas dataframes:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "0 5.1 3.5 1.4 0.2\n", "1 4.9 3.0 1.4 0.2\n", "2 4.7 3.2 1.3 0.2\n", "3 4.6 3.1 1.5 0.2\n", "4 5.0 3.6 1.4 0.2" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)\n", "iris_df.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(150, 2)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col_selector = ColumnSelector(cols=(\"sepal length (cm)\", \"sepal width (cm)\"))\n", "col_selector.transform(iris_df).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, we can use the `ColumnSelector` as part of a scikit-learn `Pipeline`:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.84" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.pipeline import make_pipeline\n", "\n", "\n", "pipe = make_pipeline(StandardScaler(),\n", " ColumnSelector(cols=(0, 1)),\n", " KNeighborsClassifier())\n", "\n", "pipe.fit(X, y)\n", "pipe.score(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 2 - Feature Selection via GridSearch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 1 showed a simple useage example of the `ColumnSelector`; however, selecting columns from a dataset is trivial and does not require a specific transformer class since we could have achieved the same results via\n", "\n", "```python\n", "classifier.fit(X[:, :2], y)\n", "classifier.score(X[:, :2], y)\n", "```\n", "\n", "However, the `ColumnSelector` becomes really useful for feature selection as part of a grid search as shown in this example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load a simple benchmark dataset:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create all possible combinations:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(0,), (1,), (2,), (3,), (0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3), (0, 1, 2), (0, 1, 3), (0, 2, 3), (1, 2, 3), (0, 1, 2, 3)]\n" ] } ], "source": [ "from itertools import combinations\n", "\n", "all_comb = []\n", "for size in range(1, 5):\n", " all_comb += list(combinations(range(X.shape[1]), r=size))\n", "print(all_comb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Feature and model selection via grid search:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best parameters: {'columnselector__cols': (2, 3), 'kneighborsclassifier__n_neighbors': 1}\n", "Best performance: 0.98\n" ] } ], "source": [ "from mlxtend.feature_selection import ColumnSelector\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn.pipeline import make_pipeline\n", "\n", "pipe = make_pipeline(StandardScaler(),\n", " ColumnSelector(),\n", " KNeighborsClassifier())\n", "\n", "param_grid = {'columnselector__cols': all_comb,\n", " 'kneighborsclassifier__n_neighbors': list(range(1, 11))}\n", "\n", "grid = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)\n", "grid.fit(X, y)\n", "print('Best parameters:', grid.best_params_)\n", "print('Best performance:', grid.best_score_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 3 -- Scaling of a Subset of Features in a scikit-learn Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following example illustrates how we could use the `ColumnSelector` in tandem with scikit-learn's `FeatureUnion` to only scale certain features (in this toy example: the first and second feature only) in a datasets in a `Pipeline`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pipeline(memory=None,\n", " steps=[('feats', FeatureUnion(n_jobs=None,\n", " transformer_list=[('col_1-2', Pipeline(memory=None,\n", " steps=[('columnselector', ColumnSelector(cols=(0, 1), drop_axis=False)), ('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1)))])), ('col_3-4', ColumnSelector(cols=(2, 3), drop_axis=Fa...ki',\n", " metric_params=None, n_jobs=None, n_neighbors=5, p=2,\n", " weights='uniform'))])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mlxtend.feature_selection import ColumnSelector\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.pipeline import FeatureUnion\n", "from sklearn.preprocessing import MinMaxScaler\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from mlxtend.data import iris_data\n", "\n", "\n", "X, y = iris_data()\n", "\n", "scale_pipe = make_pipeline(ColumnSelector(cols=(0, 1)),\n", " MinMaxScaler())\n", "\n", "pipeline = Pipeline([\n", " ('feats', FeatureUnion([\n", " ('col_1-2', scale_pipe),\n", " ('col_3-4', ColumnSelector(cols=(2, 3)))\n", " ])),\n", " ('clf', KNeighborsClassifier())\n", "])\n", "\n", "\n", "pipeline.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## API" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "## ColumnSelector\n", "\n", "*ColumnSelector(cols=None, drop_axis=False)*\n", "\n", "Object for selecting specific columns from a data set.\n", "\n", "**Parameters**\n", "\n", "- `cols` : array-like (default: None)\n", "\n", " A list specifying the feature indices to be selected. For example,\n", " [1, 4, 5] to select the 2nd, 5th, and 6th feature columns.\n", " If None, returns all columns in the array.\n", "\n", "\n", "- `drop_axis` : bool (default=False)\n", "\n", " Drops last axis if True and the only one column is selected. This\n", " is useful, e.g., when the ColumnSelector is used for selecting\n", " only one column and the resulting array should be fed to e.g.,\n", " a scikit-learn column selector. E.g., instead of returning an\n", " array with shape (n_samples, 1), drop_axis=True will return an\n", " aray with shape (n_samples,).\n", "\n", "**Examples**\n", "\n", "For usage examples, please see\n", " [https://rasbt.github.io/mlxtend/user_guide/feature_selection/ColumnSelector/](https://rasbt.github.io/mlxtend/user_guide/feature_selection/ColumnSelector/)\n", "\n", "### Methods\n", "\n", "
\n", "\n", "*fit(X, y=None)*\n", "\n", "Mock method. Does nothing.\n", "\n", "**Parameters**\n", "\n", "- `X` : {array-like, sparse matrix}, shape = [n_samples, n_features]\n", "\n", " Training vectors, where n_samples is the number of samples and\n", " n_features is the number of features.\n", "\n", "- `y` : array-like, shape = [n_samples] (default: None)\n", "\n", "\n", "**Returns**\n", "\n", "self\n", "\n", "
\n", "\n", "*fit_transform(X, y=None)*\n", "\n", "Return a slice of the input array.\n", "\n", "**Parameters**\n", "\n", "- `X` : {array-like, sparse matrix}, shape = [n_samples, n_features]\n", "\n", " Training vectors, where n_samples is the number of samples and\n", " n_features is the number of features.\n", "\n", "- `y` : array-like, shape = [n_samples] (default: None)\n", "\n", "\n", "**Returns**\n", "\n", "- `X_slice` : shape = [n_samples, k_features]\n", "\n", " Subset of the feature space where k_features <= n_features\n", "\n", "
\n", "\n", "*get_params(deep=True)*\n", "\n", "Get parameters for this estimator.\n", "\n", "**Parameters**\n", "\n", "- `deep` : boolean, optional\n", "\n", " If True, will return the parameters for this estimator and\n", " contained subobjects that are estimators.\n", "\n", "**Returns**\n", "\n", "- `params` : mapping of string to any\n", "\n", " Parameter names mapped to their values.\n", "\n", "
\n", "\n", "*set_params(**params)*\n", "\n", "Set the parameters of this estimator.\n", "\n", "The method works on simple estimators as well as on nested objects\n", "(such as pipelines). The latter have parameters of the form\n", "``__`` so that it's possible to update each\n", "component of a nested object.\n", "\n", "**Returns**\n", "\n", "self\n", "\n", "
\n", "\n", "*transform(X, y=None)*\n", "\n", "Return a slice of the input array.\n", "\n", "**Parameters**\n", "\n", "- `X` : {array-like, sparse matrix}, shape = [n_samples, n_features]\n", "\n", " Training vectors, where n_samples is the number of samples and\n", " n_features is the number of features.\n", "\n", "- `y` : array-like, shape = [n_samples] (default: None)\n", "\n", "\n", "**Returns**\n", "\n", "- `X_slice` : shape = [n_samples, k_features]\n", "\n", " Subset of the feature space where k_features <= n_features\n", "\n", "

\n" ] } ], "source": [ "with open('../../api_modules/mlxtend.feature_selection/ColumnSelector.md', 'r') as f:\n", " s = f.read() + '

'\n", "print(s)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }