{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# ColumnSelector: Scikit-learn utility function to select specific columns in a pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Implementation of a column selector class for scikit-learn pipelines." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> from mlxtend.feature_selection import ColumnSelector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `ColumnSelector` can be used for \"manual\" feature selection, e.g., as part of a grid search via a scikit-learn pipeline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### References\n", "\n", "-" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1 - Fitting an Estimator on a Feature Subset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load a simple benchmark dataset:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `ColumnSelector` is a simple transformer class that selects specific columns (features) from a datast. For instance, using the `transform` method returns a reduced dataset that only contains two features (here: the first two features via the indices 0 and 1, respectively):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(150, 2)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mlxtend.feature_selection import ColumnSelector\n", "\n", "col_selector = ColumnSelector(cols=(0, 1))\n", "# col_selector.fit(X) # optional, does not do anything\n", "col_selector.transform(X).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`ColumnSelector` works both with numpy arrays and pandas dataframes:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | sepal length (cm) | \n", "sepal width (cm) | \n", "petal length (cm) | \n", "petal width (cm) | \n", "
---|---|---|---|---|
0 | \n", "5.1 | \n", "3.5 | \n", "1.4 | \n", "0.2 | \n", "
1 | \n", "4.9 | \n", "3.0 | \n", "1.4 | \n", "0.2 | \n", "
2 | \n", "4.7 | \n", "3.2 | \n", "1.3 | \n", "0.2 | \n", "
3 | \n", "4.6 | \n", "3.1 | \n", "1.5 | \n", "0.2 | \n", "
4 | \n", "5.0 | \n", "3.6 | \n", "1.4 | \n", "0.2 | \n", "