{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# SVM-Anova: SVM with univariate feature selection\n\nThis example shows how to perform univariate feature selection before running a\nSVC (support vector classifier) to improve the classification scores. We use\nthe iris dataset (4 features) and add 36 non-informative features. We can find\nthat our model achieves best performance when we select around 10% of features.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load some data to play with\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nfrom sklearn.datasets import load_iris\n\nX, y = load_iris(return_X_y=True)\n\n# Add non-informative features\nrng = np.random.RandomState(0)\nX = np.hstack((X, 2 * rng.random((X.shape[0], 36))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create the pipeline\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.feature_selection import SelectPercentile, f_classif\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\n# Create a feature-selection transform, a scaler and an instance of SVM that we\n# combine together to have a full-blown estimator\n\nclf = Pipeline(\n [\n (\"anova\", SelectPercentile(f_classif)),\n (\"scaler\", StandardScaler()),\n (\"svc\", SVC(gamma=\"auto\")),\n ]\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot the cross-validation score as a function of percentile of features\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nfrom sklearn.model_selection import cross_val_score\n\nscore_means = list()\nscore_stds = list()\npercentiles = (1, 3, 6, 10, 15, 20, 30, 40, 60, 80, 100)\n\nfor percentile in percentiles:\n clf.set_params(anova__percentile=percentile)\n this_scores = cross_val_score(clf, X, y)\n score_means.append(this_scores.mean())\n score_stds.append(this_scores.std())\n\nplt.errorbar(percentiles, score_means, np.array(score_stds))\nplt.title(\"Performance of the SVM-Anova varying the percentile of features selected\")\nplt.xticks(np.linspace(0, 100, 11, endpoint=True))\nplt.xlabel(\"Percentile\")\nplt.ylabel(\"Accuracy Score\")\nplt.axis(\"tight\")\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }