{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Test with permutations the significance of a classification score\n\nThis example demonstrates the use of\n:func:`~sklearn.model_selection.permutation_test_score` to evaluate the\nsignificance of a cross-validated score using permutations.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset\n\nWe will use the `iris_dataset`, which consists of measurements taken\nfrom 3 Iris species. Our model will use the measurements to predict\nthe iris species.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n\niris = load_iris()\nX = iris.data\ny = iris.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For comparison, we also generate some random feature data (i.e., 20 features),\nuncorrelated with the class labels in the iris dataset.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nn_uncorrelated_features = 20\nrng = np.random.RandomState(seed=0)\n# Use same number of samples as in iris and 20 features\nX_rand = rng.normal(size=(X.shape[0], n_uncorrelated_features))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Permutation test score\n\nNext, we calculate the\n:func:`~sklearn.model_selection.permutation_test_score` for both, the original\niris dataset (where there's a strong relationship between features and labels) and\nthe randomly generated features with iris labels (where no dependency between features\nand labels is expected). We use the\n:class:`~sklearn.svm.SVC` classifier and `accuracy_score` to evaluate\nthe model at each round.\n\n:func:`~sklearn.model_selection.permutation_test_score` generates a null\ndistribution by calculating the accuracy of the classifier\non 1000 different permutations of the dataset, where features\nremain the same but labels undergo different random permutations. This is the\ndistribution for the null hypothesis which states there is no dependency\nbetween the features and labels. An empirical p-value is then calculated as\nthe proportion of permutations, for which the score obtained by the model trained on\nthe permutation, is greater than or equal to the score obtained using the original\ndata.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import StratifiedKFold, permutation_test_score\nfrom sklearn.svm import SVC\n\nclf = SVC(kernel=\"linear\", random_state=7)\ncv = StratifiedKFold(n_splits=2, shuffle=True, random_state=0)\n\nscore_iris, perm_scores_iris, pvalue_iris = permutation_test_score(\n clf, X, y, scoring=\"accuracy\", cv=cv, n_permutations=1000\n)\n\nscore_rand, perm_scores_rand, pvalue_rand = permutation_test_score(\n clf, X_rand, y, scoring=\"accuracy\", cv=cv, n_permutations=1000\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Original data\n\nBelow we plot a histogram of the permutation scores (the null\ndistribution). The red line indicates the score obtained by the classifier\non the original data (without permuted labels). The score is much better than those\nobtained by using permuted data and the p-value is thus very low. This indicates that\nthere is a low likelihood that this good score would be obtained by chance\nalone. It provides evidence that the iris dataset contains real dependency\nbetween features and labels and the classifier was able to utilize this\nto obtain good results. The low p-value can lead us to reject the null hypothesis.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nfig, ax = plt.subplots()\n\nax.hist(perm_scores_iris, bins=20, density=True)\nax.axvline(score_iris, ls=\"--\", color=\"r\")\nscore_label = (\n f\"Score on original\\niris data: {score_iris:.2f}\\n(p-value: {pvalue_iris:.3f})\"\n)\nax.text(0.7, 10, score_label, fontsize=12)\nax.set_xlabel(\"Accuracy score\")\n_ = ax.set_ylabel(\"Probability density\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Random data\n\nBelow we plot the null distribution for the randomized data. The permutation\nscores are similar to those obtained using the original iris dataset\nbecause the permutation always destroys any feature-label dependency present.\nThe score obtained on the randomized data in this case\nthough, is very poor. This results in a large p-value, confirming that there was no\nfeature-label dependency in the randomized data.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n\nax.hist(perm_scores_rand, bins=20, density=True)\nax.set_xlim(0.13)\nax.axvline(score_rand, ls=\"--\", color=\"r\")\nscore_label = (\n f\"Score on original\\nrandom data: {score_rand:.2f}\\n(p-value: {pvalue_rand:.3f})\"\n)\nax.text(0.14, 7.5, score_label, fontsize=12)\nax.set_xlabel(\"Accuracy score\")\nax.set_ylabel(\"Probability density\")\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another possible reason for obtaining a high p-value could be that the classifier\nwas not able to use the structure in the data. In this case, the p-value\nwould only be low for classifiers that are able to utilize the dependency\npresent. In our case above, where the data is random, all classifiers would\nhave a high p-value as there is no structure present in the data. We might or might\nnot fail to reject the null hypothesis depending on whether the p-value is high on a\nmore appropriate estimator as well.\n\nFinally, note that this test has been shown to produce low p-values even\nif there is only weak structure in the data [1]_.\n\n.. rubric:: References\n\n.. [1] Ojala and Garriga. [Permutation Tests for Studying Classifier\n Performance](http://www.jmlr.org/papers/volume11/ojala10a/ojala10a.pdf). The\n Journal of Machine Learning Research (2010) vol. 11\n\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.14" } }, "nbformat": 4, "nbformat_minor": 0 }