{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Cochran's Q Test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cochran's Q test for comparing the performance of multiple classifiers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> `from mlxtend.evaluate import cochrans_q` " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cochran's Q test can be regarded as a generalized version of McNemar's test that can be applied to evaluate multiple classifiers. In a sense, Cochran's Q test is analogous to ANOVA for binary outcomes. \n", "\n", "To compare more than two classifiers, we can use Cochran's Q test, which has a test statistic $Q$ that is approximately, (similar to McNemar's test), distributed as chi-squared with $L-1$ degrees of freedom, where L is the number of models we evaluate (since $L=2$ for McNemar's test, McNemars test statistic approximates a chi-squared distribution with one degree of freedom). \n", "\n", "More formally, Cochran's Q test tests the hypothesis that there is no difference between the classification accuracies [1]: \n", "\n", "$$p_i: H_0 = p_1 = p_2 = \\cdots = p_L.$$\n", "\n", "Let $\\{D_1, \\dots , D_L\\}$ be a set of classifiers who have all been tested on the same dataset. If the L classifiers don't perform differently, then the following Q statistic is distributed approximately as\n", "\"chi-squared\" with $L-1$ degrees of freedom:\n", "\n", "$$Q_C = (L-1) \\frac{L \\sum^{L}_{i=1}G_{i}^{2} - T^2}{LT - \\sum^{N_{ts}}_{j=1} (L_j)^2}.$$\n", "\n", "Here, $G_i$ is the number of objects out of $N_{ts}$ correctly classified by $D_i= 1, \\dots L$; $L_j$ is the number of classifiers out of $L$ that correctly classified object $\\mathbf{z}_j \\in \\mathbf{Z}_{ts}$, where $\\mathbf{Z}_{ts} = \\{\\mathbf{z}_1, ... \\mathbf{z}_{N_{ts}}\\}$ is the test dataset on which the classifers are tested on; and $T$ is the total number of correct number of votes among the $L$ classifiers [2]:\n", "\n", "$$ T = \\sum_{i=1}^{L} G_i = \\sum^{N_{ts}}_{j=1} L_j.$$\n", "\n", "\n", "To perform Cochran's Q test, we typically organize the classificier predictions in a binary $N_{ts} \\times L$ matrix. The $ij\\text{th}$ entry of such matrix is 0 if a classifier $D_j$ has misclassified a data example (vector) $\\mathbf{z}_i$ and 1 otherwise (if the classifier predicted the class label $l(\\mathbf{z}_i)$ correctly) [2].\n", "\n", "The following example taken from [2] illustrates how the classification results may be organized. For instance, assume we have the ground truth labels of the test dataset `y_true` and the following predictions by 3 classifiers (`y_model_1`, `y_model_2`, and `y_model_3`):\n", "\n", " y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0])\n", "\n", " y_model_1 = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0])\n", "\n", " y_model_2 = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0])\n", "\n", " y_model_3 = np.array([1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 1, 1])\n", "\n", "The table of correct (1) and incorrect (0) classifications may then look as follows:\n", "\n", "| | $D_1$ (model 1) | $D_2$ (model 2) | $D_3$ (model 3) | Occurrences |\n", "|----------|-------------------|-------------------|-------------------|-------------|\n", "| | 1 | 1 | 1 | 80 |\n", "| | 1 | 1 | 0 | 2 |\n", "| | 1 | 0 | 1 | 0 |\n", "| | 1 | 0 | 0 | 2 |\n", "| | 0 | 1 | 1 | 9 |\n", "| | 0 | 1 | 0 | 1 |\n", "| | 0 | 0 | 1 | 3 |\n", "| | 0 | 0 | 0 | 3 |\n", "| Accuracy | 84/100*100% = 84% | 92/100*100% = 92% | 92/100*100% = 92% | |\n", "\n", "By plugging in the respective value into the previous equation, we obtain the following $Q$ value [2]:\n", "\n", "$$Q_c = 2 \\times \\frac{3 \\times (84^2 + 92^2 + 92^2) - 268^2}{3\\times 268-(80 \\times 9 + 11 \\times 4 + 6 \\times 1)} \\approx 7.5294.$$\n", "\n", "(Note that the $Q$ value in [2] is listed as 3.7647 due to a typo as discussed with the author, the value 7.5294 is the correct one.)\n", "\n", "Now, the Q value (approximating $\\chi^2$) corresponds to a p-value of approx. 0.023 assuming a $\\chi^2$ distribution with $L-1 = 2$ degrees of freedom. Assuming that we chose a significance level of $\\alpha=0.05$, we would reject the null hypothesis that all classifiers perform equally well, since $0.023 < \\alpha$.\n", "\n", "In practice, if we successfully rejected the null hypothesis, we could perform multiple post hoc pair-wise tests -- for example, McNemar tests with a Bonferroni correction -- to determine which pairs have different population proportions.\n", "\n", "\n", "### References\n", "\n", "- [1] Fleiss, Joseph L., Bruce Levin, and Myunghee Cho Paik. Statistical methods for rates and proportions. John Wiley & Sons, 2013.\n", "- [2] Kuncheva, Ludmila I. Combining pattern classifiers: methods and algorithms. John Wiley & Sons, 2004.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1 - Cochran's Q test" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from mlxtend.evaluate import cochrans_q\n", "from mlxtend.evaluate import mcnemar_table\n", "from mlxtend.evaluate import mcnemar\n", "\n", "## Dataset:\n", "\n", "# ground truth labels of the test dataset:\n", "\n", "y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0])\n", "\n", "\n", "# predictions by 3 classifiers (`y_model_1`, `y_model_2`, and `y_model_3`):\n", "\n", "y_model_1 = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0])\n", "\n", "y_model_2 = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0])\n", "\n", "y_model_3 = np.array([1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 1, 1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assuming a significance level $\\alpha=0.05$, we can conduct Cochran's Q test as follows, to test the null hypothesis there is no difference between the classification accuracies, $p_i: H_0 = p_1 = p_2 = \\cdots = p_L$:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Q: 7.529\n", "p-value: 0.023\n" ] } ], "source": [ "q, p_value = cochrans_q(y_true, \n", " y_model_1, \n", " y_model_2, \n", " y_model_3)\n", "\n", "print('Q: %.3f' % q)\n", "print('p-value: %.3f' % p_value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the p-value is smaller than $\\alpha$, we can reject the null hypothesis and conclude that there is a difference between the classification accuracies. As mentioned in the introduction earlier, we could now perform multiple post hoc pair-wise tests -- for example, McNemar tests with a Bonferroni correction -- to determine which pairs have different population proportions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, let's illustrate that Cochran's Q test is indeed just a generalized version of McNemar's test:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cochran's Q Chi^2: 5.333\n", "Cochran's Q p-value: 0.021\n" ] } ], "source": [ "chi2, p_value = cochrans_q(y_true, \n", " y_model_1, \n", " y_model_2)\n", "\n", "print('Cochran\\'s Q Chi^2: %.3f' % chi2)\n", "print('Cochran\\'s Q p-value: %.3f' % p_value)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "McNemar's Chi^2: 5.333\n", "McNemar's p-value: 0.021\n" ] } ], "source": [ "chi2, p_value = mcnemar(mcnemar_table(y_true, \n", " y_model_1, \n", " y_model_2),\n", " corrected=False)\n", "\n", "print('McNemar\\'s Chi^2: %.3f' % chi2)\n", "print('McNemar\\'s p-value: %.3f' % p_value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## API" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "## cochrans_q\n", "\n", "*cochrans_q(y_target, *y_model_predictions)*\n", "\n", "Cochran's Q test to compare 2 or more models.\n", "\n", "**Parameters**\n", "\n", "- `y_target` : array-like, shape=[n_samples]\n", "\n", " True class labels as 1D NumPy array.\n", "\n", "\n", "- `*y_model_predictions` : array-likes, shape=[n_samples]\n", "\n", " Variable number of 2 or more arrays that\n", " contain the predicted class labels\n", " from models as 1D NumPy array.\n", "\n", "**Returns**\n", "\n", "\n", "- `q, p` : float or None, float\n", "\n", " Returns the Q (chi-squared) value and the p-value\n", "\n", "**Examples**\n", "\n", "For usage examples, please see\n", " [http://rasbt.github.io/mlxtend/user_guide/evaluate/cochrans_q/](http://rasbt.github.io/mlxtend/user_guide/evaluate/cochrans_q/)\n", "\n", "\n" ] } ], "source": [ "with open('../../api_modules/mlxtend.evaluate/cochrans_q.md', 'r') as f:\n", " s = f.read() \n", "print(s)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }