{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# t-test for comparison of accuracies of two algorithms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(C) 2017-2024 by [Damir Cavar](http://cavar.me/damir/)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Version:** 1.1, January 2024" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Prerequisites:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U scipy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a tutorial related to the discussion of evaluation of machine learning algorithms and classifiers using simple significance tests." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the [Computational Linguistics Program](http://cl.indiana.edu/) of the [Department of Linguistics](http://www.indiana.edu/~lingdept/) at [Indiana University](https://www.indiana.edu/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using the t-test on two distributions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The task is to compare two distributions of accuracy counts over some experimental results. Imagine that we test two algorithms $a$ and $b$ on the same training and test sets of data. We will apply the t-test as provided in the $stats$ module of $scipy$. We will need to import this module first:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from scipy import stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imagine that these are our results from two independent algorithms trained and tested on the same pairs of training and test data sets:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true }, "outputs": [], "source": [ "a = [23, 43, 12, 10]\n", "b = [23, 42, 13, 10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data sets are the same in both experiments. We could treat the algorithms as two different tests on the same population (of data). The t-test measures whether the average scores differ significantly. We apply the t-test for two related samples of scores as provided in the $stats$ module:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Ttest_relResult(statistic=0.0, pvalue=1.0)" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.ttest_rel(a,b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The returned result contains a $pvalue$ (p-value) of 1.0 in this case. If we assume that our Null-Hypothesis was that the two outcomes in $a$ and $b$ are unrelated, that is that they are random. The p-value tells us that we would make an error by rejecting this Null-Hypothesis, in fact, we would make an error with a certainty of 100% by rejecting this Null Hypothesis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imagine now that the experimental results are:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": true }, "outputs": [], "source": [ "a = [23, 43, 12, 10]\n", "b = [4, 15, 3, 9]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Applying the t-test again will give us a different result:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Ttest_relResult(statistic=-2.4238865567066052, pvalue=0.093841791155051452)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.ttest_rel(b,a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, the p-value tells us that we would make an error with a likelihood of approx. 9% by rejecting the Null Hypothesis. Remember, the Null Hypothesis is that the two distributions have nothing in common. We could have set a threshold of 10% and decided to reject the Null Hypothesis." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": true }, "outputs": [], "source": [ "a = [74, 89, 88, 78]\n", "b = [24, 2, 3, 9]" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Ttest_relResult(statistic=8.4725342868682922, pvalue=0.0034519815681217179)" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.ttest_rel(a,b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(C) 2017-2024 by [Damir Cavar](http://cavar.me/damir/) - [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))" ] } ], "metadata": { "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" }, "latex_metadata": { "affiliation": "Indiana University, Department of Linguistics, Bloomington, IN, USA", "author": "Damir Cavar", "title": "t-test for evaluation of accuracies for Machine Learning for Computational Linguistics" } }, "nbformat": 4, "nbformat_minor": 2 }