{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# t-test for comparison of accuracies of two algorithms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(C) 2017-2024 by [Damir Cavar](http://cavar.me/damir/)**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Version:** 1.1, January 2024"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Prerequisites:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U scipy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a tutorial related to the discussion of evaluation of machine learning algorithms and classifiers using simple significance tests."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the [Computational Linguistics Program](http://cl.indiana.edu/) of the [Department of Linguistics](http://www.indiana.edu/~lingdept/) at [Indiana University](https://www.indiana.edu/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using the t-test on two distributions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The task is to compare two distributions of accuracy counts over some experimental results. Imagine that we test two algorithms $a$ and $b$ on the same training and test sets of data. We will apply the t-test as provided in the $stats$ module of $scipy$. We will need to import this module first:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from scipy import stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Imagine that these are our results from two independent algorithms trained and tested on the same pairs of training and test data sets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "a = [23, 43, 12, 10]\n",
    "b = [23, 42, 13, 10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data sets are the same in both experiments. We could treat the algorithms as two different tests on the same population (of data). The t-test measures whether the average scores differ significantly. We apply the t-test for two related samples of scores as provided in the $stats$ module:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ttest_relResult(statistic=0.0, pvalue=1.0)"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stats.ttest_rel(a,b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The returned result contains a $pvalue$ (p-value) of 1.0 in this case. If we assume that our Null-Hypothesis was that the two outcomes in $a$ and $b$ are unrelated, that is that they are random. The p-value tells us that we would make an error by rejecting this Null-Hypothesis, in fact, we would make an error with a certainty of 100% by rejecting this Null Hypothesis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Imagine now that the experimental results are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "a = [23, 43, 12, 10]\n",
    "b = [4, 15, 3, 9]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Applying the t-test again will give us a different result:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ttest_relResult(statistic=-2.4238865567066052, pvalue=0.093841791155051452)"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stats.ttest_rel(b,a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, the p-value tells us that we would make an error with a likelihood of approx. 9% by rejecting the Null Hypothesis. Remember, the Null Hypothesis is that the two distributions have nothing in common. We could have set a threshold of 10% and decided to reject the Null Hypothesis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "a = [74, 89, 88, 78]\n",
    "b = [24, 2, 3, 9]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ttest_relResult(statistic=8.4725342868682922, pvalue=0.0034519815681217179)"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stats.ttest_rel(a,b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(C) 2017-2024 by [Damir Cavar](http://cavar.me/damir/) - [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  },
  "latex_metadata": {
   "affiliation": "Indiana University, Department of Linguistics, Bloomington, IN, USA",
   "author": "Damir Cavar",
   "title": "t-test for evaluation of accuracies for Machine Learning for Computational Linguistics"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}