{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Permutation Tests"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Basic idea: \n",
"\n",
"If, under the null hypothesis, the probability distribution of the data is invariant under the action of some group $\\mathcal{G}$, then once we observe the actual data, we know other possible data sets that are equally likely.\n",
"\n",
"In particular, every element of the _orbit_ of the observed data under $\\mathcal{G}$ was just as probable as the\n",
"observed data.\n",
"\n",
"Test _conditionally_ on the orbit in which the observed data lie."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example: Effect of treatment in a randomized controlled experiment\n",
"\n",
"11 pairs of rats, each pair from the same litter. \n",
"\n",
"Randomly—by coin tosses—put one of each pair into\n",
"\"enriched\" environment; other sib gets \"normal\" environment.\n",
"\n",
"After 65 days, measure cortical mass (mg).\n",
"\n",
"
\n",
"\n",
" enriched | 689 | 656 | 668 | 660 | 679 | 663 | 664 | 647 | 694 | 633 | 653 | \n",
"
\n",
"\n",
" impoverished | 657 | 623 | 652 | 654 | 658 | 646 | 600 | 640 | 605 | 635 | 642 | \n",
"
\n",
"\n",
" difference | 32 | 33 | 16 | 6 | 21 | 17 | 64 | 7 | 89 | -2 | 11\n",
" |
\n",
"
\n",
"\n",
"**How should we analyze the data?**\n",
"\n",
"Cartoon of Rosenzweig, M.R., E.L. Bennet, and M.C. Diamond, 1972. Brain changes in response to experience, _Scientific American_, _226_, 22–29 report an experiment in which 11 triples of male rats, each triple from the same litter, were\n",
"assigned at random to three different environments, \"enriched\" (E), standard, and \"impoverished.\"\n",
"See also Bennett et al., 1969. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Informal Hypotheses\n",
"\n",
"Null hypothesis: treatment has \"no effect.\"\n",
"\n",
"Alternative hypothesis: treatment increases cortical mass.\n",
"\n",
"Suggests 1-sided test for an increase."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test contenders\n",
"\n",
"+ 2-sample Student $t$-test: \n",
"$$ \n",
" \\frac{\\mbox{mean(treatment) - mean(control)}}\n",
" {\\mbox{pooled estimate of SD of difference of means}}\n",
"$$\n",
"+ 1-sample Student $t$-test on the differences: \n",
"$$\n",
"\t \\frac{\\mbox{mean(differences)}}{\\mbox{SD(differences)}/\\sqrt{11}}\n",
"$$ \n",
"Better, since littermates are presumably more homogeneous.\n",
" \n",
"+ Permutation test using $t$-statistic of differences:\n",
"same statistic, different way to calculate $P$-value.\n",
"Even better?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Strong null hypothesis\n",
"\n",
"**Treatment has no effect whatsoever---as if cortical mass were \n",
"assigned to each rat before the randomization.**\n",
"\n",
"Then equally likely that the rat with the heavier cortex will be assigned\n",
"to treatment or to control, independently across littermate pairs.\n",
"\n",
"Gives $2^{11} = 2,048$ equally likely possibilities:\n",
"\n",
"\n",
"\n",
"difference | $\\pm$32 | $\\pm$33 | $\\pm$16 | $\\pm$6 | $\\pm$21 | $\\pm$17 | $\\pm$64 | $\\pm$7 | $\\pm$89 | $\\pm$2 | $\\pm$11 | \n",
"
\n",
"
\n",
"\n",
"For example, just as likely to observe original differences as\n",
"\n",
"\n",
"\n",
"\tdifference | -32 | -33 | -16 | -6 | -21 | -17 | \n",
" -64 | -7 | -89 | -2 | -11 | \n",
"
\n",
"
\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Weak null hypothesis\n",
"\n",
"On average across pairs, treatment makes no difference."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Alternatives\n",
"\n",
"1. Individual's response depends only on that individual's assignment\n",
" + Special cases: shift, scale, etc.\n",
"\n",
"1. Interactions/Interference: my response could depend on whether you are assigned to treatment or control."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Assumptions of the tests\n",
"\n",
"1. 2-sample $t$-test: \n",
" + masses are iid sample from normal distribution, same unknown variance, same unknown mean.\n",
"\t+ Tests weak null hypothesis (plus normality, independence, non-interference, etc.).\n",
"1. 1-sample $t$-test on the differences: \n",
"\t+ mass differences are iid sample from normal distribution, unknown variance, zero mean.\n",
"\t+ Tests weak null hypothesis (plus normality, independence, non-interference, etc.)\n",
"1. Permutation test: \n",
"\t+ Randomization fair, independent across pairs.\n",
"\t+ Tests strong null hypothesis.\n",
"\n",
"Assumptions of the permutation test are true by design: That's how treatment\n",
"was assigned."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What's the group?\n",
"\n",
"The data are elements of $\\Re^{11}$. The group operations reflect coordinates\n",
"about the axes—they flip the signs of elements of the vectors."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ttest_1sampResult(statistic=3.2437214037805222, pvalue=0.0088136932072599358)"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# boilerplate\n",
"%matplotlib inline\n",
"from __future__ import division\n",
"import math\n",
"import numpy as np\n",
"import scipy as sp\n",
"from scipy import stats # distributions\n",
"from scipy import special # special functions\n",
"from scipy import random # random variables, distributions, etc.\n",
"from scipy.optimize import brentq\n",
"from scipy.stats import (binom, hypergeom)\n",
"import matplotlib.pyplot as plt\n",
"from ipywidgets import widgets\n",
"\n",
"treat = np.array([689, 656, 668, 660, 679, 663, 664, 647, 694, 633, 653])\n",
"control = np.array([657, 623, 652, 654, 658, 646, 600, 640, 605, 635, 642])\n",
"diffr = treat - control\n",
"\n",
"sp.stats.ttest_1samp(diffr, 0.0) # two-sided one-sample t-test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Student $t$-test calculations\n",
"\n",
"Mean of differences: 26.73mg \n",
"Sample SD of differences: 27.33mg \n",
"$t$-statistic: $26.73/(27.33/\\sqrt{11}) = 3.244$. \n",
"\n",
"$P$-value for 2-sided $t$-test: 0.0088\n",
"\n",
"+ Why do cortical weights have normal distribution?\n",
"+ Why is variance of the difference between treatment and control the same for different litters?\n",
"+ Treatment and control are _dependent_ because assigning a rat to treatment excludes it from the control group, and vice versa.\n",
"+ $P$-value depends on assuming differences are iid sample from a normal distribution.\n",
"+ If we reject the null, is that because there is a treatment effect, or because the other assumptions are wrong?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Permutation $t$-test calculations\n",
"\n",
"Could enumerate all $2^{11} = 2,048$ equally likely possibilities.\n",
"Calculate $t$-statistic for each.\n",
"\n",
"$P$-value is \n",
"$$\n",
"\tP = \\frac{\\mbox{number of possibilities with $t \\ge 3.244$}}{\\mbox{2,048}}\n",
"$$\n",
"(Using the mean as the test statistic instead of $t$-statistic\n",
"would yield $P = 2/2048 = 0.00098$.)\n",
"\n",
"If there were many more pairs, it would be impractical to enumerate, but \n",
"we can simulate:\n",
"\n",
"Assign a random sign to each difference. \\\\\n",
"Compute $t$-statistic \\\\\n",
"Repeat $N$ times\n",
"$$\n",
"\tP \\approx \\frac{\\mbox{number of simulations with $t \\ge 3.244$}}{N}\n",
"$$\n",
"Can invert Binomial tests to make an exact confidence interval for $P$ from the simulation results."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:10: DeprecationWarning: This function is deprecated. Please call randint(0, 1 + 1) instead\n",
" # Remove the CWD from sys.path while we load stuff.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.00199 199.0\n"
]
}
],
"source": [
"np.random.RandomState(seed=20151013) # set seed for reproducibility\n",
"\n",
"iter = 100000 # N\n",
"def simPermuTP(z, iter):\n",
"# P.B. Stark, statistics.berkeley.edu/~stark \n",
"# simulated P-value for 2-sided 1-sample t-test under the \n",
"# randomization model.\n",
" tsf = lambda x: np.abs(np.mean(x)/(np.std(x, ddof=1)/math.sqrt(len(x))))\n",
" ts = tsf(z) \n",
" return np.mean([ (tsf(z*(2*np.random.random_integers(0, 1, size=len(z))-1)) >= ts) for i in range(iter)])\n",
" \n",
"estP = simPermuTP(diffr, iter)\n",
"print(estP, estP*iter)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.0017233243956620496, 0.0022861882693160558)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# confidence interval for the p-value\n",
"# This function is in the _permute_ library; see http://statlab.github.io/permute/\n",
"\n",
"def binom_conf_interval(n, x, cl=0.975, alternative=\"two-sided\", p=None,\n",
" **kwargs):\n",
" \"\"\"\n",
" Compute a confidence interval for a binomial p, the probability of success in each trial.\n",
"\n",
" Parameters\n",
" ----------\n",
" n : int\n",
" The number of Bernoulli trials.\n",
" x : int\n",
" The number of successes.\n",
" cl : float in (0, 1)\n",
" The desired confidence level.\n",
" alternative : {\"two-sided\", \"lower\", \"upper\"}\n",
" Indicates the alternative hypothesis.\n",
" p : float in (0, 1)\n",
" Starting point in search for confidence bounds for probability of success in each trial.\n",
" kwargs : dict\n",
" Key word arguments\n",
"\n",
" Returns\n",
" -------\n",
" tuple\n",
" lower and upper confidence level with coverage (approximately)\n",
" 1-alpha.\n",
"\n",
" Notes\n",
" -----\n",
" xtol : float\n",
" Tolerance\n",
" rtol : float\n",
" Tolerance\n",
" maxiter : int\n",
" Maximum number of iterations.\n",
" \"\"\"\n",
" assert alternative in (\"two-sided\", \"lower\", \"upper\")\n",
"\n",
" if p is None:\n",
" p = x / n\n",
" ci_low = 0.0\n",
" ci_upp = 1.0\n",
"\n",
" if alternative == 'two-sided':\n",
" cl = 1 - (1-cl)/2\n",
"\n",
" if alternative != \"upper\" and x > 0:\n",
" f = lambda q: cl - binom.cdf(x - 1, n, q)\n",
" ci_low = brentq(f, 0.0, p, *kwargs)\n",
" if alternative != \"lower\" and x < n:\n",
" f = lambda q: binom.cdf(x, n, q) - (1 - cl)\n",
" ci_upp = brentq(f, 1.0, p, *kwargs)\n",
"\n",
" return ci_low, ci_upp\n",
"\n",
"binom_conf_interval(iter, math.ceil(estP*iter), cl=0.95, alternative=\"two-sided\", p=estP)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.0, 0.009792108717906225)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"binom_conf_interval(10**5, 907, cl=0.99, alternative=\"upper\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Other tests: sign test, Wilcoxon signed-rank test\n",
"\n",
"#### Sign test:\n",
"Count pairs where treated rat has heavier cortex, i.e., where\n",
"difference is positive.\n",
"\n",
"Under strong null, distribution of the number of positive differences\n",
"is Binomial(11, 1/2). Like number of heads in 11 independent tosses\n",
"of a fair coin. (Assumes no ties w/i pairs.)\n",
"\n",
"$P$-value is chance of 10 or more heads in 11 tosses of a fair coin: 0.0059.\n",
"\n",
"Only uses signs of differences, not information that only the smallest absolute \n",
"difference was negative.\n",
"\n",
"#### Wilcoxon signed-rank test\n",
"uses information about the ordering of the\n",
"differences: rank the absolute values of the differences, then give them\n",
"the observed signs and sum them. Null distribution: assign signs at random\n",
"and sum."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Still more tests, for other alternatives\n",
"\n",
"All the tests we've seen here are sensitive to _shifts_--the alternative\n",
"hypothesis is that treatment increases response (cortical mass).\n",
"\n",
"There are also nonparametric tests that are sensitive to other\n",
"treatment effects, e.g., treatment increases the variability of the \n",
"response.\n",
"\n",
"And there are tests for whether treatment has any effect at all on\n",
"the distribution of the responses.\n",
"\n",
"You can design a test statistic to be sensitive to any change that\n",
"interests you, then use the permutation distribution to get a $P$-value\n",
"(and simulation to approximate that $P$-value)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Silliness\n",
"\n",
"Treat ordinal data (e.g., Likert scale) as if measured on a linear scale; \n",
"use Student $t$-test.\n",
"\n",
"Maybe not so silly for large samples$\\ldots$\n",
"\n",
"$t$-test asymptotically distribution-free.\n",
"\n",
"How big is big?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Back to Rosenzweig et al.\n",
"\n",
"Actually had 3 treatments: enriched, standard, deprived.\n",
"\n",
"Randomized 3 rats per litter into the 3 treatments, independently across\n",
"$n$ litters.\n",
"\n",
"**How should we analyze these data?**\n",
"\n",
"\n",
"\n",
"### Test contenders\n",
"\n",
"$n$ litters, $s$ treatments (sibs per litter).\n",
"+ ANOVA--the $F$-test:\n",
"$$ \n",
"F = \\frac{\\mbox{BSS}/(s-1)}{\\mbox{WSS}/(n-s)}\n",
"$$\n",
"+ Permutation $F$-test: use permutation distribution instead of $F$ distribution to get $P$-value.\n",
"+ Friedman test: Rank within litters. Mean rank for treatment $i$ is $\\bar{R}_i$.\n",
"$$ \n",
"Q = \\frac{12n}{s(s+1)} \\sum_{i=1}^s \\left ( \\bar{R}_i - \\frac{s+1}{2} \\right )^2.\n",
"$$\n",
"$P$-value from permutation distribution. \n",
"\n",
"### Strong null hypothesis\n",
"\n",
"**Treatment has no effect whatsoever—as if cortical mass were \n",
"assigned to each rat before the randomization.**\n",
"\n",
"Then it's equally likely that each littermate is assigned to each treatment, \n",
"independently across litters.\n",
"\n",
"There are $3! = 6$ assignments of each triple to treatments. Thus, there $6^n$ equally likely assignments across all litters.\n",
"\n",
"For 11 litters, that's 362,797,056 possibilities.\n",
"\n",
"(The group under which the distribution of the data is invariant is a subgroup of the permutation group: if you think of the data as an element of $\\Re^{33}$ (33 rats), the group consists of all permutations of the 33 elements that shuffle only adjacent sets of 3 components: components 1–3, components 4–6, … components 31–33.)\n",
"\n",
"\n",
"### Weak null hypothesis\n",
"\n",
"**The average cortical weight for all three treatment groups are equal.\n",
"On average across triples, treatment makes no difference.**\n",
"\n",
"### Assumptions of the tests\n",
"\n",
"+ $F$-test: masses are iid sample from normal distribution, same unknown variance, same unknown mean for all litters and treatments. \n",
"Tests weak null hypothesis.\n",
"+ Permutation $F$-test: Randomization was as advertised: fair, independent across triples. \n",
"Tests strong null hypothesis.\n",
"+ Friedman test: Ditto.\n",
"\n",
"Assumptions of the permutation test and Friedman test are true by design: \n",
"that's how treatment was assigned.\n",
"\n",
"Friedman test statistic has $\\chi^2$ distribution asymptotically. Ties are a complication."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### $F$-test assumptions—reasonable?\n",
"\n",
"+ Why do cortical weights have normal distribution for each litter and for each treatment?\n",
"+ Why is the variance of cortical weights the same for different litters?\n",
"+ Why is the variance of cortical weights the same for different treatments?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Is $F$ a good statistic for this alternative?\n",
"\n",
"$F$-test (and Friedman statistic) sensitive to differences among the \n",
"mean responses for each treatment group, no matter what pattern the differences \n",
"have.\n",
"\n",
"But the treatments and the responses can be ordered: we hypothesize that\n",
"more stimulation produces greater cortical mass.\n",
"\n",
"deprived $\\Longrightarrow$ normal $\\Longrightarrow$ enriched \n",
"low mass $\\Longrightarrow$ medium mass $\\Longrightarrow$ high mass\n",
"\n",
"_Can we use that to make a more sensitive test?_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### A test against an ordered alternative\n",
"\n",
"Within each litter triple, count pairs of responses\n",
"that are \"in order.\" Sum across litters.\n",
"\n",
"E.g., if one triple had cortical masses\n",
"\n",
"\n",
"\n",
"deprived | 640 | \n",
"
\n",
"\n",
"normal | 660 | \n",
"
\n",
"\n",
"enriched | 650 | \n",
"
\n",
"\n",
"that would contribute 2 to the sum: $660 \\ge 640$, $650 \\ge 640$, but $640 < 650$.\n",
"\n",
"Each litter triple contributes between 0 and 3 to the overall sum.\n",
"\n",
"\n",
"Null distribution for the test based on the permutation distribution: 6\n",
"equally likely assignments per litter, independent assignments across litters."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### A different test against an ordered alternative\n",
"\n",
"Within each litter triple, add differences\n",
"that are \"in order.\" Sum across litters.\n",
"\n",
"E.g., if one triple had cortical masses\n",
"\n",
"\n",
"\n",
"deprived | 640 | \n",
"
\n",
"\n",
"normal | 660 | \n",
"
\n",
"\n",
"enriched | 650 | \n",
"
\n",
"\n",
"that would contribute 30 to the sum: $660 - 640 = 20$ and $650 - 640 = 10$, but $640 < 650$,\n",
"so that pair contributes zero.\n",
"\n",
"Each litter triple contributes between 0 and $2\\times{\\mbox{ range }}$ to the sum.\n",
"\n",
"Null distribution for the test based on the permutation distribution: 6\n",
"equally likely assignments per litter, independent across litters."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The 2-sample problem \n",
"\n",
"Suppose we have a group of $N$ individuals who are randomized into two groups, a _treatment_ group of size $N_t$ and a _control_ group of size $N_c = N - N_t$.\n",
"Label the individuals from $1$ to $N$.\n",
"Let ${\\mathcal T}$ denote the labels of individuals assigned to treatment and ${\\mathcal C}$ denote \n",
"the labels of those assigned to control.\n",
"\n",
"For each of the $N$ individuals, we measure a quantitative (real-valued) response.\n",
"Each individual $i$ has two _potential responses_: the response $c_i $individual would have if assigned to \n",
"the control group, and the response $t_i$ the individual would have if assigned to the treatment group.\n",
"\n",
"We assume that individual $i$'s response depends _only_ on that individual's assigment, and not on anyone else's assignment.\n",
"This is the assumption of _non-interference_. \n",
"In some cases, this assumption is reasonable; in others, it is not.\n",
"\n",
"For instance, imagine testing a vaccine for a communicable disease.\n",
"If you and I have contact, whether you get the disease might depend on whether I am vaccinated—and _vice versa_—since if the vaccine protects me from illness, I won't infect you.\n",
"Similarly, suppose we are testing the effectiveness of an advertisement for a product.\n",
"If you and I are connected and you buy the product, I might be more likely to buy it, even if I don't\n",
"see the advertisement.\n",
"\n",
"Conversely, suppose that \"treatment\" is exposure to a carcinogen, and the response whether the\n",
"subject contracts cancer. \n",
"On the assumption that cancer is not communicable, my exposure and your disease\n",
"status have no connection.\n",
"\n",
"The _strong null hypothesis_ is that individual by individual, treatment makes no difference whatsoever: $c_i = t_i$ for all $i$.\n",
"\n",
"If so, any differences between statistics computed for the treatment and control groups are entirely due to the luck of the draw: which individuals happened to be assigned to treatment and which to control.\n",
"\n",
"We can find the _null distribution_ of any statistic computed from the responses of the two groups: if the strong null hypothesis is true, we know what individual $i$'s response would have been whether assigned to treatment or to control—namely, the same.\n",
"\n",
"For instance, suppose we suspect that treatment tends to increase response: in general, $t_i \\ge c_i$.\n",
"Then we might expect $\\bar{c} = \\frac{1}{N_c} \\sum_{i \\in {\\mathcal C}} c_i$ to be less than\n",
"$\\bar{t} = \\frac{1}{N_t} \\sum_{i \\in {\\mathcal T}} t_i$.\n",
"How large a difference between $\\bar{c}$ and $\\bar{t}$ would be evidence that treatment increases the response,\n",
"beyond what might happen by chance through the luck of the draw?\n",
"\n",
"This amounts to asking whether the observed difference in means between the two groups is a high percentile\n",
"of the distribution of that difference in means, calculated on the assumption that the null hypothesis is true.\n",
"\n",
"Because of how subjects are assigned to treatment or to control, all allocations of $N_t$ subjects to\n",
"treatment are equally likely.\n",
"\n",
"One way to partition the $N$ subjects randomly into a group of size $N_c$ and a group of size $N_t$ is\n",
"to permute the $N$ subjects at random, then take the first $N_c$ in the permuted list to be the control\n",
"group, and the remaining $N_t$ to be the treatment group."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Gender Bias in Teaching Evaluations\n",
"MacNell, Driscoll, and Hunt (2014. [What's in a Name: Exposing Gender Bias in Student Ratings of Teaching](http://link.springer.com/article/10.1007%2Fs10755-014-9313-4), _Innovative Higher Education_) conducted a controlled, randomized experiment on\n",
"the effect of students' perception of instructors' gender on teaching evaluations\n",
"in an online course.\n",
"Students in the class did not know the instructors' true genders.\n",
"\n",
"MacNell et al. randomized 47 students in an online course into four groups: two taught by a female\n",
"instructor and two by a male instructor.\n",
"One of the groups taught by each instructor was led to believe the instructor was male;\n",
"the other was led to believe the instructor was female.\n",
"Comparable instructor biographies were given to all students.\n",
"Instructors treated the groups identically, including returning assignments at the same time.\n",
"\n",
"When students thought the instructor was female, the responding students rated the instructor lower, on average,\n",
"in every regard.\n",
"\n",
"\n",
"Characteristic | F - M |
\n",
"Caring | -0.52 |
\n",
" Consistent | -0.47 |
\n",
" Enthusiastic | -0.57 |
\n",
" Fair | -0.76 |
\n",
" Feedback | -0.47 |
\n",
" Helpful | -0.46 |
\n",
" Knowledgeable | -0.35 |
\n",
" Praise | -0.67 |
\n",
" Professional | -0.61 |
\n",
" Prompt | -0.80 |
\n",
" Respectful | -0.61 |
\n",
" Responsive | -0.22 |
\n",
"
\n",
"\n",
"Those results are for a 5-point scale, so a difference of 0.8 (Promptness) is 16% of the entire scale."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"MacNell et al. graciously shared their data.\n",
"The evaluation data are coded as follows:\n",
"\n",
" Group\n",
" 3 (12 students) - TA identified as male, true TA gender female \n",
"\t 4 (11 students) - TA identified as male, true TA gender male\n",
"\t 5 (11 students, 8 responders) - TA identified as female, true TA gender female\n",
"\t 6 (13 students, 12 responders) - TA identified as female, true TA gender male\n",
" tagender - 1 if TA is actually male, 0 if actually female \n",
" taidgender - 1 if TA is identified as male, 0 if identified as female \n",
" gender - 1 if student is male, 0 if student is female\n",
"\n",
"There are grades for 47 students but evaluations for only 43 (4 did not respond: three in group 5 and one in group 6). \n",
"The grades are not linked to the evaluations, per the IRB protocol.\n",
"\n",
"Our null hypothesis will include the assumption that whether a student responds does not depend on the identified gender of the TA: the 4 nonresponders would have remained nonresponders had they been in that instructor's other section, and no additional students would become nonresponders if they had been assigned to the same instructor's other section."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interlude: partitioning sets into more than two subsets\n",
"\n",
"Recall that there are $n \\choose k$ ways of picking a subset of size $k$ from $n$ items;\n",
"hence there are $n \\choose k$ ways of dividing a set into a subset of size $k$ and one of size $n-k$\n",
"(once you select those that belong to the subset of size $k$, the rest must be in the complementary\n",
"subset of size $n-k$.\n",
"\n",
"In this problem, we are partitioning 47 things into 4 subsets, two of size 11, one of size 12, and\n",
"one of size 13.\n",
"How many ways are there of doing that?\n",
"\n",
"\n",
"Recall the [Fundamental Rule of Counting](http://www.stat.berkeley.edu/~stark/SticiGui/SticiGui/Text/counting.htm#fundamental_rule): \n",
"If a set of choices, $T_1, T_2, \\ldots, T_k$, could result, respectively, \n",
"in $n_1, n_2, \\ldots, n_k$ possible outcomes, the entire set of $k$ choices has\n",
"$\\prod_{i=1}^k n_k$ possible outcomes.\n",
"\n",
"We can think of the allocation of students to the four groups as choosing 11 of the 47\n",
"students for the first group, then 11 of the remaining 36 for the second, \n",
"then 12 of the remaining 25 for the third.\n",
"The fourth group must contain the remaining 13.\n",
"\n",
"The number of ways of doing that is\n",
"\n",
"$$ \n",
" {47 \\choose 11}{36 \\choose 11}{25 \\choose 12} =\n",
" \\frac{47!}{11! 36!} \\frac{36!}{11! 25!} \\frac{25!}{12! 13!} = \\frac{47!}{11! 11! 12! 13!}.\n",
"$$\n",
"\n",
"Does the number depend on the order in which we made the choices?\n",
"Suppose we made the choices in a different order: first 12 students for one group, then\n",
"11 for the second, then 13 for the third (the fourth gets the remaining 11 students).\n",
"The number of ways of doing that is\n",
"$$ \n",
" {47 \\choose 12}{35 \\choose 11}{24 \\choose 13} =\n",
" \\frac{47!}{12! 35!} \\frac{35!}{11! 24!} \\frac{24!}{13! 11!} = \\frac{47!}{12! 11! 13! 11!} = .\n",
"$$\n",
"The number does not depend on the order in which we make the choices.\n",
"\n",
"By the same reasoning, the number of ways of dividing a set of $n$ objects into\n",
"$m$ subsets of sizes $k_1, \\ldots k_m$ is given by the _multinomial coefficient_\n",
"$$\n",
" {n \\choose k_1, k_2, \\ldots, k_m} =\n",
" {n \\choose k_1}{n-k_1 \\choose k_2} {n-k_1-k_2 \\choose k_3} \\cdots {n - \\sum_{i=1}^{m-1} k_i \\choose k_{m-1}}\n",
"$$\n",
"\n",
"$$ = \\frac{n! (n-k_1)! (n-k_1 - k_2)! \\cdots \n",
" (n - k_1 - \\cdots - k_{m-1}!}{k_1! (n-k_1)! k_2! (n-k_1-k_2)! \\cdots k_m!}\n",
" = \\frac{n!}{\\prod_{i=1}^m k_i!}.\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will check how surprising it would be for the means to be so much lower when the TA is identified as female, if in fact there is \"no real difference\" in how they were rated, and the apparent difference is just due to the luck of the draw: which students happened to end up in which section.\n",
"\n",
"In the actual randomization, all $47 \\choose 11, 11, 12, 13$ allocations\n",
"were equally likely.\n",
"But there might be real differences between the two instructors.\n",
"Hence, we'd like to use each of them as his or her own \"control.\"\n",
"\n",
"Each student's potential responses are represented by a ticket with 4 numbers:\n",
"\n",
"+ the rating that the student would assign to instructor A if instructor A is identified as male\n",
"+ the rating that the student would assign to instructor A if instructor A is identified as female\n",
"+ the rating that the student would assign to instructor B if instructor B is identified as male\n",
"+ the rating that the student would assign to instructor B if instructor B is identified as female\n",
"\n",
"The null hypothesis is that the first two numbers are equal and the second two numbers are equal,\n",
"but the first two numbers might be different from the second two numbers.\n",
"This corresponds to the hypohtesis that \n",
"students assigned to a given TA would rate him or her the same, whether that TA seemed to be male or female.\n",
"For all students assigned instructor A, we know both of the first two numbers if the hull hypothesis\n",
"is true; but we know neither of the second two numbers.\n",
"Similarly, if the null hypothesis is true, we know both of the second two numbers for all students\n",
"assigned to instructor B, but we know neither of the first two numbers for those students.\n",
"\n",
"Because of how the randomization was performed, all allocations \n",
"of students to sections that keep the number of students in each section the same are equally likely, so\n",
"in particular all allocations that keep the same students assigned to each actual instructor\n",
"the same are equally likely.\n",
"This is a subgroup of the full permutation group.\n",
"\n",
"Hence, all ${23 \\choose 12}$ ways of splitting the 23 students assigned to the female instructor into two groups, one with 11 students and one with 12, are equally likely. Similarly, all\n",
"${24 \\choose 11}$ ways of splitting the 24 students assigned to the male instructor into two groups, one with 13 students and one with 11, are equally likely.\n",
"We can thus imagine shuffling the female TA's students between her sections, and the male TA's students\n",
"between his sections, and examine the distribution of the difference between the mean score for the sections where the\n",
"TA was identified as male is larger than the mean score for the sections where the TA was identified as\n",
"female.\n",
"(The 4 nonresponders are also allocated at random between sections of the same \n",
"instructor; they are not included in the averages.)\n",
"\n",
"If the difference is rarely as large as the observed mean difference, the observed mean difference gives\n",
"evidence that being identified as female really does lower the scores."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5.44175311066e+25\n"
]
}
],
"source": [
"# Number of allocations to 11, 11, 12, 13\n",
"print(sp.special.binom(47,11)*sp.special.binom(36,11)*sp.special.binom(25,12)) # big number!"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" group professional respect caring enthusiastic communicate helpful \\\n",
"0 3 5.0 5.0 4.0 4.0 4.0 3.0 \n",
"1 3 4.0 4.0 4.0 4.0 5.0 5.0 \n",
"2 3 5.0 5.0 5.0 5.0 5.0 5.0 \n",
"3 3 5.0 5.0 5.0 5.0 5.0 3.0 \n",
"4 3 5.0 5.0 5.0 5.0 5.0 5.0 \n",
"\n",
" feedback prompt consistent fair responsive praised knowledgeable \\\n",
"0 4.0 4.0 4.0 4.0 4.0 4.0 3.0 \n",
"1 5.0 5.0 3.0 4.0 5.0 5.0 5.0 \n",
"2 5.0 5.0 5.0 5.0 5.0 5.0 5.0 \n",
"3 5.0 5.0 5.0 5.0 3.0 5.0 5.0 \n",
"4 5.0 3.0 4.0 5.0 5.0 5.0 5.0 \n",
"\n",
" clear overall gender age tagender taidgender \n",
"0 5.0 4.0 2.0 1990.0 0 1 \n",
"1 5.0 4.0 1.0 1992.0 0 1 \n",
"2 5.0 5.0 2.0 1991.0 0 1 \n",
"3 5.0 5.0 2.0 1991.0 0 1 \n",
"4 5.0 5.0 2.0 1992.0 0 1 \n",
" group professional respect caring enthusiastic \\\n",
"count 47.000000 43.000000 43.000000 43.000000 43.000000 \n",
"mean 4.531915 4.325581 4.325581 3.930233 3.906977 \n",
"std 1.158167 1.062811 1.062811 1.055492 1.019196 \n",
"min 3.000000 1.000000 1.000000 1.000000 1.000000 \n",
"25% 3.500000 4.000000 4.000000 3.500000 4.000000 \n",
"50% 5.000000 5.000000 5.000000 4.000000 4.000000 \n",
"75% 6.000000 5.000000 5.000000 5.000000 4.500000 \n",
"max 6.000000 5.000000 5.000000 5.000000 5.000000 \n",
"\n",
" communicate helpful feedback prompt consistent fair \\\n",
"count 43.000000 43.000000 43.000000 43.000000 43.000000 43.000000 \n",
"mean 3.953488 3.744186 3.953488 3.976744 3.744186 3.906977 \n",
"std 1.068009 1.071115 1.132916 1.079867 1.156617 0.995560 \n",
"min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 \n",
"25% 4.000000 3.000000 4.000000 4.000000 3.000000 3.500000 \n",
"50% 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000 \n",
"75% 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 \n",
"max 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 \n",
"\n",
" responsive praised knowledgeable clear overall gender \\\n",
"count 43.000000 43.000000 43.000000 43.000000 43.000000 43.000000 \n",
"mean 3.767442 4.209302 4.139535 3.720930 3.953488 1.534884 \n",
"std 0.996116 0.940064 0.989983 1.259735 1.022450 0.504685 \n",
"min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 \n",
"25% 3.000000 4.000000 4.000000 3.000000 4.000000 1.000000 \n",
"50% 4.000000 4.000000 4.000000 4.000000 4.000000 2.000000 \n",
"75% 4.500000 5.000000 5.000000 5.000000 5.000000 2.000000 \n",
"max 5.000000 5.000000 5.000000 5.000000 5.000000 2.000000 \n",
"\n",
" age tagender taidgender \n",
"count 43.000000 47.000000 47.000000 \n",
"mean 1990.325581 0.510638 0.489362 \n",
"std 4.115713 0.505291 0.505291 \n",
"min 1982.000000 0.000000 0.000000 \n",
"25% 1990.000000 0.000000 0.000000 \n",
"50% 1990.000000 1.000000 0.000000 \n",
"75% 1991.000000 1.000000 1.000000 \n",
"max 2012.000000 1.000000 1.000000 \n"
]
}
],
"source": [
"# the data are in a .csv file called \"Macnell-RatingsData.csv\" in the directory Data\n",
"import pandas as pd\n",
"\n",
"ratings = pd.read_csv(\"./Data/Macnell-RatingsData.csv\")\n",
"categories = ratings.columns.values.tolist()[1:15]\n",
"print(ratings.head())\n",
"print(ratings.describe())"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"from permute.core import corr, two_sample\n",
"from permute.utils import get_prng, permute_within_groups\n",
"rs = np.random.RandomState(seed=123456789) # set the seed of the PRNG, for reproducibility"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def stratified_two_sample(x, y, group_x, group_y, stat='mean', alternative=\"greater\", reps=10**4, \n",
" keep_dist=False, seed=None):\n",
" \"\"\"\n",
" One-sided or two-sided, two-sample permutation test for equality of\n",
" two means, with p-value estimated by simulated random sampling with\n",
" reps replications.\n",
"\n",
" Tests the hypothesis that x and y are a random partition of x,y\n",
" against the alternative that x comes from a population with mean\n",
"\n",
" (a) greater than that of the population from which y comes,\n",
" if side = 'greater'\n",
" (b) less than that of the population from which y comes,\n",
" if side = 'less'\n",
" (c) different from that of the population from which y comes,\n",
" if side = 'two-sided'\n",
"\n",
" Permutations are carried out within the given groups. Under the null hypothesis,\n",
" observations within each group are exchangeable.\n",
"\n",
" If ``keep_dist``, return the distribution of values of the test statistic;\n",
" otherwise, return only the number of permutations for which the value of\n",
" the test statistic and p-value.\n",
"\n",
" Parameters\n",
" ----------\n",
" x : array-like\n",
" Sample 1\n",
" y : array-like\n",
" Sample 2\n",
" group_x : array-like\n",
" Group assignments for sample 1\n",
" group_y : array-like\n",
" Group assignments for sample 2\n",
" stat : {'mean', 't'}\n",
" The test statistic.\n",
"\n",
" (a) If stat == 'mean', the test statistic is (mean(x) - mean(y))\n",
" (equivalently, sum(x), since those are monotonically related), omitting\n",
" NaNs, which therefore can be used to code non-responders\n",
" (b) If stat == 't', the test statistic is the two-sample t-statistic--\n",
" but the p-value is still estimated by the randomization,\n",
" approximating the permutation distribution.\n",
" The t-statistic is computed using scipy.stats.ttest_ind\n",
" (c) If stat is a function (a callable object), the test statistic is\n",
" that function. The function should take a permutation of the pooled\n",
" data and compute the test function from it. For instance, if the\n",
" test statistic is the Kolmogorov-Smirnov distance between the\n",
" empirical distributions of the two samples, max_t |F_x(t) - F_y(t)|,\n",
" the test statistic could be written:\n",
"\n",
" f = lambda u: np.max( \\\n",
" [abs(sum(u[:len(x)]<=v)/len(x)-sum(u[len(x):]<=v)/len(y)) for v in u]\\\n",
" ) \n",
" alternative : {'greater', 'less', 'two-sided'}\n",
" The alternative hypothesis to test\n",
" reps : int\n",
" Number of permutations\n",
" keep_dist : bool\n",
" flag for whether to store and return the array of values\n",
" of the irr test statistic\n",
" seed : RandomState instance or {None, int, RandomState instance}\n",
" If None, the pseudorandom number generator is the RandomState\n",
" instance used by `np.random`;\n",
" If int, seed is the seed used by the random number generator;\n",
" If RandomState instance, seed is the pseudorandom number generator.\n",
"\n",
" Returns\n",
" -------\n",
" float\n",
" the estimated p-value\n",
" float\n",
" the test statistic\n",
" list\n",
" The distribution of test statistics.\n",
" These values are only returned if `keep_dist` == True \n",
" \"\"\"\n",
" \n",
" prng = get_prng(seed)\n",
" \n",
" z = np.concatenate([x, y]) # pooled responses\n",
" groups = np.concatenate([group_x, group_y]) # pooled group assignments\n",
" \n",
" # If stat is callable, use it as the test function. Otherwise, look in the dictionary\n",
" stats = {\n",
" 'mean': lambda u: np.nanmean(u[:len(x)]) - np.nanmean(u[len(x):]),\n",
" 't': lambda u: ttest_ind(\n",
" u[:len(y)][~np.isnan(u[:len(y)])], \n",
" u[len(y):][~np.isnan(u[len(y):])], \n",
" equal_var=True)[0]\n",
" }\n",
" if callable(stat):\n",
" tst_fun = stat\n",
" else:\n",
" tst_fun = stats[stat]\n",
"\n",
" theStat = {\n",
" 'greater': tst_fun,\n",
" 'less': lambda u: -tst_fun(u),\n",
" 'two-sided': lambda u: math.fabs(tst_fun(u))\n",
" }\n",
" tst = theStat[alternative](z)\n",
" observed_tst = tst_fun(z)\n",
" \n",
" if keep_dist:\n",
" dist = np.empty(reps)\n",
" for i in range(reps):\n",
" dist[i] = theStat[alternative](permute_within_groups(z, groups, seed=prng))\n",
" hits = np.sum(dist >= tst)\n",
" return hits/reps, tst, dist\n",
" else:\n",
" hits = np.sum([(theStat[alternative](permute_within_groups(z, groups, seed=prng)) >= tst)\n",
" for i in range(reps)])\n",
" return hits/reps, tst"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Overall rating:\n",
"Difference in means: 0.4739130434782606\n",
"P-value (two-sided): 0.1149 \n",
"\n",
"\n",
"\n",
"Category Diff means P-value \n",
"professional 0.61 0.06\n",
"respect 0.61 0.07\n",
"caring 0.52 0.10\n",
"enthusiastic 0.57 0.06\n",
"communicate 0.57 0.07\n",
"helpful 0.46 0.17\n",
"feedback 0.47 0.15\n",
"prompt 0.80 0.01\n",
"consistent 0.46 0.21\n",
"fair 0.76 0.01\n",
"responsive 0.22 0.48\n",
"praised 0.67 0.02\n",
"knowledgeable 0.35 0.29\n",
"clear 0.41 0.30\n"
]
}
],
"source": [
"(p, t) = stratified_two_sample(ratings['overall'][ratings.taidgender==1], ratings['overall'][ratings.taidgender==0], \n",
" ratings['tagender'][ratings.taidgender == 1], \n",
" ratings['tagender'][ratings.taidgender == 0],\n",
" alternative = \"two-sided\", seed = rs)\n",
"print('Overall rating:')\n",
"print('Difference in means:', t)\n",
"print('P-value (two-sided):', np.round(p, 5), \"\\n\")\n",
"\n",
"print ('\\n\\n{0:24} {1:8} {2:8}'.format('Category', 'Diff means', 'P-value'))\n",
"for col in categories:\n",
" (p, t) = stratified_two_sample(ratings[col][ratings.taidgender==1], ratings[col][ratings.taidgender==0], \n",
" ratings['tagender'][ratings.taidgender == 1],\n",
" ratings['tagender'][ratings.taidgender == 0],\n",
" alternative = \"two-sided\", seed = rs)\n",
" print ('{0:20} {1:8.2f} {2:8.2f}'.format(col, t, p))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next: [General permutation tests](./permute-distance.ipynb)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 1
}