{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[](https://mybinder.org/v2/gh/kasparvonbeelen/ghi_python/main?labpath=10_-_Hypothesis_Testing.ipynb)\n",
"\n",
"\n",
"# Lecture 10: Hypothesis Testing\n",
"## Data Science for Historians (with Python)\n",
"### A very gentle and practical introduction\n",
"### Created by Kaspar Beelen and Luke Blaxill\n",
"\n",
"### For the German Historical Institute, London\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We've covered how to describe, summarize, and compare variables. However, we lack a formal procedure to assess the importance of the differences we observe. For example, we established that men are, on average, one year younger than women. But how can we establish the value or 'significance' of this gap?\n",
"\n",
"In the notebook, we move from descriptive to inferential statistics and compute the extent to which means between subgroups in our data are statistically significant."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more background on the concepts and terminology used in this notebook, please consult the lecture by Luke Blaxill."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We repeat some of the code from the previous notebook\n",
"- load the required libraries\n",
"- load the synthetic census data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import numpy as np\n",
"import seaborn as sns\n",
"import random\n",
"from scipy import stats\n",
"from tqdm.notebook import tqdm\n",
"sns.set()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('data/democraphic_data/london_subsample.csv',index_col=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To make our analysis more interesting and complex, we add a dimension: place. We study how age differences between men and women vary depending on the registration district. In Pandas, adding subgroups is convenient: simply pass a list with column names `['district', 'gender']` (instead of just one column as we've done previously). `.groupby()` will read this list from left to right, i.e. groups the data first by place and then further aggregates by `gender` within each district."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"district gender\n",
"Bethnal Green F 25.920963\n",
" M 25.613501\n",
" U 20.766603\n",
"Camberwell F 27.986109\n",
" M 26.708931\n",
" ... \n",
"Whitechapel M 26.254569\n",
" U 24.051316\n",
"Woolwich F 26.653639\n",
" M 26.311769\n",
" U 21.427165\n",
"Name: age, Length: 90, dtype: float64"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"by_reg_gen = df.groupby(['district','gender'])['age'].mean()\n",
"by_reg_gen"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Working with the output of this operation requires a bit more thought. The `.groupby()` arranges data slightly differently depending on whether you group on one column or more. This becomes apparent when printing the `type()` of the `.index` attribute."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.indexes.base.Index"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df.groupby('gender')['age'].median().index)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.indexes.multi.MultiIndex"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(by_reg_gen.index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`by_reg_gen` orders the data using a `MultiIndex`, which means that the index contains multiple levels (place and gender) via which we can access our data.\n",
"\n",
"Place sits the highest level of our grouped data frame. We can access the separate districts using `.loc[]`"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"gender\n",
"F 25.920963\n",
"M 25.613501\n",
"U 20.766603\n",
"Name: age, dtype: float64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"by_reg_gen.loc['Bethnal Green']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly, we can slice the data by place."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"district gender\n",
"Bethnal Green F 25.920963\n",
" M 25.613501\n",
" U 20.766603\n",
"Camberwell F 27.986109\n",
" M 26.708931\n",
" U 23.410050\n",
"Chelsea F 31.318046\n",
" M 30.430615\n",
" U 28.800000\n",
"Name: age, dtype: float64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"by_reg_gen.loc['Bethnal Green':'Chelsea']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But our index has two levels, place **and** gender. We can obtain the means for women for all districts using the following syntax."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"district\n",
"Bethnal Green 25.920963\n",
"Camberwell 27.986109\n",
"Chelsea 31.318046\n",
"Fulham 27.920645\n",
"Greenwich 27.593689\n",
"Hackney 28.796820\n",
"Hampstead 29.742615\n",
"Holborn 27.719117\n",
"Islington 29.047747\n",
"Kensington 30.823465\n",
"Lambeth 29.019583\n",
"Lewisham 28.879865\n",
"London City 31.323944\n",
"Marylebone 30.465945\n",
"Mile End Old Town 26.298023\n",
"Paddington 30.450756\n",
"Pancras 29.136537\n",
"Poplar 26.384635\n",
"Shoreditch 27.056387\n",
"Southwark 27.144232\n",
"St George Hanover Square 30.821499\n",
"St George In The East 24.545266\n",
"St Giles 30.129131\n",
"St Olave Southwark 26.486323\n",
"Stepney 25.985841\n",
"Strand 29.288732\n",
"Wandsworth 28.028524\n",
"Westminster 28.990447\n",
"Whitechapel 24.649451\n",
"Woolwich 26.653639\n",
"Name: age, dtype: float64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"by_reg_gen.loc[:,'F']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice the comma in `.loc[]` (`by_reg_gen.loc[:**,**'F']`) \n",
"- the part before the comma indicates the items we want to access from the first level (place). In this case, we entered a colon, meaning from the first till the last rows. \n",
"- the part after the comma indicates the items we want to select from the second level. In this case, we want to retrieve all elements with value 'F' for the second level of the MultiIndex..\n",
"\n",
"Computing the age differences by place is then fairly straightforward."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"district\n",
"Bethnal Green 0.307461\n",
"Camberwell 1.277178\n",
"Chelsea 0.887431\n",
"Fulham 1.295697\n",
"Greenwich 0.971892\n",
"Hackney 1.357030\n",
"Hampstead 0.645606\n",
"Holborn 0.341140\n",
"Islington 1.430837\n",
"Kensington 1.410332\n",
"Lambeth 1.411260\n",
"Lewisham 1.420661\n",
"London City 3.262573\n",
"Marylebone 0.921905\n",
"Mile End Old Town 0.599087\n",
"Paddington 1.664367\n",
"Pancras 1.031302\n",
"Poplar -0.218534\n",
"Shoreditch 0.130194\n",
"Southwark 0.101532\n",
"St George Hanover Square 1.412338\n",
"St George In The East -0.011402\n",
"St Giles -0.879827\n",
"St Olave Southwark 0.088463\n",
"Stepney -0.334267\n",
"Strand -1.759558\n",
"Wandsworth 1.004169\n",
"Westminster -0.329065\n",
"Whitechapel -1.605118\n",
"Woolwich 0.341871\n",
"Name: age, dtype: float64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"f_m_diff = by_reg_gen.loc[:,'F'] - by_reg_gen.loc[:,'M']\n",
"f_m_diff"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course, you can be more restrictive and slice the data by place and obtain only means for women."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"district gender\n",
"Bethnal Green F 25.920963\n",
"Camberwell F 27.986109\n",
"Chelsea F 31.318046\n",
"Name: age, dtype: float64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"by_reg_gen.loc['Bethnal Green':'Chelsea','F']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Hypothesis testing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At this stage, we can compute and compare the distribution of variables, calculate their means or other relevant statistics. But a question then immediately appears: are the differences we observe \"significant\", and what do we mean with \"significance\" anyway. In this section, we have a closer look at hypothesis testing from a data-driven perspective.\n",
"\n",
"Traditionally, statistical methods, such as the Student's t-test arose in times of limited computing power and relied on equations and assumptions related to the distribution of the data. Explaining this requires many detours and implies a steep learning curve for the statistically uninitiated. \n",
"\n",
"In this lecture, we explore a data-driven, and hopefully intuitive procedure, for assessing the significance of observed statistics (such as the mean). First, we discuss a procedure called **bootstrapping** and then explain how to use **permutation** for significance testing. \n",
"\n",
"The question we first address regards the relation between `gender` and `age`, and we went to assess if variations we observe over the districts is significant. We focus on Whitechapel and Poplar as we noticed in the previous section that both these districts deviate from the general pattern and contain a slightly younger female population.\n",
"\n",
"The age gap is bigger in Westminster than in Poplar, but is the differences of the mean significant?"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((18772, 4), (40838, 4))"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_whitechapel = df[df.district=='Whitechapel']\n",
"df_poplar = df[df.district=='Poplar']\n",
"df_whitechapel.shape, df_poplar.shape"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" mean std\n",
"gender \n",
"F 26.384635 19.047276\n",
"M 26.603169 18.596804\n",
"U 22.963240 18.230697"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_poplar.groupby('gender')['age'].agg([np.mean,np.std])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Bootstrap: Compute the Distribution of a Sample Statistic"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The means give us an idea of the expected age in different registration districts. However, our statistic is derived from partial data, a subsection of Londoners at a particular point in time and therefore an unreliable measure: difference samples may produce difference means, and we'd like to know how much variation we'd expect if we were to take more samples.\n",
"\n",
"Of course, we can not do this, we only have this particular data set. \n",
"But one statistic remains a weak empirical basis. To estimate the variation and produce confidence intervals for the mean we will follow a statistical procedure called bootstrapping, which is simple, intuitive and requires fewer assumptions about the distribution of the data. Also, it avoids equations and statistical theory and therefore easier to grasp. \n",
"\n",
"Will use the bootstrap method to compute the variation we can expect around the mean for women in Whitechapel.\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"regdist = 'Whitechapel' # change to Poplar or any other registration district\n",
"df_sub_F = df[(df.gender=='F') & (df.district==regdist)]\n",
"df_sub_M = df[(df.gender=='M') & (df.district==regdist)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before we start, observe the different distribution of `age` variable for each value in `gender` (blue='F', orange='M') Actually the look pretty similar, no? "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_sub_F.age.plot(kind='density')\n",
"df_sub_M.age.plot(kind='density')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The histogram, however, doesn't gives us precise information about gender difference, for this we need to use the bootstrap procedure."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Bootstrapping looks as follows:\n",
"\n",
"we will draw samples of size n from our data (effectively we treat our data are the population from which we take repeated samples)\n",
"for each sample, we compute and collect the statistic of interest (the mean) and put the observation back in our data (which is called with replacement)\n",
"We repeat steps 1-2 R times. \n",
"\n",
"This procedure will generate a distribution of the sample statistic: it conveys which values are consistent with the data, and which ones are unlikely to occur. Our estimate of the mean may contain error, and we'd like to compute how much variation we could expect due to random chance.\n",
"\n",
"\n",
"In code, it is easy to implement this procedure. In the `for` loop, we randomly sample 100 observations, compute the mean, and store this statistic in `mean_sampled`. We repeat this procedure 1000 times. "
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"mean_sampled = []\n",
"for _ in range(100): # repeat code in block below 1000 times\n",
" sample = df_sub_F['age'].sample(100) # randomly sample 100 observations\n",
" mean_sampled.append(sample.mean()) # append mean of subsample\n",
"pd.Series(mean_sampled).plot(kind='density') # plot the sampling distribution a statistic\n",
"df_sub_M['age'].plot(kind='density') # plot the original data distribution"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this figure, we plotted the distribution of the sample statistic (blue) and the distribution of the original data (orange). Notice how different these distributions look: the sampling distribution of a statistic is narrower and centred around the mean. The phenomenon is often referred to as the **central limit theorem**: the sampling distribution of the mean will converge to a normal distribution, which is a bell-shaped curve, with most of the values within two standard deviations from the mean. \n",
"\n",
"We will not provide a formal prove, but simuation using the bootstrap technique should suffice here. Below we take 1000 samples of size=1000 and plot the mean (red), one standard deviation (blue), and two standard deviations (red) to demonstrate that the sampling distribution of the mean is close to a normal distribution."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"mean_sampled = []\n",
"for _ in range(1000): # repeat code in block below 1000 times\n",
" sample = df_sub_F['age'].sample(1000) # randomly sample 1000 observations\n",
" mean_sampled.append(sample.mean()) # append mean of subsample\n",
"ax = pd.Series(mean_sampled).plot(kind='hist',bins=100) # plot the sampling distribution a statistic\n",
"\n",
"sdist_mean = pd.Series(mean_sampled).mean()\n",
"sdist_std = pd.Series(mean_sampled).std()\n",
"\n",
"ax.axvline(x = sdist_mean, color='black', lw=2) # plot mean of distribution of the sampling statistic\n",
"ax.axvline(x = sdist_mean + sdist_std, color='blue', lw=1) # plot line one standard deviation from the mean\n",
"ax.axvline(x = sdist_mean - sdist_std, color='blue', lw=1) # plot line one standard deviation from the mean\n",
"ax.axvline(x = sdist_mean + (2*sdist_std), color='red', lw=1) # plot line two standard deviations from the mean\n",
"ax.axvline(x = sdist_mean - (2*sdist_std), color='red', lw=1) # plot line two standard deviations from the mean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After computing the sampling distribution of the mean for women in Whitechapel, we can easily obtain the 95% confidence interval, which implies that 95% all the possible means are locature withing this interval. We can apply `.quantile()` after converting the `mean_sampled` to a `pd.Series` object (you can not apply `.quantile()` a normal Python list. \n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.025 23.610825\n",
"0.975 25.713175\n",
"dtype: float64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.Series(mean_sampled).quantile([0.025,0.975])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The mean age of women is likely to vary between 23.6 and 25.7. We'd interpret this a range of values that are produced by random chance. The probability of observing more extreme values is lower then 0.05. "
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"ax = pd.Series(mean_sampled).plot(kind='hist',bins=100) \n",
"ax.axvline(x = df_sub_M['age'].mean(), color='blue', lw=2)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.001"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len([i for i in mean_sampled if i >= df_sub_M['age'].mean()]) / len(mean_sampled)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Please repeat this procedure for Poplar. In this case, the results should suggest that the mean age of men is well within the range of the sampling distribution of the mean computed for women."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Permutation and Hypothesis Testing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The bootstrap is valuable for understanding the distribution of a sample statistics\n",
"\n",
"The permutation test is valuable for hypothesis testing: central to the hypothesis test is the question of whether the differences we observe are the product of random chance? The latter will be our **null hypothesis**, namely that the age of men and women are essentially the same and that random chance explains the differences between the means. \n",
"\n",
"The alternative hypothesis serves as the \"counterpoint\" to the null hypothesis: the difference between the means is **not** the result of random chance. It is by rejecting the null hypothesis that we can accept the alternative hypothesis. \n",
"\n",
"In this sense, we never prove the alternative hypothesis directly but assume it is true because the null hypothesis is inconsistent with the data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Permutation Procedure"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"The permutation follows a slightly different procedure than the bootstrap but also involves resampling from our data. \n",
"\n",
"1: We note the number of observations we have for each of our categories of interest (i.e. the number of women and men present in the data frame). Let calls these two groups `M` and `F` for convenience and `len(F)` is the number of observations for label or group `F`.\n",
"\n",
"2: We combine the results into one data frame, which embodies the null hypothesis namely that, when it comes to age, the mean age of women and men are essentially the same, i.e. we can ignore the difference.\n",
"\n",
"3: We shuffle the combined dataset (which mean randomizing the order of the rows) and take a sample of size `len(F)` (notice that this will contain data from both `M` and `F`.\n",
"\n",
"4: The rows that remain are our second is a sample of `len(M)`\n",
"\n",
"5: Now, we compute the mean for each sample, take their difference, and record this number\n",
"\n",
"6: We repeat step 2 till 5 a number times, let's say 1000, which will generate a distribution of permutation statistics. These statistics will tell what differences a random permutation of the data will generate, that ignores the difference between the categories of interest. In other words, these values are likely to occur if the null hypothesis were true."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's implement this procedure in code. First, we create three data frames: one for 'M', one for 'F' and one combined 'F' and 'M'."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"regdist = 'Poplar'\n",
"df_sub_F = df[(df.gender=='F') & (df.district==regdist)]\n",
"df_sub_M = df[(df.gender=='M') & (df.district==regdist)]\n",
"df_sub_C = df[df.gender.isin(['F','M']) & (df.district==regdist)] "
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((19993, 4), (19376, 4), (39369, 4))"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_sub_F.shape,df_sub_M.shape, df_sub_C.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For Poplar, the difference between means is approximately 0.2 years (women being younger than men). Is this result of random chance? What is the probability of observing this difference if we assume there exists no difference for this borough?"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(26.38463462211774, 26.603168868703552)"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_sub_F['age'].mean(),df_sub_M['age'].mean()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-0.21853424658581133"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_sub_F['age'].mean() - df_sub_M['age'].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To answer this question, we first record the number of observations for 'F' and then implement steps 2-5 in code."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"num_F = df_sub_F.shape[0] # the number of observations for women\n",
"all_idx = set(df_sub_C.index) # get all indices"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "184e3b0f8f8a405bb95f0020cb6a2a66",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=5000.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"permutations = [] # initiate empty list to collect permutation statistics\n",
"\n",
"for _ in tqdm(range(5000)): # we use tqmd to print the progress, running the permutation test can take a while\n",
" f_idx = set(df_sub_C.sample(num_F).index) # take sample of size num_F and get index\n",
" m_idx = all_idx - f_idx # get the remaining indices (rows not f_idx)\n",
" diff_f_m_perm = df_sub_C.loc[f_idx]['age'].mean() - df_sub_C.loc[m_idx]['age'].mean() # diff between permuted statistics\n",
" permutations.append(diff_f_m_perm) # append permutation statistic"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can plot the distribution of the permutation statistics, which indicates that if we don't assume any age differences between men and women in Poplar is likely between -0.6 and 0.6."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pd.Series(permutations).plot(kind='hist',bins=50)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The observed difference was -0.2 which is well the range of what we'd expect if the null hypothesis were true. In this case we don't have enough evidence to reject the null-hypothesis. "
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0, 0.5, 'Frequency')"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fig, ax = plt.subplots(figsize=(5, 5))\n",
"ax.hist(permutations, bins=11, rwidth=0.9)\n",
"ax.axvline(x = df_sub_F['age'].mean() - df_sub_M['age'].mean(), color='black', lw=2)\n",
"ax.set_xlabel('Mean age differences')\n",
"ax.set_ylabel('Frequency')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this figure we compare the observed difference with distribution of the permuted differences. The black line shows the difference between the means we computed earlier. You'll notice that the oberved values is well within the set of permuated values."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Oftentimes, statistical analysis requires you to set a threshold value for rejecting the null hypothesis. As we're interested in the deviation from the general pattern—women being younger than men—we calculate how likely we encounter values equal to or smaller than the observed -0.2 under the null hypothesis. \n",
"\n",
"\n",
"We can compute the probability of this scenario with values generated by the permutation test. Below we ask how likely are the compute differences equal to or smaller than -0.2."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.1282"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len([i for i in permutations if i <= (df_sub_F['age'].mean() - df_sub_M['age'].mean())]) / len(permutations)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A typical threshold value for rejecting the null hypothesis is 0.05 but this is a pure convention as nothing magically happens around this 0.05. In the social sciences, you'll often encounter 0.1 as the threshold, while in physics thresholds smaller than 0.00001 are common. Again this is a convention and changes by discipline or study.\n",
"\n",
"In any case, you'll notice that the probability of the null hypothesis being true is around 0.13 (value may differ because of randomization, substantially above 0.05 and therefore we can not accept the difference as statistically significant."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result of the permuation test is largely similar to the Student's t-test. The p-value in this case is the probability of observed values equal to or lower than the observed difference, assuming there is no difference between men amd womenin the sample (a one-side t-test)."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ttest_indResult(statistic=-1.151420997318941, pvalue=0.12478303628250077)"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from scipy.stats import ttest_ind\n",
"ttest_ind(df_sub_F['age'],df_sub_M['age'],alternative='less')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can repeat the permutation for Whitechapel. In this case, the age difference should be highly significant."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Gender and Disability\n",
"### An additional example\n",
"\n",
"We can also use permutation to test the relation between categorical variables: in this case we study the relation gender and disability as reported in our census sample."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"p = df.sample(frac=.1).groupby('gender')['disability'].agg([np.sum, len])"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"p['no_disability'] = p['len'] - p['sum']"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"p.rename({'sum':'disability','len':'all'},axis=1,inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"