{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Exploring Click Logs and Testing Statistical Hypotheses\n", "\n", "For class today, you read a chapter on Exploratory Data Analysis from the book \"Doing Data Science.\" Today, we'll be walking through an analysis of the same data presented in that chapter and extending it with Multiple Hypothesis testing.\n", "\n", "## Statistical Hypothesis Testing\n", "Hypothesis testing is a powerful tool used in statistical analysis to support claims like \"Drug X is better than Drug Y\", or \"Campers who attend a safety course led by a talking bear are effective at preventing forest fires.\" You'll see questions about whether or not an experiment was effective or results from two processes are different all over the place in your data science career.\n", "\n", "Hypothesis testing allows us to answer these questions with statistical rigor. Generally, we establish a \"null hypothesis\", and then conduct a test which will tell us, given the data, whether or not we can *reject* this null hypothesis. Usually, the null hypothesis is something like \"These two drugs are the *same*.\" or \"This measure's mean is no different than zero.\"\n", "\n", "We're going to work with one such tool, namely the Student's Two-Sample t-Test. \n", "\n", "### Historical Anecdote\n", "This was not a test designed for students, instead, it was designed by a statistician William Gosset who [published](http://www.york.ac.uk/depts/maths/histstat/student.pdf) under the pseudonym \"Student,\" while working for the Guinness brewing company.\n", "\n", "### Back to Statistics\n", "\n", "Student's t-Test is used when comparing samples of normally distributed variables. This assumption of normality is important, but not strict. If the data is very much not normally distributed, a non-parameteric method (that is, distribution agnostic) like the Wilcoxon signed-rank test can be used. We are looking at Welch's t-Test, a variant of the classic Student's test which allows for two samples of different size, and possibly different variance.\n", "\n", "To perform a t-Test, we compute a \"t statistic\" or \"t score\" with the two samples as input, the formula is:\n", "\n", "$t = \\frac{mean(X_1) - mean(X_2)}{s_{X_1-X_2}}$\n", "\n", "where $s_{X_1-X_2} = \\sqrt{\\frac{s^2_1}{n_1} - \\frac{s^2_2}{n_2}}$, and $s_i$ is the sample standard deviation of sample $i$ and $n_i$ is the size of sample $i$.\n", "\n", "This score comes from a T-distribution, which looks like a normal distribution but with fatter tails. If the two samples are similar, the t-statistic will be close to 0. If they're not, the t-statistic will be high in absolute value. \n", "\n", "Take a moment and think about the math. Under what conditions is the statistic highest? When the means are very different and their respective sample standard deviations are very small.\n", "\n", "We'll use Welch's t-Test, an adaptation of the two-sample independent Student's t-test that takes into account samples that may have unequal sizes and variances (which, we can see from our distribution above, may be true in our data set!)\n", "\n", "Since the t-statistic comes from a statistical distribution, we can map this value to a probability of sampling that value under the null hypothesis. The probability of realizing a high t-statistic when two samples come from the same distribution is very small, thus it's p-value is also small. \n", "\n", "![T-stat](http://upload.wikimedia.org/wikipedia/commons/4/41/Student_t_pdf.svg)\n", "Courtesy of Wikipedia, the probability density function for the Student's t-Distribution. V indicates the number of samples in our distribution.\n", "\n", "So, the output of a T-test is usually a pair of statistics, the \"t score\", and a \"p value\". The p value has a natural interpretation as a probability. We can say: \"With $100 \\times (1-p)$ percent confidence, I reject the null hypothesis that these two samples are the same.\" Meaning, that when p is very small, the two samples are statistically likely to be different.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Lab Work\n", "\n", "##Setup - Install Scipy\n", "\n", "You'll need to install a python package to your VM using the following command at the shell: `sudo apt-get install python-scipy`\n", "\n", "##Setup - Data Loading\n", "\n", "We'll be working with a single day's worth of session log data from the New York Times website. The data is available [here](http://stat.columbia.edu/~rachel/datasets/nyt1.csv) as well as in the class github repo. The data has been nicely cleaned and aggregated for us (no janitor work!) Load up a single day's data into a `DataFrame`, and summarize it." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "\n", "data = pd.read_csv(\"nyt1.csv\")\n", "data.describe()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | Age | \n", "Gender | \n", "Impressions | \n", "Clicks | \n", "Signed_In | \n", "
---|---|---|---|---|---|
count | \n", "458441.000000 | \n", "458441.000000 | \n", "458441.000000 | \n", "458441.000000 | \n", "458441.000000 | \n", "
mean | \n", "29.482551 | \n", "0.367037 | \n", "5.007316 | \n", "0.092594 | \n", "0.700930 | \n", "
std | \n", "23.607034 | \n", "0.481997 | \n", "2.239349 | \n", "0.309973 | \n", "0.457851 | \n", "
min | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
25% | \n", "0.000000 | \n", "0.000000 | \n", "3.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
50% | \n", "31.000000 | \n", "0.000000 | \n", "5.000000 | \n", "0.000000 | \n", "1.000000 | \n", "
75% | \n", "48.000000 | \n", "1.000000 | \n", "6.000000 | \n", "0.000000 | \n", "1.000000 | \n", "
max | \n", "108.000000 | \n", "1.000000 | \n", "20.000000 | \n", "4.000000 | \n", "1.000000 | \n", "
8 rows \u00d7 5 columns
\n", "\n", " | Age | \n", "Gender | \n", "Impressions | \n", "Clicks | \n", "Signed_In | \n", "
---|---|---|---|---|---|
0 | \n", "36 | \n", "0 | \n", "3 | \n", "0 | \n", "1 | \n", "
1 | \n", "73 | \n", "1 | \n", "3 | \n", "0 | \n", "1 | \n", "
2 | \n", "30 | \n", "0 | \n", "3 | \n", "0 | \n", "1 | \n", "
3 | \n", "49 | \n", "1 | \n", "3 | \n", "0 | \n", "1 | \n", "
4 | \n", "47 | \n", "1 | \n", "11 | \n", "0 | \n", "1 | \n", "
5 rows \u00d7 5 columns
\n", "\n", " | Age | \n", "Gender | \n", "Impressions | \n", "Clicks | \n", "Signed_In | \n", "CTR | \n", "AgeGroup | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "36 | \n", "0 | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "(34, 44] | \n", "
1 | \n", "73 | \n", "1 | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "(64, 1000] | \n", "
2 | \n", "30 | \n", "0 | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "(24, 34] | \n", "
3 | \n", "49 | \n", "1 | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "(44, 54] | \n", "
4 | \n", "47 | \n", "1 | \n", "11 | \n", "0 | \n", "1 | \n", "0 | \n", "(44, 54] | \n", "
5 rows \u00d7 7 columns
\n", "\n", " | Age | \n", "Gender | \n", "Impressions | \n", "Clicks | \n", "Signed_In | \n", "CTR | \n", "
---|---|---|---|---|---|---|
count | \n", "455375.000000 | \n", "455375.000000 | \n", "455375.000000 | \n", "455375.000000 | \n", "455375.000000 | \n", "455375.000000 | \n", "
mean | \n", "29.484010 | \n", "0.367051 | \n", "5.041030 | \n", "0.093218 | \n", "0.700956 | \n", "0.018471 | \n", "
std | \n", "23.606697 | \n", "0.482001 | \n", "2.208731 | \n", "0.310922 | \n", "0.457839 | \n", "0.069034 | \n", "
min | \n", "0.000000 | \n", "0.000000 | \n", "1.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
25% | \n", "0.000000 | \n", "0.000000 | \n", "3.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
50% | \n", "31.000000 | \n", "0.000000 | \n", "5.000000 | \n", "0.000000 | \n", "1.000000 | \n", "0.000000 | \n", "
75% | \n", "48.000000 | \n", "1.000000 | \n", "6.000000 | \n", "0.000000 | \n", "1.000000 | \n", "0.000000 | \n", "
max | \n", "108.000000 | \n", "1.000000 | \n", "20.000000 | \n", "4.000000 | \n", "1.000000 | \n", "1.000000 | \n", "
8 rows \u00d7 6 columns
\n", "\n", " | 0 | \n", "1 | \n", "2 | \n", "
---|---|---|---|
2 | \n", "(0, 18] | \n", "(34, 44] | \n", "0.000028 | \n", "
3 | \n", "(0, 18] | \n", "(44, 54] | \n", "0.000121 | \n", "
1 | \n", "(0, 18] | \n", "(24, 34] | \n", "0.001439 | \n", "
0 | \n", "(0, 18] | \n", "(18, 24] | \n", "0.004526 | \n", "
17 | \n", "(34, 44] | \n", "(64, 1000] | \n", "0.007703 | \n", "
16 | \n", "(34, 44] | \n", "(54, 64] | \n", "0.010931 | \n", "
19 | \n", "(44, 54] | \n", "(64, 1000] | \n", "0.024254 | \n", "
18 | \n", "(44, 54] | \n", "(54, 64] | \n", "0.032186 | \n", "
4 | \n", "(0, 18] | \n", "(54, 64] | \n", "0.033328 | \n", "
5 | \n", "(0, 18] | \n", "(64, 1000] | \n", "0.037107 | \n", "
10 rows \u00d7 3 columns
\n", "