2.5 Hypothesis testing (general)

In hypothesis testing (sometimes more explicitly called “Null Hypothesis Significance Testing” or NHST), it is formulated to answer a specific question about a population or true parameter(s) using a statistic based on a data set. In your previous statistics course, you (hopefully) considered one-sample hypotheses about population means and proportions and the two-sample mean situation we are focused on here. Hypotheses relate to trying to answer the question about whether the population mean overtake distances between the two groups are different, with an initial assumption of no difference.

NHST is much like a criminal trial with a jury where you are in the role of a jury member. Initially, the defendant is assumed innocent. In our situation, the true means are assumed to be equal between the groups. Then evidence is presented and, as a juror, you analyze it. In statistical hypothesis testing, data are collected and analyzed. Then you have to decide if we had “enough” evidence to reject the initial assumption (“innocence” that is initially assumed). To make this decision, you want to have thought about and decided on the standard of evidence required to reject the initial assumption. In criminal cases, “beyond a reasonable doubt” is used. Wikipedia’s definition (https://en.wikipedia.org/wiki/Reasonable_doubt) suggests that this standard is that “there can still be a doubt, but only to the extent that it would not affect a reasonable person’s belief regarding whether or not the defendant is guilty”. In civil trials, a lower standard called a “preponderance of evidence” is used. Based on that defined and pre-decided (a priori) measure, you decide that the defendant is guilty or not guilty. In statistics, the standard is set by choosing a significance level, \(\alpha\), and then you compare the p-value to it. In this approach, if the p-value is less than \(\alpha\), we reject the null hypothesis. The choice of the significance level is like the variation in standards of evidence between criminal and civil trials – and in all situations everyone should know the standards required for rejecting the initial assumption before any information is “analyzed”. Once someone is found guilty, then there is the matter of sentencing which is related to the impacts (“size”) of the crime. In statistics, this is similar to the estimated size of differences and the related judgments about whether the differences are practically important or not. If the crime is proven beyond a reasonable doubt but it is a minor crime, then the sentence will be small. With the same level of evidence and a more serious crime, the sentence will be more dramatic. This latter step is more critical than the p-value as it directly relates to actions to be taken based on the research but unfortunately p-values and the related decisions get most of the attention.

There are some important aspects of the testing process to note that inform how we interpret statistical hypothesis test results. When someone is found “not guilty”, it does not mean “innocent”, it just means that there was not enough evidence to find the person guilty “beyond a reasonable doubt”. Not finding enough evidence to reject the null hypothesis does not imply that the true means are equal, just that there was not enough evidence to conclude that they were different. There are many potential reasons why we might fail to reject the null, but the most common one is that our sample size was too small (which is related to having too little evidence). Other reasons include simply the variation in taking a random sample from the population(s). This randomness in samples and the differences in the sample means also implies that p-values are random and can easily vary if the data set had been slightly different. This also relates to the suggestion of using a graded interpretation of p-values instead of the fixed \(\alpha\) usage – if the p-value is an estimated quantity, is there really any difference between p-values of 0.049 and 0.051? We probably shouldn’t think there is a big difference in results for these two p-values even though the standard NHST reject/fail to reject the null approach considers these as completely different results. So where does that leave us? Interpret the p-values using strength of evidence against the null hypothesis, remembering that smaller (but not really small) p-values can still be interesting. And if you think the p-value is small enough, then you can reject the null hypothesis and conclude that the alternative hypothesis is a better characterization of the truth – and then estimate the size of the differences.

Throughout this material, we will continue to re-iterate the distinctions between parameters and statistics and want you to be clear about the distinctions between estimates based on the sample and inferences for the population or true values of the parameters of interest. Remember that statistics are summaries of the sample information and parameters are characteristics of populations (which we rarely know). In the two-sample mean situation, the sample means are always at least a little different – that is not an interesting conclusion. What is interesting is whether we have enough evidence to feel like we have proven that the population or true means differ “beyond a reasonable doubt”.

The scope of any inferences is constrained based on whether there is a random sample (RS) and/or random assignment (RA). Table 2.1 contains the four possible combinations of these two characteristics of a given study. Random assignment of treatment levels to subjects allows for causal inferences for differences that are observed – the difference in treatment levels is said to cause differences in the mean responses. Random sampling (or at least some sort of representative sample) allows inferences to be made to the population of interest. If we do not have RA, then causal inferences cannot be made. If we do not have a representative sample, then our inferences are limited to the sampled subjects.

Table 2.1: Scope of inference summary.
Random
Sampling/Random
Assignment
Random Assignment (RA)
– Yes (controlled experiment)
Random Assignment (RA)
– No (observational study)
Random Sampling (RS)
– Yes (or some method
that results in a
representative sample of
population of
interest)
Because we have RS, we can
generalize inferences to the
population the RS was taken
from. Because we have
RA we can assume the groups
were equivalent on all aspects
except for the treatment
and can establish causal inference.
Can generalize inference to
population the RS was taken
from but cannot establish
causal inference (no RA
– cannot isolate treatment
variable as only difference
among groups, could be
confounding variables).
Random Sampling (RS)
– No (usually a
convenience sample)
Cannot generalize inference to
the population of interest
because the sample was
not random and could be
biased – may not be
“representative” of the
population of interest.
Can establish causal
inference due to RA \(\rightarrow\)
the inference from this type of
study applies only to the sample.
Cannot generalize inference to
the population of interest
because the sample was
not random and could be
biased – may not be
“representative” of the
population of interest.
Cannot establish causal
inference due to lack of RA of
the treatment.

A simple example helps to clarify how the scope of inference can change based on the study design. Suppose we are interested in studying the GPA of students. If we had taken a random sample from, say, Intermediate Statistics students in a given semester at a university, our scope of inference would be the population of students in that semester taking that course. If we had taken a random sample from the entire population of students at that school, then the inferences would be to the entire population of students in that semester. These are similar types of problems but the two populations are very different and the group you are trying to make conclusions about should be noted carefully in your results – it does matter! If we did not have a representative sample, say the students could choose to provide this information or not and some chose not to, then we can only make inferences to volunteers. These volunteers might differ in systematic ways from the entire population of Intermediate Statistics students (for example, they are proud of their GPA) so we cannot safely extend our inferences beyond the group that volunteered.

To consider the impacts of RA versus results from purely observational studies, we need to be comparing groups. Suppose that we are interested in differences in the mean GPAs for different sections of Intermediate Statistics and that we take a random sample of students from each section and compare the results and find evidence of some difference. In this scenario, we can conclude that there is some difference in the population of these statistics students but we can’t say that being in different sections caused the differences in the mean GPAs. Now suppose that we randomly assigned every student to get extra training in one of three different study techniques and found evidence of differences among the training methods. We could conclude that the training methods caused the differences in these students. These conclusions would only apply to Intermediate Statistics students at this university in this semester and could not be generalized to a larger population of students. If we took a random sample of Intermediate Statistics students (say only 10 from each section) and then randomly assigned them to one of three training programs and found evidence of differences, then we can say that the training programs caused the differences. But we can also say that we have evidence that those differences pertain to the population of Intermediate Statistics students in that semester at this university. This seems similar to the scenario where all the students participated in the training programs except that by using random sampling, only a fraction of the population needs to actually be studied to make inferences to the entire population of interest – saving time and money.

A quick summary of the terminology of hypothesis testing is useful at this point. The null hypothesis (\(H_0\)) states that there is no difference or no relationship in the population. This is the statement of no effect or no difference and the claim that we are trying to find evidence against in NHST. In this chapter, \(H_0\): \(\mu_1=\mu_2\). When doing two-group problems, you always need to specify which group is 1 and which one is 2 because the order does matter. The alternative hypothesis (\(H_1\) or \(H_A\)) states a specific difference between parameters. This is the research hypothesis and the claim about the population that we often hope to demonstrate is more reasonable to conclude than the null hypothesis. In the two-group situation, we can have one-sided alternatives \(H_A: \mu_1 > \mu_2\) (greater than) or \(H_A: \mu_1 < \mu_2\) (less than) or, the more common, two-sided alternative \(H_A: \mu_1 \ne \mu_2\) (not equal to). We usually default to using two-sided tests because we often do not know enough to know the direction of a difference a priori, especially in more complicated situations. The sampling distribution under the null is the distribution of all possible values of a statistic under the assumption that \(H_0\) is true. It is used to calculate the p-value, the probability of obtaining a result as extreme or more extreme (defined by the alternative) than what we observed given that the null hypothesis is true. We will find sampling distributions using nonparametric approaches (like the permutation approach used previously) and parametric methods (using “named” distributions like the \(t\), F, and \(\chi^2\)).

Small p-values are evidence against the null hypothesis because the observed result is unlikely due to chance if \(H_0\) is true. Large p-values provide little to no evidence against \(H_0\) but do not allow us to conclude that the null hypothesis is correct – just that we didn’t find enough evidence to think it was wrong. The level of significance is an a priori definition of how small the p-value needs to be to provide “enough” (sufficient) evidence against \(H_0\). This is most useful to prevent sliding the standards after the results are found but you can interpret p-values as strength of evidence against the null hypothesis without employing the fixed significance level. If using a fixed significance level, we can compare the p-value to the level of significance to decide if the p-value is small enough to constitute sufficient evidence to reject the null hypothesis. We use \(\alpha\) to denote the level of significance and most typically use 0.05 which we refer to as the 5% significance level. We can compare the p-value to this level and make a decision, focusing our interpretation on the strength of evidence we found based on the p-value from very strong to little to none. If we are using the strict version of NHST, the two options for decisions are to either reject the null hypothesis if the p-value \(\le \alpha\) or fail to reject the null hypothesis if the p-value \(> \alpha\). When interpreting hypothesis testing results, remember that the p-value is a measure of how unlikely the observed outcome was, assuming that the null hypothesis is true. It is NOT the probability of the data or the probability of either hypothesis being true. The p-value, simply, is a measure of evidence against the null hypothesis.

Although we want to use graded evidence to interpret p-values, there is one situation where thinking about comparisons to fixed \(\alpha\) levels is useful for understanding and studying statistical hypothesis testing. The specific definition of \(\alpha\) is that it is the probability of rejecting \(H_0\) when \(H_0\) is true, the probability of what is called a Type I error. Type I errors are also called false rejections or false detections. In the two-group mean situation, a Type I error would be concluding that there is a difference in the true means between the groups when none really exists in the population. In the courtroom setting, this is like falsely finding someone guilty. We don’t want to do this very often, so we use small values of the significance level, allowing us to control the rate of Type I errors at \(\alpha\). We also have to worry about Type II errors, which are failing to reject the null hypothesis when it’s false. In a courtroom, this is the same as failing to convict a truly guilty person. This most often occurs due to a lack of evidence that could be due to a small sample size or merely just an unusual sample from the population. You can use the Table 2.2 to help you remember all the possibilities.

(ref:tab2-3) Table of decisions and truth scenarios in a hypothesis testing situation. But we never know the truth in a real situation.

Table 2.2: (ref:tab2-3)
  \(\mathbf{H_0}\) True \(\mathbf{H_0}\) False
FTR \(\mathbf{H_0}\) Correct decision Type II error
Reject \(\mathbf{H_0}\) Type I error Correct decision

In comparing different procedures or in planning studies, there is an interest in studying the rate or probability of Type I and II errors. The probability of a Type I error was defined previously as \(\alpha\), the significance level. The power of a procedure is the probability of rejecting the null hypothesis when it is false. Power is defined as

\[\text{Power} = 1 - \text{Probability(Type II error) } = \text{Probability(Reject } H_0 | H_0 \text{ is false),}\]

or, in words, the probability of detecting a difference when it actually exists. We want to use a statistical procedure that controls the Type I error rate at the pre-specified level and has high power to detect false null hypotheses. Increasing the sample size is one of the most commonly used methods for increasing the power in a given situation. Sometimes we can choose among different procedures and use the power of the procedures to help us make that selection. Note that there are many ways \(H_0\) can be false and the power changes based on how false the null hypothesis actually is. To make this concrete, suppose that the true mean overtake distances differed by either 1 or 30 cm in previous example. The chances of rejecting the null hypothesis are much larger when the groups actually differ by 30 cm than if they differ by just 1 cm, given the same sample size. The null hypothesis is false in both cases. Similarly, for a given difference in the true means, the larger the sample, the higher the power of the study to actually find evidence of a difference in the groups. We will see this difference when we return to using the entire overtake data set instead of the sample of \(n=30\) used to illustrate the permutation procedures.

After making a decision (was there enough evidence to reject the null or not), we want to make the conclusions specific to the problem of interest. If we reject \(H_0\), then we can conclude that there was sufficient evidence at the \(\alpha\)-level that the null hypothesis is wrong (and the results point in the direction of the alternative). If we fail to reject \(H_0\) (FTR \(H_0\)), then we can conclude that there was insufficient evidence at the \(\alpha\)-level to say that the null hypothesis is wrong. We are NOT saying that the null is correct and we NEVER accept the null hypothesis. We just failed to find enough evidence to say it’s wrong. If we find sufficient evidence to reject the null, then we need to revisit the method of data collection and design of the study to discuss the scope of inference. Can we discuss causality (due to RA) and/or make inferences to a larger group than those in the sample (due to RS)?

To perform a hypothesis test, there are some steps to remember to complete to make sure you have thought through and reported all aspects of the results.

Outline of 6+ steps to perform a Hypothesis Test
Preliminary steps:
* Define RQ and consider study design - what question can the data collected address?
* What graphs are appropriate to visualize the data?
* What model/statistic (T) is needed to address RQ?
1. Write the null and alternative hypotheses,
2. Plot the data and assess the “Validity Conditions” for the procedure being used (discussed below),
3. Find the value of the appropriate test statistic and p-value for your hypotheses,
4. Write a conclusion specific to the problem based on the p-value, reporting the strength of evidence against the null hypothesis (include test statistic, its distribution under the null hypothesis, and p-value).
5. Report and discuss an estimate of the size of the differences, with confidence interval(s) if appropriate.
6. Scope of inference discussion for results.