We will begin by reiterating a point made earlier in the semester about testing multiple hypotheses. Often, scientists in the field are looking for associations that seem unlikely to occur by chance. The issue with this is if one looks at enough associations then an association is bound to occur by chance rather than by genuine causality. Recall the metaphor about a needle in a haystack: finding a needle in a haystack might be amazing, but if you had checked 1000 haystacks perhaps this is less amazing…
To get an idea of what we are about to simulate, consider the following comic. Consider the set up of this experiment. You have a bunch of people with or without acne and you further know whether or not they have eaten a jelly bean of a particular color. We can easily simulate this in
R
:
# DON'T EDIT THIS CODE
set.seed(13)
has_acne = rbinom(100, 1, 0.4)
ate_red_jb = rbinom(100, 1, 0.6)
our_data = data.frame(HasAcne = has_acne, AteRedJB = ate_red_jb)
head(our_data)
## HasAcne AteRedJB
## 1 1 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 1 1
## 6 0 1
summary(our_data)
## HasAcne AteRedJB
## Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.00
## Median :0.00 Median :1.00
## Mean :0.37 Mean :0.58
## 3rd Qu.:1.00 3rd Qu.:1.00
## Max. :1.00 Max. :1.00
Here is our simple random sample of 100 people in which we know their acne and red-jelly-bean-eating status. It is significant, and indeed quite important, to note that each of these variables were generated randomly and independently of one another. Therefore, we know something that researchers in the field do not normally know: that in this case red-jelly-bean-eating should have no effect on the acne of an individual. Let’s show this using one of our logistic regression models.
m1 <- glm(HasAcne ~ AteRedJB, data = our_data, family = binomial('logit'))
summary(m1)
##
## Call:
## glm(formula = HasAcne ~ AteRedJB, family = binomial("logit"),
## data = our_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0620 -1.0620 -0.8203 1.2974 1.5829
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9163 0.3416 -2.683 0.0073 **
## AteRedJB 0.6387 0.4324 1.477 0.1397
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 131.79 on 99 degrees of freedom
## Residual deviance: 129.55 on 98 degrees of freedom
## AIC: 133.55
##
## Number of Fisher Scoring iterations: 4
Examine the Coefficients
section, specifically the column Pr(>|z|)
, which is the column of p-values. Here we see the p-value derived using the test \(H_0: \beta_i = 0\) where \(i \in \{0,1\}\) and \(\beta_0\), \(\beta_1\) are the coefficients for the Intercept
and our AteRedJB
variable, respectively. Notice that we got what we expected! The p-value for the test \(H_0: \beta_1 = 0\) was above our usual 0.05 threshold so we therefore fail to reject the statement that red-jelly-bean-eating is not significant.
This is all well and good! Statistics works!
But wait. What if we repeated this experiment over and over again? During each experiment we will take a new random sample of people and examine a new color of Jelly Bean. We will record whether or not the jelly-bean-eating for each new experiment is significant. Indeed,
# DO NOT EDIT THIS CODE
set.seed(10)
colors = c("Purple", "Brown", "Green", "Pink", "Blue", "Teal", "Salmon", "Red", "Turquoise",
"Magenta", "Yellow", "Grey", "Tan", "Cyan", "Mauve", "Beige",
"Lilac", "Black", "Peach", "Orange")
for(index in 1:20){
has_acne = rbinom(100, 1, 0.4)
ate_color_jb = rbinom(100, 1, 0.6)
our_data = data.frame(HasAcne = has_acne, AteJB = ate_color_jb)
fit <- glm(HasAcne ~ AteJB, data = our_data, family = binomial('logit'))
p_value = coef(summary(fit))[,"Pr(>|z|)"][2]
print(paste(colors[index], ": ",p_value < 0.05, sep = ""))
}
## [1] "Purple: FALSE"
## [1] "Brown: FALSE"
## [1] "Green: TRUE"
## [1] "Pink: FALSE"
## [1] "Blue: FALSE"
## [1] "Teal: FALSE"
## [1] "Salmon: FALSE"
## [1] "Red: FALSE"
## [1] "Turquoise: FALSE"
## [1] "Magenta: FALSE"
## [1] "Yellow: FALSE"
## [1] "Grey: FALSE"
## [1] "Tan: FALSE"
## [1] "Cyan: FALSE"
## [1] "Mauve: FALSE"
## [1] "Beige: FALSE"
## [1] "Lilac: FALSE"
## [1] "Black: FALSE"
## [1] "Peach: FALSE"
## [1] "Orange: FALSE"
So, were they right? Does eating green Jelly Beans cause acne?
The purpose of this exercise is to introduce you to the multiple comparisons problem and thereby hopefully serve as a warning to those who want to use statistics in practice in the future. Note that since this is primarily a written exercise, complete sentences and proper grammar are expected.
YOUR ANSWER HERE
YOUR ANSWER HERE
YOUR ANSWER HERE
YOUR ANSWER HERE
set.seed()
). Do not use a built-in function in R
to make this correction. Discuss the results.YOUR CODE, AND DISCUSSION, HERE
We have spent a lot of time this semester looking at different supervised and unsupervised methods for analyzing data and performing classification. There is a saying in statistics: every dog has its day. This is to say that there is always a situation where a certain method will out perform all the others and will be, in a sense, the “right” method for a given problem. There is an art to picking the right tool in your toolbox for a given application, so the next few problems (on this and the next computer assignment) will try to build your intuition on when to use certain methods. There are some questions in this next exercise that are similar to previous exercises. This is to give you extra practice and to help motivate what we will try to do next week.
Recall that in the case of LDA is we assume that the marginal class densities are both normal and have the same variance. This is, for the binary classification problem, we have that:
\[ \text{Data Classified as 0} \sim \mathcal{N}(\mu_0, \Sigma),\\ \text{Data Classified as 1} \sim \mathcal{N}(\mu_1, \Sigma). \]
# Make sure to install this package if you do not have it:
library(mvtnorm)
sig_sqrt = matrix( sample(seq(from=-2, to = 2, length.out = 25), size = 25), ncol = 5)
my_sigma = t(sig_sqrt) %*% sig_sqrt
data_0 = rmvnorm(250, mean = rep(1, times = 5), sigma = my_sigma)
data_1 = rmvnorm(250, mean = rep(-1, times = 5), sigma = my_sigma)
labels = c(rep(0, times = 250), rep(1, times = 250))
names = c("variable1", "variable2", "variable3", "variable4", "variable5", "labels")
full_data = rbind(data_0, data_1)
full_data = cbind(full_data, labels)
full_data = as.data.frame(full_data)
names(full_data) = names
A crucial thing to note here is that this data fits the assumptions we made with a LDA model exactly. The class marginals are, indeed, normal and they do, indeed, have the same variance. Let us now examine how well an LDA model does on data like this, and compare it to another one of our models.
library(MASS)
YOUR CODE, AND ANALYSIS HERE
YOUR CODE, AND ANALYSIS HERE
YOUR CODE, AND ANALYSIS HERE
YOUR CODE, AND ANALYSIS HERE
Note that this is a case where LDA is supposed to perform especially well. The punchline here is that if you encounter data where the class marginals seem normal with the same variance, then LDA is a very good choice! For our next CA we will examine what happens when the assumptions inhernit to LDA are broken. That is, when LDA perhaps is not the best tool for the job.