--- title: "Exam II" output: html_document --- Read in the following R packages that we will need for the exam: ```{r, message=FALSE} library(dplyr) library(ggplot2) library(tmodels) library(readr) ``` Write code as needed below and fill in your written answers in full sentences wherever you see the prompt "Answer:". Knit the file, print out the html file, and bring the answers to class. ## Cancer incidence dataset Run the following code to read in a dataset about cancer incidence rates (per 100000 people) at the county level in the U.S.; the code also creates two categorical variables that will be needed for the questions that follow. ```{r, message=FALSE} cancer <- read_csv("https://raw.githubusercontent.com/statsmaths/stat_data/gh-pages/cancer_inc_data.csv") cancer <- mutate(cancer, melanoma_group = if_else(melanoma < 25, "Low", "High")) cancer <- mutate(cancer, breast_group = if_else(breast < 110, "Low", "High")) cancer ``` Poverty gives the percentage of households that live below the poverty line and income gives the median income of households in the county. The racial demographics give the percentage of the population that identifies as a particular category. 1. Run an hypothesis test to test whether breast cancer incidence rates are correlated with the median income. Use any appropriate robust non-parametric test. ```{r} ``` Is the test statistically significant at the 0.01 level? Answer: 2. Run another hypothesis test to the one from question 1, but now select a parametric test. ```{r} ``` Is the test statistically significant at the 0.01 level? Explain the point estimate and what its sign (positive or negative) means in the context of the dataset. Answer: 3. Run an hypothesis test to see if lung cancer is effected by the region that the county is in. Use a parametric test. ```{r} ``` Is the test statistically significant at the 0.01 level? Answer: 4. Repeat the previous question with a non-parametric test. ```{r} ``` Is the test statistically significant at the 0.01 level? Answer: 5. Run a one-sample test with the null hypothesis that the mean of the breast cancer incidence is equal to 117 people per 100k. Use a parametric test. ```{r} ``` 5. Run a one-sample test with the null hypothesis that the mean of the breast cancer incidence is equal to 117 people per 100k. Use a non-parametric test. ```{r} ``` Is the p-value larger, smaller, or the same as you had in the parametric test? Answer: 6. Test the hypothesis that the categorical variable melanoma group (`melanoma_group`) is effected by the latitude of county: ```{r} ``` Are northern counties more or less likely to have high rates of melanoma? Answer: 7. Run a hypothesis test that predicts the rate of lung cancer as a function of the poverty rate, controling for region. ```{r} ``` Are areas with higher poverty rate in the same region more or less likely to have higher rates of lung cancer? Answer: 8. Add median income to the model you had in the previous question as an additional confounding variable. ```{r} ``` This should make the test no longer significant. Explain why this seems reasonable given how this hypothesis test works. Answer: 9. Test whether there is a connection between the categorical variable `breast_group` and `melanoma_group`. Use any statistical test that is appropriate. ```{r} ``` ## Multiple hypothesis testing 10. In a completely different study from the cancer incidence rates, a researcher collected 6 p-values: 0.00001, 0.005, 0.01, 0.04, 0.05, and 0.3. Adjust the p-values using the Holm-Bonferoni method: ```{r} ``` How many of the results remain significant at the 0.05 level after the multiple hypothesis correction? Answer: