--- title: "Lab 3" author: "STAT 302" date: "Due Date Here" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` *If you collaborated with anyone, you must include "Collaborated with: FIRSTNAME LASTNAME" at the top of your lab!* ## Part 1. t-test (10 points) Write a function `my_t.test()` that performs a one sample t-test in R. Your code should not use the `t.test` function in it. Your function should have the following parameters: * `x` a numeric vector of data. * `alternative` a character string specifying the alternative hypothesis. This should only accept `"two.sided"`, `"less"`, or `"greater"`. Otherwise, your function should throw an informative error. * `mu` a number indicating the null hypothesis value of the mean. Your function should return a `list` with elements: * `test_stat`: the numeric test statistic. * `df`: the degrees of freedom. * `alternative`: the value of the parameter `alternative`. * `p_val`: the numeric p-value. You should use the following information: * To get the standard error in a one sample t-test, take the standard deviation (`sd()`) of your input and divide it by the square root of the sample size. * Use `pt()` to get the area under the curve for a t-distribution. Be sure to use the parameter `lower.tail`! * The degrees of freedom for this test (`df` within `pt()`) is equal to the sample size - 1. * The p-value should never be less than 0 or greater than 1. (*Hint: Be careful about whether you use `lower.tail = TRUE` or `lower.tail = FALSE` in the two-sided test! One safe option is to use `lower.tail = FALSE` with the absolute value (`abs()`) of your test statistic.*) To prove it works, load the data below (description at https://www.openintro.org/data/index.php?data=helium). THe `air` column represents the distance traveled by an air-filled ball whereas the `helium` column is the same for a helium-filled ball. Use this data for a two-sided t-test to test the hypothesis that the population mean of `helium` is different than 20 using both `my_t.test()` and `t.test()`. The results should match. Do the same for a one-sided t-test testing that the population mean of `helium` is greater than 20. ```{r} helium_data <- read.csv("https://www.openintro.org/data/csv/helium.csv") ``` ## Part 2. Linear model (10 points) Write a function `my_lm()` that fits a linear model in R. Your function should have the following parameters: * `formula`: a `formula` class object, similar to `lm()`. * `data`: input data frame. Your function should return a `table` similar to the coefficent table from `summary()` with rows for each coefficient (including the `(Intercept)`!) and columns for the `Estimate`, `Std. Error`, `t value`, and `Pr(>|t|)`. There should be row and column names. You may find the following information helpful: * Use `model.matrix()` to extract the model matrix $\mathbf{X}$. It takes as input parameters a formula and data. * Use `model.response()` to extract the model response $\mathbf{Y}$. It takes as input a model frame object. * Use `model.frame()` to extract a model frame object. It takes as input parameters a formula and data. * You can solve for linear regression coefficients using the formula $$ \hat{\beta} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{Y}. $$ * Recall matrix operations `solve()`, `t()`, and `%*%`. * The degrees of freedom (df) you should use for your tests are equal to your sample size minus the number of covariates (including intercept!). * You can estimate $\sigma^2$ using the formula $$ \hat{\sigma}^2 = \sum_i \dfrac{(Y_i - \mathbf{X}_i \beta)^2}{\text{df}}. $$ * You can estimate the standard error for your coefficients using the formula $$ se(\hat{\beta}_j) = \sqrt{\left[\hat{\sigma}^2(\mathbf{X}^T\mathbf{X})^{-1}\right]_{[j,j]}}. $$ Note that within the brackets of the right-hand side is a matrix and $[j,j]$ represents the element of the matrix located at the $j$th row and $j$th column indices. * Use `diag()` to extract diagonal components from a matrix. * `Pr(>|t|)` comes from the two-sided t test. $$ \begin{align} H_0: \beta_j &= 0\\ H_a: \beta_j &\neq 0 \end{align} $$ * Use `pt()` to get the area under the curve for a t-distribution. Because the distribution is symmetric, you can multiply this value by $2$ to get the two-sided test output. (*Hint: As before, be careful about whether you use `lower.tail = TRUE` or `lower.tail = FALSE`! One safe option is to use `lower.tail = FALSE` with the absolute value (`abs()`) of your test statistic. Make sure you never end up with a p-value greater than 1!*) To prove it works, Use the code below to read in data from a survey of 55 Duke University students about their study habits and grades. You can read more about this data at https://www.openintro.org/data/index.php?data=gpa. ```{r} grades_data <- read.csv("https://www.openintro.org/data/csv/gpa.csv") ``` Use this data to regress `gpa` upon `studyweek` using both `my_lm()` and `lm()`. The results of `my_lm()` should match the coefficient table from the `summary` of your `lm()` output.