---
title: "Lab 3"
author: "STAT 302"
date: "Due Date Here"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
*If you collaborated with anyone, you must include "Collaborated with: FIRSTNAME LASTNAME" at the top of your lab!*
## Part 1. t-test (10 points)
Write a function `my_t.test()` that performs a one sample t-test in R. Your code should not use the `t.test` function in it.
Your function should have the following parameters:
* `x` a numeric vector of data.
* `alternative` a character string specifying the alternative hypothesis. This should only accept `"two.sided"`, `"less"`, or `"greater"`. Otherwise, your function should throw an informative error.
* `mu` a number indicating the null hypothesis value of the mean.
Your function should return a `list` with elements:
* `test_stat`: the numeric test statistic.
* `df`: the degrees of freedom.
* `alternative`: the value of the parameter `alternative`.
* `p_val`: the numeric p-value.
You should use the following information:
* To get the standard error in a one sample t-test, take the standard deviation (`sd()`) of your input and divide it by the square root of the sample size.
* Use `pt()` to get the area under the curve for a t-distribution. Be sure to use the parameter `lower.tail`!
* The degrees of freedom for this test (`df` within `pt()`) is equal to the sample size - 1.
* The p-value should never be less than 0 or greater than 1.
(*Hint: Be careful about whether you use `lower.tail = TRUE` or `lower.tail = FALSE` in the two-sided test! One safe option is to use `lower.tail = FALSE` with the absolute value (`abs()`) of your test statistic.*)
To prove it works, load the data below (description at https://www.openintro.org/data/index.php?data=helium). THe `air` column represents the distance traveled by an air-filled ball whereas the `helium` column is the same for a helium-filled ball. Use this data for a two-sided t-test to test the hypothesis that the population mean of `helium` is different than 20 using both `my_t.test()` and `t.test()`. The results should match. Do the same for a one-sided t-test testing that the population mean of `helium` is greater than 20.
```{r}
helium_data <- read.csv("https://www.openintro.org/data/csv/helium.csv")
```
## Part 2. Linear model (10 points)
Write a function `my_lm()` that fits a linear model in R.
Your function should have the following parameters:
* `formula`: a `formula` class object, similar to `lm()`.
* `data`: input data frame.
Your function should return a `table` similar to the coefficent table from `summary()` with rows for each coefficient (including the `(Intercept)`!) and columns for the `Estimate`, `Std. Error`, `t value`, and `Pr(>|t|)`. There should be row and column names.
You may find the following information helpful:
* Use `model.matrix()` to extract the model matrix $\mathbf{X}$. It takes as input parameters a formula and data.
* Use `model.response()` to extract the model response $\mathbf{Y}$. It takes as input a model frame object.
* Use `model.frame()` to extract a model frame object. It takes as input parameters a formula and data.
* You can solve for linear regression coefficients using the formula
$$
\hat{\beta} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{Y}.
$$
* Recall matrix operations `solve()`, `t()`, and `%*%`.
* The degrees of freedom (df) you should use for your tests are equal to your sample size minus the number of covariates (including intercept!).
* You can estimate $\sigma^2$ using the formula
$$
\hat{\sigma}^2 = \sum_i \dfrac{(Y_i - \mathbf{X}_i \beta)^2}{\text{df}}.
$$
* You can estimate the standard error for your coefficients using the formula
$$
se(\hat{\beta}_j) = \sqrt{\left[\hat{\sigma}^2(\mathbf{X}^T\mathbf{X})^{-1}\right]_{[j,j]}}.
$$
Note that within the brackets of the right-hand side is a matrix and $[j,j]$ represents the element of the matrix located at the $j$th row and $j$th column indices.
* Use `diag()` to extract diagonal components from a matrix.
* `Pr(>|t|)` comes from the two-sided t test.
$$
\begin{align}
H_0: \beta_j &= 0\\
H_a: \beta_j &\neq 0
\end{align}
$$
* Use `pt()` to get the area under the curve for a t-distribution. Because the distribution is symmetric, you can multiply this value by $2$ to get the two-sided test output.
(*Hint: As before, be careful about whether you use `lower.tail = TRUE` or `lower.tail = FALSE`! One safe option is to use `lower.tail = FALSE` with the absolute value (`abs()`) of your test statistic. Make sure you never end up with a p-value greater than 1!*)
To prove it works, Use the code below to read in data from a survey of 55 Duke University students about their study habits and grades. You can read more about this data at https://www.openintro.org/data/index.php?data=gpa.
```{r}
grades_data <- read.csv("https://www.openintro.org/data/csv/gpa.csv")
```
Use this data to regress `gpa` upon `studyweek` using both `my_lm()` and `lm()`. The results of `my_lm()` should match the coefficient table from the `summary` of your `lm()` output.