---
title: "Lab 2: Data Types"
author: "INSERT YOUR NAME HERE (INSERT YOUR UW NETID HERE)"
date: "Due by 23:59pm on Jan 23, 2024"
output:
  pdf_document: default
  html_document: default
---

### Total Points: 40

## Part 1. Review Questions (2+5+3 pts)

1. Multiply the inverse of a matrix $\begin{bmatrix}
3 & 2 & 1\\
4 & 8 & 1\\
5 & 9 & 16
\end{bmatrix}$ with itself. 

    Also, return those entries that are bigger than $10^{-9}$.

```{r}
# Your code here
```

2. Make a list `lst1` with components
  - `1:15` under the name `num_vec`;
  - `matrix(15:1, ncol = 3)` under the name `mat`;
  - `rep(c("a", "x"), each = 3)` under the name `char_vec`;
  - `list(x = c(1,2), y="STAT 302")` under the name `sublst`.

Answer the following questions using R:

  - Compute the sum of the component `num_vec`;
  - What is the element in position [2,3] in the component `mat`?
  - What is the third element in the component `char_vec`?
  - Use the function `strsplit()` with argument `split = ""` to split the subcomponent `y` in the component `sublst`. What is the data type for the result  `strsplit()`?
  - Subset the result after `strsplit()` via `[[1]]`. What is the fifth element of this character vector?
  
```{r}
# Your code here
```


3. Download the `family.txt` shown in Lecture 2 to your laptop. Then, read the file into R using the function `read.delim()`. Then, compute the following statistics in R:
  - What is the standard deviation of ages in `family.txt`?
  - What is the percentage of males in `family.txt`?
  - What is the maximum BMI within all the female individuals?
  
```{r}
# Your code here
```

## Part 2: Normal Distribution (2+5+3+4 pts)

R provides several functions for the normal/Gaussian distribution:

  - `dnorm()` computes the density function of a normal distribution;
  - `pnorm()` calculates the percentiles (or equivalently, the cumulative distribution function) of a normal distribution;
  - `qnorm()` returns the quantiles of a normal distribution;
  - `rnorm()` generates the normally distributed random variables.
  
Use R to answer the following questions:

1. Create and store a vector `norm_vec` with $100,000$ random variables from a Normal distribution with mean 6 and standard deviation 2. Print out the first 7 elements of `norm_vec` using the function `head()`.

```{r}
set.seed(123)  ## Don't change this line. It makes the result reproducible.
# Your code starts from here
```

2. Plot two histograms, one with the first 100 elements of `norm_vec`, and the other with all the elements of `norm_vec`. Set the argument `freq = FALSE` for both histograms for better comparisons. 
  - Change the x axis labels for both histograms to "Observations". 
  - Set their titles as "Histogram of N(6,2) distributed random sample with n=THE CORRECT NUMBER OF SAMPLE POINTS". Remember to change "THE CORRECT NUMBER OF SAMPLE POINTS".
  - Answer it by words: Which one looks more symmetric?

```{r}
# Your code starts from here
```

3. Standardize the vector `norm_vec` to $N(0,1)$ by subtracting its mean and then dividing it by its standard deviation. Name it as `norm_vec_std`. Compute the standard deviation of `norm_vec_std`. Also, what is the percentage of observations in `norm_vec_std` that are greater than 1.644854?

```{r}
# Your code here
```

4. Apply the function `pnorm()` (without specifying any other arguments) to the vector `norm_vec_std`. Then, compute its mean and variance after applying the function `pnorm()`. Finally, plot its histogram after applying the function `pnorm()` with the argument `freq = FALSE`.

  - Describe in words what do you see from the histogram. (Hint: How is the height of each bin compared with others?)

```{r}
# Your code here
```


## Part 3: Binomial Distribution (4pts per question)

The binomial distribution $\mathrm{Bin}(m,p)$ is defined by the number of successes in $m$ independent trials, each have probability $p$ of success. Think of flipping an (unfair) coin $m$ times, where the coin could be biased and has probability $p$ of landing on heads.

Similar to the above normal distribution, R also provides several functions for the binomial distribution:

  - `dbinom()` computes the probability mass function of a binomial distribution;
  - `pbinom()` calculates the percentiles (or equivalently, the cumulative distribution function) of a binomial distribution;
  - `qbinom()` returns the quantiles of a binomial distribution;
  - `rbinom()` generates the random variables from a binomial distribution.

1. Initialize a matrix `binom_mat` with 3 columns and 100 rows, whose entries are all `NA`. 

  - Then, fill in each column with random samples from binomial distributions with $m=300, p=0.25$ (first column), $m=300, p=0.5$ (second column), and $m=300, p=0.75$ (third column), respectively.
  - Compute the column means of `binom_mat`.

```{r}
set.seed(1234)  ## Don't change this line. It makes the result reproducible.
# Your code starts from here
```

2. Compute the means of every 10 elements in the first column of `binom_mat`. There should be 10 mean values in total. Then, output the median of these 10 mean values. Assign it to a variable `MoM`.
  - Compared with the mean of the first column of `binom_mat`, is `MoM` closer to the expected mean `75`? (Output a logical TRUE/FALSE using R!)

```{r}
# Your code here
```


3. Now, change the first element in the first column of `binom_mat` to -100. Then, repeat what we did in Question 2 (i.e., compute the means of every 10 elements in the first column of `binom_mat` and then calculate the median as `MoM2`.)
  - Now, compared with the mean of the first column of `binom_mat`, is `MoM2` closer to the expected mean `m*p = 75`? (Output a logical TRUE/FALSE using R!)

```{r}
# Your code here
```

4. Create a list `binom_lst` with 3 components:

  - A vector with 500 elements from a $\mathrm{Bin}(300,0.75)$ and name it as `binom500`;
  - A vector with 1000 elements from a $\mathrm{Bin}(300,0.75)$ and name it as `binom1000`;
  - A vector with 20000 elements from a $\mathrm{Bin}(300,0.75)$ and name it as `binom20000`.
  
* Compute the mean of each component of `binom_lst`. Which one is closest to the expected mean `m*p = 225`? Can you explain why?

* Look at the documentation of the functions `qqnorm()` and `qqline()`. Make QQ-plots with diagonal lines for each component of `binom_lst`. Which QQ-plot is most aligned with the diagonal line? Can you explain why?
  
```{r}
set.seed(1234)  ## Don't change this line. It makes the result reproducible.
# Your code starts from here
```