--- title: "Lab 6: Simulations" author: "INSERT YOUR NAME HERE (INSERT YOUR UW NETID HERE)" date: "Due by 23:59pm on Feb 29, 2024" output: pdf_document: default html_document: default --- ### Total Points: 45 ## Part 1. Random Number Generation and ECDFs (4+5+5 pts) 1. Generate the following objects, save them to variables (with names of your choosing), and call `head()` on those variables. * A vector with 60 draws from $\mathrm{Beta}(0.1, 0.1)$. * A vector of 200 characters sampled with replacement uniformly from "A", "G", "C", and "T". * A data frame with a column `x` that contains 100 draws from $\mathrm{Unif}(0, 3)$, and a column `y` that contains 100 draws of the form $y_i \sim \mathrm{Unif}(0, x_i)$, where $x_i$ is the i-th element of column `x`. Do this without using explicit iteration. ```{r} set.seed(123) # Don't change this line # Your code goes here ``` 2. You are given a function called `plotCumMeans()` to plot the cumulative sample mean as the sample size increases. - The first argument `rfun` stands for a function which takes one argument `n` and generates this many random numbers when called as `rfun(n)`. - The second argument `n.max` is an integer which tells the number samples to draw. As a side effect, the function plots the cumulative mean against the number of samples. ```{r} # plotCumMeans: plot cumulative sample mean as a function of sample size # Inputs: # - rfun: function which generates random draws # - n.max: number of samples to draw # Ouptut: none plotCumMeans = function(rfun, n.max) { samples = rfun(n.max) plot(1:n.max, cumsum(samples) / 1:n.max, type = "l", xlab = "Cumulative sample size", ylab = "Sample mean") } ``` Use this function to make plots for the following distributions, with `n.max = 20000`. Then answer: do the sample means start concentrating around the appropriate value as the sample size increases? * $N(-3, 10)$. * $\mathrm{Exp}(\mathrm{mean}=5)$. * $\mathrm{Cauchy}(\mathrm{location}=-1, \mathrm{scale}=2)$. Hint: for each case, you should construct a new anonymous function to pass as the `rfun` argument to `plotCumMeans()`. For instance, you should pass `function(n) rnorm(n, mean = -3, sd = sqrt(10))` as the `rfun` argument to `plotCumMeans()` for the first case. ```{r 1-2} set.seed(123) # Don't change this line # Your code goes here ``` 3. For the same distributions in Part 1-2 above, we will do the following. * Generate 10, 100, 1000 random samples from each of the following distributions. - $N(-3, 10)$, where 10 is the variance. - $\mathrm{Exp}(\mathrm{mean}=5)$. * Generate 10, 50, 100 random samples from the following distribution. - $\mathrm{Cauchy}(\mathrm{location}=-1, \mathrm{scale}=2)$. * On a single plot, display the ECDFs (empirical cumulative distribution functions) from each set of samples, and the true CDF, with each curve being displayed in a different color. In order to do this, we will write a function `plotECDF(rfun, pfun, sizes)` which takes as its arguments the single-argument random number generating function `rfun`, the corresponding single-argument cumulative distribution function `pfun`, and a vector of sample sizes `sizes` for which to plot the ecdf. We've already started to define `plotECDF()` below, but we've left it incomplete. Fill in the definition by editing the lines with "##" and "??", and then run it on the same distributions as in Part 1-2. * Examine the plots and discuss how the ECDFs converge as the sample size increases. **Note: make sure to remove `eval=FALSE`, after you've edited the function, to see the results.** ```{r eval=FALSE} # plotECDF: plots ECDFs along with the true CDF, for varying sample sizes # Inputs: # - rfun: function which generates n random draws, when called as rfun(n) # - pfun: function which calculates the true CDF at x, when called as pfun(x) # - sizes: a vector of sample sizes # Output: none plotECDF = function(rfun, pfun, sizes) { # Draw the random numbers ## samples = lapply(sizes, ??) # Calculate the grid for the CDF grid.min = min(sapply(samples, min)) grid.max = max(sapply(samples, max)) grid = seq(grid.min, grid.max, length=1000) # Calculate the ECDFs ## ecdfs = lapply(samples, ??) evals = lapply(ecdfs, function(f) f(grid)) # Plot the true CDF ## plot(grid, ??, type="l", col="black", xlab="x", ylab = "P(X <= x)", lwd=3) # Plot the ECDFs on top n.sizes = length(sizes) cols = rainbow(n.sizes) for (i in 1:n.sizes) { lines(grid, evals[[i]], col=cols[i], lwd=1.5) } legend("topleft", legend=sizes, col=cols, lwd=3) } ``` ```{r} set.seed(123) # Don't change this line # Your code for plotting goes here ``` ## Part 2. Drug Effect Simulation (2+2+3+4+2+3 pts) We continue studying the drug effect model that was discussed in Lecture 6 Slides. Recall that we believe those who are not given the drug experience a reduction in tumor size of percentage as: $$X_{\mathrm{no\,drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=R), \;\;\; R \sim \mathrm{Unif}(0,1),$$ whereas those who were given the drug experience a reduction in tumor size of percentage as: $$X_{\mathrm{drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=2).$$ 1. Look the code chunk in the lecture that generated data according to the above model. Write a function around this code called `simulateData()` that takes two arguments: * `n`: the sample size (number of subjects in each group) with a default value of 60; * `mu_drug`: the mean for the exponential distribution that defines the drug tumor reduction measurements with a default value of 2. Your function should return a list with two vectors called `no_drug` and `drug`. Each of these two vectors should have length `n`, containing the percentage reduction in tumor size under the appropriate condition (not taking the drug or taking the drug). ```{r} # Your code goes here ``` 2. Run your function `simulateData()` without any arguments (hence, relying on the default values of `n` and `mu_drug`), and store the output in `results1`. Print out the first 6 values in both the `results1$no_drug` and `results1$drug` vectors. Now, run `simulateData()` again, and store its output in `results2`. Again, print out the first 6 values in both the `results2$no_drug` and `results2$drug` vectors. We have effectively simulated two hypothetical datasets. Note that we shouldn't expect the values from `results1` and `results2` to be the same. ```{r} set.seed(123) # Don't change this line # Your code goes here ``` 3. Compute the following three numbers: the absolute difference in the mean values of `no_drug` between `results1` and `results2`, the absolute difference in the mean values of `drug` between `results1` and `results2`, and the absolute difference in mean values of `no_drug` and `drug` in `results1`. Answer in words: Of these three numbers, which one is the largest, and does this make sense? ```{r} # Your code goes here ``` 4. Now, we want to visualize the simulated data. Fortunately, the code to visualize the data is already provided for you in our Lecture 6 slides. Write a function around this code, called `plotData()` that takes just one argument `data`, which is a list with components `drug` and `no_drug`. To be clear, this function should create a single plot, with two overlaid histograms, one for`data$no_drug` (in gray) and one for `data$drug` (in red), with the same 20 bins. It should also overlay a density curve for each histogram in the appropriate colors, and produce a legend. Once written, call `plotData()` on both `results1` and `results2`. ```{r} # Your code goes here ``` 5. Generate a new simulated dataset using `simulateData()` where `n=1000` and `mu_drug=1.1`, and plot the results using `plotData()`. In one or two sentences, explain the differences that you see between this plot and the two you produced in the last problem. ```{r} # Your code goes here ``` 6. Combine the `simulateData()` function, where `n=2000` and `mu_drug=1.6`, with the `replicate()` function that we learned in Lecture 6 slides to generate 3000 simulated datasets and compute the absolute difference in mean values of `no_drug` and `drug` for each of the simulated dataset. Output both the mean and variance of all these absolute differences. ```{r} set.seed(123) # Don't change this line # Your code goes here ``` ## Part 3. Bootstrap (3+7+5 pts) In this problem, you will use the nonparametric/empirical and parametric bootstrap to estimate the mean of an exponential distribution $\text{Exponential}(\lambda)$ with density $$f(x)=\lambda \exp(-\lambda x) \quad \text{ for } \quad x\geq 0.$$ Suppose that you are given some data `X_dat` from an exponential distribution with rate $\lambda=\frac{1}{3}$ as follows. You may need to read about the relation between the rate $\lambda$ of the exponential distribution $\text{Exponential}(\lambda)$ and its mean by typing `?rexp`. ```{r} # Don't change this chunk set.seed(123) X_dat = rexp(100, rate = 1/3) ``` 1. Compute and output the empirical mean $T=\bar{X}_n$ and its standard error of `X_dat`. In addition, plot the histogram of `X_dat` with `freq=FALSE`. ```{r} # Your code goes here ``` 2. Use the nonparametric bootstrap with the re-sampling times as $B=3000$ to estimate and output the standard error for $T$. In addition, output the normal, pivotal, and percentile confidence intervals with the nominal level **90%**. - Answer in words: Do all the confidence intervals cover the empirical mean $T=\bar{X}_n$ of `X_dat`? - Plot the histogram of these $B=3000$ bootstrap statistics with `freq=FALSE`. Describe in words about its shape. ```{r} set.seed(123) # Don't change this line # Your code goes here ``` 3. Assume that you know the data `X_dat` come from an exponential distribution $\text{Exponential}(\lambda)$, but its rate $\lambda$ is unknown. Estimate the rate $\lambda$ by the empirical mean $T=\bar{X}_n$. Then, use the parametric bootstrap with the re-sampling times as $B=3000$ to estimate and output the standard error for $T$. In addition, output the normal, pivotal, and percentile confidence intervals with the nominal level **90%**. ```{r} set.seed(123) # Don't change this line # Your code goes here ``` ## Part 4. Validity of the Bootstrap Confidence Intervals (Extra Credit: 5 pts) Note: The extra credit in the problem can only be applied to any of your lost points in this Lab 6 assignment. The total points that you can earn for Lab 6 are still capped at 45. We again use the exponential distribution with rate $\lambda=\frac{1}{3}$ to check the validity of normal, pivotal, and percentile confidence intervals based on the nonparametric bootstrap. Taking the normal confidence interval as an example, we do the following procedure. 1. Generate `X_dat` from an exponential distribution with $n=100$ and rate $\lambda=\frac{1}{3}$ and compute the mean statistic $T=\bar{X}_n$ of `X_dat`. 2. Use the nonparametric bootstrap with the re-sampling times as $B=3000$ to construct the normal confidence interval with nominal level 90%. 3. Repeat Steps 1 and 2 for $N=600$ times and compute the proportion of normal confidence intervals covering the true mean $\frac{1}{\lambda}=3$. Is this proportion close to 90%? (Hint: You may need two nested `replicate()` functions or use a `replicate()` function within the `for` loop.) Do the same procedure for the pivotal and percentile confidence intervals with nominal level 90%. Which type of the confidence interval is closest to 90%? ```{r} set.seed(123) # Don't change this line # Your code goes here ```