---
title: "Worksheet for Lab 8"
author: "NAME"
date: "`r Sys.Date()`"
output: html_document
---

## Introduction

With this worksheet, you can work on and run the chunks of code you need for different exercises.  Chunks of code appear between the three backquote marks and have a different color, like this:

```{r chunk_name}
2 + 3
```

## Preliminaries

Run this chunk first!  This chunk of code loads the packages and datasets you need for this activity.

```{r setup}
library(tidyverse)
library(infer)

lineup <- read_csv("https://raw.githubusercontent.com/gregcox7/StatLabs/main/data/lineup.csv")
fairness <- read_csv("https://raw.githubusercontent.com/gregcox7/StatLabs/main/data/lineup_fairness.csv")
```

Within the braces at the beginning of each chunk, the letter `r` says that the code is written in the "R" language and the text after the little `r` is a helpful label for the chunk.

To run a chunk of code, press the green arrow button in the upper right of the chunk.  The result of running the chunk will appear immediately below it.  You can run a chunk more than once.  If you need to change anything or get an error, you can always edit your chunk and try again.

## Exercise 8.1

Our first task in analyzing these data is to estimate the proportion of suspect ID's from these lineups.  For the moment, we will ignore whether the lineup was presented simultaneously or sequentially.  Our aim here is to build a 95% confidence interval that describes how often, on average, witnesses identify the suspect from the lineup.

a. Use the following chunk of code to calculate and store the proportion of suspect ID's in the lineup data.  **Hint:** look back at how we wrote the `specify` and `calculate` lines for Kobe's Field Goal Percentage, and keep in mind that we are focused on the proportion of *suspect* ID's, not the proportion of *filler* ID's.

```{r susp_id_prop}
lineup %>%
    specify(___) %>%
    calculate(___)
```

What is the observed proportion of suspect ID's?

b. Use the following chunk of code to use bootstrapping to build a sampling distribution for the proportion of suspect ID's.  We will use this distribution to find and visualize a 95% confidence interval.  Be sure to generate at least 1000 simulated datasets.

```{r susp_id_ci}
boot_prop_id_dist <- lineup %>%
    specify(___) %>%
    generate(___) %>%
    calculate(___)

boot_prop_id_ci <- boot_prop_id_dist %>%
    get_confidence_interval(level = 0.95)

boot_prop_id_dist %>%
    visualize() +
    shade_confidence_interval(boot_prop_id_ci)
```

Report and interpret the 95% confidence interval you found (stored under `boot_prop_id_ci` in the Environment pane).  What do you think this interval tells us about the accuracy of eyewitnesses in general?

## Exercise 8.2

Our research question is, "is the proportion of suspect ID's greater than would be expected if eyewitnesses were picking randomly?"

a. If eyewitnesses were picking randomly, what proportion of the time would we expect them to pick the suspect from the lineup?  (Recall that each lineup has six photographs.)
b. What are the null and alternative hypotheses corresponding to our research question?
c. Use the chunk of code below to help you use the *normal distribution* to model the sampling distribution assuming the null hypothesis is true.  For the first blank, this should be your answer to part [a].  For the second blank, see how many rows the `lineup` dataset has---this is our sample size.  For the third blank, this should be your answer to part [a] of the previous exercise.  The final result is a "Z score".

```{r prop_z_score}
null_prop_id <- ___
sample_size <- ___
obs_prop_id <- ___

null_se <- sqrt(null_prop_id * (1 - null_prop_id) / sample_size)

z_score <- (obs_prop_id - null_prop_id) / null_se
```

What is the Z score you found?  In your own words, describe what this Z score tells us about how "extreme" our observed proportion is relative to what we would have expected if the null hypothesis were true.
d. Run the following chunk to find the p value (note that you will need to have completed part [c] for this to work!):

```{r prop_p_val}
pnorm(z_score, lower.tail = FALSE)
```

Based on the p value, would you reject the null hypothesis?  What does this say about whether eyewitnesses might be picking the suspect just at random?

## Exercise 8.3

In this exercise, our research question is, "is there a difference in the proportion of suspect ID's between sequential and simultaneous lineups?"  We will address this question by using random permutation to conduct a hypothesis test.

a. What are the null and alternative hypotheses corresponding to our research question?
b. Use the following chunk of code to calculate the observed difference in the proportion of suspect ID's between sequential and simultaneous lineups:

```{r obs_pres_diff}
obs_diff <- lineup %>%
    specify(___) %>%
    calculate(___, order = c("Sequential", "Simultaneous"))
```

What is the observed difference in proportions (this will be stored under `obs_diff` after you run your code)?  Based on this result, which type of presentation might produce a higher proportion of suspect ID's?

c. Use the following chunk of code to use random permutation to simulate at least 1000 datasets assuming the null hypothesis is true, then visualize where our observed difference falls relative to that distribution.  Remember to put one of "greater", "less" or "two-sided" for the `direction` setting in the last line; the appropriate setting depends on the alternative hypothesis (i.e., your answer to part [a]).  Also, make sure you've completed part [b], since this code needs `obs_diff` to run!

```{r null_pres_diff}
null_diff_dist <- lineup %>%
    specify(___) %>%
    hypothesize(___) %>%
    generate(___) %>%
    calculate(___, order = c("Sequential", "Simultaneous"))

null_diff_dist %>%
    visualize() +
    shade_p_value(obs_stat = obs_diff, direction = ___)
```

d. Use the following chunk to find the p value for this hypothesis test (make sure you have completed parts [b] and [c], since this code depends on both of them having been completed successfully):

```{r pres_p_val}
null_diff_dist %>%
    get_p_value(obs_stat = obs_diff, direction = ___)
```

What was the p value you got?  Based on that p value, would you reject the null hypothesis?  What does your conclusion tell us about whether simultaneous or sequential presentation differ in how much they produce suspect ID's?

## Exercise 8.4

Our research question is, "do mock witnesses make a higher proportion of suspect ID's for not-blind lineups as opposed to blind lineups?"  We will address this question by using random permutation to conduct a hypothesis test.

a. What are the null and alternative hypotheses corresponding to our research question?
b. Use the following chunk of code to calculate the observed difference in the proportion of suspect ID's between sequential and simultaneous lineups:

```{r obs_fair_diff}
obs_fairness_diff <- fairness %>%
    specify(___) %>%
    calculate(___, order = c("Not blind", "Blind"))
```

What is the observed difference in proportions (this will be stored under `obs_fairness_diff` after you run your code)?  Is this result more consistent with the null hypothesis or with the alternative hypothesis?

c. Use the following chunk of code to use random permutation to simulate at least 1000 datasets assuming the null hypothesis is true, then visualize where our observed difference falls relative to that distribution.  Remember to put one of "greater", "less" or "two-sided" for the `direction` setting in the last line; the appropriate setting depends on the alternative hypothesis (i.e., your answer to part [a]).  Also, make sure you've completed part [b], since this code needs `obs_fairness_diff` to run!

```{r null_fair_diff}
null_fairness_diff_dist <- fairness %>%
    specify(___) %>%
    hypothesize(___) %>%
    generate(___) %>%
    calculate(___, order = c("Not blind", "Blind"))

null_fairness_diff_dist %>%
    visualize() +
    shade_p_value(obs_stat = obs_fairness_diff, direction = ___)
```

d. Use the following chunk to find the p value for this hypothesis test (make sure you have completed parts [b] and [c], since this code depends on both of them having been completed successfully):

```{r fair_p_val}
null_fairness_diff_dist %>%
    get_p_value(obs_stat = obs_fairness_diff, direction = ___)
```

What was the p value you got?  Based on that p value, would you reject the null hypothesis?  What does your conclusion tell us about whether non-blind police lineups are "fair" in the sense that they do not bias a witness toward a particular suspect?