---
title: "Problem set 2"
date: ""
format:
  html:
    toc: true
    number-sections: false
---
[Download source](https://raw.githubusercontent.com/ywanglab/STAT3000/refs/heads/main/psets/hw2_medium_hint.qmd)

For these exercises, do **not** load any packages other than **dslabs**.

Make sure to use **vectorization** whenever possible (avoid loops unless explicitly allowed).

```{r}
# Load only the allowed package
library(dslabs)
```

## 1

What is the sum of the first 150 positive integers? Use the functions `seq` and `sum` to compute the sum with R for any `n`.

```{r}
# Step 1: define n
n <- 150

# Step 2: create the sequence 1, 2, ..., n
# x <- 

# Step 3: compute the sum of x
# ans <- 

# Step 4: print ans
# ans
```

## 2

Load the `murders` dataset from **dslabs**. Use the function `str` to examine the structure of the `murders` object.

* What are the column names used by the data frame for these five variables: state name, abbreviation, region, population, total murders?
* Show the subset of `murders` showing states with **less than 1.2 per 100,000** deaths.
* Show **all** variables.

```{r}
# Examine structure
str(murders)

# Step: print the column names
# names(murders)
```

```{r}
# Step 1: compute murder rate per 100,000 into a vector called rate
# rate <- 

# Step 2: create a logical vector called idx for states with rate < 1.2
# idx <- 

# Step 3: subset murders using idx
# murders[idx, ]
```

## 3

Show the subset of `murders` showing states with **less than 1.2 per 100,000** deaths and in the **Northeast** of the US. Do **not** show the `region` variable.

```{r}
# Step 1: compute rate (or reuse from Q2)
# rate <- 

# Step 2: create logical vectors low and ne
# low <- 
# ne <- 

# Step 3: combine conditions into keep
# keep <- 

# Step 4: subset and remove region column
# out <- 
# out_no_region <- 

# out_no_region
```

## 4

Among states with a murder rate less than 1.2 per 100,000, show the **smallest population** state (show the state name, population, and rate).

```{r}
# Step 1: compute rate
# rate <- 

# Step 2: restrict to low-rate states
# low <- 

# Step 3: find the row index of the smallest population among those
# Hint: use which(low) and which.min(...)
# i <- 

# Step 4: report as a small data frame with columns state, population, rate
# data.frame(...)
```

## 5

Show the state with a population of **more than 8 million** with the **lowest** murder rate (show the state name, population, and rate).

```{r}
# Step 1: compute rate
# rate <- 

# Step 2: create logical vector big for population > 8 million
# big <- 

# Step 3: find the row index i of the smallest rate among big-pop states
# i <- 

# Step 4: report state, population, rate
# data.frame(...)
```

## 6

Compute the murder rate for each **region** of the US (total murders divided by total population times 100,000). Return a data frame with one row per region and columns `region` and `rate`.

```{r}
# Step 1: compute total murders by region
# murders_by_region <- tapply( , , sum)

# Step 2: compute total population by region
# pop_by_region <- tapply( , , sum)

# Step 3: compute rates
# rate_by_region <- 

# Step 4: make a data frame with region names and rates
# region_rates <- data.frame(...)
# region_rates
```

## 7

Create a vector of numbers that starts at 5, does not pass 60, and adds numbers in increments of 3/8.

How many numbers does the list have?

```{r}
# Step 1: define start, end, step
start <- 5
end <- 60
step <- 3/8

# Step 2: create the sequence v
# v <- 

# Step 3: show the first 6 values
# head(v)

# Step 4: show the length
# length(v)
```

## 8

Make this data frame:

```{r}
temp_f <- c(72, 95, 41, 86, 78, 33)
city <- c("Chicago", "Lagos", "Oslo", "Rio de Janeiro", 
          "San Juan", "Toronto")
city_temps <- data.frame(name = city, temperature_f = temp_f)
city_temps
```

Add a new column called `temperature_c` containing the temperatures in Celsius. Keep all existing columns.

```{r}
# Step 1: compute Celsius using (F - 32) * 5/9
# city_temps$temperature_c <- 

# Step 2: print city_temps
# city_temps
```

## 9

Write a function `euler2` that computes:

$$
S_n = 1 + \frac{1}{2^2} + \frac{1}{3^2} + \dots + \frac{1}{n^2}.
$$

Test your function at `n = 10` and `n = 100`.

```{r}
# Step 1: write the function
euler2 <- function(n) {
  # Step: create k <- 1:n
  # Step: compute terms <- 1/(k^2)
  # Step: return sum(terms)
}

# Step 2: test
# euler2(10)
# euler2(100)
```

## 10

Plot $S_n$ versus $n$ for $n = 1,2,\dots,2000$ with a horizontal **dashed** line at $\pi^2/6$.

```{r}
# Step 1: create n_vals <- 1:2000
# n_vals <- 

# Step 2: compute S for each n using sapply
# S <- 

# Step 3: plot
# plot(n_vals, S, type="l", xlab="n", ylab="S_n")

# Step 4: add dashed line at pi^2/6
# abline(h = , lty = 2)
```

## 11

Use `%in%` and `state.abb` to create a logical vector for: AL, AK, AZ, AR, AA.

```{r}
test <- c("AL", "AK", "AZ", "AR", "AA")

# Step 1: create is_real
# is_real <- 

# Step 2: print is_real
# is_real
```

## 12

Report the one entry that is **not** an actual abbreviation.

```{r}
test <- c("AL", "AK", "AZ", "AR", "AA")

# Step 1: create not_real using !
# not_real <- 

# Step 2: find index with which()
# idx <- 

# Step 3: print the entry
# test[idx]
```

## 13

Using `%in%`, show all variables for Florida, California, and New York, **in that order**.

```{r}
targets <- c("Florida", "California", "New York")

# Step 1: subset murders to those states
# sub <- 

# Step 2: reorder rows using match()
# sub_ordered <- 

# Step 3: print sub_ordered
# sub_ordered
```

## 14

Write a function `vander_helper(x, n)` that returns $(1, x, x^2, \dots, x^n)$. Show results for `x=2`, `n=6`.

Restrictions: no loop.

```{r}
vander_helper <- function(x, n) {
  # Step: create exponents 0:n
  # Step: return x^(0:n)
}

# Test:
# vander_helper(2, 6)
```

## 15

Create a vector using:

```{r}
n <- 20000
p <- 0.35
set.seed(2025-9-18)
x <- sample(c(0,1), n, prob = c(1 - p, p), replace = TRUE)
```

Compute the length of each **stretch of consecutive 1s** (run lengths of 1s) and plot the distribution.

* Do **not** use a loop.
* Hint: use `rle(x)`.

Then compare empirical proportions to the geometric prediction for run lengths 1 through 8.

```{r}
# Step 1: compute r <- rle(x)
# r <- 

# Step 2: extract ones_lengths (lengths where values == 1)
# ones_lengths <- 

# Step 3: plot distribution (hist or barplot)
# hist(ones_lengths, breaks = 30)

# Step 4: empirical proportions for k=1:8
# tab <- table(ones_lengths)
# emp_counts <- as.numeric(tab[as.character(1:8)])
# emp_counts[is.na(emp_counts)] <- 0
# emp_probs <- emp_counts / length(ones_lengths)

# Step 5: theoretical probabilities (1-p)*p^(k-1)
# k <- 1:8
# theory_probs <- 

# Step 6: make a comparison data frame
# comparison <- data.frame(run_length = k, empirical_prob = emp_probs, theory_prob = theory_probs)
# comparison
```

## 16

In the `murders` dataset:

1. Compute the national average murder rate.
2. Create labels using `ifelse`:

   * `"High Crime, High Pop"` if rate > national average and pop > 6 million
   * `"High Crime, Low Pop"` if rate > national average and pop ≤ 6 million
   * `"Lower Crime"` otherwise

Then show a `table()` of the labels.

```{r}
# Step 1: state-level rate
# rate <- 

# Step 2: national average rate
# national_rate <- 

# Step 3: logical vectors high_crime and high_pop
# high_crime <- 
# high_pop <- 

# Step 4: labels using nested ifelse
# labels <- 

# Step 5: table(labels)
# table(labels)
```

## 17

What is the murder rate of the state that ranks **12th** in terms of murder rate (from highest to lowest)?

Show your work using `order` (and optionally check with `sort` or `rank`).

```{r}
# Step 1: rate vector
# rate <- 

# Step 2: ord <- order(rate, decreasing = TRUE)
# ord <- 

# Step 3: i <- ord[12]
# i <- 

# Step 4: report state and rate
# data.frame(state = murders$state[i], rate = rate[i])
```

## 18

Write a function `compute_harmonic_mean` that returns the harmonic mean of a numeric vector, but returns `NA` if any values are zero or negative. Test on `c(1,2,4,8)` and show it is about `2.133333`.

```{r}
compute_harmonic_mean <- function(x) {
  # Step 1: if any x <= 0, return NA
  # Step 2: compute n <- length(x)
  # Step 3: return n / sum(1/x)
}

# Test:
# compute_harmonic_mean(c(1, 2, 4, 8))
```

## 19

Create a function `safe_divide(x, y)` that returns `x/y` but returns `"Cannot divide by zero"` when `y` is zero. Make it work element-wise on vectors (vectorized). Test it on:

```r
x <- c(10, 20, 30)
y <- c(2, 0, 5)
```

```{r}
safe_divide <- function(x, y) {
  # Step 1: compute out <- x/y
  # Step 2: convert to character so you can store the message
  # Step 3: replace entries where y == 0
}

# Test:
# x <- c(10, 20, 30)
# y <- c(2, 0, 5)
# safe_divide(x, y)
```

## 20

Write a function `classify_state_safety(state_name)` that returns:

* `"Very Safe"` if rate < 1
* `"Safe"` if 1 ≤ rate < 3
* `"Moderate"` if 3 ≤ rate < 5
* `"High Risk"` if rate ≥ 5
* `"State not found"` if the state is not in the dataset

Test on `"Vermont"`, `"Texas"`, `"California"`, `"NotAState"`.

Then use `sapply` to classify all states and use `table()` to count how many fall into each category.

```{r}
# Step 1: compute a named vector of rates
# rate <- murders$total / murders$population * 100000
# rate_named <- setNames(rate, murders$state)

classify_state_safety <- function(state_name) {
  # Step 2: check if state_name is in names(rate_named)
  # Step 3: pull out r <- rate_named[state_name]
  # Step 4: return the correct label using if/else
}

# Tests:
# classify_state_safety("Vermont")
# classify_state_safety("Texas")
# classify_state_safety("California")
# classify_state_safety("NotAState")
```

```{r}
# Step 5: classify all states with sapply
# cats <- sapply(murders$state, classify_state_safety)

# Step 6: count categories
# table(cats)
```

## Convert to  a PDF file
In a (Linux terminal), run the following command (install any missing packages on the fly)

```
# change path_to_hw2.qmd to something like hw/hw2.qmd if the current directory is a parent diretory of hw/

quarto render path_to_hw2.qmd --to pdf 
```