--- title: "Problem set 2" date: "" format: html: toc: true number-sections: false --- [Download source](https://raw.githubusercontent.com/ywanglab/STAT3000/refs/heads/main/psets/hw2_medium_hint.qmd) For these exercises, do **not** load any packages other than **dslabs**. Make sure to use **vectorization** whenever possible (avoid loops unless explicitly allowed). ```{r} # Load only the allowed package library(dslabs) ``` ## 1 What is the sum of the first 150 positive integers? Use the functions `seq` and `sum` to compute the sum with R for any `n`. ```{r} # Step 1: define n n <- 150 # Step 2: create the sequence 1, 2, ..., n # x <- # Step 3: compute the sum of x # ans <- # Step 4: print ans # ans ``` ## 2 Load the `murders` dataset from **dslabs**. Use the function `str` to examine the structure of the `murders` object. * What are the column names used by the data frame for these five variables: state name, abbreviation, region, population, total murders? * Show the subset of `murders` showing states with **less than 1.2 per 100,000** deaths. * Show **all** variables. ```{r} # Examine structure str(murders) # Step: print the column names # names(murders) ``` ```{r} # Step 1: compute murder rate per 100,000 into a vector called rate # rate <- # Step 2: create a logical vector called idx for states with rate < 1.2 # idx <- # Step 3: subset murders using idx # murders[idx, ] ``` ## 3 Show the subset of `murders` showing states with **less than 1.2 per 100,000** deaths and in the **Northeast** of the US. Do **not** show the `region` variable. ```{r} # Step 1: compute rate (or reuse from Q2) # rate <- # Step 2: create logical vectors low and ne # low <- # ne <- # Step 3: combine conditions into keep # keep <- # Step 4: subset and remove region column # out <- # out_no_region <- # out_no_region ``` ## 4 Among states with a murder rate less than 1.2 per 100,000, show the **smallest population** state (show the state name, population, and rate). ```{r} # Step 1: compute rate # rate <- # Step 2: restrict to low-rate states # low <- # Step 3: find the row index of the smallest population among those # Hint: use which(low) and which.min(...) # i <- # Step 4: report as a small data frame with columns state, population, rate # data.frame(...) ``` ## 5 Show the state with a population of **more than 8 million** with the **lowest** murder rate (show the state name, population, and rate). ```{r} # Step 1: compute rate # rate <- # Step 2: create logical vector big for population > 8 million # big <- # Step 3: find the row index i of the smallest rate among big-pop states # i <- # Step 4: report state, population, rate # data.frame(...) ``` ## 6 Compute the murder rate for each **region** of the US (total murders divided by total population times 100,000). Return a data frame with one row per region and columns `region` and `rate`. ```{r} # Step 1: compute total murders by region # murders_by_region <- tapply( , , sum) # Step 2: compute total population by region # pop_by_region <- tapply( , , sum) # Step 3: compute rates # rate_by_region <- # Step 4: make a data frame with region names and rates # region_rates <- data.frame(...) # region_rates ``` ## 7 Create a vector of numbers that starts at 5, does not pass 60, and adds numbers in increments of 3/8. How many numbers does the list have? ```{r} # Step 1: define start, end, step start <- 5 end <- 60 step <- 3/8 # Step 2: create the sequence v # v <- # Step 3: show the first 6 values # head(v) # Step 4: show the length # length(v) ``` ## 8 Make this data frame: ```{r} temp_f <- c(72, 95, 41, 86, 78, 33) city <- c("Chicago", "Lagos", "Oslo", "Rio de Janeiro", "San Juan", "Toronto") city_temps <- data.frame(name = city, temperature_f = temp_f) city_temps ``` Add a new column called `temperature_c` containing the temperatures in Celsius. Keep all existing columns. ```{r} # Step 1: compute Celsius using (F - 32) * 5/9 # city_temps$temperature_c <- # Step 2: print city_temps # city_temps ``` ## 9 Write a function `euler2` that computes: $$ S_n = 1 + \frac{1}{2^2} + \frac{1}{3^2} + \dots + \frac{1}{n^2}. $$ Test your function at `n = 10` and `n = 100`. ```{r} # Step 1: write the function euler2 <- function(n) { # Step: create k <- 1:n # Step: compute terms <- 1/(k^2) # Step: return sum(terms) } # Step 2: test # euler2(10) # euler2(100) ``` ## 10 Plot $S_n$ versus $n$ for $n = 1,2,\dots,2000$ with a horizontal **dashed** line at $\pi^2/6$. ```{r} # Step 1: create n_vals <- 1:2000 # n_vals <- # Step 2: compute S for each n using sapply # S <- # Step 3: plot # plot(n_vals, S, type="l", xlab="n", ylab="S_n") # Step 4: add dashed line at pi^2/6 # abline(h = , lty = 2) ``` ## 11 Use `%in%` and `state.abb` to create a logical vector for: AL, AK, AZ, AR, AA. ```{r} test <- c("AL", "AK", "AZ", "AR", "AA") # Step 1: create is_real # is_real <- # Step 2: print is_real # is_real ``` ## 12 Report the one entry that is **not** an actual abbreviation. ```{r} test <- c("AL", "AK", "AZ", "AR", "AA") # Step 1: create not_real using ! # not_real <- # Step 2: find index with which() # idx <- # Step 3: print the entry # test[idx] ``` ## 13 Using `%in%`, show all variables for Florida, California, and New York, **in that order**. ```{r} targets <- c("Florida", "California", "New York") # Step 1: subset murders to those states # sub <- # Step 2: reorder rows using match() # sub_ordered <- # Step 3: print sub_ordered # sub_ordered ``` ## 14 Write a function `vander_helper(x, n)` that returns $(1, x, x^2, \dots, x^n)$. Show results for `x=2`, `n=6`. Restrictions: no loop. ```{r} vander_helper <- function(x, n) { # Step: create exponents 0:n # Step: return x^(0:n) } # Test: # vander_helper(2, 6) ``` ## 15 Create a vector using: ```{r} n <- 20000 p <- 0.35 set.seed(2025-9-18) x <- sample(c(0,1), n, prob = c(1 - p, p), replace = TRUE) ``` Compute the length of each **stretch of consecutive 1s** (run lengths of 1s) and plot the distribution. * Do **not** use a loop. * Hint: use `rle(x)`. Then compare empirical proportions to the geometric prediction for run lengths 1 through 8. ```{r} # Step 1: compute r <- rle(x) # r <- # Step 2: extract ones_lengths (lengths where values == 1) # ones_lengths <- # Step 3: plot distribution (hist or barplot) # hist(ones_lengths, breaks = 30) # Step 4: empirical proportions for k=1:8 # tab <- table(ones_lengths) # emp_counts <- as.numeric(tab[as.character(1:8)]) # emp_counts[is.na(emp_counts)] <- 0 # emp_probs <- emp_counts / length(ones_lengths) # Step 5: theoretical probabilities (1-p)*p^(k-1) # k <- 1:8 # theory_probs <- # Step 6: make a comparison data frame # comparison <- data.frame(run_length = k, empirical_prob = emp_probs, theory_prob = theory_probs) # comparison ``` ## 16 In the `murders` dataset: 1. Compute the national average murder rate. 2. Create labels using `ifelse`: * `"High Crime, High Pop"` if rate > national average and pop > 6 million * `"High Crime, Low Pop"` if rate > national average and pop ≤ 6 million * `"Lower Crime"` otherwise Then show a `table()` of the labels. ```{r} # Step 1: state-level rate # rate <- # Step 2: national average rate # national_rate <- # Step 3: logical vectors high_crime and high_pop # high_crime <- # high_pop <- # Step 4: labels using nested ifelse # labels <- # Step 5: table(labels) # table(labels) ``` ## 17 What is the murder rate of the state that ranks **12th** in terms of murder rate (from highest to lowest)? Show your work using `order` (and optionally check with `sort` or `rank`). ```{r} # Step 1: rate vector # rate <- # Step 2: ord <- order(rate, decreasing = TRUE) # ord <- # Step 3: i <- ord[12] # i <- # Step 4: report state and rate # data.frame(state = murders$state[i], rate = rate[i]) ``` ## 18 Write a function `compute_harmonic_mean` that returns the harmonic mean of a numeric vector, but returns `NA` if any values are zero or negative. Test on `c(1,2,4,8)` and show it is about `2.133333`. ```{r} compute_harmonic_mean <- function(x) { # Step 1: if any x <= 0, return NA # Step 2: compute n <- length(x) # Step 3: return n / sum(1/x) } # Test: # compute_harmonic_mean(c(1, 2, 4, 8)) ``` ## 19 Create a function `safe_divide(x, y)` that returns `x/y` but returns `"Cannot divide by zero"` when `y` is zero. Make it work element-wise on vectors (vectorized). Test it on: ```r x <- c(10, 20, 30) y <- c(2, 0, 5) ``` ```{r} safe_divide <- function(x, y) { # Step 1: compute out <- x/y # Step 2: convert to character so you can store the message # Step 3: replace entries where y == 0 } # Test: # x <- c(10, 20, 30) # y <- c(2, 0, 5) # safe_divide(x, y) ``` ## 20 Write a function `classify_state_safety(state_name)` that returns: * `"Very Safe"` if rate < 1 * `"Safe"` if 1 ≤ rate < 3 * `"Moderate"` if 3 ≤ rate < 5 * `"High Risk"` if rate ≥ 5 * `"State not found"` if the state is not in the dataset Test on `"Vermont"`, `"Texas"`, `"California"`, `"NotAState"`. Then use `sapply` to classify all states and use `table()` to count how many fall into each category. ```{r} # Step 1: compute a named vector of rates # rate <- murders$total / murders$population * 100000 # rate_named <- setNames(rate, murders$state) classify_state_safety <- function(state_name) { # Step 2: check if state_name is in names(rate_named) # Step 3: pull out r <- rate_named[state_name] # Step 4: return the correct label using if/else } # Tests: # classify_state_safety("Vermont") # classify_state_safety("Texas") # classify_state_safety("California") # classify_state_safety("NotAState") ``` ```{r} # Step 5: classify all states with sapply # cats <- sapply(murders$state, classify_state_safety) # Step 6: count categories # table(cats) ``` ## Convert to a PDF file In a (Linux terminal), run the following command (install any missing packages on the fly) ``` # change path_to_hw2.qmd to something like hw/hw2.qmd if the current directory is a parent diretory of hw/ quarto render path_to_hw2.qmd --to pdf ```