Overview

Due before class on April 9th.

Now that you’ve demonstrated your software is setup, the goal of this assignment is to practice transforming and exploring data.

Fork the hw02 repository

Go here to fork the repo for homework 02.

Exploring clean data

FiveThirtyEight, a data journalism site devoted to politics, sports, science, economics, and culture, recently published a series of articles on gun deaths in America. Gun violence in the United States is a significant political issue, and while reducing gun deaths is a noble goal, we must first understand the causes and patterns in gun violence in order to craft appropriate policies. As part of the project, FiveThirtyEight collected data from the Centers for Disease Control and Prevention, as well as other governmental agencies and non-profits, on all gun deaths in the United States from 2012-2014.

Obtain the data

I have included this dataset in the rcfss library on GitHub. To install the package, use the command devtools::install_github("uc-cfss/rcfss") in R. If you don’t already have the devtools library installed, you will get an error. Go back and install this first using install.packages(), then install rcfss. The gun deaths dataset can be loaded using data("gun_deaths"). Use the help function in R (?gun_deaths) to get detailed information on the variables and coding information.

Explore the data

Using your knowledge of dplyr and ggplot2, use summary statistics and graphs to answer the following questions:

  1. In what month do the most gun deaths occur?
  2. What is the most common intent in gun deaths? Do most people killed by guns die in suicides, homicides, or accidental shootings?
  3. What is the average age of females killed by guns?
  4. How many white males with at least a high school education were killed by guns in 2012?
  5. Which season of the year has the most gun deaths? Assume that
    • Winter = January-March
    • Spring = April-June
    • Summer = July-September
    • Fall = October-December
    • Hint: you need to convert a continuous variable into a categorical variable. Find a function that does that.
  6. What is the relationship between race and intent? For example, are whites who are killed by guns more likely to die because of suicide or homicide? How does this compare to blacks and hispanics?
  7. Are police-involved gun deaths significantly different from other gun deaths? Assess the relationship between police involvement and age, police involvement and race, and the intersection of all three variables.

Formatting graphs

While you are practicing exploratory data analysis, your final graphs should be appropriate for sharing with outsiders. That means your graphs should have:

  • A title
  • Labels on the axes (see ?labs for details)

This is just a starting point. Consider adopting your own color scales, taking control of your legends (if any), playing around with themes, etc.

Formatting tables

When presenting tabular data (aka dplyr::summarize()), make sure you format it correctly. Use the kable() function from the knitr package to format the table for the final document. For instance, this is a poorly presented table summarizing where gun deaths occurred:

library(tidyverse)
library(knitr)
library(rcfss)
# calculate total gun deaths by location
count(gun_deaths, place)
## # A tibble: 11 x 2
##    place                       n
##    <chr>                   <int>
##  1 Farm                      470
##  2 Home                    60486
##  3 Industrial/construction   248
##  4 Other specified         13751
##  5 Other unspecified        8867
##  6 Residential institution   203
##  7 School/instiution         671
##  8 Sports                    128
##  9 Street                  11151
## 10 Trade/service area       3439
## 11 <NA>                     1384

Instead, use kable() to format the table, add a caption, and label the columns:

count(gun_deaths, place) %>%
  kable(caption = "Gun deaths in the United States (2012-2014), by location",
        col.names = c("Location", "Number of deaths"))
Gun deaths in the United States (2012-2014), by location
Location Number of deaths
Farm 470
Home 60486
Industrial/construction 248
Other specified 13751
Other unspecified 8867
Residential institution 203
School/instiution 671
Sports 128
Street 11151
Trade/service area 3439
NA 1384

Run ?kable in the console to see how additional options.

Note that when viewed on GitHub, table captions will not show up. Just a (missing) feature of Markdown on GitHub 😢

Submit the assignment

Your assignment should be submitted as an R Markdown document. Don’t know what an R Markdown document is? Read this! Or this! I have included starter files for you to modify to complete the assignment, so you are not beginning completely from scratch.

Follow instructions on homework workflow. As part of the pull request, you’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc.

Rubric

Check minus: Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed. No record of commits other than the final push to GitHub.

Check: Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output). Nothing spectacular, either bad or good.

Check plus: Finished all components of the assignment correctly. Code is well-documented (both self-documented and with additional comments as necessary). Graphs and tables are properly labeled. Uses multiple commits to back up and show a progression in the work. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.

This work is licensed under the CC BY-NC 4.0 Creative Commons License.