Due before class on April 16th.
The goal of this assignment is to practice wrangling and exploring data in a research context.
hw03
repositoryGo here to fork the repo for homework 03.
In the rcfss
package, there is a data frame called dadmom
.
## # A tibble: 3 x 5
## famid named incd namem incm
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 1. Bill 30000. Bess 15000.
## 2 2. Art 22000. Amy 18000.
## 3 3. Paul 25000. Pat 50000.
Tidy this data frame so that it adheres to the tidy data principles:
NOTE: You can accomplish this task in a single piped operation using only tidyr
functions. Code which does not use tidyr
functions is acceptable, but will not merit a “check plus” on your evaluation.
Recall the gapminder
data frame we previously explored. That data frame contains just six columns from the larger data in Gapminder World. In this part, you will join the original gapminder
data frame with a new data file containing the HIV prevalence rate in the country.1
The HIV prevalence rate is stored in the data
folder as a CSV file. You need to import and merge the data with gapminder
to answer these two questions:
For each question, you need to perform a specific type of join operation. Think about what type makes the most sense and explain why you chose it.
The Supreme Court Database contains detailed information of decisions of the U.S. Supreme Court. It is perhaps the most utilized database in the study of judicial politics. Until recently, the database only contained records on cases from the “modern” era (1946-present). Recently the database was extended backwards to include all decisions since the formation of the Court in 1791. While still in beta form, this extension opens the doors to new studies of the Court’s pre-modern era decisions.
In the hw03
repository, you will find two data files: SCDB_Legacy_03_justiceCentered_Citation.csv
and SCDB_2017_01_justiceCentered_Citation.csv
. These are the exact same files you would obtain if you downloaded them from the original website; I have included them in the repository merely for your convenience. Documentation for the datasets can be found here.
The data is structured in a tidy fashion.2 That is, every row is a vote by one justice on one case for every case decided from the 1791-2016 terms.3 There are several ID variables which are useful for other types of research: for our purposes, the only ID variable you need to concern yourself with is caseIssuesId
. Variables you will want to familiarize yourself with include term
, justice
, justiceName
, decisionDirection
, majVotes
, minVotes
, majority
, and chief
. Pay careful attention in the documentation to how these variables are coded.
In order to analyze the Supreme Court data, you will need to import these two files and combine them together (see bind_rows()
from the dplyr
package). Friendly warning: you will initially encounter an error attempting to bind the two data frames. Use your powers of deduction (and R4DS/Google/Stack Overflow/classmates/me and the TAs) to figure out how to fix this error.
Once joined, use your data wrangling and visualization skills to answer the following questions:
You only need to complete one of the two bolded questions. Only complete both if you are feeling particularly masochistic!
Your assignment should be submitted as three RMarkdown documents. Follow instructions on homework workflow. As part of the pull request, you’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc.
Check minus: Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed. No record of commits other than the final push to GitHub.
Check: Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output). Nothing spectacular, either bad or good.
Check plus: Finished all components of the assignment correctly and attempted at least one advanced challenge. Code is well-documented (both self-documented and with additional comments as necessary). Graphs and tables are properly labeled. Use multiple commits to back up and show a progression in the work. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.
More specifically, the estimated number of people living with HIV per 100 population of age group 15-49.↩
Tidy, though not necessarily the most efficient. You could definitely reorganize the datasets into multiple tables of relational data.↩
Also known as a panel dataset. Terms run from October through June, so the 2016 term contains cases decided from October 2016 - June 2017↩
This work is licensed under the CC BY-NC 4.0 Creative Commons License.