Introduction to Computing for the Social Sciences

MACS 30500 University of Chicago

TAs

  • Nora Nickels
  • Gabriel Velez

Chelsea Ernhofer

Course site

http://cfss.uchicago.edu

Major topics

  • Elementary programming techniques (e.g. loops, conditional statements, functions)
  • Writing reusable, interpretable code
  • Problem-solving - debugging programs for errors
  • Obtaining, importing, and munging data from a variety of sources
  • Performing statistical analysis
  • Visualizing information
  • Creating interactive reports
  • Generating reproducible research

print("Hello world!")
## [1] "Hello world!"
# load packages
library(tidyverse)
library(broom)

# estimate and print the linear model
lm(hwy ~ displ, data = mpg) %>%
  tidy() %>%
  mutate(term = c("Intercept",
                  "Engine displacement (in liters)")) %>%
  knitr::kable(digits = 2,
               col.names = c("Variable", "Estimate",
                             "Standard Error", "T-statistic",
                             "P-Value"))

# visualize the relationship
ggplot(data = mpg, aes(displ, hwy)) + 
  geom_point(aes(color = class)) +
  geom_smooth(method = "lm", se = FALSE,
              color = "black", alpha = .25) +
  labs(x = "Engine displacement (in liters)",
       y = "Highway miles per gallon",
       color = "Car type")

# load packages
library(tidyverse)
library(broom)

# estimate and print the linear model
lm(hwy ~ displ, data = mpg) %>%
  tidy() %>%
  mutate(term = c("Intercept",
                  "Engine displacement (in liters)")) %>%
  knitr::kable(digits = 2,
               col.names = c("Variable", "Estimate",
                             "Standard Error", "T-statistic",
                             "P-Value"))
Variable Estimate Standard Error T-statistic P-Value
Intercept 35.70 0.72 49.55 0
Engine displacement (in liters) -3.53 0.19 -18.15 0
# visualize the relationship
ggplot(data = mpg, aes(displ, hwy)) + 
  geom_point(aes(color = class)) +
  geom_smooth(method = "lm", se = FALSE,
              color = "black", alpha = .25) +
  labs(x = "Engine displacement (in liters)",
       y = "Highway miles per gallon",
       color = "Car type")

Other resources

Plagiarism

  • Collaboration is good – to a point
  • Learning from others/the internet

Plagiarism

If you don’t understand what the program is doing and are not prepared to explain it in detail, you should not submit it.

Evaluations

  • Weekly programming assignments
  • Peer review

Program

A series of instructions that specifies how to perform a computation

  • Input
  • Output
  • Math
  • Conditional execution
  • Repetition

Write a report analyzing the relationship between ice cream consumption and crime rates in Chicago.

Two different approaches

  • Jane: a GUI workflow
  • Sally: a programatic workflow

Automation

  • Jane forgets how she transformed and analyzed the data
    • Extension of analysis will fall flat
  • Sally uses automation
    • Re-run programs
    • No mistakes
    • Much easier to implement in the long run

Reproducibility

  • Are my results valid? Can it be replicated?
  • The idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them
  • Also allows the researcher to precisely replicate his/her analysis

Version control

  • Revisions in research
  • Tracking revisions
  • Multiple copies
    • analysis-1.r
    • analysis-2.r
    • analysis-3.r
  • Cloud storage (e.g. Dropbox, Google Drive, Box)
  • Version control software
    • Repository

Documentation

  • Comments are the what
  • Code is the how
  • Computer code should also be self-documenting
  • Future-proofing

Badly documented code

library(tidyverse)
library(rtweet)
tmls <- get_timeline(c("MeCookieMonster", "Grover", "elmo", "CountVonCount"), 3000)
ts_plot(group_by(tmls, screen_name), "weeks")

Good code

# get_to_sesame_street.R
# Program to retrieve recent tweets from Sesame Street characters

# load packages for data management and Twitter API
library(tidyverse)
library(rtweet)

# retrieve most recent 3000 tweets of Sesame Street characters
tmls <- get_timeline(
  user = c("MeCookieMonster", "Grover", "elmo", "CountVonCount"),
  n = 3000
)

# group by character and plot weekly tweet frequency
tmls %>%
  group_by(screen_name) %>%
  ts_plot(by = "weeks")

Good code