Introduction to Computing for the Social Sciences

MACS 30500 University of Chicago

TAs

Nora Nickels
Gabriel Velez

Chelsea Ernhofer

Course site

http://cfss.uchicago.edu

Major topics

Elementary programming techniques (e.g. loops, conditional statements, functions)
Writing reusable, interpretable code
Problem-solving - debugging programs for errors
Obtaining, importing, and munging data from a variety of sources
Performing statistical analysis
Visualizing information
Creating interactive reports
Generating reproducible research

print("Hello world!")

## [1] "Hello world!"

# load packages
library(tidyverse)
library(broom)

# estimate and print the linear model
lm(hwy ~ displ, data = mpg) %>%
  tidy() %>%
  mutate(term = c("Intercept",
                  "Engine displacement (in liters)")) %>%
  knitr::kable(digits = 2,
               col.names = c("Variable", "Estimate",
                             "Standard Error", "T-statistic",
                             "P-Value"))

# visualize the relationship
ggplot(data = mpg, aes(displ, hwy)) + 
  geom_point(aes(color = class)) +
  geom_smooth(method = "lm", se = FALSE,
              color = "black", alpha = .25) +
  labs(x = "Engine displacement (in liters)",
       y = "Highway miles per gallon",
       color = "Car type")

# load packages
library(tidyverse)
library(broom)

# estimate and print the linear model
lm(hwy ~ displ, data = mpg) %>%
  tidy() %>%
  mutate(term = c("Intercept",
                  "Engine displacement (in liters)")) %>%
  knitr::kable(digits = 2,
               col.names = c("Variable", "Estimate",
                             "Standard Error", "T-statistic",
                             "P-Value"))

Variable	Estimate	Standard Error	T-statistic	P-Value
Intercept	35.70	0.72	49.55	0
Engine displacement (in liters)	-3.53	0.19	-18.15	0

# visualize the relationship
ggplot(data = mpg, aes(displ, hwy)) + 
  geom_point(aes(color = class)) +
  geom_smooth(method = "lm", se = FALSE,
              color = "black", alpha = .25) +
  labs(x = "Engine displacement (in liters)",
       y = "Highway miles per gallon",
       color = "Car type")

15 min rule: when stuck, you HAVE to try on your own for 15 min; after 15 min, you HAVE to ask for help.- Brain AMA pic.twitter.com/MS7FnjXoGH
— Rachel Thomas (@math_rachel) August 14, 2016

Other resources

Google
StackOverflow
Me
TAs
Fellow students
Class discussion page
- How to properly ask for help

Plagiarism

Collaboration is good – to a point
Learning from others/the internet

Plagiarism

If you don’t understand what the program is doing and are not prepared to explain it in detail, you should not submit it.

Evaluations

Weekly programming assignments
Peer review

Program

A series of instructions that specifies how to perform a computation

Input
Output
Math
Conditional execution
Repetition

Write a report analyzing the relationship between ice cream consumption and crime rates in Chicago.

Two different approaches

Jane: a GUI workflow
Sally: a programatic workflow

Automation

Jane forgets how she transformed and analyzed the data
- Extension of analysis will fall flat
Sally uses automation
- Re-run programs
- No mistakes
- Much easier to implement in the long run

Reproducibility

Are my results valid? Can it be replicated?
The idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them
Also allows the researcher to precisely replicate his/her analysis

Version control

Revisions in research
Tracking revisions
Multiple copies
- analysis-1.r
- analysis-2.r
- analysis-3.r
Cloud storage (e.g. Dropbox, Google Drive, Box)
Version control software
- Repository

Documentation

Comments are the what
Code is the how
Computer code should also be self-documenting
Future-proofing

Badly documented code

library(tidyverse)
library(rtweet)
tmls <- get_timeline(c("MeCookieMonster", "Grover", "elmo", "CountVonCount"), 3000)
ts_plot(group_by(tmls, screen_name), "weeks")

Good code

# get_to_sesame_street.R
# Program to retrieve recent tweets from Sesame Street characters

# load packages for data management and Twitter API
library(tidyverse)
library(rtweet)

# retrieve most recent 3000 tweets of Sesame Street characters
tmls <- get_timeline(
  user = c("MeCookieMonster", "Grover", "elmo", "CountVonCount"),
  n = 3000
)

# group by character and plot weekly tweet frequency
tmls %>%
  group_by(screen_name) %>%
  ts_plot(by = "weeks")