---
title: Intro to Analyzing Cross-Sectional Time-Series Data in R (for Students of IR
  & Comparative Politics)
author: "Justin Murphy"
date: "November 21, 2014"
output: pdf_document
toc: true
---

\pagebreak

## Overview
This document demonstrates how to perform several routine tasks in the analysis of state-level economic and political data for multiple countries over time, using the programming language R. It is aimed toward first-year graduate students with little or no experience with R. [The .Rmd file which generated this document](https://raw.githubusercontent.com/jmrphy/r_intro_for_grad_students/master/r_intro.Rmd) may prove especially useful as a template from which students may wish to get started. All one needs to do is download and install R, RStudio, and LaTex, open this .Rmd file in Rstudio, and modify/extend the examples already coded here.

This document downloads from the web two widely used data sources (the World Bank's World Development Indicators and the Polity IV measure of democracy), merges them, cleans and transforms some variables, visualizes several variables in commonly recurring types of graphs, runs simple models, and generates automatically formatted tables.


## Download packages
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE, eval=FALSE}
install.packages("WDI")
install.packages("countrycode")
install.packages("plm")
install.packages("ggplot2")
install.packages("reporttools")
install.packages("stargazer")
install.packages("psData")
```

## Introducing the World Development Indicators API in R

1. Data collected by the World Bank.
2. Hundreds of state-level variables.
3. From 1960 to 2012.
4. Import to R directly.

```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE}
require(WDI)  # load WDI package
```
* To find the variable codes to call, first search with the ```WDIsearch()``` function.
* Then call data with the ```WDI()``` function.
* Super easy!

## Simple demonstration of WDI package in R

* Let's say we're interested in GDP growth across countries. First we'll search for "gdp growth" to see if the World Bank has a variable matching this search term.

```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE}
### Use this function to find indicator codes for things that interest you
WDIsearch("gdp growth")
```

* To download the data straight from the web, we use the ```WDI()``` function, specifying which countries, variables, and years we're interested in.
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE}
### Insert as many countries and variables as you want, within this range.
### Let's try getting GDP growth for US and France from 1960-2012
wb <- WDI(country=c("US", "FR"), indicator=c("NY.GDP.MKTP.KD.ZG"), start=1960, end=2012)
```

\pagebreak

## Make a real research dataset

- Download a set of commonly used variables for all countries between 1960 and the most recent update of the WDI. Setting ```extra=TRUE``` retrieves some extra variables such as region, geocodes, and additional country codes. The additional country code turns out to merge better with other datasets (see below).

```{r, warning=FALSE, message=FALSE, error=FALSE, cache=TRUE}
wb <- WDI(country="all", indicator=c("SE.XPD.TOTL.GB.ZS", "GC.DOD.TOTL.GD.ZS",
                                     "BN.CAB.XOKA.GD.ZS", "SP.POP.DPND", "AG.LND.TOTL.K2",
                                     "NY.GDP.MKTP.CD", "NY.GDP.MKTP.PP.CD",
                                     "NY.GDP.MKTP.KD", "NY.GDP.PCAP.CD", "SP.POP.TOTL",
                                     "IT.PRT.NEWS.P3", "IT.RAD.SETS", "NE.IMP.GNFS.ZS",
                                     "NE.EXP.GNFS.ZS", "NE.TRD.GNFS.ZS",
                                     "BX.KLT.DINV.WD.GD.ZS", "BN.KLT.PRVT.GD.ZS",
                                     "NE.CON.GOVT.ZS", "SL.IND.EMPL.ZS", "NV.IND.TOTL.ZS",
                                     "NY.GDP.MKTP.KD.ZG", "NY.GDP.PCAP.KD"),
          start=1960,
          end=2012,
          extra=TRUE)
```

\pagebreak

### Clean & transform the data
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE}
# Assignment operator "<-" will add a new variable to the dataframe "wb"
wb$debt<-wb$GC.DOD.TOTL.GD.ZS      # central government debt as share of gdp
wb$curracct<-wb$BN.CAB.XOKA.GD.ZS   # current account balance as share of gdp
wb$dependency<-wb$SP.POP.DPND  # dependency-age population as share of total population
wb$land<-wb$AG.LND.TOTL.K2    # total land in square kilometers
wb$pop<-wb$SP.POP.TOTL        # total population
wb$radioscap<-(wb$IT.RAD.SETS/wb$pop)*1000  # radios per 1000 people
wb$logradioscap<-log(wb$radioscap + 1)  # logarithm of radios per capita, because it's skewed
wb$newspaperscap<-wb$IT.PRT.NEWS.P3  # newspapers in circulation per 1000 people
wb$gdp<-wb$NY.GDP.MKTP.KD     # gross domestic product, constant 2000 USD
wb$gdp2<-wb$NY.GDP.MKTP.CD  # gross domestic product, current USD, millions
wb$gdp3<-wb$NY.GDP.MKTP.PP.CD # gross domestic product, purchasing-power parity, current
wb$gdpcap<-wb$NY.GDP.PCAP.KD      # gdp per capita constant 2000 USD
wb$gdpgrowth<-wb$NY.GDP.MKTP.KD.ZG   # change in gross domestic product
wb$imports<-wb$NE.IMP.GNFS.ZS # imports of goods and services as share of gdp
wb$exports<-wb$NE.EXP.GNFS.ZS # exports of goods and services as share of gdp
wb$trade<-wb$NE.TRD.GNFS.ZS   # total trade as share of gdp 
wb$fdi<-wb$BX.KLT.DINV.WD.GD.ZS  # foreign-direct investment as share of gdp
wb$privatecapital<-wb$BN.KLT.PRVT.GD.ZS   # private capital flows as share of gdp
wb$spending<-wb$NE.CON.GOVT.ZS     # government consumption expenditure as share of gdp
wb$industry<-wb$SL.IND.EMPL.ZS     # employment in industry as share of total employment
wb$industry2<-wb$NV.IND.TOTL.ZS    # value added in industry as share of gdp
```

\pagebreak

### Subset the data

Take a peak at the "head" (the first six rows) using the ```head()``` function. This is "panel" or "pooled" cross-sectional, time-series data. This is also called "country-year" format.
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE}
head(wb[,1:7])   # limit to columns 1 through 7
```
The World Bank keeps data not only on countries, but on certain aggregates of countries. Aggregate units are distinguished by the category "Aggregates" in the factor variable "region" in the ```wb``` object (a dataframe). Then, each of the different aggregates are specified by the "country" variable. To keep things tidy, let's break the ```wb``` dataframe into a few separate dataframes. We'll make one for all the aggregates, one for regions in particular (or any other set of aggregates you're interested in; try replacing the region categories below with other categories in the "country" variable of the ```aggregate``` subset we're about to make...), and one for the good-old fashioned states.
```{r, fig.align='center', fig.width=9, fig.height=6, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE}

# make a subset of only the region/world aggregate measures
aggregates<-subset(wb, region=="Aggregates")

# make a subset of only global-level aggregate measures
world<-subset(aggregates, country=="World")

# Make a subset of only region aggregates
regions<-subset(aggregates, country=="East Asia & Pacific (all income levels)" |
                            country=="Middle East & North Africa (all income levels)" |
                            country=="Sub-Saharan Africa (all income levels)" |
                            country=="North America" |
                            country=="Latin America & Caribbean (all income levels)" |
                            country=="Europe & Central Asia (all income levels)" |
                            country=="South Asia")

wb<-subset(wb, region!="Aggregates")
```

\pagebreak

## Visualizing panel data
Below we investigate our data using one of the most powerful graphing packages in R: ggplot2. The formulas below are a quick tour of plots suited especially for the demands of panel data.

### Multiple variables (on a single scale) for one unit over time
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE, fig.show='hold'}

# load ggplot2 package, must install.packages("ggplot2") if you haven't previously.
require(ggplot2)

ggplot(world, aes(x = year)) +                     
  geom_line(aes(y = trade, colour = "Trade")) + 
  geom_line(aes(y = spending, colour = "Spending")) + 
  scale_colour_discrete(name = "Variables") +
  xlab("Year") +
  ylab("% of GDP") +
  ggtitle("Trade and Government Spending Around the World, 1960-2012")
```

\pagebreak

### One time-series plot for one variable and one line per unit
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE, fig.width=7}
ggplot(regions, aes(x = year, y = gdpcap, colour=country)) +
  geom_line() +
  scale_colour_discrete(name = "Regions") +
  xlab("Year") +
  ylab("GDP Per Capita") +
  ggtitle("GDP Per Capita by Region, 1960-2012")
```

\pagebreak

### Panel of time-series plots for one variable and one plot per unit
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE, fig.height=8}
ggplot(wb[wb$region=="Europe & Central Asia (all income levels)",], aes(x=year, y=trade)) +
  geom_line(aes(group=country)) +
  facet_wrap(~ country) +
  xlab("Year") +
  ylab("Trade (% of GDP)") +
  ggtitle("Trade Levels in Europe and Central Asia, 1960-2012") + 
  theme(axis.text.x = element_text(angle = 75, hjust = 1))
```

\pagebreak

## Import Polity IV
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE}
require(psData)
polity <- PolityGet(vars = "polity2")
polity<-subset(polity, year>=1960 & year<=2012)
head(polity)               # peek at the dataframe
```

## Easily deal with country codes using the ```countrycode``` package
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE}
# load the "countrycode" package for converting codes
require(countrycode)

# convert COW code to ISO 3-character code as in WDI
polity$cowcode<-countrycode(polity$iso2c, "iso2c", "cown")

# peek at the dataframe again
head(polity)
```

\pagebreak

## Subset variables of interest and merge with WDI
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE}

# make subset of key variables
dem<-subset(polity, select=c("iso2c", "year", "polity2"))

df<-merge(wb, dem, by=c("iso2c", "year"))      # 7,634 country-years matched, yay
```

## Standardize time-series on different scales
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE}
require(arm)

# subtract out the mean and divide by two standard deviations
df$dem.z<-rescale(df$polity2)
df$trade.z<-rescale(df$trade)
```

\pagebreak

## Visualize standardized time-series by country
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE, fig.width=8}
ggplot(df[df$region=="Middle East & North Africa (all income levels)",], aes(x=year)) +
  geom_line(aes(y=trade.z, group=country, colour="Trade")) +
  geom_line(aes(y=dem.z, group=country, colour="Polity IV")) +
  facet_wrap(~ country) +
  scale_colour_discrete(name = "Standardized Variables") +
  xlab("Year") +
  ylab("Standardized deviations from mean") +
  ggtitle("Trade & Democracy in Middle East & North Africa, 1960-2012") +
  theme(axis.text.x = element_text(angle = 75, hjust = 1))
```

\pagebreak

## Modeling pooled data
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE}
require(plm)
model<-plm(spending ~ trade + polity2 + dependency + gdpgrowth + lag(spending),
           index = c("iso3c","year"),
           model="within",
           effect="twoways",
           data=df)
ls(model)    # see the contents of the dataframe
summary(model)  # see main model output
```

### Panel-corrected standard errors
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE}
require(lmtest)
coeftest(model, vcov=function(x) vcovBK(x, type="HC1", cluster="time"))
```

\pagebreak

## Generate summary tables
These functions will generate publication-quality Latex tables of summary statistics of all the variables in your dataframe. To use these tables in a paper, you simply have to copy and paste into a Latex document what these functions print for you in the R console. These functions generate Latex code, which you can then typset with Latex into an elegant paper.
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE, results='asis'}
require(reporttools)
summaryvars<-subset(df, select=c("lending", "income", "imports", "exports", "industry"))
tableContinuous(summaryvars[,sapply(summaryvars, is.numeric)], font.size=12,
                longtable=FALSE, comment=FALSE, timestamp=FALSE,
                cap="Example Table for Continous Variables") # numeric variables only
tableNominal(summaryvars[,!sapply(summaryvars, is.numeric)], font.size=12,
             longtable=FALSE, comment=FALSE, timestamp=FALSE,
             cap="Example Table for Categorical Variables") # only non-numeric variables
```

\pagebreak

## Generate publication-quality results tables
The ```stargazer``` function in the package of the same name will turn your model object into a nicely formatted table of results. It works just as the reporttools functions above. A nice little trick is to simply write the Latex code to a file, which will automatically save to your working directory. After running these lines, check your working directory for you Latex file called "model.tex". Typset this file in a Latex editor.
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE, results='asis'}
require(stargazer)
model<-lm(spending ~ trade + polity2 + dependency + gdpgrowth + lag(spending), data=df)
stargazer(model,
          header=FALSE,
          title="Table of Regression Results",
          digits = 2,
          style = "apsr")
```

\pagebreak

### Save your dataset to a file
Finally, this code chunk will save your dataframe "df" to a comma-separated text file, so you can easily open it in Stata, SPSS, Excel, etc.  
```{r, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE}
#set your working directory to whatever folder you wish
write.csv(df, "exported_data.csv")
```