#' --- #' title: "Testing Expected Counts and Means" #' layout: lab #' permalink: /lab-scripts/chisq-t-test/ #' filename: chisq-t-test.R #' active: lab-scripts #' abstract: "This lab script will show you how to do some very basic tests in #' R for observed/expected counts and differences in means. These simple tests #' make some strong assumptions about the data, or the ability to marshall data #' for more sophisticated tests. However, they are nice entry-level stuff students #' should know." #' output: #' md_document: #' variant: gfm #' preserve_yaml: TRUE #' --- #+ setup, include=FALSE knitr::opts_chunk$set(collapse = TRUE, fig.path = "figs/chisq-t-test/", cache.path = "cache/chisq-t-test/", fig.width = 11, comment = "#>") #+ # .-. # / / # / | # |\ ._ ,-"" `. # | |,,_/ 7 ; # `;= ,=( , / # |`q q ` | \_,| # .=; <> _ ; / ,/'/ | # ';|\,j_ \;=\ ,/ `-' # `--'_|\ ) # ,' | / ;' # (,,/ (,,/ Testing Proportions and Means #' ## If it's not installed, install it. library(tidyverse) library(stevedata) #' Reminder: Please have read this: #' #' - https://svmiller.com/blog/2025/08/simple-tests-for-arms-races-war/ #' #' It takes a lot of time to write things out for the sake of new material. #' What follows here is just a reduced form of what I make available on my #' blog. #' #' ## Load and prepare the data #' #' The data I'll be using for the bulk of this lab script is the `mmb_war` data #' that I describe in the blog post above. You should have this in `{stevedata}`, #' and I may or may not make a copy of it available on Athena. Here's how I'll #' be loading it into the data for the sake of this presentation. #' Data <- stevedata::mmb_war Data #' As I noted in the blog post accompanying this particular session, the only #' that's missing from the data is a binary measure of war we need to create. #' The industry standard in quantitative peace science is operationalizing war #' from any confrontation where fatalities exceed 1,000. We'll be doing that #' with the minimum dyadic fatalities observed in the confrontation. That's the #' `dyfatmin` column here. Data %>% mutate(war = ifelse(dyfatmin >= 1000, 1, 0)) -> Data Data #' Now we're ready to go. We already have our primary arms race variable (`mmb`). #' #' There are a few ways in which you can do this. For example, the `table()` #' function is a base R that will create a cross-tabulation for you without a #' whole lot of effort. Just supply it the two vectors you want, like this. #' table(Data$war, Data$mmb) #' When you do it this way, the first argument (`Data$war`) is a column that #' becomes the information in the rows whereas the second argument (`Data$mmb`) #' is the information that becomes the columns. Here's how you'd read this table #' with that in mind. #' #' - **Top-left (a):** There were 2,049 confrontations that *did not* become wars (i.e. #' `war = 0`) and for which there was no mutual military build-up preceding the #' confrontation (`mmb = 0`). #' - **Top-right (b):** There were 75 confrontations that *did not* become wars (i.e. #' `war = 0`) but there was a mutual military build-up preceding it (`mmb = 1`). #' - **Bottom-left (c):** There were 180 confrontations that became wars (i.e. `war = 1`), #' but did not have a mutual military build-up preceding it (`mmb = 0`). #' - **Bottom-right (d):** There were 20 confrontations that became wars (i.e. `war = 1`), #' and did have a mutual military build-up preceding it (`mmb = 1`). #' #' I mention this here because I think it's important, if not critical. Chi-square #' tests do not care about what is the row and what is the column because it's #' primarily multiplying row and column totals together. However, there is a #' (reasonable) convention in the social science world that anything you understand #' to be a "cause" is a column in a cross-tab. In our case, we 100% understand #' that arms races are potentially causes of confrontation escalation to war #' and not necessarily the other way around (given how we operationalize them). #' You are not obligated to hold to that convention, but I would encourage it. #' #' You could also see for yourself with some slightly more convoluted code. Data %>% # This just splits the data by each unique combination of war and MMB # It wouldn't be elegant if you had missing data split(., paste("War = ", .$war, "; MMB = ", .$mmb, sep = "")) %>% # map() summarizes each data frame in the list map(~summarize(., n = n())) %>% # bind_rows() flattens this list into a data frame, with a new id column, # called `war_mmb` communicating each unique combination of mmb and war bind_rows(, .id = "war_mmb") #' If you do it this way (i.e. by manually getting counts to create your own #' contingency table), I strongly encourage making a matrix of this information. #' You just have to be careful how you construct it. Let's first create a vector #' of the counts, like this: v <- c(2049, 75, 180, 20) v #' Let's wrap it in a `matrix()` function now. matrix(v) #' Oops, that's not what we want. We want a 2x2 matrix. For a simple 2x2 matrix, #' we just need to specify that `nrow = 2` (or `ncol = 2`). Be mindful of the #' dimensions of your intended matrix. matrix(v, nrow = 2) #' Oops, that's also not what we want. By default, `matrix()` fills column down. #' We wanted that 75 to be in the `b` position (top-right), but it went in the #' `c` position (bottom-left). If this happens to you when you're manually creating #' your matrix, just toggle `byrow = TRUE`. It's `FALSE` as a hidden default. #' matrix(v, nrow = 2, byrow = TRUE) #' Now that we're happy with what we got, we assign to an object to use later. #' mat <- matrix(v, nrow = 2, byrow = TRUE) mat #' The chi-squared test itself is a really simple test with a very straightforward #' interpretation. Its test statistic is referenced to a chi-squared distribution #' that checks for consistency with a chi-squared distribution with some #' degree of freedom determined by the product of the number of rows and columns #' (each subtracted by 1 prior to multiplication). The sum of deviations from #' what's expected produces a chi-squared value that, if it's sufficiently large #' or outside what's expected from a "normal" chi-squared distribution, produces #' a *p*-value that allows you to reject the null hypothesis of random differences #' between observed values and expected values. #' #' That's basically what you see here, however you specify this chi-squared test. #' chisq.test(table(Data$war, Data$mmb)) chisq.test(mat) #' Simulation is useful for illustrating what the chi-squared statistic is #' with respect to its eponymous distribution. We have a distribution with one #' degree of freedom, representing a singular standard normal variate to square. #' It's conceivable, however rare, that you could draw a singular 4 from that #' distribution to square. You'd expect about 95% of the distribution under those #' circumstances to be around 1.96^2 or thereabouts (about 3.84). Intuitively, if the differences #' we observe were random fluctuations, they'd be consistent with a chi-squared #' distribution with that singular degree of freedom. Anything outside those #' bounds and we're left with the impression the differences we observe are not #' just random squared differences. #' #' Again, simulation is nice for this. Let's simulate 100,000 random numbers #' from a chi-square distribution with a singular degree of freedom parameter. set.seed(8675309) tibble(x = rchisq(100000, 1)) -> chisqsim chisqsim #' FYI: this will have a mean that approximates 1 (the degree of freedom parameter). mean(chisqsim$x) #' Tell me if you recognize these numbers... quantile(chisqsim$x, .90) quantile(chisqsim$x, .95) #' Let's plot this distribution now, and assign a vertical line corresponding #' with our test statistic. ggplot(chisqsim, aes(x)) + geom_density() + theme_minimal() + geom_vline(xintercept = 17.895, linetype = 'dashed') #' Our test statistic is almost an impossibility in the chi-squared distribution. #' chisqsim %>% filter(x >= 17.895) #' Indeed, we only observed it three times in 100,000 simulations of this #' distribution. #' #' ## Choose Your Own Adventure with the *t*-test #' #' I have three articles on the course description for which you need to do an #' article summary. I also have three data sets in `{stevedata}` that are either #' the data sets themselves or allow for reasonable approximations of what the #' authors are doing. #' EBJ # reduced form of Appell and Loyle (2012) PRDEG # full replication of Leblang (1996) states_war # approximation of Valentino et al. (2010) #' Taking requests for something more interactive for illustrating a t-test. #' Whichever data you want, the syntax is straightforward. Basically: `t.test(y ~ x, data = Data)`. #' Just be mindful that `x` has to be binary. In the `EBJ` data, we already have #' that in the `pcj` column. It's the democracy variables we need to dichotomize #' for the other two data sets. We have democracy variables from the Polity #' project in both. The convention is to classify a democracy as any instance #' where that variable is 6 or above. We'll do that here to give us flexibility #' on how we want to proceed. PRDEG %>% mutate(dem = ifelse(democ >= 6, 1, 0)) -> PRDEG states_war %>% mutate(dem = ifelse(polity2 >= 6, 1, 0), lerprop = oppfatalmin/(fatalmin + oppfatalmin)) -> states_war #' We'll unpack one, contingent on what the class wants this session. #' T1 <- t.test(fdi ~ pcj, EBJ) T2 <- t.test(rgdp ~ dem, PRDEG) T3 <- t.test(lerprop ~ dem, states_war)