--- title: "Missing Data" author: "Lecture 4f" format: revealjs: multiplex: false footer: "[https://jonathantemplin.com/bayesian-psychometric-modeling-fall-2022/](https://jonathantemplin.com/bayesian-psychometric-modeling-fall-2022/)" theme: ["pp.scss"] slide-number: c/t incremental: false editor: source --- ```{r, include=FALSE} load("lecture04f.RData") needed_packages = c("ggplot2", "cmdstanr", "HDInterval", "bayesplot", "loo") for(i in 1:length(needed_packages)){ haspackage = require(needed_packages[i], character.only = TRUE) if(haspackage == FALSE){ install.packages(needed_packages[i]) } library(needed_packages[i], character.only = TRUE) } conspiracyData = read.csv("conspiracies.csv") conspiracyItems = conspiracyData[,1:10] ``` ## Today's Lecture Objectives 1. Show how to estimate models where data are missing ## Example Data: Conspiracy Theories Today's example is from a bootstrap resample of 177 undergraduate students at a large state university in the Midwest. The survey was a measure of 10 questions about their beliefs in various conspiracy theories that were being passed around the internet in the early 2010s. Additionally, gender was included in the survey. All items responses were on a 5- point Likert scale with: 1. Strongly Disagree 2. Disagree 3. Neither Agree or Disagree 4. Agree 5. Strongly Agree #### Please note, the purpose of this survey was to study individual beliefs regarding conspiracies. The questions can provoke some strong emotions given the world we live in currently. All questions were approved by university IRB prior to their use. Our purpose in using this instrument is to provide a context that we all may find relevant as many of these conspiracy theories are still prevalent today. ## Conspiracy Theory Questions 1-5 Questions: 1. The U.S. invasion of Iraq was not part of a campaign to fight terrorism, but was driven by oil companies and Jews in the U.S. and Israel. 2. Certain U.S. government officials planned the attacks of September 11, 2001 because they wanted the United States to go to war in the Middle East. 3. President Barack Obama was not really born in the United States and does not have an authentic Hawaiian birth certificate. 4. The current financial crisis was secretly orchestrated by a small group of Wall Street bankers to extend the power of the Federal Reserve and further their control of the world's economy. 5. Vapor trails left by aircraft are actually chemical agents deliberately sprayed in a clandestine program directed by government officials. ## Conspiracy Theory Questions 6-10 Questions: 6. Billionaire George Soros is behind a hidden plot to destabilize the American government, take control of the media, and put the world under his control. 7. The U.S. government is mandating the switch to compact fluorescent light bulbs because such lights make people more obedient and easier to control. 8. Government officials are covertly Building a 12-lane \"NAFTA superhighway\" that runs from Mexico to Canada through America's heartland. 9. Government officials purposely developed and spread drugs like crack-cocaine and diseases like AIDS in order to destroy the African American community. 10. God sent Hurricane Katrina to punish America for its sins. ## {auto-animate=true, visibility="uncounted"} ::: {style="margin-top: 200px; font-size: 3em; color: red;"} Missing Data in Stan ::: ## Missing Data in Stan If you ever attempted to analyze missing data in Stan, you likely received an error message: ```Error: Variable 'Y' has NA values.``` That is because, as a default, Stan does not model missing data * Instead, we have to get Stan to work with the data we have (the values that are not missing) * That does not mean remove cases where any observed variables are missing ## Example Missing Data To make things a bit easier, I'm only turning one value into missing data (the first person's response to the first item) ```{r, eval=TRUE, echo=TRUE} # Import data =============================================================================== conspiracyData = read.csv("conspiracies.csv") conspiracyItems = conspiracyData[,1:10] # make some cases missing for demonstration: conspiracyItems[1,1] = NA ``` Note: All code will work with as much missing as you have * Observed variables do have to have some values that are not missing (by definition!) ## Stan Syntax: Multidimensional Model We will use the syntax from the [last lecture](https://jonathantemplin.github.io/Bayesian-Psychometric-Modeling-Course-Fall2022/lectures/lecture04e/04e_Modeling_Multidimensional_Latent_Variables#/title-slide)--for multidimensional measurement models with ordinal logit (graded response) items The Q-matrix this time, will be a single column vector (one dimension) ```{r} Qmatrix ``` ## Stan Model Block ```{r, eval=FALSE, echo=TRUE} model { lambda ~ multi_normal(meanLambda, covLambda); thetaCorrL ~ lkj_corr_cholesky(1.0); theta ~ multi_normal_cholesky(meanTheta, thetaCorrL); for (item in 1:nItems){ thr[item] ~ multi_normal(meanThr[item], covThr[item]); Y[item, observed[item, 1:nObserved[item]]] ~ ordered_logistic(thetaMatrix[observed[item, 1:nObserved[item]],]*lambdaMatrix[item,1:nFactors]', thr[item]); } } ``` Notes: * Big change is in ```Y```: * Previous: ``` Y[item]``` * Now: ```Y[item, observed[item, 1:nObserved[item]]]``` * The part after the comma is a list of who provided responses to the item (input in the data block) * Mirroring this is a change to ```thetaMatrix[observed[item, 1:nObserved[item]],]``` * Keeps only the latent variables for the persons who provided responses ## Stan Data Block ```{r, eval=FALSE, echo=TRUE} data { // data specifications ============================================================= int nObs; // number of observations int nItems; // number of items int maxCategory; // number of categories for each item array[nItems] int nObserved; array[nItems, nObs] array[nItems] int observed; // input data ============================================================= array[nItems, nObs] int Y; // item responses in an array // loading specifications ============================================================= int nFactors; // number of loadings in the model array[nItems, nFactors] int Qmatrix; // prior specifications ============================================================= array[nItems] vector[maxCategory-1] meanThr; // prior mean vector for intercept parameters array[nItems] matrix[maxCategory-1, maxCategory-1] covThr; // prior covariance matrix for intercept parameters vector[nItems] meanLambda; // prior mean vector for discrimination parameters matrix[nItems, nItems] covLambda; // prior covariance matrix for discrimination parameters vector[nFactors] meanTheta; } ``` ## Data Block Notes * Two new arrays added: * ```array[nItems] int nObserved;```: The number of observations with non-missing data for each item * ```array[nItems, nObs] array[nItems] int observed;```: A listing of which observations have non-missing data for each item * Here, the size of the array is equal to the size of the data matrix * If there were no missing data at all, the listing of observations with non-missing data would equal this size * Stan uses these arrays to only model data that are not missing * The values of observed serve to select only cases in Y that are not missing ## Building Non-Missing Data Arrays To build these arrays, we can use a loop in R: ```{r, eval=FALSE, echo=TRUE} observed = matrix(data = -1, nrow = nrow(conspiracyItems), ncol = ncol(conspiracyItems)) nObserved = NULL for (variable in 1:ncol(conspiracyItems)){ nObserved = c(nObserved, length(which(!is.na(conspiracyItems[, variable])))) observed[1:nObserved[variable], variable] = which(!is.na(conspiracyItems[, variable])) } ``` For the item that has the first case missing, this gives: ```{r, echo=TRUE} nObserved[1] ``` ```{r,echo=TRUE} observed[, 1] ``` The item has 176 observed responses and one missing * Entries 1 through 176 of ```observed[,1]``` list who has non-missing data * The 177th entry of ```observed[,1]``` is -1 (but won't be used in Stan) ## Array Indexing We can use the values of ```observed[,1]``` to have Stan only select the corresponding data points that are non-missing To demonstrate, in R, here is all of the data for the first item ```{r, echo=TRUE} conspiracyItems[,1] ``` And here, we select the non-missing using the index values in ```observed```: ```{r, echo=TRUE} conspiracyItems[observed[1:nObserved, 1], 1] ``` The values of ```observed[1:nObserved, 1]``` lead to only using the non-missing data ## Change Missing NA to Nonsense Values Finally, we must ensure all data into Stan have no NA values * My recommendation: Change all NA values to something that cannot be modeled * I am picking -1 here: It cannot be used with the ordered_logit() likelihood * This ensures that Stan won't model the data by accident * But, we must remember this if we are using the data in other steps (like PPMC) ```{r, echo=TRUE, eval=FALSE} # Fill in NA values in Y Y = conspiracyItems for (variable in 1:ncol(conspiracyItems)){ Y[which(is.na(Y[,variable])),variable] = -1 } ``` ## Running Stan With our missing values denoted, we then run Stan as we have previously ```{r, echo=TRUE, eval=FALSE} modelOrderedLogit_data = list( nObs = nObs, nItems = nItems, maxCategory = maxCategory, nObserved = nObserved, observed = t(observed), Y = t(Y), nFactors = ncol(Qmatrix), Qmatrix = Qmatrix, meanThr = thrMeanMatrix, covThr = thrCovArray, meanLambda = lambdaMeanVecHP, covLambda = lambdaCovarianceMatrixHP, meanTheta = thetaMean ) modelOrderedLogit_samples = modelOrderedLogit_stan$sample( data = modelOrderedLogit_data, seed = 191120221, chains = 4, parallel_chains = 4, iter_warmup = 2000, iter_sampling = 2000, init = function() list(lambda=rnorm(nItems, mean=5, sd=1)) ) ``` ## Stan Results ```{r, cache=TRUE} # checking convergence max(modelOrderedLogit_samples$summary()$rhat, na.rm = TRUE) # item parameter results print(modelOrderedLogit_samples$summary(variables = c("lambda", "mu")) ,n=Inf) ``` ## {auto-animate=true, visibility="uncounted"} ::: {style="margin-top: 200px; font-size: 3em; color: red;"} Likelihoods with Missing Data ::: ## Likelihoods with Missing Data The way we have coded Stan enables estimation by effectively skipping over cases that were missing * This means our likelihood functions are slightly different For the parameters of an item $i$, the previous model/data likelihood was: $$f \left(\boldsymbol{Y}_{p} \mid \theta_p \right) = \prod_{p=1}^P f \left(Y_{pi} \mid \theta_p \right)$$ Now, we must alter the PDF so that missing data do not contribute: $$f \left(Y_{pi} \mid \theta_p \right) = \left\{ \begin{array}{lr} f \left(Y_{pi} \mid \theta_p \right) & \text{if } Y_{pi} \text{ observed} \\ 1 & \text{if } Y_{pi} \text{ missing} \\ \end{array} \right.$$ This also applies to the likelihood for a person's $\theta$ as any missing items are skipped: $$f \left(\boldsymbol{Y}_{p} \mid \theta_p \right) = \prod_{i=1}^I f \left(Y_{pi} \mid \theta_p \right)$$ ## Ramifications of Skipping Missing Data It may seem somewhat basic to simply skip over missing cases (also called list-wise deletion), but: * Such methods meet the assumptions of missing at random data * Missing data are related to some of the observed data It is a stronger method for analysis than case-wise deletion (removing a person fully) * Case-wise deletion assumes the data are missing completely at random * No relation to any observed data * Less likely to hold in missing data than MAR Moreover, the methods we implemented in Stan are equivalent to those implemented in maximum likelihood algorithms * The likelihood function is the same ## {auto-animate=true, visibility="uncounted"} ::: {style="margin-top: 200px; font-size: 3em; color: red;"} Wrapping Up ::: ## Wrapping Up Today, we showed how to skip over missing data in Stan * Slight modifications needed to syntax * Assumes missing at random Of note, we could (but didn't) also build models for missing data in Stan * Using the transformed parameters block Finally, Stan's missing data methods are quite different from JAGS * JAGS imputes any missing data at each step of a Markov chain using Gibbs sampling