--- title: "MEPS Tutorial 5 - Simple Trend Analysis with Linear Models" date: "19 November 2023" output: rmdformats::downcute: self_contained: true default_style: "light" downcute_theme: "default" --- ```{r setup, include=FALSE} ## Global options knitr::opts_chunk$set(cache = TRUE) ``` # Introduction Analyzing trends can be a tricky matter. You have to consider many things such as the autoregressive correlation between values across the time interval or the seasonal effects that occur and are unrelated to the risk factor. All these issues contribute to the difficulty and challenge of trend analysis. However, there are simple ways to perform rudimentary trend analysis with linear models that might be useful for most stakeholders. Linear models are useful because they are easy to interpret. There is no re-transformation needed, and the outputs are interpreted in terms of real units. Non-linear models may require re-transformation or some kind of adjustment to get the $\beta$ coefficients to be interpretable in real units. Although there are a lot of different models that take into consideration the strength of the correlation between values across time (e.g., generalized estimating equation models) or the random slope and intercepts of subjects and groups, we will use the linear effects model to interpret the trends of healthcare expenditure of a representative sample of the US non-institutionalized patients. # Motivating example We will use data from the [Agency for Healthcare Research and Quality (AHRQ) Medical Expenditure Panel Survey (MEPS)](https://meps.ahrq.gov/mepsweb/) data from 2016 to 2021. We will use R to perform the simple trend analysis. ## Loading the libraries There are several libraries that we'll need to install and then load. ```{r, echo = TRUE, warning = FALSE, message = FALSE} ### step 1a: Load the MEPS package library("MEPS") ## You need to load the library every time you restart R ### Step 1b: Load the other libraries library("survey") library("foreign") library("dplyr") library("ggplot2") library("questionr") # remotes::install_github("juba/questionr") library("lspline") # devtools::install_github("mbojan/lspline", build_vignettes=TRUE) library("ggeffects") # remotes::install_github("strengejacke/ggeffects") library("margins") library("gtsummary") # remotes::install_github("ddsjoberg/gtsummary") library("sjPlot") # plot marginal effects ("plot_model" function) ``` ## Loading data into the R environment There are two ways to load the data onto the R environment from AHRQ MEPS. ```{r, echo = TRUE, warning = FALSE, message = FALSE} # There are two ways to load data from AHRQ MEPS website: #### Method 1: Load data from AHRQ MEPS website hc2021 = read_MEPS(file = "h233") hc2020 = read_MEPS(file = "h224") hc2019 = read_MEPS(file = "h216") hc2018 = read_MEPS(file = "h209") hc2017 = read_MEPS(file = "h201") hc2016 = read_MEPS(file = "h192") #### Method 2: Load data from AHRQ MEPS website hc2021 = read_MEPS(year = 2021, type = "FYC") hc2020 = read_MEPS(year = 2020, type = "FYC") hc2019 = read_MEPS(year = 2019, type = "FYC") hc2018 = read_MEPS(year = 2018, type = "FYC") hc2017 = read_MEPS(year = 2017, type = "FYC") hc2016 = read_MEPS(year = 2016, type = "FYC") ``` Once the data have been loaded onto R, we can made some edits. The first edit I make is ensure that all column or variable names are in lower case. ```{r, echo = TRUE, warning = FALSE, message = FALSE} ## Change column names to lowercase names(hc2021) <- tolower(names(hc2021)) names(hc2020) <- tolower(names(hc2020)) names(hc2019) <- tolower(names(hc2019)) names(hc2018) <- tolower(names(hc2018)) names(hc2017) <- tolower(names(hc2017)) names(hc2016) <- tolower(names(hc2016)) ``` ## Download the linkage file Next, we need to download the pooled linkage file. This has the updated primary sampling unit and strata for the individual respondents. This is an important file because when we pool MEPS data from different years, the primary sampling unit and strata will change. We will need to merge this with our pooled data eventually, which will be discussed later in the tutorial. ```{r, echo = TRUE, warning = FALSE, message = FALSE} # We need the linkage file with the appropriate stratum of the primary sampling strata (STRA9621) and primary sampling unit (PSU9621). (Note: Each year, the linkage file sampling unit name changes) linkage = read_MEPS(type = "Pooled linkage") names(linkage) <- tolower(names(linkage)) # change variable name to lower case ``` ## Creating the pooled data file Now that we have the Full-Year Consolidated files from 2016 to 2021 and the pooled linkage files loaded onto the R environment, we can being to clean the data and merge them together. We'll start by keeping only those variables that will be needed for our analysis. These include the following variables: * `dupersid`: The unique identifier of the individual respondent * `panel`: The panel when the individual entered the MEPS round * `varstr`: The individual sampling strata * `varpsu`: The individual primary sampling unit * `sex`: The sex variable (1 = MALE, 2 = FEMALE) * `totexp`: The total healthcare costs per individual per year (Note: This variable name was changed from the year-specific total expenditure (e.g., `totexp17f` to `totexp` because we are pooling from multiple years) * `perwt`: The person weight for the individual in the sample (Note: This variable name was changed from the year-specific person weight (e.g., `perwt7f`) to `perwt` because we are pooling from multiple years) We also created an indicator variable `year` to reflect the year the Full-Year Consolidated file was captured. When we merge all the data, we created a new variable `poolwt` where we divide the `perwt` by the number of years used in the pooled data (e.g., 6). This will generate a new individual person weight with the pooled data. Note: For more instruction and information about pooling multiple years with MEPS data, please see the [1996-2015 Pooled Linkage Variance Estimation File](https://meps.ahrq.gov/data_stats/download_data/pufs/h36/h36u15doc.shtml). This document provides a good summary and documentation on creating a `poolwt` variable and using the pooled strata and primary sampling unit. ```{r, echo = TRUE, warning = FALSE, message = FALSE} # Select specific variables ### 2021 hc2021p = hc2021 %>% rename( perwt = perwt21f, totexp = totexp21) %>% select( dupersid, panel, varstr, varpsu, perwt, sex, totexp) hc2021p$year <- 2021 ### 2020 hc2020p = hc2020 %>% rename( perwt = perwt20f, totexp = totexp20) %>% select( dupersid, panel, varstr, varpsu, perwt, sex, totexp) hc2020p$year <- 2020 ### 2019 hc2019p = hc2019 %>% rename( perwt = perwt19f, totexp = totexp19) %>% select( dupersid, panel, varstr, varpsu, perwt, sex, totexp) hc2019p$year <- 2019 ### 2018 hc2018p = hc2018 %>% rename( perwt = perwt18f, totexp = totexp18) %>% select( dupersid, panel, varstr, varpsu, perwt, sex, totexp) hc2018p$year <- 2018 ### 2017 hc2017p = hc2017 %>% rename( perwt = perwt17f, totexp = totexp17) %>% select( dupersid, panel, varstr, varpsu, perwt, sex, totexp) hc2017p$year <- 2017 ### 2016 hc2016p = hc2016 %>% rename( perwt = perwt16f, totexp = totexp16) %>% select( dupersid, panel, varstr, varpsu, perwt, sex, totexp) hc2016p$year <- 2016 # Merge data and adjust the person weight by 6 years pool_data = bind_rows(hc2021p, hc2020p, hc2019p, hc2018p, hc2017p, hc2016p) %>% mutate(poolwt = perwt / 6) ``` ## Merging with pooled survey primary sampling strata and unit Next, we need to merge the pooled years file with the updated primary sampling strata and unit from the linkage file. The linkage file is updated every year. Since the latest Full-Year Consolidated Data file that we are using includes 2021, we are using the 2021 linkage file (`HC-229I`). We first limit the linkage file to only include the following variables: * `dupersid`: The unique identifier for the individual respondent (Note: We will need this to merge the files later) * `panel`: The panel when the individual entered the MEPS round (Note: We will need to use this to merge the files later) * `stra9621`: The pooled sampling strata * `psu9621`: The pooled primary sampling unit ```{r, echo = TRUE, warning = FALSE, message = FALSE} # Reduce the linkage file to only include dupersid, panel, stra9621, psu9621 linkage_file = linkage %>% select(dupersid, panel, stra9621, psu9621) # Merge link file with main data pool_data_linked = left_join(pool_data, linkage_file, by = c("dupersid", "panel")) ``` ## Create factor variables We need to create factor variables so that we can evaluate the marginal effects. Disclaimer: I'm not sure if this is always necessary. However, I run into issues with the `margins` package when I estimated the average marginal effects in R. I get errors about the size of the dimension or vector, which indicates that the dimensions of the dataframe is not correct. I'm not sure why this error occurs, but I seem to have avoided it when I convert my categorical variables into factors. I'm not sure if this is a bug in the `margins` package or something I did with the terms in the regression model, but I'll monitor and update this tutorial if there are any changes or discoveries. For now, we'll convert our categorical variables into factors. Here, I also convert the variable `year` into a factor. ```{r, echo = TRUE, warning = FALSE, message = FALSE} # Create factor variables: ### MALE pool_data_linked$male[pool_data_linked$sex == 1] = 1 pool_data_linked$male[pool_data_linked$sex == 2] = 0 table(pool_data_linked$male) pool_data_linked$male <- factor(pool_data_linked$male, levels = c(0, 1)) class(pool_data_linked$male) levels(pool_data_linked$male) ### YEAR pool_data_linked$year <- factor(pool_data_linked$year, levels = c(2016, 2017, 2018, 2019, 2020, 2021)) class(pool_data_linked$year) levels(pool_data_linked$year) ``` ## Set the survey options Next, we set the survey options so that we can estimate the survey-weighted values for our population. ```{r, echo = TRUE, warning = FALSE, message = FALSE} ################################# SURVEY ################################# # Set the survey options options(survey.lonely.psu = "adjust") options(survey.adjust.domain.lonely = TRUE) # Define the survey design survey_design = svydesign( id = ~psu9621, strata = ~stra9621, weights = ~poolwt, data = pool_data_linked, nest = TRUE) ``` ## Trend analysis - Plot the total healthcare costs by sex across time We can look at the total healthcare costs `totexp` by `sex` across time `year` with the `ggsurvey` function, which is part of the `questionr` package. Visualizing the trends, females have a higher average total healthcare costs compared to males, and these are increasing across time. (Note: This trend was generated with the survey weights.) ```{r, echo = TRUE, warning = FALSE, message = FALSE} ################################ TREND ANALYSIS ################################ # Plot total expenditures over time (Load the "questionr" package to use "ggsurvey") ggsurvey(survey_design) + aes(x = year, y = totexp, group = factor(male), color = factor(male)) + stat_summary(fun = "mean", geom = "point", size = 5) + stat_summary(fun = "mean", geom = "line", size = 1) ``` ## Regression model We construct a linear regression model using the `svyglm` function in R. We included the terms `sex` and `year`, and we included an interaction term `sex:year`. This interaction term provides the estimates that describe the differences in the healthcare total costs between males and females when the years changes. This is similar to a difference-in-differences estimation. ```{r, echo = TRUE, warning = FALSE, message = FALSE} # Survey-weight GLM model with interaction term survey_model <- svyglm(totexp ~ factor(male) + factor(year) + factor(male):factor(year), survey_design) ``` We can view the estimates with their corresponding 95% confidence intervals (CIs). ```{r, echo = TRUE, warning = FALSE, message = FALSE} # Summary table with 95% CI round( cbind( summary(survey_model, df.resid = degf(survey_model$survey.design))$coef, confint(survey_model, df = degf(survey_model$survey.design)) ), 4) ``` A nicer way of presenting the regression model output is with the `tbl_regression` function from the `gtsummary` package. ```{r, echo = TRUE, warning = FALSE, message = FALSE} # Use gtsummary to generate a clean summary table. survey_model %>% tbl_regression(intercept = FALSE, estimate_fun = function(x) style_sigfig(x, digits = 3), pvalue_fun = function(x) style_sigfig(x, digits = 3)) %>% modify_caption("Survey-weighted linear regression") ``` From these findings, we can make several interpretations. The most important estimates are the interaction terms. These provide the difference-in-differences estimation or the change in healthcare total costs between males and females due to an increase in the year. For instance, the interaction term `factor(male)1:factor(year)2017` is interpreted as the average difference in the change in total healthcare costs between males and females when the year increased from 2016 to 2017. This difference-in-differences is \$276.58 (95% CI: -\$215.91, \$769.07), which was not statistically significant. ## Average marginal effects between males and females at each year We can generate the average marginal effects or the differences in total healthcare costs between males and females at each year using the `margins` function. For instance, the difference in total healthcare costs between males and females in 2019 was -\$990; 95% CI: -\$1,396, -\$583, which was statistically significant. This means that males had lower average total healthcare costs compared to females in 2019. ```{r, echo = TRUE, warning = FALSE, message = FALSE} # Equivalent to Stata's margin, dydx(year), at(male = 0:1) margins1 <- margins(survey_model, type = "response", design = survey_design, at = list(year = c(2016, 2017, 2018, 2019, 2020, 2021)), variables = "male") summary(margins1) ``` ## Average marginal effects between years for males and females Alternatively, we can estimate the average marginal effects or the changes in total healthcare costs between years for males and females. For instance, in 2021, males had an average total healthcare cost of \$1965; 95% CI: \$1,416, \$2,513, and females had an average total healthcare cost of \$1,899; 95% CI: \$1277, \$2521. ```{r, echo = TRUE, warning = FALSE, message = FALSE} # Equivalent to Stata's margin, dydx(year), at(male = 0:1) margins2 <- margins(survey_model, type = "response", design = survey_design, at = list(male = 0:1), variables = "year") summary(margins2) ``` ## Plot the average marginal effects We can plot the average marginal effects between males and females across time `year`. We can visualize the average total healthcare costs for males and females at each year from 2016 to 2021. ```{r, echo = TRUE, warning = FALSE, message = FALSE} # Plot the marginal effects plot_model(survey_model, type = "pred", terms = c("year", "male"), dot.size = 4., line.size = 1) ``` We could also plot the average marginal effects using `ggplot2` with the `ggpredict` package. ```{r, echo = TRUE, warning = FALSE, message = FALSE} # Plot the predicted values fit.dataframe <- ggpredict(survey_model, terms = c("year", "male"), design = survey_design) ggplot(fit.dataframe, aes(x, predicted, colour = group)) + geom_line(size = 1.5) + geom_point(size = 5) + labs( x = "Year", y = "Total healthcare costs", colour = get_legend_title(fit.dataframe) ) ``` There are a few important things to plot with these figures. In the first figure below, we see that the slopes or change in average total healthcare costs across the years are displayed for the males and females. The output from the `margins` function provides us with the within group change in total healthcare costs for males and females across time. For instance, the change from 2016 to 2017 for females (which is also called the slope between 2016 and 2017) was \$165. For males, their slope was \$441. ```{r, echo = FALSE, warning = FALSE, message = FALSE, out.width = "100%", fig.cap = "Average marginal effects or the Slopes of males and females."} knitr::include_graphics("Figure 2.png") ``` In the second figure, we can use the output from the `margins1` object to estimate the difference in total healthcare costs at each year between males and females. For instance, we can visualize the difference in total healthcare costs at 2019 between males and females is -\$990. ```{r, echo = FALSE, warning = FALSE, message = FALSE, out.width = "100%", fig.cap = "Differences in total healthcare costs between males and females in 2019."} knitr::include_graphics("Figure 3.png") ``` The last figure is the most important. This visualizes the interaction term results. We are mostly interested in the interaction terms for trend analysis. It provides us with the difference-in-differences estimator. In other words, the interaction terms tell us the differences in the total healthcare costs between males and females as the year changes. For instance, the slope change for females between 2018 and 2016 was approximately \$981. The slope change for males between 2018 and 2016 was \$1136. The difference between these two slope changes for males and females is \$1136 - \$981 = \$155. This is our differences-in-differences estimation, or we can state that this is the differences in total healthcare costs between males and females as the year changed from 2016 to 2018. But, this difference-in-differences was not statistically significant since the 95% CI includes zero (95% CI: \$391, \$701). ```{r, echo = FALSE, warning = FALSE, message = FALSE, out.width = "100%", fig.cap = "Interaction terms - differences-in-differences estimation."} knitr::include_graphics("Figure 4.png") ``` # Conclusions Trend analysis can be performed using linear models. The ease of interpretation and simple execution makes this ideal for most stakeholders. Moreover, the `svyglm` function allows us to incorporate the survey weights from the complex survey design of MEPS to generate standard errors that were representative of the general U.S. population. Although the `margins` package is helpful in estimating the average marginal effects, there are other alternatives such as the [`marginaleffects` package](https://github.com/vincentarelbundock/marginaleffects). Additionally, more complex models exist that account for autoregressive correlations and more precise estimations of the variance. However, they are harder to implement with complex survey designs. Unfortunately, I have not been able to figure out how to perform some of these more complex longitudinal data analysis with alternative models (e.g., linear mixed effects models) that incorporate the survey weights. For these scenarios, I prefer to use `Stata`, which makes it convenient to apply survey weights in more complex models. Perhaps, I'll create some tutorials using `Stata` for these more complex longitudinal models in the future. # Acknowledgements I am grateful to the author of the `margins` package, Thomas J. Leeper. You can find his `margins` package on his [`GitHub site`](https://github.com/leeper/margins). Additionally, the `marginaleffects` package by Vincent Arel-Bundock has helped me to compare some of `R`'s marginal effects commands with `Stata`. You can find his GitHub site [here](https://github.com/vincentarelbundock/marginaleffects). # Work in progress This is a work in progress so expect some updates in the future. # Disclaimers Any errors or mistakes are those of the author. This is only for educational purposes.