--- title: "Information Criteria and Model Selection" format: html: code-copy: true code-fold: false highlight-style: zenburn df-print: paged css: ["../assets/sticky-notes.css"] date: "`r format(Sys.Date(), '%B %d %Y')`" bibliography: "../assets/epsy8252.bib" csl: "../assets/apa-single-spaced.csl" --- ```{r} #| echo: false source("../assets/notes-setup.R") ``` # Preparation In this set of notes, you will use information theoretic approaches (e.g., information criteria) to select one (or more) empirically supported model from a set of candidate models. To do so, we will use the *mn-schools.csv* dataset: - [CSV File](https://raw.githubusercontent.com/zief0002/bespectacled-antelope/main/data/mn-schools.csv) - [Data Codebook](http://zief0002.github.io/bespectacled-antelope/codebooks/mn-schools.html) Our goal will be to examine if (and how) academic "quality" of the student-body (measured by SAT score) is related to institutional graduation rate. ```{r} #| label: setup # Load libraries library(AICcmodavg) library(broom) library(gt) library(tidyverse) # Read in data mn = read_csv(file = "https://raw.githubusercontent.com/zief0002/bespectacled-antelope/main/data/mn-schools.csv") ```
# Working Hypotheses and Candidate Models Over the prior sets of notes, we have considered several models that explain variation in graduation rates. $$ \begin{split} \mathbf{Model~1:~} \quad \hat{\mathrm{Graduation~Rate}}_i &= \beta_0 + \beta_1(\mathrm{SAT}_i) + \epsilon_i \\[1ex] \mathbf{Model~2:~} \quad \hat{\mathrm{Graduation~Rate}}_i &= \beta_0 + \beta_1(\mathrm{SAT}_i) + \beta_2(\mathrm{Public}_i) + \epsilon_i \\[1ex] \mathbf{Model~3:~} \quad \hat{\mathrm{Graduation~Rate}}_i &= \beta_0 + \beta_1(\mathrm{SAT}_i) + \beta_2(\mathrm{Public}_i) + \beta_3(\mathrm{SAT}_i)(\mathrm{Public}_i) + \epsilon_i \end{split} $$ where all three models have $\epsilon_i\overset{i.i.d.}{\sim}\mathcal{N}(0,\sigma^2_{\epsilon})$. ```{r} # Fit candidate models lm.1 = lm(grad ~ 1 + sat, data = mn) lm.2 = lm(grad ~ 1 + sat + public, data = mn) lm.3 = lm(grad ~ 1 + sat + public + sat:public, data = mn) ``` Each of these models correspond to a different **scientific working hypothesis** about how graduation rates and median SAT scores are related: - **H1:** The effect of median SAT scores on graduation rates is constant across all levels of SAT, and educational sector does not have an effect on graduation rates. - **H2:** The effect of median SAT scores on graduation rates is constant across all levels of SAT. And, while there is also an effect of educational sector on graduation rates, the effect of median SAT scores is identical for both public and private schools. - **H3:** The effect of median SAT scores on graduation rates is constant across all levels of SAT. Moreover, this effect varies across educational sector. :::fyi These working hypotheses are typically created from the theory and previous empirical work in a substantive area. They need to be translated into a statistical models, which can be quite difficult, especially if there is not a lot of theory to guide this translation. ::: One method for choosing among a set of candidate models is to pick based on the residuals. In this method, we adopt the model for which the data best meet the assumptions, and employing the principle of parsimony---adopting the more parsimonious model when there are multiple models that produce equally "good" residuals. Additionally, we can use the likelihood ratio test (LRT) to help make this decision. This method provides additional empirical evidence about which model is supported when the candidate models are nested. Unfortunately, we cannot always use the LRT to select models, since there are many times when the set of candidate models are not all nested. Information criteria give us metrics to compare the empirical support for models whether they are nested or not. In the remainder of these notes, we will examine several of the more popular information criteria metrics used for model selection.
# Akiake's Information Criteria (AIC) The AIC metric is calculated by adding a penalty term to the model's deviance ($-2\ln(\mathrm{Likelihood})$). $$ \begin{split} AIC &= \mathrm{Deviance} + 2k \\[1ex] &= -2\ln(\mathcal{L}) + 2k \end{split} $$ where *k* is the number of parameters (including the residual standard error) being estimated in the model. (Recall that the value for *k* is given as *df* in the `logLik()` output.) Remember that deviance is similar to error (it measures model-data misfit), and so models that have lower values of deviance are more empirically supported. The problem with deviance, however, is that more complex models tend to have smaller deviances and are, therefore, more supported than their simpler counterparts. Unfortunately, these complex models tend not to generalize as well as simpler models (this is called *overfitting*). The AIC metric's penalty term penalizes the deviance more when a more complex model is fitted; it is offsetting the lower degree of model-data misfit in complex models by increasing the 'misfit' based on the number of parameters. This penalty-adjusted measure of 'misfit' is called Akiake's Information Criteria (AIC). We compute the AIC for each of the candidate models and the model with the lowest AIC is selected. This model, we say, is the candidate model with the most empirical support. Below we compute the AIC for the first candidate model. ```{r} # Compute AIC for Model 1 # logLik(lm.1) -2*-113.5472 + 2*3 ``` We could also use the `AIC()` function to compute the AIC value directly. Here we compute the AIC for each of the candidate models. ```{r} # Model 1 AIC(lm.1) # Model 2 AIC(lm.2) # Model 3 AIC(lm.3) ``` Based on the AIC values, the candidate model with the most empirical evidence is the model with main effects of median SAT score and educational sector (Model 2); it has the lowest AIC. Lastly, we note that the AIC value is produced as a column in the model-level output from the `glance()` function. (Note that the `df` column from `glance()` does NOT give the number of model parameters.) ```{r} # Model-level output for H2 glance(lm.2) ```
## Empirical Support is for the Working Hypotheses Because the models are proxies for the scientific working hypotheses, the AIC ends up being a measure of empirical support for any particular working hypothesis. Using the AIC, we can rank order the models (and subsequently the working hypotheses) based on their level of empirical support. Ranked in order of empirical support, the three scientific working hypotheses are: - **H2:** The effect of median SAT scores on graduation rates is constant across all levels of SAT. And, while there is also an effect of educational sector on graduation rates, the effect of median SAT scores is identical for both public and private schools. - **H3:** The effect of median SAT scores on graduation rates is constant across all levels of SAT. Moreover, this effect varies across educational sector. - **H1:** The effect of median SAT scores on graduation rates is constant across all levels of SAT, and educational sector does not have an effect on graduation rates. It is important to remember that the phrase **given the data and other candidate models** is highly important. The ranking of models/working hypotheses is a *relative ranking* of the models' level of empirical support contingent on the candidate models we included in the comparison and the data we used to compute the AIC. As such, this method is not able to rank any hypotheses that you did not consider as part of the candidate set of scientific working hypotheses. Moreover, the AIC is a direct function of the likelihood which is based on the actual model fitted as a proxy for the scientific working hypothesis. If the predictors used in any of the models had been different, it would lead to different likelihood and AIC values, and potentially a different rank ordering of the hypotheses. The ranking of models is also based on the data we have. Even though Model 2 may not be what we expect substantively, it is more empirically supported than the interaction model (or the simple model). Why? Well, we only have data in a certain range of median SAT scores for public and private schools. Within this range of scores, Model 2 is more empirically supported, even after accounting for the additional complexity of the interaction model. If we had a broader range of median SAT scores in these two sectors (or had differentiated between private for-profit and private not-for-proft school), that evidence might support a different model. This is very important. Model 2 is the most empirically supported candidate model GIVEN the three candidate models we compared and the data we used to compute the AIC metric. :::fyi The model selection and ranking is contingent on both the set of candidate models you are evaluating, and the data you have. ::: Based on the AIC values for the three candidate models we ranked the hypotheses based on the amount of empirical support: ```{r} #| label: tbl-aics #| tbl-cap: "Models rank-ordered by the amount of empirical support as measured by the AIC." #| code-fold: true cand_models = data.frame( Model = c( "Model 2", "Model 3", "Model 1" ), k = c(4, 5, 3), AIC = c(AIC(lm.2), AIC(lm.3), AIC(lm.1)) ) cand_models |> gt() |> cols_align( columns = c(Model), align = "left" ) |> cols_align( columns = c(k, AIC), align = "center" ) |> cols_label( Model = md("*Model*"), k = md("*k*"), AIC = md("*AIC*") ) |> tab_options( table.width = pct(40) ) ```
# Corrected AIC (AICc): Adjusting for Model Complexity and Sample Size Although AIC has a penalty correction that should account for model complexity, it turns out that when the number of parameters is large relative to the sample size, AIC is still biased in favor of models that have more parameters. This led @Hurvich:1989 to propose a second-order bias corrected AIC measure (AICc) computed as: $$ \mathrm{AIC_c} = \mathrm{Deviance} + 2k\left( \frac{n}{n - k - 1} \right) $$ where *k* is, again, the number of estimated parameters (including residual standard error), and *n* is the sample size used to fit the model. Note that when *n* is very large (especially relative to *k*) that the last term is essentially 1 and the AICc value would basically reduce to the AIC value. When *n* is small relative to *k* this will add more of a penalty to the deviance. :::protip The statistical recommendation is to pretty much always use AICc rather than AIC when selecting models. ::: Below, we will compute the AICc for the first candidate model. (Note that we use $n=33$ cases for the computation for all the models in this data.) ```{r} n = 33 k = 3 # Compute AICc for linear hypothesis -2 * logLik(lm.1)[[1]] + 2 * k * n / (n - k - 1) ``` In practice, we will use the `AICc()` function from the `{AICcmodavg}` package to compute the AICc value directly. Here we compute the AICc for the three candidate models. ```{r} # Model 1 AICc(lm.1) # Model 2 AICc(lm.2) # Model 3 AICc(lm.3) ``` Based on the AICc values, the model with the most empirical support given the data and three candidate models is, again, Model 2. Using these results, we can, again, rank order the models (which correspond to working hypotheses) based on the empirical support for each. ```{r} #| label: tbl-aicc #| tbl-cap: "Models rank-ordered by the amount of empirical support as measured by the AICc." #| code-fold: true cand_models = data.frame( Model = c( "Model 2", "Model 3", "Model 1" ), k = c(4, 5, 3), AICc = c(AICc(lm.2), AICc(lm.3), AICc(lm.1)) ) cand_models |> gt() |> cols_align( columns = c(Model), align = "left" ) |> cols_align( columns = c(k, AICc), align = "center" ) |> cols_label( Model = md("*Model*"), k = md("*k*"), AICc = md("*AICc*") ) |> tab_options( table.width = pct(40) ) ```
# Quantifying Model-Selection Uncertainty When we adopt one model over another, we are introducing some degree of uncertainty into the scientific process, in this case model selection uncertainty. It would be nice if we can quantify and report this uncertainty, and this is the real advantage of using information criteria for model selection; it allows us to quantify the uncertainty we have when we select any particular candidate model. The amount of model selection uncertainty we have depends on the amount of empirical support each of the candidate models has. For example, if one particular candidate model has a lot of empirical support and the rest have very little empirical support we would have less model selection uncertainty than if all of the candidate models had about the same amount of empirical support. Since we quantify the empirical support for each model/working hypothesis by computing the AICc, we can also quantify how much more empirical support the most supported hypothesis has relative to each of the other working hypotheses by computing the difference in AICc values between the model with the most empirical support and each of the other candidate models. This measure is referred to as $\Delta$AICc. In our example, the hypothesis with the most empirical support was Model 2. The $\Delta$AICc values can then be computed for each candidate model by subtracting the AICc for Model 2 from the AICc for that candidate model. ```{r} # Compute delta AICc value for Model 1 AICc(lm.1) - AICc(lm.2) # Compute delta AICc value for Model 2 AICc(lm.2) - AICc(lm.2) # Compute delta AICc value for Model 3 AICc(lm.3) - AICc(lm.2) ``` ```{r} #| label: tbl-delta-aicc #| tbl-cap: "Models rank-ordered by the amount of empirical support as measured by the AICc. Other evidence includes the ΔAICc for each model." #| code-fold: true cand_models = cand_models |> mutate( delta_a = AICc - AICc(lm.2) ) cand_models |> gt() |> cols_align( columns = c(Model), align = "left" ) |> cols_align( columns = c(k, AICc, delta_a), align = "center" ) |> cols_label( Model = md("*Model*"), k = md("*k*"), AICc = md("*AICc*"), delta_a = html("ΔAICc") ) |> tab_options( table.width = pct(50) ) ``` @Burnham:2011 [p. 25] give rough guidelines for interpreting $\Delta$AICc values. They suggest that hypotheses with $\Delta$AICc values less than 2 are plausible, those in the range of 4--7 have some empirical support, those in the range of 9--11 have relatively little support, and those greater than 13 have essentially no empirical support. Using these criteria: - Model 2 is plausible. - Model 3 is also quite plausible. - Model 1 is a little less plausible, although it has some empirical support. This implies that we would have a fair amount of uncertainty actually selecting Model 2 over the other two models. The empirical evidence has a fair amount of support for all three models, albeit a little less support for Model 1. This suggests that we adopt more than one model since our empirical evidence cannot really differentiate which is "better".
# Relative Likelihood and Evidence Ratios One way we can mathematically formalize the strength of evidence for each model is to compute the relative likelihood. The relative likelihood provides the likelihood of each of the candidate models, given the set of candidate models and the data. To compute the relative likelihood, we use: $$ \mathrm{Relative~Likelihood} = e ^ {−\frac{1}{2} (\Delta AICc)} $$ ```{r} # Relative likelihood for Model 1 exp(-1/2 * 6.900556) # Relative likelihood for Model 2 exp(-1/2 * 0) # Relative likelihood for Model 3 exp(-1/2 * 2.20514) ``` ```{r} #| label: tbl-rel-lik #| tbl-cap: "Models rank-ordered by the amount of empirical support as measured by the AICc. Other evidence includes the ΔAICc, and relative likelihood for each model." #| code-fold: true cand_models = cand_models |> mutate( rel_lik = exp(-1/2 * delta_a) ) cand_models |> gt() |> cols_align( columns = c(Model), align = "left" ) |> cols_align( columns = c(k, AICc, delta_a, rel_lik), align = "center" ) |> cols_label( Model = md("*Model*"), k = md("*k*"), AICc = md("*AICc*"), delta_a = html("ΔAICc"), rel_lik = html("Rel(ℒ)") ) |> tab_options( table.width = pct(50) ) |> tab_footnote( footnote = html("Rel(ℒ) = Relative likelihood"), locations = cells_column_labels(columns = rel_lik) ) ``` The relative likelihood values allow us to compute *evidence ratios*, which are evidentiary statements for comparing any two model or scientific hypotheses. Evidence ratios quantify how much more empirical support one model/hypothesis has versus another. To obtain an evidence ratio, we divide the relative likelihood for any two models/hypotheses. As an example, the evidence ratio comparing Model 2 to Model 3 is computed as: ```{r} # ER comparing Model 2 to Model 3 1 / 0.3320167 ``` That is, - The empirical support for Model 2 is 3.01 times that of the empirical support for Model 1. We can similarly compute the evidence ratio to make an evidentiary statement comparing the empirical support for Model 2 versus Model 1. - The empirical support for Model 2 is 31.51 times that of the empirical support for Model 1. (To obtain this we computed $1/.032=31.51$.)
# Model Probabilities Also referred to as an Akaike Weight ($w_i$), a model probability provides a numerical measure of the probability of each model given the data and the candidate set of models. It can be computed as: $$ w_i = \frac{\mathrm{Relative~Likelihood~for~Model~J}}{\sum_j \mathrm{All~Relative~Likelihoods}} $$ Here we compute the model probability for all three candidate models. ```{r} # Compute sum of relative likelihoods sum_rel = 0.03173681 + 1 + 0.3320167 # Model probability for Model 1 0.03173681 / sum_rel # Model probability for Model 2 1 / sum_rel # Model probability for Model 3 0.3320167/ sum_rel ``` We also add these to our table of model evidence. ```{r} #| label: tbl-model-prob #| tbl-cap: "Models rank-ordered by the amount of empirical support as measured by the AICc. Other evidence includes the ΔAICc, relative likelihood, and model probability (AICc weight) for each model." #| code-fold: true cand_models = cand_models |> mutate( model_prob = rel_lik / sum(rel_lik) ) cand_models |> gt() |> cols_align( columns = c(Model), align = "left" ) |> cols_align( columns = c(k, AICc, delta_a, rel_lik, model_prob), align = "center" ) |> cols_label( Model = md("*Model*"), k = md("*k*"), AICc = md("*AICc*"), delta_a = html("ΔAICc"), rel_lik = html("Rel(ℒ)"), model_prob = md("*AICc Wt.*") ) |> tab_options( table.width = pct(50) ) |> tab_footnote( footnote = html("Rel(ℒ) = Relative likelihood"), locations = cells_column_labels(columns = rel_lik) ) ``` Since the models are proxies for the working hypotheses, the model probabilities can be used to provide probabilities of each working hypothesis as a function of the empirical support. Given the data and the candidate set of working hypotheses: - The probability of the second working scientific hypothesis is 0.733. - The probability of the third working scientific hypothesis is 0.243. - The probability of the first working scientific hypothesis is 0.023. This suggests that the second working scientific hypothesis is the most probable. There is also some support for the third working scientific hypothesis; it has some non-negligible probability. The first working scientific hypothesis does not have much support; its probability given the candidate models and data is close to zero.
# Some Final Thoughts Based on the model evidence given the data for this candidate set of models: - The second working scientific hypothesis has the most empirical support. - The third working scientific hypothesis also has a fair amount of empirical support. - There is very little empirical support for the first working scientific hypothesis. Remember, we are computing all of this evidence to select from among the candidate models. Here the empirical evidence is supporting both the hypothesis associated with Model 2 and Model 3. The data we have does not make it crystal clear about which of these hypotheses is really more supported. (This might be because our sample size is not large enough or that the data are too noisy to make this differentiation.) Recall that the information criteria are a function of the log-likelihood. Log-likelihood values, and thus information criteria, from different models can be compared, so long as: - The exact same data is used to fit the models; - The exact same outcome is used to fit the models; and - The assumptions underlying the likelihood (independence, distributional assumptions) are met. In all three models we are using the same data set and outcome. However, the assumptions only seem reasonably tenable for the second and third models (and maybe not even the third)^[Analyses not shown.]. That means that we should not really include Model 1 in our candidate set of models/working hypotheses. Changing the set of candidate models/working hypotheses we produce the following table of model evidence. ```{r} #| label: tbl-model-prob-2 #| tbl-cap: "Models rank-ordered by the amount of empirical support as measured by the AICc after removing Model 1 from the candidate set. Other evidence includes the ΔAICc, relative likelihood, and model probability (AICc weight) for each model." #| code-fold: true cand_models_2 = data.frame( Model = c("Model 2", "Model 3"), k = c(5, 3), AICc = c(AICc(lm.2), AICc(lm.3)) ) |> mutate( delta_a = AICc - AICc(lm.2), rel_lik = exp(-1/2 * delta_a), model_prob = rel_lik / sum(rel_lik) ) cand_models_2 |> gt() |> cols_align( columns = c(Model), align = "left" ) |> cols_align( columns = c(k, AICc, delta_a, rel_lik, model_prob), align = "center" ) |> cols_label( Model = md("*Model*"), k = md("*k*"), AICc = md("*AICc*"), delta_a = html("ΔAICc"), rel_lik = html("Rel(ℒ)"), model_prob = md("*AICc Wt.*") ) |> tab_options( table.width = pct(50) ) |> tab_footnote( footnote = html("Rel(ℒ) = Relative likelihood"), locations = cells_column_labels(columns = rel_lik) ) ``` Here since we only removed a model from the candidate set, the only thing that changed is the model probabilities (the Akiake weights). Changing the predictors in the candidate models, or adding other models to the candidate set can impact the amount of empirical support and the rank-ordering of hypotheses. If we had a different set of data, we may also have a whole new ranking of models or interpretation of empirical support. The empirical support is linked to the data. The empirical support is very much relative to the candidate set of models and the data we have. Lastly, it is important to note that although information criteria can tell you about the empirical support among a candidate set of models, it cannot say whether that is actually a "good" model. For that you need to look at the assumptions and other measures (e.g., $R^2$). You still need to do all of the work associated with model-building (e.g., selecting predictors from the substantive literature, exploring functional forms to meet the assumptions).
# Statistical Inference and Information Criteria Finally, it is important to mention that philosophically, information-criteria and statistical inference are two very different ways of evaluating statistical evidence. When we use statistical inference for variable selection, the evidence, the *p*-values, is a measure of how rare an observed statistic (e.g., $\hat\beta_k$, $t$-value) is under the null hypothesis. The AIC, on the other hand, is a measure of the model-data compatibility accounting for the complexity of the model. In general, the use of *p*-values is **not compatible** with the use of information criteria-based model selection methods; see @Anderson:2008 for more detail. Because of this, it is typical to not even report *p*-values when using information criteria for model selection. We should, however, reprt the standard errors or confidence intervals for the coefficient estimates, especially for any "best" model(s). This gives information about the statistical uncertainty that arises because of sampling error. It is important that you decide how you will be evaluating evidence and making decisions about variable and model selection *prior to actually examining the data*. Mixing and matching is not cool!
# Using R to Create a Table of Model Evidence We will use the `aictab()` function from the `{AICcmodavg}` package to compute and create a table of model evidence values directly from the `lm()` fitted models. This function takes a list of models in the candidate set (it actually has to be an R list). The argument `modnames=` is an optional argument to provide model names that will be used in the output. ```{r} #Create table of model evidence model_evidence = aictab( cand.set = list(lm.1, lm.2, lm.3), modnames = c("Model 1", "Model 2", "Model 3") ) # View output model_evidence ``` The model evidence provided for each model includes the number of parameters (`K`), AICc value (`AICc`), $\Delta$AICc value (`Delta_AICc`), the Akiake weight/model probability (`AICcWt`), cumulative weight (`Cum.Wt`), and log-likelihood (`LL`). The output also rank orders the models based on the AICc criterion. The models are printed in order from the model with the most empirical evidence (Quadratic) to the working hypothesis with the least amount of empirical evidence (Linear) based on the AICc.
## Pretty Printing Tables of Model Evidence for Quarto Documents We can format the output from `aictab()` to be used in the `gt()` function. Because there are multiple classes associated with the output from the `aictab()` function, we first pipe `model_evidence` object into the `data.frame()` function. Viewing this, we see that the data frame, also includes an additional column that gives the relative likelihoods (`ModelLik`). ```{r} # Create data frame to format into table tab_01 = model_evidence %>% data.frame() # View table tab_01 ``` Then we can use the `select()` function to drop the `LL` and `Cum.Wt` columns from the data frame. The log-likelihood is redundant to the information in the `AICc` column, since AICc is a function of log-likelihood and the other information in the table. The cumulative weight can also easily be computed from the information in the `AICcWt` column. ```{r} # Drop columns tab_01 = tab_01 |> select(-LL, -Cum.Wt) # View table tab_01 ``` We can then pipe the `tab_01` data frame into the `gt()` function to format the table for pretty-printing in Quarto. For some column labels, I use the `html()` function in order to use [HTML symbols](https://www.htmlsymbols.xyz/) to create the Greek letter Delta and the scripted "L". ```{r} #| label: tbl-pretty-print #| tbl-cap: "Models rank-ordered by the amount of empirical support as measured by the AICc after removing Model 1 from the candidate set. Other evidence includes the ΔAICc, relative likelihood, and model probability (AICc weight) for each model." # Create knitted table tab_01 |> gt() |> cols_align( columns = c(Modnames), align = "left" ) |> cols_align( columns = c(K, AICc, Delta_AICc, ModelLik, AICcWt), align = "center" ) |> cols_label( Modnames = md("*Model*"), K = md("*k*"), AICc = md("*AICc*"), Delta_AICc = html("ΔAICc"), ModelLik = html("Rel(ℒ)"), AICcWt = md("*AICc Wt.*") ) |> tab_options( table.width = pct(50) ) |> tab_footnote( footnote = html("Rel(ℒ) = Relative likelihood"), locations = cells_column_labels(columns = ModelLik) ) ```
# References