######################## # STAMPS 2018 - corncob Introduction # Bryan D Martin ######################## ######################## # Vignette Information ######################## # We thank Dr. Thea Whitman for kindly providing us with the # example data set we use for this vignette. You can read more # about this data in Whitman, Thea, et al. "Dynamics of microbial # community composition and soil organic carbon mineralization in # soil following addition of pyrogenic and fresh organic matter." # The ISME Journal 10.12 (2016): 2918. ######################## # Introduction ######################## # Effectively modeling microbial relative abundance poses a number # of statistical challenges, including: # * different sequencing depth, # * excessive zeros from unobserved taxa, # * high variability of empirical relative abundances (overdispersion), # * within-taxon correlation, # * hypothesis testing with categorical and continuous covariates. # Here, we introduce corncob, an individual taxon regression model # that uses abundance tables and sample data. corncob is able to model # differential abundance and differential variability, and addresses # each of the challenges presented above. # Note that in order to follow along with this tutorial (but not to use # corncob!) you will need to have phyloseq installed. # Install corncob using: devtools::install_github("bryandmartin/corncob") # To begin, we load our example data set as a phyloseq object. library(corncob) library(phyloseq) library(magrittr) data(soil_phylo) # If you are unfamiliar with phyloseq, we can view a description of the data using: soil_phylo # We now see that we have an OTU abundance table with 7770 OTUs and 119 samples. # We can extract using otu_table(). Let's examine a small subset of our data in more detail. otu_table(soil_phylo)[1:3, 1:3] # We can also see that we have 5 sample variables. # We can extract this using sample_data(). # Let's again examine a small subset in more detail. sample_data(soil_phylo)[1:3, ] # Our covariates are as follows: # * Plants: Indicator of whether plants are in the soil for this sample. # * Amdmt: Categoircal variable representing one of three soil additives; none, biochar, and freshbiomass, respectively. # * ID: Categorical variable representing different plots of soil. # * Day: Categorical variable representing one of three days of measurement; day 1, day 21, and day 81, respectively. # * DayAmdmt: Categorical variable combining the Day and Amdmt variables into a single variable. # Finally, we have a taxonomy table with 7 taxonomic ranks. tax_table(soil_phylo)[1:3, ] ######################## # Fitting a Model ######################## # Now, let's set up our model. # First, let's subset our samples to only include those with the DayAmdmt # covariate equal to 11 or 21 and then collapse the samples to the phylum level. soil <- soil_phylo %>% phyloseq::subset_samples(DayAmdmt %in% c(11,21)) %>% tax_glom("Phylum") # Let's examine the data and the taxonomy table again. soil tax_table(soil)[1:5, ] # Note that collapsing the samples is not necessary, and this model # can work at any taxonomic rank. However, we will later be fitting # a model to every taxa. We can see that by agglomerating taxa to # the phylum level, we have gone from from 7770 to 40 taxa. Thus we # collapse in order to increase the speed for the purposes of this tutorial. # Now we fit our model. We will demonstrate with Proteobacteria, or OTU.1. # For now, we will not include any covariates, so we use ~ 1 # as our model formula responses. corncob <- bbdml(formula = OTU.1 ~ 1, phi.formula = ~ 1, data = soil) ######################## # Interpreting a Model ######################## # First, let's plot the data with our model fit on the relative # abundance scale. To do this, we simply type: plot(corncob) # The points represent the relative abundances. The bars represent # the 95% prediction intervals for the observed relative abundance by sample. # Now let's look at the same plot, but on the absolute abundance scale # with 95% prediction intervals (since absolute abundance is not a parameter). # To do this, we add the option AA = TRUE to our plotting code. plot(corncob, AA = TRUE) # Finally, let's color the plot by the DayAmdmt covariate. To do this, # we add the option color = "DayAmdmt" to our plotting code. plot(corncob, AA = TRUE, color = "DayAmdmt") plot(corncob, color = "DayAmdmt") # Notice that this plot also reorders our samples so that groups # appear together so that they are easier to compare. # We can observe on this plot that it might be of interest to # distinguish between the two groups with covariates. The average # empirical relative abundance for the samples with DayAmdmt = 21 # tends to be lower and less variable than the samples with DayAmdmt = 11. ######################## # Adding covariates ######################## # Let's try modeling the expected relative abundance and the variability # of the absolute abundance with DayAmdmt as a covariate. We do this by # modifying formula and phi.formula as: corncob_da <- bbdml(formula = OTU.1 ~ DayAmdmt, phi.formula = ~ DayAmdmt, data = soil) # Let's also plot this data on both the absolute abudance and relative abundance scale. plot(corncob_da, color = "DayAmdmt", AA = TRUE) plot(corncob_da, color = "DayAmdmt") # Visually, the model with covariates seems to provide a much better fit to the data, # but how can we compare the two models statistically? ######################## # Model Selection ######################## # Let's use a likelihood ratio test to select our final model for this taxon. # We want to test the null hypothesis that the likelihood of the model with # covariates is equal to the likelihood of the model without covariates. # To do this test, we use: lrtest(corncob, corncob_da) # We obtain a p-value much smaller than a cut-off of 0.05. # Therefore we conclude that there is a statistically significant # difference in the likelihood of the two models. # Thus, we probably want to use the model with covariates for this taxon. ######################## # Parameter Interpretation ######################## # Now that we have chosen our model, let's interpret our model output. # To see a summary of the model, type: summary(corncob_da) # This output will look familiar if you have done regression analysis # in R in the past. Covariates associated with the expected relative # abundance are preceded by mu.; covariates associated with the variance # of the absolute abundances are preceded by phi.. # From this model summary, we can see that the mu.DayAmdmt21 coefficient # is negative and statistically significant. This suggests that this taxon # is differentially-abundant across DayAmdmt, and that samples with # DayAmdmt = 21 are expected to have a lower relative abundance. # This matches what we saw from the observed abundances. # We can also see that the phi.DayAmdmt21 coefficient is negative and # statistically significant. This suggests that this taxon is # differentially-variable across DayAmdmt, and that samples with DayAmdmt = 21 # are expected to have a lower variability. # This matches what we saw from the observed abundances. ######################## # Analysis for Multiple Taxa ######################## # What if we want to test all the taxa in our data to see if they are # differentially-abundant or differentially-variable? # We use the differentialTest function. It will perform the above tests on all taxa, # and it will control the false discovery rate to account for multiple comparisons. # Next, we use the differentialTest command. We specify the covariates of # our model using formula and phi.formula as before, except we no longer # include the response term because we are testing multiple taxa. # We also specify which covariates we want to test for by removing # them in the formula_null and phi.formula_null arguments. # The difference between the formulas and the null version of the formulas # will be the variables that are tested. In this case, as when we examined # the single taxon, we will be testing the coefficients of DayAmdmt for # both the expected relative abundance and the overdispersion. # We set fdr_cutoff to be our controlled false discovery rate. # In this example, we also set inits argument. This is not necessary, # but it will speed up our computation because the algorithm does not # need to identify initializations for the parameters for each taxon. set.seed(1) fullAnalysis <- differentialTest(formula = ~ DayAmdmt, phi.formula = ~ DayAmdmt, formula_null = ~ 1, phi.formula_null = ~ 1, data = soil, fdr_cutoff = 0.05, inits = rbind(rep(.01, 4))) # We can see a list of differentially-abundant taxa using: fullAnalysis$DA # In this case, we identified 14 taxa that are differentially-abundant # across DayAmdmt (out of the 40 taxa tested). # We can see a list of differentially-variable taxa using: fullAnalysis$DV # We can switch the OTU labels to taxonomic labels using otu_to_taxonomy(). # We supply our OTU labels as strings for the OTU argument. We supply the # phyloseq object for the data argument. otu_to_taxonomy(OTU = fullAnalysis$DA, data = soil) otu_to_taxonomy(OTU = fullAnalysis$DV, data = soil) # In this case, we identified 4 taxa that are differentially-variable # across DayAmdmt (out of the 40 taxa tested). # We can examine a subset of the p-values of our tests using: fullAnalysis$p[1:5,] # where DA is the p-value associated with the test for differential abundance, # DV is the p-value associated with the test for differential variance, # and warning is a flag to let us know if the test wasn't able to complete # for that OTU so that we may investigate further. # We can examine a subset of the p-values after controlling for # the false discovery rate using: fullAnalysis$p_fdr[1:5,] # where the columns are defined similarly as before, but the values # are now adjusted to control the false discovery rate at 0.05. # Finally, we can see a list of any taxa for which we were not able # to fit a model using: fullAnalysis$warning # In this case, we weren't able to fit OTU.4206 automatically. It's # worthwhile to investigate the OTU individually if this is the case. # First let's check what phylum this represents. otu_to_taxonomy(OTU = "OTU.4206", data = soil) # It may be that the model is overparameterized because there aren't # enough observations, or it may just be that the initializations were # invalid for that taxa and it needs to be re-evaluated with new initializations. # Let's first try examining the data. otu_table(soil)["OTU.4206"] # We see that the observed counts of OTU is zero in all samples except for S102, # where we observed a single count. Let's try fitting the model individually # by letting the model select the initializations automatically. check_GN04 <- bbdml(formula = OTU.4206 ~ DayAmdmt, phi.formula = ~ DayAmdmt, data = soil) # While the model fits, we should be skeptical of **any** statistical # model fit on a single observed count! # corncob is stable, but new. If you notice any issues, please # log them on Github at https://github.com/bryandmartin/CORNCOB/issues # to help us help you!