# List of packages that we need
packages <- c(
    "ANCOMBC", "ComplexHeatmap", "ggplot2", "knitr", "mia", "miaViz", "dplyr",
    "tidyr", "scater", "knitr")

# Get packages that are already installed installed
packages_already_installed <- packages[ packages %in% installed.packages() ]

# Get packages that need to be installed
packages_need_to_install <- setdiff( packages, packages_already_installed )

# Loads BiocManager into the session. Install it if it not already installed.
if( !require("BiocManager") ){
    install.packages("BiocManager")
    library("BiocManager")
}

# If there are packages that need to be installed, installs them with BiocManager
# Updates old packages.
if( length(packages_need_to_install) > 0 ) {
   install(packages_need_to_install, ask = FALSE)
}

# Load all packages into session. Stop if there are packages that were not
# successfully loaded
if( any(!sapply(packages, require, character.only = TRUE)) ){
    stop("Error in loading packages into the session.")
}

################################################################################
# Additional setup

# Set chunk options
opts_chunk$set(message = FALSE, warning = FALSE)

# Set black and white theme for figures, and Arial font
theme <- theme_bw() +
    theme(
        text = element_text(family = "Arial"), 
        panel.border = element_blank(), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        axis.line = element_line(colour = "black")
        )
theme_set(theme)

EuroBioC2023

Presenter information

All authors are affiliated to Turku Data Science Group in University of Turku, Finland.


Learning goals

  1. Microbiome research studies interactions between microbes (and human, environment…)
  2. Big data requires efficient tools to manipulate the data
  3. miaverse is a (Tree)SummarizedExperiment framework for microbiome analytics
Figure source: Moreno-Indias et al. (2021) Statistical and Machine Learning Techniques in Human Microbiome Studies: Contemporary Challenges and Solutions. Frontiers in Microbiology 12:11.
Figure source: Moreno-Indias et al. (2021) Statistical and Machine Learning Techniques in Human Microbiome Studies: Contemporary Challenges and Solutions. Frontiers in Microbiology 12:11.

Motivation

Microbiome research

  • Microbiome is a composition of microbes in well-defined area (gut, skin, mouth…)
  • Bilateral interaction between human and microbiome –> affects both health and disease.
  • The research is based on sequencing (characterization of genes and species).
  • Nowadays, multiomics approach is more common (integration of taxonomy information with metabolite data, for example)
  • Computational methods are the new microscope
  • The research has expanded rapidly in previous years
# Plot publication graph
path <- "data/PubMed_Timeline_Results_by_Year.csv"
df <- read.csv(path, skip = 1)

x <- "Year"
y <- "Count"

plot <- ggplot(df, aes(x = .data[[x]], y = .data[[y]])) +
    geom_bar(stat="identity")
plot
PubMed publications per year with a search term 'microbiome' (fetched: Sep 5, 2023)

PubMed publications per year with a search term ‘microbiome’ (fetched: Sep 5, 2023)

Big data

  • Cohort datasets are large in size
    • Data management, handling and wrangling –> data structure
    • Computational power –> High performance computing (HPC) and cloud computing
  • MultiAssayExperiment (MAE) and SummarizedExperiment (SE)
    • Several R packages frameworks are increasingly integrating MAE and SE
    • MAE enables linking of multiple experiments
    • SE – and especially TreeSE – is an efficient data container to store data from an experiment

miaverse (MIcrobiome Analysis)

The structure of the TreeSummarizedExperiment (TreeSE) class.
The structure of the TreeSummarizedExperiment (TreeSE) class.

The workflow

The workflow is based on Orchestrating Microbiome Analysis (OMA) tutorial book. Find more information from there.

Importing the dataset

We fetch the data from MGnify database. It is a EMBL-EBI’s database for metagenomic data. This large microbiome database can be accessed with MGnifyR package which nowadays support TreeSE. The package will be submitted to Bioconductor’s next release.

We chose dataset of study MGYS00005128. In this study, they studied antibiotic resistance. They collected data from Cambodia, Kenya and UK. The dataset contains total of 1197 samples with taxonomy and gene function prediction data.

As loading takes some time, the dataset is already loaded.

# library(MGnifyR)
# # Create a client object
# mg <- MgnifyClient(useCache = TRUE, cacheDir = "data/magnifyr_cache")
# # Search analysis IDs based on study ID 
# analyses <- searchAnalysis(mg, "studies", "MGYS00005128")
# # Fetch data
# mae <- getResult(mg, analyses, get.func = "go-slim")
# # Store the data
# saveRDS(mae, "data/mae.Rds")
mae <- readRDS("data/mae.Rds")

The data containers

MAE stores multiple experiments, in this case 2 (taxonomy and gene function prediction info).

mae
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 2:
##  [1] microbiota: TreeSummarizedExperiment with 2207 rows and 1197 columns
##  [2] go-slim: TreeSummarizedExperiment with 116 rows and 1197 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

We can have general information on samples of the study in sample metadata of MAE.

colData(mae)[1:5, 1:5] %>% kable()
analysis_accession analysis_analysis.status analysis_experiment.type analysis_pipeline.version analysis_is.private
MGYA00383606 MGYA00383606 completed metagenomic 4.1 FALSE
MGYA00383607 MGYA00383607 completed metagenomic 4.1 FALSE
MGYA00383608 MGYA00383608 completed metagenomic 4.1 FALSE
MGYA00383609 MGYA00383609 completed metagenomic 4.1 FALSE
MGYA00383610 MGYA00383610 completed metagenomic 4.1 FALSE

MAE and TreeSE objects have rows and columns. This means that we can subset the data similarly to other objects that have rows and columns (like data.frame). In MAE, experiments and samples are linked together, meaning that we can subset the data at one go.

For demonstrative purpose and for saving resources, let’s subset the data by taking 100 random samples.

set.seed(49585)
random_samples <- sample(colnames(mae[[1]]), 100)
mae <- mae[, random_samples]
mae
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 2:
##  [1] microbiota: TreeSummarizedExperiment with 2207 rows and 100 columns
##  [2] go-slim: TreeSummarizedExperiment with 116 rows and 100 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

The first experiment / TreeSE includes taxonomy information.

mae[[1]]
## class: TreeSummarizedExperiment 
## dim: 2207 100 
## metadata(0):
## assays(1): counts
## rownames(2207): 148939 125998 ... 233398 233398.1
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(100): MGYA00393990 MGYA00384766 ... MGYA00384647 MGYA00394128
## colData names(64): analysis_accession analysis_analysis.status ...
##   sample_instrument.model sample_last.update.date
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL

The second one includes gene function prediction data. As you can see, we can fetch the data by specifying index or name of experiment.

mae[["go-slim"]]
## class: TreeSummarizedExperiment 
## dim: 116 100 
## metadata(0):
## assays(1): counts
## rownames(116): GO:0009317 GO:0016597 ... GO:0019012 GO:0019842
## rowData names(3): description category index_id
## colnames(100): MGYA00393990 MGYA00384766 ... MGYA00384647 MGYA00394128
## colData names(64): analysis_accession analysis_analysis.status ...
##   sample_instrument.model sample_last.update.date
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL

Taxonomy information includes phylogenetic table in its feature metadata.

rowData(mae[[1]]) %>% head() %>% kable()
Kingdom Phylum Class Order Family Genus Species
148939 Eukaryota Ascomycota Dothideomycetes NA NA NA NA
125998 Eukaryota Ascomycota Eurotiomycetes Eurotiales NA NA NA
114164 Eukaryota Ascomycota Eurotiomycetes Onygenales NA NA NA
76021 Eukaryota Ascomycota Eurotiomycetes NA NA NA NA
73314 Eukaryota Ascomycota Saccharomycetes Saccharomycetales Debaryomycetaceae Candida NA
73314.1 Eukaryota Ascomycota Saccharomycetes Saccharomycetales Debaryomycetaceae Candida NA

Compared to phyloseq object, TreeSE can hold more data, for example, multiple assays. Let’s transform the data. Transformed table is stored to assays slot.

mae[[1]] <- transformAssay(mae[[1]], method = "relabundance")
mae[[1]]
## class: TreeSummarizedExperiment 
## dim: 2207 100 
## metadata(0):
## assays(2): counts relabundance
## rownames(2207): 148939 125998 ... 233398 233398.1
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(100): MGYA00393990 MGYA00384766 ... MGYA00384647 MGYA00394128
## colData names(64): analysis_accession analysis_analysis.status ...
##   sample_instrument.model sample_last.update.date
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL

Summarize data

We can summarize how many unique bacteria there in in certain taxonomy levels. For instance, we can see that there are 53 unique bacterial phyla.

rowData(mae[[1]]) %>% as_tibble() %>% summarise_all(n_distinct) %>% kable()
Kingdom Phylum Class Order Family Genus Species
5 53 108 175 273 685 882

A common operation in microbiome data analysis is agglomeration. This means that we sum-up the data to certain taxonomy levels. We can use mia::mergeFeaturesByRank function for agglomerating data to single taxonomy level. If we want to agglomerate the data to all found taxonomy levels with one command, we can use mia::splitByRanks.

altExp slot is the right place to store experiments with modified features (such as agglomerated or subsetted data).

altExps(mae[[1]]) <- splitByRanks(mae[[1]])
mae[[1]]
## class: TreeSummarizedExperiment 
## dim: 2207 100 
## metadata(0):
## assays(2): counts relabundance
## rownames(2207): 148939 125998 ... 233398 233398.1
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(100): MGYA00393990 MGYA00384766 ... MGYA00384647 MGYA00394128
## colData names(64): analysis_accession analysis_analysis.status ...
##   sample_instrument.model sample_last.update.date
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(7): Kingdom Phylum ... Genus Species
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL

We can fetch agglomerated data from the slot. We can see that instead of 2207 features, there is only 5 features in the data that is summed-up to kingdom level.

altExp(mae[[1]], "Kingdom")
## class: TreeSummarizedExperiment 
## dim: 5 100 
## metadata(1): agglomerated_by_rank
## assays(2): counts relabundance
## rownames(5): Eukaryota Bacteria Archaea Chloroplast Mitochondria
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(100): MGYA00393990 MGYA00384766 ... MGYA00384647 MGYA00394128
## colData names(64): analysis_accession analysis_analysis.status ...
##   sample_instrument.model sample_last.update.date
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL

assays include agglomerated abundance tables.

assay(altExp(mae[[1]], "Kingdom"), "counts") %>% head() %>% kable()
MGYA00393990 MGYA00384766 MGYA00384868 MGYA00393846 MGYA00393894 MGYA00393985 MGYA00384631 MGYA00394152 MGYA00383695 MGYA00384928 MGYA00383726 MGYA00393933 MGYA00385075 MGYA00384842 MGYA00384761 MGYA00384939 MGYA00393750 MGYA00384640 MGYA00384849 MGYA00385081 MGYA00394120 MGYA00384927 MGYA00384959 MGYA00384916 MGYA00384698 MGYA00384790 MGYA00393938 MGYA00385072 MGYA00394132 MGYA00394013 MGYA00384733 MGYA00393782 MGYA00383709 MGYA00385144 MGYA00393804 MGYA00383747 MGYA00385128 MGYA00394130 MGYA00393978 MGYA00385042 MGYA00384881 MGYA00393796 MGYA00384645 MGYA00384786 MGYA00384812 MGYA00384996 MGYA00383614 MGYA00393840 MGYA00383719 MGYA00393998 MGYA00393725 MGYA00394160 MGYA00384995 MGYA00383665 MGYA00383617 MGYA00384754 MGYA00394066 MGYA00393893 MGYA00393808 MGYA00394134 MGYA00385032 MGYA00386685 MGYA00393923 MGYA00384685 MGYA00384848 MGYA00385096 MGYA00393743 MGYA00393916 MGYA00384817 MGYA00393793 MGYA00394171 MGYA00384965 MGYA00384705 MGYA00393732 MGYA00384735 MGYA00384712 MGYA00384869 MGYA00383622 MGYA00384669 MGYA00384852 MGYA00384966 MGYA00394027 MGYA00384709 MGYA00393848 MGYA00385039 MGYA00385048 MGYA00383727 MGYA00394061 MGYA00393948 MGYA00384784 MGYA00393958 MGYA00383773 MGYA00384974 MGYA00393889 MGYA00384771 MGYA00383733 MGYA00384764 MGYA00394040 MGYA00384647 MGYA00394128
Eukaryota 0 11 78 17 1097 0 3 0 0 103 197 20 1515 38 5 2 18 71 9 1540 14 52 2 10062 37 9 28 1926 105 1316 3 912 2 925 1025 0 1098 252 151 1865 0 162 1874 3269 0 1812 4 14 2554 14 1164 23 1768 1776 2 0 1422 1046 1108 7 2841 157 36 1086 1 9 603 32 11 2131 2281 38 1 10 776 7 49 0 1676 5 3 0 0 3 17 0 178 33 53 3173 20 75 35 1057 0 83 0 249 2647 12
Bacteria 12972 24905 17949 22423 2542 13278 15987 11276 11418 15904 15870 13654 2925 31623 27460 26498 12378 16092 20844 2838 20757 9656 14791 95482 47183 25661 1011 4836 24770 15616 69616 21336 13801 13600 229 20324 9394 15200 16097 470 27951 12324 444 171780 24618 12197 27097 22691 1601 11288 3088 17242 12101 8357 27102 27228 457 2622 736 47864 2633 28480 29674 782 21101 16914 2601 50402 20847 6258 1805 11263 29342 17726 36623 43031 21714 28696 399 20574 14501 16719 28936 19623 32607 26047 16203 26306 58927 174135 38109 56421 10730 2759 22200 17384 25083 15233 2140 14658
Archaea 0 0 7 0 0 0 0 0 0 10 0 0 0 0 0 0 0 13 0 0 0 10 0 4 0 0 0 0 0 0 0 0 19 0 0 26 0 0 0 0 0 0 0 27 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 0 0 8 0 0 0 0 0 0 0 61 0 0 10 0 51 0 11 0 0 0 0 0 0 0 0
Chloroplast 76 264 102 289 3 95 0 95 37 4 105 236 7 0 14 268 182 3 71 9 57 79 165 189 34 318 14 2 101 143 490 66 63 4 0 208 25 128 118 1 151 8 1 1239 269 17 11 272 0 156 16 0 27 22 8 320 2 1 9 20 3 0 99 8 171 113 8 419 18 4 1 165 238 116 30 0 237 327 1 141 173 47 324 307 209 0 98 90 457 1119 13 177 134 3 197 215 32 124 6 13
Mitochondria 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 6 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 3 0 2 0 2 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 3 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0

To visualize phylogenetic relations, we can first create a phylogenetic tree based on rowData, and then plot it. The tree is created in family level.

altExp(mae[[1]], "Family") <- addTaxonomyTree(altExp(mae[[1]], "Family"))
plotRowTree(altExp(mae[[1]], "Family"), edge_colour_by = "Kingdom")

miaverse includes several visualizing methods. For example, we can visualize relative abundances of 10 most abundant phyla.

plotAbundanceDensity(
    altExp(mae[[1]], "Phylum"), assay.type = "relabundance", n = 10, layout="density") 

Alpha diversity

Alpha diversity measures how diverse the microbial composition is. For example, if there are lots of different bacterial species, the alpha diversity is higher. Again, miaverse includes convenient tools to calculate alpha diversities of samples and to visualize them.

Here we analyze if alpha diversities differ between locations.

# Calculate
mae[[1]] <- estimateDiversity(mae[[1]], index = "shannon")
# Plot
plotColData(mae[[1]], y = "shannon", x = "location", colour_by = "location") +
    # These are normal ggplot objects
    theme(legend.position = "none")

Beta diversity

Beta diversity measures differences of microbial profiles of samples. There are several techniques Principal Component Analysis (PCA) being the most well-known ordination method. Distance-based redundancy analysis (dbRDA) is supervised ordination method which takes into account sample metadata. It maximizes the variance with respect to covariates.

Here we analyze if location can explain the differences in microbial profile. The result is stored to reducedDim slot.

mae[[1]] <- runRDA(
    mae[[1]],
    assay.type = "relabundance",
    formula = data ~ location,
    distance = "bray",
    name = "dbRDA"
)

reducedDim(mae[[1]], "dbRDA")
##                     dbRDA1       dbRDA2
## MGYA00393990  0.1347102942 -0.100832593
## MGYA00384766  0.1440370073 -0.133240761
## MGYA00384868  0.0752795935 -0.183802481
## MGYA00393846  0.0610872412  0.489951835
## MGYA00393894 -0.2329823176  0.069469681
## MGYA00393985  0.1341586408 -0.102957754
## MGYA00384631 -0.0316655436  0.277311863
## MGYA00394152  0.0618287158  0.493236009
## MGYA00383695 -0.0084027652 -0.155762139
## MGYA00384928 -0.1166574929 -0.068720426
## MGYA00383726  0.0736662587 -0.174236886
## MGYA00393933  0.0595845903  0.484813425
## MGYA00385075 -0.2476202356  0.089064859
## MGYA00384842  0.0560987204  0.084441075
## MGYA00384761  0.1335693958 -0.111399651
## MGYA00384939  0.1233061411 -0.107712350
## MGYA00393750  0.0613077131  0.491512394
## MGYA00384640 -0.1152849683 -0.055104759
## MGYA00384849  0.0691988641 -0.118067879
## MGYA00385081 -0.2435778106  0.081108922
## MGYA00394120  0.0727140079 -0.115199175
## MGYA00384927  0.0744038815 -0.184540075
## MGYA00384959  0.1227612375 -0.105682000
## MGYA00384916 -0.1450605401 -0.082901182
## MGYA00384698  0.0621041269 -0.046376995
## MGYA00384790  0.1440894238 -0.132580603
## MGYA00393938  0.0773613625 -0.117812352
## MGYA00385072 -0.2463605951  0.075185591
## MGYA00394132  0.0125192818 -0.092162835
## MGYA00394013  0.0949048574 -0.121352348
## MGYA00384733  0.1299577328 -0.107163273
## MGYA00393782 -0.0485745141 -0.084911171
## MGYA00383709  0.0344791254 -0.176607956
## MGYA00385144 -0.1743700351  0.025745252
## MGYA00393804 -0.0331379977  0.245074548
## MGYA00383747  0.1202622083 -0.160957854
## MGYA00385128 -0.1157261802  0.095426271
## MGYA00394130  0.1100743500 -0.151624819
## MGYA00393978  0.0749827716 -0.169128101
## MGYA00385042 -0.2386491739  0.176320933
## MGYA00384881  0.0703782568 -0.165098363
## MGYA00393796 -0.0931293914 -0.092467612
## MGYA00384645 -0.2301181982  0.164162146
## MGYA00384786  0.0502844922 -0.178037602
## MGYA00384812  0.1399074536 -0.126607636
## MGYA00384996 -0.1351452341 -0.109831900
## MGYA00383614 -0.0345784398  0.244892145
## MGYA00393840  0.0613128092  0.491673873
## MGYA00383719 -0.2079946631  0.087238702
## MGYA00393998  0.1397657271 -0.138923923
## MGYA00393725 -0.0696509924 -0.075937955
## MGYA00394160  0.0549978102  0.086559521
## MGYA00384995 -0.1318146008 -0.110735038
## MGYA00383665 -0.1308114996 -0.066265707
## MGYA00383617 -0.0342930884  0.243270312
## MGYA00384754  0.1444104462 -0.125846612
## MGYA00394066 -0.1934873125  0.101619718
## MGYA00393893 -0.2295337408  0.068759432
## MGYA00393808 -0.1755577610  0.005031351
## MGYA00394134  0.1397203992 -0.099679138
## MGYA00385032 -0.2486498398  0.124041724
## MGYA00386685 -0.1269555631  0.021395117
## MGYA00393923  0.1109984687 -0.138860763
## MGYA00384685 -0.1657696868 -0.005621145
## MGYA00384848  0.1051087071 -0.180194961
## MGYA00385096  0.0766841236 -0.038968087
## MGYA00393743 -0.1505195024 -0.077945676
## MGYA00393916  0.1487237066 -0.107600058
## MGYA00384817  0.0534617372 -0.053099266
## MGYA00393793 -0.2454584161  0.014680854
## MGYA00394171 -0.2101295794  0.092617510
## MGYA00384965  0.1375433669 -0.142346565
## MGYA00384705  0.1408101483 -0.097320538
## MGYA00393732  0.0184656582 -0.121993265
## MGYA00384735  0.0596693137  0.487015833
## MGYA00384712  0.0117835216  0.054146791
## MGYA00384869  0.1280221883 -0.153727071
## MGYA00383622  0.1403129292 -0.128121361
## MGYA00384669 -0.2029823451  0.082198544
## MGYA00384852  0.1046645459 -0.179625391
## MGYA00384966  0.1238179369 -0.108690300
## MGYA00394027  0.1098098043 -0.067429451
## MGYA00384709  0.1403300420 -0.127865221
## MGYA00393848  0.0617271155  0.491612219
## MGYA00385039  0.0195541731 -0.123350724
## MGYA00385048 -0.0313709212  0.271745946
## MGYA00383727  0.0722204721 -0.173850173
## MGYA00394061  0.0431684905 -0.156035621
## MGYA00393948  0.1136389418 -0.053411435
## MGYA00384784  0.0502472886 -0.180846880
## MGYA00393958 -0.0257144179  0.088235614
## MGYA00383773 -0.0563405106 -0.121503598
## MGYA00384974  0.1354616969 -0.143851679
## MGYA00393889 -0.2240017574  0.064431657
## MGYA00384771  0.0618245875  0.493131802
## MGYA00383733  0.1343592838 -0.124630454
## MGYA00384764  0.0606454204  0.487786298
## MGYA00394040  0.1113961726 -0.153128594
## MGYA00384647 -0.2406707164  0.050088550
## MGYA00394128 -0.0009264312 -0.084710131
## attr(,"rotation")
##      [,1] [,2]
## attr(,"eigen")
##   dbRDA1   dbRDA2 
## 4.116854 1.767239 
## attr(,"rda")
## Call: vegan::dbrda(formula = data ~ location, data = variables,
## distance = "bray")
## 
##               Inertia Proportion Rank RealDims
## Total         28.2023     1.0000              
## Constrained    5.8841     0.2086    2        2
## Unconstrained 22.3182     0.7914   97       61
## Inertia is squared Bray distance 
## Species scores projected from 'species_scores' 
## 
## Eigenvalues for constrained axes:
## dbRDA1 dbRDA2 
##  4.117  1.767 
## 
## Eigenvalues for unconstrained axes:
##  MDS1  MDS2  MDS3  MDS4  MDS5  MDS6  MDS7  MDS8 
## 6.778 3.098 2.828 1.530 1.322 0.991 0.787 0.694 
## (Showing 8 of 97 unconstrained eigenvalues)
## 
## attr(,"significance")
## attr(,"significance")$permanova
##          Df  SumOfSqs       F Pr(>F) Total variance Explained variance
## Model     2  5.884093 12.7868  0.001       28.20231          0.2086387
## location  2  5.884093 12.7868  0.001       28.20231          0.2086387
## Residual 97 22.318218      NA     NA       28.20231          0.7913613
## 
## attr(,"significance")$homogeneity
##          Df    Sum Sq   Mean Sq        F N.Perm Pr(>F) Total variance
## location  2 0.7177062 0.3588531 10.34989    999  0.001       4.080905
##          Explained variance
## location          0.1758694

As we can see, location has significant effect on microbial profile. However, from above we can see that groups do not have similar variance which is an assumption of PERMANOVA. This has to be taken into account when making conclusions.

plotRDA(mae[[1]], "dbRDA", colour_by = "location")

Differential abundance analysis (DAA)

The idea of DAA is to analyze, if there are bacteria whose abundance differ between groups. There are multiple methods to test this (such as basic Wilcoxon test). ANCOM-BC is a method that takes into consideration unique characters and features of microbial data.

Here we want to test if there are phyla whose abundance differ between locations.

# Analyze
res <- ancombc2(
    data =  altExp(mae[[1]], "Phylum"),
    fix_formula = "location",
    p_adj_method = "fdr",
    group = "location",
    global = TRUE
)
# Store results to data container
metadata( altExp(mae[[1]], "Phylum") )[["ancombc2"]] <- res
# Print
temp <- res$res_global
temp %>% kable()
taxon W p_val q_val diff_abn passed_ss
Ascomycota 3.0152035 0.1303474 0.2929859 FALSE TRUE
Basidiomycota 2.7706441 0.1637526 0.2932449 FALSE FALSE
Arthropoda 1.7377258 0.3792163 0.4634866 FALSE FALSE
Chordata 29.2649292 0.0000000 0.0000000 TRUE TRUE
Nematoda 2.4336285 0.2096921 0.3075484 FALSE FALSE
Acidobacteria 38.9192976 0.0000382 0.0002805 TRUE TRUE
Actinobacteria 0.7400422 0.9602141 0.9602141 FALSE FALSE
Bacillariophyta 1.4432923 0.5973256 0.6570582 FALSE FALSE
Bacteroidetes 14.0562019 0.0000090 0.0000988 TRUE TRUE
Chloroflexi 2.1208994 0.3413546 0.4417530 FALSE FALSE
Cyanobacteria 19.4555055 0.0016916 0.0062024 TRUE FALSE
Deinococcus-Thermus 2.4445394 0.1866104 0.2932449 FALSE FALSE
Euryarchaeota 4.5780354 0.0625114 0.1719062 FALSE FALSE
Fibrobacteres 1.6473489 0.4169444 0.4827777 FALSE FALSE
Firmicutes 2.7863339 0.1331754 0.2929859 FALSE FALSE
Fusobacteria 2.0794228 0.2643005 0.3634131 FALSE TRUE
Gemmatimonadetes 6.3421884 0.0100882 0.0317058 TRUE FALSE
Lentisphaerae 0.8346123 0.9195272 0.9602141 FALSE FALSE
Proteobacteria 10.8483507 0.0001120 0.0006160 TRUE TRUE
Spirochaetes 3.1806770 0.1803801 0.2932449 FALSE TRUE
Tenericutes 2.8137936 0.1488332 0.2932449 FALSE FALSE
Verrucomicrobia 9.6943325 0.0007929 0.0034888 TRUE TRUE
# Add results to feature metadata
# Ensure that results go to right feature
rownames(temp) <- temp$taxon
temp <- temp[rownames(altExp(mae[[1]], "Phylum")), ]
rownames(temp) <- rownames( altExp(mae[[1]], "Phylum") )
# Add to rowData
rowData(altExp(mae[[1]], "Phylum")) <- cbind(rowData(altExp(mae[[1]], "Phylum")), temp)

We can visualize statistically significant features with boxplot.

# Get the data from assay, rowData and colData
df <- meltAssay(altExp(mae[[1]], "Phylum"), assay.type = "relabundance", add_col_data = TRUE, add_row_data = TRUE)
# Take only significant features
df <- df[ df$diff_abn, ]
# Plot
ggplot(df, aes(x = location, colour = location, y = relabundance)) +
    geom_boxplot(outlier.shape = NA) + 
    geom_jitter(width = 0.2) +
    # Own panel for each feature
    facet_grid(cols = vars(FeatureID)) +
    # Remove x axis text
    theme(axis.title.x=element_blank(), axis.text.x=element_blank()) +
    # Logarithmic scale
    scale_y_log10()

Cross-correlation

To demonstrate, how we can integrate experiments, we perform simple cross-correlation analysis. The purpose is to analyze, if there are phyla whose abundance correlates with predicted gene functions.

First we subset the gene function prediction data by taking only those features whose abundance varies the most across samples.

# Transform assay
mae[[2]] <- transformAssay(mae[[2]], method = "log10", pseudocount = 1)
mae[[2]] <- transformAssay(mae[[2]], assay.type = "log10", method = "z", name = "log10_z")
# Get coefficients of variances
rowData(mae[[2]])[["cv"]] <- apply( assay(mae[[2]], "log10_z"), 1, function(x) sd(x)/mean(x) )
# Subset the data by taking top 40 features
top_feat <- order(abs(rowData(mae[[2]])[["cv"]]), decreasing = TRUE)[1:40]
altExp(mae[[2]], "sub") <- mae[[2]][top_feat, ]
# Replace feature names with more desriptive names
rownames(altExp(mae[[2]], "sub") ) <- rowData(altExp(mae[[2]], "sub"))[["description"]]
# Print
altExp(mae[[2]], "sub")
## class: TreeSummarizedExperiment 
## dim: 40 100 
## metadata(0):
## assays(3): counts log10 log10_z
## rownames(40): cell adhesion transcription factor binding ...
##   sporulation iron-sulfur cluster binding
## rowData names(4): description category index_id cv
## colnames(100): MGYA00393990 MGYA00384766 ... MGYA00384647 MGYA00394128
## colData names(64): analysis_accession analysis_analysis.status ...
##   sample_instrument.model sample_last.update.date
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
# Transform assay of microbial data
altExp(mae[[1]], "Phylum") <- transformAssay(altExp(mae[[1]], "Phylum"), method = "clr", pseudocount = 1)
# Perform cross-correlation analysis
res <- testExperimentCrossAssociation(
    mae,
    experiment1 = 1, experiment2 = 2,
    altexp1 = "Phylum", altexp2 = "sub",
    assay.type1 = "clr", assay.type = "log10_z",
    mode = "matrix"
)
# Store the result to data container
metadata(mae)[["croscor"]] <- res
# Plot
plot <- Heatmap(
    res$cor, name = "Kendall's tau",
    # Print values to cells
    cell_fun = function(j, i, x, y, width, height, fill) {
        # If the p-value is under threshold
        if( !is.na(res$p_adj[i, j]) & res$p_adj[i, j] < 0.05 ){
            # Print "X"
            grid.text(sprintf("%s", "X"), x, y, gp = gpar(fontsize = 8, col = "black"))
            }
        },
    column_names_rot = 45
    )
# Adjust padding around plot so that names are visible
draw(plot, padding = unit(c(10, 40, 2, 2), "mm"))

Save the results

Finally, we can save the data container which contains our analysis results.

saveRDS(mae, "data/mae_results.Rds")

Thank you for your time!

Key points

  1. Microbiome research studies interactions between microbes (and human, environment…)
  2. Big data requires efficient tools to manipulate the data
  3. miaverse is a (Tree)SummarizedExperiment framework for microbiome analytics

More information

  • Project website
  • Poster: miaverse – microbiome analytics framework in SummarizedExperiment family

Session info

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 21
## 
## Matrix products: default
## BLAS:   /opt/R/4.3.1/lib/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=fi_FI.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fi_FI.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Helsinki
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    grid      stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] doRNG_1.8.6                    rngtools_1.5.2                
##  [3] foreach_1.5.2                  scater_1.29.4                 
##  [5] scuttle_1.11.2                 tidyr_1.3.0                   
##  [7] dplyr_1.1.3                    miaViz_1.9.3                  
##  [9] ggraph_2.1.0.9000              mia_1.9.16                    
## [11] MultiAssayExperiment_1.27.5    TreeSummarizedExperiment_2.9.0
## [13] Biostrings_2.69.2              XVector_0.41.1                
## [15] SingleCellExperiment_1.23.0    SummarizedExperiment_1.31.1   
## [17] Biobase_2.61.0                 GenomicRanges_1.53.1          
## [19] GenomeInfoDb_1.37.4            IRanges_2.35.2                
## [21] S4Vectors_0.39.1               BiocGenerics_0.47.0           
## [23] MatrixGenerics_1.13.1          matrixStats_1.0.0             
## [25] knitr_1.43                     ggplot2_3.4.3                 
## [27] ComplexHeatmap_2.17.0          ANCOMBC_2.3.1                 
## [29] BiocManager_1.30.22           
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.3.1               ggplotify_0.1.2            
##   [3] bitops_1.0-7                tibble_3.2.1               
##   [5] cellranger_1.1.0            polyclip_1.10-4            
##   [7] rpart_4.1.19                DirichletMultinomial_1.43.0
##   [9] lifecycle_1.0.3             Rdpack_2.5                 
##  [11] doParallel_1.0.17           lattice_0.21-8             
##  [13] MASS_7.3-60                 backports_1.4.1            
##  [15] magrittr_2.0.3              Hmisc_5.1-0                
##  [17] sass_0.4.7                  rmarkdown_2.24             
##  [19] jquerylib_0.1.4             yaml_2.3.7                 
##  [21] gld_2.6.6                   cowplot_1.1.1              
##  [23] RColorBrewer_1.1-3          DBI_1.1.3                  
##  [25] minqa_1.2.5                 multcomp_1.4-25            
##  [27] abind_1.4-5                 zlibbioc_1.47.0            
##  [29] expm_0.999-7                purrr_1.0.2                
##  [31] RCurl_1.98-1.12             TH.data_1.1-2              
##  [33] yulab.utils_0.0.9           nnet_7.3-19                
##  [35] tweenr_2.0.2                sandwich_3.0-2             
##  [37] circlize_0.4.15             GenomeInfoDbData_1.2.10    
##  [39] ggrepel_0.9.3               irlba_2.3.5.1              
##  [41] tidytree_0.4.5              vegan_2.6-4                
##  [43] permute_0.9-7               DelayedMatrixStats_1.23.4  
##  [45] codetools_0.2-19            DelayedArray_0.27.10       
##  [47] ggforce_0.4.1               energy_1.7-11              
##  [49] shape_1.4.6                 tidyselect_1.2.0           
##  [51] aplot_0.2.0                 farver_2.1.1               
##  [53] lme4_1.1-34                 gmp_0.7-2                  
##  [55] ScaledMatrix_1.9.1          viridis_0.6.4              
##  [57] base64enc_0.1-3             jsonlite_1.8.7             
##  [59] GetoptLong_1.0.5            BiocNeighbors_1.19.0       
##  [61] e1071_1.7-13                tidygraph_1.2.3            
##  [63] decontam_1.21.0             Formula_1.2-5              
##  [65] survival_3.5-7              iterators_1.0.14           
##  [67] ggnewscale_0.4.9            tools_4.3.1                
##  [69] treeio_1.25.4               DescTools_0.99.50          
##  [71] Rcpp_1.0.11                 glue_1.6.2                 
##  [73] BiocBaseUtils_1.3.2         gridExtra_2.3              
##  [75] SparseArray_1.1.11          xfun_0.40                  
##  [77] mgcv_1.9-0                  withr_2.5.0                
##  [79] numDeriv_2016.8-1.1         fastmap_1.1.1              
##  [81] boot_1.3-28.1               bluster_1.11.4             
##  [83] fansi_1.0.4                 digest_0.6.33              
##  [85] rsvd_1.0.5                  gridGraphics_0.5-1         
##  [87] R6_2.5.1                    colorspace_2.1-0           
##  [89] Cairo_1.6-1                 gtools_3.9.4               
##  [91] RSQLite_2.3.1               utf8_1.2.3                 
##  [93] generics_0.1.3              data.table_1.14.8          
##  [95] DECIPHER_2.29.0             class_7.3-22               
##  [97] graphlayouts_1.0.0          CVXR_1.0-11                
##  [99] httr_1.4.7                  htmlwidgets_1.6.2          
## [101] S4Arrays_1.1.6              pkgconfig_2.0.3            
## [103] gtable_0.3.4                Exact_3.2                  
## [105] Rmpfr_0.9-3                 blob_1.2.4                 
## [107] htmltools_0.5.6             clue_0.3-64                
## [109] scales_1.2.1                lmom_3.0                   
## [111] png_0.1-8                   ggfun_0.1.2                
## [113] rstudioapi_0.15.0           rjson_0.2.21               
## [115] reshape2_1.4.4              checkmate_2.2.0            
## [117] nlme_3.1-163                nloptr_2.0.3               
## [119] GlobalOptions_0.1.2         zoo_1.8-12                 
## [121] proxy_0.4-27                cachem_1.0.8               
## [123] stringr_1.5.0               rootSolve_1.8.2.3          
## [125] parallel_4.3.1              vipor_0.4.5                
## [127] foreign_0.8-84              pillar_1.9.0               
## [129] vctrs_0.6.3                 BiocSingular_1.17.1        
## [131] beachmat_2.17.15            cluster_2.1.4              
## [133] beeswarm_0.4.0              htmlTable_2.4.1            
## [135] evaluate_0.21               magick_2.7.5               
## [137] mvtnorm_1.2-3               cli_3.6.1                  
## [139] compiler_4.3.1              rlang_1.1.1                
## [141] crayon_1.5.2                labeling_0.4.3             
## [143] plyr_1.8.8                  ggbeeswarm_0.7.2           
## [145] stringi_1.7.12              viridisLite_0.4.2          
## [147] BiocParallel_1.35.4         lmerTest_3.1-3             
## [149] munsell_0.5.0               gsl_2.1-8                  
## [151] lazyeval_0.2.2              Matrix_1.6-1               
## [153] patchwork_1.1.3             sparseMatrixStats_1.13.4   
## [155] bit64_4.0.5                 highr_0.10                 
## [157] rbibutils_2.2.15            igraph_1.5.1               
## [159] memoise_2.0.1               bslib_0.5.1                
## [161] ggtree_3.9.1                bit_4.0.5                  
## [163] readxl_1.4.3                ape_5.7-1