Microbiome data integration workflow for population cohort
studies
# List of packages that we need
packages <- c(
"ANCOMBC", "ComplexHeatmap", "ggplot2", "knitr", "mia", "miaViz", "dplyr",
"tidyr", "scater", "knitr")
# Get packages that are already installed installed
packages_already_installed <- packages[ packages %in% installed.packages() ]
# Get packages that need to be installed
packages_need_to_install <- setdiff( packages, packages_already_installed )
# Loads BiocManager into the session. Install it if it not already installed.
if( !require("BiocManager") ){
install.packages("BiocManager")
library("BiocManager")
}
# If there are packages that need to be installed, installs them with BiocManager
# Updates old packages.
if( length(packages_need_to_install) > 0 ) {
install(packages_need_to_install, ask = FALSE)
}
# Load all packages into session. Stop if there are packages that were not
# successfully loaded
if( any(!sapply(packages, require, character.only = TRUE)) ){
stop("Error in loading packages into the session.")
}
################################################################################
# Additional setup
# Set chunk options
opts_chunk$set(message = FALSE, warning = FALSE)
# Set black and white theme for figures, and Arial font
theme <- theme_bw() +
theme(
text = element_text(family = "Arial"),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black")
)
theme_set(theme)
All authors are affiliated to Turku Data Science Group in
University of Turku, Finland.
# Plot publication graph
path <- "data/PubMed_Timeline_Results_by_Year.csv"
df <- read.csv(path, skip = 1)
x <- "Year"
y <- "Count"
plot <- ggplot(df, aes(x = .data[[x]], y = .data[[y]])) +
geom_bar(stat="identity")
plot
PubMed publications per year with a search term ‘microbiome’ (fetched: Sep 5, 2023)
The workflow is based on Orchestrating Microbiome Analysis (OMA) tutorial book. Find more information from there.
We fetch the data from MGnify database. It is a EMBL-EBI’s database for metagenomic data. This large microbiome database can be accessed with MGnifyR package which nowadays support TreeSE. The package will be submitted to Bioconductor’s next release.
We chose dataset of study MGYS00005128. In this study, they studied antibiotic resistance. They collected data from Cambodia, Kenya and UK. The dataset contains total of 1197 samples with taxonomy and gene function prediction data.
As loading takes some time, the dataset is already loaded.
# library(MGnifyR)
# # Create a client object
# mg <- MgnifyClient(useCache = TRUE, cacheDir = "data/magnifyr_cache")
# # Search analysis IDs based on study ID
# analyses <- searchAnalysis(mg, "studies", "MGYS00005128")
# # Fetch data
# mae <- getResult(mg, analyses, get.func = "go-slim")
# # Store the data
# saveRDS(mae, "data/mae.Rds")
mae <- readRDS("data/mae.Rds")
MAE stores multiple experiments, in this case 2 (taxonomy and gene function prediction info).
mae
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] microbiota: TreeSummarizedExperiment with 2207 rows and 1197 columns
## [2] go-slim: TreeSummarizedExperiment with 116 rows and 1197 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
We can have general information on samples of the study in sample metadata of MAE.
colData(mae)[1:5, 1:5] %>% kable()
analysis_accession | analysis_analysis.status | analysis_experiment.type | analysis_pipeline.version | analysis_is.private | |
---|---|---|---|---|---|
MGYA00383606 | MGYA00383606 | completed | metagenomic | 4.1 | FALSE |
MGYA00383607 | MGYA00383607 | completed | metagenomic | 4.1 | FALSE |
MGYA00383608 | MGYA00383608 | completed | metagenomic | 4.1 | FALSE |
MGYA00383609 | MGYA00383609 | completed | metagenomic | 4.1 | FALSE |
MGYA00383610 | MGYA00383610 | completed | metagenomic | 4.1 | FALSE |
MAE and TreeSE objects have rows and columns. This means that we can subset the data similarly to other objects that have rows and columns (like data.frame). In MAE, experiments and samples are linked together, meaning that we can subset the data at one go.
For demonstrative purpose and for saving resources, let’s subset the data by taking 100 random samples.
set.seed(49585)
random_samples <- sample(colnames(mae[[1]]), 100)
mae <- mae[, random_samples]
mae
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] microbiota: TreeSummarizedExperiment with 2207 rows and 100 columns
## [2] go-slim: TreeSummarizedExperiment with 116 rows and 100 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
The first experiment / TreeSE includes taxonomy information.
mae[[1]]
## class: TreeSummarizedExperiment
## dim: 2207 100
## metadata(0):
## assays(1): counts
## rownames(2207): 148939 125998 ... 233398 233398.1
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(100): MGYA00393990 MGYA00384766 ... MGYA00384647 MGYA00394128
## colData names(64): analysis_accession analysis_analysis.status ...
## sample_instrument.model sample_last.update.date
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
The second one includes gene function prediction data. As you can see, we can fetch the data by specifying index or name of experiment.
mae[["go-slim"]]
## class: TreeSummarizedExperiment
## dim: 116 100
## metadata(0):
## assays(1): counts
## rownames(116): GO:0009317 GO:0016597 ... GO:0019012 GO:0019842
## rowData names(3): description category index_id
## colnames(100): MGYA00393990 MGYA00384766 ... MGYA00384647 MGYA00394128
## colData names(64): analysis_accession analysis_analysis.status ...
## sample_instrument.model sample_last.update.date
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
Taxonomy information includes phylogenetic table in its feature metadata.
rowData(mae[[1]]) %>% head() %>% kable()
Kingdom | Phylum | Class | Order | Family | Genus | Species | |
---|---|---|---|---|---|---|---|
148939 | Eukaryota | Ascomycota | Dothideomycetes | NA | NA | NA | NA |
125998 | Eukaryota | Ascomycota | Eurotiomycetes | Eurotiales | NA | NA | NA |
114164 | Eukaryota | Ascomycota | Eurotiomycetes | Onygenales | NA | NA | NA |
76021 | Eukaryota | Ascomycota | Eurotiomycetes | NA | NA | NA | NA |
73314 | Eukaryota | Ascomycota | Saccharomycetes | Saccharomycetales | Debaryomycetaceae | Candida | NA |
73314.1 | Eukaryota | Ascomycota | Saccharomycetes | Saccharomycetales | Debaryomycetaceae | Candida | NA |
Compared to phyloseq object, TreeSE can hold more data, for example, multiple assays. Let’s transform the data. Transformed table is stored to assays slot.
mae[[1]] <- transformAssay(mae[[1]], method = "relabundance")
mae[[1]]
## class: TreeSummarizedExperiment
## dim: 2207 100
## metadata(0):
## assays(2): counts relabundance
## rownames(2207): 148939 125998 ... 233398 233398.1
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(100): MGYA00393990 MGYA00384766 ... MGYA00384647 MGYA00394128
## colData names(64): analysis_accession analysis_analysis.status ...
## sample_instrument.model sample_last.update.date
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
We can summarize how many unique bacteria there in in certain taxonomy levels. For instance, we can see that there are 53 unique bacterial phyla.
rowData(mae[[1]]) %>% as_tibble() %>% summarise_all(n_distinct) %>% kable()
Kingdom | Phylum | Class | Order | Family | Genus | Species |
---|---|---|---|---|---|---|
5 | 53 | 108 | 175 | 273 | 685 | 882 |
A common operation in microbiome data analysis is agglomeration. This means that we sum-up the data to certain taxonomy levels. We can use mia::mergeFeaturesByRank function for agglomerating data to single taxonomy level. If we want to agglomerate the data to all found taxonomy levels with one command, we can use mia::splitByRanks.
altExp slot is the right place to store experiments with modified features (such as agglomerated or subsetted data).
altExps(mae[[1]]) <- splitByRanks(mae[[1]])
mae[[1]]
## class: TreeSummarizedExperiment
## dim: 2207 100
## metadata(0):
## assays(2): counts relabundance
## rownames(2207): 148939 125998 ... 233398 233398.1
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(100): MGYA00393990 MGYA00384766 ... MGYA00384647 MGYA00394128
## colData names(64): analysis_accession analysis_analysis.status ...
## sample_instrument.model sample_last.update.date
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(7): Kingdom Phylum ... Genus Species
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
We can fetch agglomerated data from the slot. We can see that instead of 2207 features, there is only 5 features in the data that is summed-up to kingdom level.
altExp(mae[[1]], "Kingdom")
## class: TreeSummarizedExperiment
## dim: 5 100
## metadata(1): agglomerated_by_rank
## assays(2): counts relabundance
## rownames(5): Eukaryota Bacteria Archaea Chloroplast Mitochondria
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(100): MGYA00393990 MGYA00384766 ... MGYA00384647 MGYA00394128
## colData names(64): analysis_accession analysis_analysis.status ...
## sample_instrument.model sample_last.update.date
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
assays include agglomerated abundance tables.
assay(altExp(mae[[1]], "Kingdom"), "counts") %>% head() %>% kable()
MGYA00393990 | MGYA00384766 | MGYA00384868 | MGYA00393846 | MGYA00393894 | MGYA00393985 | MGYA00384631 | MGYA00394152 | MGYA00383695 | MGYA00384928 | MGYA00383726 | MGYA00393933 | MGYA00385075 | MGYA00384842 | MGYA00384761 | MGYA00384939 | MGYA00393750 | MGYA00384640 | MGYA00384849 | MGYA00385081 | MGYA00394120 | MGYA00384927 | MGYA00384959 | MGYA00384916 | MGYA00384698 | MGYA00384790 | MGYA00393938 | MGYA00385072 | MGYA00394132 | MGYA00394013 | MGYA00384733 | MGYA00393782 | MGYA00383709 | MGYA00385144 | MGYA00393804 | MGYA00383747 | MGYA00385128 | MGYA00394130 | MGYA00393978 | MGYA00385042 | MGYA00384881 | MGYA00393796 | MGYA00384645 | MGYA00384786 | MGYA00384812 | MGYA00384996 | MGYA00383614 | MGYA00393840 | MGYA00383719 | MGYA00393998 | MGYA00393725 | MGYA00394160 | MGYA00384995 | MGYA00383665 | MGYA00383617 | MGYA00384754 | MGYA00394066 | MGYA00393893 | MGYA00393808 | MGYA00394134 | MGYA00385032 | MGYA00386685 | MGYA00393923 | MGYA00384685 | MGYA00384848 | MGYA00385096 | MGYA00393743 | MGYA00393916 | MGYA00384817 | MGYA00393793 | MGYA00394171 | MGYA00384965 | MGYA00384705 | MGYA00393732 | MGYA00384735 | MGYA00384712 | MGYA00384869 | MGYA00383622 | MGYA00384669 | MGYA00384852 | MGYA00384966 | MGYA00394027 | MGYA00384709 | MGYA00393848 | MGYA00385039 | MGYA00385048 | MGYA00383727 | MGYA00394061 | MGYA00393948 | MGYA00384784 | MGYA00393958 | MGYA00383773 | MGYA00384974 | MGYA00393889 | MGYA00384771 | MGYA00383733 | MGYA00384764 | MGYA00394040 | MGYA00384647 | MGYA00394128 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Eukaryota | 0 | 11 | 78 | 17 | 1097 | 0 | 3 | 0 | 0 | 103 | 197 | 20 | 1515 | 38 | 5 | 2 | 18 | 71 | 9 | 1540 | 14 | 52 | 2 | 10062 | 37 | 9 | 28 | 1926 | 105 | 1316 | 3 | 912 | 2 | 925 | 1025 | 0 | 1098 | 252 | 151 | 1865 | 0 | 162 | 1874 | 3269 | 0 | 1812 | 4 | 14 | 2554 | 14 | 1164 | 23 | 1768 | 1776 | 2 | 0 | 1422 | 1046 | 1108 | 7 | 2841 | 157 | 36 | 1086 | 1 | 9 | 603 | 32 | 11 | 2131 | 2281 | 38 | 1 | 10 | 776 | 7 | 49 | 0 | 1676 | 5 | 3 | 0 | 0 | 3 | 17 | 0 | 178 | 33 | 53 | 3173 | 20 | 75 | 35 | 1057 | 0 | 83 | 0 | 249 | 2647 | 12 |
Bacteria | 12972 | 24905 | 17949 | 22423 | 2542 | 13278 | 15987 | 11276 | 11418 | 15904 | 15870 | 13654 | 2925 | 31623 | 27460 | 26498 | 12378 | 16092 | 20844 | 2838 | 20757 | 9656 | 14791 | 95482 | 47183 | 25661 | 1011 | 4836 | 24770 | 15616 | 69616 | 21336 | 13801 | 13600 | 229 | 20324 | 9394 | 15200 | 16097 | 470 | 27951 | 12324 | 444 | 171780 | 24618 | 12197 | 27097 | 22691 | 1601 | 11288 | 3088 | 17242 | 12101 | 8357 | 27102 | 27228 | 457 | 2622 | 736 | 47864 | 2633 | 28480 | 29674 | 782 | 21101 | 16914 | 2601 | 50402 | 20847 | 6258 | 1805 | 11263 | 29342 | 17726 | 36623 | 43031 | 21714 | 28696 | 399 | 20574 | 14501 | 16719 | 28936 | 19623 | 32607 | 26047 | 16203 | 26306 | 58927 | 174135 | 38109 | 56421 | 10730 | 2759 | 22200 | 17384 | 25083 | 15233 | 2140 | 14658 |
Archaea | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 0 | 10 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 19 | 0 | 0 | 26 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 27 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 30 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 61 | 0 | 0 | 10 | 0 | 51 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Chloroplast | 76 | 264 | 102 | 289 | 3 | 95 | 0 | 95 | 37 | 4 | 105 | 236 | 7 | 0 | 14 | 268 | 182 | 3 | 71 | 9 | 57 | 79 | 165 | 189 | 34 | 318 | 14 | 2 | 101 | 143 | 490 | 66 | 63 | 4 | 0 | 208 | 25 | 128 | 118 | 1 | 151 | 8 | 1 | 1239 | 269 | 17 | 11 | 272 | 0 | 156 | 16 | 0 | 27 | 22 | 8 | 320 | 2 | 1 | 9 | 20 | 3 | 0 | 99 | 8 | 171 | 113 | 8 | 419 | 18 | 4 | 1 | 165 | 238 | 116 | 30 | 0 | 237 | 327 | 1 | 141 | 173 | 47 | 324 | 307 | 209 | 0 | 98 | 90 | 457 | 1119 | 13 | 177 | 134 | 3 | 197 | 215 | 32 | 124 | 6 | 13 |
Mitochondria | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 2 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
To visualize phylogenetic relations, we can first create a phylogenetic tree based on rowData, and then plot it. The tree is created in family level.
altExp(mae[[1]], "Family") <- addTaxonomyTree(altExp(mae[[1]], "Family"))
plotRowTree(altExp(mae[[1]], "Family"), edge_colour_by = "Kingdom")
miaverse includes several visualizing methods. For example, we can visualize relative abundances of 10 most abundant phyla.
plotAbundanceDensity(
altExp(mae[[1]], "Phylum"), assay.type = "relabundance", n = 10, layout="density")
Alpha diversity measures how diverse the microbial composition is. For example, if there are lots of different bacterial species, the alpha diversity is higher. Again, miaverse includes convenient tools to calculate alpha diversities of samples and to visualize them.
Here we analyze if alpha diversities differ between locations.
# Calculate
mae[[1]] <- estimateDiversity(mae[[1]], index = "shannon")
# Plot
plotColData(mae[[1]], y = "shannon", x = "location", colour_by = "location") +
# These are normal ggplot objects
theme(legend.position = "none")
Beta diversity measures differences of microbial profiles of samples. There are several techniques Principal Component Analysis (PCA) being the most well-known ordination method. Distance-based redundancy analysis (dbRDA) is supervised ordination method which takes into account sample metadata. It maximizes the variance with respect to covariates.
Here we analyze if location can explain the differences in microbial profile. The result is stored to reducedDim slot.
mae[[1]] <- runRDA(
mae[[1]],
assay.type = "relabundance",
formula = data ~ location,
distance = "bray",
name = "dbRDA"
)
reducedDim(mae[[1]], "dbRDA")
## dbRDA1 dbRDA2
## MGYA00393990 0.1347102942 -0.100832593
## MGYA00384766 0.1440370073 -0.133240761
## MGYA00384868 0.0752795935 -0.183802481
## MGYA00393846 0.0610872412 0.489951835
## MGYA00393894 -0.2329823176 0.069469681
## MGYA00393985 0.1341586408 -0.102957754
## MGYA00384631 -0.0316655436 0.277311863
## MGYA00394152 0.0618287158 0.493236009
## MGYA00383695 -0.0084027652 -0.155762139
## MGYA00384928 -0.1166574929 -0.068720426
## MGYA00383726 0.0736662587 -0.174236886
## MGYA00393933 0.0595845903 0.484813425
## MGYA00385075 -0.2476202356 0.089064859
## MGYA00384842 0.0560987204 0.084441075
## MGYA00384761 0.1335693958 -0.111399651
## MGYA00384939 0.1233061411 -0.107712350
## MGYA00393750 0.0613077131 0.491512394
## MGYA00384640 -0.1152849683 -0.055104759
## MGYA00384849 0.0691988641 -0.118067879
## MGYA00385081 -0.2435778106 0.081108922
## MGYA00394120 0.0727140079 -0.115199175
## MGYA00384927 0.0744038815 -0.184540075
## MGYA00384959 0.1227612375 -0.105682000
## MGYA00384916 -0.1450605401 -0.082901182
## MGYA00384698 0.0621041269 -0.046376995
## MGYA00384790 0.1440894238 -0.132580603
## MGYA00393938 0.0773613625 -0.117812352
## MGYA00385072 -0.2463605951 0.075185591
## MGYA00394132 0.0125192818 -0.092162835
## MGYA00394013 0.0949048574 -0.121352348
## MGYA00384733 0.1299577328 -0.107163273
## MGYA00393782 -0.0485745141 -0.084911171
## MGYA00383709 0.0344791254 -0.176607956
## MGYA00385144 -0.1743700351 0.025745252
## MGYA00393804 -0.0331379977 0.245074548
## MGYA00383747 0.1202622083 -0.160957854
## MGYA00385128 -0.1157261802 0.095426271
## MGYA00394130 0.1100743500 -0.151624819
## MGYA00393978 0.0749827716 -0.169128101
## MGYA00385042 -0.2386491739 0.176320933
## MGYA00384881 0.0703782568 -0.165098363
## MGYA00393796 -0.0931293914 -0.092467612
## MGYA00384645 -0.2301181982 0.164162146
## MGYA00384786 0.0502844922 -0.178037602
## MGYA00384812 0.1399074536 -0.126607636
## MGYA00384996 -0.1351452341 -0.109831900
## MGYA00383614 -0.0345784398 0.244892145
## MGYA00393840 0.0613128092 0.491673873
## MGYA00383719 -0.2079946631 0.087238702
## MGYA00393998 0.1397657271 -0.138923923
## MGYA00393725 -0.0696509924 -0.075937955
## MGYA00394160 0.0549978102 0.086559521
## MGYA00384995 -0.1318146008 -0.110735038
## MGYA00383665 -0.1308114996 -0.066265707
## MGYA00383617 -0.0342930884 0.243270312
## MGYA00384754 0.1444104462 -0.125846612
## MGYA00394066 -0.1934873125 0.101619718
## MGYA00393893 -0.2295337408 0.068759432
## MGYA00393808 -0.1755577610 0.005031351
## MGYA00394134 0.1397203992 -0.099679138
## MGYA00385032 -0.2486498398 0.124041724
## MGYA00386685 -0.1269555631 0.021395117
## MGYA00393923 0.1109984687 -0.138860763
## MGYA00384685 -0.1657696868 -0.005621145
## MGYA00384848 0.1051087071 -0.180194961
## MGYA00385096 0.0766841236 -0.038968087
## MGYA00393743 -0.1505195024 -0.077945676
## MGYA00393916 0.1487237066 -0.107600058
## MGYA00384817 0.0534617372 -0.053099266
## MGYA00393793 -0.2454584161 0.014680854
## MGYA00394171 -0.2101295794 0.092617510
## MGYA00384965 0.1375433669 -0.142346565
## MGYA00384705 0.1408101483 -0.097320538
## MGYA00393732 0.0184656582 -0.121993265
## MGYA00384735 0.0596693137 0.487015833
## MGYA00384712 0.0117835216 0.054146791
## MGYA00384869 0.1280221883 -0.153727071
## MGYA00383622 0.1403129292 -0.128121361
## MGYA00384669 -0.2029823451 0.082198544
## MGYA00384852 0.1046645459 -0.179625391
## MGYA00384966 0.1238179369 -0.108690300
## MGYA00394027 0.1098098043 -0.067429451
## MGYA00384709 0.1403300420 -0.127865221
## MGYA00393848 0.0617271155 0.491612219
## MGYA00385039 0.0195541731 -0.123350724
## MGYA00385048 -0.0313709212 0.271745946
## MGYA00383727 0.0722204721 -0.173850173
## MGYA00394061 0.0431684905 -0.156035621
## MGYA00393948 0.1136389418 -0.053411435
## MGYA00384784 0.0502472886 -0.180846880
## MGYA00393958 -0.0257144179 0.088235614
## MGYA00383773 -0.0563405106 -0.121503598
## MGYA00384974 0.1354616969 -0.143851679
## MGYA00393889 -0.2240017574 0.064431657
## MGYA00384771 0.0618245875 0.493131802
## MGYA00383733 0.1343592838 -0.124630454
## MGYA00384764 0.0606454204 0.487786298
## MGYA00394040 0.1113961726 -0.153128594
## MGYA00384647 -0.2406707164 0.050088550
## MGYA00394128 -0.0009264312 -0.084710131
## attr(,"rotation")
## [,1] [,2]
## attr(,"eigen")
## dbRDA1 dbRDA2
## 4.116854 1.767239
## attr(,"rda")
## Call: vegan::dbrda(formula = data ~ location, data = variables,
## distance = "bray")
##
## Inertia Proportion Rank RealDims
## Total 28.2023 1.0000
## Constrained 5.8841 0.2086 2 2
## Unconstrained 22.3182 0.7914 97 61
## Inertia is squared Bray distance
## Species scores projected from 'species_scores'
##
## Eigenvalues for constrained axes:
## dbRDA1 dbRDA2
## 4.117 1.767
##
## Eigenvalues for unconstrained axes:
## MDS1 MDS2 MDS3 MDS4 MDS5 MDS6 MDS7 MDS8
## 6.778 3.098 2.828 1.530 1.322 0.991 0.787 0.694
## (Showing 8 of 97 unconstrained eigenvalues)
##
## attr(,"significance")
## attr(,"significance")$permanova
## Df SumOfSqs F Pr(>F) Total variance Explained variance
## Model 2 5.884093 12.7868 0.001 28.20231 0.2086387
## location 2 5.884093 12.7868 0.001 28.20231 0.2086387
## Residual 97 22.318218 NA NA 28.20231 0.7913613
##
## attr(,"significance")$homogeneity
## Df Sum Sq Mean Sq F N.Perm Pr(>F) Total variance
## location 2 0.7177062 0.3588531 10.34989 999 0.001 4.080905
## Explained variance
## location 0.1758694
As we can see, location has significant effect on microbial profile. However, from above we can see that groups do not have similar variance which is an assumption of PERMANOVA. This has to be taken into account when making conclusions.
plotRDA(mae[[1]], "dbRDA", colour_by = "location")
The idea of DAA is to analyze, if there are bacteria whose abundance differ between groups. There are multiple methods to test this (such as basic Wilcoxon test). ANCOM-BC is a method that takes into consideration unique characters and features of microbial data.
Here we want to test if there are phyla whose abundance differ between locations.
# Analyze
res <- ancombc2(
data = altExp(mae[[1]], "Phylum"),
fix_formula = "location",
p_adj_method = "fdr",
group = "location",
global = TRUE
)
# Store results to data container
metadata( altExp(mae[[1]], "Phylum") )[["ancombc2"]] <- res
# Print
temp <- res$res_global
temp %>% kable()
taxon | W | p_val | q_val | diff_abn | passed_ss |
---|---|---|---|---|---|
Ascomycota | 3.0152035 | 0.1303474 | 0.2929859 | FALSE | TRUE |
Basidiomycota | 2.7706441 | 0.1637526 | 0.2932449 | FALSE | FALSE |
Arthropoda | 1.7377258 | 0.3792163 | 0.4634866 | FALSE | FALSE |
Chordata | 29.2649292 | 0.0000000 | 0.0000000 | TRUE | TRUE |
Nematoda | 2.4336285 | 0.2096921 | 0.3075484 | FALSE | FALSE |
Acidobacteria | 38.9192976 | 0.0000382 | 0.0002805 | TRUE | TRUE |
Actinobacteria | 0.7400422 | 0.9602141 | 0.9602141 | FALSE | FALSE |
Bacillariophyta | 1.4432923 | 0.5973256 | 0.6570582 | FALSE | FALSE |
Bacteroidetes | 14.0562019 | 0.0000090 | 0.0000988 | TRUE | TRUE |
Chloroflexi | 2.1208994 | 0.3413546 | 0.4417530 | FALSE | FALSE |
Cyanobacteria | 19.4555055 | 0.0016916 | 0.0062024 | TRUE | FALSE |
Deinococcus-Thermus | 2.4445394 | 0.1866104 | 0.2932449 | FALSE | FALSE |
Euryarchaeota | 4.5780354 | 0.0625114 | 0.1719062 | FALSE | FALSE |
Fibrobacteres | 1.6473489 | 0.4169444 | 0.4827777 | FALSE | FALSE |
Firmicutes | 2.7863339 | 0.1331754 | 0.2929859 | FALSE | FALSE |
Fusobacteria | 2.0794228 | 0.2643005 | 0.3634131 | FALSE | TRUE |
Gemmatimonadetes | 6.3421884 | 0.0100882 | 0.0317058 | TRUE | FALSE |
Lentisphaerae | 0.8346123 | 0.9195272 | 0.9602141 | FALSE | FALSE |
Proteobacteria | 10.8483507 | 0.0001120 | 0.0006160 | TRUE | TRUE |
Spirochaetes | 3.1806770 | 0.1803801 | 0.2932449 | FALSE | TRUE |
Tenericutes | 2.8137936 | 0.1488332 | 0.2932449 | FALSE | FALSE |
Verrucomicrobia | 9.6943325 | 0.0007929 | 0.0034888 | TRUE | TRUE |
# Add results to feature metadata
# Ensure that results go to right feature
rownames(temp) <- temp$taxon
temp <- temp[rownames(altExp(mae[[1]], "Phylum")), ]
rownames(temp) <- rownames( altExp(mae[[1]], "Phylum") )
# Add to rowData
rowData(altExp(mae[[1]], "Phylum")) <- cbind(rowData(altExp(mae[[1]], "Phylum")), temp)
We can visualize statistically significant features with boxplot.
# Get the data from assay, rowData and colData
df <- meltAssay(altExp(mae[[1]], "Phylum"), assay.type = "relabundance", add_col_data = TRUE, add_row_data = TRUE)
# Take only significant features
df <- df[ df$diff_abn, ]
# Plot
ggplot(df, aes(x = location, colour = location, y = relabundance)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2) +
# Own panel for each feature
facet_grid(cols = vars(FeatureID)) +
# Remove x axis text
theme(axis.title.x=element_blank(), axis.text.x=element_blank()) +
# Logarithmic scale
scale_y_log10()
To demonstrate, how we can integrate experiments, we perform simple cross-correlation analysis. The purpose is to analyze, if there are phyla whose abundance correlates with predicted gene functions.
First we subset the gene function prediction data by taking only those features whose abundance varies the most across samples.
# Transform assay
mae[[2]] <- transformAssay(mae[[2]], method = "log10", pseudocount = 1)
mae[[2]] <- transformAssay(mae[[2]], assay.type = "log10", method = "z", name = "log10_z")
# Get coefficients of variances
rowData(mae[[2]])[["cv"]] <- apply( assay(mae[[2]], "log10_z"), 1, function(x) sd(x)/mean(x) )
# Subset the data by taking top 40 features
top_feat <- order(abs(rowData(mae[[2]])[["cv"]]), decreasing = TRUE)[1:40]
altExp(mae[[2]], "sub") <- mae[[2]][top_feat, ]
# Replace feature names with more desriptive names
rownames(altExp(mae[[2]], "sub") ) <- rowData(altExp(mae[[2]], "sub"))[["description"]]
# Print
altExp(mae[[2]], "sub")
## class: TreeSummarizedExperiment
## dim: 40 100
## metadata(0):
## assays(3): counts log10 log10_z
## rownames(40): cell adhesion transcription factor binding ...
## sporulation iron-sulfur cluster binding
## rowData names(4): description category index_id cv
## colnames(100): MGYA00393990 MGYA00384766 ... MGYA00384647 MGYA00394128
## colData names(64): analysis_accession analysis_analysis.status ...
## sample_instrument.model sample_last.update.date
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
# Transform assay of microbial data
altExp(mae[[1]], "Phylum") <- transformAssay(altExp(mae[[1]], "Phylum"), method = "clr", pseudocount = 1)
# Perform cross-correlation analysis
res <- testExperimentCrossAssociation(
mae,
experiment1 = 1, experiment2 = 2,
altexp1 = "Phylum", altexp2 = "sub",
assay.type1 = "clr", assay.type = "log10_z",
mode = "matrix"
)
# Store the result to data container
metadata(mae)[["croscor"]] <- res
# Plot
plot <- Heatmap(
res$cor, name = "Kendall's tau",
# Print values to cells
cell_fun = function(j, i, x, y, width, height, fill) {
# If the p-value is under threshold
if( !is.na(res$p_adj[i, j]) & res$p_adj[i, j] < 0.05 ){
# Print "X"
grid.text(sprintf("%s", "X"), x, y, gp = gpar(fontsize = 8, col = "black"))
}
},
column_names_rot = 45
)
# Adjust padding around plot so that names are visible
draw(plot, padding = unit(c(10, 40, 2, 2), "mm"))
Finally, we can save the data container which contains our analysis results.
saveRDS(mae, "data/mae_results.Rds")
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 21
##
## Matrix products: default
## BLAS: /opt/R/4.3.1/lib/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Helsinki
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 grid stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] doRNG_1.8.6 rngtools_1.5.2
## [3] foreach_1.5.2 scater_1.29.4
## [5] scuttle_1.11.2 tidyr_1.3.0
## [7] dplyr_1.1.3 miaViz_1.9.3
## [9] ggraph_2.1.0.9000 mia_1.9.16
## [11] MultiAssayExperiment_1.27.5 TreeSummarizedExperiment_2.9.0
## [13] Biostrings_2.69.2 XVector_0.41.1
## [15] SingleCellExperiment_1.23.0 SummarizedExperiment_1.31.1
## [17] Biobase_2.61.0 GenomicRanges_1.53.1
## [19] GenomeInfoDb_1.37.4 IRanges_2.35.2
## [21] S4Vectors_0.39.1 BiocGenerics_0.47.0
## [23] MatrixGenerics_1.13.1 matrixStats_1.0.0
## [25] knitr_1.43 ggplot2_3.4.3
## [27] ComplexHeatmap_2.17.0 ANCOMBC_2.3.1
## [29] BiocManager_1.30.22
##
## loaded via a namespace (and not attached):
## [1] splines_4.3.1 ggplotify_0.1.2
## [3] bitops_1.0-7 tibble_3.2.1
## [5] cellranger_1.1.0 polyclip_1.10-4
## [7] rpart_4.1.19 DirichletMultinomial_1.43.0
## [9] lifecycle_1.0.3 Rdpack_2.5
## [11] doParallel_1.0.17 lattice_0.21-8
## [13] MASS_7.3-60 backports_1.4.1
## [15] magrittr_2.0.3 Hmisc_5.1-0
## [17] sass_0.4.7 rmarkdown_2.24
## [19] jquerylib_0.1.4 yaml_2.3.7
## [21] gld_2.6.6 cowplot_1.1.1
## [23] RColorBrewer_1.1-3 DBI_1.1.3
## [25] minqa_1.2.5 multcomp_1.4-25
## [27] abind_1.4-5 zlibbioc_1.47.0
## [29] expm_0.999-7 purrr_1.0.2
## [31] RCurl_1.98-1.12 TH.data_1.1-2
## [33] yulab.utils_0.0.9 nnet_7.3-19
## [35] tweenr_2.0.2 sandwich_3.0-2
## [37] circlize_0.4.15 GenomeInfoDbData_1.2.10
## [39] ggrepel_0.9.3 irlba_2.3.5.1
## [41] tidytree_0.4.5 vegan_2.6-4
## [43] permute_0.9-7 DelayedMatrixStats_1.23.4
## [45] codetools_0.2-19 DelayedArray_0.27.10
## [47] ggforce_0.4.1 energy_1.7-11
## [49] shape_1.4.6 tidyselect_1.2.0
## [51] aplot_0.2.0 farver_2.1.1
## [53] lme4_1.1-34 gmp_0.7-2
## [55] ScaledMatrix_1.9.1 viridis_0.6.4
## [57] base64enc_0.1-3 jsonlite_1.8.7
## [59] GetoptLong_1.0.5 BiocNeighbors_1.19.0
## [61] e1071_1.7-13 tidygraph_1.2.3
## [63] decontam_1.21.0 Formula_1.2-5
## [65] survival_3.5-7 iterators_1.0.14
## [67] ggnewscale_0.4.9 tools_4.3.1
## [69] treeio_1.25.4 DescTools_0.99.50
## [71] Rcpp_1.0.11 glue_1.6.2
## [73] BiocBaseUtils_1.3.2 gridExtra_2.3
## [75] SparseArray_1.1.11 xfun_0.40
## [77] mgcv_1.9-0 withr_2.5.0
## [79] numDeriv_2016.8-1.1 fastmap_1.1.1
## [81] boot_1.3-28.1 bluster_1.11.4
## [83] fansi_1.0.4 digest_0.6.33
## [85] rsvd_1.0.5 gridGraphics_0.5-1
## [87] R6_2.5.1 colorspace_2.1-0
## [89] Cairo_1.6-1 gtools_3.9.4
## [91] RSQLite_2.3.1 utf8_1.2.3
## [93] generics_0.1.3 data.table_1.14.8
## [95] DECIPHER_2.29.0 class_7.3-22
## [97] graphlayouts_1.0.0 CVXR_1.0-11
## [99] httr_1.4.7 htmlwidgets_1.6.2
## [101] S4Arrays_1.1.6 pkgconfig_2.0.3
## [103] gtable_0.3.4 Exact_3.2
## [105] Rmpfr_0.9-3 blob_1.2.4
## [107] htmltools_0.5.6 clue_0.3-64
## [109] scales_1.2.1 lmom_3.0
## [111] png_0.1-8 ggfun_0.1.2
## [113] rstudioapi_0.15.0 rjson_0.2.21
## [115] reshape2_1.4.4 checkmate_2.2.0
## [117] nlme_3.1-163 nloptr_2.0.3
## [119] GlobalOptions_0.1.2 zoo_1.8-12
## [121] proxy_0.4-27 cachem_1.0.8
## [123] stringr_1.5.0 rootSolve_1.8.2.3
## [125] parallel_4.3.1 vipor_0.4.5
## [127] foreign_0.8-84 pillar_1.9.0
## [129] vctrs_0.6.3 BiocSingular_1.17.1
## [131] beachmat_2.17.15 cluster_2.1.4
## [133] beeswarm_0.4.0 htmlTable_2.4.1
## [135] evaluate_0.21 magick_2.7.5
## [137] mvtnorm_1.2-3 cli_3.6.1
## [139] compiler_4.3.1 rlang_1.1.1
## [141] crayon_1.5.2 labeling_0.4.3
## [143] plyr_1.8.8 ggbeeswarm_0.7.2
## [145] stringi_1.7.12 viridisLite_0.4.2
## [147] BiocParallel_1.35.4 lmerTest_3.1-3
## [149] munsell_0.5.0 gsl_2.1-8
## [151] lazyeval_0.2.2 Matrix_1.6-1
## [153] patchwork_1.1.3 sparseMatrixStats_1.13.4
## [155] bit64_4.0.5 highr_0.10
## [157] rbibutils_2.2.15 igraph_1.5.1
## [159] memoise_2.0.1 bslib_0.5.1
## [161] ggtree_3.9.1 bit_4.0.5
## [163] readxl_1.4.3 ape_5.7-1