Chapter 2 Manipulating Tree with Data
2.1 Manipulating tree data using tidy interface
All the tree data parsed/merged
by treeio can be converted to tidy
data frame using the tidytree
package. The tidytree package
provides tidy interfaces to manipulate tree with associated data. For instances,
external data can be linked to phylogeny or evolutionary data obtained from
different sources can be merged using tidyverse verbs. After the tree data was
manipulated, it can be converted back to treedata
object and exported to a
single tree file, further analyzed in R or visualized using ggtree (Yu et al. 2017).
2.1.1 The phylo
object
The phylo
class defined in ape is
fundamental for phylogenetic analysis in R. Most of the R packages
in this field rely
extensively on phylo
object. The tidytree package provides as_tibble
method to convert the phylo
object to tidy data frame, a tbl_tree
object.
library(ape)
set.seed(2017)
rtree(4)
tree <- tree
##
## Phylogenetic tree with 4 tips and 3 internal nodes.
##
## Tip labels:
## [1] "t4" "t1" "t3" "t2"
##
## Rooted; includes branch lengths.
as_tibble(tree)
x <- x
## # A tibble: 7 x 4
## parent node branch.length label
## <int> <int> <dbl> <chr>
## 1 5 1 0.435 t4
## 2 7 2 0.674 t1
## 3 7 3 0.00202 t3
## 4 6 4 0.0251 t2
## 5 5 5 NA <NA>
## 6 5 6 0.472 <NA>
## 7 6 7 0.274 <NA>
The tbl_tree
object can be converted back to a phylo
object.
as.phylo(x)
##
## Phylogenetic tree with 4 tips and 3 internal nodes.
##
## Tip labels:
## [1] "t4" "t1" "t3" "t2"
##
## Rooted; includes branch lengths.
Using tbl_tree
object makes tree and data manipulation more effective and
easier. For example, we can link evolutionary trait to phylogeny using dplyr verbs full_join
:
tibble(label = paste0('t', 1:4),
d <-trait = rnorm(4))
full_join(x, d, by = 'label')
y <- y
## # A tibble: 7 x 5
## parent node branch.length label trait
## <int> <int> <dbl> <chr> <dbl>
## 1 5 1 0.435 t4 0.943
## 2 7 2 0.674 t1 -0.171
## 3 7 3 0.00202 t3 0.570
## 4 6 4 0.0251 t2 -0.283
## 5 5 5 NA <NA> NA
## 6 5 6 0.472 <NA> NA
## 7 6 7 0.274 <NA> NA
2.1.2 The treedata
object
The tidytree package defines treedata
class to store phylogenetic tree with
associated data. After mapping external data to the tree structure, the
tbl_tree
object can be converted to a treedata
object.
as.treedata(y)
## 'treedata' S4 object'.
##
## ...@ phylo:
## Phylogenetic tree with 4 tips and 3 internal nodes.
##
## Tip labels:
## [1] "t4" "t1" "t3" "t2"
##
## Rooted; includes branch lengths.
##
## with the following features available:
## 'trait'.
The treedata
class is also used
in treeio package to store
evolutionary evidences inferred by commonly used software (BEAST, EPA, HYPHY,
MrBayes, PAML, PHYLDOG, pplacer, r8s, RAxML and RevBayes, etc.) (see details in Chapter 1).
The tidytree package also provides as_tibble
to convert treedata
object
to a tidy data frame. The phylogenetic tree structure and the evolutionary
inferences were stored in the tbl_tree
object, making it consistent and easier
for manipulating evolutionary statistics inferred by different software as well
as linking external data to the same tree structure.
%>% as.treedata %>% as_tibble y
## # A tibble: 7 x 5
## parent node branch.length label trait
## <int> <int> <dbl> <chr> <dbl>
## 1 5 1 0.435 t4 0.943
## 2 7 2 0.674 t1 -0.171
## 3 7 3 0.00202 t3 0.570
## 4 6 4 0.0251 t2 -0.283
## 5 5 5 NA <NA> NA
## 6 5 6 0.472 <NA> NA
## 7 6 7 0.274 <NA> NA
2.1.3 Access related nodes
dplyr verbs can be applied to tbl_tree
directly to manipulate tree data. In addition, tidytree provides several verbs to filter related nodes, including
child
, parent
, offspring
, ancestor
, sibling
and MRCA
.
These verbs accept a tbl_tree
and a selected node which can be node number or label.
child(y, 5)
## # A tibble: 2 x 5
## parent node branch.length label trait
## <int> <int> <dbl> <chr> <dbl>
## 1 5 1 0.435 t4 0.943
## 2 5 6 0.472 <NA> NA
parent(y, 2)
## # A tibble: 1 x 5
## parent node branch.length label trait
## <int> <int> <dbl> <chr> <dbl>
## 1 6 7 0.274 <NA> NA
offspring(y, 5)
## # A tibble: 6 x 5
## parent node branch.length label trait
## <int> <int> <dbl> <chr> <dbl>
## 1 5 1 0.435 t4 0.943
## 2 7 2 0.674 t1 -0.171
## 3 7 3 0.00202 t3 0.570
## 4 6 4 0.0251 t2 -0.283
## 5 5 6 0.472 <NA> NA
## 6 6 7 0.274 <NA> NA
ancestor(y, 2)
## # A tibble: 3 x 5
## parent node branch.length label trait
## <int> <int> <dbl> <chr> <dbl>
## 1 5 5 NA <NA> NA
## 2 5 6 0.472 <NA> NA
## 3 6 7 0.274 <NA> NA
sibling(y, 2)
## # A tibble: 1 x 5
## parent node branch.length label trait
## <int> <int> <dbl> <chr> <dbl>
## 1 7 3 0.00202 t3 0.570
MRCA(y, 2, 3)
## # A tibble: 1 x 5
## parent node branch.length label trait
## <int> <int> <dbl> <chr> <dbl>
## 1 6 7 0.274 <NA> NA
All these methods also implemented in treeio for working with phylo
and treedata
objects. You can try accessing related nodes using the tree object. For instance, the following command will output child nodes of the selected internal node 5
:
child(tree, 5)
## [1] 1 6
Beware that the methods work for tree objects output related node numbers, while the methods implemented for tbl_tree
object output a tibble
object that contains related information.
2.2 Data Integration
2.2.1 Combining tree data
The treeio package serves as an infrastructure that enables various types of phylogenetic data inferred from common analysis programs to be imported and used in R. For instance dN/dS or ancestral sequences estimated by CODEML, and clade support values (posterior) inferred by BEAST/MrBayes. In addition, treeio package supports linking external data to phylogeny. It brings these external phylogenetic data (either from software output or exteranl sources) to the R community and make it available for further analysis in R. Furthermore, treeio can combine multiple phylogenetic trees together into one with their node/branch-specific attribute data. Essentially, as a result, one such attribute (e.g., substitution rate) can be mapped to another attribute (e.g., dN/dS) of the same node/branch for comparison and further computations (Yu et al. 2017).
A previously published data set, seventy-six H3 hemagglutinin gene sequences of
a lineage containing swine and human influenza A viruses
(Liang et al. 2014), was here to demonstrate the utilities of comparing
evolutionary statistics inferred by different software. The dataset was
re-analyzed by BEAST for timescale estimation
and CODEML for synonymous and
non-synonymous substitution estimation. In this example, we first parsed the
outputs from BEAST using read.beast
and
from CODEML using
read.codeml
into two treedata
objects. Then the two objects containing
separate sets of node/branch-specific data were merged via the merge_tree
function.
system.file("examples/MCC_FluA_H3.tree", package="ggtree")
beast_file <- system.file("examples/rst", package="ggtree")
rst_file <- system.file("examples/mlc", package="ggtree")
mlc_file <- read.beast(beast_file)
beast_tree <- read.codeml(rst_file, mlc_file)
codeml_tree <-
merge_tree(beast_tree, codeml_tree)
merged_tree <- merged_tree
## 'treedata' S4 object that stored information of
## '/home/ygc/R/library/ggtree/examples/MCC_FluA_H3.tree',
## '/home/ygc/R/library/ggtree/examples/rst',
## '/home/ygc/R/library/ggtree/examples/mlc'.
##
## ...@ phylo:
## Phylogenetic tree with 76 tips and 75 internal nodes.
##
## Tip labels:
## A/Hokkaido/30-1-a/2013, A/New_York/334/2004, A/New_York/463/2005, A/New_York/452/1999, A/New_York/238/2005, A/New_York/523/1998, ...
##
## Rooted; includes branch lengths.
##
## with the following features available:
## 'height', 'height_0.95_HPD', 'height_median',
## 'height_range', 'length', 'length_0.95_HPD',
## 'length_median', 'length_range', 'posterior', 'rate',
## 'rate_0.95_HPD', 'rate_median', 'rate_range', 'subs',
## 'AA_subs', 't', 'N', 'S', 'dN_vs_dS', 'dN', 'dS', 'N_x_dN',
## 'S_x_dS'.
After merging the beast_tree
and codeml_tree
objects, all
node/branch-specific data imported from BEAST
and CODEML output files are
all available in the merged_tree
object. The tree object was converted to
tidy data frame using tidytree
package and visualized as hexbin scatterplot of dN/dS, dN and dS inferred
by CODEML versus rate
(substitution rate in unit of substitutions/site/year) inferred
by BEAST on the same branches.
library(dplyr)
merged_tree %>%
df <- as_tibble() %>%
select(dN_vs_dS, dN, dS, rate) %>%
subset(dN_vs_dS >=0 & dN_vs_dS <= 1.5) %>%
tidyr::gather(type, value, dN_vs_dS:dS)
$type[df$type == 'dN_vs_dS'] <- 'dN/dS'
df$type <- factor(df$type, levels=c("dN/dS", "dN", "dS"))
dfggplot(df, aes(rate, value)) + geom_hex() +
facet_wrap(~type, scale='free_y')
The output is illustrated in Fig. 2.1. We can then test the association of these node/branch-specific data using Pearson correlation, which in this case showed that dN and dS, but not dN/dS are significantly (p-values) associated with rate.
Using merge_tree
, we are able to compare analysis results using identical
model from different software packages or different models using different or
identical software. It also allows users to integrate different analysis finding
from different software packages. Merging tree data is not restricted to
software findings, associating external data to analysis findings is also
granted. The merge_tree
function is chainable and allows several tree objects
to be merged into one.
as.phylo(beast_tree)
phylo <- Nnode2(phylo)
N <- tibble(node = 1:N, fake_trait = rnorm(N), another_trait = runif(N))
d <- treedata(phylo = phylo, data = d)
fake_tree <- merge_tree(merged_tree, fake_tree)
triple_tree <- triple_tree
## 'treedata' S4 object that stored information of
## '/home/ygc/R/library/ggtree/examples/MCC_FluA_H3.tree',
## '/home/ygc/R/library/ggtree/examples/rst',
## '/home/ygc/R/library/ggtree/examples/mlc'.
##
## ...@ phylo:
## Phylogenetic tree with 76 tips and 75 internal nodes.
##
## Tip labels:
## A/Hokkaido/30-1-a/2013, A/New_York/334/2004, A/New_York/463/2005, A/New_York/452/1999, A/New_York/238/2005, A/New_York/523/1998, ...
##
## Rooted; includes branch lengths.
##
## with the following features available:
## 'height', 'height_0.95_HPD', 'height_median',
## 'height_range', 'length', 'length_0.95_HPD',
## 'length_median', 'length_range', 'posterior', 'rate',
## 'rate_0.95_HPD', 'rate_median', 'rate_range', 'subs',
## 'AA_subs', 't', 'N', 'S', 'dN_vs_dS', 'dN', 'dS', 'N_x_dN',
## 'S_x_dS', 'fake_trait', 'another_trait'.
The triple_tree
object showed above contains analysis results obtained from BEAST
and CODEML, and evolutionary
trait from external sources. All these information can be used to annotate the
tree using ggtree (Yu et al. 2017).
2.2.2 Linking external data to phylogeny
In addition to analysis findings that are associated with the tree as we showed
above, there is a wide range of heterogeneous data, including phenotypic data,
experimental data and clinical data etc., that need to be integrated and
linked to phylogeny. For example, in the study of viral evolution, tree nodes may
associated with epidemiological information, such as location, age and subtype.
Functional annotations may need to be mapped on gene trees for comparative
genomics studies. To facilitate data
integration, treeio provides
full_join
method to link external data to phylogeny and stored in either phylo
or treedata
object. Beware that linking external data to a phylo
object will produce a treedata
object to store the input phylo
with associated data. The full_join
methods can also be used at tidy data frame level (i.e. tbl_tree
object described previously) and at ggtree
level (described in session 7.1) (Yu et al. 2018).
The following example calculated bootstrap values and merging that values with the tree (a phylo
object) by matching their node numbers.
library(ape)
data(woodmouse)
dist.dna(woodmouse)
d <- nj(d)
tr <- boot.phylo(tr, woodmouse, function(x) nj(dist.dna(x))) bp <-
##
Running bootstraps: 100 / 100
## Calculating bootstrap values... done.
tibble(node=1:Nnode(tr) + Ntip(tr), bootstrap = bp)
bp2 <-full_join(tr, bp2, by="node")
## 'treedata' S4 object'.
##
## ...@ phylo:
## Phylogenetic tree with 15 tips and 13 internal nodes.
##
## Tip labels:
## No305, No304, No306, No0906S, No0908S, No0909S, ...
##
## Unrooted; includes branch lengths.
##
## with the following features available:
## 'bootstrap'.
Another example demonstrates merging evolutionary trait with the tree (a treedata
object) by matching their tip labels.
system.file("extdata/BEAST", "beast_mcc.tree", package="treeio")
file <- read.beast(file)
beast <- tibble(label = as.phylo(beast)$tip.label, trait = rnorm(Ntip(beast)))
x <-full_join(beast, x, by="label")
## 'treedata' S4 object that stored information of
## '/home/ygc/R/library/treeio/extdata/BEAST/beast_mcc.tree'.
##
## ...@ phylo:
## Phylogenetic tree with 15 tips and 14 internal nodes.
##
## Tip labels:
## A_1995, B_1996, C_1995, D_1987, E_1996, F_1997, ...
##
## Rooted; includes branch lengths.
##
## with the following features available:
## 'height', 'height_0.95_HPD', 'height_median',
## 'height_range', 'length', 'length_0.95_HPD',
## 'length_median', 'length_range', 'posterior', 'rate',
## 'rate_0.95_HPD', 'rate_median', 'rate_range', 'trait'.
Manipulating tree object is frustrated with the fragmented functions available
for working with phylo
object, not to mention linking external data to the
phylogeny structure. With treeio package, it is easy to combine tree data from various sources.
In addition, with tidytree package (session 2.1), manipulating tree is more easier using
tidy data principles and
consistent with tools already in wide use, including
dplyr,
tidyr,
ggplot2
and ggtree.
2.2.3 Grouping taxa
The groupOTU
and groupClade
methods are designed for adding taxa grouping
information to the input tree object. The methods were implemented in tidytree,
treeio and ggtree respectively to support adding grouping information at
tbl_tree
, phylo
and treedata
, and ggtree
levels. These grouping information can be
used directly in tree visualization (e.g. coloring tree based on grouping)
with ggtree (Figure 6.5).
2.2.3.1 groupClade
The groupClade
method accepts an internal node or a vector of internal nodes
to add grouping information of clade/clades.
'(((((((A:4,B:4):6,C:5):8,D:6):3,E:21):10,((F:4,G:12):14,H:8):13):13,((I:5,J:2):30,(K:11,L:11):2):17):4,M:56);'
nwk <- read.tree(text=nwk)
tree <-
groupClade(as_tibble(tree), c(17, 21))
## # A tibble: 25 x 5
## parent node branch.length label group
## <int> <int> <dbl> <chr> <fct>
## 1 20 1 4 A 1
## 2 20 2 4 B 1
## 3 19 3 5 C 1
## 4 18 4 6 D 1
## 5 17 5 21 E 1
## 6 22 6 4 F 2
## 7 22 7 12 G 2
## 8 21 8 8 H 2
## 9 24 9 5 I 0
## 10 24 10 2 J 0
## # … with 15 more rows
2.2.3.2 groupOTU
set.seed(2017)
rtree(4)
tr <- as_tibble(tr)
x <-## the input nodes can be node ID or label
groupOTU(x, c('t1', 't4'), group_name = "fake_group")
## # A tibble: 7 x 5
## parent node branch.length label fake_group
## <int> <int> <dbl> <chr> <fct>
## 1 5 1 0.435 t4 1
## 2 7 2 0.674 t1 1
## 3 7 3 0.00202 t3 0
## 4 6 4 0.0251 t2 0
## 5 5 5 NA <NA> 1
## 6 5 6 0.472 <NA> 1
## 7 6 7 0.274 <NA> 1
Both groupClade()
and groupOTU()
work with tbl_tree
, phylo
and treedata
, and ggtree
objects. Here is an example of using groupOTU()
with phylo
tree object.
groupOTU(tr, c('t1', 't4'), group_name = "fake_group") %>%
as_tibble
## # A tibble: 7 x 5
## parent node branch.length label fake_group
## <int> <int> <dbl> <chr> <fct>
## 1 5 1 0.435 t4 1
## 2 7 2 0.674 t1 1
## 3 7 3 0.00202 t3 0
## 4 6 4 0.0251 t2 0
## 5 5 5 NA <NA> 1
## 6 5 6 0.472 <NA> 1
## 7 6 7 0.274 <NA> 1
Another example of working with ggtree
object can be found in session 6.5.
The groupOTU
will trace back from input nodes to most recent common ancestor.
In this example, nodes 2, 3, 7 and 6 (2 (t1) -> 7 -> 6
and 3 (t4) -> 6
) are
grouping together.
Related OTUs are grouping together and they are not necessarily within a clade. They can be monophyletic (clade), polyphyletic or paraphyletic.
list(c1=c("A", "B", "C", "D", "E"),
cls <-c2=c("F", "G", "H"),
c3=c("L", "K", "I", "J"),
c4="M")
as_tibble(tree) %>% groupOTU(cls)
## # A tibble: 25 x 5
## parent node branch.length label group
## <int> <int> <dbl> <chr> <fct>
## 1 20 1 4 A c1
## 2 20 2 4 B c1
## 3 19 3 5 C c1
## 4 18 4 6 D c1
## 5 17 5 21 E c1
## 6 22 6 4 F c2
## 7 22 7 12 G c2
## 8 21 8 8 H c2
## 9 24 9 5 I c3
## 10 24 10 2 J c3
## # … with 15 more rows
If there are conflicts when tracing back to mrca, user can set overlap
parameter to “origin” (the first one counts), “overwrite” (default, the last one
counts) or “abandon” (un-selected for grouping)7.
2.3 Rescaling Tree Branches
Phylogenetic data can be merged for joint analysis (Figure 2.1). They can be displayed on the same tree structure as more complex annotation to help visually inspection of their evolutionary patterns. All the numerical data stored in treedata
object can be used to re-scale tree branches. For example, CodeML infers dN/dS, dN and dS, all these statistics can be used as branch lengths. All these values can also be used to color the tree (session 4.3.4) and can be project to vertical dimension to create two-dimensional tree or phenogram (session 4.2.2 and Figure 4.5 and 4.11).
ggtree(merged_tree) + theme_tree2()
p1 <- ggtree(rescale_tree(merged_tree, 'dN')) + theme_tree2()
p2 <- ggtree(rescale_tree(merged_tree, 'rate')) + theme_tree2()
p3 <-
::plot_grid(p1, p2, p3, ncol=3, labels = LETTERS[1:3]) cowplot
2.4 Subsetting Tree with Data
2.4.1 Removing tips in a phylogenetic tree
Sometimes we want to remove selected tips from a phylogenetic tree. This is due to several reasons, including low sequence quality, errors in sequence assembly, an alignment error in part of the sequence and an error in phylogenetic inference etc.
Let’s say that we want to remove three tips (colored by red) from the tree (Figure 2.3A), the drop.tip()
method removes specified tips and update tree (Figure 2.3B). All associated data will be maintained in the updated tree.
system.file("extdata/NHX", "phyldog.nhx", package="treeio")
f <- read.nhx(f)
nhx <- c("Physonect_sp_@2066767",
to_drop <-"Lychnagalma_utricularia@2253871",
"Kephyes_ovata@2606431")
ggtree(nhx) + geom_tiplab(aes(color = label %in% to_drop)) +
p1 <- scale_color_manual(values=c("black", "red")) + xlim(0, 0.8)
drop.tip(nhx, to_drop)
nhx_reduced <- ggtree(nhx_reduced) + geom_tiplab() + xlim(0, 0.8)
p2 <-plot_grid(p1, p2, ncol=2, labels = c("A", "B"))
2.4.2 Subsetting tree by tip label
Tree can be large and difficult to look at only the portions of interest. The tree_subset()
function was created in treeio package to extract a subset of the tree portion while still maintaining the structure of the tree portion. The beast_tree
in Figure 2.4A is slightly crowded. Obviously, we can make the figure taller to allow more space for the labels (similar to use “Expansion” slider in FigTree
) or we can make the text smaller. However, these solutions are not always applicable when you have a lot of tips (e.g. hundreds or thousands of tips). In particular, when you are only interested in the portion of the tree around a particular tip, you certainly don’t want to explore a large tree to find centain species you are interested in.
Let’s say you are interested in tip A/Swine/HK/168/2012 from the tree (Figure 2.4A) and you want to look at the immediate relatives of this tip.
The tree_subset()
function allows for you to look at the portions of the tree that are of interest. By default, tree_subset()
will internally call groupOTU()
to assign group specified tip from the rest of other tips (Figure 2.4B). Additionally, the branch lengths and related associated data are maintained after subsetting (Figure 2.4C). The root of the tree is always anchored at zero for the subset tree by default and all the distances are relative to this root. If you want all the distances are relative to the original root, you can specify the root position (by root.position
parameter) to the root edge of the subset tree, which is the sum of branch lengths from the original root to the root of the subset tree (Figure 2.4D and E).
system.file("examples/MCC_FluA_H3.tree", package="ggtree")
beast_file <- read.beast(beast_file)
beast_tree <-
ggtree(beast_tree) +
p1 = geom_tiplab() + xlim(0, 40) + theme_tree2()
tree_subset(beast_tree, "A/Swine/HK/168/2012", levels_back=4)
tree2 = ggtree(tree2, aes(color=group)) +
p2 <- scale_color_manual(values = c("black", "red")) +
geom_tiplab() + xlim(0, 4) + theme_tree2()
ggtree(tree2, aes(color=group)) +
p3 <- geom_tiplab(hjust = -.1) + xlim(0, 5) +
geom_point(aes(fill = rate), shape = 21, size = 4) +
scale_color_manual(values = c("black", "red"), guide = FALSE) +
scale_fill_continuous(low = 'blue', high = 'red') +
theme_tree2() + theme(legend.position = 'right')
ggtree(tree2, aes(color=group),
p4 <-root.position = as.phylo(tree2)$root.edge) +
geom_tiplab() + xlim(18, 24) +
scale_color_manual(values = c("black", "red")) +
theme_tree2()
ggtree(tree2, aes(color=group),
p5 <-root.position = as.phylo(tree2)$root.edge) +
geom_rootedge() + geom_tiplab() + xlim(0, 40) +
scale_color_manual(values = c("black", "red")) +
theme_tree2()
plot_grid(p2, p3, p4, p5, ncol=2, labels=LETTERS[2:5]) %>%
plot_grid(p1, ., ncol=2, labels=c("A", ""), rel_widths=c(.5, 1))
2.4.3 Subsetting tree by internal node number
If you are interesting at certain clade, you can specify the input node as an internal node number. The tree_subset()
function will take the clade as a whole and also trace back to particular levels to look at the immediate relatives of the clade (Figure 2.5A and B). We can use tree_subset
to zoom in selected portions and plot a whole tree with the portion of it, that is similar to the ape::zoom()
function to explore very large tree (Figure 2.5C and D).
tree_subset(beast_tree, node=121, levels_back=0)
clade <- tree_subset(beast_tree, node=121, levels_back=2)
clade2 <- ggtree(clade) + geom_tiplab() + xlim(0, 5)
p1 <- ggtree(clade2, aes(color=group)) + geom_tiplab() +
p2 <- xlim(0, 8) + scale_color_manual(values=c("black", "red"))
library(ape)
library(tidytree)
library(treeio)
data(chiroptera)
grep("Plecotus", chiroptera$tip.label)
nodes <- groupOTU(chiroptera, nodes)
chiroptera <-
MRCA(chiroptera, nodes)
clade <- tree_subset(chiroptera, clade, levels_back = 0)
x <-
ggtree(chiroptera, aes(colour = group)) +
p3 <- scale_color_manual(values=c("black", "red")) +
theme(legend.position = "none")
ggtree(x) + geom_tiplab() + xlim(0, 5)
p4 <-plot_grid(p1, p2, p3, p4,
ncol=2, labels=LETTERS[1:4])
2.5 Manipulating tree data for visualization
Tree visualization is supported by ggtree. Although ggtree implemented several methods for visual exploration of tree with data, you may want to do something that is not supported directly. In this case, you need to manipulate tree data with node coordination positions that used for visualization. This is quite easy with ggtree. User can use fortify()
method which internally call tidytree::as_tibble()
to convert the tree to tidy data frame and add columns of coordination positions (i.e. x, y, branch and angle) that are used to plot the tree. You can also access the data via ggtree(tree)$data
.
Here is an example to plot two trees face to face that is similar to a ape::cophyloplot()
.
library(dplyr)
library(ggtree)
rtree(30)
x <- rtree(30)
y <- ggtree(x)
p1 <- ggtree(y)
p2 <-
p1$data
d1 <- p2$data
d2 <-
## reverse x-axis and
## set offset to make the tree in the right hand side of the first tree
$x <- max(d2$x) - d2$x + max(d1$x) + 1
d2
p1 + geom_tiplab() + geom_tree(data=d2) + geom_tiplab(data = d2, hjust=1)
pp <-
bind_rows(d1, d2) %>%
dd <- filter(!is.na(label))
+ geom_line(aes(x, y, group=label), data=dd, color='grey') pp
It is quite easy to plot multiple trees and connect taxa in one figure. For instance, plotting trees contructed from all internal gene segments of influenza virus and connecting equivalent strans across the trees (Venkatesh et al. 2018).
rtree(30)
z <- fortify(y)
d2 <- fortify(z)
d3 <-$x <- d2$x + max(d1$x) + 1
d2$x <- d3$x + max(d2$x) + 1
d3
bind_rows(d1, d2, d3) %>%
dd = filter(!is.na(label))
+ geom_tree(data = d2) + geom_tree(data = d3) + geom_tiplab(data=d3) +
p1 geom_line(aes(x, y, group=label, color=node < 15), data=dd, alpha=.3)
2.6 Summary
The treeio package allows us to import diverse phylogeny associated data into R. However, phylogenetic tree is stored in way to facilitate computational processing which is not human fridenly and need expertise to manipulate and explore tree data. The tidytree package provides tidy interface for exploring tree data, while ggtree provides a set of utilitise to visualize and explore tree data using grammar of graphics. This full suit of packages make it easy for ordinary users to interact with tree data, and allow us to integrate phylogeny associated data from different sources (e.g. experimental result or analysis finding), which creates the possibility of comparative study.
References
Liang, Huyi, Tommy Tsan-Yuk Lam, Xiaohui Fan, Xinchun Chen, Yu Zeng, Ji Zhou, Lian Duan, et al. 2014. “Expansion of Genotypic Diversity and Establishment of 2009 H1N1 Pandemic-Origin Internal Genes in Pigs in China.” Journal of Virology, July, JVI.01327–14. https://doi.org/10.1128/JVI.01327-14.
Venkatesh, Divya, Marjolein J. Poen, Theo M. Bestebroer, Rachel D. Scheuer, Oanh Vuong, Mzia Chkhaidze, Anna Machablishvili, et al. 2018. “Avian Influenza Viruses in Wild Birds: Virus Evolution in a Multihost Ecosystem.” Journal of Virology 92 (15). https://doi.org/10.1128/JVI.00433-18.
Yu, Guangchuang, Tommy Tsan-Yuk Lam, Huachen Zhu, and Yi Guan. 2018. “Two Methods for Mapping and Visualizing Associated Data on Phylogeny Using Ggtree.” Molecular Biology and Evolution 35 (12): 3041–3. https://doi.org/10.1093/molbev/msy194.
Yu, Guangchuang, David K. Smith, Huachen Zhu, Yi Guan, and Tommy Tsan-Yuk Lam. 2017. “Ggtree: An R Package for Visualization and Annotation of Phylogenetic Trees with Their Covariates and Other Associated Data.” Methods in Ecology and Evolution 8 (1): 28–36. https://doi.org/10.1111/2041-210X.12628.