\documentclass[12pt]{article} \usepackage[utf8]{inputenc} \usepackage{graphicx} \usepackage{color} \definecolor{blue1}{RGB}{0,102,204} %% \usepackage[colorlinks=true,linkcolor=blue1,citecolor=blue1,urlcolor=blue1]{hyperref} \usepackage[colorlinks=true,linkcolor=blue,citecolor=blue,urlcolor=blue]{hyperref} \usepackage{array} \usepackage[english]{babel} \usepackage{amsfonts} \usepackage{url} \usepackage{bm} \usepackage[margin=2.5cm]{geometry} \usepackage[affil-it]{authblk} \newcommand{\R}{\mathbb{R}} \newcommand{\beq}{\begin{equation}} \newcommand{\eeq}{\end{equation}} \newcommand{\m}[1]{\mathbf{#1}} \newcommand{\rcmd}[1]{\textcolor{red}{\texttt{#1}}} \newcommand{\code}[1]{{{\tt #1}}} \newcommand{\Rlogo}{\includegraphics[width=0.05\textwidth]{Rlogo.pdf}} \title{Introduction to phylogenetics using \Rlogo} \author{\textbf{Thibaut Jombart} \thanks{\texttt{thibautjombart@gmail.com}} ~\\Licence: CC BY 4.0 } \affil{{\footnotesize Imperial College London \\MRC Centre for Outbreak Analysis and Modelling}} %% \date{\today} \date{\today} \sloppy \hyphenpenalty 10000 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} \selectlanguage{english} <>= opts_chunk$set(fig.path='figs/phylo-', fig.keep='high', dev='pdf', fig.width=7, fig.height=7, tidy=FALSE, warning=FALSE, fig.show="asis", fig.align='center', out.width=".8\\textwidth") @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \maketitle \vspace{-1cm} { \begin{center} \includegraphics[width=0.5\textwidth]{ygg.jpg} \end{center} } \begin{abstract} This practical aims to illustrate the basics of phylogenetic reconstruction using \Rlogo, with an emphasis on how the methods work, how their results can be interpreted, and the relative advantages and limitations of the methods. Three main classes of phylogenetic approaches are introduced, namely \textbf{distance-based}, \textbf{maximum parsimony}, and \textbf{maximum likelihood} methods. We also illustrate how to assess the reliability of individual nodes using bootstrap, and show how a simple linear model can be used to estimate a molecular clock in rooted phylogenies. Methods are illustrated using a dataset of seasonal influenza isolates sampled in the US from 1993 to 2008. \end{abstract} \newpage \tableofcontents % Pour éviter les problèmes de couleur... \color{black} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% <>= opts_chunk$set(tidy=FALSE) @ <>= options(width=80) @ \newpage %%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%% \section{Introduction} %%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%% \subsection{Phylogenetics in a nutshell} %%%%%%%%%%%%%%%%%%% The reconstruction of evolutionary relationships of a set of organisms can be a tricky task, and has led to the development of a variety of methods over the last decades, implemented in an even larger number of software. However, these methods can be classified into three main categories: \begin{itemize} \item \textbf{distance-based methods}: compute a matrix of pairwise genetic distances between the studied taxa, and summarize it using a hierarchical clustering algorithm such as UPGMA or Neighbour-Joining. \emph{Advantages}: fast (the fastest) and flexible (different genetic distances allow to account for different features of DNA sequence evolution). \emph{Limitations}: no model comparison (can't test for the 'best' tree, or the 'best' model of evolution); may be inaccurate and highly dependent on the distance and clustering algorithm chosen. \item \textbf{maximum parsimony}: seeks the tree with the smallest number of overall genetic changes between the taxa. This is achieved by changing randomly the topology of the tree until parsimony is no longer improved. \emph{Advantages}: intuitive interpretation (assumes that the simplest scenario is the most likely), usually accurate when the amount of genetic changes is small. \emph{Limitations}: computer-intensive, simplistic model of evolution, no model comparison, inaccurate when substantial evolution takes place, and when heterogeneous mutation rates exist in different parts of the tree. \item \textbf{likelihood-based method}: based on a model of sequence evolution which allows to compute a likelihood, that is, the probability of observing the data given the model and a set of parameters. There are two main branches of likelihood-based methods: maximum likelihood and Bayesian methods. The first seeks the 'best' tree and parameter values, i.e. the one maximizing the likelihood. The second derives samples of trees and model parameters which are the most consistent with the data and possible prior knowledge about the tree/parameters. \emph{Advantages}: flexible (any model of evolution can be used), usually accurate, model selection possible, measure of uncertainty (in Bayesian approaches). \emph{Limitations}: computer-intensive, model selection possibly cumbersome. \end{itemize} The \Rlogo~ software implements one of the largest selections of phylogenetic methods, including all of the above except for Bayesian reconstruction. %%%%%%%%%%%%%%%%%%% \subsection{Required packages} %%%%%%%%%%%%%%%%%%% This practical requires a working version of \Rlogo~ \cite{np145} greater than or equal to 2.15.2. It uses the following packages: \textit{stats} implements basic hierarchical clustering routines, \textit{ade4} \cite{tj548} and \textit{adegenet} \cite{tjart05} are here used essentially for their graphics, \textit{ape} \cite{tj527} is the core package for phylogenetics, and \textit{phangorn} \cite{tj843} implements parsimony and likelihood-based methods. Make sure that the dependencies are installed as well when installing the packages: <>= install.packages("adegenet", dep=TRUE) install.packages("phangorn", dep=TRUE) @ Then load the packages using: <<>>= library(stats) library(ade4) library(ape) library(adegenet) library(phangorn) @ %%%%%%%%%%%%%%%%%%% \subsection{The data} %%%%%%%%%%%%%%%%%%% The data used in this practical are DNA sequences of seasonal influenza (H3N2) downloaded from Genbank (\url{http://www.ncbi.nlm.nih.gov/genbank/}). Alignments have been realized beforehand using standard tools (Clustalw2 for basic alignment and Jalview for refining the results). We selected 80 isolates genotyped for the hemagglutinin (HA) segment sampled in the US from 1993 to 2008. The dataset consists of two files: i) \texttt{usflu.fasta}, a file containing aligned DNA sequences and ii) \texttt{usflu.annot.csv}, a comma-separated file containing useful annotations of the sequences. Both files are available online from the \textit{github} page: \url{https://github.com/reconhub/phylo-practical}. We first download these files and store them in a \texttt{data/} folder; note that \texttt{paste0} is only used here so that you can see the entire path name in the document: <>= if (!dir.exists("data")) dir.create("data") ## get annotations annot.url <- paste0("https://raw.githubusercontent.com/reconhub/", "phylo-practical/master/data/usflu.annot.csv") download.file(annot.url, destfile = "data/usflu.annot.csv", method = "curl") ## get DNA sequences dna.url <- paste0("https://raw.githubusercontent.com/reconhub/", "phylo-practical/master/data/usflu.fasta") download.file(dna.url, destfile = "data/usflu.fasta", method = "curl") @ To read the DNA sequences into R, we use \texttt{fasta2DNAbin} from the \textit{adegenet} package: <<>>= dna <- fasta2DNAbin(file="data/usflu.fasta") dna class(dna) @ Sequences are stored as \texttt{DNAbin} objects, an efficient representation of DNA/RNA sequences which use bytes (as opposed to character strings) to code nucleotides, resulting in considerable savings in terms of memory required to store the data. While the present dataset is very small, such compression can become essential for larger genomes (bacterial genomes are typically a few millions of nucleotides long). Note that for even larger datasets, more efficient data reduction can be achieved using the bit-level coding of polymorphic sites implemented in \textit{adegenet} \cite{tjart23}. %% For instance, the first 10 nucleotides of the first 5 isolates: %% <<>>= %% as.character(dna)[1:5,1:10] %% @ %% are actually coded as \texttt{raw} bytes: %% <<>>= %% unclass(dna)[1:5,1:10] %% typeof(unclass(dna)[1:5,1:10]) %% @ %% This results in significant savings in terms of memory required to represent the data: %% <<>>= %% object.size(as.character(dna))/object.size(dna) %% @ %% While this dataset is very small, such compression can become essential for larger genomes %% (bacterial genomes can be several millions of nucleotides long). %% Note that for even larger datasets, more efficient data reduction can be achieved using the bit-level coding of %% polymorphic sites implemented in \textit{adegenet} \cite{tjart23}. ~\\ The annotation file is read in R using the standard procedure: <<>>= annot <- read.csv("data/usflu.annot.csv", header=TRUE, row.names=1, stringsAsFactors=FALSE) head(annot) @ \texttt{accession} contains the Genbank accession numbers, which are unique sequence identifiers; \texttt{year} is the year of collection of the isolates; \texttt{misc} contains other possibly useful information. Before going further, we check that isolates are identical in both files (accession numbers are used as labels for the sequences): <<>>= dim(dna) dim(annot) all(annot$accession==rownames(dna)) table(annot$year) @ Good! The data we will analyse are 80 isolates (5 per year) typed for the same 1701 nucleotides. \newpage %%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%% \section{Distance-based phylogenies} %%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%% Distance-based phylogenetic reconstruction consists in i) computing pairwise genetic distances between individuals (here, isolates), ii) representing these distances using a tree, and iii) evaluating the relevance of this representation. %%%%%%%%%%%%%%%%%%% \subsection{Computing genetic distances} %%%%%%%%%%%%%%%%%%% We first compute genetic distances using \textit{ape}'s \texttt{dist.dna}, which proposes no less than 15 different genetic distances (see \texttt{?dist.dna} for details). Here, we use Tamura and Nei 1993's model \cite{tj841} which allows for different rates of transitions and transversions, heterogeneous base frequencies, and between-site variation of the substitution rate. <<>>= D <- dist.dna(dna, model="TN93") class(D) length(D) @ \texttt{D} is an object of class \texttt{dist} which contains the distances between every pairs of sequences. Now that genetic distances between isolates have been computed, we need to visualize this information. There are $n(n-1)/2$ distances for $n$ sequences; here, $n=80$ so that the genetic relationships between the sampled isolates are described by $80 \times 79 / 2 = 3160$ pairwise distances. Most of the time, summarising such information is not entirely trivial. The simplest approach is plotting directly the matrix of pairwise distances: <<>>= temp <- as.data.frame(as.matrix(D)) table.paint(temp, cleg=0, clabel.row=.5, clabel.col=.5) @ \noindent Darker shades of grey represent greater distances. Note that to use \texttt{image} to produce similar plots, data need to be transformed first; for instance: <<>>= temp <- t(as.matrix(D)) temp <- temp[,ncol(temp):1] @ <<>>= par(mar=c(1,5,5,1)) image(x=1:80, y=1:80, temp, col=rev(heat.colors(100)), xaxt="n", yaxt="n", xlab="",ylab="") axis(side=2, at=1:80, lab=rev(rownames(dna)), las=2, cex.axis=.5) axis(side=3, at=1:80, lab=rownames(dna), las=3, cex.axis=.5) @ \noindent (see \texttt{image.plot} in the package \textit{fields} for similar plots with a legend). Since the data are roughly ordered by year, we can already see some genetic structure appearing, but this is admittedly not the most satisfying or informative approach, and tells us little about the evolutionary relationships between our isolates. %%%%%%%%%%%%%%%%%%% \subsection{Building trees} %%%%%%%%%%%%%%%%%%% We use trees to get a better representation of the genetic distances between individuals. It is important, however, to bear in mind that the obtained trees are not necessarily efficient representations of the original distances, and information can ---and likely will--- be lost in the process. A wide array of algorithms for constructing trees from a distance matrix are available in \Rlogo, including: \begin{itemize} \item \texttt{nj} (\textit{ape} package): the classical Neighbor-Joining algorithm. \item \texttt{bionj} (\textit{ape}): an improved version of Neighbor-Joining. \item \texttt{fastme.bal} and \texttt{fastme.ols} (\textit{ape}): minimum evolution algorithms. \item \texttt{hclust} (\textit{stats}): classical hierarchical clustering algorithms including single linkage, complete linkage, UPGMA, and others. \end{itemize} ~\\ Here, we go for the standard: <>= tre <- nj(D) class(tre) tre <- ladderize(tre) tre plot(tre, cex=.6) title("A simple NJ tree") @ \noindent Trees created in the package \textit{ape} are instances of the class \texttt{phylo}. See \texttt{?read.tree} for a description of this class. %%%%%%%%%%%%%%%%%%% \subsection{Plotting trees} %%%%%%%%%%%%%%%%%%% The plotting method offers many possibilities for plotting trees; see \texttt{?plot.phylo} for more details. Functions such as \texttt{tiplabels}, \texttt{nodelabels}, \texttt{edgelabels} and \texttt{axisPhylo} can also be useful to annotate trees. For instance, we may simply represent years using different colors (red=ancient; blue=recent): <>= plot(tre, show.tip=FALSE) title("Unrooted NJ tree") myPal <- colorRampPalette(c("red","yellow","green","blue")) tiplabels(annot$year, bg=num2col(annot$year, col.pal=myPal), cex=.5) temp <- pretty(1993:2008, 5) legend("bottomleft", fill=num2col(temp, col.pal=myPal), leg=temp, ncol=2) @ \noindent This illustrates a common mistake when interpreting phylogenetic trees. In the above figures, we tend to assume that the left-side of the phylogeny is `ancestral', while the right-side is `recent'. This is wrong ---as suggested by the colors--- unless the phylogeny is actually rooted, i.e. some external taxa has been used to define what is the most `ancient' split in the tree. The present tree is not rooted, and should be better represented as such: <<>>= plot(tre, type="unrooted", show.tip=FALSE) title("Unrooted NJ tree") tiplabels(tre$tip.label, bg=num2col(annot$year, col.pal=myPal), cex=.5) @ In the present case, a sensible rooting would be any of the most ancient isolates (from 1993). We can take the first one: <<>>= head(annot) tre2 <- root(tre, out=1) tre2 <- ladderize(tre2) @ and plot the result: <>= plot(tre2, show.tip=FALSE, edge.width=2) title("Rooted NJ tree") tiplabels(tre$tip.label, bg=transp(num2col(annot$year, col.pal=myPal),.7), cex=.5, fg="transparent") axisPhylo() temp <- pretty(1993:2008, 5) legend("topright", fill=transp(num2col(temp, col.pal=myPal),.7), leg=temp, ncol=2) @ \noindent The phylogeny is now rooted. The shape of this tree is typical of influenza. What can you say about the evolution of influenza and the fitness of different viral lineages, based on this tree? What does the ``trunk'' of this tree represent? Would there be any interest in predicting the genome of the trunk? \\ %%%%%%%%%%%%%%%%%%% \subsection{Estimating a molecular clock} %%%%%%%%%%%%%%%%%%% Rooted trees are also useful for assessing the rate of evolution of a given gene. We call \emph{molecular clock} the accumulation of mutations over time. Can you visually assess if there are signs of a molecular clock in this tree? A quantitative analysis is very easy to perform, and merely relies on regressing the number of mutations from the root to the time of divergence from the root: <>= mutFromRoot <- as.matrix(dist.dna(dna, model="N"))[1,] yearFromRoot <- annot$year-annot$year[1] plot(mutFromRoot~yearFromRoot, xlab="Years from the root", ylab="Mutations from the root", main="H3N2 molecular clock") lm.clock <- lm(mutFromRoot~-1+yearFromRoot) abline(lm.clock, col="blue",lwd=2) summary(lm.clock) lm.clock$coefficients lm.clock$coefficients/ncol(dna) 365/ lm.clock$coefficients @ What is the substitution rate per year for the HA segment? What is the substitution rate per year and per site? On average, how many days would you expect to wait before observing one mutation on a transmission chain? Knowing that the generation time of influenza is roughly around 2-3 days, would you recommend using HA sequences for reconstructing transmission trees of influenza epidemics? What alternative would you suggest? %%%%%%%%%%%%%%%%%%% \subsection{Assessing the quality of a phylogeny} %%%%%%%%%%%%%%%%%%% Many genetic distances and hierarchical clustering algorithms can be used to build trees; not all of them are appropriate for a given dataset. Genetic distances rely on hypotheses about the evolution of DNA sequences which should be taken into account. For instance, the mere proportion of differing nucleotides between sequences (\texttt{model='raw'} in \texttt{dist.dna}) is easy to interprete, but only makes sense if all substitutions are equally frequent. In practice, simple yet flexible models such as that of Tamura and Nei (1993, \cite{tj841}) are probably fair choices. At the very least, the genetic distance used should allow different rates for transitions ($a \leftrightarrow g$, $c \leftrightarrow t$) and transversions (other changes). \\ Once one has chosen an appropriate genetic distance and built a tree using this distance, an essential yet most often overlooked question is whether this tree actually is a good representation of the original distance matrix. This is easily investigated using simple biplots and correlation indices. The function \texttt{cophenetic} is used to compute distances between the tips of the tree. Note that more distances are available in the \textit{adephylo} package (see \texttt{distTips} function). <<>>= x <- as.vector(D) y <- as.vector(as.dist(cophenetic(tre2))) plot(x, y, xlab="original pairwise distances", ylab="pairwise distances on the tree", main="Is NJ appropriate?", pch=20, col=transp("black",.1), cex=3) abline(lm(y~x), col="red") cor(x,y)^2 @ \noindent As it turns out, our Neighbor-Joining tree (\texttt{tre2}) is a very good representation of the chosen genetic distances. Things would have been different had we chosen, for instance, UPGMA: <<>>= tre3 <- as.phylo(hclust(D,method="average")) y <- as.vector(as.dist(cophenetic(tre3))) plot(x, y, xlab="original pairwise distances", ylab="pairwise distances on the tree", main="Is UPGMA appropriate?", pch=20, col=transp("black",.1), cex=3) abline(lm(y~x), col="red") cor(x,y)^2 @ \noindent In this case, UPGMA is a poor choice. Why is this? A first explanation is that UPGMA forces ultrametry (all the tips are equidistant to the root): <>= plot(tre3, cex=.5) title("UPGMA tree") @ \noindent The underlying assumption is that all lineages have undergone the same amount of evolution, which is obviously not the case in seasonal influenza sampled over 16 years. \\ Another validation of phylogenetic trees, much more commonly used, is bootstrap. Bootstrapping a phylogeny consists in sampling the nucleotides with replacement, rebuilding the phylogeny, and checking if the original nodes are present in the bootstrapped trees. In practice, this procedure is repeated a large number of times (e.g. 100, 1000), depending on how computer-intensive the phylogenetic reconstruction is. The underlying idea is to assess the variability in the obtained topology which results from conducting the analyses on a random sample the genome. Note that the assumption that the analysed sequences represent a random sample of the genome is often dubious. For instance, this is not the case in our toy dataset, since HA segment has a different rate of evolution and experiences different selective pressures from other segments of the influenza genome. We nonetheless illustrate the procedure, implemented by \texttt{boot.phylo}: <>= myBoots <- boot.phylo(tre2, dna, function(e) root(nj(dist.dna(e, model = "TN93")),1)) myBoots @ The output gives the number of times each node was identified in bootstrapped analyses (the order is the same as in the original object). It is easily represented using \texttt{nodelabels}: <>= plot(tre2, show.tip=FALSE, edge.width=2) title("NJ tree + bootstrap values") tiplabels(frame="none", pch=20, col=transp(num2col(annot$year, col.pal=myPal),.7), cex=3, fg="transparent") axisPhylo() temp <- pretty(1993:2008, 5) legend("topright", fill=transp(num2col(temp, col.pal=myPal),.7), leg=temp, ncol=2) nodelabels(myBoots, cex=.6) @ \noindent As we can see, some nodes are very poorly supported. One common practice is to collapse these nodes into multifurcations. There is no dedicated method for this in \textit{ape}, but one simple workaround consists in setting the corresponding edges to a length of zero (here, with bootstrap $<$ 70\%), and then collapsing the small branches: <<>>= temp <- tre2 N <- length(tre2$tip.label) toCollapse <- match(which(myBoots<70)+N, temp$edge[,2]) temp$edge.length[toCollapse] <- 0 tre3 <- di2multi(temp, tol=0.00001) @ The new tree might be slightly less informative, but more robust than the previous one: <>= plot(tre3, show.tip=FALSE, edge.width=2) title("NJ tree after collapsing weak nodes") tiplabels(tre3$tip.label, bg=transp(num2col(annot$year, col.pal=myPal),.7), cex=.5, fg="transparent") axisPhylo() temp <- pretty(1993:2008, 5) legend("topright", fill=transp(num2col(temp, col.pal=myPal),.7), leg=temp, ncol=2) @ \newpage %%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%% \section{Maximum parsimony phylogenies} %%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%% \subsection{Introduction} %%%%%%%%%%%%%%%%%%% Phylogenetic reconstruction based on parsimony seeks trees which minimize the total number of changes (substitutions) from ancestors to descendents. While a number of criticisms can be made to this approach, it is a simple way to infer phylogenies for data which display low divergence (i.e. most taxa differ from each other by only a few nucleotides, and the overall substitution rate is low). \\ In practice, there is often no way to perform an exhaustive search amongst all possible trees to find the most parsimonious one, and heuristic algorithms are used to browse the space of possible trees. The strategy is fairly simple: i) initialize the algorithm using a tree and ii) make small changes to the tree and retain those leading to better parsimony, until the parsimony score stops improving. %%%%%%%%%%%%%%%%%%% \subsection{Implementation} %%%%%%%%%%%%%%%%%%% Parsimony-based phylogenetic reconstruction is implemented in the package \textit{phangorn}. It requires a tree (in \textit{ape}'s format, i.e. a \texttt{phylo} object) and the original DNA sequences in \textit{phangorn}'s own format, \texttt{phyDat}. We convert the data and generate a tree to initialize the method: <<>>= dna2 <- as.phyDat(dna) class(dna2) dna2 tre.ini <- nj(dist.dna(dna,model="raw")) tre.ini @ The parsimony of a given tree is given by: <<>>= parsimony(tre.ini, dna2) @ Then, optimization of the parsimony is achieved by: <<>>= tre.pars <- optim.parsimony(tre.ini, dna2) tre.pars @ Here, the final result is very close to the original tree. The obtained tree is unrooted and does not have branch lengths, but it can be plotted as previously: <>= plot(tre.pars, type="unr", show.tip=FALSE, edge.width=2) title("Maximum-parsimony tree") tiplabels(tre.pars$tip.label, bg=transp(num2col(annot$year, col.pal=myPal),.7), cex=.5, fg="transparent") temp <- pretty(1993:2008, 5) legend("bottomright", fill=transp(num2col(temp, col.pal=myPal),.7), leg=temp, ncol=2, bg=transp("white")) @ In this case, parsimony gives fairly consistent results with other approaches, which is only to be expected whenever the amount of divergence between the sequences is fairly low, as is the case in our data. \newpage %%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%% \section{Maximum likelihood phylogenies} %%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%% \subsection{Introduction} %%%%%%%%%%%%%%%%%%% Maximum likelihood phylogenetic reconstruction is somehow similar to parsimony methods in that it browses a space of possible tree topologies looking for the 'best' tree. However, it offers far more flexibility in that any model of sequence evolution can be taken into account. Given one model of evolution, one can compute the likelihood of a given tree, and therefore optimization procedures can be used to infer both the most likely tree topology and model parameters. As in distance-based methods, model-based phylogenetic reconstruction requires thinking about which parameters should be included in a model. Usually, all possible substitutions are allowed to have different rates, and the substitution rate is allowed to vary across sites according to a gamma distribution. We refer to this model as $\mbox{GTR} + \Gamma(4)$ (GTR: global time reversible). More information about phylogenetic models can be found in \cite{tj842}. \\ %%%%%%%%%%%%%%%%%%% \subsection{Getting a ML tree} %%%%%%%%%%%%%%%%%%% Likelihood-based phylogenetic reconstruction is implemented in the package \textit{phangorn}. As in the previous section, we use the data \texttt{dna2}, converted into \textit{phangorn}'s format. We choose a Neighbor-Joining tree of Tamura and Nei's 1993 distance to get an initial tree. <<>>= class(dna2) dna2 tre.ini <- nj(dist.dna(dna,model="TN93")) tre.ini @ %% <<>>= %% pml(tre.ini, dna2, k=4) %% @ %% The computed likelihood is \texttt{NA}, which is obviously a bit of a problem, but a likely frequent issue. %% This issue is due to missing data (\texttt{NAs}) and ambiguous sites in the original dataset: %% <<>>= %% table(as.character(dna2)) %% @ %% We therefore need to remove missing data before going further. %% We first retrieve the position of missing data, i.e. any data differing from 'a', 'g','c' and 't'. %% <<>>= %% na.posi <- which(apply(as.character(dna),2, function(e) %% any(!e %in% c("a","t","g","c")))) %% @ %% We can easily plot the number of missing data for each site: %% <<>>= %% temp <- apply(as.character(dna),2, function(e) %% sum(!e %in% c("a","t","g","c"))) %% plot(temp, type="l", col="blue", xlab="Position in HA segment", %% ylab="Number of NAs") %% @ %% The begining of the alignment is guilty for most of the missing data, which was only to be expected %% (extremities of the sequences have variable length). %% <<>>= %% dna3 <- dna[,-na.posi] %% dna3 %% table(as.character(dna3)) %% dna4 <- as.phyDat(dna3) %% @ %% The object \texttt{dna3} is an alignment of all sequences excluding missing data; \texttt{dna4} is %% its conversion in \texttt{phyDat} format. %% %%%%%%%%%%%%%%%%%%% %% \subsection{Getting a ML tree} %% %%%%%%%%%%%%%%%%%%% This tree is most likely not the ML tree, but we need it as a 'reasonable' starting point to initialize the optimization procedure. The likelihood of this initial tree is computed using \texttt{pml}: <<>>= fit.ini <- pml(tre.ini, dna2, k=4) fit.ini @ \noindent We now have all the information needed for seeking a maximum likelihood solution using \texttt{optim.pml}; we specify that we want to optimize tree topology (\texttt{optNni=TRUE}), base frequencies (\texttt{optBf=TRUE}), the rates of all possible subtitutions (\texttt{optQ=TRUE}), and use a gamma distribution to model variation in the substitution rates across sites (\texttt{optGamma=TRUE}): <>= fit <- optim.pml(fit.ini, optNni=TRUE, optBf=TRUE, optQ=TRUE, optGamma=TRUE) @ <<>>= fit class(fit) names(fit) @ \texttt{fit} is a list with class \texttt{pml} storing various useful information about the model parameters and the optimal tree (stored in \texttt{fit\$tree}). In this example, we can see from the output that transitions ($a \leftrightarrow g$ and $c \leftrightarrow t$) are much more frequent than transversions (other changes), which is consistent with biological expectations (transversions induce more drastic changes of chemical properties of the DNA and are more prone to purifying selection). One advantage of using probabilistic models of evolution is that different models can be compared formally. For instance, here, we can verify that the optimized tree is indeed better than the original one using standard likelihood ratio tests and AIC: <<>>= anova(fit.ini, fit) AIC(fit.ini) AIC(fit) @ Both the likelihood ratio test (highly significant, function \texttt{anova}) and the AIC (lower=better) indicate that the new tree is a better model of the data than the initial one. \\ We can extract and plot the tree as we did before with other methods: <>= tre4 <- root(fit$tree,1) tre4 <- ladderize(tre4) plot(tre4, show.tip=FALSE, edge.width=2) title("Maximum-likelihood tree") tiplabels(annot$year, bg=transp(num2col(annot$year, col.pal=myPal),.7), cex=.5, fg="transparent") axisPhylo() temp <- pretty(1993:2008, 5) legend("topright", fill=transp(num2col(temp, col.pal=myPal),.7), leg=temp, ncol=2) @ This tree is statistically better than the original NJ tree based on Tamura and Nei's distance \cite{tj841}. However, we can note that it is remarkably similar to the 'robust' version of this distance-based tree (after collapsing weakly supported nodes). The structure of this dataset is fairly simple, and all methods give fairly consistent results. In practice, different methods can lead to different interpretations, and it is often worth exploring different approaches before drawing conclusions on the data. %% \newpage %% %%%%%%%%%%%%%%%%%%% %% %%%%%%%%%%%%%%%%%%% %% \section{Supplementary figures} %% %%%%%%%%%%%%%%%%%%% %% %%%%%%%%%%%%%%%%%%% %% \begin{figure}[htp] %% { %% \centering %% \includegraphics[width=.9\textwidth]{figs/clustering.pdf} %% } %% ~\\ %% ~\\ %% { %% \footnotesize \textbf{Figure S1: distance-based phylogenetic reconstruction.} Pairwise %% genetic distances between the sampled pathogens (represented as star-like bugs) are computed from %% aligned DNA sequences. The resulting distance matrix is summarized using a hierarchical clustering %% algorithm such as UPGMA or NJ, which aggregate taxa until a complete tree is reconstructed. %% } %% \end{figure} %% \newpage %% \begin{figure}[htp] %% { %% \centering %% \includegraphics[width=.9\textwidth]{figs/parsimony5.pdf} %% } %% ~\\ %% ~\\ %% { %% \footnotesize \textbf{Figure S2: maximum-parsimony phylogenetic reconstruction.} The method %% researches the tree with the smallest total number of genetic changes (thick bars). This is %% achieved by permuting nodes randomly (dotted arrows) and retaining configurations with improved parsimony; the %% algorithm stops when the parsimony score can no longer be improved. %% } %% \end{figure} %% \newpage %% \begin{figure}[htp] %% { %% \centering %% \includegraphics[width=.9\textwidth]{figs/phyloML.pdf} %% } %% ~\\ %% ~\\ %% { %% \footnotesize \textbf{Figure S3: likelihood-based phylogenetic reconstruction.} %% Maximum-likelihood (ML) and Bayesian approaches use heuristic methods to find trees and parameters which are consistent with %% the data (blue background), while overlooking less plausible topologies (grey background). In ML methods, the tree %% and parameters maximizing the likelihood function are retained, while Bayesian approaches provide %% samples of trees and parameters consistent with the posterior distribution. %% } %% \end{figure} %% \newpage %% \begin{figure}[htp] %% { %% \centering %% \includegraphics[width=.9\textwidth]{figs/bootstrap.pdf} %% } %% ~\\ %% ~\\ %% { %% \footnotesize \textbf{Figure S4: bootstrapping phylogenies.} Bootstrapping a phylogeny %% consists in comparing a reference tree to a set of trees obtained by re-sampling the data. Each %% re-sampled dataset yields a phylogeny which can be compared to the reference. If the original %% dataset is a random sample of the genome, then the variability observed across bootstrapped trees %% reflects the variability of the genome. %% } %% \end{figure} \newpage \bibliographystyle{plain} \bibliography{biblioTJ} \end{document}