--- layout: page mathjax: true permalink: /lectures/lecture3/ --- ## Lecture 3: Base Calling for Second-generation sequencing Monday 4 April 2016 _Scribed by Alborz Bejnood and revised by the course staff_ ## Topics In the previous lecture, we learned about the evolution of first and second generation sequencing efforts. In this lecture, we examine the base calling step in the process of sequencing DNA: the inference of individual bases from a series of light intensity signals. 1. Base calling for Illumina sequencing 2. Error sources 3. Modeling the error source - The error model - Inference on the model - Zero-forcing - Minimum mean square error based estimation - Maximum likelihood estimation - Epilogue to the estimation story ## Base calling for Illumina sequencing We begin with a review of the sequencing by synthesis approach to sequencing. We start with DNA and fragment it into many shorter sequences (resulting in a number of fragments on the order of 100 million to 1 billion). Then on glass plates, there are a number of dots onto which these DNA fragments will stick, with the material on the plate complementary to the basis on those dots. Not every dot will have a such a fragment, but a good fraction of them will, allowing us to move towards sequencing. We recall that the solution contains two entities: ddNTP (free-flowing) and DNA polymerase (binds the ddNTP). In addition, there is the enzyme that reverses termination in the washing cycles. We note that the above is a simplified version of base calling. In reality, the fluorescent signal from a single template is too weak to be detected. Hence, first the template is converted into many clones using a technique called [polymerase chain reaction (PCR)](https://en.wikipedia.org/wiki/Polymerase_chain_reaction) (which also won its discoverer [Kary Mullis](https://en.wikipedia.org/wiki/Kary_Mullis) a Nobel prize). Before synthesis, each strand is cloned using PCR and each dot corresponds to 1000 clones per strand. Ideally one would want all copies of fragments in a dot to be synchronized going forward. Because of chemical and mechanical imperfections, some strands are lagging behind and others are leading forward in the same dot leading to a lot of "noise" in a dot. ## Error sources for Illumina sequencing The main source of errors is the asynchronicity of the various molecules in a dot. In a cycle there are molecules where two ddNTPs are attached (due to the first one being attached being a dNTP rather than a ddNTP). This is shown in pane 3 in cycle 1 in the figure below. Here the molecule bound to a dGTP rather than a ddGTP. Thus the synthesis was not terminated, and the molecule bound to a ddATP as well. This led to it emitting light corresponding to A rather than G, leading to an inconsistency with a molecule where this defect may not have occurred. This results to the strand *leading*. Another error would occur if termination is not reversed in the washing cycle. In the figure below, this is shown in pane 2 in cycle 3. Here the termination of G is not reversed. Thus this molecule emits light corresponding to G rather than light corresponding to A. This leads to an inconsistency in the corresponding output signal (compared to a molecule where this has not occured). These strands are said to be *lagging*.
An illustration showing errors in illumina sequencing, showing different molecules in a dot.
Both leading and lagging errors are called *phasing* errors. ## Modeling the error sources and doing inference ### The error model We probabilistically model the problem as follows. Let $$L$$ be the length of the molecule being sequenced. Let the original sequence be $$\mathbf{s}=s(1),s(2),\cdots,s(L).$$ Let us define sequences $$\mathbf{s}^{(A)},\mathbf{s}^{(C)}, \mathbf{s}^{(G)}, \mathbf{s}^{(T)}$$, as $$s^{(A)}(i) = \left\{ \begin{array}[cc]\\ 1 &\text{if }s(i)\text{=A}\\ 0 & \text{otherwise.} \end{array}\right.$$ Let $$\mathbf{y}^{(A)}, \mathbf{y}^{(C)}, \mathbf{y}^{(G)}, \mathbf{y}^{(T)}$$ be the intensities of the colours corresponding to A, C, G, and T respectively. We note that we make $$L$$ measurements in total (one measurement per cycle), and hence each one of $$\mathbf{y}^{(A)}, \mathbf{y}^{(C)}, \mathbf{y}^{(G)}, \mathbf{y}^{(T)}$$ is an $$L$$ length vector. We define \begin{align*} p &:= \text{probability that a molecule does not bind to a ddNTP in a cycle},\\ q &:= \text{probability that a molecule binds to two ddNTPs in a cycle},\\ 1-p-q &:= \text{probability that a molecule binds to one ddNTP in a cycle}. \end{align*} Further we define $$Q_{ij} = \text{Probability that the }j^{th}\text{ base of a molecule emits colour in the }i^{th} \text{ cycle.}$$ Now we model the relationship between $$\mathbf{y}^{(A)}$$ and $$\mathbf{s}^{(A)}$$. Here we observe that at time $$i$$, the intensity is given by $$\mathbf{y}^{(A)}(i)= b \sum_{j=1}^{L}Q_{ij} s^{(A)}(j) + n(i),$$ where $$n_i$$ is noise that we model as zero mean Gaussians $$\mathcal{N}(0,\sigma^2)$$, independent across cycles, and $$b$$ is the total intensity expected if all molecules had the same colour emitted. We note that the signal observed in cycle $$j$$ depends not only on $$s^{(A)}(j)$$ but upon all other values of $$\mathbf{s}^{(A)}$$. In statistical signal processing, this is called [_inter-symbol interference_](https://en.wikipedia.org/wiki/Intersymbol_interference). In matrix notation, this problem reduces to $$\mathbf{y}^{(A)} = b Q \mathbf{s}^{(A)} + \mathbf{n}.$$ ### Inference on the model #### Zero-forcing The most obvious inference rule would be $$\hat{\mathbf{s}}^{(A)} = \mathbb{I}\left(Q^{-1}\mathbf{y}^{(A)} > \frac{b}{2}\right),$$ where $$\mathbb{I}$$ is the usual indicator function, that is $$\mathbb{I}(x>y) \left\{ \begin{array}[cc]\\ 1 &\text{if }x>y\\ 0 & \text{otherwise.} \end{array}\right.$$ In signal processing literature, this is referred to as [Zero-forcing](https://en.wikipedia.org/wiki/Zero_forcing_equalizer). We note that \begin{align*} Q^{-1}\mathbf{y}^{(A)} = b \mathbf{s}^{(A)} + Q^{-1}\mathbf{n} = b \mathbf{s}^{(A)} + \mathbf{n}_{ZF} \end{align*} where $$\mathbf{n}_{ZF} \sim \mathcal{N}(0, \sigma^2 Q^{-1}(Q^{-1})^{T})$$ . That is, the noise across co-ordinates are not independent. This formulation is suboptimal if $$Q$$ is an [ill-conditioned matrix](https://en.wikipedia.org/wiki/Condition_number), _i.e._ if the largest singular value of Q is much larger than the smallest singular value of Q. To see this at a high level, let $$\sigma_{\max}^2, \sigma_{\min}^2$$ be the largest and the smallest singular values of $$Q$$. Next note that the noise variance of $$n_{ZF}$$ in the dimension corresponding to the largest singular value of $$Q$$ is $$\frac{\sigma^2}{\sigma_{\max}^2}$$, which is much smaller than the $$\frac{\sigma^2}{\sigma_{\min}^2}$$, the noise variance of $$n_{ZF}$$ in the dimension corresponding to the largest singular value of $$Q$$. The probability of making an error is determined to a first order by the latter which can be quite bad. #### Minimum mean square error based estimation Another popular estimator is the [Minimum mean square error (MMSE) estimator](https://en.wikipedia.org/wiki/Minimum_mean_square_error). This is basically derived by picking $$\hat{\mathbf{s}^{(A)}}$$ such that $$\mathbb{E}[(\hat{\mathbf{s}^{(A)}}-\mathbf{s}^{(A)})^2]$$ is minimised. In the standard setting when the support of $$\mathbf{s}^{(A)}$$ is not restricted to be $$\{0,1\}$$, this leads to $$\hat{\mathbf{s}^{(A)}}= \frac{1}{b} Q^T(QQ^T+\sigma^2 I)^{-1}\mathbf{y}^{(A)}$$. The MMSE-based estimator used in this setting is $$\hat{\mathbf{s}}^{(A)} = \mathbb{I}\left( Q^T(QQ^T+\sigma^2 I)^{-1}\mathbf{y}^{(A)} > \frac{b}{2}\right).$$ #### Maximum likelihood estimation The maximum-likelihood estimator for this case would involve running the [Viterbi algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm). This basically involves solving $$\hat{\mathbf{s}}^{(A)}=\min_{\mathbf{s} \in \{0,1\}^{L}} ||\mathbf{y}^{(A)}-Q\mathbf{s}||_2^2.$$ The time complexity of the Viterbi algorithm is exponential in the recursion depth (which in here is the maximum |i-j| if there is non-zero probability of seeing light from base j at time i). In this setting, this is $$O(2^{L})$$. A work-around this is to approximate each column of $$Q$$ so that it is non-zero only in a $$w$$ width band. This gives the run time of the maximum-likelihood to be $$O(2^{w}L)$$. This is quite managable for $$w \le 10$$. #### Epilogue to the estimation story One could estimate $$\mathbf{y}^{(A)}, \mathbf{y}^{(C)}, \mathbf{y}^{(G)}, \mathbf{y}^{(T)}$$ separately and then combine these estimates; however, this is clearly suboptimal. A better estimation procedure is to write these as a linear estimation problem and use the straight-forward generalisations of the techniques discussed above. The linear system would be $$[\mathbf{y}^{(A)}, \mathbf{y}^{(C)}, \mathbf{y}^{(G)}, \mathbf{y}^{(T)}] = b Q [\mathbf{y}^{(A)}, \mathbf{y}^{(C)}, \mathbf{y}^{(G)}, \mathbf{y}^{(T)}] + [\mathbf{n}^{(A)}, \mathbf{n}^{(C)}, \mathbf{n}^{(G)}, \mathbf{n}^{(T)}]$$ Illumina initially used an algorithm called [Bustard](http://genome.sph.umich.edu/wiki/Base_Caller_Summaries), which essentially did zero-forcing on this system. Then, algorithm [BayesCall](http://genome.cshlp.org/content/early/2009/08/06/gr.095299.109) was proposed, which essentially does the MMSE based estimation on this system. BayesCall is currently the most popular algorithm. The authors of BayesCall also proposed another algorithm called [NaiveBayesCall](http://www.ncbi.nlm.nih.gov/pubmed/21385040) which approximates the Viterbi algorithm on the above system.