--- title: "Combinatorics" author: "Jacques van Helden" date: "`r Sys.Date()`" output: slidy_presentation: fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango incremental: no keep_md: yes self_contained: no smaller: yes theme: cerulean toc: yes widescreen: yes powerpoint_presentation: fig_caption: yes fig_height: 6 fig_width: 7 slide_level: 2 toc: no html_document: fig_caption: yes highlight: zenburn self_contained: no theme: cerulean toc: yes beamer_presentation: colortheme: dolphin fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango incremental: no keep_tex: no slide_level: 2 theme: Montpellier toc: yes pdf_document: fig_caption: yes highlight: zenburn toc: yes toc_depth: 3 ioslides_presentation: colortheme: dolphin fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango self_contained: no smaller: yes toc: yes widescreen: yes font-import: http://fonts.googleapis.com/css?family=Risque font-family: Garamond subtitle: Probabilities and statistics for bioinformatics (STAT1) editor_options: chunk_output_type: console transition: linear --- ```{r include=FALSE, echo=FALSE, eval=TRUE} library(knitr) options(width = 300) knitr::opts_chunk$set( fig.width = 7, fig.height = 5, fig.path = 'figures/02_combinatorics_', fig.align = "center", size = "tiny", echo = FALSE, eval = TRUE, warning = FALSE, message = FALSE, results = TRUE, comment = "") # knitr::asis_output("\\footnotesize") ## Library used to draw flow charts library(DiagrammeR) ``` # Enumerating oligonucleotides and oligopeptides ## Problem DNA is composed of 4 nucleotides denoted by the letters $A$, $C$, $G$, $T$. Proteins are made of 20 amino acids. a. For each one of these two types of macromolecules, how many distinct oligomers can be formed by polymerizing 30 residues ("30-mers") ? **Suggested approach**: start by addressing a simpler form of the same problem, by starting with polymers of much smaller sizes: 1, then 2 residues, ... b. Generalize the formula for oligomers of an arbitrary size $k$ (so-called **k-mers** in the domain), made of $n$ distinct residues. c. What is the name of the function resulting from this analysis? d. In this process, which mode did you use to pick up the residues: **with** or **without replacement**? ## Solution: enumeration of oligomers - The underlying process is a **drawing with replacement**: at each position of the sequence, we can choose any of the $n$ residues ($n=4$ for nucleotidic sequences, $n=20$ for peptidic sequences). - Progressive approach of the solution - Trivial case: single-residue sequence $\rightarrow$ there are exactly $n$ possibilities. - Two-residue sequences: for each of the $n$ possible residues at the first position, we can select $n$ resodies for the second one $\rightarrow$ there are $n \cdot n = n^2$ possible dimers. - Trimers: for each of these dimers, there are $n$ possible residues that can be chosen for the$3^{d}$ position $\rightarrow$ there are $n^2 \cdot n = n^3$ distinct trinucleotides. - Generalization to $k$-mers: there are $n^k$ distinct sequences of size $k$. - In our case, the sequence size was $k=30$. We have thus - $N = n^k = 4^{30} = `r signif(digits=3, 4^30)`$ distinct oligonucleotides sequences - $N = n^k = 20^{20} = `r signif(digits=3, 20^30)`$ distinct oligopeptide sequences - If we consider the succession of numbers obtained for increasing oligomer sizes $k=1, 2, \cdot$ we observe a **geometric progression**. ## The geometric progression The **geometric progression** is a succession of numbers where each term can be computed by multiplying the previous one by a constant factor. $$x_i = x_{i-1} \cdot n$$ For a large size of $k$ the formula can be developed. $$\begin{aligned} x_k &= x_{k-1} \cdot n \\ &= (x_{k-2} \cdot n) \cdot n = x_{k-2} \cdot n^2 \\ &= x_{k-3} \cdot n^3 = \ldots = x_0 \cdot n^k \end{aligned}$$ In our case, the initial value is $x_0=1$; $k$ denotes the oligomer size, and $n$ is the number of distinct residues used to form the oligomer ($n=4$ for nucleic acids, $n=20$ for amino acids). ## Number of oligomers ```{r number_distinct_oligos, out.width = "60%", fig.width=8, fig.height=6, echo=FALSE, fig.cap="Number of possible oligonucleotides (top) and oligopeptides (bottom) with either a linear (left) and logarithmic (right) scale for the ordinate."} x <- 1:20 ## Oligomer size par.ori <- par(no.readonly = TRUE) par(mfrow = c(2,2)) n <- 4 ## Number of distinct residues N <- n^x plot(x, N, main = "Number of distinct oligonucleotides", xlab = "Oligomer size", ylab = "N", las = 1, cex.axis = 0.8, type = "h", lwd = 3, col = "blue" ) plot(x, N, main = "Number of distinct oligonucleotides", xlab = "Oligomer size", ylab = "N (log scale)", las = 1, cex.axis = 0.8, log = "y", type = "p", pch = 19, col = "blue" ) ## Oliopeptides n <- 20 ## Number of distinct residues N <- n^x plot(x, N, main = "Number of distinct oligopeptides", xlab = "Oligomer size", ylab = "N", las = 1, cex.axis = 0.8, type = "h", lwd = 3, col = "blue" ) plot(x, N, main = "Number of distinct oligopeptides", xlab = "Oligomer size", ylab = "N (log scale)", las = 1, cex.axis = 0.8, log = "y", type = "p", pch = 19, col = "blue" ) par(par.ori) ## restaurer les paramètres par défaut ``` ## Exercise 02.1: oligomers with no repeated residue How many oligomers can be formed (DNA or peptides) that would contain exactly once each residue. **Suggested approach**: progressively aggregate the residues whilst wondering, at each step, bow many residues have not yet been incorporated in the sequence. **Sub-questions**: - Generalize the formula for sequences of items of any type, drawn from a set of arbitrary size $n$. - What is the name of the corresponding function? - In this process, what is the mode of residue selection: **with** or **without replacement**? ## Solution: oligomers with no repeated residue - First residue: $n$ possibilities. - As soon as the first residue has been chosen, there are only $n-1$ possibilities to draw a different residue for the second position. We thus have $n \cdot (n-1)$ possible sequences for the first two residues. - For the third position, there are only $n-2$ residues left; We thus have $n \cdot (n-1) \cdot (n-2)$ possibilities for the 3 first positions of the sequence. - By extension of this reasoning, the total number of possible solutions (assuming $n$ is not too small) will be $$n! = n \cdot (n-1) \cdot \ldots \cdot 2 \cdot 1$$ - In our case: - $n! = 4! = `r signif(digits=3, factorial(4))`$ oligonuclEotides includinf each nucleotide exactly once (size 4). - $n! = 20! = `r signif(digits=3, factorial(20))`$ oligopeptides (size 20). ## The factorial function - Enumerates the number of possible permutations of a finite set of items - Drawing without replacement - Defined by a recursive formula $$N = n! = \left\{ \begin{array}{ll} 1 & \text{if } n=0 \\ n \cdot (n-1)! &\text{otherwise} \end{array} \right.$$ Note: by definition, $0! = 1$, which enables to compute $1!$ and the subsequent numbers with the recursive formula. For sufficiently large values of $n$, a clearer formulation is $$N = n \cdot (n-1) \cdot (n-2) \ldots 2 \cdot 1$$ ## Factorial ```{r factorial, out.width = "95%", echo=FALSE, fig.width=10, fig.height=4} n <- 1:20 N <- factorial(n) # plot(n,N, log = "y") par.ori <- par(no.readonly = TRUE) par(mfrow = c(1,2)) plot(n, N, main = "Factorial", xlab = "n", ylab = "N", las = 2, cex.axis = 0.8, type = "h", lwd = 3, col = "blue" ) plot(n, N, main = "Factorial\n(logarithmic Y scale)", xlab = "n", ylab = "N (log scale)", las = 2, cex.axis = 0.8, log = "y", type = "p", pch = 19, col = "blue" ) par(par.ori) ``` ## Exercise 02.2: gene lists (with order) A transcriptome experiment has been led to define the level of expression of all the yeast genes. Knowing that the genome contains $6000$ genes, how many possible ways are there to select the $15$ most expressed genes *with their relative order*? **Suggested approach**: as previously, simplify the problem by starting from the minimal selection, and progressively increase the number of selected genes (1 gene, 2 genes, ...). **Complementary questions**: - Give the example of a familiar bet game related to this enumerating process. - Generalize the formula for any selection of a list of $x$ items in a set containing $n$ elements. ## Solution 02.2: (ordered) lists of genes This is a selection **without replacement** (indeed, each gene appears at one and only one position in the list of all genes), and **ordered** (a list with the same genes taken in a different orders would be considered as a different result). - For the first gene, there are $n=6000$ possibilities. - As soon as the first gene has been defined, there are only $5999$ possibilities left for the second gene. There are thus $n \cdot (n-1) = 6000 \cdot 5999$ possibilities for the succession of the two first genes; - By extension, there are $6000 \cdot 5999 \cdot 5998 \cdot \ldots \cdot 5986 = `r signif(digits=3, prod(6000 - (0:14)))`$ possibilities for the first 15 genes. - If we generalize this reasoning to the lists of the $x$ first genes in a set of $n$, we obtain $N = n \cdot (n-1) \cdot (n-2) \cdot ... \cdot (n-x+1)$. Note that this can be represented by a more compact formula. $N = n \cdot (n-1) \cdot (n-2) \cdot ... \cdot (n-x+1) = \frac{n!}{(n-x)!}$ ## Arrangements In combinatorics, the term **arrangement** denotes an *orderless* drawing *without replacement*, i.e. random drawing where the order of the item is taken in consideration, and where each already selected item cannot be selected as next element. Number of arrangements of $x$ items drawn in a set of size $n$. $$\begin{array}{ccl} A^x_n & = & \frac{n!}{(n - x)!} \\ & = & \frac{n(n-1) \ldots (n-x +1) (n - x) (n-x-1) \ldots 2 \cdot 1}{(n - x) (n-x-1) \ldots 2 \cdot 1} \\ & = & n \cdot (n-1) \cdot \ldots \cdot (n-x+1) \end{array} $$ ## Arrangements -- Typical application - [**tricast** (also named **trifecta**](https://en.wikipedia.org/wiki/Trifecta)). - A bet where players must predict the three winner horses ($x=3$) of a race, and the exact order of their arrival. For $n=15$ horses, there are $n \cdot (n-1) \cdot (n-2) = 15 \cdot 14 \cdot 13 = `r 15*14*13`$ possibilities. ## Exercise 02.3: unordered sets of genes A transcriptomics experiment has been led to measure the level of expression of all yeast genes. Knowing that the genome contains $6000$ genes, how many possibilities are there to select the $15$ genes with the highest expression level **without** taking into account the relative order of those 15 genes? **Suggested approach**: as previously, simplify the problem by starting from minimal selections (1 gene, 2 genes, ...) and then generalize the formula. **Complementary questions**: - Find an example of familiar bet game related to this problem. - Generalize the formula for the selection of a set of $x$ genes from a genome containing $n$ genes. - Do you know the name of the formula that resulted from this reasoning? ## Solution 02.3: unordered sets of genes - For a single-gene selection, there are $n=6000$ possibilities. - For 2 genes, there are $n \cdot (n-1) = 6000 \cdot 5999$ arrangements, but these would cover each gene pair with two different orders: $(a,b)$ and $(b,a$). The number of **unordered** subsets of two genes is thus $N = n \dot (n-1)/2$. - Similarly, for 3 genes we need to divide the number of arrangements ($A^x_n = \frac{n!}{(n-x)!} = 6000 \cdot 5999 \cdot 5998$) by the number of permutations of the same gene triplet ($(a, b, c), (a, c, b), (b, a, c) \ldots$), which gives $\frac{6000!}{(6000-3)! 3!} = \frac{6000 \cdot 5999 \cdot 5888}{6} = `r signif(digits=3, choose(6000,3))`$. - For $15$ genes we obtain $\frac{n!}{(n-x)!x!} = \frac{6000!}{5985! \cdot 15!} = `r signif(digits=3, choose(6000, 15))`$ possible *combinations*. ## Combinations A **combination** is a selection *without replacement* a finite set, where the order of drawing is taken into consideration. The number of possible combinations of $x$ numbers among $n$ is provided by the **binomial coefficient**. $$\binom{n}{x} = C^x_n = \frac{n!}{x! (n-x)!}$$ **Attention: ** the relative positions of $x$ and $n$ are opposite in the two alternative notations for combinations $binom{n}{x}$ ("$x$ among $n$") and ($C^x_n$, "choose"). ## Combinations -- Typical application - [**trio**](https://en.wikipedia.org/wiki/Trifecta), a variation of the **tricast** bet, where the order of arrival of the 3 winner horses is not taken in consideration. $$\binom{n}{x} = \binom{15}{3} = C^3_{15} = \frac{15!}{3! 12!} = `r choose(n=15, k=3)`$$ - [**loto** (or lotto)](https://fr.wikipedia.org/wiki/Loto): each player checks 6 numbers within a grid containing 90 numbers. The number of possibilities is $$\binom{n}{x} = \binom{90}{6} = C^6_{90} = \frac{90!}{6! 84!} = `r choose(n=90, k=6)`$$ # Summary of the concepts and formulas ## Drawings with / without replacement There are two classical ways of drawing elements among a set: with or without replacement. 1. **Drawing without replacement**: each element can be selected at most once. Examples: - [*loto*](https://fr.wikipedia.org/wiki/Loto) game (also spelled *lotto*). - Arbitrary selection of a subset of the genes from a genome. 2. Drawing **with replacement**: each element can be drawn zero, one or several times. Examples: - Dice game. At each drawing of a dice, we have the same possible outcomes (6 sides). - Generation of a random sequence, by iteratively adding a randomly selected residue (4 nucleotides for DNA, 20 aminoacids for proteins). ## Elements of combinatorics - **Arrangements**: drawings with order, without replacement - **Combinations**: drawings without considering the order of the items, without replacement) ## Choice of the appropriate formula ```{r combinatorics_flowchart, out.width = "90%", fig.width = 8, fig.height = 5} grViz(" digraph combinatorics_flowchart { node [shape = diamond, fontname = Helvetica ] node [fontcolor = 'orange'] A [label = 'replacement ?']; B [label = 'ordered ?']; D [label = 'ordered ?'] G [label = 'exhaustive selection?'] node [shape = 'rectangle', fontcolor = 'black'] C [label = 'geometric progression']; E [label = 'arrangement'] F [label = 'combination'] H [label = 'factorial'] edge [arrowhead = 'normal', color = '#00BB00', fontcolor = '#00BB00', label = 'Yes'] A -> B B -> C D -> G G -> H edge [arrowhead = 'normal', color = '#BB0000', fontcolor = '#BB0000', label = 'No'] A -> D D -> F G -> E } ") ``` ## Formulas | Replacement | Order | Formula | Description | |--------|-------|------------|---------------------------------| | Yes | Yes | $n^x$ | **Geometric progression**: ordered drawings (sequences), with replacement, of $x$ items from a set of size $n$ | | No | Yes | $n!$ | **factorial**: permutations of all elements of a set of size $n$ | | No | Yes | $A^x_n = \frac{n!}{(n-x)!}$ | **Arrangements** : ordered drawing, without replacement, of $x$ items in a set of size $n$ | | No | No | $C^x_n = \binom{n}{x} = \frac{n!}{x! (n - x) !}$ | **Combinations** : orderless drawing, without replacement, of $x$ items in a set of size $n$ | # Supplementary exercises ## Exercise 02.5: oligopeptides $3 \times 20$ *How many distinct oligopeptides of size $k=60$ can be formed by using exactly $3$ times each amino acid?* ## Solution 02.5: oligopeptides $3 \times 20$ *How many distinct oligopeptides of size $k=60$ can be formed by using exactly $3$ times each amino acid?* Let us start by generating a particular sequence that fits these conditions, by concatenating 3 copies of each amino acid by alphabetic order. ```{r echo=FALSE} aa.table <- read.delim("amino_acid_table.tsv", sep = "\t") # kable(aa.table, caption = "Symboles des acides aminés et codons") aa <- sort(as.vector(aa.table$Symbol)) aa <- aa[aa %in% LETTERS] cat(sort(rep(aa, times = 3)), sep = "") ``` Any permutation of these 60 letters is a valid solution. Here are three examples. ```{r echo=FALSE} cat(sample(rep(aa, times = 3)), sep = "") cat(sample(rep(aa, times = 3)), sep = "") cat(sample(rep(aa, times = 3)), sep = "") ``` However, we have to take into account that any permutation between two identical amino acids will give an identical sequence. The difficulty of the exercise will thus be to enumerate the number of *distinct* permutations.