--- title: "Practical -- k-mer distributions" author: "Jacques van Helden" date: "`r Sys.Date()`" output: html_document: code_folding: hide fig_caption: yes highlight: zenburn self_contained: no theme: cerulean toc: yes toc_depth: 3 toc_float: yes beamer_presentation: colortheme: dolphin fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango incremental: no keep_tex: no slide_level: 2 theme: Montpellier toc: yes ioslides_presentation: colortheme: dolphin fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango smaller: yes toc: yes widescreen: yes word_document: toc: yes toc_depth: '3' slidy_presentation: fig_caption: yes fig_height: 6 fig_width: 7 highlight: tango incremental: no keep_md: yes smaller: yes theme: cerulean toc: yes widescreen: yes pdf_document: fig_caption: yes highlight: zenburn toc: yes toc_depth: 3 font-import: http://fonts.googleapis.com/css?family=Risque subtitle: Statistics for Biology (STAT2) font-family: Garamond transition: linear --- ```{r include=FALSE, echo=FALSE, eval=TRUE} if (!require(knitr)) { install.packages("knitr") } library(knitr) options(width = 300) knitr::opts_chunk$set( fig.width = 7, fig.height = 5, fig.path = 'figures/kmer_distrib_', fig.align = "center", size = "tiny", echo = TRUE, eval = TRUE, warning = FALSE, message = FALSE, results = TRUE, comment = "") # knitr::asis_output("\\footnotesize") ``` ## Introduction In this practical, we will count k-mer occurrences in DNA sequences of different organisms (one organism per sutdent), fit different theoretical distributions of probabilities onto the empirical distributions of counts, and test the goodness of fit for these alternative distributions. ## Organisms Each student will choose one organism of interest among the following ones. | Taxon | Organism | |--------|-----------------------------------| | Fungi | Saccharomyces_cerevisiae | | Bacteria | Escherichia coli GCF 000005845.2 ASM584v2 | | Mammalian | Homo sapiens GRCh38 | | Mammalian | Mus musculus GRCm38 | | Bird | Gallus gallus EnsEMBL | | Fish | Danio_rerio_EnsEMBL | | Insect | Drosophila melanogaster | | Worm | | | Plant | Arabidopsis thaliana.TAIR10.29 | | Plant | Zea mays.AGPv3.29 | | Apicomplexa | | ## Solutions ### Working directory On your computer, create a directory for this practical. I suggest to use a consistent naming for the different practicals of this course. We will further create a sub-folder with the name of our organism of interest. ```{r working_dir} ## Define your organism of interest org <- "Homo_sapiens" ## Define a working directory work.dir <- file.path("~", "CMB-STAT2_practicals", "kmer_distrib", org) dir.create(work.dir, recursive = TRUE, showWarnings = FALSE) ## Print a message with the result directory message("Result directory\t", work.dir) ``` ## Counting k-mer occurrences in each promoter of a model organism 1. Open a connection to the Regulatory Sequence Analysis Tools (RSAT) teaching server : 2. In the tool search box, type "retrieve sequence" and click on the corresponding tool. 3. In the [*retrieve-sequence*](http://pedagogix-tagc.univ-mrs.fr/rsat/retrieve-seq_form.cgi) form, - click *Mandatory inputs*, enter the name your organism of interest, and check the option *all genes of this organism*; - in *Mandatory options*, select *upstream*, and set the sequence limits from -1 to -500 - in *Advanced options*, *make sure that this option is **unchecked**: *Prevent overlap with neighbour genes (noorf)* ^[Note: normally it is recommended to check this option, but we intently inactivate it in order to get sequences of the same sizes. ] - click *Run analysis* and *GO*. After a few seconds (or minutes) the result is displayed. Right-click on the sequence file (extension fasta) and open it in a separate tab to check its content. 4. Come back to the result page of retrieve-sequence. In the *Next Step* box below the result, click on the link to *oligo-analysis*. This will transfer your sequences to the oligo-analysis form. - In the *Sequence* section, inactivate the option *purge sequence*. - In the *Oligmer counting mode*, **uncheck** the option *prevent overlapping matches*. - Select *Count on single strand*. - For *oligomer lengths*, select 2 and **uncheck the other lengths**. - In *Results*, check the option *Occurrence table*. - Type your email address and select the mail output. - Click *GO*. After a few seconds (minutes), the RSAT server should display the result page, with links to the k-mer count table. Copy the URL of the result file. ### Download the count table from RSAT Let us define the name we will give to the local copy of the k-mer count table generated on the RSAT server in the previous steps. ```{r kmer_table_name} ## Define the path and the name of the local file kmer.file <- file.path(work.dir, "2nt-ovlp-1str_Homo_sapiens.tab") ``` One solution is to download manually the k-mer count table generated on the RSAT server, move it to the work directory, and rename it `2nt-ovlp-1str_Homo_sapiens.tab` (to be adapted depending on your organism of interest). Another possiblity is to use R command `download.file()` download it from the URL of the result file on the RSAT server. ```{r eval=FALSE} ## Note: this will only work for a few days, because the temporary files are removed from the server temp.url <- "http://pedagogix-tagc.univ-mrs.fr/rsat/tmp/www-data/2020/02/17/oligo-analysis_2020-02-17.155035_zZ4IA4" ## Provide the arguments in the order of the function definition download.file(temp.url, kmer.file) ## Equivalent : name the arguments # download.file(url = temp.url, destfile = kmer.file) ## Note: named arguments can be provided in a different order without problem # download.file(destfile = kmer.file, url = temp.url) ``` Whichever method was chosen, check that the file is at the right place on your computed. ```{r} ## List the files in the working directory list.files(work.dir) ## Send a message with the k-mer file location message("K-mer count table file:\t", kmer.file) ``` ### Load the k-mer count table in R Use the finction `read.table()` to load the k-mer count table in a variable named `kmer.table`. ```{r} ## Call the help for read.table() # ?read.table ## Load the k-mer count table in a variable kmer.table <- read.table( file = kmer.file, comment.char = ";", ## comment lines start with ";" in RSAT header = TRUE, # the first row (after the comments) contain the column headers row.names = 1, ## the first column contains row names but there might be homonyms sep = "\t" ## column separator is the tabulation ) ## We set uppercases for the dinucleotide sequences, which were in lowercases in the original count table names(kmer.table) <- toupper(names(kmer.table)) ``` Check the dimensions of this table. ```{r} ## Check the dimensions of the k-mer count table dim(kmer.table) ## Number of k-mers m <- ncol(kmer.table) n <- nrow(kmer.table) ## Print the result print(paste0("Number of rows: ", n)) print(paste0("Number of columns: ", m)) ``` Check the column names ```{r} ## Check the column names names(kmer.table) colnames(kmer.table) # equivalent ``` Display the first and last 10 lines of the k-mer count table. ```{r} ## Show the first 10 lines of the k-mer count table head(kmer.table) tail(kmer.table) ``` ## Compute marginal statistics ### Statistics per column (k-mer) Tips: use the R function `apply()`. We will compute marginal statistics on the rows and columns of the count table. One possibility is to add columns and rows on this table, but this would not be very convenient for the subsequent computations to be performed on the counts. We will thus create two separate tables: one with the row-wise statistics, and another one with the column-wise statistics. Rather than looping over each row or column of the count table, we can use the function `apply()`in order to compute a chosen statistics on each row (margin = 1) or column (margin = 2) of the table. ```{r} ## Recall the dimensions of the kmer count table dim(kmer.table) # View(kmer.table) ## Compute means per column (on the "second" margin) col.means <- apply(kmer.table, 2, mean) print(col.means) ``` We can run the `apply()`function in the same way to compute the other descriptive statistics (median, variance, ...) and store each result in a separate vector. However, it would be more convenient to dispose of a single structure containing all the statistics for each column. For this, we use `data.frame()` to create a data frame with one column per column per computed statistics, and one row per k-mer. ```{r} ## Compute a table summarising the statistics per k-mer ## (marginal stats on the columns) stats.per.kmer <- data.frame( mean = apply(kmer.table, 2, mean), sd = apply(kmer.table, 2, sd), var = apply(kmer.table, 2, var), min = apply(kmer.table, 2, min), p05 = apply(kmer.table, 2, quantile, 0.05), Q1 = apply(kmer.table, 2, quantile, 0.25), median = apply(kmer.table, 2, median), Q3 = apply(kmer.table, 2, quantile, 0.75), p95 = apply(kmer.table, 2, quantile, 0.95), max = apply(kmer.table, 2, max), sum = apply(kmer.table, 2, sum) ) kable(stats.per.kmer, caption = "Column-wise means of the k-mer count table. ") ``` ### Warning about R sd() and var() functions **Beware:** the R functions `sd` and `var` do not compute the standard deviation and variance of the numbers provded! Instead, they consider that the data provided are a sample drawed from a population, and the return an estimate of the standard deviation or variance of this population. So, instead of the sample variance $$s^2 = \sum_{i=1}^n \frac{1}{n}(x - \bar{x})^2$$ They estimate the variance of the population corrected for the systematic bias of sample variance. $$\hat{s^2} = \sum_{i=1}^n \frac{1}{n-1}(x - \bar{x})^2$$ For this exercise, this is actually what we want, since we can consider that the `r n` genes of our organism of interest are a sampling of all the genes that might be found in organisms of the same taxon. In any case, the impact of the correction is negligible with such a large number of genes ($n = `r n`$). $$\frac{n}{n-1} = `r round(digits = 7, n/(n-1))`$$ ### Analyse the relationships between the mean and the variance Poisson: variance = mean Binomial: variance < mean XXX: variance > mean ```{r mean_vs_var, fig.width=7, fig.height=4, out.width="60%", fig.cap="Mean-variance plot for dinucleotide counts in all human promoters. The line indicates the variance/mean relationship in Poisson. distributions. "} plot(stats.per.kmer$mean, stats.per.kmer$var, xlab = "Mean of the counts per promoters", ylab = "Variance of the counts per promoters", las = 1, panel.first = grid(), main = "Dinucleotide counts in human promoters\nVariance versus mean plot", xlim = c(10,50), ylim = c(0,500), cex = 0) text(x = stats.per.kmer$mean, y = stats.per.kmer$var, labels = rownames(stats.per.kmer)) abline(a = 0, b = 1) ``` ### Statistics per gene (row) ```{r stats_per_gene} ## Compute a table summarising the gene-wise (row-wise) statistics of the k-mer count table stats.per.gene <- data.frame( mean = apply(kmer.table, 1, mean), sd = apply(kmer.table, 1, sd), var = apply(kmer.table, 1, var), min = apply(kmer.table, 1, min), Q1 = apply(kmer.table, 1, quantile, 0.25), median = apply(kmer.table, 1, median), Q3 = apply(kmer.table, 1, quantile, 0.75), max = apply(kmer.table, 1, max), sum = apply(kmer.table, 1, sum) ) ``` We can inspect the sum of counts per gene. ```{r sum_per_gene} kable(rev(table(stats.per.gene$sum))) ``` For almost all genes, the sum of k-mer counts is $n = L -1$, where $L$ is the sequence length. This is logical since there is no dinucleotide starting at the last position of a sequence. More generally, for a k-mer of size $k$, the number of possible positions is $n = L - k + 1$. The very few cases of sequences having a smaller sums of counts result from the presence of ambiguous nucleotides (represented by the symbol `N`) in the genome sequence. For this exercise, we will suppress these genes because our models rely on the hypothesis that the sum of counts per gene is constant. ```{r gene_filtering} ## Filter out the genes having an incorrect sum of counts n <- max(stats.per.gene$sum) ## Count the number of genes with the correct sum of counts sum(stats.per.gene$sum == n) ## Count the number of genes with an incorrect sum of counts sum(stats.per.gene$sum != n) ## Filter out the genes with incorrect sums of counts kmer.table <- kmer.table[stats.per.gene$sum == n, ] dim(kmer.table) ## Count the remaining number of genes nb.genes <- nrow(kmer.table) ``` ## Drawing empirical distributions of counts ### Draw an histogram with the distribution of counts for a given k-mer. The simplest way to generate a histogram is to use the `hist()` function. Here is the result with the default parameters. ```{r one_kmer_hist, fig.width=7, fig.height=4, out.width="60%", fig.cap="Counts per sequence in all human promoters. "} kmer <- "AG" # Select a k-mer counts <- kmer.table[, kmer] # select the counts for this k-mer range(counts) ## Draw an histogram with default parameters (not very beautiful) hist(x = counts) ``` We can fine-tune the parameters in order to get a more informative representation, by specifying custom breaks. ```{r one_kmer_hist_pretty, fig.width=7, fig.height=4, out.width="60%", fig.cap="Counts per sequence in all human promoters. "} ## Draw a nice and informative histogram hist(x = counts, breaks = 0:max(counts + 1), col = "#44DDFF", border = "#44DDFF", las = 1, ylab = "Number of promoters", xlab = "K-mer counts", main = paste0(kmer, " counts in promoters")) ``` ### Generate a figure with 4 x 4 panels to depict the histograms of the 16 k-mers ```{r all_kmer_hist_pretty, fig.width=12, fig.height=8, out.width="100%", fig.cap="Dinucleotide counts per sequence in all human promoters. "} par(mfrow = c(4,4)) par(mar = c(4.1, 5.1, 4.1, 1.1)) for (i in 1:16) { kmer <- names(kmer.table)[i] counts <- kmer.table[, i] # select the ith column of the k-mer table ## Draw a nice and informative histogram hist(x = counts, breaks = 0:max(counts + 1), col = "#44DDFF", border = "#44DDFF", las = 1, ylab = "Number of promoters", xlab = "K-mer counts", main = paste0(toupper(kmer), " counts in promoters")) } par(mfrow = c(1,1)) par(mar = c(4.1, 5.1, 4.1, 2.1)) ``` ### Use other graphical representations to get an insight of the k-mer count distributions (boxplots, violin plots) #### Boxplot The funciton `boxplot()` provides an informative summary of the data. ```{r kmer_count_boxplot, fig.width=7, fig.height=5, out.width="60%", fig.cap="Boxplot of dinucleotide counts in all promoters. "} boxplot(x = kmer.table) ``` We can fine-tue the parameters to enhance the readability ```{r kmer_count_boxplot_pretty, fig.height=10, fig.width=7, out.width="60%", fig.cap="Boxplot of dinucleotide counts in all promoters. "} boxplot(x = kmer.table[,order(stats.per.kmer$median)], horizontal = TRUE, las = 1, col = "#44DDFF", main = "Dinucleotide distributions", xlab = "Counts") ``` Alternative: sort by variance. ```{r kmer_count_boxplot_pretty_var, fig.height=10, fig.width=7, out.width="60%", fig.cap="Boxplot of dinucleotide counts in all promoters. "} boxplot(x = kmer.table[,order(stats.per.kmer$var)], horizontal = TRUE, las = 1, col = "#44DDFF", main = "Dinucleotide distributions", xlab = "Counts") ``` #### Violin plot We can use the R function `vioplot()` to generate a violin plot. This requires to install the R package `vioplot`, if it is not already done. ```{r violin_plot, fig.width=7, fig.height=5, out.width="60%", fig.cap="Violin plot of dinucleotide counts in promoter sequences"} ## Check if the vioplot package is already installed if (!require(vioplot)) { ## Install the vioplot package install.packages("vioplot") } library(vioplot) ## Load the vioplot package # help(package = vioplot) # Get help on the package # help(vioplot) # Get help on the vioplot function vioplot(kmer.table) ``` Let us fine-tune the parameters to get a pretty violin plot. ```{r violin_plot_pretty, fig.height=10, fig.width=7, out.width="60%", fig.cap="Violin plot of dinucleotide counts in promoter sequences"} vioplot(kmer.table[, order(stats.per.kmer$mean)], horizontal = TRUE, col = "#44DDFF", las = 1, main = "Violin plot of dinucleotide counts in promoters", xlab = "Counts", ) ``` ```{r violin_plot_pretty_var, fig.height=10, fig.width=7, out.width="60%", fig.cap="Violin plot of dinucleotide counts in promoter sequences"} vioplot(kmer.table[, order(stats.per.kmer$var)], horizontal = TRUE, col = "#44DDFF", las = 1, main = "Violin plot of dinucleotide counts in promoters", xlab = "Counts", ) ``` ## Exploring k-mer count distributions in promoter sequences 5. Compute a vector with the relative frequency of each k-mer in all the sequences. 6. Compute a table with the relative frequencies of k-mers per sequence, and compute similar summary statistics per column on this relative frequency table. 7. Write a brief interpretation of the results. ## Fitting ### Encapsulate the pretty histogram in an R function For the rest of this tutorial, we will have to draw several histograms of k-mer count distributions. Since we are satisfied of the fine-tuned histogram above, one possibility is to include it in a function, that we will be able to recall later rather than having to re-type the full code for each usage. We will properly document this function with the roxygen2 standard. This will enable us to incorporate this function in a custom R package if we wish. We can also include parameters (defined in the signature of the function) in order to make its use more flexible. Moreover, this function will return an object (of class "histogram") with the numerical values of the histogram (breaks, frequencies, counts, ...). ```{r count_histogram, collapse=FALSE} #' @title K-mer count histogram #' @author Jacques van Helden #' @description Draw an histogram with the distribution of a vector of counts for a given k-mer. #' @param counts a vector of counts #' @param kmer the sequence of the oligonucleotide (k-mer) to be displayed in the title #' @param class.interval=1 class interval (breaks are automatically computed) #' @param col="#44DDFF" fill color for the histogram rectangles #' @param border="#22BBFF" border color for the histogram rectangles #' @param main=paste0(kmer, " counts in promoters") #' @param ... all the other arguments are passed to the hist function #' @return an object of type histogram returned by the hist() function #' @export kmer.hist <- function(counts, kmer, class.interval = 1, col = "#44DDFF", border = "#44DDFF", main = paste0(kmer, " counts in promoters"), ...) { h <- hist(x = counts, breaks = seq(from = 0, to = max(counts + class.interval), by = class.interval), col = col, border = border, las = 1, ylab = "Number of promoters", xlab = "K-mer counts", main = main, ...) ## Return the histogram values return(h) } ``` We can then call this function with a given k-mer. ```{r another_kmer, fig.width=7, fig.height=4, out.width="60%", fig.cap="Distribution of k-mer counts in promoter generated with our custom funciton kmer.hist(). "} hist.values <- kmer.hist(counts = kmer.table[, "TT"], kmer = "TT") ``` We can then re-use it with wider breaks. ```{r another_kmer_ci5, fig.width=7, fig.height=4, out.width="60%", fig.cap="Distribution of k-mer counts in promoter generated with our custom funciton kmer.hist() with class intervals of size 5. "} kmer <- "TT" counts <- kmer.table[, kmer] hist.values.ci5 <- kmer.hist(counts = counts, kmer = kmer, border = "#0088FF", class.interval = 5) ``` We can now extract the values of the histogram, for example the class midpoints and the counts per class. ```{r hist_values} class(hist.values) names(hist.values) # View(hist.values) ``` We can print the midpoints of the class intervals. ```{r hist_mids} print(hist.values$mids) ``` And the corresponding counts. ```{r hist_counts} print(hist.values$counts) ``` We can also use this information to draw a custom plot of the distribution (for example a frequency polygon). ```{r frequency_polygon, fig.width=7, fig.height=4, out.width="60%", fig.cap="Distribution of counts for the TT dinucleotide in Human promoter sequences. Representation. as frequency polygon. "} plot(hist.values$mids, hist.values$counts, las = 1, xlab = "Number of sequences", ylab = "Counts per sequence", type = "l", col = "#008800", lwd = 3, main = paste0(kmer, " counts\nFrequency polygon")) grid() # Add a grid to the plot ``` ### Fit a Poisson distribution on each empirical distribution of k-mer counts. - How do you choose the parameters? - Draw the fitted Poisson distribution over the histogram of empirical distribution (observed k-mer occurences) **Tips:** - in order to add some plot over an existing plot, you can use the `lines()` function - you can also use specific options to draw histogram-like lines: `lines(x, y, type = "h"). The Poisson distribution is defined by a single parameter: the expected mean $\lambda$ (the Greek letter "lambda"). Noteworth, the variance of a Poisson distribution equals its mean. ```{r fit_poisson} ## Select a kmer and its counts kmer <- "GA" ## Select an arbitrary k-mer counts <- kmer.table[, kmer] ci <- 2 ## Estimate lambda (mean of the Poisson) as the mean of the counts lambda <- mean(counts) ## Compute the Poisson Probability Mass Function X <- 0:max(counts) ## Compute the values covered by the k-mer counts # length(X) poisson.pmf <- dpois(x = X, lambda = lambda) # sum(poisson.pmf) # Validation : this should approximte 1 ## Compute expected number of genes for each count value poisson.exp <- poisson.pmf * nb.genes # sum(poisson.exp) # Validation: this should approximate n ``` ```{r fitting_plot_fp, fig.width=7, fig.height=5, out.width="60%", fig.cap="Fitting of a Poisson distribution on dinucleotide counts"} ## plot the histogram h <- kmer.hist(counts, kmer, ylim = c(0, max(poisson.exp)), class.interval = 1, col = "#BBBBBB", border = "#BBBBBB", main = paste0("Poisson fit on ", kmer, " counts")) lines(X, poisson.exp, col = "red", lwd = 2) obs <- h$counts ``` An alternative is to draw the lines as a histogram rather than a polygon frequency. ```{r fitting_plot_h, fig.width=7, fig.height=5, out.width="60%", fig.cap="Fitting of a Poisson distribution on dinucleotide counts"} ## plot the histogram h <- kmer.hist(counts, kmer, ylim = c(0, max(poisson.exp)), class.interval = 1, col = "#BBBBBB", border = "#BBBBBB", main = paste0("Poisson fit on ", kmer, " counts")) lines(X, poisson.exp, col = "red", lwd = 1, type = "h") ``` We can now generate the same drawing for each k-mer. ```{r poisson_fit_all_kmers, fig.width=12, fig.height=8, out.width="100%", fig.cap="Poisson fit on dinucleotide counts from Human promoter sequences. "} par(mfrow = c(4, 4)) for (i in 1:m) { kmer <- names(kmer.table)[i] counts <- kmer.table[, kmer] ## plot the histogram h <- kmer.hist(counts, kmer, ylim = c(0, max(poisson.exp)), class.interval = 1, col = "#BBBBBB", border = "#BBBBBB", main = paste0("Poisson fit on ", kmer, " counts")) lambda <- mean(counts) poisson.exp <- dpois(x = X, lambda = lambda) * nb.genes lines(X, poisson.exp, col = "red", lwd = 0.5, type = "h") } par(mfrow = c(1, 1)) ``` ### Using the sample median as estimator of the population mean ```{r poisson_fit_all_kmers_median, fig.width=12, fig.height=8, out.width="100%", fig.cap="Poisson fit on dinucleotide counts from Human promoter sequences. "} par(mfrow = c(4, 4)) for (i in 1:m) { kmer <- names(kmer.table)[i] counts <- kmer.table[, kmer] ## plot the histogram h <- kmer.hist(counts, kmer, ylim = c(0, max(poisson.exp)), class.interval = 1, col = "#BBBBBB", border = "#BBBBBB", main = paste0("Poisson fit on ", kmer, " counts")) lambda <- median(counts) poisson.exp <- dpois(x = X, lambda = lambda) * nb.genes lines(X, poisson.exp, col = "red", lwd = 0.5, type = "h") lines(X, poisson.exp, col = "red", lwd = 0.5, type = "h") } par(mfrow = c(1, 1)) ``` ### Q-Q plots ```{r} ## Selevct a k-mer kmer <- "AC" counts <- kmer.table[, kmer] lambda <- mean(counts) X <- 0:(max(counts)) ## Generate random counts that follow a Poisson loaw rcounts <- rpois(n = length(counts), lambda = lambda) ## Draw a Q-Q plot with the random counts versus actual counts qqplot(x = counts, y = rcounts, main = "QQplot - counts vs random Poisson", col = "#88AABB", xlab = "Counts", ylab = "Random counts (rpois)") grid() abline(a = 0, b = 1) ``` ## Goodness of fit ### Using R built-in chisq.test() function 1. Assess the goodness of the fit using the R `chisq.test()` function. - How significant is the result? - How good is the fit? - Did the test issue some warning message? If so, how do you interpret it? - Check if the assumptions are met for the validity of the chi2 test. ```{r chisq_test} ## Choose a k-mer and select its counts kmer <- "TC" counts <- kmer.table[, kmer] max.count <- max(counts) X <- 0:max.count # all possible values of X ## Get the number of genes per k-mer counts with hist() ## but do not plot the histogram h <- hist(counts, breaks = seq(from = 0, to = max.count + 1, by = 1), plot = FALSE) obs <- h$counts ## Observed counts ## Compute the probabilities from a Poisson distribution poisson.p <- dpois(X, lambda = mean(counts)) ## Run the chi2 test chi2.result <- chisq.test(x = obs, p = poisson.p) print(chi2.result) pval <- chi2.result$p.value ``` ### With a custom chisquare.test() function 2. Implement a properly documented function that - Takes as input a vector of observed values (counts per class) + a vector of expected values (counts per class), - Checks that the expected values are compliant with the applicability conditions. - If not, groups the tails of the vectors in order to ensur this condition. - Runs the chi2 test - Returns the following values - chi2 statistics - number of initial classes - number of classes after tail grouping - degrees of freedom - p-value **Tips** - **Beware**, the input vector should not be the counts of vector per sequences, but a vector of counts per class (e.g. how many genes had between 0 and 4 k-mer occurrences, betwen 5 and 9, betwen 10 and 14, ...) returned by the histogram ```{r chi2_test_function} #' @title chi-squared test #' @author Jacques van Helden #' @description Run a chi2 test with convenient options enabling to automatically group the tails of the input vector if the expected values are too small #' @param obs a vector of observed values. These must be absolut e counts and not relative frequncies. #' @param exp a vector of expected values. Should have the same length and same sum as the observed values #' @param X X values for the histogram #' @param main main title for the histogram #'@return a vector of statistics related to the chi2 test: chi2 statistics, number of classes of the input vectors, numnber of classes after tail grouping, degrees of freedom, p-value chisq.test.grouped.tails <- function(obs, exp, X, main = "chi2 test") { ## Group the left tail first.ok <- head(which(exp >= 5), n = 1) last.ok <- tail(which(exp >= 5), n = 1) if (first.ok > 1) { exp.grouped <- c(sum(exp[1:(first.ok - 1)]), exp[first.ok:last.ok]) obs.grouped <- c(sum(obs[1:(first.ok - 1)]), obs[first.ok:last.ok]) X.grouped <- c(X[first.ok - 1], X[first.ok:last.ok]) } else { exp.grouped <- exp[first.ok:last.ok] obs.grouped <- obs[first.ok:last.ok] X.grouped <- X[first.ok:last.ok] } if (last.ok < length(exp)) { exp.grouped <- c(exp.grouped, sum(exp[(last.ok + 1):length(exp)])) obs.grouped <- c(obs.grouped, sum(obs[(last.ok + 1):length(obs)])) X.grouped <- c(X.grouped, X[last.ok + 1]) } test <- chisq.test(x = obs.grouped, p = exp.grouped, rescale.p = TRUE) ## Collect the result result <- data.frame( chi2.stat = test$statistic, nclasses.input = length(obs), nclasses.grouped = length(obs.grouped), df = test$parameter, p.value = test$p.value) ## Plot the distributions before and after grouping if (class(X) == "numeric") { par(mfrow = c(2,1)) ## Plot distrib before grouping ymax = ceiling(max(obs.grouped, exp.grouped)) ## Compute max value for Y axis plot(x = X, y = obs, type = "h", col = "grey", lwd = 2, ylim = c(0, ymax), xlim = range(X), ylab = "Number of sequences", xlab = "K-mer occurrences", las = 1, main = main) lines(x = X, y = exp, type = "l", col = "red") legend("topright", legend = paste0(sum(exp < 5), " counts with exp < 5")) ## Plot distrib after grouping plot(x = X.grouped, y = obs.grouped, type = "h", col = "grey", lwd = 2, ylim = c(0, ymax), xlim = range(X), ylab = "Number of sequences", xlab = "K-mer occurrences", las = 1, main = paste0(main, "\ngrouped tails")) lines(x = X.grouped, y = exp.grouped, type = "l", col = "darkgreen") legend("topright", cex = 0.7, legend = paste0(names(result), " = ", signif(result, digits = 3)) ) par(mfrow = c(1,1)) } return(result) } ``` ```{r TC_fitting, fig.width=7, fig.height=7, out.width="80%"} #' @title Poisson fit #' @description Compute the distribution of counts and fit a Poisson distribution #' @param counts #' @param kmer="k-mer" poisson.fit <- function(counts, main = "Poisson fit") { ## Compute the number of occurrences for each count value h <- hist(counts, breaks = seq( from = min(counts) - 0.5, to = max(counts) + 0.5, by = 1), plot = FALSE) obs <- h$counts # note: the right tail contains many zeros X <- h$mids mean.counts <- mean(counts) ## Compute expected counts exp <- dpois(x = X, lambda = mean.counts) * sum(obs) chi2.result <- chisq.test.grouped.tails(obs = obs, exp = exp, X = X, main = main) return(chi2.result) } ## Run the chi2 test with tail grouping kmer <- "CT" counts <- kmer.table[, kmer] poisson.fit(counts, main = "CT poisson fit") ``` As a matter of control we will fit a Poisson distribution on a vector of random numbers generated according to a Poisson distribution. ```{r rpois_fitting, fig.width=7, fig.height=7, out.width="80%"} ## Run the chi2 test with random numbers following a Poisson distribution poisson.fit(rpois(n = length(counts), lambda = mean(counts)), main = "Random Poisson numbers") ``` ## Fitting a binomial distribution ```{r eval=FALSE} kmer <- "TC" counts <- kmer.table[, kmer] h <- hist(counts, breaks = seq( from = min(counts - 0.5), to = max(counts + 0.5), by = 1), plot = FALSE) obs <- h$counts X <- h$mids ## Estimate prior k-mer probabilities by computing their ## frequency in the whole data set total.per.kmer <- apply(kmer.table, 2, sum) freq.per.kmer <- total.per.kmer / sum(total.per.kmer) n <- max(stats.per.gene$sum) # number of k-mer positions per sequence exp <- sum(obs) * dbinom(x = X, size = n, prob = freq.per.kmer[kmer]) length(exp) length(obs) length(X) chi2.result <- chisq.test.grouped.tails( obs = obs, exp = exp, X = X, main = paste0(kmer, "; binomial fitting")) ``` ## Fitting a normal distribution ## Fitting a negative binomial