miND - miRNA NGS Data pipeline by TAmiRNA GmbH, developed by: Andreas B. Diendorfer, PhD

1 Introduction

This miND report provides a condensed overview of the mapping and (if requested) differential expression analysis of the provided samples. Details on the analysis parameters are listed below:

  • Project ID: PRJEB27261
  • Report generated: 2021-12-09 (13:28:43 GTM +0000) by system user “ec2-user”

Comment:


Samples from ENA project PRJEB27261


Analysis parameters:

  • Species: Homo sapiens (hsa, TXID: 9606)
  • Sequencing adapter: (cutadapt -a TruSeq=TGGAATTCTC)
  • Minimum read length: 17nt
  • Reads quality cutoff: 30 (phred quality score)
  • Significance level: 0.05

Tabular data can be filtered or sorted using the fields and options at the top of each table. To export the data for further processing, please select the desired format (Excel or CSV) at the table.

2 Data exploration

The first part of this report aims to give a general overview of the sequencing data. Please be aware, that any downstream analysis depends on certain assumptions on the distribution and quality of the data. It is important to manually evaluate the data with the plots and tables provided in this section. Samples that do not pass those evaluations should be excluded from statistical analysis, as they can distort the results.

2.1 Sample table

2.2 Raw data quality control

To evaluate the quality and check the data for common sequencing problems, all processed files are also analysed with the “fastQC” tool. The results of all samples are then combined into one report together with statistics about the adapter trimming step.

The multiQC report was provided alongside with this file (multiqc_report.html).

Any files that do not pass the manual evaluation of this step should be excluded from further processing and analysis, as they could distort the results.

2.3 Reads classification

Reads classification gives insights into the type and origin (i.e. composition) of all sequences obtained for each sample. After processing of the reads (adapter trimming, quality filtering, size filtering), all reads are mapped against various databases to categorize them. This is done in a hierarchical process, where reads are first mapped against the genome. Genome mapped reads are then mapped against known miRNA sequences and only those not identified as miRNAs get mapped against other databases for further classification.

“Unclassified genomic” indicates reads that were mapped against the genome but were not found in any of the RNA specific databases, while “unmapped” are reads that could not be found in the given reference genome.

The “Relative reads” tab shows the same data scaled to 100% to indicate the relative abundance of each read classification in a given sample.

Hint: You can double click on any of the RNA categories in the legend to hide all other and only show this one category.

2.3.1 Absolute reads

2.3.2 Read composition (relative)

2.3.3 Mapping statistics

The following histograms show the number (y-axis) of genome mapped reads against their length (x-axis) for each sample. The stacked bar charts visualize the proportions of unmapped and mapped reads and can be used to evaluate the read quality. Most microRNAs are 22 nucleotides long.

2.4 Read classification table

The data in this table are equivalent to the data shown in the reads classification graph above (absolute reads). These are raw read counts (without any normalization).

2.5 miRNA mappings

2.5.1 miRNA RPM table

This table contains all identified miRNAs in each sample. Read counts are normalized to 1 million mapped miRNAs.

Please use the download link provided underneath the table to save the miRNA mappings data. The buttons provided at the top of the table can also be used, but won’t include detailed group information of the samples.

Download extended miRNA mapping table (RPM)

2.5.2 miRNA raw reads table

This table contains all identified miRNAs in each sample. These are raw read counts (without any normalization).

Please use the download link provided underneath the table to save the miRNA mappings data. The buttons provided at the top of the table can also be used, but won’t include detailed group information of the samples.

Download extended miRNA mapping table (raw reads)

2.5.3 Identified miRNAs comparison

This graph shows the amount of distinct mature miRNAs identified in each sample.

Download identified miRNAs comparison data

2.5.4 miRNA read count distribution

This overview plots the abundance of a miRNA (collapsed read count) on the x axis and the amount of other miRNAs in this range on the y axis. It illustrates the distribution of miRNAs in the sample.

2.6 Heatmaps

Data is based on RPM normalized reads and scaled using the unit variance method for visualization in heatmaps. Clustering is done using the average method of pheatmap calculating the distances as correlations.

2.6.1 Top 50 miRNAs

This heatmap shows only the top 50 miRNAs (based on coefficient of variation (CV%)). An additional filter was introduce to increase the robustness: only miRNAs that show an RPM in at least 1 / n(groups) percent of samples (e.g. with 4 groups, the miRNA has to have an RPM value above 5 in at least 25% of the samples). This removes miRNAs that have a high CV but are only expressed in a too small amount of samples to bear any statistical significance or biological relevance.

Download data used to generate the heatmap

2.6.2 All miRNAs

434 miRNAs are shown in the following heatmap, based on the same filters described at the top 50 miRNAs.

Download data used to generate the heatmap

2.7 PCA

Principal component analysis (PCA) uses RPM normalized miRNA reads and reduces the data dimensions down to two, so that it can be plotted in a graph. A quick introduction to PCA plots and the underlaying principle, can be found here.

Samples are either colored by their first group or by the cluster they were assigend to. Clustering is done using the ward (ward.D2) alrogithm of hclust (split at euclidian cluster height of 40).

2.7.1 PCA cluster by sample groups

2.7.2 PCA cluster by hierarchical clustering

2.8 t-SNE

t-SNE is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space (like 2 dimensions here). It models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. More details can be found in the author’s publication (Maaten and Hinton 2008).

3 Appendix

3.1 R session information

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.0.5 (2021-03-31)
##  os       Amazon Linux 2
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Etc/UCT
##  date     2021-12-09
##  pandoc   2.16.2 @ /home/ec2-user/mind/envs/tmp/79ceb962e5bc2aa81550cf556ab60c33/bin/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package       * version  date (UTC) lib source
##  annotate        1.68.0   2020-10-27 [1] Bioconductor
##  AnnotationDbi   1.52.0   2020-10-27 [1] Bioconductor
##  Biobase       * 2.50.0   2020-10-27 [1] Bioconductor
##  BiocGenerics  * 0.36.0   2020-10-27 [1] Bioconductor
##  bit             4.0.4    2020-08-04 [1] CRAN (R 4.0.3)
##  bit64           4.0.5    2020-08-30 [1] CRAN (R 4.0.3)
##  blob            1.2.2    2021-07-23 [1] CRAN (R 4.0.5)
##  cachem          1.0.6    2021-08-19 [1] CRAN (R 4.0.5)
##  callr           3.7.0    2021-04-20 [1] CRAN (R 4.0.3)
##  cellranger      1.1.0    2016-07-27 [1] CRAN (R 4.0.5)
##  cli             3.1.0    2021-10-27 [1] CRAN (R 4.0.5)
##  colorspace      2.0-2    2021-06-24 [1] CRAN (R 4.0.5)
##  crayon          1.4.2    2021-10-29 [1] CRAN (R 4.0.5)
##  crosstalk       1.2.0    2021-11-04 [1] CRAN (R 4.0.5)
##  data.table      1.14.2   2021-09-27 [1] CRAN (R 4.0.5)
##  DBI             1.1.1    2021-01-15 [1] CRAN (R 4.0.3)
##  desc            1.4.0    2021-09-28 [1] CRAN (R 4.0.5)
##  devtools        2.4.3    2021-11-30 [1] CRAN (R 4.0.5)
##  digest          0.6.29   2021-12-01 [1] CRAN (R 4.0.5)
##  dplyr         * 1.0.7    2021-06-18 [1] CRAN (R 4.0.5)
##  DT            * 0.17     2021-01-06 [1] CRAN (R 4.0.3)
##  edgeR         * 3.32.1   2021-01-14 [1] Bioconductor
##  ellipsis        0.3.2    2021-04-29 [1] CRAN (R 4.0.3)
##  evaluate        0.14     2019-05-28 [1] CRAN (R 4.0.5)
##  fansi           0.5.0    2021-05-25 [1] CRAN (R 4.0.5)
##  farver          2.1.0    2021-02-28 [1] CRAN (R 4.0.3)
##  fastmap         1.1.0    2021-01-25 [1] CRAN (R 4.0.3)
##  fs              1.5.2    2021-12-08 [1] CRAN (R 4.0.5)
##  genefilter    * 1.72.1   2021-01-21 [1] Bioconductor
##  generics        0.1.1    2021-10-25 [1] CRAN (R 4.0.5)
##  ggfortify     * 0.4.13   2021-10-25 [1] CRAN (R 4.0.5)
##  ggplot2       * 3.3.5    2021-06-25 [1] CRAN (R 4.0.5)
##  ggrepel       * 0.8.2    2020-03-08 [1] CRAN (R 4.0.0)
##  glue            1.5.1    2021-11-30 [1] CRAN (R 4.0.5)
##  gridExtra     * 2.3      2017-09-09 [1] CRAN (R 4.0.5)
##  gtable          0.3.0    2019-03-25 [1] CRAN (R 4.0.5)
##  highr           0.9      2021-04-16 [1] CRAN (R 4.0.3)
##  hms             1.1.1    2021-09-26 [1] CRAN (R 4.0.5)
##  htmltools       0.5.2    2021-08-25 [1] CRAN (R 4.0.5)
##  htmlwidgets     1.5.4    2021-09-08 [1] CRAN (R 4.0.5)
##  httr            1.4.2    2020-07-20 [1] CRAN (R 4.0.5)
##  IRanges         2.24.1   2020-12-12 [1] Bioconductor
##  jquerylib       0.1.4    2021-04-26 [1] CRAN (R 4.0.3)
##  jsonlite        1.7.2    2020-12-09 [1] CRAN (R 4.0.3)
##  kableExtra    * 1.3.4    2021-02-20 [1] CRAN (R 4.0.3)
##  knitr           1.36     2021-09-29 [1] CRAN (R 4.0.5)
##  labeling        0.4.2    2020-10-20 [1] CRAN (R 4.0.5)
##  lattice         0.20-45  2021-09-22 [1] CRAN (R 4.0.5)
##  lazyeval        0.2.2    2019-03-15 [1] CRAN (R 4.0.5)
##  lifecycle       1.0.1    2021-09-24 [1] CRAN (R 4.0.5)
##  limma         * 3.46.0   2020-10-27 [1] Bioconductor
##  locfit          1.5-9.4  2020-03-25 [1] CRAN (R 4.0.5)
##  magrittr      * 2.0.1    2020-11-17 [1] CRAN (R 4.0.3)
##  Matrix          1.3-4    2021-06-01 [1] CRAN (R 4.0.5)
##  memoise         2.0.1    2021-11-26 [1] CRAN (R 4.0.5)
##  mime            0.12     2021-09-28 [1] CRAN (R 4.0.5)
##  munsell         0.5.0    2018-06-12 [1] CRAN (R 4.0.5)
##  pcaMethods    * 1.82.0   2020-10-27 [1] Bioconductor
##  pheatmap      * 1.0.12   2019-01-04 [1] CRAN (R 4.0.5)
##  pillar          1.6.4    2021-10-18 [1] CRAN (R 4.0.5)
##  pkgbuild        1.2.1    2021-11-30 [1] CRAN (R 4.0.5)
##  pkgconfig       2.0.3    2019-09-22 [1] CRAN (R 4.0.5)
##  pkgload         1.2.4    2021-11-30 [1] CRAN (R 4.0.5)
##  plotly        * 4.9.4.1  2021-06-18 [1] CRAN (R 4.0.5)
##  prettyunits     1.1.1    2020-01-24 [1] CRAN (R 4.0.5)
##  processx        3.5.2    2021-04-30 [1] CRAN (R 4.0.3)
##  ps              1.6.0    2021-02-28 [1] CRAN (R 4.0.3)
##  purrr           0.3.4    2020-04-17 [1] CRAN (R 4.0.3)
##  R6              2.5.1    2021-08-19 [1] CRAN (R 4.0.5)
##  RColorBrewer  * 1.1-2    2014-12-07 [1] CRAN (R 4.0.5)
##  Rcpp            1.0.7    2021-07-07 [1] CRAN (R 4.0.5)
##  readr         * 1.4.0    2020-10-05 [1] CRAN (R 4.0.5)
##  readxl        * 1.3.1    2019-03-13 [1] CRAN (R 4.0.5)
##  remotes         2.4.2    2021-11-30 [1] CRAN (R 4.0.5)
##  rlang           0.4.12   2021-10-18 [1] CRAN (R 4.0.5)
##  rmarkdown       2.11     2021-09-14 [1] CRAN (R 4.0.5)
##  rprojroot       2.0.2    2020-11-15 [1] CRAN (R 4.0.3)
##  RSQLite         2.2.8    2021-08-21 [1] CRAN (R 4.0.5)
##  rstudioapi      0.13     2020-11-12 [1] CRAN (R 4.0.3)
##  Rtsne         * 0.15     2018-11-10 [1] CRAN (R 4.0.5)
##  rvest           1.0.2    2021-10-16 [1] CRAN (R 4.0.5)
##  S4Vectors       0.28.1   2020-12-09 [1] Bioconductor
##  scales          1.1.1    2020-05-11 [1] CRAN (R 4.0.5)
##  sessioninfo     1.2.2    2021-12-06 [1] CRAN (R 4.0.5)
##  stringi         1.7.6    2021-11-29 [1] CRAN (R 4.0.5)
##  stringr       * 1.4.0    2019-02-10 [1] CRAN (R 4.0.5)
##  survival        3.2-13   2021-08-24 [1] CRAN (R 4.0.5)
##  svglite         2.0.0    2021-02-20 [1] CRAN (R 4.0.3)
##  systemfonts     1.0.3    2021-10-13 [1] CRAN (R 4.0.5)
##  testthat        3.1.1    2021-12-03 [1] CRAN (R 4.0.5)
##  tibble        * 3.1.6    2021-11-07 [1] CRAN (R 4.0.5)
##  tidyr         * 1.1.4    2021-09-27 [1] CRAN (R 4.0.5)
##  tidyselect      1.1.1    2021-04-30 [1] CRAN (R 4.0.3)
##  usethis         2.1.3    2021-10-27 [1] CRAN (R 4.0.5)
##  utf8            1.2.2    2021-07-24 [1] CRAN (R 4.0.5)
##  vctrs           0.3.8    2021-04-29 [1] CRAN (R 4.0.5)
##  viridisLite     0.4.0    2021-04-13 [1] CRAN (R 4.0.3)
##  webshot         0.5.2    2019-11-22 [1] CRAN (R 4.0.5)
##  withr           2.4.3    2021-11-30 [1] CRAN (R 4.0.5)
##  WriteXLS      * 6.3.0    2021-04-01 [1] CRAN (R 4.0.3)
##  xfun            0.22     2021-03-11 [1] CRAN (R 4.0.3)
##  XML             3.99-0.8 2021-09-17 [1] CRAN (R 4.0.5)
##  xml2            1.3.3    2021-11-30 [1] CRAN (R 4.0.5)
##  xtable          1.8-4    2019-04-21 [1] CRAN (R 4.0.5)
##  yaml            2.2.1    2020-02-01 [1] CRAN (R 4.0.3)
## 
##  [1] /home/ec2-user/mind/envs/tmp/79ceb962e5bc2aa81550cf556ab60c33/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

3.2 References

The following references are provided for tools used with implications on the scientific and statistical outcome of this analysis. A multitude of other tools helped in preparation of this report of which many are available as open source. Please contact us for a full list of references.

Andrews, Simon. 2010. FastQC: A quality control tool for high throughput sequence data.” https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
Bushnell, Brian. 2015. BBMap.” https://sourceforge.net/projects/bbmap/.
Chen, Yunshun, Aaron T. L. Lun, and Gordon K. Smyth. 2016. From reads to genes to pathways: Differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline [version 2; referees: 5 approved].” F1000Research 5: 1–49. https://doi.org/10.12688/F1000RESEARCH.8987.2.
Ewels, Philip, Måns Magnusson, Sverker Lundin, and Max Käller. 2016. MultiQC: Summarize analysis results for multiple tools and samples in a single report.” Bioinformatics 32 (19): 3047–48. https://doi.org/10.1093/bioinformatics/btw354.
Friedländer, Marc R., Sebastian D. MacKowiak, Na Li, Wei Chen, and Nikolaus Rajewsky. 2012. MiRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades.” Nucleic Acids Research 40 (1): 37–52. https://doi.org/10.1093/nar/gkr688.
Griffiths-Jones, S. 2004. The microRNA Registry.” Nucleic Acids Research 32 (90001): 109D–111. https://doi.org/10.1093/nar/gkh023.
Huber, Wolfgang, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, et al. 2015. Orchestrating high-throughput genomic analysis with Bioconductor.” Nature Methods 12 (2): 115–21. https://doi.org/10.1038/nmeth.3252.
Köster, Johannes, and Sven Rahmann. 2012. Snakemake-a scalable bioinformatics workflow engine.” Bioinformatics 28 (19): 2520–22. https://doi.org/10.1093/bioinformatics/bts480.
Langmead, Ben, Cole Trapnell, Mihai Pop, and Steven L. Salzberg. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.” Genome Biology 10 (3). https://doi.org/10.1186/gb-2009-10-3-r25.
Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. 2009. The Sequence Alignment/Map format and SAMtools.” Bioinformatics 25 (16): 2078–79. https://doi.org/10.1093/bioinformatics/btp352.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (12): 1–21. https://doi.org/10.1186/s13059-014-0550-8.
Maaten, Laurens van der, and Geoffrey Hinton. 2008. Visualizing High-Dimensional Data Using t-SNE.” Journal of Machine Learning Research 9 9 (August): 2579–2605.
Martin, Marcel. 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads.” EMBnet.journal 17 (1): 10. https://doi.org/10.14806/ej.17.1.200.
McCarthy, Davis J., Yunshun Chen, and Gordon K. Smyth. 2012. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Research 40 (10): 4288–97. https://doi.org/10.1093/nar/gks042.
Pantano, Lorena, Marc R. Friedländer, Georgia Escaramís, Esther Lizano, Joan Pallarès-Albanell, Isidre Ferrer, Xavier Estivill, and Eulàlia Martí. 2016. Specific small-RNA signatures in the amygdala at premotor and motor stages of Parkinson’s disease revealed by deep sequencing analysis.” Bioinformatics 32 (5): 673–81. https://doi.org/10.1093/bioinformatics/btv632.
Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. 2009. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.
Stacklies, Wolfram, Henning Redestig, Matthias Scholz, Dirk Walther, and Joachim Selbig. 2007. pcaMethods - A bioconductor package providing PCA methods for incomplete data.” Bioinformatics 23 (9): 1164–67. https://doi.org/10.1093/bioinformatics/btm069.
Sweeney, Blake A., Anton I. Petrov, Boris Burkov, Robert D. Finn, Alex Bateman, Maciej Szymanski, Wojciech M. Karlowski, et al. 2019. RNAcentral: A hub of information for non-coding RNA sequences.” Nucleic Acids Research 47 (D1): D221–29. https://doi.org/10.1093/nar/gky1034.
Zerbino, Daniel R., Premanand Achuthan, Wasiu Akanni, M. Ridwan Amode, Daniel Barrell, Jyothish Bhai, Konstantinos Billis, et al. 2018. Ensembl 2018.” Nucleic Acids Research 46 (D1): D754–61. https://doi.org/10.1093/nar/gkx1098.