{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# scram demonstration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Using the [scram_docker container](https://hub.docker.com/r/sfletcher/scram_docker/)\n", "- Notebook started in the project root directory via (Windows):\n", "```\n", "docker run -it --rm -v ${PWD}:/work -p 8888:8888 sfletcher/scram_docker\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Set up the Jupyter environment (optional)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [], "source": [ "%matplotlib inline\n", "#To display pandas dataframes inline\n", "from IPython.core.interactiveshell import InteractiveShell\n", "InteractiveShell.ast_node_interactivity = \"all\"\n", "import pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example file and directory structure on the host" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ".\r\n", "├── license\r\n", "├── out_dir\r\n", "├── ref\r\n", "│ ├── GFP.fa\r\n", "│ ├── TAIR10_transposable_elements.fa\r\n", "│ └── ath_mir.fa\r\n", "├── scram_demonstration.ipynb\r\n", "└── seq\r\n", " ├── treatment_a_rep1.fa\r\n", " ├── treatment_a_rep2.fa\r\n", " ├── treatment_a_rep3.fa\r\n", " ├── treatment_b_rep1.fa\r\n", " ├── treatment_b_rep2.fa\r\n", " └── treatment_b_rep3.fa\r\n", "\r\n", "3 directories, 11 files\r\n" ] } ], "source": [ "! tree" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### scram help\n", "- The scram aligner is in the container path\n", "- In Jupyter notebook, use the ! symbol to run a non-Python CLI app" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fast and simple small RNA read alignment v0.2.0\r\n", "\r\n", "Usage:\r\n", " scram [command]\r\n", "\r\n", "Available Commands:\r\n", " compare Compare normalised alignment counts and standard errors for 2 read file sets\r\n", " help Help about any command\r\n", " profile Align reads of length l from 1 read file set to all sequences in a reference file\r\n", "\r\n", "Flags:\r\n", " --adapter string 3' adapter sequence to trim - FASTA & FASTQ only (default \"nil\")\r\n", " -r, --alignTo string path/to/FASTA reference file\r\n", " -1, --fastxSet1 string comma-separated path/to/read file set 1. GZIPped files must have .gz file extension\r\n", " -h, --help help for scram\r\n", " -l, --length string comma-separated read (sRNA) lengths to align\r\n", " --maxLen int Maximum read length to include for RPMR normalization (default 32)\r\n", " --minCount float Minimum read count for alignment and to include for RPMR normalization (default 1)\r\n", " --minLen int Minimum read length to include for RPMR normalization (default 18)\r\n", " --noNorm Do not normalize read counts by library size (i.e. reads per million reads)\r\n", " --noSplit Do not split alignment count for each read by the number of times it aligns\r\n", " -o, --outFilePrefix string path/to/outfile prefix (len.csv will be appended)\r\n", " -t, --readFileType string Read file type: cfa (collapsed FASTA), fa (FASTA), fq (FASTQ), clean (BGI clean.fa). (default \"cfa\")\r\n", "\r\n", "Use \"scram [command] --help\" for more information about a command.\r\n" ] } ], "source": [ "!scram -h" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparison alignments and scatter plots" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- For comparing two treatments (e.g. wild-type verses mutant) - the output is the mean and standard error of the total alignments to each reference sequence\n", "- By default, the alignment count of multi-mapping reads is split evenly between the number of loci aligned to (both within and among all reference sequences). The -noSplit flag shows the maximum possible alignment count at each loci \n", "- Alignments are carried out seperately for each small RNA size (read length) entered in the -nt filed\n", "- The read length aligned is appended to the alignment csv file name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Alignment of 2 x 3 replicates (treatment A & treatment B) x 3 read lengths (21, 22, 24 nt) to Arabidopsis transposable elements FASTA reference file\n", "- Read files are in collapsed FASTA format (generated by [FASTX-Toolkit](http://hannonlab.cshl.edu/fastx_toolkit/)), which is the default format for SCRAM \n", "- FASTQ or FASTA (non-collapsed) can also be used (with the readFileType flag). GZIP compressed input is fine. 3' Adapters can be trimmed on-the-fly from FASTA and FASTQ reads (with the adapter flag). Adapter trimming is quite stringent (12 5'adapter nucleotides must be all present, or the read is rejected), so a dedicated read QC/trimming software package may be desirable. " ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Loading reads\n", "\n", "\n", "SCRAM is attempting to load read files in the default collapsed FASTA format\n", "seq/treatment_a_rep1.fa - 7,916,958 reads processed\n", "seq/treatment_a_rep2.fa - 7,827,082 reads processed\n", "seq/treatment_a_rep3.fa - 7,897,787 reads processed\n", "\n", "SCRAM is attempting to load read files in the default collapsed FASTA format\n", "seq/treatment_b_rep3.fa - 9,185,811 reads processed\n", "seq/treatment_b_rep2.fa - 8,311,241 reads processed\n", "seq/treatment_b_rep1.fa - 8,203,718 reads processed\n", "\n", "Loading reference\n", "\n", "No. of reference sequences: 31189\n", "Combined length of reference sequences: 23,315,940 nt\n", "\n", "Aligning 21 nt reads\n", "\n", "Aligning 22 nt reads\n", "\n", "Aligning 24 nt reads\n", "\n", "Alignment complete. Total time taken = 15.746887639s\n" ] } ], "source": [ "!scram compare -r ref/TAIR10_transposable_elements.fa \\\n", " -1 seq/treatment_a_rep1.fa,seq/treatment_a_rep2.fa,seq/treatment_a_rep3.fa \\\n", " -2 seq/treatment_b_rep1.fa,seq/treatment_b_rep2.fa,seq/treatment_b_rep3.fa \\\n", " -l 21,22,24 -o out_dir/treatment_a_vs_b \n", " \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The normalised alignment count (reads per million reads) and standard error for the two treatments (columns) aligned to each reference sequence (rows) are generated\n", "- data can be easily imported and manipulated in a Pandas dataframe\n", "- NOTE: There's no need for data manipulation in Pandas prior to plotting - it's just a demo" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "comparison_alignment = pandas.read_csv('out_dir/treatment_a_vs_b_21.csv')" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | Header | \n", "Mean count 1 | \n", "Std. err 1 | \n", "Mean count 2 | \n", "Std. err 2 | \n", "
---|---|---|---|---|---|
0 | \n", "AT3TE56475|+|13785373|13785436|ATDNA12T3_2|DNA... | \n", "0.466 | \n", "0.017302 | \n", "0.427 | \n", "0.022843 | \n", "
1 | \n", "AT3TE55845|+|13718067|13718130|ATDNA12T3_2|DNA... | \n", "0.953 | \n", "0.019223 | \n", "0.920 | \n", "0.040779 | \n", "
2 | \n", "AT5TE53195|+|14754521|14755395|ATREP15|RC/Heli... | \n", "0.001 | \n", "0.000003 | \n", "0.000 | \n", "0.000017 | \n", "
3 | \n", "AT3TE45810|+|11021715|11022665|BOMZH1|DNA/MuDR... | \n", "0.034 | \n", "0.004299 | \n", "0.027 | \n", "0.004598 | \n", "
4 | \n", "AT3TE47200|-|11319279|11319496|HELITRONY3|RC/H... | \n", "0.039 | \n", "0.002143 | \n", "0.037 | \n", "0.002122 | \n", "
\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"
\\n\"+\n", " \"\\n\"+\n",
" \"from bokeh.resources import INLINE\\n\"+\n",
" \"output_notebook(resources=INLINE)\\n\"+\n",
" \"
\\n\"+\n",
" \"\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"
\\n\"+\n", " \"\\n\"+\n",
" \"from bokeh.resources import INLINE\\n\"+\n",
" \"output_notebook(resources=INLINE)\\n\"+\n",
" \"
\\n\"+\n",
" \"\n", " | Header | \n", "len | \n", "sRNA | \n", "Position | \n", "Strand | \n", "Count | \n", "Std. Err | \n", "Times aligned | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "AT1TE52125|-|15827287|15838845|ATHILA2|LTR/Gyp... | \n", "11559 | \n", "AAAAGGTCAAGAGACAAAGAT | \n", "3237 | \n", "- | \n", "0.004 | \n", "0.000621 | \n", "70 | \n", "
1 | \n", "AT1TE52125|-|15827287|15838845|ATHILA2|LTR/Gyp... | \n", "11559 | \n", "TAATCCGGATTTCTCTTTATC | \n", "4253 | \n", "+ | \n", "0.003 | \n", "0.000009 | \n", "99 | \n", "
2 | \n", "AT1TE52125|-|15827287|15838845|ATHILA2|LTR/Gyp... | \n", "11559 | \n", "AGAAAACCTACTGTAAACTGT | \n", "11514 | \n", "- | \n", "0.068 | \n", "0.008258 | \n", "5 | \n", "
3 | \n", "AT1TE53750|-|16325184|16327367|ATREP3|RC/Helit... | \n", "2184 | \n", "ACTAGATTTTAACCCGCGGTA | \n", "65 | \n", "+ | \n", "0.002 | \n", "0.000265 | \n", "164 | \n", "
4 | \n", "AT1TE53750|-|16325184|16327367|ATREP3|RC/Helit... | \n", "2184 | \n", "AAAAATAAATCGTCCCGCGGT | \n", "87 | \n", "- | \n", "0.007 | \n", "0.000024 | \n", "37 | \n", "