{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NGS Data formats and QC\n",
    "\n",
    "## Introduction\n",
    "There are several file formats for storing Next Generation Sequencing (NGS) data. In this tutorial we will look at some of the most common formats for storing NGS reads and variant data. We will cover the following formats:\n",
    "\n",
    "__FASTQ__ - This format stores unaligned read sequences with base qualities  \n",
    "__SAM/BAM__ - This format stores unaligned or aligned reads (text and binary formats)  \n",
    "__CRAM__ - This format is similar to BAM but has better compression than BAM  \n",
    "__VCF/BCF__ - Flexible variant call format for storing SNPs, indels, structural variations (text and binary formats)   \n",
    "\n",
    "Further to understanding the different file formats, it is important to remember that all sequencing platforms have technical limitations that can introduce biases in your sequencing data. Because of this it is very important to check the quality of the data before starting any analysis, whether you are planning to use something you have sequenced yourself or publicly available data. In the second part of this tutorial we will describe how to perform a QC assessment for your NGS data.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mkdir ngs\n",
    "cd ngs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can follow the instructions in the tutorial from here.\n",
    "\n",
    "## Let’s get started!\n",
    "Let's download `samtools`,`bcftools`and `htslib`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do this from commandline.\n",
    "\n",
    "```\n",
    "sudo yum install samtools bcftools htslib gnuplot\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "samtools --help"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "bcftools --help"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This should return the help message for samtools and bcftools respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the required datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wget https://github.com/cb2edu/CB2-101-BioComp/raw/2020/07-NGS_Intro_QC/data/example.tgz\n",
    "tar -xvzf example.tgz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## FASTQ\n",
    "FASTQ is a data format for raw unaligned sequencing reads. It is an extension to the FASTA file format, and includes a quality score for each base. Have a look at the example below, containing two reads:\n",
    "  \n",
    "```\n",
    "@ERR007731.739 IL16_2979:6:1:9:1684/1\n",
    "CTTGACGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAGAAATG\n",
    "+\n",
    "BBCBCBBBBBBBABBABBBBBBBABBBBBBBBBBBBBBABAAAABBBBB=@>B\n",
    "@ERR007731.740 IL16_2979:6:1:9:1419/1\n",
    "AAAAAAAAAGATGTCATCAGCACATCAGAAAAGAAGGCAACTTTAAAACTTTTC\n",
    "+\n",
    "BBABBBABABAABABABBABBBAAA>@B@BBAA@4AAA>.>BAA@779:AAA@A\n",
    "```\n",
    "   \n",
    "We can see that for each read we get four lines:\n",
    "\n",
    "   1. The read metadata, such as the read ID. Starts with `@` and, for paired-end Illumina reads, is terminated with /1 or /2 to show that the read is the member of a pair. \n",
    "   2. The read\n",
    "   3. Starts with `+` and optionally contains the ID again\n",
    "   4. The per base [Phred quality score](https://en.wikipedia.org/wiki/Phred_quality_score)\n",
    "\n",
    "The quality scores range (in theory) from 1 to 94 and are encoded as [ASCII characters](https://en.wikipedia.org/wiki/ASCII). The first 32 ASCII codes are reserved for control characters which are not printable, and the 33rd is reserved for space. Neither of these can be used in the quality string, so we need to subtract 33 from whatever the value of the quality character is. For example, the ASCII code of “A” is 65, so the corresponding quality is:\n",
    "   \n",
    "```   \n",
    "Q = 65 - 33 = 32\n",
    "```\n",
    "   \n",
    "The Phred quality score `Q` relates to the base-calling error probability `P` as\n",
    "   \n",
    "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; P = 10<sup>-Q/10</sup>   \n",
    "   \n",
    "The Phred quality score is a measure of the quality of base calls. For example, a base assigned with a Phred quality score of 30 tells us that there is a 1 in 1000 chance that this base was called incorrectly.\n",
    "\n",
    "|Phred Quality Score|Probability of incorrect base call|Base call accuracy|\n",
    "|---|---|---|\n",
    "|10|1 in 10|90%|\n",
    "|20|1 in 100|99%|\n",
    "|30|1 in 1000|99.9%|\n",
    "|40|1 in 10,000|99.99%|\n",
    "|50|1 in 100,000|99.999%|\n",
    "|60|1 in 1,000,000|99.9999%|\n",
    "\n",
    "The following simple perl command will print the quality score value for an ASCII character. Try changing the \"A\" to another character, for example one from the quality strings above (e.g. @, = or B)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "perl -e 'printf \"%d\\n\",ord(\"@\")-33;'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "__Q2.1: How many reads are there in the file example.fastq? (Hint: remember that `@` is a possible quality score. Is there something else in the header that is unique?)__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SAM\n",
    "[SAM (Sequence Alignment/Map)](https://samtools.github.io/hts-specs/SAMv1.pdf) format was developed by the 1000 Genomes Project group in 2009 and is a unified format for storing read alignments to a reference genome. SAM format is a standard format for storing NGS sequencing reads, base qualities, associated meta-data and alignments of the data to a reference genome. If no reference genome is available, the data can also be stored unaligned.\n",
    "\n",
    "The files consist of a header section (optional) and an alignment section. The alignment section contains one record (a single DNA fragment alignment) per line describing the alignment between fragment and reference. Each record has 11 fixed columns and optional key:type:value tuples. Open the SAM/BAM file specification document [https://samtools.github.io/hts-specs/SAMv1.pdf](https://samtools.github.io/hts-specs/SAMv1.pdf) either in a web browser or you can find a copy in the QC directory as you may need to refer to it throughout this tutorial. \n",
    "\n",
    "Now let us have a closer look at the different parts of the SAM/BAM format. \n",
    "\n",
    "### Header Section\n",
    "\n",
    "The header section of a SAM file looks like:\n",
    "\n",
    "<insert header text>\n",
    "\n",
    "```\n",
    "@HD\tVN:1.0\tSO:coordinate\n",
    "@SQ\tSN:test_ref\tLN:17637\n",
    "@RG ID:ERR003612 PL:ILLUMINA LB:g1k-sc-NA20538-TOS-1 PI:2000 DS:SRP000540 SM:NA20538 CN:SC\n",
    "```\n",
    "\n",
    "Each line in the SAM header begins with an `@`, followed by a two-letter header record type code as defined in the [SAM/BAM format specification document](https://samtools.github.io/hts-specs/SAMv1.pdf). Each record type can contain meta-data captured as a series of key-value pairs in the format of ‘TAG:VALUE’.\n",
    "\n",
    "#### Read groups\n",
    "One useful record type is RG which can be used to describe each lane of sequencing. The RG code can be used to capture extra meta-data for the sequencing lane. Some common RG TAGs are:\n",
    "\n",
    "* ID: SRR/ERR number\n",
    "* PL: Sequencing platform\n",
    "* PU: Run name\n",
    "* LB: Library name\n",
    "* PI: Insert fragment size\n",
    "* SM: Individual/Sample\n",
    "* CN: Sequencing centre\n",
    "\n",
    "While most of these are self explanitory, insert fragment size may occasionally be negative. This simply indicates that the reads found are overlapping while its size is less than 2 x read length."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "Look at the following line from the header of a SAM file and answering the questions that follow: \n",
    "\n",
    "```\n",
    "@RG ID:ERR003612 PL:ILLUMINA LB:g1k-sc-NA20538-TOS-1 PI:2000 DS:SRP000540 SM:NA20538 CN:SC\n",
    "```\n",
    "   \n",
    "__Q3: What does RG stand for?__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q4: What is the sequencing platform?__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q5: What is the sequencing centre?__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q6: What is the lane identifier?__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q7: What is the expected fragment insert size?__  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Alignment Section\n",
    "The alignment section of SAM files contains one line per read alignment, an example is\n",
    "\n",
    "ERR005816.1408831\t163\tChr1\t19999970\t23\t40M5D30M2I28M\t=\t20000147\t213 GGTGGGTGGATCACCTGAGATCGGGAGTTTGAGACTAGGTGG...   \n",
    "<=@A@??@=@A@A>@BAA@ABA:>@<>=BBB9@@2B3<=@A@...   \n",
    "\n",
    "Each of the lines are composed of multiple columns listed below. The first 11 columns are mandatory.\n",
    "\n",
    "1. QNAME: Query NAME of the read or the read pair i.e. DNA sequence\n",
    "2. FLAG: Bitwise FLAG (pairing, strand, mate strand, etc.)\n",
    "3. RNAME: Reference sequence NAME\n",
    "4. POS: 1-Based leftmost POSition of clipped alignment\n",
    "5. MAPQ: MAPping Quality (Phred-scaled)\n",
    "6. CIGAR: Extended CIGAR string (operations: MIDNSHPX=)\n",
    "7. MRNM: Mate Reference NaMe (’=’ if same as RNAME)\n",
    "8. MPOS: 1-Based leftmost Mate POSition\n",
    "9. ISIZE: Inferred Insert SIZE\n",
    "10. SEQ: Query SEQuence on the same strand as the reference\n",
    "11. QUAL: Query QUALity (ASCII-33=Phred base quality)\n",
    "12. OTHER: Optional fields\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "Let's have a look at example.sam. This file contains only a subset of the alignment section of a BAM-file that we will look closer at soon. Notice that we can use the standard UNIX operations like __cat__ on this file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat example.sam"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q8: What is the mapping quality of ERR003762.5016205? (Hint: can you use grep and awk to find this?)__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q9: What is the CIGAR string for ERR003814.6979522? (Hint: we will go through the meaning of CIGAR strings in the next section)__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q10: What is the inferred insert size of ERR003814.1408899?__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### CIGAR string\n",
    "Column 6 of the alignment is the CIGAR string for that alignment. The CIGAR string provides a compact representation of sequence alignment. Have a look at the table below. It contains the meaning of all different symbols of a CIGAR string:\n",
    "\n",
    "|Symbol |Meaning                                             |\n",
    "|---    |---                                                 |\n",
    "|M      |alignment match or mismatch                         |\n",
    "|=      |sequence match                                      |\n",
    "|X      |sequence mismatch                                   |\n",
    "|I      |insertion into the read (sample sequenced)          |\n",
    "|D      |deletion from the read (sample sequenced)           |\n",
    "|S      |soft clipping (clipped sequences present in SEQ)    |\n",
    "|H      |hard clipping (clipped sequences NOT present in SEQ)|\n",
    "|N      |skipped region from the reference                   |\n",
    "|P      |padding (silent deletion from padded reference)     |\n",
    "\n",
    "Below are two examples describing the CIGAR string in more detail.\n",
    "  \n",
    "__Example 1:__  \n",
    "Ref:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ACGTACGTACGTACGT  \n",
    "Read:&nbsp;&nbsp;ACGT-&nbsp;-&nbsp;-&nbsp;-&nbsp;ACGTACGA  \n",
    "Cigar: 4M 4D 8M  \n",
    "\n",
    "The first four bases in the read are the same as in the reference, so we can represent these as 4M in the CIGAR string. Next comes 4 deletions, represented by 4D, followed by 7 alignment matches and one alignment mismatch, represented by 8M. Note that the mismatch at position 16 is included in 8M. This is because it still aligns to the reference.\n",
    "\n",
    "__Example 2:__  \n",
    "Ref:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ACTCAGTG-&nbsp;-&nbsp;GT  \n",
    "Read:&nbsp;&nbsp;ACGCA-&nbsp;TGCAGTtagacgt  \n",
    "Cigar: 5M 1D 2M 2I 2M 7S  \n",
    "\n",
    "Here we start off with 5 alignment matches and mismatches, followed by one deletion. Then we have two more alignment matches, two insertions and two more matches. At the end, we have seven soft clippings, 7S. These are clipped sequences that are present in the SEQ (Query SEQuence on the same strand as the reference).\n",
    "\n",
    "### Exercises\n",
    "__Q11: What does the CIGAR from Q9 mean?__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q12: How would you represent the following alignment with a CIGAR string?__  \n",
    " \n",
    "Ref:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ACGT-&nbsp;-&nbsp;-&nbsp;-&nbsp;ACGTACGT  \n",
    "Read:&nbsp;&nbsp;ACGTACGTACGTACGT  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Flags\n",
    "Column 2 of the alignment contains a combination of bitwise FLAGs describing the alignment. The following table contains the information you can get from the bitwise FLAGs:\n",
    "   \n",
    "|Hex  |Dec |Flag         |Description                                           |\n",
    "|---  |--- |---          |---                                                   |\n",
    "|0x1  |1   |PAIRED       |paired-end (or multiple-segment) sequencing technology|\n",
    "|0x2  |2   |PROPER_PAIR  |each segment properly aligned according to the aligner|\n",
    "|0x4  |4   |UNMAP        |segment unmapped                                      |\n",
    "|0x8  |8   |MUNMAP       |next segment in the template unmapped                 |\n",
    "|0x10 |16  |REVERSE      |SEQ is reverse complemented                           |\n",
    "|0x20 |32  |MREVERSE     |SEQ of the next segment in the template is reversed   |\n",
    "|0x40 |64  |READ1        |the first segment in the template                     |\n",
    "|0x80 |128 |READ2        |the last segment in the template                      |\n",
    "|0x100|256 |SECONDARY    |secondary alignment                                   |\n",
    "|0x200|512 |QCFAIL       |not passing quality controls                          |\n",
    "|0x400|1024|DUP          |PCR or optical duplicate                              |\n",
    "|0x800|2048|SUPPLEMENTARY|supplementary alignment                               |\n",
    "\n",
    "For example, if you have an alignment with FLAG set to 113, this can only be represented by decimal codes `64 + 32 + 16 + 1`, so we know that these four flags apply to the alignment and the alignment is paired-end, reverse complemented, sequence of the next template/mate of the read is reversed and the read aligned is the first segment in the template.\n",
    "\n",
    "#### Primary, secondary and supplementary alignments\n",
    "A read that aligns to a single reference sequence (including insertions, deletions, skips and clipping but not direction changes), is a __linear alignment__. If a read cannot be represented as a linear alignment, but instead is represented as a group of linear alignments without large overlaps, it is called a __chimeric alignment__. These can for instance be caused by structural variations. Usually, one of the linear alignments in a chimeric alignment is considered to be the __representative__ alignment, and the others are called __supplementary__.\n",
    "\n",
    "Sometimes a read maps equally well to more than one spot. In these cases, one of the possible alignments is marked as the __primary__ alignment and the rest are marked as __secondary__ alignments.\n",
    "\n",
    "## BAM\n",
    "BAM (Binary Alignment/Map) format, is a compressed binary version of SAM. This means that, while SAM is human readable, BAM is only readable for computers. BAM was developed for fast processing and random access. To achieve this, BGZF (Block GZIP) compression is used for indexing. BAM files can be viewed using samtools, and will then have the same format as a SAM file.  The key features of BAM are:\n",
    "\n",
    "* Can store alignments from most mappers\n",
    "* Supports multiple sequencing technologies\n",
    "* Supports indexing for quick retrieval/viewing\n",
    "* Compact size (e.g. 112Gbp Illumina = 116GB disk space)\n",
    "* Reads can be grouped into logical groups e.g. lanes, libraries, samples\n",
    "* Widely supported by variant calling packages and viewers\n",
    "\n",
    "Since BAM is a binary format, we can't use the standard UNIX operations directly on this file format. __Samtools__ is a set of programs for interacting with SAM and BAM files. Using the samtools view command, print the header of the BAM file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "samtools view -H NA20538.bam"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "__Q13: What version of the human assembly was used to perform the alignments? (Hint: Can you spot this somewhere in the @SQ records?)__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q14: How many lanes are in this BAM file? (Hint: Do you recall what RG represents?)__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q15: What version of bwa was used to align the reads? (Hint: is there anything in the @PG record that looks like it could be a version tag?)__  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output from running samtools view on a BAM file without any options is a headerless SAM file. This gets printed to STDOUT in the terminal, so we will want to pipe it to something. Let's have a look at the first read of the BAM file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "samtools view NA20538.bam | head -n 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q16: What is the name of the first read? (Hint: have a look at the alignment section if you can't recall the different fields)__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q17: What position does the alignment of the read start at?__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Indexing\n",
    "To allow for fast random access of regions in BAM and CRAM files, they can be indexed. The files must first be coordinate-sorted rather that sorted by read name. This can be done using __samtools sort__. If no options are supplied, it will by default sort by the left-most position of the reference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "samtools sort -o NA20538_sorted.bam NA20538.bam"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can use __samtools index__ to create an index file (.bai) for our sorted BAM file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "samtools index NA20538_sorted.bam"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To look for reads mapped to a specific region, we can use __samtools view__ and specify the region we are interested in as: RNAME[:STARTPOS[-ENDPOS]]. For example, if we wanted to look at all the reads mapped to a region called chr4, we could use:\n",
    "\n",
    "``samtools view alignment.bam chr4``\n",
    "\n",
    "To look at the region on chr4 beginning at position 1,000,000 and ending at the end of the chromosome, we can do:\n",
    "\n",
    "  ``samtools view alignment.bam chr4:1000000``\n",
    "  \n",
    "And to explore the 1001bp long region on chr4 beginning at position 1,000 and ending at position 2,000, we can use:\n",
    "\n",
    "  ``samtools view alignment.bam chr4:1000-2000``"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercises \n",
    "__Q18: How many reads are mapped to region 20025000-20030000 on chromosome 1?__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "samtools view NA20538_sorted.bam 1:20025000-20030000 | wc -l"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## VCF\n",
    "The VCF file format was introduced to store variation data. VCF consists of tab-delimited text and is parsable by standard UNIX commands which makes it flexible and user-extensible. \n",
    "\n",
    "### VCF header\n",
    "The VCF header consists of meta-information lines (starting with `##`) and a header line (starting with `#`). All meta-information lines are optional and can be put in any order, except for _fileformat_. This holds the information about which version of VCF is used and must come first. \n",
    "\n",
    "The meta-information lines consist of key=value pairs. Examples of meta-information lines that can be included are ##INFO, ##FORMAT and ##reference. The values can consist of multiple fields enclosed by `<>`. More information about these fields is available in the VCF specification [http://samtools.github.io/hts-specs/VCFv4.3.pdf](http://samtools.github.io/hts-specs/VCFv4.3.pdf). This can be accessed using a web browser and there is a copy in the QC directory.\n",
    "\n",
    "#### Header line\n",
    "The header line starts with `#` and consists of 8 required fields:\n",
    "\n",
    "1. CHROM: an identifier from the reference genome\n",
    "2. POS: the reference position\n",
    "3. ID: a list of unique identifiers (where available)\n",
    "4. REF: the reference base(s)\n",
    "5. ALT: the alternate base(s)\n",
    "6. QUAL: a phred-scaled quality score\n",
    "7. FILTER: filter status\n",
    "8. INFO: additional information\n",
    "\n",
    "If the file contains genotype data, the required fields are also followed by a FORMAT column header, and then a number of sample IDs. The FORMAT field specifies the data types and order. Some examples of these data types are:\n",
    "\n",
    "* GT: Genotype, encoded as allele values separated by either / or |\n",
    "* DP: Read depth at this position for this sample\n",
    "* GQ: Conditional genotype quality, encoded as a phred quality\n",
    "\n",
    "### Body\n",
    "In the body of the VCF, each row contains information about a position in the genome along with genotype information on samples for each position, all according to the fields in the header line.\n",
    "\n",
    "## BCF\n",
    "\n",
    "BCF is a compressed binary representation of VCF.\n",
    "\n",
    "VCF can be compressed with BGZF (bgzip) and  indexed with TBI or CSI (tabix), but even compressed it can still be very big. For example, a compressed VCF with 3781 samples of human data will be 54 GB for chromosome 1, and 680 GB for the whole genome. VCFs can also be slow to parse, as text conversion is slow. The main bottleneck is the \"FORMAT\" fields. For this reason the BCF format was developed. \n",
    "\n",
    "In BCF files the fields are rearranged for fast access. The following images show the process of converting a VCF file into a BCF file. \n",
    "\n",
    "Bcftools comprises a set of programs for interacting with VCF and BCF files. It can be used to convert between VCF and BCF and to view or extract records from a region.\n",
    "\n",
    "### bcftools view  \n",
    "Let's have a look at the header of the file 1kg.bcf in the data directory. Note that bcftools uses __`-h`__ to print only the header, while samtools uses __`-H`__ for this. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bcftools view -h 1kg.bcf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly to BAM, BCF supports random access, that is, fast retrieval from a given region. For this, the file must be indexed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bcftools index 1kg.bcf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can extract all records from the region 20:24042765-24043073, using the __`-r`__ option. The __`-H`__ option will make sure we don't include the header in the output:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bcftools view -H -r 20:24042765-24043073 1kg.bcf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### bcftools query  \n",
    "The versatile __bcftools query__ command can be used to extract any VCF field. Combined with standard UNIX commands, this gives a powerful tool for quick querying of VCFs. Have a look at the usage options:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bcftools query -h"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try out some useful options. As you can see from the usage, __`-l`__ will print a list of all the samples in the file. Give this a go:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bcftools query -l 1kg.bcf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another very useful option is __`-s`__ which allows you to extract all the data relating to a particular sample. This is a [common option](http://samtools.github.io/bcftools/bcftools.html#common_options) meaning it can be used for many bcftools commands, like `bcftools view`. Try this for sample HG00131:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bcftools view -s HG00131 1kg.bcf | head -n 50"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The format option, __`-f`__ can be used to select what gets printed from your query command. For example, the following will print the position, reference base and alternate base for sample HG00131, separated by tabs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bcftools query -f'%POS\\t%REF\\t%ALT\\n' -s HG00131 1kg.bcf | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's look at the __`-i`__ option. With this option we can select only sites for which a particular expression is true. For instance, if we only want to look at sites that have more than 2 alternate alleles, we can use the following expression (piped to `head` to only show a subset of the output):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bcftools query -f'%CHROM\\t%POS\\t%REF\\t%ALT\\n' -i 'AC[0]>2' 1kg.bcf | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use __`-i`__ with the expression `AC[0]>2`. AC is an info field that holds the __a__llele __c__ount. Some fields can hold multiple values, so we use `AC[0]>2` to indicate that we are looking for the first value (this is zero indexed, and hence starts at 0 instead of 1), and that this value should be > 2. To format our output, we use __`-f`__ to specify that we want to print the chromosome name and position.\n",
    "\n",
    "There is more information about expressions on the bcftools manual page [http://samtools.github.io/bcftools/bcftools.html#expressions](http://samtools.github.io/bcftools/bcftools.html#expressions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "Now, try and answer the following questions about the file 1kg.bcf in the data directory. For more information about the different usage options you can open the bcftools query manual page [http://samtools.github.io/bcftools/bcftools.html#query](http://samtools.github.io/bcftools/bcftools.html#query) in a web browser."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q19: What version of the human assembly do the coordinates refer to?__ "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q20: How many samples are there in the BCF?__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q21: What is the genotype of the sample HG00107 at the position 20:24019472? (Hint: use the combination of -r, -s, and -f options)__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q22: How many positions are there with more than 10 alternate alleles? (Hint: use the -i filtering option)__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q23: In how many positions does HG00107 have a non-reference genotype and a read depth bigger than 10? (Hint: you can use pipes to combine bcftools queries)__   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bcftools view -s HG00107 -H -i 'DP>10' 1kg.bcf | bcftools view -s HG00107 -H -i 'AC[0]>0' 1kg.bcf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# QC assessment of NGS data\n",
    "QC is an important part of any analysis. In this section we are going to look at some of the metrics and graphs that can be used to assess the QC of NGS data. \n",
    "\n",
    "\n",
    "## Base quality\n",
    "[Illumina sequencing](https://en.wikipedia.org/wiki/Illumina_dye_sequencing) technology relies on sequencing by synthesis. One of the most common problems with this is __dephasing__. For each sequencing cycle, there is a possibility that the replication machinery slips and either incorporates more than one nucleotide or perhaps misses to incorporate one at all. The more cycles that are run (i.e. the longer the read length gets), the greater the accumulation of these types of errors gets. This leads to a heterogeneous population in the cluster, and a decreased signal purity, which in turn reduces the precision of the base calling. The figure below shows an example of this."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because of dephasing, it is possible to have high-quality data at the beginning of the read but really low-quality data towards the end of the read. In those cases you can decide to trim off the low-quality reads, for example using a tool called [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic).\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quality control of fastq files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download the required dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wget https://github.com/cb2edu/CB2-101-BioComp/raw/2020/07-NGS_Intro_QC/data/rnaseq_data.tar.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unzip the tarball"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tar -xvzf rnaseq_data.tar.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "FASTQC is software that is used to calculate the quality of the fastq data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip\n",
    "unzip fastqc_v0.11.9.zip"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the help of fastqc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "FastQC/fastqc -h"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mkdir -p rnaseq_quality\n",
    "FastQC/fastqc -o rnaseq_quality -f fastq rnaseq_data/adrenal_R1.fq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "  Open the analysis file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "firefox rnaseq_quality/adrenal_R1_fastqc.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A good example:\n",
    "http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html\n",
    "\n",
    "A bad example:\n",
    "http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html\n",
    "\n",
    "## Read trimming\n",
    "We will use Trimmomatic for read trimming and adapter removal. Download Trimmomatic from: http://www.usadellab.org/cms/?page=trimmomatic\n",
    "Generally, Illumina adapters are of two types: Nextera for WGS and exome sequencing and Truseq from RNAseq. We need to keep that in mind and provide the for trimming.\n",
    "\n",
    "We used TruSeq3-PE adapters for clipping. Some other options are:\n",
    "1. Adapters will have max 2 mismatches and will be clipped if a score of 30 is reached. 2. Remove leading and trailing N bases if quality is below 2.\n",
    "3. Move with a 4 bp window and cutting where the average quality falls below 15.\n",
    "4. After trmming remove all sequences whose length is below 30."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip\n",
    "unzip Trimmomatic-0.39.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "java -jar Trimmomatic-0.39/trimmomatic-0.39.jar PE \\\n",
    "rnaseq_data/adrenal_R1.fq rnaseq_data/adrenal_R2.fq \\\n",
    "adrenal_R1_trim.fq adrenal_R1_unmapped.fq \\\n",
    "adrenal_R2_trim.fq adrenal_R2_unmapped.fq \\\n",
    "ILLUMINACLIP:Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 \\\n",
    "LEADING:3 TRAILING:3 MINLEN:30"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating QC stats of bam files\n",
    "Now let's try this out! We will generate QC stats for two lanes of Illumina paired-end sequencing data from yeast. The reads have already been aligned to the [Saccromyces cerevisiae reference genome](ftp://ftp.ensembl.org/pub/current_fasta/saccharomyces_cerevisiae/dna) to produce the BAM file lane1.sorted.bam."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will use __`samtools stats`__ to generate the stats for the primary alignments. The option __`-f`__ can be used to filter reads with specific tags, while __`-F`__ can be used to _filter out_ reads with specific tags. The following command will include only primary alignments:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "samtools stats -F SECONDARY NA20538_sorted.bam \\\n",
    "    > NA20538.sorted.bam.bchk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Have a look at the first 47 lines of the statistics file that was generated:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "head -n 47 NA20538.sorted.bam.bchk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This file contains a number of useful stats that we can use to get a better picture of our data, and it can even be plotted with __`plot-bamstats`__, as you will see soon. First let's have a closer look at some of the different stats. Each part of the file starts with a `#` followed by a description of the section and how to extract it from the file. Let's have a look at all the sections in the file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "grep ^'#' NA20538.sorted.bam.bchk | grep 'Use'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summary Numbers (SN)\n",
    "This initial section contains a summary of the alignment and includes some general statistics. In particular, you can see how many bases mapped, and how much of the genome that was covered.\n",
    "\n",
    "### Exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now look at the output and try to answer the questions below.\n",
    "\n",
    "__Q2: What is the total number of reads?__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q3: What proportion of the reads were mapped?__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q4: How many pairs were mapped to a different chromosome?__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q5: What is the insert size mean and standard deviation?__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q6: How many reads were paired properly?__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q7: How many reads have zero mapping quality?__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Q8: Which reads (forward or reverse) have higher base quality on average?__  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generating QC plots\n",
    "Finally, we will create some QC plots from the output of the stats command using the command __plot-bamstats__ which is included in the samtools package: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot-bamstats -p NA20538-plots/ NA20538.sorted.bam.bchk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now in your web browser open the file NA20538-plots/index.html to view the QC information. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Congratulations__ you have reached the end of the Data formats and QC tutorial! "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Bash",
   "language": "bash",
   "name": "bash"
  },
  "language_info": {
   "codemirror_mode": "shell",
   "file_extension": ".sh",
   "mimetype": "text/x-sh",
   "name": "bash"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}