{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "FASTQ\n", "=====\n", "\n", "This notebook explores [FASTQ], the most common format for storing sequencing reads.\n", "\n", "FASTA and FASTQ are rather similar, but FASTQ is almost always used for storing *sequencing reads* (with associated quality values), whereas FASTA is used for storing all kinds of DNA, RNA or protein sequencines (without associated quality values).\n", "\n", "Before delving into the format, I should mention that there are great tools and libraries for parsing and manipulating FASTQ, e.g. [FASTX], and [BioPython]'s [SeqIO] module. If your needs are relatively simple, you might try using these tools and libraries and skip reading this document.\n", "\n", "[FASTA]: http://en.wikipedia.org/wiki/FASTA_format\n", "[FASTQ]: http://en.wikipedia.org/wiki/FASTQ_format\n", "[BioPython]: http://biopython.org/wiki/Main_Page\n", "[SeqIO]: http://biopython.org/wiki/SeqIO\n", "[FASTX]: http://hannonlab.cshl.edu/fastx_toolkit/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic format\n", "Here's a single sequencing read in FASTQ format:\n", "\n", " @ERR294379.100739024 HS24_09441:8:2203:17450:94030#42/1\n", " AGGGAGTCCACAGCACAGTCCAGACTCCCACCAGTTCTGACGAAATGATGAGAGCTCAGAAGTAACAGTTGCTTTCAGTCCCATAAAAACAGTCCTACAA\n", " +\n", " BDDEEF?FGFFFHGFFHHGHGGHCH@GHHHGFAHEGFEHGEFGHCCGGGFEGFGFFDFFHBGDGFHGEFGHFGHGFGFFFEHGGFGGDGHGFEEHFFHGE\n", "\n", "It's spread across four lines. The four lines are:\n", "\n", "1. \"`@`\" followed by a read name\n", "2. Nucleotide sequence\n", "3. \"`+`\", possibly followed by some info, but ignored by virtually all tools\n", "4. Quality sequence (explained below)\n", "\n", "Here is a very simple Python function for parsing file of FASTQ records:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('ERR294379.100739024 HS24_09441:8:2203:17450:94030#42/1',\n", " 'AGGGAGTCCACAGCACAGTCCAGACTCCCACCAGTTCTGACGAAATGATG',\n", " 'BDDEEF?FGFFFHGFFHHGHGGHCH@GHHHGFAHEGFEHGEFGHCCGGGF'),\n", " ('ERR294379.136275489 HS24_09441:8:2311:1917:99340#42/1',\n", " 'CTTAAGTATTTTGAAAGTTAACATAAGTTATTCTCAGAGAGACTGCTTTT',\n", " '@@AHFF?EEDEAF?FEEGEFD?GGFEFGECGE?9H?EEABFAG9@CDGGF'),\n", " ('ERR294379.97291341 HS24_09441:8:2201:10397:52549#42/1',\n", " 'GGCTGCCATCAGTGAGCAAGTAAGAATTTGCAGAAATTTATTAGCACACT',\n", " 'CDAF@#@=44465HHHHH'))]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def parse_paired_fastq(fh1, fh2):\n", " \"\"\" Parse paired-end reads from a pair of FASTQ filehandles\n", " For each pair, we return a name, the nucleotide string\n", " for the first end, the quality string for the first end,\n", " the nucleotide string for the second end, and the\n", " quality string for the second end. \"\"\"\n", " reads = []\n", " while True:\n", " first_line_1, first_line_2 = fh1.readline(), fh2.readline()\n", " if len(first_line_1) == 0:\n", " break # end of file\n", " name_1, name_2 = first_line_1[1:].rstrip(), first_line_2[1:].rstrip()\n", " seq_1, seq_2 = fh1.readline().rstrip(), fh2.readline().rstrip()\n", " fh1.readline() # ignore line starting with +\n", " fh2.readline() # ignore line starting with +\n", " qual_1, qual_2 = fh1.readline().rstrip(), fh2.readline().rstrip()\n", " reads.append(((name_1, seq_1, qual_1), (name_2, seq_2, qual_2)))\n", " return reads\n", "\n", "fastq_string1 = '''@509.6.64.20524.149722/1\n", "AGCTCTGGTGACCCATGGGCAGCTGCTAGGGAGCCTTCTCTCCACCCTGA\n", "+\n", "HHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHIIHHIHFHHF\n", "@509.4.62.19231.2763/1\n", "GTTGATAAGCAAGCATCTCATTTTGTGCATATACCTGGTCTTTCGTATTC\n", "+\n", "HHHHHHHHHHHHHHEHHHHHHHHHHHHHHHHHHHHHHHDHHHHHHGHGHH'''\n", "\n", "fastq_string2 = '''@509.6.64.20524.149722/2\n", "TAAGTCAGGATACTTTCCCATATCCCAGCCCTGCTCCNTCTTTAAATAAT\n", "+\n", "HHHHHHHHHHHHHHHHHHHH@HHFHHHEFHHHHHHFF#FFFFFFFHHHHH\n", "@509.4.62.19231.2763/2\n", "CTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAANAAGAGCTTACTA\n", "+\n", "HHHHHHHHHHHHHHHHHHEHEHHHFHGHHHHHHHH>@#@=44465HHHHH'''\n", "\n", "parse_paired_fastq(StringIO(fastq_string1), StringIO(fastq_string2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other comments\n", "\n", "In all the examples above, the reads in the FASTQ file are all the same length. This is not necessarily the case though it is usually true for datasets generated by sequencing-by-synthesis instruments. FASTQ files can contain reads of various lengths.\n", "\n", "FASTQ files often have extension `.fastq` or `.fq`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other resources\n", "\n", "* [Wikipedia page for FASTQ format](http://en.wikipedia.org/wiki/Fastq_format)\n", "* [BioPython], which has [its own ways of parsing FASTA](http://biopython.org/wiki/SeqIO)\n", "* [FASTX] toolkit\n", "* [seqtk]\n", "* [FastQC]\n", "\n", "[BioPython]: http://biopython.org/wiki/Main_Page\n", "[SeqIO]: http://biopython.org/wiki/SeqIO\n", "[SAMtools]: http://samtools.sourceforge.net/\n", "[FASTX]: http://hannonlab.cshl.edu/fastx_toolkit/\n", "[FASTQC]: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/\n", "[seqtk]: https://github.com/lh3/seqtk\n", "\n", "© Copyright [Ben Langmead](http://www.cs.jhu.edu/~langmea) 2014--2019" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 1 }