{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**Source of the materials**: Biopython cookbook (adapted)\n", "Status: Draft" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cookbook – Cool things to do with it {#chapter:cookbook}\n", "====================================\n", "\n", "Biopython now has two collections of “cookbook” examples – this chapter\n", "(which has been included in this tutorial for many years and has\n", "gradually grown), and \n", "which is a user contributed collection on our wiki.\n", "\n", "We’re trying to encourage Biopython users to contribute their own\n", "examples to the wiki. In addition to helping the community, one direct\n", "benefit of sharing an example like this is that you could also get some\n", "feedback on the code from other Biopython users and developers - which\n", "could help you improve all your Python code.\n", "\n", "In the long term, we may end up moving all of the examples in this\n", "chapter to the wiki, or elsewhere within the tutorial.\n", "\n", "Working with sequence files {#seq:cookbook-sequences}\n", "---------------------------\n", "\n", "This section shows some more examples of sequence input/output, using\n", "the `Bio.SeqIO` module described in Chapter \\[chapter:Bio.SeqIO\\].\n", "\n", "### Filtering a sequence file\n", "\n", "Often you’ll have a large file with many sequences in it (e.g. FASTA\n", "file or genes, or a FASTQ or SFF file of reads), a separate shorter list\n", "of the IDs for a subset of sequences of interest, and want to make a new\n", "sequence file for this subset.\n", "\n", "Let’s say the list of IDs is in a simple text file, as the first word on\n", "each line. This could be a tabular file where the first column is the\n", "ID. Try something like this:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "input_file = \"big_file.sff\"\n", "id_file = \"short_list.txt\"\n", "output_file = \"short_list.sff\"\n", "wanted = set(line.rstrip(\"\\n\").split(None,1)[0] for line in open(id_file))\n", "print(\"Found %i unique identifiers in %s\" % (len(wanted), id_file))\n", "records = (r for r in SeqIO.parse(input_file, \"sff\") if r.id in wanted)\n", "count = SeqIO.write(records, output_file, \"sff\")\n", "print(\"Saved %i records from %s to %s\" % (count, input_file, output_file))\n", "if count < len(wanted):\n", " print(\"Warning %i IDs not found in %s\" % (len(wanted)-count, input_file))\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Note that we use a Python `set` rather than a `list`, this makes testing\n", "membership faster.\n", "\n", "### Producing randomised genomes\n", "\n", "Let’s suppose you are looking at genome sequence, hunting for some\n", "sequence feature – maybe extreme local GC% bias, or possible restriction\n", "digest sites. Once you’ve got your Python code working on the real\n", "genome it may be sensible to try running the same search on randomised\n", "versions of the same genome for statistical analysis (after all, any\n", "“features” you’ve found could just be there just by chance).\n", "\n", "For this discussion, we’ll use the GenBank file for the pPCP1 plasmid\n", "from *Yersinia pestis biovar Microtus*. The file is included with the\n", "Biopython unit tests under the GenBank folder, or you can get it from\n", "our website,\n", "[`NC_005816.gb`](http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.gb).\n", "This file contains one and only one record, so we can read it in as a\n", "`SeqRecord` using the `Bio.SeqIO.read()` function:\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from Bio import SeqIO\n", "original_rec = SeqIO.read(\"data/NC_005816.gb\", \"genbank\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "So, how can we generate a shuffled versions of the original sequence? I\n", "would use the built in Python `random` module for this, in particular\n", "the function `random.shuffle` – but this works on a Python list. Our\n", "sequence is a `Seq` object, so in order to shuffle it we need to turn it\n", "into a list:\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import random\n", "nuc_list = list(original_rec.seq)\n", "random.shuffle(nuc_list) #acts in situ!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Now, in order to use `Bio.SeqIO` to output the shuffled sequence, we\n", "need to construct a new `SeqRecord` with a new `Seq` object using this\n", "shuffled list. In order to do this, we need to turn the list of\n", "nucleotides (single letter strings) into a long string – the standard\n", "Python way to do this is with the string object’s join method.\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from Bio.Seq import Seq\n", "from Bio.SeqRecord import SeqRecord\n", "shuffled_rec = SeqRecord(Seq(\"\".join(nuc_list), original_rec.seq.alphabet),\n", " id=\"Shuffled\", description=\"Based on %s\" % original_rec.id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Let’s put all these pieces together to make a complete Python script\n", "which generates a single FASTA file containing 30 randomly shuffled\n", "versions of the original sequence.\n", "\n", "This first version just uses a big for loop and writes out the records\n", "one by one (using the `SeqRecord`’s format method described in\n", "Section \\[sec:Bio.SeqIO-and-StringIO\\]):\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "import random\n", "from Bio.Seq import Seq\n", "from Bio.SeqRecord import SeqRecord\n", "from Bio import SeqIO\n", "\n", "original_rec = SeqIO.read(\"NC_005816.gb\",\"genbank\")\n", "\n", "handle = open(\"shuffled.fasta\", \"w\")\n", "for i in range(30):\n", " nuc_list = list(original_rec.seq)\n", " random.shuffle(nuc_list)\n", " shuffled_rec = SeqRecord(Seq(\"\".join(nuc_list), original_rec.seq.alphabet), \\\n", " id=\"Shuffled%i\" % (i+1), \\\n", " description=\"Based on %s\" % original_rec.id)\n", " handle.write(shuffled_rec.format(\"fasta\"))\n", "handle.close()\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Personally I prefer the following version using a function to shuffle\n", "the record and a generator expression instead of the for loop:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "import random\n", "from Bio.Seq import Seq\n", "from Bio.SeqRecord import SeqRecord\n", "from Bio import SeqIO\n", "\n", "def make_shuffle_record(record, new_id):\n", " nuc_list = list(record.seq)\n", " random.shuffle(nuc_list)\n", " return SeqRecord(Seq(\"\".join(nuc_list), record.seq.alphabet), \\\n", " id=new_id, description=\"Based on %s\" % original_rec.id)\n", "\n", "original_rec = SeqIO.read(\"NC_005816.gb\",\"genbank\")\n", "shuffled_recs = (make_shuffle_record(original_rec, \"Shuffled%i\" % (i+1)) \\\n", " for i in range(30))\n", "handle = open(\"shuffled.fasta\", \"w\")\n", "SeqIO.write(shuffled_recs, handle, \"fasta\")\n", "handle.close()\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Translating a FASTA file of CDS entries {#sec:SeqIO-translate}\n", "\n", "Suppose you’ve got an input file of CDS entries for some organism, and\n", "you want to generate a new FASTA file containing their protein\n", "sequences. i.e. Take each nucleotide sequence from the original file,\n", "and translate it. Back in Section \\[sec:translation\\] we saw how to use\n", "the `Seq` object’s `translate method`, and the optional `cds` argument\n", "which enables correct translation of alternative start codons.\n", "\n", "We can combine this with `Bio.SeqIO` as shown in the reverse complement\n", "example in Section \\[sec:SeqIO-reverse-complement\\]. The key point is\n", "that for each nucleotide `SeqRecord`, we need to create a protein\n", "`SeqRecord` - and take care of naming it.\n", "\n", "You can write you own function to do this, choosing suitable protein\n", "identifiers for your sequences, and the appropriate genetic code. In\n", "this example we just use the default table and add a prefix to the\n", "identifier:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio.SeqRecord import SeqRecord\n", "def make_protein_record(nuc_record):\n", " \"\"\"Returns a new SeqRecord with the translated sequence (default table).\"\"\"\n", " return SeqRecord(seq = nuc_record.seq.translate(cds=True), \\\n", " id = \"trans_\" + nuc_record.id, \\\n", " description = \"translation of CDS, using default table\")\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "We can then use this function to turn the input nucleotide records into\n", "protein records ready for output. An elegant way and memory efficient\n", "way to do this is with a generator expression:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "proteins = (make_protein_record(nuc_rec) for nuc_rec in \\\n", " SeqIO.parse(\"coding_sequences.fasta\", \"fasta\"))\n", "SeqIO.write(proteins, \"translations.fasta\", \"fasta\")\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This should work on any FASTA file of complete coding sequences. If you\n", "are working on partial coding sequences, you may prefer to use\n", "`nuc_record.seq.translate(to_stop=True)` in the example above, as this\n", "wouldn’t check for a valid start codon etc.\n", "\n", "### Making the sequences in a FASTA file upper case\n", "\n", "Often you’ll get data from collaborators as FASTA files, and sometimes\n", "the sequences can be in a mixture of upper and lower case. In some cases\n", "this is deliberate (e.g. lower case for poor quality regions), but\n", "usually it is not important. You may want to edit the file to make\n", "everything consistent (e.g. all upper case), and you can do this easily\n", "using the `upper()` method of the `SeqRecord` object (added in Biopython\n", "1.55):\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "records = (rec.upper() for rec in SeqIO.parse(\"mixed.fas\", \"fasta\"))\n", "count = SeqIO.write(records, \"upper.fas\", \"fasta\")\n", "print(\"Converted %i records to upper case\" % count)\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "How does this work? The first line is just importing the `Bio.SeqIO`\n", "module. The second line is the interesting bit – this is a Python\n", "generator expression which gives an upper case version of each record\n", "parsed from the input file (`mixed.fas`). In the third line we give this\n", "generator expression to the `Bio.SeqIO.write()` function and it saves\n", "the new upper cases records to our output file (`upper.fas`).\n", "\n", "The reason we use a generator expression (rather than a list or list\n", "comprehension) is this means only one record is kept in memory at a\n", "time. This can be really important if you are dealing with large files\n", "with millions of entries.\n", "\n", "### Sorting a sequence file {#sec:SeqIO-sort}\n", "\n", "Suppose you wanted to sort a sequence file by length (e.g. a set of\n", "contigs from an assembly), and you are working with a file format like\n", "FASTA or FASTQ which `Bio.SeqIO` can read, write (and index).\n", "\n", "If the file is small enough, you can load it all into memory at once as\n", "a list of `SeqRecord` objects, sort the list, and save it:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "records = list(SeqIO.parse(\"ls_orchid.fasta\",\"fasta\"))\n", "records.sort(cmp=lambda x,y: cmp(len(x),len(y)))\n", "SeqIO.write(records, \"sorted_orchids.fasta\", \"fasta\")\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The only clever bit is specifying a comparison function for how to sort\n", "the records (here we sort them by length). If you wanted the longest\n", "records first, you could flip the comparison or use the reverse\n", "argument:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "records = list(SeqIO.parse(\"ls_orchid.fasta\",\"fasta\"))\n", "records.sort(cmp=lambda x,y: cmp(len(y),len(x)))\n", "SeqIO.write(records, \"sorted_orchids.fasta\", \"fasta\")\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Now that’s pretty straight forward - but what happens if you have a very\n", "large file and you can’t load it all into memory like this? For example,\n", "you might have some next-generation sequencing reads to sort by length.\n", "This can be solved using the `Bio.SeqIO.index()` function.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "#Get the lengths and ids, and sort on length\n", "len_and_ids = sorted((len(rec), rec.id) for rec in \\\n", " SeqIO.parse(\"ls_orchid.fasta\",\"fasta\"))\n", "ids = reversed([id for (length, id) in len_and_ids])\n", "del len_and_ids #free this memory\n", "record_index = SeqIO.index(\"ls_orchid.fasta\", \"fasta\")\n", "records = (record_index[id] for id in ids)\n", "SeqIO.write(records, \"sorted.fasta\", \"fasta\")\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "First we scan through the file once using `Bio.SeqIO.parse()`, recording\n", "the record identifiers and their lengths in a list of tuples. We then\n", "sort this list to get them in length order, and discard the lengths.\n", "Using this sorted list of identifiers `Bio.SeqIO.index()` allows us to\n", "retrieve the records one by one, and we pass them to `Bio.SeqIO.write()`\n", "for output.\n", "\n", "These examples all use `Bio.SeqIO` to parse the records into `SeqRecord`\n", "objects which are output using `Bio.SeqIO.write()`. What if you want to\n", "sort a file format which `Bio.SeqIO.write()` doesn’t support, like the\n", "plain text SwissProt format? Here is an alternative solution using the\n", "`get_raw()` method added to `Bio.SeqIO.index()` in Biopython 1.54 (see\n", "Section \\[sec:seqio-index-getraw\\]).\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "#Get the lengths and ids, and sort on length\n", "len_and_ids = sorted((len(rec), rec.id) for rec in \\\n", " SeqIO.parse(\"ls_orchid.fasta\",\"fasta\"))\n", "ids = reversed([id for (length, id) in len_and_ids])\n", "del len_and_ids #free this memory\n", "record_index = SeqIO.index(\"ls_orchid.fasta\", \"fasta\")\n", "handle = open(\"sorted.fasta\", \"w\")\n", "for id in ids:\n", " handle.write(record_index.get_raw(id))\n", "handle.close()\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "As a bonus, because it doesn’t parse the data into `SeqRecord` objects a\n", "second time it should be faster.\n", "\n", "### Simple quality filtering for FASTQ files {#sec:FASTQ-filtering-example}\n", "\n", "The FASTQ file format was introduced at Sanger and is now widely used\n", "for holding nucleotide sequencing reads together with their quality\n", "scores. FASTQ files (and the related QUAL files) are an excellent\n", "example of per-letter-annotation, because for each nucleotide in the\n", "sequence there is an associated quality score. Any per-letter-annotation\n", "is held in a `SeqRecord` in the `letter_annotations` dictionary as a\n", "list, tuple or string (with the same number of elements as the sequence\n", "length).\n", "\n", "One common task is taking a large set of sequencing reads and filtering\n", "them (or cropping them) based on their quality scores. The following\n", "example is very simplistic, but should illustrate the basics of working\n", "with quality data in a `SeqRecord` object. All we are going to do here\n", "is read in a file of FASTQ data, and filter it to pick out only those\n", "records whose PHRED quality scores are all above some threshold (here\n", "20).\n", "\n", "For this example we’ll use some real data downloaded from the ENA\n", "sequence read archive,\n", "\n", "(2MB) which unzips to a 19MB file `SRR020192.fastq`. This is some Roche\n", "454 GS FLX single end data from virus infected California sea lions (see\n", " for details).\n", "\n", "First, let’s count the reads:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "count = 0\n", "for rec in SeqIO.parse(\"SRR020192.fastq\", \"fastq\"):\n", " count += 1\n", "print(\"%i reads\" % count)\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Now let’s do a simple filtering for a minimum PHRED quality of 20:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "good_reads = (rec for rec in \\\n", " SeqIO.parse(\"SRR020192.fastq\", \"fastq\") \\\n", " if min(rec.letter_annotations[\"phred_quality\"]) >= 20)\n", "count = SeqIO.write(good_reads, \"good_quality.fastq\", \"fastq\")\n", "print(\"Saved %i reads\" % count)\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This pulled out only $14580$ reads out of the $41892$ present. A more\n", "sensible thing to do would be to quality trim the reads, but this is\n", "intended as an example only.\n", "\n", "FASTQ files can contain millions of entries, so it is best to avoid\n", "loading them all into memory at once. This example uses a generator\n", "expression, which means only one `SeqRecord` is created at a time -\n", "avoiding any memory limitations.\n", "\n", "### Trimming off primer sequences {#sec:FASTQ-slicing-off-primer}\n", "\n", "For this example we’re going to pretend that `GATGACGGTGT` is a 5’\n", "primer sequence we want to look for in some FASTQ formatted read data.\n", "As in the example above, we’ll use the `SRR020192.fastq` file downloaded\n", "from the ENA\n", "().\n", "The same approach would work with any other supported file format (e.g.\n", "FASTA files).\n", "\n", "This code uses `Bio.SeqIO` with a generator expression (to avoid loading\n", "all the sequences into memory at once), and the `Seq` object’s\n", "`startswith` method to see if the read starts with the primer sequence:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "primer_reads = (rec for rec in \\\n", " SeqIO.parse(\"SRR020192.fastq\", \"fastq\") \\\n", " if rec.seq.startswith(\"GATGACGGTGT\"))\n", "count = SeqIO.write(primer_reads, \"with_primer.fastq\", \"fastq\")\n", "print(\"Saved %i reads\" % count)\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "That should find $13819$ reads from `SRR014849.fastq` and save them to a\n", "new FASTQ file, `with_primer.fastq`.\n", "\n", "Now suppose that instead you wanted to make a FASTQ file containing\n", "these reads but with the primer sequence removed? That’s just a small\n", "change as we can slice the `SeqRecord` (see\n", "Section \\[sec:SeqRecord-slicing\\]) to remove the first eleven letters\n", "(the length of our primer):\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "trimmed_primer_reads = (rec[11:] for rec in \\\n", " SeqIO.parse(\"SRR020192.fastq\", \"fastq\") \\\n", " if rec.seq.startswith(\"GATGACGGTGT\"))\n", "count = SeqIO.write(trimmed_primer_reads, \"with_primer_trimmed.fastq\", \"fastq\")\n", "print(\"Saved %i reads\" % count)\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Again, that should pull out the $13819$ reads from `SRR020192.fastq`,\n", "but this time strip off the first ten characters, and save them to\n", "another new FASTQ file, `with_primer_trimmed.fastq`.\n", "\n", "Finally, suppose you want to create a new FASTQ file where these reads\n", "have their primer removed, but all the other reads are kept as they\n", "were? If we want to still use a generator expression, it is probably\n", "clearest to define our own trim function:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "def trim_primer(record, primer):\n", " if record.seq.startswith(primer):\n", " return record[len(primer):]\n", " else:\n", " return record\n", "\n", "trimmed_reads = (trim_primer(record, \"GATGACGGTGT\") for record in \\\n", " SeqIO.parse(\"SRR020192.fastq\", \"fastq\"))\n", "count = SeqIO.write(trimmed_reads, \"trimmed.fastq\", \"fastq\")\n", "print(\"Saved %i reads\" % count)\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This takes longer, as this time the output file contains all $41892$\n", "reads. Again, we’re used a generator expression to avoid any memory\n", "problems. You could alternatively use a generator function rather than a\n", "generator expression.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "def trim_primers(records, primer):\n", " \"\"\"Removes perfect primer sequences at start of reads.\n", "\n", " This is a generator function, the records argument should\n", " be a list or iterator returning SeqRecord objects.\n", " \"\"\"\n", " len_primer = len(primer) #cache this for later\n", " for record in records:\n", " if record.seq.startswith(primer):\n", " yield record[len_primer:]\n", " else:\n", " yield record\n", "\n", "original_reads = SeqIO.parse(\"SRR020192.fastq\", \"fastq\")\n", "trimmed_reads = trim_primers(original_reads, \"GATGACGGTGT\")\n", "count = SeqIO.write(trimmed_reads, \"trimmed.fastq\", \"fastq\")\n", "print(\"Saved %i reads\" % count)\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This form is more flexible if you want to do something more complicated\n", "where only some of the records are retained – as shown in the next\n", "example.\n", "\n", "### Trimming off adaptor sequences {#sec:FASTQ-slicing-off-adaptor}\n", "\n", "This is essentially a simple extension to the previous example. We are\n", "going to going to pretend `GATGACGGTGT` is an adaptor sequence in some\n", "FASTQ formatted read data, again the `SRR020192.fastq` file from the\n", "NCBI\n", "().\n", "\n", "This time however, we will look for the sequence *anywhere* in the\n", "reads, not just at the very beginning:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "\n", "def trim_adaptors(records, adaptor):\n", " \"\"\"Trims perfect adaptor sequences.\n", "\n", " This is a generator function, the records argument should\n", " be a list or iterator returning SeqRecord objects.\n", " \"\"\"\n", " len_adaptor = len(adaptor) #cache this for later\n", " for record in records:\n", " index = record.seq.find(adaptor)\n", " if index == -1:\n", " #adaptor not found, so won't trim\n", " yield record\n", " else:\n", " #trim off the adaptor\n", " yield record[index+len_adaptor:]\n", "\n", "original_reads = SeqIO.parse(\"SRR020192.fastq\", \"fastq\")\n", "trimmed_reads = trim_adaptors(original_reads, \"GATGACGGTGT\")\n", "count = SeqIO.write(trimmed_reads, \"trimmed.fastq\", \"fastq\")\n", "print(\"Saved %i reads\" % count)\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Because we are using a FASTQ input file in this example, the `SeqRecord`\n", "objects have per-letter-annotation for the quality scores. By slicing\n", "the `SeqRecord` object the appropriate scores are used on the trimmed\n", "records, so we can output them as a FASTQ file too.\n", "\n", "Compared to the output of the previous example where we only looked for\n", "a primer/adaptor at the start of each read, you may find some of the\n", "trimmed reads are quite short after trimming (e.g. if the adaptor was\n", "found in the middle rather than near the start). So, let’s add a minimum\n", "length requirement as well:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "\n", "def trim_adaptors(records, adaptor, min_len):\n", " \"\"\"Trims perfect adaptor sequences, checks read length.\n", "\n", " This is a generator function, the records argument should\n", " be a list or iterator returning SeqRecord objects.\n", " \"\"\"\n", " len_adaptor = len(adaptor) #cache this for later\n", " for record in records:\n", " len_record = len(record) #cache this for later\n", " if len(record) < min_len:\n", " #Too short to keep\n", " continue\n", " index = record.seq.find(adaptor)\n", " if index == -1:\n", " #adaptor not found, so won't trim\n", " yield record\n", " elif len_record - index - len_adaptor >= min_len:\n", " #after trimming this will still be long enough\n", " yield record[index+len_adaptor:]\n", "\n", "original_reads = SeqIO.parse(\"SRR020192.fastq\", \"fastq\")\n", "trimmed_reads = trim_adaptors(original_reads, \"GATGACGGTGT\", 100)\n", "count = SeqIO.write(trimmed_reads, \"trimmed.fastq\", \"fastq\")\n", "print(\"Saved %i reads\" % count)\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "By changing the format names, you could apply this to FASTA files\n", "instead. This code also could be extended to do a fuzzy match instead of\n", "an exact match (maybe using a pairwise alignment, or taking into account\n", "the read quality scores), but that will be much slower.\n", "\n", "### Converting FASTQ files {#sec:SeqIO-fastq-conversion}\n", "\n", "Back in Section \\[sec:SeqIO-conversion\\] we showed how to use\n", "`Bio.SeqIO` to convert between two file formats. Here we’ll go into a\n", "little more detail regarding FASTQ files which are used in second\n", "generation DNA sequencing. Please refer to Cock *et al.* (2009)\n", "@cock2010 for a longer description. FASTQ files store both the DNA\n", "sequence (as a string) and the associated read qualities.\n", "\n", "PHRED scores (used in most FASTQ files, and also in QUAL files, ACE\n", "files and SFF files) have become a *de facto* standard for representing\n", "the probability of a sequencing error (here denoted by $P_e$) at a given\n", "base using a simple base ten log transformation:\n", "\n", "$$Q_{\\textrm{PHRED}} = - 10 \\times \\textrm{log}_{10} ( P_e )$$\n", "\n", "This means a wrong read ($P_e = 1$) gets a PHRED quality of $0$, while a\n", "very good read like $P_e = 0.00001$ gets a PHRED quality of $50$. While\n", "for raw sequencing data qualities higher than this are rare, with post\n", "processing such as read mapping or assembly, qualities of up to about\n", "$90$ are possible (indeed, the MAQ tool allows for PHRED scores in the\n", "range 0 to 93 inclusive).\n", "\n", "The FASTQ format has the potential to become a *de facto* standard for\n", "storing the letters and quality scores for a sequencing read in a single\n", "plain text file. The only fly in the ointment is that there are at least\n", "three versions of the FASTQ format which are incompatible and difficult\n", "to distinguish...\n", "\n", "1. The original Sanger FASTQ format uses PHRED qualities encoded with\n", " an ASCII offset of 33. The NCBI are using this format in their Short\n", " Read Archive. We call this the `fastq` (or `fastq-sanger`) format in\n", " `Bio.SeqIO`.\n", "\n", "2. Solexa (later bought by Illumina) introduced their own version using\n", " Solexa qualities encoded with an ASCII offset of 64. We call this\n", " the `fastq-solexa` format.\n", "\n", "3. Illumina pipeline 1.3 onwards produces FASTQ files with PHRED\n", " qualities (which is more consistent), but encoded with an ASCII\n", " offset of 64. We call this the `fastq-illumina` format.\n", "\n", "The Solexa quality scores are defined using a different log\n", "transformation:\n", "\n", "$$Q_{\\textrm{Solexa}} = - 10 \\times \\textrm{log}_{10} \\left( \\frac{P_e}{1-P_e} \\right)$$\n", "\n", "Given Solexa/Illumina have now moved to using PHRED scores in version\n", "1.3 of their pipeline, the Solexa quality scores will gradually fall out\n", "of use. If you equate the error estimates ($P_e$) these two equations\n", "allow conversion between the two scoring systems - and Biopython\n", "includes functions to do this in the `Bio.SeqIO.QualityIO` module, which\n", "are called if you use `Bio.SeqIO` to convert an old Solexa/Illumina file\n", "into a standard Sanger FASTQ file:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "SeqIO.convert(\"solexa.fastq\", \"fastq-solexa\", \"standard.fastq\", \"fastq\")\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "If you want to convert a new Illumina 1.3+ FASTQ file, all that gets\n", "changed is the ASCII offset because although encoded differently the\n", "scores are all PHRED qualities:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "SeqIO.convert(\"illumina.fastq\", \"fastq-illumina\", \"standard.fastq\", \"fastq\")\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Note that using `Bio.SeqIO.convert()` like this is *much* faster than\n", "combining `Bio.SeqIO.parse()` and `Bio.SeqIO.write()` because optimised\n", "code is used for converting between FASTQ variants (and also for FASTQ\n", "to FASTA conversion).\n", "\n", "For good quality reads, PHRED and Solexa scores are approximately equal,\n", "which means since both the `fasta-solexa` and `fastq-illumina` formats\n", "use an ASCII offset of 64 the files are almost the same. This was a\n", "deliberate design choice by Illumina, meaning applications expecting the\n", "old `fasta-solexa` style files will probably be OK using the newer\n", "`fastq-illumina` files (on good data). Of course, both variants are very\n", "different from the original FASTQ standard as used by Sanger, the NCBI,\n", "and elsewhere (format name `fastq` or `fastq-sanger`).\n", "\n", "For more details, see the built in help (also\n", "[online](http://www.biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO-module.html)):\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on module Bio.SeqIO.QualityIO in Bio.SeqIO:\n", "\n", "NAME\n", " Bio.SeqIO.QualityIO - Bio.SeqIO support for the FASTQ and QUAL file formats.\n", "\n", "DESCRIPTION\n", " Note that you are expected to use this code via the Bio.SeqIO interface, as\n", " shown below.\n", " \n", " The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute\n", " to bundle a FASTA sequence and its PHRED quality data (integers between 0 and\n", " 90). Rather than using a single FASTQ file, often paired FASTA and QUAL files\n", " are used containing the sequence and the quality information separately.\n", " \n", " The PHRED software reads DNA sequencing trace files, calls bases, and\n", " assigns a non-negative quality value to each called base using a logged\n", " transformation of the error probability, Q = -10 log10( Pe ), for example::\n", " \n", " Pe = 1.0, Q = 0\n", " Pe = 0.1, Q = 10\n", " Pe = 0.01, Q = 20\n", " ...\n", " Pe = 0.00000001, Q = 80\n", " Pe = 0.000000001, Q = 90\n", " \n", " In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40.\n", " In the QUAL format these quality values are held as space separated text in\n", " a FASTA like file format. In the FASTQ format, each quality values is encoded\n", " with a single ASCI character using chr(Q+33), meaning zero maps to the\n", " character \"!\" and for example 80 maps to \"q\". For the Sanger FASTQ standard\n", " the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and\n", " quality are then stored in pairs in a FASTA like format.\n", " \n", " Unfortunately there is no official document describing the FASTQ file format,\n", " and worse, several related but different variants exist. For more details,\n", " please read this open access publication::\n", " \n", " The Sanger FASTQ file format for sequences with quality scores, and the\n", " Solexa/Illumina FASTQ variants.\n", " P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby),\n", " M.L.Heuer (BioJava) and P.M. Rice (EMBOSS).\n", " Nucleic Acids Research 2010 38(6):1767-1771\n", " http://dx.doi.org/10.1093/nar/gkp1137\n", " \n", " The good news is that Roche 454 sequencers can output files in the QUAL format,\n", " and sensibly they use PHREP style scores like Sanger. Converting a pair of\n", " FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL\n", " files from a Roche 454 SFF binary file, use the Roche off instrument command\n", " line tool \"sffinfo\" with the -q or -qual argument. You can extract a matching\n", " FASTA file using the -s or -seq argument instead.\n", " \n", " The bad news is that Solexa/Illumina did things differently - they have their\n", " own scoring system AND their own incompatible versions of the FASTQ format.\n", " Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can\n", " be negative. PHRED scores and Solexa scores are NOT interchangeable (but a\n", " reasonable mapping can be achieved between them, and they are approximately\n", " equal for higher quality reads).\n", " \n", " Confusingly early Solexa pipelines produced a FASTQ like file but using their\n", " own score mapping and an ASCII offset of 64. To make things worse, for the\n", " Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the\n", " FASTQ file format, this time using PHRED scores (which is more consistent) but\n", " with an ASCII offset of 64.\n", " \n", " i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ\n", " file format: The original Sanger PHRED standard, and two from Solexa/Illumina.\n", " \n", " The good news is that as of CASAVA version 1.8, Illumina sequencers will\n", " produce FASTQ files using the standard Sanger encoding.\n", " \n", " You are expected to use this module via the Bio.SeqIO functions, with the\n", " following format names:\n", " \n", " - \"qual\" means simple quality files using PHRED scores (e.g. from Roche 454)\n", " - \"fastq\" means Sanger style FASTQ files using PHRED scores and an ASCII\n", " offset of 33 (e.g. from the NCBI Short Read Archive and Illumina 1.8+).\n", " These can potentially hold PHRED scores from 0 to 93.\n", " - \"fastq-sanger\" is an alias for \"fastq\".\n", " - \"fastq-solexa\" means old Solexa (and also very early Illumina) style FASTQ\n", " files, using Solexa scores with an ASCII offset 64. These can hold Solexa\n", " scores from -5 to 62.\n", " - \"fastq-illumina\" means newer Illumina 1.3 to 1.7 style FASTQ files, using\n", " PHRED scores but with an ASCII offset 64, allowing PHRED scores from 0\n", " to 62.\n", " \n", " We could potentially add support for \"qual-solexa\" meaning QUAL files which\n", " contain Solexa scores, but thus far there isn't any reason to use such files.\n", " \n", " For example, consider the following short FASTQ file::\n", " \n", " @EAS54_6_R1_2_1_413_324\n", " CCCTTCTTGTCTTCAGCGTTTCTCC\n", " +\n", " ;;3;;;;;;;;;;;;7;;;;;;;88\n", " @EAS54_6_R1_2_1_540_792\n", " TTGGCAGGCCAAGGCCGATGGATCA\n", " +\n", " ;;;;;;;;;;;7;;;;;-;;;3;83\n", " @EAS54_6_R1_2_1_443_348\n", " GTTGCTTCTGGCGTGGGTGGGGGGG\n", " +\n", " ;;;;;;;;;;;9;7;;.7;393333\n", " \n", " This contains three reads of length 25. From the read length these were\n", " probably originally from an early Solexa/Illumina sequencer but this file\n", " follows the Sanger FASTQ convention (PHRED style qualities with an ASCII\n", " offet of 33). This means we can parse this file using Bio.SeqIO using\n", " \"fastq\" as the format name:\n", " \n", " >>> from Bio import SeqIO\n", " >>> for record in SeqIO.parse(\"Quality/example.fastq\", \"fastq\"):\n", " ... print(\"%s %s\" % (record.id, record.seq))\n", " EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC\n", " EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA\n", " EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG\n", " \n", " The qualities are held as a list of integers in each record's annotation:\n", " \n", " >>> print(record)\n", " ID: EAS54_6_R1_2_1_443_348\n", " Name: EAS54_6_R1_2_1_443_348\n", " Description: EAS54_6_R1_2_1_443_348\n", " Number of features: 0\n", " Per letter annotation for: phred_quality\n", " Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet())\n", " >>> print(record.letter_annotations[\"phred_quality\"])\n", " [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]\n", " \n", " You can use the SeqRecord format method to show this in the QUAL format:\n", " \n", " >>> print(record.format(\"qual\"))\n", " >EAS54_6_R1_2_1_443_348\n", " 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18\n", " 24 18 18 18 18\n", " \n", " \n", " Or go back to the FASTQ format, use \"fastq\" (or \"fastq-sanger\"):\n", " \n", " >>> print(record.format(\"fastq\"))\n", " @EAS54_6_R1_2_1_443_348\n", " GTTGCTTCTGGCGTGGGTGGGGGGG\n", " +\n", " ;;;;;;;;;;;9;7;;.7;393333\n", " \n", " \n", " Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset\n", " of 64):\n", " \n", " >>> print(record.format(\"fastq-illumina\"))\n", " @EAS54_6_R1_2_1_443_348\n", " GTTGCTTCTGGCGTGGGTGGGGGGG\n", " +\n", " ZZZZZZZZZZZXZVZZMVZRXRRRR\n", " \n", " \n", " You can also get Biopython to convert the scores and show a Solexa style\n", " FASTQ file:\n", " \n", " >>> print(record.format(\"fastq-solexa\"))\n", " @EAS54_6_R1_2_1_443_348\n", " GTTGCTTCTGGCGTGGGTGGGGGGG\n", " +\n", " ZZZZZZZZZZZXZVZZMVZRXRRRR\n", " \n", " \n", " Notice that this is actually the same output as above using \"fastq-illumina\"\n", " as the format! The reason for this is all these scores are high enough that\n", " the PHRED and Solexa scores are almost equal. The differences become apparent\n", " for poor quality reads. See the functions solexa_quality_from_phred and\n", " phred_quality_from_solexa for more details.\n", " \n", " If you wanted to trim your sequences (perhaps to remove low quality regions,\n", " or to remove a primer sequence), try slicing the SeqRecord objects. e.g.\n", " \n", " >>> sub_rec = record[5:15]\n", " >>> print(sub_rec)\n", " ID: EAS54_6_R1_2_1_443_348\n", " Name: EAS54_6_R1_2_1_443_348\n", " Description: EAS54_6_R1_2_1_443_348\n", " Number of features: 0\n", " Per letter annotation for: phred_quality\n", " Seq('TTCTGGCGTG', SingleLetterAlphabet())\n", " >>> print(sub_rec.letter_annotations[\"phred_quality\"])\n", " [26, 26, 26, 26, 26, 26, 24, 26, 22, 26]\n", " >>> print(sub_rec.format(\"fastq\"))\n", " @EAS54_6_R1_2_1_443_348\n", " TTCTGGCGTG\n", " +\n", " ;;;;;;9;7;\n", " \n", " \n", " If you wanted to, you could read in this FASTQ file, and save it as a QUAL file:\n", " \n", " >>> from Bio import SeqIO\n", " >>> record_iterator = SeqIO.parse(\"Quality/example.fastq\", \"fastq\")\n", " >>> with open(\"Quality/temp.qual\", \"w\") as out_handle:\n", " ... SeqIO.write(record_iterator, out_handle, \"qual\")\n", " 3\n", " \n", " You can of course read in a QUAL file, such as the one we just created:\n", " \n", " >>> from Bio import SeqIO\n", " >>> for record in SeqIO.parse(\"Quality/temp.qual\", \"qual\"):\n", " ... print(\"%s %s\" % (record.id, record.seq))\n", " EAS54_6_R1_2_1_413_324 ?????????????????????????\n", " EAS54_6_R1_2_1_540_792 ?????????????????????????\n", " EAS54_6_R1_2_1_443_348 ?????????????????????????\n", " \n", " Notice that QUAL files don't have a proper sequence present! But the quality\n", " information is there:\n", " \n", " >>> print(record)\n", " ID: EAS54_6_R1_2_1_443_348\n", " Name: EAS54_6_R1_2_1_443_348\n", " Description: EAS54_6_R1_2_1_443_348\n", " Number of features: 0\n", " Per letter annotation for: phred_quality\n", " UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?')\n", " >>> print(record.letter_annotations[\"phred_quality\"])\n", " [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]\n", " \n", " Just to keep things tidy, if you are following this example yourself, you can\n", " delete this temporary file now:\n", " \n", " >>> import os\n", " >>> os.remove(\"Quality/temp.qual\")\n", " \n", " Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL\n", " files. Because the Bio.SeqIO system is designed for reading single files, you\n", " would have to read the two in separately and then combine the data. However,\n", " since this is such a common thing to want to do, there is a helper iterator\n", " defined in this module that does this for you - PairedFastaQualIterator.\n", " \n", " Alternatively, if you have enough RAM to hold all the records in memory at once,\n", " then a simple dictionary approach would work:\n", " \n", " >>> from Bio import SeqIO\n", " >>> reads = SeqIO.to_dict(SeqIO.parse(\"Quality/example.fasta\", \"fasta\"))\n", " >>> for rec in SeqIO.parse(\"Quality/example.qual\", \"qual\"):\n", " ... reads[rec.id].letter_annotations[\"phred_quality\"]=rec.letter_annotations[\"phred_quality\"]\n", " \n", " You can then access any record by its key, and get both the sequence and the\n", " quality scores.\n", " \n", " >>> print(reads[\"EAS54_6_R1_2_1_540_792\"].format(\"fastq\"))\n", " @EAS54_6_R1_2_1_540_792\n", " TTGGCAGGCCAAGGCCGATGGATCA\n", " +\n", " ;;;;;;;;;;;7;;;;;-;;;3;83\n", " \n", " \n", " It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are\n", " using (\"fastq\" or \"fastq-sanger\" for the Sanger standard using PHRED values,\n", " \"fastq-solexa\" for the original Solexa/Illumina variant, or \"fastq-illumina\"\n", " for the more recent variant), as this cannot be detected reliably\n", " automatically.\n", " \n", " To illustrate this problem, let's consider an artifical example:\n", " \n", " >>> from Bio.Seq import Seq\n", " >>> from Bio.Alphabet import generic_dna\n", " >>> from Bio.SeqRecord import SeqRecord\n", " >>> test = SeqRecord(Seq(\"NACGTACGTA\", generic_dna), id=\"Test\",\n", " ... description=\"Made up!\")\n", " >>> print(test.format(\"fasta\"))\n", " >Test Made up!\n", " NACGTACGTA\n", " \n", " >>> print(test.format(\"fastq\"))\n", " Traceback (most recent call last):\n", " ...\n", " ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test).\n", " \n", " We created a sample SeqRecord, and can show it in FASTA format - but for QUAL\n", " or FASTQ format we need to provide some quality scores. These are held as a\n", " list of integers (one for each base) in the letter_annotations dictionary:\n", " \n", " >>> test.letter_annotations[\"phred_quality\"] = [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]\n", " >>> print(test.format(\"qual\"))\n", " >Test Made up!\n", " 0 1 2 3 4 5 10 20 30 40\n", " \n", " >>> print(test.format(\"fastq\"))\n", " @Test Made up!\n", " NACGTACGTA\n", " +\n", " !\"#$%&+5?I\n", " \n", " \n", " We can check this FASTQ encoding - the first PHRED quality was zero, and this\n", " mapped to a exclamation mark, while the final score was 40 and this mapped to\n", " the letter \"I\":\n", " \n", " >>> ord('!') - 33\n", " 0\n", " >>> ord('I') - 33\n", " 40\n", " >>> [ord(letter)-33 for letter in '!\"#$%&+5?I']\n", " [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]\n", " \n", " Similarly, we could produce an Illumina 1.3 to 1.7 style FASTQ file using PHRED\n", " scores with an offset of 64:\n", " \n", " >>> print(test.format(\"fastq-illumina\"))\n", " @Test Made up!\n", " NACGTACGTA\n", " +\n", " @ABCDEJT^h\n", " \n", " \n", " And we can check this too - the first PHRED score was zero, and this mapped to\n", " \"@\", while the final score was 40 and this mapped to \"h\":\n", " \n", " >>> ord(\"@\") - 64\n", " 0\n", " >>> ord(\"h\") - 64\n", " 40\n", " >>> [ord(letter)-64 for letter in \"@ABCDEJT^h\"]\n", " [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]\n", " \n", " Notice how different the standard Sanger FASTQ and the Illumina 1.3 to 1.7 style\n", " FASTQ files look for the same data! Then we have the older Solexa/Illumina\n", " format to consider which encodes Solexa scores instead of PHRED scores.\n", " \n", " First let's see what Biopython says if we convert the PHRED scores into Solexa\n", " scores (rounding to one decimal place):\n", " \n", " >>> for q in [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]:\n", " ... print(\"PHRED %i maps to Solexa %0.1f\" % (q, solexa_quality_from_phred(q)))\n", " PHRED 0 maps to Solexa -5.0\n", " PHRED 1 maps to Solexa -5.0\n", " PHRED 2 maps to Solexa -2.3\n", " PHRED 3 maps to Solexa -0.0\n", " PHRED 4 maps to Solexa 1.8\n", " PHRED 5 maps to Solexa 3.3\n", " PHRED 10 maps to Solexa 9.5\n", " PHRED 20 maps to Solexa 20.0\n", " PHRED 30 maps to Solexa 30.0\n", " PHRED 40 maps to Solexa 40.0\n", " \n", " Now here is the record using the old Solexa style FASTQ file:\n", " \n", " >>> print(test.format(\"fastq-solexa\"))\n", " @Test Made up!\n", " NACGTACGTA\n", " +\n", " ;;>@BCJT^h\n", " \n", " \n", " Again, this is using an ASCII offset of 64, so we can check the Solexa scores:\n", " \n", " >>> [ord(letter)-64 for letter in \";;>@BCJT^h\"]\n", " [-5, -5, -2, 0, 2, 3, 10, 20, 30, 40]\n", " \n", " This explains why the last few letters of this FASTQ output matched that using\n", " the Illumina 1.3 to 1.7 format - high quality PHRED scores and Solexa scores\n", " are approximately equal.\n", "\n", "CLASSES\n", " Bio.SeqIO.Interfaces.SequentialSequenceWriter(Bio.SeqIO.Interfaces.SequenceWriter)\n", " FastqIlluminaWriter\n", " FastqPhredWriter\n", " FastqSolexaWriter\n", " QualPhredWriter\n", " \n", " class FastqIlluminaWriter(Bio.SeqIO.Interfaces.SequentialSequenceWriter)\n", " | Write Illumina 1.3+ FASTQ format files (with PHRED quality scores).\n", " | \n", " | This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline,\n", " | using PHRED scores and an ASCII offset of 64. Note these files are NOT\n", " | compatible with the standard Sanger style PHRED FASTQ files which use an\n", " | ASCII offset of 32.\n", " | \n", " | Although you can use this class directly, you are strongly encouraged to\n", " | use the Bio.SeqIO.write() function with format name \"fastq-illumina\"\n", " | instead. This code is also called if you use the .format(\"fastq-illumina\")\n", " | method of a SeqRecord. For example,\n", " | \n", " | >>> from Bio import SeqIO\n", " | >>> record = SeqIO.read(\"Quality/sanger_faked.fastq\", \"fastq-sanger\")\n", " | >>> print(record.format(\"fastq-illumina\"))\n", " | @Test PHRED qualities from 40 to 0 inclusive\n", " | ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN\n", " | +\n", " | hgfedcba`_^]\\[ZYXWVUTSRQPONMLKJIHGFEDCBA@\n", " | \n", " | \n", " | Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is\n", " | encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a\n", " | warning is issued.\n", " | \n", " | Method resolution order:\n", " | FastqIlluminaWriter\n", " | Bio.SeqIO.Interfaces.SequentialSequenceWriter\n", " | Bio.SeqIO.Interfaces.SequenceWriter\n", " | builtins.object\n", " | \n", " | Methods defined here:\n", " | \n", " | write_record(self, record)\n", " | Write a single FASTQ record to the file.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Methods inherited from Bio.SeqIO.Interfaces.SequentialSequenceWriter:\n", " | \n", " | __init__(self, handle)\n", " | Creates the writer object.\n", " | \n", " | Use the method write_file() to actually record your sequence records.\n", " | \n", " | write_file(self, records)\n", " | Use this to write an entire file containing the given records.\n", " | \n", " | records - A list or iterator returning SeqRecord objects\n", " | \n", " | This method can only be called once. Returns the number of records\n", " | written.\n", " | \n", " | write_footer(self)\n", " | \n", " | write_header(self)\n", " | \n", " | write_records(self, records)\n", " | Write multiple record to the output file.\n", " | \n", " | records - A list or iterator returning SeqRecord objects\n", " | \n", " | Once you have called write_header() you can call write_record()\n", " | and/or write_records() as many times as needed. Then call\n", " | write_footer() and close().\n", " | \n", " | Returns the number of records written.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Methods inherited from Bio.SeqIO.Interfaces.SequenceWriter:\n", " | \n", " | clean(self, text)\n", " | Use this to avoid getting newlines in the output.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Data descriptors inherited from Bio.SeqIO.Interfaces.SequenceWriter:\n", " | \n", " | __dict__\n", " | dictionary for instance variables (if defined)\n", " | \n", " | __weakref__\n", " | list of weak references to the object (if defined)\n", " \n", " class FastqPhredWriter(Bio.SeqIO.Interfaces.SequentialSequenceWriter)\n", " | Class to write standard FASTQ format files (using PHRED quality scores).\n", " | \n", " | Although you can use this class directly, you are strongly encouraged\n", " | to use the Bio.SeqIO.write() function instead via the format name \"fastq\"\n", " | or the alias \"fastq-sanger\". For example, this code reads in a standard\n", " | Sanger style FASTQ file (using PHRED scores) and re-saves it as another\n", " | Sanger style FASTQ file:\n", " | \n", " | >>> from Bio import SeqIO\n", " | >>> record_iterator = SeqIO.parse(\"Quality/example.fastq\", \"fastq\")\n", " | >>> with open(\"Quality/temp.fastq\", \"w\") as out_handle:\n", " | ... SeqIO.write(record_iterator, out_handle, \"fastq\")\n", " | 3\n", " | \n", " | You might want to do this if the original file included extra line breaks,\n", " | which while valid may not be supported by all tools. The output file from\n", " | Biopython will have each sequence on a single line, and each quality\n", " | string on a single line (which is considered desirable for maximum\n", " | compatibility).\n", " | \n", " | In this next example, an old style Solexa/Illumina FASTQ file (using Solexa\n", " | quality scores) is converted into a standard Sanger style FASTQ file using\n", " | PHRED qualities:\n", " | \n", " | >>> from Bio import SeqIO\n", " | >>> record_iterator = SeqIO.parse(\"Quality/solexa_example.fastq\", \"fastq-solexa\")\n", " | >>> with open(\"Quality/temp.fastq\", \"w\") as out_handle:\n", " | ... SeqIO.write(record_iterator, out_handle, \"fastq\")\n", " | 5\n", " | \n", " | This code is also called if you use the .format(\"fastq\") method of a\n", " | SeqRecord, or .format(\"fastq-sanger\") if you prefer that alias.\n", " | \n", " | Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is\n", " | encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a\n", " | warning is issued.\n", " | \n", " | P.S. To avoid cluttering up your working directory, you can delete this\n", " | temporary file now:\n", " | \n", " | >>> import os\n", " | >>> os.remove(\"Quality/temp.fastq\")\n", " | \n", " | Method resolution order:\n", " | FastqPhredWriter\n", " | Bio.SeqIO.Interfaces.SequentialSequenceWriter\n", " | Bio.SeqIO.Interfaces.SequenceWriter\n", " | builtins.object\n", " | \n", " | Methods defined here:\n", " | \n", " | write_record(self, record)\n", " | Write a single FASTQ record to the file.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Methods inherited from Bio.SeqIO.Interfaces.SequentialSequenceWriter:\n", " | \n", " | __init__(self, handle)\n", " | Creates the writer object.\n", " | \n", " | Use the method write_file() to actually record your sequence records.\n", " | \n", " | write_file(self, records)\n", " | Use this to write an entire file containing the given records.\n", " | \n", " | records - A list or iterator returning SeqRecord objects\n", " | \n", " | This method can only be called once. Returns the number of records\n", " | written.\n", " | \n", " | write_footer(self)\n", " | \n", " | write_header(self)\n", " | \n", " | write_records(self, records)\n", " | Write multiple record to the output file.\n", " | \n", " | records - A list or iterator returning SeqRecord objects\n", " | \n", " | Once you have called write_header() you can call write_record()\n", " | and/or write_records() as many times as needed. Then call\n", " | write_footer() and close().\n", " | \n", " | Returns the number of records written.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Methods inherited from Bio.SeqIO.Interfaces.SequenceWriter:\n", " | \n", " | clean(self, text)\n", " | Use this to avoid getting newlines in the output.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Data descriptors inherited from Bio.SeqIO.Interfaces.SequenceWriter:\n", " | \n", " | __dict__\n", " | dictionary for instance variables (if defined)\n", " | \n", " | __weakref__\n", " | list of weak references to the object (if defined)\n", " \n", " class FastqSolexaWriter(Bio.SeqIO.Interfaces.SequentialSequenceWriter)\n", " | Write old style Solexa/Illumina FASTQ format files (with Solexa qualities).\n", " | \n", " | This outputs FASTQ files like those from the early Solexa/Illumina\n", " | pipeline, using Solexa scores and an ASCII offset of 64. These are\n", " | NOT compatible with the standard Sanger style PHRED FASTQ files.\n", " | \n", " | If your records contain a \"solexa_quality\" entry under letter_annotations,\n", " | this is used, otherwise any \"phred_quality\" entry will be used after\n", " | conversion using the solexa_quality_from_phred function. If neither style\n", " | of quality scores are present, an exception is raised.\n", " | \n", " | Although you can use this class directly, you are strongly encouraged\n", " | to use the Bio.SeqIO.write() function instead. For example, this code\n", " | reads in a FASTQ file and re-saves it as another FASTQ file:\n", " | \n", " | >>> from Bio import SeqIO\n", " | >>> record_iterator = SeqIO.parse(\"Quality/solexa_example.fastq\", \"fastq-solexa\")\n", " | >>> with open(\"Quality/temp.fastq\", \"w\") as out_handle:\n", " | ... SeqIO.write(record_iterator, out_handle, \"fastq-solexa\")\n", " | 5\n", " | \n", " | You might want to do this if the original file included extra line breaks,\n", " | which (while valid) may not be supported by all tools. The output file\n", " | from Biopython will have each sequence on a single line, and each quality\n", " | string on a single line (which is considered desirable for maximum\n", " | compatibility).\n", " | \n", " | This code is also called if you use the .format(\"fastq-solexa\") method of\n", " | a SeqRecord. For example,\n", " | \n", " | >>> record = SeqIO.read(\"Quality/sanger_faked.fastq\", \"fastq-sanger\")\n", " | >>> print(record.format(\"fastq-solexa\"))\n", " | @Test PHRED qualities from 40 to 0 inclusive\n", " | ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN\n", " | +\n", " | hgfedcba`_^]\\[ZYXWVUTSRQPONMLKJHGFECB@>;;\n", " | \n", " | \n", " | Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is\n", " | encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit,\n", " | a warning is issued.\n", " | \n", " | P.S. Don't forget to delete the temp file if you don't need it anymore:\n", " | \n", " | >>> import os\n", " | >>> os.remove(\"Quality/temp.fastq\")\n", " | \n", " | Method resolution order:\n", " | FastqSolexaWriter\n", " | Bio.SeqIO.Interfaces.SequentialSequenceWriter\n", " | Bio.SeqIO.Interfaces.SequenceWriter\n", " | builtins.object\n", " | \n", " | Methods defined here:\n", " | \n", " | write_record(self, record)\n", " | Write a single FASTQ record to the file.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Methods inherited from Bio.SeqIO.Interfaces.SequentialSequenceWriter:\n", " | \n", " | __init__(self, handle)\n", " | Creates the writer object.\n", " | \n", " | Use the method write_file() to actually record your sequence records.\n", " | \n", " | write_file(self, records)\n", " | Use this to write an entire file containing the given records.\n", " | \n", " | records - A list or iterator returning SeqRecord objects\n", " | \n", " | This method can only be called once. Returns the number of records\n", " | written.\n", " | \n", " | write_footer(self)\n", " | \n", " | write_header(self)\n", " | \n", " | write_records(self, records)\n", " | Write multiple record to the output file.\n", " | \n", " | records - A list or iterator returning SeqRecord objects\n", " | \n", " | Once you have called write_header() you can call write_record()\n", " | and/or write_records() as many times as needed. Then call\n", " | write_footer() and close().\n", " | \n", " | Returns the number of records written.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Methods inherited from Bio.SeqIO.Interfaces.SequenceWriter:\n", " | \n", " | clean(self, text)\n", " | Use this to avoid getting newlines in the output.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Data descriptors inherited from Bio.SeqIO.Interfaces.SequenceWriter:\n", " | \n", " | __dict__\n", " | dictionary for instance variables (if defined)\n", " | \n", " | __weakref__\n", " | list of weak references to the object (if defined)\n", " \n", " class QualPhredWriter(Bio.SeqIO.Interfaces.SequentialSequenceWriter)\n", " | Class to write QUAL format files (using PHRED quality scores).\n", " | \n", " | Although you can use this class directly, you are strongly encouraged\n", " | to use the Bio.SeqIO.write() function instead. For example, this code\n", " | reads in a FASTQ file and saves the quality scores into a QUAL file:\n", " | \n", " | >>> from Bio import SeqIO\n", " | >>> record_iterator = SeqIO.parse(\"Quality/example.fastq\", \"fastq\")\n", " | >>> with open(\"Quality/temp.qual\", \"w\") as out_handle:\n", " | ... SeqIO.write(record_iterator, out_handle, \"qual\")\n", " | 3\n", " | \n", " | This code is also called if you use the .format(\"qual\") method of a\n", " | SeqRecord.\n", " | \n", " | P.S. Don't forget to clean up the temp file if you don't need it anymore:\n", " | \n", " | >>> import os\n", " | >>> os.remove(\"Quality/temp.qual\")\n", " | \n", " | Method resolution order:\n", " | QualPhredWriter\n", " | Bio.SeqIO.Interfaces.SequentialSequenceWriter\n", " | Bio.SeqIO.Interfaces.SequenceWriter\n", " | builtins.object\n", " | \n", " | Methods defined here:\n", " | \n", " | __init__(self, handle, wrap=60, record2title=None)\n", " | Create a QUAL writer.\n", " | \n", " | Arguments:\n", " | - handle - Handle to an output file, e.g. as returned\n", " | by open(filename, \"w\")\n", " | - wrap - Optional line length used to wrap sequence lines.\n", " | Defaults to wrapping the sequence at 60 characters\n", " | Use zero (or None) for no wrapping, giving a single\n", " | long line for the sequence.\n", " | - record2title - Optional function to return the text to be\n", " | used for the title line of each record. By default\n", " | a combination of the record.id and record.description\n", " | is used. If the record.description starts with the\n", " | record.id, then just the record.description is used.\n", " | \n", " | The record2title argument is present for consistency with the\n", " | Bio.SeqIO.FastaIO writer class.\n", " | \n", " | write_record(self, record)\n", " | Write a single QUAL record to the file.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Methods inherited from Bio.SeqIO.Interfaces.SequentialSequenceWriter:\n", " | \n", " | write_file(self, records)\n", " | Use this to write an entire file containing the given records.\n", " | \n", " | records - A list or iterator returning SeqRecord objects\n", " | \n", " | This method can only be called once. Returns the number of records\n", " | written.\n", " | \n", " | write_footer(self)\n", " | \n", " | write_header(self)\n", " | \n", " | write_records(self, records)\n", " | Write multiple record to the output file.\n", " | \n", " | records - A list or iterator returning SeqRecord objects\n", " | \n", " | Once you have called write_header() you can call write_record()\n", " | and/or write_records() as many times as needed. Then call\n", " | write_footer() and close().\n", " | \n", " | Returns the number of records written.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Methods inherited from Bio.SeqIO.Interfaces.SequenceWriter:\n", " | \n", " | clean(self, text)\n", " | Use this to avoid getting newlines in the output.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Data descriptors inherited from Bio.SeqIO.Interfaces.SequenceWriter:\n", " | \n", " | __dict__\n", " | dictionary for instance variables (if defined)\n", " | \n", " | __weakref__\n", " | list of weak references to the object (if defined)\n", "\n", "FUNCTIONS\n", " FastqGeneralIterator(handle)\n", " Iterate over Fastq records as string tuples (not as SeqRecord objects).\n", " \n", " This code does not try to interpret the quality string numerically. It\n", " just returns tuples of the title, sequence and quality as strings. For\n", " the sequence and quality, any whitespace (such as new lines) is removed.\n", " \n", " Our SeqRecord based FASTQ iterators call this function internally, and then\n", " turn the strings into a SeqRecord objects, mapping the quality string into\n", " a list of numerical scores. If you want to do a custom quality mapping,\n", " then you might consider calling this function directly.\n", " \n", " For parsing FASTQ files, the title string from the \"@\" line at the start\n", " of each record can optionally be omitted on the \"+\" lines. If it is\n", " repeated, it must be identical.\n", " \n", " The sequence string and the quality string can optionally be split over\n", " multiple lines, although several sources discourage this. In comparison,\n", " for the FASTA file format line breaks between 60 and 80 characters are\n", " the norm.\n", " \n", " **WARNING** - Because the \"@\" character can appear in the quality string,\n", " this can cause problems as this is also the marker for the start of\n", " a new sequence. In fact, the \"+\" sign can also appear as well. Some\n", " sources recommended having no line breaks in the quality to avoid this,\n", " but even that is not enough, consider this example::\n", " \n", " @071113_EAS56_0053:1:1:998:236\n", " TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA\n", " +071113_EAS56_0053:1:1:998:236\n", " IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III\n", " @071113_EAS56_0053:1:1:182:712\n", " ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG\n", " +\n", " @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+\n", " @071113_EAS56_0053:1:1:153:10\n", " TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT\n", " +\n", " IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6\n", " @071113_EAS56_0053:1:3:990:501\n", " TGGGAGGTTTTATGTGGA\n", " AAGCAGCAATGTACAAGA\n", " +\n", " IIIIIII.IIIIII1@44\n", " @-7.%<&+/$/%4(++(%\n", " \n", " This is four PHRED encoded FASTQ entries originally from an NCBI source\n", " (given the read length of 36, these are probably Solexa Illumina reads where\n", " the quality has been mapped onto the PHRED values).\n", " \n", " This example has been edited to illustrate some of the nasty things allowed\n", " in the FASTQ format. Firstly, on the \"+\" lines most but not all of the\n", " (redundant) identifiers are omitted. In real files it is likely that all or\n", " none of these extra identifiers will be present.\n", " \n", " Secondly, while the first three sequences have been shown without line\n", " breaks, the last has been split over multiple lines. In real files any line\n", " breaks are likely to be consistent.\n", " \n", " Thirdly, some of the quality string lines start with an \"@\" character. For\n", " the second record this is unavoidable. However for the fourth sequence this\n", " only happens because its quality string is split over two lines. A naive\n", " parser could wrongly treat any line starting with an \"@\" as the beginning of\n", " a new sequence! This code copes with this possible ambiguity by keeping\n", " track of the length of the sequence which gives the expected length of the\n", " quality string.\n", " \n", " Using this tricky example file as input, this short bit of code demonstrates\n", " what this parsing function would return:\n", " \n", " >>> with open(\"Quality/tricky.fastq\", \"rU\") as handle:\n", " ... for (title, sequence, quality) in FastqGeneralIterator(handle):\n", " ... print(title)\n", " ... print(\"%s %s\" % (sequence, quality))\n", " ...\n", " 071113_EAS56_0053:1:1:998:236\n", " TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III\n", " 071113_EAS56_0053:1:1:182:712\n", " ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+\n", " 071113_EAS56_0053:1:1:153:10\n", " TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6\n", " 071113_EAS56_0053:1:3:990:501\n", " TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(%\n", " \n", " Finally we note that some sources state that the quality string should\n", " start with \"!\" (which using the PHRED mapping means the first letter always\n", " has a quality score of zero). This rather restrictive rule is not widely\n", " observed, so is therefore ignored here. One plus point about this \"!\" rule\n", " is that (provided there are no line breaks in the quality sequence) it\n", " would prevent the above problem with the \"@\" character.\n", " \n", " FastqIlluminaIterator(handle, alphabet=SingleLetterAlphabet(), title2ids=None)\n", " Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping).\n", " \n", " The optional arguments are the same as those for the FastqPhredIterator.\n", " \n", " For each sequence in Illumina 1.3+ FASTQ files there is a matching string\n", " encoding PHRED integer qualities using ASCII values with an offset of 64.\n", " \n", " >>> from Bio import SeqIO\n", " >>> record = SeqIO.read(\"Quality/illumina_faked.fastq\", \"fastq-illumina\")\n", " >>> print(\"%s %s\" % (record.id, record.seq))\n", " Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN\n", " >>> max(record.letter_annotations[\"phred_quality\"])\n", " 40\n", " >>> min(record.letter_annotations[\"phred_quality\"])\n", " 0\n", " \n", " NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores\n", " with an ASCII offset of 64. They are approximately equal but only for high\n", " quality reads. If you have an old Solexa/Illumina file with negative\n", " Solexa scores, and try and read this as an Illumina 1.3+ file it will fail:\n", " \n", " >>> record2 = SeqIO.read(\"Quality/solexa_faked.fastq\", \"fastq-illumina\")\n", " Traceback (most recent call last):\n", " ...\n", " ValueError: Invalid character in quality string\n", " \n", " NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33.\n", " \n", " FastqPhredIterator(handle, alphabet=SingleLetterAlphabet(), title2ids=None)\n", " Generator function to iterate over FASTQ records (as SeqRecord objects).\n", " \n", " - handle - input file\n", " - alphabet - optional alphabet\n", " - title2ids - A function that, when given the title line from the FASTQ\n", " file (without the beginning >), will return the id, name and\n", " description (in that order) for the record as a tuple of\n", " strings. If this is not given, then the entire title line\n", " will be used as the description, and the first word as the\n", " id and name.\n", " \n", " Note that use of title2ids matches that of Bio.SeqIO.FastaIO.\n", " \n", " For each sequence in a (Sanger style) FASTQ file there is a matching string\n", " encoding the PHRED qualities (integers between 0 and about 90) using ASCII\n", " values with an offset of 33.\n", " \n", " For example, consider a file containing three short reads::\n", " \n", " @EAS54_6_R1_2_1_413_324\n", " CCCTTCTTGTCTTCAGCGTTTCTCC\n", " +\n", " ;;3;;;;;;;;;;;;7;;;;;;;88\n", " @EAS54_6_R1_2_1_540_792\n", " TTGGCAGGCCAAGGCCGATGGATCA\n", " +\n", " ;;;;;;;;;;;7;;;;;-;;;3;83\n", " @EAS54_6_R1_2_1_443_348\n", " GTTGCTTCTGGCGTGGGTGGGGGGG\n", " +\n", " ;;;;;;;;;;;9;7;;.7;393333\n", " \n", " For each sequence (e.g. \"CCCTTCTTGTCTTCAGCGTTTCTCC\") there is a matching\n", " string encoding the PHRED qualities using a ASCII values with an offset of\n", " 33 (e.g. \";;3;;;;;;;;;;;;7;;;;;;;88\").\n", " \n", " Using this module directly you might run:\n", " \n", " >>> with open(\"Quality/example.fastq\", \"rU\") as handle:\n", " ... for record in FastqPhredIterator(handle):\n", " ... print(\"%s %s\" % (record.id, record.seq))\n", " EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC\n", " EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA\n", " EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG\n", " \n", " Typically however, you would call this via Bio.SeqIO instead with \"fastq\"\n", " (or \"fastq-sanger\") as the format:\n", " \n", " >>> from Bio import SeqIO\n", " >>> with open(\"Quality/example.fastq\", \"rU\") as handle:\n", " ... for record in SeqIO.parse(handle, \"fastq\"):\n", " ... print(\"%s %s\" % (record.id, record.seq))\n", " EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC\n", " EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA\n", " EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG\n", " \n", " If you want to look at the qualities, they are record in each record's\n", " per-letter-annotation dictionary as a simple list of integers:\n", " \n", " >>> print(record.letter_annotations[\"phred_quality\"])\n", " [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]\n", " \n", " FastqSolexaIterator(handle, alphabet=SingleLetterAlphabet(), title2ids=None)\n", " Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping).\n", " \n", " The optional arguments are the same as those for the FastqPhredIterator.\n", " \n", " For each sequence in Solexa/Illumina FASTQ files there is a matching string\n", " encoding the Solexa integer qualities using ASCII values with an offset\n", " of 64. Solexa scores are scaled differently to PHRED scores, and Biopython\n", " will NOT perform any automatic conversion when loading.\n", " \n", " NOTE - This file format is used by the OLD versions of the Solexa/Illumina\n", " pipeline. See also the FastqIlluminaIterator function for the NEW version.\n", " \n", " For example, consider a file containing these five records::\n", " \n", " @SLXA-B3_649_FC8437_R1_1_1_610_79\n", " GATGTGCAATACCTTTGTAGAGGAA\n", " +SLXA-B3_649_FC8437_R1_1_1_610_79\n", " YYYYYYYYYYYYYYYYYYWYWYYSU\n", " @SLXA-B3_649_FC8437_R1_1_1_397_389\n", " GGTTTGAGAAAGAGAAATGAGATAA\n", " +SLXA-B3_649_FC8437_R1_1_1_397_389\n", " YYYYYYYYYWYYYYWWYYYWYWYWW\n", " @SLXA-B3_649_FC8437_R1_1_1_850_123\n", " GAGGGTGTTGATCATGATGATGGCG\n", " +SLXA-B3_649_FC8437_R1_1_1_850_123\n", " YYYYYYYYYYYYYWYYWYYSYYYSY\n", " @SLXA-B3_649_FC8437_R1_1_1_362_549\n", " GGAAACAAAGTTTTTCTCAACATAG\n", " +SLXA-B3_649_FC8437_R1_1_1_362_549\n", " YYYYYYYYYYYYYYYYYYWWWWYWY\n", " @SLXA-B3_649_FC8437_R1_1_1_183_714\n", " GTATTATTTAATGGCATACACTCAA\n", " +SLXA-B3_649_FC8437_R1_1_1_183_714\n", " YYYYYYYYYYWYYYYWYWWUWWWQQ\n", " \n", " Using this module directly you might run:\n", " \n", " >>> with open(\"Quality/solexa_example.fastq\", \"rU\") as handle:\n", " ... for record in FastqSolexaIterator(handle):\n", " ... print(\"%s %s\" % (record.id, record.seq))\n", " SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA\n", " SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA\n", " SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG\n", " SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG\n", " SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA\n", " \n", " Typically however, you would call this via Bio.SeqIO instead with\n", " \"fastq-solexa\" as the format:\n", " \n", " >>> from Bio import SeqIO\n", " >>> with open(\"Quality/solexa_example.fastq\", \"rU\") as handle:\n", " ... for record in SeqIO.parse(handle, \"fastq-solexa\"):\n", " ... print(\"%s %s\" % (record.id, record.seq))\n", " SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA\n", " SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA\n", " SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG\n", " SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG\n", " SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA\n", " \n", " If you want to look at the qualities, they are recorded in each record's\n", " per-letter-annotation dictionary as a simple list of integers:\n", " \n", " >>> print(record.letter_annotations[\"solexa_quality\"])\n", " [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17]\n", " \n", " These scores aren't very good, but they are high enough that they map\n", " almost exactly onto PHRED scores:\n", " \n", " >>> print(\"%0.2f\" % phred_quality_from_solexa(25))\n", " 25.01\n", " \n", " Let's look at faked example read which is even worse, where there are\n", " more noticeable differences between the Solexa and PHRED scores::\n", " \n", " @slxa_0001_1_0001_01\n", " ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN\n", " +slxa_0001_1_0001_01\n", " hgfedcba`_^]\\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;\n", " \n", " Again, you would typically use Bio.SeqIO to read this file in (rather than\n", " calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will\n", " contain thousands of reads, so you would normally use Bio.SeqIO.parse()\n", " as shown above. This example has only as one entry, so instead we can\n", " use the Bio.SeqIO.read() function:\n", " \n", " >>> from Bio import SeqIO\n", " >>> with open(\"Quality/solexa_faked.fastq\", \"rU\") as handle:\n", " ... record = SeqIO.read(handle, \"fastq-solexa\")\n", " >>> print(\"%s %s\" % (record.id, record.seq))\n", " slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN\n", " >>> print(record.letter_annotations[\"solexa_quality\"])\n", " [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]\n", " \n", " These quality scores are so low that when converted from the Solexa scheme\n", " into PHRED scores they look quite different:\n", " \n", " >>> print(\"%0.2f\" % phred_quality_from_solexa(-1))\n", " 2.54\n", " >>> print(\"%0.2f\" % phred_quality_from_solexa(-5))\n", " 1.19\n", " \n", " Note you can use the Bio.SeqIO.write() function or the SeqRecord's format\n", " method to output the record(s):\n", " \n", " >>> print(record.format(\"fastq-solexa\"))\n", " @slxa_0001_1_0001_01\n", " ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN\n", " +\n", " hgfedcba`_^]\\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;\n", " \n", " \n", " Note this output is slightly different from the input file as Biopython\n", " has left out the optional repetition of the sequence identifier on the \"+\"\n", " line. If you want the to use PHRED scores, use \"fastq\" or \"qual\" as the\n", " output format instead, and Biopython will do the conversion for you:\n", " \n", " >>> print(record.format(\"fastq\"))\n", " @slxa_0001_1_0001_01\n", " ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN\n", " +\n", " IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##\"\"\n", " \n", " \n", " >>> print(record.format(\"qual\"))\n", " >slxa_0001_1_0001_01\n", " 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21\n", " 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2\n", " 1 1\n", " \n", " \n", " As shown above, the poor quality Solexa reads have been mapped to the\n", " equivalent PHRED score (e.g. -5 to 1 as shown earlier).\n", " \n", " PairedFastaQualIterator(fasta_handle, qual_handle, alphabet=SingleLetterAlphabet(), title2ids=None)\n", " Iterate over matched FASTA and QUAL files as SeqRecord objects.\n", " \n", " For example, consider this short QUAL file with PHRED quality scores::\n", " \n", " >EAS54_6_R1_2_1_413_324\n", " 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26\n", " 26 26 26 23 23\n", " >EAS54_6_R1_2_1_540_792\n", " 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26\n", " 26 18 26 23 18\n", " >EAS54_6_R1_2_1_443_348\n", " 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18\n", " 24 18 18 18 18\n", " \n", " And a matching FASTA file::\n", " \n", " >EAS54_6_R1_2_1_413_324\n", " CCCTTCTTGTCTTCAGCGTTTCTCC\n", " >EAS54_6_R1_2_1_540_792\n", " TTGGCAGGCCAAGGCCGATGGATCA\n", " >EAS54_6_R1_2_1_443_348\n", " GTTGCTTCTGGCGTGGGTGGGGGGG\n", " \n", " You can parse these separately using Bio.SeqIO with the \"qual\" and\n", " \"fasta\" formats, but then you'll get a group of SeqRecord objects with\n", " no sequence, and a matching group with the sequence but not the\n", " qualities. Because it only deals with one input file handle, Bio.SeqIO\n", " can't be used to read the two files together - but this function can!\n", " For example,\n", " \n", " >>> with open(\"Quality/example.fasta\", \"rU\") as f:\n", " ... with open(\"Quality/example.qual\", \"rU\") as q:\n", " ... for record in PairedFastaQualIterator(f, q):\n", " ... print(\"%s %s\" % (record.id, record.seq))\n", " ...\n", " EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC\n", " EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA\n", " EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG\n", " \n", " As with the FASTQ or QUAL parsers, if you want to look at the qualities,\n", " they are in each record's per-letter-annotation dictionary as a simple\n", " list of integers:\n", " \n", " >>> print(record.letter_annotations[\"phred_quality\"])\n", " [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]\n", " \n", " If you have access to data as a FASTQ format file, using that directly\n", " would be simpler and more straight forward. Note that you can easily use\n", " this function to convert paired FASTA and QUAL files into FASTQ files:\n", " \n", " >>> from Bio import SeqIO\n", " >>> with open(\"Quality/example.fasta\", \"rU\") as f:\n", " ... with open(\"Quality/example.qual\", \"rU\") as q:\n", " ... SeqIO.write(PairedFastaQualIterator(f, q), \"Quality/temp.fastq\", \"fastq\")\n", " ...\n", " 3\n", " \n", " And don't forget to clean up the temp file if you don't need it anymore:\n", " \n", " >>> import os\n", " >>> os.remove(\"Quality/temp.fastq\")\n", " \n", " QualPhredIterator(handle, alphabet=SingleLetterAlphabet(), title2ids=None)\n", " For QUAL files which include PHRED quality scores, but no sequence.\n", " \n", " For example, consider this short QUAL file::\n", " \n", " >EAS54_6_R1_2_1_413_324\n", " 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26\n", " 26 26 26 23 23\n", " >EAS54_6_R1_2_1_540_792\n", " 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26\n", " 26 18 26 23 18\n", " >EAS54_6_R1_2_1_443_348\n", " 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18\n", " 24 18 18 18 18\n", " \n", " Using this module directly you might run:\n", " \n", " >>> with open(\"Quality/example.qual\", \"rU\") as handle:\n", " ... for record in QualPhredIterator(handle):\n", " ... print(\"%s %s\" % (record.id, record.seq))\n", " EAS54_6_R1_2_1_413_324 ?????????????????????????\n", " EAS54_6_R1_2_1_540_792 ?????????????????????????\n", " EAS54_6_R1_2_1_443_348 ?????????????????????????\n", " \n", " Typically however, you would call this via Bio.SeqIO instead with \"qual\"\n", " as the format:\n", " \n", " >>> from Bio import SeqIO\n", " >>> with open(\"Quality/example.qual\", \"rU\") as handle:\n", " ... for record in SeqIO.parse(handle, \"qual\"):\n", " ... print(\"%s %s\" % (record.id, record.seq))\n", " EAS54_6_R1_2_1_413_324 ?????????????????????????\n", " EAS54_6_R1_2_1_540_792 ?????????????????????????\n", " EAS54_6_R1_2_1_443_348 ?????????????????????????\n", " \n", " Becase QUAL files don't contain the sequence string itself, the seq\n", " property is set to an UnknownSeq object. As no alphabet was given, this\n", " has defaulted to a generic single letter alphabet and the character \"?\"\n", " used.\n", " \n", " By specifying a nucleotide alphabet, \"N\" is used instead:\n", " \n", " >>> from Bio import SeqIO\n", " >>> from Bio.Alphabet import generic_dna\n", " >>> with open(\"Quality/example.qual\", \"rU\") as handle:\n", " ... for record in SeqIO.parse(handle, \"qual\", alphabet=generic_dna):\n", " ... print(\"%s %s\" % (record.id, record.seq))\n", " EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN\n", " EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN\n", " EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN\n", " \n", " However, the quality scores themselves are available as a list of integers\n", " in each record's per-letter-annotation:\n", " \n", " >>> print(record.letter_annotations[\"phred_quality\"])\n", " [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]\n", " \n", " You can still slice one of these SeqRecord objects with an UnknownSeq:\n", " \n", " >>> sub_record = record[5:10]\n", " >>> print(\"%s %s\" % (sub_record.id, sub_record.letter_annotations[\"phred_quality\"]))\n", " EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26]\n", " \n", " As of Biopython 1.59, this parser will accept files with negatives quality\n", " scores but will replace them with the lowest possible PHRED score of zero.\n", " This will trigger a warning, previously it raised a ValueError exception.\n", " \n", " log(...)\n", " log(x[, base])\n", " \n", " Return the logarithm of x to the given base.\n", " If the base not specified, returns the natural logarithm (base e) of x.\n", " \n", " phred_quality_from_solexa(solexa_quality)\n", " Convert a Solexa quality (which can be negative) to a PHRED quality.\n", " \n", " PHRED and Solexa quality scores are both log transformations of a\n", " probality of error (high score = low probability of error). This function\n", " takes a Solexa score, transforms it back to a probability of error, and\n", " then re-expresses it as a PHRED score. This assumes the error estimates\n", " are equivalent.\n", " \n", " The underlying formulas are given in the documentation for the sister\n", " function solexa_quality_from_phred, in this case the operation is::\n", " \n", " phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10)\n", " \n", " This will return a floating point number, it is up to you to round this to\n", " the nearest integer if appropriate. e.g.\n", " \n", " >>> print(\"%0.2f\" % round(phred_quality_from_solexa(80), 2))\n", " 80.00\n", " >>> print(\"%0.2f\" % round(phred_quality_from_solexa(20), 2))\n", " 20.04\n", " >>> print(\"%0.2f\" % round(phred_quality_from_solexa(10), 2))\n", " 10.41\n", " >>> print(\"%0.2f\" % round(phred_quality_from_solexa(0), 2))\n", " 3.01\n", " >>> print(\"%0.2f\" % round(phred_quality_from_solexa(-5), 2))\n", " 1.19\n", " \n", " Note that a solexa_quality less then -5 is not expected, will trigger a\n", " warning, but will still be converted as per the logarithmic mapping\n", " (giving a number between 0 and 1.19 back).\n", " \n", " As a special case where None is used for a \"missing value\", None is\n", " returned:\n", " \n", " >>> print(phred_quality_from_solexa(None))\n", " None\n", " \n", " solexa_quality_from_phred(phred_quality)\n", " Covert a PHRED quality (range 0 to about 90) to a Solexa quality.\n", " \n", " PHRED and Solexa quality scores are both log transformations of a\n", " probality of error (high score = low probability of error). This function\n", " takes a PHRED score, transforms it back to a probability of error, and\n", " then re-expresses it as a Solexa score. This assumes the error estimates\n", " are equivalent.\n", " \n", " How does this work exactly? Well the PHRED quality is minus ten times the\n", " base ten logarithm of the probability of error::\n", " \n", " phred_quality = -10*log(error,10)\n", " \n", " Therefore, turning this round::\n", " \n", " error = 10 ** (- phred_quality / 10)\n", " \n", " Now, Solexa qualities use a different log transformation::\n", " \n", " solexa_quality = -10*log(error/(1-error),10)\n", " \n", " After substitution and a little manipulation we get::\n", " \n", " solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10)\n", " \n", " However, real Solexa files use a minimum quality of -5. This does have a\n", " good reason - a random base call would be correct 25% of the time,\n", " and thus have a probability of error of 0.75, which gives 1.25 as the PHRED\n", " quality, or -4.77 as the Solexa quality. Thus (after rounding), a random\n", " nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5.\n", " \n", " Taken literally, this logarithic formula would map a PHRED quality of zero\n", " to a Solexa quality of minus infinity. Of course, taken literally, a PHRED\n", " score of zero means a probability of error of one (i.e. the base call is\n", " definitely wrong), which is worse than random! In practice, a PHRED quality\n", " of zero usually means a default value, or perhaps random - and therefore\n", " mapping it to the minimum Solexa score of -5 is reasonable.\n", " \n", " In conclusion, we follow EMBOSS, and take this logarithmic formula but also\n", " apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED\n", " quality of zero to -5.0 as well.\n", " \n", " Note this function will return a floating point number, it is up to you to\n", " round this to the nearest integer if appropriate. e.g.\n", " \n", " >>> print(\"%0.2f\" % round(solexa_quality_from_phred(80), 2))\n", " 80.00\n", " >>> print(\"%0.2f\" % round(solexa_quality_from_phred(50), 2))\n", " 50.00\n", " >>> print(\"%0.2f\" % round(solexa_quality_from_phred(20), 2))\n", " 19.96\n", " >>> print(\"%0.2f\" % round(solexa_quality_from_phred(10), 2))\n", " 9.54\n", " >>> print(\"%0.2f\" % round(solexa_quality_from_phred(5), 2))\n", " 3.35\n", " >>> print(\"%0.2f\" % round(solexa_quality_from_phred(4), 2))\n", " 1.80\n", " >>> print(\"%0.2f\" % round(solexa_quality_from_phred(3), 2))\n", " -0.02\n", " >>> print(\"%0.2f\" % round(solexa_quality_from_phred(2), 2))\n", " -2.33\n", " >>> print(\"%0.2f\" % round(solexa_quality_from_phred(1), 2))\n", " -5.00\n", " >>> print(\"%0.2f\" % round(solexa_quality_from_phred(0), 2))\n", " -5.00\n", " \n", " Notice that for high quality reads PHRED and Solexa scores are numerically\n", " equal. The differences are important for poor quality reads, where PHRED\n", " has a minimum of zero but Solexa scores can be negative.\n", " \n", " Finally, as a special case where None is used for a \"missing value\", None\n", " is returned:\n", " \n", " >>> print(solexa_quality_from_phred(None))\n", " None\n", "\n", "DATA\n", " SANGER_SCORE_OFFSET = 33\n", " SOLEXA_SCORE_OFFSET = 64\n", " __docformat__ = 'restructuredtext en'\n", " print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...\n", " single_letter_alphabet = SingleLetterAlphabet()\n", "\n", "FILE\n", " /home/tiago_antao/miniconda/lib/python3.5/site-packages/Bio/SeqIO/QualityIO.py\n", "\n", "\n" ] } ], "source": [ "from Bio.SeqIO import QualityIO\n", "help(QualityIO)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Converting FASTA and QUAL files into FASTQ files {#sec:SeqIO-fasta-qual-conversion}\n", "\n", "FASTQ files hold *both* sequences and their quality strings. FASTA files\n", "hold *just* sequences, while QUAL files hold *just* the qualities.\n", "Therefore a single FASTQ file can be converted to or from *paired* FASTA\n", "and QUAL files.\n", "\n", "Going from FASTQ to FASTA is easy:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "SeqIO.convert(\"example.fastq\", \"fastq\", \"example.fasta\", \"fasta\")\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Going from FASTQ to QUAL is also easy:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "SeqIO.convert(\"example.fastq\", \"fastq\", \"example.qual\", \"qual\")\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "However, the reverse is a little more tricky. You can use\n", "`Bio.SeqIO.parse()` to iterate over the records in a *single* file, but\n", "in this case we have two input files. There are several strategies\n", "possible, but assuming that the two files are really paired the most\n", "memory efficient way is to loop over both together. The code is a little\n", "fiddly, so we provide a function called `PairedFastaQualIterator` in the\n", "`Bio.SeqIO.QualityIO` module to do this. This takes two handles (the\n", "FASTA file and the QUAL file) and returns a `SeqRecord` iterator:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio.SeqIO.QualityIO import PairedFastaQualIterator\n", "for record in PairedFastaQualIterator(open(\"example.fasta\"), open(\"example.qual\")):\n", " print(record)\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This function will check that the FASTA and QUAL files are consistent\n", "(e.g. the records are in the same order, and have the same sequence\n", "length). You can combine this with the `Bio.SeqIO.write()` function to\n", "convert a pair of FASTA and QUAL files into a single FASTQ files:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "from Bio.SeqIO.QualityIO import PairedFastaQualIterator\n", "handle = open(\"temp.fastq\", \"w\") #w=write\n", "records = PairedFastaQualIterator(open(\"example.fasta\"), open(\"example.qual\"))\n", "count = SeqIO.write(records, handle, \"fastq\")\n", "handle.close()\n", "print(\"Converted %i records\" % count)\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Indexing a FASTQ file {#sec:fastq-indexing}\n", "\n", "FASTQ files are often very large, with millions of reads in them. Due to\n", "the sheer amount of data, you can’t load all the records into memory at\n", "once. This is why the examples above (filtering and trimming) iterate\n", "over the file looking at just one `SeqRecord` at a time.\n", "\n", "However, sometimes you can’t use a big loop or an iterator - you may\n", "need random access to the reads. Here the `Bio.SeqIO.index()` function\n", "may prove very helpful, as it allows you to access any read in the FASTQ\n", "file by its name (see Section \\[sec:SeqIO-index\\]).\n", "\n", "Again we’ll use the `SRR020192.fastq` file from the ENA\n", "(),\n", "although this is actually quite a small FASTQ file with less than\n", "$50,000$ reads:\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz\n", "!gzip -d SRR020192.fastq.gz" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "41892" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from Bio import SeqIO\n", "fq_dict = SeqIO.index(\"SRR020192.fastq\", \"fastq\")\n", "len(fq_dict)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Seq('GTCCCAGTATTCGGATTTGTCTGCCAAAACAATGAAATTGACACAGTTTACAAC...CCG', SingleLetterAlphabet())" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fq_dict[\"SRR020192.23186\"].seq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "When testing this on a FASTQ file with seven million reads, indexing\n", "took about a minute, but record access was almost instant.\n", "\n", "The example in Section \\[sec:SeqIO-sort\\] show how you can use the\n", "`Bio.SeqIO.index()` function to sort a large FASTA file – this could\n", "also be used on FASTQ files.\n", "\n", "### Converting SFF files {#sec:SeqIO-sff-conversion}\n", "\n", "If you work with 454 (Roche) sequence data, you will probably have\n", "access to the raw data as a Standard Flowgram Format (SFF) file. This\n", "contains the sequence reads (called bases) with quality scores and the\n", "original flow information.\n", "\n", "A common task is to convert from SFF to a pair of FASTA and QUAL files,\n", "or to a single FASTQ file. These operations are trivial using the\n", "`Bio.SeqIO.convert()` function (see Section \\[sec:SeqIO-conversion\\]):\n", "\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from Bio import SeqIO\n", "SeqIO.convert(\"data/E3MFGYR02_random_10_reads.sff\", \"sff\", \"reads.fasta\", \"fasta\")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "SeqIO.convert(\"data/E3MFGYR02_random_10_reads.sff\", \"sff\", \"reads.qual\", \"qual\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "SeqIO.convert(\"data/E3MFGYR02_random_10_reads.sff\", \"sff\", \"reads.fastq\", \"fastq\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Remember the convert function returns the number of records, in this\n", "example just ten. This will give you the *untrimmed* reads, where the\n", "leading and trailing poor quality sequence or adaptor will be in lower\n", "case. If you want the *trimmed* reads (using the clipping information\n", "recorded within the SFF file) use this:\n", "\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from Bio import SeqIO\n", "SeqIO.convert(\"data/E3MFGYR02_random_10_reads.sff\", \"sff-trim\", \"trimmed.fasta\", \"fasta\")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "SeqIO.convert(\"data/E3MFGYR02_random_10_reads.sff\", \"sff-trim\", \"trimmed.qual\", \"qual\")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "SeqIO.convert(\"data/E3MFGYR02_random_10_reads.sff\", \"sff-trim\", \"trimmed.fastq\", \"fastq\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "If you run Linux, you could ask Roche for a copy of their “off\n", "instrument” tools (often referred to as the Newbler tools). This offers\n", "an alternative way to do SFF to FASTA or QUAL conversion at the command\n", "line (but currently FASTQ output is not supported), e.g.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "$ sffinfo -seq -notrim E3MFGYR02_random_10_reads.sff > reads.fasta\n", "$ sffinfo -qual -notrim E3MFGYR02_random_10_reads.sff > reads.qual\n", "$ sffinfo -seq -trim E3MFGYR02_random_10_reads.sff > trimmed.fasta\n", "$ sffinfo -qual -trim E3MFGYR02_random_10_reads.sff > trimmed.qual\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The way Biopython uses mixed case sequence strings to represent the\n", "trimming points deliberately mimics what the Roche tools do.\n", "\n", "For more information on the Biopython SFF support, consult the built in\n", "help:\n", "\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on module Bio.SeqIO.SffIO in Bio.SeqIO:\n", "\n", "NAME\n", " Bio.SeqIO.SffIO - Bio.SeqIO support for the binary Standard Flowgram Format (SFF) file format.\n", "\n", "DESCRIPTION\n", " SFF was designed by 454 Life Sciences (Roche), the Whitehead Institute for\n", " Biomedical Research and the Wellcome Trust Sanger Institute. SFF was also used\n", " as the native output format from early versions of Ion Torrent's PGM platform\n", " as well. You are expected to use this module via the Bio.SeqIO functions under\n", " the format name \"sff\" (or \"sff-trim\" as described below).\n", " \n", " For example, to iterate over the records in an SFF file,\n", " \n", " >>> from Bio import SeqIO\n", " >>> for record in SeqIO.parse(\"Roche/E3MFGYR02_random_10_reads.sff\", \"sff\"):\n", " ... print(\"%s %i %s...\" % (record.id, len(record), record.seq[:20]))\n", " ...\n", " E3MFGYR02JWQ7T 265 tcagGGTCTACATGTTGGTT...\n", " E3MFGYR02JA6IL 271 tcagTTTTTTTTGGAAAGGA...\n", " E3MFGYR02JHD4H 310 tcagAAAGACAAGTGGTATC...\n", " E3MFGYR02GFKUC 299 tcagCGGCCGGGCCTCTCAT...\n", " E3MFGYR02FTGED 281 tcagTGGTAATGGGGGGAAA...\n", " E3MFGYR02FR9G7 261 tcagCTCCGTAAGAAGGTGC...\n", " E3MFGYR02GAZMS 278 tcagAAAGAAGTAAGGTAAA...\n", " E3MFGYR02HHZ8O 221 tcagACTTTCTTCTTTACCG...\n", " E3MFGYR02GPGB1 269 tcagAAGCAGTGGTATCAAC...\n", " E3MFGYR02F7Z7G 219 tcagAATCATCCACTTTTTA...\n", " \n", " Each SeqRecord object will contain all the annotation from the SFF file,\n", " including the PHRED quality scores.\n", " \n", " >>> print(\"%s %i\" % (record.id, len(record)))\n", " E3MFGYR02F7Z7G 219\n", " >>> print(\"%s...\" % record.seq[:10])\n", " tcagAATCAT...\n", " >>> print(\"%r...\" % (record.letter_annotations[\"phred_quality\"][:10]))\n", " [22, 21, 23, 28, 26, 15, 12, 21, 28, 21]...\n", " \n", " Notice that the sequence is given in mixed case, the central upper case region\n", " corresponds to the trimmed sequence. This matches the output of the Roche\n", " tools (and the 3rd party tool sff_extract) for SFF to FASTA.\n", " \n", " >>> print(record.annotations[\"clip_qual_left\"])\n", " 4\n", " >>> print(record.annotations[\"clip_qual_right\"])\n", " 134\n", " >>> print(record.seq[:4])\n", " tcag\n", " >>> print(\"%s...%s\" % (record.seq[4:20], record.seq[120:134]))\n", " AATCATCCACTTTTTA...CAAAACACAAACAG\n", " >>> print(record.seq[134:])\n", " atcttatcaacaaaactcaaagttcctaactgagacacgcaacaggggataagacaaggcacacaggggataggnnnnnnnnnnn\n", " \n", " The annotations dictionary also contains any adapter clip positions\n", " (usually zero), and information about the flows. e.g.\n", " \n", " >>> len(record.annotations)\n", " 11\n", " >>> print(record.annotations[\"flow_key\"])\n", " TCAG\n", " >>> print(record.annotations[\"flow_values\"][:10])\n", " (83, 1, 128, 7, 4, 84, 6, 106, 3, 172)\n", " >>> print(len(record.annotations[\"flow_values\"]))\n", " 400\n", " >>> print(record.annotations[\"flow_index\"][:10])\n", " (1, 2, 3, 2, 2, 0, 3, 2, 3, 3)\n", " >>> print(len(record.annotations[\"flow_index\"]))\n", " 219\n", " \n", " Note that to convert from a raw reading in flow_values to the corresponding\n", " homopolymer stretch estimate, the value should be rounded to the nearest 100:\n", " \n", " >>> print(\"%r...\" % [int(round(value, -2)) // 100\n", " ... for value in record.annotations[\"flow_values\"][:10]])\n", " ...\n", " [1, 0, 1, 0, 0, 1, 0, 1, 0, 2]...\n", " \n", " If a read name is exactly 14 alphanumeric characters, the annotations\n", " dictionary will also contain meta-data about the read extracted by\n", " interpretting the name as a 454 Sequencing System \"Universal\" Accession\n", " Number. Note that if a read name happens to be exactly 14 alphanumeric\n", " characters but was not generated automatically, these annotation records\n", " will contain nonsense information.\n", " \n", " >>> print(record.annotations[\"region\"])\n", " 2\n", " >>> print(record.annotations[\"time\"])\n", " [2008, 1, 9, 16, 16, 0]\n", " >>> print(record.annotations[\"coords\"])\n", " (2434, 1658)\n", " \n", " As a convenience method, you can read the file with SeqIO format name \"sff-trim\"\n", " instead of \"sff\" to get just the trimmed sequences (without any annotation\n", " except for the PHRED quality scores and anything encoded in the read names):\n", " \n", " >>> from Bio import SeqIO\n", " >>> for record in SeqIO.parse(\"Roche/E3MFGYR02_random_10_reads.sff\", \"sff-trim\"):\n", " ... print(\"%s %i %s...\" % (record.id, len(record), record.seq[:20]))\n", " ...\n", " E3MFGYR02JWQ7T 260 GGTCTACATGTTGGTTAACC...\n", " E3MFGYR02JA6IL 265 TTTTTTTTGGAAAGGAAAAC...\n", " E3MFGYR02JHD4H 292 AAAGACAAGTGGTATCAACG...\n", " E3MFGYR02GFKUC 295 CGGCCGGGCCTCTCATCGGT...\n", " E3MFGYR02FTGED 277 TGGTAATGGGGGGAAATTTA...\n", " E3MFGYR02FR9G7 256 CTCCGTAAGAAGGTGCTGCC...\n", " E3MFGYR02GAZMS 271 AAAGAAGTAAGGTAAATAAC...\n", " E3MFGYR02HHZ8O 150 ACTTTCTTCTTTACCGTAAC...\n", " E3MFGYR02GPGB1 221 AAGCAGTGGTATCAACGCAG...\n", " E3MFGYR02F7Z7G 130 AATCATCCACTTTTTAACGT...\n", " \n", " Looking at the final record in more detail, note how this differs to the\n", " example above:\n", " \n", " >>> print(\"%s %i\" % (record.id, len(record)))\n", " E3MFGYR02F7Z7G 130\n", " >>> print(\"%s...\" % record.seq[:10])\n", " AATCATCCAC...\n", " >>> print(\"%r...\" % record.letter_annotations[\"phred_quality\"][:10])\n", " [26, 15, 12, 21, 28, 21, 36, 28, 27, 27]...\n", " >>> len(record.annotations)\n", " 3\n", " >>> print(record.annotations[\"region\"])\n", " 2\n", " >>> print(record.annotations[\"coords\"])\n", " (2434, 1658)\n", " >>> print(record.annotations[\"time\"])\n", " [2008, 1, 9, 16, 16, 0]\n", " \n", " You might use the Bio.SeqIO.convert() function to convert the (trimmed) SFF\n", " reads into a FASTQ file (or a FASTA file and a QUAL file), e.g.\n", " \n", " >>> from Bio import SeqIO\n", " >>> try:\n", " ... from StringIO import StringIO # Python 2\n", " ... except ImportError:\n", " ... from io import StringIO # Python 3\n", " ...\n", " >>> out_handle = StringIO()\n", " >>> count = SeqIO.convert(\"Roche/E3MFGYR02_random_10_reads.sff\", \"sff\",\n", " ... out_handle, \"fastq\")\n", " ...\n", " >>> print(\"Converted %i records\" % count)\n", " Converted 10 records\n", " \n", " The output FASTQ file would start like this:\n", " \n", " >>> print(\"%s...\" % out_handle.getvalue()[:50])\n", " @E3MFGYR02JWQ7T\n", " tcagGGTCTACATGTTGGTTAACCCGTACTGATT...\n", " \n", " Bio.SeqIO.index() provides memory efficient random access to the reads in an\n", " SFF file by name. SFF files can include an index within the file, which can\n", " be read in making this very fast. If the index is missing (or in a format not\n", " yet supported in Biopython) the file is indexed by scanning all the reads -\n", " which is a little slower. For example,\n", " \n", " >>> from Bio import SeqIO\n", " >>> reads = SeqIO.index(\"Roche/E3MFGYR02_random_10_reads.sff\", \"sff\")\n", " >>> record = reads[\"E3MFGYR02JHD4H\"]\n", " >>> print(\"%s %i %s...\" % (record.id, len(record), record.seq[:20]))\n", " E3MFGYR02JHD4H 310 tcagAAAGACAAGTGGTATC...\n", " >>> reads.close()\n", " \n", " Or, using the trimmed reads:\n", " \n", " >>> from Bio import SeqIO\n", " >>> reads = SeqIO.index(\"Roche/E3MFGYR02_random_10_reads.sff\", \"sff-trim\")\n", " >>> record = reads[\"E3MFGYR02JHD4H\"]\n", " >>> print(\"%s %i %s...\" % (record.id, len(record), record.seq[:20]))\n", " E3MFGYR02JHD4H 292 AAAGACAAGTGGTATCAACG...\n", " >>> reads.close()\n", " \n", " You can also use the Bio.SeqIO.write() function with the \"sff\" format. Note\n", " that this requires all the flow information etc, and thus is probably only\n", " useful for SeqRecord objects originally from reading another SFF file (and\n", " not the trimmed SeqRecord objects from parsing an SFF file as \"sff-trim\").\n", " \n", " As an example, let's pretend this example SFF file represents some DNA which\n", " was pre-amplified with a PCR primers AAAGANNNNN. The following script would\n", " produce a sub-file containing all those reads whose post-quality clipping\n", " region (i.e. the sequence after trimming) starts with AAAGA exactly (the non-\n", " degenerate bit of this pretend primer):\n", " \n", " >>> from Bio import SeqIO\n", " >>> records = (record for record in\n", " ... SeqIO.parse(\"Roche/E3MFGYR02_random_10_reads.sff\", \"sff\")\n", " ... if record.seq[record.annotations[\"clip_qual_left\"]:].startswith(\"AAAGA\"))\n", " ...\n", " >>> count = SeqIO.write(records, \"temp_filtered.sff\", \"sff\")\n", " >>> print(\"Selected %i records\" % count)\n", " Selected 2 records\n", " \n", " Of course, for an assembly you would probably want to remove these primers.\n", " If you want FASTA or FASTQ output, you could just slice the SeqRecord. However,\n", " if you want SFF output we have to preserve all the flow information - the trick\n", " is just to adjust the left clip position!\n", " \n", " >>> from Bio import SeqIO\n", " >>> def filter_and_trim(records, primer):\n", " ... for record in records:\n", " ... if record.seq[record.annotations[\"clip_qual_left\"]:].startswith(primer):\n", " ... record.annotations[\"clip_qual_left\"] += len(primer)\n", " ... yield record\n", " ...\n", " >>> records = SeqIO.parse(\"Roche/E3MFGYR02_random_10_reads.sff\", \"sff\")\n", " >>> count = SeqIO.write(filter_and_trim(records, \"AAAGA\"),\n", " ... \"temp_filtered.sff\", \"sff\")\n", " ...\n", " >>> print(\"Selected %i records\" % count)\n", " Selected 2 records\n", " \n", " We can check the results, note the lower case clipped region now includes the \"AAAGA\"\n", " sequence:\n", " \n", " >>> for record in SeqIO.parse(\"temp_filtered.sff\", \"sff\"):\n", " ... print(\"%s %i %s...\" % (record.id, len(record), record.seq[:20]))\n", " ...\n", " E3MFGYR02JHD4H 310 tcagaaagaCAAGTGGTATC...\n", " E3MFGYR02GAZMS 278 tcagaaagaAGTAAGGTAAA...\n", " >>> for record in SeqIO.parse(\"temp_filtered.sff\", \"sff-trim\"):\n", " ... print(\"%s %i %s...\" % (record.id, len(record), record.seq[:20]))\n", " ...\n", " E3MFGYR02JHD4H 287 CAAGTGGTATCAACGCAGAG...\n", " E3MFGYR02GAZMS 266 AGTAAGGTAAATAACAAACG...\n", " >>> import os\n", " >>> os.remove(\"temp_filtered.sff\")\n", " \n", " For a description of the file format, please see the Roche manuals and:\n", " http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats\n", "\n", "CLASSES\n", " Bio.SeqIO.Interfaces.SequenceWriter(builtins.object)\n", " SffWriter\n", " \n", " class SffWriter(Bio.SeqIO.Interfaces.SequenceWriter)\n", " | SFF file writer.\n", " | \n", " | Method resolution order:\n", " | SffWriter\n", " | Bio.SeqIO.Interfaces.SequenceWriter\n", " | builtins.object\n", " | \n", " | Methods defined here:\n", " | \n", " | __init__(self, handle, index=True, xml=None)\n", " | Creates the writer object.\n", " | \n", " | - handle - Output handle, ideally in binary write mode.\n", " | - index - Boolean argument, should we try and write an index?\n", " | - xml - Optional string argument, xml manifest to be recorded in the index\n", " | block (see function ReadRocheXmlManifest for reading this data).\n", " | \n", " | write_file(self, records)\n", " | Use this to write an entire file containing the given records.\n", " | \n", " | write_header(self)\n", " | \n", " | write_record(self, record)\n", " | Write a single additional record to the output file.\n", " | \n", " | This assumes the header has been done.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Methods inherited from Bio.SeqIO.Interfaces.SequenceWriter:\n", " | \n", " | clean(self, text)\n", " | Use this to avoid getting newlines in the output.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Data descriptors inherited from Bio.SeqIO.Interfaces.SequenceWriter:\n", " | \n", " | __dict__\n", " | dictionary for instance variables (if defined)\n", " | \n", " | __weakref__\n", " | list of weak references to the object (if defined)\n", "\n", "FUNCTIONS\n", " ReadRocheXmlManifest(handle)\n", " Reads any Roche style XML manifest data in the SFF \"index\".\n", " \n", " The SFF file format allows for multiple different index blocks, and Roche\n", " took advantage of this to define their own index block which also embeds\n", " an XML manifest string. This is not a publically documented extension to\n", " the SFF file format, this was reverse engineered.\n", " \n", " The handle should be to an SFF file opened in binary mode. This function\n", " will use the handle seek/tell functions and leave the handle in an\n", " arbitrary location.\n", " \n", " Any XML manifest found is returned as a Python string, which you can then\n", " parse as appropriate, or reuse when writing out SFF files with the\n", " SffWriter class.\n", " \n", " Returns a string, or raises a ValueError if an Roche manifest could not be\n", " found.\n", " \n", " SffIterator(handle, alphabet=DNAAlphabet(), trim=False)\n", " Iterate over Standard Flowgram Format (SFF) reads (as SeqRecord objects).\n", " \n", " - handle - input file, an SFF file, e.g. from Roche 454 sequencing.\n", " This must NOT be opened in universal read lines mode!\n", " - alphabet - optional alphabet, defaults to generic DNA.\n", " - trim - should the sequences be trimmed?\n", " \n", " The resulting SeqRecord objects should match those from a paired FASTA\n", " and QUAL file converted from the SFF file using the Roche 454 tool\n", " ssfinfo. i.e. The sequence will be mixed case, with the trim regions\n", " shown in lower case.\n", " \n", " This function is used internally via the Bio.SeqIO functions:\n", " \n", " >>> from Bio import SeqIO\n", " >>> for record in SeqIO.parse(\"Roche/E3MFGYR02_random_10_reads.sff\", \"sff\"):\n", " ... print(\"%s %i\" % (record.id, len(record)))\n", " ...\n", " E3MFGYR02JWQ7T 265\n", " E3MFGYR02JA6IL 271\n", " E3MFGYR02JHD4H 310\n", " E3MFGYR02GFKUC 299\n", " E3MFGYR02FTGED 281\n", " E3MFGYR02FR9G7 261\n", " E3MFGYR02GAZMS 278\n", " E3MFGYR02HHZ8O 221\n", " E3MFGYR02GPGB1 269\n", " E3MFGYR02F7Z7G 219\n", " \n", " You can also call it directly:\n", " \n", " >>> with open(\"Roche/E3MFGYR02_random_10_reads.sff\", \"rb\") as handle:\n", " ... for record in SffIterator(handle):\n", " ... print(\"%s %i\" % (record.id, len(record)))\n", " ...\n", " E3MFGYR02JWQ7T 265\n", " E3MFGYR02JA6IL 271\n", " E3MFGYR02JHD4H 310\n", " E3MFGYR02GFKUC 299\n", " E3MFGYR02FTGED 281\n", " E3MFGYR02FR9G7 261\n", " E3MFGYR02GAZMS 278\n", " E3MFGYR02HHZ8O 221\n", " E3MFGYR02GPGB1 269\n", " E3MFGYR02F7Z7G 219\n", " \n", " Or, with the trim option:\n", " \n", " >>> with open(\"Roche/E3MFGYR02_random_10_reads.sff\", \"rb\") as handle:\n", " ... for record in SffIterator(handle, trim=True):\n", " ... print(\"%s %i\" % (record.id, len(record)))\n", " ...\n", " E3MFGYR02JWQ7T 260\n", " E3MFGYR02JA6IL 265\n", " E3MFGYR02JHD4H 292\n", " E3MFGYR02GFKUC 295\n", " E3MFGYR02FTGED 277\n", " E3MFGYR02FR9G7 256\n", " E3MFGYR02GAZMS 271\n", " E3MFGYR02HHZ8O 150\n", " E3MFGYR02GPGB1 221\n", " E3MFGYR02F7Z7G 130\n", "\n", "DATA\n", " __docformat__ = 'restructuredtext en'\n", " print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...\n", "\n", "FILE\n", " /home/tiago_antao/miniconda/lib/python3.5/site-packages/Bio/SeqIO/SffIO.py\n", "\n", "\n" ] } ], "source": [ "from Bio.SeqIO import SffIO\n", "help(SffIO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Identifying open reading frames\n", "\n", "A very simplistic first step at identifying possible genes is to look\n", "for open reading frames (ORFs). By this we mean look in all six frames\n", "for long regions without stop codons – an ORF is just a region of\n", "nucleotides with no in frame stop codons.\n", "\n", "Of course, to find a gene you would also need to worry about locating a\n", "start codon, possible promoters – and in Eukaryotes there are introns to\n", "worry about too. However, this approach is still useful in viruses and\n", "Prokaryotes.\n", "\n", "To show how you might approach this with Biopython, we’ll need a\n", "sequence to search, and as an example we’ll again use the bacterial\n", "plasmid – although this time we’ll start with a plain FASTA file with no\n", "pre-marked genes:\n", "[`NC_005816.fna`](http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.fna).\n", "This is a bacterial sequence, so we’ll want to use NCBI codon table 11\n", "(see Section \\[sec:translation\\] about translation).\n", "\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from Bio import SeqIO\n", "record = SeqIO.read(\"data/NC_005816.fna\", \"fasta\")\n", "table = 11\n", "min_pro_len = 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Here is a neat trick using the `Seq` object’s `split` method to get a\n", "list of all the possible ORF translations in the six reading frames:\n", "\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GCLMKKSSIVATIITILSGSANAASSQLIP...YRF - length 315, strand 1, frame 0\n", "KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE - length 285, strand 1, frame 1\n", "GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYH - length 176, strand 1, frame 1\n", "VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLP - length 165, strand 1, frame 1\n", "NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA - length 355, strand 1, frame 2\n", "RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTR - length 128, strand 1, frame 2\n", "TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIK - length 100, strand 1, frame 2\n", "QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD - length 114, strand -1, frame 0\n", "IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHV - length 111, strand -1, frame 0\n", "WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESK - length 125, strand -1, frame 1\n", "RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLK - length 361, strand -1, frame 1\n", "WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKR - length 111, strand -1, frame 1\n", "LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAA - length 107, strand -1, frame 2\n", "RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQ - length 119, strand -1, frame 2\n" ] } ], "source": [ "for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:\n", " for frame in range(3):\n", " length = 3 * ((len(record)-frame) // 3) #Multiple of three\n", " for pro in nuc[frame:frame+length].translate(table).split(\"*\"):\n", " if len(pro) >= min_pro_len:\n", " print(\"%s...%s - length %i, strand %i, frame %i\"\n", " % (pro[:30], pro[-3:], len(pro), strand, frame))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Note that here we are counting the frames from the 5’ end (start) of\n", "*each* strand. It is sometimes easier to always count from the 5’ end\n", "(start) of the *forward* strand.\n", "\n", "You could easily edit the above loop based code to build up a list of\n", "the candidate proteins, or convert this to a list comprehension. Now,\n", "one thing this code doesn’t do is keep track of where the proteins are.\n", "\n", "You could tackle this in several ways. For example, the following code\n", "tracks the locations in terms of the protein counting, and converts back\n", "to the parent sequence by multiplying by three, then adjusting for the\n", "frame and strand:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "record = SeqIO.read(\"NC_005816.gb\",\"genbank\")\n", "table = 11\n", "min_pro_len = 100\n", "\n", "def find_orfs_with_trans(seq, trans_table, min_protein_length):\n", " answer = []\n", " seq_len = len(seq)\n", " for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:\n", " for frame in range(3):\n", " trans = str(nuc[frame:].translate(trans_table))\n", " trans_len = len(trans)\n", " aa_start = 0\n", " aa_end = 0\n", " while aa_start < trans_len:\n", " aa_end = trans.find(\"*\", aa_start)\n", " if aa_end == -1:\n", " aa_end = trans_len\n", " if aa_end-aa_start >= min_protein_length:\n", " if strand == 1:\n", " start = frame+aa_start*3\n", " end = min(seq_len,frame+aa_end*3+3)\n", " else:\n", " start = seq_len-frame-aa_end*3-3\n", " end = seq_len-frame-aa_start*3\n", " answer.append((start, end, strand,\n", " trans[aa_start:aa_end]))\n", " aa_start = aa_end+1\n", " answer.sort()\n", " return answer\n", "\n", "orf_list = find_orfs_with_trans(record.seq, table, min_pro_len)\n", "for start, end, strand, pro in orf_list:\n", " print(\"%s...%s - length %i, strand %i, %i:%i\" \\\n", " % (pro[:30], pro[-3:], len(pro), strand, start, end))\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "And the output:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA - length 355, strand 1, 41:1109\n", "WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKR - length 111, strand -1, 491:827\n", "KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE - length 285, strand 1, 1030:1888\n", "RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQ - length 119, strand -1, 2830:3190\n", "RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTR - length 128, strand 1, 3470:3857\n", "GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYH - length 176, strand 1, 4249:4780\n", "RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLK - length 361, strand -1, 4814:5900\n", "VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLP - length 165, strand 1, 5923:6421\n", "LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAA - length 107, strand -1, 5974:6298\n", "GCLMKKSSIVATIITILSGSANAASSQLIP...YRF - length 315, strand 1, 6654:7602\n", "IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHV - length 111, strand -1, 7788:8124\n", "WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESK - length 125, strand -1, 8087:8465\n", "TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIK - length 100, strand 1, 8741:9044\n", "QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD - length 114, strand -1, 9264:9609\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "If you comment out the sort statement, then the protein sequences will\n", "be shown in the same order as before, so you can check this is doing the\n", "same thing. Here we have sorted them by location to make it easier to\n", "compare to the actual annotation in the GenBank file (as visualised in\n", "Section \\[sec:gd\\_nice\\_example\\]).\n", "\n", "If however all you want to find are the locations of the open reading\n", "frames, then it is a waste of time to translate every possible codon,\n", "including doing the reverse complement to search the reverse strand too.\n", "All you need to do is search for the possible stop codons (and their\n", "reverse complements). Using regular expressions is an obvious approach\n", "here (see the Python module `re`). These are an extremely powerful (but\n", "rather complex) way of describing search strings, which are supported in\n", "lots of programming languages and also command line tools like `grep` as\n", "well). You can find whole books about this topic!\n", "\n", "Sequence parsing plus simple plots {#seq:sequence-parsing-plus-pylab}\n", "----------------------------------\n", "\n", "This section shows some more examples of sequence parsing, using the\n", "`Bio.SeqIO` module described in Chapter \\[chapter:Bio.SeqIO\\], plus the\n", "Python library matplotlib’s `pylab` plotting interface (see [the\n", "matplotlib website for a tutorial](http://matplotlib.sourceforge.net/)).\n", "Note that to follow these examples you will need matplotlib installed -\n", "but without it you can still try the data parsing bits.\n", "\n", "### Histogram of sequence lengths\n", "\n", "There are lots of times when you might want to visualise the\n", "distribution of sequence lengths in a dataset – for example the range of\n", "contig sizes in a genome assembly project. In this example we’ll reuse\n", "our orchid FASTA file\n", "[ls\\_orchid.fasta](http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta)\n", "which has only 94 sequences.\n", "\n", "First of all, we will use `Bio.SeqIO` to parse the FASTA file and\n", "compile a list of all the sequence lengths. You could do this with a for\n", "loop, but I find a list comprehension more pleasing:\n", "\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(94, 572, 789)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from Bio import SeqIO\n", "sizes = [len(rec) for rec in SeqIO.parse(\"data/ls_orchid.fasta\", \"fasta\")]\n", "len(sizes), min(sizes), max(sizes)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[740,\n", " 753,\n", " 748,\n", " 744,\n", " 733,\n", " 718,\n", " 730,\n", " 704,\n", " 740,\n", " 709,\n", " 700,\n", " 726,\n", " 753,\n", " 699,\n", " 658,\n", " 752,\n", " 726,\n", " 765,\n", " 755,\n", " 742,\n", " 762,\n", " 745,\n", " 750,\n", " 731,\n", " 741,\n", " 740,\n", " 727,\n", " 711,\n", " 743,\n", " 727,\n", " 757,\n", " 770,\n", " 767,\n", " 759,\n", " 750,\n", " 788,\n", " 774,\n", " 789,\n", " 688,\n", " 719,\n", " 743,\n", " 737,\n", " 728,\n", " 740,\n", " 696,\n", " 732,\n", " 731,\n", " 735,\n", " 720,\n", " 740,\n", " 629,\n", " 572,\n", " 587,\n", " 700,\n", " 636,\n", " 716,\n", " 592,\n", " 716,\n", " 733,\n", " 626,\n", " 737,\n", " 740,\n", " 574,\n", " 594,\n", " 610,\n", " 730,\n", " 641,\n", " 702,\n", " 733,\n", " 738,\n", " 736,\n", " 732,\n", " 745,\n", " 744,\n", " 738,\n", " 739,\n", " 740,\n", " 745,\n", " 695,\n", " 745,\n", " 743,\n", " 730,\n", " 706,\n", " 744,\n", " 742,\n", " 694,\n", " 712,\n", " 715,\n", " 688,\n", " 784,\n", " 721,\n", " 703,\n", " 744,\n", " 592]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sizes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Now that we have the lengths of all the genes (as a list of integers),\n", "we can use the matplotlib histogram function to display it.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "sizes = [len(rec) for rec in SeqIO.parse(\"ls_orchid.fasta\", \"fasta\")]\n", "\n", "import pylab\n", "pylab.hist(sizes, bins=20)\n", "pylab.title(\"%i orchid sequences\\nLengths %i to %i\" \\\n", " % (len(sizes),min(sizes),max(sizes)))\n", "pylab.xlabel(\"Sequence length (bp)\")\n", "pylab.ylabel(\"Count\")\n", "pylab.show()\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "That should pop up a new window containing the following graph:\n", "\n", "![Histogram of orchid sequence lengths.](images/hist_plot.png)\n", "\n", "That should pop up a new window containing the graph shown in\n", "Figure \\[fig:seq-len-hist\\].\n", "\n", "Notice that most of these orchid sequences are about $740$ bp long, and\n", "there could be two distinct classes of sequence here with a subset of\n", "shorter sequences.\n", "\n", "*Tip:* Rather than using `pylab.show()` to show the plot in a window,\n", "you can also use `pylab.savefig(...)` to save the figure to a file (e.g.\n", "as a PNG or PDF).\n", "\n", "### Plot of sequence GC%\n", "\n", "Another easily calculated quantity of a nucleotide sequence is the GC%.\n", "You might want to look at the GC% of all the genes in a bacterial genome\n", "for example, and investigate any outliers which could have been recently\n", "acquired by horizontal gene transfer. Again, for this example we’ll\n", "reuse our orchid FASTA file\n", "[ls\\_orchid.fasta](http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta).\n", "\n", "First of all, we will use `Bio.SeqIO` to parse the FASTA file and\n", "compile a list of all the GC percentages. Again, you could do this with\n", "a for loop, but I prefer this:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "from Bio.SeqUtils import GC\n", "\n", "gc_values = sorted(GC(rec.seq) for rec in SeqIO.parse(\"ls_orchid.fasta\", \"fasta\"))\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Having read in each sequence and calculated the GC%, we then sorted them\n", "into ascending order. Now we’ll take this list of floating point values\n", "and plot them with matplotlib:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "import pylab\n", "pylab.plot(gc_values)\n", "pylab.title(\"%i orchid sequences\\nGC%% %0.1f to %0.1f\" \\\n", " % (len(gc_values),min(gc_values),max(gc_values)))\n", "pylab.xlabel(\"Genes\")\n", "pylab.ylabel(\"GC%\")\n", "pylab.show()\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "As in the previous example, that should pop up a new window containing a\n", "graph:\n", "\n", "![Histogram of orchid sequence lengths.](images/gc_plot.png)\n", "\n", "As in the previous example, that should pop up a new window with the\n", "graph shown in Figure \\[fig:seq-gc-plot\\].\n", "\n", "If you tried this on the full set of genes from one organism, you’d\n", "probably get a much smoother plot than this.\n", "\n", "### Nucleotide dot plots\n", "\n", "A dot plot is a way of visually comparing two nucleotide sequences for\n", "similarity to each other. A sliding window is used to compare short\n", "sub-sequences to each other, often with a mis-match threshold. Here for\n", "simplicity we’ll only look for perfect matches (shown in black\n", "\n", "in Figure \\[fig:nuc-dot-plot\\]).\n", "\n", "in the plot below).\n", "\n", "![Nucleotide dot plot of two orchid sequence lengths (using pylab’s\n", "imshow function).](images/dot_plot.png)\n", "\n", "To start off, we’ll need two sequences. For the sake of argument, we’ll\n", "just take the first two from our orchid FASTA file\n", "[ls\\_orchid.fasta](http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta):\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio import SeqIO\n", "handle = open(\"ls_orchid.fasta\")\n", "record_iterator = SeqIO.parse(handle, \"fasta\")\n", "rec_one = next(record_iterator)\n", "rec_two = next(record_iterator)\n", "handle.close()\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "We’re going to show two approaches. Firstly, a simple naive\n", "implementation which compares all the window sized sub-sequences to each\n", "other to compiles a similarity matrix. You could construct a matrix or\n", "array object, but here we just use a list of lists of booleans created\n", "with a nested list comprehension:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "window = 7\n", "seq_one = str(rec_one.seq).upper()\n", "seq_two = str(rec_two.seq).upper()\n", "data = [[(seq_one[i:i+window] <> seq_two[j:j+window]) \\\n", " for j in range(len(seq_one)-window)] \\\n", " for i in range(len(seq_two)-window)]\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Note that we have *not* checked for reverse complement matches here. Now\n", "we’ll use the matplotlib’s `pylab.imshow()` function to display this\n", "data, first requesting the gray color scheme so this is done in black\n", "and white:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "import pylab\n", "pylab.gray()\n", "pylab.imshow(data)\n", "pylab.xlabel(\"%s (length %i bp)\" % (rec_one.id, len(rec_one)))\n", "pylab.ylabel(\"%s (length %i bp)\" % (rec_two.id, len(rec_two)))\n", "pylab.title(\"Dot plot using window size %i\\n(allowing no mis-matches)\" % window)\n", "pylab.show()\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "That should pop up a new window containing a graph like this:\n", "\n", "That should pop up a new window showing the graph in\n", "Figure \\[fig:nuc-dot-plot\\].\n", "\n", "As you might have expected, these two sequences are very similar with a\n", "partial line of window sized matches along the diagonal. There are no\n", "off diagonal matches which would be indicative of inversions or other\n", "interesting events.\n", "\n", "The above code works fine on small examples, but there are two problems\n", "applying this to larger sequences, which we will address below. First\n", "off all, this brute force approach to the all against all comparisons is\n", "very slow. Instead, we’ll compile dictionaries mapping the window sized\n", "sub-sequences to their locations, and then take the set intersection to\n", "find those sub-sequences found in both sequences. This uses more memory,\n", "but is *much* faster. Secondly, the `pylab.imshow()` function is limited\n", "in the size of matrix it can display. As an alternative, we’ll use the\n", "`pylab.scatter()` function.\n", "\n", "We start by creating dictionaries mapping the window-sized sub-sequences\n", "to locations:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "window = 7\n", "dict_one = {}\n", "dict_two = {}\n", "for (seq, section_dict) in [(str(rec_one.seq).upper(), dict_one),\n", " (str(rec_two.seq).upper(), dict_two)]:\n", " for i in range(len(seq)-window):\n", " section = seq[i:i+window]\n", " try:\n", " section_dict[section].append(i)\n", " except KeyError:\n", " section_dict[section] = [i]\n", "#Now find any sub-sequences found in both sequences\n", "#(Python 2.3 would require slightly different code here)\n", "matches = set(dict_one).intersection(dict_two)\n", "print(\"%i unique matches\" % len(matches))\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "In order to use the `pylab.scatter()` we need separate lists for the $x$\n", "and $y$ co-ordinates:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "#Create lists of x and y co-ordinates for scatter plot\n", "x = []\n", "y = []\n", "for section in matches:\n", " for i in dict_one[section]:\n", " for j in dict_two[section]:\n", " x.append(i)\n", " y.append(j)\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "We are now ready to draw the revised dot plot as a scatter plot:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "import pylab\n", "pylab.cla() #clear any prior graph\n", "pylab.gray()\n", "pylab.scatter(x,y)\n", "pylab.xlim(0, len(rec_one)-window)\n", "pylab.ylim(0, len(rec_two)-window)\n", "pylab.xlabel(\"%s (length %i bp)\" % (rec_one.id, len(rec_one)))\n", "pylab.ylabel(\"%s (length %i bp)\" % (rec_two.id, len(rec_two)))\n", "pylab.title(\"Dot plot using window size %i\\n(allowing no mis-matches)\" % window)\n", "pylab.show()\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "That should pop up a new window containing a graph like this:\n", "\n", "That should pop up a new window showing the graph in\n", "Figure \\[fig:nuc-dot-plot-scatter\\].\n", "\n", "![Nucleotide dot plot of two orchid sequence lengths (using pylab’s\n", "scatter function).](images/dot_plot_scatter.png)\n", "\n", "Personally I find this second plot much easier to read! Again note that\n", "we have *not* checked for reverse complement matches here – you could\n", "extend this example to do this, and perhaps plot the forward matches in\n", "one color and the reverse matches in another.\n", "\n", "### Plotting the quality scores of sequencing read data\n", "\n", "If you are working with second generation sequencing data, you may want\n", "to try plotting the quality data. Here is an example using two FASTQ\n", "files containing paired end reads, `SRR001666_1.fastq` for the forward\n", "reads, and `SRR001666_2.fastq` for the reverse reads. These were\n", "downloaded from the ENA sequence read archive FTP site\n", "(\n", "and\n", "),\n", "and are from *E. coli* – see\n", " for details. In the\n", "following code the `pylab.subplot(...)` function is used in order to\n", "show the forward and reverse qualities on two subplots, side by side.\n", "There is also a little bit of code to only plot the first fifty reads.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "import pylab\n", "from Bio import SeqIO\n", "for subfigure in [1,2]:\n", " filename = \"SRR001666_%i.fastq\" % subfigure\n", " pylab.subplot(1, 2, subfigure)\n", " for i,record in enumerate(SeqIO.parse(filename, \"fastq\")):\n", " if i >= 50 : break #trick!\n", " pylab.plot(record.letter_annotations[\"phred_quality\"])\n", " pylab.ylim(0,45)\n", " pylab.ylabel(\"PHRED quality score\")\n", " pylab.xlabel(\"Position\")\n", "pylab.savefig(\"SRR001666.png\")\n", "print(\"Done\")\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "You should note that we are using the `Bio.SeqIO` format name `fastq`\n", "here because the NCBI has saved these reads using the standard Sanger\n", "FASTQ format with PHRED scores. However, as you might guess from the\n", "read lengths, this data was from an Illumina Genome Analyzer and was\n", "probably originally in one of the two Solexa/Illumina FASTQ variant file\n", "formats instead.\n", "\n", "This example uses the `pylab.savefig(...)` function instead of\n", "`pylab.show(...)`, but as mentioned before both are useful.\n", "\n", "![Quality plot for some paired end reads.](images/SRR001666.png)\n", "\n", "The result is shown in Figure \\[fig:paired-end-qual-plot\\].\n", "\n", "Here is the result:\n", "\n", "Dealing with alignments\n", "-----------------------\n", "\n", "This section can been seen as a follow on to\n", "Chapter \\[chapter:Bio.AlignIO\\].\n", "\n", "### Calculating summary information {#sec:summary_info}\n", "\n", "Once you have an alignment, you are very likely going to want to find\n", "out information about it. Instead of trying to have all of the functions\n", "that can generate information about an alignment in the alignment object\n", "itself, we’ve tried to separate out the functionality into separate\n", "classes, which act on the alignment.\n", "\n", "Getting ready to calculate summary information about an object is quick\n", "to do. Let’s say we’ve got an alignment object called `alignment`, for\n", "example read in using `Bio.AlignIO.read(...)` as described in\n", "Chapter \\[chapter:Bio.AlignIO\\]. All we need to do to get an object that\n", "will calculate summary information is:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio.Align import AlignInfo\n", "summary_align = AlignInfo.SummaryInfo(alignment)\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The `summary_align` object is very useful, and will do the following\n", "neat things for you:\n", "\n", "1. Calculate a quick consensus sequence – see section \\[sec:consensus\\]\n", "\n", "2. Get a position specific score matrix for the alignment – see\n", " section \\[sec:pssm\\]\n", "\n", "3. Calculate the information content for the alignment – see\n", " section \\[sec:getting\\_info\\_content\\]\n", "\n", "4. Generate information on substitutions in the alignment –\n", " section \\[sec:sub\\_matrix\\] details using this to generate a\n", " substitution matrix.\n", "\n", "### Calculating a quick consensus sequence {#sec:consensus}\n", "\n", "The `SummaryInfo` object, described in section \\[sec:summary\\_info\\],\n", "provides functionality to calculate a quick consensus of an alignment.\n", "Assuming we’ve got a `SummaryInfo` object called `summary_align` we can\n", "calculate a consensus by doing:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "consensus = summary_align.dumb_consensus()\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "As the name suggests, this is a really simple consensus calculator, and\n", "will just add up all of the residues at each point in the consensus, and\n", "if the most common value is higher than some threshold value will add\n", "the common residue to the consensus. If it doesn’t reach the threshold,\n", "it adds an ambiguity character to the consensus. The returned consensus\n", "object is Seq object whose alphabet is inferred from the alphabets of\n", "the sequences making up the consensus. So doing a `print consensus`\n", "would give:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "consensus Seq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT\n", "...', IUPACAmbiguousDNA())\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "You can adjust how `dumb_consensus` works by passing optional\n", "parameters:\n", "\n", "the threshold\n", "\n", ": This is the threshold specifying how common a particular residue has\n", " to be at a position before it is added. The default is $0.7$\n", " (meaning $70\\%$).\n", "\n", "the ambiguous character\n", "\n", ": This is the ambiguity character to use. The default is ’N’.\n", "\n", "the consensus alphabet\n", "\n", ": This is the alphabet to use for the consensus sequence. If an\n", " alphabet is not specified than we will try to guess the alphabet\n", " based on the alphabets of the sequences in the alignment.\n", "\n", "### Position Specific Score Matrices {#sec:pssm}\n", "\n", "Position specific score matrices (PSSMs) summarize the alignment\n", "information in a different way than a consensus, and may be useful for\n", "different tasks. Basically, a PSSM is a count matrix. For each column in\n", "the alignment, the number of each alphabet letters is counted and\n", "totaled. The totals are displayed relative to some representative\n", "sequence along the left axis. This sequence may be the consesus\n", "sequence, but can also be any sequence in the alignment. For instance\n", "for the alignment,\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "GTATC\n", "AT--C\n", "CTGTC\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "the PSSM is:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "G A T C\n", " G 1 1 0 1\n", " T 0 0 3 0\n", " A 1 1 0 0\n", " T 0 0 2 0\n", " C 0 0 0 3\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Let’s assume we’ve got an alignment object called `c_align`. To get a\n", "PSSM with the consensus sequence along the side we first get a summary\n", "object and calculate the consensus sequence:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "summary_align = AlignInfo.SummaryInfo(c_align)\n", "consensus = summary_align.dumb_consensus()\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Now, we want to make the PSSM, but ignore any `N` ambiguity residues\n", "when calculating this:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "my_pssm = summary_align.pos_specific_score_matrix(consensus,\n", " chars_to_ignore = ['N'])\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Two notes should be made about this:\n", "\n", "1. To maintain strictness with the alphabets, you can only include\n", " characters along the top of the PSSM that are in the alphabet of the\n", " alignment object. Gaps are not included along the top axis of\n", " the PSSM.\n", "\n", "2. The sequence passed to be displayed along the left side of the axis\n", " does not need to be the consensus. For instance, if you wanted to\n", " display the second sequence in the alignment along this axis, you\n", " would need to do:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "second_seq = alignment.get_seq_by_num(1)\n", " my_pssm = summary_align.pos_specific_score_matrix(second_seq\n", " chars_to_ignore = ['N'])\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The command above returns a `PSSM` object. To print out the PSSM as\n", "shown above, we simply need to do a `print(my_pssm)`, which gives:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "A C G T\n", "T 0.0 0.0 0.0 7.0\n", "A 7.0 0.0 0.0 0.0\n", "T 0.0 0.0 0.0 7.0\n", "A 7.0 0.0 0.0 0.0\n", "C 0.0 7.0 0.0 0.0\n", "A 7.0 0.0 0.0 0.0\n", "T 0.0 0.0 0.0 7.0\n", "T 1.0 0.0 0.0 6.0\n", "...\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "You can access any element of the PSSM by subscripting like\n", "`your_pssm[sequence_number][residue_count_name]`. For instance, to get\n", "the counts for the ’A’ residue in the second element of the above PSSM\n", "you would do:\n", "\n" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'my_pssm' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmy_pssm\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m\"A\"\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'my_pssm' is not defined" ] } ], "source": [ "print(my_pssm[1][\"A\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The structure of the PSSM class hopefully makes it easy both to access\n", "elements and to pretty print the matrix.\n", "\n", "### Information Content {#sec:getting_info_content}\n", "\n", "A potentially useful measure of evolutionary conservation is the\n", "information content of a sequence.\n", "\n", "A useful introduction to information theory targeted towards molecular\n", "biologists can be found at\n", ". For our purposes, we\n", "will be looking at the information content of a consesus sequence, or a\n", "portion of a consensus sequence. We calculate information content at a\n", "particular column in a multiple sequence alignment using the following\n", "formula:\n", "\n", "$$IC_{j} = \\sum_{i=1}^{N_{a}} P_{ij} \\mathrm{log}\\left(\\frac{P_{ij}}{Q_{i}}\\right)$$\n", "\n", "where:\n", "\n", "- $IC_{j}$ – The information content for the $j$-th column in\n", " an alignment.\n", "\n", "- $N_{a}$ – The number of letters in the alphabet.\n", "\n", "- $P_{ij}$ – The frequency of a particular letter $i$ in the $j$-th\n", " column (i. e. if G occurred 3 out of 6 times in an aligment column,\n", " this would be 0.5)\n", "\n", "- $Q_{i}$ – The expected frequency of a letter $i$. This is an\n", " optional argument, usage of which is left at the user’s discretion.\n", " By default, it is automatically assigned to $0.05 = 1/20$ for a\n", " protein alphabet, and $0.25 = 1/4$ for a nucleic acid alphabet. This\n", " is for geting the information content without any assumption of\n", " prior distributions. When assuming priors, or when using a\n", " non-standard alphabet, you should supply the values for $Q_{i}$.\n", "\n", "Well, now that we have an idea what information content is being\n", "calculated in Biopython, let’s look at how to get it for a particular\n", "region of the alignment.\n", "\n", "First, we need to use our alignment to get an alignment summary object,\n", "which we’ll assume is called `summary_align` (see\n", "section \\[sec:summary\\_info\\]) for instructions on how to get this. Once\n", "we’ve got this object, calculating the information content for a region\n", "is as easy as:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "info_content = summary_align.information_content(5, 30,\n", " chars_to_ignore = ['N'])\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Wow, that was much easier then the formula above made it look! The\n", "variable `info_content` now contains a float value specifying the\n", "information content over the specified region (from 5 to 30 of the\n", "alignment). We specifically ignore the ambiguity residue ’N’ when\n", "calculating the information content, since this value is not included in\n", "our alphabet (so we shouldn’t be interested in looking at it!).\n", "\n", "As mentioned above, we can also calculate relative information content\n", "by supplying the expected frequencies:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "expect_freq = {\n", " 'A' : .3,\n", " 'G' : .2,\n", " 'T' : .3,\n", " 'C' : .2}\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The expected should not be passed as a raw dictionary, but instead by\n", "passed as a `SubsMat.FreqTable` object (see section \\[sec:freq\\_table\\]\n", "for more information about FreqTables). The FreqTable object provides a\n", "standard for associating the dictionary with an Alphabet, similar to how\n", "the Biopython Seq class works.\n", "\n", "To create a FreqTable object, from the frequency dictionary you just\n", "need to do:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from Bio.Alphabet import IUPAC\n", "from Bio.SubsMat import FreqTable\n", "\n", "e_freq_table = FreqTable.FreqTable(expect_freq, FreqTable.FREQ,\n", " IUPAC.unambiguous_dna)\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Now that we’ve got that, calculating the relative information content\n", "for our region of the alignment is as simple as:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "info_content = summary_align.information_content(5, 30,\n", " e_freq_table = e_freq_table,\n", " chars_to_ignore = ['N'])\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Now, `info_content` will contain the relative information content over\n", "the region in relation to the expected frequencies.\n", "\n", "The value return is calculated using base 2 as the logarithm base in the\n", "formula above. You can modify this by passing the parameter `log_base`\n", "as the base you want:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "info_content = summary_align.information_content(5, 30, log_base = 10,\n", " chars_to_ignore = ['N'])\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Well, now you are ready to calculate information content. If you want to\n", "try applying this to some real life problems, it would probably be best\n", "to dig into the literature on information content to get an idea of how\n", "it is used. Hopefully your digging won’t reveal any mistakes made in\n", "coding this function!\n", "\n", "### Substitution Matrices\n", "\n", "Substitution matrices are an extremely important part of everyday\n", "bioinformatics work. They provide the scoring terms for classifying how\n", "likely two different residues are to substitute for each other. This is\n", "essential in doing sequence comparisons. The book “Biological Sequence\n", "Analysis” by Durbin et al. provides a really nice introduction to\n", "Substitution Matrices and their uses. Some famous substitution matrices\n", "are the PAM and BLOSUM series of matrices.\n", "\n", "Biopython provides a ton of common substitution matrices, and also\n", "provides functionality for creating your own substitution matrices.\n", "\n", "### Using common substitution matrices\n", "\n", "### Creating your own substitution matrix from an alignment {#sec:subs_mat_ex}\n", "\n", "A very cool thing that you can do easily with the substitution matrix\n", "classes is to create your own substitution matrix from an alignment. In\n", "practice, this is normally done with protein alignments. In this\n", "example, we’ll first get a Biopython alignment object and then get a\n", "summary object to calculate info about the alignment. The file\n", "containing [protein.aln](examples/protein.aln) (also available online\n", "[here](http://biopython.org/DIST/docs/tutorial/examples/protein.aln))\n", "contains the Clustalw alignment output.\n", "\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from Bio import AlignIO\n", "from Bio import Alphabet\n", "from Bio.Alphabet import IUPAC\n", "from Bio.Align import AlignInfo\n", "filename = \"data/protein.aln\"\n", "alpha = Alphabet.Gapped(IUPAC.protein)\n", "c_align = AlignIO.read(filename, \"clustal\", alphabet=alpha)\n", "summary_align = AlignInfo.SummaryInfo(c_align)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Sections \\[sec:align\\_clustal\\] and \\[sec:summary\\_info\\] contain more\n", "information on doing this.\n", "\n", "Now that we’ve got our `summary_align` object, we want to use it to find\n", "out the number of times different residues substitute for each other. To\n", "make the example more readable, we’ll focus on only amino acids with\n", "polar charged side chains. Luckily, this can be done easily when\n", "generating a replacement dictionary, by passing in all of the characters\n", "that should be ignored. Thus we’ll create a dictionary of replacements\n", "for only charged polar amino acids using:\n", "\n" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "replace_info = summary_align.replacement_dictionary([\"G\", \"A\", \"V\", \"L\", \"I\",\n", "\"M\", \"P\", \"F\", \"W\", \"S\",\n", "\"T\", \"N\", \"Q\", \"Y\", \"C\"])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This information about amino acid replacements is represented as a\n", "python dictionary which will look something like (the order can vary):\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "{('R', 'R'): 2079.0, ('R', 'H'): 17.0, ('R', 'K'): 103.0, ('R', 'E'): 2.0,\n", "('R', 'D'): 2.0, ('H', 'R'): 0, ('D', 'H'): 15.0, ('K', 'K'): 3218.0,\n", "('K', 'H'): 24.0, ('H', 'K'): 8.0, ('E', 'H'): 15.0, ('H', 'H'): 1235.0,\n", "('H', 'E'): 18.0, ('H', 'D'): 0, ('K', 'D'): 0, ('K', 'E'): 9.0,\n", "('D', 'R'): 48.0, ('E', 'R'): 2.0, ('D', 'K'): 1.0, ('E', 'K'): 45.0,\n", "('K', 'R'): 130.0, ('E', 'D'): 241.0, ('E', 'E'): 3305.0,\n", "('D', 'E'): 270.0, ('D', 'D'): 2360.0}\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This information gives us our accepted number of replacements, or how\n", "often we expect different things to substitute for each other. It turns\n", "out, amazingly enough, that this is all of the information we need to go\n", "ahead and create a substitution matrix. First, we use the replacement\n", "dictionary information to create an Accepted Replacement Matrix (ARM):\n", "\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from Bio import SubsMat\n", "my_arm = SubsMat.SeqMat(replace_info)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "With this accepted replacement matrix, we can go right ahead and create\n", "our log odds matrix (i. e. a standard type Substitution Matrix):\n", "\n" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": true }, "outputs": [], "source": [ "my_lom = SubsMat.make_log_odds_matrix(my_arm)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The log odds matrix you create is customizable with the following\n", "optional arguments:\n", "\n", "- `exp_freq_table` – You can pass a table of expected frequencies for\n", " each alphabet. If supplied, this will be used instead of the passed\n", " accepted replacement matrix when calculate expected replacments.\n", "\n", "- `logbase` - The base of the logarithm taken to create the log\n", " odd matrix. Defaults to base 10.\n", "\n", "- `factor` - The factor to multiply each matrix entry by. This\n", " defaults to 10, which normally makes the matrix numbers easy to\n", " work with.\n", "\n", "- `round_digit` - The digit to round to in the matrix. This defaults\n", " to 0 (i. e. no digits).\n", "\n", "Once you’ve got your log odds matrix, you can display it prettily using\n", "the function `print_mat`. Doing this on our created matrix gives:\n", "\n" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "D 2\n", "E -1 1\n", "H -5 -4 3\n", "K -10 -5 -4 1\n", "R -4 -8 -4 -2 2\n", " D E H K R\n" ] } ], "source": [ "my_lom.print_mat()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }