{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "SAM\n", "===\n", "\n", "This notebook explores parsing and understanding the SAM (Sequence Alignment/Map) and related BAM format. SAM is an extremely common format for representing read alignments. Most widely-used read alignment tools output SAM alignments.\n", "\n", "SAM is a text format. There is a closely related binary format called BAM. There are two types of BAM files: unsorted or sorted. Various tools, notably [SAMtools] and [Picard], can convert back and forth between SAM and BAM, and can sort an existing BAM file. When we say a BAM file is sorted we almost always mean that the alignments are sorted left-to-right along the reference genome. (It's also possible to sort a BAM file by read name, though that's only occassionally useful.)\n", "\n", "Once you have an interesting set of read alignments that you would like to keep for a while and perhaps analyze further, it's a good idea to keep them in *sorted BAM* files. This is because:\n", "\n", "1. They will be well compressed. BAM files are smaller than corresponding SAM files, and *sorted* BAM files are smaller than corresponding *unsorted* BAM files.\n", "2. From a sorted BAM file, it's easy to extract just the alignments that overlap a specified stretch of the genome, making it easy to convert from sorted BAM to many other useful formats.\n", "\n", "That said, most alignment tools output SAM (not BAM), and the alignments come out in an arbitrary order -- not sorted.\n", "\n", "An authoritative and complete document describing the SAM and BAM formats is the [SAM specification]. This document is thorough, but, being a specification, it does not describe is the various \"dialects\" of legal SAM emitted by popular tools. I'll cover some of that here.\n", "\n", "[SAM specification]: http://samtools.sourceforge.net/SAMv1.pdf\n", "[SAMtools]: http://samtools.sourceforge.net\n", "[Picard]: http://picard.sourceforge.net" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Here's a string representing a three-line SAM file. I'm temporarily\n", "# ignoring the fact that SAM files usually have several header lines at\n", "# the beginning.\n", "samStr = '''\\\n", "r1\t0\tgi|9626243|ref|NC_001416.1|\t18401\t42\t122M\t*\t0\t0\tTGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG\t+\"@6<:27(F&5)9)\"B:%B+A-%5A?2$HCB0B+0=D<7E/<.03#!.F77@6B==?C\"7>;))%;,3-$.A06+<-1/@@?,26\">=?*@'0;$:;??G+:#+(A?9+10!8!?()?7C>\tAS:i:-5\tXN:i:0\tXM:i:3\tXO:i:0\tXG:i:0\tNM:i:3\tMD:Z:59G13G21G26\tYT:Z:UU\n", "r2\t0\tgi|9626243|ref|NC_001416.1|\t8886\t42\t275M\t*\t0\t0\tNTTNTGATGCGGGCTTGTGGAGTTCAGCCGATCTGACTTATGTCATTACCTATGAAATGTGAGGACGCTATGCCTGTACCAAATCCTACAATGCCGGTGAAAGGTGCCGGGATCACCCTGTGGGTTTATAAGGGGATCGGTGACCCCTACGCGAATCCGCTTTCAGACGTTGACTGGTCGCGTCTGGCAAAAGTTAAAGACCTGACGCCCGGCGAACTGACCGCTGAGNCCTATGACGACAGCTATCTCGATGATGAAGATGCAGACTGGACTGC\t(#!!'+!$\"\"%+(+)'%)%!+!(&++)''\"#\"#&#\"!'!(\"%'\"\"(\"+&%$%*%%#$%#%#!)*'(#\")(($&$'&%+&#%*)*#*%*')(%+!%%*\"$%\"#+)$&&+)&)*+!\"*)!*!(\"&&\"*#+\"&\"'(%)*(\"'!$*!!%$&&&$!!&&\"(*\"$&\"#&!$%'%\"#)$#+%*+)!&*)+(\"\"#!)!%*#\"*)*')&\")($+*%%)!*)!('(%\"\"+%\"$##\"#+(('!*(($*'!\"*('\"+)&%#&$+('**$$&+*&!#%)')'(+(!%+\tAS:i:-14\tXN:i:0\tXM:i:8\tXO:i:0\tXG:i:0\tNM:i:8\tMD:Z:0A0C0G0A108C23G9T81T46\tYT:Z:UU\n", "r3\t16\tgi|9626243|ref|NC_001416.1|\t11599\t42\t338M\t*\t0\t0\tGGGCGCGTTACTGGGATGATCGTGAAAAGGCCCGTCTTGCGCTTGAAGCCGCCCGAAAGAAGGCTGAGCAGCAGACTCAAGAGGAGAAAAATGCGCAGCAGCGGAGCGATACCGAAGCGTCACGGCTGAAATATACCGAAGAGGCGCAGAAGGCTNACGAACGGCTGCAGACGCCGCTGCAGAAATATACCGCCCGTCAGGAAGAACTGANCAAGGCACNGAAAGACGGGAAAATCCTGCAGGCGGATTACAACACGCTGATGGCGGCGGCGAAAAAGGATTATGAAGCGACGCTGTAAAAGCCGAAACAGTCCAGCGTGAAGGTGTCTGCGGGCGAT\t7F$%6=$:9B@/F'>=?!D?@0(:A*)7/>9C>6#1<6:C(.CC;#.;>;2'$4D:?&B!>689?(0(G7+0=@37F)GG=>?958.D2E04CB>D-=\"C'B080E'5BH\"77':\"@70#4%A5=6.2/1>;9\"&-H6)=$/0;5E:<8G!@::1?2DC7C*;@*#.1C0.D>H/20,!\"C-#,6@%<+:?5\"2?:G,F\"D0B8D-6$65D 0:\n", " ret.append([0, run, \"\"])\n", " elif md[i].isalpha(): # stretch of mismatches\n", " mmstr = \"\"\n", " while i < len(md) and md[i].isalpha():\n", " mmstr += md[i]\n", " i += 1\n", " assert len(mmstr) > 0\n", " ret.append([1, len(mmstr), mmstr])\n", " elif md[i] == \"^\": # read gap\n", " i += 1 # skip over ^\n", " refstr = \"\"\n", " while i < len(md) and md[i].isalpha():\n", " refstr += md[i]\n", " i += 1 # skip over inserted character\n", " assert len(refstr) > 0\n", " ret.append([2, len(refstr), refstr])\n", " else:\n", " raise RuntimeError('Unexpected character in MD:Z: \"%d\"' % md[i])\n", " return ret" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[0, 10, ''], [1, 1, 'A'], [0, 5, ''], [2, 2, 'AC'], [0, 6, '']]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Each element in the list returned by this call is itself a list w/ 3\n", "# elements. Element 1 is the MD:Z operation (0=match, 1=mismatch,\n", "# 2=deletion). Element 2 is the length and element 3 is the relevant\n", "# sequence of nucleotides from the reference.\n", "mdzToList('10A5^AC6')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can write a fucntion that takes a read sequennce, a parsed CIGAR string, and a parse `MD:Z` string and combines information from all three to make what I call a \"stacked alignment.\"" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def cigarMdzToStacked(seq, cgp, mdp_orig):\n", " ''' Takes parsed CIGAR and parsed MD:Z, generates a stacked alignment:\n", " a pair of strings with gap characters inserted (possibly) and where\n", " characters at at the same offsets are opposite each other in the\n", " alignment. Only knows how to handle CIGAR ops M=XDINSH right now.\n", " '''\n", " mdp = mdp_orig[:]\n", " rds, rfs = [], []\n", " mdo, rdoff = 0, 0\n", " for c in cgp:\n", " op, run = c\n", " skipping = (op == 4 or op == 5)\n", " assert skipping or mdo < len(mdp)\n", " if op == 0: # CIGAR op M, = or X\n", " # Look for block matches and mismatches in MD:Z string\n", " mdrun = 0\n", " runleft = run\n", " while runleft > 0 and mdo < len(mdp):\n", " op_m, run_m, st_m = mdp[mdo]\n", " run_comb = min(runleft, run_m)\n", " runleft -= run_comb\n", " assert op_m == 0 or op_m == 1\n", " rds.append(seq[rdoff:rdoff + run_comb])\n", " if op_m == 0: # match from MD:Z string\n", " rfs.append(seq[rdoff:rdoff + run_comb])\n", " else: # mismatch from MD:Z string\n", " assert len(st_m) == run_comb\n", " rfs.append(st_m)\n", " mdrun += run_comb\n", " rdoff += run_comb\n", " # Stretch of matches in MD:Z could span M and I CIGAR ops\n", " if run_comb < run_m:\n", " assert op_m == 0\n", " mdp[mdo][1] -= run_comb\n", " else:\n", " mdo += 1\n", " elif op == 1: # CIGAR op I\n", " rds.append(seq[rdoff:rdoff + run])\n", " rfs.append(\"-\" * run)\n", " rdoff += run\n", " elif op == 2: # D\n", " op_m, run_m, st_m = mdp[mdo]\n", " assert op_m == 2\n", " assert run == run_m\n", " assert len(st_m) == run\n", " mdo += 1\n", " rds.append(\"-\" * run)\n", " rfs.append(st_m)\n", " elif op == 3: # N\n", " rds.append(\"-\" * run)\n", " rfs.append(\"-\" * run)\n", " elif op == 4: # S\n", " rds.append(seq[rdoff:rdoff + run].lower())\n", " rfs.append(' ' * run)\n", " rdoff += run\n", " elif op == 5: # H\n", " rds.append('!' * run)\n", " rfs.append(' ' * run)\n", " elif op == 6: # P\n", " raise RuntimeError(\"Don't know how to handle P in CIGAR\")\n", " else:\n", " raise RuntimeError('Unexpected CIGAR op: %d' % op)\n", " assert mdo == len(mdp)\n", " return ''.join(rds), ''.join(rfs)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('GGACGCTCAGTA--GTGACGATAGCTGAAAACCCTGTACGATAAACC',\n", " 'GGACGCTCAGTAATGTGACGATAGCTGAAAA--CTGTACGATAAACG')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Following example includes gaps and mismatches\n", "cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGATAAACC', cigarToList('12M2D17M2I14M'), mdzToList('12^AT30G0'))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('GGACGCTCAGTA--GTGACGATAGCTGAAAACCCTGTACGAgaagcc',\n", " 'GGACGCTCAGTAATGTGACGATAGCTGAAAA--CTGTACGA ')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Following example also includes soft clipping (CIGAR: S)\n", "# SAM spec: Soft clipping: \"clipped sequences present in SEQ\"\n", "# We print them in lowercase to emphasize their clippedness\n", "cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGAGAAGCC', cigarToList('12M2D17M2I8M6S'), mdzToList('12^AT25'))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('GGACGCTCAGTA--GTGACGATAGCTGAAAACCCTGTACGAgaagcc!!!',\n", " 'GGACGCTCAGTAATGTGACGATAGCTGAAAA--CTGTACGA ')" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Following example also includes hard clipping (CIGAR: H)\n", "# SAM spec: Hard clipping: \"clipped sequences NOT present in SEQ\"\n", "cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGAGAAGCC', cigarToList('12M2D17M2I8M6S3H'), mdzToList('12^AT25'))\n", "# Note: don't see hard clipping in practice much" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('GGACGCTCAGTA--GTGACGATAG----------CTGAAAACCCTGTACGAgaagcc!!!',\n", " 'GGACGCTCAGTAATGTGACGATAG----------CTGAAAA--CTGTACGA ')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Following example also includes skipping (CIGAR: N), as seen in\n", "# TopHat alignments\n", "cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGAGAAGCC',\n", " cigarToList('12M2D10M10N7M2I8M6S3H'),\n", " mdzToList('12^AT25'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the stacked alignment, it's easy to do other things. E.g. we can turn a stacked alignment into a new CIGAR string that uses the `=` and `X` operations instead of the less specific `M` operation:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def cigarize(rds, rfs):\n", " off = 0\n", " oplist = []\n", " lastc, cnt = '', 0\n", " for i in range(len(rds)):\n", " c = None\n", " if rfs[i] == ' ':\n", " c = 'S'\n", " elif rds[i] == '-' and rfs[i] == '-':\n", " c = 'N'\n", " elif rds[i] == '-':\n", " c = 'D'\n", " elif rfs[i] == '-':\n", " c = 'I'\n", " elif rds[i] != rfs[i]:\n", " c = 'X'\n", " else:\n", " c = '='\n", " if c == lastc:\n", " cnt += 1\n", " else:\n", " if len(lastc) > 0:\n", " oplist.append((lastc, cnt))\n", " lastc, cnt = c, 1\n", " if len(lastc) > 0:\n", " oplist.append((lastc, cnt))\n", " return ''.join(map(lambda x: str(x[1]) + x[0], oplist))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "x, y = cigarMdzToStacked('ACGTACGT', cigarToList('8M'), mdzToList('4G3'))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'4=1X3='" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cigarize(x, y)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "x, y = cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGAGAAGCC',\n", " cigarToList('12M2D10M10N7M2I8M6S3H'),\n", " mdzToList('12^AT25'))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'12=2D10=10N7=2I8=9S'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cigarize(x, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other resources\n", "\n", "* [SAM specification]\n", "* [SAMtools] (and associated [publication](http://bioinformatics.oxfordjournals.org/content/25/16/2078.full))\n", "* [pysam] (Python library)\n", "* [Picard] (Java library)\n", "* [Bamtools] (C++ library)\n", "* [Rsamtools] (R library)\n", "\n", "[SAM specification]: http://samtools.sourceforge.net/SAMv1.pdf\n", "[SAMtools]: http://samtools.sourceforge.net/\n", "[pysam]: http://code.google.com/p/pysam/\n", "[Picard]: http://picard.sourceforge.net/\n", "[Bamtools]: http://sourceforge.net/projects/bamtools/\n", "[Rsamtools]: http://www.bioconductor.org/packages/2.14/bioc/html/Rsamtools.html\n", "\n", "© Copyright [Ben Langmead](http://www.cs.jhu.edu/~langmea) 2014" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 1 }