{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting Started with Big Data Genomics\n", "\n", "Big Data Genomics is a collection of schemas, libraries, and command-line utilities meant to standarize and scale the processing of the massive amounts of data generated by next generation sequencing (NGS). All Big Data Genomics tools use a common representation of NGS entities defined in an Avro schema. This allows for much greater flexibility in terms of the tools and languages which can be used to process genomic data.\n", "\n", "Avro is useful because it specifies both an [object serialization format](http://avro.apache.org/docs/current/spec.html#Data+Serialization) as well as an [object container file format](http://avro.apache.org/docs/current/spec.html#Object+Container+Files) for storing many objects. You can read and write Avro data from [many languages](https://cwiki.apache.org/confluence/display/AVRO/Supported+Languages), and the file format has been designed for parallel processing by [Apache Hadoop MapReduce](http://hadoop.apache.org/) or [Apache Spark](http://spark.apache.org/). In addition to Avro's row-based container file format, objects conforming to an Avro schema can be stored in a file using the columnar [Parquet container file format](https://github.com/Parquet/parquet-format). Unfortunately, you can only work with Parquet files in Java today, though there are clients under development in other languages.\n", "\n", "### Loading the Schemas\n", "The Big Data Genomics schemas are currently defined in [single file](https://github.com/bigdatagenomics/adam/blob/master/adam-format/src/main/resources/avro/adam.avdl) written in the [Avro interface description language (IDL)](http://avro.apache.org/docs/current/idl.html). In order to be parsed by Python's Avro library, the IDL representation must be transformed to the standard JSON Avro schema representation using [avro-tools](http://avro.apache.org/docs/current/gettingstartedjava.html). \n", "\n", "Once this is done, the schema may be loaded and parsed by the Python library. We want to inspect the types (which will include records) used by Big Data Genomics tools.\n", "\n", "The IDL representation is much easier to read than the JSON representation, so we'll be looking at the IDL while working with the \"raw\" JSON." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import avro.protocol as avpr" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "ADAM_formats = avpr.parse(open(\"adam.avpr\", \"r\").read()) \n", "sorted([(v.type, k) for k, v in ADAM_formats.types_dict.items()])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "[('enum', u'ADAMGenotypeAllele'),\n", " ('enum', u'ADAMGenotypeType'),\n", " ('enum', u'Base'),\n", " (u'record', u'ADAMContig'),\n", " (u'record', u'ADAMDatabaseVariantAnnotation'),\n", " (u'record', u'ADAMGenotype'),\n", " (u'record', u'ADAMNestedPileup'),\n", " (u'record', u'ADAMNucleotideContigFragment'),\n", " (u'record', u'ADAMPileup'),\n", " (u'record', u'ADAMRecord'),\n", " (u'record', u'ADAMVariant'),\n", " (u'record', u'VariantCallingAnnotations'),\n", " (u'record', u'VariantEffect')]" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading and Writing Schemas\n", "\n", "We need to be able to read and write records for a given schema from and to disk. We will demonstrate that for an arbitrary record (a mock ADAMContig) below:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from avro.datafile import DataFileReader, DataFileWriter\n", "from avro.io import DatumReader, DatumWriter" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "ADAM_contig_schema = ADAM_formats.types_dict['ADAMContig']\n", "ADAM_contig_schema.to_json()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "{'fields': [{'default': None, 'name': u'contigId', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'contigName', 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'contigLength', 'type': [u'null', u'long']},\n", " {'default': None, 'name': u'contigMD5', 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'referenceURL', 'type': [u'null', u'string']}],\n", " 'name': u'ADAMContig',\n", " 'namespace': u'org.bdgenomics.adam.avro',\n", " 'type': u'record'}" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "# this is the Python representation of the Avro object. \n", "ADAM_contig = {'contigId': 9230,\n", " 'contigName':\"1Nabc\", \n", " 'contigLength':7781,\n", " 'contigMD5':\"8743b52063cd84097a65d1633f5c74f5\",\n", " 'referenceURL': 'http://data.dna/223'} # mock data" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "with DataFileWriter(open(\"contigs.avro\", \"w\"), DatumWriter(), ADAM_contig_schema) as contig_writer:\n", " contig_writer.append(ADAM_contig)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "with DataFileReader(open(\"contigs.avro\", \"r\"), DatumReader()) as contigs:\n", " for contig in contigs:\n", " print contig" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "{u'contigMD5': u'8743b52063cd84097a65d1633f5c74f5', u'contigId': 9230, u'contigLength': 7781, u'referenceURL': u'http://data.dna/223', u'contigName': u'1Nabc'}\n" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The Schemas Themselves\n", "We can inspect the JSON schema of the record, and see that which fields (and their respective types) that record contains.\n", "\n", "We'll write a few records to a new file, using some arbitrary data (but of the correct type), and then load that file and print the Python representations of those records. This is about all there is to do with these records." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ADAMRecord\n", "\n", "An ADAMRecord stores the alignment information of sequence reads against a reference sequence. Many additional metadata are stored alongside that core information, all of which can be found annotated below. It is the ADAM equivalent to a record in a SAM/BAM file (but much easier to read).\n", "\n", "Below is a real example of a real ADAMRecord, exported to JSON. This was generated by converting an existing BAM file to ADAM format (e.g. `adam bam2adam hg37.bam hg37.bam.adam`), and then printing it (into JSON) with e.g. `adam print hg37.bam.adam`.\n", "\n", "A Record, like a BAM file, stores a read sequence with quality data (ASCII+33 + [PHRED score](http://en.wikipedia.org/wiki/Phred_quality_score)), along with alignment information stored in extended CIGAR format. More information on all fields can be found in the comments on the IDL representation below.\n", "\n", "Additionally, important metainformation is parsed out of the BAM files and stored as first-class key-vals. Nonstandard (as well as many standard tags, still) additional attributes are stored in a string under `attributes` in the same format as found in SAM files." ] }, { "cell_type": "code", "collapsed": false, "input": [ "ADAM_record = {\"referenceName\": \"20\", \"referenceId\": 19, \"start\": 19893804, \"mapq\": 60, \n", " \"readName\": \"20GAVAAXX100126:4:64:6132:191287\", \n", " \"sequence\": \"GTTTTCTATGAAGTTATTTTCTAGGGATTCTGTTTTGTTGTCGTTGTTCACACTGTAGCTCTCAGATCTTACTGTTTTTTTTTTAATTGTGATAAAGCATA\", \n", " \"mateReference\": \"20\", \"mateAlignmentStart\": 19893476, \"cigar\": \"101M\", \n", " \"qual\": \"EHHHFGDGHFHF8EEFB=B@=GHFFBAA@8??>IHHEIHEH>EHH@HFG>GEGHEGHFAHHHHHHHHHHHHFHGHGHGHHHHHHF@HFHHHHHHHHHGGGGGGGHHHHHHHHHHHHHHHHHHHGHHH\\tMQ:i:60\\tBQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\\tXO:i:0\\tXM:i:0\\tSM:i:37\\tNM:i:0\\tAM:i:37\\tXG:i:0\\tRG:Z:20GAV.4\\tX1:i:0\\tX0:i:1\", \n", " \"recordGroupSequencingCenter\": \"BI\", \"recordGroupDescription\": None, \"recordGroupRunDateEpoch\": None, \n", " \"recordGroupFlowOrder\": None, \"recordGroupKeySequence\": None, \"recordGroupLibrary\": \"Solexa-18484\", \n", " \"recordGroupPredictedMedianInsertSize\": None, \"recordGroupPlatform\": \"illumina\", \"recordGroupPlatformUnit\": \"20GAVAAXX100126.4\", \n", " \"recordGroupSample\": \"NA12878\", \n", " \"mateReferenceId\": 19, \"referenceLength\": 63025520, \"referenceUrl\": None, \n", " \"mateReferenceLength\": 63025520, \"mateReferenceUrl\": None, \"origQual\": None}\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is the record from the SAM file that is represented above:" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "20GAVAAXX100126:4:64:6132:191287 83 20 19893805 60 101M = 19893477 -428 GTTTTCTATGAAGTTATTTTCTAGGGATTCTGTTTTGTTGTCGTTGTTCACACTGTAGCTCTCAGATCTTACTGTTTTTTTTTTAATTGTGATAAAGCATA \n", "EHHHFGDGHFHF8EEFB=B@=GHFFBAA@8??>IHHEIHEH>EHH@HFG>GEGHEGHFA\n", "#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GENOTYPE\n", "chr17 7621777 rs1544724 T G . . . GT 0/1\n", "chr1 82154 rs4477212 a . . . . GT 1/1\n", "chr1 752566 rs3094315 g A . . . GT 0/1\n", "chr1 752721 rs3131972 A G . . . GT 0/0\n", "chr1 776546 rs12124819 A . . . . GT 0/0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```c\n", "record ADAMGenotype {\n", " // Information regarding the actual variant (c.f. ADAMVariant for more information)\n", " union { null, ADAMVariant } variant; \n", " // c.f. VariantCallingAnnocations for more information\n", " union { null, VariantCallingAnnotations } variantCallingAnnotations = null;\n", "\n", " // This is the actual name of the sample the genotype is associated with (taken from the VCF header line).\n", " union { null, string } sampleId = null;\n", " // A description, if any, associated with the sample (this would be from the VCF metainformation).\n", " union { null, string } sampleDescription = null;\n", " // Optional information regarding the processing of the VCF, taken from the file's metainformation.\n", " union { null, string } processingDescription = null;\n", "\n", " // The actual genotype of the sample at the location; this is the call itself.\n", " // This correponds with the information in the Sample column in the VCF. It could look like, for example, \n", " // [\"REF\", \"ALT\"] if one chromosome's allele corresponds to the reference's, and the other to the alternate's.\n", " // Note: Length is equal to the ploidy. Values: \"REF\", \"ALT\", \"NOCALL\". \n", " // Note too that, unlike in the VCF format, in the sample columns,\n", " // where one line/record can contain multiple alternate alleles, ADAMGenotype corresponds to exactly one alternate\n", " // and thus we do not need to refer to the specific alternate allele being called. \n", " array alleles = null;\n", "\n", " //////////////////////////////////////////////////////////////////\n", " // Information optionally encoded in the VCF's Samples columns //\n", " //////////////////////////////////////////////////////////////////\n", " // How many reads consider this allele to be the reference.\n", " union { null, int } referenceReadDepth = null;\n", " // How many reads consider this allele to be the alternate. \n", " union { null, int } alternateReadDepth = null;\n", " // How many total reads at this position. Correponds to the DP tag in VCF.\n", " union { null, int } readDepth = null;\n", " // The phred-scaled probability that we're correct for this genotype call.\n", " union { null, int } genotypeQuality = null;\n", "\n", " // Phred-scaled scores for the called genotypes. (Length 3)\n", " array genotypeLikelihoods = null;\n", "\n", " // (Not sure what this is, does not appear to be used in ADAM. \n", " // Looks like it could be the expected allele counts (c.f. http://pngu.mgh.harvard.edu/~purcell/plink/dosage.shtml).)\n", " union { null, float } expectedAlleleDosage = null;\n", "\n", " // Number of reads mapped at site on forward strand (Not sure what this means, where this is taken from. \n", " // Is not used in ADAM.)\n", " union { null, int } readsMappedForwardStrand = null;\n", "\n", " // In the ADAM world we split multiallelic VCF lines into multiple\n", " // single-alternate records. This bit is set if that happened for this record.\n", " boolean splitFromMultiAllelic = false;\n", " \n", " // Whether this is a phased genotype. A phased genotype means that we know which allege belongs to which\n", " // strand of the chromosome. (This information is encoded in the VCF's sample colum.)\n", " union { null, boolean } isPhased = null;\n", " // And if so, what is the phase ID.\n", " union { null, int } phaseSetId = null;\n", " \n", " // The quality of the phasing. (This isn't precisely defined in v4.2 of the spec.)\n", " union { null, int } phaseQuality = null;\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ADAMVariant\n", "\n", "An ADAMVariant is used with ADAMGenotype to denote the actual variant and reference allele of the genotype, as well as the alignment information associated with it. \n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "variant = {\"contig\": {\"contigId\": 0, \"contigName\": \"chr17\", \"contigLength\": None, \n", " \"contigMD5\": None, \"referenceURL\": None}, \n", " \"position\": 7621776, \"referenceAllele\": \"T\", \"variantAllele\": \"G\"}" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "ADAM_variant_schema = ADAM_formats.types_dict['ADAMVariant']\n", "ADAM_variant_schema.to_json()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "{'fields': [{'default': None,\n", " 'name': u'contig',\n", " 'type': [u'null',\n", " {'fields': [{'default': None,\n", " 'name': u'contigId',\n", " 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'contigName', 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'contigLength', 'type': [u'null', u'long']},\n", " {'default': None, 'name': u'contigMD5', 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'referenceURL',\n", " 'type': [u'null', u'string']}],\n", " 'name': u'ADAMContig',\n", " 'namespace': u'org.bdgenomics.adam.avro',\n", " 'type': u'record'}]},\n", " {'default': None, 'name': u'position', 'type': [u'null', u'long']},\n", " {'name': u'referenceAllele', 'type': u'string'},\n", " {'name': u'variantAllele', 'type': u'string'}],\n", " 'name': u'ADAMVariant',\n", " 'namespace': u'org.bdgenomics.adam.avro',\n", " 'type': u'record'}" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "```c\n", "record ADAMVariant {\n", " union { null, ADAMContig } contig = null;\n", " // The position in the reference sequence this variant is located.\n", " union { null, long } position = null;\n", " // The reference allele at that position.\n", " string referenceAllele;\n", " // The alternate allele being considered in this variant.\n", " string variantAllele;\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### VariantCallingAnnotations\n", "\n", "This record represents all stats that, inside a VCF, are stored outside of the sample but are computed based on the samples. For instance, MAPQ0 is an aggregate stat computed from all samples and stored inside the INFO line." ] }, { "cell_type": "code", "collapsed": false, "input": [ "variant_calling_annotation = {\n", " \"readDepth\": None,\n", " \"downsampled\": None,\n", " \"baseQRankSum\": None,\n", " \"clippingRankSum\": None, \n", " \"fisherStrandBiasPValue\": None,\n", " \"haplotypeScore\": None,\n", " \"inbreedingCoefficient\": None,\n", " \"alleleCountMLE\": [], \n", " \"alleleFrequencyMLE\": [],\n", " \"rmsMapQ\": None,\n", " \"mapq0Reads\": None,\n", " \"mqRankSum\": None, \n", " \"usedForNegativeTrainingSet\": None,\n", " \"usedForPositiveTrainingSet\": None,\n", " \"variantQualityByDepth\": None, \n", " \"readPositionRankSum\": None,\n", " \"vqslod\": None, \"culprit\": None, \n", " \"variantCallErrorProbability\": None,\n", " \"variantIsPassing\": True,\n", " \"variantFilters\": [],\n", "} " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "ADAM_variant_calling_annotations_schema = ADAM_formats.types_dict['VariantCallingAnnotations']\n", "ADAM_variant_calling_annotations_schema.to_json()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "{'fields': [{'default': None, 'name': u'readDepth', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'downsampled', 'type': [u'null', u'boolean']},\n", " {'default': None, 'name': u'baseQRankSum', 'type': [u'null', u'float']},\n", " {'default': None, 'name': u'clippingRankSum', 'type': [u'null', u'float']},\n", " {'default': None,\n", " 'name': u'fisherStrandBiasPValue',\n", " 'type': [u'null', u'float']},\n", " {'default': None, 'name': u'haplotypeScore', 'type': [u'null', u'float']},\n", " {'default': None,\n", " 'name': u'inbreedingCoefficient',\n", " 'type': [u'null', u'float']},\n", " {'default': None,\n", " 'name': u'alleleCountMLE',\n", " 'type': {'items': u'int', 'type': 'array'}},\n", " {'default': None,\n", " 'name': u'alleleFrequencyMLE',\n", " 'type': {'items': u'int', 'type': 'array'}},\n", " {'default': None, 'name': u'rmsMapQ', 'type': [u'null', u'float']},\n", " {'default': None, 'name': u'mapq0Reads', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'mqRankSum', 'type': [u'null', u'float']},\n", " {'default': None,\n", " 'name': u'usedForNegativeTrainingSet',\n", " 'type': [u'null', u'boolean']},\n", " {'default': None,\n", " 'name': u'usedForPositiveTrainingSet',\n", " 'type': [u'null', u'boolean']},\n", " {'default': None,\n", " 'name': u'variantQualityByDepth',\n", " 'type': [u'null', u'float']},\n", " {'default': None,\n", " 'name': u'readPositionRankSum',\n", " 'type': [u'null', u'float']},\n", " {'default': None, 'name': u'vqslod', 'type': [u'null', u'float']},\n", " {'default': None, 'name': u'culprit', 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'variantCallErrorProbability',\n", " 'type': [u'null', u'float']},\n", " {'default': True, 'name': u'variantIsPassing', 'type': u'boolean'},\n", " {'default': None,\n", " 'name': u'variantFilters',\n", " 'type': {'items': u'string', 'type': 'array'}}],\n", " 'name': u'VariantCallingAnnotations',\n", " 'namespace': u'org.bdgenomics.adam.avro',\n", " 'type': u'record'}" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "```c\n", "record VariantCallingAnnotations {\n", " union { null, int } readDepth = null;\n", " // Was this downsampled?\n", " union { null, boolean } downsampled = null;\n", "\n", " // Base quality rank sum. \n", " union { null, float } baseQRankSum = null;\n", " union { null, float } clippingRankSum = null;\n", " union { null, float } fisherStrandBiasPValue = null; // Phred-scaled.\n", " union { null, float } haplotypeScore = null;\n", " union { null, float } inbreedingCoefficient = null;\n", " array alleleCountMLE = null;\n", " array alleleFrequencyMLE = null;\n", " union { null, float } rmsMapQ = null;\n", " union { null, int } mapq0Reads = null;\n", " union { null, float } mqRankSum = null;\n", " union { null, boolean } usedForNegativeTrainingSet = null;\n", " union { null, boolean } usedForPositiveTrainingSet = null;\n", " union { null, float } variantQualityByDepth = null;\n", " union { null, float } readPositionRankSum = null;\n", " // Log-odds ratio of being a true vs false variant under trained\n", " // Gaussian mixture model.\n", " union { null, float } vqslod = null;\n", " union { null, string } culprit = null;\n", " // Phred-scaled probability of error for this variant call.\n", " union { null, float } variantCallErrorProbability = null;\n", " // True implies either filters were applied and the variant passed\n", " // those filters, or no filters were applied. False implies filters\n", " // were applied the variant did not pass.\n", " boolean variantIsPassing = true;\n", " // A list of filters applied.\n", " array variantFilters = null;\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ADAMNucleotideContigFragment\n", "\n", "Next we'll look at an ADAMNucleotideContigFragment, which stores a [contig](http://en.wikipedia.org/wiki/Contig) of nucleotides; this may be a reference chromosome, an assembly, or a BAC. \n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "ADAM_nucleotide_contig_fragment_schema = ADAM_formats.types_dict['ADAMNucleotideContigFragment']\n", "ADAM_nucleotide_contig_fragment_schema.to_json()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "{'fields': [{'default': None,\n", " 'name': u'contigName',\n", " 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'contigId', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'description', 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'url', 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'fragmentSequence', 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'contigLength', 'type': [u'null', u'long']},\n", " {'default': None, 'name': u'fragmentNumber', 'type': [u'null', u'int']},\n", " {'default': None,\n", " 'name': u'fragmentStartPosition',\n", " 'type': [u'null', u'long']},\n", " {'default': None,\n", " 'name': u'numberOfFragmentsInContig',\n", " 'type': [u'null', u'int']}],\n", " 'name': u'ADAMNucleotideContigFragment',\n", " 'namespace': u'org.bdgenomics.adam.avro',\n", " 'type': u'record'}" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, the below is a real contig from the human reference genome (with a bit of the sequence itself elided for sanity)." ] }, { "cell_type": "raw", "metadata": {}, "source": [ "contigName = 20\n", "contigId = 0\n", "fragmentSequence = TTCG\u2026\u2026\u2026\u2026\u2026AACCGGCTCGA\n", "contigLength = 20000000\n", "fragmentNumber = 4\n", "fragmentStartPosition = 40000\n", "numberOfFragmentsInContig = 2000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```c\n", "record ADAMNucleotideContigFragment {\n", " union { null, string } contigName = null;\n", " union { null, int } contigId = null;\n", " union { null, string } description = null;\n", " union { null, string } url = null; \n", " union { null, string } fragmentSequence = null; // sequence of bases in this fragment\n", " union { null, long } contigLength = null; // length of the total contig (all fragments)\n", " union { null, int } fragmentNumber = null; // ordered number for this fragment\n", " union { null, long } fragmentStartPosition = null; // position of first base of fragment in contig\n", " union { null, int } numberOfFragmentsInContig = null; // total number of fragments in contig\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ADAMContig\n", "\n", "The ADAMContig record is used to describe region of the sequence being considered." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```c\n", "record ADAMContig {\n", " union { null, int } contigId = null;\n", " union { null, string } contigName = null;\n", " union { null, long } contigLength = null;\n", " union { null, string } contigMD5 = null;\n", " union { null, string } referenceURL = null;\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ADAMPileup\n", "\n", "ADAMPileup summarizes base calls against a reference sequence of aligned reads with various coverage. This is similar to the Pileup format used in the SAMTools suite, and is generally used in alignment and for visual inspection of alignment. \n", "\n", "ADAM can print alignment information in pileup format with the command `mpileup`. A very abridged sample of output is found below, and more information on the format can be found in this [SAMTools documentation page on mpileup](http://samtools.sourceforge.net/mpileup.shtml)." ] }, { "cell_type": "raw", "metadata": {}, "source": [ "20 9999944 A 35 ,,,....,.,,..,,...,.,,,,,,,.,,.,...\n", "20 9999945 C 37 ,,,....,.,,..,,...,.,,,,,,,.,,.,....,\n", "20 9999946 T 37 ,,,....,.,,..,,...,.,,,,,,,.,,.,....,\n", "20 9999947 C 38 ,,,....,.,,..,,...,.,,,,,,,.,,.,....,,\n", "20 9999948 T 40 ,,,....,.,,..,,...,.,,,,,,,.,,.,....,,.,\n", "20 9999949 T 41 ,,,....,.,,..,,...,.,,,,,,,.,,.,....,,.,.\n", "20 9999950 A 42 ,,,....,.,,..,,...,.,,,,,,,.,,.,....,,.,..\n", "20 9999951 G 43 ,,,....,.,,..,,...,.,,,,,,,.,,.,....,,.,...\n", "20 9999952 T 46 ,,,....,.,,..,,...,.,,,,,,,.,,.,....,,.,.....," ] }, { "cell_type": "code", "collapsed": false, "input": [ "ADAM_pileup_schema = ADAM_formats.types_dict['ADAMPileup']\n", "ADAM_pileup_schema.to_json()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ "{'fields': [{'default': None,\n", " 'name': u'referenceName',\n", " 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'referenceId', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'position', 'type': [u'null', u'long']},\n", " {'default': None, 'name': u'rangeOffset', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'rangeLength', 'type': [u'null', u'int']},\n", " {'default': None,\n", " 'name': u'referenceBase',\n", " 'type': [u'null',\n", " {'name': u'Base',\n", " 'namespace': u'org.bdgenomics.adam.avro',\n", " 'symbols': [u'A',\n", " u'C',\n", " u'T',\n", " u'G',\n", " u'U',\n", " u'N',\n", " u'X',\n", " u'K',\n", " u'M',\n", " u'R',\n", " u'Y',\n", " u'S',\n", " u'W',\n", " u'B',\n", " u'V',\n", " u'H',\n", " u'D'],\n", " 'type': 'enum'}]},\n", " {'default': None,\n", " 'name': u'readBase',\n", " 'type': [u'null', u'org.bdgenomics.adam.avro.Base']},\n", " {'default': None, 'name': u'sangerQuality', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'mapQuality', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'numSoftClipped', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'numReverseStrand', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'countAtPosition', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'readName', 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'readStart', 'type': [u'null', u'long']},\n", " {'default': None, 'name': u'readEnd', 'type': [u'null', u'long']},\n", " {'default': None,\n", " 'name': u'recordGroupSequencingCenter',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupDescription',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupRunDateEpoch',\n", " 'type': [u'null', u'long']},\n", " {'default': None,\n", " 'name': u'recordGroupFlowOrder',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupKeySequence',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupLibrary',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupPredictedMedianInsertSize',\n", " 'type': [u'null', u'int']},\n", " {'default': None,\n", " 'name': u'recordGroupPlatform',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupPlatformUnit',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupSample',\n", " 'type': [u'null', u'string']}],\n", " 'name': u'ADAMPileup',\n", " 'namespace': u'org.bdgenomics.adam.avro',\n", " 'type': u'record'}" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "```c\n", "record ADAMPileup {\n", " union { null, string } referenceName = null;\n", " union { null, int } referenceId = null;\n", " union { null, long } position = null;\n", " union { null, int } rangeOffset = null;\n", " union { null, int } rangeLength = null;\n", " union { null, Base } referenceBase = null;\n", " union { null, Base } readBase = null;\n", " union { null, int } sangerQuality = null;\n", " union { null, int } mapQuality = null;\n", " union { null, int } numSoftClipped = null;\n", " union { null, int } numReverseStrand = null;\n", " union { null, int } countAtPosition = null;\n", "\n", " union { null, string } readName = null;\n", " union { null, long } readStart = null;\n", " union { null, long } readEnd = null;\n", "\n", " // record group identifer from sequencing run\n", " union { null, string } recordGroupSequencingCenter = null;\n", " union { null, string } recordGroupDescription = null;\n", " union { null, long } recordGroupRunDateEpoch = null;\n", " union { null, string } recordGroupFlowOrder = null;\n", " union { null, string } recordGroupKeySequence = null;\n", " union { null, string } recordGroupLibrary = null;\n", " union { null, int } recordGroupPredictedMedianInsertSize = null;\n", " union { null, string } recordGroupPlatform = null;\n", " union { null, string } recordGroupPlatformUnit = null;\n", " union { null, string } recordGroupSample = null;\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### VariantEffect\n", "\n", "VariantEffect denotes the effect that a variant allele has on a given gene. It notes the change, if any, in amino acid for a given codon's mutation." ] }, { "cell_type": "code", "collapsed": false, "input": [ "ADAM_variant_effect_schema = ADAM_formats.types_dict['VariantEffect']\n", "ADAM_variant_effect_schema.to_json()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ "{'fields': [{'default': None, 'name': u'hgvs', 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'referenceAminoAcid',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'alternateAminoAcid',\n", " 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'geneId', 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'transcriptId', 'type': [u'null', u'string']}],\n", " 'name': u'VariantEffect',\n", " 'namespace': u'org.bdgenomics.adam.avro',\n", " 'type': u'record'}" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "```c\n", "record VariantEffect {\n", " union { null, string} hgvs = null;\n", " union { null, string } referenceAminoAcid = null;\n", " union { null, string } alternateAminoAcid = null;\n", " union {null, string} geneId = null;\n", " union {null, string} transcriptId = null;\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ADAMDatabaseVariantAnnotation\n", "\n", "This record documents the significance of a given allele." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```c\n", "record ADAMDatabaseVariantAnnotation {\n", " union { null, ADAMVariant } variant;\n", " union { null, int } dbSnpId = null;\n", "\n", " // Domain information\n", " union {null, string} geneSymbol = null;\n", "\n", " // Clinical fields\n", " union {null, string} omimId = null;\n", " union {null, string} cosmicId = null;\n", " union {null, string} clinvarId = null;\n", " union {null, string} clinicalSignificance = null;\n", "\n", " // Conservation\n", " union { null, string } gerpNr = null;\n", " union { null, string } gerpRs = null;\n", " union { null, float } phylop = null;\n", " union { null, string } ancestralAllele = null;\n", "\n", " // Population statistics\n", " union {null, int} thousandGenomesAlleleCount = null;\n", " union {null, float} thousandGenomesAlleleFrequency = null;\n", "\n", " // Effect of the variant\n", " //array effects = null;\n", "\n", " // Predicted effects\n", " union { null, float } siftScore = null;\n", " union { null, float } siftScoreConverted = null;\n", " union { null, string } siftPred = null;\n", "\n", " union { null, float } mutationTasterScore = null;\n", " union { null, float } mutationTasterScoreConverted = null;\n", " union { null, string } mutationTasterPred = null;\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ADAMNestedPileup\n", "\n", "A nested pileup data type\u2014contains reference to list of overlapping records assopciated with an ADAMPileup." ] }, { "cell_type": "code", "collapsed": false, "input": [ "ADAM_nested_pileup_schema = ADAM_formats.types_dict['ADAMNestedPileup']\n", "ADAM_nested_pileup_schema.to_json()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "{'fields': [{'name': u'pileup',\n", " 'type': {'fields': [{'default': None,\n", " 'name': u'referenceName',\n", " 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'referenceId', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'position', 'type': [u'null', u'long']},\n", " {'default': None, 'name': u'rangeOffset', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'rangeLength', 'type': [u'null', u'int']},\n", " {'default': None,\n", " 'name': u'referenceBase',\n", " 'type': [u'null',\n", " {'name': u'Base',\n", " 'namespace': u'org.bdgenomics.adam.avro',\n", " 'symbols': [u'A',\n", " u'C',\n", " u'T',\n", " u'G',\n", " u'U',\n", " u'N',\n", " u'X',\n", " u'K',\n", " u'M',\n", " u'R',\n", " u'Y',\n", " u'S',\n", " u'W',\n", " u'B',\n", " u'V',\n", " u'H',\n", " u'D'],\n", " 'type': 'enum'}]},\n", " {'default': None,\n", " 'name': u'readBase',\n", " 'type': [u'null', u'org.bdgenomics.adam.avro.Base']},\n", " {'default': None, 'name': u'sangerQuality', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'mapQuality', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'numSoftClipped', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'numReverseStrand', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'countAtPosition', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'readName', 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'readStart', 'type': [u'null', u'long']},\n", " {'default': None, 'name': u'readEnd', 'type': [u'null', u'long']},\n", " {'default': None,\n", " 'name': u'recordGroupSequencingCenter',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupDescription',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupRunDateEpoch',\n", " 'type': [u'null', u'long']},\n", " {'default': None,\n", " 'name': u'recordGroupFlowOrder',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupKeySequence',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupLibrary',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupPredictedMedianInsertSize',\n", " 'type': [u'null', u'int']},\n", " {'default': None,\n", " 'name': u'recordGroupPlatform',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupPlatformUnit',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupSample',\n", " 'type': [u'null', u'string']}],\n", " 'name': u'ADAMPileup',\n", " 'namespace': u'org.bdgenomics.adam.avro',\n", " 'type': u'record'}},\n", " {'name': u'readEvidence',\n", " 'type': {'items': {'fields': [{'default': None,\n", " 'doc': u'* These two fields, along with the two\\n * reference{Length, Url} fields at the bottom\\n * of the schema, collectively form the contents\\n * of the Sequence Dictionary embedded in the these\\n * records from the BAM / SAM itself.\\n * TODO: this should be moved to ADAMContig',\n", " 'name': u'referenceName',\n", " 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'referenceId', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'start', 'type': [u'null', u'long']},\n", " {'default': None, 'name': u'mapq', 'type': [u'null', u'int']},\n", " {'default': None, 'name': u'readName', 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'sequence', 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'mateReference',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'mateAlignmentStart',\n", " 'type': [u'null', u'long']},\n", " {'default': None, 'name': u'cigar', 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'qual', 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupName',\n", " 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'recordGroupId', 'type': [u'null', u'int']},\n", " {'default': False, 'name': u'readPaired', 'type': [u'boolean', u'null']},\n", " {'default': False, 'name': u'properPair', 'type': [u'boolean', u'null']},\n", " {'default': False, 'name': u'readMapped', 'type': [u'boolean', u'null']},\n", " {'default': False, 'name': u'mateMapped', 'type': [u'boolean', u'null']},\n", " {'default': False,\n", " 'name': u'readNegativeStrand',\n", " 'type': [u'boolean', u'null']},\n", " {'default': False,\n", " 'name': u'mateNegativeStrand',\n", " 'type': [u'boolean', u'null']},\n", " {'default': False,\n", " 'name': u'firstOfPair',\n", " 'type': [u'boolean', u'null']},\n", " {'default': False,\n", " 'name': u'secondOfPair',\n", " 'type': [u'boolean', u'null']},\n", " {'default': False,\n", " 'name': u'primaryAlignment',\n", " 'type': [u'boolean', u'null']},\n", " {'default': False,\n", " 'name': u'failedVendorQualityChecks',\n", " 'type': [u'boolean', u'null']},\n", " {'default': False,\n", " 'name': u'duplicateRead',\n", " 'type': [u'boolean', u'null']},\n", " {'default': None,\n", " 'name': u'mismatchingPositions',\n", " 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'attributes', 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupSequencingCenter',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupDescription',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupRunDateEpoch',\n", " 'type': [u'null', u'long']},\n", " {'default': None,\n", " 'name': u'recordGroupFlowOrder',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupKeySequence',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupLibrary',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupPredictedMedianInsertSize',\n", " 'type': [u'null', u'int']},\n", " {'default': None,\n", " 'name': u'recordGroupPlatform',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupPlatformUnit',\n", " 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'recordGroupSample',\n", " 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'mateReferenceId', 'type': [u'null', u'int']},\n", " {'default': None,\n", " 'name': u'referenceLength',\n", " 'type': [u'null', u'long']},\n", " {'default': None, 'name': u'referenceUrl', 'type': [u'null', u'string']},\n", " {'default': None,\n", " 'name': u'mateReferenceLength',\n", " 'type': [u'null', u'long']},\n", " {'default': None,\n", " 'name': u'mateReferenceUrl',\n", " 'type': [u'null', u'string']},\n", " {'default': None, 'name': u'origQual', 'type': [u'null', u'string']}],\n", " 'name': u'ADAMRecord',\n", " 'namespace': u'org.bdgenomics.adam.avro',\n", " 'type': u'record'},\n", " 'type': 'array'}}],\n", " 'name': u'ADAMNestedPileup',\n", " 'namespace': u'org.bdgenomics.adam.avro',\n", " 'type': u'record'}" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "```c\n", "record ADAMNestedPileup {\n", " ADAMPileup pileup;\n", " array readEvidence;\n", "}\n", "```" ] } ], "metadata": {} } ] }