{ "metadata": { "name": "", "signature": "sha256:44f4aa65db43c68a983e74e580aa58377a411be28ccd1dd835f561e38da2afe3" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Defining Sequence Ontology Mappings\n", "\n", "The goal here is to determine if what Snpeff's label means (generally) the same thing as VEP's label means the same thing as what ANNOVAR's label. Unfortunately each classifier uses slightly differt terms so we need to normalize the terms over all three tools. There is some hope that all tools will adopt the terms as defined by [*The Sequence Ontology Project*](http://www.sequenceontology.org/), but as of May 2014 we still need create this mapping.\n", "\n", "First, let's take a look at the terms each tool uses.\n", "\n", "##SNPeff\n", "SNPeff classifies variants by region and effect.\n", "###Region\n", "* NONE\n", "* INTERGENIC\n", "* UPSTREAM\n", "* UTR_5_PRIME\n", "* SPLICE_SITE_ACCEPTOR\n", "* SPLICE_SITE_DONOR\n", "* SPLICE_SITE_REGION\n", "* EXON or NONE\n", "* EXON\n", "* INTRON\n", "* UTR_3_PRIME\n", "* DOWNSTREAM\n", "* REGULATION\n", "\n", "###Effect\n", "* NONE\n", "* CHROMOSOME\n", "* CUSTOM\n", "* CDS\n", "* INTERGENIC\n", "* INTERGENIC_CONSERVED\n", "* UPSTREAM\n", "* UTR_5_PRIME\n", "* UTR_5_DELETED\n", "* START_GAINED\n", "* SPLICE_SITE_ACCEPTOR\n", "* SPLICE_SITE_DONOR\n", "* SPLICE_SITE_REGION\n", "* INTRAGENIC\n", "* START_LOST\n", "* SYNONYMOUS_START\n", "* NON_SYNONYMOUS_START\n", "* GENE\n", "* TRANSCRIPT\n", "* EXON\n", "* EXON_DELETED\n", "* NON_SYNONYMOUS_CODING\n", "* SYNONYMOUS_CODING\n", "* FRAME_SHIFT\n", "* CODON_CHANGE\n", "* CODON_INSERTION\n", "* CODON_CHANGE_PLUS_CODON_INSERTION\n", "* CODON_DELETION\n", "* CODON_CHANGE_PLUS_CODON_DELETION\n", "* STOP_GAINED\n", "* SYNONYMOUS_STOP\n", "* STOP_LOST\n", "* RARE_AMINO_ACID\n", "* INTRON\n", "* INTRON_CONSERVED\n", "* UTR_3_PRIME\n", "* UTR_3_DELETED\n", "* DOWNSTREAM\n", "* REGULATION\n", "\n", "\n", "##VEP\n", "VEP uses the sequence ontology terms as the SO project defines them.\n", "\n", "* transcript_ablation\n", "* splice_donor_variant\n", "* splice_acceptor_variant\n", "* stop_gained\n", "* frameshift_variant\n", "* stop_lost\n", "* initiator_codon_variant\n", "* inframe_insertion\n", "* inframe_deletion\n", "* missense_variant\n", "* transcript_amplification\n", "* splice_region_variant\n", "* incomplete_terminal_codon_variant\n", "* synonymous_variant\n", "* stop_retained_variant\n", "* coding_sequence_variant\n", "* mature_miRNA_variant\n", "* 5_prime_UTR_variant\n", "* 3_prime_UTR_variant\n", "* non_coding_exon_variant\n", "* nc_transcript_variant\n", "* intron_variant\n", "* NMD_transcript_variant\n", "* upstream_gene_variant\n", "* downstream_gene_variant\n", "* TFBS_ablation\n", "* TFBS_amplification\n", "* TF_binding_site_variant\n", "* regulatory_region_variant\n", "* regulatory_region_ablation\n", "* regulatory_region_amplification\n", "* feature_elongation\n", "* feature_truncation\n", "* intergenic_variant\n", "\n", "##ANNOVAR\n", "ANNOVAR does something similar to SNPeff in that it breaks it's classification into a region and an effect\n", "###Region\n", "* exonic\n", "* splicing\n", "* ncRNA\n", "* UTR5\n", "* UTR3\n", "* intronic\n", "* upstream\n", "* downstream\n", "* intergenic\n", "\n", "###Effect\n", "* frameshift insertion\n", "* frameshift deletion\n", "* frameshift block substitution\n", "* stopgain\n", "* stoploss\n", "* nonframeshift insertion\n", "* nonframeshift deletion\n", "* nonframeshift block substitution\n", "* nonsynonymous SNV\n", "* synonymous SNV\n", "* unknown\n", "\n", "\n", "##Mapping\n", "I can't decide what resolution the buckets should be (ie feature type, or just variant category-LOF, Missense, Syn, etc). Note that SNPeff catagorizes variants that chage the start/stop codon to another start/stop codon as both synonymous and non-synonomous. Also note that VEP doesn't differentiate between SNPs and mod 3 indels. Lastly, the annotation of splice site variants is tricky. ANNOVAR annotates anything within 2bp of a splice site (on either side). This is similar to the splice_donor_variant and splice_acceptor_variant, except what with this classification variants on the exonic side of the splice site are not counted. The splice_region_variant has no analogous match in ANNOVAR and will be ignored in this analysis.\n", "
Normalized SO Name | \n", "ANNOVAR | \n", "VEP | \n", "SNPeff | \n", "
---|---|---|---|
frameshift_variant | \n", "frameshift_deletion, frameshift_insertion, frameshift_block_substitution | \n", "frameshift_variant | \n", "FRAME_SHIFT | \n", "
stop_gained | \n", "stopgain | \n", "stop_gained | \n", "STOP_GAINED | \n", "
stop_lost | \n", "stoploss | \n", "stop_lost | \n", "STOP_LOST | \n", "
splicing_variant | \n", "splicing | \n", "splice_donor_variant, splice_acceptor_variant | \n", "SPLICE_SITE_DONOR, SPLICE_SITE_ACCEPTOR | \n", "
inframe_variant | \n", "nonframeshift_deletion, nonframeshift_insertion | \n", "inframe_insertion, inframe_deletion | \n", "CODON_INSERTION, CODON_CHANGE_PLUS_CODON_INSERTION, CODON_DELETION, CODON_CHANGE_PLUS_CODON_DELETION | \n", "
nonsynonymous_variant | \n", "nonsynonymous_SNV, nonframeshift_block_substitution | \n", "initiator_codon_variant, missense_variant, stop_retained_variant, incomplete_terminal_codon_variant | \n", "CODON_CHANGE, NON_SYNONYMOUS_CODING, NON_SYNONYMOUS_START, NON_SYNONYMOUS_STOP, START_LOST | \n", "
synonymous_variant | \n", "synonymous_SNV | \n", "synonymous_variant | \n", "SYNONYMOUS_CODING, SYNONYMOUS_START, SYNONYMOUS_STOP | \n", "
3_prime_UTR_variant | \n", "UTR3 | \n", "3_prime_UTR_variant | \n", "UTR_3_PRIME, UTR_3_DELETED | \n", "
5_prime_UTR_variant | \n", "UTR5 | \n", "5_prime_UTR_variant | \n", "UTR_5_PRIME, UTR_5_DELETED, START_GAINED | \n", "
upstream_gene_variant | \n", "upstream | \n", "upstream_gene_variant | \n", "UPSTREAM | \n", "
downstream_gene_variant | \n", "downstream | \n", "downstream_gene_variant | \n", "DOWNSTREAM | \n", "
regulatory_region_variant | \n", "N/A | \n", "regulatory_region_variant, regulatory_region_ablation, regulatory_region_amplification | \n", "REGULATION | \n", "
intron_variant | \n", "intronic | \n", "intron_variant | \n", "INTRON, INTRON_CONSERVED | \n", "
intergenic_variant | \n", "intergenic | \n", "intergenic_variant | \n", "INTERGENIC, INTERGENIC_CONSERVED | \n", "
Ignored | \n", "unknown, exonic, ncRNA | \n", "transcript_ablation , coding_sequence_variant, splice_region_variant, feature_truncation, feature_elongation, TF_binding_site_variant, TFBS_amplification, TFBS_ablation, NMD_transcript_variant, nc_transcript_variant, non_coding_exon_variant, mature_miRNA_variant | \n", "EXON, GENE, EXON_DELETED, CDS, CHROMOSOME, SPLICE_SITE_REGION, SPLICE_SITE_BRANCH, SPLICE_SITE_BRANCH_U12, MICRO_RNA, INTRAGENIC | \n", "