ensembl-glossary definition hasDbXref related A genomic locus that has been annotated. SO:0000001 element Feature Genome annotation Genomic locus where transcription occurs. A gene may have one or more transcripts, which may or may not encode proteins. SO:0000704 https://en.wikipedia.org/wiki/Gene Gene A transcript is the operational unit of a gene. In a genomic context, transcripts consist of one or more exons, with adjoining exons being separated by introns. The exons/introns are transcribed and then the introns spliced out. Transcripts may or may not encode a protein SO:0000673 isoform Splice variant Transcript Expressed Sequence Tag. Coarse sequence reads from flanking vector regions into the inserts of cDNA libraries. ESTs act as physical markers for cloning and full length sequencing of the cDNAs of expressed genes. Typically identified by purifying mRNAs, converting to cDNAs, and then sequencing a portion of the cDNAs. Usually short, single reads from a tissue or stage in development. Expressed sequence tag EST http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html The Transcript Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users, based on the type and quality of the alignments used to annotate the transcript. TSL Transcript support level http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html A transcript where all splice junctions are supported by at least one non-suspect mRNA. Transcript support level 1 TSL 1 http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html A transcript where the best supporting mRNA is flagged as suspect or the support is from multiple ESTs Transcript support level 2 TSL 2 http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html A transcript where the only support is from a single EST Transcript support level 3 TSL 3 http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html A transcript where the best supporting EST is flagged as suspect Transcript support level 4 TSL 4 http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html A transcript where no single transcript supports the model structure. Transcript support level 5 TSL 5 http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html A transcript that was not analysed for TSL. Transcript support level not applicable TSL NA http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods to identify the most functionally important transcript(s) of a gene. APPRIS http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS. APPRIS Principal 1 APPRIS P1 http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. APPRIS Principal 2 APPRIS P2 http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html Where the APPRIS core modules are unable to choose a clear principal variant and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. APPRIS Principal 3 APPRIS P3 http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with a distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant. APPRIS Principal 4 APPRIS P4 http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant. APPRIS Principal 5 APPRIS P5 http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html For genes in which the APPRIS core modules are unable to choose a clear principal isoform, the ALT1 is the candidate transcript(s) models that is conserved in at least three tested species. APPRIS Alternative 1 APPRIS ALT1 http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html For genes in which the APPRIS core modules are unable to choose a clear principal isoform, the ALT1 is the candidate transcript(s) models that appear to be conserved in fewer than three tested species. APPRIS Alternative 2 APPRIS ALT2 http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html A subset of the GENCODE transcript set, containing only 5' and 3' complete transcripts. GENCODE Basic http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html A protein-coding transcript which is missing the start codon due to incomplete evidence. five prime incomplete 5' incomplete http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html A protein-coding transcript which is missing the stop codon due to incomplete evidence. three prime incomplete 3' incomplete http://www.ensembl.org/info/genome/genebuild/canonical.html A single transcript chosen for a gene which is the most conserved, most highly expressed, has the longest coding sequence and is represented in other key resources, such as NCBI and UniProt. This is defined in detail on http://www.ensembl.org/info/genome/genebuild/canonical.html canonical transcript Canonical Ensembl canonical http://www.ensembl.org/info/genome/genebuild/ccds.html A coding sequence in the Consensus Coding Sequence Set is consistently annotated between Ensembl, MGI, HGNC and NCBI. The long term goal is to support convergence towards a standard set of gene annotations on the human genome. https://en.wikipedia.org/wiki/Consensus_CDS_Project Consensus Coding Sequence Consensus CDS CCDS http://www.ensembl.org/info/genome/genebuild/biotypes.html A gene or transcript classification. Gene type Transcript type Type Biotype http://www.ensembl.org/info/genome/genebuild/biotypes.html Gene/transcipt that contains an open reading frame (ORF). SO:0001217 https://en.wikipedia.org/wiki/Gene Coding Protein coding http://www.ensembl.org/info/genome/genebuild/biotypes.html Gene/transcript that doesn't contain an open reading frame (ORF). https://en.wikipedia.org/wiki/Non-coding_RNA Processed transcript http://www.ensembl.org/info/genome/genebuild/biotypes.html A non-coding gene/transcript >200bp in length SO:0002127 https://en.wikipedia.org/wiki/Long_non-coding_RNA Long non-coding RNA lncRNA Long non-coding RNA (lncRNA) http://www.ensembl.org/info/genome/genebuild/biotypes.html Transcripts which are known from the literature to not be protein coding. SO:0001263 https://en.wikipedia.org/wiki/Non-coding_RNA Non coding http://www.ensembl.org/info/genome/genebuild/biotypes.html Transcripts where ditag and/or published experimental data strongly supports the existence of long (>200bp) non-coding transcripts that overlap the 3'UTR of a protein-coding locus on the same strand. three prime overlapping ncRNA 3' overlapping ncRNA http://www.ensembl.org/info/genome/genebuild/biotypes.html Transcripts that overlap the genomic span (i.e. exon or introns) of a protein-coding locus on the opposite strand. https://en.wikipedia.org/wiki/Antisense_RNA Antisense http://www.ensembl.org/info/genome/genebuild/biotypes.html Transcripts that are long intergenic non-coding RNA locus with a length >200bp. Requires lack of coding potential and may not be conserved between species. SO:0001641 https://en.wikipedia.org/wiki/Long_non-coding_RNA#Long_intergenic_non-coding_RNAs_(lincRNA) Long intergenic RNA Long interspersed RNA lincRNA lincRNA (long intergenic ncRNA) http://www.ensembl.org/info/genome/genebuild/biotypes.html An alternatively spliced transcript believed to contain intronic sequence relative to other, coding, transcripts of the same gene. Retained intron http://www.ensembl.org/info/genome/genebuild/biotypes.html A long non-coding transcript in introns of a coding gene that does not overlap any exons. SO:0002184 Sense intronic http://www.ensembl.org/info/genome/genebuild/biotypes.html A long non-coding transcript that contains a coding gene in its intron on the same strand. SO:0002183 Sense overlapping http://www.ensembl.org/info/genome/genebuild/biotypes.html Unspliced lncRNAs that are several kb in size. Macro lncRNA http://www.ensembl.org/info/genome/genebuild/biotypes.html A non-coding gene. ncRNA http://www.ensembl.org/info/genome/genebuild/biotypes.html A small RNA (~22bp) that silences the expression of target mRNA. SO:0001265 https://en.wikipedia.org/wiki/MicroRNA micro RNA miRNA http://www.ensembl.org/info/genome/genebuild/biotypes.html An RNA that interacts with piwi proteins involved in genetic silencing. SO:0001638 https://en.wikipedia.org/wiki/Piwi-interacting_RNA piwi-interacting RNA piRNA http://www.ensembl.org/info/genome/genebuild/biotypes.html The RNA component of a ribosome. SO:0001637 https://en.wikipedia.org/wiki/Ribosomal_RNA ribsosomal RNA rRNA http://www.ensembl.org/info/genome/genebuild/biotypes.html A small RNA (20-25bp) that silences the expression of target mRNA through the RNAi pathway. https://en.wikipedia.org/wiki/Small_interfering_RNA short interfering RNA silencing RNA small interfering RNA siRNA http://www.ensembl.org/info/genome/genebuild/biotypes.html Small RNA molecules that are found in the cell nucleus and are involved in the processing of pre messenger RNAs https://en.wikipedia.org/wiki/Small_nuclear_RNA U-RNA small nuclear RNA snRNA http://www.ensembl.org/info/genome/genebuild/biotypes.html Small RNA molecules that are found in the cell nucleolus and are involved in the post-transcriptional modification of other RNAs. SO:0001272 https://en.wikipedia.org/wiki/Small_nucleolar_RNA small nucleolar RNA snoRNA http://www.ensembl.org/info/genome/genebuild/biotypes.html A transfer RNA, which acts as an adaptor molecule for translation of mRNA. SO:0001272 https://en.wikipedia.org/wiki/Transfer_RNA transfer RNA tRNA http://www.ensembl.org/info/genome/genebuild/biotypes.html Short non coding RNA genes that form part of the vault ribonucleoprotein complex. SO:0000404 https://en.wikipedia.org/wiki/Vault_RNA vaultRNA Miscellaneous RNA. A non-coding RNA that cannot be classified. miscRNA http://www.ensembl.org/info/genome/genebuild/biotypes.html A gene that has homology to known protein-coding genes but contain a frameshift and/or stop codon(s) which disrupts the ORF. Thought to have arisen through duplication followed by loss of function. SO:0000336 https://en.wikipedia.org/wiki/Pseudogene Pseudogene http://www.ensembl.org/info/genome/genebuild/biotypes.html Pseudogene that lack introns and is thought to arise from reverse transcription of mRNA followed by reinsertion of DNA into the genome. SO:0000043 https://en.wikipedia.org/wiki/Pseudogene#Processed Processed pseudogene http://www.ensembl.org/info/genome/genebuild/biotypes.html Pseudogene that can contain introns since produced by gene duplication. SO:0001760 https://en.wikipedia.org/wiki/Pseudogene#Non-processed Non-processed pseudogene Unprocessed pseudogene http://www.ensembl.org/info/genome/genebuild/biotypes.html Pseudogene where protein homology or genomic structure indicates a pseudogene, but the presence of locus-specific transcripts indicates expression. These can be classified into 'Processed', 'Unprocessed' and 'Unitary'. SO:0002109, SO:0002107, SO:0002108 Transcribed pseudogene http://www.ensembl.org/info/genome/genebuild/biotypes.html Pseudogenes that have mass spec data suggesting that they are also translated. These can be classified into 'Processed', 'Unprocessed' SO:0002105, SO:0002106 Translated pseudogene http://www.ensembl.org/info/genome/genebuild/biotypes.html Pseudogene owing to a SNP/indel but in other individuals/haplotypes/strains the gene is translated. Polymorphic pseudogene http://www.ensembl.org/info/genome/genebuild/biotypes.html A species specific unprocessed pseudogene without a parent gene, as it has an active orthologue in another species. SO:0001759 https://en.wikipedia.org/wiki/Pseudogene#Unitary_pseudogenes Unitary pseudogene http://www.ensembl.org/info/genome/genebuild/biotypes.html Inactivated immunoglobulin gene. IG pseudogene http://www.ensembl.org/info/genome/genebuild/biotypes.html, http://www.ensembl.org/info/genome/genebuild/ig_tcr.html Immunoglobulin gene that undergoes somatic recombination, annotated in collaboration with IMGT http://www.imgt.org/. SO:0002122 https://en.wikipedia.org/wiki/V(D)J_recombination Immunoglobulin gene IG gene http://www.ensembl.org/info/genome/genebuild/biotypes.html, http://www.ensembl.org/info/genome/genebuild/ig_tcr.html T cell receptor gene that undergoes somatic recombination, annotated in collaboration with IMGT http://www.imgt.org/. SO:0002133 https://en.wikipedia.org/wiki/V(D)J_recombination T cell receptor gene TR gene http://www.ensembl.org/info/genome/genebuild/biotypes.html Regions with EST clusters that have polyA features that could indicate the presence of protein coding genes. These require experimental validation, either by 5' RACE or RT-PCR to extend the transcripts, or by confirming expression of the putatively-encoded peptide with specific antibodies. TEC (To be Experimentally Confirmed) http://www.ensembl.org/info/genome/genebuild/biotypes.html A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs). read-through Readthrough http://www.ensembl.org/info/genome/genebuild/biotypes.html Variable chain immunoglobulin gene that undergoes somatic recombination before transcription SO:0002126 https://en.wikipedia.org/wiki/V(D)J_recombination IG V gene http://www.ensembl.org/info/genome/genebuild/biotypes.html Diversity chain immunoglobulin gene that undergoes somatic recombination before transcription SO:0002124 https://en.wikipedia.org/wiki/V(D)J_recombination IG D gene http://www.ensembl.org/info/genome/genebuild/biotypes.html Joining chain immunoglobulin gene that undergoes somatic recombination before transcription SO:0002125 https://en.wikipedia.org/wiki/V(D)J_recombination IG J gene http://www.ensembl.org/info/genome/genebuild/biotypes.html Constant chain immunoglobulin gene that undergoes somatic recombination before transcription SO:0002123 https://en.wikipedia.org/wiki/V(D)J_recombination IG C gene http://www.ensembl.org/info/genome/genebuild/biotypes.html Variable chain T cell receptor gene that undergoes somatic recombination before transcription SO:0002137 https://en.wikipedia.org/wiki/V(D)J_recombination TR V gene http://www.ensembl.org/info/genome/genebuild/biotypes.html Diversity chain T cell receptor gene that undergoes somatic recombination before transcription SO:0002135 https://en.wikipedia.org/wiki/V(D)J_recombination TR D gene http://www.ensembl.org/info/genome/genebuild/biotypes.html Joining chain T cell receptor gene that undergoes somatic recombination before transcription SO:0002136 https://en.wikipedia.org/wiki/V(D)J_recombination TR J gene http://www.ensembl.org/info/genome/genebuild/biotypes.html Constant chain T cell receptor gene that undergoes somatic recombination before transcription SO:0002134 https://en.wikipedia.org/wiki/V(D)J_recombination TR C gene The sequence of the spliced exons of a transcript expressed in DNA notation (T rather than U), representing the coding or sense strand. The cDNA contains the whole sequence of the RNA, including coding and untranslated sequence. SO:0000756 https://en.wikipedia.org/wiki/Complementary_DNA complementary DNA cDNA CoDing Sequence. The region of a cDNA which is translated. In Ensembl displays, the stop codon is included as part of the CDS sequence. SO:0000316 ORF open reading frame translatable sequence translated sequence coding sequence CDS A sequence of amino acids, translated from a CDS. https://en.wikipedia.org/wiki/Peptide Translation Protein Peptide A region of special biological interest within a single protein sequence. However, a domain may also be defined as a region within the three-dimensional structure of a protein that may encompass regions of several distinct protein sequences that accomplishes a specific function. A domain class is a group of domains that share a common set of well-defined properties or characteristics. https://en.wikipedia.org/wiki/Protein_domain Domain Protein domain Transcribed genomic region that remains in the RNA after splicing, includes both the CDS and the UTRs. SO:0000147 https://en.wikipedia.org/wiki/Exon Exon Transcribed genomic regions that is removed from the RNA by splicing. SO:0000188 https://en.wikipedia.org/wiki/Intron Intron Three base pairs in either DNA or RNA that code for an amino acid (or stop translation). SO:0000360 https://en.wikipedia.org/wiki/Genetic_code Codon Exons that are not spliced out, therefore present in all transcripts of a given gene. Constitutive exon The position of an exon/intron boundary within a codon. A phase of zero means the boundary falls between codons, one means between the first and second base and two means between the second and third base. Exons have a start and end phase, whereas introns have just one phase. A boundary in a non-coding region has a phase of -1. Phase Sequence 5' or 3' to a DNA or RNA sequence of interest (for example gene, transcript, SNP or repeat). SO:0000239 flank flanking region Flanking sequence The region of a coding cDNA which is not translated. SO:0000203 https://en.wikipedia.org/wiki/Untranslated_region UTR Untranslated region The region of a coding cDNA upstream of the start codon which is not translated. SO:0000204 https://en.wikipedia.org/wiki/Untranslated_region 5' untranslated region five prime untranslated region five prime UTR 5' UTR The region of a coding cDNA downstream of the stop codon which is not translated. SO:0000205 https://en.wikipedia.org/wiki/Untranslated_region 3' untranslated region three prime untranslated region three prime UTR 3' UTR http://www.ensembl.org/info/genome/compara/homologue_types.html Specific genes that are descended from the same common sequence in an ancestor. SO:0000853, FHOM_0000007 https://en.wikipedia.org/wiki/Sequence_homology Homologs Homologues http://www.ensembl.org/info/genome/compara/homology_method.html A representation of the evolutionary relationship between homologues, constructed using the Ensembl gene tree pipeline. https://en.wikipedia.org/wiki/Phylogenetic_tree Protein tree Gene tree http://www.ensembl.org/info/genome/compara/homologue_types.html Orthologues are genes derived from a common ancestor through vertical descent (or speciation) and can be thought of as the direct evolutionary counterpart. FHOM_0000017 https://en.wikipedia.org/wiki/Sequence_homology#Orthology Orthologs Orthologues http://www.ensembl.org/info/genome/compara/homologue_types.html A type of orthologue assigned for a pair of species where only one copy is found in each species. FHOM_0000020 one-to-one orthologs one-to-one orthologues 1-to-1 orthologs 1-to-1 orthologues http://www.ensembl.org/info/genome/compara/homologue_types.html A type of orthologue assigned for a pair of species where one gene in one species is orthologous to multiple genes in the other species, due to (a) duplication event(s) in the second species. FHOM_0000034 one-to-many orthologs one-to-many orthologues 1-to-many orthologs 1-to-many orthologues http://www.ensembl.org/info/genome/compara/homologue_types.html A type of orthologue assigned for a pair of species where multiple orthologues are found in both species, where the duplication events in both species occurred after the speciation event. FHOM_0000048 many-to-many orthologs Many-to-many orthologues http://www.ensembl.org/info/genome/compara/homologue_types.html Genes (homologues) that have evolved by duplication. FHOM_0000011 https://en.wikipedia.org/wiki/Sequence_homology#Paralogy Paralogs Paralogues http://www.ensembl.org/info/genome/compara/homologue_types.html Members of the same gene family in different species that are not direct orthologues. In a gene tree, these genes are separated by a duplication node. FHOM_0000050 out paralogs out paralogues Between species paralogs Between species paralogues http://www.ensembl.org/info/genome/compara/homologue_types.html Pairs of genes in a species that occur together in the same tree, but are actually two halves of the same gene split partway along. Gene split http://www.ensembl.org/info/genome/compara/homologue_types.html Paralogues which are very far away from the other members of a paralogue family. They are part of the same super-family, but the precise taxonomic relationship to other members is undefined, as the trees are too large to compute. Other paralogs Other paralogues http://www.ensembl.org/info/genome/compara/homologue_types.html Two or more versions of a duplicated gene in a single species. In a gene tree, the genes are separated by a duplication node. FHOM_0000049 In paralogs In paralogues Within species paralogs Within species paralogues Pairs of genes in a polyploid genome that underwent (a) hybridisation event(s). The original genes were orthologues in the two (or more) species that hybridised, and now occur in the same species. Since they did not arise through a duplication event, they are not paralogues. FHOM_0000073 https://en.wikipedia.org/wiki/Polyploid#Homoeologous Homoeologues http://www.ensembl.org/info/genome/variation/index.html Locus where the sequence differs between individuals of the same species SO:0001060 https://en.wikipedia.org/wiki/Genetic_variation mutation variation Polymorphism Variant Genetic loci where allelic variation is associated with variation in a quantitative trait (e.g. blood pressure). SO:0000771 https://en.wikipedia.org/wiki/Quantitative_trait_locus Quantitative trait locus QTL Genetic loci where allelic variation is associated with expression levels of other genes. https://en.wikipedia.org/wiki/Expression_quantitative_trait_loci Expression quantitative trait locus eQTL http://www.ensembl.org/info/genome/variation/data_description.html#evidence_status Codes that reflect the amount and type of evidence that supports the existence of a variant. Evidence status http://www.ensembl.org/info/genome/variation/index.html Variant that only affects a small locus Sequence variant http://www.ensembl.org/info/genome/variation/index.html Variant that affects a large locus SO:0001537 Structural variant http://www.ensembl.org/info/genome/variation/index.html Single Nucleotide Polymorphism, substitution of a single nucleotide for another nucleotide SO:0000694 https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism SNV base-pair substitution point mutation single nucleotide variation Single nucleotide polymorphism SNP http://www.ensembl.org/info/genome/variation/index.html Insertion of one or more nucleotides SO:0000667 https://en.wikipedia.org/wiki/Insertion_(genetics) Insertion http://www.ensembl.org/info/genome/variation/index.html Deletion of one or more nucleotides SO:0000159 https://en.wikipedia.org/wiki/Deletion_(genetics) Deletion http://www.ensembl.org/info/genome/variation/index.html An insertion and a deletion, affecting two or more nucleotides SO:1000032 https://en.wikipedia.org/wiki/Indel Indel http://www.ensembl.org/info/genome/variation/index.html A sequence alteration where the length of the deleted sequence is the same as the length of the inserted sequence. SO:1000002 Substitution http://www.ensembl.org/info/genome/variation/index.html Copy Number Variation: increases or decreases the copy number of a given locus. Subcategorised into Loss and Gain compared to the reference. SO:0001019 https://en.wikipedia.org/wiki/Copy-number_variation CNP copy number gain copy number loss copy number polymorphism deletion duplication Copy number variation CNV http://www.ensembl.org/info/genome/variation/index.html A continuous nucleotide sequence is inverted in the same position SO:1000036 https://en.wikipedia.org/wiki/Chromosomal_inversion Inversion http://www.ensembl.org/info/genome/variation/index.html A region of nucleotide sequence that has translocated to a new position SO:0000199 https://en.wikipedia.org/wiki/Chromosomal_translocation chromosome rearrangement Translocation http://www.ensembl.org/info/genome/variation/index.html One of a number of alternative forms of the same genetic locus/variant. SO:0001023 Allele (variant) http://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html Different versions of a gene found between the primary assembly and a patch or genome haplotype. alternative sequence gene haplotype gene patch gene Alt gene Allele (gene) The allele of a variant found in the reference genome currently being studied. The reference allele is not necessarily the major or ancestral allele. Reference allele Any allele of a variant which is not the in the reference genome currently being studied. The alternative allele is not necessarily the minor allele. Alternative allele http://www.ensembl.org/info/genome/variation/data_description.html#maf The allele which is most frequent in the global population, defined in human by the 1000 Genomes Project. The major allele may be the reference or the alternative allele, and may or may not be the ancestral allele. https://en.wikipedia.org/wiki/Allele_frequency Major allele http://www.ensembl.org/info/genome/variation/data_description.html#maf The allele which is the second most frequent in the global population, defined in human by the 1000 Genomes Project. The minor allele may be the reference or the alternative allele, and may or may not be the ancestral allele. https://en.wikipedia.org/wiki/Allele_frequency Minor allele An allele which has only been identified in one individual or one family. A private allele may be the reference or the alternative allele, and may or may not be the ancestral allele. Private allele The allele which occurs at this locus in closely related species and is thought to reflect the allele present at the time of speciation. The ancestral allele may be the reference or the alternative allele, and the major or minor allele. Ancestral allele http://www.ensembl.org/info/genome/variation/data_description.html#maf The frequency of the second most common allele in the specified population. https://en.wikipedia.org/wiki/Allele_frequency MAF Minor allele frequency http://www.ensembl.org/info/genome/variation/data_description.html#maf The highest minor allele frequency observed in any population typed for this variant. For human this includes the 1000 Genomes Project, gnomAD and UK10K. https://en.wikipedia.org/wiki/Allele_frequency Highest population minor allele frequency HPMAF Highest population MAF http://www.ensembl.org/info/genome/variation/data_description.html#maf The frequency of the second most common allele in the global population, defined in human by the 1000 Genomes Project phase 3. Global MAF http://www.ensembl.org/info/genome/variation/data_description.html The specific alleles that are present in an individual's genome. In diploid organisms two alleles make up the genotype (except for the sex chromosomes). Zygosity Genotype http://www.ensembl.org/info/genome/variation/data_description.html A measurable locus that varies within a population. SO:0001645 Genetic marker http://www.ensembl.org/info/genome/variation/data_description.html Two or more adjacent copies of a region (of length greater than 1). SO:0000705 Tandem repeat http://www.ensembl.org/info/genome/variation/data_description.html An insertion of sequence from the Alu family of mobile elements. SO:0002063 Alu insertion http://www.ensembl.org/info/genome/variation/data_description.html A structural sequence alteration or rearrangement encompassing one or more genome fragments, with four or more breakpoints. SO:0001784 Complex structural alteration http://www.ensembl.org/info/genome/variation/data_description.html When no simple or well defined DNA mutation event describes the observed DNA change, the keyword ""complex"" should be used. Usually there are multiple equally plausible explanations for the change. SO:1000005 Complex substitution http://www.ensembl.org/info/genome/variation/data_description.html A rearrangement breakpoint between two different chromosomes. SO:0001873 Interchromosomal breakpoint http://www.ensembl.org/info/genome/variation/data_description.html A translocation where the regions involved are from different chromosomes. SO:0002060 Interchromosomal translocation http://www.ensembl.org/info/genome/variation/data_description.html A rearrangement breakpoint within the same chromosome. SO:0001874 Intrachromosomal breakpoint http://www.ensembl.org/info/genome/variation/data_description.html A translocation where the regions involved are from the same chromosome. SO:0002061 Intrachromosomal translocation http://www.ensembl.org/info/genome/variation/data_description.html A functional variant whereby the sequence alteration causes a loss of function of one allele of a gene. SO:0001786 Loss of heterozygosity http://www.ensembl.org/info/genome/variation/data_description.html A deletion of a mobile element when comparing a reference sequence (has mobile element) to a individual sequence (does not have mobile element). SO:0002066 Mobile element deletion http://www.ensembl.org/info/genome/variation/data_description.html A kind of insertion where the inserted sequence is a mobile element. SO:0001837 Mobile element insertion http://www.ensembl.org/info/genome/variation/data_description.html An insertion the sequence of which cannot be mapped to the reference genome. SO:0001838 Novel sequence insertion http://www.ensembl.org/info/genome/variation/data_description.html A variation that expands or contracts a tandem repeat with regard to a reference. SO:0002096 Short tandem repeat variant http://www.ensembl.org/info/genome/variation/data_description.html A duplication consisting of 2 identical adjacent regions. SO:1000173 Tandem duplication http://www.ensembl.org/info/genome/variation/data_description.html A DNA sequence used experimentally to detect the presence or absence of a complementary nucleic acid. SO:0000051 Probe http://www.ensembl.org/info/genome/variation/predicted_data.html The effect that the variant has on each feature that it overlaps. A variant will have a consequence for each feature that it overlaps. Variant consequence http://www.ensembl.org/info/genome/variation/predicted_data.html A subjective classification of the severity of the variant consequence, based on agreement with SNPEff. Variant impact http://www.ensembl.org/info/genome/variation/predicted_data.html The variant is assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay. HIGH High impact variant consequence http://www.ensembl.org/info/genome/variation/predicted_data.html A non-disruptive variant that might change protein effectiveness. MODERATE Moderate impact variant consequence http://www.ensembl.org/info/genome/variation/predicted_data.html A variant that is assumed to be mostly harmless or unlikely to change protein behaviour. LOW Low impact variant consequence http://www.ensembl.org/info/genome/variation/predicted_data.html Usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact. MODIFIER Modifier impact variant consequence http://www.ensembl.org/info/genome/variation/predicted_data.html A feature ablation whereby the deleted region includes a transcript feature SO:0001893 Transcript ablation http://www.ensembl.org/info/genome/variation/predicted_data.html A splice variant that changes the 2 base region at the 3' end of an intron SO:0001574 Splice acceptor variant http://www.ensembl.org/info/genome/variation/predicted_data.html A splice variant that changes the 2 base region at the 5' end of an intron SO:0001575 Splice donor variant http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant whereby at least one base of a codon is changed, resulting in a premature stop codon, leading to a shortened transcript SO:0001587 Stop gained http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant which causes a disruption of the translational reading frame, because the number of nucleotides inserted or deleted is not a multiple of three SO:0001589 Frameshift variant http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant where at least one base of the terminator codon (stop) is changed, resulting in an elongated transcript SO:0001578 Stop lost http://www.ensembl.org/info/genome/variation/predicted_data.html A codon variant that changes at least one base of the canonical start codo SO:0002012 Start lost http://www.ensembl.org/info/genome/variation/predicted_data.html A feature amplification of a region containing a transcript SO:0001889 Transcript amplification http://www.ensembl.org/info/genome/variation/predicted_data.html An inframe non synonymous variant that inserts bases into in the coding sequenc SO:0001821 Inframe insertion http://www.ensembl.org/info/genome/variation/predicted_data.html An inframe non synonymous variant that deletes bases from the coding sequenc SO:0001822 Inframe deletion http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved SO:0001583 Missense variant http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence_variant which is predicted to change the protein encoded in the coding sequence SO:0001818 Protein altering variant http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron SO:0001630 Splice region variant http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant where at least one base of the final codon of an incompletely annotated transcript is changed SO:0001626 Incomplete terminal codon variant http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant where at least one base in the terminator codon is changed, but the terminator remains SO:0001567 Stop retained variant http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant where there is no resulting change to the encoded amino acid SO:0001819 Synonymous variant http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant that changes the coding sequence SO:0001580 Coding sequence variant http://www.ensembl.org/info/genome/variation/predicted_data.html A transcript variant located with the sequence of the mature miRNA SO:0001620 Mature miRNA variant http://www.ensembl.org/info/genome/variation/predicted_data.html A UTR variant of the 5' UTR SO:0001623 5 prime UTR variant http://www.ensembl.org/info/genome/variation/predicted_data.html A UTR variant of the 3' UTR SO:0001624 3 prime UTR variant http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant that changes non-coding exon sequence in a non-coding transcript SO:0001792 Non coding transcript exon variant http://www.ensembl.org/info/genome/variation/predicted_data.html A transcript variant occurring within an intron SO:0001627 Intron variant http://www.ensembl.org/info/genome/variation/predicted_data.html A variant in a transcript that is the target of NMD SO:0001621 NMD transcript variant http://www.ensembl.org/info/genome/variation/predicted_data.html A transcript variant of a non coding RNA gene SO:0001619 Non coding transcript variant http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant located 5' of a gene SO:0001631 Upstream gene variant http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant located 3' of a gene SO:0001632 Downstream gene variant http://www.ensembl.org/info/genome/variation/predicted_data.html A feature ablation whereby the deleted region includes a transcription factor binding site SO:0001895 TFBS ablation http://www.ensembl.org/info/genome/variation/predicted_data.html A feature amplification of a region containing a transcription factor binding site SO:0001892 TFBS amplification http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant located within a transcription factor binding site SO:0001782 TF binding site variant http://www.ensembl.org/info/genome/variation/predicted_data.html A feature ablation whereby the deleted region includes a regulatory region SO:0001894 Regulatory region ablation http://www.ensembl.org/info/genome/variation/predicted_data.html A feature amplification of a region containing a regulatory region SO:0001891 Regulatory region amplification http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant located within a regulatory region SO:0001907 Feature elongation http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant located within a regulatory region SO:0001566 Regulatory region variant http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant that causes the reduction of a genomic feature, with regard to the reference sequence SO:0001906 Feature truncation http://www.ensembl.org/info/genome/variation/predicted_data.html A sequence variant located in the intergenic region, between genes SO:0001628 Intergenic variant A single letter code that represents two or more possible nucleotides at a single base locus. https://en.wikipedia.org/wiki/International_Union_of_Pure_and_Applied_Chemistry#Amino_acid_and_nucleotide_base_codes Ambiguity code http://www.ensembl.org/info/genome/variation/data_description.html#quality_control Variants that failed our quality control analyses, therefore they are flagged as suspicious. Failed variant Flagged variant http://www.ensembl.org/info/genome/variation/data_description.html#clin_significance A classification of a variant's impact on disease, taken from ClinVar. pathogenicity Clin sig Clinical significance https://en.wikipedia.org/wiki/Linkage_disequilibrium A measure of how often two variants or specific sequences are inherited together. Linkage LD Linkage disequilibrium https://en.wikipedia.org/wiki/Linkage_disequilibrium The correlation between a pair of loci. It varies from 0 (loci are in complete linkage equilibrium) to 1 (loci are in complete linkage disequilibrium and coinherited). r squared r2 https://en.wikipedia.org/wiki/Linkage_disequilibrium The difference between the observed and the expected frequency of a given haplotype. If two loci are independent (i.e. in linkage equilibrium and therefore not coinherited at all), the D' value will be 0. D prime D' https://en.wikipedia.org/wiki/Linkage_disequilibrium A set of variant alleles in a contiguous genomic region. A haplotype block describes a set of alleles which tend to be inherited together. SO:0001024 Haplotype (variation) The transcript sequence derived from one copy of a gene in an individual, based on the phased 1000 Genomes genotype data. CDS and protein sequences are derived from this. Transcript haplotype The complete set of DNA found in each cell. https://en.wikipedia.org/wiki/Genome Genome http://www.ensembl.org/info/genome/genebuild/assembly.html A computational representation of the sequence of a haploid genome, representative of a species or strain. SO:0000353 https://en.wikipedia.org/wiki/Sequence_assembly reference assembly assembly Genome assembly http://www.ensembl.org/info/genome/genebuild/assembly.html Refers to the number of overlapping sequences used to build a region of the assembly. High coverage indicates a good amount of sequence information while low coverage reflects a low amount of sequence information. Coverage https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html The underlying genome sequence, without alternative sequence included. reference sequence Primary assembly https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html Genomic sequence that differs from the genomic DNA on the primary assembly. These are represented as sequence on top of the primary assembly. Provided by the GRC for human and mouse. non-reference sequence Alternative sequence https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html New sequences that have been added to the genome assembly since its release. There are two types: fix and nove patches. Patch https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html Known variations to the primary assembly, due to variability in the human genome sequence (eg. the highly variable MHC locus). These were included as part of the genome assembly when it was first produced. https://en.wikipedia.org/wiki/Major_histocompatibility_complex Haplotype (genome) https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html Novel patches represent new allelic loci. They can usually be considered as similar to haplotypes and are likely to be reclassified as such in the next genome assembly, but not necessarily. Novel patch https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html Fix patches are where the primary assembly was found to be incorrect, and the patch reflects the corrected sequence. Fix patch https://www.ensembl.org/info/genome/genebuild/chromosomes_scaffolds_contigs.html A contig is a contiguous stretch of DNA sequence without gaps that has been assembled solely based on direct sequencing information. SO:0000149 https://en.wikipedia.org/wiki/Contig Contig https://www.ensembl.org/info/genome/genebuild/chromosomes_scaffolds_contigs.html Scaffolds are sets of ordered, oriented contigs, assembled by sequence overlap. They are longer sequences than contigs, but shorter than full chromosomes. SO:0000148 Supercontig Scaffold A banding pattern on a chromosome resulting from staining and examination by microscopy. These are named in terms of the chromosome arm they are found on, and are often used as a shorthand for describing the location of genomic features. https://en.wikipedia.org/wiki/Cytogenetics cytogenetic map chromosome band Cytogenetic band A segment of DNA that has been inserted into a vector molecule, such as a plasmid, and then replicated to form many identical copies. Clone https://www.ensembl.org/info/genome/genebuild/chromosomes_scaffolds_contigs.html A vector used to clone DNA fragments (100 to 300-kb insert size; average, 150 kb) from another species so that it can be replicated in bacteria. Many genomes (such as human) were sequenced by cloning segments into BACs, amplifying and sequencing the clones. https://en.wikipedia.org/wiki/Bacterial_artificial_chromosome Bacterial artificial chromosome BAC Originated from a bacterial plasmid, a YAC contains a yeast centromeric region, a yeast origin of DNA replication, a cluster of unique rectriction sites and a selectable marker and a telomere region at the en of each arm. YACs are capable of cloning extremely large segments of DNA (over 1 megabase long) into a host cell, where the DNA is propagated along with the other chromosomes of the yeast cell. https://en.wikipedia.org/wiki/Yeast_artificial_chromosome Yeast artificial chromosome YAC DNA from a bacterial virus spliced with a small fragment of a genome (up to 50 kb) to be amplified and sequenced. https://en.wikipedia.org/wiki/Cosmid Cosmid The actual number of bases of sequence we have for a full genome assembly, including alternative sequences and PARs, excluding gaps. Base pairs (genome size) The golden path is the length of the non-redundant reference assembly. It excludes alternative sequences and PARs, but includes the estimated size of the gaps. SO:0000688 Golden path (genome size) https://www.ensembl.org/info/genome/genebuild/chromosomes_scaffolds_contigs.html Which level of the assembly we are working on. coord_system Coordinate system The number of chromosomes of a genome. Karyotype https://www.ensembl.org/info/genome/genebuild/human_PARS.html Small regions of sequence identity located at the tips of the short and the long arms of the X and Y chromosomes where recombination and genetic exchange take place. Genes within the pseudoautosomal region are not sex linked. pseudoautosomal region PAR http://www.ensembl.org/info/docs/api/core/core_tutorial.html#slices The term "slice" in Ensembl refers to a length of DNA sequence. A slice can be any length, from one base long to the entire length of a chromosome. Slice https://www.ensembl.org/info/genome/genebuild/chromosomes_scaffolds_contigs.html The largest continuous sequence for an organism. The official technical definition for toplevel sequences are 'sequence regions in the genome assembly that are not a component of another sequence region'. For example, when a genome is assembled into chromosomes, toplevel sequences will be chromosomes and unplaced scaffolds. If a genome has only been assembled into scaffolds, then toplevel sequences are scaffolds and unplaced contigs. Toplevel https://www.ensembl.org/info/genome/genebuild/chromosomes_scaffolds_contigs.html A scaffold that can be positioned on a chromosome based on genetic mapping information. Placed scaffold https://www.ensembl.org/info/genome/genebuild/chromosomes_scaffolds_contigs.html A scaffold that cannot be positioned on a chromosome. SO:0001875 unassigned supercontig unplaced supercontig Unassigned scaffold Unplaced scaffold Publicly available database that Ensembl imports data from. Ensembl sources Database from which Ensembl imports cDNA or protein sequence for gene annotation, or gene names. Gene source database http://www.ensembl.org/info/genome/genebuild/annotation_merge.html The aim of GENCODE as a sub-project of the ENCODE scale-up project is to annotate all evidence-based gene features in the entire human and mouse genomes at a high accuracy. The GENCODE gene set is the default geneset in Ensembl and is equivalent to the Ensembl/HAVANA merged genes. https://www.gencodegenes.org/ https://en.wikipedia.org/wiki/GENCODE GENCODE http://www.ensembl.org/info/genome/genebuild/annotation_sources.html NCBI's Reference Sequences (RefSeq) database is a curated database of Genbank's genomes, mRNAs and proteins. RefSeq attempts to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, tRNA, and protein products. https://www.ncbi.nlm.nih.gov/refseq/ https://en.wikipedia.org/wiki/RefSeq RefSeq http://www.ensembl.org/info/genome/genebuild/annotation_sources.html Database of protein sequence and functional information, based at European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR). These sequences are used as evidence for annotating Ensembl genes. http://www.uniprot.org/ https://en.wikipedia.org/wiki/UniProt UniProt Knowledgebase UniProtKB UniProt http://www.ensembl.org/info/genome/genebuild/ig_tcr.html International ImMunoGeneTics information system. Database of immunoglobulin and T-cell receptor annotation. We collaborate with IMGT on manual annotation of somatically recombined genes. http://www.imgt.org/ https://en.wikipedia.org/wiki/HLA_Informatics_Group ImMunoGeneTics International ImMunoGeneTics information system IMGT http://www.ensembl.org/info/genome/genebuild/annotation_sources.html UniProt/Swiss-Prot is a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. These sequences are used as evidence for annotating Ensembl genes. https://en.wikipedia.org/wiki/UniProt#UniProtKB/Swiss-Prot UniProt/SwissProt SwissProt http://www.ensembl.org/info/genome/genebuild/annotation_sources.html A subset of TrEMBL (Translated EMBL database) containing the computer-annotated protein translations of all coding sequences (CDS) present in the ENA (formerly EMBL-bank) that are not yet incorporated into the UniProt/SwissProt database. These sequences are used as evidence for annotating Ensembl genes. https://en.wikipedia.org/wiki/UniProt#UniProtKB/TrEMBL UniProt/TrEMBL TrEMBL http://www.ensembl.org/info/genome/genebuild/annotation_sources.html An international consortium between the ENA, GenBank and DDBJ to share submissions of nucleotide sequence. These sequences are used as evidence for annotating Ensembl genes. http://www.insdc.org/ https://en.wikipedia.org/wiki/International_Nucleotide_Sequence_Database_Collaboration International Nucleotide Sequence Database Collaboration INSDC http://www.ensembl.org/info/genome/genebuild/annotation_sources.html Europe's primary nucleotide sequence resource. The main sources of the DNA and RNA sequences in the database are submissions from individual researchers, genome sequencing projects and patent applications. https://www.ebi.ac.uk/ena https://en.wikipedia.org/wiki/European_Nucleotide_Archive European Nucleotide Archive ENA http://www.ensembl.org/info/genome/genebuild/annotation_sources.html The US branch of INSDC. https://www.ncbi.nlm.nih.gov/genbank/ https://en.wikipedia.org/wiki/GenBank GenBank (database) http://www.ensembl.org/info/genome/genebuild/annotation_sources.html The Asian branch of INSDC. http://www.ddbj.nig.ac.jp/ https://en.wikipedia.org/wiki/DNA_Data_Bank_of_Japan DNA Data Bank of Japan DDBJ An organised hierarchy of terms produced by the Gene Ontology Consortium, used to describe the function of proteins. GO terms are split into three subcategories: biological processes (what the protein does), cellular component (where in the cell the protein is found), and molecular function (how the protein acts). http://www.geneontology.org/ https://en.wikipedia.org/wiki/Gene_ontology GO terms GO Gene Ontology http://www.ensembl.org/info/genome/genebuild/gene_names.html HGNC is responsible for approving unique symbols and names for human loci, including protein coding genes, ncRNA genes and pseudogenes, to allow unambiguous scientific communication. HGNC gene names are used for Ensembl human genes, where available, and for orthologous genes in other species. https://www.genenames.org/ https://en.wikipedia.org/wiki/HUGO_Gene_Nomenclature_Committee HUGO gene nomenclature committee HGNC http://www.ensembl.org/info/genome/genebuild/annotation_sources.html The Rfam database is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models (CMs). These sequences are used as evidence for annotating Ensembl non-coding genes. http://rfam.xfam.org/ https://en.wikipedia.org/wiki/Rfam Rfam http://www.ensembl.org/info/genome/genebuild/annotation_sources.html The miRBase database is a searchable database of published miRNA sequences and annotation. These sequences are used as evidence for annotating Ensembl miRNA genes. http://www.mirbase.org/ https://en.wikipedia.org/wiki/MiRBase miRbase http://www.ensembl.org/info/genome/genebuild/gene_names.html MGI is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease. MGI gene names are used for Ensembl mouse genes, where available. http://www.informatics.jax.org/ https://en.wikipedia.org/wiki/Mouse_Genome_Informatics Mouse genome informatics MGI http://www.ensembl.org/info/genome/genebuild/gene_names.html An online biological database of information about the zebrafish (Danio rerio). zFIN gene names are used for Ensembl zebrafish genes, where available. https://zfin.org/ https://en.wikipedia.org/wiki/Zebrafish_Information_Network Zebrafish Information Network zFIN Canonical database for the molecular biology and genetics of Saccharomyces cerevisiae, source of the annotation seen in Ensembl. https://www.yeastgenome.org/ https://en.wikipedia.org/wiki/Saccharomyces_Genome_Database Saccharomyces Genome Database SGD A genome browser hosted at the University of California Santa Cruz. Ensembl collaborates with UCSC in projects such as GENCODE, CCDS and TSL. https://genome.ucsc.edu/ https://en.wikipedia.org/wiki/UCSC_Genome_Browser Genome Browser University of California Santa Cruz UCSC UCSC Genome Browser http://www.ensembl.org/info/genome/funcgen/regulation_sources.html Database from which Ensembl imports ChIP-seq, DNase-seq and other related datasets, which are used in the Ensembl regulatory build. Epigenome source database http://www.ensembl.org/info/genome/funcgen/regulation_sources.html Project aiming to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active, by large scale functional analyses of laboratory cell lines. Used as a source for the Ensembl regulatory build. https://www.encodeproject.org/ https://en.wikipedia.org/wiki/ENCODE ENCyclopedia Of DNA Elements ENCODE http://www.ensembl.org/info/genome/funcgen/regulation_sources.html Project aiming to apply functional genomics analysis on primary cells of the haematopoietic cell lineage from healthy and diseased individuals, to produce lineage-specific epigenomes. Used as a source for the Ensembl regulatory build. http://www.blueprint-epigenome.eu/ Blueprint Epigenomes http://www.ensembl.org/info/genome/funcgen/regulation_sources.html Project aiming to develop publicly available reference epigenome maps from a variety of cell types. http://www.roadmapepigenomics.org/ https://en.wikipedia.org/wiki/NCBI_Epigenomics#Roadmap_Epigenomics_Project Roadmap Epigenomics http://www.ensembl.org/info/genome/variation/sources_documentation.html Database from which Ensembl imports variation data, including loci, sample genotypes, population frequencies and phenotype associations. Variation source database http://www.ensembl.org/info/genome/variation/sources_documentation.html The Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of simple (short) genetic polymorphisms in human, maintained by NCBI. https://www.ncbi.nlm.nih.gov/projects/SNP/ https://en.wikipedia.org/wiki/DbSNP Single Nucleotide Polymorphism database dbSNP http://www.ensembl.org/info/genome/variation/sources_documentation.html The European Variation Archive is an open-access database of all types of genetic variation data from all species. https://www.ebi.ac.uk/eva/ European Variation Archive EVA http://www.ensembl.org/info/genome/variation/sources_documentation.html dbVar is NCBI's database of human genomic structural variation — insertions, deletions, duplications, inversions, mobile elements, and translocations. https://www.ncbi.nlm.nih.gov/dbvar/ dbVar http://www.ensembl.org/info/genome/variation/sources_documentation.html The Database of Genomic Variants archive (DGVa) is a repository that provides archiving, accessioning and distribution of publicly available genomic structural variants, in all species.https://www.ebi.ac.uk/dgva Database of Genomic Variants archive DGVa http://www.ensembl.org/info/genome/variation/sources_documentation.html The goal of the 1000 Genomes Project was to find most genetic variants with frequencies of at least 1% in the human populations studied. Ensembl display sample genotypes and population frequencies from the 1000 Genomes project. http://www.internationalgenome.org/ https://en.wikipedia.org/wiki/1000_Genomes_Project 1000G 1kG 1000 Genomes project http://www.ensembl.org/info/genome/variation/sources_documentation.html An aggregation of publicly available whole genome and whole exome variant calling experiments in human. GnomAD was previously known as ExAC, when it contained only exome data. Ensembl display population frequencies from gnomAD. http://gnomad.broadinstitute.org/ Exome Aggregation Consortium ExAC Genome Aggregation Database gnomAD http://www.ensembl.org/info/genome/variation/sources_documentation.html Whole genome variant calling data from humans worldwide with heart, lung, blood, and sleep disorders. Ensembl display population frequencies from TOPMed. https://www.nhlbi.nih.gov/science/trans-omics-precision-medicine-topmed-program Trans-Omics for Precision Medicine TOPMed http://www.ensembl.org/info/genome/variation/sources_documentation.html Study comparing exomes of 6000 diseased individuals with 4000 healthy individuals in the UK in order to identify disease-causing variants. Ensembl display population frequencies from the control group. https://www.uk10k.org/ UK10K http://www.ensembl.org/info/genome/variation/sources_documentation.html An international collaboration formed to develop a haplotype map of the human genome and thus describe the common patterns of human DNA sequence variation using genotyping. Ensembl display sample genotypes and population frequencies from the HapMap project. https://www.genome.gov/10001688/international-hapmap-project/ https://en.wikipedia.org/wiki/International_HapMap_Project International HapMap Project HapMap http://www.ensembl.org/info/genome/variation/sources_phenotype_documentation.html NCBI resource that aggregates information about genomic variation and its relationship to human health. Ensembl display clinical significance and phenotypes from ClinVar. https://www.ncbi.nlm.nih.gov/clinvar/ ClinVar Database from which Ensembl imports phenotype associations with genes and/or variants. Phenotype source database http://www.ensembl.org/info/genome/variation/sources_phenotype_documentation.html An online database that describes the function and phenotypes associated with human genes. Ensembl display phenotypes from OMIM and MIM morbid. https://www.omim.org/ https://en.wikipedia.org/wiki/Online_Mendelian_Inheritance_in_Man MIM morbid Mendelian Inheritance in Man Online Mendelian Inheritance in Man MIM OMIM http://www.ensembl.org/info/genome/variation/sources_phenotype_documentation.html An online database that describes the function and phenotypes associated with animal genes. Ensembl display phenotypes from OMIA. https://www.omia.org/ https://en.wikipedia.org/wiki/Online_Mendelian_Inheritance_in_Animals Online Mendelian Inheritance in Animals OMIA http://www.ensembl.org/info/genome/variation/sources_phenotype_documentation.html A catalogue of rare disease associations. Ensembl display phenotypes from Orphanet. http://www.orpha.net/ https://en.wikipedia.org/wiki/Orphanet Orphanet http://www.ensembl.org/info/genome/variation/sources_phenotype_documentation.html A curated database that extracts associations between variants and genes from published genome-wide association studies in human. Ensembl display phenotypes from the GWAS catalog. https://www.ebi.ac.uk/gwas/ NHGRI-EBI Genome-wide association study catalogue GWAS catalogue Genome-wide association study catalog Genome-wide association study catalogue NHGRI-EBI GWAS Catalogue NHGRI-EBI Genome-wide association study catalog NHGRI-EBI GWAS Catalog GWAS catalog http://www.ensembl.org/info/genome/variation/sources_phenotype_documentation.html An international scientific endeavour to create and characterise the phenotype of 20,000 knockout mouse strains. Ensembl display phenotypes from the IMPC. http://www.mousephenotype.org/ https://en.wikipedia.org/wiki/International_Mouse_Phenotyping_Consortium International Mouse Phenotyping Consortium IMPC http://www.ensembl.org/info/genome/variation/sources_phenotype_documentation.html Project aiming to house all publicly available QTL and association data on livestock animal species. Ensembl display phenotypes from the Animals QTLdb. https://www.animalgenome.org/cgi-bin/QTLdb/index Animal Quantitative Trait Loci Database Animal QTLdb http://www.ensembl.org/info/genome/variation/sources_phenotype_documentation.html Project aiming to collate all known (published) gene lesions responsible for human inherited disease. Full HGMD access is restricted to license holders so Ensembl supports the minimal public data release which consists of variant/mutation names and locations. http://www.hgmd.cf.ac.uk/ac/index.php Human Gene Mutation Database HGMD http://www.ensembl.org/info/genome/variation/sources_documentation.html Database of somatic variants found in cancer. COSMIC licensing does not permit redistribution of the full dataset, but mutation identifiers, locations and tumour types are available in Ensembl. http://cancer.sanger.ac.uk/cosmic https://en.wikipedia.org/wiki/COSMIC_cancer_database Catalog Of Somatic Mutations In Cancer Catalogue Of Somatic Mutations In Cancer COSMIC Protein source database A repository for 3D biological macromolecular structure data. Ensembl provide links out to the PDB, and use structures to display the locations of variants in proteins. http://www.ebi.ac.uk/pdbe/ https://en.wikipedia.org/wiki/Protein_Data_Bank Protein Data Bank PDB A sequence of computational tasks or actions that carry out a specific function. https://en.wikipedia.org/wiki/Algorithm Algorithm http://www.ensembl.org/info/genome/genebuild/genome_annotation.html The automatic process by which Ensembl plot known RNA and protein sequence onto the genome, using sequence similarity. Ensembl Genes Ensembl annotation Genebuild Ensembl Genebuild www.ensembl.org/info/genome/genebuild/manual_havana.html Human And Vertebrate ANalysis and Annotation. The team within Ensembl who manually annotate genes and transcripts for a subset of species. manual annotation Havana Ensembl Havana http://www.ensembl.org/info/genome/funcgen/regulatory_build.html The process by which Ensembl predict the location of regions that regulate gene expression using epigenomic evidence. Ensembl Regulatory Build http://www.ensembl.org/info/genome/compara/homology_method.html The process by which Ensembl compare gene sequences in order to construct gene trees and predict homologues. Ensembl gene tree pipeline InterPro is an integrated resource for protein families, domains and sites, combining information from several different protein signature databases, including PROSITE, PRINTS, Pfam, Seg, SignalP, Gene3D, SMART, TIGRFAMs, PIR SuperFamilies and SUPERFAMILY. Ensembl run InterProScan on all protein sequences, which uses these protein signatures to identify domains. https://www.ebi.ac.uk/interpro/ https://en.wikipedia.org/wiki/InterPro InterProScan A sequence comparison algorithm optimised for speed which is used to search sequence databases for optimal local alignments to a query. https://en.wikipedia.org/wiki/BLAST Basic Local Alignment Search Tool BLAST An mRNA/DNA and cross-species protein sequence analysis tool to quickly find sequences of 95% and greater similarity of length 40 bases or more. https://en.wikipedia.org/wiki/BLAT_(bioinformatics) BLAST-Like Alignment Tool BLAT http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html A standalone application that looks for low complexity sequences. DUST http://www.ensembl.org/info/genome/genebuild/automatic_coding.html Eponine is a probabilistic method for detecting transcription start sites (TSS) in mammalian genomic sequence, with good specificity and excellent positional accuracy. Eponine models consist of a set of DNA weight matrices recognising specific sequence motifs. Each of these is associated with a position distribution relative to the TSS. http://www.sanger.ac.uk/science/tools/eponine Eponine http://www.ensembl.org/info/genome/genebuild/automatic_coding.html GeneWise is a sequence analysis tool for comparing proteins to DNA sequences allowing for introns and frameshifts. It is used in the Targetted stage of the Ensembl GeneBuild. https://www.ebi.ac.uk/Tools/psa/genewise/ GeneWise A fast gapped DNA-DNA alignment algorithm. It can be used for aligning various types of sequences such as genomic DNA, cDNAs/ESTs, and proteins. It is used in the Targetted stage of the Ensembl GeneBuild. https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate Exonerate http://www.ensembl.org/info/genome/genebuild/2x_genomes.html A gene build method used by Ensembl for low coverage genomes, allowing genes to be annotated that span two scaffolds by mapping to the human gene. Projection build An HMM-based ab initio gene prediction method, used to create a track of ab initio genes in Ensembl. http://genes.mit.edu/GENSCAN.html https://en.wikipedia.org/wiki/GENSCAN GENSCAN http://www.ensembl.org/info/genome/variation/predicted_data.html#sift A tool which predicts if missense variants are likely to affect protein function based on sequence homology and the physico-chemical similarity between the alternate amino acids. http://sift.bii.a-star.edu.sg/ SIFT http://www.ensembl.org/info/genome/variation/predicted_data.html#polyphen A tool which predicts if missense variants are likely to affect protein function based on physical and comparative considerations. http://genetics.bwh.harvard.edu/pph2/ PolyPhen http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html The method by which repeated sequences and low-complexity regions are hidden, usually used in searches by alignment and homology-searching programs. http://www.repeatmasker.org/ https://en.wikipedia.org/wiki/Repeated_sequence_(DNA) RepeatMasker A matrix that defines scores for amino acid substitutions, reflecting the similarity of physicochemical properties, and observed substitution frequencies. The BLOSUM 62 matrix is tailored using sequences sharing no more than 62% identity (sequences closer evolutionary, were represented by a single sequence in the alignment to avoid bias from using related family members). https://en.wikipedia.org/wiki/BLOSUM Blocks Substitution Matrix BLOSUM 62 http://www.ensembl.org/info/docs/tools/vep/index.html The Variant Effect Predictor (VEP) is an Ensembl tool that predicts the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions. https://en.wikipedia.org/wiki/Ensembl_Genomes#Variant_Effect_Predictor Variant Effect Predictor VEP File formats A set of recomendations for variant naming. The nomenclature describes the change a variant allele has on a named (genomic, transcript or protein) sequence. Can be used as an input for the VEP and displayed for known variants. http://varnomen.hgvs.org/ HGVS Human Genome Variation Society Sequence Variant Nomenclature HGVS names HGVS nomenclature http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#vcf VCF is a standard format for listing genetic variation, which is the output for many variant callers. It can be used as an input for the Ensembl VEP and is used to store and download variation data in Ensembl. https://en.wikipedia.org/wiki/Variant_Call_Format Variant Call Format VCF http://www.ensembl.org/info/website/upload/bed.html BED is a simple format for listing genomic loci. It can be used to upload data to view in Ensembl, as a custom file for additional VEP annotation and is used to store and download constrained elements in Ensembl. BED FASTA is used to store finished nucleotide and peptide sequences. The Ensembl FTP site has genome, cDNA, CDS and peptide sequences in FASTA, and you can export FASTA from various webpages in Ensembl. https://en.wikipedia.org/wiki/FASTA_format FASTA http://www.ensembl.org/info/website/upload/large.html#bam-format BAM and CRAM store alignments of NGS data to the genome. Ensembl allow attachment of BAM and CRAM files to view in against the gene, and store RNA-seq, ChIP-seq and DNase-seq in BAM. https://en.wikipedia.org/wiki/Binary_Alignment_Map BAM CRAM SAM Binary alignment map BAM/CRAM http://www.ensembl.org/info/website/upload/large.html#bb-format BigBed is an indexed form of BED, which can be used to store larger scale data. Ensembl allow attachment of BigBed files to view against the genome and store peaks of regulatory evidence as BigBed. BigBed http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#default Ensembl default is an input format for the VEP, used to describe the position and alleles of a variant. Ensembl default (VEP) http://www.ensembl.org/info/website/upload/bed.html#bedGraph BedGraph allows you to store scores for loci in BED format, the loci can be of varying size. It can be uploaded to view in Ensembl. BedGraph http://www.ensembl.org/info/website/upload/gff.html GTF is a tab-limited format that describes genomic features, such as genes and transcripts, and allows hierarchical linking of gene features. Ensembl store gene files as GTF, allow attachment of GTF files to view against the genome and allow custom annotation with the VEP using GTF files. https://en.wikipedia.org/wiki/Gene_transfer_format General transfer format Gene transfer format GTF http://www.ensembl.org/info/website/upload/gff3.html GFF is a tab-limited format that describes genomic features, such as genes and transcripts, and allows hierarchical linking of gene features. Ensembl store gene files as GFF, allow attachment of GFF files to view against the genome and allow custom annotation with the VEP using GFF files. https://en.wikipedia.org/wiki/General_feature_format gene-finding format generic feature format General feature format GFF http://www.ensembl.org/info/website/upload/psl.html PSL represents alignments and can be viewed in Ensembl. PSL http://www.ensembl.org/info/website/upload/wig.html Wiggle format expresses scores across genomic loci, requiring fixed size bins for the scores. It can be uploaded to view in Ensembl. WIG Wiggle http://www.ensembl.org/info/website/upload/large.html#bw-format BigWig is an indexed form of wiggle and can be used to store larger scale data. Ensembl simplify NGS data, such as ChIP-seq and RNA-seq into BigWig to view in the browser. It can also be used to attach your own data to Ensembl. BigWig http://www.ensembl.org/info/website/upload/pairwise.html Pairwise interactions, such as those derived from Hi-C, can be stored in the WashU format and viewed in Ensembl. https://en.wikipedia.org/wiki/Chromosome_conformation_capture Pairwise interactions (WashU) Chain files describe the mapping between different genome assemblies. Ensembl store these on the FTP site. chain mapping mapping assembly chain chain Newick is a tree format. Ensembl gene trees can be downloaded in Newick and it is used to store Ensembl species trees. https://en.wikipedia.org/wiki/Newick_format New Hampshire tree format Newick format Newick notation Newick tree format Newick EMBL files store sequence and accompanying annotation for features across a genomic region. They can be exported from various webpages in Ensembl and are stored for 1Mb regions across the genome. EMBL (file format) GenBank files store sequence and accompanying annotation for features across a genomic region. They can be exported from various webpages in Ensembl and are stored for 1Mb regions across the genome. GenBank (file format) http://www.ensembl.org/info/data/ftp/index.html Ensembl Multi Format (EMF) stores genomic alignments in Ensembl. Ensembl Multi Format EMF Alignment format http://www.ensembl.org/info/data/ftp/index.html Multiple alignment format (MAF) stores genomic alignments. Multiple alignment format MAF http://www.ensembl.org/info/data/mysql.html MySQL is a database. All Ensembl data is stored in MySQL relational tables, which can be found on the FTP site and accessed directly by MySQL queries. https://en.wikipedia.org/wiki/MySQL MySQL https://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html A VEP cache contains all the gene and variant data needed to run a VEP query, and can be used to run large queries quickly on your own machine. These can be installed as part of your VEP installtion, or downloaded from the FTP site. VEP cache Genome Variation Format (GVF) is used to store variation data. It can be found on the Ensembl FTP site. Genome Variation Format GVF PhyloXML is an XML language for the analysis, exchange, and storage of phylogenetic trees (or networks) and associated data. It is used to store Ensembl phylogenetic trees. https://en.wikipedia.org/wiki/PhyloXML PhyloXML OrthoXML is an XML format to allow the storage and comparison of orthology data. It is used to store Ensembl homologues. OrthoXML Resource Description Framework (RDF) is used as a metadata data model. Ensembl use it to describe links from Ensembl annotations to those annotations in other databases. https://en.wikipedia.org/wiki/Resource_Description_Framework Resource Description Framework RDF A golden path. A file provided to Ensembl that describes how the longer sequences in the genome assembly were assembled from shorter sequences. For example, an AGP file can describe how a chromosome is assembled from a collection of scaffolds or a collection of contigs. For an AGP file that describes how a scaffold is assembled from a collection of contigs, each contig will be listed on a separate line in the AGP file and the line will include information about where the contig lies within the scaffold and the orientation of the contig. A golden path AGP http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html https://en.wikipedia.org/wiki/Repeated_sequence_(DNA) Repeat http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html The method by which repeated sequences and low-complexity regions are hidden, usually used in searches by alignment and homology-searching programs. https://en.wikipedia.org/wiki/Repeated_sequence_(DNA) Repeat masking Hard masked sequence is repeat masked with the repeat sequences replaced by Ns. Hard masked sequence files on the Ensembl FTP site have "rm" in their file name. Hard masked Soft masked sequence is repeat masked with the repeat sequences in lower case. Soft masked sequence files on the Ensembl FTP site have "sm" in their file name. Soft masked http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html A dispersed intermediately repetitive DNA sequence found in the human genome in about one million copies. The sequence is about 300 bp long and is found commonly in introns, 3' untranslated regions of genes, and intergenic genomic regions. The name Alu comes from the a recognition site for the AluI endonuclease that cleaves it. SO:0002063 https://en.wikipedia.org/wiki/Alu_element Alu insertion http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html A region in the genomic sequence containing short tandem repeats of 2-10bp. SO:0000289 https://en.wikipedia.org/wiki/Microsatellite Microsatellite http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html The region of the chromosome at which the two sister chromatids are joined during mitosis and meiosis, mostly composed of satellite DNA. SO:0000577 https://en.wikipedia.org/wiki/Centromere Centromere http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html Poly-purine or poly-pyrimidine stretches, or regions of extremely high AT or GC content. Low complexity regions http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html Non-functional copies of RNA genes which have been reintegrated into the genome with the assistance of a reverse transcriptase. RNA repeats http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html Multiple copies of the same base sequence on a DNA sequence. The repeated pattern can vary in length from a single base to several thousand bases long. SO:0000005 https://en.wikipedia.org/wiki/Satellite_DNA Satellite repeats http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html Duplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc. Simple repeats http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html Typically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences. https://en.wikipedia.org/wiki/Tandem_repeat Tandem repeats http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html Long tandem repeats. https://en.wikipedia.org/wiki/Tandem_repeat LTRs http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html Long Interspersed Elements. Retrotransposed elements in the genome containing open reading frames encoding (often inactive) reverse transcription machinery. SO:0000194 https://en.wikipedia.org/wiki/Long_interspersed_nuclear_element long interspersed nuclear element LINE Type I Transposons/LINE http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html Short Interspersed Elements. Retrotransposed elements less than 500 bp that contain tRNA, snRNA and rRNA, which require other mobile elements to be transposed. Alu elements are a type of SINE. SO:0000206 https://en.wikipedia.org/wiki/Short_interspersed_nuclear_element short interspersed nuclear element SINE Type I Transposons/SINE http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html Elements that have been transposed and duplicated around the genome by excision and ligation. SO:0000182 https://en.wikipedia.org/wiki/Transposable_element#Classification DNA transposon Type II Transposons Repeats that cannot be classified. Unknown repeat A comparison between two or more sequences by matching identical and/or similar residues/nucleotides and assigning a score to the match. https://en.wikipedia.org/wiki/Sequence_alignment Alignments An alignment carried out using the whole genome sequence. Whole genome alignment http://www.ensembl.org/info/genome/compara/analyses.html An alignment between two whole genomes. Pairwise sequence alignment Pairwise alignment Pairwise whole genome alignment http://www.ensembl.org/info/genome/compara/multiple_genome_alignments.html An alignment between more than two whole genomes of a selected taxon. Multiple sequence alignment Multiple alignment Multiple whole genome alignment http://www.ensembl.org/info/genome/compara/synteny.html In a genomic context we refer to syntenic regions if the sequence is globally conserved between two species. https://en.wikipedia.org/wiki/Synteny Synteny The cigar line defines the sequence of matches/mismatches and deletions (or gaps) in an alignment https://en.wikipedia.org/wiki/Sequence_alignment#Representations Compact Idiosyncratic Gapped Alignment Report CIGAR A measure of how similar two alignment sequences are, specifically, what percentage of amino acids or nucleotides are the same in type and position between the two sequences. The value is dependent on which sequence is used as the reference, since it is a percentage of that reference. %ID Identity An application for displaying sequence alignments with custom colour-annotation, which is used by Ensembl displaying gene tree and family alignments. http://wasabiapp.org/ Wasabi How well one sequence matches another determined by calculation by an alignment program of identical and conserved residues/nucleotides. Similarity http://www.ensembl.org/info/genome/compara/multiple_genome_alignments.html Pecan is a global multiple sequence alignment program that makes practical the probabilistic consistency methodology for significant numbers of sequences of practically arbitrary length. Pecan http://www.ensembl.org/info/genome/compara/multiple_genome_alignments.html The EPO (Enredo, Pecan, Ortheus) pipeline is a three step pipeline for whole-genome multiple alignments, using Enredo segments, aligning them with Pecan and constructing ancestal sequences with Ortheus. Enredo Pecan Ortheus EPO http://www.ensembl.org/info/genome/compara/multiple_genome_alignments.html Progressive-Cactus is a next-generation aligner that stores whole-genome alignments in a graph structure. Progressive cactus http://www.ensembl.org/info/genome/compara/analyses.html LASTZ is a program for aligning DNA sequences in a pairwise manner. Its precedesessor is BlastZ. LastZ http://www.ensembl.org/info/genome/compara/analyses.html BlastZ is a program for aligning DNA sequences in a pairwise manner. It has been replaced by LASTZ. BlastZ Translated Blat can be used for alignment of the coding regions of genomes only in a pairwise manner. https://en.wikipedia.org/wiki/BLAT_(bioinformatics) Translated Blat http://www.ensembl.org/info/genome/funcgen/regulatory_features.html Regions that are predicted to regulate the expression of genes, based on the Ensembl regulatory build. reg-feat Regulatory features http://www.ensembl.org/info/genome/funcgen/regulatory_features.html Regions at the 5' end of genes where transcription factors and RNA polymerase bind to initiate transcription. SO:0000167 https://en.wikipedia.org/wiki/Promoter_(genetics) Promoters http://www.ensembl.org/info/genome/funcgen/regulatory_features.html Transcription factor binding regions that flank promoters. SO:0001952 https://en.wikipedia.org/wiki/Promoter_(genetics) Promoter flanking regions http://www.ensembl.org/info/genome/funcgen/regulatory_features.html Regions that bind transcription factors and interact with promoters to stimulate transcription of distant genes. SO:0000165 https://en.wikipedia.org/wiki/Enhancer_(genetics) Enhancers http://www.ensembl.org/info/genome/funcgen/regulatory_features.html Regions that bind CTCF, the insulator protein that demarcates open and closed chromatin. SO:0001974 https://en.wikipedia.org/wiki/CTCF CTCF binding sites http://www.ensembl.org/info/genome/funcgen/regulatory_features.html Sites which bind transcription factors, for which no other role can be determined as yet. SO:0000235 Transcription factor binding sites http://www.ensembl.org/info/genome/funcgen/regulatory_features.html Regions of spaced out histones, making them accessible to protein interactions. SO:0001747 https://en.wikipedia.org/wiki/DNase_I_hypersensitive_site Open chromatin regions http://www.ensembl.org/info/genome/funcgen/regulatory_features.html The activity state of a regulatory feature in a specific epigenome. Regulatory activity http://www.ensembl.org/info/genome/funcgen/regulatory_features.html When a regulatory feature displays an epigenetic signature which is consistent with it carrying out its named function, for example an active Promoter has an epigenetic signature consistent with initiating transcription, while an active CTCF binding site will bind CTCF. It is analogous to a sprinter running. Active http://www.ensembl.org/info/genome/funcgen/regulatory_features.html When a regulatory feature displays a epigenetic signature with the potential to be activated. It is analogous to a sprinter in the blocks. Poised http://www.ensembl.org/info/genome/funcgen/regulatory_features.html When a regulatory feature is epigenetically repressed, having an epigenetic signature that prevents it from being active. Repressed http://www.ensembl.org/info/genome/funcgen/regulatory_features.html When a regulatory feature bears no epigenetic modifications from the ones included in the Regulatory Build. Inactive http://www.ensembl.org/info/genome/funcgen/regulatory_features.html When there is no available data in the cell type for this regulatory feature. NA http://www.ensembl.org/info/genome/funcgen/regulation_sources.html Experimental data that is used to construct and determine activity of regulatory features. Epigenome evidence http://www.ensembl.org/info/genome/funcgen/regulation_sources.html A method to determine the genomic regions that proteins bind to. https://en.wikipedia.org/wiki/ChIP-sequencing Chromatin Immunoprecipitation Sequencing ChIPSeq ChIP-seq http://www.ensembl.org/info/genome/funcgen/regulation_sources.html A method to determine regions of open and closed chromatin. https://en.wikipedia.org/wiki/DNase-Seq DNase hypersensitivity DNase-seq DNase sensitivity http://www.ensembl.org/info/genome/funcgen/regulation_sources.html A protein that binds to DNA and controls the rate of transcription. https://en.wikipedia.org/wiki/Transcription_factor TF Transcription factor http://www.ensembl.org/info/genome/funcgen/regulation_sources.html Covalent modifications to the histone proteins that make up the nucleosome, which are known to regulate gene expression. SO:0001700 https://en.wikipedia.org/wiki/Histone#Histone_modification histone acetylation histone methylation Histone mod Histone modification http://www.ensembl.org/info/genome/funcgen/regulation_other.html Modification of cytosines in CpGs with methyl groups, which is known to repress gene expression. SO:0000114 https://en.wikipedia.org/wiki/DNA_methylation CpG methylation DNA methylation http://www.ensembl.org/info/genome/funcgen/regulation_other.html A method to determine the methylation of genomic cytosines. https://en.wikipedia.org/wiki/Bisulfite_sequencing RRBS WGBS Bisulphite sequencing Bisulfite sequencing http://www.ensembl.org/info/genome/funcgen/peak_calling.html A count of the number of NGS reads from an epigenome experiment aligned to a locus, shown as a BigWig across the genome. Signal http://www.ensembl.org/info/genome/funcgen/peak_calling.html Locus identified from epigenome signal as being having high signal, shown as a BigBed across the genome. Peak Short genomic sequence that is known to bind to a particular transcription factor. https://en.wikipedia.org/wiki/DNA_binding_site Motif PWM Position weight matrices TFBM Position weight matrix Transcription factor binding motif http://www.ensembl.org/info/genome/funcgen/regulation_sources.html A cell type, such as a primary tissue or lab cell line, for which we have epigenome evidence and can predict regulatory features. cell type tissue Cell line Epigenome A short sequence whose placement on the genome is known. Marker UniSTS is a NCBI resource for non-redundant Sequence Tagged Sites (STS) markers. For each marker, UniSTS displays the primer sequences, product size, and mapping information, as well as cross references to dbSNP, RHdb, GDB, MGD, etc. The marker report also lists GenBank and RefSeq records that contain the primer sequences determined by ePCR. UniSTS http://www.ensembl.org/info/genome/genebuild/xrefs.html Mapping between Ensembl genes, transcripts and proteins to the same features in other databases. general identifiers xref External reference http://www.ensembl.org/info/genome/variation/prediction/protein_function.html A tool that integrates multiple annotations into one metric for scoring the deleteriousness of single nucleotide variants. CADD http://www.ensembl.org/info/genome/variation/prediction/protein_function.html A tool for predicting the pathogenicity of single nucleotide variants using an ensemble method. REVEL http://www.ensembl.org/info/genome/variation/prediction/protein_function.html A tool for assessing the functional impact of single nucleotide variants based on evolutionary conservation of the affected amino acid in protein homologues. MutationAssessor http://www.ensembl.org/info/genome/variation/prediction/protein_function.html A tool for predicting the pathogenicity of single nucleotide variants using a logistic regression based ensemble method. MetaLR http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html The Matched Annotation from NCBI and EMBL-EBI is a collaboration between Ensembl/GENCODE and RefSeq to identify transcripts that match GRCh38 and are 100% identical between RefSeq and Ensembl/GENCODE for 5' UTR, CDS, splicing and 3'UTR. Matched Annotation between NCBI and EBI Matched Annotation from NCBI and EMBL-EBI Matched Annotation from NCBI and Ensembl MAIN MANE http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html The Matched Annotation from NCBI and EMBL-EBI is a collaboration between Ensembl/GENCODE and RefSeq. The MANE Select is a default transcript per human gene that is representative of biology, well-supported, expressed and highly-conserved. This transcript set matches GRCh38 and is 100% identical between RefSeq and Ensembl/GENCODE for 5' UTR, CDS, splicing and 3'UTR. Matched Annotation from NCBI and EBI Select Matched Annotation from NCBI and EMBL-EBI Select Matched Annotation from NCBI and Ensembl Select MAIN Select MANE Select http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html Long-read sequence data is computationally processed into non-redundant transcript models which are manually appraised by the Ensembl-Havana annotation team. tagine TAGENE http://www.ensembl.org/info/genome/genebuild/biotypes.html The coding sequence contains a stop codon that is translated (as supported by experimental evidence), and termination occurs instead at a canonical stop codon further downstream. It is currently unknown which codon is used to replace the translated stop codon, hence it is represented by 'X' in the protein sequence https://en.wikipedia.org/wiki/Stop_codon#Translational_readthrough Stop codon readthrough http://www.ensembl.org/Help/Faq?id=367 DNA strand arbitrary defined as the strand with its 5' end at the tip of the short chromosome arm (p). If a gene is forward-stranded, its sense (sequence matching cDNA) is on the forward strand. Forward strand is reverse complementary to the reverse strand. SO:0001030 + strand 1 strand positive strand Plus strand Forward strand http://www.ensembl.org/Help/Faq?id=367 DNA strand arbitrary defined as the strand with its 5' end at the tip of the long chromosome arm (q). If a gene is reverse-stranded, its sense (sequence matching cDNA) is on the reverse strand. Reverse strand is reverse complementary to the forward strand. SO:0001031 - strand -1 strand negative strand Minus strand Reverse strand http://www.ensembl.org/info/genome/genebuild/mane.html RefSeq transcripts that match 100% across the sequence, exon/intron structure and UTRs as part of the MANE project RefSeq Match The UniProt identifier that matches to the Ensembl transcript. This may be a UniProt protein isoform and will have a number suffix, or may just refer to a UniProt entry. UniProt Match A transcript with a premature stop codon considered likely to be subjected to targeted degradation. Nonsense-Mediated Decay is predicted to be triggered where the in-frame termination codon is found more than 50bp upstream of the final splice junction. https://en.wikipedia.org/wiki/Nonsense-mediated_decay Nonsense-Mediated Decay NMD Nonsense Mediated Decay A transcript with a non-ATG start codon but which still encodes a methionine since the ribosomal machinery allows non-AUG to translate as methionine in specific cases. https://en.wikipedia.org/wiki/Start_codon Non-ATG start http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html Transcripts in the MANE Plus Clinical set are additional transcripts per locus necessary to support clinical variant reporting, for example transcripts containing known Pathogenic or Likely Pathogenic clinical variants not reportable using the MANE Select set. Note there may be additional clinically relevant transcripts in the wider RefSeq and Ensembl/GENCODE sets but not yet in MANE. Matched Annotation from NCBI and EBI Plus Clinical Matched Annotation from NCBI and EMBL-EBI Plus Clinical Matched Annotation from NCBI and Ensembl Plus Clinical MAIN Plus Clinical MANE Plus Clinical http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html The full GENCODE transcript set, containing both complete transcripts and 5' and 3' incomplete transcripts. GENCODE Comprehensive http://www.ensembl.org/info/genome/genebuild/biotypes.html Alternatively spliced transcript of a protein coding gene for which we cannot define a CDS. Protein coding CDS not defined http://www.ensembl.org/info/genome/genebuild/biotypes.html Not translated in the reference genome owing to a SNP/DIP but in other individuals/haplotypes/strains the transcript is translated. Replaces the polymorphic_pseudogene transcript biotype. Protein coding LOF