ensembl-glossary
definition
hasDbXref
related
A genomic locus that has been annotated.
SO:0000001
element
Feature
Genome annotation
Genomic locus where transcription occurs. A gene may have one or more transcripts, which may or may not encode proteins.
SO:0000704
https://en.wikipedia.org/wiki/Gene
Gene
A transcript is the operational unit of a gene. In a genomic context, transcripts consist of one or more exons, with adjoining exons being separated by introns. The exons/introns are transcribed and then the introns spliced out. Transcripts may or may not encode a protein
SO:0000673
isoform
Splice variant
Transcript
Expressed Sequence Tag. Coarse sequence reads from flanking vector regions into the inserts of cDNA libraries. ESTs act as physical markers for cloning and full length sequencing of the cDNAs of expressed genes. Typically identified by purifying mRNAs, converting to cDNAs, and then sequencing a portion of the cDNAs. Usually short, single reads from a tissue or stage in development.
Expressed sequence tag
EST
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
The Transcript Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users, based on the type and quality of the alignments used to annotate the transcript.
TSL
Transcript support level
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
A transcript where all splice junctions are supported by at least one non-suspect mRNA.
Transcript support level 1
TSL 1
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
A transcript where the best supporting mRNA is flagged as suspect or the support is from multiple ESTs
Transcript support level 2
TSL 2
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
A transcript where the only support is from a single EST
Transcript support level 3
TSL 3
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
A transcript where the best supporting EST is flagged as suspect
Transcript support level 4
TSL 4
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
A transcript where no single transcript supports the model structure.
Transcript support level 5
TSL 5
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
A transcript that was not analysed for TSL.
Transcript support level not applicable
TSL NA
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods to identify the most functionally important transcript(s) of a gene.
APPRIS
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS.
APPRIS Principal 1
APPRIS P1
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant.
APPRIS Principal 2
APPRIS P2
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
Where the APPRIS core modules are unable to choose a clear principal variant and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated.
APPRIS Principal 3
APPRIS P3
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with a distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant.
APPRIS Principal 4
APPRIS P4
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant.
APPRIS Principal 5
APPRIS P5
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
For genes in which the APPRIS core modules are unable to choose a clear principal isoform, the ALT1 is the candidate transcript(s) models that is conserved in at least three tested species.
APPRIS Alternative 1
APPRIS ALT1
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
For genes in which the APPRIS core modules are unable to choose a clear principal isoform, the ALT1 is the candidate transcript(s) models that appear to be conserved in fewer than three tested species.
APPRIS Alternative 2
APPRIS ALT2
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
A subset of the GENCODE transcript set, containing only 5' and 3' complete transcripts.
GENCODE Basic
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
A protein-coding transcript which is missing the start codon due to incomplete evidence.
five prime incomplete
5' incomplete
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
A protein-coding transcript which is missing the stop codon due to incomplete evidence.
three prime incomplete
3' incomplete
http://www.ensembl.org/info/genome/genebuild/canonical.html
A single transcript chosen for a gene which is the most conserved, most highly expressed, has the longest coding sequence and is represented in other key resources, such as NCBI and UniProt. This is defined in detail on http://www.ensembl.org/info/genome/genebuild/canonical.html
canonical transcript
Canonical
Ensembl canonical
http://www.ensembl.org/info/genome/genebuild/ccds.html
A coding sequence in the Consensus Coding Sequence Set is consistently annotated between Ensembl, MGI, HGNC and NCBI. The long term goal is to support convergence towards a standard set of gene annotations on the human genome.
https://en.wikipedia.org/wiki/Consensus_CDS_Project
Consensus Coding Sequence
Consensus CDS
CCDS
http://www.ensembl.org/info/genome/genebuild/biotypes.html
A gene or transcript classification.
Gene type
Transcript type
Type
Biotype
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Gene/transcipt that contains an open reading frame (ORF).
SO:0001217
https://en.wikipedia.org/wiki/Gene
Coding
Protein coding
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Gene/transcript that doesn't contain an open reading frame (ORF).
https://en.wikipedia.org/wiki/Non-coding_RNA
Processed transcript
http://www.ensembl.org/info/genome/genebuild/biotypes.html
A non-coding gene/transcript >200bp in length
SO:0002127
https://en.wikipedia.org/wiki/Long_non-coding_RNA
Long non-coding RNA
lncRNA
Long non-coding RNA (lncRNA)
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Transcripts which are known from the literature to not be protein coding.
SO:0001263
https://en.wikipedia.org/wiki/Non-coding_RNA
Non coding
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Transcripts where ditag and/or published experimental data strongly supports the existence of long (>200bp) non-coding transcripts that overlap the 3'UTR of a protein-coding locus on the same strand.
three prime overlapping ncRNA
3' overlapping ncRNA
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Transcripts that overlap the genomic span (i.e. exon or introns) of a protein-coding locus on the opposite strand.
https://en.wikipedia.org/wiki/Antisense_RNA
Antisense
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Transcripts that are long intergenic non-coding RNA locus with a length >200bp. Requires lack of coding potential and may not be conserved between species.
SO:0001641
https://en.wikipedia.org/wiki/Long_non-coding_RNA#Long_intergenic_non-coding_RNAs_(lincRNA)
Long intergenic RNA
Long interspersed RNA
lincRNA
lincRNA (long intergenic ncRNA)
http://www.ensembl.org/info/genome/genebuild/biotypes.html
An alternatively spliced transcript believed to contain intronic sequence relative to other, coding, transcripts of the same gene.
Retained intron
http://www.ensembl.org/info/genome/genebuild/biotypes.html
A long non-coding transcript in introns of a coding gene that does not overlap any exons.
SO:0002184
Sense intronic
http://www.ensembl.org/info/genome/genebuild/biotypes.html
A long non-coding transcript that contains a coding gene in its intron on the same strand.
SO:0002183
Sense overlapping
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Unspliced lncRNAs that are several kb in size.
Macro lncRNA
http://www.ensembl.org/info/genome/genebuild/biotypes.html
A non-coding gene.
ncRNA
http://www.ensembl.org/info/genome/genebuild/biotypes.html
A small RNA (~22bp) that silences the expression of target mRNA.
SO:0001265
https://en.wikipedia.org/wiki/MicroRNA
micro RNA
miRNA
http://www.ensembl.org/info/genome/genebuild/biotypes.html
An RNA that interacts with piwi proteins involved in genetic silencing.
SO:0001638
https://en.wikipedia.org/wiki/Piwi-interacting_RNA
piwi-interacting RNA
piRNA
http://www.ensembl.org/info/genome/genebuild/biotypes.html
The RNA component of a ribosome.
SO:0001637
https://en.wikipedia.org/wiki/Ribosomal_RNA
ribsosomal RNA
rRNA
http://www.ensembl.org/info/genome/genebuild/biotypes.html
A small RNA (20-25bp) that silences the expression of target mRNA through the RNAi pathway.
https://en.wikipedia.org/wiki/Small_interfering_RNA
short interfering RNA
silencing RNA
small interfering RNA
siRNA
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Small RNA molecules that are found in the cell nucleus and are involved in the processing of pre messenger RNAs
https://en.wikipedia.org/wiki/Small_nuclear_RNA
U-RNA
small nuclear RNA
snRNA
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Small RNA molecules that are found in the cell nucleolus and are involved in the post-transcriptional modification of other RNAs.
SO:0001272
https://en.wikipedia.org/wiki/Small_nucleolar_RNA
small nucleolar RNA
snoRNA
http://www.ensembl.org/info/genome/genebuild/biotypes.html
A transfer RNA, which acts as an adaptor molecule for translation of mRNA.
SO:0001272
https://en.wikipedia.org/wiki/Transfer_RNA
transfer RNA
tRNA
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Short non coding RNA genes that form part of the vault ribonucleoprotein complex.
SO:0000404
https://en.wikipedia.org/wiki/Vault_RNA
vaultRNA
Miscellaneous RNA. A non-coding RNA that cannot be classified.
miscRNA
http://www.ensembl.org/info/genome/genebuild/biotypes.html
A gene that has homology to known protein-coding genes but contain a frameshift and/or stop codon(s) which disrupts the ORF. Thought to have arisen through duplication followed by loss of function.
SO:0000336
https://en.wikipedia.org/wiki/Pseudogene
Pseudogene
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Pseudogene that lack introns and is thought to arise from reverse transcription of mRNA followed by reinsertion of DNA into the genome.
SO:0000043
https://en.wikipedia.org/wiki/Pseudogene#Processed
Processed pseudogene
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Pseudogene that can contain introns since produced by gene duplication.
SO:0001760
https://en.wikipedia.org/wiki/Pseudogene#Non-processed
Non-processed pseudogene
Unprocessed pseudogene
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Pseudogene where protein homology or genomic structure indicates a pseudogene, but the presence of locus-specific transcripts indicates expression. These can be classified into 'Processed', 'Unprocessed' and 'Unitary'.
SO:0002109, SO:0002107, SO:0002108
Transcribed pseudogene
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Pseudogenes that have mass spec data suggesting that they are also translated. These can be classified into 'Processed', 'Unprocessed'
SO:0002105, SO:0002106
Translated pseudogene
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Pseudogene owing to a SNP/indel but in other individuals/haplotypes/strains the gene is translated.
Polymorphic pseudogene
http://www.ensembl.org/info/genome/genebuild/biotypes.html
A species specific unprocessed pseudogene without a parent gene, as it has an active orthologue in another species.
SO:0001759
https://en.wikipedia.org/wiki/Pseudogene#Unitary_pseudogenes
Unitary pseudogene
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Inactivated immunoglobulin gene.
IG pseudogene
http://www.ensembl.org/info/genome/genebuild/biotypes.html, http://www.ensembl.org/info/genome/genebuild/ig_tcr.html
Immunoglobulin gene that undergoes somatic recombination, annotated in collaboration with IMGT http://www.imgt.org/.
SO:0002122
https://en.wikipedia.org/wiki/V(D)J_recombination
Immunoglobulin gene
IG gene
http://www.ensembl.org/info/genome/genebuild/biotypes.html, http://www.ensembl.org/info/genome/genebuild/ig_tcr.html
T cell receptor gene that undergoes somatic recombination, annotated in collaboration with IMGT http://www.imgt.org/.
SO:0002133
https://en.wikipedia.org/wiki/V(D)J_recombination
T cell receptor gene
TR gene
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Regions with EST clusters that have polyA features that could indicate the presence of protein coding genes. These require experimental validation, either by 5' RACE or RT-PCR to extend the transcripts, or by confirming expression of the putatively-encoded peptide with specific antibodies.
TEC (To be Experimentally Confirmed)
http://www.ensembl.org/info/genome/genebuild/biotypes.html
A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).
read-through
Readthrough
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Variable chain immunoglobulin gene that undergoes somatic recombination before transcription
SO:0002126
https://en.wikipedia.org/wiki/V(D)J_recombination
IG V gene
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Diversity chain immunoglobulin gene that undergoes somatic recombination before transcription
SO:0002124
https://en.wikipedia.org/wiki/V(D)J_recombination
IG D gene
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Joining chain immunoglobulin gene that undergoes somatic recombination before transcription
SO:0002125
https://en.wikipedia.org/wiki/V(D)J_recombination
IG J gene
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Constant chain immunoglobulin gene that undergoes somatic recombination before transcription
SO:0002123
https://en.wikipedia.org/wiki/V(D)J_recombination
IG C gene
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Variable chain T cell receptor gene that undergoes somatic recombination before transcription
SO:0002137
https://en.wikipedia.org/wiki/V(D)J_recombination
TR V gene
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Diversity chain T cell receptor gene that undergoes somatic recombination before transcription
SO:0002135
https://en.wikipedia.org/wiki/V(D)J_recombination
TR D gene
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Joining chain T cell receptor gene that undergoes somatic recombination before transcription
SO:0002136
https://en.wikipedia.org/wiki/V(D)J_recombination
TR J gene
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Constant chain T cell receptor gene that undergoes somatic recombination before transcription
SO:0002134
https://en.wikipedia.org/wiki/V(D)J_recombination
TR C gene
The sequence of the spliced exons of a transcript expressed in DNA notation (T rather than U), representing the coding or sense strand. The cDNA contains the whole sequence of the RNA, including coding and untranslated sequence.
SO:0000756
https://en.wikipedia.org/wiki/Complementary_DNA
complementary DNA
cDNA
CoDing Sequence. The region of a cDNA which is translated. In Ensembl displays, the stop codon is included as part of the CDS sequence.
SO:0000316
ORF
open reading frame
translatable sequence
translated sequence
coding sequence
CDS
A sequence of amino acids, translated from a CDS.
https://en.wikipedia.org/wiki/Peptide
Translation
Protein
Peptide
A region of special biological interest within a single protein sequence. However, a domain may also be defined as a region within the three-dimensional structure of a protein that may encompass regions of several distinct protein sequences that accomplishes a specific function. A domain class is a group of domains that share a common set of well-defined properties or characteristics.
https://en.wikipedia.org/wiki/Protein_domain
Domain
Protein domain
Transcribed genomic region that remains in the RNA after splicing, includes both the CDS and the UTRs.
SO:0000147
https://en.wikipedia.org/wiki/Exon
Exon
Transcribed genomic regions that is removed from the RNA by splicing.
SO:0000188
https://en.wikipedia.org/wiki/Intron
Intron
Three base pairs in either DNA or RNA that code for an amino acid (or stop translation).
SO:0000360
https://en.wikipedia.org/wiki/Genetic_code
Codon
Exons that are not spliced out, therefore present in all transcripts of a given gene.
Constitutive exon
The position of an exon/intron boundary within a codon. A phase of zero means the boundary falls between codons, one means between the first and second base and two means between the second and third base. Exons have a start and end phase, whereas introns have just one phase. A boundary in a non-coding region has a phase of -1.
Phase
Sequence 5' or 3' to a DNA or RNA sequence of interest (for example gene, transcript, SNP or repeat).
SO:0000239
flank
flanking region
Flanking sequence
The region of a coding cDNA which is not translated.
SO:0000203
https://en.wikipedia.org/wiki/Untranslated_region
UTR
Untranslated region
The region of a coding cDNA upstream of the start codon which is not translated.
SO:0000204
https://en.wikipedia.org/wiki/Untranslated_region
5' untranslated region
five prime untranslated region
five prime UTR
5' UTR
The region of a coding cDNA downstream of the stop codon which is not translated.
SO:0000205
https://en.wikipedia.org/wiki/Untranslated_region
3' untranslated region
three prime untranslated region
three prime UTR
3' UTR
http://www.ensembl.org/info/genome/compara/homologue_types.html
Specific genes that are descended from the same common sequence in an ancestor.
SO:0000853, FHOM_0000007
https://en.wikipedia.org/wiki/Sequence_homology
Homologs
Homologues
http://www.ensembl.org/info/genome/compara/homology_method.html
A representation of the evolutionary relationship between homologues, constructed using the Ensembl gene tree pipeline.
https://en.wikipedia.org/wiki/Phylogenetic_tree
Protein tree
Gene tree
http://www.ensembl.org/info/genome/compara/homologue_types.html
Orthologues are genes derived from a common ancestor through vertical descent (or speciation) and can be thought of as the direct evolutionary counterpart.
FHOM_0000017
https://en.wikipedia.org/wiki/Sequence_homology#Orthology
Orthologs
Orthologues
http://www.ensembl.org/info/genome/compara/homologue_types.html
A type of orthologue assigned for a pair of species where only one copy is found in each species.
FHOM_0000020
one-to-one orthologs
one-to-one orthologues
1-to-1 orthologs
1-to-1 orthologues
http://www.ensembl.org/info/genome/compara/homologue_types.html
A type of orthologue assigned for a pair of species where one gene in one species is orthologous to multiple genes in the other species, due to (a) duplication event(s) in the second species.
FHOM_0000034
one-to-many orthologs
one-to-many orthologues
1-to-many orthologs
1-to-many orthologues
http://www.ensembl.org/info/genome/compara/homologue_types.html
A type of orthologue assigned for a pair of species where multiple orthologues are found in both species, where the duplication events in both species occurred after the speciation event.
FHOM_0000048
many-to-many orthologs
Many-to-many orthologues
http://www.ensembl.org/info/genome/compara/homologue_types.html
Genes (homologues) that have evolved by duplication.
FHOM_0000011
https://en.wikipedia.org/wiki/Sequence_homology#Paralogy
Paralogs
Paralogues
http://www.ensembl.org/info/genome/compara/homologue_types.html
Members of the same gene family in different species that are not direct orthologues. In a gene tree, these genes are separated by a duplication node.
FHOM_0000050
out paralogs
out paralogues
Between species paralogs
Between species paralogues
http://www.ensembl.org/info/genome/compara/homologue_types.html
Pairs of genes in a species that occur together in the same tree, but are actually two halves of the same gene split partway along.
Gene split
http://www.ensembl.org/info/genome/compara/homologue_types.html
Paralogues which are very far away from the other members of a paralogue family. They are part of the same super-family, but the precise taxonomic relationship to other members is undefined, as the trees are too large to compute.
Other paralogs
Other paralogues
http://www.ensembl.org/info/genome/compara/homologue_types.html
Two or more versions of a duplicated gene in a single species. In a gene tree, the genes are separated by a duplication node.
FHOM_0000049
In paralogs
In paralogues
Within species paralogs
Within species paralogues
Pairs of genes in a polyploid genome that underwent (a) hybridisation event(s). The original genes were orthologues in the two (or more) species that hybridised, and now occur in the same species. Since they did not arise through a duplication event, they are not paralogues.
FHOM_0000073
https://en.wikipedia.org/wiki/Polyploid#Homoeologous
Homoeologues
http://www.ensembl.org/info/genome/variation/index.html
Locus where the sequence differs between individuals of the same species
SO:0001060
https://en.wikipedia.org/wiki/Genetic_variation
mutation
variation
Polymorphism
Variant
Genetic loci where allelic variation is associated with variation in a quantitative trait (e.g. blood pressure).
SO:0000771
https://en.wikipedia.org/wiki/Quantitative_trait_locus
Quantitative trait locus
QTL
Genetic loci where allelic variation is associated with expression levels of other genes.
https://en.wikipedia.org/wiki/Expression_quantitative_trait_loci
Expression quantitative trait locus
eQTL
http://www.ensembl.org/info/genome/variation/data_description.html#evidence_status
Codes that reflect the amount and type of evidence that supports the existence of a variant.
Evidence status
http://www.ensembl.org/info/genome/variation/index.html
Variant that only affects a small locus
Sequence variant
http://www.ensembl.org/info/genome/variation/index.html
Variant that affects a large locus
SO:0001537
Structural variant
http://www.ensembl.org/info/genome/variation/index.html
Single Nucleotide Polymorphism, substitution of a single nucleotide for another nucleotide
SO:0000694
https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism
SNV
base-pair substitution
point mutation
single nucleotide variation
Single nucleotide polymorphism
SNP
http://www.ensembl.org/info/genome/variation/index.html
Insertion of one or more nucleotides
SO:0000667
https://en.wikipedia.org/wiki/Insertion_(genetics)
Insertion
http://www.ensembl.org/info/genome/variation/index.html
Deletion of one or more nucleotides
SO:0000159
https://en.wikipedia.org/wiki/Deletion_(genetics)
Deletion
http://www.ensembl.org/info/genome/variation/index.html
An insertion and a deletion, affecting two or more nucleotides
SO:1000032
https://en.wikipedia.org/wiki/Indel
Indel
http://www.ensembl.org/info/genome/variation/index.html
A sequence alteration where the length of the deleted sequence is the same as the length of the inserted sequence.
SO:1000002
Substitution
http://www.ensembl.org/info/genome/variation/index.html
Copy Number Variation: increases or decreases the copy number of a given locus. Subcategorised into Loss and Gain compared to the reference.
SO:0001019
https://en.wikipedia.org/wiki/Copy-number_variation
CNP
copy number gain
copy number loss
copy number polymorphism
deletion
duplication
Copy number variation
CNV
http://www.ensembl.org/info/genome/variation/index.html
A continuous nucleotide sequence is inverted in the same position
SO:1000036
https://en.wikipedia.org/wiki/Chromosomal_inversion
Inversion
http://www.ensembl.org/info/genome/variation/index.html
A region of nucleotide sequence that has translocated to a new position
SO:0000199
https://en.wikipedia.org/wiki/Chromosomal_translocation
chromosome rearrangement
Translocation
http://www.ensembl.org/info/genome/variation/index.html
One of a number of alternative forms of the same genetic locus/variant.
SO:0001023
Allele (variant)
http://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html
Different versions of a gene found between the primary assembly and a patch or genome haplotype.
alternative sequence gene
haplotype gene
patch gene
Alt gene
Allele (gene)
The allele of a variant found in the reference genome currently being studied. The reference allele is not necessarily the major or ancestral allele.
Reference allele
Any allele of a variant which is not the in the reference genome currently being studied. The alternative allele is not necessarily the minor allele.
Alternative allele
http://www.ensembl.org/info/genome/variation/data_description.html#maf
The allele which is most frequent in the global population, defined in human by the 1000 Genomes Project. The major allele may be the reference or the alternative allele, and may or may not be the ancestral allele.
https://en.wikipedia.org/wiki/Allele_frequency
Major allele
http://www.ensembl.org/info/genome/variation/data_description.html#maf
The allele which is the second most frequent in the global population, defined in human by the 1000 Genomes Project. The minor allele may be the reference or the alternative allele, and may or may not be the ancestral allele.
https://en.wikipedia.org/wiki/Allele_frequency
Minor allele
An allele which has only been identified in one individual or one family. A private allele may be the reference or the alternative allele, and may or may not be the ancestral allele.
Private allele
The allele which occurs at this locus in closely related species and is thought to reflect the allele present at the time of speciation. The ancestral allele may be the reference or the alternative allele, and the major or minor allele.
Ancestral allele
http://www.ensembl.org/info/genome/variation/data_description.html#maf
The frequency of the second most common allele in the specified population.
https://en.wikipedia.org/wiki/Allele_frequency
MAF
Minor allele frequency
http://www.ensembl.org/info/genome/variation/data_description.html#maf
The highest minor allele frequency observed in any population typed for this variant. For human this includes the 1000 Genomes Project, gnomAD and UK10K.
https://en.wikipedia.org/wiki/Allele_frequency
Highest population minor allele frequency
HPMAF
Highest population MAF
http://www.ensembl.org/info/genome/variation/data_description.html#maf
The frequency of the second most common allele in the global population, defined in human by the 1000 Genomes Project phase 3.
Global MAF
http://www.ensembl.org/info/genome/variation/data_description.html
The specific alleles that are present in an individual's genome. In diploid organisms two alleles make up the genotype (except for the sex chromosomes).
Zygosity
Genotype
http://www.ensembl.org/info/genome/variation/data_description.html
A measurable locus that varies within a population.
SO:0001645
Genetic marker
http://www.ensembl.org/info/genome/variation/data_description.html
Two or more adjacent copies of a region (of length greater than 1).
SO:0000705
Tandem repeat
http://www.ensembl.org/info/genome/variation/data_description.html
An insertion of sequence from the Alu family of mobile elements.
SO:0002063
Alu insertion
http://www.ensembl.org/info/genome/variation/data_description.html
A structural sequence alteration or rearrangement encompassing one or more genome fragments, with four or more breakpoints.
SO:0001784
Complex structural alteration
http://www.ensembl.org/info/genome/variation/data_description.html
When no simple or well defined DNA mutation event describes the observed DNA change, the keyword ""complex"" should be used. Usually there are multiple equally plausible explanations for the change.
SO:1000005
Complex substitution
http://www.ensembl.org/info/genome/variation/data_description.html
A rearrangement breakpoint between two different chromosomes.
SO:0001873
Interchromosomal breakpoint
http://www.ensembl.org/info/genome/variation/data_description.html
A translocation where the regions involved are from different chromosomes.
SO:0002060
Interchromosomal translocation
http://www.ensembl.org/info/genome/variation/data_description.html
A rearrangement breakpoint within the same chromosome.
SO:0001874
Intrachromosomal breakpoint
http://www.ensembl.org/info/genome/variation/data_description.html
A translocation where the regions involved are from the same chromosome.
SO:0002061
Intrachromosomal translocation
http://www.ensembl.org/info/genome/variation/data_description.html
A functional variant whereby the sequence alteration causes a loss of function of one allele of a gene.
SO:0001786
Loss of heterozygosity
http://www.ensembl.org/info/genome/variation/data_description.html
A deletion of a mobile element when comparing a reference sequence (has mobile element) to a individual sequence (does not have mobile element).
SO:0002066
Mobile element deletion
http://www.ensembl.org/info/genome/variation/data_description.html
A kind of insertion where the inserted sequence is a mobile element.
SO:0001837
Mobile element insertion
http://www.ensembl.org/info/genome/variation/data_description.html
An insertion the sequence of which cannot be mapped to the reference genome.
SO:0001838
Novel sequence insertion
http://www.ensembl.org/info/genome/variation/data_description.html
A variation that expands or contracts a tandem repeat with regard to a reference.
SO:0002096
Short tandem repeat variant
http://www.ensembl.org/info/genome/variation/data_description.html
A duplication consisting of 2 identical adjacent regions.
SO:1000173
Tandem duplication
http://www.ensembl.org/info/genome/variation/data_description.html
A DNA sequence used experimentally to detect the presence or absence of a complementary nucleic acid.
SO:0000051
Probe
http://www.ensembl.org/info/genome/variation/predicted_data.html
The effect that the variant has on each feature that it overlaps. A variant will have a consequence for each feature that it overlaps.
Variant consequence
http://www.ensembl.org/info/genome/variation/predicted_data.html
A subjective classification of the severity of the variant consequence, based on agreement with SNPEff.
Variant impact
http://www.ensembl.org/info/genome/variation/predicted_data.html
The variant is assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay.
HIGH
High impact variant consequence
http://www.ensembl.org/info/genome/variation/predicted_data.html
A non-disruptive variant that might change protein effectiveness.
MODERATE
Moderate impact variant consequence
http://www.ensembl.org/info/genome/variation/predicted_data.html
A variant that is assumed to be mostly harmless or unlikely to change protein behaviour.
LOW
Low impact variant consequence
http://www.ensembl.org/info/genome/variation/predicted_data.html
Usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact.
MODIFIER
Modifier impact variant consequence
http://www.ensembl.org/info/genome/variation/predicted_data.html
A feature ablation whereby the deleted region includes a transcript feature
SO:0001893
Transcript ablation
http://www.ensembl.org/info/genome/variation/predicted_data.html
A splice variant that changes the 2 base region at the 3' end of an intron
SO:0001574
Splice acceptor variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A splice variant that changes the 2 base region at the 5' end of an intron
SO:0001575
Splice donor variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant whereby at least one base of a codon is changed, resulting in a premature stop codon, leading to a shortened transcript
SO:0001587
Stop gained
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant which causes a disruption of the translational reading frame, because the number of nucleotides inserted or deleted is not a multiple of three
SO:0001589
Frameshift variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant where at least one base of the terminator codon (stop) is changed, resulting in an elongated transcript
SO:0001578
Stop lost
http://www.ensembl.org/info/genome/variation/predicted_data.html
A codon variant that changes at least one base of the canonical start codo
SO:0002012
Start lost
http://www.ensembl.org/info/genome/variation/predicted_data.html
A feature amplification of a region containing a transcript
SO:0001889
Transcript amplification
http://www.ensembl.org/info/genome/variation/predicted_data.html
An inframe non synonymous variant that inserts bases into in the coding sequenc
SO:0001821
Inframe insertion
http://www.ensembl.org/info/genome/variation/predicted_data.html
An inframe non synonymous variant that deletes bases from the coding sequenc
SO:0001822
Inframe deletion
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved
SO:0001583
Missense variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence_variant which is predicted to change the protein encoded in the coding sequence
SO:0001818
Protein altering variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron
SO:0001630
Splice region variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant where at least one base of the final codon of an incompletely annotated transcript is changed
SO:0001626
Incomplete terminal codon variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant where at least one base in the terminator codon is changed, but the terminator remains
SO:0001567
Stop retained variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant where there is no resulting change to the encoded amino acid
SO:0001819
Synonymous variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant that changes the coding sequence
SO:0001580
Coding sequence variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A transcript variant located with the sequence of the mature miRNA
SO:0001620
Mature miRNA variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A UTR variant of the 5' UTR
SO:0001623
5 prime UTR variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A UTR variant of the 3' UTR
SO:0001624
3 prime UTR variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant that changes non-coding exon sequence in a non-coding transcript
SO:0001792
Non coding transcript exon variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A transcript variant occurring within an intron
SO:0001627
Intron variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A variant in a transcript that is the target of NMD
SO:0001621
NMD transcript variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A transcript variant of a non coding RNA gene
SO:0001619
Non coding transcript variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant located 5' of a gene
SO:0001631
Upstream gene variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant located 3' of a gene
SO:0001632
Downstream gene variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A feature ablation whereby the deleted region includes a transcription factor binding site
SO:0001895
TFBS ablation
http://www.ensembl.org/info/genome/variation/predicted_data.html
A feature amplification of a region containing a transcription factor binding site
SO:0001892
TFBS amplification
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant located within a transcription factor binding site
SO:0001782
TF binding site variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A feature ablation whereby the deleted region includes a regulatory region
SO:0001894
Regulatory region ablation
http://www.ensembl.org/info/genome/variation/predicted_data.html
A feature amplification of a region containing a regulatory region
SO:0001891
Regulatory region amplification
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant located within a regulatory region
SO:0001907
Feature elongation
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant located within a regulatory region
SO:0001566
Regulatory region variant
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant that causes the reduction of a genomic feature, with regard to the reference sequence
SO:0001906
Feature truncation
http://www.ensembl.org/info/genome/variation/predicted_data.html
A sequence variant located in the intergenic region, between genes
SO:0001628
Intergenic variant
A single letter code that represents two or more possible nucleotides at a single base locus.
https://en.wikipedia.org/wiki/International_Union_of_Pure_and_Applied_Chemistry#Amino_acid_and_nucleotide_base_codes
Ambiguity code
http://www.ensembl.org/info/genome/variation/data_description.html#quality_control
Variants that failed our quality control analyses, therefore they are flagged as suspicious.
Failed variant
Flagged variant
http://www.ensembl.org/info/genome/variation/data_description.html#clin_significance
A classification of a variant's impact on disease, taken from ClinVar.
pathogenicity
Clin sig
Clinical significance
https://en.wikipedia.org/wiki/Linkage_disequilibrium
A measure of how often two variants or specific sequences are inherited together.
Linkage
LD
Linkage disequilibrium
https://en.wikipedia.org/wiki/Linkage_disequilibrium
The correlation between a pair of loci. It varies from 0 (loci are in complete linkage equilibrium) to 1 (loci are in complete linkage disequilibrium and coinherited).
r squared
r2
https://en.wikipedia.org/wiki/Linkage_disequilibrium
The difference between the observed and the expected frequency of a given haplotype. If two loci are independent (i.e. in linkage equilibrium and therefore not coinherited at all), the D' value will be 0.
D prime
D'
https://en.wikipedia.org/wiki/Linkage_disequilibrium
A set of variant alleles in a contiguous genomic region. A haplotype block describes a set of alleles which tend to be inherited together.
SO:0001024
Haplotype (variation)
The transcript sequence derived from one copy of a gene in an individual, based on the phased 1000 Genomes genotype data. CDS and protein sequences are derived from this.
Transcript haplotype
The complete set of DNA found in each cell.
https://en.wikipedia.org/wiki/Genome
Genome
http://www.ensembl.org/info/genome/genebuild/assembly.html
A computational representation of the sequence of a haploid genome, representative of a species or strain.
SO:0000353
https://en.wikipedia.org/wiki/Sequence_assembly
reference assembly
assembly
Genome assembly
http://www.ensembl.org/info/genome/genebuild/assembly.html
Refers to the number of overlapping sequences used to build a region of the assembly. High coverage indicates a good amount of sequence information while low coverage reflects a low amount of sequence information.
Coverage
https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html
The underlying genome sequence, without alternative sequence included.
reference sequence
Primary assembly
https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html
Genomic sequence that differs from the genomic DNA on the primary assembly. These are represented as sequence on top of the primary assembly. Provided by the GRC for human and mouse.
non-reference sequence
Alternative sequence
https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html
New sequences that have been added to the genome assembly since its release. There are two types: fix and nove patches.
Patch
https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html
Known variations to the primary assembly, due to variability in the human genome sequence (eg. the highly variable MHC locus). These were included as part of the genome assembly when it was first produced.
https://en.wikipedia.org/wiki/Major_histocompatibility_complex
Haplotype (genome)
https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html
Novel patches represent new allelic loci. They can usually be considered as similar to haplotypes and are likely to be reclassified as such in the next genome assembly, but not necessarily.
Novel patch
https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html
Fix patches are where the primary assembly was found to be incorrect, and the patch reflects the corrected sequence.
Fix patch
https://www.ensembl.org/info/genome/genebuild/chromosomes_scaffolds_contigs.html
A contig is a contiguous stretch of DNA sequence without gaps that has been assembled solely based on direct sequencing information.
SO:0000149
https://en.wikipedia.org/wiki/Contig
Contig
https://www.ensembl.org/info/genome/genebuild/chromosomes_scaffolds_contigs.html
Scaffolds are sets of ordered, oriented contigs, assembled by sequence overlap. They are longer sequences than contigs, but shorter than full chromosomes.
SO:0000148
Supercontig
Scaffold
A banding pattern on a chromosome resulting from staining and examination by microscopy. These are named in terms of the chromosome arm they are found on, and are often used as a shorthand for describing the location of genomic features.
https://en.wikipedia.org/wiki/Cytogenetics
cytogenetic map
chromosome band
Cytogenetic band
A segment of DNA that has been inserted into a vector molecule, such as a plasmid, and then replicated to form many identical copies.
Clone
https://www.ensembl.org/info/genome/genebuild/chromosomes_scaffolds_contigs.html
A vector used to clone DNA fragments (100 to 300-kb insert size; average, 150 kb) from another species so that it can be replicated in bacteria. Many genomes (such as human) were sequenced by cloning segments into BACs, amplifying and sequencing the clones.
https://en.wikipedia.org/wiki/Bacterial_artificial_chromosome
Bacterial artificial chromosome
BAC
Originated from a bacterial plasmid, a YAC contains a yeast centromeric region, a yeast origin of DNA replication, a cluster of unique rectriction sites and a selectable marker and a telomere region at the en of each arm. YACs are capable of cloning extremely large segments of DNA (over 1 megabase long) into a host cell, where the DNA is propagated along with the other chromosomes of the yeast cell.
https://en.wikipedia.org/wiki/Yeast_artificial_chromosome
Yeast artificial chromosome
YAC
DNA from a bacterial virus spliced with a small fragment of a genome (up to 50 kb) to be amplified and sequenced.
https://en.wikipedia.org/wiki/Cosmid
Cosmid
The actual number of bases of sequence we have for a full genome assembly, including alternative sequences and PARs, excluding gaps.
Base pairs (genome size)
The golden path is the length of the non-redundant reference assembly. It excludes alternative sequences and PARs, but includes the estimated size of the gaps.
SO:0000688
Golden path (genome size)
https://www.ensembl.org/info/genome/genebuild/chromosomes_scaffolds_contigs.html
Which level of the assembly we are working on.
coord_system
Coordinate system
The number of chromosomes of a genome.
Karyotype
https://www.ensembl.org/info/genome/genebuild/human_PARS.html
Small regions of sequence identity located at the tips of the short and the long arms of the X and Y chromosomes where recombination and genetic exchange take place. Genes within the pseudoautosomal region are not sex linked.
pseudoautosomal region
PAR
http://www.ensembl.org/info/docs/api/core/core_tutorial.html#slices
The term "slice" in Ensembl refers to a length of DNA sequence. A slice can be any length, from one base long to the entire length of a chromosome.
Slice
https://www.ensembl.org/info/genome/genebuild/chromosomes_scaffolds_contigs.html
The largest continuous sequence for an organism. The official technical definition for toplevel sequences are 'sequence regions in the genome assembly that are not a component of another sequence region'. For example, when a genome is assembled into chromosomes, toplevel sequences will be chromosomes and unplaced scaffolds. If a genome has only been assembled into scaffolds, then toplevel sequences are scaffolds and unplaced contigs.
Toplevel
https://www.ensembl.org/info/genome/genebuild/chromosomes_scaffolds_contigs.html
A scaffold that can be positioned on a chromosome based on genetic mapping information.
Placed scaffold
https://www.ensembl.org/info/genome/genebuild/chromosomes_scaffolds_contigs.html
A scaffold that cannot be positioned on a chromosome.
SO:0001875
unassigned supercontig
unplaced supercontig
Unassigned scaffold
Unplaced scaffold
Publicly available database that Ensembl imports data from.
Ensembl sources
Database from which Ensembl imports cDNA or protein sequence for gene annotation, or gene names.
Gene source database
http://www.ensembl.org/info/genome/genebuild/annotation_merge.html
The aim of GENCODE as a sub-project of the ENCODE scale-up project is to annotate all evidence-based gene features in the entire human and mouse genomes at a high accuracy. The GENCODE gene set is the default geneset in Ensembl and is equivalent to the Ensembl/HAVANA merged genes. https://www.gencodegenes.org/
https://en.wikipedia.org/wiki/GENCODE
GENCODE
http://www.ensembl.org/info/genome/genebuild/annotation_sources.html
NCBI's Reference Sequences (RefSeq) database is a curated database of Genbank's genomes, mRNAs and proteins. RefSeq attempts to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, tRNA, and protein products. https://www.ncbi.nlm.nih.gov/refseq/
https://en.wikipedia.org/wiki/RefSeq
RefSeq
http://www.ensembl.org/info/genome/genebuild/annotation_sources.html
Database of protein sequence and functional information, based at European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR). These sequences are used as evidence for annotating Ensembl genes. http://www.uniprot.org/
https://en.wikipedia.org/wiki/UniProt
UniProt Knowledgebase
UniProtKB
UniProt
http://www.ensembl.org/info/genome/genebuild/ig_tcr.html
International ImMunoGeneTics information system. Database of immunoglobulin and T-cell receptor annotation. We collaborate with IMGT on manual annotation of somatically recombined genes. http://www.imgt.org/
https://en.wikipedia.org/wiki/HLA_Informatics_Group
ImMunoGeneTics
International ImMunoGeneTics information system
IMGT
http://www.ensembl.org/info/genome/genebuild/annotation_sources.html
UniProt/Swiss-Prot is a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. These sequences are used as evidence for annotating Ensembl genes.
https://en.wikipedia.org/wiki/UniProt#UniProtKB/Swiss-Prot
UniProt/SwissProt
SwissProt
http://www.ensembl.org/info/genome/genebuild/annotation_sources.html
A subset of TrEMBL (Translated EMBL database) containing the computer-annotated protein translations of all coding sequences (CDS) present in the ENA (formerly EMBL-bank) that are not yet incorporated into the UniProt/SwissProt database. These sequences are used as evidence for annotating Ensembl genes.
https://en.wikipedia.org/wiki/UniProt#UniProtKB/TrEMBL
UniProt/TrEMBL
TrEMBL
http://www.ensembl.org/info/genome/genebuild/annotation_sources.html
An international consortium between the ENA, GenBank and DDBJ to share submissions of nucleotide sequence. These sequences are used as evidence for annotating Ensembl genes. http://www.insdc.org/
https://en.wikipedia.org/wiki/International_Nucleotide_Sequence_Database_Collaboration
International Nucleotide Sequence Database Collaboration
INSDC
http://www.ensembl.org/info/genome/genebuild/annotation_sources.html
Europe's primary nucleotide sequence resource. The main sources of the DNA and RNA sequences in the database are submissions from individual researchers, genome sequencing projects and patent applications. https://www.ebi.ac.uk/ena
https://en.wikipedia.org/wiki/European_Nucleotide_Archive
European Nucleotide Archive
ENA
http://www.ensembl.org/info/genome/genebuild/annotation_sources.html
The US branch of INSDC. https://www.ncbi.nlm.nih.gov/genbank/
https://en.wikipedia.org/wiki/GenBank
GenBank (database)
http://www.ensembl.org/info/genome/genebuild/annotation_sources.html
The Asian branch of INSDC. http://www.ddbj.nig.ac.jp/
https://en.wikipedia.org/wiki/DNA_Data_Bank_of_Japan
DNA Data Bank of Japan
DDBJ
An organised hierarchy of terms produced by the Gene Ontology Consortium, used to describe the function of proteins. GO terms are split into three subcategories: biological processes (what the protein does), cellular component (where in the cell the protein is found), and molecular function (how the protein acts). http://www.geneontology.org/
https://en.wikipedia.org/wiki/Gene_ontology
GO terms
GO
Gene Ontology
http://www.ensembl.org/info/genome/genebuild/gene_names.html
HGNC is responsible for approving unique symbols and names for human loci, including protein coding genes, ncRNA genes and pseudogenes, to allow unambiguous scientific communication. HGNC gene names are used for Ensembl human genes, where available, and for orthologous genes in other species. https://www.genenames.org/
https://en.wikipedia.org/wiki/HUGO_Gene_Nomenclature_Committee
HUGO gene nomenclature committee
HGNC
http://www.ensembl.org/info/genome/genebuild/annotation_sources.html
The Rfam database is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models (CMs). These sequences are used as evidence for annotating Ensembl non-coding genes. http://rfam.xfam.org/
https://en.wikipedia.org/wiki/Rfam
Rfam
http://www.ensembl.org/info/genome/genebuild/annotation_sources.html
The miRBase database is a searchable database of published miRNA sequences and annotation. These sequences are used as evidence for annotating Ensembl miRNA genes. http://www.mirbase.org/
https://en.wikipedia.org/wiki/MiRBase
miRbase
http://www.ensembl.org/info/genome/genebuild/gene_names.html
MGI is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease. MGI gene names are used for Ensembl mouse genes, where available. http://www.informatics.jax.org/
https://en.wikipedia.org/wiki/Mouse_Genome_Informatics
Mouse genome informatics
MGI
http://www.ensembl.org/info/genome/genebuild/gene_names.html
An online biological database of information about the zebrafish (Danio rerio). zFIN gene names are used for Ensembl zebrafish genes, where available. https://zfin.org/
https://en.wikipedia.org/wiki/Zebrafish_Information_Network
Zebrafish Information Network
zFIN
Canonical database for the molecular biology and genetics of Saccharomyces cerevisiae, source of the annotation seen in Ensembl. https://www.yeastgenome.org/
https://en.wikipedia.org/wiki/Saccharomyces_Genome_Database
Saccharomyces Genome Database
SGD
A genome browser hosted at the University of California Santa Cruz. Ensembl collaborates with UCSC in projects such as GENCODE, CCDS and TSL. https://genome.ucsc.edu/
https://en.wikipedia.org/wiki/UCSC_Genome_Browser
Genome Browser
University of California Santa Cruz
UCSC
UCSC Genome Browser
http://www.ensembl.org/info/genome/funcgen/regulation_sources.html
Database from which Ensembl imports ChIP-seq, DNase-seq and other related datasets, which are used in the Ensembl regulatory build.
Epigenome source database
http://www.ensembl.org/info/genome/funcgen/regulation_sources.html
Project aiming to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active, by large scale functional analyses of laboratory cell lines. Used as a source for the Ensembl regulatory build. https://www.encodeproject.org/
https://en.wikipedia.org/wiki/ENCODE
ENCyclopedia Of DNA Elements
ENCODE
http://www.ensembl.org/info/genome/funcgen/regulation_sources.html
Project aiming to apply functional genomics analysis on primary cells of the haematopoietic cell lineage from healthy and diseased individuals, to produce lineage-specific epigenomes. Used as a source for the Ensembl regulatory build. http://www.blueprint-epigenome.eu/
Blueprint Epigenomes
http://www.ensembl.org/info/genome/funcgen/regulation_sources.html
Project aiming to develop publicly available reference epigenome maps from a variety of cell types. http://www.roadmapepigenomics.org/
https://en.wikipedia.org/wiki/NCBI_Epigenomics#Roadmap_Epigenomics_Project
Roadmap Epigenomics
http://www.ensembl.org/info/genome/variation/sources_documentation.html
Database from which Ensembl imports variation data, including loci, sample genotypes, population frequencies and phenotype associations.
Variation source database
http://www.ensembl.org/info/genome/variation/sources_documentation.html
The Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of simple (short) genetic polymorphisms in human, maintained by NCBI. https://www.ncbi.nlm.nih.gov/projects/SNP/
https://en.wikipedia.org/wiki/DbSNP
Single Nucleotide Polymorphism database
dbSNP
http://www.ensembl.org/info/genome/variation/sources_documentation.html
The European Variation Archive is an open-access database of all types of genetic variation data from all species. https://www.ebi.ac.uk/eva/
European Variation Archive
EVA
http://www.ensembl.org/info/genome/variation/sources_documentation.html
dbVar is NCBI's database of human genomic structural variation — insertions, deletions, duplications, inversions, mobile elements, and translocations. https://www.ncbi.nlm.nih.gov/dbvar/
dbVar
http://www.ensembl.org/info/genome/variation/sources_documentation.html
The Database of Genomic Variants archive (DGVa) is a repository that provides archiving, accessioning and distribution of publicly available genomic structural variants, in all species.https://www.ebi.ac.uk/dgva
Database of Genomic Variants archive
DGVa
http://www.ensembl.org/info/genome/variation/sources_documentation.html
The goal of the 1000 Genomes Project was to find most genetic variants with frequencies of at least 1% in the human populations studied. Ensembl display sample genotypes and population frequencies from the 1000 Genomes project. http://www.internationalgenome.org/
https://en.wikipedia.org/wiki/1000_Genomes_Project
1000G
1kG
1000 Genomes project
http://www.ensembl.org/info/genome/variation/sources_documentation.html
An aggregation of publicly available whole genome and whole exome variant calling experiments in human. GnomAD was previously known as ExAC, when it contained only exome data. Ensembl display population frequencies from gnomAD. http://gnomad.broadinstitute.org/
Exome Aggregation Consortium
ExAC
Genome Aggregation Database
gnomAD
http://www.ensembl.org/info/genome/variation/sources_documentation.html
Whole genome variant calling data from humans worldwide with heart, lung, blood, and sleep disorders. Ensembl display population frequencies from TOPMed. https://www.nhlbi.nih.gov/science/trans-omics-precision-medicine-topmed-program
Trans-Omics for Precision Medicine
TOPMed
http://www.ensembl.org/info/genome/variation/sources_documentation.html
Study comparing exomes of 6000 diseased individuals with 4000 healthy individuals in the UK in order to identify disease-causing variants. Ensembl display population frequencies from the control group. https://www.uk10k.org/
UK10K
http://www.ensembl.org/info/genome/variation/sources_documentation.html
An international collaboration formed to develop a haplotype map of the human genome and thus describe the common patterns of human DNA sequence variation using genotyping. Ensembl display sample genotypes and population frequencies from the HapMap project. https://www.genome.gov/10001688/international-hapmap-project/
https://en.wikipedia.org/wiki/International_HapMap_Project
International HapMap Project
HapMap
http://www.ensembl.org/info/genome/variation/sources_phenotype_documentation.html
NCBI resource that aggregates information about genomic variation and its relationship to human health. Ensembl display clinical significance and phenotypes from ClinVar. https://www.ncbi.nlm.nih.gov/clinvar/
ClinVar
Database from which Ensembl imports phenotype associations with genes and/or variants.
Phenotype source database
http://www.ensembl.org/info/genome/variation/sources_phenotype_documentation.html
An online database that describes the function and phenotypes associated with human genes. Ensembl display phenotypes from OMIM and MIM morbid. https://www.omim.org/
https://en.wikipedia.org/wiki/Online_Mendelian_Inheritance_in_Man
MIM morbid
Mendelian Inheritance in Man
Online Mendelian Inheritance in Man
MIM
OMIM
http://www.ensembl.org/info/genome/variation/sources_phenotype_documentation.html
An online database that describes the function and phenotypes associated with animal genes. Ensembl display phenotypes from OMIA. https://www.omia.org/
https://en.wikipedia.org/wiki/Online_Mendelian_Inheritance_in_Animals
Online Mendelian Inheritance in Animals
OMIA
http://www.ensembl.org/info/genome/variation/sources_phenotype_documentation.html
A catalogue of rare disease associations. Ensembl display phenotypes from Orphanet. http://www.orpha.net/
https://en.wikipedia.org/wiki/Orphanet
Orphanet
http://www.ensembl.org/info/genome/variation/sources_phenotype_documentation.html
A curated database that extracts associations between variants and genes from published genome-wide association studies in human. Ensembl display phenotypes from the GWAS catalog. https://www.ebi.ac.uk/gwas/
NHGRI-EBI Genome-wide association study catalogue
GWAS catalogue
Genome-wide association study catalog
Genome-wide association study catalogue
NHGRI-EBI GWAS Catalogue
NHGRI-EBI Genome-wide association study catalog
NHGRI-EBI GWAS Catalog
GWAS catalog
http://www.ensembl.org/info/genome/variation/sources_phenotype_documentation.html
An international scientific endeavour to create and characterise the phenotype of 20,000 knockout mouse strains. Ensembl display phenotypes from the IMPC. http://www.mousephenotype.org/
https://en.wikipedia.org/wiki/International_Mouse_Phenotyping_Consortium
International Mouse Phenotyping Consortium
IMPC
http://www.ensembl.org/info/genome/variation/sources_phenotype_documentation.html
Project aiming to house all publicly available QTL and association data on livestock animal species. Ensembl display phenotypes from the Animals QTLdb. https://www.animalgenome.org/cgi-bin/QTLdb/index
Animal Quantitative Trait Loci Database
Animal QTLdb
http://www.ensembl.org/info/genome/variation/sources_phenotype_documentation.html
Project aiming to collate all known (published) gene lesions responsible for human inherited disease. Full HGMD access is restricted to license holders so Ensembl supports the minimal public data release which consists of variant/mutation names and locations. http://www.hgmd.cf.ac.uk/ac/index.php
Human Gene Mutation Database
HGMD
http://www.ensembl.org/info/genome/variation/sources_documentation.html
Database of somatic variants found in cancer. COSMIC licensing does not permit redistribution of the full dataset, but mutation identifiers, locations and tumour types are available in Ensembl. http://cancer.sanger.ac.uk/cosmic
https://en.wikipedia.org/wiki/COSMIC_cancer_database
Catalog Of Somatic Mutations In Cancer
Catalogue Of Somatic Mutations In Cancer
COSMIC
Protein source database
A repository for 3D biological macromolecular structure data. Ensembl provide links out to the PDB, and use structures to display the locations of variants in proteins. http://www.ebi.ac.uk/pdbe/
https://en.wikipedia.org/wiki/Protein_Data_Bank
Protein Data Bank
PDB
A sequence of computational tasks or actions that carry out a specific function.
https://en.wikipedia.org/wiki/Algorithm
Algorithm
http://www.ensembl.org/info/genome/genebuild/genome_annotation.html
The automatic process by which Ensembl plot known RNA and protein sequence onto the genome, using sequence similarity.
Ensembl Genes
Ensembl annotation
Genebuild
Ensembl Genebuild
www.ensembl.org/info/genome/genebuild/manual_havana.html
Human And Vertebrate ANalysis and Annotation. The team within Ensembl who manually annotate genes and transcripts for a subset of species.
manual annotation
Havana
Ensembl Havana
http://www.ensembl.org/info/genome/funcgen/regulatory_build.html
The process by which Ensembl predict the location of regions that regulate gene expression using epigenomic evidence.
Ensembl Regulatory Build
http://www.ensembl.org/info/genome/compara/homology_method.html
The process by which Ensembl compare gene sequences in order to construct gene trees and predict homologues.
Ensembl gene tree pipeline
InterPro is an integrated resource for protein families, domains and sites, combining information from several different protein signature databases, including PROSITE, PRINTS, Pfam, Seg, SignalP, Gene3D, SMART, TIGRFAMs, PIR SuperFamilies and SUPERFAMILY. Ensembl run InterProScan on all protein sequences, which uses these protein signatures to identify domains. https://www.ebi.ac.uk/interpro/
https://en.wikipedia.org/wiki/InterPro
InterProScan
A sequence comparison algorithm optimised for speed which is used to search sequence databases for optimal local alignments to a query.
https://en.wikipedia.org/wiki/BLAST
Basic Local Alignment Search Tool
BLAST
An mRNA/DNA and cross-species protein sequence analysis tool to quickly find sequences of 95% and greater similarity of length 40 bases or more.
https://en.wikipedia.org/wiki/BLAT_(bioinformatics)
BLAST-Like Alignment Tool
BLAT
http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html
A standalone application that looks for low complexity sequences.
DUST
http://www.ensembl.org/info/genome/genebuild/automatic_coding.html
Eponine is a probabilistic method for detecting transcription start sites (TSS) in mammalian genomic sequence, with good specificity and excellent positional accuracy. Eponine models consist of a set of DNA weight matrices recognising specific sequence motifs. Each of these is associated with a position distribution relative to the TSS. http://www.sanger.ac.uk/science/tools/eponine
Eponine
http://www.ensembl.org/info/genome/genebuild/automatic_coding.html
GeneWise is a sequence analysis tool for comparing proteins to DNA sequences allowing for introns and frameshifts. It is used in the Targetted stage of the Ensembl GeneBuild. https://www.ebi.ac.uk/Tools/psa/genewise/
GeneWise
A fast gapped DNA-DNA alignment algorithm. It can be used for aligning various types of sequences such as genomic DNA, cDNAs/ESTs, and proteins. It is used in the Targetted stage of the Ensembl GeneBuild. https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate
Exonerate
http://www.ensembl.org/info/genome/genebuild/2x_genomes.html
A gene build method used by Ensembl for low coverage genomes, allowing genes to be annotated that span two scaffolds by mapping to the human gene.
Projection build
An HMM-based ab initio gene prediction method, used to create a track of ab initio genes in Ensembl. http://genes.mit.edu/GENSCAN.html
https://en.wikipedia.org/wiki/GENSCAN
GENSCAN
http://www.ensembl.org/info/genome/variation/predicted_data.html#sift
A tool which predicts if missense variants are likely to affect protein function based on sequence homology and the physico-chemical similarity between the alternate amino acids. http://sift.bii.a-star.edu.sg/
SIFT
http://www.ensembl.org/info/genome/variation/predicted_data.html#polyphen
A tool which predicts if missense variants are likely to affect protein function based on physical and comparative considerations. http://genetics.bwh.harvard.edu/pph2/
PolyPhen
http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html
The method by which repeated sequences and low-complexity regions are hidden, usually used in searches by alignment and homology-searching programs. http://www.repeatmasker.org/
https://en.wikipedia.org/wiki/Repeated_sequence_(DNA)
RepeatMasker
A matrix that defines scores for amino acid substitutions, reflecting the similarity of physicochemical properties, and observed substitution frequencies. The BLOSUM 62 matrix is tailored using sequences sharing no more than 62% identity (sequences closer evolutionary, were represented by a single sequence in the alignment to avoid bias from using related family members).
https://en.wikipedia.org/wiki/BLOSUM
Blocks Substitution Matrix
BLOSUM 62
http://www.ensembl.org/info/docs/tools/vep/index.html
The Variant Effect Predictor (VEP) is an Ensembl tool that predicts the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions.
https://en.wikipedia.org/wiki/Ensembl_Genomes#Variant_Effect_Predictor
Variant Effect Predictor
VEP
File formats
A set of recomendations for variant naming. The nomenclature describes the change a variant allele has on a named (genomic, transcript or protein) sequence. Can be used as an input for the VEP and displayed for known variants. http://varnomen.hgvs.org/
HGVS
Human Genome Variation Society
Sequence Variant Nomenclature
HGVS names
HGVS nomenclature
http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#vcf
VCF is a standard format for listing genetic variation, which is the output for many variant callers. It can be used as an input for the Ensembl VEP and is used to store and download variation data in Ensembl.
https://en.wikipedia.org/wiki/Variant_Call_Format
Variant Call Format
VCF
http://www.ensembl.org/info/website/upload/bed.html
BED is a simple format for listing genomic loci. It can be used to upload data to view in Ensembl, as a custom file for additional VEP annotation and is used to store and download constrained elements in Ensembl.
BED
FASTA is used to store finished nucleotide and peptide sequences. The Ensembl FTP site has genome, cDNA, CDS and peptide sequences in FASTA, and you can export FASTA from various webpages in Ensembl.
https://en.wikipedia.org/wiki/FASTA_format
FASTA
http://www.ensembl.org/info/website/upload/large.html#bam-format
BAM and CRAM store alignments of NGS data to the genome. Ensembl allow attachment of BAM and CRAM files to view in against the gene, and store RNA-seq, ChIP-seq and DNase-seq in BAM.
https://en.wikipedia.org/wiki/Binary_Alignment_Map
BAM
CRAM
SAM
Binary alignment map
BAM/CRAM
http://www.ensembl.org/info/website/upload/large.html#bb-format
BigBed is an indexed form of BED, which can be used to store larger scale data. Ensembl allow attachment of BigBed files to view against the genome and store peaks of regulatory evidence as BigBed.
BigBed
http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#default
Ensembl default is an input format for the VEP, used to describe the position and alleles of a variant.
Ensembl default (VEP)
http://www.ensembl.org/info/website/upload/bed.html#bedGraph
BedGraph allows you to store scores for loci in BED format, the loci can be of varying size. It can be uploaded to view in Ensembl.
BedGraph
http://www.ensembl.org/info/website/upload/gff.html
GTF is a tab-limited format that describes genomic features, such as genes and transcripts, and allows hierarchical linking of gene features. Ensembl store gene files as GTF, allow attachment of GTF files to view against the genome and allow custom annotation with the VEP using GTF files.
https://en.wikipedia.org/wiki/Gene_transfer_format
General transfer format
Gene transfer format
GTF
http://www.ensembl.org/info/website/upload/gff3.html
GFF is a tab-limited format that describes genomic features, such as genes and transcripts, and allows hierarchical linking of gene features. Ensembl store gene files as GFF, allow attachment of GFF files to view against the genome and allow custom annotation with the VEP using GFF files.
https://en.wikipedia.org/wiki/General_feature_format
gene-finding format
generic feature format
General feature format
GFF
http://www.ensembl.org/info/website/upload/psl.html
PSL represents alignments and can be viewed in Ensembl.
PSL
http://www.ensembl.org/info/website/upload/wig.html
Wiggle format expresses scores across genomic loci, requiring fixed size bins for the scores. It can be uploaded to view in Ensembl.
WIG
Wiggle
http://www.ensembl.org/info/website/upload/large.html#bw-format
BigWig is an indexed form of wiggle and can be used to store larger scale data. Ensembl simplify NGS data, such as ChIP-seq and RNA-seq into BigWig to view in the browser. It can also be used to attach your own data to Ensembl.
BigWig
http://www.ensembl.org/info/website/upload/pairwise.html
Pairwise interactions, such as those derived from Hi-C, can be stored in the WashU format and viewed in Ensembl.
https://en.wikipedia.org/wiki/Chromosome_conformation_capture
Pairwise interactions (WashU)
Chain files describe the mapping between different genome assemblies. Ensembl store these on the FTP site.
chain mapping
mapping
assembly chain
chain
Newick is a tree format. Ensembl gene trees can be downloaded in Newick and it is used to store Ensembl species trees.
https://en.wikipedia.org/wiki/Newick_format
New Hampshire tree format
Newick format
Newick notation
Newick tree format
Newick
EMBL files store sequence and accompanying annotation for features across a genomic region. They can be exported from various webpages in Ensembl and are stored for 1Mb regions across the genome.
EMBL (file format)
GenBank files store sequence and accompanying annotation for features across a genomic region. They can be exported from various webpages in Ensembl and are stored for 1Mb regions across the genome.
GenBank (file format)
http://www.ensembl.org/info/data/ftp/index.html
Ensembl Multi Format (EMF) stores genomic alignments in Ensembl.
Ensembl Multi Format
EMF Alignment format
http://www.ensembl.org/info/data/ftp/index.html
Multiple alignment format (MAF) stores genomic alignments.
Multiple alignment format
MAF
http://www.ensembl.org/info/data/mysql.html
MySQL is a database. All Ensembl data is stored in MySQL relational tables, which can be found on the FTP site and accessed directly by MySQL queries.
https://en.wikipedia.org/wiki/MySQL
MySQL
https://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html
A VEP cache contains all the gene and variant data needed to run a VEP query, and can be used to run large queries quickly on your own machine. These can be installed as part of your VEP installtion, or downloaded from the FTP site.
VEP cache
Genome Variation Format (GVF) is used to store variation data. It can be found on the Ensembl FTP site.
Genome Variation Format
GVF
PhyloXML is an XML language for the analysis, exchange, and storage of phylogenetic trees (or networks) and associated data. It is used to store Ensembl phylogenetic trees.
https://en.wikipedia.org/wiki/PhyloXML
PhyloXML
OrthoXML is an XML format to allow the storage and comparison of orthology data. It is used to store Ensembl homologues.
OrthoXML
Resource Description Framework (RDF) is used as a metadata data model. Ensembl use it to describe links from Ensembl annotations to those annotations in other databases.
https://en.wikipedia.org/wiki/Resource_Description_Framework
Resource Description Framework
RDF
A golden path. A file provided to Ensembl that describes how the longer sequences in the genome assembly were assembled from shorter sequences. For example, an AGP file can describe how a chromosome is assembled from a collection of scaffolds or a collection of contigs. For an AGP file that describes how a scaffold is assembled from a collection of contigs, each contig will be listed on a separate line in the AGP file and the line will include information about where the contig lies within the scaffold and the orientation of the contig.
A golden path
AGP
http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html
https://en.wikipedia.org/wiki/Repeated_sequence_(DNA)
Repeat
http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html
The method by which repeated sequences and low-complexity regions are hidden, usually used in searches by alignment and homology-searching programs.
https://en.wikipedia.org/wiki/Repeated_sequence_(DNA)
Repeat masking
Hard masked sequence is repeat masked with the repeat sequences replaced by Ns. Hard masked sequence files on the Ensembl FTP site have "rm" in their file name.
Hard masked
Soft masked sequence is repeat masked with the repeat sequences in lower case. Soft masked sequence files on the Ensembl FTP site have "sm" in their file name.
Soft masked
http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html
A dispersed intermediately repetitive DNA sequence found in the human genome in about one million copies. The sequence is about 300 bp long and is found commonly in introns, 3' untranslated regions of genes, and intergenic genomic regions. The name Alu comes from the a recognition site for the AluI endonuclease that cleaves it.
SO:0002063
https://en.wikipedia.org/wiki/Alu_element
Alu insertion
http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html
A region in the genomic sequence containing short tandem repeats of 2-10bp.
SO:0000289
https://en.wikipedia.org/wiki/Microsatellite
Microsatellite
http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html
The region of the chromosome at which the two sister chromatids are joined during mitosis and meiosis, mostly composed of satellite DNA.
SO:0000577
https://en.wikipedia.org/wiki/Centromere
Centromere
http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html
Poly-purine or poly-pyrimidine stretches, or regions of extremely high AT or GC content.
Low complexity regions
http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html
Non-functional copies of RNA genes which have been reintegrated into the genome with the assistance of a reverse transcriptase.
RNA repeats
http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html
Multiple copies of the same base sequence on a DNA sequence. The repeated pattern can vary in length from a single base to several thousand bases long.
SO:0000005
https://en.wikipedia.org/wiki/Satellite_DNA
Satellite repeats
http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html
Duplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc.
Simple repeats
http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html
Typically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences.
https://en.wikipedia.org/wiki/Tandem_repeat
Tandem repeats
http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html
Long tandem repeats.
https://en.wikipedia.org/wiki/Tandem_repeat
LTRs
http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html
Long Interspersed Elements. Retrotransposed elements in the genome containing open reading frames encoding (often inactive) reverse transcription machinery.
SO:0000194
https://en.wikipedia.org/wiki/Long_interspersed_nuclear_element
long interspersed nuclear element
LINE
Type I Transposons/LINE
http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html
Short Interspersed Elements. Retrotransposed elements less than 500 bp that contain tRNA, snRNA and rRNA, which require other mobile elements to be transposed. Alu elements are a type of SINE.
SO:0000206
https://en.wikipedia.org/wiki/Short_interspersed_nuclear_element
short interspersed nuclear element
SINE
Type I Transposons/SINE
http://www.ensembl.org/info/genome/genebuild/assembly_repeats.html
Elements that have been transposed and duplicated around the genome by excision and ligation.
SO:0000182
https://en.wikipedia.org/wiki/Transposable_element#Classification
DNA transposon
Type II Transposons
Repeats that cannot be classified.
Unknown repeat
A comparison between two or more sequences by matching identical and/or similar residues/nucleotides and assigning a score to the match.
https://en.wikipedia.org/wiki/Sequence_alignment
Alignments
An alignment carried out using the whole genome sequence.
Whole genome alignment
http://www.ensembl.org/info/genome/compara/analyses.html
An alignment between two whole genomes.
Pairwise sequence alignment
Pairwise alignment
Pairwise whole genome alignment
http://www.ensembl.org/info/genome/compara/multiple_genome_alignments.html
An alignment between more than two whole genomes of a selected taxon.
Multiple sequence alignment
Multiple alignment
Multiple whole genome alignment
http://www.ensembl.org/info/genome/compara/synteny.html
In a genomic context we refer to syntenic regions if the sequence is globally conserved between two species.
https://en.wikipedia.org/wiki/Synteny
Synteny
The cigar line defines the sequence of matches/mismatches and deletions (or gaps) in an alignment
https://en.wikipedia.org/wiki/Sequence_alignment#Representations
Compact Idiosyncratic Gapped Alignment Report
CIGAR
A measure of how similar two alignment sequences are, specifically, what percentage of amino acids or nucleotides are the same in type and position between the two sequences. The value is dependent on which sequence is used as the reference, since it is a percentage of that reference.
%ID
Identity
An application for displaying sequence alignments with custom colour-annotation, which is used by Ensembl displaying gene tree and family alignments. http://wasabiapp.org/
Wasabi
How well one sequence matches another determined by calculation by an alignment program of identical and conserved residues/nucleotides.
Similarity
http://www.ensembl.org/info/genome/compara/multiple_genome_alignments.html
Pecan is a global multiple sequence alignment program that makes practical the probabilistic consistency methodology for significant numbers of sequences of practically arbitrary length.
Pecan
http://www.ensembl.org/info/genome/compara/multiple_genome_alignments.html
The EPO (Enredo, Pecan, Ortheus) pipeline is a three step pipeline for whole-genome multiple alignments, using Enredo segments, aligning them with Pecan and constructing ancestal sequences with Ortheus.
Enredo Pecan Ortheus
EPO
http://www.ensembl.org/info/genome/compara/multiple_genome_alignments.html
Progressive-Cactus is a next-generation aligner that stores whole-genome alignments in a graph structure.
Progressive cactus
http://www.ensembl.org/info/genome/compara/analyses.html
LASTZ is a program for aligning DNA sequences in a pairwise manner. Its precedesessor is BlastZ.
LastZ
http://www.ensembl.org/info/genome/compara/analyses.html
BlastZ is a program for aligning DNA sequences in a pairwise manner. It has been replaced by LASTZ.
BlastZ
Translated Blat can be used for alignment of the coding regions of genomes only in a pairwise manner.
https://en.wikipedia.org/wiki/BLAT_(bioinformatics)
Translated Blat
http://www.ensembl.org/info/genome/funcgen/regulatory_features.html
Regions that are predicted to regulate the expression of genes, based on the Ensembl regulatory build.
reg-feat
Regulatory features
http://www.ensembl.org/info/genome/funcgen/regulatory_features.html
Regions at the 5' end of genes where transcription factors and RNA polymerase bind to initiate transcription.
SO:0000167
https://en.wikipedia.org/wiki/Promoter_(genetics)
Promoters
http://www.ensembl.org/info/genome/funcgen/regulatory_features.html
Transcription factor binding regions that flank promoters.
SO:0001952
https://en.wikipedia.org/wiki/Promoter_(genetics)
Promoter flanking regions
http://www.ensembl.org/info/genome/funcgen/regulatory_features.html
Regions that bind transcription factors and interact with promoters to stimulate transcription of distant genes.
SO:0000165
https://en.wikipedia.org/wiki/Enhancer_(genetics)
Enhancers
http://www.ensembl.org/info/genome/funcgen/regulatory_features.html
Regions that bind CTCF, the insulator protein that demarcates open and closed chromatin.
SO:0001974
https://en.wikipedia.org/wiki/CTCF
CTCF binding sites
http://www.ensembl.org/info/genome/funcgen/regulatory_features.html
Sites which bind transcription factors, for which no other role can be determined as yet.
SO:0000235
Transcription factor binding sites
http://www.ensembl.org/info/genome/funcgen/regulatory_features.html
Regions of spaced out histones, making them accessible to protein interactions.
SO:0001747
https://en.wikipedia.org/wiki/DNase_I_hypersensitive_site
Open chromatin regions
http://www.ensembl.org/info/genome/funcgen/regulatory_features.html
The activity state of a regulatory feature in a specific epigenome.
Regulatory activity
http://www.ensembl.org/info/genome/funcgen/regulatory_features.html
When a regulatory feature displays an epigenetic signature which is consistent with it carrying out its named function, for example an active Promoter has an epigenetic signature consistent with initiating transcription, while an active CTCF binding site will bind CTCF. It is analogous to a sprinter running.
Active
http://www.ensembl.org/info/genome/funcgen/regulatory_features.html
When a regulatory feature displays a epigenetic signature with the potential to be activated. It is analogous to a sprinter in the blocks.
Poised
http://www.ensembl.org/info/genome/funcgen/regulatory_features.html
When a regulatory feature is epigenetically repressed, having an epigenetic signature that prevents it from being active.
Repressed
http://www.ensembl.org/info/genome/funcgen/regulatory_features.html
When a regulatory feature bears no epigenetic modifications from the ones included in the Regulatory Build.
Inactive
http://www.ensembl.org/info/genome/funcgen/regulatory_features.html
When there is no available data in the cell type for this regulatory feature.
NA
http://www.ensembl.org/info/genome/funcgen/regulation_sources.html
Experimental data that is used to construct and determine activity of regulatory features.
Epigenome evidence
http://www.ensembl.org/info/genome/funcgen/regulation_sources.html
A method to determine the genomic regions that proteins bind to.
https://en.wikipedia.org/wiki/ChIP-sequencing
Chromatin Immunoprecipitation Sequencing
ChIPSeq
ChIP-seq
http://www.ensembl.org/info/genome/funcgen/regulation_sources.html
A method to determine regions of open and closed chromatin.
https://en.wikipedia.org/wiki/DNase-Seq
DNase hypersensitivity
DNase-seq
DNase sensitivity
http://www.ensembl.org/info/genome/funcgen/regulation_sources.html
A protein that binds to DNA and controls the rate of transcription.
https://en.wikipedia.org/wiki/Transcription_factor
TF
Transcription factor
http://www.ensembl.org/info/genome/funcgen/regulation_sources.html
Covalent modifications to the histone proteins that make up the nucleosome, which are known to regulate gene expression.
SO:0001700
https://en.wikipedia.org/wiki/Histone#Histone_modification
histone acetylation
histone methylation
Histone mod
Histone modification
http://www.ensembl.org/info/genome/funcgen/regulation_other.html
Modification of cytosines in CpGs with methyl groups, which is known to repress gene expression.
SO:0000114
https://en.wikipedia.org/wiki/DNA_methylation
CpG methylation
DNA methylation
http://www.ensembl.org/info/genome/funcgen/regulation_other.html
A method to determine the methylation of genomic cytosines.
https://en.wikipedia.org/wiki/Bisulfite_sequencing
RRBS
WGBS
Bisulphite sequencing
Bisulfite sequencing
http://www.ensembl.org/info/genome/funcgen/peak_calling.html
A count of the number of NGS reads from an epigenome experiment aligned to a locus, shown as a BigWig across the genome.
Signal
http://www.ensembl.org/info/genome/funcgen/peak_calling.html
Locus identified from epigenome signal as being having high signal, shown as a BigBed across the genome.
Peak
Short genomic sequence that is known to bind to a particular transcription factor.
https://en.wikipedia.org/wiki/DNA_binding_site
Motif
PWM
Position weight matrices
TFBM
Position weight matrix
Transcription factor binding motif
http://www.ensembl.org/info/genome/funcgen/regulation_sources.html
A cell type, such as a primary tissue or lab cell line, for which we have epigenome evidence and can predict regulatory features.
cell type
tissue
Cell line
Epigenome
A short sequence whose placement on the genome is known.
Marker
UniSTS is a NCBI resource for non-redundant Sequence Tagged Sites (STS) markers. For each marker, UniSTS displays the primer sequences, product size, and mapping information, as well as cross references to dbSNP, RHdb, GDB, MGD, etc. The marker report also lists GenBank and RefSeq records that contain the primer sequences determined by ePCR.
UniSTS
http://www.ensembl.org/info/genome/genebuild/xrefs.html
Mapping between Ensembl genes, transcripts and proteins to the same features in other databases.
general identifiers
xref
External reference
http://www.ensembl.org/info/genome/variation/prediction/protein_function.html
A tool that integrates multiple annotations into one metric for scoring the deleteriousness of single nucleotide variants.
CADD
http://www.ensembl.org/info/genome/variation/prediction/protein_function.html
A tool for predicting the pathogenicity of single nucleotide variants using an ensemble method.
REVEL
http://www.ensembl.org/info/genome/variation/prediction/protein_function.html
A tool for assessing the functional impact of single nucleotide variants based on evolutionary conservation of the affected amino acid in protein homologues.
MutationAssessor
http://www.ensembl.org/info/genome/variation/prediction/protein_function.html
A tool for predicting the pathogenicity of single nucleotide variants using a logistic regression based ensemble method.
MetaLR
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
The Matched Annotation from NCBI and EMBL-EBI is a collaboration between Ensembl/GENCODE and RefSeq to identify transcripts that match GRCh38 and are 100% identical between RefSeq and Ensembl/GENCODE for 5' UTR, CDS, splicing and 3'UTR.
Matched Annotation between NCBI and EBI
Matched Annotation from NCBI and EMBL-EBI
Matched Annotation from NCBI and Ensembl
MAIN
MANE
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
The Matched Annotation from NCBI and EMBL-EBI is a collaboration between Ensembl/GENCODE and RefSeq. The MANE Select is a default transcript per human gene that is representative of biology, well-supported, expressed and highly-conserved. This transcript set matches GRCh38 and is 100% identical between RefSeq and Ensembl/GENCODE for 5' UTR, CDS, splicing and 3'UTR.
Matched Annotation from NCBI and EBI Select
Matched Annotation from NCBI and EMBL-EBI Select
Matched Annotation from NCBI and Ensembl Select
MAIN Select
MANE Select
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
Long-read sequence data is computationally processed into non-redundant transcript models which are manually appraised by the Ensembl-Havana annotation team.
tagine
TAGENE
http://www.ensembl.org/info/genome/genebuild/biotypes.html
The coding sequence contains a stop codon that is translated (as supported by experimental evidence), and termination occurs instead at a canonical stop codon further downstream. It is currently unknown which codon is used to replace the translated stop codon, hence it is represented by 'X' in the protein sequence
https://en.wikipedia.org/wiki/Stop_codon#Translational_readthrough
Stop codon readthrough
http://www.ensembl.org/Help/Faq?id=367
DNA strand arbitrary defined as the strand with its 5' end at the tip of the short chromosome arm (p). If a gene is forward-stranded, its sense (sequence matching cDNA) is on the forward strand. Forward strand is reverse complementary to the reverse strand.
SO:0001030
+ strand
1 strand
positive strand
Plus strand
Forward strand
http://www.ensembl.org/Help/Faq?id=367
DNA strand arbitrary defined as the strand with its 5' end at the tip of the long chromosome arm (q). If a gene is reverse-stranded, its sense (sequence matching cDNA) is on the reverse strand. Reverse strand is reverse complementary to the forward strand.
SO:0001031
- strand
-1 strand
negative strand
Minus strand
Reverse strand
http://www.ensembl.org/info/genome/genebuild/mane.html
RefSeq transcripts that match 100% across the sequence, exon/intron structure and UTRs as part of the MANE project
RefSeq Match
The UniProt identifier that matches to the Ensembl transcript. This may be a UniProt protein isoform and will have a number suffix, or may just refer to a UniProt entry.
UniProt Match
A transcript with a premature stop codon considered likely to be subjected to targeted degradation. Nonsense-Mediated Decay is predicted to be triggered where the in-frame termination codon is found more than 50bp upstream of the final splice junction.
https://en.wikipedia.org/wiki/Nonsense-mediated_decay
Nonsense-Mediated Decay
NMD
Nonsense Mediated Decay
A transcript with a non-ATG start codon but which still encodes a methionine since the ribosomal machinery allows non-AUG to translate as methionine in specific cases.
https://en.wikipedia.org/wiki/Start_codon
Non-ATG start
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
Transcripts in the MANE Plus Clinical set are additional transcripts per locus necessary to support clinical variant reporting, for example transcripts containing known Pathogenic or Likely Pathogenic clinical variants not reportable using the MANE Select set. Note there may be additional clinically relevant transcripts in the wider RefSeq and Ensembl/GENCODE sets but not yet in MANE.
Matched Annotation from NCBI and EBI Plus Clinical
Matched Annotation from NCBI and EMBL-EBI Plus Clinical
Matched Annotation from NCBI and Ensembl Plus Clinical
MAIN Plus Clinical
MANE Plus Clinical
http://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html
The full GENCODE transcript set, containing both complete transcripts and 5' and 3' incomplete transcripts.
GENCODE Comprehensive
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Alternatively spliced transcript of a protein coding gene for which we cannot define a CDS.
Protein coding CDS not defined
http://www.ensembl.org/info/genome/genebuild/biotypes.html
Not translated in the reference genome owing to a SNP/DIP but in other individuals/haplotypes/strains the transcript is translated. Replaces the polymorphic_pseudogene transcript biotype.
Protein coding LOF