### Generic Feature Format Version 3 (GFF3)
#### Summary
Author: Lincoln Stein
Date: 18 August 2020
Version: 1.26
Although there are many richer ways of representing genomic features via XML and in relational database schemas, the stubborn persistence of a variety of ad-hoc tab-delimited flat file formats declares the bioinformatics community's need for a simple format that can be modified with a text editor and processed with shell tools like grep. The GFF format, although widely used, has fragmented into multiple incompatible dialects. When asked why they have modified the published Sanger specification, bioinformaticists frequently answer that the format was insufficient for their needs, and they needed to extend it. The proposed GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous formats. The new format:
1. Adds a mechanism for representing more than one level of hierarchical grouping of features and subfeatures.
2. Separates the ideas of group membership and feature name/id.
3. Constrains the feature type field to be taken from a controlled vocabulary.
4. Allows a single feature, such as an exon, to belong to more than one group at a time.
5. Provides an explicit convention for pairwise alignments.
6. Provides an explicit convention for features that occupy disjunct regions.
#### GFF3 Validator
GFF3 validation tools are available at [modENCODE-DCC](https://github.com/modENCODE-DCC/validator)
#### Description of the Format
GFF3 files are nine-column, tab-delimited, plain text files. Literal use of tab, newline, carriage return, the percent (%) sign, and control characters must be encoded using [RFC 3986 Percent-Encoding](https://tools.ietf.org/html/rfc3986#section-2.1); no other characters may be encoded. Backslash and other ad-hoc escaping conventions that have been added to the GFF format are not allowed. The file contents may include any character in the set supported by the operating environment, although for portability with other systems, use of UTF-8 is recommended.
- tab (%09)
- newline (%0A)
- carriage return (%0D)
- % percent (%25)
- control characters (%00 through %1F, %7F)
In addition, the following characters have reserved meanings in column 9 and must be escaped when used in other contexts:
- ; semicolon (%3B)
- = equals (%3D)
- & ampersand (%26)
- , comma (%2C)
Note that unescaped spaces are allowed within fields, meaning that parsers must split on tabs, not spaces. Use of the "+" (plus) character to encode spaces is deprecated from early versions of the spec and is no longer allowed.
Undefined fields are replaced with the "." character, as described in the original GFF spec.
The start and end coordinates of the feature are given in positive 1-based integer coordinates, relative to the landmark given in column one. Start is always less than or equal to end. For features that cross the origin of a circular feature (e.g. most bacterial genomes, plasmids, and some viral genomes), the requirement for start to be less than or equal to end is satisfied by making end = the position of the end + the length of the landmark feature.
For zero-length features, such as insertion sites, start equals end and the implied site is to the right of the indicated base in the direction of the landmark.
For features of type "CDS", the phase indicates where the next codon begins relative to the 5' end (where the 5' end of the CDS is relative to the strand of the CDS feature) of the current CDS feature. For clarification the 5' end for CDS features on the plus strand is the feature's start and and the 5' end for CDS features on the minus strand is the feature's end. The phase is one of the integers 0, 1, or 2, indicating the number of bases forward from the start of the current CDS feature the next codon begins. A phase of "0" indicates that a codon begins on the first nucleotide of the CDS feature (i.e. 0 bases forward), a phase of "1" indicates that the codon begins at the second nucleotide of this CDS feature and a phase of "2" indicates that the codon begins at the third nucleotide of this region. Note that ‘Phase’ in the context of a GFF3 CDS feature should not be confused with the similar concept of frame that is also a common concept in bioinformatics. Frame is generally calculated as a value for a given base relative to the start of the complete open reading frame (ORF) or the codon (e.g.
The phase is REQUIRED for all CDS features.
A list of feature attributes in the format tag=value. Multiple tag=value pairs are separated by semicolons. URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape. Attribute values do not need to be and should not be quoted. The quotes should be included as part of the value by parsers and not stripped.
These tags have predefined meanings:
Multiple attributes of the same type are indicated by separating the values with the comma "," character, as in:
Parent=AF2312,AB2812,abc-3
In addition to Parent, the Alias, Note, Dbxref and Ontology_term attributes can have multiple values.
Note that attribute names are case sensitive. "Parent" is not the same as "parent".
All attributes that begin with an uppercase letter are reserved for later use. Attributes that begin with a lowercase letter can be used freely by applications.
SO or SOFA IDs: If using the SO (or SOFA) IDs rather than the short names1 ("mRNA" etc), use the following mappings:
| gene | SO:0000704 |
| mRNA | SO:0000234 |
| exon | SO:0000147 |
| cds | SO:0000316 |
Other mRNA parts that you might wish to use are:
| intron | SO:0000188 (redundant with exon) |
| polyA_sequence | SO:0000610 (part of the three_prime_UTR) |
| polyA_site | SO:0000553 (part of the gene) |
| five_prime_UTR | SO:0000204 |
| three_prime_UTR | SO:0000205 |
| M451 | match 451 bases |
| D3499 | skip 3499 bases in the reference ctg123 sequence |
| M501 | match the next 501 bases |
| D1499 | skip 1499 bases in the reference ctg123 |
| M2001 | match the next 2001 bases |
This directive indicates that the GFF3 file uses the ontology of feature types located at the indicated URI or URL. Multiple URIs may be added, in which case they are merged (or raise an exception if they cannot be merged). The URIs for the released sequence ontologies are:
Releases occur every two months for SO and SOFA.
The repository for SO releases is here: http://sourceforge.net/projects/song/files/Sequence%20Ontology/
The repository for SOFA releases is here: http://sourceforge.net/projects/song/files/SO_Feature_Annotation/
This directive may occur several times per file. If no feature ontology is specified, then the most recent release of the Sequence Ontology is assumed.
If multiple directives are given and a feature type is matched by multiple ontologies, the matching ontology included by the directive highest in the file wins the reference. The Sequence Ontology itself is always referenced last.
The content referenced by URI must be in OBO or DAG-Edit format.
The genome assembly build name used for the coordinates given in the file. Please specify the source of the assembly as well as its name. Examples (the parentheses are comments):
##genome-build NCBI B36 (human) ##genome-build WormBase ws110 (worm) ##genome-build FlyBase r4.1 (drosophila)
This notation indicates that the annotation portion of the file is at an end and that the remainder of the file contains one or more sequences (nucleotide or protein) in FASTA format. This allows features and sequences to be bundled together. All FASTA sequences included in the file must be included together at the end of the file and may not be interspersed with the features lines. Once a ##FASTA section is encountered no other content beyond valid FASTA sequence is allowed.
Example:
##gff-version 3.1.26 ##sequence-region ctg123 1 1497228 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1 ctg123 . five_prime_UTR 1050 1200 . + . Parent=mRNA00001 ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . three_prime_UTR 7601 9000 . + . Parent=mRNA00001 ctg123 . cDNA_match 1050 1500 5.8e-42 + . ID=match00001;Target=cdna0123+12+462 ctg123 . cDNA_match 5000 5500 8.1e-43 + . ID=match00001;Target=cdna0123+463+963 ctg123 . cDNA_match 7000 9000 1.4e-40 + . ID=match00001;Target=cdna0123+964+2964 ##FASTA >ctg123 cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc gtatttgatttgggtttactatcgaataatgagaattttcaggcttaggc ttaggcttaggcttaggcttaggcttaggcttaggcttaggcttaggctt aggcttaggcttaggcttaggcttaggcttaggcttaggcttaggcttag aatctagctagctatccgaaattcgaggcctgaaaagtgtgacgccattc ... >cnda0123 ttcaagtgctcagtcaatgtgattcacagtatgtcaccaaatattttggc agctttctcaagggatcaaaattatggatcattatggaatacctcggtgg aggctcagcgctcgatttaactaaaagtggaaagctggacgaaagtcata tcgctgtgattcttcgcgaaattttgaaaggtctcgagtatctgcatagt gaaagaaaaatccacagagatattaaaggagccaacgttttgttggaccg tcaaacagcggctgtaaaaatttgtgattatggttaaagg
For backward-compatibility with the GFF version output by the Artemis tool, a GFF line that begins with the character > creates an implied ##FASTA directive.
This is the case in which a single unspliced transcript encodes a single CDS.
----->XXXXXXX*------>
The preferred representation is to create a gene, a transcript, an exon and a CDS:
chrX . gene XXXX YYYY . + . ID=gene01;name=resA chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01 chrX . exon XXXX YYYY . + . Parent=tran01 chrX . CDS XXXX YYYY . + . Parent=tran01
Some groups will find this redundant. A valid alternative is to omit the exon feature:
chrX . gene XXXX YYYY . + . ID=gene01;name=resA chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01 chrX . CDS XXXX YYYY . + . Parent=tran01
It is not recommended to parent the CDS directly onto the gene, because this will make it impossible to determine the UTRs (since the gene may validly include untranscribed regulatory regions).
Also note that mixing the two styles, as in the case of an organism with both spliced and unspliced transcripts, is liable to lead to the confusion of people working with the GFF3 file.
This is the case in which a single (possibly spliced) transcript encodes multiple open reading frames that generate independent protein products.
----->XXXXXXX*-->BBBBBB*--->ZZZZ*-->AAAAAA*-----
Since the single transcript corresponds to multiple genes that can be identified by genetic analysis, the recommended solution here is to create four "gene" objects and make them the parent for a single transcript. The transcript will contain a single exon (in the unspliced case) and four separate CDSs:
chrX . gene XXXX YYYY . + . ID=gene01;name=resA chrX . gene XXXX YYYY . + . ID=gene02;name=resB chrX . gene XXXX YYYY . + . ID=gene03;name=resX chrX . gene XXXX YYYY . + . ID=gene04;name=resZ chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02,gene03,gene04 chrX . exon XXXX YYYY . + . ID=exon00001;Parent=tran01 chrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene01 chrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene02 chrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene03 chrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene04
To disambiguate the relationship between which genes encode which CDSs, you may use the Derives_from relationship.
An intein occurs when a portion of the protein is spliced out and the two polypeptide fragments are rejoined to become a functional protein. The portion that is spliced out is called the "intein," and it may itself have intrinsic molecular activity:
----->XXXXXXyyyyyyyyyyXXXXXXX*------- (yyyyyy is the intein)
The preferred representation is to create one gene, one transcript, one exon, and one CDS. The CDS produces a pre-polypeptide using the "Derives_from" tag, and this polypeptide in turn gives rise to two mature_polypeptides, one each for the intein and the flanking protein:
chrX . gene XXXX YYYY . + . ID=gene01;name=resA chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01 chrX . exon XXXX YYYY . + . Parent=tran01 chrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01 chrX . polypeptide XXXX YYYY . + . ID=poly01;Derives_from=cds01 chrX . mature_polypeptide XXXX YYYY . + . ID=poly02;Parent=poly01 chrX . mature_polypeptide XXXX YYYY . + . ID=poly02;Parent=poly01 chrX . intein XXXX YYYY . + . ID=poly03;Parent=poly01
Because the flanking mature_polypeptide has discontinuous coordinates on the genome, it appears twice with the same ID.
If the intein is immediately degraded, you may not wish to annotate it explicitly, and its line would be deleted from the example. However, if it has molecular activity, it may correspond to a gene, in which case:
chrX . gene XXXX YYYY . + . ID=gene01;name=resA chrX . gene XXXX YYYY . + . ID=gene02;name=inteinA chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02 chrX . exon XXXX YYYY . + . Parent=tran01 chrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01 chrX . polypeptide XXXX YYYY . + . ID=poly01;Derives_from=cds01 chrX . mature_polypeptide XXXX YYYY . + . ID=poly02;Parent=poly01;Derives_from=gene01 chrX . mature_polypeptide XXXX YYYY . + . ID=poly02;Parent=poly01;Derives_from=gene01 chrX . intein XXXX YYYY . + . ID=poly03;Parent=poly01;Derives_from=gene02
The term "polypeptide" is part of SO. The terms "mature_polypeptide" and "intein" are slated to be added in a pending release.
This occurs when two genes contribute to a processed transcript via a trans-splicing reaction:
spliced leader =======>----->XXXXXXX*------>
The simplest way to represent this is to show the mRNA as being split across two discontinuous genomic locations:
chrX . gene XXXX YYYY . + . ID=gene01;name=my_gene chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01 chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01 chrX . exon XXXX YYYY . + . Parent=tran01 chrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01
However, this does not indicate which part of the transcript comes from the spliced leader. A preferred representation explicitly adds features for the spliced leader gene, the primary_transcript and the spliced_leader_RNA:
chrX . gene XXXX YYYY . + . ID=gene01;name=my_gene chrX . gene XXXX YYYY . + . ID=gene02;name=leader_gene chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02 chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02 chrX . primary_transcript XXXX YYYY . + . ID=pt01;Parent=tran01;Derives_from=gene01 chrX . spliced_leader_RNA XXXX YYYY . + . ID=sl01;Parent=tran01;Derives_from=gene02 chrX . exon XXXX YYYY . + . Parent=tran01 chrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01
As shown here, the mRNA derives from two genes ("my_gene" and the leader gene) and occupies disjunct coordinates on the genome. The primary_transcript, which encodes the body of the mRNA, is part of (has as its Parent) this mRNA. The same relationship applies to the spliced leader RNA. The Derives_from relationship is used to indicate which genes produced the primary transcript and spliced leader respectively.
The exon and CDS features follow in the normal fashion.
This event occurs when the ribosome performs a programmed frameshift during translation in order to skip over an in-frame stop codon. The frameshift may occur forward or backward.
-------------------------> mRNA
==========
============* CDS
The representation of this is to make the CDS discontinuous:
chrX . gene XXXX YYYY . + . ID=gene01;name=my_gene chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01;Ontology_term=SO:1000069 chrX . exon XXXX YYYY . + . Parent=tran01 chrX . CDS XXXX YYYY . + 0 ID=cds01;Parent=tran01 chrX . CDS YYYY-1 ZZZZ . + 0 ID=cds01;Parent=tran01
The CDS segment that represent the new reading frame will always has a phase of 0 since the ribosome is moving and thus redefining the codon.
It is suggested that the mRNA be tagged with the appropriate SO transcript attributes such as "minus_1_translational_frameshift" (SO:1000069). This will allow all such programmed frameshift mRNAs to be recovered with a query. The accession for "plus_1_translational_frameshift" is SO:1001263.
A classic operon occurs when the genes in a polycistronic transcript are co-regulated by cis-regulatory element(s):
regulatory element * ================================================> operon ----->XXXXXXX*-->BBBBBB*--->ZZZZ*-->AAAAAA*-----
It can be indicated in GFF3 in this way:
chrX . operon XXXX YYYY . + . ID=operon01;name=my_operon chrX . promoter XXXX YYYY . + . Parent=operon01 chrX . gene XXXX YYYY . + . ID=gene01;Parent=operon01;name=resA chrX . gene XXXX YYYY . + . ID=gene02;Parent=operon01;name=resB chrX . gene XXXX YYYY . + . ID=gene03;Parent=operon01;name=resX chrX . gene XXXX YYYY . + . ID=gene04;Parent=operon01;name=resZ chrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02,gene03,gene04 chrX . exon XXXX YYYY . + . ID=exon00001;Parent=tran01 chrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene01 chrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene02 chrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene03 chrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene04
The regulatory element ("promoter" in this example) is part of the operon via the Parent tag. The four genes are part of the operon, and the resulting mRNA is multiply-parented by the four genes, as in the earlier example.
At the time of this writing, promoters and other cis-regulatory elements cannot be part_of an operon, but this restriction is being reconsidered.
mirGFF3 format is adapted from the GFF3 definition to contain miRNA/isomiRs information from miRNA-seq data. The main difference is at the Attributes column, where these fields are mandatory: Variant, Cigar, Hits, Expression and Filter. To understand more about each one, please visit the main repository https://github.com/miRTop/mirGFF3