#### README ####

-----------------------
GFF FLATFILE DUMPS
-----------------------
This directory contains GFF flatfile dumps. All files are compressed
using GNU Zip.

Ensembl provides an automatic gene annotation for Aspergillus fumigatus Af293.
For some species ( human, mouse, zebrafish, pig and rat), the
annotation provided through Ensembl also includes manual annotation
from HAVANA.
In the case of human and mouse, the GTF files found here are equivalent
to the GENCODE gene set.

GFF3 flat file format dumping provides all the sequence features known by
Ensembl, including protein coding genes, ncRNA, repeat features etc.
Annotation is based on alignments of biological evidence (eg. proteins,
cDNAs, RNA-seq) to a genome assembly. Annotation is based on alignments of
biological evidence (eg. proteins, cDNAs, RNA-seq) to a genome assembly.
The annotation dumped here is transcribed and translated from the
genome assembly and is not the original input sequence data that
we used for alignment. Therefore, the sequences provided by Ensembl
may differ from the original input sequence data where the genome
assembly is different to the aligned sequence.
Considerably more information is stored in Ensembl: the flat file 
just gives a representation which is compatible with existing tools.

We are considering other information that should be made dumpable. In 
general we would prefer people to use database access over flat file 
access if you want to do something serious with the data. 

Note the following features of the GFF3 format provided on this site:
1) types are described using SO terms that are as specific as possible.
e.g. protein_coding_gene is used where a gene is known to be protein coding
2) Phase is currently set to 0 - the phase used by the Ensembl system
is stored as an attribute
3) Some validators may warn about duplicated identifiers for CDS features. 
This is to allow split features to be grouped.

We are actively working to improve our GFF3 so some of these issues may
be addressed in future releases of Ensembl.

Additionally, we provide a GFF3 file containing the predicted gene set
as generated by Genscan and other abinitio prediction tools.
This file is identified by the abinitio extension.

-----------
FILE NAMES
------------
The files are consistently named following this pattern:
   <species>.<assembly>.<_version>.gff3.gz

<species>:       The systematic name of the species. 
<assembly>:      The assembly build name.
<version>:       The version of Ensembl from which the data was exported.
gff3 : All files in these directories are in GFF3 format
gz : All files are compacted with GNU Zip for storage efficiency.

e.g. 
Homo_sapiens.GRCh38.81.gff3.gz

For the predicted gene set, an additional abinitio flag is added to the name file.
<species>.<assembly>.<version>.abinitio.gff3.gz

e.g.
Homo_sapiens.GRCh38.81.abinitio.gff3.gz

--------------------------------
Definition and supported options
--------------------------------

GFF3 files are nine-column, tab-delimited, plain text files. Literal use of tab,
newline, carriage return, the percent (%) sign, and control characters must be
encoded using RFC 3986 Percent-Encoding; no other characters may be encoded.
Backslash and other ad-hoc escaping conventions that have been added to the GFF
format are not allowed. The file contents may include any character in the set
supported by the operating environment, although for portability with other
systems, use of Latin-1 or Unicode are recommended.

Fields

Fields are tab-separated. Also, all but the final field in each feature line
must contain a valu; "empty" columns are denoted with a '.'

   seqid     - The ID of the landmark used to establish the coordinate system for the current
               feature. IDs may contain any characters, but must escape any characters not in
               the set [a-zA-Z0-9.:^*!+_?-|]. In particular, IDs may not contain unescaped
               whitespace and must not begin with an unescaped ">".
   source    - The source is a free text qualifier intended to describe the algorithm or
               operating procedure that generated this feature. Typically this is the name of a
               piece of software, such as "Genescan" or a database name, such as "Genbank." In
               effect, the source is used to extend the feature ontology by adding a qualifier
               to the type creating a new composite type that is a subclass of the type in the
               type column.
   type      - The type of the feature (previously called the "method"). This is constrained to
               be either: (a)a term from the "lite" version of the Sequence Ontology - SOFA, a
               term from the full Sequence Ontology - it must be an is_a child of
               sequence_feature (SO:0000110) or (c) a SOFA or SO accession number. The latter
               alternative is distinguished using the syntax SO:000000.
   start     - start position of the feature in positive 1-based integer coordinates
               always less than or equal to end
   end       - end position of the feature in positive 1-based integer coordinates
   score     - The score of the feature, a floating point number. As in earlier versions of the
               format, the semantics of the score are ill-defined. It is strongly recommended
               that E-values be used for sequence similarity features, and that P-values be
               used for ab initio gene prediction features.
   strand    - The strand of the feature. + for positive strand (relative to the landmark), -
               for minus strand, and . for features that are not stranded. In addition, ? can
               be used for features whose strandedness is relevant, but unknown.
   phase     - For features of type "CDS", the phase indicates where the feature begins with
               reference to the reading frame. The phase is one of the integers 0, 1, or 2,
               indicating the number of bases that should be removed from the beginning of this
               feature to reach the first base of the next codon. In other words, a phase of
               "0" indicates that the next codon begins at the first base of the region
               described by the current line, a phase of "1" indicates that the next codon
               begins at the second base of this region, and a phase of "2" indicates that the
               codon begins at the third base of this region. This is NOT to be confused with
               the frame, which is simply start modulo 3.
   attribute - A list of feature attributes in the format tag=value. Multiple tag=value pairs
               are separated by semicolons. URL escaping rules are used for tags or values
               containing the following characters: ",=;". Spaces are allowed in this field,
               but tabs must be replaced with the %09 URL escape. Attribute values do not need
               to be and should not be quoted. The quotes should be included as part of the
               value by parsers and not stripped.


Attributes

The following attributes are available. All attributes are semi-colon separated
pairs of keys and values.

- ID:     ID of the feature. IDs for each feature must be unique within the
          scope of the GFF file. In the case of discontinuous features (i.e. a single
          feature that exists over multiple genomic locations) the same ID may appear on
          multiple lines. All lines that share an ID collectively represent a single
          feature.
- Name:   Display name for the feature. This is the name to be displayed to the user.
- Alias:  A secondary name for the feature
- Parent: Indicates the parent of the feature. A parent ID can be used to group exons into
          transcripts, transcripts into genes, and so forth
- Dbxref: A database cross reference
- Ontology_term: A cross reference to an ontology term
- Is_circular:   A flag to indicate whether a feature is circular

Pragmas/Metadata

GFF3 files can contain meta-data. In the case of experimental meta-data these are
noted by a #!. Those which are stable are noted by a ##. Meta data is a single key,
a space and then the value. Current meta data keys are:

* genome-build -  Build identifier of the assembly e.g. GRCh37.p11
* genome-version - Version of this assembly e.g. GRCh37
* genome-date - The date of this assembly's release e.g. 2009-02
* genome-build-accession - The accession and source of this accession e.g. NCBI:GCA_000001405.14
* genebuild-last-updated - The date of the last genebuild update e.g. 2013-09

------------------
Example GFF3 output
------------------

##gff-version 3
#!genome-build  Pmarinus_7.0
#!genome-version Pmarinus_7.0
#!genome-date 2011-01
#!genebuild-last-updated 2013-04

GL476399        Pmarinus_7.0    supercontig     1       4695893 .       .       .       ID=supercontig:GL476399;Alias=scaffold_71
GL476399        ensembl gene    2596494 2601138 .       +       .       ID=gene:ENSPMAG00000009070;Name=TRYPA3;biotype=protein_coding;description=Trypsinogen A1%3B Trypsinogen a3%3B Uncharacterized protein  [Source:UniProtKB/TrEMBL%3BAcc:O42608];logic_name=ensembl;version=1
GL476399        ensembl transcript      2596494 2601138 .       +       .       ID=transcript:ENSPMAT00000010026;Name=TRYPA3-201;Parent=gene:ENSPMAG00000009070;biotype=protein_coding;version=1
GL476399        ensembl exon    2596494 2596538 .       +       .       Name=ENSPMAE00000087923;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=-1;rank=1;version=1
GL476399        ensembl exon    2598202 2598361 .       +       .       Name=ENSPMAE00000087929;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=2;ensembl_phase=1;rank=2;version=1
GL476399        ensembl exon    2599023 2599282 .       +       .       Name=ENSPMAE00000087937;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=2;rank=3;version=1
GL476399        ensembl exon    2599814 2599947 .       +       .       Name=ENSPMAE00000087952;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=0;ensembl_phase=1;rank=4;version=1
GL476399        ensembl exon    2600895 2601138 .       +       .       Name=ENSPMAE00000087966;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=-1;ensembl_phase=0;rank=5;version=1
GL476399        ensembl CDS     2596499 2596538 .       +       0       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399        ensembl CDS     2598202 2598361 .       +       2       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399        ensembl CDS     2599023 2599282 .       +       1       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399        ensembl CDS     2599814 2599947 .       +       2       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399        ensembl CDS     2600895 2601044 .       +       0       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399        ensembl five_prime_UTR  2596494 2596498 .       +       .       Parent=transcript:ENSPMAT00000010026
GL476399        ensembl three_prime_UTR 2601045 2601138 .       +       .       Parent=transcript:ENSPMAT00000010026