#### README ####

-----------------------
GFF FLATFILE DUMPS
-----------------------
Gene annotation is provided in GFF3 format. Detailed specification of
the format is maintained by the Sequence Ontology:
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

GFF3 files are validated using GenomeTools: http://genometools.org

For chromosomal assemblies, in addition to a file containing all
genes, there are per-chromosome files. If a predicted geneset is
available (generated by Genscan and other ab initio tools), these
genes are in a separate 'abinitio' file.


The 'type' of gene features is:
 * "gene" for protein-coding genes
 * "ncRNA_gene" for RNA genes
 * "pseudogene" for pseudogenes
The 'type' of transcript features is:
 * "mRNA" for protein-coding transcripts
 * a specific type or RNA transcript such as "snoRNA" or "lnc_RNA"
 * "pseudogenic_transcript" for pseudogenes
All transcripts are linked to "exon" features.
Protein-coding transcripts are linked to "CDS", "five_prime_UTR", and
"three_prime_UTR" features.

Attributes for feature types:
(square brackets indicate data which is not available for all features)
 * region types:
    * ID: Unique identifier, format "<region_type>:<region_name>"
    * [Alias]: A comma-separated list of aliases, usually including the
      INSDC accession
    * [Is_circular]: Flag to indicate circular regions
 * gene types:
    * ID: Unique identifier, format "gene:<gene_stable_id>"
    * biotype: Ensembl biotype, e.g. "protein_coding", "pseudogene"
    * gene_id: Ensembl gene stable ID
    * version: Ensembl gene version
    * [Name]: Gene name
    * [description]: Gene description
 * transcript types:
    * ID: Unique identifier, format "transcript:<transcript_stable_id>"
    * Parent: Gene identifier, format "gene:<gene_stable_id>"
    * biotype: Ensembl biotype, e.g. "protein_coding", "pseudogene"
    * transcript_id: Ensembl transcript stable ID
    * version: Ensembl transcript version
    * [Note]: If the transcript sequence has been edited (i.e. differs
      from the genomic sequence), the edits are described in a note.
 * exon
    * Parent: Transcript identifier, format "transcript:<transcript_stable_id>"
    * exon_id: Ensembl exon stable ID
    * version: Ensembl exon version
    * constitutive: Flag to indicate if exon is present in all
      transcripts
    * rank: Integer that show the 5'->3' ordering of exons
 * CDS
    * ID: Unique identifier, format "CDS:<protein_stable_id>"
    * Parent: Transcript identifier, format "transcript:<transcript_stable_id>"
    * protein_id: Ensembl protein stable ID
    * version: Ensembl protein version

Metadata:
 * genome-build - Build identifier of the assembly e.g. GRCh37.p11
 * genome-version - Version of this assembly e.g. GRCh37
 * genome-date - The date of the release of this assembly e.g. 2009-02
 * genome-build-accession - Genome accession e.g. GCA_000001405.14
 * genebuild-last-updated - Date of the last genebuild update e.g. 2013-09

-----------
FILE NAMES
------------
The files are consistently named following this pattern:
   <species>.<assembly>.<_version>.gff3.gz

<species>:       The systematic name of the species. 
<assembly>:      The assembly build name.
<version>:       The version of Ensembl from which the data was exported.
gff3 : All files in these directories are in GFF3 format
gz : All files are compacted with GNU Zip for storage efficiency.

e.g. 
Homo_sapiens.GRCh38.81.gff3.gz

For the predicted gene set, an additional abinitio flag is added to the name file.
<species>.<assembly>.<version>.abinitio.gff3.gz

e.g.
Homo_sapiens.GRCh38.81.abinitio.gff3.gz

------------------
Example GFF3 output
------------------

##gff-version 3
#!genome-build  Pmarinus_7.0
#!genome-version Pmarinus_7.0
#!genome-date 2011-01
#!genebuild-last-updated 2013-04

GL476399        Pmarinus_7.0    supercontig     1       4695893 .       .       .       ID=supercontig:GL476399;Alias=scaffold_71
GL476399        ensembl gene    2596494 2601138 .       +       .       ID=gene:ENSPMAG00000009070;Name=TRYPA3;biotype=protein_coding;description=Trypsinogen A1%3B Trypsinogen a3%3B Uncharacterized protein  [Source:UniProtKB/TrEMBL%3BAcc:O42608];logic_name=ensembl;version=1
GL476399        ensembl transcript      2596494 2601138 .       +       .       ID=transcript:ENSPMAT00000010026;Name=TRYPA3-201;Parent=gene:ENSPMAG00000009070;biotype=protein_coding;version=1
GL476399        ensembl exon    2596494 2596538 .       +       .       Name=ENSPMAE00000087923;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=-1;rank=1;version=1
GL476399        ensembl exon    2598202 2598361 .       +       .       Name=ENSPMAE00000087929;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=2;ensembl_phase=1;rank=2;version=1
GL476399        ensembl exon    2599023 2599282 .       +       .       Name=ENSPMAE00000087937;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=2;rank=3;version=1
GL476399        ensembl exon    2599814 2599947 .       +       .       Name=ENSPMAE00000087952;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=0;ensembl_phase=1;rank=4;version=1
GL476399        ensembl exon    2600895 2601138 .       +       .       Name=ENSPMAE00000087966;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=-1;ensembl_phase=0;rank=5;version=1
GL476399        ensembl CDS     2596499 2596538 .       +       0       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399        ensembl CDS     2598202 2598361 .       +       2       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399        ensembl CDS     2599023 2599282 .       +       1       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399        ensembl CDS     2599814 2599947 .       +       2       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399        ensembl CDS     2600895 2601044 .       +       0       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399        ensembl five_prime_UTR  2596494 2596498 .       +       .       Parent=transcript:ENSPMAT00000010026
GL476399        ensembl three_prime_UTR 2601045 2601138 .       +       .       Parent=transcript:ENSPMAT00000010026