#### README #### ----------------------- GFF FLATFILE DUMPS ----------------------- This directory contains GFF flatfile dumps. All files are compressed using GNU Zip. Ensembl provides an automatic gene annotation for Nematocida parisii ERTm1. For some species ( human, mouse, zebrafish, pig and rat), the annotation provided through Ensembl also includes manual annotation from HAVANA. In the case of human and mouse, the GTF files found here are equivalent to the GENCODE gene set. GFF3 flat file format dumping provides all the sequence features known by Ensembl, including protein coding genes, ncRNA, repeat features etc. Annotation is based on alignments of biological evidence (eg. proteins, cDNAs, RNA-seq) to a genome assembly. Annotation is based on alignments of biological evidence (eg. proteins, cDNAs, RNA-seq) to a genome assembly. The annotation dumped here is transcribed and translated from the genome assembly and is not the original input sequence data that we used for alignment. Therefore, the sequences provided by Ensembl may differ from the original input sequence data where the genome assembly is different to the aligned sequence. Considerably more information is stored in Ensembl: the flat file just gives a representation which is compatible with existing tools. We are considering other information that should be made dumpable. In general we would prefer people to use database access over flat file access if you want to do something serious with the data. Note the following features of the GFF3 format provided on this site: 1) types are described using SO terms that are as specific as possible. e.g. protein_coding_gene is used where a gene is known to be protein coding 2) Phase is currently set to 0 - the phase used by the Ensembl system is stored as an attribute 3) Some validators may warn about duplicated identifiers for CDS features. This is to allow split features to be grouped. We are actively working to improve our GFF3 so some of these issues may be addressed in future releases of Ensembl. Additionally, we provide a GFF3 file containing the predicted gene set as generated by Genscan and other abinitio prediction tools. This file is identified by the abinitio extension. ----------- FILE NAMES ------------ The files are consistently named following this pattern: ..<_version>.gff3.gz : The systematic name of the species. : The assembly build name. : The version of Ensembl from which the data was exported. gff3 : All files in these directories are in GFF3 format gz : All files are compacted with GNU Zip for storage efficiency. e.g. Homo_sapiens.GRCh38.81.gff3.gz For the predicted gene set, an additional abinitio flag is added to the name file. ...abinitio.gff3.gz e.g. Homo_sapiens.GRCh38.81.abinitio.gff3.gz -------------------------------- Definition and supported options -------------------------------- GFF3 files are nine-column, tab-delimited, plain text files. Literal use of tab, newline, carriage return, the percent (%) sign, and control characters must be encoded using RFC 3986 Percent-Encoding; no other characters may be encoded. Backslash and other ad-hoc escaping conventions that have been added to the GFF format are not allowed. The file contents may include any character in the set supported by the operating environment, although for portability with other systems, use of Latin-1 or Unicode are recommended. Fields Fields are tab-separated. Also, all but the final field in each feature line must contain a valu; "empty" columns are denoted with a '.' seqid - The ID of the landmark used to establish the coordinate system for the current feature. IDs may contain any characters, but must escape any characters not in the set [a-zA-Z0-9.:^*!+_?-|]. In particular, IDs may not contain unescaped whitespace and must not begin with an unescaped ">". source - The source is a free text qualifier intended to describe the algorithm or operating procedure that generated this feature. Typically this is the name of a piece of software, such as "Genescan" or a database name, such as "Genbank." In effect, the source is used to extend the feature ontology by adding a qualifier to the type creating a new composite type that is a subclass of the type in the type column. type - The type of the feature (previously called the "method"). This is constrained to be either: (a)a term from the "lite" version of the Sequence Ontology - SOFA, a term from the full Sequence Ontology - it must be an is_a child of sequence_feature (SO:0000110) or (c) a SOFA or SO accession number. The latter alternative is distinguished using the syntax SO:000000. start - start position of the feature in positive 1-based integer coordinates always less than or equal to end end - end position of the feature in positive 1-based integer coordinates score - The score of the feature, a floating point number. As in earlier versions of the format, the semantics of the score are ill-defined. It is strongly recommended that E-values be used for sequence similarity features, and that P-values be used for ab initio gene prediction features. strand - The strand of the feature. + for positive strand (relative to the landmark), - for minus strand, and . for features that are not stranded. In addition, ? can be used for features whose strandedness is relevant, but unknown. phase - For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region. This is NOT to be confused with the frame, which is simply start modulo 3. attribute - A list of feature attributes in the format tag=value. Multiple tag=value pairs are separated by semicolons. URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape. Attribute values do not need to be and should not be quoted. The quotes should be included as part of the value by parsers and not stripped. Attributes The following attributes are available. All attributes are semi-colon separated pairs of keys and values. - ID: ID of the feature. IDs for each feature must be unique within the scope of the GFF file. In the case of discontinuous features (i.e. a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines. All lines that share an ID collectively represent a single feature. - Name: Display name for the feature. This is the name to be displayed to the user. - Alias: A secondary name for the feature - Parent: Indicates the parent of the feature. A parent ID can be used to group exons into transcripts, transcripts into genes, and so forth - Dbxref: A database cross reference - Ontology_term: A cross reference to an ontology term - Is_circular: A flag to indicate whether a feature is circular Pragmas/Metadata GFF3 files can contain meta-data. In the case of experimental meta-data these are noted by a #!. Those which are stable are noted by a ##. Meta data is a single key, a space and then the value. Current meta data keys are: * genome-build - Build identifier of the assembly e.g. GRCh37.p11 * genome-version - Version of this assembly e.g. GRCh37 * genome-date - The date of this assembly's release e.g. 2009-02 * genome-build-accession - The accession and source of this accession e.g. NCBI:GCA_000001405.14 * genebuild-last-updated - The date of the last genebuild update e.g. 2013-09 ------------------ Example GFF3 output ------------------ ##gff-version 3 #!genome-build Pmarinus_7.0 #!genome-version Pmarinus_7.0 #!genome-date 2011-01 #!genebuild-last-updated 2013-04 GL476399 Pmarinus_7.0 supercontig 1 4695893 . . . ID=supercontig:GL476399;Alias=scaffold_71 GL476399 ensembl gene 2596494 2601138 . + . ID=gene:ENSPMAG00000009070;Name=TRYPA3;biotype=protein_coding;description=Trypsinogen A1%3B Trypsinogen a3%3B Uncharacterized protein [Source:UniProtKB/TrEMBL%3BAcc:O42608];logic_name=ensembl;version=1 GL476399 ensembl transcript 2596494 2601138 . + . ID=transcript:ENSPMAT00000010026;Name=TRYPA3-201;Parent=gene:ENSPMAG00000009070;biotype=protein_coding;version=1 GL476399 ensembl exon 2596494 2596538 . + . Name=ENSPMAE00000087923;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=-1;rank=1;version=1 GL476399 ensembl exon 2598202 2598361 . + . Name=ENSPMAE00000087929;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=2;ensembl_phase=1;rank=2;version=1 GL476399 ensembl exon 2599023 2599282 . + . Name=ENSPMAE00000087937;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=2;rank=3;version=1 GL476399 ensembl exon 2599814 2599947 . + . Name=ENSPMAE00000087952;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=0;ensembl_phase=1;rank=4;version=1 GL476399 ensembl exon 2600895 2601138 . + . Name=ENSPMAE00000087966;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=-1;ensembl_phase=0;rank=5;version=1 GL476399 ensembl CDS 2596499 2596538 . + 0 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026 GL476399 ensembl CDS 2598202 2598361 . + 2 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026 GL476399 ensembl CDS 2599023 2599282 . + 1 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026 GL476399 ensembl CDS 2599814 2599947 . + 2 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026 GL476399 ensembl CDS 2600895 2601044 . + 0 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026 GL476399 ensembl five_prime_UTR 2596494 2596498 . + . Parent=transcript:ENSPMAT00000010026 GL476399 ensembl three_prime_UTR 2601045 2601138 . + . Parent=transcript:ENSPMAT00000010026