#### README #### ----------------------- GFF FLATFILE DUMPS ----------------------- Gene annotation is provided in GFF3 format. Detailed specification of the format is maintained by the Sequence Ontology: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md GFF3 files are validated using GenomeTools: http://genometools.org For chromosomal assemblies, in addition to a file containing all genes, there are per-chromosome files. If a predicted geneset is available (generated by Genscan and other ab initio tools), these genes are in a separate 'abinitio' file. The 'type' of gene features is: * "gene" for protein-coding genes * "ncRNA_gene" for RNA genes * "pseudogene" for pseudogenes The 'type' of transcript features is: * "mRNA" for protein-coding transcripts * a specific type or RNA transcript such as "snoRNA" or "lnc_RNA" * "pseudogenic_transcript" for pseudogenes All transcripts are linked to "exon" features. Protein-coding transcripts are linked to "CDS", "five_prime_UTR", and "three_prime_UTR" features. Attributes for feature types: (square brackets indicate data which is not available for all features) * region types: * ID: Unique identifier, format ":" * [Alias]: A comma-separated list of aliases, usually including the INSDC accession * [Is_circular]: Flag to indicate circular regions * gene types: * ID: Unique identifier, format "gene:" * biotype: Ensembl biotype, e.g. "protein_coding", "pseudogene" * gene_id: Ensembl gene stable ID * version: Ensembl gene version * [Name]: Gene name * [description]: Gene description * transcript types: * ID: Unique identifier, format "transcript:" * Parent: Gene identifier, format "gene:" * biotype: Ensembl biotype, e.g. "protein_coding", "pseudogene" * transcript_id: Ensembl transcript stable ID * version: Ensembl transcript version * [Note]: If the transcript sequence has been edited (i.e. differs from the genomic sequence), the edits are described in a note. * exon * Parent: Transcript identifier, format "transcript:" * exon_id: Ensembl exon stable ID * version: Ensembl exon version * constitutive: Flag to indicate if exon is present in all transcripts * rank: Integer that show the 5'->3' ordering of exons * CDS * ID: Unique identifier, format "CDS:" * Parent: Transcript identifier, format "transcript:" * protein_id: Ensembl protein stable ID * version: Ensembl protein version Metadata: * genome-build - Build identifier of the assembly e.g. GRCh37.p11 * genome-version - Version of this assembly e.g. GRCh37 * genome-date - The date of the release of this assembly e.g. 2009-02 * genome-build-accession - Genome accession e.g. GCA_000001405.14 * genebuild-last-updated - Date of the last genebuild update e.g. 2013-09 ----------- FILE NAMES ------------ The files are consistently named following this pattern: ..<_version>.gff3.gz : The systematic name of the species. : The assembly build name. : The version of Ensembl from which the data was exported. gff3 : All files in these directories are in GFF3 format gz : All files are compacted with GNU Zip for storage efficiency. e.g. Homo_sapiens.GRCh38.81.gff3.gz For the predicted gene set, an additional abinitio flag is added to the name file. ...abinitio.gff3.gz e.g. Homo_sapiens.GRCh38.81.abinitio.gff3.gz ------------------ Example GFF3 output ------------------ ##gff-version 3 #!genome-build Pmarinus_7.0 #!genome-version Pmarinus_7.0 #!genome-date 2011-01 #!genebuild-last-updated 2013-04 GL476399 Pmarinus_7.0 supercontig 1 4695893 . . . ID=supercontig:GL476399;Alias=scaffold_71 GL476399 ensembl gene 2596494 2601138 . + . ID=gene:ENSPMAG00000009070;Name=TRYPA3;biotype=protein_coding;description=Trypsinogen A1%3B Trypsinogen a3%3B Uncharacterized protein [Source:UniProtKB/TrEMBL%3BAcc:O42608];logic_name=ensembl;version=1 GL476399 ensembl transcript 2596494 2601138 . + . ID=transcript:ENSPMAT00000010026;Name=TRYPA3-201;Parent=gene:ENSPMAG00000009070;biotype=protein_coding;version=1 GL476399 ensembl exon 2596494 2596538 . + . Name=ENSPMAE00000087923;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=-1;rank=1;version=1 GL476399 ensembl exon 2598202 2598361 . + . Name=ENSPMAE00000087929;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=2;ensembl_phase=1;rank=2;version=1 GL476399 ensembl exon 2599023 2599282 . + . Name=ENSPMAE00000087937;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=2;rank=3;version=1 GL476399 ensembl exon 2599814 2599947 . + . Name=ENSPMAE00000087952;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=0;ensembl_phase=1;rank=4;version=1 GL476399 ensembl exon 2600895 2601138 . + . Name=ENSPMAE00000087966;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=-1;ensembl_phase=0;rank=5;version=1 GL476399 ensembl CDS 2596499 2596538 . + 0 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026 GL476399 ensembl CDS 2598202 2598361 . + 2 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026 GL476399 ensembl CDS 2599023 2599282 . + 1 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026 GL476399 ensembl CDS 2599814 2599947 . + 2 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026 GL476399 ensembl CDS 2600895 2601044 . + 0 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026 GL476399 ensembl five_prime_UTR 2596494 2596498 . + . Parent=transcript:ENSPMAT00000010026 GL476399 ensembl three_prime_UTR 2601045 2601138 . + . Parent=transcript:ENSPMAT00000010026 -------------------------------------- Locus Reference Genomic Sequence (LRG) -------------------------------------- This is a manually curated project that contains stable and un-versioned reference sequences designed specifically for reporting sequence variants with clinical implications. The sequences of each locus (also called LRG) are chosen in collaboration with research and diagnostic laboratories, LSDB (locus specific database) curators and mutation consortia with expertise in the region of interest. LRG website: http://www.lrg-sequence.org LRG data are freely available in several formats (FASTA, BED, XML, Tabulated) at this address: http://www.lrg-sequence.org/downloads