################################################################################################ This directory contains data files produced by the GENCODE project which is headed by Paul Flicek at the EMBL-EBI, UK. For questions, please contact gencode-help@ebi.ac.uk and check the website http://www.gencodegenes.org ################################################################################################ ################################## GENCODE Release Files Description ################################## (X is the release version, eg. 21 for human, M4 for mouse): ################# #Annotation files ################# 1. gencode.vX.annotation.{gtf,gff3}.gz: Main file, gene annotation on reference chromosomes in GTF and GFF3 file formats. These are the main GENCODE gene annotation files. They contain annotation (genes, transcripts, exons, start_codon, stop_codon, UTRs, CDS) on the reference chromosomes, which are chr1-22, X, Y, M in human and chr1-19, X, Y, M in mouse. 2. gencode.vX.chr_patch_hapl_scaff.annotation.{gtf,gff3}.gz: Gene annotation on reference-chromosomes/patches/scaffolds/haplotypes in GTF and GFF3 file formats. 3. gencode.vX.primary_assembly.annotation.{gtf,gff3}.gz: Gene annotation on reference chromosomes and scaffolds in GTF and GFF3 file formats. 4. gencode.vX.basic.annotation.{gtf,gff3}.gz: Basic gene annotation on reference chromosomes in GTF and GFF3 file formats. This is a subset of the corresponding comprehensive annotation including only those transcripts tagged as 'basic' in every gene. 5. gencode.vX.chr_patch_hapl_scaff.basic.annotation.{gtf,gff3}.gz: Basic gene annotation on reference-chromosomes/patches/scaffolds/haplotypes in GTF and GFF3 file formats. 6. gencode.vX.primary_assembly.basic.annotation.{gtf,gff3}.gz: Basic gene annotation on reference chromosomes and scaffolds in GTF and GFF3 file formats. 7. gencode.vX.long_noncoding_RNAs.{gtf,gff3}.gz: Long non-coding RNAs on reference chromosomes in GTF and GFF3 file formats. These files are a sub-set of the main annotation files on the reference chromosomes. They contain only the lncRNA genes, which are those with any of these biotypes: "processed_transcript", "lincRNA", "3prime_overlapping_ncrna", "antisense", "non_coding", "sense_intronic", "sense_overlapping", "TEC", "known_ncrna", "bidirectional_promoter_lncrna", "macro_lncRNA", "lncRNA". 8. gencode.vX.polyAs.{gtf,gff3}.gz: PolyA features annotated by Havana on reference chromosomes in GTF and GFF3 file formats. These files contain polyA signals, polyA sites and pseudo polyAs manually annotated by HAVANA. They include only the reference chromosomes. The value of the 'gene_id', 'transcript_id', 'gene_name' and 'transcript name' fields corresponds to a random identifier. The polyA features are not directly associated to any gene or transcript when annotated by Havana. 9. gencode.vX.2wayconspseudos.{gtf,gff3}.gz: (Retrotransposed) pseudogenes predicted by the Yale & UCSC pipelines, but not by Havana on reference chromosomes in GTF and GFF3 file formats. 10. gencode.vX.tRNAs.{gtf,gff3}.gz: tRNA structures predicted by tRNA-Scan on reference chromosomes in GTF and GFF3 file formats. ############### #Sequence files ############### 11. gencode.vX.transcripts.fa.gz: All transcript sequences on reference chromosomes in Fasta format. The sequence headers contain the following fields: transcript-id| gene-id| Havana-gene-id (if the gene contains manually annotated transcripts, '-' otherwise)| Havana-transcript-id (if this transcript was manually annotated, '-' otherwise)| transcript-name| gene-name| sequence-length| transcript biotype 12. gencode.vX.pc_transcripts.fa.gz: Protein-coding transcript sequences on reference chromosomes in Fasta format. The sequence headers contain the following fields: transcript-id| gene-id| Havana-gene-id (if the gene contains manually annotated transcripts, '-' otherwise)| Havana-transcript-id (if this transcript was manually annotated, '-' otherwise)| transcript-name| gene-name| sequence-length| 5'-UTR (3'-UTR if reverse strand) location in the transcript| CDS location in the transcript| 3'-UTR (5'-UTR if reverse strand) location in the transcript 13. gencode.vX.pc_translations.fa.gz: Translations of protein-coding transcripts on reference chromosomes Fasta file. The sequence headers contain the following fields: transcript-id| gene-id| Havana-gene-id (if the gene contains manually annotated transcripts, '-' otherwise)| Havana-transcript-id (if this transcript was manually annotated, '-' otherwise)| transcript-name|gene-name|sequence-length 14. gencode.vX.lncRNA_transcripts.fa.gz: Long non-coding RNA transcript sequences on reference chromosomes Fasta file. The sequence headers contain the following fields: transcript-id| gene-id| Havana-gene-id (if the gene contains manually annotated transcripts, '-' otherwise)| Havana-transcript-id (if this transcript was manually annotated, '-' otherwise)| transcript-name| gene-name| sequence-length Long non-coding RNA transcripts are those with any of these biotypes: "processed_transcript", "lincRNA", "3prime_overlapping_ncrna", "antisense", "non_coding", "sense_intronic", "sense_overlapping", "TEC", "known_ncrna", "bidirectional_promoter_lncrna", "macro_lncRNA", "lncRNA". 15. A.genome.fa.gz (where A is the current assembly name eg. GRCh38 for human, GRCm38 for mouse): Genome sequence fasta file (sequence region names are the same as in the GTF/GFF3 files). It includes reference chromosomes, scaffolds, assembly patches and haplotypes. 16. A.primary_assembly.genome.fa.gz (where A is the current assembly name eg. GRCh38 for human, GRCm38 for mouse): Primary assembly genome sequence fasta file (sequence region names are the same as in the GTF/GFF3 files). It includes reference chromosomes and scaffolds only. ############### #Metadata files ############### 17. gencode.vX.metadata.Annotation_remark.gz: Remarks made during the manual annotation of the transcript. 1 - transcript id 2 - annotation remark 18. gencode.vX.metadata.EntrezGene.gz: Entrez Gene id associated to the transcript. 1 - transcript id 2 - Entrez Gene id 19. gencode.vX.metadata.Exon_supporting_feature.gz: Piece of evidence used in the annotation of an exon (usually peptides, mRNAs, ESTs). 1 - transcript id 2 - external id of the feature supporting the exon annotation 3 - external source of the supporting feature 4 - exon id 5 - exon coordinates 20. gencode.vX.metadata.Gene_source.gz: Source of the gene annotation (Ensembl, Havana, Ensembl-Havana merged model or imported in the case of small RNA and mitochondrial genes). 1 - gene id 2 - gene source 21. gencode.vX.metadata.MGI.gz: MGI approved gene symbol. 1 - transcript id 2 - MGI gene symbol 3 - MGI unique id 22. gencode.vX.metadata.PDB.gz: PDB entry associated to the transcript. 1 - transcript id 2 - PDB id 23. gencode.vX.metadata.PolyA_feature.gz: Manually annotated polyA feature overlapping the transcript 3'-end. 1 - transcript id 2 - transcript-based start coordinate of the polyA feature 3 - transcript-based end coordinate of the polyA feature 4 - polyA feature chromosome 5 - polyA feature start coordinate 6 - polyA feature end coordinate 7 - polyA feature strand 8 - polyA feature type ("polyA_site", "polyA_signal", "pseudo_polyA") 24. gencode.vX.metadata.Pubmed_id.gz: PubMed id of a publication associated to the transcript. 1 - transcript id 2 - PubMed id 25. gencode.vX.metadata.RefSeq.gz: RefSeq RNA and/or protein associated to the transcript. 1 - transcript id 2 - RefSeq RNA id 3 - RefSeq protein id (optional) 26. gencode.vX.metadata.Selenocysteine.gz: Amino acid position of a selenocysteine residue in the transcript. 1 - transcript id 2 - selenocysteine position 27. gencode.vX.metadata.SwissProt.gz: UniProtKB/SwissProt entry associated to the transcript. 1 - transcript id 2 - UniProtKB/SwissProt accession number 3 - UniProtKB/SwissProt accession number 28. gencode.vX.metadata.Transcript_source.gz: Source of the transcript annotation (Ensembl, Havana, etc). 1 - transcript id 2 - transcript source 29. gencode.vX.metadata.Transcript_supporting_feature.gz: Piece of evidence used in the annotation of the transcript. 1 - transcript id 2 - external id of the feature supporting the transcript annotation 3 - external source of the supporting feature 30. gencode.vX.metadata.TrEMBL.gz: UniProtKB/TrEMBL entry associated to the transcript. 1 - transcript id 2 - UniProtKB/TrEMBL accession number 3 - UniProtKB/TrEMBL accession number ###################################### General format of the annotation files ###################################### We supply genome-wide features on three different confidence levels. Level 1 + 2 should be used for high-quality local analysis. 1 + 2 + 3 should be used for genome-wide analysis. * Level 1: validated Pseudogene loci, that were predicted by the analysis-pipelines from YALE, UCSC as well as by HAVANA manual annotation from WTSI. Other transcripts, that were verified experimentally by RT-PCR and sequencing through the GENCODE experimental pipeline. * Level 2: manual annotation HAVANA manual annotation from WTSI (and Ensembl annotation where it is identical to Havana). The following regions are considered "fully annotated" although they will still be updated: chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, X, Y. * Level 3: automated annotation ENSEMBL loci where they are different from the HAVANA annotation or where no annotation can be found. This data is supplied in GTF and GFF3 format as defined here: http://www.gencodegenes.org/data_format.html with the following tags added to the attributes column where appropriate: * level [1,2,3]: validation status as described. * tag "3_nested_supported_extension": 3' end extended based on RNA-seq data. * tag "3_standard_supported_extension": 3' end extended based on RNA-seq data. * tag "454_RNA_Seq_supported": annotated based on RNA-seq data. * tag "5_nested_supported_extension": 5' end extended based on RNA-seq data. * tag "5_standard_supported_extension": 5' end extended based on RNA-seq data. * tag "alternative_3_UTR": shares an identical CDS but has alternative 3' UTR with respect to a reference variant. * tag "alternative_5_UTR": shares an identical CDS but has alternative 5' UTR with respect to a reference variant. * tag "appris_principal": transcript expected to code for the main functional isoform based on a range of protein features (APPRIS pipeline, Nucleic Acids Res. 2013 Jan;41(Database issue):D110-7). (this tag is not found after Gencode 21) * tag "appris_candidate": where there is no single 'appris_principal' variant the main functional isoform will be translated from one of the 'appris_candidate' genes. (this tag is not found after Gencode 21) * tag "appris_candidate_ccds": the "appris_candidate" transcript that has an unique CCDS. (this tag is not found after Gencode 21) * tag "appris_candidate_longest_ccds": the "appris_candidate" transcripts where there are several CCDS, in this case APPRIS labels the longest CCDS. (this tag is not found after Gencode 21) * tag "appris_candidate_longest_seq": where there is no "appris_candidate_ccds" or "appris_candidate_longest_ccds" variant, the longest protein of the "appris_candidate" variants is selected as the primary variant. (this tag is not found after Gencode 21) * tag "appris_candidate_highest_score": where there is no 'appris_principal' variant, the candidate with highest APPRIS score is selected as the primary variant. (this tag is not found after Gencode 20) * tag "appris_candidate_longest": where there is no 'appris_principal' variant, the longest of the 'appris_candidate' variants is selected as the primary variant. (this tag is not found after Gencode 20) * tag "appris_principal_1": (This flag corresponds to the older flag "appris_principal") where the transcript expected to code for the main functional isoform based solely on the core modules in the APPRIS database. The APPRIS core modules map protein structural and functional information and cross-species conservation to the annotated variants. * tag "appris_principal_2": (This flag corresponds to the older flag "appris_candidate_ccds") Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. If one (but no more than one) of these candidates has a distinct CCDS identifier it is selected as the principal variant for that gene. A CCDS identifier shows that there is consensus between RefSeq and GENCODE/Ensembl for that variant, guaranteeing that the variant has cDNA support. * tag "appris_principal_3": Where the APPRIS core modules are unable to choose a clear principal variant and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. Consensus CDS annotated earlier are likely to have more cDNA evidence. Consecutive CCDS identifiers are not included in this flag, since they will have been annotated in the same release of CCDS. These are distinguished with the next flag. * tag "appris_principal_4": (This flag corresponds to the Ensembl 78 flag "appris_candidate_longest_ccds") Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with a distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant. * tag "appris_principal_5": (This flag corresponds to the Ensembl 78 flag "appris_candidate_longest_seq") Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant. * tag "appris_alternative_1": Candidate transcript(s) models that are conserved in at least three tested non-primate species. * tag "appris_alternative_2": Candidate transcript(s) models that appear to be conserved in fewer than three tested non-primate species. * tag "artifactual_duplication": annotated on an artifactual duplicate region of the genome assembly. * tag "basic": identifies a subset of representative transcripts for each gene; prioritises full-length protein coding transcripts over partial or non-protein coding transcripts within the same gene, and intends to highlight those transcripts that will be useful to the majority of users. * tag "bicistronic": transcript contains two confidently annotated CDSs. Support may come from eg proteomic data, cross-species conservation or published experimental work. * tag "CAGE_supported_TSS": transcript 5' end overlaps ENCODE or Fantom CAGE cluster. * tag "CCDS": member of the consensus CDS gene set, confirming coding regions between ENSEMBL, UCSC, NCBI and HAVANA. * tag "cds_end_NF": the coding region end could not be confirmed. * tag "cds_start_NF": the coding region start could not be confirmed. * tag "dotter_confirmed": transcript QC checked using dotplot to identify features eg splice junctions, end of homology. * tag "downstream_ATG": downstream ATG assessed as less likely to initiate the translation of the functional protein due to eg experimental evidence, poor cross-species conservation, weak Kozak context, interference with signal peptides or targeting signals. * tag "exp_conf": transcript was tested and confirmed experimentally. * tag "fragmented_locus": locus consists of non-overlapping transcript fragments either because of genome assembly issues (i.e., gaps or mis-assemblies), or because supporting transcripts (e.g., from another species) cannot be completely mapped, or because the supporting transcripts are non-overlapping end pairs (i.e., 5' and 3' ESTs from a single cDNA). * tag "inferred_exon_combination": transcript model contains all possible in-frame exons supported by homology, experimental evidence or conservation, but the exon combination is not directly supported by a single piece of evidence and may not be biological. Used for large genes with repetitive exons (e.g. titin (TTN)) to represent all the exons individual transcript variants can pool from. * tag "inferred_transcript_model": transcript model is not supported by a single piece of transcript evidence. May be supported by multiple fragments of transcript evidence or by combining different evidence sources e.g. protein homology, RNA-seq data, published experimental data. * tag "low_sequence_quality": transcript supported by transcript evidence that, while mapping best-in-genome, shows regions of poor sequence quality. * tag "mRNA_end_NF": the mRNA end could not be confirmed. * tag "mRNA_start_NF": the mRNA start could not be confirmed. * tag "NAGNAG_splice_site": in-frame type of variation where, at the acceptor site, some variants splice after the first AG and others after the second AG. * tag "ncRNA_host": the locus is a host for small non-coding RNAs. * tag "nested_454_RNA_Seq_supported": annotated based on RNA-seq data. * tag "NMD_exception": the transcript looks like it is subject to NMD but publications, experiments or conservation support the translation of the CDS. * tag "NMD_likely_if_extended": codon if the transcript were longer but cannot currently be annotated as NMD as does not fulfil all criteria - most commonly lack of an intron downstream of the stop codon. * tag "non_ATG_start": the CDS has a non-ATG start and its validity is supported by publication or conservation. * tag "non_canonical_conserved": the transcript has a non-canonical splice site conserved in other species. * tag "non_canonical_genome_sequence_error": the transcript has a non-canonical splice site explained by a genomic sequencing error. * tag "non_canonical_other": the transcript has a non-canonical splice site explained by other reasons. * tag "non_canonical_polymorphism": the transcript has a non-canonical splice site explained by a SNP. * tag "non_canonical_TEC": the transcript has a non-canonical splice site that needs experimental confirmation. * tag "non_canonical_U12": the transcript has a non-canonical splice site explained by a U12 intron (i.e. AT-AC splice site). * tag "non_submitted_evidence": a splice variant for which supporting evidence has not been submitted to databases, i.e. the model is based on literature or collaborator evidence. * tag "not_best_in_genome_evidence": a transcript is supported by evidence from same species paralogous loci. * tag "not_organism_supported": evidence from other species was used to build model. * tag "orphan": protein-coding locus with no paralogues or orthologs. * tag "overlapping locus": exon(s) of the locus overlap exon(s) of a readthrough transcript or a transcript belonging to another locus. * tag "overlapping_uORF": a low confidence upstream ATG existing in other coding variant would lead to NMD in this trancript, that uses the high confidence canonical downstream ATG. * tag "PAR": annotation in the pseudo-autosomal region, which is duplicated between X & Y. * tag "pseudo_consens": member of the pseudogene set predicted by YALE, UCSC and HAVANA. * tag "readthrough_gene": protein-coding gene that has a readthrough transcript. * tag "readthrough_transcript": a transcript that overlaps two or more independent loci but is considered to belong to a 3rd, separate locus. * tag "reference_genome_error": locus overlaps a sequence error or an assembly error in the reference genome that affects its annotation (e.g., 1 or 2bp insertion/deletion, substitution causing premature stop codon). The main effect is that affected transcripts that would have had a CDS are currently annotated without one. * tag "retained_intron_CDS": internal intron of CDS portion of transcript is retained. * tag "retained_intron_final": final intron of CDS portion of transcript is retained. * tag "retained_intron_first": first intron of CDS portion of transcript is retained. * tag "retrogene": protein-coding locus created via retrotransposition. * tag "RNA_Seq_supported_only": transcript supported by RNAseq data and not supported by mRNA or EST evidence. * tag "RNA_Seq_supported_partial": transcript annotated based on mixture of RNA-seq data and EST/mRNA/protein evidence. * tag "RP_supported_TIS": transcript that contains a CDS that has a translation initiation site supported by Ribosomal Profiling data. * tag "seleno": contains a selenocysteine. * tag "semi_processed": a processed pseudogene with one or more introns still present. These are likely formed through the retrotransposition of a retained intron transcript. * tag "sequence_error": transcript contains ≥ 1 non-canonical splice junction that is associated with a known or novel genome sequence error * tag "stop_codon_readthrough": transcript whose coding sequence contains an internal stop codon that does not cause the translation termination. * tag "upstream_ATG": an upstream ATG exists when a downstream ATG is better supported. * tag "upstream_uORF": a low confidence upstream ATG existing in other coding variant would lead to NMD in this trancript, that uses the high confidence canonical downstream ATG. Please note: if start codons are split between two exons, two start-codon features will be listed. Please note: pre-release 4, "cds_start_NF" was listed as "cds start not found", etc. Please note: pre-release 6, "seleno" tags included the selenocystein position as the amino acid number within the protein, these are now given as genomic coordinates as separate GTF features. ################################################################################################# Release M32 (February 2023) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to August 2022 This release corresponds to Ensembl version 109 in the GRCm39 assembly. ################################################################################################# Release M31 (October 2022) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to April 2022 This release corresponds to Ensembl version 108 in the GRCm39 assembly. The following biotype has been added: -protein_coding_CDS_not_defined: Transcript that belongs to a protein_coding gene and doesn't contain an ORF. Replaces the processed_transcript transcript biotype in protein_coding genes. Annotation files in gtf and gff3 format having the basic set of transcripts in the primary assembly are now included in the release. ################################################################################################# Release M30 (July 2022) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to January 2022 This release corresponds to Ensembl version 107 in the GRCm39 assembly. The following biotypes have been added: -protein_coding_LoF: Not translated in the reference genome owing to a SNP/DIP but in other individuals/haplotypes/strains the transcript is translated. This biotype replaces the polymorphic_pseudogene transcript biotype. -artifact: Annotated on artifactual regions of the genome assembly. The following tags or attributes have been added: -artifactual_duplication / artif_dupl -readthrough_gene ################################################################################################ Release M29 (April 2022) ################################################################################################ This release contains updates to the Ensembl-HAVANA merged annotation up to August 2021. This release corresponds to Ensembl version 106 in the GRCm39 assembly. ################################################################################################ Release M28 (December 2021) ################################################################################################ This release contains updates to the Ensembl-HAVANA merged annotation up to May 2021. This release corresponds to Ensembl version 105 in the GRCm39 assembly. Clone-based gene names have been retired from the release files. From this release onwards, genes without a name in MGI, EntrezGene, RFAM or miRBase will get their gene id as default name. ################################################################################################ Release M27 (May 2021) ################################################################################################ This release contains updates to the Ensembl-HAVANA merged annotation up to December 2020. This release corresponds to Ensembl version 104 in the GRCm39 assembly. ################################################################################################# Release M26 (February 2021) ################################################################################################# This release contains updates to the Ensembl-HAVANA merged annotation up to August 2020. This release corresponds to Ensembl version 103 in the GRCm39 assembly. ################################################################################################# Release M25 (April 2020) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to November 2019. This release corresponds to Ensembl version 100 in the GRCm38 assembly. ################################################################################################# Release M24 (January 2020) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to August 2019. This release corresponds to Ensembl version 99 in the GRCm38 assembly. ################################################################################################# Release M23 (September 2019) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to May 2019. This release corresponds to Ensembl version 98 in the GRCm38 assembly. New tag added: stop_codon_readthrough. ################################################################################################# Release M22 (July 2019) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to February 2019. This release corresponds to Ensembl version 97 in the GRCm38 assembly. The following biotypes have been replaced by the "lncRNA" biotype: -3prime_overlapping_ncRNA -antisense -bidirectional_promoter_lncRNA -lincRNA -macro_lncRNA -non_coding -processed_transcript -sense_intronic -sense_overlapping The following attribute has been added: -MGI_id: Unique stable id provided by the MGI database for each gene with an approved symbol. ################################################################################################# Release M21 (April 2019) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to November 2018. This release corresponds to Ensembl version 96 in the GRCm38 assembly. ################################################################################################# Release M20 (January 2019) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to August 2018. This release corresponds to Ensembl version 95 in the GRCm38 assembly. ################################################################################################# Release M19 (October 2018) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to May 2018. This release corresponds to Ensembl version 94 in the GRCm38 assembly. ################################################################################################# Release M18 (July 2018) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to March 2018. This release corresponds to Ensembl version 93 in the GRCm38 assembly. ################################################################################################# Release M17 (April 2018) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to November 2017. This release corresponds to Ensembl version 92 in the GRCm38 assembly. ################################################################################################# Release M16 (December 2017) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to August 2017. This release corresponds to Ensembl version 91 in the GRCm38 assembly. ################################################################################################# Release M15 (August 2017) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to May 2017. This release corresponds to Ensembl version 90 in the GRCm38 assembly. The antisense biotype is now called "antisense_RNA". ################################################################################################# Release M14 (May 2017) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to January 2017. This release corresponds to Ensembl version 89 in the GRCm38 assembly. ################################################################################################# Release M13 (March 2017) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to October 2016. This release corresponds to Ensembl version 88 in the GRCm38 assembly. New tags added: -3_nested_supported_extension -3_standard_supported_extension -454_RNA_Seq_supported -5_nested_supported_extension -5_standard_supported_extension ################################################################################################# Release M12 (December 2016) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to August 2016. This release corresponds to Ensembl version 87 in the GRCm38 assembly. The 'gene_status' and 'transcript_status' attributes have been removed from the GENCODE GTF and GFF3 files. ################################################################################################# Release M11 (October 2016) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to March 2016. This release corresponds to Ensembl version 86 in the GRCm38 assembly. ################################################################################################# Release M10 (July 2016) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to January 2016. This release corresponds to Ensembl version 85 in the GRCm38 assembly. ################################################################################################# Release M9 (March 2016) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to October 2015. This release corresponds to Ensembl version 84 in the GRCm38 assembly. Two changes have been introduced in the GFF3 files: - the 'UTR' feature type has been replaced with 'five_prime_UTR' and 'three_prime_UTR'; - CDS, start_codon and stop_codon features that are split across different exons of the same transcript now have a shared ID to conform with the SO GFF3 specification. ################################################################################################# Release M8 (December 2015) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to August 2015. This release corresponds to Ensembl version 83 in the GRCm38 assembly. ################################################################################################# Release M7 (September 2015) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to June 2015. This release corresponds to Ensembl version 82 in the GRCm38 assembly. New gene and transcript biotype introduced in this release: -bidirectional_promoter_lncrna -transcribed_unitary_pseudogene New tags introduced in this release: -overlapping_locus -retrogene -inferred_transcript_model ################################################################################################# Release M6 (July 2015) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to March 2015. This release corresponds to Ensembl version 81 in the GRCm38 assembly. New files added in this release: a)Basic annotation on the reference chromosomes and on all sequence regions: -gencode.vM6.basic.annotation.gtf.gz -gencode.vM6.basic.annotation.gff3.gz -gencode.vM6.chr_patch_hapl_scaff.basic.annotation.gtf.gz -gencode.vM6.chr_patch_hapl_scaff.basic.annotation.gff3.gz b)Nucleotide sequences of all annotated transcripts on the reference chromosomes: -gencode.vM6.transcripts.fa.gz ################################################################################################# Release M5 (May 2015) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to December 2014. This release corresponds to Ensembl version 80 in the GRCm38 assembly. New gene and transcript biotypes introduced in M5: -macro_lncRNA -ribozyme -sRNA -scaRNA Seven new APPRIS tags have been added in this release: -appris_principal_1: (This flag corresponds to the older flag "appris_principal") where the transcript expected to code for the main functional isoform based solely on the core modules in the APPRIS database. The APPRIS core modules map protein structural and functional information and cross-species conservation to the annotated variants. -appris_principal_2: (This flag corresponds to the older flag "appris_candidate_ccds") Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. If one (but no more than one) of these candidates has a distinct CCDS identifier it is selected as the principal variant for that gene. A CCDS identifier shows that there is consensus between RefSeq and GENCODE/Ensembl for that variant, guaranteeing that the variant has cDNA support. -appris_principal_3: Where the APPRIS core modules are unable to choose a clear principal variant and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. Consensus CDS annotated earlier are likely to have more cDNA evidence. Consecutive CCDS identifiers are not included in this flag, since they will have been annotated in the same release of CCDS. These are distinguished with the next flag. -appris_principal_4: (This flag corresponds to the Ensembl 78 flag "appris_candidate_longest_ccds") Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with a distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant. -appris_principal_5: (This flag corresponds to the Ensembl 78 flag "appris_candidate_longest_seq") Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant. -appris_alternative_1: Candidate transcript(s) models that are conserved in at least three tested non-primate species. -appris_alternative_2: Candidate transcript(s) models that appear to be conserved in fewer than three tested non-primate species. The UTR lines in the GTF/GFF3 files now include 'exon_number' and 'exon_id' attributes. Also, in this release the transcript attributes were removed from the gene lines in all GTF and GFF3 annotation files. ie. the transcript_id transcript_type , transcript_status, transcript_name , havana_transcript etc. NOTE: GTF and GFF3 files containing gene annotation on the primary assembly only (main chromosomes and unplaced/unlocalized scaffolds) were added: - gencode.vM5.primary_assembly.annotation.gtf.gz - gencode.vM5.primary_assembly.annotation.gff3.gz These files are meant to be used mainly for NGS analyses in conjunction with the equivalent primary assembly genome sequence provided: - GRCm38.primary_assembly.genome.fa.gz ################################################################################################# Release M4 (December 2014) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to August 2014. This release corresponds to Ensembl version 78 in the GRCm38 assembly. Some new gene and transcript biotypes are introduced in M4: TEC IG_C_pseudogene IG_D_pseudogene TR_C_gene TR_D_gene TR_J_gene TR_J_pseudogene processed_pseudogene transcribed_processed_pseudogene transcribed_unprocessed_pseudogene unitary_pseudogene unprocessed_pseudogene There are some transcript tag changes in M4: appris_candidate_longest_seq tag is introduced in M4 and appris_candidate_highest_score and appris_canditate_longest are removed from M4 exp_conf tag is introduced in M4 NOTE: The files listed below were updated on 5-12-2014 to correct a small number of gene and transcript names containing a semicolon: -gencode.vM4.annotation.gff3 -gencode.vM4.annotation.gtf -gencode.vM4.chr_patch_hapl_scaff.annotation.gff3 -gencode.vM4.chr_patch_hapl_scaff.annotation.gtf -gencode.vM4.pc_transcripts.fa -gencode.vM4.pc_translations.fa ################################################################################################# Release M3 (August 2014) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to April 2014. This release corresponds to Ensembl version 76 in the GRCm38 assembly. A new annotation file type is added in the Gencode releases which is GFF3. Note: In GFF3 the start and stop codon are included in the CDS. In GTF the start codon is included in the CDS but the stop codon is included in the UTR. 1 extra appris tag is added in this release in the transcript lines: * tag "appris_candidate_highest_score": where there is no 'appris_principal' variant, the candidate with highest APPRIS score is selected as the primary variant. 1 extra attribute is added in the annotation files in all protein coding transcripts: the protein_id For example: chr1 HAVANA transcript 3214482 3671498 . - . gene_id "ENSMUSG00000051951.5"; transcript_id "ENSMUST00000070533.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "Xkr4"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "Xkr4-001"; level 2; protein_id "ENSMUSP00000070648.4"; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS14803.1"; havana_gene "OTTMUSG00000026353.2"; havana_transcript "OTTMUST00000065166.1"; 1 extra metadata file is added: gencode.vM3.metadata.EntrezGene: EntrezGene gene id associated to the transcript NOTE: The files listed below were updated on 28-08-2014 to correct the following minor issues: - some CCDS tags had been wrongly assigned to transcripts with the same start and end coordinates as a CCDS model but with different amino acid sequence. - absence of semicolon after "havana_transcript" attributes in the GTF files. gencode.vM3.annotation.gff3.gz gencode.vM3.annotation.gtf.gz gencode.vM3.chr_patch_hapl_scaff.annotation.gff3.gz gencode.vM3.chr_patch_hapl_scaff.annotation.gtf.gz gencode.vM3.long_noncoding_RNAs.gtf.gz ################################################################################################# Release M2 (December 2013) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to July 2013. This GENCODE release corresponds to Ensembl version 74 in the GRCm38 assembly. 3 extra tags are added in this release in the trasncript lines: * tag "appris_principal": transcript expected to code for the main functional isoform based on a range of protein features (APPRIS pipeline). * tag "appris_candidate": where there is no single 'appris_principal' variant the main functional isoform will be translated from one of the 'appris_candidate' genes. * tag "appris_candidate_longest": where there is no 'appris_principal' variant, the longest of the 'appris_candidate' variants is selected as the primary variant. ################################################################################################# Release M1 (October 2013) ################################################################################################# This is a remerge between the Ensembl annotation and updates from HAVANA up to July 2011. This GENCODE release corresponds to Ensembl version 65 which was released in Ensembl browser in December 2011 in the NCBIM37 assembly. This Ensembl release has been chosen as the first Mouse GENCODE release, Gencode M1, because this dataset was used in mouse ENCODE analysis (http://www.ncbi.nlm.nih.gov/pubmed/22889292) This is the last Ensembl release in assembly version NCBIM37. Files which contain metadata associated to transcripts and genes, usually displayed in the UCSC browser, are part of the public release: gencode.M1.metadata.Annotation_remark: remarks made during the manual annotation of the transcript gencode.M1.metadata.Exon_supporting_feature: piece of evidence used in the annotation of an exon (usually peptides, mRNAs, ESTs) gencode.M1.metadata.Gene_source: source of the gene annotation (Ensembl, Havana, Ensembl-Havana merged model or imported in the case of small RNA and mitochondrial genes) gencode.M1.metadata.MGI: MGI approved gene symbol gencode.M1.metadata.PDB: PDB entry associated to the transcript gencode.M1.metadata.PolyA_feature: manually annotated polyA feature overlapping the transcript 3'-end (polyA_signal, polyA_site, pseudo_polyA) - Fields are 1)transcript_id, 2-3)polyA feature coordinates relative to the transcript, 4-7)polyA feature genomic coordinates, 8)type of polyA feature gencode.M1.metadata.Pubmed_id: Pubmed id of a publication associated to the transcript gencode.M1.metadata.RefSeq: RefSeq RNA and/or protein associated to the transcript gencode.M1.metadata.Selenocysteine: amino acid position of a selenocysteine residue in the transcript gencode.M1.metadata.SwissProt: UniProtKB/SwissProt entry associated to the transcript gencode.M1.metadata.Transcript_source: source of the transcript annotation (Ensembl, Havana, Ensembl-Havana merged model or imported in the case of small RNA and mitochondrial genes) gencode.M1.metadata.Transcript_supporting_feature: piece of evidence used in the annotation of the transcript (usually peptides, mRNAs, ESTs) gencode.M1.metadata.TrEMBL: UniProtKB/TrEMBL entry associated to the transcript