################################################################################################
This directory contains data files produced by the 

           GENCODE project

which is headed by Paul Flicek at the EMBL-EBI, UK.

For questions, please contact gencode-help@ebi.ac.uk
and check the website http://www.gencodegenes.org

################################################################################################


##################################
GENCODE Release Files Description
##################################

(X is the release version, eg. 21 for human, M4 for mouse):

#################
#Annotation files
#################

1. gencode.vX.annotation.{gtf,gff3}.gz:
  Main file, gene annotation on reference chromosomes in GTF and GFF3 file formats.
  These are the main GENCODE gene annotation files. They contain annotation (genes, 
  transcripts, exons, start_codon, stop_codon, UTRs, CDS) on the reference chromosomes,
  which are chr1-22, X, Y, M in human and chr1-19, X, Y, M in mouse.

2. gencode.vX.chr_patch_hapl_scaff.annotation.{gtf,gff3}.gz:
  Gene annotation on reference-chromosomes/patches/scaffolds/haplotypes in GTF and GFF3 file formats.

3. gencode.vX.primary_assembly.annotation.{gtf,gff3}.gz:
  Gene annotation on reference chromosomes and scaffolds in GTF and GFF3 file formats.

4. gencode.vX.basic.annotation.{gtf,gff3}.gz:
  Basic gene annotation on reference chromosomes in GTF and GFF3 file formats.
  This is a subset of the corresponding comprehensive annotation including only 
  those transcripts tagged as 'basic' in every gene.

5. gencode.vX.chr_patch_hapl_scaff.basic.annotation.{gtf,gff3}.gz:
  Basic gene annotation on reference-chromosomes/patches/scaffolds/haplotypes in 
  GTF and GFF3 file formats.

6. gencode.vX.primary_assembly.basic.annotation.{gtf,gff3}.gz:
  Basic gene annotation on reference chromosomes and scaffolds in
  GTF and GFF3 file formats.

7. gencode.vX.long_noncoding_RNAs.{gtf,gff3}.gz: 
  Long non-coding RNAs on reference chromosomes in GTF and GFF3 file formats.
  These files are a sub-set of the main annotation files on the reference chromosomes. 
  They contain only the lncRNA genes, which are those with any of these biotypes: 
  "processed_transcript", "lincRNA", "3prime_overlapping_ncrna", "antisense", "non_coding", 
  "sense_intronic", "sense_overlapping", "TEC", "known_ncrna", "bidirectional_promoter_lncrna", 
  "macro_lncRNA", "lncRNA".

8. gencode.vX.polyAs.{gtf,gff3}.gz:
  PolyA features annotated by Havana on reference chromosomes in GTF and GFF3 file formats.
  These files contain polyA signals, polyA sites and pseudo polyAs manually 
  annotated by HAVANA. They include only the reference chromosomes.
  The value of the 'gene_id', 'transcript_id', 'gene_name' and 'transcript name' fields
  corresponds to a random identifier. The polyA features are not directly associated to 
  any gene or transcript when annotated by Havana. 

9. gencode.vX.2wayconspseudos.{gtf,gff3}.gz:
  (Retrotransposed) pseudogenes predicted by the Yale & UCSC pipelines, but not by Havana on 
  reference chromosomes in GTF and GFF3 file formats.

10. gencode.vX.tRNAs.{gtf,gff3}.gz:
  tRNA structures predicted by tRNA-Scan on reference chromosomes in GTF and GFF3 file formats.


###############
#Sequence files
###############

11. gencode.vX.transcripts.fa.gz:
  All transcript sequences on reference chromosomes in Fasta format.
  The sequence headers contain the following fields:
    transcript-id|
    gene-id|
    Havana-gene-id (if the gene contains manually annotated transcripts, '-' otherwise)|
    Havana-transcript-id (if this transcript was manually annotated, '-' otherwise)|
    transcript-name|
    gene-name|
    sequence-length|
    transcript biotype

12. gencode.vX.pc_transcripts.fa.gz:
  Protein-coding transcript sequences on reference chromosomes in Fasta format.
  The sequence headers contain the following fields:
    transcript-id|
    gene-id|
    Havana-gene-id (if the gene contains manually annotated transcripts, '-' otherwise)|
    Havana-transcript-id (if this transcript was manually annotated, '-' otherwise)|
    transcript-name|
    gene-name|
    sequence-length|
    5'-UTR (3'-UTR if reverse strand) location in the transcript|
    CDS location in the transcript|
    3'-UTR (5'-UTR if reverse strand) location in the transcript

13. gencode.vX.pc_translations.fa.gz:
  Translations of protein-coding transcripts on reference chromosomes Fasta file.
  The sequence headers contain the following fields:
    transcript-id|
    gene-id|
    Havana-gene-id (if the gene contains manually annotated transcripts, '-' otherwise)|
    Havana-transcript-id (if this transcript was manually annotated, '-' otherwise)|
    transcript-name|gene-name|sequence-length

14. gencode.vX.lncRNA_transcripts.fa.gz:
  Long non-coding RNA transcript sequences on reference chromosomes Fasta file.
  The sequence headers contain the following fields:
    transcript-id|
    gene-id|
    Havana-gene-id (if the gene contains manually annotated transcripts, '-' otherwise)|
    Havana-transcript-id (if this transcript was manually annotated, '-' otherwise)|
    transcript-name|
    gene-name|
    sequence-length
  
  Long non-coding RNA transcripts are those with any of these biotypes: 
  "processed_transcript", "lincRNA", "3prime_overlapping_ncrna", "antisense", "non_coding", 
  "sense_intronic", "sense_overlapping", "TEC", "known_ncrna", "bidirectional_promoter_lncrna", 
  "macro_lncRNA", "lncRNA".

15. A.genome.fa.gz (where A is the current assembly name eg. GRCh38 for human, GRCm38 for mouse):
  Genome sequence fasta file (sequence region names are the same as in the GTF/GFF3 files).
  It includes reference chromosomes, scaffolds, assembly patches and haplotypes. 

16. A.primary_assembly.genome.fa.gz (where A is the current assembly name eg. GRCh38 for human, GRCm38 for mouse):
  Primary assembly genome sequence fasta file (sequence region names are the same as in the GTF/GFF3 files).
  It includes reference chromosomes and scaffolds only.


###############
#Metadata files
###############

17. gencode.vX.metadata.Annotation_remark.gz:
  Remarks made during the manual annotation of the transcript.
  1 - transcript id
  2 - annotation remark

18. gencode.vX.metadata.EntrezGene.gz:
  Entrez Gene id associated to the transcript.
  1 - transcript id
  2 - Entrez Gene id

19. gencode.vX.metadata.Exon_supporting_feature.gz:
  Piece of evidence used in the annotation of an exon (usually peptides, mRNAs, ESTs).
  1 - transcript id
  2 - external id of the feature supporting the exon annotation
  3 - external source of the supporting feature
  4 - exon id
  5 - exon coordinates

20. gencode.vX.metadata.Gene_source.gz:
  Source of the gene annotation (Ensembl, Havana, Ensembl-Havana merged model or 
  imported in the case of small RNA and mitochondrial genes).
  1 - gene id
  2 - gene source

21. gencode.vX.metadata.HGNC.gz:
  HGNC approved gene symbol.
  1 - transcript id
  2 - HGNC gene symbol
  3 - HGNC unique id

22. gencode.vX.metadata.PDB.gz:
  PDB entry associated to the transcript.
  1 - transcript id
  2 - PDB id

23. gencode.vX.metadata.PolyA_feature.gz:
  Manually annotated polyA feature overlapping the transcript 3'-end.
  1 - transcript id
  2 - transcript-based start coordinate of the polyA feature
  3 - transcript-based end coordinate of the polyA feature
  4 - polyA feature chromosome
  5 - polyA feature start coordinate
  6 - polyA feature end coordinate
  7 - polyA feature strand
  8 - polyA feature type ("polyA_site", "polyA_signal", "pseudo_polyA")

24. gencode.vX.metadata.Pubmed_id.gz:
  PubMed id of a publication associated to the transcript.
  1 - transcript id
  2 - PubMed id

25. gencode.vX.metadata.RefSeq.gz:
  RefSeq RNA and/or protein associated to the transcript.
  1 - transcript id
  2 - RefSeq RNA id
  3 - RefSeq protein id (optional)

26. gencode.vX.metadata.Selenocysteine.gz:
  Amino acid position of a selenocysteine residue in the transcript.
  1 - transcript id
  2 - selenocysteine position

27. gencode.vX.metadata.SwissProt.gz:
  UniProtKB/SwissProt entry associated to the transcript.
  1 - transcript id
  2 - UniProtKB/SwissProt accession number
  3 - UniProtKB/SwissProt accession number

28. gencode.vX.metadata.Transcript_source.gz:
  Source of the transcript annotation (Ensembl, Havana, etc).
  1 - transcript id
  2 - transcript source

29. gencode.vX.metadata.Transcript_supporting_feature.gz:
  Piece of evidence used in the annotation of the transcript.
  1 - transcript id
  2 - external id of the feature supporting the transcript annotation
  3 - external source of the supporting feature
  
30. gencode.vX.metadata.TrEMBL.gz:
  UniProtKB/TrEMBL entry associated to the transcript.
  1 - transcript id
  2 - UniProtKB/TrEMBL accession number
  3 - UniProtKB/TrEMBL accession number




######################################
General format of the annotation files
######################################

We supply genome-wide features on three different confidence levels.
Level 1 + 2 should be used for high-quality local analysis.
1 + 2 + 3 should be used for genome-wide analysis.

* Level 1: validated 

Pseudogene loci, that were predicted by the analysis-pipelines from YALE, UCSC 
as well as by HAVANA manual annotation from WTSI. 
Other transcripts, that were verified experimentally by RT-PCR and sequencing
through the GENCODE experimental pipeline.

* Level 2: manual annotation 

HAVANA manual annotation from WTSI (and Ensembl annotation where it is identical to Havana).
The following regions are considered "fully annotated" although they will still be updated:
chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, 
ENCODE pilot regions.

* Level 3: automated annotation 

ENSEMBL loci where they are different from the HAVANA annotation or where no annotation 
can be found. 

This data is supplied in GTF and GFF3 format as defined here:
 http://www.gencodegenes.org/data_format.html
with the following tags added to the attributes column where appropriate:

* level [1,2,3]: validation status as described.
* tag "3_nested_supported_extension": 3' end extended based on RNA-seq data.
* tag "3_standard_supported_extension": 3' end extended based on RNA-seq data.
* tag "454_RNA_Seq_supported": annotated based on RNA-seq data.
* tag "5_nested_supported_extension": 5' end extended based on RNA-seq data.
* tag "5_standard_supported_extension": 5' end extended based on RNA-seq data.
* tag "alternative_3_UTR": shares an identical CDS but has alternative 3' UTR with respect to a 
  reference variant.
* tag "alternative_5_UTR": shares an identical CDS but has alternative 5' UTR with respect to a 
  reference variant.
* tag "appris_principal": transcript expected to code for the main functional isoform based on a 
  range of protein features (APPRIS pipeline, Nucleic Acids Res. 2013 Jan;41(Database issue):D110-7). 
  (this tag is not found after Gencode 21)
* tag "appris_candidate": where there is no single 'appris_principal' variant the main functional 
  isoform will be translated from one of the 'appris_candidate' genes.  (this tag is not found after 
  Gencode 21)
* tag "appris_candidate_ccds": the "appris_candidate" transcript that has an unique CCDS.  
  (this tag is not found after Gencode 21)
* tag "appris_candidate_longest_ccds": the "appris_candidate" transcripts where there are several 
  CCDS, in this case APPRIS labels the longest CCDS.  (this tag is not found after Gencode 21)
* tag "appris_candidate_longest_seq": where there is no "appris_candidate_ccds" or 
  "appris_candidate_longest_ccds" variant, the longest protein of the "appris_candidate" variants is 
  selected as the primary variant.  (this tag is not found after Gencode 21)
* tag "appris_candidate_highest_score": where there is no 'appris_principal' variant, the candidate 
  with highest APPRIS score is selected as the primary variant. (this tag is not found after Gencode 
  20)
* tag "appris_candidate_longest": where there is no 'appris_principal' variant, the longest of the 
  'appris_candidate' variants is selected as the primary variant. (this tag is not found after 
  Gencode 20)
* tag "appris_principal_1": (This flag corresponds to the older flag "appris_principal") where the 
  transcript expected to code for the main functional isoform based solely on the core modules in the 
  APPRIS database. The APPRIS core modules map protein structural and functional information and 
  cross-species conservation to the annotated variants.
* tag "appris_principal_2": (This flag corresponds to the older flag "appris_candidate_ccds") Where 
  the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human 
  protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be 
  the principal variant.
  If one (but no more than one) of these candidates has a distinct CCDS identifier it is selected as 
  the principal variant for that gene. A CCDS identifier shows that there is consensus between RefSeq 
  and GENCODE/Ensembl for that variant, guaranteeing that the variant has cDNA support.
* tag "appris_principal_3": Where the APPRIS core modules are unable to choose a clear principal 
  variant and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the 
  variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the 
  earlier it was annotated.
  Consensus CDS annotated earlier are likely to have more cDNA evidence. Consecutive CCDS identifiers 
  are not included in this flag, since they will have been annotated in the same release of CCDS. 
  These are distinguished with the next flag. 
* tag "appris_principal_4": (This flag corresponds to the Ensembl 78 flag 
  "appris_candidate_longest_ccds") Where the APPRIS core modules are unable to choose a clear 
  principal CDS and there is more than one variant with a distinct (but consecutive) CCDS 
  identifiers, APPRIS selects the longest CCDS isoform as the principal variant.
* tag "appris_principal_5": (This flag corresponds to the Ensembl 78 flag 
  "appris_candidate_longest_seq") Where the APPRIS core modules are unable to choose a clear 
  principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the 
  longest of the candidate isoforms as the principal variant.
* tag "appris_alternative_1": Candidate transcript(s) models that are conserved in at least three 
  tested non-primate species.
* tag "appris_alternative_2": Candidate transcript(s) models that appear to be conserved in fewer 
  than three tested non-primate species.
* tag "artifactual_duplication": annotated on an artifactual duplicate region of the genome assembly. 
* tag "basic": identifies a subset of representative transcripts for each gene; prioritises 
  full-length protein coding transcripts over partial or non-protein coding transcripts within the 
  same gene, and intends to highlight those transcripts that will be useful to the majority of users. 
* tag "bicistronic": transcript contains two confidently annotated CDSs. Support may come from eg 
  proteomic data, cross-species conservation or published experimental work.
* tag "CAGE_supported_TSS": transcript 5' end overlaps ENCODE or Fantom CAGE cluster.
* tag "CCDS": member of the consensus CDS gene set, confirming coding regions between ENSEMBL, 
  UCSC, NCBI and HAVANA.
* tag "cds_end_NF": the coding region end could not be confirmed.
* tag "cds_start_NF": the coding region start could not be confirmed.
* tag "dotter_confirmed": transcript QC checked using dotplot to identify features eg splice 
  junctions, end of homology.
* tag "downstream_ATG": downstream ATG assessed as less likely to initiate the translation of the 
  functional protein due to eg experimental evidence, poor cross-species conservation, weak Kozak 
  context, interference with signal peptides or targeting signals.
* tag "Ensembl_canonical": most representative transcript of the gene. This will be the MANE_Select 
  transcript if there is one, or a transcript chosen by an Ensembl algorithm otherwise.
* tag "exp_conf": transcript was tested and confirmed experimentally.
* tag "fragmented_locus": locus consists of non-overlapping transcript fragments either because of 
  genome assembly issues (i.e., gaps or mis-assemblies), or because supporting transcripts (e.g., 
  from another species) cannot be completely mapped, or because the supporting transcripts are 
  non-overlapping end pairs (i.e., 5' and 3' ESTs from a single cDNA).
* tag "inferred_exon_combination": transcript model contains all possible in-frame exons supported 
  by homology, experimental evidence or conservation, but the exon combination is not directly 
  supported by a single piece of evidence and may not be biological. Used for large genes with 
  repetitive exons (e.g. titin (TTN)) to represent all the exons individual transcript variants can 
  pool from.
* tag "inferred_transcript_model": transcript model is not supported by a single piece of 
  transcript evidence. May be supported by multiple fragments of transcript evidence or by combining 
  different evidence sources e.g. protein homology, RNA-seq data, published experimental data.
* tag "low_sequence_quality": transcript supported by transcript evidence that, while mapping 
  best-in-genome, shows regions of poor sequence quality.
* tag "MANE_Select": the transcript belongs to the MANE Select data set. The Matched Annotation 
  from NCBI and EMBL-EBI project (MANE) is a collaboration between Ensembl-GENCODE and RefSeq 
  to select a default transcript per human protein coding locus that is representative of biology, 
  well-supported, expressed and conserved. This transcript set matches GRCh38 and is 100% identical
  between RefSeq and Ensembl-GENCODE for 5' UTR, CDS, splicing and 3' UTR.
* tag "MANE_Plus_Clinical": the transcript belongs to the MANE Plus Clinical data set. Within the 
  MANE project, these are additional transcripts per locus necessary to support clinical variant 
  reporting, for example transcripts containing known pathogenic or likely pathogenic clinical 
  variants not reportable using the MANE Select data set. This transcript set matches GRCh38 
  and is 100% identical between RefSeq and Ensembl-GENCODE for 5' UTR, CDS, splicing and 3' UTR.
* tag "mRNA_end_NF": the mRNA end could not be confirmed.
* tag "mRNA_start_NF": the mRNA start could not be confirmed.
* tag "NAGNAG_splice_site": in-frame type of variation where, at the acceptor site, some variants 
  splice after the first AG and others after the second AG.
* tag "ncRNA_host": the locus is a host for small non-coding RNAs. 
* tag "nested_454_RNA_Seq_supported": annotated based on RNA-seq data.
* tag "NMD_exception": the transcript looks like it is subject to NMD but publications, experiments 
  or conservation support the translation of the CDS.
* tag "NMD_likely_if_extended": codon if the transcript were longer but cannot currently be 
  annotated as NMD as does not fulfil all criteria - most commonly lack of an intron downstream of 
  the stop codon.
* tag "non_ATG_start": the CDS has a non-ATG start and its validity is supported by publication or 
  conservation.
* tag "non_canonical_conserved": the transcript has a non-canonical splice site conserved in other 
  species.
* tag "non_canonical_genome_sequence_error": the transcript has a non-canonical splice site 
  explained by a genomic sequencing error.
* tag "non_canonical_other": the transcript has a non-canonical splice site explained by other 
  reasons.
* tag "non_canonical_polymorphism": the transcript has a non-canonical splice site explained by a 
  SNP.
* tag "non_canonical_TEC": the transcript has a non-canonical splice site that needs experimental 
  confirmation.
* tag "non_canonical_U12": the transcript has a non-canonical splice site explained by a U12 intron 
  (i.e. AT-AC splice site).
* tag "non_submitted_evidence": a splice variant for which supporting evidence has not been 
  submitted to databases, i.e. the model is based on literature or collaborator evidence.
* tag "not_best_in_genome_evidence": a transcript is supported by evidence from same species 
  paralogous loci.
* tag "not_organism_supported": evidence from other species was used to build model.
* tag "orphan": protein-coding locus with no paralogues or orthologs.
* tag "overlapping locus": exon(s) of the locus overlap exon(s) of a readthrough transcript or a 
  transcript belonging to another locus.
* tag "overlapping_uORF": a low confidence upstream ATG existing in other coding variant would lead 
  to NMD in this trancript, that uses the high confidence canonical downstream ATG.
* tag "PAR": annotation in the pseudo-autosomal region, which is duplicated between X & Y.
* tag "pseudo_consens": member of the pseudogene set predicted by YALE, UCSC and HAVANA.
* tag "readthrough_gene": protein-coding gene that has a readthrough transcript.
* tag "readthrough_transcript": a transcript that overlaps two or more independent loci but is 
  considered to belong to a 3rd, separate locus.
* tag "reference_genome_error": locus overlaps a sequence error or an assembly error in the 
  reference genome that affects its annotation (e.g., 1 or 2bp insertion/deletion, substitution 
  causing premature stop codon). The main effect is that affected transcripts that would have had a 
  CDS are currently annotated without one.
* tag "retained_intron_CDS": internal intron of CDS portion of transcript is retained.
* tag "retained_intron_final": final intron of CDS portion of transcript is retained.
* tag "retained_intron_first": first intron of CDS portion of transcript is retained.
* tag "retrogene": protein-coding locus created via retrotransposition.
* tag "RNA_Seq_supported_only": transcript supported by RNAseq data and not supported by mRNA or 
  EST evidence.
* tag "RNA_Seq_supported_partial": transcript annotated based on mixture of RNA-seq data and 
  EST/mRNA/protein evidence.
* tag "RP_supported_TIS": transcript that contains a CDS that has a translation initiation site 
  supported by Ribosomal Profiling data.
* tag "seleno": contains a selenocysteine.
* tag "semi_processed": a processed pseudogene with one or more introns still present. These are 
  likely formed through the retrotransposition of a retained intron transcript.
* tag "sequence_error": transcript contains ≥ 1 non-canonical splice junction that is associated 
  with a known or novel genome sequence error
* tag "stop_codon_readthrough": transcript whose coding sequence contains an internal stop codon 
  that does not cause the translation termination.
* tag "TAGENE": a transcript created or extended using assembled RNA-seq long reads.
* tag "upstream_ATG": an upstream ATG exists when a downstream ATG is better supported.
* tag "upstream_uORF": a low confidence upstream ATG existing in other coding variant would lead to 
  NMD in this trancript, that uses the high confidence canonical downstream ATG.


Please note: if start codons are split between two exons, two start-codon features will be listed.
Please note: pre-release 4, "cds_start_NF" was listed as "cds start not found", etc.
Please note: pre-release 6, "seleno" tags included the selenocystein position as the amino acid 
number within the protein, these are now given as genomic coordinates as separate GTF 
features.
Please note: the stable ids with the ENSGR0000XXXXXX, ENSTR0000XXXXXX format (until release 24) or
the ENSG00000XXXXXX.X_PAR_Y, ENST00000XXXXXX.X_PAR_Y format (from release 25 onwards) are genes and 
transcripts in the pseudoautosomal regions (PAR regions) of human chromosome Y. These genes/transcripts 
are tagged with "PAR" and have a different stable_id than their counterpart in chromosome X to avoid 
redundancy. For example: ENST00000431238.7 in chrX and ENST00000431238.7_PAR_Y in chrY.




#################################################################################################
Release 43 (February 2023)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to August 2022
This release corresponds to Ensembl version 109.



#################################################################################################
Release 42 (October 2022)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to April 2022
This release corresponds to Ensembl version 108.

The following biotype has been added:
-protein_coding_CDS_not_defined: Transcript that belongs to a protein_coding gene 
  and doesn't contain an ORF. Replaces the processed_transcript transcript biotype
  in protein_coding genes.

Annotation files in gtf and gff3 format having the basic set of transcripts in the 
primary assembly are now included in the release.



#################################################################################################
Release 41 (July 2022)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to January 2022
This release corresponds to Ensembl version 107.

The following biotypes have been added:
-protein_coding_LoF: Not translated in the reference genome owing to a SNP/DIP 
  but in other individuals/haplotypes/strains the transcript is translated. 
  This biotype replaces the polymorphic_pseudogene transcript biotype.
-artifact: Annotated on artifactual regions of the genome assembly.

The following tags or attributes have been added:
-artifactual_duplication / artif_dupl
-readthrough_gene



#################################################################################################
Release 40 (April 2022)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to August 2021
This release corresponds to Ensembl version 106.



#################################################################################################
Release 39 (December 2021)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to May 2021
This release corresponds to Ensembl version 105.

Clone-based gene names have been retired from the release files. From this release onwards, 
genes without a name in HGNC, EntrezGene, RFAM or miRBase will get their gene id as default name.   



#################################################################################################
Release 38 (May 2021)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to December 2020
This release corresponds to Ensembl version 104.

New tag added: Ensembl_canonical.



#################################################################################################
Release 37 (February 2021)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to August 2020
This release corresponds to Ensembl version 103.

New tag added: MANE_Plus_Clinical.



#################################################################################################
Release 36 (November 2020)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to May 2020
This release corresponds to Ensembl version 102.



#################################################################################################
Release 35 (August 2020)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to March 2020
This release corresponds to Ensembl version 101.



#################################################################################################
Release 34 (April 2020)
##################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to November 2019.
This release corresponds to Ensembl version 100.



#################################################################################################
Release 33 (January 2020)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to August 2019.
This release corresponds to Ensembl version 99.

NOTE: the following genes had their biotype erroneously changed from protein_coding to polymorphic_pseudogene. The correct biotypes will be restored in our next release.

ENSG00000158887.18  MPZ
ENSG00000014641.20  MDH1
ENSG00000211456.13  SACM1L
ENSG00000109339.24  MAPK10
ENSG00000112715.23  VEGFA
ENSG00000123505.18  AMD1
ENSG00000082556.13  OPRK1
ENSG00000134575.13  ACP2
ENSG00000111716.14  LDHB
ENSG00000111424.12  VDR
ENSG00000184992.13  BRI3BP
ENSG00000171885.17  AQP4
ENSG00000125510.18  OPRL1



#################################################################################################
Release 32 (September 2019)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to May 2019.
This release corresponds to Ensembl version 98.

New tag added: stop_codon_readthrough.



#################################################################################################
Release 31 (July 2019)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to February 2019.
This release corresponds to Ensembl version 97.

The following biotypes have been replaced by the "lncRNA" biotype: 
-3prime_overlapping_ncRNA
-antisense
-bidirectional_promoter_lncRNA
-lincRNA
-macro_lncRNA
-non_coding
-processed_transcript
-sense_intronic
-sense_overlapping

The following tags or attributes have been added:
-TAGENE: Transcript created or extended using assembled RNA-seq long reads.
-HGNC_id: Unique stable id provided by the HGNC for each gene with an approved symbol.



#################################################################################################
Release 30 (April 2019)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to November 2018.
This release corresponds to Ensembl version 96.



#################################################################################################
Release 29 (October 2018)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to May 2018.
This release corresponds to Ensembl version 94.



#################################################################################################
Release 28 (April 2018)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to November 2017.
This release corresponds to Ensembl version 92.



#################################################################################################
Release 27 (August 2017)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to January 2017.
This release corresponds to Ensembl version 90.

The antisense biotype is now called "antisense_RNA".



#################################################################################################
Release 26 (March 2017)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to October 2016.
This release corresponds to Ensembl version 88.

The "gene_status" and "transcript_status" attributes have been removed from the GENCODE GTF and 
GFF3 files.

New tags added:
-3_nested_supported_extension
-3_standard_supported_extension
-454_RNA_Seq_supported
-5_nested_supported_extension
-5_standard_supported_extension
-nested_454_RNA_Seq_supported



#################################################################################################
Release 25 (July 2016)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to March 2016.
This release corresponds to Ensembl version 85.

Genes and transcripts on the chrY PAR regions now have "_PAR_Y" appended to their ids. Until 
release 24 these ids had the "ENSGR00..." and "ENSTR00..." formats.

The "UTR" features in the GFF3 files have been replaced with "five_prime_UTR" and "three_prime_UTR".



#################################################################################################
Release 24 (December 2015)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to August 2015.
This release corresponds to Ensembl version 83.



#################################################################################################
Release 23 (July 2015)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to March 2015.
This release corresponds to Ensembl version 81.

New files added in this release:

a)Basic annotation on the reference chromosomes and on all sequence regions:
-gencode.v23.basic.annotation.{gtf,gff3}.gz
-gencode.v23.basic.annotation.gff3.gz
-gencode.v23.chr_patch_hapl_scaff.basic.annotation.{gtf,gff3}.gz
-gencode.v23.chr_patch_hapl_scaff.basic.annotation.gff3.gz

b)Nucleotide sequences of all annotated transcripts on the reference chromosomes:
-gencode.v23.transcripts.fa.gz



#################################################################################################
Release 22 (March 2015)
#################################################################################################
This is a merge between a full new Ensembl gene build and updates from HAVANA up to October 2014.
This release corresponds to Ensembl version 79.

Three new APPRIS tags have been added in this release:

-appris_principal_1: (This flag corresponds to the older flag "appris_principal") where the 
transcript expected to code for the main functional isoform based solely on the core modules in the 
APPRIS database. The APPRIS core modules map protein structural and functional information and 
cross-species conservation to the annotated variants.

-appris_principal_2: (This flag corresponds to the older flag "appris_candidate_ccds") Where the 
APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human 
protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be 
the principal variant.
If one (but no more than one) of these candidates has a distinct CCDS identifier it is selected as 
the principal variant for that gene. A CCDS identifier shows that there is consensus between RefSeq 
and GENCODE/Ensembl for that variant, guaranteeing that the variant has cDNA support.

-appris_principal_3: Where the APPRIS core modules are unable to choose a clear principal variant 
and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant 
with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it 
was annotated.
Consensus CDS annotated earlier are likely to have more cDNA evidence. Consecutive CCDS identifiers 
are not included in this flag, since they will have been annotated in the same release of CCDS. 
These are distinguished with the next flag. 

-appris_principal_4: (This flag corresponds to the Ensembl 78 flag "appris_candidate_longest_ccds") 
Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one 
variant with a distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform 
as the principal variant.

-appris_principal_5: (This flag corresponds to the Ensembl 78 flag "appris_candidate_longest_seq") 
Where the APPRIS core modules are unable to choose a clear principal variant and none of the 
candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as 
the principal variant.

-appris_alternative_1: Candidate transcript(s) models that are conserved in at least three tested 
non-primate species.

-appris_alternative_2: Candidate transcript(s) models that appear to be conserved in fewer than 
three tested non-primate species.

In contrast, five APPRIS tags present in the previous release have been dropped:
appris_principal 
appris_candidate_ccds
appris_candidate_longest_ccds
appris_candidate_longest_seq
appris_candidate


Also, in this release the transcript attributes were removed from the gene lines in all GTF and 
GFF3 annotation files. ie. the transcript_id transcript_type, transcript_status, transcript_name, 
havana_transcript, etc.


NOTE: The GFF3 files listed below were replaced on 7 May 2015 to fix non-unique IDs of some UTR 
features as well as to extend the CDS until the last nucleotide of the stop codon in stop codons 
split between two exons.
 - gencode.v22.chr_patch_hapl_scaff.annotation.gff3.gz
 - gencode.v22.annotation.gff3.gz

NOTE: GTF and GFF3 files containing gene annotation on the primary assembly only (main chromosomes 
and unplaced/unlocalized scaffolds) were added on 7 May 2015:
 - gencode.v22.primary_assembly.annotation.{gtf,gff3}.gz
 - gencode.v22.primary_assembly.annotation.gff3.gz
These files are meant to be used mainly for NGS analyses in conjunction with the equivalent primary 
assembly genome sequence provided:
 - GRCh38.primary_assembly.genome.fa.gz



#################################################################################################
Release 21 (October 2014)
#################################################################################################
This is a merge between a full new Ensembl gene build and updates from HAVANA up to June 2014.
This release corresponds to Ensembl version 77.

The transcript support levels imported from UCSC have been introduced in the main annotation files. 
Transcripts are scored according to how well mRNA and EST alignments match over its full length:
- 1 (all splice junctions of the transcript are supported by at least one non-suspect mRNA)
- 2 (the best supporting mRNA is flagged as suspect or the support is from multiple ESTs)
- 3 (the only support is from a single EST)
- 4 (the best supporting EST is flagged as suspect)
- 5 (no single transcript supports the model structure)
- NA (the transcript was not analyzed) 

Three new APPRIS tags have been added in this release:
-appris_candidate_ccds: the "appris_candidate" transcript that has an unique CCDS.
-appris_candidate_longest_ccds: the "appris_candidate" transcripts where there are several CCDS, in 
this case APPRIS labels the longest CCDS.
-appris_candidate_longest_seq: where there is no "appris_candidate_ccds" or 
"appris_candidate_longest_ccds" variant, the longest protein of the "appris_candidate" variants is 
selected as the primary variant.

In contrast, two APPRIS tags present in the previous release have been dropped:
-appris_candidate_highest_score
-appris_candidate_longest


NOTE: The files listed below were replaced on 12-11-2014 to fix the haplotype chromosome names, 
which should correspond to the GRC accessions as in previous GENCODE releases.
 - gencode.v21.chr_patch_hapl_scaff.annotation.{gtf,gff3}.gz
 - gencode.v21.chr_patch_hapl_scaff.annotation.gff3.gz
 - gencode.v21.metadata.PolyA_feature.gz
 - GRCh38.genome.fa.gz



#################################################################################################
Release 20 (August 2014)
#################################################################################################
This is a merge between a full new Ensembl gene build and updates from HAVANA up to April 2014.
This release corresponds to Ensembl version 76 and is in the new human assembly GRCh38.

A new annotation file type is added in the Gencode releases which is GFF3.
Note:
In GFF3 the start and stop codon are included in the CDS.
In GTF the start codon is included in the CDS but the stop codon is included in the UTR.  

1 extra appris tag is added in this release in the transcript lines:

* tag "appris_candidate_highest_score": where there is no 'appris_principal' variant, the candidate 
with highest APPRIS score is selected as the primary variant.

1 extra attribute is added in the annotation files in all protein coding transcripts: the protein_id

For example:

chr1	HAVANA	transcript	69091	70008	.	+	.	gene_id 
"ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status 
"KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; 
transcript_name "OR4F5-001"; level 2; protein_id "ENSP00000334393.3"; tag "basic"; tag 
"appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; 
havana_transcript "OTTHUMT00000003223.1";


1 extra metadata file is added:

gencode.v20.metadata.EntrezGene: EntrezGene gene id associated to the transcript

NOTE: The files listed below were updated on 27-08-2014 to correct the following minor issues:
- some CCDS tags had been wrongly assigned to transcripts with the same start and end coordinates 
as a CCDS model but with different amino acid sequence
- absence of semicolon after "havana_transcript" attributes in the GTF files.

gencode.v20.annotation.{gtf,gff3}.gz
gencode.v20.chr_patch_hapl_scaff.annotation.{gtf,gff3}.gz
gencode.v20.long_noncoding_RNAs.{gtf,gff3}.gz
gencode.v20.annotation.gff3.gz
gencode.v20.chr_patch_hapl_scaff.annotation.gff3.gz


#################################################################################################
Release 19 (December 2013)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to July 2013.
This release corresponds to Ensembl version 74.

3 extra tags are added in this release in the transcript lines:

* tag "appris_principal": transcript expected to code for the main functional isoform based on a 
range of protein features (APPRIS pipeline).
* tag "appris_candidate": where there is no single 'appris_principal' variant the main functional 
isoform will be translated from one of the 'appris_candidate' genes.
* tag "appris_candidate_longest": where there is no 'appris_principal' variant, the longest of the 
'appris_candidate' variants is selected as the primary variant.


NOTE: The tag "basic" was added to the gencode.v19.annotation.gff3 file on 2 Feb 2016.

NOTE: The protein_id attribute was added to the gencode.v19.annotation.gtf file on 1 March 2016.


#################################################################################################
Release 18 (September 2013)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to April 2013.
This release corresponds to Ensembl version 73.

NOTE: File GRCh37.p12.genome.fa.gz with the whole genome fasta sequence was replaced in the server 
on the 4th October 2013 because it was faulty.

There is an extra field in the exon lines in the GTFs: the exon_id, which is placed 
after the exon_number and before the level.


#################################################################################################
Release 17 (June 2013)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to February 2013.
This release corresponds to Ensembl version 72.

There is an extra file in this release:
GRCh37.p11.genome.fa.gz
which has the genome sequence in fasta format.
In the label there are 2 names of the sequence region:
the official GRCh37.p11 assembly names then tab and then the Ensembl name.
*The names of the the official GRCh37.p11 assembly used in the GRCh37.p11.genome.fa.gz file 
are the same names used the GTF files.

There are 2 extra transcript tags in the GTF:

RP_supported_TIS
inferred_exon_combination

Please see above in the tags description for more details.


#################################################################################################
Release 16 (April 2013)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to November 2012.
This release corresponds to Ensembl version 71.

(2 files were re-uploaded in the server on the 28 May 2013; they had in the scaffolds name an extra 
 '.1' extension, these are :
1. gencode.v16.chr_patch_hapl_scaff.annotation.{gtf,gff3}.gz 
2. gencode.v16.patch_hapl_scaff.annotation.{gtf,gff3}.gz )

(3 files were re-uploaded in the server on the 30 April 2013; they had as mitochondrial chromosome 
'chrMT' instead of 'chrM'; now they have 'chrM'  these are:
1. gencode.v16.annotation.gtf
2. gencode.v16.tRNAs.gtf
3. gencode.v16.chr_patch_hapl_scaff.annotation.gtf)

This release has 2 extra annotation GTFS:
1) a GTF that has not only the annotation in the main chromosomes but also the annotation in 
patches/scaffolds/haplotypes. It's called "gencode.v16.chr_patch_hapl_scaff.annotation.gtf".
2) a GTF that has only the annotation in patches/scaffolds/haplotypes. It's called 
"gencode.v16.patch_hapl_scaff.annotation.gtf".

The coordinates of the patches/scaffolds/haplotypes start from 1 , i.e. they are not genomic 
coordinates.
The coordinates of the annotation (genes,transcripts,exons etc) in patches/scaffolds/haplotypes are 
relative to the patches/scaffolds/haplotypes.
The naming of the patches/scaffolds/haplotypes are from assembly GRCh37.p10 and they are accession 
numbers which are stable. ie. for the patches can be found here:
ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p10/PATCHES/alt_scaffolds/alt_scaffold_placement.txt
More info about GRCh37.p10 can be found here: 
ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p10/


Files which contain metadata associated to transcripts and genes, usually displayed in the UCSC 
browser, are now part of the public release:
-gencode.v16.metadata.Annotation_remark: remarks made during the manual annotation of the transcript
-gencode.v16.metadata.Exon_supporting_feature: piece of evidence used in the annotation of an exon 
(usually peptides, mRNAs, ESTs)
-gencode.v16.metadata.Gene_source: source of the gene annotation (Ensembl, Havana, Ensembl-Havana 
merged model or imported in the case of small RNA and mitochondrial genes)
-gencode.v16.metadata.HGNC: HGNC approved gene symbol
-gencode.v16.metadata.PDB: PDB entry associated to the transcript
-gencode.v16.metadata.PolyA_feature: manually annotated polyA feature overlapping the transcript 
3'-end (polyA_signal, polyA_site, pseudo_polyA) - Fields are 1)transcript_id, 2-3)polyA feature 
coordinates relative to the transcript, 4-7)polyA feature genomic coordinates, 8)type of polyA 
feature 
-gencode.v16.metadata.Pubmed_id: Pubmed id of a publication associated to the transcript
-gencode.v16.metadata.RefSeq: RefSeq RNA and/or protein associated to the transcript
-gencode.v16.metadata.Selenocysteine: amino acid position of a selenocysteine residue in the 
transcript
-gencode.v16.metadata.SwissProt: UniProtKB/SwissProt entry associated to the transcript
-gencode.v16.metadata.Transcript_source: source of the transcript annotation (Ensembl, Havana, 
Ensembl-Havana merged model or imported in the case of small RNA and mitochondrial genes)
-gencode.v16.metadata.Transcript_supporting_feature: piece of evidence used in the annotation of the 
transcript (usually peptides, mRNAs, ESTs)
-gencode.v16.metadata.TrEMBL: UniProtKB/TrEMBL entry associated to the transcript

All the metadata files include annotation on the main chromosomes/patches/scaffolds/haplotypes.

Also, there are 9 extra transcript tags in the GTF files:

NMD_likely_if_extended
RNA_Seq_supported_only
bicistronic
downstream_ATG
low_sequence_quality
retained_intron_CDS
retained_intron_first
retained_intron_final
sequence_error

Please see above in the tag specification for details about these 9 tags.


#################################################################################################
Release 15 (Jan 2013)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to August 2012.
This release corresponds to Ensembl version 70.

There is an extra attribute on the exon lines called exon_number, that indicates the order of the
exon inside the transcript,starting from the 5'end; i.e. the closest to the 5'end will have 
"exon_number 1", the second next to it will have "exon_number 2" etc.

Also, there is an extra tag introduced in the GTF files, the tag 'readthrough_transcript'.This tag 
is found only on the transcript lines and indicates whether the transcript is read-through:
(a transcript that overlaps two or more independent loci but is considered to belong to a 3rd, 
separate locus)


#################################################################################################
Release 14 (November 2012)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to June 2012.
This release corresponds to Ensembl version 69.

There is an extra FASTA file called 'gencode.v14.lncRNA_transcripts.fa' with the long non-coding 
RNA transcripts in FASTA format.
Also, an extra tag is introduced in the GTF files, the tag 'basic'. This tag corresponds to the 
transcripts belonging to the GENCODE basic annotation set in the UCSC Genome Browser.


#################################################################################################
Release 13 (Aug 2012)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to March 2012.
This release corresponds to Ensembl version 68.

There is an extra GTF file called 'gencode13_corrected_statuses.gtf'.
Ensembl 68 database had errors in the statuses and the GTF was manually corrected after the release.
The statuses were corrected using the Havana database which includes only the manual annotations.
Therefore, there are some statuses "NULL" which exist in the Havana db but not in Ensembl as they 
become "KNOWN" or "NOVEL" according to what sources of evidence are found when the merge happens 
and Ensembl db is created.
Ensembl 69 release will have the correct statuses.


#################################################################################################
Release 12 (May 2012)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to December 2011.
This release corresponds to Ensembl version 67.

As in the previous release the "ncrna_host" biotype was replaced with "processed_transcript" in 
21 genes and 80 transcripts in the Gencode GTF although it is still present in Ensembl 67.


#################################################################################################
Release 11 (February 2012)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to October 2011.
This release corresponds to Ensembl version 66.

The locus-level tag "ncRNA_host" has been introduced in this Gencode release. A total of 2041 genes
were automatically tagged as ncRNA_host if they encompassed a small ncRNA. This is expected to be 
substituted by Havana manual annotation in future releases. In addition the "ncrna_host" biotype 
has been replaced by "processed_transcript" in 19 genes and 80 transcripts. The "ncrna_host" biotype
has now been removed from the Havana annotation but this change will not be effective in Ensembl 
until version 68, so please be aware of this discrepancy between the Gencode GTF and the Ensembl 
database. 

The files pc_transcripts.fa and pc_translations.fa were updated with an extended range of transcript
biotypes (see above) on 2012-04-04.

The following GTF files were updated on 2012-06-01 to ammend 19958 gene names and 27058 transcript 
names that had a wrongly appended version number (mostly ".1"). Please note that the original names 
in the Ensembl core database (e66) remain unchanged.
gencode.v11.annotation.level.gtf
gencode.v11.long_noncoding_RNAs.gtf
gencode.v11.annotation.level_1_2.gtf (gencode11_GRCh37.tgz)
gencode.v11.annotation.level_3.gtf (gencode11_GRCh37.tgz)


#################################################################################################
Release 10 (December 2011)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to July 2011.
This release corresponds to Ensembl version 65.

The files pc_transcripts.fa and pc_translations.fa were updated with an extended range of transcript
biotypes (see above) on 2012-04-04.


#################################################################################################
Release 9 (September 2011)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to May 2011.
Ensembl annotation: 8 genes and 27 transcripts were removed from the previous version.
Havana annotation: updated and first pass of chromosome 14 completed, including the ABO gene on a 
GRC patch. In the merge set, the GAGE cluster is imported from Havana only.
This release corresponds to Ensembl version 64.


#################################################################################################
Release 8 (June 2011)
#################################################################################################
This is a remerge between the Ensembl annotation and updates from HAVANA up to March 2011.
Poorly supported Ensembl models were removed and Ensembl annotation on the haplotypes, which were 
missing in Gencode 7, was also added. Annotation by Havana of chromosome 12 has been completed 
and that of chromosome 14 is nearly finished.
This release corresponds to Ensembl version 63.


#################################################################################################
Release 7 (March 2011)
#################################################################################################
This is a merge between a full new Ensembl gene build and updates from HAVANA up to December 2010.
The gene and transcript version numbers have been appended to their stable ids in the release files.
This release corresponds to Ensembl version 62.
This is the release used for the ENCODE integration analysis.


#################################################################################################
Release 6 (January 2011)
#################################################################################################

This is a remerge between the Ensembl annotation and updates from HAVANA up to November 2010.
This includes mainly data on chromosomes 8, 9. The level 1 pseudogenes, 2-way pseudogenes and 
tRNA scans are the same as in version 5.
This release corresponds to Ensembl version 61.


#################################################################################################
Release 5 (December 2010)
#################################################################################################

This is a remerge between the Ensembl annotation and updates from HAVANA up to April 2010.
This includes mainly data on chromosomes 6,7,8. The level 1 pseudogenes and the 2-way pseudogenes
were also updated.
This release corresponds to Ensembl version 60.


#################################################################################################
Release 4 (May 2010)
#################################################################################################

This is a remerge between the Ensembl annotation and updates from HAVANA up to February 2010.
This includes mainly data on chromosomes 4 & 5. The level 1 pseudogenes and the 2-way pseudogenes
were not updated from version 3c.
This release corresponds to Ensembl version 58 (with minor fixes).
It was submitted for the June 2010 ENCODE data freeze.

!!Please note that this file was updated with new transcript status fields on 02.07.2010, 17:00 
BST!!


#################################################################################################
Release 3d (February 2010)
#################################################################################################

This is an update of freeze 3(c), in which ENSEMBL removed a number of dubious transcripts.
Tags cds_start_NF, cds_end_NF, mRNA_start_NF, mRNA_end_NF, exp_conf, non_org_supp are introduced.
This release corresponds to Ensembl version 57.
Only the gencode.v3d.annotation.GRCh37.gtf file was updated.


#################################################################################################
Release 3c (October 2009 freeze)
#################################################################################################

This is an update of freeze 3(b), mainly for chromosomes 3 & 4 for which the latest annotation 
was held back and QC'ed again to be used in the RNASeq Genome Annotation Assessment Project.
Only gencode.v3b.annotation.GRCh37.gtf and gencode.v3b.annotation.NCBI36.gtf were updated.
This release corresponds to Ensembl version 56.
This is the release used for the ENCODE integration analysis.


#################################################################################################
Release 3 (July 2009 freeze)
#################################################################################################

New full merge between HAVANA and ENSEMBL.
Native on genome assembly GRCh37, features are projected back to NCBI36 were possible.

1. all loci on both assemblies:
   gencode.v3b.annotation.GRCh37.gtf (updated on 3.10.09)
   gencode.v3b.annotation.NCBI36.gtf (updated on 3.10.09)

2. polyA features:
   gencode.v3.polyAs.GRCh37.gtf
   gencode.v3.polyAs.NCBI36.gtf

3. tRNA predictions:
   gencode.v3.tRNAs.GRCh37.gtf
   gencode.v3.tRNAs.NCBI36.gtf

4. pseudogenes predicted by UCSC and Yale:
   gencode.v3.2wayconspseudos.GRCh37.gtf
   gencode.v3.2wayconspseudos.NCBI36.gtf


################################################################################################
Updated dump (23.06.2009)
#################################################################################################

1. Updated version of the January 2009 freeze file produced for use in the 1000 Genomes projects 
and others.
gencode_data.rel2b.{gtf,gff3}.gz

The field "CCDSID" with a valid CCDS id has been added to tha 9th column.
The field "CCDSOL" has been added indicating that the transcript overlaps a CCDS transcript, but 
was not flagged as such directly.
A few genes have been removed and added respectively. Some formatting issues have been resolved.

2. protein-coding transcripts, their sequences and translations
gencode_data.rel2b.pc_transcripts.fa.gz
gencode_data.rel2b.pc_translations.fa.gz


################################################################################################
Updated dumps (21.04.2009)
#################################################################################################

<directory release_2>

1. protein-coding transcripts, their sequences and translations
gencode_data.rel2.pc_transcripts.{gtf,gff3}.gz
gencode_data.rel2.pc_transcripts_cdnas.fa.gz
gencode_data.rel2.pc_transcripts_translations.fa.gz

2. re-dump (as January freeze) without external annotation 
gencode_data.rel2a.{gtf,gff3}.gz

################################################################################################
For the analysis data freeze of January 2009 there are the following files in this directory:
################################################################################################

<directory release_2>

1. gencode_data.rel2.{gtf,gff3}.gz
  Data file in GTF format, compressed with gzip, containing annotation on three levels.
  Same format as described below, with the addition of one line for every gene and transcript.
  In case you don not want these, do something like 
  awk '{if($3 !~ "gene|transcript"){print $0}}' gencode_data.rel2.gtf > gencode_data.rel2_mod.gtf 

2. gencode_tRNAscans.rel2.{gtf,gff3}.gz
   tRNAscan predictions from the ENSEMBL simpleFeature table (level 3).

3. gencode_polyAs.rel2.{gtf,gff3}.gz
   poly signals from the loutre database (polyA_site, pseudo_polyA) (seperate level).

################################################################################################
For the initial data freeze of October 1st 2008 there are the following files in this directory:
################################################################################################

<directory release_1>

1.gencode_data.rel1.v2.{gtf,gff3}.gz
  Data file in GTF format, compressed with gzip, containing annotation on three levels.
  Here is what the first 11 lines look like (containing first transcript):

  ##description: evidence-based annotation of the human genome (NCBI36)
  ##provider: GENCODE
  ##contact: fsk@sanger.ac.uk
  ##format: gtf 2.2
  ##date: 2008-10-02
  chr1	HAVANA	exon	1873	1920	.	+	.	 gene_id "OTTHUMG00000000961"; transcript_id "OTTH  UMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2;
  chr1	HAVANA	exon	2042	2090	.	+	.	 gene_id "OTTHUMG00000000961"; transcript_id "OTTHUMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2;
  chr1	HAVANA	exon	2476	2560	.	+	.	 gene_id "OTTHUMG00000000961"; transcript_id "OTTHUMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2;
  chr1	HAVANA	exon	2838	2915	.	+	.	 gene_id "OTTHUMG00000000961"; transcript_id "OTTHUMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2;
  chr1	HAVANA	exon	3084	3237	.	+	.	 gene_id "OTTHUMG00000000961"; transcript_id "OTTHUMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2;
  chr1	HAVANA	exon	3316	3533	.	+	.	 gene_id "OTTHUMG00000000961"; transcript_id "OTTHUMT00000002844"; transcript_type "unprocessed_pseudogene"; transcript_status "UNKNOWN"; gene_type "unprocessed_pseudogene"; gene_status "NOVEL"; gene_name "RP11-34P13.1"; transcript_name "RP11-34P13.1-001"; level 2;

2.gencode_data.rel1.v2.regions.txt
  list of regions (genomic coordinates) where only HAVANA annotation (level 1 & 2) can be found.

3.gencode_data.rel1.v2.regions_with_ids.txt
  list of regions (genomic coordinates) where only HAVANA annotation (level 1 & 2) can be found, 
with all OTT-ids from the region listed.