The files in each release's GRCh37_mapping subdirectory contain GENCODE annotation originally 
built on GRCh38 that has been mapped back to the GRCh37 assembly.

They are mainly intended for users who are still working on the GRCh37/hg19 assembly and wish 
to use the most up-to-date GENCODE annotation instead of the last release (v19) built on GRCh37.


The GENCODE annotation has been mapped using gencode-backmap, developed by Mark Diekhans: 

https://github.com/diekhans/gencode-backmap


---------------
gencode-backmap
---------------

Package for mapping of GENCODE gene annotation files to older assemblies.

This provides tools to map the genome coordinates in GENCODE to previous releases of the reference 
genome. This projects annotations through genomics alignment using the TransMap algorithm to produce 
mappings. Reports are made on status of all mappings.

This maps all GENCODE GTF and GFF3 files. It does not map the PolyA_feature file. It is recommended 
to only use these on primary chromosomes, due to the high-similarity of alt-loci sequences to the 
primary assembly coupled with the growth in number of alt-loci making correct back-mapping difficult.


---------
Algorithm
---------

The program takes the current (source) GENCODE GFF3 or GTF, cross-assembly genomic alignments, and 
the previous (target) GENCODE annotations.

Mapping is done on a per-gene mapping using the following steps:

* Project transcripts of the gene through the alignments, keeping exons chained.
   o If there are multiple mappings, first look for ones that the overlapping to the previous version 
     of the transcript, if it exists. Otherwise, if there is a previous version of the gene, select 
     mappings overlapping the gene. Otherwise, to filter for paralog mappings, pick the mapping with 
     the most similar span as the source.
   o Project features of the transcript, such as CDS and start codons, the transcript alignment between 
     the genomes. This ensures that features stay in the same location within the transcript.
* Check all transcripts of the gene for consistency. Reject source gene mappings with transcripts on 
  different chromosomes or strands, or where the genomic length of the gene has changed more than 50%.
* If a version of the gene exists in the target and the mapped gene doesn't overlap the target gene, 
  it is also rejected.
   o If a gene did not map or was rejected and a version of the gene with the same biotype exists in 
     the target annotations, use the existing gene.
* Small, automatic-only or all automatic genes are optionally not mapped, with the target annotation 
  being passed through. This avoids complex mappings of small RNAs imported from other database (e.g. mirRNAs).
* Target genes with no corresponding mappings and that overlap patched regions or regions with GRC 
  incident reports in the target genome may optionally be passed through. This address a fair number 
  of problem cases. This was a common problem on GRCh37 chrX.

Pairing of source and target genes is somewhat complex due to instability of some gene identifiers 
between assemblies. If a matching base gene id (less version) is not found, an attempt is made to match 
the genes using the symbolic name.


--------------
Identification
--------------
(Starting from Release 25)

Gene and transcript identifiers (*_id attribute) are based on the source identifiers. For annotations that are passed through unchanged from the target assembly, the identifiers are used as-is. For mapped annotations, a mapping version number, in the form _n, is appended to the id, where n is one-based number indicating a version of the mapping. The mapping version starts at 1 and is added to the standard identifier version, for example "ENST00000456328.2_1".

When a previous release file is supplied, the gene and transcript annotations are checked and the mapping version incremented if defined attributes change.

Genes have their mapping version numbers incremented if any of the following change from the previous release:
   * contained transcripts, include version and mapping version number,
   * coordinates of the gene change,
   * the gene_type or gene_status changes,
   * the gene_name changes.

Transcripts have their mapping version numbers incremented if any of the following change from the previous release:
   * coordinates of the transcript or its features change,
   * the transcript_type or transcript_status changes,
   * the transcript_name changes,
   * any tag changes,
   * CDS or other sub-features of the transcript change.

The following ids have the mapping versions applied:
   * gene_id,
   * transcript_id,
   * exon_id,
   * havana_gene,
   * havana_transcript.


--------------------------
Categorization of mappings
--------------------------
Information is collected on mappings and saved as attributes of the GFF3/GTF records. There are also 
records at the gene and transcript level in the mapping information file. The attributes and their values are:

* remap_status - Attribute that indicates the status of the mapping. Possible values are:
   o full_contig - Gene or transcript completely mapped to the target genome with all features intact.
   o full_fragment - Gene or transcript completely to the target genome with insertions in some features. 
     These are usually small insertions.
   o partial - Gene or transcript partially mapped to the target genome.
   o deleted - Gene or transcript did not map to the target genome.
   o no_seq_map - The source sequence is not in the assembly alignments. This will occur with alt loci 
     genes if the alignments only contain the primary assembly.
   o gene_conflict - Transcripts in the gene mapped to multiple locations.
   o gene_size_change - Transcripts caused gene's length to change by more than 50%. This is to detect 
     mapping to processed pseudogenes and mapping across tandem gene duplications.
   o automatic_small_ncrna_gene - Gene is a from a small, automatic (ENSEMBL source) non-coding RNA. 
     These are take from the target annotations if --useTargetForAutoGenes is specified.
   o automatic_gene - Gene is a from an automatic process (ENSEMBL source). These are take from the 
     target annotations if --useTargetForAutoGenes is specified.
   o pseudogene - Pseudogene annotations (excluding polymorphic). These are taken from the target 
     annotations if --useTargetForPseudoGenes is specified.
* remap_original_id - Original ID attribute of the feature. If a feature is split when mapped, new IDs 
  are created, otherwise the original ID is used.
* remap_original_location - Location of the feature in the source genome.
* remap_num_mappings - Number of mappings of the feature, only one of them was used.
* remap_target_status - Attribute that compares the mapping to the existing target annotations. 
  Possible values are:
   o new - Gene or transcript was not in target annotations.
   o lost -Gene or transcript exists in source and target genome, however source was not mapped.
   o overlap - Gene or transcript overlaps previous version of annotation on target genome.
   o nonOverlap - Gene or transcript exists in target, however source mapping is to a different location. 
     This is often mappings to a gene family members or pseudogenes.
* remap_substituted_missing_target - target gene annotate was substituted.


#############
Release files
#############

* gencode.vXlift37.annotation.gtf.gz, gencode.vXlift37.annotation.gff3.gz
  Main annotation files: comprehensive gene annotation on the reference chromosomes in GTF and GFF3 
  file formats. Some genes built on a GRCh38 reference chromosome were mapped to unplaced/unlocalized
  scaffolds in the GRCh37 primary assembly. These have been included in the above files. 
  Please note that not all genes could be consistently mapped to GRCh37. For most of these genes, 
  the annotation in the last GENCODE release based on GRCh37 (v19) has been used, which may involve 
  a different gene_id with respect to the latest GENCODE annotation in GRCh38.

* gencode.vXlift37.unmapped.gtf.gz, gencode.vXlift37.unmapped.gff3.gz
  Annotation from the GRCh38-based release that could not be consistently mapped to GRCh37 and 
  thus is absent from the annotation files listed above.

* gencode.vXlift37.basic.annotation.gtf.gz, gencode.vXlift37.basic.annotation.gff3.gz
  This is a subset of the corresponding comprehensive annotation files, containing only those 
  transcripts tagged as 'basic' in every gene.

* gencode.vXlift37.long_noncoding_RNAs.gtf.gz, gencode.vXlift37.long_noncoding_RNAs.gff3.gz
  This is a subset of the corresponding comprehensive annotation files, containing only those 
  genes classified as long non-coding RNAs.

* gencode.vXlift37.transcripts.fa.gz:
  Fasta file with all transcript sequences.

* gencode.vXlift37.pc_transcripts.fa.gz:
  Fasta file with the nucleotide sequences of all transcripts with an associated CDS feature:
  protein_coding, nonsense_mediated_decay, non_stop_decay, IG_*_gene, TR_*_gene and
  polymorphic_pseudogene transcript biotypes).

* gencode.vXlift37.pc_translations.fa.gz:
  Fasta file with the translations of the coding transcripts in the above file.
 
* gencode.vXlift37.lncRNA_transcripts.fa.gz:
  Fasta file with the nucleotide sequences of all transcripts belonging to long non-coding RNA genes.

* GRCh37.primary_assembly.genome.fa.gz:
  GRCh37 primary assembly sequence (the primary assembly includes the reference chromsosomes and the 
  unplaced and unlocalized scaffolds).

* gencode.vXlift37.metadata.Annotation_remark.gz:
  Remarks made during the manual annotation of the transcript.

* gencode.vXlift37.metadata.EntrezGene.gz:
  Entrez gene ids associated to GENCODE transcripts (from Ensembl xref pipeline).

* gencode.vXlift37.metadata.Gene_source.gz:
  Source of the gene annotation (Ensembl, Havana, Ensembl-Havana merged model or imported in the case 
  of small RNA and mitochondrial genes).

* gencode.vXlift37.metadata.HGNC.gz:
  HGNC approved gene symbol (from Ensembl xref pipeline).

* gencode.vXlift37.metadata.PDB.gz:
  PDB entries associated to the transcript (from Ensembl xref pipeline).

* gencode.vXlift37.metadata.Pubmed_id.gz:
  PDB entries associated to the transcript (from Ensembl xref pipeline).

* gencode.vXlift37.metadata.RefSeq.gz:
  RefSeq RNA and/or protein associated to the transcript (from Ensembl xref pipeline).

* gencode.vXlift37.metadata.SwissProt.gz:
  UniProtKB/SwissProt entry associated to the transcript (from Ensembl xref pipeline).

* gencode.vXlift37.metadata.Transcript_source.gz:
  Source of the transcript annotation.

* gencode.vXlift37.metadata.Transcript_supporting_feature.gz:
  Piece of evidence used in the annotation of the transcript.

* gencode.vXlift37.metadata.TrEMBL.gz:
  UniProtKB/TrEMBL entry associated to the transcript (from Ensembl xref pipeline).