README FILE FOR THE EXPLANATION OF THE STATISTICS IN THE gencodegenes.org WEBSITE. The statistics are calculated using the gtf file that includes only the reference chromosomes (i.e. chr1-22, X, Y, MT for human; chr1-19, X, Y, MT for mouse). In the general stats we have clustered some biotypes to offer an overall view of the data. In detail: 1) GENES: Total No of Genes -> counts total number of all genes Protein-coding genes -> counts "protein_coding" gene biotype (from releases 41 and M30, protein-coding readthrough genes are counted separately) Long non-coding RNA genes -> counts broad lncRNA gene biotype (which are these gene biotypes: "processed_transcript", "lincRNA", "3prime_overlapping_ncrna", "antisense", "non_coding", "sense_intronic", "sense_overlapping", "TEC", "known_ncrna", "macro_lncRNA", "bidirectional_promoter_lncrna", "lncRNA") Small non-coding RNA genes -> counts broad sRNA gene biotype (which are these gene biotypes: "snRNA", "snoRNA", "rRNA", "Mt_tRNA", "Mt_rRNA", "misc_RNA", "miRNA", "ribozyme", "sRNA", "scaRNA", "vaultRNA") Pseudogenes -> counts pseudogene broad gene biotype (gene biotype containing /*pseudogene/) - processed pseudogenes -> counts /*processed_pseudogene/ gene biotypes - unprocessed pseudogenes -> counts /*unprocessed_pseudogene/ gene biotypes - unitary pseudogenes -> counts /*unitary_pseudogene/ gene biotype - polymorphic pseudogenes-> counts "polymorphic pseudogene" gene biotype - pseudogenes -> counts "pseudogene" gene biotype Immunoglobulin/T-cell receptor gene segments - protein coding segments -> counts IG and TR genes of protein coding broad gene biotype - pseudogenes -> counts IG and TR genes of pseudogene broad gene biotype ******************************** 2) TRANSCRIPTS: Total No of Transcripts -> counts total number of all transcripts Protein-coding transcripts -> counts "protein_coding" transcript biotype - full length protein-coding -> counts the protein-coding transcripts that have neither "cds_start_NF" nor "cds_end_NF" tag - partial length protein-coding: -> counts all the rest (they have either a "cds_start_NF" or "cds_end_NF" tag or both) Nonsense mediated decay transcripts -> counts the "nonsense_mediated_decay" transcript biotype Long non-coding RNA transcripts -> this number is from lncRNA broad biotype loci - total number of transcripts ******************************** 3) TRANSLATIONS: Total No of distinct translations -> obtained by adding the numbers of distinct translation sequences per gene. Only transcripts with a "protein_coding" biotype are taken into account. Translations of partial length transcripts, i.e. those with "cds_start_NF" and/or "cds_end_NF" tags, are excluded when their CDS sequence matches the 3' end or the 5' end, respectively, of a full length transcript sequence in the same gene. Genes that have more than one distinct translations -> genes that have more than one distinct translations (of "protein_coding" transcript biotypes), excluding the partial length transcripts that have a sub-sequence of a full length sequence already counted as distinct translation. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~ NOTE: Until releases 19 and M3 the pseudogene gene stats were calculated as shown below. - processed pseudogenes -> counts number of genes that have at least one "processed_pseudogene" or "transcribed_processed_pseudogene" or "retrotransposed" transcript - unprocessed pseudogenes -> counts number of genes that have at least one "unprocessed_pseudogene" or "transcribed_unprocessed_pseudogene" transcript - unitary pseudogenes -> counts number of genes that have at least one "unitary_pseudogene" transcript - polymorphic pseudogenes -> counts "polymorphic pseudogene" gene biotype - pseudogenes -> counts unclassified "pseudogene" gene biotypes (NOTE: this is because these HAVANA gene biotypes: "processed_pseudogene", "transcribed_processed_pseudogene", "retrotransposed", "unprocessed_pseudogene" , "transcribed_unprocessed_pseudogene" and "unitary_pseudogene" were grouped under a generic "pseudogene" biotype during the Ensembl-Havana annotation merge process. However, we decide to show the original Havana gene biotype in the statistics, which was obtained according to their transcript biotype as described above. The table with the analytical stats under the general stats corresponds to the real database data, ie. only "pseudogene" biotype at the gene level.)