#### README #### IMPORTANT: Please note you can download correlation data tables, supported by Ensembl, via the highly customisable BioMart data mining tool. See http://protists.ensembl.org/biomart/martview or http://www.ebi.ac.uk/biomart/ for more information. Not available for Ensembl Bacteria. ----------------------- GENBANK FLATFILE DUMPS ----------------------- This directory contains GENBANK flatfile dumps. All files are compressed using GNU Zip. Ensembl Genomes provides an automatic reannotation of genomic data as well as imports of existing genomic data. These data will be dumped in a number of forms - one of them being GENBANK flat files. As the annotation of this form comes from Ensembl Genomes, and not the original sequence entry, the two annotations are likely to be different. GENBANK flat file format dumping provides all the confirmed protein coding genes known by Ensembl Genomes. Considerably more information is stored in Ensembl Genomes: the flat file just gives a representation which is compatible with existing tools. The main body of the entry gives the same information as is in the main GENBANK flat file entry. * ID - the GENBANK id * AC - the EMBL/GenBank/DDBJ accession number (only the primary accession number used) * SV - The accession.version pair which gives the exact reference to a particular sequence * CC - comment lines to help you interpret the entry Currently the following features are dumped into the feature table of the Ensembl entry: * Transcripts as CDS entries. Each transcript has the following attributes attached o Transcript id - a stable id, which Ensembl will attempt to preserve as sensibly as possible during updates of the data o Gene id - indication of the gene that this transcript belongs to. gene ids are stable and preserved as sensibly as possible during updates of the data o Translation - the peptide translation of the transcript. * Exons as exon entries. Each exon has the following information o Exon id. The exon id is stable and preserved as sensibly as possible during sequence updates o start_phase. The phase of the splice site at the 5' end of the exon. Phase 0 means between two codons, phase 1 means between the first and the second base of the codon (meaning that there are 2 bases until the reading frame of the exon) and phase 2 means between the second and the third base of the codon (one base until the reading frame starts). o end_phase. The phase of the splice site at the 3' end of the exon: same definition as above (though of course, being end_phase, the position relative to the exon's reading frame is different for phase 1 and 2). We are considering other information that should be made dumpable. In general we would prefer people to use database access over flat file access if you want to do something serious with the data. ----------- FILE NAMES ------------ The files are consistently named following this pattern: ...dat.gz : The systematic name of the species. : The assembly build name. : The version of Ensembl Genomes from which the data was exported. dat : All files in these directories are in GenBank DAT format gz : All files are compacted with GNU Zip for storage efficiency. e.g. Drosophila_melanogaster.BDGP5.21.dat.gz Where the genome has a chromosome-level assembly, individual files are provided for each chromosome, named following this pattern: ...chromosome..dat.gz Where the assembly also contains additional non-chromosomal that are not present in the chromosomes, these are all available in a file with the pattern: ...nonchromosomal.dat.gz