#### README ####

IMPORTANT: Please note you can download correlation data tables, 
supported by Ensembl, via the highly customisable BioMart data mining tool. 
See http://protists.ensembl.org/biomart/martview 
or http://www.ebi.ac.uk/biomart/ for more information. Not available for
Ensembl Bacteria. 

-----------------------
GENBANK FLATFILE DUMPS
-----------------------
This directory contains GENBANK flatfile dumps. All files are compressed 
using GNU Zip.

Ensembl Genomes provides an automatic reannotation of genomic data as well
as imports of existing genomic data. These data will be dumped in a number 
of forms - one of them being GENBANK flat files. As the annotation of this 
form comes from Ensembl Genomes, and not the original sequence entry, 
the two annotations are likely to be different.

GENBANK flat file format dumping provides all the confirmed protein coding 
genes known by Ensembl Genomes. Considerably more information is stored in 
Ensembl Genomes: the flat file just gives a representation which is 
compatible with existing tools.

The main body of the entry gives the same information as is in the main 
GENBANK flat file entry.

    * ID - the GENBANK id
    * AC - the EMBL/GenBank/DDBJ accession number (only the primary 
           accession number used)
    * SV - The accession.version pair which gives the exact reference to 
           a particular sequence
    * CC - comment lines to help you interpret the entry 

Currently the following features are dumped into the feature table of 
the Ensembl entry:

    * Transcripts as CDS entries. Each transcript has the following 
      attributes attached
          o Transcript id - a stable id, which Ensembl will attempt to 
            preserve as sensibly as possible during updates of the data
          o Gene id - indication of the gene that this transcript belongs 
            to. gene ids are stable and preserved as sensibly as possible 
            during updates of the data
          o Translation - the peptide translation of the transcript. 
    * Exons as exon entries. Each exon has the following information
          o Exon id. The exon id is stable and preserved as sensibly 
            as possible during sequence updates
          o start_phase. The phase of the splice site at the 5' end 
            of the exon. Phase 0 means between two codons, phase 1 
            means between the first and the second base of the codon 
            (meaning that there are 2 bases until the reading frame of 
            the exon) and phase 2 means between the second and the third 
            base of the codon (one base until the reading frame starts).
          o end_phase. The phase of the splice site at the 3' end of the 
            exon: same definition as above (though of course, being end_phase, 
            the position relative to the exon's reading frame is different 
            for phase 1 and 2). 

We are considering other information that should be made dumpable. In 
general we would prefer people to use database access over flat file 
access if you want to do something serious with the data. 

-----------
FILE NAMES
------------
The files are consistently named following this pattern:
   <species>.<assembly>.<eg_version>.dat.gz

<species>:       The systematic name of the species. 
<assembly>:      The assembly build name.
<eg_version>: The version of Ensembl Genomes from which the data was exported.
dat : All files in these directories are in GenBank DAT format
gz : All files are compacted with GNU Zip for storage efficiency.

e.g. 
Drosophila_melanogaster.BDGP5.21.dat.gz

Where the genome has a chromosome-level assembly, individual files are provided
for each chromosome, named following this pattern:
   <species>.<assembly>.<eg_version>.chromosome.<chromosome_name>.dat.gz
Where the assembly also contains additional non-chromosomal that are not present
in the chromosomes, these are all available in a file with the pattern:
   <species>.<assembly>.<eg_version>.nonchromosomal.dat.gz