#### README #### Please send comments or questions to http://lists.ensembl.org/mailman/listinfo/dev. ----------------------------------------- Ensembl Multi Format (EMF) FLATFILE DUMPS ----------------------------------------- This directory contains Compara EMF flatfile dumps. To ease downloading of the files, an EMF file is created for each chromosome. All files are then compacted with GNU Zip for storage efficiency. EMF files store: 1. Whole-genome multiple alignments created by the Ensembl Comparative Genomics team (compara). 2. Genesbased multiple alignments created by the Ensembl Comparative Genomics team (gene_alignment). 3. Alignments of resequencing data from individuals (or strains) to the reference genome assembly created by the Ensembl Functional Genomics team (resequencing). The file format is very similar for all uses, with differences noted below. This README represents verion 1.0 of the EMF specification. ---------- FILE NAMES ---------- The files are consistently named following this pattern: .....emf.gz EXAMPLE EMF resequencing data file names Mus_musculus.NCBIM36.43.resequencing.chromosome.Y.emf.gz Homo_sapiens.NCBI36.43.resequencing.chromosome.X.emf.gz Rattus_norvegicus.RGSC3.4.43.resequencing.chromosome.7.emf.gz EXAMPLE EMF whole-genome multiple alignment (compara) data file names Compara.pecan_7_way.chr13_1.emf.gz Compara.pecan_gerp_10_way.chr3_5.emf.gz Compara.pecan_gerp_10_way.others_1.emf.gz EXAMPLE EMF genebased multiple alignment (gene_alignment) data file names Compara.protein_trees.43.emf.gz ----------- FILE FORMAT ----------- #File Header lines start with ## ##FORMAT (compara, resequencing, gene_alignment) ##DATE dump_date ##RELEASE ensembl_release_number (may contain multiple release numbers) #Data Header #total number of SEQ and SCORE lines must correspond to the number and order of columns in the data lines #first line SEQ organism individual*/translation_stable_id^ chr sequence_start sequence_stop strand gene_stable_id display_label|NULL^/(chr_length=sequence_length)& #compara following lines SEQ organism chr sequence_start sequence_stop strand sequence_length #resequencing following lines SEQ organism individual sequence_source (WGS, etc.) SCORE score_type #gene_alignment following lines SEQ organism translation_stable_id chr sequence_start sequence_stop strand gene_stable_id display_label|NULL #compara example SEQ human 4 450000 560000 1 SEQ mouse 17 780000 790000 -1 SEQ rat 12 879999 889998 1 SCORE GERP #resequencing example SEQ mouse reference 17 780000 790000 1 SEQ mouse 129S1/SvJ WGS SEQ mouse DBA WGS SCORE aligned 129S1/SvJ reads (may also be confidence score) SCORE aligned DBA reads #gene_alignment example SEQ Mus_musculus ENSMUSP00000042016 2 76987970 77045910 -1 ENSMUSG00000042272 Sestd1 SEQ Canis_familiaris ENSCAFP00000029327 5 66656398 66666500 -1 ENSCAFG00000019806 489651 SEQ Drosophila_melanogaster CG5439-PA 2L 13177251 13178813 1 CG5439 CG5439 SEQ Ornithorhynchus_anatinus ENSOANP00000021778 Ultra102 388659 395409 -1 ENSOANG00000013809 NULL # The SEQ section may list ancestral sequences too, following an # in-order traversal of the tree (see the TREE section below) #In EPO_EXTENDED alignments, the SEQ header may show a chromosome name of the form Composite_123456789. #This indicates that the sequence of the species is actually made of #several scaffolds concatenated together #In those cases, the SEQ line will be preceded by a comment listing all the #scaffolds that make up this "Composite" sequence # Then may follow a TREE pragma with the Newick-encoded phyogenetic tree # linking those sequences. TREE (Hsap_11_134959233_135017027[-]:0.00825809999999999,Ggor_11_135146106_135206808[-]:0.008760400000000002)Aseq_Ancestor_1102_1085092_1_57552[+]:0.007914300000000013; # The node names follow the pattern: "species_chromosome_start_end[strand]" # and match the coordinates found in the SEQ lines # data lines one per bp based on the sequence coordinates in the Data Header # gap character "-" # no alignment character "~" # lower case character: masked sequence # heterozygous bases use ambiguity codes # all coordinates are inclusive coordinates #Data Block #the data lock starts with DATA #the columns represent the sequences of the SEQ headers in the same order #spaces are optional between columns representing SEQ and SCORE lines DATA # compara example AAA -5 TTA +1 CGA +1 GG- +4 CG- +4 # resequencing example A A A 2 1 T A T 2 1 C C C 2 1 G G ~ 1 0 C C ~ 1 0 # at the end of a data block // signals the beginning of the next data header // * resequencing only ^ gene_alignment only & optional for resequencing Other information: There may be multiple SEQ and SCORE lines in data headers Comment character is "#" and must be the first character in the line