-------------------------------------------- Glossina morsitans morsitans (Tsetse) README ============================================ Table of Contents -------------------------------------------- 1. Tsetse EST sequencing projects statistics 2. Methodology 3. File format 4. Data Access 5. Contacts ============================================ -------------------------------------------------------------------------------------- 1. Tsetse EST sequencing projects statistics ============================================ A total of 12,700 randomly selected clones were sequenced mainly from both ends. In total 21, 427 EST were produced (9,879,196 bp). Clustering with Phrap (P. green unpublished) produced 3,220 clusters with an median membership of 4.90 (range 2-135) (74.27% of the total EST). The average size distribution of the clusters is given in Fig. 1. This left 5,656 singletons (25.73% of the total EST). Average EST size was 461 bp. The ESTs generated were 73.7% redundant 2. Methodology ============== Each sequence was analysed using BLASTX against SWALL and FlyBase proteins. Pfam (Bateman et al 2002) domains were identifiesd using ESTwise each Pfam domain was mapped to Interpro annotation using ESTwise (Birney unpunlished). Contaminating T. brucei sequences were removed from the final set of clusters by screening them against all known T. brucei DNA sequences. GO annotation was transferred to each sequence on the basis of BLAST hits to FlyBase proteins (gene_association.fb) with a significance above E() = 1e-10 or, where there was a Pfam domain detected the corresponding GO terms were transferred on the basis of Interpro to GO mapping (interpro2go) 3. File format ============== The file format is based on the the guidelines provided by the Gene Ontology Consortium (www.geneontology.org) with several mij 1. DB Database from which annotated entry has been taken. Here: GeneDB_Gmorsitans is used to signify that the data will eventually be housed at GeneDB (http://www.genedb.org) 2. DB_Object_ID A unique identifier in the DB is normally used for the item being annotated. Here a contig number refers to individual EST clusters from the following FASTA format file, containing all the sequnces: ftp://ftp.sanger.ac.uk/pub/pathogens/Glossina/GLOS.ESTs.fasta 3. DB_Object_Symbol This column is temporarily 'occupied' with the same contents as column 2 4. NOT Here: Currently not applicable, always empty. 5. GOid The GO identifier for the term attributed to the DB_Object_ID. Example: GO:0005625 6. DB:Reference Reference cited to support the attribution. This is pre-publication data, after publication this column will contain a PubMed id for a supporting reference for the attributions 7. Evidence Here always 'IEA'. Example: IEA 8. With Here: Currently not applicable, always empty. 9. Aspect One of the three ontologies: P (biological process), F (molecular function) or C (cellular component). Example: P 10. DB_Object_Name Here: Currently not applicable. Contains the same information as column 2. Will soon contain details of the closest Drosophila homologue for reference purposes 11. Synonym Here: Currently not applicable, always empty. 12. DB_Object_Type What kind of entity is being annotated. Here: always 'transcript' Example: transcript 13. Taxon_ID Identifier for the species being annotated. Here: taxon:37546 14. Date Date that the analysis was performed 4. Data Access ============== All sequences analysed are available from ftp://ftp.sanger.ac.uk/pub/pathogens/Glossina/GLOS.ESTs.fasta The FASTA file header contains the contig (cluster) number to with GO terms were assigned as well as the identifiers for the individual reads EXAMPLE: >Gmm-0873 2 members, constructed from Tse79g12.p1c, Tse79g12.q1c TTTCATCGAACCATAAGGCAATCTCTTTTTCTGCCGATTCAACGGCATCGGAGCCGTGAA TAATATTGCGACCGACTTGAATGCAGAAATCGCCTCGTATTGTTCCAGGTAATGAAT... 5. Contacts =========== Neil Hall, nh1@sanger.ac.uk Matt Berriman, mb4@sanger.ac.uk Arnaud Kerhornou, axk@sanger.ac.uk