GOA README ----------- 1. Contents ------------ 1. Contents 2. Introduction 3. List of files and file formats 4. SWISS-PROT/TrEMBL/Ensembl non-redundant human proteome set 5. IPI 6. Ancillary mappings 7. Assignment of GO terms to SWISS-PROT/TrEMBL/Ensembl data 8. Addition of GO assignments from other data sources 9. Contacts 10. Copyright Notice 2. Introduction ---------------- GOA (GO Annotation@EBI) is a project run by the European Bioinformatics Institute that aims to provide assignments of gene products to the Gene Ontology (GO) resource. The goal of the Gene Ontology Consortium is to produce a dynamic controlled vocabulary that can be applied to all eukaryotes, even while knowledge of gene and protein roles in cells is still accumulating and changing. In the GOA project, this vocabulary will be applied to a non-redundant set of proteins described in the SWISS-PROT, TrEMBL and Ensembl databases that collectively provide complete proteomes for Homo sapiens and other organisms. In the first stage of this project, GO assignments have been applied to a data set representing the human proteome by a combination of electronic mappings and manual curation. Subsequently GO assignments for all complete proteomes will be provided at this site. For futher information please refer to our web site at: http://www.ebi.ac.uk/GOA 3. List of files and file formats ---------------------------------- The GOA project produces two gene_association files: i) gene_association.goa_human Locations: http://www.geneontology.org/gene-associations/gene_association.goa ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz This file contains the GO assignments for the proteins of the non-redundant human proteome set. ii) gene_association.goa_sptr Locations: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/SPTR/gene_association.goa_sptr.gz This file contains all GO assignments for the SWISS-PROT TrEMBL databases. We have complied with the file format described by the GeneOntology consortium for annotation files (http://www.geneontology.org/GO.annotation.html#file) Since we deal with proteins rather than genes, the semantics of some fields are slightly different to the gene association files. 1. DB Database from which annotated entry has been taken. One of either SPTR (SWISS-PROT TrEMBL) or ENSEMBL (Ensembl). Example: SPTR 2. DB_Object_ID A unique identifier in the DB for the item being annotated. Here: Accession number or identifier of the annotated protein. Either a SWISS-PROT/TrEMBL accession number or an Ensembl peptide ID. Example: O00165 3. DB_Object_Symbol A (unique and valid) symbol to which DB_Object_ID is matched. Here: SWISS-PROT/TrEMBL entry name or Ensembl peptide ID. Example: HAX1_HUMAN 4. NOT Here: Currently not applicable, always empty. 5. GOid The GO identifier for the term attributed to the DB_Object_ID. Example: GO:0005625 6. DB:Reference Reference cited to support the attribution. See section 7 for an explanation of the reference types used. Examples: PUBMED:9058808, GOA:interpro, GOA:spkw, GOA:spec. 7. Evidence One of either IMP,IGI,IPI,ISS,IDA,IEP,IEA,TAS,NAS, NR, E, P or ND Example: TAS 8. With Example: SPTR:O00341 9. Aspect One of the three ontologies: P (biological process), F (molecular function) or C (cellular component). Example: P 10. DB_Object_Name Name of gene or gene product Here: Either empty (for Ensembl peptides) or the abbreviated description line (for SWISS-PROT and TrEMBL entries). Example: HS1-associated protein X-1 11. Synonym Gene_symbol [or other text] Here: International Protein Index identifier (section 5). Example: IPI00010440 12. DB_Object_Type What kind of entity is being annotated. Here: always 'protein' Example: protein 13. Taxon_ID Identifier for the species being annotated. Here: always 'taxon:9606' for human proteins. Example: taxon:9606 14. Date The date of last annotation update in the format 'YYYYMMDD' eg: 20030228 15. Assigned_By Attribute describing the source of the annotation. One of either SPTR, MGI, SGD, FB. iii) xrefs.goa Location: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/xrefs.goa In addition to the principal file with mappings of SWISS-PROT/TrEMBL/Ensembl to GO, a file has been prepared describing the relationship between the entries in this set and other databases, such as the EMBL/Genbank/DDBJ nucleotide sequence databases, HUGO, and LocusLink and RefSeq at the NCBI. This file is tab delineated (multiple entries in individual fields are separated by commas). The fields are as follows: 1. Database from which annotated entry has been taken. One of either SP (SWISS-PROT), TR (TrEMBL) or ENSEMBL (Ensembl). 2. SWISS-PROT or TrEMBL accession number or Ensembl ID. 3. International Protein Index identifier (see section 5). 4. Supplementary SWISS-PROT/TrEMBL entries associated with this IPI entry 5. Supplementary Ensembl entries associated with this IPI entry 6. RefSeq NP sequences associated with this IPI entry 7. RefSeq XP sequences associated with this IPI entry 8. Protein identifiers (cross reference to EMBL/Genbank/DDBJ nucleotide databases) 9. HUGO HGNC number, HUGO official gene symbol 10. NCBI LocusLink loci number, LocusLink Provisional Gene Symbol (replaced by HUGO Official gene symbol where available) 4. The SWISS-PROT/TrEMBL/Ensembl non-redundant human proteome set ------------------------------------------------------------------ SWISS-PROT is a high quality, manually curated database of protein sequences. It's automatically produced supplement, TrEMBL, contains all protein sequences that can be predicted from the EMBL nucleotide sequence database and that have not yet been annotated to SWISS-PROT standards. Together, SWISS-PROT and TrEMBL (SPTR) offer a complete picture of the protein world. It is a priority of the SWISS-PROT team to merge redundant entries: SPTR is completely non-redundant at the sequence level. However, when alternative sequences are submitted for the same gene, there is a delay before such entries are identified and merged. The aim of the Ensembl group is to develop a software system which produces and maintains automatic annotation of eukaryotic genomes. There may be protein coding sequences that have not yet been annotated in the nucleotide sequence databases, but whose existence has nonetheless been predicted by Ensembl. The SWISS-PROT/TrEMBL/Ensembl non-redundant human proteome set is obtained by combining (i) all SWISS-PROT entries for Homo sapiens (ii) all TrEMBL entries that have been mapped to a given gene locus (up to a maximum of 1 TrEMBL entry per locus) (iii) all additional TrEMBL entries that are non-redundant (at a 95% similarity threshold) with any entries already included in the set and (iv) all additional Ensembl entries that are similarly non-redundant with any of the entries that are already included. The set is updated weekly following each release of SPTR. It is this set of non-redundant protein sequences that will be annotated with the vocabulary defined by the GO consortium. For more information see http://www.ebi.ac.uk/proteome/CPhelp.html. The set can be downloaded (in FASTA format) via http://www.ebi.ac.uk/proteome/HUMAN/download.html or taken directly from the EBI ftp site at ftp://ftp.ebi.ac.uk/pub/databases/SPproteomes/fasta_files/proteomes/9606.SPEns.FASTAC 5. IPI ------- The IPI (International Protein Index) provides a top-level overview of the main databases that describe the human proteome: SWISS-PROT, TrEMBL, Ensembl, and NCBI's RefSeq databases. IPI assigns stable identifiers to clusters of matching proteins from its contributing databases. IPI is updated monthly. For more details see http://www.ebi.ac.uk/IPI/IPIhelp.html The SWISS-PROT/TrEMBL/Ensembl proteome set contains 1 protein for each IPI entry associated with a SWISS-PROT, TrEMBL or Ensembl entry. Other sequences from these databases and RefSeq also associated with these IPI entries are listed in the sptrXrefs file. 6. Ancillary mappings ---------------------- Mappings between SPTR and EMBL/Genbank/DDBJ are derived from the cross references to these databases found in SWISS-PROT and TrEMBL entries. Mappings between SPTR and HUGO, LocusLink and RefSeq are derived from various publicly available sources of information that allow the electronic tracking of identifiers between databases. Contentious or contradictory data is referred to a curator for judgement. 7. Assignment of GO terms to SWISS-PROT/TrEMBL/Ensembl data ------------------------------------------------------------ In this release, we use four data sources to assign GO terms to proteins. A) PUBMED:nnnnnnnn Curators have read the abstract or full paper with the PubMed identifier nnnnnnnn and assigned the GO terms manually. Where a journal is not indexed by PUBMED then an internal idenfier is provided eg: PBTnnnnnnnn. Please contact goa@ebi.ac.uk for details. B) GOA:interpro Transitive assignment using InterPro matches. In detail, the protein in question has one or more InterPro matches. The InterPro domain or family is assigned to the corresponding GO term using the interpro2go file. C) GOA:spkw Transitive assignment using SWISS-PROT keywords. In detail, the protein in question is from SWISS-PROT or TrEMBL and has a SWISS-PROT keyword. This keyword is assigned to the corresponding GO term using the spkw2go file. D) GOA:spec Transitive assignment using enzyme codes. In detail, the protein in question is from SWISS-PROT or TrEMBL and has an Enzyme Commission number in its description line. This EC number is then assigned to the corresponding GO term using the EC crossreferences in the GO ontology files. The files interpro2go, spkw2go are found at http://www.geneontology.org/index.html#classification. 8. Addition of GO assignments from other data sources ------------------------------------------------------- GO terms have already been assigned to a certain proportion of the human proteome by Proteome Inc. This data has been incorporated into the classifications described in the gene_associations.goa files. 9. Contacts ----------- Please direct any questions to goa@ebi.ac.uk We welcome any feedback. 10. Copyright Notice -------------------- GOA - GO Annotation@EBI Copyright 2002 (C) The European Bioinformatics Institute. This README and the accompanying databases may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy. $Date: 2003/03/03 16:51:52 $