Gene Ontology annotation and the human genome. ============================================== A discussion at the Banbury Conference Center, Cold Spring Harbor Laboratory, December 10 - 12, 2000. Background and Objectives. ========================== The Gene Ontology database (www.geneontology.org) is now being used by several model organism databases for the annotation of genes or their products. These include not only the three founding databases (FlyBase, Mouse Genome Informatics, Saccharomyces Genome Database) but those that have subsequently joined the Consortium (WormBase, The Arabidopsis Information Resource) and groups in both academia (e.g. TIGR, Swiss-Prot, Interpro) and industry (e.g. AstraZeneca, Celera Genomics, Proteome Inc.). Many other academic groups and companies have expressed an interest in using, or are already using, GO within their domains. This is the year that sequencing of the human genome will be finished. A corollary to this is that a first approximation of the human proteome will be completed as well. Each and every group who are working on this effort and those who hope to mine these data would like to have a rich and consistent set of descriptors used to describe the tentative functional assignments. This is the one area where there is full agreement between the separate groups. This meeting was called to discuss how the GO consortium, other databases, academic interests and industry can work together to ensure that GO can be effectively used to annotate human genes and to ensure that there is consistency of annotation between different groups. On this last point an anomalous situation arises that had not previously been an issue with any of the model organisms: The paradox is that there is no obvious single competent "authority" to implement GO for the human genome, in part because there is no analog of an accepted community database for human genetics, let alone biology. It is against this background that Richard Mural and Michael Ashburner, spurred by Ken Fasman, began an email exchange earlier this year to discuss how the GO Consortium could help; this lead to an al fresco meeting at ISMB in San Diego of Michael Ashburner, Suzanna Lewis, Jim Kadin, Mike Cherry (from GO) and Jennifer Wortman and Paul Thomas (from Celera Genomics) to discuss how we should proceed. Thanks to the hospitality of Dr Jan Witkowski, Director of the Banbury Centre of Cold Spring Harbor Labs, we were able to meet for one day and a half in the delightful setting of Banbury to further the discussions that had begun by email and at ISMB. Simply for logistic reasons the meeting was by invitation only and was limited to twenty participants. Agenda. ======= * Introduction to GO, what it can (or should) be able to do and cannot. * How the GO Consortium works & its immediate plans. * Presentations by other invitees as to their interest in using GO for human genome annotation. * The limitations of GO at present for this task. * How we can collaborate to improve this state of affairs. * Consistency of annotation of human and other mammals, especially mouse and, in a somewhat longer term, rat. * Consistency of annotation between different groups. * How the annotation of human genes with GO can be made public. The following is not a minute of the meeting, but an account of the main discussions and conclusions. It is followed, in an appendix, by a formal proposal from the Ensembl/EBI teams. This document was not presented to the meeting, although its content was. Presentations from the GO Consortium. ===================================== Each site gave an introduction to their work. On behalf of the GO Consortium Michael Ashburner and Judith Blake gave a history of the Consortium , discussed the objectives of the Consortium and what had been achieved so far. Suzanna Lewis discussed the development of software, in particular the status of the mySQL implementation of GO and of the new tools for annotation and editing being developed in her group at LBL (http://www.godatabase.org/dev/) (see ftp://genome-ftp.stanford.edu/pub/go/minutes/minutes.berkeley.20001106 for a very recent account of the progress of GO). One recent development has been the publication on the GO site of tables showing the (approximate) correspondences between GO terms and other vocabularies. Those public are for EC numbers, Swiss-Prot keywords, Riley's functional catalog (GenProtEC), TIGR roles and TIGR EGAD terms. We hope to add PIR keywords and, if suitable permission can be obtained, the MIPS functional catalog, in the future. At the time of writing the master copies of the GO ontologies are the flat files. These are maintained by the GO editors at Stanford, Jackson and Cambridge, using CVS. Other groups (e.g. TAIR, WormBase) wanting changes or additions work through one or other of these sites. Other versions of the ontologies (e.g. in mySQL, in XML) are produced sporadically from the flat files and there have been problems of synchrony that have disturbed users. GO is addressing these problems in the short term. In the medium term (e.g. first half of 2001) there will be two major changes: One is that the master will be in mySQL, other versions (flat files, XML) will be automatically derived from the database. This will solve many of the synchrony problems and also the syntax errors that have plagued users (although in the last two weeks a new syntax checker for the flat files has been introduced and has improved this problem). The other change is that a full-time GO Editor will be appointed working at the EBI and that all changes to GO will be channeled through that office. Finally, the GO Consortium was pleased to announce that their application for funding for three years (from 1 Dec 2000) from the NHGRI has been approved. This will allow a dramatic increase in the resources available to the Consortium, on both the software and biology sides. Presentations from other public domain groups. ============================================== Rolf Apweiler discussed the development of rules for the automatic annotation of protein sequences that are being developed within the TrEMBL group at the EBI. A major achievement in the last couple of years has been the development of InterPro, an integrated protein domain resource, and of InterProScan, tools to annotate proteins with InterPro groups. About 60% of TrEMBL records are now annotated using InterPro (about 3,300 different InterPro groups). Work is also proceeding to develop rule based functional annotation methods (e.g. to be annotated as a trypsin a protein must have not only the appropriate Pfam domain but also the catalytic triad present in all active trypsins). Mapping of InterPro groups to GO is being done manually (1,200/3,300 groups are mapped so far). Mapping of Swiss-Prot to GO is being done using both David Hill's table (see ftp://genome-ftp.stanford.edu/pub/go/external2go/swp2go) and manual curation. About 50% of Swiss-Prot data are now mapped to GO terms. Michele Clamp, on behalf of the EBI/Sanger Centre Ensembl team, explained the goals of the Ensembl project (www.ensembl.org) which is annotating the Santa Cruz reference human genome sequence. About 16,000 human proteins, aligned to this reference sequence, can automatically inherit GO terms from Swiss Prot/TrEMBL); others, which may only have support from ESTs, can be annotated with GO terms by using the GO - InterPro mapping. Michele also outlined the ideas of the Ensembl/EBI teams for annotating human genes with GO (see appendix). At the NCBI, Donna Maglott explained, the annotation of human genes contained within RefSeq/LocusLink with GO terms will be done initially by the import of data on about 10,000 genes from Proteome Inc (see below). In a collaboration with the Saccharomyces Database Group in Stanford the GO annotation of yeast genes in LocusLink now includes GO terms, imported through an ASN.1 data dump. The longer term curation issue of genes with GO terms at the NCBI is an open issue; the NCBI now have the staff for the curatorial review of sequences, including the published literature; the OMIM group are revising their clinical synopsis terms. The NCBI are working with Greg Schuler's assembly of the public sequence data, rather than that from Santa Cruz (see below). The major function of the HUGO Gene Nomenclature Committee (HGNC) in London is, as Sue Povey said, to establish unique symbols for human genes. These symbols and names should be stable, memorable and meaningful; so far, some 12,000 have been assigned of which 10,000 are associated with sequence data. This has involved close collaboration with the Mouse database (MGD) and more recently Ensembl. This group has the data structures in place to include GO terms and has a serious interest in seeing that a consistent method is used, such as GO, to describe attributes of human gene. The major constraint of this group is that of resources. Lisa Brooks, from NHGRI, asked the question in the back of every one's mind: Why is there no HumanBase ? Ken Fasman gave some personal insight into the problems there have been in funding databases such as GDB but Lisa said that the NHGRI are now giving this matter some serious thought. Michael Ashburner said that he would hope that if an initiative was made that then it would be international and involve other potential funders and agencies. Presentations from companies. ============================= For Proteome Inc. Kevin Roberg-Perez said that for YPD and WPD they had been using a functional categorisation developed some years ago within Proteome. Their method of working was to annotate gene products directly from the literature. Proteome is currently using GO terms to annotate human proteins and is committed to using GO for all species, but the roll out for other species is yet to be determined. So far some 8,100 human proteins had been annotated with GO terms with an average of 4.5 terms/protein (maximum was 45). Proteome use their own evidence codes for the annotation, rather than those used by GO. An agreement has been reached between Proteome and the NCBI for the former to provide the annotations (including GO terms) on 10,000 human proteins to the latter for incorporation into LocusLink and RefSeq. Whether or not this collaboration will be ongoing is not yet known. At Celera Genomics, Richard Mural stated that they wish to use GO as part of their annotation of the complete human genome sequence. So far this is being done computationally, using software derived from that first developed by Mark Yandel of Celera for the annotation of the Drosophila genome. The intention is to use this system (essemtially simple cutoffs for BLAST expectation values against a set of manually curated sequences) as a first pass and then use human experts. At Celera West Paul Thomas has been grappling with the problem of how to automatically use sequence similarities to make functional assignments; the basic paradigm has been described (ISMB'98), it is: (i) Build a phylogenetic-type tree of sequences with known functions. (ii) Assign GO functions to each branch of the tree. (iii) Take unknown sequences and see which branch of the tree they lie within. Paul argues that this software (PANTHER) will allow automatic assignement when the protein families based on an evolutionary analysis of sequence similarity have been mapped to GO. Darryl Gietzen described the Protein Function Hierarchy developed at Incyte; there are three hierarchical list of terms (618 in total). The A list is from the EC hierarchy, the B list is the Molecular Function Hierarchy (B7 is cell location) and the C list is the Biological Function Hierarchy. Genes (i.e. EST clusters) are assigned terms by using keywords derived from the top BLAST hits; by this method 94% of Swiss-Prot hits can be assigned one or more PFH terms. Incyte plans to include GO annotation in 2001. DoubleTwist are not shipping any products with GO, it is only be used in a research context, said Andrew Kasarskis. They are borrowing mappings from the public domain and making de novo assigments of GO terms, either in house or collaboratively with their partners. Andrew gave three reasons for not shipping products that include GO terms now: - no human <> GO mapping - GO needs modification to allow full use with human - problem of the irregularity of GO updates. For AstraZeneca, a company that has been generous in its financial help to the GO Consortium, Ken Fasman said that there was increasing concern with different ontologies being used by different data providers and that a policy decision had been made to insist on the use of GO for any product that they would purchase after a 24 month notice period. AstraZeneca intends to seek support for this policy more broadly in the pharmaceutical industry. Ken was also concerned that there might be a number of different and non-collaborating efforts to use GO for human genes, and, even worse, modifications of the GO ontologies independent of the GO Consortium. Outcomes. ========= The objectives of the GO Consortium are simple: to collaborate with others (preferable an other) to develop the GO ontologies so that they can be most effectively used for the annotation of human genes and to receive from the collaborating group(s) a table of assignments of GO terms to human genes that will allow the human genome to be searched along with the genomes of others by GO terms. The GO Consortium recognises that there are pre-conditions necessary for these objectives to be achieved: A mechanism must exist for those using GO, but outwith the Consortium, to suggest changes (additions, corrections etc) to the GO ontologies. For human genes this will be through David Hill of the Jackson Labs. GO expects those who propose new terms to define these terms (see GO documentation) at the time of request. The companies now using GO for human genes in products (i.e. Celera Genomics and Proteome) both said that they will now begin to feed new terms to GO and to suggest changes required for the annotation of human genes. The Consortium must increase the rigour of their syntactical checks on GO data and the synchrony of release of the same data in different forms. This will require a single validation script for each class of file to be run whenever data is committed. It may be necessary (as was suggested by Andrew) to go to a regular (e.g. monthly) public release at a pre-determined time and dates. We need a mechanism to ensure much better user feedback. One suggestion is to run an open User's Meeting once (or more) a year. This will not be by invitation but will require pre-registration so as to avoid a logistic catastrophe. We expect that one of these meetings will be at the time of one of the regular Consortium meetings and that may be another could occur at the time of, e.g. ISMB or similar meeting. GO is in the public domain (there was some discussion as to whether protection under, e.g. a GPL, is desirable). There is an implicit contract between the GO Consortium and commercial users of GO - the commercial users get the information for free, but they have an obligation to give the Consortium useful feedback. There can, of course, be problems with public feedback to GO from commercial companies. GO should establish mechanisms other than the public mailing list to allow people to comment on GO, both in general and in detail, in a manner that is private (although, of course, any resulting changes to GO would be public). It is in the long term interests of the commercial users of genomic data for there to be stability and uniformity in annotation. For the consumers of data their interests are that they can use the same analytical methods on data coming from the public and commercial domains or from two or more different commercial concerns. For the providers of data their interests are not to have to spend resources re-inventing the wheel and to be able to easily QC their data by comparison with public data or data from other commercial sources. There are two major public groups annotating the "complete" human genome sequence, the Ensembl group at Hinxton and the NCBI group. At the moment they are using different assemblies of the sequence (Santa Cruz and Schuler, respectively) but there is an agreement in principle, at least, for the two groups to share a common name space - the International Gene Index. There are clearly a number of name space/identifier issues that will make everyone's job harder - at least in the short term - but these were well beyond the remit of this group. To some extent, for the purposes of GO, there is already a common name space between the EBI and NCBI for gene products - the protein_id's of the GenBank/EMBL-Bank/DDBJ records. A maintained table of correspondence(s) between protein_id's and other name spaces (Swiss-Prot, HUGO, LocusLink, Ensembl etc) might be a good idea, at least until the IGI reaches full term. The NCBI will be importing GO annotation for about 10,000 "known" human genes from Proteome Inc into both RefSeq and LocusLink. The NCBI will provide a general methodological statement as to how the particular gene product to GO assigments were made. All of these assignements will be attributed to Proteome Inc. The Ensembl team will work very closely with others at the EBI and with the HUGO Gene Nomenclature Committee in London, to establish a central, open repository, called GOAH, to track assignments of GO terms to human gene products which can be used by other databases worldwide (see Appendix). Some action items. * Kevin Roberg-Perez - send in proposal for partitioning enzymes (the issue here being that the children of "enzyme" in $molecular_function are very 'flat'). * Richard Mural - send in problems with mouse gene associations. * GO - to arrange user's meeting at next meeting (hosted by TAIR). * GO - establish email methods to allow companies to comment on GO privately to the GO consortium. * Suzanna Lewis - immediate validation of gene associations on CVS commit. * GO - consider regular dated updates, rather than updates on edit as now is the case. Thankyous. ========== We thank Mrs Beatrice Toliver for her help and courtesy at the Banbury Centre and Dr. Jan Witkowski for allowing us to use the Conference Centre for this meeting. Attendees & their affiliations. =============================== Rolf Apweiler (European Bioinformatics Institute - Swiss-Prot; InterPro) Michael Ashburner (European Bioinformatics Institute - GO Consortium - FlyBase) Judith Blake (Jackson Laboratory - Mouse Genome Database - GO Consortium) Mike Cherry (Department of Genetics, Stanford - GO Consortium - SaccDB) Jannan Eppig (Jackson Laboratory - Mouse Genome Database - GO Consortium) David Hill (Jackson Laboratory - Gene Expression Database - GO Consortium) Suzanna Lewis (BDGP, Berkeley - GO Consortium) Martin Ringwald (Jackson Laboratory - GO Consortium - Gene Expression Database) Michele Clamp (Sanger Centre - Ensembl) Donna Maglott (NCBI - LocusLink - RefSeq) Lincoln Stein (Cold Spring Harbor Lab. - WormBase - DAS) Sue Povey (University College London - HUGO Nomenclature Committee) Lisa Brooks (NHGRI) Kevin Roberg-Perez (Proteome Inc) Darryl Gietzen (Incyte) Ken Fasman (AstraZeneca, Boston) Andrew Kasarskis (DoubleTwist) Richard Mural (Celera Genomics, East) Paul Thomas (Celera Genomics, West) Jennifer Wortman (Celera Genomics, East) Appendix - The Ensembl/EBI GOAH proposal. ========================================= GOAH - GO Annotation of Human. The EBI proposes to provide a central, open database tracking assignments of human gene products to the Gene Ontology (GO) resource. The Gene Ontology project (www.geneontology.org) provides a framework to assign functional information to gene products. GO was founded by three model organism databases (FlyBase, SGD and MGI) and has expanded to 5 databases, taking in WormBase and TAIR. GO has proved to be very successful in these databases, capturing functional information in a way which can be queried across species databases and providing a consistent framework for aspects such as evidence tracking. The effective use of the human genome will require some aspect of functional tracking of gene products. The proven GO experience, in particular in Mouse, indicates that GO will work well for this task. Unlike other organisms, there is no clear central database for human genome resources, and it is likely that this will remain the case. We propose therefore to provide a central, open repository, called GOAH, to track assignments of GO terms to human gene products which can be used by other databases worldwide. This repository would be manned with two or three editors providing overall curation and quality control. These editors would be the point of contact for individual researchers wishing to contact GOAH. For large scale projects with a proven track record of functional assignment, such as HUGO, Proteome Inc, SWISS-PROT and MIM, direct editing of the GOAH database will be allowed with the editors providing conflict resolution and general consistency of the project. The human gene products would be tracked via the internationally agreed protein identifiers (protein_id) which is an established identifier system for proteins shared by the International Collaboration of DNA databases. All information stored in GOAH would be placed in the public domain without restriction. The EBI provides an ideal location to provide the GOAH resource, with synergies to the Ensembl team of genome annotation and the SWISS-PROT team of protein functional assignment. In addition the EBI has strong links to the main players in this field, such as the NCBI, Proteome Inc, Celera and CSHL. The necessary resources for GOAH have already been found at the EBI and committed to furthering functional assignment in human, either directly in this GOAH project or in some collaboration with other interested parties.