Minutes of the GO Project Meeting The GO Meeting was held 17-19 May 1999 at the Banbury campus of CSH. In attendance were: FlyBase (Michael Ashburner, Suzi Lewis) SGD (Mike Cherry, Midori Harris) MGI (Janan Eppig, Judy Blake, Joel Richardson, Martin Ringwald, Allan Davis) A Summary of the Meeting: The outlined agenda for the meeting was: 1. semantics 2. species specificity 3. content 4 implementation 5 software 6. resources 1. The GO Project is recognized as a shared, pragmatic database resource involving three separate ontologies (Gene Function, Process, Cellular Component) that represent independent structured sets of terms for performing biological queries across different species genomic databases. It is not a definitive phylogenetic classification system of biology. The current GO Project is composed of three Organism Databases: FlyBase (Drosophila) , SGD (yeast), and MGD (mouse). It is hoped that additional organism databases may subsequently join. Each Organism Database will annotate their genes to the three ontological categories and deposit these results in a Universal GO browser (currently being constructed by Suzi). Concurrently, each Organism Database on their own will use their annotations for whatever way they see fit at their own web sites. The three database (DB) groups are anxious to start implementing the project. It is felt that once people actually start 'getting their hands dirty' by annotating and testing out the system, we will recognize potential pitfalls and successes and will see areas of the ontologies that need to be developed further. 2. The system will be structured as: a Master GO List with assigned GO# identifiers. This Master List will be made available to all DB. Any curator can add to/modify this list (see below). All additions or modifiers will pass through a Central Manager (a biology-trained individual) who will then update the Master List accordingly. The Master List is anticipated to go through a period of flux during the initial phases and when any new organism joins the group. It will be the responsibility of each DB to regularly check in with the Master List and read all communications between DBs so that everyone is 'on the same page'. 3. It was recognized that this project is a WORK IN PROGRESS: meaning, it is a dynamic system where things will be added and changed as curators see problems and concerns. With subsequent biological knowledge, a re-organization of the hierarchies may even become necessary. Provisions for this re-organization will be made by the following: each DB group will initially be assigned a block of 'free' GO identification numbers (GO#) which can be used to add or modify new terms to the structured GO list. (New terms should use unique words/phrases). Any such modification will be simultaneously submitted to a Central Manager and representative of each DB via email/XML working from the current GO List. We anticipate that these initial modifications will be at lower levels of the hierarchy and not include any major re-organization that could directly and immediately affect the annotations of other DB. If that happens to be the case, the curator is responsible for FIRST contacting the Central Manager for inquiry, discussion, and approval PRIOR to initiating or submitting any changes to the GO list. Since a Central Manager has yet to be hired, it is prudent for curators not to initiate any major renovations to the GO List for quite some time. 4. Common Sense & Individual Responsibility must rule here: that is, each group is responsible for reading the corresponding emails submitted by each DB; each curator should check in frequently with the modified GO list on a regular basis; if a curator anticipates to spend a long time modifying some structure of the GO List he should alert all other members and then work as quickly as possible to update the changes. Effective communication is key to the success of this project. 5. Each DB will annotate their genes/gene products to any 'depth' they see fit. This should include annotating genes to complexes and not subunits of those complexes, since the cross-species homology will not necessarily hold true to such detailed levels. Once the practice is put in motion, curators will develop a better feel for how 'deep' they should be annotating based upon the work of other curators from other DB (see below). 6. Currently, the GO List records some species-specific GO terms. With subsequent biological knowledge, it may be found that these are no longer species-specific but shared amongst different organisms, thus requiring a name change. Currently species-specific terms will only be updated when it is shown that the term is not species specific. If a species does not have a gene annotated to a specific GO term, then either it is not present in the species or it hasn't been found yet. 7. Structures will have to inherently change with the discovery of new genes that do not necessarily follow the currently established hierarchies. Curators are to be aware of the "True Path (Judy's) Rule": the pathway up the hierarchy must always be true. If a new gene is found to break this rule or species-specificity becomes a problem, a restructuring of the hierarchy should occur by adding more nodes and connecting terms that creates a new path to fulfil the trueness of the upward hierarchy. When a term is added to the Master GO List, the curator needs to add all of the parents and children of the new term A suggestion was made that to simplify the hierarchy: we might considered throwing out the "part of" GO items and instead only used the "is a" GO terms. After discussion, it was realized that too much information would be lost by eliminating the terms. The GO List will maintain both "is a" and "part of" terms. The example used to work through this effort was that of the Process Ontology for the gene product 'chitin'. Chitin metabolism is a part of cuticle synthesis in the fly, and part of cell wall organization in yeast. As a result of the above discussion, the parent 'chitin metabolism' will now have daughters 'cuticle chitin metabolism' and 'cell wall chitin metabolism, with the appropriate catabolism and synthesis terms underneath them. chitin metabolism chitin biosynthesis chitin catabolism cuticle chitin metabolism cuticle chitin biosynthesis cuticle chitin catabolism cell wall chitin metabolism cell wall chitin biosynthesis cell wall chitin catabolism The procedure to add terms is made particularly difficult because the Process Ontology is a DAG, and given the current state of knowledge, it is volatile. We need an automated procedure to add all the arcs when necessary to expand the structure. A tool is needed that will 1. not allow bad paths (solution: add extra nodes) 2. curators need to see all paths. 3. given curator decision, can automate a split 8. A CVS file system will be used to process changes made to the GO list. This will automatically record the curators name & date and must provide a succinct reason for implementing the change. The CVS client server will be set up by Mike Cherry: accounts will be set up for each individual curator for access and modifications. Additional curators may join by submitting a User name to Mike. 9. An idea was proposed that the GO list hierarchy encode next to the GO terms the number of genes specific to each organism DB listed underneath that particular GO term. For example: %DNA metabolism (M-3, D-9, Y-5) %DNA replication (M-1, D-4, Y-3) %DNA dependent DNA replication (M-2, D-5, Y-2) Meaning that there are 3 mouse (M) genes, 9 Drosophila (D) genes, and 5 yeast (Y) genes under the entire category of %DNA metabolism, and that they can subsequently be broken down further to each subordinate GO term for finer resolution. This should be helpful to curators in understanding to which level of 'depth' other curators are annotating. 10. EC#s should be kept in the GO term lists because they provide a searchable technique for curators during annotation. 11. The following ideas and agendas were proposed for goals within a 1-2 week (?) time frame from the end of this GO Meeting: a) Stanford curators stop GO and send an updated file to Michael Ashburner. b) Michael Ashburner parses v. 0.2a7 ---> v. 0.9 c) v. 0.9 ---> Suzi for syntex check ---> assign unique GO# to all terms (v. 1.0); and parse into XML d) CVS established/tested by Mike Cherry; as well, Mike will try to register the web site www.gene.org if still available; if not, other suggested names: www.genestogo.org or some combination of the words GO and gene (GOgeneGO; GOgene; geneGO; etc....) e) all GO again. Independently, Suzi and Joel will determine necessary XML syntax and processes. Between June 1999 and the next meeting, two stages will be implemented: STAGE 1: CVS established initial curation: annotate as many genes as possible (FlyBase hopes to get 3000-4000 genes done; are other DB up to the challenge?) XML exports to Suzi established a Central Manager will be hired STAGE 2: a working database something of real functionality for the public: "genes to GO" 12. Making the database available to public users should coincide with a descriptive (promotional) write-up of the GO Project in a widely circulated genetically-oriented journal, such as Trends in Genetics. Members of the GO Project should be thinking of ideas for this paper and what they would like to see in it. 13. Janan will meet with Ken Fasman (Astra) and Lisa Brooks (NIH Program Officer) at the JAX MGI Advisory Board meeting in early June 1999 and discuss initiating a co-operative (MGI, SGD, FlyBase) grant for submission in November 1999. After this meeting, Janan will update other members of the GO Project on the ideas for the grant via email correspondence. 14. Resources: $200K from Astra '99 and a promise of $200K for '00. Michael Ashburner will have the money established in an EBI U.S. account in Cambridge Trust Company. 15. The next GO Meeting is scheduled for 6-9 October 1999 to be hosted by MGI in Bar Harbor, Maine.