Gene Ontology Meeting February 25-26, 2000 at Astra-Zeneca in Cambridge, MA Attendees: Michael Ashburner (FlyBase) Suzanna Lewis (FlyBase) Heather Butler (FlyBase) Judy Blake (MGI) Janan Eppig (MGI) David Hill (MGI) Joel Richardson (MGI) Martin Ringwald (MGI) Allan Peter Davis (MGI) Michael Rebhan (Astra Zeneca) Mike Cherry (SGD) Cathy Ball (SGD) Midori Harris (SGD) Andrew Kasarskis (SGD) AGENDA ITEMS Progress Reports Celera Report Papers Collaborators and Other Projects Ontology Issues Style and Work Practices Tools for GO Questions from Michael R. Plans for Next Meeting PROGRESS REPORTS Mouse folks: Judy Blake has submitted the GO grant. The mouse members have assigned approximately 4500 genes to GO terms. 100 by hand 650 by EC number 1270 using Swiss-Prot 2500 using mouse nomenclature Without counting the Swiss-Prot data, they used 474 Molecular Function terms, 50 Cellular Component terms and 80 Biological Process terms. Since they are using automated annotation, they have performed a variety of quality checks, such as looking for more than one annotation within an ontology. They have come close to exhausting the current automated assignments and are going to be doing more by hand in the future. Yeast Folks: SGD has 1524 genes in the gene association file. About 1000 of these are ORFs and the rest are tRNAs or snoRNAs. All SGD genes have been GO-annotated by hand. Fly Folks: FlyBase currently has about 3000 genes annotated mostly by hand. Heather has worked through the protein kinases and will next tackle the protein phosphatases. Annotation of new genes will be largely done by sequence similarity, while existing genes will be done by hand in related chunks. When the Drosophila sequence is released in March, there will be a large amount of sequences annotated to a high-level GO ID. These will be deepened to more specific GO nodes with time. CELERA REPORT GO was used in the annotation of the Drosophila genome at Celera. Suzanna made a dataset with all genes annotated to the molecular function GO and used it for BLAST searches. Usually, the level of GO node was quite high -- only one or two terms from the top. Where experts in a field were expected to be annotating genes, the specificity of the GO nodes used were increased (for example, olfactory receptors). Ultimately, there were 40 bins labelled by GO name (the 40th was "unknown"). Annotators were then able to have a pretty reasonable guess as to the function of the new fly gene. A second binning with biological process and cellular component showed a terrific correlation with the first. About half the genes from the Celera set are associated with a GO term. Since a given gene has a less than 50% chance of having been seen by a human, an association with a GO term is very valuable. FlyBase is still waiting to receive the sequence -- it will be released with the publication of the papers in March. FlyBase will be responsible for updating the sequence in GenBank. PAPERS We agreed to immediately pursue three publications: 1) Nature Genetics solicited a short (2500 word) article from David Botstein. It will be submitted March 10, with a short author list. 2) Genome Biology -- Michael Ashburner has been asked to write a short (1000 word) article for their premier issue. It will most likely have an authorship along the lines of "The GO Consortium". 3) Genome Research -- Judy Blake will adapt the grant to a "big" paper to be submitted to Genome Research. 4) NAR database issue -- We will submit a paper to NAR as a matter of course. The submission won't be until August or September. Since there are likely to be changes in the NAR policies, we will discuss the details of the NAR paper at the next meeting. COLLABORATORS AND OTHER PROJECTS There was a great deal of discussion about taking on other organisms and collaborators. The conclusion was that before we take on other organisms, we must first meet the following goals: --We need to be in a database (Suzanna Lewis will be working on this, with help from Joel Richardson). Hopefully, this will be accomplished by the next meeting. See "Plans for Next Meeting" for more detailed steps. --Documentation of philosophy, styles and practices needs to be written to record and communicate our current thinking. See "Plans for Next Meeting" for more detailed steps. --A "GO manager" to coordinate changes to the ontology, arrange training, communicate with all groups, etc needs to be hired. Midori Harris has volunteered to assume the responsibility. Michael Ashburner suggested we have two classes of partners -- the first with write permission and the second without it. These "second class" partners will have to funnel suggestions and comments through a full partner. Other organism groups that have expressed interest include worm, Arabidopsis, and S. pombe. We'll invite a representative from the worm and Arabidopsis database groups to the next GO meeting. Michael Ashburner has received a grant application for "BioBabel" -- a proposal to adopt GO terms within SwissProt, Enzyme Commission and Interpro. Representatives from this group can also be invited to the next meeting. ONTOLOGY ISSUES Methods and practices for editing and maintaining the ontology took up a large portion of the discussions. Conclusions will be listed, and in the cases where the discussion is particularly illuminating, the discarded options will be listed as well. 1) Changes to GO nodes that have multiple parents... When editing one of the ontologies, it is more convenient to add another node in only one position. For example, if we start with the structure shown below: a b d e f If we want to add node 'c' as a of 'd' and a child of node 'a', do we need to edit all the appropriate lines, or just one? The group decided to make an "editable" non-redundant version of the ontologies: Linear, redundant format (for viewing): a b d e f c d e f Non-redundant format (for editing): a b d % c e f The envisioned procedure is that a curator checks out the compressed, or non-redundant, version and then views an expanded version using a planned tool we're calling "The Validator." When an edit needs to be made to an ontology, it is made in the compressed version and tested with the Validator. The compressed version is then checked back into the cvs. The Validator will be written by Joel, suing specifications mentioned later. The web will display the expanded, read-only format. 2) We will add GO id to parent terms. For example, we used to state: term1 ; GOID1 % term2 Now we will state: term1 ; GOID1 % term2 ; GOID2 3) GO nodes should aggressively avoid using species-specific definitions. We agreed to substitute "Yeast mating" with "Mating, sensu Saccharomyces." Using the "sensu" reference makes the node available to other species that use the same process/function/component. Each organism database will take care of their contributions to the species-specific language. 4) We will get rid of cellular component references in the function ontology. For example, "mitochondrial primase" needs only be "primase." There are many cases where component terms are appropriate in the process ontology, so those will remain. Michael A. will take care of this. 5) Joel pointed out these logical relationships that we need to make sure are true in the ontologies: if A is part of B and C isa B, is A part of C? --- YES if A is a B and B isa C, is A isa C? --- YES if A is part of B and B is part of C, is A part of C? --- YES if A isa B and C is part of B, is C part of A? --- NOT NECESSARILY Joel will send out a list of the logical inconsistencies that he has detected. 6) An example that got a lot of attention is the case of the mitotic chromosome's location in the cellular component ontology. While the mitotic chromosome resides in the nucleus in yeast, it is cytoplasmic at this stage of cell life in mouse or fly. In addition, many organisms have chromosomes that are NOT located in the nucleus. The solution arrived at was to remove chromosome from the nucleus in general and place the appropriate subsets of chromosomes in the correct place (nuclear, cytoplasmic, mitochondrial). 7) We need to track deleted GO ids. There are types of things that can happen to GO terms -- merging two (or more) nodes, splitting a node, deleting a term. a. When a term is deleted, we will cut the line out and paste it at the end of the file (or as a child of the parent "defunct", I don't recall the final decision), using the following format: and tags. 11) We currently cannot standardize rules for subdividing ontology terms, but instead will continue to make each decision on a case-by-case basis. 12) Gene products in themselves are not nodes of the function ontology, although doing something with or to a specific gene product can be one. For example, being hedgehog is not likely to be a function, but being a hedgehog receptor or hedgehog receptor ligand are functions. 13) We may eventually need a synonym table to facilitate queries. 14) Changes that need to made to the ontology to meet the current style include eliminating unnecessary hyphens, adjust grammar so that "transporters" become described as "transport" and "transporting," remove words like "protein" and "factor" where we can be more explicit. 15) Heather and Midori will write some documentation about the evidence codes. 16) We need to think about a "best practices" document that will state and explain good work habits for both current and future annotators. In the meantime, we will share any help documents, such as SGD's "Instructions for Annotating Genes Using GO." TOOLS FOR GO 1) Database Suzanna will get a handle on this. The major difficulty has been hiring a programmer. Michael R. offered some help on this from Astra-Zeneca. Suzanna is planning on using MySQL to create a version to distribute from the central site. The schema is not yet ready, but Suzanna and Joel will work on this together. The database will also need the ability for bulk load. 2) Validator - Joel will do this We need a validator to check for: a. cycles b. deletion of nodes used in gene association files c. syntactic correctness (refer to logical relationships described in the ONTOLOGY ISSUES section.) d. unique IDs e. warning message of the number of affected nodes f. orphans g. new nodes have IDs Associated with the validator is the ability to compact and expand the ontologies for writing and reading. The validator will run on the central site, as well as locally for checking before an edited ontology is checked back in. Joel plans on writing this in python, so each site will need to install it. 3) GO BLAST server - Mike C. will take care of this The GO BLAST server will use a dataset of GO-annotated protein sequences. The results should show each GO node associated with a gene product, as well as a few generations of ancestors. 4) Annotation aids It would be nice for curators to have a tool that, given a single node, display all other gene products at that node (and nearby nodes) as well as all their other GO associations. This would assist curators in assigning a gene product to as many GO terms as needed, by showing them all other GO terms that might be related. 5) Suzanna's browser needs to be installed at Stanford, so we can all be using it from the same server. 6) Michael R. suggested we make a link to a dtd (datatype definition) file. Suzanna will look into finding a tool that will read the xml and create a dtd file. ANSWERS TO QUESTIONS FROM MICHAEL R. 1) GO ids will be stable. They may be "defuncted", but they will not go away. 2) "is a" and "part of" are likely to be used for quite some time. However, "part of" means "can be a part of", NOT "is always a part of." 3) Incyte still expresses interest, but that's all we've received from them. 4) Homepage recommendations -- Mike C. will add a bit from the grant to add more detail to the homepage. It might also benefit from the addition of statistics from the gene association files. 5) Should we have an ftp site that allows one to download the most recent version of GO? 6) Michael R. will create a FAQ to be linked from the home page. 7) Mike C. will put SGD's PowerPoint GO presentations to the GO site. PLANS FOR THE NEXT MEETING The next GO meeting will be in Cambridge, UK June 29 and 30. The plans are: 1) Have documentation ready a. GO philosophy document (Michael A., Judy and Midori) b. Rules for making changes to GO (Michael A. and Andrew) c. Rules for applying GO terms -- this is currently project-specific. Each project needs to think about this and bring something to the table next time. This should also include particularly illuminating examples, such as chitin synthesis, mitotic chromosomes. It should also emphasize how to avoid making GO nodes too species-specific, and mention the logical aspects of inserting or moving nodes. 2) Invite representatives from BioBabel, Arabidopsis, and C. elegans. 3) Have database in place 4) Create programs described above 5) Work HARD on adding more GO definitions. We have permission to use the Oxford Dictionary of Biochemistry and Molecular Biology. 6) Make the ontology edits mentioned above 7) Write three (!) papers