These are some notes I typed up after the Nov. 2000 meeting. I've deposited them as a supplement to Courtland Yockey's notes. midori Minutes of Gene Ontology Consortium Meeting 4-5-6 November 2000 (Berkeley 2000 Meeting) Table of Contents Table of Contents 1 Groups represented 2 Process Ontology Discussion 3 Embracing the Explosion 3 Sensu to the rescue 4 Anatomical Lexicons and GO 5 Granularity and the Purpose of the GO 6 Distinction between GO and Partner Databases 6 Progress Report - General 7 What is the GO NOT 7 A couple of brief notes 7 Mapping GO to other Lexicons 8 Human Annotation 8 Evidence Codes & Documentation (Midori) 11 ...and what every happened to Proteome 11 Progress Report - Flybase (Michael Ashburner & Heather Butler) 12 Progress Report - SGD (Midori Harris, Mike Cherry) 13 Progress Report - MGI (David Hill) 14 Progress Report - TAIR (Leonore Reiser) 15 Progress Report - Wormbase (Erich Schwarz) 16 Progress Report - Prokaryotes & Protozoa 17 Software Developments 19 Java Browser (John Richter) 19 Java Editor (John Richter) 19 Technical Discussion : aiming for a Generic Ontology Builder 20 HTML Browser (Brad Marshall) 20 BLAST Server (Suzanna Lewis) 21 Timing, Resource, and Prioritization Issues 21 Interim Solution during resource-limited period 22 Downloads from the 'outside' 22 Discussion around the Editing Process 24 Schema Discussion, with emphasis on DBxREF 25 GO SLIM 27 Utility of GO SLIM 27 Status of GO SLIM 27 Funding 29 Publications 29 Next Meeting 29 Appendix A: Current GO SLIM 30 In attendance (complete list with little disinction for partial attendance) Michael Ashburner (Flybase) Suzanna Lewis (Fly) Mike Cherry (yeast) Heather Butler (fly) Midori Harris (yeast) Sue Rhee (arabidopsis) J. Yoon (arabidopsis) Leonore Reiser (arabidopsis) Chris Mungall (fly) Courtland Yockey (AstraZeneca) Judy Blake (mouse) Brad Marshall (fly) David Hill (mouse) Selina Dwight (yeast) (arrived afternoon 5 Nov) Erich Schwarz (worm) Cathy Ball (yeast) Gavin Sherlock (yeast) Kara Dolinski (yeast) John Matese (yeast) John Richter (fly) Lukas Müller (arabidopsis) Groups represented Drosophila - Flybase Yeast - SGD Mouse - MGI Arabidopsis - TAIR C. elegans - Wormbase Pharma - AstraZeneca Minutes authored by Judy Blake and Courtland Yockey Minutes edited by Midori Harris Note ... items under 'clearinghouse' topics are incorporated into Progress Reports Process Ontology Discussion From before the meeting actually started through to the end of the first day, discussion kept ranging back to the Process Ontology. By 9AM, Michael and Judy were discussing how the term 'heart development' might be exist in both GO and MGI DAGs, but that they would mean different things. By the end of the day, it appeared that a consensus had been reached whereby 'heart development (sensu Mammalia)' in GO might be a conceptual equivalent to what MGI and its users might need under the heading 'heart development.' Needless to say, much additional discussion need follow this, but there was a definite feeling that progress was made in tackling the 'anatomy of process' problem (my term, not introduced at the meeting). It appeared that this was very much a fly/mouse oriented discussion, that the granularity around the worm and plant and fungal anatomical issues were not deeply addressed. That being said, the 'solution' arrived at was to 'embrace the explosion' in such a manner that any organism might be handled. It also appeared that addressing mouse anatomy was a generally accepted surrogate for mammals in general, as a first pass. Embracing the Explosion What does it mean to include anatomy? Better yet, what happens if you do not include it, or include it 'incorrectly'? It was thought at one time that anatomy could indeed be excluded from the GO, but this runs into a number of problems, chief among them being that anyone looking at a Process Ontology might well expect to see terms such as 'spermatogenesis' or 'heart development'. In the absence of such 'important' and pervasive terms, the utility of the GO is significantly diminished. One possible solution discussed at length and eventually abandoned was that of combinatorial terms. Such Combi-Terms (my phrase, not used at the meeting) would take components from existing ontologies, creating a new term that does not exist in either, effectively creating a term set that is not itself hierarchical but bridges two hierarchies, thereby allowing entry and tracing of each one touched. For example, the notion of encoding the concept 'heart development' as 'heart (anatomical ontology)' + 'development (process ontology)' was discussed. In such a situation, a term could be either a literal (an explicit part of one of the ontologies) or a symbol combination (a combination of literals, a Combi-Term). The combination inherits the lexicon and the hierarchical position of each component term, but not the definitions of the components; a definition and GO ID would be assigned to the Combi-Term. However, this solution is prone to drift- error, meaning that the Combi-Terms might not keep up with changes in the anatomical and process ontologies. Further, there are problems inherent in using only symbols and divorcing those from the meanings of the symbols while retaining their hierarchy positional information, chief among these being suspicious hierarchy tracing. For instance, given 'heart development' applied to a fly, there is no constraint preventing hierarchy tracing to 'heart valve development,' which is inherently non-sensical for the fly. Sensu to the rescue The solution that seemed communally acceptable was to invoke the existing 'sensu' notation. In the case of 'heart development,' the sub-graph might look like: heart development %heart development sensu Insecta %heart development sensu Mammalia 20 groups worldwide to handle nomenclature for Candida. The Candida sequence is close to being completed. The pombe sequence is completed and the paper has been submitted for publication (Sanger group). A GO annotation set has been submitted with external references being Swissprot IDs. Progress Report - MGI (David Hill) GO Browser now publically available for MGI records (created/implemented by John Corradi) Gene-by-gene annotation of 15,000 genes effort is concentrated on moving IEA evidence to ISS evidence (or greater) updated annotations now being downloading to GO webpage every Friday Major Revision of GO subsections Process Ontology: apoptosis (with Flybase/Heather) Progress Report - TAIR (Leonore Reiser) Major Revisions Working (with Ji Yoon) on the introduction of terms into the three Ontologies to support Arabidopsis in the first instance, Plantae in general as a followthrough Plant terms - 162 total to date, 63 in the Component Ontology (50 with definitions) Expressed strong support for visual-based ontology curation and browsing tools Thesauri Worked on EGAD plant <> GO associations A contact has been initiated with Maize DB, but little feedback has been received to date (is this contact via Susan at Cornell?). A contact has been established at the Carnegie Institute within the Rice community regarding working with GO and Cyanobase A plant-related mailing list has been initiated The notion of a 'plant clearinghouse' for GO annotation is embryonic and being organized by Lenore and Richard Burkerist (formerly of the Sanger). Progress Report - Wormbase (Erich Schwarz) Erich is Paul Sternberg's first employee Wormbase growth Wen Chen (will be adding expression data to Wormbase AceDB) Raymond Lee (will be porting Wormbase from AceDB to other platforms) to fill one more curator's slot by summer & one db programmer Erich will be the GO-responsible party AceDB is beautiful for assisting in positional cloning. What is the relationship with Lincoln Stein? Invaluable He is on the grant that currently underwrites Wormbase Essential at the level of asking Lincoln to do things .. Lincoln put together the Wormbase.Org site at Paul Sternberg's request Eventually Wormbase will move from Cold Spring Harbor to CalTech relationship to Proteome ... they are using GO ... ?in the human curation? General feeling is that the scientists have lost control of Proteome, the businessmen have smelled profit. Proteome is helping out Worm by providing Worm database definition lines (?). relationship with the Sanger ... John Hodgin (gene mapping db) ... Sanger/WashU sequence feed ... very complex interaction map with other smaller dbs ... Richard Durbin involved as a subsidiary PI, like Lincoln Stein goal ... weekly updates rather than once every 3 months Progress Report - Prokaryotes & Protozoa Charlie Hodgkin (Glaxo; used to write software for Michael) initially contacted Michael and expressed his intention to use the GO for annotating E. coli. This launched an e-mail flurry that eventually led to interaction between Michael and Heather, and Monica Riley and her postdoc Greta Serres (around ISMB '99). Charlie, with Monica's support, is quite interested in putting the E. coli GO annotations in the public domain, but there is not indication of when this might be completed. Michael and Heather have mapped Monica Riley's latest (non-GO) classification to the GO, but it cannot be publicly released. This mapping required the addition of many terms to the GO and has set up the GO for use for most enteric bacteria. Monica has a great deal (10 years of work) invested in her classification scheme and has a great deal of interest in seeing a proper mapping/merge between GO and her scheme. Michael and Heather have also obtained from Monica Riley the Genprotec enzymes list (a list of E. coli proteins), and this has been parsed into the Function Ontology. The situation around EcoCyc is complicated and it is unclear when EcoCyc <> GO mapping might be done. EcoCyc ownership is being resolved between DoubleTwist, Pangea (now DoubleTwist), and SRI; an NIH grant (content unknown) is being held up pending resolution. There is an interest at Stanford in annotating cyanobacteria with GO, but it is unclear where this is going in the near term. A previously noted protist interaction is dead (Russ Altman?) : a Canadian group has recently become interested in annotating protists with the GO in conjunction with some large scale sequencing. There is currently no activity on the horizon for annotating viruses with the GO. The group has not looked at the 'minimal microbial genome' to see if it is fully represented in the GO. Michael would like to see Pseudomonas included (for its xenobiotic metabolism) and Streptomyces (for its antibiotic biosynthesis) Three groups have independently applied for funding to the Wellcome Trust for the sequencing of 3 different protozoa (leishmani, trypansoma cruzi, and trypansoma bruceii), in close collaboration with a new plasmodium database at the University of Pennsylvania. Al Ivans (Sanger), who I presume is attached to one or more of these, is interested in using the GO in their gene annotations, though nothing is firm to date General message: would like to identify communities, work to build a foundation for them, then invite them to build on that foundation Software Developments A relational database containing the GO terms is now available. It was unclear to me whether this is in Oracle in addition to MySQL, but that is not really relevant at this juncture. The time is imminent when the following will be available to curators: direct writes to a pre-production (development) database instance elimination of duplicate ID and basic syntax errors owing to data integrity functions (currently no data integrity ... everything done as flat textfiles). Curation interface with undo, DAG view, and commit functionality (see below) Rollback to previous version, owing to audit trail functions Java Browser (John Richter) Online at fruitfly.org Demonstration was very well received by participants A lot of effort has gone into making the browser platform independent All GO terms in all three Ontologies are found in a single tree (one window in Browser) Clicking any term brings up a minimal DAG in an accessory window Queries run as Perl5 regular expressions Query Window supports either Boolean or Perl5 regular expression queries After pulling a gene product, can click on a term block and thereby highlight the terms in the Ontology Tree Window; the minimal DAGs are also shown for these terms in an accessory window The number of gene products is no longer indicated in the interface Java Editor (John Richter) The editor will have the same look and feel as the Browser when completed Editor runs as an application rather than as an applet; it is available under CVS and must be installed to be used GO ID is fixed; new terms get automatic assignment (there was brief discussion around how to assign ID's which did not reach final resolution; currently, ID ranges are apparently assigned to curation groups) Structrual Edits (edits to the DAG structure) simple select/click/drag to change structure 'infinite' undoes term-merge by click/drag; one term lives on while the other is obsolesced and becomes a synonym of the live term term-splitting enabled (do not recall details) term obsolescence is automatic if all children are obsolesced cyclic graphs are not automatically disallowed; thus curators need to avoid descent along obsolesced terms For searches, obsolesced terms should all be treated as leaves Rollbacks: In theory, 'infinite' rollback is possible, but there is currently no 'History Viewer' and John suggested that design of one would be significantly more complex than either Browser or Editor. Logic Validation is currently not within the remit of the Editor Erich: would like to have a visual cue that indicates which terms have been changed recently Provides for tagging a term as part of GO Slim, and capability of adding any number of additional segregation tags in the future. Import & Export of subtrees will soon be available (a very useful feature) A Gene Product Viewer will be added in the near future, the nature of which is currently unknown. Technical Discussion : aiming for a Generic Ontology Builder The current Editor is componentized The DAG component is generic, not specific for GO Using Java components has allowed fancy click/drag functionality, including a smooth table resizer All components are available on the CVS repository : with DOCUMENTATION (fancy that) The GO Schema is available at www.fruitfly.org/annot/go/database/index.html People are encouraged to write ports to different database formats and submit them to the GO Database mailing list HTML Browser (Brad Mars) The HTML Browser is running off of the GO Relational Database (Chris Mungall's Perl API) rather than off of the GO Flatfiles, as the Java Browser does currently found at http://www.fruitfly.org/~bradmars/cgi- bin/go.cgi?accession=3700 Similar : though generally weaker : functionality to the Java Browser. Plans are to increase functionality to be on par with the Java Browser BLAST Server (Suzanna Lewis) There is an obvious need for the ability to BLAST against organism sequence sets using sequences retrieved via GO searches. An unasked question : why do they not rely on the archival databases to serve BLASTs? There was some discussion around whether to have 'all 400 version of ADH' included or not The current thought is to have protein sequences in fasta- format deposited on the GO website and available for Blast searches In the case of MGI, the plan is to submit a single protein sequence per gene. This single sequence will be the primary Swissprot sequence that represents each gene prouct. MGI annotation is to the level of the gene, not to the gene product; therefore, spliceforms and post-translational modifications do not matter that much to them at this point in time. In the case of yeast, about 2000 yeast protein sequences have already been provided. The fasta header line for the yeast sequences was discussed as a standard. general syntax : key:value[space] (key order not constrained) 'provide everything that you can provide, the more the better' community name for the gene (that used locally within a particular domain) gene name (the generally accepted, or global name for the gene) tab delimited list of GO ID's one or more external database cross-references Timing, Resource, and Prioritization Issues The primary software developers associated with the GO Consortium are John Richter and Chris Mungall. John is the primary driver behind development of the Java Browser and Editor, which appear to be the currently favored for GO delivery to end users and curators. About 3 weeks of work remain (according to John) to reach a final stage on Java Editor & Browser John is currently under obligation for development of an open-source gene annotation platform, Apollo, and timing for completion of software development for the GO Consortium is currently critically dependent upon his obligations to Apollo. Also competing for resource is a Fly Genome Reannotation project that is on-going. There is significant concern (voiced by Michael) around the absence of rigorous syntactic control and internal checks. John indicated, though, that syntactic checking algorithms are 'the stuff of papers' and were going to devour the majority of the development time. (I am quite fuzzy on exactly what the issue is here). The concern was voiced that the GO DAG could be really screwed up if the Editor is used improperly ... powerful tools mean powerful edits. However, the potential for rollbacks really softens the potential impact of this problem. Suzanna voiced the opinion that the priority lay in getting out of flatfile mode and into a relational database environment. It seemed that most everyone agreed with this. A key untackled problem appears to be 'how to commit DAGs to a database,' which apparently both John and Judy are thinking hard about. (again, I'm not quite clear on what this issue is here.) Interim Solution during resource-limited period John will do enough so that the Java Editor is available and will produce flatfiles; the flatfiles will be fed to the GO via the currently implemented CVS system. Downloads from the 'outside' Data is currently provided to external parties on an ad hoc basis. The mySQL database can be downloaded with the following caveat: 'If you make any changes, do not publish this database'. Among the groups who have downloaded is AstraZeneca (Bo Servenius, Lund, Sweden). Presumably, people are writing applications on top of the database, so John is trying to give a lead time on large changes. A Perl Development Kit has been assembled to assist in the loading/updating of external representations. A defined update/error reporting process has not been set up to date. However, each of these is currently handled through the e-mail lists (GO-FRIENDS, GO-DATABASE, GO-DIFF). The contact points should probably be Chris Mungall and John Richter. Send format problems to John Richter (http://www.fruitflyorg/annot/go/) and suggestions to John at go-admin@bdgp.lbl.gov Right now, remote access is relatively slow, presumably because there is lots of SQL being done remotely. John and Chris have some ideas bouncing about regarding 'fast access to GO Objects'. Consult John on this matter. Discussion around the Editing Process (various) (see also Software Development: Java Editor) Judy: a database administrator would help to stem the proliferation of terms and provide some control over the process Still trying to resolve the issue of how to distribute editing to a database globally across groups and disciplines. A software tool is emerging (Java Editor, design by John Richter) but the issue of editing control is still being debated. Michael: Minor Changes - Just commit them; Major Changes - consult the group via e-mail and commit after period (currently ~10 days); Really Major Changes - talk in person. Michael does not really want to see the current process changed. There was some discussion between Michael and Judy around similarity between the GO and extant nomenclatures, where committees meet to commit changes, but Michael would really want to avoid the potential emergence of bureaucracy, favoring maintenance of the present 'sociology' around the GO Consortium. It was generally accepted that there should be both a Working Database (Development) and an Authoritative Database (Production). There was some split of opinion over whether the Working Database should be publicly visible, but it appears that it will be made so, the argument being that the more skilled eyes that are looking at the work in development, the better. Schema Discussion, with emphasis on DBxREF A discussion ensued around whether the DBxREF field should refer to Terms and Definitions in the aggregate or separately. This discussion gets to the issue of whether a particular piece of information derived from a Source should be traceable to the Source or not. The current record structure looks like this: GO ID Term Definition Definition Refs Synonyms Synonym Refs DBxREFS (+ Boolean; supports Definition, Term, etc.) Suzanna led the discussion around proposed changes to the Schema. suggest adding a 'subclass' table to support tags that allow quick segregation of the GO into subsets (such as GO SLIM). suggest addition of sequence relationships to support sequence accession number searches, with coincident links to Gene Products Suggest addition of a type flag to the term_dbxref table to allow Term vs. Definition (or other type) values to be distinguished. Also, suggest addition of DBxREF tables as needed around other value- sets (such as synonyms) Trademark (Mike Cherry) A copyright statement has been added to the GO Consortium Page A logo [designed by John Matese] is available for the GO Consortium (found on http://www.geneontology.org/GO.usage.html; specifically http://genome-www.stanford.edu/images/GOthumbnail.gif) The Trademarking/Copywriting issue comes up now as more people become interested in using GO and the potential for product developers to incorporate GO without attribution or in an independent but apparently associated manner increases. GO SLIM The utility of GO SLIM was generally recognized. A referent/cutout/subclass table will be added to the GO Relational DB that will allow extraction of GO SLIM, GO FAT (full), Plant Terms, and other term subsets. Utility of GO SLIM Initially needed for high level Celera annotation Subsequently needed for Riken Annotation Jamboree (FANTOM meeting) this was apparently an effort to annotate by 'computational means' about 2000 previously undescribed mouse cDNAs 11 methods were used and a manuscript is apparently being assembled one thing that the group rues is the absence of an upfront comparative metric between the 11 methods; they could tease out a comparison between the methods, but it would be tedious, though perhaps worth it This effort involved the MGI (David) and took place around the time of the global genomics meeting in Japan Problem: The GO SLIM used for Celera and Riken were not identical Useful for easy to apprehend graphics, such as pie charts Useful for high level classification, such as across genes in a microarray experiment Potentially more stable than full GO (though this is not a sentiment shared by all participants). In particular, the 'lower half' of the Process Ontology is in flux. Status of GO SLIM Currently maintained by hand (by Michael) One outcome of meeting was a decision to added a GO SLIM entry to each Gene Product annotated. A commitment to automating this process was discussed. This GO SLIM view would be included in Swissprot and NCBI representations alongside the GO FAT view. The current criterion for a term being part of GO SLIM is 'someone says so'. There is no intrinsic connotation to the node-levels in GO. This means that there is nothing inherently similar among terms at Level 3, for instance. Thus, the slice through the Ontologies that would define GO SLIM is necessarily ragged (occurring at different levels for different branches). Currently no alerting system to highlight changes in GO SLIM; requires consultation of DIFF files and manual extraction. Most recent version 100-150 terms; see Appendix A. Funding The GO Consortium expects to receive full funding for pending grants. Funding is for 3 years beginning 1 December 2000 There is currently an administrative (Congressional Budget) delay on the disbursement of funds. Funding is initially to support 3 groups in the GO Consortium (Mosue, Yeast, Fly), with the addition of 2 groups after the 1st year (presumably Arabidopsis and C. elegans). Other funding ... Incyte (no details) ... AstraZeneca (no details) Publications A Genome Research paper is in the process of being written (GO Consortium), which will provide more details on the GO and is a followthrough from the Nature Genetics paper recently published. A Genomics paper is in the process of being written (MGI) which will detail some of the recent mouse gene annotations incorporating the GO Next Meeting 3, 4, 5 March Palo Alto, CA hosted by Sue Rhee, Carnegie Institute, Stanford University Immediate Actions Doug:send THE LETTER to him to for DoubleTwist:.invite them to Dec. meeting:.CALL ANDREW AND ASK HIM WHO TO SEND THE LETTER TO:EXPLAIN THE SITUATION.. review John Richter's FAQ about GO page: WE NEED TO SEND FASTA/GO FILE TO GO-SLIM WE NEED TO POST AT mgi THE mgi/GO FILE THAT IS SENT TO www.geneontology.org Appendix A: Current GO SLIM From: Suzanna Lewis[SMTP:suzi@bdgp.lbl.gov] Sent: Wednesday, October 25, 2000 11:34 AM To: ma11@gen.cam.ac.uk; midori@genome.Stanford.EDU; suzi@bdgp.lbl.gov Cc: go@genome.Stanford.EDU; dph@titan.informatics.jax.org Subject: Re: go_slim also needed to add unlocalised to component slim. here is the updated version $Gene_Ontology ; GO:0003673 $cellular_component ; GO:0005575 %cell wall ; GO:0005618 %extracellular ; GO:0005576