This file contains the collected minutes of Gene Ontology Consortium meetings. =================================================================== Ontology Meeting, Palo Alto, Jan 10 & 11, 1999 Participants: SCD and Arabidopsis Databases Mike Cherry, David Botstein, curators of SCD and Arabidopsis db. =46lyBase Michael Ashburner & Suzi Lewis MGD Janan Eppig, Joel Richardson, Judith Blake Astra Ken Fasman 1. One or three? Answer is Three: The first discussion reaffirmed that the Ontologies will be developed independently. In the future, we may explore edges between the independent ontologies, but not now. After stabilization and initial annotations of product to list terms, we will see whether to combine all together, what the levels of annotation are by species, and see about establishing edges between the three sets of terms. - don't get into describing 'location' as part of 'function' - three ontologies are a) function b) process c) subcellular location (but see below). 2. Gene Function vs. Gene Products? Recognized confusion in gene product/complex listings in function list. Agreed to move protein complexes and their components to the 'cellular' list. Renamed 'cellular location' to 'cellular components'. Decided not to create a fourth category, but rather to think of 'cellular' syntax as 'is located in' with synonym 'is subcomponent of'. Recognized a subtle distinction between something that is 'part of' a macromolecule and something that is 'located in' i.e. 'part of' the nucleus. Still, decided all would be placed in this category. This is a system designed to deal with a state of incomplete knowledge. Cell is composed of the whole set of product complexes. 3. Define syntax: a. function=8A 'is a' , hierarchical=8A b. process=8A'is a subprocess of' but may also list 'is an instance of', a DAG c. component=8A'is located in' synonym: 'is subcomponent of' but may also list 'is an instance of', a DAG. 4. Specific function syntax=8A a) Not for =8A Drop the 'Not for Mus, Not for Drosophila' distinctions b) Precursors =8A receive function process annotation of mature protein. c) Facultative/Obligatory=8A'maybe part of', 'sometimes part of', 'part of >>under certain conditions' some things present at sometimes, not at others= , or don't know. Some things spend some of the time in one compartment and some in the other (cell cycle proteins). Decide not to annotate these distinctions. Decide function ontology will be a straight 'is a', no qualifiers. 5) Cell is a generic cell, will not divide by cell type. Generic gene products will be listed as needed. example: Alpha tubulin=8A.yeast has two, elegans has 6, some are there in mitosis, some are not. what are leaves of the trees? are they nodes representing genes from all species? So, we've thrown out individual gene products except as needed, but gene complexes still here. Not creating laundry list of gene products of each species, but adding component parts as necessary. Example 1. a) cellular component: gene product A, gene product B, gene product C b) function: alpha tubulin c) process: mitosis, axonal transport Not going to put in the relatedness between complex, process, and function until we have some data and a better understanding of how this will work. 6) If you have process, function and localization, it's almost as good as having a small paragraph about the gene. (example of micro-array paper) 7) Implementation Plan a) Michael does revision of current version b) All participants edit list very carefully c) Suzi assigns GO numbers 'for real' d) Start alpha annotations e) Future terms added through 'ontology manager', currently Michael. 8) Immediate future a) requires prototype funding (possibly from Astra via Ken Fasman, other ideas too) b) each database needs one curator for project c) need overall curation manager (currently Michael, person would work with Michael) d) each database needs capitol money for computer 'kit' e) need money for meetings/travel. two meetings/year, alternate coasts f) need programmer, ultimately 2, one for database DBA, when db created, second for interface, tools. Astra is buying our commitment to annotate our databases to the common vocabulary=8Alooking for increased usability and searchability of each individually and collectively. =================================================================== Minutes of the GO Project Meeting The GO Meeting was held 17-19 May 1999 at the Banbury campus of CSH. In attendance were: FlyBase (Michael Ashburner, Suzi Lewis) SGD (Mike Cherry, Midori Harris) MGI (Janan Eppig, Judy Blake, Joel Richardson, Martin Ringwald, Allan Davis) A Summary of the Meeting: The outlined agenda for the meeting was: 1. semantics 2. species specificity 3. content 4 implementation 5 software 6. resources 1. The GO Project is recognized as a shared, pragmatic database resource involving three separate ontologies (Gene Function, Process, Cellular Component) that represent independent structured sets of terms for performing biological queries across different species genomic databases. It is not a definitive phylogenetic classification system of biology. The current GO Project is composed of three Organism Databases: FlyBase (Drosophila), SGD (yeast), and MGD (mouse). It is hoped that additional organism databases may subsequently join. Each Organism Database will annotate their genes to the three ontological categories and deposit these results in a Universal GO browser (currently being constructed by Suzi). Concurrently, each Organism Database on their own will use their annotations for whatever way they see fit at their own web sites. The three database (DB) groups are anxious to start implementing the project. It is felt that once people actually start 'getting their hands dirty' by annotating and testing out the system, we will recognize potential pitfalls and successes and will see areas of the ontologies that need to be developed further. 2. The system will be structured as: a Master GO List with assigned GO# identifiers. This Master List will be made available to all DB. Any curator can add to/modify this list (see below). All additions or modifiers will pass through a Central Manager (a biology-trained individual) who will then update the Master List accordingly. The Master List is anticipated to go through a period of flux during the initial phases and when any new organism joins the group. It will be the responsibility of each DB to regularly check in with the Master List and read all communications between DBs so that everyone is 'on the same page'. 3. It was recognized that this project is a WORK IN PROGRESS: meaning, it is a dynamic system where things will be added and changed as curators see problems and concerns. With subsequent biological knowledge, a re-organization of the hierarchies may even become necessary. Provisions for this re-organization will be made by the following: each DB group will initially be assigned a block of 'free' GO identification numbers (GO#) which can be used to add or modify new terms to the structured GO list. (New terms should use unique words/phrases). Any such modification will be simultaneously submitted to a Central Manager and representative of each DB via email/XML working from the current GO List. We anticipate that these initial modifications will be at lower levels of the hierarchy and not include any major re-organization that could directly and immediately affect the annotations of other DB. If that happens to be the case, the curator is responsible for FIRST contacting the Central Manager for inquiry, discussion, and approval PRIOR to initiating or submitting any changes to the GO list. Since a Central Manager has yet to be hired, it is prudent for curators not to initiate any major renovations to the GO List for quite some time. 4. Common Sense & Individual Responsibility must rule here: that is, each group is responsible for reading the corresponding emails submitted by each DB; each curator should check in frequently with the modified GO list on a regular basis; if a curator anticipates to spend a long time modifying some structure of the GO List he should alert all other members and then work as quickly as possible to update the changes. Effective communication is key to the success of this project. 5. Each DB will annotate their genes/gene products to any 'depth' they see fit. This should include annotating genes to complexes and not subunits of those complexes, since the cross-species homology will not necessarily hold true to such detailed levels. Once the practice is put in motion, curators will develop a better feel for how 'deep' they should be annotating based upon the work of other curators from other DB (see below). 6. Currently, the GO List records some species-specific GO terms. With subsequent biological knowledge, it may be found that these are no longer species-specific but shared amongst different organisms, thus requiring a name change. Currently species-specific terms will only be updated when it is shown that the term is not species specific. If a species does not have a gene annotated to a specific GO term, then either it is not present in the species or it hasn't been found yet. 7. Structures will have to inherently change with the discovery of new genes that do not necessarily follow the currently established hierarchies. Curators are to be aware of the "True Path (Judy's) Rule": the pathway up the hierarchy must always be true. If a new gene is found to break this rule or species-specificity becomes a problem, a restructuring of the hierarchy should occur by adding more nodes and connecting terms that creates a new path to fulfil the trueness of the upward hierarchy. When a term is added to the Master GO List, the curator needs to add all of the parents and children of the new term A suggestion was made that to simplify the hierarchy: we might considered throwing out the "part of" GO items and instead only used the "is a" GO terms. After discussion, it was realized that too much information would be lost by eliminating the terms. The GO List will maintain both "is a" and "part of" terms. The example used to work through this effort was that of the Process Ontology for the gene product 'chitin'. Chitin metabolism is a part of cuticle synthesis in the fly, and part of cell wall organization in yeast. As a result of the above discussion, the parent 'chitin metabolism' will now have daughters 'cuticle chitin metabolism' and 'cell wall chitin metabolism, with the appropriate catabolism and synthesis terms underneath them. chitin metabolism chitin biosynthesis chitin catabolism cuticle chitin metabolism cuticle chitin biosynthesis cuticle chitin catabolism cell wall chitin metabolism cell wall chitin biosynthesis cell wall chitin catabolism The procedure to add terms is made particularly difficult because the Process Ontology is a DAG, and given the current state of knowledge, it is volatile. We need an automated procedure to add all the arcs when necessary to expand the structure. A tool is needed that will 1. not allow bad paths (solution: add extra nodes) 2. curators need to see all paths. 3. given curator decision, can automate a split 8. A CVS file system will be used to process changes made to the GO list. This will automatically record the curators name & date and must provide a succinct reason for implementing the change. The CVS client server will be set up by Mike Cherry: accounts will be set up for each individual curator for access and modifications. Additional curators may join by submitting a User name to Mike. 9. An idea was proposed that the GO list hierarchy encode next to the GO terms the number of genes specific to each organism DB listed underneath that particular GO term. For example: %DNA metabolism (M-3, D-9, Y-5) %DNA replication (M-1, D-4, Y-3) %DNA dependent DNA replication (M-2, D-5, Y-2) Meaning that there are 3 mouse (M) genes, 9 Drosophila (D) genes, and 5 yeast (Y) genes under the entire category of %DNA metabolism, and that they can subsequently be broken down further to each subordinate GO term for finer resolution. This should be helpful to curators in understanding to which level of 'depth' other curators are annotating. 10. EC#s should be kept in the GO term lists because they provide a searchable technique for curators during annotation. 11. The following ideas and agendas were proposed for goals within a 1-2 week (?) time frame from the end of this GO Meeting: a) Stanford curators stop GO and send an updated file to Michael Ashburner. b) Michael Ashburner parses v. 0.2a7 ---> v. 0.9 c) v. 0.9 ---> Suzi for syntex check ---> assign unique GO# to all terms (v. 1.0); and parse into XML d) CVS established/tested by Mike Cherry; as well, Mike will try to register the web site www.gene.org if still available; if not, other suggested names: www.genestogo.org or some combination of the words GO and gene (GOgeneGO; GOgene; geneGO; etc....) e) all GO again. Independently, Suzi and Joel will determine necessary XML syntax and processes. Between June 1999 and the next meeting, two stages will be implemented: STAGE 1: CVS established initial curation: annotate as many genes as possible (FlyBase hopes to get 3000-4000 genes done; are other DB up to the challenge?) XML exports to Suzi established a Central Manager will be hired STAGE 2: a working database something of real functionality for the public: "genes to GO" 12. Making the database available to public users should coincide with a descriptive (promotional) write-up of the GO Project in a widely circulated genetically-oriented journal, such as Trends in Genetics. Members of the GO Project should be thinking of ideas for this paper and what they would like to see in it. 13. Janan will meet with Ken Fasman (Astra) and Lisa Brooks (NIH Program Officer) at the JAX MGI Advisory Board meeting in early June 1999 and discuss initiating a co-operative (MGI, SGD, FlyBase) grant for submission in November 1999. After this meeting, Janan will update other members of the GO Project on the ideas for the grant via email correspondence. 14. Resources: $200K from Astra '99 and a promise of $200K for '00. Michael Ashburner will have the money established in an EBI U.S. account in Cambridge Trust Company. 15. The next GO Meeting is scheduled for 6-9 October 1999 to be hosted by MGI in Bar Harbor, Maine. =================================================================== GO MEETING - The Jackson Labs. Oct 7-8 1999. PEOPLE MGD Judy Blake David Hill Joel Richardson Martin Ringwold Janan Eppig Charlie Ray Ben King - Mouse sequencing Jeff Davies Richard Balderelli Allan Davies SGD Andrew Kasarskis Mike Cherry Midori Harris FB Heather Butler Michael Ashburner Suzanna Lewis Astra Zeneca Michael Rebhan AGENDA 1. Current CVS/annotation of GO 2. Putting sets together for common query interface 3. Publications 4. WWW pages 5. Other collaborations 6. Funding & resources 7. People MINUTES 1. Progress FB/Berkeley. Nothing new on software; New versions imported into query tool. FBV/Cambridge Report on progress of attribution in FB. About 1700 done. Celera annotation plans were reported. It is hoped that they will use GO for functional inference. FB to get its reference CDS set of genes GO'd by November 7. (Ashburner/Heather). SGD Midori annotating yeast genes with GO, done about 300 plus tRNAs. Also doing gene summaries of each gene in SGD. Have about 3000 to do. GO query tool for internal use on www for curators; better diff files. MGD Alan and David Hill have been doing assignments. Detailed hand annotation with MLC and GXD - have to write detailed reports on genes and then add GO terms. At the same time do first pass "GO-FISH" - have mapped 3,000 genes with GO terms. Also mapped via EC numbers. Had not been using CVS but keeping a file of changes. Mapping SWP Keywords to GO terms - done to letter 'E'. 650 SWP Keywords that seem to be relevant to GO. 40-50% map directly to GO. David will [or could !] finish within a week ! dph@informatics.jax.org - David Hill MGD now beginning to use CVS (Allen) For CVS problems: mark@genome.stanford.edu Use "update" rather than "checkout". Agreed number series for new terms: SGD 0000001-0001500. MGD 0001501-0003000. FB 0008001-0009500. 2. Putting sets together: What we are using now: FB tagged value format SGD tabbed list MGD Excel file Evidence statements - MGD argue for "stated by author". Following agreed as valid values IMP inferred from mutant phenotype IGI inferred from genetic interaction {with } IPI inferred from physical interaction {with } **note we changed this from protein interaction ISS inferred from sequence similarity {with } IDA inferred from direct assay ASS author said so NA not avaliable Evidence must not be null, even if the record is " not available " We now want to agree on a tab delimited format - which SL can parse into XML. MEOW Core database. [mandatory] cardinality 1 ; controlled: MGI, FB, SGD gene symbol. [mandatory] cardinality 1 gene symbol synonym .cardinality 0, 1, >1 [white space allowed] gene name. cardinality 0,1 [white space allowed] gene identifier. [mandatory] cardinality 1 chromosome. cardinality 0, 1 map position. cardinality 0, 1 short gene description. cardinality 1 db xref, NA, protein. cardinality 0, 1, >1 GO add-on GO id. [mandatory] cardinality 1, >1 reference id. [mandatory] cardinality 1, >1 ; must be within domain of database identified in MEOW core evidence. [mandatory] cardinality 1, >1 ; controlled, see above aspect. cardinality 1 ; controlled F|P|C DB,Gene_id,Gene_symbol,GOid,ref(|refs),evidence(|evidence),aspect,name,synonym(|synonym) tab delimiter between fields (NOT commas) within field delimiter is | hard return at end of record ascii SGD_GO_files/gene_associations MGD_GO_files/gene_associations FB_GO_files/gene_associations SGD & FB do a remove of old versions before committing new. At this stage other data will not be dumped by contributing databases to GO. 2. Query/Editor tools/databases. Private editorial tools Local editorial interface to modify GO (ie to replace CVS) - but changes to go to editor for committment. Stanford work on editor tool. How do we compare for internal purposes between collab. d/abses ? Public tools At local sites [responsibility of collab d/bases] Cross-genome Data base Servlet ? or other performance enhancement Improved query database GO query tool must have comment to GO email button (at first to all of GO list, so that we can all see what is going on). Each database should implement its own query tool for GO. - all 3. WWW Mike has registered: www.geneontology.org & www.genename.org We agree to use geneontology.org as prime address and to close down the existing ebi and fruitfly sites (these then point to geneontology.org). Need a top page - Cherry Suzanna to check that the Query applet can run from this new web site. - Suzi Suzanna will activate URL hyperlinks from query report. - Suzi Needs url syntax for MGD (see MGD Tools for Developers on home page - or contact Joel) and for SGD (contact Mike Cherry). Tree will show number of gene_associations per node. The CVS can automatically update the text files and automatically write a new version and date at top of file - Cherry ftp - three ontologies in both hierarchical and xml (rename "compartment" as "cellular component" in CVS repository). - Cherry will xml files be automatically updated by a script when ontologies are updated ? - yes, but need to look into mechanism - Suzi. - GO.bib - GO.doc .. MA to re-write as an html document. Add GXD as collaborator indep of MGD - GO.defs - ISMB paper - geneassociations.fly - geneassociations.mouse - geneassociations.yeast GO query tool from Suzanna email button for contacts; go to entire list - Cherry Must change proofs of the SGD/FB/MGD NAR January issue papers, for new url. MA to write general introduction for web page Ashburner MA to update GO.doc Ashburner Suzi to give collaborators urls for definitions. (OUP acknowledgement) - Suzi 4. Publications Where - TIGS .. probably the best for this first paper. Alternatives: Genome Research NAR Nature Genetics Bioinformatics Talk to Roberts about paper for NAR Special Issue for 2001. Ashburner, but next year. Botstein & Cherry to do a draft then to Alan Davies at MGD - Botstein/Cherry/Alan 4. Other collaborators. C. elegans - Sternberg's NIH application for WormBase has been submitted - for summer 2000 funding. Arabidopsis: TAIR (The Arabidopsis Information Resource) - Carnegie-Stanford (science)/NCGR (computing). Started Sept 1, all of old AtDB curators moved over to Carnegie. Chris Town of TIGR is on TAIR grant. MA worried that could be more than one push - TIGR (NSF annotation grant); Mike Bevan at John Innes. Ashburner to follow up. Monica Riley/Gretta Serres - functional assigments for E. coli. Need to talk to TIGR about prokaryotes. Ashburner to follow up Look at TRANSFAC classification. Incyte collaboration, further discussions with Frank Russo. Ashburner/Suzi Swiss-Prot. Ashburner 5. Grants. Janan will lead on an NIH-NHGRI RO1 grant - Liza Brookes - for Feb 1 2000. - Janan What should we ask for: curator for MGD curator for SGD curator for WormBase ? as supplement [curator for FB already on MRC grant] Core: GO manager/editor software support travel/kit Funding cycle: FB to 2003 NIH 2002 MRC SGD 2001 NIH MGD 2001 NIH GXD 2000-2005 (NIH Institute of Child Health) GO 8/00-8/03 ? Astra-Zeneca: Would Ken be willing to write two cheques, one to EBI and one to UCB since we are the only two who now need to draw on funds ? Contracts between EBI and Jaxs and EBI and Stanford are academic at the moment. Should we set up a non-profit GO Inc ? Ashburner for action 6. Content MA to finish Style Manual, work on with Andrew - Ashburner/Andrew Need to look again at %enzyme - split by EC - what would we loose ? - use classification of substrates imposed on EC ? - Ashburner 7. Next meeting Feb 24-26 2000 - Boston / Harvard. Talk to Bill. Ashburner Talk to FCK re: a meeting in Les Treilles. Ashburner Friends of GO - activate and update - add Mike Rebhan. - Ashburner bionet.announce when new pages up and data into query tool. FINAL REMARKS Substantial progress has been made by all three database groups in implementing GO over the summer. This is very encouraging. Although there have been some areas of GO content that have needed changing (and several that have needed adding, as expected), in general the three ontologies seem to be working rather well. A major message of this meeting is that we must get something substantial in the public view as soon as possible. To this end we have rationalised the web sites for GO and agreed an output format for gene associations to be sent to Suzanna to drive the Query Tool. We have also agreed on a paper about GO for TIGS to be done this year. We hope that the new web pages with a Query Tool with content can be up in a matter of weeks, tho we know that until mid-November Suzanna and Ashburner are very busy with the fly annotation. =================================================================== Gene Ontology Meeting February 25-26, 2000 at Astra-Zeneca in Cambridge, MA Attendees: Michael Ashburner (FlyBase) Suzanna Lewis (FlyBase) Heather Butler (FlyBase) Judy Blake (MGI) Janan Eppig (MGI) David Hill (MGI) Joel Richardson (MGI) Martin Ringwald (MGI) Allan Peter Davis (MGI) Michael Rebhan (Astra Zeneca) Mike Cherry (SGD) Cathy Ball (SGD) Midori Harris (SGD) Andrew Kasarskis (SGD) AGENDA ITEMS Progress Reports Celera Report Papers Collaborators and Other Projects Ontology Issues Style and Work Practices Tools for GO Questions from Michael R. Plans for Next Meeting PROGRESS REPORTS Mouse folks: Judy Blake has submitted the GO grant. The mouse members have assigned approximately 4500 genes to GO terms. 100 by hand 650 by EC number 1270 using Swiss-Prot 2500 using mouse nomenclature Without counting the Swiss-Prot data, they used 474 Molecular Function terms, 50 Cellular Component terms and 80 Biological Process terms. Since they are using automated annotation, they have performed a variety of quality checks, such as looking for more than one annotation within an ontology. They have come close to exhausting the current automated assignments and are going to be doing more by hand in the future. Yeast Folks: SGD has 1524 genes in the gene association file. About 1000 of these are ORFs and the rest are tRNAs or snoRNAs. All SGD genes have been GO-annotated by hand. Fly Folks: FlyBase currently has about 3000 genes annotated mostly by hand. Heather has worked through the protein kinases and will next tackle the protein phosphatases. Annotation of new genes will be largely done by sequence similarity, while existing genes will be done by hand in related chunks. When the Drosophila sequence is released in March, there will be a large amount of sequences annotated to a high-level GO ID. These will be deepened to more specific GO nodes with time. CELERA REPORT GO was used in the annotation of the Drosophila genome at Celera. Suzanna made a dataset with all genes annotated to the molecular function GO and used it for BLAST searches. Usually, the level of GO node was quite high -- only one or two terms from the top. Where experts in a field were expected to be annotating genes, the specificity of the GO nodes used were increased (for example, olfactory receptors). Ultimately, there were 40 bins labelled by GO name (the 40th was "unknown"). Annotators were then able to have a pretty reasonable guess as to the function of the new fly gene. A second binning with biological process and cellular component showed a terrific correlation with the first. About half the genes from the Celera set are associated with a GO term. Since a given gene has a less than 50% chance of having been seen by a human, an association with a GO term is very valuable. FlyBase is still waiting to receive the sequence -- it will be released with the publication of the papers in March. FlyBase will be responsible for updating the sequence in GenBank. PAPERS We agreed to immediately pursue three publications: 1) Nature Genetics solicited a short (2500 word) article from David Botstein. It will be submitted March 10, with a short author list. 2) Genome Biology -- Michael Ashburner has been asked to write a short (1000 word) article for their premier issue. It will most likely have an authorship along the lines of "The GO Consortium". 3) Genome Research -- Judy Blake will adapt the grant to a "big" paper to be submitted to Genome Research. 4) NAR database issue -- We will submit a paper to NAR as a matter of course. The submission won't be until August or September. Since there are likely to be changes in the NAR policies, we will discuss the details of the NAR paper at the next meeting. COLLABORATORS AND OTHER PROJECTS There was a great deal of discussion about taking on other organisms and collaborators. The conclusion was that before we take on other organisms, we must first meet the following goals: --We need to be in a database (Suzanna Lewis will be working on this, with help from Joel Richardson). Hopefully, this will be accomplished by the next meeting. See "Plans for Next Meeting" for more detailed steps. --Documentation of philosophy, styles and practices needs to be written to record and communicate our current thinking. See "Plans for Next Meeting" for more detailed steps. --A "GO manager" to coordinate changes to the ontology, arrange training, communicate with all groups, etc needs to be hired. Midori Harris has volunteered to assume the responsibility. Michael Ashburner suggested we have two classes of partners -- the first with write permission and the second without it. These "second class" partners will have to funnel suggestions and comments through a full partner. Other organism groups that have expressed interest include worm, Arabidopsis, and S. pombe. We'll invite a representative from the worm and Arabidopsis database groups to the next GO meeting. Michael Ashburner has received a grant application for "BioBabel" -- a proposal to adopt GO terms within SwissProt, Enzyme Commission and Interpro. Representatives from this group can also be invited to the next meeting. ONTOLOGY ISSUES Methods and practices for editing and maintaining the ontology took up a large portion of the discussions. Conclusions will be listed, and in the cases where the discussion is particularly illuminating, the discarded options will be listed as well. 1) Changes to GO nodes that have multiple parents... When editing one of the ontologies, it is more convenient to add another node in only one position. For example, if we start with the structure shown below: a b d e f If we want to add node 'c' as a of 'd' and a child of node 'a', do we need to edit all the appropriate lines, or just one? The group decided to make an "editable" non-redundant version of the ontologies: Linear, redundant format (for viewing): a b d e f c d e f Non-redundant format (for editing): a b d % c e f The envisioned procedure is that a curator checks out the compressed, or non-redundant, version and then views an expanded version using a planned tool we're calling "The Validator." When an edit needs to be made to an ontology, it is made in the compressed version and tested with the Validator. The compressed version is then checked back into the cvs. The Validator will be written by Joel, suing specifications mentioned later. The web will display the expanded, read-only format. 2) We will add GO id to parent terms. For example, we used to state: term1 ; GOID1 % term2 Now we will state: term1 ; GOID1 % term2 ; GOID2 3) GO nodes should aggressively avoid using species-specific definitions. We agreed to substitute "Yeast mating" with "Mating, sensu Saccharomyces." Using the "sensu" reference makes the node available to other species that use the same process/function/component. Each organism database will take care of their contributions to the species-specific language. 4) We will get rid of cellular component references in the function ontology. For example, "mitochondrial primase" needs only be "primase." There are many cases where component terms are appropriate in the process ontology, so those will remain. Michael A. will take care of this. 5) Joel pointed out these logical relationships that we need to make sure are true in the ontologies: if A is part of B and C isa B, is A part of C? --- YES if A is a B and B isa C, is A isa C? --- YES if A is part of B and B is part of C, is A part of C? --- YES if A isa B and C is part of B, is C part of A? --- NOT NECESSARILY Joel will send out a list of the logical inconsistencies that he has detected. 6) An example that got a lot of attention is the case of the mitotic chromosome's location in the cellular component ontology. While the mitotic chromosome resides in the nucleus in yeast, it is cytoplasmic at this stage of cell life in mouse or fly. In addition, many organisms have chromosomes that are NOT located in the nucleus. The solution arrived at was to remove chromosome from the nucleus in general and place the appropriate subsets of chromosomes in the correct place (nuclear, cytoplasmic, mitochondrial). 7) We need to track deleted GO ids. There are types of things that can happen to GO terms -- merging two (or more) nodes, splitting a node, deleting a term. a. When a term is deleted, we will cut the line out and paste it at the end of the file (or as a child of the parent "defunct", I don't recall the final decision), using the following format: and tags. 11) We currently cannot standardize rules for subdividing ontology terms, but instead will continue to make each decision on a case-by-case basis. 12) Gene products in themselves are not nodes of the function ontology, although doing something with or to a specific gene product can be one. For example, being hedgehog is not likely to be a function, but being a hedgehog receptor or hedgehog receptor ligand are functions. 13) We may eventually need a synonym table to facilitate queries. 14) Changes that need to made to the ontology to meet the current style include eliminating unnecessary hyphens, adjust grammar so that "transporters" become described as "transport" and "transporting," remove words like "protein" and "factor" where we can be more explicit. 15) Heather and Midori will write some documentation about the evidence codes. 16) We need to think about a "best practices" document that will state and explain good work habits for both current and future annotators. In the meantime, we will share any help documents, such as SGD's "Instructions for Annotating Genes Using GO." TOOLS FOR GO 1) Database Suzanna will get a handle on this. The major difficulty has been hiring a programmer. Michael R. offered some help on this from Astra-Zeneca. Suzanna is planning on using MySQL to create a version to distribute from the central site. The schema is not yet ready, but Suzanna and Joel will work on this together. The database will also need the ability for bulk load. 2) Validator - Joel will do this We need a validator to check for: a. cycles b. deletion of nodes used in gene association files c. syntactic correctness (refer to logical relationships described in the ONTOLOGY ISSUES section.) d. unique IDs e. warning message of the number of affected nodes f. orphans g. new nodes have IDs Associated with the validator is the ability to compact and expand the ontologies for writing and reading. The validator will run on the central site, as well as locally for checking before an edited ontology is checked back in. Joel plans on writing this in python, so each site will need to install it. 3) GO BLAST server - Mike C. will take care of this The GO BLAST server will use a dataset of GO-annotated protein sequences. The results should show each GO node associated with a gene product, as well as a few generations of ancestors. 4) Annotation aids It would be nice for curators to have a tool that, given a single node, display all other gene products at that node (and nearby nodes) as well as all their other GO associations. This would assist curators in assigning a gene product to as many GO terms as needed, by showing them all other GO terms that might be related. 5) Suzanna's browser needs to be installed at Stanford, so we can all be using it from the same server. 6) Michael R. suggested we make a link to a dtd (datatype definition) file. Suzanna will look into finding a tool that will read the xml and create a dtd file. ANSWERS TO QUESTIONS FROM MICHAEL R. 1) GO ids will be stable. They may be "defuncted", but they will not go away. 2) "is a" and "part of" are likely to be used for quite some time. However, "part of" means "can be a part of", NOT "is always a part of." 3) Incyte still expresses interest, but that's all we've received from them. 4) Homepage recommendations -- Mike C. will add a bit from the grant to add more detail to the homepage. It might also benefit from the addition of statistics from the gene association files. 5) Should we have an ftp site that allows one to download the most recent version of GO? 6) Michael R. will create a FAQ to be linked from the home page. 7) Mike C. will put SGD's PowerPoint GO presentations to the GO site. PLANS FOR THE NEXT MEETING The next GO meeting will be in Cambridge, UK June 29 and 30. The plans are: 1) Have documentation ready a. GO philosophy document (Michael A., Judy and Midori) b. Rules for making changes to GO (Michael A. and Andrew) c. Rules for applying GO terms -- this is currently project-specific. Each project needs to think about this and bring something to the table next time. This should also include particularly illuminating examples, such as chitin synthesis, mitotic chromosomes. It should also emphasize how to avoid making GO nodes too species-specific, and mention the logical aspects of inserting or moving nodes. 2) Invite representatives from BioBabel, Arabidopsis, and C. elegans. 3) Have database in place 4) Create programs described above 5) Work HARD on adding more GO definitions. We have permission to use the Oxford Dictionary of Biochemistry and Molecular Biology. 6) Make the ontology edits mentioned above 7) Write three (!) papers History We need to establish a FAQ page We need to arrange for introductory sessions for new groups Status Assumption built into this is that if a term is associated with a gene product it must necessarily follow that all parent terms also are an accurate and truthful description of that gene. The structure of the ontology is tested and validated continuously as the curators assure that the parents, parts and go terms are all true. * Yeast 1800 genes associated to GO, represent half of the total number of yeast genes that have name, evidence code for almost all of them * Fly, automated annotation at jamboree, but not submitted to GO until curators validate them. * Mouse mostly done automatically until now; see the handout for the numbers. When conflicts arise they try not to change the ontology unless that have to. This is done by going up to a broader term. Moving to hand annotation particularly for new genes. Tools and Common resources * Now available, John's web browser (www.informatics.jax.org/~jpc/GO) modeled after MESH and Brad's browser (www.fruitfly.org/~bradmars/cgi-bin/go.cgi) that is running off the Informix database * Database (Informix, MySQL, and Oracle?) implemented and there is a Perl object methods in repository. We will be writing updates to the ontology in the database after the fall meeting. * Ontology editor, first priority. Suzi (et al.) to do by next meeting * Mike C. to use Ian's scripts to automatically perform regular validation for text version until editor is ready. * Merge two html versions of browser (Brad and John) * Suzi/John to fix Java browser and decide to either pull the plug or continue development. Add link to Java help page * Each organism database to provide a fasta file of protein sequences for those gene products that have been annotated. Suzi (et al.) will set up blast search services for GO * API to be refined as applications are developed * Steffan and Heather to work up prototype for next meeting of rules between the separate ontologies * Definitions, Michael is to contact Julian Dow for definitions from Dictionary of Cell Biology * Mike to e-mail style manual to Michael, who will then check it into CVS * Suzi/Brad to clean up XML version Content * Use part-of relationship to solve the 's/t/ protein kinase' (multipart protein) problem. E.g. %s/t protein kinase 20 groups worldwide to handle nomenclature for Candida. The Candida sequence is close to being completed. The pombe sequence is completed and the paper has been submitted for publication (Sanger group). A GO annotation set has been submitted with external references being Swissprot IDs. Progress Report - MGI (David Hill) GO Browser now publically available for MGI records (created/implemented by John Corradi) Gene-by-gene annotation of 15,000 genes effort is concentrated on moving IEA evidence to ISS evidence (or greater) updated annotations now being downloading to GO webpage every Friday Major Revision of GO subsections Process Ontology: apoptosis (with Flybase/Heather) Progress Report - TAIR (Leonore Reiser) Major Revisions Working (with Ji Yoon) on the introduction of terms into the three Ontologies to support Arabidopsis in the first instance, Plantae in general as a followthrough Plant terms - 162 total to date, 63 in the Component Ontology (50 with definitions) Expressed strong support for visual-based ontology curation and browsing tools Thesauri Worked on EGAD plant <> GO associations A contact has been initiated with Maize DB, but little feedback has been received to date (is this contact via Susan at Cornell?). A contact has been established at the Carnegie Institute within the Rice community regarding working with GO and Cyanobase A plant-related mailing list has been initiated The notion of a 'plant clearinghouse' for GO annotation is embryonic and being organized by Lenore and Richard Burkerist (formerly of the Sanger). Progress Report - Wormbase (Erich Schwarz) Erich is Paul Sternberg's first employee Wormbase growth Wen Chen (will be adding expression data to Wormbase AceDB) Raymond Lee (will be porting Wormbase from AceDB to other platforms) to fill one more curator's slot by summer & one db programmer Erich will be the GO-responsible party AceDB is beautiful for assisting in positional cloning. What is the relationship with Lincoln Stein? Invaluable He is on the grant that currently underwrites Wormbase Essential at the level of asking Lincoln to do things .. Lincoln put together the Wormbase.Org site at Paul Sternberg's request Eventually Wormbase will move from Cold Spring Harbor to CalTech relationship to Proteome ... they are using GO ... ?in the human curation? General feeling is that the scientists have lost control of Proteome, the businessmen have smelled profit. Proteome is helping out Worm by providing Worm database definition lines (?). relationship with the Sanger ... John Hodgin (gene mapping db) ... Sanger/WashU sequence feed ... very complex interaction map with other smaller dbs ... Richard Durbin involved as a subsidiary PI, like Lincoln Stein goal ... weekly updates rather than once every 3 months Progress Report - Prokaryotes & Protozoa Charlie Hodgkin (Glaxo; used to write software for Michael) initially contacted Michael and expressed his intention to use the GO for annotating E. coli. This launched an e-mail flurry that eventually led to interaction between Michael and Heather, and Monica Riley and her postdoc Greta Serres (around ISMB '99). Charlie, with Monica's support, is quite interested in putting the E. coli GO annotations in the public domain, but there is not indication of when this might be completed. Michael and Heather have mapped Monica Riley's latest (non-GO) classification to the GO, but it cannot be publicly released. This mapping required the addition of many terms to the GO and has set up the GO for use for most enteric bacteria. Monica has a great deal (10 years of work) invested in her classification scheme and has a great deal of interest in seeing a proper mapping/merge between GO and her scheme. Michael and Heather have also obtained from Monica Riley the Genprotec enzymes list (a list of E. coli proteins), and this has been parsed into the Function Ontology. The situation around EcoCyc is complicated and it is unclear when EcoCyc <> GO mapping might be done. EcoCyc ownership is being resolved between DoubleTwist, Pangea (now DoubleTwist), and SRI; an NIH grant (content unknown) is being held up pending resolution. There is an interest at Stanford in annotating cyanobacteria with GO, but it is unclear where this is going in the near term. A previously noted protist interaction is dead (Russ Altman?) : a Canadian group has recently become interested in annotating protists with the GO in conjunction with some large scale sequencing. There is currently no activity on the horizon for annotating viruses with the GO. The group has not looked at the 'minimal microbial genome' to see if it is fully represented in the GO. Michael would like to see Pseudomonas included (for its xenobiotic metabolism) and Streptomyces (for its antibiotic biosynthesis) Three groups have independently applied for funding to the Wellcome Trust for the sequencing of 3 different protozoa (leishmani, trypansoma cruzi, and trypansoma bruceii), in close collaboration with a new plasmodium database at the University of Pennsylvania. Al Ivans (Sanger), who I presume is attached to one or more of these, is interested in using the GO in their gene annotations, though nothing is firm to date General message: would like to identify communities, work to build a foundation for them, then invite them to build on that foundation Software Developments A relational database containing the GO terms is now available. It was unclear to me whether this is in Oracle in addition to MySQL, but that is not really relevant at this juncture. The time is imminent when the following will be available to curators: direct writes to a pre-production (development) database instance elimination of duplicate ID and basic syntax errors owing to data integrity functions (currently no data integrity ... everything done as flat textfiles). Curation interface with undo, DAG view, and commit functionality (see below) Rollback to previous version, owing to audit trail functions Java Browser (John Richter) Online at fruitfly.org Demonstration was very well received by participants A lot of effort has gone into making the browser platform independent All GO terms in all three Ontologies are found in a single tree (one window in Browser) Clicking any term brings up a minimal DAG in an accessory window Queries run as Perl5 regular expressions Query Window supports either Boolean or Perl5 regular expression queries After pulling a gene product, can click on a term block and thereby highlight the terms in the Ontology Tree Window; the minimal DAGs are also shown for these terms in an accessory window The number of gene products is no longer indicated in the interface Java Editor (John Richter) The editor will have the same look and feel as the Browser when completed Editor runs as an application rather than as an applet; it is available under CVS and must be installed to be used GO ID is fixed; new terms get automatic assignment (there was brief discussion around how to assign ID's which did not reach final resolution; currently, ID ranges are apparently assigned to curation groups) Structrual Edits (edits to the DAG structure) simple select/click/drag to change structure 'infinite' undoes term-merge by click/drag; one term lives on while the other is obsolesced and becomes a synonym of the live term term-splitting enabled (do not recall details) term obsolescence is automatic if all children are obsolesced cyclic graphs are not automatically disallowed; thus curators need to avoid descent along obsolesced terms For searches, obsolesced terms should all be treated as leaves Rollbacks: In theory, 'infinite' rollback is possible, but there is currently no 'History Viewer' and John suggested that design of one would be significantly more complex than either Browser or Editor. Logic Validation is currently not within the remit of the Editor Erich: would like to have a visual cue that indicates which terms have been changed recently Provides for tagging a term as part of GO Slim, and capability of adding any number of additional segregation tags in the future. Import & Export of subtrees will soon be available (a very useful feature) A Gene Product Viewer will be added in the near future, the nature of which is currently unknown. Technical Discussion : aiming for a Generic Ontology Builder The current Editor is componentized The DAG component is generic, not specific for GO Using Java components has allowed fancy click/drag functionality, including a smooth table resizer All components are available on the CVS repository : with DOCUMENTATION (fancy that) The GO Schema is available at www.fruitfly.org/annot/go/database/index.html People are encouraged to write ports to different database formats and submit them to the GO Database mailing list HTML Browser (Brad Mars) The HTML Browser is running off of the GO Relational Database (Chris Mungall's Perl API) rather than off of the GO Flatfiles, as the Java Browser does currently found at http://www.fruitfly.org/~bradmars/cgi- bin/go.cgi?accession=3700 Similar : though generally weaker : functionality to the Java Browser. Plans are to increase functionality to be on par with the Java Browser BLAST Server (Suzanna Lewis) There is an obvious need for the ability to BLAST against organism sequence sets using sequences retrieved via GO searches. An unasked question : why do they not rely on the archival databases to serve BLASTs? There was some discussion around whether to have 'all 400 version of ADH' included or not The current thought is to have protein sequences in fasta- format deposited on the GO website and available for Blast searches In the case of MGI, the plan is to submit a single protein sequence per gene. This single sequence will be the primary Swissprot sequence that represents each gene prouct. MGI annotation is to the level of the gene, not to the gene product; therefore, spliceforms and post-translational modifications do not matter that much to them at this point in time. In the case of yeast, about 2000 yeast protein sequences have already been provided. The fasta header line for the yeast sequences was discussed as a standard. general syntax : key:value[space] (key order not constrained) 'provide everything that you can provide, the more the better' community name for the gene (that used locally within a particular domain) gene name (the generally accepted, or global name for the gene) tab delimited list of GO ID's one or more external database cross-references Timing, Resource, and Prioritization Issues The primary software developers associated with the GO Consortium are John Richter and Chris Mungall. John is the primary driver behind development of the Java Browser and Editor, which appear to be the currently favored for GO delivery to end users and curators. About 3 weeks of work remain (according to John) to reach a final stage on Java Editor & Browser John is currently under obligation for development of an open-source gene annotation platform, Apollo, and timing for completion of software development for the GO Consortium is currently critically dependent upon his obligations to Apollo. Also competing for resource is a Fly Genome Reannotation project that is on-going. There is significant concern (voiced by Michael) around the absence of rigorous syntactic control and internal checks. John indicated, though, that syntactic checking algorithms are 'the stuff of papers' and were going to devour the majority of the development time. (I am quite fuzzy on exactly what the issue is here). The concern was voiced that the GO DAG could be really screwed up if the Editor is used improperly ... powerful tools mean powerful edits. However, the potential for rollbacks really softens the potential impact of this problem. Suzanna voiced the opinion that the priority lay in getting out of flatfile mode and into a relational database environment. It seemed that most everyone agreed with this. A key untackled problem appears to be 'how to commit DAGs to a database,' which apparently both John and Judy are thinking hard about. (again, I'm not quite clear on what this issue is here.) Interim Solution during resource-limited period John will do enough so that the Java Editor is available and will produce flatfiles; the flatfiles will be fed to the GO via the currently implemented CVS system. Downloads from the 'outside' Data is currently provided to external parties on an ad hoc basis. The mySQL database can be downloaded with the following caveat: 'If you make any changes, do not publish this database'. Among the groups who have downloaded is AstraZeneca (Bo Servenius, Lund, Sweden). Presumably, people are writing applications on top of the database, so John is trying to give a lead time on large changes. A Perl Development Kit has been assembled to assist in the loading/updating of external representations. A defined update/error reporting process has not been set up to date. However, each of these is currently handled through the e-mail lists (GO-FRIENDS, GO-DATABASE, GO-DIFF). The contact points should probably be Chris Mungall and John Richter. Send format problems to John Richter (http://www.fruitflyorg/annot/go/) and suggestions to John at go-admin@bdgp.lbl.gov Right now, remote access is relatively slow, presumably because there is lots of SQL being done remotely. John and Chris have some ideas bouncing about regarding 'fast access to GO Objects'. Consult John on this matter. Discussion around the Editing Process (various) (see also Software Development: Java Editor) Judy: a database administrator would help to stem the proliferation of terms and provide some control over the process Still trying to resolve the issue of how to distribute editing to a database globally across groups and disciplines. A software tool is emerging (Java Editor, design by John Richter) but the issue of editing control is still being debated. Michael: Minor Changes - Just commit them; Major Changes - consult the group via e-mail and commit after period (currently ~10 days); Really Major Changes - talk in person. Michael does not really want to see the current process changed. There was some discussion between Michael and Judy around similarity between the GO and extant nomenclatures, where committees meet to commit changes, but Michael would really want to avoid the potential emergence of bureaucracy, favoring maintenance of the present 'sociology' around the GO Consortium. It was generally accepted that there should be both a Working Database (Development) and an Authoritative Database (Production). There was some split of opinion over whether the Working Database should be publicly visible, but it appears that it will be made so, the argument being that the more skilled eyes that are looking at the work in development, the better. Schema Discussion, with emphasis on DBxREF A discussion ensued around whether the DBxREF field should refer to Terms and Definitions in the aggregate or separately. This discussion gets to the issue of whether a particular piece of information derived from a Source should be traceable to the Source or not. The current record structure looks like this: GO ID Term Definition Definition Refs Synonyms Synonym Refs DBxREFS (+ Boolean; supports Definition, Term, etc.) Suzanna led the discussion around proposed changes to the Schema. suggest adding a 'subclass' table to support tags that allow quick segregation of the GO into subsets (such as GO SLIM). suggest addition of sequence relationships to support sequence accession number searches, with coincident links to Gene Products Suggest addition of a type flag to the term_dbxref table to allow Term vs. Definition (or other type) values to be distinguished. Also, suggest addition of DBxREF tables as needed around other value- sets (such as synonyms) Trademark (Mike Cherry) A copyright statement has been added to the GO Consortium Page A logo [designed by John Matese] is available for the GO Consortium (found on http://www.geneontology.org/GO.usage.html; specifically http://genome-www.stanford.edu/images/GOthumbnail.gif) The Trademarking/Copywriting issue comes up now as more people become interested in using GO and the potential for product developers to incorporate GO without attribution or in an independent but apparently associated manner increases. GO SLIM The utility of GO SLIM was generally recognized. A referent/cutout/subclass table will be added to the GO Relational DB that will allow extraction of GO SLIM, GO FAT (full), Plant Terms, and other term subsets. Utility of GO SLIM Initially needed for high level Celera annotation Subsequently needed for Riken Annotation Jamboree (FANTOM meeting) this was apparently an effort to annotate by 'computational means' about 2000 previously undescribed mouse cDNAs 11 methods were used and a manuscript is apparently being assembled one thing that the group rues is the absence of an upfront comparative metric between the 11 methods; they could tease out a comparison between the methods, but it would be tedious, though perhaps worth it This effort involved the MGI (David) and took place around the time of the global genomics meeting in Japan Problem: The GO SLIM used for Celera and Riken were not identical Useful for easy to apprehend graphics, such as pie charts Useful for high level classification, such as across genes in a microarray experiment Potentially more stable than full GO (though this is not a sentiment shared by all participants). In particular, the 'lower half' of the Process Ontology is in flux. Status of GO SLIM Currently maintained by hand (by Michael) One outcome of meeting was a decision to added a GO SLIM entry to each Gene Product annotated. A commitment to automating this process was discussed. This GO SLIM view would be included in Swissprot and NCBI representations alongside the GO FAT view. The current criterion for a term being part of GO SLIM is 'someone says so'. There is no intrinsic connotation to the node-levels in GO. This means that there is nothing inherently similar among terms at Level 3, for instance. Thus, the slice through the Ontologies that would define GO SLIM is necessarily ragged (occurring at different levels for different branches). Currently no alerting system to highlight changes in GO SLIM; requires consultation of DIFF files and manual extraction. Most recent version 100-150 terms; see Appendix A. Funding The GO Consortium expects to receive full funding for pending grants. Funding is for 3 years beginning 1 December 2000 There is currently an administrative (Congressional Budget) delay on the disbursement of funds. Funding is initially to support 3 groups in the GO Consortium (Mosue, Yeast, Fly), with the addition of 2 groups after the 1st year (presumably Arabidopsis and C. elegans). Other funding ... Incyte (no details) ... AstraZeneca (no details) Publications A Genome Research paper is in the process of being written (GO Consortium), which will provide more details on the GO and is a followthrough from the Nature Genetics paper recently published. A Genomics paper is in the process of being written (MGI) which will detail some of the recent mouse gene annotations incorporating the GO Next Meeting 3, 4, 5 March Palo Alto, CA hosted by Sue Rhee, Carnegie Institute, Stanford University Immediate Actions Doug:send THE LETTER to him to for DoubleTwist:.invite them to Dec. meeting:.CALL ANDREW AND ASK HIM WHO TO SEND THE LETTER TO:EXPLAIN THE SITUATION.. review John Richter's FAQ about GO page: WE NEED TO SEND FASTA/GO FILE TO GO-SLIM WE NEED TO POST AT mgi THE mgi/GO FILE THAT IS SENT TO www.geneontology.org Appendix A: Current GO SLIM From: Suzanna Lewis[SMTP:suzi@bdgp.lbl.gov] Sent: Wednesday, October 25, 2000 11:34 AM To: ma11@gen.cam.ac.uk; midori@genome.Stanford.EDU; suzi@bdgp.lbl.gov Cc: go@genome.Stanford.EDU; dph@titan.informatics.jax.org Subject: Re: go_slim also needed to add unlocalised to component slim. here is the updated version $Gene_Ontology ; GO:0003673 $cellular_component ; GO:0005575 %cell wall ; GO:0005618 %extracellular ; GO:0005576 20 groups worldwide to handle nomenclature for Candida. The Candida sequence is close to being completed. The pombe sequence is completed and the paper has been submitted for publication (Sanger group). A GO annotation set has been submitted with external references being Swissprot IDs. Progress Report - MGI (David Hill) GO Browser now publically available for MGI records (created/implemented by John Corradi) Gene-by-gene annotation of 15,000 genes effort is concentrated on moving IEA evidence to ISS evidence (or greater) updated annotations now being downloading to GO webpage every Friday Major Revision of GO subsections Process Ontology: apoptosis (with Flybase/Heather) Progress Report - TAIR (Leonore Reiser) Major Revisions Working (with J. Yoon) on the introduction of terms into the three Ontologies to support Arabidopsis in the first instance, Plantae in general as a followthrough Plant terms - 162 total to date, 63 in the Component Ontology (50 with definitions) Expressed strong support for visual-based ontology curation and browsing tools Thesauri Worked on EGAD plant <> GO associations A contact has been initiated with Maize DB, but little feedback has been received to date (is this contact via Susan at Cornell?). A contact has been established at IRRI within the Rice community regarding working with GO and Cyanobase A plant-related mailing list has been initiated The notion of a 'plant clearinghouse' for GO annotation is embryonic and being organized by Lenore and Richard Burkiewich (formerly of the Sanger). Progress Report - Wormbase (Erich Schwarz) Erich is Paul Sternberg's first WormBase employee Wormbase growth Wen Chen (will be adding expression data to Wormbase AceDB) Raymond Lee (will be porting Wormbase from AceDB to other platforms) to fill one more curator's slot by summer & one db programmer Erich will be the GO-responsible party AceDB is beautiful for assisting in positional cloning. What is the relationship with Lincoln Stein? Invaluable He is on the grant that currently underwrites Wormbase Essential at the level of asking Lincoln to do things .. Lincoln put together the Wormbase.Org site at Paul Sternberg's request Eventually Wormbase will move from Cold Spring Harbor to CalTech Proteome is helping out Worm by providing Worm sequence definition lines relationship with the Sanger ... John Hodgin (gene mapping db) ... Sanger/WashU sequence feed ... very complex interaction map with other smaller dbs ... Richard Durbin involved as a subsidiary PI, like Lincoln Stein Progress Report - Prokaryotes & Protozoa Charlie Hodgkin (Glaxo; used to write software for Michael) initially contacted Michael and expressed his intention to use the GO for annotating E. coli. This launched an e-mail flurry that eventually led to interaction between Michael and Heather, and Monica Riley and her postdoc Greta Serres (around ISMB '99). Charlie, with Monica's support, is quite interested in putting the E. coli GO annotations in the public domain, but there is not indication of when this might be completed. There is an interest at Carnegie in annotating cyanobacteria with GO, but it is unclear where this is going in the near term. A previously noted protist interaction is dead: a Canadian group has recently become interested in annotating protists with the GO in conjunction with some large scale sequencing. There is currently no activity on the horizon for annotating viruses with the GO. The group has not looked at the 'minimal microbial genome' to see if it is fully represented in the GO. Michael would like to see Pseudomonas included (for its xenobiotic metabolism) and Streptomyces (for its antibiotic biosynthesis) Three groups have independently applied for funding to the Wellcome Trust for the sequencing of 3 different protozoa (leishmani, trypansoma cruzi, and trypansoma bruceii), in close collaboration with a new plasmodium database at the University of Pennsylvania. Al Ivans (Sanger), who I presume is attached to one or more of these, is interested in using the GO in their gene annotations, though nothing is firm to date General message: would like to identify communities, work to build a foundation for them, then invite them to build on that foundation Software Developments A relational database containing the GO terms is now available. It was unclear to me whether this is in Oracle in addition to MySQL, but that is not really relevant at this juncture. The time is imminent when the following will be available to curators: direct writes to a pre-production (development) database instance elimination of duplicate ID and basic syntax errors owing to data integrity functions (currently no data integrity ... everything done as flat textfiles). Curation interface with undo, DAG view, and commit functionality (see below) Rollback to previous version, owing to audit trail functions Java Browser (John Richter) Online at fruitfly.org Demonstration was very well received by participants A lot of effort has gone into making the browser platform independent All GO terms in all three Ontologies are found in a single tree (one window in Browser) Clicking any term brings up a minimal DAG in an accessory window Queries run as Perl5 regular expressions Query Window supports either Boolean or Perl5 regular expression queries After pulling a gene product, can click on a term block and thereby highlight the terms in the Ontology Tree Window; the minimal DAGs are also shown for these terms in an accessory window The number of gene products is no longer indicated in the interface Java Editor (John Richter) The editor will have the same look and feel as the Browser when completed Editor runs as an application rather than as an applet; it is available under CVS and must be installed to be used GO ID is fixed; new terms get automatic assignment (there was brief discussion around how to assign ID's which did not reach final resolution; currently, ID ranges are apparently assigned to curation groups) Structrual Edits (edits to the DAG structure) simple select/click/drag to change structure 'infinite' undoes term-merge by click/drag; one term lives on while the other is obsolesced and becomes a synonym of the live term term-splitting enabled (do not recall details) term obsolescence is automatic if all children are obsolesced cyclic graphs are not automatically disallowed; thus curators need to avoid descent along obsolesced terms For searches, obsolesced terms should all be treated as leaves Rollbacks: In theory, 'infinite' rollback is possible, but there is currently no 'History Viewer' and John suggested that design of one would be significantly more complex than either Browser or Editor. Logic Validation is currently not within the remit of the Editor Erich: would like to have a visual cue that indicates which terms have been changed recently Provides for tagging a term as part of GO Slim, and capability of adding any number of additional segregation tags in the future. Import & Export of subtrees will soon be available (a very useful feature) A Gene Product Viewer will be added in the near future, the nature of which is currently unknown. Technical Discussion : aiming for a Generic Ontology Builder The current Editor is componentized The DAG component is generic, not specific for GO Using Java components has allowed fancy click/drag functionality, including a smooth table resizer All components are available on the CVS repository : with DOCUMENTATION (fancy that) The GO Schema is available at http://www.godatabase.org/dev/database/database/ People are encouraged to write ports to different database formats and submit them to the GO Database mailing list HTML Browser (Brad Mars) The HTML Browser is running off of the GO Relational Database (Chris Mungall's Perl API) rather than off of the GO Flatfiles, as the Java Browser does currently found at http://www.godatabase.org/cgi-bin/go.cgi Similar : though generally weaker : functionality to the Java Browser. Plans are to increase functionality to be on par with the Java Browser BLAST Server (Suzanna Lewis) There is an obvious need for the ability to BLAST against organism sequence sets using sequences retrieved via GO searches. An unasked question : why do they not rely on the archival databases to serve BLASTs? There was some discussion around whether to have 'all 400 version of ADH' included or not The current thought is to have protein sequences in fasta- format deposited on the GO website and available for Blast searches In the case of MGI, the plan is to submit a single protein sequence per gene. This single sequence will be the primary Swissprot sequence that represents each gene prouct. MGI annotation is to the level of the gene, not to the gene product; therefore, spliceforms and post-translational modifications do not matter that much to them at this point in time. In the case of yeast, about 2000 yeast protein sequences have already been provided. The fasta header line for the yeast sequences was discussed as a standard. general syntax : key:value[space] (key order not constrained) 'provide everything that you can provide, the more the better' community name for the gene (that used locally within a particular domain) gene name (the generally accepted, or global name for the gene) tab delimited list of GO ID's one or more external database cross-references Timing, Resource, and Prioritization Issues The primary software developers associated with the GO Consortium are John Richter and Chris Mungall. John is the primary driver behind development of the Java Browser and Editor, which appear to be the currently favored for GO delivery to end users and curators. About 3 weeks of work remain (according to John) to reach a final stage on Java Editor & Browser John is currently under obligation for development of an open-source gene annotation platform, Apollo, and timing for completion of software development for the GO Consortium is currently critically dependent upon his obligations to Apollo. Also competing for resource is a Fly Genome Reannotation project that is on-going. There is significant concern (voiced by Michael) around the absence of rigorous syntactic control and internal checks. John indicated, though, that syntactic checking algorithms are 'the stuff of papers' and were going to devour the majority of the development time. (I am quite fuzzy on exactly what the issue is here). The concern was voiced that the GO DAG could be really screwed up if the Editor is used improperly ... powerful tools mean powerful edits. However, the potential for rollbacks really softens the potential impact of this problem. Suzanna voiced the opinion that the priority lay in getting out of flatfile mode and into a relational database environment. It seemed that most everyone agreed with this. A key untackled problem appears to be 'how to commit DAGs to a database,' which apparently both John and Judy are thinking hard about. (again, I'm not quite clear on what this issue is here.) Interim Solution during resource-limited period John will do enough so that the Java Editor is available and will produce flatfiles; the flatfiles will be fed to the GO via the currently implemented CVS system. Downloads from the 'outside' Data is currently provided to external parties on an ad hoc basis. The mySQL database can be downloaded with the following caveat: 'If you make any changes, do not publish this database'. Among the groups who have downloaded is AstraZeneca (Bo Servenius, Lund, Sweden). Presumably, people are writing applications on top of the database, so John is trying to give a lead time on large changes. A Perl Development Kit has been assembled to assist in the loading/updating of external representations. A defined update/error reporting process has not been set up to date. However, each of these is currently handled through the e-mail lists (GO-FRIENDS, GO-DATABASE, GO-DIFF). The contact points should probably be Chris Mungall and John Richter. Send format problems to John Richter (http://www.fruitflyorg/annot/go/) and suggestions to John at go-admin@bdgp.lbl.gov Right now, remote access is relatively slow, presumably because there is lots of SQL being done remotely. John and Chris have some ideas bouncing about regarding 'fast access to GO Objects'. Consult John on this matter. Discussion around the Editing Process (various) (see also Software Development: Java Editor) Judy: a database administrator would help to stem the proliferation of terms and provide some control over the process Still trying to resolve the issue of how to distribute editing to a database globally across groups and disciplines. A software tool is emerging (Java Editor, design by John Richter) but the issue of editing control is still being debated. Michael: Minor Changes - Just commit them; Major Changes - consult the group via e-mail and commit after period (currently ~10 days); Really Major Changes - talk in person. Michael does not really want to see the current process changed. There was some discussion between Michael and Judy around similarity between the GO and extant nomenclatures, where committees meet to commit changes, but Michael would really want to avoid the potential emergence of bureaucracy, favoring maintenance of the present 'sociology' around the GO Consortium. It was generally accepted that there should be both a Working Database (Development) and an Authoritative Database (Production). There was some split of opinion over whether the Working Database should be publicly visible, but it appears that it will be made so, the argument being that the more skilled eyes that are looking at the work in development, the better. Schema Discussion, with emphasis on DBxREF A discussion ensued around whether the DBxREF field should refer to Terms and Definitions in the aggregate or separately. This discussion gets to the issue of whether a particular piece of information derived from a Source should be traceable to the Source or not. The current record structure looks like this: GO ID Term Definition Definition Refs Synonyms Synonym Refs DBxREFS (+ Boolean; supports Definition, Term, etc.) Suzanna led the discussion around proposed changes to the Schema. suggest adding a 'subclass' table to support tags that allow quick segregation of the GO into subsets (such as GO SLIM). suggest addition of sequence relationships to support sequence accession number searches, with coincident links to Gene Products Suggest addition of a type flag to the term_dbxref table to allow Term vs. Definition (or other type) values to be distinguished. Also, suggest addition of DBxREF tables as needed around other value- sets (such as synonyms) Trademark (Mike Cherry) A copyright statement has been added to the GO Consortium Page A logo [designed by John Matese] is available for the GO Consortium (found on http://www.geneontology.org/GO.usage.html; specifically http://genome-www.stanford.edu/images/GOthumbnail.gif) The Trademarking/Copywriting issue comes up now as more people become interested in using GO and the potential for product developers to incorporate GO without attribution or in an independent but apparently associated manner increases. GO SLIM The utility of GO SLIM was generally recognized. A referent/cutout/subclass table will be added to the GO Relational DB that will allow extraction of GO SLIM, GO FAT (full), Plant Terms, and other term subsets. Utility of GO SLIM Initially needed for high level Celera annotation Subsequently needed for Riken Annotation Jamboree (FANTOM meeting) this was apparently an effort to annotate by 'computational means' about 2000 previously undescribed mouse cDNAs 11 methods were used and a manuscript is apparently being assembled one thing that the group rues is the absence of an upfront comparative metric between the 11 methods; they could tease out a comparison between the methods, but it would be tedious, though perhaps worth it This effort involved the MGI (David) and took place around the time of the global genomics meeting in Japan Problem: The GO SLIM used for Celera and Riken were not identical Useful for easy to apprehend graphics, such as pie charts Useful for high level classification, such as across genes in a microarray experiment Potentially more stable than full GO (though this is not a sentiment shared by all participants). In particular, the 'lower half' of the Process Ontology is in flux. Status of GO SLIM Currently maintained by hand (by Michael) One outcome of meeting was a decision to added a GO SLIM entry to each Gene Product annotated. A commitment to automating this process was discussed. This GO SLIM view would be included in Swissprot and NCBI representations alongside the GO FAT view. The current criterion for a term being part of GO SLIM is 'someone says so'. There is no intrinsic connotation to the node-levels in GO. This means that there is nothing inherently similar among terms at Level 3, for instance. Thus, the slice through the Ontologies that would define GO SLIM is necessarily ragged (occurring at different levels for different branches). Currently no alerting system to highlight changes in GO SLIM; requires consultation of DIFF files and manual extraction. Most recent version 100-150 terms; see Appendix A. Funding The GO Consortium expects to receive full funding for pending grants. Funding is for 3 years beginning 1 December 2000 There is currently an administrative (Congressional Budget) delay on the disbursement of funds. Funding is initially to support 3 groups in the GO Consortium (Mosue, Yeast, Fly), with the addition of 2 groups after the 1st year (presumably Arabidopsis and C. elegans). Other funding ... Incyte (no details) ... AstraZeneca (no details) Publications A Genome Research paper is in the process of being written (GO Consortium), which will provide more details on the GO and is a followthrough from the Nature Genetics paper recently published. A Genomics paper is in the process of being written (MGI) which will detail some of the recent mouse gene annotations incorporating the GO Next Meeting 3, 4, 5 March Palo Alto, CA hosted by Sue Rhee, Carnegie Institute, Stanford University Immediate Actions Doug:send THE LETTER to him to for DoubleTwist:.invite them to Dec. meeting:.CALL ANDREW AND ASK HIM WHO TO SEND THE LETTER TO:EXPLAIN THE SITUATION.. review John Richter's FAQ about GO page: WE NEED TO SEND FASTA/GO FILE TO GO-SLIM WE NEED TO POST AT mgi THE mgi/GO FILE THAT IS SENT TO www.geneontology.org Appendix A: Current GO SLIM From: Suzanna Lewis[SMTP:suzi@bdgp.lbl.gov] Sent: Wednesday, October 25, 2000 11:34 AM To: ma11@gen.cam.ac.uk; midori@genome.Stanford.EDU; suzi@bdgp.lbl.gov Cc: go@genome.Stanford.EDU; dph@titan.informatics.jax.org Subject: Re: go_slim also needed to add unlocalised to component slim. here is the updated version $Gene_Ontology ; GO:0003673 $cellular_component ; GO:0005575 %cell wall ; GO:0005618 %extracellular ; GO:0005576 GO mapping - GO needs modification to allow full use with human - problem of the irregularity of GO updates. For AstraZeneca, a company that has been generous in its financial help to the GO Consortium, Ken Fasman said that there was increasing concern with different ontologies being used by different data providers and that a policy decision had been made to insist on the use of GO for any product that they would purchase after a 24 month notice period. AstraZeneca intends to seek support for this policy more broadly in the pharmaceutical industry. Ken was also concerned that there might be a number of different and non-collaborating efforts to use GO for human genes, and, even worse, modifications of the GO ontologies independent of the GO Consortium. Outcomes. ========= The objectives of the GO Consortium are simple: to collaborate with others (preferable an other) to develop the GO ontologies so that they can be most effectively used for the annotation of human genes and to receive from the collaborating group(s) a table of assignments of GO terms to human genes that will allow the human genome to be searched along with the genomes of others by GO terms. The GO Consortium recognises that there are pre-conditions necessary for these objectives to be achieved: A mechanism must exist for those using GO, but outwith the Consortium, to suggest changes (additions, corrections etc) to the GO ontologies. For human genes this will be through David Hill of the Jackson Labs. GO expects those who propose new terms to define these terms (see GO documentation) at the time of request. The companies now using GO for human genes in products (i.e. Celera Genomics and Proteome) both said that they will now begin to feed new terms to GO and to suggest changes required for the annotation of human genes. The Consortium must increase the rigour of their syntactical checks on GO data and the synchrony of release of the same data in different forms. This will require a single validation script for each class of file to be run whenever data is committed. It may be necessary (as was suggested by Andrew) to go to a regular (e.g. monthly) public release at a pre-determined time and dates. We need a mechanism to ensure much better user feedback. One suggestion is to run an open User's Meeting once (or more) a year. This will not be by invitation but will require pre-registration so as to avoid a logistic catastrophe. We expect that one of these meetings will be at the time of one of the regular Consortium meetings and that may be another could occur at the time of, e.g. ISMB or similar meeting. GO is in the public domain (there was some discussion as to whether protection under, e.g. a GPL, is desirable). There is an implicit contract between the GO Consortium and commercial users of GO - the commercial users get the information for free, but they have an obligation to give the Consortium useful feedback. There can, of course, be problems with public feedback to GO from commercial companies. GO should establish mechanisms other than the public mailing list to allow people to comment on GO, both in general and in detail, in a manner that is private (although, of course, any resulting changes to GO would be public). It is in the long term interests of the commercial users of genomic data for there to be stability and uniformity in annotation. For the consumers of data their interests are that they can use the same analytical methods on data coming from the public and commercial domains or from two or more different commercial concerns. For the providers of data their interests are not to have to spend resources re-inventing the wheel and to be able to easily QC their data by comparison with public data or data from other commercial sources. There are two major public groups annotating the "complete" human genome sequence, the Ensembl group at Hinxton and the NCBI group. At the moment they are using different assemblies of the sequence (Santa Cruz and Schuler, respectively) but there is an agreement in principle, at least, for the two groups to share a common name space - the International Gene Index. There are clearly a number of name space/identifier issues that will make everyone's job harder - at least in the short term - but these were well beyond the remit of this group. To some extent, for the purposes of GO, there is already a common name space between the EBI and NCBI for gene products - the protein_id's of the GenBank/EMBL-Bank/DDBJ records. A maintained table of correspondence(s) between protein_id's and other name spaces (Swiss-Prot, HUGO, LocusLink, Ensembl etc) might be a good idea, at least until the IGI reaches full term. The NCBI will be importing GO annotation for about 10,000 "known" human genes from Proteome Inc into both RefSeq and LocusLink. The NCBI will provide a general methodological statement as to how the particular gene product to GO assigments were made. All of these assignements will be attributed to Proteome Inc. The Ensembl team will work very closely with others at the EBI and with the HUGO Gene Nomenclature Committee in London, to establish a central, open repository, called GOAH, to track assignments of GO terms to human gene products which can be used by other databases worldwide (see Appendix). Some action items. * Kevin Roberg-Perez - send in proposal for partitioning enzymes (the issue here being that the children of "enzyme" in $molecular_function are very 'flat'). * Richard Mural - send in problems with mouse gene associations. * GO - to arrange user's meeting at next meeting (hosted by TAIR). * GO - establish email methods to allow companies to comment on GO privately to the GO consortium. * Suzanna Lewis - immediate validation of gene associations on CVS commit. * GO - consider regular dated updates, rather than updates on edit as now is the case. Thankyous. ========== We thank Mrs Beatrice Toliver for her help and courtesy at the Banbury Centre and Dr. Jan Witkowski for allowing us to use the Conference Centre for this meeting. Attendees & their affiliations. =============================== Rolf Apweiler (European Bioinformatics Institute - Swiss-Prot; InterPro) Michael Ashburner (European Bioinformatics Institute - GO Consortium - FlyBase) Judith Blake (Jackson Laboratory - Mouse Genome Database - GO Consortium) Mike Cherry (Department of Genetics, Stanford - GO Consortium - SaccDB) Jannan Eppig (Jackson Laboratory - Mouse Genome Database - GO Consortium) David Hill (Jackson Laboratory - Gene Expression Database - GO Consortium) Suzanna Lewis (BDGP, Berkeley - GO Consortium) Martin Ringwald (Jackson Laboratory - GO Consortium - Gene Expression Database) Michele Clamp (Sanger Centre - Ensembl) Donna Maglott (NCBI - LocusLink - RefSeq) Lincoln Stein (Cold Spring Harbor Lab. - WormBase - DAS) Sue Povey (University College London - HUGO Nomenclature Committee) Lisa Brooks (NHGRI) Kevin Roberg-Perez (Proteome Inc) Darryl Gietzen (Incyte) Ken Fasman (AstraZeneca, Boston) Andrew Kasarskis (DoubleTwist) Richard Mural (Celera Genomics, East) Paul Thomas (Celera Genomics, West) Jennifer Wortman (Celera Genomics, East) Appendix - The Ensembl/EBI GOAH proposal. ========================================= GOAH - GO Annotation of Human. The EBI proposes to provide a central, open database tracking assignments of human gene products to the Gene Ontology (GO) resource. The Gene Ontology project (www.geneontology.org) provides a framework to assign functional information to gene products. GO was founded by three model organism databases (FlyBase, SGD and MGI) and has expanded to 5 databases, taking in WormBase and TAIR. GO has proved to be very successful in these databases, capturing functional information in a way which can be queried across species databases and providing a consistent framework for aspects such as evidence tracking. The effective use of the human genome will require some aspect of functional tracking of gene products. The proven GO experience, in particular in Mouse, indicates that GO will work well for this task. Unlike other organisms, there is no clear central database for human genome resources, and it is likely that this will remain the case. We propose therefore to provide a central, open repository, called GOAH, to track assignments of GO terms to human gene products which can be used by other databases worldwide. This repository would be manned with two or three editors providing overall curation and quality control. These editors would be the point of contact for individual researchers wishing to contact GOAH. For large scale projects with a proven track record of functional assignment, such as HUGO, Proteome Inc, SWISS-PROT and MIM, direct editing of the GOAH database will be allowed with the editors providing conflict resolution and general consistency of the project. The human gene products would be tracked via the internationally agreed protein identifiers (protein_id) which is an established identifier system for proteins shared by the International Collaboration of DNA databases. All information stored in GOAH would be placed in the public domain without restriction. The EBI provides an ideal location to provide the GOAH resource, with synergies to the Ensembl team of genome annotation and the SWISS-PROT team of protein functional assignment. In addition the EBI has strong links to the main players in this field, such as the NCBI, Proteome Inc, Celera and CSHL. The necessary resources for GOAH have already been found at the EBI and committed to furthering functional assignment in human, either directly in this GOAH project or in some collaboration with other interested parties. =================================================================== Summary of the GO Consortium Meeting help March 4 & 5 at the Carnegie Institution, Stanford University, Palo Alto, CA. **************** Hosts: Sue Rhee and the TAIR group Participants: TAIR, SGD, BDGP and FlyBase, MGI, DictyBase, Worm full list of individuals at the end of the document Guests: Han Xie of Compugen Mark Wilkinson, of the NRC Canada, a TAIR collaborator Summary Agenda... This is not in the order of the meeting, but rather supports some structure for this report. 1. Introduction to GO for New People and Systems. 2. Revision of Enzymes to incorporate E.C. terms. 3. Short Notes, Updates and Action Items. Definitions SP2GO InterPro GO-SLIM Obsolete terms Energy Derivation Top Level Terms 4. Sort Process Ontology into component parts and other Process considerations. 5. "Determination", "Differentiation" and "Development". 6. Major Divisions of the Process Ontology 7. Physiology - Initial Discussions 8. Report on Narrative vs. Combinatorial approach re anatomy in biological process terms. 9. Software Update from BDGP group. 10. New procedures for revising ontologies. 11. In General, Things to Do, some sooner, some later. 12. Specifically, For the Next Meeting (July 14, 15) in Bar Harbor 13. Progress Reports inclu. short reports from Compugen and NRC Canada 14. Full List of Participants. ******************* 1. Introduction to GO for New People and Systems Brief History, What ontologies are being developed, What are the rules and procedures for both\ontology development and annotation of genomes, Review of the public presence of the GO consortium. 2. Revision of Enzymes to incorporate E.C. terms. It was agreed that incorporating EC higher level terms was a good thing to do. Some of the EC strings are very long because they are adding definitions into the term. We will move the definitions into the definitions file. This will not restrict searches since the search includes the definitions. Michael will work to tidy the list and replace current enzyme set with new representation. Synonyms will be added as needed. Curators are reminded that synonyms for the protein should NOT be entered, just synonyms for the molecular function. 3. Short Notes, Updates and Action Items. a) Definitions: We still only have about 10% of terms with definitions. The rule is, if you add a new term, you need to add a definition. We reminder ourselves that the GO:ID goes with the definition, not the term, in cases of revision in the use of a term. SGD crew are scanning OUP Dict of Molec. Biol. into an ascii file for us to use in adding the definitions. This will be incorporated into the GO-EDITOR (John Richter). We will add ISBN numbers to each definition, as well as personal signature to each definition we add. b) Update of SP2GO files. Need to continue with timely updates. New process from MGI will update SP with each MGI update. David Hill (dph@informatics.jax.org) continues to be the primary person managing this file. c) InterPro: Michael Ashburner reviewed history of InterPro for new people. Mapping of InterPro to GO is public at EBI, but is not posted at the GO site yet. Place InterPro:GO mapping at GO site. d) GO-SLIM: At the moment, there is a hand-curated GO-SLIM. Ultimately, an attribute of a GO term will be that it is a member of a certain GO-SLIM representation. We recognize that there will be different slices of the GO that will be useful to different annotation communities. So we expect to support different GO-SLIM sets. GO-SLIM implementation will wait for database. e) Obsolete terms: When a term becomes obsolete, the definition should be appended to explain why it became obsolete. The note might also contain suggested terms to search if you are considering this obsolete term. John Richter will make sure that obsolete terms are supported in the GO-EDITOR. f) Energy Derivation: Natasha Maltsev of Argonne Natn Lab has list of energy derivations that will be a starting point for expansion in this area. Michael will get list from Natasha. g) Top Level Terms: We don't want to limit top level terms. We need to think of them as 'collectors'. So when considering the addition of high level terms, consider 'Do we need this collective term?". When we consider that we have a term (growth and maintenance, for example) because we cannot distinguish by experimental data to which term we should annotate a protein, that is an annotation perspective. But we also want to include terms so that we can group things. h) Prions will not be represented since they relate to disease state. 4. Sort Process Ontology into component parts and other Process considerations. We discussed whether the process ontology should be separated into two parts: cellular and multicellular. This discussion is not new. We recognized the utility of having a complete unit of the process ontology representing cellular-level processes since this is needed and practical for the unicellular organisms. Thus we will work towards a robust representation of cellular processes that will be useful to all. This decision led further to a recognition the process ontology is sorting into 4 major components. These are: cellular processes, developmental processes, physiological processes, and behavioral processes. We agreed to break out cellular processes and to specifically represent them at the top of the Process ontology. Most of the discussion over the rest of the meeting then focused on developmental processes. a.) Differentiation is a cellular process; morphogenesis is a multicellular process b.) We discussed whether to break apart the terms 'growth and maintenance' and 'cell organization and biogenesis'. At first, it seemed that we should. However, we quickly realized the utility of these terms in that some preliminary experimental evidence couldn't distinguish as to whether a gene product was involved in 'growth' or in the 'maintenance' of an organism. We did agree to change the term 'cell organization and biogenesis' to 'cell organization and/or biogenesis'. Midori will incorporate this into the work to carefully define the high level terms. Still some confusion as to the difference between the 'cell organization and/or biogenesis' node and the 'growth and maintenance' node. c.) Every high-level node needs careful definitions: Midori and Michael will work on this soon. d.) Remove terms 'oncogenesis' and 'tumor suppressor'. These terms reflect phenotypes. 'Oncogenesis' is really 'unregulated or mis-regulated cell cycle control.'. The ontology term relates to cell cycle regulation. The evidence for the association of a gene product with the process of cell cycle regulation often come from the study of the disease state. This same argument supports the removal of the term 'tumor suppressor', which is, after all, a phenotype statement and not a biological process statement. e.) Cell Motility: Under 'cell motility', 'vesicle transport' and 'spindle function' are examples of cell motility. So, maybe need to extend upper level with a term 'motility', then a daughter term 'cell motility' and a daughter term of that of 'cytokinesis'. So 'cytokinesis' would be a part of 'cell motility'. Cell division is a synonym rather than a GO term because the term is used both as 'division of the nucleus' and as a synonym for 'cytokinesis' , i.e., division of the entire cell. So, Synonyms need not be unique. We need some further work here under 'cell cycle' as there are multiple usages of these terms. So, need precise definitions of our usages. 5. "Determination", "Differentiation" and "Development". Definition of 'Determination' and definition of 'Differentiation'. How shall we represent these concepts? Reflects a 50 year debate in developmental biology. Need to rewrite these definitions so that they are less experimentally based. Consider, throughout, 'has the definition been written in terms of the experimental method?', If so, consider revising definition. Sound bites from this interesting discussion -determination when the decision has been made to adopt a developmental stage ( tricky because it is often before the actual differentiation occurs) -differentiation when you actually express a set of characteristics...process whereby relatively unspecialized cells acquire -so is 'cell specification' a synonym for determination? Or is it that specification is the same as establishing an identity but not yet determined. It is a temporal thing. you are getting signal. it is not the same as determination. -autonomous specification specification produced by a inheritance of molecules. a type of cell specification -conditional specification is the specification determined by the relative position of cells in an organisms. A type of cell specification. -not the same as competence which is a characteristic of a cell. Conclusion: Competence is the 'ability' to do something. Competence is not a process, it's a state. So we throw it out. but, if useful, we could have 'establishment of competence' or 'maintenance of competence' Conclusion: the term 'Development' as a high level process will be used to consider the whole history of the organism. This generated a lot of discussion as we considered 'embryonic development' and 'post-embryonic development'. This is a hard distinction to support for plants and for larval development. Different communities use these terms in different ways. Post-embryonic development is useful for fly...keep it in??? What is covered by the term 'brain development'? It continues throughout life. What do we mean by a term like 'heart development'? Does that mean the developmental process up until you have a heart? Or does it include the further development of the heart after a recognizable organ is formed? Embryogenesis, morphogenesis, organogenesis are all DAGs...some things that parts of embryogenesis will be part of morphogenesis as well. So... development morphogenesis aggregation differentiation maturation aging senescence Option 1 global heart development formation of the heart beyond formation of the heart 6. Major Divisions of the Process Ontology. % cell % development % physiology % behavior 7. Physiology - Initial Discussions. Having struggled through the beginnings of a representation of development, and at least conceptualizing the work needed to realize this part of the process ontology, we recognize that physiology is the next big area to struggle with. Animal and plant physiologies will be pretty independent. Cellular processes are independent of physiology. So here is where the DAG structure becomes imperative. For example, 'hormone response' has both physiological and cellular components. We can relate them through the use of the DAG structure. Remember that we are trying to develop a tool for biologists that works....not trying to represent all biology. Need to make somewhat arbitrary decisions, such as where to put 'germination', that address what the user cares about, i.e. 'what genes are involved in the process of germination?'. Physiological processes are heavily impacted by outside signal (environment). Changes in response to environment. While not absolute, physiological statements more often reflect processes in the mature organisms. In defining physiology as a grouping mechanism, we need to work down to the next level now. 'Transpiration' is an example of a physiological process, ditto 'perception of external stimuli', 'stress response', 'immune response', etc. 'Seed germination', 'release of dormancy' terms are both physiological and developmental processes. Physiological processes ultimately will go down to the granularity of cellular processes. The DAG structure will help in the representation of all these terms in the Process ontology. 8. Report on Narrative vs. Combinatorial approach re anatomy in biological process terms. This was the major event of this meeting. For many meetings, we have come back to the issue of species-specific anatomies and the incorporation of anatomical terms in the process ontology. Over a year ago, Joel Richardson proposed a combinatorial approach wherein a process term combined with an anatomical term would be used to annotate knowledge about a gene product. At the Hinxton meeting, the group agreed that this was a sensible and powerful approach. However, subsequent implementation efforts revealed difficulties in incorporating such biologically useful concepts as 'gametogenesis'. Also, the management of the combinatorial approach would be harder than the further development of what is now call the 'narrative' approach. The narrative approach is the current paradigm of building up the ontology incrementally as we describe the process in biological terms. Yet, in following discussions, the issue of whether or not to incorporate anatomies, which are themselves highly developed and precise ontologies, in the process ontology kept arising. Finally, at the human annotation meeting at Banbury last summer, we agreed that David Hill would 'do the experiment' and give a presentation at this meeting for the group to consider. David used the example of "Heart Development". He developed ontology for heart development in both the narrative and the combinatorial manners. A copy of that presentation is available. The end result was that the group was overwhelmed with the power of the combinatorial approach both to provide self-structured cross-product terms and to reveal new information and avenues for experimentation. 1. Do we leave it up to each group to decide whether to use this approach to process annotation? A resounding NO from the group. 2. Can we separate out subtrees that can be used to generate cross-products? Yes, could use GO-SLIM or other subtrees. In fact, the GO-SLIM set may be the mechanism for grouping annotations across species. 3. There could be cross-products of cross-products....how far do we want to break this down? Don't have to go all the way down as long as the representation of the biology is correct. 4. Works as long as the two concepts are orthogonal, can't do with just anything and get the consistency needed. 5. Big worry...if each group is incorporating combined terms relative to their particular anatomy, we lose the power of the combination of all annotations. One approach is to ask the query...'give all products in heart development', and have query go out against all cross-products. We will have to work on this. 6. Can we have a join of the anatomies? then have a single anatomy to use in the cross-product with developmental processes? don't know...right now, we think the combinatorial approach is the right way to go, we will have to work on the implementation. 7. Some concern about ripping out anatomical terms from process right now. Can the primary process ontology be made more amenable to cross-species specific anatomical parts? 8. If we have multiple anatomies, then the search needs to go against anatomies...this can be done. Summary....Issues 1. There is general consensus to go forward with the combinatorial approach. 2. Do we need to have a shared anatomy? 3. How will others be able to use the ontologies to annotate if we have this complicated approach? 4. Parser...need to put into better language...earlier we tackle the problem of language, the better we can promote this for ourselves and others. 5. GOAL...write definitions for common developmental process terms. 6. Start working on further experiments with this approach...write definitions, work out mathematical properties. 7. Each group needs to provide an anatomy. 8. The anatomies needn't have GO:IDs, but the cross-products should have IDs. 9. We will use the developmental process as a demonstration of this approach.. 10. Immediate action items include: a. schema changes (Joel and Suzi) b. editor will work fine for now. 9. Software Update from BDGP group. Brad Marshall - GO-Browser. Objective is to make browser better. Have moved from cgi scripts to XML backend with RDI to associate with different data sources. This makes for a more flexible backend. Want to chain a lot of data sources together with GO associations. Only want to retrieve a subgraph at a given depth. Much enthusiasm for power of this approach. John Richter - GO-EDIT. This new editor is an open-source application that provides an annotation tool for GO type ontologies. Can rearrange, define terms, designated subsets... Released and available. Will start using this right away. First commit will result in messy Diff file, but then all will be well. 1st change....editors using Editor will write out to files and commit via CVS 2nd change...editors using Editor will write out directly to database...don't know when that will be. John Richter - GO database. database could be ready 3 wks after he starts working on it, but right now he's working on Apollo. Suzi Lewis - Apollo Pedigree 10. New procedures for revising ontologies. * who has 'write' access in each instance? Michael, Heather, David, Harold, Leonore, Midori, Karen * how are people outside suppose to communicate suggestions to us? Suggestion...two databases, one public, one writable (production). Curators will have db accounts...login to edit and write. A few have publish access. Publishing takes it over to the public db. We track changes, etc. Publishers.... Michael, Heather, David, Harold, Leonore, Midori, Karen(Publishing involves clearing and reloading. The event gets a version number. 'Static' version would be a cron job ( midnight of the first of every month???) There would need to be a name for each release. 11. In General Things to Do, some sooner, some later 1) Cross Products: Post GO:IDs for cross-products not for Anatomy... 2) Details for Cross-Products...may be summer before we can commit all this. 3) Post InterPro:GO mappings at GO Web site (Michael) 4) Post Anatomy Files from different groups. 5) Post FASTA files of unique set of AA seqs with GO annotations at GO site, include SP:ID in header. Set up for searching. 6) Review and update GO-SLIM files 7) Develop top levels of Physiology for next meeting 8) Add seqID column to Gene Associations table (use SP ID). 9) Consider posting 'good' other ontologies at GO site. This will involve a lot of discussion...Don't want impression that GO is responsible for quality of other ontologies. 10) Update SP translation tables (David Hill) 11) Move comments about reasons for changes from CVS to database (when we can) (John Richter). 12) xml dumps from Editor to Suzi. 13) Support old GO:IDs in database (John Richter). 14) Post 'Citing these data' on Web pages (MikeC). 15) Provide some kind of static or versioned GO for use by tools that incorporate the GO as part of their annotation suite. 16) Suzi needs to deal with <>...where ever you want to keep this character, put the / in front of it. 17) MikeC will create an anonymous server behind firewall at Stanford for cvs or provide a machine outside the firewall. 18) Make CVS world readable..provide a repository..may want to consider SourceForge... 12. Specifically, For the Next Meeting (July 14-15, Bar Harbor). 1. Midori and Michael will have some High-Level definitions for review (i.e., when does the process of differentiation start? when does it stop?). Change the term 'cell organization and biogenesis' to 'cell organization and/or biogenesis'. Also, need children of combined terms....thus, need 'cell organization' term and 'biogenesis' terms with definitions. 2. Need some work done to clarify and define terms in the area of 'Cell cycle'. Full List of Participants. TAIR: Sue Rhee, Leonore Reiser, Aisling Doyle, J. Yoon, Margarita Garcia DictyBase: Rex Chisholm, Warren Kibbe SGD: Mike Cherry, David Botstein, Midori Harris, Selina Dwight, Karen Christie, Dianna Fisk, Anand Sethuraman, Cathy Ball, Gavin Sherlock, Worm: Wen Chen BDGP and FlyBase: Michael Ashburner, Suzi Lewis, Heather Butler, John Richter, Brad Marshall, Chris Mungall MGI: Judith Blake, Joel Richardson, David Hill, Martin Ringwald, Janan Eppig, Harold Drabkin =================================================================== GO Meeting July 14-15,2001 Hosted by Judy Blake and the Jackson Laboratory in Bar Harbor, ME. Minutes compiled by Leonore Reiser Attendees (name-organization): Judith Blake-MGI Janan Eppig-MGI Joel Richardson-MGI Martin Ringwald-MGI David Hill-MGI Harold Drabkin-MGI Michael Ashburner-FlyBase Midori Harris-GO Suzi Lewis-BDGP John Richter-BDGP Brad Marshall-BDGP Pavel Tomancak-BDGP Mike Cherry-SGD Anand Sethuraman-SGD Karen Christie-SGD Selina Dwight-SGD Dianna Fisk-SGD Erich Schwarz-WormBase Raymond Lee-WormBase Rex Chisholm-DictyBase Liat Mintz-Compugen Courtland Yockey-Astra Zeneca Rolf Apweiler-SWISS-PROT Nicola Mulder-SWISS-PROT Jennifer Hogan-Incyte Matt Berriman-Sanger Tom Weigers-MGI Jim Kadin-MGI Carol Bult-MGI Sue Rhee-TAIR Leonore Reiser-TAIR Progress reports by group (presenter): MGD (Harold Drabkin): from handout supplied. 1. As of 7/12/01 MGI has 15638 annotations. 2. Total number of mouse genes with at least 1 associated GO term: 5673 Genes with molecular function term: 4615 Genes with process term: 3487 Genes with component term: 3584 3. Breakdown by annotation type: IEA: 3690 process; 5333 function; 3835 component Of these: 760 annotations are from EC mapping (694 function; 66 component) and 6542 from SWISS-PROT mapping (1740 process, 2603 molecular function; 2199 component). The remainder are from GOFISH. Hand Annotations: Annotating at a rate of 40-45/week. 1115 genes annotated this way (636 process; 637 component; 788 function) that correspond to a total of 2780 annotations (1058 process; 722 component; 950 function). SGD (Karen Christie): 1. How SGD deals with annotations to unknown (function, component or process) is by assigning a reference, from either curation or from a publication with the appropriate evidence code. This distinguishes between cases where a curator has looked and nothing is known and cases where the literature for the gene has not yet been examined in order to make a GO annotation to a given ontology. We decided that this was a good approach for all curators to take and that this decision would be reflected in the documentation on the website. 2.Validation checking: includes checking that all of the required fields were being filled and scripts to identify terms that become obsolete so that annotations to these terms can be updated. 3. Cathy Ball and others have done a four- way genome comparison- Arabidopsis, Yeast, Fly ,Worm:- each sequence is BLASTED against each other to build gene families Criteria is (P =50 and 80% in one HSP).Each sequence is present only in one tree. Compared annotations with a high node, GO Slim type, process ontology and curated each of them by hand (by sequence similarity and FASTA definition lines). Annotating the entire gene cluster (have 1400 families- hand curated for process-14/15 very high level processes, 23 families that could not be ascribed to a process). Function calls were made electronically via a script. This dataset can be used as a check for annotations- and as a general guide for annotation. A limitation is that the represented proteins are only those found in all genomes. Flybase (Michael Ashburner): Working on cleaning up existing annotations: 1. Heather was fixing all legacy annotations that did not have literature or evidence codes. 2. Go Slim high level annotations being gone over for 7000-8000 genes. Michael has gone through all of them- annotated by hand . 3.Several thousand BLAST results being gone through. 4. 9000 genes with at least one GO annotation. WormBase (Erich Schwarz): 1.One third of C.elegans genes have been annotated using InterPro.This data is up at the GO site. 2.For 19,000 cds sequences, ~2,200 have 'reasonable' phenotypes from RNAi screens. Working on turning phenotype data from RNAi to GO annotations. Mapping phenotypes to GO. For example the trait, paralyzed phenotype would map to GO term: locomotory behavior. Evidence inferred from RNAi. (which would be given the evidence code IMP- this additional definition of IMP will be reflected in the documentation on the web pages. DictyBase (Rex Chisholm): 1.Initial annotation from genome sequence 5X coverage ~8000 ORFs. 2. Have about 50 hand annotations. Will use InterProt, SWISS-PROT, EC mapping and have a set of annotations by next meeting. Compugen (Liat Mintz): 1. Concentrating on GO Annotations at the level of transcripts. Methods include: a. Protein clustering (Smith-Waterman) b. Literature Clustering-text mining tool. c. mRNA clustering First release includes human, rat ,mouse annotations. Update to be released Aug.1 will have additional annotations. SWISS-PROT/InterPro (Rolf Apweiler): 1.Nicola Mulder's map of InterProt to GO terms has about 2700/4000 terms mapped. Many can't be mapped yet ( as they include lots of viral terms). 2.SWISS-PROT mapping. From this about 50% of the terms have been mapped to GO but this still needs some manual curation. 3.Human gene annotation status. Yesterday (July 13,2001) the first pass annotation from SWISS-PROT, trEMBL and Ensembl annotations was completed using three electronic mapping methods. Imported 7316 GO annotations from Proteome and literature associations. The plan is to have complete coverage at a high level by September-October. After this , manual GO annotations will be part of the normal curation pipeline. GO (Midori Harris): 1. Training SWISS-PROT curators to do GO annotations. 2. A bit of PR work with Nature and the Wall St. Journal. 3. Working with Karen Christie of SGD to revamp the GO web pages. TAIR (Leonore Reiser): 1. 3901 annotations to genes corresponding to TIGR open reading frames using term matching (exact matches between GO terms and definition lines). Matches generated by script and validated by curator went in as IEA annotations to GO and TAIR FTP site. All annotations are to molecular function. 2. Progress on development of an anatomy ontology for Arabidopsis using GO editor to make DAGs. 266 terms as of 7/12/01, 50% have definitions Will work with rice (Gramene DB, IRRI) and Maize (MMP) to find top level terms for plants. Discussion Points: 1. GO Slim a). Consensus that there needs to be a new GOSLIM developed. Terms for GOSLIM will be selected by a small working group. b). A directory of the GOSLIM versions that have been used should be made available via the website. c). Some considerations in using GOEDIT to make GOSLIM files: Will have to wait until the database is up to implement GOSLIM notations as this is not accommodated by the flat files. Also, having everything in the database will make it easier to keep GOSLIM in synch with the current GO. The 'canonical' GOSLIM will be in the database and other versions (specific to certain projects) will be posted as flat files. d). Chris Mungall has been working on software for mapping full GO to GOSLIM. e). Midori Harris will take charge of new GOSLIM. 2.Gene association files a). Should we get rid of aspect in the files? Consensus is no, as this information is useful in consistency checking. b). How to deal with the fact that each group is annotating at different levels/to different database objects. At the moment, most groups are annotating at the level of gene, or transcript, but this differs from database to database. There is a need to define the object being annotated explicitly. Changes to be made are: 1) add a column that defines the object being annotated (or the moment the options will be gene, transcript and protein). 2) The symbol used in the association file will be the symbol for that object (e.g. if annotation is to a gene object then symbol = gene symbol, annotation to protein object then symbol= protein symbol). Same holds for synonyms, they should match the object being annotated, 3) add a column for Taxonomy_ID (from NCBI) that defines the taxonomic node for the organism whose gene/protein/transcript is being annotated. Symbol is still mandatory but since not every database has symbols for everything. In that case, using alternative names is OK (e.g. a gene symbol when annotation is to a transcript or protein. There may need to be further discussion on the issue of symbol column. c). Should there be another column specifically for PubMed references in addition to the database references for the evidence for the association? No, but we can allow for multiple entries in the DB:reference column to accommodate. Multiple reference identifiers (e.g. |MGI:209393| PMID:123333) will be separated using pipes with the model organism database identifier preceding the secondary identifier. d). What to do with sequence identifiers? We reaffirmed a previous decision NOT to put GenBank or EMBL identifiers in the association files. Instead, sequence identifiers will be provided in a separate file. e). Action Items: Midori will update the web pages with the new information/format for the association files. An XML format will be created to export the association files. 3.Definitions status: Up to 13% of terms are defined. Everyone agrees it is harder to define the higher nodes but crucial especially as these are used for GOSLIM. Midori has taken a stab at some of the higher nodes in process and these need to be looked at. With respect to making it easier to add definitions, SGD has converted a dictionary into an electronic format. 4. Discussion of "is this a function ?". Many people have noted that there still seem to be gene products in the GO both in function and process ontologies. Dyenin was cited as an example of a protein name in the function ontology. a. Rex will look into identifying areas that need to be cleaned up as far as protein names and bring the suggestions back to the group. 5. How granular should we get in the ontologies- what belongs? The discussion about what is a function raised questions about how granular should the ontologies get and what is a function vs. a process term. a). Granularity. In general, the answer is as far as we can go, within the bounds already defined. Community feedback into the ontologies is especially important as a means to improve granularity. b). What to do with molecules? If we want to represent the function of a subunit ( a molecule). 1. Add parts of the complex (e.g. add regulatory/catalytic component) 2. Remember we are annotating the product to the potential of the component. 3. To avoid a proliferation of terms, we can collapse nodes and gain specificity by the combined annotation. For example, RNA polymerases might be collapsed into a single function node and the component or process gives the specificity. c).What about phosphorylation? Is was suggested that having phosphorylation as a process was redundant with kinase in function and is inconsistent with the rule that a process requires more than one function. In this case we decided to keep protein phosphorylation in process ontology because it would be expected as a child of protein modification. 6. Status of the GOEditor (John Richter). a). Changes made to the editor. 1. Dropped obsolete relationships: the editor no longer shows obsolete terms. 2. History now shows differently. You can now view the entire history of a term. 3. There is a new data adaptor that includes the relationship "develops from". But this only works with the database adaptor not flat files. 4. Saving to the database now works. The editor tracks all changes in your session, checks that you are working on the database and adds history tracking. From now on history will show everyone's saves. Can now query the history of a term-selecting a term highlights the edits for that term for the life of that term. 5. Conflict checking is working- save fails if a conflict is detected. 6. ID generation will be from the database. ID ranges will no longer be required. 7. All three ontologies are loaded when using the database adaptor. b). In progress. 1. A plug-in for generating IDs other than GO IDs. For example, people using the editor to create anatomy ontologies will be able to have their own prefixes. 2. A gene association plug-in. 3. Update to require passwords for loading and saving to the database will be instituted. c). Discussion/comments on the editor. 1. When importing from other flat files (e.g. anatomy) they can be parsed into the database by stripping terms. However, there may be reasons to save the original database identifiers -perhaps as synonyms. 2. While conflict checking works, top level node edits need to be announced to the group before starting. 3. When will the database be the real thing? Perhaps in about 3 weeks but, perhaps not. 7. GO HTML Browser (Brad Mars): a.) Displays DAGs, gene associations, definitions. Searching by organism, term, etc... still some issues to resolve in the searching. BDGP has registered the domain name godatabase.org but this is not up yet. b.) Some more work to be done on UI issues before release. c.) Related to general software issues: Chris Mungall has been making progress with a BLAST server using GO annotated sequences from yeast. 8. Website documentation issues: a). FTP site including CVS repository is moving to the outside of the firewall. This will allow anonymous read access to the CVS. b). Updating to the FTP site. SGD has the only write access to the site. So in the event that Chris wants to dump the database XML files, Mike will grab the specified files from BDGP and load them. c). DTD for the XML format needs to be moved to the website so that people can get to that. Currently the XML is updated once a month from Stanford to the database and then dumped out. This will move to nightly dumps once the DTD and XML is in place. d) . Currently we are just exporting the tree but will also be exporting the associations (as above discussion on an XML format and appropriate DTD to be written). e). The new format will allow people to specify and import subgraphs from the files. f). A validation step will be added to the XML dump to make sure it is correct. g). Move old documentation out of CVS but leave on the FTP site. Things to be archived include past minutes from meetings, old XML DTD files, old ontologies. Move the archive directory out of CVS but leave on the FTP site. Note that the abstracts/ directory is empty and could be removed. h). As part of the rearrangement that Midori and Karen are undertaking , we can add a jobs page. 9. GO expansion (specific issues and general concerns): a). Matt Berriman from the Pathogen Sequencing Unit at the Sanger Center raised some issues about how to expand the GO to include parasites. In particular the question was raised as how we can represent the components outside of the cell ... moving from place to place. Alterations to component ontology need to be made to accommodate this. A few basic ideas to deal with this: 1. add a host %extracellular %host Phenotype is clearly outside of the GO . There are no plans to expand to include phenotypes in the ontologies. >Some groups, like SWISS-PROT, may be annotating to groups of protein products. Eventually, these should move out to the level of a transcript. >In general we do not consider natural variants/polymorphisms, but it is really up to individual databases to decide on that. >Splice variants, isozymes and other polymorphisms will be an issue for databases to deal with. Each variant should be annotated appropriately. e). Expansion of the evidence codes. Generally agreed that more information needs to be provided so that it is clear how the annotation call was derived. Case in point by Rolf Apweiler For 26098 proteins 20757 from SWISS-PROT;5841 from Ensembl Annotations include: 1448 mappings from EC 7696 SWISS-PROT keyword mapping 11025 InterPro mappings 1. 3022 unique GO terms that are all IEA. In general, there is not need to expand upon the IEA code but rather to have a more specific DB: reference for each type of annotation. So rather than dumping everything into IEA and giving a general reference, use IEA and the reference should explicitly state if the association was made from InterPro mapping or keyword mapping, etc. Send the references for the analysis as part of the annotation record. Each analysis method should have its own reference (e.g. GOFISH). 2. When is it an ISS vs. IEA call.. It is only ISS if a person has looked at it, in Swiss-Prot there is always a person looking at the sequence. Can we enforce a standard of reporting for sequence analysis methods? Probably this is unrealistic to attain. The reference column of the association files will be used for providing more information on the different computational methods used for annotations. In addition, a set of descriptions of the methods used by each group will be put up on the GO website. f). Integration of annotations from non-model organism databases. SWISS-PROT and Compugen will provide first pass annotations to model organism genes to model organism databases to incorporate them and pass them on to the GO site. For non-model organisms, the associations will go directly to the GO site. 10.Organizational issues a) . Anyone can participate in discussion. Suggestion of terms can and should come from anyone however write access to the ontologies should remain limited. b). Who contributes to the annotations. The associations have to be a from a database. c). What constitutes membership. In general requires that participating groups accept the principles of the GO and are willing to commit to ongoing development of the GO. 1. putting associations into the public domain. 2. contribution of financial support or data. 3. contribution of software tools. d). Due to the size of the group (currently 10 members) it may be more efficient to have smaller working groups that meet for various reasons (such as an executive committee, or onotology working groups) in order to focus on specific action items. To keep meeting at this rate (3-4 times a year) and to get things done, it may be helpful to change the structure of the meetings. e). Should the GO site be a clearing house for information about other ontologies? 11. More about cross products and anatomy. a). Cross products- David Hill expanded on the idea of cross products. We are in agreement to strip anatomy from process terms and proceed with cross products. For making cross products, it is important that the ontologies be orthogonal. We can expand the concept of cross products to many areas and it would be good to have a general tool for doing this that allows you to select specific nodes to create cross products (see d below). With respect to making anatomy ontologies for generating cross products with developmental process node in process ontology we need to first make orthogonal ontologies of anatomy and developmental stage, then take the appropriate cross products from these to make the cross product with development. It is essential to take the time component out from staging (e.g. days post-fertilization, post-germination are not useful as there is a lot of variation in how rapidly development occurs within a species). An example: [stage (organism specific, internal ID) X anatomy (organism specific, internal ID)] X [developmental process ( Go-generic, GO ID)] = GO ID. Also, the cross-product of stage X anatomy will have an internal ID. b). Anatomy browsing- Pavel from BDGP demonstrated a browser being developed at BDGP to display gene expression patterns and fly anatomy. Uses both images and text display. c). Each organism database contributes their anatomy/ developmental stage ontologies and definitions to the GO and it will go into the CVS. Each group should be responsible (and responsive) for updates to their anatomy ontology. d). Developing a tool to generate cross products. John Richter thinks he can adapt the editor to have this function. Realistically there will not be a tool till after October for generating cross products. e). Report on papers citing or using the GO and activities related to GO. 1. MGI/RIKEN annotation paper is out. 2. Cathy Ball and SGD have their 4 way comparison paper in the works. 3. GO paper is out in Genome Research. 4. Matt Berriman -Parasites are GO in Trends in Parasitology. 5. David and Joel working on a paper about cross-products. 6. Courtland is doing a seminar for library sciences. 7. Michael will be doing a course in October. 8. TAIR review on plant data management includes small section on GO. 9. Postponed request from Annual Review of Genetics until next year. f). Next meeting will be at the Chicago Omni Hotel and includes a users meeting. Regular meeting will be from the 12th (new groups)-14th and users meeting on the 15th... =================================================================== GO MEETING in Chicago on Oct 13, 14 2001 Northwestern University, Chicago, Illinois USA Rex Chisholm, host List of Participants 1. Chris Mungall - BDGP, Berkeley 2. Brad Marshall - BDGP, Berkeley 3. Rex Chisholm - DictyBase, Northwestern 4. Suzi Lewis - BDGP, Berkeley 5. Michael Ashburner - FlyBase, EBI 6. J. Yoon - TAIR, Carnegie 7. Sue Rhee -TAIR, Carnegie 8. Peter Good - NHGRI 9. Trisha Dyck - DictyBase, Northwestern 10. Karen Christie - SGD, Stanford 11. Matt Berriman - Parasitic genomes, Sanger 12. Judith Blake - MGI, Jackson Laboratory 13. David Hill - MGI, Jackson Laboratory 14. Harold Drabkin - MGI, Jackson Laboratory 15. Janan Eppig - MGI, Jackson Laboratory 16. Raymond Lee - WormBase, CalTech 17. Wen Chen - WormBase, CalTech 18. Midori Harris - GO EDITOR, EBI 19. Evelyn Camon - SP-human proteins, EBI 20. Bernard de Bono - visitor from Cambridge 21. John Richter - BDGP, Boulder 22. Erich Schwarz - WormBase, CalTech 23. Jason Stewart - BDGP, Albuquerque 24. Hanqing Xie, Compugen ACTION ITEMS FROM CHICAGO MEETING - OCTOBER 2001 1. ACTION ITEM: Get temporal and anatomical CVs into CVS from participating databases. TEMPORAL: need to add temporal ontologies, but maybe we need to add dimensions rather than time or relative temporal terms? discrete temporal terms. like Tyler Stages (Mus; 'life stages ' in worm; template for others? STATUS OF ANATOMIES FlyBase done, FlyBase also represents 'derived from' in organ-organ relationships TAIR.... pretty good, almost done, just checking... SGD. anatomy...12 terms, done MGI ... Martin met with folks in Edinburgh, fine for MGI to commit mouse anatomy to the web site. Adult anatomy is not complete, but we could contribute it. Defined by Edinburgh by anatomical space definitions... 3D... Embryological one is done. Worm... Wen Chen...has been working on this. Life Stages for embryos is done. It has been converted into GO type form. Need to do refinement on definitions. Right now, Wen has built structure with 55 terms.... Worm... can keep temporal aspects. In C. elegans, because of knowledge of every cell, can actually doing anatomy in terms of big picture of all cells. Raymond has done a pilot on the feeding apparatus, the pharynx of worm. Working with David Hall of Einstein to work out anatomy (also Sylvia Martinelli at Sanger has some work to incorporate). 2. ACTION ITEM: need a tool to create cross-product terms see further discussion of this point under BDGP report 3. ACTION ITEM: Post new biological process that incorporates updating developmental processes. Ask for comments by Dec. 31. After that date, do the update and commit it. Post new Biological Process. 4. ACTION ITEM: come up with initial default GO slim, send around for comment, incorporate into database such that db changes could be flagged...then we could make that available as GO-SLIM. Ask that people who make variants and publish with that will post them...Midori and Michael will create default file... technically...flag in flatfile... Note, this is a carryover from Bar Harbor meeting. don't need 'static' one...??? but do need easy way to create one, need 'this one, that one, all the others' last time we agreed that we would a) archive any that are used b) people will use this feature a lot in the future, so how do we make this easier c) we decided to provide a default one.... when the database is implemented, this will be easier to manage... d) people that work with alternative GO-SLIMs will post them 5. ACTION ITEM: Agenda items for next meeting 1. Chris Mungall... intro to ontological formalisms 2. Michael... will report on the literature of relationships in ontologies, WordNet book report All terms need parents. The downside of using a quick upgrade of everything is that 'is-a' relationships may not be correct for all, particularly biological processes. Should we add an 'is-a' to the top level? Also, the 'part-of' relationship is complicated since we use 'part-of ' in different ways. Need to investigate the implications of this. We may decide that we don't want to get more complicated than we are. Do it as needed, well, why would we need it... 1) to use outside ontological tools 2) to resolve multiple uses of 'part-of' types 3) to facilitate doing queries 6. ACTION ITEM: standardization of database abbreviations Michael will finish off flatfile in one day of standard set of linking database abbreviations. He will post to CVS. done: go/doc/GO.xrf_abbs 7. ACTION ITEM: New Evidence Codes 1) invent an evidence code for 'nothing is known biologically' ND; ND:evidence-reference would be a local database citation that abstracts to methodology. Done ************ the next one came up during the user's meeting and didn't receive discussion from the full group....may have to wait until next meeting for consensus, or may be agreed do by email... 2) proposal at User's Meeting to add an evidence code for 'inferred from curated orthology'; ICO; evidence-reference would be a database citation that abstracts to the methodology. This depends heavily on a shared understanding of the term 'curated orthology', but in this first instance refers to the case where RGD is transferring GO associations to Rat genes from MGD via the curated orthology relationship provided by MGI. Further discussion reveals the complexity of this. Orthologies are most often defined at this time by sequence similarity. The transference of functional annotation therefore might be considered 'ISS'. EXCEPT that this really depends on the method of assignment of function to the orthologous object to begin with. If the assignment, for example, of a function to a mouse gene was as a result of a biochemical assay, then ISS for the rat protein might be appropriate. If the assignment of function to the mouse gene, however, was via an electronic assertion based on (perhaps) shared domains, then this would be an IEA assignment in mouse and even with the orthologous relationship to a rat gene, the rat gene GO assignment should not be given an ISS evidence code. 8. ACTION ITEM: We will add a date field to the entire gene association file We are adding a date column, mandatory for all annotations, YYYYMMDD. It will mean the date on which the association was made; it will not need to be broken down into "created" vs. "updated," because we "update" annotations by adding new lines to the association files (and deleting old lines if the situation calls for it). The date field can be used in conjunction with the ND code, so that curators can tell when it was that nothing useful was found. This was the original motivation for including the date, but we quickly realized that dating annotations was good for other reasons as well. This has now been documented in the GO documentation at the Web site. 9. ACTION ITEM: Update Documentation to explain new use of the TaxonID column in the gene association files. Revised syntax for TAXON ID Column is: 1st ID = taxon encoding the gene product; 2nd taxonID refers to the context, i.e. the user organism of the gene product. Syntax will be taxon1 ! (pipe) taxon2 Done 10. ACTION ITEM: Change requirement for submission of sequence information for gene products. Remove sequence subdirectory and replace with subdirectory holding files of gpID: proteinSeqID from the participating db. BDGP will use these files to yank protein sequences and generated appropriate def line according to current GO standards. Resulting sequence sets will be posted regularly. Peptides will be the sequence type. new syntax: DBID:geneObjectID DBID:seqAcc#; DBID:seqAcc#;DBID: seqAcc#; where multiple seqIDs are only added to reflect alternative transcripts, not allelic variants 11. ACTION ITEM: Update directory structure on GO web site. Karen will transmit this directory restructuring to Mike... gp2protein gp2protein.sgd gene2protein.mgi gene2protein.etc /protein2FASTA created by se group, will go into monthly archive, also get rid of species subdirectories, create new one 'gene-associations' and put all the mod association files in there. rename 'monthly' dumps to an informative name Parent directory 'Data Snapshots' subdirectories 'Current' and 'Archived' ?? SO monthly_downloads (cvs tag on the 1st of every month) subdirectory will be /currrent_yyyymmdd ontologies xml db gene-associations definitions database load sequence set / archived under that, the monthly subdirectories moved from the previous monthly all the other directories will be the most current... the monthly is the snapshot version.... remove 'abstract' directory remove 'archive' directory 'docs' okay 'external to go' ok 'mail' ok 'note' can be deleted 'ontology' keep 'schema' goes 'sequence' delete 'software' delete 'xml' delete 12. ACTION ITEMS FOR JOHN RICHTER *need to attach cross-product terms to existing ontologies *need to be able to track identifiers to components of the cross_product *will notify users 'there may be dangling references here' *need to address 'merging' issues. John will release 2.7 next week, in 2.8 will allows dangling objects other plug-ins that are being suggested. gene product fetcher...will get from database also, just select the ISBN for some select set of references, including at least the Oxford Grid ISBN number. John is working on an html toolkit that will show java trees on the web...a little servlet. 13. ACTION ITEM: User's guide for DAGedit... John will set up a WIKI page, and anyone can contribute. 14. ACTION ITEM: Transition into using database as primary repository John will work on history tracking mechanism while we test the db and flatfile saving issue. On a Friday, John will say 'we're going to start using the database'. John will populate the db on the weekend with the newest stuff on CVS. Curators won't do anything over that weekend. From then on, curators will save to db and at the end of the session will also export to flatfiles. This will continues as long as we need to . At some point, John will have us revert to the old system if needed. It's important during that time to check email before using to check for recent messages. John will try to give us notice. Also, won't due before Nov 2, but are tentatively schedule this event (the email msg) for Nov. 2. ACTION ITEM. Continue to consider the need for a DBA. ACTION ITEM: Need a UK mirror of the GO site....Midori will talk to Pete. Check with Chris Richter--he talked with Pete (Petteri Jokinen, EBI systems) ************************************** ACTION ITEMS FROM BAR HARBOR MEETING - July 2001 1.Go Slim a). Consensus that there needs to be a new GOSLIM developed. A small working group will select terms for GOSLIM. b). A directory of the GOSLIM versions that have been used should be made available via the website. c). Some considerations in using GOEDIT to make GOSLIM files: Will have to wait until the database is up to implement GOSLIM notations as this is not accommodated by the flat files. Also, having everything in the database will make it easier to keep GOSLIM in synch with the current GO. The 'canonical' GOSLIM will be in the database and other versions (specific to certain projects) will be posted as flat files. d). Chris Mungall has been working on software for mapping full GO to GOSLIM. e). Midori Harris will take charge of new GOSLIM. 2. Changes to syntax of gene associations files There is a need to define the object being annotated explicitly. Changes to be made are: 1) add a column that defines the object being annotated (or the moment the options will be gene, transcript and protein). 2) The symbol used in the association file will be the symbol for that object (e.g. if annotation is to a gene object then symbol = gene symbol, annotation to protein object then symbol= protein symbol). Same holds for synonyms, they should match the object being annotated, 3) add a column for TaxonID (from NCBI) that defines the taxonomic node for the organism whose gene/protein/transcript is being annotated. 4) Midori will update the web pages with the new information/format for the association files. An XML format will be described to export the association files. 3. "is this a function or a protein name?" Rex will look into identifying areas that need to be cleaned up as far as protein names and bring the suggestions back to the group. 4. Web site management 1) A validation step will be added to the XML dump to make sure it is correct. 2). Move old documentation out of CVS but leave on the FTP site. Things to be archived include past minutes from meetings, old XML DTD files, old ontologies. Move the archive directory out of CVS but leave on the FTP site. Note that the abstracts/ directory is empty and could be removed. Reports from Consortium Members 1. Rex Chisholm, DictyBase Warren Kibbe, getting the database set up, making a few GO associations for an interim collection of genes, 5600 full length cDNAs, ~ 70% of genes (between 8 and 10,000 genes). NIH grant pending which would provide some more curators. 2. Suzi Lewis, BDGP The next thing to accomplish with DAG-edit is to connect to the database, i.e., have the DB directly connected to the editor instead of the flatfiles. Once this is done, a secondary goal is to include plugins and to add a history viewer. Also, we want to switch to the database, and we want to have synchronized monthly updates. Every month (around 1st of month), take snapshot, export XML and to database (don't rewrite back to flatfiles). The XML and DB include gene associations. We have added another format of the data, RDF (an extension of XML) this will get us to DAML-OIL, Protege, etc). Chris has gotten sequence data...fly, yeast, Arabidopsis... BLAST search would bring back hyperlinked output with a small subgraph.... Sequence data is on the GO site only for SGD. Otherwise have to download... 3. Michael Ashburner, FlyBase Have appointed new curator, Rebecca ,will start Nov 5, FlyBase inherited a large number of electronic annotations. An editor has now looked at them all and there are no more IEAs. Several thousand proteins from Celera had no GO data. Michael has been through all of those and has been able to assign GO terms to about 1/4 of them. Still there are ~380 genes with GO annotations but no references. Michael is working on that. Regular FlyBase curators added GO terms as well. Re-annotation of release 3 has just started. Release 3 / Drosophila should be complete with no gaps for euchromatin and should have no ambiguities. Now true for 2 chromosome arms. Harvard and Berkeley trained 10 people to use Apollo to annotate fly sequences...Plan is to have reannotation by April 1. 15 min per gene.... Improving Cross-Links...with PIR, database of modified AA cross links with definition files 2) Minn. Biodegradation 3) MetaCyc...doing just pathways, working with Peter Karp following ISMB, 4) comments back from MIPS...They are interested in making a MIPS 2GO mapping available, Michael and Midori have both talked with Klaus Meyer. 4. Sue Rhee, TAIR Manual annotations have started. Tools developed to do this (PubDB). PubDB stores matches to gene names and keywords (GO, Anatomy) to papers. Curators can validate the matches and update/insert new genes, and gene aliases. Currently developing the web forms to validate and update and insert new annotations between genes and GO terms. Will package it and make the source codes available in a couple of months. A lot of this work supervised and carried out by Leonora. Leonore will be leaving TAIR, she is looking for more of an education/outreach kind of job. TAIR is actively looking for replacement for her. J. has been working with Leonore. Rest of curators at Carnegie are working with the GO, learning to do manual annotations. There are about 4,000 annotations using GOFISH methods to 3890 genes. Added 50 hand annotations to GO. Using PubDB matches of known gene names and GO terms within papers to screen literature and to provide a set for curators to work with to annotate genes. They use GO Editor, and other tools. So, they take whole set of GO terms, run them against abstracts, and make a file of matches for curators to manually validate using Web forms. Doing the same for gene names gives about 80% validation rate when examined. A lot of gene matching to papers, over 90% have GO term that match as well. TIGR curators will use PubDB and share the literature curation efforts. Two types of electronic annotations have been done. Nicky at InterPro has provided InterPro matches to Arabidopsis proteins. TAIR also ran InterproScan and used Nicky's InterPro to GO mapping. Nicky may have used set from MIPs. So, slight differences, but will submit Nicky's annotations to common protein set to GO. There are about 10,000 GO annotations out of about 20,000 proteins. For the remaining 5,000 proteins, will put in TAIR annotations. Currently, they are in the process of removing annotations that don't make sense for plants. Working with Peter Karp at MetaCyc to add plant pathways to MetaCyc. We used the pathologic script to find pathways matching to Arabidopsis enzymes.. All plant proteins thought to be enzyme or enzyme like (~6000) were passed through and ~1800 proteins were matched. 112 pathways have more than 50% reactions matching and 64 pathways have less than 50% reactions matching. We are waiting for Michael to finish going through MetaCyc to GO mapping and will submit the GO annotations from this after manual check (Lukas Mueller, mueller@acoma.stanford.edu ) Lukas Mueller is one of the TAIR curators who is very interested in GO. Anatomy development. 247 anatomy terms and 54 dev. stage terms. These will be submitted to GO once the new CVS architecture is set up. These have been submitted to the CVS repository for Plant Ontology Consortium Lincoln Stein has set up at CSHL. Will do comparisons to combine to create higher nodes once Gramene submits their ontologies. They contacted Paradigm to develop collaborations as they provide extensive service for phenotyping for plants. TIGR will use PubDB to annotate genes to GO. TAIR has provided login for them. We will discuss on the process of operation. We decided to separate labor based on processes. TIGR has microbial systems in GO annotations. TIGR and TAIR cleaning up genome annotations together. 5. Karen Christie, SGD SGD has finished off the Oxford Dictionary. Marcel Mendoza, summer intern, went through word files from scanning dictionary, moved to RTF. Mike Cherry wrote web interface to query... password accessible. RTF files allow one to know the difference between the term definitions and string. Have added 2000 terms since July and now have about 17% of them defined. Internal progress reports including progress on GO. New curator Rama Balakrishnan, she will do GO annotations Mike, Rama, Karen went and meet with Russ Altman's group. He is moving into the genetics department and will have more interactions with SGD and GO. One of his grad students is working to put GO into Protege. Text matches between literature and GO terms to develop interface for curators to use to guide annotations. IEAs from Valerie Wood...they are comparing new set with all other IEAs. For genes with no annotations, may be good. Looking at comparing with new set from Valerie Wood. 6. Matt Berriman, Sanger Matt Berriman, Sanger Institute Just released version 1 of GeneDB (http://www.genedb.org) genome database for parasitic groups, initially S. pombe, Trypanosome brucei and Leishmania major. Every genome will be annotated to GO, and GO association files will be released as these are done. Have written some parasite-specific GO terms (http://www.sanger.ac.uk/Users/mb4/GO). Annotation of malaria genome still planned to happen before Christmas. TIGR has mentioned they would like to get involved in that too. 7. Harold Drabkin, MGI Majority of annotations still electronic with over 6000 genes annotated. There are over 1500 hand-annotated genes, and these are done both as new genes are entered into MGI and as curatorial review of gene families and other genes. There has been an increase in hand annotations with function unknown. (discussed further in the 'annotation discussion'). MGI curators are focusing on sets of genes and on new genes with no GO annotations. The MGI gene association files are updated on a weekly basis and the new file sent to the GO site and posted on our ftp site. We have modified the associations file to include taxonID, object type, 'non' option and syntaxes as agreed by the group. MGI curators are now adding over 100 annotations per week. 8. Midori Harris, GO Editor Added a couple hundred GO terms... Most of these have been specific requests from SP annotators. Have identified areas that need work when time allows. Reorganizing documentation. Still need to hire Midori's assistant, but are now interviewing. 9. Evelyn Camon, SWISS-PROT EBI, Wolfgang Fleischmann is doing automatic annotations to all species. all of GO... of 100,000 SP annotations, 40% have GO assignments... EBI open database to provide access to human gene products...have reached the stage that NCBI/Proteome, SP/InterPro, other GO translations (EC, keywords, etc), SGD, MGI, FlyBase, all annotations all together in dataset... Oracle DB of all associations... 1.5M entries of (Protein to GO) Will also submit gene associations back to the GO. Each assignment has it's own ProteinToGo ID, can extract information about GoAH. 28,727 eligible human proteins, 11435 IEA proteins, (9636 InterPro true match, 4,000+ via SP keywords. 577 via EC codes, 9662 hand annotated proteins in the human dataset, 6864 done by Proteome Inc.; 2,830 by EBI/SP curation team. There will be a new release of InterPro in the next couple of weeks. 3,000 human entries identified by Paul, those by Proteome were removed from curation set. SP curators have been working hard to assign GO terms to the remaining... This stage of the work is now complete. From now on, SP curators will annotate GO terms. Report from Nicky about how InterPro annotations are done. QuickGO browser has some new functionality. Now there are links to microarray expression database. Proteome analysis pages at EBI urgently need GO-SLIM. 10. Erich Swartz, WormBase IEA has done 1/3 of 19,000 genes. Creating new parsing of ontologies and expanding automatic annotation. Building an anatomy and developmental timing ontologies. Erich has been trying to finish RNAi ... 52 phenotypic types. Currently limited to not having full GO curator. Ideally by next meeting... Proteome has asked WormBase for help them with 'GO-izing' their standard vocabulary that they use for all the phenotypes. Erich has been in contact with WIT2 annotations group. have set up a collaboration with them to do that. 11. Physiological Presentation from Bernard de Bono. Bernard has been lecturing on human physiology for the past 6 years. At the MRC-LMB he is annotating protein repertoires from the human genome, and is therefore interested in bridging representations of physiological processes with gene products. During his talk he suggested a physiology model in which a biological process could be precisely defined in terms of a large-scale exchange between compartments. Four main types of compartment were defined: subcellular, cellular, extracellular and surface. He created fourteen tissue Cell Blocks: seven of them interface with the extracellular compartment only, while the other seven interface with the surface compartment as well. As Cell Blocks are the terminal leaves of a classification tree, it is not intended that a particular histological cell type should belong to more than one Cell Block. The human Cell Blocks described are: a) Cutaneous, Respiratory, Urological, Uterine, Testicular, Gastrointestinal, and Placental b) Hematological, Endothelial, Endocrine, Muscular, Nervous, Skeletal and Connective Tissue The whole technical discussion covered the following points, points 1. and 2. having been described above. Points 18 to 21 involved suggested extensions to this model to be discussed at a later stage of development. 1. A Process is the exchange between one compartment and another. 2. A series of major functional Cell Blocks was created and classified in terms of compartment contact. 3. An organ then becomes a Cell Block composite. 4. Cellular and subcellular compartments can then be addressed by the location of the Cell Block. 5. Extracellular and Surface compartments can be addressed by the Cell Blocks that are bound by it. 6. As organs are Cell Block composites, the anatomical location of the Cell Blocks can be tracked down. 7. A Process is an objective that can be depicted by a series of sub-objectives that may have to be temporally sorted. 8. A Process may occur only during a specific milestone in the organism's stage of development. 9. Separate time scales represent 7. and 8. to become the temporal ontology. 10. A spatial ontology represents 4., 5. and 6. 11. The Location Ontology captures 9. and 10. 12. A Process then becomes a cross product of Location Ontology and Function Ontology. 13. The Process Ontology editor creates paths in this co-ordinate matrix and assigns Physiological alias to every path. From 7., a path may have subpaths that are sorted along the temporal ontology co-ordinate. 14. The organism database GO annotator generates ontology co-ordinates in terms of What (function ontology), Where (spatial ontology) & When (temporal ontology) for every sequence. 15. If a gene product's ontology annotations hit a path from 13. the gene product automatically inherits that Process. 16. Cross product annotation is space efficient and robust to updates. Process definitions are more precise. 17. The creation of a 'Tool Box' of basic physiological objectives is therefore feasible. 18. Compartments can be mapped into partitions. 19. Processes description can then extend to exchange between one partition and another. 20. An Enzyme can be seen as pulling Molecule A out and pushing Molecule B into the same partition. 21. Can embryological/developmental processes then be defined as a transition from one Cell Block to another? 22. Caveat: this cross product can be represented by a DAG, but needs more than just 'is a' and 'part of' type of edges. Conclusion: As more and more complex organisms are annotated at gene level, it will become increasingly evident that gene products participate in more physiological processes than practicable to annotate directly - suggesting that curated paths using the spatial, temporal and function Ontologies as co-ordinates may be the solution to represent physiology. Annotation Issues 1. Midori and Rex continue on search and destroy mission....to get gene products out of ontologies... 2. DEFINITIONS.... now 18% done. Tomorrow we will be looking at the Oxford Dictionary of Biochemistry and Molecular Biology that will be provided as part of the Editor. Keep Oxford definition and add local identifier... For edited references that come out of some other resource, add original resource and the modifier resource. So, with more than one citation, it means that definition is composite of the two resources. Good progress. ... including as it does new terms that are supposed to be only entered with a definition.... 3. Database abbreviations are not always consistent. We will make a little flatfile of these... database:identifier and will post to CVS.... 4. Annotation Discussions 1) Annotating to 'unknown' is different than annotating to 'didn't look, don't know'. So, when annotating to 'unknown' because biology isn't known, MGI puts 'unknown' and references the paper... go through all the papers that are for this gene, use the most recent paper as a reference...last paper about that gene that was looked at. SGD..if they look and there are papers available but they don't address the issue, than they use a generic SGD citation...so, like MGI, from modification date. TAIR associates to the last paper, 'NAS' give the last dated paper. Unknown is used as annotation tool to help the annotation pipeline so that you know the effort was made to annotate the GO. We considered advantages and disadvantages of both approaches (MGI & SGD) before coming up with the ND solution. SGD: cite generic SGD Advantage: doesn't attribute statement to a paper that didn't actually contain it. Disadvantage: no indication of when someone last looked for information. MGI: cite most recent papter Advantage: provides a date so that curators know to look only at more recent papers for additional information. Disadvantage: implies that the paper actually stated that something was unknown. Using ND in conjunction with the date added to all annotations captures the advantages of both previous approachers. So question arises, what do we put on the GO site? Have ontological term. 1) invent an evidence code for 'nothing is known biologically' ND 2) evidence-reference would be a local database citation 2) ALSO will add a date field to the entire table...so will know the last annotation date...for each line in the gene association file....to the end...yyyymmdd This will mean that the date on which the association was made; it will not need to be broken down into 'created' or 'modified' because we 'update' annotations by adding new lines to the association files ( and deleting old lines if the situation calls for it). VIRUS/PATHOGEN/PARASITE Virus using host gene products will have associations to the virus genome. Should use the gene reference to the model organism database. Issue is how to annotate, for example, mouse protein that is abnormally functioning in the normal of the viral genome. That is to say, the function of the mouse protein is 'normal' for the viral process, but abnormal for the mouse process. We need a way to identify the genome of the process being annotated. There was intense discussion reflecting that a curator of mouse proteins wouldn't be annotating to the viral process because that process is 'abnormal' for mouse biology. But the curator of viral proteins might want to indicate that a mouse protein was 'part of' the viral transcriptome, or something like that. So, it was concluded that in that kind of instance, the curator was annotating a normal process, and the association file needed to indicate both the taxon of origin of the gene product (which is the function of the taxon ID now), and additionally, be able to indicate the taxon of the genome being annotated. So, if only one taxonID is presented, it means that is the taxon of origin for the gene product and the taxon of the genome being annotated. If two taxon IDs are presented, then the 1st one is the taxon ID of origin for the gene product and the second one is the taxon ID of the genome under annotation. The important point here is that what 'normal' means is relative to the organism being annotated, i.e., normal for the host vs. normal for the virus. ******************** CONCLUSION, TAXON ID Column, 1st ID = taxon encoding the gene product; 2nd taxonID refers to the context, i.e. the user organism of the gene product. Syntax will be taxon1 ! (pipe) taxon2 Implicit here is that the user taxon indicates whose perspective of 'normal' applies. ********************** Columns in SP annotation files db contributing - identifier from the contributing database (SP, TrEMBL, international protein index) - 3rd column would be international protein identifier - from other sources, 3rd column would be the gene symbol...would put any gene names in the synonym field...wanted to be able to use the IPI... so our suggestion is that they use the gene name if they have it, use the IPI as a default, if they update gene name, the IPI moves to the synonym.... ********************** BDGP report on software and database development for the GO John Richter, software...DAG-Editor...version 1.207, 1) can load files directly from the GO site, 2) new plugins load automatically, but can disable that feature 3) can add new relationships Chris Mungall...GO database 1) monthly archives in database, XML and flatfiles 2) have been expanding schema....can now have sequences in database. 3) want not the FASTA, just the relevant protein seqID, these will be loaded into the GO database. We do have sequence directory on CVS, which has SGD file of sequences on it. This will be dumped and new subdirectory created as noted below. 4) building tools to help in the proper use of the files by the community. One example would be to have triggers to prevent others from grabbing files and using them in analysis without understanding the IEA or levels and evidence. Cross products make a new ontology (transcription occurs in the nucleus) move nucleus to this new ontology move transcription to this new ontology 'new term' nuclear transcription, has new relationship should be high priority on the list of things to do to be able to create cross products. Will put the cross products where they belong. So, do we create a huge file of all, or do you create cross products with dangle unfound terms. Why don't we just permit loading of the entire directory....also would need definitions. So the load would be huge, and what would be the point. If we were loading 10 different files all the time, why not load as one big file. Mode of operation is to allow cross product generation to users...so the real question is if a new process term, a cross-product term, were in the process file, there should be access to all the term components of the term including the anatomy terms...Have to support dependencies in the different files. So if you load process ontology, will be prompted to load the anatomy files. 'verified' in the sense that a curator has looked at it. GO DATABASE Chris Mungall sequence blast results multiple sequence viewer... width of sequence bar reflects the degree of similarity. shows the multiple sequences on the top and then the blast results underneath. GADFLY So question now is do we move into database? As in, do the curators edit into the database rather than into the flatfiles, and at the end of each day, commit as well to the flatfile. So, need a group to manage this database, someone familiar with MySQL, postINGRES.... BROWSERS - BRAD MARSHALL AMIGO...www.godatabase.org very soon....(on the internal LBL site) Future directions **** BLAST server so that you can do a BLAST search against the gene product sequences with links back to the tree ****get a portion of the tree that you like, and get a FASTA dump of all the sequences associated. ****new browser gives number of terms annotated to term or terms under it. ***coming soon, will be able to select terms and download into a FASTA file ****want to select more than one evidence code ****want to filter by NOT for one or more evidence codes ****curator approved is everything other than IEA ****advanced search, can search by gene products, or gene symbols, pick data sources, etc. used to have ability to paste in a list, but have taken that feature out. but now people are again asking for that. So, can put back that capability... ****have extensive docs for this... Jason Stewart, new to the group, may do some software development with GO. He is also familiar with MGED development of structured vocabularies. MGED has been working with Rosetta, Affymetrix, others, to have a data model. microarray array gene expression. MAGEML is the mark-up language... just had a jamboree in Toronto and put all the source code in Source Forge. They are building annotation tools to help build the XML files. So, the model is very large, has 146 different classes, very interconnected, a lot of context. There are lots of parts of the model where 'terms' need to exist. Simple and complex, some are forms of restricted vocabularies, some of them will be real ontologies. What they did indicate in the model that whenever they come on one of these terms, they designate as an 'ontology', e.g. this is Jason's ontology for describing spots on a glass file SO when others use the file, and find an ontology term that they don't know about, they will have a url to find the ontology. So the program that is taking the XML and putting in the local database needs to go get that ontology and put it in the database that is being developed. SO Jason is writing a perl script that will allow researcher to go get the XML at the url link to load into their database. So, researchers can each submit their own ontologies... So everyone around the world can utilize the terms now...not that this will not facilitate the development of a community standard MIRROR...need a UK mirror, Midori will talk with Pete VERSIONS, REVISIONS, RELEASES...Diff problems are due to genuine bugs in DAGedit that will be fixed in 2.7. Bo at AstroZeneca asked if we could include a diff along with the monthly. No, we decided not to do that, the new directory structure should help this issue, since all the files will be organized better. Term Counts...Bo...Mike and Midori got the same numbers...no one knows why BO got different numbers. GOBO - global open biology ontologies We, the GO consortium, have three common GO ontologies. Other ontologies such as anatomy and temporal ontologies will be developed and 'owned' by MOD databases. At the Bar Harbor meeting last July we heard in particular from plant people about the need for phenotypes ontologies. Also need an ontology for biological substances so that those terms can be taken out of the function ontology. SP crew may do this. So, we need an umbrella under which we can have a variety of ontologies. This would essentially be a web site, cvs, or ftp site onto which different communities would be encouraged to deposit their ontologies. 1) The ontologies would be open. 2) They should be instantiated in GO syntax, flatfiles, so that they could be used with GO tools. 3) They should be orthogonal with existing ontologies...this is the hardest to resolve... 4) They would share ID space 5) Definition files should accompany ontologies. Orthogonal issue is the biggest concern. There are reasons for this. For example, there are alternative, competing ontologies in the same domain; by definition they are not orthogonal. We would want to distinguish between competing and complementary. We would need to explain why orthogonalities are there if they are. So, would be fine if GOBO was a web server site...community would be offered to send us their urls...There are other web sites for biological ontologies, but they won't adhere to the five principles above. Also, the anatomies and other ones that are used as part of the GO project would be handled somewhat differently. Most of these, however, will have some aspect of involvement with the GO project. Suzi will be writing adaptors for RDF / DAML-OIL, a data adaptor for GOedit. The beauty of using DAGedit is that you get something that can be instantiated in GO syntaxes. ACTION ITEM...Michael proposing to publish a short editorial about this. ... GO COMPLIANCE and JOURNALS. Michael reported that Nature has been doing the experiment of using GO terms as metadata to articles. Michael had a discussion to Declan Butler about this. This might be a more interesting concept than 'compliance'. Nature editors would do this. Coming down to 'keyword' for the article that is selected to from the GO. Rex says maybe if up and running with Nature, we should suggest the concept in a letter to other mainstream journals. GOSlim could be a keyword list, but for article keywords, should be any GO term. Evelyn says...keywords...EMBL adds keywords to flatfiles. Curators at EMBL talk to 8,000 scientists a month. Authors add keywords; they could be encouraged to add GO terms as the keywords. GenBank might add as a dbXREF. dbXREF ... GO has not been accepted as a dbXREF... but Michael is going to next advisory meeting . GOdbXREFs with SP? Midori says they're going to be stored in SP-Oracle DB and will be visible to the public sometime. ACTION ITEM... Michael will try to get GO accepted as a dbXREF for the sequence databases JOBS and FUNDING Midori's assistant advertised FlyBase curator of GO hired (Rebecca) Evelyn hired Lincoln's grant is going in. GO jobs will be on a separate Web page Judy will do Grant Progress Report due Nov 1. This year TAIR and WormBase will get funding from this grant. PUBLICATIONS Michael is talking about GO at Novatis meeting in London that requires a publication Michael will do Matt has TIG paper to add to the progress report David and Michael will continue on cross-product paper. Panther/GO comparative paper...evaluation on electronic annotation...David and Suzi (FANTOM) Cathy Ball, SGD - 4way comparative paper...did pull in all GO annotations that were available...still working on publication... Matt has forthcoming book chapter with Midori to go Current Protocols in Parasite Genomics Han Xie...internal review of Compugen stuff... Also... Berriman, M., Aslett, M., Hall, N., Ivens, A (2001) Parasites are GO. Trend in Parasitology 17(10) 463-4. PMID 11642257 WEB PAGES Midori as working on various pages Karen and Midori talking about totally redoing Web page ****Separate page for the job listings ****Page for getting in touch with GO, listing email lists and other contacts, participating groups would list particular contact for that group ****Another page of all members and former members time on this? Karen will send out url on personal space before posting publicly NEXT MEETINGS GO Users Meeting Feb1, GO meeting on Feb 2 and 3rd (O'Reilly meeting the 3 days before them). Academic staff get 25% discount...$600, Faculty get 50% $495. for O'Reilly meeting How many people from us will be there...? 25... Might be advertised more heavily...how many would be expected at Users Mtg? over 100 at least Would have to pay for network access. yes it's critical we decided. Next beyond that...Michael with host in Hinxton.... Next beyond that...maybe hosted by Compugen in Princeton... TIGR - contact Michelle and ask for a TIGR rep =================================================================== GO meeting - Tucson, Arizona - Feb. 2-3, 2002 Introductions. Progress reports. GO central (Midori Harris). * Adding terms, reorganizing ontology. * Add Jane Lomax as editor. * Gene product search and destroy, GO-slim etc still awaiting action. * MOBY effort (Mark Wilkinson's brainchild). Provide registry for sequence retrieval, annotation of where genes are go terms from different sources. Biomoby.org site for more info. Get MOBY white paper from Midori. Flybase (Michael Ashburner). * Becky Foulger lot of clean up. * Better representation of data. Release 3 of Dm sequence. * Reannotation of sequence will be available soon. * MA and SL have had a collaboration with Celera Proteomics to compare their assignement of GO terms to Drosophila proteins (made with their PANTHER) system to those made by FlyBase. This is now being prepared for publication. SGD (Karen Christie and Mike Cherry). * EC definitions downloaded EC datafile function ontology file to clean up. All EC definitions MUST be checked before loading to be sure they don't overwrite. * Total number of 1500 GO terms that get a definition as a result of this EC to GO effort. * Anonymous CVS server is running. * Website reorganized since Chicago meeting. Check updates on people page of GO web site. MGI. (Harold Drabkin). * Two full-time curators. * 23% increase in annotated genes since October. 45% increase in swissprot assignments. * Increase in hand annotation-triage process picks papers that are GO-friendly. * Backpopulating entries through interactions with rat database. * Developing new web interface. * Since GO is now in the database searches will be much better. In addition tracking of literature will be improved. TAIR (Lukas Mueller). * Manual go curation is being done by all curators. * Annotations to 10k genes. * Linked to metacyc. Aracyc database established. Used metacyc to GO. Generated 1612 annotations. * Secondary metabolism not yet well represented in GO. TAIR will add many new terms in this area. * TAIR has developed anatomy (1000) and development (120 terms) ontologies. * Created tool called pubsearch. Links literature, gene info and go annotation. Contains abstracts and full text files. WormBase. Paul Sternberg. * Focused on large RNAi screens (1100 genes assigned to about 2000 go terms). Should produce a large number of new curated terms. * Life stages DAG for embryo and post embryonic stages and adults currently circulating through the community . * Anatomy DAG (Raymond Lee). Cell lineage relationship included in DAG. DictyBase (Rex Chisholm, Tricia Dyck). * 7200 cDNAs * chromosome 2 complete. * Currently have about 2000 genes with annotations. PSU (Matt Berriman) * Malaria. Genome shrunk. 412 genes annotated. * Tryps bruc. Chr 1. 400 genes * Leishmania. Interested in using GO * Entameoba. Matt will present. * Life cycle using cycle function of DAG-Edit. GRAMENE. (Pankaj Jaiswal) * 8000 gene product in SP-tREMBL about 50% have GO associations. * Rice should be available for next meeting. * 300 plant related terms. * Manual curation of 4000 proteins in next few years. GKB (Beth Nickerson). * Waiting to hire. EBI (Evelyn Camon) * GO release in works * Keyword to GO mapping improved. * Single GO curator * Receptor database at EBI * Interaction database (INTACT) TIGR (Michelle Gwinn) * 10000 microbial genes annotated to go * Arabadopsis Compugene (Han Xie) * annotating protein new version in about 2 months. Based on 1M proteins, 90% annotated in some way. * Used for oligodesign, and in commercial database * Algorithm development with GO clustering to generate primitive ontology from literature; AstraZeneca (Courtland Yockney) * GO has an official home within AZ. Gene association files reconciled with annotation of internal databases. * Feedback-good: used for microarray data analysis. Increasing the number of things that can be prioritized for understanding arrays. * Need immunology, physiology and hematology areas added. GPCR has added value. Use function ontology as an organizing principle. Incyte. (Lisa Matthews). * Incorporation of GO for YPD (6200 genes-46000 go terms) * Mycopath (1800 genes with 7900 GO terms). * Improving tracking. October Action Item updates (see list of action items at end). 1. mouse 2. 3. 4. no progress 5. on agenda for later 6. standardization complete 7. added ND evidence code 8. add date field completed 9. taxon ID done 10. change submission requirement 11. update data directory-done 12. action items for John R. 13. DAG edit user guide. In process 14. transition to DB in progress 15. continued need for DBA 16. UK mirror in the works. Software.and Database * DAG-Edit o Moved to sourceforge site. o DBxrefs now more automated o Type filtering o Find improved o Plugins launched in background o Reduced size flat file format o Dangling parent references now working-can link out to other references o Cycles are supported o Macros can be saved o History plugin restored o Term change tracker plugin added o Can associate pictures with terms. Plugin. o Future directions. Database beta test. Postgres problems. * Gene products viewer * Create servelets * Move DAML/OIL compatability * Interactive database mode * Need to track obsolete terms somehow. * Database beta trial to continue o Will continue until a week has passed without problems. First priority to get save to work. Then get load to work. Then all editing will be on database. * SourceForge repository reviewed. http://sourceforge.net/projects/geneontology o Bugs should be submitted via tracker at sourceforge o Requests for term changes needs to be made through sourceforege * Fasta/sequence status. Most sequences are available. * Database schema. Changed to match DAG-edit. Currently running two systems, MYSql and Postgres * AmiGO update o Added peptide sequences derived from sp. o Icons that go back to sources added o Active link to ISS to show sequences. * FAQ-O-matic. o Chris will implement a basic faq and develop a system for allowing it to be updated by consortium members. Content. * Presence of non-coding RNAs in GO. But need ways of representing genes of various classes. MA proposed an ontology for sequence features--SO (sequence ontology). * Removal of gene products will continue with MH, MA and RLC. They will be replaced with more appropriate descriptors. * Remove cyclin as we have an appropriate replacement, but assure that synonyms are present to allow searches. * GO-slim still on the agenda for future. GOBO. * MA proposed adding GOBO to the GO repository. The group approved. * Ontologies must be orthogonal * Ontologies owned by developers will be responsible for maintaining them. Cross product Generation. * Discussed the production of cross products for anatomy agreed to wait until next time to discuss once tools are available. Annotation issues. * Have been using broad evidence codes. Discussed the need to expand evidence codes to more precisely capture the evidence supporting use of GO terms. o Use of more detailed codes would be optional o JR suggested an ontology of evidence codes. The group agreed. o Important to be sure that ISS codes trace back to a high quality annotation, not just IEA. o Agreed to make finer grained evidence codes available. o Discussed what codes we should use. (see list-get electronic version from Midori). Agreed that each must have definitions to be agreed upon in the period before the next meeting. o Where possible suffixes should be used consistently. * David Hill raised the issue of cardinality induced by mutant phenotype evidence code. Midori suggested simply making a note that the field may be used this way in this case. * Matt Berriman asked what citation should should be used when there is no database. The suggestion was to point to a URL. * Discussed situations were two terms from separate ontologies are often linked. Considered the possibility of developing tools to provide curators with options. This broadened to a discussion of tools for annotation including Talisman, Pubsearch from TAIR and sequence based tools. Sharing of these tools via GMOD was encouraged. * Use of NOT was discussed. This is to be used in cases where something explicitly says something is not true. Content issues: * Discussed pathogenesis. How should we handle cases when this is a function of a pathogen. Agreed to add pathogenesis as a term with synonym of virulence. Ontology Structure and Representation. * The need to begin to develop more varied relationship types was discussed. It was agreed that this was needed, but could be delayed. MA presented a list of some possible relationship types. He agreed to share his list with the group for consideration and possible expansion. Next meeting. 12-14 May at CSH?? Hinxton in the fall coordinate with ontology meeting (mid November). [Later changed to September, immediately after Genome Informatics] Action Items. * Submit electronic annotion methods and tools to Suzi et al. * Suzi will generate a report on progress since the last meeting. To speed up organism reports. * JR action item: spell checking method * JR action item: Other people submit rules and use to check ontology * Brad: grey out obsolete terms in AmiGO * GO requests for terms through sourceforge. * Add links to GO web site to submit or track requests. * Brad: Amigo. ISS add links out. * Individual groups should look at GO.xrf_abbs to check that URLs are correct * Brad: Add documentation for format of GO.xrf_abbs. * Chris will set up GO FAQ. * We will establish a process for updating the FAQ using a distributed. * MA will send out SO (sequence ontology) * JR will develop plug in for producing cross-products * Send GO slims that have used and published on to repository. * MH and MA will circulate expanded evidence code vocabulary with definitions for review. * Suzi et al will develop tools for annotation. Talisman tool. * Pubsearch from TAIR should reside on GMOD. * Beth will pursue May meeting at CSH, with Michelle Gwinn from TIGR as a backup. Action Items from October 2001 Meeting (this is just the list, no details or results) 1. Get temporal and anatomical CVs into CVS from participating databases 2. Tool to create cross-product terms 3. Post new biological process that incorporates updating developmental processes 4. Come up with initial default GO slim 5. Ontological formalisms (for Feb meeting) 6. Standardization of database abbreviations 7. New evidence codes 8. Add date field to the entire gene association file 9. Update Documentation to explain new use of the TaxonID column 10. Change requirement for submission of sequence information for gene products 11. Update directory structure on GO web site 12. Action items for John Richter 13. User's guide for DAG-Edit 14. Transition into using database as primary repository 15. Continue to consider the need for a DBA 16. UK mirror =================================================================== Meeting of the Gene Ontology Consortium Cold Spring Harbor Laboratory, Plimpton Room May 12-13, 2002 Sunday May 12th, 2002 ATTENDING: ---------- - Suzanna Lewis FlyBase Berkeley, CA - Michael Ashburner FlyBase Cambridge, UK - Karen Christie SGD Stanford, CA - Judith Blake MGI Bar Harbor, ME - Elizabeth Nickerson GK CSHL, NY - Janan Eppig MGI Bar Harbor, ME - Courtland E. Yockey AstraZeneca Delaware - Matt Berriman PSU(Sanger) Cambridge, UK - Katya Mantrova Incyte Genomics Beverly, MA - Han Xie Compugen Jamesbrook, NJ - Liat Mintz Compugen Jamesbrook, NJ - Bernard de Bono MRC Cambridge, UK - Michelle Gwinn TIGR Rockville, MD - Linda Hannick TIGR Rockville, MD - Harold Drabkin MGI Bar Harbor, ME - David Hill MGI Bar Harbor, ME - Rex Chisholm DictyBase Northwestern - John Richter BDGP Berkeley, CA - Pankaj Jaiswal Gramene Cornell, NY - Susan McCouch Gramene Cornell, NY - Martin Ringwald MGI Bar Harbor, ME - Midori Harris EBI Hinxton, UK - Eurie Hong SGD Stanford, CA - Chandra Theesfeld SGD Stanford, CA - Mike Cherry SGD Stanford, CA - Doreen Ware Gramene Cornell, NY - Chris Mungall BDGP Berkeley, CA - Eimear Kenny WB Caltech, CA - Lukas Mueller TAIR Carnegie Inst., Stanford, CA - Daniel Barrell EBI Hinxton, UK - Evelyn Camon EBI Hinxton, UK - Becky Foulger FlyBase Cambridge, UK - Jane Lomax EBI Hinxton, UK - Amelia Ireland EBI Hinxton, UK GROUP REPORTS ------------- Database summary - Suzanna Lewis - presented table of all non-IEA annotations - number of terms has increased 10% since last meeting (now 22,000 F; 21,000 P; 15,000 C) - idea to show % genome covered, buts gets into issue of estimating numbers of genes - new ec2go mapping (thanks to Daniel) - pie charts on Amigo!!! can break down by group, by GO-slim like terms (chosen by numerical representation) - Michael Ashburner suggests being able to choose a specific GO-slim file - Matt suggested being able to keep top level pie chart when going to breakdown of a slice FlyBase - Becky Foulger - Fritz Roth data set added SGD - Karen Christie, Mike Cherry - EC definitions (~1500), curators nearly finished checking these; Rama will add when done - added ~300 Component annotations from a large scale analysis paper from Michael Snyder's lab (Kumar et al. 2002); added only when 2 different methods confirmed the localization - working on some new GO tools to incorporate into SGD: one maps sets of genes to GO-slim terms - still in the prototype stage - for annotations to the unknown terms, have changed over to use of the ND evidence code and have added the date column to our gene associations file - have changed our software to be able to display NOT annotations - 2 new curators, Eurie Hong and Chandra Theesfeld, both in attendance *** Action Item 1a (Brad Marshall): make display of NOT data possible/correct in AmiGO (e.g. FBP26 for SGD; FlyBase, others have more) SwissProt - Evelyn Camon - GOA file re-released - human data only - annotation for other species - 2.1 million associations , viewable in QuickGO - next priority - organisms not covered by a MOD - new SwissProt keyword file, on the website (74% of keywords mapped to GO) PSU (Sanger) - Matt Berriman - another organism for Plasmodium falciparum, 74% of genome annotated ~3000 annotations, all non IEA - tsetse: working on gene association file soon, based on BLAST hits, etc,; sequencing is done 21000 ESTs, ~8500 seqs - life cycle stage ontology - http://www.sanger.ac.uk/Users/mb4/PLO/ MGI - Harold Drabkin - GO now incorporated into the MGI database, so MGI browser now faster than it was, not using flat files, and reading data current to within a day, allows Boolean operators - Martin just put mouse anatomy file onto GOBO - David has been working on a scheme which could be used for GO-Slim (text of document distributed at meeting attached to the end of this file) - phenotype ontology - making it loadable into DAG-Edit TAIR - Lukas Mueller - added a GO annotation search to the website - literature curation tool (available via GMOD), have used the tool at TAIR, annotations haven't yet gotten to web site - added/rearranged terms in metabolism for plant specific pathways, - embryogenesis vs. morphogenesis, in plants morphogenesis is not an obligatory child of embryogenesis, Tanya's proposal for revisions should be available for discussion soon - GO-slim version, plant version for TAIR TIGR - Michelle Gwinn (Comprehensive Microbial Resource) - CMR: many associations, but can't release them until the genomes have been published, one paper has been submitted, as associations become available will be added to the GO annotations table - schoenella - Bacillus anthracis - klebsiella - auto-annotation tool to assign IEA annotations to non-TIGR genomes, may not work until more genomes are manually curated - Linda Hannick (Arabidopsis) - 2 new people, team of 5 now - 20% of Arabidopsis genome is done, approaching by paralogous families of genes, tool to display similarities, stuff, speeds up annotation, everyone specializes in one area, rather than random - trying to coordinate with TAIR people to avoid duplication of effort - next genome - Trypanosoma brucei Pankaj Jaiswal - Gramene - 9000 annotations for rice, mostly IEA - ontology browser on Gramene - putting in rice anatomy and temporal files - trait ontologies in rice, refining structures - working with MaizeDB to develop resources for anatomy and trait, phenotype ontologies - working with Michael Ashburner on chemical ontologies WormBase - Eimear Kenny - goal to have detailed descriptions for genes by mid-2003 - Andre P. - Erich working on more extensive gene descriptions - 2 new WB curators, 1 is in the process of moving to CA, already doing lit curation - now 3 WB curators working on GO - WB is developing ontologies: due for release soon o cell lineage ontology (Raymond Lee) o developmental ontology (Wen Chen) life stages DictyBase - Rex Chisholm - EST collection from Dictyostelium , 2000+ IEA annotations - chromosome II about to be completed, mostly IEA annotations - still working on final schema for database, using a prototype yet - ontology: anatomy, life cycle, have passed on to David, as simple test of crossproduct - funding - everything looks good for DictyBase's funding to be approved GO - Midori Harris - got SourceForge suggestion tracking running, and is working well, also helps people making request see that there is a line for requests - Jane and Amelia making lots of progress - definitions at 30% now!! - new curators: Amelia Ireland just started, and Cath Brooksbank about to start Compugen - Liat Mintz - no non-IEA annotations - continuing to work on gene associations, including annotations in different products - just published their paper in Genome Research - oligo-libraries arranged based on GO terms, genes that are not annotated are often low-expressers - over 10% of human genome is transcribed from both strands - hope to release a new version soon Incyte - Katya Mantrova - complete translation of protein properties into GO terms - all databases now annotated with GO terms - new term suggestions - 90% are getting accepted - BioKnowledge library by subscription only from June 1st, free trial until then AstraZeneca - Courtland Yockey (post-meeting addition) - Incorporation of GO into AZ Bioinformatics Infrastructure o Global "protein annotation pipeline" currently utilizes EBI's GOA and NCBI's Proteome annotations as primary public source material o All derivative global molecular class databases being constructed utilize GO annotations o A global target decision support system (under construction) will utilize GO annotations as well o A number of internal groups continue to either use or inquire about GO annotations from the standpoints of microarray data analysis and text mining applications - Additions to GO Infrastructure in AstraZeneca o MySQL database mirror of GO MySQL database set up which includes both public and AZ internal GO annotations, and which will serve as reference set for derivative uses and views o Plans to port GO from MySQL to Oracle were not pursued in favor of a MySQL-only solution o Internal GO Annotation effort (GOAC Project) now spans >2000 genes and >13,000 annotations ACTIONS ON PREVIOUS ACTION ITEMS -------------------------------- - SourceForge suggestion tracker - working well - John Garavelli visiting EBI - helping with terms that have RESIDs - John Richter will come visit people if they ask nicely (for help, analysis of system-specific oddities/bugs with DAG-Edit/GOET) - Not done: Chris - have linkouts to sequence in cases such as ISS with ________) o ISS with SP:nnnnnn, click on ISS, get new report page _ with Literature: PUBMED; _ ISS with SWP:nnnnnn - GO.xrf_abbs file: stable way for database cross-reference (Ask Brad) - in progress o need to invent a metareference for linking for curator refs o need to talk about specific columns, e.g. gp2protein o clarify abbreviations - SO document, Michael Ashburner has submitted, people are commenting *** Action Item 1b - metareference for curator refs for AmiGO (BDGP and/or GO): create a metareference for linking for curator refs for definitions for AmiGO (e.g. GO:mah, SGD:krc, etc) *** Action Item 1c - AmiGO (BDGP): linkouts in AmiGO to sequence in cases such as ISS with ________) *** Action Item 2 - GO.xrf_abbs file (each group): examine the GO.xrfs_abbs file with respect to those abbreviations used by your group, add or submit (to your favorite contact with CVS write permission,) CONTENT ISSUES -------------- - Ligand, everyone has agreed upon a solution and Jane is about to implement it - E&M (Embryogenesis and Morphogenesis, not Electricity and Magnetism, this is biology! ;) - Tanya sent draft out Friday, about ready to implement changes to accommodate fact that in plants is separable from morphogenesis - have not been terribly consistent about "biosynthesis of ___" when we are talking about modifying a residue within a protein; some of these activities are grouped under "biosynthesis" while others are under "protein modification" [NB: not the exact term text strings] o so for modification of bases or aa residues within the context of the RNA or protein, then it will be only modification, biosynthesis applies when the substance is made as a free substance o post-meeting addition (MAH): the case of selenocysteine, which is produced by modification of a serine residue attached to a tRNA. I think the 'not free so not biosynthesis' reasoning applies here too. *** Action Item 3 - GO content: modification vs. biosynthesis (GO) - examine ontologies for consistency of term names in the area of modifications to nucleotides/amino acid residues within the context of an already synthesized nucleic acid/protein - proliferation of sensu terms, (Suzi) do we have rules? should only happen in the case of homonym terms, same text strings with unique meanings for each organism *** Action Item 4 - GO content: sensu terms (GO) - evaluate sensu terms, and expand documentation - Kirill - don"t have a term for "group transfer", instances of certainly are covered, won"t do this (insert a high-level grouping term) yet - use of "AND" in a term, o Suzi is against it, because there is high probability of violations of the true path, should work to use "AND" as a grouping mechanism and attempt to use the structure to represent the grouping; o "and/or" is acceptable ??? - will examine these on a case by case basis and see where they are appropriate - annotation to two different terms with an OR [NB: this would be two lines in a gene_association.yfo file with an "OR"], o DAML+OIL has a way of indicating disjunctions o Chris and John are not in agreement on how to deal with this, will hash out the options for software solutions to this issue and report back o following the above "conclusion" there were some additional group comments suggesting that the better way to approach this is to construct the ontology to have the appropriate grouping terms so as to avoid the need to have an "OR" join between two associations *** Action Item 5a - GO syntax: use of "and" and "and/or" (GO) - evaluate use of "and" and "and/or" in GO terms, target for elimination when possible *** Action Item 5b - possibility of ambiguous gene associations conjoined with "OR" (BDGP: Chris, John) - discuss possible software solutions to ? of joining two different associations (gene product to GO term) with an "OR", [NB: resolution of this item was unclear; first communicate with GO people on Action n Item 4a and discuss whether there is any real desire/need to do this.] *** Action Item 6 - expansion/clarification of GO documentation (GO: Cath B) - Cath will evaluate GO documentation and expand/modify to clarify - Integrity checks o Do we have any rules for integrity checking? are we at a stage where we could? o lets look for: _ child terms lacking parentage that they should have _ redundant relationships - still some of these *** Action Item 7a - ontology integrity checking (John) - will create a SourceForge submission page for ontology errors DONE!!! 5/13/02 *** Action Item 7b - ontology integrity checking (each group) - curators should look for ontology errors, i.e. for items to consider for automated integrity checking and submit them to the SourceForge page that John will create NATIONAL LIBRARY OF MEDICINE (NLM) AND GO (Judy Blake) ------------------------------------------------------ - she and Michael Ashburner will be working with NLM to bring GO into NLM and MESH - has some papers on the topic, "Lexical properties of the Gene Ontology", but in order to map GO terms to NLM terms (of any type), NLM requires definitions for all (GO) terms, when NLM brings in a new system, they are looking to incorporate the new system as synonyms to existing terms OR make new terms if no syn exists - Michael Ashburner had meeting with Stuart Nelson (head of MESH) and Betsy... (head of NLM), to establish seriousness of GO on this project - Courtland ? incorporate GO into MESH ,or have a new UMLS ? A: both, looks like very good progress GO-SLIMS -------- - TAIR and Amelia are working on some generic GO-slims, one for plants and one for animals, will be very similar, except for some things like no photosynthesis in animals, - have archived GO-slim which was used for Celera drosophila, will archive other GO-slims that have been used and which can be found, have written a document on GO-slims to be updated with a caveat about obsolete terms in archived slims - David has a proposal (see attachment), with an example about how to get all membrane things, need to join membrane of cell fraction with membrane of cell, David chose the GO-slim from the DAG and selected bins as biologist , rather than using a computational method to divide the annotations - Michael Ashburner is against having "Other" terms in GO-slims, David"s GO-Slim highlights some grouping terms that may be missing from the GO - there was general agreement that the display software of our dreams would be able to generate an 'other' category (on the fly, maybe?) for pie- chart purposes - we definitely decided not to add 'other' grouping terms into the ontologies - handling redundancy, when a term may be annotated to two terms, with different granularity, issues about collapsing redundancy - each GO-slim should have a document attached to it that explains it rules - ? from Courtland, about being able to use a GO-slim to map an annotation set, Chris suggested that this should be a script, would be nice to incorporate these scripts into AmiGO so that people can use the various GO-slims and use the one of your choice to map the association file(s) of your choice - Evelyn ? - naming convention for GO-slims - no problem to have as many as are needed/used, but we will put them into repository *** Action Item 8 - submit GO-slim scripts/rules (each group, as relevant) - Submit scripts (Chris is fine with Python, or Perl) for using/calculating GO-slims to BDGP *** Action Item 9 - GO-slim naming conventions (GO): - confirm/review naming conventions for GO-slims and expand documentation if needed (Michael Ashburner claims that there is a naming convention in the document that he has just written) *** Action Item 1d : AmiGO (BDGP): Incorporate GO-Slim scripts into AmiGO DAG-EDIT ISSUES ---------------- - John will come visit you to talk about, help set-up DAG-Edit if you ask him nicely - upcoming change of field in DAG-Edit, where the ID will not automatically be GOID, could DAG-Edit will read ID prefix from root term : Action item for John - spell checking is not done - integrity checking - not started, no info to do it, will need to discuss what the rules are - database o occasional problems, still complicated - capture semantics of transport? email from Chris Mungall to Midori... o does this mean the thing itself moving, or o Chris has a little thing he can display to talk about this - relationship type choosing is now allowed - John proposed: o determine new relationship types o inform everyone of what it is and symbol to be used o then implement o - would have to modify true path rule *** Action Item 10 - DAG-Edit/GOET (John Richter) - automatic recognition of ID prefix so that one doesn't have to manually change it all the time *** Action Item 11 - division of "part-of" into multiple relationship types (Chris and Jane) - will look into new relationships deriving from the current multiplicity of the meaning of the "part of" relationship - sure wish we could do cross-products o John will do a "macro" for this o Chris proposed being able to select a term in each ontology and have a table generated, where one could select rectangular blocks, David wanted to be able to see "part of" relationships... o John "but that"ll be huge...." o David "Embrace the Explosion." *** Action Item 12a - GO dictionary (GO, John Garavelli) - we need a dictionary for John to use for spell checking (John Garavelli wants to write a script for this anyway so he will generate the dictionary) *** Action Item 12b - GO dictionary in editor (John Richter) - can write a spell checker for the editor once he has a dictionary *** Action Item 13 - Cross-product tool (interested parties (David, Bernard, ?), Chris, and John Richter) - cross-product tool: further discussion will clarify what is actually wanted as well as feasible, so that John can write a plug-in for curators to use via the editor *** Action Item 14 - New documentation for making cross products in DAG-Edit as currently exists (GO: Jane, Amelia) - create document on generating cross- products in DAG-Edit - How do we handle IDs when we split terms? o Currently, the old term and ID becomes obsolete, and both new terms get new IDs, with obsolete ID as a synonym to each new term *** Action Item 15 - comment field: obsoletes & syntax (GO) - move obsolete IDs from synonyms to comment field and institute a regular (as in parsable) syntax for this field *** Action Item 1e - display comment field in AmiGO (Brad) - display comment field in AmiGO BERNARD'S PRESENTATION : GOAL (GO Active Language) -------------------------------------------------- - Progress since Chicago - representation of physiology as a xproduct of anatomy, and multiple GO aspects - it is possible for terms to inherit processes from parent terms - activity - any GO Function (F) or Process (P) - compartment (CPR) - can link P and C terms when we know where a process occurs - A biological process can be formally described as a relationship between two CPRs using an activity. - CPRs - is a region of biological space that can be unequivocally addressed using a combination of nodes from the C, cellular, and anatomical ontologies - Bolus discoideum (Latin for disk-shaped round lump) will be the hypothetical model organism o 3 developmental milestones o 7 cell types - system models processes o exchange o stage o complexing Bernard's document is downloadable from the MRC-LMB ftp site; he's also got a power point doc there: ftp://ftp.mrc-lmb.cam.ac.uk/pub/bdb/GOAL_Framework.pdf ftp://ftp.mrc-lmb.cam.ac.uk/pub/bdb/GOAL_Presentation.ppt GOET editor (for GOAL) ---------------------- - John demos new software, which will probably replace DAG-Edit - this new program is going to use a DAML+OIL like format, which will allow John to make many things that we've wanted to do be possible much better o e.g. history saves, undo, simpler modules for changing a dbxref (currently programmatically difficult in DAG-Edit) o will allow editing of new types of data much easier o DAML+OIL will have advantages for some of the new data types, e.g. SO; also allows some intrinsic restrictions/rules for a given class - this is available through GMOD project, available on SourceForge ANNOTATION ISSUES ----------------- Concurrent assignments - Evelyn Camon (Correlations between terms often used together) - system in QuickGO which gives curator hints which terms usually turn up together, shows up in QuickGO, also will throw up exceptions "weirdness detector" for annotators *** Action Item 16a - concurrent assignment protocol/docs for QuickGO (Evelyn) - get documentation from Tom Oinn on how he did it for QuickGO; add to documentation, to explain how this is calculated *** Action Item 16b - concurrent assignments from database (Chris) - pull this calculation on concurrent assignments from manual annotations using Database [NB: Fritz Roth is doing some calculations along this line] *** Action Item 1f - AmiGO (Brad) - show concurrent assignments in AmiGO Evidence Codes -------------- - two concerns - issue 1 - evidence codes for annotations o categorization (currently), but does not imply confidence level, o in discussions between this meeting and the previous one in Tucson, and from the surveys done for Fritz Roth, it has become apparent that the evidence codes in use now do not provide an indication of confidence, that curators felt that they could not make judgements on experiment quality from evidence code alone - issue 2 - judgements of sequence believability o another practical ?, qualitatively, to develop a system that provides meaning to people running algorithms, which sequences do you believe in? what are rules/criteria for deciding which sequences to use, and also that the annotations are believable - want a good test set for which to test algorithms, that only contains the genes and those annotations which are deemed to be of good quality - Evelyn suggested using the QuickGO algorithm for correlated annotation with all the current IEA stripped out and compare to same algorithm run with the IEAs, i.e. does it still predict the same correlations - consensus opinion to not include IEA or ISS in training sets - attempt to evaluate training set (no IEA or ISS) quality - FlyBase Panther calculations - transitive errors, only 4% (ISS) o other error type = F errors (F#%*-up, also about 4%) - to get the clustered set: o do within the group o use EBI clustering (TRIBE, InterPro, or SP clustering) o Liat/Compugen may be able to do the clustering - use the training set to help develop a tool that helps with annotation *** Action Item17a - sequence clustering for sequences annotated with GO (Daniel? Liat?) - take sequences as they are now, run a clustering algorithm, generate trees, attach GO annotations and inspect by hand *** Action Item17b - very cool annotation tool (????, highly dependent on above) - use this to develop an annotation tool that utilizes homology clustering Annotation Tools: ----------------- - Talisman can be downloaded from EBI, semi documented, a curation interface for GO in SwissProt, some discussion of transmitting annotations when appropriate to another MOD, currently no programmers to support this tool/program - Lucas's tool (now on GMOD) o searches PUBMED, creates linking table for specified info o preindexed papers against GO terms (perl module for text string matching), o implemented in Java servelet o right now only abstracts indexed o Sue Rhee recently found new software for PDF to text o mysql database, trying to make it more generic, submitting a GMOD grant to expand applicability *** Action Item18 - IEA/ISS methods (each group, GO: Midori): Groups to submit to Midori short blurbs on procedures for large scale annotation methods (bulk assignments, particularly with IEA or ISS) with urls to add to the annotations guide Consistent term use: -------------------- - Midori raised an issue about attempting to make sure that we use terms in consistent ways, Lisa Matthews has offered to send some notes about term use at Incyte ================================================================== Monday May 13th, 2002 GOBO and SO ----------- SO -- - Mike Cherry did some reorganization of the directory structure and put the GOBO stuff into the CVS repository, see http://www.geneontology.org/doc/gobo.html - Martin put up first bit of mouse on Friday - SO attempts to provide a controlled vocabulary for sequence features, and types of genes, e.g. whether primary transcript is edited or not, located sequence features and clones and ways to locate them on the sequence - Michael Ashburner and Suzi will write a supplemental NIH-grant off the GO grant to get a software person to do SO, since this will require DAML+OIL type slots to adequately describe the information types - Lincoln Stein, Owen White, Ewan Birney, test project , servers to provide data in the same way DAS server using the SO terminology, will help refine the quality of the SO - John and Chris made some comments about conversion from GO format to DAML+OIL, John suggested that it might be easier to dev in DAML+OIL from the early stages Biochemical Ontology -------------------- - Pankaj has been working on this - Michael Ashburner has restructured some of this and hopes to release it upon his return to UK next week - most restructurings to split out classifications by compound type and classifications by action - also removed "compound names" - Pankaj will parse in CAS #s - will help to do metabolism by cross-products - MESH is semantically mixed, and heavily biased by pharmaceutical compounds - CAS is not open to the public - about 1400 terms now Disease ontologies ------------------ - Rat people are very interested in these, DictyBase as well - UMLS has tied together a lot of this type of information, though there are some major issues with licensing and public access; but there are many classifications already so do some of these provide a good starting point (JB) - Michael Ashburner is unaware of anyone doing this as yet - where does NCI fit into this? doesn't seem to be much of a relationship... - SMD/MGED may already have some starts on this type of ontology - also need to make sure that we get definitions into this - with respect to tying information to human disease (key to much of our various groups funding), it is key that we make the relationships btw genes/phenotypes and relationships to human diseases - some overlap with phenotype ontologies and Bernard's attempts to describe physiology - can't easily use SnoMed, restrictions on its use Cell type ontologies -------------------- - Martin wants to do a mammalian one - already one for Drosophila Phenotype ontologies -------------------- - Michael Ashburner trying to get together a group for this, Gramene very interested GKB - Elizabeth Nickerson ------------------------- - integrative database for human biology o biological processes o biological pathways o collaboration with GO, EBI - top-down approach, from topics down, rather than gene by gene - ? how to store as a knowledge base, rather than as a database... want to see when the output of one assertion is the input of one assertion - tried lots of grammatical tagging, outputs were often unsatisfactory - now: input and output tagging, and linking to GO terms - and still want to link to references for every assertion - using Protege for structuring the data - but lack a good interface for authors to input data, currently using Excel spreadsheet for authors so that each sentence is associated with metadata in a way that can be imported into Protege GMOD: Generic Model Organism Database (Mike Cherry) ---------------------------------------------------- - www.gmod.org - organization headed by Lincoln Stein - idea is to create modules, small components that can be used - more robust, shareable, documented software modules - so that a new database starting up doesn't have to start de novo with writing their own software, GMOD proposals submission close in a week - what would a new database have to create within their first 6 months in order to get up and running - initially asking everyone to make everything Open Source, for GMOD purposes, it is defined by definition on SourceForge site - GMOD site is an open repository of tools that are being made available - 4 older MODs are to be given supplemental funds for GMOD efforts, with the understanding that if one group is developing a tool, they enquire of the others, how would you use this tool and make it useable for all the groups Data Distribution (Chris Mungall) --------------------------------- AmiGO - pie charts o Matt"s already suggested a slight modification (keeping original pie) - Graph view (from term pages) o some modifications to clarify (GOID #s) o call for suggestions for using the network graphs *** Action Item 1 continued - AmiGO (Brad Marshall) - additions to AmiGO - add a SourceForge site for AmiGO bugs/requests - gray out obsolete terms (post meeting addition) - link from treeview page to graph view - search function for the comments - don't automatically toggle to gene product when the search result comes up null - need to make sure that definition references go up with the def, not in the general dbxrefs - add ability to upload files for multigene search - GOST, request for it to accept a seqID - want to be able to search with SwissProt accession numbers (this requires a gp2protein file for every organism, nothing for TIGR, PomBase, ) - having a way of hiding/deselecting GO terms in BLAST report that you don't believe Chris has some experience with dividing TrEMBL into reliable and non-reliable, may be helpful to others in generating gp2protein files *** Action Item 19 - gp2protein file documentation (Chris??) - expand documentation for gp2protein files Monthly Releases ---------------- - request for synchrony between flat file releases, Definitions at the same time as ontology files - ftp site is being updated hourly (15 after the hour) - Courtland Yockey - Could we use the archives to track our understanding of biology - Courtland Yockey - suggested a month-to-month diff file o is interested in this for corporate/pipeline people for being able to track and find differences, and that GO could provide this a resource for others - Courtland Yockey suggesting some sort of monthly summary of major changes in a place where it is easy to find for part-time users of GO, monthly release notes? this could also explain motivation for changes and clarify rationale *** Action Item 20 - monthly release notes (GO) - take a look at doing monthly release notes, *** Action Item 21 - monthly diffs (Courtland Yockey) - will investigate DAG-Edit diffs, and communicate with John regarding proceeding further on utility of a plug-in for DAG-Edit that could do this Database beta test is over for now ---------------------------------- - hopefully will resume at next meeting - with the new GOET tool Planning Ahead --------------- Upcoming Meetings ----------------- - Genome Informatics - John, talk about GOET and relation to databases - ISMB - Michael Ashburner to give plenary - November meeting (17-20) in Hinxton - MGED meets GO, w/ diseases and chemicals o ontologies, and tools for building ontologies o Suzi soliciting Bernard to submit abstracts to this (Michael Ashburner is in a position of power to select items of key interest, e.g. Bernard's results, SO by Suzi) - Judy will do GO for course at Woods Hole in November - Midori will be doing 3 meetings o E-biosci o NetTab meeting - agents in bioinformatics (www.nettab.org) o Ontologies for Biology (European Science Foundation) - Heidelberg, Germany - FANTOM, part of why David generated rules for a GO-slim - - KDD Cup 2002, this year will have both a FlyBase corpus and also and SGD corpus, the attendees are often from corporations and the results are rarely available to mere mortals, but may available to the AZ's of the world Publications: ------------- - Chris on GO database - GO publication for Current Protocols - Judy and Midori Blake, J.A. and M. Harris (submitted) "The Gene Ontology (GO) Porject: Sturcutred vocabularies for molecular biology and their application to genome and expression analysis" in Current Protocols in Bioinformatics, Brazevanis, A, Davison, D., Page, R., Stein, L. and Storma, G., eds. Wiley & Sons, NY - Matt's in Current Protocols in Parasite Genomics - Evelyn: 1 to Genome Research (interpro2GO mapping); Bioinformatics article on InterPro; SwissProt article to Genome Research - Midori - 2 requests o briefings on Bioinformatics _ thought to be neither time nor cost effective to put such an article in such a small journal, so opinion against accepting either of these at this pint in time to avoid saturating market with the same thing again o Current Drug Discovery, _ sort out gene nomenclature mess... _ trade journal for portion of pharma industry Website stuff ------------- *** Action Item 22 - update to current GO home page (Karen) - make links to Gavin's source *** Action Item 23 - DAG-Edit user notes (Jane) - will post DAG-Edit user notes *** Action Item 24 - GO FAQ (Rama and Cath) - populate FAQ with Q & A"s Amelia"s website proposal ------------------------- - good start on content reorganization - suggestion to remove link to EP-GO browser - statistics on hits: o 951 to AmiGO o 91 to MGI o 52 to EP-GO - conclusion? Hinxton GO meeting ------------------ - Genome Informatics (GI) meeting 4-8th September - GO Users meeting September 9th - GO Consortium meeting September 10-11 - users coming to GI, can extend housing an extra night, Users not coming to GI can refer to page of suggestions for housing, travel - Consortium Members, possible to extend housing for one night via registration page, may be possible to extend housing via another mechanism, Consortium meeting will be in Cambridge rather than in Hinxton - Suzi strongly encouraging attendance at GI, deadline for abs is middle of June - Structure of User's meeting o Midori's thoughts, thinking of still having talks, but also having poster sessions, panel discussions o workshops with John, Chris software stuff o advertise such types of contents *** Action Item 25 - Hinxton meeting (Michael Ashburner) - a: find venue for 10-11 meeting - b: get a Manchester person down to talk about DAML+OIL *** Action Item 26a - Hinxton Users meeting (Midori and Karen) - will work out logistics of registration (Consortium members will probably also use the registration page) *** Action Item 26b - Hinxton Users meeting (Midori) - add suggestion tick box to reg form for what would you like to see *** Action Item 26c - Hinxton Users meeting (Midori) - mailing to go-friends list asking about desired content/attendance for User"s meeting GO meeting after Hinxton ------------------------ - proposal to have it in John"s home town in St. Croix, Virgin Islands - hotels not too expensive - TIGR - better in the spring, not winter - late January - arrive on Friday 24th, meeting 25th-26th, leave on 27th January 2003 o no Users meeting o pending quote from John *** Action Item 27 - quotes for Virgin Islands meeting proposal (John Richter) - will get quotes and send to list within the next week ================================================================== MGI Gene Ontology Progress Report of May, 2002 A GO-slim from a biological perspective (page 1/2) by David Hill Cellular Component 1.) non-structural extracellular: extracellular EXCLUDING extracellular matrix 2.) extracellular matrix: extracellular matrix 3.) plasma membrane: plasma membrane 4.) other membranes: (membrane EXCLUDING plasma membrane) OR (membrane fraction NOT plasma membrane) 5.) cytosol: cytosol OR (sarcoplasm EXCLUDING (sarcoplasmic reticulum OR junctional membrane complex)) 6.) cytoskeleton: cytoskeleton OR microtubule organizing center OR spindle OR muscle fiber OR cilia OR flagellum (sensu Eukarya) 7.) mitochondrion: mitochondrion 8.) ER/Golgi: endoplasmic reticulum OR ER-Golgi intermediate compartment OR Golgi apparatus OR transport vesicle OR Golgi vesicle 9.) translational apparatus: eukaryotic 43S pre-initiation complex OR eukaryotic 48S initiation complex OR eukaryotic translation initiation factor 2B complex OR eukaryotic translation initiation factor 4F complex OR nascent polypeptide-associated complex OR signal sequence receptor complex OR ribosome 10.) nucleus: nucleus 11.) other cytoplasmic organelle: acidocalcisome OR cytoplasmic exosome OR endosome OR glyoxysome OR lysosome OR peroxisome OR vacuole 12.) other cell component: cellular component NOT (1-11) Molecular Function 1.) defense/immunity protein: defense/immunity protein 2.) cytoskeletal protein: cytoskeletal regulator OR motor OR structural constituent of cytoskeleton OR structural constituent of eye lens OR structural constituent of muscle OR cytoskeletal binding protein 3.) transcription regulator: transcription regulator 4.) cell adhesion molecule: cell adhesion molecule 5.) ligand binding or carrier: ligand binding or carrier 6.) ligand: ligand 7.) receptor: receptor 8.) other signal transduction molecule: signal transducer EXCLUDING (ligand OR receptor) 9.) enzyme: enzyme 10.) transporter: transporter 11.) enzyme regulator: enzyme regulator 12.) other molecular function: NOT (1-11) Biological Process 1.) cell adhesion: cell adhesion 2.) cell-cell signaling: cell-cell signaling 3.) cell cycle and proliferation: cell cycle OR cell proliferation 4.) death: death 5.) cell organization and biogenesis: cell organization and biogenesis 6.) protein metabolism: protein metabolism 7.) DNA metabolism: DNA metabolism 8.) RNA metabolism: RNA metabolism OR transcription 9.) other metabolic processes: metabolism EXCLUDING (DNA metabolism OR RNA metabolism) 10.) stress response: stress response 11.) transport: transport 12.) developmental processes: developmental processes 13.) signal transduction:signal transduction 14.) other biological processes: NOT (1-12) =================================================================== Gene Ontology Consortium Meeting Lucy Cavendish College, Cambridge, UK September 10-11, 2002 Contents Participant list Progress Reports Action Items from last meeting Ontology Representation: Chris Wroe GOAL: Bernard de Bono Database & Software Content Issues Annotation Issues Documentation Other items Appendix 1: Collected action items from this meeting Appendix 2: Notes on C. Wroe and B. de Bono presentations A. Chris Wroe, DAML+OIL B. Bernard de Bono, GOAL Appendix 3: Handouts accompanying progress reports A. FlyBase B. GOA at EBI C. MGI D. SGD E. TIGR Eukaryotes F. TIGR Microbes Appendix 4: Action items from CSH May 2002 Participants: Michael Ashburner FlyBase Cambridge, UK Rama Balakrishnan SGD Stanford, CA Daniel Barrell EBI Hinxton, UK Tanya Berardini TAIR Carnegie Inst., Stanford, CA Matt Berriman PSU(Sanger) Hinxton, UK Judith Blake MGI Bar Harbor, ME Cath Brooksbank EBI Hinxton, UK Evelyn Camon EBI Hinxton, UK Mike Cherry SGD Stanford, CA Rex Chisholm DictyBase Northwestern Univ., Chicago, IL Karen Christie SGD Stanford, CA Bernard de Bono MRC-LMB Cambridge, UK Becky Foulger FlyBase Cambridge, UK Linda Hannick TIGR Rockville, MD Midori Harris EBI Hinxton, UK David Hill MGI Bar Harbor, ME Eurie Hong SGD Stanford, CA Amelia Ireland EBI Hinxton, UK Suzanna Lewis BDGP Berkeley, CA Jane Lomax EBI Hinxton, UK Brad Marshall BDGP Berkeley, CA Lisa Matthews Incyte Genomics Beverly, MA Suparna Mundodi TAIR Carnegie Inst., Stanford, CA Chris Mungall BDGP Berkeley, CA Sue Rhee TAIR Carnegie Inst., Stanford, CA John Richter BDGP Berkeley, CA Erich Schwarz WB Caltech, CA Valerie Wood PomBase(Sanger) Hinxton, UK Han Xie Compugen Jamesbrook, NJ Visiting on Tuesday, Sept. 10, 2002: Robert Stevens University of Manchester Chris Wroe University of Manchester Progress Reports GO Curators at EBI - Jane to visit NLM (more below) - 60% of terms now defined - MIPS Funcat <--> GO mapping posted (go/external2go/mips2go) - other aspects of progress touched on in review of action items from CSH FlyBase - see handout; highlights: - recuration for release 3 of Drosophila sequence (gaps filled; new genes) - Eleanor Whitfield (SP) cross-checks FB & SP annotations for redundancy ** action item 1: FB to use PubMed IDs instead of [or in addition to?] FBrf IDs SGD - see handout; highlights: - new GO Term Mapper and GO Term Finder tools - GO Tutorial (some parts generic GO, others SGD-specific) - at least one annotation for every gene known to encode a product MGI - see handout; highlights: - areas where GO annotation is focused - cross-product manuscript accepted (Genome Research) - work on cellular and developmental processes TAIR - replacing IEA with literature-based annotations - nifty cell viewer - organize annotation effort by cellular component (using cell viewer) or by pathway - Pubsearch tool helps with literature mining (from Suparna's Users Meeting talk) WormBase - developmental stage ontology to be released soon (waiting for some data on aging to be made public) - anatomy ontology also in the works; has about 5900 terms! - working on tool, way to handle GO annotations in ACeDB - will update RNAi --> GO term mapping (used for some WB IEA annotations) DictyBase - NIH funding started August 1 - SGD tables loaded with Dicty data - manual curation getting started PSU - malaria genome manually annotated to GO (lots of ISS updated to IDA, especially for cellular component) - annotations will be released when genome paper is published - now working on T. brucei - life cycle ontology in progress (Matt & John Richter will try to speed up DAG-Edit -- was very slow because of many many relationships) - for S. pombe: Data now in GeneDB (replaces PomBase) EBI "GOA" - see handout; highlights: - annotation file releases since last meeting: - 5 gene_association.goa_human releases - 3 gene_association.goa_sptr releases - want GO annotations associated with EMBL-Bank records by end of 2002 - manuscript submitted to Genome Research - possibility of SIB-based SP curators using GO to be explored (SP/EMBL retreat coming up late Sept) - UniProt Consortium [SP (EBI and SIB) + PIR] grant funded; will allow more manual assignment of GO terms to TrEMBL entries TIGR - two handouts: 1 on eukaryotes, 1 on microbes; highlights: - sharing Arabidopsis annotations with TAIR - Manatee tool: interface for editing GO terms and evidence - have RefSeq gi number --> GO ID; GO group recommends using protein id instead (gi's not shared by 3 collaborating nucleotide sequence dbs) - 7 microbial genomes annotated to GO; Vibrio cholerae on GO site; others awaiting genome completion and/or publication - GO terms displayed on CMR ** action item 2: TIGR to provide protein id --> GO ID ** action item 3: TIGR to send IEA annotations to GO for genomes not sequenced at TIGR Compugen - GO annotations updated (August 2002) Incyte - academic subscriptions to *PD databases - Lisa seeking to offer financial support for GO meetings Action Items from CSH meeting (May 2002) (also see complete list in appendix 4) 1. Many AmiGO items -- see software section 2. Check over GO.xrf_abbs file -- essentially done, except for incremental updates 3. GO content: modification vs. biosynthesis -- done 4. GO content: evaluate sensu terms -- done; essentially all will be kept; more "sensu" terms will be added, as will more generic terms as parents for "sensu" terms 5. GO syntax: use of 'and' and 'and/or'; 'or' in gene associations? -- Jane is working on removing most terms with "and"; nothing done with gene associations yet 6. Expansion/clarification of GO documentation -- not done yet, but Cath presented a plan of action that sounded good ** action item 4: Cath will update documentation and circulate drafts 7. Ontology integrity checking -- in progress; one thing done so far is that Amelia has a script that checks for several errors; John has set up SourceForge tracker for suggesting checks 8. Submit GO-slim scripts/rules -- ongoing 9. GO-slim naming conventions -- was done even before it became an action item 10. DAG-Edit/GOET automatic recognition of ID prefix -- not discussed; probably not done yet 11. division of 'part-of' into multiple relationship types -- not even started 12. GO dictionary -- done, with procedure in place for incremental updates (didn't touch on whether it'll be implemented in DAG-Edit or GOET, or, if so, when) 13. Cross-product tool -- nothing beyond current DAG-Edit yet (so cross-products can be done but not as easily as we'd like) 14. New documentation for making cross products in DAG-Edit -- not done yet 15. comment field: obsoletes & syntax (GO) -- done 16. concurrent assignments: QuickGO, database -- documentation on QuickGO at http://golgi.ebi.ac.uk/ego/manual.html and http://golgi.ebi.ac.uk/ego/index_internal.html; Evelyn will try to track down more; nothing on GO database side yet ** action item 5: Evelyn to continue tracking down info on QuickGO concurrent assignments ** action item 6: consortium, especially Chris M, to revisit concurrent annotations in GO database 17. clustering sequences annotated with GO; tool -- nothing yet 18. Short descriptions of IEA/ISS methods -- in progress 19. gp2protein file documentation -- Amelia did a small amount a while back; no word on updating or expanding it 20. monthly release notes -- in progress; see Documentation section 21. monthly diffs -- in progress; see Documentation section 22. update to current GO home page make links to Gavin's source -- not done yet 23. post DAG-Edit user notes -- done 24. GO FAQ -- Cath and Rama will work on FAQ (not much done yet) (other action items were related to organizing meetings) Ontology Structure & Representation: guest presentation by Chris Wroe, with input from Robert Stevens I'm not going to try to reproduce Chris' talk (!) but here are some highlights: - ontologies for biology (such as GO) are best done by biologists for biologists - description logic systems such as DAML+OIL provide a mechanism for building and maintaining ontologies (easier to maintain consistency and completeness with "hand-crafted" ontologies) - examples from GONG: finding inconsistencies and missing relationships that would be really hard to find manually - used MeSH chemical terms - missing 'isa' relationships added - some 'isa' relationships made more specific - errors corrected (e.g. a 'catabolism' term under a 'biosynthesis' parent) - DAML+OIL can be used at any point along spectrum -- don't have to have formal structures already in place to convert - OilEd tool now available; previously tools for use with DAML+OIL underdeveloped - definitions (in the DAML+OIL sense): formal definitions for concepts are easy to create for some (e.g. metabolism) terms but much more complicated for others (e.g. enzymes) - conversion to DAML+OIL will mean a large increase in source code; difficult or impossible to do DAML+OIL diff; Michel Klein developing "virtual cvs" - case study (how apt!): medical vocabularies and the "exploding bicycle" -- highlights need for constraints on what can be combined in cross-products Linda Hannick took good notes on this talk, so I've included them as Appendix 2A. GOAL (GO Annotation Language) update: presentation from Bernard Once again I'm not going to reproduce the whole presentation. Highlights: - concept of "structure," in this context referring to any physical entity, such as a gene product or a cellular component - structure provides activity - activity changes structure - word count on GO terms: - most frequently used words are connectors ('of', 'and', 'sensu', etc.) - 50% occurred only once; of these 65% are "structure" words - activity = change in structure over time, or A = (delta S)/(delta T) - defining structure (S) and measuring S and time (T) provides information on the activity (A) A(r) ---> A(p) activity S(1) ---> S(2) where A(r) and S(1) are starting activity and structure, respectively, and A(p) is activity provided by new structure S(2) - concept of "housing structure" S(H) -- the nearest common parent of S(1) and S(2), not affected by the activity that converts S(1) to S(2); relevant to measuring time (T) -- relative, not absolute, time is what's important A = [S(1)S(2)]/S(H) can compare different activities using function S[S(1)S(2)] A = S[S(1)S(2)]/[TS(H)] - collaborating with Rex Chisholm to try this for Dicty; update later Linda's note are included in Appendix 2B. Database & Software Issues DAG-Edit & GOET - John hopes not to do any more development on DAG-Edit. There are a few bug fixes outstanding, but he won't add new features. John would like someone else (a Java programmer) to take over DAG-Edit maintenance; Sue offered to ask Danny Yoo to do it. - John will add an integrity check to the flat file helper to check for deletion of terms that were present in the files loaded. ** action item 7: add check for term deletion to flat file helper ** action item 8: Sue will ask Danny to take over DAG-Edit maintenance ** action item 9: Amelia will collect bug reports and feature requests from curators. If John can't act on feature suggestions, perhaps Danny can. - John is developing GOET in the context of image annotation for Drosophila. This takes his time away from GO in the short run, but he will be working on the infrastructure of GOET, which will eventually benefit GO. AmiGO Brad has made progress on most of the AmiGO-related action items from last time: a) make display of NOT data possible/correct in AmiGO (e.g. FBP26 for SGD; FlyBase, others have more) -- DONE b) metareference for curator refs for AmiGO (BDGP and/or GO): create a metareference for linking for curator refs for definitions for AmiGO (e.g. GO:mah, SGD:krc, etc) Not done yet because there was no way to distinguish a definition dbxref from any other dbxref (also relevant to item l), nor was there any way to tell a reference to a person apart from a reference to a database entry. We'll introduce a prefix to be used for references to curators (GOC:) and Brad will generate web pages to be used as the metareferences. ** action item 10: change prefixes to "GOC:" for definition references that represent an individual curator or group of curators ** action item 11: Brad will create a form where curators can enter info (e.g. name, affiliation, dbxref entered in definition reference field), and create and link a web page for each GOC:xyz entry c) linkouts in AmiGO to sequence in cases such as ISS with ________) -- DONE d) Incorporate GO-Slim scripts into AmiGO -- not done yet e) display comment field in AmiGO -- This requires comments to be stored in the GO database, and will be done as soon as they are. ** action item 12: Chris to get comments into the database f) show concurrent assignments in AmiGO -- another one in the pipeline, pending addition to GO database g) add a SourceForge site for AmiGO bugs/requests -- DONE h) gray out obsolete terms (post meeting addition) -- DONE i) link from treeview page to graph view -- DONE j) search function for the comments -- again, depends on having comments in database k) don't automatically toggle to gene product when the search result comes up null -- DONE l) need to make sure that definition references go up with the def, not in the general dbxrefs -- can be done once definition references are distinguished from other dbxrefs in the database m) add ability to upload files for multigene search -- DONE n) GOST, request for it to accept a seqID -- programming done; will be "live" once new Linux cluster is installed (probably is by now) o) want to be able to search with SwissProt accession numbers (this requires a gp2protein file for every organism, nothing for TIGR, PomBase, etc.) -- notes aren't quite clear; doesn't seem to be done yet p) having a way of hiding/deselecting GO terms in BLAST report that you don't believe -- hard, and not done, but one can now choose a cutoff score GO-Slim Issues (overlap between software & annotation): Many users have asked the model organism DBs (especially SGD) to provide files with gene symbols and GO (or GO-Slim) terms that have been assigned to the gene product. After some discussion we decided to do so, and to include both annotations to the "unknown" terms and genes that have not yet been annotated (the latter will be listed as "unexamined"). Mike Cherry also suggested a table showing each GO term and a list of gene products annotated to it (originally suggested to Mike by Fritz Roth). No decision on this one. On a related note, Chris has devised, and Matt has tried, a clunky method for generating pie charts using a GO-Slim of one's choosing. The clunky bit is that associations between gene/gene product IDs and GO-Slim terms have to be reloaded into a new database. ** action item 13: Add a link to the GO-Slim directory to the home page. ** action item 14: DBs to send GO-Slims and lists of all genes to BDGP. ** action item 15: BDGP to generate tables of gene ID <--> GO-Slim term for each DB that submits a gene list and a GO-Slim. Genes lacking annotations will get "unexamined"; annotations to "unknown" will be preserved. ** action item 16: Add hyperlinks to the gp2protein files: link from web page and from each gene_association file. Content Issues How to coordinate work of several curators, geographically dispersed and having backgrounds in different areas of biology, and maintain the consensus-building approach that has worked so well for us? We agreed that dividing up work based on areas of interest/expertise is a good way to go. To facilitate it, we'll need to keep track of who's working on what. We'll set up "interest groups" for any areas within the ontologies that are likely to require extensive additions or revisions, or to have proposed changes crop up frequently. Curators can join or leave groups as they please. Proposed changes relevant to an interest group should be handled (or at least seen) by that group. Can we come up with a way to tell whether a given area within an ontology has been extensively reviewed? There was an unfortunate incident recently where a change was made to a bit of the newly revamped 'development' portion of the process ontology. We'd like to avoid this sort of fumble in the future, but it's impossible to tell just by looking at the ontology which bits have been reviewed thoroughly and which parts still look much as they did two or three years ago. There's a lot of information socked away in CVS log files, the email archive, and meeting notes, but it would be much more convenient for curators if the excavation of ontology content history could be streamlined. In the long run, it should be possible to flag terms as "reviewed" in the database, but there's no simple solution for the flat files. We'll just have to keep records as well as or better than in the past, and spend time and effort to keep each other informed. It's not hopeless, though; there are a couple of things we can do to facilitate communication and record-keeping. To help with record-keeping, all ontology content changes will be put in the SourceForge curator request tracker from now on. Jane and Midori can add any GO curator to the list of possible assignees; every member database whose curators have GO CVS write access should have at least one curator on the SourceForge list. Note that putting things in the SourceForge tracker is not mutually exclusive with sending messages to the GO list. Any item that obviously is, or might be, involved or controversial should still go to the list. Err on the side of sending more things to the list if it's not clear. To keep everyone informed, we'll run a script that extracts the summary lines from new SourceForge entries and emails the resulting list to the GO mailing list. (In theory anyone can join the mailing list specifically for the SourceForge tracker, but few will want to, because the volume of email is huge and most of it is administrative dross.) Anyone can then follow the discussion of any item that looks interesting (try the SourceForge "monitor" option -- it's cool!), and anyone can choose to take the discussion onto the GO mailing list. We decided that there is no need for a "GO curators" mailing list: "interest groups" are likely to change over time, and anything of relevant to more than the interest group should go to the main GO mailing list anyway. ** action item 17: Set up "interest groups" based on subject matter; maintain a list of groups and who's in them (on SourceForge if possible -- look into this). ** action item 18: All content changes, no matter how small, should go into the SourceForge tracker for archiving purposes. Summary entries should be nice and informative. ** action item 19: Set up script to email summaries from new (open) SourceForge tracker entries. On the specter of excessive granularity (a long involved discussion indeed): We reaffirmed that gene products should not appear as concepts (i.e. as ontology terms). But under some circumstances it is acceptable to mention gene products within ontology terms. The issue to be resolved is how fine-grained we should be in children of "protein biosynthesis," "protein binding," and some others. Many of the children of "protein binding" and of "protein biosynthesis" mention specific individual proteins; see the MGI handout for a list of terms that have come into question. There is an additional concern with protein biosynthesis terms: many of the too-specific ones added recently are actually intended to capture the results of experiments that measure levels of specific proteins, but do not distinguish effects on translation (the restricted definition of "protein biosynthesis," which is what we use in GO, and have implicitly decided to keep using) from effects on other steps in the overall process of making a protein (e.g. transcription, modification). We thought that adding terms for binding to (or biosynthesis of) any specific protein was reasonably consistent with the logic we apply when considering new terms, but we questioned the utility of having many many very specific terms. We agreed that we would keep or add terms that represent different mechanisms, such as "covalent protein binding" and "non-covalent protein binding" (hypothetical examples) or "viral protein biosynthesis." Michael came up with a two-part test; we can keep/add a "protein X biosynthesis" term if both criteria are met: 1. There is something specific about the biosynthesis of protein X, i.e. there are gene products involved in X biosynthesis but not general protein biosynthesis. 2. The proposed term is not redundant with any other process term. For example, we will make "glycoprotein biosynthesis" obsolete because it is redundant with "protein glycosylation." The same test can be applied to binding, transport, etc. But how to avoid losing information? Curators often want to capture what is known, as when an experiment detects binding to a particular protein substrate or altered levels of a specific gene product. The coffee break "Round Table" discussion led to a proposal: eventually make children of "protein binding" obsolete, and instead use annotation to indicate which protein is bound by the gene product of interest. The annotation would use the generic "protein binding" GO term, and a new column in the gene_association file where we can store an ID for the protein that is bound. Inevitably, though, there's a catch: the world is not yet ready for us to implement this in all situations. If the gene product being annotated binds a class of proteins -- the example was actin -- rather than a single protein, we're SOL for the present. In time there will be UniProt IDs representing protein families, but that could take months or even a year or two. There was some discussion of what to do in the meantime; the conclusion was to apply a couple more tests to identify terms that we should keep for now but make obsolete later. First, check over annotations that use the term; second, check whether the term has any children. Annotations will help us figure out whether the term meets the first criterion of the two-part test. A term that has children is most likely a useful grouping term. The same considerations, and possible future solution, apply to "protein X biosynthesis." To address the issue of experiments that detect changes in levels of a particular protein, we have decided to consider adding terms for "gene expression" and regulation of same, but further discussion is required before we add them (I suspect that counter-arguments will be raised). If they are added, the new gene_association column could be used with them in the same way as proposed for protein binding. ** action item 20: Test all "protein biosynthesis" and "protein binding" terms. Apply the two-part test to all, and (for protein family or class ones) look at annotations and child terms. Circulate the list slated for obsolescence. Note: we are not going to make all "protein binding" terms obsolete yet. It would be good to determine which terms would pass the tests, though. ** action item 21: Circulate a proposal for incorporating "gene expression" and "regulation of gene expression" terms and definitions. ** action item 22: Discuss this again at the next meeting! "Cellular process" to distinguish from multicellular processes was generally well received. Examples where the distinction would be useful are cellular morphogenesis vs. organ or body morphogenesis, cellular respiration vs. breathing, etc. It will take some work to define "cellular process." ** action item 23: Propose definition for "cellular process" and discuss on mailing list. ** action item 24: Each model organism DB should review terms under "embryogenesis" and "morphogenesis" to check for correct parentage; also figure out which ones will go under "cellular process." "Cell surface" and related terms: these were added recently by TAIR curators, to capture information from experiments in plants that can narrow down localization to plasma membrane or cell wall but can't distinguish between the two (that's what's meant by "cell surface" in plant literature). The definitions and placement of the cell surface terms were discussed, and changes recommended. We also discussed other cellular component terms in the area of external or surface structures such as cell walls. The fairly generic term "external protective structure" will be changed because "protective" sounds too much like a process; we came up with "encapsulating." The revised term, "external encapsulating structure," will become a child of extracellular. The definition should mention that the structure lies outside the plasma membrane and surrounds the entire cell. We should also review the cell wall terms to make sure they're placed correctly -- apparently the plant cell wall term should be under extracellular. One thing that came up is that there are no cellular component terms that really reflect boundaries (as opposed to physical parts) such as that between inside and outside the cell. It will be interesting to look into boundary terms, considering how they might be defined and where they might fit relative to existing terms. ** action item 25: TAIR curators to improve definitions of "cell surface" and its children. ** action item 26: Change wording of GO:0030312 to "external encapsulating structure." Circulate new definition; make sure Michelle Gwinn has a chance to comment. ** action item 27: Review all "cell wall" terms to check parentage. Plant cell wall does need to be moved. ** action item 28: Start thinking about terms (and definitions, of course) to capture concept of boundary. Transport terms: Dianna Fisk (SGD) is collaborating with Can Tran, who works on TC. Function terms will thereby be kept consistent with what's in TC. Most transport process terms should be OK, but as always any problems should be noted and sent to the list. Transport terms that mention specific proteins should be put to the same test as binding and biosynthesis terms (see above), although we expect that the results will prompt us to keep more of the transport terms. Susceptibility/resistance: We decided to make all terms that say "X susceptibility/resistance" obsolete because they really represent traits. The biological processes that we were trying to represent can all be covered by "response to X" terms (many of which already exist; others can be added). IDs: should we encode F/P/C in the GOID? Although some users have asked for this (for convenience), the overwhelming consensus was that we will not add anything to current GOIDs to show whether the term is molecular function, biological process, or cellular component. We will eventually be in a position to build links between what are now the three separate ontologies, so it's better to use a single ID space for them. Annotation Issues We receive frequent requests for GO terms/IDs to be associated with UniGene IDs. One way it can be done is via a UniGene <--> LocusLink file available from NCBI. ** action item 29: Create UniGene <--> GO file (Daniel) Issue raised by TIGR (Linda Hannick): how to represent annotations made using multiple BLAST hits or similarity to a domain or family (rather than similarity to one other gene product) The problem: they feel that they're losing information about the annotation/curation procedure by putting only one accession number in the "with" column. For many of these comparisons, several sequences have to be included, and the similarities among them taken together, to get a believable conclusion about the annotations for the gene product of interest. Furthermore, many of these curated sequence sets are not yet published. Discussion centered mainly on whether the situation was best covered by using ISS or IC as the evidence code. The eventual decision was to continue to use ISS. Some key points that came up in the discussion (documented for posterity): - The argument in favor of IC was that considerable curator judgment is involved in making the determinations, which makes the procedure different from simply running BLAST and looking at the best hit. There was concern about "polluting" ISS by including cases where similarity is to a family rather than to a single gene product. - The counter-argument was two-fold. One point is that multiple sequence alignments are nevertheless still analyzing and comparing protein (or nucleic acid) sequences, and most curators have been mentally including these analyses under "ISS" all along, viewing them as consistent with the currently defined scope of ISS. - The second point was that IC is used in a well-defined set of circumstances, for a well-defined purpose. It would "pollute," or at least confuse, the scope of IC to use it for annotations that are based on sequence similarity; also, one could follow similar logic to broaden IC to include all curator evaluation of experimental results. We decided not to relax the current definition and scope of IC. Conclusion: allow >1 entry in "with" column for ISS Curators then enter any accession numbers available, and include an ID that allows a link to a page describing the entire set of sequences used. ** action item 30: add to documentation of "with" column use -- allow cardinality 0, 1, >1 for all evidence codes that use "with" at all; explain situations where cardinality 0 is allowed ** action item 31: annotations that use ISS, IPI, or IGI but have a blank "with" column should link to the annotation documentation (let people see the possible reasons why nothing's entered) Pseudogenes and other "doubtful" genes: If a gene is known to encode an RNA or protein product, there's no doubt that the product(s) can be annotated with GO terms (or the gene can be annotated in lieu of direct gene product annotation if necessary). Genes that look as though they encode a product (e.g. open reading frames with no stops) but haven't been individually studied tend to be annotated. If something is unmistakably a pseudogene -- lots of frameshifts, etc -- it's not annotated. But what about other cases that fall between the "obviously OK to annotate" and "obviously pseudogene" ends of the spectrum? From Michelle Gwinn: We have a class of genes which according to our sequence data have either a single frameshift or a single stop codon in their coding sequence. However, they also have screaming good hits to other characterized proteins and to HMMs that span the problem in the ORF. We reflect the presence of the defect with an addition to the common names of the proteins. The concern is that a single frameshift or stop may be read through, or could even reflect a sequencing error. To avoid losing information, we've decided that the best way to handle these cases is to use SO annotation to document the frameshift/stop/whatever anomaly, and GO annotations to capture what the product is thought to do if it is indeed expressed. Shared annotations: For some organisms, gene products are annotated by more than one group (e.g. MGI and SWISS-PROT do mouse; TIGR and TAIR do Arabidopsis). We must avoid circular annotations, especially those based on sequence similarity (ISS). Most (all?) of the groups that inherit annotations from another source tag them in the gene_association file some way. For example, MGI has a special reference used for annotations inherited from SWISS-PROT. This was regarded as a good way to handle shared annotations; any group that doesn't do something of the sort already should adopt the practice. ** action item 32: Each group that shares annotations should tag the ones that come from the other group(s). ** action item 33: Document this decision, and how to implement it. Documentation Issues Monthly logs: Amelia has been working on a script to detect differences between one version of GO (ontologies + definitions); she showed sample output that was very well received. There is still a bit of work to do to get it to prime-time quality, but it is in very good shape. We will run the script every month, when the flat files are archived and database releases made. In addition to running it regularly, we'll include it in the software repository on SourceForge, so that anyone can run it to compare any two versions of GO. ** action item 34: Amelia will continue polishing The Script. When it's ready for prime time, it will go in the software repository, and will be run every month to generate a log to accompany the flat file archives and database releases. Decide where to put the output. FAQ: Chris will help Cath and Rama set up a FAQ-o-matic page; thereafter, anyone can enter question and answers. Cath and Rama will do a bunch to get things started and make sure the FAQ covers questions that we already know crop up frequently. ** action item 35: set up new faq-o-matic page (Cath & Rama, with a bit of help from Chris); everyone to add faq's and answers, though Cath & Rama will probably do the most, at least at first. ** action item 36: EBI GO curators circulate a set of instructions for using CVS. Other Items of Interest GO <-> UMLS: Jane will visit NLM for about a month starting Sept. 15. She will learn all about UMLS, and help them incorporate GO into the "Metathesaurus." That is, GO will become one of the ontologies indexed in the metathesaurus. Jane and some NLM people have already done a test integration. MeSH terms will be reviewed and new ones added in light of indexing GO in UMLS. Jane will report on this work at the next meeting. Funding: For the NIH grant, there's a progress report due soon (December 1?). Judy will coordinate, and email anyone who should contribute material. We will apply for five years when we renew; the renewal is due March 1, 2003. Judy will also coordinate this. There will be four aims: 1. Develop and support ontologies for molecular biology. 2. Annotation using ontologies for informatics systems of consortium members; this will include support for meetings. 3. Provide informatics resource; covers database instantiations, data repository and means of access, and software tools. 4. Outreach: support for ways to provide training for new groups starting to use GO, perhaps by having them visit a "GO site". A "visiting scientist" sort of thing could also be a good way for GO curators to take advantage of domain experts' knowledge. Meeting support might also fall under this aim. Aims 1, 2, and 3 are essentially the same as in the original grant, with the scope of Aim 1 expanded a bit. Aim 4 is modified from the original aim to have other database groups join the consortium. We would also like to support an effort to annotate bacterial genomes (i.e. those not already done or in the works at TIGR) using GO. E. coli and B. subtilis are the most obvious ones; genomes sequenced at Sanger would also be good. ** action item 37: Progress report for current grant. ** action item 38: Prepare renewal grant application. GOBO: Covered in Michael's talk at the Users meeting. We have a supplement to the NIH grant to fund work on SO; Suzi will hire two people, one more biology-oriented, the other more techy, for a year. SOFG: conference coming up in November. Web pages: We'll keep the current appearance for the time being, but that shouldn't stop us form improving the organization. The home page can be split into a few shorter pages, based on the work Amelia did earlier. ** action item 39: Prepare a site with mock-ups of GO web pages derived by splitting up the current home page sensibly. Next Meetings: The next Consortium meeting will be January 25-26, 2003 in St. Croix. Plan to arrive on Jan. 24 and leave on Jan. 27. John will make a group reservation; when we get the email about it, we must act promptly because rooms will go fast. There won't be a Users meeting. After that, the next meeting will be hosted by TIGR in June 2003, with a Users meeting. Linda will check on available dates; our first choice is June 2-4 (users on Monday June 2, consortium Tues-Wed June 3-4). Alternate dates are June 18-20. Appendix 1: Collected Action Items (numbered in the order the appear in the main document) 1. FB to use PubMed IDs instead of [or in addition to?] FBrf IDs. 2. TIGR to provide protein id --> GO ID. 3. TIGR to send IEA annotations to GO for genomes not sequenced at TIGR. 4. Cath will update documentation and circulate drafts. 5. Evelyn to continue tracking down info on QuickGO concurrent assignments. 6. Consortium, especially Chris M, to revisit concurrent annotations in GO database. 7. Add check for term deletion to flat file helper. 8. Sue will ask Danny to take over DAG-Edit maintenance. 9. Amelia will collect bug reports and feature requests for DAG-Edit from curators. If John can't act on feature suggestions, perhaps Danny can. 10. Change prefixes to "GOC:" for definition references that represent an individual curator or group of curators. 11. Brad will create a form where curators can enter info (e.g. name, affiliation, dbxref entered in definition reference field), and create and link a web page for each GOC:xyz entry. 12. Chris to get comments into the database. 13. Add a link to the GO-Slim directory to the home page. 14. DBs to send GO-Slims and lists of all genes to BDGP. 15. BDGP to generate tables of gene ID <--> GO-Slim term for each DB that submits a gene list and a GO-Slim. Genes lacking annotations will get "unexamined"; annotations to "unknown" will be preserved. 16. Add hyperlinks to the gp2protein files: link from web page and from each gene_association file. 17. Set up "interest groups" based on subject matter; maintain a list of groups and who's in them (on SourceForge if possible -- look into this). 18. All content changes, no matter how small, should go into the SourceForge tracker for archiving purposes. Summary entries should be nice and informative. 19. Set up script to email summaries from new (open) SourceForge tracker entries. 20. Test all "protein biosynthesis" and "protein binding" terms. Apply the two-part test to all, and (for protein family or class ones) look at annotations and child terms. Circulate the list slated for obsolescence. Note: we are not going to make all "protein binding" terms obsolete yet. It would be good to determine which terms would pass the tests, though. 21. Circulate a proposal for incorporating "gene expression" and "regulation of gene expression" terms and definitions. 22. Discuss this [protein binding etc.] again at the next meeting! 23. Propose definition for "cellular process" and discuss on mailing list. 24. Each model organism DB should review terms under "embryogenesis" and "morphogenesis" to check for correct parentage; also figure out which ones will go under "cellular process." 25. TAIR curators to improve definitions of "cell surface" and its children. 26. Change wording of GO:0030312 to "external encapsulating structure." Circulate new definition; make sure Michelle Gwinn has a chance to comment. 27. Review all "cell wall" terms to check parentage. Plant cell wall does need to be moved. 28. Start thinking about terms (and definitions, of course) to capture concept of boundary. 29. Create UniGene <--> GO file (Daniel) 30. Add to documentation of "with" column use -- allow cardinality 0, 1, >1 for all evidence codes that use "with" at all; explain situations where cardinality 0 is allowed. 31. Annotations that use ISS, IPI, or IGI but have a blank "with" column should link to the annotation documentation (let people see the possible reasons why nothing's entered). 32. Each group that shares annotations should tag the ones that come from the other group(s). 33. Document this decision [shared annotation], and how to implement it. 34. Amelia will continue polishing The Script. When it's ready for prime time, it will go in the software repository, and will be run every month to generate a log to accompany the flat file archives and database releases. Decide where to put the output. 35. set up new faq-o-matic page (Cath & Rama, with a bit of help from Chris); everyone to add faq's and answers, though Cath & Rama will probably do the most, at least at first. 36. EBI GO curators circulate a set of instructions for using CVS. 37. Progress report for current grant. 38. Prepare renewal grant application. 39. Prepare a site with mock-ups of GO web pages derived by splitting up the current home page sensibly. ======================================================================= Appendix 2: Linda Hannick's notes on presentations by Chris Wroe and Bernard de Bono at the GO Consortium meeting, 10 Sept. 2002 Note: the pdf looks better! A. DAML+OIL Chris Wroe / Robert Stevens Experiments in how you can use hand-crafted text à software-based technology What we can do, not tutorial... Helen Parkinson - What does the technology offer? Process: * Electronically generate rather than add manually. * Pathway; not all or nothing; some benefit part way too... * Simple additions from yesterday Making relationships to additional parents, etc Finds biological content error(s), finding relationships that are problematic Suggests additions, finds the missing relationships that are very hard to find by hand, suggests additional. Inconsistencies reasoned out (e.g., a case of catabolism under biosynthesis.) - What software is available? GONG: what have we done so far? Developing a stepwise methodology Incremental migration path adding semantic content to the GO in situ 1. Syntax transformation to DAML+OIL 2. Reasoning over existing content 3. Adding partial concept descriptions 4. Adding complete " " 5. Concept composition at the point of use Allow the creation of new ontology terms at the point of use. àIsa has to be done manually; P is easy less hard work than doing it all by hand. Migration path Definitions/descriptions (carbohydrate metabolism) broken down to a DAML+OIL necessary and sufficient conditions. Complete definition: biosynthesis of an amino acid Natural language pulls out the essentials Natural lang tool what you see is what you meant Metabolism terms easy; enzyme terms very complex to describe Absolutely explicit; lots of restrictions onProperty and has-class restrictions Top-down approach would be easier (dehydrogenase defined before malate dehydrogenase) Scripts were central Used as much automation as possible Many term phrases fit a stereotyped pattern Metabolism for example Hard coded UMLS lexical normalization tools to match up concepts from different ontologies May also help the parsing task * Additional DAML+OIL definitions represent a significant increase in the amount of 'source code' * Introduces large numbers of interdependencies (lots of these were missed in the hand-built ontology. * Michael Klein - conceptual cvs for DL Can't just do a diff on DL; need a cvs. Meeting tomorrow. Will e-mail the group re this. Software * DL datastructure with API * Editor gui OilEd * Ontology server Case study from forerunners in medicine (SNOMED) * Learn from their mistakes; already avoided mistakes of early medical terminologies * Similar to medicine; large, complex concepts o SNOMEDrt relational terminology o 200K-300K concepts at the present time o results 200K concepts dissected over 2-3 yrs by 9 half-time clinicians (double coverage) o ~20M investment o tried to use scripts and tools; propagation of concepts like we are discussing o major early benefit was a more complete taxonomy for accurate retrieval of records o not open source; there is a gathering force behind going open source (Richter, Chris Schut) (global technological project, GTP). o Formal def of terms useful resource in its own right irrespective of DL reasoning * But different; well specified use-annotation Relatively small group of people who are highly skilled Medical record keeping more for the accountants What additional software is necessary? BioOntologies people using DL? Not extensively 2 diff ways to use as a standard, because it works with ontologies with property-based descriptions Open source? OIL is; Java client pops it in (Robert) License for display is _36000 on Solaris. Problems: Scaling The combinatorial explosion Example Burns How expensive Read II grew from 20K to 250K terms in ~100 staff-years, but still too small to be useful But too big to use... (SNOMED 3.5) Beat the explosion by having ~12 separate taxonomies, the elements of which can be combined to form more complex concepts. Didn't work because the sensible options have to be defined in the user interface. o No grammar rules o Possible to make nonsense terms o Impossible to detect equivalent terms, or classify composition Need a reference terminology in the middle. ------------------------------------------------------------------ B. GOAL (Bernard) It is structure that provides activity. It is activity that changes structure. Organism bias is a cumulative effect of structure bias that has infiltrated GO. Word counts in GO Split into two sets, A and B. Of, and, etc in A Set B (90% of words used in GO terms). Occur 10 times or less in the DAG. >50% of B set occur only once throughout DAGs 65% of set B are physical objects in our universe (glutamate) three DAGS alike most of the words are structures. A=change in structure ÆS change in transitionTime ÆT Is it possible to have any sort of value for ÆS and ÆT? Having a handle on Structure will have a profound effect on Activity Hypothetical structure classification: Small mol Gene prod Complexes Cells Anatomy Map activity on graph of above. 2 structures more similar will be closer on graph.A Can impose different organisms on the graph. GOAL Object Definition Ar shifts S1 to S2. Ap is new activity. Navigate the structural graph by activities. Distance along the tree will be significant. Ar=s(S1,S2)/t(S1,S2) r is "required" p is "provided" Extend to any level of complexity. SH housing structure : The first part of the node that S1 and S2 have in common. A = S1, S2 T( SH) Profile comparisons of ACT objects Now can compare Activities much in the way the BLOSUM matrix is used. Compare whole-genome physiologies. Show on the same structural graph what you mean by an activity. ======================================================================= Appendix 3: Progress Reports A. FlyBase Progress Report, Sept 2002. Cambridge Meeting. 1. GO terms added by continued literature curation of primary papers and personal communications by Cambridge FlyBase curators Rachel, Gillian and Chihiro. 2. Kerry Knight is currently assigning GO terms as part of her clean up of free text in FB; referencing to primary papers. 3. All outstanding SWISS-PROT records (~1000) that were attached to a FlyBase genes have now been analyzed and GO terms added based on the summary comments. GO terms are referenced directly to the SWISS-PROT record. In addition, Eleanor Whitfield at SWISS-PROT is assigning GO terms to new SWISS-PROT records, and SWISS-PROT records updated from SpTrEMBL. These are referenced to papers listed in the SWISS-PROT record and are also incorporated into our files. We now periodically do a check to ensure that all relevant SWISS-PROT entries are curated. 4. Becky is currently curating recent reviews, mainly on processes e.g. oogenesis, embryogenesis, organogenesis, signaling etc. to increase the number of process GO annotations in FlyBase. 5. Work is ongoing to increase the number of definitions for fly-specific GO terms especially for embryogenesis terms. 6. We have received a file of predicted GO annotations from the FB/PANTHER collaboration. A paper describing this experiment has just been submitted to Genome Research. The predictions have not been parsed into FB. The reason is that this analysis will be redone on the new Release 3 sequence. 7. The next major task will be to re-annotate for GO terms the Release 3 protein set. That should keep us busy for some time. Rebecca & Michael. ------------------------------------------------------------------------- B. Gene Ontology Annotation @ EBI The GOA Project is headed by Rolf Apweiler GOA Annotation Coordinator: Evelyn Camon (camon@ebi.ac.uk) GOA Electronic Coordinator: Daniel Barrell (dbarrell@ebi.ac.uk) URL:http://www.ebi.ac.uk/GOA Last Updated: 03-SEP-2002 Current Status: We have made 5 releases GOA Human and 3 release of GOA SPTR(GOA-All) on the EBI and GO ftp sites. In SRS these releases are merged in the one database called GOA. The recent release of all our GO annotation makes SWISS-PROT group at EBI a considerable contributor to the GO consortium annotation effort providing over 2.1 million GO associations across 507964 SWISS-PROT and TrEMBL entries covering 45407 species. GOA Human releases are in keeping with our Human Proteomics Initiative and GO Consortium agreement to fast-track functional annotation of the human proteome. We have not yet integrated GO data from other Consortium groups due to lack of manual annotation with PUBMED references in the association files. We are working particularly closely with Mouse Genome Informatics (MGI) and FlyBase group to resolve these matters. As IPI is now indexing Mouse data we will next work on releasing GOA Mouse. Discussions have been initiated with EMBL-Bank on how to transfer GO annotations from GOA into EMBL flat files via its db_xref. It is decided to add a link from EMBL-Bank flat files directly to QuickGO eg. db_xref="GOA:P22301". It is hoped that this will be achieved by the next EMBL release, which will be made public in few weeks time. EBI maintains SWISS-PROT keyword 2 go and InterPro 2 go mappings these are updated on a regular basis and shared with the GO Consortium where they have been used to enhance their data sets as well as those of external GO users (Microarray/mass spec). We are also working closely with PIR to help their keyword mappings. A GOA paper has been submitted to Genome Research. The GOA project is ahead of schedule on all its grant deliverables. HOW IS GO ANNOTATED IN SWISS-PROT/TrEMBL/InterPro/? GOA is produced by electronic and manual efforts The large-scale assignment of GO terms to SWISS-PROT and TrEMBL entries involves electronic techniques. This strategy exploits existing properties within the entries including the presence of keywords and Enzyme Commission (EC) numbers as well as the presence of cross-reference to InterPro entries, which are manually mapped to GO. Electronically combining these mappings with a table of matching SWISS-PROT and TrEMBL entries generates a table of associations. SWISS-PROT keyword and InterPro to GO mappings are maintained in-house and shared on the GO home page for local database updates. Manual assignment of GO terms by SWISS-PROT curators uses published literature and provides more reliable GO annotation. On each release of GOA, annotation with electronic evidence codes (IEA: 'inferred from electronic annotation') will be replaced with associations using codes that imply more experimental evidence. RETRIEVING DATA FROM GOA There are various ways of accessing and searching GOA project data, including several web-based browsers. The GOA files can also be downloaded. Resources & Descriptions Web-based tools QuickGO A fast web-based browser with access to core GO data and up-to-date electronic and manual EBI GO annotations. URL: http://www.ebi.ac.uk/ego/index.html SRS Search the GOA database or a mirror of the GO consortium repository (GO). URL: http://srs.ebi.ac.uk/ Proteome Analysis Pages GO annotations have been produced for classification of proteins belonging to each complete proteome. On the Proteome Analysis Pages a slimmed down version of GO (GO-slim), representing high-level GO terms, is displayed as a proteome overview. URL example: http://www.ebi.ac.uk/proteome/HUMAN/go/go.html EBI's GO-slim see: http://www.ebi.ac.uk/proteome/goslim_terms.html InterPro GO annotations made by InterPro are visible directly in InterPro entries. URL example: http://www.ebi.ac.uk/interpro/Ientry?ac=IPR000402 AmiGO GO Consortium browser with access to core GO data and released GOA data. URL: http://www.godatabase.org/docs/docs.html Downloads GOA 'Association File' This is a tab-delimited file of associations between gene products and GO terms and is the most common form of data transfer within the GO Consortium. For more information on our format read the GOA README file (http://www.ebi.ac.uk/proteome/goa/goaHelp.html) Two separate GOA association files are currently produced. Human GOA file access (contains GO annotations for all proteins in the nonredundant human proteome set): ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz http://www.geneontology.org/gene-associations/gene_association.goa_human SPTR GOA file access (contains GO annotations for all proteins in SWISS-PROT and TrEMBL): ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/SPTR/gene_association.goa_sptr.gz http://www.geneontology.org/gene-associations/gene_association.goa_sptr GOA Xref File For each GOA release we also distribute a file of cross references that displays the relationship between the entries in the GOA data set with other databases, such as EMBL/Genbank/DDBJ nucleotide sequence databases, HUGO and LocusLink and Refseq. GOA xref file: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/ STATISTICS Statistics for GOA-Human and GOA-SPTR association files are available from the GOA homepage. (http://www.ebi.ac.uk/GOA) GRANT SUPPORT: GOA is supported by Grants QRLT-2001-00015 and QLRI-2000-00981 of the European Commission and a supplementary NIH grant, 1R01HGO2273-01. CONTACTING GOA: Post: EMBL-European Bioinformatics Institute Wellcome Trust Genome Campus Cambridge CB10 1SD UK Phone: +44 (0) 1223 494444 Fax: +44 (0) 1223 494468 E-mail:goa@ebi.ac.uk CREDITS: Daniel Barrell - GOA File updates David Binns - QuickGO Wolfgang Fleischmann - Automation Coordinator John Maslen - Talisman Paul Kersey - Xref file & data set generation Michele Magrane & all curators - GO Annotation Nicola Mulder, Alex Kanapin & Annotators - InterPro Rodrigo Lopez, Nicola Harte - SRS Midori Harris, Jane Lomax, Amelia Ireland, Cath Brooksbank - GO Curators Rolf Apweiler -SWISS-PROT Coordinator Peter Stoehr - Head of Database Operations (EMBL-Bank Issues) GOA Consortium Report (03-SEP-2002) SWISS-PROT/TrEMBL/InterPro -------------------------------------------------------------------------- C. MGI Gene Ontology Progress Report Sept. 2002 General: We continue to focus on extending our goal to have annotation for all genes in the database. Our efforts have focused on three areas: 1. Adding annotation to genes currently without any annotation 2. Replacing annotations that were "fished" from text records with literature based annotation 3. Annotating genes having no go but having rat orthology We have constructed a dataset that might be used as a "gold standard" to judge the efficiency of various annotation algorithms. This dataset is comprised of genes that have been hand annotated with evidence codes derived from experimental evidence (IDA, IPI, IMP, IGI). A second dataset derived from this series has only those genes that have had the same GO ID applied more than once by any combination of these. MGI GO STATS as of August 27, 2002. [table converted to tab-delimited text] Annotation Type 30-Apr-02 27-Aug-02 Change % Change Total Genes annotated:[1] 7600 8576 976 13 Total Hand Annotation # of Genes 2125 2646 521 25 Orthology: 19 24 5 26 "IEA" SwissProt to GO 4852 6123 1271 26 Interpro to GO 3376 3529 153 5 EC to GO 662 658 -4 -0.6 MLC Scan 40 40 0 0 GO Fish 2337 2228 -109[2] -5 [1] Number of genes with at least ONE GO term of any kind. [2] This figure has decreased due to our ongoing efforts to replace these with literature based annotation. Beyond GO The phenotype ontology is continues to be developed with the aid of the DAG-Editor[3], which has facilitated term merging and increasing the complexity of the DAG structure. [3] Cynthia Smith, Cathleen Lutz, Carroll Goldsmith, Teresa Chu, and Alan P. Davis Too many unnecessary GO terms: On the issues of excess granularity [note: some color coding and underlining lost in conversion to plain text] The GO was originally set up as a vocabulary to describe the molecular function, process, and cellular location of a gene product that could be used across model organism databases. However, recently, the GO appears to be growing in areas that appear to reflect a cross over between product name and function and process. There are three example areas: 1. Protein Binding 2. Protein Biosynthesis 3. Immune Response : interleukin X biosynthesis..... 1. The function term ":Protein Binding" coupled with the "with" statement is intended to describe the interaction of a gene product with another protein. The creation of dozens of children that specifically refer to a single gene product in a single type of organism (mammal), as in the cases of interleukin-X binding, where X is a specific molecule, unnecessarily increase the granularity of the GO in a species specific manner. 2 and 3 . Protein Biosynthesis was originally meant to describe the processes involved in the formation of a peptide bond, either on the ribosome or not. The creation of specific terms for single instances of proteins is unnecessary. If the term is NOT meant to describe processes involved in peptide bond formation, it should not be a child of this term. The use of the term "XYZ protein biosynthesis" to be used for a description of any unknown process or combination of processes involved in altering the level of a particular gene product is ambiguous. If there is not evidence to pinpoint transcription, RNA processing, translation, post-translational processing, or RNA and/or protein degradation as the process or processes that are involved in the gene product to be annotated, then perhaps no annotation should be applied. If we proceed down this path, then XYZ biosynthesis will need to have specific children, XYZ biosynthesis, transcription, etc. 1. The first issue begins in protein biosynthesis, where we currently have: protein biosynthesis [GO:0006412]) amino acid activation + charged-tRNA modification + **glycoprotein biosynthesis+ CD4 biosynthesis + FasL biosynthesis + protein amino acid glycosylation + *integrin biosynthesis + **lipoprotein biosynthesis + **mannoprotein biosynthesis + *MHC class I biosynthesis + *MHC class II biosynthesis + *neurotransmitter receptor biosynthesis non-ribosomal peptide biosynthesis regulation of protein biosynthesis + regulation of translation + *TRAIL receptor biosynthesis + translational elongation + translational initiation + translational termination + viral protein biosynthesis *What we do not need is a separate term for each protein. As I understood from discussions on the GO-list, these terms were intended to encompass everything that goes into making the protein, from transcription, translation, and perhaps even degradation. They are intended to capture experiments that use protein /gene product A to influence the (levels) of protein/gene product B. There may be 100 steps between the two. This is making the GO terms experiment driven rather than the other way around. Such experiments are just NOT useful as evidence for any GO terms. They suggest experiments to be done. **The second issue regarding protein biosynthesis is that adding lipids and carbohydrates to proteins is a post-translational modification and does not belong under protein biosynthesis. The term "protein biosynthesis" should be restricted to processes that form a peptide bond, either on the ribosome (mostly) or not (antibiotics). 2. A second area of is the growth of a separate term for each protein binding: protein binding [GO:0005515] alpha-catenin binding ARF binding beta-amyloid binding beta-catenin binding cadherin binding calmodulin binding + clathrin binding collagen binding cyclin binding cytokine binding + chemokine binding + granulocyte macrophage colony-stimulating factor complex binding interferon binding + interleukin binding + interleukin receptor + interleukin-1 binding + interleukin-10 binding + interleukin-11 binding + interleukin-12 binding + interleukin-13 binding + interleukin-14 binding + interleukin-15 binding + interleukin-16 binding + interleukin-17 binding + interleukin-18 binding + interleukin-19 binding + interleukin-2 binding + interleukin-20 binding + interleukin-21 binding + interleukin-22 binding + interleukin-23 binding + interleukin-24 binding + interleukin-25 binding + interleukin-26 binding + interleukin-27 binding + interleukin-3 binding + interleukin-4 binding + interleukin-5 binding + interleukin-6 binding + interleukin-7 binding + interleukin-8 binding + interleukin-9 binding + cytoskeletal protein binding + DNA topoisomerase I binding dynein binding + enzyme binding + eukaryotic initiation factor 4E binding gamma-catenin binding growth factor binding + hemoglobin binding histone binding HSP70 protein binding + immunoglobulin binding + importin-alpha export receptor intermediate filament binding ISG15 carrier KU70 binding lamin binding lipoprotein binding + metarhodopsin binding neurexin binding nuclear localization sequence binding peroxisome targeting sequence binding + poly-glutamine tract binding polypeptide hormone binding + profilin binding protein amino acid binding + protein C-terminus binding protein carrier protein domain specific binding + protein signal sequence binding RAN protein binding Rho binding + RPTP-like protein binding SNARE binding + snoRNP binding syndecan binding TATA-binding protein binding TRAIL binding transcription factor binding + Wnt-protein binding This loses the utility of the Protein Binding and With fields. Are we going to have a separate term for every single pair of proteins. The chemokine and interferons conceivably could be expanded in a like manner This is not needed. The primary term plus the "with" field is sufficient. Algorithms could be written where if the pairs are annotated properly, one could search the "with" field to come back with all binding partners. 3. A third area is sort of related to the "biosynthesis " issue again> Why are separate terms for the biosynthesis of each interleukin needed?? immune response cytokine metabolism cytokine biosynthesis chemokine biosynthesis + connective tissue growth factor biosynthesis + granulocyte macrophage colony-stimulating factor biosynthesis + interferon type I biosynthesis + interferon-gamma biosynthesis + interleukin-1 biosynthesis [GO:0042222] regulation of interleukin-1 biosynthesis + interleukin-10 biosynthesis + interleukin-11 biosynthesis + interleukin-12 biosynthesis + interleukin-13 biosynthesis + interleukin-14 biosynthesis + interleukin-15 biosynthesis + interleukin-16 biosynthesis + interleukin-17 biosynthesis + interleukin-18 biosynthesis + interleukin-19 biosynthesis + interleukin-2 biosynthesis + Interleukin-20 biosynthesis + interleukin-21 biosynthesis + interleukin-22 biosynthesis + interleukin-23 biosynthesis + interleukin-24 biosynthesis + interleukin-25 biosynthesis + interleukin-26 biosynthesis + interleukin-27 biosynthesis + interleukin-3 biosynthesis + interleukin-4 biosynthesis + interleukin-5 biosynthesis + interleukin-6 biosynthesis + interleukin-7 biosynthesis + interleukin-8 biosynthesis + interleukin-9 biosynthesis + regulation of cytokine biosynthesis + TRAIL biosynthesis + All of these could be easily described using GO terms for translation, protein processing, etc. Again, we do not need a term for each specific protein product. These too appear driven by the desire to want to use an experiment to create a GO term. We need to decide how granular the GO needs to be. Prepared by H. Drabkin 10/2/02 ------------------------------------------------------------------------- D. GO Report from SGD Outline - SGD Goals for GO Annotations - Definitions for GO Terms within SGD - Annotations - GO Tutorial - GO Tools - Pathway Tools SGD Goals for GO Annotations Definitions for GO terms within SGD SGD is making a big push to write definitions for all the terms that have been used to annotate SGD genes. There are about 68 component terms, 268 function terms and 287 process terms that need definitions. Each curator writes 2 definitions per month and also if the curator needs to annotate to a term that doesn't have a definition, he/she will write the definition before making the annotation. We are making good progress towards this goal. Annotations Our goals for the near future are: - Have at least one annotation for all the named genes. Out of 4297 named ORF's we do not have any annotation for only 367 loci. - Fill in annotations for genes that have partial annotations. - Polish all the annotations (work on the IEAs and the 'unknown' annotations). GO Tutorial SGD has created a tutorial to familiarize users with the Gene Ontology (GO) and how it is used at SGD. The tutorial gives an overview of GO and highlights pages and tools at SGD that use GO annotations with some cool mouseovers. In addition, the tutorial provides links to other sites that may help users take advantage of the power of GO. GO Tutorial: http://genome-www.stanford.edu/Saccharomyces/help/gotutorial.html GO Tools SGD has developed 2 tools to mine GO data. They are the GO Term Mapper and the GO Term Finder tools. The GO Term Mapper or the GO slim tool maps the granular GO terms used to annotate a list of genes to their more general parent terms (ie. GO Slim terms) from all three ontologies. The GO Term Finder finds all the terms and their parents for a list of genes (users query). The GO Term Finder gives a tree view of all the terms with the DAG relationships, that the query set of genes have been annotated to. Both these tools can take a file of gene names or ORF's as input and can be very useful for analysis of expression data. GO Term Mapper: http://genome-www4.stanford.edu/cgi-bin/SGD/GO/goTermMapper GO Term Finder: http://genome-www4.stanford.edu/cgi-bin/SGD/GO/goTermFinder Pathway Tools SGD is in the process of incorporating biochemical pathways into the database using Peter Karp's (Stanford Research Institute, CA) Pathway Tools. A summer student mapped E.C. numbers to metabolic enzymes in SGD (approximately 1000) by using ec2go and searching the literature. In the first build using the Pathway Tools, 828 reactions were created in 163 pathways. We are in the process of refining the pathways. We will be using the E.C. numbers to increase the GO function annotations and hopefully add to the current ec2go file as new GO function terms are created. ------------------------------------------------------------------------- E. TIGR eukaryotic GO update September 2002 Linda Hannick Associations currently at GO: Arabidopsis Aug-27-2002 # genes with GO assignments 5089 since last release 798 # terms assigned 10833 molecular function 6564 biological process 2807 cellular component 1462 Associations not yet released to GO Other euk GO annotations in progress and not yet released include chromosome 2 of T. brucei (manually curated) and O. sativa (IEA). New developments The Arabidopsis project is now sharing GO annotations with TAIR weekly. TAIR GO assignments will be stored in our database along with our own to prevent duplication of work. They will be displayed on our annotation interface. We have a new gi2ath association file from GenBank which will be uploaded to the GO ftp site after this meeting. Software improvements are making it faster and easier to assign GO terms. The Manatee interface now allows editing of GO terms and evidence. A new GO search page allows an annotated search of a particular genome in our database, or a search of a the entire DAG. TIGR annotators track new terms using temporary "TI:" ID's. The assignment of temporary terms is now enabled by a set of pages: [pictures] Track TI ID's The TI: ID's are intended as a tracking device for new terms as they are submitted to Sourceforge. They are replaced automatically in our database as we enter the newly assigned GO: ID, with the TI: ID becoming a synonym to the GO ID in our database. ------------------------------------------------------------------------- F. TIGR microbial GO update - August 2002 - compiled by Michelle Gwinn Associations currently at GO: genes terms Vibrio cholerae 2924 6243 I just sent a new Vibrio file with more associations, you may have noticed the number went down instead of up, this is due to the removal of GO terms from the plain "hypothetical proteins", after result of discussion on GO email list. I also sent a gp2protein file for Vibrio. -------------------------------- Associations (manual) not yet at GO: Genome genes terms Shewanella oneidensis 3769 8307 Bacillus anthracis 4555 9673 Coxiella burnetii 1467 2711 Methylococcus capsulatus* 2616 4554 Geobacter sulfurreducens* 1916 4078 Listeria monocytogenes* >1465 >3342 (in progress now) ----- ------- TOTAL 15788 32665 GRAND TOTAL 18712 38908 (with Vibrio) Genomes pending publication (submitted manuscripts) and subsequent release to GO web page: Shewanella oneidensis Bacillus anthracis (Total of 17980 GO terms) * indicates annotation is incomplete for that genome, more genes remain from that organism that need to be assigned GO terms ----------------------------------- Other news: Our automatic annotation tool is now assigning GO terms to microbial genomes -TIGR genomes, preliminary assignment - followed by manual review prior to release -non-TIGR genomes (IEA) for display on our CMR website (should we send these to GO?) Comprehensive Microbial Resource (CMR) displaying GO terms for genes that have them. (should be functional by the time of the meeting) Rough draft of prokaryotic GO Slim exists, work continues db/software support of GO Slims is under construction ------------------------------------ If anyone has any questions or wants to chat about any of this, please don't hesitate to email me - mlgwinn@tigr.org Hope you have a good meeting, see you in the winter, Michelle ======================================================================= Appendix 4: Action Items from May 2002 Meeting at CSH ACTION ITEMS FROM MAY 12-13 GO MEETING Action Item 1 - AmiGO (Brad Marshall, BDGP) a) make display of NOT data possible/correct in AmiGO (e.g. FBP26 for SGD; FlyBase, others have more) b) metareference for curator refs for AmiGO (BDGP and/or GO): create a metareference for linking for curator refs for definitions for AmiGO (e.g. GO:mah, SGD:krc, etc) c) linkouts in AmiGO to sequence in cases such as ISS with ________) d) Incorporate GO-Slim scripts into AmiGO e) display comment field in AmiGO f) show concurrent assignments in AmiGO g) add a SourceForge site for AmiGO bugs/requests h) gray out obsolete terms (post meeting addition) i) link from treeview page to graph view j) search function for the comments k) don't automatically toggle to gene product when the search result comes up null l) need to make sure that definition references go up with the def, not in the general dbxrefs m) add ability to upload files for multigene search n) GOST, request for it to accept a seqID o) want to be able to search with SwissProt accession numbers (this requires a gp2protein file for every organism, nothing for TIGR, PomBase, etc.) p) having a way of hiding/deselecting GO terms in BLAST report that you don't believe Action Item 2 - GO.xrf_abbs file (each group): examine the GO.xrfs_abbs file with respect to those abbreviations used by your group, add or submit (to your favorite contact with CVS write permission) Action Item 3 - GO content: modification vs. biosynthesis (GO) - examine ontologies for consistency of term names in the area of modifications to nucleotides/amino acid residues within the context of an already synthesized nucleic acid/protein **DONE, except for a few individual cases that aren't straightforward Action Item 4 - GO content: sensu terms (GO) - evaluate sensu terms, and expand documentation **in progress Action Item 5a - GO syntax: use of 'and' and 'and/or' (GO) - evaluate use of 'and' and 'and/or' in GO terms, target for elimination when possible **in progress Action Item 5b - possibility of ambiguous gene associations conjoined with 'OR' (BDGP: Chris, John) - discuss possible software solutions to ? of joining two different associations (gene product to GO term) with an 'OR', [NB: resolution of this item was unclear; first communicate with GO people on Action Item 5a and discuss whether there is any real desire/need to do this.] Action Item 6 - expansion/clarification of GO documentation (GO: Cath B) - Cath will evaluate GO documentation and expand/modify to clarify Action Item 7a - ontology integrity checking (John) - will create a SourceForge submission page for ontology errors **DONE!!! 5/13/02 Action Item 7b - ontology integrity checking (each group) - curators should look for ontology errors, and submit them to the SourceForge page that John will create **two whole entries so far Action Item 8 - submit GO-slim scripts/rules (each group, as relevant) - Submit scripts (Chris is fine with Python, or Perl) for using/calculating GO-slims to BDGP Action Item 9 - GO-slim naming conventions (GO): - confirm/review naming conventions for GO-slims and expand documentation if needed (Michael Ashburner claims that there is a naming convention in the document that he has just written) **was done already (see go/GO_slims/README) Action Item 10 - DAG-Edit/GOET (John Richter) - automatic recognition of ID prefix so that one doesn't have to manually change it all the time Action Item 11 - division of 'part-of' into multiple relationship types (Chris and Jane) - will look into new relationships deriving from the current multiplicity of the meaning of the 'part of' relationship Action Item 12a - GO dictionary (GO, John Garavelli)- we need a dictionary for John to use for spell checking (John Garavelli wants to write a script for this anyway so he will generate the dictionary) **DONE; dictionary is updated frequently Action Item12b - GO dictionary in editor (John Richter) - can write a spell checker for the editor once he has a dictionary Action Item 13 - Cross-product tool (interested parties (David, Bernard, ?), Chris, and John Richter) - cross-product tool: further discussion will clarify what is actually wanted as well as feasible, so that John can write a plug-in for curators to use via the editor Action Item 14 - New documentation for making cross products in DAG-Edit as currently exists (GO: Jane, Amelia) - create document on generating cross-products in DAG-Edit Action Item 15 - comment field: obsoletes & syntax (GO) - move obsolete IDs from synonyms to comment field and institute a regular (as in parsable) syntax for this field **parsable syntax part is done - syntax established; only thing now is to make sure we use it Action Item 16a - concurrent assignment protocol/docs for QuickGO (Evelyn) - get documentation from Tom Oinn on how he did it for QuickGO; add to documentation, to explain how this is calculated Action Item 16b - concurrent assignments from database (Chris) - pull this calculation on concurrent assignments from manual annotations using Database [NB: Fritz Roth is doing some calculations along this line] Action Item 17a - sequence clustering for sequences annotated with GO (Daniel? Liat?) - take sequences as they are now, run a clustering algorithm, generate trees, attach GO annotations and inspect by hand Action Item 17b - very cool annotation tool (????, highly dependent on above) - use this to develop an annotation tool that utilizes homology clustering Action Item 18 - IEA/ISS methods (each group, GO: Midori): Groups to submit to Midori short blurbs on procedures for large scale annotation methods (bulk assignments, particularly with IEA or ISS) with urls to add to the annotations guide **I've received ONE response (thanks to Harold Drabkin) Action Item 19 - gp2protein file documentation (Chris??)- expand documentation for gp2protein files Action Item 20 - monthly release notes (GO) - take a look at doing monthly release notes **in progress; item for Sept. agenda Action Item 21 - monthly diffs (Courtland Yockey) - will investigate DAG-Edit diffs, and communicate with John regarding proceeding further on utility of a plug-in for DAG-Edit that could do this **progress on item 20 is relevant Action Item 22 - update to current GO home page (Karen) - make links to Gavin's source Action Item 23 - DAG-Edit user notes (Jane) - will post DAG-Edit user notes **DONE!! (thanks, Jane!) Action Item 24 - GO FAQ (Rama and Cath) - populate FAQ with Q & A's Action Item 25 - Hinxton meeting (Michael Ashburner) - a: find venue for 10-11 meeting **DONE - b: get a Manchester person down to talk about DAML+OIL **I've asked Action Item 26a- Hinxton Users meeting (Midori and Karen) - will work out logistics of registration (Consortium members will probably also use the registration page) **DONE Action Item 26b- Hinxton Users meeting (Midori) - add suggestion tick box to reg form for what would you like to see **DONE Action Item 26c- Hinxton Users meeting (Midori) - mailing to go-friends list asking about desired content/attendance for User's meeting **DONE (zero replies tho :( ) Action Item 27 - quotes for Virgin Islands meeting proposal (John Richter) - will get quotes and send to list within the next week **John sent one message and got several replies, so I assume this is in progress =================================================================== Gene Ontology Consortium Meeting Divi Carina Hotel, St Croix, US Virgin Islands January 25-26, 2002 Contents Participant list Progress Reports Action Items from last meeting Presentation: GO in UMLS: Jane Lomax Content Issues Database & Software Annotation Issues Miscellaneous Documentation: Appendix 1: Handouts accompanying progress reports [omitted from text version] Appendix 2: Action items from CSH May 2002 Appendix 3: Notes on J. Lomax presentation [omitted from text version] Appendix 4: Assorted documents relevant to agenda items. A. Email from Tanya Berardini B. Email from Aubrey De Grey C. MGI Excessive granularity document D. MGI Negation document E. Documentation progress report (from Cath) Appendix 5: Collected action items from this meeting Participants Michael Ashburner FlyBase Cambridge, UK Daniel Barrell EBI Hinxton, UK Matt Berriman PSU(Sanger) Hinxton, UK Judith Blake MGI Bar Harbor, ME Cath Brooksbank EBI Hinxton, UK Evelyn Camon EBI Hinxton, UK Tricia Dyck DictyBase Northwestern University, Chicago, IL Kara Dollinski SGD Stanford, CA Harold Drabkin MGI Bar Harbor, ME Dianna Fisk SGD Stanford, CA Becky Foulger FlyBase Cambridge, UK Linda Hannick TIGR Rockville, MD Midori Harris EBI Hinxton, UK David Hill MGI Bar Harbor, ME Eurie Hong SGD Stanford, CA Amelia Ireland EBI Hinxton, UK Jane Lomax EBI Hinxton, UK Brad Marshall BDGP Berkeley, CA Suparna Mundodi TAIR Carnegie Inst., Stanford, CA Chris Mungall BDGP Berkeley, CA Sue Rhee TAIR Carnegie Inst., Stanford, CA John Richter BDGP Berkeley, CA Valerie Wood GeneDB S. pombe (Sanger PSU) Hinxton, UK Progress Reports For full reports, see Appendix 1. GO Editorial Office, EBI - over 600 new terms added; 70% of terms now have definitions - every GO synonym examined and a relationship to the term name assigned (as part of UMLS project) - comments added to all obsolete terms - SourceForge item notification script now up and running DictyBase - public beta of annotations is viewable and will be added to the GO repository after checking - medical ontology has been developed and should be available soon FlyBase - 27,056 GO annotations now in FlyBase - Swiss-Prot GO annotations continuing - these include annotations for non-D. melanogaster genes. - most recent re-annotation of the Drosophila genome (release 3) is almost complete - definitions added to a number of fly specific process terms GOA @ EBI - 6 GOA-SPTR releases, 8 GOA-Human releases - GOA dataset to be enhanced by mappings from Swiss Institute of Bioinformatics - GOA cross-referenced directly in the EMBL nucleotide sequence database - QuickGO browser updated MGI - annotations added at a steady rate; 41,000+ annotations to 9032 genes - continued development of phenotype ontology; expected to be made public by mid-February - RIKEN data has been loaded into the database GeneDB S. pombe (Sanger PSU) - total of 15,029 GO term assignments now made to process and component terms - extensive overhaul of configuration files to give constant refinement of associations PSU (Sanger) - full manually curated GO annotation of malaria finished - joint curation with TIGR of Trypanosoma brucei continues - annotation of Aspergillus fumigatus and Theileria annulata genomes to come SGD - two new software tools: GO Term Finder and GO Tree View - every ORF at SGD has a function and process term annotation - every named ORF has a complete set of GO annotations TAIR - GO terms being added to the ontologies with definitions - plant GO-slim developed and submitted - aim to annotate all studied Arabidopsis genes to all three GO ontologies TIGR - T. brucei chromosomes 4 and 6, rice and Aspergillus fumigatus are in the works - Shewanella association file recently submitted - several bacterial genomes awaiting publication ACTION ITEM: TAIR to update MetaCyc2GO mappings. Action Items from last meeting See Appendix 2 for full details. Action items arising from this were: ACTION ITEM: John. 7 from last time [add term deletion feature to DAG-Edit]. ACTION ITEM: Brad. 10 and 11 [adding more information about GO curators to website/database] outstanding. ACTION ITEM: Come up with system for notifying developers of format changes. ACTION ITEM: Add "contributed by" column. GO in UMLS : Jane Lomax See Appendix 3 for the full presentation. Progress report: GO has not yet been released with UMLS Metathesaurus, but substantial progress has been made. There has been a successful insertion of the molecular function ontology, with cellular component and biological process soon to follow. There are two major issues created for GO; how to handle GO 'synonyms', and ambiguity in GO term names. These issues are discussed later in the meeting. Content Issues Synonyms: distinguishing exact synonyms from related terms - how many types to distinguish? - how to store/represent (implications for tools)? There was some discussion regarding the synonym types. In particular, whether a synonym with the "broader than" relationship to the main term reflects a missing parent or relationship in the tree, and also the number of relationships we need - do we need finer distinctions than true synonym vs related term? It was concluded that we would keep all the existing types of synonyms (exact, broader, narrower, related to, undefined) and the hierarchy of synonym types would be as follows: related to [i] exact [i] broader [i] narrower [i] undefined ACTION ITEM: Curators. When adding new synonyms, track which type they are. If they are 'broader than' or 'narrower than', consider whether it calls for a new term. ACTION ITEM: Jane. Circulate synonym list again. ACTION ITEM: BDGP. Look into rules that could be worked into DAG-Edit to make synonym maintenance easier. GO/UMLS component term merge problems The problem stems from ambiguity in term names. The term string "xxx complex" in GO refers to a cellular location, but the same string in UMLS usually refers to a protein entity and would be assigned the semantic type 'amino acid, peptide or protein'. The question is, does the GO cellular component term mean the same as the UMLS concept? If it doesn't, and a new concept would have to be created, what semantic type should we assign it, and what relationship would need to be created between these new and existing concepts? It was agreed that the GO 'xxx component' cellular component terms were different in meaning to the existing 'xxx complex' concepts in UMLS, and GO term names should not be changed to fit with UMLS. It was decided that Jane should discuss possible solutions with UMLS people; possibly modify some GO term names in UMLS only (by adding 'location'?) or see whether UMLS can help come up with a solution in their system, and to keep consortium informed of progress. The consensus was that all cellular component terms should be in concepts with the semantic type 'cell component' (never part of a concept with the semantic type 'amino acid, peptide or protein') and that the relationship between the new (with GO term) and existing concepts should be something broad, like 'related to'. ACTION ITEM: Jane. Discuss this with UMLS and fill us in on the results. Cellular processes: questions to be resolved before the cellular process reorganization is committed See Appendix 4A for the email from Tanya Berardini containing the questions. - Cellular differentiation vs cell fate commitment and cell type development vs cell type differentiation David Hill outlined a suggestion: cell differentiation can be broken down into the following steps; cell fate commitment where a cell senses its location and begins to specialize, but can still switch types, cell type determination where a cell switches irreversibly to a specific type and cell development where a cell physiologically matures into its type. Should we use these divisions in GO? The group agreed that we should. Conclusion: Cell differentiation and its children will have the following structure: cellular process [i] cell differentiation [p] cell fate commitment (exact synonym: cell fate specification) [p] cell fate determination [p] cell development (exact synonyms: cell morphogenesis, cell maturation) - Response to endogenous stimulus and response to exogenous stimulus Cellular response and organismal responses are usually linked; we would like to capture relationship but don't want to violate true paths (eg. for unicellular orgs). This means being very careful with parentage. Cue a big discussion of where to put the unicellular/multicellular split. A working solution was proposed: make the split as far below 'physiological process ; GO:0007582' as possible, and as and when needed, rather than splitting right below physiological processes. We will revisit this to see how the solution has worked. Leaving the "response to xxx" terms under cell communication is fine. The group agreed that it was always important to keep annotation in mind when making these changes, and reaffirmed the need to keep GO process terms covering multicellular processes, as they are needed for annotation in many species and help in the development of orthogonal ontologies. ACTION ITEM: David and Tanya. When splitting out multicellular vs unicellular processes, make the split as far below 'physiological process ; GO:0007582' as possible, and as and when needed, rather than splitting right below physiological processes. Grouping terms in the function ontology Prompted by Karen's email on G-nucleotide release factors and the related items RNA polymerase and hydrogen-translocating ATPases The function ontology contains grouping terms that reflect process or component info (eg. DNA repair protein; membrane-associated functions). This cross-contamination is useful for helping curators find terms but is not consistent with the guidelines set out for function terms. One approach would be to make relationships between the function and component or process ontologies and remove the grouping terms. This would require VERY careful curation as some functions act in many processes. A better solution would be to expand the toolset available to curators, eg. Fritz Roth's statistical links and concurrent assignment tools. The conclusions were that no hard-coded links will be made between the ontologies and instead research would continue into tools to make statistical links. ACTION ITEM: GO editorial team (and others). Start removing grouping terms slowly and carefully with all the usual communications. If obsoleting a term, ensure the corresponding process or component exists. Should functions (particularly enzyme functions) be differentiated on the basis of environment? 1. pH-specific enzymes: Example given was GO:0030230 and GO:0030231, differentiated on the basis of the pH at which they act. Conclusion: different EC numbers - keep both terms; same EC numbers - obsolete the pH-specific examples and use the parent term. 2. Hydrogenases: Example given was GO:0008901 and its children GO:0016948 - GO:0016951. They have the same EC number but different metal ions associated with them. This could be solved in the same way as protein binding - at the annotation stage, use a chemical ontology and use the extra column to note the metal. Alternatively, we could use multiple parents and/or annotate to separate terms (eg. hydrogenase, iron binding). The issue was not resolved after discussion and will probably be left until we have software to implement the new column. Should we add 'activity' to function term strings? - if so, do we change the main term string or add 'related terms'? Two main arguments for this: first, it reduces the ambiguity of the term name, therefore helping when GO is included in other systems (specifically UMLS), and second, it will reduce user confusion. All agreed this was a timely step. ACTION ITEM: Jane. Add activity to function term strings. How to represent membrane proteins - whether to have 'integral [to] membrane', what wording - whether to add children (e.g. for type I, II, III, IV transmembrane) In the component ontology, we used to have 'intergral membrane protein' plus children which was problematic because it didn't refer to a location, rather a relationship between a membrane protein and a membrane. The wording was recently changed to 'integral to membrane'; did we want to keep this for the long term or find some other solution? The other issue, brought up by Evelyn, was whether to add more granular child terms for the different types of transmembrane protein, as this would help with Swiss-Prot/GO mappings. This idea was rejected because these are types of protein and not locations. Conclusion: Keep the membrane terms as they are now (integral and peripheral); don't add the children as they don't reflect a location. Should the 'host' term be used for viral cellular component terms? The term 'host' was originally created for describing the cellular component of single-celled parasites infecting a host cell, so it was placed under 'extracellular'. A problem arose when trying to add the new viral terms, because viruses aren't cells, so the host cell environment is not extracellular. Various options were discussed, including moving 'host' out from under 'extracellular', but it was felt that the best option was to simply extend the definition of 'extracellular' so that it could be applied to organisms that aren't technically cells. A comment would also be added explaining why this was done. ACTION ITEM: GO editorial team. Define extracellular to include outside a virus particle, then use host terms as parents for the appropriate virus cell component terms. How should we handle component terms that can be both intracellular and extracellular? Some complexes can be intra- or extracellular; the example given was 'immunoglobulin complex ; GO:0019814' which can be either membrane bound or circulating, so there are two is_a child terms, 'immunoglobulin, circulating ; GO:??' and 'immunoglobulin, membrane bound ; GO:00??'. The problem comes with the placement of the parent term, the generic 'immunoglobulin complex', which might be used when you know that a gene product is a component of an immunoglobulin molecule, but not know whether it is membrane bound or cirulating. At the moment the term is placed directly under 'cellular component', but it's going to end up a pretty long list! After some discussion, during which we considered whether we needed a generic term at all, it was felt that the most appropriate place for such terms is directly under cellular component where we currently have them. ACTION ITEM: GO editorial team. Go through the enzyme complexes (see also SF entry 535294) and where applicable, make a general parent directly under 'cellular component' with children in specific locations. Term grammar (for use in automated construction of sentences describing gene products) See Appendix 4B for the email from Aubrey de Grey We are willing to alter the term grammar to suit Aubrey's needs as long as: A: Aubrey sends terms so we don't have too much work to do! B: we check carefully to make sure any changes won't wreck terms for biologists searching or curators annotating ACTION ITEM: GO editorial team to get list from Aubrey and evaluate; adjust terms as needed. Revisit 'catalyst' and 'regulator' part-of children of some enzymatic activity terms Several enzymes are split into a catalyst and a regulator function. This item questioned the need for these terms as they sound like enzyme components rather than functions. After discussion, it was decided that they should be left as they are to allow maximal information about protein function to be captured. Revisit the "Round Table Discussion" on how to represent synthesis/binding/etc. of individual proteins See Appendix 4C for the MGI excessive granularity document. The problem is basically that GO cannot allow gene product names inside GO terms because of the rampant proliferation of terms that this generates, however, it is still useful to be able to annotate to this level of granularity. For instance, to able to state that a gene product IL18_HUMAN is involved in 'interleukin-13 biosynthesis'. The solution proposed by Chris was as follows; some GO terms would have 'slots', which would be filled in the gene_associations file. For instance, 'biosynthesis' would have a 'slot' named 'synthesizes'. The GO term 'interleukin-13 biosynthesis' would therefore not exist, and instead, the annotation for IL18_HUMAN would include an entry to GO term 'cytokine biosynthesis ; GO:0042089' or just plain 'biosynthesis ; GO:0009058'; this entry/line would also have a column for 'slot', which would read "synthesizes(interleukin-13)". Interleukin-13 could be replaced with an identifier from a product/family/physical-entity ontology. The proposition is described in more detail at http://www.fruitfly.org/~cjm/slots.html The practical implications were discussed; there is a need for ontologies to cite in the slot values, for example, a chemical ontology and a protein family ontology. A few exist and more will be available in about a year. This will also require a rethink of annotation practice, and some new tools. Existing annotations would of course have to be retrofitted, but the bulk of this could be automated. Of great importance is considering our users, any changes need to be announced well in advance. In addition, would we change the front-end appearance of tools, e.g. AmiGO, or keep these changes behind the scenes? One issue is that using the slots effectively creates GO terms that are cross-products, but do we instantiate these products - i.e. give them GO IDs? For instance, if we were to instantiate all the terms generated by the cross product between 'synthesis' and a product/molecule/chemical ontology we would have actual GO IDs: GO:9000001 IL-1 biosynthesis GO:9000002 IL-2 biosynthesis GO:9000003 IL-3 biosynthesis GO:9000004 IL-4 biosynthesis GO:9000005 IL-5 biosynthesis The disadvantage is that any time the orthogonal ontology of products is changed, GO has to be changed (either manually or automatically) to reflect this. For example, if IL1 was split into IL-1a, IL-1b we would need IL-1{a,b} {biosynthesis, receptor} etc in GO. With the 'slots' approach there would be no GO ID for "IL-8 biosynthesis". Curators could still annotated genes as "IL-8 biosynthesis" by dynamically combining the terms using slots but the disadvantage is that there would not be a single GO ID they could quote in a paper etc. ACTION ITEM: Announce on the website that we'll implement this solution at some future date (no date set but will be 6+ months from now). Assemble a group (MA, Chris, David) to work on the implementation. Interest Groups Interest groups and areas have been extensively examined or claimed already, the problem is, how to ensure that the interest group is informed when changes are made to that part of the ontology? We could have interest groups listed e.g. in SourceForge, or on our webpage, perhaps with a list of GO_Slim terms defining the area of interest alongside. Anyone making changes to these areas would then have to inform these groups first, then the onus would be on these groups to pipe up if they had a problem! ACTION ITEM: Midori to put up interest groups on web page. Everybody to send group ideas & which they volunteer for. See if it works or if we need further formalization by putting groups in SourceForge. Annotation Annotation of disease genes Annotations of genes implicated in disease to be submitted by Nat Goodman. These should be fine as long as he doesn't annotate actual disease processes, i.e. he must only annotate the normal functions of genes implicated in disease. Consistency and quality control - Suggestion from Evelyn: a set of "standard annotations" for common proteins. Evelyn has seen different terms assigned to "common" proteins; is this a QC problem or does it differ between organisms and what has been studied and what experiments have been done? How do you define "common proteins"? Conclusion: Annotations are the responsibility of individual databases. Differences often reflect the state of experimentation. Evelyn has unique perspective for spotting inconsistencies, because SWISS-PROT includes annotations from all organisms. She should keep communicating problems to the individual databases. Negation See Appendix 4D for MGI's handout. Conclusion: The best solution in the long term is to use Chris's slots model; in the meanwhile, muddle through somehow - each group can decide what works best for them. Database and Software DAG-Edit & GOET One line of GOET work has stopped, but GOET overall goes on. John is back working on DAG-Edit. :-) New DAG-Edit features (full list appears in the release notes of the latest version): - Search tool remembers last 10 searches on each field - Configuration plugin allows users to show undefined terms in gray - Changed flat file format to support multi-character types - The available relationship types are now defined per-session, instead of per-adapter - Created a Relationship Type Manager plugin that allows a user to define which types are available in a session - Dbxrefs now have an editable description (however, the flat file format cannot store these descriptions) - An arbitrary number of files can now be read in at one time (instead of just 3) - File read history now stores groups of files, not one file at a time John would like switch over to the new flat file format. This should be announced on the proposed webpage for forthcoming software/data format changes, as well as on the GO site in SourceForge. Users should be given adequate time to switch over; John suggests allowing two months after the announcement has gone up. The new format allows relationship symbols & types defined in headers, and multi-character relationship types are possible, as well as dbxref comments in the flat file. It also has a reduced file size due to non-redundant display of parentage. Other planned features for DAG-Edit include: - multiple terms viewable in gene product plug in (only one can be viewed at a time at the moment) - option to have a "delete" button to move terms to obsolete - plug-in for cross-products - spellcheck function to use with the dictionary file AmiGO Brad reported that the AmiGO GOst BLAST server is now live. He also reported that an AmiGO software upgrade is coming soon. Brad is interested in feedback from the community that uses GO on what data and tools they use and how they use them. ACTION ITEM: Construct and post a user survey covering tools, AmiGO, etc.. Send question ideas to Amelia Ireland. It will be sent out to GO-Friends and data collected in time for the grant application. Database There isn't much change to report on the database. Chris and Dave Emmert (Harvard) are developing CHADO, a postgres database. It will be more capable of holding different ontologies; it is expected that FlyBase and GMOD will use it and it will probably subsume the GO database. Database Updates (Chris Mungall) Chris says that automated database update are taking place approximately once a month. He has a script which creates 4 downloads: terms; terms and annotations; terms, annotations and sequences; terms, annotations without IEA and sequences (for AmiGO). The script takes takes approximately two days to run. It was suggested that there should be a daily update of the database terms and structure to prevent the lag seen between the addition of the new terms and their appearance in AmiGO. ACTION ITEM: Chris. Suggestion: a daily release of a separate database containing just terms without annotations. The whole database should be updated every month. AmiGO would have the option to view the up-to-date term set with no associations. Chris has scripts to map gene association files to GO-slim terms; it uses 'bucket' terms such as "other enzyme" which are given temporary GO-slim IDs. ACTION ITEM: Chris. Make use of parents rather than bucket terms to avoid confusion due to transient IDs. Brad clarified the AmiGO pie chart maker behaviour and accepted suggestions for new features. ACTION ITEM: Brad. Investigate piping GO-Slim mapping results to the AmiGO pie chart maker. ACTION ITEM: Brad. Add the ability to dump AmiGO pie chart data as a flat file containing GO ID, term name and the number of gene products. Miscellaneous GO.bib file During the updating of the documentation, Cath discovered the GO.bib file and asked who uses and maintains it and whether some guidelines could be drawn up for its content and usage. It was concluded that no one uses this document (let alone maintains it!) and it could be removed from the GO documentation. GOBO After the success of the Standards and Ontologies for Functional Genomics (SOFG) conference, Helen Parkinson (EBI) has had requests for an ontology site hosted at the EBI or at sofg.org. Michael Ashburner will talk to Helen and Chris Stoeckert about this. Documentation See Appendix 5E for Cath's progress report. Cath has made significant progress in her work on the documentation. Unfortunately, Cath is no longer part of the GO team at the EBI, but she was able to do the work in her new role as part of the Outreach team. She has reorganized, rewritten and updated the documentation to make it clearer and easier for users to find the information they are looking for; to this end, she has split the information into sections relating to different GO users. There were several action items relating to the documentation: ACTION ITEM: Member databases. Each database should send annotation FAQs from their existing documentation to Cath for inclusion in GO FAQ. GO FAQ will have general annotation FAQs and then specific FAQs from each database and from the EBI. ACTION ITEM: Everyone . Read over the new documentation (especially the style guide) and send any suggestions to Cath. This is available at http://www.ebi.ac.uk/~cath/ ACTION ITEM: Cath. The changeover to the new documentation will occur on 15 March. ACTION ITEM: Cath. Update the synonym section of format guide to accommodate the decisions made at this meeting. ACTION ITEM: Chris. Provide some documentation on the mySQL database. ACTION ITEM: Jane and John. Update the DAG-Edit user guide. Grant Proposal Judy reviewed the schedule and plan for the upcoming competitive grant renewal for the GO Consortium. We will submit our proposal to the NHGRI on March 1. We will ask for continued support for the development of the ontologies, now including the Sequence Ontology for sequence features. We will ask for continued support for the annotation of genomes and gene products to the GO by the model organism databases and Swiss-Prot. We will ask for continued support for a community database resource which includes open access to the ontologies, the annotations to the GO, and other resources and tools. Some new aspects of the project are that we will continue to work to provide the ontologies in DAML+OIL, and will provide support for pilot projects that investigate or interact with the GO in new ways. Next Meeting Host: TIGR, June 3 - 4 (no users meeting). Minutes: BDGP Appendix 2. Action items from CSH May 2002 Action Items from Cambridge Sept 2002 meeting 1. FB to use PubMed IDs instead of [or in addition to?] FBrf IDs. - DONE 2. TIGR to provide protein id --> TIGR gene ID. - DONE 3. TIGR to send IEA annotations to GO for genomes not sequenced at TIGR. - NOT DONE. Michelle says some of IEA associations were being made based incorrect GO associations and is working to fix this. 4. Cath will update documentation and circulate drafts. - see report 5. Evelyn to continue tracking down info on QuickGO concurrent assignments. - has tried. contact David Binns. 6. Consortium, especially Chris M, to revisit concurrent annotations in GO database. ????? 7. Add check for term deletion to flat file helper. - will put option in configuration manager 8. Sue will ask Danny to take over DAG-Edit maintenance. - DONE. He said no. 9. Amelia will collect bug reports and feature requests for DAG-Edit from curators. If John can't act on feature suggestions, perhaps Danny can. - DONE. SourceForge list. 10. Change prefixes to "GOC:" for definition references that represent an individual curator or group of curators. - when Brad does 11. 11. Brad will create a form where curators can enter info (e.g. name, affiliation, dbxref entered in definition reference field), and create and link a web page for each GOC:xyz entry. - new action item covering this. 12. Chris to get comments into the database. - code working. will do. 13. Add a link to the GO-Slim directory to the home page. - NOT DONE. 14. DBs to send GO-Slims and lists of all genes to BDGP. - in directory. 15. BDGP to generate tables of gene ID <--> GO-Slim term for each DB that submits a gene list and a GO-Slim. Genes lacking annotations will get "unexamined"; annotations to "unknown" will be preserved. ????? 16. Add hyperlinks to the gp2protein files: link from web page and from each gene_association file. - use docs 17. Set up "interest groups" based on subject matter; maintain a list of groups and who's in them (on SourceForge if possible -- look into this). - sort of DONE. 18. All content changes, no matter how small, should go into the SourceForge tracker for archiving purposes. Summary entries should be nice and informative. - ongoing; DONE. 19. Set up script to email summaries from new (open) SourceForge tracker entries. - DONE. 20. Test all "protein biosynthesis" and "protein binding" terms. Apply the two-part test to all, and (for protein family or class ones) look at annotations and child terms. Circulate the list slated for obsolescence. Note: we are not going to make all "protein binding" terms obsolete yet. It would be good to determine which terms would pass the tests, though. - in progress. 21. Circulate a proposal for incorporating "gene expression" and "regulation of gene expression" terms and definitions. - decided against "regulation of gene expression"; Jane will circulate the "gene expression" def. 22. Discuss this [protein binding etc.] again at the next meeting! - DONE. 23. Propose definition for "cellular process" and discuss on mailing list. - DONE. 24. Each model organism DB should review terms under "embryogenesis" and "morphogenesis" to check for correct parentage; also figure out which ones will go under "cellular process." - in progress; mouse done. 25. TAIR curators to improve definitions of "cell surface" and its children. - DONE. 26. Change wording of GO:0030312 to "external encapsulating structure." Circulate new definition; make sure Michelle Gwinn has a chance to comment. - DONE. 27. Review all "cell wall" terms to check parentage. Plant cell wall does need to be moved. - DONE. 28. Start thinking about terms (and definitions, of course) to capture concept of boundary. - ongoing. 29. Create UniGene <--> GO file (Daniel) - DONE. 30. Add to documentation of "with" column use -- allow cardinality 0, 1, >1 for all evidence codes that use "with" at all; explain situations where cardinality 0 is allowed. - NOT DONE. 31. Annotations that use ISS, IPI, or IGI but have a blank "with" column should link to the annotation documentation (let people see the possible reasons why nothing's entered). - NOT DONE. 32. Each group that shares annotations should tag the ones that come from the other group(s). - coming soon. 33. Document this decision [shared annotation], and how to implement it. - coming soon. 34. Amelia will continue polishing The Script. When it's ready for prime time, it will go in the software repository, and will be run every month to generate a log to accompany the flat file archives and database releases. Decide where to put the output. - script done; need to decide where output should go. 35. set up new faq-o-matic page (Cath & Rama, with a bit of help from Chris); everyone to add faq's and answers, though Cath & Rama will probably do the most, at least at first. - content collection 1st round done. 36. EBI GO curators circulate a set of instructions for using CVS. - DONE. 37. Progress report for current grant. - DONE. 38. Prepare renewal grant application. - in progress. 39. Prepare a site with mock-ups of GO web pages derived by splitting up the current home page sensibly. - NOT DONE. Appendix 4A. Email from Tanya Berardini Cellular process issues (from Tanya): Subject: Cellular process issues for St.Croix Hi everyone, Here are a few issues that I think would be good to address at the meeting. David will be attending, while I won't be able to make it. 1. cell differentiation vs. cell fate commitment right now, these terms are siblings cell differentiation: The process whereby relatively unspecialized cells, e.g. embryonic or regenerative cells, acquire specialized structural and/or functional features that characterize the cells, tissues, or organs of the mature organism or some other relatively stable phase of the organism's life history. ref:ISBN:0198506732 cell fate commitment: The commitment of cells to specific cell fates and their capacity to differentiate into particular kinds of cells. Positional information is established through protein signals that emanate from a localized source within a cell (the initial one-cell zygote) or within a developmental field. ref: ISBN:0716731185 2. response to endogenous stimulus and response to exogenous stimulus Move to be children of physiological process/add physiological process as additional parent? Right now, they are children of cell communication. response to endogenous stimulus: The change in state or activity of a cell or an organism as a result of the perception of an endogenous stimulus. ref: TAIR:sm response to exogenous stimulus:The change in state of activity of an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of the perception of an external stimulus. ref: FB:hb 3. cell_type development vs. cell_type differentiation Do we need both terms? Are they meant to describe different things? (e.g. pole cell development vs. pole cell differentiation) Check out the children of cell differentiation for a sample. Thanks, Tanya Appendix 4B. Email from Aubrey De Grey Term grammar (from Aubrey de Grey): Subject: GO grammar Hi Midori, Am I alone in feeling that the GO ontologies are grammatically challenged? It seems to me that the terms in each of them should be such that a sentence of the form: It encodes a[n] involved in which is localised to the should always read properly, but in fact one gets things like: It encodes a heme binding involved in nutritional response pathway which is a component of the extracellular. as opposed to: It encodes a heme binding protein involved in nutritional response which is a component of the extracellular space. I care about this more than most because I construct such sentences automatically from GO data in FlyBase as part of the summary paragraphs that appear in the gene records. But I think it looks decidedly untidy even when the terms are presented in tabular form, and it would probably take only a couple of hours' work to correct the common ones. Becky saw my point and suggested I mention it to you. What do you think? Cheers, Aubrey reply: Hi Aubrey, I'll put this issue on the agenda for the GO meeting, since it's coming up so soon anyway. I don't think there' will be any objection to adjusting the 'pathway' process terms, or the cellular component terms, since it will help with sentence generation, and won't hurt for any other purpose. We absolutely _cannot_ make alterations such as 'heme binding' --> 'heme binding protein' in the function ontology. None of the ontologies includes terms representing gene products; rather, we did and do put a lot of effort into keeping gene product names (whether specific, like 'actin', or generic, like 'protein') out of GO. GO terms also do not represent what a gene product is (or is made of), but what it does and where it is found. Function terms represent activities, not entities. It seems to me that it would be straightforward to adjust the sentence generation to accommodate function terms as activities rather than molecules, e.g. It encodes [an RNA|a protein] with activity involved in ... We would then be willing to fix any function terms that caused this construction to go awry. reply to above: Very good point re function - and very nice suggestion for the sentence structure. That's what I'll do. On a quick browse, the only group of function terms that would be a bit broken by your sentence structure are ones that end in "factor" (guanyl-nucleotide exchange factor, etc), and for them I guess dropping "factor" would actually be in line with the policy you describe. Great if you can adjust the process and component ontologies. Appendix 4C. MGI Excessive granularity document Excessive granularity. As we add and refine terms in the ontologies, we need to keep two things in mind. First, the terms should be as organism non-specific as possible. Secondly, the terms should be as meaningful as possible. As put forth in the last GO meeting, there are several branches of the GO that seem to have expanded unnecessarily. These fall into three broad categories: Protein Binding, Biosynthesis, and Regulation. 1. Protein binding: In using the term GO:0005515, Protein binding, we can make use of not only the GO ontology structure itself, but also the use of the attributes/qualifiers used in linking a term in the ontologies with a gene product, which are included in each annotation line supplied in a gene_association.db file. A good example is the use of the term "protein binding". This term can be qualified with both an evidence code and the "with" field. The combination allows a curation of a gene product to bind to a specific protein product. The "with" field is intended to house a sequence identifier or db identifier pointing to a specific protein. Therefore, there should be no need to populate the GO with specific children of protein binding. However, that is not to say that in those instances where there may be ambiquity, that we cannot have a child that describes binding to a product family. For example, actin binding, stat binding, etc. These can be used when the specific gene product is not identified. Amel, amelogenin F GO:0005515 protein binding IPI SWP:Q9CRG8 In this example, amelogenin was shown to bind to Q9CRG8, the protein specified by Bat3, HLA-B-associated transcript 3 Acrp30 adipocyte complement related protein of 30 kDa F GO:0005515 protein binding IPI SWP:Q60994 In the example above, the protein Acrp30 is shown to bind to SWP:Q60994, Acrp30; thus, the statement demonstrates that the protein oligomerizes. Ablim1, actin-binding LIM protein F GO:0003779 actin binding IDA In this example, the actin-binding LIM protein was shown to bind actin, but the actual gene product was not specified (in mouse, there several actins: actin, alpha 1 (Acta1), actin alpha 2 (Acta2), actin beta (Actb), actin alpha, cardiac (Actc1), actin gamma (Actg), and actin gamma2 (Actg2). The term GO:0005515 would not be sufficient, since the "with" field could not be specified. However, in this case the GO:0003779 term allowed sufficient granularity in the annotation. Another example uses GO:0005518, collagen binding. Gp6, glycoprotein 6 (platelet) F collagen binding IDA In this example, glycoprotein 6 was shown to bind (a) collagen. However, in the case of Mrc2, mannose receptor, C type 2 F collagen binding ISS EMBL:AF107292 a human ortholog of the murine mannose receptor was shown to bind collagen. In this instance, it it NOT the mouse protein that was assayed, so it would be inappropriate to use the human binding target. However, we infer that because the paper shows that AF107292 is the human ortholog of the mouse protein, we can assign the collagen binding function. Again, because a suitable child existed for the protein binding term, we can capture the protein binding function with more granularity than would otherwise be possible. Therefore, in most cases, the use of the "with" field in combination with the IPI code is sufficient to annotate binding of one protein to another. It is therefore not necessary to consider creating protein-specific terms (eg, interleukin 1-15 binding) to capture the information. 2. Biosynthesis As maintained before, the notion of Protein Biosynthesis should mean specifically the building up of a polypeptide by translation. Any other fate of the protein, such as post-translational modification, etc. is NOT part of "Protein Biosynthesis". The use of the term Biosynthesis to include other metabolic fates is misleading. Protein biosynthesis is already itself a child of metabolism. Thus, adding terms such as "biosynthesis of protein X" as a term to mean anything affecting the appearance/level of protein X is not useful. If a gene product effects the translation of protein X, then the gene product's annotation should be to a specific term under protein biosynthesis (initiation, elongation, etc.). If the gene product effects/ modifies a post-translational modification, etc., then it should be annotated to those processes. 3. Regulation Additionally, terms are arising in several notes concerning the regulation, both positive and negative, or particular processes (biosynthesis, phosphorylation, etc.). Terms exist for the negative/positive regulation of phosphorylation/whatever of specific_protein_family_member X, X+1, etc. Is this granularity necessary? Would it be sufficient for negative/positive reglation of phopsphorylation/whatever period/or protein_family? Protein Biosynthesis Example:1 protein biosynthesis [GO:0006412]) amino acid activation + charged-tRNA modification + **glycoprotein biosynthesis+ CD4 biosynthesis + FasL biosynthesis + protein amino acid glycosylation + *integrin biosynthesis + and children **lipoprotein biosynthesis and children+ **mannoprotein biosynthesis and children + *MHC class I biosynthesis and children+ *MHC class II biosynthesis and children+ *neurotransmitter receptor biosynthesis non-ribosomal peptide biosynthesis regulation of protein biosynthesis + regulation of translation + TRAIL receptor biosynthesis + and children translational elongation + translational initiation + translational termination + viral protein biosynthesis Biosynthesis Example 2 immune response cytokine metabolism cytokine biosynthesis chemokine biosynthesis + connective tissue growth factor biosynthesis + granulocyte macrophage colony-stimulating factor biosynthesis + interferon type I biosynthesis + interferon-gamma biosynthesis + interleukin-1 biosynthesis [GO:0042222] regulation of interleukin-1 biosynthesis + interleukin-10 biosynthesis + interleukin-11 biosynthesis + interleukin-12 biosynthesis + interleukin-13 biosynthesis + interleukin-14 biosynthesis + interleukin-15 biosynthesis + interleukin-16 biosynthesis + interleukin-17 biosynthesis + interleukin-18 biosynthesis + interleukin-19 biosynthesis + interleukin-2 biosynthesis + Interleukin-20 biosynthesis + interleukin-21 biosynthesis + interleukin-22 biosynthesis + interleukin-23 biosynthesis + interleukin-24 biosynthesis + interleukin-25 biosynthesis + interleukin-26 biosynthesis + interleukin-27 biosynthesis + interleukin-3 biosynthesis + interleukin-4 biosynthesis + interleukin-5 biosynthesis + interleukin-6 biosynthesis + interleukin-7 biosynthesis + interleukin-8 biosynthesis + interleukin-9 biosynthesis + regulation of cytokine biosynthesis + TRAIL biosynthesis + Regulation Example regulation of tyrosine phosphorylation of STAT protein positive regulation of tyrosine phosphorylation of STAT protein positive regulation of tyrosine phosphorylation of Stat1 protein positive regulation of tyrosine phosphorylation of Stat2 protein positive regulation of tyrosine phosphorylation of Stat3 protein positive regulation of tyrosine phosphorylation of Stat4 protein positive regulation of tyrosine phosphorylation of Stat5 protein positive regulation of tyrosine phosphorylation of Stat6 protein] positive regulation of tyrosine phosphorylation of Stat7 protein Appendix 4D. MGI Negation document NOT Protein Binding Background: The GO term "protein binding" (GO:0005515) is used in the function ontology to specify that a gene product binds to another protein. It is used with the IPI evidence code and the "with" field to indicate the specific protein that the annotated gene product binds to. In the examples below, Arl6ip has been shown to bind to Arl6 (SP:O88848), and Cdc42 has been shown to bind Cdc42ep5 Arl6ip ADP-ribosylation-like factor 6 interacting protein F protein binding IPI SP:O88848 Cdc42, cell division cycle 42 homolog (S. cerevisiae) F protein binding IPI SWP:Q9QZT9 The "not" qualifier has been provided for documentation of experiments that were designed to test a hypothesized function, cellular localization, and proposed participation in a biological process. For example, a protein product has homology to chitinase; however, experiments performed on the isolated protein demonstrated that the protein did NOT have chitinase activity. Chi3l3 chitinase 3-like 3- F NOT chitinase IDA Dilemma: In some experiments, protein binding to a specific protein has been shown to not occur. In the example below, a publication demonstrated that the gene product of Akap9 specifically binds one protein, but not the other. Akap9 A kinase (PRKA) anchor protein (yotiao) 9 C cytoplasm IDA F NOT protein binding IPI SWP:Q62348 F protein binding IPI SWP:Q9QZE7 However, in this instance, the use of the "NOT" may be confusing, as the GO term "protein binding" is (probably) meant to be very broad (it has no definition), and does (may) not imply "binding to a specific protein". For example, immunoprecipitation experiments could demonstrate that a particular gene product is associating with other proteins, but the proteins have not been identified. In this case, the "with" field may have to be left null. The risk, however, is that the "not" could be misinterpreted to mean that this gene product does NOT have the function of binding to a protein. Generally, when an assertion can use the "with" field, the annotation still makes sense if that field is blank. For example, when an ISS evidence code is used, but the accession number is not known, leaving the "with" field blank still means that the annotation was made based on sequence similarity. Another example is when the IMP evidence code is used. If the assertion is based on a specific mutant allele, it is possible to add a database identifier to the "with" field, when known. However, if the assertion is based on an RNAi experiment, the "with" field is often left blank. In these cases, the annotation makes sense even if the "with field is blank. A problem can arise, however, if the "not" qualifier is used with protein binding and IPI. If the "with" field is left blank, the assertion reads that the gene product does not bind protein. Note that this is not a problem when a gene product can be annotated to one of the children, such as "actin-binding" (does NOT bind actin). Proposal We would still like to be able to capture this type of experiment, as it can provide information about the properties of the gene product. Therefore, it might be useful to create a term , such as "specific protein binding" as a child of protein binding. All/most of the current children of "protein binding" would then be moved to be children of the new term. The "not" qualifier would never be used if the "with" field is left blank. For example, the entry for Kdr is shown below: Kdr, kinase insert domain protein receptor F NOT specific protein binding IPI SP:P97946 The interpretation should be that Arl6ip does not bind specifically to Figf (c-fos induced growth factor). A second example, is where Akap9, A kinase (PRKA) anchor protein F NOT specific protein binding IPI SWP:Q62348 F specific protein binding IPI SWP:Q9QZE7 This paper demonstrated that Akap9 did NOT bind to Tsn (translin), but DID bind to Tsnax (translin-associated factor X). Appendix 4E. Documentation progress report (from Cath) GO DOCUMENTATION: PROGRESS REPORT OVERARCHING PRINCIPLES * Make as much as poss. comprehensible to broad audience * Make it clear what audience each doc is aimed at * Avoid redundancy wherever possible to make pages easier to update (FAQ is an exception to this principle: aim is to provide info with as few clicks as possible) STUFF FOR THE GENERAL PUBLIC An introduction to GO * Purpose is to provide an overview that is clear and useful to first-time users. Links to more detailed documents make it useful for curators and annotators. * Includes General documentation up to Data representation section: defines what's covered (and what isn't) in each ontology, plus the basics of DAG structure (will redo diagram so it looks better on the web) * Replaced 'data representation' with a blurb about what file formats we produce and where you can download them from. Also includes a para on GO slims. * New section that puts GO in the context of ontologies in general: discusses GOBO and the new list of ontologies on the MGED site; dicusses cross products, and discusses mappings to other classification systems. * 'Contributing to GO points to the sourceforge site and to the mailing lists. * Still needs a link to the FAQs. FAQs * Have kept html v. simple and not added comprehensive contents list yet because these will be pasted into FAQomatic and this might do some of the formatting for us. * Where are the main gaps and who can provide material to fill them? * What other sections do we need and what order should the sections be in? * Do we need a section on annotations to each of the MODs or should these be dealt with by the MODs' own FAQs? If so, GOA FAQs could be moved to GOA web page and we could just provide links to each MOD's FAQ. * Should the order of questions in any of the sections be swapped around? * Need to install the faqomatic (http://faqomatic.sourceforge.net/fom-serve/cache/1.html) . Can Chris do this? * Do we need someone to be in charge of the FAQ or should it be a free for all? * Can each of the people who provide questions check them: in some cases I've made them more general and added bits. GO style guide (or should it be the GO content guide?) * Revamp of GO usage guide. Purpose is to explain not only how the ontologies are created and edited, but also the rationale behind why we do it this way. * I wrote it as a practical guide for curators but I don't think it's working. * Starts off at the level of terms and what we can do with them; then moves up a level to relationships between terms; then deals with whole ontologies and the rules specific to each one. * Should we split the purely philosophical (more in tune with the original purpose of this doc) from the purely practical? If we did this, people who aren't part of the consortium but want to know more about why we do it that way would have something deeper than the intro. * If we did this, should we merge the purely practical stuff with the format guide? This would avoid the constant cross-referencing between docs. * Things that have been added include Much more comprehensive contents list link to full list of database cross-references Amelia's list of 'standard' definitions Clearer guidelines on sensu GO format guide * Main aim to help anyone who wants to parse the files; but also of use to curators because you always end up tweaking the flat files at some point. * As with the style guide, it starts of at the level of terms, then moves up to relationships between terms and the structure of entire files. * Not sure what to do with the stuff on the bibliography. * Need someone to write up something on structure of mySQL files, or provide a link if this is already there on the godatabase site. * Added Jane's syntax for comments More on how to use sensu. Stuff that I haven't done....(mention at this point that I'm now full time outreach) Publications on/about GO * Update; reverse order so most recent first. STUFF FOR CURATORS CVS user guide for curators * new doc; I'm happy to edit this but I need someone to write it. Volunteers? DAG-Edit User Guide (Jane's doc) * Does this need to be more visible from the front page now that other ontologies are using it more and more? * Jane to add some info on creating cross products * Needs formatting in same style as rest of docs; I'm not going to do anything else with it. Dummies' guides * Each group to maintain their own local 'dummies' guides': SGD and EBI now have these. * I'll turn the EBI one into HTML and leave it on my website (or the website of one of the other curators? I'll have to pass over the responsibility of updating this to someone else. STUFF FOR ANNOTATORS GO Annotation Guide * Computational annotation methods need updating. FlyBase (Becky Foulger) SGD (Karen Christie) MGI (Harold Drabkin - has already sent info to Midori) TAIR (Suparna has sent) WormBase ? PomBase (Val Wood) RGD ? DictyBase (Rex Chisholm) PSU (Matt Berriman) Gramene (Pankaj Jaiswal) GKB ? EBI (Daniel Barrell) TIGR (Michelle Gwinn/Linda Hannick) Compugen (liat Mitz/Han Xie) AstraZeneca Courtland Yockey Incyte Lisa Matthews * I'll add the ones I've got but then I'd like to hand over to someone else. * Need to document standard operating procedures for shared annotations (tag annotations that come from other groups). Appendix 5. Collected action items from this meeting Action Items, St. Croix January 2003 1. TAIR. Update MetaCyc2GO mappings. 2. John. Action item 7 from last time [add term deletion feature to DAG-Edit]. 3. Brad. Action items 10 and 11 [adding more information about GO curators to website/database] outstanding. 4. Come up with system for notifying developers of format changes. 5. Add "contributed by" column. 6. Curators. When adding new synonyms, track which type they are. If they are 'broader than' or 'narrower than', consider whether it calls for a new term. 7. Jane. Circulate synonym list again. 8. BDGP. Look into rules that could be worked into DAG-Edit to make synonym maintenance easier. 9. Jane. Discuss this with UMLS and fill us in on the results. 10. David and Tanya. When splitting out multicellular v/s unicellular processes, make the split as far below 'physiological process ; GO:0007582' as possible, and as and when needed, rather than splitting right below physiological processes. 11. GO editorial team (and others). Start removing grouping terms slowly and carefully with all the usual communications. If obsoleting a term, ensure the corresponding process or component exists. 12. Jane. Add activity to function term strings. 13. GO editorial team. Define extracellular to include outside a virus particle, then use host terms as parents for the appropriate virus cell component terms. 14. GO editorial team. Go through the enzyme complexes (see also SF entry 535294) and where applicable, make a general parent directly under 'cellular component' with children in specific locations. 15. GO editorial team. Get a list from Aubrey of ill-fitting GO terms and evaluate; adjust terms as needed. 16. Announce on the website that we'll implement this solution at some future date (no date set but will be 6+ months from now). Assemble a group (MA, Chris, David) to work on the implementation. 17. Midori. Put up interest groups on web page. Everybody to send group ideas & which they volunteer for. See if it works or if we need further formalization by putting groups in SourceForge. 18. Construct and post a user survey covering tools, AmiGO, etc.. Send question ideas to Amelia Ireland. It will be sent out to GO-Friends and data collected in time for the grant application. 19. Chris. Suggestion: a daily release of a separate database containing just terms without annotations. The whole database should be updated every month. AmiGO would have the option to view the up-to-date term set with no associations. 20. Chris. Make use of parents rather than bucket terms to avoid confusion due to transient IDs. 21. Brad. Investigate piping GO-Slim mapping results to the AmiGO pie chart maker. 22. Brad. Add the ability to dump AmiGO pie chart data as a flat file containing GO ID, term name and the number of gene products. 23. Member databases. Each database should send annotation FAQs from their existing documentation to Cath for inclusion in GO FAQ. GO FAQ will have general annotation FAQs and then specific FAQs from each database and from the EBI. 24. Everyone . Read over the new documentation (especially the style guide) and send any suggestions to Cath. This is available at http://www.ebi.ac.uk/~cath/ . 25. Cath. The changeover to the new documentation will occur on 15 March. 26. Cath. Update the synonym section of format guide to accommodate the decisions made at this meeting. 27. Chris. Provide some documentation on the mySQL database. 1 Number of genes with at least ONE GO term of any kind. 2 Decreased due to movement to obsolete. This also holds for Interpro and EC to GO 3 This figure has decreased due to our ongoing efforts to replace these with literature based annotation.. 4 Cynthia Smith, Cathleen Lutz, Carroll Goldsmith, Teresa Chu, and Alan P. Davis =================================================================== TIGR 20030603 Group reports Most of these reports were provided as written material and more details from the individuals can be found there. We reviewed these quickly and so the comments here are simply those that I happened to catch in passing. FB report (Becky) In addition to existing GenBank/Swiss-Prot sequence curation and a paper-by-paper approach to literature curation, GO-annotations are being done on a gene-by-gene basis to fill in holes. GO data from gene models that were split or merged in the Release 3 genome reannotation have been mostly re-partitioned. Chris Mungall has also given Becky a list of genes where the coding sequence has changed; GO data for these gene models has started to be assessed. SGD report (Karen Christie) Currently pushing to remove IEAs, all gone but those for about 60 Ty encoded ORFs. Microbial Structure Ontology was described (for Fungal structures): Judy asks, are they interested, Karen says yes they are interested, some groups (Neurospora, Aspergillus) already participating. Mike, small community, not a lot of money to sustain, applying for grants now. Aspergillus already has a database set up (in Manchester I think) Cross-reference to SGD for annotation, Candida, Aspergillus, Neurospora. Chandra, Eurie, and Maria have initiated this and it is now a part of OBO. In addition to Maria Costanzo and Jodi Hirschman, who are attending a GO meeting for the first time, SGD welcomes another curator, Rob Nash. MGI (Harold) RIKEN annotation increased total genes annotated by 37%, 3300 genes, this was done by inheriting GO annotation done on Riken clones. These came in mostly as ISS or TAS evidence codes. They are developing (or collaborating on) three ontologies: GO, anatomy, and phenotype all with a common structure. This allows the use of common tools such as the DAG-editor and ontology browser. The MGI GO browser now displays comment field (important for MGI annotators and users) Changes to software so that users don't get links to obsolete GO nodes Changes to software were implemented so that users don't get links to obsolete GO nodes. These include enhancements to the editorial interface, and automatic removal of obsolete terms being assigned via SP2GO and IP2GO translation tables. Editorial interface enhancements were needed to aid reannotation of genes mapping to obsolete terms, because they go live within 24 hours of any changes. Much that is to assist with keeping the annotations up to date. MGI can now track original source of a GO annotation, to help track when a curator has manually changed an annotation that was originally obtained from dataloads. An additional enhancement they have added to the interface is the inclusion of a GO marker notes field to supplement the notes field associated with each individual annotation. The new notes field is meant for notes pertaining to the state of annotation rather than notes about the marker itself. Other software development: A version of the GOTermFinder is being developed at MGI and is available at http://www.spatial.maine.edu/~mdolan/MGI_Term_Finder.html BDGP (Suzanna speaking for Chris Mungall) Chris gave a talk on slots and cross products at Genome Informatics; there is concern, which we share here that this will make things overly complex. That is why we will prototype first and Chris has started work on this. He is also looking at third party tools. Chris proposes that the BDGP software folk meet with MGI (aka David and Joel, etc.) prior to next GO meeting for first implementation of properties (also for other ontologies). Since the words 'slots' and 'properties' are synonymous GO will go with the word properties (and properties have values) DAG-Edit: next version (1.4) will support properties, which means we need a new flat file format (with tag-values). We will still have some backward compatibility with existing flat file format. Not using XML solely because it is not that easily readable by humans GO-Slim - needs to know which slim files go with which annotation files GO Database - monthly loads are now more regular and reliable with a new QC procedure, also daily loads of ontology terms with no QC, is now storing more data in GO database (not yet available in AmiGO). The 'with' column is now "fully normalized". He did note that not everyone is providing a gp2protein file - really need to have these from everyone who is providing an annotation file. This was added as an action item. Karen Eilbeck joining Berkeley group to work on SO TAIR (Suparna) Annotation Update: The complete set of numbers is in the handout. The rate of annotations is about 150 genes a month or 2 genes a day per curator. To the GO ontology itself, they have added about 150 new terms since last meeting. They have updated their gp2protein file recently. The main TAIR database is at NCGR in Santa Fe. The GO associations from Carnegie at Stanford are updated weekly TAIR held a very successful literature curation meeting at TAIR in March 2003. Updating MetaCyc2GO file: The mapping to MetaCyc has problems (going to function instead of process). Approximately 80 new pathways have been added (50 have existing GO terms, need about 30 more terms to complete mapping). TAIR (as personified by Suparna) are now updating the mappings from MetaCyc pathways to GO functions and once finished will this task will pass the mappings on to GO central (Amelia) to check errors. They have updated web site, can now search for genes with GO terms as keywords; added an Evidence description to add more info about the experiment. They are also developing a new ontology browser. Along with this, they also are developing a GO awareness campaign for the Arabidopsis community. Lukas Mueller will be going to Cornell and running Solanaceae database TIGR (Linda) () * Arabidopsis: At TIGR this project is going through and renaming gene products correctly. Funding will end in fall and TIGR will then turn over all A. thaliana data to TAIR. * T. brucei: This annotation effort is still active and progressing. (Michelle) * Bacillus anthracis and Coxiella burnetii: just released annotation files of the gene products to GO terms. * Other prokaryotic genomes: These gene products been annotated with GO terms and are awaiting publication to be released. TIGR uses 'Manatee' to assist curators in GO annotation. This adds new GO search capabilities that assist the curators in fully annotating prokaryotic and eukaryotic genomes. Manatee is available on SourceForge, but it depends on TIGR database schema. BDGP and TIGR may add an Apollo connection to manatee. Rex/Dictybase are interested in this as well. Don't have gp2protein files for all prokaryotes, in some cases because the data was not available when the annotation file was released Wormbase (?) They provide biweekly updates for the public, including GO annotations. They also raised one question regarding the cardinality of evidence codes to annotation. There followed a discussion about whether multiple evidence codes belong on one row or in individual rows (one row per evidence code). Resolved: The decision was to do the latter and make the cardinality one to one. Final question was whether conference abstracts are legitimate references. Resolved: Yes! Conference abstracts can server as references. Dictybase (Rex) They are working for a late June official release of DictyBase (based on SGD's code and schema-special thanks to Mike and all). This release will include 1800 loci with 8949 GO annotations (all except 40 to IEA). They now have two full-term curators (Petra and Pascale), and one new programmer (who will start in July). This developer can help John out since he is experienced with Java and is partially funded by GO. Suzi to ask John to contact Rex (done). Gramene (Pankaj) They will be making a new release in late June, with 4500 new non-IEA gene associations. Most of their recent focus has been on curating mutants and phenotypes. They are working with other databases with on mitochondrial and chloroplasts. Likewise, they are working with rice database on nomenclature issues. They are also now working with Maize people to try to get gene association file for maize incorporated. GO-editorial (Mostly Jane, with a soupcon of comments from Midori) Amelia has created a nice digest that is available monthly. This digest summarizes: new terms, obsoletes, new definitions, basic data on changes, and links to appropriate SourceForge entries. It is kept on the ftp site. Please send suggestions for improvements to Amelia. There will soon be a cron job that mails announcement of each new digests to go-friends (AI). Component terms have increased quite a lot (with effort by BRENDA group to create complex terms for enzyme complexes). In addition, we now have definitions for 78% of terms (yeah!!). They have brought the GO synonyms file up-to-date. Molecular function terms now have the word 'activity' as part of term name. They have generated a list of obsolete terms with suggestions for remapping for review. There are new web page drafts for review (presented later in meeting) Interest groups - not linked to anything GOA (Evelyn) We can look on the GOA web site for latest statistics and news (http://www.ebi.ac.uk/GOA). They have produced three releases since February. They have now updated their associations file to include the source of the annotation, so credit/blame can be made appropriately. They have also now integrated the manual annotations from other sites (fly, MGI, and SGD). The HAMAP group at SIB, Geneva is working on a HAMAP2GO mapping and may be involved in manual GO annotation of Swiss-Prot microbial proteins. In total, the GOA project has released more than 3 million annotations to 600,000 proteins. They have also written two papers about GOA (and GO) and two more are planned for this year. Big news is that now LocusLink is now using the GOA annotations. SwissProt will henceforth be responsible for updating the former Proteome annotations to GOA annotations. Evelyn is just back from ontology workshop in Japan. Edgar W from Transfac was one of the chairs. Evelyn spoke about GO and GOA and got a favorable response. She found that many Japanese were aware of GO but were generating other ontologies(cell types and anatomy) that are up and coming on the OBO site. Some groups had developed ontologies similar to GO, because they didn't seem to realize GO existence or didn't realize they could (and should) request new terms.. Because of this, she raised the question of how we will decide which ontology will go into OBO (this is a good question).Who decides which cell-type ontology will be the standard. Answer: Michael will probably just decide by fiat. SwissProt is going to be employing two fulltime GO annotators. Interviewing begins in July. Daniel is thinking about doing a release of the GOA database (which is in Postgres) for the general public (from Suzi, he should get in touch with Chris in case we end up switching to Postgres here as well for GO). Incyte They are doing manual annotation (a la Proteome) using weekly updates from locus link and GenBank. The statistics for their annotations are in the handout. They have also restarted monthly term suggestions for GO terms They are doing new product development - BioKnowledge Retriever. This will include two new ontologies (mammalian disease, mammalian expression) and they are interested in making these public and in working with other groups to develop these and make them public. This is something to consider for OBO. Maize MaizeDB will cease to exist in 3 months, now called maizeGDB. He is here to learn because they are just getting started with GO. The URL for the new maize database is http://www.maizegdb.org/ RGD They have generated about 3000 annotations (distributed equally at ~1K GO ontology). They are working on using the GO terms (building their own GO browser) for gene search strategies. Their browser will be utilizing GO terms as part of search strategies to identify genes, including genes annotated to terms descendents They have a disease specific orientation and want to utilize other ontologies to organize this type of data in RGD Pathogen at Sanger Next time Annotation Issues IEA TIGR is using an HMM scoring function for assignments and since this is more sophisticated than keyword matches they would like a means to add quality information to IEA. Someone pointed out that this is also true for multiple alignments. David says the appropriate thing to do is to use different references for different types of analysis. Suzi says that this argument can also be extended to all evidence types, as discussed before. David suggests extending filtering in AmiGO to also qualify the query to IEAs with certain references. (Brad, another issue is the bulk of IEAs. Too slow for web interface when IEAs are loaded.) TAIR solution is evidence description, but this is internal. MA if a db wants an internal one then they can. David we have reference. Midori- GO reference refers to GO pub. Added 3 action items below: BDGP needs to implement filter, group needs to establish a collection of references to methods, BDGP also needs to explore ways to deal with size explosion of associations other than omitting IEAs from AmiGO. Suspect annotations Rex, if inaccurate annotations are discovered at one site that came from another site they can't change/fix the annotation because it didn't arise from their own site. I second this because maintaining high quality in the associations is one of the main utilities of GO, people use it as the default golden reference set. Rex noticed an actin with motor activity, easy to notice. How are we to do this? Judy: what can the group as a whole do to help. Midori: they owner has to make the correction. How can notification that a correction is required occur? MA: every MOD has a mechanism in place to receive and make corrections. Question is, do we begin to build association quality assurance tools to detect these. Gp2protein could be used together with BLAST, using best-hit match and flagging discrepancies in associations. Suzi: GOST tool can be used for new annotation. Karen E: how many levels up the tree is acceptable-any number. David: incompleteness of annotation is also an issue. MA: it would help even more if this tool were available and used during the process of annotation. Another means of improving quality is by adding the ability to file error reports directly from AmiGO pages. Three action items added below. late addendum from Evelyn: Concurrent Assignments tool from EBI, Manatee has something similar; AI for AmiGO to be able to do this type of thing, (Amazon-like: others who annotated to this also annotated...). More 'rules' for annotation Midori: The current rules are broad and do not contain specific guidelines for handling of every situation. Just make suggestions, best practices. How do you identify common proteins. Evelyn: amigo needs concurrent assignments. Midori: oral tradition is now written down. The rule is that we are annotating to potential. Long discussion of potential. Amelia: slide show. Solution is to use the word intrinsic to distinguish regulator activity versus extrinsic regulator activity. Harold: function is not necessarily an attribute of gene product, it can also be applied to complexes. Jane: Is transient activity okay? Yes. MA: complex should have a defined stoichiometry. Karen: Is there an issue with counting # of subunits? No. Midori: the point is not to have a component term for every ImmunoPrecipitation-able agglomeration; "defined stoichiometry" doesn't imply identical subunit composition between species Resolved: use the word intrinsic to distinguish regulator activity (regulatory function that occurs when the gp is part of a complex) versus extrinsic regulator activity and to change the relationship type to is-a. - CDK-cyclin example - start including the word 'intrinsic' for the regulator activity to clearly indicate that it is part of a complex, without which the kinase activity of CDK kinase subunit is not active either. Jane's item Resolved: Binding stands alone (not binding activity) Treemap demo (Eric Baehrecke) He is interested in steroid activated programmed cell death signaling, both fly and human apoptosis. Ben Schneiderman is software person who is interested in information visualization (hyperlinks, Spotfire, Treemap) and has developed a strategy for analysis of genome data using GO and Treemap displays. The components of the tool include: a GO parser, parser for genome data, a view in Treemap. The visual variables that may be controlled are color hue, color intensity, and area of the rectangle representing the data. Eric B will look into what they need do in order to enable us to link to Treemap from the GO Tools list. See http://www.cs.umd.edu/hcil/treemap/ for more information. Properties implementation Group seemed to feel that the most important priority is completion of software for direct saves to the database. Believe that this will assist implementation of properties. Did allow that John's proposal for new flat file format looks good and useful and since most of this work is already done, it will be good to have around. We also agreed that the existing flat file format would never go away, although property information may be lost in a direct conversion. Brad's report Brad described STAG, which is an SQL templating system. It returns SQL query results as XML dumps. A generic piece of Perl software uses the template to generate the query. Machete is a software package that sits on top of STAG. It is a lightweight Perl application that maps CGI parameters to the proper SQL, HTML, and XML templates. It uses a library of templates to replace the current Perl API. This will result in all SQL queries, HTML pages and XML transformations being maintained as a library of templates. This will allow future generations of AmiGO to be flexible, expandable, customizable and portable. As the GO schema becomes more integrated with Chado it will allow more types of queries across a wider collection of data in the future. There was quite a bit of interest. Both Rex and Judy interested in having Chris and John talk to their counterparts of the technical staff at DictyBase and JAX. Proteasome or part of relationships Different forms of part of: David, if it always there then it is a child. Distinguish between those that never change and those that vary (where) where only the child will have that part. Midori: We all agree that there must be multiple is-a children for complexes of different composition, which is clear. The question we need to address is what to do when the composition is the same. David: If we don't know it is safer to create two subtypes). Judy: is-always-found-there and is-it-the-same are two separate questions. Conclusion for the first question was to change the documentation to not require part-of to mean always. Different subunit composition implies different terms. If the composition is identical, then this is a single term and multiple parents are allowed (nay encouraged). In other words, complexes that have the same composition in all locations may receive multiple parentage, however, complexes that have varying compositions need separate terms with the specific localizations In the future, we may need to add more sub-types for things like myristoylated or phosphorylated forms of the compound. Physiological processes David: This area need revisions and continued discussions. We will create an interest group to handle the reorganization/structuring of the physiological process node of the biological process ontology. Those interested should contact Tanya (tberardi@acoma.stanford.edu) or David (dph@jax.informatics.org). The group will meet and discuss via email and present a report at the next meeting. Proposed top nodes right under 'physiological process' are 'organismal physiological process' and 'cellular physiological process'. Behaviour Peter Midford in Arizona, already is working on behaviour ontologies for loggerhead turtles, jumping spiders and we feel this level of detail seems to be beyond the scope of GO. However, there still needs to be some descriptive capabilities for behaviour within GO, both for Drosophila and maybe for mouse, to be able to annotate certain genes. The essential questions relates to what should be included in Process. It is clear in Drosophila that one can pin certain genes to behaviours like walking or circadian rhythms because these are hard-wired. Conversely, there is need for an auxiliary ontology developed specifically to deal with behaviors in mouse since much knowledge in this area is not tied directly to specific gene activity Conclusion - we do want behaviour in GO, but there may be other ontologies, for groups like mouse, that will extend these. In these cases we'll recommend that these auxiliary ontologies be consistent with GO and include any necessary cross-references to GO terms. To support this the GO terms should be at a level that can be used for many organisms for behaviours that have a genetically defined component. Localization of viruses We had previously discussed (at earlier meetings) and considered expanding the definition of extracellular to include extraviral in order to be able to include viral host cells. There was an objection from virologists that it doesn't make sense to consider viral host cells as extracellular. Therefore, we have now decided to reverse this previous decision from meeting and will remove viral reference from the definition of extracellular. Added action item. Purity vs. pragmatism aka obsoletism Question: When do terms become obsolete? Two issues, when redefining the term and when removing gene product names. Word-smithing changes to the definitions that does not impact the meaning, only clarifies the original meaning do not require that the term be made obsolete (the criteria is that no annotations will ever be affected by the change to the definition). However, if the fundamental concept changes then the term needs a new ID and the older one must be made obsolete. Michelle found some terms that when going from primary ID to secondary ID (arise from merges only) were not strictly synonymous. This is a problem. David: don't just remove gene products they need to be replaced. There followed a very lengthy discussion regarding the issue of function grouping. Much of group wants to use synonyms (broader) to deal with these. For now put portmanteau terms into synonyms. Drive function to purely subsumption hierarchy. (Function grouping ontology). Synonyms Distinguishing between exact synonyms and inexact synonyms. Work is done, just need incremental improvements. John is on it so that DAG-Edit makes this easier. Structure terms We are keeping structure terms until we have properties. Values can come from anatomy or cellular component, or cell type. Wording is wrong, but we can live with it. Disappearing GO ids Michelle had this problem with DAG-Edit. Midori, special case of terms from Michael and shouldn't happen again. TAIR: Sporadically disappearing definitions. Michael: if term is not in ontology then definition is not saved. Need DAG-Edit to warn if there are definitions without terms. Another reason for going to database. Amelia: most problems apparently due to CVS rather than DAG-Edit. Behaviour Resolved: These are to remain as they are. Viral component terms Two action items were added to change extracellular definition and move terms. Scope of Metabolism What does 'part of' mean within the context of metabolism. The present definition is very broad and the question is should it include its own regulation. Currently it does. In general, this is an issue. Transport is not included as a 'part of' metabolism. Are regulation and transport equivalent (or analogous) concepts? On the other hand, is it more correctly called intermediate metabolism? Definition needs to be examined. Looking at a more sophisticated way to model, but in the meantime, regulation is an inherent part of process although strictly speaking the relationship is not the same PART OF as it is for the steps in a process. Because of this, we may need another relationship type for regulation. Midori will send some examples to Chris and BDGP for consideration. For now, transport will not be included in metabolism, but regulation will be. Synonyms Should gene products be included in synonyms? Yes, because people are going to be using these to look for them. Does this mean that gene products are permissible in term names then too? Yes, this is okay when the gene product is not the complete term, but indicates the substrate within the complete term. P53 is the common usage, but never is the name of a gene. Since the meaning is in the definition then the wording doesn't matter and it is okay to use the gene product as the string. However it is preferable to qualify this, that is, use something like 'p53-class' instead of just p53 in term names. However, if the gene product is used then it should be applicable across species and not restricted to a particular narrow group. Cyclin is another case, but it is more broadly used. We could possibly skirt the issue by using the string 'class' as a qualifier to gene product. Transporters (aka ATP synthase terms) Question was whether to create two separate terms for bi-directional reactions and then annotate to both terms. Resolved: Policy is that we will create a single term (that describes both directions of a bidirectional reaction) unless you have reason to believe that there is a biological justification to separate the two directions of the reaction into separate functions. Function Grouping Terms or Conglomerate functions The examples used in this discussion were 'T cell receptor' and 'myosin'. There was a lot of discussion about whether or not it is appropriate to create function terms that describe the sum of the parts. That is, a term to represent the single unary function that is created through the contributions of all the different individual functions that make up a complex, e.g. The function of something like a T cell receptor or myosin may have. One of the advantages of representing the various activities of 'T cell receptor' with multiple parentage was elucidated by David that it is a way to help annotators, who otherwise need to know that 'DNA helicase activity' includes ATPase activity, etc. Rex argued the other side, that this approach could lead to an unmanageable proliferation of terms to represent this sort of information. Karen brought up the cautionary example of 'GTPase activator activity' which currently has two parentage lines one from 'enzyme regulator activity' which is fine, and a second line of descent from 'signal transducer activity', which is a problem because it makes 'receptor signaling protein activity' an ancestor of 'GTPase activator activity'. This is clearly wrong (there is a SourceForge entry already entered for this). 'GTPase activator activity' is an old term, so this may have come about because at the time the known GTPase(s) was/were all involved in receptor signaling. The eventual thought seemed to settle on the idea that to create this sort of 'grouping term' in the function ontology opens up the potential for true path violations of the type illustrated by the 'GTPase activator activity' example. It was suggested to have some sort of Function Of Gene product (FOG) Ontology to make the correlations between individual functions and a specific gene product or class of gene products. The Function ontology itself will become more like a hierarchy than a DAG. The relationships in the FOG ontology will not be 'ISA' but will be a flavor of 'PART OF" to indicate their contribution to the conglomerate function. Web page 1. for credits have people use the sourceforge style link to logo, so we can count some of the usage statistics. 2. home page is: about, what's new, downloads, credits. 3. Link to AmiGO and a search box all in the left panel. 4. Jennifer suggested that we use the Sanger style links: site links across the top and page links down the left (plus the standard search tools) and no one objected or offered a counter-proposal. She will implement that web site demonstrated within the next few weeks. AI is to prototype the Sanger style page. (Comment added later by Jennifer: This action item was for me, as stated in the bottom of the final list of action items in the minutes.) Next meeting September 13-19: Working group on first implementation of properties in Bar Harbor (Chris, John, David...). September 24-25: phenotype meeting will immediately precede GO meeting in Bar Harbor. September 26-27: next GO meeting in Bar Harbor January 16-17 at Stanford. Decision is still to be made regarding user's meeting in September Action Items 1. ALL: update gp2protein on central CVS site. 2. Suparna & Amelia: update metacyc mappings (and check that no functions are mapped to) 3. Amelia: change monthly report file names so they'll sort by date. DONE! 4. Amelia: cron job that mails announcement of each new monthly digest to go-friends 5. BDGP, JAX: first prototype to be implemented for properties prior to JAX meeting 6. BDGP (SwissProt?): need to provide a tool for tentative assignment of GO terms. 7. one row, one term, one reference, one evidence code. DONE! 8. (IEA) Midori: to assemble method references for IEAs 9. (IEA) BDGP to explore means of including larger number of associations in DB and AmiGO. 10. (IEA) BDGP to add filtering that is a combination of evidence code and reference. 11. (suspect annotations) Midori et al.: Add some things to documentation to describe procedure for error reporting, whether in terms or in associations. 12. (suspect annotations) GO-central to add links on main web site to report errors in annotation. 13. (suspect annotations) Brad to add button to AmiGO to mail error reports. 14. SUZI: write a tool to look at and report on consistency of annotation. 15. ALL: review annotation documentation and send in comments to GO-central (Midori to oversee). 16. BRAD: to add term based page. This would show all gene products and the other terms that had been used on each of those terms. A "other customers who used this term, also used these terms". 17. JOHN: Need DAG-Edit to warn if there are definitions without terms when saving so that the definitions are not lost. 18. GO central: for all part-of children in the function ontology, change the relationship to is-a and change wording to 'intrinsic regulator' or 'intrinsic catalyst'. 19. Jane: remove 'activity' from 'binding' terms; DONE! 20. Midori & Jane to dredge up what problems were at end of database save testing; send to John. DONE! 21. JOHN: Need DAG-Edit and central repository to work more seamlessly...DB or transparent CVS must be implemented. 22. GO-central improve documentation on synonyms 23. David organizing physiological process interest group 24. Physiological interest group is to report on progress next time 25. GO-central delete references to viruses in the definition of extracellular. 26. GO-central move viral component terms back into intracellular. 27. Midori to send examples of regulation to BDGP and Chris et al. to examine how to correctly indicate and model regulation. 28. Eurie: can now proceed to use gene products in terms with the addition of the suffix class and other situations will be handled in the same way. 29. GO-central: Update the documentation to reflect the decision on transporters 30. Amelia: Check on the terms in question and make sure they are consistent with the decision regarding transporters (and other bi-directional functions). 31. Michelle: Originally this AI was to send examples of messed up merges to GO-central for resolution. This was done. There are a few "sensu Eukarya" terms with secondary ids that did not have "sensu Eukarya" in them (Amelia generated a list of about 10). However, it turns out that it is ok that they are that way because, due to the placement of the old terms in the graph (as children of mitochondrial things for example), it is logically implied that they are Eukaryotic and therefore it is fine to make them secondary ids of Eukaryotic specific new terms. The problem for TIGR arose when those terms with mitochondrial parents were used to annotate some bacterial proteins (even though we knew about the path violations for bacteria) because at that point bacterial counterparts did not exist for those terms and they still wanted to capture the information. Therefore, the new Action Item is for TIGR to fix these annotations now that the bacterial counterpart terms are in GO. Thanks to Midori and Amelia for clarification of this. 32. transcription factor is wrong (mis-defined and mis-annotated). Interest group is going to fix this and report the solution. 33. All interest groups to provide short (one page more or less) reports for next meeting. 34. Jennifer: to provide a mock-up of the GO home page using Sanger style links. Minutes by Suzanna Lewis and Karen Christie. Thanks to everyone who could found the time to review, comment and fill in the holes. =================================================================== GO Consortium Meeting - Bar Harbor, ME - September 26-27, 2003 [Next Meeting: Stanford- SGD organizing: - GO Users Open Mtg; Jan. 15th. GO Consortium Mtg. Jan. 16-17.] Opening Comments: Meeting organization: We are a very cohesive group that works well together and we want this to continue. Therefore as we grow in size and in objectives we must continuously address the effectiveness of our organization in order to maintain 1) effective communication, 2) the quality of what the project produces and 3) informalities of the group, so that all feel welcome to contribute and comment. At this point the group has grown to the extent that we must adjust and strengthen the structure and organization of the GO Consortium.. We recognize that there are four major sub-groups here: 1) Ontology Development, including Interest Groups; 2) Annotation; 3) Database and Software Development and 4) Production and Distribution. In this context, we need to discuss how to go about revising the structure of the GO Consortium meetings. For example, the 'whole' group meets less frequently and sub-group meet more frequently. This topic was a thread through the meeting and there was further discussion at the end of the meeting. 1) Group Participant List EBI-Ontology group (Midori Harris, Jane Lomax, Jen Clark, Amelia Ireland ) Berkeley DB group (Suzi Lewis, Chris Mungall) FlyBase (Michael Ashburner, Rebecca Foulger) SGD (Mike Cherry, Rama Balakrishnan, Maria Costanzo, Rob Nash) MGI (Judy Blake, David Hill, Harold Drabkin, Martin Ringwald, Mary Dolan, Li Ni, Joel Richardson, Janan Eppig, Alex Diehl) TAIR (Tanya Berardini, Suparna Mundodi) SWISS-PROT (Evelyn Camon, Daniel Barrell) Sanger Parasite Group (Matt Berriman) S. pombe/Sanger (Val Wood) WormBase (Eimear Kenney, Kimberly Van Auken) DictyBase (Rex Chisholm, Pascale Gaudet, Warren Kibbe, Cathy Li) GKB (Lisa Matthews) RGD (Susan Bromberg, Norie De la Cruz, Victoria Petri, Mary Simoyama, Lan Zhao) TIGR (Michelle Gwinn) Incyte (Allan Davis) ZFIN (Doug Howe, Sridhar Ramachandran) Gamene (Pankaj Jaiswal) 2) Updates on Action Items from St. Croix Meeting The full listing of Action Items from St. Croix Meeting is at end of report. Most Action Items are completed. Not_Done or In_Progress or Special_Notes items listed here. 1. Update all gp2protein files in CVS. Need to send reminders to some groups. 6. BDGP(SwissProt): Request for tool for tentative assignment of GO terms Not Done 8. Assemble 'methods' references for IEA. In progress - work done by Midori and Michelle. GO is going to maintain a set of generic references of descriptions of IEA techniques for databases to use who themselves do not have reference collections to call on. These will then allow users of the data to distinguish between the different ways that GO terms have been assigned that fall under the IEA umbrella. [Action Item 34, BHmtg] 9. IEA- BDGP to explore means of including larger number of associations in DB and AmiGO. In progress [see Action Item 28 BHmtg for a related topic, that is, removing defunct associations]. 10. IEA - BDGP to add filtering that is combination of evidence code and reference Not Done (needs number 8 to be completed first.) 32. Transcription factor issue...Interest group is going to fix and report. Not Done 3) Reports from EBI-GO 3) Reports from Ontology Development Interest Groups Beyond the reports from the Interest Groups, there was considerable discussion about how to involve more experts in certain biological areas in the development of the ontologies. Lisa reported considerable success for GKB by going to specialty meetings and approaching individuals to discuss GKB and elicit their help. Also, follow-up site visits to researcher's institutions might help. GKB uses a powerpoint template to guide contributors. It was decided that GO should also take this proactive approach [Action Item 3]. While the use of ppt is not applicable to GO it is clear that a comparable user guide and standards are needed for newbie ontology contributors. [Action Item 47] a) Physiology Tanya and David provided a file with revisions for physiology section of Process. This will be implemented. Complete revision with terms and definitions available from SGD, MGI, others. b) Plants Interest Group We have revamped all the extracellular component terms and are now rearranging and expanding the children of the sexual reproduction terms. sensu Magnoliophyta -A problem was discovered with the 'sensu Magnoliophyta' terms. Many of these terms seem misleading because they actually refer to phenomena that also occur more broadly outside Magnoliophyta. However it was pointed out that that 'sensu Magnoliophyta' just means 'in the sense of Magnoliophyta' and so does not exclude annotation of non-flowering plant gene products to such a term. -One alternative would be to replace the word 'Magnoliophyta' with a sensu word that could apply equally all groups (that might be annotated with such a term). This would be quite time consuming because we would have to check each annotation case using the term, whether it applied to all plants and whether all green algae were included etc. -At the moment there are no non-flowering plant species being annotated and so there is not an urgent need for terms to be created for the annotation of non-flowering plants. -With these points in mind it was decided that we should concentrate on making the flowering plant terms exhaustive and stick to 'sensu Magnoliophyta'. We will create terms for non-flowering plants when non-flowering plants are being annotated. 4) Reports from Annotation Groups The following groups submitted progress reports of their activities since the last GO Consortium meeting. a) FlyBase - ok b) TAIR - ok c) MGI - ok d) SGD-ok e) GOA-ok f) WormBase -ok g) TIGR - ok h) Sanger Pathogen - ok i) Incycte - ok j) RGD - ok k) ZFIN - ok l) DictyBase-ok 5) Ontology Development Issues a) Logical consistency checks In the documentation there is an example of a logical relationship: If A is a part of B and C is an instance of B, then is A must be a part of C? Then there is an example with "cytoplasm". Jane notes this logic isn't always true in the ontologies and ask if can we fix this? This lead to a discussion of "part of" and how we use it in GO: Chris (and John) said there are 4 types of part of (letting A represent the 'larger' component and B represent the sub-component) 1. B is sometimes part of A 2. B is necessarily (always) a part of A (this is the one we almost always use) 3. A necessarily has part B 4. A necessarily has part B -and- B is necessarily a part of A (both directions of relationship) Chris: Technically what many ontologies do is to use the weakest relationship (#1) as the default because it is assumes the least. These relationships can then be adjusted to become more restrictive (and precise) as more is known. In practice, we (GO) already are using the part-of relationship in the stricter sense of #2--most of the time. (as an aside, Chris met and discussed this with Stuart Aiken in Edinburgh. He is also thinking about this and doing a lot of work in this area). Chris also described the distinction between 'part' and 'proper part'. A proper-part is a direct part and therefore is not transitive. E.g.: "a nail is a proper part of a finger and a finger is a proper part of a hand but a nail is not a proper part of a hand". There were several decisions made. First, we agreed to update documentation as it regards the use of 'part-of'. Second, we agreed to henceforth only use 'part-of' in the sense of type #2. Third, we agreed to track down all cases that do not use 'part-of' in the sense of type #2 and restructure the ontology as needed. [Action Item 5, 16]. Fourth, we will consider adding all the different logically distinct 'part-of' relationships because these may prove to be needed in many cases in the future. b) 'Signal Transducer Activity' term disagreements Question is whether the current "signal transducer activity" term is appropriate for GO. Harold/David think it is. They proposed a new definition: "the activity of converting one type of signal into another type of signal" (signals can be light, chemical, etc.) They say the process of signal transduction is more than one step but the function of "signal transducer activity" is the first step. Amelia: There was an issue with "receptor binding" and "signal transducer activity" - not all signal transducers are receptors. If a receptor is under signal transducer activity it should be involved in signal transduction. If a change is made to the definition of "signal transducer activity" than it should be obsoleted, even though there are lots of annotations to it. Especially since Amelia feels the term has been used incorrectly. Report is attached at the end of Meeting Notes. Midori: there is the question of whether there is a molecular activity of "signal transducer activity". Amelia: What about steroid receptors that move steroids in/out of cells? Many: should we change the wording, add a comment? RESOLUTION TO "signal transducer activity" question: [Action Items 4] -need to obsolete the current term -make a new term with the same name but a new definition -create the new definition to everyone's satisfaction (to be ironed out later) -add a warning to the comment on the appropriate use of this term -clean up the children terms - some need to be moved to other areas of the ontology. c) Presence/Absence of function grouping terms Midori: A couple meetings ago it was decided to remove from Function those terms that grouped things based on something other than activity - like Processes or Components. But, having the grouping terms is useful for annotation so people are in no hurry to remove them. ex. "defense immunity protein activity" This term is a grouping term found in the function ontology that is solely based on Process and has many children terms where this is also true. We don't want function-terms that represent a process because 1) it is a process, not a function and 2) any is-a relationships of child terms to this parent is illogical. While tempting we don't want terms grouped in Function by nature of being in the same Process. Judy: Maybe we're trying too hard to put a function on everything and are wanting these function terms when really we should just have a process and no function. Suzi: Some problems come back to the fact that there is a relationship between function and process which we don't reflect. Midori: Agreements at meeting don't always manifest into agreement after meetings in email. Judy: If there is angst, then we need more discussion and to resolve things at meetings. General agreement at meeting can break down in fuzzy specific instances in emails afterwards. Judy/Midori: practicality verses purity of function ontology Rex: Perhaps people don't realize there are analogous Process terms to use. Maybe state more clearly in the emails. [Action Item 6. RESOLUTION TO process grouping terms in Function. We will not use Process to group Function terms unless all of the terms being grouped share the same type of function. GO curators will continue to bring these to the attention of the group via email, if agreement is reached quickly - great. If not, it will be resolved at a meeting. Also, in the emails be sure to point out the Process term alternatives to the Function term. Things in Function should have things grouped by function.] d) Consistency of Parentage (catalysis and binding) Amelia: catalysis and binding - sometimes an enzyme activity has parents of both the catalysis term and binding term. Mostly there is only the catalysis parent. Which way should it be? Consensus: enzyme activities should have only the catalysis parent. [Action Item 17.Remove all binding parents to enzyme activities where appropriate. Document the fact that binding is not always a parent of enzyme. Binding only when stable binding occurs] e) Difference between activation of/positive regulation of/induction of/etc Evelyn: positive regulation does not equal activation Consensus: some redundancy, can't make synonyms in all cases - need some new definitions and comments. [Action Item 18: curation team will go through and find these and try to resolve them , redefine them as needed and put notes in comments.] f) Synonyms in ontology files (this was actually discussed after the old action items but seems to belong with this section). Michael: following the experiment of integrating GO into UMLS it was clear that "synonym" was being used in many ways. Jane has made a synonym file with all of the relationships in the file. format: GOid/GO term/ synonym type id /synonym . This info should be in the db and in the GO ontology files not just the synonym file. Discussion on whether to stop using inexact synonyms in favor of entry words - answer was no. 5 types of synonyms (one parent, 4 children): related (~) %exact (=) %broader than (<) %narrower than (>) %other related (!=) For broader than and narrower than synonyms one must always ask if the synonym should be a GO term. [ACTION ITEM 8: consensus and resolution: Put the 5 types into the database. Put the 5 types into the flat files. John will need to make DAG edit work with this. Jane needs to write documentation. Chris will add them to the db. A warning of the new file format will go out prior to implementation.] 6) Annotation Issues a) Need for Annotation Consistency We discussed the need for greater attention to consistency in annotations. Our users expect the annotations to be based on shared standards so that they can be compared and used in comparative genomics contexts. We agreed that we need to more formally identify a mechanism/team/process to ensure greater annotation consistency. This effort will include the development or employment of tools to evaluate annotations. [Action Items: 48] b) ISS and sequence dissimilarity. When two sequence are similar but are missing some key piece of sequence similarity that tells you that your protein can't have the function in question what do you do? [Action Item 24; 29] Add to documentation - use the NOT field for ISS annotation with sequence dissimilarity. c) Annotating to Complexes: This was a major and continuing topic at this meeting. Should we assign function to members of a complex when these members either do not engage in the (typically) catalytic activity, or we don't really know the function of the member? This was a very long discussion. There were two separate problems that were discussed simultaneously (see below). Discussion ranged over both of the problems throughout. Two separate problems: Problem 1. There is an ontology problem in that when the function ontology has an enzyme activity and with children "regulation of activity" and "catalytic activity" there becomes a true path violation for the regulator in that it's path goes up to the catalytic activity when it does not have that activity. This could be solved by removing the "enzyme activity" parent from the regulatory subunit. The regulatory subunit would have as parent "enzyme activity regulator". People feared that this would remove a link between the regulatory subunit and the function it was regulating. Others said that the link would be preserved with the component ontology term assignments. There were suggestions to rearrange the ontology - but nothing seemed to satisfy the needs. In the end the decision was to remove the "enzyme activity" parent from the regulator term. [Action Item 10] enzyme activity" terms will no longer have as children their regulatory subunits. The regulatory subunit will have as a parent "enzyme regulator". We recognize that this removes a link in function between the regulator and the enzyme activity. However, we feel this will be covered in the annotation of the gene to the complex in question. Problem 2. What to do when annotating the function of a subunit of a complex when that subunit does not have a known activity on its own. Up to now we have been annotating to the potential of a subunit and therefore would annotate the function of the complex as the function of one of the subunits (this is in the documentation). This is not actually correct of course, since the individual subunits do not have the function of the whole complex. But to not do this would lose the relationship of the subunit to the function of the complex to which it contributes. Ideally we would be annotating the functions of complexes and assigning gene products as parts of complexes with those functions, but many databases don't have the ability to do that. - Some suggested making relationships between GO's Function and Component ontologies. - Some suggested not linking function to the subunits (if nothing is known about what they individually do) at all. - Some suggested adding a qualifier in the association file - suggestions: direct/indirect, associated_with, etc. - Some suggested modifying the association file format to include a way to indicate that gene products A plus B plus C are needed for a particular function. [Action Item 11] Regarding the annotation of gene-products that are members of a complex: 1. The complex should appear in the component ontology. 2. Gene products that are members of that complex should be annotated to that component terms. 3. The complex itself (the instance of it in your DB) should be annotated to the appropriate function. 4. Gene products that are members of that complex should (if a more precise functional granularity is not known) be annotated to the function of the entire complex, but must have an additional qualifier added. This mandatory qualifier will be placed in the "NOT" column. The string we will use for this qualifier has not yet been finalized, but the candidates that we have discussed are "associated_with", "component_of", and "contributes_to". Whichever string is decided upon the consequence is that now there will be two allowed values in the NOT column: These are "NOT" and ["associated_with" or "component_of" or "contributes_to"]. If both NOT and qualifier value are needed for the association then they will be separated with a pipe character '|'. [Action Item 30] In a related topic - Mike will add "complex" as an allowed type of "DB_OBJECT_TYPE" in the gene association file for those groups who are able to store complexes in their dbs and assign terms to them. d) Validation of Annotation Up to now there has been no validation within the data sets or between the data sets. Can we use the test set? We want the consortium to check annotations. Michael: Need tool that takes association files, gets proteins, clusters them, presents to annotators the GO terms attached to the clusters, then view. Need to flag things that are ok, but come up in the screen so they don't have to be looked at again. Once something is found that needs attention - send message to contributing db to fix it. First time these checks are run it will be a lot to go through but once that's done, should be (hopefully) fairly easy to maintain. Maybe we should have GO school/camp for 2 weeks. Suzi: 3 things: 1. take existing annotations and check for consistency 2. have a given set of genes annotated by two methods and check for agreement 3. GO camp/school useful for a. resolving discrepancies, b. new people education It's very important to check consistency between dbs. Mike: consistency is a goal and sharing , must share nitty gritty of methods to make this work. Suzi: maybe we should all use the same tools Mike: that's what GMOD is for. David: consistency with component and function will be easier than process, process will be different for different species. Will need to choose wisely what defines a shared process. [Action Item 49] 7) Resource Issues a) Report from development group on instantiating GO in Prolog The underlying structure of the ontologies is going to have a big shift into a logic programming language Prolog. This new paradigm will impact the development and storage of the ontologies, but the annotation processes will remain the same and most users won't see a difference. We will continue to provide the GO in various formats. Chris Mungall gave a report from the working group that met in Bar Harbor prior to the GO meeting. This group included Chris Mungall, Suzi Lewis, David Hill, Harold Drabkin, Joel Richardson, Jim Kadin, and Alex Diehl. - GO is a mix of 'stem' terms and 'composite' terms. For example: 'oxygen binding' is a composite term of the compound 'oxygen' and the term of the function 'binding'. - A more complex term is 'positive regulation of smooth muscle contraction" - it can be broken down into its component parts: the action in the term is "contraction" "muscle" is the thing being affected by the action "smooth" is a modifier for "muscle" and so a modifier for the thing being affected "regulation" is a modifier of the action "positive" is a modifier of "regulation" (Aside to this discussion: What would be the term in an anatomy ontology "muscle" or "smooth muscle" - answer: "smooth muscle" would be a child of "muscle") We might want to think about GO as a language system. GO terms are highly regular in their structure. They lend themselves to formating: for regulation terms--> QUALIFIER, "regulation of" PROCESS where PROCESS is "contraction" or "biosynthesis", etc. PROCESS can itself have modifiers. One can deconstruct the GO terms like this and build a grammar. There is a programming language called "Prolog" that breaks down terms into parts/classes. Steps in using this for GO: 1. take all or part of GO and decompose. 2. Maintain this breakdown in GO itself 3. make "oxygen binding" a cross product of compound term (from a compound ontology) and the function "binding". Now many parallel hierarchies like transport and binding can be maintained more easily. Question: if we have a way of generating a compound term should we still maintain the compound terms in GO or just have them made as users need them. Answer and consensus - we should maintain them in GO. Phenotype Ontology will produce massive cross product from Anatomy ontology and Process. We will use the build up process for the first time a term is needed and then the term will get an id and be in the ontologies permanently. There will be a user mode for creating specific terms. Discussion of Chris's talk: Michael: mapping of component terms in PO to base GO terms Midori: will this help sorting parent/child relationships for new terms? Chris - it should Martin: will there be 1 rule or decomposition or several rules? Chris - it will create standard wording Judy: Will people who do ontology development need to use Prolog? Chris - No, just need to make sure they add rules as necessary. Rex: so with a new term from many places, will the tool make the term from the many places? Chris - you will put in the term, the tool will suggest optional add-ons or alternative names, and parent terms for you to review. Rex: will the tool read an anatomy file? Chris/Suzi - yes, it will. A set of developers will work on the core/primitive terms and annotators will work on derived terms. Michael: primitive will come from anatomical ontology? Chris - yes Michael: mouse anatomy will have "head development" a compound term, but here the primitive is mouse head not just head and this term will be used for many types of heads - do we want only one GO term? David: we should have all head types as children of "head development" (children would be "mouse head development", "fly head development", etc.) Rex: how will the anatomies be used?, import them all and then sort it out.? - not sure of answer to this one. Prolog demo: -run deconstruction - get stem terms: regulation([regulation, qualifier(Q), regulates: P]) -Grammar: qualifier regulation of process [regulation,qualifier:positive,regulates:[contraction, affects:[muscle, qualifier:smooth]]] -first step: go through all GO and breakdown into stem terms. -Test parent/child relationships Chris showed amino acid test - term "glycine binding" has a parent "amino acid binding" but needs "serine family amino acid binding" as parent term since glycine has parent "serine family amino acid" in the compound ontology. It showed that GO was missing an intermediate term and suggests what to do - either add glycine as a direct child of amino acid in the compound ontology or make a new intermediate term so that "glycine binding" can be a child of "serine family amino acid binding" This tool should solve the interleukin problem from before. More discussion: Michael: this new tool is an easier route to maintainability and communication with other ontologies - what are the downsides - for the GO curation/editorial team there will be transitional pain - but not for the users. Once the transition is done will there be other downsides? Chris: We will need to maintain the other ontologies. Judy: It depends on stable contributing base of vocabularies, some are not so stable, but it will likely be approximately what it is now. David: GO curators need to now maintain all of the member ontologies. If "oxygen" doesn't exist in a compound db, who puts it in? Michael: need a "buy-in" of base ontologies. If we can insure that all of the ontologies we rely on are around this table or under institutional control, than we don't need GO developers to maintain, just need reliable people to maintain them and provide quick turnaround - except maybe compound and protein families. Judy: What about UniProt and PIR? PIR/PANTHER families are a possibility Michael: chemical ontology -- EBI has a real chemist - so there will be work on that eventually at EBI. Cell type ontology is fairly mature, all anatomies must be in CVS. Rex: We need a clearing house to tell people which anatomies their terms should be involved in - define lines of what each ontology encompasses. Martin: who can write to files at OBO? Michael: each file will have a person who does the updates? Martin: but right now who can write to these files? Suzi: there is a short list of people with write access. c) Report on DAG_Edit Suzi gave presentation for John, highlighting new properties (see presentation for details). There was a question on whether DAG Edit can save changes between two versions - GO curation team says you can save the histories - need to check on this. [Action Item 42]. Question to group on when to shift to new DAG Edit which is ready to go. Will organize a testing people and John will visit users during this period [Action Item 43]. d) Report on AmiGO Chris gave presentation for Brad (see presentation for details). Test of new underlying data structure was done for the GO term correlations/concurrent assignments tool. "Genes who liked that GO term also liked this one." - it worked fine. This was Action Item 16 from St. Croix meeting. e) Demonstrations 1. Joel Richardson: Viewing annotation vocabulary graphically: using GraphViz. Currently works on mouse data, has plans to make it generic, maybe it should work off the GO db. SGD has similar tool. Will work to putting these out on GMOD. 2. Eimear Kenney: Textpresso This is a tool for mining the full text of publications for relevant sentences. 3. David Hill: Automated paragraph generation from GO annotations. Attempt to develop rules at making a nice text paragraph based on the annotation and GO terms assigned to a protein. Did this because granting agencies and users want text output. David did test with Pax6. He developed simple sentence structure rules that allow the automatic fill in of GO terms and annotation information and production of a text description of what is known about a protein. This text is generated from underlying data, is basically the reverse of the deconstruction described by Chris. Both are necessary for complete usefulness of the GO system. Ultimately, it is hoped that GO data will be presented to the user with options on viewing - the normal GO term assignment tables, a graphical interface like Joel's, and a text entry like David's describing the sum total of what all of the GO terms and annotations tell us about the protein. f) Slots = properties Slots is being accomplished with the Prolog deconstruction stuff Chris presented. For slots we need to decompose ontologies, additional relationship types, need axiomatic ontologies (elemental, basic terms).Chris will start decomposing terms - needs volunteers to go through them - David/Amelia/ plant person. Should be done with some testing by next meeting. [Action Item 50] 8) Lingering questions from this meeting 1. from TAIR: TAIR has a pathway to term map and SGD has a map to another term in the same tree but at a different level. How should this be handled? We didn't' return to this question. 2. GO Slim: should 3 files be required when sending in a Slim: 1) Go Slim itself, 2. Go term mapping to GO Slim, 3) mapping of genes to GO Slim. 9) Action Items from this meeting Ontology Development Action Items 1. Create SOPs for checking of ontology integrity 2. Document process for revision of subtrees 3. Create SOP for getting people into interest groups and other interest group activities. 4. RESOLUTION TO "signal transducer activity" question: i. need to obsolete the current term ii. -make a new term with the same name but a new definition iii. -create the new definition to everyone's satisfaction (to be ironed out later) iv. -add a warning to the comment on the appropriate use of this term v. -clean up the children terms - some need to be moved to other areas of the ontology. 5. Document Logic Consistency issues in regards to 'Part-Of' designations. Following documentation, track down instances that are not always 'necessarily part-of', figure out what to do with them (known examples: proteasome and polarisome) 6. RESOLUTION TO process grouping terms in Function. We will not use Process to group Function terms unless all of the terms being grouped share the same type of function. GO curators will continue to bring these to the attention of the group via email, if agreement is reached quickly - great. If not, it will be resolved at a meeting. Also, in the emails be sure to point out the Process term alternatives to the Function term. Things in Function should have things grouped by function. 7. Alex will send in SF ticket on 'regulation of survival gene products' under "apoptosis" and GO team will check it out. 8. RESOLUTION on synonym types: Put the 5 types into the database. Put the 5 types into the flat files. John will need to make DAG edit work with this. Jane needs to write documentation. Chris will add them to the db. A warning of the new file format will go out prior to implementation. 9. As needed, add English terms as synonyms. 10. RESOLUTION of the regulator subunit of enzyme activity as child of activity question: "enzyme activity" terms will no longer have as children their regulatory subunits. The regulatory subunit will have as a parent "enzyme regulator". We recognize that this removes a link in function between the regulator and the enzyme activity. However, we feel this will be covered in the annotation of the gene to the complex in question. 11. RESOLUTION of "subunit of complex" annotation issue: It was decided to annotate the gene products of a complex to the complex with component terms. To continue to annotate the individual subunits to the function of the entire complex but with a qualifier in the "NOT" column - the qualifier will be "associated_with". Therefore, there will be two allowed values in the NOT column: "NOT" and "associated_with". If you need to use both values at once separate them with a pipe. [subsequent discussion as to whether 'associated_with' or 'component_of' would be the better tag. Action Items specifically for the Go Editorial Office in Hinxton. 12. Add two new curators to the web site 'people page'. 13. Commit the new web site with improved index. (jen) 14. Send URL of function ontology documentation round to group for discussion. (done) 15. Document the difference between a parent/grouping term in the function ontology and a single term in the process ontology. 16. Document the 5 different part_of terms and the fact that we mostly use just one of them (necessarily part of). 17. Document the fact that binding is not always a parent of enzyme. Binding is only a parent when stable binding occurs. Remove Binding as parent where appropriate. 18. Standardize use of 'activation', 'induction', 'positive regulation of'. GO curation team will go through and find instances of "positive regulation of"/"activation of"/"induction of" and try to resolve them, redefine them as needed and put notes in comments. 19. Keep an eye out for any standard operating procedure information coming from the Annotators 1. meetings. 20. GO.evidence.html has a bad link. Fix this. 21. Two new tools were demonstrated. Add these to the tools page: Joel Richardson's Annotation 2. Browser and the Textpresso program that Eimear Kenney presented. 22. The folks in the GO office are to the test the new DAG-Edit for a few weeks prior to release. 23. Jane will write documentation on Synonym Types. Need to send a warning of the new file format prior to implementation. 24. Add to documentation - use the NOT field for ISS annotation with sequence dissimilarity. 25. Change the documentation so that ISS can have cardinality >1. Add documentation that clarifies the section where it tells annotators that if you are unsure of the function/process of your gene to bump up to the next higher term. Add that if that bumping gets you to the root of the ontology you should then use the "unknown" term for that ontologyAction Items for Annotation Groups and for Annotation Oversight 26. Formally identify an Annotation Oversight Team, they will a) access quality, b) set standards c) evaluate the annotations of contributing groups, d) alert those groups to annotations that may need attention. 27. RESOLUTION: If there are IEA sets of associations that have not been updated in one year, they will be removed from the front page and AmiGO if a call to the submitting group doesn't result in an updated file. 28. RESOLUTION: Use the NOT field for ISS annotation with sequence dissimilarity. Everyone keep the ISS consistency (how much similarity is enough for different groups) issue in mind and think about ways to improve it. 29. Mike will add "complex" as an allowed type of "DB_OBJECT_TYPE" in the gene association file for those groups who are able to store complexes in their dbs and assign terms to them. 30. Need a new tool that will check for situations where annotation of GO terms was made (for ex. to a mouse gene) based on terms added to another gene (for ex. from human) with ISS, but where the annotation of the match protein (in ex. human) has since changed. An email would be sent to an annotator to review the annotation for the mouse gene again. 31. Add documentation that clarifies the section where it tells annotators that if you are unsure of the function/process of your gene to bump up to the next higher term. Add that if that bumping gets you to the root of the ontology you should then use the "unknown" term for that ontology (in EBI-GO List too) 32. Everyone should be using a script to check for formatting errors in the association files before submitting them to GO. SGD and others have such scripts to share. 33. Send comments on text sent out by Michelle for HMMs and pairwise matches IEA references. Send any other text for other types of IEA evidence around for comment. 34. Organize the quality control checking for annotations. Make a tool to do the comparisons. Organize the GO school/camp. We all must buy into the concept of annotation consistency. Action Items for Software and Database Development and Production 35. Send reminders to groups who need to update gp2protein. 36. Mike's group will be establishing a production manager and will hire someone to do the job. This person will work with Brad on AmiGO, Suzi's group on database validations, various Annotation Groups on standards for GO association files. 37. If there are IEA sets of associations that have not been updated in one year, they will be removed from the front page and AmiGO if a call to the submitting group doesn't result in an updated file. (This is the same as Action item #28, but is here because it affects both groups) 38. An AmiGO request: Add a species filter to AmiGO. This could be done either by the using the identity of the contributing database or independent of source database by using taxon id from GenBank available for the related sequence. Note the SF site should be used for this kind of AmiGO request (hence #40 below). 39. Provide a SF ticket for AmiGO improvements and suggestions; provide focus group for AmiGO improvements. 40. Software/db group send Midori db format requirements for the IEA references. 41. There was a question on whether DAG Edit can save changes between two versions - GO curation team says you can save the histories - need to check on this. 42. Organize testing period for new DAG Edit. Approximately 6 weeks of testing. John will visit users of DAG Edit during the testing period. 43. New flat files in new format should have a different file name format. "function.obo, process.obo, component.obo" these files will be terms plus definitions. Feel we still need all three although with new system then can be combined. Other Action Items 44. Run a test-set on all the GO tools. Generate a test set of genes for tool validation. Get a responsible person to manage the test system. Post results so users can see the kind of analysis/visualizations provided by a Tool. Nobody made a clear commitment to organize this, but many were in favor of it. 45. Pankaj is in contact with a group that wants to translate GO into several European languages, Arabic and Chinese. Pankaj will talk to this group wanting to translate GO and learn the details of their plans. Need precise input as to how the group would deal with update issues. 46. Develop further documentation for ontology development guidelines so that when we get help from outside experts to develop specific branches of the ontology we have a way to introduce them to some of the basic tenets and standards that are needed in order to do this [EBI editorial staff]. 47. Mike Cherry (primarily, but not all by himself of course) to propose and up mechanism/team/process to develop 1) manual methods, 2) automated assessment tools, and 3)documentation to ensure greater annotation consistency. 48. Pursue by all possible means methods for improving consistency of annotations: computationally based on sequence; Comparatively, between alternate methods carried out on same gene sets; through training and documentation (camp?) [Suzi, Mike, Michael, and Judy] 49. Chris will start decomposing terms and David/Amelia/ plant person will work with him to help test the results and change the ontologies as needed. 10) Summary Proposal for future organization: -software will be broken into development group and production group -the production group will be handled by Stanford group -annotation needs quality control oversight - for now Mike is checking into this. Should we change the way we organize the GO Constorium meeting schedule? -all will be the same for the next meeting (in Stanford) - Maybe we should have breakout group meetings for the subgroups (GO ontology development, annotation, software) which report back to the big group. However, many people are vested in several of these areas. - Maybe we should have breakout groups for the interest groups which report back. Mike: there will be 1/2 day available for breakouts. They are expecting the meeting to take 2 full days. Judy: maybe a series of small group meetings followed by the big group. Suzi: then the big group would only meet 1-2 times a year. General agreement on this - suggestion to schedule the big meeting following Stanford meeting after the small meetings have been scheduled. Addendum 1: Report of Action Items from St. Croix meeting. 1. ALL: update gp2protein on central CVS site. still several need to update. 2. Suparna & Amelia: update metacyc mappings (and check that no functions are mapped to) DONE 3. Amelia: change monthly report file names so they'll sort by date. DONE! 4. Amelia: cron job that mails announcement of each new monthly digest to go-friends this is DONE, in the sense that the auto-mailing works, but not done in the sense that the reporting can be improved (Judy did I get this right?) 5. BDGP, JAX: first prototype to be implemented for properties prior to JAX meeting DONE (Chris reported at Bar Harbor meeting) 6. BDGP (SwissProt?): need to provide a tool for tentative assignment of GO terms. NOT DONE 7. one row, one term, one reference, one evidence code. DONE! 8. (IEA) Midori: to assemble method references for IEAs.....stuff to discuss at BH mtg 9. (IEA) BDGP to explore means of including larger number of associations in DB and AmiGO. IEA db tuning...also need expiring date...NOT DONE 10. (IEA) BDGP to add filtering that is a combination of evidence code and reference. for IEA...and TIGR, add filter for taxonID, query tool for AmiGO NOT DONE 11. (suspect annotations) Midori et al.: Add some things to documentation to describe procedure for error reporting, whether in terms or in associations. DONE 12. (suspect annotations) GO-central to add links on main web site to report errors in annotation. DONE 13. (suspect annotations) Brad to add button to AmiGO to mail error reports. DONE 14. (suspect annotations) NOT DONE but this will be part of the new annotation oversight system. 15. ALL: review annotation documentation and send in comments to GO-central (Midori to oversee). nothing sent...EBI-GO updating documentation as per BH meeting 16. BRAD: to add term based page. This would show all gene products and the other terms that had been used on each of those terms. A "other customers who used this term, also used these terms". Amazon dot.com approach. Not done in production AmiGO, but has been done in test of new AmiGO architecture. 17. JOHN: Need DAG-Edit to warn if there are definitions without terms when saving so that the definitions are not lost. DONE in beta version 18. GO central: for all part-of children in the function ontology, change the relationship to is-a and change wording to 'intrinsic regulator' or 'intrinsic catalyst'. DONE 19. Jane: remove 'activity' from 'binding' terms; DONE! 20. Midori & Jane to dredge up what problems were at end of database save testing; send to John. DONE 21. JOHN: Need DAG-Edit and central repository to work more seamlessly...DB or transparent CVS must be implemented. NOT DONE 22. GO-central improve documentation on synonyms. DONE 23. David organizing physiological process interest group. DONE 24. Physiological interest group is to report on progress next time. DONE 25. GO-central delete references to viruses in the definition of extracellular. DONE 26. GO-central move viral component terms back into intracellular. DONE 27. Midori to send examples of regulation to BDGP and Chris et al. to examine how to correctly indicate and model regulation. DONE 28. Eurie: can now proceed to use gene products in terms with the addition of the suffix class and other situations will be handled in the same way. OK 29. GO-central: Update the documentation to reflect the decision on transporters. OK 30. Amelia: Check on the terms in question and make sure they are consistent with the decision regarding transporters (and other bi-directional functions). DONE 31. Michelle: Originally this AI was to send examples of messed up merges to GO-central for resolution. This was done. There are a few "sensu Eukarya" terms with secondary ids that did not have "sensu Eukarya" in them (Amelia generated a list of about 10). However, it turns out that it is ok that they are that way because, due to the placement of the old terms in the graph (as children of mitochondrial things for example), it is logically implied that they are Eukaryotic and therefore it is fine to make them secondary ids of Eukaryotic specific new terms. The problem for TIGR arose when those terms with mitochondrial parents were used to annotate some bacterial proteins (even though we knew about the path violations for bacteria) because at that point bacterial counterparts did not exist for those terms and they still wanted to capture the information. Therefore, the new Action Item is for TIGR to fix these annotations now that the bacterial counterpart terms are in GO. Thanks to Midori and Amelia for clarification of this. DONE 32. transcription factor is wrong (mis-defined and mis-annotated). Interest group is going to fix this and report the solution. NOT DONE 33. All interest groups to provide short (one page more or less) reports for next meeting. 34. Jennifer: to provide a mock-up of the GO home page using Sanger style links. Tried this, but didn't work well. Appendum 2: Signal Transducer Activity report signal transducer activity : current def "Mediates the transfer of a signal from the outside to the inside of a cell [or cellular compartment] by means other than the introduction of the signal molecule itself into the cell. The proposed definition of signal transducer is based on the concepts carried by both "signal" and "transducer".I've been looking over the definitions of the individual components of "signal tranducer"; there are two components: 1. detect signal << what is this? 2. change signal into another activity<<>>does not have to be a molecule; it can be light (see further on). Are all these proteins therefore signal transducing molecules. Certainly cytokines are accepted as signal transducing moleculeswith the ability to induce signal transduction via receptor binding. >>No, the transducer is the thing that converts one type of signal to another.Cytokines are the signal, not the transducer. (paraphrasing) "To me, signal tranducer..." or "I see signal transducers..." - you both have a concept of what a signal transducer is, but I think that the current def and the new def fail to capture it. I think that the term 'signal transducer activity' has been used to describe the activity of anything involved in a signal transduction cascade, and by using the term thus you are not capturing any more information than you already have by annotating to the process term 'signal transduction'. If you want to have a term to represent conversion of one type of signal information into another, I think it should be a new term because I don't think that 'signal transducer activity' will have been used in this way. A signal transducer would thus be a gene product that converts one type of signal into another." it seems possible that more than one of the proteins in a signal transduction pathway could be signal transducers, but not necessarily all of them since they all won't change the signal to another form. How is a signal transducer thus defined any different from a transporter? ("Enables the directed movement of substances (such as macromolecules, small molecules, ions) into, out of, within or between cells."). Substance and signal are not the same things. A substance is always a physical entity; a signal is not. Insulin binding it's receptor is a signal, but so is heat, etc. Binding to a receptor does not mean a substance is then transported into the cell. Reply to the part about transducer vs transporter: with a transporter, a substance goes in one end and out the other. the "signal" can be a substance (like a phermone), but it doesn't have to be. The transducer converts one type of signal to another ( a chemical signal (like phermone) to a conformational change , etc. ...in the transducer vs. transporter debate - I understand the difference between transducers and transporters; however, we've got all the receptor activities lumped under 'signal transducer activity' and some receptors work by conveying the signal molecule into the cell. Then these should not be called transducers, they are transporters. We can no longer therefore broadly classify all receptor activities as signal transducers - each receptor activity will need to be assessed and recategorized. Are receptors which transport a signal molecule into a cell therefore not signal transducers? Are they involved in signal transduction, though? Or would we say that there has been a change in the signal type, ie. incoming signal is extracellular steroid molecules, and the outgoing signal is intracellular steroid molecules. In this above cases, there is no transducer. =================================================================== GO Consortium Meeting - Stanford, CA - January 16-17, 2004 [Next Meeting: Chicago - Dictybase organizing - October 2004] Group Participant List SGD (Mike Cherry, Karen Christie, Kara Dolinski, Eurie Hong, Dianna Fisk, Rama Balakrishnan, Rob Nash, Stacia Engel) TAIR (Sue Rhee, Tanya Berardini, Suparna Mundodi) MGI (Judy Blake, Joel Richardson, Harold Drabkin, David Hill, Mary Dolan) ZFIN (Doug Howe) RGD (Victoria Petri) Dictybase (Rex Chisholm, Petra Fey, Karen Pilcher) EBI-Ontology Group: (Midori Harris, Jane Lomax, Jen Clark, Amelia Ireland) GOA (Evelyn Camon, Daniel Barrell) Wormbase (Kimberly Van Auken, Ranjana Kishore) Incyte (Burk Braun) Gramene (not present) IRIS (Richard Bruskiewich) Berkeley DB Group: (Suzi Lewis, Chris Mungall, John Day-Richter, Brad Marshall) TIGR (Linda Hannick) FlyBase (Michael Ashburner, Rebecca Foulger) S. pombe/Sanger (Val Wood) Pathogen/Sanger (not present) TOC 1. Opening Comments: GO Grant Update 2. Annotation Groups: Progress Reports 3. Interest Group Reports 4. Ontology Development Issues 4.1 Metabolism terms: divide into cellular and organismal metabolism 4.2 Regulation of non-biological processes 4.3 Transcription/translation factor activity 4.4 Component ontology annotations 4.5 Protein classification 4.6 Use of 'sensu' 4.7 Documentation of function ontology 4.8 GO_Slims Development 4.9 NameSpace Ontology 4.10 GO email archive search 4.11 Gene association file errors 4.12 Date tracking for definitions 5. Software Report 5a. Presentation of OBOL 5b. Report on the DAG-Edit workshop 5c. Update on changes to AmiGO 6. Annotation Issues 6a. Problems with pathway information annotation 7. Future Meetings 8. Final Item - Incorporation of GO in WormBookIII 9. Summary of Action Items from this meeting. 10. Review of Action Items from past meeting [Bar Harbor] 1. Opening Comments: GO Grant Update (Judy) Judy reported on the status of the GO funding from NHGRI. In the competitive renewal, the GO funding mechanism was changed from an RO1 to a P41 (research_resource), and significant new funds were requested. Current indications are that we will be funded for 3 years. However, there have been several cuts in funding proposed including one requested software engineering position. Additionally, there will be no new group (sub-contract) funding. Small side projects are also not funded. There may be additional adjustments; we are awaiting official notifications. We hope that any further adjustments in funding can be shared across all of the groups receiving funding through this grant except for the European contracts (due to dismal exchange rate) and BDGP (which has already had one position cut). 2. Annotation Progress Reports: Reports were issued from the following groups: SGD - Report available TAIR - Report available MGI - Report available TIGR - Report available ZFIN - Report available RGD - Report available Dictybase - Report available Flybase - Report available GOA - Report available EBI-Ontology - Report available Wormbase - Report available Incyte - Report available Sanger/Pombe - Report available Sanger-pathogen - Report available Gramene - Report available IRIS - ok, no electronic report BDGP Software - Report available 3. Interest Group Reports a. Plant Interest Group Plant interest group report is available at http://www.ebi.ac.uk/~jclark/GOwebsite/text%20in%20development/plants_folder/plants.htm This is also available as a text file with other reports from this meeting. No other interest groups reporting. 4. Ontology Development Issues 4.1 Metabolism terms It was decided that "metabolism" would be split into "cellular metabolism" and "organismal metabolism". This is similar to the division of "physiological process" into "organismal physiological process" and "cellular physiological process". Further discussion about this will continue in coming weeks. 4.2 Regulation of non-biological processes Example: regulation of water crystallization: water crystallization is not a biological process but it is regulated biologically (e.g., "regulation of water crystallization" IS a biological process). Some don't have "is a" parents. How about "water metabolism", homeostasis? Action item: find homes for these terms. As we proceed with consistency checks throughout the ontology, we will need to provide parents for these terms. 4.3 Transcription/translation factor activity Many of these terms are really describing complex processes and not activities. For example, "translation initiation factor activity" (GO:0003743): functions in the initiation of ribosome-mediated translation of mRNA; "transcription factor activity" (GO:0003700): Any protein required to initiate or regulate transcription; includes both gene regulatory proteins as well as the general transcription factors. There is not a specific "activity". These terms are heavily used by the biology community. What is needed are real definitions. Where should a regulatory activity go? One suggestion is to think about how a particular activity is being assayed to help with making a real definition. There was a long discussion about this. From a practical point of view, not all activities are usefully 'atomized'. Biologists may not be able to nimbly provide a definition of 'transcription factor activity', but they do have a conception of the complex matrix of action described by this term. We agreed that these more complicated function terms need to be included in the ontology. Action item: Eurie and Michael will strive to provide a definition for 'transcription factor activity'. 4.4 Component ontology annotations Transient associations for complex: "associated with" vs "part of". Should we distinguish between stable components of a complex versus something that by some experiment localizes to the complex? Action item: We will add a new qualifier for "Colocalizes with" that is appropriate for indicating that the gene product has been found in the vicinity of a structure. Action item: The GO office will update the documentation for Component rules with discussion of this qualifier and its use. 4.5 Protein classification: The use of a protein classification system in the GO is being investigated. 4.6 Sensu definitions: Jen lead the discussion of various options for incorporating taxonomic information as necessary Action item: 'sensu' terms will have a mixture of English phrases and Latin genera, along with the taxon ID. The definition of any sensu term would include the point that it is not totally restricted to a particular grouping. http://www.geneontology.org/GO.usage.html#taxon 4.7 Documentation of function ontology Current documentation for creating terms for the function ontology is being further refined. One item that was discussed was representing complex functions in function (e.g., receptor tyrosine kinase, which is a receptor (ligand binding) and a kinase) (doesn't make sense to deconstruct); if "receptor tyrosine kinase" is a child of both receptor and kinase, then its function is defined by its placement in the graph. In some cases, however, there may be a complex function: when the two or more functions are not mutually exclusive; the two activities are coupled; the functions are dependent on each other. Action item. These concepts will be added to the documentation. 4.8 GO_Slims Development Some people wanted documentation for developing their own GO slims. Currently, there is only the readme file in the GO-Slim directory in cvs. There IS software: Map2Slim script: Takes an annotation and puts it into an appropriate place in the Slim. To use, you need a Slim file and association file. The output result is a binned output. The program allows creating bins that don't exist in the GO; like chaperone + Chaperone regulator. This is available at GO site. 4.9 NameSpace Ontology Mark Wilkinson will be responsible for updating BioMoby and with provided NameSpace designations. 4.10 Improve GO email archive search? A suggestion was made to have the GO email archive searchable by Google. However, it would mean that everyone in the world will see it. If we can restrict it, that would be better. Action item: Brad and Mike will look into whether it is possible to keep a Google search of the email archive separate from the general Google search of the GO web pages. 4.11 Gene Association file errors Each group has checked these. However, one issue brought to light is that the files are getting big! It was suggested that we start compressing these files (gzip). Also, many are not updated frequently; in May two more will be over a year old. The groups need to address this. Also, it was suggested that people having write access to the cvs repository use ssh (so that password is never free text over internet). The downside is that one has to type password more often; however, it is more secure. John pointed out that we don't need to type our passwords each time if we get a private key on our computers. Action item: groups to investigate if large files, compressing files, will pose any problems at their own site. 4.12 Tracking the date of a term definition A suggestion was made to track when a term definition was made. The date might be added to the definition itself, perhaps through DAG-Edit. Action item: Implement if easy and does no harm. 5. Software Report 5a. Chris Mungall gave a presentation of OBOL (Open Bio-Ontology Language); this is available at the GO site (http://www.geneontology.org/meeting/go_2004_01_stanford/). This is the result of the decomposition of implicit info out of GO terms into defined classifications. OBOL will support the slot-based annotations discussed in previous meetings, and will more readily support creation of cross-product and composite terms. 5b. John Day-Richter summarized the DAG-Edit workshop held at Hinxton in Dec. 2003. It was very useful for working out bugs and highlighting functionality that many people were unaware of, such as a. CVS plugin for DAG-Edit b. Term change tracker plugin c. OBO data adapter: can break ontology into multiple files based on name spaces d. Category manager plugin: used for GO_Slim set up Future enhancements will include include filter plugin, search, color, decoration, and more. Some of these, however, will require switching to OBO format files. Action item: Proposal: by end of month will post OBO files, curator trials -- use OBO for two weeks to work out bugs, then general switch to OBO as master; More general document for users. Announcement on site Why use new format? The advantage of new format for average user is that the file is more readable; it is one file rather than four; it is smaller overall (35%), and using the OBO format will allow the user to take advantage of many of the functions of DAG-Edit that are not saved with the old flat file format. 5c. Brad Marshall reported that although the AmiGO browser seems to be the same, many changes have been implemented to the way it functions. There are also plans to incorporate GraphViz visualization. The GO database is currently being updated monthly. However, once it is moved to Stanford, the updates will be more frequent. 6. Annotation Issues 6a. Problems with pathway information annotation; inference from existing genome annotations to novel genome annotations. A discussion was generated by TIGR (Linda and Michelle Gwinn) concerning how to annotate pathways. GO does NOT, per se. The following example was discussed: A is converted to B via x (gene product); B is converted to C via y (gene product) X in annotated using the GO ID for its function; Y is annotated to the GO ID for its function (both need an ISS at least); Then, if there is a Process term Biosynthesis of C, both gene products are annotated to its GO ID using an IC code (the curator infers the pathway because both x and y exist. The with field when the IC code is used here will have at least two GO IDs corresponding to the two GO activities (read: the curator infers that this process occurs because of the presence of the two activities (GO IDs) in that organism. 7. Future Meetings It has been brought up several times that due to the increasing size of the consortium as well as budget considerations, the usefulness of the large group meeting every three to four months is less than it was. It is suggested that these large group meeting occur no more than twice a year. However, there will be planed smaller meetings with a specific focus. The intended participation would be limited to those members that have a specific interest in the topic. Suggested topics would be: * Tools and development (e.g., such as the DAG-Edit workshop) * GO Annotation **(This might be open to other groups like Xenopus, etc. This might be run as annotation jamborees, etc. Might be a good place to discuss quality and efficiency problems, etc.) * SO (sequence ontology) * Ontology Content (e.g. Tanya and David for cell physiology) When a large meeting occurs, it will be important that someone that goes to the smaller meetings attends to give a report of these meetings to the group as a whole. 8. Final Item. Worm Book III would like to embed use of GO terms within all articles of the book; the online version of the book would then link out the appropriate GO term pages . This is a good idea for us to be involved in, especially with respect to our grant mandate to make GO more accessible. Perhaps we could make a dictionary based on an alpha dump of the ontology. We could post it periodically. Obsolete terms would need to be removed. ACTION ITEM: Mike Cherry to look into how best to do this. 9. Summary of Action Items from this meeting. 1. Action item: Eurie and Michael will strive to provide a definition for 'transcription factor activity'. 2. Action item: We will try to set up a pilot project that has a web page "indexing" key point discussions in the GO email archives 3. Action item: We will add a new qualifier for "Colocalizes with" that is appropriate for indicating that the gene product has been found in the vicinity of a structure. 4. Action item: Jen will update the documentation for Component rules with discussion of this qualifier and its use. 5. Action item: Brad and Mike will look into whether it is possible to keep a Google search of the email archive separate from the general Google search of the GO web pages. 6. Action item: groups to investigate if large files, compressing files, will pose any problems at their own site. 7. Action item: Mike Cherry to look into how best to interact with WormBookIII to embed GO terms in on-line version of the book. Need to consider 'glossary' approach and how to maintain currency. 8. Action item: 'sensu' terms will have a mixture of English phrase and Latin genera, along with the taxon ID. The definition of any sensu term would include the point that it is not totally restricted to a particular grouping. 9. Action item: Proposal: by end of month will post OBO files, curator trials-- use OBO for two weeks to work out bugs, then general switch to OBO as master; More general document for users. Announcement on site 10. Action item: We will try to set up a pilot project that has a web page "indexing" key point discussions in the GO email archives. 11. Action Item: "Not" column will be renamed "Qualifier". When it has any other value other than NOT or NULL, it should be used for annotations for components of a complex only. This will allow reason across membership in a component to infer function. Should be checks for complex entries. Subunit will have annotation to a particular subunit activity, if known, or to "contributes_to" and that gene product must also be annotated as a component of complex. e.g., specific example eIF2; has three subunits (alpha, beta, gamma); one binds GTP; one binds RNA. But the whole complex binds the ribosome (needs all three); so all three get "contributes to" ribosome binding, and one gets GTP binding, the other gets RNA binding. AND all three are annotated to EIF2 complex. 12. Action Item: In column #12 of the gene_association file, "complex" will be allowed as a type of "DB_OBJECT". 13: Action Item: Concepts relating to the use of complex functions (e.g. receptor tyrosine kinase) will be added to the documentation. 14. Action Item: Add Joel Richardson's tool to the tools page. Note language change from Python to Java. (Jen) 15. Action Item: Document OBO flat file format advantages for annotators (There are none.) 16. Action Item: Write documentation for the process and component ontologies along the same lines as the function documentation that had already been written. (Jen) 17 Action Item: Add documentation to remind people that the definition is there to clarify the meaning of the term name if there is any ambiguity. This is to be added to the general documentation as well as to the documentation for each ontology. (Jen) 18. Action Item: See if there is an easy way to add the date that a definition was made. 10. Review of Action Items from past meeting [Bar Harbor] 1. Standard operating procedures for checking ontology integrity: These would include checking true path rules, missing parents (terms that are not "is_a" to anything), etc. Currently, some groups have scripts that do this; perhaps these can be incorporated into DAG-Edit. This will be continued. 2. Documenting process for revision of sub-trees (for example, the changes being made to split physiological process. Currently much of the discussion occurred via telephone. These type of thing should be documented so new members can review discussions - We need some sort of "audit trail" of discussions - some are in SourceForge entries , and should be made easier to find. It is important that more than just changes but also the rationale and logic about things that have been extensively discussed. Ultimately, summaries should be submitted to Sourceforge. A suggestion was made that chat room technology could be used instead of phone conversations, so that everything is logged. Action item: we will try to set up a pilot project that has a web page "indexing" key point discussions in the GO email archives. 3. Work on documentation on procedures for getting people into interest groups is progressing. 4. The fate of "signal transducer activity" is still under investigation. 5. The checking of logic consistencies for "Part-Of" is in progress 6. When is a function actually a process? Documentation is being drafted to add to principles of ontology development. 7. A Source Forge item on "regulation of survival gene products" under "apoptosis" to be checked by GO team. The term name was changed to "regulation of surival gene product activity" and moved under "anti-apoptosis". DONE 8. Synonym types: Documentation describing the types of synonyms (exact, related, broader, narrower, not same) is on the GO site. However, synonym type is only displayed in OBO format files. 9. Added English terms as synonyms as needed. DONE 10. Resolution of the "regulator subunit of enzyme activity" as a child of the enzyme activity: The enzyme activity will no longer have its regulatory subunits as children. The term "enzyme regulator" will be the parent of these terms. DONE 11. Additional use of the NOT field . The discussion basically revolved around three choices: a. Allow NOT field to have multiple use. It was suggested that we rename qualifier column from "NOT" to "qualifier", and can have values "NOT" or "contributes_to" or NULL. If using "Contributes_to" to annotate to a function term, then the annotation must be accompanied by a component annotation for same gene product to the complex. There was some concern about putting two different types of info in one column.. Subunits are annotated with function of complex as "contributes_to" as qualifier and subunit annotated as component of complex. b. A special evidence code for being in a complex instead of "contributes to". Like CPX? c. Use an additional column to indicate that the activity occurs only when complexed to something else. Action Item: "Not" column will be renamed "Qualifier". When it has any other value other than NOT or NULL, it should be used for annotations to component only. This will allow reason across membership in a component to infer function. Should be checks for complex entries. Subunit will have annotation to a particular subunit activity, if known, or to "contributes_to" and that gene product must be annotated as a component of complex. e.g., specific example eIF2; has three subunits (alpha, beta, gamma); one binds GTP; one binds RNA. But the whole complex binds the ribosome (needs all three); so all three get "contributes to" ribosome binding", and one gets GTP binding, the other gets RNA bindings. AND all three are annotated to EIF2 complex 12. Two new curators added to web site "people" page: DONE 13. New web site with improved index implemented. DONE 14. Changes to function ontology documentation: DONE 15. Document difference between parent/grouping in function vs a single term in process. DONE 16. Different types of "Part of". OBO can handle, but flat file format can't. The are 5 different types of ' part of' are documented. 17. Document that "binding" is not always a parent of enzyme. It is only a parent when stable binding occurs. DONE 18. Standardization of the use of "activation", "induction", and "positive regulation of" is being documented. 19. Add any procedural information coming out of Curators/Annotators meetings 20. Bad link was fixed in GO.evidence.html DONE 21. New Tools: Joel's Graphical Annotation Browser demo. 22. DAG-EDIT 1.409 beta 4 being tested. 23. Synonym documentation completed. DONE 24. Document the use of the NOT field for an ISS annotation with sequence dissimilarity. DONE 25. Document that the with field used with ISS can have cardinality > 1. ??? 26. An "Annotation Oversight Team" will be created to assess quality, set standards, evaluate annotations of contributing groups, and alert groups to annotations that need attention. Not Done: Will continue with plan to have 'GO Annotation Camp' this year. 27. COMPUGEN will resubmit their annotation file. RESOLUTION: If there are IEA sets that have not been updated in one year, they will be removed from the front page and AMIGO. 28. See 24. 29. Action Item: In column #12 of the gene_association file, "complex" will be allowed as a type of "DB_OBJECT". 30. A tool is needed to check validity of an annotation that was made to a Riken gene based on ISS to a SP record or IP domain. When the IP domain is removed by SwissProt (because it was actually found to not be a real domain, or always associated with a particular activity, etc.) the annotations are now no longer valid. This is being researched by Suzi and David. 31. Documentation added to tell annotators that if you are unsure of function/process, bump annotation up to next level; it that results in the root of the ontology, you need to use "unknown". DONE. 32. Formatting errors in association files are being found: everyone should check. DONE. A proposal was suggested that all use the same script to check this. 33. Getting groups references for any "in-house" analysis and post on GO site continues. 34. Annotation QC: defer to annotation discussion 35. Send out reminders to groups to update their gp2protein files! Used in Amigo and other. 36. A production manager that will work on AMIGO, GO database, etc. will be hired 37. This is #27 again. 38. AMIGO will have a filter added for "organization" and "species" 39. Sourceforge now has a tracker for AMIGO requests. 40. Format for IEA references in progress. 41. Term change tracker plugin in DAG Edit allows you to track changes. DONE 42. A second DAG-EDITOR course will be held on the West coast.. 43. New flat files (OBO format) will have "obo" extension. At some point, the three ontology files will be combined. 44. Action Item: hire a person to run test sets on all of the GO tools. 45. Request to translate the GO into various non-English languages, including Arabic and Chinese. Pankaj to investigate the need, how to update, etc. 46. Create documentation for ontology development guidelines aimed at people that we would recruit as outside experts to develop specific branches of GO. EBI: in progress 47. Mike Cherry developing manual methods, automated assessment tools and documentation aimed at improving annotation consistency. 48. Improve consistency in annotation. Mary Dolan is currently analyzing the annotation consistency between mouse and human genes. ; continue working; 49. The "decomposing" of the GO ontologies is in progress. To be reported on at a later date. =================================================================== GO Consortium Meeting - Chicago, IL - October 15-16, 2004 [Next Meeting: Pasadena, CA - WormBase organizing - May 2005] Group Participant List SGD (Rama Balakrishnan, Mike Cherry, Maria Costanzo, Mayank Thanawala) TAIR (Suparna Mundodi) MGI (Judy Blake, Harold Drabkin, David Hill, Mary Dolan) ZFIN (Doug Howe) RGD (Victoria Petri, Mary Shimoyama, Simon Twigger) dictyBase (Rex Chisholm, Petra Fey, Pascale Gaudet, Warren Kibbe, Karen Pilcher, Sohel Merchant) EBI-Ontology Group (Midori Harris, Jane Lomax, Jen Clark, Amelia Ireland) GOA (Daniel Barrell, Emily Dimmer) Wormbase (Ranjana Kishore) Incyte (Renee White) Gramene (not present) IRIS (not present) DB Group (Chris Mungall) TIGR (Michelle Gwinn-Giglio) FlyBase (Michael Ashburner, Rebecca Foulger) S. pombe/Sanger (Arnaud Kerhornou) Pathogen/Sanger (Arnaud Kerhornou) SO (Karen Eilbeck) Reactome (not present) TGD (Mike Cherry, Nick Stover, Mayank Thanawala) A. Report from GO External Advisory Board Meeting (Judy Blake ) The advisory board made three key points: 1) We should put a heavy emphasis on production and annotation. If we move to a more formal ontological structure, it must not affect the usability of GO for annotation. We need to better support users and annotation efforts. We should consider separating GO production and management from research and development. Several who were at the meeting felt that the reasons why we should move to a more formal ontological structure might not have been conveyed adequately to the board. 2) Metrics: we need to track our progress and usage in concrete ways in order to justify our funding. 3) Usability: we need to support na•ve users, whose first or only exposure to GO is often on the gene page of a MOD, as well as first-time annotators who need to know how to get started and where to find support. Discussion: * it's critical for us to set goals and objectives for the coming year, in order to prepare for submission of our next grant in ~1 year * we've had conflicting requests and suggestions from computer scientists vs. biologists, and we've perhaps put disproportionate energy into interactions with the computer scientist community. We need to think about our core community, remembering that our funding is from NIH, which is looking for practical applications that facilitate biomedical research. * it might be helpful to seek separate funding for production and for research * in addition to our users who want to annotate genes, we also have users who want to see large numbers of annotated genes. We need to think about how the most complete annotation of genes could be achieved. As more genomes are sequenced, people will be wanting to use GO even when they don't have a MOD. NHGRI is funding genome sequences without funding annotation. We need to think about how we could capture the limited annotation that may come out of those efforts, while ensuring that new annotation is of good quality. Could a grant support GO annotation in a non-species-specific way, for example by funding a GO outreach person to assist annotators? B. Annotation Reports and Issues 1. Reports from Consortium Groups GO Editorial Office SGD MGI FlyBase TAIR GOA RGD dictyBase WormBase TIGR Incyte (Any discussions that occurred after the reports are noted here but the reports themselves are not recapitulated.) Some new members of the group were introduced in this section. Action item: Jen will update the "GO People" section of the GO website. GO Editorial Office: In response to a question about how they prioritize addressing SourceForge items, Midori said that the highest priority items are: - small things that can be done quickly - issues that are of interest to many groups - adding terms about previously unrepresented areas of biology, which may have repercussions throughout the ontology The comment was made that the format of emails about SourceForge items is not easy to read. Action item: Midori will look into whether the format of SourceForge emails can be customized. Judy requested that those participating in email discussions that are not entered into SourceForge or forwarded to the GO list, should periodically write a summary of the discussion and and send it to the group, so that all are aware of the discussion. Jen has documented the development interest group discussion and (since the Consortium meeting) has written a template for documentation of interest group discussions which she will make available on the web. TAIR: TAIR allows user submission of GO annotation; this led to a discussion of whether we can provide a generic spreadsheet for GO annotation by inexperienced users. How can we encourage the scientific community to do GO annotations? TAIR has some GO annotators who have volunteered because they want to explore careers in bioinformatics. When community members submit individual annotations, their names are displayed on TAIR web pages; this may be an incentive. Action item: Jen will get the formatted Excel spreadsheet for user submission of GO annotations from TAIR, will modify it to be applicable to any organism, and will put it on the website to help new annotators. dictyBase: Pascale has constructed a "cross-product" ontology Dicty development ontology (presented at GO Users meeting) that contains terms created by combining GO process terms with terms from a Dicty anatomy ontology. The presentation file will be available for reference at the GO website, and Jen will provide a link to it from the development interest group documentation. Judy made the point that it might be useful to display to users any previous GO annotations (deleted or obsoleted) for a gene product. Incyte: Renee described a quality control method that involves assembling a list of terms that are frequently used together ("GO pairs"; e.g., 'protein kinase activity' and 'protein amino acid phosphorylation'), and checking whether all gene products annotated to one are also annotated to the other. Judy asked whether this or any other quality control method could be shared with the public efforts. Mike asked whether Incyte's Pfam-to-GO mappings are different from the public ones; Renee answered that they may be, since Incyte started with Interpro and have done hand curation of the mappings. Action item: Renee will look into whether Incyte's quality control methods or Pfam-to-GO mappings may be shared with the public GO efforts. If permission is obtained, Jen will put them on the GO website. 2. Report from GO Annotation Camp (Mike Cherry) Mike went through each point of the report; lack of comments indicates assent by the group. Discussions about specific items are noted here. Numbering of items corresponds to numbering in the Annotation Camp report. 1) Curation examples. Submission of annotation examples is a requirement for GOC members. Collection of annotation examples is now in the SourceForge Annotation Issues tracker (item #1047963), or examples may be sent to Midori. Action item: Each database must submit a set of 10 papers and accompanying GO annotations (see SourceForge item #1047963 for details). 2) README file. All gene association files must be accompanied by a README file summarizing the current annotation strategy: how genes are prioritized for GO annotation; whether multiple annotations to the same term, derived from different papers, are included; and any other annotation methods that may differ between MODs. Action item: Each database must submit a README file describing annotation strategy to accompany its gene association file. 11) Component terms and IEP. The annotation campers had decided that component annotations should never be supported by IEP evidence. Midori asked whether 'never' was too strong a qualifier. It was agreed that there may be legitimate exceptions to this rule. On a related topic, Pascale asked whether localizations inferred by localizing a GFP fusion protein should have the 'colocalizes with' qualifier. The consensus was that in general they shouldn't, since those experiments are designed to indicate localization of the wild-type protein. However, there may be other evidence that affects confidence in the results (e.g., if a fusion protein is localized to the lysosome/vacuole), and this is an area where curator judgement must be exercised on a case-by-case basis. 14) Choosing the appropriate level for GO annotation. Delete 'using IGI' from last sentence; it's not relevant to the example. 17) Points to remember for suggesting new terms. Judy pointed out that use of gene product names in terms is still under discussion. Item 2 should be changed to 'Avoid using gene product names in new term names'. Jane suggested adding the point that an informative name should be used for the SourceForge entry. Item 5: Suparna asked whether companion terms should really be added if not needed immediately for annotation. Midori said this is subject to curator judgement, and Mike pointed out that it's helpful to have a discussion of companion terms in SourceForge for future reference even if they aren't immediately incorporated into the ontology. This led to a discussion of the issue of bare leaf nodes: how many GO terms are not used to annotate genes? Chris looked up the current statistics: 18,000 terms exist, including obsoletes 8,700 have a gene(s) attached 10,000 have genes attached OR have child terms with genes attached 20) Policy on curation of every paper available. The annotation campers had decided that it was ideal to annotate using each available paper: the number of independent annotations can provide a measure of confidence in the assignment. Michael felt that this redundancy could interfere with computation, particularly if it's appled inconsistently; it needs to be documented in the gene association README file. Several ideas were proposed for filtering annotations in order to create a subset to use in data analysis. A file could be created containing a reference subset of annotations, or particular annotations for each gene could be tagged in the gene association file. SGD currently provides a file containing a single GO term of each aspect for each gene, but the consensus was that a representative GO annotation set for higher organisms would need more than a single annotation per gene. Chris suggested that a tool could provide users with custom-generated gene association files, restricted by number of annotations per gene, type of evidence code, etc. 22) What to put into the DB_Object_Type column? It should be added to this section that MGD and ZFin use allele identifiers for IMP annotations. 24) Expanding GO evidence codes. Michael's evidence code hierarchy was discussed. The consensus was that this hierarchy should be available on the OBO site. Action item: Make the evidence code hierarchy available at the OBO site. Summary discussion: Michael pointed out that parts of the Annotation Camp report should be incorporated into the online annotation documentation. Action item: Incorporate all relevant parts of the Annotation Camp report into online GO annotation documentation. The question was raised as to whether there should be future annotation camps. There was enthusiasm for continuing them, although perhaps making them only 2-3 days long rather than a week. This would help ensure annotation consistency between MODs. They could be attached to another meeting such as the BioCurator meeting. MGI already does GO workshops, aimed at new users, at larger meetings. We should perhaps design a short tutorial/workshop to present to new GO annotators. 3. GO Annotation Topics a) Annotation consistency between groups Future Annotation Camps should address this. b) Curated Reference sets for data analysis groups This was discussed previously for item 20 of the Annotation Camp report, 'Policy on curation of every paper available'. We returned to the issue. There are many different ways to do this, ranging from merely stripping out multiple identical annotations, to providing users with a selected annotation set. It was felt that it's potentially dangerous to create a selected set of annotations, and we explored possible ways to do it. Michael suggested that if a gene is annotated to both a term and to its parent term, we could remove the annotation to the parent. However, annotations to parent terms may have better evidence than annotations to child terms, and taking away annotations to parents may remove knowledge: e.g., a protein found in both nucleus and nucleolus would lose its nuclear localization term. Multiple functions or localizations of a single gene product could also be lost. Judy suggested that we table this discussion for next meeting while individual groups think further about the issue; providing a tool for users to generate custom subsets could obviate the need for us to make a decision on this. c) NAS vs. TAS There are differences between databases on the use of NAS. For a statement in a paper describing a direct experiment and referring to "data not shown", Flybase would typically assign NAS evidence, while MGI would assign IDA. Suparna explained that at TAIR, if the paper explicitly states that 'Data not shown', NAS evidence would be used. But if a paper is describing a direct experiment and based on the direct experiment, the author is making a statement on gene function or process, they do TAS annotation and use an evidence description to support the type of evidence that it is based on. There was a discussion of whether curators should interpret data shown in figures. The general consensus was that it's acceptable to curate something that's obvious in a figure although not explicitly stated, but it's not acceptable to look at data and come to different conclusions than the authors did. However , TAIR curators have occasionally not curated a conclusion(or have tried to contact an author about it) if it doesn't appear to be supported by the data. d) IDs permissible with IPI IDs used with IPI should be protein identifiers - Swiss-Prot, Trembl, RefSeq - and MOD identifiers should not be used if they refer to a gene rather than a protein. The ID used should lead to a FASTA file of protein sequence. For IGI evidence, gene Ids should be used. This will be difficult for UniProt annotators, but it should also be very rare. C. Ontology Content Reports and Issues 1. GO Editorial Office Report (Midori Harris) See above (B1, Reports from Consortium Groups) 2. Reports from GO Content Meeting at Carnegie Sept 19-20, 2004 a) PAMGO (Michelle Gwinn-Giglio) Problems with current pathogenesis terms: - pathogenesis is a 'victim-centric' term - expresses the point of view of the host - many terms missing - need to capture symbiotic interactions - not always obvious whether organism is pathogenic, sometimes a relationship can be either pathogenic or symbiotic The PAMGO proposal would add a general list of host/hostee interactions, and would remove 'pathogenesis'. There was general acceptance of the proposal, but some strong opinions that pathogenesis terms should be retained and symbiosis terms added. After more research, Michelle came to the conclusion that symbiosis is any kind of relationship in which two organisms live intimately together, and pathogenesis is an instance of this. There are currently three trees that represent these various points of view: 1. contains only general terms (PAMGO) 2. very complex, with separate general, symbiosis, and pathogenesis subtrees 3. symbiosis tree with pathogenesis subtree The PAMGO group favors #1; Michelle favors #3, because it allows for better representation of biofilms and symbiotic relationships not involving hosts. The defense response part of this area is still under discussion; there is even less consensus between plant and animal researchers. David commented that in constructing development trees, issues like this were resolved by adding general terms that both camps agreed on, and then adding child terms acceptable to each group. The general consensus at this meeting is that #3 is best and represents a large improvement over current terms; it can always be refined in the future. The final plan will be circulated and will be implemented unless there's a serious problem. Action item: Finalize the new symbiosis/pathogenesis terms and incorporate them into the process ontology. b) Metabolism (Jane Lomax) Currently, all metabolism terms have 'physiological process' parentage but not 'cellular physiological process' parentage; this is a problem for annotation of gene products of unicellular organisms. For example, carbohydrate catabolism occurs, with different mechanisms, at both the cellular and organismal levels, but the GO term 'carbohydrate catabolism' does not have 'cellular physiological process' parentage. The proposed solution to this problem is to create new children of metabolism: organismal metabolism, cellular metabolism, primary metabolism. Some transport terms will be included, where it is an integral part of metabolism - for example, in plants some substances have to be transported between tissues for certain types of metabolism. Consensus: this is straightforward and should be implemented. Action item: Finalize the new metabolism terms and incorporate them into the process ontology. c) 'Regulates' Relationship Type (Midori Harris) Many would like to add a relationship type for 'regulates'. This would have consequences for tools and displays, and would need detailed analysis of the scope of the project. The consensus was that this should be looked into. Chris said that converting AmiGO and DAG-Edit to deal with this relationship type would be straightforward. We should let GO users know about this soon so they have time to change their tools. Action item: Look into the ramifications of adding the new relationship type 'regulates'. Midori will announce this upcoming change to the GO-friends mailing list. d) Cell cycle (Amelia Ireland) Currently, there are separate terms for each phase of mitosis with children describing parts of each phase. This was based on the S. cerevisiae cell cycle and is problematic for organisms with different cell cycles. The proposed solution includes redefining cell cycle phases as processes, and adding more specific terms for cell cycle events. Consensus: the high-level terms are fine, and the child terms may need work. Action item: Proceed to revise the cell cycle node along the lines already established. 3. Lessons learned from content meetings (Midori Harris) - choose topics carefully: topics should be important to many groups and not easily resolvable by email - have GO curators acting as liaisons to outside experts - distribute materials in advance so people are prepared 4. Other Content Proposals for Discussion a) removal of terms This discussion centered on the tension between "ontological purity" and "scruffy necessity". Some members of the group feel that too many terms are obsoleted, too quickly; this creates a lot of re-curation work and may adversely affect the way users look at our project. The mandate of the GO editorial group is ontological purity, but perhaps this needs to be re-evaluated. If a widely used terms describing well-known gene products (e.g., cytochrome P450) disappear, that does not serve our user community. Everyone agreed that increased use of synonyms could help this situation. Term names can be made more precise, while commonly used, imprecise terms could be synonyms. Synonyms should be used liberally, and we should improve our use and display of them in various tools. Some were concerned that 'vague' terms that represent extremely important concepts for biologists (transcription factor activity, chaperone activity, G-protein coupled receptor) have been obsoleted or may be slated for obsoletion. The argument was made that these terms do have clear meanings for scientists, and they represent special cases that need to be preserved. Retaining them may even lead to blurring of the line between function and process, but we should permit this for these special cases. Chris proposed a less severe form of obsoletion, where a term is deprecated but not immediately removed. David suggested that if if all annotations to a term are automatically transferrable to its new equivalent, then the obsoletion should not have happened in the first place; the definition of the original term should have been improved, or the term should have been merged with another. We discussed the specific case of 'chaperone activity', which has been obsoleted. Rama explained that the word 'chaperone' is used to mean three separate activities: transporting something; unfolded protein binding; and unfolded protein binding and re-folding activity. David suggested creating a lexical grouping term to be the parent of all three of these activities. Amelia thought that this would not make sense and would be analogous to creating a term 'factor activity' to group all 'factors'. Judy suggested that we could include lexical grouping terms in the GO, but tag them in some way to mark them as 'impure'. However, some thought this solution was simplistic, and others thought there would be no point in tagging a term if people would go on using it as before. Amelia suggested that we could create special terms crossing the process/function line, e.g., transcription factor could be a child of process: transcription and function: DNA binding activity. Chris observed that tagging would address the problem of vague terms, while cross-parentage addresses the problem of precise, complex terms. Summary: We need to use synonyms more aggressively and liberally. We can't achieve purity, so we need to explore options for alternative solutions when they become necessary. We need to write up examples for ways to deal with exceptional terms. There was no consensus on whether lexical grouping terms or cross-aspect terms are a good idea. The group recognizes that there are exceptions that need special attention rather than immediate obsoletion. The definition may need reworking, or we may need to implement special solutions. b)chemoattractant activity Rex questioned the obsoletion of 'chemoattractant activity' because he felt that this was a definable function and furthermore, that these molecules have no other function. Amelia said that the major problem with this term was with its definition, since "attracting motile cells" is not a function. Rex argued that the function is more than simple receptor binding, and Harold observed that while measuring chemoattractant activity requires observing a process, it's still a function. The consensus was that we will reinstate the term, with a better definition. Rex proposed the definitions of chemoattractant/ chemorepellant activity: "Provides a signal to induce positive/negative directional cell movement". The definition and placement in the ontology of the analogous term 'pheromone activity' may provide an example. Action item: Reinstate terms 'chemoattractant/chemorepellant activity'. c) ABC transporters Michelle explained that the old term names were ambiguous and implied particular gene products, which didn't work for bacteria, where several functions reside in separate gene products. The parent term, 'ATPase activity, coupled to transmembrane movement of substances', accurately described the molecular function of all its child terms. One issue with this obsoletion was the sheer number of annotations involved. This has already led to a new procedure for alerting people to obsoletions. Harold was concerned that the obsoletion will lead to loss of information about ATP binding, and that this connection (ATPases bind ATP) should be intrinsic to the ontology rather than accomplished by concurrent annotation. The problem with this is that we would need to create substrate binding terms for all enzyme-substrate pairs, and we have already made an explicit decision not to do this. After much discussion, a consensus emerged that this relationship shouldn't be built into the ontology but should be in a curator check: curators should consider annotating with both terms. There is already a note associated with 'ATPase activity, coupled to transmembrane movement of substances' that says curators should consider also annotating to 'ATP binding'. The narrower-than synonym 'ATP-binding cassette transporter' should also be added to 'ATPase activity, coupled to transmembrane movement of substances'. Judy observed that this is a good example of an issue that needs to be resolved face-to-face. Action item: Add 'ATP-binding cassette transporter' as a narrower-than synonym of 'ATPase activity, coupled to transmembrane movement of substances'. d) Definition of molecular function Harold pointed out that currently molecular function is defined as an 'elemental activity'. The definition needs to be broadened to include complex functions. There was consensus that this should be done, and Jane agreed to do it. Action item: Broaden definition of molecular function to include complex functions. e) RCA evidence code A new evidence code, RCA (reviewed computational analysis) was proposed to refer to computational analyses that are reviewed and published, and that don't rely on sequence comparison. Examples of this type of study are PMID:14566057 and PMID:12826619. The major reasons that a new evidence code is needed are: 1) The confidence level differs between a "typical" TAS (eg. statement in a review, where the review cites other research papers showing evidence for the annotation; generally high confidence) vs. using TAS for computational analyses (generally lower confidence then a statement in a review or introduction of a paper). 2) Computational biologists who use GO annotations in their methods would often like to eliminate annotations based on other computational methods, to reduce the circular argument/proliferation of errors problem. They would not be able to do this if we use TAS. 3) IEA would not be appropriate for these papers because IEA implies the absence of curator input. Thus, IEA, we think, is generally a different type of evidence (and generally lower confidence) than RCA. The consensus was that we will add this evidence code. Action item: Add and document new evidence code, RCA (reviewed computational analysis). D. GO Database and Tools Report (Mike Cherry) 1. GO Database Soon, all gene association files will be analyzed to remove errors. This filtering process will take the input file and create a new file in which the following corrections have been made: - remove IEAs older than a year - remove annotations to obsolete terms - make sure GOIDs are valid - change secondary IDs to primary - standardize headers The processed files would be used to generate AmiGO. If this processing were done today, only SGD's and ZFin's files would not have changes; other groups would have ~10-150 changes per file. We discussed the issue of whether annotations with IEA evidence, older than one year, should be removed. The consensus was that they should. If groups believe that their IEA annotations are still current, they can review them and update the date yearly so they will be retained. Chris asked whether users might want to see gene products annotated to obsolete terms - perhaps these should not be removed. There were differing opinions on whether annotations to obsoletes should be retained. On the one hand, they are more informative than the absence of annotation, at least to users who understand what obsoletion means. On the other hand, removing obsoletes would drive annotation, forcing re-annotation of obsoletes to be the highest priority. The consensus was that the process should remove annotations to obsolete terms than are older than a certain (undecided) age. Renee suggested that when a term is obsoleted, it could be replaced with a parent term as a temporary placeholder until manual re-curation can be done. Mike will give each DB a report of the errors in their files, and after some period of time, we will enforce the filtering. How would we want to handle one-time efforts? They will go out of date because they don't keep up with ontology changes; however, the checking script could eliminate obsolete terms. These files could be kept in a separate directory, but there was concern that users might not find them. There was no consensus on how to handle this issue. Action item: Mike Cherry to give each database a list of the errors in its gene association file. Action item: Jen will ask Mike for information about the new checking script and will document the annotation checks on the GO website. 2. AmiGO Mayank has been installing AmiGO at Stanford; AmiGO and GOST are running; will start to switch over from Berkeley to Stanford soon. Mike will send test URLs to Consortium members; when all are satisfied, AmiGO users will be redirected to Stanford. The first step will be to simply replicate what's done at Berkeley; changes will be implemented later. Action item: Continue with and finish the installation of AmiGO at Stanford. 3. DAG-Edit Midori has found some minor bugs in the DAG-Edit 1.419 beta version and has emailed John. The final version should be available soon. 4. Web pages Jen has re-done the tools page. The Advisory Board has suggested that the first page should be simpler; Jen and others will look into that. Action item: Jen will design a simple front page for the GO website that is friendlier for biologists and other newcomers. It should include links to explanatory pages for new users and new annotators. She will also check that the links to SourceForge are working correctly. 5. User Stats AmiGO web pages are being hit 18,000 times per week. Usage has increased. Action item: Jen will fix the AmiGO search box on the front page of the GO website so that either terms or annotations can be searched. Mike Cherry will send her information on how to do this. 6. Documentation No report. E. GO User Support (Michael Ashburner) 1. Legacy annotation sets and what to do with them See section D1 (GO Database) above. 2. Outreach to new groups Several ideas were proposed: - we could have GOC staff member(s) dedicated to assisting annotation efforts. Center for Bio-Ontologies? - we could try to work with sequencing centers (JGI, Broad Institute) - we could try to convince program officers of funding agencies of the need to fund functional annotation along with genome sequencing - we could use large meetings such as PAG, ISMB, ASM as opportunities to hold workshops and inform people about GO - we could advertise the GO Users meeting as a place to learn about GO as well as to present its uses. Part of the meeting could be devoted to a GO tutorial. - we should seek out genome databases with which we are not in contact: Bombyx, Xenopus, Maize Jen will be presenting a GO annotation tutorial at the upcoming PAG (Plant and Animal Genome) conference (January 2005). We should make sure that as many databases as possible know about this. Action item: Jen will try to make contact with as many genome databases as possible to make sure they're aware of the tutorial at the PAG meeting. 3. Requests to join GO Consortium, GO Associates idea So far, we have accepted into the Consortium groups that work with us on ontology development and return to us a gene association file. But an increasing number of groups would like to join, and we don't want the GOC meetings to grow to an unworkable size. We could limit the number of people attending from each group, but no one really liked this idea. CGD, MetaCyc, GermOnline, and a toxicogenomics database have asked to join the GOC; except for CGD, these are not organism-specific genome databases. The consensus was that we should establish the status of GO Associate. Associates will contribute annotation files and participate in a GO meeting (equivalent to the current users meeting, broadened to include an educational component); we may invite specific associates to a GOC meeting as they become educated and actively involved in the project. Action item: Document the status of GO Associate and invite interested groups to join. F. SO/GO Development Reports 1. OBOL (Chris Mungall) Chris has found 248 missing relationships. They are listed at http://www.fruitfly.org/~cjm/obol. The editorial office reviews them and adds some but not all. OBOL can be used behind the scenes to suggest new relationships or to create new definitions. Action item: We will proceed with using OBOL to make computed definitions of cell differentiation and maintain them in a cell type ontology. Discussion: GO contains implicit orthogonal ontologies, e.g., a chemical ontology. How do we decide which cross-product terms should exist in GO vs. in separate cross-product ontology? Michael suggested that if a term is needed for annotation, then it should be instantiated in GO. But there is concern about 'bloat' making it difficult for annotators to find terms. David suggested creating a separate namespace for extremely specific terms, e.g., mouse development terms. Another suggestion would be to incorporate all the separate ontologies into GO but provide tools for users to filter out terms not relevant to them. 2. SO Content Meeting reports plus SO Development report (Karen Eilbeck) SO Content Meetings were held at Berkeley Aug 22-23, 2004 and at Hinxton, Sept 22, 2004. The structure of SO was changed drastically after the last meeting. The revised ontology, 'so-meeting.obo', is available for comments. MODs are starting to use SO. SOFA content is frozen for 1 year, until next May. 3. OBO (Michael Ashburner) We are getting many requests to add ontologies to OBO, however we can't add contradictory ontologies, and the quality of some ontologies may be variable. Michael proposes splitting the OBO site into subdirectories. The OBO core directory would contain ontologies being worked on by GOC members or which are needed by GOC members for making cross-products with GO; entries must be approved by the GOC. Another directory would contain all other ontologies. Action item: Distribute the ontologies at the OBO site into two subdirectories, one containing GO Consortium-approved ontologies in active use by consortium members, and the other for any other ontologies. G. Plan Meetings and Assess Past Meetings The question was raised as to whether we really need to read database reports at the GOC meeting. At this meeting, this consumed a morning and there was relatively little discussion. The consensus was that we will not have each group talk about each report in future meeting (although reports must still be provided). Any special issues or new developments requiring discussion should be submitted as agenda items. WormBase (Caltech) offered to host the next meeting. May and September were discussed as possible times, but it was decided that a year from now would be too long an interval, so May was agreed upon. The suggestion was made to have back-to-back GO Users and GOC meetings, 1 1/2 days each. H. Summary of Action Items from this meeting 1. Action item: Jen will update the "GO People" section of the GO website. 2. Action item: Midori will look into whether the format of SourceForge emails can be customized. 3. Action item: Jen will get the formatted Excel spreadsheet for user submission of GO annotations from TAIR, will modify it to be applicable to any organism, and will put it on the website to help new annotators. 4. Action item: Renee will look into whether Incyte's quality control methods or Pfam-to-GO mappings may be shared with the public GO efforts. If permission is obtained, Jen will put them on the GO website. 5. Action item: Each database must submit a set of 10 papers and accompanying GO annotations (see SourceForge item #1047963 for details). 6. Action item: Each database must submit a README file describing annotation strategy to accompany its gene association file. 7. Action item: Michael Ashburner will make the evidence code hierarchy available at the OBO site. 8. Action item: The Editorial Office will incorporate all relevant parts of the Annotation Camp report into online GO annotation documentation. 9. Action item: The Editorial Office will finalize the new symbiosis/pathogenesis terms and incorporate them into the process ontology. 10. Action item: Jane will finalize the new metabolism terms and incorporate them into the process ontology. 11. Action item: Chris and the Editorial Office will look into the ramifications of adding the new relationship type 'regulates'. Midori will announce this upcoming change to the GO-friends mailing list. 12. Action item: Amelia will proceed to revise the cell cycle node along the lines already established. 13. Action item: Amelia will reinstate terms 'chemoattractant/chemorepellant activity'. 14. Action item: Jane will add 'ATP-binding cassette transporter' as a narrower-than synonym of 'ATPase activity, coupled to transmembrane movement of substances'. 15. Action item: The Editorial Office will broaden the definition of molecular function to include complex functions. 16. Action item: The Editorial Office will add and document the new evidence code, RCA (reviewed computational analysis). 17. Action item: Mike Cherry will give each database a list of the errors in its gene association file. 18. Action item: Jen will ask Mike for information about the new checking script and will document the annotation checks on the GO website. 19. Action item: SGD personnel will continue with and finish the installation of AmiGO at Stanford. 20. Action item: Jen will design a simple front page for the GO website that is friendlier for biologists and other newcomers. It should include links to explanatory pages for new users and new annotators. She will also check that the links to SourceForge are working correctly. 21. Action item: Jen will fix the AmiGO search box on the front page of the GO website so that either terms or annotations can be searched. Mike Cherry will send her information on how to do this. 22. Action item: Jen will try to make contact with as many genome databases as possible to make sure they're aware of the tutorial at the PAG meeting. 23. Action item: The Editorial Office will document the status of GO Associate and invite interested groups to join. 24. Action item: Chris will proceed with using OBOL to make computed definitions of cell differentiation and maintain them in a cell type ontology. 25. Action item: Amelia will distribute the ontologies at the OBO site into two subdirectories, one containing GO Consortium-approved ontologies in active use by consortium members, and the other for any other ontologies. I. Review of Action Items from last meeting [Stanford] 1. Action item: Eurie and Michael will strive to provide a definition for 'transcription factor activity'. A definition has been considered but not yet incorporated into the ontology. 2. Action item: We will try to set up a pilot project that has a web page "indexing" key point discussions in the GO email archives. Jen has worked on a prototype and hasn't gotten much feedback yet. Judy pointed out that it's important to record key discussions so we don't revisit the same issues over and over again. Jen will work out a way to do it, then people who are actively involved in each discussion could pick the key points to index. 3. Action item: We will add a new qualifier for "Colocalizes with" that is appropriate for indicating that the gene product has been found in the vicinity of a structure. DONE. 4. Action item: Jen will update the documentation for Component rules with discussion of this qualifier and its use. DONE. 5. Action item: Brad and Mike will look into whether it is possible to keep a Google search of the email archive separate from the general Google search of the GO web pages. This is not possible. 6. Action item: groups to investigate if large files, compressing files, will pose any problems at their own site. No problems identified. 7. Action item: Mike Cherry to look into how best to interact with WormBookIII to embed GO terms in on-line version of the book. Need to consider 'glossary' approach and how to maintain currency. Mike has talked to Lisa Gerard at Wormbase; this is in progress. 8. Action item: 'sensu' terms will have a mixture of English phrase and Latin genera, along with the taxon ID. The definition of any sensu term would include the point that it is not totally restricted to a particular grouping. DONE. 9. Action item: Proposal: by end of month will post OBO files, curator trials-- use OBO for two weeks to work out bugs, then general switch to OBO as master; More general document for users. Announcement on site. DONE. 10. Action item: We will try to set up a pilot project that has a web page "indexing" key point discussions in the GO email archives. Duplicate of item #2 above. 11. Action Item: "Not" column will be renamed "Qualifier". When it has any other value other than NOT or NULL, it should be used for annotations for components of a complex only. This will allow reason across membership in a component to infer function. Should be checks for complex entries. Subunit will have annotation to a particular subunit activity, if known, or to "contributes_to" and that gene product must also be annotated as a component of complex. e.g., specific example eIF2; has three subunits (alpha, beta, gamma); one binds GTP; one binds RNA. But the whole complex binds the ribosome (needs all three); so all three get "contributes to" ribosome binding, and one gets GTP binding, the other gets RNA binding. AND all three are annotated to EIF2 complex. DONE. 12. Action Item: In column #12 of the gene_association file, "complex" will be allowed as a type of "DB_OBJECT". DONE. Reactome may have used it; no one else has. 13: Action Item: Concepts relating to the use of complex functions (e.g. receptor tyrosine kinase) will be added to the documentation. DONE. 14. Action Item: Add Joel Richardson's tool to the tools page. Note language change from Python to Java. (Jen) DONE. 15. Action Item: Document OBO flat file format advantages for annotators (There are none.) DONE. 16. Action Item: Write documentation for the process and component ontologies along the same lines as the function documentation that had already been written. (Jen) Process ontology documentation is in progress; component ontology documentation has not been started. 17. Action Item: Add documentation to remind people that the definition is there to clarify the meaning of the term name if there is any ambiguity. This is to be added to the general documentation as well as to the documentation for each ontology. (Jen) DONE. 18. Action Item: See if there is an easy way to add the date that a definition was made. Not done. There was agreement that we should do this, but we need to work out the specifics (which date, how to display it). Midori will write some specifications and ask for comments, then will add it as a feature request to SourceForge. 19. Action Item: General improvements to GO website. Since the January 2004 Consortium meeting, the following additions and changes have been made to the website: new meetings page added; OBO file format documented; instructions on accessing cvs by ssh added; function ontology documentation added; obsoletion standard operating procedure added; information on the annotation checking script added; page added for acknowledgements of outside experts; evidence code summary table added; menu added to give easier access to SourceForge; documentation added about frequency of updates of GO downloads and mapping files; 'mailto:' links added to site for people in interest groups and databases; tools developer's page added; more detail added to annotation guide; style of website encoded in cascading style sheets and footer fixed. =================================================================== GO Consortium Meeting Ð Pasadena, CA Ð April 8-9, 2005 [Next Meeting: Berkeley organizing Ð March 25-29, 2006 (to be confirmed)] Group Participant List: SGD (Rama Balakrishnan, Mike Cherry, Karen Christie) TAIR (Tanya Berardini, Sue Rhee) MGI (Judy Blake, Alexander Diehl, Mary Dolan, Harold Drabkin, David Hill, Li Ni) ZFIN (Doug Howe) RGD (Simon Twigger, Jennifer Smith) dictyBase (Rex Chisholm, Pascale Gaudet, Karen Pilcher) GO Editorial Office (Midori Harris, Jane Lomax, Amelia Ireland, Jen Clark) GOA (Evelyn Camon) Wormbase (Igor Antoshechkin, Carol Bastiani, Wen Chen, Eimear Kenny, Ranjana Kishore, Raymond Lee, Hans-Michael Muller, Erich Schwarz, Paul Sternberg, Kimberly Van Auken) BioBase (Incyte) (not present) Gramene (not present) IRIS (not present) BDGP (Suzanna Lewis, John Day-Richter, Chris Mungall, Shenqiang Shu) TIGR (Michelle Gwinn-Giglio) FlyBase (Michael Ashburner, Russ Collins) S. pombe/Sanger (Val Wood) Pathogen/Sanger (not present) SO (not present) Reactome (not present) TGD (Mike Cherry) TABLE OF CONTENTS INTRODUCTION AND WELCOME Report from GO External Advisory Board (Judy Blake) ONTOLOGY ISSUES Regulator vs. regulation "Structural constituent" function terms Behavioral Response terms Issue about oocyte growth from the Development interest group Chemoattractant/chemorepellant ANNOTATION ISSUES Redundancy in the annotations (Mike Cherry) Online annotation form Allow genes to be entered in the 'with' column with colocalizes_with Use of 'with' column for entering species "Views" for particular organism sets Pure hypothetical proteins Inter-annotator consistency Evaluation of electronic annotations Obsoletion and moving annotations Supporting new annotation groups RESOURCE ISSUES Advances in representations and use of cross-products (Chris Mungall) OBO-Edit (John Day-Richter) Advisory Board and other GO notes (Mike Cherry) MOBY namespace issues (Chris Mungall) BRAINSTORMING SESSION GO Home Page Home Page prototypes (Amelia Ireland) AmiGO: searching and visualization GO survey NEXT MEETINGS Annotation Camp Ontology Development Meeting Grant Meeting GO Consortium Meeting GO Users Meeting ACTION ITEMS Summary of Action items from last meeting, not yet done [Chicago] Summary of Action items from this meeting [Pasadena] Review of all Action items from last meeting [Chicago] INTRODUCTION AND WELCOME Report from GO External Advisory Board (Judy Blake): Topics that were discussed include annotation coverage and numbers of annotations from the different databases (David Hill). Annotation consistency was a concern and we need to explicitly describe how the Consortium addresses this issue (Judy Blake). Progress on ontology development, including the newly instituted 'ontology workshop' groups, was also discussed (Jane Lomax), and the development of new tools to deal with complexity and integrating other ontologies with GO was presented (Chris Mungall). The progress and impact of GO were presented by Mike Cherry. The metrics included the number of hits to the geneontology.org pages and the number of publications found in PubMed by searching for "Gene Ontology." In 2004, there were 120 publications of this type, and so far in 2005 there are 90. These are impressive numbers compared to 0 in 1999, especially considering PubMed searches only title and abstract, indicating that GO was integral to the study. The goals for the grant were also discussed (due February 1, 2006). Because funding from the NIH may not increase very much, yet we have ideas for expanding the scope of GO activities, the Advisory Board suggested looking for other possible sources for funding. One possibility is to look at what agencies are funding the research publications that cite the Gene Ontology. Overall, the response from the committee was very positive and they had many suggestions for what to do next. They encourage the Consortium to focus on the core mission and understanding the users and the community. ONTOLOGY ISSUES Regulator vs. regulation: There is redundancy in regulation terms in the process and function ontologies. For example, we have 'regulation of enzyme activity' (process) and 'enzyme regulator activity' (function). Can we move all of this to function or process? There is a proposal to add the new relationship type 'regulates' that would potentially take care of this issue, allowing annotation to anonymous classes of regulation: inhibition, activation, down-regulation, up-regulation. (David Hill, Harold Drabkin) Discussion: _ Is the ontology redundant in function/process? Should all regulation be process? Many of the regulation terms are not defined. Many regulation terms should probably not be function, they should be process. A difficult example is the regulatory subunit of a kinase. Should this be annotated to 'enzyme regulator activity' and also contributes_to 'kinase activity'? You sometimes need to annotate to a function and a process. It is not possible to transfer everything to either function or process. For example, a gene product can be involved in transport (process) but does not have transporter activity (function). _ There is a question of what exactly is a function? And what exactly is a process? In the GO documentation, a process is defined as a collection of functions, although a process could potentially be only one function. There are different views on what exactly constitutes a function. One view is that to have an activity must have a direct inhibitor, and can be thought of as a dose-dependent. Example: _ The opposite view is that everything has a function, even a brick. A brick has no inhibitor; you cannot plot concentration of brick vs. what a brick does. _ Making the decision to annotate to function or process is often based on how much information is available. _ Another issue is regulators and regulated gene products. Gene products that act as regulators are also regulated. The is_a and the part_of relationships don't really capture the interactions here. One proposal to help clarify this is with a new relationship type 'regulates.' Regulation terms would have different relationship types in different nodes, for example: -'regulation' --is_a 'regulation of cell differentiation' -'cell differentiation' --regulates 'regulation of cell differentiation' _ Overall, people are in favor of the new relationship type 'regulates.' One criticism is that 'is_a' and 'part_of' are very concrete whereas 'regulates' is vague. Also, there were concerns that the ontology would look different if the "regulates" information is found in the relationship rather than in the term name. This objection is unfounded: the ontology and terms will look exactly the same; only the relationship types will change. Another problem is the degrees and kinds of regulation. Some children of positive regulation terms include activation and maintenance, and children of negative regulation are down-regulation and inhibition. Action item: Look at how the relationship type 'regulates' and positive/negative regulation will affect the ontologies and the annotation files. [Chris Mungall] Action item: Look at current annotations to 'enzyme regulation' type terms to see what has been used. [Jane Lomax] Action item: New relationship type 'regulates'. [John Day-Richter] "Structural constituent" function terms: [See SourceForge item 1113374] Some children of 'structural molecule activity' are functions but include cellular component information. These terms include 'extracellular matrix structural constituent,' 'extracellular matrix constituent conferring elasticity,' 'extracellular matrix constituent,' lubricant activity', etc. Discussion: _ Non-catalytic functions belong in the function ontology. These 'structural constituent' functions are such terms. After examination, people agreed that they all looked like legitimate functions. The problem seems to be the difficulty of describing the activity that we are trying to represent, and nobody was able to suggest better phrasing. _ 'Structural molecule activity' (GO:005198) is a legitimate function, although it could be reworded to just 'structural molecule' as it is not exactly an activity. The child terms are problematic, however, because they incorporate cellular component or anatomical terms. A few of the existing terms could be renamed to remove the reference to the extracellular matrix, thereby generating legitimate children of 'structural molecule': extracellular matrix constituent conferring elasticity ; GO:0030023 change to: structural molecule constituent conferring elasticity extracellular matrix constituent, lubricant activity ; GO:0030197 change to: structural molecule constituent, lubricant activity matrix constituent conferring compression resistance ; GO:0030021 change to: structural molecule constituent conferring compression resistance matrix constituent conferring compression tensile strength ; GO:0030020 change to: structural molecule constituent conferring tensile strength _ The decision was to postpone this until we have the tools to deal with these terms that will allow us to do the decomposition and annotations. Behavioral Response terms: There is currently no link between the behavior terms and the 'response to' terms, which means there is quite a lot of redundancy. Do we want to keep these terms? We've talked before about how these terms shouldn't be applied to higher animals. For lower animals, should they perhaps be combined with 'response to' terms? Discussion: _ Some people felt that behavior is a type of response. After discussion it was agreed that they are different things. For example, 'cell motility' is a behavior, and not a response. 'Chemotaxis' is a response. Some responses are not behaviors, such as 'response to pathogen,' 'inflammatory response,' response to UV,' etc. _ Definitions: 'Behavior' is: "The specific actions or reactions of an organism in response to external or internal stimuli. Patterned activity of a whole organism in a manner dependent upon some combination of that organism's internal state and external conditions." 'Response to stimulus' is: "A change in state or activity of a cell or organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of the perception of a stimulus." The distinction is whether the whole organism or just certain cells of the organism are responding. Also, some behaviors are independent of stimulus, whereas all responses are dependent on stimulus. Action item: Document the fact that we will systematically put both parents on behavioral response terms where there are terms in both nodes. The terms currently in the ontology have correct parentage. [Jane Lomax] Issue about oocyte growth from the Development interest group: [See SourceForge item 1170007] Chris found a missing relationship using OBOL and wanted to know why we rejected it. [GO:0001555] oocyte growth is_a [GO:0007281] germ cell development We would like to decide whether it is best to have a process term 'development' or to have a collector term 'developmental process' instead. The issue is a philosophical/semantic/logical one. Currently the term is called 'development' but the relationships to child terms are such as would fit the term 'developmental process'. This sets up 'development' as a collector term rather than a process term in spite of its name. The definition of development states that it is: Definition: Biological processes specifically aimed at the progression of an organism over time from an initial condition (e.g. a zygote, or a young adult) to a later condition (e.g. a multicellular animal or an aged adult). Comments: Note that this term was initially 'developmental process' and was renamed. The concept of 'developmental process' was spotted by OBOL as inconsistent since at this high level 'growth' is an is_a child of 'development', whilst the lower level terms 'x growth' terms are part_of 'x development'. For example 'oocyte growth' would be considered to be part_of 'germ cell development'. Therefore in creating a standard dag structure for the development node it is impossible to state a consistent relationship between 'x growth' and 'x development' without considering the high level terms to be simple collector terms that are exempt from the rules. Likewise, is the actin cytoskeleton a type of cytoskeleton, or is it a part of THE cytoskeleton? So the question here is: 1) Should the term 'development' be changed to 'developmental process' and documented as a collector term that is exempt from the rules of the standard dag structure applying to all the more specific terms? or 2) Should the term 'development' be redefined so that it is truly a process term and then all its child term changed to be part_of, in keeping with the lower level terms? Discussion: _ From the ontology point of view, it is problematic to have growth R' development and oocyte growth R germ cell development, where R is not equal to R'. _ 'Oocyte' may have a different role relative to 'growth' than 'germ cell' does to 'development,' or development may mean a different thing in both contexts (developmental process vs. development as a whole). If this is the case it should be clearly specified in the definition (and preferably reflected in the name). _ An oocyte is_a germ cell (according to CL) and growth is_a development (according to GO). Is it because oocyte growth doesn't actually refer to developmental growth, but rather to an increase in diameter that is not necessarily related to growth in the developmental sense? Could we have two kinds of growth? 1. 'Developmental growth.' 2. 'Growth' that occurs when development is not taking place. _ When we get to the higher nodes, the distinction between is_a and part_of gets a bit fuzzy and is dependent on how we interpret the definition of something like development. If we interpret it as the collection of processes, then growth is an is_a because it is a developmental process. If we consider development the entirety of processes, then growth would be a part_of because it is a part of the entirety of development. _ This cannot be the case in GO because if Ôdevelopmental processÕ is a collector term then its is_a children must also be collector terms. An analogy is a stamp collection. A stamp can be part_of a stamp collection, but it is not an is_a child to a stamp collection since only another stamp collection can be a type of stamp collection. A stamp cannot be a type of stamp collection. _ Conclusion: The term 'development' cannot be a collector term, and that it must be a process term. _ Next discussion point: Since development is certainly now a process term then if ÔgrowthÕ was kept as a child of ÔdevelopmentÕ then it must have the part_of relationship. _ This would be a problem for other terms since ÔdevelopmentÕ of all individuals of all species is not really a single process. For example, Ôplant developmentÕ is not really a part of the grand overarching process of development of all species. ÔPlant developmentÕ is a type of development. _ Perhaps the development term could be considered as Ôorganism developmentÕ so that the graph would appear as follows: [i]organism development ---[i]plant development ---[p]growth ---[p]organ development _ Dictyostelium undergoes development as an aggregation of cells rather than as a single organism so Ôorganism developmentÕ would not work. How about Ôentity developmentÕ rather than Ôorganism developmentÕ since that would accommodate Dictyostelium? _ An alternative suggestion is to move growth outside development, as a direct child of 'biological_process.' This would also fix the problem that some growth events (like certain instances of cell growth), are not developmental events. Developmental growth events would have both 'development' and 'growth' as parents. Current structure: [i]development ---[i] growth ------[i]oocyte growth ---[i]oocyte development ------[p]oocyte growth With new structure: [i]growth ---[i]cell growth ------[i]oocyte growth [i]development ---[i]oocyte development ------[p]oocyte growth _ Relationships are now okay with OBOL. Action item: Move growth outside development, as a direct child of 'biological_process.' Chemoattractant/chemorepellant: [See SourceForge item 1052249] At the last meeting, it was decided that 'chemoattractant activity' should be restored. However, there were still objections as to whether this was a legitimate function term. The problem is that chemo[attract/repell]ant definition would involve binding to a specific receptor, setting off a signaling cascade that induces positive or negative chemotaxis. This definition invokes the function of 'receptor binding' and the process of 'induction of positive/negative chemotaxis'. Chemo[attract/repell]ants can be defined without referencing chemotaxis, which is a process. A way to represent these molecules is with new receptor binding terms - 'chemo[attract/repell]ant receptor binding' - in combination with the existing process terms 'induction of [positive/negative] chemotaxis'. Both sets of terms can have chemo[attract/repell]ant as narrower-than synonyms. Discussion: _ The argument against this was that there is no way to clearly annotate a chemoattractant that would unequivocally distinguish the chemoattractant itself from a protein inside the cell that would bind the chemoattractant receptor and trigger signaling events. _ Resolution: Functions will be conceptually divided into two classes, those that involve reactions and those that do not. A number of other terms that were obsoleted based on the fact that they were functions for which activities could not be assigned will be restored. Action item: A list of function terms to be restored that includes 'chemoattractant' and 'chemorepellent' activities will be circulated. [Amelia Ireland] ANNOTATION ISSUES Redundancy in the annotations (Mike Cherry): To address the presence of redundancy in the annotations, we will be filtering to remove redundant annotations. Everyone will still submit association files, but the GO database will now include a directory that contains filtered files. Some of the requirements for filtering: _ Every row in the file must be correct. If it's not, it will not be included. You will be notified if it's incorrect. _ If the GO ID is a secondary ID, it will be replaced with a primary ID. _ If the GO term is obsolete, it will be removed. _ If an IEA is more than 1 year old, it will be removed. In addition, each species/taxon ID will have only one "authority" MOD, and all other annotations for that species will be filtered. If another source wants their annotations to appear in this filtered file, they will coordinate with the MOD to have their annotations included in the set submitted from the MOD. This will not be implemented for a month or two. Action item: The information regarding the requirements for implementation of the filtering of annotations will be sent to the GOC members. [Mike Cherry] Online annotation form: Should we allow bench scientists to submit annotations? Many researchers have annotations that would be useful to have. If we allow submissions from non- Consortium/non-database groups, how should we do it? What tools should we use? Do the annotations need to be reviewed by curators? Discussion: _ We should definitely allow researchers to submit their annotations. These submissions could be sent to GOA, who will send them to an appropriate MOD. The database can submit that data in their association file, citing the source of the annotation. _ If there are groups, such as the chicken group (Fiona McCarthy), that come to us for mentoring. We need to talk further about training of other groups. _ Many tools for annotation exist; only TAIR and TIGR have tools for user submission. Action item: Suzi wants to hear from everyone about what tools they have for annotation. Send to Shu at Berkeley. [Everyone] Action item: Ask GMOD if there are tools available for user submission of annotations. Allow genes to be entered in the 'with' column with colocalizes_with: There is a proposal to use the 'with' field to describe the qualifier when using 'colocalizes_with.' An example in which MGI would like to use it is when a paper localizes a gene product to the lysosome by colocalization with Lamp1, a marker for lysosomes. (David Hill) GeneX | GO: lysosome | IDA | colocalizes_with | Lamp1 Discussion: _ Should we use this column in a new way? The 'with' column refers to the evidence code, not to the qualifier. Perhaps we are muddying the use of this column by using it in all different ways. The use of the colocalizes_with qualifier is already confusing for annotators, and this is adding another layer of complexity. _ Additional information is not necessarily being added if the db object in the 'with' column is already annotated. _ Part of the rationale for doing this is that we are not 100% confident that the gene product is actually in the lysosome. However, this is the way that many highly studied proteins were originally localized, through a colocalization experiment. _ The idea of adding a new column to the association file was suggested. See also discussion on species in the 'with' column. Action item: Add a SourceForge item for the issue of using the 'with' column when using colocalizes_with. [David Hill] Use of 'with' column for entering species: TAIR would also like to use the 'with' column in a new way, adding taxon IDs of bacteria for terms like 'response to bacteria' and 'response to fungi.' The use case we have in mind is capturing which organism was used in an experiment where plants are subjected to bacterial attack and then respond. The pathogen discussion group is quite enthusiastic about this. Discussion: _ If this is implemented, you can no longer do ISS with another database object in the 'with' column. Entering gene products and tax IDs are two totally different concepts. _ Another important question is whether this is in the scope of GO. _ MGI has a detailed notes field separate from GO from which they can also retrieve data and where they keep this type of information. _ Using the 'with' column is not necessarily the best way to do it. Perhaps this should be a new column in the annotation file if people want to capture taxon IDs and use the 'with' column to describe the qualifier. Action item: Write up a proposal on how to capture information such as taxon IDs (describing the GO term) and proteins that colocalize_with other proteins (describing the qualifier). [John Day-Richter, Chris Mungall] "Views" for particular organism sets: The prokaryote and plant groups are using a subset of GO terms that are specific to their organisms. The resolution is to just leave it; it is not being maintained and will not be updated unless the file is required for specific purposes. Discussion: _ We should not encourage people to use only a subset of GO. Some groups do in fact filter out portions of the ontologies to which they cannot annotate (for example, removing 'chloroplast' for animal annotation). However, this is for annotation purposes and is not intended for viewing the ontologies. _ The groups that are using these "views" are using them because they have resource issues and a tight time constraint. People argue that 7,000 terms is not that much better than 18,000 terms. Pure hypothetical proteins: What should annotators do about purely hypothetical proteins with no sequence similarity to anything and have no possible GO terms? The former guidelines were that these genes would not get annotated at all; should we continue doing that, or annotate them to "unknown", with the RCA evidence code? (Michelle Gwinn-Giglio) The consensus is that this is not an appropriate use of RCA. Function/Process/Component "unknown" should be used here with the ND evidence code. GO is not for determining whether a gene is real or not. Moreover, there have been significant improvements in the quality of the genome assemblies since the GO project started. Inter-annotator consistency: We need to ensure consistency between annotators and document the way each database addresses annotation consistency. This is important for the grant renewal. (Mike Cherry) Discussion: _ Comparison of curators (MGI vs. UniProt mouse annotation) showed that there are many discrepancies between curators (approximately 50% of the time), however, all curators are over 90% accurate. This means that the vast majority of annotations are correct but they are also incomplete. _ At SGD, curators pair up monthly to review papers and discuss the possible GO annotations. Measures like this and attendance at the GO annotation camp help ensure consistency. However, because of the difference in resources (i.e., number of curators), each database needs to set their own standards and document their protocols for annotation. Action item: Add information about inter-annotator checks in the README files and in the progress reports. [Everyone] Action item: Inter-annotator consistency will be discussed further at the GO annotation camp. Evaluation of electronic annotations: Several users have been interested in the reliability of electronic annotations, and there is only one publication, still in press, that addresses this question (Camon, E.B. et al. 2005. BMC Bioinformatics 6 (Suppl 1):S17). What has been done to assess electronic annotations? Can we say something on the website to answer this kind of question? (Jen Clark) Discussion: _ There have been a few studies for different tools. TargetP is 80-90% correct in its predictions but does not work well for plants. EC2go and InterPro2GO are quite good and have 91-100% precision. _ This is something we should definitely look into and post on the GO website. Action item: Write up a summary of the reliability of electronic annotations and reference the paper that talks about this. Add this to the FAQs. [Jen Clark] Obsoletion and moving annotations: At the last meeting, we agreed that terms do not need to be obsoleted if the definition change was meant only to improve the wording and did not change the way annotators use the term. There are cases, like the new definition of morphogenesis and development, where only a few annotations need to be changed. The question was whether it was okay to keep the term and ask the annotators to verify their annotations and change them where appropriate. Discussion: _ The argument for doing this is that it is simpler. _ The group generally argued against that: if the definition changed sufficiently such that annotators need to verify or change their annotations, then the term needs a new ID. This is important because not only do the GO Consortium members need to be aware of this; every GO user needs to be alerted to the change, and the only way to do it is by obsoleting the term. _ The GO editorial office will go back to obsoleting or merging if the annotators need to check their annotations. They will only redefine without obsoleting if there is no possibility of a mistake. In the case of the morphogenesis terms, this means that 'x morphogenesis' terms defined as 'development' will need to be merged into the corresponding 'x development term' and a new 'x morphogeneis' term made. Action item: Document what happens when a term/definition changes enough such that annotators need to modify annotations, using development/morphogenesis as an example. [Tanya Berardini, Jen Clark] Supporting new annotation groups: _ Tutorials at meetings have so far been quite effective. The PAG meeting generated a lot of interest in GO. We should be on the lookout for opportunities to hold tutorials at specialty meetings, particularly in areas where the ontology needs to be developed. _ Direct mentoring is also very effective. Fiona McCarthy (chicken annotation) visited MGI for two weeks and generated a full set of electronic annotations. This tutorial turned into much more than just GO topics; much time went into general management of an MOD. _ The Annotation Camp last year was very successful, and registration for the second camp looks to be about 40+ annotators. These people will be split into smaller groups for discussion. The meeting is June 1-4 at Stanford. _ Genomes for infectious organisms are a big focus for the NIH right now. We should try to send a "trainer" to the NIAID Bioinformatics Resource Meeting. Action item: Work with the BRC centers (NIAID) to provide any support they want. Try to tie in a tutorial with one of their group meetings. [Judy Blake] RESOURCE ISSUES Advances in representations and use of cross-products (Chris Mungall): Currently there are many cross-products in the GO. Many of them we would like to keep but many are redundant. (For example, cysteine metabolism and sulfur amino acid biosynthesis. The reasons for this are historical; GO preceded the chemical ontology ChEBI.) To successfully create cross-products there must be consistency amongst the different ontologies. Chris has been looking at the cell ontology (CL) and the cross- products that can be made with GO. There are around 800 cell types in GO, where as the CL has 700 terms, 300 of which do not have matches in GO. Differences between CL and GO: _ Definitions: Example: T cell vs. T lymphocyte. The GO definitions are not consistent with the CL develops_from relationship type. _ Inconsistent structure: Example: hemocyte, plasmocyte, lamellocyte. _ Granularity: The CL sometimes has less detail (e.g., retinal cone cell vs. photoreceptor cell). More commonly, the GO has less detail (e.g. neuroblast, cell proliferation). The question is, though, do we need to create more GO terms only if we annotate to them? _ Naming style: Differences in hyphenation (B cell), suffix (CL: neuron, GO: neuron cell). _ Missing synonyms. _ Spelling: Example: oesinophil. _ Different relationships: Some terms are part_of in one ontology and is_a in another ontology. OBOL parses terms to find OBO terms embedded inside the GO terms and finds inconsistencies between GO and OBO. The new approach is to remove dependency on text analysis and augment the GO to integrate information from other ontologies. Chris showed a few examples of how the ontology file would look with the CL terms integrated. The relationship types need better names: currently they are intersection_of and has_output. GO: 30183B-cell differentiation ; GO: 30183 intersection_of cell differentiation ; GO: 30154 intersection_of B lymphocyte ; CL: 236 Why should we do this? _ Makes GO more computable: easier to maintain ontologies and consistency between ontologies. _ Integrates ontologies such that you can query across ontologies. _ Automates inconsistency detection: batch mode, dynamically from OBO-Edit. _ Explores new browsing and navigation paradigms. For example, when looking at the DAG for 'lymphocyte differentiation,' show 'cell differentiation' (GO) and 'lymphocyte' (CL) in adjacent windows. A prototype for browsing cross-products was generated using MGI's structured notes fields that show cell types, anatomical structures, cell lines in which the gene product was expressed, and developmental stages for the GO annotations. _ Filters dynamically: allows us to thin out complex parts of the ontology. Proposed plan: _ Integrate GO with CL to begin with. The next logical ontology to integrate after CL is ChEBI. Eventually integrate anatomy ontologies but these will be the hardest. _ Create cross-product information using OBOL and integrate it with GO. Add the cross-product information to the GO terms but don't do cross-products for everything. _ Synchronize the ontologies. _ Iterate and evaluate. _ Look at cross-products of process and process. OBO-Edit (John Day-Richter): _ OBO-Edit has a new plug-in that allows generation of cross-products. To create a cross-product, first load two or more ontologies and, using the plug-in, select a core term and drag and drop the terms to be crossed into the cross-product window. Next, drag and drop a relationship type into the box, add a property (for example, 'has_output'), and create a rule for generating the cross product, (for example, ($1 $2 = term 1, space, term2). When the cross is executed, the new term(s) are created with new ID(s). The term name can later be edited, as well as any operation you would do on a 'normal' term. Clicking the "+" on the right side of the plug-in window allows you to add an unlimited number of terms. _ OBO-Edit has new editing modes: for example, merging terms, when you drag a term, a menu appears to let you chose what to do with the term. There are also new keystrokes for functions using letters (M = merge). _ OBO-Edit has a major improvement in speed; all due to displaying fewer paths in the DAG viewer. _ OBO-Edit has "instances": instantiated versions of the classes we have been creating. Allows instance browsers, for example different sequence. _ OBO1.2 file format changes the way synonyms are handled. One can specify synonym type. This new feature makes the OBO-Edit-generated files incompatible with DAG Edit. One can open OBO files with OBO-Edit, but should not commit them in cvs. Advisory Board and other GO notes (Mike Cherry): _ Web usage for GO web site (filtered for robots, images): GOC home page had close to 1 million hits within the past year; GO database had over 4 million hits. _ More than 5000 different IP addresses. _ Most users from USA, Europe, then Asia. _ Publications: 0 papers in 1999; 169 papers mentioned "Gene Ontology" in 2004; 98 as of April 9, 2005. This is particularly impressive because PubMed searches only titles and abstracts, indicating that GO is an important aspect of the study. _ Citations: the two most cited GO Consortium papers have been cited over 1000 times, which is extremely high. _ Number of links reported by Google: >7200 pages link to geneontology.org. _ Links from NCBI = approx. 46,000; links from EBI = approx.14,000. _ AmiGO will be run from Stanford as of May 1, 2005. The database is built weekly. _ Datafiles download site will change slightly (web address). MOBY namespace issues (Chris Mungall): To synchronize our data with MOBY, we must fix two things in our association file. First, we need to be consistent with the authority in column 1. Sometimes Flybase is Flybase, but sometimes it is FB. Similarly, SGD is sometimes SGD, at other times, SGDID. Second, we need to modify our global IDs to represent the namespace. For example, the NCBI IDs for gi and PubMed ID are indistinguishable. One way to fix this is with the use of LSIDs (Life Science Identifier, http://lsid.sourceforge.net/). With an LSID, this ID: PMID: 12571434 becomes this LSID: urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434 where 'urn' stands for Uniform Resource Name. Discussion: _ Using the LSID allows you to represent more information. It is also a way to help unify IDs globally, not just within GO groups. If the LSID is going to become a life sciences standard, perhaps we should move towards this ID. The LSIDs don't necessarily need to be shown on the web pages but can be used for storing IDs. _ One of the problems with the LSID is that you cannot automatically generate a URL from it; you need a resolver. Also, some long-standing IDs are not amenable to LSIDs, such as EC numbers. _ For the time being, it is sufficient that everyone is aware of this issue. MOBY is not yet ready for us. _ Whatever the solution, it needs to satisfy GO and the community of users, so we need to get a list of our requirements. Simon Twigger will be at the MOBY meeting in May, so he can talk to Mark Wilkinson about some of the ways to solve this ID problem. Action item: Talk to Mark Wilkinson about LSIDs and come up with a proposal for how we might make changes to our IDs. [Chris Mungall] BRAINSTORMING SESSION GO Home Page: This brainstorming session was to get ideas for improving the GO website. _ We should consult with biologists and web designers for help on this. _ Reactome has a very nice front page. Was designed by professionals. _ You still need to know who the users are. _ A survey of colleagues who are not involved with GO, see what they think. _ Less text, more graphics. _ Look at some of the other homepages and see what we like and don't like. Reactome, Google, etc. There are many examples of how not to do it. _ Send around your favorite 5 home pages. Doesn't need to be bioinformatics. Needs to have similar content types. _ Done this at TAIR. Worked with a professional, fed them ideas about needs for biologists, and they came up with 3 different options. _ http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/Evaluate.html talks about evaluating webpages. _ http://webpagesthatsuck.com/ _ What would Amazon do? We're trying to sell our product. Amazon is successful and uses certain principles. We should try to use these principles. _ What is your community? Website should be for community. Market research: what is community and what are their needs? A lot of this will flow from that. Lots of documentation on the website but no one reads it. _ In paper, there are 10 pointsÉ if you are text miner, do thisÉ etc. _ Look at what parts of the website are used most and focus on those. _ Do online survey of users. _ If somebody comes to a page, goes to another page, and jumps back, can we look at that? _ How are the web pages used? _ We need to target biologists. They have no idea what it is about. Need to encourage use by biologists. _ How can we better serve? _ Look at failed queries. Figure out what people are looking for but can't find. _ Of the sites linking to GO, can we look at where they are coming from and who is linking the most? _ Objections to the current page are that the front page is kind of random. They are important things but not well arranged. _ Things have changed over time. What we did before is not the same so we need to adapt. _ "The goal of gene ontology is to do controlled vocabulary"Éso now what? _ Menus on left are very hard to navigate. _ Searching is hard for documentationÉalways get hits from email archive. _ The kinds of questions we ask are different from other people. _ We are more persistent. We know things exist and work harder to find stuff. Other people don't know it's there and will give up. _ We need to clarify who the documentation is intended for: new users, advanced users, etc. Not much distinction between annotation and editing the ontologies. _ We need a direct way to get to annotation parts and ontology development. Home Page prototypes (Amelia Ireland): 1. New index page: mission statement needs to be in 10th grade style. Needs to be shorter and catchier. Search box: should be right at the top. Can search by gene/protein name or by gene/protein description. Should set up AmiGO so that you don't have to click any options for term/name. If you are biologist and type in your gene name, you won't get it. Section for popular links: this could come from usage stats. Links could change from time to time. Maybe "What would you like to do?" on home page. Sections: for annotators, for biologists, for programmers, etc. News bulletins: put news on separate page. Recent ontology changes: more news. Site info: stats for website, grant information. Left menu: open menus, home, downloads, GO tools, documentation, get involved, curator guides, about GO, terms of use, contact GO, site map. can be opened up. Annotator/curator: confusing to people. Ontology editing, annotationsÉwhat to use? What makes sense to biologists? Search boxes: are they ever used on pages other than the home page? Many say yes. Search boxes should stay on every page. 2. Quickstart guide: DonÕt want to read through whole manual, just go and do it. Four main categories. Get a gene with annotations. Find gene products associated with process, function, component. General introduction to GO. Three types of people: annotators, ontology editors, consumers. Need to aim at consumers. Simple start Ð search, short statement, and quickstart. Usage stats: basis for quickstart. 3. Simple page: quickstart, downloads, tools, get involved, contact. (People like this one.) 4. Pictures: small icons next to each broad topic. (People really like this one.) 5. Separate by users: New users + Advanced users. (People like the previous one better.) 6. Separate by main topics: tools, downloads, etc. 7. I want to: know more about what GO isÉ see the def of a GO termÉ othersÉ Discussion: _ Don't want to come to a page and see text. Too long. _ At ZFIN, we have a process where we sit down as a group. What does the webpage have to do? What is the goal? Do paper mock-ups of what it should look like. What can we get rid of and want do we need? Take it across the street to people in the labs and ask them to use it. Ask them: What do you expect when you click here? What is important to you? What is missing? Commonly: always have search box at top or at the right. Doug will send some criteria to the GO list. _ Need LESS text, even if it is high school level. _ Frequently cited paper: "a tool for the unification of biology" Ð possible tag line for webpage. _ What we need to focus on is the process. Who will be responsible? How will we get feedback? Who will take time to work on this and report back to everyone? _ The reason to have this discussion now is that all of you need to feel empowered. Everyone's ideas are important. We're not designing this for ourselves, it's for others. Who are we trying to reach? _ Can we send out a message to GO list and ask what do you need from the GO page? What would make it easier to use? _ We can't be limited by that. There are many more users than there are on the Email list. _ Maybe the survey would be good because it will reach more users. _ Have a link at the bottom of pages that says, "Did you find what you are looking for?" _ We should keep track of failed searches, and use that information to assess what users are looking for. _ Access to the database is limited. The static pages are only supporting material. Action item: To help in the new design of the GO home page, send examples of web pages that we like, and pages that we do not like. [Everyone] AmiGO: searching and visualization: This brainstorming session was to get ideas for improving the AmiGO browser. _ Current behavior seems non-intuitive to some people: when you click on a term, a new page opens. _ Perhaps the search can retrieve a list of terms rather than a DAG. _ Synonyms are not displayed prominently enough. Many users search on a synonym and get to a page that has different information. _ The search should integrate annotations and ontologies better. The SGD search tool, for example, searches gene names, gene products, GO terms, etc. _ A new functionality would be the ability to search all ontologies in OBO. With the cross-products this will happen anyway. How will cross products affect the speed of AmiGO? We may need to think about new ways to display the data. Chris Mungall presented some ideas addressing that issue in his talk about cross- products. _ Everyone agreed that the ontologies must be updated more frequently. Currently updated every month; at Stanford it is currently updated every week. Is that enough? Should we have daily updates? Annotations can be updated less frequently. _ Should the different databases look more similar? One way to start would be to use the same ontology browser. What would it take for everyone to use the same browser? (People agree that ontology data need to be up-to-date.) _ AmiGO and GOC "look and feel" could be more similar; currently it's not clear they are the same thing or part of the same thing. _ Ideas of tools for annotators: term suggestion tool (for example, Incyte/Biobase GO pairs); the Wormbase tool OrthoGO lists annotations from InParanoid calculated orthologs; also, a way to transfer GO terms into GO annotation tool. _ GOst-type tool for ESTs; the problem is about the number of sequences a user would submit. _ Some databases have their own GO browsers; we need to know what improvements they have made over AmiGO. GO survey: To better understand who the GO users are, what information they are looking for when they come to the GO/AmiGO web pages, and how to better target the users we are trying to reach, it was suggested that we do surveys when we go to meetings. Suggested questions include: _ Have you heard of GO? _ What do you think it is? _ Why do you use GO? _ Has it worked for you? _ What do you do? (arrays, etc.) There may be two surveys, one for beginners and one for more advanced users. Action item: Write up a survey that we can take to meetings. Try to minimize effort by putting it online with easy analysis. Put the link in your MOD newsletter. [Mike Cherry, Everyone] NEXT MEETINGS Annotation Camp: June 1-4, 2005, Stanford. Ontology Development Meeting: October/November, 2005, TIGR. Focus on immunology with transport side. We have $15,000 support for this meeting from anonymous sources. Grant Meeting: November (possibly November 17-18, 2005), Banbury? Washington, DC? Meeting of the PIs and others. GO Consortium Meeting: Possibly March 25-29, 2006, St. Croix. Berkeley will organize. Also plan to do outreach in Puerto Rico. GO Users Meeting: September 14-15, 2005. Integrated with MGED 8 meeting in Bergen, Norway. ACTION ITEMS Summary of Action items from last meeting, not yet done [Chicago]: 1. Action item: Jen will get the formatted Excel spreadsheet for user submission of GO annotations from TAIR and will put it on the website to help new annotators. Jen has this from TAIR; not on website yet. 2. Action item: Renee will look into whether IncyteÕs quality control methods or Pfam-to-GO mappings may be shared with the public GO efforts. If permission is obtained, Jen will put them on the GO website. Don't know; Incyte (now called BioBase) not present. 3. Action item: Each database must submit a set of 10 papers and accompanying GO annotations (see SourceForge item #1047963 for details). 3 or 4 databases have done this. 4. Action item: Each database must submit a README file describing annotation strategy to accompany its gene association file. Some have done this; Mike will warn people. 5. Action item: The Editorial Office will broaden the definition of molecular function to include complex functions. Draft written by Midori; will include non- activity functions. 6. Action item: Jen will ask Mike for information about the new checking script and will document the annotation checks on the GO website. In progress? 7. Action item: SGD personnel will continue with and finish the installation of AmiGO at Stanford. Almost done. 8. Action item: The Editorial Office will document the status of GO Associate and invite interested groups to join. In progress; haven't decided on description of GO Associate. 9. Action item: Chris will proceed with using OBOL to make computed definitions of cell differentiation and maintain them in a cell type ontology. In progress. Summary of Action items from this meeting [Pasadena]: 1. Action item: Look at how the relationship type 'regulates' and positive/negative regulation will affect the ontologies and the annotation files. [Chris Mungall] 2. Action item: Look at current annotations to 'enzyme regulation' type terms to see what has been used. [Jane Lomax] 3. Action item: Add new relationship type 'regulates'. [John Day-Richter] 4. Action item: Document the fact that we will systematically put both parents on behavioral response terms where there are terms in both nodes. For example, behavioral response to ether. The terms currently in the ontology have correct parentage. [Jane Lomax] 5. Action item: Move growth outside development, as a direct child of 'biological_process.' [GO Office] 6. Action item: A list of function terms to be restored that includes 'chemoattractant' and 'chemorepellent' activities will be circulated. [Amelia Ireland] 7. Action item: The information regarding the requirements for implementation of the filtering of annotations will be sent to the GOC members. [Mike Cherry] 8. Action item: Suzi wants to hear from everyone what tools they have for annotation. Send to Shu at Berkeley. [Everyone] 9. Action item: Ask GMOD if there are tools available for user submission of annotations. 10. Action item: Add a SourceForge item for the issue of using the 'with' column when using colocalizes_with. [David Hill] 11. Action item: Write up a proposal on how to capture information such as taxon IDs (describing the GO term) and proteins that colocalize_with other proteins (describing the qualifier). [John Day-Richter, Chris Mungall] 12. Action item: Add information about inter-annotator consistency in the README files and in the progress reports. [Everyone] 13. Action item: Inter-annotator consistency will be discussed further at the GO annotation camp. 14. Action item: Write up a summary of the reliability of electronic annotations and reference the paper that talks about this. Add this to the FAQs. [Jen Clark] 15. Action item: Document what happens when a term/definition changes enough such that annotators need to modify annotations, using development/morphogenesis as an example. [Tanya Berardini, Jen Clark] 16. Action item: Work with the BRC centers (NIAID) to provide any support they want. Try to tie in a tutorial with one of their group meetings. [Judy Blake] 17. Action item: In the bibliography section of the GO web site, make an option to sort by year or by topic. [GO Office] 18. Action item: Talk to Mark Wilkinson about LSIDs and come up with a proposal for how we might make changes to our IDs. [Chris Mungall] 19. Action item: To help in the new design of the GO home page, send examples of web pages that we like, and pages that we do not like. [Everyone] 20. Action item: Write up a survey that we can take to meetings. Try to minimize effort by putting it online with easy analysis. Put the link in your MOD newsletter. [Mike Cherry, Everyone] 21. Action item: Investigate running GO Users Meetings as satellites of other conferences (not necessarily GO Consortium Meetings), both because of timing and to reach different/broader audiences. Contact the organizers of the MGED meeting (Bergen, Norway) to set up an adjacent Users Meeting. [Midori Harris] Review of all Action items from last meeting [Chicago]: 1. Action item: Jen will update the "GO People" section of the GO website. DONE. 2. Action item: Midori will look into whether the format of SourceForge emails can be customized. Can't be customized. Unresolvable. 3. Action item: Jen will get the formatted Excel spreadsheet for user submission of GO annotations from TAIR and will put it on the website to help new annotators. Jen has this from TAIR; not on website yet. 4. Action item: Renee will look into whether IncyteÕs quality control methods or Pfam-to-GO mappings may be shared with the public GO efforts. If permission is obtained, Jen will put them on the GO website. Don't know; Incyte (now called BioBase) not present. 5. Action item: Each database must submit a set of 10 papers and accompanying GO annotations (see SourceForge item #1047963 for details). 3 or 4 databases have done this. 6. Action item: Each database must submit a README file describing annotation strategy to accompany its gene association file. Some have done this; Mike will warn people. 7. Action item: Michael Ashburner will make the evidence code hierarchy available at the OBO site. DONE. 8. Action item: The Editorial Office will incorporate all relevant parts of the Annotation Camp report into online GO annotation documentation. DONE. 9. Action item: The Editorial Office will finalize the new symbiosis/pathogenesis terms and incorporate them into the process ontology. DONE. 10. Action item: Jane will finalize the new metabolism terms and incorporate them into the process ontology. DONE. 11. Action item: Chris and the Editorial Office will look into the ramifications of adding the new relationship type ÔregulatesÕ. Midori will announce this upcoming change to the GO-friends mailing list. DONE. 12. Action item: Amelia will proceed to revise the cell cycle node along the lines already established. In progress. 13. Action item: Amelia will reinstate terms Ôchemoattractant/chemorepellant activityÕ. No one responded to SourceForge item; terms will be reinstated. 14. Action item: Jane will add ÔATP-binding cassette transporterÕ as a narrower-than synonym of ÔATPase activity, coupled to transmembrane movement of substancesÕ. DONE. 15. Action item: The Editorial Office will broaden the definition of molecular function to include complex functions. Draft written by Midori; will include non- activity functions. 16. Action item: The Editorial Office will add and document the new evidence code, RCA (reviewed computational analysis). DONE. 17. Action item: Mike Cherry will give each database a list of the errors in its gene association file. DONE. 18. Action item: Jen will ask Mike for information about the new checking script and will document the annotation checks on the GO website. In progress? 19. Action item: SGD personnel will continue with and finish the installation of AmiGO at Stanford. Almost done. 20. Action item: Jen will design a simple front page for the GO website that is friendlier for biologists and other newcomers. It should include links to explanatory pages for new users and new annotators. She will also check that the links to SourceForge are working correctly. Amelia presented some prototypes; in progress. 21. Action item: Jen will fix the AmiGO search box on the front page of the GO website so that either terms or annotations can be searched. Mike Cherry will send her information on how to do this. DONE. 22. Action item: Jen will try to make contact with as many genome databases as possible to make sure theyÕre aware of the tutorial at the PAG meeting. DONE. 23. Action item: The Editorial Office will document the status of GO Associate and invite interested groups to join. In progress; haven't decided on description of GO Associate. 24. Action item: Chris will proceed with using OBOL to make computed definitions of cell differentiation and maintain them in a cell type ontology. In progress. 25. Action item: Amelia will distribute the ontologies at the OBO site into two subdirectories, one containing GO Consortium-approved ontologies in active use by Consortium members, and the other for any other ontologies. DONE. =================================================================== GO Consortium Meeting St. Croix, USVI March 31 - April 2, 2006 [Next meeting: November, 2006, Hinxton or Marseille (to be determined)] GROUP PARTICIPANT LIST BDGP/SO John Day-Richter, Karen Eilbeck, Suzanna Lewis, Chris Mungall, Shu ShengQiang DictyBase Rex Chisholm FlyBase Michael Ashburner, Susan Tweedie GeneDB Pathogen Matt Berriman GeneDB Pombe Val Wood GOA Daniel Barrell, Evelyn Camon GOEO Jennifer Clark, Midori Harris, Amelia Ireland, Jane Lomax MGI Judy Blake, Alex Diehl, Mary Dolan, Harold Drabkin, David Hill NCBO Barry Smith PAMGO Candace Collmer, Trudy Torto-Alallibo Reactome Lisa Matthews RGD Susan Bromberg SGD / CGD Mike Cherry, Karen Christie, Stan Dong, Stacia Engel, Eurie Hong, Marek Skrzypek TAIR Tanya Berardini, Sue Rhee TIGR Linda Hannick, Michelle Gwinn-Giglio WB Ranjana Kishore, Kimberly Van Auken ZFIN Doug Howe Friday, March 31, 2006 Principles of Biomedical Ontology Design Barry Smith, Department of Philosophy, University at Buffalo, National Center for Biomedical Ontology Barry's talked was divided into five sections, the first four covered general ontological issues, while the fifth was devoted to GO, specifically. There are many other ontologies besides GO, each with various pros and cons regarding their construction. The least sloppy ontology is the Foundational Model of Anatomy (FMA), characterized as follows: Pros: * Clear statement of scope, we know what it is: human structural anatomy * Powerful, proper (formal) treatment of definitions - most important feature * Single inheritance is_a hierarchy - an objective good * From the whole organism to the biological molecule Cons: * Some unfortunate artifacts in the ontology deriving from its specific computer representation (Protege) - FMA was built manually, but there came a point where it was too big to be maintained manually. [An ontology should never contain entities dictated by your programmer. Needed to include non-anatomical terms to make it work with the program; this is bad.] Formal Aristotelian Definitions An A = Def. a B which Cs Parent which differentiates in this way. This is also why single inheritance is good; there is only one place to look. For example: cell = an anatomical structure which consists of cytoplasm, etc. There are many circular definitions in GO. These are useless. Not bad, just useless. Every single definition should tell you where in the is_a hierarchy the term belongs. Every definition is an encapsulation; they give you the content you need in a modular form. An ontology has to be designed both for human beings and computers. Terms used in definitions should be simpler than the term to be defined. Many of GO's existing definitions have this problem. (FMA 90,000 terms - it would be a nice discipline for the GO to represent terms in this diagram, because this lets you know if you have terms that don't go anywhere.) The Gene Ontology is characterized as follows: Pros: * Open source * Cross-species * Impressive annotation resource * Impressive policies for maintenance * Has recognized the need for reform Cons: * Poor formal architecture * Poor support for automatic reasoning and error-checking * No cross-ontology relations * Not (yet) transgranular Granularity is one very important challenge for bioinformatics. If data comes in granular packages, then ontologies must be granular. GO doesn't give a basis for reasoning in organized granularity. GO can make big strides forward without changing the content, i.e. distinguish cellular from physiological process. GO deals with definitions in a way that is worse than useless. Logically speaking, they are total nonsense. For example, GO: hemolysis of host red blood cells is defined as: The processes by which an organism effects hemolysis, the lytic destruction of red blood cells with the release of intracellular hemoglobin, in its host. This sort of definition is worse than circular. David: What if there is a parent that defined hemolysis? Barry: This would be fine. GO is now adopting structured definitions which are built out of genus and differentiae. For example, GO: neuronal cell differentiation - differentiation by which a cell acquires features of a neuron. Michael raised the issue that, in the past, Barry has criticized the GO because they have complex terms in the definitions. But, some chemical terms are inherently complex, right? Barry's response: Definitions should use terms that are less complex than the terms itself. You're going to have to produce a computer-friendly version of these definitions. If possible, you'll need to produce a human-friendly version, as well. Judy pointed out that we are also dealing with community vocabularies that were constructed with different concerns. Barry's response: We need to move to a new kind of paradigm. We don't need to allow their terminology to thwart current ontology efforts. Barry: The problem is that the UMLS accepts any group (community) developed ontologies without worrying about their quality. Another example, the National Cancer Institute Thesaurus (NCIT): Pros: * Open source * Broad coverage * Some formal structure (OWL-DL) * OWL-DL: a collection of languages used by WWW, DL maximally expressible formal logic that is still computable * Has realized the error of its ways (a good ontology needs a more expressive language than DL.) Cons * Full of errors (many inherited from UMLS) * Has verbal definitions * Has logically incompatible definitions * Confuses definitions with descriptions Goals: to make use of current terminology best practices to relate relevant concepts to one another in a formal structure, i.e., to support automatic reasoning. Of 37,261 nodes, 33,720 remain formally undefined, while about half have verbal definitions, sometimes more than one, e.g., disease progression. This assumes that people already know what is meant. Three verbal definitions are logically incompatible. For example, cancer is defined as a process and an object. Disease definitions treat them as a condition and a process. Like the GO, the definitions here get confused with descriptions. The NCIT recognizes three classes of plants and three kinds of cells. How best to deal with this? Barry's response: Generally, use of of_a (use of 'other') is bad practice. There are three kinds of cells in the NCIT that do not overlap: * Abnormal cell - top-level class * Normal cell - is a subclass of 'microanatomy' * Cell is a subclass of 'Other anatomic concept' (so that cells themselves are concepts) Neither abnormal or normal cells are types of cells Another example, the UMLS Semantic Network, an upper level ontology for the biomedical domain: Pros: * Broad coverage * No multiple inheritance Cons: * Incoherent use of 'conceptual entities' * Relationship: location_of For example, 'fungus location_of vitamin' - what does this really mean? Every instance of fungus located in some vitamin? Every instance of fungus is located in every vitamin? Should be: every instance of A is such that there is some instance of B. General Ontological Overview Good ontologies require a consistent use of terms, supported by logically coherent (non-circular) definitions and a coherent shared treatment of relations in equivalent human-readable and computable formats. There are Three Fundamental Dichotomies: * Continuants vs. Occurrents * Dependents vs. Independents * Types vs. Instances ONTOLOGIES ARE REPRESENTATIONS OF TYPES, NOT INSTANCES. Types exist to bind different communities, and this is precisely what is missing from the UMLS where the terms are all different and produced by different groups. Types are sometimes called kinds, universals, categories, species, genera, etc. GO has three ontologies: * Molecules, cell components, organisms are independent continuants which have functions. * Functions are dependent continuants that become realized through special sorts of processes we call functionings. * Processes (occurrents) include: functionings, side-effects, stochastic processes Continuants (aka endurants) have continuous existence in time. They can gain or lose parts, i.e. preserve their identity through change, but they exist in toto whenever they exist at all. Snapshots of continuants (you, 3D) Occurrents are never wholly there. They unfold themselves in successive phases and exist only in their phases. Videos of occurrents (your life, 4D: 3D + time) How should complexes be treated? There are special problems that arise in the world of molecules..... Dependent entities require independent continuants as their bearers, e.g., there is no grin without a cat. Independent continuants are such things as organisms, cells, molecules, and environments. Dependent continuants are things like qualities, function, or spatial region. All occurrents are dependent entities. They are dependent upon independents. The basic ontology has three things: * Independent continuant: component * Dependent continuant: function * Occurrent: process Two families are occurrents: functioning vs. side-effects, stochastic processes. Michael asked if it makes sense to have a process of temperature regulation as instantiated in a person? David's response: that would be a great way to define a biological process in the GO. Midori asked if independent continuants can represent entities when something is removed instead of added. Barry's response: Yes, but.... Some dependent continuants are realizable, such as 'expression of a gene', 'applications of a therapy', 'course of a disease', or 'execution of an algorithm.' Functions vs. functionings. The function of your heart = to pump blood into your body. The OBO Foundry There is a movement in the NIH to try to avoid waste of data and to try to encourage reuse of data. There are very generously funded projects to try to serve this need and this is money down the drain. They're not proactive; they accept the data that is thrown at them. You will never make data interoperable unless you actively pursue that. Old strategy: UMLS - rooted in faithfulness to the ways language is used by different communities. Each community created their own terminology and structure independently. We need common, enduring ways of organizing biomedical data. We need preestablished, reference ontology upon which groups can draw. This can indeed help make data interoperable. New strategy: OBO foundry - preemptive regimentation of language, structure, and format. We are making progress on the first two. Draft version can be found on: http://obofoundry.org/ The goal is a step-by-step evolution. In time, the OBO foundry ontologies will be so recognizably good that groups will enforce use of specific terminology in reporting results. The OBO foundry will be initiated by a subset of ontologies who agree to a core set of principles. OBO Foundry * OBO-UBO * GO * SO * RNA ontology * PATO * FuGO (Functional Genomics Investigation Ontology) * Some others The OBO foundry will consist of two kinds of ontologies: a reference ontology and an application ontology, like NCIT or the FuGO ontology. The reference ontologies will provide a repertoire of database schemas to use. Criteria for inclusion: the ontologies must be open, must agree to collaborate, must have common formal language, identifier space, versioning, clearly delineated content, textual definitions, well-documented, and a plurality of independent users. Further criteria will be added over time to begin to improve quality. The main non-trivial step forward is the adoption of methodology of shared, coherent, defined definitions which promotes quality control, guarantees automatic reasoning, and yields direct connection to temporally indexed instance data. Types and Instances We've now seen a distinction between types and instances: science text vs. clinical document, man vs. Michael. Instances are not represented in an ontology. We're interested in generalizations. Nevertheless, instances must still be taken into account. Instances are divided into types which are arranged hierarchically. Once you've got the types in order, the instance becomes less important. But....they should always be in the back of your mind. Each node in an ontology should consist of a term, an identifier, synonyms, and a definition. An ontology is a computable representation of biological reality. When people talk about concepts, they are expressing a fundamental confusion. There are terms in your ontology and types in reality. Nodes are connected by relationships. We're trying to capture reality; that's why we curate the scientific literature. We want to teach the computer to reason about biological reality like we do. The computer can't read science texts, so the annotator is finding a way to teach the computer how the terms fit the instances. The computer should have up-to-date knowledge. There are some rules on types. Don't confuse types with words, concepts, ways of getting to know types, etc. Once you have a good word for a type, you should use that term forever more. John asked what the problem is with concepts. Barry's response: Concepts encourage inward thinking. is_a and part_of should be used the same way in all ontologies referring to the same types and relations. There is no type non-mammal, non-membrane, other metalworker in New Zealand. Ontology of terms is NOT equal to a logic of terms, e.g., there are no conjunctive and disjunctive types * anatomic structure, system, or substance * musculoskeletal and connective tissue disorder * rheumatism, excluding the back Which types exist in reality is not a function of our knowledge. Rex asked about the musculoskeletal example. What if there was a disorder that affected both musculoskeletal and connective tissue? Would you add the word 'both' to the term? One point we need to remember is that we wouldn't deny that something exists, but we need to think about how these terms are put together to avoid confusion. The solution here: 'disorder affecting both musculoskeletal and connective tissue.' The word both makes a difference. It's never wrong to be painstakingly literal. John asked what the basis is for saying something is or is not a type. Precoordination vs. post-coordination. We can observe instances of a type, and we should examine terms in the ontology with and/or to see if they really represent a type. Should we provide an AND search union or intersection? What do people really want? In the world of instances, there are clear boundaries, but there are also continuums, e.g., temperature, color, bowls, cups. This means that there is a necessary element of conventionality to how you divide the continuum. Multiple Inheritance All multiple inheritance can be unpacked into clear, separate hierarchies. There are technologies for normalizing hierarchies, and you can generate any combination of normalized hierarchy. Using breast cancer as an example: breast cancer can have a parent term neoplasm and a parent term disease of the breast. These could be split into two ontologies: one classified as location, the other as manifestation of disease. Most of the 'diamonds' can be cleanly unpacked into two different hierarchies, and then one can map between the two; this is the way this should be done. Problems with multiple inheritance: * source of errors * encourages laziness * serves as obstacles to integration with neighboring ontologies * hampers use of Aristotelian method of definitions Compositionality The meanings of compound terms should be determined by the meanings of the simpler terms. Common rules allow alignment with other ontologies. There are 15 such rules, which can be sent around, if wanted. If we have rules stated, then it's easier to train, avoid mistakes, and classify. But most of all, if all the ontologies use the same rules, then those ontologies become automatically aligned with each other. The Gene Ontology is useful because lots of people use it. We want lots of people to use the GO and thereby use the cell ontology, the SO, etc. OBO Relation Ontology The relation ontology consists of formally defined relationships. An ontology comprises terms with well-defined relationships and good definitions. is_a Correct definition of is_a: Every instance of A is an instance of B. A is_a B = def for all x, if x is an instance of A, then x instance of B. Occurrents: the is_a definition works fine for occurrents. Continuants: only continuants change. This means that continuants need to take time continuously into account. Every instance of A at time t is an instance of B at time t. This is being a little more careful about time. In the ontology, we're only ever going to say things about time and instances. part_of part_of as a relation between types is more problematic than is standardly supposed. There are two kinds of part_of: relations between types and relations between instances: human heart part_of human and Mary's heart part_of Mary. This is incredibly important if you want to avoid mistakes. All-some structure: all instances of A are instance-level parts of some instance of B. This works in the untensed sense of processes. But continuants needs to take some account of time. How to use the OBO Relation Ontology? The all-some form gives us cascading inferences because if you have all-some form, whichever A you choose as the first term, the instance of B of which it is a part will be included in some C, which will include as part also the A with which you began. The same principle applies to the other relations in the OBO Relation Ontology. What about something that occurs only sometimes as part_of a process? Barry: That's okay. There are three kinds of relations: between types, between instances and types, and between instances. You need to keep these three kinds of relations always in mind. There's no constraint on single or multiple inheritance for part_of relations. You can't define everything. You have to take some terms and relations as primitive. We now need to deal with continuity. The human body is very highly connected. This means that you have parts which have no joints between them. This means that there are physical boundaries, but that there are also flat boundaries that are not physical boundaries, but boundaries that we create by fiat. There is continuity, attachment, and adjacent. Practically the only things in the body not connected to other things in the body are blood cells. Sample relationships: * attached_to * synapsed_with There is also attachment, location, and containment. In order to understand containment, you need to understand the different kinds of holes. Containment involves relations to a hole or cavity, e.g., a hole that you dig in the ground has a flat lid, your mouth has a fiat boundary. This is why you need to distinguish between instances and type relations. A continuous_with B is different for instance and type. continuous_with is not always symmetric. Every lymph node is continuous with some lymphatic vessel. adjacent_to is also not symmetric. This is important because there is an expectation of symmetry. transformation_of: child become adult, pre-RNA become mature RNA. Always think about the order! derives_from: zygote derives_from ovum and sperm. Two instances become one instance. Budding and capture are two other relations that need to be considered. A biological example of capture might be eating (not general agreement about this). There is a suite of defined relations between types: foundational, spatial, temporal, participation. To be added are: lacks, dependent_on, quality_of, functioning_of. Alex pointed out that lacks relates an instance and a type, e.g,. this fly lack wings. Is this explicit for this relation? Barry: Yes. What would be an example of quality_of? Barry: temperature. We must choose the relations that we can assert. Comment: There are lots of instances in biology where there are multiple ways to get to a particular state. How do you address that in an ontology? We're going to need a pathway ontology at different levels of granularity. We will need much cruder ontologies for pathways that will take care of every level of granularity. The Gene Ontology The Gene Ontology is composed of three ontologies, or so it thinks, with three central questions that need to be addressed: location, function, and process, and three granularities: cellular, molecular, and organ + organism. GO has cells, but it does not include terms for molecules or organisms within any of its three ontologies, except e.g., GO: xxx host which was a hack. OBO-UBO will provide top level terms, so you would choose the term host from the UBO. Host is kind of a relational term, but UBO has the facility to talk about this. Instance - a particular entity in spatiotemporal reality. Type - A general kind instantiated by an open-ended totality of instances which have certain qualities and propensities in common of the sort that can be documented in scientific literature. Biological process instance - A change or complex of changes on the level of granularity of the cell or organism, mediated by one or more gene products. Molecular function instance - The propensity of a gene product instance to perform actions, such as catalysis or binding, on the molecular level of granularity. Molecular function execution instance, aka "functioning": a process instance on the molecular level of granularity that is the result of the action of a gene product instance. Are the relations between functions and processes a matter of granularity? Molecular functions are defined as the building blocks of biological processes, but you do not assert part_of between ontologies. Michael pointed out that this was a very conscious design choice. You must get relations between molecular and higher level terms correct. What does function mean? To say that an entity has a biological function means that it's part of an organism and has a propensity to act reliably to contribute to survival. A better definition would be: function means it's part of an organism and has a disposition to act reliably in such a way as to contribute to the organism's canonical life plan. Does this exclude the idea of abnormal? What is canonical vs. variance vs. pathological? There are biological functions and there are molecular functions? Are all molecular functions biological functions? The function of the heart is to pump blood. But you can have malfunctionings, side-effects, accidents, and background stochastic activity. (Examples?) These things exist on all levels of granularity. If you do not have a prototype of good function, then you do not have a function. Where you have a function, then you have a scale: heart, healthy heart, unhealthy heart. What about cases like sickle cell, where there is positive selection, but only in some environments? Response: We will need an ontology of biological environments, niches, habitats. A 'reliable' term has built into it the idea of a certain environment. The sickle cell example is really about two different functions: oxygen carrying and malarial resistance. Can variant be thought of as an intermediate between canonical and pathological? For example, most left lungs have only two lobes, but three lobes is a variant. How does this relate to instances and types? There are no pathological functions. Malfunctions lead to pathology. We're going to have to recognize variant functions. Why did we introduce variants? Because functions always come with a scale. It only makes sense to talk about functions with a prototype function. Functions are associated with certain characteristic process shapes. If it's true that there is always a prototypical end to the function, then it follows that there are no bad functions. Hypothesis: there are no 'bad' functions. It is not the function of an oncogene to cause cancer. Oncogenes were in every case proto-oncogenes with functions of their own. They become oncogenes because of bad (non-prototypical) environments. Comment that even using the terms oncogene and proto-oncogene involves pathology. Response: Talking about function is part and parcel of talking about pathologies. Functions are non-pathological. Is this true for molecules? Yes. Is it true on all levels of granularity? Does it make sense to talk about a pathological molecule? An oncogene would be an example. Comment: What about hypersensitivity? At the cellular level it results in cell death. This is good for the organism, but is it good for the cell? Response: An immune response is a response at the biological level which includes functions that may be good for the organism, but not good for the cell. Some things may even be good for the population on the whole. But, we need to recognize that there is a huge amount of thought in population genetics and we should be very careful about how we speak about this. Are there any exceptions to the definition of molecular function? Response: I don't believe there is an exception for molecular functions. They always make a contribution to the canonical life plan. There is frequent discussion about the use of evidence from pathological molecules to inform what is represented in the GO. Like the FMA, so the GO is a canonical ontology. That's why thinking about a variance could be important. What does canonical mean? What does normal mean? The gene ontology is a canonical ontology, a computational representation of the ways in which genes normally function. You need to think carefully about what this means. The FMA is a canonical representation, a computational representation of types and relations. Granularity There are two kinds of causality: successive causality Each stage in the history of a disease presupposes the earlier stages In this case, we need to reason across time, track the order of events in times. We need pathway ontologies on every level of granularity. We especially need these things for the disease level. simultaneous causality Illustrated by Boyle's law. Two things happen simultaneously. It's not about events, it's about changes. (compare Boyle's law: a rise in temperature causes a simultaneous increase in pressure. Networks are continuants. At any given time, there are networks existing in the organism at different level of granularity. Changes in one cause simultaneous changes in all the others. Generally speaking, when you're dealing with organisms at coarser grains, you're dealing with networks at higher levels. We need ontologies of networks at the molecular and at higher levels, e.g., digestive system - simultaneous causality. But there is a granularity gulf. The way data is collected, i.e. For most existing data sources, there is a fixed, single granularity. However, many clinical phenomena cross granularity. The GO consists of three ontologies: MF and BP are dependent, while CC is independent. If we normalize, then we realize that we're missing the independent bearers, such as organism and complex. Judy raised the issue that we may need separate ontologies for cell and physiological processes. Are many existing terms cross-products of these two? Yes, consider cell differentiation. But aren't cellular processes dependent upon molecular function? Barry's response: Harping on about granularity. Do the coarser grains as much service as the finer grains. That's why we need a disease ontology. GO has cellular components, but we've never had anatomy terms. Normalization of granular levels is key: molecule, molecular function, molecular process cellular component, cellular function, cellular process organism, organism-level biological function, organism-level biological process What about hosts? Would we need to go up a level? Response: Host is an organism, is this not included in the GO? Does there need to always be an example, or an instance, of each level? There are likely to be situations where the molecular function and the process are the same thing. Annotation Methodology Scientific curators use experimental observations reported in the literature to link gene products with gene ontology terms - actually observe instances. Is it true that they're always looking at typical instances? The annotations yield a slowly growing map of biological reality. If done properly, this institutes a virtuous cycle, and the bigger and better the ontology becomes. What we're doing when we're annotating: an experiment is an instance from which we infer facts about types. But, we also learn about the instances acted upon. The instances described are typical in that sense that there's nothing interfering with them that would mess up the conclusions, i.e. this is not an artifactual experiment. Experimental records document a variety of such instances; they document the existence of real-world molecules that have the potential to execute. Annotations will help determine what is typical and what is not. We have a glossary now: Instance - a particular entity in spatiotemporal reality. Type - A general kind instantiated by an open-ended totality of instances which share certain qualities and propensities in common of the sort that can be documented in the scientific literature. Gene product instance - Generated by expression of a DNA sequence, that plays a role. Biological process instance - A change or complex of changes on the level of granularity of the cell or organism, mediated by one or more gene products. Cellular component instance - Molecular function instance - The propensity of a gene product to perform actions, such as catalysis or binding, on the molecular level of granularity. Types are trivial once you know the instances! Molecular function execution instance, (aka, "functionings"): a process instance on the molecular level of granularity that is the result of the action of a gene product instance. Type - a type of molecular function execution instance (aka, a type of functioning). Should 'activity' be dropped from Molecular Function terms? Pros: * functions are never activities (they are propensities, potentials) * many functions are never realized * current remedy is ugly, and not universally acceptable structural constituent of bone Cons: * much renaming work would be needed to advance clarity As soon as you try to state carefully what annotators are doing, the activity term messes things up. Suzi pointed out that this illustrates that additional relationships need to be added in order to go from function to process. Barry's response: One problem that needs to be addressed is that if you have a relationship between molecular function and biological process, then it will look like you are appeasing yourself. Jane also pointed out that there is a conceptual problem in that there is a mixture of functions and activities in the GO, e.g., catalysis vs. transcription factor. Rex explained that one of the reasons that 'activity' was added was to distinguish between the gene product and the activity, e.g., DNA polymerase vs. DNA polymerase activity. Barry's response: activities should go into a molecular activity ontology, while functions should go in a molecular function ontology. We need to create a new level of clarity in the way people think and speak. Can we replace activity with function, since they're all functions when they're under that branch of the tree? If we searched and replaced with function, does this work? Sometimes the user community mistakes gene products for function. If a potential user of the GO has a protein which has a molecular function characterized as alcohol dehydrogenase, but is not known as such, isn't this confusing to people? One possible solution would be to have names of molecules classified according to function, delete the word 'activity' and then make sure there are no strange terms. Also, do we want a molecular functioning ontology? Does this parallel a molecular function ontology? Judy: We cannot change the nomenclature of gene/gene products, this is fixed. Alex: I am worried about removing 'activity' since it is commonly used. John: This is just lazy grammar, 'activity' is what it does, not what is has, can be changed. David: This is more common in the biochemical world, than anywhere else. ACTION ITEM #1: Very seriously consider removing the word 'activity' from the molecular function terms and consider renaming the molecular function ontology. Principles for Building Biomedical Ontologies : A GO Perspective David Hill, MGI David's talk was centered around the idea of taking the principles that Barry talked about and discussing how we've applied them in GO. Most of the big arguments that GO has had addresses issues that Barry raised. The principles for building a good biomedical ontology are as follows: * Univocity - word means the same thing * Positivity - not a membrane is not a good term * Objectivity * Single Inheritance * Definitions - formal, written definitions * Basis in Reality * Types vs. Instances * Ontology Alignment The Challenge of Univocity One of the first problems that GO dealt with is that different people in different communities use words differently. So, how does a computer know what people are talking about? In GO, we dealt with this by creating primary terms and many characterized synonyms. Another challenge is illustrated by the question, what does a bud mean? The answer is different for vertebrates, plants, and yeasts. This is now the inverse of what we had before: people use the same words to describe different things. So how did GO deal with that? The computer doesn't know the difference, so that was when we decided to create the sensu designations for terms. These describe the term in the case of metazoans, fungi, etc., as the biologists in the field think about it. Synonyms are incredibly important, as we had to have terms that biologists will search on. One question that arose: how to represent the function of something like a tRNA? It's term: triplet codon-amino acid adaptor activity. But no biologist is ever going to search on this term. That's why we created tRNA as a synonym. What would happen if we removed activity from this term? Not all tRNAs could then be annotated to it. The Challenge of Positivity Sometimes absence is a distinction in the biologist's mind. Some organelles have membranes around them, but some, like centrioles, do not. So the GO came up with two types of organelles: membrane and non-membrane bound. Note the logical difference here between 'non-membranous bound organelle' and 'not a membrane bound organelle.' We don't want the latter, as it signifies everything other than what we're talking about. Alex: Biologists would understand 'non-membrane bound organelle'. Rex: A better term is needed. The Challenge of Objectivity For some gene products, we have no idea what they do. Database users want to know if we don't know anything (exhaustiveness with respect to knowledge) so the unknown terms were created. Annotating to these terms means that an annotator looked at all the literature, and we don't know what the function is. But consider 'G-protein coupled receptor, unknown ligand' - there is no difference between this term and the parent term 'G-protein coupled receptor.' In this case, instead of using a term that incorporates 'unknown,' we should annotate to the parent term. How should we annotate genes that we assume have some function, but we don't know what that is? This will be especially critical for annotating the reference genomes. We will want to annotate to molecular function and use the ND evidence code (see discussion on Sunday). Single Inheritance This is something that is going to be really hard for the GO, due to incompleteness in the graph, as well as other reasons. GO has many is_a diamonds. Technically the diamonds are correct, but they could be eliminated. This would involve choosing one term as the primary parent, while the other would be derived. What do these pairs have in common? * Locomotory behavior vs. larval behavior * GTPase regulator activity vs. enzyme activator activity * Non-membrane bound organelle vs. intracellular organelle All of these terms differentiate from the parent by a different factor - type of behavior vs. what is behaving. One side is a type of behavior, the other is location. Conceptually, we can insert an intermediate groups term, a descriptive behavior term, behavior of a thing. This is no better but allows you to figure out why you have this diamond. Chris: There are more complex diamonds in the GO, where the result is not a cross-product. Alex asked why you couldn't reason with this type of structure? As a biologist, it seems reasonable. Could we just annotate to two different terms: larval behavior and locomotory behavior? Does this get users to the same information? Would doing this make it more difficult for annotators to always know all the right terms to select? Judy pointed out that there is a danger here in having dependency on annotation to represent a term that maybe should be in the ontology. There is also some feeling that annotating to each term separately does not capture all of the same information. Having Good GO Definitions There are two types of definitions: * definition written by a biologist (the one we all see when it's written out) * definition given by where the term is placed in the graph We all strive to make the former definitions necessary and sufficient. The latter are also necessary, but they are not often sufficient. The set of necessary conditions is determined by the graph, and the graph is only considered a partial definition dependent upon placement (selecting the appropriate parents) and relationships. Requests for new terms, thus, need to consider all relationships. For example, a proteasome complex, which may be part_of the cytosol, part_of the endoplasmic reticulum and part_of the nucleus. Does this mean that logically is has to be a part_of each place all the time? Chris' response: It's valid to not be a part of these parents. But, if it's present in the ER, cytosol, and nucleus and moves between these places, is that still valid? No. It could be wrong if it's not part of each thing. For example, red blood cells wouldn't have the nuclear proteasome complex. If the same complex is found in two places (no difference in the composition of the complex), are we double annotating if we annotate to both? The feeling is that curators should annotate to both types. At present, the definition means that if something is a cytosolic proteasome complex, then it is, at some point, part_of the cytosol. Barry stated that is is fine for an is_a relationship, but part_of always means part_of. Chris: We're not making that strong a statement with our part_of complex. Barry: Are you talking about the part_of relationship in the relationship ontology of the GO? Which definition of part_of is used? How do you account for motion? Earlier when this issue was discussed, we said that it had to be part_of sometimes. Is this true for all ontologies? What about the cytokinesis example? Where should we put it-part of mitosis, part of cell cycle - we know that this is not true for all situations all of the time. The Importance of Relationships We annotate to regulatory subunits of catalytic activities. But, we didn't know how to express the regulation of that catalytic activity, so we need a new relationship for regulates. This new relationship type would address the idea that this gene product impinges upon, but doesn't actually have the catalytic activity. There was some discussion of the extent to which the regulates relationship should be used. As an example, suppose that you have a wise father who regulates the behavior of a child who, in turn, regulates the behavior of her pet dog. Does the father also regulate the behavior of the dog via the daughter? What about a kinase regulating a transcription factor that regulates some other activity? Or MAPK signaling pathways - does a MAPKKK regulate a MAPK? There are probably many examples in biochemistry where A regulates B which regulates C, but we only know about A and C. The real issue here, then, is what is the definition of regulation: directly regulates or indirectly regulates? GO Textual Definitions: we strive to created similarly structured, normalized definitions, e.g., glial cell differentiation. For process terms, David would like to propose that every process is defined by a beginning and an end. Does this work for all situations? Can we make extremely explicit process definitions? Would this lead to a lot of new child terms? New sensu terms? Basis in Reality GO is designed by a consortium and represents a consensus. Large-scale developments are the result of compromise and annotators are constantly looking for examples in the literature. If we align different ontologies, such as the cell type or chemical ontologies, with GO, then that permits generation of consistent and complete definitions - formal definitions with necessary and sufficient conditions in both human readable and computer readable forms. Types vs. Instances What are the instances we are dealing with in our work as ontology builders and scientific curators? The first question to answer is: What knowledge are we trying to capture? We're interested in understanding how genes contribute to the biology of an organism. What is meant by gene products? We have gene products types and gene product instances, e.g., shh in this cell in this mouse. The instance is the actual molecule that can be physically isolated and takes up space. What do the experiments do? Experiments are designed to study the properties of gene products. How do we represent that accumulated knowledge? We connect what the wet bench biologists see to our understanding of biology in the ontology. What are the instances? They're in the labs. We use what experimenters report about those instances. How do we connect instances with knowledge representation in the GO? Some examples: * A molecular function annotation using IDA Paper reports assaying the activity of retinoid dehydrogenase Annotation made: retinal dehydrogenase activity What are the instances? There are gene products instances, molecular function instances revealed by that assay, and instances of molecular function associated with an instance of the enzymes. Conclusion: If I have this molecule, it has the potential to have this function. * Using IMP Gucy2c activity goes away in mutant cells - observation was that when they had this molecule, they had cGMP, when it was gone, they didn't have it. * Using IGI Don't have enzymatic activity when animals are doubly mutant for subunits of the enzyme. Single mutants have activity intermediate between wild-type and double mutants. The instances are the gene products, hexA and hexB, and the molecular function instances. We say that these molecules have the potential to execute the activity. Sue commented that the assertion being made in each of these cases is slightly different. For IDA, the annotation is clear, but the latter cases suggest that the gene products are involved, but is their role direct or regulatory? Should we use the contributes_to qualifier in these cases? There is agreement that the evidence code does tell you something about the quality of the inference and what types of caveats a user should expect. But, can you compute that aspect of the evidence code? Another example is that of molecular function terms, such as protein binding, with the IPI evidence code. What are the instances? Flag-tagged molecules, molecules to which an antibody exists, execution of binding, potential to execute binding, potential to execute specific binding. Some discussion about whether protein binding is a good function term. Conclusions? Process annotation using IMP: * Observation is that in the presence of Shh you get a specific process of heart development. Instances are functional and non-functional molecules of Shh, development of a mouse heart, functioning of a Shh molecule. This process is the result of gene products executing their functions. Process annotation using IPI: * IPI is often used to annotate to MF: protein binding * In this example, a catenin-interacting protein was annotated to BP: Wnt receptor signaling pathway. * There was no instance of a function during Wnt receptor signaling, so where did the missing information come from? * An inference was made based upon previous annotations. * This represents a chain of inferences. How far do we want to take it? Much discussion ensued on the chain of inferences. Some felt that IPI for protein binding would be the only annotation that could be made. Others felt that the annotation depended on the context of the experiment and what the authors state. If the authors claim that this [participation in a process] is true based upon their IPI experiment, and the experiment is in a peer-reviewed journal then it's okay for an annotator to make that annotation. Otherwise, the correct evidence code would be IC. In the signaling literature, it is not uncommon to find that authors do co-IP experiments to show that a protein of interest in involved in a particular signaling cascade. But how far do we take the inference? When do we take the inference? Who makes the assertion? Does that matter? Does it matter what the community assumes to be true? If, for example, protein A bind protein B and protein B is involved in two completely different processes, would you annotate to both? No. If the binding experiment is the only one performed in the paper is that enough? If not, do the other experiments influence the process annotation? If we comprehensively annotated\ genes, can we make the same conclusions as wet bench biologists? Can we make a rule about if and when we will use IPI for biological process annotations? A check of the database indicates that almost all of the core databases have used IPI for process annotations. The conclusion was that we should think about this more and come up with examples of when we would or would not make the inference to inform our decision about making a rule for IPI and process annotations. This would be an agenda item for the coming GO annotation camp. Saturday, April 1, 2006 Discussion of Goals in the New Grant Proposal Aim 1 - Ontology Development Suzi Lewis, BDGP and David Hill, MGI What is our aim? Suzi Lewis, BDGP As NIH program director Peter Good has stated, the heart of the project is still the ontology; that's the resource that people still use. Thus, GO will maintain comprehensive, logically rigorous, biologically accurate ontologies paying close attention to both content and relationships in these ontologies. Content Development David Hill, MGI Comprehensive annotation can drive ontology development. With this in mind, David presented an example of ontology development using the process of blood pressure regulation, an important and well-funded area of research. In mid-November, when David began focusing on curation of genes having to do with blood pressure regulation, there were three relevant terms in the GO: regulation of blood pressure, and positive and negative regulation of blood pressure. To further develop this part of the ontology, he first got a textbook on medical physiology and proposed a basic structure in a SourceForge item that included 43 new terms. Then, the refinement process began... A textbook was not sufficient for describing all aspects of blood pressure regulation and from reading published papers, he was able to add more terms, as well as synonyms for many terms. Work continued; he read more and more papers and added more and more terms. One interesting aspect that came out of this was that not all new terms added had to do with blood pressure regulation, because the genes involved in regulating blood pressure affect other processes as well, such as water consumption and kidney function. Thus, in the process, the blood pressure node improved and other nodes improved, as well since you're now starting to understand genes as you expand nodes in the graph. In the end, most new terms added were new leaf nodes, with relatively few changes to the actual structure of the graph. Summary of ontology development steps: 1 Consult a textbook. 2 Identify papers. 3 Read papers. 4 Enter SourceForge items. 5 Modify GO. 6 Curate papers. Good news: 14 new genes initially annotated to blood pressure regulation. Now, 23 genes annotated, with five genes comprehensively annotated. Along the way, other genes got annotated and not all annotations had to do with blood pressure, which is good for clinicians. Bad news: There is still an outstanding SourceForge item about the terms, as Amelia found inconsistencies in some terms and definitions. Although these issues are relatively minor, they're still there and need to be addressed and fixed. When proposing new terms, it is important to think as an ontology developer, not as an annotator. Since the issues raised did not get in the way of curation, it was too easy to switch over to the role of an annotator and just go ahead and curate the papers, without tidying up the ontology issues. How can we prevent this type of short-circuiting? One way would be to have some assignment of responsibility for ontology development. The ontology developers could point out logical issues, while expert curators deal with major logical issues. Minor issues could then be addressed as concrete proposals that could perhaps be presented in a way where the curator could quickly accept or reject the proposal. Any final decisions would then be based on whether the ontology really represents the biology. Some discussion on this: Does it become the responsibility of a curator to assume the role of an ontology developer when they recognize that new terms are needed? To some extent yes, as we need to keep in mind that ontology development is key to the GO and dependent upon feedback. We have become a more cohesive group and are really at a point where ontology development and annotation intersect, leading to shifting responsibilities. This may mean that we need some changes in how we interact and do things and that we may need to address any problems that exist in how we manage ontology development. We should be creative in how we think about this; maybe the SourceForge method needs improvement, maybe there are other options besides SourceForge? Harold: When you have the interest and the background, that's when the interest to develop the ontology kicks in. Lisa: More collaboration between groups is good, we are doing pathways related to human disease. Rex: We should identify areas of interest or holes in the ontology. Suzi: We need more structure in how this is done. There may be some value in alerting people that a particular area of the ontology needs attention. That way, people with expertise in the field can provide insight, since it is not always clear to an annotator when some terms might be missing, especially if the area under development is not congruent with their background and/or interests. It may also be necessary to have someone at each database be responsible for ontology development. Identifying such people would give the whole process more structure. In addition, the editorial office could provide a timeline of projects outlining different areas of development. Annotators could subscribe to specific mailing lists about these issues. Bottom line: there is shared responsibility for ontology development!! Editorial office had a poster about time lines of ontology development projects. (PRE)ACTION ITEM #2: Need to work out the balance of power/responsibility between the GO office and annotator/ontology developers to complete SourceForge items. David to the GO editorial office: Do you want to be the enforcer? Jane: We do hound people to finish an item. Judy: We need a systematic way to bring closure to an item. Aim 2 - Reference Genomes Rex Chisholm, dictyBase and Judy Blake, MGI The second aim of the grant built on the idea of ontology development. In light of the criticisms of the previous proposal, it was important to pay attention to the biology and pay strong attention to how the ontology was used. What is valuable to biologists about GO? There is an enormous effort to sequence more genomes. There are 100 billion letters of sequence in GenBank representing 160,000 different organisms. The next step, however, is to use that information to understand these organisms and GO has an important role here because some organisms may only have a handful of experiments. In light of this, GO would like to have a set of well defined, well characterized reference genomes that could be used for electronic annotation of new organisms. What are the characteristics of a reference genome? 1 Needs to have a sequenced genome. 2 Needs to have an active and robust MOD supporting it. 3 Needs to have broad-scale functional genomics projects. 4 Needs to have an adequate research community adding new information. 5 Needs to have a sufficient literature base. 6 The reference genome list needs to be distributed across the tree of life in such a way as to ensure a range of organisms are represented. So, a list of nine reference genomes was chosen, starting with the two extremes: E. coli and humans. Humans have to be there since that's where interest lies and much funding derives. E. coli is there since it is the largest component of our biomass and there is a proposal for an E. coli database that is planning to use GO in their annotation. Other genomes identified include mouse, fly, worm, S. cerevisiae, Arabidopsis, and Dictyostelium. Those genomes not identified as reference genomes are not considered less significant, but the reference genomes will be the focus of a committed, coordinated annotation effort. What did we commit to? Curation of the reference genomes will provide broad and deep annotation, with each of the genomes being consulted and agreeing to establish foci of curation. Curation of genes that are relevant to human disease will be emphasized and at a minimum their curation will include: 1) looking at a provided list of genes, 2) using that list to prioritize annotations to determine if you can add information about the function of those genes in your organism, and 3) looking at those genes and also at the human annotations to see if the human annotations can be improved based upon your annotations. As the reference genomes are annotated, it will be important to provide data for metrics. It is unlikely that we will have complete annotation for those nine reference genomes, but we want to be able to measure progress. This may require some slight reorganization of how information about GO curation is captured. We want to establish a GO annotation team for this part of the project. For most of the reference genomes, there already are GO-supported, "embedded" curators. This fact is very relevant to the previous discussion, as these individuals will play a coordination, as well as annotation, role for their respective genomes. These individuals will be funded by GO and will answer more to the GO, as we will need to expand leverage to complete these annotations. Another role that these individuals will play is in providing outreach and training to other curation efforts. They will agree to participate in annotation consistency exercises to see if curators within their database have some level of consistency in how to think about curating GO. Part of the annotation consistency efforts may include everyone curating a human paper and comparing annotations. They will also participate in annotation camps and workshops, for which there will be a more regular process. There will be regular communication in the form of biweekly (fortnightly) phone conferences. And lastly, there will be a bidirectional process where each database has a responsibility of reviewing the GOA annotations for their organisms, ensuring feedback between GOA and the reference MODs. This will require a lot of communication with the EBI group about human gene curation. EBI already accepts some human annotation from MGI and other groups and would be happy to take additional annotations. Using EBI's curation tools, annotators can directly annotate to the GOA database. However, there is some concern about other MODs performing human gene curation. How will the outside world view this? We will need to make concrete and tangible progress on this or it will look bad. It might be best to focus on particular genes and show that we've made progress. All MODs have high priorities within their own organisms and so we will need to be realistic and methodical in our approach. The key role of the MODs will be to still do what the MODs do. That's the easy part of this! We're promising to do what we're already doing. GOA/EBI group annotation tool is online and can be made available to outside groups, e.g. AgBase, _____, Roslin group for chicken Sue expressed concerns that the annotation of human could be problematic, if we say we will do human, and then don't do it, we will look bad. MGI already has a list of human disease genes, ~4000, people are already contributing to InParanoid and using it to compare MOD genes to the human list, we can then also track MOD progress on this list of genes. This new effort will need a lot of coordination and we will need to know how to represent progress. We will generate a list of human disease genes, which will number several thousand. This is not an impossible list to generate. In addition, we can provide a list of InParanoid orthologs to help groups identify the relevant genes to annotate. This leads well into the discussion on co-annotation of the mouse/rat/human genomes..... Cooperative Annotation - Human/Mouse/Rat Annotation Judy Blake, MGI There is a heavy call from the GO user community for human annotation, and it will be beneficial to look at how the mouse/rat/human groups can work collaboratively since in many cases papers report experiments on genes from all three species. Mary Dolan has already done some work on the annotation consistency between mouse and human annotations and will also be adding chicken, fly, and zebrafish to this analysis (all predicated on knowing the orthology). One consideration, though, is that there are different intersections of information based upon the experimental focus on an organism. For example, rat gene products are often studied in the context of neurophysiology. At present, these groups do things independently and likely have similar but different priority in curation sets. How can these groups exchange information and help each other out? One way would be to create a shared annotation group, which may be a core group of people funded by GO, of human, mouse, and rat annotators that focus on gene products implicated in human disease. This would involve shared curation of the literature, collective work on the same dataset, and an update of each group's editorial tools to accommodate multiple species. These groups would also need to develop shared quality control protocols that would ensure high quality curation. Currently at MGI, they annotate rat gene products as they come to them, and these annotation lines are sent straight to GOA and posted on the MGI ftp site. Another proposal that relates to David's idea of depth of annotation: when these groups take on a certain area for curation, they also take on coordinated ontology development. The groups would begin by identifying and reading reviews and then using the reviews to improve data sets, triage the primary literature, and improve the ontologies. This would involve identifying all of the organisms and gene products described, and they are working on various computational ways to identify that information. Overall, these efforts would help the groups to simultaneously curate the same, focused set of genes and allow for development of a working group amongst these three genomes. Many mouse papers have data from rats and humans and it doesn't make sense to have three different MODs look at the same paper. Curation of rat genes begins with abstracts and then focuses on disease genes, particularly nervous system disease genes. This is fine, because concentrating on disease collectively gets the synergy. It is agreed that it doesn't make sense for people to look at the same genes in mouse and human, but with this new idea, people would be looking at the same group of genes involved in a particular area and this would help by reflecting how people in each field currently think about this topic. How would the curation be split up? Ideally, by references and by organism. This effort would involve literature-based, shared discussions that could start with a recent review and then choose a focus for annotation. This work might also lead to productive interactions with Reactome. If out of this work came a list of genes that were well curated in mouse/rat/human, then these genes could be bumped to the top of Reactome's curation priority list and help give them ideas about what pathways to coordinate. The exact tools and processes for doing this are not all in place yet, but will come via discussion as work progresses and we see growth in each of these areas. For GOA, this will be different from their normal curation strategy, as they currently fully curate each gene product. However, the hope is that this strategy extends beyond mouse/rat/human. For example, FlyBase could look at their homologs, too. In addition, we could put in place responsibilities so that when you curate a paper, you curate the whole paper and all the information in that paper. Sharing this gene list amongst all the reference genomes will provide the broadest representation of what these genes do in biology. Reference Genomes Rex Chisholm, dictyBase How do we think of metrics for both breadth and depth? One important point in annotating the reference genomes is to have metrics, coordinated by and agreed upon by the consortium, to assess the breadth and depth of annotation. How do we develop these metrics? There was some discussion of this on the email list while this proposal was being written. In some ways the answer seems obvious, but there are a lot of hidden complexities and different groups think about this in different ways. Nevertheless, we should all collect the same set of numbers. What is meant by breadth? A group would have broad annotation when they've annotated all the genes in the genome. There is, however, different kinds of information from different organisms. For example, rat is good for physiology, mouse for disease models, yeast for signaling, etc., and biologists tend to think about a process based upon what they know from a variety of organisms. For a given organism, we will need to track the number of genes that actually have experimental annotations. In addition, we'll want to know what genes' functions are known mainly by ISS and what percentage of the genome is annotated with ISS exclusively. We'll also want to know what is annotated based on IEA (sometimes this is the only type of annotation known). Tracking these numbers for each reference genome and collectively we allow a way to monitor progress towards breadth. The goal is to do the entire job - a huge task - and we need to be able to show progress along the way. It would be good to compare the snapshot taken at the time the grant was being written with one taken this summer. It would also be good to add, in the annotation file, the number of annotatable genes and proteins, since this will be critical for assessing metrics. It is not always trivial, however, to know what percentage of genes you can annotate. The cerevisiae genome, for example, has 20-25% of genes with no annotatable information and it has a compact genome with a long history of experimental study! It may still take a future publication to annotate these genes. In addition to measuring how many annotatable genes and proteins exist for each organism, it might also be helpful to record how much ontology space is covered for each organism. This bears on the second point, which is how to measure depth. If you've annotated every single paper ever published, then you've achieved 100% depth. But that's not realistic, as many papers aren't gene-centric and not all papers have curatable information. So, what should the denominator be here? Is there any way to capture the number of relevant papers? Would working with natural language processing groups be useful for gathering this kind of information? Another measure of depth might include an assessment of how far down any given branch the GO is used, ie how close do you get to the leaves? Shu (FlyBase) has written an algorithm for assessing this. One off-shoot (so to speak) of this would be that lots of annotations to a leaf node would highlight potential new areas for ontology development. Depth: can be measured as: % of papers used for curation average # papers per curated gene can we capture the # of relevant papers since we know it is a subset of all papers. How? Problems: Even if a paper mentions rat, it may not mention any genes Even if a paper mentions a gene, it may not give you information that is curatable for GO Some of these types of projects will be very useful to the natural language processing groups; Mike, Judy already have collaborations with such groups. ACTION ITEM #3: Begin to coordinate processes for reference genomes to start setting priorities and tracking progress. Acquire and distribute lists of genes for curation focus and set-up fortnightly discussions. [Point Person: Rex Chisholm] We need to develop a list of "disease genes" and process to monitor progress. We need to refocus and integrate roles of GO funded annotators. Aim 3 - Outreach Michael Ashburner, FlyBase and Suzi Lewis, BDGP We promised a tool that would provide better ways of annotating through the ortholog sets. This might be software that shows a phylogenetic tree and protein alignments and allows curators to click on any gene in the tree to see the gene's annotations. It might also allow for dragging terms from one organism to the other, resulting in ISS annotations. To accomplish this, we will need ortholog sets (from other groups) and protein sets from each of the reference genomes. Many groups are calculating orthologs - each does this differently and each uses a different data format. It would be really nice to have a common dataset so that we can compare these different methods on one set. It would also be nice to have a common output file. We'd like to be able to add in any group/organism that can provide a protein set (Matt has 40). Michelle mentioned differences in current annotation of gene models in eukaryotic organisms. Common Coding Sequence ?not sure of name? - collaboration between 3 groups annotating human. ACTION ITEM #4: Any other ideas for shared curation software, please forward to Chris Mungall. To avoid circular annotations, curators would only want to drag and drop experimentally derived annotations, which could be color-coded based upon evidence code. Where would the annotations go afterwards? The tool could generate an annotation file, and GOA already has a tool to do this. One thing to keep in mind, though, is that if anything changes about the original annotation, you would want to check the related annotation for continued accuracy. WORKING GROUP: We should establish a working group for this software so that we can leverage what other groups have done and figure out, technically, how to pull this all together. Two critical need for this type of software are: 1) reliable, agreed upon protein sets for establishing orthologs, and 2) a common output from the various ortholog groups. There is pressure to keep the number of genomes annotated up for taxonomic spread, but also to keep it down because of the enormity of work involved. The GO will support annotation across all organisms, and has been doing so for some time now. TIGR became an official member in GO in 2001, and TAIR joined us before that. As of the time the grant was written, there were 1,867 genome projects, of which 339 were published and in some sense complete. The former number includes over 900 prokaryotic genomes, over 500 eukaryotic genomes, and 26 meta-genomes. A number of additional mammalian genomes have just been funded. Over 100 pathogens are being sequenced at Sanger. Thus, there will be an increasing number of genomes coming in the future. For the great majority of these genomes, however, there is either no or relatively poor funding for annotation efforts and associated databases. This provides quite a major problem for the GO which has traditionally interacted with database groups and not with individual sequencing projects. It is in the interests of science and the GO Consortium to reach out to these groups to encourage and support GO annotations across as broad a phylogenetic breadth as possible. To this end, it is important for all of us to assume an ambassadorial role at talks and meetings and this role has been formalized under Jen Clark at the editorial office. But, we need to take a number of additional steps to go beyond outreach and beyond initial contacts. We need to provide a basic toolkit of training materials for new groups to annotate their organisms. This would include reference annotations from reference genomes that could be used for transferring annotations. This is one of the reasons for having a broad phylogenetic spread of reference genomes. In the past, we have also been asked to give advice about tools. People want to know what the best tools are for annotating a genome, but we have been reluctant to recommend one groups' tool over that of another. What we could do, however, is provide the best SOPs for using the existing tools. There are many tools that use GO for microarray analysis, and when people see the list on the web page, they often don't know where to start. When giving tutorials, we currently use one or two of these tools, but don't necessarily work that closely with developers to give them feedback on the software. It would be very good to begin working with these people behind the scenes to help improve the tools. It would also be good to have a wiki for each of the tools and to partner with groups that have a vested interest in further developing and improving these tools. Mike suggested that we could partner with someone, e.g. MGED, to work on this. Sue suggested that we could be more proactive on collaborating with these various software developers - they are often looking for nodes that are overrepresented, which of course will be affected by nodes which are underannotated. GO slims are often too broad for some uses, so we may need to help people develop better slims to really do what they need. We also need to encourage developers to incorporate evidence codes, so they don't get ignored by so many people. Mike seconded Sue's comment about some developers ignoring evidence codes entirely, and they are using a gene-association file to predict other annotations, and have no idea whether they are using experimental or IEA annotations. In summary, we know that some tools don't use the evidence codes, some ignore the WITH column and thus, the NOT qualifier, and that tools using the GO slims may be using GO annotations that are too high in the tree to be really useful. Accordingly, each group should think about what is the right depth of the ontology for their organism. Has anyone performed an independent analysis of GO tools? Sorin Draghici at Wayne State University has done this. (See his publication in Bioinformatics, Khatri and Draghici 21 (18): 3587.) End Discussion. Another form of curation outreach can be provided in the mappings of external classification systems to GO, such as ec2go or interpro2go. When using these, though, it is important to consider how extensively they are maintained, since some are updated regularly and some probably haven't been updated in three or four years. Information about frequency of updates should be transparent in the mappings file. Also, we should always be on the lookout for new classification systems to map to GO, such as the KEGG orthology. Most importantly, however, is that we're going to provide training. We've already had two annotation camps, and are planning a third for this coming July in Palo Alto. We would also like to have one annotation workshop per year, alternating on which side of the Atlantic it's held. We should especially reach out to Asian groups, particularly some groups in Japan who've already sent curators to the 2005 Annotation Camp and are very interested in learning how to do manual annotation . WORKING GROUP: There is going to be an outreach working group and they will need to come up with a plan for how to encourage and support annotations from a broad spectrum of genome projects. Part of this group's focus should be on writing SOPs for GO curation and working very closely with new groups to ensure regular submission of gene association files. This working group should include the "embedded" GO curators at each of the MODs. Another issue related to this has to do with freshness, or currency, of annotations, since we know that annotations can go stale for a number of reasons. For reference genomes and well-supported MODs refreshing annotations is an integral part of their job. For single-pass annotations, however, there may not be resources to refresh them, especially if they are performed at a sequencing center that has long since moved on to the next genome. We do have an existing system to do with this in that a gene association file that has not been refreshed within the last twelve month is put into a separate gene association file archive. Another outreach idea is that of an annual annotation challenge. This might be similar to the CASP (Critical Assessment of Techniques for Protein Structure Prediction) and GASP (Genome Annotation Assessment Project) projects. An annotation challenge might also be a good way to interact with software developers creating tools for GO. We could give groups a set of genomes, have them run their programs on them, submit resulting GO annotations, and then compare their annotations to known annotations to see how the different programs perform. The winning program could then be run against all of the genomes that need refreshing. The annotation challenge could perhaps also be extended to evaluating tools that assess consistency of annotation within the consortium. GO should also continue trying to reach out to additional prokaryotic annotations groups, as there are now a number of different groups in many different locations, such as the Joint Genome Institute, Argonne National Labs or the Pasteur Institute, to name a few. We could also reach out to groups involved in the NIAID program to sequence organisms that are potential bioterrorist agents, which is ongoing at several BRCs (Bioinformatics Resource Centers). There are some particular issues, such as the operon issue or annotation by pathway hole filling, that arise when annotating prokaryotic genomes that may merit some changes in the GO. Specifically, this may require changes to the evidence codes, and we may need to improve links between function and process. These issues are especially important as annotation of the E. coli genome, one of the reference genomes, becomes more coordinated (see Riley et al, NAR, 2006 Vol. 34, No. 1, 1-9). Aim 4 - Community Advocacy Mike Cherry, SGD The main idea behind community advocacy is that there would be advocates to seek out what is needed by those people that use the GO. As a group, we want to know, and should start thinking about, who else is using the GO and what can we do to meet their needs? To begin to address this, we may want to designate a publicity person for the GO and have a place on the website where we highlight the use of the GO. This is not simply an issue of interface design, but also really addresses the questions of what is GO and why should I care about it in an effort to draw people in more. We need to recognize that in talking to non-genomics biologists about GO, we may need to use new language and be pro-active about determining exactly what people do and do not know about GO. GO is now a resource that is expected to be there and as such, will be subject to criticism. But, it is our responsibility to get out there and help people who are using it and address issues they might have. Michelle suggested that we might want to work with someone professional to develop training materials. Candace seconded this and mentioned that there are lots of people at colleges/universities using GO in their courses and training materials would be really helpful. Perhaps we could collaborate with an educational professional to produce GO training materials. Reactome has had some experience with a group that came in to produce training material that they would then sell as a subscription. The number of people subscribed might help to give some idea of how many people use their resource. How else to determine who uses GO? We could compile lists of people who mention GO in publications and in talks at meetings. For the former, we could perform a full-text search using "gene ontology." It would also be good to foster a user community that gets immediate feedback from GO in the form of quick responses to email queries. We could also try some sort of monthly call-in where people can ask questions or simply lurk and listen. Could we put into place some mechanism that allows people to tell us what genes they would really like to see annotated? We could also make some changes to the email lists. Sometimes people post to a GO mailing list, but it's not to the right list. We should consider consolidating the lists and making it more transparent which list people should post to when they have a question or comment. Having good training materials on the web is key. Courses on genomics and informatics would use GO if there were training materials available. This would increase student awareness of GO. Is there any way that we can determine who is downloading files and perhaps get automatic feedback from them when they do so? Tracking downloads, however, would be a low estimate of how many people use GO. Sue was recently at a Systems Biology meeting, and nearly every talk mentioned GO. Attending meetings may be one way to get a feel who is using it. She also felt that looking for citations in publications is also a decent step. John: We need to foster the idea that we are responsive to the community. Karen brought up Rama's comment that we need to consolidate/manage our various email lists: go-database, amigo, go, and make sure that questions don't get lost, that everything gets answered in a timely fashion. David: Medium to large scale analyses where they use GO for analysis- we could offer to focus on annotating genes which would be relevant to the analysis and get it done so that they can use it before publication of the paper. Jane suggested that we could have a tracker for this. What about a monthly newsletter, a wiki, rss feeds? As many of the MODs already send out monthly newsletters, perhaps they could append a new section devoted exclusively to GO. Using the MODs to reach more people was especially helpful for the GO survey done last fall, results of which are included as an appendix to the grant submission. Also, any courses that people teach, such as those at TIGR, Jackson Labs, Woods Hole, CSHL, could include GO as part of the curriculum. There will be no new hires for outreach, so existing GO curators should think about how to reach out to their user community. If you have more ideas about how best to do this, please send them to Eurie Hong at SGD who will be coordinating community outreach. One other point to keep in mind is that a lot of groups doing functional annotation are doing structural annotation, as well. This is partly why we need the Sequence Ontology (SO). SO has become the fourth ontology, and although there are differences between the SO and GO ontologies, they are all supported and needed to do the complexity of gene annotation. Aim 5 - Organization Suzi Lewis, BDGP Three years ago, the aims of the GO grant proposal were to create structured vocabularies, support and promote use of GO in annotation projects, add new MODs to the consortium, build and disseminate informatics resources and tools to support community use of GO vocabularies.. How can we measure progress towards these aims? The number of hits to the web site has gone up, the number of publications mentioning GO has gone up, and the number of links to the GO web site has increased. 17/24 NIH agencies that fund research have supported projects that use GO and there are many hits when searching for GO in Google Scholar. The GO survey netted close to 1500 replies in three weeks, almost all of which were positive. So, where do we go from here? Looking at the 2006 aims: 1 maintain comprehensive, logically rigorous, biologically accurate ontologies 2 comprehensively annotate nine reference genomes in as complete detail as possible 3 support annotations across all organisms 4 provide annotations and tools to the research community Given that the funding level is expected to remain the same, how are we going to scale up? The answer is that we need to do a little better, need to get more efficient, and need to be more coordinated. This can be done! We already have some working groups, such as the AmiGO and OBO-Edit working groups, and these have been successful. But, any new working groups that form will have specific questions to ask, including: 1 Why is this group here? 2 What is the lifespan of this group? (could differ greatly between groups) 3 Who is the group leader? 4 Are we making progress? What are the obstacles to making progress? 5 What are the group's priorities and what are the criteria for setting priorities? 6 What will the group deliver? 7 What are the criteria for membership? Who has a vested interest in this issue? 8 How often should you meet? How should you meet? Via email? Other ways? Each group can decide what works for them and what communication method will work best, but having metrics is essential. How are you going to measure your progress/success? What tests do you want to build? Different groups will need to communicate with one another. How are you going to share information? The decision making process for all of this still needs to be fleshed out, but efficiency and lack of bureaucracy is key. If there are software issues that come up, please raise them with Chris. As an example, the reference genome group will have Rex Chisholm in charge, produce annotations and protein sets, will comprise members for each of the nine reference genomes, participate in fortnightly conference calls, and have as their metrics the percentage of the genome that is annotated. This group will work with the computational group and the user community, processes for which will be decided. Levels of organization in the GOC can be described as follows: 1 GO PIs, Judy, Michael, Mike, Suzi - set priorities, obtain funding. 2 Midori, David, Jen, Rex, Eurie, and Chris - ontology content, annotator-ontology liason, annotation outreach, reference annotation, community advocacy, computational architecture, respectively. 3 Curators, some of whom are directly funded, from MODs: human, mouse, zebrafish, fly, weed, worm, slime mold, budding yeast, microbes - perform annotation Parting thoughts: This isn't a drastic change from what we're already doing, but it is trying to be more conscious about what we're doing. We don't want things to fall through the cracks. One of the reasons that projects fail is that when there are so many things to do, you pick the easy things over the things you should do. We need to do the hard, important things. Ontological Content: Relationships and Terms Missing "is_a"s Chris Mungall, BDGP Tangled DAGs and complexity - complexity is increasing exponentially (i.e total number of paths to the root node). The first topic addresses the issue of missing is_a relationships in the GO. Jane Lomax has worked on filling in missing is_a relationships in the cellular component ontology. At the onset, the process and component ontologies are not is_a complete. In the past, our methodology has been that as long as a term has a parent, we haven't cared so much about the relationship. Why is there a need to have an is_a parent for each term? Because without them, we're missing terms and when we're missing terms it's much harder to get the ontology to work with other ontologies and tools that are out there. For example, the ProtŽgŽ ontology editor assumes is_a completeness and so does not work well with GO, as orphan terms appear as root terms. Accurate ontologies also aid accurate searching. How can we get started at untangling complex DAGs? In our current display, when drawing the DAG, we ignore the relationship type. We label it, but we don't use the information on the label. For example, the Srb-mediator complex in it's current incarnation is displayed multiple times showing every possible way to draw a route to the top. In Jane's work, she took all of the is_a orphans and either found the correct parent or created the correct parent. There were 277 is_a orphans in the cellular component ontology. In the 'fixed' version of the cellular component ontology, these is_a orphans are gone, which increases the total number of mixed-paths-to-root, actually making it more complex to look at. Some browsers, such as that of the FMA (Foundational Model of Anatomy) have distinct is_a and part_of browsers, which helps to think about the ontology better. A summary of Jane's work: old * 277 is_a orphans/1688 terms * avg is_a paths to root: 1.4 * avg mixed-paths to root 6.97 new: * 0 is_ a orphans * avg is_a paths to root: 3.36 * avg mixed-paths to root: 38.6 This work illustrates how this project needs to coordinate with the AmiGO working group. We want to make this new version live as soon as possible, but the display will need to be addressed. It would be good to let people have a look, since the component ontology will look very different when these changes are implemented. There are also some terms missing part_of roots. For example, unlocalized complexes don't have part_of roots at the top of the tree. All of these complexes need to be housed somewhere, although some of them are so old that there are no references, no definitions, and no annotations made to them. **Demonstration of new is_a complete cellular component ontology in OBO-Edit** To ensure is_a completeness, a new high level term 'X component' was introduced. For example, cell projection component was introduced to have a complete is_a path for existing terms, such as cell projection membrane. This is analogous to what the FMA, which is is_a complete, did when they invented the term 'cardinal body part' to have a parent term for the 'head', 'trunk', etc. since extremities are not organs. FuGO also needed these types of top level terms. One issue that still remains is what to have as an is_a parent for cell components, ie terms that are part_of cell. Would it be okay to have a parent term cell component, even though the whole ontology is called cellular component? Barry Smith recommended that this trick only be used when dealing with entities that are not anything else, since we don't want to introduce multiple inheritance for this purpose. If the cell component term was introduced, we'd have the same number of high level terms and could remove some redundancies. Chris and Suzi also suggested renaming the ontology. We could refer to a 'cell-level entity' and then also have cell parts. Barry pointed out that the FMA uses part in the name, and uses 'cardinal part' if it's the only way to have an is_a relationship for every term. It is used, however, for entities that are there only when the whole is there. So, we could use the term 'cell part', 'membrane part', etc...... So Jane has done the following: * created terms like 'cell projection component' to be is_a parent to give the various types of cell projections an is_a parent. * may have added some is_a parents that are needed * an issue came up about: she didn't create a term "cell component" which is almost identical to name of the ontology, i.e. "cellular component" so there are still a bunch of things right off the top. ACTION ITEM #5: Consider what, if any, are the repercussions of renaming the cellular component to cell level entity? ACTION ITEM #6: Change new terms ending in '...component' to now end in '...part.' [Jane Lomax] Note that the current version of AmiGO does not have the ability to disentangle paths, nor do other groups browsers. In the meantime, the MODs will need to address this in their display, AmiGO will need to deal with this. We don't necessarily want to wait for the tools to catch up to implement this, but the displays will explode a bit. Annotators should take a look at this in OBO-Edit to see if they can live with this. There may be some redundancies that could be removed, which would help with the increased paths to root. OBO-Edit needs a verification system to check for is_a completeness and warn before saving. ACTION ITEM #7: Jane will send an is_a complete cellular component ontology to the GO list. There are also lots (i.e., thousands) of missing is_a relationships in the process ontology that will need to be dealt with. This should be a little bit easier since the tools are already dealing with this with respect to cellular component. However, once we make this live, we're committed to maintaining is_a completeness. Any new terms entered should not create new is_a orphans. OBO-Edit has a verification system for this that could be revived and could be tested by the OBO-Edit WORKING GROUP. Fixing the Upper Levels of Biological Process Chris Mungall, BDGP and Barry Smith Examining the top levels of the Biological Process ontology, we also find that there are some issues with granularity. We have been talking about diamonds and DAGs and how they're not so good for ontologies. If diamonds are treated correctly they're not such a problem., but diamonds at the upper levels of the ontology are more of a problem, especially when they are at the very top. What is the solution? Some of the high level terms in the BP ontology are: cellular process, physiological process, cellular physiological process, and organismal physiological process. Do these terms define the granular level of the process or of the end result? The definitions seem inconsistent. In some cases, the definitions refer to the goal of the process, not the level at which the process occurs. Further there is a lack of symmetry in these terms and definitions. For example, why is there no term 'organismal process'? And what does physiological process really mean? It is vague and hard to distinguish from biological process, although some dictionaries define it as everything but developmental processes. We should think of and define a process as something that has a beginning and an end. But where's the beginning and end of a physiological process? Of metabolism? There is also an issue with behavior terms. We've got something called behavior which is not considered a physiological process, but 'behavioral physiological' child terms are useful. All of the behavior terms are now 'response to...' terms, but shouldn't response be a physiological process, since the definition is about a set of things that happen in response to something else? Jen commented that she once looked up "physiology" and it appears to be a catch for anything that wasn't "development" and some other major branch of research at the time the term was introduced; David feels that physiological process is equivalent to biological process; idea of eliminating "physiological process". Should we just eliminate the term 'physiological process'? Alternatively, we could add the granularity at the top level of the process ontology itself. We could have molecular, cellular, and organismal levels. These are disjointed, whereas physiology is not. The advantage is that each is distinguished from a parent in the same way. So, for example, a term like cardiac contraction would be an is_a term of organism level process, cell contraction would be a part_of that and also an is_a of a cellular level process. Barry's revision: Biological process: * molecular level process * cellular level process * organismal level process Barry: "I do not believe that there is such a thing as "molecular physiology", however there are such things as "molecular biological process". Rex and David provided examples to the contrary that people do actually talk about "molecular physiology". What about single-celled organisms? This has been a source of past and current confusion, since in these cases cell and organism are the same thing. From the perspective of the GO, we need to make a decision about what a single-celled organism is. We may also want the term 'multicellular level process' or 'multicellular organism process', and we need to consider the cases of organisms like Dicty that can exist in both states and some bacterial species that associate but don't meet criteria for being deemed multicellular. Barry suggested adding two children of cellular level process: single cell process and multicellular process. We could also have single organism process and multiorganism process. Agreed that we should look at a few branches of the ontology and see how this would work. What will the effects of these changes be? We can sort through this in-house as there may be many merges and changes, but we shoulud keep an eye towards what other ontologies, such as FMA, do. We also need to consider the implications of this for pathological processes. What are molecular level processes vs cellular processes? For example, if a molecular penetrates a cell, it's a molecular process, but if a cell captures a molecule this would be a cellular process. We could figure out the granularity by determining the agent. The trick is to find an example where there is no apparent classification as one or the other. It should be obvious, though, where the action is - at the molecular or cellular level. Think about viruses attaching to cells, DNA molecules penetrating a hole in the cell, etc. Also think about how this reorganization would affect development and cell differentiation terms in the GO. Development is an organismal process, but a big part of that, cell differentiation, is a cellular process. Will this be okay? The relationship to the former would be part_of, while the relationship to the latter would be is_a. ACTION ITEM #8: Take the new molecular/cellular/multicellular arrangement of the biological process ontology, try it out, and see how it works. The WORKING GROUP for this will include: Chris, Jane, David, Alex, Michelle, Rex, and Val. Jane will make a .obo file so that people can look at this. The group will also need to create a document outlining their philosophical approach and the results. Should there be a development site to help with working on this? David brought up the sticky issue of making developmental process be under 'organismal process' and then you eventually get down to 'cell differentitation' which also needs to be under ' cellular process'. Problem of development and the child terms of development. Relations Between Function, Process, and Component Chris Mungall, BDGP How can we make links between the GO ontologies? The discussion of this focused on links between molecular function and biological process using histidine degradation as an example. (See Overbeek, et al, NAR 2005 33(17):5691-5702). Histidine degradation is a complex process with a branched pathway. In GO, it is represented by the BP term 'histidine catabolism.' Each of the reactions in the process are mapped to GO MF terms, such as 'histidine ammonia-lyase' activity. We have some variants for different ways that histidine catabolism takes place in different organisms and could map pathway branches seen in different organisms to GO child terms. Two could be mapped, histidine catabolism to glutamate and formate, and histidine catabolism to glutamate and formamide, but one could not: histidine catabolism to glutamate and formiminotetrahydrofolate is not represented in the GO. Because of this, there is no way to link relevant functions to processes. Michelle: The microbial community is looking for this type of relationship because reaction steps for each process is how they (TIGR) annotate a bacterial genome. There is clearly lots of curation work that needs to be done. How will the results fit into the current ontological structure? But, an important issue here is what is a function and what should be in the function ontology? We must settle that first. We also need to be clear about what we mean when we are talking about types, processes, and functions. Every single term in the GO should represent an actual type in biology, but the converse is not necessarily true. What are the relations in reality? There are relationships between types in the same ontology and between functions and processes at a given level of granularity. What are the instances and relations in reality? Some particular gene product has some molecular function instance - has a functioning- which makes up a more complex multi-step process. Examples of type: histidine ammonia lyase function, histidine ammonia lyase reaction, histidine catabolism. How hard is it going to be to determine when we can make these relations? What are the types and relations in reality? Some reactions will only take place in one pathway. Barry commented that if this is the case, then we can't assert part_of relations. But can lack of knowledge be used to exclude a part_of relationship? What about using the relationship has_part? This exists in the OBO relation ontology. We need a way to distinguish those reactions that always occur as part of 'histidine catabolism' versus those that are sometimes part of histidine catabolism depending upon the organism and maybe sometimes part of other processes. For example, alcohol dehydrogenase functions in many pathways. Barry's comments: if we have multiple inheritance in part_of parents, then every single instance must have that part. If we have an entity that is sometimes part of one thing and sometimes another, then we cannot assert the part_of relationship. The part_of definition means all the time; has_part means the parent process necessarily has that function as part of the process. We need to collect more data and think about how to integrate that data. The data is captured in annotations and so we should be able to use annotations to help predict the relationships between process and function. These ideas will probably work fine for well-characterized pathways, but what about other pathways, such as cytokinesis, where all of the molecular events are not yet defined? At some point we are going to want to talk about pathways at an organismal level, not just at the molecular level. To do this we will want to consult with the pathway databases and have some level of synchronization with them. In GO, however, we will still be missing information about the order of events, so we would need to bring in new relations such as 'preceded_by.' Prokaryotic annotators would like to see those relationships brought into GO soon. (Michelle gets complaints from people in her classes that GO doesn't represent pathways.) So, how would we manifest this in GO? Molecular functions reside in gene products and the function is distinct from the functioning. In reality, there is a certain bit of redundancy here; do we want to manifest this directly in the GO? Do we want to have functioning terms in the process ontology? We know that not all GO function terms have a corresponding functioning term. Some redundancy already exists in the GO, in cases such as 'iron transport' and 'iron transporter.' We also have some redundancy between component types and function types. This is not necessarily bad, but we should be aware of the reasons why they're there. The functioning of the gene product is implicit; it exists in reality but not in GO. If we're not going to create functioning terms, then we will need to link the function and process ontologies in another way. We shouldn't use part_of between the ontologies. What about having a functioning_of relationship? Another proposal would be to remove the function terms and replace them with highly granular process terms [see below]. One point to consider is that every organism may not always have a particular reaction as part of its pathway. This is why we're using the has_part relation and not the part_of relation, but this arrangement might still be problematic for creating GO slims. It would then be important to annotation NOT to the process if you know a reaction/enzyme exists but the pathway is broken, which happens a lot in bacteria. Use of the NOT qualifier in these cases would be consistent with the way we normally use it. Can one function have many types of functionings? There can be many instances of functionings but only one functioning type. Processing Function Terms Amelia Ireland, GOEO In the past there has been a lot of discussion about function terms and a number of term obsoletions. The proposal presented here would move function terms representing steps in a process to the process ontology. For example, many terms under 'protein modification' in the process ontology are just one-step reactions. On the other hand, we have multi-step reactions that appear in the function ontology, e.g., ent-kaurene oxidase activity. We also have processes and functions that appear to be identical, such as histone methylation and histone methyltransferase activity. We have had a one-step rule for creating functions, but what constitutes a step can vary between functions. Some function terms fit the definition and some do not. Some discussion as to where the one-step rule is stated. It is in the documentation and has been used by the editorial office for quite some time. When the one-step rule was instated, no one spoke up to disagree. But, perhaps the one-step rule does not clearly state what we want and can be interpreted in different ways. We seem to agree that a function is an instance of what a gene product can do at the level of the molecule and does not represent just one step in a process. For example, think of the Michaelis-Menten equation which takes into account several steps in an enzymatic reaction, e. g., binding, catalysis. However, even though there are steps, some said that there is still just one thing that the molecule does. Does the presence or absence of reaction intermediates have any bearing on this discussion? No, since for any enzymatic reaction that occurs there are going to be some intermediate steps. What about situations where, for a given function, there is one gene product in eukaryotes, but many gene products in prokaryotes? Would we represent the same reaction in different ways in GO? Would there be multiple functions for the prokaryotic gene products and one bundled function for the eukaryotic gene product? But why annotate to one conglomerate term when the current incarnation of GO allows annotators to annotate to more than one term, i.e. one term for each reaction? (See cerevisiae fatty acid synthetase, for example.) This could create problems in term nomenclature. What about multifunctional proteins? In these cases, curators would annotate to one term if this function represented a cascade of events that won't stop once started. Otherwise, curators should annotate to separate functions. Amelia commented here that it seems clear that many functions don't fit the definition of "elemental activities describing the actions of a gene product at the molecular level", and that there are "clean" and "dirty" functions in the GO. "Clean" functions include terms like 'arginase activity', 'transaminase activity', etc. and we can make "clean" relations between these functions and the process term 'arginine catabolism to glutatmate.' Another example of a "clean" function would be 'Notch binding' which can be related to 'Notch signaling pathway.' On the other hand, "dirty" functions are not necessarily steps in a process, but represent a combination of two or more attributes from function, component, process, or something other domain. An example of a "dirty" function is 'transcription factor activity' which is defined as: Any activity required to initiate or regulate transcription; includes the actions of both gene regulatory proteins as well as general transcription factors. Other examples include receptor activity, structural molecule activity, enzyme regulator activity, hormone activity, etc. These "dirty" functions represent a role or a class and cannot be linked to process terms using existing GO relations. What is the solution? Amelia proposes that we move terms representing events into the process ontology. We could start with catalytic activities and move them under the parent term 'metabolism,' creating terms that correspond to particular EC numbers. But, if enzyme activities are "clean", why move them to process? Because these functions are events - occurrents - and should be dealt with in a different way to those function terms representing a role or class of gene product. Using arginine biosynthesis as an example, individual reactions could be organized according to the substances involved. Reactions involving arginine would be is_a children of arginine metabolism. The reactions making up a specific arginine biosynthesis pathway could be made part_of children of the term representing that pathway, e.g. 'arginine biosynthesis from xxx'. Will listing the individual steps as children of arginine metabolism and the relevant pathway be consistent with how biologists think about this? Another example of moving function terms is illustrated by the binding terms. For example, 'Notch binding' could be a part_of 'Notch signaling pathway.' We could do the same for transporter activity, permease activity, receptor activity, ligand binding during signaling pathway, regulator activity, etc. We could also reinstate some obsolete terms such as cell adhesion molecule as the concept of cell adhesion receptor binding. All moved terms will be given many function- and process-style synonyms. So, what's left? The Brave New Function World. This world would redefine function in more colloquial terms, consistent with the dictionary definition that defines a function as the purpose that a gene product serves in the normal activity of an organism. We could keep "dirty" terms representing combinations of function, process, or other information in function, and the new function definition would allow us to add useful terms currently not allowed, such as toxin, ligand, other suggestions? Does this proposal reflect that there are problems with our function terms, or with the definitions? For example, biochemists and cell biologists look at ligands differently. There is a function in the connotation of ligand; was it just poorly defined? Maybe the definitions need to be clarified. The Brave New Function World would include the following functions: energy transducer enzyme motor nutrient reservoir signal structural molecule receptor regulator transporter toxin There is great concern that this will create new problems and confusion, in part because these are names of things, not functions. Barry's suggestion is that perhaps GO should create an ontology of molecular entities so that GO can clarify use of words to describe what a molecule is, what it does, what it could do, etc. But do we want to have a molecular entity ontology and a function ontology? This discussion seems to be raising a number of separate issues: * For "clean" functions, how would you show which functions are needed for a particular process? * Perhaps we need to talk about better upper level organization for function, similar to what we discussed for process? * There is a lot of messiness in the function ontology, but processing terms to group functions is not right. * There is an issue with physical entities. We do have a chance to do this clean, and we don't want to invent a jargon any more than we have to. We could keep the Brave New Functions, but call them molecular function entities or add 'function' to the names. There has been a long-standing issue with trying to define the intersection between the molecules we can assay and our understanding of their function. This is part of why we added the word 'activity' to the function terms. We've purged anything from the function ontology that sounded like a protein, but we still have these issues. We need to readdress the one-step rule and correct this, if needed. The prokaryotic community wants an association between function and process, so we're going to have to find a way to do this, but not by moving function terms into process. This seems backwards to people, as the "cleanest" functions, such as enzymatic activities, are the ones that people think biologists would most expect to see in a function ontology. In summary, Amelia presented the pros and cons to her proposal: Pros: * Reactions involving a certain substrate can all be homed underneath one term, xxx metabolism * Reactions can be linked to pathways * Greater precision in annotation * Ontologies more consistent and pure * Less redundancy between process and function * GO function closer to colloguial interpretation Cons: * Some adjustment needed by those accustomed to seeing binding and enzymatic reactions in function * May take some time to implement An alternative solution presented would be to allow complex functions and redefine molecular function to for terms that represent all the biological activities a gene product has. But some molecules have completely separate functions, such as actin which polymerizes into filaments and inhibits DNaseI. Other examples cited include GPCRs and immunoglobulins. Other thoughts from this discussion: * Make grouping terms under function that indicates that this function initiates processes. * Perhaps we need to broaden our definition of entity so that we can annotate not only to the function of gene products, but to complexes, as well, since in some cases, such as RNA polymerase activity, no single gene product has that function. This relates to the issue of granularity of entities. * Can we establish links between function and complex terms in cellular component? ACTION ITEM #9: Make actin polymerization a function term. ACTION ITEM #10: Amelia, Chris, and David (and Barry, if he's available) should get together and come up with a single proposal for making connections between function and process. This will likely include writing new definitions for function and process. Sunday, April 2nd, 2006 Two Taxon Option Jane Lomax, GOEO This issue arose from the content meeting held at TIGR in November 2005, where a proposal was drafted for a dual annotation system that accommodates annotation of multi-organism processes. A key component of this proposal is that for multi-organism annotation, two taxa may be placed in the Taxon ID field. The first taxon ID refers to the organism that encodes the gene product, while the second refers to the organism that interacts with it. The two taxa should be pipe-separated. Dual-taxon annotation will also require adding a lot more terms and so we're considering having a few GO curators visit PAMGO later this year to develop that part of the ontology. Other plans include writing detailed documentation and guidance for annotators that may need to curate these interactions. There will also be an announcement of this documentation on the annotation mailing list and dual-taxon annotation will be discussed at the upcoming annotation camp. At the moment, though, only TIGR and PAMGO groups are probably doing this. If the interaction is between two individuals of the same species should annotators still put two taxa in the ID field? Yes. The issue of dual-taxon annotation started out originally because some bacterial proteins are injected into plants cells, which leads to the 'hypersensitive response'. Currently, this term is a child of 'programmed cell death', but is this the right parentage? We made need to create a separate ontology for these interactions, since 'hypersensitive response' doesn't fit where it currently resides. It is agreed that the GO recognizes that there are still some big problems in the underlying structure of the ontology with regard to this type of annotation and we may need to recognize a new structure of the process ontology to get this right. Having an upper level term of 'organismal process' may help understand how to place these terms correctly. ACTION ITEM #11: Add detailed documentation on dual-taxon annotation, announce this to the annotators' mailing list, include the info in annotation camp discussions, and work on developing the ontology. [Candace Collmer, Jane Lomax, Amelia Ireland, others?] Two-Taxon Annotation - Capturing the Host Side of Interactions Jane Lomax, GOEO Given that there is a range of interactions, some harmful, some beneficial, are we annotating these interactions consistently? At present, when an interaction is mutually beneficial, we annotate both organisms. But, if the interaction is pathogenic, we only annotate the agent, not the host. A new proposal would allow for annotation of host gene products in pathogenic interactions. There is some opposition to this because, for the host, this represents abnormal processes. For example, would we want to annotate CD4 to HIV binding? Would we want to allow for ISS transference of this kind of annotation? In the case of viruses this could be particularly dangerous as there are many species-specific host-virus interactions. What about cases of rhizobium binding to plant proteins (plant nodulation)? Does apparent co-evolution, or positive selection from both organisms, matter for this discussion? There may be a grey area between interactions that are called mutualistic versus interaction that are pathogenic. Is it always clear how to distinguish the two? Where does normal start and stop? How do you distinguish an agent in the environment that is neutral versus one that causes a pathogenic response? A binding reaction could lead to a pathogenic response; how do you not annotate that activity? Isn't it a normal process to bind something in the environment? We do have MF terms that involve a gene product interacting with DDT, which may not reflect that gene product's normal function. Does the canonical life plan include interaction with viruses? Yes. But again, the normal function of CD4 is not to bind HIV. Yet it does bind, and not when mutated. There is some agreement that it is difficult to draw the line between what is solely beneficial and what may sometimes be pathogenic (think about intestinal flora here). In the absence of any real objective dividing line, then at this point, we have to say that capturing these types of annotations is not okay and we continue to only annotate 'normal' processes. But, there are likely many different examples that will need to be considered before taking a definitive stance on this issue. ACTION ITEM #12: Candace and Trudy will send examples and relevant references to the annotation list as they come up, so that we can consider these on a case-by-case basis. [Candace Collmer, Trudy Torto-Alallibo] Obsoleting Justification Policy Judy Blake, MGI We have been getting a lot of feedback on term obsoletion from the GO user community, for example, GO users at the BRC meetings. It is necessary, therefore, to try to clarify the discussion about complex functions and about how we obsolete terms. The one-step rule for defining functions should have been challenged before becoming embedded into the system. There are two main issues with the current obsoletion situation: 1 when we obsolete terms 2 the way we handle our systems when we do obsolete terms. When searching in AmiGO for a term that has been obsoleted, a term that people use biologically, the obsolete terms come to the top of the page. We need to correct this. The AMIGO WORKING GROUP will address this. We continue to debate how we work with the rules about changing definitions. We can take an absolutist approach and say that any change requires a new ID, but what we've come to over time is that if we felt a definition didn't quite capture the essence of the term, we would change the definition, but not the ID. Another issue (see Item #7 in Judy's handout, Obsoletes Redux) is that obsoletion has been used as a shorthand way to get people to change their annotations. But is there a better way to get people to look at their annotations? We could have annotations deprecated until they are re-checked. It may also be necessary to have a better versioning system for GO terms. It would be possible to granularly track the way terms are changing, and perhaps we could embed the version of the term in the OBO file. One question, though: what constitutes a version-worthy change? OBO-Edit already has a mechanism in place to track obsolete with new replacement, but do people need to track less fundamental changes? SGD has a way for curators to add 'date last updated' tags to their annotations. This is manual and it is always left up to the curator to decide what changes necesitate a change to the update tag. OBO-Edit 1.2 can assign automatic or manual replacement. We need to be clear on why we obsolete terms, though. Sometimes we really need to obsolete a term but in other cases we know that there is a concept we want and we wouldn't make the term obsolete except for the fact that we know there are erroneous annotations. Bottom line: we'll still want to have automatic replacement as well as suggestions for having a look at another possible replacement term. One concern is that our users don't understand why we keep adding and deleting terms such as cytokinesis. Cytokinesis has been obsoleted because we've diddled with the definition, but users don't understand why we're doing this. The key here is that probably the definition hasn't altered the correctness of the term. Another case concerns that of the molecular function plasma protein. We obsoleted this term, but didn't make a term to replace it, such as a component term 'extracellular.' Would this still have been within the scope of the GO, ie is plasma a cellular component? There seem to be two issues here: 1) There does need to be some mechanism for obsoleting terms and checking annotations. But not any change to the definition warrants obsoletion. Trivial changes don't warrant obsoletion. 2) Annotations drive what the GO is. Operationally, there needs to be a lot closer communication between the editorial office and the annotators. We need to think about our policies for communication. ACTION ITEM #13: Develop a new policy for communicating about term obsoletions. The person proposing obsoletion should get in touch with the annotating groups (using contact information from the gene association file) informing them that the term is under review while also soliciting input and suggestions on what changes to make. This could be scripted. But, we still need to have a system for making sure people check their annotations if changes have been made to a term. Perhaps we could use the regular monthly reports to alert people that a term has changed somewhat and ask them to please review their annotations. But how do we enforce review? It could be set up so that until a term is reviewed it will be removed from the file. Filtering would be done by annotation date. An additional feature built into this script could alert curators when leaves have been added to the original parent annotation so that curators could check and possibly add higher granular annotations. Concerns: Is there a way, without having a gene association file, to find and check for obsolete terms? How do we make term obsoletions obvious to groups that are not part of the consortium? We may now have a mechanism in place for communicating obsoletions, but we still need criteria for deciding when an annotation is potentially affected by a change in the definition. And, people really need to check their annotations before a change happens. Should we revisit old obsoletes and for those made obsolete because of annotation issues, should we go back to the old terms and make their IDs secondary IDs? ACTION ITEM #14: Reinforce the policy that we will no longer obsolete a term just because the definition has changed or because annotations are thought to be bad or incorrect. ACTION ITEM #15: Explore the proper technical solution for establishing a mechanism to notify GO users when a term has changed, or rather, when we are thinking of changing a term. This solution should consider: * Do we need to enforce a way of making sure people make any necessary changes? * How best to contact people? Monthly reports? Newsletter? Web page? * Can we adapt the existing system but have a version for 'proposed changes to a term's definition'? * Could we have users subscribe to a mailing list to be alerted when there are changes to terms they're interested in? We could have a 'track this term' feature in AmiGO for communication outside GO. * Which people should be alerted to obsoletions? The EC has a procedure whereby, on a periodic basis, they post all proposed changes and have a public comment period. The editorial office shouldn't have to pester people to review their annotations; this is something that the PIs should do. At many past GO meetings, for many past proposals, there were no comments or objections raised. Is silence tacit approval? That seemed to be the case, but the editorial office would like an acknowledgement that the proposed changes are really okay with people, since silence does not mean 'okay.' Maybe we also need to have a minimal length of time for which a SourceForge item must stay open to allow people time to look things over. With the new grant proposal refocusing 'embedded' GO curators efforts, there should be a greater response to potential term changes. We could also try having a wiki where each group needs to actively check a box, yes or no, by a certain time. We will need to give people a reasonable amount of time to respond, and also perhaps provide a box for comments. A script would remind people, 24 hours before the deadline, if they haven't responded. Each site must appoint someone to take care of this responsibility. GO PIs will be notified of chronically non-responding groups. ACTION ITEM #16: Revisit obsolete terms to see which can be merged with current terms. Decide if the IDs of obsoletes could be made secondary IDs to the currently existing term. ACTION ITEM #17: John Day-Richter will talk to the GO Editorial Office to find the best way to implement these changes and suggestions in OBO-Edit. ACTION ITEM #18: Add a term creation date to the .obo file. Annotation Issues Issues Arising from Annotation Camp Karen Christie, SGD There were a number of annotation issues that arose at the 2005 annotation camp that seemed appropriate for discussion by the entire consortium. However, there has been a considerable amount of time (nine months) between the annotation camp and this latest consortium meeting. Since we are about to have another annotation camp, we should work out procedures for how to reconcile annotation issues that arise at camp in a timely fashion. The upcoming annotation camp will consist of two parts, one of which is internal and will involve representatives from each of the reference genomes and the other of which will consist of outreach and training. For non-contentious annotation issues, is it okay for camp participants to put concrete proposals directly to the list to be checked and ratified, rather than wait for the next consortium meeting? Such proposals would be specifically about annotation issues; content issues would still need to go through the appropriate channels. The general feeling is that yes, this would be okay, but it would be good to provide a bullet summary of the recommendations in the camp minutes, which were excellent last year. Also, since we now have group leaders for different aspects of GO, such as AmiGO, content, we should make explicit contact with each of the other groups if issues arise at camp that affect these groups. The upcoming annotations camp will emphasize annotation consistency. How can we measure and track annotation consistency and quality across groups? This is something that we said we wanted to do in the grant proposal. Dates for the upcoming annotation camp are tentatively set for July 10 -14, 2006. Camp will be funded, in part, by the Genetics Department at Stanford. Annotating to Unknown David Hill, MGI Proposal: unknown terms in the ontology would be eliminated and annotations to these terms would be removed and gene products reannotated to the root of each branch of the ontology, BP, MF, CC. Everyone agreed that this was okay, but we did not establish a time table for this change. Annotation of Common Knowledge Paper Introductions Karen Christie, SGD This issue came up at the last annotation camp. Some groups allow curation of information from the introduction of papers where authors cite research papers as references. These groups use TAS for the evidence code in these cases. Other groups require curators to actually look these statements up. At camp, we didn't come up with a recommendation for the best annotation practice for this situation. It was generally agreed that TAS is not a high quality evidence code and for reference genomes especially, the goal should be for each gene product to be annotated with an experimental evidence code. References cited in the introduction of a paper are not always the most relevant and may not even be from the same species as the gene product being annotated. Some groups have TAS and have used it for different ways. Rat uses it, pombe uses it as a placeholder until the original paper can be curated, sgd used it to curate information from reviews. TAIR has a large number of TAS annotations and GOA has TAS annotations inherited from Proteome and annotation from Swiss-Prot where annotators weren't distinguishing between author statements and experimental evidence. Should groups retro-curate and remove TAS annotations? Yes, if possible. ACTION ITEM #19: TAS is no longer considered a useful evidence code and will not be used in any consistency measures of reference genome annotation. Since part of the idea of the reference genomes is to provide a source of IEA annotations for other groups, we strongly encourage reference genome annotators to not use TAS, and instead use experimental evidence codes whenever possible. The GO documentation should also state this in a clear fashion. Annotation of Common Knowledge 'Textbook Knowledge' Karen Christie, SGD Common knowledge, such as that found in textbooks like Stryer's Biochemistry text, has, in the past, been annotated using the TAS evidence code. For example, alcohol dehydrogenase has been annoated to CC:cytosol using TAS and Stryer, but this information really isn't traceable from Stryer to an experimental paper. This type of TAS annotation predates the IC evidence code which might be a much better way to annotate this kind of information. Consider the case of ribosomal proteins, where there may be strong sequence conservation leading to an ISS annotation to CC:ribosome. In this case, then, an annotation to BP:translation, using the IC evidence code would be appropriate. (Note how IC is being used in this way to link ontologies together.) What reference should be used? What entity goes in the WITH column? Curators should cite the paper that they're reading in the reference column and put the GO term that they used to make the connection in the WITH column. If the original ISS annotation was made with an internal db_ref, then that reference would be used for the IC annotation, too. We should consider establishing criteria for making IC statements, as some inferences may be more explicit than others. This would be a good topic for discussion at annotation camp. NAS vs Experimental Evidence Codes - Data Not Shown Harold Drabkin, MGI This is another issue that came up at annotation camp. When experimental results are reported in a paper but followed with 'data not shown,' what is the appropriate evidence code to use? MGI uses IDA (or whatever is the correct experimental evidence code) because it is often the case that the experimental method is clear and journals would require the data to be shown, if needed. 'Data not shown' can often be the result of space limitations for publication. 'Data not shown' can be subdivided into two types, though: the first being cases where it is clear what assay was used, and the second being cases where it's not so clear how the data was acquired. Should curators annotate the latter, and if so, using what evidence code? NAS? Should curators contact the authors about this type of data? What about information cited as personal communication? The general consensus seems to be that curators should use their best judgement in these cases. If you are confident that the author is clearly stating what they did, then use an experimental evidence code. It's okay to accept that the authors did what they said they did. If you are not confident about the experimental evidence code, however, it's probably best not to curate the information, as NAS is not a very useful evidence code and we should strive not to use it. Information cited as personal communication should not be annotated. Same Protein, Different Organisms, Different Strains Michelle Gwinn-Giglio, TIGR When annotating different bacterial strains, is it okay to use experimental evidence codes if the actual experiments were performed in a different strain? This question was posed to the mailing list and no real consensus was reached. Should the annotations be made using the ISS evidence code, or is IDA okay? We know that bacterial strains can be hugely different from one another. What constitutes a strain or species in bacteria? Michelle thinks that it's okay to transfer the experimental evidence code when the sequence similarity is 100%. Some agree with this, but others disagree and think that ISS or even IC would be the more appropriate evidence code. People think that this type of annotation is okay, because when we make annotations, we're always making inferences and asserting the typical function of a gene product in that species. However, there are cases where gene products that are 100% identical do not have the same function in different strains. The feeling is that we do need to be pragmatic, here. The sequencing and the biochemistry of a particular organism may be done on two entirely different strains and we are really annotating the potential of a given gene product. Further, it is probably more likely that the distinctions annotators will encounter will lie not in molecular function but in biological process. You may have the same base activity for a gene product, but the process it's involved in might be different. ACTION ITEM #20: Add to the documentation that it is okay to use experimental evidence codes for identical/similar gene products from different strains of the same species. Annotation of Gene Products 'Acted Upon' Michelle Gwinn, TIGR In the current documentation, we state that we only annotate those gene products that are involved in a process, not those that are acted upon by a process. For example, we annotate gene products that are involved in the process of secretion, but not the product being secreted. There is some concern that if we follow this argument, we could annotate to genes upon which a transcription factor acts, or that when two proteins bind, that protein A acts upon protein B by binding it. We need to be careful how we go down this route. There is strong feeling that if we decide to pursue this type of annotation that we don't do so in the context of the current annotation file format. We should, instead, place these annotations in a different file. There are groups, such as MGI who capture this kind of information in a text/notes field under the category of 'target.' Other groups, such as TAIR, have added new relationships such as 'has_protein_modification_type' that are in TAIR, but are not included in the gene association file sent to GO. This highlights that individual databases can easily store additional information, but can we be consistent about how this information is eventually presented? But does GO want to support this type of data? At present, the relationship between the genes and GO terms is implicit. If the relationship is made explicit, then it would be okay to capture this type of information. We can easily add more to the same file, or to a new one, where we explicitly state these relationships, since there are groups out there that want to see this type of information. Could this be a way for GO to handle those things that are not judged as 'normal,' such as host-virus interactions? This problem arises because we have these implicit relationships. However, these relationships are actually subtle that we usually state. For example, the cellular component annotations really represent the end destination for a gene product. We don't annotate every point along its path while it gets to that final destination. There is a feeling that we could capture this type of information, but that until we make relationships more explicit we probably want to keep these annotations in a separate file. We could add another column to the current gene association file, but users might then have to filter the file to remove this set, and it could create confusion. Also, what are the ramifications for existing GO tools? John pointed out that we are already discussing creating a more expressive gene association file and recording this type of data could be part of a pilot project related to this. But, there is still some discussion about whether or not GO really wants to support this type of annotation, since this type of annotation would really expand the role of what GO provides. Is this within our project scope? Some annotators see this as a logical progression of what information users will want to see in GO. For example, if a gene is annotated to MF: tyrosine kinase, the next question is likely, what gene products does it phosphorylate? General consensus: having these annotations will be useful, but will require more explicit relations than what we are currently making, so we need to decide how we want to do this. ACTION ITEM #21: Individual groups can collect this [acted_upon] data knowing that, in the future, GO will present this type of relation. Chris, John, Sue, Candace, and Michelle will form a WORKING GROUP to come up with a proposal for how to implement this. Note that this type of annotation will not be a core requirement, but that GO will facilitate its display if groups want to do this. Annotations Inferred from Genetic Context Michelle Gwinn-Giglio, TIGR What evidence code could be used for annotations made on the basis of genome context, a situation that arises in frequently in bacteria when evidence for surrounding genes being involved in a process is well-supported, but for other genes , also within the operon, the evidence is less well-supported? One possible evidence code is IEP, but if the expression data is not shown, this won't work. What about IC? Michelle would like to propose a new evidence code, IGC, Inferred from Genomic Context, to deal with these annotations. The idea here is that you are using the positional context of the gene to make the annotation. What would curators put in the WITH column for these annotations? The WITH column could have the SO ID for operon, but it is agreed that we would want to capture which operon is being annotated and perhaps even list all genes in this operon, or at least the first and last genes in the operon. Could annotators make function annotations in these cases, in addition to process annotations? There was general agreement that, for prokaryotes, using operons to make function as well as process annotations is okay, especially since TIGR uses operon position as a contributing factor in the annotations, rather than the sole support for one. Could this evidence code also be used for eukaryotes in cases of synteny? This might be harder to do, but we do want a way to annotate cases where the evidence based on sequence similarity might not be very strong, but there is other evidence, ie genomic context, that suggests a gene product is involved in a particular process. Matt brought up the example of variant surface proteins for illustration. This issue also came up at annotation camp, where the sequence similarity of flanking genes was very good, but the sequence similarity of the gene to be annotated was not so good. If you just used ISS to annotate the latter, what would you put in the WITH column? ACTION ITEM #22: Create a new evidence code IGC, Inferred from Genomic Context. The precise definition of this code and procedures for annotation (what to put in the WITH column) will be hashed out and added to the documentation. [Michelle Gwinn and Matt Berriman] Pseudogene Annotation Michael Ashburner, FlyBase There is general and strong agreement that GO will not annotate to pseudogenes as defined in the sequence ontology (SO). It was generally agreed upon that annotating pseudogenes is wrong and that they should not have a GO annotation, instead they should have a SO annotation. When the pseudogene acquires a function it is no longer a pseudogene and could then gain a GO term. One issue that arises with this, though, is that MODs need to be careful and consistent with how they define pseudogenes. This is especially true for the reference genomes. Does a single frameshift constitute a pseudogenes? Probably not, in most cases. There are 12 genes in FlyBase that have a premature stop codon and these are not called pseudogenes as they are known to be functional in other strains and there are no other frameshifting deletions within their sequence. Looking up Sequence Accession Numbers Karen Christie, SGD This is another annotation issue that came out of discussions at last year's annotation camp. For papers that discuss sequence similarity (or show alignments of protein sequences classified in the same family) but do not give accession numbers for the proteins listed (and many papers don't), is it okay for curators to look up the accession IDs to make an ISS annotation? If the paper being annotated doesn't show the experimental evidence for the gene product used to make the ISS annotation, and you don't know if an experiment has actually been done with that gene product, should you make the annotation? Can you check to see if the experiment has been done and then make the annotation? MGI curators look up accession numbers all the time. They establish the orthology relationship and then look to see if there is a direct experiment for that gene product. For proteins, they then add the Uniprot accession ID to the WITH column. Ref_seq IDs could be used in the case of ISS annotations based on nucleotide similarity. Not all databases have been doing this, though. Other groups take the authors' word on the sequence similarity, but don't look up the accession IDs. For these ISS annotations, there is nothing in the WITH column. The ISS evidence code, in fact, actually predates the WITH column. There is agreement, though, that going forward, we need to fill in the WITH field for ISS annotations and that there must be an experimental evidence code for the gene product cited in the WITH field. This is not stated in the current documentation on ISS. ACTION ITEM #22: Update the documentation on using the ISS evidence code to emphasize that annotators need to enter something in the WITH field. In the case of gene products, there must be an experimental evidence code for that gene product which supports the annotation, i.e., we don't want to have circular ISS annotations. Uniprot IDs, ref_seq IDs, or individual MOD gene IDs would be okay to use in the WITH column. Old ISS annotations that don't have an entry in the WITH column will not need to be retrofitted immediately. Usage of HMM Evidence Michelle Gwinn-Giglio, TIGR This issue first arose about a year and a half ago and refers to cases where HMM models, or other models such as a neural network model, are used to determine sequence similarity. Can these models be cited in the WITH field? TIGR has been using them in this way from the start, but there seemed to be some concern that this was not okay. One of the concerns was about whether or not the models and their IDs are stable. The models may change over time, but the ID associated with a given model is stable. For CBS (Center for Biological Sequence Analysis, Copenhagen) models, the names of the models correspond to their IDs and these models are available to anyone who may want to test their sequences against them. There is general agreement that it is okay to use the CBS models in the WITH field for ISS annotations. SignalP, another program that uses both an hmm and a neural network model, will required a new, specific abbreviation. WITH Column Working Group Report Harold Drabkin, MGI What entries are acceptable for the WITH column? WITH is an evidence code qualifier that consists of a searchable database ID and that relates to the evidence codes in the following way: ISS - something similar to the gene product IMP - could be an allele IPI - whatever interacts with the gene product IGI - whatever interacts with the gene IC - the informative GO term IEA - under discussion IDA - under discussion There was some discussion about whether there is a legitimate entry in the WITH column for IDA annotations. Some curators have suggested that a target or drug could be added to the WITH column for IDA, but the general consensus is that this is not in the spirit of what we want to capture in the WITH field and therefore, we won't allow entries in the WITH column for IDA annotations at this time. What to put in the IEA WITH field generated much discussion. We want users to be clear on what the WITH field signifies, but at present, a number of different IDs, from different algorithms, are used here, for example, interpro2go, spkw2go, ec2go, etc. Are these IDs sufficient for users to understand the relationship between the gene product and the WITH field? There is a general consensus that the combination of a reference and an appropriate ID in the WITH column is enough information for users to figure out what the ID means for IEA annotations. Although the IDs may vary, the common relationship between them and the gene product is that the ID is an 'object that the gene product matched when the algorithm was run.' ACTION ITEM #24: GO will disallow WITH column entries for IDA annotations. ACTION ITEM #25: Document that WITH column entries are essential for all match-based methods of annotation and that a valid database ID is required for IEA WITH entries. The WITH column won't be mandatory for tools that just predict GO annotations, as the reference entered will describe the tool/algorithm used. What is inferable from RCA Evidence? Linda Hannick, TIGR In some cases, large-scale experiments make statements about function from their experiments, such as physical interactions. Many groups feel comfortable annotating to a process based upon large-scale experiments, but not to molecular function terms. Should the GO have a policy on what types of annotations can be made using large-scale data and the RCA evidence code? One specific example that Val put forward to the mailing list concerned a cerevisiae paper where authors made function assertions based upon analysis of large-scale interaction data. The data from this paper was then used to annotate to MF terms. Subsequently, SGD has reviewed these annotations and removed any annotations to MF terms. But can, and do, we want to make a statement that MF annotations can never be made from computational analysis? The general consensus is that it is difficult, at this point, to say that curators should never annotate to MF using the RCA evidence code, especially since the body of work that relates to this question is still relatively small. ACTION ITEM #26: Add more examples of how the RCA evidence code can and should be used for GO annotation based on published literature to date. Large-Scale vs Small-Scale, but Same Evidence Type Eurie Hong, SGD Large and small-scale experiments can, and often do, result in annotations that use the same experimental evidence code, such as IPI or IMP. Should GO somehow try to differentiate the two types of experiment for users? The discussion was centered around whether large-scale vs small-scale is really the crux of the matter. A large-scale experiment is not synonymous with a poor quality experiment, and likewise, small-scale does not equate with high quality. Is the real issue one of experimental method, rather than scale? Are we trying to inform users about the potential pitfalls of different experimental methods and is that within the scope of GO? Would doing so require expanding the current evidence codes? ACTION ITEM #27: No conclusion about how to distinguish large- vs small-scale experiments was reached. People are encouraged to keep thinking about this issue which clearly needs more discussion. GO Reference Collection Midori Harris, GOEO The GO reference collection, a collection of descriptions of methods that groups use for ISS, IEA, and ND evidence codes, needs to be more visible and easier to use. This information is immensely useful, but not even all GO Consortium members knew that it existed even though it currently has a home in the GO CVS repository. The AMIGO WORKING GROUP agreed that it is really important to make these references more visible and there is also general agreement that different groups who are using the same process should be citing the same references, thus avoiding duplication in these references. ACTION ITEM #28: The AMIGO WORKING GROUP will implement a strategy to incorporate and display the contents of the GO references. ACTION ITEM #29: Existing GO references will be examined to check for and eliminate redundancy. [Midori Harris and Karen Christie] Ontological Content, continued The Use of Sensu in the GO Chris Mungall, BDGP There are currently two uses of sensu in the GO. The original use of sensu was as a linguistic qualifier or linguistic disambiguator meant to distinguish cases where the same word referred to different types that are not related, for example, 'bud' or 'trichome'. A second use has arisen, though, which is that of a type qualifier. This use is legitimate, but shouldn't be lumped together with the original use. Organism type specificity is a genuine challenge for the GO, but sensu has been wrongly recruited to fix this. There are two problems: 1) We have conflated the meaning of sensu resulting in lack of precision, and 2) We have added taxon IDs which isn't quite right for this use. The proposed solution is to retain sensu for its original purpose, that of a linguistic qualifier. Its interpretation then becomes: 'as used in the XYZ' community. Taxon IDs would not be required, as the use would not be restricted to organism-specific communities. Biochemists or cell biologists working on the same organism may talk about the same term differently. A second part of this solution is to introduce a new relation for genuine organism-specific terms (contextual parts). This involves the idea of contextual synonyms, exact synonyms with a context qualifier. This allows users to configure particular applications, e.g., a user could configure to use the plant context exact synonyms, but we don't need to be as specific here as the actual taxon ID. The context should be the insect community, plant community, etc, in all places where this occurs. Context, in these cases, referring to sociolinguistic context. Further, the use of sensu would not be inherited as it is right now. There would be no need to carry its use through. For other situations, we do want to introduce genuinely different biological subtypes. In this case Chris proposes adding an 'in_organism' relationship. An example would be that of 'thylakoid', where we would have 'thylakoid, in cyanobacteria' as a subtype of 'thylakoid' instead of thylakoid (sensu Cyanobacteria). We could use the NCBI taxonomy as our organism ontology to make the relationship between the term and the taxon. These could be put in the .obo file, but this would mean that the .obo file would no longer be an insular, standalone file. Alternatively, we could keep the links in a separate file. What about cases where something is present in many organisms, such as all gram negative bacteria? What would we do then? Would we need to great a new ID? We do allow for combinations in the .obo file, so there are ways to address this. We don want to use valid taxon IDs. ACTION ITEM #30: Chris Mungall and Jen Clark will discuss the different aspects of changes to our use of sensu, write documentation on this, and implement the new strategy. This change will then be announced to the community. Demonstration of Cross-Products between GO and CO Chris Mungall, BDGP Chris showed a demo of how cross-products terms could be represented in OBO-Edit and in AmiGO using the example of 'larval locomotory behavior,' which exists as a diamond in the current tree. First, he took the term and created a logical, cross-product definition, locomotory behavior during larval stage, using the larval stage definition from the FlyBase anatomy/dev stage ontology. The generic term here is locomotory behavior, and its differentiating characteristic is that it occurs during the larval stage. Having the logical definition makes it possible to disentangle the diamond. Another example is that of the term 'differentiation,' a term that implicitly refers to the cell-type ontology. Performing a first-pass using the OBO software, we can make cross-product terms such as 'osteoblast differentiation,' meaning: cell differentiation that has_participant osteoblast. We need a better relationship, though, and don't want to use a relationship from the cell ontology because we want a relationship between a cell and a process, ie between a continuant and a process, and the cell ontology relationships are between continuants. Are the definitions created necessary and sufficient? Yes, the OBO intersection_of tag indicates necessary and sufficient. Adding these terms will allow for querying GO by cell types, but we will need OBO 1.2 to be able to represent them. One concern: the example used the term 'larval locomotory behavior' but lots of species have larvae. This highlights the need for more general anatomy and developmental stage ontologies and provides one reason why cross products between GO terms and cell type are being tackled first - it's much less species-centric. When looking at cross-products in OBO-Edit, annotators can see the creation of new terms based upon is_a relationships in the cell ontology. For example, 'macrophage cell activation' would have an is_a child 'microglial cell activation'. OBO-edit infers that this child term is okay, but curators would be able to check yes or no to confirm that the term makes sense and should remain in the ontology. Reasons for not accepting a term might include: 1) The relationship between the cells is not correct, or 2) The process may not actually occur. OBO-Edit will allow for curators to check the correctness of terms before they get added and will allow for changes to the computable definitions within the cross product window. There will be ways to this in bulk, if needed. Where do we go from here? We are ready to start putting these terms into the .obo file, and will need to use obo 1.2 format to accommodate these relationship types. We will also need to have a way to hide these from the average user until the world is ready for this. Slightly off topic discussion ensued about splitting the gene_ontology.obo file into edit and general versions, generally agreed upon. Chris and John agreed to experiment with the edit version. Slightly off topic discussion ensued about post-composition in the gene-association files, for example a new column in the file format: slots: eg. OBOREL: located_in [MA:liver] OBOREL: has_primary_participant [FBbt: Y_neuron]. ACTION ITEM #31: Chris and John will develop a plan for implementing cross-products between GO terms and the cell-type ontology. Part of this plan will involve splitting the Gene Ontology .obo file into an edit version that would be filtered into a gene ontology .obo file still using obo-edit 1.0. Also, will need to consider what to do with explicitly stated relationships (relevant to earlier discussion of acted_upon) and work out the specifics of what should happen if curators need to provide feedback on relationships within the cell ontology (eg, contact Oliver Hoffman). In parallel, we should also come up with a plan for AmiGO development and user education. Working Group Reports The AmiGO Working Group Eurie Hong, SGD This presentation addressed who the AmiGO working group is and how they should interact with the rest of the group. The AmiGO working group is open to all consortium members and there is a major domo mailing list for people to sign up for if they would like to participate. (Works the same way as other GO mailing lists.) Currently, AmiGO operates under a three-month release cycle broken up approximately by : one to two months of developing mock-ups and specifications and one month of a testing cycle to identify bugs and tweak features, followed by production and installation at Stanford. This implies four releases a year, but that number could be higher, depending upon what issues arise during each release cycle. Questions: At what point should users become involved with AmiGO development? Should we have focus groups for AmiGO prototypes, since there are different types of users? How would we identify users that are not curators? The consensus seems to be that it's important to get people from outside the consortium to provide feedback on AmiGO development. We could perhaps find these people via SourceForge entries or from a GOC newsletter. We could also tap into other communities that use AmiGO, such as the plant ontology consortium. If there are proposals for changes, should they be sent out to the consortium for a period of feedback? Agreed that it would be helpful to send an email indicating which issues are being addressed for the next release and which remain open in SourceForge. This would allow people to see the current priorities and make specific comments or requests, if needed. How should we handle user support? The email addresses off of the AmiGO web pages could go to a more directed group of people to make sure that they go to thegroup that will definitely deal with them. News on upcoming releases: Current Release * Main fix - viewing term siblings and parents * Ontology filter for terms * Obsoletes sorted to the bottom of the page Next Release * Improved searching and filtering (search box on every page) * Reorganization of term search results to look more like gene search results * Try to make it obvious to users where they will end up when they click on something High Priority Items * Dealing with tangled paths to root * Displaying cross-product terms * Displaying IEA annotations (takes a lot longer to load the database with IEAs) Final comments: If any MOD users have questions about AmiGO, please send them on to the working group, and please join if you are interested in AmiGO. One note: you don't have to be on the AmiGO mailing list to email the group. OBO-Edit: The OBO-Edit Working Group John Day-Richter, BDGP The OBO-Edit Working Group was formed to address bugs in the OBO-Edit software and to give direction to OBO-Edit development. The working group currently consists of members of every group in the consortium. A user survey listed several new features as having high priority. First on this list was a user's guide, followed by a basic annotation tool to have some way of associating a gene product ID with a term, OBOL integration, and bug fixing. Other desirable features include usage movies, FAQs, how-tos, and public webinars open to the world (see below). Midori requested a future directions email so that the working group could help prioritize the rest of the user requests. A user's guide with greater documentation will be released separately from software releases, since in the past, having more documentation wasn't necessarily grounds for issuing a new release. Members of the working group have signed on to co-write and edit various sections of the documentation. Another goal has been to have a more regular release schedule with input from the working group required about when to make a beta version an official version. The group still needs to work out what criteria need to be met before a release will be deemed official, but getting to another official version of OBO-Edit should be a high priority. Once an official version exists, however, the official editing software will not change until another official version is released. Improved communication in the form of a remote tutorial was tried in February, using different technologies like VNC, Gizmo, and IRC text chat, but there were some technical problems with VNC and Gizmo that will need to be addressed for the success of future webinars. Complete transcripts of the IRC commentary from this first training session are coming. ACTION ITEM #32: To alleviate technical problems with remote tutorials, which are likely more cost-effective than flying everyone to a particular place for training, GO will investigate retaining the services of a company for hosting future webinars. John will investigate various options available to us. Once a new version of OBO-Edit has been released, should users get it immediately? In other words, what quality control measures are in place? John recommends that, when a new version is released, annotators use the new version but don't commit their files. He has written a test suite that must be passed before each new release. If there is a new bug report for a given version, then a test is added which must be passed before the next release is issued. All working group members have a list of things that they must test because John can't run all possible GUI tests. This has helped a lot. ACTION ITEM #33: The OBO-Edit User's Guide, which is available in OBO-Edit in the docs directory, will also be made available on the GO website. John will talk to Mike Cherry about how to get this done. ACTION ITEM #34: The OBO-Edit working group should come up with a time line for release of the official version. Transition to OBO-Edit Version 1.2 Mike Cherry, SGD We will need to make an announcement to the user community regarding the switch to OBO 1.2. We should pick a date for switching over and then be very deliberate about this change, giving users plenty of time so that they can fix any scripts that use the file. What is the best vehicle for announcing the change? The GO home page, the GO Friends email list, a monthly GO newsletter (see below)? When we switch to obo 1.2, are we going to generate both and old and new obo file? Yes, but perhaps the gene_ontology.obo should be in the new format and the old file should be renamed something else. We will also want to make announcements when there are other big changes to the file, such as changes to the use of sensu or having links between process and function. The need to make specific versions of the gene_ontology.obo file was also discussed: Maybe we need to make specific versions, like when there is a substantial change in format/structure, eg., when a new relationship type is brought in, say make version 2 when the 'sensu' changes are made. OBO 1.2 will have the functionality for meta-data (like tracker ids in sourceforge). How best to communicate changes in the GO? One suggestion was to bundle the monthly release with a report/release notes that would document any changes. Currently, a semi-automated monthly report is generated. These release notes would include the results of the monthly report, along with a human-readable summary that highlights any major changes in the file. For the monthly report, Rama has suggested that SourceForge IDs be added so that changes came be traced back to the original request. ACTION ITEM #35: The monthly archive of the GO will also include Release Notes. These notes would include the output of the monthly report script, including the relevant SourceForge IDs, as well as human-readable text that summarizes significant additions or changes to the GO file. [GO Editorial Office] ACTION ITEM #36: Form a NEWSLETTER WORKING GROUP to develop a GO newsletter that will provide a vehicle for making announcements about a number of GO-related issues, such as major changes to GO, our meeting schedule, what decisions were made at meetings, GO workshops and tutorials, and maybe even the new-term-of-the-month or GO tip-of-the-day. Determine the proper target group for the newsletter. The first newsletter should go out before the switch to obo 1.2. [Eurie Hong, Jane Lomax, John Day-Richter, GO PIs, and other to-be-determined volunteers] Production Priorities Mike Cherry, SGD The production priorities for the GO include: 1. Genome protein sets (gp2protein) 2. User support 3. Production systems change 4. Database changes 5. On-the-fly species annotations Genome protein sets Genome protein sets (gp2protein files) should include all proteins in the organism, not just the proteins that are annotated. We also want to put together a fasta file from each organism, which is important to have as the input file for InParanoid and other analyses. What would the defline format be? The standard fasta format. InParanoid actually wants a one-to-one mapping file of gene to protein, but for the reference genomes we may need to have all ID mappings between genes and protein products, as well as other ID mappings such as Uniprot, IPI, CCD, MOD IDs, GI, protein_id, ref_seq, etc. The protein set should include predicted proteins, but we may want to make a separate file for ncRNAs. How often should these files be updated? Updates could coincide with each MODs release cycle. ACTION ITEM #37: Mike will send an email out to let groups know what the genome protein set fasta file format should be, specifically, what IDs should be included. Mike has also been tracking experimental evidence codes (IMP, IDA, IPI, IGI, and IEP) for each organism, including information about the percentage of genes from each organism that has been annotated to an experimental code. One issue is that we need to figure out a way to report how many genes are actually in each organism. ACTION ITEM #38: Each MOD should supply the current number of genes for their organism. Rex will then determine exactly what should be included in the gene-association file, ie the total number or the number of genes split out based upon protein coding genes and ncRNA genes. User Support In the past, user support has included email lists for users to report problems. Do we need to make changes to where these emails go? Do we want specialized lists or some more generic lists? Currently there are the following e-mail lists: GO, GO-database, GO-webmaster, GO friends, GO-In, GO-Top. For users, it is probably easier to have one email address, but we would then need to have a system for replying to the emails, dealing with the problem, and tracking the resolution. In essence, a more formalized system for user support. As always, if a question comes into the go friends email list, then the first person to whom it's relevant should still answer the email. ACTION ITEM #39: Set up a formalized system for coordinating user help. This will include rotating responsibility for reading the emails and answering or forwarding them to the appropriate person or group. To facilitate dealing with issues raised in the emails, each group should send a list of contacts to the GO Editorial Office. Information about the new user support system will be added to the newsletter. [Mike, Eurie, GO Editorial Office] Production Systems Changes The GODB and AmiGO currently run off of SGD servers. GO will be moving to a cluster environment, where there will be multiple database servers and GO HTML servers. Last year, we started building the GO Lite (minus IEA) database three times a week. This mean that AmiGO is always using 2-4-day old data. More cluster nodes are now on order which will allow for daily updates. Including IEAs, however, would increase the time required to build the database. We will now have CVS available through the GO web site which will allow users to get back to an old version of GO on the web. Other potential production changes include: building an updating script so that the database would not always have to be rebuilt from scratch every time and switching the GO database schema to Chado, a GMOD product. AmiGO does not work off of Chado yet, but Chris has written a schema to simulate the GO schema over Chado. Letters of support needed. We would also like to consider adding history tracking of GO terms and IDs. And, lastly, we need to consider what types of files need to be archived. Survey of GO File Downloads (from 44 replies) The most popular download is the MySQL database. This is followed by the GO flat file (all users currently sitting in this room - no non-members said that they're using the flat file). If we switch to a different database schema, we will need to make sure that we bring the MySQL users along. On-the-Fly Species Gene Annotation This is mainly useful for multi-species gene association data sets, such as those that come from GOA or Uniprot. Daniel and Evelyn commented that this is trivial to maintain and that GOA already has a web interface for that, since people like to download specific data sets. Rolf Apweiler is keen to allow people to pick and choose which data set they want to see. An on-the-fly viewer could be done at Stanford. Outstanding Issues Content issue regarding protein complexes (Midori). Names and synonyms (Jen). IPI chain of inferences (David). This will also be a good issue for annotation camp. Consortium members are urged to think about annotation issues for the upcoming annotation camp in July. Final Comments Judy Blake, MGI The GO grant has been submitted for renewed funding, but we will be adding supplementary material in May. Any short-term items that may have user impact could be added to the supplementary material. April 17th is the date set for receiving the first reports from the WORKING GROUPS. Judging from the size of this meeting's agenda, it would be a good idea to have anothermeeting in ~ six months, instead of waiting a whole year. The next GO meeting will thus likely be in November 2006 in either Hinxton or Marseille. Action Items ACTION ITEM #1: Very seriously consider removing the word 'activity' from the molecular function terms and consider renaming the molecular function ontology. (PRE)ACTION ITEM #2: Need to work out the balance of power/responsibility between the GO office and annotator/ontology developers to complete SourceForge items. ACTION ITEM #3: Begin to coordinate processes for reference genomes to start setting priorities and tracking progress. Acquire and distribute lists of genes for curation focus and set-up fortnightly discussions. [Point Person: Rex Chisholm] ACTION ITEM #4: Any other ideas for shared curation software, please forward to Chris Mungall. ACTION ITEM #5: Consider what, if any, are the repercussions of renaming the cellular component to cell level entity? ACTION ITEM #6: Change new terms ending in '...component' to now end in '...part.' [Jane Lomax] ACTION ITEM #7: Jane will send an is_a complete cellular component ontology to the GO list. ACTION ITEM #8: Take the new molecular/cellular/multicellular arrangement of the biological process ontology, try it out, and see how it works. The WORKING GROUP for this will include: Chris, Jane, David, Alex, Michelle, Rex, and Val. Jane will make a .obo file so that people can look at this. The group will also need to create a document outlining their philosophical approach and the results. Should there be a development site to help with working on this? ACTION ITEM #9: Make actin polymerization a function term. ACTION ITEM #10: Amelia, Chris, and David (and Barry, if he's available) should get together and come up with a single proposal for making connections between function and process. This will likely include writing new definitions for function and process. ACTION ITEM #11: Add detailed documentation on dual-taxon annotation, announce this to the annotators' mailing list, include the info in annotation camp discussions, and work on developing the ontology. [Candace Collmer, Jane Lomax, Amelia Ireland, others?] ACTION ITEM #12: Candace and Trudy will send examples and relevant references to the annotation list as they come up, so that we can consider these on a case-by-case basis. [Candace Collmer, Trudy Torto-Alallibo] ACTION ITEM #13: Develop a new policy for communicating about term obsoletions. The person proposing obsoletion should get in touch with the annotating groups (using contact information from the gene association file) informing them that the term is under review while also soliciting input and suggestions on what changes to make. This could be scripted. ACTION ITEM #14: Reinforce the policy that we will no longer obsolete a term just because the definition has changed or because annotations are thought to be bad or incorrect. ACTION ITEM #15: Explore the proper technical solution for establishing a mechanism to notify GO users when a term has changed, or rather, when we are thinking of changing a term. ACTION ITEM #16: Revisit obsolete terms to see which can be merged with current terms. Decide if the IDs of obsoletes could be made secondary IDs to the currently existing term. ACTION ITEM #17: John Day-Richter will talk to the GO Editorial Office to find the best way to implement these changes and suggestions in OBO-Edit. ACTION ITEM #18: Add a term creation date to the .obo file. ACTION ITEM #19: TAS is no longer considered a useful evidence code and will not be used in any consistency measures of reference genome annotation. Since part of the idea of the reference genomes is to provide a source of IEA annotations for other groups, we strongly encourage reference genome annotators to not use TAS, and instead use experimental evidence codes whenever possible. The GO documentation should also state this in a clear fashion. ACTION ITEM #20: Add to the documentation that it is okay to use experimental evidence codes for identical/similar gene products from different strains of the same species. ACTION ITEM #21: Individual groups can collect this [acted_upon] data knowing that, in the future, GO will present this type of relation. Chris, John, Sue, Candace, and Michelle will form a WORKING GROUP to come up with a proposal for how to implement this. Note that this type of annotation will not be a core requirement, but that GO will facilitate its display if groups want to do this. ACTION ITEM #22: Create a new evidence code IGC, Inferred from Genomic Context. The precise definition of this code and procedures for annotation (what to put in the WITH column) will be hashed out and added to the documentation. [Michelle Gwinn and Matt Berriman] ACTION ITEM #22: Update the documentation on using the ISS evidence code to emphasize that annotators need to enter something in the WITH field. In the case of gene products, there must be an experimental evidence code for that gene product which supports the annotation, i.e., we don't want to have circular ISS annotations. Uniprot IDs, ref_seq IDs, or individual MOD gene IDs would be okay to use in the WITH column. Old ISS annotations that don't have an entry in the WITH column will not need to be retrofitted immediately. ACTION ITEM #24: GO will disallow WITH column entries for IDA annotations. ACTION ITEM #25: Document that WITH column entries are essential for all match-based methods of annotation and that a valid database ID is required for IEA WITH entries. The WITH column won't be mandatory for tools that just predict GO annotations, as the reference entered will describe the tool/algorithm used. ACTION ITEM #26: Add more examples of how the RCA evidence code can and should be used for GO annotation based on published literature to date. ACTION ITEM #27: No conclusion about how to distinguish large- vs small-scale experiments was reached. People are encouraged to keep thinking about this issue which clearly needs more discussion. ACTION ITEM #28: The AMIGO WORKING GROUP will implement a strategy to incorporate and display the contents of the GO references. ACTION ITEM #29: Existing GO references will be examined to check for and eliminate redundancy. [Midori Harris and Karen Christie] ACTION ITEM #30: Chris Mungall and Jen Clark will discuss the different aspects of changes to our use of sensu, write documentation on this, and implement the new strategy. This change will then be announced to the community. ACTION ITEM #31: Chris and John will develop a plan for implementing cross-products between GO terms and the cell-type ontology. Part of this plan will involve splitting the Gene Ontology .obo file into an edit version that would be filtered into a gene ontology .obo file still using obo-edit 1.0. Also, will need to consider what to do with explicitly stated relationships (relevant to earlier discussion of acted_upon) and work out the specifics of what should happen if curators need to provide feedback on relationships within the cell ontology (eg, contact Oliver Hoffman). In parallel, we should also come up with a plan for AmiGO development and user education. ACTION ITEM #32: To alleviate technical problems with remote tutorials, which are likely more cost-effective than flying everyone to a particular place for training, GO will investigate retaining the services of a company for hosting future webinars. John will investigate various options available to us. ACTION ITEM #33: The OBO-Edit User's Guide, which is available in OBO-Edit in the docs directory, will also be made available on the GO website. John will talk to Mike Cherry about how to get this done. ACTION ITEM #34: The OBO-Edit working group should come up with a time line for release of the official version. ACTION ITEM #35: The monthly archive of the GO will also include Release Notes. These notes would include the output of the monthly report script, including the relevant SourceForge IDs, as well as human-readable text that summarizes significant additions or changes to the GO file. [GO Editorial Office] ACTION ITEM #36: Form a NEWSLETTER WORKING GROUP to develop a GO newsletter that will provide a vehicle for making announcements about a number of GO-related issues, such as major changes to GO, our meeting schedule, what decisions were made at meetings, GO workshops and tutorials, and maybe even the new-term-of-the-month or GO tip-of-the-day. Determine the proper target group for the newsletter. The first newsletter should go out before the switch to obo 1.2. [Eurie Hong, Jane Lomax, John Day-Richter, GO PIs, and other to-be-determined volunteers] ACTION ITEM #37: Mike will send an email out to let groups know what the genome protein set fasta file format should be, specifically, what IDs should be included. ACTION ITEM #38: Each MOD should supply the current number of genes for their organism. Rex will then determine exactly what should be included in the gene-association file, ie the total number or the number of genes split out based upon protein coding genes and ncRNA genes. ACTION ITEM #39: Set up a formalized system for coordinating user help. This will include rotating responsibility for reading the emails and answering or forwarding them to the appropriate person or group. To facilitate dealing with issues raised in the emails, each group should send a list of contacts to the GO Editorial Office. Information about the new user support system will be added to the newsletter. [Mike, Eurie, GO Editorial Office]