GO Annotation Camp Clark Center Stanford University Stanford, CA June 1-4 2005 SUMMARY The second GO Annotation Camp was held at Stanford University for three and a half days from June 1st through the 4th. The first GO Annotation Camp (held in Cambridge, England in June 2004) was a relatively small event (20 people) attended exclusively by people from groups that were already members of the GO Consortium (GOC) and were already familiar with using GO to annotate genes. In contrast, this second Annotation Camp was larger (55 people) and had a large proportion of attendees coming from groups that are not currently GOC members. While 19 attendees were representatives of GOC groups, there were 36 people (65% of attendees) coming from non-GOC groups. The 19 GOC attendees were either curators or PIs of established GO groups, most of which were Model Organism Databases (MODs). The background of the non-GOC attendees, with respect to GO, varied widely. Some attendees were already quite familiar with GO and had chosen to come in order to learn how to get started using GO as an annotation system. Other attendees were less familiar with GO and had come to evaluate whether GO would meet their needs. With the varied background of the attendees at this second Annotation Camp, we decided to focus on the basics to introduce the GO system and how to use it for genome annotation. Each of the four days began with Introductions. At some point during the four days, everyone stood up and took a few minutes to introduce themselves, their background, and their current interests to the group. It was really interesting to hear what each person was doing and served to help people connect with others having interests in common. The remainder of the first day consisted largely of presentations. In the morning, there were two presentations, each given by a panel of experienced GO curators. The first presentation (given by panel 1: Karen Christie, Ceri van Slyke, Petra Fey, Harold Drabkin, Rebecca Foulger, Rachael Huntley, and Rama Balakrishnan) covered the basics of the Gene Ontologies themselves, the evidence codes, and provided a quick overview of a number of tools useful for viewing the ontologies. The second presentation (given by panel 2: Eurie Hong, Tanya Berardini, Karen Pilcher, Doug Howe, Michelle Gwinn, Russell Collins, Stacia Engel, and Carol Bastiani) described what constitutes a GO annotation and presented specific examples of making GO annotations. In the afternoon, there were two presentations (by Rebecca Foulger, formerly of FlyBase and now of UniProt, and Karen Christie, of Saccharomyces Genome Database) going through a single paper in detail to show what annotations could be drawn from the paper and what evidence codes were appropriate for each annotation. Ceri Van Slyke, of Zfin, also presented a specific example from a third paper. The day concluded with a time for group discussion and questions that had not yet been addressed. On the remaining three days, there was one presentation after the Introductions. On the second day, Michelle Gwinn-Giglio of The Institute for Genomic Research (TIGR) spoke about the prokaryotic genome annotation pipeline at TIGR, which is based largely on sequence similarity based methods of comparison. This was of particular interest to many of the attendees as they will often be in a similar situation of trying to annotate a genome for which there has been relatively little experimental characterization of the genes. On the morning of the third day, Jane Lomax, of the GO Editorial Office, spoke about the activities of that office dealing with requests to add or make changes to GO terms. This was also very useful as a number of people had indicated that they wanted to know how to submit a new term. As many of these new groups will be annotating organisms not previously considered by the GOC, it is likely that there will need to be changes in existing GO terms as well as new terms to accommodate the biology of these organisms. On the fourth day, Mike Cherry spoke about creating and maintaining gene_association files and Jane Lomax spoke about the other ontologies available from the Open Biomedical Ontologies umbrella, both of which are practical topics for any group that will be using GO and other controlled vocabularies to annotate a genome. The presentations given during the GO Annotation Camp can be found here: ftp://ftp.geneontology.org/pub/go/teaching_resources/presentations/ One of the main activities of the Annotation Camp was the Small Group Working sessions. In preparation for the Camp, all attendees were asked to submit two papers that they would be interested in talking about at the Annotation Camp. They were instructed to "choose a paper that presents a tricky issue that you want to talk about or a paper that you feel is a good sample paper from your research community." They were also asked to send some information about each paper, including a 1-2 sentence synopsis of the key issue of each paper, with respect to why they chose it for the annotation exercises. The papers submitted by attendees were used to select papers to read for Small Group Working sessions. In the course of three such sessions, each person read four papers. The group was divided into twelve small groups, of 3-5 people, where at least one person was a curator from a GOC group. Each group read a paper and discussed what annotations could be made from the paper. Having at least one GOC member per group gave all the people new to GO access to someone experienced while they got some practice in making GO annotations. This exercise worked well when the papers had been selected by GO Consortium members, with one or more specific aspect (Function, Process, or Component) or Evidence Code in mind. It was more variable when done with papers selected by non-GOC attendees; some papers were not suitable for GO annotations at all. As a consistency exercise, two different groups read each paper, with the results recorded in a table. At the end of the day, for both day two and day three, there was a large group discussion. This discussion worked best when both of the groups that had read the same paper got together in advance and compared their choices. Next time, it would probably be better to select a smaller set of papers from the list of sample annotation papers being developed by the GOC. However, despite refinements to improve the exercise, the Small Group Working sessions were found to be very useful by the majority, as evidenced both by verbal comments and the responses to our survey. Overall the GO annotation camp went very well. Funding from the Stanford Genetics Department and an anonymous gift allowed the participation of a number of people to attend who otherwise would have been unable to come. PARTICIPANT LIST Consortium Members Petra Fey DictyBase Karen Pilcher DictyBase Russell Collins FlyBase Jane Lomax GO Editorial Office Judith Blake MGI Harold J Drabkin MGI Jennifer R. Smith RGD Rama Balakrishnan SGD Mike Cherry SGD Karen Christie SGD Stacia Engel SGD Eurie Hong SGD Tanya Berardini TAIR Rachael Huntley TAIR Michelle Gwinn-Giglio TIGR Rebecca Foulger UniProt Carol Bastiani WormBase Doug Howe ZFIN Ceri E. Van Slyke ZFIN Other Participants Edward Braun University of Florida Vasily Cherepanov University of Victoria (Canada) Candace Collmer Wells College Colleen Crangle ConverSpeech LLC Peter D'Eustachio NYU School of Medicine Christine Elsik Texas A&M University Kamal Gajendran Nat. Center for Genome Resources Jeremy D. Glasner University of Wisconsin Mark Hance Virginia Bioinformatics Institute Mark Heiges University of Georgia Heather Hood Oregon Health & Science Univ. Trupti Joshi Univ. of Missouri-Columbia Joseph Karalius Seminis Vegetable Seeds Varsha Khodiyar University College London Cindy Krieger Tetrahymena Genome Database Kitsos Louis Inst. of Mol. Bio. & Biotech. (Greece) Xinghua Lu Medical Univ. South Carolina John MacMullen UNC Chapel Hill Chunhong Mao Virginia Tech Emma Master Concordia University Paul Morris Bowling Green State University Lukas Mueller Cornell Nicole Perna University of Wisconsin Anjan Purkayastha Virginia Bioinformatics Institute Shingo Sakaniwa National Inst. of Genetics (Japan) Jun Sese University of Tokyo Jaswinder Singh UC Berkeley Nicholas A. Stover Tetrahymena Genome Database Tsuyoshi Tanaka Natl. Inst. Agrobiol. Sci. (Japan) Pantelis Topalis Inst. of Mol. Biol. & Biotech. (Greece) Sucheta Tripathy Virginia Polytechnic and State Babu Valliyodan University of Missouri Chisato Yamasaki BIRC, AIST Gongxin Yu Virginia Tech Jim Zheng Medical Univ. South Carolina COLLECTED ACTION ITEMS Below are the collected action items, and one potential additional item for discussion by the GO Consortium as a whole. Each action item also appears in context after the question and discussion that spurred it. ACTION ITEM 1: We need to expand the documentation for IMP to be more explicit about the fact that IMP is appropriate for ALL cases involving changes or variation in a single gene, not just for cases where a given allele is designated "wild type" and others are designated "mutant". ACTION ITEM 2: Question for the GO list: Should a difference in philosophy regarding use of TAS be included in each group's README file? ACTION ITEM 3: Revisit the guidelines for TAS? The discussion at the recent Annotation Camp highlighted that it doesn't really seem reasonable to use TAS for common knowledge annotations. The newer code IC seems better suited to represent this type of annotation, especially since often there is not a traceable statement in the reference used, e.g. Stryer, that can actually be traced back to original data. Use of TAS in this situation seems a hold over from the idea that the evidence codes can provide an indication of the reliability of the annotation and the desire to use a "better" evidence code. However, we have debunked this idea every time we have discussed it a Consortium meeting, so perhaps it is time to revisit whether the evidence code TAS should be allowed for a common knowledge statement, even if it is not traceable in the reference used. ACTION ITEM 4: Confirm/clarify with GO Consortium as a whole when it is appropriate to use an experimental evidence code for an experiment not actually shown in the paper. ACTION ITEM 5: Ask the GO Consortium whether genomic structure and synteny can be included as acceptable evidence to include in the ISS evidence code. ACTION ITEM 6: Can the people representing new groups here, please let us know the status of your projects, and if there is any assistance from the GO Consortium that would be useful to you. Please send this info to Jennifer Clark (jenclark@ebi.ac.uk). POSSIBLE ACTION ITEM: Perhaps the GO group should discuss whether the question below should have a definitive answer. If diversity of practice is acceptable, then the specific practice should be described in each group's README file. QUESTION: If a paper with a sequence comparison, for example paper 13 (Keeling PJ and Palmer JD) that you wish to use for annotation does not provide the sequence accession number, should you go look it up? DISCUSSION: Different groups had different practices. Some felt that they must have an accession # in order to make an annotation, and they would try to identify the appropriate accession #. Other groups felt that it was asking for trouble to make assumptions about what accession # corresponded to the sequence the authors has used, that if it wasn't listed in the paper, it wasn't appropriate to fill one in. TOPICS DISCUSSED There were numerous times for group discussion during the Annotation Camp. Presentations were informal and often stimulated discussion. In addition, at the end of each day was a scheduled 90-minute time slot for group discussion to discuss the issues that had arisen during the day. The following is a summary of the various questions asked and resulting discussions that took place over the course of the 3 and a half day camp, organized in a Question and Discussion format, where questions about similar topics have been grouped together regardless of the actual order of questions and discussion. There are a handful of action items relating to topics that should be referred to the GO list for discussion and a decision. EVIDENCE CODES Cross species expression IDA versus IGI - QUESTION: What evidence code should be used if the researchers express a protein from one organism in a system based on a different organism? DISCUSSION: For a cross species expression experiment, e.g. human RAS protein complementing a RAS mutation in yeast, the appropriate evidence code is IGI. This is considered to be a genetic interaction because two different genes are interacting, the yeast RAS gene (mutated in this case) and the human RAS gene (functional in this case). This particular type of experiment is not considered to be a direct assay (IDA) because it is an in vivo genetic complementation assay. The expressed protein is not directly shown to have a specific activity, only that it is able to compensate for the lack of activity in the mutant. NOTE: In discussion afterwards, the case of transfections and assaying a resulting extract was brought up. This was not discussed at the Annotation Camp. IGI versus IMP - QUESTION: What evidence code should be used if the paper shows rescue of a deletion in one organism with the wildtype gene from another organism? DISCUSSION: As decided previously, cross species complementation goes under IGI. Situations involving a mutant or allele of a single gene are considered IMP. IDA - QUESTION: How does one annotate recombinant proteins? DISCUSSION: When a recombinant protein is purified and assayed in an in vitro system, this is typically classified as IDA. NOTE: It was brought up in discussion afterwards that there are non-in vitro ways to assay recombinant proteins; such experiments would not necessarily be considered to be IDA. IMP - QUESTION: What kinds of annotations can be made from phenotypes? DISCUSSION: GO tries to describe the normal processes and functions, not the phenotypes of mutants. However, analysis of mutant phenotypes often allows reasonable suppositions about the normal function of a given gene product. Typically, phenotypes of mutants (or allelic variants) allow annotations of process, if not of function. - QUESTION: Is it reasonable to make an annotation using a gain of function mutation? Perhaps the only mutation available is a gain of function allele that is known to be constitutively active. Can a GO annotation be made from such a mutation? DISCUSSION: While Russ felt that the example that initiated the discussion with was not appropriate for annotation, others thought such an allele could be useful for annotation, particularly if nothing else is available. The basic conclusion was that it depends upon the situation. The basic goal of GO is to provide some indication as to the normal process, function, and location of the gene product. If a curator judges that a gain of function mutation provides insight as to the normal role of the gene product, then it is fine make an annotation based on it, with the IMP evidence code. - QUESTION: How does one deal with polymorphism or allelic variation at a given locus ? DISCUSSION: This is within the domain of IMP. Though not considered a "mutant" by those studying these situations, GO places all situations where one is looking at a change in a single gene, whether it's considered to be a "mutant" versus "wild type" situation or "normal" allelic variation within a population, under the evidence IMP. Remember that the designation of a certain allele as "mutant" versus "wild type" is often context dependent anyway. ACTION ITEM 1: We need to expand the documentation for IMP to be more explicit about the fact that IMP is appropriate for ALL cases involving changes or variation in a single gene, not just for cases where a given allele is designated "wild type" and others are designated "mutant". IEP - QUESTION: When is it appropriate to use IEP? DISCUSSION: Most groups feel that the IEP evidence code should be used very sparingly. However, occasionally, it is useful. For example, if a group performs microarray expression analysis and finds a cluster of genes with the same expression pattern, where nine of ten are known to be involved in ribosome biogenesis and one protein is uncharacterized. In a really tightly linked cluster like this, it may be reasonable to assign a process annotation to "ribosomal biogenesis". IMP versus IPI - QUESTION: How would GO annotate two-hybrid and deletion mutant experiments that determine which portion of a protein interacts? DISCUSSION: Use of the IPI evidence code will allow you capture the other protein that interacts with the protein being annotated. Note though that GO does NOT capture the specifics of which domain is involved in the interaction. This is beyond the scope of GO. ND - QUESTION: Is there a place to indicate the date that an annotation using the ND (no data) evidence code was made? DISCUSSION: Every annotation, regardless of evidence code, is supposed to be associated with a date indicating when it was made. The date column in the gene_association file is supposed to contain the date the annotation in that row was made, not the date the file was created or committed or any date that is general to the file. Each row should have its own date indicating when the association between the gene product and the term in that row was made. For annotations to the unknown terms using the ND code, the date is particularly useful because it provides the date when the literature and/or sequence databases were checked allowing users and curators to know that if there is a paper about the given more recent than that date, then it may be possible to update the unknown annotation to something more specific. ND versus TAS/NAS - QUESTION: What evidence code should be used for annotations to the unknown terms? DISCUSSION: There are a couple different situations for making annotations to the unknown terms that are handled with different evidence codes. For example, if an author says that nothing is known about the function of a given gene product, it is permissible to cite their paper for the annotation and use either TAS or NAS as appropriate. However, if a curator searches the literature and possibly also does sequence analysis and there is "no information" available for a given gene, then there is no research paper to cite. In this case, some groups, those that are able to create internal references, cite their own internal reference regarding the use of the ND evidence code. The GO consortium has also created a generic reference for use of the ND code so that groups that do not have the ability to create internal references may cite the GO reference. Exactly what constitutes "no information" depends slightly on the group. Some groups, like SGD, look only for available literature and do not do any sequence analysis. Other groups, like MGI, look at both literature and internally performed sequence analysis. TAS - QUESTION: When is it appropriate to use the TAS evidence code, or is it better to follow the trail back to one of the cited papers? DISCUSSION: The appropriate course depends on the specific situation. However, not all references cited in a review are relevant to the organism you are trying to annotate. Sometimes the reference is about a gene from a different organism that is similar to the gene in the organism you are trying to annotate. Thus, it is a good idea to make sure that the reference cited in a review is for the organism you are annotating. There is also some variability between groups whether the philosophy is to annotated from a review using TAS, or to use that review to trace back to the original research papers and annotated from those papers with evidential evidence codes. This difference in philosophy is acceptable, though perhaps this should be documented in each groups' README file. - QUESTION: Is it permissible to use information from the introduction of a research paper to annotate a gene? DISCUSSION: Different groups have different philosophies on this question. Some groups, including SGD, allow curators to make an annotation from the introduction with TAS, especially if this TAS annotation may facilitate curators being able to use the associated reference at a later date to trace it back to a paper with original data. Other groups, including MGI, do not allow this, but instead require the curator to trace the statement back to the original source to make the annotation from the original reference, if appropriate. RGD curators are also required to trace the statement back to the cited sources before making any annotation. They have noticed that sometimes the cited references do not actually deal with the organism of interest, but some other organism; thus the cited papers would not actually be allow any annotations to be made for the organism of interest. - ACTION ITEM 2: Question for the GO list: Should a difference in philosophy regarding use of TAS be included in each group's README file? TAS versus IC - QUESTION: What evidence code is appropriate to use for statements of "common knowledge"? DISCUSSION: The current documentation states that TAS may be used as the evidence code for statements of common knowledge. For example, let's say you have a paper that says that Protein X is an xxxxx , with a direct assay for activity, so you can use IDA for this function term. Then it also makes a mutation in the gene for Protein X and shows that it is involved in process yyyy, so you can use IMP for the process term. But, the paper does not have any direct evidence about the localization of Protein X. However, everyone knows that process yyyy occurs in the cytoplasm, so you can annotate protein X to the component term "cytoplasm ; GO:5737" by TAS using a general reference like Biochemistry by Lupert Stryer. TAIR stores each chapter of a book as a separate reference. This will help in situations where the book is felt to be an appropriate reference for the annotation. However, Karen commented that SGD no longer allows use of TAS and the general textbook by Stryer for the situation represented by the component annotation in the example above, as there is not really a traceable statement in Stryer providing evidence that process yyyy occurs in this location in yeast. SGD feels that it is better to use the newer evidence code IC for these "common" knowledge types of annotations. Thus, if an SGD curator felt that it was reasonable to make the annotation "cytoplasm" based on the knowledge that Protein X the process annotation yyyy, then the curator could assign the component term "cytoplasm ; GO:5737" using IC and the GOid of the process term yyyyy. This seems a more accurate representation of the annotation, that the curator used common knowledge to assign the component term, based on the process term. Peter d'Estachio further commented that many of these "common knowledge" types of statements are often not well based in actual experiments conducted on the organism of interest, that early biochemists would often perform experiments with materials that were easy to obtain, e.g. calf thymus, and assume that this accurately represented the situation for another organism, e.g. human. This may or may not be the case. ACTION ITEM 3: Revisit the guidelines for TAS? The discussion at the recent Annotation Camp highlighted that it doesn't really seem reasonable to use TAS for common knowledge annotations. The newer code IC seems better suited to represent this type of annotation, especially since often there is not a traceable statement in the reference used, e.g. Stryer, that can actually be traced back to original data. Use of TAS in this situation seems to stem from the idea that the evidence codes can provide an indication of the reliability of the annotation and the desire to use a "better" evidence code. However, we have debunked this idea every time we have discussed it a Consortium meeting, so perhaps it is time to revisit whether the evidence code TAS should be allowed for a common knowledge statement, even if it is not traceable in the reference used. IC - QUESTION: When is it appropriate to use IC? For example, transcription factors may also be found in the cytoplasm, but would you make an annotation like this by IC? DISCUSSION: The IC evidence code is basically meant to be used in those cases where you cannot find any references from which to make an annotation in typically one, but possibly two, of the ontologies, but based on the GO annotations that have already been made for the gene in the other ontologies, the curator feels that a reasonable annotation can be made in the remaining ontology. One of our classic examples for IC is that of a transcription factor. For example, let's say a curator has annotated a gene product to the function term "transcription factor activity ; GO;0003700" and to the process term "regulation of transcription from RNA polymerase II promoter ; GO:0006357", but cannot find any data that discusses the location of the gene product. However, for an annotator of a eukaryotic organism, with the knowledge that it is a transcription factor, it is reasonable to assume that it must be in the nucleus at least some of the time in order to act as a transcription factor, so the curator may make the component annotation "nucleus ; GO:0005634" with the evidence code IC from GO:0003700 and GO:0006357, rather than leaving this gene with no component annotation or assigning the term "cellular component unknown ; GO:0008372". It is true that some transcription factors are regulated at the level of localization and are thus found in the cytoplasm until they are activated. If there is experimental evidence showing that a transcription factor is found in the cytoplasm, then you can give this annotation the appropriate evidence code. However, it would not be reasonable to assign the component term "cytoplasm ; GO:0005737" to a transcription factor by IC, because it is not reasonable to assume that all transcription factors will be localized to the cytoplasm at some time. Note that it is possible for the same gene product to be annotated to both "nucleus ; GO:0005634" and to "cytoplasm ; GO:0005737" as GO allows all appropriate annotations to be made. Note also that we don't assign the term "cytoplasm ; GO:0005737" to every protein, just because it was made in the cytoplasm. This isn't really useful. Typically, the term "cytoplasm ; GO:0005737" would be assigned if there is experimental evidence showing cytoplasmic localization or it may be inferred by the curator if the cytoplasm is thought to be the normal localization of that type of gene product, but there are no experiments performed with this specific gene product in this organism. IC versus ISS - QUESTION: In paper 13 (Keeling PJ and Palmer JD), they do a sequence comparison and say that this indicates that the gene product in question is an enolase, which corresponds to the function term "phosphopyruvate hydratase activity ; GO:0004634" . Based on this information, what would you do for the process and component annotations? Would the appropriate evidence code be IC or ISS? DISCUSSION: The appropriate course of action depends on the specific situation. For example, there are some InterPro domains, e.g. IPR000013 shown below, whose mappings in the interpro2go file include process and component, as well as function, terms. In a case like this, it would be appropriate to use ISS with IPR000013 for all three annotations, as these come directly from the interpro2go file. However, if the mapping file gave only a function annotation and the curator was the source of any further annotations, then IC should be used. InterPro:IPR000013 Peptidase M7, snapalysin > GO:metallopeptidase activity ; GO:0008237 InterPro:IPR000013 Peptidase M7, snapalysin > GO:proteolysis and peptidolysis ; GO:0006508 InterPro:IPR000013 Peptidase M7, snapalysin > GO:extracellular region ; GO:0005576 NAS versus experimental evidence codes - QUESTION: What evidence code do you use for an experiment referred to in the paper, but not directly shown? DISCUSSION: At the Cambridge GO camp this was bought up, but different people seem to remember different outcomes for the discussion. Becky and David (MGI) remember that it was OK to use experimental evidence codes for this, with the theory that the authors had done the work but most likely had to remove the figures from the paper due to space restrictions imposed by the journal. In the Cambridge meeting minutes, this specific discussion doesn't seem to be covered but it does say in the 'Evidence Code Usage' section that NAS can be used for 'data not shown', and does not specify that an experimental evidence code may be used for an experiment not shown in the paper. At this Annotation camp, this issue was bought up again because different groups were annotating 'data not shown' in different ways: some were using NAS, and others were using experimental codes. The consenus from the Stanford GO camp discussion was that it is okay to use an experimental evidence code if the authors, for example, show an assay for enzymes A, B and C and then say "and D, E and F have the same activity (data not shown)", so that the Materials & Methods section shows a protocol for that experiment. But if they show an unrelated experiment with 'data not shown' then NAS should be used because you can't see any comparable experiment and it wouldn't always be clear which experimental code should be used. NOTE: In subsequent discussion, it was apparent that this issue is not resolved. The Action Item below needs to be discussed at the next Consortium meeting. ACTION ITEM 4: Confirm/clarify with GO Consortium as a whole when it is appropriate to use an experimental evidence code for an experiment not actually shown in the paper. Multiple evidence codes for the same annotation - QUESTION: Should I use both ISS and IDA as evidence for annotations to the same GO term? DISCUSSION: Sure, generally, the more different pieces of evidence, especially when of different types, you can use to support an annotation, the more confidence you can place in the annotation. There may be cases where you feel that keeping an ISS annotation, even when you have an IDA annotation to the same term, is appropriate. Making annotations based on sequence comparisons IEA and ISS - QUESTION: When would one use ISS versus IEA for annotations made via sequence similarity comparisons? DISCUSSION: To use ISS, a human must have examined and verified the annotation. For example, if you make GO annotations by doing an automated scan of InterPro hits, then you must use IEA because there has not been any human judgment involved in making the annotations. ISS - QUESTION: I am trying to annotation an organism where there are two different "species": A and B, both of which are parasites, but one interacts with plants and the other with animals. I am trying to annotate the sequence of species A, however for some genes all of the data comes from species B. What annotations are reasonable to make for species A by sequence similarity comparison? DISCUSSION: We often find that sequence similarity comparisons are better suited to making annotations in the Molecular Function ontology than in either the Biological Process or Cellular Component ontologies. For example, pectin metabolism is relevant to the species that interacts with plants, but not to the species that interacts with animals, so a process annotation to "pectin metabolism" just does not make sense. In a case like this, it is often reasonable to make function annotations by ISS, though it is often not reasonable to make any process annotations by ISS. - QUESTION: Can one use orthology to make GO annotations? DISCUSSION: One needs to be very careful about transferring GO annotations, particularly process, to orthologs. Even if orthologs in different species are still performing the same molecular activity (function), they might not be involved in the same processes. For example, Zebrafish has undergone a genome duplication. When there is an ortholog, we often give both copies the same annotation with ISS evidence from a gene in a different species. At a later time, when there is experimental evidence in Zebrafish, one of the orthologs may receive a NOT qualifier to this annotation if it is shown to be involved in some other process but not the one predicted by the original sequence similarity comparison. - QUESTION: It seems that an annotation based on a simple pair-wise BLAST is not as good as an annotation based on meeting the threshold to match an HMM. How does GO indicate the quality of an ISS annotation? Would it be possible have a system to differentiate the quality of the annotation? DISCUSSION: The GO Consortium is aware that there are many different levels of quality of evidence used for annotations made with the ISS (or any other) evidence code. We have discussed, on multiple occasions, the idea of having a more granular hierarchy of evidence codes. However, each time, we have come to the conclusion that the type of method does NOT automatically provide any information about the quality of the annotation. To judge the quality of the annotation, one really needs to look at the specific details of the method used. Both for this reason, and to maintain a simple and easily used system of evidence codes, the GOC has decided against using an expanded hierarchy of evidence codes. In the example asked about, a simple pair-wise BLAST versus a match to an HMM, one can look at the methods described in the references for the annotations and make a judgment. - QUESTION: Is it possible for GO to set standards for ISS annotations? DISCUSSION: It is not possible to establish a single cutoff score that is meaningful across-the-board. Michelle Gwinn will touch on this subject more thoroughly in her talk. What GO requires is for each ISS method to be described in an abstract that is used as the reference for all annotations made via that method. In this way, users can judge for themselves what criteria were used. - QUESTION: Using sequence similarity methods, how do you decide which function is the most appropriate, which sub-function or which family? DISCUSSION: This question was deferred as Michelle Gwinn's talk about the prokaryotic annotation pipeline at TIGR will address this type of issue. - QUESTION: In paper 1 (VanWagoner et al), they discuss an operon/gene cluster. What evidence code is appropriate? Is ISS acceptable for this DISCUSSION: We discussed whether genomic structure and synteny are acceptable evidence to include in the ISS evidence code. People seemed to feel that consideration of genome organization and synteny seemed reasonable to be included in the ISS evidence code. However, no one could remember this question ever being discussed at a Consortium meeting previously. ACTION ITEM 5: Ask the GO Consortium whether genomic structure and synteny can be included as acceptable evidence to include in the ISS evidence code. - QUESTION: Are you allowed to cite a paper for an annotation by ISS if they do not give an accession # for the protein, nucleic acid sequence, InterPro domain, etc. they used for the comparison? DISCUSSION: While it is best to have a with statement for an ISS annotation, it is not mandatory. It is up to each group to establish their own requirements here. SGD, which does not do any of its own sequence analysis, has relied exclusively upon the published literature for those ISS annotations that it has currently. If an accession number has been given, then we will put it in the with column. However, if no accession number is given, we may make the ISS annotation anyway, even though we cannot fill the with column. We do not try to track down a sequence identifier if none was given. If you did track one down, could you ever be certain that it was actually what the author used? - QUESTION: If a paper with a sequence comparison, for example paper 13 (Keeling PJ and Palmer JD) that you wish to use for annotation does not provide the sequence accession number, should you go look it up? DISCUSSION: Different groups had different practices. Some felt that they must have an accession # in order to make an annotation, and they would try to identify the appropriate accession #. Other groups felt that it was asking for trouble to make assumptions about what accession # corresponded to the sequence the authors has used, that if it wasn't listed in the paper, it wasn't appropriate to fill one in. POSSIBLE ACTION ITEM: Perhaps the GO group should discuss whether the above question should have a definitive answer. If diversity of practice is acceptable, then the specific practice should be described in each group's README file. - QUESTION: Can you make an ISS annotation for the function of gene C by ISS with gene B, even if the only function annotation for Gene B was generated by ISS with Gene A? DISCUSSION: This is not recommended procedure. This type of transference of annotations, from an uncharacterized protein to another uncharacterized protein, has the potential to pollute the pool of annotations with information that is not well supported by either experimental data or by good sequence similarity to a characterized protein. Some groups, like MGI, do not allow this type of "ISS from an ISS" annotation to be made. However, MGI does do things to attempt to make it clear what is known about sequence similarity of its genes to other species and what information can be drawn from that knowledge. MGI has a "special reference", J: 73065, that means "orthology established by MGI (either by MGI itself, or by literature referencing)'. Using this internally-generated reference as the reference for the GO annotation, if MGI has determined that a given mouse protein has orthologs in other closely related species, it can make GO annotations for the mouse protein. For annotations made in this way, the evidence code is ISS, the with field contains the identifier of the sequence of the ortholog (from human, rat, whatever), and we have a notes field that carries the PMID of the paper with the human/rat /etc. experiment. The paper listed in the notes must describe an experiment using the orthologous protein listed in the with column; this can't be another sequence similarity comparison. In this way, MGI is able to make annotations for proteins for which they have determined one or more orthologs in other species, even when there is no publication about the mouse gene. Note though that in this system, the database must have a system and expertise to determine orthology. The following questions were discussed in the context of Michelle Gwinn's talk about TIGR's pipeline for annotation of bacterial genomes: - QUESTION: Where do the known proteins come from for an HMM and how is the trusted cutoff determined? DISCUSSION: We start with a combination of automated clustering methods and manual curation from the literature. For TIGR to include a protein in an HMM seed, it must be experimentally characterized or have really good bioinformatic evidence for function. To determine the trusted cutoff, we look at the distribution of the scores and check the literature. Trusted cutoff is different for each HMM. One HMM can have a trusted cutoff of 10 and another 100. You have to check if your protein is higher than the trusted cutoff. The trusted cutoff may even be negative, again simply check to see if your protein's score is greater than the trusted cutoff score. There are currently 2585 HMM families at TIGR; many of these are for prokaryotic functions/proteins. About 1300 of these are at the equivalog level, that is modeling one specific function. - QUESTION: Where does the significance value of a hit come from: BLAST or Smith-Waterman? DISCUSSION: Both. We use liberal BLAST scores for collecting proteins to put into the Smith- Waterman portion of the BLAST_extend_repraze (BER) search, but then we are much more conservative in terms of assigning function. We store Blast P values for our pairwise matches as well as statistics from the Smith-Waterman alignment including percent id/similarity, gaps, and indels. At TIGR we tend to focus on percent identity and length of match as the most important indicators. Gaps or insertions in the alignment would reduce the matching regions as well as the overall quality of the match. - QUESTION: How do you evaluate your sequence matches? DISCUSSION: We look at a combination of several different sequence based tools: BER (BLAST_extend_repraze; our tool for generating pairwise alignments), HMMs, Prosite, MHMM, SignalP, etc. Once we have all the data, TIGR is very conservative. We prefer to undercall, rather than overcall because we wish to avoid transitive annotation. We have evolved a multi-level system for naming proteins with evidence of various types. Our highest name level is the one where we believe that we know the precise function of a protein. In this case we will give the protein a precise name. In order for us to do this we need to see at least one of the following two pieces of evidence: a full-length match to an experimentally characterized protein at greater than ~40% identity (depending on the length of the two proteins, short proteins require higher identity) OR an above trusted cutoff match to an equivalog (modeling one specific function) level HMM. If these criteria are not met, the protein will get a name consistent with what evidence is available: it might be given a name based on its membership in a protein family ("xxxxx family protein"), it might be given a putative designation ("putative xxxxx protein") or it might be given the homolog designation ("xxxxx protein homolog", which indicates just that there is some level of sequence simlarity between the two proteins, but no claims on function). For example, if we had a poor pairwise match to an experimentally characterized biotin synthase and nothing else, we would likely call the protein "biotin synthase homolog" or if the match was very poor it would be demoted to "hypothetical protein". To avoid transitive annotation, we require that in order for a pairwise match to be considered as evidence for a functional annotation, the match protein must itself be experimentally characterized. - QUESTION: Are the experimentally detemined proteins listed? DISCUSSION: We maintain a database containing accessions of proteins that have been experimentally characterized. Our annotation tools highlight these proteins in our search results to aid annotators. We are constantly adding new accessions to this database, however, it is not even remotely an exhaustive set. This is an area where we very much wish we could invest more resources. - QUESTION: What do you call a homolog? DISCUSSION: TIGR uses the word homolog to denote that there is some sequence similarity but we are making no claims about the protein's function (because the similarity is so weak). We use the word ortholog very carefully only when there is evidence to support that a given gene is equivalent in function to a specific gene in a related organism. We have a hierarchy of terms we use based on how strong the sequence similarity is (see above). Ortholog [we believe we know the exact function of this protein based on a strong match to characterized examples from other organisms] is a very strong statement; homolog [weak but consistent similarity OR strong similarity to something not expected in this organism (photosynthetic gene in non-photosynthetic bug)] is much weaker. 25-30% of the genes in new genomes fall into the categories "hypothetical" or "conserved hypothetical" protein. It is common to find paralogous families of proteins within a species which have no homology to anything with a characterized function, but do match unknown proteins from other species (the conserved hypotheticals), and also to find paralogus families that do not match anything outside the species at all (hypothetical proteins). - QUESTION: What proportion of proteins in a genome fall into frameshift and point mutations? and in these cases, what gets deposited to GenBank? DISCUSSION: Typically, we see around 10-100 proteins in this class in an average prokaryotic genome. For proteins where we can identify a biological translation exception (selenocysteins, programmed frameshifts) an appropriate translation is sent to GenBank, for the proteins in which we find single frameshifts or in-frame stops which have no experimental documentation for function in this organism or elsewhere, no translation is sent. - QUESTION: Do you (TIGR) go back and add /update annotations? Not on a genome wide scale - we do propogate updates stemming from new HMMs to older genomes by searching the new HMMs against those genomes and then checking the annotation of genes that score well to the HMMs. But this only effects a few genes per genome - we would like to do large scale genome reannotation, but do not yet have the resouces to do so. QUALIFIERS NOT - QUESTION: How do you capture negative results? Can one use the NOT qualifier to indicate a negative result in an experiment? For example, in Paper 1 (VanWagoner TM et al.), they make a mutation that has no phenotype. DISCUSSION: Typically, GO does not attempt to capture this type of negative data. The NOT qualifier was instituted in order to be able to make statements that a particular GO term should NOT be associated with a given gene. In practice, this qualifier is used sparingly. It is particularly useful in cases where a cursory inspection might result in making a positive annotation with that term. For example, SGD has made an annotation to say that Rcl1p is NOT an RNA cyclase (Function ontology). In this paper the authors do a sequence comparison (Figure 1)and state that superficially Rcl1p is similar to other known RNA cyclases. But a closer look at the sequence conservation in the catalytic site residues revealed that it lacks key residues. A direct assay testing this protein for RNA cyclase activity was negative. [Billy E et al., EMBO J. 2000, 19(9):2115-26. PMID: 10790377]. Thus, we made a "NOT RNA cyclase activity" annotation for RCL1 because a superficial look at sequence similarity might lead someone to assume that it is an RNA cyclase. The use of the NOT qualifier allows us to make an annotation that points people to the negative evidence against this assumption. Contributes to - QUESTION: In your presentation (about the Chang et al. paper; presented by Karen Christie), you made an annotation to a complex term. Would you give RMI1 an annotation to a function term with the qualifier "contributes to"? DISCUSSION: That's a good question. Often, you would want to make a "contributes to 'function term'" annotation. However, I didn't feel it was warranted from this paper. They do show that Rmi1p can be isolated with the Sgs1p/Top3p complex, but they do not show that Rmi1p is present as part of a complex that has enzymatic activities associated with Sgs1p or Top3p. If they had done an assay with the whole complex and shown activity, a "contributes to: function" would definitely have been appropriate. However, from this paper, I felt that "molecular function unknown" was the appropriate function annotation for RMI1. Colocalizes with - QUESTION: In paper 18 (Masaki M et al.), protein A binds directly to protein B that binds to complex CX, as per the diagram below. Is this sufficient evidence for a component annotation? Should a term be created for the protein B-CX complex? DISCUSSION: Clearly, it is fine to annotate protein B to "colocalizes with 'CX complex'". It was felt that in this situation, where they are identifying protein A, protein B, and complex CX all together, that it was also reasonable to annotate protein A to "colocalizes with 'CX complex'" and also to "protein binding (with protein B)". However, GO would probably not invent a name and corresponding term for "protein B - CX complex" unless the authors describe it in such terms. [drawing not included in text version - see pdf or html versions] OTHER GO ANNOTATION QUESTIONS Granularity or level of annotation - QUESTION: How does one deal with downstream effects of a given mutation? For example in Paper 6 (Glise B et al.), they deal with a mutation that has downstream effects on wing disc development. Should this downstream effect get captured? DISCUSSION: This depends on the situation. There is no hard and fast rule. Rather the appropriate course of action depends both on the philosophy of the group and the specific situation. The fact that mutant phenotypes may reflect a downstream effect rather than the primary defect is something that the GO consortium recognized. For example, in S. cerevisiae, splicing mutations often display translation defects. We now know that this is because the majority of the few genes that contain introns in cerevisiae are ribosomal protein genes. With this knowledge, we now feel that it is not appropriate to annotate a splicing gene to the process term "translation ; GO:0043037". However, if this is the first paper characterizing a gene and its phenotype when mutated, then you will probably want to make an annotation so that you can provide some hint of what process the gene is involved in. Once more is known, you may decide that the annotation needs to be reviewed in light of current information. Another situation is illustrated with this example. Let's say you have two transcription factors. One is a general transcription factor involved in the basic process of transcription. Because it is so basic, it is involved in many processes. Clearly it should get a process annotation for "transcription from Pol II promoter", but one might not wish to annotate all the downstream effects. The second transcription factor is specifically involved in activating one specific pathway, e.g. histidine biosynthesis. In this case, in addition to annotating to "transcription from Pol II promoter", it would be very useful to also annotate to "histidine metabolism", since it is specifically involved in this process. - QUESTION: In your presentation (about the Chang et al paper; presented by Karen Christie), you had a list of several possible specific terms, but you ended up selecting only "response to DNA damage" as an annotation. Why? DISCUSSION: I listed each of the specific phenotypes the authors mentioned in the paper as a short summary to help me when I looked for a term. Some of these phenotypes are very similar to GO terms. However, I felt that the most basic process that was consistent with and represented all of the various phenotypes, some of which may be downstream effects, was "response to DNA damage". Specific Annotation Questions - QUESTION: Relating to protein binding, how do you deal with annotation of homo-oligomers? DISCUSSION: There are already terms in GO that are appropriate for annotation of homo- oligomers. In the process ontology, there is "protein homooligomerization ; GO:0051260" and in the function ontology there is "protein homodimerization activity ; GO:0042803". Annotations to these terms would generally use the IPI evidence code and contain the same protein ID in the with column (#8) as is found in column 2, which contains the ID of the object being annotated. For example, in UniProt, the human protein RAD51_HUMAN (accession number Q06609) has this annotation: protein homooligomerization ; GO:0051260 ; IPI with Q06609 - QUESTION: In your presentation (about the Chang et al paper; presented by Karen Christie), you annotated RMI1 to "molecular function unknown", but not to "protein binding" even though they showed that Rmi1p binds to Sgs1p and Top3p. Why did you not annotate with the "protein binding "term? DISCUSSION: While a number of groups, such as MGI, choose to use the with column in conjunction with the term "protein binding" to store protein-protein interaction data, SGD does not. We have another system for storing this type of data and do not double curate it into GO as well. Use of the term "protein binding" to effectively become a method for creating a database of protein interactions is allowed by the GO Consortium, but is not required. - QUESTION: Would you use experiments done in human cell lines for annotation, even if these lines are aneuploid or otherwise known to be abnormal? DISCUSSION: The general thought was that it is often fine and is really up to the discretion of curators annotating human genes. For human, an experiment in a cell line will often be the only experimental data available. In addition, while non- primary cell lines are often known to have abnormalities, these changes from the normal state are often not relevant to the given experiment. In this situation, it is reasonable to draw conclusions about the normal process of a gene and thus make a GO annotation. Annotation methods and consistency - QUESTION: Is there any checking about consistency of annotation, either between different groups, or within a single group? DISCUSSION: One of the reasons for this, and the previous, GO Annotation Camp, is to address the issues of training in annotation and consistency checking and to discuss specific annotation questions so that the GO Consortium as a whole comes to consensus on the appropriate course of annotation. - QUESTION: Is there any training for new users? DISCUSSION: We are in the process of building a set of papers to serve as a training set. Each group is to provide 10 papers and a "key" to what annotations may be made from each paper. These can then be used both for training new people and for exercises to check for consistency between curators. - QUESTION: How do the three ontologies relate to each other? How can a curator be sure that they have made all the appropriate annotations? DISCUSSION: QuickGO suggests possible terms that a curator may wish to consider, based on analysis of "concurrent annotations", i.e. terms that tend to be annotated to the same genes. Proteome used GO Pairs. We are assessing these types of methods to see if they can be incorporated into curation tools. - QUESTION: How do you search for a GO term? How do you know to search for protein kinase for example? DISCUSSION: Sometimes by familiarity. As with any system, the more you use it, the better at it you get. Some people look at the annotation of a similar gene in another DB, and use its annotations as a guide. GO also has "synonyms", in addition to the primary term names, to help make it easier for someone unfamiliar with the GO system to find the right term. Though we have been calling them synonyms, they do not necessarily mean exactly the same thing. We have four types, referred to as "broad_synonym", "exact_synonym", "narrow_synonym", or "related_synonym". While it is not always possible to use a common usage phrase as the main name of the GO term, we try to add common usage phrases as synonyms to help with searching. OTHER GO QUESTIONS Obsolete terms - QUESTION: What information is still associated with an obsolete term? DISCUSSION: All information that was part of a GO term (the GO id, term name, definition, aspect, any dbxrefs) remains part of that GO term when it becomes obsolete. When a term becomes obsolete, the word "OBSOLETE" is becomes the first word of the definition. If a term was defined at the time it became obsolete, the original definition is retained after the word "OBSOLETE". If it was not defined, the phrase "was not defined before being made obsolete" is used. In addition, the Comment field is used to suggest an appropriate replacement term or terms. It is worth noting that it is possible to create a new term with a term name that is identical to the term name of a term that has been obsoleted. This is because the GO id is attached to the definition of a term, not to the term name per se. Thus if it becomes clear that a GO term has been defined incorrectly, in a way that would have allowed inappropriate annotations to be made, our practice is to obsolete the original GO term and replace it with a new GO term that has a different GOid. In this way, users who have used the original GO term to annotate are alerted to the change and can use the comment on the obsoleted term to help identify the appropriate replacement term. - QUESTION: How do databases handle obsolete terms? What happens to the obsolete terms? How are they stored? DISCUSSION: Terms that have been obsoleted are never deleted from the GO file. In the OBO format of the GO file, obsolete terms are given the attribute "is_obsolete: true" and the word "OBSOLETE" is added to the definition as the first word. In the older GO file format, terms made obsolete become children of the appropriate obsolete node. However, how individual databases handle obsolete terms depends on the database. SGD, for example, does not keep obsolete terms in our database. Our nightly script to load and update GO checks if a term that has become obsolete is used for any annotations in our database or is included in any GO-slims. If it is, curators examine the situation and fix it. However, once the obsolete term is not used for any annotations in the database, it will be deleted from SGD. References for GO annotations - QUESTION: Sometimes a reference has a Supplementary Material section that has been used for GO Annotations. The problem is that sometimes the Supplementary Data becomes unavailable if the journal or author doesn't maintain older data. Is it still OK to use Supplementary Data for GO annotations? DISCUSSION: SGD deals with this problem by getting a copy of the Supplementary Data file(s) at the time we make GO annotations from it and then we make the Supplementary Data file available through SGD's ftp site, as well as via the original site on the journal's website. This way, we make sure that the original data remains available to the public. Helping ensure that supplementary datasets, especially large scale ones, remain publicly available is something that our Advisory Board has said is important, so we are starting to make more of an effort to host the supplementary data files for published papers. However, every database group will need to make its own decision about how to handle this type of question depending on its own resources and priorities. - QUESTION: For some genes, the only papers we have are older papers where only a phenotype, without knowing a specific gene, or perhaps a biochemical assay is shown. Do you use such papers for GO annotations? DISCUSSION: The answer to this question depends in part on curator discretion and group practice. For old papers describing biochemical assays of a protein, one of the complications can be if the protein may have multiple isozymes. If you can unambiguously assign the paper to a specific isozyme, or if the proposed function is general enough to apply to all isozymes, it may be reasonable to use the paper for GO annotations. - QUESTION: Is it permissible to use a meeting abstract for a GO annotation? or a thesis? or a publication in a foreign language? DISCUSSION: Each group has its own policies on references that are not (and will not be) in PubMed. WormBase allows use of Meeting Abstracts, and maintains these abstracts in our database with WBPaper identifiers. In contrast, SGD does not allow these; the curator must be able to read the full text of a peer reviewed paper in order to be allowed to make a GO annotation from it. Making changes to the Gene Ontologies - QUESTION: If you don't have write access to the CVS repository, how do you request a new term? DISCUSSION: We have a SourceForge tracker for suggestions to improve the Gene Ontologies. All suggested changes, whether for new terms or changes to existing terms, structure, definition, etc., should be entered into the GO request tracker, regardless of whether the requester has CVS write access or not. The SourceForge tracker is our official tracking mechanism for all requests and their outcome. - QUESTION: How many terms should be created, i.e. do we need a term that is specific for Drosophila? DISCUSSION: GO may have as many terms as is needed. The guiding principle for whether a species specific term is need or not is whether the process (or function or component) is unique to a given species or taxon. If the process occurs similarly across all known organisms, only a single term is needed. However, if a specific organism or taxon, exemplified by the model organism Drosophila melanogaster in this example does something differently than other organisms, then a specific term is needed. These terms representing taxon-specific processes are labeled with the word "sensu" and the relevant taxon, e.g. "sensu Drosophila". However, it is important to note that using the sensu designation in a term does not exclude that term from being used to annotate species outside that designation. For example, a 'sensu Drosophila' term might reasonably used to annotate a mosquito gene product. Synonyms - QUESTION: Do you have different types of synonyms? How do you display them? DISCUSSION: GO has four types of synonyms: related, exact, broad and narrow. The OBO format file uses the tags: "broad_synonym", "exact_synonym", "narrow_synonym", "related_synonym". Unfortunately, the synonym types are not currently displayed in AmiGO. NOTE: Since the Annotation Camp, a new version of AmiGO has been moved into production. The individual term pages, like the one for which the URL is given below, now indicate the type for all synonyms. http://www.godatabase.org/cgi-bin/amigo/go.cgi?view=details&depth=1&query=307 - QUESTION: Does AmiGO search for synonyms? DISCUSSION: Yes. AmiGO searches for a match within the synonyms as well as within the term names. Unfortunately, the results page does not indicate what the match was when it occurs within the synonym. NOTE: Since the Annotation Camp, a new version of AmiGO has been moved into production. If you search for something like "cyclin", you can now see whether the match was within the term name or a synonym and the specific text that matches is highlighted. Note that only synonyms that match the query are shown; other synonyms that do not match the query are not shown on the search results. File Format, Error checking, and filtering of annotations - QUESTION: What is the difference between the OBO file format and the GO file format? DISCUSSION: The GO file format is the original format of the Gene Ontology files. There is a separate file for each of the three ontologies, biological process, molecular function, and cellular component. Term relationships are indicated by the amount of indentation from the terms above it and the relationship type is indicated with a symbol in front of the term name. There is an additional fourth file for the term definitions. The OBO format file is somewhat like XML in format and it contains the information about each of the three ontologies, including the term definitions, in a single file. In addition, the OBO file contains the specific synonym type ("broad_synonym", "exact_synonym", "narrow_synonym", "related_synonym"). The GO format includes the synonyms, but not the synonym types. - QUESTION: Currently the error checking script for the gene_association files just lists the line number. Could it print the whole line as well, so that it is easier to find the error? DISCUSSION: Mike Cherry "Yes, we can do that. The script is still evolving. The way to make it do what we all want is to provide feedback." - QUESTION: In Mike Cherry's list of errors, why is it an error to have IEAs more than a year old? DISCUSSION: In order to prevent having a lot of out of date information that was electronically generated, the GO Consortium has decided to implement a limit on the currency of IEA annotations. Those more than a year old will be excluded. For groups with ongoing annotation programs that perform IEAs via scripts, this will mean that that they need to rerun their script. All annotations that are still made by the script will get the new date and can be kept in the gene_association file. - QUESTION: If the GOA project has GO annotations for a given species, that the relevant MOD did not capture, can the checking script alert the MOD? DISCUSSION: To prevent redundant annotations being present in AmiGO, the GO Consortium will be instituting a new policy. Currently, AmiGO shows two copies of the same annotations in some circumstances. For example, MGI makes annotations that are subsumed into the GOA project. Although the gene_association files credit the appropriate source, both get loaded into the GO database and then displayed on AmiGO. It would be OK for there to be multiple annotations, if they had independent sources, but this is effectively the same annotation just duplicated. Thus, in the future, for each species (or taxon), there will be one official source. If GOA is also making annotations for that species, then the GOA project and the MOD have to communicate so that GOA can contribute its annotations to the MOD and all annotations relevant to one species will be in a single file, contributed by the official source. - QUESTION: On the ftp site, is the time given in the "Last Change" column for each file accurate? DISCUSSION: Yes, but note that the ftp server uses times in Greenwich Mean Time (GMT) rather than Pacific Time, so an update that occurs when it is 11pm at Stanford, will get a time of 7am the next day. Documentation - QUESTION: Do the README files provide information about a given groups annotation philosophy, e.g. whether they use "protein binding" annotations to capture protein-protein interaction data, exactly what procedure they use to assign an unknown term with the evidence code ND? DISCUSSION: We have just recently decided that this information must be present in each group's README file. It is probably not all there yet, but we will be implementing expanded README files with this type of information. Other Ontologies - QUESTION: Is there a way to find out what other ontologies are being developed, even those that are not part of OBO? DISCUSSION: OBO is a main resource for most of the ontologies. Beyond that, Michael Ashburner and Suzanna Lewis keep an eye on a variety of other ontology projects. - QUESTION: How does GO deal with expression data, for example if a gene is expressed in the leaf versus the root? DISCUSSION: GO does not deal with expression data at all. GO deals with the cellular level, not the tissue level. For information outside of the scope of GO, there are other ontologies. For plants, the Plant Ontology vocabularies (www.plantontology.org) for plant structure and plant growth and developmental stages can deal with information about the site of expression. - QUESTION: Is there an ontology for microarray experiments? DISCUSSION: The Microarray Gene Expression Data Society (MGED Society) develops standards and ontologies for microarray experiments. GO Tools - QUESTION: Can I use tools like Textpresso to make GO annotations? DISCUSSION: Textpresso, a tool to index and search the full text of papers, is not designed make GO annotations. However, it may be useful to help find the appropriate papers from which a curator may make GO annotations. Textpresso indexes the full text of a paper, so it can be particularly useful in the identification of papers that are relevant to a given organism when the abstracts of those papers do not mention the organism and can be useful in identifying topics, based on the keywords it uses to search. However, curation itself is performed by a human curator who reads the paper and judges what annotations can be drawn from the paper. - QUESTION : How do I trim down the full Gene Ontology to create a GO-slim? DISCUSSION: You can customize a GO-slim in whatever way you like, whether that's a subset of high level terms or any other subset of terms. To create your own GO slim, you need to use DAG-Edit (http://www.godatabase.org/dev/java/dagedit/docs/index.html) - there's tutorial on how to do this at http://www.geneontology.org/GO.teaching.resources.shtml#tut, called 'Creating a GO slim using DAG-Edit' (an online version will be available soon). - QUESTION : How do I map a set of associations to a GO-slim? DISCUSSION: To map an association file to your own, or an existing, GO slim, you can use the web-based tool Generic GO Term Mapper (http://go.princeton.edu/cgi-bin/GOTermMapper) for any species, or GO Slim Mapper (http://db.yeastgenome.org/cgi-bin/GO/goTermMapper) for Saccharomyces annotations only. Alternatively, if you know some Perl, you can use the map2slim script (http://www.geneontology.org/GO.slims.shtml?all#script) - QUESTION: In one of the small groups, someone asked if it would be possible for DAG-Edit to include the Evidence codes so that annotators could look at both the term and the scope of each evidence code in one tool, rather than having to go to the GO documentation each time for the evidence codes? DISCUSSION: This is a question for the designer of DAG-Edit. - QUESTION: Have you thought of creating a forum on the website so that people can submit info on new ontologies, etc.? DISCUSSION: There was discussion about whether a web forum, or wiki, would improve the accessibility of GO to people looking for information on it or other biological ontologies. The main concern seemed to be whether a web forum would be any more effective at attracting users than the existing email list, for which the email archive is available for searching. We are always trying to figure out the best way to get the word out about the GO project, how to get regular researchers interested, not just the people who already know about GO. - QUESTION: At the last GO Consortium meeting, there was discussion about a Common Annotation Tool. Has any progress been made on such a thing? DISCUSSION: While there is not a single annotation tool in use by all Consortium members, there are a couple tools that are freely available. The Arabidopsis Information Resource (TAIR) and Rat Genome Database (RGD) collaborate to create PubSearch (at pubsearch.org), a curation tool designed to help manage the literature. A paper about PubSearch will be coming out soon in Current Protocols in Bioinformatics. TIGR creates the Manatee tool. Two versions are available, eukaryotic and prokaryotic. Both PubSearch and Manatee are freely available. Other - QUESTION: Why is "ATPase" activity NOT a child of "ATP binding"? Why do we need a separate annotation to "ATP binding"? DISCUSSION: The term "ATPase activity" did used to be a child under "ATP binding" . There was much discussion at the time about moving it out, but no one present remembered the exact details well enough to clarify on the exact reasoning that this change occurred, though likely due to some true path violation to have "ATPase activity" always under "ATP binding". Peter D'Estachio commented that some ATPases are very rapid at catalysis and subsequent release, such that they do not exhibit stable binding to ATP, though without researching into the reasons for the change, we don't remember if this was a consideration in the decision. - QUESTION: Does anyone spend time of the research aspects of GO, i.e. the AI/CS aspects of it? DISCUSSION: The majority of people developing GO are biologists working as curators trying to annotate genes. Their focus is typically on the biological perspective, i.e. trying to create a term or terms that represent the process or function they need to annotate. However. some members of Suzanna Lewis's group at the Berkeley Drosophila Genome Project (BDGP) are interested in the data structure and data models. - QUESTION: Can PubMed enforce the requirement of GO annotations, similarly to how submission to GenBank requires use of Gene IDs? DISCUSSION: Use of standard gene names and IDs is a much bigger problem, so the Model Organism Database (MOD) community would probably focus on requiring standard gene names and/or gene IDs first, before requesting use of GOids. On the Biocurator mailing list, there is currently a discussion about what would we (the MODs) would want the journals to require. It was also mentioned that if there were undergraduate courses on biocuration, it might help people understand what information needs to appear in their publications for them to get properly indexed. QUESTIONS RELATING TO DATABASE ORGANIZATION, RATHER THAN GO - QUESTION: How does information about the strain background get captured? DISCUSSION: It depends upon the organization. Some groups do not capture this information at all. Other groups do. Whether and how to do this depends upon the needs and interest of your community. - QUESTION: Can GO be used to annotate a(n) 1) microRNA, 2) mature processed protein 3) alternative transcript? DISCUSSION: GO can be used to annotate any gene product, regardless of whether the gene product is composed of protein or RNA. In practice , many groups annotate at the level of gene, rather than gene product (protein or RNA product). This is because most database groups generate database objects with unique IDs for genes, but do not generate separate database objects with their own unique IDs for the corresponding products of these genes. Every object to be annotated needs to exist as a DataBase object with its own unique identifier within your database. Any database object representing a gene that has a gene product or any database object representing a gene product directly can be annotated with GO. However, it is not appropriate to use GO to annotate a sequence feature that does not have a gene product, e.g. an origin of replication or a protein binding site. Annotation of such sequence-based features should be performed with the Sequence Ontology (SO), not the Gene Ontology (GO). - QUESTION: Is it OK to do a set of first pass annotations with weak evidence, e.g. InterPro Scan? DISCUSSION: This is up to each group to decide based on the needs of the community each serves. The community needs should guide your annotation practice. Saying that gene A has similarity to XXX domain may motivate somebody to look at that gene, so this annotation may be useful to the community. - QUESTION: How do you decide the priority between updating existing annotations and annotating another genome? DISCUSSION: Each group needs to get feedback from its own use community and assign priorities accordingly. - QUESTION: Do databases keep what they have in their gene_association file in sync with what they show on their production website? DISCUSSION: Each database has its own procedures for updating its production website and for creating its gene_association file. For example, Zfin is not in sync: it submits its gene_association files only weekly, but database updates show up immediately. At SGD, database updates also show up immediately, but gene_association files are submitted daily. At MGI, both the database and the gene_association file are updated weekly. WormBase gene_association files had not been in synchrony with WormBase releases, but user-feedback has prompted WormBase to maintain synchrony between its gene_association file and WormBase releases. - QUESTION: Is Community Annotation (annotations made by members of the research community) an effective way to get good annotations into the database? How do you do Quality Control on such annotations? DISCUSSION: This seems to be highly dependent on the specific community and perhaps also on specifically what sort of information is being asked for. The Phytophthera database (at Virginia Bioinformatics Institute) is using a community annotation system with good success. They have about 1600 genes. Users login and make annotations. Annotations from the "experts" are not checked. However, annotations from other sources go to a temp database to be checked before entry into the production database. So far, 25 people have provided 1600 annotations, mostly during the Joint Genomes Initiative (JGI) annotation jamboree for Phytophthera. Nicole Perna used a Community Annotation system for E. coli and it worked well. However, in contrast, SGD was asked to set up a system to allow users to contribute statements about key highlights in published papers. In over two years, only a handful of people have contributed less that a hundred annotations about research publications. - QUESTION: How do you (asked of the Phytophthera community) deal with conflicting annotations? DISCUSSION: So far the Phytophthera project has not had to deal with conflicting annotations. However, it is important to note that conflicting annotations are allowed by GO. If there is dissention in the field, it is permissible to make annotations reflecting both points of view, each with an appropriate reference. ACTION ITEM 6: Can the people representing new groups here, please let us know the status of your projects, and if there is any assistance from the GO Consortium that would be useful to you. Please send this info to Jennifer Clark (jenclark@ebi.ac.uk) READING LIST PAPERS PRESENTED TO THE FULL GROUP A. Loyola A et al., Functional analysis of the subunits of the chromatin assembly factor RSF. Mol Cell Biol. 2003 Oct;23(19):6759-68. [PMID:12972596] (full paper presented by Rebecca Foulger of UniProt) B. Chang M et al., RMI1/NCE4, a suppressor of genome instability, encodes a member of the RecQ helicase/Topo III complex. EMBO J. 2005 Jun 1;24(11):2024-33. [PMID:15889139] (full paper presented by Karen Christie of SGD) C. Appelbaum L et al., Homeobox-clock protein interaction in zebrafish. A shared mechanism for pineal-specific and circadian gene expression. J Biol Chem. 2005 Mar 25;280(12):11544-51. [PMID:15657039] (single example from the paper presented by Ceri Van Slyke of Zfin) PAPERS DISCUSSED IN SMALL WORKING GROUPS 1. VanWagoner TM et al., Characterization of three new competence-regulated operons in Haemophilus influenzae. J Bacteriol. 2004 Oct;186(19):6409-21. [PMID:15375121] 2. Sakuragi N et al., Functional analysis of a novel gene, DD3-3, from Dictyostelium discoideum. Biochem Biophys Res Commun. 2005 Jun 17;331(4):1201-6. [PMID:15883003] 3. Iizuka K et al., Genetically linked C-type lectin-related ligands for the NKRP1 family of natural killer cell receptors. Nat Immunol. 2003 Aug;4(8):801-7. [PMID:12858173] 4. Huang S et al., Arabidopsis VILLIN1 Generates Actin Filament Cables That Are Resistant to Depolymerization. Plant Cell. 2005 Feb;17(2):486-501. [PMID:15659626] 5. Kawashima CG et al., Characterization and expression analysis of a serine acetyltransferase gene family involved in a key step of the sulfur assimilation pathway in Arabidopsis. Plant Physiol. 2005 Jan;137(1):220-30. [PMID:15579666] 6. Glise B et al., Shifted, the Drosophila ortholog of Wnt inhibitory factor-1, controls the distribution and movement of Hedgehog. Dev Cell. 2005 Feb;8(2):255-66. [PMID:15691766] 7. Abramovitch RB et al., Pseudomonas type III effector AvrPtoB induces plant disease susceptibility by inhibition of host programmed cell death. EMBO J. 2003 Jan 2;22(1):60-9. [PMID:12505984] 7a. Kumar S et al., The role of reactive oxygen species on Plasmodium melanotic encapsulation in Anopheles gambiae. Proc Natl Acad Sci U S A. 2003 Nov 25;100(24):14139-44. [PMID:14623973] 8a. Qutob D et al., Expression of a Phytophthora sojae necrosis-inducing protein occurs during transition from biotrophy to necrotrophy. Plant J. 2002 Nov;32(3):361-73. [PMID:12410814] 8b. Tyagi S et al., The ORF3 protein of hepatitis E virus interacts with liver-specific alpha1-microglobulin and its precursor alpha1-microglobulin/bikunin precursor (AMBP) and expedites their export from the hepatocyte. J Biol Chem. 2004 Jul 9;279(28):29308-19. [PMID:15037615] 9a. Imhof I et al., Phosphatidylethanolamine is the donor of the phosphorylethanolamine linked to the alpha1,4-linked mannose of yeast GPI structures. Glycobiology. 2000 Dec;10(12):1271-5. [PMID:11159918] 9b. Ito Y et al., Organ-specific alternative transcripts of KNOX family class 2 homeobox genes of rice. Gene. 2002 Apr 17;288(1-2):41-7. [PMID:12034492] 10a. Kucharski R and Maleszka R. Transcriptional profiling reveals multifunctional roles for transferrin in the honeybee, Apis mellifera. J Insect Sci. 2003;3:27. [PMID:15841243] 10b. Chen ZJ et al., Molecular cloning of a regulatory protein for membrane-bound guanylate cyclase GC-A. Biochem Biophys Res Commun. 2000 Nov 11;278(1):106-11. [PMID:11071862] 11a. Deng M et al., Mapping Gene Ontology to proteins based on protein-protein interaction data. Bioinformatics. 2004 Apr 12;20(6):895-902. [PMID:14751964] 11b. Palsson A and Gibson G. Association between nucleotide variation in Egfr and wing shape in Drosophila melanogaster. Genetics. 2004 Jul;167(3):1187-98. [PMID:15280234] 12a. Birbes H et al., A mitochondrial pool of sphingomyelin is involved in TNFalpha-induced Bax translocation to mitochondria. Biochem J. 2005 Mar 15;386(Pt 3):445-51. [PMID:15516208] 12b. Thorpe CJ et al., Wnt/beta-catenin regulation of the Sp1-related transcription factor sp5l promotes tail development in zebrafish. Development. 2005 Apr;132(8):1763-72. [PMID:15772132] 13. Keeling PJ and Palmer JD. Lateral transfer at the gene and subgenic levels in the evolution of eukaryotic enolase. Proc Natl Acad Sci U S A. 2001 Sep 11;98(19):10745-50. [PMID:11526220] 14. Henderson MJ et al., EDD, the human hyperplastic discs protein, has a role in progesterone receptor coactivation and potential involvement in DNA damage response. J Biol Chem. 2002 Jul 19;277(29):26468-78. [PMID:12011095] 15. Barloy-Hubler F et al., Smc01944, a secreted peroxidase induced by oxidative stresses in Sinorhizobium meliloti 1021. Microbiology. 2004 Mar;150(Pt 3):657-64. [PMID:14993315] 16. Koushika SP et al., A post-docking role for active zone protein Rim. Nat Neurosci. 2001 Oct;4(10):997- 1005. [PMID:11559854] 17. Prigge MJ et al., Class III Homeodomain-Leucine Zipper Gene Family Members Have Overlapping, Antagonistic, and Distinct Roles in Arabidopsis Development. Plant Cell. 2005 Jan;17(1):61-76. [PMID:15598805] 18. Masaki M et al., Mixed lineage kinase LZK and antioxidant protein-1 activate NF-kappaB synergistically. Eur J Biochem. 2003 Jan;270(1):76-83. [PMID:12492477]