!version: $Revision: 1.8 $ !date: $Date: 2000/11/08 17:09:39 $ ! !Gene Ontology !sgd annotation guide ! This document is a set of instructions for SGD curators on how to annotate yeast genes to GO nodes. Some things in it are yeast-specific, and the brief introduction to cvs is specific to users inside the Stanford Genetics department. The 2000-09-12 update includes more SGD-specific information than the previous version. SGD now has a web-based interface for annotation, and a script that generates the gene_association.sgd file from SGD's Oarcle database. Using the Gene Ontology (GO) Contents Web interface for annotation Browsing in the ontologies Annotation format Getting the files: a very tiny bit on CVS Web interface for annotation First, log into Oracle from the curator login page. Click on GO Info. Enter the locus name or locus_no, or choose the feature option from the pull-down and enter the feature name or feature_no for the item you're annotating. To retrieve locus_no or feature_no using SQL, see Local Oracle Resources and, in particular, the links under Getting Started for a basic introduction to SQL. Search the ontologies to find the GO number(s) for items that accurately describe the gene of interest. Be sure to pick at least one item from each of the three ontologies, even if you have to use "unknown" a lot. You can, of course, use as many GO nodes as you need, if your gene product is involved in several processes, is found in more than one location, etc. Note: if there is more than one GO number attached to a node, use only the first one listed. To annotate a gene that hasn't been annotated yet, or to add to existing annotations, fill out the form, using one row per combination of GO node + reference. -Choose an ontology from the pull-down menu. The processing script will squawk if the GO number and ontology don't match. -Enter the GO ID number in the box provided; you don't have to include the leading zeros. Note that if you've just added a new GO item, you will have to wait until the GO loading script runs (in the middle of the night) before you can use the new GO number for annotations. -Enter one reference ID in the Ref box (ace paper object as of 2000-09-12). If there's more than one reference for an annotation, use a separate row for each one. -Choose the codes for the types of evidence that the reference contains to support the association. You can choose more than one code. -Fill in the names of loci or features that have the same GO ID number, with the same reference and supporting evidence. There's one box for locus names, and another for feature names. Continue filling in rows until you've made all the annotations, then click "submit." The Oracle database will be updated right away, and the new annotations will be included in the gene_association.sgd when the script runs at night. If the gene has already been annotated, existing annotations will show up as rows that are already filled out. Loci and features that have the same annotation (i.e. the same GO number, same reference, and same evidence code(s) assigned) are included. Scroll down to find empty rows to add annotations. Existing annotations can be deleted or changed--just click the "delete this annotation" button. To change an annotation, delete the old one and replace it with a new one. Browsing in the ontologies The easiest way to find the GO numbers that apply to your gene of interest is to use Mira's browser. The GO curation interface has a link to the browser, which will open a new window showing the to the entry page for Mira's program, with the pathname for the copies of the ontologies in the ftp directory. If you have checked out the ontology files from the CVS repository, you can substitute the pathname of the directory that contains your copies of the ontology files; remember to include a slash at the end. Clicking "Submit Query" takes you to a set of frames so that you can search three files at once. You can use the default selections (component on top, function in the middle, and process on the bottom), or use the selector to change to another file in the same directory. Type in a string and click "Submit Query;" you can use text or a GO number. The search automatically works as though it's using wildcards (i.e. you can type in all or part of a word, phrase, number, etc.) and supports Boolean AND and NOT operators. Examples: mitochon matches GO terms containing "mitochondrion," "mitochondria," "mitochondrial" Using Boolean AND: type & collagen finds GO terms containing both "type" and "collagen" Using Boolean NOT: collagen & ~V finds GO terms containing "collagen" but not "V" Annotation format GO annotations for yeast genes are stashed in go/SGD_GO_files/gene_association.sgd, which is one of many GO-related files in our CVS repository. See below for info on how to work with files in CVS. (Note: the gene_association.sgd file may be Read-only when checked out. Use [ctrl]-x [ctrl]-q in emacs to change it to Read/Write). Search the ontologies to find the GO number(s) for items that accurately describe the gene of interest. Be sure to pick at least one item from each of the three ontologies, even if you have to use "unknown" a lot. You can, of course, use as many GO nodes as you need, if your gene product is involved in several processes, is found in more than one location, etc. Add one line to the gene_association.sgd file for each unique combination of gene product + GO node + reference + evidence code, using the following format: DB [tab] gene UID [tab] gene symbol [tab] NOT [tab] GO ID [tab] ref [tab] evidence [tab] with [ID] [tab] aspect [tab] gene name [tab] synonym(|synonym)\n What all the gibberish means: *Use tabs to separate fields. Don't actually type "[tab]," just hit the "tab" key! *DB is the database that contributed this line. We use "SGD;" other collaborators (for now) are FB (FlyBase) and MGI (Mouse Genome Informatics). Required. *gene UID is the unique identifier used by the database for the gene. We use the SGDID. Required. *gene symbol is what we yeast-oids usually call "gene name" (ACT1, YHL035C, etc.; use the standard name if there is one). Required. *NOT means, well, "not." Used for cases where the literature explicitly states that the GO term is not suitable for the gene product (e.g. Yfg7p is not in the vacuole). Not required, and probably won't be used very often. Don't forget to enter a blank column for any line that dosn't say "not." *GO ID is the GO number for the node (duh). Don't forget the "GO:" prefix. Required. *reference is the reference for the assignment of a GO node to a gene product. It's required. Put only one reference per line; if there's more than one reference, use a separate line (see examples). Published papers are much preferred, but we can cite "SGD said so" or "YPD said so" if we find information in a database but can't track down a paper. Use the unique SGD identifier (ace paper object name now, reference ID after the shift to Oracle), with the prefix 'SGD:'. Some useful ace paper objects: Stryer 3rd edition strye_1988_rcavm Stryer 4th edition strye_1995_rcbsz Alberts et al. 3rd edition alber_1994_rcavn Oxford Dictionary of Biochemistry and Molecular Biology smith_1997_rcavl found in SGD cherr_2000_rcbtn found in YPD costa_2000_rcbva found in SwissProt bairo_2000_rcbvb *evidence is what kind of data support the claim that the gene product has the function (process, cellular component). Put only one reference per line; if there's more than one kind of evidence for the association, even in a single paper, use a separate line for each one (see examples). Choose from the list of codes. Required. IMP inferred from mutant phenotype IGI inferred from genetic interaction IPI inferred from physical interaction ISS inferred from sequence similarity (this will also cover structural similarity) IDA inferred from direct assay IEP inferred from expression pattern IEA inferred from electronic annotation (used for wholly or partially automated assignments, which means SGD will use it seldom or never) TAS traceable author statement (use for information that seems quite reliable even though the actual evidence isn't right there; examples include reviews and textbooks) (formerly ASS, "author said so") NAS non-traceable author statement (use for database references or claims in papers that don't meet "TAS" criteria) (formerly NA, "not available") There is another code, NR ("not recorded"), which appears for annotations done before the SGD curators began tracking evidence types. It should not be used for new annotations--use TAS or NAS. More information on the fine points of evidence code usage is available--see the GO.evidence document. *aspect is which ontology the term came from. It's optional, because each GO number "knows" its ontology, but it's highly recommended as internal quality control. Choose from the list: F function ontology P process ontology C cellular component ontology *gene name is the full, spelled out gene name (e.g. "sonic hedgehog," "couch potato"), which yeast genes don't have. It's optional, and white space is allowed. synonym is what we call an alias--another name for the gene. Optional; white space allowed; separate multiple aliases with pipes. Include a carriage return (a.k.a. newline, \n) at the end of each line. Include an empty column (a tab) for any optional field that you leave empty! Examples: SGD S0002260 CDC2 GO:0005658 SGD:burge_1998_rbmmw NA C SGD S0002260 CDC2 GO:0003891 SGD:burge_1998_rbmmw NA F SGD S0002260 CDC2 GO:0006272 SGD:burge_1998_rbmmw NA P SGD S0004295 ACO1 GO:0005759 SGD:mcali_1997_rblks NA C glu1|YLR304C SGD S0004295 ACO1 GO:0005759 SGD:alber_1994_rcavn NA C glu1|YLR304C SGD S0001207 DNA2 GO:0006281 SGD:formo_1999_rbofx IMP P WEB2|YHR164C SGD S0001207 DNA2 GO:0006281 SGD:formo_1999_rbofx IGI P WEB2|YHR164C SGD S0001207 DNA2 GO:0006262 SGD:budd__1995_aaxoz IMP P WEB2|YHR164C SGD S0001207 DNA2 GO:0006262 SGD:budd__1995_aaxoz IDA P WEB2|YHR164C SGD S0001207 DNA2 GO:0006262 SGD:bragu_1998_rbloh IMP P WEB2|YHR164C SGD S0001207 DNA2 GO:0006262 SGD:bragu_1998_rbloh IDA P WEB2|YHR164C (Note: CDC2 also has several aliases; I've left them out just to make the point that "synonym" is an optional field.) Getting the files: a very tiny bit on CVS We're using CVS to maintain and edit the ontology files and associated files such as GO numbers that we've added and the files containing our annotations. Here's the minimum you need; for more information on CVS, there's an email that Mike sent to the GO mailing list (note that a lot of the stuff in the email only applies to folks coming from outside our firewall). To check out files, first add this line to your .cshrc file (obviously, you only have to do this once): setenv CVSROOT /share/go/cvs Then open a window on one of the DEC machines (alberich, fafner, or fasolt). Stay in your home directory, or cd to any other directory where you may want CVS to create the go/[whatever] directory. Type in: cvs co [directory name] The directories you need are go/ontology and go/SGD_GO_files. go/ontology has three files: function.ontology, process.ontology, and component.ontology (formerly called compartment.ontology). go/SGD_GO_files has gene_names and gene_associations. If you want to edit any of the ontologies, you'll also need the go/numbers directory. CVS will create a directory called go/ that has the subdirectories you check out. You only really have to check out files once, although you can always re-check out a file (good to know in case you do something horrendous to your copy). To update a file in any of your go/ subdirectories, cd to the directory and give the command: cvs update [filename] If there has been a change to the repository copy (i.e. someone else worked on the file and commited the changes), you'll get the message "U [filename]." If there are no differences, nothing happens. In either case, you're now ready to go on your merry way and use the file. To commit your changes to the CVS repository (the "master copy") cd to the directory containing your copy of the file, and give this command: cvs ci [filename] An editor window will appear so that you can type in notes/comments/etc explaining what you changed and why. This isn't such a big deal for annotations, since they're pretty much self-explanatory, but always use it when you edit the ontologies themselves!!! Save and quit the file, and you'll get a message saying that you've checked in the file, with old and new version numbers. Text version 2000-03-03 MAH Updated 2000-09-12 MAH