!version: $Revision: 1.8 $
!date: $Date: 2000/11/08 17:09:39 $
!
!Gene Ontology
!sgd annotation guide
!
This document is a set of instructions for SGD curators on how to
annotate yeast genes to GO nodes. Some things in it are
yeast-specific, and the brief introduction to cvs is specific to users
inside the Stanford Genetics department.

The 2000-09-12 update includes more SGD-specific information than the
previous version. SGD now has a web-based interface for annotation,
and a script that generates the gene_association.sgd file from SGD's
Oarcle database.

Using the Gene Ontology (GO)

Contents

Web interface for annotation
Browsing in the ontologies 
Annotation format
Getting the files: a very tiny bit on CVS    

Web interface for annotation

First, log into Oracle from the curator login page. Click on GO
Info. Enter the locus name or locus_no, or choose the feature option
from the pull-down and enter the feature name or feature_no for the
item you're annotating. To retrieve locus_no or feature_no using SQL,
see Local Oracle Resources and, in particular, the links under Getting
Started for a basic introduction to SQL.


Search the ontologies to find the GO number(s) for items that
accurately describe the gene of interest. Be sure to pick at least one
item from each of the three ontologies, even if you have to use
"unknown" a lot. You can, of course, use as many GO nodes as you need,
if your gene product is involved in several processes, is found in
more than one location, etc. Note: if there is more than one GO number
attached to a node, use only the first one listed.


To annotate a gene that hasn't been annotated yet, or to add to
existing annotations, fill out the form, using one row per combination
of GO node + reference.

 -Choose an ontology from the pull-down menu. The processing script
  will squawk if the GO number and ontology don't match.

 -Enter the GO ID number in the box provided; you don't have to
  include the leading zeros. Note that if you've just added a new GO
  item, you will have to wait until the GO loading script runs (in the
  middle of the night) before you can use the new GO number for
  annotations.

 -Enter one reference ID in the Ref box (ace paper object as of
  2000-09-12). If there's more than one reference for an annotation, use
  a separate row for each one.

 -Choose the codes for the types of evidence that the reference
  contains to support the association. You can choose more than one
  code.

 -Fill in the names of loci or features that have the same GO ID
  number, with the same reference and supporting evidence. There's one
 box for locus names, and another for feature names.


Continue filling in rows until you've made all the annotations, then
click "submit." The Oracle database will be updated right away, and
the new annotations will be included in the gene_association.sgd when
the script runs at night.

If the gene has already been annotated, existing annotations will show
up as rows that are already filled out. Loci and features that have
the same annotation (i.e. the same GO number, same reference, and same
evidence code(s) assigned) are included. Scroll down to find empty
rows to add annotations. Existing annotations can be deleted or
changed--just click the "delete this annotation" button. To change an
annotation, delete the old one and replace it with a new one.

Browsing in the ontologies

The easiest way to find the GO numbers that apply to your gene of
interest is to use Mira's browser. The GO curation interface has a
link to the browser, which will open a new window showing the to the
entry page for Mira's program, with the pathname for the copies of the
ontologies in the ftp directory. If you have checked out the ontology
files from the CVS repository, you can substitute the pathname of the
directory that contains your copies of the ontology files; remember to
include a slash at the end.

Clicking "Submit Query" takes you to a set of frames so that you can
search three files at once. You can use the default selections
(component on top, function in the middle, and process on the bottom),
or use the selector to change to another file in the same
directory. Type in a string and click "Submit Query;" you can use text
or a GO number. The search automatically works as though it's using
wildcards (i.e. you can type in all or part of a word, phrase, number,
etc.) and supports Boolean AND and NOT operators.

Examples:

mitochon matches GO terms containing "mitochondrion," "mitochondria,"
"mitochondrial"

Using Boolean AND: type & collagen finds GO terms containing both
"type" and "collagen"

Using Boolean NOT: collagen & ~V finds GO terms containing "collagen"
but not "V"


Annotation format

  GO annotations for yeast genes are stashed in
go/SGD_GO_files/gene_association.sgd, which is one of many GO-related
files in our CVS repository. See below for info on how to work with
files in CVS. (Note: the gene_association.sgd file may be Read-only
when checked out.  Use [ctrl]-x [ctrl]-q in emacs to change it to
Read/Write).

  Search the ontologies to find the GO number(s) for items that
accurately describe the gene of interest. Be sure to pick at least one
item from each of the three ontologies, even if you have to use
"unknown" a lot. You can, of course, use as many GO nodes as you need,
if your gene product is involved in several processes, is found in
more than one location, etc.

Add one line to the gene_association.sgd file for each unique
combination of gene product + GO node + reference + evidence code,
using the following format:

DB [tab] gene UID [tab] gene symbol [tab] NOT [tab] GO ID  [tab] ref [tab] evidence [tab] with [ID] [tab] aspect [tab] gene name [tab] synonym(|synonym)\n

What all the gibberish means:

*Use tabs to separate fields. Don't actually type "[tab]," just hit the
"tab" key!

*DB is the database that contributed this line. We use "SGD;" other
collaborators (for now) are FB (FlyBase) and MGI (Mouse Genome
Informatics). Required.

*gene UID is the unique identifier used by the database for the
gene. We use the SGDID. Required.

*gene symbol is what we yeast-oids usually call "gene name" (ACT1,
YHL035C, etc.; use the standard name if there is one). Required.

*NOT means, well, "not." Used for cases where the literature
explicitly states that the GO term is not suitable for the gene
product (e.g. Yfg7p is not in the vacuole). Not required, and probably
won't be used very often. Don't forget to enter a blank column for any
line that dosn't say "not."

*GO ID is the GO number for the node (duh). Don't forget the "GO:"
prefix. Required.

*reference is the reference for the assignment of a GO node to a gene
product. It's required. Put only one reference per line; if there's
more than one reference, use a separate line (see examples). Published
papers are much preferred, but we can cite "SGD said so" or "YPD said
so" if we find information in a database but can't track down a
paper. Use the unique SGD identifier (ace paper object name now,
reference ID after the shift to Oracle), with the prefix 'SGD:'. Some
useful ace paper objects:
 
  Stryer 3rd edition  strye_1988_rcavm
  Stryer 4th edition  strye_1995_rcbsz
  Alberts et al. 3rd edition  alber_1994_rcavn
  Oxford Dictionary of Biochemistry and Molecular Biology  smith_1997_rcavl
  found in SGD  cherr_2000_rcbtn
  found in YPD  costa_2000_rcbva
  found in SwissProt bairo_2000_rcbvb
 
*evidence is what kind of data support the claim that the gene product
has the function (process, cellular component). Put only one reference
per line; if there's more than one kind of evidence for the
association, even in a single paper, use a separate line for each one
(see examples). Choose from the list of codes. Required.
 
  IMP inferred from mutant phenotype
  IGI inferred from genetic interaction
  IPI inferred from physical interaction
  ISS inferred from sequence similarity (this will also cover structural similarity)
  IDA inferred from direct assay
  IEP inferred from expression pattern
  IEA inferred from electronic annotation (used for wholly or partially automated assignments, which means SGD will use it seldom or never)
  TAS traceable author statement (use for information that seems quite reliable even though the actual evidence isn't right there; examples include reviews and textbooks) (formerly ASS, "author said so")
  NAS non-traceable author statement (use for database references or claims in papers that don't meet "TAS" criteria) (formerly NA, "not available")
 
There is another code, NR ("not recorded"), which appears for
annotations done before the SGD curators began tracking evidence
types. It should not be used for new annotations--use TAS or NAS. More
information on the fine points of evidence code usage is
available--see the GO.evidence document.

*aspect is which ontology the term came from. It's optional, because
each GO number "knows" its ontology, but it's highly recommended as
internal quality control. Choose from the list:
 
  F function ontology
  P process ontology
  C cellular component ontology
 
*gene name is the full, spelled out gene name (e.g. "sonic hedgehog,"
"couch potato"), which yeast genes don't have. It's optional, and
white space is allowed.  synonym is what we call an alias--another
name for the gene. Optional; white space allowed; separate multiple
aliases with pipes.

Include a carriage return (a.k.a. newline, \n) at the end of each
line. Include an empty column (a tab) for any optional field that you
leave empty!

Examples:

SGD	S0002260	CDC2	GO:0005658	SGD:burge_1998_rbmmw	NA	C		
SGD	S0002260	CDC2	GO:0003891	SGD:burge_1998_rbmmw	NA	F		
SGD	S0002260	CDC2	GO:0006272	SGD:burge_1998_rbmmw	NA	P		
SGD	S0004295	ACO1	GO:0005759	SGD:mcali_1997_rblks	NA	C		glu1|YLR304C
SGD	S0004295	ACO1	GO:0005759	SGD:alber_1994_rcavn	NA	C		glu1|YLR304C
SGD	S0001207	DNA2	GO:0006281	SGD:formo_1999_rbofx	IMP	P		WEB2|YHR164C
SGD	S0001207	DNA2	GO:0006281	SGD:formo_1999_rbofx	IGI	P		WEB2|YHR164C
SGD	S0001207	DNA2	GO:0006262	SGD:budd__1995_aaxoz	IMP	P		WEB2|YHR164C
SGD	S0001207	DNA2	GO:0006262	SGD:budd__1995_aaxoz	IDA	P		WEB2|YHR164C
SGD	S0001207	DNA2	GO:0006262	SGD:bragu_1998_rbloh	IMP	P		WEB2|YHR164C
SGD	S0001207	DNA2	GO:0006262	SGD:bragu_1998_rbloh	IDA	P		WEB2|YHR164C


(Note: CDC2 also has several aliases; I've left them out just to make
the point that "synonym" is an optional field.)


Getting the files: a very tiny bit on CVS

We're using CVS to maintain and edit the ontology files and associated
files such as GO numbers that we've added and the files containing our
annotations.

Here's the minimum you need; for more information on CVS, there's an
email that Mike sent to the GO mailing list (note that a lot of the
stuff in the email only applies to folks coming from outside our
firewall).


To check out files, first add this line to your .cshrc file
(obviously, you only have to do this once):

setenv CVSROOT /share/go/cvs


Then open a window on one of the DEC machines (alberich, fafner, or
fasolt). Stay in your home directory, or cd to any other directory
where you may want CVS to create the go/[whatever] directory. Type in:

cvs co [directory name]

The directories you need are go/ontology and
go/SGD_GO_files. go/ontology has three files: function.ontology,
process.ontology, and component.ontology (formerly called
compartment.ontology). go/SGD_GO_files has gene_names and
gene_associations. If you want to edit any of the ontologies, you'll
also need the go/numbers directory. CVS will create a directory called
go/ that has the subdirectories you check out. You only really have to
check out files once, although you can always re-check out a file
(good to know in case you do something horrendous to your copy).
 
To update a file in any of your go/ subdirectories, cd to the
directory and give the command:

cvs update [filename]

If there has been a change to the repository copy (i.e. someone else
worked on the file and commited the changes), you'll get the message
"U [filename]." If there are no differences, nothing happens. In
either case, you're now ready to go on your merry way and use the
file.

To commit your changes to the CVS repository (the "master copy") cd to
the directory containing your copy of the file, and give this command:

cvs ci [filename]

An editor window will appear so that you can type in
notes/comments/etc explaining what you changed and why. This isn't
such a big deal for annotations, since they're pretty much
self-explanatory, but always use it when you edit the ontologies
themselves!!! Save and quit the file, and you'll get a message saying
that you've checked in the file, with old and new version numbers.

Text version 2000-03-03 MAH
Updated 2000-09-12 MAH