Minutes of the GO Project Meeting

The GO Meeting was held 17-19 May 1999 at the Banbury campus of CSH. In
attendance were:
FlyBase (Michael Ashburner, Suzi Lewis)
SGD (Mike Cherry, Midori Harris)
MGI (Janan Eppig, Judy Blake, Joel Richardson, Martin Ringwald, Allan Davis)

A Summary of the Meeting:
The outlined agenda for the meeting was:
1. semantics
2. species specificity
3. content
4 implementation
5 software
6. resources

1.  The GO Project is recognized as a shared, pragmatic database resource
involving three separate ontologies (Gene Function, Process, Cellular
Component)  that represent independent structured sets of terms for
performing biological queries across different species genomic databases.
It is not a definitive phylogenetic classification system of biology.  The
current GO Project is composed of three Organism Databases: FlyBase
(Drosophila) , SGD (yeast), and MGD (mouse).  It is hoped that additional
organism databases may subsequently join.

Each Organism Database will annotate their genes to the three ontological
categories and deposit these results in a Universal GO browser (currently
being constructed by Suzi).  Concurrently, each Organism Database on their
own will use their annotations for whatever way they see fit at their own
web sites.

The three database (DB) groups are anxious to start implementing the
project.  It is felt that once people actually start 'getting their hands
dirty' by annotating and testing out the system, we will recognize
potential pitfalls and successes and will see areas of the ontologies that
need to be developed further.

2.  The system will be structured as:  a Master GO List with assigned GO#
identifiers.  This Master List will be made available to all DB.  Any
curator can add to/modify this list (see below).  All additions or
modifiers will pass through a Central Manager (a biology-trained
individual) who will then update the Master List accordingly.  The Master
List is anticipated to go through a period of flux during the initial
phases and when any new organism joins the group.  It will be the
responsibility of each DB to regularly check in with the Master List and
read all communications between DBs so that everyone is 'on the same page'.

3.  It was recognized that this project is a WORK IN PROGRESS: meaning, it
is a dynamic system where things will be added and changed as curators see
problems and concerns.  With subsequent biological knowledge, a
re-organization of the hierarchies may even become necessary.  Provisions
for this re-organization will be made by the following:  each DB group will
initially be assigned a block of 'free' GO identification numbers (GO#)
which can be used to add or modify new terms to the structured GO list.
(New terms should use unique words/phrases).  Any such modification will be
simultaneously submitted to a Central Manager and representative of each DB
via email/XML working from the current GO List.  We anticipate that these
initial modifications will be at lower levels of the hierarchy and not
include any major re-organization that could directly and immediately
affect the annotations of other DB.  If that happens to be the case, the
curator is responsible for FIRST contacting the Central Manager for
inquiry, discussion, and approval PRIOR to initiating or submitting any
changes to the GO list.  Since a Central Manager has yet to be hired, it is
prudent for curators not to initiate any major renovations to the GO List
for quite some time.

4.  Common Sense & Individual Responsibility must rule here: that is, each
group is responsible for reading the corresponding emails submitted by each
DB; each curator should check in frequently with the modified GO list on a
regular basis; if a curator anticipates to spend a long time modifying some
structure of the GO List he should alert all other members and then work as
quickly as possible to update the changes.  Effective communication is key
to the success of this project.

5.  Each DB will annotate their genes/gene products to any 'depth' they see
fit.  This should include annotating genes to complexes and not subunits of
those complexes, since the cross-species homology will not necessarily hold
true to such detailed levels.  Once the practice is put in motion, curators
will develop a better feel for how 'deep' they should be annotating based
upon the work of other curators from other DB (see below).

6.  Currently, the GO List records some species-specific GO terms.  With
subsequent biological knowledge, it may be found that these are no longer
species-specific but shared amongst different organisms, thus requiring a
name change.  Currently species-specific terms will only be updated when it
is shown that the term is not species specific. If a species does not have
a gene annotated to a specific GO term, then either it is not present in
the species or it hasn't been found yet.

7.  Structures will have to inherently change with the discovery of new
genes that do not necessarily follow the currently established hierarchies.
Curators are to be aware of the "True Path (Judy's) Rule": the pathway up
the hierarchy must always be true.  If a new gene is found to break this
rule or species-specificity becomes a problem, a restructuring of the
hierarchy should occur by adding more nodes and connecting terms that
creates a new path to fulfil the trueness of the upward hierarchy.  When a
term is added to the Master GO List, the curator needs to add all of the
parents and children of the new term  A suggestion was made that to
simplify the hierarchy: we might considered throwing out the "part of" GO
items and instead only used the "is a" GO terms.  After discussion, it was
realized that too much information would be lost by eliminating the terms.
The GO List will maintain both "is a" and "part of" terms.
	The example used to work through this effort was that of the
Process Ontology for the gene product 'chitin'. Chitin metabolism is a part
of cuticle synthesis in the fly, and part of cell wall organization in
yeast. As a result of the above discussion, the parent 'chitin metabolism'
will now have daughters 'cuticle chitin metabolism' and 'cell wall chitin
metabolism, with the appropriate catabolism and synthesis terms underneath
them.

	chitin metabolism
		chitin biosynthesis
		chitin catabolism
		cuticle chitin metabolism
			cuticle chitin biosynthesis
			cuticle chitin catabolism
		cell wall chitin metabolism
			cell wall chitin biosynthesis
			cell wall chitin catabolism

The procedure to add terms is made particularly difficult because the
Process Ontology is a DAG, and given the current state of knowledge, it is
volatile.  We need an automated procedure to add all the arcs when
necessary to expand the structure.

A tool is needed that will
	1. not allow bad paths (solution: add extra nodes)
	2. curators need to see all paths.
	3. given curator decision, can automate a split

8.  A CVS file system will be used to process changes made to the GO list.
This will automatically record the curators name & date and must provide a
succinct reason for implementing the change.  The CVS client server will be
set up by Mike Cherry: accounts will be set up for each individual curator
for access and modifications.  Additional curators may join by submitting a
User name to Mike.

9.  An idea was proposed that the GO list hierarchy encode next to the GO
terms the number of genes specific to each organism DB listed underneath
that particular GO term.  For example:
		%DNA metabolism (M-3, D-9, Y-5)
			%DNA replication (M-1, D-4, Y-3)
				%DNA dependent DNA replication (M-2, D-5, Y-2)
Meaning that there are 3 mouse (M) genes, 9 Drosophila (D) genes, and 5
yeast (Y) genes under the entire category of %DNA metabolism, and that they
can subsequently be broken down further to each subordinate GO term for
finer resolution.  This should be helpful to curators in understanding to
which level of 'depth' other curators are annotating.

10.  EC#s should be kept in the GO term lists because they provide a
searchable technique for curators during annotation.

11.  The following ideas and agendas were proposed for goals within a 1-2
week (?) time frame from the end of this GO Meeting:

a) Stanford curators stop GO and send an updated file to Michael Ashburner.
b) Michael Ashburner parses v. 0.2a7 ---> v. 0.9
c) v. 0.9 ---> Suzi for syntex check ---> assign unique GO# to all terms
(v. 1.0); and parse into XML
d) CVS established/tested by Mike Cherry;  as well, Mike will try to
register the web site  www.gene.org if still available; if not, other
suggested names: www.genestogo.org or some combination of the words GO and
gene (GOgeneGO; GOgene; geneGO; etc....)
e) all GO again.

Independently, Suzi and Joel will determine necessary XML syntax and processes.
Between June 1999 and the next meeting, two stages will be implemented:

	STAGE 1:	CVS established
			initial curation: annotate as many genes as
possible (FlyBase hopes to get 3000-4000 genes done; are other DB up to the
challenge?)
			XML exports to Suzi established
			a Central Manager will be hired

	STAGE 2:	a working database
			something of real functionality for the public:
"genes to GO"

12.  Making the database available to public users should coincide with a
descriptive (promotional) write-up of the GO Project in a widely circulated
genetically-oriented journal, such as Trends in Genetics.  Members of the
GO Project should be thinking of ideas for this paper and what they would
like to see in it.

13.  Janan will meet with Ken Fasman (Astra) and Lisa Brooks (NIH Program
Officer) at the JAX MGI Advisory Board meeting in early June 1999 and
discuss initiating a co-operative (MGI, SGD, FlyBase) grant for submission
in November 1999.  After this meeting, Janan will update other members of
the GO Project on the ideas for the grant via email correspondence.

14.  Resources: $200K from Astra '99 and a promise of $200K for '00.
Michael Ashburner will have the money established in an EBI U.S. account in
Cambridge Trust Company.

15.  The next GO Meeting is scheduled for 6-9 October 1999 to be hosted by
MGI in Bar Harbor, Maine.