This document specifies the syntax of Gene Product Annotation Data (GPAD) and Gene Product Information (GPI) formats. GPAD describes the relationships between biological entities (such as gene products) and biological descriptors (such as GO terms). GPI describes the biological entities.
This is an working draft, for comment by the community.
Comments should be sent to go-discuss AT geneontology.org
The Gene Ontology project provides annotations describing attributes of biological entities such as genes and gene products.
The Gene Ontology has historically provided annotations via Gene Association Format (GAF), including GAF-1 and GAF-2. Ontologies are distributed separately, using an OWL serialization or OBO format.
The use of GAF has some drawbacks:
GAF formats will continue to be supported, but the need for a way to represent genes/gene products separately from annotations, as well as the need to use the evidence ontology has lead to the creation of the GPAD (Gene Product Annotation Data) and GPI (Gene Product Information) formats, defined here.
Whilst GPAD and GPI have been defined for use within the Gene Ontology Consortium for GO annotation, this specification is designed to be reusable for analagous ontology-based annotation - for example, gene phenotype annotation.
TODO provide link ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/
We first start with some preliminary definitions, including a description of the notation used in this specification.
The body of the document is split in two - the first part defines GPAD syntax, the second defined GPI syntax.
his document uses only a very simple form of UML class diagrams that are expected to be easily understandable by readers familiar with the basic concepts of object-oriented systems.
TODO - move UML elsewhere
GPAD and GPI document structures are defined using a standard BNF notation, which is summarized below.
Annotation*
[Qualifier]
GPI and GPAD documents consist of sequences of Unicode characters and are encoded in UTF-8 [RFC 3629].
Alpha_Char ::= a |b |c |d |e |f |g |h |i |j |k |l |m |n |o |p |q |r |s |t |u |v |w |x |y |z
| A |B |C |D |E |F |G |H |I |J |K |L |M |N |O |P |Q |R |S |T |U |V |W |X |Y |Z
Digit ::= 0 |1 |2 |3 |4 |5 |6 |7 |8 |9
Alphanumeric_Char ::= Alpha_Char | Digit
There is a single space character allowed
Space ::= ' '
An identifier consists of a prefix and a local identifier separated by a colon symbol
ID ::= Prefix ':' Local_ID
A Prefix must not contain special characters such as ’:’s
Prefix ::= Alphanumeric | '_' | '-'
A local identifier can consist of any non-whitespace character
Local_ID ::= (Any_ASCII_Character - ws)
An OBO ID is a type of identifier
OBO_ID ::= ID
OBO identifiers (which include GO identifiers) SHOULD follow the OBO identifier policy
References are also types of identifier
Reference ::= ID
The GO database registry contains a list of valid prefixes that can be used in GPAD or GPI files. Every identifier used in a GPAD or GPI file SHOULD have an entry in the registry.
The combination of prefix plus Local_ID (see previous section) describes how an identifier should be mapped to a URI.
Open-ended property-value pairs are allowed at different points throughout a document. The property symbol or “tag” is a shorthand way of specifying a URI that denotes the actual property used.
Property_Symbol ::= ID | Alphanumeric
Dates are written into what is equivalent to the date portion of ISO-8601, omitting hyphens:
YYYYMMDD ::= Year Month Day_of_month
Year ::= digit digit digit digit
Month ::= digit digit
Day_of_month ::= digit digit
Both months and days count from 1. E.g. Jan=1, first day of month=1.
A Date is equivalent to an xsd:date, and inherits the same semantics and constraints.
A GPAD document consists of a header followed by zero or more annotations
GPAD_Doc ::= GPAD_Header Annotation*
This is illustrated in the following UML diagram:
A header consists of an obligatory format version declaration followed by zero or more metadata lines:
GPAD_Header ::= '!gpa-version: 1.1' nl
GPAD_Header_Line*
Each metadata line starts with an exclamation mark ‘!’. One mark indicates a structured tag-value pair, two marks indicates free text.
GPAD_Header_Line ::=
'!' Property_Symbol ':' Space* Value nl |
'!!' (Char - nl)* nl
The list of allowed property symbols is open-ended and outside the scope of this specification. Different groups may decide on their own conventions. Examples include:
In this specification, an annotation is an association between a biological entity (such as a gene or gene product) and an ontology class (term). The association describes some aspect of that entity, and includes with metadata about the association, such as evidence and provenance.
Each annotation is on a separate line of tab separated values:
Annotation ::= Col_1 tab Col_2 tab ... Col_12 nl
Each of these columns has its own syntax, as specified below:
DB ::= Prefix
DB_Object_ID ::= Local_ID
Qualifiers ::= [Qualifier] ('|' Qualifier)*
Ontology_Class_ID ::= OBO_ID
References ::= Reference ('|' Reference)*
Evidence_type ::= OBO_ID
With_or_From ::= [ID] ('|' ID)*
Interacting_taxon_ID ::= [Taxon_ID]
Date ::= YYYYMMDD
Assigned_by ::= Prefix
Annotation_Extensions ::= [Extension_Conj] ('|' Extension_Conj)*
Annotation_Properties ::= [Property_Value_Pair] ('|' Property_Value_Pair)*
C1 and C2 combine to form a unique reference to the Entity being described. C4 and C11 (and optionally C8) combine to form a Description of a biological attribute. C3 is the relationship between this entity and the description (for example, “involved in”). The other columns combine to make metadata about the annotation; C5, C6 and C7 describe the evidence for the association plus its provenance.
Documentation on this column is available on the column 16 page (note that we identify columns by their position in GAF).
The value of this column is a pipe-separated list of zero or more conjunctive expressions:
Extension_Conj ::= [Relational_Expression] (',' Relational_Expression)*
Relation_Expression ::= Relation_Symbol '(' ID ')'
A property value pair uses an open-ended vocabulary of properties to association information with the annotation.
Property_Value_Pair ::= Property_Symbol '=' Property_Value
Property_Value ::= (AnyChar - ('=' | '|' | nl))
TODO define AnyChar such that escaping is allowed
Properties may include the name or ID of the curator who made the annotation. Recommendations on the property vocabulary will be provided separately.
TODO
A qualifier is either a logical modifier or a relation
Qualifier ::= Modifier | Relation_Symbol
The only modifier is logical negation
Modifier ::= 'NOT'
This specification does not mandate which relations are allowed. The full set of relations allowed are specified in a separate GO relations ontology. This ontology also specifies the domain and range constraints for these relations.
For GO annotations, all annotation QC rules apply.
Specifically
Detailed semantics will be provided in future via a mapping to OWL. For now, consult the main GO documentation for a description of the columns used here, and recommendations on how to use them. E.g.
See also the section on mapping to GAF-2 in this document.
GPI_Doc ::= GPI_Header Entity*
A header consists of an obligatory format version declaration followed by an obligator database declaration then zero or more lines starting with an exclamation point:
GPI_Header ::= '!gpi-version: 1.1' nl
'!namespace: ' Prefix nl
Header_Line*
Each metadata line starts with an exclamation mark ‘!’. One mark indicates a structured tag-value pair, two marks indicates free text.
GPI_Header_Line ::=
'!' Property_Symbol ':' Space* Value nl |
'!!' (Char - nl)* nl
TODO - decide whether to support multiple namespaces in one document (we are leaning towards allowing this for the mega-gpi use case - DECIDED)
A GP entity is any biological entity that can be annotated using GPAD
Each entity is written on a separate line of tab separated values:
Entity ::= Col_1 tab Col_2 tab ... Col_9 nl
Each of these columns has its own syntax, as specified below:
DB_Object_ID ::= Local_ID
DB_Object_Symbol ::= xxxx
DB_Object_Name ::= xxxx
DB_Object_Synonyms ::= [Label] ('|' Label)*
DB_Object_Type ::= Type_Symbol
DB_Object_Taxon ::= Taxon_ID
Parent_ObjectID ::= [ID??] ('|' ID)*
DB_Xrefs ::= [ID??] ('|' ID)*
Properties ::= [Property_Value] (',' Property_Value)*
The ID for the entity is formed as follows:
ID = CONCAT( Namespace ':' Local_ID)
TODO describe semantics of other columns
GAF is broadly speaking the relational join of GPAD and GPI. However, there are subtle differences that require additional operations when interconverting between the two.
The following mappings describe how GAF-2 columns map to GPAD/GPI
DB --> GPAD-c1 & GPI-Header
DB_Object_ID --> GPAD-c2 & GPI-c1
DB_Object_Symbol --> GPI-c3
Qualifiers --> GPAD-c3
(see notes below)Ontology_Class_ID --> GPAD-c4
References --> GPAD-c5
Evidence_code --> ECO(GPAD-c6)
With_or_From ::= GPAD-c7
Apsect --> derived from GPAD-c4
DB_Object_Name --> GPI-c3
DB_Object_Synonym --> GPI-c4
DB_Object_Type --> GPI-c5
Taxon --> GPI-c6 (| GPAD-c8)
Date --> GPAD-c9
Assigned_by ::= GPAD-c10
Annotation_Extension ::= GPAD-c11
Gene-Product-Form-ID ::= see notes
One crucial difference between GAF and GPAD/GPI is that in GAF, associations MUST be made to gene or canonical protein (which corresponds to a gene). If the intent is to describe some aspect a particular form (e.g. isoform) of that gene, then this is indicated in GAF-c17.
In GPAD, associations are made directly to entity being described, as closely as possible. The mapping must therefore include extra steps.
When converting from an annotation GPAD/GPI to GAF, first check the Parent field in the GPI for the corresponding entity. If there is no entry, proceed according to the table above. If there is an entity, then use this entity as if it were in GPAD-c1+c2, and place the original value of GPAD-c1+c2 in GAF-c17.
TODO this is still underspecified - discussed 2012-03-14, decided probably fine
The following permanent URL contains a mapping table that specifies how to generate a specific ECO class ID from (1) a GAF code and (2) a GO_REF used in the With column of a GAF:
http://purl.obolibrary.org/obo/eco/gaf-eco-mapping.txt
GAF has a qualifier column (GAF-c4) which may contain the NOT modifier, or the contributes_to or colocalized_with relations. There is no way to indicate any other relationship type.
When translating, if no relation is specified, then the follow defaults are chosen:
If a NOT modifer is present, then this is also included (separated by ‘|’).
If the following relations are present in the GAF, they are used in the GPAD document as-is:
See http://www.geneontology.org/GO.annotation.conventions.shtml
TODO
GO