% GPAD and GPI Tabular Formats version 1.2 % Gene Ontology Consortium % Revision: $Rev: 7306 $ ### Abstract This document specifies the syntax of Gene Product Annotation Data (GPAD) and Gene Product Information (GPI) formats. GPAD describes the relationships between biological entities (such as gene products) and biological descriptors (such as GO terms). GPI describes the biological entities. ### Status This is an working draft, for comment by the community. * Date: $Date: 2013-03-22 15:21:43 -0700 (Fri, 22 Mar 2013) $ Comments should be sent to go-discuss AT geneontology.org ## Changes in GPAD and GPI relative to 1.0 - add column for db in GPI, do not use the header - allow a relation in the GPAD interacts_with_taxon (list), relation(taxon) -> DavidOS: interacts_with_other_organism (RO id?) - GPAD Standard set of properties: annotation_id ("id"), "curator_name" (DC_Author), "go_evidence" (shorthand), comment - GPI properties: DB_Subset (swissprot vs tremble), Annotation_Complete (Date), slim/subset type of thing - JSON: properties is an array with objects - file names: *.gpa (also accepted *.gpad) and *.gpi ## Introduction ### Background The [Gene Ontology project](http://geneontology.org) provides annotations describing attributes of biological entities such as genes and gene products. The Gene Ontology has historically provided annotations via Gene Association Format (GAF), including [GAF-1](http://www.geneontology.org/GO.format.gaf-1_0.shtml) and [GAF-2](http://www.geneontology.org/GO.format.gaf-2_0.shtml). Ontologies are distributed separately, using an OWL serialization or OBO format. The use of GAF has some drawbacks: * Combined representation of gene/gene product data and annotations leads to redundancy/repetition * No way to represent gene/gene product metadata for unannotated genes * Requirement to maintain backward compatibility makes it harder to introduce enhancements such as use of an ontology for evidence types GAF formats will continue to be supported, but the need for a way to represent genes/gene products separately from annotations, as well as the need to use the evidence ontology has lead to the creation of the GPAD (Gene Product Annotation Data) and GPI (Gene Product Information) formats, defined here. Whilst GPAD and GPI have been defined for use within the Gene Ontology Consortium for GO annotation, this specification is designed to be reusable for analagous ontology-based annotation - for example, gene phenotype annotation. *TODO* provide link ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/ ### Outline We first start with some preliminary definitions, including a description of the notation used in this specification. The body of the document is split in two - the first part defines GPAD syntax, the second defined GPI syntax. ## Preliminary Definitions ### UML Notation his document uses only a very simple form of UML class diagrams that are expected to be easily understandable by readers familiar with the basic concepts of object-oriented systems. *TODO* - move UML elsewhere ### BNF Notation GPAD and GPI document structures are defined using a standard BNF notation, which is summarized below. * terminal symbols are single quoted * non-terminal symbols are unquoted * zero or more symbols are indicated by following the symbol with a star; e.g. `Annotation*` * zero or one symbols are written using square brackets; e.g. `[Qualifier]` * alternative symbols are written using vertical bars * groupings are written using parentheses * complementation is written using minus symbol GPI and GPAD documents consist of sequences of Unicode characters and are encoded in UTF-8 [RFC 3629]. * TODO - do we allow UTF-8 or restrict to ASCII? DECIDED: ASCII ### Basic Characters Alpha_Char ::= a |b |c |d |e |f |g |h |i |j |k |l |m |n |o |p |q |r |s |t |u |v |w |x |y |z | A |B |C |D |E |F |G |H |I |J |K |L |M |N |O |P |Q |R |S |T |U |V |W |X |Y |Z Digit ::= 0 |1 |2 |3 |4 |5 |6 |7 |8 |9 Alphanumeric_Char ::= Alpha_Char | Digit ### Spacing Characters There is a single space character allowed Space ::= ' ' ### Identifiers An identifier consists of a prefix and a local identifier separated by a colon symbol ID ::= Prefix ':' Local_ID A Prefix must not contain special characters such as ':'s Prefix ::= Alphanumeric | '_' | '-' A local identifier can consist of any non-whitespace character Local_ID ::= (Any_ASCII_Character - ws) An OBO ID is a type of identifier OBO_ID ::= ID OBO identifiers (which include GO identifiers) SHOULD follow the [OBO identifier policy](http://www.obofoundry.org/id-policy.shtml) References are also types of identifier Reference ::= ID ### GO database registry The [GO database registry](http://www.geneontology.org/cgi-bin/xrefs.cgi) contains a list of valid prefixes that can be used in GPAD or GPI files. Every identifier used in a GPAD or GPI file SHOULD have an entry in the registry. The combination of prefix plus Local_ID (see previous section) describes how an identifier should be mapped to a URI. ### Property Symbols Open-ended property-value pairs are allowed at different points throughout a document. The property symbol or "tag" is a shorthand way of specifying a URI that denotes the actual property used. Property_Symbol ::= ID | Alphanumeric ### Dates Dates are written into what is equivalent to the date portion of ISO-8601, omitting hyphens: YYYYMMDD ::= Year Month Day_of_month Year ::= digit digit digit digit Month ::= digit digit Day_of_month ::= digit digit Both months and days count from 1. E.g. Jan=1, first day of month=1. A Date is equivalent to an [xsd:date](http://www.w3.org/TR/xmlschema11-2/#date), and inherits the same semantics and constraints. ## GPAD Syntax ### GPAD Document Structure A GPAD document consists of a header followed by zero or more annotations GPAD_Doc ::= GPAD_Header Annotation* This is illustrated in the following UML diagram: ![](http://geneontology.org/specifications/gpad/gpad-document-uml.png) ### GPAD Headers A header consists of an obligatory format version declaration followed by zero or more metadata lines: GPAD_Header ::= '!gpa-version: 1.1' nl GPAD_Header_Line* Each metadata line starts with an exclamation mark '!'. One mark indicates a structured tag-value pair, two marks indicates free text. GPAD_Header_Line ::= '!' Property_Symbol ':' Space* Value nl | '!!' (Char - nl)* nl The list of allowed property symbols is open-ended and outside the scope of this specification. Different groups may decide on their own conventions. Examples include: * Project_name: E.g. SGD * URL: E.g. http://www.yeastgenome.org/ * Funding: e.g. NHGRI * Date: an ISO-8601 formatted date describing when the file was produced ### Annotations In this specification, an annotation is an association between a biological entity (such as a gene or gene product) and an ontology class (term). The association describes some aspect of that entity, and includes with metadata about the association, such as evidence and provenance. ![](http://geneontology.org/specifications/gpad/gpad-uml.png) Each annotation is on a separate line of tab separated values: Annotation ::= Col_1 tab Col_2 tab ... Col_12 nl Each of these columns has its own syntax, as specified below: 1. `DB ::= Prefix` 2. `DB_Object_ID ::= Local_ID` 3. `Qualifiers ::= [Qualifier] ('|' Qualifier)*` 4. `Ontology_Class_ID ::= OBO_ID` 5. `References ::= Reference ('|' Reference)*` 6. `Evidence_type ::= OBO_ID` 7. `With_or_From ::= [ID] ('|' ID)*` 8. `Interacting_taxon_ID ::= [Taxon_ID]` 9. `Date ::= YYYYMMDD` 10. `Assigned_by ::= Prefix` 11. `Annotation_Extensions ::= [Extension_Conj] ('|' Extension_Conj)*` 12. `Annotation_Properties ::= [Property_Value_Pair] ('|' Property_Value_Pair)*` C1 and C2 combine to form a unique reference to the *Entity* being described. C4 and C11 (and optionally C8) combine to form a *Description* of a biological attribute. C3 is the relationship between this entity and the description (for example, "involved in"). The other columns combine to make metadata about the annotation; C5, C6 and C7 describe the evidence for the association plus its provenance. ### Annotation extensions Documentation on this column is available on the [column 16](http://www.geneontology.org/GO.annotation.col_16.shtml) page (note that we identify columns by their position in GAF). The value of this column is a pipe-separated list of zero or more conjunctive expressions: Extension_Conj ::= [Relational_Expression] (',' Relational_Expression)* Relation_Expression ::= Relation_Symbol '(' ID ')' ### Property-Value pairs A property value pair uses an open-ended vocabulary of properties to association information with the annotation. Property_Value_Pair ::= Property_Symbol '=' Property_Value Property_Value ::= (AnyChar - ('=' | '|' | nl)) *TODO* define AnyChar such that escaping is allowed Properties may include the name or ID of the curator who made the annotation. Recommendations on the property vocabulary will be provided separately. *TODO* ### Qualifiers and Relations A qualifier is either a logical modifier or a relation Qualifier ::= Modifier | Relation_Symbol The only modifier is logical negation Modifier ::= 'NOT' This specification does not mandate which relations are allowed. The full set of relations allowed are specified in a separate GO relations ontology. This ontology also specifies the domain and range constraints for these relations. ### GPAD Validation For GO annotations, all [annotation QC rules](http://www.geneontology.org/GO.annotation_qc.shtml) apply. Specifically * All Prefixes MUST be registered in GO.xrf_abbs * The Ontology_Class_ID MUST be a GO ID and SHOULD correspond to a non-obsolete, non-merged class * The Evidence_code MUST be an ECO ID ### GPAD Semantics Detailed semantics will be provided in future via a mapping to OWL. For now, consult the main GO documentation for a description of the columns used here, and recommendations on how to use them. E.g. * [GAF-2](http://www.geneontology.org/GO.format.gaf-2_0.shtml). See also the section on mapping to GAF-2 in this document. ## GPI Syntax ### GPI Document Structure GPI_Doc ::= GPI_Header Entity* ### GPI Headers A header consists of an obligatory format version declaration followed by an obligator database declaration then zero or more lines starting with an exclamation point: GPI_Header ::= '!gpi-version: 1.1' nl '!namespace: ' Prefix nl Header_Line* Each metadata line starts with an exclamation mark '!'. One mark indicates a structured tag-value pair, two marks indicates free text. GPI_Header_Line ::= '!' Property_Symbol ':' Space* Value nl | '!!' (Char - nl)* nl *TODO* - decide whether to support multiple namespaces in one document (we are leaning towards allowing this for the mega-gpi use case - DECIDED) ### GP Entities A GP entity is any biological entity that can be annotated using GPAD ![](http://geneontology.org/specifications/gpad/gpi-uml.png) Each entity is written on a separate line of tab separated values: Entity ::= Col_1 tab Col_2 tab ... Col_9 nl Each of these columns has its own syntax, as specified below: 1. `DB_Object_ID ::= Local_ID` 2. `DB_Object_Symbol ::= xxxx` 3. `DB_Object_Name ::= xxxx` 4. `DB_Object_Synonyms ::= [Label] ('|' Label)*` 5. `DB_Object_Type ::= Type_Symbol` 6. `DB_Object_Taxon ::= Taxon_ID` 7. `Parent_ObjectID ::= [ID??] ('|' ID)*` 8. `DB_Xrefs ::= [ID??] ('|' ID)*` 9. `Properties ::= [Property_Value] (',' Property_Value)*` The ID for the entity is formed as follows: ID = CONCAT( Namespace ':' Local_ID) *TODO* describe semantics of other columns ### GPI Validation * TODO ### GPI Semantics * TODO ## Mapping between GAF-2 and GPI/GPAD GAF is broadly speaking the relational join of GPAD and GPI. However, there are subtle differences that require additional operations when interconverting between the two. ### Column mappings The following mappings describe how GAF-2 columns map to GPAD/GPI 1. `DB --> GPAD-c1 & GPI-Header` 2. `DB_Object_ID --> GPAD-c2 & GPI-c1` 3. `DB_Object_Symbol --> GPI-c3` 4. `Qualifiers --> GPAD-c3` (see notes below) 5. `Ontology_Class_ID --> GPAD-c4` 6. `References --> GPAD-c5` 7. `Evidence_code --> ECO(GPAD-c6)` 8. `With_or_From ::= GPAD-c7` 9. `Apsect --> derived from GPAD-c4` 10. `DB_Object_Name --> GPI-c3` 11. `DB_Object_Synonym --> GPI-c4` 12. `DB_Object_Type --> GPI-c5` 13. `Taxon --> GPI-c6 (| GPAD-c8)` 14. `Date --> GPAD-c9` 15. `Assigned_by ::= GPAD-c10` 16. `Annotation_Extension ::= GPAD-c11` 17. `Gene-Product-Form-ID ::= see notes` ### Gene Product Form rewrites One crucial difference between GAF and GPAD/GPI is that in GAF, associations MUST be made to gene or canonical protein (which corresponds to a gene). If the intent is to describe some aspect a particular form (e.g. isoform) of that gene, then this is indicated in GAF-c17. In GPAD, associations are made directly to entity being described, as closely as possible. The mapping must therefore include extra steps. When converting from an annotation GPAD/GPI to GAF, first check the Parent field in the GPI for the corresponding entity. If there is no entry, proceed according to the table above. If there is an entity, then use this entity *as if* it were in GPAD-c1+c2, and place the original value of GPAD-c1+c2 in GAF-c17. *TODO* this is still underspecified - discussed 2012-03-14, decided probably fine ### ECO mapping The following permanent URL contains a mapping table that specifies how to generate a specific ECO class ID from (1) a GAF code and (2) a GO_REF used in the With column of a GAF: http://purl.obolibrary.org/obo/eco/gaf-eco-mapping.txt ### Default Annotation Relations GAF has a qualifier column (GAF-c4) which may contain the NOT modifier, or the contributes_to or colocalized_with relations. There is no way to indicate any other relationship type. When translating, if no relation is specified, then the follow defaults are chosen: * part_of (cell component) * involved_in (biological process) * enables (molecular functon) If a NOT modifer is present, then this is also included (separated by '|'). If the following relations are present in the GAF, they are used in the GPAD document as-is: * colocalized_with * contributes_to See [http://www.geneontology.org/GO.annotation.conventions.shtml](conventions) ## References *TODO*