GO Gene Product Annotation Data File Format Guide
Gene product annotation data (GPAD) format is used to submit gene and gene product annotations to the GO Consortium. Please note that gene product information uses GPI format files.
File header
The first line of the file declares the format and version:
!gpad-version: 1.0
Other information, such as contact details for the submitter or database group, useful links, etc., can be included the file by prefixing the line with an exclamation mark (!
); such lines will be ignored by parsers.
It is strongly suggested that a line be added at the bottom of the header with the column names in it:
!DB DB_Object_ID Relationship GO ID Reference Evidence Code With/From Interacting taxon Date Assigned By Annotation XP
An example of a full file header:
!gpad-version: 1.0
!CVS Version: Revision: 1.134 $
!GOC Validation Date: 08/26/2009 $
!Submission Date: 8/26/2009
!
!Project_name: Pombase - Schizosaccharomyces pombe DB
!URL: http://www.pombase.org/
!Contact Email: val@sanger.ac.uk
!
!DB DB_Object_ID Relationship GO ID Reference Evidence Code With/From Interacting taxon Date Assigned By Annotation XP
Annotation File Fields
GPAD data is held in tab-delimited columns; fields with multiple values (for example, references) should have these values separated by pipes.
Content | Required? | Cardinality | Example |
---|---|---|---|
DB | required | 1 | UniProtKB |
DB Object ID | required | 1 | P12345 |
Relationship | required | 1 or more | NOT|part of |
GO ID | required | 1 | GO:0003993 |
Reference(s) | required | 1 or greater | PMID:2676709 |
Evidence Code | required | 1 | ECO:0000315 |
With (or) From | optional | 0 or greater | GO:0000346 |
Interacting taxon | optional | 0 or 1 | 9606 |
Date | required | 1 | 20090118 |
Assigned By | required | 1 | SGD |
Annotation XP | optional | 0 or greater | part_of(CL:0000576) |
Definitions and requirements for field contents
- DB
- refers to the database from which the identifier in DB object ID is drawn. This is not necessarily the group submitting the file. If a UniProtKB ID is the DB object ID, DB should be UniProtKB.
must be one of the values from the set of GO database cross-references
this field is mandatory, cardinality 1 - DB Object ID
- a unique identifier (from the database in DB) for the item being annotated
this field is mandatory, cardinality 1 - In GPAD 1.0 format, the identifier may reference a top-level primary gene or gene product identifier, or an identified variant of a gene or gene product. Contents may include protein sequence identifiers: for example, identifiers that specify distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. Identifiers for functional RNAs can also be included in this column.
- If the gene product is not a top-level gene or gene product identifier, the Gene Product Information (GPI) file should contain information about the canonical form of the gene or gene product.
- The DB object ID is the identifier for the database object, which may or may not correspond exactly to what is described in a paper. For example, a paper describing a protein may support annotations to the gene encoding the protein (gene ID in DB object ID field) or annotations to a protein object (protein ID in DB object ID field).
- Relationship
- the relationship between the gene product in the DB:DB object ID and the GO ID
composed of up to three parts: an operator (optional), a modifier (optional) and an atomic relation (required)
this field is mandatory, cardinality 1 or greater than 1, entries pipe-separated -
The operator may be one of two values, not or always. Operators are optional.
Valid qualifiers are contributes to and colocalizes with. In addition, annotations encompassing interactions with other organisms may use the qualifiers host, other organism or symbiont. Qualifiers are optional.
The atomic relations depend upon the term namespace, and are as follows:
gene product actively participates in molecular functional
gene product actively participates in biological process
gene product part of cellular component
An atomic relation must be used. - See also the documentation on qualifiers in the GO annotation guide
- GO ID
- the GO identifier for the term attributed to the DB object ID
this field is mandatory, cardinality 1 - Reference
- one or more unique identifiers for a single source cited as an authority for the attribution of the GO ID to the DB object ID. This may be a literature reference or a database record. The syntax is DB:accession.
Note that only one reference can be cited on a single line in the gene association file. If a reference has identifiers in more than one database, multiple identifiers for that reference can be included on a single line. For example, if the reference is a published paper that has a PubMed ID, the PubMed ID must be included; if the model organism database has its own identifier for the reference, that can also be included.
this field is mandatory, cardinality 1, >1; for cardinality >1 use a pipe to separate entries (e.g. PMID:2676709|SGD_REF:S000047763). - Evidence Code
- one of the codes from the Evidence Code ontology, ECO
this field is mandatory, cardinality 1 - With [or] From (column 8)
- Also referred to as with, from or the with/from column
- one of:
- DB:gene|protein|seq_ID
- GO:ID
- CHEBI:ID
this field is required for some evidence codes
cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. CGSC:pabA|CGSC:pabB)Note: This field is used to hold an additional identifier for annotations using certain evidence codes: ECO:0000305 [IC]; ECO:0000203, 0256, and 0265 [all IEA]; ECO:00000316 [IGI]; ECO:0000021 [IPI]; ECO:0000031, 0250 and 0255 [all ISS]. For example, it can identify another gene product to which the annotated gene product is similar (ECO:0000031, 0250 and 0255, ISS) or interacts with (ECO:0000021, IPI). More information on the meaning of with or from column entries is available in the evidence code documentation entries for the relevant codes.
Cardinality = 0 is not recommended, but is permitted because cases can be found in literature where no database identifier can be found (e.g. physical interaction or sequence similarity to a protein, but no ID provided). Cardinality = 0 is not allowed for ISS annotations (ECO:0000031, ECO:0000250 and ECO:0000255) made after October 1, 2006. Annotations where evidence is ECO:0000316 [IGI], ECO:0000021 [IPI], or ECO:0000031, ECO:0000250 or ECO:0000255 [all ISS] and with cardinality = 0 should link to an explanation of why there is no entry in with. Cardinality may be >1 for any of the evidence codes that use with; for ECO:0000021 [IPI] and ECO:00000316 [IGI], cardinality >1 has a special meaning (see evidence documentation for more information). For cardinality >1 use a pipe to separate entries (e.g. FB:FBgn1111111|FB:FBgn2222222).
Note that a gene ID may be used in the with column for a ECO:0000021 [IPI] annotation, or for an ECO:0000031, ECO:0000250 or ECO:0000255 [all ISS] annotation based on amino acid sequence or protein structure similarity, if the database does not have identifiers for individual gene products. A gene ID may also be used if the cited reference provides enough information to determine which gene ID should be used, but not enough to establish which protein ID is correct.
'GO:id' is used only when the evidence code is ECO:0000305 [IC], and refers to the GO term(s) used as the basis of a curator inference. In these cases the entry in the 'DB:Reference' column will be that used to assign the GO term(s) from which the inference is made. This field is mandatory for evidence code ECO:0000305 [IC].
The ID is usually an identifier for an individual entry in a database (such as a sequence ID, gene ID, GO ID, etc.). Identifiers from the Center for Biological Sequence Analysis (CBS), however, represent tools used to find homology or sequence similarity; these identifiers can be used in the with column for ECO:0000031, ECO:0000250 or ECO:0000255 [ISS] annotations.
The with column may not be used with the evidence codes ECO:0000314 [IDA], ECO:0000304 [TAS], ECO:0000303 [NAS], or ECO:0000307 [ND].
- Taxon
- taxonomic identifier for interacting organism
to be used only in conjunction with terms that have the biological process term multi-organism process or the cellular component term host cell as an ancestor. The first taxon ID should be that of the organism encoding the gene or gene product, and the taxon ID after the pipe should be that of the other organism in the interaction.
this field is mandatory for terms with parentage under multi-organism process or host cell, cardinality 1; annotations to other terms should have this column blank
See the GO annotation conventions for more information on multi-organism terms. - Date
- Date on which the annotation was made; format is YYYYMMDD
this field is mandatory, cardinality 1 - Assigned By
- The database which made the annotation
one of the values from the set of GO database cross-references
Used for tracking the source of an individual annotation.
Value will differ from the DB column for any annotation that is made by one database and incorporated into another.
this field is mandatory, cardinality 1 - Annotation XP
- one of:
- DB:gene_id
- DB:sequence_id
- CHEBI:CHEBI_id
- Cell Type Ontology:CL_id
- GO:GO_id
Contains cross references to other ontologies that can be used to qualify or enhance the annotation. The cross-reference is prefaced by an appropriate GO relationship; references to multiple ontologies can be entered. For example, if a gene product is localized to the mitochondria of lymphocytes, the GO ID (column 5) would be mitochondrion ; GO:0005439, and the annotation extension column would contain a cross-reference to the term lymphocyte from the Cell Type Ontology.
Targets of certain processes or functions can also be included in this field to indicate the gene, gene product, or chemical involved; for example, if a gene product is annotated to protein kinase activity, the annotation extension column would contain the UniProtKB protein ID for the protein phosphorylated in the reaction.
See the documentation on using the annotation extension column for details of practical usage; a wider discussion of the annotation extension column can be found on the GO wiki.
this field is optional, cardinality 0 or greater
Note that several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are: GO ID; Reference; With or From; and Annotation XP.