GO Gene Product Information File Format Guide

Gene product information (GPI) format is used to submit gene and gene product information to the GO Consortium. Please note that annotation information uses GPAD format files.

GPI format version
File header
Data fields
Definitions and requirements for field contents

GPI format version

This document describes GPI version 1.0, as presented at the November 2011 GO Consortium meeting.

File header

The first line of the file declares the format and version:

!gpi-version: 1.0

Other information, such as contact details for the submitter or database group, useful links, etc., can be included the file by prefixing the line with an exclamation mark (!); such lines will be ignored by parsers.

It is strongly suggested that a line be added at the bottom of the header with the column names in it:

!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs

An example of a full file header:

!gpi-version: 1.0 !CVS Version: Revision: 1.134 $ !GOC Validation Date: 08/26/2009 $ !Submission Date: 8/26/2009 ! !Project_name: Pombase - Schizosaccharomyces pombe DB !URL: http://www.pombase.org/ !Contact Email: val@sanger.ac.uk ! !DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs

Data Fields

GPI data is held in tab-delimited columns; fields with multiple values (for example, gene product synonyms) should have these values separated by pipes.

Fields in the annotation file
Content	Required?	Cardinality	Example
DB	required	1	UniProtKB
DB Object ID	required	1	P12345
DB Object Symbol	required	1	PHO3
DB Object Name	optional	0 or 1	Toll-like receptor 4
DB Object Synonym	optional	0+, pipe-separated	hToll\|Tollbooth
DB Object Type	required	1	protein
Taxon	required	1	taxon:9606
Parent GP ID	optional	0 or 1
External GP xrefs	optional	0+, pipe-separated	UniProtKB:P12345

Definitions and requirements for field contents

DB: refers to the database from which the identifier in DB object ID is drawn. This is not necessarily the group submitting the file. If a UniProtKB ID is the DB object ID, DB should be UniProtKB.
must be one of the values from the set of GO database cross-references
this field is mandatory, cardinality 1
DB Object ID: a unique identifier (from the database in DB) for the item being annotated
this field is mandatory, cardinality 1; In GPI 1.0 format, the identifier may reference a top-level primary gene or gene product identifier, or an identified variant of a gene or gene product. Contents may include protein sequence identifiers: for example, identifiers that specify distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. Identifiers for functional RNAs can also be included in this column.; If the gene product is not a top-level gene or gene product identifier, the Parent GP ID field should contain the canonical form of the gene or gene product.; The DB object ID is the identifier for the database object, which may or may not correspond exactly to what is described in a paper. For example, a paper describing a protein may support annotations to the gene encoding the protein (gene ID in DB object ID field) or annotations to a protein object (protein ID in DB object ID field).
DB Object Symbol: a (unique and valid) symbol to which DB object ID is matched
this field is mandatory, cardinality 1; The DB Object Symbol field should be a symbol that means something to a biologist wherever possible (a gene symbol, for example). It is not a unique identifier or an accession number (unlike the DB object ID), although IDs can be used as a DB object symbol if there is no more biologically meaningful symbol available (e.g., when an unnamed gene is annotated).
ORF names can be used for otherwise unnamed genes or proteins.
If gene products are annotated, the gene product symbol can be used if available. Many gene product annotation entries may share a gene symbol.; The text entered in the DB object name and DB object symbol should refer to the entity in DB object ID. For example, several alternative transcripts from one gene may be annotated separately, each with specific gene product identifiers in DB object ID, but with the same gene symbol in the DB object symbol column.
DB Object Name: name of gene or gene product
this field is not mandatory, cardinality 0, 1 [white space allowed]; The text entered in the DB object name and DB object symbol should refer to the entity in DB object ID. For example, several alternative transcripts from one gene may be annotated separately, each with specific gene product identifiers in DB object ID, but with the same gene symbol in the DB object symbol column.
DB Object Synonym: Gene symbol [or other text]
Note that we strongly recommend that gene synonyms are included in the gene association file, as this aids the searching of GO.
this field is not mandatory, cardinality 0, 1, >1 [white space allowed]; for cardinality >1 use a pipe to separate entries (e.g. YFL039C|ABY1|END7|actin gene)
DB Object Type: A description of the type of the gene or gene product being annotated.
one of the following: protein_complex; protein; transcript; ncRNA; rRNA; tRNA; snRNA; snoRNA; any subtype of ncRNA in the Sequence Ontology. If the precise product type is unknown, gene_product should be used.
this field is mandatory, cardinality 1; The object type (gene_product, transcript, protein, protein_complex, etc.) listed in the DB object type field must match the database entry identified by the DB object ID. Note that DB object type refers to the database entry (i.e. it represents a protein, functional RNA, etc.); this column does not reflect anything about the GO term or the evidence on which the annotation is based. For example, if your database entry represents a protein-encoding gene, then protein goes in the DB object type column.
Taxon: taxonomic identifier(s)
The NCBI taxon ID of the species encoding the gene product.
this field is mandatory, cardinality 1
taxon should be specified as a number without the prefix "taxon". Note that this is a change from GAF format.
Parent GP ID: If the DB object ID refers to a variant of a gene product, this column will hold the identifier of the gene product from which it was derived.
this field is mandatory, cardinality 1, for variant forms of a gene product (e.g. identifiers that specify distinct proteins produced by differential splicing, alternative translational starts, post-translational cleavage or post-translational modification).
if the DB object ID refers to the canonical form of a gene product, this column should be blank.; The identifier used must be a standard 2-part global identifier, e.g. UniProtKB:OK0206; The entity in the parent GP ID column may not necessarily be the canonical form of the gene product; the canonical form would be identifiable as entry for that gene product in the GPI file would have the parent GP ID blank.
External GP xrefs: Identifiers for this object in other databases
Optional, cardinality 0+; multiple identifiers should be pipe-separated; Identifiers used must be a standard 2-part global identifiers, e.g. UniProtKB:OK0206; This column should be used to record IDs for this object in other databases; for gene products in model organism databases, this may include the UniProt ID, NCBI gene or protein IDs, etc..

Note that several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are: GO ID [column 5], where dbname is always GO; DB:Reference; With or From; and Taxon, where dbname is always taxon. For GO IDs, do not repeat the 'GO:' prefix (i.e. always use GO:0000000, not GO:GO:0000000)