The GO Flat File Format
This format is now deprecated and the use of OBO 1.2 format is recommended.
Introduction
The structure of the old GO flat files was designed with an eye towards ease of editing in a plain text editor. The indentation scheme allowed curators to easily see the structure of the DAG, and a fair amount of redundant information allowed a curator to visualize the term they were working on without having to constantly review the entire file. The individual ontologies are held in separate files and the definitions are kept in a further separate file:
Biological Process (process.ontology)
Molecular Function (function.ontology)
Cellular Component (component.ontology)
Definitions (GO.defs)
File front matter
The first lines of each file carry information about the version, the date of last update, (optionally) the source of the file, the name of the database, the domain of the file and the editors of the file. Comment lines start with a !
. These lines are present in both the ontology files and the definitions file.
Here's an example of the front matter of a GO flat file:
!autogenerated-by: DAG-Edit version 1.315
!saved-by: midori
!date: Fri Jan 03 17:14:37 GMT 2003
!version: $ Revision: 1.17 $
!type: % ISA Is a
!type: < PARTOF Part of
GO format ontology files
Following the comments in the ontology files is a line beginning with a $
, reflecting the domain and aspect of the ontology:
$Gene_Ontology ; GO:0003673
Relationships between terms
In the GO flat files, the symbol %
is used to represent an is a relationship and the symbol <
a part of relationship.
Parent-child relationships between terms are represented by indentation:
parent_term
child_term
is-a relationships
%term0
%term1
means that term1 is a (is a subclass of) term0
%term0
%term1 % term2
means that term1 is a term0 and is a term2.
part-of relationships
%term0
<term1
means that term1 is part of term0.
%term0
<term1 < term2 < term3
means that term1 is part of term0 and part of term2 and term3.
Line syntax
Each line of the flat file contains, at mininum, a GO term string and ID, the relationship type and a certain level of indentation.
Secondary IDs are shown after the primary ID:
%term name ; termID, secondaryID, secondaryID
If a term has synonyms, they are written after the term information:
%term name ; termID ; synonym:[synonym1] ; synonym:[synonym2]
The syntax for database cross-references is
%term name ; termID ; database_abbreviation:identifier ; database_abbreviation:identifier
The syntax for Relationships to other terms is
%term name ; termID [R] parentTerm1 ; parentTermID1 [R] parentTerm2 ; parentTermID2
where [R]
represents the relationship symbol %
or <
The order in which items appear on a line (where [item] indicates optional items, (X|Y) are alternatives, and *
means one or more may be present) is:
(<|%)term ; primaryID[, secondaryID]* [; db cross ref]* [; synonym:text]* [ (<|%) term]*
An example from the molecular function ontology (would appear as a single line in the file):
%peroxidase activity ; GO:0004601, GO:0016685 ; EC:1.11.1.7 ; MetaCyc:PEROXID-RXN ; synonym:myeloperoxidase activity ; synonym:peroxidase reaction % antioxidant activity ; GO:0016209
-
peroxidase activity ; GO:0004601
is the term name and ID -
GO:0016685
is a secondary ID for GO:0004601 -
EC:1.11.1.7
andMetaCyc:PEROXID-RXN
are cross-references to equivalent objects in other databases -
myeloperoxidase activity
andperoxidase reaction
are synonyms for peroxidase activity -
% antioxidant activity ; GO:0016209
indicates the term is an is a child of antioxidant activity
GO format definition files
The definitions for terms in all three ontology files are stored in the GO.defs file. Each definition must contain the following:
- term
- the name of the term to which the definition refers
- goid
- the term's unique identifier
- definition
- the definition of the term
- definition_reference
- one or more references for the definition
A definition may also have a comment:
- comment
- text
An example definition:
term: unfolded protein response
goid: GO:0030968
definition: The series of molecular signals generated as a consequence of the presence of unfolded proteins in the endoplasmic reticulum (ER) or other ER-related stress; results in changes in the regulation of transcription and translation.
definition_reference: GOC:mah
definition_reference: PMID:12042763
comment: Note that this term should not be confused with 'response to unfolded protein ; GO:0006986', which refers to any cellular response to the presence of unfolded proteins anywhere in the cell. Also see 'ER-associated protein catabolism ; GO:0030433'.