The GO Flat File Format

This format is now deprecated and the use of OBO 1.2 format is recommended.

Introduction

The structure of the old GO flat files was designed with an eye towards ease of editing in a plain text editor. The indentation scheme allowed curators to easily see the structure of the DAG, and a fair amount of redundant information allowed a curator to visualize the term they were working on without having to constantly review the entire file. The individual ontologies are held in separate files and the definitions are kept in a further separate file:

Biological Process (process.ontology)
Molecular Function (function.ontology)
Cellular Component (component.ontology)
Definitions (GO.defs)

Back to top

File front matter

The first lines of each file carry information about the version, the date of last update, (optionally) the source of the file, the name of the database, the domain of the file and the editors of the file. Comment lines start with a !. These lines are present in both the ontology files and the definitions file.

Here's an example of the front matter of a GO flat file:

!autogenerated-by:  DAG-Edit version 1.315
!saved-by:          midori
!date:              Fri Jan 03 17:14:37 GMT 2003
!version:           $ Revision: 1.17 $
!type: % ISA Is a
!type: < PARTOF Part of

Back to top

GO format ontology files

Following the comments in the ontology files is a line beginning with a $, reflecting the domain and aspect of the ontology:

$Gene_Ontology ; GO:0003673

Relationships between terms

In the GO flat files, the symbol % is used to represent an is a relationship and the symbol < a part of relationship.

Parent-child relationships between terms are represented by indentation:

parent_term
 child_term

is-a relationships

%term0
 %term1

means that term1 is a (is a subclass of) term0

%term0
 %term1 % term2

means that term1 is a term0 and is a term2.

part-of relationships

%term0
 <term1

means that term1 is part of term0.

%term0
 <term1 < term2 < term3

means that term1 is part of term0 and part of term2 and term3.

Line syntax

Each line of the flat file contains, at mininum, a GO term string and ID, the relationship type and a certain level of indentation.

Secondary IDs are shown after the primary ID:

%term name ; termID, secondaryID, secondaryID

If a term has synonyms, they are written after the term information:

%term name ; termID ; synonym:[synonym1] ; synonym:[synonym2]

The syntax for database cross-references is

%term name ; termID ; database_abbreviation:identifier ; database_abbreviation:identifier

The syntax for Relationships to other terms is

%term name ; termID [R] parentTerm1 ; parentTermID1 [R] parentTerm2 ; parentTermID2

where [R] represents the relationship symbol % or <

The order in which items appear on a line (where [item] indicates optional items, (X|Y) are alternatives, and * means one or more may be present) is:

(<|%)term ; primaryID[, secondaryID]* [; db cross ref]* [; synonym:text]* [ (<|%) term]*

An example from the molecular function ontology (would appear as a single line in the file):

%peroxidase activity ; GO:0004601, GO:0016685 ; EC:1.11.1.7 ; MetaCyc:PEROXID-RXN ; synonym:myeloperoxidase activity ; synonym:peroxidase reaction % antioxidant activity ; GO:0016209

  • peroxidase activity ; GO:0004601 is the term name and ID
  • GO:0016685 is a secondary ID for GO:0004601
  • EC:1.11.1.7 and MetaCyc:PEROXID-RXN are cross-references to equivalent objects in other databases
  • myeloperoxidase activity and peroxidase reaction are synonyms for peroxidase activity
  • % antioxidant activity ; GO:0016209 indicates the term is an is a child of antioxidant activity

Back to top

GO format definition files

The definitions for terms in all three ontology files are stored in the GO.defs file. Each definition must contain the following:

term
the name of the term to which the definition refers
goid
the term's unique identifier
definition
the definition of the term
definition_reference
one or more references for the definition

A definition may also have a comment:

comment
text

An example definition:

term: unfolded protein response
goid: GO:0030968
definition: The series of molecular signals generated as a consequence of the presence of unfolded proteins in the endoplasmic reticulum (ER) or other ER-related stress; results in changes in the regulation of transcription and translation.
definition_reference: GOC:mah
definition_reference: PMID:12042763
comment: Note that this term should not be confused with 'response to unfolded protein ; GO:0006986', which refers to any cellular response to the presence of unfolded proteins anywhere in the cell. Also see 'ER-associated protein catabolism ; GO:0030433'.

Back to top