GO Annotation File Format Guide
This page documents the file formats used to store gene associations (annotations), data capturing the attributes of gene products using terms from the Gene Ontology, and the QC checks run upon the data submitted by members of the GO Consortium. For more general information on annotation, please see the GO annotation guide.
Annotation File Format Guide
The Gene Ontology Consortium stores annotation data, the representation of gene product attributes using GO terms, in tab-delimited plain text files. Each line in the file represents a single association between a gene product and a GO term with a certain evidence code and the reference to support the link.
There are two annotation file formats; GAF 2.0, the format currently used by the GO Consortium, and GAF 1.0, the older format which captures slightly less information.
GAF 2.0
The primary format used for annotation files by the GO Consortium.
GAF 1.0
This format is deprecated (as of June 2010), but the GO Consortium continues to provide files in this format for users who have not yet switched to GAF 2.0.
- GAF 1.0 format specification [used and recommended by the GO Consortium]
Annotation Quality Control
The GO Consortium implements a number of automated checks to check the quality of the annotations submitted to the GO database. These checks are detailed on the annotation QC checks page.
Annotation QC script
This Perl script performs a subset of quality control checks listed on the annotation QC checks page in an effort to validate the format and to partially check the data provided within the gene association files. This script is used on all gene association files before they are loaded into the GO database. The results of this filtering step are reported back to the submitting group.
This script is intended to be generic and to enforce the standards defined by the GO Consortium. Use this script to validate your gene association file before committing it to the archive. Suggestions are welcome for enhancements to this process. Download the script directly, via the GO web CVS interface, or from the directory go/software/utilities in the GO CVS repository.
Submitted gene association files are committed to the GO CVS repository into the gene association file submissions directory (go/gene-associations/submission/). The checking and filtering script is run nightly on any newly deposited files by the GO Database staff at Stanford. The output of the script is placed in the gene association file directory (go/gene-associations/) and subsequently used to load the GO database.
Errors Checked
The input file is checked for the following types of errors. If a row of the gene association file is found to contain an error it is removed from the final output file.
The script checks each line for the correct number of columns, the cardinality of the columns, looks for leading or trailing whitespace and does a number of specific checks for data in particular columns.
These specific checks include use of the defined terms for Qualifier, Evidence, Aspect, and DB Object type columns. The DB:Reference, Taxon and GO ID columns are checked for minimal form. The Date is also verified to match the YYYYMMDD format.
Column 1, and all database abbreviations used within the gene association file is checked to see that the abbreviation (case insensitive) is defined within the GO database cross-references.
The GO IDs mentioned in the file are checked, using the current gene_ontology.obo file. Rows with obsolete GO IDs are removed, as well as any row containing an invalid GO ID.
All IEA annotations that are over one year old are removed. This filtering step is completed using the date of annotation stated in column 14. Obviously, the validity of the information in the date column is thus very important.
Taxon IDs
A major component to the filtering is the requirement that particular taxon IDs can only be included within the association files provided by specific projects. For example, the taxon ID for Mus musculus (taxon:10090) is limited to the file provided by the Mouse Genome Informatics project. Please see the list of species and relevant database groups for more details.
Script command line options
Usage help for the script is available with the -h option. The script is designed to be run from the go/gene-associations/submission directory within a GO CVS sandbox. By default the script needs the go/doc/GO.xrf_abbs and go/ontology/gene_ontology_edit.obo files. The input gene association file is read from STDIN by default, or from the specified file defined with the -i option.
Usage
A. check a file for any errors, obsolete GO IDs or old IEA annotations
filter-gene-association.pl -i gene_association.sgd.gz
B. filter any problems and output the validated lines, including headers
filter-gene-association.pl -i gene_association.fb.gz -w > filtered-output
C. check file without the taxid checking on, and write the bad lines to STDOUT
filter-gene-association.pl -i gene_association.fb.gz -p nocheck -e > bad-lines
System requirements
The script is written using basic Perl and should be portable to most systems. It has been tested on MacOSX with Perl 5.8.1 and Solaris with Perl 5.6.1 and greater.
Submitted by Mike Cherry, 2005-10-19; script updated to include new checks on 2011-01-01