This Perl script is provided as a quality control check in an effort to validate the format and to partially check the data provided within the gene association files. This script is used on all gene association files before they are loaded into the GO database. The results of this filtering step are reported back to the submitting group.
This script is intended to be generic and to enforce the standards defined by the GO Consortium. Use this script to validate your gene association file before committing it to the archive. The checks provided define the minimum standard format for the repository. Suggestions are welcome for enhancements to this process. Download the script directly, via the GO web CVS interface, or from the directory go/software/utilities in the GO CVS repository.
Submitted gene association files are committed to the GO CVS repository into the gene association file submissions directory (go/gene-associations/submission/). The checking and filtering script is run nightly on any newly deposited files by the GO Database staff at Stanford. The output of the script is placed in the gene association file directory (go/gene-associations/) and subsequently used to load the GO database.
The input file is checked for the following types of errors. If a row of the gene association file is found to contain an error it is removed from the final output file.
The script checks each line for the correct number of columns, the cardinality of the columns, looks for leading or trailing whitespace and does a number of specific checks for data in particular columns.
These specific checks include use of the defined terms for Qualifier, Evidence, Aspect, and DB Object type columns. The DB:Reference, Taxon and GO ID columns are checked for minimal form. The Date is also verified to match the YYYYMMDD format.
Column 1, and all database abbreviations used within the gene association file is checked to see that the abbreviation (case insensitive) is defined within the GO database cross-references.
The GO IDs mentioned in the file are checked, using the current gene_ontology.obo file. Rows with obsolete GO IDs are removed, as well as any row containing an invalid GO ID.
All IEA annotations that are over one year old are removed. This filtering step is completed using the date of annotation stated in column 14. Obviously, the validity of the information in the date column is thus very important.
A major component to the filtering is the requirement that particular taxon IDs can only be included within the association files provided by specific projects. For example, the taxon ID for Mus musculus (taxon:10090) is limited to the file provided by the Mouse Genome Informatics project. Please see the list of species and relevant database groups for more details.
Usage help for the script is available with the -h option. The script is designed to be run from the go/gene-associations/submission directory within a GO CVS sandbox. By default the script needs the go/doc/GO.xrf_abbs and go/ontology/gene_ontology_edit.obo files. The input gene association file is read from STDIN by default, or from the specified file defined with the -i option.
A. check a file for any errors, obsolete GO IDs or old IEA annotations
filter-gene-association.pl -i gene_association.sgd.gz
B. filter any problems and output the validated lines, including headers
filter-gene-association.pl -i gene_association.fb.gz -w > filtered-output
C. check file without the taxid checking on, and write the bad lines to STDOUT
filter-gene-association.pl -i gene_association.fb.gz -p nocheck -e > bad-lines
The script is written using basic Perl and should be portable to most systems. It has been tested on MacOSX with Perl 5.8.1 and Solaris with Perl 5.6.1 and greater.
Submitted by Mike Cherry, 2005-10-19
* GO:0050896 : response to stimulus * GO:0007610 : behavior * GO:0051716 : cellular response to stimulus * GO:0009628 : response to abiotic stimulus * GO:0009607 : response to biotic stimulus * GO:0042221 : response to chemical stimulus * GO:0009719 : response to endogenous stimulus * GO:0009605 : response to external stimulus * GO:0006950 : response to stress * GO:0048585 : negative regulation of response to stimulus * GO:0048584 : positive regulation of response to stimulus * GO:0048583 : regulation of response to stimulus