Annotation Quality Control Checks

This Perl script is provided as a quality control check in an effort to validate the format and to partially check the data provided within the gene association files. This script is used on all gene association files before they are loaded into the GO database. The results of this filtering step are reported back to the submitting group.

This script is intended to be generic and to enforce the standards defined by the GO Consortium. Use this script to validate your gene association file before committing it to the archive. The checks provided define the minimum standard format for the repository. Suggestions are welcome for enhancements to this process. Download the script directly, via the GO web CVS interface, or from the directory go/software/utilities in the GO CVS repository.

Submitted gene association files are committed to the GO CVS repository into the gene association file submissions directory (go/gene-associations/submission/). The checking and filtering script is run nightly on any newly deposited files by the GO Database staff at Stanford. The output of the script is placed in the gene association file directory (go/gene-associations/) and subsequently used to load the GO database.

Errors Checked

The input file is checked for the following types of errors. If a row of the gene association file is found to contain an error it is removed from the final output file.

The script checks each line for the correct number of columns, the cardinality of the columns, looks for leading or trailing whitespace and does a number of specific checks for data in particular columns.

These specific checks include use of the defined terms for Qualifier, Evidence, Aspect, and DB Object type columns. The DB:Reference, Taxon and GO ID columns are checked for minimal form. The Date is also verified to match the YYYYMMDD format.

Column 1, and all database abbreviations used within the gene association file is checked to see that the abbreviation (case insensitive) is defined within the GO database cross-references.

The GO IDs mentioned in the file are checked, using the current gene_ontology.obo file. Rows with obsolete GO IDs are removed, as well as any row containing an invalid GO ID.

All IEA annotations that are over one year old are removed. This filtering step is completed using the date of annotation stated in column 14. Obviously, the validity of the information in the date column is thus very important.

Taxon IDs

A major component to the filtering is the requirement that particular taxon IDs can only be included within the association files provided by specific projects. For example, the taxon ID for Mus musculus (taxon:10090) is limited to the file provided by the Mouse Genome Informatics project. Please see the list of species and relevant database groups for more details.

Script command line options

Usage help for the script is available with the -h option. The script is designed to be run from the go/gene-associations/submission directory within a GO CVS sandbox. By default the script needs the go/doc/GO.xrf_abbs and go/ontology/gene_ontology_edit.obo files. The input gene association file is read from STDIN by default, or from the specified file defined with the -i option.

Usage

A. check a file for any errors, obsolete GO IDs or old IEA annotations

filter-gene-association.pl -i gene_association.sgd.gz

B. filter any problems and output the validated lines, including headers

filter-gene-association.pl -i gene_association.fb.gz -w > filtered-output

C. check file without the taxid checking on, and write the bad lines to STDOUT

filter-gene-association.pl -i gene_association.fb.gz -p nocheck -e > bad-lines

System requirements

The script is written using basic Perl and should be portable to most systems. It has been tested on MacOSX with Perl 5.8.1 and Solaris with Perl 5.6.1 and greater.

Submitted by Mike Cherry, 2005-10-19

Back to top

Additional checks instated in 2010

The SQL included on these pages can be used to directly query the GO database. Copy and paste the SQL into the text box located on the AmiGO Goose page: http://berkeleybop.org/goose
  1. No use of the 'NOT' qualifier with 'protein binding'; GO:0005515
  2. Justification: Even if an identifier is available in the 'with' column, a qualifier only informs on the GO term, it cannot instruct users to restrict the annotation to just the protein identified in the 'with', therefore an annotation applying GO:0005515 with the NOT qualifier implies that the annotated protein cannot bind anything. This is such a wide-reaching statement that few curators would want to make. This rule only applies to GO:0005515, children of this term can be qualified with NOT, as further information on the type of binding is then supplied in the GO Term e.g. NOT + 'GO:0051529 NFAT4 protein binding', would be fine, as the negative binding statement only applies to the NFAT4 protein.
  3. Annotations to 'protein binding'; GO:0005515, should be made with IPI and interactor should be in the 'with' field
  4. '''Justification''' Annotations that apply GO:0005515 'protein binding' with the TAS, NAS, IC, IMP, IGI and IDA evidence codes are not informative. None of these evidence codes allow protein accessions to be included in the 'with' field, and as most proteins need to interact with another protein for it to function, annotations that do not provide details of its interactor are not extremely informative! Of course, this is not such a problem with child terms of protein binding where the type of protein is identified in the GO term name.
  5. Reciprocal annotations for protein binding should be made
  6. '''Justification:''' When annotating to terms that are descendants of protein binding, and when the curator can supply the accession of the interacting protein accession, it is essential that reciprocal annotations are available - i.e. if you say protein A binds protein B, then you need to also have the second annotation that states that protein B binds protein A.
    This will be a soft QC- A script will make these inferences and it is up to each MOD to evaluate and include the inferences in their GAF/DB. '''Justification:''' If we take an example annotation: protein_A GO:0005515 (protein binding) IPI PMID:12345 with = protein_X - this annotation line can be interpreted as: protein_A was found to carry out the 'protein binding' activity in PMID:12345, and that this function was Inferred from the results of a Physicial Interaction (IPI) assay, which involved protein_X However if we would like to transfer this annotation to protein_A's ortholog 'protein_B', the ISS annotation that would be created would be: protein_B GO:0005515(protein binding) ISS GO_REF:curator_judgement with = protein_A - which is interpreted as: 'It is inferred that protein_B carries out protein binding activity due to it sequence similarity (which was curator determined), with protein_A (which was experimentally shown to carry out 'protein binding')'. Therefore the ISS annotation will not display the the interacting protein X accession. Such an annotation display can be confusing, as the value in the 'with' column just provides further information on why the ISS/IPI or IGI annotation was created. This means that an ISS projection from 'protein binding'is not particularly useful - as you are only really telling the user that you think an homologous protein binds a protein, based on overall sequence similarity. This rule only applies to GO:0005515, as descendant terms such as 'GO:0048273 mitogen-activated protein kinase p38 binding' used as ISS annotations are informative as the GO term name contains far more specific information as to the identity of the interactor.
  7. Only use the IEP evidence code with terms from the Biological Process Ontology
  8. '''Justification:''' The IEP evidence code is used where process involvement is inferred from the timing or location of expression of a gene, particularly when comparing a gene that is not yet characterized with the timing or location of expression of genes known to be involved in a particular process. This type of annotation is only suitable with terms from the Biological Process ontology
  9. Curators should not use the IPI evidence code along with catalytic activity molecular function terms
  10. '''Justification:''' The IPI (Inferred from Physical Interaction) evidence code, is used where an annotation can be supported from interaction evidence between the gene product of interest and another molecule (see evidence code documentation: http://www.geneontology.org/GO.evidence.shtml#ipi). While the IPI evidence code is frequently used to support annotations to terms that are children of GO:0005488; binding, it is thought unlikely by the Binding WG that enough information can be obtained from a binding interaction to support an annotation to a term that is a chid of GO:0003824; catalytic activity. Such IPI annotations to child terms of GO:0003824 may need to be revisited and corrected.
  11. Annotation to High Level Responseto terms should not be made using any evidence code
  12. The following high level terms have been deemed not usable.
      * GO:0050896 : response to stimulus
      * GO:0007610 : behavior
      * GO:0051716 : cellular response to stimulus
      * GO:0009628 : response to abiotic stimulus
      * GO:0009607 : response to biotic stimulus
      * GO:0042221 : response to chemical stimulus
      * GO:0009719 : response to endogenous stimulus
      * GO:0009605 : response to external stimulus
      * GO:0006950 : response to stress
      * GO:0048585 : negative regulation of response to stimulus
      * GO:0048584 : positive regulation of response to stimulus
      * GO:0048583 : regulation of response to stimulus