The following basic checks ensure that submitted gene association files conform to the GAF spec, and come from the original GAF check script.
SELECT gene_product.symbol, CONCAT(gpx.xref_dbname, ':', gpx.xref_key) AS gpxref, IF(association.is_not=1,"NOT","") AS 'not', term.acc, term.name, evidence.code, db.name AS assigned_by FROM association INNER JOIN gene_product ON association.gene_product_id = gene_product.id INNER JOIN dbxref AS gpx ON gene_product.dbxref_id = gpx.id INNER JOIN term ON association.term_id = term.id INNER JOIN evidence ON association.id = evidence.association_id INNER JOIN db ON association.source_db_id=db.id WHERE association.is_not='1' AND term.acc = 'GO:0005515'
/^(.*?\t){3}not\tGO:0005515\t/i
Even if an identifier is available in the 'with' column, a qualifier only informs on the GO term, it cannot instruct users to restrict the annotation to just the protein identified in the 'with', therefore an annotation applying protein binding ; GO:0005515 with the not qualifier implies that the annotated protein cannot bind anything.
This is such a wide-reaching statement that few curators would want to make.
This rule only applies to GO:0005515; children of this term can be qualified with not, as further information on the type of binding is then supplied in the GO term; e.g. not + NFAT4 protein binding ; GO:0051529 would be fine, as the negative binding statement only applies to the NFAT4 protein.
For more information, see the binding guidelines on the GO wiki.
SELECT gene_product.symbol, CONCAT(gpx.xref_dbname, ':', gpx.xref_key) AS gpxref, IF(association.is_not=1,"NOT","") AS 'not', term.acc, term.name, evidence.code, db.name AS assigned_by FROM association INNER JOIN gene_product ON association.gene_product_id = gene_product.id INNER JOIN dbxref AS gpx ON gene_product.dbxref_id = gpx.id INNER JOIN term ON association.term_id = term.id INNER JOIN evidence ON association.id = evidence.association_id INNER JOIN db ON association.source_db_id=db.id WHERE evidence.code IN ('NAS','TAS','IDA','IMP','IGC','IEP','ND','IC','RCA','EXP', 'IGI') AND (term.acc = 'GO:0005515' OR term.acc = 'GO:0005488')
Annotations to binding : GO:0005488 or protein binding ; GO:0005515 with the TAS, NAS, IC, IMP, IGI and IDA evidence codes are not informative as they do not allow the interacting partner to be specified. If the nature of the binding partner is known (protein or DNA for example), an appropriate child term of binding ; GO:0005488 should be chosen for the annotation. In the case of chemicals, ChEBI IDs can go in the 'with' column. Children of protein binding ; GO:0005515 where the type of protein is identified in the GO term name do not need further specification.
For more information, see the binding guidelines on the GO wiki.
SELECT gene_product.symbol, CONCAT(gpx.xref_dbname, ':', gpx.xref_key) AS gpxref, IF(association.is_not=1,"NOT","") AS 'not', term.acc, term.name, term.term_type, evidence.code, db.name AS assigned_by FROM association INNER JOIN gene_product ON association.gene_product_id = gene_product.id INNER JOIN dbxref AS gpx ON gene_product.dbxref_id = gpx.id INNER JOIN term ON association.term_id = term.id INNER JOIN evidence ON association.id = evidence.association_id INNER JOIN db ON association.source_db_id=db.id WHERE evidence.code = 'IEP' AND term.term_type != 'biological_process'
The IEP evidence code is used where process involvement is inferred from the timing or location of expression of a gene, particularly when comparing a gene that is not yet characterized with the timing or location of expression of genes known to be involved in a particular process. This type of annotation is only suitable with terms from the Biological Process ontology.
For more information, see the binding guidelines on the GO wiki.
The entire GAF is converted to OWL, combined with the main GO ontology and auxhiliary constraint ontologies. The resulting ontology is checked for consistency and unsatisfiable classes over using a complete DL reasoner such as HermiT.
When annotating to terms that are descendants of protein binding, and when the curator can supply the accession of the interacting protein accession, it is essential that reciprocal annotations are available - i.e. if you say protein A binds protein B, then you need to also have the second annotation that states that protein B binds protein A.
This will be a soft QC; a script will make these inferences and it is up to each MOD to evaluate and include the inferences in their GAF/DB.
For more information, see the binding guidelines on the GO wiki.
SELECT gene_product.symbol, CONCAT(gpx.xref_dbname, ':', gpx.xref_key) AS gpxref, IF(association.is_not=1,"NOT","") AS 'not', term.acc, term.name, evidence.code, db.name AS assigned_by FROM association INNER JOIN gene_product ON association.gene_product_id = gene_product.id INNER JOIN dbxref AS gpx ON gene_product.dbxref_id = gpx.id INNER JOIN term ON association.term_id = term.id INNER JOIN evidence ON association.id = evidence.association_id INNER JOIN db ON association.source_db_id=db.id WHERE evidence.code IN ('ISS','ISO','ISA','ISM') AND term.acc = 'GO:0005515'
If we take an example annotation:
gene product: protein A
GO term: protein binding ; GO:0005515
evidence: IPI
reference: PMID:123456
with/from: with protein A
this annotation line can be interpreted as: protein A was found to carry out the 'protein binding' activity in PMID:12345, and that this function was Inferred from the results of a Physicial Interaction (IPI) assay, which involved protein X
However if we would like to transfer this annotation to protein A's ortholog 'protein B', the ISS annotation that would be created would be:
gene product: protein B
GO term: protein binding ; GO:0005515
evidence: ISS
reference: GO_REF:curator_judgement
with/from: with protein A
This is interpreted as 'it is inferred that protein B carries out protein binding activity due to its sequence similarity (curator determined) with protein A, which was experimentally shown to carry out 'protein binding'.
Therefore the ISS annotation will not display the the interacting protein X accession. Such an annotation display can be confusing, as the value in the 'with' column just provides further information on why the ISS/IPI or IGI annotation was created. This means that an ISS projection from protein binding is not particularly useful as you are only really telling the user that you think an homologous protein binds a protein, based on overall sequence similarity.
This rule only applies to GO:0005515, as descendant terms such as mitogen-activated protein kinase p38 binding ; GO:0048273 used as ISS annotations are informative as the GO term name contains far more specific information as to the identity of the interactor.
For more information, see the binding guidelines on the GO wiki.
All IC annotations should include a GO ID in the "With/From" column; for more information, see the IC evidence code guidelines.
Use IDA only when no identifier can be placed in the "With/From" column. When there is an appropriate ID for the "With/From" column, use IPI.
All IPI annotations should include a nucleotide/protein/chemical identifier in the "With/From" column (column 8). From the description of IPI in the GO evidence code guide: "We strongly recommend making an entry in the with/from column when using this evidence code to include an identifier for the other protein or other macromolecule or other chemical involved in the interaction. When multiple entries are placed in the with/from field, they are separated by pipes. Consider using IDA when no identifier can be entered in the with/from column." All annotations made after January 1 2012 that break this rule will be removed.
Ontology operations such as term merges and obsoletions may be out of sync with annotation releases. Each GO entry T in the GAF is checked to see if it corresponds to a valid (non-obsolete) term in the ontology. If not, metadata for other terms is checked. If the term has been merged into a term S (i.e. S has alt_id of T) then T is replaced by S in the GAF line.
SELECT gene_product.symbol, CONCAT(gpx.xref_dbname, ':', gpx.xref_key) AS gpxref, IF(association.is_not=1,"NOT","") AS 'not', term.acc, term.name, evidence.code, db.name AS assigned_by FROM term INNER JOIN graph_path ON term.id = graph_path.term2_id INNER JOIN term AS term2 ON graph_path.term1_id = term2.id INNER JOIN association ON graph_path.term2_id = association.term_id INNER JOIN evidence ON association.id = evidence.association_id INNER JOIN gene_product ON association.gene_product_id = gene_product.id INNER JOIN dbxref AS gpx ON gene_product.dbxref_id = gpx.id INNER JOIN db ON association.source_db_id=db.id WHERE term2.acc = 'GO:0003824' AND evidence.code = 'IPI'
The IPI (Inferred from Physical Interaction) evidence code is used where an annotation can be supported from interaction evidence between the gene product of interest and another molecule (see the evidence code documentation). While the IPI evidence code is frequently used to support annotations to terms that are children of binding ; GO:0005488, it is thought unlikely by the Binding working group that enough information can be obtained from a binding interaction to support an annotation to a term that is a chid of catalytic activity ; GO:0003824. Such IPI annotations to child terms of catalytic activity ; GO:0003824 may need to be revisited and corrected.
For more information, see the catalytic activity annotation guide on the GO wiki.
SELECT gene_product.symbol, CONCAT(gpx.xref_dbname, ':', gpx.xref_key) AS gpxref, IF(association.is_not=1,"NOT","") AS 'not', term.acc, term.name, db.name AS assigned_by FROM association INNER JOIN gene_product ON association.gene_product_id = gene_product.id INNER JOIN term ON association.term_id = term.id INNER JOIN db ON association.source_db_id=db.id INNER JOIN dbxref AS gpx ON gene_product.dbxref_id = gpx.id WHERE term.acc IN ( 'GO:0050896', 'GO:0007610', 'GO:0051716', 'GO:0009628', 'GO:0009607', 'GO:0042221', 'GO:0009719', 'GO:0009605', 'GO:0006950', 'GO:0048585', 'GO:0048584', 'GO:0048583', 'GO:0001071', 'GO:0000988')
Some terms are too high-level to provide useful information when used for annotation, regardless of the evidence code used.
We provide and maintain the list of too high-level terms as two subsets in the ontology:
Both subsets denote high level terms, not to be used for any manual annotation.
For inferred electronic annotations (IEAs), we allow the use of terms from the gocheck_do_not_manually_annotate subset. These terms may still offer some general information, but a human curator should always be able to find a more specific annotation.
To be added
SELECT gene_product.symbol, CONCAT(gpx.xref_dbname, ':', gpx.xref_key) AS gpxref, IF(association.is_not=1,"NOT","") AS 'not', term.acc, term.name, evidence.code, CONCAT(dbxref.xref_dbname, ':', dbxref.xref_key) AS evxref, db.name AS assigned_by FROM association INNER JOIN evidence ON association.id = evidence.association_id INNER JOIN gene_product ON association.gene_product_id = gene_product.id INNER JOIN term ON association.term_id = term.id INNER JOIN dbxref ON evidence.dbxref_id = dbxref.id INNER JOIN dbxref AS gpx ON gene_product.dbxref_id = gpx.id INNER JOIN db ON association.source_db_id=db.id WHERE dbxref.xref_dbname = 'PMID' AND dbxref.xref_key REGEXP '^[^0-9]'
/^(.*?\t){5}([^\t]\|)*PMID:(?!\d+)/
References in the GAF (Column 6) should be of the format db_name:db_key|PMID:12345678, e.g. SGD_REF:S000047763|PMID:2676709. No other format is acceptable for PubMed references; the following examples are invalid:
This is proposed as a HARD QC check: incorrectly formatted references will be removed.
SELECT gene_product.symbol, CONCAT(gpx.xref_dbname, ':', gpx.xref_key), IF(association.is_not = 1, "NOT", "") AS 'not', term.acc, term.name, evidence.code, CONCAT(dbxref.xref_dbname, ':', dbxref.xref_key) AS evxref, db.name AS assigned_by FROM association INNER JOIN evidence ON association.id = evidence.association_id INNER JOIN gene_product ON association.gene_product_id = gene_product.id INNER JOIN term ON association.term_id = term.id INNER JOIN dbxref ON evidence.dbxref_id = dbxref.id INNER JOIN dbxref AS gpx ON gpx.id = gene_product.dbxref_id INNER JOIN db ON association.source_db_id = db.id WHERE ( evidence.code = 'ND' AND term.acc NOT IN ( 'GO:0005575', 'GO:0003674', 'GO:0008150' ) ) OR ( NOT(evidence.code = 'ND') AND term.acc IN ( 'GO:0005575', 'GO:0003674', 'GO:0008150' ) ) OR ( evidence.code = 'ND' AND ( CONCAT(dbxref.xref_dbname, ':', dbxref.xref_key) NOT IN ( 'GO_REF:0000015', 'FB:FBrf0159398', 'ZFIN:ZDB-PUB-031118-1', 'dictyBase_REF:9851', 'MGI:MGI:2156816', 'SGD_REF:S000069584', 'CGD_REF:CAL0125086', 'RGD:1598407', 'TAIR:Communication:1345790', 'AspGD_REF:ASPL0000111607' ) ) )
The No Data (ND) evidence code should be used for annotations to the root nodes only and should be accompanied with GO_REF:0000015 or an internal reference. PMIDs cannot be used for annotations made with ND.
The SQL code identifies all ND annotations that do not use GO_REF:0000015 or one of the alternative internal references listed for it in the GO references file.
GO taxon constraints ensure that annotations are not made to inappropriate species or sets of species. See http://www.biomedcentral.com/1471-2105/11/530 for more details.
This check ensures that the GO IDs used for annotations are valid IDs and are not obsolete.