Guide to GO Evidence Codes
This document is a guide to the standard usage of the GO evidence codes.
Annotators may find the evidence code decision tree a useful aid in selecting the correct evidence code for an annotation.
Introduction
A GO annotation consists of a GO term associated with a specific reference that describes the work or analysis upon which the association between a specific GO term and gene product is based. Each annotation must also include an evidence code to indicate how the annotation to a particular term is supported. Although evidence codes do reflect the type of work or analysis described in the cited reference which supports the GO term to gene product association, they are not necessarily a classification of types of experiments/analyses. Note that these evidence codes are intended for use in conjunction with GO terms, and should not be considered in isolation from the terms. If a reference describes multiple methods that each provide evidence to make a GO annotation to a particular term, then multiple annotations with identical GO identifiers and reference identifiers but different evidence codes may be made.
Out of all the evidence codes available, only Inferred from Electronic Annotation (IEA) is not assigned by a curator. Manually-assigned evidence codes fall into four general categories: experimental, computational analysis, author statements, and curatorial statements.
Use of an experimental evidence code in a GO annotation indicates that the cited paper displayed results from a physical characterization of a gene or gene product that has supported the association of a GO term. The experimental evidence codes are:
- Inferred from Experiment (EXP)
- Inferred from Direct Assay (IDA)
- Inferred from Physical Interaction (IPI)
- Inferred from Mutant Phenotype (IMP)
- Inferred from Genetic Interaction (IGI)
- Inferred from Expression Pattern (IEP)
Use of the computational analysis evidence codes indicates that the annotation is based on an in silico analysis of the gene sequence and/or other data as described in the cited reference. The evidence codes in this category also indicate a varying degree of curatorial input. The computational analysis evidence codes are:
- Inferred from Sequence or structural Similarity (ISS)
- Inferred from Sequence Orthology (ISO)
- Inferred from Sequence (ISA)
- Inferred from Sequence Model (ISM)
- Inferred from Genomic Context (IGC)
- Inferred from Biological aspect of Ancestor (IBA)
- Inferred from Biological aspect of Descendant (IBD)
- Inferred from Key Residues (IKR)
- Inferred from Rapid Divergence(IRD)
- inferred from Reviewed Computational Analysis (RCA)
Author statement codes indicate that the annotation was made on the basis of a statement made by the author(s) in the reference cited. The author statement evidence codes used by GO are:
Use of the curatorial statement evidence codes indicates an annotation made on the basis of a curatorial judgement that does not fit into one of the other evidence code classifications. The curatorial statement codes are:
- Inferred by Curator (IC)
- No biological Data available (ND) evidence code
All of the above evidence codes are assigned by curators. However, GO also used one evidence code that is assigned by automated methods, without curatorial judgement. The automatically-assigned evidence code is:
Evidence codes are not statements of the quality of the annotation. Within each evidence code classification, some methods produce annotations of higher confidence or greater specificity than other methods, in addition the way in which a technique has been applied or interpreted in a paper will also affect the quality of the resulting annotation. Thus evidence codes cannot be used as a measure of the quality of the annotation.
Experimental Evidence Codes
EXP: Inferred from Experiment
This code is used in an annotation to indicate that an experimental assay has been located in the cited reference, whose results indicate a gene product's function, process involvement, or subcellular location (indicated by the GO term). The EXP code is the parent code for the IDA, IPI, IMP, IGI and IEP experimental codes.
The EXP evidence code can be used where any of the assays described for the IDA, IPI, IMP, IGI, or IEP evidence codes is reported. However it is highly encouraged that groups should annotate to one of the more specific experimental codes (IDA, IPI, IMP, IGI, or IEP) instead of EXP, and all curators directly involved in the GO Reference Genome annotation effort are obliged to use these and not EXP.
The EXP code exists for groups who would like to contribute high-quality GO annotations that are produced from directly associating GO terms to gene products by citing experimental published results, but where the group is unable to fit the appropriate specific experimental GO evidence code to each annotation.
A published reference should always be cited in the reference column, and no value should be entered into the with/from column of EXP annotations.
IDA: Inferred from Direct Assay
Updated November 9, 2007
- Enzyme assays
- In vitro reconstitution (e.g. transcription)
- Immunofluorescence (for cellular component)
- Cell fractionation (for cellular component)
- Physical interaction/binding assay (sometimes appropriate for cellular component or molecular function)
The IDA evidence code is used to indicate a direct assay was carried out to determine the function, process, or component indicated by the GO term. Curators therefore need to be careful, because an experiment considered as a direct assay for a term from one ontology may be a different kind of evidence for a term from another of the ontologies. In particular, there are more kinds of direct assays for cellular component than for function or process. For example, a fractionation experiment might provide "direct assay" evidence that a gene product is in the nucleus, but "protein interaction" (IPI) evidence for its function or process.
For transfection experiments or other experiments where a gene from one organism or tissue is put into a system that is not its normal environment, the annotator should use the author's intent and interpretation of the experiment as a guide as to whether IMP or IDA is appropriate. When the author is comparing differences between alleles, regardless of the simplicity or complexity of the assay, IMP is appropriate. When the author is using an expression system as a way to investigate the normal function of a gene product, IDA is appropriate.
Examples where the IDA evidence code should be used:
- Binding assays can provide direct assay evidence for annotating to the xxx binding molecular function terms. (Use IDA only when no identifier can be placed in the with/from column; when there is an appropriate ID for the with/from column, use IPI).
- Assays describing the isolation of a complex by immunoprecipitation of a tagged subunit should use IDA, not IPI. Thus this type of assay can provide IDA for annotation to a component term for the specific complex because it is a direct assay for a complex.
- Transfections into a cell line, overexpression, or ectopic expression of a gene when the expression system used is considered to be an assay system to address basic, normal functions of gene product even if it would not normally be expressed in that cell type or location. If the experiments were conducted to assess the normal function of the gene and the assay system is believed to reproduce this function, i.e., the authors would consider their experiment to be a direct assay, and not a comparison between various alleles of a gene, then the IDA code should be used. This is in contrast with a situation where overexpression affects the function or expression of the gene and that difference from normal is used to make an inference about the normal function; in this case use the IMP evidence code.
Examples where the IDA evidence code should not be used:
- Binding assays where it is possible to put an ID corresponding to the specific binding partner that was shown to interact directly the gene product being annotated should be annotated with the IPI code, not with IDA.
- Transfection into a cell line, overexpression, or ectopic expression of a gene where the effects of various alleles of a gene are compared to each other or to wild-type. For this type of experiment, annotate using IMP.
IPI: Inferred from Physical Interaction
Updated November 9, 2007
- 2-hybrid interactions
- Co-purification
- Co-immunoprecipitation
- Ion/protein binding experiments
Covers physical interactions between the gene product of interest and another molecule (such as a protein, ion or complex). IPI can be thought of as a type of IDA, where the actual binding partner or target can be specified, using "with" in the with/from field.
Examples where the IPI evidence code should be used:
- Binding assays where it is possible to put an ID corresponding to the specific binding partner that was shown to interact directly the gene product being annotated should be annotated with the IPI code, not with IDA.
Examples where the IPI evidence code should not be used:
- Assays describing the isolation of a complex by immunoprecipitation of a tagged subunit should use IDA, not IPI with the ID corresponding to the tagged subunit in the with/from column because not all subunits of the complex interact directly with the tagged subunit. Thus this type of assay can provide IDA for annotation to a component term for the specific complex because it is a direct assay for a complex, but it would not necessarily be true to say that all members of the complex interact directly with the tagged subunit.
- Annotations to protein binding ; GO:0005515 should not be used to describe an antibody binding to another protein. However, an effect of an antibody on an activity or process can support a function or process annotation, using the IMP code.
Usage of the With/From Column for IPI
We strongly recommend making an entry in the with/from column when using this evidence code to include an identifier for the other protein or other macromolecule or other chemical involved in the interaction. When multiple entries are placed in the with/from field, they are separated by pipes. Consider using IDA when no identifier can be entered in the with/from column.
Two examples of how the with/from column is used with IPI are shown in the table below. Abcd3, a mouse gene, is annotated to protein binding ; GO:0005515, based on , 1999 (PMID:10551832). The with/from field has the UniProt protein ID of the protein Abcd3 binds to. Alb, a rat gene is annotated to drug binding based on , 2002 (PMID:12458670). In this case the CHEBI ID (chemical ID) of the drug that Alb binds to is provided in the with/from column.
... |
2.
DB Object ID |
3.
DB Object Symbol |
4.
Qualifier |
5.
GO ID |
6.
DB:Reference |
7.
Evidence Code |
8.
With/From |
... |
---|---|---|---|---|---|---|---|---|
... | MGI:1349216 | Abcd3 | GO:0005515 | PMID:10551832 | IPI | UniProt:P33897|UniProt:Q61285 | ... | |
... | RGD:RGD2085 | Alb | GO:0008144 | PMID:12458670 | IPI | CHEBI:28939 | ... |
Note: For an interacting protein, a protein ID is recommended in the with/from column for a IPI annotation, but a gene ID may be used if the database does not have identifiers for individual gene products. A gene ID may also be used if the cited reference provides enough information to determine which gene ID should be used, but not enough to establish which protein ID is correct, for example, in cases where there is a one-to-many relationship between a gene and its protein products.
Note that there has been some discrepancy between groups as to the use of the with/from column; please see the note on usage of the with/from column for more details.
IMP: Inferred from Mutant Phenotype
Updated November 9, 2007
- mutations, natural or introduced, that result in partial or complete impairment or alteration of the function of that gene
- polymorphism or allelic variation (including where no allele is designated wild-type or mutant)
- any procedure that disturbs the expression or function of the gene, including RNAi, anti-sense RNAs, antibody depletion, or the use of any molecule or experimental condition that may disturb or affect the normal functioning of the gene, including: inhibitors, blockers, modifiers, any type of antagonists, temperature jumps, changes in pH or ionic strength.
- overexpression or ectopic expression of wild-type or mutant gene that results in aberrant behavior of the system or aberrant expression where the resulting mutant phenotype is used to make a judgment about the normal activity of that gene product.
The IMP evidence code covers those cases when the function, process or cellular localization of a gene product is inferred based on differences in the function, process, or cellular localization between two different alleles of the corresponding gene. The IMP code is used for cases where one allele may be designated 'wild-type' and another as 'mutant'. It is also used in cases where allelic variation occurs naturally and no specific allele is designated as wild-type or mutant. Caution should be used when making annotations from gain-of-function mutations as it may be difficult to infer a gene's normal function from a gain of function mutation, although it is sometimes possible.
For transfection experiments or other experiments where a gene from one organism or tissue is put into a system that is not its normal environment, the annotator should use the author's intent and interpretation of the experiment as a guide as to whether IMP or IDA is appropriate. When the author is comparing differences between alleles, regardless of the simplicity or complexity of the assay, IMP is appropriate. When the author is using an expression system as a way to investigate the normal function of a gene product, IDA is appropriate.
Examples where the IMP code should be used
- use of an inhibitor of a gene product's activity in order to see the effect of absence, or significant depletion, of that gene product. For example, an experiment using baicalein to inhibit the activity of 12-LOX in a murine bladder cancer cell line inhibits cell proliferation in a concentration dependent manner (see PMID:15161019) results in an annotation to the GO term cell proliferation using the IMP evidence code for the 12-LOX gene.
- transfection into a cell line, overexpression, or extopic expression of a gene where the effects of various alleles of a gene are compared to each other or to wild-type. For this type of experiment, annotate using IMP.
- In situations where a mutation in gene A provides information about the function, process, or component of gene B do not use IGI. Use IMP evidence code and use column-16 or the Annotation Extension column to provide additional data. For example, if a mutation in gene A causes a mislocalization of gene B, gene A is annotated to protein localization using IMP and the gene B identifier is added to the Annotation Extension column with the appropriate relationship.
Examples where the IMP code should not be used
- mutation in gene B provides information about gene A being annotated. For this type of experiment, use the IGI code.
- complementation of a mutation in one organism by a gene from a different organism.
- Transfections into a cell line, overexpression, or ectopic expression of a gene when the expression system used is considered to be an assay system to address basic, normal functions of gene product even if it would not normally be expressed in that cell type or location. If the experiments were conducted to assess the normal function of the gene and the assay system is believed to reproduce this function, i.e., the authors would consider their experiment to be a direct assay, and not a comparison between various alleles of a gene, then the IDA code should be used. This is in contrast with a situation where overexpression affects the function or expression of the gene and that difference from normal is used to make an inference about the normal function; in this case use the IMP evidence code.
Usage of the With column for IMP
We recommend making a "with" entry in the with/from column when using this evidence code to indicate the identifier for the allele in which the phenotype was observed. When multiple entries are placed in the with/from field, they are separated by pipes.
Example for how the with/from column should be filled in
- The mouse gene product Actc1 (actin, alpha, cardiac ; MGI:87905), has a GO annotation to muscle thin filament assembly ; GO:0030240, inferred from mutant phenotype, IMP of MGI:2180072 (symbol: Actc1tm1Jll; name: targeted mutation 1, James Lessard), from PMID:9114002. MGI:2180072 is entered in the with/from column for this annotation.
IGI: Inferred from Genetic Interaction
Updated May 1, 2012
- "Traditional" genetic interactions such as suppressors, synthetic lethals, etc.
- Functional complementation
- Rescue experiments
- Inference about one gene drawn from the phenotype of a mutation in a different gene
Includes any combination of alterations in the sequence (mutation) or expression of more than one gene/gene product. This code can therefore cover any of the IMP experiments that are done in a non-wild-type background; the key is what the comparison is made against. If there is a single mutation or difference between the two strains compared, use IMP. If there are multiple mutations or differences between the two strains compared, use IGI. When redundant copies of a gene must all be mutated to see an informative phenotype, use IGI. Caution should be used when making annotations from genetic combinations that include gain-of-function mutations as it may be difficult to infer a gene's normal function from a gain of function mutation, although it is sometimes possible. Note that some organisms, such as mouse, will have far, far more IGI than IMP annotations. Use IMP for "phenotypic similarity," as described above.
"Functional complementation" above refers to experiments in which a gene from one organism complements a deletion or other mutation in another species. For these annotations, the with/from column should list the identifiers for the gene that complements or is complemented by the gene of interest. In annotations from cross-species functional complementation experiments, the gene referred to in the with/from column will be from a different species than the gene being annotated.
Examples where the IGI evidence code should be used:
- mutation in gene B provides information about gene A being annotated. For this type of experiment, use the IGI code.
- complementation of a mutation in one organism by a gene from a different organism. For this type of experiment, use the IGI code.
Examples where the IGI evidence code should not be used:
In situations where a mutation in gene A provides information about the function, process, or component of gene B do not use IGI. Use IMP evidence code and use column-16 or the Annotation Extension column to provide additional data. For example, if a mutation in gene A causes a mislocalization of gene B, gene A is annotated to protein localization using IMP and the gene B identifier is added to the Annotation Extension column with the appropriate relationship.
Usage of the With/From Column for IGI
We recommend making an entry in the with/from column when using this evidence code to include an identifier for the other other gene(s) involved in the genetic interaction. When multiple entries are placed in the with/from field, they are separated by pipes.
Note that there has been some discrepancy between groups as to the use of the with/from column; please see the Note on Usage of the With/from Column for more details.
For example, if the annotation is based on a double mutation, one of which is the gene being annotated, the identifier for the gene being annotated would be placed in the DB object ID field (column 2) while the identifier for the second mutated gene would be placed in the with/from field (column 8). If the annotation is based on a triple mutation, one of which is the gene being annotated, the identifiers for both of the two other mutated genes would be placed in the with/from field, separated by a pipe.
... |
2.
DB Object ID |
3.
DB Object Symbol |
4.
Qualifier |
5.
GO ID |
6.
DB:Reference |
7.
Evidence Code |
8.
With/From |
... |
---|---|---|---|---|---|---|---|---|
... | FB:gene_A_ID | gene A | GO:0006796 | PMID:11979277 | IGI | FB:gene_B_ID | ... | |
... | FB:gene_A_ID | gene A | GO:0006796 | PMID:11979277 | IGI | FB:gene_B_ID|FB:gene_C_ID | ... | |
... | SGD:gene_A_ID | gene A | GO:0006796 | PMID:11979277 | IGI | PomBase:gene_B_ID | ... |
IEP: Inferred from Expression Pattern
Updated November 9, 2007
- Transcript levels or timing (e.g. Northerns, microarray data)
- Protein levels (e.g. Western blots)
The IEP evidence code covers cases where the annotation is inferred from the timing or location of expression of a gene, particularly when comparing a gene that is not yet characterized with the timing or location of expression of genes known to be involved in a particular process. Use this code with caution! It may be difficult to determine whether the expression pattern really indicates that a gene plays a role in a given process, so the IEP evidence code is usually used in conjunction with high level GO terms in the biological process ontology.
Note that we have not yet encountered any examples where we feel it is valid to make annotations to terms from the cellular component or molecular function ontologies on the basis of expression pattern data. Thus we currently recommend that this code be restricted to annotations to terms from the biological process ontology. Also, different annotating groups use different identifiers (gene or protein or gene_product) and no inference should be made as to whether an annotation made using IEP concerns a gene, RNA or protein.
Examples where the IEP evidence code should be used:
- genes upregulated during a stress condition may be annotated to the process of stress response (for example, heat shock proteins)
- genes selectively expressed at specific developmental stages in specific organs may be annotated to xxx development
Examples where the IEP evidence code should not be used:
- Function and component annotations should not be made with IEP.
- Exogenous expression or overexpression of a gene should be not annotated using IEP; only the normal expression pattern should lead to an IEP annotation.
- Overexpression of a gene causing increased activity of an enzyme should be annotated to IDA or IMP (see IDA documentation)
- Overexpression (wild type or mutated) of a gene causing an abnormal phenotype should be annotated to IMP
- Exogenous expression of a gene and assaying of its function should be annotated to IDA (like a transcription factor)
- Binding assays with overexpressed proteins or exogenously expressed proteins should be annotated to IPI for protein binding or IDA for binding to other molecules.
- Observation of protein localization for a component annotation should be made using the IDA evidence code.
- Annotation to the molecular function term transcription factor activity where the experimental evidence is that introduction of the gene to be tested into an in vitro assay system leads to expression of the appropriate reporter gene. Annotate using the IDA evidence code.
- Annotation to a binding molecular function term, e.g. calmodulin binding, where the experiment was to screen an expression library (a library expressing various proteins) to identify which of the library proteins interact with a particular protein of interest. Annotate using the IPI evidence code with the accession number of to the interacting protein (or its corresponding gene) in the with/from field.
- Annotating an enzymatic function to a Molecular Function Term based on an overexpression experiment. Since this is not the normal expression pattern, the IEP code does not apply. IDA would be the appropriate evidence code for this annotation. Annotating guanylate cyclase 2f from rat (GC-F), to the Molecular Function term guanylate cyclase activity, based on the experimental result that over-production of GC-E and GC-F in COS cells resulted in production of or increase in of guanylyl cyclase activity (PMID:7831337). IDA would be the appropriate evidence code for this annotation.
Computational Analysis Evidence Codes
ISS: Inferred from Sequence or Structural Similarity
Updated April 1, 2008
The ISS evidence code or one of its sub-categories should be used whenever a sequence-based analysis forms the basis for an annotation and review of the evidence and annotation has been done manually. If the annotation has not been reviewed manually, the correct evidence code is IEA, even if the evidence supporting the annotation is all sequence based. ISS should be used if a combination of sequence-based tools or methods are used. If only one particular type of sequence-based evidence is used then one of the more specific sub-categories of ISS may be more appropriate for the annotation. There are three specific sub-categories of ISS, mentioned briefly here and described in more detail below:
- ISA: If the primary piece of evidence is a pairwise or multiple alignment then ISA (Inferred from Sequence Alignment) would likely be the appropriate evidence code to use.
- ISO: If the primary piece of evidence is the assertion of orthology between the gene product and a gene product in another organism, ISO (Inferred from Sequence Orthology) would likely be the appropriate evidence code to use.
- ISM: If any kind of sequence modeling method (e.g. Hidden Markov Models) is the primary piece of evidence then the ISM (Inferred from Sequence Model) code is the most appropriate.
ISS can also be used for structural similarity with experimentally characterized gene products, as determined by crystallography, nuclear magnetic resonance, or computational prediction. In practice, ISS annotations are rarely, if ever, made purely from structural information. When included, structural information is generally at the level of secondary structure modeling or prediction derived from sequence information. Secondary structure information is particularly useful as one component of RNA gene predictions and in some domain models.
Population of the with field is important when using the ISS code or one of its sub-categories. The entry in with is the accession of the object or model to which your query has similarity. It is mandatory for annotators to make an entry in the with field when using the ISS code or one of its sub-categories if the annotation is based on an alignment with other proteins (e.g UniProt) or a sequence model contained in a database (e.g. Pfam, InterPro). If the annotation is based on a method such as tRNASCAN which cannot be referred to with an accession number, the with field may be left empty. Entries in the with field should be in the format database:accession, where database is one of the abbreviations listed in the GO database abbreviations collection and accession is the accession number of the object the sequence similarity is with. Multiple entries in the with field should be separated by pipes.
If the searches and evaluation of the sequence-based data are described in a published paper, the ID (either one assigned by PubMed or one assigned by another database such as a Model Organism Database) of the paper should be placed in the reference column. However, if the group that is doing the GO annotation performed the searches and evaluation of the sequence-based data, and there is no published reference, a reference can be used from the GO Consortium's collection of GO references; if there is nothing appropriate in this set, the annotating group submit a description of the methods of data collection and evaluation used, and submit it to the GO Consortium. This will be added to the reference collection and will receive a GO_REF accession number for use in annotations. In all cases, the ID of the reference describing the methodology of the sequence analysis should be placed in the reference column.
Examples of when to use ISS:
- An ISS annotation is often based on more than just one type of sequence-based evidence. Often, a host of searches are performed for any given query protein. These searches might include BLAST, profile HMMs, TMHMM, SignalP, PROSITE, InterPro, etc. Evaluation of output from these search tools (bear in mind that every search may not yield results for every protein) leads an annotator to a particular ISS annotation for a particular protein. For example, a BLAST search might reveal that a query protein matches an experimentally characterized protein from another species at 50% identity over the full lengths of both proteins. After reading literature about the match protein, the curator sees that the match protein is known to contain a domain located in the plasma membrane and another domain that extends into the cytoplasm. It is also known from the literature that the experimentally characterized match protein requires the binding of ATP to function. TMHMM analysis of the query protein predicts several membrane spanning regions in one half of the protein (consistent with location in a membrane). In addition there are PROSITE and Pfam results which reveal the presence of an ATP-binding domain in the other half of the protein which TMHMM predicts to be cytoplasmic. These four search results taken together point to a probable identification of the query protein as having the function of the match protein.
- PMID:8674114 describes comparative analysis of several newly identified and previously characterized snoRNAs. They list a number of sequence features, both conserved sequence elements and a region of complementarity to rRNA, and spacings that are characteristic of box C/D snoRNAs. As the authors don't develop a predictive method, the analysis they describe isn't considered to be a model, so ISM is not appropriate. As being a member of the box C/D snoRNA family is predictive for being a methylation guide, one could make annotations for a number of snoRNAs based on this paper. Note that the yeast U24 gene (snR24) is also experimentally characterized in this paper. Thus, for snR24 from S. cerevisiae, it is possible to make annotations using both the ISS and the IMP evidence codes, or one might choose not to make the ISS-based annotation for snR24 since experimental evidence is available.
ISA: Inferred from Sequence Alignment
- Sequence similarity with experimentally characterized gene products, as determined by alignments, either pairwise or multiple (tools such as BLAST, ClustalW, MUSCLE)
- An entry in the with field is mandatory.
The ISA code is a sub-category of the ISS code. It should be used whenever a sequence alignment is the basis for making an annotation, but only when a curator has manually reviewed the alignment and choice of GO term or if the information is in a published paper, the authors have manually reviewed the evidence. Such alignments may be pairwise alignments (the alignment of two sequences to one another) or multiple alignments (the alignment of 3 or more sequences to one another). BLAST produces pairwise alignments and any annotations based solely on the evaluation of BLAST results should use this code. GO policy states that in order to assert that a query protein has the same function as a match protein, the match protein MUST be experimentally characterized. This prevents transitive annotation errors. A transitive annotation error occurs when a protein gets its annotation by virtue of a match to an uncharacterized protein that may itself have gotten its annotation from yet another uncharacterized protein, and so on. With the high number of genome sequences currently in the public databases, the risk of transitive annotation errors is high. However, by requiring that every alignment used for a GO annotation contain an experimentally characterized protein, transitive annotation errors can be significantly reduced.
The process of evaluating a sequence alignment involves checking that the length of the matching region and the percent identity with the matching sequence are sufficient to infer shared function. Residues or secondary structures that are important for function should be conserved. The guiding principle in making sequence similarity based annotations should be that there is a good reason to believe that the comparison is relevant. This evaluation may be carried out by the curator, when sequence analysis is performed by the curators, or by authors of a published paper, when the curator is making annotations based on literature. In literature-based annotation it is incumbent upon the curator to identify which of the proteins in the sequence analysis are experimentally characterized so as to populate the with field.
A note about when to use ISO (inferred from sequence orthology) instead of ISA: If it is known that the experimentally characterized match protein in question is the functional ortholog of the query protein, then the code ISO (Inferred from Sequence Orthology) may be used (see the ISO section below). Orthologs are generally determined from phylogenetic analysis using algorithms such as maximum likelihood or nearest neighbor joining. The presumption is that orthologs often have the same/similar biological function and/or engage in the same or similar biological processes. It can sometimes be difficult to determine when proteins are orthologs of each other, but if one is confident of orthology the orthology specific code should be used.
Note that we have not set definitive numerical cutoffs for the extent or percentage identity of sequence similarity comparisons because groups annotating very different organisms from the current MODs / reference genomes may find that a given arbitrarily selected numerical cutoff does not work when applied to a new organism. It is up to each annotating group to use judgment as to what sequence similarity comparisons are relevant for the purpose of making GO annotations.
It is mandatory to make an entry in the with column when using ISA. The entry in with is the accession number of the experimentally characterized sequences(s) that match the query sequence. Multiple entries in the with field should be separated by pipes. Annotations made with ISA without an entry in the with field will be filtered out by the Annotation File Format Quality Control script which is run monthly.
If the generation and evaluation of the alignment was described in a published paper and then curated by a GO annotator, a reference to the paper should be placed in the reference column. However, if the same group that is doing the GO annotation performed the generation and evaluation of the alignment, then a reference should be placed in the reference column that describes the methodology used. If there is no publication for this methodology, a reference can be used from the GO Consortium's collection of GO references; if there is nothing appropriate in this set, the annotating group submit a description of the methods of data collection and evaluation used, and submit it to the GO Consortium. This will be added to the reference collection and will receive a GO_REF accession number for use in annotations.
Examples of when to use ISA:
- A curator generates a pairwise alignment between a query Haemophilus influenzae protein that he/she is trying to annotate and a Vibrio marinus protein. The curator sees that the Vibrio protein is experimentally characterized. The curator evaluates the alignment and sees that the two proteins match over nearly their entire lengths at 68% identity. Furthermore, after reading information on the characterized Vibrio protein the curator looks for the important residues needed for catalysis and binding in the Vibrio protein and finds that they are conserved in the Haemophilus protein. The curator reads the available literature on the Vibrio protein to determine what is known about that protein. The curator can then assign GO terms to the Haemophilus protein based on what has been experimentally determined in the Vibrio protein. The code for this annotation is ISA, the accession number of the Vibrio protein should be placed in the with field. If the process used by the curator for evaluation of the sequence alignments is not in a published paper they should refer to a GO standard reference, for example GO_REF:0000012.
- A curator performs sequence similarity analysis on a group of genes, (e.g. sequence similarity alignments of the human NDUFS8 gene (UniProtKB accession: O00217) with several other genes) and identifies several genes with very high sequence identity to the experimentally characterized human HDUFS8 gene: orangutan and chimpanzee (both 100% sequence identity), crab-eating macaque (95% identity), and gorilla (92% identity). The curator judged that these high sequence matches to the human sequence meant that all proteins possessed a similar function, therefore, annotations were made for the related genes in orangutan (UniProt:Q5RC7), macaque (UniProt:Q60HE3), chimpanzee (UniProt:Q0MQI3), and gorilla (UniProt:Q0MQI2) by ISS with the experimentally characterized human NDUFS8 protein, and the accession number of the human NDUFS8 gene was included in the with column for each of these annotations. As there is no published paper describing this sequence analysis, the id of the GO_REF (e.g. GO_REF:0000024) that describes the process the curator carried out to make this judgment is placed in the REF_DB_ID field.
- PMID:2165073 identifies a new gene, AAC3, that is similar to two known genes of the same species (S. cerevisiae) based on Southern hybridization. Cloning and sequencing of the new AAC3 gene indicates that it is similar to the previously characterized ADP/ATP translocators AAC1 and PET9. For the AAC3 gene, an annotation may be made to the function term ATP:ADP antiporter activity using the evidence code ISA; the reference is the paper which performed the analysis and the accession numbers of the experimentally genes with which AAC3 was aligned (AAC1 and PET9) should be placed in the with field.
- PMID:12507466 describes a set of proteins containing both experimentally confirmed and predicted N-terminal acetyltransferases (NATs) that were collected and assigned to orthologous groups based on phylogenetic analysis. Three of the groups, Ard1, Mak3, and Nat3, were named based on the well characterized gene by that name from S. cerevisiae that is a member of the group. In addition, a previously unknown group with unknown substrate specificity was identified, called Nat5 based on the name of the S. cerevisiae member of the group. About the Nat5 family, the authors make this statement Nat5p represents a family of the putative NATs with orthlogous proteins identified in yeast, S. pombe, C. elegans, D. melanogaster, A. thaliana and H. sapiens. The finding of this new family is only based on sequence similarity of Nat5p (YOR253Wp) to other NATs. Our attempts to detect any Nat5p substrates in yeast by 2D-gel electrophoresis has been so far unsuccessful, but this may reflect the rarity of the substrates in vivo or that Nat5p is acting on the smaller polypetides with mobility parameters undetectable by our regular 2D-gel procedure. As a protein with sequence similarity to other NATs, the annotation that may be made for NAT5 is to the function term peptide alpha-N-acetyltransferase activity. Although this paper clearly discussed orthology relationships, the evidence code for this annotation for NAT5 is ISA because it is not based on the orthology relationship, but merely on similarity with the other experimentally characterized NATs in yeast, MAK3, ARD1, and NAT3, and the accession numbers of these three genes should be placed in the with field. The reference is the paper which performed the analysis, Note that this paper may also be used for annotations using the ISO code when the annotation is based on the orthology relationships described in the paper.
ISO: Inferred from Sequence Orthology
- Pairwise or multiple alignments between a query protein and experimentally characterized match proteins when the proteins are established to be orthologs of each other
- Phylogenetic analysis of a set of proteins to define orthologous groups.
- An entry in the with field is mandatory.
The ISO code is a sub-category of the ISS code. Orthology is a relationship between genes in different species indicating that the genes derive from a common ancestor. Orthology is established by multiple criteria generally including amino acid and/or nucleotide sequence comparisons and one or more of the following:
- phylogenetic analysis
- coincident expression
- conserved map location
- functional complementation
- immunological cross-reaction
- similarity in subcellular localization
- subunit structure
- substrate specificity
- response to specific inhibitors
It should be noted that there are known cases where a gene in one organism is significantly different in size from its ortholog(s) in other species. For example, the U2 snRNA in S. cerevisiae is much larger than vertebrate U2 snRNAs due to several additional domains. However it has been shown that both S. cerevisiae and vertebrate U2 snRNAs have the same conserved core and perform the same basic role in the spliceosome, even though a simplistic sequence comparison might miss this due to the large size difference between U2 in S. cerevisiae and U2 in mammalian species.
When making an annotation using the ISO evidence code, an entry in the with field is mandatory. This entry will be the accession number of an experimentally characterized orthologous gene product. The matching orthologous gene product must have substantiating experimental evidence to support the annotation. In addition, there will be cases where a gene product in one species is the ortholog of several closely related paralogous genes in another species. In these cases, the ID for all of these paralogs should be included in the with field. Annotations made with ISO without an entry in the with field will be filtered out by the Annotation File Format Quality Control script.
If the paper being used to make the annotation demonstrates the orthology, then that paper is used as the reference for that annotation. However, if the group doing the annotation is establishing orthology and there is no published reference, a reference can be used from the GO Consortium's collection of GO references; if there is nothing appropriate in this set, the annotating group submit a description of the methods of data collection and evaluation used, and submit it to the GO Consortium. This will be added to the reference collection and will receive a GO_REF accession number for use in annotations.
It is important to note that if revised predictions on orthologous protein sets are produced at a later time than the original annotation, annotations should be updated accordingly.
Example of when to use ISO:
- PMID:12507466 describes a set of proteins containing both experimentally confirmed and predicted N-terminal acetyltransferases (NATs) that were collected and assigned to orthologous groups based on phylogenetic analysis. Three of the groups, Ard1, Mak3, and Nat3, were named based on the well characterized gene by that name from S. cerevisiae that is a member of the group.. Proteins in these orthologous groups without experimental characterization can be assigned the function term peptide alpha-N-acetyltransferase activity based on orthology to the experimentally characterized proteins within the orthologous group. The evidence code for this annotation is ISO, the reference is the paper which performed the analysis, and the accession numbers of the experimentally characterized members of the orthologous group should be placed in the with field. The paper also makes it clear that the genes, ARD1, MAK3, and NAT3 are well characterized experimentally, thus one could use the relevant one of these genes in the with field for annotations of members of their orthology groups without further reading. There may be additional characterized genes in each group, but it is not obvious from the paper. Also note that this paper also describes a putative Nat5 family only based on sequence similarity of Nat5p (YOR253Wp) to other NATs. As there is no experimentally characterized member of the Nat5 family, no annotations may be made based on the Nat5 orthology grouping, though see the ISA section for a description of the annotation which may be made for NAT5.
ISM: Inferred from Sequence Model
- Prediction methods for non-coding RNA genes such as tRNASCAN-SE, Snoscan, and Rfam
- Predicted presence of recognized functional domains or membership in protein families, as determined by tools such as profile Hidden Markov Models (HMMs), including Pfam and TIGRFAM
- Predicted protein features using tools such as TMHMM (transmembrane regions), SignalP (signal peptides on secreted proteins), and TargetP (subcellular localization)
- any other kind of domain modeling tool or collections of them such as SMART, PROSITE, PANTHER, InterPro, etc.
- An entry in the with field is required when the model used is an object with an accession number (as found with Pfam, TIGRFAM, InterPro, PROSITE, Rfam, etc.) The with field may be left blank for tools such as tRNAscan and Snoscan where there is not an object with an accession to point to.
The ISM code is a sub-category of the ISS code. The ISM code should be used any time that evidence from some kind of statistical model of a sequence or group of sequences is used to make a prediction about the function of a protein or RNA. Generally, when searching sequences with these modeling tools, the results include statistical scores (such as e values and cutoff scores) that help curators decide when a result is significant enough to warrant making an annotation. If an annotator manually checks these scores and determines if the result makes sense in the context of other information known about the sequence and decides that the evidence warrants a particular annotation, then the evidence code is ISM. However, if a tool that looks only at the scores makes annotations automatically and there is no manual review, the evidence code should be IEA.
It is important to note that some models are more functionally specific than others. In particular this is seen in the profile HMMs and somewhat in PROSITE motifs. Some HMMs are built so that all of the proteins used in building the model and all of the proteins that score well to the model have the exact same function. These models can therefore be used to predict precise functions in match proteins. Other models are built to reflect the shared sequence found among members of superfamiles or subfamilies. These can be used to predict varying levels of functional specificity and may often only provide very general annotations such as identification of a protein as an oxidoreductase. Finally, many models predict the presence of particular domains in a protein which may or may not provide information on the function of a protein, for example the CUB domain is found in a functionally diverse set of proteins and does not allow annotations to function to be made based on its presence alone. Therefore it is very important during the manual annotation process to assess what information it is safe to conclude from a match to any given model.
Some of the sequence-based modeling techniques result in models specific to individual sequence families. The profile HMMs, PROSITE motifs, and InterPro are in this group. In such cases, the with field should be populated with the accession number of the model specific for the functional domain or protein in question. Other sequence-based modeling techniques such as tRNASCAN and Snoscan are methods that result in the prediction of a set of sequences within a particular class (e.g. tRNAs, snoRNAs) and there are not specific models that one can link to each ncRNA. In these cases the with field may be left blank.
If the search for, and evaluation of, the sequence-based model data was described in a published paper, a reference to the paper should be placed in the reference column. However, if the search for and evaluation of the data was performed by the same group that is doing the GO annotation, then a reference should be placed in the reference column that describes the methodology used. If there is no publication for this methodology, a reference can be used from the GO Consortium's collection of GO references; if there is nothing appropriate in this set, the annotating group submit a description of the methods of data collection and evaluation used, and submit it to the GO Consortium. This will be added to the reference collection and will receive a GO_REF accession number for use in annotations.
Examples of when to use ISM:
- A curator performs an HMM search for a query protein. The result is that the query protein scores above the trusted cutoff to the HMM PF05426 alginate lyase. This HMM describes a family of alginate lyases. After review of all documentation associated with the HMM to determine functional specificity, or lack thereof, of the HMM and review of the scores that the query protein received, if the curator is confident that the query protein is indeed an alginate lyase, the appropriate annotations should be made using ISM as the evidence code, and putting Pfam:PF05426 in the with column. Since this search and evaluation was performed by the curator, a GO standard reference should be used to describe the search and evaluation methods (e.g. GO_REF:0000011).
- A paper describes using PROSITE searches with the protein of interest and concludes the protein has a particular binding activity based on a match to a particular PROSITE motif. The curator would make the appropriate GO annotations, using ISM as the evidence code, putting the accession number of the PROSTIE motif that provided the evidence in the with column, and the PMID number of the paper that described the work in the reference column.
- A curator runs the program tRNAscan (Lowe, T.M. and Eddy, S.R. NAR, 1997) on a newly sequenced bacterial genome to find the tRNAs. tRNAscan produces a list of the tRNA genes contained within that genome. A curator checks the results of the analysis to make sure that the predictions make sense and are consistent with what is known about the organism. Each of theses genes is given appropriate annotations for a tRNA. The evidence code is ISM, and a reference describing the process the curator used (either a published paper or a GO standard reference) should be placed in the reference column. The with column may be left blank.
- PMID:10024243 describes the use of a probabilistic model to predict snoRNA genes in yeast. Each of theses genes may be given appropriate annotations for a snoRNA. The evidence code is ISM, and the reference is the paper describing the work. The with column may be left blank.
IGC: Inferred from Genomic Context
Updated November 9, 2007
- operon structure
- syntenic regions
- pathway analysis
- genome scale analysis of processes
This evidence code can be used whenever information about the genomic context of a gene product forms part of the evidence for a particular annotation. Genomic context includes, but is not limited to, such things as identity of the genes neighboring the gene product in question (i.e. synteny), operon structure, and phylogenetic or other whole genome analysis.
IGC may be used in situations where part of the evidence for the function of a protein is that it is present in a putative operon for which the other members of the operon have strong sequence or literature based evidence for function. The presence of the gene in an operon specific for a particular function, pathway, complex, etc. is itself a form of evidence. It is encouraged that when using this code with operon structure that the id numbers for the genes in the operon be put in the with/from field.
The IGC evidence code can also be used to annotate gene products encoded by genes within a region of conserved synteny. For instance, sequence similarity alone may be too low to make an inference but orthology can often be predicted based on the position of a gene within a region of synteny and this used to strengthen the assertion. In these cases the with/from field should be used to store the identity of the positional ortholog.
In the area of process annotations, in order for us to assert that a gene product is involved in a particular process in the cell, that process itself must be happening in that cell. The only way to know if a process is happening is to determine if all of the elements required for that process are present. This is often accomplished by looking to see if there are genes in the genome which can complete every step in the process in question. The same holds true for subunits of protein complexes. This often entails examining many different gene products and many different evidence types found all around the genome of an organism to reach a particular conclusion.
When the method used to make annotations using the IGC code is performed internally by the annotating group and is not published, a short description of the method should be written and added to the GO Consortium's collection of GO references, where it will be given a GO_REF ID which can be used to cite the reference in gene association files.
Usage of the With/From Column for IGC
We recommend making an entry in the with/from column when using this evidence code. In cases where operon structure or synteny are the compelling evidence, include identifier(s) for the neighboring genes in the with/from column. In casees where metabolic reconstruction is the compelling evidence, and there is an identifier for the pathway or system, that should be entered in the with/from column. When multiple entries are placed in the with/from field, they are separated by pipes.
Note that there has been some discrepancy between groups as to the use of the with/from column; please see the Note on Usage of the With/from Column for more details.
... |
2.
DB Object ID |
3.
DB Object Symbol |
4.
Qualifier |
5.
GO ID |
6.
DB:Reference |
7.
Evidence Code |
8.
With/From |
... |
---|---|---|---|---|---|---|---|---|
... | TIGR_CMR:gene_B_ID | gene B | GO:0009231 | GO_REF:0000025 | IGC |
operon_geneA_ID|operon_geneC_ID
(from operon in annotated organism) |
... | |
... | TIGR_CMR:gene_A_ID | gene A | GO:0009102 | PMID:15347579 | IGC | TIGR_GenProp:GenProp0036 | ... |
IBA: Inferred from Biological aspect of Ancestor
Updated May 3, 2011
- A type of phylogenetic evidence whereby an aspect of a descendent is inferred through the characterization of an aspect of a ancestral gene.
IBD: Inferred from Biological aspect of Descendent
Updated May 3, 2011
- A type of phylogenetic evidence whereby an aspect of an ancestral gene is inferred through the characterization of an aspect of a descendant gene.
IKR: Inferred from Key Residues
Updated May 2, 2012
- A type of manually-curated evidence derived from sequence analysis, characterized by the lack of key sequence residues. All annotations that apply this evidence code should use the 'NOT' qualifier. This evidence code is used to annotate a gene product when, although homologous to a particular protein family, it has lost essential residues and is very unlikely to be able to carry out an associated function, participate in the expected associated process, or found in a certain location. This annotation statement can be supported by a published literature reference (e.g. a PubMed identifier) that has described the sequence analysis efforts, or by a GO Reference that describes the process a curator undertook to become sufficiently convinced of the sequence mutation. Where an IKR annotation statement is made using a GO Reference, inclusion of an identifier in the 'with/from' column of the annotation format that can indicate to the user the lacking residues (e.g. an alignment, domain or annotation rule identifier) is absolutely required. In contrast, when an IKR annotation statement is supported by a published literature reference,a value in the 'with/from' field is highly recommended although not required. This evidence code is also referred to as IMR (inferred from Missing Residues).
Examples where the IKR evidence code should be used:
- Curator-Determined IKR Annotation Example: Rat HPT (P06866) is homologous to serine proteases and contains a match to the peptidase S1 domain. However further sequence analysis by a curator looking at the Peptidase S1B, active site, established it has lost all essential catalytic residues, making it unable to carry out serine protease activity.
- Curator-Determined IKR Annotation Example, Using PAINT : Curators determined that Drosophila neuroligin protein does not have carboxylesterase activity, based on phylogeny-based evidence. The Panther identifier in the 'with/from' field links out to an evidence record citing annotation data from orthologous gene products, supporting the annotation statement.
- Paper-Curated IKR Annotation Example: Ross,J., Jiang,H., Kanost,M.R. and Wang,Y. (2003) Serine proteases and their homologs in the Drosophila melanogaster genome: an initial analysis of sequence conservation and phylogenetic relationships. Gene 30;304:117-31 (PMID:12568721). The authors describe the determination of serine protease activity of proteins from the D. melanogaster S1 serine protease gene family, by determining the presence of conserved His, Asp, Ser catalytic triad residues in retrieved sequences. If all three residues were present in the conserved TAAHC, DIAL, and GDSGGP motifs, the sequence was considered to have serine protease activity. Any sequence lacking one of the key residues was identified as an a serine protease homolog, lacking proteolytic activity.
... |
2.
DB Object ID |
3.
DB Object Symbol |
4.
Qualifier |
5.
GO ID |
6.
DB:Reference |
7.
Evidence Code |
8.
With/From |
... |
---|---|---|---|---|---|---|---|---|
... | P06866 | RatHPT | NOT |
GO:0004252 serine-type endopeptidase activity |
GO_REF:0000047 | IKR | InterPro:IPR000126 | ... |
... | P06866 | neuroligin | NOT |
GO:0004091 carboxylesterase activity |
GO_REF:0000033 | IKR | PANTHER:PTHR11559_AN146 | ... |
... | FB:FBgn0033192 | gene S1 | NOT |
GO:0004252 serine-type endopeptidase activity |
PMID:12568721 | IKR | ... |
Examples where the IKR evidence code should not be used:
- If there is experimental evidence available from a publication to support a NOT-evidenced annotation. In such instances, the curator should make the IDA, IMP or EXP NOT-qualified annotation based on the experimental evidence. If a paper supplies data that showed the active site was missing and additionally carried out an experimental assay to show lack of activity, it would be correct to create two annotation statements from this paper; both NOT IKR and NOT IDA.
- CAUTION: Where curators make judgements of functionning using the IKR evidence code, they should be able to draw on some level of expertise regarding the protein family, as there will always be exceptions to the rule. For instance, Q9H4A3 (WNK1_HUMAN) is a good example where nature has confounded prediction; Cys-250 is present instead of the conserved Lys which is expected to be an active site residue. However Lys-233 appears to fulfill the required catalytic function.
IRD: Inferred from Rapid Divergence
Updated May 3, 2011
- A type of phylogenetic evidence characterized by rapid divergence from ancestral sequence. Annotating with this evidence code implies a NOT annotation.
RCA: inferred from Reviewed Computational Analysis
Updated November 9, 2007
- Predictions based on computational analyses of large-scale experimental data sets
- Predictions based on computational analyses that integrate datasets of several types, including experimental data (e.g. expression data, protein-protein interaction data, genetic interaction data, etc.), sequence data (e.g. promoter sequence, sequence-based structural predictions, etc.), or mathematical models
The RCA code should be used for annotations made from predictions based on computational analyses of large-scale experimental data sets, or on computational analyses that integrate multiple types of data into the analysis. Acceptable experimental data types include protein-protein interaction data (e.g. two-hybrid results, mass spectroscopic identification of proteins identified by affinity tag purifications, etc.) synthetic genetic interactions, microarray expression results. Sequence-based data based on the sequence of the gene product, including structural predictions based on sequence, may be included provided that the analysis included non-sequence-based data as well. Sequence information related to promotor sequence features may also be included as a data type within these analyses. Predictions based on mathematical modelling which attempts to duplicate existing experimental results are also appropriate for use of this evidence code.
Analyses based purely on comparisons of the gene product sequence, including sequence similarity with experimentally characterized gene products, as determined by pairwise or multiple alignment; prediction methods for non-coding RNA genes; recognized functional domains, as determined by tools such as InterPro, Pfam, SMART, etc. and including the use of files such as interpro2go, pfam2go, smart2go to convert the domain hits to GO terms; predicted protein features, e.g., transmembrane regions, signal sequence, etc.; structural similarity with experimentally characterized gene products, as determined by crystallography, nuclear magnetic resonance, or computational prediction; or analyses combining multiple types of data based on the gene product sequence should use the ISS evidence code (or the IEA code if it is not reviewed by a curator).
Similarly for experimental data, if the annotation was made purely on the basis of an experimental result, e.g. a protein-protein interaction with a characterized protein, a genetic interaction with a characterized gene, or having a similar microarray expression pattern as a characterized gene, then the appropriate experimental evidence code, IPI, IGI, or IEP, respectively, should be used instead.
Examples where the RCA evidence code should be used:
- 2003 (PMID:14566057) analyzed all interactions for S. cerevisiae present in the Database of Interacting Proteins (DIP) and made predictions about the roles of genes that were uncharacterized at the time. GO Annotations resulting from this publication include the process term 'rRNA processing' for both UTP30 and NOP6, neither of which was experimentally characterized at the time. A role for NOP6 in the biogenesis of the small ribosomal subunit has subsequently been indicated via a genetic interaction with the experimentally characterized gene EMG1. ,
- 2003 (PMID:12826619) ... ,
Examples where the RCA evidence code should not be used:
- Annotations based on more than one type of gene product sequence based evidence, including such things as BLAST, profile HMMs, TMHMM, SignalP, PROSITE, InterPro, mapping files such as interpro2go etc. should use the ISS code.
- Annotations based on integrated computational analyses, if they have not been reviewed by a curator, should receive the IEA code.
Author Statement Evidence Codes
TAS: Traceable Author Statement
Updated November 9, 2007
- Any statement in an article where the original evidence (experimental results, sequence comparison, etc.) is not directly shown, but is referenced in the article and therefore can be traced to another source.
The TAS evidence code covers author statements that are attributed to a cited source. Typically this type of information comes from review articles. Material from the introductions and discussion sections of non-review papers may also be suitable if another reference is cited as the source of experimental work or analysis.
When annotating with this code the curator should use caution and be aware that authors often cite papers dealing with experiments that were performed in organisms different from the one being discussed in the paper at hand. Thus a problem with the TAS code is that it may turn out from following up the references in the paper that no experiments were performed on the gene in the organism actually being characterized in the primary paper. For this reason we recommend (when time and resources allow) that curators track down the cited paper and annotate directly from the experimental paper using the appropriate experimental evidence code. When this is not possible and it is necessary to annotate from reviews, the TAS code is the appropriate code to use for statements that are associated with a cited reference.
Once an annotation has been made to a given term using an experimental evidence code, we recommend removing any annotations made to the same term using the TAS evidence code.
Note that prior to July 2006, it was allowed to use the TAS evidence code for annotations based on information found in a text book or dictionary; as text book material has often become common knowledge (e.g. "everybody" knows that enolase is a glycolytic enzyme). However, at the 2006 GO Annotation Camp, it was concluded that this sort of information is not traceable to its source and is thus not suitable for the TAS evidence code. When annotating on the basis of common knowledge possessed by the curator, consider the IC code. When annotating an author statement that that is not associated with a cited reference, use the NAS code.
Examples where the TAS evidence code should be used:
- Annotating the twelve S. cerevisiae genes (RPO21, RPB2, RPB3, RPB4, RPB5, RPO26, RPB7, RPB8, RPB9, RPB10, RPC10, and RPB11) that are part of the core complex of RNA polymerase II to the GO term DNA-directed RNA polymerase II, core complex ; GO:00005665 based on a table in , 1998 (PMID:9774381) listing each of these genes as encoding a subunit of the enzyme and giving one or more references for each subunit.
-
Annotating the human myo9b gene to the GO term Rho GTPase activator activity ; GO:0005100 based on this statement in the introduction of a research article, , 2002 (PMID:11801597):
"Biochemical characterization of both bacterially expressed Myr5 and Myr7 tail domains and tissue-purified human Myo9b demonstrate that these myosins IX are active GAPs for Rho but not Rac or CDC 42 (3,4,7)."
Examples where the TAS evidence code should not be used:
-
In 2001 (PMID:11158314), the authors state:
"All of the CELF proteins contain multiple potential protein kinase C and casein kinase II phosphorylation sites. All are predicted to have predominantly nuclear localization, and CELF3, CELF4, and CELF5 each possess a consensus nuclear localization signal sequence near the C terminus."
As this paper provided no reference to support the author's ascertion that CELF3 is located to the nucleus (nor presentation of sequence analyses related to this statement), and the absence of better published data at the time of curation, CELF3 has been annotated to the GO term nucleus with the NAS evidence code and not the TAS evidence code.
... 2.
DB Object ID3.
DB Object Symbol4.
Qualifier5.
GO ID6.
DB:Reference7.
Evidence Code8.
With/From... ... gene B GO:0005634 PMID:11158314 IGC operon_geneA_ID|operon_geneC_ID
(from operon in annotated organism)... ... UniProt:Q5SZQ8 CELF3_HUMAN GO:0009102 PMID:15347579 NAS ...
, - When an annotator makes an annotation based on a combination of another GO annotation and common knowledge. For example, if a curator makes an annotation to the cellular component term nucleus on the basis that the gene product is already annotated to the molecular function term general RNA polymerase II transcription factor activity and the common knowledge that transcription factors interacting with RNA polymerase II act in the nucleus, then the IC evidence code should be used with the GO ID for the GO term from which the annotation was derived in the with/from field and the same reference should be cited as was used for the annotation to the term whose GO ID is placed in the with/from field.
NAS: Non-traceable Author Statement
Updated November 9, 2007
- Database entries that don't cite a paper (e.g. UniProt Knowledgebase records, YPD protein reports)
- Statements in papers (abstract, introduction, or discussion) that a curator cannot trace to another publication
The NAS evidence code should be used in all cases where the author makes a statement that a curator wants to capture but for which there are neither results presented nor a specific reference cited in the source used to make the annotation. The source of the information may be peer reviewed papers, textbooks, or database records. For some annotations using the NAS code, there will not be an entry in the with/from field.
The NAS code is also used for making annotations from database entries when a curator reviews the annotations that result. Typically such annotations will refer to an unpublished reference describing what was done, either a reference with a GO_REF id or an internal reference from the specific annotating database.
Cases where the NAS code should be used:
-
In 2001 (PMID:11158314), the authors state that:
"All of the CELF proteins contain multiple potential protein kinase C and casein kinase II phosphorylation sites. All are predicted to have predominantly nuclear localization, and CELF3, CELF4, and CELF5 each possess a consensus nuclear localization signal sequence near the C terminus."
... 2.
DB Object ID3.
DB Object Symbol4.
Qualifier5.
GO ID6.
DB:Reference7.
Evidence Code8.
With/From... ... UniProt:Q5SZQ8 CELF3_HUMAN GO:0009102 PMID:11158314 NAS ...
,
Cases where the NAS code should not be used:
- When an author makes a statement that is attributed to a source cited in the reference list, use the TAS evidence code.
- When an annotator makes an annotation based on a combination of another GO annotation and common knowledge. For example, if a curator makes an annotation to the cellular component term nucleus on the basis that the gene product is already annotated to the molecular function term general RNA polymerase II transcription factor activity and the common knowledge that transcription factors interacting with RNA polymerase II act in the nucleus, then the IC evidence code should be used with the GO ID for the GO term from which the annotation was derived in the with/from field and the same reference should be cited as was used for the annotation to the term whose GO ID is placed in the with/from field.
Curatorial Statement Evidence Codes
IC: Inferred by Curator
Updated September 22, 2011
The IC evidence code is to be used for those cases where an annotation is not supported by any direct evidence, but can be reasonably inferred by a curator from other GO annotations, for which evidence is available.
An example would be when there is evidence (be it direct assay, sequence similarity or even from electronic annotation) that a particular gene product has the function RNA polymerase II transcription factor activity ; GO:0003702. There is no direct evidence showing that this gene product is located in the nucleus, but this would be a perfectly reasonable inference for a curator to make since the curator is annotating a eukaryotic gene product that is associated with a specific nuclear RNA polymerase. This inference will be linked to the annotation to the term RNA polymerase II transcription factor activity ; GO:0003702 in two ways: both annotations will share the same reference; and the annotation inferred by a curator will include one or more with/from statements pointing to the GO term(s) used by the curator for the inference.
In many cases a GO term can be inferred from just one other annotation as described above. Occasionally, there are cases where a curator has to infer the GO term based on evidence from multiple sources of evidence/GO annotations. The 'with/from' field in these annotations will therefore supply >1 GO identifier, obtained from the set of supporting GO annotations assigned to the same gene/gene product identifier which cite publicly-available references. In addition, such IC-annotations will use reference GO_REF:0000036.
Usage of the With/From Column for IC
Note that the with/from field must always be filled in with a GO ID when using this evidence code.
For example, 1998 (PMID:9651335) provides evidence that the protein encoded by the S. cerevisiae UGA3 gene has the function specific RNA polymerase II transcription factor activity ; GO:0003704. From this, the curator deduces it is located in the nucleus and thus makes an annotation to the cellular component term nucleus ; GO:0005634 with the GO ID for the function term in the with/from for the component annotation.
,The second example shown below illustrates the use of IC with GO_REF:0000036. In this case, a curator has inferred an annotation for the CUP9 gene to the GO Term transcriptional open complex formation at RNA polymerase II promoter; GO:0001113 based on evidence from PMID:9427760, PMID:18708352 and the with/from column supplies the GO IDs derived from these two publications.
... |
2.
DB Object ID |
3.
DB Object Symbol |
4.
Qualifier |
5.
GO ID |
6.
DB:Reference |
7.
Evidence Code |
8.
With/From |
... |
---|---|---|---|---|---|---|---|---|
... | SGDID:S000002329 | UGA3 | GO:0003704 | PMID:9651335 | IPI | ... | ||
... | SGDID:S000002329 | UGA3 | GO:0005634 | PMID:9651335 | IC | GO:0003704 | ... |
... |
2.
DB Object ID |
3.
DB Object Symbol |
4.
Qualifier |
5.
GO ID |
6.
DB:Reference |
7.
Evidence Code |
8.
With/From |
... |
---|---|---|---|---|---|---|---|---|
... | SGDID:S000006098 | CUP9 | GO:0000122 | PMID:9427760 | IMP | ... | ||
... | SGDID:S000006098 | CUP9 | GO:0000978 | PMID:9427760 | IDA | ... | ||
... | SGDID:S000006098 | CUP9 | GO:0001103 | PMID:18708352 | IPI | CYC8 | ... | |
... | SGDID:S000006098 | CUP9 | GO:0001113 | GO_REF:0000036 | IC | GO:0000122|GO:0000978|GO:0001103 | ... |
ND: No biological Data available
Updated November 9, 2007
Used for annotations when information about the molecular function, biological process, or cellular component of the gene or gene product being annotated is not available.
Use of the ND evidence code indicates that the annotator at the contributing database found no information that allowed making an annotation to any term indicating specific knowledge from the ontology in question (molecular function, biological process, or cellular component) as of the date indicated. This code should be used only for annotations to the root terms, molecular function ; GO:0003674, biological process ; GO:0008150, or cellular component ; GO:0005575, which, when used in annotations, indicate that no knowledge is available about a gene product in that aspect of GO.
Annotations made with the ND evidence code should be accompanied by a reference that explains that curators looked but found no information. Note that some groups check only published literature while other groups also make sequence comparisons to see if an annotation can be made on the basis of a sequence comparison. The GO Reference collection includes a reference that can be used with ND when both literature and sequence have been checked; to use it, put "GO_REF:0000015" in the reference column of a gene association file.
Note that use of the ND evidence code with an annotation to one of the root nodes to indicate lack of knowledge in that aspect makes a statement about the lack of knowledge only with respect to that particular aspect of the ontology. Use of the ND evidence code to indicate lack of knowledge in one particular aspect does not make any statement about the availability of knowledge or evidence in the other GO aspects.
Even if an author states in a paper that there is no data available or nothing is known about the gene product in a particular GO aspect, annotation to the corresponding root node should be made with ND evidence code citing either the annotating group's internal reference or the GOC's reference on use of the ND evidence code, not a specific paper.
Note: The ND evidence code, unlike other evidence codes, should be considered as a code that indicates curation status/progress than as method used to derive an annotation.
When a gene product is annotated to a GO term using the NOT qualifier, this is a statement that it is not appropriate to associate that specific GO term with that particular gene product. However, such a negative annotation does not make any positive statements about the role of that gene product. Thus, there should always be a positive annotation, in addition to the NOT annotation. If nothing is known about the role of the gene product in a given aspect (molecular function, biological process, or cellular component) of GO, then the positive annotation should be made to the root node for that aspect using the ND evidence code.
Computationally-assigned Evidence Codes
IEA: Inferred from Electronic Annotation
Updated November 9, 2007
- Annotations based on "matches" in sequence similarity comparisons if they have not been reviewed by a curator
- Annotations transferred from database records, if not reviewed by a curator
- Annotations made on the basis of keyword mapping files, if not reviewed by a curator
- If annotations based on sequence similarity based methods have been reviewed by a curator, use ISS instead and change the reference from the one that describes the computational analysis to one that says that the curator reviewed the sequence similarity and approved it.
Used for annotations that depend directly on computation or automated transfer of annotations from a database, particularly when the analysis is performed internally and not published. A key feature that distinguishes this evidence code from others is that it is not made by a curator; use IEA when no curator has checked the specific annotation to verify its accuracy. The actual method used (BLAST search, Swiss-Prot keyword mapping, etc.) doesn't matter.
When the method used to make annotations using the IEA code is performed internally by the annotating group and is not published, a short description of the method should be written and added to the GO Consortium's collection of GO references, where it will be given a GO_REF ID which can be used to cite the reference in gene association files.
Examples where the IEA evidence code should be used:
- Annotations based on "matches" in sequence similarity comparisons if they have not been reviewed by a curator. If annotations based on sequence similarity based methods have been reviewed by a curator, use ISS instead.
- Annotations transferred from database records, if not reviewed by a curator. If such annotations are reviewed by a curator and the database record has no linked publication, consider the NAS code.
- Annotations made on the basis of keyword mapping files, if not reviewed by a curator
Examples where the IEA evidence code should not be used:
- Annotations based on "matches" in sequence similarity comparisons and which have been reviewed by a curator should be made with ISS code.
- Annotations transferred from database records, where the annotation is reviewed by a curator should not receive the IEA code. If the source is not traceable and the annotation is worth making, NAS should be used.
Usage of the With/From Column for IEA
At the January 2007 GOC meeting, it was agreed that it will be required to make an entry in the with/from column for all annotations made after May 1, 2007 when using this evidence code to indicate what individual sequences, sequence objects, methods, keyword mapping files, etc. are the basis of the annotation. When multiple entries are placed in the with/from field, they are separated by pipes.
... |
2.
DB Object ID |
3.
DB Object Symbol |
4.
Qualifier |
5.
GO ID |
6.
DB:Reference |
7.
Evidence Code |
8.
With/From |
... |
---|---|---|---|---|---|---|---|---|
... | UniProt:A0A7W6 | A0A7W6_9PARI | GO:0006118 | GOA:interpro|GO_REF:0000002 | IEA | InterPro:IPR005797 | ... | |
... | UniProt:A0A7W4 | A0A7W6_9PARI | GO:0006118 | GOA:spkw|GO_REF:0000004 | IEA | SP_KW:KW-0496 | ... | |
... | UniProt:A0K8M1 | A0K8M1_BURCH | GO:0004830 | GOA:spec|GO_REF:0000003 | IEA | EC:6.1.1.2 | ... | |
... | UniProt:A0KAB8 | Y2695_BURCH | GO:0008237 | GOA:hamap|GO_REF:0000020 | IEA | HAMAP:MF_00009 | ... | |
... | UniProt:O77797 | AKAP3_BOVIN | GO:0009434 | GOA:compara|GO_REF:0000019 | IEA | Ensembl:ENSMUSP00000093091 | ... |
Obsolete Evidence Codes
NR: Not Recorded
Updated November 9, 2007
Used for annotations done before curators began tracking evidence types (appears in some legacy annotations). It may not be used for new annotations.
Note on Usage of the With/from Column
As of April 2007, we are aware that there has been some variability in usage of the with/from column. Some groups have used an annotation in combination with the IDs in the with/from field in the same line to indicate specific interactions that occur in pairwise or other specific combinations, while others have used the with/from field to indicate all interactions with that gene that are described in a paper, without any indication as to whether they occur at the same time or not. This issue has been placed on the agenda of the next GO Consortium meeting for resolution.