!FlyBase readme file for gene_association.fb !version: $Revision: 1.7 $ !date: $Date: 2014/11/06 !from: FlyBase !saved-by: gocur@morgan.harvard.edu 1. TABLE OF CONTENTS ================================================================================ 1. TABLE OF CONTENTS 2. INTRODUCTION 3. GENE_ASSOCIATION.FB FILE FORMAT 4. METHODS OF GO ANNOTATION 4.1.1 PUBLISHED LITERATURE 4.1.2 CONFERENCE ABSTRACTS 4.1.3 GENBANK RECORDS 4.1.4 UNIPROT/SWISS-PROT RECORDS 4.1.5 GENOMIC SEQUENCE DATA 4.1.6 PERSONAL COMMUNICATIONS 5. ELECTRONIC (IEA) GO ANNOTATION IN FLYBASE 6. USE OF THE ND EVIDENCE CODE IN FLYBASE 7. CONTACT INFORMATION 2. INTRODUCTION ================================================================================ This file provides a brief description of how GO data is captured in FlyBase and how it is displayed in the gene_association.fb file. FlyBase is a database of genetic and molecular data for Drosophila. FlyBase includes data on all species from the family Drosophilidae; the primary species represented is Drosophila melanogaster. FlyBase is produced by a consortium of researchers funded by the National Institutes of Health, U.S.A., and the Medical Research Council, London. This consortium includes both Drosophila biologists and computer scientists at Harvard University, University of Cambridge (UK), and Indiana University. For additional information, please visit the FlyBase web site at http://flybase.org/. 3. GENE_ASSOCIATION.FB FILE FORMAT ================================================================================ The gene_association.fb file contains GO annotations for Drosophila melanogaster gene products. The gene_association.fb file uses the standard file format for gene_association files of the Gene Ontology (GO) Consortium. A more complete description of the file format is found here: http://geneontology.org/page/go-annotation-file-gaf-format-20 The following provides a brief description of the columns in the gene_association files. Lines beginning 'FB File:' refer specifically to the format in gene_association.fb. 1: DB The database contributing the gene_association file FB File: always "FB" for gene_association.fb. 2: DB_Object_ID A unique identifier in the database for the item being annotated. FB File: This is always the primary FBgn (FlyBase gene identifier) for a Drosophila gene. Example: FBgn0000490 3: DB_Object_Symbol A (unique and valid) symbol to which the DB_Object_ID is matched. FB File: This is always the primary gene symbol for a Drosophila gene. Example: dpp 4: Qualifier (this field is optional) One or more of 'NOT', 'contributes_to' or 'colocalizes_with' as qualifier(s) for a GO annotation. Multiple qualifiers are separated by a pipe (|). 5: GO ID The unique GO identifier for the GO term attributed to the DB_Object_ID. Example: GO:0005160 6: DB:Reference The unique identifier for the reference to which the GO annotation is attributed. FB File: Each FlyBase reference including published literature, conference abstracts, personal communications, sequence records and computer files has a unique 7 digit identifier (an FBrf). Where this reference is a published paper with a PubMed identifier, the PubMed ID is also listed in column 6, separated from the FBrf with a pipe (|). Example: FB:FBrf0136863|PMID:11432817 7: Evidence Several types of evidence codes may be used in FlyBase GO annotation: Experimental evidence codes (generally preferred): IDA (Inferred from Direct Assay) IMP (Inferred from Mutant Phenotype) IGI (Inferred from Genetic Interaction) IPI (Inferred from Physical Interaction) IEP (Inferred from Expression Pattern) Computational analysis evidences codes: ISS (Inferred from Sequence or structural Similarity) ISO (Inferred from Sequence Orthology) ISA (Inferred from Sequence Alignment) ISM (Inferred from Sequence Model) IGC (Inferred from Genomic Context) IBA (Inferred from Biological aspect of Ancestor) IBD (Inferred from Biological aspect of Descendant) IKR (Inferred from Key Residues) IRD (Inferred from Rapid Divergence) RCA (Inferred from Reviewed Computational Analysis) Author statement evidence codes (generally avoided for new annotations): TAS (Traceable Author Statement) NAS (Non-traceable Author Statement) Curatorial statement evidence codes: IC (Inferred by Curator) ND (No biological Data available) Automatically assigned evidence: IEA (Inferred from Electronic Annotation) 8: With (or) From Some evidence codes require/recommend this column to contain an appropriate database identifier e.g. interacting gene/gene product, similar sequence etc. For IC, the GO identifier of the term used as the basis of a curator inference is given. IGI example: FB:FBgn0261530 IPI example: UniProtKB:P41046 ISS example: RGD:69264 IC example: GO:0045298 IEA example: InterPro:IPR000504 9: Aspect Which ontology the GO term belongs to: Function (F), Process (P) or Component (C). Example: P 10: DB_Object_Name FB File: The full name of the FlyBase gene. Where a FlyBase gene has no full name, this field is left blank. Example: decapentaplegic 11: DB_Object_Synonym FB File: Alternative names and symbols by which the database object is known. Multiple synonyms of a FlyBase gene are separated by a pipe (|). Example: BMP|CG9885|DPP|DPP-C|Decapentaplegic|Decapentaplegic/Bone Morphogenetic Protein|Dm-DPP|Dpp|Haplo-insufficient|Hin-d|M(2)23AB|M(2)LS1|TGF-b|TGF-beta|TGFbeta| Tegula|Tg|blink|blk|bone morphogenetic protein|bone morphogenic protein|heldout|ho|l(2)10638|l(2)22Fa|l(2)k17036|shortvein|shv 12: DB_Object_Type The type of object being annotated. FB file: always "gene" for gene_association.fb. 13: Taxon The taxonomic identifier of the species encoding the gene product Example: taxon:7227 14: Date The date of last annotation update, in the format 'YYYYMMDD'. Example: 20070821 FB file: FlyBase started to record annotation dates in 2006; only date stamps later than 20060803 are accurate. 15: Assigned_by The source of the GO annotation. FB File: One of either FB or UniProtKB (See section 4.1.4). 16: Annotation Extension This column is not currently used in FlyBase GO annotation. It could contain cross references to other ontologies that can be used to qualify or enhance the annotation. 17: Gene Product Form ID This column is not currently used in FlyBase GO annotation. It could contain an identifier for a specific form of gene product A new gene_association.fb file is submitted to the GO consortium for each FlyBase release. There are currently 5-6 releases of FlyBase per year. 4. METHODS OF GO ANNOTATION ================================================================================ Database Objects ---------------- Currently all GO annotations in FlyBase are attributed to genes. The GO terms describe the attributes of the products (both RNA and protein) encoded by these Drosophila genes. Redundancy in gene_association.fb --------------------------------- Redundant GO annotations at FlyBase are captured; if two papers show the same GO data, both sets of GO data will be captured and displayed. Multiple lines of evidence for a single GO annotation are also captured, since multiple annotations to the same GO term add to the confidence of the GO annotation. In addition, if two or more papers show conflicting GO data, all sets of GO data are recorded. If in subsequent references a conclusion is reached, then GO terms which are no longer correct will be removed or replaced. FlyBase makes GO annotation mainly from published primary papers. We no longer make GO annotations based on conference abstracts or GenBank records, and we only rarely curate new GO data from reviews and personal communications from FlyBase users. 4.1.1 PUBLISHED LITERATURE Literature curation at FlyBase is primarily done by a paper-by-paper approach. GO curation is one part of this literature curation. The GO curators also curate literature on a gene-by-gene basis. 4.1.2 CONFERENCE ABSTRACTS FlyBase no longer curates new GO data based on abstracts from the Annual Drosophila Research Conference. Existing GO annotation associated with abstracted will gradually be removed and replaced data from primary papers where available. 4.1.3 GENBANK RECORDS In the past, GO annotations are taken from GenBank records, where the record lists the function or location of a gene product. These GO annotations were supported by the NAS evidence code. No new annotations are being added using the NAS evidence code and the existing annotations from on GenBank records will gradually be removed as experimental evidence becomes available. 4.1.4 UNIPROT/SWISS-PROT RECORDS Swiss-Prot records created before early 2002 were curated for GO based on information in the 'Comments' field, supported by the NAS evidence code. 4.1.5 GENOMIC SEQUENCE DATA BLASTP searches are performed on known protein sequences and protein sequences predicted based on the genomic sequence of the Drosophila melanogaster genome. Genes are GO-annotated based on sequence similarity to proteins of known function in Drosophila and/or other organisms. The ID of the similar sequence is entered in the with column. This can be a GenBank accession, a UniProt ID or a gene identifier from a model organism database. 4.1.6 PERSONAL COMMUNICATIONS Personal communications to FlyBase from the database users are archived. GO data is attributed to these communications where applicable. 5. ELECTRONIC (IEA) GO ANNOTATION IN FLYBASE ================================================================================ IEA-supported GO annotation in FlyBase are currently based on a single source: i. INTERPRO 2 GO MAPPINGS GO terms are assigned to FlyBase genes through InterPro protein domain assignments. InterPro protein domains are assigned to FlyBase genes as part of an ongoing collaboration between UniProt and FlyBase. InterPro-predicted GO terms that are identical to an existing non-IEA GO annotation for a FlyBase gene are excluded. InterPro-predicted GO terms that are a parent of (i.e. less specialized than) an existing non-IEA GO annotation for a gene are also excluded. All remaining InterPro GO predictions (including predicted GO terms that are identical to an existing IEA GO annotation for a given gene) are added into FlyBase, supported by the inferred from electronic (IEA) evidence code. These annotations are updated for every release of FlyBase. For further information and corresponding annotations see FBrf0174215. 6. USE OF THE ND EVIDENCE CODE IN FLYBASE ================================================================================ The Gene Ontology (GO) Consortium created the evidence code "ND" to indicate "no biological data available". This code is used for annotations to the three root terms `molecular_function ; GO:0003674', `biological_process ; GO:0008150' or `cellular_component ; GO:0008372'. In FlyBase the use of any of these three GO terms, attributed to reference FBrf0159398 and supported by the ND evidence code, signifies that a curator has examined the available literature and sequence for this gene and that as of the date of the annotation to the unknown term, there is no information supporting an annotation to any GO term in that ontology. 7. CONTACT INFORMATION ================================================================================ Questions or comments about this file should be sent to: gocur@morgan.harvard.edu