!FlyBase readme file for gene_association.fb
!version: $Revision: 1.7 $
!date: $Date: 2014/11/06
!from: FlyBase 
!saved-by: gocur@morgan.harvard.edu

1. TABLE OF CONTENTS
================================================================================

1.  	TABLE OF CONTENTS
2.  	INTRODUCTION
3.  	GENE_ASSOCIATION.FB FILE FORMAT
4.	METHODS OF GO ANNOTATION
		4.1.1	PUBLISHED LITERATURE
		4.1.2	CONFERENCE ABSTRACTS
		4.1.3	GENBANK RECORDS
		4.1.4	UNIPROT/SWISS-PROT RECORDS
		4.1.5	GENOMIC SEQUENCE DATA
		4.1.6	PERSONAL COMMUNICATIONS	
5.	ELECTRONIC (IEA) GO ANNOTATION IN FLYBASE
6. 	USE OF THE ND EVIDENCE CODE IN FLYBASE
7.	CONTACT INFORMATION


2. INTRODUCTION
================================================================================

This file provides a brief description of how GO data is captured in
FlyBase and how it is displayed in the gene_association.fb file.

FlyBase is a database of genetic and molecular data for Drosophila.
FlyBase includes data on all species from the family Drosophilidae; the
primary species represented is Drosophila melanogaster. FlyBase is
produced by a consortium of researchers funded by the National
Institutes of Health, U.S.A., and the Medical Research Council, London.
This consortium includes both Drosophila biologists and computer
scientists at Harvard University, University of Cambridge (UK), and Indiana
University.

For additional information, please visit the FlyBase web site at
http://flybase.org/.


3. GENE_ASSOCIATION.FB FILE FORMAT
================================================================================

The gene_association.fb file contains GO annotations for Drosophila melanogaster
gene products. The gene_association.fb file uses the standard file
format for gene_association files of the Gene Ontology (GO) Consortium.
A more complete description of the file format is found here:

http://geneontology.org/page/go-annotation-file-gaf-format-20

The following provides a brief description of the columns in the
gene_association files. Lines beginning 'FB File:' refer specifically
to the format in gene_association.fb.


 1: 	DB
		The database contributing the gene_association file
		FB File: always "FB" for gene_association.fb.

 2: 	DB_Object_ID
		A unique identifier in the database for the item being annotated.
		FB File: This is always the primary FBgn (FlyBase gene identifier) for a Drosophila gene.	
		Example: FBgn0000490
 
 3: 	DB_Object_Symbol
		A (unique and valid) symbol to which the DB_Object_ID is matched.
		FB File: This is always the primary gene symbol for a Drosophila gene.
		Example: dpp

 4: 	Qualifier (this field is optional)
		One or more of 'NOT', 'contributes_to' or 'colocalizes_with' as qualifier(s) for a GO annotation.
   		Multiple qualifiers are separated by a pipe (|).
		

 5: 	GO ID
		The unique GO identifier for the GO term attributed to the DB_Object_ID.
		Example: GO:0005160

 6: 	DB:Reference
		The unique identifier for the reference to which the GO annotation is attributed.
		FB File: Each FlyBase reference including published literature,
		conference abstracts, personal communications, sequence records and
		computer files has a unique 7 digit identifier (an FBrf). Where this
		reference is a published paper with a PubMed identifier, the PubMed ID
		is also listed in column 6, separated from the FBrf with a pipe (|).
		Example: FB:FBrf0136863|PMID:11432817

 7: 	Evidence
		Several types of evidence codes may be used in FlyBase GO annotation:
		
		Experimental evidence codes (generally preferred):
		IDA (Inferred from Direct Assay)
		IMP (Inferred from Mutant Phenotype)
		IGI (Inferred from Genetic Interaction)
		IPI (Inferred from Physical Interaction)
		IEP (Inferred from Expression Pattern)
		
		Computational analysis evidences codes:
		ISS (Inferred from Sequence or structural Similarity)
		ISO (Inferred from Sequence Orthology)
		ISA (Inferred from Sequence Alignment)
		ISM (Inferred from Sequence Model)
		IGC (Inferred from Genomic Context)
		IBA (Inferred from Biological aspect of Ancestor)
		IBD (Inferred from Biological aspect of Descendant)
		IKR (Inferred from Key Residues)
		IRD (Inferred from Rapid Divergence)
		RCA (Inferred from Reviewed Computational Analysis)
		
		Author statement evidence codes (generally avoided for new annotations):
		TAS (Traceable Author Statement)
		NAS (Non-traceable Author Statement)
		
		Curatorial statement evidence codes:
		IC (Inferred by Curator)
		ND (No biological Data available)
		
		Automatically assigned evidence:
		IEA (Inferred from Electronic Annotation)

 8: 	With (or) From
		Some evidence codes require/recommend this column to contain an appropriate database identifier e.g. interacting gene/gene product, similar sequence etc. For IC, the GO identifier of the term used as the basis of a curator inference is given.
		
		IGI example:	FB:FBgn0261530
		IPI example:	UniProtKB:P41046
		ISS example:	RGD:69264
		IC example:	GO:0045298
		IEA example:	InterPro:IPR000504

9:		Aspect
		Which ontology the GO term belongs to: Function (F), Process (P) or Component (C).
		Example: P

10: 	DB_Object_Name
 		FB File: The full name of the FlyBase gene. Where a FlyBase gene has no
		full name, this field is left blank.
		Example: decapentaplegic
    	
11: 	DB_Object_Synonym
		FB File: Alternative names and symbols by which the database object is
		known. Multiple synonyms of a FlyBase gene are separated by a pipe (|).
		Example: BMP|CG9885|DPP|DPP-C|Decapentaplegic|Decapentaplegic/Bone Morphogenetic Protein|Dm-DPP|Dpp|Haplo-insufficient|Hin-d|M(2)23AB|M(2)LS1|TGF-b|TGF-beta|TGFbeta|
		Tegula|Tg|blink|blk|bone morphogenetic protein|bone morphogenic protein|heldout|ho|l(2)10638|l(2)22Fa|l(2)k17036|shortvein|shv

12: 	DB_Object_Type
		The type of object being annotated.
		FB file: always "gene" for gene_association.fb.	

13: 	Taxon
		The taxonomic identifier of the species encoding the gene product
		Example: taxon:7227

14: 	Date
		The date of last annotation update, in the format 'YYYYMMDD'. 
		Example: 20070821
		FB file: FlyBase started to record annotation dates in 2006; 
		only date stamps later than 20060803 are accurate.

15: 	Assigned_by
		The source of the GO annotation.
		FB File: One of either FB or UniProtKB (See section 4.1.4).
		

16: 	Annotation Extension
		This column is not currently used in FlyBase GO annotation. It could contain cross references to other ontologies that can be used to qualify or enhance the annotation.
		
		
17: 	Gene Product Form ID
		This column is not currently used in FlyBase GO annotation. It could contain an identifier for a specific form of gene product
		
A new gene_association.fb file is submitted to the GO consortium for each FlyBase release. There are currently 5-6 releases of FlyBase per year.



4. METHODS OF GO ANNOTATION
================================================================================

Database Objects
----------------
Currently all GO annotations in FlyBase are attributed to genes. The GO
terms describe the attributes of the products (both RNA and protein)
encoded by these Drosophila genes.


Redundancy in gene_association.fb
---------------------------------
Redundant GO annotations at FlyBase are captured; if two papers show
the same GO data, both sets of GO data will be captured and displayed.
Multiple lines of evidence for a single GO annotation are also
captured, since multiple annotations to the same GO term add to the
confidence of the GO annotation.

In addition, if two or more papers show conflicting GO data, all sets
of GO data are recorded. If in subsequent references a conclusion is
reached, then GO terms which are no longer correct will be removed or
replaced.

FlyBase makes GO annotation mainly from published primary papers.
We no longer make GO annotations based on conference abstracts or
GenBank records, and we only rarely curate new GO data from reviews and
personal communications from FlyBase users.

4.1.1	PUBLISHED LITERATURE

Literature curation at FlyBase is primarily done by a paper-by-paper
approach. GO curation is one part of this literature curation.
The GO curators also curate literature on a gene-by-gene basis.


4.1.2	CONFERENCE ABSTRACTS

FlyBase no longer curates new GO data based on abstracts from the
Annual Drosophila Research Conference. Existing GO annotation
associated with abstracted will gradually be removed and replaced data
from primary papers where available.


4.1.3	GENBANK RECORDS

In the past, GO annotations are taken from GenBank records, where the
record lists the function or location of a gene product. These GO
annotations were supported by the NAS evidence code. No new annotations
are being added using the NAS evidence code and the existing
annotations from on GenBank records will gradually be removed as
experimental evidence becomes available.


4.1.4	UNIPROT/SWISS-PROT RECORDS

Swiss-Prot records created before early 2002 were curated for GO
based on information in the 'Comments' field, supported by the NAS
evidence code.


4.1.5	GENOMIC SEQUENCE DATA

BLASTP searches are performed on known protein sequences and protein
sequences predicted based on the genomic sequence of the Drosophila
melanogaster genome. Genes are GO-annotated based on sequence
similarity to proteins of known function in Drosophila and/or other
organisms. The ID of the similar sequence is entered in the with
column. This can be a GenBank accession, a UniProt ID or a gene
identifier from a model organism database.


4.1.6	PERSONAL COMMUNICATIONS

Personal communications to FlyBase from the database users are
archived. GO data is attributed to these communications where
applicable.


5. ELECTRONIC (IEA) GO ANNOTATION IN FLYBASE
================================================================================

IEA-supported GO annotation in FlyBase are currently based on a single source:

i. INTERPRO 2 GO MAPPINGS

GO terms are assigned to FlyBase genes through InterPro protein domain
assignments. InterPro protein domains are assigned to FlyBase genes as
part of an ongoing collaboration between UniProt and FlyBase.

InterPro-predicted GO terms that are identical to an existing non-IEA GO annotation 
for a FlyBase gene are excluded. InterPro-predicted GO terms that are a parent of 
(i.e. less specialized than) an existing non-IEA GO annotation for a gene are also excluded. 
All remaining InterPro GO predictions (including predicted GO terms
that are identical to an existing IEA GO annotation for a given gene)
are added into FlyBase, supported by the inferred from electronic
(IEA) evidence code.

These annotations are updated for every release of FlyBase.

For further information and corresponding annotations see FBrf0174215.


6. USE OF THE ND EVIDENCE CODE IN FLYBASE
================================================================================

The Gene Ontology (GO) Consortium created the evidence code "ND" to
indicate "no biological data available". This code is used for
annotations to the three root terms `molecular_function ;
GO:0003674', `biological_process ; GO:0008150' or
`cellular_component ; GO:0008372'. In FlyBase the use of any of
these three GO terms, attributed to reference FBrf0159398 and supported
by the ND evidence code, signifies that a curator has examined the
available literature and sequence for this gene and that as of the date
of the annotation to the unknown term, there is no information
supporting an annotation to any GO term in that ontology.



7. CONTACT INFORMATION
================================================================================

Questions or comments about this file should be sent to:
gocur@morgan.harvard.edu