GOA  README
-----------

1.  Contents
------------

1.  Contents
2.  Introduction
3.  List of files and file formats
4.  SWISS-PROT/TrEMBL/Ensembl non-redundant human proteome set
5.  IPI
6.  Ancillary mappings
7.  Assignment of GO terms to SWISS-PROT/TrEMBL/Ensembl data
8.  Addition of GO assignments from other data sources
9.  Contacts
10. Copyright Notice

2.  Introduction
----------------

GOA (GO Annotation@EBI) is a project run by the European Bioinformatics
Institute that aims to provide assignments of gene products to the Gene
Ontology (GO) resource.  The goal of the Gene Ontology Consortium is to 
produce a dynamic controlled vocabulary that can be applied to all eukaryotes, 
even while knowledge of gene and protein roles in cells is still accumulating and 
changing.  In the GOA project, this vocabulary will be applied to a
non-redundant set of proteins described in the SWISS-PROT, TrEMBL and Ensembl
databases that collectively provide complete proteomes for Homo sapiens and other
organisms.

In the first stage of this project, GO assignments have been applied to a 
data set representing the human proteome by a combination of electronic 
mappings and manual curation. Subsequently GO assignments for all complete 
proteomes will be provided at this site. 

For futher information please refer to our web site at:
http://www.ebi.ac.uk/GOA

3.  List of files and file formats
----------------------------------

The GOA project produces two gene_association files:

i) gene_association.goa_human

   Locations: http://www.geneontology.org/gene-associations/gene_association.goa
              ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz


This file contains the GO assignments for the proteins of the 
non-redundant human proteome set. 


ii) gene_association.goa_sptr

   Locations: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/SPTR/gene_association.goa_sptr.gz

This file contains all GO assignments for the SWISS-PROT TrEMBL databases.

We have complied with the file format described by the GeneOntology consortium
for annotation files (http://www.geneontology.org/GO.annotation.html#file)
Since we deal with proteins rather than genes, the semantics of some fields are slightly
different to the gene association files. 

1.  DB 
    Database from which annotated entry has been taken.
    One of either SPTR (SWISS-PROT TrEMBL) or ENSEMBL (Ensembl).
    Example: SPTR
    
2.  DB_Object_ID
    A unique identifier in the DB for the item being annotated.
    Here: Accession number or identifier of the annotated protein.
    Either a SWISS-PROT/TrEMBL accession number or an Ensembl peptide ID.
    Example: O00165
    
3.  DB_Object_Symbol
    A (unique and valid) symbol to which DB_Object_ID is matched.
    Here: SWISS-PROT/TrEMBL entry name or Ensembl peptide ID.  
    Example: HAX1_HUMAN
    
4.  NOT
    Here: Currently not applicable, always empty.
    
5.  GOid
    The GO identifier for the term attributed to the DB_Object_ID.
    Example: GO:0005625
    
6.  DB:Reference
    Reference cited to support the attribution.
    See section 7 for an explanation of the reference types used.
    Examples: PUBMED:9058808, GOA:interpro, GOA:spkw, GOA:spec. 
    
7.  Evidence
    One of either IMP,IGI,IPI,ISS,IDA,IEP,IEA,TAS,NAS, NR, E, P or ND
    Example: TAS
    
8.  With
    Example: SPTR:O00341
    
9.  Aspect
    One of the three ontologies: P (biological process), F (molecular function) 
    or C (cellular component).
    Example: P
    
10. DB_Object_Name
    Name of gene or gene product
    Here: Either empty (for Ensembl peptides) or the abbreviated
    description line (for SWISS-PROT and TrEMBL entries).
    Example: HS1-associated protein X-1
    
11. Synonym
    Gene_symbol [or other text]
    Here: International Protein Index identifier (section 5).  
    Example: IPI00010440
    
12. DB_Object_Type
    What kind of entity is being annotated.
    Here: always 'protein'
    Example: protein
    
13. Taxon_ID
    Identifier for the species being annotated.
    Here: always 'taxon:9606' for human proteins.
    Example: taxon:9606

14. Date
    The date of last annotation update in the format 'YYYYMMDD' eg: 20030228

15. Assigned_By
    Attribute describing the source of the annotation.  One of either SPTR, MGI, SGD, FB.

iii) xrefs.goa

    Location: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/xrefs.goa

In addition to the principal file with mappings of SWISS-PROT/TrEMBL/Ensembl to GO, 
a file has been prepared describing the relationship 
between the entries in this set and other databases, such as the 
EMBL/Genbank/DDBJ nucleotide sequence databases, HUGO, and LocusLink and 
RefSeq at the NCBI.  This file is tab delineated (multiple entries in 
individual fields are separated by commas). The fields are as follows:

1.  Database from which annotated entry has been taken.
    One of either SP (SWISS-PROT), TR (TrEMBL) or ENSEMBL (Ensembl).
2.  SWISS-PROT or TrEMBL accession number or Ensembl ID.
3.  International Protein Index identifier (see section 5).  
4.  Supplementary SWISS-PROT/TrEMBL entries associated with this IPI entry
5.  Supplementary Ensembl entries associated with this IPI entry
6.  RefSeq NP sequences associated with this IPI entry
7.  RefSeq XP sequences associated with this IPI entry
8.  Protein identifiers (cross reference to EMBL/Genbank/DDBJ nucleotide
                         databases)
9.  HUGO HGNC number, HUGO official gene symbol
10. NCBI LocusLink loci number, LocusLink Provisional Gene Symbol (replaced 
                                 by HUGO Official gene symbol where available)


4.  The SWISS-PROT/TrEMBL/Ensembl non-redundant human proteome set
------------------------------------------------------------------

SWISS-PROT is a high quality, manually curated database of protein sequences. 
It's automatically produced supplement, TrEMBL, contains all protein sequences 
that can be predicted from the EMBL nucleotide sequence database and that have 
not yet been annotated to SWISS-PROT standards.  Together, SWISS-PROT and 
TrEMBL (SPTR) offer a complete picture of the protein world.  It is a 
priority of the SWISS-PROT team to merge redundant entries: SPTR is 
completely non-redundant at the sequence level. However, when alternative 
sequences are submitted for the same gene, there is a delay before such 
entries are identified and merged. 

The aim of the Ensembl group is to develop a software system which produces 
and maintains automatic annotation of eukaryotic genomes.  There may be 
protein coding sequences that have not yet been annotated in the 
nucleotide sequence databases, but whose existence has nonetheless been 
predicted by Ensembl.  

The SWISS-PROT/TrEMBL/Ensembl non-redundant human proteome set is obtained by
combining (i) all SWISS-PROT entries for Homo sapiens (ii) all TrEMBL entries
that have been mapped to a given gene locus (up to a maximum of 1 TrEMBL entry
per locus) (iii) all additional TrEMBL entries that are non-redundant (at a 
95% similarity threshold) with any entries already included in the set and (iv)
all additional Ensembl entries that are similarly non-redundant with any of the
entries that are already included.  The set is updated weekly following each
release of SPTR.  It is this set of non-redundant protein sequences that will 
be annotated with the vocabulary defined by the GO consortium.

For more information see http://www.ebi.ac.uk/proteome/CPhelp.html.  The set 
can be downloaded (in FASTA format) via 
http://www.ebi.ac.uk/proteome/HUMAN/download.html or taken directly from the 
EBI ftp site at ftp://ftp.ebi.ac.uk/pub/databases/SPproteomes/fasta_files/proteomes/9606.SPEns.FASTAC
    
5.  IPI    
-------

The IPI (International Protein Index) provides a top-level overview of the 
main databases that describe the human proteome: SWISS-PROT, TrEMBL, Ensembl,
and NCBI's RefSeq databases.  IPI assigns stable identifiers to clusters of   
matching proteins from its contributing databases. IPI is updated monthly.  For
more details see http://www.ebi.ac.uk/IPI/IPIhelp.html

The SWISS-PROT/TrEMBL/Ensembl proteome set contains 1 protein for each IPI 
entry associated with a SWISS-PROT, TrEMBL or Ensembl entry.  Other sequences
from these databases and RefSeq also associated with these IPI entries are 
listed in the sptrXrefs file.

6.  Ancillary mappings
----------------------

Mappings between SPTR and EMBL/Genbank/DDBJ are derived from the cross 
references to these databases found in  SWISS-PROT and TrEMBL entries.
Mappings between SPTR and HUGO, LocusLink and RefSeq are derived from various 
publicly available sources of information that allow the electronic 
tracking of identifiers between databases. Contentious or contradictory data 
is referred to a curator for judgement.

7.  Assignment of GO terms to SWISS-PROT/TrEMBL/Ensembl data
------------------------------------------------------------

In this release, we use four data sources to assign GO terms to proteins.

A) PUBMED:nnnnnnnn
Curators have read the abstract or full paper with the PubMed identifier
nnnnnnnn and assigned the GO terms manually.  Where a journal is not indexed
by PUBMED then an internal idenfier is provided eg: PBTnnnnnnnn.  Please contact 
goa@ebi.ac.uk for details.

B) GOA:interpro
Transitive assignment using InterPro matches.
In detail, the protein in question has one or more InterPro matches.
The InterPro domain or family is assigned to the corresponding
GO term using the interpro2go file.

C) GOA:spkw
Transitive assignment using SWISS-PROT keywords. 
In detail, the protein in question is from SWISS-PROT or TrEMBL and
has a SWISS-PROT keyword. This keyword is assigned to the 
corresponding GO term using the spkw2go file.

D) GOA:spec
Transitive assignment using enzyme codes.
In detail, the protein in question is from SWISS-PROT or TrEMBL and
has an Enzyme Commission number in its description line. This EC
number is then assigned to the corresponding GO term using the
EC crossreferences in the GO ontology files.

The files interpro2go, spkw2go are found at 
http://www.geneontology.org/index.html#classification. 

8.  Addition of GO assignments from other data sources
-------------------------------------------------------

GO terms have already been assigned to a certain proportion of the human
proteome by Proteome Inc.  This data has been incorporated into the
classifications described in the gene_associations.goa files.

9. Contacts
----------- 

Please direct any questions to goa@ebi.ac.uk  We welcome any feedback.

10. Copyright Notice
--------------------
 
GOA - GO Annotation@EBI
Copyright 2002 (C) The European Bioinformatics Institute.
This README and the accompanying databases may be copied and
redistributed freely, without advance permission, provided that this
copyright statement is reproduced with each copy.

$Date: 2003/03/03 16:51:52 $