GO Database Guide

What is the GO Database?

The GO database is a relational database comprising the GO ontologies and the annotations of genes and gene products to terms in the GO. The advantage of housing both the ontologies and annotations in a single database is that powerful queries can be performed over annotations using the ontology. The GO database is the source of the data accessible through the AmiGO browser and search engine.

The GO database is built from source data at regular intervals, and is currently maintained a MySQL database. GO database builds can be downloaded and installed on your local machine or queried remotely.

For a good online introduction to databases in bioinformatics, we highly recommend Relational Databases for Biologists.

Data in the GO database

The GO database is generated from the most recent version of the ontology and the annotation files contributed by members of the GO Consortium. There are a number of different sets of data available for download.

GO database builds and update frequency
Build name	Contents	Updated	Downloads
termdb	ontologies, definitions and mappings to other dbs	daily	MySQL OBO XML OWL RDF XML SQL
assocdb	termdb (above); all manual gene product annotations; electronic annotations (IEA) from all databases other than UniProtKB ^[1]	weekly	All files can be found in the latest-lite directory in the GO database archives; files are named as follows: MySQL: go_`YYYYMMDD`-assocdb-tables.tar.gz SQL: go_`YYYYMMDD`-assocdb-data.gz OBO XML: go_`YYYYMMDD`-assocdb.obo-xml.gz RDF XML: go_`YYYYMMDD`-assocdb.rdf-xml.gz
seqdb	assocdb (above), plus protein sequences for most of the gene products	weekly	Files are in the latest-lite directory; they are named as follows: MySQL: go_`YYYYMMDD`-seqdb-tables.tar.gz SQL: go_`YYYYMMDD`-seqdb-data.gz
full GO database	termdb (above), plus manual and electronically generated (IEA) annotations	monthly	Files are in the latest-full directory; they are named as follows: MySQL: go_`YYYYMM`-assocdb-tables.tar.gz SQL: go_`YYYYMM`-assocdb-data.gz

[1] The majority of the sequences and IEA [electronic] annotations from UniProtKB are omitted from the weekly database builds due to the large size of the data set. For those species with a dedicated authoritative database group, such as Drosophila, mouse or Saccharomyces, UniProtKB annotations are collected and submitted by the dedicated group, and hence the UniProtKB IEA annotations for these species do appear in the GO database. Please see the annotation policies on avoiding redundancy for further information on authoritative database groups.

Back to top

Database Availability

Online access

The AmiGO browser and search engine provides web browser-based access to the GO database. As well as allowing users to search, browse, and download terms and annotations, AmiGO has analysis tools for further data processing.

Those familiar with MySQL can use GOOSE, the GO Online SQL Environment, a web interface for running SQL queries against the GO database. There are a number of sample queries for those unfamiliar with MySQL query syntax.

GO Database Mirrors

There are several sites that provide mirrors of the GO database accessible through a MySQL client (see the SQL section below for more on SQL queries of the GO database).

European Bioinformatics Institute (EBI)

The EBI mirror is of the full GO database: ontology data and all associations, both manual and electronic.

Connection parameters for the GO database mirror at the EBI
Parameter	Value
host	mysql.ebi.ac.uk
user	go_select
password	amigo
database	go_latest
port	4085

Ensembl

Ensembl provide builds of the full GO database going back several years. The GO databases are typically named ensembl_go_[ENSEMBL_BUILD_NO].

Connection parameters for the Ensembl GO database mirror
Parameter	Value
host	ensembldb.ensembl.org
user	anonymous

To connect to a database from a MySQL command line client, type the following:

mysql -h[host] -u[user] -p[password] -P[port] [database-name]

For example:

mysql -hmysql.ebi.ac.uk -ugo_select -pamigo -P4085 go_latest

Once connected you can use the SHOW DATABASES command to see available databases. To see when the database was built, use the command SELECT * FROM instance_data.

Access Policies

The GO database server and mirrors are a shared resource and thus we require data mining to be performed in a manner that allows others to utilize this resource at the same time. Any activity that mines the GO database using AmiGO or GOOSE must be controlled so that only one request is issued at a time. Users with greater needs than this should install a local copy of the database.

Local installation from pre-built database dumps

The GO database can be downloaded as a MySQL database dump and reconstituted on any system where MySQL is running. The data is also available as SQL statements for use with other database management systems; unfortunately we can only supply minimal support for non-MySQL databases.

Downloads and further details can be found on the GO database downloads page. Older builds of the database are available from the GO archives.

Build your own

You can create your own instance of a GO database, either by building one de novo, or by augmenting an existing build, e.g. by loading your own annotations or by adding other OBO ontologies. The go-dev software set contains the code and scripts required for these processes.

go-dev is a set of GO-related software and scripts, created and maintained by members of the GO Consortium. It can be downloaded via SVN. You may also need the gobo-dbic package for some applications.

To build a GO database from scratch, you will need a MySQL server, and the GO database SQL source, found in the directory sql of the go-dev software kit.

To load ontologies or data from OBO files or gene association files, you will need to install go-perl and go-db-perl (found in go-dev) and use the load scripts there.

Local GO mirror

Another way of creating your own instance of the GO database is by using a script provided in the go-dev distribution for creating GO mirrors: go_db_install.pl. Usage and examples are given by:

go-dev/trunk/go-db-perl/scripts/go_db_install.pl -h

For example, the following command loads the latest database dump into a database called "go_latest" on localhost:

go-dev/trunk/go-db-perl/scripts/go_db_install.pl -v -d localhost

This example will load the latest lite database dump into a database called "go_latest_lite" on localhost:

go-dev/trunk/go-db-perl/scripts/go_db_install.pl -i -e go_latest_lite -v -d localhost

This method is also useful for cron jobs. Please note that this method also creates a time stamp database with a suffix identical to the database name that you are loading.

Back to top

Querying the Database

Querying via AmiGO

The most common way to query the database is via the AmiGO browser interface. The Advanced Query provides a reasonably flexible way to query the database, but far more powerful queries can be executed using SQL.

Querying in SQL

You can query the GO database using the SQL query language; either download and install the GO MySQL dump to query your local copy, or connect to one of the GO database mirror nodes. MySQL queries can be performed through the MySQL command line client, or by using software such as SquirrelSQL or MySQL Query Browser, one of the GUI tools from available from mysql.com.

If you are unsure as to how to construct your query, the best place to start is the example queries page on the GO wiki.

Querying via perl

You can connect to a local or remote MySQL installation using the API provided by go-perl and go-db-perl, part of the go-dev software set. The API can be used to get terms, subgraphs and annotated gene products, as well as to perform more complex analyses.

You can also use the perl API interactively through a perl shell, as follows (substituting connection details as appropriate):

GOshell.pl -d go_latest -dbuser go_select -dbauth amigo -port 4085 -h mysql.ebi.ac.uk

Type help for more instructions.

Querying via java

The GHOUL (GO Hibernate Object Layer) library provides an API for accessing GO through java calls and HQL queries.

The code is avaliable from the GO sourceforge project in the svn repository can be browsed here.

Type help for more instructions.

Alternate means of querying

There are also a few less conventional means of querying the database programmatically.

XML via DBStag

Nested XML can be automatically extracted from the database using DBStag. A number of SQL templates are available in the stag-templates directory in CVS.

SPARQL Endpoint

We have an experimental SPARQL Endpoint for querying an RDF view of the GO database. This service is highly experimental and not fully supported; if it proves useful it may be supported in future. Comments are welcome!

Back to top

Database Schema

A relational schema specifies a collection of table definitions, providing structure for the data housed in the database instance. The schema for the GO database consists of tables for storing the terms and structure (as a graph) of the GO ontologies, along with gene product and annotation data.

Autodoc

See the automatically generated GO schema documentation.

The schema is partitioned into different modules or sub-schemas. This is purely a documentation convention. The go-graph module is for housing the ontology; the central tables are term and term2term. Annotations are stored in the go-association module; the main tables are association and gene_product. The GO database can also be enhanced with views; these are in the directory go-dev/sql/view. Autogenerated documentation is available for all tables and all views.

Schema Diagram (ER)

The database structure can also be seen in the entity-relationship diagram shown below; please note that the database schema used is dated June 2009.

View a larger version

Additional notes on the schema

Primary and foreign keys

As a convention, all tables in the schema use the column named id as the primary key, and foreign key columns are all named reference_id. All foreign keys are explicitly declared in the schema; however, MySQL drops these. This means you should always consult the documentation here, rather than relying on the MySQL DESCRIBE TABLES command or the CREATE TABLE commands in the database dump. The GO database schema page allows you to traverse primary and foreign key links via internal html page links.

All keys are numeric surrogate identifiers. They are meaningless, not consistent between builds, and bear no correspondence to public identifiers such as GO:0008050.

Terms as nodes in a graph

The central concept in OBO style ontologies and in the GO database is the graph. The terms are nodes in the graph, and the relationships between them are arcs (see the ontology structure documentation for more information). These is handled by the tables term and term2term.

The terms constituting the nodes in an ontology graph represent the kinds of entities that exist within the domain of that ontology. The edges in the graph represent the relations that hold between these entities. The edge types or relations in GO are drawn from the OBO Relation Ontology.

Note that these edge types or relations are also stored in the term table. This allows us to reuse the same schema structures, and potentially allows us to have hierarchies of relations, which may be required in the future.

Traversing Graphs

When performing ontology-oriented queries, it is often necessary to perform some kind of graph traversal. It is possible to use the term2term table to iterate through the graph, but this requires multiple database calls as MySQL does not support transitive operators such as Oracle's CONNECT BY. Most implementations of SQL do not support the kind of recursive querying required to answer queries such as "find all DNA binding genes".

This kind of query is possible with the GO database because we precompute the path from every node to all of its ancestors. This is known as the transitive closure of a relationship. This goes in the graph_path table, which also holds the distance between terms.

In particular, we calculate the reflexive transitive closure, which means that every term is related to itself (the distance between the terms is zero). In practical terms, this makes it easier to write queries such as "find all DNA binding genes", because such queries should return genes annotated directly to DNA binding, as well as to its children.

Note that we can use the same table to query for descendants as well as ancestors; it is simply a matter of switching around term1 and term2 in the graph_path query constraints.

The diagram below shows an example of the reflexive transitive closure of cellular pigment accumulation and its ancestors. The solid lines indicate direct is a relationships (stored in the term2term table); the dotted lines indicate the implied ancestral relationships (i.e. the closure), stored in the graph_path table.

Transitive closure of cellular pigment accumulation and its ancestors.

Schema Source

The SQL source for the schema is maintained in go-dev, in the sql directory; see the sql/modules directory for the SQL DDL.

Back to top

Finding out more

Contact

If you have specific questions, either technical or content-related, regarding the database, please contact the GO helpdesk.

The GO software group mailing list often hosts discussions of the database and related software. New list members must be approved before joining, but all conversations on the list are archived and can be viewed without a subscription.

Frequently Asked Questions

See the FAQ on the GO wiki.

Future extensions

There are a number of extensions to the schema planned over the coming years. These changes will be introduced in a way that retains backwards compatibility. You can see some examples of these in the form of unpopulated tables and columns in the database.

Deductive closure

A more advanced algorithm (e.g. that used by the OBO-Edit reasoner) for computing the graph_path table may be used in future, allowing us to discriminate paths of different edge types. Amongst other things it will help create relation-centric graph views.

Cross products

The schema already supports logical (complete) definitions via the complete relation qualifier, although these are not populated in the public GO database as the cross-products are still experiemntal.

For more background, see the logical definitions page on the GO wiki.

Similar database schemas

The GO schema is the predecessor of the ontology-based components of other bioinformatics schemas, including Chado, BioSQL and OBD. We have no plans to migrate to any of these schemas in the near future.

Back to top