ABOUT

This directory contains example gene sets to be used for GO evaluation
purposes.

FILE NAMING CONVENTIONS

The suffix ".gset" should be used.

Whilst complete metadata should be included in the file, the filename
should be reasonably self-descriptive. For published gene sets it
should include primary author, year and a brief description.

The filename should NOT contain any special characters or
whitespace. E.g. use:

  Zebedee_et_al_2008_bouncing_genes.gset

FORMAT

Each gene set file consists of two parts - a header which includes
metadata about the gene set, and the actual list of genes.

These are separated by a line containing 3 hash symbols.

Note the list of fields is extensible - we may add more at a later date.

Each tag should start with a '*' at the start of the line, and be
followed by the tag, then a ':', then a space, followed by the value.

The value should fit on one line, except where noted below

 * reference: 
    - optional
    - should be a colon-separated ID field or a URL, where the db is registered in GO xref abbs
    - Example: PMID:123456
 * data_url
    - optional
    - if this gene set can be generated by a REST query then the URL should be specified here.
 * year_submitted
    - Example: 2012
 * submitted_by: 
    - Example: GOC:group
 * contact_email:
 * description: 
    - description of the experiment or how the curator generated the gene list
    - this field can carry on over multiple lines
  * reference_set:
    - numeric identifier from NCITaxonomy OR the suffix of a GAF
    - Example: 9606
    - Example: wb
  * ontology_subset:
    - Optional
    - By default the entirety of GO is used. To run enrichment over a subset, specify the ontology
      namespace label (e.g. biological_process - remember the underscore) OR a term ID (e.g. GO:0008150).
      If a term ID is used, all descendants of that term are considered in the analysis
  * expected_enriched_term
    -  (should list top 3?)
  * item_id_type:
    - This should be EITHER a database from GO.xref_abbs (e.g. UniProtKB or SGD) OR 'symbol' OR 'name'
    - we prefer that identifiers are used but in some cases the author may only provide a gene symbol list

The list of genes or gene symbols is newline separated. In the future
we may also allow ranked gene sets and scored gene sets (e.g. read
counts).

SEE ALSO

Other gene sets are available in other formats - e.g. GMT

http://cccb.dfci.harvard.edu/genesigdb/downloadall.jsp
http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats