ABOUT This directory contains example gene sets to be used for GO evaluation purposes. FILE NAMING CONVENTIONS The suffix ".gset" should be used. Whilst complete metadata should be included in the file, the filename should be reasonably self-descriptive. For published gene sets it should include primary author, year and a brief description. The filename should NOT contain any special characters or whitespace. E.g. use: Zebedee_et_al_2008_bouncing_genes.gset FORMAT Each gene set file consists of two parts - a header which includes metadata about the gene set, and the actual list of genes. These are separated by a line containing 3 hash symbols. Note the list of fields is extensible - we may add more at a later date. Each tag should start with a '*' at the start of the line, and be followed by the tag, then a ':', then a space, followed by the value. The value should fit on one line, except where noted below * reference: - optional - should be a colon-separated ID field or a URL, where the db is registered in GO xref abbs - Example: PMID:123456 * data_url - optional - if this gene set can be generated by a REST query then the URL should be specified here. * year_submitted - Example: 2012 * submitted_by: - Example: GOC:group * contact_email: * description: - description of the experiment or how the curator generated the gene list - this field can carry on over multiple lines * reference_set: - numeric identifier from NCITaxonomy OR the suffix of a GAF - Example: 9606 - Example: wb * ontology_subset: - Optional - By default the entirety of GO is used. To run enrichment over a subset, specify the ontology namespace label (e.g. biological_process - remember the underscore) OR a term ID (e.g. GO:0008150). If a term ID is used, all descendants of that term are considered in the analysis * expected_enriched_term - (should list top 3?) * item_id_type: - This should be EITHER a database from GO.xref_abbs (e.g. UniProtKB or SGD) OR 'symbol' OR 'name' - we prefer that identifiers are used but in some cases the author may only provide a gene symbol list The list of genes or gene symbols is newline separated. In the future we may also allow ranked gene sets and scored gene sets (e.g. read counts). SEE ALSO Other gene sets are available in other formats - e.g. GMT http://cccb.dfci.harvard.edu/genesigdb/downloadall.jsp http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats