Geneset annotation data

Data sources

We currently obtain the geneset annotation data from several public data resources and keep them up-to-date, so that you don’t have to do it:

Source

Update frequency

MSigDB

whenever a new release is available

Gene Ontology

whenever a new release is available

KEGG

whenever a new release is available

WikiPathways

whenever a new release is available

Reactome

whenever a new release is available

Disease Ontology

whenever a new release is available

The most updated data information can be accessed here.

Geneset object

Geneset annotation data are both stored and returned as a geneset object, which is essentially a collection of fields (attributes) and their values:

{
  "_id": "WP60",
  "_version": 1,
  "genes": [
    {
      "ensemblgene": "YOL165C",
      "mygene_id": "853999",
      "name": "putative aryl-alcohol dehydrogenase",
      "ncbigene": "853999",
      "symbol": "AAD15",
      "uniprot": "Q08361"
    },
    {
      "mygene_id": "850488",
      "name": "pseudo",
      "ncbigene": "850488",
      "symbol": "AAD6"
    },
    {
      "ensemblgene": "YNL331C",
      "mygene_id": "855385",
      "name": "putative aryl-alcohol dehydrogenase",
      "ncbigene": "855385",
      "symbol": "AAD14",
      "uniprot": "P42884"
    },
    {
      "ensemblgene": "YCR107W",
      "mygene_id": "850471",
      "name": "putative aryl-alcohol dehydrogenase",
      "ncbigene": "850471",
      "symbol": "AAD3",
      "uniprot": "P25612"
    },
    {
      "ensemblgene": "YJR155W",
      "mygene_id": "853620",
      "name": "putative aryl-alcohol dehydrogenase",
      "ncbigene": "853620",
      "symbol": "AAD10",
      "uniprot": "P47182"
    },
    {
      "ensemblgene": "YDL243C",
      "mygene_id": "851354",
      "name": "putative aryl-alcohol dehydrogenase",
      "ncbigene": "851354",
      "symbol": "AAD4",
      "uniprot": "Q07747"
    }
  ],
  "is_public": true,
  "name": "Toluene degradation",
  "source": "wikipathways",
  "taxid": 559292,
}

The example above contains the most common available fields, but omits some fields that are specific to specific data sources. For a full example, you can check out a few examples: GO_0004568_9606 (a Gene Ontology geneset), WP60 (a Wikipathways geneset), or find a list of the available fields at: http://mygeneset.info/v1/metadata/fields

_id field

Each individual geneset object contains an “_id” field as the primary key. The value of the “_id” field is different for every built-in data source, but is typically the primary ID used in the source data. For example, for MSigDB, this is the original geneset id. For genesets coming from metabolic pathway databases (KEGG, GO, Wikipathways) which contain multiple species, _id is typically a combination of the pathway id, plus the organism taxid. User-submitted genesets have randomly generated _id fields. Here is an example. If searching for a particular GO term, or KEGG ID using the query endpoint, we recommend using “kegg.id”or “go.id”, plus the species filter instead of “_id”.

Note

Regardless how the value of the “_id” field looks like, it always works for our geneset annotation service /v1/geneset/<geneid>.

_score field

You will often see a “_score” field in the returned geneset object, which is the internal score representing how well the query matches the returned geneset object. It probably does not mean much in the geneset annotation service when only one geneset object is returned. In the geneset query service, by default, the returned geneset hits are sorted by the scores in descending order.

Species filter

We support ALL species annotated by NCBI and Ensembl. All of our services allow you to pass a “species” parameter to limit the query results. “species” parameter accepts taxonomy ids as the input. You can look for the taxomony ids for your favorite species from NCBI Taxonomy.

For convenience, we allow you to pass these common names for commonly used species (e.g. “species=human,mouse,rat”):

Search term (common name)

Genus name

Taxonomy id

human

Homo sapiens

9606

mouse

Mus musculus

10090

rat

Rattus norvegicus

10116

mosquito

Anopheles gambiae

180454

fruitfly

Drosophila melanogaster

7227

nematode

Caenorhabditis elegans

6239

zebrafish

Danio rerio

7955

thale-cress

Arabidopsis thaliana

3702

rice

Oryza sativa

39947

dog

Canis lupus familiaris

9615

chicken

Gallus gallus

9031

horse

Equus caballus

9796

chimpanzee

Pan troglodytes

9598

frog

Xenopus tropicalis

8364

pig

Sus scrofa

9823

pseudomonas-aeruginosa

Pseudomonas aeruginosa

208964

brewers-yeast

Saccharomyces cerevisiae

559292

If needed, you can pass “species=all” to query against all available species, although, we recommend you to pass specific species you need for faster response.