The webserver provides an API for advanced users as described in this page.
The base URL for API calls is https://gmgc.embl.de/api/v1.0/
and API calls return JSON (except where noted).
Also, note that the resources can all be downloaded for local processing. For large scale analyses, that will be more efficient than repeatedly calling the API.
version
Returns the version of the resource.
curl https://gmgc.embl.de/api/v1.0/version
{
"gmgc-version": "1.0.0",
"last-updated": "Jun 1 2020"
}
Results returned by lookup addresses and matching a single unigene
the information documented below. Alternatively, if the provided <identifier>
matches more than one unigene
, the the reply will include a list of matches.
# Using an eggNOG identifier
curl https://gmgc.embl.de/api/v1.0/unigene/4PQJ6
{
"matches": [
{
"source": "SAMN06172490",
"biome": [
"dog gut",
"human skin",
"cat gut"
],
"taxonomy": "1262977",
"id": "GMGC10.000_000_027.PEPT"
},
{
"source": "SAMN06172460",
"biome": [
"dog gut"
],
"taxonomy": "742823",
"id": "GMGC10.000_186_864.PEPT"
},
(...)
]
}
unigene/<identifier>
- returns information about given unigene, including number of samples where it was identified, length and habitats.
curl https://gmgc.embl.de/api/v1.0/unigene/GMGC10.054_598_380.SCLAV_5304
{
"gene_family": "GMGC10.205_457_183.UNKNOWN",
"cluster": "GMGC10.146_435_694.SCLAV_5304",
"query": "GMGC10.054_598_380.SCLAV_5304",
"name": "GMGC10.054_598_380.SCLAV_5304",
"taxonomy": [
{
"name": "environmental samples",
"id": "59619",
"rank": "genus"
}
],
"samples": 573,
"length": 88062,
"habitat": [
"human skin",
"human gut"
],
"genome_bins": 25,
"strand": "+",
"complete": 1
}
unigene/<identifier>/dna_sequence
- returns the DNA sequence of the requested unigene
curl https://gmgc.embl.de/api/v1.0/unigene/GMGC10.054_598_380.SCLAV_5304/dna_sequence
{
"query": "GMGC10.054_598_380.SCLAV_5304",
"name": "GMGC10.054_598_380.SCLAV_5304",
"dna_sequence": "ATGAAGTTAGGGGAGAAAATAA..."
}
unigene/<identifier>/protein_sequence
- returns the translated (aminoacid) sequence of the requested unigene
curl https://gmgc.embl.de/api/v1.0/unigene/GMGC10.054_598_380.SCLAV_5304/protein_sequence
{
"query": "GMGC10.054_598_380.SCLAV_5304",
"name": "GMGC10.054_598_380.SCLAV_5304",
"protein_sequence": "MKLGEKIMRLGKKTSRAISIALL..."
}
unigene/<identifier>/features
- returns intrinsic sequence features as well as annotations from Pfam, SMART and eggNOGcurl https://gmgc.embl.de/api/v1.0/unigene/GMGC10.054_598_380.SCLAV_5304/features
{
"features": {
"intrinsic": [
{
"feature": "COIL",
"end": 351,
"start": 321
}
],
"pfam": [
{
"domain": "Pfam:RCC1",
"evalue": 3.4e-06,
"bitscore": 24.2,
"end": 891,
"start": 834
},
(...)
],
"smart": [
{
"domain": "PbH1",
"evalue": 4989.07360092172,
"bitscore": 2.3,
"end": 782,
"start": 755
},
(...)
],
"eggnog": {
"cog_functional_category": "M",
"eggnog_ogs": [
"2IDH9@201174",
"4D0CR@85004",
"COG5184@1",
"COG5184@2"
],
"seed_ortholog_score": 340.5,
"go_terms": [],
"kegg_pathway": [],
"bigg_reaction": [],
"ec_number": "-",
"cazy": "-",
"kegg_reaction": [],
"seed_eggnog_ortholog": "1394175.AWUN01000002_gene844",
"predicted_protein_name": "-",
"seed_ortholog_evalue": 7.8e-89,
"kegg_ko": [],
"eggnog_free_text_description": "Listeria-Bacteroides repeat domain (List_Bact_rpt)",
"brite": [],
"kegg_module": []
}
},
"query": "GMGC10.054_598_380.SCLAV_5304",
"name": "GMGC10.054_598_380.SCLAV_5304"
}
unigene/<identifier>/samples
- return the list of samples where given unigene
was identified.curl https://gmgc.embl.de/api/v1.0/unigene/GMGC10.054_598_380.SCLAV_5304/samples
{
"samples": [
"SAMEA1906425",
"SAMEA1906421",
"SAMEA1906417",
(...)
],
"query": "GMGC10.054_598_380.SCLAV_5304",
"name": "GMGC10.054_598_380.SCLAV_5304"
}
unigene/<identifier>/genome_bins
- return the list of genome bins where given unigene
was identified.curl https://gmgc.embl.de/api/v1.0/unigene/GMGC10.054_598_380.SCLAV_5304/genome_bins
{
"genome_bins": [
"GMBC10.101_017",
"GMBC10.160_410",
"GMBC10.133_210",
(...)
],
"query": "GMGC10.054_598_380.SCLAV_5304",
"name": "GMGC10.054_598_380.SCLAV_5304"
}
unigene/<identifier>/antibiotics
- returns the list of antibiotics associated with the specified unigene
curl https://gmgc.embl.de/api/v1.0/unigene/GMGC10.000_001_095.UGD/antibiotics
{
"antiobiotics": [
"actinomycin",
"actinomycind",
"arylomycin",
(...)
],
"query": "GMGC10.000_001_095.UGD",
"name": "GMGC10.000_001_095.UGD"
}
unigene/<identifier>/aro_terms
- returns the list of associated ARO/CARD identifierscurl https://gmgc.embl.de/api/v1.0/unigene/GMGC10.000_001_095.UGD/aro_terms
{
"aro_terms": [
"ARO:1000001",
"ARO:3000000",
"ARO:3002984",
"ARO:3003577",
"ARO:3003580",
"ARO:3004112",
"ARO:3004269"
],
"query": "GMGC10.000_001_095.UGD",
"name": "GMGC10.000_001_095.UGD"
}
genome_bin/<genome_bin_id>
- returns information about the requested genome_bin
identifiercurl https://gmgc.embl.de/api/v1.0/genome_bin/GMBC10.001_023
{
"min_contig_size": 3323,
"name": "GMBC10.001_023",
"contamination": 0,
"genome": "SAMEA3708885.bin.14",
"N50": 2611971,
"total_bp_size": 2611971,
"nr_contigs": 72,
"GTDB_tk": "d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Lachnospira;s__Lachnospira rogosae",
"category": "high-quality",
"quality": 99.33,
"completeness": 99.33,
"max_contig_size": 243991
}
sample/<sample_id>
- returns information about the specified sample
curl https://gmgc.embl.de/api/v1.0/sample/SAMEA3708885
{
"ena_link": "https://www.ebi.ac.uk/ena/data/view/SAMEA3708885",
"longitude": 12.568337,
"habitat": "human gut",
"latitude": 55.676097,
"name": "SAMEA3708885"
}
There are also plural versions of the lookups above, that work with POST
:
unigenes
unigenes/features
unigenes/samples
unigenes/genome_bins
unigenes/dna_sequence
unigenes/protein_sequence
samples
genome_bins
These correspond to the calls above, except that they work for
multiple inputs, passed in as a JSON (with correct content type), in the
format: {"names": ["gene-A", "gene-B"]}
(or sample A
, or genome_bin A
…).
E.g:
curl --header 'Content-Type: application/json' \
--request POST \
--data '{"names": ["GMGC10.003_873_867.PHOA", "GMGC10.016_471_114.PHOA"]}' \
'https://gmgc.embl.de/api/v1.0/unigenes/genome_bins'
habitat/<habitat_name>
curl https://gmgc.embl.de/api/v1.0/habitat/human_gut
{
"samples": [
"SAMEA1906426",
"SAMEA1906424",
"SAMEA1906422",
(...)
],
"subcatalog_url": "https://gmgc.embl.de/downloads/v1.0/GMGC10.human-gut.95nr.fna.gz",
"name": "human gut",
"subcatalog_no_rare_url": "https://gmgc.embl.de/downloads/v1.0/GMGC10.human-gut.no-rare.95nr.fna.gz"
}
query/sequence
The call is as a POST
request of up to 50 sequences as an attached FASTA file. The attachment should be called fasta
.
For example, create test.fasta
containing:
>MySeq
AALAMSALMALSJLAJLACAOSIJDAOSIJDALAASKJDASLKJALCEMALWPQRODASLKJALCKMALWPQRODASLKJ
ALCCKMALWPQRODASLKJALCKMALWPQROQUPJALSFAASLUFPASUFASFJA
and with curl
:
curl -X POST \
-F 'fasta=@test.fasta' \
-F 'mode=all' \
-F 'return_seqs=true' \
-F 'return_bins=true' \
'https://gmgc.embl.de/api/v1.0/query/sequence'
Note that the algorithm will always returns its best matches as hits and it is the user’s responsibility to filter them appropriately (i.e., if no good matches exist in the catalog, the algorithm will still return something, but it will be returned with a high e-value).
Parameters:
mode
: "all"
or "besthit"
return_seqs
: booleanreturn_bins
: boolean {
"results":
[{
"query_name": "Q1",
"hits":
[ # hits is always a list, but if the request included `mode=besthit`, this will be a list of size one.
{"unigene_id": "GMGC10.000_000_000.NAME",
"evalue": 12e-23,
"bitscore": 232.2,
# sequences are only provided if the request included `return_seqs=true`
"dna_sequence": "ATTATACAA...",
"protein_sequence": "MEPATA..."
# Genome bins are only provided if the request included `return_bins=true`
"genome_bins":
[ "GMBC10.001_023"
, "GMBC10.202_232"
]
},
(...)
]
}, {
"query_name": "Q2",
"hits": (...)
}]
}