GMGC API (Version 1.0)

The webserver provides an API for advanced users as described in this page.

The base URL for API calls is https://gmgc.embl.de/api/v1.0/ and API calls return JSON (except where noted).

Also, note that the resources can all be downloaded for local processing. For large scale analyses, that will be more efficient than repeatedly calling the API.

Version

version

Returns the version of the resource.

Example

curl https://gmgc.embl.de/api/v1.0/version

Output

{
    "gmgc-version": "1.0.0",
    "last-updated": "Jun 1 2020"
}

Lookup

Results returned by lookup addresses and matching a single unigene the information documented below. Alternatively, if the provided <identifier> matches more than one unigene, the the reply will include a list of matches.

Matching multiple unigenes

Example

# Using an eggNOG identifier
curl https://gmgc.embl.de/api/v1.0/unigene/4PQJ6

Output

{
  "matches": [
    {
      "source": "SAMN06172490",
      "biome": [
        "dog gut",
        "human skin",
        "cat gut"
      ],
      "taxonomy": "1262977",
      "id": "GMGC10.000_000_027.PEPT"
    },
    {
      "source": "SAMN06172460",
      "biome": [
        "dog gut"
      ],
      "taxonomy": "742823",
      "id": "GMGC10.000_186_864.PEPT"
    },
(...)
  ]
}

Matching a single unigene

unigene/<identifier> - returns information about given unigene, including number of samples where it was identified, length and habitats.

Example

curl https://gmgc.embl.de/api/v1.0/unigene/GMGC10.054_598_380.SCLAV_5304

Output

{
  "gene_family": "GMGC10.205_457_183.UNKNOWN",
  "cluster": "GMGC10.146_435_694.SCLAV_5304",
  "query": "GMGC10.054_598_380.SCLAV_5304",
  "name": "GMGC10.054_598_380.SCLAV_5304",
  "taxonomy": [
    {
      "name": "environmental samples",
      "id": "59619",
      "rank": "genus"
    }
  ],
  "samples": 573,
  "length": 88062,
  "habitat": [
    "human skin",
    "human gut"
  ],
  "genome_bins": 25,
  "strand": "+",
  "complete": 1
}

unigene/<identifier>/dna_sequence - returns the DNA sequence of the requested unigene

Example

curl https://gmgc.embl.de/api/v1.0/unigene/GMGC10.054_598_380.SCLAV_5304/dna_sequence

Output

{
  "query": "GMGC10.054_598_380.SCLAV_5304",
  "name": "GMGC10.054_598_380.SCLAV_5304",
  "dna_sequence": "ATGAAGTTAGGGGAGAAAATAA..."
}

unigene/<identifier>/protein_sequence - returns the translated (aminoacid) sequence of the requested unigene

Example

curl https://gmgc.embl.de/api/v1.0/unigene/GMGC10.054_598_380.SCLAV_5304/protein_sequence

Output

{
  "query": "GMGC10.054_598_380.SCLAV_5304",
  "name": "GMGC10.054_598_380.SCLAV_5304",
  "protein_sequence": "MKLGEKIMRLGKKTSRAISIALL..."
}

unigene/<identifier>/features - returns intrinsic sequence features as well as annotations from Pfam, SMART and eggNOG

Example

curl https://gmgc.embl.de/api/v1.0/unigene/GMGC10.054_598_380.SCLAV_5304/features

Output

{
  "features": {
    "intrinsic": [
      {
        "feature": "COIL",
        "end": 351,
        "start": 321
      }
    ],
    "pfam": [
      {
        "domain": "Pfam:RCC1",
        "evalue": 3.4e-06,
        "bitscore": 24.2,
        "end": 891,
        "start": 834
      },
      (...)
    ],
    "smart": [
      {
        "domain": "PbH1",
        "evalue": 4989.07360092172,
        "bitscore": 2.3,
        "end": 782,
        "start": 755
      },
      (...)
    ],
    "eggnog": {
      "cog_functional_category": "M",
      "eggnog_ogs": [
        "2IDH9@201174",
        "4D0CR@85004",
        "COG5184@1",
        "COG5184@2"
      ],
      "seed_ortholog_score": 340.5,
      "go_terms": [],
      "kegg_pathway": [],
      "bigg_reaction": [],
      "ec_number": "-",
      "cazy": "-",
      "kegg_reaction": [],
      "seed_eggnog_ortholog": "1394175.AWUN01000002_gene844",
      "predicted_protein_name": "-",
      "seed_ortholog_evalue": 7.8e-89,
      "kegg_ko": [],
      "eggnog_free_text_description": "Listeria-Bacteroides repeat domain (List_Bact_rpt)",
      "brite": [],
      "kegg_module": []
    }
  },
  "query": "GMGC10.054_598_380.SCLAV_5304",
  "name": "GMGC10.054_598_380.SCLAV_5304"
}

unigene/<identifier>/samples - return the list of samples where given unigene was identified.

Example

curl https://gmgc.embl.de/api/v1.0/unigene/GMGC10.054_598_380.SCLAV_5304/samples

Output

{
  "samples": [
    "SAMEA1906425",
    "SAMEA1906421",
    "SAMEA1906417",
    (...)
  ],
  "query": "GMGC10.054_598_380.SCLAV_5304",
  "name": "GMGC10.054_598_380.SCLAV_5304"
}

unigene/<identifier>/genome_bins - return the list of genome bins where given unigene was identified.

Example

curl https://gmgc.embl.de/api/v1.0/unigene/GMGC10.054_598_380.SCLAV_5304/genome_bins

Output

{
  "genome_bins": [
    "GMBC10.101_017",
    "GMBC10.160_410",
    "GMBC10.133_210",
    (...)
  ],
  "query": "GMGC10.054_598_380.SCLAV_5304",
  "name": "GMGC10.054_598_380.SCLAV_5304"
}

Antibiotics

unigene/<identifier>/antibiotics - returns the list of antibiotics associated with the specified unigene

Example

curl https://gmgc.embl.de/api/v1.0/unigene/GMGC10.000_001_095.UGD/antibiotics

Output

{
  "antiobiotics": [
    "actinomycin",
    "actinomycind",
    "arylomycin",
    (...)
  ],
  "query": "GMGC10.000_001_095.UGD",
  "name": "GMGC10.000_001_095.UGD"
}

unigene/<identifier>/aro_terms - returns the list of associated ARO/CARD identifiers

Example

curl https://gmgc.embl.de/api/v1.0/unigene/GMGC10.000_001_095.UGD/aro_terms

Output

{
  "aro_terms": [
    "ARO:1000001",
    "ARO:3000000",
    "ARO:3002984",
    "ARO:3003577",
    "ARO:3003580",
    "ARO:3004112",
    "ARO:3004269"
  ],
  "query": "GMGC10.000_001_095.UGD",
  "name": "GMGC10.000_001_095.UGD"
}

Genome bins

genome_bin/<genome_bin_id> - returns information about the requested genome_bin identifier

Example

curl https://gmgc.embl.de/api/v1.0/genome_bin/GMBC10.001_023

Output

{
  "min_contig_size": 3323,
  "name": "GMBC10.001_023",
  "contamination": 0,
  "genome": "SAMEA3708885.bin.14",
  "N50": 2611971,
  "total_bp_size": 2611971,
  "nr_contigs": 72,
  "GTDB_tk": "d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Lachnospira;s__Lachnospira rogosae",
  "category": "high-quality",
  "quality": 99.33,
  "completeness": 99.33,
  "max_contig_size": 243991
}

Samples

sample/<sample_id> - returns information about the specified sample

Example

curl https://gmgc.embl.de/api/v1.0/sample/SAMEA3708885

Output

{
  "ena_link": "https://www.ebi.ac.uk/ena/data/view/SAMEA3708885",
  "longitude": 12.568337,
  "habitat": "human gut",
  "latitude": 55.676097,
  "name": "SAMEA3708885"
}

Batched queries

There are also plural versions of the lookups above, that work with POST:

unigenes
unigenes/features
unigenes/samples
unigenes/genome_bins
unigenes/dna_sequence
unigenes/protein_sequence
samples
genome_bins

These correspond to the calls above, except that they work for multiple inputs, passed in as a JSON (with correct content type), in the format: {"names": ["gene-A", "gene-B"]} (or sample A, or genome_bin A…).

E.g:

curl --header 'Content-Type: application/json' \
     --request POST \
     --data '{"names": ["GMGC10.003_873_867.PHOA", "GMGC10.016_471_114.PHOA"]}' \
     'https://gmgc.embl.de/api/v1.0/unigenes/genome_bins'

Habitats

habitat/<habitat_name>

Example

curl https://gmgc.embl.de/api/v1.0/habitat/human_gut

Output

{
  "samples": [
    "SAMEA1906426",
    "SAMEA1906424",
    "SAMEA1906422",
    (...)
  ],
  "subcatalog_url": "https://gmgc.embl.de/downloads/v1.0/GMGC10.human-gut.95nr.fna.gz",
  "name": "human gut",
  "subcatalog_no_rare_url": "https://gmgc.embl.de/downloads/v1.0/GMGC10.human-gut.no-rare.95nr.fna.gz"
}

Search

Query by sequence

query/sequence

The call is as a POST request of up to 50 sequences as an attached FASTA file. The attachment should be called fasta.

For example, create test.fasta containing:

>MySeq
AALAMSALMALSJLAJLACAOSIJDAOSIJDALAASKJDASLKJALCEMALWPQRODASLKJALCKMALWPQRODASLKJ
ALCCKMALWPQRODASLKJALCKMALWPQROQUPJALSFAASLUFPASUFASFJA

and with curl:

curl -X POST \
     -F 'fasta=@test.fasta' \
     -F 'mode=all' \
     -F 'return_seqs=true' \
     -F 'return_bins=true' \
     'https://gmgc.embl.de/api/v1.0/query/sequence'

Note that the algorithm will always returns its best matches as hits and it is the user’s responsibility to filter them appropriately (i.e., if no good matches exist in the catalog, the algorithm will still return something, but it will be returned with a high e-value).

Parameters:

mode: "all" or "besthit"
return_seqs: boolean
return_bins: boolean

    {
        "results":
            [{
                "query_name": "Q1",
                "hits":
                    [ # hits is always a list, but if the request included `mode=besthit`, this will be a list of size one.
                    {"unigene_id": "GMGC10.000_000_000.NAME",
                     "evalue": 12e-23,
                     "bitscore": 232.2,
                        # sequences are only provided if the request included `return_seqs=true`
                     "dna_sequence": "ATTATACAA...",
                     "protein_sequence": "MEPATA..."
                        # Genome bins are only provided if the request included `return_bins=true`
                     "genome_bins":
                        [ "GMBC10.001_023"
                        , "GMBC10.202_232"
                        ]
                    },
                    (...)
                    ]
              }, {
                  "query_name": "Q2",
                  "hits": (...)
              }]
    }