Global Microbial Gene Catalog (GMGC)

The Global Microbial Gene Catalog (GMGC, currently version 1.0) is an integrated, consistently-processed, gene catalog of the microbial world, combining metagenomics and high-quality sequenced isolates (from the ProGenomes2 database, Mende et al., 2019).

A total of 2.3 Billion genes were used to build the catalog. After 100%-identity redundancy removal, we obtained a 100% non-redundant catalog with 966 million sequences. Further, species-level (95% nucleotide identity) resulted in the main GMGC catalog, which includes 302 million genes. See the Downloads page for more details and various habitat-specific subcatalogs that may be more convenient for your usage.

Identifiers

Genes in the main catalog and subcatalogs are identified with the scheme GMGC10.###_###_###.NAME. The initial GMGC10 indicates the version of the catalog (Global Microbial Gene Catalog 1.0). The numeric ID uniquely identifies the gene (the underscores are for readability only). Finally, the NAME is the predicted gene name, obtained from eggnog-mapper (Huerta-Cepas et al., 2017).

Searching

Search uses a kmer-based index. Briefly, the query sequence is broken up into consecutive 7-mers and the 100 sequences in the database which share the highest number of kmers with the query are retrieved. These are aligned against the query using a fast implementation of Smith-Waterman (Zhao et al., 2013) and sorted by alignment score. This method is very efficient at recovering sequence that are similar to ones in the database, but is less sensitive than BLAST.

References

  1. Daniel R Mende, Ivica Letunic, Oleksandr M Maistrenko, Thomas S B Schmidt, Alessio Milanese, Lucas Paoli, Ana Hernández-Plaza, Askarbek N Orakov, Sofia K Forslund, Shinichi Sunagawa, Georg Zeller, Jaime Huerta-Ceps, Luis Pedro Coelho, and Peer Bork proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes in Nucleic Acid Research, 2019 https://doi.org/10.1093/nar/gkz1002
  2. Jaime Huerta-Cepas, Kristoffer Forslund, Luis Pedro Coelho, Damian Szklarczyk, Lars Juhl Jensen, Christian von Mering and Peer Bork Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper in Mol Biol Evol, 2017 https://doi.org/10.1093/molbev/msx148
  3. Mengyao Zhao, Wan-Ping Lee, Erik P. Garrison, and Gabor T. Marth SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications in PloS One, 2013 https://doi.org/10.1371/journal.pone.0082138