The main catalog was done at species-level (95% nucleotide identity) and includes 302,655,267 unigenes. Additionally, we make available a 100% non-redundant catalog (including 966,108,540 genes) and a 90% amino-acid level catalog (210,478,083 genes). Note that the 90% catalog is a subcatalog of the main one and identifiers are kept consistent.
Files can also be downloaded from our GMGC10.data git-annex repository. This recommended only for more advanced users, but may be a better interface for settings such as an HPC cluster (where direct web access is not available) and provides automated data integrity checks.
The 14 habitats considered in this version of the catalog give rise to 14 different sub-catalogs. Additionally, for convenience, we provide versions which exclude rare unigenes and may be more appropriate for uses such as short read mapping.
File name | GMGC.100nr.fna.gz |
---|---|
Gene count | |
File size |
This table shows the structure of the clusters at at 95%. Note that this table requires ca. 300 GB to store and contains 8,533,537,889 rows.
This table contains triplets of the form [ORF1] [REL] [ORF2]
. [ORF2]
is always an ORF in the GMGCv1, while [ORF1]
is one of the original ORFs. [REL]
is one of
=
: the ORFs are identicalC
: ORF1
is contained in ORF2
as a substringR
: ORF1
can be represented by ORF2
(i.e., ORF1
is 95% identical to a substring of ORF2
).To understand the ORF structure, you also need to download the renaming tables.
There are two types of ORF used for building the GMGC: - Metagenomic ORFs. These named according to the format {SAMPLE}_{#}
. - Progenomes2 ORFs. These were renamed as Fr12_{#}
(Freeze 12 being an internal name for ProGenomes2 as construction of GMGCv1
predated the publication of ProGenomes2). The reason for renaming is that it was important to have short names to keep the memory usage down.
At the end of the process, the representative ORF for each cluster is assigned a GMGC10
name (GMGC10
being the abbreviated form of GMGC 1.0
).
Protein clusters were built at 90% amino acid identity, which represent sequences with the same function. Broader families were built at higher thresholds too, namely 50%, 30%, and 20% (all subject to e-value thresholding, see the manuscript for details).
The unigene metadata table contains basic information for each unigene: its original name (which indicates which sample the representative was assembled from, the habitat of this sample, the size (in nucleotides), and whether the ORF is a complete ORF.
ORFs are annotated to the environment from which they were assembled. Unigenes aggregate multiple ORFs and can, thus, represent multiple environments. The file GMGC10.gene-environment.tsv contains the link between unigenes and environments:
0
: the unigene does not contain any ORF from the respective environment,1
: the unigene contain at least one ORF from the respective environment,2
: the unigene contain at least one ORF from the respective environment and was detected in at least 5 samples from the environment.Genes were assigned taxonomy as described in the manuscript: Taxonomic assignments
The GMGC was also annotated with eggnog-mapper: GMGC10.emapper2.annotations.tsv.gz (this file is NGLess-compatible).
CARD/ResFAM annotations (in NGLess-compatible format) are also available as GMGC10.card_resfam.tsv
Per sample metadata includes the habitat, GPS coordinates, number of basepairs and other information on each sample. Note that some samples use an internal name in addition to the public one.
This is a large table with 6 columns, representing 4 different sparse matrices:
unigene
: The unigenesample
: The samplescaled
: The scaled valueraw
: The raw valueraw_unique
: The raw value for uniquely mapped reads onlynormed10m
: The normed value scaled by 10 millionThe meaning of raw
, scaled
, and normed10m
are similar to that used in the NGLess count function, except that normed10m
is scaled to 10 million reads. Briefly, raw
is the number of reads mapped to the unigene (distributed using option {dist1}
in NGLess), scaled
takes the size of each unigene into account, and normed10m
takes both the size of each unigene and the number of reads in each sample.
Warning: Note that this table file is 396GB (compressed) and contains 35,790,210,719 rows. Make sure you have the resources to handle it prior to downloading it.
Genome bins have IDs of the form GMBC10.NNN_NNN
. They are ordered by decreasing quality (defined as completeness - 5 * contamination
, both metrics estimated using checkM, so that the first bin (GMBC10.000_000
) is the highest quality genome.
We also provide a table with estimated quality measures from each bin: GMBC10.meta.tsv. Bins have been classified into three categories:
high-quality
: >90% complete & <5% contamination,medium-quality
: >50% complete & <10% contamination,low-quality
: all other bins.As Unigenes can represent multiple ORFs, an Unigene may be present in multiple genomic bins (and not all unigenes are in a genome bin). The file GMGC10.GMBC10.tsv associates unigenes with genome bins.
Individual genome FASTA files can be downloaded through the API, e.g., https://gmgc.embl.de/api/v1.0/genome_bin/GMBC10.001_023/fasta.