Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 1;35(9):1544-1552.
doi: 10.1093/bioinformatics/bty830.

MSPminer: abundance-based reconstitution of microbial pan-genomes from shotgun metagenomic data

Affiliations

MSPminer: abundance-based reconstitution of microbial pan-genomes from shotgun metagenomic data

Florian Plaza Oñate et al. Bioinformatics. .

Abstract

Motivation: Analysis toolkits for shotgun metagenomic data achieve strain-level characterization of complex microbial communities by capturing intra-species gene content variation. Yet, these tools are hampered by the extent of reference genomes that are far from covering all microbial variability, as many species are still not sequenced or have only few strains available. Binning co-abundant genes obtained from de novo assembly is a powerful reference-free technique to discover and reconstitute gene repertoire of microbial species. While current methods accurately identify species core parts, they miss many accessory genes or split them into small gene groups that remain unassociated to core clusters.

Results: We introduce MSPminer, a computationally efficient software tool that reconstitutes Metagenomic Species Pan-genomes (MSPs) by binning co-abundant genes across metagenomic samples. MSPminer relies on a new robust measure of proportionality coupled with an empirical classifier to group and distinguish not only species core genes but accessory genes also. Applied to a large scale metagenomic dataset, MSPminer successfully delineates in a few hours the gene repertoires of 1661 microbial species with similar specificity and higher sensitivity than existing tools. The taxonomic annotation of MSPs reveals microorganisms hitherto unknown and brings coherence in the nomenclature of the species of the human gut microbiota. The provided MSPs can be readily used for taxonomic profiling and biomarkers discovery in human gut metagenomic samples. In addition, MSPminer can be applied on gene count tables from other ecosystems to perform similar analyses.

Availability and implementation: The binary is freely available for non-commercial users at www.enterome.com/downloads.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Simplified model illustrating the rationale behind the method. Six samples except the fourth carry a strain of a microbial species represented by a circle. The absolute abundance of each strain is indicated on the bottom right. Core genes (red, blue, yellow) are present in all the strains while accessory genes (green, purple) are found only in some. In addition, the yellow gene is tagged as shared because it is observed in sample 4 that do not contain the species. After shotgun sequencing, core genes yield directly proportional mapped reads counts across samples, the proportionality coefficient being roughly equal to the ratio of their length. In contrast, such relationship between a core and an accessory gene is observed only in the subset of samples where the accessory gene is present
Fig. 2.
Fig. 2.
Method for comparing gene count profiles and classifying genes in MSPs. The counts of a gene (g2) are compared to the counts of the core seed (g1) with which it is associated across metagenomic samples. The coefficient of proportionality a between g1 and g2 is estimated to be 0.75. The solid line of slope α corresponds to expected counts. Dashed lines represent the gene quantification thresholds before and after adjustment according to α. Black and grey crosses are respectively structural and undetermined zeros. Only structural zeros are taken into account to assign g2 to a given class (c.f. braces). Black and grey points are respectively inlier and outlier samples. The distance between the unique outlier and the expected proportional count correspond to the residual rs
Fig. 3.
Fig. 3.
MSPminer workflow
Fig. 4.
Fig. 4.
Evaluation of the measures of proportionality. (A) Comparison of the Pearson’s correlation coefficient, the Spearman’s correlation coefficient and the proposed measure of proportionality to detect an association between the median abundance vector of the core genes of the simulated species and the abundance vectors of each of its genes. The x-axis corresponds to the percentage of samples where a gene is detected and the y-axis corresponds to the intensity of the relationship between the compared vectors. The closer the value is to 1, the stronger the intensity of the relationship. (B) Comparison of the performances of the robust (black) and the non-robust (grey) measures of proportionality to detect a relationship between the noisy abundance vector of each gene of the simulated species and the outlier-free median abundance vector of its core genes. The proportion of outliers is gradually increased to 5%, 10% and 20%
Fig. 5.
Fig. 5.
Evaluation of the clustering algorithm. (A) Impact of number of samples where the simulated species is detected on clustering. (B) Impact of strain mixture on clustering

References

    1. Almeida M. et al. (2016) Capturing the most wanted taxa through cross-sample correlations. Isme J., 10, 2459–2467. - PMC - PubMed
    1. Almeida M., Pop M. (2015) Metagenomics for Microbiology. Academic Press, pp. 55–63. (ISBN 978-0-12-410508-9).
    1. Bland J.M., Altman D.G. (1996) Statistics Notes: transforming data. BMJ, 312, 770–770. - PMC - PubMed
    1. Brito I.L. et al. (2017) Mobile genes in the human microbiome are structured from global to individual scales. Nature, 544, 124. - PubMed
    1. Brooks J.P. et al. (2015) The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies. BMC Microbiol., 15, 66.. - PMC - PubMed

Publication types