Predicting functions of uncharacterized gene products from microbial communities

Zhang, Yancong; Bhosle, Amrisha; Bae, Sena; Eckenrode, Kelly; Huang, Xueying; Tang, Jingjing; Lavrentovich, Danylo; Awad, Lana; Hua, Ji; Wang, Ya; Morgan, Xochitl C.; Li, Bin; Krueger, Andy; Garrett, Wendy S.; Franzosa, Eric A.; Huttenhower, Curtis

doi:10.1038/s41587-025-02813-7

Download PDF

Article
Open access
Published: 15 October 2025

Predicting functions of uncharacterized gene products from microbial communities

Yancong ZhangÂ ORCID: orcid.org/0000-0002-2768-2975^1,2,3,4,
Amrisha Bhosle^2,3,4,
Sena Bae^3,4,5,
Kelly Eckenrode^3,4,
Xueying Huang^3,4,
Jingjing TangÂ ORCID: orcid.org/0000-0003-1447-5018^3,4,
Danylo LavrentovichÂ ORCID: orcid.org/0000-0002-8432-9596⁶,
Lana Awad³,
Ji Hua³,
Ya Wang^3,4,
Xochitl C. Morgan^3,4,
Bin Li⁷,
Andy Krueger⁷,
Wendy S. GarrettÂ ORCID: orcid.org/0000-0002-5092-0150^2,4,5,8,
Eric A. FranzosaÂ ORCID: orcid.org/0000-0002-8798-7068^2,3,4^Â na1 &
â€¦
Curtis HuttenhowerÂ ORCID: orcid.org/0000-0002-1110-0096^2,3,4,5^Â na1Â

Nature Biotechnology (2025)Cite this article

11k Accesses
54 Altmetric
Metrics details

Subjects

Abstract

The majority of genes in microbial communities remain uncharacterized. Here we develop a method to infer putative function for microbial proteins at scale by assessing community-wide multiomics data. We predict high-confidence functions for >443,000 protein families (~82.3% previously uncharacterized), including >27,000 protein families with weak homology to known proteins and >6,000 protein families without homology. These were drawn from 1,595 gut metagenomes and 800 metatranscriptomes from the Integrative Human Microbiome Project (HMP2/iHMP). Integrating additional information such as sequence similarity, genomic proximity and domainâ€“domain interactions improves performance of the method. Our methodâ€™s implementation, FUGAsseM, is generalizable and predicts protein function in both well-studied and undercharacterized communities. FUGAsseM achieves similar levels of accuracy in the context of microbial communities when compared to state-of-the-art approaches designed for application to single organisms while simultaneously providing much greater breadth of coverage. This initial study expands the functional landscape of the human gut microbiome and allows for exploration of microbial proteins in undercharacterized communities.

Lineage-specific microbial protein prediction enables large-scale exploration of protein ecology within the human gut

Article Open access 03 April 2025

Functional host-specific adaptation of the intestinal microbiome in hominids

Article Open access 06 January 2024

The standardisation of the approach to metagenomic human gut analysis: from sample collection to microbiome profiling

Article Open access 19 May 2022

Main

Microbial proteins represent the entire span of enzymatic, structural and other molecular functions (MFs) necessary across the tree of life, such as metabolizing host-inaccessible dietary components¹, producing immunomodulatory small molecules² and driving biogeochemical cycling^3,4,5. However, even within the human gut microbiomeâ€”arguably the best-characterized human microbial community habitatâ€”up to 70% of proteins are uncharacterized^6,7. The environmental and health relevance of this functional â€˜dark matterâ€™ combined with its tremendous scale (for example, several million genes across the diversity of human gut microbiomes^8,9) necessitates a scalable approach for predicting protein functions in microbial communities^10,11,12. Although culture-based and other in vitro techniques provide the gold standard for functional insights^13,14, it is challenging to rapidly characterize most microorganisms using cultivation in settings similar to their natural environments, let alone in a sufficiently high-throughput manner¹⁵. Similarly, experimental characterization of proteins is undeniably a labor-intensive, time-consuming and expensive process that cannot approach the scale of thousands of new protein families per year¹⁶.

Computational strategies for protein function prediction have a long history, especially in microbial isolate genomes^17,18,19, and they have gained popularity with the exponential growth of sequence data^20,21,22. Such methods typically integrate multiple data types from individual organisms, including sequence similarity^23,24,25,26, genomic context^27,28, evolutionary relationships^29,30, protein structure³¹, proteinâ€“protein interactions^32,33,34, gene expression^35,36 or an ensemble of these data types^37,38,39. Direct application of single-organism methods to communities is greatly limited by the prevalence of novel sequences lacking sequence similarity to known proteins, as well as the gap between shotgun metagenomes (which provide only DNA) and other functional assays. This limits the knowledge of genomic context, protein structures, proteinâ€“protein interactions or expression in most communities, making systematic prediction of protein functions for microbial communities challenging.

Given the success of high-throughput sequencing technologies for microbial community characterization, metatranscriptomics (MTX) is becoming an increasingly practical way to profile whole-community RNA to provide microbial functional information in situ⁴⁰. Considering that transcriptional profiles of microorganisms frequently differ between isolate culture and natural community settings⁴¹, MTX provides a crucial way to link genetic potential (gene carriage and abundance) to functional activity in human^42,43,44 and other environmental microbiomes^5,45,46. Genes with similar functions tend to be coexpressed^47,48, indicating that they are active in the same biological processes (BPs). Therefore, coexpression patterns have been leveraged for protein function prediction in single-organism settings^36,49,50. Recently, this approach was applied to infer functions of unknown genes in marine microbial communities⁵. Aside from this exploratory study, however, which did not further develop or investigate the methodology itself, automated function prediction for microbial communities has been surprisingly unexplored.

To address the dramatic undercharacterization of microbial community gene products, we developed a method to systematically predict protein function in microbiomes by leveraging coexpression patterns from metatranscriptomes, along with other types of community-wide evidence such as genomic proximity, sequence similarity and domainâ€“domain interactions of proteins assembled from microbial communities. We use a two-layered random forest (RF) classifier system to assign target proteins to functions through â€˜guilt-by-associationâ€™ learning functional association networks among microbial proteins. For a given function, an individual RF classifier is trained for each type of association evidence to assign unannotated proteins to that function on the basis of their associations with annotated proteins, forming the first layer of machine learners. In the second layer, an ensemble RF classifier integrates the per-evidence prediction confidence scores from the first layer to produce a single combined confidence score, adjusting evidence weighting per function to capture biological trends and enhance prediction accuracy. The resulting method yields comparably or more accurate predictions than state-of-art single-organism methods while also achieving far greater sensitivity among diverse organisms found only in community data. When applying our method to the Integrative Human Microbiome Project (HMP2/iHMP)⁴⁴, we annotated >443,000 previously uncharacterized protein families with Gene Ontology (GO)⁵¹ terms (including >33,000 novel protein families that lack notable sequence homology to known proteins) with high-confidence function predictions. In this process, both well-studied and less studied gut taxa were further characterized with increased functional diversity. An open-source implementation of this method (FUGAsseM, function predictor of uncharacterized gene products by assessing high-dimensional community data in microbiomes)⁵², along with documentation, is available online (http://huttenhower.sph.harvard.edu/fugassem). Our work predicts functions of uncharacterized gene products in microbial communities and provides methodology supporting future microbial community-based function prediction.

Results

MTX-based coexpression patterns capture comprehensive functional activity in microbial communities

To first assess the relationship between protein function and gene expression in the human microbiome, we screened 800 metatranscriptomes from the HMP2 inflammatory bowel disease (IBD) multiomics database (IBDMDB), which spanned a total of 109 participants with Crohnâ€™s disease (nâ€‰=â€‰52), with ulcerative colitis (nâ€‰=â€‰30) and without IBD (nâ€‰=â€‰27), followed longitudinally for up to 1â€‰year⁴⁴. Specifically, we quantified the expression of the protein families previously profiled from 1,595 metagenomes by MetaWIBELE⁷â€”a tool that predicts bioactivity from metagenomically assembled protein families (Methods). A total of 582,744 protein families detected in the metatranscriptomes were contributed by 336 species with at least 500 protein families per species (Fig. 1a, Supplementary Table 1 and Extended Data Fig. 1a).

**Fig. 1: Metatranscriptomes encode functional relationships and can be used for predicting functions of uncharacterized proteins in microbial communities.**

To examine the characterization level of these protein families, we generated a set of â€˜informativeâ€™ (refs. ^53,54) BP terms (the largest and most diverse ontology aspect in GO⁵¹) inspired by single-organism methods⁵⁵. To be included in this set, annotations from child terms were propagated to all ancestor terms within the GO directed acyclic graph (DAG). For a given species, an informative BP term is defined as a term that contains at least a specified number of annotated proteins (taking DAG inheritance into account) while none of its child terms individually reach this threshold (Extended Data Fig. 1g). While the concept is not reliant on a specific threshold number k, we used kâ€‰=â€‰20 throughout this work. After applying this criterion to GO protein family annotations within species, the resulting species-specific informative terms were subsequently combined by union to create a single nonredundant set of informative terms across species, ensuring that no term was a parent term of others within the set. In other words, if a term X met the kâ€‰=â€‰20 threshold for informativeness in at least one species and was not a parent term of another such term Y identified in any species, it was included in the set of informative terms used in this study. All the informative terms used in this study are organized in Supplementary Table 2. According to the novelty categories defined in our previous publication⁷, we classified the aforementioned MTX protein families into â€˜SCâ€™ (strong homology to characterized UniProtKB⁵⁶ proteins with informative BP terms), â€˜SNIâ€™ (strong homology to characterized UniProtKB proteins with noninformative BP terms), â€˜SUâ€™ (strong homology to uncharacterized UniProtKB proteins without any BP terms), â€˜UPIâ€™ (strong homology to uncharacterized UniParc proteins), â€˜RHâ€™ (remote homology to UniProt proteins) and â€˜NHâ€™ (no homology to UniProt proteins) (Methods). SC comprised 83,280 families (14.3% of the total). Conversely, 499,464 families (85.7% of the total) were functionally uncharacterized, including 11.9% SNI, 60.5% SU, 3.6% UPI, 8.0% RH and 1.7% NH.

As a baseline for investigation, we began with pangenomes of common species in the human microbiome. Despite the degree to which these organisms have been well studied as isolates, their pangenomes in typical communities remain predominantly undercharacterized. Overall, 60.5% of the HMP2 protein families were not annotated with any BP terms (that is, SU) in UniProtKB. To further characterize the SU protein families, we included the annotations of MF and cellular component (CC) terms. We stratified SU into â€˜SU_MFâ€™ (SU families with MF term annotations in UniProtKB), â€˜SU_CCâ€™ (SU families with only CC term annotations in UniProtKB) and â€˜SU_nonGOâ€™ (SU families without any GO annotations in UniProtKB). Of the 352,527 SU families, 123,921 and 51,183 families were labeled as SU_MF and SU_CC respectively, while the remaining 77,423 families lacked any GO annotations (Extended Data Fig. 1a). Even in the well-characterized Escherichia coli pangenome, only 37.6% protein families were annotated with BP terms and 24.9% were not annotated with any GO terms (Extended Data Fig. 1b). As a positive control, the protein families of E.â€‰coli K-12 strains were well annotated, including 64.6% families with BP terms, 24.6% with either MF or CC terms and only 10.8% without GO annotations (Extended Data Fig. 1c). The predominance of uncharacterized proteins even in the pangenomes of well-studied microorganisms highlights the need for expanded function prediction in microbial communities.

Next, as a first step in assigning putative function predictions, we investigated the degree to which MTX is functionally informative in isolation. Notably, in single organisms, proteins in the same metabolic or regulatory pathway tend to be coexpressed; hence, coexpression indicates functional relatedness and can be used to predict pathway comembership^36,49,50. In microbial communities, MTX are to date the most widely available whole-community transcriptional data encoding potential functional relatedness. We began with simple MTX coexpression networks (that is, Pearsonâ€™s correlation computed over a pair of proteinsâ€™ expression values across samples), for which, as expected, both characterized and uncharacterized proteins showed coexpression patterns with strong connection (Râ€‰>â€‰0.5) (Extended Data Fig. 1d). Many uncharacterized protein families were highly correlated with characterized families, where their correlations were comparable to correlations within characterized families (Extended Data Fig. 1e). To further distinguish the legitimate lack of coexpression cases from those where proteins were not well expressed (for example, low prevalence), we defined a set of â€˜well-expressedâ€™ proteins (that is, detected in at least 10% of total MTX samples) and measured how the uncharacterized proteins are close to characterized neighbors among these well-expressed proteins in MTX coexpression networks (Methods). Among the top 25 species with the most novel (that is, RH and NH) proteins (Fig. 1b), many well-expressed uncharacterized proteins were strongly correlated with the characterized proteins at the transcriptional level (Fig. 1c and Supplementary Table 3), suggesting their functional relatedness in the gut.

We next sought to compare MTX-based coexpression with isolate-based coexpression using isolate data for HMP2 species from the STRING database⁵⁷. In total, 12 of the top 25 species with the greatest number of novel proteins were available in STRING, in agreement with the observation that a large fraction of species from microbial communities lack corresponding reference isolates (Supplementary Table 4). In most species (11 of 12), STRING-linked proteins showed significantly stronger correlations than STRING-unlinked proteins in MTX-based networks (Mannâ€“Whitney test on difference between average correlation values, Pâ€‰<â€‰0.05) (Fig. 1d). Moreover, the coexpression similarity (quantified by Pearsonâ€™s correlation) between MTX and STRING networks was itself significantly correlated with the reference representation of HMP2 species (Pearsonâ€™s correlation, Râ€‰=â€‰0.6, Pâ€‰=â€‰0.031) (Extended Data Fig. 1f). That is, the better characterized a species is, the more similar its MTX-based and STRING-based coexpression networks are. However, many more coexpression relationships were captured by MTX than by isolates; only E.â€‰coli and Pseudomonas aeruginosa contained direct expression data in STRING (version 11.5), which were then transferred by homology to 148 of 336 HMP2 species. These findings highlight the abundance of coexpressed yet functionally uncharacterized proteins in gut metatranscriptomes.

On the basis of these initial assessments, we developed a methodâ€”FUGAsseMâ€”to systematically predict functions of unknown proteins in the context of microbial communities (Fig. 1e). FUGAsseM is built on an integrative machine learning framework directly analogous to those used for single-organism function prediction tasks. It integrates multiple types of community-based data such as coexpression patterns from MTX, proximity in metagenomic assemblies (that is, genes occurring near each other in a contig), sequence similarity of proteins (that is, modeled coarsely as comembership in the same UniRef50 cluster⁵⁸) and predicted domainâ€“domain interactions⁵⁹ (Methods). For each function of interest, FUGAsseM first builds an individual RF classifier for each data type in the first layer, which has the effect of mapping individual assay measurement values to its confidence level for functional relatedness, resulting in a prediction score indicating the likelihood of a protein familyâ€™s functional annotation based on its specific feature set. Next, it builds an ensemble classifier that combines RF predictions from all data types to provide final confidence of each protein family for a given function (Extended Data Fig. 1h). This process evaluates the relative contribution of each data type in predicting functional associations per gene per function, determining their overall informativeness in assigning functional annotations. FUGAsseM is designed for flexible annotation of gene sets from any source, as long as sufficient training data are available, making it broadly applicable across diverse functional categories such as GO terms⁵¹, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways⁶⁰ or MetaCyc modules⁶¹. In this study, we focused on GO annotations as a proof-of-concept application, leveraging the extensive resources available for GO to demonstrate FUGAsseMâ€™s predictive performance.

FUGAsseM accurately predicts functions of uncharacterized proteins from microbial communities

To assess FUGAsseMâ€™s prediction accuracy, we first compared its community-based predictions to those based on existing isolate data using a cross-validation approach. To this end, we trained FUGAsseM using isolate-based network data (that is, coexpression and integrated network data for isolates) from STRING (Methods). FUGAsseMâ€™s performance using MTX coexpression alone (FUGAsseM-MTX) was comparable to STRINGâ€™s isolate coexpression (Fig. 2a, Extended Data Fig. 2a,c and Supplementary Table 5), with the added advantage of much broader applicability to any organism detected in a community (Pearsonâ€™s correlation for BP terms: Râ€‰=â€‰0.43, Pâ€‰<â€‰0.0001). FUGAsseMâ€™s performance was further improved by adding genomic proximity within metagenomically assembled contigs, sequence similarity and domainâ€“domain interactions (that is, FUGAsseM-full), which also significantly correlated with predictions from STRINGâ€™s integrated data (Fig. 2a, Extended Data Fig. 2a,c and Supplementary Table 6). This also held true for species-wise comparisons across diverse GO terms but, again, applicable to any organism detected metaomically (Fig. 2b, Extended Data Fig. 2b,d and Supplementary Tables 7 and 8). Thus, FUGAsseM performed comparably to state-of-the-art data integration approaches but with the ability to infer functions for many more species directly from whole-community profiles.

**Fig. 2: FUGAsseM predicts BP terms of microbial communities with high accuracy.**

FUGAsseMâ€™s accuracy was also comparable to that of other state-of-the-art methods for function prediction in single organisms (Fig. 2c). Two approaches that use sequence information with outstanding performance in CAFA3 (ref. ²²) were selected for benchmarking: NetGO2.0 (ref. ⁶²) and DeepGOPlus⁶³. To perform the comparison, we applied DeepGOPlus with default parameter settings to the top ten most abundant species among those processed by FUGAsseM and NetGO2.0 to five species (a limitation of its web-based interface; Methods). FUGAsseM accurately predicted existing annotations with comparable performance to NetGO2.0, DeepGOPlus and STRING (Fig. 2c, Extended Data Fig. 3a and Supplementary Table 9). FUGAsseM-MTX achieved an average area under the receiver operating characteristic curve (AUROC) of 0.71 for informative BP term prediction. Strikingly, its AUROC was improved to 0.95 by aggregating other community-wide data (that is, FUGAsseM-full). Additionally, given the great advances in the fieldâ€™s capacity for protein structure prediction^64,65,66,67, we compared FUGAsseM to recent approaches based on structure homology when available (Methods). The results showed that FUGAsseM-MTX was comparable to predictions based on structural homology (while using only MTX covariation) and FUGAsseM-full still greatly outperformed both methods (Fig. 2c). This was one of the first demonstrations of synergy between multiple data types and multiple organisms within a community, also evidenced in additional evaluations below (Fig. 2c). Notably, FUGAsseM also remained accurate when predicting MF and CC terms (Extended Data Figs. 2e,f and 3b,c and Supplementary Table 9).

FUGAsseM further maintained its accuracy when predicting completely new annotations, using a temporally held-out annotation set analogous to that in CAFAâ€™s evaluations^20,21,22 (Fig. 2d,e). Methods relying on homology for prediction may be susceptible to circularity-induced overperformance (for example, protein X from gold standard is annotated to term Y because of homology to protein Z, while predictor also assigns X to Y because of homology to Z). Thus, we designed a temporally held-out evaluation inspired by CAFA, using UniProt⁵⁶ annotations of HMP2 proteins lacking experimental evidence at the first time point (T₀: release 2019_01, used by FUGAsseM for training) and adding experimental evidence by the second time point (T₁: release 2022_01) (Methods). We trained FUGAsseM using HMP2 SC annotations available at T₀, corresponding with GO in UniProt (release 2019_01). We then assessed its accuracy in predicting new annotations that were experimentally verified between then and T₁ (release 2022_01), including â€˜accumulated evidenceâ€˜ that gained new experimental validation at T₁ and â€˜new evidenceâ€™ that was unseen at T₀ and newly added to the database with experimental validation at T₁.

The only organism with sufficiently many new protein annotations proved to be E.â€‰coli; however, for this taxon, FUGAsseMâ€™s MTX model and full model achieved high performance with an averaged AUROC of 0.80 (Fig. 2d, Extended Data Fig. 2g,h and Supplementary Table 10). The enrichment of newly experimentally validated annotations among high-confidence predictions was also highly statistically significant (gene set enrichment analysis (GSEA), qâ€‰<â€‰0.002; Fig. 2e and Extended Data Fig. 2i,j). Thus, FUGAsseM continues to perform well even when assessed using temporally held-out, newly experimentally validated function annotations. Similarly, we assessed the performance of FUGAsseM with STRING-based isolate data using this temporal hold-out approach (Methods). Strikingly, the FUGAsseM-MTX model outperformed STRINGâ€™s isolate-based coexpression for BP predictions (Extended Data Fig. 3d). Meanwhile, the FUGAsseM-full model closely matched STRINGâ€™s integrated predictions while excelling in identifying novel annotations with experimental support absent from the training data (Extended Data Fig. 3e).

To further assess the robustness of FUGAsseM and minimize confounding from the potential circularity of homology-based annotations, we evaluated its performance using only experimentally confirmed annotations from our gold-standard set (Methods). The FUGAsseM-full model consistently demonstrated high predictive accuracy across both experimentally validated and other annotations, indicating its effectiveness in mitigating these potential biases (Extended Data Fig. 4a). Furthermore, the FUGAsseM-MTX model exhibited much higher performance for experimentally validated annotations compared to those lacking experimental support, reinforcing the reliability of MTX-derived functional predictions (Extended Data Fig. 4b). This expanded analysis confirms that FUGAsseMâ€™s performance advantages persist independently of homology effects.

MTX-based coexpression contributes substantially to FUGAsseM predictions

We further explored the contributional importance of each data type integrated in the FUGAsseM-full model. To this end, we assessed the importance scores learned by the second-layer RF for accurate GO terms (that is, successful models), where the latter were defined as the models that resulted in high-confidence (that is, prediction probabilityâ€‰â‰¥â€‰0.75) predictions of GO annotations. Both MTX-based coexpression and sequence similarity achieved average importance scores over 0.28, exceeding that of the third most important data type (genomic proximity, 0.11) (Fig. 3aâ€“c and Supplementary Table 11). Their importance was also much higher than proximity (proteins assembled in the same contig) and predicted domainâ€“domain interactions, highlighting the important role of MTX-based coexpression in function prediction.

**Fig. 3: MTX-based coexpression contributes substantially to FUGAsseM-full protein function predictions.**

MTX-based coexpression maintained a large contribution to predicting annotations with new experimental evidence over time, drawing from the corresponding evaluation above (Fig. 3dâ€“f and Supplementary Table 12). To avoid potential circularity for predicting annotations that were made previously on the basis of the same type of evidence, we examined the importance of data types for predicting new annotations excluded from the FUGAsseMâ€™s training models. Strikingly, MTX-based coexpression achieved a notable contribution (average importance scoresâ€‰â‰¥â€‰0.42) to predicting new BP annotations even when these annotations were made using other data types (for example, sequence similarity of 0.39) (Fig. 3d). Given that MF terms are often carried out by the action of a single macromolecule, sequence similarity remained a high contributor to predictions of MF terms (Fig. 3e), consistent with previous findings²¹. This provides an interesting demonstration that MTX coexpression can be informative for function assignment in a manner directly analogous to that of transcriptional coexpression within single organisms under controlled conditions³⁶.

Moreover, we conducted additional analyses using experimentally validated annotations from our gold-standard set, effectively reducing potential confounding from homology-based annotation transfers (Methods). Notably, MTX-based coexpression features contributed more prominently to predicting annotations with experimental validation in the integrated model compared to other types of data (Extended Data Fig. 4c,d). This suggests that incorporating MTX-derived coexpression features enhances functional inference, reinforcing their value in improving annotation accuracy and reliability.

Furthermore, we performed an ablation study to quantify the contribution of each evidence type to the prediction of FUGAsseM-full models, systematically training FUGAsseM models while excluding one data type at a time (Methods). To minimize potential homology-based circularity, we used a temporal hold-out validation approach, ensuring that predictions were derived from independent multiomics evidence rather than propagated database annotations. This analysis revealed that MTX-based coexpression remained a key predictive feature across all GO aspects, with particularly strong contributions to BP annotations (Extended Data Fig. 5). These findings underscore the critical role of MTX-derived signals in capturing functional relationships within microbial communities.

Predicted high-confidence functional annotations refine the functional landscape of the human microbiome

Because the preceding evaluations provided high confidence in FUGAsseMâ€™s accuracy, we next applied it to predict community-encoded functions (that is, informative GO annotations) of the human gut microbiome. We integrated coexpression patterns from 800 HMP2 metatranscriptomes and proximity patterns from assemblies of 1,595 HMP2 metagenomes, computed sequence similarities modeled coarsely as comembership within UniRef50 and drew domainâ€“domain interactions from our previous published results⁷ for the HMP2 protein families (Methods). To determine whether a target protein is confidently assigned to a function, we defined a cutoff for new predictions with high confidence. These were optimized per term and taxon by maximizing expected F₁ score (Extended Data Fig. 6a). Predictions with known annotations in UniProt tended to achieve higher confidence than those without (Extended Data Fig. 6b), acting as positive controls. Given these, we used prediction probability 0.75 as a uniform threshold for all models to define high-confidence predictions (termed the â€˜defaultâ€™ threshold) (Extended Data Fig. 6c). This retained true positives with high prediction confidence while including new predictions. Next, we defined prediction probability 0.85 as a â€˜stringentâ€™ threshold for higher precision, maintaining a sufficient true positive rate but at an especially low false positive rate.

Among the 546,251 protein families from the HMP2 species processed by FUGAsseM, we assigned high-confidence (with the default cutoff) predictions to 443,549 families, including 267,944 protein families with GO BP annotations (that is, 49.1% of the total), 364,652 with MF annotations (that is, 66.8% of the total) and 120,134 with CC annotations (that is, 22.0% of the total) (Extended Data Fig. 7aâ€“c). In total, 364,965 of these were previously uncharacterized (that is, 82.3% of all predictions), including 33,912 novel protein families (that is, RH and NH with homology at <90% identify or 80% coverage to any proteins in UniProt; 20,456 predicted with BP annotations, 24,438 predicted with MF annotations and 9,608 predicted with CC annotations) with comparable prediction confidence to previously annotated proteins. Notably, quantitative analysis showed that FUGAsseM substantially captured functions that were also significantly enriched among protein families previously prioritized as bioactive in IBD⁷ (Extended Data Fig. 7h,i). This result showcases FUGAsseMâ€™s ability to assign putative functions to uncharacterized yet phenotypically important proteins in IBD.

High-confidence predicted annotations expanded our understanding of microbial protein functions from both well-studied and less studied species in the human gut (Fig. 4 and Extended Data Fig. 7dâ€“g). Focusing on the top 25 species with the greatest number of novel (that is, NH and RH) proteins, on average, the fraction of protein families annotated with BP terms per taxon expanded from 12.1% to 57.4% (4.7-fold increase) on the basis of the default threshold and to 28.8% (2.4-fold increase) on the basis of the stringent threshold (Fig. 4a). Most of these predictions were for the pangenomes of common gut taxa such as Bacteroides thetaiotaomicron, Ruminococcaceae bacterium and Roseburia inulinivorans. This emphasizes the degree to which even species with well-characterized strains (for example, E.â€‰coli, which is among the top 25 species) may still have undercharacterized community pangenomes (Extended Data Fig. 1b); in such cases, FUGAsseM strikingly improved pangenome characterization (Fig. 4b and Supplementary Table 13). Other less studied species such as Sutterella wadsworthensis and Firmicutes bacterium CAG 124 with many novel proteins were also much better annotated (with an average of 25.6% of novel proteins annotated using stringent threshold). Moreover, 9,784 (of 15,638) protein families previously annotated with informative GO annotations were assigned to more informative GO terms, expanding their characterization (Fig. 4b and Extended Data Fig. 7e,g). In total, 62,209 (of 116,740) uncharacterized protein families (that is, lacking informative GO annotations) were assigned high-confidence functional annotations. Overall, gut microbial proteins gained extensive high-confidence predictions across GO aspects, expanding the functional repertoire of the gut microbiome.

**Fig. 4: FUGAsseM predictions greatly expand putative function assignments in the human gut microbial gene catalog.**

To expand statistics for functional annotations spanning all of GOâ€™s aspects, we observed a 3.4-fold increase in BP annotations (meanâ€‰Â±â€‰SD: 11â€‰Â±â€‰29 annotations per term per species in the training set versus 37â€‰Â±â€‰136 in the high-confidence prediction set), a 4.3-fold increase in MF annotations (10â€‰Â±â€‰25 versus 43â€‰Â±â€‰129) and a 4.1-fold increase in CC annotations (21â€‰Â±â€‰41 versus 86â€‰Â±â€‰190) on the basis of the default confidence threshold (Fig. 4c). In particular, annotations for BP terms with high confidence were substantially increased by both the default and the stringent thresholds (Fig. 4d and Supplementary Table 14). Most of these predictions benefited from the contribution of MTX-based coexpression (Fig. 3a,d). Conversely, as is typical for many of their terms, newly predicted MF and CC annotations sometimes represented general functions that require additional knowledge of the context in which the gene product acts. They were also more likely to benefit from the contribution of sequence similarity for prediction (Fig. 3). Our analysis reveals a substantial number of microbial proteins in the human gut that remain uncharacterized. While their precise functional roles are yet to be fully elucidated, their presence in transcriptionally active communities suggests potential biological importance.

Uncharacterized proteins increase the diversity of species-shared and species-specific functions in the human gut

The top 15 most frequently predicted BP terms were assigned to more than half of the total species in the HMP2 by FUGAsseM, furthering their characterization (Fig. 5a, first column, and Supplementary Tables 15 and 16). To select the most frequently predicted terms in HMP2, we sorted all predicted terms in descending order on the basis of the count of species that were assigned at least one protein to each corresponding term. Most of these functions were related to essential or housekeeping activities (for example, DNA processing, RNA processing, nucleotide or amino acid biosynthetic process and CC organization). FUGAsseM achieved an average cross-validated AUROC of more than 0.95 over these top 15 terms (second column in Fig. 5a), indicating the high confidence of these predictions. The increase in the annotations of uncharacterized protein families (average of 33.84 and 12.06 protein families per term per species annotated with default and stringent thresholds, respectively) suggests undercharacterized diversity of microbial proteins for these essential functions (Fig. 5a, third column). For example, 542 novel proteins from 48 species were newly predicted to be involved in the regulation of cell shape (GO:0008360), greatly increasing the protein diversity of this process.

**Fig. 5: Uncharacterized proteins are predicted to be broadly distributed and species-specific functions with high confidence in the human microbiome.**

FUGAsseM also improved the diversity of species-specific functions (Fig. 5b and Supplementary Tables 17 and 18). We examined the top 15 least frequently predicted BP terms that were the bottom 15 terms from the ranking list (introduced above). These terms were often specific to a few species but with good performance (average AUROC of 0.88). Unlike the housekeeping processes shared by many species, these functions were enriched for highly specific metabolic processes, including toxin metabolism (for example, altering the toxicity of ingested compounds and impacting health outcomes⁶⁸), iron ion homeostasis (for example, changing iron availability impacting on host-microbiota interactions^69,70) and ammonium ion metabolism (for example, affecting enteric nervous system function⁷¹). Many nonmetabolic processes were also predicted with high confidence and were often related to interaction between microorganisms and their host or environment. These included outer-membrane assembly, which is responsible for binding of and response to host proteins^72,73, transmembrane transport, which has a role in nutrient uptake^74,75, and dicarboxylic acid transport, which is involved in the uptake of succinate, fumarate and malate by the microbiome⁷⁶. Species-specific functions might be underestimated previously as housekeeping functions, which are typically easier to predict, dominate the annotations. The increased diversity of such specialized functions by uncharacterized proteins (including novel proteins) suggests richness of unexplored functions, which potentially distinguish clades and contribute to the fitness of both microorganisms and the host.

Newly annotated proteins in microbial nutrient use and phage defense

A wide array of taxa and functions were covered by new FUGAsseM predictions in the human gut, ranging from broad housekeeping processes to highly taxon-specific nutrient use and phage responses (Fig. 6). For example, the annotations of housekeeping functions (for example, cell wall organization GO:0071555)⁷⁷ were greatly expanded in multiple well-studied and less studied species (Fig. 6a, Supplementary Table 19 and Supplementary Note 1). Similarly, species-specific annotations such as disaccharide metabolism (GO:0005984) were also improved (Fig. 6b and Supplementary Table 20). Many bacteria in the human gut use disaccharides as an important energy source⁷⁸ and these were better characterized in health-relevant microorganisms such as Faecalibacterium prausnitzii and Hungatella hathewayi (Supplementary Note 1). Lastly, the importance of coexpression for function prediction was also affirmed by these data, with many uncharacterized proteins receiving new annotations on the basis of strong coexpression with annotated proteins in their source species despite lacking annotated homologs (Fig. 6c,d and Supplementary Tables 21 and 22). For example, H.â€‰hathewayi is associated with colorectal cancer and has the ability to break down various glycosaminoglycans, aiding colonization⁷⁹. The use of FUGAsseM resulted in the assignment of high-confidence functional annotations to 3,231 uncharacterized proteins of H.â€‰hathewayi, including chemotaxis (a fitness attribute in nutrient foraging⁸⁰; GO:0006935) and cobalamin biosynthesis (contributing to glycosaminoglycan-degrading⁷⁹; GO:0009236) (Fig. 6d and Supplementary Note 1).

**Fig. 6: Selected examples of specific newly predicted function annotations spanning pathways and functions in the human gut microbiome.**

F.â€‰prausnitzii, which has been reported to be associated with health⁴⁴ and exhibits high phylogenetic diversity^7,81, showed largely expanded characterization of its pangenomes, which are at present notably underannotated given its health relevance (Figs. 1b and 4). Strikingly, uncharacterized F.â€‰prausnitzii proteins confidently predicted as viral responsive (GO:0019058 and GO:0051607) were strongly coexpressed with a small number of previously characterized pathway members (Fig. 6c). Previous studies showed that F.â€‰prausnitzii phages were enriched in IBD⁸², including Myroviridae type I cluster 6 (Taranis and Toutatis) and Siphoviridae type I cluster 1 (Lugh) families. Although F.â€‰prausnitzii uncharacterized proteins lacked global sequence similarity to related previously annotated phage proteins (Extended Data Fig. 8a and Supplementary Table 21), their similar domain architectures and coexpression confirmed their functional relatedness. In addition, the subunits of types 1C and 1E, the most common CRISPR systems of F.â€‰prausnitzii strains⁸³, were newly predicted in F.â€‰prausnitzii. These spanned Cas1 or Cas5 and Cas3 genes from type 1E and 1C systems, respectively (Extended Data Fig. 8b). These previously annotated and newly predicted CRISPRâ€“Cas proteins showed 60â€“80% amino acid sequence identity and strong coexpression within system types and their similarity dropped to ~20% between system types with a weak correlation at the expression level. These findings emphasize the informative role of MTX-based coexpression in deducing functional capacity, complementing sequence similarity.

In another case, proteins involved in carbohydrate catabolism (GO:0044275) were newly predicted in Bacteroides, a dominant clade and also a primary glycan degrader in the human gut microbiota. This activity is mediated by polysaccharide use loci (PULs), which typically include multiple carbohydrate active enzymes (CAZymes) that enable complex carbohydrate recognition, uptake and breakdown^84,85,86,87. Although the newly predicted proteins in B.â€‰thetaiotaomicron involved in GO:0044275 lacked significant homology to annotated PUL proteins, their domain architectures tended to be similar, including glycosyl hydrolase Pfam domains, outer-membrane-related domains and PUL domains from the same operon (Extended Data Fig. 8c). Again, MTX-based coexpression showed the strongest contribution to the predictions of carbohydrate catabolic processes in B.â€‰thetaiotamicron (Fig. 6e and Supplementary Table 23). Similarly, in Bacteroides fragilis, coexpression patterns revealed associations between these glycoside hydrolases and other known PUL proteins^88,89, indicating the concerted action of protein sets (Fig. 6f, Supplementary Table 22 and Supplementary Note 1). These extended annotations shed light on the crucial role of Bacteroides in polysaccharide metabolism for energy harvesting, aiding in understanding both colonization and ecological persistence.

In addition to these examples in well-studied species, FUGAsseM provided insights into protein function for less studied species. In many less studied species, the MTX-based expression of novel proteins was strongly correlated with annotated pathway members, indicative of underlying functional associations active specifically within a community context (Fig. 6g,h). For example, uncharacterized proteins in Subdoligranulum spp., which have a role in carbohydrate fermentation^8,90 and are associated with chronic-inflammation-related immune markers⁹¹, were predicted with high confidence by leveraging coexpression patterns. In particular, FUGAsseM expanded the repertoire of proteins contributing to peptidoglycan synthesis (for example, phospho-N-acetylmuramoyl-pentapeptide transferase) in Subdoligranulum (Fig. 6g and Supplementary Table 22), which is a key component in sensing by host proteins and contributing to the regulation of inflammatory response toward bacteria^92,93. Additionally, uncharacterized proteins from undercharacterized species such as Methanobrevibacter spp. were predicted to be associated with amino acid biosynthesis enzymes (Fig. 6h and Supplementary Table 22), such as aspartateâ€“semialdehyde dehydrogenase, which participates in sulfur amino acid processes unique to these Archaea and associated with metabolic disease^94,95. Taken together, FUGAsseMâ€™s coexpression-based function prediction furthered the characterization of both well-studied and less studied members of the gut microbiome.

Discussion

While culture-independent sequencing has provided increasingly vast catalogs of microbial community gene families, a large majority of their functions remain unknown. Here, we present a new method, FUGAsseM, that provides precise predictions of functions for uncharacterized proteins by integrating diverse community-wide data, including MTX-based coexpression, MGX-based genomic proximity, sequence similarity and predicted domainâ€“domain interactions. FUGAsseM uses a two-layered RF classifier system to assign proteins to new functions through guilt by association. For a given function, individual classifiers are initially trained for each data type to assign new proteins to that function on the basis of their associations with other annotated proteins. In the second layer, an ensemble classifier integrates the per-evidence prediction confidence scores from the first layer to produce a unified confidence score. This architecture allows for flexible adjustment of evidence weighting on a function-by-function basis, thus capturing biological trends that favor greater overall prediction accuracy. FUGAsseM can incorporate a large diversity of data types, while also scaling efficiently with the number of datasets per type. We used FUGAsseM to assign high-confidence GO annotations (that is, BP, MF and CC) to >443,000 protein families in the human gut, including >33,000 families with <90% identity or <80% coverage to any UniProt member. These predictions substantially expanded the functional repertoire of both common gut taxa (for example, F.â€‰prausnitzii) and less studied species (for example, S.â€‰wadsworthensis) and ranged from broadly distributed housekeeping activities to highly specialized metabolic functions. The functional diversity of uncharacterized proteins was particularly well informed by the analysis of MTX-based coexpression. An open-source implementation of FUGAsseM, supporting documentation, tutorials and all current data products (including function predictions of HMP2 protein families) are available online (http://huttenhower.sph.harvard.edu/fugassem).

FUGAsseM is modeled on computational strategies for protein function prediction in single organisms, which have become extremely accurate over the past two decades^17,18. These, in turn, build on simple annotation transfer by homology, under the assumption that proteins with sufficiently similar sequences frequently carry out similar functions^23,24,25,26. Particularly with the advent of deep-learning-based structural modeling, this intuition can be extended to include function transfer based on structural similarity^19,96,97. A wide range of other data types have also been used for function prediction or transfer, including gene neighborhood methods (that is, relating genes located in the same genomic loci, such as operons^27,98,99) and phylogenetic profiling (that is, relating genes on the basis of co-occurrence across species^97,100). Network-based approaches rely on guilt by association for prediction, such as proteinâ€“protein interactions^101,102,103, epistatic genetic interactions^104,105 or coexpression¹⁰⁶ from DNA microarrays^35,36,49, qPCR¹⁰⁷ or RNA-seq⁵⁰.

However, analogous function prediction methods for microbial communities have not been previously developed. Compared to single organisms, microbial communities tend to contain organisms with substantially more diverse genetic backgrounds, thus introducing additional complexity for function prediction. Even human-associated communities typically contain a high proportion of novel sequences lacking homology to known proteins and, thus, with limited knowledge of genomic context, protein structure or proteinâ€“protein interaction. Approximately 45% of assembled protein families in the human microbiome were new (RH and NH with homology with <90% amino acid sequence identity or <80% coverage to any UniProt member) and that fraction was even larger in undercharacterized microbial communities such as marine or soil communities (~70% to 90%)⁷. Thus, community-wide function prediction requires a combination of leveraging community-wide data directly, transferring function among organisms using more than simple homology and scalable computational methods, as direct experimental characterization would be challenging. FUGAsseM, thus, represents one of the first efforts to specifically predict functions in complex microbial communities. It enhances the functional characterization of the human microbiome, with the percentage of protein families assigned to informative BP terms increasing from 14.4% to 49.1% (3.4-fold increase). By comparing the number of annotated families assigned by FUGAsseM to eggNOG-mapper¹⁰⁸ (which assigns protein families by single-best-hit sequence homology), FUGAsseM clearly demonstrated superior coverage of meaningful annotations compared to eggNOG-mapper, particularly for proteins lacking notable homologs in UniProt (Extended Data Fig. 9 and Supplementary Table 24). This underscores the strength of the FUGAsseM framework in accurately predicting previously uncharacterized protein functions within microbial communities.

FUGAsseM differs from previous single-organism methods mainly by using community-level data designed specifically for the complexity of microbiomes. Single-organism algorithms developed for analyzing isolate genomes are limited in the context of communities^{23,24,25,62,63}. These tools typically rely on the existence of functional data (gene expression, protein interactions, etc.) for microbial isolates^57,106 that do not exist for the vast majority of microbiome members. For example, only two of 336 HMP2 species have direct expression data in STRING. In their absence, functions are often transferred on the basis of homology, which is highly variable and unreliable without additional supporting data. FUGAsseM addresses these limitations by incorporating advanced multiomics techniques, allowing a much larger number of sequences from communities to receive at least putative functional interpretations. The combination of MGX with MTX makes numerous protein sequences from communities accessible, allowing for a more comprehensive exploration of microbial functionality in ecosystems. By applying FUGAsseMâ€™s integrative prediction approach to whole-community data (for example, MTX-based coexpression, MGX-based genomic proximity and across-species sequence similarity), we can address proteins with little homology and for which no isolate functional data are available.

MTX coexpression for function prediction in communities is of particular note, as it may be argued that coexpression analysis from microarrays unlocked the annotation potential of genomes as they began to be sequenced in the early 2000s³⁵. Unlike an isolateâ€™s expression profiles from a monoculture environmentâ€”which are already highly informative⁵⁷â€”MTX captures the transcriptional profiles of microorganisms in situ^{40,41,109,110}. It also has the potential to cover the same set of taxa and genes spanning communities as profiled by typical metagenomic DNA sequencing methods. Particularly in combination with other community-relevant data, such as cross-species homology, FUGAsseM is especially suitable even for communities without isolates. Moreover, FUGAsseMâ€™s MTX-based approach can reduce the impact of circularity in function prediction (for example, training sequence similarity to predict functions that are themselves annotated through sequence similarity). Correspondingly, our evaluations on predictions based on experimental evidence (independent from the training set) showed substantial contribution from MTX-based coexpression to high-confidence predictions (Figs. 2d,e and 3dâ€“f).

In our analysis, FUGAsseM substantially extended functional annotations within the human microbiome, highlighting key ecological processes such as iron and ammonium ion homeostasis and toxin metabolism (Fig. 5b). Our predictions reveal insights into the ecological dynamics of iron within the gut, identifying specific protein families involved in its acquisition and regulation. This is particularly important given that iron, albeit essential for numerous enzymatic processes in microbial physiology, is a limiting nutrient in the gut, leading to competitive interactions both among microbial species and between microorganisms and the host^69,70. The transportation of iron is a crucial bioprocess in the gut given that iron availability is dependent on barrier integrity and inflammation; in turn, this influences pathogen colonization and the expression of virulence factors, as well as the composition and resilience of the microbial community. Bacteria possess complex mechanisms to sequester iron from their environment, which can affect host iron homeostasis and overall health, leading to a tightly regulated balance between iron acquisition and host defense systems^111,112.

Similarly, our method identified proteins linked to ammonium ion metabolism, shedding light on a crucial but less explored aspect of nitrogen cycling within gut communities. The ability to efficiently process ammonium ions in certain microorganisms affects the entire community by influencing nitrogen availability, which is essential for the synthesis of critical biomolecules such as amino acids and nucleotides¹¹³. Nitrogen cycling, particularly through ammonium ions, is pivotal for microbial growth and ecosystem function and disruptions in this cycle can lead to notable shifts in community structure and function^114,115. These metabolic pathways also highlight the adaptability of gut microorganisms to use ammonia under low-oxygen conditions, which is often seen in densely populated regions of the gut, contributing to the efficiency of nutrient recycling and energy conservation in this ecosystem^116,117. Our approach further predicted proteins involved in other transport-related functions, including amino acid, anion and dicarboxylic acid transmembrane transport (Fig. 5b). These processes reflect a range of nutrient uptake and waste removal, affecting microbial viability and interaction within the gut ecosystem. We also detailed the MFs and CCs newly annotated by FUGAsseM, such as the ATP activity involved in essential microbial energy processes, species-specific functions such as endo-1,4-Î²-xylanase activity and various CCs (Extended Data Fig. 10). The annotation of endo-1,4-Î²-xylanase is a noteworthy example as it underlines the role of highly specific microorganisms in the degradation of individual plant-derived fibers within the gut, thus simultaneously influencing host and microbial nutrient processing. While our predictions suggest that certain functions are primarily associated with specific species, this may reflect either true biological exclusivity or the current limitations of available data. Future studies with expanded datasets and experimental validation will help further distinguish between species-specific and widely conserved functions.

In the process of expanding the functional landscape of the human gut microbiome, FUGAsseMâ€™s annotations suggested several BPs that are more prevalent or critical than previously understood. For instance, the process of cell shape regulation, which influences microbial adaptability, colonization and interaction within the gut environment, was newly predicted across a diverse array of species, including both gram-positive and gram-negative bacteria. Similarly, the annotation of flagellum components in various species from different phyla highlights an unexpected prevalence of motility functions (Supplementary Table 15). These findings highlight the diversity of strategies used for microbial colonization, interaction and nutrient acquisition within the human gut.

Additionally, FUGAsseMâ€™s annotation underscores the extent to which even well-studied taxa possess expansive pangenomes that remain undercharacterized in endogenous, human-associated communities. Among the microorganisms analyzed, F.â€‰prausnitzii and Bacteroides massiliensis were notably underannotated (fraction of BP annotations: 29.2% and 27.9%, respectively) (Fig. 4a). Specifically, F.â€‰prausnitzii, with diverged phylogeny and varying phenotypes^81,118, showed considerable gaps in functional annotations (15,122 of the total 21,358 protein families were uncharacterized), likely because of its complex strain variability, which can impact MTX species-based normalization and dilute coexpression signals. The expression profiles of uncharacterized B.â€‰massiliensis proteins showed moderate correlations with characterized proteins (average Pearsonâ€™s correlation, Râ€‰=â€‰0.56) (Fig. 1c), which may be because of the limited number of previously annotated protein families (only 8% of B.â€‰massiliensis protein families have been annotated with informative BP terms). Interestingly, our predictions also highlight the extent to which even well-studied taxa, such as E.â€‰coli and B.â€‰thetaiotaomicron, exhibit large, undercharacterized pangenomes when found in natural, human-associated communities. In some instances, these taxa have seen dramatic expansions in their pangenome annotations, which is comparable to other highly prevalent taxa in the human gut that lack extensively studied isolate genomes, such as Eubacterium rectale and H.â€‰hathewayi (Fig. 4a). It is noteworthy that, even with these improvements, the limited knowledge of annotations and coexpression patterns still diminish the effectiveness of the guilt-by-association approach in FUGAsseM. Incorporating additional types of data in future studies, such as structural data, could potentially mitigate this limitation. These insights underscore the potential for unexplored biological functions within the gut microbiome, emphasizing the need for further research to fully elucidate these processes and their implications.

One example of note is FUGAsseMâ€™s newly predicted phage defense and CRISPRâ€“Cas proteins in F.â€‰prausnitzii, as this represents a pathway of general interest in an especially health-relevant taxon. Any predicted function assignment, whether in isolates or a community, remains putative until it is verified by guided experimentation. One potential approach to validate these might be deleting newly predicted CRISPR system elements to observe the impact on plasmid conjugation and phage infection¹¹⁹. However, F.â€‰prausnitzii is challenging to isolate from stool samples and requires specific culturing techniques that mimic the anaerobic conditions in the gut¹²⁰. Additionally, F.â€‰prausnitzii has a complex and highly variable genome, making genetic manipulation difficult¹²⁰. As an alternative, it may be possible to take advantage of the â€˜natural experimentâ€™ of variable F.â€‰prausnitzii genetics⁷. This would involve selection of a diverse group of F.â€‰prausnitzii strains that naturally span a combinatorial range of presence and absence patterns for predicted CRISPRâ€“Cas systems from the gut microbiome, expose them to various phages (for example, Myroviridae and Siphoviridae) in stool culture and assess their susceptibilities. However, it is not clear that appropriate F.â€‰prausnitzii phages are available for such a screen without themselves requiring new, labor-intensive isolation. Despite the challenges posed by the organismâ€™s isolation difficulties and F.â€‰prausnitziiâ€™s genetic complexity, exploring the newly predicted phage defense and CRISPRâ€“Cas proteins represents an important avenue of research, which could potentially benefit from leveraging other validation techniques such as heterologous expression and fluorescence-based assays.

While FUGAsseM is able to leverage a range of capabilities unique to microbial communities, such as community-wide MTX, it is also limited in some ways by microbiome assays. As introduced above, MTX is limited in sensitivity to rare microorganisms and transcripts and the technology itself can be difficult to apply because of the need to deplete diverse ribosomal RNAs without damaging mRNAs^121,122. MTX-based coexpression might include more noise and lack sensitivity for low-abundance taxa or transcripts, as compared to highly controlled isolate-based experimental expression profiles, but is applicable to almost any taxon or pathway detected in a community (Fig. 2a,b). When focusing on MGX, genomic proximity can be informative for predicting functions encoded at the same locus (for example, operon). This relies on MGX assembly, which can easily break down especially in communities¹²³. It also requires organisms to possess substantial operonic structure, which not all microorganisms do¹²⁴. Even when relying on simple cross-organismal sequence homology, it is difficult to avoid circularity and â€˜annotation rotâ€™ by transferring annotations that themselves have been made solely by sequence-based transfer^125,126, which is a challenge faced by the broad field for function prediction. Both our evaluation on predictions with new experimental evidence independent of training data and annotations with experimental evidence in the gold-standard set suggests that we have avoided this pitfall so far but it may also be useful to include additional orthogonal data types, such as protein structural similarity. Lastly, many assay types that are especially informative for single-organism function prediction are simply not yet available in microbial community contexts, such as proteinâ€“protein interactions or genetic epistasis.

However, despite these limitations, there is a need to address the drastic underannotation of protein function within microbial communities. It will remain extremely challenging to translate the health potential of the human microbiome without a deeper understanding of hostâ€“microorganism and microorganismâ€“microorganism molecular interactions, let alone understand the ecologies of environmental microbiomes that are less well studied. This cannot be achieved when many millions of microbial proteins remain functionally uncharacterized. While experimental assays are the gold standard for such annotation, the scope of the problem means that it must be guided and accelerated computationally. FUGAsseM provides a first pass at annotating the most accessible portion of the â€˜dark matterâ€™ of microbial communities and the human microbiome and we hope that future applications and expansions of the method will improve our molecular understanding of the microbiome and the mechanisms that microorganisms use to interact with each other and their environments.

Methods

Overview

FUGAsseM provides a system for predicting new functional assignments for uncharacterized proteins found in microbial communities through the integration of various community-wide data types. Using a guilt-by-association approach, FUGAsseM first preprocesses each data type to generate a corresponding network representation (that is, with proteins as nodes and edges representing proteinâ€“protein functional associations suggested by that data type). Then, for each function of interest (defined as a gene set, such as a GO term), FUGAsseM trains a two-layer machine learning classifier to assign proteins to the function on the basis of their network relationships with other proteins, particularly those already assigned to the function. Here, we demonstrated FUGAsseMâ€™s accuracy and superior coverage in the task of predicting functional assignments for proteins from the human gut microbiome in comparison to methods trained on noncommunity data, including two state-of-the-art function prediction methods. We then applied FUGAsseM to generate and analyze a catalog of new functional assignments for previously uncharacterized human gut proteins. Key aspects of the FUGAsseM methodology are summarized below, with further information on network data, machine learning approaches and evaluation procedures provided in the Supplementary Information.

FUGAsseM methodology

FUGAsseM relies most heavily on (1) MGX to create a catalog of all potentially characterizable proteins in a set of target communities; (2) a gold-standard set with initial functional assignments (where a function is defined as a set of genes that fulfill similar roles or biochemical conversions); and (3) MTX coexpression for initial function prediction whenever possible, as it is available for many environments of interest and covers the majority of proteins when available^44,109,127. When available, these can be integrated with other data types (homology, genomic proximity, etc.). FUGAsseM provides predicted estimates for each protein family assigned to each function of interest. It uses a two-step strategy to process network data and predict functions of microbial community protein families, respectively.

Protein families

FUGAsseM operates on nonredundant protein sequences within stratified taxonomic bins that are compiled from target communities using metagenomic data. To generate these protein families, metagenomic shotgun sequencing data are initially processed with quality control procedures (including adaptor detection, low-quality read base trimming and potential contaminant read elimination), which can be performed using tools such as KneadData (a quality control pipeline (http://huttenhower.sph.harvard.edu/kneaddata). In this case, MGX reads are processed to yield an input gene catalog stratified by taxonomic bin, which are generated using either assembly-based approaches (assembling metagenomes, followed by nonredundant protein family construction and taxonomic assignments^7,128,129) or reference-based approaches (aligning metagenomic reads to the pangenomes of stratified taxa, followed by a translated search performed against reference protein sequence catalogs^130,131).

Gold-standard set

FUGAsseM takes a gold-standard set as input to train a list of functions (that is, sets of genes that perform similar roles or chemical transformations), which is designed to be broadly applicable to any type of functions such as GO terms⁵¹, KEGG pathways⁶⁰ or MetaCyc modules⁶¹. For a given function type, a gold-standard set is defined by a set of functional assignments, describing which protein families (as introduced above) are already known to be annotated to which function (that is, in which gene sets). Those original functional annotations can come from any source of gene sets (for example, UniProt⁵⁶, eggNOG¹³² or KEGG⁶⁰).

Network data (prediction evidence)

FUGAsseM is able to incorporate any assay data of the aforementioned protein families for function prediction in microbial communities that can be modeled as a network. FUGAsseM can be trained using any collection of one or more shotgun metagenomic datasets, optionally accompanied by paired metatranscriptomes. Each type of data is typically encoded as a weighted proteinâ€“protein network, in which the semantics of each network and its weights differ by data type. This is true for coexpression (edges represent transcriptional similarity), sequence similarity (edges represent homology), genomic proximity (edges representing co-occurrence within contigs or operons) or protein interaction (edges representing interactions between proteins or protein domains). This network ensemble is fed to the machine learning module of FUGAsseM. Comprehensive details regarding the network data are available in the Supplementary Information.

Machine learning

FUGAsseM can accept any type of network data as input and ultimately treats them uniformly in the machine learning process for function prediction. Each data type is represented as a weighted proteinâ€“protein network, where the interpretation and weighting of each network vary by data type. FUGAsseM uses a â€˜layered learningâ€™ strategy to train machine learners (Extended Data Fig. 1h). For each function in the gold standard, each network is first processed in a data-type-specific manner by the initial layer of the machine learning module, where an individual classifier is trained to learn cofunction patterns on the basis of the shared annotation of each protein pair to the function. This results in an evidence weighting behavior that learns to upweight individual data type results only when they correspond with proteins of similar function. Next, all data types are integrated by a second layer of the machine learning module that combines individual classifiersâ€™ predictions to train an ensemble classifier for final prediction. This procedure can be carried out in two stagesâ€”training and prediction (potentially on different datasets spanning the same community types). RFs (as implemented in Python scikit-learn) are used for machine learning in FUGAsseM, a state-of-the-art method that is widely used in microbiome applications¹³³, as detailed in the Supplementary Information.

Synthetic evaluation

To assess the accuracy and reliability of FUGAsseM, we conducted two primary evaluation analyses: cross-validation and temporal hold-out validation. Using a cross-validation approach, we compared the predictions of FUGAsseM and other state-of-the-art tools against different subsets of the gold-standard set (as detailed above), which assesses the performance of these tools in annotating proteins with their prospective functions. In the temporal hold-out process, we adopted the benchmarking approach from CAFA, a community-wide effort aimed at evaluating the performance of computational methods in predicting protein function^20,21,22. In the context of CAFA, the dataset is typically held out temporally, meaning that tests consist of annotations that were not available at the time when training and predictions were made (that is, proteins that have accumulated experimental validation evidence between training and evaluation). Using this approach, the temporally held-out dataset serves as an independent validation to evaluate the performance of FUGAsseM in predicting protein functions. By withholding a subset of annotations until after predictions are made, we aim to simulate real-world scenarios where new experimentally validated annotations are continuously being added over time.

Dataset overview

We used the data from HMP2/iHMP2 (ref. ⁴⁴) for both benchmarking and application. A gene catalog with stratified taxonomic bins was built from 1,595 available shotgun metagenomes, which comprised 1,665,223 protein families after quality control⁷. MTX reads from 800 metatranscriptomes that are paired with the MGX samples were previously quality-controlled by KneadData (version 0.7.0) with default settings (http://huttenhower.sph.harvard.edu/kneaddata)⁴⁴, including trimming and filtering low-quality reads, removing potential human read contamination and eliminating ribosomal RNAs.

Evaluation approaches

All methods used to evaluate the performance of FUGAsseM in this study are detailed in the Supplementary Information. These include quantitative metrics, cross-validation procedures, temporal hold-out strategies, evaluation against experimental evidence, per-data-type ablation analyses and comparisons to existing methods.

Application to the HMP2

FUGAsseM analysis

Preparation of protein families

We applied the FUGAsseM algorithm to proteins from the human gut microbiome (as described above). To prepare protein families with stratified taxonomic bins for FUGAsseM, we assembled quality-controlled HMP2 MGX reads into metagenomic contigs using MEGAHIT (version 1.1.3)¹³⁴. Open reading frames (ORFs) in each contig were predicted by Prodigal (version 2.6)¹³⁵. Complete ORFs were clustered into nonredundant gene catalogs with 95% sequence identity over 90% alignment coverage using CD-HIT (version 4.7)¹³⁶. The resulting representative amino acid translations of the nonredundant gene catalogs were further clustered into protein families requiring 90% amino acid sequence identity and 80% coverage with CD-HIT (following the same criteria used in UniRef90)⁵⁸. Lastly, taxonomic assignment for each protein family was performed using a utility of MetaWIBELE⁷, which searches proteins against the UniRef90 database with DIAMOND¹³⁷ and uses a guilt-by-association approach to propagate consistent taxonomic assignments of classified members to unclassified members within the same metagenomic species pangenomes⁷.

Alternatively, the protein families stratified into taxonomic bins can also be generated using reference-based approaches. For example, HUMAnN¹³¹, a tool to profile functions from microbial communities, identifies nonredundant protein families at species level from metagenomes. Briefly, quality-controlled MGX reads are initially screened using MetaPhlAn¹³¹ to detect the known species present in each sample. Then, all sample reads are aligned against the combined pangenomes of species identified in the prescreen using Bowtie2 (ref. ¹³⁸). After the nucleotide-level search, a translated search is conducted against a protein sequence catalog based on UniRef⁵⁸ using DIAMOND¹³⁷ for reads that do not initially map. This yields abundance profiles of protein families (UniRef90s) organized by the species responsible for those genes.

Preparation of gold-standard set

We prepared a gold standard for training from GO annotations of all available HMP2 proteins in UniProt⁵⁶ (release 2019_01). Because of its hierarchical structure, GO terms are divided into different sizes representing how many proteins are assigned to a given term within a taxon. Large terms may represent general functions of most proteins and predicting these functions would be not enough to elucidate specific functions of unknown proteins. Small terms typically represent specialized functions but too few proteins annotated with these terms would limit the prediction efficiency because of lack of training information. Thus, we constructed an informative GO set that defines a set of terms with an adequate number of annotated proteins based on the GO database (releases 2021-07-02) with FUGAsseMâ€™s utility. Adopting the definition from previous studies^53,54, we propagated the annotations from child terms to all their ancestor terms within the GO DAG and defined a GO term as informative term if it contained more than certain number k of annotated proteins (kâ€‰=â€‰20 in this work, although the concept of informative GO terms does not prescribe a specific k value) and all of its child terms contain fewer than k proteins. This utility selects informative terms within taxon and then combines them through union to produce a combined, nonredundant set across taxa, ensuring hierarchical independence by excluding any terms that were parent terms of others within the set. This results in a uniform set of informative terms that can be applied to all taxa while keeping taxon-specific informative terms.

Preparation of network input data

To prepare MTX network input data for FUGAsseM, we built a normalized MTX abundance of protein families table with taxonomic stratification using FUGAsseMâ€™s utility. First, quality-controlled HMP2 MTX shotgun sequencing reads per sample were aligned to the MGX-based nonredundant gene catalogs (as introduced above) using Bowtie2 with default parameter settings¹³⁸. Second, MTX count data were calculated by featureCounts¹³⁹ with default settings, followed by summing up the number of reads mapped to all gene sequences from the same protein family into counts of the protein family itself. Third, the resulting MTX counts were first divided by the length of the representative in each protein family in kilobases, generating abundances in reads per kilobase (RPK) units. Because MTX abundance is strongly correlated with DNA abundance estimates in communities^110,140, we used an MTX-specific procedure to normalize MTX abundance within taxon (termed as â€˜taxon-specificâ€™ scaling¹⁴¹). It normalized the RPK units of each protein family into within-taxon total-sum scaling, where per-sample MTX abundance of a protein family is divided by the total MTX abundance contributed by its source taxon. This resulted in a normalized MTX abundance table that estimates the â€˜relative expressionâ€™ of protein families within taxon, compensating for variation in RNA abundance driven by underlying changes in taxonomic abundance.

Another option to prepare MTX abundance input relies on MGX-based protein families profiled by HUMAnN¹³¹. In this approach, HUMAnN first aligns quality-controlled MTX reads using Bowtie2 (ref. ¹³⁸) to the sample-specific pangenomes prescreened by the paired MGX (as introduced above). Then, HUMAnNâ€™s translated search aligns unmapped reads against UniRef90 (ref. ⁵⁸) using DIAMOND¹³⁷. The resulting MTX abundance of the UniRef90 families stratified by contributed species are further normalized within species using HUMAnNâ€™s utility â€˜humann_renorm_tableâ€™.

To prepare other types of network input data, we constructed UniRef50-like protein families by clustering the representatives of the MGX-based protein families using CD-HIT (version 4.7) with 50% amino acid sequence identity and 80% coverage, which were formatted as vector-based evidence encoding sequence similarity information. We obtained genomic context information for the operated protein families by extracting cocontig patterns from the previously generated metagenomic assemblies in HMP2 (ref. ⁷). Additionally, domainâ€“domain interactions were obtained from previous annotations for the HMP2 protein families⁷, which were assigned to the corresponding families by searching protein domains against the DOMINE database⁵⁹ (version 2.0; a database that includes both known and predicted protein domainâ€“domain interactions).

Running FUGAsseM on the HMP2 data

When applying FUGAsseM to the HMP2, we excluded extremely rare taxa for analysis that contained fewer than 500 protein families detected by MTX. Given these MTX abundance profiles, MTX-based coexpression networks were predicted by calculating Pearson correlations among the protein families. We used FUGAsseM to predict GO annotations of the HMP2 protein families with parameter settings â€˜--minimum-coverage 0.05 --go-mode union --go-level 20â€™. For a given term, its assignment to a protein family with prediction probability greater than 0.75 was delineated as a high-confidence prediction (referred to as the default threshold), which ensures retention of true positives with high confidence while incorporating new predictions. A threshold of 0.85 was also introduced as a stringent threshold to enhance precision, sustaining an adequate true positive rate but at an especially low false positive rate. This resulted in 443,549 and 292,036 protein families being labeled to 295 informative GO terms at the default and stringent thresholds, respectively.

eggNOG-mapper analysis

To enable a fair comparison between FUGAsseM and eggNOG-mapper, we uniformly defined the characterization levels and consistently applied them across the same set of protein families. Specifically, eggNOG-mapper was run using default parameters (E valueâ€‰<â€‰0.001) on protein families previously processed by FUGAsseM. Results were then filtered to retain only the informative GO terms (as described above) defined consistently between methods, ensuring a rigorous comparative assessment. Then, we compared the numbers of high-confidence predictions (prediction probabilityâ€‰â‰¥â€‰0.75 for FUGAsseM, E valueâ€‰<â€‰0.001 for eggNOG-mapper)) for informative GO terms produced by these two methods.

Protein categories

We used the homology and informative BP annotations to categorize protein families. This distinction is interpreted as classification of sequence novelty and functional characterization level. Following our previous study⁷, we defined a protein family as having â€˜strong homologyâ€™ if it shared the same nonredundant CD-HIT cluster as a representative protein exhibiting â‰¥90% amino acid sequence identity and â‰¥80% coverage of the longest sequence compared to known proteins in the UniRef90 (release 2019_01)⁵⁸. We defined strong homologs to known UniRef90 proteins as SC (strong homology, characterized) if they were assigned at least one informative GO BP term⁵¹ by UniProtKB⁵⁶. SNI indicates strongly homologous proteins assigned only to higher-level noninformative BP terms and SU indicates no BP annotations. UPI indicates that the homologous UniRef90 cluster was built by UniParc proteins (which lack GO annotations). Nonhomologous proteins with <25% identity or <25% coverage or no hit at all were termed as NH, whereas RH entailed modest similarity (25%â€‰â‰¤â€‰identityâ€‰<â€‰90% and 25%â€‰â‰¤â€‰coverageâ€‰<â€‰80%).

Functional enrichment

We conducted GSEA on the HMP2 protein families using the GSEA function in the clusterProfiler (version 3.10.1)¹⁴² R package. Two sources of GO annotations for the protein families were used for comparison: prior annotations from UniProt⁵⁶ and predicted annotations by FUGAsseM. For each BP term from each annotation source, we ranked the protein families by their priority scores of MetaWIBELE⁷ (measuring the bioactive potential of protein families in a specific environmental or phenotypic setting of interest) in the entire gene list and calculated a GSEA enrichment score that captured the degree to which a gene set was overrepresented at the extremes (top or bottom) of the entire ranked list of genes. To assess significance, we performed 1,000 permutations for each gene set and multiple-hypothesis testing was corrected for using a Benjaminiâ€“Hochberg false discovery rate (FDR) approach with a significance threshold FDR-adjusted P valueâ€‰<â€‰0.25.

Separate analysis for MTX-based coexpression

MTX-based coexpression patterns

To differentiate cases of a genuine lack of coexpression from instances where proteins were inadequately expressed (for example, low prevalence), we collected proteins that were detected by MTX in more than 10% of total samples, which was termed as a well-expressed set. Using these expected well-expressed proteins, we calculated their Pearsonâ€™s correlation to distinguish legitimate lack of high coexpression cases from low-expressed cases between prior characterized and uncharacterized protein families.

Comparison to STRING-based coexpression

We compared MTX-based coexpression profiles by themselves to the corresponding isolatesâ€™ coexpression from STRING⁵⁷. We matched proteins between MTX-based profiles and STRING-based profiles based on common UniProt IDs and then extracted their coexpression patterns. A pair of proteins was defined as linked if their linkage score (measuring their association degree) was more than the first quartile of the maximum score in STRING. A Mannâ€“Whitney test was used to assess the difference of MTX-based coexpression distributions between linked and unlinked proteins in isolates. To examine the relationship between species characterization or representation and similarity of MTX-based versus STRING-based coexpression networks, we defined a metric (referred to as â€˜reference representationâ€™) calculated as the percentage of proteins from a species in the HMP2 dataset that had strong homologs in UniRef90. Then, we compared the reference representation of a species to the quantitative similarity between MTX-based and STRING-based coexpression calculated by Pearsonâ€™s correlation.

Coexpression network

To visualize coexpression networks of interest, we selected protein families with high-confidence predictions based on the stringent threshold (that is, prediction probabilityâ€‰â‰¥â€‰0.85) and connected families if their correlation coefficient were more than the 90th percentile value of the whole pairwise correlations of the corresponding nodes per species per term.

Abbreviation

All abbreviations used in this study are organized in Supplementary Table 25.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Associated data generated during this study are included in the published article and its Supplementary Information. The precomputed function predictions from the previously published HMP2 data spanning 451,830 protein families (with preserved annotation and/or FUGAsseMâ€™s predictions) and 295 target GO terms are available online (http://huttenhower.sph.harvard.edu/fugassem). All assembled metagenomic contigs, ORFs, gene families, protein families, functional profiles, taxonomic profiles and prioritized profiles of protein families related with this study are also available online (http://huttenhower.sph.harvard.edu/metawibele). Raw data of HMP2 metagenomes and metatranscriptomes were obtained from the IBDMDB website (https://ibdmdb.org). The following public databases were used: UniProt (https://www.uniprot.org/), UniRef90 (https://www.uniprot.org/uniref/), Pfam (https://pfam.xfam.org/) and DOMINE (https://manticore.niehs.nih.gov/cgi-bin/Domine). Source data are provided with this paper.

Code availability

The open-source FUGAsseM software is available online (http://huttenhower.sph.harvard.edu/fugassem). Manuals and online tutorials describing FUGAsseM are available on GitHub (https://github.com/biobakery/fugassem). User support is provided through the bioBakery help forum (https://forum.biobakery.org). Additional software details are provided in Methods.

References

Armet, A. M. et al. Rethinking healthy eating in light of the gut microbiome. Cell Host Microbe 30, 764â€“785 (2022).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Sharon, G. et al. Specialized metabolites from the microbiome in health and disease. Cell Metab 20, 719â€“730 (2014).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Baldrian, P. et al. Active and total microbial communities in forest soil are largely different and highly stratified during decomposition. ISME J. 6, 248â€“258 (2012).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Singleton, C. M. et al. Methanotrophy across a natural permafrost thaw environment. ISME J. 12, 2544â€“2558 (2018).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Salazar, G. et al. Gene expression changes and community turnover differentially shape the global ocean metatranscriptome. Cell 179, 1068â€“1083 (2019).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Joice, R., Yasuda, K., Shafquat, A., Morgan, X. C. & Huttenhower, C. Determining microbial products and identifying molecular targets in the human microbiome. Cell Metab. 20, 731â€“741 (2014).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Zhang, Y. et al. Discovery of bioactive microbial gene products in inflammatory bowel disease. Nature 606, 754â€“760 (2022).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649â€“662 (2019).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Almeida, A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol. 39, 105â€“114 (2021).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Peisl, B. Y. L., Schymanski, E. L. & Wilmes, P. Dark matter in hostâ€“microbiome metabolomics: tackling the unknownsâ€”a review. Anal. Chim. Acta 1037, 13â€“27 (2018).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Vanni, C. et al. Unifying the known and unknown microbial coding sequence space. eLife 11, e67667 (2022).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Pavlopoulos, G. A. et al. Unraveling the functional dark matter through global metagenomics. Nature 622, 594â€“602 (2023).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Browne, H. P. et al. Culturing of â€˜unculturableâ€™ human microbiota reveals novel taxa and extensive sporulation. Nature 533, 543â€“546 (2016).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Lagier, J. C. et al. Culture of previously uncultured members of the human gut microbiota by culturomics. Nat. Microbiol. 1, 16203 (2016).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Almeida, A. et al. A new genomic blueprint of the human gut microbiota. Nature 568, 499â€“504 (2019).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Schnoes, A. M., Ream, D. C., Thorman, A. W., Babbitt, P. C. & Friedberg, I. Biases in the experimental annotations of protein function and their effect on our understanding of protein function space. PLoS Comput. Biol. 9, e1003063 (2013).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Rost, B., Liu, J., Nair, R., Wrzeszczynski, K. O. & Ofran, Y. Automatic prediction of protein function. Cell Mol. Life Sci. 60, 2637â€“2650 (2003).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Lee, D., Redfern, O. & Orengo, C. Predicting protein function from sequence and structure. Nat. Rev. Mol. Cell Biol. 8, 995â€“1005 (2007).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Xin, F. & Radivojac, P. Computational methods for identification of functional residues in protein structures. Curr. Protein Pept. Sci. 12, 456â€“469 (2011).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Radivojac, P. et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221â€“227 (2013).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Jiang, Y. et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 17, 184 (2016).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 244 (2019).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Jensen, L. J. et al. Prediction of human protein function from post-translational modifications and localization features. J. Mol. Biol. 319, 1257â€“1265 (2002).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Wass, M. N. & Sternberg, M. J. ConFuncâ€”functional annotation in the twilight zone. Bioinformatics 24, 798â€“806 (2008).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Clark, W. T. & Radivojac, P. Analysis of protein function and its prediction from amino acid sequence. Proteins 79, 2086â€“2096 (2011).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
You, R. et al. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics 34, 2465â€“2473 (2018).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Korbel, J. O., Jensen, L. J., von Mering, C. & Bork, P. Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nat. Biotechnol. 22, 911â€“917 (2004).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Enault, F., Suhre, K. & Claverie, J. M. Phydbac â€˜Gene Function Predictorâ€™: a gene annotation tool based on genomic context analysis. BMC Bioinformatics 6, 247 (2005).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D. & Yeates, T. O. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl Acad. Sci. USA 96, 4285â€“4288 (1999).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Engelhardt, B. E., Jordan, M. I., Muratore, K. E. & Brenner, S. E. Protein molecular function prediction by Bayesian phylogenomics. PLoS Comput. Biol. 1, e45 (2005).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Pazos, F. & Sternberg, M. J. Automated prediction of protein function and detection of functional sites from structure. Proc. Natl Acad. Sci. USA 101, 14754â€“14759 (2004).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Deng, M., Zhang, K., Mehta, S., Chen, T. & Sun, F. Prediction of protein function using proteinâ€“protein interaction data. J. Comput. Biol. 10, 947â€“960 (2003).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Nabieva, E., Jim, K., Agarwal, A., Chazelle, B. & Singh, M. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21, i302â€“i310 (2005).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Wells, J. A. & McClendon, C. L. Reaching for high-hanging fruit in drug discovery at proteinâ€“protein interfaces. Nature 450, 1001â€“1009 (2007).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Brown, M. P. et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA 97, 262â€“267 (2000).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
van Noort, V., Snel, B. & Huynen, M. A. Predicting gene function by conserved co-expression. Trends Genet. 19, 238â€“242 (2003).
ArticleÂ PubMedÂ Google ScholarÂ
Guan, Y. et al. Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biol. 9, S3 (2008).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Mostafavi, S., Ray, D., Warde-Farley, D., Grouios, C. & Morris, Q. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol. 9, S4 (2008).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Piovesan, D. & Tosatto, S. C. E. INGA 2.0: improving protein function prediction for the dark proteome. Nucleic Acids Res. 47, W373â€“w378 (2019).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Bashiardes, S., Zilberman-Schapira, G. & Elinav, E. Use of metatranscriptomics in microbiome research. Bioinform. Biol. Insights 10, 19â€“25 (2016).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Franzosa, E. A. et al. Sequencing and beyond: integrating molecular â€˜omicsâ€™ for microbial community profiling. Nat. Rev. Microbiol. 13, 360â€“372 (2015).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Franzosa, E. A. et al. Relating the metatranscriptome and metagenome of the human gut. Proc. Natl Acad. Sci. USA 111, E2329â€“E2338 (2014).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Heintz-Buschart, A. et al. Integrated multi-omics of the human gut microbiome in a case study of familial type 1 diabetes. Nat. Microbiol. 2, 16180 (2016).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Lloyd-Price, J. et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569, 655â€“662 (2019).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Coolen, M. J. & Orsi, W. D. The transcriptional response of microbial communities in thawing Alaskan permafrost soils. Front. Microbiol. 6, 197 (2015).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Vorobev, A. et al. Transcriptome reconstruction and functional analysis of eukaryotic marine plankton communities via high-throughput metagenomics and metatranscriptomics. Genome Res. 30, 647â€“659 (2020).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Lee, H. K., Hsu, A. K., Sajdak, J., Qin, J. & Pavlidis, P. Coexpression analysis of human genes across many microarray data sets. Genome Res. 14, 1085â€“1094 (2004).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Gaiteri, C., Ding, Y., French, B., Tseng, G. C. & Sibille, E. Beyond modules and hubs: the potential of gene coexpression networks for investigating molecular mechanisms of complex brain disorders. Genes Brain Behav. 13, 13â€“24 (2014).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Stuart, J. M., Segal, E., Koller, D. & Kim, S. K. A gene-coexpression network for global discovery of conserved genetic modules. Science 302, 249â€“255 (2003).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
van Dam, S., Vosa, U., van der Graaf, A., Franke, L. & de Magalhaes, J. P. Gene co-expression analysis for functional classification and geneâ€“disease predictions. Brief. Bioinform. 19, 575â€“592 (2018).
PubMedÂ Google ScholarÂ
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25â€“29 (2000).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Zhang, Y. & Maharjan, S. biobakery/fugassem: FUGAsseM v0.3.8. Zenodo https://doi.org/10.5281/zenodo.16477039 (2025).
Hvidsten, T. R., Komorowski, J., Sandvik, A. K. & Laegreid, A. Predicting gene function from gene expressions and ontologies. In Proceedings of the Pacific Symposium on Biocomputing (eds Altman, R. B., Dunker, A. K., Hunter, L., Lauderdale, K. & Klein, T. E.) (World Scientific, 2001).
Zhou, X., Kao, M. C. & Wong, W. H. Transitive functional annotation by shortest-path analysis of gene expression data. Proc. Natl Acad. Sci. USA 99, 12783â€“12788 (2002).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Myers, C. L., Barrett, D. R., Hibbs, M. A., Huttenhower, C. & Troyanskaya, O. G. Finding function: evaluation methods for functional genomic data. BMC Genomics 7, 187 (2006).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Mitchell, A. et al. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 43, D213â€“D221 (2015).
ArticleÂ PubMedÂ Google ScholarÂ
von Mering, C. et al. STRING: known and predicted proteinâ€“protein associations, integrated and transferred across organisms. Nucleic Acids Res. 33, D433â€“D437 (2005).
ArticleÂ Google ScholarÂ
Suzek, B. E., Huang, H., McGarvey, P., Mazumder, R. & Wu, C. H. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 1282â€“1288 (2007).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Yellaboina, S., Tasneem, A., Zaykin, D. V., Raghavachari, B. & Jothi, R. DOMINE: a comprehensive collection of known and predicted domain-domain interactions. Nucleic Acids Res. 39, D730â€“D735 (2011).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27â€“30 (2000).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Caspi, R. et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 42, D459â€“D471 (2014).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Yao, S. et al. NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information. Nucleic Acids Res. 49, W469â€“w475 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Kulmanov, M. & Hoehndorf, R. DeepGOPlus: improved protein function prediction from sequence. Bioinformatics 37, 1187 (2021).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
RodrÃguez Del RÃo, Ã. et al. Functional and evolutionary significance of unknown genes from uncultivated taxa. Nature 626, 377â€“384 (2024).
ArticleÂ PubMedÂ Google ScholarÂ
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679â€“682 (2022).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439â€“d444 (2022).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243â€“246 (2024).
ArticleÂ PubMedÂ Google ScholarÂ
Koppel, N., Maini Rekdal, V. & Balskus, E. P. Chemical transformation of xenobiotics by the human gut microbiota. Science 356, eaag2770 (2017).
ArticleÂ PubMedÂ Google ScholarÂ
Das, N. K. et al. Microbial metabolite signaling is required for systemic iron homeostasis. Cell Metab. 31, 115â€“130.e116 (2020).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Seyoum, Y., Baye, K. & Humblot, C. Iron homeostasis in host and gut bacteriaâ€”a complex interrelationship. Gut Microbes 13, 1â€“19 (2021).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Chen, Z. et al. The role of intestinal bacteria and gutâ€“brain axis in hepatic encephalopathy. Front. Cell. Infect. Microbiol. 10, 595759 (2020).
ArticleÂ PubMedÂ Google ScholarÂ
Galdiero, S. et al. Microbeâ€“host interactions: structure and role of gram-negative bacterial porins. Curr. Protein Pept. Sci. 13, 843â€“854 (2012).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Hogbom, M. & Ihalin, R. Functional and structural characteristics of bacterial proteins that bind host cytokines. Virulence 8, 1592â€“1601 (2017).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Jaehme, M. & Slotboom, D. J. Diversity of membrane transport proteins for vitamins in bacteria and archaea. Biochim. Biophys. Acta 1850, 565â€“576 (2015).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Fujita, M. et al. A TonB-dependent receptor constitutes the outer membrane transport system for a lignin-derived aromatic compound. Commun. Biol. 2, 432 (2019).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Connors, J., Dawe, N. & van Limbergen, J. The role of succinate in the regulation of intestinal inflammation. Nutrients 11, 25 (2018).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Boudreau, M. A., Fisher, J. F. & Mobashery, S. Messenger functions of the bacterial cell wall-derived muropeptides. Biochemistry 51, 2974â€“2990 (2012).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Hosaka, H., Kawamura, M., Hirano, T., Hakamata, W. & Nishio, T. Utilization of sucrose and analog disaccharides by human intestinal bifidobacteria and lactobacilli: search of the bifidobacteria enzymes involved in the degradation of these disaccharides. Microbiol. Res. 240, 126558 (2020).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Rawat, P. S., Li, Y., Zhang, W., Meng, X. & Liu, W. Hungatella hathewayi, an efficient glycosaminoglycan-degrading Firmicutes from human gut and its chondroitin ABC exolyase with high activity and broad substrate specificity. Appl. Environ. Microbiol. 88, e0154622 (2022).
ArticleÂ PubMedÂ Google ScholarÂ
Cullender, T. C. et al. Innate and adaptive immunity interact to quench microbiome flagellar motility in the gut. Cell Host Microbe 14, 571â€“581 (2013).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Lopez-Siles, M., Duncan, S. H., Garcia-Gil, L. J. & Martinez-Medina, M. Faecalibacterium prausnitzii: from microbiology to diagnostics and prognostics. ISME J. 11, 841â€“852 (2017).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Cornuault, J. K. et al. Phages infecting Faecalibacterium prausnitzii belong to novel viral genera that help to decipher intestinal viromes. Microbiome 6, 65 (2018).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Bai, Z. et al. Comprehensive analysis of 84 Faecalibacterium prausnitzii strains uncovers their genetic diversity, functional characteristics, and potential risks. Front. Cell. Infect. Microbiol. 12, 919701 (2022).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Koropatkin, N. M. & Smith, T. J. SusG: a unique cell-membrane-associated Î±-amylase from a prominent human gut symbiont targets complex starch molecules. Structure 18, 200â€“215 (2010).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Martens, E. C. et al. Recognition and degradation of plant cell wall polysaccharides by two human gut symbionts. PLoS Biol. 9, e1001221 (2011).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Wu, M. et al. Genetic determinants of in vivo fitness and diet responsiveness in multiple human gut Bacteroides. Science 350, aac5992 (2015).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Terrapon, N. et al. PULDB: the expanded database of polysaccharide utilization loci. Nucleic Acids Res. 46, D677â€“d683 (2018).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Pavarina, G. C., Lemos, E. G. M., Lima, N. S. M. & Pizauro, J. M. Jr. Characterization of a new bifunctional endo-1,4-Î²-xylanase/esterase found in the rumen metagenome. Sci. Rep. 11, 10440 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Carneiro, L. et al. Selective xyloglucan oligosaccharide hydrolysis by a GH31 Î±-xylosidase from Escherichia coli. Carbohydr. Polym. 284, 119150 (2022).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Lin, H. et al. Multiomics study reveals Enterococcus and Subdoligranulum are beneficial to necrotizing enterocolitis. Front. Microbiol. 12, 752102 (2021).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Shi, T. T. et al. Comparative assessment of gut microbial composition and function in patients with Gravesâ€™ disease and Gravesâ€™ orbitopathy. J. Endocrinol. Invest. 44, 297â€“310 (2021).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Girardin, S. E. et al. Nod1 detects a unique muropeptide from gram-negative bacterial peptidoglycan. Science 300, 1584â€“1587 (2003).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Hasegawa, M. et al. Differential release and distribution of Nod1 and Nod2 immunostimulatory molecules among bacterial species and environments. J. Biol. Chem. 281, 29054â€“29063 (2006).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Elshorbagy, A. et al. Amino acid changes during transition to a vegan diet supplemented with fish in healthy humans. Eur. J. Nutr. 56, 1953â€“1962 (2017).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Dong, Z., Sinha, R. & Richie, J. P. Jr. Disease prevention and delayed aging by dietary sulfur amino acid restriction: translational implications. Ann. N. Y. Acad. Sci. 1418, 44â€“55 (2018).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Whisstock, J. C. & Lesk, A. M. Prediction of protein function from protein sequence and structure. Q. Rev. Biophys. 36, 307â€“340 (2003).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Sleator, R. D. & Walsh, P. An overview of in silico protein function prediction. Arch. Microbiol. 192, 151â€“155 (2010).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Overbeek, R., Fonstein, M., Dâ€™Souza, M., Pusch, G. D. & Maltsev, N. The use of gene clusters to infer functional coupling. Proc. Natl Acad. Sci. USA 96, 2896â€“2901 (1999).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Teichmann, S. A. & Babu, M. M. Conservation of gene co-regulation in prokaryotes and eukaryotes. Trends Biotechnol. 20, 407â€“410 (2002).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Eisenberg, D., Marcotte, E. M., Xenarios, I. & Yeates, T. O. Protein function in the post-genomic era. Nature 405, 823â€“826 (2000).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Sharan, R., Ulitsky, I. & Shamir, R. Network-based prediction of protein function. Mol. Syst. Biol. 3, 88 (2007).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Wang, P. I. & Marcotte, E. M. Itâ€™s the machine that matters: predicting gene function and phenotype from protein networks. J. Proteomics 73, 2277â€“2289 (2010).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Ryan, C. J. et al. High-resolution network biology: connecting sequence with function. Nat. Rev. Genet. 14, 865â€“879 (2013).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Costanzo, M. et al. The genetic landscape of a cell. Science 327, 425â€“431 (2010).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Costanzo, M. et al. Environmental robustness of the global yeast genetic interaction network. Science 372, eabf8424 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Serin, E. A., Nijveen, H., Hilhorst, H. W. & Ligterink, W. Learning from co-expression networks: possibilities and challenges. Front. Plant Sci. 7, 444 (2016).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Southard, J. N. Protein analysis using real-time PCR instrumentation: incorporation in an integrated, inquiry-based project. Biochem. Mol. Biol. Educ. 42, 142â€“151 (2014).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Cantalapiedra, C. P., HernÃ¡ndez-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 38, 5825â€“5829 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Schirmer, M. et al. Dynamics of metatranscription in the inflammatory bowel disease gut microbiome. Nat. Microbiol. 3, 337â€“346 (2018).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Zhang, Y., Thompson, K. N., Huttenhower, C. & Franzosa, E. A. Statistical approaches for differential expression analysis in metatranscriptomics. Bioinformatics 37, i34â€“i41 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Parrow, N. L., Fleming, R. E. & Minnick, M. F. Sequestration and scavenging of iron in infection. Infect. Immun. 81, 3503â€“3514 (2013).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Sanchez-Jimenez, A., Marcos-Torres, F. J. & Llamas, M. A. Mechanisms of iron homeostasis in Pseudomonas aeruginosa and emerging therapeutics directed to disrupt this vital process. Microb. Biotechnol. 16, 1475â€“1491 (2023).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Kim, C. S. et al. Seasonal and spatial environmental influence on Opisthorchis viverrini intermediate hosts, abundance, and distribution: insights on transmission dynamics and sustainable control. PLoS Negl. Trop. Dis. 10, e0005121 (2016).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Isobe, K. & Ohte, N. Ecological perspectives on microbes involved in N-cycling. Microbes Environ. 29, 4â€“16 (2014).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Yi, M. et al. Temporal changes of microbial community structure and nitrogen cycling processes during the aerobic degradation of phenanthrene. Chemosphere 286, 131709 (2022).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Davila, A. M. et al. Intestinal luminal nitrogen metabolism: role of the gut microbiota and consequences for the host. Pharmacol. Res. 68, 95â€“107 (2013).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Hou, K. et al. Microbiota in health and diseases. Signal. Transduct. Target. Ther. 7, 135 (2022).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Fitzgerald, C. B. et al. Comparative analysis of Faecalibacterium prausnitzii genomes shows a high level of genome plasticity and warrants separation into new species-level taxa. BMC Genomics 19, 931 (2018).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Silas, S. et al. Type III CRISPRâ€“Cas systems can provide redundancy to counteract viral escape from type I systems. eLife 6, e27601 (2017).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Cuiv, P. O. et al. Isolation of genetically tractable most-wanted bacteria by metaparental mating. Sci. Rep. 5, 13282 (2015).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Deutscher, M. P. Degradation of RNA in bacteria: comparison of mRNA and stable RNA. Nucleic Acids Res. 34, 659â€“666 (2006).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Reck, M. et al. Stool metatranscriptomics: a technical guideline for mRNA stabilisation and isolation. BMC Genomics 16, 494 (2015).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Sczyrba, A. et al. Critical assessment of metagenome interpretationâ€”a benchmark of metagenomics software. Nat. Methods 14, 1063â€“1071 (2017).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Yue, Q. et al. Functional operons in secondary metabolic gene clusters in Glarea lozoyensis (Fungi, Ascomycota, Leotiomycetes). mBio 6, e00703 (2015).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Friedberg, I. Automated protein function predictionâ€”the genomic challenge. Brief. Bioinform. 7, 225â€“242 (2006).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Jeffery, C. J. Current successes and remaining challenges in protein function prediction. Front. Bioinform. 3, 1222182 (2023).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Abu-Ali, G. S. et al. Metatranscriptome of human faecal microbial communities in a cohort of adult men. Nat. Microbiol. 3, 356â€“366 (2018).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59â€“65 (2010).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Li, J. et al. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. 32, 834â€“841 (2014).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Franzosa, E. A. et al. Species-level functional profiling of metagenomes and metatranscriptomes. Nat. Methods 15, 962â€“968 (2018).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Beghini, F. et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife 10, e65088 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309â€“d314 (2019).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Thomas, A. M. et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat. Med. 25, 667â€“678 (2019).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Li, D., Liu, C. M., Luo, R., Sadakane, K. & Lam, T. W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674â€“1676 (2015).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Li, W., Jaroszewski, L. & Godzik, A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 17, 282â€“283 (2001).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59â€“60 (2015).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357â€“359 (2012).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923â€“930 (2014).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Zhang, Y. et al. Metatranscriptomics for the human microbiome and microbial community functional profiling. Annu. Rev. Biomed. Data Sci. 4, 279â€“311 (2021).
ArticleÂ PubMedÂ Google ScholarÂ
Klingenberg, H. & Meinicke, P. How to normalize metatranscriptomic count data for differential expression analysis. PeerJ 5, e3859 (2017).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Yu, G., Wang, L. G., Han, Y. & He, Q. Y. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS 16, 284â€“287 (2012).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ

Download references

Acknowledgements

This work was supported in part by a grant from Takeda Pharmaceuticals (C.H.), a grant from Hillâ€™s Pet Nutrition (C.H.), National Institutes of Health (NIH) National Institute of Allergy and Infectious Diseases grant U19AI110820 (D. A. Rasko, University of Maryland), NIH National Institute of Diabetes and Digestive and Kidney Diseases grant R24DK110499 (C.H.), the National Natural Science Fund for Excellent Young Scientists Fund Program Overseas (Y.Z.), the Agricultural Science and Technology Innovation Program grant ASTIP (Y.Z.) and the startup package from Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences (Y.Z.). The computations in this paper were run in part on the Faculty of Arts and Sciences (FAS) Research Computing Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University.

Author information

These authors contributed equally: Eric A. Franzosa, Curtis Huttenhower.

Authors and Affiliations

Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
Yancong Zhang
Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Yancong Zhang,Â Amrisha Bhosle,Â Wendy S. Garrett,Â Eric A. FranzosaÂ &Â Curtis Huttenhower
Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
Yancong Zhang,Â Amrisha Bhosle,Â Sena Bae,Â Kelly Eckenrode,Â Xueying Huang,Â Jingjing Tang,Â Lana Awad,Â Ji Hua,Â Ya Wang,Â Xochitl C. Morgan,Â Eric A. FranzosaÂ &Â Curtis Huttenhower
Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA, USA
Yancong Zhang,Â Amrisha Bhosle,Â Sena Bae,Â Kelly Eckenrode,Â Xueying Huang,Â Jingjing Tang,Â Ya Wang,Â Xochitl C. Morgan,Â Wendy S. Garrett,Â Eric A. FranzosaÂ &Â Curtis Huttenhower
Department of Immunology and Infectious Diseases, Harvard T. H. Chan School of Public Health, Boston, MA, USA
Sena Bae,Â Wendy S. GarrettÂ &Â Curtis Huttenhower
Systems, Synthetic, and Quantitative Biology Program, Harvard University, Cambridge, MA, USA
Danylo Lavrentovich
Takeda Development Center Americas, Inc., Lexington, MA, USA
Bin LiÂ &Â Andy Krueger
Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA
Wendy S. Garrett

Authors

Yancong Zhang
View author publications
Search author on:PubMedÂ Google Scholar
Amrisha Bhosle
View author publications
Search author on:PubMedÂ Google Scholar
Sena Bae
View author publications
Search author on:PubMedÂ Google Scholar
Kelly Eckenrode
View author publications
Search author on:PubMedÂ Google Scholar
Xueying Huang
View author publications
Search author on:PubMedÂ Google Scholar
Jingjing Tang
View author publications
Search author on:PubMedÂ Google Scholar
Danylo Lavrentovich
View author publications
Search author on:PubMedÂ Google Scholar
Lana Awad
View author publications
Search author on:PubMedÂ Google Scholar
Ji Hua
View author publications
Search author on:PubMedÂ Google Scholar
Ya Wang
View author publications
Search author on:PubMedÂ Google Scholar
Xochitl C. Morgan
View author publications
Search author on:PubMedÂ Google Scholar
Bin Li
View author publications
Search author on:PubMedÂ Google Scholar
Andy Krueger
View author publications
Search author on:PubMedÂ Google Scholar
Wendy S. Garrett
View author publications
Search author on:PubMedÂ Google Scholar
Eric A. Franzosa
View author publications
Search author on:PubMedÂ Google Scholar
Curtis Huttenhower
View author publications
Search author on:PubMedÂ Google Scholar

Contributions

Y.Z., E.A.F. and C.H. designed the methods. Y.Z., X.H., J.T., D.L., L.A., J.H. and E.A.F. carried out the evaluations and applications. Y.Z. implemented the software. Y.Z. and A.B. wrote the tutorial document and tested the software. Y.Z. drafted the paper with feedback from the other authors. S.B., K.E., Y.W., X.C.M., E.A.F., C.H. and all other authors edited the paper. B.L., A.K., W.S.G. and all other authors participated in interpretation of the primary findings. All authors approved the final paper.

Corresponding authors

Correspondence to Yancong Zhang, Eric A. Franzosa or Curtis Huttenhower.

Ethics declarations

Competing interests

C.H. is on the scientific advisory boards of Seres Therapeutics, Empress Therapeutics and ZOE Nutrition. W.S.G. is on the scientific advisory boards of Empress Therapeutics, Freya Biosciences, Sail Biosciences, Seres Therapeutics and the Gates Foundation, all unrelated to this study. W.S.G.â€™s laboratory has received funding from Merck and Astellas. B.L. and A.K. are Takeda employees and may own company stock. The other authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisherâ€™s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Uncharacterized gut microbial proteins are coexpressed with known proteins and annotated using a two-layer ML model.

(a) Human gut microbial protein families detected in HMP2 metatranscriptomes were categorized into different levels of under-characterization. (b) The top 25 species with the most novel proteins (RH and NH) show diverse protein category composition, with uncharacterized proteins prevalent. (c) In contrast to the human gut microbiome, most proteins in E. coli K-12 strains are well-characterized. (d) Uncharacterized proteins exhibit similar MTX-based coexpression patterns to those of characterized proteins (nâ€‰=â€‰1,543 total gene pairs). â€˜knownâ€™ indicates transcript correlations among characterized proteins; â€˜hybridâ€™ between characterized and uncharacterized; â€˜unknownâ€™ among uncharacterized proteins. Box plots display the median (line at the 50th percentile), interquartile range (box spanning the 25th to 75th percentiles), whiskers (extending to 1.5Ã— IQR), and mean values (dark points). (e) Strong transcript correlations were also observed across different characterization levels (nâ€‰=â€‰1,331 total gene pairs); Box plots as in (d). (f) Network similarity between MTX-based and STRING-based coexpression patterns was significantly correlated with species reference representation. Pearson correlation coefficients (95% CI) and unadjusted P values are shown (nâ€‰=â€‰13 species). Reference representation was estimated by the percentage of UniRef90 homologs in HMP2 MGX previously profiled by MetaWIBELE. (g) â€˜Informativeâ€™ GO terms were defined as those with at least a minimum number of annotated proteins, while each child term contained fewer than this threshold. (h) FUGAsseM uses a two-layer machine learning architecture. In the first layer (green), random forest classifiers predict functional confidence scores from individual data types (purple). These scores are then integrated by a second-layer ensemble RF classifier (orange), generating a final confidence score (red).

Source data

Extended Data Fig. 2 FUGAsseM accurately predicts MF and CC terms in microbial communities.

(a-d) FUGAsseM predictions significantly correlated with STRING-based results (MF: nâ€‰=â€‰95 terms, 148 species; CC: nâ€‰=â€‰12 terms, 35 species) and identified additional terms in species lacking isolates (MF: nâ€‰=â€‰27 terms, 159 species; CC: nâ€‰=â€‰2 terms, 46 species). Pearson correlation coefficients (95% CI) and unadjusted P values shown. (e,f) FUGAsseM achieved comparable accuracy to DeepGOPlus and NetGO2.0; STRING data were available for 14 of 25 (MF) and 6 of 18 (CC) abundant species. (g,h) Hold-out validation confirmed high AUROC (MF: nâ€‰=â€‰12 terms; CC: nâ€‰=â€‰10 terms). (i,j) FUGAsseM predicted significantly higher scores (GSEA method; FDR-adjusted Pâ€‰<â€‰0.002 for multiple comparisons) for annotations gaining experimental support over time. Box plots display the median (line at the 50th percentile), interquartile range (box spanning the 25th to 75th percentiles), whiskers (extending to 1.5Ã— IQR), and mean values (dark points).

Source data

Extended Data Fig. 3 FUGAsseM achieves high accuracy and generalizability for GO term prediction.

(a-c) FUGAsseM achieved AUROCs comparable to state-of-the-art methods across GO categories: BP (nâ€‰=â€‰1,010 term-species pairs), MF (nâ€‰=â€‰1,282 term-species pairs), and CC (nâ€‰=â€‰92 term-species pairs). (d-e) When predicting BP annotations with newly accumulated experimental evidence, FUGAsseM-MTX (d) and FUGAsseM-full (e) maintain high accuracy with community-wide data compared to those with isolate-based data (nâ€‰=â€‰34 total terms for temporal hold-out evaluation). â€˜Accumulated evidenceâ€™: the annotations that lacked experimental evidence at T₀ but gained accumulated experimental validation from T₀ to T₁; â€˜New evidenceâ€™: totally unseen annotations at T₀ with accumulated experimental validation at T₁. Box plots display the median (line at the 50th percentile), interquartile range (box spanning the 25th to 75th percentiles), whiskers (extending to 1.5Ã— IQR), and mean values (dark points).

Source data

Extended Data Fig. 4 FUGAsseM accurately predicts annotations validated by experimental evidence.

(a,b) The FUGAsseM-full (a) and FUGAsseM-MTX (b) models retained high accuracy when predicting annotations for experimentally validated annotations from the gold standard dataset (nâ€‰=â€‰142 and 285 total terms evaluated with and without experimental evidence, respectively). Box plots display the median (line at the 50th percentile), interquartile range (box spanning the 25th to 75th percentiles), whiskers (extending to 1.5Ã— IQR), and mean values (dark points). (c,d) Distribution of Random Forest importance scores of FUGAsseM-full modelâ€™s second (data integration) layer, which predicted annotations with (c) (nâ€‰=â€‰3,725 term-species pairs for prediction) and without (d) (nâ€‰=â€‰11,513 term-species pairs for prediction) experimental time in the gold standard set. Box plots as in a.

Source data

Extended Data Fig. 5 Temporal hold-out evaluation by removing one data type at a time.

Removing MTX-based coexpression showed dominant influence on the performance of FUGAsseM to predict new annotations that accumulated experimental evidence over time for (a) Biological Processes, (b) Molecular Functions, and (c) Cellular Components. Averaged AUROC of each term was calculated across the HMP2 species used in this study. The top 25 GO terms with the highest average AUROC across species and methods are shown (or all available terms if fewer than 25 are present).

Source data

Extended Data Fig. 6 Threshold selection for high-confidence predictions of FUGAsseM.

(a) The thresholds of prediction confidence when achieving the maximum F1 score were heterozygous across models (here, Random Forest models) used for predicting each term per species (nâ€‰=â€‰21,785 total term-species pairs for prediction). Box plots display the median (line at the 50th percentile), interquartile range (box spanning the 25th to 75th percentiles), whiskers (extending to 1.5Ã— IQR), and mean values (dark points). (b) In addition, known annotations in UniProt tended to be predicted with higher confidence than those unknowns in all types of GO aspects. A threshold like 0.75 prediction probability looks strict enough to cover most of known annotations with high confidence. (c) Though the threshold of 0.75 achieved high recall while keeping new potentially true predictions, it still looks â€˜defaultâ€™ with low precision. We defined a â€˜stringentâ€™ threshold (that is, 0.85 prediction probability) that doubled the precision while maintaining recall. However, a more â€˜stringentâ€™ threshold makes it more possible to miss true new predictions meanwhile.

Source data

Extended Data Fig. 7 Functional repertoire of the gut microbial protein families is expanded by FUGAsseM in the HMP2.

(a-c) FUGAsseM assigned high-confidence annotations to previously uncharacterized protein families across all GO aspectsâ€”BP (a), MF (b), and CC (c)â€”in the HMP2. (dâ€“g) Among the top 25 most uncharacterized HMP2 species with the greatest number of novel proteins, many novel proteins were annotated with high-confidence MF and CC terms, improving functional coverage for both well-studied and poorly characterized taxa. Predictions were defined at two thresholds: â€˜defaultâ€™ (probability â‰¥ 0.75) and â€˜stringentâ€™ (â‰¥ 0.85). Categories include: â€˜no_annâ€™ (no high-confidence predictions), â€˜preserved_annâ€™ (proteins already annotated in UniProt), â€˜amp_ann (default/stringent)â€™ (known proteins with new predictions), and â€˜new_ann (default/stringent)â€™ (uncharacterized proteins with new annotations). Full results are in Supplementary Table 13. (h-i) FUGAsseM annotations also captured biologically relevant signals associated with inflammatory bowel disease (IBD). Gene Set Enrichment Analysis (GSEA) showed that proteins prioritized by MetaWIBELE in IBD were more strongly enriched in FUGAsseM-predicted annotations. Normalized enrichment scores (NES) revealed significant enrichment (h) and depletion (i) among protein families associated with IBD. Here, NES is an adjusted enrichment score by correcting differences of gene-set sizes, reflecting the degree to which a gene list is overrepresented at the top of a gene list ranked by MetaWIBELEâ€™s prioritization. The top 25 BP most significantly enriched terms (GSEA test; FDR-adjusted Pâ€‰<â€‰0.25 for multiple comparisons) are listed in decreasing order by NES.

Source data

Extended Data Fig. 8 Domain architectures support FUGAsseM function predictions in Faecalibacterium prausnitzii and Bacteroides thetaiotaomicron.

(a) Proteins predicted with viral life cycle (GO:0019058) showed similar domain architectures with the previously annotated proteins, clustering into two main phage clusters. (b) Similarly, proteins predicted with defend response to virus (GO:0051607) exhibited domain patterns characteristic of two distinct CRISPR systems. Proteins shown were also strongly coexpressed (Fig. 6c). (c) In B. thetaiotaomicron, proteins predicted with cellular carbohydrate catabolic process (GO:0044275) contained glycosyl hydrolase and outer membraneâ€”key components of the starch utilization system for carbohydrate metabolism. Domain architectures of randomly selected 25 B. thetaiotaomicronâ€™s proteins predicted with GO:0044275 (strongly coexpressed in Fig. 6e) are shown.

Source data

Extended Data Fig. 9 FUGAsseM predicted more protein families with informative GO annotations compared to eggNOG-mapper.

(a-c) Total number of annotated protein families by FUGAsseM (full model with integrated data) and eggNOG-mapper, including proteins with and without homologs to UniProtKB proteins. Only predictions with high confidence (prediction probability â‰¥0.75 for FUGAsseM, e-valueâ€‰<â€‰0.001 for eggNOG-mapper) are included for (a) Biological Processes, (b) Molecular Functions, and (c) Cellular Components. (d-f) Distribution of annotated protein families with high confidence per species are shown (nâ€‰=â€‰620 total method-species pairs). Statistical analysis was performed using two-sided unpaired Wilcoxon tests with unadjusted P values reported. Box plots display the median (line at the 50th percentile), interquartile range (box spanning the 25th to 75th percentiles), whiskers (extending to 1.5Ã— IQR), and outliers (values beyond 1.5Ã— IQR from the quartiles).

Source data

Extended Data Fig. 10 Uncharacterized proteins are predicted to be broadly distributed and species-specific MF and CC terms with high confidence in the human microbiome.

(a) Enumeration of the fraction of annotated taxa (first column), AUROC values per taxon (middle column; nâ€‰=â€‰3,849 total term-species pairs for prediction), and numbers of annotations (third column) for species-shared BP terms predicted by FUGAsseM in the HMP2. The top 15 terms with the largest number of species with at least one assignment are listed in decreasing order of average preserved annotations across species before running FUGAsseM (full results in Supplementary Table 15,16). â€˜Beforeâ€™ represents the number of annotations before running FUGAsseM. â€˜After (default)â€™ represents the number after running FUGAsseM with the â€˜defaultâ€™ threshold, and â€˜after (stringent)â€™ reflects values based on the â€˜stringentâ€™ threshold. Box plots display the median (line at the 50th percentile), interquartile range (box spanning the 25th to 75th percentiles), and whiskers (extending to 1.5Ã— IQR). (b) Fraction of annotated taxa (first column), AUROC values per taxon (middle column; nâ€‰=â€‰89 total term-species pairs for prediction), and numbers of annotations (third column) for species-specific MF terms predicted by FUGAsseM from the HMP2 cohort. The 15 least frequently predicted terms are listed in decreasing order of mean number of preserved annotations as in a (full results in Supplementary Table 17,18). (c) Fraction of annotated taxa (first column), AUROC values per taxon (middle column; nâ€‰=â€‰926 total term-species pairs for prediction), and numbers of annotations (third column) for species-shared CC terms predicted by FUGAsseM in the HMP2. CC terms are listed in decreasing order of mean number of preserved annotations as in a (full results in Supplementary Table 15,16). Box plots as in a.

Source data

Supplementary information

Supplementary Information

Supplementary Notes for results interpretation and Methods.

Reporting Summary

Supplementary Tables 1â€“25

Supplementary Table 1: Proportions of protein families detected by HMP2 MTX with different characterization levels. Supplementary Table 2: A list of informative GO terms defined in this study. Supplementary Table 3: Maximum MTX-based correlation coefficients between well-expressed characterized and uncharacterized proteins from the HMP2 species. Supplementary Table 4: Coexpression scores for the matched proteins detected by MTX and STRING. Supplementary Table 5: Mean AUROC across species per term of FUGAsseM based on MTX-based and STRING-based coexpression. Supplementary Table 6: Mean AUROC across species per term of FUGAsseM based on community-integrated data and STRING-integrated networks. Supplementary Table 7: Mean AUROC across terms per species of FUGAsseM based on MTX-based and STRING-based coexpression. Supplementary Table 8: Mean AUROC across terms per species of FUGAsseM based on community-integrated data and STRING-integrated networks. Supplementary Table 9: AUROC results of each method for each term within each species based on cross-validation evaluation. Supplementary Table 10: AUROC results of each method for each term within each species based on temporal hold-out evaluation. Supplementary Table 11: Prediction contribution of each data type for overall predictions. Supplementary Table 12: Prediction contribution of each data type for predictions with experimental evidence. Supplementary Table 13: Number of protein families with high-confidence predictions from the top 25 most uncharacterized species in the HMP2. Supplementary Table 14: Per-species fraction of protein families with functional annotations before and after processed by FUGAsseM-full. Supplementary Table 15: Per-species AUROC results of the top 15 most frequently predicted terms. Supplementary Table 16: Averaged fraction of protein families with the top 15 most frequently predicted terms before and after being processed by FUGAsseM-full. Supplementary Table 17: Per-species AUROC results of the top 15 least frequently predicted terms. Supplementary Table 18: Averaged fraction of protein families with the top 15 least frequently predicted terms before and after processed by FUGAsseM-full. Supplementary Table 19: Proteins predicted to GO:0071555. Supplementary Table 20: Proteins predicted to GO:0005984. Supplementary Table 21: Proteins predicted to GO:0019058 and GO:0051607 in F.â€‰prausnitzii. Supplementary Table 22: Proteins predicted to other functions in other taxa. Supplementary Table 23: Proteins predicted to GO:0044275 in B.â€‰thetaiotaomicron. Supplementary Table 24: Protein annotations of informative GO terms assigned by eggNOG-mapper. Supplementary Table 25: Abbreviations used in this study.

Source data

Source Data Fig. 1

Values of bar plots and box plots and statistical source data.

Source Data Fig. 2

Values of box plots, violin plots, point plots, scatter plots, heatmaps and density plots and statistical source data.

Source Data Fig. 3

Values of box plots and violin plots.

Source Data Fig. 4

Values of bar plots and point plots.

Source Data Fig. 5

Values of bar plots, box plots and point plots.

Source Data Fig. 6

Values of network plots.

Source Data Extended Data Fig. 1

Values of bar plots, box plots, point plots and scatter plots and statistical source data.

Source Data Extended Data Fig. 2

Values of box plots, violin plots, point plots, scatter plots, heatmaps and density plots and statistical source data.

Source Data Extended Data Fig. 3

Values of box plots, violin plots, point plots and density plots and statistical source data.

Source Data Extended Data Fig. 4

Values of box plots, violin plots and point plots.

Source Data Extended Data Fig. 5

Values of point plots.

Source Data Extended Data Fig. 6

Values of box plots, violin plots, point plots and line plots.

Source Data Extended Data Fig. 7

Values of bar plots and point plots.

Source Data Extended Data Fig. 8

Domain annotations of each protein.

Source Data Extended Data Fig. 9

Values of bar plots and box plots and statistical source data.

Source Data Extended Data Fig. 10

Values of bar plots, box plots and point plots.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the articleâ€™s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâ€™s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, Y., Bhosle, A., Bae, S. et al. Predicting functions of uncharacterized gene products from microbial communities. Nat Biotechnol (2025). https://doi.org/10.1038/s41587-025-02813-7

Download citation

Received: 06 August 2024
Accepted: 15 August 2025
Published: 15 October 2025
DOI: https://doi.org/10.1038/s41587-025-02813-7

Subjects

Abstract

Similar content being viewed by others

Main

Results

MTX-based coexpression patterns capture comprehensive functional activity in microbial communities

FUGAsseM accurately predicts functions of uncharacterized proteins from microbial communities

MTX-based coexpression contributes substantially to FUGAsseM predictions

Predicted high-confidence functional annotations refine the functional landscape of the human microbiome

Uncharacterized proteins increase the diversity of species-shared and species-specific functions in the human gut

Newly annotated proteins in microbial nutrient use and phage defense

Discussion

Methods

Overview

FUGAsseM methodology

Protein families

Gold-standard set

Network data (prediction evidence)

Machine learning

Synthetic evaluation

Dataset overview

Evaluation approaches

Application to the HMP2

FUGAsseM analysis

Preparation of protein families

Preparation of gold-standard set

Preparation of network input data

Running FUGAsseM on the HMP2 data

eggNOG-mapper analysis

Protein categories

Functional enrichment

Separate analysis for MTX-based coexpression

MTX-based coexpression patterns

Comparison to STRING-based coexpression

Coexpression network

Abbreviation

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links