Extended Data Fig. 1: Uncharacterized gut microbial proteins are coexpressed with known proteins and annotated using a two-layer ML model.
From: Predicting functions of uncharacterized gene products from microbial communities

(a) Human gut microbial protein families detected in HMP2 metatranscriptomes were categorized into different levels of under-characterization. (b) The top 25 species with the most novel proteins (RH and NH) show diverse protein category composition, with uncharacterized proteins prevalent. (c) In contrast to the human gut microbiome, most proteins in E. coli K-12 strains are well-characterized. (d) Uncharacterized proteins exhibit similar MTX-based coexpression patterns to those of characterized proteins (nâ=â1,543 total gene pairs). âknownâ indicates transcript correlations among characterized proteins; âhybridâ between characterized and uncharacterized; âunknownâ among uncharacterized proteins. Box plots display the median (line at the 50th percentile), interquartile range (box spanning the 25th to 75th percentiles), whiskers (extending to 1.5Ã IQR), and mean values (dark points). (e) Strong transcript correlations were also observed across different characterization levels (nâ=â1,331 total gene pairs); Box plots as in (d). (f) Network similarity between MTX-based and STRING-based coexpression patterns was significantly correlated with species reference representation. Pearson correlation coefficients (95% CI) and unadjusted P values are shown (nâ=â13 species). Reference representation was estimated by the percentage of UniRef90 homologs in HMP2 MGX previously profiled by MetaWIBELE. (g) âInformativeâ GO terms were defined as those with at least a minimum number of annotated proteins, while each child term contained fewer than this threshold. (h) FUGAsseM uses a two-layer machine learning architecture. In the first layer (green), random forest classifiers predict functional confidence scores from individual data types (purple). These scores are then integrated by a second-layer ensemble RF classifier (orange), generating a final confidence score (red).