Extended Data Fig. 6: Threshold selection for high-confidence predictions of FUGAsseM.
From: Predicting functions of uncharacterized gene products from microbial communities

(a) The thresholds of prediction confidence when achieving the maximum F1 score were heterozygous across models (here, Random Forest models) used for predicting each term per species (nâ=â21,785 total term-species pairs for prediction). Box plots display the median (line at the 50th percentile), interquartile range (box spanning the 25th to 75th percentiles), whiskers (extending to 1.5Ã IQR), and mean values (dark points). (b) In addition, known annotations in UniProt tended to be predicted with higher confidence than those unknowns in all types of GO aspects. A threshold like 0.75 prediction probability looks strict enough to cover most of known annotations with high confidence. (c) Though the threshold of 0.75 achieved high recall while keeping new potentially true predictions, it still looks âdefaultâ with low precision. We defined a âstringentâ threshold (that is, 0.85 prediction probability) that doubled the precision while maintaining recall. However, a more âstringentâ threshold makes it more possible to miss true new predictions meanwhile.