Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006;34(20):5966-73.
doi: 10.1093/nar/gkl731. Epub 2006 Oct 26.

Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches

Affiliations

Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches

Yi-Kuo Yu et al. Nucleic Acids Res. 2006.

Abstract

Protein sequence database search programs may be evaluated both for their retrieval accuracy--the ability to separate meaningful from chance similarities--and for the accuracy of their statistical assessments of reported alignments. However, methods for improving statistical accuracy can degrade retrieval accuracy by discarding compositional evidence of sequence relatedness. This evidence may be preserved by combining essentially independent measures of alignment and compositional similarity into a unified measure of sequence similarity. A version of the BLAST protein database search program, modified to employ this new measure, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The accuracy of BLAST statistics. 10 000 shuffled mouse sequences were compared to shuffled human RefSeq (20) sequences from Build 35 of the human genome. The number of queries whose best match had a reported P-value ≤ x is plotted against x, using a log–log scale. Curves are shown for B-BLAST, S-BLAST, SU-BLAST, C-BLAST and CU-BLAST. The diagonal line indicates the theoretical prediction for all curves. The vertical line at x = 10−4 indicates the point at which a single query with equal or better P-value is expected.
Figure 2
Figure 2
BLAST retrieval accuracy. The 3586 astral40 sequences having at least one other relative in the astral40 data set (18,19) are used as queries in a search of this database. The results are pooled and sorted by E-value, and ROC curves are produced by plotting the number of true positives against the number of false positives as one descends the retrieval list. In (A), ROC curves are shown for B-BLAST, S-BLAST, SU-BLAST, C-BLAST and CU-BLAST. The ROC5000 scores for these programs are also shown, each having a standard error of ± 0.0002 (2). In (B), the same ROC curves are shown in a semi-log plot, using the scales of coverage and errors per query (30).
Figure 3
Figure 3
The probability density function for λ. The ungapped scale parameter λ (26) is calculated for the standard BLOSUM-62 amino acid substitution matrix (22) using the observed amino frequencies of two proteins. Empirical probability density functions are shown for all pairs of unrelated proteins from the astral40 data set (18,19), as well as for all pairs of non-identical related proteins.
Figure 4
Figure 4
The empirical probability density of alignment and compositional P-values from shuffled sequences. 10 000 shuffled mouse query sequences were compared using S-BLAST to shuffled human RefSeq (20) sequences from Build 35 of the human genome. For each query, the alignment P-value Pa of the best match to the database was found, and the compositional P-value Pc was then calculated for the database sequence involved.

References

    1. Gribskov M., Robinson N.L. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem. 1996;20:25–33. - PubMed
    1. Schäffer A.A., Aravind L., Madden T.L., Shavirin S., Spouge J.L., Wolf Y.I., Koonin E.V., Altschul S.F. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 2001;29:2994–3005. - PMC - PubMed
    1. Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Altschul S.F., Koonin E.V. Iterated profile searches with PSI-BLAST—a tool for discovery in protein databases. Trends Biochem. Sci. 1998;23:444–447. - PubMed
    1. Pearson W.R., Sierk M.L. The limits of protein sequence comparison? Curr. Opin. Struct. Biol. 2005;15:254–260. - PMC - PubMed

Publication types