Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2001 Jan 15;29(2):351-61.
doi: 10.1093/nar/29.2.351.

The estimation of statistical parameters for local alignment score distributions

Affiliations
Comparative Study

The estimation of statistical parameters for local alignment score distributions

S F Altschul et al. Nucleic Acids Res. .

Abstract

The distribution of optimal local alignment scores of random sequences plays a vital role in evaluating the statistical significance of sequence alignments. These scores can be well described by an extreme-value distribution. The distribution's parameters depend upon the scoring system employed and the random letter frequencies; in general they cannot be derived analytically, but must be estimated by curve fitting. For obtaining accurate parameter estimates, a form of the recently described 'island' method has several advantages. We describe this method in detail, and use it to investigate the functional dependence of these parameters on finite-length edge effects.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Islands in a local alignment path graph. (a) Schematic representation of the path graph. In every cell C the red line recalls the choice made by the optimization procedure of the Smith–Waterman algorithm. By these lines, all the cells with non-zero scores are partitioned into islands according to which anchoring points (circles) they are connected to. (b) Score landscape on a 50 × 50 path graph. The score at every cell of the path graph is represented by its height above the surface and color-coded with zero scores corresponding to blue areas and increasingly red colors for higher scores. The example shown is generated with a BLOSUM-62 scoring matrix, and a score –(11 + k) for each gap of length k. The islands are easily seen.
Figure 2
Figure 2
Schematic representation of a path graph used to avoid edge effects in the estimation of λ and K via the island method. The n × n scoring lattice (gray square in the middle) is surrounded by a border of width b. Only islands that are anchored within the central n × n area (shown in dark red) are counted. Islands anchored outside this area (green) are ignored. Note that some of the ignored islands reach into the inner area and some of the accepted islands reach into the border region since the classification of an island depends only on the position of its anchor (circles); borders thus are required on all sides to suppress edge effects properly.
Figure 3
Figure 3
Estimates obtained via the island method with different cutoffs c. Standard errors for the estimates are shown with error bars. The plotted horizontal line indicates the best estimate of the asymptotic λ. Details of the simulation are given in the legend to Table 1.
Figure 4
Figure 4
Estimates derived from borderless n × n sequence comparisons by the island method as a function of 1/n. Approximately 1 000 000 islands with a score of at least 37 were generated to produce the estimates, which thus have a standard error of 0.1%; the size of the symbols represents one standard error. The plotted line represents the theory of equation 9 for the apparent (n,n). The scoring system and random sequence model are the same as those described in the legend to Table 1.
Figure 5
Figure 5
The mean length l(x) of optimal island alignments, as a function of the alignment score x. Error bars, representing one standard error, grow with score primarily because the number of alignments on which the mean length estimates are based decreases. The plotted line represents a linear regression on the data for scores ≥47. Details of the simulation are given in the legend to Table 1.

References

    1. Pearson W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448. - PMC - PubMed
    1. Altschul S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
    1. Gish W. and States,D.J. (1993) Identification of protein coding regions by database similarity search. Nature Genet., 3, 266–272. - PubMed
    1. Altschul S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
    1. Smith T.F. and Waterman,M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197. - PubMed

Publication types

MeSH terms