The estimation of statistical parameters for local alignment score distributions

S F Altschul¹, R Bundschuh, R Olsen, T Hwa

Affiliations

PMID: 11139604
PMCID: PMC29669
DOI: 10.1093/nar/29.2.351

Comparative Study

The estimation of statistical parameters for local alignment score distributions

S F Altschul et al. Nucleic Acids Res. 2001.

. 2001 Jan 15;29(2):351-61.

doi: 10.1093/nar/29.2.351.

Authors

S F Altschul¹, R Bundschuh, R Olsen, T Hwa

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. [email protected]

PMID: 11139604
PMCID: PMC29669
DOI: 10.1093/nar/29.2.351

Abstract

The distribution of optimal local alignment scores of random sequences plays a vital role in evaluating the statistical significance of sequence alignments. These scores can be well described by an extreme-value distribution. The distribution's parameters depend upon the scoring system employed and the random letter frequencies; in general they cannot be derived analytically, but must be estimated by curve fitting. For obtaining accurate parameter estimates, a form of the recently described 'island' method has several advantages. We describe this method in detail, and use it to investigate the functional dependence of these parameters on finite-length edge effects.

PubMed Disclaimer

Figures

**Figure 1**
Islands in a local alignment path graph. (a) Schematic representation of the path graph. In every cell C the red line recalls the choice made by the optimization procedure of the Smith–Waterman algorithm. By these lines, all the cells with non-zero scores are partitioned into islands according to which anchoring points (circles) they are connected to. (b) Score landscape on a 50 × 50 path graph. The score at every cell of the path graph is represented by its height above the surface and color-coded with zero scores corresponding to blue areas and increasingly red colors for higher scores. The example shown is generated with a BLOSUM-62 scoring matrix, and a score –(11 + k) for each gap of length k. The islands are easily seen.

**Figure 2**
Schematic representation of a path graph used to avoid edge effects in the estimation of λ and K via the island method. The n × n scoring lattice (gray square in the middle) is surrounded by a border of width b. Only islands that are anchored within the central n × n area (shown in dark red) are counted. Islands anchored outside this area (green) are ignored. Note that some of the ignored islands reach into the inner area and some of the accepted islands reach into the border region since the classification of an island depends only on the position of its anchor (circles); borders thus are required on all sides to suppress edge effects properly.

**Figure 3**
Estimates obtained via the island method with different cutoffs c. Standard errors for the estimates are shown with error bars. The plotted horizontal line indicates the best estimate of the asymptotic λ. Details of the simulation are given in the legend to Table 1.

**Figure 4**
Estimates derived from borderless n × n sequence comparisons by the island method as a function of 1/n. Approximately 1 000 000 islands with a score of at least 37 were generated to produce the estimates, which thus have a standard error of 0.1%; the size of the symbols represents one standard error. The plotted line represents the theory of equation 9 for the apparent (n,n). The scoring system and random sequence model are the same as those described in the legend to Table 1.

**Figure 5**
The mean length l(x) of optimal island alignments, as a function of the alignment score x. Error bars, representing one standard error, grow with score primarily because the number of alignments on which the mean length estimates are based decreases. The plotted line represents a linear regression on the data for scores ≥47. Details of the simulation are given in the legend to Table 1.

See this image and copyright information in PMC

References

1. Pearson W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448. - PMC - PubMed
1. Altschul S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
1. Gish W. and States,D.J. (1993) Identification of protein coding regions by database similarity search. Nature Genet., 3, 266–272. - PubMed
1. Altschul S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
1. Smith T.F. and Waterman,M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The estimation of statistical parameters for local alignment score distributions

Affiliation

The estimation of statistical parameters for local alignment score distributions

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources