Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2003 Jun 3:4:21.
doi: 10.1186/1471-2105-4-21. Epub 2003 Jun 3.

EasyGene--a prokaryotic gene finder that ranks ORFs by statistical significance

Affiliations
Comparative Study

EasyGene--a prokaryotic gene finder that ranks ORFs by statistical significance

Thomas Schou Larsen et al. BMC Bioinformatics. .

Abstract

Background: Contrary to other areas of sequence analysis, a measure of statistical significance of a putative gene has not been devised to help in discriminating real genes from the masses of random Open Reading Frames (ORFs) in prokaryotic genomes. Therefore, many genomes have too many short ORFs annotated as genes.

Results: In this paper, we present a new automated gene-finding method, EasyGene, which estimates the statistical significance of a predicted gene. The gene finder is based on a hidden Markov model (HMM) that is automatically estimated for a new genome. Using extensions of similarities in Swiss-Prot, a high quality training set of genes is automatically extracted from the genome and used to estimate the HMM. Putative genes are then scored with the HMM, and based on score and length of an ORF, the statistical significance is calculated. The measure of statistical significance for an ORF is the expected number of ORFs in one megabase of random sequence at the same significance level or better, where the random sequence has the same statistics as the genome in the sense of a third order Markov chain.

Conclusions: The result is a flexible gene finder whose overall performance matches or exceeds other methods. The entire pipeline of computer processing from the raw input of a genome or set of contigs to a list of putative genes with significance is automated, making it easy to apply EasyGene to newly sequenced organisms. EasyGene with pre-trained models can be accessed at http://www.cbs.dtu.dk/services/EasyGene.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The overall HMM architecture. Each box corresponds to a submodel with more than one state. The number above the boxes indicates the number of bases modelled by the submodel. An 'X' indicates a variable number.
Figure 2
Figure 2
Enlargement of null model and internal looped codons. LEFT: The state structure of the NULL model. The background state is of third order and models the general composition of the genome. The three shadow states model coding regions on the complementary strand. There are transitions from the background state to the first RBS state and to the first state modelling the start codon. RIGHT: Details of model of internal codons. A codon is modelled by three states with a transition from the last state back to the first and one out of the codon model. By putting several codon models in series, the length distribution of coding regions can be captured. From the last state there is a transition to the first state of the 'BSTOP' model, which models the last codon before the stop codon.
Figure 3
Figure 3
The state structure of the RBS model. The RBS model consists of seven states for modelling the ribosome binding site followed by a set of tied states for the variable region between the RBS and the start codon. From the last state there is a transition to the first of the three states modelling the start codon.
Figure 4
Figure 4
Relationship between R, Γ and variable length in codons l'. The numbers are taken from the E. coli runs described in Results and Discussion, but the qualitative behavior is independent of the genome
Figure 5
Figure 5
Gene length distribution imposed by HMM architecture. The model length distribution given by a negative binomial (equation 8 with n = 3) compared to the length histogram of set A genes for H. pylori J99.
Figure 6
Figure 6
Assessing the optimal number of HMM coding branches. Performance curves for 1,2,3 and 4 Markov branches of looped codon submodels for E. coli. The performance curves are made by the following procedure: First we sort the positive R-values in ascending order for each of the 10 subsets of set T (test sets). Then for each ascending R-value we calculate the fraction of genes in set T scoring below R (true positive rate) and the fraction of ORFs (with lengths greater than or equal to 20 codons) in one megabase double-stranded sequence scoring below R (false positive rate). The resulting 10 files with true and false positive rates are concatenated and 30 false positive cutoffs are selected (from 0 to 0.15 with steps of 0.005). The false positive entries in the 10 files which fall between these cutoffs are found and the corresponding true positive entries are averaged. Hence for each average false positive rate (halfway between two consecutive false positive cutoffs) we associate an average true positive rate and these tuples are then plotted.
Figure 7
Figure 7
Assessing the optimal order of looped codon states. Performance curves for 3rd, 4th and 5th order Markov states of looped codon submodels for E. coli. For explanation of the construction of performance curves please confer the caption of figure 6.
Figure 8
Figure 8
Comparing significance and log-odds. Performance curves comparing significance and log-odds scores for E. coli. For explanation of the construction of performance curves please confer the caption of figure 6.
Figure 9
Figure 9
Statistical characteristics of random sequences. The top two panels show the mean and variance of log-odds scores versus variable ORF length in random sequences (E. coli model). Lowest subplot shows a logarithmic plot of the length distribution of random ORFs. The linear regression lines are shown in all three plots.
Figure 10
Figure 10
Comparing predicted and found number of false positives. Empirical and theoretical number of false positives per Mb double-stranded random sequence according to the E. coli model.
Figure 11
Figure 11
Probability density functions for the standard score. Empirical (dots) and theoretical (line) probability density functions for the standard scores (Γ) in random sequences (E. coli model). The lower plot is an enlargement of the distribution tail.

References

    1. Frishman D, Mironov A, Mewes HW, Gelfand M. Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Research. 1998;26:2941–2947. doi: 10.1093/nar/26.12.2941. - DOI - PMC - PubMed
    1. Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A. On the total number of genes and their length distribution in complete microbial genomes. Trends in Genetics. 2001;17:425–428. doi: 10.1016/S0168-9525(01)02372-1. - DOI - PubMed
    1. Kawarabayasi Y, et al. Complete genome sequence of an aerobic hyperthermophilic crenarchaeon. Aeropyrum pernix K1 DNA Res. 1999;6:83–101. - PubMed
    1. Fickett J. Recognition of protein coding regions in DNA sequences. Nucleic Acids Research. 1982;17:5303–5318. - PMC - PubMed
    1. Gribskov M, Devereux J, Burgess R. The codon preference plot: Graphic analysis of protein coding sequences and prediction of gene expression. Nucleic Acids Research. 1984;12:539–549. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources