TY - GEN
T1 - Mining genes in DNA using genescout
AU - Yin, Michael M.
AU - Wang, Jason
PY - 2002/12/1
Y1 - 2002/12/1
N2 - In this paper, we present a new system, called GeneScout, for predicting gene structures in vertebrate genomic DNA. The system contains specially designed hidden Markov models (HMMs) for detecting functional sites including protein-translation start sites, mRNA splicing junction donor and acceptor sites, etc. Our main hypothesis is that, given a vertebrate genomic DNA sequence S, it is always possible to construct a directed acyclic graph G such that the path for the actual coding region of S is in the set of all paths on G. Thus, the gene detection problem is reduced to that of analyzing the paths in the graph G. A dynamic programming algorithm is used to find the optimal path in G. The proposed system is trained using an expectation-maximization (EM) algorithm and its performance on vertebrate gene prediction is evaluated using the 10-way cross-validation method. Experimental results show the good performance of the proposed system and its complementarity to a widely used gene detection system.
AB - In this paper, we present a new system, called GeneScout, for predicting gene structures in vertebrate genomic DNA. The system contains specially designed hidden Markov models (HMMs) for detecting functional sites including protein-translation start sites, mRNA splicing junction donor and acceptor sites, etc. Our main hypothesis is that, given a vertebrate genomic DNA sequence S, it is always possible to construct a directed acyclic graph G such that the path for the actual coding region of S is in the set of all paths on G. Thus, the gene detection problem is reduced to that of analyzing the paths in the graph G. A dynamic programming algorithm is used to find the optimal path in G. The proposed system is trained using an expectation-maximization (EM) algorithm and its performance on vertebrate gene prediction is evaluated using the 10-way cross-validation method. Experimental results show the good performance of the proposed system and its complementarity to a widely used gene detection system.
KW - Bioinformatics
KW - Data mining
KW - Gene finding
KW - Hidden Markov models
KW - Knowledge discovery
UR - http://www.scopus.com/inward/record.url?scp=78149338925&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=78149338925&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:78149338925
SN - 0769517544
SN - 9780769517544
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 733
EP - 736
BT - Proceedings - 2002 IEEE International Conference on Data Mining, ICDM 2002
T2 - 2nd IEEE International Conference on Data Mining, ICDM '02
Y2 - 9 December 2002 through 12 December 2002
ER -