GeneScout: A data mining system for predicting vertebrate genes in genomic DNA sequences

Michael M. Yin, Jason T.L. Wang

Research output: Contribution to journalArticlepeer-review

22 Scopus citations


Automated detection or prediction of coding sequences from within genomic DNA has been a major rate-limiting step in the pursuit of vertebrate genes. Programs currently available are far from being powerful enough to elucidate a gene structure completely. In this paper, we present a new system, called GeneScout, for predicting gene structures in vertebrate genomic DNA. The system contains specially designed hidden Markov models (HMMs) for detecting functional sites including protein-translation start sites, mRNA splicing junction donor and acceptor sites, etc. An HMM model is also proposed for exon coding potential computation. Our main hypothesis is that, given a vertebrate genomic DNA sequence S, it is always possible to construct a directed acyclic graph G such that the path for the actual coding region of S is in the set of all paths on G. Thus, the gene detection problem is reduced to that of analyzing the paths in the graph G. A dynamic programming algorithm is used to find the optimal path in G. The proposed system is trained using an expectation-maximization algorithm and its performance on vertebrate gene prediction is evaluated using the 10-way cross-validation method. Experimental results show that the proposed system performs well and is comparable to existing gene discovery tools.

Original languageEnglish (US)
Pages (from-to)201-218
Number of pages18
JournalInformation sciences
Issue number1-3
StatePublished - Jun 14 2004

All Science Journal Classification (ASJC) codes

  • Software
  • Control and Systems Engineering
  • Theoretical Computer Science
  • Computer Science Applications
  • Information Systems and Management
  • Artificial Intelligence


  • Bioinformatics
  • Gene finding
  • Hidden Markov models
  • Knowledge discovery
  • Soft computing


Dive into the research topics of 'GeneScout: A data mining system for predicting vertebrate genes in genomic DNA sequences'. Together they form a unique fingerprint.

Cite this