TY - JOUR
T1 - DNA sequence classification via an expectation maximization algorithm and neural networks
T2 - A case study
AU - Ma, Qicheng
AU - Wang, Jason T.L.
AU - Shasha, Dennis
AU - Wu, Cathy H.
N1 - Funding Information:
Manuscript received June 1, 2001; revised October 1, 2001. This work was supported in part by NSF Grants IIS-9988345 and IIS-9988636. Q. Ma is with the Novartis Pharmaceuticals Corporation, Summit, NJ 07901 USA. J. T. L. Wang is with the Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102 USA (e-mail: [email protected]). D. Shasha is with the Courant Institute of Mathematical Sciences, New York University, New York, NY 10012 USA. C. H. Wu is with the National Biomedical Research Foundation, Georgetown University Medical Center, NW, Washington, DC 20007 USA. Publisher Item Identifier S 1094-6977(01)11259-9.
PY - 2001/11
Y1 - 2001/11
N2 - This paper presents new techniques for biosequence classification, with a focus on recognizing E. Coli promoters in DNA. Specifically, given an unlabeled DNA sequence S, we want to determine whether or not S is an E. Coli promoter. We use an expectation maximization (EM) algorithm to locate the -35 and -10 binding sites in an E. Coli promoter sequence. The EM algorithm differs from previously published EM algorithms in that, instead of assuming a uniform distribution for the lengths of the spacer between the -35 binding site and the -10 binding site as well as the spacer between the -10 binding site and the transcriptional start site, our algorithm deduces the probability distribution for these lengths. Based on the located binding sites, we select features in each E. Coli promoter sequence according to their information contents and represent the features using an orthogonal encoding method. We then feed the features to a neural network (NN) for promoter recognition. Empirical studies show that the proposed approach achieves good performance on different datasets.
AB - This paper presents new techniques for biosequence classification, with a focus on recognizing E. Coli promoters in DNA. Specifically, given an unlabeled DNA sequence S, we want to determine whether or not S is an E. Coli promoter. We use an expectation maximization (EM) algorithm to locate the -35 and -10 binding sites in an E. Coli promoter sequence. The EM algorithm differs from previously published EM algorithms in that, instead of assuming a uniform distribution for the lengths of the spacer between the -35 binding site and the -10 binding site as well as the spacer between the -10 binding site and the transcriptional start site, our algorithm deduces the probability distribution for these lengths. Based on the located binding sites, we select features in each E. Coli promoter sequence according to their information contents and represent the features using an orthogonal encoding method. We then feed the features to a neural network (NN) for promoter recognition. Empirical studies show that the proposed approach achieves good performance on different datasets.
KW - Bayesian inference
KW - Bioinformatics
KW - Data mining
KW - Expectation maximization (EM)
KW - Neural networks (NNs)
KW - Promoter recognition
UR - http://www.scopus.com/inward/record.url?scp=0035521109&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0035521109&partnerID=8YFLogxK
U2 - 10.1109/5326.983930
DO - 10.1109/5326.983930
M3 - Article
AN - SCOPUS:0035521109
SN - 1094-6977
VL - 31
SP - 468
EP - 475
JO - IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews
JF - IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews
IS - 4
ER -