We describe a method for discovering active motifs in a set of related protein sequences. The method is an automatic two step process: (1) find candidate motifs in a small sample of the sequences; (2) test whether these motifs are approximately present in all the sequences. To reduce the running time, we develop two optimization heuristics based on statistical estimation and pattern matching techniques. Experimental results obtained by running these algorithms on generated data and functionally related proteins demonstrate the good performance of the presented method compared with visual method of O'Farrell and Leopold. By combining the discovered motifs with an existing fingerprint technique, we develop a protein classifier. When we apply the classifier to the 698 groups of related proteins in the PROSITE catalog, it gives information that is complementary to the BLOCKS protein classifier of Henikoff and Henikoff. Thus, using our classifier in conjunction with theirs, one can obtain high confidence classifications (if BLOCKS and our classifier agree) or suggest a new hypothesis (if the two disagree).
All Science Journal Classification (ASJC) codes