PIDA: A new algorithm for pattern identification

MAIN

©1996-2007 All Rights Reserved. Online Journal of Bioinformatics . You may not store these pages in any form except for your own personal use. All other usage or distribution is illegal under international copyright treaties. Permission to use any of these pages in any other way besides the before mentioned must be gained in writing from the publisher. This article is exclusively copyrighted in its entirety to OJB publications. This article may be copied once but may not be, reproduced or re-transmitted without the express permission of the editors. This journal satisfies the refereeing requirements (DEST) for the Higher Education Research Data Collection (Australia). Linking:To link to this page or any pages linking to this page you must link directly to this page only here rather than put up your own page.

OJBTM

Online Journal of Bioinformatics ©

Volume 8 (1):30-40, 2007

PIDA: A new algorithm for pattern identification

Putonti C^1,2, Pettitt BM^1,2,3, Reid JG³, Fofanov Y^1,2

¹Department of Computer Science, ²Department of Biology and Biochemistry, and ³Department of Chemistry, University of Houston, Houston, Texas, USA

ABSTRACT

Putonti C, Pettitt BM, Reid JG, Fofanov Y., PIDA: A new algorithm for pattern identification, Online J Bioinformatics, 8 (1):30-40, 2007. Algorithms for motif identification in sequence space have predominately been focused on recognizing patterns of a fixed length containing regions of perfect conservation with possible regions of unconstrained sequence. Such motifs can be found in everything from proteins with distinct active sites, to non-coding RNAs with specific structural elements that are necessary to maintain functionality. In the event that an insertion/deletion has occurred within an unconstrained portion of the pattern, it is possible that the pattern retains its functionality. In such a case the length of the pattern is now variable and may not be overlooked when utilizing existing motif detection methods. The Pattern Island Detection Algorithm (PIDA) presented here has been developed to recognize patterns that have occurrences of varying length within sequences of any size alphabet. PIDA works by identifying all regions of perfect conservation (for lengths longer than a user-specified threshold), and then builds those conservation “islands” into fixed-length patterns. Next the algorithm modifies these fixed-length patterns by identifying additional (and different) islands that can then be incorporated into each pattern through insertions/deletions within the “water” separating the islands. To provide some benchmarks for this analysis, PIDA was used to search for patterns within randomly generated sequences as well as sequences known to contain conserved patterns. For each of the patterns found, the statistical significance is calculated based upon the pattern’s likelihood to appear by chance, thus providing a means to determine those patterns which are likely to have a functional role. The PIDA approach to motif finding is designed to perform best when searching for patterns of variable length although it is also able to identify patterns of a fixed length. PIDA has been designed to be as generally applicable as possible since there are a variety of sequence problems of this type, from transcription factor binding sites in DNA, to structural motifs in non-coding RNA, to high-contact-order domains in certain proteins. The algorithm was implemented in C++ and is freely available upon request from the authors.

KEY WORDS: pattern discovery, motif conservation, variable length patterns

MAIN

FULL-TEXT (SUBSCRIPTION)