©1996-2007 All Rights
Reserved. Online Journal of Bioinformatics .
You may not store these pages in any form except for your own personal use. All
other usage or distribution is illegal under international copyright treaties.
Permission to use any of these pages in any other way besides the before mentioned must be gained in writing from the
publisher. This article is exclusively copyrighted in its entirety to OJB
publications. This article may be copied once but may not be, reproduced or re-transmitted without the express permission of the
editors. This journal satisfies the refereeing requirements
(DEST) for the Higher Education Research Data Collection (Australia). Linking:To link to this page or
any pages linking to this page you must link directly to this page only here
rather than put up your own page.
OJBTM
Online Journal
of Bioinformatics ©
Volume 8 (1):30-40, 2007
PIDA: A new algorithm for pattern
identification
Putonti C1,2,
1Department of Computer Science, 2Department
of Biology and Biochemistry, and 3Department of Chemistry,
ABSTRACT
Putonti C, Pettitt BM, Reid JG, Fofanov Y., PIDA: A new algorithm
for pattern identification, Online J Bioinformatics, 8 (1):30-40, 2007. Algorithms for
motif identification in sequence space have predominately been focused on
recognizing patterns of a fixed length containing regions of perfect
conservation with possible regions of unconstrained sequence. Such motifs
can be found in everything from proteins with distinct active sites, to
non-coding RNAs with specific structural elements
that are necessary to maintain functionality. In the event that an
insertion/deletion has occurred within an unconstrained portion of the pattern,
it is possible that the pattern retains its functionality. In such a case
the length of the pattern is now variable and may not be overlooked when
utilizing existing motif detection methods. The Pattern Island Detection
Algorithm (PIDA) presented here has been developed to recognize patterns that
have occurrences of varying length within sequences of any size alphabet.
PIDA works by identifying all regions of perfect conservation (for lengths
longer than a user-specified threshold), and then builds those conservation
“islands” into fixed-length patterns. Next the algorithm modifies these
fixed-length patterns by identifying additional (and different) islands that
can then be incorporated into each pattern through insertions/deletions within
the “water” separating the islands. To provide some benchmarks for this
analysis, PIDA was used to search for patterns within randomly generated
sequences as well as sequences known to contain conserved patterns. For
each of the patterns found, the statistical significance is calculated based
upon the pattern’s likelihood to appear by chance, thus providing a means to
determine those patterns which are likely to have a functional role. The
PIDA approach to motif finding is designed to perform best when searching for
patterns of variable length although it is also able to identify patterns of a
fixed length. PIDA has been designed to be as generally applicable as
possible since there are a variety of sequence problems of this type, from
transcription factor binding sites in DNA, to structural motifs in non-coding
RNA, to high-contact-order domains in certain proteins. The algorithm was
implemented in C++ and is freely available upon request from the authors.
KEY WORDS: pattern discovery, motif conservation,
variable length patterns