©1996-2009
All Rights Reserved. Online Journal
of Bioinformatics .
You may not store these pages in any form except for your own personal use. All
other usage or distribution is illegal under international copyright treaties.
Permission to use any of these pages in any other way besides the before mentioned must be gained in writing from the
publisher. This article is exclusively copyrighted in its entirety to OJB
publications. This article may be copied once but may not be, reproduced or re-transmitted without the express permission of the
editors. This journal satisfies the refereeing requirements
(DEST) for the Higher Education Research Data Collection (Australia). Linking:To link to this page or
any pages linking to this page you must link directly to this page only here
rather than put up your own page.
OJBTM
Online Journal of Bioinformatics©
Volume 8 (2):165-178, 2007
Constructing Phylogeny from Whole Genome
Alignment
Chan PY, Lam
TW, Liu CM, Yiu SM
Department of Computer Science, The
ABSTRACT
Chan PY, Lam
TW, Liu CM, Yiu SM.,
Constructing Phylogeny from Whole
Genome Alignment, Online J Bioinformatics, 8 (2):165-178, 2007. To reconstruct a
phylogeny for a given set of species, most of the previous approaches are based
on the similarity information derived from a subset of conserved regions (or
genes) in the corresponding genomes. In some cases, the regions chosen may not
reflect the evolutionary history of the species and may be too restricted to
differentiate the species. It is generally believed that the inference could be
more accurate if whole genomes are being considered. The best existing solution
that makes use of complete genomes was proposed by Henz
et al.[14] They can construct a phylogeny for 91 prokaryotic genomes in 170 CPU
hours with an accuracy of about 70% (based on the measurement of non-trivial
splits) while other approaches that use whole genomes can only deal with no
more than 20 species. Note that Henz et al. measure
the distance between the species using BLASTN which is not primarily designed
for whole genome alignment. Also, their approach is not scalable, for example,
it probably takes over 1000 CPU hours to construct a phylogeny for all 230
prokaryotic genomes published by NCBI. In addition, we found that non-trivial splits is only a rough indicator of the accuracy
of the phylogeny. In this paper, we propose the followings. (1) To evaluate the
quality of a phylogeny with respect to a model answer, we suggest using the concept
of the maximum agreement subtree as it can capture
the structure of the phylogeny. (2) We propose to use whole genome alignment
software (such as MUMmer) to measure the distances
between the species and derive an efficient approach to generate these
distances. From the experiments on real data sets, we found that our approach
is more accurate and more scalable than Henz et al.’s
approach. We can construct a phylogenetic tree for the same set of 91 genomes
with an accuracy more than 20%higher (with respect to both evaluation measures)
in 2 CPU hours (more than 80 times faster than their approach). Also, our
approach is scalable and can construct a phylogeny for 230 prokaryotic genomes
with accuracy as high as 85% in only 9.5 CPU hours.
Keywords: whole genome alignment, phylogeny,
maximum compatible subtree, maximum agreement subtree.