Ph.D Thesis

Ph.D StudentGronau Ilan
SubjectReconstructing Phylogenetic Trees from Noisy Metrics
DepartmentDepartment of Computer Science
Supervisor PROFESSOR EMERITUS Shlomo Moran
Full Thesis textFull thesis text - English Version


Phylogenetic reconstruction is the task of inferring the evolutionary tree spanning a given set of species (taxa) from molecular sequences extracted from these species. A popular approach for dealing with this task is to divide it into two independent parts: first, infer inter-species distances from the input sequences (assuming some stochastic model of sequence evolution); then, use the inferred distances to construct a tree. This distance-based approach for phylogenetic reconstruction is very popular mainly because it leads to very efficient algorithms. The inter-species distances obtained in the first phase are often regarded as noisy versions of some additive distances, which are simply counts of evolutionary events (such as site substitutions). Reconstructing a tree from completely accurate estimates of these additive distances is an easy task. However, dealing successfully with error in distance estimation is much more complicated. We identify two factors which influence the accuracy of distance-based reconstruction in this noisy setting: the accuracy of the distance estimates computed in the first phase, and the robustness of the algorithm used in the second phase to the noise in their input distances. This thesis studies both factors as sources for improving the accuracy of phylogenetic reconstruction.

The first part of the thesis presents two novel algorithms which have proven reconstruction guarantees relating noise in the input distances to the minimal weight (or length) of correctly reconstructed edges. Both algorithms are very efficient in running time and are shown to have proven properties of robustness to the noise in the input distance estimates. The second part of the thesis deals with analyzing and minimizing the error of the distance estimates obtained from the input sequences. We identify a wide range of common models for site substitution in which additive distances can be defined in many different ways. We specify the set of valid distance formulae in these models and describe a general framework for analyzing the estimation error implied by each formula. This approach is shown to enable a dramatic reduction in distance estimation error. Experimental results demonstrate that this also leads to significant improvement in the accuracy of phylogenetic reconstruction.