|M.Sc Thesis||Department of Computer Science|
|Assoc. Prof. El-Yaniv Ran|
Protein homology detection is a major task in computational biology. Homology detection tools are used for analyzing new proteins. Indeed, the functional and structural properties of a new protein can be better understood if homologous proteins can be identified. During the past few decades several homology detection algorithms have been proposed, based on ideas from areas such as string comparison, Hidden Markov Models, kernel methods and other fields as well. While capable of recognizing many homologies, these tools are less successful with some specific types of proteins and homologies.
This thesis discusses two novel homology detection tools representing improvement over existing algorithms for two specific types of homologies. The first tool is a variant of the BLAST algorithm that improves its performance on a group of proteins known as low complexity proteins. While low complexity proteins account for a large portion of all known proteins, BLAST's statistical model fails to correctly assess the significance of alignments in which such proteins are involved. We propose a novel tool for fixing the problem by post normalization of BLAST's e-value, composed of three phases. First we identify those alignments in which low complexity proteins are involved. This is done by computing the Jensen-Shannon divergence between the distribution of amino acids of each sequence and the background distribution of amino acids, and marking those alignments in which one of the two distances is higher than some threshold as suspicious. In the second phase the e-value of all suspicious alignments is re-estimated by considering the probability of observing the alignment based on the amino acid composition of the compared proteins. Finally, another estimation of the alignment’s e-value is performed by considering the structure of the alignment.
The second tool is a new homology detection algorithm based on vectorial representation of proteins. A protein is represented by its degree of association to a fixed reference set of entities from the protein space, as estimated by an association measure. The similarity of two proteins is evaluated through the distance between their two vectors. This algorithm is particularly effective on remote homologies, characterized by high structural and low sequential similarities. While these homologies are of biological interest, existing tools usually fail to identify them due to the low sequential similarity. This thesis explores different aspects of the algorithm, such as the choice of association measure and the choice of reference set are examined.
The thesis presents and evaluates both methods. The results may be of interest to the study of proteins in general.