טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentErihov Miri
SubjectIdentifying Accurate Twilight-Zone Alignments with
Machine Learning and Local Information
DepartmentDepartment of Electrical Engineering
Supervisors Professor Crammer Yacov
Dr. Rechel Kolodny
Full Thesis textFull thesis text - English Version


Abstract

Finding alignments between protein sequences is a fundamental procedure in bioinformatics research. For example, these alignments are necessary to predict the structure and function of an unknown protein sequence. The alignment is done by searching the database of known proteins for protein sequences that resemble the unknown query sequence. A good aligner is both highly sensitive and accurate.

A sensitive aligner is one that performs well even when matching proteins of low sequence identity, i.e., which finds the so-called twilight-zone alignments.

Indeed, we need such sensitive aligners because there are many proteins with only distantly related counterparts in the Protein Data Bank. The high sensitivity of the aligner can come on the expanses of its accuracy be finding alignments between unrelated proteins. Improving the accuracy of sensitive sequence aligners can be useful to many bioinformatics applications, where accuracy is measured in terms of the similarity of the matched sub-structures. Rather than aligning a query sequence to database, we re-rank alignments found by the state-of-the-art sequence aligner,

HHSearch, and identify the most accurate ones. Ranking in a post-processing step like we suggest here allows us to train a machine learning classifier that is specific for the task and hence performs better. Our classifier is based on a Random Forest to predict local structure and a Support Vector Machine which uses the predicted local structure together with other local structural cues to identify accurate alignments. We consider two settings: the first simulates a search in a database of proteins with known structures, and the second in a sequence database. The difference between these settings is that in the first, reliable structural information can be used for the target protein, while in the other we must resort to predicting structural features of the target. In both cases, we assume only the sequence of the query protein is known, and predict its structural features. We rigorously test the performance of our classifiers which are based on different combinations of structural cues, with respect to one another and to commonly used scores: HH-score, HHprob, and percent sequence similarity and identity. We show that the top performer is our classifier which receives as input a combined signal of (true or predicted) secondary structure, local structure, and local sequence profile patterns.