M.Sc Thesis

M.Sc StudentKounin Solomon
SubjectEfficient Nearest Neighbor Search in Biological Strings
DepartmentDepartment of Biomedical Engineering


The nearest neighbor search (NNS) problem is the following: given a set S of n data points in some metric space X, preprocess S so that the queries, which require finding the point pÎS closest to a query point qÎX, can be efficiently answered. High dimensional nearest neighbor problems arise naturally when complex objects are represented by vectors of numerical features. In our case, those objects are DNA strings, presented by DNA profiles based on the genotypes of short tandem repeat (STR) markers. For large d, the present NNS solutions provide little improvement over the brute-force algorithm. Thus, in order to increase significantly the performance of the NNS solutions, approximate nearest neighbor (ANN) algorithms were introduced, allowing a certain amount of error, defined by a positive real e, to appear in the results. This approach makes it possible to answer the query with nearly logarithmic time complexity. In the present study, an efficient ANN solution for the problem of human identification based on DNA profiles has been provided. This solution is applicable also to a wide variety of other types of biometric profiles: fingerprints, palm geometries, voice, etc. The dataset, consisting of STR vectors, is organized as a kd-tree data structure by using the sliding midpoint splitting rule for the sub-division of the vector space. The adopted version of the standard ANN search algorithm could be efficiently used to handle search requests based on both complete and partial query profiles. In addition, algorithmic changes were introduced utilizing a parallel processing approach for query execution.