M.Sc Thesis | |

M.Sc Student | Kounin Solomon |
---|---|

Subject | Efficient Nearest Neighbor Search in Biological Strings |

Department | Department of Biomedical Engineering |

Supervisor | PROFESSOR EMERITUS Isak Gath |

The *nearest neighbor search*
(NNS) problem is the following: given a set *S* of *n* data points in
some metric space *X*, preprocess *S* so that the queries, which
require finding the point *pÎS*
closest to a query point *qÎX*,
can be efficiently answered. High dimensional nearest neighbor problems arise
naturally when complex objects are represented by vectors of
numerical features. In our case,
those objects are DNA strings, presented by DNA profiles based on the genotypes
of *short tandem repeat* (STR) markers. For large *d*, the present NNS
solutions provide little improvement over the brute-force algorithm. Thus, in
order to increase significantly the performance of the NNS solutions, *approximate
nearest neighbor* (ANN) algorithms were introduced, allowing a certain
amount of error, defined by a positive real *e*, to appear in the results.
This approach makes it possible to answer the query with nearly logarithmic
time complexity. In the present study, an efficient ANN solution for the
problem of human identification based on DNA profiles has been provided. This
solution is applicable also to a wide variety of other types of biometric
profiles: fingerprints, palm geometries, voice, etc. The dataset, consisting of
STR vectors, is organized as a kd-tree data structure by using the sliding
midpoint splitting rule for the sub-division of the vector space. The adopted
version of the standard ANN search algorithm could be efficiently used to
handle search requests based on both complete and partial query profiles. In
addition, algorithmic changes were introduced utilizing a parallel processing
approach for query execution.