Ph.D Thesis

Ph.D StudentShalem Mirit
SubjectCombined Search over Heterogeneous Repositories
DepartmentDepartment of Computer Science
Supervisors DR. Yaron Kanza
MR Ziv Bar-Yossef
Full Thesis textFull thesis text - English Version


In complex search tasks that utilize information from several data sources, it is often required to pose several basic search queries, join the answers to these queries, where each answer is given as a ranked list of items, and return a ranked list of combinations. Evaluation of such complex queries consists of several parts. The first part requires extracting, for each basic query independently, the relevant information from a suitable repository. Next, the results of the basic queries should be integrated, i.e., joined according to the join conditions of the complex query. Finally, the join result should be filtered and ranked, to provide the user with a manageable sized answer that has a high probability of satisfying her.

This thesis studies several aspects of complex query evaluation. We begin with the question how to extract data efficiently from XML documents. Current algorithms incur high memory costs on queries that involve child-axis nodes. In this work we provide an analytical explanation for this phenomenon. We present a large-scale study of the space complexity of evaluating twig queries (a fragment of XPath) over indexed XML documents.

The next part of the complex search is to join the ranked results of the basic queries, and return the "best" results. We discuss two problems of the top-k join of ranked lists. First, a join is a lossy operation, and over heterogeneous data sources some highly-ranked items may not appear in any combination, although they may be relevant to the user. We introduce top-k maximal join, which solves this problem by allowing maximal (incomplete) combinations in the answer, i.e., combinations that cannot be extended. We present novel algorithms for computing the top-k maximal combinations. One of the algorithms is instance optimal w.r.t. algorithms that compute a θ-approximate answer. A second algorithm is much more efficient than adaptations of existing algorithms for top-k maximal join.

Second, the top-k join result may include too many repetitions of items, and repetitions may reduce user satisfaction. The main task is to choose a subset of the join result that would maximize the probability to satisfy the user. To that end, we propose two measures for estimating the quality of result sets, namely, coverage and optimality-ratio. We present new semantics for complex search queries, aiming at providing high coverage and high optimality ratio. We show empirically that our semantics outperform top-k and other existing semantics, w.r.t. user satisfaction.