טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentKhalaman Savva
SubjectUtilizing Inter-Document SImilarities in Federated Search
DepartmentDepartment of Industrial Engineering and Management
Supervisor Professor Oren Kurland
Full Thesis textFull thesis text - English Version


Abstract

The main goal of a search engine is finding documents in a collection that satisfy an information need expressed by a given query. The classical search setting is based on the assumption that all documents are stored and indexed in a single centralized collection. However, in cases wherein data is proprietary, and/or not free, this assumption may not hold. In such settings, federated (distributed) information retrieval is becoming very important; that is, running a query against different, potentially non-overlapping, collections.

We address two important tasks in federated search. The first is collection selection; that is, selecting which collections to use for retrieval for a query. The second task is results merging; i.e., merging the document lists retrieved from the selected collections. Most approaches for these two tasks estimate collection and document relevance by implicitly ignoring inter-collection and inter-document relations. We propose the same general model for both tasks which we adapt from work on fusing document lists retrieved from the same collection. The model lets similar collections, in the collection selection task, and similar documents, in the results merging task, provide relevance status support to each other. Specifically, the model uses information induced from clusters of similar instances (collections or documents) to re-rank an initially retrieved list of instances.

Extensive empirical evaluation demonstrates the effectiveness of our proposed model for the collection selection and results merging tasks. Specifically, the resultant task performance is in many cases superior to that addressed by state-of-the-art methods. Our work is also the first to study the cluster hypothesis for the collection selection task. Specifically, we show that Voorhees’ nearest-neighbor test can hold to a descent extent for datasets with specific characteristics.