טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentMichael Bendersky
SubjectPassage Language Models in Ad Hoc Document Retrieval
DepartmentDepartment of Industrial Engineering and Management
Supervisor Full Professor Kurland Oren
Full Thesis textFull thesis text - English Version


Abstract

Throughout the last decade or so, search engines (e.g., Google) have become a crucial tool for discovering information in on-line data repositories. Finding documents in a corpus (repository) that pertain to users’ queries is a hard challenge, especially in light of increasingly large and diverse corpora as the World Wide Web (WWW), for example.

The ad hoc retrieval task, which is the principle task that search engines perform, is to rank documents in a corpus in response to a query by their assumed relevance to the information need it represents. While there are numerous challenges involved in ad hoc retrieval, one of the prominent ones is the fact that a document could be of interest to a user (or deemed relevant to the user’s query) even if only (very few, potentially small) parts of it, i.e., passages, actually contain information that pertains to the information need. Therefore, in such situations methods that compare the document as a whole to the query face significant difficulties in detecting the required documents.

Passage-based document-retrieval approaches address this challenge by using the information from (only some of) the document passages to rank a document in response to a query. In this thesis we show that several of these previously proposed passage-based approaches, along with some new ones, can in fact be derived from the same probabilistic model.

While our formulation and derived ranking algorithms are not committed to any specific estimation paradigm, we use the successful language modeling framework to instantiate specific algorithms. In doing so, we propose a novel passage language model that integrates information from the ambient document to an extent controlled by the estimated document homogeneity. Several document homogeneity measures that we propose yield passage language models that are more effective than previously proposed ones. Furthermore, we demonstrate the benefits in using our proposed passage language model for constructing and utilizing a passage-based relevance model.

Finally, we show that our proposed document-homogeneity measures are also effective means for integrating document-query and passage-query similarity information for document retrieval.