טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentLevi Natali
SubjectNavigating in the Dark: Modeling Uncertainty in Ad Hoc
Retrieval Using Multiple Relevance Models
DepartmentDepartment of Industrial Engineering and Management
Supervisor Professor Oren Kurland
Full Thesis textFull thesis text - English Version


Abstract

Search engines have become a highly important tool for finding information in digital repositories such as the Web. The basic task that search engines have to address is finding the documents in a corpus (collection) that pertain to the information need expressed by a given query. This task is often referred to as ad hoc retrieval.

Users of search engines tend to use short queries --- i.e., a few keywords --- to escribe their information needs. As a result, queries are often ambiguous.

Consequently, it can be quite a hard challenge to infer what the underlying nformation need is without any additional information about the need and/or the user who formulated the query.

Most existing retrieval methods address the uncertainty about the information need implicitly as part of the retrieval process. That is, the choice of a model (models) to represent the information need essentially reflects the possible interpretation(s) of the need. For example, a specific query expansion technique employed with specific parameter values (e.g., the number of terms to expand the query with) yields an expanded query form that represents one possible interpretation of the information need.

Indeed, most retrieval models are committed to a specific (single) representation of the information need that is supposedly expressed by the query.

We present a novel probabilistic framework to ad hoc retrieval that explicitly ddresses the uncertainty about the information need expressed by a query. In doing so we account for two major factors that affect uncertainty, namely (1) the fact that the same query can be used to represent different information needs, and (2) the nature of the corpus upon which the search is performed. A case in point for the latter, a query for the car Jaguar used over the Web should better include the term car, yet this term has no discriminative power in a portal dedicated to cars.

The retrieval model that we derive integrates multiple relevance models, e.g., tatistical language models that are presumed to generate terms in relevant documents. These relevance models potentially correspond to information needs that may underlie the query.

To exemplify the practical potential of our framework, we take a pseudo feedback approach, and construct multiple relevance models based on documents sampled from an initially retrieved list. We then propose several faithfulness measures. Empirical evaluation demonstrates the performance merits of our methods with respect to using a single relevance model.