|M.Sc Student||Gelfer Kalmanovich Inna|
|Subject||Cluster-Based Relevance Models|
|Department||Department of Industrial Engineering and Management||Supervisor||Professor Oren Kurland|
|Full Thesis text|
Search engines (e.g., Bing, Google, Yahoo!) have to address the ad hoc retrieval task: finding documents in a corpus (repository) that pertain to an information need expressed by a given query. The relevance model approach to ad hoc retrieval is based on the following premise. Given a query, there is a (relevance) statistical language model that “generates” the terms both in the query and in documents that are relevant to the information need expressed by the query. Empirically, the relevance model approach posts state-of-the-art retrieval performance. The most prominent estimation method of a relevance model is based on a pseudo-feedback approach: utilizing information from documents in an initially retrieved list. However, some (or even) many of the documents in the list might not be relevant to the underlying information need; furthermore, query-related aspects might be insufficiently represented in the list. To address these issues, we propose novel approaches to constructing relevance models that are based on utilizing information induced from clusters of similar documents that are created offline. Such clusters can potentially help, for example, to better represent the query-context manifested in the corpus than single documents can. Indeed, our empirical evaluation demonstrates the merits of constructing relevance models using cluster-based information. Specifically, our best-performing models yield better (and more robust) performance than that of the standard relevance-model-estimation approach that is based only on documents.