Search engines (e.g., Bing, Google, Yahoo!) have to address the ad hoc
retrieval task: finding documents in a corpus (repository) that pertain to
an information need expressed by a given query. The relevance model approach
to ad hoc retrieval is based on the following premise. Given a query,
there is a (relevance) statistical language model that “generates” the
terms both in the query and in documents that are relevant to the
information need expressed by the query. Empirically, the relevance
model approach posts state-of-the-art retrieval performance. The most
prominent estimation method of a relevance model is based on a pseudo-feedback
approach: utilizing information from documents in an initially
retrieved list. However, some (or even) many of the documents in the
list might not be relevant to the underlying information need;
furthermore, query-related aspects might be insufficiently represented
in the list. To address these issues, we propose novel approaches to
constructing relevance models that are based on utilizing information
induced from clusters of similar documents that are created offline. Such
clusters can potentially help, for example, to better represent the query-context
manifested in the corpus than single documents can. Indeed, our
empirical evaluation demonstrates the merits of constructing relevance models
using cluster-based information. Specifically, our best-performing models
yield better (and more robust) performance than that of the standard relevance-model-estimation
approach that is based only on documents.