|M.Sc Thesis||Department of Industrial Engineering and Management|
|Supervisor:||Dr. Kurland Oren|
|Full Thesis text|
Search engines (e.g., Google) have become a crucial means for finding information in large corpora (repositories).
The ad hoc retrieval task is the core challenge that search engines have to cope with: finding documents that pertain to an information need underlying a query.
Pseudo-feedback-based (query expansion) retrieval is the process by which the documents most highly ranked by a search performed in response to a query are used for forming an expanded query, which is used for (re-) ranking the entire corpus. The underlying idea is to automatically ``enrich'' the original query with (assumed) related terms so as to bridge the ``vocabulary gap'' between short queries and relevant documents. While this process improves retrieval effectiveness on average, there are many queries for which the resultant effectiveness is inferior to that obtained by using the original query. In this work we address this performance robustness problem by tackling the query drift problem --- the shift in "intention" from the original query to its expanded form.
One approach for ameliorating query drift that we present relies on fusing the lists retrieved in response to the query and to its expanded form so as to perform query-anchoring of the latter. While the fusion-based approach relies on co-occurrence of documents in retrieved lists, our second approach re-ranks the (top) retrieval results of one retrieved method by the scores assigned by another. Both approaches are based on the assumption that documents retrieved in response to the expanded query, and which exhibit high query similarity, are less likely to exhibit query drift.
We show empirically that our approach posts significantly better performance than that of retrieval based only on the original query. The performance is also more robust than the retrieval using the expanded query, albeit somewhat lower on average. This performance- robustness trade-off characterizes most of the methods that we have examined.