טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentKatz Adir
SubjectIntegrating Explicit and Pseudo Feedback to Improve
Retrieval Effectiveness
DepartmentDepartment of Industrial Engineering and Management
Supervisor Professor Oren Kurland
Full Thesis textFull thesis text - English Version


Abstract

One of the biggest challenges faced by search engines is finding documents in a corpus that pertain to the information need expressed by a query. In many cases, the user of the search engine does not know how to effectively express his information need. Consequently, the query posted to the search engine may result in an ineffective search. If the user provides relevance feedback for results presented by the search engine in response to his query, then this feedback can be utilized so as to improve retrieval effectiveness using a second search. It is often the case that the relevance feedback is used for automatic query expansion, or more generally, query reformulation.

However, relevance feedback, if available, is often scarce. For example, the user may mark only one document as relevant. In such a case, effectively using the feedback becomes a hard challenge due to potential query drift. That is, while methods using several relevant documents focus on their commonalities, in the case of a single relevant document there are no such commonalities to

rely on.           

In this work, we explore a suite of methods that enrich the information induced from minimal relevance feedback with that induced from pseudo feedback; i.e., the documents most highly ranked by an initial search. Specifically, information induced from the given relevant documents is used to identify pseudo

relevant documents that are more likely than others to be relevant. Several of our methods consider candidate pseudo relevant documents independently of each other. We also propose methods that use clusters of similar documents to identify high quality pseudo relevant documents.

Extensive empirical evaluation performed with TREC data attests to the merits of our proposed methods. A case in point, our methods substantially outperform the standard approach of using only the given relevant documents for retrieval. Furthermore, our methods post performance that consistently

transcends that of a previously proposed approach for selecting the pseudo relevant documents to be used with the true relevant documents; namely, simply using the top-retrieved documents as pseudo relevant.