טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentShprincis Igal
SubjectStudying The Cluster Hypothesis using Fine Grained Analysis
of Relevant Documents
DepartmentDepartment of Industrial Engineering and Management
Supervisor Professor Oren Kurland


Abstract

The cluster hypothesis is that closely associated documents tend to be relevant to the same information needs.

Being a qualitative statement, the cluster hypothesis does not refer to the specific representation method applied for the document/query nor to the association measure employed.

Our goal is to explore the extent to which the cluster hypothesis is affected by the relative portion of relevant text in documents and their passages.

To do so we examine the effect of tightening the requirement for a document/passage to be deemed relevant, based on the fraction of relevant text it contains, on the extent to which the cluster hypothesis holds.

To measure the extent to which the cluster hypothesis holds we use several previously proposed tests.

In addition, we analyze the properties and relations between different types of meta-documents;

these consist of either relevant or non-relevant passages in top-retrieved documents.

Finally, we propose an extended model for query-performance prediction which utilizes the aforementioned meta-documents.

Our conclusions shed light on some practical aspects of the of the extent to which the cluster hypothesis holds and on the properties of top retrieved documents.