|Ph.D Thesis||Department of Industrial Engineering and Management|
|Supervisor:||Dr. Kurland Oren|
|Full Thesis text|
We live in a special age where almost every piece of information one can think of is available to us at a touch of the keyboard and a click of the mouse. However, the massive volume of digitized information that humanity has accumulated is a double-edged sword. While it may be beneficial to us as users to have more information at our disposal, it can create challenges for search engines to identify that relevant “needle” in the electronic haystack.
One such challenge is that many documents are long and topically diverse; e.g., those on the Web. Specifically, such documents could contain information relevant to an information need expressed by a query, and at the same time, contain much non-relevant information. A possible approach to addressing this challenge is to infer relevance based on smaller units of text within the documents, passages, and use this inference in the overall document-relevance estimation process.
In this thesis we first present the potential merits of utilizing passages for ranking documents by their presumed relevance to an information need expressed by a given query while having (minimal) relevance information. We also present a practical novel method for ranking documents using passages within these settings.
We then present novel passage-based methods for ranking documents given a query without the use of relevance feedback information; i.e., the ad hoc retrieval task. We present a probabilistic retrieval model that integrates information induced from the document as a whole and from its passages. Specifically, document-query similarities, passage-query similarities, inter-passage-similarities, and inter-document similarities, are integrated in our model.
Another method that we propose takes advantage of an additional source of information: clusters of similar documents. Using information induced from query-specific clusters, that is, clusters created from top-retrieved documents, was shown to be very effective in the document ranking task. We show that integrating the two sources of information? clusters and passages ? can yield performance that is superior to that of using each alone.
Empirical evidence shows that the performance of our models is superior to that of (i) commonly used passage-based document ranking methods; (ii) a state-of-the-art pseudo-feedback-based approach; and (iii) a state-of-the-art term-proximity-based retrieval model.
In a slightly different vein, we propose a novel performance prediction model for passage-based retrieval used in question answering systems. Our model utilizes named entities to estimate the probability that the retrieved list contains the correct answer. We show that this model yields prediction quality higher than that of various reference comparisons.