Ph.D Thesis

Ph.D StudentLempel Ronny
SubjectLink Structure Analysis and Query Result Caching in Web
Search Engines
DepartmentDepartment of Computer Science
Supervisor PROFESSOR EMERITUS Shlomo Moran


The World Wide Web is an ever-growing sea of information. The volume of data available online increases exponentially, as more and more people and entities publish data on the Web. The Web has no authorship standards, no editorial board and no imposed topical hierarchy. In this chaotic arena of text, images, voice and video, hundreds of millions of people from all over the world are surfing in search of information on every possible walk of life.

Web search is made possible by complex information retrieval systems called search engines. Search engines are presented with queries of users, who expect the queries to be instantaneously answered with ranked lists of the most relevant URLs available online for each query. In order to meet these demands, search engines must collect and index billions of resources, and develop highly efficient retrieval and ranking algorithms that are capable of effectively answering queries in almost a blink of the eye.

This research focuses on two aspects of search engine technology.

The first part examines the impact of link structure analysis on the retrieval and ranking components of search engines. It begins with the study of theoretical properties of several link analyzing algorithms. These properties promote a better understanding of the individual algorithms and of the differences which separate each algorithm from the others. We then present techniques which augment link-based algorithms, by integrating general relevance factors, both positive and negative, into the analysis. This enables to boost the rankings of pages with desirable traits, and to lower the rankings of pages that are linked to irrelevant pages. We also demonstrate that connectivity information can improve multimedia retrieval by presenting an image retrieval system that is based on adaptations of link analyzing algorithms.

The second part of the research deals with the caching and prefetching of search results in search engines. Successful caching and prefetching of results can significantly lower the computational resources required for executing the stream of queries with which the engines are faced. We first examine a single user's search session, and propose a strategy for the prefetching of search results that minimizes the expected computational cost of answering the session's queries. We then experiment with several combined prefetching and cache replacement policies, measuring the hit ratio of the query result cache that results from each policy. These experiments are driven by a stream of over seven million real queries, submitted to a major search engine. Finally, we study the problem of caching search results in the framework of competitive analysis, under a concrete stochastic model of the query stream. We prove that a certain online caching algorithm incurs a cost whose expected value is no more than 4 times the expected cost of any caching algorithm.