Ph.D Thesis

Ph.D StudentGurevich Maxim
SubjectExternal Search Engine Mining
DepartmentDepartment of Electrical and Computers Engineering
Supervisors MR Ziv Bar-Yossef
PROF. Idit Keidar
Full Thesis textFull thesis text - English Version


Search engines maintain large amounts of valuable data, such as web content, user queries, clicks, and browsing trails. This data is fully accessible only to the search engines themselves. Other parties, like users, advertisers, and researches, have very limited access to the data via public interfaces provided by search engines (e.g., the search interface).

External techniques for mining search engine data are rare and under-developed. Such external methods, which rely only on public interfaces, are appealing since they can be used by anyone, not relying on the goodwill of search engines. External mining can be used by search engine users and partners to objectively benchmark the quality of the service they get and by researchers to compare search engines and study properties of the web. Even search engines themselves may benefit from external mining, as it can help them reveal their strengths and weaknesses relative to their competitors.

In this work we propose a comprehensive framework for externally mining search engine indices and query logs. We designed an algorithm for estimating index properties, such as index size and freshness, language/domain/topic composition, density of spam, etc. We developed methods for sampling user queries from search engine query logs and for estimating query frequency (or popularity) in the logs. Finally, we designed an algorithm for estimating the visibility of a given web page in a search engine, and extracting the specific queries on which it is most visible.

Our algorithms make extensive use of tools from statistics (Monte Carlo methods), information retrieval, and databases. The correctness and the efficiency of the algorithms was analyzed both theoretically and empirically. The empirical analysis relies on a synthetic search engine we built locally and on real commercial search engines.