Ph.D Thesis

Ph.D StudentKraus Naama
SubjectEffective Search in Distributed Environments
DepartmentDepartment of Electrical and Computer Engineering
Supervisors PROF. Idit Keidar
DR. David Carmel


In this thesis, we tackle search effectiveness of distributed search (DiS), and in particular explore tradeoffs between search quality and other considerations. We explore three scenarios: tail-tolerant distributed search, similarity search over endless data-streams, and network-efficient similarity search in peer-to-peer networks. Distributed search engines typically allow missing some of the search results for various reasons, which degrades search quality. For example, in the scenarios that we explored, search quality degrades due to late responses, capacity limitations, and limited network bandwidth.

We propose algorithms that improve search quality by better exploiting the infrastructure's resources, namely, index redundancy, space capacity, and network cost. We achieve improvements by considering the internals of the search algorithms in use, rather than using them as black boxes.

We evaluate our algorithms both theoretically and empirically, by formulating a DiS algorithm's success probability and measuring its empirical recall. The success probability of a DiS algorithm captures the probability that it finds a query's search result. For measuring recall, we consider centralized search as our ground truth, and measure an algorithm's recall with respect to the search results of a given centralized search algorithm. We measure recall by conducting empirical evaluations using real-world datasets. We compare our algorithms to prior art and show, both theoretically and empirically, that they increase search quality in the three scenarios that we examined.

The topics covered in this thesis are summarized as follows.

Tail-tolerant distributed search:

We introduce a novel approach for constructing and searching a distributed index with redundancy when some query results are missed due to high tail latency. We propose rSmartRed, an optimal strategy for selecting the number of node replicas to search over at runtime, which considers each node's likelihood to contain results that are relevant to the query, as well the probability to miss its results due to high latency. In addition, when feasible, we propose to replace Replication with Repartition, which constructs independent index instances instead of exact copies. Our tail-tolerant distributed search improves search effectiveness when results are omitted due to a high latency compared to naively using Replication as a black box.

Similarity search over endless data-streams:

We present Stream-LSH, a similarity search algorithm that uses a bounded index for indexing unbounded data. We propose a randomized policy for dynamically maintaining items in the index, which takes into account items' age, quality, and popularity attributes. We show that Stream-LSH better exploits capacity resources which improves search effectiveness compared to prior art.

Efficient similarity search in peer-to-peer networks.

We present NearBucket-LSH, an effective algorithm for similarity search in large-scale distributed online social networks organized as peer-to-peer overlays. As communication is a dominant consideration in distributed systems, we focus on minimizing the network cost while guaranteeing good search quality. We decrease the network cost by considering the internals of the similarity search and the peer-to-peer architecture, and harnessing their properties to our needs.