Ph.D Thesis

Ph.D StudentSagy Guy
SubjectGeometric Methods for Analyzing and Monitoring
Large Distributed Data
DepartmentDepartment of Computer Science
Supervisors PROF. Assaf Schuster
Full Thesis textFull thesis text - English Version


In the distributed systems model, data is partitioned over a set of nodes often geographically dispersed. In some systems, the data is more static and is stored over distributed databases, while on the other extreme, large sensor networks monitor continuous data streams. As the amount of data increases, so does the need for efficient distributed data analysis or monitoring tools.

A basic requirement in these distributed systems is the ability to detect objects whose score, according to a given function, exceeds some threshold. Since an object's data can be partitioned over various nodes, computing its global score requires collecting its data over the network. A main challenge is to perform threshold queries or monitoring with minimum network communication, i.e., without collecting the data from the nodes to a central location.

In this work we present new geometric methods for performing threshold and monitoring queries. First, we present a geometric method for distributed databases. This method enables top-k vectorial aggregation queries over such databases, where the local data of objects in each node consists of d attributes (not a single scalar). Next, we study the challenge of performing threshold queries when the object score is determined by a general function (e.g., non-linear or non-monotonic). We present a novel approach for representing scoring functions as a difference of two monotonic functions (d.m. representation) and provide a generic threshold queries (tentative bounds) algorithm that supports general scoring functions. Then, we apply the d.m. representation approach to top-k queries over centralized databases.

Next, we generalize the geometric method and define new constraints for threshold monitoring. We call these constraints Safe Zones. This method is more flexible and allows local constraints to be tailored to the behavior of the data over the different  nodes. Unlike the earlier approach, it thus supports heterogenous data distribution. This method incurs a much lower communication cost than the current state-of-the-art, and can be used to monitor high-dimensional data.

Finally, we present efficient algorithms for handling constraint violations. When a node's data violates the local constraint, it may (or may not) cause a threshold crossing. Violation resolution is the process of deciding whether a threshold crossing has occurred.

We present an analysis of the violation resolution problem and suggest communication bounds for resolution algorithms. In addition, we use a geometric approach for developing an efficient communication and low latency resolution algorithm.