Ph.D Thesis

Ph.D StudentLazerson Arnon
SubjectEfficient Monitoring of Distributed Data Streams
DepartmentDepartment of Computer Science
Supervisor PROF. Assaf Schuster
Full Thesis textFull thesis text - English Version


As the number of connected devices grows, the velocity and volume of the data they produce and consume grows. This massive data is scattered across many physically distributed sites, incurring high communication costs. Thus, traditional data mining algorithms assuming that data is co-located or that communication is inexpensive are no longer adequate. Moreover, most traditional data mining algorithms only consider one-shot computation, where a query is processed once on a fixed data set. Yet data is increasingly dynamic, and many applications continuously require up-to-date results.

Emerging large-scale applications rely on continuous tracking of complex queries over collections of massive, dynamic, and physically distributed data streams.  Thus, in addition to the space- and time-efficiency requirements of conventional stream processing (at each distributed site), effective solutions also need to guarantee communication efficiency.  Continuously collecting the data to a central location is infeasible in large-scale applications, as the excess communication required interferes with the normal operation of the data network.  Furthermore, in the case of battery-operated devices such as wireless sensor nodes, central data accumulation depletes the power supply of individual devices, reducing the network lifetime.

We focus on reducing the communication cost of distributed threshold monitoring queries, where a threshold on the value of a function over aggregated distributed data streams is defined, and the system must issue an alert when this threshold is crossed. Threshold monitoring queries have been used in numerous applications either directly or as the main building block for other queries such as top-k, “heavy-hitters'”, quantiles, and so on. Furthermore, they can be naturally extended to the more general problem of approximate function monitoring, where the goal is to track the value of a function to within user-prescribed error bounds.

We show that by carefully constructing local conditions that are indicative of a global change in the aggregated value, we can reduce the communication cost of distributed threshold monitoring and function approximation tasks by orders of magnitude in many important cases.  We develop a novel method for deriving computationally efficient local conditions. These lightweight local conditions can be rapidly evaluated on the fly, reducing run-time and power consumption by up to six orders of magnitude in some cases.