M.Sc Thesis


M.Sc StudentAvishay Livne
SubjectMonitoring Distributed Data Streams
DepartmentDepartment of Computer Science
Supervisors Full Professors Schuster Assaf
Professor Daniel Keren
Full Thesis textFull thesis text - English Version


Abstract

Monitoring distributed streams of data is a basic construct in many distributed systems. Examples include a wireless sensor network, where we would like to receive an alert every time the average of the temperature readings taken by the sensors exceeds a given threshold, or a distributed search engine, where we would like to determine the set of queries whose frequency of use exceeds a given threshold.

So far, research has focused on detecting global properties that are defined by simple aggregates (e.g., sum, average, or minimum) over the distributed data. For these cases, there are usually local conditions for the occurrence of the global property of interest.

In many cases, however, global properties of interest are expressed by more complex functions. Examples include detecting when the variance of the sensor readings in a sensor network exceeds a given threshold, determining the set of pairs of queries whose correlation exceeds a given threshold, and performing distributed classification tasks. Monitoring such complex functions is an expensive task in terms of communication, processing and storage access.

We study global properties of the following form: we assume that each node holds a time varying two dimensional vector. In addition, we assume that we are given an arbitrary scoring function that is calculated on the weighted average of the local vectors held by the nodes. We are interested in detecting when the value of this function violates a predetermined threshold value.

In the general case described above, it is far harder to devise local conditions such as the ones employed for simple aggregates. Even in the apparently trivial case of f(x)=x2, we cannot make any inference on whether f((x)/2) is above or below the threshold t, when given the position of f(x) and f(y) vis-à-vis a threshold value t. As this simple example illustrates, it is generally very difficult to find local indications for monitoring nonlinear functions. The problem is that it is not enough to know the function's values at the local data vectors; we have to know something about the possible domain which contains the average of the local vectors.

Our proposed approach is based on the mathematical and geometric properties of the scoring functions, which have never been used by the data mining community. We plan to apply this approach for different data mining algorithms in different distributed setups, for which current solutions are either suboptimal or nonexistent.