Ph.D Thesis

Ph.D StudentYehezkel Aviv
SubjectGeneralizations of the Cardinality Estimation Problem and
Applications to Computer Networks
DepartmentDepartment of Computer Science
Supervisor PROF. Reuven Cohen
Full Thesis textFull thesis text - English Version


Classical processing algorithms for database management systems usually require several passes over (static) data sets in order to produce an accurate answer to a user query. However, for a wide range of application domains, the data set is very large and is updated on a continuous basis, making this approach impractical. Such big data streams appear in a wide variety of computer science applications. They are common, for example, in computer networks, where detailed usage statistics (such as the source IP addresses of packets) from different parts of the network need to be continuously collected and analyzed for various security and management tasks.

For this reason, there is a growing interest in developing ``streaming algorithms'' for efficient processing and querying of continuous data streams. These algorithms seek to provide accurate results while minimizing both the required storage and the processing time per stream element, at the price of a small inaccuracy in their output. Streaming algorithms use small fixed-size storage to store a summary (``sketch'') of the input data, and use probabilistic algorithms to accurately estimate the desired quantity.

A fundamental streaming problem is the ``cardinality estimation problem'': given a very long stream of elements x1, x2, ?, with repetitions, the goal is to find the number n of distinct elements. This is a well-known problem in numerous big data applications for network monitoring, security, query optimization, query execution progress, etc. The cardinality estimation problem can be generalized to set expressions over multiple streams, which yields many new important big data applications in large database systems, network monitoring and network security.

It is easy to use linear O(n) space to produce an accurate solution to the cardinality problem. This can be done, for example, by comparing the value of a newly encountered element, xi, to every (stored) value encountered so far. If the value of xi has not been seen before, it is stored and counted. However, for a wide range of application domains, the data set is very large, making linear space algorithms impractical. Thus, streaming algorithms are required.

This thesis focuses on several generalizations of the cardinality estimation problem. It proposes new streaming algorithms, analyzes their efficiency and statistical performance, and shows how they can be used for solving timely management and monitoring problems in computer networks.