Ph.D Thesis | |

Ph.D Student | Yehezkel Aviv |
---|---|

Subject | Generalizations of the Cardinality Estimation Problem and Applications to Computer Networks |

Department | Department of Computer Science |

Supervisor | PROF. Reuven Cohen |

Full Thesis text |

Classical processing algorithms for database management systems usually require several passes over (static) data sets in order to produce an accurate answer to a user query. However, for a wide range of application domains, the data set is very large and is updated on a continuous basis, making this approach impractical. Such big data streams appear in a wide variety of computer science applications. They are common, for example, in computer networks, where detailed usage statistics (such as the source IP addresses of packets) from different parts of the network need to be continuously collected and analyzed for various security and management tasks.

For this reason, there is a growing interest in developing ``streaming algorithms'' for efficient processing and querying of continuous data streams. These algorithms seek to provide accurate results while minimizing both the required storage and the processing time per stream element, at the price of a small inaccuracy in their output. Streaming algorithms use small fixed-size storage to store a summary (``sketch'') of the input data, and use probabilistic algorithms to accurately estimate the desired quantity.

A
fundamental streaming problem is the ``cardinality estimation problem'': given
a very long stream of elements *x*_{1}, *x*_{2}, ?,
with repetitions, the goal is to find the number *n* of distinct elements.
This is a well-known problem in numerous big data applications for network
monitoring, security, query optimization, query execution progress, etc. The
cardinality estimation problem can be generalized to set expressions over
multiple streams, which yields many new important big data applications in
large database systems, network monitoring and network security.

It
is easy to use linear *O(n)* space to produce an accurate solution to the
cardinality problem. This can be done, for example, by comparing the value of a
newly encountered element, *x*_{i}, to every (stored) value
encountered so far. If the value of *x*_{i} has not been seen
before, it is stored and counted. However, for a wide range of application
domains, the data set is very large, making linear space algorithms
impractical. Thus, streaming algorithms are required.

This thesis focuses on several generalizations of the cardinality estimation problem. It proposes new streaming algorithms, analyzes their efficiency and statistical performance, and shows how they can be used for solving timely management and monitoring problems in computer networks.