Ph.D Thesis

Ph.D StudentAdas Dolev
SubjectConcurrent Sketches and their Applications
DepartmentDepartment of Computer Science
Supervisor PROF. Roy Friedman
Full Thesis textFull thesis text - English Version


Real-time processing of continuous high-volume streams of data is a common requirement in many contemporary application environments.

These applications share the need to handle big data, which requires algorithms to process new data exceptionally efficiently, forcing algorithms to rely on scarcely available fast SRAM memory.

However, such memory is limited in size, which motivates approximate solutions that conserve space.

Quite often approximate statistics are good enough provided that their associated approximation error is relatively small.

This gives rise to compact approximate data structures that only save a small representation of the stream, known as a sketch.

That is, sketches maintain compact approximate statistics about streams of data, thereby enabling quickly answering queries regarding the data stream without having to reprocess it.

This dissertation studies several aspects of concurrent sketches and their applications.

In particular, we focus on the following topics: Conflict-free Replicated Data Types (CRDT) Sliding Window Sketches, Limited Associativity Caches, Multi-Producers Single-Consumer Queue and Cache Admission Filter.

In our first contribution, we introduce the notion of sliding window CRDT sketches where we collect global statistics about the stream in a decentralized manner.

Conflict-free Replicated Data Types (CRDT) is a promising approach for expressing the desired semantics of distributed sketches.

Two important properties of CRDT are that an operation can always be accepted at any given node and updates are propagated asynchronously to other nodes.

This provides resilience to network failure and disconnection since no coordination with other nodes is necessary to serve an operation.

Our algorithms, CRDT-LT and CRDT-AT, include compact structures that support insertion at any time during the window, querying over the window as well as over any interval inside that window that is provided at query time.

The structure provides probabilistic accuracy guarantees for point queries and can be extended to a broad range of problems, such as finding heavy hitters and count distinct.

Our work is the first to offer a truly decentralized CRDT sliding window sketch.

In our second contribution, we show that limited associativity caches are a promising direction for software caches.

Specifically, we demonstrate that limited associativity enables simple yet efficient realizations of multiple cache management schemes that can be trivially parallelized.

We show that the obtained hit ratio is usually similar to fully associative caches of the same management policy, but the throughput is improved by up to x5 compared to production-grade caching libraries, especially in multi-threaded executions.

In our third contribution, we study the implementation of wait-free multi-producer single-consumer queues.

In applications such as sharded data processing systems, which many parallel sketches are, data flow programming and load sharing applications, multiple concurrent data producers are feeding requests into the same data consumer.

This can be naturally realized through concurrent queues, where each consumer pulls its tasks from its dedicated queue.

In the last contribution, we presented TinyCache - An Effective Cache Admission Filter.

TinyCache is a compact table-based management policy for datastore caches that is easily parallelizable.