|M.Sc Student||Meir Hagar|
|Subject||Oak: Off-Heap Allocated Keys Big Data Analytics|
|Department||Department of Electrical Engineering||Supervisor||Professor Idit Keidar|
|Full Thesis text|
We present Oak (Off-heap Allocated Keys), a scalable concurrent key-value map designed for real-time big data analytics. Oak offloads the data from the virtual machine heap in managed-memory languages like Java, thereby reducing garbage collection overheads.
Oak is optimized for big keys and values and for frequent incremental maintenance
of existing values, as prevalent in streaming analytics use cases. To this end, it adopts
a zero-copy approach to data update and retrieval, e.g., through concurrent update-in-place. Oak's API is similar to that of Java's ConcurrentNavigableMap with adjustments for efficient zero-copy implementation. It provides strong (atomic) semantics for read, write, and various read-modify-write operations, such as compute (in-situ update) and put-if-absent, as well as (non-atomic) ascending and descending iterators.
We provide proof of Oak's correctness, by identifying linearization points for all operations, so that concurrent operations appear to execute in the order of their linearization points. We further report on our experiments which show that Oak is faster by 1.3-4.8x than the currently standard concurrent KV-map, the Java ConcurrentSkipListMap. In addition, our results demonstrate that off-heap allocation is beneficial in scenarios with conditional updates of large values.
Our industrial partners are integrating Oak as the core data index in the popular Apache Druid in-memory analytics platform. This integration is beyond the scope of this thesis.