|Ph.D Student||Kalinsky Oren|
|Subject||Relational Join Acceleration through Algorithms and Hardware|
|Department||Department of Electrical Engineering||Supervisors||Professor Yoav Etsion|
|Professor Benny Kimelfeld|
|Full Thesis text|
The wide availability of data in the information era requires efficient data management solutions. Specifically, Relational Database Management Systems (RDBMSs) have been a key component in many software solutions for more than five decades, and have an estimated market value of tens of billions of dollars. A core problem in RDBMSs is the evaluation of complex join queries over large datasets. Such join queries are known to have a high runtime complexity which incurs ”expensive” query evaluation in terms of time, memory, and energy. In this thesis, we investigate the problem of accelerating relational join queries and approach the problem from three different directions.
Our first approach is developing a novel relational join algorithm that utilizes recently published
theoretical ideas on efficient evaluation of multiway join queries. Our algorithm, namely Cached-TrieJoin (CTJ), augments trie joins with smart caching of intermediate results. The flexible caching mechanism can dynamically conform to the amount of memory available. This property benefits join queries and execution environments where the number of intermediate results does not fit into memory. The runtime complexity of our algorithm matches the worst-case output size, and provides state-of-the-art performance.
To further accelerate relational join queries, we relax the full-precision guarantee.
Focusing on online-aggregation queries, we allow to trade precision for performance. We develop an online aggregation algorithm that integrates random walks into CTJ, our exact join algorithm. By embracing recent developments for online aggregation, we offer an unbiased count aggregation estimator. Moreover, to the best of our knowledge, we are the first to offer an unbiased estimator for count distinct aggregations. We test our algorithms on an exploration system for the semantic web and show that existing multiway join solutions, including CTJ, do not provide interactive performance. However, we show that such performance can be achieved using the approach of online aggregation which provides approximate answers with accuracy that improves over time. We design an online aggregation algorithm named AuditJoin that returns an approximate answer in one second, and continues to improve the accuracy every second. Compared to existing online aggregation solutions, AuditJoin improves the accuracy by 5-100?.
Finally, we explore another direction to surpass the superior performance achieved by CTJ. A
fundamental assumption of CTJ is the underlying hardware it operates on, namely the CPU. The generic nature of the CPU incurs performance and energy overhead. In this work, we reduce the overhead by designing a specialized hardware accelerator for join queries. TrieJax, our CTJ-based accelerator, leverages the inherent parallelism of CTJ by incorporating both static and dynamic multithreading execution. Furthermore, the smart intermediate result cache of TrieJax prevents the spill of intermediate results into memory and reduces the number of memory accesses. TrieJax outperforms existing database hardware accelerators by 7-63? on average, while consuming 15-79? less energy. Compared to software query engines, that incorporate modern join algorithm such as CTJ, TrieJax achieves a speedup of 9-20? and reduce the energy consumption by 59-110?.