|M.Sc Student||Tolchinsky Igor|
|Subject||Rethinking Locality in NUMA Systems|
|Department||Department of Electrical and Computers Engineering||Supervisors||PROF. Isaac Keslassy|
|PROF. Avi Mendelson|
|Full Thesis text|
In NUMA systems, data and thread placement are assumed to have a significant impact on the overall performance of the system, since accesses to remote locations incur a higher penalty than local accesses. Therefore, many algorithms have focused on how to increase locality for NUMA-based systems. However, it is unclear when these algorithms really matter. In fact, characterizing the sensitivity of workloads to data and execution placement still remains an open problem.
The main contribution of this work is motivated by the surprising observation that for many applications that are executed on contemporary architectures, and most likely on future NUMA systems, a major limiting factor is bandwidth. This understanding leads us to the second observation that most of the current systems are optimized towards latency over bandwidth, for example through the use of massive speculation-based mechanisms. In order to demonstrate that latency has a lower importance than bandwidth in many environments, we develop a breakthrough model that differentiates the latency effect from the bandwidth.
We start our investigation by developing a novel analytical model. Our model and measurements indicate that for some applications, it is bandwidth and not latencies that can be the main inefficiency factor. However, we have also observed that bandwidth did not scale with the number of processors.
This observation highlights the fact that in NUMA systems, thread and data placement are important for traffic reduction, but on the other hand, the use of speculative architectural mechanisms such as OoO and prefetching are one of the main factors for increasing the bandwidth of the interconnect. The dissertation shows that a significant portion of workloads is more sensitive to bandwidth than to latencies. Therefore, for these workloads, we may decide to use a different set of optimizations.
Our workloads consist of synthetic kernels as well as C2C, PARSEC and Google's datacenter benchmark. These benchmarks are used for confirming the model and for measuring the performance degradation of contemporary benchmarks on AMD-based and Intel-based servers. Finally, we calculate analytically the potential performance degradation due to improper placement on Google's datacenter benchmark.
We find that PARSEC and Google's datacenter benchmark are not sensitive to improper placement in the latency aspect.