|Ph.D Thesis||Department of Electrical Engineering|
|Supervisors:||Prof. Uri Weiser|
|Prof. Idit Keidar|
|Assoc. Prof. Kolodny Avinoam|
|Full Thesis text|
While on-die memory sub-systems have long been a principal challenge in computer architecture, the shift to Chip Multiprocessors (CMP) both greatly intensifies the gravity of the problem, and also opens the door for new types of solutions. In this work we address two such cases, where the interplay between caches and multithreading results in unique phenomena and calls for novel cache designs.
In our first contribution we address a new cache organization for multi-core machines (CMPs with handful of cores). We introduce Nahalal, an architecture whose novel floorplan topology partitions cached data according to its usage (shared versus private data), and thus enables fast access to shared data for all processors while preserving the vicinity of private data to each processor. In Nahalal, a fraction of the on-die memory capacity budget is used for hot shared data, and is located in the center of the chip, enclosed by all processors. The rest of the cache capacity is placed in the outer area of the die, and provides private storage space for each core. The Nahalal topology is particularly appropriate for common multi-threaded applications since a small subset of their working set is typically shared by many cores and is accessed numerous times during the application's lifetime.
In our second contribution we address the interplay of threads and caches in the new generation of many-core machines. (CMPs with a large number of simple cores.) While these systems now use both large caches and aggressive multi-threading to mitigate the off-chip memory access problem, the combination of the two very different approaches makes performance prediction challenging and hinders the understanding of the basic interplay between workloads and architectures and its effect on performance and power.
To address these challenges, we provide a high level, closed-form model that captures both the architecture and the application characteristics. Specifically, for a given application, the model describes its performance and power as a function of the number of threads it runs in parallel, on a range of architectures.
We use the analytical model to qualitatively study how different properties of both the workload and the architecture affect performance and power. Our findings recognize distinctly different behavior patterns for different application families and architectures, as well as a non-intuitive "performance valley" shape where machines provide inferior performance. We study the shape of this "performance valley" and provide insights on how it can be avoided.