M.Sc Thesis

M.Sc StudentGreen Oded
SubjectScheduling Directives for Shared-Memory Many-Core
Processsor Systems
DepartmentDepartment of Electrical and Computer Engineering
Supervisor ASSOCIATE PROFESSOR Yitzhak Birk


We consider a shared-memory (no private caches) many-core architecture. A program comprises a set of single-core tasks along with a set of precedence relations among them, which represent data dependences and ensure correct execution. For reasons such as programming convenience and reduced code foot print, multiple-instance (“duplicable”) tasks are used in data-parallel situations such as summing up the rows of a matrix. Dispatching tasks to cores is done by hardware within very few clock cycles and at a very high rate.  This is thus a dataflow machine at the inter-task level, with conventional control flow within each task. The Plurality Hypercore is such an architecture.

The precedence constraints guarantee correctness, and the absence of private caches obviates the need to consider which core should execute any given task. However, one must still decide the dispatching order whenever the number of runnable tasks exceeds that of available cores. This choice among correct execution orders can impact performance: 1) it can mitigate bottlenecks, namely situations wherein a task that must precede many others is scheduled later than it could have been and now causes cores to be idle awaiting its completion, and 2) it can impact the instantaneous memory footprint of the program and its data, thereby affecting the hit rate of the shared cache. For a given number of cores and a program with known task execution times, one could simply add precedence relations in order to enforce the desired scheduling order. This, however, is impossible in general, and the problem is most acute with duplicable tasks, as the precedence constraints apply jointly to all task instances.

This work focuses on scheduling constructs (“primitives”) that can be used by programmers and by automatic optimization tools to further direct the runtime dispatcher, with special attention to duplicable tasks.  Such constructs must express relations that occur in real programs and whose translation into scheduling directives impacts performance. Additionally, they must lend themselves to efficient implementation in hardware. We present several such primitives that increase the expressive power of the programmer and/or optimizer, along with examples in which they increase performance with different numbers of cores. We also outline their implementation in the context of a Hypercore-like system, thereby proving them to be practical.