M.Sc Thesis

M.Sc StudentFuchs Amit
SubjectFault-Tolerant Operanting System for Many-Core
DepartmentDepartment of Computer Science
Supervisor PROF. Avi Mendelson
Full Thesis textFull thesis text - English Version


Creating operating systems for many-core processors is a critical challenge. Contemporary multi-core systems inhibit scaling by relying on hardware-based cache-coherency and atomic primitives to guarantee consistency and synchronization when accessing shared memory. Moreover, projections on transistor scaling trends predict that hardware fault rates will increase by orders of magnitude and the microarchitecture alone could not provide adequate robustness in the exascale era. Resilience must be considered at all levels; operating systems cannot continue to assume that the processors are error-free.

A fault-tolerant distributed operating system is presented, designed to harness the massive parallelism in many-core distributed shared memory processors. It targets scale-out architectures with 1,000-10,000 fault-prone cores on-chip and waives traditional hardware-based consistency over the shared memory. The operating system allows applications to remain oblivious to hardware faults and efficiently utilize all cores of exascale systems-on-chip without performing explicit synchronization.

To scale efficiently and reliably as the number of cores rapidly increases while their reliability decreases, the new operating system provides fault-tolerant task-level parallelism to applications through a coarse-grained data-flow programming model. A decentralized wait-free execution engine was created to maximize task parallelism, scalability, and resiliency over unreliable processing cores. It combines message-passing and shared memory without strong consistency guarantees. Fine-grained checkpoints are intrinsic at all levels, enabling on-the-fly recovery of application-level tasks in the case of hardware faults, automatically resuming their execution with minimal costs.

A prototype implementation of the new operating system was experimentally evaluated on a many-core full-system simulator, the presented results exemplify the characteristics and benefits of the new approach.