M.Sc Thesis | |
M.Sc Student | Fuchs Amit |
---|---|
Subject | Fault-Tolerant Operanting System for Many-Core Processors |
Department | Department of Computer Science | Supervisor | PROF. Avi Mendelson |
Full Thesis text | ![]() |
A fault-tolerant distributed operating system is presented, designed to harness the massive parallelism in many-core distributed shared memory processors. It targets scale-out architectures with 1,000-10,000 fault-prone cores on-chip and waives traditional hardware-based consistency over the shared memory. The operating system allows applications to remain oblivious to hardware faults and efficiently utilize all cores of exascale systems-on-chip without performing explicit synchronization.
To scale efficiently and reliably as the number of cores rapidly increases while their reliability decreases, the new operating system provides fault-tolerant task-level parallelism to applications through a coarse-grained data-flow programming model. A decentralized wait-free execution engine was created to maximize task parallelism, scalability, and resiliency over unreliable processing cores. It combines message-passing and shared memory without strong consistency guarantees. Fine-grained checkpoints are intrinsic at all levels, enabling on-the-fly recovery of application-level tasks in the case of hardware faults, automatically resuming their execution with minimal costs.
A prototype implementation of the new operating system was experimentally evaluated on a many-core full-system simulator, the presented results exemplify the characteristics and benefits of the new approach.