Ph.D Thesis

Ph.D StudentAgbaria Adnan
SubjectReliability in High Performance Distributed Computing
DepartmentDepartment of Computer Science
Supervisor PROF. Roy Friedman


The increase in microcomputing power and network speed offer the possibility of replacing a supercomputer with a cluster of small computers, thereby reducing the cost of high performance computing. However, unlike traditional supercomputers, a distributed cluster is prone to partial hardware failures, and the probability of such a failure grows linearly with the number of nodes. Therefore, a cluster of workstations for high performance computing cannot replace a supercomputer functionality without supporting reliability.

This thesis addresses practical and theoretical approaches for providing reliability in high performance distributed computing systems. We start by presenting a framework for defining the distributed computing model that we consider. We define the model with checkpoint and recovery (changeably, restart) events, and then conditions on which a set of checkpoints is consistent. Checkpointing an application is the act of saving its state during execution on stable storage so that if the application fails, it can be recovered from the last saved state.

Following this, we present the Starfish system, an environment for executing dynamic (and static) MPI programs on a cluster of workstations. Starfish is unique in being efficient, fault-tolerant, highly available, and dynamic as a system internally, and in supporting fault-tolerance and dynamicity for its application programs as well. Starfish achieves these goals by combining group communication technology with checkpoint/restart, and uses a novel architecture that is both flexible and portable and keeps group communication outside the critical data path, for maximum performance.

We also present a mechanism that enables Starfish to be a heterogeneous reliable distributed system. This is the heterogeneous checkpoint/restart module that allows to recover an application from a saved state that was taken in a hardware architecture and/or operating system that can be different from those in the machine on which it is recovered.

Next, we present a classification of distributed executions based on rollback propagation during recovery. Specifically, an execution is k-rollback, if k indicates the maximum number of checkpoints that need to be rolled back for recovery. We show a new class of execution, called d-bounded cycles, and show that this execution is (n-1)d-rollback. In addition, we present a checkpointing protocol that guarantees a d-bounded cycles execution.

Finally, we present a technique called overhead ratio for evaluating distributed checkpointing protocols. This technique gives a quantitative evaluation for each checkpointing protocol. Particularly, it allows us to compare various checkpointing protocols easily. As a result, we can choose the right protocol for a given environment.