M.Sc Thesis

M.Sc StudentKama Alon
SubjectTransparent Fault-Tolerant Java Virtual Machine
DepartmentDepartment of Computer Science
Supervisor PROF. Roy Friedman


Replication is one of the prominent approaches for obtaining fault tolerance.  In a distributed environment, where computers are connected by a network, replication can be implemented by having multiple copies of a program run concurrently.  In cases where a copy on one of the computers crashes, the others may proceed normally and mask that failure.

Implementing replication as described above on commodity hardware and in a transparent fashion, i.e. without changing the programming model, has many challenges.  Deciding at what level (hardware, operating system, middleware, or application) to implement the replication has ramifications on development costs and portability of the programs.  Other difficulties lie in the coordination of the copies in the face of non-determinism, such as I/O and environment differences (e.g. different clocks).  Also, the minimization of overhead needs to be addressed, so that the performance is acceptable.

We report on an implementation of transparent fault tolerance at the virtual machine level of Java.  We describe the design of the system and present performance results that in certain cases are equivalent to those of non-replicated executions.  We also discuss design

decisions stemming from implementing replication at the virtual machine level, and the special considerations necessary in order to support Symmetric Multi-Processors (SMP).