Ph.D Thesis

Ph.D StudentHadad Erez
SubjectArchitectures for Fault-Tolerant Middleware Services
DepartmentDepartment of Computer Science
Supervisor PROF. Roy Friedman


Middleware-based distributed computing is the key technique for constructing enterprise-scale software systems, both now and in the foreseeable future, for reasons such as flexibility, component re-usability and abstraction of low-level component interaction.

Partial failures are a fundamental problem in distributed systems, arising from the actual distribution of system functions among multiple components that can fail independently of one another. In the context of middleware client-server interaction, such failures might disrupt the service provided to clients, which is undesirable, especially for mission-critical applications. Hence, our initial research problem focuses on adding fault-tolerance capabilities to the middleware as a generic method for constructing highly-available middleware services.

In this thesis we present FTS, a CORBA Fault-Tolerance Service that operates through active replication of objects. FTS embodies a novel approach at solving our research problem, called the object-apdator approach. In essence, this new approach melds together qualities of preceding techniques, resulting in a unique combination of portability, interoperability, performance and simplicity. Furthermore, through FTS we provide constructive criticism of the Fault-Tolerant CORBA (FT-CORBA) standard, which aims to be a comprehensive solution to the above problem. Last, we study the performance of FTS, and deduce general limitations of CORBA's ability to support sophisticated high-level services.

As additional contributions, we study a couple of aspects of FTS operation that have general relevance to middleware services and/or state machine replication. First, we focus on applying batching (or packing) of multiple requests into a single ABCAST message as a technique to boost its throughput, and consequently the throughput of an active replication service that uses it. We observe that maintaining good throughput using ABCAST protocols under varying client request rates involves adapting the batching threshold with the request rate. Consequently, we provide two variants of an adaptive batching mechanism and show their advantage compared to classic fixed-threshold batching mechanisms.

Last, we present a scheme for reducing the memory footprint of replication services such as FTS and FT-CORBA compliant systems. One of the largest memory-consuming components is the reply cache, which is used for maintaining at-most once semantics of request execution. We show how to apply Selective Acknowledgments (SACKs) to speed up cache purging beyond its time-based expiration mechanism, in a way that matches FTS's client-to-single-server interaction. This technique can also be applied to many types of passive replication systems as well. We further test this technique and show its effectiveness.