M.Sc Thesis


M.Sc StudentZilber Erez
SubjectTPT-RAID: a High Performance Box-Fault Tolerant Storage
System
DepartmentDepartment of Electrical and Computer Engineering
Supervisor ASSOCIATE PROFESSOR Yitzhak Birk


Abstract

Storage devices are inexpensive. Reliable storage systems, however, remain expensive, and even they are susceptible to a box-level failure, rendering an entire ECC group unavailable. One solution is a multi-box raid, wherein each error-correction group uses at most one block from each storage box. We introduce TPT-RAID, a highly available, scalable yet simple multi-box RAID. TPT-RAID extends the idea of an out-of-band SAN controller into the RAID: data is sent directly between hosts and targets and between targets, and the RAID controller supervises ECC calculation performed by the targets. This prevents a communication bottleneck in the controller and improves performance dramatically while retaining the simplicity of centralized control. TPT-RAID moreover conforms to a conventional switched network architecture, whereas an in-band RAID controller would either constitute a communication bottleneck or would have to be constructed as a full-fledged router. TPT-RAID can be implemented as a software extension to a SAN controller without hardware changes. This and TPT-RAID 's scalability are demonstrated by our TPT-RAID prototype that uses InfiniBand, an emerging very high speed interconnect with RDMA capability. We prove the correctness and completeness of TPT-RAID. Finally, we describe the required protocol extensions.