טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentTimor Aviel
SubjectUsing Under-Utilized CPU Resources to Enhance its
Reliability
DepartmentDepartment of Electrical Engineering
Supervisors Professor Avi Mendelson
Professor Yitzhak Birk
Full Thesis textFull thesis text - English Version


Abstract

Soft errors (often called Transient faults) are temporary faults that arise in a circuit due to a variety of internal noise and external sources such as cosmic particle hits. Though soft errors are still relatively rare they are increasingly becoming a major impediment to processor reliability. This is due primarily to processor scaling characteristics. In the past, systems designed to tolerate such faults utilized high-cost customized solutions where the use of replicated hardware components was advocated to detect and recover from microprocessor faults. Today, the capability to detect and recover form faults is desired for commodity hardware where price, performance and power now constitute the main drivers. For such systems, the traditional solutions are inadequate and new approaches are needed.


We introduce two independent micro-architecture level techniques: Double Execution and Double Decoding. Both exploit the low average processor resource utilization that characterizes modern processors to help enhance processor reliability. Double Execution protects the Out-Of-Order part of the CPU by executing each instruction twice on the existing hardware. Double Decoding extends the Double Execution scheme by also protecting the decode logic. It entails the use of a second, low-performance, low-power instruction decoder in order to detect soft errors in the decoder logic. We show that these techniques improve the processor's reliability with relatively low performance, power and hardware overheads, and their implementation is moreover simple. Finally, whenever the resulting reliability is "excessive", reliability can even be traded back for performance by increasing clock rate and/or reducing voltage, with the resulting solution dominating single execution in all respects.