OBAFEMI AWOLOWO UNIVERSITY, ILE-IFE, NIGERIA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ACHIEVING FAULT-TOLERANCE IN OPERATING SYSTEM DESIGN AND IMPLEMENTATION
Introduction
Fault-tolerant computing is the art and science of building computing systems that continue to operate satisfactorily in the presence of faults. A fault-tolerant system may be able to tolerate one or more fault-types including - i) transient, intermittent or permanent hardware faults, ii) software and hardware design errors, iii) operator errors, or iv) externally induced upsets or physical damage. An extensive methodology has been developed in this field over the past thirty years, and a number of fault-tolerant machines have been developed - most dealing with random hardware faults, while a smaller number deal with software, design and operator faults to varying degrees. A large amount of supporting research has been reported.
Fault tolerance and dependable systems research covers a wide spectrum of applications ranging across embedded real-time systems, commercial transaction systems, transportation systems, and military/space systems - to name a few. The supporting research includes system architecture, design techniques, coding theory, testing, validation, proof of correctness, modelling, software reliability, operating systems, parallel processing, and real-time processing. These areas often involve widely diverse core expertise ranging from formal logic, mathematics of stochastic modelling, graph theory, hardware design and software engineering.
Recent developments include the adaptation of existing fault-tolerance techniques to RAID disks where information is striped across several disks to improve bandwidth and a redundant disk is used to hold encoded information so that data can be reconstructed if a disk fails. Another area is the use of application-based fault-tolerance techniques to detect errors in high performance parallel processors.
References: Avizienis, A., et al., (Ed.). (1987):Dependable Computing and Fault-Tolerant Systems Vol. 1: The Evolution of Fault-Tolerant Computing, Vienna: Springer-Verlag. (Though somewhat dated, the best historical reference available.) Harper, R., Lala, J Lala, J., et. al., (1991): The Draper Approach to Ultra Reliable Real-Time Systems, Computer, May 1991. Briere, D., and Traverse, P. (1993): AIRBUS A320/A330/A340 Electrical Flight Controls: A Family of Fault-Tolerant Systems, Proc. of the 23rd International Symposium on Fault-Tolerant Computing FTCS-23, Toulouse, France, IEEE Press, June 1993. Sanders, W., and Obal, W. D. II, (1993): Dependability Evaluation using UltraSAN, Software Demonstration in Proc. of the 23rd International Symposium on Fault-Tolerant Computing FTCS-23, Toulouse, France, IEEE Press, June 1993. Beounes, C., et. al. (1993): SURF-2: A Program For Dependability Evaluation Of Complex Hardware And Software Systems, Proc. of the 23rd International Symposium on Fault-Tolerant Computing FTCS-23, Toulouse, France, IEEE Press, June 1993. Jenn, E. , Arlat, J. Rimen, M., Ohlsson, J. and Karlsson, J. (1994): Fault injection into VHDL models:the MEFISTO tool, Proc. Of the 24th Annual International Symposium on Fault-Tolerant Computing (FTCS-24), Austin, Texas, June 1994. Timothy, K. Tsai and Ravishankar K. Iyer, (1996): "An Approach Towards Benchmarking of Fault-Tolerant Commercial Systems," Proc