Preview

Achieving Fault-Tolerance in Operating System Design and Implementation

Best Essays
Open Document
Open Document
4745 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
Achieving Fault-Tolerance in Operating System Design and Implementation
OSAGU, JESSICA CHINEZIE
OBAFEMI AWOLOWO UNIVERSITY, ILE-IFE, NIGERIA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ACHIEVING FAULT-TOLERANCE IN OPERATING SYSTEM DESIGN AND IMPLEMENTATION

Introduction
Fault-tolerant computing is the art and science of building computing systems that continue to operate satisfactorily in the presence of faults. A fault-tolerant system may be able to tolerate one or more fault-types including - i) transient, intermittent or permanent hardware faults, ii) software and hardware design errors, iii) operator errors, or iv) externally induced upsets or physical damage. An extensive methodology has been developed in this field over the past thirty years, and a number of fault-tolerant machines have been developed - most dealing with random hardware faults, while a smaller number deal with software, design and operator faults to varying degrees. A large amount of supporting research has been reported.

Fault tolerance and dependable systems research covers a wide spectrum of applications ranging across embedded real-time systems, commercial transaction systems, transportation systems, and military/space systems - to name a few. The supporting research includes system architecture, design techniques, coding theory, testing, validation, proof of correctness, modelling, software reliability, operating systems, parallel processing, and real-time processing. These areas often involve widely diverse core expertise ranging from formal logic, mathematics of stochastic modelling, graph theory, hardware design and software engineering.
Recent developments include the adaptation of existing fault-tolerance techniques to RAID disks where information is striped across several disks to improve bandwidth and a redundant disk is used to hold encoded information so that data can be reconstructed if a disk fails. Another area is the use of application-based fault-tolerance techniques to detect errors in high performance parallel processors.



References: Avizienis, A., et al., (Ed.). (1987):Dependable Computing and Fault-Tolerant Systems Vol. 1: The Evolution of Fault-Tolerant Computing, Vienna: Springer-Verlag. (Though somewhat dated, the best historical reference available.) Harper, R., Lala, J Lala, J., et. al., (1991): The Draper Approach to Ultra Reliable Real-Time Systems, Computer, May 1991. Briere, D., and Traverse, P. (1993): AIRBUS A320/A330/A340 Electrical Flight Controls: A Family of Fault-Tolerant Systems, Proc. of the 23rd International Symposium on Fault-Tolerant Computing FTCS-23, Toulouse, France, IEEE Press, June 1993. Sanders, W., and Obal, W. D. II, (1993): Dependability Evaluation using UltraSAN, Software Demonstration in Proc. of the 23rd International Symposium on Fault-Tolerant Computing FTCS-23, Toulouse, France, IEEE Press, June 1993. Beounes, C., et. al. (1993): SURF-2: A Program For Dependability Evaluation Of Complex Hardware And Software Systems, Proc. of the 23rd International Symposium on Fault-Tolerant Computing FTCS-23, Toulouse, France, IEEE Press, June 1993. Jenn, E. , Arlat, J. Rimen, M., Ohlsson, J. and Karlsson, J. (1994): Fault injection into VHDL models:the MEFISTO tool, Proc. Of the 24th Annual International Symposium on Fault-Tolerant Computing (FTCS-24), Austin, Texas, June 1994. Timothy, K. Tsai and Ravishankar K. Iyer, (1996): "An Approach Towards Benchmarking of Fault-Tolerant Commercial Systems," Proc

You May Also Find These Documents Helpful

  • Powerful Essays

    Primary hardware that must have a backup to ensure availability is the web server and the database server. In addition to having a primary and a backup of each of these two servers a replication server must also be implemented into the architecture in order for the databases on each server to mirror each other. With proper planning and implementation of this system if the primary servers have a failure there will not be any interruption of service to the customer who is accessing the…

    • 2777 Words
    • 12 Pages
    Powerful Essays
  • Satisfactory Essays

    | * OS level * Patch history * Resilient computing * Stateful inspection * Whitelists-Blacklists * DB encryption * Backups and archiving…

    • 409 Words
    • 2 Pages
    Satisfactory Essays
  • Good Essays

    Designing a fault-tolerant system can be done at different levels of the software stack. We call general purpose the approaches that detect and correct the failures at a given level of that stack, masking them entirely to the higher levels (and ultimately to the end-user, who eventually see a correct result, despite the occurrence of failures). General-purpose approaches can target specific types of failures (e.g. message loss, or message corruption), and let other types of failures hit higher levels of the software stack. In this section, we discuss a set of well-known and recently developed protocols to provide general-purpose fault tolerance for a large set of failure types, at different levels of the software stack, but always below the…

    • 1211 Words
    • 5 Pages
    Good Essays
  • Satisfactory Essays

    The next two faults categories, excluding the OMISSION faults, emulate specific programming errors common to kernel code according to earlier studies [Sullivan and Chillarege, 1991; Christmansson and Chillarege, 1996].…

    • 285 Words
    • 2 Pages
    Satisfactory Essays
  • Good Essays

    There are two kinds of systems that people can utilize when setting up a network. They can use a distributed system or the other kind of system called a centralized system. In this paper we will find out what can happen as far as the failures in these systems and what if anything can be done to fix these systems when they fail.…

    • 726 Words
    • 3 Pages
    Good Essays
  • Satisfactory Essays

    Ittnt2670 Lesson 1

    • 489 Words
    • 2 Pages

    The feature that enhances fault tolerance by providing multiple data paths to a single server storage device is called _________.…

    • 489 Words
    • 2 Pages
    Satisfactory Essays
  • Better Essays

    Website Migration Project

    • 3004 Words
    • 13 Pages

    This project aims to produce a system that will adequately address Tony’s Chips system requirements. In light of this, the system’s architecture will consider all of the system’s requirements in its design. The system’s architecture will make use of the ideally performing applications. The project aims to create a cohesive system from the many available system components by putting emphasis on application compatibility. The project also aims at creating reliable recovery solutions for the system. This will be undertaken with the aim of enhancing system recoverability.…

    • 3004 Words
    • 13 Pages
    Better Essays
  • Satisfactory Essays

    Filures Paper

    • 498 Words
    • 2 Pages

    There will be a discussion of four different failures within this paper. The failures are as follows: crash failures, timing failures, network failures, and byzantine failures.…

    • 498 Words
    • 2 Pages
    Satisfactory Essays
  • Satisfactory Essays

    Homework

    • 304 Words
    • 1 Page

    Reliable Delivery: The protocols provides reliable delivery service by guaranteeing to move each network layer datagram across the link without error.…

    • 304 Words
    • 1 Page
    Satisfactory Essays
  • Powerful Essays

    Failure Mode Analysis

    • 1502 Words
    • 7 Pages

    FMEA & FTA •FMEA/FMECA •Fault Tree Analysis Arnljot Hoyland, Marvin Rausand, System Reliability Theory, John Wiley & Sons, Inc., 1994, ISBN 0-471-59397-4 Meng-Lai Yin 1 FMEA (Failure Mode and Effects Analysis) • Qualitative analysis • Purpose: identify design areas where improvements are needed to meet reliability requirements • One of the first systematic techniques for failure analysis • Developed in the late 50s to study problems that might arise from malfunctions of military systems • Often used as the first step of a system reliability study • An FMEA becomes a failure mode, effects, and criticality analysis (FMECA) if criticalities or priorities are assigned • Information can be found in: MIL-STD-1629, IEC 812, SAE ARP 926, IEEE std.…

    • 1502 Words
    • 7 Pages
    Powerful Essays
  • Powerful Essays

    b. Is capable to provide greater transmission capacity for the use of the company personnel…

    • 447 Words
    • 2 Pages
    Powerful Essays
  • Good Essays

    lru algorithm report

    • 842 Words
    • 3 Pages

    This approach is the least-recently-used (LRU) algorithm. The result of applying LRU replacement to our example reference string is shown in Fig. 9.14. The LRU algorithm produces 12 faults.…

    • 842 Words
    • 3 Pages
    Good Essays
  • Powerful Essays

    Cloud Testing

    • 1274 Words
    • 6 Pages

    James A. Whittaker, Florida Institute of Technology IEEE Software 17(1), pp. 70-79, Jan-Feb 2000 Avital Braner Basic Seminar of Software Engineering Hebrew University 2009…

    • 1274 Words
    • 6 Pages
    Powerful Essays
  • Powerful Essays

    Real Time Fault Tolerance

    • 26468 Words
    • 106 Pages

    1 INTRODUCTION 2 BASIC DEFINITIONS 3 FAULTS, ERRORS, AND FAILURES 4 FAULT DURATION 5 DESIGN TECHNIQUES 6 FAULT-TOLERANT TECHNIQUES 7 TYPES OF REDUNDANCY 8 FAULT-TOLERANT ARCHITECTURE 9 REAL-TIME FAULT-TOLERANT SYSTEMS 10 THE LATENCY PROBLEM 11 APPLICATION AREAS 12 SOFTWARE FAULTS 13 DEPENDABILITY MODELLING 2 5 11 15 19 21 25 33 54 58 62 75 85…

    • 26468 Words
    • 106 Pages
    Powerful Essays
  • Powerful Essays

    Consistency Model

    • 6736 Words
    • 27 Pages

    Traditionally, memory consistency models were of interest only to computer architects designing parallel machines. The goal was to present a model as close as…

    • 6736 Words
    • 27 Pages
    Powerful Essays