Xxxxx Xxxxxx
Class
Date
Instructor
Abstract This paper will take a look at failures that occur in distributed and centralized systems. Also, it will discuss proper isolation processes, and the procedures that need to be taken to fix these failures.
Failures
Distributed systems are a collection of network attached systems working as one. Users of such a system should perceive this and a single integrated system. There are many benefits to making systems this way. Improving availability, reducing costs, and higher performance are just a few of these benefits. However, with these benefits there are still failures. After all there is no such thing as a "perfect system" is there?
There will be a discussion of four different failures within this paper. The failures are as follows: crash failures, timing failures, network failures, and byzantine failures. Crash Failures Crash failures can halt the distributed system for some time. The most common cause of this type of fault is operating system failure. This type of fault can be isolated to individual "problem" systems and further developed to produce more fault tolerant systems. This type of fault is also seen in centralized systems. How many times has a user seen the blue screen of death on a windows based system? Timing Failures Timing failures occur in a distributed system when the client expects a response from the server and the response is not received in the expected time frame. Some clients cannot wait for the required response from the server. This causes server operations to fail and thus resulting in timing failures. Network Failures Given the fact that distributed systems communicate across the network the failure of this network would be disastrous. Depending on the scale of the network this could be easy to isolate or very difficult. Trace routs could be used to see if there is a distinguishable point of failure in the network. Once the point of