Robert Martinez
POS 355
May 12, 2014
William Davis
Failures in a Distributed System
A distributed system is a series of individual computers that appear to work as a single unit to its users. These systems share processing power, memory, and hard drive space. While this type of system is very efficient it does have its problems.
The four categories of failures that occur in a distributed system are Hardware failures, Omission failures, Operating System failures or Crash, and Byzantine failures. Failures are often confused with faults. Faults are defined by Paul Krzyzanowski in his paper titled Fault Tolerance, Dealing with an imperfect world as “a deviation from the expected behavior of a system: a malfunction.” Paul Krzyzanowski also list three types of faults, Transient, Intermittent and Permanent. Transient faults occur once, such as when sending a message that doesn’t reach its destination and has to be resent. Intermittent faults are reoccurring faults or faults that continually appear then disappear. Permanent faults are persistent leading to the replacement of the faulty component.
Hardware failures are the failure of a component in a system. These failures were very common, but with changes in design and how components are manufactured these failures are becoming fewer and fewer. Most hardware failures occur at network connections or hard drives. Distributed systems use an array of servers, and backup drives just in case there is a failure of a component.
Redhat characterizes omission failures as “a component that does not respond to an input from another component, and thereby fails by not producing the expected output.” Most users recognize omission failures as a failure to send or receive a message. Distributed systems handle omission failures with measures such as the acknowledgment or ACK response in a reliable end-to-end transmission. If the sender does not receive the ACK response the transmission is resent.
References: Krzyzanowski, P. (2009, April). Fault Tolerance, Dealing with an imperfect world. Retrieved from https://www.cs.rutgers.edu/~pxk/rutgers/notes/content/ft.html Wulf, J. (2013, October 31). JBoss Enterprise SOA Platform 4.2. Retrieved from https://access.redhat.com/site/documentation/en-US/JBoss_Enterprise_SOA_Platform/4.2/html/SOA_ESB_Programmers_Guide/SOA_ESB_Programmers_Guide-_Fault_tolerance_and_Reliability_-_Failure_classification_.html