POS/355
June 10, 2013
It is important to understand that no distributed system is ever safe from any failures. No matter how fault tolerant a system is prepared, there is no such thing as a complete failure-proof system. A constant stream of problems will always arise and taking the necessary precautions and having strong problem solving skills are essential to the success of improving a distributed system from any type of failure. We will discuss four types of failures that may occur within a distributed system and discuss the proper way of addressing them. Without the proper precaution, knowledge, and understanding of these distributed systems and its failures, business continuity is put at risk and can be disrupted. One of the most common failures in a distributed system is hardware failure and is also one of the main reasons why performing backups are necessary. No other failure will make you think twice about realizing the importance of backups than an unrecoverable hard disk failure. Depending on which particular hardware was the root of the failure, it can be a simple plug and play replacement, or even extensive as a catastrophic meltdown. This type of failure is also applicable to a centralized system and can leave the same consequences if the system is not properly designed to be fault tolerant. To isolate this failure, you must understand the purpose of a synchronous system. This type of systems sends a message to a device and waits a given time for it to respond. If no response is received after a certain amount of time, it will send the message again. After a certain amount of resends, that device will be labeled as failed. To fix and avoid this failure is to have physical redundancy. Meaning, either have an active replication or have a primary backup of the system. Physical redundancy also involves having physical components to replace any failure of hardware that may have occurred. Another common