Fault Tolerance In Distributed Systems

Read Complete Research Material

Fault Tolerance in Distributed Systems

Fault Tolerance in Distributed Systems


If the word fault tolerance is observed, it can be defined as fault which is an error or malfunction from a normal behavior and tolerance is the ability for putting up or enduring with that error. Therefore, fault tolerance is defined as the ability of the system to deal with malfunctions.



A fault is considered as a deviation in a system from a behavior that is expected from the system known as a malfunction. Fault can be due to various factors that includes failure of hardware, software bugs, problem of the network and operator errors

Fault can be divided into three different categories

Transient Fault

The transient fault occurs once and then the fault disappears. This can be explained by an example that a network message which does not reach the destination however, it does when the message is retransmitted.

Intermittent faults

Intermittent fault is defined by a fault that occurs and vanishes and this process goes on. These types of faults are considered as the most annoying faults. Loose connections can be called as intermittent faults.

Permanent faults

The permanent faults are persistent. These continue to exist unless the fault component has been replaced or repaired. Disk head crashes, power supply burn out or software bugs are examples of permanent faults (Coulouris, 2009).

Any of the faults that are defined can be either a fail stop failure of Byzantine failure. A fail stop fault is that in which the fault part stops to function and produces no outcome. More specifically, it either produces no result or produces an output that shows the component has failed. A byzantine fault is that in which the faulty part continues to run however, it produces results that are incorrect. Dealing with this type of fault is considered to be more troublesome.

When fault tolerance is being discussed the terms asynchronous and synchronous is also used. A synchronous system is one in which the system responds to a message within a determined, limited amount of time. However, this does not happen in the asynchronous system. Communication that is done through a serial port is a synchronous system example. The communication that is done via IP packets is an asynchronous example (Pallickara, Bulut, Fox, 2013).

Approaches to faults

Systems are designed in a manner that the presence of fault is reduced. The fault avoidance is a procedure in which various validation and design steps are done in order to make sure that the system prevents any faulty error to occur.

Fault removal is an approach in which the faults are determined in the system and then these are removed through various methods. This includes testing, verification and debugging as well as the replacement of the components that have failed, addition of heat sinks to solve the problems of thermal dissipation etc.

Fault tolerance is the realization of the fact that faults will occur in a system and that systems are to be designed in a manner that they would be tolerant to the ...
Related Ads
  • Distributed Database Syst...

    Free research that covers distributed databas ...

  • Carumba Corporation

    ... CEO, distributed to managers of Carumba C ...

  • Javaspace

    Free research that covers : a distributed system ...