Fault tolerance is the property of a system to continue operating correctly even if some of its components fail or experience faults. It requires that there is no single point of failure, faults are isolated to specific components, and failures do not propagate through the system. Fault tolerance is important for distributed systems to provide reliability, availability, and security. Common fault tolerance techniques are replication, where multiple copies of data are stored at different sites, and checkpointing, where consistent system states are periodically saved to stable storage and can be restored to after a failure.
Fault tolerance is the property of a system to continue operating correctly even if some of its components fail or experience faults. It requires that there is no single point of failure, faults are isolated to specific components, and failures do not propagate through the system. Fault tolerance is important for distributed systems to provide reliability, availability, and security. Common fault tolerance techniques are replication, where multiple copies of data are stored at different sites, and checkpointing, where consistent system states are periodically saved to stable storage and can be restored to after a failure.
Fault tolerance is the property of a system to continue operating correctly even if some of its components fail or experience faults. It requires that there is no single point of failure, faults are isolated to specific components, and failures do not propagate through the system. Fault tolerance is important for distributed systems to provide reliability, availability, and security. Common fault tolerance techniques are replication, where multiple copies of data are stored at different sites, and checkpointing, where consistent system states are periodically saved to stable storage and can be restored to after a failure.
Fault-tolerance or graceful degradation is the property that
enables a system (often computer-based) to continue operating
properly in the event of the failure of (or one or more faults within) some of its components. As an example in real life environment, the Transmission Control Protocol (TCP) is designed to allow reliable two-way communication in a packet-switched network, even in the presence of communications links which are imperfect or overloaded.
The basic characteristics of fault tolerance require:
1. No single point of failure
2. No single point of repair 3. Fault isolation to the failing component 4. Fault containment to prevent propagation of the failure 5. Availability of reversion modes
Two main reasons for the occurrence of a fault :
1. Node failure -Hardware or software failure.
2. Malicious Error-Caused by unauthorized Access.
Fault Tolerance is needed in order to provide 3 main feature to
distributed systems :
1. Reliability-Focuses on a continuous service with out any
interruptions. 2. Availability - Concerned with read readiness of the system. 3. Security-Prevents any unauthorized access.
Examples-Patient Monitoring systems, flight control systems, Banking
Services etc. Fault Tolerance Techniques 1) Replication :
Creating multiple copies or replica of data items and storing them
at different sites Main idea is to increase the availability so that if a node fails at one site, so data can be accessed from a different site. Has its limitation too such as data consistency and degree of replica.
2 ) Check Pointing
Saving the state of a system when they are in a consistent state
and storing it on a stable storage. Each such instance when a system is in the stable state is called a check point. In case of a failure, system is restored to its previous consistent state. Saves useful computation.