Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 15

FAULT AVOIDANCE AND TOLERANCE TECHNIQUE

BY

Nithiyanandham M. Pavithra R.

Fault-Intolerance and Fault-Tolerance


The fault intolerance (or fault-avoidance) approach improves system reliability by removing the source of failures (i.e., hardware and software faults) before normal operation begins

The approach of fault-tolerance expect faults to be present during system operation, but employs design techniques which insure the continued correct execution of the computing process

FAULTS

A DS should be fault-tolerant Should be able to continue functioning in the presence of faults

Fault tolerance is related to dependability Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors Network errors/outages

Dependability
Dependability Includes Availability Reliability Safety Maintainability

Availability: A measurement of whether a system is ready to be used immediately. System is up and running at any given moment Reliability: A measurement of whether a system can run continuously without failure. System continues to function for a long period of time A system goes down 1ms/hr has an availability of more than 99.99%, but is unreliable A system that never crashes but is shut down for a week once every year is 100% reliable but only 98% available Safety: A measurement of how safe failures are System fails, nothing serious happens For instance, high degree of safety is required for systems controlling nuclear power plants Maintainability: A measurement of how easy it is to repair a system A highly maintainable system may also show a high degree of availability

A fault is the cause of the error An error is part of a system state that may lead to a failure A system fails when it cannot meet its promises (specifications) Faults can be three categories Transient (appear once and disappear) Intermittent (appear-disappear-reappear behavior) A loose contact on a connector intermittent fault Permanent (appear and persist until repaired)

FAULTS
Any fault may be fail-silent (fail-stop) Byzantine synchronous system vs. asynchronous system E.g., IP packet versus serial port transmission

ACHIEVING FAULT TOLERENCE


Redundancy information redundancy Hamming codes, parity memory ECC memory time redundancy Timeout & retransmit physical redundancy/replication TMR, RAID disks, backup servers Replication vs. redundancy: Replication: multiple identical units functioning concurrently
vote on outcome

REDUNDANCY
Redundancy is key technique for hiding failures When one unit is functioning others are available to fill in in case the unit ceases to work. Redundancy types: 1. Information: add extra (control) information Error-correction codes in messages 2. Time: perform an action persistently until it succeeds: Transactions 3. Physical: add extra components (S/W & H/W) Process replication, electronic circuits

ACTIVE REPLICATION Technique for fault tolerance through physical redundancy No redundancy:

Triple Modular Redundancy (TMR): Threefold component replication to detect and correct a single component failure

Availability: how much fault tolerance? 100 % fault-tolerance cannot be achieved. The closer we wish to get to 100%, the more expensive the system will be. Availability: % of time that the system is functioning five nines: system is up 99.999% of the time: 55.6 minutes downtime per year Three nines: system is up 99.9% of the time: 8.76 hours downtime per year Downtime includes all time when the system is unavailable.

GOAL OF FAULT TOLERANT COMPUTING


Dependability Reliability Availability Safety Security Performability Maintainability Testability Goal of tolerance

POINTS OF FAILURE Points of failure: A system is k-fault tolerant if it can withstand k faults. Need k+1 components with silent faults k can fail and one will still be working Need 2k+1 components with Byzantine faults k can generate false replies: k+1 will provide a majority vote

You might also like