Professional Documents
Culture Documents
Abraham 1986
Abraham 1986
Invited Paper
This paper describes a variety of fault and error models which are
because of photolithography errors, deficiencies in process
used as the basis for designing fault-tolerant VeryLargeScale quality, improper design, etc. These can result in contam-
integrated ( V M ) systems.The fault models describe physical d e ination,
improper contacts, electromigration,
corrosion,
fects and failures and the input patterns which will expose them, oxide defects, etc. [2]-[6]. Even if achip is manufactured
and are suitable for testing, while error models describe the effects
perfectly, it could subsequently wear out in the field due to
on the functional outputs ofdefects and are useful for on-line error
detection. The models are described at various levels of abstraction. electromigration,hot-electroninjection, spreading charge
The differences between fault and error models for identical func- loss, etc. [2], [7]-[IO]. Environmental effects, such as alpha
tional modules are also illustrated. particles, cosmic radiation, etc., canalsocause a circuit to
produce erroneous data [Ill-[14].
INTRODUCTION Afault-tolerant systemhas mechanisms for detecting
erroneous data produced by the system, techniquesfor
Advances i n Very Large Scale Integration (VLSI) technol-
correction of the error and recovery from it, isolating the
ogyresult i n complex chips which, in turn, lead to lower
faulty part of the circuit which caused the error, and recon-
cost for extremelycomplex systems. As suchsystems are
figuring the system so that the faulty portion of the system
increasingly used i n critical applications, there is a need to
is no longer used. In most cases, the error detection mecha-
ensure thatthe computationsproduced by thesesystems
nisms are designed(for cost effectiveness) with the as-
are dependable. The discipline of fault-tolerant computing
sumption that only a limited number of faults are present in
deals withthe designof systems which are tolerant to
the system. This assumption is valid if the system is periodi-
failures and, hence, result in a higher level of dependability.
cally exercised to flush out any latent faults.
Laprie [I] has proposed a service-oriented definition of
Thus fault-tolerant operation requires that:
system failurewhere a system failure is defined to occur
whenthedelivered service deviates from thespecified
service. The failure occurs because the system was erro- 1) manufactured chips and systems be initially tested to
neous; an error is that part of the systemstate which is reject faulty units;
liable to failure, i.e., to the delivery of a service not comply- 2) the system be tested periodically in the field to flush
ing with the specified service. The cause of an error is said out latent faults;
t o be a fault. Thus impairments to dependableoperation 3) error detection mechanisms be available to detect
are the faults in the system, which could be either due to errors during operation;
incorrect design or specification ofthe system, or faults 4) error correction, recovery, and reconfiguration mech-
duringthemanufacturing process,or a fault-free system anisms be available to isolate the faulty units.
could subsequentlybecomefaultydue to physical or en-
vironmental causes. This paper will discuss faults and errors
due to improper manufacturing or subsequent wearout of The testing and diagnosis process requires the availability
systems in the field. The area of design and specification of a sequenceofinput patterns whichwill exposeany
faults, althoughimportant, is beyond the scope ofthis faults in the system. On the other hand, the error detection
paper. mechanismrequires a knowledgeofthe types of errors
Faulty VLSI chips could be produced during manufacture which are produced under fault and an appropriate encod-
ing of the outputs of modules so that errorscanbe de-
tected. An externally caused change of a storage value (by a
ManuscriptreceivedOctober 25,1985;revisedDecember 15, cosmic ray, for example) could result in an error at the
1985.Thisresearchwassupported in partbytheSemiconductor output of a functional module, but a test which attempts to
Research Corporation under Contract SRCRSCH84-06-049and in detect a fault will not discover any permanent fault in the
partbythe Joint ServicesElectronicsProgram (US. Army,U.S. module. Thus as increases in density are achieved by smaller
Navy, and US. Air Force) under Contract NOOOl4-84-C-0149.
The authors are with the Computer Systems Group, Coordinated geometries, the likelihood of transient errors typically in-
ScienceLaboratory,University of Illinois atUrbana-Champaign, creases, requiring error detection mechanisms forreliable
Urbana, IL 61801, USA. operation.
n
I Load
Attenuator
transistor
I
Fig. 2.Connector-switch-attenuator
(CSA)
elements to
model MOS elements.
1 1
1
B L-GS2Q , I, WP
f f
c
6 C
c I
0
I
0
I Fig. 3. CSA modelof NOR gate with a broken line rnod-
n n m
Y, +I
nl nn2 Yp +1
eled.
FaultEquivalenceandDominance: Considerthethree-
Bl< RB2< input NAND gate shown in Fig. 5. This gate has four lines
(three inputs and one output) and would, therefore, have
eight stuck-at faults, each line stuck at 0 or 1. However, the
faults A, B, or C stuck at 0 would result in the output D
beingpermanently 1 and, therefore, it is impossible to
distinguish between an input stuck at 0 from the output
stuck at 1. These faults aresaid to be equivalent. Now
consider the fault A-stuck-at-I. In order to detect this fault,
Tl,T2,T3
-v A
I a 0 has t o be applied on A, and I s at 6 and C so that the
effectofthefault can be propagated to D. The correct
Fig. 4. Circuit for T T L NAND gate. value of D will be a 1 and it will be a 0 under fault. This
test forA-stuck-at-Iwill, therefore, also detectthefault
Dstuck-at-0. Hence, A-stuck-at-I is said to dominate D
each defect was simulated, its effect on the output of the stuck-at-0. Usingthe relations of equivalenceand domi-
circuit was noted. Some of the results are shown in Table 2. nance allowsmany faults tobecombinedinto a single
For example, if resistor RBI were open, the output would class, reducing the number of faults to be considered in a
be permanently logic 0 (s-a-0 in the table). It can be noted complex system. A three-input NAND gate, therefore, will
that many of the opens and shorts will result i n inputs or have fourdifferent fault classes and the tests for these
faults are shown in Table 3. In the table, the fault consisting
Table 2 Defects in TTL NAND Gate Circuit of line d-stuck-at4 is shown as d/O.
~
on the potential number offaults detected. Fig. 6. Two faults which are functionallyequivalent.
rlc A 4 1
-
-
Fig. 8. n-MOS NOR gate with fourfaults.
.""r =
Q " 1
0
present, then under the input 111, the output may not be 0 last test) whenit produces an incorrectlogical output.
but may be an indeterminate voltage represented as I. Thus Faults 2 and 4 can be seen to be equivalent (at least under
under this fault and this input combination, the output is this input set). Fault 3 produces a high-impedance output
not permanently stuck at 0 or 1. This example might seem under the combination OOO and, i n n-MOS technology, this
t o demonstrate a majordrawbackofthestuck-atfault is logicallyequivalenttothe previous output(shown as
model.However, if we considerwhether tests generated Q"), at least for a short period of time until any residual
for stuck-atfaults will detectthe transistor faults (even charge leaks away from the output. If the tests are applied
though they cannot be modeled as stuck-at faults), it can in the order shown, this fault will be detected since 0" will
be seen that the stuck fault model is still viable [42]. be a 0 and the correct output should be a 1. It is interesting
Table 4 showsthebehaviorofthen-MOS NAND gate tonotethat i f the OOO combinationwereapplied first,
under the faults when the tests for the stuckfaults in a and the output line happened to have a logic 1 stored on
NAND gate are applied. Outputs are shown for no fault (F,) it, this particular fault would nothave been detected by the
tests.
These examples show that, in general, many faults which
Table 4 Tests for n-MOS NAND Gate cannot be modeled as stuck-at faults, can still be detected
by applying a stuck-at fault test set.
Inputs outputs The Bridging Fault Model: Since many physical faults at
A B C Fo Fl F2
the circuit level will result in shorts between interconnect-
1 1 1 0 I 1
ing lines, the stuck-at fault model was extended to bridging
0 1 1 1 0 1
1 0 1 1 1 1 faults. This fault model treats shorts between two lines in
1 1 0 1 1 1 thelogic gate network and assumes that both lines will
have the wired-AND or wired-OR logic value underfault.
Mei [43] has shown that any singlestuck-atfault testset
will be able to detect any bridging fault between two or
and for the two shorts ( F1 and f 2 ) .Even though the output more input leads of a gate, as well as any bridging fault
under F, forthe first inputcombination is indeterminate which results in feedback such thatthetotalnumber of
(shown as I), it will be seen that the next input combination inversions in the loop is odd. This is a special case of short
will produce a logical output which is different from the faults, at the transistor level, treated in the previous section.
correct output and, thus, the fault will be detected. Similar The Stuck-Open and Stuck-On Fault Models: Wadsack
results hold for an n-MOS NOR gate.Fig. 8 shows such a [26] pointed out a special case of the transistor-level open
gate with four faults: faults 1 and 2 being shorts and faults 3 fault, which he called the stuck-open fault, where an open
and 4 being opens. Table 5 gives the behavior of the gate transistor(or abrokenline) can lead to a CMOS gate
when stuck-fault testsare applied under these four faults. behaving as if it had memory. Consider the CMOSNOR gate
Again, fault 1 produces an indeterminate output for the first shown in Fig. 9. It has two p-channel transistors in series
input combination but will be eventually detected (by the connecting V, to the output C, and two n-channel tran-
"WC
B
Application of 512000 pseudorandompatterns was found
t o detect 98 percent of stuck-at faults but only 85 percent
of stuck-open faults. This also points out the difficulty and
importance of tests for stuck-open faults.
I latch GL
Functional-Level Fault Models
Transistor-level models more accurately represent failure
mechanisms, but involve a much greater degree of com-
plexity, since the number of primitive elements (transistors
and interconnections) which can be faulty is very large. A
VLSl system may consist of many functional modules, each
module being implemented by these primitive elements. In
many cases, it is sufficient to know whether a module is
faulty or not. In this case, a fault model at the functional
module level which accurately includes the effect of faults
in transistors and interconnections is very useful, since the
B A O A B 0 A B 0 numberofmodules is much smaller thanthenumber of
Fig. 10. Gates and latch usedto model transistor-level faults transistors, reducing the complexity of treatment. In doing
in CMOS NOR gate. so, however, one has to be careful about loss of informa-
Clk2
Clkl
0 0 1
module will be all of the same type, either Os becoming Is, 1) The output is aunidirectional error inthe correct
or I s becoming Os. Such errors are described as unidirec- output for the correct input.
tional. An example of this behavior is in PLAs, where single 2) The output is the correct output for an input which is
device and interconnect failures have been shown to result a unidirectional error in the correct input.
in unidirectional errors [65], [66]. The erroneous selection or 3) The output is aunidirectional error inthe correct
lack of selection of a product line for certain inputs results output for an input vector which is a unidirectional error in
inunidirectional errors i n the output. Fig. 13 illustrates the correct input.
eleven classes of faults including stuck-at faults (faults I-3), Soft-ZrrorModels: Errors dueto environmentally in-
shorts (faults 4-9), andcrosspointfaults(faults IO, 11) duced transientfailures or intermittent circuit instabilities
which result in only unidirectional errors at the output of a are referred t o as soft errors. Disturbances such as the
PLA. electron-hole pairs created by the passage of an alpha
Failure t o separately bufferboththeuncomplemented particlethroughthe activedeviceregions of an n-MOS
and complemented bit lines in PLAs as in Fig. 13 has been memory are at best difficult to detect by means of classical
shown to yield error models which are equivalent to uni- testing. Fault models for such phenomena are meaningless.
directional errors on inputs as well as the outputs [67]. Fig. Concurrent error detection techniques based on a realistic
14 illustrates the interesting phenomenon of how a short error model allow forthe detection of such errors. The cost
between the gate and drain of an enhancement transistor is a function of the hardware required for codeword check-
(case 2) or a bit-line stuck-at4 (case 1) i n the AND plane of ers and redundantinformation bits. Based on measure-
the PLA can result in what is equivalent to a 1-bit change ments and simulation, single-bit error detection/correction
on the input. In Fig.14, a case 1 or case 2 fault causes an has been shown tobe sufficient for most soft errors in MOS
output error equivalent to an input of (MI) rather than the memories [ I l l , [ l 2 ] . However,models must include infor-
correct (001). Without input buffering,the unidirectional mation concerning not only thetype of errors but also their
error modelforoutputsof PLAs must bemodifiedto frequency, if error detection and correction is t o be com-
include errors on inputs and outputs. The authors have prehensive. For example, in memories i f soft errors occur
shown that a comprehensive error model for PLAs lacking frequently andmemorywords areaccessed infrequently,
input buffering can be summarized as followsfor shorts single-bit correction per word may not be sufficient.
and stuck-at faults: Indeterminate Errors: Certain classes of shorts,such as
ROM
OOO ooo
001 001
010 010
011 01 1
100 100
101 101
110 110
111 111
The unique applications of error and fault models have J-C. Laprie,”Dependablecomputingandfaulttolerance:
Conceptsandterminology,” in Proc. /€E€ /nt. Symp. on
been noted throughout the text. The following discussion
Fault-Tolerant Computing, pp. 2-11, June 1985.
illustrates this uniqueness with onespecific module. The T. E. Mangir, “Sources of failures and yield improvement for
previous descriptionsoffaultand error modelsfor PLAs VLSl and restructurable interconnects for RVLSI and WSI: Part
present an opportunity for drawing some comparisons. If I-Sources of failures and yield improvement for VLSI,” Proc.
failures are considered to be thespurious appearance or /E€€, vol. 72, pp. 690-708, June1984.
H. R. Bolin, “Processdefectsandeffects on MOSFETgate
disappearance of an active device i n the OR or AND plane of reliability,“ in Proc.ReliabilityPhysicsSymp., pp. 252-254,
a PLA (crosspoint failures), thepreviousdiscussion has 1980.
shown that from a functional fault model perspective the E. D. Colbourne, G . P. Coverley, and S. K. Behera, “Reliability
resultwillbetheaddition or deletionof a literal in a of MOS LSI circuits,”Proc. /€E€, vol. 62, pp. 244-259, Feb.
1974.
product term of at least one output function, or in adding J. T.EasterbrookandR. C. Bennetts, ”Failure mechanisms in
or deleting a product term(implicant)from at least one logic circuitsandtheirrelatedfaulteffects,” in Proc. /€€E
output function. Test algorithms based on this fault model Conf. on NewDevelopments in Automatic Testing, pp.44-47,
should,therefore,be derived such thattheyverify that Nov. 1977.
these situations have not occurred. The growth or shrinkage C. 1. Schnable, L. J. Callace, and H. L. Pujol, ”Reliability of
CMOS integrated circuits,’’ /€€E Computer, vol. 11, pp, 6-17,
of product terms, however, provides little information that Oct. 1978.
is usefulfordetecting errors at the outputof a PLA. A P. B. Ghate, ”Electromigration-induced failures in VLSl inter-
useful error model for PLAs based on the appearance or connects,” in Proc.Reliability PhysicsSymp., pp. 292-299,
disappearance of an active device is one which consists of Mar. 1982.
P. S. Ho, “Basic problems for electromigration in VLSl appli-
unidrectional errors on the output lines. Therefore, by sim-
cations,” in Proc. Reliability Physics Symp., pp, 288-291, Mar,
plyencodingtheoutputsofthe PLA inaunidirectional 1982.
error detecting code, errors due to the modeledfailures can A. Ito, H. A. Swasey, and E. W. George, ”Hot electron reliabil-
be detected. Error detection is based on erroneous output ity modeling in VLSldevices,” in Proc.ReliabilityPhysics
logic values rather than functional changes within the mod- Symp., pp. %-101,1983.
D. S. Peck, “New concerns about integrated circuit reliability,”
ule. Testing, on theother hand, is concerned withthe in Proc. Reliability Physics Symp., pp. 1-6, 1978.
functional change of a device, gate, circuit, or module and T.C.May,“Softerrors in VLSI: Presentandfuture,” /€€E
the input(s) necessary to detect this malfunction. A similar Trans.Comp.,Hybrids, Manuf. Technol.,vol.CHMT-2, pp.
comparison can be made for all of the modules discussed 377-387, Dec. 1979.
C. H. Sie, R. A.Youngblood, J. H. Liao,andA.Turk,“Soft
in this paper.
failure modes in MOS RAMS,” in Proc. Reliability Phys Symp.,
pp. 27-32, 1977.
W. T.Anderson, Jr.and S. C.Binari,“Radiationeffects in
CONCLUSIONS CaAs devices and ICs,” in Proc. Reliability Physics Symp., pp.
31 6-319, 1983.
VLSl systemsare becoming ever morecomplex, with
J. W. Peeples and T. J. Every, “Parametric influence on system
many new types of failures and failure effects, as technol- softerror
rates,“ in Proc.Reliability
Physics
Symp.,
pp.
ogy reduces linewidths and increases density. There is a 255-260,1980.
greatneed for a thorough understandingof how defects N. Burgess and R. I. Damper, “The inadequacy of the stuck-at
and failures affect these complex systems.The inherent fault model for testingMOS LSI circuits: A review of MOS
failure mechanisms and some implications for computer-aided
contradiction in therequirements for fault models has to be design and test of MOS LSI circuits,” Software Microsyst., vol.
solved; they are required to be very accurate but, at the 3, pp. 30-36, Apr. 1984.
same time, applicable to very large systems.The solution S. Cai, M. Mezzalama,and P. Prinetto, “A review offault
seems tobeto describe accurately the effects of faults models for LSI/VLSI devices,” Software Microsyst., vol. 2, pp.
44-53, Apr. 1983.
within higher functional modules and thus make complex G. R.Case,“Analysis of actualmechanisms in CMOSlogic
systems tractable by reducing the number of primitive ele- gates,” in Proc. Design Automation Conf., pp. 265-270, 1976.
ments. The hierarchical approaches which are used by the C.Timoc, M. Buehler,T. Criswold, C. Pina, F. Stott, and 1.
design community must also be exploited in fault and error Hess, “Logical models of physical failures,” in Proc. / E € € Int.
modelingand test generation.Powerful CAD tools for Test Conf., pp. 546-553, Nov. 1983.
J. P. Hayes,”Faultmodeling,” /E€€ Design and Test,vol. 2,
modeling and test generation, which can easily be used by pp. 88-95, Apr. 1985.
designers and test engineers, must be developed if we are J. W. BandlerandA. E. Salama,“Faultdiagnosisofanalog
to exploit the potential of VLSI. It is clear that an arbitrary circuits,” Proc. /E€€, vol. 73, pp, 1279-1325, Aug. 1985.
system would be very difficultto modelandtest. It is J. Galiay,Y.Crouzet,and M. Vergniault,”Physicalversus
logicalfaultmodels in MOS LSI circuits:Impactonthe
imperative that complex systems be designed so that they testability,” /€E€ Trans. Comput., vol. C-29, pp. 527-531, June
are easily testable, and designfortestability will become 1980.
one of the primary requirements of very complex systems. 6. Courtois,“Failuremechanism,faulthypothesis,andana-
On-line, concurrent error detection can also be made cost- lytical testing of LSI-NMOS (HMOs) circuits,”in Proc. VLS1‘87,
effectiveformany applications with appropriatedesign pp. 341 -350, Aug. 1981.
S. K. Jain and V. D. Agrawal, “Modeling and test generation
techniques which eliminate the likelihood of insidious er- algorithms for MOS circuits,” /€E€ Trans. Comput., vol. C-34,
rors. Reliable systems of the future will most certainly test pp. 426-433, May 1985.
for faults and check for errors which are based on models K. W. Chiangand 2. C. Vranesic, “On faultdetection in
derived from an analysis of the effects of realistic physical CMOS logicnetworks,” in Proc.20thDesign Automation
Conf., pp. 50-56, June1983.
failures. Based on accurate and tractablefaultand error M. Acken, “Testing for bridging faults in CMOS circuits,” in
models, efficient and comprehensive approaches to reliable Proc. 20th Design Automation Conf., pp. 717-718, June1983.
VLSl system design can be formulated. R. L. Wadsack, “Fault modeling and logic simulationof CMOS