Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Fault and Error Models for VLSl

JACOBA. ABRAHAM, FELLOW,IEEE, AND W. KENT FUCHS, MEMBER, IEEE

Invited Paper

This paper describes a variety of fault and error models which are
because of photolithography errors, deficiencies in process
used as the basis for designing fault-tolerant VeryLargeScale quality, improper design, etc. These can result in contam-
integrated ( V M ) systems.The fault models describe physical d e ination,
improper contacts, electromigration,
corrosion,
fects and failures and the input patterns which will expose them, oxide defects, etc. [2]-[6]. Even if achip is manufactured
and are suitable for testing, while error models describe the effects
perfectly, it could subsequently wear out in the field due to
on the functional outputs ofdefects and are useful for on-line error
detection. The models are described at various levels of abstraction. electromigration,hot-electroninjection, spreading charge
The differences between fault and error models for identical func- loss, etc. [2], [7]-[IO]. Environmental effects, such as alpha
tional modules are also illustrated. particles, cosmic radiation, etc., canalsocause a circuit to
produce erroneous data [Ill-[14].
INTRODUCTION Afault-tolerant systemhas mechanisms for detecting
erroneous data produced by the system, techniquesfor
Advances i n Very Large Scale Integration (VLSI) technol-
correction of the error and recovery from it, isolating the
ogyresult i n complex chips which, in turn, lead to lower
faulty part of the circuit which caused the error, and recon-
cost for extremelycomplex systems. As suchsystems are
figuring the system so that the faulty portion of the system
increasingly used i n critical applications, there is a need to
is no longer used. In most cases, the error detection mecha-
ensure thatthe computationsproduced by thesesystems
nisms are designed(for cost effectiveness) with the as-
are dependable. The discipline of fault-tolerant computing
sumption that only a limited number of faults are present in
deals withthe designof systems which are tolerant to
the system. This assumption is valid if the system is periodi-
failures and, hence, result in a higher level of dependability.
cally exercised to flush out any latent faults.
Laprie [I] has proposed a service-oriented definition of
Thus fault-tolerant operation requires that:
system failurewhere a system failure is defined to occur
whenthedelivered service deviates from thespecified
service. The failure occurs because the system was erro- 1) manufactured chips and systems be initially tested to
neous; an error is that part of the systemstate which is reject faulty units;
liable to failure, i.e., to the delivery of a service not comply- 2) the system be tested periodically in the field to flush
ing with the specified service. The cause of an error is said out latent faults;
t o be a fault. Thus impairments to dependableoperation 3) error detection mechanisms be available to detect
are the faults in the system, which could be either due to errors during operation;
incorrect design or specification ofthe system, or faults 4) error correction, recovery, and reconfiguration mech-
duringthemanufacturing process,or a fault-free system anisms be available to isolate the faulty units.
could subsequentlybecomefaultydue to physical or en-
vironmental causes. This paper will discuss faults and errors
due to improper manufacturing or subsequent wearout of The testing and diagnosis process requires the availability
systems in the field. The area of design and specification of a sequenceofinput patterns whichwill exposeany
faults, althoughimportant, is beyond the scope ofthis faults in the system. On the other hand, the error detection
paper. mechanismrequires a knowledgeofthe types of errors
Faulty VLSI chips could be produced during manufacture which are produced under fault and an appropriate encod-
ing of the outputs of modules so that errorscanbe de-
tected. An externally caused change of a storage value (by a
ManuscriptreceivedOctober 25,1985;revisedDecember 15, cosmic ray, for example) could result in an error at the
1985.Thisresearchwassupported in partbytheSemiconductor output of a functional module, but a test which attempts to
Research Corporation under Contract SRCRSCH84-06-049and in detect a fault will not discover any permanent fault in the
partbythe Joint ServicesElectronicsProgram (US. Army,U.S. module. Thus as increases in density are achieved by smaller
Navy, and US. Air Force) under Contract NOOOl4-84-C-0149.
The authors are with the Computer Systems Group, Coordinated geometries, the likelihood of transient errors typically in-
ScienceLaboratory,University of Illinois atUrbana-Champaign, creases, requiring error detection mechanisms forreliable
Urbana, IL 61801, USA. operation.

~18-9219/86/0500-0639501.00 a986 IEEE

P R O C E E D I N G S OF T H E IEEE, V O L 74, NO 5, M A Y 1986 639


Fault and Error Models Transistor-Level Fault Models
As systems increase in complexity, it is useful to be able Defects and failures in present-day integrated circuits can
t o describe faults and errors at various levels of abstraction be abstracted to shorts and opens in the interconnects and
in the system. A fault which is described at a very low level, degradationof devices [2]. Fault models at the transistor
for example, at the level of transistors, may very accurately level, therefore, can characterize physical failures quite ac-
describe the physicalphenomena causing thefaultbut, curately [15]-[19]. Unfortunately, as the complexity of VLSl
because of the extremely large number of transistors in a increases, the number of potential faults at the device and
VLSl chip, the model may be intractable for the purpose of interconnect level increases drastically. Nevertheless, it is
deriving tests for the fault. Similarly, an error model, which necessary t o study the effects of failures at the transistor
is described for a very small functional block, may require level and develop accurate fault models at this level. The
duplication of the block inorder to detect the errors, while betterunderstandingof theeffects of failures, thus ob-
an error model described for a more complex block might tained, can be used to develop accurate faultmodels at
only require the addition of asmall amount of hardware in higher levels which can beappliedin complex systems.
order to detect the errors. The two requirements for fault This approach is analogous to that used in the hierarchical
and errormodels, in some sense,are contradictory. They design of VLSl systems where complex circuits are built up
are: accuracy (that is, realistic faults and errors should be from smaller cells.
modeled); and tractability (that is, very complex systems This sectionwill describe various transistor-levelfault
should be handled). Recent research, therefore, deals with models. MOS technologies are emphasized since much of
deriving realisticmodels at higher levels which can accu- the published research has concentrated on these, and also
rately capture the faults and errors at lower levels. since they are one of the current dominant technologies for
As an example, consider a contact between two conduct- VLSl systems.
ing lines in a VLSl circuit. If the contact is faulty, there is a Fault models proposed at the transistor level incorporate
break between two lines and the fault can be described at one or more of the following classes of faults:
this level. It may also turn out that the break is equivalent
to the input of gatea being permanently set to logic0 or, i n 1) shorts and opens of transistors or interconnections,
turn, the input of a complex functional module being per- 2) delayeffects of failures,
manently 0. It would, of course, be simpler for purposes of 3) coupling or crosstalk between nodes of a circuit,
analysis to consider the fault at the highest possible level of 4) degradations of elements.
abstraction.
N o w suppose that the contact is not permanently open Shorts and opens are included in most fault models, while
but is periodically open, dependingon thetemperature, themore accurate(andmorecomplex)models include
vibration, or other external causes. This can be described as delays. Faults where activity on one node affects the logic
an intermittent fault. However,testing for thefault may values on another node in the circuit are primarily applied
never expose i t since the fault may not be active during the t o memories. Fault models which incorporate degradations
test. If, however, the circuit is designed so that the error of elements (for example, transistor parameter changes, or
causedby thelinebeingopen can be detected during changes in the value of a resistor) are usually used in analog
normal computation, the goal of dependable operation can circuits [20]. Such fault models are analog in nature, and are
still be achieved. Again, the error can be described at the not within the scope of this paper.
output of a very small module or at the output of a much Transistor-LevelShortsand Opens: Transistor-levelshorts
larger module. If errors at the output of a large module can and opens model many of the physical failures and defects
bedetectedwith very low overhead, this is clearly the in integrated circuits. A study performed by Galiay [21] on
desirable situation. 4-bit microprocessor chips revealed that a great majority of
A physical failure can also lead to the output ofa module faultswere shortsand opens at the transistor level. A
being at a nonlogical value (for example, midway between general fault model for shorts and opens was proposed by
logic 0 and 1, called indeterminate). Such faults are difficult Courtois [22]. The model was derived from electrical con-
to describe and detect, but the errors due to the faults may siderations and divided the faults into three classes, requir-
be detected by error detection techniques. ing progressively greater difficultyof .test. The model is
The remainder of the paper describes a variety of fault givenin Table 1. There has recentlybeen a significant
and errormodels which have beenproposed in recent amount of work attempting to derive tests for shorts, opens,
literature. The applicability of the ,models is discussed with stuck line, and transistor faults in MOS circuits [23]-[25].
comparisons and contrasts made for specific VLSl modules. MOS networks have the property that lines left in a high
Faultanderrormodels are presentedfortransistor, gate, impedance state (floating) will retaintheirprevious logic
and functional levels of abstraction.
Table 1 Transistor-Level fault Model for n-MOS Technology

FAULT MODELS Class 0 Class 1 Class 2


Singlephysical de- faults in Class 0 plusFaults in Class 1 plus
Fault models are a description of the effect of a defect or fect: faulty
con-
shorts between shorts between
failure (which can cause an error at the output) on a circuit. tact
precon-
or adjacent metal
any
ormetal
or dif-
As discussed earlier, fault models are driven by the require- transistor
tact; diffusion
fusion;
lines.
multiple
ment to derive high-quality tests for complex circuits. Thus defects.open;
stuck on or
metal line open.
a useful fault model will naturally lead to a test generation
procedure for the fault.

640 PROCEEDINGS O F THEIEEE. VOL. 74. NO. 5. M A Y 1986


valuefor arelativelylongtime,until their chargeleaks shows the class of MOS failures using the additional fault
away. This is due to the inherent capacitance of the lines. transistors. For example, a transistor stuck closed, shown in
This property can be used to design very efficient storage Fig. l(d), is represented by adding a parallel transistor of the
elements, as long as the clock period is much smaller than same strength yl, controlled by line f. When thefault is
the time necessary to drain away the charge. Unfortunately, active, f = 1, the transistor will always be closed logically. A
because of this property, faults in MOS circuits can lead to short between two nodes, rn and n, shown in Fig. l(e), is
situations where stuck-at tests cannot detect some of the represented by the fault transistor connecting rn and n and
transistor-level shorts or opens [26]. The stuck-open fault a strength y p + l (this would be a strength greater than that
[26] caused by a broken line controlling a gate in a CMOS of any normal transistor), so that setting the transistor state
circuit is an example. t o 1 shorts the source and drain nodes together so that they
Another class of faults occurs because of shorts. Chiang act as a single node.
and Vranesic [24] noted that in the presence of a stuck-on Hayes [29], [I91 models faults i n MOS circuits by viewing
transistor (or a short which causes the circuit to look as if them as a network of switches, attenuators, and amplifiers
the transistor is permanently on), both the p and nnet- linked by connectors. Fig. 2 shows the various MOS circuit
works i n a CMOS circuit may conduct, and the fault may or elementsandthe
correspondingconnector-switch-
may not be detectable depending upon the resistance of attenuator (CSA) elements. Fig. 3 shows the CSA model of a
the stuck-on device. Baschiera and Courtois [27]have out- NOR gate withthefaultwhere the transistor S3 is not
lined some of the problems of testing for stuck-on faults. connected to ground. The wells, Wp and Wn, are added to
They have shown that, depending upon the resistances of
the transistors, some potential vectors may not detect
Electronic Correspondin
stuck-on faults while others will, andthat care must be Functlon element CSA elernen?
taken in deriving the test patterns. They also conclude that
if a test vector tests for stuck-on faults, it also tests for the Interconnection
P -
Bidirectional
conductor logic connector
corresponding stuck-open faults, while the converse is not
true.
Bryant[28] has developed a concurrentfault simulation
_L
system which can model MOS faults. These are represented n
nMOS transistor Positive switch
as though extra fault transistors were added to the network; Switching
the gate nodes of the fault transistors are considered to be
extra fault inputs t o the network that control the presence 1
or absence of the faults. To model the behavior of ratioed PMOS transistor Negative switch
circuits, each transistor is given a discrete strength. Fig. 1

n
I Load
Attenuator
transistor
I
Fig. 2.Connector-switch-attenuator
(CSA)
elements to
model MOS elements.

1 1

1
B L-GS2Q , I, WP

f f
c
6 C
c I
0
I
0
I Fig. 3. CSA modelof NOR gate with a broken line rnod-
n n m
Y, +I
nl nn2 Yp +1
eled.

(e) (9 model charge storage. Unlike a capacitor,however,the


Fig. 1. Modeling MOSfailuresusingfaulttransistors. (a)
Node n stuck-at-zero. (b) Node n stuck-at-one.(c)Tran- input/output signals of the wells are restricted to a finite
sistor t stuck-open. (d) Transistor t stuck-closed. (e) Short n set of logic values, although the storage capacity or size of
and r n . (f) Open n into n l and n2. the well corresponds to that of the underlying capacitance.

A B R A H A M A N D FUCHS FAULT A N D E R R O R MODELS F O R VLSl 641


The fault where the line to ground is broken is modeled by tunately, is computationally very expensive. A compromise
breakingtheline CO andrepresenting it as stuck-at the is t o derive an approximation to circuit delays in less com-
output value C. When A = 1, 6 = 0 is appliedto this puter time. An MOS fault simulator which can produce the
model, the value C is propagated through the switch 53 to output waveforms under faults has been developed by Shih
the output C, thus holding the previous value of the out- [34]. This simulator uses a table lookup of the transistor I-V
put, and modeling the stuck-open fault. The wells can also characteristics to simulate both the fault-freecircuit and
be used to model delays and faults producing changes in many faulty circuits in one pass. The simulator can either
delays i n the circuit, which is discussed in the next section. plot voltage waveforms or extract both the logic and delay
The Bryant and Hayes models discussed above suffer values for both the fault-free circuit and the faulty circuits
fromtwolimitations. First, both models assume that the from the waveforms.Simulation of tests generated under
gate terminalof a transistor is decoupledfrom the drain the assumption of no circuit delays showed that theywould
and the source, that is the states of the drain and source not detect some of the faults, demonstrating the need for
can never affect the state ofthe gate.This simplifies the using delay information in order to derive accurate tests.
model of a transistor, but is not general enough to model Coupling BetweenNodes of a Circuit: As integrated.
the effect of physical failures such as a short between the circuit densities increase, the interconnection lines carrying
gate and source of a transistor. Secondly, in MOS logic, itis logic signals become physically closer together, increasing
known thatinadditiontothe ground and V, levels the likelihood that logic signals and signal changes on one
(representinglogic levels 0 and I), intermediatevoltage line can affect thelogic values on otherlines. Amore
levels do exist. Banerjee [30] used a multivalued algebra as accurate faultmodelwould, therefore, include such cou-
the basis of a modelto simulatetheeffects of physical pling effects under failure. Memory fault models have long
faults i n MOS circuits. The model allows for five logic levels incorporated this effect [35], [36]. The special case of shorts
and five logic conditions which are related to the strengths between two lines is included i n even the simplertran-
of the logic levels.AnMOScircuit is viewed as a set of sistor-level fault models but the coupling fault is different
nodesconnected bythree-terminal devices representing from a short fault in that it includes cases where a change
transistors. The state of a node i n a circuit is described by in the logic value on a line (perhaps faster than some given
( a , b ) , where “a“ refers to the condition of the node and rate) i s necessary to affect an adjacent line.
“b” refers to the logic level of the node.
The five logic levels used fordescribingvoltage ranges Gate-Level Fault Models
are:
Early fault models were developed at the logic gate level.
0: hard 0 The popularity of this approach can be attributed to several
o*: soft 0 reasons. Such models are simple to design and use. Many
I: indeterminate, near the logic threshold faults in discrete technologies can be represented by faults
I*: soft 1 at thelogic gate level. Use of such faultmodelsallows
1: hard 1 . manyofthepowerful results in mathematicsrelating to
Boolean algebra to be applied to derivingtests for complex
Five basic node conditions are also defined: systems. Finally, a fault model at the logic gate level can be
used t o represent faults in many different technologies if, in
I: input
node
fact, defects and faultsin these technologies can be mapped
C: charge-storage
node
to gate faults.
5: strongdrivingnode
The Stuck-At Fault Model: One of the earliest and still
W : weakdrivingnode
widely used fault models is the stuck-at model [37]. In this
F: faulty
node.
model, it is assumed thatphysicaldefects and faults will
Using results fromcircuit simulation, an ordering was result in the lines at the logic gate level of the circuit being
found among various possiblecircuit states of nodes. (In permanently (stuck-at) logic 0 or 1. This model has been the
fact, it was found that out of the 25 possible combinations, source of a great deal of research. It is still very popular
only 19 node statesare possibleforMOScircuits.) This since it has been shown that many defects at the transistor
model was also used as the basis of a fault simulator. and circuit levels can, in fact, be modeled by the stuck-at
Delay Effects of Failures: The transistor-level fault mod- fault model at the logic level. In practice, only single stuck
els discussed above treat the transistor as a logical switch. faults are considered in a circuit.
This will, at best, allowonly a crudeapproximation of A subset of the stuck fault model is the pin fault model,
delays inthe circuit,in particular, the delayeffectsof where only input/output pins of a module are assumed to
failures. Since there have beenreportsoffailures which be stuck at 0 or 1 underfailure [38]. This has sometimes
only result in timingerrors [31]-[33], fault models are needed been used when testing a printed circuit board with many
which can accurately predict erroneous timing behavior. VLSl devices. Unfortunately, this fault model does not even
In the CSA model developed by Hayes [29], the concept include a high percentage of gate-level stuck faults within
of time is introduced by making the rate at which a well the module in most cases and is, therefore,inappropriate
changes state dependon thestrength levels ofthe logic for VLSI.
signals applied to i t s terminals.Unfortunately,thismodel TransistorDefects and Stuck-AtFaults: Animportant
does not provide any indication as to how the delay values question in the usefulness of a simple model such as the
may be determined from circuit parameters. stuck-at fault model is whether physical faults at the tran-
A circuit-level simulator, such as SPICE, can be used to sistor or circuit levels can be modeled by permanent (stuck-
obtain accurate delay values underfaults. This, unfor- at) Os or I s at the inputs and outputs of logic gates.The

642 PROCEEDINGS OF THE IEEE, VOL. 74, NO. 5 , M A Y 1986


results of one study [31] which attempted to answer this
question are quite interesting. Fig. 4 shows the circuit for a
TTL NAND gate. In the study, hard opens and shorts at the
circuit levelweresimulatedusing acircuitsimulator. As
q )-.
C

Fig. 5. Three-input NAND gate.

FaultEquivalenceandDominance: Considerthethree-
Bl< RB2< input NAND gate shown in Fig. 5. This gate has four lines
(three inputs and one output) and would, therefore, have
eight stuck-at faults, each line stuck at 0 or 1. However, the
faults A, B, or C stuck at 0 would result in the output D
beingpermanently 1 and, therefore, it is impossible to
distinguish between an input stuck at 0 from the output
stuck at 1. These faults aresaid to be equivalent. Now
consider the fault A-stuck-at-I. In order to detect this fault,
Tl,T2,T3

-v A
I a 0 has t o be applied on A, and I s at 6 and C so that the
effectofthefault can be propagated to D. The correct
Fig. 4. Circuit for T T L NAND gate. value of D will be a 1 and it will be a 0 under fault. This
test forA-stuck-at-Iwill, therefore, also detectthefault
Dstuck-at-0. Hence, A-stuck-at-I is said to dominate D
each defect was simulated, its effect on the output of the stuck-at-0. Usingthe relations of equivalenceand domi-
circuit was noted. Some of the results are shown in Table 2. nance allowsmany faults tobecombinedinto a single
For example, if resistor RBI were open, the output would class, reducing the number of faults to be considered in a
be permanently logic 0 (s-a-0 in the table). It can be noted complex system. A three-input NAND gate, therefore, will
that many of the opens and shorts will result i n inputs or have fourdifferent fault classes and the tests for these
faults are shown in Table 3. In the table, the fault consisting
Table 2 Defects in TTL NAND Gate Circuit of line d-stuck-at4 is shown as d/O.
~

Physical Defect Effect on Gate


RBI open output s-a-0
RB2. RL, SBDI, SBD2 open, T2, T3 base open undetectable Table 3 Tests for 3-ln~utNAND Gate
TI emitter open, base open input A s-a-I
T 2 emitter open input B s-a-I
T3 emitter open input C s-a-I
TI, r2,r3 collector open output s-a-0
T4 collector open, base-emitter short output s-a-I
TI, r2, r3 collector-emitter short undetectable
T4 collector-emitter short output s - a 4
TI base-emitter short input A s-a-1
T2 base-emitter short input 6 s-a-I
T3 base-emitter short input C s-a-I The notions of equivalenceanddominance can be ap-
T I , T2, T3, T4 collector-base short output s-a-0 plied to more complex circuits. Thus two faults which are
in different parts of a larger circuit could possibly be equiv-
alent. For example, Fig. 6 shows a simple circuit with four
the output being stuck at logic 0 or 1. Some of the defects inputs and one output. Stuck-at-I faults on the two lines
would not be detected by a logic-level test and, hence, are marked a and b are equivalent, that is, the function under
labeled as undetectable in the table. However, they might either of the faults is the same. However, equivalences such
be detected using other information. For example, the cir- as these are more difficult to detect and, in practice, only
cuit simulation showed that RB2 open would not cause a equivalences and dominances around a gateare normally
logic-level failure i n the circuit, but it would increase the considered. Moreinformationon theconceptsoffault
circuit delay from 7 to 4 ns; this would be detectable only equivalence and dominance, as well as the idea of reducing
by what the authors called an AC test, which also checked the number of fault classes by fault collapsing, are found in
for the timing in the circuit. Another study has shown that
many single transistor-level faults in bipolar and MOS tech-
nologies cannot be modeled as single stuck-at faults; how-
ever, they can be accurately modeled as multiple stuck-at W
faults [18].Conversely,there is some evidencethat many X
gate-level stuck-at faults will realistically rarely occur. One
study has correlated mask-level point defects to transistor-
f
level andgate-levelfaultmodels throughsimulation [39].
Initial results for some example circuits indicate that many Y
stuck-at faults are much more likely to occur than others.
z
The implication for test generation is that fault coverage is
dependent on the likelihood of fault occurrence as well as dl a W l

on the potential number offaults detected. Fig. 6. Two faults which are functionallyequivalent.

A B R A H A M A N D FUCHS: FAULT A N D E R R O R MODELS F O R VLSl 643


VDD
Detection of TransistorFaultsby Stuck-At Tests: When
we considerfaults in MOStechnologies, as has already
been stated, it can be seen that some of these faults cannot
t
be modeled by the stuck-at fault model. For example, Fig. 7
shows an n-MOS NAND gate withtwo faults which are
shorts (shown as dotted lines). Ifthe shortmarked 1 is
VDD !i3

rlc A 4 1

-
-
Fig. 8. n-MOS NOR gate with fourfaults.

Tabie 5 Tests for n-MOS NOR Gate

.""r =

Fig. 7. n-MOS NAND gate with two faults


1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
I O
1
0
1
0
0
0
1

Q " 1
0

present, then under the input 111, the output may not be 0 last test) whenit produces an incorrectlogical output.
but may be an indeterminate voltage represented as I. Thus Faults 2 and 4 can be seen to be equivalent (at least under
under this fault and this input combination, the output is this input set). Fault 3 produces a high-impedance output
not permanently stuck at 0 or 1. This example might seem under the combination OOO and, i n n-MOS technology, this
t o demonstrate a majordrawbackofthestuck-atfault is logicallyequivalenttothe previous output(shown as
model.However, if we considerwhether tests generated Q"), at least for a short period of time until any residual
for stuck-atfaults will detectthe transistor faults (even charge leaks away from the output. If the tests are applied
though they cannot be modeled as stuck-at faults), it can in the order shown, this fault will be detected since 0" will
be seen that the stuck fault model is still viable [42]. be a 0 and the correct output should be a 1. It is interesting
Table 4 showsthebehaviorofthen-MOS NAND gate tonotethat i f the OOO combinationwereapplied first,
under the faults when the tests for the stuckfaults in a and the output line happened to have a logic 1 stored on
NAND gate are applied. Outputs are shown for no fault (F,) it, this particular fault would nothave been detected by the
tests.
These examples show that, in general, many faults which
Table 4 Tests for n-MOS NAND Gate cannot be modeled as stuck-at faults, can still be detected
by applying a stuck-at fault test set.
Inputs outputs The Bridging Fault Model: Since many physical faults at
A B C Fo Fl F2
the circuit level will result in shorts between interconnect-
1 1 1 0 I 1
ing lines, the stuck-at fault model was extended to bridging
0 1 1 1 0 1
1 0 1 1 1 1 faults. This fault model treats shorts between two lines in
1 1 0 1 1 1 thelogic gate network and assumes that both lines will
have the wired-AND or wired-OR logic value underfault.
Mei [43] has shown that any singlestuck-atfault testset
will be able to detect any bridging fault between two or
and for the two shorts ( F1 and f 2 ) .Even though the output more input leads of a gate, as well as any bridging fault
under F, forthe first inputcombination is indeterminate which results in feedback such thatthetotalnumber of
(shown as I), it will be seen that the next input combination inversions in the loop is odd. This is a special case of short
will produce a logical output which is different from the faults, at the transistor level, treated in the previous section.
correct output and, thus, the fault will be detected. Similar The Stuck-Open and Stuck-On Fault Models: Wadsack
results hold for an n-MOS NOR gate.Fig. 8 shows such a [26] pointed out a special case of the transistor-level open
gate with four faults: faults 1 and 2 being shorts and faults 3 fault, which he called the stuck-open fault, where an open
and 4 being opens. Table 5 gives the behavior of the gate transistor(or abrokenline) can lead to a CMOS gate
when stuck-fault testsare applied under these four faults. behaving as if it had memory. Consider the CMOSNOR gate
Again, fault 1 produces an indeterminate output for the first shown in Fig. 9. It has two p-channel transistors in series
input combination but will be eventually detected (by the connecting V, to the output C, and two n-channel tran-

644 P R O C E E D I N G S OF THE IEEE, VOL. 74, NO. 5, M A Y 1986


VDD inthe transistor-level CMOS NOR gateare modeled by
stuck faults injected in this network; for example, in order
to model the stuck-open fault shown in the previous figure,
a stuck-at-1 fault is injected at the output of gate G I . When
thecombination A = 1, B = 0 occurs, the inputtothe
clock is 0, thus forcing the output C to retain the previous
2*
value. Even more complexgate-levelnetworks have been
proposed to model a larger set of faults in the CMOS gate
[44].Transistor-level faults, includingbridging, havealso
been modeled by gate-level circuitry for an interesting class
of general array logic structures [45]. The problem with such
approaches which attempt to modeltransistor-levelfaults
with gate-level circuits is that the circuitry can become very
complicated andthere is a necessity toinclude storage
elements; in addition, it can be seen that many of the stuck
Fig. 9. CMOS NOR gate. faults at the gate-level modeldonot correspond to any
physical fault in the transistor representation.
There has been other work in the area of detection of
sistors in parallel from C to the ground. Now suppose there stuck-open faults. Test generation algorithms which use the
is a break in the line shown at 1 in the figure. In order to stuck-attest set as the basis to detectstuck-openfaults
detect this break, one would have to attempt to control the were described in [24], [GI, [47]. It has recently been shown
transistor connecting A throughthelinewhich is poten- that a variation of CMOS, Domino CMOS, is easier to test,
tially broken, while at the same time deactivating the other because moststuck-openfaults are automaticallytested
transistor using B. Thiscan be done by setting A = 1 and during the precharge and evaluation phases of this particu-
6 = 0. Under this combination of inputs, C will be equal to lar technology [48].
0. Unfortunately, if the break has occurred, and if C hap- These methods of generating test sets for a CMOS logic
penedto have 0 as theprevious output, theconditions, circuit are based on staticbehavior, in otherwords,they
A = 1, 6 = 0, will result in C floating but this would still be assume a zero delay through all gates and interconnections.
recognized as 0 by subsequent logic levels, and fault 1 will Under dynamic behavior, when one allows variable delays,
not be detected. In order to detect this fault, therefore, we the test sets derived by these methods can be shown to be
would have t o set C equal to logic 1 which can be done by invalidated [33]. In this study, Reddy et al. gave examples
setting A = 0, B = 0, and then applying the test, A = 1, whichshowedthat: 1) there exist certainfunctions for
B = 0. Thus a sequence of inputs is necessary in order to which an irredundant CMOS realization using conventional
detect the stuck-open fault. design techniques does not have any valid test set under
Since a stuck-open faultcannot be modeled by the dynamic behavior;and 2) if an undetectablestuck-open
stuck-at model, Wadsack attempted to derive a logic gate fault is present, it may invalidate other tests, or the circuit
and latch network in which stuck-at faults could be used to may malfunction for some other pair of input changes. The
model stuck-open faults in the transistor-level CMOS NOR proposed testableCMOScircuits use extra controllable
gate. This is shown in Fig. IO. The clocked D-type latch CL inputs and additional logic. A novel hybrid CMOS realiza-
simulates the charge-storage property oftheoutputline. tion was proposed by Jha [49]. It was shown that a valid test
Under fault-free conditions, the CK input is logic 1. Faults set was guaranteed to exist for this realization even under
dynamic behavior for any stuck-open fault. A simulation of
a CMOS microprocessor was also performed by Timoc [50].

"WC
B
Application of 512000 pseudorandompatterns was found
t o detect 98 percent of stuck-at faults but only 85 percent
of stuck-open faults. This also points out the difficulty and
importance of tests for stuck-open faults.

I latch GL
Functional-Level Fault Models
Transistor-level models more accurately represent failure
mechanisms, but involve a much greater degree of com-
plexity, since the number of primitive elements (transistors
and interconnections) which can be faulty is very large. A
VLSl system may consist of many functional modules, each
module being implemented by these primitive elements. In
many cases, it is sufficient to know whether a module is
faulty or not. In this case, a fault model at the functional
module level which accurately includes the effect of faults
in transistors and interconnections is very useful, since the
B A O A B 0 A B 0 numberofmodules is much smaller thanthenumber of
Fig. 10. Gates and latch usedto model transistor-level faults transistors, reducing the complexity of treatment. In doing
in CMOS NOR gate. so, however, one has to be careful about loss of informa-

A B R A H A M A N D FUCHS. F A U L T A N D ERROR M O D E L S F O R VLSl 645


tion which results in some failures at the transistor level, for functional fault description can be used to derive tests for
example, notbeingincludedin the functionalmodel. the decoder or to find the effects of faults in the decoder
Another major problem is that different implementations of on the overall system.
the same functionwould have different transistor-level Another study [58] involved a multiplexerfunctional
faults, which would, in turn, result in different functional- block. A multiplexer has N inputs and log, N control sig-
level behavior. Some attempts at describing the effects of nals, and one output, where the output is selected to be
failures at the functional level for a variety of functional one of the inputsby the address on the control lines. Under
modules are briefly outlined in this section. a fault, it was found thatthebehavior of the multiplexer
GeneralFaultModelsforFunctionalBlocks: Given a modulecould be described inthefollowingfunctional
combinational function with N inputs, a very general fault manner:
model is t o assume that this function can be transformed 1) A 0 and a 1 cannot be selected on every input line.
into any othercombinationalfunctionof N inputs and, 2) When selecting some input, another inputwill be
therefore, testingit requires applicationof all 2N input selectedinsteadof, or inaddition to, the correct input
combinations. Such exhaustive testing is impossiblefor (producing the AND or OR of these two values).
functionswitha large numberofinputs.However, if a 3) Selection of a value on an input, followed by its com-
functionwith a large numberofinputs is implemented plement on the same input, will result in error at the out-
using a regular interconnection of subfunctions, each with put.
a relatively smallnumberofinputs,the exhaustive fault 4) Selection of a value on an inputfollowed by the
model can be used for the subfunctions very effectively. For selectionofthe opposite value on a linewhose address
example, if an adder is implemented as a cascade of full differs in exactly 1 bit from the first address will result in
adder cells (a ripple-carry adder), each adder cell has three error.
inputs and the entire adder array can be tested with eight It was also pointed out that 3) and 4) were the result of
input combinations which guarantee that each full adder stuck-open faults, thus showing that stuck-open as well as
module is exhaustively tested, assuming at most one mod- stuck-at or transistor-level faults could easily be described
ule is faulty.Techniquesfortesting arrays of relatively for this functional module at a higher level.
simple cells have been developed by Kautz [51], Friedman Memory Fault Models: There has been a great deal of
[52], and Cheng [53]. McCluskey [54] has described methods effortinderiving test patterns for memories sincethese
ofpartitioning circuits so they can be testedunderthe constitute a major portion of large systems. A random-access
exhaustive fault model. memory (RAM) chip consists of an array of memory cells,
The analagous fault model for sequential circuits assumes an address decoder, address and data registers, and
that, underfault,thecircuit can be transformed into any read/write logic. Faults in a memory chip maycause it to
other sequential circuit. A usual restriction is that the num- have changes in some electrical parameters or intiming.
ber of states willnot increase underfailure. Checking Faults will also cause the memory to be functionally incor-
experiments [55] for such general faults are quite lengthy rect. A widely used set of functional faults for memories is
and have not been applied widely toVLSI, except in special as follows:
cases [56], [57]. 1) One or more cells in the systemare stuck at logic 0
Very complex VLSl systems are most readily implemented or 1.
in a regular,cellularfashion in many instances. In these 2) Twoormore cells are coupled. A pairofmemory
cases, an exhaustivefault model would be very tractable, cells, i and j , are said to be coupled for transition from x to
given the regularstructure of
implementation. Unfor- y if one cell of the pair, say cell i, changes the state of the
tunately, most exhaustive fault models assume that a com- other cell, cell j , from 0 to 1 or from 1 to 0.
binational circuit does not become sequential under fault; 3) The state of a memory cell can be altered as a result of
this is clearly not true for the stuck-open faults described certain patterns of I s and Os or transitions in other cells.
earlier. Very little work has been done in models which, in A fault model which did not assume knowledge of the
fact, do not make this limiting assumption. memory array structure would allow coupling between any
Models for Small Functional Modules: A detailed study pairs of cells. If the cells which are adjacent to each other
of an n-MOS decoder with shorts and opens was performed are known, coupling can be restricted to neighboring cells,
by Banerjee [32]. A decoder can be described functionally as reducing the test complexity. Many testing algorithms have
having N inputs and 2N outputs and, under normal oper- beenderivedto detect each of these faults in memory
ation, exactly one output line is activated corresponding to systems [59].
the input address. It was found that all single transistor-level Fault Models forProgrammableLogicArrays(PLAs):
faults could be described by the following functional faults: PLAsare widely used in VLSlsystems since they provide a
means of implementing complex functions in a very regular
1) Instead ofthe correctline, an incorrectline is ac- and structured fashion. A PLA is shown schematically in Fig.
tivated. 11 with X , to X , as the inputs, four product terms, and two
2) In addition to the correct line, an incorrect line is ac- outputs, F, and F,. The internal fusible links or transistors in
tivated. the PLA arrayare shown schematically as dots in this
3) No line is activated. diagram. Faults in the PLA could include stuck-at, bridging,
or shorts, as well as contact or cross-point faults caused by
It can be seen that such a description is very simpleand the spurious presence or absence of a contact (fusible link
was shown t o incorporate all ofthe physical shorts and or transistor) between a row and columns of the PLA. It has
openspossible in the transistor-leveldescription. If a de- been shown in [60] that the contact faults dominate most
coder is used as a functional module in a large system, this single stuck-at,bridging, and shorteddiodefaults. The

646 PROCEEDINGS OF THE IEEE. VOL. 74, NO. 5 . M A Y 1986


x1 x2 x3 x4 F1 F2 tual treatment of instructions as consisting of micro-instruc-
tions which, in turn, are comprised of a set of micro-orders.
Under fault, one or all of the following events are assumed
to happen:
1) One or more micro-orders are inactive, therefore, the
instruction is not executed completely.
2) Micro-orders which are normally inactive become ac-
tive.
3) A set of micro-instructions is active in addition to, or
Q2 Q5 44 43 instead of, the normal micro-instructions.
Fig. 11. Schematicrepresentation of a PLA. The effect of a fault would be incorrect logic valuesat
the external pins or thedestruction of internallystored
effects of contact faults within the PLA can be represented data. This fault model, in conjunction with fault models for
functionally as changes in the product terms of the PLA other functions of the microprocessor were used to derive
[61]. A single-input contact fault on a row of a PLAcauses comprehensive tests for microprocessors using only archi-
the growth, shrinkage,or disappearance of the correspond- tectural level information. If the details of implementation
ingproductterm. Similarly, a single-output contactfault of the microprocessor are not known, this is probably the
causes the appearance of a newimplicant or the disap- only approach to deriving reasonable tests forfaults (the
pearance of the corresponding implicant. Fig. 1 2 shows five details of which are unknown) in microprocessors.
faults in the PLA of Fig. 11 and their effects on the corre-
sponding product terms. For example, if the contact at 91 ERROR MODELS
Accurateandtractablefaultmodelsprovide a viable
means ofdetermining the effectof physicalfailures on
\ x3 x4 \ x3 x4
x1 00 01 11 10 individual transistors, gates,or functional modules. How-
ever, fault models do not yield immediate insight into the
problem of determining the nature of errors due to these
F1 physical failures in the information at the output of circuit
modulesduring normal operation. Concerns such as the
extent of information corruption due to a physical failure,
the extent of errorpropagation, and the timebetween
Growth fault at 0 1 Shrinkage fault at 02 failure inception and possible error detection (error latency),
can be investigated by first developing models of the types
\ x3 x4 \ x3 x4 of errorsgeneratedbyphysicalfailures. Error models are
x1
also necessary for the developmentofself-checking VLSl
design, which allows for concurrent detection of errors due
t o physicalfailures during normaloperation. Error models
are a means of classifying the effect of physical failures on
the system during periods of actual computation. Although
there is a considerable bodyof research reported inthe
Appearance fauk at b3 Disappearance literature about the development of fault models for VLSI,
fault at 0 4 M 05
there is a distinct lack of similar research concerning accu-
Fig. 12. Effect of contact faults on PLA product terms
rate and tractable error models. The discussion which fol-
lows presents work performed primarily by researchers
interested in developing techniquesfordesigning self-
is faulty (it no longer has an effect), this is equivalent to the checking logic.
product term growing in size. On the other hand, if there is
a spurious contact at Q2, this has the effect of shrinking the
Transistor- LevelError Models
product term, as shown in Fig. 12. These functional effects
can be used to derive tests for faults within PLAs [62]. The effects of shorts and opens on VLSl transistors and
Fault Models forMicroprocessors: Microprocessors are interconnect result in error models which are dependent on
used as building blocks in very complex systems.Even the circuit design and application. For example, in n-MOS,
though these modules are quite complex themselves, fairly a short betweenthe gate and drain of an enhancement
effective functional-level fault models at the register-trans- transistor can result in a very different class of errors if the
fer or architectural level have been derived for them. Thatte transistor is employed in a PLA rather than in a bit-serial
[63] visualizes a microprocessor as a set of functions includ- multiplier. For this reason, transistor-level error models can-
ing register decoding, data transfer, data manipulation, and not be discussed without also discussing their specific ap-
instruction sequencing. Afunctional faultmodel is devel- plication. In contrast tofault models, error models at all
oped foreach of these functions.Improvements in this levels of abstraction are only meaningful when the discus-
modelweremade by Brahme [MI. This faultmodelfor sion is application and implementation specific.
instruction sequencing usesan extrapolation of the tech- Unidirectional Error Models: There are situations in
niques applied t o derive fault models for smaller functions which a single physical failure can create an electrical short
like decoders. The specific fault model is based on concep- or open and all the resulting bit errors at the outputs of a

A B R A H A M A N D F U C H S : F A U L T A N DERROR MODELS F O R VLSl 647


Clk2

Fig. 13. Buffered PLA faults yieldingunidirectionaloutputerrors.

Clk2

Clkl

0 0 1

Fig. 14. Equivalentinputerrorfor a PLA fault.

module will be all of the same type, either Os becoming Is, 1) The output is aunidirectional error inthe correct
or I s becoming Os. Such errors are described as unidirec- output for the correct input.
tional. An example of this behavior is in PLAs, where single 2) The output is the correct output for an input which is
device and interconnect failures have been shown to result a unidirectional error in the correct input.
in unidirectional errors [65], [66]. The erroneous selection or 3) The output is aunidirectional error inthe correct
lack of selection of a product line for certain inputs results output for an input vector which is a unidirectional error in
inunidirectional errors i n the output. Fig. 13 illustrates the correct input.
eleven classes of faults including stuck-at faults (faults I-3), Soft-ZrrorModels: Errors dueto environmentally in-
shorts (faults 4-9), andcrosspointfaults(faults IO, 11) duced transientfailures or intermittent circuit instabilities
which result in only unidirectional errors at the output of a are referred t o as soft errors. Disturbances such as the
PLA. electron-hole pairs created by the passage of an alpha
Failure t o separately bufferboththeuncomplemented particlethroughthe activedeviceregions of an n-MOS
and complemented bit lines in PLAs as in Fig. 13 has been memory are at best difficult to detect by means of classical
shown to yield error models which are equivalent to uni- testing. Fault models for such phenomena are meaningless.
directional errors on inputs as well as the outputs [67]. Fig. Concurrent error detection techniques based on a realistic
14 illustrates the interesting phenomenon of how a short error model allow forthe detection of such errors. The cost
between the gate and drain of an enhancement transistor is a function of the hardware required for codeword check-
(case 2) or a bit-line stuck-at4 (case 1) i n the AND plane of ers and redundantinformation bits. Based on measure-
the PLA can result in what is equivalent to a 1-bit change ments and simulation, single-bit error detection/correction
on the input. In Fig.14, a case 1 or case 2 fault causes an has been shown tobe sufficient for most soft errors in MOS
output error equivalent to an input of (MI) rather than the memories [ I l l , [ l 2 ] . However,models must include infor-
correct (001). Without input buffering,the unidirectional mation concerning not only thetype of errors but also their
error modelforoutputsof PLAs must bemodifiedto frequency, if error detection and correction is t o be com-
include errors on inputs and outputs. The authors have prehensive. For example, in memories i f soft errors occur
shown that a comprehensive error model for PLAs lacking frequently andmemorywords areaccessed infrequently,
input buffering can be summarized as followsfor shorts single-bit correction per word may not be sufficient.
and stuck-at faults: Indeterminate Errors: Certain classes of shorts,such as

648 PROCEEDINGS OF THE IEEE, VOL. 74, NO. 5, M A Y 1986


directional errors onbothinputs and outputs, modified
I versions of standard encoding have been developed which
afford the use of a single checker, but which still detect
AND
PLANE I
~ j OR
PLANE
errors on inputs and/or outputs [72]. Fig. 15 illustrates such
an implementationof a self-checking PLA. The input
checkbits C, are functions of the input information bits /;,
while the output checkbits C, are a function of both the
output information bits I, and the input information. Meth-
ods of circuit design have yet to be devised which allow for
ease in detectionof indeterminate logic values either
throughtesting or concurrenterror detection. Soft errors
causedby localized upsets, although difficult to test, are
typically easily detectedconcurrentlybysimple encoding
- - - schemes with appropriate codeword checkers.
Designing for Error Avoidanceand Containment: After
an error model has been developed based on a realistic set
of physical failures and a specific circuit design, the design
Fig. 15. Concurrent errordetection for unidirectional er- cycle can be repeated in an attempt to eliminate the likeli-
rors in PLAs. hoodof thosefailures which result indifficult-to-detect
insidious error classes. As an example, theselection of
multiple word select lines in a read-onlymemory (ROM)
thefault F, in Fig. 7, can result inlogic valuesat the dueto a physicalfailure results in a unidirectional error
outputs of transistors which are between 0 and 1, as has model [72].However, by employing design rules, which
previously been stated. These indeterminate values corre- minimize the likelihood of such a failure, e.g., in MOS by
spond to errors when the values are interpreted incorrectly increasing the distance between the buried contacts of the
by the transistors they drive. Indeterminate logic levels are depletion-mode transistors and the V,, contacts, or using a
difficult to detect, just as their corresponding fault models precharging scheme which isolates V,, and the word select
are difficult to test. Several gates driven by the same inde- lines, such troublesome errors can be eliminated from the
terminate value may interpretthelogic value differently model. For ROMs and PLAs, a restricted error model can be
withtheresultant effect of somegates receivingcorrect developedwith these fault-avoiding layout rules, which
information and others erroneous inputs. An error may go includes only potential single-bit errors in the outputs or
undetected i f acodeword checker interpretstheinter- input lines.instead ofemployingunidirectional error de-
mediatevoltage as a correct logic value, whilethe next tecting codes, andtheresultant complexityincodeword
stage of the logic circuit interprets it as an erroneous value. checkers, a simple parity checker scheme can be employed
Applications of the Models: Error detection for unidirec- which detects all errors i n the error model [73]. Fig. 16
tional errors on outputsoffunctionalmodules can be presents the self-checking ROM design employing the error
accomplished by encoding the outputs with unidirectional restrictinglayout rules. P, is the odd parity of each data
errordetecting codes andusingappropriate codeword word(including Pz and 4), pZ is theodd parity ofthe
checkers. Codeword checkers have recentlybeendevel- address of each data word, and 5 is the even parity of the
oped for MOS technologies based on transistor-level fault address of each data word. Noncodewords are detected by
models [@I, [69], which are considerably less costly than a self-testing parity checkerandcomparator. Other fault-
classical gate-level designs [70], [71]. These designs use avoiding designs have beendeveloped for PLAs [74], [75]
mask-level layout and analysis to reduce the size ofthe and for memories to minimize susceptibility to soft errors.
checkers. For instances in which failures can result in uni- The techniquesfor memories have included:foldedbit-

ROM

Fig. 16. A scheme for concurrent error detection in ROMs

A B R A H A M A N D FUCHS- FAULT A N D ERROR MODELS FOR VLSl 649


sense lines (reducing thesensitivity to bit-line and sense transistor, gate, or functional error models include timing
amplifier alpha hits), decreased front-end gate oxides (in- errors as part of the model. The advantage of functional-level
creasing the charge storage density), and novel cell designs error models over transistor- and gate-levelequivalents is
[I 11. that complex systemscan be modeled without the details
involvedinconsideringindividual devices or gates. Also,
functional-level error models are typically amenable to er-
Gate-Level Error Models
ror detection, without the necessity of a large amount of
Gate-levelerrormodels have typically followed models error detection hardware and software, as is often inherent
for gate-levelfaults [76]. Single stuck-atlines yield error i n error detection techniques at a lowerlevelof imple-
models which are single-bit errors on outputs or inputs of mentation. Finally, errors can be modeled formodules in
gates. Shorts on input lines or output lines have typically which the actual failure mechanism or the details of imple-
been assumed to result in thelogical AND or OR of the mentation are not known. The disadvantage of functional-
constituent logic values, which is again a single-bit error on level error modeling is that the relationship between realis-
the output or input of a gate. tic failures and modeled errors may be tenuous. Care must
Gate-levelertor models have classically been used in betakentoverifyfunctional error models throughfault
developing self-checking logic circuits. The self-checking simulation, measurement, and accelerated lifetesting. There
property of a circuit relates in a formal manner a specific is a distinctneed for research concerning thecorrelation
faultmodelto an error model. A totallyself-checking between proposed functional error models and the classes
combinational circuit possesses the properties that for every of errors produced by actual physical failures.
fault from the fault modelthecircuit never produces an Information TransferandStorageError Models: Classic
error that is not detectable, i.e., an incorrect codeword examples ofinformation transfer and storage modules in
output, and the circuit produces a detectable error, i.e., a VLSlare themultibit bus andreadable/writablememory
noncodeword output, for at least one codeword input. chip. Bus and memory error modelshave included single-bit
The error preserving nature of circuits is also important in errors (due t o shorts between two lines, or single broken
considering the propagation of errors throughout a system lines), random errors, or unidirectional errors. Note should
before detection occurs. Combinational circuits that always be made that a failure in the address decoding mechanism
propagate detectable errors from the error model are called of a memory maycause a read orwriteofthewrong
code disjoint (i.e., for the fault-free circuit, noncodeword location, which may be observed as a random error at the
inputs always mapintononcodeword outputs). Amore output. If such a failure is permanent,then, as with PLAs
complex model of error propagation has beendeveloped and ROMs [73], coding schemes can be used which force
by Smith and Lam [77]. Theyalso provide necessary and such a failure into a class of detectable errors by encoding
sufficientconditions fortheplacement of checkers in a the address withthe data and usinglayout rules which
self-checking system. avoid the likelihood of multiple activated word lines.
A morecomplex error model is necessary for intercon-
nection networks composed of multiple stages of switches.
Functional-Level Error Models
Multistage interconnection networks (MINs), as an exam-
Functional-level error models explicitly state the func- ple, are an alternative approach to multiprocessing in envi-
tional errors due to physical failures in specific modules. In ronmentswhere the cost of a complete crossbar inter-
broad terms, functional error models can be divided into connectionnetwork is not feasible. Anexample MIN
modelsofinformation transfer and storage (e.g., inter- composed of 2 x 2 switches and connecting eight proces-
connection networks), information manipulation (e.g., sors t o eight memory modules is given in Fig. 17. In devel-
ALUs), and control (e.g., instruction sequences).Very few oping an error model for this network, it is convenient to

OOO ooo
001 001

010 010

011 01 1

100 100

101 101

110 110

111 111

Fig. 17. Example multistage interconnection network.

PROCEEDINGS OF THE IEEE, VOL. 74, NO. 5 , M A Y 1936


650
Valid States technique has been used in designing a self-checking
microprogram sequencer [79], [MI.
Structure-Based Error Models: Most VLSl systems are
implemented in a highly structured and hierarchical manner.
In developing the error models for these systems the struc-
E # # # # # # # #
'8 ' 9 '10 sll s12 s13 s ~ d '15 '1.5
ture can often be exploited to yield insights into tractable
and accurate functional error models. The MlNs described
above are examples ofthisapproach.Anotherinteresting
Invalid States area of investigation is the structure of algorithms that are
Fig. 18. Possible functional states in a 2 X 2 switching ele- implemented by specific VLSl systems.
ment. Certain classes of errors in software structures caused by
physical failures can be modeled from a functional perspec-
consider the effect of shorts and opens on the interconnect- tive. Severalresearchershave proposed an error model for
ing links; however, the switches themselves are most easily control flow in macro- and micro-code software [81]-[83].
consideredfromafunctional perspective. A lower level Detectable errors are considered to be invalid branches, i.e.,
model would be limited toa specific design and implemen- any branch that deviates from the flow graph of the original
tationtechnology.In this case, an exhaustive functional program. Techniques for detecting errors in this model are
error model can be derived from the functional fault model based on the generation and storage of checking signatures
for the switches. All possible functional states of a single at predetermined points inthe object code. An error model
2 X 2 switch are given in Fig. 18. Some ofthe statesare forlinked data structures has also beendeveloped. The
invalid in the sense that the switch should never be in that model has been used forstructural integrity checking of
state during normal use. Other states are possible (or valid), data structures, and includes missing or erroneous pointers
however, they may be incorrect for a specific switch setting. between nodes (records) [84]. Limits are placed on the
In the examplethatfollows, a class ofnetworks,namely maximum number of total erroneous pointers and the num-
Delta networks, as illustrated in Fig. 17, are examined. If all ber of erroneouspointers in each node.Deviations from
the data paths of a single switch are considered to be in the the correct structure can be detected by techniques such as
same state atany given instance then it is not difficult to redundantpointers [84], signatures stored in the leaves
show that given the topology of the network and the fault (signatured access paths), or the use of code-theoretic tech-
model for the links and switches, any failure in a switch or niques (distributed and appended checks) [85].
link will result in one of the following errors [78]: Designingfor Error Avoidance and Containment: It is
especiallyuseful at the functional level, as it is at the
transistorand gate level, to restrict,throughdesigntech-
1) a unidirectional error will occur in the information niques, the types of undetectable errors that are likely to
flowing through the network; occur during normal operation. For failures resulting in the
2) theinformationwill be transferred tothewrong selection of the wrong source or destination in an informa-
destination where the address of the wrong destina- tion transfer or storage operation, the examples above have
tion will be a unidirectional error in the correct ad- shownhowthe destination or source address can be
dress; encoded andstored with the data ina manner allowing
3) both of the above will occur; or detectionof errors during the information transfer oper-
4) the source processor will fail to receive an acknowl- ation.
edgesignal from a destinationmemory (assuming Another extremely useful error containment technique is
circuit-switched operation). the distribution of portions of computation or information
transferand storage over several physically separate func-
Based on this error model, a simple method of concurrent tionalunits.An errormodel can then be developedthat
error detection utilizing unidirectionalerror detecting codes allows any single functionalunitto fail in any arbitrary
over both the destination address and data can be devised manner (a random error model).Atitssimplestlevelthis
[781. technique is often used in multichip memory design where
An alternative error model is necessary if 2 X 2 switches each chip supplies at most 1 bit for any given n-bit word.
are employed in the network, a centralized control strategy The arbitrary failure of a single chip, including the address
is used, andyet the specific interconnectiontopology is decoder, results in at most only a single-biterror. This same
unknown. The error model for such a scenario is as follows technique has been applied to MlNs in order to limit the
[781: effect of any single control or switch chip failure [%I. The
data paths through the network are divided up intoslices of
1) the correct information may arrive at one or more bits so that the resulting error model is one inwhich
incorrect destinations, or co destination; random errors may be present in the information received
2) there may be unidirectional errors in the information at the correct destination, however, the errors will be con-
arriving at the correct destination. fined to a specific region i n the information word (e.g., 1
byte). Amore sophisticated applicationof thistechnique
I n order to comprehensively detect the errors in this model has beendeveloped forconstrainingtheextentof errors
a method of detecting all random destination errors and all due to a faulty processor in a multiprocessor environment
unidirectional errors in the information must be devised. A [87]. The technique is algorithm-specific and utilizes en-
novel encoding strategy has been developed which encodes coded data and algorithm modification to ensure thatall
the complete address with the checkbitsof the data in errors due t o a faulty processor can be detected and, in this
ordertodetectthe errors in thismodel [78]. A similar case, corrected.

A B R A H A M A N D FUCHS: FAULT A N D E R R O R M O D E L S FOR VLSI 651


A SUMMARYCOMPARISONOF FAULT AND ERROR MODELS REFERENCES

The unique applications of error and fault models have J-C. Laprie,”Dependablecomputingandfaulttolerance:
Conceptsandterminology,” in Proc. /€E€ /nt. Symp. on
been noted throughout the text. The following discussion
Fault-Tolerant Computing, pp. 2-11, June 1985.
illustrates this uniqueness with onespecific module. The T. E. Mangir, “Sources of failures and yield improvement for
previous descriptionsoffaultand error modelsfor PLAs VLSl and restructurable interconnects for RVLSI and WSI: Part
present an opportunity for drawing some comparisons. If I-Sources of failures and yield improvement for VLSI,” Proc.
failures are considered to be thespurious appearance or /E€€, vol. 72, pp. 690-708, June1984.
H. R. Bolin, “Processdefectsandeffects on MOSFETgate
disappearance of an active device i n the OR or AND plane of reliability,“ in Proc.ReliabilityPhysicsSymp., pp. 252-254,
a PLA (crosspoint failures), thepreviousdiscussion has 1980.
shown that from a functional fault model perspective the E. D. Colbourne, G . P. Coverley, and S. K. Behera, “Reliability
resultwillbetheaddition or deletionof a literal in a of MOS LSI circuits,”Proc. /€E€, vol. 62, pp. 244-259, Feb.
1974.
product term of at least one output function, or in adding J. T.EasterbrookandR. C. Bennetts, ”Failure mechanisms in
or deleting a product term(implicant)from at least one logic circuitsandtheirrelatedfaulteffects,” in Proc. /€€E
output function. Test algorithms based on this fault model Conf. on NewDevelopments in Automatic Testing, pp.44-47,
should,therefore,be derived such thattheyverify that Nov. 1977.
these situations have not occurred. The growth or shrinkage C. 1. Schnable, L. J. Callace, and H. L. Pujol, ”Reliability of
CMOS integrated circuits,’’ /€€E Computer, vol. 11, pp, 6-17,
of product terms, however, provides little information that Oct. 1978.
is usefulfordetecting errors at the outputof a PLA. A P. B. Ghate, ”Electromigration-induced failures in VLSl inter-
useful error model for PLAs based on the appearance or connects,” in Proc.Reliability PhysicsSymp., pp. 292-299,
disappearance of an active device is one which consists of Mar. 1982.
P. S. Ho, “Basic problems for electromigration in VLSl appli-
unidrectional errors on the output lines. Therefore, by sim-
cations,” in Proc. Reliability Physics Symp., pp, 288-291, Mar,
plyencodingtheoutputsofthe PLA inaunidirectional 1982.
error detecting code, errors due to the modeledfailures can A. Ito, H. A. Swasey, and E. W. George, ”Hot electron reliabil-
be detected. Error detection is based on erroneous output ity modeling in VLSldevices,” in Proc.ReliabilityPhysics
logic values rather than functional changes within the mod- Symp., pp. %-101,1983.
D. S. Peck, “New concerns about integrated circuit reliability,”
ule. Testing, on theother hand, is concerned withthe in Proc. Reliability Physics Symp., pp. 1-6, 1978.
functional change of a device, gate, circuit, or module and T.C.May,“Softerrors in VLSI: Presentandfuture,” /€€E
the input(s) necessary to detect this malfunction. A similar Trans.Comp.,Hybrids, Manuf. Technol.,vol.CHMT-2, pp.
comparison can be made for all of the modules discussed 377-387, Dec. 1979.
C. H. Sie, R. A.Youngblood, J. H. Liao,andA.Turk,“Soft
in this paper.
failure modes in MOS RAMS,” in Proc. Reliability Phys Symp.,
pp. 27-32, 1977.
W. T.Anderson, Jr.and S. C.Binari,“Radiationeffects in
CONCLUSIONS CaAs devices and ICs,” in Proc. Reliability Physics Symp., pp.
31 6-319, 1983.
VLSl systemsare becoming ever morecomplex, with
J. W. Peeples and T. J. Every, “Parametric influence on system
many new types of failures and failure effects, as technol- softerror
rates,“ in Proc.Reliability
Physics
Symp.,
pp.
ogy reduces linewidths and increases density. There is a 255-260,1980.
greatneed for a thorough understandingof how defects N. Burgess and R. I. Damper, “The inadequacy of the stuck-at
and failures affect these complex systems.The inherent fault model for testingMOS LSI circuits: A review of MOS
failure mechanisms and some implications for computer-aided
contradiction in therequirements for fault models has to be design and test of MOS LSI circuits,” Software Microsyst., vol.
solved; they are required to be very accurate but, at the 3, pp. 30-36, Apr. 1984.
same time, applicable to very large systems.The solution S. Cai, M. Mezzalama,and P. Prinetto, “A review offault
seems tobeto describe accurately the effects of faults models for LSI/VLSI devices,” Software Microsyst., vol. 2, pp.
44-53, Apr. 1983.
within higher functional modules and thus make complex G. R.Case,“Analysis of actualmechanisms in CMOSlogic
systems tractable by reducing the number of primitive ele- gates,” in Proc. Design Automation Conf., pp. 265-270, 1976.
ments. The hierarchical approaches which are used by the C.Timoc, M. Buehler,T. Criswold, C. Pina, F. Stott, and 1.
design community must also be exploited in fault and error Hess, “Logical models of physical failures,” in Proc. / E € € Int.
modelingand test generation.Powerful CAD tools for Test Conf., pp. 546-553, Nov. 1983.
J. P. Hayes,”Faultmodeling,” /E€€ Design and Test,vol. 2,
modeling and test generation, which can easily be used by pp. 88-95, Apr. 1985.
designers and test engineers, must be developed if we are J. W. BandlerandA. E. Salama,“Faultdiagnosisofanalog
to exploit the potential of VLSI. It is clear that an arbitrary circuits,” Proc. /E€€, vol. 73, pp, 1279-1325, Aug. 1985.
system would be very difficultto modelandtest. It is J. Galiay,Y.Crouzet,and M. Vergniault,”Physicalversus
logicalfaultmodels in MOS LSI circuits:Impactonthe
imperative that complex systems be designed so that they testability,” /€E€ Trans. Comput., vol. C-29, pp. 527-531, June
are easily testable, and designfortestability will become 1980.
one of the primary requirements of very complex systems. 6. Courtois,“Failuremechanism,faulthypothesis,andana-
On-line, concurrent error detection can also be made cost- lytical testing of LSI-NMOS (HMOs) circuits,”in Proc. VLS1‘87,
effectiveformany applications with appropriatedesign pp. 341 -350, Aug. 1981.
S. K. Jain and V. D. Agrawal, “Modeling and test generation
techniques which eliminate the likelihood of insidious er- algorithms for MOS circuits,” /€E€ Trans. Comput., vol. C-34,
rors. Reliable systems of the future will most certainly test pp. 426-433, May 1985.
for faults and check for errors which are based on models K. W. Chiangand 2. C. Vranesic, “On faultdetection in
derived from an analysis of the effects of realistic physical CMOS logicnetworks,” in Proc.20thDesign Automation
Conf., pp. 50-56, June1983.
failures. Based on accurate and tractablefaultand error M. Acken, “Testing for bridging faults in CMOS circuits,” in
models, efficient and comprehensive approaches to reliable Proc. 20th Design Automation Conf., pp. 717-718, June1983.
VLSl system design can be formulated. R. L. Wadsack, “Fault modeling and logic simulationof CMOS

652 PROCEEDINGS OF THE IEEE, V O L 74, N O .5, M A Y 1986


and MOS integrated circuits,”
Bell
Syst.
Tech. J., pp, W.-T.Cheng,“Testinganderrordetection in iterativelogic
1449-1474, May-June 1978. arrays,”Tech.Rep.CSC-44,CoordinatedSci.Lab., Univ. of
D. Baschiera and B. Courtois, “Testing CMOS: A challenge,” Illinois at Urbana-Champaign, 1985.
VLSI Des., pp. 58-62, Oct. 1984. E. J. McCluskey and S. Bozorgui-Nesbat, “Design for autono-
M. D. Schuster and R. E. Bryant, “Concurrent fault simulation mous test,” /E€€ Trans. Comput., vol. C-31, pp. 866-875, Nov.
of MOS digital circuits,” in Proc. Conf. on AdvancedRe- 1981.
search in VLSI, pp. 1-10.1984. F. C. Hennie, Finite-State Models for Logical Machines. New
J. P.Hayes,“Faultmodelingfor digitalintegratedcircuits,” York: Wiley, 1968.
/E€€ Trans. Computer-Aided Design, vol. CAD-3, pp, 202-208, T. Sridhar and J. P. Hayes, “Design of easily testable bit-sliced
July,1984. systems,” /€E€ Trans. Comput.,vol.C-30,pp.842-854,Nov.
P.Banerjeeand J. A. Abraham,“A multivaluedalgebrafor 1981.
modeling physical failures in MOS VLSI circuits,” /E€€ Trans. T. Davis, R. Kunda,and W. K. Fuchs,“Testingofbit-serial
Computer-Aided Design, vol. CAD-3, pp, 312-321, July 1985. multipliers,” in Proc. /€E€ Int. Conf. on Computer Design, pp.
C. C. Beh, K . H. Arya, C. E. Radke, and K. E. Torku, “Do stuck 430-434,OCt.1985.
faultmodelsreflectmanufacturingdefects?” in Proc. /€E€ J. A . Abrahamand V. K . Agarwal, “Test generationfor digi-
Semiconductor Test Conf., pp. 35-42, Nov. 1982. tal systems,” in Fault-Tolerant Computing: Theory and Tech-
P. Banerjee, “A model for simulating physical failures in MOS niques, D. K. Pradhan,Ed.EnglewoodCliffs,NJ:Prentice-
VLSI circuits,”Tech.Rep.CSC-13,CoordinatedSci.Lab., Hall, 1985.
Univ. of Illinois at Urbana-Champaign, 1985. M. S. AbadirandH. K. Reghbati,“Functionaltestingof
S. M. Reddy, M. K . Reddy, and J. C. Kuhl, ”On testable design semiconductor random-access memories,” Comput. Surv., vol.
for CMOS logiccirucits,” in Proc. /E€€ Int. TestConf., pp, 15, pp. 175-198, Sept. 1983.
435-445, Oct. 1983. D. L. Ostapko and S. J. Hong, ”Fault analysis and test genera-
H.-C. Shih, J. T.Rahmeh,and J. A. Abraham, ”An MOS fault tion for programmable logic arrays,” /E€€ Trans. Comput., vol.
simulator with timinginformation,” in Proc. Int. Conf. on C-28, pp. 617-626, Sept. 1979.
Computer-Aided Design, pp. 45-47, Nov. 1985. J. E. Smith, “Detection of faults in programmable logic arrays,”
W.Barraclough,A.C. L. Chiang,andW.Sohl,”Techniques I€€€ Trans. Comput., vol. C-28, pp. 845-853, Nov. 1979.
for testing the microcomputer family,” Proc. IF€€, vol. 64, pp. P. Bose and J. A. Abraham, “Test generation for programma-
943-950, June 1976. ble logic arrays,” in Proc. 79th De-gn Automation CGnf., pp,
R. Nair, S. M. Thatte, and J. A. Abraham, ”Efficient algorithms 574-580, June1982.
fortestingsemiconductorrandom-accessmemories,” /€E€ S. M. Thatteand J. A.Abraham,”Testgenerationfor mi-
Trans. Comput., vol. C-27, pp. 572-576, June 1978. croprocessors,” /E€€ Trans.Comput.,vol.C-29, pp. 429-441,
J. F. Poage, “Derivation of optimum tests to detect faults in June1980.
combinational circuits,” in Proc. Symp. on Mathematical The- D. S. Brahmeand J. A.Abraham,“Functionaltestingof
ory of Automata, pp. 483-528, 1%3. microprocessors,” /€€E Trans. Comput.,vol. C-33, pp. 475-485,
M. L. Ketelsen, “An integrated circuit fault model for digital June 1984.
systems,” Tech. Rep. CSLR-743,CoordinatedSci.Lab.. Univ. C. P. Mak, J. A. Abraham, and E. S. Davidson, “The design of
of Illinois at Urbana-Champaign, 1976. PLAs with concurrenterrordetection,” in Proc.72th Int.
J. P.Shen, W. Maly,and F. J. Ferguson, “Inductivefault Symp. on Fault-Tolerant Computing, pp, 303-310, June 1982.
analysis of MOSintegratedcircuits,” /€€E Design and Test. H. Dong and E. J. McCluskey, “Matrix representation of PLAs
vol. 2, pp. 13-26, Dec. 1985. and an applicationto characterize errors,” Center for Reliable
E. J. McCluskeyand F. W.Clegg,”Faultequivalence in ComputingTech.Rep.81-11,StanfordUniv.,Stanford,CA,
combinationallogic networks,” /FEE Trans.Comput., vol. Sept. 1981.
C-20, pp. 1286-1293, NOV. 1971. P. Banerjee and J. A. Abraham, “Fault characterization of VLSI
D. R. Schertz and C. Metze, “A new representation for faults MOS circuits,” in Proc. Int. Conf. on Circuitsand Computers,
in combinationaldigitalcircuits,” /E€€ Trans.Comput., vol. pp. 564-568, Sept. 1982.
C-21, pp. 858-866, Aug. 1972. N. K. Jha and J. A. Abraham, “Techniques for efficient MOS
P. Banerjee and J. A. Abraham, “Characterization and testing implementation oftotallyself-checkingcheckers,” in Proc.
ofphysicalfailures in MOS logic circuits,” /E€€ Design and /€E€ Int. Symp. on FaultTolerantComputing, pp. 430-435,
Test, VOI.1, pp. 76-86, Aug. 1984. June 1985.
K.C. Y . Mei, “Bridgingandstuck-atfaults,” / € € E Trans. M. Nicolaidis, I. Jansch,and B. Courtois,“Stronglycode
Comput., vol. C-23, pp. 720-726, July 1974. disjoint checkers,” in Proc. 14th Int. Symp. on Fault-Tolerant
S. A.AI-ArianandD. P. Agrawal, ”Modeling andtestingof Computing, pp. 16-21, June 1984.
CMOS circuits,” in Proc. Int. Conf. on Computer Design, pp. D. A. Anderson and G. Metze, “Design of totally self-check-
763-769, Oct. 1984. ing check circuits for m-out-of-n codes,”/FEE Trans. Comput.,
S. W. Sievers and A. Aviiienis, “Analysis of a classof totally vol. C-22, pp. 263-269, Mar. 1973.
self-checkingfunctionsimplemented in a MOS LSI general M. A. Marouf and A. D. Friedman, “Design of self-checking
logicstructure,” in Proc. 71th Int. Symp. on Fault-Tolerant checkers for Berger codes,” in Proc. 8th Int. Symp. on fault-
Computing, pp. 256-261, June 1981. Tolerant Computing, pp. 179-184, June 1978.
Y. M. El-ziq, “Automatic test generation for stuck-open faults W. K.Fuchsand J. A.Abraham, “Aunified approach to
in CMOS VLSI,” in Proc. 18th Design Automation Conf., pp. concurrent error detection in highly structured logic arrays,”
347-354, June 1981. in Proc.74th Int. Symp. on Fault-TolerantComputing,pp.
R. Chandramouli, “On testingstuck-openfaults,” in Proc. 4-9, June 1984 (also to appear in /E€€ J. Solid-State Circuits).
73th Int. Symp. on Fault-TolerantComputing, pp. 258-265, C.-Y.Chen,W. K. Fuchs,and j. A.Abraham,“Efficient con-
June 1983. current error detection in ROMs and PLAs,” in Proc. /€€€ Int.
V. C. Oklobdzija and P. C. Kovijanic, “On testabilityof Conf. on Computer Design, pp. 525-529, Oct. 1985.
CMOS domino logic,” in Proc. 74th Int. Symp. on Fault- Y . Tamir andC.H.Sequin,“Designandapplication of
Tolerant Computing, pp. 50-55, June 1984. self-testing comparators implemented with MOSPLAs,” /E€€
N. K. Jha and J. A.Abraham,“TestableCMOS logic circuits Trans. Comput., vol. C-33, pp. 493-506, June 1984.
underdynamicbehavior,” in Proc. Int. Conf. on Computer- M. Nicolaidis and B. Courtois, “Design of self-checking sys-
Aided Design, pp. 131-133, Nov. 1984. tems based on analytical fault hypotheses,” IMAC Computer
C.Timoc, F. Stott, K. Wickman,and L. Hess,
“Adaptive Architecture Croup Res.Rep. 353, Mar. 1983.
self-testfor a microprocessor,” in Proc. /E€€ Semiconductor J. F. Wakerly,ErrorDetectingCodes,Self-checkingCircuits
Test Conf., pp. 701-703, Oct. 1983. and Applications.NewYork,NY:Elsevier-NorthHolland,
W. H. Kautz,“Testingforfaults in cellularlogicarrays,” in 1978.
Proc. Symp. on Switching and Automata Theory, pp. 161 -174, J. E. Smithand P. Lam, “A theory of totallyself-checking
1%7. system design,” /€€E Trans. Comput.. vol. C-32, pp. 831-844,
A. D. Friedman, “Easily testable iterative systems,” /E€€ Trans. Sept. 1983.
Comput., vol. C-22, pp. 1061-1064, Dec. 1973. W. K . Fuchs, K. H. Huang,and J. A.Abraham,“Concurrent
errordetection in VLSl interconnectionnetworks,” in Proc. [83] J, P. Shen and M. A. Schuette, “On-line self-monitoring using
70th Int. Symp. on Computer Architecture, pp. 309-315, June signatured instruction streams,” in Proc. /E€€ Int. Test Conf.,
1983. pp. 275-282, NOV.1983.
[79]C. Y. Wong, W. K . Fuchs, J. A. Abraham,and E. S. Davidson, [W] D. J. Taylor, D. E. Morgan,and J. P. Black,“Redundancy in
“The design of a microprogram control unit with concurrent datastructures:Improvingsoftwarefaulttolerance,” I€€€
errordetection,” in Proc. 73th Int. Symp. onfault-Tolerant Trans. Software Eng., vol. SE-6, pp. 585-594, Nov. 1980.
Computing, pp. 476-483, June 1983. [85] W. K . Fuchs,“Concurrenterror detection in VLSl systems
[So] M. M. Yen, W. K . Fuchs,and J. A. Abraham,“Designing for throughstructureencoding,”Tech.Rep. CSC-43, Coordi-
concurrent error detection in VLSI: Application to a micropro- nated Sci. Lab., Univ. of Illinois at Urbana-Champaign, 1985.
gram control unit,” I€€€).Solid-State Circuits (to appear). [86] J. E. Lilienkamp, D. H. Lawrie, and P.-C. Yew, “A fault-tolerant
[81] M. Namjoo, “Design
of
concurrentlytestable
micropro- interconnectionnetworkusingerror-correcting codes,” in
grammed control units,” in Proc. 75th Annu.Workshopon Proc. Int.Conf.on Parallel Processing, pp.123-125,Aug.
Microprogramming, pp. 173-180, Oct. 1982. 1982.
[82] T. Sridhar and S. M. Thatte, “Concurrent checking of program [87] K . H. Huangand J. A. Abraham,“Algorithm-based fault
flow in VLSl processors,” in Proc. /€€E Int. Test Conf., pp. toleranceformatrixoperations,” /E€€ Trans. Comput., vol.
191-199, NOV.1982. C-33, pp. 518-528, June 1984.

654 P R O C E E D I N G S OF T H E IEEE. VOL. 74, N O . 5, M A Y 1986

You might also like