Self Repair Technology For Logic Circuits: Architecture, Overhead and Limitations

Computer Engineering
Self Repair Technology

for Logic Circuits
Architecture, Overhead and Limitations
Heinrich T. Vierhaus
BTU Cottbus
Computer Engineering Group
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Outline
1. Introduction: Nano Structure Problems
2. The Problem of Wear-Out
3. Repair for Memory and FPGAs
4. Basic Logic Repair Strategies & Structures
5. Test and Repair Administration
6. De-Stressing Strategies
7. Cost, Overhead, Single Points of Failure
8. Summary and Conclusions

1. Introduction
A bunch of new problems from nano structures ...

Nanoelectronic Problems
Lithography:
The wavelength used to „map“ structural information from
masks to wafers is larger (4 times of more) than the minimum
structural features (193 versus 90 / 65 / 45 nm).
Adaptation of layouts for correction of mapping faults.
Statistical Parameter Variations:

The number of atoms in MOS-transistor channels becomes so
small that statistical variations of doping densities have an impact
on device parameters such as threshold voltages.

New Problems with Nano-Technologies
Light
source
Wave length: 193 nm
resist mask (reticle)
wafer exposed resist

Feature size: down to 28 nm
Layout Correction
Modified layout
for compensation
of mapping faults
Compensation is critical and non-ideal

Faults are not random but correlated!
Requires fast fault diagnosis
Doping Fluctuations in MOS Transistors
Poly-Si
n doping atom
n
p-Substrate
Density and distribution of doping atoms

cause shifts in transistor threshold voltages!

Nanostructure Problems
Individual device characteristics such as Vth are more dependent
on statistical variations of underlying physical features such
as doping profiles.
Primary Relevance: Yield
A significant share of basic devices will be „out or specs“ and needs
a replacement by backup elements for yield improvement after
production. Primary Relevance: Yield
Smaller features mean higher stress (field strength, current

density), also foster new mechanisms of early wear-out.
Primary Relevance: Lifetime
Transient error recognition and compensation „in time“ is becoming a must
due to e. g. charged particles that can discharge circuit nodes.
Primary Relevance: Dependability

Fault Tolerant Computing
Software-based Works only

fault detection for transient faults! specific
& compensation
Fault
event HW logic & Typically works
RT-level for transient and universal
detection & permanent faults!
compensation
Typically works very

Transistor-and switch level for specific types of
compensation transient faults specific
only!

2. Wear-Out Problems and Mechanisms

Structures on ICs used to live longer than either their application
or even their users. Not any more ...

IC Structures May Get Tired
„Wear-out“ – effects ICs in nano-electronics are likely to appear much earlier,

causing a lot of problems for dependable long-time applications !

Fault Effects on ICs

metal low- k insulator
migration deterioration Metal 3
Metal 2
Poly-
imide Via
(low-k)
Field-
Oxide
p n-well p
n n
Gate Metal 1
Oxide
(high-k)
Transistor deterioration (HCI, NBTI),

eventually gate oxide shorts !

Wear-Out Mechnisms
Metal Migration:
Metal atoms (Al, Cu) tend
to migrate under high current
density and high temperature.
Stress migration:
Migration effects may be enhanced
under mechanical stress conditons.
Effect:
Metal lines and vias may actually
cause line interrupts. The effect is
partly reversible by changing current
directions.

Metal Migration
neighbor
metal -wire under high current density:
new
neighbor
After some time in operation
Voids (holes) neighbor
Open-defect
short
Vias are specially prone to such defects
The effect is reversible by reversing the direction of current flow !

Transistor Degradation
Negative Bias Thermal Instability (NBTI): Reduced switching speed
for p-channel MOS transistors that have operated under long-time constant
negative gate bias. The effect is partly reversible.
Hot Carrier Injection (HCI): Reduced switching speed for n-channel MOS
transistors, induced by positive gate bias and frequent switching.
Not reversible.
Gate Oxide Deterioration: Induced by high field strengh. Not reversible
Dielectric Breakdown: Insulating layers between metal lines may break

causing shorts between signal lines.
Design technology including a prospective „life time budget“!!

Management of Wear-Out by
„Fault Tolerant Computing?
Built-in fault tolerance and error compensation are needed in nano-
technologies anyway and for the management of transient faults.
Wear-out induced faults may show up as „intermittent“ faults first,

which become more and more frequent.
Fault in synchronous circuits and systems are detected „by clock cycle“.
Hence the detection does not even recognize if the fault is permanent
or not for many types of fault tolerant architecture.

Triple Modular Redundancy

Execution
Unit 1
input Result out
signal (majority)
Execution Comparator
Unit 2 Voter
Error
Execution detect
Unit 3
Can detect and compensate almost any type of fault

Overhead about 200-300 %, additional signal delays
The voter itself is not covered but must be a „self checking checker“
Standard (by law) in avionics applications!
Error Detecting / Correcting Codes
Data Data
Error
Transmission / correction
Storage
Signature Signature Fault-

detect
Comparison
Often applicable to 1- or 2-bit faults only
Often limited to certain fault models (uni-directional)
Becomes expensive if applied to Signature
computational units
Can TMR and Codes Compensate

Permanent Faults?
Fault / error detection circuitry typically works on a clock-cycle base.
It does not „know“ if a fault is transient or permanent.
A permanent fault is a fault event that occurs in several to many successive
clock cycles repeatedly.
Error correction technology can detect and compensate such permanent faults
as well as transient faults.
A critical condition occurs if transient faults occur on top of
permanent faults. Then the superposition of fault effects is likely to
exceed the system‘s fault handling capacity.
System components that run actively „in parallel“ suffer from the same
wear-out effects. Therefore there is a an increase in dependability before
wear-out limits, but no significant life time extension!
Redundancy and Wear-Out
During the normal life time of the system, duplication or triplication

can enhance reliability significantly. But also area and power consumption
are about triplicated.
And by the end of normal operating time (out of fuel / steam) all three
systems will fail shortly one after the other !!
Reliability enhancement is not equal to life time extension !!
Self Repair?
Software-based Works only
fault detection for transient faults! specific
& compensation
Fault
event HW logic & Typically works
RT-level for transient and universal
detection & permanent faults!
compensation
Self Repair for permanent faults!
Typically works very

Transistor-and switch level for specific types of
compensation transient faults specific
only!

3. Repair for Memory and FPGAs
Compensation of transient faults is not enough.
Some technologies for transient compensation can handle

permanent faults, too, but not on the long run and with
additional transient faults!

Memory Test & Repair

Read- /
Lines write lines
Line
address
spare
column
columns
Memory Test & Repair (2)

Read- /
Lines Write lines
Line
address
spare
column
Memory
BIST columns
controller
... is already state-of-the-art!
FPGA-based Self Repair

FPGA macro-blocks working as CPUs
logic
L W L W L L W L W L block
W L W L W W L W L W
L W L W L L W L W L wiring
W L W L W W L W L W block
L W L W L L W L W L
L W L W L Memory
W L W L W
Applic.
L W L W L Config.
SW &
W L W L W SW
data
L W L W L
* e. g. proposed by McCluskey et al. IEEE Design and Test 2004
FPGA-based embedded controller: 8051

In-System FPGA Repair

FPGA-based CPUs
logic
L W L W L L W L W L block
fault W L W L W W L W L W
L W L W L L W L W L wiring
W L W L W W L W L W block
L W L W L L W L W L
under repair System

Repair
function
function
L W L W L Memory
W L W L W
Applic.
L W L W L Config.
SW &
W L W L W SW
data
L W L W L

Repair Mechanism: Row/Line-Shift

CLB CLB CLB CLB
occupied
CLB CLB CLB CLB CLBs
row with
CLB CLB CLB CLB
faulty CLB
occupied
CLB CLB CLB CLB
CLBs
CLB CLB CLB CLB reserve

row
Little Overhead for the re-configuration process

Loss of many “good” CLBs for every fault
Distributed Backup CLBs

CLB CLB CLB CLB CLB
CLB CLB CLB CLB CLB
CLB CLB CLB CLB CLB
CLB CLB CLB CLB CLB
CLB functionally CLB faulty CLB

occupied CLB
non-occupied selected
CLB CLB replacement CLB
CLB (reserve)
Minimum loss of functional CLBs
High effort for re-wiring requires massive „embedded“

computing power (32-bit CPU, 500 MHz)
Self Repair within FPGA Basic Blocks
Heterogeneous repair strategies required (memory, logic)

Logic blocks may use methods known from memory BISR
Additional repair strategies are necessary for logic elements
The basic overhead for FPGAs versus standard logic
(about 10) is enhanced.
Repair strategies for logic may use some features already
used in FPGAs (e. g. switched interconnects).

Structure of a CLB Slice

Program in
FF
Logic SRAM
in
in
M out
Logic U
X FF
Field
out
M
Redudant Row U FF
X
Logic
out
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn SRAM
FPGAs for a Solution?

The granularity of re-configurable logic blocks (CLBs)
in most FPGAs is the order of several thousand transistors.
Replacement strategies must be placed on a granularity of
blocks in the area of 100-500 transistors for fault densities
between 0.01 % and 0.1 %.
Efficient FPGA- repair mechanism requires detailed fault diagnosis
plus specific repair schemes, which cannot be kept as pre-computed
reconfiguration schemes.
Computation of specific repair schemes requires „in-system
EDA“ (re-placement and routing) with a massive demand
for computing power.
There is no source of such „always available“ computing power.

Self-Repairing FPGA ?
Reconfigurable Logic
CLB WB CLB WB CLB WB CLB
New-Config.
Program
Memory
Config.
Scheme
Virtual CPU
Advanced FPGA Structures

CPU WB CPU ... are only partly
re-configurable for
performance
CLB WB CLB WB CLB WB CLB reasons !
ALU WB MULT WB ALU WB MULT

FPGA / CPLD Repair

Looks pretty easy at first glance because of regular
architecture!
Requires lines / columns of switches for configuration at

inputs and between AND / OR matrices.
Requires additional programmability of cross-points

by double-gate transistor as in EEPROMs or Flash memory.
Not fully compatible with standard CMOS

Limited number of (re-) configurations
Floating gate (FAMOS) transistors are fault-sensitive!

4. Basic Logic Repair Strategies
Repair techniques that replace failing building blocks by redundant

elements from a „silent“ storage are not new.
IBM has been selling such computer systems specifically for

applications in banks for decade.
But always with few (2-10) backup elements (CPUs) assuming

a small number of failures (< 10) within years.

Mainframes
.. will often contain „redundant“ CPUs for eventual fault

compensation. But one faulty transistor then „costs“ a whole CPU,
limiting the fault handling to a few (about 10) permanent fault cases.
Granularity of Replacement
Hardly Block-level Core-

explored replacement Replacement
(logic) (e. g. FPGAs) (e. g. CPU)
Expected fault density (1 out of..)
FPGA- cores CPU

trans. gate macro block
100 101 102 103 104 105 106 Granularity

(transistors)

Repair Overhead versus Element Loss

Repair procedure Functioning
overhead elements lost
New
Methods
and
Archi-
tectures
Prohibitive Prohibitive
overhead fault density
1 10 100 1k 10k 100k 1M 10M

Size of replaced blocks
(granularity)
Built-in Self Repair (BISR)

BISR is well understood for highly regular structures such as embedded
memory blocks.
BISR is essentially depending on built-in self test (BIST) with high
diagnostic resolution.
Fault Fault Fault Redundancy

Detection Diagnosis Isolation Allocation
Fault / Redundancy Management
Redundancy management must monitor faults, replacements, available redundancy and

must also re-establish a „working“ system state after power-down states.
Levels of Repair
Transistors - Switch Level
Replace transistors or transistor groups
Losses by reconfiguration: (switched-off „good“ devices):
Potentially small ( 20 – 50%) for transistor faults
Overhead for test and diagnosis: Very high
Repair overhead
Gate Level will dominate
Replace gates or logic cells reliability!
Losses by reconfiguration:
Medium (60 to 90 %) for single transistor faults
Overhead for test and diagnosis: High
Macro-Block Level
Replace functional macros (ALU, FPU, CPU)
Losses by reconfiguration: High, 99% or more
Overhead for test and diagnosis: Maybe acceptable
The Fault Isolation Problem
Load
1
Driver
Load
2
Gate-
short
GND-shorts of input gates affect the whole fan-in

network and make redundancy obsolete!!
Block-Level Repair
&
&
SE
& SE
SE
&
Blocks of logic / RT elements (gates and larger) contain

a redundant element each that can replace a faulty unit.

Switching Concept (1)

inputs inputs
outputs outputs
Functional Functional
Block 1 Block 1
Block 2 Block 2
Block 3 Block 3
Replace- Replace-
Test in ment Test out Test in ment Test out
Block Block
1 2

Switching Concept (2)

inputs inputs outputs
outputs
Block 1 Block 1
Block 2 Block 2
Block 3 Block 3
Replace- Replace-
Test in ment Test out Test in ment Test out
Block Block
3 4

A Regular Switching Scheme

The scheme is regular and scalable by nature, comprising always k functional
blocks of the same nature plus 1 additional block for backup.
Building blocks are separated by (pass-) transistor switches at inputs and
outputs, providing a full isolation of a faulty block.
Always 2 additional pass-transistors between two functional blocks.
The reconfiguration scheme is regular in shifting functionality between

blocks, which results in a simple scheme of administration.
The functional access to the „spare“ block can be used for testing purposes.
In any state of (re-) configuration, the potentially „faulty“ block is connected
to test input / output terminals.

Overhead Depending on Block Size

Transistors
Basic Element Functional backup norm switch ext. switch
3 /4- 2-NAND 12 4 18 24
3 / 4 2-AND 18 6 18 24
3/4 2-XOR 18 6 18 24
H- Adder 36 12 24 30
F- Adder 90 30 30 36
For small basic blocks, the switches make the essential overhead (200%)!
For larger basic blocks,the overhead can be reduced to about 30-50%
... not counting test- and administration overhead!
Extract larger basic units from seemingly irregular logic netlists!!

Overhead
Transistors per RLB (3 functional units)
Switches
Basic Block functional backup Overhead
min. / ext.
2- NAND 12 4 18 /24 230 %

2- AND 18 6 18 /24 160 %
XOR 18 6 18 /24 160 %
Half Adder 36 12 24 /30 116 %

Full Adder 90 30 30 /36 73 %
8-bit ALU 4500 1500 168 / 224 38 %

5. Test and Repair Administration
Conf. Conf.
Test Generator
RLB RLB
BIST BIST
RLB RLB Logic

Configurator Conf. Conf.
and RLB RLB

Logic Status BIST BIST
Memory
RLB RLB
System
Monitoring
Test Analyzer
Centralized Control De-centralized

May be faulty! test and control

Blocks, Switching, Administration

Local (re-) configuration Remote (re-) configuration
Columns of Switches Columns of Switches
F-Unit F-Unit F-Unit F-Unit
Red.-Unit Red.-Unit Red.-Unit Red.-Unit

Conf.-Unit Conf.-Unit Decoder Decoder
Conf.-Unit Conf.-Unit
Global
Control-Unit
Global
Control-Unit

Combining Test and Re-Configuration

Reference
Test Test
input out
Logic
under Compare
Test
fault
next state detect
Config. Memory /
Counter

Test and Administration

inputs outputs
Each of the elements in a Functional

block is testable via specific Block 1
test inputs.
Output Switches
Input Switches
Test is done by comparison
with reference outputs. The system is run
through states of re-configuration with the same Functional
input test pattern applied. Block n
At test, a functional unit is always removed
from normal operation and connected Replace-
to test I / O s. ment
Block
In case of a „fault detect“, Test in Test out
the system is fixed in the current status. Decoder
fix at fault
State Reg.
Such a procedure of self-test
Test clock Fault indicator
and self-reconfiguration can
Self Test Circ. Fault
run at every system start-up, avoiding flag
a central „fault memory“.

Controller for (Re-) Configuration

out
Reference
f1
Switches
Switches
scan path
RLB + Controller minimum

f2 complexity: 80 transistors
+ (3 + 1 configuration)
f3
Test Control-Bits
in Scan
Decoder out
act s1 s2 s3 s4 A controller may drive
f1 f one or several re-configurable
act 1 2 3 4 f2 >1
& in f3 blocks in parallel, depending
F
on their size
>1
BISR
clock >1
reset freset fault

test

Local Interconnects
The block-based repair scheme so far can not cover faults on wires between
re-configurable blocks.
For small basic blocks (such as logic gates) the majority of
wiring is between re-configurable units and not covered.
For larger (RT-level) basic blocks the majority of wiring
is within basic blocks and covered.
Schemes that can also cover inter-block wiring are possible,
but require FPGA-like configurable switching and complex switching schemes.

Essentials of the Repair Scheme

Logic self repair is feasible at cost below triple modular
redundancy (TMR).
There is a trade-off between the size or the reconfigurable

logic blocks (RLBs) and the maximum tolerable fault density.
Administration, not redundancy makes the critical overhead.
Efforts can be saved by administrating several RLBs in

parallel.
Low-level interconnects between RLBs make for the essential
„single point of failure“ in the repair scheme!

6. De-Stressing
Component
failure rates
failure curve
10-1 without de-stressing
failure curve
10-2
with de-stressing
10-3
10-4
t1 t2 t3 t4
System life time

The Purpose of De-Stressing

Building blocks in digital systems of equal type may be more or
less heavily used.
Blocks running with the highest dynamic load and at the highest
temperature are candidates for early failure.
Using otherwize „silent“ resources to relieve such units from stress
periodically may serve the overall life time of the system.
The re-configuration scheme developed for repair may also serve

such purpose with slight modifications.
..and the scheme must be compatible with repair architectures !

state 0 The Scheme of De-Stressing state 2
Task 1 BB1 Task 1 BB1

medium load medium load
Task 2 BB2 A better initial distribution Task 2 BB2

low load of taks and stress makes low load
Task 3 BB3 a better re-distribution. Task 3 BB3

heavy load heavy load
Repair capabilities can be
Backup RB Backup test RB
test preserved.
state 1 state 3
Task 1 BB1 But: Task 1 BB1

medium load medium load
De-stressing may need BB2
Task 2 BB2 Task 2
re-organisation within an
low load
low load
active system, while repair
Task 3 BB3 has been off-line so far ! Task 3 BB3
heavy load heavy load
Backup test RB Backup RB

Modified Control Scheme

For de-stressing, functions have to be shifted while the system
is in „hot“ operation.
As long as all building blocks are fully functional, running two

functional blocks in parallel serving the same inputs and outputs
is possible.
With a total of k building blocks (including the spare one) there are
k „stable“ states of re-configuration (1 normal, 3 repairs) and (k-1)
intermediate states for „handover“ in case of de-stressing.
There are no extra switches necessary, but an additional overhead

in state management and state decoding.

FSM including Transitional States

0
tr=1
0/1 tr =0
1
tr=1
tr =0
1/2
2
tr=1
tr =0
2/3
3
If a „flying“ transition between repair states becomes necessary,

the control logic will have seven states instead of four!
Control Logic Functionality

Test access to each of four basic blocks is possible through the extra test acces.
With a test input pattern applied, the RBB is run through the 4 states.
If a BB or the RB is found to be faulty through the test access, the control

is fixed in this state. The faulty block is then not in functional use.
The controller has a „fault“ flag, which indicates the
status of „backup in use“.
BB
Once a RBB has a fault detected, it cannot be used
for de-stressing operations. BB
As long as a RBB has no fault detected, if can activate
the re-configuration for de- stressing with an extra BB
control signal, which makes the FSM run throught Test Test
scheme of extended logic states for „hot“ re-configuration. in RB out

Extended Control Logic

„1“ for
Reconfigurable Block fault detect
Test in
(RB)
Test out
Switch control
signals
&
Decoder
FSM >1 &
FF
fault
flag
test
FF reset clock
FSM reset
tr

7. Overhead and Limitations

BISR requires additional overhead.
The inevitable extra circuitry used for fault administration is
not fault-free by definition.
But we can assume that such circuitry, if fabricated correctly,
is not in heavy use all the time and will exhibit much reduced
failure from stress.
Memory cells used for repair state administration are prone to
transient fault effects from particle radiation.
Wit suitable state encoding (1-out of n-code) parity check
can be applied.

Overhead
Overhead factors:
- Number and size of redundant elements,

- Number of switches for (re)- configuration,
- Test and fault diagnosis,
- Control logic,
- Extra overhead for system – management.

Cost / Overhead
( 3 functional blocks plus 1 backup in RLB)
Basic Trans. Trans. Switch Contr.* Overhead
Block funct. backup Trans. Unit Tr. %
2-NAND 3* 4 4 30 81 /200 960 / 3600
H- Adder 3 * 12 12 40 81 /200 369 / 700
F- Adder 3 * 30 30 50 81 /200 179 / 311
2-bit ALU 3 * 352 352 140 81 /200 54.2 / 65.5
4-bit ALU 3 * 699 699 180 81 /200 45.8 / 51.5
8-bit ALU 3 * 1367 1367 260 81 /200 41.6 / 44.5
* with / without extensions for de-stressing, controller design

optimized for supervision by parity control.

Sources of Overhead
Basic Complexity Overhead in %
Block (trans.) redund. switches control ctrl/destr.
2-NAND 4 33 250 675 1666

H-Adder 12 33 111 225 555
F-Adder 30 33 55 90 222
2Bit ALU 352 33 13 7.6 18.9
4Bit ALU 699 33 8.5 3.8 9.5
8Bit ALU 1367 33 6.2 2 4.8
Switches and control overhead dominate, reasonable lower bound

for complexity of basic blocks is around 100-200 transistors.

Overhead and Block Size

Overhead
in %
1000
self repair plus de-stressing
self repair
100
33
10
10 102 103 104

Basic Block Size
(transistors)
The Switching Problem (1)

switch switch
control control
Compensates „always on“
switch
control
Compensates „always off“
switch switch
control control
Compensates „always on“ and
„always off“
... always in one single transistor.

Single Points of Failure

Transistor Switches
Config.
Control
Network
switch
control
1 2
Signal Reconfigu-
wiring rable
Logic Block
3
(RLB)
1: short gate - signal input

2: short gate - block input
3: channel short
Pass Transistor Faults
Short
A short condition between the signal input (Usign) and the control
input (Uctrl) may be solved by designing the gate input line (Rbr)
as a fuse. Then one additional transistor is needed as a „power sink“.

Blowing Fuses
CTL in
VDDhigh
n
fuse
gate
sin short n
p sout
Power-Sink-Transistor

8. Summary and Conclusions

Logic self-repair is not impossible, but noch cheap either.
The lower bound for logic blocks is about 100 transistors.

Experience shows that most logic designs „yield“ some potential
for logic extraction.
Repair technologies work even (much) better for regular processor
architectures such as VLIW processors.
In real-life designs, a large part of the system (memory, 50-90 %),
functional units, 10-40 %) is regular. Only a small fraction is truly
„irregular“ and needs higher overhead.
No such strategy yet for analog and mixed signal circuits !

Real Embedded Systems

CPU CPU
Data Path Data Path

Mem.
Ctrl Cache Ctrl Cache
Mixed
DSP Memory
Signal / RF
.. only a small fraction of the real system is truly irregular and needs
„expensive“ logic repair !
Regular Processor Architectures
Needs Crtl.-
Logic-BISR Register File
Logic
Add Mult
Multiple parallel
Processing units
Regular processor structures with multiple parallel units need

expensive logic (self-) repair only for their control logic. Reconfiguration
of data-path elements can be arranged by software, which does not have wear-out !
Design for Repairability

RT netlist
Extract obvious
regular blocks
RLB
Control
Random
Logic
Circuitry
done Compose
Find and extract RT-RLBs
regular entities Compose
Estimate
RLB control
Compose Reliability
Random Scheme
Gate-Level
Rest Logic
RLBs
This is the END !
Thank you for not falling asleep !

(I would have....)

Self Repair Technology For Logic Circuits: Architecture, Overhead and Limitations

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Self Repair Technology For Logic Circuits: Architecture, Overhead and Limitations

Uploaded by

Copyright:

Available Formats

Computer Engineering

Self Repair Technology

Architecture, Overhead and Limitations

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

A bunch of new problems from nano structures ...

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Statistical Parameter Variations:

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

New Problems with Nano-Technologies

Wave length: 193 nm

resist mask (reticle)

wafer exposed resist

Compensation is critical and non-ideal

Doping Fluctuations in MOS Transistors

Density and distribution of doping atoms

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Smaller features mean higher stress (field strength, current

Primary Relevance: Dependability

Fault Tolerant Computing

Software-based Works only

Typically works very

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

2. Wear-Out Problems and Mechanisms

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

IC Structures May Get Tired

„Wear-out“ – effects ICs in nano-electronics are likely to appear much earlier,

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Fault Effects on ICs

Transistor deterioration (HCI, NBTI),

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

After some time in operation

Voids (holes) neighbor

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Dielectric Breakdown: Insulating layers between metal lines may break

Design technology including a prospective „life time budget“!!

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Wear-out induced faults may show up as „intermittent“ faults first,

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Triple Modular Redundancy

Can detect and compensate almost any type of fault

Error Detecting / Correcting Codes

Signature Signature Fault-

Can TMR and Codes Compensate

Redundancy and Wear-Out

During the normal life time of the system, duplication or triplication

Typically works very

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

3. Repair for Memory and FPGAs

Compensation of transient faults is not enough.

Some technologies for transient compensation can handle

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Memory Test & Repair

Memory Test & Repair (2)

FPGA-based Self Repair

* e. g. proposed by McCluskey et al. IEEE Design and Test 2004

FPGA-based embedded controller: 8051

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

In-System FPGA Repair

under repair System

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn