Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 75

Computer Engineering

Self Repair Technology


for Logic Circuits

Architecture, Overhead and Limitations

Heinrich T. Vierhaus
BTU Cottbus
Computer Engineering Group

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Outline
1. Introduction: Nano Structure Problems
2. The Problem of Wear-Out
3. Repair for Memory and FPGAs
4. Basic Logic Repair Strategies & Structures
5. Test and Repair Administration
6. De-Stressing Strategies
7. Cost, Overhead, Single Points of Failure
8. Summary and Conclusions

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

1. Introduction

A bunch of new problems from nano structures ...

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Nanoelectronic Problems
Lithography:
The wavelength used to „map“ structural information from
masks to wafers is larger (4 times of more) than the minimum
structural features (193 versus 90 / 65 / 45 nm).
Adaptation of layouts for correction of mapping faults.

Statistical Parameter Variations:


The number of atoms in MOS-transistor channels becomes so
small that statistical variations of doping densities have an impact
on device parameters such as threshold voltages.

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

New Problems with Nano-Technologies

Light
source

Wave length: 193 nm

resist mask (reticle)

wafer exposed resist


Feature size: down to 28 nm
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Layout Correction

Modified layout
for compensation
of mapping faults

Compensation is critical and non-ideal


Faults are not random but correlated!
Requires fast fault diagnosis
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Doping Fluctuations in MOS Transistors

Poly-Si

n doping atom
n

p-Substrate

Density and distribution of doping atoms


cause shifts in transistor threshold voltages!

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Nanostructure Problems
Individual device characteristics such as Vth are more dependent
on statistical variations of underlying physical features such
as doping profiles.
Primary Relevance: Yield
A significant share of basic devices will be „out or specs“ and needs
a replacement by backup elements for yield improvement after
production. Primary Relevance: Yield

Smaller features mean higher stress (field strength, current


density), also foster new mechanisms of early wear-out.
Primary Relevance: Lifetime
Transient error recognition and compensation „in time“ is becoming a must
due to e. g. charged particles that can discharge circuit nodes.

Primary Relevance: Dependability


CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Fault Tolerant Computing

Software-based Works only


fault detection for transient faults! specific
& compensation

Fault
event HW logic & Typically works
RT-level for transient and universal
detection & permanent faults!
compensation

Typically works very


Transistor-and switch level for specific types of
compensation transient faults specific
only!

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

2. Wear-Out Problems and Mechanisms


Structures on ICs used to live longer than either their application
or even their users. Not any more ...

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

IC Structures May Get Tired

„Wear-out“ – effects ICs in nano-electronics are likely to appear much earlier,


causing a lot of problems for dependable long-time applications !

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Fault Effects on ICs


metal low- k insulator
migration deterioration Metal 3
Metal 2
Poly-
imide Via
(low-k)

Field-
Oxide
p n-well p
n n
Gate Metal 1
Oxide
(high-k)

Transistor deterioration (HCI, NBTI),


eventually gate oxide shorts !

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Wear-Out Mechnisms
Metal Migration:
Metal atoms (Al, Cu) tend
to migrate under high current
density and high temperature.

Stress migration:
Migration effects may be enhanced
under mechanical stress conditons.
Effect:
Metal lines and vias may actually
cause line interrupts. The effect is
partly reversible by changing current
directions.

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Metal Migration
neighbor
metal -wire under high current density:
new

neighbor

After some time in operation

Voids (holes) neighbor

Open-defect
short
Vias are specially prone to such defects
The effect is reversible by reversing the direction of current flow !

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Transistor Degradation
Negative Bias Thermal Instability (NBTI): Reduced switching speed
for p-channel MOS transistors that have operated under long-time constant
negative gate bias. The effect is partly reversible.

Hot Carrier Injection (HCI): Reduced switching speed for n-channel MOS
transistors, induced by positive gate bias and frequent switching.
Not reversible.
Gate Oxide Deterioration: Induced by high field strengh. Not reversible

Dielectric Breakdown: Insulating layers between metal lines may break


causing shorts between signal lines.

Design technology including a prospective „life time budget“!!

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Management of Wear-Out by
„Fault Tolerant Computing?
Built-in fault tolerance and error compensation are needed in nano-
technologies anyway and for the management of transient faults.

Wear-out induced faults may show up as „intermittent“ faults first,


which become more and more frequent.

Fault in synchronous circuits and systems are detected „by clock cycle“.
Hence the detection does not even recognize if the fault is permanent
or not for many types of fault tolerant architecture.

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Triple Modular Redundancy


Execution
Unit 1
input Result out
signal (majority)
Execution Comparator
Unit 2 Voter
Error
Execution detect

Unit 3

Can detect and compensate almost any type of fault


Overhead about 200-300 %, additional signal delays
The voter itself is not covered but must be a „self checking checker“
Standard (by law) in avionics applications!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Error Detecting / Correcting Codes

Data Data

Error
Transmission / correction
Storage

Signature Signature Fault-


detect
Comparison
Often applicable to 1- or 2-bit faults only
Often limited to certain fault models (uni-directional)
Becomes expensive if applied to Signature
computational units
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Can TMR and Codes Compensate


Permanent Faults?
Fault / error detection circuitry typically works on a clock-cycle base.
It does not „know“ if a fault is transient or permanent.
A permanent fault is a fault event that occurs in several to many successive
clock cycles repeatedly.
Error correction technology can detect and compensate such permanent faults
as well as transient faults.
A critical condition occurs if transient faults occur on top of
permanent faults. Then the superposition of fault effects is likely to
exceed the system‘s fault handling capacity.
System components that run actively „in parallel“ suffer from the same
wear-out effects. Therefore there is a an increase in dependability before
wear-out limits, but no significant life time extension!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Redundancy and Wear-Out

During the normal life time of the system, duplication or triplication


can enhance reliability significantly. But also area and power consumption
are about triplicated.
And by the end of normal operating time (out of fuel / steam) all three
systems will fail shortly one after the other !!
Reliability enhancement is not equal to life time extension !!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Self Repair?
Software-based Works only
fault detection for transient faults! specific
& compensation

Fault
event HW logic & Typically works
RT-level for transient and universal
detection & permanent faults!
compensation
Self Repair for permanent faults!

Typically works very


Transistor-and switch level for specific types of
compensation transient faults specific
only!

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

3. Repair for Memory and FPGAs

Compensation of transient faults is not enough.

Some technologies for transient compensation can handle


permanent faults, too, but not on the long run and with
additional transient faults!

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Memory Test & Repair


Read- /
Lines write lines

Line
address
spare
column

columns
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Memory Test & Repair (2)


Read- /
Lines Write lines
Line
address

spare
column

Memory
BIST columns
controller
... is already state-of-the-art!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

FPGA-based Self Repair


FPGA macro-blocks working as CPUs
logic
L W L W L L W L W L block
W L W L W W L W L W
L W L W L L W L W L wiring
W L W L W W L W L W block
L W L W L L W L W L

L W L W L Memory
W L W L W
Applic.
L W L W L Config.
SW &
W L W L W SW
data
L W L W L

* e. g. proposed by McCluskey et al. IEEE Design and Test 2004

FPGA-based embedded controller: 8051

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

In-System FPGA Repair


FPGA-based CPUs
logic
L W L W L L W L W L block
fault W L W L W W L W L W
L W L W L L W L W L wiring
W L W L W W L W L W block
L W L W L L W L W L

under repair System


Repair
function
function
L W L W L Memory
W L W L W
Applic.
L W L W L Config.
SW &
W L W L W SW
data
L W L W L

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Repair Mechanism: Row/Line-Shift


CLB CLB CLB CLB
occupied
CLB CLB CLB CLB CLBs

row with
CLB CLB CLB CLB
faulty CLB

occupied
CLB CLB CLB CLB
CLBs

CLB CLB CLB CLB reserve


row

Little Overhead for the re-configuration process


Loss of many “good” CLBs for every fault
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Distributed Backup CLBs


CLB CLB CLB CLB CLB

CLB CLB CLB CLB CLB

CLB CLB CLB CLB CLB

CLB CLB CLB CLB CLB

CLB functionally CLB faulty CLB


occupied CLB
non-occupied selected
CLB CLB replacement CLB
CLB (reserve)

Minimum loss of functional CLBs

High effort for re-wiring requires massive „embedded“


computing power (32-bit CPU, 500 MHz)
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Self Repair within FPGA Basic Blocks

Heterogeneous repair strategies required (memory, logic)


Logic blocks may use methods known from memory BISR
Additional repair strategies are necessary for logic elements
The basic overhead for FPGAs versus standard logic
(about 10) is enhanced.
Repair strategies for logic may use some features already
used in FPGAs (e. g. switched interconnects).

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Structure of a CLB Slice


Program in

FF
Logic SRAM
in
in

M out
Logic U
X FF
Field

out
M
Redudant Row U FF
X
Logic
out
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn SRAM
Computer Engineering

FPGAs for a Solution?


The granularity of re-configurable logic blocks (CLBs)
in most FPGAs is the order of several thousand transistors.
Replacement strategies must be placed on a granularity of
blocks in the area of 100-500 transistors for fault densities
between 0.01 % and 0.1 %.
Efficient FPGA- repair mechanism requires detailed fault diagnosis
plus specific repair schemes, which cannot be kept as pre-computed
reconfiguration schemes.
Computation of specific repair schemes requires „in-system
EDA“ (re-placement and routing) with a massive demand
for computing power.
There is no source of such „always available“ computing power.

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering
Self-Repairing FPGA ?
Reconfigurable Logic

CLB WB CLB WB CLB WB CLB

New-Config.
CLB WB CLB WB CLB WB CLB

CLB WB CLB WB CLB WB CLB

Program
Memory
CLB WB CLB WB CLB WB CLB

Config.
CLB WB CLB WB CLB WB CLB
Scheme

CLB WB CLB WB CLB WB CLB

Virtual CPU
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Advanced FPGA Structures


CPU WB CPU ... are only partly
re-configurable for
performance
CLB WB CLB WB CLB WB CLB reasons !

CLB WB CLB WB CLB WB CLB

ALU WB MULT WB ALU WB MULT

CLB WB CLB WB CLB WB CLB

CLB WB CLB WB CLB WB CLB

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

FPGA / CPLD Repair


Looks pretty easy at first glance because of regular
architecture!

Requires lines / columns of switches for configuration at


inputs and between AND / OR matrices.

Requires additional programmability of cross-points


by double-gate transistor as in EEPROMs or Flash memory.

Not fully compatible with standard CMOS


Limited number of (re-) configurations
Floating gate (FAMOS) transistors are fault-sensitive!

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

4. Basic Logic Repair Strategies

Repair techniques that replace failing building blocks by redundant


elements from a „silent“ storage are not new.

IBM has been selling such computer systems specifically for


applications in banks for decade.

But always with few (2-10) backup elements (CPUs) assuming


a small number of failures (< 10) within years.

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Mainframes

.. will often contain „redundant“ CPUs for eventual fault


compensation. But one faulty transistor then „costs“ a whole CPU,
limiting the fault handling to a few (about 10) permanent fault cases.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Granularity of Replacement

Hardly Block-level Core-


explored replacement Replacement
(logic) (e. g. FPGAs) (e. g. CPU)

Expected fault density (1 out of..)

FPGA- cores CPU


trans. gate macro block

100 101 102 103 104 105 106 Granularity


(transistors)

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Repair Overhead versus Element Loss


Repair procedure Functioning
overhead elements lost
New
Methods
and
Archi-
tectures
Prohibitive Prohibitive
overhead fault density

1 10 100 1k 10k 100k 1M 10M


Size of replaced blocks
(granularity)
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Built-in Self Repair (BISR)


BISR is well understood for highly regular structures such as embedded
memory blocks.
BISR is essentially depending on built-in self test (BIST) with high
diagnostic resolution.

Fault Fault Fault Redundancy


Detection Diagnosis Isolation Allocation

Fault / Redundancy Management

Redundancy management must monitor faults, replacements, available redundancy and


must also re-establish a „working“ system state after power-down states.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Levels of Repair
Transistors - Switch Level
Replace transistors or transistor groups
Losses by reconfiguration: (switched-off „good“ devices):
Potentially small ( 20 – 50%) for transistor faults
Overhead for test and diagnosis: Very high
Repair overhead
Gate Level will dominate
Replace gates or logic cells reliability!
Losses by reconfiguration:
Medium (60 to 90 %) for single transistor faults
Overhead for test and diagnosis: High

Macro-Block Level
Replace functional macros (ALU, FPU, CPU)
Losses by reconfiguration: High, 99% or more
Overhead for test and diagnosis: Maybe acceptable
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

The Fault Isolation Problem

Load
1

Driver

Load
2
Gate-
short

GND-shorts of input gates affect the whole fan-in


network and make redundancy obsolete!!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Block-Level Repair
&

&
SE
& SE
SE
&

Blocks of logic / RT elements (gates and larger) contain


a redundant element each that can replace a faulty unit.

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Switching Concept (1)


inputs inputs
outputs outputs
Functional Functional
Block 1 Block 1

Functional Functional
Block 2 Block 2

Functional Functional
Block 3 Block 3

Replace- Replace-
Test in ment Test out Test in ment Test out
Block Block

1 2

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Switching Concept (2)


inputs inputs outputs
outputs
Functional Functional
Block 1 Block 1

Functional Functional
Block 2 Block 2

Functional Functional
Block 3 Block 3

Replace- Replace-
Test in ment Test out Test in ment Test out
Block Block

3 4

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

A Regular Switching Scheme


The scheme is regular and scalable by nature, comprising always k functional
blocks of the same nature plus 1 additional block for backup.
Building blocks are separated by (pass-) transistor switches at inputs and
outputs, providing a full isolation of a faulty block.
Always 2 additional pass-transistors between two functional blocks.

The reconfiguration scheme is regular in shifting functionality between


blocks, which results in a simple scheme of administration.
The functional access to the „spare“ block can be used for testing purposes.
In any state of (re-) configuration, the potentially „faulty“ block is connected
to test input / output terminals.

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Overhead Depending on Block Size


Transistors

Basic Element Functional backup norm switch ext. switch

3 /4- 2-NAND 12 4 18 24
3 / 4 2-AND 18 6 18 24
3/4 2-XOR 18 6 18 24
H- Adder 36 12 24 30
F- Adder 90 30 30 36

For small basic blocks, the switches make the essential overhead (200%)!
For larger basic blocks,the overhead can be reduced to about 30-50%
... not counting test- and administration overhead!

Extract larger basic units from seemingly irregular logic netlists!!

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Overhead
Transistors per RLB (3 functional units)
Switches
Basic Block functional backup Overhead
min. / ext.

2- NAND 12 4 18 /24 230 %


2- AND 18 6 18 /24 160 %

XOR 18 6 18 /24 160 %

Half Adder 36 12 24 /30 116 %


Full Adder 90 30 30 /36 73 %
8-bit ALU 4500 1500 168 / 224 38 %

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

5. Test and Repair Administration

Conf. Conf.
Test Generator
RLB RLB
BIST BIST

RLB RLB Logic


Configurator Conf. Conf.

and RLB RLB


Logic Status BIST BIST
Memory
RLB RLB

System
Monitoring
Test Analyzer

Centralized Control De-centralized


May be faulty! test and control

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Blocks, Switching, Administration


Local (re-) configuration Remote (re-) configuration
Columns of Switches Columns of Switches

F-Unit F-Unit F-Unit F-Unit

F-Unit F-Unit F-Unit F-Unit

Red.-Unit Red.-Unit Red.-Unit Red.-Unit


F-Unit F-Unit F-Unit F-Unit
Conf.-Unit Conf.-Unit Decoder Decoder

Conf.-Unit Conf.-Unit
Global
Control-Unit
Global
Control-Unit

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Combining Test and Re-Configuration


Reference
Test Test
input out
Logic
under Compare
Test
fault
next state detect
Config. Memory /
Counter

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Test and Administration


inputs outputs

Each of the elements in a Functional


block is testable via specific Block 1
test inputs.

Output Switches
Input Switches
Test is done by comparison
with reference outputs. The system is run
through states of re-configuration with the same Functional
input test pattern applied. Block n
At test, a functional unit is always removed
from normal operation and connected Replace-
to test I / O s. ment
Block
In case of a „fault detect“, Test in Test out
the system is fixed in the current status. Decoder
fix at fault
State Reg.
Such a procedure of self-test
Test clock Fault indicator
and self-reconfiguration can
Self Test Circ. Fault
run at every system start-up, avoiding flag
a central „fault memory“.

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Controller for (Re-) Configuration


out

Reference
f1

Switches
Switches
scan path

RLB + Controller minimum


f2 complexity: 80 transistors
+ (3 + 1 configuration)
f3
Test Control-Bits
in Scan
Decoder out
act s1 s2 s3 s4 A controller may drive
f1 f one or several re-configurable
act 1 2 3 4 f2 >1
& in f3 blocks in parallel, depending
F
on their size

>1
BISR
clock >1

reset freset fault


test

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Local Interconnects
The block-based repair scheme so far can not cover faults on wires between
re-configurable blocks.
For small basic blocks (such as logic gates) the majority of
wiring is between re-configurable units and not covered.
For larger (RT-level) basic blocks the majority of wiring
is within basic blocks and covered.
Schemes that can also cover inter-block wiring are possible,
but require FPGA-like configurable switching and complex switching schemes.

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Essentials of the Repair Scheme


Logic self repair is feasible at cost below triple modular
redundancy (TMR).

There is a trade-off between the size or the reconfigurable


logic blocks (RLBs) and the maximum tolerable fault density.

Administration, not redundancy makes the critical overhead.

Efforts can be saved by administrating several RLBs in


parallel.
Low-level interconnects between RLBs make for the essential
„single point of failure“ in the repair scheme!

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

6. De-Stressing
Component
failure rates

failure curve
10-1 without de-stressing
failure curve
10-2
with de-stressing
10-3

10-4

t1 t2 t3 t4
System life time

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

The Purpose of De-Stressing


Building blocks in digital systems of equal type may be more or
less heavily used.
Blocks running with the highest dynamic load and at the highest
temperature are candidates for early failure.
Using otherwize „silent“ resources to relieve such units from stress
periodically may serve the overall life time of the system.

The re-configuration scheme developed for repair may also serve


such purpose with slight modifications.

..and the scheme must be compatible with repair architectures !

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering
state 0 The Scheme of De-Stressing state 2

Task 1 BB1 Task 1 BB1


medium load medium load

Task 2 BB2 A better initial distribution Task 2 BB2


low load of taks and stress makes low load

Task 3 BB3 a better re-distribution. Task 3 BB3


heavy load heavy load
Repair capabilities can be
Backup RB Backup test RB
test preserved.
state 1 state 3

Task 1 BB1 But: Task 1 BB1


medium load medium load
De-stressing may need BB2
Task 2 BB2 Task 2
re-organisation within an
low load
low load
active system, while repair
Task 3 BB3 has been off-line so far ! Task 3 BB3
heavy load heavy load

Backup test RB Backup RB


CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Modified Control Scheme


For de-stressing, functions have to be shifted while the system
is in „hot“ operation.

As long as all building blocks are fully functional, running two


functional blocks in parallel serving the same inputs and outputs
is possible.
With a total of k building blocks (including the spare one) there are
k „stable“ states of re-configuration (1 normal, 3 repairs) and (k-1)
intermediate states for „handover“ in case of de-stressing.

There are no extra switches necessary, but an additional overhead


in state management and state decoding.

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

FSM including Transitional States


0
tr=1

0/1 tr =0

1
tr=1
tr =0
1/2
2
tr=1
tr =0
2/3
3

If a „flying“ transition between repair states becomes necessary,


the control logic will have seven states instead of four!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Control Logic Functionality


Test access to each of four basic blocks is possible through the extra test acces.
With a test input pattern applied, the RBB is run through the 4 states.

If a BB or the RB is found to be faulty through the test access, the control


is fixed in this state. The faulty block is then not in functional use.
The controller has a „fault“ flag, which indicates the
status of „backup in use“.
BB
Once a RBB has a fault detected, it cannot be used
for de-stressing operations. BB
As long as a RBB has no fault detected, if can activate
the re-configuration for de- stressing with an extra BB
control signal, which makes the FSM run throught Test Test
scheme of extended logic states for „hot“ re-configuration. in RB out

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Extended Control Logic


„1“ for
Reconfigurable Block fault detect
Test in
(RB)
Test out

Switch control
signals
&
Decoder
FSM >1 &
FF
fault
flag
test
FF reset clock
FSM reset
tr

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

7. Overhead and Limitations


BISR requires additional overhead.
The inevitable extra circuitry used for fault administration is
not fault-free by definition.
But we can assume that such circuitry, if fabricated correctly,
is not in heavy use all the time and will exhibit much reduced
failure from stress.
Memory cells used for repair state administration are prone to
transient fault effects from particle radiation.
Wit suitable state encoding (1-out of n-code) parity check
can be applied.

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Overhead
Overhead factors:

- Number and size of redundant elements,


- Number of switches for (re)- configuration,
- Test and fault diagnosis,
- Control logic,
- Extra overhead for system – management.

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Cost / Overhead
( 3 functional blocks plus 1 backup in RLB)
Basic Trans. Trans. Switch Contr.* Overhead
Block funct. backup Trans. Unit Tr. %
2-NAND 3* 4 4 30 81 /200 960 / 3600
H- Adder 3 * 12 12 40 81 /200 369 / 700
F- Adder 3 * 30 30 50 81 /200 179 / 311
2-bit ALU 3 * 352 352 140 81 /200 54.2 / 65.5
4-bit ALU 3 * 699 699 180 81 /200 45.8 / 51.5
8-bit ALU 3 * 1367 1367 260 81 /200 41.6 / 44.5

* with / without extensions for de-stressing, controller design


optimized for supervision by parity control.

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Sources of Overhead
Basic Complexity Overhead in %
Block (trans.) redund. switches control ctrl/destr.

2-NAND 4 33 250 675 1666


H-Adder 12 33 111 225 555
F-Adder 30 33 55 90 222
2Bit ALU 352 33 13 7.6 18.9
4Bit ALU 699 33 8.5 3.8 9.5
8Bit ALU 1367 33 6.2 2 4.8

Switches and control overhead dominate, reasonable lower bound


for complexity of basic blocks is around 100-200 transistors.

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Overhead and Block Size


Overhead
in %

1000
self repair plus de-stressing
self repair
100

33
10

10 102 103 104


Basic Block Size
(transistors)
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

The Switching Problem (1)


switch switch
control control
Compensates „always on“

switch
control

Compensates „always off“

switch switch
control control
Compensates „always on“ and
„always off“

... always in one single transistor.


CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Single Points of Failure


Transistor Switches
Config.
Control
Network

switch
control

1 2
Signal Reconfigu-
wiring rable
Logic Block
3
(RLB)

1: short gate - signal input


2: short gate - block input
3: channel short
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Pass Transistor Faults

Short

A short condition between the signal input (Usign) and the control
input (Uctrl) may be solved by designing the gate input line (Rbr)
as a fuse. Then one additional transistor is needed as a „power sink“.

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Blowing Fuses
CTL in
VDDhigh

n
fuse
gate
sin short n

p sout
Power-Sink-Transistor

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

8. Summary and Conclusions


Logic self-repair is not impossible, but noch cheap either.

The lower bound for logic blocks is about 100 transistors.


Experience shows that most logic designs „yield“ some potential
for logic extraction.
Repair technologies work even (much) better for regular processor
architectures such as VLIW processors.
In real-life designs, a large part of the system (memory, 50-90 %),
functional units, 10-40 %) is regular. Only a small fraction is truly
„irregular“ and needs higher overhead.
No such strategy yet for analog and mixed signal circuits !

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn


Computer Engineering

Real Embedded Systems


CPU CPU

Data Path Data Path


Mem.
Ctrl Cache Ctrl Cache

Mixed
DSP Memory
Signal / RF

.. only a small fraction of the real system is truly irregular and needs
„expensive“ logic repair !
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Regular Processor Architectures

Needs Crtl.-
Logic-BISR Register File
Logic

Add Mult
Multiple parallel
Processing units

Regular processor structures with multiple parallel units need


expensive logic (self-) repair only for their control logic. Reconfiguration
of data-path elements can be arranged by software, which does not have wear-out !
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

Design for Repairability


RT netlist

Extract obvious
regular blocks
RLB
Control
Random
Logic
Circuitry

done Compose
Find and extract RT-RLBs
regular entities Compose
Estimate
RLB control
Compose Reliability
Random Scheme
Gate-Level
Rest Logic
RLBs
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering

This is the END !

Thank you for not falling asleep !


(I would have....)

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

You might also like