Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

IEEE TRANSACTION ON RELIABILITY, VOL. XX, NO.

X, MONTH 2017 1

A methodology based on reconfigurable partition


technique to deal with permanent faults on FPGA
Victor Manuel Gonçalves Martins, Eduardo Augusto Bezerra, and Marcelo Daniel Berejuck

Abstract—Dynamic Partial Reconfiguration (DPR) has been the module is removed to another partition, and the partition
used as a solution to deal with permanent faults in space-borne where the module was previously allocated is discarded.
based on Field Programmable Gate Array (FPGA) devices when This paper introduces a new approach in which the recon-
they are exposed to the radiation environment. The solutions
based on DPR proposed so far usually make a reservation for figurable partitions are not discarded. It is a methodology that
the modules of the system, and for an amount of reconfigurable allows relocating hardware modules on multiple partitions,
partitions as well. When a fault is detected, the module is without the need for multiple partial bitstreams [7]. To detect
moved to another partition, and the partition where the module the permanent faults on the FPGA, we proposed a Differential
was previously allocated is discarded. This paper introduces Delay Sensor (DDS) technique. It was originally developed
a technique in which the partitions are not discarded, what
means a better usage of silicon resources available in the FPGA. to detect a natural physical degradation process that occurs
The results show that the proposed mechanism can tolerate on integrated circuits and known as ”transistor aging” [8].
permanent faults (hard errors), with only a slight increase of However, it can be used to detect permanent faults induced
memory required by an embedded processor. by radiation, because any change in FPGA configuration will
Index Terms—FPGA Reliability, Fault Removal and Preven- present a different pattern of the delay expected.
tion, Aging Mitigation. The results show that the proposed mechanism can tolerate
permanent faults (hard errors), with only a slight increase of
I. I NTRODUCTION memory required by an embedded processor. In the case of an
ongoing fault in a module, the modules can be reallocated to
PACE-BORNE high-performance computing platforms
S based on Field Programmable Gate Array (FPGA) tech-
nology are exposed to harsh radiation environments and can
any FPGA reconfigurable partition, as the proposed strategy
ensures that the modules will have same physical interfaces.
The paper is organized as follow. Section II introduces
suffer from hardware faults due to cosmic radiations. These
the related work about partial reconfiguration on FPGA.
faults can be transient or permanent. Transient faults are
Section III describes technical details regarding the Virtex-
caused by high-energy subatomic particles striking the FPGA
6 Field Programmable Gate Array (FPGA) family, used as
and are characterized as Single Event Effects. As a conse-
a case study. The section describes low-level details about
quence, they can affect the conguration memory and associated
the Partial Reconfiguration resources location and routing
circuitry and ports of an FPGA. Permanent faults can occur
for that FPGA family. Section IV presents the problem and
due to radiation effects in a device as Single Event Latch-up
contributions. Section V explains the implemented case study,
(SEL) and the Total Ionizing Dose (TID) [1].
including its processing and memory requirements. Section VI
To address this problem, FPGA manufacturers developed
summarizes and discusses the results obtained from the case
technologies to make some devices less sensitive to the cosmic
study implementation. Section VII concludes the paper and
radiation. However, such devices can be more expensive than
lists the future work.
their off-the-shelf counterparts. Several authors have been
suggesting the Dynamic Partial Reconfiguration (DPR) as a
solution to deal with the transient and permanent faults on off- II. R ELATED WORK
the-shelf FPGA devices when used in space-borne computing Partial reconfiguration allows changing some system com-
platforms [2][3][4][5][6]. This technique allows for a portion ponents without interrupting the operation of all of them.
of the FPGA memory to be written while the rest of the Usually, the tools provided by FPGA manufacturers generate
device is still working. This feature can allow an FPGA to different partial bitstreams for each reconfigurable partition. It
self-recover from radiation induced faults by changing the on- means that are required N ×M partial bitstreams to implement
chip architecture of the system. M components in N Reconfigurable Partitions (RPs). This
The solutions based on DPR proposed so far usually make a restriction requires a greater need for memory to accommodate
reservation for the modules of the system, and for an amount all partial bitstreams.
of reconfigurable partitions as well. When a fault is detected, Several authors have addressed this issue. In [9] there is a
Victor Manuel Gonçalves Martins and Eduardo Augusto Bezerra: De- list of operations that allow RPs providing flexibility in the use
partment of Electrical Engineering, Federal University of Santa Catarina of its partial bitstreams. It requires that the flow Translation →
(UFSC), Florianopolis, Santa Catarina, Brazil, E-mail: {victor.martins, ed- Mapping → Placement and Routing runs twice. Once for use
uardo.bezerra}@eel.ufsc.br; Marcelo Daniel Berejuck: Department of Com-
puting Engineering, Federal University of Santa Catarina (UFSC), Ararangua, at the end of the “PR2UCF” command to remove the Proxy
Brazil, E-mail: marcelo.berejuck@ufsc.br Logics location, and another time to use the Direct Routing
IEEE TRANSACTION ON RELIABILITY, VOL. XX, NO. X, MONTH 2017 2

Constraints of the FPGA Editor, in graphic mode (manual) to A. Resources and Memory Configuration Architecture
remove the routing of signals that connect the RPs to the rest The Virtex-6 architecture is similar to the previous Virtex-
of the system. 4 and Virtex-5 ones, and also the latest 7-Series. Virtex-
The same authors presented in [10] a manual solution for 6 devices support Partial Reconfiguration (PR) in a bi-
the crossing RPs routing problem, using an FPGA Editor. They dimensional region of resources. This characteristic allows to
developed the Intellectual Property Interface (IPIF) module, constraining the position of a module or even to change its
which requires all input and output signals of RPs to go properties (configuration) dynamically. The reconfiguration of
through a defined section of the FPGA resources. The module each rectangular region can be performed from outside, via
allows the control of I/O signals, reducing the signal routing the Joint Test Action Group (JTAG) or by the SelectMAP
probability of crossing the RPs. However, the added module interface, or internally using the Internal Configuration Access
further restricts the freedom of the tools used in the following Port (ICAP) resource.
flow. Moreover, the problem of signals crossing a RP was only The set of available FPGA resources is composed by
reduced, not eliminated. Configurable Logic Blocks (CLBs), blocks of Digital Signal
GoAhead introduced in [11] is a new tool, and not just Processing (DSP), Block RAMs, Input-Output Blocks and
a manipulation of Xilinx flow of tools. Instead of using the clock management modules, such as Phase Locked Loops
constraints to force a particular Placement and Routing, they (PLLs). These resources are organized in an array structure
manipulate the Xilinx Design Language (XDL) obtained at the (rows vs. columns), and each one has a switch box associated
end of Mapping, so that Placement and Routing Tool has only to it. As shown in the example of Fig. 1, the resources are
one option available for a given resource location or a routing. sorted in columns, and each row contains a section of each
In [12] the GoAhead analysis requires repeated conversions column. Every row is split in two clock regions, identified as
between formats XDL and Native Circuit Description (NCD). XxYy.
These conversions require high implementation times, which
makes it unsuitable for FPGAs with higher resources density.
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
In [12] there is a process used to allow the partial relocation
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
X0Y3 00 00 00X1Y3 0 0000 0000 0000
00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000
of bitstreams in several RPs. It is based on [10] and it
eliminates the problem of routing signals crossing the RPs. 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Instead of modules IPIF, a Look Up Table (LUT) is added to 00 00 00 00 0B0 0 0 0 0 0 0 0B0 0 0 0 0 0 0 0 0 0 0 0B0 0 0 0 0 0 0 0 0 0 0 0B0 0 0 0 0 0 0
the static part of the system in each input/output signal of RPs. 00I00 00 00CL CL CL000R000 000 000CL CL000 DS000 000 000 CL CL000R000 000 000 CL CL000 DS000 000 000 CL000DC000 000 000CL000R000 000 000 CL CL CL000DS000 000 000 CL000I000 000 000 CL CL000R000 000 000CL CL000 DS000 000 000 CL
In this way, the signals that build the interface between the RPs
and the rest of the system are delimited by Proxy Logic-LUT
00Os00 00 00B B B00AM00 00 00B B00 P00 00 00 B B00AM00 00 00 B B00 P00 00 00 B00M00 00 00B00AM00 00 00 B B B00P00 00 00 B00Os00 00 00 B B00AM00 00 00B B00 P00 00 00 B
pairs. These pairs are then mapped to locations with certain 00 00 00 00s s s 00s00 00 00s s 00 s00 00 00 s s 00s00 00 00 s s 00 s00 00 00 s 00s00 00 00s 00s00 00 00 s s s 00s00 00 00 s 00 00 00 00 s s 00s00 00 00s s 00 s00 00 00 s
restrictions, and the routing is forced through constraints that 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ensure the uniformity of the interface between the RPs and 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
the rest of the system. 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
X0Y0 00 00 00X1Y0
00 00 00 00 00 00 00 00 00 00 00 00 00
Other examples of works that adopt partial reconfiguration
applied to space-born platforms are [2], [3], [4], [5] and
[6]. In all these proposals the recovery action is performed Fig. 1. Example of FPGA resources distribution.
through the reallocation of the module where it was detected
a permanent fault to another RP, excluding the entire RP where Fig. 1 also shows that the resource distribution is homoge-
the module was originally. Moreover, they use standard Partial neous if we compare row by row, but is heterogeneous along
Reconfiguration tools, resulting in a library N times greater. the row. This detail will be important in the phase of definition
(when compared to the methods of bitstream reallocation as of RPs.
in [12], where N is the total number of RPs). It is done in this In Virtex-6 FPGAs, the CLB column section in a row
way, in order to maintain all necessary variants of bitstreams, contains 40 CLBs (50 on the 7-Series). In order to configure a
allowing the reallocation of any module to any RP. CLB section (40 CLBs), a group of frames is required. Each
All the strategies found in the related work result in a frame has 81 words of 32 bits. Those frames are adjacent to
waste of resources in the partition where a permanent fault each other in the FPGA configuration memory. The first part
is detected, requiring a great amount of external memory to of the frames holds information regarding the routing used to
store the entire library of bitstreams. configure the correspondent switch box. The rest of the frames
includes the configuration of all the LUTs in the 40 CLBs, of
the Slice properties and multiplexers configuration, and also
about the INIT parameter, which defines the initial values of
III. FPGA AND PARTIAL R ECONFIGURATION OVERVIEW the Flip- Flops present in the 40 CLBs group.

This Section introduces the Xilinx Virtex-6 FPGA family


[13]. This is the FPGA technology adopted as a case study B. PR Interface and Routing
for the experiments using the methodology introduced in this The reuse of a partial bitstream in different RPs demands a
paper. set of rules that must be followed, both regarding the interface
IEEE TRANSACTION ON RELIABILITY, VOL. XX, NO. X, MONTH 2017 3

of each RP with rest of the system and relating to the routing Proxy Logic, this constraint forces the addition of a primitive
in the respective borders. LUT1 to every input and output of the system interconnecting
1) RPs Resources: The following requirements must be with the RPs. Its resource localization can be forced and an
followed regarding the window of resources for each RP: example of this constraints is:
• All RPs need to include exactly the same FPGA resources PIN "<system.net_name>" SCC_BUFFER=SLICE_XxYy;
with the same physical distribution; PIN "<system.net_name>" BEL=x6LUT;
• The vertical RP limits (up and down) need to be coinci- Regarding the problem 3), it is possible to ensure which
dent with the FPGA clock regions; clock tree(s) is(are) used in a certain RP, using the primitive
• RPs only can include physical I/O resources if they are BUFHCE [17] with an associated constraint. In addition to
not used by the system. setting the clock tree, the primitive BUFHCE allows to en-
After deciding how many resources are assigned to each able/disable the clock signal input in their RP. This is an
RP, defining the window of resources using the tools as important requirement in the relocation process modules that
PlanAhead [14] is an easy process. require the suspension of the clock signal. An example of how
2) RPs Interface: A regular tool flow ignores the compat- to add one of these constraints is as follows:
ibility of the RP interfaces. So, we must find ways to ensure INST "<BUFHCE_name_of_instance>" LOC=BUFHCE_XxYy;
that compatibility is guaranteed. Fig. 2 a) and b) show two
routing examples that do not allow the creation of relatively 3) RPs Routing: In the previous sections (III-B1 and
similar interfaces using the available constraints [15]. Fig. 2 III-B2), techniques have been described that ensure the com-
c) shows how to structure the routing between the RPs and patibility of the FPGA resources between RPs and its location.
the rest of the system. Herein we present the techniques that ensure compatibility in
the system routing.
A necessary guarantee is that the routing of each RP is
M1 RP1 M1 RP1 M1 RP1 completely isolated from the rest of the system. This is
achieved by adding the flag Isolated="true" to every
SYS SYS SYS RP in the corresponding statement in the xpartition.pxml file.
The other important point is to ensure that the routing
M2 RP2 M2 RP2 M2 RP2 of signals that connect the RPs to the rest of the system
a) b) c) are compatible. With the application of the techniques pre-
sented in Section III-B2, this process can be performed using
Fig. 2. Routing Rules the Directed Routing Constraints existing option in the tool
fpga_edline.exe. With this option, after running the tool
The remaining problems that result from the normal PR Placement and Routing, it is possible to extract the description
tools are: of how the routing of a selected set of signals was performed.
1) Situations where one output signal from one partition Properly processing this information, it is possible to build a
has a fanout greater than one. In this scenario, the net constraints group that ensures the desired uniformity routing
has multiple destinations. Some of them can be inside of the same set of signals after a second execution of the flow
the RP - Fig. 2 b); Translation → Mapping → Placement and Routing [15]. Here
2) Situations where the source of the input signal of the is an example of a Directed Routing Constraint with Absolute
partition has a fanout greater than one. In this situation, Placement Constraint:
the input module is just one of several destinations of NET "net_name"
the net. Often the same net is the input for multiple ROUTE="{3;1;6vlx240tff1156;9aab57fd!-1;"
"42256;-156792;S!0;-683;-224!1;-9745;1703!2;"
partitions, or can be input in more than one submodule "-10236;-2730!3;1789;1251!4;843;392;L!}";
in the same partition - Fig. 2 a); INST "rp_name/rp_net_name_PROXY" LOC=SLICE_X96Y71;
3) As each clock region of an FPGA has multiple clock INST "rp_name/rp_net_name_PROXY" BEL="D6LUT";
INST "SCC_BUFFER_sys/sys_net_name" LOC=SLICE_X91Y71;
trees, during Place and Route step, different relative INST "SCC_BUFFER_sys/sys_net_name" BEL="C6LUT";
clock trees may be selected in the RPs.
For the conditions 1) and 2), if the partitions are defined as This description is divided into two parts. The first identifies
reconfigurable, these are solved from the partitions side. Con- the net name and shows his routing (ROUTE=). The second
figuring Reconfigurable="true" in the xpartition.pxml indicates what resources are source and destination of the net.
file, the tools for PR add a primitive LUT1 to every input and In this example, the net originates from a Proxy Logic of an
output of an RP. These additions are designed by Proxy Logic RP and targets an existing SCC_BUFFER in the rest of the
and can be blocked to a specific resources using the following system.
The second and third line of the ROUTE constraint shows
constraints:
the detailed signal routing. The first two arguments define the
PIN "<rp_name.net_name>" LOC=SLICE_XxYy; absolute xy coordinate of the signal source. The third argument
PIN "<rp_name.net_name>" BEL=x6LUT;
S!0 indicates the signal Source and the connection point 0.
Adopting the suggestions of the flow Isolation Design Flow The remaining groups of three arguments define coordinates
(IDF) [16], the constraint SCC_BUFFER can be used. As the for the following connection points. The final argument, L!,
IEEE TRANSACTION ON RELIABILITY, VOL. XX, NO. X, MONTH 2017 4

indicates the point destination of the signal, which means, an FPGA Development Flow
input of an FPGA resource. To find the absolute value at each New Assisted Design Flow
Hardware Synthesis
connection point, the relative value is added with the previous Description
Language
absolute value as follows:
42256;-156792;S!0; (absolute source point) Floor Plan
Floor Planning
-683; -224!1; (PlanAhead )
Constraints
41573;-157016
-9745; 1703!2;"
(absolute connection point 1)
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
31828;-155313 (absolute connection point 2) 00 00 00PB4MP
0 00 00 00 00 00 00 00 00Tool
0000000
0 0 0 00(1st 0 0 0 00 00 00 00 00 00 00
PB4MP

0 0 0 0 0Stage)
-10236; -2730!3; Constraints
21592;-158043 (absolute connection point 3) (1st Stage)
1789;1251!4;
23381;-156792 (absolute connection point 4)
843;392;L!}"; Translation
24224;-156400 (absolute destination point)
PB4MP
In the description of a signal routing, only the coordinates of Constraints
(2nd Stage )
the signal source are absolute. Therefore, to achieve consistent ISE Settings
Mapping
routing of two signals belonging to two RPs interfaces with
the system description the same signal description can be used,
Placement and
simply modifying the signal source coordinates. Routing
NCD File

IV. D ESIGN METHODOLOGY TO DETECT AND RECOVERY Yes PB4MP Tool


No 1st Stage?
THE TARGET FPGA S (2nd Stage )

Our target is the permanent faults in FPGAs (i.e. hard Configuration


Bitstream
errors). This Section presents a new partial reconfiguration Generation
design flow to increase the reliability, allowing the swap
of modules between RPs. A new Differential Delay Sensor Fig. 3. FPGA Development + Design Flow.
(DDS) is also introduced. It was used to help in the decision
of when a swap should take place.
→ Configuration Bitstream Generation. It is important that
the Hardware Description Language specification is structured
A. Proposed Design Flow
to correctly assign each module to the appropriate RP. This
Partial Reconfiguration provides extensibility of resources includes adding primitives BUFHCE and deciding between
on a device and flexibility to change modules at run time. adding the normal SCC_BUFFER, as in Fig. 4 a) or a
However, the Xilinx flow requires different partial bitstreams primitive LUT2 that allows to disable all interface signals,
for the same component, if it is to be implemented in different as in Fig. 4 b). the second option is chosen, each physical
Reconfigurable Partitions (RPs) (even if they have the same LUT6 can implement two LUT2, which helps to reduce the
structure). This means that N x M partial bitstreams must FPGA resource overhead. Another step that needs the designer
be generated to implement M components in N RPs. This attention is the Floor Plan, when deciding the RPs location.
restriction may significantly increase the amount of memory It is imperative to follow the requirements identified before in
required to accommodate all partial bitstreams. To solve this this Section.
problem the Partial Bitstream for Multiple Partitions (PB4MP)
was developed in a previous work [7]. LUT signal LUT
The proposed Partial Reconfiguration mechanism requires
signal signal_buffered if_enable signal_buffered
a deep knowledge of low level details of the target Virtex-6
Family [13]. Using the compilation of knowledge detailed in a) b)
Section III, plus previous third party work [9], a design flow
has been developed [7]. Fig. 4. The Buffer Options.
The flow permits implementing an FPGA system where all
RPs have the same physical interface, which allows to use only After this stage the PB4MP Tool generates the first con-
one partial bitstream for each module that can be allocated straints group. These constraints will be responsible for the
in any RP. Therefore only N partial bitstreams are required, buffers inclusion (SCC_BUFFER or LUT2). This time the tool
instead of the N × M required in the aforementioned works. does not block the resources for these buffers. It only defines
Fig. 3 shows the full design flow of a PB4MP-based FPGA a window of possible locations for them, as shown in Fig. 5.
system. It is based on the standard Xilinx tools flow (ISE and This is a way to give some freedom for the next steps to obtain
PlanAhead) and requires only a small user intervention at the a good placement and routing.
initial phase of the hardware design, as in the IDF process [16]. Initially, the Placement and Routing operation generates
The remaining steps are automated. an NCD file. The PB4MP tool is then executed again and
The base of the flow is the sequence Synthesis → Floor the nets that interconnect the modules in the RPs to the
Plan → Translation → Mapping → Placement and Routing remaining parts of the system are extracted using the Directed
IEEE TRANSACTION ON RELIABILITY, VOL. XX, NO. X, MONTH 2017 5

B. Proposed Differential Delay Sensor (DDS)


Transistors present some delay when they are switching
Buffers
window RP1 from a logical level to another (from 0 to 1, for instance).
RP1 That delay may grow over time due to the stress conditions

SYSTEM
of temperature, supply voltage, and duty cycle [18]. This
natural physical degradation process is known as ”transistor
aging” [8]. We are assuming here that any transistor in FPGA
Buffers
can introduce a delay due to those stress conditions. However,
window RP2 the delay also may be different from an expected value
RP2
due to harsh radiation environments. It means that when the
delay increases, there is no assurance that it was caused by
transistor aging or any other reason. In other words, the DDS
Fig. 5. Partition interface process.
sensor allows to know that a failure has occurred, but without
knowing what the origin of the failure was.
Considering this scenario, the Differential Delay Sensor
(DDS) strategy is proposed in this work. In the DDS strategy,
the delays in two identical circuits are measured. Each circuit
Routing Constraints tool from fpga_edline.exe (which is a Delay Meter (DM). One of the DMs is the gold sensor
is the command line version of FPGA editor, used for making that uses FPGA resources which are not used in any other
changes in the NCD file). With this information, as show in circuits (only are used to implement the gold sensor). The
Fig. 5, the tool chooses one group of interconnect nets to second DM uses resources that are in use when a specific
replicate (automatically or defined by the designer), and with module is allocated in a specific RP. This strategy allows the
this information, a second constraints group (based ROUTE delay measurement with almost no aging effects influence, as
constraints) is created. The sequence, Translation → Mapping temperature or supply voltage, are equal on both sensors.
→ Placement and Routing is executed again. At the end of Each Delay Meter has a column of multiplexers, connected
these two stages the final bitstream is generated and can be in series and with all outputs registered as shown in Fig. 6.
loaded into the FPGA. Normally, the input is always in low ’0’. To measure the delay,
in a clock rising edge the input is forced to high ’1’, and in the
To take advantage of PB4MP in order to increase reliability,
next clock rising edge, all multiplexers outputs are registered.
an extra constraints list was added to the flow process. The
As each multiplexer implies a small delay, if the delay results
list is built using the CONFIG PROHIBIT constraint, which
in any variation, that will mean a different vector presented in
provides the option to select the FPGA resources that are not
the flip-flops (FFs) used to register the multiplexers outputs.
to be used. In the selected policy implemented for this first
version, a full resource column is removed from every three
columns, delimited by the FPGA rows organization [13]. To Delay
assure the desired reliability and maintainability levels, this Meter
process never excludes two columns with the same relative FF
‘1’
position in different RPs, as shown at the bottom in Fig. 8.
...
...

FF
This distribution allows two management resources pro- ‘1’
cesses. Process A that allows swapping the resources in use, FF
‘1’
and process B where, if a module presents a permanent fault,
it can replace the RP by another module that does not use the Input ...
Clock ...
same resource where the permanent fault has been detected.
The CONFIG PROHIBIT constraint excludes the resource
(CLB, DSP, BRAMs, etc) but does not exclude the associated Fig. 6. Delay Meter (DM) Diagram.
switching box. However, if the resource is not used, there is
no routing for that resource, which means the switch box is The Delay Meter construction is based on the Xilinx
not used, or it is used in a different way. CARRY4 primitive [17] and on the flip-flops available in
the same FPGA slice. As shown in Fig. 7, each CARRY4
This distribution may look as a waste of 33% of the RP primitive has four multiplexers, and in the Xilinx Virtex-6
resources, however, in an FPGA system as a result of the family, together, the four multiplexers have a delay of 68ps. In
routing congestion this is not a real overhead. When the a 200MHz clock system, a full CARRY4 allows detecting a
routing is limited to the RP size, this congestion is bigger. clock variation equivalent to around 2.7MHz. Reading all four
Therefore and in practice, the proposed approach does not multiplexers outputs simultaneously, the precision is around
imply a relevant additional hardware consumption. Moreover, 17ps, allowing the detection of a 0.7MHz clock variation (in
this is only a demonstration example. A real system with 10 the same 200MHz system). In our Delay Meter strategy, the
RPs with same distribution policy, this resources waste will CARRY4 inputs DI and S are assigned to high, the input
be less than 10%. CINIT is used in the first element to be the DM input, and
IEEE TRANSACTION ON RELIABILITY, VOL. XX, NO. X, MONTH 2017 6

the CI is connected to the previous CO(3). The outputs O are The platform employs the Advanced eXtensible Interface
ignored. 4 (AXI4) family of protocols, which is becoming the indus-
try’s standard for bus-based interconnections. The hardware
reconfiguration is done by the Internal Configuration Access
Port (ICAP) interface, that is interconnected through the AXI4
and managed by the Embedded Parallel Operating System
(EPOS) [22] running on Plasma, which provides the necessary
run-time support to implement the system features and the
communication with the remaining modules. RP1, RP2 and
RP3 represent the RPs that contain in this case, one Advanced
Encryption Standard (AES) component each one (three copies,
in a Triple Modular Redundancy (TMR) configuration).
The TMR option has two motivations: firstly, because it
provides fault prevention and fault recovery capabilities to the
system; secondly because it represents the worst case scenario,
as it uses the same FPGA resources to implement the three
modules (all of them in the same component). The three AESs
are implemented following the resources distribution discussed
in Section IV-A, plus the policy regarding the small resources
Fig. 7. Virtex-6 CARRY4 Primitive [17].
groups describe in Section IV-B. We have chosen a Real-Time
For the Differential Delay Sensor to measure the true aging, Star Network-on-Chip (RTSNoC) [23] based interconnection
all modules must have two small FPGA resources groups re- scheme for connecting the reconfigurable partitions to the
served. The first one, where the gold Delay Meter is allocated, Central Processor Unit (CPU) where the voter is running
gives the delay reference and has the same relative position (implemented in software). Each AES is connected to a port
in the FPGA. The second one is an exclusive resources group of an RTSNoC router, and one of the RTSNoC router ports is
reserved for the Delay Meter implementation. Only when the connected to an AXI4 bridge.
system performs a measurement, it will be configured as a DM. To enable the Delay Meters characterization and to monitor
The second group, in opposite, another DM has his resources the system, the System Monitor (SYSMON) [24] primitive
configured to force the stress phase along the multiplexers was included, which allows measure the temperature, the
chain. When the system decide measure the aging, the gold supply voltage (VCCIN T ) and the device current(ICCIN T ).
DM is temporally implemented and both delay are registered
and analyzed. At the bottom left side of Fig. 8, the first column A. TMR Architecture
represents this resources allocation policy used to implement
The TMR mechanism implies implementing the same mod-
the Differential Delay Sensor (DDS).
ule/function three times. The redundant modules run in par-
The option to use the CARRY4 primitives and no other allel, performing the same computation. The results from the
resources as, for instance LUTs, is the preferred one as this three modules are analyzed by another module, responsible
implementation almost does not use FPGA switching boxes to for voting the correct result. This way, even if a module
interconnect the used resources. This results in a better control produces wrong results, the other two will guarantee the
to assure that both Delay Meters are identical. Moreover, correct operation. The weak point in this strategy is the voter,
Xilinx does not share technical information about the switch as if it fails, the whole TMR strategy will also fail. The major
box resource, making it difficult to use them. advantage of TMR is its error detection and correction feature.
The DDS has another advantage when compared to ap- An important drawback is the need to employ three times more
proaches as [19] that use Ring Oscillator (ROSC) sensors. hardware [8].
In our proposed approach, the sensor is static (the gold Delay
Meter is not even used 99.9% of the time), which means less
power consumption. B. Memory Usage and Recovery Execution Time
The memory usage and the execution time of the software
V. S TUDY CASE IMPLEMENTED that implements the recovery procedure depend on several
The block diagram in Fig. 8 shows the System-on-Chip details of the adopted strategy. The basis of the strategy is
Platform [20] implemented as a case study, which is used to the same, i.e. swapping modules between RPs, but it can be
explore and to evaluate the proposed approach. The platform done in two different ways:
supports dynamic reconfiguration, and it was implemented in • The modules are available externally to the FPGA in a

a Xilinx XC6VLX240T FPGA. It is based on an MIPS32 memory holding the partial bitstreams;
ISA, the Plasma softcore, which is freely available at Open- • The swap is implemented by moving the correspondent

cores [21]. Any processor (hardcore or softcore) could be used. part of the FPGA configuration memory.
We adopted that platform because it works well and fulfills our The adopted alternative is the second one, as it does not
needs regarding hardware structure. need extra hardware (the external memory). Nevertheless, this
IEEE TRANSACTION ON RELIABILITY, VOL. XX, NO. X, MONTH 2017 7

FPGA TABLE I
S OFTWARE ROUTINES WITH I NDIVIDUAL M EMORY AND E XECUTION
(P0) T IMES C HARACTERISTICS .
MIPS32
Internal (EPOS) Software Routine Description Bytes Run time
ICAP (Cycles)
RAM Voter RP swap(rp1, rp2): Function that uses the XHw- 416
ICAP routines to implement the algorithm to swap
two RPs.
DDS measure(rp): Function that uses the XHw- 2376
AXI4 Bus ICAP routines below to measure the delay in one
RP (*).
XHwICAP Base Driver: Base Driver to commu- 7012
System nicate with XHwICAP (include Init(), SelfTest()
Timer UART RTSNoC and GetConfigReg() functions).
Monitor
XHwICAP DeviceReadFrame(): Routine to read 556 N CRF
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 one frame from the FPGA configuration memory. (3720)

00 00 00 00 00 00 00 00 00 00 00
DM

DM
00 00 00 00 00 00 00 00 00 00 00 00 XHwICAP DeviceWriteFrame(): Routine to
write one frame to the FPGA configuration mem-
784 N CW F
(4010)
00 00 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 00 RP1 00 00 00 00 00 000000 ory.
000000 00000000000
00 00 00 00 00 00 00 00 00 00 00 00
DM
00 00 00 00 00 00 00 00 00 00 00 00 Total MCODE 11144

00 00 00 00 00 00 0 0 0 0 0 0 RP2 000000 000000 B


A 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 000000000000 0 0 0 0 0 0000000000000 0 0 0 0 0 0000000000000
DM 0 0 0 0 0 0 0 0 0 0 0 TABLE II
PARTITION C HARACTERISTICS .
000000 000000 000000 000000
00 00 00 00 00 00 00 00 00 00 00 00 RP300 00 00 00 00 00 00 00 00 00 00 00
DM

RP Pa- Parameters description for each hardware


000000 000000 000000 module
00 00 00
rameters
DM
SizeF rame Number of bytes in one frame.
DM
Gold Delay Meter 0 0 0 Unused Resources NRP s Number of RPs in the system.
DM Module Delay Meter Permanent Fault NRP F rames Number of frames that have one full RP configuration.
F requency Frequency value of the system clock (used by the pro-
Fig. 8. System Diagram. cessing unit).

solution can be more or less demanding on the memory and This means twice the frame size SizeF rame for each of the
runtime requirements. If the Operating System (OS) has a NRP F rames frames (equation 1b).
considerable amount of available free memory, it is possible
to read and write all the corresponding module configuration MDAT A = NRP s × 4 + SizeF rame × 2 (1a)
memory. However, if the free memory is very limited, the
option is to read and write the configuration, frame by frame. MDAT A = NRP s × 4 + SizeF rame × NRP F rames × 2
In this work, we adopt the latter solution. (1b)
In order to evaluate the memory consumption and the run (*) Regarding the function DDS measure(rp), after debug-
time execution, a set of equations was defined based on ging the configuration memory of the target FPGA (Virtex-6
TABLE I and TABLE II. TABLE I describes all software family), it is observed that the CLBs are organized in small
functions and its memory usage (compiled using gcc for columns with 40 CLBs each one. There ar 36 frames to
MIPS32 architecture with -O3 optimization flag) and the run configure any one CLBs. However, only 12 frames are used
time requirements for the Virtex-6 family. The Bytes column to configure the CLBs. The other 24 are used to set up the
shows the memory usage required for the code part (MCODE ). switch boxes which are not used in the Delay Meter (DM), and
The Run time column details all the parameters that influence because of that, it is not necessary to modify this configuration
the execution time indicating the number of clock cycles. memory section. To configure the portion of memory required
TABLE II details all hardware parameters that influence the for the 12 frames, it is not necessary to have all memory
run time and the memory usage (MDAT A ). available for backing-up the full frames. It is sufficient to
The memory required is divided in two parts: code have the masks to change only the mandatory bit. This way,
(MCODE detailed in TABLE I) and temporary data (MDAT A ). memory can be saved. However, memory space for the masks
The Data section stores the initial addresses of the RPs, information storage makes the DDS measure(rp) function to
four bytes for each one, and temporarily backs-up the FPGA require a certain amount of memory.
configuration memory content from the RPs being swapped. Equation 2 gives the amount of memory required to im-
In the low-cost option, where there is one read/write operation plement the recovery ability. The total amount depends on
for each frame, it is required twice the frame size SizeF rame the FPGA family, on the frame size (SizeF rame ), and on the
(equation 1a). For the worst case, when the swap is done with number of frames to be moved in each read/write operation.
only one whole read/write operation for each RP, it is required
the double of the size of the configuration memory of an RP. M EMT otal = MCODE + MDAT A (2)
IEEE TRANSACTION ON RELIABILITY, VOL. XX, NO. X, MONTH 2017 8

The total number of clock cycles required to execute a System Monitor primitive (the temperature, the supply voltage
complete swap between two RPs (N CT otal ) is determined VCCIN T and current ICCIN T ). The RP3 partition option
by two factors (equation 3): FPGA family, because of the happens to be the one that is physically right next to the
frame size SizeF rame , which influences the number of clock System Monitor primitive into the device, thus obtaining the
cycles spent by the functions XHwICAP DeviceReadFrame() most stringent existing temperature value on that partition. The
(N CRF ) and XHwICAP DeviceWriteFrame() (N CW F ); and use of this module is made from a moderately so that not
the RP size NRP F rames , that defines how many frames need much overheated the RP3 partition. However, to be faster the
to be read and written. observation of the evolution of the temperature, the cooling
fan coupled to the FPGA was turned off during testing.
2) Reconfigurable Partitions Temperature Measure: The
N CT otal = (N CRF + N CW F ) × NRP F rames × 2 (3) second program, aimed to measure temperature differences
After obtaining the total number of cycles, equation 4 gives between RPs, is exemplified in next pseudo-code.
the required time necessary to swap two RPs in seconds void main application DDS temperature(){
...
(TSwap ). init counter ();
count100 = 0;
enable AES RP3 = 1;
N CT otal while (1){
TSwap = (4) if (enable AES RP3) {
F requency send data(&AES3, input data);
get data(&AES3, output data[3]);
Regarding the differential delay measure operation, a new }
parameter NCLBsConf F rames was defined. This value cor- if (read counter() >= COUNTER LIMIT) {
responds to the number of frames used to configure only DDS measure(1, dds[1]);
DDS measure(3, dds[3]);
a small CLBs column. Equation 5 shows that for each Get Sysmonitor(&sysmon);
NCLBsConf F rames frame it is required to read the correspon- printf ( ”Golden DM(RP1):%f, Golden DM(RP3):%f”, dds[1].dm golden,
dds[3].dm golden);
dent frame, to change the needed bits, and to write back the printf ( ”Temperature:%f, Vccint:%f, Iccint:%f”, sysmon.temperature,
sysmon.vccint, sysmon.iccint);
frame modified, to implement the gold Delay Meter (or to count100++;
leave the correspondent resources). if (count100 >= 100) {
enable AES RP3 = (enable AES RP3++) & 0x01;
count100 = 0;
}
reset counter();
N CDDS = (N CRF + N CW F ) × NCLBsConf F rames (5) }
}
The dimension of the DM always allows being implemented }
in one single small CLBs column. Because of that, the value In this algorithm, the processor continues to communicate
N CDDS only depends on the FPGA family, which has an only with the module allocated in RP3, now a flashing mode,
impact on almost all parameters. but in addition to reading the values of DDS in RP3 and
System Monitor primitive, the DDS values present in RP1
C. DM and DDS Oriented Tests are also obtained. At this point was demonstrated that it is
Although the system of Fig. 8 has been developed to possible to monitor the temperature at differentRPs, which
implement strategies A and B, two test programs are written also create conditions to in the future to use this information
for the purpose of analyzing the DM and DDS behavior. to minimize the FPGA aging c FPGA due to the Negative
Bias Temperature Instability (NBTI).
1) DM Characterization and DDS Measure: The first test
program designed to study the DM behavior and the DDS
VI. E XPERIMENTAL R ESULTS AND D ISCUSSION
differential measurement is described in the following pseudo-
code. In the System-on-Chip case study shown in Fig. 8, the input
data are sent to the three AESs modules (TMR). After some
void main application DM characterization(){
... processing time, the results are retrieved from the modules.
init counter ();
while (1) {
The three results are analyzed by a voter unit, implemented
... in software. If a mismatch is detected, the faulty module is
send data(&AES3, input data);
get data(&AES3, output data[3]); identified, and another RP is selected to swap with the faulty
if (read counter() >= COUNTER LIMIT) {
one. Before performing the swapping, the input clock signals
DDS measure(3, dds[3]); are disabled on both RPs. The voting function is running on the
Get Sysmonitor(&sysmon);
printf ( ”Golden DM(RP3):%f, Module DM:%f”, dds[3].dm golden, Plasma CPU, making it a single point of failure. To overtake
dds[3].dm module); this drawback, a solution, based on [25], is under development
printf ( ”Temperature:%f, Vccint:%f, Iccint:%f”, sysmon.temperature,
sysmon.vccint, sysmon.iccint); as a related work aiming to improve CPU dependability
reset counter();
} figures. In a real system, concurrently to this operation, the
} DDS measure(rp) function is called periodically, to identify
}
modules that have been in the same RP for too long. If the
In this test, the processor only uses the AES module that Differential Delay Sensor (DDS) sets off an alarm (returns
is allocated in RP3 and cyclically (about three times per an alarm value), the module is swapped with another one,
second) are read the DDS values and some values from the following a policy based on swapping history analysis.
IEEE TRANSACTION ON RELIABILITY, VOL. XX, NO. X, MONTH 2017 9

Each AES module takes 4,172 LUTs, which in the Virtex- TABLE IV
6 family [13] means that a minimum of 522 CLBs are B ITSTREAMS C ONSTRAINTS I NTERFERENCE .
required. As a section of a CLB column has 40 CLBs, Detail Without With
at least 14 of such CLB columns are needed. In practice, Constraints Constraints
and due to placement and routing requirements (without any Average Total Bits 0 1689700 1685917
CONFIG PROHIBIT constraint), the module implementation Average Total Bits 1 280219 284002
requires 23 CLB columns. This FPGA occupation figure is a Total Matching Bits 0 1396224 (-18%) 1342436 (-21%)
good opportunity to implement the strategy presented at the Matching Routing Bits 0 1055328 (-15%) 1025529 (-17%)
bottom of Fig. 8, with no waste of extra resources in the second Matching Settings Bits 0 340896 (-76%) 316907 (-78%)
stage, and with no routing congestion. Each Delay Meter (DM) Total Matching Bits 1 59063 (-79%) 14728 (-95%)
takes 10 slices (a CLB has two slices in Virtex-6 FPGAs). With Matching Routing Bits 1 18706 (-87%) 4755 (-97%)
this in mind, a whole small CLBs’ column is reserved for the Matching Settings Bits 1 40357 (-72%) 9973 (-93%)
three modules.
The characteristics of each RP are presented in TABLE III.
the CLBs configuration. For each value, there is a reduction
percentage about the total average value.
TABLE III
AES RP C HARACTERISTICS . It is clear that the percentage reduction in bits 0 is not so
expressive. It happens because in general if a resource is not
RP Pa- Parameters Values used, its correspondent configuration memory is filled in with
rameters
SizeF rame 324 0s. As a result, all bits 0 used to configure a CLB, will be kept
NRP s 3 in 0 in another bitstream where the same CLB is not used.
NRP F rames 1,524 (1,012 for CLBs + 512 for Regarding the bits 1, the results show that with constraints
BRAMs) to force the desired distribution, the matching bits present a
F requency 50MHz (DMs uses 200MHz inter- decrease of about 95% (97% routing bits and 93% settings
nally) bits). This reduction will never be 100%, as there is the
interface between each module and the remaining FPGA logic,
which is compulsory to be equal, to implement the PB4MP.
A. Analysis of Memory Usage and Recovery Execution Time The routing percentage reduction also shows that with the
same resources distribution, the switching box configuration
Using TABLE III in equation 1a, it results that MDAT A =
memory will be affected almost in the same way, even if no
660 bytes, and from equation 2, it is possible to conclude that
constraints have been used for them.
our approach needs only 11,804 bytes of additional system
memory. Regarding the run time, to swap two RPs it will take
C. Simulation of CARRY4 Primitive Aging
23,561,040 clock cycles (equation 3), which corresponds to
471 ms (using equation 4). Regarding the delay measurement To have an observable difference in the Differential Delay
in one module, after knowing that NCLBsConf F rames = 12, Sensor measurements, the FPGA system must be left working
equation 5 returns 92,760 clock cycles (1.8 ms). for several weeks, or even for months. For the sake of speed,
Considering Fig. 8 RPs, each A swap operation will take some FPGA devices should be sacrificed, being exposed to
1.4 seconds, and each B swap operation will take 471 ms. harsh environments.
Although this is still a considerable time, the probability of To help prove the DDS operation, an equivalent circuit
the B swap operation occurs is relatively low. The A swap for the CARRY4 primitive, as shown in Fig. 7, has been
operation period will be large (days or weeks) and can be designed and analyzed using the AgingCalc [26] tool. Each
performed when the system is in an idle state. As the delay multiplexer was built only with NAND gates from the tool’s
measurement is performed just once a day (or once a week), existing library. As the Virtex-6 family is produced with the
the 1.8 ms required for each measurement can be negligible. 40nm technology [27], and the AgingCalc does not have
this option, two 32nm CMOS libraries have been selected:
32nm bulk library and 32nm HP library. These two libraries
B. Analysis of Partial Bitstreams Library from Predictive Technology Model (PTM) [28] were developed
The tool bit_inform.exe has been developed to help by Nanoscale Integration and Modeling (NIMO) group, from
to understand the influence of the distribution resources policy the Arizona State University (ASU).
in the modules’ bitstreams. This tool takes the three bitstreams After several simulations run, with all inputs assigned to ‘1’,
(from the three modules), and makes a comparison between the CARRY4 primitive reports more aging effects in the carry-
their data frames. The comparison results are listed in TA- chain propagation. Two temperature scenarios are analyzed for
BLE IV. both libraries: 50◦ C (122◦ F) and 100◦ C (212◦ F). Fig. 9 shows
The first two lines show the average of bits 0 or bits 1 the AgingCalc results for the first 10 years.
in the three analyzed bitstreams. The remaining lines list the There are two important remarks regarding these data: for
amount of matching bits (0 and 1) found in all bitstreams, the 100◦ C temperature, the aging makes the path-delay to
divided into two groups: the routing bits, used to configure increase more and more; and, a large part of the degradation
the switching boxes; and the settings bits, responsible for happens in the first year (minimum 5% and maximum 22%).
IEEE TRANSACTION ON RELIABILITY, VOL. XX, NO. X, MONTH 2017 10

WĂƚŚͲĚĞůĂLJǁŝƚŚŐĞŝŶŐ WĂƚŚͲĚĞůĂLJĞŐƌĂĚĂƚŝŽŶǁŝƚŚŐĞŝŶŐ along the temperature increase the value obtained from DM
ϭϳϱ
ϭϲϱ
ϰϱ undergoes strong fluctuations. These fluctuations are due to
ϰϬ
ϭϱϱ
ϭϰϱ Ϳ
ϯϱ the own clock resources FPGA [29], which by automatically
ͿƐ ϭϯϱ ;й
Ŷ
ϯϬ ϯϮŶŵͺďƵůŬϭϬϬ϶
detecting that the clock signal has undergone a change due
;Ɖ ϭϮϱ ŝŽƚ Ϯϱ ϯϮŶŵͺďƵůŬϱϬ϶
LJĂ ĂĚ
Ğůϭϭϱ Ăƌ ϮϬ ϯϮŶŵͺ,WϭϬϬ϶ to increasing temperature, compensates this change. However,
ϭϬϱ ŐĞϭϱ
 ϯϮŶŵͺ,WϱϬ϶
ϵϱ
ϴϱ
ϭϬ even with this existing mechanism on the device, the value
ϳϱ ϱ
ϲϱ Ϭ
read by a DM remain tendentially decreases with increasing
Ϭ ϭ Ϯ ϯ ϰ ϱ ϲ ϳ ϴ ϵ ϭϬ Ϭ ϭ Ϯ ϯ ϰ ϱ ϲ ϳ ϴ ϵ ϭϬ
temperature, following a straight line. After this characteriza-
tion, it is possible to use thisDM also as a temperature sensor.
Fig. 9. CARRY4 equivalent circuit Aging Analysis

E. Differential Delay Sensor Characterization


This means that, when it is not possible to ensure a controlled Validated the utility of a DM as a temperature sensor, it
temperature, it is imperative to mitigate the aging effects in was necessary to validate that a couple of these sensors can
the first moment. measure the FPGA resources aging. Using the same data
Fig. 9 show results obtained from an equivalent CARRY4 obtained by previous test (resulting from the first code), a
circuit with the worst case vector applied to its inputs. When first graph is designed with the evolution of the values of both
the best case vector is applied (with all inputs assigned to ‘0’), the DMs. They are the same DDS (Golden DM and Module
in all simulations, the degradation after ten years was between DM), followed by the outline of a bar graph indicating the
0% and 1%. This means that it is possible to save FPGA result of GoldenDM − M oduleDM . Both graphs are shown
resources from the gold Delay Meter (DM) sensor, allowing in Fig. 11.
to reach the final DDS objective, which is to measure the aging
with no temperature (or supply voltage) interference over time. Ϯϴ͕ϱ

'ŽůĚĞŶD
Ϳ
Ɛ Ϯϴ
ƚ
ŝ DŽĚƵůĞD


D. Delay Meter Characterization Ŷ
ŝ
Ğ Ϯϳ͕ϱ
Ƶ
ů
Ă
Running the code of the test present in Section V-C1 and ;s Ϯϳ
Ɛƌ
Ğƚ
with the fan responsible for cooling the FPGA off, it was Ğ Ϯϲ͕ϱ
D
LJĂ
possible to observe the evolution of the temperature inside the ůĞ
 Ϯϲ
device and its influence on DMs of a DDS. Taking advantage Ϯϱ͕ϱ

of System Monitor primitive it was possible to correlate the Ϭ͕ϱ

values obtained through the DDS with temperature values, Ϭ͕ϰ

supply voltage VCCIN T and current ICCIN T measured by D



Ϭ͕ϯ

Ğů
primitive. The first relationship, present in Fig. 10, was the ƵĚ Ϭ͕Ϯ
Ž
characterization of a Golden DM. D
Ͳ Ϭ͕ϭ
D

Ŷ Ϭ
ĞĚ
Ϯϴ͕ϱ ůŽ
'ͲϬ͕ϭ
Ϯϴ
LJсͲϬ͕ϭϯϵϰdžнϯϮ͕ϴϳϳ
ZϸсϬ͕ϵϱϯϵ ͲϬ͕Ϯ
Ϯϳ͕ϱ
Ϳ
Ɛ
ŝƚ Ϯϳ ͲϬ͕ϯ


Ŷ
ŝ ϯϱ ϯϳ ϯϴ ϰϬ ϰϭ ϰϯ ϰϰ ϰϱ ϰϲ ϰϳ ϰϴ
Ğ Ϯϲ͕ϱ
ůƵ ^LJƐƚĞŵDŽŶŝƚŽƌ ;dĞŵƉĞƌĂƚƵƌĞŝŶΣͿ
Ă
;s Ϯϲ
Ğƌƚ
Ğ Ϯϱ͕ϱ
D
LJĂ
Ğů Ϯϱ Fig. 11. Noise/error analysis in reading of DDS.
Ϯϰ͕ϱ

Ϯϰ It is apparent that while with a slight difference in average


Ϯϯ͕ϱ
ϯϱ ϰϬ ϰϱ ϱϬ ϱϱ ϲϬ
less than half a bit (positive or negative), the two DMs always
^LJƐƚĞŵDŽŶŝƚŽƌ ;dĞŵƉĞƌĂƚƵƌĞŝŶΣͿ accompanied, even when strong fluctuations occur originated
by the clock features.
Fig. 10. Characterization of a DM used in Virtex-6.
F. Temperature Influence on Current and Power Supply FPGA
In order to minimize the risk of getting a wrong reading,
Drawing on the same data obtained by running the first
each value of Golden DM present in Fig. 10, corresponding
code, the interference of temperature variation was analyzed
to an average of 100 consecutive measurements performed by
in the current value ICCIN T that runs through the FPGA. The
the routine DDS_measure(rp) whenever it is called. The
relationship between these two quantities originated the graph
resulting graph was against deduced the equation 6, in this
of Fig. 12.
case m = 0, 1394 and yo = 32, 877.
It is clear that the increase in temperature also entails an
increase in the current flowing through the device. Among the
y(x) = −m.x + yo (6)
35◦ C and 60◦ C the increment is nearly 250 mA per 10◦ C,
Although the trend graph is linear up and has a Coefficient and this increase tends to increase further with temperature
of Determination R2 = 0, 9539, which is also observable increase.
IEEE TRANSACTION ON RELIABILITY, VOL. XX, NO. X, MONTH 2017 11

ϰ͕ϭ 10◦ C (in the range between 35◦ C and 60◦ C. This mean, after
Ϳ
ϰ
LJсϬ͕ϬϬϬϰdžϮ Ͳ Ϭ͕Ϭϭϲϲdžнϯ͕ϱϯϰϭ the temperature increase 25◦ C, the power consumption on the
 ZϸсϬ͕ϵϵϴ
Ŷ
ŝ
ƚ
device increased 16%.
Ŷϯ͕ϵ
Ğ
ƌ
ƌ
Ƶ
;
ϯ͕ϴ
ƌ
Ž
ŝƚ ϯ͕ϳ
G. Using DDS as Temperature Sensor in FPGA Partitions
Ŷ
Ž
D
 ϯ͕ϲ Running the second program code (Section V-C2), which
ŵ
Ğ
ƚƐ
LJ
are used only the Golden DMs of DDSs existing in RP1 and
^ ϯ͕ϱ
R2 partitions, the possibility of using the DDS also as a local
ϯ͕ϰ
temperature sensor was analyzed. The results obtained allowed
ϯϱ ϰϬ ϰϱ ϱϬ ϱϱ ϲϬ
^LJƐƚĞŵDŽŶŝƚŽƌ ;dĞŵƉĞƌĂƚƵƌĞŝŶΣͿ to draw the graphs of Fig. 15.

Ϯϳ͕ϱ
Fig. 12. Temperature influence on current ICCIN T . 'ŽůĚĞŶD;ZWϯͿ
Ϳ Ϯϳ 'ŽůĚĞŶD;ZWϭͿ
Ɛ
ŝƚ


Ŷ
ŝϮϲ͕ϱ
Ğ
ůƵ
Ă

The same relationship was made between the temperature s


Ɛ Ϯϲ
Ğƌƚ
;

Ğ
and the power supply VCCIN T that feeds the FPGA, and gave D
LJ Ϯϱ͕ϱ
Ă
Ğů
rise to the graph of Fig. 13. Ϯϱ

Ϯϰ͕ϱ
ϰϲ
ϭ͕Ϭϯϯ ϰϱ͕ϴ
Ϳ


ϰϱ͕ϲ
Ϳ Ŷ
ŝ
s
 ϰϱ͕ϰ
ϭ͕ϬϯϮ
Ğ
ƌ
ŝŶ

Ƶ
ƚ
Ăϰϱ͕Ϯ
LJ ƌ
ů Ğ
Ɖ ϰϱ
Ɖ ŵ
Ɖ Ğ
ϰϰ͕ϴ
Ƶϭ͕Ϭϯϭ
;^
d
;
ŶŽϰϰ͕ϲ
ƌ D
Ž LJƐ^ ϰϰ͕ϰ
ŝƚ
Ŷϭ͕ϬϯϬ ϰϰ͕Ϯ
Ž
D
 ϰϰ
ŵ
Ğ
dŝŵĞ;ůŽĐŬ LJĐůĞƐͿ
ƚƐ ϭ͕ϬϮϵ
LJ
^
Fig. 15. Temperature measurement using the Golden DM existing in DDS.
ϭ͕ϬϮϴ
ϯϱ ϰϬ ϰϱ ϱϬ ϱϱ ϲϬ
^LJƐƚĞŵDŽŶŝƚŽƌ ;dĞŵƉĞƌĂƚƵƌĞŝŶΣͿ In the first half of the test, the FPGA cooling fan itself off.
In the second half it was on. This difference is apparent in two
ways: the temperature oscillations read by the System Monitor
Fig. 13. Temperature influence on power supply VCCIN T .
primitive (SysMon) reach higher values in the first half; and
In a opposite direction of the current ICCIN T behavior, the amount collected from the RP1 Golden DM fluctuates with
the voltage decreases with temperature. With a decrement of greater depth also in the first half. This mean, without active
approximately 1 mV for every 10◦ C, although it is noted that cooling, the device not only heats up, such heat is transferred
the temperature influences this tension, do not have the same in part to the rest of the FPGA mainly affecting adjacent areas.
influence as it is on current. Even with active cooling, the SysMon always keeps some
Multiplying the voltage by the current, it was also reviewed oscillation, which is justified, as already mentioned, because of
the evolution of the power consumption with the temperature the RP3 partition is close physically with the primitive System
rise. Fig. 14 shows the resulting graph. Monitor, which turns out to be always very influenced by the
heating that occurs in this partition.
ϰ͕ϮϬϬ
In order to compare the temperature between RP1 and RP3,
LJсϬ͕ϬϬϬϰdžϮ Ͳ Ϭ͕Ϭϭϴdžнϯ͕ϲϳϯϴ and simultaneously prove that the registered characterization
ϰ͕ϭϬϬ ZϸсϬ͕ϵϵϳϳ
in Fig. 10 is valid, the equation of the trend line y(x) =
Ϳ ϰ͕ϬϬϬ
t −0, 1394x+32, 877 was converted into the equation equation:
ŝŶ
;/ ϯ͕ϵϬϬ
Ύ
x(y) = 32,877−y
0,1394 . Applying this equation to the Fig. 15 data,
h
ƌ ϯ͕ϴϬϬ
Ğ the result was the graph in Fig. 16.
ǁ
Ž
Wϯ͕ϳϬϬ
ϲϬ
ϯ͕ϲϬϬ ϱϴ 'ŽůĚĞŶD;ZWϯͿ
ϱϲ 'ŽůĚĞŶD;ZWϭͿ
Ϳ
ϯ͕ϱϬϬ 
Σ ϱϰ ^LJƐƚĞŵDŽŶŝƚŽƌ
Ŷ
ŝ;
ϯϱ ϰϬ ϰϱ ϱϬ ϱϱ ϲϬ ƌĞ
ϱϮ
ƵϱϬ
^LJƐƚĞŵDŽŶŝƚŽƌ ;dĞŵƉĞƌĂƚƵƌĞŝŶΣͿ Ăƚƌ
Ğϰϴ
Ɖ
ŵ
Ğ
dϰϲ
ϰϰ
ϰϮ
Fig. 14. Temperature influence on power consumption VCCIN T .ICCIN T .
ϰϬ
dŝŵĞ;ůŽĐŬ LJĐůĞƐͿ

With an influence due to temperature similarly to the current


ICCIN T , the power consumption increases about 250 mW per Fig. 16. Conversion of the Golden DM values in temperature.
IEEE TRANSACTION ON RELIABILITY, VOL. XX, NO. X, MONTH 2017 12

In this new chart is possible to see and compare the and the AgeingCalc will support both Bias Temperature In-
temperatures in the RP1, RP3 and measured by the primitive stability (BTI). Using the AgeingCalc after have included the
SysMon. Again it is observable the difference between the FinFET libraries (from 2nm to 7nm nodes), will be interesting
first and the second part. In the second part, with cooling, the to evaluate the aging effects in the recent FPGAs.
temperature variations in the partition RP3 are more attenuated To validate the simulation results, it will be important
amplitude, and temperature variations in the RP1 partition are to sacrifice some FPGA devices, exposing them to extreme
almost without suffering interference from changes that are conditions, which will result in more pragmatic results.
occurring in the RP3 partition.
The second program algorithm, to provoke a strong inter- ACKNOWLEDGMENT
mittent activity in the module allocated in RP3 partition also This work has been partly funded by the Brazilian Na-
causes an alternation between two temperature levels. With no tional Council for Scientific and Technological Development
active cooling, switches between 48◦ C and 56◦ C, and with (CNPq).
active cooling switches between roughly 45◦ C and 51◦ C.
R EFERENCES
VII. C ONCLUSIONS AND F UTURE W ORK [1] V. Dumitriu, L. Kirischian, and V. Kirischian. Run-time recovery
mechanism for transient and permanent hardware faults based on dis-
This work’s main objective is to improve dependability tributed, self-organized dynamic partially reconfigurable systems. IEEE
attributes in an FPGA based design. A new aging sensor is Transactions on Computers, 65(9):2835–2847, Sept 2016.
[2] David P. Montminy, Rusty O. Baldwin, Paul D. Williams, and Barry E.
proposed, the Differential Delay Sensor (DDS), which together Mullins. Using relocatable bitstreams for fault tolerance. In Proceedings
with the developed design flow (PB4MP), adds an aging of the NASA/ESA Conference on Adaptive Hardware and Systems (AHS),
effects mitigation unit in an FPGA system. The aim is not pages 701–708, Edinburgh, Scotland, August 2007.
[3] Lev Kirischian, Victor Dumitriu, Pil Woo Chun, and Galina Okouneva.
only to provide fault recovery capabilities to the system but Research article - mechanism of resource virtualization in rcs for
also to allow fault prevention regarding aging effects. multitask stream applications. International Journal of Reconfigurable
Computing, 2010(159367):1–13, 2010.
The new mitigation unit has a different philosophy when [4] Cristiana Bolchini, Antonio Miele, and Chiara Sandionigi. Increasing
compared to works where the focus is on the aging effects autonomous fault-tolerant fpga-based systems lifetime. In Proceedings
of critical paths. The presented approach hammers the aging of the 17th IEEE European Test Symposium (ETS), pages 1–6, Annecy,
France, May 2012.
effects in all the resources of each Reconfigurable Partition, [5] Jaime Espinosa, David de Andres, Juan Carlos Ruiz, and Pedro Gil.
since the first beginning. According to the results shown in Tolerating multiple faults with proximate manifestations in fpga-based
Fig. 9, it is crucial to provide aging control from the early critical designs for harsh environments. In Proceedings of the IEEE In-
ternational Conference on Field Programmable Logic and Applications
usage stages, before the permanent part of Vth increases. (FPL), pages 292–299, Oslo, Norway, August 2012.
An important feature of DDS is that when it makes aging [6] Victor Dumitriu, Lev Kirischian, and Valeri Kirischain. Run-time
measurements, it is interference free from other sources as recovery mechanism for transient and permanent hardware faults based
on distributed, self-organized dynamic partially reconfigurable systems.
temperature or power supply. When any of the other sources IEEE Transactions on Computers, PP(89):1–14, December 2015.
change, they result in similar interference levels in both Delay [7] Victor Manuel Gonçalves Martins, Joao Gabriel Reis, Horácio C. C.
Meter, always resulting in an actual aging effects value. Neto, and Eduardo Augusto Bezerra. Designing partial bitstreams for
multiple xilinx fpga partitions. In Proceedings of the IEEE 23rd Annual
The AgeingCalc tool uses Predictive Technology Model International Symposium on Field-Programmable Custom Computing
models, which can have certain discrepancies from the tech- Machines (FCCM), pages 256–259, Vancouver, Canada, May 2015.
nologies used to produce the FPGAs. The Triple-Oxide Ap- [8] Dhiraj K. Pradhan. Fault-Tolerant Computer System Design. Prentice
Hall, 1996.
proach [27] is an example used by Xilinx to achieve significant [9] Yoshihiro Ichinomiya, Sadaki Usagawa, Motoki Amagasaki, Masahiro
static power reductions. The real primitive CARRY4 construc- Iida, Morihiro Kuga, and Toshinori Sueyoshi. Designing flexible
tion (in gates) can also present some differences from the used reconfigurable regions to relocate partial bitstreams. In Proceedings
of the IEEE Annual International Symposium on Field-Programmable
designed, but with more or less delay-path degradation. Custom Computing Machines (FCCM), page 241, Toronto, Canada, May
The priority is to monitor and to prevent the aging effects 2012.
with no concerns regarding the critical paths. This will mitigate [10] Yoshihiro Ichinomiya, Motoki Amagasaki, Masahiro Iida, Morihiro
Kuga, and Toshinori Sueyoshi. A bitstream relocation technique to
the NBTI in its general form, allowing any module to be improve flexibility of partial reconfiguration. In Proceedings of the IEEE
allocated in the RPs. International Conference on Algorithms and Architectures for Parallel
The test reported in the graphs of Fig. 15 proved that using Processing (ICA3PP), pages 139–152, Fukuoka, Japan, September 2012.
[11] Christian Beckhoff, Dirk Koch, and Jim Torresen. Goahead: A partial
only one DM of each MD DDS there is in each of allocated reconfiguration framework. In Proceedings of the IEEE Annual Interna-
modules, it is possible to monitor the temperature in each tional Symposium on Field-Programmable Custom Computing Machines
of the RPs. This feature can allow the system to ensure the (FCCM), pages 37–44, Toronto, Canada, May 2012.
[12] Tomás Drahonovský, Martin Rozkovec, and Ondrej Novák. Relocation
RPs, fighting overheat, something that the AgingCalc tool in of reconfigurable modules on xilinx fpga. In Proceedings of the
Section VI-C demonstrated that accelerated the aging due to IEEE Annual International Symposium on Design and Diagnostics of
NBTI. Also, as shown the Fig. 14, the heating also causes an Electronic Circuits and Systems (DDECS), pages 175–180, Karlovy
Vary, Czech Republic, April 2013.
increase in power consumption, which is also a phenomenon [13] Xilinx Inc. Virtex-6 fpga configuration user guide v3.9 (ug360).
to be combated. http://www.xilinx.com/support/documentation/user guides/ug360.pdf,
As a future work, DDS PB4MP will be adapted to be easily November 2015.
[14] Xilinx Inc. Partial reconfiguration tutorial: Planahead design tool v14.5,
used in the Vivado flow. Also, it is planned to perform the same April 2013. http://www.xilinx.com/support/documentation/sw manuals/
analysis for the Positive Bias Temperature Instability (PBTI), xilinx14 7/PlanAhead Tutorial Partial Reconfiguration.pdf.
IEEE TRANSACTION ON RELIABILITY, VOL. XX, NO. X, MONTH 2017 13

[15] Xilinx Inc. Constraints guide v14.5 (ug625). http://www.xilinx.com/


support/documentation/sw manuals/xilinx14 7/cgd.pdf, April 2013.
[16] Xilinx Inc. Isolation design flow. http://www.xilinx.com/applications/
isolation-design-flow.html, 2016.
[17] Xilinx Inc. Virtex-6 libraries guide for hdl designs v14.7
(ug623). http://www.xilinx.com/support/documentation/sw manuals/
xilinx14 7/virtex6 hdl.pdf, October 2013.
[18] W. Mizubayashi, T. Mori, K. Fukuda, Y. X. Liu, T. Matsukawa,
Y. Ishikawa, K. Endo, S. O’uchi, J. Tsukada, H. Yamauchi, Y. Morita,
S. Migita, H. Ota, and M. Masahar. Pbti for n-type tunnel finfets. In
Proceedings of the International Conference on IC Design & Technology
(ICICDT), pages 1–4, Leuven, Belgium, June 2015.
[19] Deepashree Sengupta and Sachin S. Sapatnekar. Predicting circuit aging
using ring oscillators. In Proceedings of the IEEE 19th Asia and
South Pacific Design Automation Conference (ASP-DAC), pages 430–
435, Singapore, January 2014.
[20] Tiago Rogrio Mck and Antnio Augusto Frhlich. Seamless integration
of hw/sw components in a hls-based soc design environment. In Pro-
ceedings of the International Symposium on Rapid System Prototyping
(RSP), pages 109–115, Montreal, Canada, October 2013.
[21] OpenCores. Opencores. http://opencores.org, November 2014.
[22] The EPOS Project. Embedded parallel operating system. http://epos.
lisha.ufsc.br, 2014.
[23] Marcelo Daniel Berejuck. Worst-case Latency Network-on-chip for
Real-time Systems. Technical report, Federal University of Santa
Catarina, Florianopolis, Brazil, 2015. PhD Thesis.
[24] Xilinx Inc. Virtex-6 fpga system monitor user guide v1.2
(ug370). http://www.xilinx.com/support/documentation/user guides/
ug370.pdf, September 2014.
[25] Tuo Li, Muhammad Shafique, Semeen Rehman, Jude Angelo Ambrose,
Jrg Henkel, and Sri Parameswaran. DHASER: Dynamic heterogeneous
adaptation for soft-error resiliency in ASIP-based multi-core systems. In
Proceedings of the IEEE/ACM International Conference on Computer-
Aided Design (ICCAD), pages 646–653, San Jose, USA, Nov. 2013.
[26] Jakson dos Santos Pachito. Aging Prediction Methodology for Digital
Circuits. Technical report, University of Algarve, Faro, Portugal,
2012. MSc Thesis, http://w3.ualg.pt/∼jsemiao/port/ files/Jackson
thesis final.pdf.
[27] Xilinx Inc. Power consumption at 40 and 45nm - white paper: Spartan-6
and virtex-6 devices v1.0, April 2009. http://www.xilinx.com/support/
documentation/white papers/wp298.pdf.
[28] Arizona State University (ASU). Predictive technology model (ptm)
website, January 2012. http://ptm.asu.edu/.
[29] Xilinx Inc. Virtex-6 fpga clocking resources user guide v2.3
(ug362). http://www.xilinx.com/support/documentation/user guides/
ug362.pdf, January 2014.

Michael Shell Biography text here.

PLACE
PHOTO
HERE

John Doe Biography text here.

Jane Doe Biography text here.

You might also like