Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/348144782

FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Article in IEEE Transactions on Parallel and Distributed Systems · December 2020


DOI: 10.1109/TPDS.2020.3043449

CITATIONS READS
86 472

9 authors, including:

Kai Zhao Sheng Di


University of California, Riverside The University of Hong Kong, INRIA, ANL
44 PUBLICATIONS 524 CITATIONS 159 PUBLICATIONS 3,623 CITATIONS

SEE PROFILE SEE PROFILE

Sihuan Li Xin Liang


University of California, Riverside University of California, Riverside
25 PUBLICATIONS 748 CITATIONS 87 PUBLICATIONS 1,599 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Kai Zhao on 14 February 2021.

The user has requested enhancement of the downloaded file.


1

FT-CNN: Algorithm-Based Fault Tolerance for


Convolutional Neural Networks
Kai Zhao, Sheng Di, Senior, IEEE, Sihuan Li, Xin Liang, Yujia Zhai, Jieyang Chen, Kaiming Ouyang,
Franck Cappello, Fellow, IEEE, and Zizhong Chen, Senior, IEEE

Abstract—Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems
in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by
high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference
process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is
unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault
tolerance (ABFT) techniques cannot protect all convolution implementations. In this paper, we focus on how to protect the CNN
inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic
ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional
ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow
integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our
evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results
demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%∼8% in both error-free and
error-injected situations).

Index Terms—Algorithm-Based Fault Tolerance, Deep Learning, Silent Data Corruption, Reliability, High-Performance Computing

1 I NTRODUCTION In the domain of CNN inference, machine learning ap-


plications could be very error prone because of two reasons.
Deep learning using convolutional neural networks (CNNs) On the one hand, recent literature indicates that soft errors
is becoming the key state-of-the-art technique in science are inevitable in modern systems, from edge computing
and technology fields such as image classification [1], [2], devices to supercomputers [12], [13], because of multiple
[3], object detection [4], natural language processing [5], factors [14] such as high-energy cosmic radiation [15], aging,
medical image analysis [6], and drug design [7]. More and and wear of devices [16]. On the other hand, CNN inference
more scientific research (such as cosmological simulation applications often call for power-efficient and cost-efficient
and materials analysis) also is addressing the great potential machine learning accelerators, which [17], [18] may adopt
of leveraging CNN techniques to analyze extremely large overclocking with voltage underscaling, incurring more soft
amounts of data in a supercomputer environment, achieving errors than common hardware incurs.
unprecedented discoveries in their domains [8]. Soft errors may cause serious consequences to CNN
The reliability of the CNN inference is becoming a inference systems. Recent studies [19], [20], [21] indicate
critical concern [9] because CNN inference applications are that resilient convolutional neural networks are essential
being widely utilized in different scenarios, including high- for guaranteeing the correctness of inference applications.
performance scientific simulations and safety-critical sys- Researchers [19] demonstrate that a single bit flip happened
tems [10], [11] such as aerospace and autonomous vehicles. during CNN image classification could result in as much as
CNN inference applications usually run for a long time or 40% and 70% SDC rate in datapath and memory, respec-
continuously to process many inference tasks. For example, tively. Such high SDC rates would downgrade the CNN
the inference engine in autonomous vehicles is running prediction accuracy dramatically. Furthermore, the neutron
continuously to predict road conditions. As a result, even beam test [20] shows that when running YOLO [4] classifi-
a single inference task for one input finishes in seconds, the cation, the Failure In Time (FIT) caused by SDCs could be
reliability of CNN inference is still critically important given as much as 38 for Nvidia K40 and 96 for Nvidia Tegra X1,
the long execution time of the inference applications. which fail to meet the ISO 26262 standard for functional
safety of road vehicles [22].
Existing resilient solutions are insufficient for protecting
• Kai Zhao, Sihuan Li, Yujia Zhai,and Zizhong Chen are with the Depart- CNN inference applications against these soft errors. Error-
ment of Computer Science and Engineering at University of California, correcting code (ECC), for example, suffers from memory
Riverside, CA 92521.
• Sheng Di and Franck Cappello are with the Mathematics and Computer area cost and relatively high latency and power consump-
Science division at Argonne National Laboratory, Lemont, IL 60439. tion. According to [23], ECC with chip-kill applied to all
• Xin Liang and Jieyang Chen are with the Computer Science and Mathe- data, compared with no ECC protection, has an average of
matics Division at Oak Ridge National Laboratory, Oak Ridge, TN 37831.
40% overhead in memory energy, 20% overhead in system
2

energy and 20% overhead in performance for computation- 2 BACKGROUND


bounded applications. Moreover, ECC cannot handle mul-
This section introduces some high-level ideas of convolu-
tiple bit flips or computational errors. Techniques based
tional layers and the existing ABFT techniques to MM-based
on instruction duplication (ID) [24] incur high overhead
convolution algorithms. The notations and symbols used in
and require both application-specific and hardware-specific
this paper are summarized in Table 1.
optimization; and optimizing and deploying ID techniques
on all CNN accelerators is difficult.
Considering all the drawbacks and limitations of ECC 2.1 Definition of Convolutional Layer
and ID, algorithm-based fault tolerance (ABFT) [25] is an
The convolutional layer can be represented as the following
attractive solution to realize resilient CNN. It has much
convolution operation.
lower overhead than other techniques have; and it is archi-
tecture independent, meaning that it supports any hardware O[n][m][x][y] = B[m]+
Ch−1
P R−1
P R−1
accelerator. The idea of ABFT is to detect and/or correct soft P
D[n][k][U x + i][U y + j] × W[m][k][i][j] (1)
errors based on the known invariants that the algorithm k=0 i=0 j=0
H−R+U
has. Over the past thirty years, ABFT schemes have been 0 ≤ n < N, 0 ≤ m < M, 0 ≤ x, y < E, E = U
successful in detecting errors for matrix operations [26], [27],
The convolution operation involves two significant in-
[28], [29], [30], [31], iterative methods [32], [33], [34], [35],
puts: the feature map (fmap) D, D ∈ RN ×Ch×H×H , and
data transformation kernels [36] and sorting algorithm [37].
the convolutional kernels W, W ∈ RM ×Ch×R×R . Note that
However, the existing ABFT schemes for matrix operations
all the matrices and vectors in this paper are highlighted
focus mainly on large and square matrices. Moreover, they
in bold in order to differentiate from the scalar numbers,
incur more than 50% overhead when applied for CNN soft
according to the naming convention. The bias, denoted as
error protection (shown by our experiments in Section 6.3 ).
B, is applied to the output after convolution, and the final
In this paper, we propose a strategy comprising a series
result is denoted as O, O ∈ RN ×M ×E×E . Since the bias
of ABFT schemes for protecting the CNN inference stage
operation is independent of the convolution computation,
against soft errors. We focus on the convolutional layers
in the rest of this section we describe only the protection for
in CNN because they consume the major portion of the
convolution computation. In Section 5.1, we will discuss the
computation time [38], [39], [40], [41].
The main contributions of this paper are summarized as protection for bias.
follows.
• We design several ABFT schemes that can be applied TABLE 1
Notations and Symbols Used in This Paper
to any convolution implementation on any hardware. Notation Description
They can detect and correct errors at runtime. We D Feature map, dimension is 4D
provide an in-depth analysis of the ABFT schemes in W Kernels, also called filters, dimension is 4D
terms of fault protection ability and runtime. O Output, dimension is 4D
• We design a multischeme workflow for soft error pro- B Bias, dimension is 1D
C Checksums
tection with layerwise optimization to obtain a high
S Block summations of O, corresponding to checksums
detection/correction ability with limited runtime over- ⊗ Convolution operation
head. Additionally, our solution can protect the bias N First dimension of D and O
operation, grouped convolution, and back propagation. M First dimension of W and second dimension of O
• We implement an efficient soft error detection library Ch Second dimension of D and W, also called channels
H Third and fourth dimension of D
for CNN, called FT-Caffe, and evaluate FT-Caffe on Im-
R Third and fourth dimension of W
ageNet [42] using four popular CNN models: Alexnet E Third and fourth dimension of O
[1], VGG-19 [2], ResNet-18 [3], and YOLOv2 [4]. Exper- U Stride size
imental results on the Bebop supercomputer [43] using
up to 128 nodes demonstrate that FT-Caffe can keep the
correctness of the inferences with 4%∼8% overhead in
both error-free and erroneous cases. 2.2 Implementation of Convolutional Layer
In the rest of the paper, we first introduce background Convolution can be implemented efficiently in several
about convolutional layers and existing ABFT techniques ways [44]. The first option is MM-based convolution [45],
applicable to matrix-matrix multiplication (MM)-based con- which reshapes the kernel and feature map to two tempo-
volution implementation. In Section 3, we propose four rary matrices and then applies matrix-matrix multiplication
novel ABFT schemes that can be applied to any convolution (MM) on them. Another way to implement convolution
implementations. In Section 4, we analyze the fault protec- is called direct convolution, which performs the convolu-
tion ability and runtime of the four schemes and propose tion operation directly. It is widely used in AI accelerators
an efficient multischeme workflow integrating all the four including Eyeriss [39], DianNao [46] and NVIDIA Deep
schemes. In Section 5, we discuss how to support bias, Learning Accelerator [47]. Fast Fourier transform–based
grouped convolution, and back propagation. In Section 6, convolution [48] leverages FFT to compute the convolution.
we evaluate our solutions for both error-free case and erro- It is particularly suitable for the relatively large feature
neous case. In Section 7, we discuss related work on fault map and kernel. However, it is inferior to the Winograd
tolerance in convolutional neural networks. We present our convolution [44] when the sizes of the feature map and
concluding remarks in Section 8. kernel are relatively small.
3

Modern CNN frameworks and accelerators generally Equation (2) can be interpreted by using the blue part
automatically choose the best implementations of convo- in Figure 1. Since each of the 3D substructures of D and W
lution based on hardware resources and model structure, is treated as a block, D and W can be thought of as two
because various implementations have different constraints 1D vectors of blocks. At the block level, the convolution
on memory, architecture, and CNN model. operation is similar to matrix-matrix multiplication. The
element (i, j) of O is calculated by using the ith element
of D and the j th element of W. As illustrated in Figure 1,
2.3 ABFT for Matrix-Matrix Multiplication the elements involved in the convolutional layers can be
Traditional ABFT designed for matrix-matrix multiplication split into two levels, which are covered by our protection
can be applied to the MM calculation of the MM-based solution, respectively.
convolution [20], but it has at least three limitations. (1) We can derive that the convolution operation (denoted
It supports only MM-based convolution implementation, by ⊗) has a distributive property as follows.
which is not always the best-fit implementation selected
D1 ⊗ W1 + D2 ⊗ W1 =
by the CNN framework and accelerator. (2) It incurs high PCh−1 PR−1 PR−1
overhead (more than 50%, as shown in Section 6.3), due to k=0 i=0 j=0 D1 [k][U x + i][U y + j] × W1 [k][i][j]
+ Ch−1 R−1 PR−1
P P
the small and irregular shape of the matrices used by MM- k=0 i=0 j=0 D2 [k][U x + i][U y + j] × W1 [k][i][j]
PCh−1 PR−1 PR−1
based convolution. (3) Moreover, it cannot cover the reor- = k=0 i=0 j=0 (D1+D2 )[k][U x+i][U y+j]×W1 [k][i][j]
ganization operations of feature before the MM calculation. = (D1 + D2 ) ⊗ W1
Therefore, new ABFT schemes are needed in order to protect
Similarly, we can get the following equation.
the convolutional layer more effectively.
D1 ⊗ W1 + D1 ⊗ W2 = D1 ⊗ (W1 + W2 ) (3)

3 N OVEL ABFT S CHEMES FOR C ONVOLUTION The distributive property, Formula (3), is the key to proving
the equivalence between the sum of the output and the
In this section, we present four novel ABFT schemes, each output checksum. This property will be used later to prove
supporting any convolution implementation and being able the correctness of our design.
to protect the whole convolution process. In Section 4, we
propose a multischeme workflow using all the schemes in
different stages to maximize the soft error protection ability 3.2 Preliminary Analysis – CNN Checksums
with minimized performance overhead. In general, we compute checksums for D and W and then
use them to derive the checksums for O. Soft errors can be
detected and corrected by comparing O with its checksums.
3.1 Preliminary Analysis – Convolution We introduce all the checksums (as shown in Table 2 and
For clear description, we interpret convolution at the block as yellow blocks in Figure 1) that are necessary for our ABFT
level. Specifically, in Equation (1), D, W, and O are all 4D schemes.
matrices. They can be represented as being composed of
multiple blocks as shown in Figure 1. For any n and m (0 ≤ TABLE 2
Checksums Used by Schemes
n < N, 0 ≤ m < M ), Dn , Wm , and Onm are blocks. The
Checksums Checksums
dimension of Dn , Wm , and Onm are Ch×H ×H , Ch×R×R Scheme
of D and W of O
and E × E , respectively. The notation ⊗ is used to represent Full Checksum (FC) Cd1 , Cw1 Co1 , Co2
the convolution computation between blocks Dn and Wm . Row Checksum (RC) Cd1 , Cd2 Co1 , Co3
Column Checksum (ClC) Cw1 , Cw2 Co2 , Co4
The convolution operation defined by Equation (1) can be Checksum-of-Checksum (CoC) Cd1 , Cw1 , Cd2 , Cw2 Co5 , Co6 , Co7
simplified at the block level as follows. CoC Detection Only (CoC-D) Cd1 , Cw1 , Cd2 , Cw2 Co5

Onm = Dn ⊗ Wm We define the checksums of D and W as follows.


(2)
0 ≤ n < N, 0 ≤ m < M PN −1
Cd1 = n=0 Dn
PN −1
Cd2 = n=0 nDn
PM −1 (4)
Cw1 = m=0 Wm
W1 W2 … Wm Cw1 Cw2 PM −1
Cw2 = m=0 mWm
Level 2
Level 1 The four checksums (denoted as input checksums) can
D1 O11 O12 … O1m be treated as four blocks of D and W. The checksums
D2 O21 O22 … O2m of O (denoted as output checksums) are defined as the
CO2

CO4

… … … … … convolution result of input checksums and/or inputs.


Dn On1 On2 … Onm O1m Co1 = Cd1 ⊗ W
Cd1 CO1 CO5 CO6 Co2 = D ⊗ Cw1
Cd2 CO3 CO7 Co3 = Cd2 ⊗ W
Co4 = D ⊗ Cw2 (5)
Input & output Checksum Co5 = Cd1 ⊗ Cw1
Co6 = Cd1 ⊗ Cw2
Fig. 1. Interpretation of Convolution at the Block Level
Co7 = Cd2 ⊗ Cw1
4

The output O is represented in the form of blocks (i.e., 3.6 Checksum-of-Checksum Scheme (CoC/CoC-D)
Level 1 in Figure 1). Elements inside the same block are Unlike the three schemes that all need D and/or W to
independent with respect to checksums ( Level 2 in Figure calculate output checksums, the last scheme we proposed
1). That is, we perform the checksum comparison indepen- involves neither D nor W but only their checksums, so
dently for each element across blocks. Therefore, multiple it is named checksum-of-checksum scheme (or CoC scheme
soft errors in the same block can be detected and corrected for short). Specifically, Co5 , Co6 , and Co7 are the output
independently. checksums we will use in this scheme. Similar to Co1 , using
In what follows, we describe the four schemes we the distributive property can get three equations between
proposed, each involving one or more input and output the output checksums and output as follows.
checksums. The required checksums used by each scheme XN −1 XM −1
are summarized in Table 2. Co5 = Onm = So5
n=0 m=0
XN −1 XM −1
Co6 = m × Onm = So6
3.3 Full Checksum Scheme (FC) n=0 m=0
XN −1 XM −1
The first scheme we designed is called full checksum scheme, Co7 = n × Onm = So7
n=0 m=0
or FC, because it is based on checksums from both D and
W, as shown in Figure 1 and Table 2. So5 , So6 , and So7 are defined as output summations
Cd1 and Cw1 are calculated before the convolution op- corresponding to Co5 , Co6 , and Co7 . Let O(i, j) be the
eration, so any memory error striking D or W during the corrupted output block, O0 be the correct output, and let
convolution would not affect Cd1 or Cw1 . As for the output δ = O0ij −Oij be the difference. Using the output checksums,
checksums, we can get the following equations by applying we can get the following.
the distributive property of ⊗.
XN −1 XM −1
−1
NP −1
NP −1
NP Co5 − So5 = O0nm − Onm = δ
Co1 [m] =( Dn ) ⊗ Wm = (Dn ⊗ Wm ) = Onm n=0 m=0
n=0 n=0 n=0 XN −1 XM −1
−1
MP MP−1 MP−1 Co6 − So6 = m × (O0nm − Onm ) = j × δ
Co2 [n]=Dn ⊗ ( Wm ) = (Dn ⊗ Wm ) = Onm n=0 m=0
m=0 m=0 m=0 XN −1 XM −1
Co7 − So7 = n × (O0nm − Onm ) = i × δ
These equations show the equality between the sum of n=0 m=0
output and the output checksums. Let So1 and So2 be the The location i, j can be obtained by i = (Co7 − So7 )/δ
PN −1
summation of the output, where So1 [m] = n=0 Onm , and j = (Co6 − So6 )/δ . Then the soft error can be fixed by
PM −1
So2 [n] = m=0 Onm . We can compare Co1 , Co2 with So1 , adding δ to Oij .
So2 to detect, locate, and correct soft errors if they exist. If only soft error detection is required, we do not need
to compute Co6 and Co7 , thus reducing the number of
computations. Input checksums regarding Cd1 , Cd2 and
3.4 Row Checksum Scheme (RC) Cw1 , Cw2 , however, are still required for soft error detection.
Compared with the full checksum scheme, the second ABFT We denote such a detection scheme by CoC-D.
scheme we designed involves only the row checksums of
output O, so we call it row checksum scheme.
4 M ULTISCHEME W ORKFLOW
The row checksums used in this scheme are Co1 and
Co3 . Co3 is computed from convolution operation between In this section, we first discuss the fault protection abilities
Cd2 and W, and the related output summation is defined by and runtime of the four schemes we proposed in Section
PN −1
So3 [m] = n=0 n × Onm . 3. Then, we propose a multischeme workflow, powered by
For the detection of soft errors, we need to compare Co1 calibrated arrangement of the four schemes and layerwise
with So1 . If they are not equal to each other at location j, the optimization.
C [j]−So3 [j]
error can be located by i = Co3o1 [j]−So1 [j]
and j , and it can be
corrected by adding Co1 [j] − So1 [j] to the block (i, j). 4.1 Analysis of Protection Ability for Convolution
Checksum Schemes
In this section, we analyze the fault protection ability of all
3.5 Column Checksum Scheme (ClC)
the schemes.
The third scheme we proposed is called column checksum
scheme because it involves only the column checksums of 4.1.1 Fault Model
output O. The column checksums used in this scheme are The fault model for soft errors that we discuss in this paper
Co2 and Co4 . Co4 is defined by performing convolution includes transient faults in computational units and data
operation between D and Cw2 , and the related output corruption faults (both transient and persistent) in memory
PM −1
summation is defined as So4 [n] = m=0 m×Onm . To detect (including cache). In the following text, we use fault to rep-
soft errors, we compare Co2 with So2 first. If they are not resent a malfunction event, and we denote its corresponding
equal to each other at location i, the error can be located by symptom as soft error.
[i]−So4 [i]
i and j (= CCo4
o2 [i]−So2 [i]
), and it can be recovered by adding Soft error protection includes error detection and error
Co2 [i] − So2 [i] to the block (i, j). correction. Error detection means that the scheme can detect
5

soft errors without knowing the exact location. Error correc- Input & output Checksum Data with soft error Checksum in use
tion means that the scheme can locate the soft error locations
and recover the incorrect result. W1 W2 … Wm Cw1 Cw2
W1 W2 … Wm Cw1
Without loss of generality, in the following analysis we
consider at most one fault per convolution. One convo- D1 O11 O12 … O1m
D1 O11 O12 … O1m
lutional neural network contains several or even tens of D2 O21 O22 … O2m D2 O21 O22 … O2m
convolutional layers, and the total forward execution time … … … … …
… … … … …
of a CNN model is usually within seconds. Thus, we can Dn On1 On2 … Onm
Dn On1 On2 … Onm
Cd1 CO5 CO6
reasonably assume that at most one fault may strike to one Cd1 CO5
Cd2 CO7
convolutional layer, considering the short executing time of
a single layer. Multiple faults per convolution can also be (a) CoC Error Detection (b) CoC Error Detection
detected by our schemes and recovered by recomputing the
corrupted convolutional layer. Fig. 2. Soft Error Protection Ability of CoC Scheme (Soft error happens
in inputs and outputs)

4.1.2 Analysis of Soft Error in D and W


Cw1 Cw2 Cw1 Cw2
One fault occurring during the convolution execution can
result in multiple soft errors in W and D. The soft errors Cd1 CO5 CO6 Cd1 CO5 CO6
in W can be detected by comparing the checksum of W Cd2 CO7 Cd2 CO7
with Cw1 and corrected by reloading weights from the CNN
model. The soft errors in D do not need correction because (a) SDC in Cd1 (b) SDC in Cw1
D will be discarded after convolution computation; the
resulting errors in the output can be detected and corrected Cw1 Cw2 Cw1 Cw2
by the checksums of the output, as demonstrated below.
Cd1 CO5 CO6 Cd1 CO5 CO6
Cd2 CO7 Cd2 CO7
4.1.3 Analysis of Soft Error in O
One fault during the convolution execution can result in (c) SDC in Cd2 (d) SDC in Cw2
corruption of one block row or column of O. By definition, Fig. 3. Soft Error Protection Ability of CoC Scheme (Soft error happens
in checksums)
the row i of O is computed by the ith block of D with W .
Thus, one fault in D would result in at most one corrupted
row. The column j of O is computed by D with the j th 4.1.5 Soft Error Protection Ability of Row Checksum
block of W. Thus, one fault in W would result in at most one Scheme and Column Checksum Scheme
corrupted column. Moreover, the intermediate result will be
Since the row checksum scheme and column checksum
reused only by the same row or column, such that one fault
scheme are symmetric with each other, we discuss them
in the computational units would corrupt only values in the
together in this section. As shown in Figure 4(a), the row
same row or column. Accordingly, in the following sections
checksum scheme can detect and correct soft errors if they
we discuss the soft error protection ability in the context of
are in the same row. If the soft errors are in the same column,
at most one corrupted row or column of O.
as shown in Figure 4(b), the row checksum scheme can only
detect soft errors; it has no correction ability. The column
4.1.4 Soft Error Protection Ability of CoC Scheme checksum scheme, on the contrary, can detect and correct
errors located in the same column but fail to correct those
Figure 2 demonstrates the protection ability of the CoC
appearing in the same row.
scheme when soft errors strike the input or output data. As
shown in Figure 2(a), multiple soft errors can be detected
by using only Co5 . A single soft error in O can be corrected W1 W2 … Wm
by CoC using all checksums including Co5 , Co6 , and Co7 ,
as shown in Figure 2 (b). However, CoC cannot correct soft D1 O11 O12 … O1m W1 W2 … Wm Cw1 Cw2
errors across multiple blocks in O. D2 O21 O22 O2m
Figure 3 illustrates the protection ability of the CoC … … … … … D1 O11 O12 … O1m
Dn On1 On2 … Onm D2 O21 O22 … O2m
scheme when soft errors happen inside the checksums.
CO4
CO2

Cd1 CO1 … … … …
Such soft errors can cause inconsistency among the output
Cd2 CO3 Dn On1 On2 … Onm
checksums of CoC, which can be used for error detection.
(a) Row Checksum Scheme (b) Column Checksum Scheme
For example, in Figure 3(a), Cd1 is corrupted, leading to
corrupted Co5 and Co6 with correct Co7 . We can detect Fig. 4. Soft Error Protection Ability of Row/Column Checksum Schemes
this abnormal pattern when comparing checksums with the
summation of O to detect the input checksum corruption.
The input D, W, and output O are clean and without soft 4.1.6 Soft Error Protection Ability of Full Checksum
errors since fault frequency is at most once per convolution. Scheme
Thus, we can safely discard all the checksums and finish this The full checksum scheme has the highest ability to correct
convolution computation. soft errors. The scheme uses both the row checksum Co1
6

and column checksum Co2 so that it can correct soft errors in Table 4 shows the theoretical time complexity of all the
both directions, as shown in Figure 5(a)(b). If soft errors exist schemes. The full checksum scheme has the best soft error
in Co1 (Figure 5(d)), however, Co1 can no longer be used to correction ability; however, its runtime is relatively long.
locate or correct soft errors. To support error correction in Although the CoC scheme has lower ability than the other
this situation, we use checksum Co5 and Co6 from the CoC three schemes in correcting soft errors, it has the shortest
scheme to locate the corrupted column, and we then use Co2 runtime. Note that the kernel checksum Cw1 and Cw2 can
to correct the soft errors. If soft errors exist in Co2 (Figure be precalculated before the application; there is no cost in
5(c)), Co5 and Co7 are used to locate the corrupted row, and generating kernel checksum in the row, column, and CoC
Co1 is used to correct the soft errors. schemes.

W1 W2 … Wm Cw1 W1 W2 … Wm Cw1 18 18

Normalized Runtime

Normalized Runtime

Normalized Runtime
16 CoC-D 16 CoC-D 20 CoC-D
14 CoC 14 CoC CoC
12 RC 12 RC 15 RC
D1 O11 O12 … O1m D1 O11 O12 … O1m 10 ClC 10 ClC ClC
FC FC 10 FC
D2 O21 O22 O2m D2 O21 O22 … O2m 8 8

CO2
6 6
… … … … … … … … …
CO2

4 4 5
On1 On2 … Onm On1 On2 … Onm 2 2
Dn Dn
0 0 0
Cd1 CO1 Cd1 CO1

Se
C
C +FC
C +R

Se
C
C +FC
C +R

Se
C
C +FC
C +R
oC ate
oC
oC C

oC ate
oC
oC C

oC ate
oC
oC C
pa

pa

pa
+C +F

+C +F

+C +F
(a) Soft Error in the Same Row (b) Soft Error in the Same Column

r
lC C

lC C

lC C
+F

+F

+F
C

C
W1 W2 … Wm Cw1 (a) AlexNet (b) YOLOv2 (c) YOLOv2 (Conv8)
W1 W2 … Wm Cw1
18 18 18

Normalized Runtime

Normalized Runtime

Normalized Runtime
16 CoC-D 16 CoC-D 16 CoC-D
D1 O11 O12 … O1m CoC CoC CoC
14 14 14
D2 O21 O22 O2m D1 O11 O12 … O1m 12 RC 12 RC 12 RC
O21 O22 … O2m 10 ClC 10 ClC 10 ClC
… … … … … D2
CO2

CO2

FC FC FC
8 8 8
Dn On1 On2 … Onm … … … …
6 6 6
Cd1 CO1 CO5 Dn On1 On2 … Onm 4 4 4
Cd1 CO1 CO5 CO6 2 2 2
CO7 0 0 0
(c) Soft error in the Same Row (d) Soft error in the Same Column
Se
C
C +FC
C +R

Se
C
C FC
C +R

Se
C
C +FC
C +R
oC ate
oC
oC C

oC ate
oC
oC C

oC ate
oC
oC C
pa

pa

pa
(including Co2 ) (including Co1 )
+C +F

+C +F

+C +F
r

r
lC C

lC C

lC C
+F

+F

+F
C

C
Fig. 5. Soft Error Protection Ability of Full Checksum Scheme (d) VGG-19 (e) ResNet-18 (f) ResNet-18 (Conv1)
Fig. 6. Worst-Case Normalized Runtime, Baseline is CoC-D

4.1.7 Conclusion
In this section, we define our fault model and analyze the To verify the correctness of the derived time complexity
soft error protection ability of four schemes. We conclude of the four schemes, we execute them on a supercomputer
that the CoC scheme has the lowest error correction ability using four CNN models. We show the normalized worst-
and that the full checksum scheme has the best error correc- case runtime of the four schemes in the separate column
tion ability. The abilities of the row checksum scheme and of Figure 6. Other columns of this Figure represent the
column checksum scheme are higher than that of the CoC worst-case runtime of multischeme workflows and will be
scheme but lower than that of the full checksum scheme. discussed in the next section. Experiments confirm our con-
CoC-D (discussed in Section 3.6) can detect multiple soft clusion that CoC and CoC-D have the shortest runtime and
errors but without correction ability. The analysis here that the runtime of the full checksum scheme is relatively
serves as the fundamental basis of our low-overhead high- long. We also see that the column checksum scheme has a
protection design, which will be presented in Section 4.3. much longer runtime than the row checksum scheme does.
The reason is twofold. On the one hand, W blocks have
smaller sizes than D blocks have, leading to longer time to
4.2 Runtime Analysis compute D ⊗ Cw2 by the column checksum scheme than to
In this section, we analyze the time complexity theoretically compute Cd2 ⊗ W by the row checksum scheme. On the
and present runtimes of all schemes based on experiments. other hand, computing row checksums (Co1 and Co3 ) is
Table 3 shows the time complexity of some basic check- more efficient than computing column checksums (Co2 and
sum operations, where α is the coefficient of CPU-intensive Co4 ), because the row checksum calculation can be reduced
operations and β represents the coefficient for memory- to efficient column-summation operations.
intensive operations.
TABLE 4
TABLE 3 ABFT Schemes Runtime
Runtimes of Basic Operations Soft Error
Scheme Derived Runtime
Operation Derived Runtime Correction Ability
FC α(N + M )ChR2 E 2 + + β(N ChH 2 2N M E 2 ) High
block level convolution Dn ⊗ Wm αChR2 E 2
RC 2αM ChR2 E 2 + 2β(N ChH 2 + N M E 2 ) Middle
Total convolution operations αN M ChR2 E 2 ClC 2αN ChR E + 2β(N M E 2 )
2 2 Middle
Compute the checksum of D βN ChH 2 CoC 3αChR2 E 2 + β(2N ChH 2 + 3N M E 2 ) Low
Compute the checksum of O βN M E 2
7

4.3 Multischeme Workflow for Soft Error Protection We denote the runtime of the workflow CoC+FC as t0 ,
The four schemes we proposed have pros and cons in terms the runtime of workflow CoC+RC as t1 , and the runtime
of their soft error correction ability and runtime overhead. of workflow CoC+RC+FC as t2 . Enabling RC can fix some
To achieve the highest protection ability and lowest over- soft errors before FC, thus changing the runtime from t0 to
head, we propose a multischeme workflow by integrating t1 . When RC fails to correct soft errors, however, FC still
the four schemes, as shown in Figure 7. The workflow is needs to be invoked; and the runtime will increase from t0
made up of two modules: error detection and error correc- to t2 . Denoting the probability of row soft errors by pr and
tion. In our designed workflow, we use CoC-D to detect the probability of column soft errors by pc , we can derive
errors because it has the lowest overhead. For the error the average time saved by RC as ty = pr (t0 − t1 ) and the
correction, we put CoC in the beginning because it is the average time increase by RC as tn = pc (t2 − t0 ). In order
most lightweight method. By comparison, FC has highest to minimize the total runtime, RC should be enabled when
correction ability but also highest time overhead, so we put ty > tn .
it at the end of the workflow. We give an example to further illustrate when RC should
be enabled. Figure 6(b) shows the average runtime among
Start all the convolutional layers in YOLOv2. In this figure, the
runtime of CoC+RC is much lower than that of CoC+FC,
Convolution and the runtime of CoC+RC+FC is slightly higher than that
operation
of CoC+FC. Therefore, enabling RC can save significant
Error is corrected runtime when the soft errors are able to be corrected by
RC
Unable to correct error RC. On the other hand, a bit runtime penalty is incurred if
Error is Unable to Use RC
Unable to
correct
RC fails to correct the soft errors. However, for the conv8
detected correct error
CoC-d CoC
RC/ClC
Controller
Skip RC/ClC
FC error layer in YOLOv2 (shown in Figure 6(c)), CoC+RC’s runtime
No errors Error is corrected Use ClC Error is is close to that of CoC+FC. Thus, enabling RC in this layer
corrected
ClC would barely reduce the overall runtime even though the
Error is
corrected
soft errors can be corrected by RC. Moreover, CoC+RC+FC’s
Finish runtime is much higher than CoC+RC’s. As a result, the total
runtime will increase significantly if the soft errors cannot
Fig. 7. Multischeme Workflow Designed to Detect/Correct Soft Errors be corrected by RC. Hence, for this layer, it is better to use
CoC+FC for error correction with RC disabled.
The error detection modules will be executed for every In practice, the runtime t0 , t1 and t2 can be computed
execution whether there is a soft error or not. Thus, any by offline profiling. The probability values pc and pr can
unnecessary computations should be avoided in order to be estimated based on the size of D and size of W. For
reduce the overall overhead. For instance, both CoC-D and instance, the soft error often strikes each element in the
FC are able to detect all the soft errors, but we adopt only input under the independent and identical distribution. In
CoC-D in the workflow for error detection because FC has a this situation, it is easy to drive that the probability of soft
much higher overhead. RC and ClC cannot detect soft errors errors occurring in D is proportional to that of W (i.e., pprc =
correctly if the checksum is corrupted. number of elements in D
number of elements in W .
The error correction module will not be executed until
some soft errors are detected. The schemes in this module
will be invoked to fix soft errors according to the workflow. 5 R ESOLVING B IAS , G ROUPED C ONVOLUTION , AND
If it fails to correct the errors due to inconsistency of check- BACK P ROPAGATION
sum blocks or illegal error locations, the next-level scheme In this section, we extend our solution to support bias,
will be invoked. grouped convolution, and the back propagation of convo-
Since the checksums can be reused among different CNN lutional layers.
schemes in the workflow, the runtime of the workflow is
actually lower than the sum of all schemes’ runtimes. For
example, both CoC-D and CoC use Co5 ; if CoC-D detects 5.1 Bias
soft errors and CoC is invoked to correct soft errors, CoC Bias is a 1D vector that needs be added to the output of the
can save the time of computing Co5 and its corresponding convolutional layers. FT-Caffe provides protection for the
summation So5 , since they have been computed by CoC- bias operation.
D. This analysis can also be confirmed by our experiments. Many CNN frameworks add bias on the fly with the
As shown in Figure 6, the relative runtime of CoC in the convolution calculation. As a result, the output O already
second column is reduced compared with that of CoC in the contains bias, whereas the output checksums do not con-
first column. The relative runtime of RC in the third column tain bias since they are calculated by inputs and input
is reduced compared with that of RC in the first column. checksums without bias. In order to compare the output
The decision to put RC and ClC in the workflow between checksums and the output O, bias has to be subtracted
CoC and FC is controlled by each layer. The reason to control from output summation before comparison. Subtracting
RC/ClC in each layer is that their relative runtimes differ bias from output O directly before verification and then
across layers. Since RC and ClC are symmetric based, in adding bias to O after verification is not feasible, however,
the following we present our analysis based mainly on RC, because of the overhead of modifying every element in O.
without loss of generality. Table 5 shows the output checksums and adjusted output
8

summation for comparison in order to detect errors. The Using this equation, we can prove the relation between Co2
bias part of the formulations can be precomputed. and O as follows.
M
G −1
P 2M
P−1
G −1
MP
TABLE 5
Co2 [n] = Dn ⊗ [ Wm , Wm , · · · , wm ]
m=0 m= M m=(G−1) M
Bias Adjustments for Output Checksums Comparison G G
M 2M
G −1 GP−1
Checksum Adjust Summation C
wm + Dn [ C 2C
P
= Dn [0.. G −1]⊗ G .. G −1] ⊗ wm
Co1 So1 [m][i][j] − N × Bias[m] m=0
PN m= M
G
Co3 So3 [m][i][j] − ( P i=1 i) × Bias[m] −1
MP
Co2 − m Bias[m]
So2 [n][i][j] P + · · · + Dn [ (G−1)C
G ..C − 1] ⊗ wm
Co4 So4 [n][i][j] − mP m × Bias[m] m= (G−1)M
G
Co5 So5 [i][j] − N P × m Bias[m] −1
MP −1
MP
Co6 So6 [i][j] − N × m m × Bias[m] = Dn ⊗ wm = Onm
So7 [i][j] − ( N
Pm
m=0 m=0
P
Co7 i=1 i) × Bias[m]
Similar equations can be proved for Co1 ,Co3 ,Co4 , Co5 ,
Co6 , and Co7 . Therefore, all the ABFT schemes we proposed
can be applied to grouped convolution.

5.2 ABFT for Grouped Convolution


5.3 ABFT for Convolution Back Propagation
Grouped convolution is a special kind of convolution. Our Our schemes can also be applied to back propagation to-
schemes need to be modified to support this convolution. gether with forward pass so that the convolutional layers
Define the number of groups as G. Each fmap basic block can be fully protected in the training phase.
has Ch
G instead of Ch channels. All the M kernel basic During back propagation, the gradient of kernel ∇W
blocks are divided into G groups, each having M G 3D basic is used by methods such as gradient descent in order to
blocks. The kernel block in the g th group does convolution update W. The gradient of fmap ∇D is used to get ∇O
only with the g th channel group of every fmap block. Figure of the previous layer. As shown in Figure 9, the gradients
8 shows this process for N=2, M=4, and G=2. are calculated as D ⊗ ∇O = ∇W and WT ⊗ ∇O = ∇D.
Checksums for ∇O are used in this situation to protect the
Grouped Convolution two convolution operations.
Group 1 d1 Group 1 Group 2
Group 2 d1 O11 O12 O13 O14 D W:
W1 W2 W3 W4 CK
Group 1 d2 O21 O22 O23 O24
Group 2 d2 D1 D2 D3 CK W1 W2 W3 CK

Breakdown of Grouped Convolution Operation CK

Group 1
d1 O11 O12
Group 1 W1 W2 WT D: T D3
d2 O21 O22 W1
D3
Group 2 W2
O13 O14 D3
d1 W3
Group 2 W3 W4 CK
d2 O23 O24 CK CK CK

d1 = D1[1,…,N/2-1]; d1' = D1[N/2,…,N-1]; d2 = D2[1,…,N/2-1]; d2' = D2[N/2,…,N-1] Fig. 9. Demonstration of Checksum Design for Back Propagation
Fig. 8. Demonstration of Grouped Convolution, Groups = 2
Since CNN models are usually trained in a more stable
The checksums for fmap Cd1 and Cd2 stay the same. The environment than the inference stage and since the training
checksum for kernel are redefined as stage can tolerate some soft errors because of their iterative-
M
convergent nature, we focus our experiments on the infer-
−1
G 2M
G −1 M −1
X X X ence stage.
Cw1 = [ Wm , Wm , ..., Wm ]
m=0 m= M m=(G−1) M
G G

M
6 E XPERIMENTAL E VALUATION
G−1 2M
G −1 M −1
X X X In this section, we evaluate our multischeme workflow
Cw2 = [ m × Wm , m × Wm , ..., m × Wm ]
using our FT-Caffe fault tolerance CNN framework.
m=0 m= M m=(G−1) M
G G

where Cw1 and Cw2 are the combination of G checksums 6.1 Experimental Setup
from each kernel group. Each checksum has Ch G channels, FT-Caffe. Our FT-Caffe framework is based on Intel-Caffe.
so Chw1 and Chw1 each have G Ch G = Ch channels, which
MKL-DNN is enabled to support dynamic selection of con-
are the same with every fmap block.
volution execution. MKL-DNN contains all the convolution
The definition of output checksums Co1 , Co2 , ..., Co7 implementations we discussed in Section 2.2. It automati-
stays the same. Let X[l..r] represent the channels from l cally chooses the most suitable implementation to use for
to r in matrix X . We can prove the following property for each convolutional layer. To compare the runtime overhead
any Dn and Wm according to Equation (1). of our solution with that of the ABFT designed for matrix-
Dn ⊗ Wm = Dn [1..k − 1] ⊗ Wm [1..k − 1] matrix multiplication, we also perform the experiments
+Dn [k..Ch] ⊗ Wm [k..Ch], 0 ≤ k < Ch based on the MM-based convolution implementation.
9

CNN models and dataset. We tested our FT-Caffe with four thus confirming our analysis in Section 4 that FC has high
widely used networks: AlexNet, VGG-19, ResNet-18, and protection ability and relatively longer runtime.
YOLOv2. Pretrained Caffe models are used together with Erroneous cases – layerwise RC/ClC optimization. Figure
model prototypes for deployment. We adopt the ImageNet 10(c) demonstrates the runtime overhead with layerwise
validation set, which contains 50k images. The images are optimization enabled. Every layer decides whether to use
preprocessed to smaller size in order to save picture pro- RC/ClC independently, as described in Section 4.3. Com-
cessing time when the program starts. The batch size is set pared with Figure 10(b), the error correction overhead de-
to 64. creases by 40%∼60% (e.g., 1.55% → 0.72% for YOLOv2 as
Experimental platforms. We conducted our experiments on shown in Figure 10(b) vs. (c)) in all CNN models because of
the Bebop supercomputer [43] at Argonne National Labora- the effectiveness of RC. Figure 11(a) shows the distribution
tory using up to 128 nodes. Each node is equipped with 128 of varies workflows that is the result of layerwise RC/ClC
GB memory and two Intel Xeon E5-2695 v4 processors (each optimization. We can see that RC is enabled in all layers of
with 16 cores) AlexNet and VGG-19, while it is disabled in 30% to 40% of
Error injection. To demonstrate the overhead of our fault layers in ResNet-18 and YOLOv2. The results demonstrate
tolerant solutions, we inject soft errors at the source code the need for layerwise RC optimization since RC is not
level as most ABFT works did [29], [36]. The consequences suitable for all layers in the same CNN model. Figure 11(b)
of one computational fault or memory fault are simulated shows the distribution of soft errors by the schemes that
by randomly corrupting selected row or column of output. correct them. Less than 5% of soft errors are corrected
We denote the total number of convolutional layers of a by CoC because of the low correction ability of CoC. RC
CNN model as L. To assess the overhead accurately, we run corrects nearly 90% of the soft errors in AlexNet and VGG-
the experiments for L epochs corresponding to the numbers 19 because RC is enabled in all layers of the two CNN
of convolutional layers of each network (L= 5, 9, 16, 21 for models and the probability of soft errors striking a row
AlexNet, YOLOv2, VGG-19, and ResNet-18, respectively). in O is higher than the probability of soft errors striking
For the ith epoch, we inject errors to ith convolutional layer. a column.
The final overhead is the arithmetic mean of all the inference
executions and the standard deviation in our experiments is 12% CoC-D 12% CoC-D 12% CoC-D
CoC CoC
within 5%. 10% 10% FC 10% RC
ClC
Overhead

Overhead

Overhead
8% 8% 8% FC
6% 6% 6%
6.2 Experimental Results with MKL-DNN 4% 4% 4%
2% 2% 2%
In this section, we present our evaluation results with MKL-
0% 0% 0%
DNN . We analyze the results from the perspective of execu-
Al

VG et

YO t-1

Al

VG et

YO -1

Al

VG et

YO t-1
es

es

es
ex

ex

ex
G

LO 8

LO 8

LO 8
N

N
tion time overhead for both error-free cases and erroneous
N

N
-1

-1

-1
e

et

e
v2

v2

v2
9

9
cases. (a) Error-free (b)
Erroneous (c) Erroneous
Error-free cases. The experimental results in the error- (RC/ClC Disabled) (RC/ClC enabled)
free cases are presented in Figure 10(a). We can see from Fig. 10. Runtime Overhead with MKL-DNN
the figure that our FT-caffe can protect the inference stage 200% 200%
CoC+FC CoC
with less than 4%, 4.5%, 8%, and 5% overhead for AlexNet, CoC+RC+FC RC
150% CoC+ClC+FC 150% ClC
VGG-19, YOLOv2, and ResNet-18, respectively, regardless FC
Percentage

Percentage

of the convolution implementation. These results show that 100% 100%


our CoC-D error detection scheme has relatively short run-
time compared with the convolution execution, which is 50% 50%
attributed to the design of avoiding unnecessary compu-
0% 0%
tations in our solution (see Section 4.3). The reason ResNet-
Al

VG et

R 19

YO et-1

Al

VG et

R 19

YO et-1
es

es
ex

ex
G

LO 8

LO 8

18 has higher overhead than the other models have is that


N

N
N

N
-

-
v2

v2

the ResNet-18 has small convolution kernels (W size is


(a) Distribution of different (b) Distribution of soft errors cor-
M ×C×3×3) in all the convolutional layers, which have workflows rected by schemes
relatively short computing time; thus, the checksum compu- Fig. 11. Breakdown Analysis of Multischeme Workflow with MKL-DNN
tation and verification time percentage would be relatively
large. Erroneous cases – breakdown of error correction overhead by
Erroneous cases – RC/ClC disabled. To show the effective- layer. To better illustrate the overhead of our solution for
ness of our layerwise optimization for RC/ClC, we first each model, we present in Figure 12 the breakdown of the
test our multischeme workflow with RC/ClC disabled in overhead by layer. The figure demonstrates that layers have
erroneous cases. Figure 10(b) demonstrates that the run- diverse overheads of error protection due to the different
time overheads (including both error detection and error shapes of D and W. We also notice that the overhead of RC
correction) of the four CNN models are all below 9%. The differs among layers in the same model, thus confirming the
error detection overhead is higher than the error correction functionality of our layerwise RC/ClC optimization.
overhead because the error detection scheme is executed for
every convolution operation whereas the error correction 6.3 Experimental Results with MM-Based Convolution
schemes are invoked only when errors are detected. The full In this section, we evaluate the runtime overhead of
checksum scheme dominates the error correction overhead, our multischeme workflow and the traditional MM-based
10

2.0%
CoC-D CoC-D 12% CoC-D 12% CoC-D 12% CoC-D
CoC 2.0% CoC CoC CoC
RC RC 10% 10% FC 10% RC
1.5% ClC ClC ClC

Overhead

Overhead

Overhead
FC 1.5% FC
Overhead

Overhead
8% 8% 8% FC
1.0% 6% 6% 6%
1.0%
4% 4% 4%
0.5% 0.5% 2% 2% 2%
0% 0% 0%
0.0% 0.0%

Al

VG et

YO -1

Al

VG et

YO t-1

Al

VG et

YO -1
es

es

es
co

co

co

co

co

co
co 1
co 2
co 3
co 4
co 5
co 6
co 7
co 8

ex

ex

ex
G

LO 8

LO 8

LO 8
nv

nv

nv

nv

nv

nv
nv
nv
nv
nv
nv
nv
nv
nv

N
N

N
-1

-1

-1
et

et
1

v2

v2

v2
9

9
(a) AlexNet (b) YOLOv2
1.20%
CoC-D 1.20% CoC-D (a) Error-Free (b) Erroneous (c) Erroneous
1.00% CoC
RC
CoC
RC
(RC/ClC Disabled) (RC/ClC enabled)
1.00%
0.80% ClC ClC
Fig. 13. Runtime Overhead with MM-based Convolution
Overhead

FC Overhead 0.80% FC
0.60% 0.60% 200% 200%
CoC+FC CoC
0.40% 0.40% CoC+RC+FC RC
0.20% 0.20% 150% CoC+ClC+FC 150% ClC
FC

Percentage

Percentage
0.00% 0.00%
conv1
conv2
conv3
conv4
conv5
conv6
conv7
conv8
conv9
conv10
conv11
conv12
conv13
conv14
conv15
conv16

conv1
conv2
conv3
conv4
conv5
conv6
conv7
conv8
conv9
conv10
conv11
conv12
conv13
conv14
conv15
conv16
conv17
conv18
conv19
conv20
conv21
100% 100%

(c) VGG-19 (d) ResNet-18 50% 50%

Fig. 12. Breakdown of Runtime Overhead by Layer with MKL-DNN


0% 0%

Al

VG et

R 19

YO et-1

Al

VG et

R 19

YO et-1
es

es
e

e
xN

LO 8

xN

LO 8
N

N
-

-
v2

v2
ABFT. Since the MM-based ABFT supports only the MM-
based convolution implementation, we set the convolution (a) Distribution of different (b) Distribution of soft errors cor-
workflows rected by schemes
implementation to the MM-based mode in MKL-DNN. We
Fig. 14. Breakdown Analysis of Multischeme Workflow with MM-Based
implemented MM-based ABFT rigorously based on [26], Convolution
which has ≤1% overhead for large and square matrices
as claimed by the authors of that work. The overhead of
the MM-based ABFT in convolution execution is shown in used without preprocessing in order to better demonstrate
Table 6. The MM-based ABFT incurs up to 60% overhead the process of parallel CNN inference application. In the
even without error injection for the four CNN models. beginning of the parallel process, images are distributed
This result is consistent with our analysis in Section 2.3. to the local disk of each node; then each node starts to
Considering that the MM-based ABFT cannot protect the do the data processing step first to convert the images to
whole process of MM-based convolution and cannot protect suitable size required by CNN models, and then execute
other convolution implementations, we conclude that the the inference step under the protection of our multischeme
MM-based ABFT is unsuitable for soft error protection of workflow.
CNN applications. We conducted the parallel evaluation in both error-free
Figure 13 shows the overhead of our multischeme work- and erroneous cases. However, because of space limits, we
flow for MM-based convolution. The overhead of our so- present only the parallel performance evaluation results in
lution is below 6% in the error-free cases and below 6.5% the situation with injected errors (as shown in Figure 15).
in the cases with injected errors for all CNN models. The In fact, the evaluation results in the error-free situation are
layerwise RC/ClC optimization reduces the overhead for similar. Specifically, experiments show that our multischeme
error correction by as much as 77%. Figure 14(a) shows the workflow has a very good scalability: that is, the soft error
fractions of different workflows chosen by convolutional protection overhead does not increase with the number of
layers. Figure 14(b) shows the distribution of soft errors nodes at all. In absolute terms, the overhead stays around
that are corrected by different schemes. Compared with the 2%∼6% in the erroneous cases and is only 1%∼4% in the
MKL-DNN implementation, more layers adopt RC for error error-free cases.
correction in the MM-based convolution (see Figure 11 ver-
sus Figure 14). The reason is that the relative runtime of RC 7 R ELATED W ORK
compared with FC is lower in the MM-based convolution The importance of fault tolerance for convolution has been
implementation than other implementations. emerging in recent years. Guaranteeing the correctness of
inference is vital in a safety-critical use case [19]. To achieve
TABLE 6 better resiliency for CNN networks, researchers have been
Overhead of MM-Based ABFT for MM-Based Convolution, No Error exploring solutions from different perspectives including
Injection hardware, system, and software. For hardware, Kim et al.
Model AlexNet YOLOv2 VGG-19 ResNet-18 [49] proposed a hardened 3D die-stacked memory based
Overhead 27.9% 57.5% 45.8% 61.2%
on the fault characteristics in convolutional DNNs. Li et
al. [19] proposed to add redundant circuits selectively to
harden the latches based on analysis of data resiliency.
6.4 Parallel Performance Evaluation Compared with traditional full-hardware redundancy tech-
In this section, we present the parallel performance eval- niques, those partial-hardware redundancy techniques may
uation results of AlexNet, YOLOv2, VGG-19, and ResNet- not double the power usage. However, hardware modifica-
18. Original images of the ImageNet validation dataset are tion incurs significant effort considering the varied CNN
11

60
Image Processing 160 Image Processing
ACKNOWLEDGMENTS
50 Inference Inference
Soft Error
Protection
140 Soft Error
Protection
This research was supported by the Exascale Computing
Time (second)

Time (second)
120
40
100 Project (ECP), Project Number: 17-SC-20-SC, a collaborative
30 80 effort of two DOE organizations - the Office of Science and
20 60
40
the National Nuclear Security Administration, responsible
10
20 for the planning and preparation of a capable exascale
0 0
ecosystem, including software, applications, hardware, ad-
8

16

32

64

12

16

32

64

12
8

8
Nodes Nodes vanced system engineering and early testbed platforms, to
(a) AlexNet (b) YOLOv2
support the nation’s exascale computing imperative. The
400 100
350
Image Processing
Inference 90
Image Processing
Inference
material was supported by the U.S. Department of En-
300
Soft Error
Protection
80 Soft Error
Protection
ergy, Office of Science, under contract DE-AC02-06CH11357.
Time (second)

Time (second)
70
250 60 This work was also supported by the National Science
200 50
150 40 Foundation under Grants CCF-1513201, CCF-1619253, and
100
30
20
OAC-2034169. We acknowledge the computing resources
50 10 provided on Bebop, which is operated by the Laboratory
0 0
Computing Resource Center at Argonne National Labora-
8

16

32

64

12

16

32

64

12
8

8
Nodes Nodes tory.
(c) VGG-19 (d) ResNet-18
Fig. 15. Parallel Performance Evaluation of Our Solution with Injected
Errors on Bebop Supercomputer R EFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-
cation with deep convolutional neural networks,” in Advances in
models and their accelerators. At the system level, other neural information processing systems, 2012, pp. 1097–1105.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional
than the DMR/TMR protection, checkpoint/restart (C/R) is networks for large-scale image recognition,” arXiv preprint
also applied to large-scale machine learning systems. Sub- arXiv:1409.1556, 2014.
sequently, Qiao et al. proposed a more efficient C/R scheme [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” in Proceedings of the IEEE conference on
based on their derived upper bound on extra iteration cost Computer Vision and Pattern Recognition, 2016, pp. 770–778.
with perturbations [50]. While those C/R techniques are [4] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in
promising to protect model training from soft errors, they Proceedings of the IEEE conference on computer vision and pattern
are not good fits for inference since one inference execution recognition, 2017, pp. 7263–7271.
[5] Y. Goldberg, “Neural network methods for natural language pro-
could be very fast and applying C/R incurs significant cessing,” Synthesis Lectures on Human Language Technologies, vol. 10,
overhead. Researchers have therefore pursued lightweight no. 1, pp. 1–309, 2017.
software-level solutions. By applying ABFT techniques for [6] S.-C. B. Lo, H.-P. Chan, J.-S. Lin, H. Li, M. T. Freedman, and S. K.
Mun, “Artificial convolution neural network for medical image
MM-based convolution, Santos et al. [20] reported that
pattern recognition,” Neural networks, vol. 8, no. 7-8, pp. 1201–1214,
50%∼60% of radiation-induced corruptions could be cor- 1995.
rected. Unfortunately, the traditional ABFT works only for [7] E. Gawehn, J. A. Hiss, and G. Schneider, “Deep learning in drug
MM-based convolution, which is inefficient in most cases. discovery,” Molecular informatics, vol. 35, no. 1, pp. 3–14, 2016.
[8] J. M. Wozniak et al., “Candle/supervisor: A workflow framework
In contrast, our solutions can work for any convolution for machine learning applied to cancer research,” BMC Bioinfor-
implementations. matics, vol. 19, no. 18, p. 491, 2018.
[9] J. J. Zhang et al., “Building robust machine learning systems:
Current progress, research challenges, and opportunities,” in Pro-
ceedings of the 56th Annual Design Automation Conference 2019, ser.
8 C ONCLUSION AND F UTURE W ORK DAC ’19. New York, NY, USA: ACM, 2019.
[10] M. T. Le, F. Diehl, T. Brunner, and A. Knol, “Uncertainty estimation
for deep neural object detectors in safety-critical applications,”
This work focus on extending ABFT to convolution oper- in 2018 21st International Conference on Intelligent Transportation
ations in convolutional neural networks. We propose four Systems (ITSC), 2018, pp. 3873–3878.
[11] S. Burton, L. Gauerhof, and C. Heinzemann, “Making the case for
ABFT schemes and a multischeme workflow to protect safety of machine learning in highly automated driving,” in Com-
the convolutional layer. We further extend our schemes to puter Safety, Reliability, and Security. Cham: Springer International
support bias, grouped convolution, and convolution back Publishing, 2017, pp. 5–16.
[12] M. Snir et al., “Addressing failures in exascale computing,” Inter-
propagation. We implement an efficient CNN framework, national Journal of High Performance Computing Applications, 2014.
FT-Caffe, that is resilient to silent data corruption. [13] L. Bautista-Gomez, F. Zyulkyarov, O. Unsal, and S. McIntosh-
Experiments demonstrate that our proposed fault- Smith, “Unprotected computing: A large-scale study of dram raw
error rate on a supercomputer,” in Proceedings of the International
tolerant solutions incur negligible overhead. In absolute Conference for High Performance Computing, Networking, Storage and
terms, FT-Caffe can acheive less than 8% overhead for the Analysis, ser. SC ’16. IEEE Press, 2016.
most widely used CNN models, including AlexNet, YOLO, [14] L. Tan, S. L. Song, P. Wu, Z. Chen, R. Ge, and D. J. Kerbyson, “In-
VGG-19, and ResNet-18, in both error-free and erroneous vestigating the interplay between energy efficiency and resilience
in high performance computing,” in 2015 IEEE International Paral-
cases. lel and Distributed Processing Symposium. IEEE, 2015, pp. 786–796.
We plan to extend the implementation to more CNN [15] J. P. Walters, K. M. Zick, and M. French, “A practical character-
frameworks and to design architecture-specific optimiza- ization of a nasa spacecube application through fault emulation
and laser testing,” in Proceedings of the 2013 43rd Annual IEEE/IFIP
tions for different hardware including GPU, FPGA, and AI International Conference on Dependable Systems and Networks, ser.
accelerators. DSN ’13. USA: IEEE Computer Society, 2013, pp. 1–8.
12

[16] A. Geist, “How to kill a supercomputer: Dirty power, cosmic rays, 25th ACM International Symposium on High-Performance Parallel and
and bad solder,” IEEE Spectrum, vol. 10, pp. 2–3, 2016. Distributed Computing, ser. HPDC ’16. New York, NY, USA: ACM,
[17] J. Zhang, K. Rangineni, Z. Ghodsi, and S. Garg, “Thundervolt: En- 2016, pp. 43–55.
abling aggressive voltage underscaling and timing error resilience [34] D. Tao, S. Di, X. Liang, Z. Chen, and F. Cappello, “Improving
for energy efficient deep learning accelerators,” in Proceedings of performance of iterative methods by lossy checkponting,” in
the 55th Annual Design Automation Conference, ser. DAC ’18. New Proceedings of the 27th International Symposium on High-Performance
York, NY, USA: ACM, 2018. Parallel and Distributed Computing, 2018, pp. 52–65.
[18] W. Choi, D. Shin, J. Park, and S. Ghosh, “Sensitivity based er- [35] D. Tao, “Fault tolerance for iterative methods in high-performance
ror resilient techniques for energy efficient deep neural network computing,” Ph.D. dissertation, UC Riverside, 2018.
accelerators,” in Proceedings of the 56th Annual Design Automation [36] X. Liang, J. Chen, D. Tao, S. Li, P. Wu, H. Li, K. Ouyang, Y. Liu,
Conference 2019, ser. DAC ’19. New York, NY, USA: ACM, 2019. F. Song, and Z. Chen, “Correcting soft errors online in fast fourier
[19] G. Li et al., “Understanding error propagation in deep learning transform,” in Proceedings of the International Conference for High
neural network (dnn) accelerators and applications,” in Proceed- Performance Computing, Networking, Storage and Analysis, ser. SC
ings of the International Conference for High Performance Computing, ’17. New York, NY, USA: ACM, 2017, pp. 30:1–30:12.
Networking, Storage and Analysis, ser. SC ’17. New York, NY, USA: [37] S. Li, H. Li, X. Liang, J. Chen, E. Giem, K. Ouyang, K. Zhao,
ACM, 2017, pp. 8:1–8:12. S. Di, F. Cappello, and Z. Chen, “Ft-isort: Efficient fault tolerance
[20] F. F. d. Santos, L. Draghetti, L. Weigel, L. Carro, P. Navaux, for introsort,” in Proceedings of the International Conference for High
and P. Rech, “Evaluation and mitigation of soft-errors in neural Performance Computing, Networking, Storage and Analysis, ser. SC
network-based object detection in three gpu architectures,” in ’19. New York, NY, USA: Association for Computing Machinery,
2017 47th Annual IEEE/IFIP International Conference on Dependable 2019.
Systems and Networks Workshops (DSN-W), 2017, pp. 169–176. [38] J. Cong and B. Xiao, “Minimizing computation in convolutional
[21] B. Reagen et al., “Ares: A framework for quantifying the resilience neural networks,” in Artificial Neural Networks and Machine Learn-
of deep neural networks,” in Proceedings of the 55th Annual Design ing – ICANN 2014. Springer International Publishing, 2014, pp.
Automation Conference, ser. DAC ’18. New York, NY, USA: ACM, 281–290.
2018, pp. 17:1–17:6. [39] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
[22] R. Salay, R. Queiroz, and K. Czarnecki, “An analysis of iso 26262: efficient reconfigurable accelerator for deep convolutional neural
Using machine learning safely in automotive software,” arXiv networks,” IEEE journal of solid-state circuits, vol. 52, no. 1, pp. 127–
preprint arXiv:1709.02435, 2017. 138, 2016.
[23] D. Li, Z. Chen, P. Wu, and J. S. Vetter, “Rethinking algorithm-based [40] S. Jin, S. Di, X. Liang, J. Tian, D. Tao, and F. Cappello, “Deepsz:
fault tolerance with a cooperative software-hardware approach,” A novel framework to compress deep neural networks by using
in Proceedings of the International Conference on High Performance error-bounded lossy compression,” in Proceedings of the 28th In-
Computing, Networking, Storage and Analysis, ser. SC ’13. New ternational Symposium on High-Performance Parallel and Distributed
York, NY, USA: ACM, 2013. Computing, 2019, pp. 159–170.
[24] X. Vera, J. Abella, J. Carretero, and A. González, “Selective repli- [41] Z. Hu, X. Zou, W. Xia, S. Jin, D. Tao, Y. Liu, W. Zhang, and
cation: A lightweight technique for soft errors,” ACM Transactions Z. Zhang, “Delta-dnn: Efficiently compressing deep neural net-
on Computer Systems (TOCS), vol. 27, no. 4, p. 8, 2009. works via exploiting floats similarity,” in The 49th International
Conference on Parallel Processing (ICPP 2020), 2020.
[25] K.-H. Huang and J. A. Abraham, “Algorithm-based fault tolerance
[42] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im-
for matrix operations,” IEEE Transactions on Computers, vol. 100,
agenet: A large-scale hierarchical image satabase,” in 2009 IEEE
no. 6, pp. 518–528, 1984.
conference on computer vision and pattern recognition. IEEE, 2009,
[26] P. Wu, Q. Guan, N. DeBardeleben, S. Blanchard, D. Tao, X. Liang,
pp. 248–255.
J. Chen, and Z. Chen, “Towards practical algorithm based fault
[43] Bebop supercomputer. Available at https://www.lcrc.anl.gov/
tolerance in dense linear algebra,” in Proceedings of the 25th ACM
systems/resources/bebop, 2019, online.
International Symposium on High-Performance Parallel and Distributed
[44] A. Lavin and S. Gray, “Fast algorithms for convolutional neural
Computing, ser. HPDC ’16. New York, NY, USA: ACM, 2016, pp.
networks,” in 2016 IEEE Conference on Computer Vision and Pattern
31–42.
Recognition (CVPR), 2016, pp. 4013–4021.
[27] P. Wu, D. Li, Z. Chen, J. S. Vetter, and S. Mittal, “Algorithm- [45] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catan-
directed data placement in explicitly managed non-volatile mem- zaro, and E. Shelhamer, “cudnn: Efficient primitives for deep
ory,” in Proceedings of the 25th ACM International Symposium on learning,” arXiv preprint arXiv:1410.0759, 2014.
High-Performance Parallel and Distributed Computing, ser. HPDC ’16. [46] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
New York, NY, USA: Association for Computing Machinery, 2016, “Diannao: A small-footprint high-throughput accelerator for ubiq-
p. 141–152. uitous machine-learning,” in Proceedings of the 19th International
[28] L. Tan, S. Kothapalli, L. Chen, O. Hussaini, R. Bissiri, and Z. Chen, Conference on Architectural Support for Programming Languages and
“A survey of power and energy efficient techniques for high per- Operating Systems, ser. ASPLOS ’14. New York, NY, USA: ACM,
formance numerical linear algebra operations,” Parallel Computing, 2014, pp. 269–284.
vol. 40, no. 10, pp. 559 – 573, 2014. [47] NVIDIA, http://nvdla.org, 2019, online.
[29] J. Chen, H. Li, S. Li, X. Liang, P. Wu, D. Tao, K. Ouyang, Y. Liu, [48] S. Liu, Q. Wang, and G. Liu, “A versatile method of discrete
K. Zhao, Q. Guan, and Z. Chen, “Fault tolerant one-sided matrix convolution and fft (dc-fft) for contact analyses,” Wear, vol. 243,
decompositions on heterogeneous systems with gpus,” in Proceed- no. 1, pp. 101–111, 2000.
ings of the International Conference for High Performance Computing, [49] J.-S. Kim and J.-S. Yang, “Dris-3: Deep neural network reliability
Networking, Storage, and Analysis, ser. SC ’18. Piscataway, NJ, USA: improvement scheme in 3d die-stacked memory based on fault
IEEE Press, 2018, pp. 68:1–68:12. analysis,” in Proceedings of the 56th Annual Design Automation
[30] J. Chen, N. Xiong, X. Liang, D. Tao, S. Li, K. Ouyang, K. Zhao, Conference 2019, ser. DAC ’19. New York, NY, USA: ACM, 2019,
N. DeBardeleben, Q. Guan, and Z. Chen, “Tsm2: optimizing tall- pp. 129:1–129:6.
and-skinny matrix-matrix multiplication on gpus,” in Proceedings [50] A. Qiao, B. Aragam, B. Zhang, and E. Xing, “Fault tolerance in
of the ACM International Conference on Supercomputing, 2019, pp. iterative-convergent machine learning,” in Proceedings of the 36th
106–116. International Conference on Machine Learning, ser. ICML, vol. 97.
[31] C. Rivera, J. Chen, N. Xiong, S. L. Song, and D. Tao, “Tsm2x: Long Beach, California, USA: PMLR, 2019, pp. 5220–5230.
High-performance tall-and-skinny matrix-matrix multiplication
on gpus,” 2020.
[32] Z. Chen, “Online-abft: An online algorithm based fault tolerance
scheme for soft error detection in iterative methods,” in Proceedings
of the 18th ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming, ser. PPoPP ’13. New York, NY, USA: ACM,
2013, pp. 167–176.
[33] D. Tao, S. L. Song, S. Krishnamoorthy, P. Wu, X. Liang, E. Z.
Zhang, D. Kerbyson, and Z. Chen, “New-sum: A novel online
abft scheme for general iterative methods,” in Proceedings of the
13

Kai Zhao received his bachelor’s degree from Jieyang Chen (Member, IEEE) is a Computer
Peking University in 2014 and will receive his Scientist at Computer Science and Mathemat-
Ph.D. degree from University of California, River- ics Division at Oak Ridge National Laboratory.
side in 2022. He is a long-term intern at Argonne He received his master and Ph.D. degrees in
National Laboratory. His research interests in- Computer Science from University of Califor-
clude high-performance computing, scientific nia, Riverside in 2014 and 2019. He received
data management and reduction, and resilient a bachelor’s degree in Computer Science and
machine learning. Email: kzhao016@ucr.edu. Engineering from Beijing University of Tech-
nology in 2012. His research interests include
high-performance computing, parallel and dis-
tributed systems, and big data analytics. Email:
chenj3@ornl.gov.
Sheng Di (Senior Member, IEEE) received his
master’s degree from Huazhong University of
Science and Technology in 2007 and Ph.D. de- Kaiming Ouyang received his bachelor’s de-
gree from the University of Hong Kong in 2011. gree from University of Electronic Science and
He is currently a computer scientist at Argonne Technology of China in 2016 and will receive
National Laboratory. Dr. Di’s research interest his Ph.D. degree from University of California,
involves resilience on high-performance comput- Riverside in 2021. He is a long-term intern at
ing (such as silent data corruption, optimization Argonne National Laboratory PMRS group led
checkpoint model, and in-situ data compression) by Dr. Balaji and supervised by Dr. Si. His re-
and broad research topics on cloud comput- search interest is parallel runtime system. Email:
ing (including optimization of resource allocation, kouya001@ucr.edu.
cloud network topology, and prediction of cloud workload/hostload). He
is working on multiple HPC projects, such as detection of silent data
corruption, characterization of failures and faults for HPC systems, and Franck Cappello (Fellow, IEEE) is the direc-
optimization of multilevel checkpoint models. Email: sdi1@anl.gov. tor of the Joint-Laboratory on Extreme Scale
Computing gathering six of the leading high-
Sihuan Li is a Ph.D. student in computer sci- performance computing institutions in the world:
ence at University of California, Riverside. He Argonne National Laboratory, National Center
obtained his bachelor’s degree in math from for Scientific Applications, Inria, Barcelona Su-
Huazhong University of Science and Technol- percomputing Center, Julich Supercomputing
ogy, China. He did a long-term internship at Center, and Riken AICS. He is a senior computer
Argonne National Laboratory. Broadly speaking, scientist at Argonne National Laboratory and an
his research interests fall into High Performance adjunct associate professor in the Department of
Computing. Specifically, he mainly studies Al- Computer Science at the University of Illinois at
gorithm Based Fault Tolerance (ABFT), lossy Urbana-Champaign. He is an expert in resilience and fault tolerance for
compression and their applications in large scale scientific computing and data analytics. Recently he started investigat-
scientific simulations. He is an IEEE student ing lossy compression for scientific data sets to respond to the pressing
member. Email: sli049@ucr.edu. needs of scientist performing large-scale simulations and experiments.
His contribution to this domain is one of the best lossy compressors for
Xin Liang is a Computer/Data Scientist at scientific data set respecting user-set error bounds. He is a member of
Oak Ridge National Laboratory. He received the editorial board of the IEEE Transactions on Parallel and Distributed
his Ph.D. degree from University of California, Computing and of the ACM HPDC and IEEE CCGRID steering commit-
Riverside in 2019 and his bachelor’s degree tees. He is a fellow of the IEEE. Email: cappello@mcs.anl.gov.
from Peking University in 2014. His research
interests include high-performance computing, Zizhong Chen (Senior Member, IEEE) received
parallel and distributed systems, scientific data a bachelor’s degree in mathematics from Beijing
management and reduction, big data analytic, Normal University, a master’s degree degree in
scientific visualization, and cloud computing. He economics from the Renmin University of China,
has interned in multiple national laboratories and and a Ph.D. degree in computer science from
worked on several exascale computing projects. the University of Tennessee, Knoxville. He is a
He is a member of the IEEE. Email: liangx@ornl.gov. professor of computer science at the University
of California, Riverside. His research interests in-
Yujia Zhai received his bachelor’s degree from clude high-performance computing, parallel and
University of Science and Technology of China distributed systems, big data analytics, cluster
in 2016, a master’s degree from Duke Univer- and cloud computing, algorithm-based fault tol-
sity in 2018, and will receive his Ph.D. de- erance, power and energy efficient computing, numerical algorithms and
gree from University of California, Riverside software, and large-scale computer simulations. His research has been
in 2023. His research interests include high- supported by National Science Foundation, Department of Energy, CMG
performance computing, parallel and distributed Reservoir Simulation Foundation, Abu Dhabi National Oil Company,
systems, and numerical linear algebra software. Nvidia, and Microsoft Corporation. He received a CAREER Award from
Email: yzhai015@ucr.edu. the US National Science Foundation and a Best Paper Award from the
International Supercomputing Conference. He is a Senior Member of the
IEEE and a Life Member of the ACM. He currently serves as a subject
area editor for Elsevier Parallel Computing journal and an associate
editor for the IEEE Transactions on Parallel and Distributed Systems.
Email: chen@cs.ucr.edu.

View publication stats

You might also like