Multi-Wavelet Coefficients Fusion in Deep Residual Networks For Fault Diagnosis

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIE.2018.2866050, IEEE
Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS
Multi-wavelet Coefficients Fusion in Deep

Residual Networks for Fault Diagnosis
Minghang Zhao, Myeongsu Kang, Member, IEEE, Baoping Tang, and Michael Pecht, Fellow
Member, IEEE
long-term generation of wind power, and the reliable operation

Abstract—Wavelet transform, an effective tool to of other related mechanical systems.
decompose signals into a series of frequency bands, has However, fault diagnosis of planetary gearboxes is often
been widely used for vibration-based fault diagnosis in more challenging than fixed-shaft gearboxes, because the
machinery. Likewise, the use of deep learning algorithms is
becoming popular to automatically learn discriminative
dynamic characteristics of planetary gearboxes are much more
features from input data for the sake of improving complicated. To be specific, the planet gears not only mesh
diagnostic performance. However, the fact that no general with the sun gear, but also mesh with the ring gear
consensus has been reached as to which wavelet basis simultaneously; the planetary gearboxes mostly have to operate
functions are useful for diagnosis motivated this together with other rotating components, such as fixed-shaft
investigation of methods to fuse multi-wavelet transforms gearboxes and motors; and there may be irregular noises from
into deep learning algorithms. In this paper, two
methods—i.e., multi-wavelet coefficients fusion in deep
the working environments, which make the vibration signals
residual networks by concatenation (MWCF-DRN-C) and more complex. Furthermore, the operating conditions mostly
multi-wavelet coefficients fusion in deep residual networks change with time, so that even the vibration signals of the same
by maximization (MWCF-DRN-M)—were developed to health condition may also vary a lot. As a result, it is very
capture discriminative information from diverse sets of difficult for the general signal processing (or envelope analysis)
wavelet coefficients for fault diagnosis. The efficacy of the based fault detection methods to detect the defect frequency,
developed methods was verified by applying them to
planetary gearbox fault diagnosis.
and then diagnose the faults of planetary gearboxes accurately.
The state-of-the-art machine learning-based fault diagnosis
Index Terms—Deep residual networks, fault diagnosis, methods can be classified into two categories: shallow learning
feature learning, multi-wavelet coefficients fusion, vs. deep learning. In the shallow learning-based methods,
planetary gearbox. feature construction based on statistical features (e.g.,
root-mean-square (RMS), kurtosis, and energy [3], [4]) is an
I. INTRODUCTION essential step. Then, the features are fed into shallow learning
P LANETARY gearboxes can provide higher gear ratios and

power densities than the general fixed-shaft gearboxes, and
have been used in a wide range of critical mechanical systems,
algorithms, such as support vector machines, neural networks,
and decision trees, for the sake of fault diagnosis. The use of
shallow learning algorithms with domain knowledge-based
such as helicopters, heavy trucks, and wind turbines [1]. statistical features can be computationally more efficient than
However, due to longtime dust corrosion, heavy loads, and deep learning-based approaches. Likewise, monitoring the
tough operating environment, planetary gearboxes may suffer features can offer relatively direct health information about a
various kinds of faults, such as damages to the bearings, gear target system, while learned features derived from deep
root cracking, and shaft imbalance [2]. Consequently, effective learning algorithms may not. However, it is difficult and
fault diagnosis and timely maintenance of planetary gearboxes time-consuming to determine which features should be
are helpful to ensure the safety of helicopters and heavy trucks, extracted, because an optimized set of features varies from case
to case in industrial applications. Further, the fact that
conventional shallow learning algorithms are not scalable for
Manuscript received September 18, 2017; revised February 5, 2018 dealing with high-volume and high-dimensional data for fault
and June 14, 2018; accepted August 5, 2018. This work was supported
in part by the National Natural Science Foundation of China under Grant diagnosis will be an issue.
51775065 and in part by the Technology Innovation Program (or Deep learning algorithms [5] that automatically learn a good
Industrial Strategic Technology Development Program (10076392, set of features from the monitoring data for diagnosis can be an
Development of Vehicle Self Diagnosis System and Service for
Automobile Driving Safety Improvement) funded by the Ministry of Trade, effective way to address the above-mentioned disadvantages
Industry and Energy, Korea). (Corresponding author: Myeongsu Kang.) [6]-[15]. For example, Jia et al. [6] used deep auto-encoders to
Minghang Zhao and Baoping Tang are with State Key Laboratory of pre-train the deep neural networks, which achieved higher
Mechanical Transmission, Chongqing University, Chongqing 400044,
China (e-mail: minghang.zhao@gmail.com, bptang@cqu.edu.cn).
accuracies than shallow neural networks; Wang et al. [13]
Myeongsu Kang and Michael Pecht are with the Center for Advanced applied convolutional neural networks (CNNs) to the
Life Cycle Engineering, University of Maryland, College Park, MD 20742, time-frequency representations obtained by using continuous
USA (e-mail: mskang@calce.umd.edu, pecht@calce.umd.edu).
0278-0046 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIE.2018.2866050, IEEE
wavelet transform, which yielded higher accuracies than from time-frequency representations obtained by diverse
traditional shallow classifiers. Therefore, the feature learning wavelets and verify the effectiveness of the method to fuse
ability of deep learning algorithms is a great benefit for fault multiple wavelet transforms into deep learning algorithms.
diagnosis applications. However, training deep learning That is, this paper aims to learn a good set of features from
algorithms is often not an easy task. For example, traditional diverse wavelets. Furthermore, learning diverse features is vital
deep auto-encoders involve too many weights to be trained, and for increasing the performance of deep learning methods [23],
although convolutional neural networks (CNNs) adopt a weight and the fusion of multi-wavelet coefficients into deep learning
sharing strategy to reduce the number of weights, algorithms can enable learning diverse features. Therefore, this
back-propagating errors through multiple layers can result in study can avoid the wavelet selection problem.
exploding/vanishing gradient problems, which can be a A deep residual network (DRN) [24],[25] is a state-of-the-art
primary cause of training failures. Additionally, most of the deep learning method. DRNs can automatically learn
previous deep learning algorithms were not specifically discriminative features from the input data. The difference
designed for vibration-based fault diagnosis. Hence, to further between a DRN and the classical CNN is that the DRN uses
improve diagnostic performance, it is necessary to explore new identity skip-connections (ISCs) in the deep architecture to
deep learning architectures. make the trainable parameters easier to be optimized. The DRN
Time-frequency analysis can uncover the dynamic properties integrates many “tricks” for a better training of deep neural
of non-stationary vibration signals. Various time-frequency networks, such as momentum, batch normalization (BN), L2
analysis methods (e.g., short-time Fourier transform, wavelet regularization, and variance-scaling weight initialization.
transform, and empirical mode decomposition) have been used These tricks make it reliable and applicable to the experimental
in vibration-based fault diagnosis. However, short-time Fourier data with different properties. Therefore, the DRN has the
transform only has a fixed resolution in the frequency domain, potential to learn a good set of features from the input data and
and empirical mode decomposition lacks a theoretical to correctly identify the health status of an object machine.
demonstration. Due to the merit of multi-resolution localization In this study, two multi-wavelet coefficients fusion methods
in detecting faults from gearboxes [16], wavelet transform [17] within the DRN architecture were developed for fault
was used in this study to generate time-frequency diagnosis—multi-wavelet coefficients fusion in a DRN by
representations of vibration signal as the input of the deep concatenation (MWCF-DRN-C) and multi-wavelet coefficients
learning method. To be specific, discrete wavelet packet fusion in a DRN by maximization (MWCF-DRN-M),
transform (DWPT) was adopted, which can generate a series of respectively. The MWCF-DRN-C method concatenates a series
matrices of wavelet packet coefficients. of matrices of wavelet packet coefficients and takes a single
However, there is still no general consensus as to which concatenated matrix as an input, whereas the MWCF-DRN-M
wavelet can offer optimal performance for fault diagnosis. method re-designs the architecture of the DRN for the sake of
Furthermore, an optimal wavelet basis function can be varied multi-wavelet coefficients fusion. These methods can
depending on fault detection and diagnosis applications [12], adaptively adjust the contribution of wavelet packet
[13], [18]-[20]. For example, Ding and He [12] employed a coefficients to fault diagnosis, with the goal of improving
deep learning model for fault diagnosis of spindle bearings, diagnostic performance. Likewise, these methods can learn
which took a matrix containing wavelet packet energies as better features for fault diagnosis than the state-of-the-art deep
input. Specifically, the authors used a Daubechies 8 wavelet for learning methods (i.e., CNN and DRN) taking a matrix of
DWPT. Wang et al. [13] used a Morlet wavelet to generate wavelet packet coefficients obtained from a certain wavelet
time-frequency representations from the vibration signals and basis function.
further applied a CNN to fault diagnosis, which took the The reminder of this paper is organized as follows. Section II
time-frequency representations as input. Kang et al. [18] used describes a simulation system to collect vibration signals under
DWPT with a Daubechies 20 wavelet to decompose acoustic variable operating conditions used for planetary gearbox fault
emission signals into a series of wavelet packet nodes and diagnosis. Section III delineates the inclusion of domain
extract features from the nodes, such as relative wavelet packet knowledge into the deep models—that is, MWCF-DRN-C and
node energies, for bearing fault diagnosis. MWCF-DRN-M—and defines the input of the methods using
To address the aforementioned issue, wavelet selection [21], multi-wavelet coefficients fusion. In Section IV, performance
[22] and multi-wavelet coefficients fusion are two promising comparisons are conducted to verify the effectiveness of the
solutions. Here, the main objective of wavelet selection is to developed methods for fault diagnosis, and the limitations are
find an optimal wavelet for fault diagnosis. For example, discussed. Section V gives conclusions.
Vakharia et al. [21] used a criterion named as “maximum
energy to Shannon entropy ratio” to specify an optimal wavelet II. FAULT DESCRIPTION OF PLANETARY GEARBOXES
for feature extraction. However, there is still a likelihood that To verify the effectiveness of the developed methods, fault
an optimal wavelet specified by wavelet selection methods may diagnosis of a planetary gearbox was considered. A drivetrain
not be effective for time-frequency representations for learning dynamics simulator was used to simulate the faults. The
a discriminative set of features that will be used for planetary simulator was mainly composed of a 3-phase 3 HP motor, a
gearbox fault diagnosis under non-stationary operating 2-stage planetary gearbox with 192:7 gear ratio (including 4
conditions. This motivated us to learn a good set of features
planet gears in the 1st stage and 3 planet gears in the 2nd stage), a wavelet function provides optimal diagnostic performance,
2-stage parallel gearbox (with 29:100 gear ratio in the 1st stage multi-wavelet coefficients fusion is considered in this study.
and 5:2 gear ratio in the 2nd stage), and a programmable heavy Accordingly, this section mainly discusses the essential idea
duty magnetic brake (with a maximum torque of 65 lb•ft), as behind the two developed methods, MWCF-DRN-C and
shown in Fig. 1. More information about the simulator can be MWCF-DRN-M, by presenting the theoretical background of a
found in [26]. Vibration signals in the vertical direction were DRN and the design of DRN architecture.
collected at 25.6 kHz using an accelerometer, which was
A. Input Data Configuration
mounted at the input side of the planetary gearbox.
As a classical multi-resolution analysis algorithm, DWPT
[17] enables a signal to be decomposed into two sets of wavelet
coefficients, i.e., the approximation coefficients at a
low-frequency band and the detail coefficients at a relatively
high-frequency band. As indicated in Fig. 2, the decomposition
is then repeated recursively not only on the approximation
coefficients but also on the detail coefficients, so that the
information on various frequency bands can be revealed.
Fig. 1. The drivetrain dynamics simulator used in this study.

This study considered nine health conditions of a planetary
gearbox under non-stationary operating conditions, including 1
normal and 8 faulty statuses (i.e., bearing and gear defects), as
summarized in Table I. For each health condition, 12 16-s
vibration signals were collected as the rotational speed of the Fig. 2. An example of a 3-level decomposition tree of DWPT, where 𝑊𝑊𝑖𝑖,𝑗𝑗
motor linearly increased from 20 Hz to 36 Hz under three is the wavelet coefficients in the 𝑗𝑗th sub-band at the 𝑖𝑖th decomposition
different torsional loads, as described in Table II. In particular, level. 𝑊𝑊1,0 and 𝑊𝑊1,1 are the approximation coefficients and detail
coefficients of the original signal, respectively. 𝑊𝑊2,0 and 𝑊𝑊2,1 are the
each 16-s vibration signal was equally divided into 100 approximation coefficients and detail coefficients of 𝑊𝑊1,0, respectively.
observations with each observation in the length of 4,096
samples. Therefore, the total number of observations was 1200 Mathematically, DWPT can be implemented using a series
for each health condition, as presented in Table II. Likewise, of convolutions with a pair of low-pass and high-pass filters
although each of the 0.16-s vibration signals already contained [17]. Specifically, the high-pass filter ℎ(∙) and the low-pass
a certain level of noise, white Gaussian noise was artificially filter 𝑔𝑔(∙) such that 𝑔𝑔(𝑘𝑘) = (−1)𝑘𝑘 ℎ(1 − 𝑘𝑘) can be defined by:
1
embedded into them to increase the level of difficulty in fault ℎ(𝑘𝑘) = < 𝜑𝜑(𝑡𝑡), 𝜑𝜑(2𝑡𝑡 − 𝑘𝑘) > (1)
diagnosis. This is based on the assumption that the vibration √2
1
signals can contain a higher level of noise in real-world 𝑔𝑔(𝑘𝑘) = < 𝜓𝜓(𝑡𝑡), 𝜓𝜓(2𝑡𝑡 − 𝑘𝑘) > (2)
vibration-based fault diagnosis applications; noise was added to √2
the signals to enforce a signal-to-noise ratio of 5 dB. The where 𝜑𝜑(∙) is a scaling function, 𝜓𝜓(∙) is the corresponding
average RMSs are shown in Table I to illustrate the intensities wavelet function, <∙,∙> is an inner product, and 𝑡𝑡 and 𝑘𝑘 are
of the faults under non-stationary operating conditions. variables. For a discrete 1-dimensional signal, its wavelet
TABLE I coefficients at various frequency sub-bands and decomposition
A SUMMARY OF HEALTH STATES OF THE PLANETARY GEARBOX levels can be calculated iteratively by:
Health status Label Description RMS (m/s2)
Healthy H Healthy bearings and gears 1.00 ± 0.26 𝑊𝑊𝑖𝑖+1,2𝑗𝑗 (𝜏𝜏) = � ℎ(𝑘𝑘 − 2𝜏𝜏)𝑊𝑊𝑖𝑖,𝑗𝑗 (𝑘𝑘) (3)
BFB A ball fault in a bearing 1.54 ± 0.66
IFB An inner race fault in a bearing 1.70 ± 0.82 𝑘𝑘
OFB An outer race fault in a bearing 1.86 ± 0.88
CFB A composite fault of BFB, IFB, and OFB 1.77 ± 0.86 𝑊𝑊𝑖𝑖+1,2𝑗𝑗+1 (𝜏𝜏) = � 𝑔𝑔(𝑘𝑘 − 2𝜏𝜏)𝑊𝑊𝑖𝑖,𝑗𝑗 (𝑘𝑘) (4)
Faulty
TRC A tooth root crack on a gear 1.60 ± 0.66
𝑘𝑘
TSP A tooth surface pitting on a gear 1.64 ± 0.68
TCF A tooth chipped fault on a gear 1.01 ± 0.25 where 𝑊𝑊0,0 is the original discrete signal in the length of 𝑁𝑁,
TMF A tooth missing fault on a gear 1.61 ± 0.70
{𝑊𝑊𝑖𝑖,𝑗𝑗 (𝑘𝑘), 𝑘𝑘 = 1,2, … , 𝑁𝑁/2𝑖𝑖 } are the wavelet coefficients in the
TABLE II
NUMBER OF OBSERVATIONS FOR EACH HEALTH STATUS 𝑗𝑗th sub-band at the 𝑖𝑖th decomposition level, {𝑊𝑊𝑖𝑖+1,2𝑗𝑗 (𝜏𝜏), 𝜏𝜏 =
Load
Number of 16-s vibration
Number of observations in
Total number of 1,2, … , 𝑁𝑁/2𝑖𝑖+1 } and {𝑊𝑊𝑖𝑖+1,2𝑗𝑗+1 (𝜏𝜏), 𝜏𝜏 = 1,2, … , 𝑁𝑁/2𝑖𝑖+1 } are
signals under each load observations for each
(lb·ft)
condition
each 16-s vibration signal
class the wavelet coefficients in the (2𝑗𝑗)th and (2𝑗𝑗 + 1)th sub-bands
1 4 100
1200
at the (𝑖𝑖 + 1)th decomposition level, respectively, and 𝑗𝑗 ranges
6 4 100
18 4 100 from 0 to 2𝑖𝑖 − 1 at the 𝑖𝑖th decomposition level.
As mentioned in Section I, previous research has found that
III. DEVELOPED METHODS FOR MULTI-WAVELET the performance of the DWPT-assisted fault diagnosis
COEFFICIENTS FUSION IN A DRN applications was highly dependent on the selected wavelet
functions [16]. Aiming at this problem, a straightforward
As stated in Section I, as a candidate solution to address the solution is to use multiple wavelet functions together for
fact that no general consensus has been reached as to which comprehensive time-frequency representations of vibration
signals. In this study, Daubechies wavelets were used as an where 𝐼𝐼C is the input feature map of a convolutional layer; 𝐾𝐾 is
example in the experiment since they were widely used in a convolutional kernel; 𝑏𝑏 is a bias; 𝑂𝑂C is a channel of the output
vibration-based fault diagnosis [3], [12], [19]. However, it is feature map; 𝑖𝑖, 𝑗𝑗, and 𝑐𝑐 are the indexes of row, column, and
notable that the developed methods are applicable to other channel of the feature map, respectively; and 𝑢𝑢 and 𝑣𝑣 are the
wavelets, such as Symlet, Coiflet, Morlet, and so forth. indexes of row and column of the convolutional kernel,
As depicted in Fig. 3, the wavelet packet coefficients at respectively. Since a convolutional layer can have more than
different frequency bands obtained using a certain wavelet can one convolutional kernels, more than one channel of output
be formed into a 2-dimensional (2D) matrix; then, the 2D feature map can be obtained. In this study, convolutional
matrices derived from different wavelets can be stacked kernels at a 3 × 3 size were used because they not only have a
together as a 3-dimensional (3D) matrix. Likewise, since the higher computational efficiency than larger kernels, but they
depth of DWPT was 6 (i.e., depth = 6) and the length of an can also be large enough to detect local maxima [27].
observation was 4,096 in the experiment, the dimension of a 3D In each training iteration, a mini-batch of observations is
matrix of wavelet packet coefficients in this study was 𝑝𝑝 × 𝑞𝑞 × randomly selected and fed into the DRN. However, the
𝑁𝑁w = (4096/2depth ) × (2depth ) × 𝑁𝑁w = 64 × 64 × 𝑁𝑁w , distributions of learned features in the mini-batches often
where 𝑁𝑁w is the number of selected wavelets. continuously change in the training iterations, which is known
as the internal covariance shift problem [28]. In such a case, the
weights and biases have to be continuously updated to adapt to
the changed distributions. As a result, the training of deep
networks can be challenging. BN [28] is a kind of technique
which is used to address this problem and is expressed by:
1 𝑁𝑁batch
𝜇𝜇 = � 𝑥𝑥𝑠𝑠 (6)
𝑁𝑁batch 𝑠𝑠=1
1 𝑁𝑁batch
𝜎𝜎 2 = � (𝑥𝑥𝑠𝑠 − 𝜇𝜇)2 (7)
𝑁𝑁batch 𝑠𝑠=1
𝑥𝑥𝑠𝑠 − 𝜇𝜇
𝑥𝑥�𝑠𝑠 = (8)
√𝜎𝜎 2 + 𝜖𝜖
𝑦𝑦𝑠𝑠 = 𝛾𝛾𝑥𝑥�𝑠𝑠 + 𝛽𝛽 (9)
Fig. 3. A series of 2D matrices of wavelet packet coefficients obtained where 𝑥𝑥𝑠𝑠 is a feature of the 𝑠𝑠 observation in a mini-batch,
th
using various Daubechies wavelets. 𝑁𝑁batch is the mini-batch size, 𝜖𝜖 is a constant value which is
B. Background Theory of a DRN close to zero, and 𝑦𝑦𝑠𝑠 is the output feature of BN. The input
features are normalized to have a mean of 0 and a standard
A DRN can be interpreted as a model that is a stack of deviation of 1 in (6), (7), and (8), so that the input features are
various components, including a convolutional layer, a series of enforced to have similar distributions; then, 𝛾𝛾 and 𝛽𝛽 are trained
residual building units (RBUs), a BN, a rectifier linear unit to scale and shift the normalized features to desirable
activation function (ReLU), a global average pooling (GAP), distributions. The optimization of 𝛾𝛾 and 𝛽𝛽 is achieved using a
and a fully connected output layer [24], [25]. As shown in Fig. gradient descent algorithm, which is expressed by:
4(a), a RBU can be composed of two BNs, two ReLUs, two 𝜂𝜂 𝜕𝜕𝐸𝐸𝑠𝑠 𝜕𝜕𝑃𝑃𝑃𝑃𝑃𝑃ℎ_𝑟𝑟𝑘𝑘
convolutional layers, and one ISC. A brief architecture of a 𝛾𝛾 ← 𝛾𝛾 − �� (10)
𝑁𝑁batch 𝜕𝜕𝑃𝑃𝑃𝑃𝑃𝑃ℎ_𝑟𝑟𝑘𝑘 𝜕𝜕𝜕𝜕
DRN is shown in Fig. 4(b). 𝑠𝑠 𝑘𝑘
𝜂𝜂 𝜕𝜕𝐸𝐸𝑠𝑠 𝜕𝜕𝑃𝑃𝑃𝑃𝑃𝑃ℎ_𝑏𝑏𝑘𝑘
𝛽𝛽 ← 𝛽𝛽 − �� (11)
𝑁𝑁batch 𝜕𝜕𝑃𝑃𝑃𝑃𝑃𝑃ℎ_𝑏𝑏𝑘𝑘 𝜕𝜕𝜕𝜕
𝑠𝑠 𝑘𝑘
where 𝜂𝜂 is the learning rate, 𝐸𝐸𝑠𝑠 is the error of the 𝑠𝑠 th
observation. 𝑃𝑃𝑃𝑃𝑃𝑃ℎ_𝑟𝑟 and 𝑃𝑃𝑃𝑃𝑃𝑃ℎ_𝑏𝑏 are two collections of
differentiable paths that connect 𝛾𝛾 and 𝛽𝛽 with the error at the
output layer, respectively.
The ReLU is used to achieve nonlinear transformations by
enforcing the negative features to be zero. It is expressed by:
𝑂𝑂R (𝑖𝑖, 𝑗𝑗, 𝑐𝑐) = max{𝐼𝐼R (𝑖𝑖, 𝑗𝑗, 𝑐𝑐), 0} (12)
where 𝐼𝐼R and 𝑂𝑂R are the input and output feature maps of the
Fig. 4. (a) A RBU, (b) a brief architecture of a DRN, in which “Conv 3 × 3”
ReLU, respectively. The derivative of ReLU is expressed by:
refers to a convolutional layer with convolutional kernel size of 3 × 3.
The convolutional layer is used to learn features, in which 𝜕𝜕𝑂𝑂R (𝑖𝑖, 𝑗𝑗, 𝑐𝑐) 1, if 𝐼𝐼R (𝑖𝑖, 𝑗𝑗, 𝑐𝑐) > 0
=� (13)
each convolutional kernel behaves as a trainable feature 𝜕𝜕𝐼𝐼R (𝑖𝑖, 𝑗𝑗, 𝑐𝑐) 0, if 𝐼𝐼R (𝑖𝑖, 𝑗𝑗, 𝑐𝑐) < 0
extractor. Compared with matrix multiplications in the Its derivative is either 1 or 0, which can reduce the risk of
traditional fully connected layers, the use of convolutions in gradient vanishing and exploding compared with the sigmoid
convolutional layers enables reduction of the number of and tanh activation functions.
weights and computational complexity, which is expressed by: The ISCs are the key component that makes a DRN easier to
train than the traditional CNNs. In the training process of
𝑂𝑂C (𝑖𝑖, 𝑗𝑗) = � � � 𝐼𝐼C (𝑖𝑖 − 𝑢𝑢, 𝑗𝑗 − 𝑣𝑣, 𝑐𝑐) ∙ 𝐾𝐾(𝑢𝑢, 𝑣𝑣, 𝑐𝑐) + 𝑏𝑏 (5)
traditional CNNs without ISCs, the gradients of error with
𝑢𝑢 𝑣𝑣 𝑐𝑐
respect to the weights (and biases) need to be back-propagated collections of differentiable paths that connect the weight and
layer by layer. For example, the gradients on the 𝑙𝑙 th layer are bias with the neurons at the fully connected output layer,
dependent on the weights at the (𝑙𝑙 + 1)th layer. If the weights at respectively. The training procedures can be repeated a certain
the (𝑙𝑙 + 1)th layer are not optimal, the gradients on the 𝑙𝑙 th layer number of times so that the parameters can be optimized. In
cannot be optimal as well. As a result, it is difficult to summary, the parameters that need to be optimized while
effectively train the weights in a CNN with multiple layers. training include 𝛾𝛾 and 𝛽𝛽 in BNs and the weights and biases in
ISCs solve this problem by directly connecting some the convolutional layers and the fully connected output layer.
convolutional layers to deeper layers, so that it can be easy for
C. Design of the Fundamental Architecture for DRNs
the gradients to be back-propagated through a deep network. In
other words, the gradients can be back-propagated into the Deep learning models’ architectures, including depth (i.e.,
layers easier than the traditional CNNs, so that the weights and the number of nonlinear transformation layers) and width (i.e.,
biases can be updated effectively. It has been shown that a DRN the numbers of kernels in the convolutional layers), are key
with tens or hundreds of layers can be easily trained and yield factors that influence the models’ performance, such as the test
higher accuracies than the CNNs without ISCs [25]. accuracy and computation time. Zoph and Le [30] employed
A GAP was applied before the final fully connected output reinforcement learning for network architecture search, which
layer, which is expressed by: was computationally expensive and used 800 graphics
𝑂𝑂G (𝑐𝑐) = average 𝐼𝐼G (𝑖𝑖, 𝑗𝑗, 𝑐𝑐) (14) processing units to train the deep models with different
𝑖𝑖,𝑗𝑗
hyperparameters. Suganuma et al. [31] investigated a method
where 𝐼𝐼G and 𝑂𝑂G are the input and output feature maps of GAP,
using genetic programming to design deep networks. Despite
respectively. The GAP enables the shift variant problem to be
addressed by calculating a global feature from each channel of such studies, neural network architecture optimization is a
the input feature map. In this study, the shift variant problem long-standing issue in the field–that is, there is still no general
means that the fault-related impulses can exist in different consensus as to how deep or wide the network should be.
locations in the observations, and the GAP can ensure that the
DRN learns features which are invariant to the locations. The
output feature maps of the GAP are fed to the fully connected
output layer to pick up the classification results.
The training process of a DRN follows the same principle as
the general neural networks. The training data are propagated
into a DRN and processed while passing through a series of
convolutional layers, BNs, and ReLUs followed by a GAP and
a fully connected output layer. More specifically, at the fully
connected output layer, a softmax function is used to estimate
the possibility of an observation belonging to the classes [29],
which is expressed by:
𝑒𝑒 𝑥𝑥𝑛𝑛 Fig. 5. A typical architecture of a DRN, in which m indicates the number
𝑦𝑦𝑛𝑛 = 𝑁𝑁class , for 𝑛𝑛 = 1, … , 𝑁𝑁class (15) of convolutional kernels, and “/2” is meant to move the convolutional
∑𝑧𝑧=1 𝑒𝑒 𝑥𝑥𝑧𝑧
kernels with a stride of 2.
where 𝑥𝑥𝑛𝑛 is the feature at the 𝑛𝑛th neuron of the output layer, 𝑦𝑦𝑛𝑛 The fundamental architecture of the DRN used in this study
is the output, which can be seen as the estimated possibility of
is pictorially illustrated in Fig. 5. Likewise, the essential idea
an observation belonging to the 𝑛𝑛th class, and 𝑁𝑁class is the total
behind this architecture is described as follows. First, the
number of classes. Then, the cross-entropy error, which
architecture has 19 convolutional layers and 1 fully connected
measures the distance between the true label 𝑡𝑡 and the output 𝑦𝑦,
layer in depth. Note that it is important to contain a sufficient
can be calculated by:
𝑁𝑁class number of nonlinear transformation layers to ensure that the
𝐸𝐸(𝑦𝑦, 𝑡𝑡) = − � 𝑡𝑡𝑛𝑛 ln(𝑦𝑦𝑛𝑛 ) (16) input data can be converted to be discriminative features. In
𝑛𝑛=1
previous studies conducted for vibration- and current-based
where 𝑡𝑡𝑛𝑛 is the true possibility of the observation belonging to
the 𝑛𝑛th class. Note that the partial derivative of cross-entropy fault diagnosis using deep learning, no more than 10 nonlinear
error with respect to the neurons at the fully connected output transformation layers have been used [6]-[15]. Considering the
layer can be expressed by: increased level of nonlinearity of the acquired data, the DRN
𝜕𝜕𝜕𝜕 contains more nonlinear transformation layers, where a
= 𝑦𝑦𝑛𝑛 − 𝑡𝑡𝑛𝑛 (17) nonlinear transformation layer stands for a convolutional layer
𝜕𝜕𝑥𝑥𝑛𝑛
Then, the error is back-propagated through the network to with a nonlinear activation function (i.e., ReLU) in this study.
update the weights and biases, which are expressed by: As mentioned above, DRNs with tens or hundreds of layers can
𝜂𝜂 𝜕𝜕𝐸𝐸𝑠𝑠 𝜕𝜕𝑥𝑥𝑛𝑛 𝜕𝜕𝑁𝑁𝑁𝑁𝑁𝑁_𝑤𝑤𝑛𝑛,𝑘𝑘 be easily trained due to the use of ISCs [25], so that the depth of
𝑤𝑤 ← 𝑤𝑤 − �� (18) the DRN architecture is in a reasonable range.
𝑁𝑁batch 𝜕𝜕𝑥𝑥𝑛𝑛 𝜕𝜕𝑁𝑁𝑁𝑁𝑁𝑁_𝑤𝑤𝑛𝑛,𝑘𝑘 𝜕𝜕𝜕𝜕
𝑠𝑠 𝑛𝑛 𝑘𝑘 Then, the first convolutional layer (i.e., the layer closest to
𝜂𝜂 𝜕𝜕𝐸𝐸𝑠𝑠 𝜕𝜕𝑥𝑥𝑛𝑛 𝜕𝜕𝑁𝑁𝑁𝑁𝑁𝑁_𝑏𝑏𝑛𝑛,𝑘𝑘
𝑏𝑏 ← 𝑏𝑏 − �� (19) the input layer) and three convolutional layers in the RBUs,
𝑁𝑁batch 𝜕𝜕𝑥𝑥𝑛𝑛 𝜕𝜕𝑁𝑁𝑁𝑁𝑁𝑁_𝑏𝑏𝑛𝑛,𝑘𝑘 𝜕𝜕𝜕𝜕 which have a stride of 2, are used to reduce the size of the
𝑠𝑠 𝑛𝑛 𝑘𝑘
where 𝑤𝑤 is a weight, 𝑏𝑏 is a bias, and 𝑁𝑁𝑁𝑁𝑁𝑁_𝑤𝑤 and 𝑁𝑁𝑁𝑁𝑁𝑁_𝑏𝑏 are two feature maps. In Fig. 5, 𝑚𝑚 indicates the number of
convolutional kernels, which is increased to 2𝑚𝑚 and 4𝑚𝑚 in not easily detected. As a result, the fault-related information
deeper layers because a few basic features can be integrated to can be overwhelmed by the other components, which makes the
be many different high-level features. 𝑚𝑚 is set to 4 in this study. fault diagnosis a challenging task.
To further alleviate over-fitting, dropout [32] with a ratio of To deal with the non-stationary vibration signals, DWPT is
50% is applied to the GAP. In other words, half of the neurons employed to decompose the vibration signals into multiple
in the GAP layer were randomly selected and set to be zero in sub-band signals. However, it is generally unknown which
each training iteration, which can be interpreted as a process of sub-band signal contains the most intrinsic information about
adding noise to the network, in order to prevent the DRN from the system’s health conditions (i.e., normal and several faulty
memorizing too much non-discriminative information and conditions). Likewise, because informative sub-band signals
can be varied due to changes in operating conditions (e.g.,
ensure a high generalizability.
rotating speeds), this study combines all the wavelet
D. Multi-wavelet Coefficients Fusion in a DRN coefficients at each terminal node and uses them as input for the
In this subsection, the motivations for developing deep learning methods.
multi-wavelet coefficients fusion methods are described, and In general, different wavelets may be optimal for diagnosing
the architectures of the two developed methods different types of faults under different operating conditions, so
(MWCF-DRN-C and MWCF-DRN-M) are presented. that it is unlikely for a certain wavelet to be the most effective
1) Motivations of Multi-wavelet Coefficients Fusion for diagnosing all types of faults in consideration (e.g., bearing
inner race faults, outer race faults, ball faults, gear surface pits,
and gear root cracks). Therefore, the fusion of multiple
wavelets can improve the performance of a fault diagnosis task
involving the classification of multiple fault types.
2) MWCF-DRN-C
The developed MWCF-DRN-C method is based on a
well-known fact in deep learning–that is, learning diverse
features is critical for increasing performance [23]. The
multi-wavelet coefficients fusion concept can be considered a
promising way to introduce diversity into a DRN. To enable
multi-wavelet coefficients fusion, one of the simplest methods
Fig. 6. (a) Schematic diagram of a rolling element bearing with a crack is to concatenate all 2D matrices of wavelet packet coefficients
on its inner race, in which the arrows denote rotating and moving
directions, (b) an illustration of impulses due to the crack, (c) schematic
and propagate them into the DRN.
diagram of a gear with a broken tooth, and (d) an illustration of the As illustrated in Fig. 7(a), a special design in
waveform generated by the faulty gear. MWCF-DRN-C is the use of a concatenation layer to combine
The faults on bearings and/or gears often produce relatively multiple wavelet packet coefficients by forming a 𝑝𝑝 × 𝑞𝑞 × 𝑁𝑁𝑤𝑤
large amplitudes in the waveform of vibrations signals. For matrix, where 𝑁𝑁𝑤𝑤 is the number of wavelets in consideration
example, for a bearing with a crack on its inner race (see Fig. and 𝑝𝑝 and 𝑞𝑞 are the dimensionality of a 2D matrix of wavelet
6(a)), there will be a sudden change of contact force every time packet coefficients (see Section III.A). Note that the
a rolling ball passes over the crack, and the sudden change of concatenation layer does not have any parameter to be trained.
contact force will create an impulse in the waveform, as Then, with the use of the concatenation layer, the first
indicated in Fig. 6(b). For a bearing with a crack on its outer convolutional layer has more trainable weights that can be used
race, the rolling balls will strike on the crack and lead to for multi-wavelet coefficients fusion. To be specific, each
impulses as well. Similarly, a broken tooth on a gear (see Fig. convolutional kernel in the first convolutional layer of
6(c)) can lead to a large amplitude every time the broken tooth MWCF-DRN-C has 3 × 3 × 𝑁𝑁w weights, while a
meshes with the tooth of another gear, as indicated in Fig. 6(d). convolutional kernel in the first convolutional layer of the DRN
Conventional signal processing-based fault diagnosis without multi-wavelet coefficients fusion only has 3 × 3
methods often rely on the detection of fault-related waveforms. weights, where 3 × 3 indicates that the length and width of a
For example, for a bearing rotating at a constant speed, the convolutional kernel are both 3. This difference is caused by the
fault-related impulses will be generated periodically; if the time nature of convolutional layers, i.e., the number of channels of a
interval between the impulses matches the ball passing convolutional kernel has to be the same with the number of
frequency of the inner race, it is possible to determine whether channels of input feature map. After a supervised training
the bearing has a fault on its inner race. However, for large process, trainable weights and biases of the MWCF-DRN-C
rotating machines with multi-stage gearboxes, the vibration
can be optimized to learn a discriminative set of features for
signals are often composed of multiple components because of
accurate fault diagnosis.
vibrations excited by the meshing of multi-stage gear
3) MWCF-DRN-M
transmissions, rotations of shafts and bearings, or
The development of MWCF-DRN-M is closely related to the
environmental noise. For a rotating machine operating at
working principle of wavelet analysis in fault diagnosis. For
varying rotating speeds, frequencies of these vibration
fault detection of rotating machinery, wavelet analysis often
components can be non-stationary. Moreover, when a fault is at
works as a method to discover fault-related waveforms by
its early stage, the fault-related information in the waveforms is
generating very positive or negative wavelet packet coefficients. DRNs [24], [25]. The learning rate was initialized to 0.1 and
In other words, compared with the wavelet packet coefficients reduced to 0.01 at the 40th epoch and 0.001 at the 80th epoch.
which are close to zero, very positive or negative wavelet The training was terminated at the 100th epoch, so that the
packet coefficients are more likely to represent fault-related trainable parameters could be updated in large steps at the
waveforms if the wavelet is effective in detecting the beginning and slightly fine-tuned at the end of the training
fault-related waveforms. However, for large rotating process. The coefficient of momentum [5] was set to 0.9, which
machineries with multi-stage gear transmissions, it is often is used to accelerate the training process by making use of the
unavoidable that some unimportant wavelet packet coefficients update in the previous iteration. The mini-batch size was set to
can have large absolute values as well, which is mainly caused 128, which indicated that 128 observations were randomly
by the other vibration components mentioned in section III.D.1.
selected and fed into the deep architecture in each iteration; the
Aiming at the above issues, an individual convolutional layer
training process can be accelerated compared with feeding one
with trainable parameters is applied to each 2D matrix of
observation in each iteration. The weight decay coefficient of
wavelet packet coefficients with the goal of highlighting the
fault-related wavelet packet coefficients, i.e., transforming the L2 regularization was set to 0.0001, which was the same with
important wavelet packet coefficients to be large features. Then, the DRN in [24].
motivated by the fact that it is generally unknown which B. Performance Comparisons
wavelet can be the most effective in detecting the fault-related
In this section, the state-of-the-art deep learning algorithms
waveforms, the developed MWCF-DRN-M method uses a
without multi-wavelet coefficients fusion (i.e., CNN and DRN
maximization layer [33]-[35] to fuse the information from
multiple wavelets (i.e., the output features of the individual taking a matrix of wavelet coefficients using a certain
convolutional layers). To be specific, the element-wise Daubechies wavelet) were used for performance comparisons.
maximum values are taken as the output in the maximization A 10-fold cross-validation [5] was conducted to evaluate the
layer. methods, that is, the dataset was randomly divided into 10
The architecture of the developed MWCF-DRN-M is subsets. In each test, one subset was used as the testing data,
illustrated in Fig. 7(b). The individual convolutional layers and the other nine subsets were put together to be the training
(which are applied to the 2D matrices of wavelet packet data. The tests were repeated 10 times, so that each subset has a
coefficients) and the maximization layer are special designs chance to be the testing data. As a result, 10 accuracies can be
that differentiate the MWCF-DRN-M from the original DRN. obtained for each method, and their average value (i.e., the
The working principle of the special designs is further average accuracy) was used as the metric to evaluate the
explained. Although the maximization layer is parameterless, method. Experimental results of CNN, DRN, MWCF-DRN-C,
the individual convolutional layers can make it to be a trainable and MWCF-DRN-M are given in Tables III-IV. The overall
process, which enable to adjust the values of the features before average accuracies are given in Table V and discussed below.
performing element-wise maximum feature selection. In this 1) Performance comparison with CNN and DRN
way, the developed MWCF-DRN-M can automatically learn As mentioned above, both CNN and DRN took a matrix of
which features to select for the sake of yielding high diagnostic wavelet coefficients using a certain Daubechies wavelet, from
accuracy. This alternative method facilitates the inclusion of DB1 to DB30. To ensure a fair and reliable comparison, the same
physics-based knowledge into the DRN. hyperparameters mentioned above were adopted. As shown in
Table III, the DRN outperformed the CNN no matter which DB
IV. EXPERIMENTAL RESULTS was used. The overall average test accuracy of the DRN with
different DBs was 91.45%, which was 2.89% higher than CNN.
The two developed methods—i.e., MWCF-DRN-C and
For the MWCF-DRN-C and MWCF-DRN-M methods, a
MWCF-DRN-M—were implemented using TensorFlow 1.0.1,
multiple set of matrices of wavelet packet coefficients using
which is a machine learning library open-sourced by Google.
𝑁𝑁w randomly selected Daubechies wavelets, from DB1 to DB30,
Experimental comparisons were made with the classical CNN
where 𝑁𝑁w = 2, 6, 10, 14, 18, 22, 26, and 30 were considered in
and DRN to verify the efficacy of the developed methods.
this study (see Table IV). The reason that the developed
A. Hyperparameters Setup methods did not consider a full factorial design was to reduce
The hyperparameters were set based on the setups in generic computational burden. The same 10-fold cross-validation was
employed in performance evaluation of the developed methods.
Fig. 7. Architectures of (a) MWCF-DRN-C and (b) MWCF-DRN-M, in which a “2D matrix i” refers to a 2-dimensional matrix of wavelet packet
coefficients obtained by using an i-tap Daubechies wavelet, abbreviated as DBi in this study, and N is the number of Daubechies wavelets in
consideration. Both MWCF-DRN-C and MWCF-DRN-M have the same RBUs as the DRN in Fig. 5.
TABLE III
ACCURACIES OF THE CNN AND DRN USED FOR FAULT DIAGNOSIS OF THE PLANETARY GEARBOX (UNIT: %)
Training accuracy Test accuracy Training accuracy Test accuracy
CNN DRN CNN DRN CNN DRN CNN DRN
DB1 86.74 ± 2.34 94.98 ± 0.54 83.44 ± 3.01 88.41 ± 0.95 DB16 92.54 ± 1.38 97.09 ± 0.71 89.56 ± 1.96 91.97 ± 1.21
DB2 88.54 ± 1.03 96.31 ± 0.52 86.05 ± 1.50 90.57 ± 0.64 DB17 91.82 ± 1.53 97.04 ± 0.83 88.81 ± 1.58 92.03 ± 0.76
DB3 90.86 ± 2.00 96.78 ± 0.54 88.13 ± 1.83 91.21 ± 1.11 DB18 91.38 ± 0.92 96.69 ± 0.98 88.39 ± 1.39 92.18 ± 0.98
DB4 91.44 ± 1.57 96.92 ± 0.80 88.51 ± 1.50 91.26 ± 1.69 DB19 90.65 ± 2.81 96.38 ± 0.94 87.06 ± 3.21 91.37 ± 1.22
Daubechies wavelets
DB5 91.70 ± 0.91 96.93 ± 1.01 89.01 ± 1.05 92.18 ± 0.96 DB20 91.62 ± 0.72 96.66 ± 0.77 88.53 ± 1.06 91.40 ± 0.87
DB6 91.77 ± 1.60 97.15 ± 0.63 89.04 ± 1.72 91.76 ± 0.85 DB21 91.66 ± 1.22 96.16 ± 1.11 88.77 ± 1.53 91.20 ± 1.42
DB7 91.88 ± 0.62 97.03 ± 0.80 89.17 ± 0.99 91.92 ± 0.97 DB22 92.13 ± 0.49 96.56 ± 1.04 89.33 ± 1.05 91.50 ± 0.95
DB8 92.27 ± 1.44 97.32 ± 0.29 89.39 ± 1.78 92.93 ± 0.79 DB23 92.18 ± 1.31 96.22 ± 0.84 89.58 ± 1.42 91.62 ± 1.04
DB9 92.28 ± 1.64 97.28 ± 0.49 89.45 ± 1.54 91.85 ± 0.84 DB24 92.14 ± 0.59 95.78 ± 0.89 89.18 ± 0.95 90.56 ± 1.12
DB10 92.19 ± 0.58 96.78 ± 0.85 89.24 ± 1.16 91.77 ± 1.33 DB25 91.01 ± 1.60 96.69 ± 0.68 88.03 ± 1.65 91.57 ± 0.75
DB11 91.83 ± 1.90 97.07 ± 0.90 88.43 ± 2.02 92.19 ± 1.31 DB26 91.88 ± 0.90 96.11 ± 0.79 89.12 ± 0.92 91.18 ± 0.97
DB12 91.86 ± 0.84 97.01 ± 0.71 89.14 ± 1.06 92.07 ± 1.29 DB27 90.94 ± 2.05 95.96 ± 0.85 88.43 ± 2.16 90.79 ± 1.44
DB13 92.13 ± 1.61 96.88 ± 0.88 89.68 ± 1.47 91.87 ± 1.40 DB28 91.35 ± 0.42 96.36 ± 0.72 88.49 ± 0.92 91.05 ± 1.16
DB14 92.48 ± 1.33 97.05 ± 0.33 89.71 ± 1.66 92.02 ± 0.82 DB29 90.04 ± 4.05 96.31 ± 0.96 86.64 ± 3.69 90.81 ± 0.98
DB15 92.18 ± 1.09 96.71 ± 0.73 89.48 ± 1.52 91.94 ± 1.43 DB30 91.63 ± 0.60 95.42 ± 0.72 88.98 ± 1.08 90.26 ± 1.29
TABLE IV 11(d), it can be observed that the composite fault “CFB” was
ACCURACIES OF THE DEVELOPED MWCF-DRN-C AND MWCF-DRN-M
METHODS WITH RANDOMLY SELECTED DB WAVELETS (UNIT: %)
also basically separable from the other health conditions, which
Training accuracy Test accuracy means that the MWCF-DRN-C and MWCF-DRN-M are able
2
MWCF-DRN-C
97.05 ± 0.69
MWCF-DRN-M
97.00 ± 0.66
MWCF-DRN-C
91.56 ± 1.49
MWCF-DRN-M
92.41 ± 0.83
to distinguish the composite fault from the other health states.
selected Daubechies
6 98.42 ± 0.20 97.80 ± 0.63 93.07 ± 0.73 93.57 ± 0.98

# of randomly
10 98.80 ± 0.22 98.21 ± 0.23 92.72 ± 1.03 93.65 ± 0.93

wavelets
14 98.95 ± 0.31 98.38 ± 0.23 93.10 ± 0.79 93.80 ± 0.98

18 99.18 ± 0.17 98.59 ± 0.20 93.45 ± 0.85 94.13 ± 0.45
22 99.36 ± 0.13 98.55 ± 0.37 93.01 ± 0.79 93.74 ± 0.81
26 99.43 ± 0.16 98.60 ± 0.21 92.81 ± 0.78 93.94 ± 0.97
30 99.47 ± 0.09 98.69 ± 0.24 93.01 ± 0.59 93.88 ± 0.67
TABLE V
COMPARISON OF THE OVERALL AVERAGE ACCURACIES IN TABLE III AND
TABLE IV (UNIT: %)
Method Training accuracy Test accuracy
CNN 91.44 ± 1.37 88.56 ± 1.61 Fig. 8. 2D visualization of feature maps at 4 different layers in CNN with
DRN 96.59 ± 0.76 91.45 ± 1.08 DB1, which yields an accuracy of 83.06%, (a) the input layer, (b) the 7th
MWCF-DRN-C 98.83 ± 0.25 92.84 ± 0.88 convolutional layer, (c) the 13th convolutional layer, (d) the GAP layer.
MWCF-DRN-M 98.23 ± 0.35 93.64 ± 0.83
As presented in Table V, the developed MWCF-DRN-C

yielded an overall average test accuracy of 92.84%, which
achieved 4.28% and 1.39% improvements when compared with
the CNN and DRN with no multi-wavelet coefficients fusion.
Further, the developed MWCF-DRN-M achieved an overall
average test accuracy of 93.64%, which yielded 5.08% and 2.19%
improvements compared with the CNN and DRN, respectively.
To get a sense of the feature learning ability of the methods Fig. 9. 2D visualization of feature maps at 4 different layers in DRN with
(i.e., CNN, DRN, MWCF-DRN-C, and MWCF-DRN-M), DB1, which yields an accuracy of 86.57%, (a) the input layer, (b) the 3rd
RBU, (c) the 6th RBU, and (d) the GAP layer.
which further facilitates the improvement of fault diagnosis, the
feature maps at four different layers were visualized using
t-distributed stochastic neighbor embedding [36], which is an
effective dimensionality reduction method to visualize
high-dimensional data in a lower-dimensional feature space.
The working effect of the deep learning methods is shown in
Figs. 8-11. It can be shown that observations of different health
conditions at the input layer were highly agglomerated and
became more and more separable at deeper layers for the CNN, Fig. 10. 2D visualization of feature maps at 4 different layers in the
MWCF-DRN-C with 30 DBs, which yields an accuracy of 92.69%, (a) the
DRN, MWCF-DRN-C, and MWCF-DRN-M; that is, these input layer, (b) the 3rd RBU, (c) the 6th RBU, and (d) the GAP layer.
methods were effective for transforming the input data to be
discriminative features through a series of nonlinear
transformations. Likewise, the learned high-level features
obtained by the MWCF-DRN-C and MWCF-DRN-M were
more separable than those obtained by the CNN and DRN
methods, as shown in Figs. 8(d), 9(d), 10(d), and 11(d), and
they were fed to the fully connected output layer to pick up
classification results. For example, the learned features for
Fig. 11. 2D visualization of feature maps at 4 different layers in the
“TRC” and “TSP” by the CNN and DRN overlapped each other, MWCF-DRN-M with 30 DBs, which yields an accuracy of 94.35%, (a)
whereas the learned features by the developed methods had few the input layer, (b) the 3rd RBU, (c) the 6th RBU, and (d) the GAP layer.
overlapped points. Additionally, as shown in Figs. 10(d) and
2) Performance comparison between MWCF-DRN-C V. CONCLUSIONS

and MWCF-DRN-M This paper developed two methods to fuse multi-wavelet
As illustrated in Table V, the MWCF-DRN-M method coefficients in DRNs and thus address the challenges in finding
slightly outperformed the MWCF-DRN-C method, achieving a good set of features for vibration-based fault diagnosis under
0.80% improvement in terms of test accuracy. The input indeed non-stationary operating conditions. The MWCF-DRN-C
involves many redundant wavelet coefficients; the developed method simply concatenated a series of 64 × 64 matrices of
MWCF-DRN-M method effectively eliminated redundant wavelet coefficients and used them as an input of the following
information via the element-wise maximization operation and convolutional layer in a DRN, whereas the MWCF-DRN-M
reduced the degree of complexity when optimizing trainable method exploited a combination of the individual convolutional
weights during the training process, whereas the layers connected to 2D matrices of wavelet packet coefficients
MWCF-DRN-C method had to optimize the weights in the and the maximization layer at the early stage of a DRN to
convolutional layer to eliminate the redundant information and capture and amplify discriminative information from the input.
its high-level features might still contain much redundant The usefulness of the developed methods was verified by
information even after the training process. performance comparisons in planetary gearbox fault diagnosis
Figs. 10(d) and 11(d) illustrate the high-level feature maps with state-of-the-art deep learning methods (i.e., CNN and
using the developed MWCF-DRN-C and MWCF-DRN-M in a DRN) with no multi-wavelet coefficients fusion concept.
lower-dimensional feature space, respectively. The “TRC” and Experimental results indicated that the developed
“TSP” under the MWCF-DRN-M scheme were more MWCF-DRN-C and MWCF-DRN-M methods outperformed
agglomerated with themselves than under the MWCF-DRN-C the CNN- and DRN-based methods by yielding 4.28% and
scheme. This was the reason why the MWCF-DRN-M 5.08%, 1.39% and 2.19% performance improvements in terms
outperformed the MWCF-DRN-C in terms of test accuracy. of average test accuracy (10-fold cross-validation), respectively.
These improvements were due to the use of diverse sets of
C. Discussion on the Limitations of the Developed
wavelet coefficients with associated trainable weights to adjust
Methods
the significance of wavelet coefficients for fault diagnosis.
However, a long-standing issue with machine Likewise, the MWCF-DRN-M method was superior to the
learning-powered fault diagnosis applications is that they often MWCF-DRN-C method by slightly improving diagnostic
need to deal with a new fault (or a new composition of multiple performance–that is, 0.80% performance improvement in terms
faults) that was not met before. According to the nature of of average test accuracy. The MWCF-DRN-M method was
supervised learning, the new fault would be misclassified into effective for reducing redundancy in multiple 64 × 64 matrices
one of the classes in consideration during the training process. of wavelet coefficients by a maximization layer, which further
A promising solution to this problem is to develop a method to enabled the alleviation of DRN’s burden to learn discriminative
measure the degree of membership for unseen data, where the features. Although the MWCF-DRN-C reduced redundancy by
degree of membership can be defined as the probability that the applying trainable weights to multiple 64 × 64 matrices,
unseen test data belongs to each of the classes in consideration. optimizing these weights was not an easy task.
If none of the degrees is larger than a pre-defined threshold, the Further, the underlying structure of the developed methods
test data (e.g., a fault not met before) would be classified into an can be applicable to various time-frequency analysis methods
unknown class. (e.g., local mean decomposition, empirical wavelet transform,
Another unsolved problem with machine learning-powered and Wigner-Ville distribution), and data (e.g., vibration signals,
fault diagnosis applications is how to deal with the same faults current signals, and acoustic signals) for fault diagnosis.
with different intensities. Domain adaptation, which refers to a
group of techniques that aim to improve the performance of a REFERENCES
task by using the knowledge from a closely related domain, can
[1] Y. Lei, J. Lin, M. Zuo, and Z. He, “Condition Monitoring and Fault
probably be used to address this problem. For example, the Diagnosis of Planetary Gearboxes: A Review,” Measurement, vol. 48, pp.
difference between the distributions of the same fault at 292–305, 2014.
different intensities can be narrowed using some domain [2] W. Liu, B. Tang, J. Han, X. Lu, N. Hu, and Z. He, “The Structure Healthy
Condition Monitoring and Fault Diagnosis Methods in Wind Turbines: A
adaptation methods, so that a model trained on one intensity can
Review,” Renew. Sustain. Energy Rev., vol. 44, pp. 466–472, 2015.
correctly identify the fault under the other intensities. [3] X. Jin, M. Zhao, T. W. S. Chow, and M. Pecht, “Motor Bearing Fault
Additionally, to apply the developed methods in industry, the Diagnosis Using Trace Ratio Linear Discriminant Analysis,” IEEE Trans.
following limitations at least need to be addressed. First, Ind. Electron., vol. 61, no. 5, pp. 2441–2451, 2014.
[4] J. Tian, C. Morillo, M. H. Azarian, and M. Pecht, “Motor Bearing Fault
unsupervised and semi-supervised versions of the developed Detection Using Spectral Kurtosis-Based Feature Extraction Coupled
methods should be developed, because it is not easy to obtain with K-Nearest Neighbor Distance Analysis,” IEEE Trans. Ind. Electron.,
labelled data. Second, the developed methods were trained on vol. 63, no. 3, pp. 1793–1803, 2016.
[5] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
balanced datasets in this study. However, in general, deep MA, USA: MIT Press, 2016.
learning algorithms often yielded relatively low performance [6] F. Jia, Y. Lei, J. Lin, X. Zhou, and N. Lu, “Deep Neural Networks: A
on imbalanced datasets. Thus, it will be necessary to explore Promising Tool for Fault Characteristic Mining and Intelligent Diagnosis
of Rotating Machinery with Massive Data,” Mech. Syst. Signal Process.,
the impact of the developed methods on imbalanced datasets. If vol. 72–73, pp. 303–315, 2016.
needed, the integration of generative adversarial network-based [7] T. Ince, S. Kiranyaz, L. Eren, M. Askar, and M. Gabbouj, “Real-Time
approaches into the developed methods will be investigated for Motor Fault Detection by 1-D Convolutional Neural Networks,” IEEE
the sake of balancing class distributions. Trans. Ind. Electron., vol. 63, no. 11, pp. 7067–7075, 2016.
[8] Y. Lei, F. Jia, J. Lin, S. Xing, and S. Ding, “An Intelligent Fault Diagnosis 2017, Berlin, Germany, pp. 497–504.
Method Using Unsupervised Feature Learning Towards Mechanical Big [32] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R.
Data,” IEEE Trans. Ind. Electron., vol. 63, no. 5, pp. 3137–3147, 2016. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks
[9] W. Sun, R. Zhao, R. Yan, S. Shao, and X. Chen, “Convolutional from Overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.
Discriminative Feature Learning for Induction Motor Fault Diagnosis,” [33] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio,
IEEE Trans. Ind. Informat., vol. 13, no. 3, pp. 1350–1359, 2017. “Maxout Networks,” in Proc. 30th International Conference on Machine
[10] L. Jing, T. Wang, M. Zhao, and P. Wang, “An Adaptive Multi-Sensor Learning, 16-21 June, 2013, Atlanta, GA, USA, pp. 1319–1327.
Data Fusion Method Based on Deep Convolutional Neural Networks for [34] Y. Huang, X. Sun, M. Lu, and M. Xu, “Channel-Max, Channel-Drop and
Fault Diagnosis of Planetary Gearbox,” Sensors, vol. 17, E414, 2017. Stochastic Max-Pooling,” in Proc. IEEE Conf. Comput. Vis. Pattern
[11] F. Wang, H. Jiang, H. Shao, W. Duan, and S. Wu, “An Adaptive Deep Recognit. Workshop, 11-12 June 2015, Boston, USA, pp. 9–17.
Convolutional Neural Network for Rolling Bearing Fault Diagnosis,” [35] Z. Liao and C. Gustavo, “A Deep Convolutional Neural Network Module
Meas. Sci. Technol., vol. 28, pp. 095005, 2017. that Promotes Competition of Multiple-Size Filters,” Pattern Recognit.,
[12] X. Ding and Q. He, “Energy-Fluctuated Multiscale Feature Learning with vol. 71, pp. 94–105, 2017.
Deep ConvNet for Intelligent Spindle Bearing Fault Diagnosis,” IEEE [36] L.J.P. van der Maaten and G.E. Hinton, “Visualizing High-Dimensional
Trans. Instrum. Meas., vol. 66, no. 8, pp. 1926–1935, 2017. Data Using t-SNE,” J. Mach. Learn. Res., vol. 9, pp. 2579–2605, 2008.
[13] P. Wang, Ananya, R. Yan, and R. X. Gao, “Virtualization and Deep
Recognition for System Fault Classification,” J. Manuf. Syst., vol. 44, pp. Minghang Zhao was born in Shandong, China,
310–316, 2017. in June, 1991. He received his B.E. degree in
[14] M. Xia, T. Li, L. Xu, L. Liu, and C. Silva, “Fault Diagnosis for Rotating mechanical engineering from the College of
Machinery Using Multiple Sensors and Convolutional Neural Networks,” Mechanical Engineering, Chongqing University,
IEEE/ASME Trans. Mechatron., vol. 23, pp. 101–110, 2017. Chongqing, China, in June, 2013.
[15] W. Zhang, C. Li, G. Peng, Y. Chen, and Z. Zhang, “A Deep He is currently working toward a Ph.D. degree
Convolutional Neural Network with New Training Methods for Bearing under the supervision of Prof. Baoping Tang in
Fault Diagnosis under Noisy Environment and Different Working Load,” the State Key Laboratory of Mechanical
Mech. Syst. Signal Process., vol. 100, no. 1, pp. 439–453, 2018. Transmission, Chongqing University, Chongqing,
[16] R. Yan, R. Gao, and X. Chen, “Wavelets for Fault Diagnosis of Rotary China. He was previously a visiting research
Machines: A Review with Applications,” Signal Process., vol. 96, pp. 1– scholar in the Center for Advanced Life Cycle Engineering (CALCE),
15, 2014. University of Maryland, College Park, MD, USA, from 2016 to 2017. His
[17] R. Gao and R. Yan, Wavelets. Springer, Boston, MA, USA, 2011. research interests include data-driven fault diagnosis, prognostics and
[18] M. Kang, J. Kim, J. Kim, A. Tan, E. Kim, and B. Choi, “Reliable Fault health management of mechanical and electrical systems.
Diagnosis for Low-Speed Bearings Using Individually Trained Support Myeongsu Kang (M'17) received the B.E. and
Vector Machines with Kernel Discriminative Feature Analysis,” IEEE M.S. degrees in computer engineering and
Trans. Power Electron., vol. 30, pp. 2786–2797, 2015. information technology and the Ph.D. degree in
[19] Y. Wang, G. Xu, L. Liang, and K. Jiang, “Detection of Weak Transient electrical, electronics, and computer engineering
Signals Based on Wavelet Packet Transform and Manifold Learning for from the University of Ulsan, Ulsan, South Korea,
Rolling Element Bearing Fault Diagnosis,” Mech. Syst. Signal Process., in 2008, 2010, and 2015, respectively.
vol. 54–55, pp. 259–276, 2015. He is currently a Research Scientist with the
[20] Y. Qin, “A New Family of Model-Based Impulsive Wavelets and Their Center for Advanced Life Cycle Engineering
Sparse Representation for Rolling Bearing Fault Diagnosis,” IEEE Trans. (CALCE), University of Maryland, College Park,
Ind. Electron., vol. 65, no. 3, pp. 2716–2726, 2018. MD, USA. His current research interests include
[21] V. Vakharia, V. Gupta, and P. Kankar, “Efficient Fault Diagnosis of Ball data-driven anomaly detection, diagnostics, and prognostics of complex
Bearing using ReliefF and Random Forest Classifier,” J. Braz. Soc. Mech. systems, such as automotive, railway transportation, and avionics, for
Sci. & Eng., vol. 39, no. 8, pp. 2969–2982, 2017. which failure would be catastrophic. He has experties in analytics,
[22] D. Vautrin, X. Artusi, M. Lucas, and D. Farina, “A Novel Criterion of machine learning, system modeling, and statistics for prognostics and
Wavelet Packet Best Basis Selection for Signal Classification With health management.
Application to Brain–Computer Interfaces,” IEEE Trans. Biomed. Eng., Baoping Tang received his M.Sc. degree in
vol. 56, no. 11, pp. 2734–2738, 2009. 1996 and Ph.D. degree in 2003 both from the
[23] Y. Chen, X. Jin, J. Feng, and S. Yan, “Training Group Orthogonal Neural College of Mechanical Engineering, Chongqing
Networks with Privileged Information,” in Proc. International Joint University, Chongqing, China.
Conferences on Artificial Intelligence, 19-25 August 2017, Australia. He is currently a professor and Ph.D.
[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image supervisor in the College of Mechanical
Recognition,” in Proc. 2016 IEEE Conference on Computer Vision and Engineering, Chongqing University, Chongqing,
Pattern Recognition, 27-30 June 2016, Seattle, WA, USA, pp. 770–778. China. His main research interests include
[25] K. He, X. Zhang, S. Ren, and J. Sun, “Identity Mappings in Deep wireless sensor networks, mechanical and
Residual Networks,” in Proc. 14th European Conference on Computer electrical equipment security service and life
Vision, 8-16 October, 2016, Amsterdam, Netherlands, pp. 630–645. prediction, and measurement technology and instruments.
[26] Drivetrain diagnostics simulator, Dr. Tang won a National Scientific and Technological Progress 2nd
http://spectraquest.com/drivetrains/details/dds/ Prize of China in 2004 and a National Invention 2nd Prize of China in
[27] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking 2015. More than 150 papers has been published in his research career.
the Inception Architecture for Computer Vision,” in Proc. 29th IEEE Michael Pecht (S'78-M'83-SM'90-F'92) received
Conf. Comput. Vis. Pattern Recognit., Las Vegas, NA, USA, Jun. 26–Jul. the B.S. degree in acoustics, the M.S. degrees in
1, 2016, pp. 2818–2826. electrical engineering and engineering
[28] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep mechanics, and the Ph.D. degree in engineering
Network Training by Reducing Internal Covariate Shift,” in Proc. 32nd mechanics from the University of Wisconsin at
International Conference on Machine Learning, 7-9 July 2015, Lille, Madison, Madison, WI, USA, in 1982.
France, pp. 448–456. He is the Founder of the Center for Advanced
[29] P. Zhou and J. Austin, “Learning Criteria for Training Neural Network Life Cycle Engineering (CALCE), University of
Classifiers,” Neural Comput. Appl., vol. 7, no. 4, pp. 334–342, 1998. Maryland, College Park, MD, USA, where he is
[30] B. Zoph and Q. V. Le, “Neural Architecture Search with Reinforcement also a Chair Professor.
Learning,” in Proc. International Conference on Learning Dr. Pecht is a Professional Engineer and an American Society of
Representations, 24-26 April, 2017, Toulon, France. Mechanical Engineers (ASME) Fellow. He has received the IEEE
[31] M. Suganuma, S. Shirakawa, and T. Nagao, “A Genetic Programming Undergraduate Teaching Award and the International Microelectronics
Approach to Designing Convolutional Neural Network Architectures,” in Assembly and Packaging Society (IMAPS) William D. Ashman Memorial
Proc. the Genetic and Evolutionary Computation Conference, 15-19 July Achievement Award for his contribution in electronics reliability analysis.

Multi-Wavelet Coefficients Fusion in Deep Residual Networks For Fault Diagnosis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi-Wavelet Coefficients Fusion in Deep Residual Networks For Fault Diagnosis

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

Multi-wavelet Coefficients Fusion in Deep

long-term generation of wind power, and the reliable operation

P LANETARY gearboxes can provide higher gear ratios and

Fig. 1. The drivetrain dynamics simulator used in this study.

6 98.42 ± 0.20 97.80 ± 0.63 93.07 ± 0.73 93.57 ± 0.98

10 98.80 ± 0.22 98.21 ± 0.23 92.72 ± 1.03 93.65 ± 0.93

14 98.95 ± 0.31 98.38 ± 0.23 93.10 ± 0.79 93.80 ± 0.98

As presented in Table V, the developed MWCF-DRN-C

2) Performance comparison between MWCF-DRN-C V. CONCLUSIONS

You might also like