10.1007@s13042 019 00967 W

International Journal of Machine Learning and Cybernetics
https://doi.org/10.1007/s13042-019-00967-w
ORIGINAL ARTICLE
A robust multilayer extreme learning machine using kernel

risk‑sensitive loss criterion
Xiong Luo1,2,3 · Ying Li1,2,3 · Weiping Wang1,2,3 · Xiaojuan Ban1,2 · Jenq‑Haur Wang4 · Wenbing Zhao5
Received: 23 October 2017 / Accepted: 22 May 2019

© Springer-Verlag GmbH Germany, part of Springer Nature 2019
Abstract
More recently, extreme learning machine (ELM) has emerged as a novel computing paradigm that enables the neural network
(NN) based learning to be achieved with fast training speed and good generalization performance. However, the single hidden
layer NN using ELM may be not effective in addressing some large-scale problems with more computational efforts. To avoid
such limitation, we utilize the multilayer ELM architecture in this article to reduce the computational complexity, without
the physical memory limitation. Meanwhile, it is known to us all that there are a lot of noises in the practical applications,
and the traditional ELM may not perform well in this instance. Considering the existence of noises or outliers in training
dataset, we develop a more practical approach by incorporating the kernel risk-sensitive loss (KRSL) criterion into ELM,
on the basis of the efficient performance surface of KRSL with high accuracy while still maintaining the robustness to outli-
ers. A robust multilayer ELM, i.e., the stacked ELM using the minimum KRSL criterion (SELM-MKRSL), is accordingly
proposed in this article to enhance the outlier robustness on large-scale and complicated dataset. The simulation results on
some synthetic datasets indicate that the proposed approach SELM-MKRSL can achieve higher classification accuracy and
is more robust to the noises compared with other state-of-the-art algorithms related to multilayer ELM.
Keywords Extreme learning machine (ELM) · Multilayer perceptron · Kernel risk-sensitive loss (KRSL) · Deep learning
1 Introduction are proved to be well-performed [1–3]. At the same time,

many researchers continue making an intensive study in
With the rapid development in the field of artificial intel- new and effective algorithms. In the last decade, an emerg-
ligence, there have been lots of practical applications ing machine learning algorithm, named extreme learning
using machine learning approaches and these applications machine (ELM) [4, 5], has drawn many researchers’ atten-
tions. The hidden node parameters of neural network (NN)
This manuscript is recommended by the 8th International using ELM, including the input weights connecting input
Conference on Extreme Learning Machines (ELM2017). layer and hidden layer, and the biases of hidden neurons,
are assigned randomly without being adjusted manually.
* Xiong Luo Consequently, ELM is an effective method with extremely
xluo@ustb.edu.cn
fast training speed and good generalization performance [6].
1
School of Computer and Communication Engineering, There have been a lot of variants of ELM addressing dif-
University of Science and Technology Beijing (USTB), ferent practical issues [7–18]. Meanwhile, some theoreti-
Beijing 100083, China cal improvements on ELM are also achieved. For instance,
2
Beijing Key Laboratory of Knowledge Engineering a sparse ELM was proposed, and it can realize a highly
for Materials Science, Beijing 100083, China compact NN to reduce the storage space and testing time
3
Key Laboratory of Geological Information Technology, [19, 20]. An enhanced bidirectional ELM was proposed to
Ministry of Land and Resources, Beijing 100037, China reduce the network complexity [21]. In this network, only
4
Department of Computer Science and Information the hidden nodes with large residual error are reserved and
Engineering, National Taipei University of Technology, the unimportant nodes are discarded. The online sequential
Taipei 10608, Taiwan
ELM was incorporated into the ensemble NNs, while pre-
5
Department of Electrical Engineering and Computer Science, senting an improved algorithm called ensemble of online
Cleveland State University, Cleveland, OH 44115, USA
13
Vol.:(0123456789)
sequential ELM [22]. However, when solving some prob- to the non-convexity of the performance surface for corren-
lems with large-scale datasets, shallow structure is not tropic loss (C-loss), it is very flat when it is far away from the
efficient enough even with a large quantity of hidden neu- optimal solution, and it is sharp around the optimal solution,
rons. Hence, the study on multilayer architecture and deep resulting in poor accuracy and slow convergence [38]. Con-
learning is extraordinarily important and necessary. Then, sequently, an alternative to the correntropy in kernel space,
an ELM-based hierarchical learning framework was devel- named kernel risk-sensitive loss (KRSL), was proposed
oped for multilayer perceptron [23]. It includes two main [39]. The performance surface of KRSL is more convex in
components, which are unsupervised multilayer feature rep- comparison with that of C-loss, hence it can achieve better
resentation and supervised feature classification separately. convergence performance in some applications [40].
In addition, a stacked ELM was designed, while dividing an Motivated by it, in this article, we propose a robust multi-
ELM-based large single NN into some NNs with multiple layer ELM, i.e., the stacked ELM using the minimum KRSL
small ELMs [24]. These sub-ELMs are connected serially (SELM-MKRSL), which utilizes KRSL as loss function by
and stacked on multiple layers. substituting MSE in the S-ELM framework to measure the
In addition to the challenge of multilayer architecture, one similarity between the desired output and the learned model
of the most common issues is the outliers or noises existed output. Here, the multilayer ELM architecture is used in an
in the training set [25]. On this condition, the poor gener- effort to tackle the issue of large-scale and complicated
alization performance may be achieved by traditional ELM dataset. The dimensionality of feature space is fixed, that
[26], partly because the mean square error (MSE) is used in is, the number of hidden neurons is user specified at first,
ELM as cost function. MSE is a global similarity measure without the problem of memory overflow. Meanwhile, the
and it applies to Gaussian distributed error [27]. Neverthe- multilayer structure makes it capable to train extensive data
less, it is very common for the data which are non-Gaussian and achieve good generalization. Differing from it in [37],
distributed in many real-world applications. In order to over- we use KRSL instead of correntropy as cost function in this
come the poor robustness to noisy datasets, a probabilistic article to achieve more robustness to noises. Meanwhile,
regularized ELM was proposed [28]. It constructs a new on the basis of it in [40], we extend KRSL not just in the
objective function to minimize both mean and variance of field of adaptive filtering, but in classification applications.
the modeling error, and it is proved to be more robust com- Moreover, considering that the L1-norm and L2-norm are
pared with traditional ELM. Moreover, based on the fuzzy used in the traditional ELM algorithm to acquire structural
ELM (F-ELM), a ridge regression based extreme learning risk minimization and do achieve a good performance [11],
fuzzy system (RR-EL-FS) was developed to enhance the we combine KRSL and the L2 penalty as a loss function in
robustness to small and noisy datasets [29]. In RR-EL-FS, S-ELM architecture in our work. The main contributions of
the strategy of ridge regression is incorporated into F-ELM this article are further summarized in two aspects as follows:
to substitute the least square which is used for computing the
output weight. In order to handle the noisy data regression 1. We utilize minimum KRSL (MKRSL) criterion to
problems, the constrained optimization method based ELM replace the MSE in ELM framework. The KRSL is
for regression (CO-ELM-R) was proposed [30]. It combines robust to non-Gaussian distributed noises and outliers,
two Lagrange multipliers that mimic support vector regres- so it can remarkably improve the anti-noise ability of
sion (SVR) into the basis of ELM to cope with infeasible NN-based learning algorithm. Furthermore, some theo-
constraints of the regression optimization problem. Recently, retical analyses are provided.
a local similarity measure named correntropy was presented 2. In order to further enhance the stability and generaliza-
[31]. Correntropy is a robust nonlinear generalized correla- tion performance of our proposed approach, we add a L2
tion in kernel space and uses the concept like probability to regularization term to the cost function.
evaluate the similarity of two variables [32]. It is flexible
and easy to be implemented, thus it has been utilized in The structure of this article is organized as follows. In
various applications [33, 34]. Considering the advantages Sect. 2, the original ELM, the stacked ELM as well as the
of correntropy, a regularized correntropy criterion for ELM basic concept and properties of KRSL are described. In
was designed to address the datasets with noises [35]. In Sect. 3, we detail the algorithm SELM-MKRSL proposed
this method, the Gaussian kernel function is utilized to in this article. In Sect. 4, the simulation results are shown to
substitute Euclidean norm of the MSE criterion, which can evaluate the algorithmic performance and noise robustness
enhance the anti-noise ability of ELM [35, 36]. Moreover, of our proposed SELM-MKRSL. The comparison between
a variant of stacked ELM was proposed to reduce the train- SELM-MKRSL and original S-ELM as well as correntropy-
ing time and be robust for outliers rejection, which used the based S-ELM is also given in this section. Finally, we con-
correntropy-optimized principle component analysis (PCA) clude this article.
to replace the original PCA method [37]. However, owing
13
2 Related work The core idea of this method is to minimize the training
error as well as the norm of output weights by:
2.1 Extreme learning machine (ELM) Minimize ∶ ‖𝐇𝜷 − 𝐓‖2F + 𝜇‖𝜷‖2F , (8)
Extreme learning machine is a kind of generalized learning where ‖⋅‖F can be defined as ‖⋅‖2 which denotes the L2-
algorithm for single hidden layer feedforward NN [4–6]. norm, μ is a tuning parameter.
For ELM, the input weights and the biases of hidden layer According to the scale of the training dataset, we can
are initialized randomly and obtain different solutions of (8) in an effort to decrease the
{( need)}Nnot be tuned. Given N computational complexity.
arbitrary training samples xi , ti i=1, where xi ∈ Rd is the
input data vector, ti ∈ Rm is the corresponding target label. In the case that the number of training samples is less
The output of the original ELM with L hidden neurons can than the hidden neurons, that is N < L, the output weight
be written as: vector can be represented as [24]:
( )−1
∑ 𝐈
L
𝜷 = 𝐇T + 𝐇𝐇T 𝐓, (9)
ti = fL (xi ) = 𝜷 j g(wj ⋅ xi + bj ), i = 1, … , N, (1) C
j=1
and the corresponding output of ELM is:
where βj is the output weight vector connecting the jth hid- ( )−1
𝐈
den node with the output layer, g(·) is a nonlinear activation f (x) = h(x)𝜷 = h(x)𝐇T + 𝐇𝐇T 𝐓, (10)
C
function, wj ∈ Rd is the input weight vector connecting the
input layer with the jth hidden unit, and bj denotes the bias where I is the identity matrix and C is a constant.
term of the jth hidden neuron. In another case that the number of training samples is
Then, (1) can be represented in the matrix form: much larger than the hidden neurons, that is N ≫ L, the out-
put weight vector can be represented as [24]:
𝐇𝜷 = 𝐓, (2)
[ []T ]T ( )−1
where 𝜷 = 𝜷 1 , 𝜷 2 , … , 𝜷 L , 𝐓 = t1 , t2 , … , tN , and the 𝐈
𝜷= + 𝐇T 𝐇 𝐇T 𝐓, (11)
hidden layer output matrix H is: C
and the corresponding output of ELM is:
⎡ h(x1 ) ⎤ ⎡ g(w1 ⋅ x1 + b1 ) ⋯ g(wL ⋅ x1 + bL ) ⎤ ( )−1
𝐇=⎢ ⋮ ⎥=⎢ ⋮ ⋱ ⋮ ⎥.
f (x) = h(x)𝜷 = h(x)
𝐈
+ 𝐇T 𝐇 𝐇T 𝐓. (12)
⎢ ⎥ ⎢ ⎥ C
⎣ h(x N ⎦
) ⎣ g(w1 ⋅ xN + b 1 ) ⋯ g(w L ⋅ x N + b L ⎦
)
(3)
The output weight β can be acquired by solving 2.2 Stacked extreme learning machine (S‑ELM)
Minimize ∶ ‖𝐇𝜷 − 𝐓‖2F , (4) When solving some problems with large-scale dataset, we
where ‖⋅‖F represents the Frobenius norm. often need plenty of hidden nodes to train a learning model,
The least square solution of (4) is given as: so that good generalization performance could be achieved.
However, too many hidden neurons may make the train-
𝜷 = 𝐇† 𝐓, (5) ing NN become very complicated and incur the issue with
†
where H is the Moore–Penrose generalized inverse of out of memory. In response to this limitation, an improved
matrix H [41, 42]. We can use the orthogonal projection algorithm called stacked extreme learning machine (S-ELM)
method to compute H† when HTH is nonsingular [4]: was developed [24]. S-ELM is a multilayer NN which is
( )−1 composed of multiple sub-ELMs lying in distinct layers. It
𝐇† = 𝐇T 𝐇 𝐇T , (6) decomposes a large single ELM network into multiple small
ELMs, and these sub-ELMs are connected serially. The
and the output weight β can be obtained by: hidden layer output of the lower layer is transmitted to the
( )−1 higher layer. It should be noted that the hidden layer outputs
𝜷 = 𝐇T 𝐇 𝐇T 𝐓. (7) are not passed directly but extracted several key components
via the PCA dimension reduction method [44].
Considering the existence of noises in real-world appli-
Assumed that the dimensionality of the feature space is
cations, the equality constrained optimization method-
fixed with L hidden neurons, where L can be appointed by
based ELM was further proposed [43] to improve the gen-
users. In the lower layer of S-ELM, we first conduct a standard
eralization performance and stability of the standard ELM.
ELM network and acquire the hidden layer output matrix H.
The corresponding output weight β can be obtained by (9)
13
or (11). However, owing to the randomness of the generated expectation, and FAB(a, b) represents the simultaneous distri-
hidden layer parameters, i.e., input weights and biases, the bution of (A, B). In this article, we only take Gaussian kernel
connections between hidden layer and output layer may have into consideration and the corresponding kernel function is
redundant information, and will not be linearly independent. expressed as:
Hence, we perform a PCA method on β and the dimension of � �
β is cut down from L to L′, where L′ < L. 1
𝜅𝜎 (a − b) = √ exp −
(a − b)2
During the implementation of PCA dimension reduction, a 2𝜎 2
. (18)
𝜎 2𝜋
series of eigenvectors can be obtained, and they are arranged in
descending order on the basis of their eigenvalues. We select However, due to the high non-convexity of the C-loss per-
the first L′ eigenvectors as Q ∈ RL×L′. Then, the output weight formance surface, which is given by Closs(A, B) = 1 − Vσ(A,
after dimension reduction can be expressed as 𝜷 � = 𝜷 T 𝐐. B), it is extremely steep around the optimal solution, and it
Similarly, the hidden layer output is obtained as: becomes very flat when it is away from the optimal solution.
This will result in slow convergence and poor performance.
𝐇� = 𝐇𝐐. (13)
To address this issue, a new measure criterion, i.e., KRSL,
Then, the dimension-reduced hidden layer output H′ is was proposed [39].
passed to the higher layer to represent the whole information With two random variables A and B, KRSL is defined as:
of the lower hidden layer. In the next hidden layer of S-ELM,
(L−L′) new hidden neurons will be generated randomly. And L𝜆 (A, B) =
1
E[exp(𝜆(1 − 𝜅𝜎 (A − B)))]
then we have the output matrix Hnew of these new nodes as: 𝜆
(19)
𝜆∫
1
= exp(𝜆(1 − 𝜅𝜎 (a − b)))dFAB (a, b),
⎡ hnew (x1 ) ⎤ ⎡ h1 (x1 ) ⋯ hL−L� (x1 ) ⎤
𝐇new = ⎢ ⋮ ⎥=⎢ ⋮ ⋱ ⋮ ⎥. (14)
⎢ ⎥ ⎢ ⎥ where λ is the risk-sensitive parameter and λ > 0. In prac-
⎣ hnew (xN ) ⎦ ⎣ h1 (xN ) ⋯ hL−L� (xN ) ⎦
tical applications,
{( )}N with a finite number of data instances
Thus, the hidden layer output of this layer is formulated as: ai , bi i=1, we often calculate empirical KRSL to approx-
[ ] imately substitute above equation as:
𝐇 = 𝐇� , 𝐇new . (15)
1 ∑
N
We repeat these procedures mentioned above on the sub- L̂ 𝜆 (A, B) = exp(𝜆(1 − 𝜅𝜎 (ai − bi ))). (20)
sequent hidden layer until the last one, and the final output of N𝜆 i=1
S-ELM network is calculated as (10) or (12). We describe the
total process of the combination as follows: Here, the empirical KRSL [ measures]T the similarity
between[ the vector
]T 𝐀 = a 1 , a2 , … , aN and the vector
[ � 𝐇] → 𝐇� 1
� 𝐁 = b1 , b2 , … , bN . When there is no ambiguity, we denote
[𝐇� 1 , 𝐇new 2 ] → 𝐇� 2 L̂ 𝜆 (A, B) as L̂ 𝜆 (𝐀, 𝐁).
𝐇 2 , 𝐇new 3 → 𝐇 3
. (16) Generally, there are some significant properties of KRSL.
[ � ] ⋮ � They are listed as follows, and they have been proved in [39]:
𝐇 N−2 , 𝐇new (N−1) → 𝐇 [ N−1 ]
𝐇� N−1 , 𝐇new N
Property 1 KRSL is symmetric: L𝜆 (A, B) = L𝜆 (B, A).
2.3 Kernel risk‑sensitive loss (KRSL) Proper ty 2 K R S L i s p o s i t i ve a n d b o u n d e d :

1
𝜆
≤ L𝜆 (A, B) ≤ 𝜆1 exp(𝜆), and when A = B, L𝜆 (A, B) reaches
The KRSL is a modified local similarity measure based on the the minimum.
correntropy [39]. Both correntropy and KRSL are effective in
processing the data which have large outliers, because they are Property 3 L𝜆 (A, B) ≈ 1
𝜆
+ Closs (A, B) when λ is small
not sensitive to noises. enough.
Given two arbitrary variables A and B, the correntropy of
them can be defined by: Property 4 L𝜆 (A, B) ≈ 1
𝜆
+ 1
2𝜎 2
E[(A − B)2 ] when σ is large
enough.
∫ (17) [ ]T
V𝜎 (A, B) = E[𝜅𝜎 (A − B)] = 𝜅𝜎 (a − b)dFAB (a, b),
Property 5 Let e = 𝐀−𝐁 = e1 , e2 , … , eN , where
ei = ai − bi and i = 1, 2, …, N. The empirical KRSL L̂ 𝜆 (𝐀, 𝐁)
is the function of e, and when ‖e‖∞ = maxi=1,2,…,N ��ei �� ≤ 𝜎 ,
where σ is kernel band width, ĸσ(·) represents the Mer-
cer kernel function [45], E(·) denotes the mathematical
it is convex.
13
Property 6 In the case that ‖e‖∞ > 𝜎 , the empirical KRSL where λ is the risk-sensitive parameter satisfying λ > 0, η is
L̂ 𝜆 (𝐀, 𝐁) will be convex as long as the risk-sensitive param- the regularization coefficient, and ĸσ(·) is the kernel function
eter λ is large enough. defined as (18) with bandwidth σ.
Using the MKRSL criterion in SELM-MKRSL, for the
τth iteration, the optimal solution of above equation can be
3 Stacked ELM using the minimum KRSL expressed as:
criterion
𝜷 𝜏+1 = (𝐇T 𝐀𝐇 + 𝜂𝐈)−1 𝐇T 𝐀𝐓, (24)
In this section, we present the implementation of our algo- where A is a diagonal matrix and its diagonal element Aii
rithm SELM-MKRSL. Furthermore, we also provide the is defined as:
theoretical analysis for the convergence and computational
1 ∑ ( ( ))
N
complexity of SELM-MKRSL. Aii = 𝛼i𝜏+1 = exp 𝜆 1 − 𝜅𝜎 (ti − hi 𝜷 𝜏 ) 𝜅𝜎 (ti − hi 𝜷 𝜏 ).
2
N𝜎 i=1
3.1 The proposed algorithm (25)
Actually, it is easy to deduce
The KRSL is used as the loss function of S-ELM to replace 𝜕
the objective function as (4). Then, the problem can be J
𝜕𝜷 KRSL
(𝜷) = 0
solved by minimizing the KRSL between the target output
1 ∑ ( ( ( )))
N
and the predicted output. The new criterion is given as: ⇒− exp 𝜆 1 − 𝜅𝜎 ti − hi 𝜷
2
N𝜎 i=1
∑
N ( )( )
J(𝜷) = min L𝜆 (ti − yi ), (21) × 𝜅𝜎 ti − hi 𝜷 ti − hi 𝜷 hi + 2𝜂𝜷̃ =0
𝜷 (N )
i=1 ∑ T ∑N
⇒ hi 𝛼i hi + 𝜂 𝜷 = hTi 𝛼i ti
where N is the number of training samples, ti denotes the i=1 i=1
target output of the ith sample xi. In addition, yi denotes the ( )−1 (N )
∑
N
∑
calculated output of xi in training network, and it can be ⇒𝛽= hTi 𝛼i hi +𝜂 × hTi 𝛼i ti ,
expressed as: i=1 i=1
(26)
yi = hi 𝜷, (22)
where hi is the hidden layer output vector for xi, and β is the
hidden layer output weight, which is given by
where 𝜂 = 2𝜂̃ represents the regularization coefficient, and
[ ]T αi can be obtained by (25).
𝜷 = 𝜷 1, 𝜷 2, … , 𝜷 L .
We implement the MKRSL to calculate the output weight
Moreover, in order to improve the stability and learning β as described above in each hidden layer during the training
performance, a L2 regularization term is added to (21). As of S-ELM network. The architecture of our proposed SELM-
a result, we can obtain a new objective function defined as: MKRSL is illustrated in Fig. 1 and the whole process of train-
ing SELM-MKRSL is summarized in Algorithm 1. It should
� �
1 �
N
� � � �� be pointed out that, Lτ mentioned in Algorithm 1 is the KRSL
2
JKRSL (𝜷) = min
𝜷 N𝜆 i=1
exp 𝜆 1 − 𝜅𝜎 ti − hi 𝜷 + 𝜂‖𝜷‖F , of the desired output and the model output in the τth iteration,
(23) which is calculated as:
1 ∑ ( ( ( )))
N
L𝜏 = exp 𝜆 1 − 𝜅𝜎 ti − yi . (27)
N𝜆 i=1
13
13
(a) (b)
The initial β βT
Dimension reduction
1 ... j ... L
The τ-th β
1 ... L’
No
Lτ-Lτ-1< ε ? β’
Yes
T
(c)
1 1
1 1 1
...
...
...
...
...
xi L’ L’ ti
...
...
...
...
...
L L L
d Input m
weight
Input matrix wj Random Output
layer hidden nodes Randomly generated nodes layer
Fig. 1 The architecture of SELM-MKRSL: a MKRSL; b PCA; c S-ELM
3.2 Convergence analysis Hence, this sequence is non-increasing finally. In

addition, it has been proved that the KRSL is bounded,
Motivated by [35], in accordance with the above analysis, that is 𝜆1 ≤ L𝜆 (A, B) ≤ 𝜆1 exp(𝜆) [39]. As a consequence,
two theorems for the convergence of our approach are pro- {JKRSL (𝛽 ,𝜏 𝛼 )𝜏 is bounded} as well. So the sequence
𝜏 𝜏
vided as follows. JKRSL (𝛽 , 𝛼 ), 𝜏 = 1, 2, ⋯ is verified to be converged.

{ }
Theorem 1 The sequence JKRSL (𝛽 𝜏 , 𝛼 𝜏 ), 𝜏 = 1, 2, ⋯ con- Theorem 2 When A = I and η = 0 in (24), the MKRSL cri-
verges after iteratively calculating by (24) and (25). terion is equivalent to MSE.
Proof According to (23) and (24), it can be seen that Proof If A = I and η = 0, we have:
𝜷 ∗ = (𝐇T 𝐀𝐇 + 𝜂𝐈)−1 𝐇T 𝐀𝐓 = (𝐇T 𝐇)−1 𝐇T 𝐓. (34)
JKRSL (𝛽 𝜏 , 𝛼 𝜏 ) ≥ JKRSL 𝛽 𝜏+1 , 𝛼 𝜏 ≥ JKRSL 𝛽 𝜏+1 , 𝛼 𝜏+1 .
( ) ( )
We can observe that this equation is the same as (7).
(33) Consequently, the theorem turns out to be true.
13
3.3 Computational complexity is provided by the U.S. Postal Service and it has 7291
training samples and 2007 testing samples, each of which
In Algorithm 1, the main computational cost is the update of contains 256 features [46].
output weight β in Step 2. The computational complexity ISOLET is a dataset used for classification task from
of vector α in each iteration for one hidden layer is O(NL) UCI machine learning repository [47]. It has 7797
according to (30). Furthermore, the calculation of the instances in total with 617 attributes and is trained for
output weight using (24) takes the computational cost as predicting which letter name is spoken. In our simulations,
O(L3 + L2N + L2m + 2LN2 + LNm) in each iteration of one we select 6238 samples randomly as training data and the
layer, where m is the number of output labels. As a conse- rest 1559 samples as testing data.
quence, the total computational cost for Step 2 is O(kNIter SPECT heart is a dataset that describes diagnosing
(L3 + L2N + L2m + 2LN2 + LNm)) with total k hidden layers of cardiac single proton emission computed tomography
and NIter iterations in each layer. Generally speaking, the (SPECT) images [48]. Each of the patients is classified
number of training samples N is larger than the number of into two categories: normal and abnormal. There are 267
hidden neurons L, i.e., N > L, and the number of labels m is instances in total and 80 of them are used for training as
usually small. Hence, the overall computational complexity well as the remainder 187 instances are for testing with
of Algorithm 1 is O(kNIterLN2). 22 attributes.
It should be pointed out that the iteration times NIter will Internet advertisements dataset represents a set of pos-
not be so large if the parameters in SELM-MKRSL are set sible advertisements on Internet pages [49]. It is used to
appropriately. Hence, the MKRSL criterion used in multi- predict whether an image is an advertisement (“ad”) or not
layer ELM may not significantly increase the computational (“nonad”) with 3279 samples and 1558 features. We take
cost compared with MSE. 70% of the samples as training data and the remaining 30%
as testing data.
Balance scale dataset was generated to model psychologi-
4 Performance evaluation and analysis cal experimental results [50]. Each example is classified as
having the balance scale tip to the right, tip to the left, or be
In this section, the simulation results are presented to evalu- balanced. There are 625 samples in total with four attributes.
ate the performance as well as the noise robustness of our We take 75% of it to train the model and the rest 25% to test
proposed approach on some synthetic datasets. Meanwhile, the performance.
we compare the SELM-MKRSL proposed in this article with Drug consumption is a dataset used for estimating when
the original S-ELM [24] and the S-ELM using correntropy people use the drug lastly [51]. The dataset contains 1885
(SELM-correntropy), in terms of the performance in dealing instances and each sample has 30 features. We choose 1320
with contaminated training data. As to SELM-correntropy, samples to train randomly and the rest to test.
we replace the MSE criterion in the same way as [35], just in Contraceptive method choice dataset is a subset of the
the S-ELM architecture instead. Therefore, these three meth- 1987 National Indonesia Contraceptive Prevalence Survey
ods all have the same multilayer architecture in the learning [52]. It was collected to predict the current contraceptive
network, so that we can evaluate the effect of MSE, corren- method choice (no use, long-term methods, or short-term
tropy and KRSL criterions on classification performance. methods) of a woman based on her demographic and socio-
All the simulations in this article are conducted on the con- economic characteristics. The dataset has 1473 instances
ditions as follows: Intel-i7 2.8G CPU, 16G RAM, Windows with 9 attributes, about 67% of which is used as training
7, MATLAB R2012a. data and the rest 33% as testing data.
Parkinson speech dataset is used to estimate whether peo-
4.1 Datasets ple have Parkinson’s disease (PD) or not [53]. The training
data were collected from 40 people, half of which are PD
In this article, we implement the simulations on some data- patients and the other half are healthy individuals. Each
sets which are described in detail below. With regard to the person has 26 voice samples, so the training dataset con-
selection of these datasets, we choose those whose distribu- tains 1040 samples in all. As to the collection of the testing
tion is relatively balanced. Meanwhile, the number of data dataset, 28 PD patients were asked to say only the sustained
samples ranges from 200 to 10,000 and the dimensionality vowels ‘a’ and ‘o’ three times respectively, which made a
of features ranges from single digit to four figures. Note that total of 168 recordings. Each sample in training dataset or
the number of features will not be more than the correspond- test dataset has 26 features extracted from voice samples.
ing number of samples. QSAR biodegradation dataset contains 1055 instances
USPS is a handwritten digits dataset transformed from with 41 attributes to discriminate ready (356) and not ready
8-bit grayscale images of digits “0” to “9”. This dataset (699) biodegradable molecules [54]. We choose 750 samples
13
randomly as training data and the remaining 305 samples as regularization coefficient η. Here, we just take dataset USPS
testing data. for instance, and we set the noise P(i) as Uniform distribu-
Seismic bumps dataset describes the problem of high tion to demonstrate the process of selecting various param-
energy (higher than 104 J) seismic bumps forecasting in a eters. In our simulations, we choose the iterative termination
coal mine [55]. There are 2584 samples in total and each threshold as ε = 10−3, and the number of hidden layers is set
sample has 19 features. 70% of the data is used for training to k = 3 for all the NNs.
and the rest 30% is used for testing. Firstly, we estimate the effect of risk-sensitive parameter
In order to estimate the algorithmic robustness with outli- λ and bandwidth σ on classification performance. Actually,
ers, we add a mixed noise model to the training data, which there is no efficient way to determine the appropriate param-
is assumed to be n(i) = rP(i) + (1 − r)Q(i), where 0 < r < 1 eter values, so we just find a relatively good parameter set-
[39]. Here, P(i) and Q(i) are two noise processes both inde- ting within the following bounds: λ ∈ {1, 3, 5} and σ ∈ {0.2,
pendent with each other, and the variances of them are 𝜎P2 0.4, 0.6, …, 8}. Figure 2 shows the testing classification
and 𝜎Q2 , respectively. In the simulations below, we set Q(i) accuracy of SELM-MKRSL with the increase of bandwidth
as Gaussian distribution with zero-mean and r = 0.06. Mean- σ, where the risk-sensitive parameter λ is prefixed. We can
while, we change the 𝜎Q2 to observe the variation of classifi- observe that, as σ increases, the classification accuracy
cation performance. With respect to the noise P(i), first increases steeply, then tends to converge, and slightly
� √we √ con-
�
sider three cases: (a) Uniform distribution over − 5, 5 , decreases at last. In terms of λ, the testing accuracy val-
ues have similar variation tendency and nearly overlap with
(b) Gaussian distribution with zero-mean and 𝜎P2 = 1, and different λ. Hence, the SELM-MKRSL is not very sensitive
(c) Binary distribution with Pr{A(i) = 1} = Pr{A(i) = − 1} = to the risk-sensitive parameter λ. According to the simula-
0.5. Usually the variance of Q(i) is much larger than that of tion designed above, we set λ as 5 and σ as 1 in the following
P(i), so Q(i) represents larger outliers here. The generated simulations on dataset USPS when P(i) is set as Uniform
noises above are overlapped with the original datasets. distribution. Similarly, as shown in Figs. 3 and 4, we can
determine the parameters λ and σ in the same way when
4.2 Simulation results and discussion P(i) is set as Gaussian distribution and Binary distribution
respectively.
We first demonstrate the selections of the parameters in our Then, we need to adjust the regularization coefficient η
proposed SELM-MKRSL, and then show the impacts of in our proposed SELM-MKRSL. The testing accuracy and
them on classification performance and the outlier robust- training time with the change of parameter η are shown in
ness. The number of parameters in SELM-MKRSL is Fig. 5. It can be seen that as η increases, the testing accu-
slightly more than that of original S-ELM, due to the use racy does not change very much, just decreases a little. As
of KRSL in our proposed approach. Considering that the for training time, it spends much time to train the learning
distribution of the datasets used in this article is balanced, network when η becomes large. As a consequence, we set
and what we concern is the correct classified ratio on all cat- parameter η as 0.1 in the simulations below on dataset USPS
egories, not on only one specific type, we choose accuracy when P(i) is set as Uniform distribution.
to evaluate the classification performance of the different Subsequently, the effect of equality constrained opti-
methods. Specifically, for those datasets used in the binary mization-based regularization parameter C is evaluated.
classification problem, we use other two criterions, including In the simulations, C is set within {10−20, 10−19, …, 1 019,
precision and recall, to evaluate the performance addition- 1020}, and the result is illustrated in Fig. 6a. It can be seen
ally. As for multi-class classification datasets, the precision that, the testing accuracy of S-ELM increases rapidly at
and recall may not have significant meaning to evaluate the first, and then becomes stable and converges in a wide
classification performance, hence we just use accuracy as range of C. Nevertheless, in terms of SELM-MKRSL and
the measure criterion. SELM-correntropy, the testing accuracy is almost invari-
able regardless of the variation of C. The reason of this
4.2.1 Parameters selection and analysis phenomenon is that parameter C is just used for calcu-
lating the initial output weight β in SELM-MKRSL and
For S-ELM, there are only three parameters required to be SELM-correntropy. Whether the initial β approximates to
adjusted. They are the number of total hidden layers k, the the optimal solution or not will not influence the classifica-
number of hidden nodes in each layer L, and the equality tion performance of these two algorithms, because the iter-
constrained optimization-based regularization parameter C. ative updating rule will make the output weight approach
While for SELM-MKRSL, in addition to these parameters the optimal solution gradually. However, the nearer the
in S-ELM, there are more parameters to be tuned, including distance between the initial output weight and the opti-
risk-sensitive parameter λ, kernel bandwidth σ, and another mal one is, the less iterations may conduct to reach to the
13
Fig. 2 The relationship between 90

the bandwidth σ and testing
accuracy with different λ in
SELM-MKRSL on dataset
USPS when P(i) is set as uni- 80
form distribution
70
Testing Accuracy (%)

60
50
40
=1
30
=3
=5
20
0 1 2 3 4 5 6 7 8

USPS when P(i) is set as Gauss- 80
ian distribution
70
60
50
40 =1
=3
=5
30
0 1 2 3 4 5 6 7 8
13

USPS when P(i) is set as binary 80
distribution
70
Testing Accuracy (%) 60
50
40 =1
=3
=5
30
0 1 2 3 4 5 6 7 8
84 160
82
140
80
120
78
76
100
Training Time (s)
74
80
72
70 60
68
40
66
SELM-MKRSL SELM-MKRSL
64 20
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
(a) Testing accuracy (b) Training time
Fig. 5 The relationship between the regularization coefficient η and testing accuracy as well as training time in SELM-MKRSL on dataset USPS
13
90 82
80
80
78
70 76

74
60
72
50 70
68
40 SELM-MKRSL SELM-MKRSL
SELM-Correntropy 66 SELM-Correntropy
S-ELM S-ELM
30 64
-20 -15 -10 -5 0 5 10 15 20 0 100 200 300 400 500 600 700 800 900 1000
log C L
(a) Regularization parameter C (b) Number of hidden nodes L
Fig. 6 The relationship between two parameters and testing accuracy in three algorithms
optimal solution, in an effort to reduce the training time. In we first give the initial results with zero noise, in order to
the SELM-correntropy, it replaces the MSE by correntropy provide a better insight of the baseline performance. Fig-
and conducts the iterations to update the output weight β ure 7a, b show the comparison results of three algorithms on
[45]. In the proposed SELM-MKRSL, the process of itera- datasets USPS and ISOLET, respectively. We can see that,
tions using MKRSL criterion makes it attain the optimal when there is no extra artificial noise on dataset, the testing
solution for β. Meanwhile, we can observe that the testing accuracy values of S-ELM and SELM-correntropy
accuracy of SELM-MKRSL converges in a higher position are almost similar and are slightly higher than that of the
than S-ELM and SELM-correntropy. proposed SELM-MKRSL with the increase of hidden nodes’
Afterwards, we study the impact of the number of hidden number L. However, when there are noises in datasets, the
neurons L on the classification performance. As shown in classification performance of our proposed method is better
Fig. 6b, the testing accuracies of all three algorithms first than that of other two algorithms. Figure 8a, b show the
slightly increase and then become stable with the increasing relationship between the outlier variance and testing accu-
of L. In addition, as we all know, the complexity of training racy on datasets USPS and ISOLET, respectively,
� √ √ when� the
network will grow large gradually if the number of hidden noise P(i) is set as uniform distribution over − 5, 5 . The
nodes increases. Considering the situation mentioned above, experimental results on datasets SPECT Heart and Internet
we determine the number of hidden nodes L as 500 on data- Advertisements are also given in Fig. 8c, d. It is obvious that
set USPS. Furthermore, it is noted that the testing classifica- the classification performance of our proposed SELM-
tion accuracy of the proposed SELM-MKRSL is higher than MKRSL is nearly stable with a relatively high testing accu-
that of other two approaches as shown in Fig. 6a, b. racy in comparison with S-ELM and SELM-correntropy,
regardless of the changes of outlier variance. As a result, the
proposed algorithm SELM-MKRSL is applicable to the
4.2.2 Simulation results
training of the datasets with noises.
In addition, we also conduct the simulations in distinct
In order to demonstrate the outlier robustness of our pro-
noise settings to evaluate the algorithmic robustness on
posed approach, we conduct the simulations on SELM-
different types of noise. Figure 9 shows the relationship
MKRSL with different amplitudes and types of noises. We
between outlier variance and testing accuracy when the
use the same noise model as described in Sect. 4.1 and
noise P(i) is set as zero-mean Gaussian distribution with
change the noise variance 𝜎Q2 to observe the impacts of out-
𝜎P2 = 1, and Fig. 10 shows the relationship when P(i) is set
lier variances on classification performance and algorithmic
as Binary distribution with Pr{A(i) = 1} = Pr{A(i) = − 1} =
robustness. Before we conduct the simulations with noises,
0.5. As illustrated in these figures, we change the variance
13
95 96
94 94
93
92
92
90

91
88
90
86
89
84
88
82
87 SELM-Correntropy SELM-Correntropy
S-ELM S-ELM
86 80
100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000
L L
(a) USPS dataset (b) ISOLET dataset
Fig. 7 The comparison results of three algorithms on two datasets with zero noise when increasing L (the number of hidden nodes)
of noise Q(i) from 5 to 40, and observe the impact on the corresponding precision and recall are also presented in
testing accuracy. It can be seen that our proposed SELM- Table 5 where the best values are in bold, when the noise
MKRSL is robust to the noise type and can achieve good P(i) is uniform distribution and the variance of the noise
classification accuracy compared with original S-ELM and Q(i) is set as 10. We can see that the precision of the original
SELM-correntropy. S-ELM is slightly higher than that of SELM-correntropy
The reason why our proposed approach can achieve and the proposed SELM-MKRSL. However, the recall of
good performance is not because SELM-MKRSL starts at a our proposed SELM-MKRSL is better than that of other two
higher point, but due to the robustness and insensitivity of methods. This may be because that the S-ELM only classi-
KRSL to the noises or outliers. In fact, the SELM-MKRSL fies the sample who has very high probability to be positive
algorithm may not outperform other two methods when the sample as positive one, and mistakenly estimates the sample
data is not contaminated with noises. However, while the as negative one whose probability is not extremely high.
data contains noises, SELM-MKRSL performs better com- With regard to SELM-MKRSL, we notice that it can achieve
pared with S-ELM and SELM-correntropy. better recall with relatively high precision compared with
The parameter settings of the three distinct approaches other methods. Among those simulation results, the reason
on four datasets are summarized in Table 1. Furthermore, why the precision of Parkinson speech dataset can achieve
the comparison of corresponding testing classification accu- 100% for all three approaches is just that all the instances in
racy is represented in Table 2, where the highest accuracy testing dataset come from the PD patients and they are all
values are in bold. In order to present the generalization of positive samples.
the proposed algorithm, we conduct the simulations on more As a conclusion, the simulation results demonstrate that
datasets when the noise P(i) is Uniform distribution. All SELM-MKRSL proposed in this article is more robust to
these datasets are downloaded from the UCI machine learn- large noises and can achieve better performance compared
ing repository [56], and the simulation results as well as the with other state-of-the-art algorithms, including S-ELM
corresponding parameter settings are given in Tables 3 and [24] and SELM-correntropy.
4, respectively. As shown in Tables 2 and 4 where the bold
values are the highest accuracy values, we can see that the
testing accuracy values of our proposed algorithm are higher 4.2.3 Discussion
than those of other two methods, regardless of the types and
variances of the noises. Therefore, the proposed algorithm It is well known that the loss function for a training model
in this article is effective and can achieve better classifi- is usually a similarity measure between the predicted out-
cation performance. For the binary classification datasets, put and the desired one. Meanwhile, MSE is one of the
13
84 84
82 82
80 80
78 78

76 76
74 74
72 72
70 70
SELM-Correntropy SELM-Correntropy
S-ELM S-ELM
68 68
5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40
Outlier Variance Outlier Variance
(a) USPS (b) ISOLET

95 100
90
90
85
80
80
70
75
70 60
65
50
60
40
55
30
50 SELM-Correntropy SELM-Correntropy
S-ELM S-ELM
45 20
5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40
(c) SPECT Heart (d) Internet Advertisements
Fig. 8 The relationship between outlier variance and testing accuracy in three algorithms on four datasets when the noise P(i) is uniform distri-
bution
most common methods to measure the similarity between of the C-loss can be highly non-convex, which is very flat
two variables owing to its simplicity. Nevertheless, as away from the optimal solution and very steep around the
researched in [57, 58], second-order statistics (such as solution, resulting in poor convergence performance [38].
MSE) is not suitable for dealing with non-Gaussian data, The KRSL used in algorithm SELM-MKRSL can solve the
which is very usual in practical applications. Recent stud- problem mentioned above. Its performance surface can be
ies in information theoretic learning (ITL) indicate that more convex than that of C-loss, leading to higher accu-
the ITL costs can obtain higher-order statistics of data racy and faster convergence speed, and at the same time
and achieve better performance than MSE especially on maintaining the robustness to noises [39]. In consequence,
non-Gaussian data [57, 59]. The correntropy, as a local the proposed algorithm SELM-MKRSL can outperform
similarity measure in ITL, can effectively eliminate the S-ELM and SELM-correntropy on the dataset with noises.
bad influence of outliers and has been used in many adap-
tive algorithms [60]. However, the performance surface
13
86 86
84 84
82 82
80 80

78 78
76 76
74 74
72 72
S-ELM S-ELM
70 70
5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40
(a) USPS (b) ISOLET

100
90
90
80 80
70
70
60
60
50
50 40
30
40
S-ELM S-ELM
20
5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40
Fig. 9 The relationship between outlier variance and testing accuracy in three algorithms on four datasets when the noise P(i) is Gaussian distri-
bution
5 Conclusion and future work the algorithm SELM-MKRSL proposed in this article
enhances the noise robustness characteristic of multilayer
In this article, a novel criterion of training the multilayer ELM, and it is verified that this approach is applicable for
ELM is proposed, and a new robust stacked ELM, i.e., large-scale data problems with outliers.
SELM-MKRSL, is accordingly designed. Compared with Nevertheless, in this article, the noises in the datasets are
the MSE used in the traditional ELM, the KRSL is more artificial and may have limitations in evaluating the actual
efficient to measure the similarity between the target out- performance. In the future, we will continue the study and
put and the predicted output when the training data are apply the proposed approach to some other application
contaminated with noises, because the KRSL is insensitive fields, e.g., human action recognition. As we all know, there
to large outliers and may be little influenced by noises. are many noises or outliers in the data collected from human
The simulation results indicate that the proposed SELM- movements, and we will use these practical datasets to fur-
MKRSL is superior to other methods and is robust to ther evaluate the effectiveness of our proposed approach.
various noises with higher classification accuracy. Hence,
13
86 86
84 84
82 82
80 80

78 78
76 76
74 74
72 72
S-ELM S-ELM
70 70
5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40
(a) USPS (b) ISOLET

100
90
90
80 80
70
70
60
60
50
50 40
30
40
S-ELM S-ELM
20
5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40
Fig. 10 The relationship between outlier variance and testing accuracy in three algorithms on four datasets when the noise P(i) is binary distri-
bution
Table 1 Parameter settings of Datasets Noise type S-ELM SELM-correntropy SELM-MKRSL

three different approaches on P(i)
four datasets L σ η λ σ η
USPS Uniform distribution 500 3.0 3.0 5.0 1.0 0.1

Gaussian distribution 500 3.0 2.8 5.0 1.0 0.2
Binary distribution 500 2.5 2.0 5.5 1.5 0.6
ISOLET Uniform distribution 200 2.0 1.4 3.0 2.0 1.2
SPECT Heart Uniform distribution 20 4.5 3.6 1.5 0.2 0.8
Internet Adver- Uniform distribution 250 1.5 0.4 1.5 0.6 3.4
tisements Gaussian distribution 250 1.0 1.2 1.5 0.5 3.8
13
Table 2 Performance Datasets Noise type Noise Accuracy (%)

comparison among three P(i) variance
different approaches on four 𝜎Q2 S-ELM SELM-cor- SELM-MKRSL
datasets rentropy
USPS Uniform distribution 5 77.98 76.38 79.32

15 77.13 79.02 81.96
25 79.07 79.12 79.87
35 77.48 77.28 80.02
Gaussian distribution 5 82.46 81.12 82.96
15 81.61 81.12 83.81
25 81.76 81.27 83.96
35 81.91 81.81 83.36
Binary distribution 5 81.51 82.21 84.26
15 81.96 80.47 83.26
25 81.17 80.12 82.96
35 81.17 80.22 82.16
ISOLET Uniform distribution 10 77.16 79.73 82.68
20 77.81 78.51 82.10
30 76.52 78.13 82.62
40 81.14 80.24 82.68
20 82.23 82.75 83.13
30 81.85 82.94 84.99
40 79.86 81.08 83.58
20 81.91 82.38 82.49
30 81.23 80.50 81.33
40 81.31 79.15 81.33
SPECT heart Uniform distribution 10 74.87 79.14 89.84
20 87.17 87.70 91.98
30 90.37 91.44 91.98
40 74.33 77.54 91.02
20 72.19 67.43 74.33
30 75.40 88.77 89.14
40 77.01 88.24 89.84
20 74.33 59.36 91.98
30 85.03 78.61 91.98
40 77.54 86.10 91.98
Internet advertisements Uniform distribution 10 71.94 72.20 83.27
20 72.86 71.15 82.35
30 70.22 70.36 84.72
40 70.75 71.01 83.40
20 74.70 73.25 84.45
30 74.84 71.94 84.32
40 71.41 75.23 85.64
20 73.25 77.60 83.40
30 71.28 69.17 84.06
40 74.04 73.25 85.90
13
Table 3 Parameter settings Datasets S-ELM SELM-correntropy SELM-MKRSL

among three different
approaches on six more datasets L σ η λ σ η
when the noise P(i) is Uniform
distribution Balance scale 250 1.0 0.2 2.5 1.0 0.4
Drug consumption 250 0.5 1.0 2.5 0.4 0.6
contraceptive method choice 200 1.5 1.2 2.5 1.6 0.2
Parkinson speech 300 1.0 0.8 0.5 1.4 0.2
QSAR biodegradation 300 1.2 0.6 2.5 1.6 0.2
Seismic bumps 300 1.2 0.8 2.5 1.6 0.2
Table 4 Performance Datasets Noise variance Accuracy (%)

comparison among three 𝜎Q2
different approaches on six S-ELM SELM-corren- SELM-MKRSL
more datasets when the noise tropy
P(i) is uniform distribution
Balance scale 10 74.67 88.00 90.67
20 83.33 87.33 91.33
30 74.00 89.33 89.67
40 75.33 86.00 90.00
Drug consumption 10 72.92 75.04 77.70
20 75.40 78.05 79.29
30 75.75 77.70 78.05
40 74.16 76.81 77.35
Parkinson speech 10 77.98 70.83 85.12
20 64.29 52.38 94.05
30 70.24 58.93 86.90
40 72.02 68.45 83.33
Contraceptive method choice 10 52.01 52.64 53.49
20 51.59 52.22 53.28
30 50.74 52.43 53.07
40 53.49 54.12 54.76
QSAR biodegradation 10 78.69 77.70 80.66
20 77.70 78.03 82.30
30 79.34 76.72 81.31
40 80.98 80.00 81.64
Seismic bumps 10 92.47 92.86 94.01
20 92.60 92.73 93.88
30 91.58 91.84 94.01
40 92.09 93.24 93.37
Table 5 Performance Datasets Metric (%) S-ELM SELM-correntropy SELM-MKRSL

comparison on precision and
recall among three different QSAR biodegradation Precision 82.92 82.73 83.61
approaches for the binary
Recall 85.66 89.53 89.56
classification datasets
Parkinson speech Precision 100.00 100.00 100.00
Recall 47.44 39.41 97.17
SPECT heart Precision 97.25 97.15 93.06
Recall 80.35 86.13 96.98
Seismic bumps Precision 93.64 93.62 93.62
Recall 99.25 99.90 99.96
Internet advertisements Precision 92.01 91.79 84.45
Recall 70.71 71.00 98.24
13
Acknowledgements This work is funded by the National Key Research 19. Bai Z, Huang GB, Wang D (2014) Sparse extreme learning
and Development Program of China under Grant 2016YFC0600510, machine for classification. IEEE Trans Cybern 44(10):1858–1870
the National Natural Science Foundation of China under Grants 20. Cao JW, Zhao YF, Lai XP, Ong MEH, Yin C, Koh ZX, Liu
U1836106 and U1736117, the Key Laboratory of Geological Infor- N (2015) Landmark recognition with sparse representation
mation Technology of Ministry of Land and Resources under Grant classification and extreme learning machine. J Franklin Inst
2017320, and the University of Science and Technology Beijing— 352(10):4528–4545
National Taipei University of Technology Joint Research Program 21. Cao WP, Ming Z, Wang XZ, Cai SB (2017) Improved bidirec-
under Grant TW201705. tional extreme learning machine based on enhanced random
search. Memet Comput 5:1–8
22. Lan Y, Soh YC, Huang GB (2009) Ensemble of online sequential
extreme learning machine. Neurocomputing 72:3391–3395
References 23. Tang JX, Deng CW, Huang GB (2016) Extreme learning machine
for multilayer perceptron. IEEE Trans Neural Netw Learn Syst
1. Serengil SI, Ozpinar A (2017) Workforce optimization for bank 27(4):809–821
operation centers: a machine learning approach. Int J Interact Mul- 24. Zhou HM, Huang GB, Lin ZP, Wang H, Soh YC (2015) Stacked
timed Artif Intell 4(6):81–87 extreme learning machines. IEEE Trans Cybern 45(9):2013–2025
2. Elvira C, Ochoa A, Gonzalvez JC, Mochón F (2018) Machine- 25. Luo X, Deng J, Liu J, Wang W, Ban X, Wang JH (2017) A quan-
learning-based no show prediction in outpatient visits. Int J Inter- tized kernel least mean square scheme with entropy-guided learn-
act Multimed Artif Intell 4(7):29–34 ing for intelligent data analysis. China Commun 14(7):127–136
3. Alasadi AHH, Alsafy BM (2017) Diagnosis of malignant mela- 26. Miche Y, Bas P, Jutten C, Simula O, Lendasse A (2008) A meth-
noma of skin cancer types. Int J Interact Multimed Artif Intell odology for building regression models using extreme learning
4(5):44–49 machine: OP-ELM. In: Proc 16th Eur symposium artif neural
4. Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: netw—adv comput intell learn, pp 247–252
theory and applications. Neurocomputing 70(1–3):489–501 27. Guo D, Shamai S, Verdu S (2005) Mutual information and mini-
5. Huang GB, Wang DH, Lan Y (2011) Extreme learning machines: mum mean-square error in Gaussian channels. IEEE Trans Inf
a survey. Int J Mach Learn Cybern 2(2):107–122 Theory 51(4):1261–1282
6. Cervellera C, Maccio D (2017) An extreme learning machine 28. Lu XJ, Ming L, Liu WB, Li HX (2017) Probabilistic regularized
approach to density estimation problems. IEEE Trans Cybern extreme learning machine for robust modeling of noise data. IEEE
47(10):3254–3265 Trans Cybern 48(8):2368–2377
7. Iosifidis A, Gabbouj M (2015) On the kernel extreme learning 29. Zhang T, Deng ZH, Choi KS, Liu JF, Wang ST (2017) Robust
machine speedup. Pattern Recognit Lett 68:205–210 extreme learning fuzzy systems using ridge regression for small
8. Huang GB, Bai Z, Kasun LLC, Vong CM (2015) Local receptive and noisy datasets. In: Proc IEEE int conf fuzzy syst, pp 1–7
fields based extreme learning machine. IEEE Comput Intell Mag 30. Wong SY, Yap KS, Yap HJ (2016) A constrained optimization
10(2):18–29 based extreme learning machine for noisy data regression. Neu-
9. Zhu H, Tsang ECC, Wang XZ (2016) Monotonic classification rocomputing 171:1431–1443
extreme learning machine. Neurocomputing 225:205–213 31. Santamaria I, Pokharel PP, Principe JC (2006) Generalized cor-
10. Cao JW, Zhang K, Luo MX, Yin C, Lai XP (2016) Extreme learn- relation function: definition, properties, and application to blind
ing machine and adaptive sparse representation for image clas- equalization. IEEE Trans Signal Process 54(61):2187–2197
sification. Neural Netw 81:91–102 32. Liu W, Pokharel PP, Principe JC (2007) Correntropy: properties
11. Luo X, Yang X, Jiang C, Ban XJ (2018) Timeliness online regu- and applications in non-Gaussian signal processing. IEEE Trans
larized extreme learning machine. Int J Mach Learn Cybern Signal Process 55(11):5286–5298
9(3):465–476 33. He R, Zheng WS, Hu BG (2011) Maximum correntropy criterion
12. Mozaffari A, Azad NL (2016) Self-controlled bio-inspired for robust face recognition. IEEE Trans Pattern Anal Mach Intell
extreme learning machines for scalable regression and classifi- 33(8):1561–1576
cation: a comprehensive analysis with some recommendations. 34. Chen BD, Xing L, Liang JL, Zheng N, Principe JC (2014) Steady-
Artif Intell Rev 46(2):167–223 state mean-square error analysis for adaptive filtering under
13. Zhai JH, Shao QY, Wang XZ (2016) Architecture selection of the maximum correntropy criterion. IEEE Signal Process Lett
ELM networks based on sensitivity of hidden nodes. Neural 21(7):880–884
Process Lett 44(2):471–489 35. Xing HJ, Wang XM (2013) Training extreme learning machine
14. Balasundaram S, Gupta D (2016) On optimization based via regularized correntropy criterion. Neural Comput Appl
extreme learning machine in primal for regression and classifi- 23(7–8):1977–1986
cation by functional iterative method. Int J Mach Learn Cybern 36. Luo X, Sun J, Wang L, Wang W, Zhao W, Wu J, Wang JH,
7(5):707–728 Zhang Z (2018) Short-term wind speed forecasting via stacked
15. Zhu H, Tsang ECC, Wang XZ, Aamir Raza Ashfaq R (2017) extreme learning machine with generalized correntropy. IEEE
Monotonic classification extreme learning machine. Neurocom- Trans Ind Inf 14(11):4963–4971
puting 225:205–213 37. Luo X, Xu Y, Wang WP, Yuan MM, Ban XJ, Zhu YQ, Zhao
16. Ding S, Zhang N, Zhang J, Xu X, Shi Z (2017) Unsupervised WB (2018) Towards enhancing stacked extreme learning
extreme learning machine with representational features. Int J machine with sparse autoencoder by correntropy. J Franklin
Mach Learn Cybern 8(2):587–595 Inst 355(4):1945–1966
17. Alom MZ, Sidike P, Taha TM, Asari VK (2017) State preserving 38. Syed MN, Pardalos PM, Principe JC (2014) On the optimization
extreme learning machine: a monotonically increasing learning properties of the correntropic loss function in data analysis.
approach. Neural Process Lett 45(2):703–725 Optim Lett 8(3):823–839
18. Luo X, Jiang C, Wang W, Xu Y, Wang JH, Zhao W (2019) User 39. Chen BD, Xing L, Xu B, Zhao H, Zheng N, Principe JC (2017)
behavior prediction in social networks using weighted extreme Kernel risk-sensitive loss: definition, properties and applica-
learning machine with distribution optimization. Future Gener tion to robust adaptive filtering. IEEE Trans Signal Process
Comput Syst 93:1023–1035 65(11):2888–2901
13
40. Luo X, Zhang D, Yang LT, Liu J, Chang X, Ning H (2016) A 53. Sakar BE, Isenkul ME, Sakar CO, Sertbas A, Gurgen F, Delil S,
kernel machine-based secure data sensing and fusion scheme in Apaydin H, Kursun O (2013) Collection and Analysis of a Parkin-
wireless sensor networks for the cyber-physical systems. Future son speech dataset with multiple types of sound recordings. IEEE
Gener Comput Syst 61:85–96 J Biomed Health 17(4):828–834
41. Serre D (2010) Matrices: theory and applications. Springer, 54. Mansouri K, Ringsted T, Ballabio D, Todeschini R, Consonni
New York V (2013) Quantitative structure-activity relationship mod-
42. Dwyer PS, Rao CR, Mitra SK (1973) Generalized inverse of els for ready biodegradability of chemicals. J Chem Inf Model
matrices and its applications. J Am Stat Assoc 68:239 53(4):867–878
43. Huang GB, Zhou HM, Ding X, Zhang R (2012) Extreme learn- 55. Sikora M, Wrobel L (2010) Application of rule induction algo-
ing machine for regression and multiclass classification. IEEE rithms for analysis of data collected by seismic hazard monitoring
Trans Syst Man Cybern Part B Cybern 42(2):513–529 systems in coal mines. Arch Min Sci 55(1):91–114
44. Candes EJ, Li X, Ma Y (2011) Robust principal component 56. Dua D, Karra TE (2019) UCI machine learning repository. Uni-
analysis? J ACM 58(3):11 versity of California, School of Information and Computer Sci-
45. Schölkopf B, Smola AJ (2002) Learning with kernels: support ence, Irvine. http://archive.ics.uci.edu/ml
vector machines, regularization, optimization, and beyond. MIT 57. Principe JC (2010) Information theoretic learning: Renyi’s entropy
Press, Cambridge and kernel perspectives. Springer, New York
46. Hull JJ (1994) A database for handwritten text recognition 58. Chen BD, Zhu Y, Hu JC, Principe JC (2013) System parame-
research. IEEE Trans Pattern Anal Mach Intell 16(5):550–554 ter identification: information criteria and algorithms. Elsevier,
47. Cole R, Fanty M (1994) UCI machine learning repository. https Amsterdam
://archive.ics.uci.edu/ml/datasets/ISOLET 59. Chen M, Li Y, Luo X, Wang W, Wang L, Zhao W (2019) A
48. Kurgan LA, Cios KJ, Tadeusiewicz R, Ogiela M, Goodenday novel human activity recognition scheme for smart health using
LS (2001) Knowledge discovery approach to automated cardiac multilayer extreme learning machine. IEEE Internet Things J
SPECT diagnosis. Artif Intell Med 23(2):149–169 6(2):1410–1418
49. Kushmerick N (1999) Learning to remove Internet advertise- 60. Chen LJ, Qu H, Zhao JH, Chen BD, Principe JC (2016) Efficient
ments. In: Proc int conf autonom agents, pp 175–181 and robust deep learning with correntropy-induced loss function.
50. Klahr D, Siegler RS (1978) The representation of children’s Neural Comput Appl 27(4):1019–1031
knowledge. Adv Child Dev Behav 12:61–116
51. Fehrman E, Muhammad AK, Mirkes EM, Egan V, Gorban AN Publisher’s Note Springer Nature remains neutral with regard to
(2017) The five factor model of personality and evaluation of jurisdictional claims in published maps and institutional affiliations.
drug consumption risk. In: Palumbo F, Montanari A, Vichi M.
(eds) Data science. Studies in classification, data analysis, and
knowledge organization. Springer, Cham
52. Lim TS, Loh WY, Shih YS (2000) A comparison of prediction
accuracy, complexity, and training time of thirty-three old and
new classification algorithms. Mach Learn 40(3):203–228
13

10.1007@s13042 019 00967 W

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10.1007@s13042 019 00967 W

Uploaded by

Copyright:

Available Formats

International Journal of Machine Learning and Cybernetics

A robust multilayer extreme learning machine using kernel

Received: 23 October 2017 / Accepted: 22 May 2019

1 Introduction are proved to be well-performed [1–3]. At the same time,

2.3 Kernel risk‑sensitive loss (KRSL) Proper ty 2 K R S L i s p o s i t i ve a n d b o u n d e d :

Fig. 1 The architecture of SELM-MKRSL: a MKRSL; b PCA; c S-ELM

3.2 Convergence analysis Hence, this sequence is non-increasing finally. In

vided as follows. JKRSL (𝛽 , 𝛼 ), 𝜏 = 1, 2, ⋯ is verified to be converged.

Fig. 2 The relationship between 90

Testing Accuracy (%)

Fig. 3 The relationship between 90

Fig. 4 The relationship between 90

Testing Accuracy (%) 60

(a) Testing accuracy (b) Training time

Testing Accuracy (%)

(a) Regularization parameter C (b) Number of hidden nodes L

Testing Accuracy (%)

(a) USPS dataset (b) ISOLET dataset

Testing Accuracy (%)

Outlier Variance Outlier Variance

(a) USPS (b) ISOLET

Outlier Variance Outlier Variance

(c) SPECT Heart (d) Internet Advertisements

Testing Accuracy (%)

Outlier Variance Outlier Variance

(a) USPS (b) ISOLET

Outlier Variance Outlier Variance

(c) SPECT Heart (d) Internet Advertisements

Testing Accuracy (%)

Outlier Variance Outlier Variance

(a) USPS (b) ISOLET

Outlier Variance Outlier Variance

(c) SPECT Heart (d) Internet Advertisements

Table 1 Parameter settings of Datasets Noise type S-ELM SELM-correntropy SELM-MKRSL

USPS Uniform distribution 500 3.0 3.0 5.0 1.0 0.1

Table 2 Performance Datasets Noise type Noise Accuracy (%)

USPS Uniform distribution 5 77.98 76.38 79.32

Table 3 Parameter settings Datasets S-ELM SELM-correntropy SELM-MKRSL

Table 4 Performance Datasets Noise variance Accuracy (%)

Table 5 Performance Datasets Metric (%) S-ELM SELM-correntropy SELM-MKRSL

You might also like

2.3 Kernel risk‑sensitive loss (KRSL) Proper ty 2 K R S L i s p o s i t i ve a n d b o u n d e d :

3.2 Convergence analysis Hence, this sequence is non-increasing finally. In