Pattern Recognition: Zhe Wang, Zonghai Zhu, Dongdong Li

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Pattern Recognition 99 (2020) 107050

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog

Collaborative and geometric multi-kernel learning for multi-class


classification
Zhe Wang∗, Zonghai Zhu∗, Dongdong Li
Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai 200237, PR China

a r t i c l e i n f o a b s t r a c t

Article history: The multi-class classification is the problem of classifying the sample into one of three or more classes. In
Received 16 April 2019 this paper, we propose an algorithm named collaborative and geometric multi-kernel learning (CGMKL) to
Revised 31 July 2019
classify multi-class data into corresponding class directly. The CGMKL uses the Multiple Empirical Kernel
Accepted 12 September 2019
Learning (MEKL) to map the sample into multiple kernel spaces, and then trains the softmax function
Available online 14 September 2019
in each kernel space. To realize the collaborative learning, one regularization term, which controls the
Keywords: consistent outputs of samples in different kernel spaces, provides the complementary information. More-
Multi-class classification over, another regularization term exhibits the classification result with a geometric feature by reducing
Empirical kernel mapping the within-class distance of the outputs of samples. Extensive Experiments on the multi-class data sets
Multiple empirical kernel learning validate the effectiveness of the CGMKL.
Regularized learning
© 2019 Elsevier Ltd. All rights reserved.

1. Introduction ples in the single class. To deal with the imbalanced problem, ad-
ditional technologies such as cost-sensitive learning [13,14] and re-
In the field of machine learning, the classification task may sampling [15,16] are required to balance the misclassification cost
face with the multi-class classification problem [1–3], where the or data distribution. Moreover, in the case of a high number of
data set has more than two classes. In such a case, the designed classes, the decision boundaries may get overly complex [17].
classifier is required to classify the sample into one of three or The OVO [18] is usually considered to be more accurate than
more classes. Generally, traditional classifiers, such as Support Vec- the OVA. It creates one classifier between any two classes. There-
tor Machine (SVM) [4,5], and Logistic Regression (LR) [6], are de- fore, the OVO creates simpler problems with less samples and
signed to deal with the binary-classification problems. When they does not cause the imbalanced problem as the OVA does. However,
are applied in multi-class classification problems, they must con- there are some weaknesses in terms of reproducibility of decision
vert the multi-class problem into several binary-class problems [7]. boundaries as well as computational complexity with the number
The most common strategies to deal with the multi-class classi- of classes increasing [19]. For a data set with k classes, the OVO
fication problem are One-Versus-Rest (OVR) and One-Versus-One constructs k(k+1
2
))
binary classifiers, and then places the test sam-
(OVO) [8]. ples into all binary classifiers to provide the voting results. More-
The OVR [9] constructs one classifier per class, which is trained over, the OVO faces with the non-competence problem [20,21], as
to distinguish the samples in the single class from the samples it assigns the sample to all of the binary classifiers, though some
in all remaining classes. For a data set with k classes, the OVR of the classifiers are not meaningful.
constructs k binary classifiers to classify all samples in the data As opposes to the OVO and OVR, which are required to cal-
set. The major problem in the OVR is that the number of sam- culate several binary classifiers, the softmax function [22,23] is a
ples in the single class may be far less than that in all remaining multi-class algorithm optimized by minimizing an unified negative
classes, which causes the imbalanced problem [10]. In the imbal- log-likelihood of the training data. Suppose the data set is with k
anced problem [11,12], the classifiers tend to overly focus on the classes. In probability theory, the output of the softmax function
samples in all remaining classes, thereby misclassifying the sam- naturally represents a categorical distribution, which is a probabil-
ity distribution on k possible outcomes. Then, the outcome with
the maximum probability is the corresponding class of the sample.

Corresponding authors.
The advantage of the softmax function is that it is easy to solve
E-mail addresses: wangzhe@ecust.edu.cn (Z. Wang), 13564251556@163.com (Z. the weight vector by using gradient descent. Moreover, for an input
Zhu). sample, the softmax can directly provide the probability belonging

https://doi.org/10.1016/j.patcog.2019.107050
0031-3203/© 2019 Elsevier Ltd. All rights reserved.
2 Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050

to each class. In this manner, the softmax function avoids the im- • CGMKL realizes the multi-class classification under the MEKL

balanced problem caused the OVR, and the computational com- framework through combining the softmax function and MEKL. By
plexity and non-competence problem caused by the OVO. doing so, the MEKL enriches the expressions of sample and greatly
Although the softmax function can deal with the multi-class improves the classification ability of the softmax function.
classification problem, it is hard for the softmax function to tackle • CGMKL offers the complementary information between dif-

the data in the original space well, especially when the bound- ferent kernel spaces by introducing a regularization term RU , which
ary of data distribution is non-linear. To deal with the problem, keeps consistency outputs of samples in different kernel spaces. By
the kernel method [24,25] is used to map the original data into doing so, classifiers in different kernel spaces can learn from each
the kernel space, thus dealing with the data with a nonlinear dis- other and keep collaborative working.
tribution. The kernel method can be categorized into two types, • CGMKL makes the output trend of data suit for classification

including Implicit Kernel Mapping (IKM) and Empirical Kernel through introducing a regularization term RG , which reduces the
Mapping (EKM) [26]. The IKM constructs nonlinear relations of within-class distance of the outputs of samples. By doing so, the
the input data in an implicit feature expression. Generally, the classification result exhibits a geometric feature.
IKM deals with the kernel function in the manner of xi xj (i, j = The remainder of this paper is organized as follows. Section II
1, 2, . . . N ). However, this kind of manner does not exist in the soft- presents a brief introduction of the multi-class classification and
max function. Different from the IKM, the EKM enriches the ex- the MEKL. The detailed description of the proposed CGMKL is illus-
pression mode of the sample by mapping the original sample x trated in section III. The experimental results are reported and dis-
into the kernel space with a explicit form e (x) in accordance with cussed in Section IV. Finally, the conclusions are provided in Sec-
the kernel function e . As the e (x) provides the detailed value of tion V.
each dimension for the sample x in the kernel space, the EKM can
be embedded into the softmax function naturally. 2. Related work
Although EKM provides the explicit feature expression in the
kernel space, single kernel may fail to fully excavate and utilize 2.1. Multi-class classification
the relationship of samples. To further enrich the expressions of
the sample and utilize the classification ability of different kernels, To deal with the multi-class classification problem, OVO or
the Multiple Kernel Learning (MKL) framework [27–29] are pro- OVR strategy are two major methods. However, these two kinds
posed and demonstrated to has superior classification ability. Nu- of methods have deficiencies in classification. Therefore, this pa-
merous researches have continuously improved the development per uses the softmax function to directly classify the samples into
of the MKL. For example, the simple MKL [30] introduced an adap- their corresponding class. In the multi-class classification, suppose
tive l2 norm into the MKL. The GLMKL used the grouped lasso to the training set contains N samples, and the training set is writ-
construct the connection of kernels, thus ensuring the hierarchy ten as {x1 , φ1 }, {x2 , φ2 }, . . . , {xN , φN }, where the xi ∈ R1×d the label
and sparsity. The EasyMKL [31] combined the kernels derived from φi ∈ {1, 2, . . . , k}. Then, the loss function of the softmax is written
multiple sources in a data-driven way to enhance the accuracy. as follows:
Recently, the Nystrom [32] and data-dependent [33] methods are  
widely used to reduce the complexity and learn the optimal ker- 
N 
k
exi w j +b j
J=− I (φi = j ) log k , (1)
nel. Except for the improvement, the MKL is widely used in many
i=1 j=1 j=1 exi w j +b j
applications such as biomedical applications [34] and disease pre-
diction [35]. Besides, the MKL also can be converted to be suitable where I (φi = j ) is a boolean function. If the label φ i is equal to
for feature selection [36]. Obviously, the MKL can tackle complex j, I (φi = j ) returns 1, otherwise, it returns 0. The wj and bj are the
situations and improve classification ability [37,38]. As the effec- weight vector and threshold corresponding to the jth class. The un-
tiveness of the MKL, this paper combines the softmax with the derlying implication of the loss function J is that, if xi belongs to
Multiple Empirical Kernel Learning (MEKL). The softmax function jth class, the value of xi w j + b j is required to be as large as possi-
is learned in each explicit kernel space. ble, thereby improving the probability that the sample xi belongs
However, combining the softmax function and MEKL framework to the jth class.
still exists two main problems to be solved. The first problem is As the aforementioned formula is relatively complex, the
how to make the softmax functions work collaboratively between derivative formula may be complex. Therefore, we write the ma-
different kernel spaces. The second problem is how to control the trix formula of the softmax function as follows:
output trend of data to help improving classification ability in each  

N
exp(xi W )yTi
kernel space. To this end, this paper designs and introduces one J=− log . (2)
regularization term RU , which requires the consistent outputs of i=1
exp(xW )1T
i
samples in different kernel spaces, to provide complementary in-
In the above formula, the xi = [xi , 1] pads a value 1 to match
formation from different kernel space, thereby realizing the collab-
orative working. Moreover, this paper designs another regulariza- the corresponding threshold. The W ∈ R(d+1 )×k concatenates the
tion term RG , which requires the outputs of samples to be with weight vectors and the thresholds in all classes. The 1 ∈ R1×k is
small within-class distances in each kernel space, to make the out- a row vector whose elements are all equal to 1. The yi is the one-
put trend of samples suit for the geometric feature of the classi- hot vector representation of the label, and the exp(xi W ) represents

fication task. Generally, the samples with the same class label are the exi W .
expected to be close to each other after they are projected on the
solution vector. Then, two parameters are used to control the im- 2.2. MEKL
portance of the two regularization RU and RG .
As a result, this paper proposes a collaborative and geomet- This paper utilizes the kernel method to map the sample from
ric multi-kernel learning (CGMKL) for multi-class classification. The original space into the kernel space. By doing so, the nonlinear
CGMKL combines the softmax function and MEKL to tackle multi- classification can be converted into the linear classification prob-
class classification. Moreover, the CGMKL introduces two regular- lem in the kernel space. Generally, the kernel method can be di-
ization terms to improve the classification ability further. The con- vided into two types including IKM and EKM. As the name sug-
tributions of this paper are given as follows: gests, the IKM maps the sample into kernel space with implicit
Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050 3

way. Different from the IKM, the EKM maps the original sample spaces. In order to get the complementary information in different
into kernel space and provides the detailed value of each dimen- kernel spaces, a regularization term RU is designed to require the
sion for a sample in the kernel space. If the training set con- consistent outputs of samples in different kernel spaces. Through
tains N samples and the N samples are defined as {(xi , φi )}N i=1
, introducing the RU , the softmax functions in different kernel spaces
where φ i is the label of the sample. The symmetrical positive work collaboratively with each other. The detailed formula of RU is
semi-definite kernel matrix is defined as K = [keri j ]N×N and keri j = written as:
(xi ) · (x j ) = ker (xi , x j ). Suppose the rank of K is equal to r, the   T
1  l 1  j 1  j
m N m m
kernel matrix K is decomposed into:
RU = xi Wl − xi W j xli Wl − xi W j (7)
2N m m
K = QN×r r×r QTN×r , (3) l=1 i=1 j=1 j=1

where r × r is a diagonal matrix whose elements are composed of Then, we use the Xl to represent the matrix form of the sample
r positive eigenvalues of K, QN × r is the eigenvectors corresponding set {xli }N
i=1
in the lth kernel space. Each row of the Xl corresponds
to r eigenvalues. To reflect the visualized form in the kernel space, one sample of the {xli }N i=1
. In this manner, the regularization term
the mapping function is defined as e (e (I ) → F ). For an input Ru can be rewritten as follows:
sample x in original feature space I, x can be mapped into kernel ⎛  T ⎞
space F by e and e (x) is calculated as:
1  1  1 
m m m
RU = tr⎝ Xl Wl − X jW j Xl Wl − X jW j ⎠,
e (x ) = [ker (x, x1 ), ker (x, x2 ), . . . , ker (x, xN )]Q−1/2 . (4) 2N m m
l=1 j=1 j=1
It is known that different kernel plays different role in differ-
ent scenario. Therefore, combing multiple kernels is an effective (8)
method to improve the generalization ability in classification and where tr is the trace of the square matrix.
regression tasks. In this paper, m kernel matrices and correspond- Generally, for a binary-classification problem, the outputs of
ing mapping functions are used. For the training set, the expres- samples are expected to exhibit a geometric feature that the
sions in m kernel spaces are {e1 (xi ), . . . , el (xi ), . . . , em (xi )}N
i=1
. outputs of samples have small within-class distance and large
between-class distance. To transform the geometric feature into
3. Proposed CGMKL multi-class classification, the between-class distance is not consid-
ered, as this distance may be hard to work in multiple classes.
In this section, the proposed CGMKL is introduced. Firstly, how Therefore, a regularization term RGl is designed to reduce the
to combine the softmax function and MEKL is introduced. More- within-class distance of samples after they are projected on the
over, the regularization terms RU and RGl are described. Then, the solution vector. The detailed formula of RGl is written as:
solution process and the pseudo code of the CGMKL are presented.
1
RGl = tr (Gl Wl WlT GTl ), (9)
3.1. Learning framework of CGMKL 2N
where Gl is the scatter matrix in the lth kernel space. In practice,
The proposed CGMKL maps the samples into m empirical ker- we first calculate the mean value of samples in each class. Next,
nel spaces in accordance with the rule of MEKL. Then, the CGMKL samples in each class subtract their corresponding class mean
trains the softmax function to directly classify the samples into value, and then they are mapped into the kernel space by using
their corresponding class in each kernel space. To make the learn- the kernel function el . To match the dimensions of Wl , these sam-
ing process in each kernel space collaborative, an regularization ples are padded one threshold whose value is set to 1. Finally, Gl
term RU is introduced to provide the consistent outputs of samples is vertically combined by these samples.
in different kernel spaces. Moreover, another regularization term As a result, the detailed loss function of the CGMKL is written
RGl , l = 1, ., m is introduced to make the outputs of samples exhibit as follows:
a geometric feature. The entire framework of the CGMKL is written  

m 
N
exp(xli Wl )yTi 1
as follows: L = − log +c tr (Gl Wl WlT GTl )
exp(xl Wl )1T 2N

m l=1 i=1 i
L= [Remp ( fl ) + cRGl ] + λRU , (5) ⎛  T ⎞

m
1 1 
m
1 
m
l=1
+λ tr⎝ Xl Wl − X jW j Xl Wl − X jW j ⎠. (10)
2N m m
where the parameter c and λ controls the importance of the l=1 j=1 j=1

termRGl and RU , respectively.


In detail, the Remp (fl ) represents the loss of the softmax func-
tion in the lth kernel space. In each kernel space, the loss func- 3.2. Solution process of CGMKL
tion of the softmax function is required to be as small as possible.
Generally, the accuracy on the training samples increases with the The partial derivative of Wl , l = 1, . . . , m in each kernel space is
value of the loss function declining. The detailed description of the related to the terms, including Remp (fl ), RGl and RU . As these terms
Remp (fl ) is written as follows: are accumulative in the learning framework. the derivative with
  respect to Wl for Remp (fl ), RGl and RU can be calculated indepen-

N
exp(xli Wl )yTi
Remp ( fl ) = − log , (6) dently. To solve the optimal Wl , the connection between the matrix
i=1
exp(xl Wl )1T
i differential and matrix derivative is established as follows:
 T 
where the xli = [el (xi ), 1] represents the ith training sample 
m 
n
∂f ∂f
df = dWi j = tr dW . (11)
mapped into the lth kernel space with a threshold, and the yi is
i=1 j=1
∂ Wi j ∂W
the one-hot vector representation of the label. To combine the loss
function in each kernel space, we accumulate these loss functions where f is the loss function and Wij is the element in row i, column
to realize the softmax under the MEKL framework. j of W. As described in the above formula, the derivative with re-
As the accumulation is a simple way to combine multiple spect to the solution matrix W for the loss function f can be calcu-
kernels, it does not consider the relationship of different kernel lated by using the matrix differential and the calculation of trace.
4 Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050

1) For the partial derivative of Remp (fl ) with respect to Wl , l = Then,


1, 2, . . . , m in each kernel space, the formula of Remp (fl ) and the
∂ RGl 1
simplified process are listed as follows: = GT G W . (18)
∂ Wl N l l l
 

N
exp(xli Wl )yi T
3) For the rest regularization term RU regarding consistent out-
Remp ( fl ) = − log
i=1
exp(xl Wl )1Ti
puts of samples in different kernel spaces, according to the formula
of RU in Eq. 8, it is found that each kernel space has strong con-

N
nection with other kernel spaces. To calculate the partial derivative
=− (log(exp(xli Wl ))yTi − log(exp(xli Wl )1T ))
of RU with respect to Wl , we must calculate the derivative in the
i=1
lth kernel space and the other kernel spaces, respectively. For the

N
lth kernel space,
= (−xli Wl yTi + log(exp(xli Wl )1T )). (12) ⎛
i=1
  T ⎞
1  1 
m m
1
To calculate the partial derivative, the necessary formula about dRU = tr⎝ Xl Wl − X jW j d Xl Wl − X jW j ⎠
N m m
dot product  and 1 is listed as follow: j=1 j=1

1(u  v ) = uT v, (13)  m
1 1 1
= tr XlT Xl Wl dWlT − XlT Xl Wl dWlT − XlT X jW jT dWlT
where u and v and column vector have the same dimensions, and N m m
j=1
1 is a row vector whose dimensions is the same as the dimensions 
of u and v. 1 T 
m
+ X X jW jT dWlT
Then, we solve the differential equation of Eq. 12. The detailed m2 l
j=1
process is listed as follows:  
  1 m−1 T m−1 T
m

N
dexp xli Wl( )1
T = tr Xl Xl Wl dWlT − X X jW jT dWlT . (19)
dRemp ( fl ) = −dxli Wl yTi + N m m2 l
i=1
exp xli Wl( )1T j=1

  For the other kernel spaces,



N
(exp(xli Wl )  (dxli Wl )1T
= +−dxli Wl yTi   
exp(xli Wl )1T 1  1 
m m
1
i=1 dRU = tr XoWo − X jW j d (XoWo − X j W j )T
  N m m
N
exp(xli Wl )T j=1 j=1
= l T l
−dxi Wl yi + dxi Wl . (14)  
i=1
exp(xli Wl )1T 1 1 1 
m
= tr − XlT XoWodWlT + 2 XlT X jW jT dWlT . (20)
Since the Remp (fl ) is a single value, the right section of Eq. 14 is N m m
j=1
equal to the trace of itself. Then, we calculate the trace of the right
Here, the Xo and Wo represent the sample matrix and weight
section of Eq. 14 as follows:
  
matrix in the other kernel space. It is worth noting that there are

N
exp(xli Wl )T m − 1 other kernel spaces. Therefore, derivatives in other kernel
tr −dxli Wl yTi + dxli Wl spaces are accumulated together and the detailed process is listed
exp(xl Wl )1T
i=1 i as follows:
     

N
exp(xli Wl )T 1 1 
m
m−1 T
m
= tr dxli Wl −yTi + dRU = tr − XlT X jW j dWlT + Xl X jW jT dWlT .
i=1
exp(xl Wl )1T i
N m m2
j=1, j=l j=1
   

N
(
exp xli Wl T ) (21)
= tr −yTi + dxli Wl
i=1
(
exp xli Wl 1T ) Then, the partial derivative of RU with respect to Wl can be cal-
     culated by accumulating results in all kernel spaces. The detailed

N
exp(xli Wl )T derivative is listed as follows:
= tr −yTi + xli dWl . (15)  
i=1
exp(xl Wl )1Ti 1 m−1 T 1 
m
dRU = tr Xl Xl Wl − XlT X jW j dWlT . (22)
According to the relationship between the matrix differential N m m
j=1, j=l
and matrix derivative of Eq. 11, the partial derivative of Remp (fl )
Therefore, the partial derivative of RU with respect to Wl is:
with respect to Wl can be calculated as follows:
   T ∂ RU 1 m−1 T 1 m

∂ Remp 
N
(
exp xli Wl T ) = ( Xl Xl Wl − XlT X j W j ). (23)
= −yTi + xli ∂ Wl N m m
j=1, j=l
∂ Wl i=1
(
exp xli Wl 1T )
  4) Finally, by accumulating these derivatives, we can get the

N
T exp(xli Wl ) partial derivative of loss function L with respect to Wl as follows:
= xli − yi . (16)    
exp(xl Wl )1T  exp(xli Wl )
i=1 i ∂L N
T c
= xli log − y + GTl Gl Wl
2) For the regularization term RGl in Eq. 9, the partial derivative ∂ Wl i=1
exp ( x l
i
Wl )1T i
N
of RGl with respect to Wl can be calculated as follow:  
λ m−1 T 1 
m

1 + Xl Xl Wl − X jW j , (24)
dRGl = tr (Gl Wl d (WlT GTl )) N m m
j=1, j=l
N
1 The RMSProp strategy is used to update the Wl according to the
= tr (GTl Gl Wl )dWlT . (17)
N partial derivative. Different from the gradient descent, one memory
Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050 5

Table 1 the difference between hidden layer and the multi-kernel map-
Training process of CGMKL.
ping. The RandomForest directly deals with multi-class classifica-
Input: Training samples {xi , yi }Ni=1 , m candidate kernels {kerl (xi , x j )}m
l=1
; tion from the perspective of decision trees ensemble. The EasyMKL
Output: The weight matrix Wl , l = 1, . . . , m; is a classical multiple kernel learning method. The SVM is a fa-
1. Calculate the mapping function el (l = 1, ., m) of m kernel spaces. mous algorithm in machine learning. Here, the SVM utilizes OVO
2. Obtain {xli }Ni=1 , sample matrix Xl , and scatter matrix Gl (l = 1, ., m); and OVR strategies to deal with the multi-class problems.
3. Initialize k = 0, η , α , β ,
, c, λ, max iterative times maxiters, In the CGMKL, the ’RBF’ kernel is calculated as ker (xi , x j ) =
Wl with random normal distribution and VWlk (l = 1, ., m);
||x −x ||2
4. Calculate the value of lossk according to Eq. 10; exp(− i2σ 2j ), where σ is set to the average value of all the
5. While k ≤ max iters l2 − norm distances ||xi − x j ||2 , i, j = 1, . . . , N, and N is the num-
6. k = k + 1;
ber of the samples. The number of kernels is set to 2. The value
7. For each kernel space l (l = 1, . . . , m ),
8. Calculate ∂∂WL according to Eq. 24; of σ is multiplied by {1, 2} to reflect the two kernels under dif-
ferent measuring scales. The learning rate η is set to 1. The pa-
l
9. Calculate VWlk = α × VWlk−1 + β × ( ∂∂WL )2 ;
rameters including α , β and
in the RMSProp strategy is set to
l
10. Update Wlk = Wlk−1 − η (  1
) ∂∂WLl ;
VWlk +

0.9, 0.1 and 10−8 , respectively. The parameters c and λ controlling


11. End for
12. Calculate the value of lossk according to Eq. 10; two regularization terms are selected form {0.01, 0.1, 1, 10, 100}. In
13. If |lossk − lossk−1 | < 10−4 : the softmax and BPNN, the learning rate is selected form {0.0001,
14. Break; 0.001, 0.01, 0.1, 1, 10, 100}. To make the architecture of the BPNN
15. End if deeper, the number of the hidden layers is added to 3. In each
16. End while
hidden layer, the number of the hidden units is selected from {64,
128, 256}. The activation function of the BPNN is the sigmoid func-
tion. In the RandomForest, the number of trees is set to 100. In the
matrix VWl whose elements are initialized to 0 is used to record EasyMKL, the parameter c is selected form {0.01, 0.1, 1, 10, 100} and
the previous gradient. For each iteration, VWl is reassigned as: λ is selected form {0.01, 0.1, 1}. In the SVM, the parameter c con-
 2 trolling the slacking factors and σ controlling the scale of the RBF
∂L
VWlnew = α × VWl + β × . (25) kernel are selected from {0.01, 0.1, 1, 10, 100}. All of the algorithms
∂ Wl adjust the parameters to achieve the highest results on the used
Then, the new weight matrix Wl is updated in accordance with data sets.
the dot product of the gradient of Wl and the memory matrix VWl , To reflect the statistical result, the Bayesian signed-rank test
and it is calculated as follows: analysis [40] is used to compare the difference between two algo-
  rithms further. The Bayesian signed-rank test can clearly reflect the
1 ∂L difference between two algorithms in the way of numeric compar-
Wlnew = Wl − η  , (26)
VWlnew +
∂ Wl isons. Here, the parameter rope(r) in the Bayesian sign-rank test is
set to 0.01. The prior parameters of the Dirichlet are set to s = 0.6
where η is the learning rate. The pseudo code of CGMKL in training and z0 = 0. The number of Monte Carlo samples is set to 50000.
is listed in Table 1. In the experiment, the 5-folds cross-validation is used to vali-
For a test sample x, if xl = [el (x ), 1] is the explicit form in the date the classification ability of all used algorithms on each used
lth kernel space, the output of the sample is calculated as: data set. In the 5-folds cross-validation, the original data set is split
into 5 parts, where one for the testing and the others for the train-

m
F (x ) = xl Wl . (27) ing. Then, the classification result of the algorithm can be obtained
l=1 in accordance with the training and testing data. By repeating the
process 5 times and averaging the results, the average result on the
The output F(x) is a vector similar to the one-hot form in the
5 folds data can be obtained. To make it fair, we used the same
multi-class classification problem. In the output vector F(x), the po-
partitions of the training and testing data when different compar-
sition with the maximum value corresponds to the class of the test
ison algorithms train on the data set. Each algorithm adjusts the
sample.
parameters to achieve the highest classification result on each data
set.
4. Experiments

In this section, experiments are designed to investigate the ef- 4.2. Classification result
fectiveness of the proposed CGMKL. This section consists of three
major subsections. The first subsection describes the used data 4.2.1. Performance on multi-class data sets
sets, algorithms. The second subsection presents the classification Table 3 presents the classification results of all used algorithms
performance of used algorithms on the used data sets. The last on the multi-class data sets. According to the table, the CGMKL
subsection discusses the parameters and the convergence of the gets the best results on 9 out of 22 data sets, and the average Acc
proposed CGMKL. of the CGMKL achieved the highest value 84.95%, which demon-
strates that the CGMKL outperforms the other comparison algo-
4.1. Experimental setting rithms. Compared to the softmax, the average Acc of the CGMKL
is about 5.5% higher than that of softmax. Therefore, it is con-
In the experiment, 22 multi-class data sets are selected to vali- cluded that the kernel function greatly improves the classifica-
date the effectiveness of the CGMKL. The description of these data tion ability. Compared to the BPNN, the CGMKL also exhibits su-
sets is listed in Table 2. To validate the performance of the CGMKL, perior classification result, which indicates that the empirical ker-
another 6 classical algorithms, including softmax [23], Back Prop- nel mapping provides better classification ability than the hidden
agation Neural Network (BPNN) [23], RandomForest [39], EasyMKL layer does. Compared to the multi-kernel algorithm EasyMKL, as
[31], SVM (OVO) and SVM (OVR) [19], are selected as comparison the CGMKL has another two features including collaborative work-
algorithms. As the softmax is the basic algorithm of CGMKL, it is ing and geometric feature, the CGMKL exceeds the EasyMKL in the
selected as the baseline algorithm. The BPNN is used to reflect classification task. Compared to the RandomForest, the CGMKL also
6 Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050

Table 2
Description of data sets.

Name Classes Dimensions Samples Name Classes Dimensions Samples

Iris 3 4 150 Cmc 3 9 1473


Hayesroth 3 4 160 Semeion 10 256 1593
Yale 15 1024 165 Segmentation 7 18 2310
Seeds 3 7 210 YaleB 38 1024 2414
JAFFE 10 1024 213 2k2k 10 784 4000
ORL 40 644 400 Wine_Quality_White 7 11 4898
Movement_libra 15 90 360 Optdigits 10 64 5620
Led7digit 10 7 500 Statlog 6 36 6435
Balance 3 4 625 Marketing 9 13 8993
Vehicle 4 18 846 USPS 10 256 9298
Coil_20 20 1024 1440 Penbased 10 16 10992

Table 3
Classification results of all used comparison algorithms on the used data sets (The best result on each data set is written in bold).

Data Set CGMKL softmax BPNN RandomForest EasyMKL SVM (OVO) SVM (OVR)
Acc ± std (%) Acc ± std (%) Acc ± std (%) Acc ± std (%) Acc ± std (%) Acc ± std (%) Acc ± std (%)

Iris 98.00 ± 2.98 95.33 ± 3.80 97.33 ± 2.79 95.33 ± 3.80 97.33 ± 2.49 97.33 ± 2.79 95.33 ± 5.06
Hayesroth 83.21 ± 4.22 55.63 ± 6.98 81.45 ± 9.37 83.58 ± 6.53 77.47 ± 2.58 74.09 ± 7.03 60.32 ± 10.43
Yale 84.44 ± 7.76 85.56 ± 8.20 84.44 ± 10.57 74.44 ± 18.43 73.33 ± 15.63 77.78 ± 15.17 73.56 ± 14.02
Seeds 95.71 ± 5.16 92.86 ± 4.45 93.81 ± 2.71 90.48 ± 9.82 90.48 ± 12.05 92.38 ± 8.14 93.33 ± 8.49
JAFFE 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 98.12 ± 2.07 98.61 ± 1.92 98.62 ± 1.28 99.50 ± 1.12
ORL 98.00 ± 2.27 98.00 ± 1.90 97.75 ± 2.56 96.50 ± 1.37 95.25 ± 2.15 98.75 ± 1.25 92.25 ± 5.11
Movement_libra 72.83 ± 9.64 61.50 ± 15.05 70.83 ± 11.15 66.83 ± 12.62 72.87 ± 7.75 71.17 ± 12.41 63.67 ± 14.64
Led7digit 70.95 ± 2.20 74.52 ± 3.08 75.18 ± 3.21 70.85 ± 3.94 74.97 ± 2.35 75.11 ± 2.02 58.36 ± 2.49
Balance 97.61 ± 0.49 87.70 ± 1.98 95.81 ± 2.57 84.80 ± 1.06 92.80 ± 1.81 92.93 ± 2.26 85.21 ± 4.25
Vehicle 81.28 ± 2.82 68.33 ± 1.34 77.65 ± 2.21 75.16 ± 1.68 70.55 ± 2.22 83.06 ± 2.04 79.19 ± 1.17
Coil_20 93.45 ± 2.87 91.79 ± 4.71 93.38 ± 3.27 98.14 ± 1.74 87.45 ± 3.77 94.45 ± 2.40 91.74 ± 4.40
Cmc 56.15 ± 2.57 50.87 ± 2.35 52.71 ± 3.50 52.77 ± 3.03 49.29 ± 0.62 54.93 ± 0.89 47.67 ± 1.60
Semeion 94.00 ± 1.88 91.49 ± 2.09 92.05 ± 1.31 94.19 ± 1.14 85.94 ± 1.35 95.25 ± 1.53 88.93 ± 2.82
Segmentation 97.10 ± 0.90 88.18 ± 2.48 93.72 ± 0.80 98.01 ± 0.79 97.75 ± 0.76 96.67 ± 1.12 90.95 ± 5.12
YaleB 84.35 ± 19.95 81.07 ± 24.64 77.94 ± 26.78 89.31 ± 13.52 52.30 ± 21.04 83.71 ± 20.77 76.66 ± 29.57
2k2k 93.12 ± 2.18 87.15 ± 3.24 89.39 ± 2.40 91.17 ± 1.74 63.63 ± 1.95 88.56 ± 3.18 81.65 ± 3.08
Wine_Quality_White 51.86 ± 4.27 51.41 ± 2.00 52.21 ± 3.46 52.39 ± 3.41 40.16 ± 2.46 45.30 ± 0.51 44.35 ± 5.88
Optdigits 98.95 ± 0.23 95.63 ± 0.89 96.46 ± 0.84 97.49 ± 0.61 97.74 ± 0.73 98.26 ± 0.66 96.98 ± 1.03
Statlog 88.66 ± 0.47 80.70 ± 2.71 83.79 ± 1.58 89.49 ± 1.34 88.45 ± 0.98 87.89 ± 0.97 84.13 ± 2.81
Marketing 31.72 ± 1.71 31.31 ± 1.08 32.70 ± 0.63 31.49 ± 1.07 30.23 ± 1.84 28.43 ± 1.67 24.34 ± 1.20
USPS 97.80 ± 1.26 93.25 ± 1.78 94.62 ± 1.77 96.06 ± 1.29 92.28 ± 1.90 97.39 ± 1.53 95.95 ± 1.64
Penbased 99.62 ± 0.09 86.97 ± 2.46 98.03 ± 1.51 99.00 ± 0.42 97.36 ± 0.24 99.51 ± 0.19 98.49 ± 0.45
average Acc 84.95 ± 3.45 79.51 ± 4.42 83.24 ± 4.32 82.98 ± 4.15 78.47 ± 4.03 83.25 ± 4.08 78.30 ± 5.74

shows superior classification ability. Compared to the SVM (OVO),


although SVM (OVO) shows good classification ability, it must be
converted into multiple binary classification problems to deal with
the multi-class problems. Compared to the rest SVM (OVR), as the
SVM (OVR) exists many disadvantages, the CGMKL obviously out-
performs the SVM (OVR).
The Bayesian signed-rank test validates the superior perfor-
mance of the CGMKL further. As is shown in Fig. 1, The first low
of the heat-map indicates that the performance of the CGMKL
is much better than that of the other comparison algorithms.
Moreover, the first column of the heat-map shows that none of
the comparison algorithms outperforms the CGMKL. In detail, the
CGMKL performs 100% better than the algorithms, including the
softmax, EasyMKL and SVM (OVR), and the CGMKL performs 87%
and 82% better than the RandomForest and SVM (OVO). Compared
to the BPNN, the CGMKL is superior to the BPNN with the proba-
bility 98%.

4.2.2. Performance on image data with varying percentage of


training data Fig. 1. Heat map of the Bayesian signed-rank test is presented. The value on each
grid reflect the probability that the algorithm listed in the row outperforms the
In the experiment, there are 6 image data sets with a high
algorithm listed in the column.
number of dimensions. To validate the performance of the CGMKL
with different percentage of training samples, these data sets are
randomly sampled 10 times with different percentages of training
samples. The percentages are selected from {0.3, 0.4, 0.5, 0.6, 0.7, same for all algorithms. Then, the line chart is used to present the
0.8, 0.9}. To make the experiment fair, the selected training sam- classification results with different percentages on the image data
ples with different percentage in each sampling process are the sets. The performance of all algorithms can be evaluated by the
Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050 7

Fig. 2. Classification results of different comparison algorithms vary with the percentage of training data on Yale, JAFFE, ORL, Coil_20, YaleB and 2k2k.

Fig. 3. Average classification results of different comparison algorithms vary with the percentage of training data on Yale, JAFFE, ORL, Coil_20, YaleB and 2k2k.

areas under the curve. As is shown in the Fig. 2, it is found that To investigate the entire performance with a varying percent-
the CGMKL achieves the best result On JAFFE and 2k2k. Except for age of training samples further, Fig. 3 presents the average perfor-
the JAFFE and 2k2k, the CGMKL provides relatively good classifica- mance with different percentage of training data on all used data
tion results and has the suboptimal results. In detail, the CGMKL sets. According to the figure, the performances of all algorithms
outperforms the BPNN on YaleB and 2k2k, and be competitive to generally increases with the percentage. Moreover, regardless of
BPNN on JAFFE, ORL and Coil_20, but the performance of CGMKL the percentage, the average performance of the CGMKL achieves
is worse than that of BPNN on Yale. Compared to the softmax and the highest score. The result reveals that the CGMKL can get better
SVM (OVO), the CGMKL outperforms the two algorithms on 4 out performances on the image data sets with high dimensions. The
of 5 data sets. Moreover, the CGMKL outperforms the RandomFor- better performance might be attribute to the rich expressions of
est on 3 out of 5 data sets, and it outperforms the EasyMKL and sample brought by the multiple kernels and the prior information
SVM (OVR) on all data sets. brought by the two regularization terms.
8 Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050

Table 4
Classification results regarding different value of k on the used data sets (The best result on each data set is written in
bold).

Data set k=1 k=2 k=3 k=4 k=5


Acc ± std (%) Acc ± std (%) Acc ± std (%) Acc ± std (%) Acc ± std (%)

Iris 98.00 ± 1.83 98.00 ± 2.98 98.67 ± 1.83 98.67 ± 1.83 98.00 ± 4.47
Hayesroth 82.65 ± 4.80 83.21 ± 4.22 82.56 ± 8.10 82.47 ± 7.41 83.67 ± 8.14
Yale 80.00 ± 11.79 84.44 ± 7.76 86.00 ± 7.60 86.22 ± 6.17 86.44 ± 6.96
Seeds 93.81 ± 4.94 95.71 ± 4.45 95.24 ± 4.76 95.24 ± 5.32 95.24 ± 6.07
JAFFE 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00
ORL 97.50 ± 2.50 98.00 ± 2.27 97.25 ± 2.71 97.75 ± 1.85 97.50 ± 2.65
Movement_libra 72.33 ± 12.50 72.83 ± 9.64 74.17 ± 10.67 74.67 ± 10.17 74.33 ± 9.83
Led7digit 69.49 ± 2.36 70.95 ± 2.20 73.24 ± 3.83 72.79 ± 4.33 73.72 ± 2.53
Balance 96.63 ± 0.91 97.61 ± 0.49 97.44 ± 0.90 97.42 ± 0.93 97.25 ± 1.26
Vehicle 80.11 ± 2.24 81.28 ± 2.82 80.68 ± 2.97 81.31 ± 2.13 79.39 ± 2.21
Coil_20 94.01 ± 3.90 93.45 ± 2.87 93.13 ± 3.76 92.55 ± 3.60 91.73 ± 3.85
Cmc 52.62 ± 2.41 56.15 ± 2.57 55.60 ± 1.20 55.68 ± 1.87 55.94 ± 1.70
Semeion 93.88 ± 2.07 94.00 ± 1.88 93.77 ± 2.08 92.94 ± 1.99 92.68 ± 1.74
Segmentation 96.41 ± 1.30 97.10 ± 0.90 97.10 ± 0.95 96.84 ± 1.04 96.97 ± 1.02
YaleB 82.82 ± 21.76 84.35 ± 19.95 85.51 ± 19.07 85.31 ± 19.04 85.48 ± 19.17
2k2k 92.72 ± 2.42 93.12 ± 2.18 92.64 ± 2.78 92.26 ± 2.61 91.39 ± 3.08
Wine_Quality_White 43.02 ± 3.07 51.86 ± 4.27 50.43 ± 2.40 51.00 ± 3.59 49.59 ± 1.70
Optdigits 98.97 ± 0.29 98.95 ± 0.23 98.99 ± 0.36 98.92 ± 0.24 98.74 ± 0.33
Statlog 87.18 ± 0.90 88.66 ± 0.47 88.64 ± 0.45 88.53 ± 0.62 88.21 ± 0.50
Marketing 30.07 ± 1.75 31.72 ± 1.71 31.80 ± 1.04 31.79 ± 2.51 32.45 ± 1.25
USPS 97.88 ± 1.21 97.80 ± 1.26 97.73 ± 1.25 97.65 ± 1.35 97.57 ± 1.47
Penbased 99.60 ± 0.11 99.62 ± 0.09 99.58 ± 0.05 99.59 ± 0.09 99.54 ± 0.06
average Acc 83.62 ± 3.87 84.95 ± 3.45 85.01 ± 3.58 84.98 ± 3.58 84.81 ± 3.64

is multiplied by 1 and 2 to provide two kernel functions. When


we use three kernel functions, the σ is multiplied by 1, 2 and 3.
Similarly, the rest can be done in the same manner. Then, the clas-
sification results with different value of k are listed in Table 4.
From the perspective of average Acc in Table 4, it is found that
the classification result of single kernel is worse than that of mul-
tiple kernels. Although the CGMKL achieves the best average result
when k is equal to 3, the advantage is not obvious. The heat-map
of Bayesian signed-rank test demonstrates the similar results. Ac-
cording to Fig. 4, it is found that the classification ability of the
CGMKL with multiple kernels are competitive. The first column of
the heat-map presents that the performance of the CGMKL with
two kernels is 24% higher than that with single kernel. The perfor-
mance of the CGMKL with three kernels is 42% higher than that
with single kernel. The performance of the CGMKL with four ker-
nels is 48% higher than that with single kernel. The performance
of the CGMKL with five kernels is 50% higher than that with sin-
gle kernel. These results also reflect that the classification ability
Fig. 4. Bayesian signed-rank test corresponding to Table 4 is presented. The value
on each grid reflect the probability that the algorithm listed in the row outperforms of the CGMKL increases with the number of kernels. However, it
the algorithm listed in the column. is worth noting that the requirement of computing resources in-
creases with the number of kernel functions. In fact, to improve
the classification ability by increasing the number of kernels, the
consuming of computing resources is too high. Therefore, two ker-
4.3. Discussion
nels are preferred in the CGMKL.
4.3.1. Number of the kernel functions
4.3.2. Parameters analysis
Generally, the kernel parameter σ significantly affects the clas-
The proposed CGMKL has two parameters, including c and λ.
sification performance. Different data sets require different value
Therefore, these two parameters are combined to analyze the re-
of σ to obtain the best classification result. In the experiment, a
lationship between the classification performance and the param-
generally-used measuring method, which can be used for differ-
eters. Here, the data sets with relatively high number of classes
ent data sets, is used. The ’RBF’ kernel is calculated as ker (xi , x j ) =
are selected, as the CGMKL mainly deals with the multi-class clas-
||x −x ||2
exp(− i2σ 2j ), where σ is set to the average value of all the sification problems. Concerning the parameter λ in Fig. 5, when
l2 − norm distances ||xi − x j ||2 , i, j = 1, . . . , N, and N is the number the value of λ is in a relatively small value, the change of the
of the samples. To reflect three kernels under different measuring classification results is relatively stable with the value of c vary-
scales, the value of σ is multiplied by a value, which can be se- ing. When the value of λ is increasing, the classification results
lected from {1, 2, 3, 4, 5}. corresponding to different value of c exist fluctuations. The phe-
To reflect how the value of k affects the classification results, nomenon can be seen obviously in the data sets, including Yale,
the parameter k, which reflects the number of kernel functions, is JAFFE, ORL, Movement_libras and W ine_Quality_W hite. Moreover,
selected from {1, 2, 3, 4, 5}. When we use one kernel function, the the performance of small λ is better than that of large λ, which
σ is multiplied by 1. When we use two kernel functions, the σ means that approximate attentions for the consistent output in
Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050 9

Fig. 5. Classification results of varying c and λ.

different kernel spaces can improve the classification ability of the including Yale, ORL and Movementlibras. Conversely, when the value
CGMKL. Excessive attentions may be adverse to the classification of c is in a relatively large value, the change of the classification
ability. results is relatively stable with the value of λ varying. Generally,
When the value of c is in a relatively small value, the change of the performance of small c is better than that of large c, which
the classification results is relatively unstable with the value of λ means that excessive attentions for the within-class distances may
varying. The phenomenon can be seen obviously in the data sets, be adverse to the classification ability. In summary, Fig. 5 also
10 Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050

Fig. 6. Sub-figure (a) shows the original data set, and the other sub-figures present the visualized results with a varying parameter c.

reflects that the two regularization terms play important role in three classes. The samples in different class are marked different
the classifications. For most data sets, setting both the c and λ to colors. As the data set has three classes, the output of the samples
approximate 0.1 is preferential in accordance with Fig. 5. is with three dimensions. According the sub-figures in Fig. 6, it is
obvious that the within-class of the samples with the same class is
4.3.3. Visualized results with varying parameters declining with the parameter c increasing. These figures reflect the
Fig. 6 presents the visualized results with a varying parameter function of the regularization term RGl , which requires the outputs
c. In this figure, the sub-figure (a) shows the original data set with of samples to exhibit small within-class.
Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050 11

Fig. 7. Sub-figure (a) shows the original data set, and the samples in different class are marked as different color. In the other sub-figures, the samples with different color
represent the output in different kernel space, and the visualized results with a varying parameter λ are presented.

Fig. 7 presents the visualized results with a varying parameter 4.3.4. Convergence analysis
λ. In this figure, the sub-figure (a) shows the original data set. In Fig. 8 shows the convergence of the proposed CGMKL. In
the other sub-figures, the samples with different colors represent MCEMKL, the RMSProp is used to accelerate the convergence
the output of the samples in different kernel space. As is shown in speed. Therefore, the convergence curve is with slightly oscillating.
Fig. 7, it is obvious that the outputs of the samples in different ker- According to the figure, it is obvious that the value of loss declines
nel space are becoming more and more similar with the parameter rapidly with the times of iterations increasing in the early stage.
λ increasing. Therefore, these figures demonstrate that the regular- Moreover, when the times of iterations reach 100, the value of loss
ization term RU can make the outputs of the samples in different tends to be stable. Therefore, it can be concluded that CGMKL is
kernel space consistent. with fast convergence speed.
12 Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050

Fig. 8. Convergence of the CGMKL on the used data sets.

5. Conclusion max function can utilize the explicit features in the kernel space
efficiently. To improve the collaborative working between differ-
This paper proposes a algorithm named CGMKL to deal with ent kernel spaces, one regularization term RU is designed to re-
multi-class classification. In detail, this paper places the softmax quire the consistent outputs of samples in different kernel spaces.
function under the MEKL framework. In this manner, the soft- Moreover, to make the outputs of samples to be with geometric
Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050 13

classification features, a geometric projection regularization term [7] A. Rocha, S.K. Goldenstein, Multiclass from binary: expanding one-versus-all,
RGl is designed to reduce the within-class distance of the outputs one-versus-one and ecoc-based approaches, IEEE Trans. Neur. Netw. Learn.
Syst. 25 (2) (2017) 289–302.
of samples in each kernel space. [8] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, F. Herrera, An overview of
The classification results demonstrate that the CGMKL exhibits ensemble methods for binary classifiers in multi-class problems: experimental
superior classification ability to the other comparison algorithms. study on one-vs-one and one-vs-all schemes, Pattern Recogn. 44 (8) (2011)
1761–1776.
The classification results regarding the number of kernels presents [9] J.H. Hong, S.-B. Cho, A probabilistic multi-class strategy of one-vs.-rest support
that using multiple kernel performs better than using single kernel vector machines for cancer classification, Neurocomputing 71 (16–18) (2008)
does. However, too many kernels bring computational complexity 3275–3281.
[10] B. Krawczyk, M. Galar, M. Wozniak, H. Bustince, F. Herrera, Dynamic ensem-
and does not necessarily improve the classification results greatly.
ble selection for multi-class classification with one-class classifiers, Pattern
Therefore, two kernels are enough according to the results. The Recogn. 83 (2018) 34–51.
classification results regarding the collaborative working and ge- [11] H. He, Y. Ma, Imbalanced Learning: Foundations, Algorithms, and Applications,
Wiley-IEEE Press, 2013.
ometric projection indicate the effectiveness of the two regulariza-
[12] A. Khatami, M. Babaie, A. Khosravi, H.R. Tizhoosh, S. Nahavandi, Parallel deep
tion terms. solutions for image retrieval from imbalanced medical imaging archives, Appl.
The are several advantages of the CGMKL. Firstly, the CGMKL Soft Comput. 63 (2018) 197–205.
illustrates that the empirical kernel can be introduced into the [13] Z. Zhang, X. Luo, S. García, F. Herrera, Cost-sensitive back-propagation neural
networks with binarization techniques in addressing multi-class problems and
softmax function. Moreover, the learning in different kernel space non-competent classifiers, Appl. Soft Comput. 56 (C) (2017) 357–367.
can be combined in accordance with the requirement of consis- [14] S.H. Khan, M. Hayat, M. Bennamoun, F.A. Sohel, R. Togneri, Cost-sensitive learn-
tent output. In this manner, the softmax in different kernel space ing of deep feature representations from imbalanced data, IEEE Trans. Neural
Netw. Learn.Syst. PP (99) (2017) 1–15.
can learn from each other. The consistent output also provides a [15] W.W. Ng, J. Hu, D.S. Yeung, S. Yin, F. Roli, Diversified sensitivity-based under-
way to establish a connection between different spaces. In addi- sampling for imbalance classification problems, IEEE Trans. Cybernet. 45 (11)
tion, the regularization term regarding geometric feature validates (2017) 2402–2412.
[16] X. Yuan, L. Xie, M. Abouelenien, A regularized ensemble framework of deep
that, if the outputs of samples with the same class label are close learning for cancer detection from multi-class, imbalanced training data, Pat-
to each other, the classification results can be improved. Therefore, tern Recogn. 77 (2018) 160–172.
the term can be introduced into some other learning framework. [17] E. Lughofer, O. Buchtala, Reliable all-pairs evolving fuzzy classifiers, IEEE Trans.
Fuzzy Syst. 21 (4) (2013) 625–641.
The proposed CGMKL also exists some limitations. When the
[18] L. Zhou, Q. Wang, H. Fujita, One versus one multi-class classification fusion
number of the samples is huge, it is hard for CGMKL to train using optimizing decision directed acyclic graph for predicting listing status of
multiple kernel function due to the memory constrain. The rea- companies, Inform. Fusion 36 (2016) 80–89.
[19] T. Wu, C. Lin, R. Weng, Probability estimates for multi-class classification by
son is that the CGMKL uses all the training samples to calculate
pairwise coupling, J. Mach. Learn. Res. 5 (4) (2004) 975–1005.
the kernel function. The requirement of memory increases with the [20] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, F. Herrera, Dynamic classi-
number of samples. The huge number of samples leads to a huge fier selection for one-vs-one strategy: avoiding non-competent classifiers, Pat-
kernel matrix, which consumes a lot of computing resource to cal- tern Recogn. 46 (12) (2013) 3412C3424.
[21] I. Mendialdua, J.M. Martinez-Otzeta, I. Rodriguez-Rodriguez, T. Ruiz-Vazquez,
culate the matrix decomposition. To improve the efficiency of the B. Sierra, Dynamic selection of the best base classifier in one versus one,
CGMKL, some data reduction and feature selection methods are re- Knowl.-Based Syst. 85 (C) (2015) 298–306.
quired. Moreover, the CGMKL can only tackle the sample with the [22] C.K.I. Williams, D. Barber, Bayesian classification with gaussian processes, IEEE
Trans. Pattern Anal. Mach.Intell. 20 (12) (1998) 1342–1351.
vector form. For the other types of the samples, these samples are [23] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, The MIT Press, 2016.
required to be converted into the vector form. [24] K.R. Müller, S. Mika, G. Rätsch, K. Tsuda, B. Schölkopf, An introduction to ker-
nel-based learning algorithms, IEEE Trans. Neural Netw. 12 (2) (2001) 181.
[25] J. Shawe-Taylor, N. Cristianini, Kernel method for pattern analysis, J. Am. Stat.
Assoc. 101 (476) (2004). 1730–1730
Acknowledgment [26] H. Xiong, M.N.S. Swamy, M.O. Ahmad, Optimizing the kernel in the empirical
feature space, IEEE Trans. Neural Netw. 16 (2) (2005) 460–474.
This work is supported by “Shuguang Program” supported by [27] F.R. Bach, Consistency of the group lasso and multiple kernel learning, J. Mach.
Learn. Res. 9 (2) (2007) 1179–1225.
Shanghai Education Development Foundation and Shanghai Munic- [28] M. Gönen, E. Alpaydin, Multiple kernel learning algorithms, J. Mach. Learn. Res.
ipal Education Commission, Natural Science Foundation of China 12 (2011) 2211–2268.
under Grant no. 61672227, Natural Science Foundations of China [29] S.S. Bucak, R. Jin, A.K. Jain, Multiple kernel learning for visual object recogni-
tion: a review, IEEE Trans. Pattern Anal. Mach.Intell. 36 (7) (2014) 1354–1369.
under Grant no. 61806078, National Key R&D Program of China [30] A. Rakotomamonjy, F.R. Bach, S. Canu, Y. Grandvalet, Simplemkl, J. Mach. Learn.
under Grant no. 2018YFC0910500, the Special Fund Project for Res. 9 (3) (2008) 2491–2521.
Shanghai Informatization Development in Big Data under Grant no. [31] F. Aiolli, M. Donini, Easymkl: a scalable multiple kernel learning algorithm,
Neurocomputing 169 (2015) 215–224.
201901043, and National Major Scientific and Technological Special [32] N. Rabin, D. Fishelov, Multi-scale kernels for nyström based extension schemes,
Project for “Significant New Drugs Development” under Grant no. Appl. Math. Comput. 319 (2018) 165–177.
2019ZX09201004. [33] Q. Wang, G. Fu, L. Li, H. Wang, Y. Li, Data-dependent multiple kernel learning
algorithm based on soft-grouping, Pattern Recogn. Lett. 112 (2018) 111–117.
[34] Y. Shi, F. Tillmann, D. Anneleen, T. LeonCharles, J.A. Suykens, B.D. Moor, Y. M., L
2-norm multiple kernel learning and its application to biomedical data fusion,
References BMC Bioinformat. 11 (1) (2010) 309.
[35] D. Zhang, Y. Wang, L. Zhou, H. Yuan, D. Shen, Multimodal classification of
[1] C. Hsu, C. Lin, A comparison of methods for multiclass support vector ma- alzheimer disease and mild cognitive impairment, Neuroimage 55 (3) (2011)
chines, IEEE Transactions on Neural Networks, 2002. 856–867.
[2] M. Lapin, M. Hein, B. Schiele, Analysis and optimization of loss functions [36] H. Xue, Y. Song, H. Xu, Multiple indefinite kernel learning for feature selection,
for multiclass, top-k, and multilabel classification, IEEE Trans. Pattern Anal. in: Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017,
Mach.Intell. 40 (7) (2018) 1533–1554. pp. 3210–3216.
[3] A. Fernández-Baldera, J.M. Buenaposada, L. Baumela, Badacost: multi-class [37] S. Sun, J. Shawe-Taylor, L. Mao, Pac-bayes analysis of multi-view learning, In-
boosting with costs, Pattern Recogn. 79 (2018) 467–479. form. Fusion 35 (2016) 117–131.
[4] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995) [38] J. Zhao, X. Xie, X. Xu, S. Sun, Multi-view learning overview: recent progress
273–297. and new challenges, Inform. Fusion 38 (2017) 43–54.
[5] L. Ke, Y. Wu, N. Yu, P. Li, L. Yang, Hierarchical multi-class classification in multi- [39] T.K. Ho, The random subspace method for constructing decision forests, IEEE
modal spacecraft data using DNN and weighted support vector machine, Neu- Trans. Pattern Anal. Mach.Intell. 20 (8) (1998) 832–844.
rocomputing 259 (11) (2017) 55–65. [40] G.C. A. Benavoli, J. Demsar, Time for a change: a tutorial for comparing mul-
[6] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley and Sons, tiple classifiers through bayesian analysis, J. Mach. Learn. Res. 77 (1) (2016)
2000. 1–36.
14 Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050

Zhe Wang received the B.Sc. and Ph.D. degrees in Depart- Dongdong Li received the B.Sc. and Ph.D. degrees in De-
ment of Computer Science and Engineering, Nanjing Uni- partment of Computer Science and Engineering, Zhejiang
versity of Aeronautics and Astronautics, Nanjing, China, University, Hangzhou, China, 2003 and 2008, respectively.
2003 and 2008, respectively. He is now a full Profes- She is now a full Assistant Professor in Department of
sor in Department of Computer Science and Engineering, Computer Science and Engineering, East China University
East China University of Science and Technology, Shang- of Science and Technology, Shanghai, China. Her research
hai, China. His research interests include feature extrac- interests include speech processing, affective computing
tion, kernel-based methods, image processing, and pat- and pattern recognition. At present, she has more than 15
tern recognition. At present, he has more than 40 pa- papers with the first or corresponding author published
pers with the first or corresponding author published on on some famous international journals and conferences.
some famous international journals including IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, IEEE
Transactions on Neural Networks and Learning, and Pat-
tern Recognition etc.

Zonghai Zhu received his B.Sc. degree in Department of


Information, Mechanical and Electrical Engineer, Shang-
hai Normal University, China, 2010. Now, he is a candi-
date of the Ph.D. degree in Department of Computer Sci-
ence and Engineering, East China University of Science
and Technology, Shanghai, China. His research interests
include pattern recognition and imbalanced problems.

You might also like