Professional Documents
Culture Documents
Pattern Recognition: Zhe Wang, Zonghai Zhu, Dongdong Li
Pattern Recognition: Zhe Wang, Zonghai Zhu, Dongdong Li
Pattern Recognition: Zhe Wang, Zonghai Zhu, Dongdong Li
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
a r t i c l e i n f o a b s t r a c t
Article history: The multi-class classification is the problem of classifying the sample into one of three or more classes. In
Received 16 April 2019 this paper, we propose an algorithm named collaborative and geometric multi-kernel learning (CGMKL) to
Revised 31 July 2019
classify multi-class data into corresponding class directly. The CGMKL uses the Multiple Empirical Kernel
Accepted 12 September 2019
Learning (MEKL) to map the sample into multiple kernel spaces, and then trains the softmax function
Available online 14 September 2019
in each kernel space. To realize the collaborative learning, one regularization term, which controls the
Keywords: consistent outputs of samples in different kernel spaces, provides the complementary information. More-
Multi-class classification over, another regularization term exhibits the classification result with a geometric feature by reducing
Empirical kernel mapping the within-class distance of the outputs of samples. Extensive Experiments on the multi-class data sets
Multiple empirical kernel learning validate the effectiveness of the CGMKL.
Regularized learning
© 2019 Elsevier Ltd. All rights reserved.
1. Introduction ples in the single class. To deal with the imbalanced problem, ad-
ditional technologies such as cost-sensitive learning [13,14] and re-
In the field of machine learning, the classification task may sampling [15,16] are required to balance the misclassification cost
face with the multi-class classification problem [1–3], where the or data distribution. Moreover, in the case of a high number of
data set has more than two classes. In such a case, the designed classes, the decision boundaries may get overly complex [17].
classifier is required to classify the sample into one of three or The OVO [18] is usually considered to be more accurate than
more classes. Generally, traditional classifiers, such as Support Vec- the OVA. It creates one classifier between any two classes. There-
tor Machine (SVM) [4,5], and Logistic Regression (LR) [6], are de- fore, the OVO creates simpler problems with less samples and
signed to deal with the binary-classification problems. When they does not cause the imbalanced problem as the OVA does. However,
are applied in multi-class classification problems, they must con- there are some weaknesses in terms of reproducibility of decision
vert the multi-class problem into several binary-class problems [7]. boundaries as well as computational complexity with the number
The most common strategies to deal with the multi-class classi- of classes increasing [19]. For a data set with k classes, the OVO
fication problem are One-Versus-Rest (OVR) and One-Versus-One constructs k(k+1
2
))
binary classifiers, and then places the test sam-
(OVO) [8]. ples into all binary classifiers to provide the voting results. More-
The OVR [9] constructs one classifier per class, which is trained over, the OVO faces with the non-competence problem [20,21], as
to distinguish the samples in the single class from the samples it assigns the sample to all of the binary classifiers, though some
in all remaining classes. For a data set with k classes, the OVR of the classifiers are not meaningful.
constructs k binary classifiers to classify all samples in the data As opposes to the OVO and OVR, which are required to cal-
set. The major problem in the OVR is that the number of sam- culate several binary classifiers, the softmax function [22,23] is a
ples in the single class may be far less than that in all remaining multi-class algorithm optimized by minimizing an unified negative
classes, which causes the imbalanced problem [10]. In the imbal- log-likelihood of the training data. Suppose the data set is with k
anced problem [11,12], the classifiers tend to overly focus on the classes. In probability theory, the output of the softmax function
samples in all remaining classes, thereby misclassifying the sam- naturally represents a categorical distribution, which is a probabil-
ity distribution on k possible outcomes. Then, the outcome with
the maximum probability is the corresponding class of the sample.
∗
Corresponding authors.
The advantage of the softmax function is that it is easy to solve
E-mail addresses: wangzhe@ecust.edu.cn (Z. Wang), 13564251556@163.com (Z. the weight vector by using gradient descent. Moreover, for an input
Zhu). sample, the softmax can directly provide the probability belonging
https://doi.org/10.1016/j.patcog.2019.107050
0031-3203/© 2019 Elsevier Ltd. All rights reserved.
2 Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050
to each class. In this manner, the softmax function avoids the im- • CGMKL realizes the multi-class classification under the MEKL
balanced problem caused the OVR, and the computational com- framework through combining the softmax function and MEKL. By
plexity and non-competence problem caused by the OVO. doing so, the MEKL enriches the expressions of sample and greatly
Although the softmax function can deal with the multi-class improves the classification ability of the softmax function.
classification problem, it is hard for the softmax function to tackle • CGMKL offers the complementary information between dif-
the data in the original space well, especially when the bound- ferent kernel spaces by introducing a regularization term RU , which
ary of data distribution is non-linear. To deal with the problem, keeps consistency outputs of samples in different kernel spaces. By
the kernel method [24,25] is used to map the original data into doing so, classifiers in different kernel spaces can learn from each
the kernel space, thus dealing with the data with a nonlinear dis- other and keep collaborative working.
tribution. The kernel method can be categorized into two types, • CGMKL makes the output trend of data suit for classification
including Implicit Kernel Mapping (IKM) and Empirical Kernel through introducing a regularization term RG , which reduces the
Mapping (EKM) [26]. The IKM constructs nonlinear relations of within-class distance of the outputs of samples. By doing so, the
the input data in an implicit feature expression. Generally, the classification result exhibits a geometric feature.
IKM deals with the kernel function in the manner of xi xj (i, j = The remainder of this paper is organized as follows. Section II
1, 2, . . . N ). However, this kind of manner does not exist in the soft- presents a brief introduction of the multi-class classification and
max function. Different from the IKM, the EKM enriches the ex- the MEKL. The detailed description of the proposed CGMKL is illus-
pression mode of the sample by mapping the original sample x trated in section III. The experimental results are reported and dis-
into the kernel space with a explicit form e (x) in accordance with cussed in Section IV. Finally, the conclusions are provided in Sec-
the kernel function e . As the e (x) provides the detailed value of tion V.
each dimension for the sample x in the kernel space, the EKM can
be embedded into the softmax function naturally. 2. Related work
Although EKM provides the explicit feature expression in the
kernel space, single kernel may fail to fully excavate and utilize 2.1. Multi-class classification
the relationship of samples. To further enrich the expressions of
the sample and utilize the classification ability of different kernels, To deal with the multi-class classification problem, OVO or
the Multiple Kernel Learning (MKL) framework [27–29] are pro- OVR strategy are two major methods. However, these two kinds
posed and demonstrated to has superior classification ability. Nu- of methods have deficiencies in classification. Therefore, this pa-
merous researches have continuously improved the development per uses the softmax function to directly classify the samples into
of the MKL. For example, the simple MKL [30] introduced an adap- their corresponding class. In the multi-class classification, suppose
tive l2 norm into the MKL. The GLMKL used the grouped lasso to the training set contains N samples, and the training set is writ-
construct the connection of kernels, thus ensuring the hierarchy ten as {x1 , φ1 }, {x2 , φ2 }, . . . , {xN , φN }, where the xi ∈ R1×d the label
and sparsity. The EasyMKL [31] combined the kernels derived from φi ∈ {1, 2, . . . , k}. Then, the loss function of the softmax is written
multiple sources in a data-driven way to enhance the accuracy. as follows:
Recently, the Nystrom [32] and data-dependent [33] methods are
widely used to reduce the complexity and learn the optimal ker-
N
k
exi w j +b j
J=− I (φi = j ) log k , (1)
nel. Except for the improvement, the MKL is widely used in many
i=1 j=1 j=1 exi w j +b j
applications such as biomedical applications [34] and disease pre-
diction [35]. Besides, the MKL also can be converted to be suitable where I (φi = j ) is a boolean function. If the label φ i is equal to
for feature selection [36]. Obviously, the MKL can tackle complex j, I (φi = j ) returns 1, otherwise, it returns 0. The wj and bj are the
situations and improve classification ability [37,38]. As the effec- weight vector and threshold corresponding to the jth class. The un-
tiveness of the MKL, this paper combines the softmax with the derlying implication of the loss function J is that, if xi belongs to
Multiple Empirical Kernel Learning (MEKL). The softmax function jth class, the value of xi w j + b j is required to be as large as possi-
is learned in each explicit kernel space. ble, thereby improving the probability that the sample xi belongs
However, combining the softmax function and MEKL framework to the jth class.
still exists two main problems to be solved. The first problem is As the aforementioned formula is relatively complex, the
how to make the softmax functions work collaboratively between derivative formula may be complex. Therefore, we write the ma-
different kernel spaces. The second problem is how to control the trix formula of the softmax function as follows:
output trend of data to help improving classification ability in each
N
exp(xi W )yTi
kernel space. To this end, this paper designs and introduces one J=− log . (2)
regularization term RU , which requires the consistent outputs of i=1
exp(xW )1T
i
samples in different kernel spaces, to provide complementary in-
In the above formula, the xi = [xi , 1] pads a value 1 to match
formation from different kernel space, thereby realizing the collab-
orative working. Moreover, this paper designs another regulariza- the corresponding threshold. The W ∈ R(d+1 )×k concatenates the
tion term RG , which requires the outputs of samples to be with weight vectors and the thresholds in all classes. The 1 ∈ R1×k is
small within-class distances in each kernel space, to make the out- a row vector whose elements are all equal to 1. The yi is the one-
put trend of samples suit for the geometric feature of the classi- hot vector representation of the label, and the exp(xi W ) represents
fication task. Generally, the samples with the same class label are the exi W .
expected to be close to each other after they are projected on the
solution vector. Then, two parameters are used to control the im- 2.2. MEKL
portance of the two regularization RU and RG .
As a result, this paper proposes a collaborative and geomet- This paper utilizes the kernel method to map the sample from
ric multi-kernel learning (CGMKL) for multi-class classification. The original space into the kernel space. By doing so, the nonlinear
CGMKL combines the softmax function and MEKL to tackle multi- classification can be converted into the linear classification prob-
class classification. Moreover, the CGMKL introduces two regular- lem in the kernel space. Generally, the kernel method can be di-
ization terms to improve the classification ability further. The con- vided into two types including IKM and EKM. As the name sug-
tributions of this paper are given as follows: gests, the IKM maps the sample into kernel space with implicit
Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050 3
way. Different from the IKM, the EKM maps the original sample spaces. In order to get the complementary information in different
into kernel space and provides the detailed value of each dimen- kernel spaces, a regularization term RU is designed to require the
sion for a sample in the kernel space. If the training set con- consistent outputs of samples in different kernel spaces. Through
tains N samples and the N samples are defined as {(xi , φi )}N i=1
, introducing the RU , the softmax functions in different kernel spaces
where φ i is the label of the sample. The symmetrical positive work collaboratively with each other. The detailed formula of RU is
semi-definite kernel matrix is defined as K = [keri j ]N×N and keri j = written as:
(xi ) · (x j ) = ker (xi , x j ). Suppose the rank of K is equal to r, the T
1 l 1 j 1 j
m N m m
kernel matrix K is decomposed into:
RU = xi Wl − xi W j xli Wl − xi W j (7)
2N m m
K = QN×r r×r QTN×r , (3) l=1 i=1 j=1 j=1
where r × r is a diagonal matrix whose elements are composed of Then, we use the Xl to represent the matrix form of the sample
r positive eigenvalues of K, QN × r is the eigenvectors corresponding set {xli }N
i=1
in the lth kernel space. Each row of the Xl corresponds
to r eigenvalues. To reflect the visualized form in the kernel space, one sample of the {xli }N i=1
. In this manner, the regularization term
the mapping function is defined as e (e (I ) → F ). For an input Ru can be rewritten as follows:
sample x in original feature space I, x can be mapped into kernel ⎛ T ⎞
space F by e and e (x) is calculated as:
1 1 1
m m m
RU = tr⎝ Xl Wl − X jW j Xl Wl − X jW j ⎠,
e (x ) = [ker (x, x1 ), ker (x, x2 ), . . . , ker (x, xN )]Q−1/2 . (4) 2N m m
l=1 j=1 j=1
It is known that different kernel plays different role in differ-
ent scenario. Therefore, combing multiple kernels is an effective (8)
method to improve the generalization ability in classification and where tr is the trace of the square matrix.
regression tasks. In this paper, m kernel matrices and correspond- Generally, for a binary-classification problem, the outputs of
ing mapping functions are used. For the training set, the expres- samples are expected to exhibit a geometric feature that the
sions in m kernel spaces are {e1 (xi ), . . . , el (xi ), . . . , em (xi )}N
i=1
. outputs of samples have small within-class distance and large
between-class distance. To transform the geometric feature into
3. Proposed CGMKL multi-class classification, the between-class distance is not consid-
ered, as this distance may be hard to work in multiple classes.
In this section, the proposed CGMKL is introduced. Firstly, how Therefore, a regularization term RGl is designed to reduce the
to combine the softmax function and MEKL is introduced. More- within-class distance of samples after they are projected on the
over, the regularization terms RU and RGl are described. Then, the solution vector. The detailed formula of RGl is written as:
solution process and the pseudo code of the CGMKL are presented.
1
RGl = tr (Gl Wl WlT GTl ), (9)
3.1. Learning framework of CGMKL 2N
where Gl is the scatter matrix in the lth kernel space. In practice,
The proposed CGMKL maps the samples into m empirical ker- we first calculate the mean value of samples in each class. Next,
nel spaces in accordance with the rule of MEKL. Then, the CGMKL samples in each class subtract their corresponding class mean
trains the softmax function to directly classify the samples into value, and then they are mapped into the kernel space by using
their corresponding class in each kernel space. To make the learn- the kernel function el . To match the dimensions of Wl , these sam-
ing process in each kernel space collaborative, an regularization ples are padded one threshold whose value is set to 1. Finally, Gl
term RU is introduced to provide the consistent outputs of samples is vertically combined by these samples.
in different kernel spaces. Moreover, another regularization term As a result, the detailed loss function of the CGMKL is written
RGl , l = 1, ., m is introduced to make the outputs of samples exhibit as follows:
a geometric feature. The entire framework of the CGMKL is written
m
N
exp(xli Wl )yTi 1
as follows: L = − log +c tr (Gl Wl WlT GTl )
exp(xl Wl )1T 2N
m l=1 i=1 i
L= [Remp ( fl ) + cRGl ] + λRU , (5) ⎛ T ⎞
m
1 1
m
1
m
l=1
+λ tr⎝ Xl Wl − X jW j Xl Wl − X jW j ⎠. (10)
2N m m
where the parameter c and λ controls the importance of the l=1 j=1 j=1
∂ Remp
N
(
exp xli Wl T ) = ( Xl Xl Wl − XlT X j W j ). (23)
= −yTi + xli ∂ Wl N m m
j=1, j=l
∂ Wl i=1
(
exp xli Wl 1T )
4) Finally, by accumulating these derivatives, we can get the
N
T exp(xli Wl ) partial derivative of loss function L with respect to Wl as follows:
= xli − yi . (16)
exp(xl Wl )1T exp(xli Wl )
i=1 i ∂L N
T c
= xli log − y + GTl Gl Wl
2) For the regularization term RGl in Eq. 9, the partial derivative ∂ Wl i=1
exp ( x l
i
Wl )1T i
N
of RGl with respect to Wl can be calculated as follow:
λ m−1 T 1
m
1 + Xl Xl Wl − X jW j , (24)
dRGl = tr (Gl Wl d (WlT GTl )) N m m
j=1, j=l
N
1 The RMSProp strategy is used to update the Wl according to the
= tr (GTl Gl Wl )dWlT . (17)
N partial derivative. Different from the gradient descent, one memory
Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050 5
Table 1 the difference between hidden layer and the multi-kernel map-
Training process of CGMKL.
ping. The RandomForest directly deals with multi-class classifica-
Input: Training samples {xi , yi }Ni=1 , m candidate kernels {kerl (xi , x j )}m
l=1
; tion from the perspective of decision trees ensemble. The EasyMKL
Output: The weight matrix Wl , l = 1, . . . , m; is a classical multiple kernel learning method. The SVM is a fa-
1. Calculate the mapping function el (l = 1, ., m) of m kernel spaces. mous algorithm in machine learning. Here, the SVM utilizes OVO
2. Obtain {xli }Ni=1 , sample matrix Xl , and scatter matrix Gl (l = 1, ., m); and OVR strategies to deal with the multi-class problems.
3. Initialize k = 0, η , α , β ,
, c, λ, max iterative times maxiters, In the CGMKL, the ’RBF’ kernel is calculated as ker (xi , x j ) =
Wl with random normal distribution and VWlk (l = 1, ., m);
||x −x ||2
4. Calculate the value of lossk according to Eq. 10; exp(− i2σ 2j ), where σ is set to the average value of all the
5. While k ≤ max iters l2 − norm distances ||xi − x j ||2 , i, j = 1, . . . , N, and N is the num-
6. k = k + 1;
ber of the samples. The number of kernels is set to 2. The value
7. For each kernel space l (l = 1, . . . , m ),
8. Calculate ∂∂WL according to Eq. 24; of σ is multiplied by {1, 2} to reflect the two kernels under dif-
ferent measuring scales. The learning rate η is set to 1. The pa-
l
9. Calculate VWlk = α × VWlk−1 + β × ( ∂∂WL )2 ;
rameters including α , β and
in the RMSProp strategy is set to
l
10. Update Wlk = Wlk−1 − η ( 1
) ∂∂WLl ;
VWlk +
In this section, experiments are designed to investigate the ef- 4.2. Classification result
fectiveness of the proposed CGMKL. This section consists of three
major subsections. The first subsection describes the used data 4.2.1. Performance on multi-class data sets
sets, algorithms. The second subsection presents the classification Table 3 presents the classification results of all used algorithms
performance of used algorithms on the used data sets. The last on the multi-class data sets. According to the table, the CGMKL
subsection discusses the parameters and the convergence of the gets the best results on 9 out of 22 data sets, and the average Acc
proposed CGMKL. of the CGMKL achieved the highest value 84.95%, which demon-
strates that the CGMKL outperforms the other comparison algo-
4.1. Experimental setting rithms. Compared to the softmax, the average Acc of the CGMKL
is about 5.5% higher than that of softmax. Therefore, it is con-
In the experiment, 22 multi-class data sets are selected to vali- cluded that the kernel function greatly improves the classifica-
date the effectiveness of the CGMKL. The description of these data tion ability. Compared to the BPNN, the CGMKL also exhibits su-
sets is listed in Table 2. To validate the performance of the CGMKL, perior classification result, which indicates that the empirical ker-
another 6 classical algorithms, including softmax [23], Back Prop- nel mapping provides better classification ability than the hidden
agation Neural Network (BPNN) [23], RandomForest [39], EasyMKL layer does. Compared to the multi-kernel algorithm EasyMKL, as
[31], SVM (OVO) and SVM (OVR) [19], are selected as comparison the CGMKL has another two features including collaborative work-
algorithms. As the softmax is the basic algorithm of CGMKL, it is ing and geometric feature, the CGMKL exceeds the EasyMKL in the
selected as the baseline algorithm. The BPNN is used to reflect classification task. Compared to the RandomForest, the CGMKL also
6 Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050
Table 2
Description of data sets.
Table 3
Classification results of all used comparison algorithms on the used data sets (The best result on each data set is written in bold).
Data Set CGMKL softmax BPNN RandomForest EasyMKL SVM (OVO) SVM (OVR)
Acc ± std (%) Acc ± std (%) Acc ± std (%) Acc ± std (%) Acc ± std (%) Acc ± std (%) Acc ± std (%)
Iris 98.00 ± 2.98 95.33 ± 3.80 97.33 ± 2.79 95.33 ± 3.80 97.33 ± 2.49 97.33 ± 2.79 95.33 ± 5.06
Hayesroth 83.21 ± 4.22 55.63 ± 6.98 81.45 ± 9.37 83.58 ± 6.53 77.47 ± 2.58 74.09 ± 7.03 60.32 ± 10.43
Yale 84.44 ± 7.76 85.56 ± 8.20 84.44 ± 10.57 74.44 ± 18.43 73.33 ± 15.63 77.78 ± 15.17 73.56 ± 14.02
Seeds 95.71 ± 5.16 92.86 ± 4.45 93.81 ± 2.71 90.48 ± 9.82 90.48 ± 12.05 92.38 ± 8.14 93.33 ± 8.49
JAFFE 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 98.12 ± 2.07 98.61 ± 1.92 98.62 ± 1.28 99.50 ± 1.12
ORL 98.00 ± 2.27 98.00 ± 1.90 97.75 ± 2.56 96.50 ± 1.37 95.25 ± 2.15 98.75 ± 1.25 92.25 ± 5.11
Movement_libra 72.83 ± 9.64 61.50 ± 15.05 70.83 ± 11.15 66.83 ± 12.62 72.87 ± 7.75 71.17 ± 12.41 63.67 ± 14.64
Led7digit 70.95 ± 2.20 74.52 ± 3.08 75.18 ± 3.21 70.85 ± 3.94 74.97 ± 2.35 75.11 ± 2.02 58.36 ± 2.49
Balance 97.61 ± 0.49 87.70 ± 1.98 95.81 ± 2.57 84.80 ± 1.06 92.80 ± 1.81 92.93 ± 2.26 85.21 ± 4.25
Vehicle 81.28 ± 2.82 68.33 ± 1.34 77.65 ± 2.21 75.16 ± 1.68 70.55 ± 2.22 83.06 ± 2.04 79.19 ± 1.17
Coil_20 93.45 ± 2.87 91.79 ± 4.71 93.38 ± 3.27 98.14 ± 1.74 87.45 ± 3.77 94.45 ± 2.40 91.74 ± 4.40
Cmc 56.15 ± 2.57 50.87 ± 2.35 52.71 ± 3.50 52.77 ± 3.03 49.29 ± 0.62 54.93 ± 0.89 47.67 ± 1.60
Semeion 94.00 ± 1.88 91.49 ± 2.09 92.05 ± 1.31 94.19 ± 1.14 85.94 ± 1.35 95.25 ± 1.53 88.93 ± 2.82
Segmentation 97.10 ± 0.90 88.18 ± 2.48 93.72 ± 0.80 98.01 ± 0.79 97.75 ± 0.76 96.67 ± 1.12 90.95 ± 5.12
YaleB 84.35 ± 19.95 81.07 ± 24.64 77.94 ± 26.78 89.31 ± 13.52 52.30 ± 21.04 83.71 ± 20.77 76.66 ± 29.57
2k2k 93.12 ± 2.18 87.15 ± 3.24 89.39 ± 2.40 91.17 ± 1.74 63.63 ± 1.95 88.56 ± 3.18 81.65 ± 3.08
Wine_Quality_White 51.86 ± 4.27 51.41 ± 2.00 52.21 ± 3.46 52.39 ± 3.41 40.16 ± 2.46 45.30 ± 0.51 44.35 ± 5.88
Optdigits 98.95 ± 0.23 95.63 ± 0.89 96.46 ± 0.84 97.49 ± 0.61 97.74 ± 0.73 98.26 ± 0.66 96.98 ± 1.03
Statlog 88.66 ± 0.47 80.70 ± 2.71 83.79 ± 1.58 89.49 ± 1.34 88.45 ± 0.98 87.89 ± 0.97 84.13 ± 2.81
Marketing 31.72 ± 1.71 31.31 ± 1.08 32.70 ± 0.63 31.49 ± 1.07 30.23 ± 1.84 28.43 ± 1.67 24.34 ± 1.20
USPS 97.80 ± 1.26 93.25 ± 1.78 94.62 ± 1.77 96.06 ± 1.29 92.28 ± 1.90 97.39 ± 1.53 95.95 ± 1.64
Penbased 99.62 ± 0.09 86.97 ± 2.46 98.03 ± 1.51 99.00 ± 0.42 97.36 ± 0.24 99.51 ± 0.19 98.49 ± 0.45
average Acc 84.95 ± 3.45 79.51 ± 4.42 83.24 ± 4.32 82.98 ± 4.15 78.47 ± 4.03 83.25 ± 4.08 78.30 ± 5.74
Fig. 2. Classification results of different comparison algorithms vary with the percentage of training data on Yale, JAFFE, ORL, Coil_20, YaleB and 2k2k.
Fig. 3. Average classification results of different comparison algorithms vary with the percentage of training data on Yale, JAFFE, ORL, Coil_20, YaleB and 2k2k.
areas under the curve. As is shown in the Fig. 2, it is found that To investigate the entire performance with a varying percent-
the CGMKL achieves the best result On JAFFE and 2k2k. Except for age of training samples further, Fig. 3 presents the average perfor-
the JAFFE and 2k2k, the CGMKL provides relatively good classifica- mance with different percentage of training data on all used data
tion results and has the suboptimal results. In detail, the CGMKL sets. According to the figure, the performances of all algorithms
outperforms the BPNN on YaleB and 2k2k, and be competitive to generally increases with the percentage. Moreover, regardless of
BPNN on JAFFE, ORL and Coil_20, but the performance of CGMKL the percentage, the average performance of the CGMKL achieves
is worse than that of BPNN on Yale. Compared to the softmax and the highest score. The result reveals that the CGMKL can get better
SVM (OVO), the CGMKL outperforms the two algorithms on 4 out performances on the image data sets with high dimensions. The
of 5 data sets. Moreover, the CGMKL outperforms the RandomFor- better performance might be attribute to the rich expressions of
est on 3 out of 5 data sets, and it outperforms the EasyMKL and sample brought by the multiple kernels and the prior information
SVM (OVR) on all data sets. brought by the two regularization terms.
8 Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050
Table 4
Classification results regarding different value of k on the used data sets (The best result on each data set is written in
bold).
Iris 98.00 ± 1.83 98.00 ± 2.98 98.67 ± 1.83 98.67 ± 1.83 98.00 ± 4.47
Hayesroth 82.65 ± 4.80 83.21 ± 4.22 82.56 ± 8.10 82.47 ± 7.41 83.67 ± 8.14
Yale 80.00 ± 11.79 84.44 ± 7.76 86.00 ± 7.60 86.22 ± 6.17 86.44 ± 6.96
Seeds 93.81 ± 4.94 95.71 ± 4.45 95.24 ± 4.76 95.24 ± 5.32 95.24 ± 6.07
JAFFE 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00
ORL 97.50 ± 2.50 98.00 ± 2.27 97.25 ± 2.71 97.75 ± 1.85 97.50 ± 2.65
Movement_libra 72.33 ± 12.50 72.83 ± 9.64 74.17 ± 10.67 74.67 ± 10.17 74.33 ± 9.83
Led7digit 69.49 ± 2.36 70.95 ± 2.20 73.24 ± 3.83 72.79 ± 4.33 73.72 ± 2.53
Balance 96.63 ± 0.91 97.61 ± 0.49 97.44 ± 0.90 97.42 ± 0.93 97.25 ± 1.26
Vehicle 80.11 ± 2.24 81.28 ± 2.82 80.68 ± 2.97 81.31 ± 2.13 79.39 ± 2.21
Coil_20 94.01 ± 3.90 93.45 ± 2.87 93.13 ± 3.76 92.55 ± 3.60 91.73 ± 3.85
Cmc 52.62 ± 2.41 56.15 ± 2.57 55.60 ± 1.20 55.68 ± 1.87 55.94 ± 1.70
Semeion 93.88 ± 2.07 94.00 ± 1.88 93.77 ± 2.08 92.94 ± 1.99 92.68 ± 1.74
Segmentation 96.41 ± 1.30 97.10 ± 0.90 97.10 ± 0.95 96.84 ± 1.04 96.97 ± 1.02
YaleB 82.82 ± 21.76 84.35 ± 19.95 85.51 ± 19.07 85.31 ± 19.04 85.48 ± 19.17
2k2k 92.72 ± 2.42 93.12 ± 2.18 92.64 ± 2.78 92.26 ± 2.61 91.39 ± 3.08
Wine_Quality_White 43.02 ± 3.07 51.86 ± 4.27 50.43 ± 2.40 51.00 ± 3.59 49.59 ± 1.70
Optdigits 98.97 ± 0.29 98.95 ± 0.23 98.99 ± 0.36 98.92 ± 0.24 98.74 ± 0.33
Statlog 87.18 ± 0.90 88.66 ± 0.47 88.64 ± 0.45 88.53 ± 0.62 88.21 ± 0.50
Marketing 30.07 ± 1.75 31.72 ± 1.71 31.80 ± 1.04 31.79 ± 2.51 32.45 ± 1.25
USPS 97.88 ± 1.21 97.80 ± 1.26 97.73 ± 1.25 97.65 ± 1.35 97.57 ± 1.47
Penbased 99.60 ± 0.11 99.62 ± 0.09 99.58 ± 0.05 99.59 ± 0.09 99.54 ± 0.06
average Acc 83.62 ± 3.87 84.95 ± 3.45 85.01 ± 3.58 84.98 ± 3.58 84.81 ± 3.64
different kernel spaces can improve the classification ability of the including Yale, ORL and Movementlibras. Conversely, when the value
CGMKL. Excessive attentions may be adverse to the classification of c is in a relatively large value, the change of the classification
ability. results is relatively stable with the value of λ varying. Generally,
When the value of c is in a relatively small value, the change of the performance of small c is better than that of large c, which
the classification results is relatively unstable with the value of λ means that excessive attentions for the within-class distances may
varying. The phenomenon can be seen obviously in the data sets, be adverse to the classification ability. In summary, Fig. 5 also
10 Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050
Fig. 6. Sub-figure (a) shows the original data set, and the other sub-figures present the visualized results with a varying parameter c.
reflects that the two regularization terms play important role in three classes. The samples in different class are marked different
the classifications. For most data sets, setting both the c and λ to colors. As the data set has three classes, the output of the samples
approximate 0.1 is preferential in accordance with Fig. 5. is with three dimensions. According the sub-figures in Fig. 6, it is
obvious that the within-class of the samples with the same class is
4.3.3. Visualized results with varying parameters declining with the parameter c increasing. These figures reflect the
Fig. 6 presents the visualized results with a varying parameter function of the regularization term RGl , which requires the outputs
c. In this figure, the sub-figure (a) shows the original data set with of samples to exhibit small within-class.
Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050 11
Fig. 7. Sub-figure (a) shows the original data set, and the samples in different class are marked as different color. In the other sub-figures, the samples with different color
represent the output in different kernel space, and the visualized results with a varying parameter λ are presented.
Fig. 7 presents the visualized results with a varying parameter 4.3.4. Convergence analysis
λ. In this figure, the sub-figure (a) shows the original data set. In Fig. 8 shows the convergence of the proposed CGMKL. In
the other sub-figures, the samples with different colors represent MCEMKL, the RMSProp is used to accelerate the convergence
the output of the samples in different kernel space. As is shown in speed. Therefore, the convergence curve is with slightly oscillating.
Fig. 7, it is obvious that the outputs of the samples in different ker- According to the figure, it is obvious that the value of loss declines
nel space are becoming more and more similar with the parameter rapidly with the times of iterations increasing in the early stage.
λ increasing. Therefore, these figures demonstrate that the regular- Moreover, when the times of iterations reach 100, the value of loss
ization term RU can make the outputs of the samples in different tends to be stable. Therefore, it can be concluded that CGMKL is
kernel space consistent. with fast convergence speed.
12 Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050
5. Conclusion max function can utilize the explicit features in the kernel space
efficiently. To improve the collaborative working between differ-
This paper proposes a algorithm named CGMKL to deal with ent kernel spaces, one regularization term RU is designed to re-
multi-class classification. In detail, this paper places the softmax quire the consistent outputs of samples in different kernel spaces.
function under the MEKL framework. In this manner, the soft- Moreover, to make the outputs of samples to be with geometric
Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050 13
classification features, a geometric projection regularization term [7] A. Rocha, S.K. Goldenstein, Multiclass from binary: expanding one-versus-all,
RGl is designed to reduce the within-class distance of the outputs one-versus-one and ecoc-based approaches, IEEE Trans. Neur. Netw. Learn.
Syst. 25 (2) (2017) 289–302.
of samples in each kernel space. [8] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, F. Herrera, An overview of
The classification results demonstrate that the CGMKL exhibits ensemble methods for binary classifiers in multi-class problems: experimental
superior classification ability to the other comparison algorithms. study on one-vs-one and one-vs-all schemes, Pattern Recogn. 44 (8) (2011)
1761–1776.
The classification results regarding the number of kernels presents [9] J.H. Hong, S.-B. Cho, A probabilistic multi-class strategy of one-vs.-rest support
that using multiple kernel performs better than using single kernel vector machines for cancer classification, Neurocomputing 71 (16–18) (2008)
does. However, too many kernels bring computational complexity 3275–3281.
[10] B. Krawczyk, M. Galar, M. Wozniak, H. Bustince, F. Herrera, Dynamic ensem-
and does not necessarily improve the classification results greatly.
ble selection for multi-class classification with one-class classifiers, Pattern
Therefore, two kernels are enough according to the results. The Recogn. 83 (2018) 34–51.
classification results regarding the collaborative working and ge- [11] H. He, Y. Ma, Imbalanced Learning: Foundations, Algorithms, and Applications,
Wiley-IEEE Press, 2013.
ometric projection indicate the effectiveness of the two regulariza-
[12] A. Khatami, M. Babaie, A. Khosravi, H.R. Tizhoosh, S. Nahavandi, Parallel deep
tion terms. solutions for image retrieval from imbalanced medical imaging archives, Appl.
The are several advantages of the CGMKL. Firstly, the CGMKL Soft Comput. 63 (2018) 197–205.
illustrates that the empirical kernel can be introduced into the [13] Z. Zhang, X. Luo, S. García, F. Herrera, Cost-sensitive back-propagation neural
networks with binarization techniques in addressing multi-class problems and
softmax function. Moreover, the learning in different kernel space non-competent classifiers, Appl. Soft Comput. 56 (C) (2017) 357–367.
can be combined in accordance with the requirement of consis- [14] S.H. Khan, M. Hayat, M. Bennamoun, F.A. Sohel, R. Togneri, Cost-sensitive learn-
tent output. In this manner, the softmax in different kernel space ing of deep feature representations from imbalanced data, IEEE Trans. Neural
Netw. Learn.Syst. PP (99) (2017) 1–15.
can learn from each other. The consistent output also provides a [15] W.W. Ng, J. Hu, D.S. Yeung, S. Yin, F. Roli, Diversified sensitivity-based under-
way to establish a connection between different spaces. In addi- sampling for imbalance classification problems, IEEE Trans. Cybernet. 45 (11)
tion, the regularization term regarding geometric feature validates (2017) 2402–2412.
[16] X. Yuan, L. Xie, M. Abouelenien, A regularized ensemble framework of deep
that, if the outputs of samples with the same class label are close learning for cancer detection from multi-class, imbalanced training data, Pat-
to each other, the classification results can be improved. Therefore, tern Recogn. 77 (2018) 160–172.
the term can be introduced into some other learning framework. [17] E. Lughofer, O. Buchtala, Reliable all-pairs evolving fuzzy classifiers, IEEE Trans.
Fuzzy Syst. 21 (4) (2013) 625–641.
The proposed CGMKL also exists some limitations. When the
[18] L. Zhou, Q. Wang, H. Fujita, One versus one multi-class classification fusion
number of the samples is huge, it is hard for CGMKL to train using optimizing decision directed acyclic graph for predicting listing status of
multiple kernel function due to the memory constrain. The rea- companies, Inform. Fusion 36 (2016) 80–89.
[19] T. Wu, C. Lin, R. Weng, Probability estimates for multi-class classification by
son is that the CGMKL uses all the training samples to calculate
pairwise coupling, J. Mach. Learn. Res. 5 (4) (2004) 975–1005.
the kernel function. The requirement of memory increases with the [20] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, F. Herrera, Dynamic classi-
number of samples. The huge number of samples leads to a huge fier selection for one-vs-one strategy: avoiding non-competent classifiers, Pat-
kernel matrix, which consumes a lot of computing resource to cal- tern Recogn. 46 (12) (2013) 3412C3424.
[21] I. Mendialdua, J.M. Martinez-Otzeta, I. Rodriguez-Rodriguez, T. Ruiz-Vazquez,
culate the matrix decomposition. To improve the efficiency of the B. Sierra, Dynamic selection of the best base classifier in one versus one,
CGMKL, some data reduction and feature selection methods are re- Knowl.-Based Syst. 85 (C) (2015) 298–306.
quired. Moreover, the CGMKL can only tackle the sample with the [22] C.K.I. Williams, D. Barber, Bayesian classification with gaussian processes, IEEE
Trans. Pattern Anal. Mach.Intell. 20 (12) (1998) 1342–1351.
vector form. For the other types of the samples, these samples are [23] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, The MIT Press, 2016.
required to be converted into the vector form. [24] K.R. Müller, S. Mika, G. Rätsch, K. Tsuda, B. Schölkopf, An introduction to ker-
nel-based learning algorithms, IEEE Trans. Neural Netw. 12 (2) (2001) 181.
[25] J. Shawe-Taylor, N. Cristianini, Kernel method for pattern analysis, J. Am. Stat.
Assoc. 101 (476) (2004). 1730–1730
Acknowledgment [26] H. Xiong, M.N.S. Swamy, M.O. Ahmad, Optimizing the kernel in the empirical
feature space, IEEE Trans. Neural Netw. 16 (2) (2005) 460–474.
This work is supported by “Shuguang Program” supported by [27] F.R. Bach, Consistency of the group lasso and multiple kernel learning, J. Mach.
Learn. Res. 9 (2) (2007) 1179–1225.
Shanghai Education Development Foundation and Shanghai Munic- [28] M. Gönen, E. Alpaydin, Multiple kernel learning algorithms, J. Mach. Learn. Res.
ipal Education Commission, Natural Science Foundation of China 12 (2011) 2211–2268.
under Grant no. 61672227, Natural Science Foundations of China [29] S.S. Bucak, R. Jin, A.K. Jain, Multiple kernel learning for visual object recogni-
tion: a review, IEEE Trans. Pattern Anal. Mach.Intell. 36 (7) (2014) 1354–1369.
under Grant no. 61806078, National Key R&D Program of China [30] A. Rakotomamonjy, F.R. Bach, S. Canu, Y. Grandvalet, Simplemkl, J. Mach. Learn.
under Grant no. 2018YFC0910500, the Special Fund Project for Res. 9 (3) (2008) 2491–2521.
Shanghai Informatization Development in Big Data under Grant no. [31] F. Aiolli, M. Donini, Easymkl: a scalable multiple kernel learning algorithm,
Neurocomputing 169 (2015) 215–224.
201901043, and National Major Scientific and Technological Special [32] N. Rabin, D. Fishelov, Multi-scale kernels for nyström based extension schemes,
Project for “Significant New Drugs Development” under Grant no. Appl. Math. Comput. 319 (2018) 165–177.
2019ZX09201004. [33] Q. Wang, G. Fu, L. Li, H. Wang, Y. Li, Data-dependent multiple kernel learning
algorithm based on soft-grouping, Pattern Recogn. Lett. 112 (2018) 111–117.
[34] Y. Shi, F. Tillmann, D. Anneleen, T. LeonCharles, J.A. Suykens, B.D. Moor, Y. M., L
2-norm multiple kernel learning and its application to biomedical data fusion,
References BMC Bioinformat. 11 (1) (2010) 309.
[35] D. Zhang, Y. Wang, L. Zhou, H. Yuan, D. Shen, Multimodal classification of
[1] C. Hsu, C. Lin, A comparison of methods for multiclass support vector ma- alzheimer disease and mild cognitive impairment, Neuroimage 55 (3) (2011)
chines, IEEE Transactions on Neural Networks, 2002. 856–867.
[2] M. Lapin, M. Hein, B. Schiele, Analysis and optimization of loss functions [36] H. Xue, Y. Song, H. Xu, Multiple indefinite kernel learning for feature selection,
for multiclass, top-k, and multilabel classification, IEEE Trans. Pattern Anal. in: Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017,
Mach.Intell. 40 (7) (2018) 1533–1554. pp. 3210–3216.
[3] A. Fernández-Baldera, J.M. Buenaposada, L. Baumela, Badacost: multi-class [37] S. Sun, J. Shawe-Taylor, L. Mao, Pac-bayes analysis of multi-view learning, In-
boosting with costs, Pattern Recogn. 79 (2018) 467–479. form. Fusion 35 (2016) 117–131.
[4] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995) [38] J. Zhao, X. Xie, X. Xu, S. Sun, Multi-view learning overview: recent progress
273–297. and new challenges, Inform. Fusion 38 (2017) 43–54.
[5] L. Ke, Y. Wu, N. Yu, P. Li, L. Yang, Hierarchical multi-class classification in multi- [39] T.K. Ho, The random subspace method for constructing decision forests, IEEE
modal spacecraft data using DNN and weighted support vector machine, Neu- Trans. Pattern Anal. Mach.Intell. 20 (8) (1998) 832–844.
rocomputing 259 (11) (2017) 55–65. [40] G.C. A. Benavoli, J. Demsar, Time for a change: a tutorial for comparing mul-
[6] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley and Sons, tiple classifiers through bayesian analysis, J. Mach. Learn. Res. 77 (1) (2016)
2000. 1–36.
14 Z. Wang, Z. Zhu and D. Li / Pattern Recognition 99 (2020) 107050
Zhe Wang received the B.Sc. and Ph.D. degrees in Depart- Dongdong Li received the B.Sc. and Ph.D. degrees in De-
ment of Computer Science and Engineering, Nanjing Uni- partment of Computer Science and Engineering, Zhejiang
versity of Aeronautics and Astronautics, Nanjing, China, University, Hangzhou, China, 2003 and 2008, respectively.
2003 and 2008, respectively. He is now a full Profes- She is now a full Assistant Professor in Department of
sor in Department of Computer Science and Engineering, Computer Science and Engineering, East China University
East China University of Science and Technology, Shang- of Science and Technology, Shanghai, China. Her research
hai, China. His research interests include feature extrac- interests include speech processing, affective computing
tion, kernel-based methods, image processing, and pat- and pattern recognition. At present, she has more than 15
tern recognition. At present, he has more than 40 pa- papers with the first or corresponding author published
pers with the first or corresponding author published on on some famous international journals and conferences.
some famous international journals including IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, IEEE
Transactions on Neural Networks and Learning, and Pat-
tern Recognition etc.