Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

This article has been accepted for publication in IEEE Transactions on Fuzzy Systems.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2023.3296503

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 1

Fuzzy twin support vector machines with


distribution inputs
Zhizheng Liang, Member, IEEE, Shifei Ding

Abstract—The fuzzy twin support vector machine (FTSVM) is attempt to address the drawbacks of the original TSVM such as
a powerful and effective classifier due to the use of nonparallel the sensitivity to noise and outliers. The idea of the structural
hyperplanes and fuzzy membership functions. This paper extends risk [9] is employed to improve TSVM via a regularization
the FTSVM model to uncertain objects with probability distri-
butions and proposes a distribution-input FTSVM (DFTSVM) term. The variance minimization is introduced to propose the
model. Unlike the classical FTSVM, the DFTSVM model adopts recursive projection twin support vector machine (RPTSVM)
an insensitive pinball loss function that can suppress feature [10]. A pair of nonparallel parametric interval hype-graphs are
noise. It also defines a fuzzy membership function of un- explored in twin parametric-margin support vector machines
certain objects in terms of the Wasserstein distance and k- [11]. The TSVM model with the pinball loss [12] employs
nearest neighbors. We use properties of Gaussian distributions to
transform the original DFTSVM into a tractable model. When the quantile distance to suppress the noisy data between the
covariance matrices of uncertain objects are positive definite, two boundary of the two classes. The large-scale pinball TSVM
optimization problems in DFTSVM are smooth and convex, and [13] uses the kernel trick directly and does not need to compute
the quasi-Newton algorithm is employed to solve the DFTSVM the inverse of matrices in the dual. The general twin support
model. We also analyze the DFTSVM model to demonstrate its vector machine with the pinball loss function [14] is robust to
noise insensitivity and weighted scatter minimization. The fuzzy
decision rule we define is independent of Gaussian assumptions, noise and stable for resampling. The robust nonparallel SVM
and a kernel version of DFTSVM via a reduced-set strategy is via second-order cone programs [15] can effectively suppress
achieved. In the experimental part, we demonstrate the effective- noise. Recently, a comprehensive survey on TSVM can be
ness of DFTSVM when handling uncertain objects. Moreover, we found in [16].
illustrate how the DFTSVM model is modified to deal with large- In addition to employing robust loss functions to suppress
scale data sets by constructing probability density functions of
samples and how to model deep features as distribution inputs noise and outliers, some weighted schemes have been designed
for image data sets. to impose different importance on samples in the training set.
Among these weighted schemes, the weights from fuzzy sets
Index Terms—fuzzy membership, FTSVM, fuzzy decision
rules, kernel functions, data classification [4], [5], [17] have been proved to be effective in reducing the
effect of outliers. The fuzzy support vector machine (FSVM)
[18] assigns higher membership values to important samples
I. I NTRODUCTION such that outliers are suppressed. The fuzzy TSVM (FTSVM)
[19], which produces two nonparallel hyperplanes, uses the
T HE twin support vector machine [1], as a variant of the
support vector machine (SVM) [2], [3], is an effective
tool for some classification and regression problems. Unlike
margin of samples to reduce the influence of samples with
high uncertainty. Clifford geometric algebra is employed to
SVM, the twin SVM (TSVM) tries to find two nonparallel achieve the decision surface in Clifford fuzzy SVM [20]. The
hyperplanes that are achieved by solving two optimization fuzzy least squares TSVM [21] constructs fuzzy hyperplanes
subproblems. Due to theoretical merits and good generaliza- instead of crisp hyperplanes by using membership degrees of
tion performance in practical problems, the TSVM model samples. The fuzzy least squares projection TSVM [22] and
has found wide applications in many fields, such as image robust fuzzy least squares TSVM [23] were developed to deal
classification [3], credit risk [4], and imbalanced learning [5]. with the problem of data imbalance. The large-scale fuzzy
The idea of adopting two nonparallel hyperplanes for the least squares TSVM [24] employs the sequential minimization
classification problem originally appeared in the general- principle to construct an iterative algorithm for large-scale
ized eigenvalue proximal SVM (GEPSVM) [6]. The aim of learning. To handle the output with the representation of asym-
GEPSVM is to make samples in the positive class close to metric trapezoidal fuzzy numbers, the authors [25] developed
one hyperplane and samples in the negative class approach the the asymmetric dual-regression model from the framework of
other hyperplane. Inspired by the idea of GEPSVM, Jayadeva TSVM.
et al. [1] proposed the twin SVM that contains two quadratic Intuitionistic fuzzy sets (IFSs), as a generalization of fuzzy
optimization problems. Since then, numerous variants of sets, have been employed to construct the weights of samples
TSVM have been developed [7], [8]. These modified models in SVM and its variants. Besides the membership function,
intuitionistic fuzzy sets also define a non-membership func-
Manuscript created June, 2023; This work is partly supported by the NSFC tion. Thus, IFSs can better describe the uncertain information
under Grant 61976216. of samples since the non-membership describes the possibility
Zhizheng Liang, Shifei Ding are with the School of Computer Science
and Technology, China University of Mining and Technology, China(e-mail: of not belonging to the set. As a result, using IFSs to design
liang@cumt.edu.cn; dingsf@cumt.edu.cn) the weights of samples has attracted much attention over

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on October 11,2023 at 13:14:00 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Fuzzy Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2023.3296503

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 2

the past several years. The intuitionistic fuzzy SVM [26] with probability distributions. Different from some previous
introduces a score function of intuitionistic fuzzy numbers to methods [40], [43], we employ robust loss functions and
measure the importance of samples. The intuitionistic fuzzy weaken Gaussian assumptions of uncertain objects. In the
TSVM (IFTSVM) [27] embeds intuitionistic fuzzy numbers DFTSVM model, uncertain objects are regarded as random
into TSVM, and the score function is explored in a Hilbert vectors whose means and covariances are assumed to be
space. The IFTSVM model is also employed to deal with provided. To suppress outliers in uncertain data, we define
imbalanced data [28]. By replacing the hinge loss function in a membership function of uncertain objects based on the
IFTSVM with the pinball loss function, the pinball IFTSVM Wasserstein distance on the probability space. The original
(PIFTSVM) can effectively deal with noisy data [29]. The model consists of intractable optimization problems due to
IFTSVM model is also used to deal with the multiclass prob- the presence of multiple-dimensional integrals. We make
lem of leaf recognition [30]. The intuitionistic fuzzy TSVM is use of some techniques to make DFTSVM tractable. The
combined with fuzzy ART to deal with the problem of class transformed optimization problems in DFTSVM are smooth
imbalance learning [31]. The intuitionistic fuzzy proximal and convex under positive definite covariance matrices. Thus,
SVM (IFPSVM) [32] computes the membership degree of a some optimization algorithms can be employed to handle
sample from its distance to the corresponding class center, and DFTSVM, and we resort to the Broyden-Fletcher-Goldfarb-
it achieves the non-membership degree of a sample by the ratio Shanno (BFGS) algorithm [44] to solve the DFTSVM model.
of the number of heterogeneous points to the number of all Some characteristics of DFTSVM are analyzed, and its kernel
data points. The safe intuitionistic fuzzy TSVM [33] is used version in a reduced kernel space is developed. We conduct
to deal with unlabeled samples in semi-supervised learning. a series of experiments on data sets to demonstrate that our
The intuitionistic fuzzy weighted least squares TSVM [34] DFTSVM model is much better than some previous models
explores the local information of data points to construct the in the presence of outliers. Overall, the main contributions of
membership and non-membership functions of samples, and this paper are summarized as follows.
linear equations are solved instead of quadratic programming • To suppress outliers in uncertain data, we introduce a
problems. new fuzzy TSVM model with distribution inputs, where
Most TSVM models based on fuzzy sets [22] and IFSs we employ an insensitive loss function and construct a
[29] deal with noisy data more effectively than the classical membership function of uncertain objects. To capture
TSVM. For noisy data mentioned in [22], [23], [28], [29], the uncertain information from distribution inputs, we
each attribute or feature of samples was recorded as a single employ the Wasserstein distance to define the fuzzy
value that may be contaminated by noise. Unlike noisy data membership function.
whose features have single-valued representations, there are • We explore properties of Gaussian random vectors to
specific types of uncertain data that have been explored simplify DFTSVM and solve it in the primal in terms of
in [35], [36], [37], where each example is described by a the BFGS algorithm. We define the fuzzy decision rule
random vector or each attribute of examples is described by that is independent of Gaussian assumptions to classify
interval numbers. The examples with these representations uncertain objects. We discuss how noise is affected by a
are referred to as uncertain objects. When uncertain objects parameter in our DFTSVM model and explain DFTSVM
are represented as continuous random variables, they have from the viewpoint of weighted scatter minimization. We
probability density functions that describe the uncertainty of also extend DFTSVM to its kernel version via a reduced-
samples. The maximum-margin classifier with probabilistic set strategy.
constraints [38] classifies uncertain examples with high proba- • We carry out a series of experiments to illustrate the
bility. The least squares regression model is designed from the effectiveness of DFTSVM. In addition, we also show how
viewpoint of additive perturbation [39]. The maximum-margin our DFTSVM model is modified to approximately deal
classifier with Gaussian uncertainty (MMCGU) [40] defines an with large-scale data sets by generating probability den-
optimization model by considering each uncertain object as a sity functions of samples and how our DFTSVM model
Gaussian random vector and simplifies the model via Gaussian is employed to classify image data sets by modeling deep
assumptions. To capture the nonlinear characteristics of data, features as probability distributions.
the authors [41] studied uncertain objects with isotropic Gaus- The rest of this paper is organized as follows. Section II de-
sian distributions in the kernel space. Uncertain kernel Fisher scribes the fuzzy TSVM and MMCGU. Section III introduces
discriminant analysis (UKFDA) [42] uses Fisher discriminant the DFTSVM model and discusses its properties. Section IV
criteria to deal with two types of uncertain objects. The describes the experimental results. Conclusions and further
UKFDA model defines covariance matrices, within scatter work are stated in the final section.
matrices, and between scatter matrices of uncertain objects.
Recently, the uncertainty-aware twin support vector machine
(UTSVM) [43] uses two nonparallel hyperplanes to cope with II. R ELATED WORK
uncertain objects with Gaussian distributions. Let {(x1 , y1 ), · · · , (xi , yi ), · · · , (xn , yn )} be n samples in
Although some methods [40], [42], [43] have been proposed binary classification problems, where xi ∈ Rm (i = 1, · · · , n)
to classify uncertain objects, these methods do not effectively and yi ∈ {1, −1}. For ease of description, let the first l
handle uncertain objects containing outliers. To this end, this samples belong to the positive class, and the rest of the
paper investigates how to robustly tackle uncertain objects samples belong to the negative class. When uncertain objects

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on October 11,2023 at 13:14:00 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Fuzzy Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2023.3296503

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 3

are described, we use {(X1 , y1 ), · · · , (Xi , yi ), · · · , (Xn , yn )} III. F UZZY TWIN SUPPORT VECTOR MACHINE WITH
to denote n uncertain objects, where Xi is the ith random DISTRIBUTION INPUTS
vector with the continuous probability distribution. In the case In this section, we first define the fuzzy membership of the
of probability distributions, the means and covariance matrices uncertain object. Then, we describe the proposed DFTSVM in
of random vectors are widely used statistical information. linear cases, define the fuzzy decision rule of uncertain objects,
and study several properties of DFTSVM. Finally, we describe
A. Fuzzy TSVM the proposed DFTSVM in nonlinear cases.
It is found that the outliers in the data set affect the hyper-
planes of SVM and TSVM. The fuzzy twin SVM incorporates A. Fuzzy membership of uncertain objects
the fuzzy membership function into TSVM to suppress noise As with the fuzzy TSVM, we need to define a member-
and outliers. Each sample in the training set is assigned a ship function of uncertain objects. But defining it is more
membership value, so samples yield different contributions to challenging since uncertain objects involve distributional infor-
the hyperplanes. The optimization model of the fuzzy TSVM mation. Fortunately, some existing measures [45] such as the
is formulated as Kullback-Leibler divergence and the Wasserstein distance can
∥w1 ∥2 + b21 Xl Xn be employed to compute the distance between two probability
min + c1 (w1T xi + b1 )2 + c2 si ξi distributions. The Wasserstein distance from the theory of
2 (1)
i=1 i=l+1 optimal transport is a metric defined on the probability space.
s.t. yi (w1T xi + b1 ) ≥ 1 − ξi , ξi ≥ 0, i = l + 1, · · · , n It has achieved wide applications in different fields since it
2 n l
captures the geometrical relations between distributions. Here,
∥w2 ∥ + b22 X X we adopt the Wasserstein distance to measure the difference
min + c3 (w2T xi + b2 )2 + c4 si ξi
2 i=1
(2) between two probability distributions since it is a metric on the
i=l+1
probability space. The L2-norm Wasserstein distance between
s.t. yi (w2T xi + b2 ) ≥ 1 − ξi , ξi ≥ 0, i = 1, · · · , l.
two Gaussian distributions N (m1 , Σ1 ) and N (m2 , Σ2 ) has
In contrast to the twin bounded SVM (TBSVM) [9], the analytical expressions [45], denoted by
fuzzy TSVM uses the membership of each sample to improve
D(m1 , m2 )2 = ∥m1 − m2 ∥2 + tr(Σ) (5)
the performance of TBSVM. A small si reduces the effect of
1 1 1
xi , and si is obtained by defining a membership function. The where Σ = Σ1 + Σ2 − 2(Σ1 Σ2 Σ1 ) , tr() denotes the trace
2 2 2
1
general principle of devising a membership function is to make of a matrix, and Σ12 is the square root of the matrix Σ1 .
the samples that may be outliers and noise take small values. When Σ1 and Σ2 are diagonal matrices, D(m1 , m2 ) can be
The optimization problems of (1) and (2) are generally solved computed easily. When the uncertain objects have the same
in the dual. To make the fuzzy TSVM suitable for online uncertainty, i.e., Σ1 = Σ2 , D(m1 , m2 ) degenerates into the
learning, the authors [17] used incremental and decremental Euclidean distance.
learning strategies to achieve two nonparallel hyperplanes. Unlike the fuzzy SVM [18] in which the class center is
employed, we explore the local neighborhood information to
B. Maximum-margin classifier with Gaussian uncertainty define the fuzzy membership of uncertain objects. To this end,
Let {(x1 , Σ1 , y1 ), · · · , (xi , Σi , yi ), · · · , (xn , Σn , yn )} be n for each uncertain object, we first find its k-nearest neighbors
uncertain objects of Gaussian distributions with means xi ∈ from the same class by using a k-neighborhood search strategy.
Rm (i = 1, · · · , n) and covariance matrices Σi (i = 1, · · · , n). Assume that Nk (xi ) denotes the index set of uncertain objects
In fact, the ith uncertain object is regarded as a random vector in the k-nearest neighborhood of the ith uncertain object.
Xi with the m-dimensional Gaussian distribution N(xi , Σi ). From k-nearest neighbors, we can compute the local center
For uncertain objects, the following optimization model is built of the ith uncertain object. For computational simplicity, we
to achieve the hyperplane wT x + b = 0: assume that the uncertain objects in the neighborhood are
λ∥w∥2 1X
n Z statistically independent and follow Gaussian distributions.
min + max(0, 1 − yi (wT x + b))fXi (x)dx Based on Proposition 1 in the supplemental material, the
2 n i=1 Rm
(3) 1
P of the ith uncertain object is denoted by vi =
local center
j∈Nk (xi ) xj , and the local covariance is achieved
where fXi (x) = (2π)m/21|Σi |1/2 exp(− 12 (x−xi )T Σ−1 i (x−xi ))
|Nk (xi )|
1
P
by Σ̃i = |Nk (x 2 j∈Nk (xi ) Σj . Thus, for the ith uncertain
is the probability density function of the random variable Xi . i )|

To make (3) have a tractable representation, the authors [40] object Xi , we generate a virtual object Vi whose mean and
simplified (3) as covariance are vi and Σ̃i , respectively. Based on the local
information of each uncertain object, we define the following
n
λ∥w∥2 1 X dxi dx dΣ dx fuzzy membership function to achieve weights of uncertain
min + (1+erf ( i ))+ √i exp(−( i )2 )
2 n i=1 2 dΣi 2 π dΣi objects:
(4)
(
p exp(−δ D(xr+i ,vi ) ) yi = 1
where dΣi = R 2wT Σi w, dxi = 1 − yi (wT xi + b), and s(xi ) = (6)
x 2
erf (x) = √2π 0 e−t dt. By exploring the convexity of the exp(−δ D(xr− i ,vi )
) yi = −1
objective function of (4) with respect to w and b, several where vi is the mean of the ith virtual object Vi corresponding
optimization algorithms [40] have been proposed to solve (4). to the ith uncertain object Xi , r+ = maxi=1,··· ,l D(xi , vi ),

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on October 11,2023 at 13:14:00 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Fuzzy Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2023.3296503

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 4

r− = maxi=l+1,··· ,n D(xi , vi ), and δ is a non-negative pa- object in the negative class. To deal with (10) and (11), we
rameter controlling the shape of the membership function. It introduce the following lemmas to simplify Li and Lj .
is observed that δ = 0 produces equal membership degrees for Lemma 1: Let X be an m-dimensional random vector with
all uncertain objects. For the positive class, the local center will mean µ and covariance Σ. We have
approach its class center if k = l − 1. For the negative class, Z
the local center will approach its class center if k = n − l − 1. xxT fX (x)dx = Σ + µµT (12)
Rm
Z
B. The linear case
(Ax + b)fX (x)dx = Aµ + b (13)
The ϵ-insensitive pinball loss function [46] has been em- Rm
ployed in the classification problem. In fact, there are several where A is a matrix with proper dimensions, and b is a vector
variants of the insensitive loss function [47] in the regression with proper dimensions.
problem. They have a similar role in reducing the effect Lemma 2: Let Z be a one-dimensional Gaussian random
of noise and outliers in data. In this paper, we employ the variable with mean µ and variance σ 2 . The integral I =
following ϵ-insensitive pinball loss function, denoted by
R
z≥0
zfZ (z)dz can be simplified as

τ (u − ϵ) u>ϵ
µ µ σ µ2

Lϵτ (u) = 0 −ϵ ≤ u ≤ ϵ . (7) I= erf c(− √ ) + √ exp(− 2 ) (14)

(τ − 1)(u + ϵ) u < −ϵ 2 2σ 2π 2σ

From (7), one can see that the width of the insensitive zone where
R ∞the −tcomplementary error function [48] erf c(x) =
2
√2 exp dt and erf (x) = 1 − erf c(x).
is determined by ϵ instead of τ . Thus, τ will not affect the π x
insensitive zone. When τ = 1 and ϵ = 0, Lϵτ (u) degenerates Lemma 2 employs the Gaussian assumption, but we do not
into the hinge loss function. When τ = 0.5, Lϵτ (u) is the need it in Lemma 1. For Li (i = 1, · · · , l), we have Li =
T T T 2
R
symmetric ϵ-insensitive tube. As with TSVM, we attempt to R m (w 1 xx w1 + 2w1 xb1 + b1 )fXi (x)dx. Using Lemma 1
look for two nonparallel hyperplanes to classify uncertain gives
objects. Following the ideas of the fuzzy TSVM and using
the loss function in (7), we construct the following model to Li = w1T (xi xTi + Σi )w1 + 2w1T xi b1 + b21 . (15)
achieve two hyperplanes from uncertain data:
To simplify Lj , we need to define two halfspaces. These two
c1 (∥w1 ∥2 + b21 ) Xl si (w1T x + b1 )2 halfspaces are defined as Ω1j = {x : 1 − yj (w1T x + b1 ) − ϵ ≥
Z
min + fXi (x)dx
w1 ,b1 2 i=1 Rm 2 0} and Ω2j = {x : −(1 − yj (w1T x + b1 ) + ϵ) ≥ 0}. Since
yj = −1 in the negative class, we have Ω1l+1 = · · · = Ω1n and
Xn Z
+ c2 sj Lϵτ (1 − yj (w1T x + b1 ))fXj (x)dx Ω2l+1 = · · · = Ω2n . Using Ω1j and Ω2j , we can rewrite Lj as
j=l+1 Rm
(8) Z
Lj = (1 − yj (w1T x + b1 ) − ϵ)fXj (x)dx−
c3 (∥w2 ∥2 + b22 ) Xn si (w2T x + b2 )2
Z
Ω1j
min + fXi (x)dx (16)
w2 ,b2 2 i=l+1 Rm 2 Z
Xl Z (1 − yj (w1T x + b1 ) + ϵ)fXj (x)dx.
+ c4 sj Lϵτ (1 − yj (w2T x + b2 ))fXj (x)dx Ω2j
j=1 Rm
(9) Unlike (15), we assume that the uncertain object follows
where si is the fuzzy membership of the ith uncertain object, Gaussian distributions in (16). For clarity, we let X̄j = 1 −
and fXj (x) is the probability density function of the jth uncer- yj (w1T Xj +b1 )−ϵ and X̂j = −(1−yj (w1T Xj +b1 )+ϵ), where
tain object. From (8) and (9), we find that directly optimizing j = l + 1, · · · , n. When Xj is a Gaussian random vector with
them is impractical since multi-dimensional integrals appear. mean xj and covariance Σj , X̄j and X̂j are Gaussian random
The objective functions of (8) and (9) are strongly convex, variables. Using properties of Gaussian random vectors, we
and there exist unique solutions to them. We refer to (8) and can achieve means and variances of X̄j and X̂j . Specifically,
(9) as the fuzzy TSVM with distribution inputs (DFTSVM) the mean and variance of X̄j are 1 − yj (w1T xj + b1 ) − ϵ and
in linear cases. Since (8) and (9) have similar expressions, we w1T Σj w1 ; the mean and variance of X̂j are −(1 − yj (w1T xj +
will focus mainly on how to solve (8) in the following. From b1 ) + ϵ) and w1T Σj w1 . Based on X̄j and X̂j , Lj in (16) can
(8), we define the loss of the single uncertain object, denoted be rewritten as
Z ∞ Z ∞
by
Z Lj = τ xfX̄j (x)dx + (1 − τ ) xfX̂j (x)dx. (17)
0 0
Li = (w1T x + b1 )2 fXi (x)dx (10)
Rm Using Lemma 2, we rewrite the loss Lj in (17) as
Z
Lj = Lϵτ (1 − yj (w1T x + b1 ))fXj (x)dx (11) uj erf c(−e1,j ) σj exp(−(e1,j )2 )
Rm
Lj = τ [ + √ ]
2 2π
(18)
where Li denotes the loss of the ith uncertain object in the ūj erf c(−e2,j ) σj exp(−(e2,j )2 )
positive class, and Lj denotes the loss of the jth uncertain + τ̄ [ + √ ]
2 2π
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on October 11,2023 at 13:14:00 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Fuzzy Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2023.3296503

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 5

yj (w1T xj + b1 ) − ϵ, ūj = −(1 − yj (w1T xj +


where uj = 1 − p singularity problem of covariances but also reduces uncertain
uj
b1 ) + ϵ), σj = w1T Σj w1 , e1,j = √2σ , τ̄ = 1 − τ , and information of data.
j

e2,j =

√ j . Using (15) and (18), we rewrite (8) as It is observed that (19) and (20) are convex optimization
2σj problems since they are obtained by simplifying (8) and (9).
c1 (∥w1 ∥2 + b21 ) Thus, some convex optimization algorithms can be used to
minf1 (w1 , b1 ) = + solve them. We can see that (19) and (20) are unconstrained
2
Xl si (w (xi x + Σi )w1 + 2wT xi b1 + b2 )
T T optimization problems. The first-order information of objective
1 i 1 1
i=1 2 functions plays an important role in designing optimization
Xn uj erf c(−e1,j ) σj exp(−(e1,j )2 ) algorithms. Although the optimization problems of (19) and
+ c2 sj τ [ + √ ] (20) have complex appearances, it is interesting to note that
j=l+1 2 2π
the partial derivatives of the objective functions in (19) and
Xn ūj erf c(−e2,j ) σj exp(−(e2,j )2 )
+ c2 sj τ̄ [ + √ ]. (20) with respect to optimization variables have compact
j=l+1 2 2π representations. When differentiating f1 (w1 , b1 ) with respect
(19) to w1 and b1 , we achieve the gradient information given in
Similarly, we rewrite (9) as the supplemental material.
When the gradient information of the objective function is
c3 (∥w2 ∥2 + b22 )
minf2 (w2 , b2 ) = + provided, we can use some first-order optimization algorithms
2 to solve (19) and (20). Although the pinball loss function
Xn si (w2 (xi xi + Σi )w2 + 2w2T xi b2 + b22 )
T T
is non-smooth, the objective functions of (19) and (20) are
i=l+1 2 smooth when covariance matrices are positive definite. Con-
Xl vj erf c(−e3,j ) σ̄j exp(−(e3,j )2 ) (20) sidering the characteristics of the objective functions in (19)
+ c4 sj τ [ + √ ]
j=1 2 2π and (20), in this paper, we make use of the Broyden-Fletcher-
Xl v̄j erf c(−e4,j ) σ̄j exp(−(e4,j )2 ) Goldfarb-Shanno (BFGS) algorithm [44] to solve DFTSVM
+ c4 sj τ̄ [ + √ ]
j=1 2 2π since the BFGS algorithm is effective for solving smooth
optimization problems. To give an overall description of our
where vj = 1 −pyj (w2T xj + b2 ) − ϵ, v̄j = −(1 − yj (w2T xj + scheme in handling uncertain objects, we list the main steps
vj v̄j
b2 ) + ϵ), σ̄j = w2T Σj w2 , e3,j = √2σ̄ , and e4,j = √2σ̄ . If of solving DFTSVM in Algorithm 1, but the BFGS algorithm
j j
Σi (i = 1, · · · , n) degenerate into the matrices whose elements is described in the supplemental material due to page limits.
are zeros, the uncertain objects will be deterministic. When
Σi (i = 1, · · · , n) are zero matrices, the optimization problems Algorithm 1: The pseudo-code of DFTSVM in linear cases
of (19) and (20) are rewritten as 1: Input: the training objects and their labels;
2: Output: the label of test objects;
c1 (∥w1 ∥2 + b21 ) Xl si (w1T xi + b1 )2 3: Obtain the representation of uncertain objects;
min + 4: Calculate fuzzy memberships of uncertain objects by (6);
Xn 2 2
i=1 (21) 5: Perform the BFGS algorithm on (19) and (20);
+ c2 sj (τ max(uj , 0) + τ̄ max(ūj , 0)) 6: Employ the fuzzy decision rule in (27).
j=l+1

c3 (∥w2 ∥2 + b22 ) Xn si (w2T xi + b2 )2


min +
2 i=l+1 2 (22) C. The fuzzy decision rule of uncertain objects
Xl
+ c4 sj (τ max(vj , 0) + τ̄ max(v̄j , 0)). After (19) and (20) are solved, we can obtain two nonparal-
j=1
lel hyperplanes: 0 = w1T x+b1 and 0 = w2T x+b2 . For uncertain
The objective functions in (21) and (22) are non-smooth, objects, we need the decision rule to classify uncertain objects.
but (19) and (20) have smooth objective functions. Hence, In (19) and (20), we employ the losses of uncertain objects
(19) and (20) can be regarded as the smoothed version of to construct the objective functions. Assume that an uncertain
(21) and (22) if covariance matrices are properly selected, i.e., test object Xt with mean µt and covariance Σt is given. We
Σi = λIm and λ approaches zero, where Im is an identity define its loss by using two nonparallel hyperplanes as follows:
matrix. Note that the covariance matrices Σi (i = 1, · · · , n) Z
reflect the uncertain information of objects. The covariance D1 = (w1T x + b1 )2 fXt (x)dx, (23)
matrices in a high-dimensional space may be singular if there Rm
are notpenough data points. In (18), we have to compute
Z
σj = w1T Σj w1 and e1,j = √2σ
uj
. Since σj appears in D2 = (w2T x + b2 )2 fXt (x)dx. (24)
j Rm
T
the denominator of e1,j , w1 Σj w1 may be zero if Σj is not Using Lemma 1, we rewrite (23) and (24) as
strictly positive definite. To address the singularity problem
of covariance matrices, a widely used strategy is to perturb D1 = w1T (xt xTt + Σt )w1 + b21 + 2w1T xt b1 , (25)
covariance matrices such that they are non-singular via the
D2 = w2T (xt xTt + Σt )w2 + b22 + 2w2T xt b2 . (26)
regularization technique, i.e., Σi +λi Im , (i = 1, · · · , n), where
λi is a small positive constant. In addition, employing principal It is clear that the losses of D1 and D2 are independent of
component analysis (PCA) [3] or deep learning models [49] Gaussian assumptions. After D1 and D2 are calculated, the
to reduce the dimension of data not only overcomes the label of the test object Xt is given by

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on October 11,2023 at 13:14:00 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Fuzzy Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2023.3296503

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 6

Pl R
 We have g1 (w1 , b1 ) = 0.5 i=l Rm si (w1T x + b1 )2 fXi (x)dx
y=1 if D1 ≤ D2 since w1T m1 +b1 = 0. Assume there exists the data point m2 in
. (27)
y = −1 if D1 > D2 the negative class lying on the hyperplane 1+w1T x+b1 +ϵ = 0.
Using (27), we obtain the label of the uncertain object Xt . The weighted scatter of uncertain objects in the negative class
Since (27) involves uncertain information, we refer to it as the is defined by
fuzzy decision rule of uncertain objects. n Z
1 X
g2 (w1 , b1 ) = si |w1T x − w1T m2 |fXi (x)dx. (30)
D. Further analysis of DFTSVM 2 Rm
i=l+l
In this subsection, we show how noise is affected by the
Using 1 + w1T m2
+ b1 + ϵ = 0 and yi = −1, we have
parameter τ and explain DFTSVM from the viewpoint of
n Z
weighted scatter minimization. 1 X
Unlike deterministic samples, each uncertain object gives a g2 (w1 , b1 ) = si |1 − yi (w1T x + b1 ) + ϵ|fXi (x)dx.
2 R m
i=l+1
probability distribution. This actually points out that there are (31)
infinite data points for each uncertain object. From (19), we From (29) and (31), we define the following optimization
can see that each uncertain object in the negative class defines model:
the loss from different halfspaces. For each uncertain object,
the two sets Ω1j and Ω2j affect the loss, and the data points c1 (∥w1 ∥2 + b21 )
minw1 ,b1 + g1 (w1 , b1 ) + C2 g2 (w1 , b1 ). (32)
in the insensitive zone have no contribution to the objective 2
function. From the necessary condition of the optimization From (29) and (30), we observe that the weighted scatter of
problem of (8), we have uncertain objects is defined by different distance measures.
Xl In fact, we combine the scatter with margin of uncertain
0 = c1 b1 + si (w1T xi + b1 )+ objects to define (32). Specifically, the model of (32) tries
i=1
Xn Z Z (28) to make a trade-off between margin maximization and scatter
c2 sj −τ yj fXj (x)dx + τ̄ yj fXj (x)dx. minimization. If we can build the relationship between (8) and
Ω1j Ω2j
j=l+1 (32), this will explain why (8) exploits the scatter of uncertain
Here, we only consider the first-order information about b1 . objects. To this end, we define the following losses of uncertain
From (28), we observe that the parameter τ does not affect the objects that may contain misclassified data points:
sizes of Ω1j Rand Ω2j . But the parameter Pn τ controls the ratio of n Z
n R X
si (max(1 − yi (w1T x + b1 ) − ϵ, 0))fXi (x)dx,
P
j=l+1 sj 1
Ωj
(−y j )f X (x)dx to j=l+1 sj 2
Ωj
y j fXj (x)dx. C3
Pn j R Rm
The quantity j=l+1 sj Ω1j (−yj )fXj (x)dx denotes i=l+1
1
(33)
the losses from data points in Ω j , and the quantity n Z
X
si (max(1 − yi (w1T x + b1 ) + ϵ, 0))fXi (x)dx.
Pn R
j=l+1 sj Ω2j yj fXj (x)dx denotes the losses from data
C4
i=l+1 Rm
points in Ω2j . Note that τ takes values in the interval (34)
of [0, 1]. When τ takes small values, τ̄ {= (1 − τ )} If C3 = τ c2 , C4 = (τ − 1)c2 , and C2 = (1 − τ )c2 , we
takes
Pn relatively R big values. When τ takes small values, add (33) and (34) to the objective function of (32), and this
sj 1 (−y j )fXj (x)dx takes large values, and produces the model of (8). This means that we combine the
Pnj=l+1 RΩj
j=l+1 sj Ω2j yj fXj (x)dx takes small values to ensure that scatter of uncertain objects with their margin to define the
the equality (28) holds. This shows that there are many data objective function of (8). Hence, the optimization problem of
points corresponding to uncertain objects in Ω1j and that there (8) tries to make the trade-off among the margin, the scatter,
are few data points corresponding to uncertain objects in Ω2j . and misclassified uncertain objects. This indicates that the
In such a case, the optimization problem of (8) is insensitive optimization problem of (8) contains the idea of weighted
to feature noise close to the hyperplane w1T x + b1 = 0. When scatter minimization since the fuzzy membership is employed
the parameter τ = 0.5, data points corresponding to uncertain to construct the weights of uncertain objects. Overall, we em-
objects in two sets are relatively balanced. As τ approaches ploy (28) to show the role of τ and define (32) from (29) and
one, there are many data points corresponding to uncertain (31) to obtain (8) from (33) and (34). Likewise, the parameter
objects in Ω2j . As a result, the parameter τ = 0.5 is preferable τ in (9) will affect noise, and the optimization problem of
in the presence of feature noise. (9) can be explained from the viewpoint of weighted scatter
Due to the introduction of uncertain information, it is minimization.
meaningful to explore how the scatter of uncertain objects is
defined in (8). Assume that uncertain objects contain the data E. The nonlinear case
point m1 which lies on the hyperplane w1T x + b1 = 0, i.e.,
w1T m1 + b1 = 0. For uncertain objects in the positive class, Kernel functions can capture the nonlinear features of data,
the weighted scatter in the projected space determined by the so it is worth studying how to implement the nonlinearity
direction w1 is defined as for uncertain objects via the kernel trick. In kernel-based
l Z
learning models, the input data are mapped into a high-
1X dimensional feature space by the nonlinear map ϕ. In fact,
g1 (w1 , b1 ) = si (w1T x − w1T m1 )2 fXi (x)dx. (29)
2 Rm explicit mappings are often not given, and kernel functions
i=l

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on October 11,2023 at 13:14:00 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Fuzzy Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2023.3296503

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 7

such as Gaussian kernels and polynomial kernels are widely where vj = 1−yi (β T ki +b2 )−ϵ, v̄p T
j = −(1−yi (β ki +b2 ))−ϵ,
vj v̄j
adopted in practical applications. Note that there are many e3,j = 2σ̄ , e4,j = 2σ̄ , σ̄j = β Nj β, N̄i = ki kiT + Ni ,
√ √ T
j j
strategies of modeling the uncertain information of data. For and Ni is the n̄ × n̄ matrix whose element at the ith row and
∂k(x̄ ,x)
example, one can directly obtain uncertain information from a jth column is defined as ( ∂k(x̄i ,x)
∂x |x=xi )T Σi ∂xj |x=xi .
kernel space. Here, we assume that the uncertain information The model that consists of (35) and (36) gives DFTSVM
of data in the original space is provided. Namely, the mean xi in nonlinear cases. Unlike TSVM, (35) and (36) are not a
and the covariance matrix Σi in the original space are used to perfect kernelization of (19) and (20) since we make use of
describe the uncertain object Xi . Via the nonlinear map ϕ, we an approximate scheme to achieve the uncertain information
denote the mean and covariance matrix in the feature space by in a feature space. When the number of uncertain objects is
ϕ(xi ) and ϕ(Σi ). Following the idea in [43], we expand the ith large, we can perform the c-means clustering algorithm [2] on
uncertain object Xi at its mean xi in terms of Taylor series and the means of uncertain objects to achieve a set of data points
attain ϕ(Xi ) ≈ ϕ(xi ) + J(Xi − xi ), where J is the Jacobian as reduced sets. From (35), we observe that uncertain objects
matrix that is computed by J = ∂ϕ(X i)
∂Xi |Xi =xi . In such a in the kernel space are formed, i.e., the ith uncertain object is
case, the uncertain information in the feature space ϕ(Σi ) is regarded as a random vector with mean ki and covariance Mi
expressed as JΣi J T . If the covariance matrix is computed in (35). It is clear that (35) and (36) are also unconstrained
by using a batch of data points, one may directly use data optimization problems. We employ the BFGS algorithm to
points to achieve the covariance in the kernel space instead of solve them. After (35) and (36) are optimized, the decision
kernelizing covariance matrices in the original space. rule that has a similar form to the linear one is employed to
In kernel-based methods, w1 and w2 are restricted to a classify uncertain objects.
specific subspace consisting of a linear combination
Pn̄ of data
in a feature space. Specifically, w1 = i=1 αi ϕ(x̄i ) and
points P IV. E XPERIMENTAL RESULTS

w2 = i=1 βi ϕ(x̄i ), where n̄ << n and x̄i (i = 1, · · · , n̄) are In this section, we perform a series of experiments on
achieved by performing the c-means clustering algorithm [2] synthetic examples and data sets from real-world applications
on the means of uncertain objects. Unlike the work in [43], we to validate the performance of DFTSVM. The DFTSVM
employ a reduced-set strategy to avoid the large kernel matrix. model contains several parameters affecting its performance.
It is observed that all of the means of uncertain objects are As done in TSVM, we let c1 = c3 and c2 = c4 to reduce the
employed if c = n. Using w1 , w2 , and the nonlinear technique number of parameters. The parameters c1 and c2 are selected
for uncertain objects, we achieve a kernel version of (19), from the set {10i , i = −3, −2, · · · , 2, 3}. When Gaussian
denoted by kernels k(xi , xj ) = exp(−||xi −xj ||2 /2σ 2 ) is employed, σ 2 is
l taken from the set {10i , i = −3, −2, · · · , 2, 3}. The parameter
c1 (αT Kα + b21 ) X si (αT M̄i α + 2αT ki b1 + b21 )
min + τ is chosen from the set {0, 0.3, 0.5, 0.7, 0.9, 1}, and we let
α,b1 2 i=1
2 ϵ = 0.5, δ = 5, and k = 5 in the k-nearest neighbors.
n
X uj erf c(−e1,j ) σj exp(−(e1,j )2 )
+ c2 sj τ [ + √ ]
2 2π A. Synthetic examples
j=l+1
n The cross-plane problem can be effectively handled by
X ūj erf c(−e2,j ) σj exp(−(e2,j )2 )
+ c2 sj τ̄ [ + √ ] TSVM-related methods instead of SVM-related methods.
2 2π When data contain uncertain information, it is useful to explore
j=l+1
(35) the structural information of uncertain data. In this subsection,
u ū
we employ the cross-plane problem of uncertain data to
where e1,j = √ j , e2,j = √ j , uj =
2σj 2σj
1 − yi (αT ki + b1 ) − ϵ, validate that DFTSVM is effective for capturing the structural
p
ūj = −(1 − yi (αT ki + b1 )) − ϵ, σj= αT Mj α, M̄i = information of data. Following the idea in [43], we generate
ki kiT + Mi , Mi is the n̄ × n̄ matrix whose element at the uncertain objects in a two-dimensional space for visualization.
∂k(x̄ ,x)
ith row and jth column is ( ∂k(x̄ i ,x)
∂x |x=xi )T Σi ∂xj |x=xi , The uncertain objects with Gaussian distributions are close to
k(x̄i , x) is a kernel function, and ki is the ith column of the two straight lines x(1) − x(2) = 0 and x(1) + x(2) = 0. To
kernel matrix obtained by reduced sets and means of uncertain generate uncertain objects, we first obtain a series of data
objects. points from these two straight lines, and then we perturb
Similarly, we obtain a kernel version of (20), denoted by each coordinate of data points to make them slightly deviate
n
from straight lines. The perturbed data points are regarded
c3 (β T Kβ + b22 ) X si (β T N̄i w2 + 2β T xi b2 + b22 ) as the means of uncertain objects. For the covariance ma-
min +
β,b2 2 2 trices of uncertain objects, we sample them from Wishart
i=l+1
l
distributions. In the experiments, we directly use the function
X vj erf c(−e3,j ) σ̄j exp(−(e3,j )2 ) “wishrand(Σ, df )” in the Matlab toolbox, where Σ is the
+ c4 sj τ [ + √ ]
j=1
2 2π base matrix, and df is the level of deviation of the covariance
l matrix. We take df = 2 + 2 ∗ rand(1, n) and change the
X v̄j erf c(−e4,j ) σ̄j exp(−(e4,j )2 ) degree of uncertainty by tuning Σ. The covariance matrices
+ c4 sj τ̄ [ + √ ]
j=1
2 2π with Σ = 2Im have much bigger uncertainty than those with
(36) Σ = Im .

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on October 11,2023 at 13:14:00 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Fuzzy Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2023.3296503

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 8




















 
    0 2 4 6 8     0 2 4 6 8

(a) UTSVM with Im (b) UTSVM with 2Im

15 10

10
6

4
5
2

0
0
-2

-4
-5

-6

-10 -8
-8 -6 -4 -2 0 2 4 6 8 -8 -6 -4 -2 0 2 4 6 8

(c) DFTSVM with Im (d) DFTSVM with 2Im

Fig. 1: Visualization of two nonparallel hyperplanes (straight lines) obtained by UTSVM and DFTSVM

Fig.1 shows uncertain objects generated by changing Σ. In matrix, and o is a random vector whose elements are sampled
Fig.1, the sign ‘+’ denotes the means of uncertain objects in from the uniform distribution (−0.05, 0.05). In our experi-
the positive class, the sign ‘*’ denotes the means of uncertain ments, Q = (1, −0.95; −0.95, 1) and α = (1000, 0). We
objects in the negative class, and the confidence ellipses first generate two sets of uncertain objects, where each class
for ‘+’ and ‘*’ denote the degree of uncertainty. In fact, contains 150 uncertain objects. For the first data set (a), half
the hyperplanes in a two-dimensional space are the straight of the uncertain objects in the positive class are generated by
lines. Fig.1 shows the hyperplanes achieved by UTSVM and ζ11 = (−0.25, 0)T , half of the uncertain objects in the positive
DFTSVM with different base matrices. From Fig.1, we can class are achieved by ζ21 = (0.25, 0), and the uncertain objects
see that uncertain information affects the hyperplanes. When in the negative class are produced by using ζ 2 = (0, 0). For the
data do not contain outliers, both UTSVM and DFTSVM can second data set (b), half of the uncertain objects in the positive
capture the main structure of uncertain objects. In addition, we class are generated by ζ11 = (−0.5, 0)T , half of the uncertain
verify the robustness of DFTSVM in the presence of outliers. objects in the positive class are produced ζ21 = (0.5, 0)T , and
To this end, we generate outliers by changing the labels of the uncertain objects in the negative class are generated by
some uncertain objects. Fig.2 shows uncertain data containing following the same parameters in the first data set. In addition
outliers. Once again, we employ UTSVM and DFTSVM to to the data sets (a and b), we change twenty percent of the
attain two nonparallel hyperplanes. From Fig.2, we find that labels of the data sets (a and b) to make them contain outliers.
hyperplanes obtained by UTSVM are severely affected by This produces two new data sets (c and d).
outliers since they deviate from the means of uncertain objects. We compare the DFTSVM model with some uncertain
The DFTSVM model performs well since hyperplanes can fit data classifiers such as the power SVM (PSVM) [50], MM-
two classes of uncertain objects to some degree. Overall, the CGU [40], the structural regularized support vector machines
experiments show that the DFTSVM model is robust in the (SRSVM) [2], the second-order cone-programming approach
presence of outliers. for robust classification (SOCP-RC) [38], uncertain kernel
Note that in the above experiments, we explore uncertain Fisher discriminant analysis (UKFDA) [42], and UTSVM
objects with Gaussian distributions. However, uncertain ob- [43]. The specific parameters of models are given in the
jects do not follow Gaussian distributions in the general case. supplemental material. We generate fifty data points from each
Hence, it is necessary to study the effectiveness of DFTSVM uncertain object and use these data points to estimate its mean
on uncertain objects with non-Gaussian distributions. Here, and covariance matrix. We describe the performance of various
we generate uncertain objects from skew-normal distributions classifiers in terms of the error rate that is inversely related to
[42]. The ith uncertain object in class k has a representation accuracy [14] under the tenfold cross-validation. Experimental
of Oik ∼ SN (ζ k + o, Q, α)(i = 1, · · · , nk , k = 1, 2), results are shown in Table I.
where ζ k is a location parameter, α is a parameter tuning As shown in Table I, classifiers obviously yield lower error
the skewness of the distribution, Q is a positive definite rates if the location parameters have a large distance. This

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on October 11,2023 at 13:14:00 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Fuzzy Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2023.3296503

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 9

12 20

10
15
8

6
10
4

2 5

0
0
-2

-4
-5
-6

-8 -10
-8 -6 -4 -2 0 2 4 6 8 -8 -6 -4 -2 0 2 4 6 8

(a) UTSVM with Im (b) UTSVM with 2Im


10 10

8 8

6 6

4 4

2 2

0 0

-2 -2

-4 -4

-6 -6

-8 -8
-8 -6 -4 -2 0 2 4 6 8 -8 -6 -4 -2 0 2 4 6 8

(c) DFTSVM with Im (d) DFTSVM with 2Im


Fig. 2: Visualization of two nonparallel hyperplanes from UTSVM and DFTSVM with outliers

TABLE I: Error rates (%) and standard deviations of various methods on artificial data sets
Data sets PSVM [50] MMCGU [40] UKFDA [42] SRSVM [2] SOCP-RC [38] UTSVM [43] DFTSVM
a 17.99(0.32) 16.87(1.72) 17.32(0.89) 17.43(0.26) 17.51(0.42) 16.32(0.25) 15.33(0.31)
b 9.21(1.55) 3.33(1.42) 3.41(1.36) 3.49(1.21) 4.46(1.62) 2.66(1.43) 2.21(1.29)
c 20.21(0.89) 18.33(2.01) 19.72(1.05) 19.95(0.51) 18.03(0.89) 18.15(0.50) 17.09(0.42)
d 13.37(2.52) 5.76(2.48) 6.23(1.58) 6.37(1.89) 5.96(1.73) 6.13(1.72) 4.52(1.57)

comes from the fact that uncertain objects in two classes have of ten percent of uncertain objects in the training set. We use
large margins in such a case. We can see that the performance these new data sets to verify the robustness of DFTSVM. We
of each method becomes worse when the data set contains compare the DFTSVM model with those models in subsection
label noise. Our DFTSVM model is clearly superior to other IV.A. When implementing SOCP-RC, we make each class
models in the presence of label noise since we employ the of objects have the same uncertainty degree to reduce the
insensitive loss function and fuzzy membership functions in number of second-order cone constraints. For each data set, we
designing our classifier. Overall, the DFTSVM model is robust randomly choose seventy percent of the uncertain objects for
in the presence of label noise. training and use the rest of the uncertain objects for testing. To
reduce the randomness from the division of the data set, we
B. Experiments on the UCI data sets report experimental results over twenty runs and make use
of additional five runs to select the parameters of models.
In this subsection, we validate the DFTSVM model on The number of reduced sets in Gaussian kernels is 100. Table
some data sets from the UCI repository. These data sets are II lists the error rates of various methods and their standard
often employed to test the performance of classifiers. The deviations, where F and L denote feature noise and label noise,
statistical information of data sets we use in binary classi- respectively.
fication problems is Australian (14 attributes /690 samples),
Breast (10/683), Diabetes (8/768), Heart (13/270), Ionosphere As can be seen from Table II, the error rate of each method
(34/351), Monk1 (6/556), Monks2 (6/601), Monks3 (6/554), increases when feature and label noise appears. When data
and Sonar (60/208). The original features of samples in these do not contain noise, the MMCGU model obtains the best
data sets are not described by probability density functions. performance on the Australian data set, and DFTSVM does
Following the ideas in [42], [43], we generate uncertain objects not always give better performance on some data sets. This is
from the original samples for each data set. In addition to because the weights of uncertain objects in DFTSVM are not
the initial uncertain data, we also explore two types of noise optimal, and small weights weaken the contributions to the
(feature noise and label noise) on uncertain data. For feature loss function. The PSVM model does not perform well since
noise, we corrupt the means of uncertain objects with zero- we employ covariance matrices to obtain uncertainty. Among
mean Gaussian noise to generate feature noise. The ratio of all of the methods we test, only DFTSVM explores the weights
the variance of noise to that of the means of uncertain objects of uncertain objects. When data are polluted, our DFTSVM
is set to be 0.05. For label noise, we randomly flip the labels model achieves better performance than other models in most

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on October 11,2023 at 13:14:00 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Fuzzy Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2023.3296503

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 10

TABLE II: Error rates (%) and standard deviations of various methods with nonlinear kernels on the UCI data sets
Data sets PSVM [50] MMCGU [40] UKFDA [42] SRSVM [2] SOCP-RC [38] UTSVM [43] DFTSVM
14.92±3.78 13.02±2.45 13.12±3.05 13.71±4.02 13.27±5.12 13.09±3.23 13.03±3.48
Australian F 15.32±3.21 14.89±2.74 15.02±3.42 15.12±3.22 14.77±4.89 14.75±4.33 14.05±3.28
L 16.79±3.96 15.89±3.01 16.32±3.42 16.78±3.29 16.12±4.05 15.32±3.46 15.12±3.25
4.54±1.62 4.13±1.52 4.01±1.52 4.05±2.03 3.44±1.62 3.19±1.05 3.23±1.31
Breast F 6.97±2.01 4.98±2.12 5.24±4.06 5.58±3.09 5.49±3.72 4.73±5.02 4.13±1.59
L 7.89±3.04 5.33±2.78 5.68±3.05 5.99±4.05 5.35±3.12 5.27±3.62 5.09±2.85
26.79±2.42 25.95±2.46 26.42±3.92 26.35±2.49 25.58±3.52 24.79±3.53 24.56±3.49
Diabetes F 29.34±3.81 27.39±2.58 28.56±3.23 28.95±3.45 27.52±3.62 27.41±2.59 27.03±3.05
L 31.24±3.05 28.53±3.17 28.99±3.41 29.02±3.42 28.67±3.42 28.33±3.05 28.02±3.52
22.89±4.81 21.77±4.08 22.28±3.86 22.38±4.79 22.49±4.54 20.83±4.56 21.08±4.63
Heart F 25.32±4.77 23.46±4.72 23.89±4.22 23.75±5.03 23.46±4.95 23.49±4.32 23.25±4.17
L 26.58±4.09 24.52±5.03 24.88±5.12 24.52±6.03 24.98±5.32 24.48±5.06 23.89±4.25
6.48±3.52 5.68±3.47 5.95±3.02 5.98±3.81 5.86±3.52 5.42±3.52 5.39±3.81
Ionosphere F 8.98±3.95 7.32±3.52 7.99±3.82 8.07±4.12 7.45±3.89 7.26±3.22 7.33±3.45
L 10.05±4.32 8.52±3.27 8.92±4.05 9.03±5.12 8.67±4.15 8.39±4.01 8.27±3.92
3.23±3.51 2.09±2.46 2.18±1.69 2.52±3.42 1.21±3.12 1.02±1.62 1.07±1.89
Monk1 F 5.32±3.03 3.55±2.08 3.98±2.69 4.05±3.01 3.65±2.32 3.46±2.47 3.53±2.41
L 5.89±3.02 4.56±2.89 4.92±3.01 5.21±3.05 4.82±2.77 4.25±3.01 4.24±2.56
14.21±4.25 12.29±4.85 13.35±4.56 13.46±4.23 12.75±5.01 11.29±3.54 11.15±3.24
Monk2 F 16.82±4.21 14.57±4.62 14.76±4.53 15.33±4.05 14.68±3.89 14.62±4.53 14.21±4.09
L 17.92±5.05 15.32±4.23 15.93±5.24 15.91±4.05 15.39±4.77 15.43±4.29 14.99±4.08
4.51±1.82 3.62±1.48 4.07±2.32 3.92±2.05 3.58±1.43 3.08±1.45 3.10±1.78
Monk3 F 6.79±2.02 5.45±1.68 5.89±1.96 6.05±3.12 5.62±1.73 5.32±2.05 5.37±1.79
L 7.52±2.26 5.99±1.75 6.38±1.65 6.93±2.01 5.35±1.82 6.02±1.99 5.49±1.88
12.41±5.32 11.49±5.85 12.28±5.32 12.21±4.01 10.04±5.08 10.11±4.89 10.02±5.01
Sonar F 14.59±5.82 13.21±5.27 13.79±5.46 13.64±5.25 13.09±5.12 13.28±5.32 13.02±5.28
L 15.33±5.23 14.63±4.72 14.58±5.05 15.02±5.39 14.23±5.04 14.55±4.77 13.29±5.22
7.00 3.55 5.00 5.44 3.66 1.77 1.55
Average rank F 7.00 2.72 5.11 5.77 3.61 2.33 1.44
L 7.00 3.16 5.00 5.61 3.55 2.55 1.11
χ2F =44.95 p=4.78e-8
Friedman test F χ2F =47.32 p=1.61e-8
L χ2F =46.22 p=2.67e-8

cases. that using the pinball loss function can suppress feature noise
Since we test different models on multiple data sets, it is in the previous section. The experimental results also verify
reasonable to use the statistical test to compare these models. this point. We find that the DFTSVM model gives the smallest
The Friedman test with the post-hoc test [51], as a statistical average rank in the presence of label noise. This may come
tool, is chosen to rank classifiers. The null hypothesis in the from the fact that we define the weights of uncertain objects in
Friedman test is to consider all the classifiers being equal our model. This implicitly shows that the weights in DFTSVM
ranks. Under this null hypothesis, we need to compute the may suppress label noise more effectively. In addition, we also
Friedman statistic whose distribution is χ2F with k − 1 degree perform the paired t-test with a significant level of 0.05 to
of freedom, where k is the number of classifiers. The Friedman compare the DFTSVM model with other models. Due to page
test makes use of the average ranks of classifiers. The average limits, we put these results in the supplemental material.
ranks of different classifiers in three cases are listed in Table
II. From the average ranks of methods, we obtain the Friedman C. Experiments on the open-set environment
statistic, which is given in the last line of Table II. From In this subsection, we evaluate the DFTSVM model in the
the Friedman statistic in Table II, we can see that the null open-set environment. Unlike classical classification problems,
hypothesis does not hold. The following is to use the Nemenyi the open-set recognition experiment explores the problem
test to show whether the performance of two classifiers is where some classes do not appear at the training stage. This
different in a statistical sense. Two classifiers are significantly simulates the case that the distributional information of the
different in statistics if the difference between theirqaverage training set is different from that of the test set. The data we
ranks is greater than the value defined by CD = qα k(k+1) 6N use are taken from the USPS data set. The handwritten data set
[51], where N is the number of data sets and qα is calculated contains 9,298 samples. To simulate the open-set recognition
by using the Studentized range statistic. The CD diagrams environment, we randomly choose one hundred samples from
[51] are also employed to visualize the difference between each class of odd numbers to form the training set. For the
the performance of different classifiers. Fig.3 shows the CD test set, we randomly choose one hundred samples from each
diagrams of various classifiers in three cases, where the value odd class and twenty samples from each even class. There
of CD is obtained by using α = 0.05. From Fig.3, we can see are ten classes in the test set, but there are only five odd
that the DFTSVM model is clearly superior to other models in classes in the training set. For each odd class in the training
the presence of noise since the mean rank of DFTSVM is much set, the corresponding odd numbers are used as the positive
lower than those of other models. We have theoretically shown class, and the other odd classes form the negative class. Thus,

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on October 11,2023 at 13:14:00 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Fuzzy Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2023.3296503

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 11

Critical Distance=3.0021 Critical Distance=3.0021

7 6 5 4 3 2 1 7 6 5 4 3 2 1

PSVM DFTSVM PSVM DFTSVM

SRSVM UTSVM SRSVM UTSVM

UKFD MMCGU UKFDA MMCGU

SOCP-RC SOCP-RC
(a) Uncertain data (b) Uncertain data with feature noise
Critical Distance=3.0021

7 6 5 4 3 2 1

PSVM DFTSVM

SRSVM UTSVM

UKFDA MMCGU

SOCP-RC
(c) Uncertain data with label noise

Fig. 3: Comparisons of various methods in terms of the CD diagrams

TABLE III: Precision and average precision of various methods on handwritten numerical data sets
Data sets PSVM [50] MMCGU [40] UKFDA [42] SRSVM [2] SOCP-RC [38] UTSVM [43] DFTSVM
1 vs.(3,5,7,9) 82.37±5.38 85.96±3.92 84.73±4.06 83.75±3.70 86.32±2.89 86.00±3.89 87.84±4.05
3 vs.(1,5,7,9) 75.68±3.46 78.12±4.45 77.35±3.84 75.52±4.06 80.16±2.77 80.33±3.76 84.85±4.47
5 vs.(1,3,7,9) 70.33±4.08 73.26±3.75 73.16±3.97 72.28±4.15 70.16±2.85 72.28±4.05 75.36±3.89
7 vs.(1,3,5,9) 89.87±4.91 92.91±4.09 90.33±4.32 90.15±4.60 90.31±4.53 92.09±3.02 93.02±4.57
9 vs.(1,3,5,7) 56.20±5.20 61.52±4.09 59.55±4.09 62.32±2.29 63.18±4.51 63.05±4.02 62.18±4.57
Average 74.89 78.35 77.02 76.80 78.02 78.75 80.65

we obtain five binary classification problems. We evaluate D. Potential applications in modeling large-scale data sets
the performance of each model on ten classes of handwritten Although the DFTSVM model is originally designed for
numeral characters. Since the test set contains five classes that handling uncertain objects, in this subsection, we show how
do not appear in the training set, we employ the precision and to make DFTSVM handle large-scale data sets by modeling
the average precision to measure the performance of various probability density functions for each class of samples. Here,
methods. The precision of the ith binary problem is defined we define the following framework to achieve the hyperplane
as Pi = T PTi +F
Pi
Pi , where T Pi is the number of samples that w1T x + b1 = 0:
are correctly classified as the positive class, and F Pi is the Z
number of samples in the negative class that are misclassified. min c1 g(w1 , b1 ) + p+ L+ (x)f+ (x)dx
To generate uncertain objects from the image data, we first Rm
Z (37)
obtain eight neighbors of each pixel in the image, and then + c2 p− L− (x)f− (x)dx
we compute the mean and variance from the current pixel Rm
and its eight neighbors. Thus, for each image, we generate where g(w1 , b1 ) is a regularization term that controls the
a random vector whose covariance is a diagonal matrix. In complexity of the model, p+ is a prior probability of the
the experiments, we test different models in terms of linear positive class, p− is a prior probability of the negative class,
kernels, and Table III shows the experimental results. L+ (x) is a loss function of the positive sample x, L− (x) is a
loss function of the negative sample x, f+ (x) is a probability
density function of the positive sample x, and f− (x) is a
probability density function of the negative sample x.
As shown in Table III, all of the methods do not obtain better The framework of (37) is generic for supervised learning
performance on 9 vs. (1,3,5,7), but they obtain better perfor- since we do not give the specific loss function and regular-
mance on 7 vs. (1,3,5,9). UTSVM is superior to DFTSVM on ization term. Moreover, L+ (x) and L− (x) in (37) may take
9 vs. (1,3,5,7). This may come from the fact that the weights the same loss function. Although the exact density functions
of uncertain objects in DFTSVM are not optimal. MMCGU, f+ (x) and f− (x) are unknown, we can approximate them by
SOCP-RC, and UTSVM obtain similar performance on this using some methods. It is noted that the Gaussian mixture
data set. It is observed from Table III that the average precision model (GMM) [45] is very useful for representing probability
of DFTSVM is the highest in all methods we test. In summary, distributions in unsupervised learning. When the number of
the DFTSVM model is superior to other models since it uses components in GMM is large enough, the GMM method can
an insensitive loss function and a fuzzy membership function model any complex density function with proper precision.
to construct the classifier. Based on this idea, we model f+ (x) and f− (x) by using

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on October 11,2023 at 13:14:00 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Fuzzy Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2023.3296503

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 12

a convex combination of Gaussian distributions, respectively. GMM as a prototype. Thus, covariance matrices are directly
Specifically, we use Gaussian mixture models to approximate employed in SRSVM.
f+ (x) and f− (x), denoted by In our experiments, we change the size of the data set by
l n fixing the number of components of GMM to 100. That is, we
use a convex combination of 100 Gaussian functions to model
X X
f+ (x) = qi fXi (x)dx, f− (x) = qi fXi (x)dx (38)
i=1 i=l+1
a probability density function of samples, and the number of
prototypes in each class is actually 100 for the classifiers we
where qi P Pn of the ith Gaussian distribution,
is the proportion test. At the training stage, 70% of the samples are randomly
l
qi ≥ 0, i=1 qi = 1, i=l+1 qi = 1, and fXi (x) denotes selected to obtain GMM, and 100 prototypes of each class
the Gaussian density function with mean xi and covariance are achieved. At the test stage, the rest (30%) of the samples
Σi . To be consistent with the proposed DFTSVM, we assume are used for testing. Table IV lists the performance indexes of
that l Gaussian functions are employed to model f+ (x) and various methods including the error rates of classifiers and the
that n − l Gaussian functions are used to model f− (x). training time of GMM and classifiers, where 1k = 1000 and
In fact, the number of components in GMM is generally 1e-2=0.01.
determined by some information criteria, but this could not As can be seen from Table IV, as the number of data points
be a key problem since we use GMM in supervised learning. varies from 10k to 200k, it is necessary to consume much more
The famous expectation-maximization algorithm [45] is used time to implement GMM whose time varies from 5.87 seconds
to estimate the parameters of GMM. When we adopt the to 198.8 seconds. When we only use prototypes from GMM
regularization term and the losses in (8), the following model to train FSVM, FTSVM, IFTSVM, SPTSVM, and PIFTSVM,
is achieved by substituting (38) into (37): we find that they cannot obtain better performance. This is
c1 X l Z
p+ qi T due to the fact that the number of prototypes from GMM is
2
min (∥w1 ∥ + b21 ) + (w1 x + b1 )2 fXi (x)dx not large. In general, these classifiers require much more data
w1 ,b1 2 R m 2
i=1 points to improve their performance. Note that SPTSVM is not
n Z
X better than PIFTSVM since SPTSVM neglects the weights of
+ c2 p− qj Lϵτ (1 − yj (w1T x + b1 ))fXj (x)dx.
Rm
prototypes. When covariance matrices are employed, we find
j=l+1
that SRSVM and GDFTSVM significantly outperform other
(39)
models since both models make full use of covariance matrices
It is interesting to note that (39) is equivalent to (8) if we from GMM. The performance of GDFTSVM is better than that
regard p+ qi and p− qj as the weights of uncertain objects in of SRSVM since the former is derived from GMM and the
(8). For (39), each component in the Gaussian mixture model latter does not explore the weights of prototypes. It is found
is a Gaussian distribution, but for (8), we do not assume that all that GDFTSVM consumes much more time than other models
uncertain objects follow Gaussian distributions. We can obtain since GDFTSVM involves computing the integral. From Table
the hyperplane w1T x + b1 = 0 by solving (39). Similarly, we IV, we find that the performance of each method is not always
can obtain the hyperplane w2T x + b2 = 0 under GMM. When improved as the size of the generated data set changes. This
distribution inputs of DFTSVM are obtained by performing comes from the fact that the size of the test set becomes large
the GMM method on deterministic samples, we refer to it as as the number of samples in the generated data set increases.
the GMM-based DFTSVM (GDFTSVM). In addition, we change the number of components of
To verify the effectiveness of (39), we perform experiments GMM to explore the performance of various methods. We
on NDC data sets that are generated by the David Musician choose NDC (100k) as our data set. Fig.4 shows the error
data generator [52]. Using the generator, we can produce any rates of various methods as the number of components of
size of the data set with a fixed dimension of 32. When there GMM varies. As can be seen from Fig.4, GDFTSVM and
are plenty of samples in the training set, one often selects some SRSVM are obviously better than other methods since these
prototypes from the original samples to reduce the training two models make use of covariance matrices. Note that
time of models. Here, the main aim of the experiments is to test GDFTSVM achieves better performance than SRSVM since
whether covariance matrices from GMM are beneficial for im- SRSVM does not consider the weights of prototypes and has
proving the models. Unlike previous experiments, we use the a non-smooth objective function. Overall, it is reasonable to
prototypes from GMM to train classifiers. To this end, we test employ covariance matrices to improve the performance of
the classifiers including the fuzzy SVM (FSVM) [18], FTSVM models if some prototypes (means of clusters) of data sets
[17], IFTSVM [27], the spare pinball TSVM (SPTSVM) [53], instead of all data points are used to train classifiers on large-
PIFTSVM [29], and SRSVM. For these classifiers, we use scale data sets.
the decision rule of TSVM since the test samples are not
uncertain objects. When implementing SPTSVM, we neglect
the weights of prototypes from GMM since SPTSVM does not E. Medical images with deep features
contain the weights of samples. Moreover, we have pointed It is interesting to explore deep features instead of raw
out that PIFTSVM degenerates into SPTSVM under proper features on some image data sets by using convolutional neural
conditions [29]. The main reason of choosing these classifiers networks. In this subsection, we show how deep features
is that we directly use the weights of prototypes from GMM. are used in DFTSVM on two medical image sets in binary
When implementing SRSVM, we regard each component of classification problems [49]. The breast image set consists

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on October 11,2023 at 13:14:00 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Fuzzy Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2023.3296503

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 13

TABLE IV: Error rates (%) and training time (s) of various methods on the NDC data set
GMM [45] FSVM [18] FTSVM [17] IFTSVM [27] SPTSVM [53] PIFTSVM [29] SRSVM [2] GDFTSVM
Data sets Error rate(%) and standard deviations
the time (s)
29.75±2.28 29.83±4.55 29.39±3.12 28.97±2.01 28.66±1.74 16.42±1.62 14.21±4.43
NDC(10k)
5.87 4.0e-3 8.3e-3 1.6e-2 7.2e-3 1.7e-2 5.0e-3 0.04
28.25±6.56 27.79±6.23 28.17±4.28 23.47±3.50 22.52±5.05 17.53±4.03 15.94±6.85
NDC(30k)
31.97 3.6e-3 7.4e-3 1.5e-2 5.0e-3 1.5e-2 5.2e-3 0.03
25.52±5.03 25.68±4.69 24.53±5.17 23.81±4.34 21.10±4.22 15.57±5.31 13.59±2.83
NDC(50k)
52.76 4.0e-3 8.5e-3 1.7e-2 6.7e-3 1.6e-2 5.3e-3 0.84
28.32±4.57 28.14±5.25 26.34±6.03 29.88±5.26 29.58±4.88 14.68±5.18 12.68±5.54
NDC(70k)
74.20 3.4e-3 7.5e-3 1.6e-2 5.1e-3 1.8e-2 6.0e-3 0.93
27.79±5.21 27.65±4.74 27.20±3.93 25.08±4.72 23.43±5.26 12.24±4.86 10.14±6.33
NDC(90k)
93.79 4.0e-3 8.6e-3 2.0e-2 6.7e-3 1.9e-2 6.2e-3 1.83
28.13±3.53 27.39±4.78 27.16±5.45 24.55±4.62 24.12±4.56 13.21±3.95 11.61±6.65
NDC(100k)
101.21 3.7e-3 7.5e-3 1.6e-2 6.5e-3 1.6e-2 5.2e-3 1.87
28.73±6.23 28.46±5.57 28.36±4.09 28.47±5.31 27.82±5.15 18.23±4.76 16.51±2.32
NDC(200k)
198.8 4.0e-3 8.3e-3 2.0e-2 6.7e-3 1.8e-2 6.2e-3 1.05

TABLE V: Error rates (%) and standard deviations of various methods on medical data sets
Data sets PSVM [50] MMCGU [40] UKFDA [42] SRSVM [2] SOCP-RC [38] UTSVM [43] DFTSVM
Pneumonia images 7.05±3.10 6.25±3.13 6.36±2.79 6.32±2.75 6.13±3.45 6.22±2.98 5.78±3.01
Breast images 24.52±5.43 22.37±5.65 21.78±5.70 21.34±2.05 20.89±6.01 21.13±5.79 19.76±5.68

35
SRSVM and MMCGU achieve similar results on the breast
data set. It is found that the DFTSVM model produces the
30
lowest error rate and outperforms other models on the two
25 data image sets. In contrast to other models, the DFTSVM
model employs robust loss functions and fuzzy membership
Error rate (%)

20
functions. Among all methods, UTSVM and DFTSVM pro-
15 duce two nonparallel hyperplanes, and other models give a
10 FSVM
single hyperplane. The experiments show that uncertain data
FTSVM
IFTSVM
classifiers can be used to classify deep features by modeling
5 SPTSVM
PIFTSVM
SRSVM
deep features as random vectors.
GDFTSVM
0
0 5 10 15
The number of components in log 10 scale F. Sensitivity analysis of parameters of DFTSVM
Fig. 4: Performance of various classifiers with the change of Like TSVM and its variants, the parameters of DFTSVM
components in GMM will affect its performance. We conduct a sensitivity analysis
of parameters of DFTSVM on the breast image data set. To
investigate the effect of parameters of DFTSVM, we follow
of 780 images, and the pneumonia image set contains 5,856 similar experimental settings in Section IV.E. To visualize ex-
images. Here, we achieve deep features by using a pre- perimental results in a three-dimensional space, we first study
trained convolutional neural network (CNN), i.e., ResNet18, the effect of the parameters c1 and c2 by fixing ϵ = τ = 0.5
and the features are taken from the layer of “res5b-relu”. The and σ = 0.1, and then we explore the effect of ϵ and τ by
deep features of each image have a tensor representation of fixing c1 = c2 = 1 and σ = 0.1. Fig.5 shows the experimental
7 × 7 × 512. We sample the features along the third axis and results, where the error rates and the running time of DFTSVM
achieve features whose size is 7 × 7 × 128 (7 × 7 × 512/4). In are shown. The additional experiments on parameters δ and k
such a case, we regard each image as 49 instances with 128 can be found in the supplemental material.
dimensions and model each image as a random vector with the From Fig.5 (a), we find that the DFTSVM model achieves
mean and covariance from 49 instances. For the pneumonia lower error rates if c1 takes small values and c2 takes values
image set, we choose 1,148 images from the original test and in the interval of [0.1,10]. From Fig.5 (b), we observe that
validation sets to reduce the computational time of algorithms. τ = 0 does not yield good experimental results, but the
Seventy percent of the images are randomly chosen to form classification performance of DFTSVM is stable in a wide
the training set, and the rest (30%) of the images are used range of parameters. This shows that the parameters τ and
for testing. In our experiments, the number of reduced sets in ϵ are much less sensitive than the parameters c1 and c2 in
kernel methods is set to 100 in the case of Gaussian kernels. terms of classification performance. We can see from Fig.5
Experimental results are reported over ten runs. Table V shows (c) and (d) that the running time of DFTSVM varies with the
experimental results on two medical image sets. change of parameters. It is noted that the model with better
As can be observed from Table V, MMCGU and UTSVM performance requires more training time in the general case.
obtain similar performance on the pneumonia data set, and The experiments show that achieving better performance of

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on October 11,2023 at 13:14:00 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Fuzzy Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2023.3296503

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 14

(a) Error rates of DFTSVM with varying c1 and c2 (b) Error rates of DFTSVM with varying ϵ and τ

(c) The running time of DFTSVM with varying c1 and c2 (d) The running time of DFTSVM with varying ϵ and τ
Fig. 5: Effect of parameters in DFTSVM on the image data set

DFTSVM needs to select proper parameters and that training In addition, developing efficient algorithms to solve DFTSVM
the model for handling uncertain data consumes much more is worthy of further study.
time.
R EFERENCES
V. C ONCLUSIONS AND FURTHER WORK
This paper develops a novel fuzzy TSVM model with [1] Jayadeva, R. Khemchandani, and S. Chandra, “Twin support vector
machines for pattern classification,” IEEE Trans. on PAMI, vol. 29, pp.
distribution inputs to handle uncertain objects. Unlike previ- 905–910, 2007.
ous models, we introduce a fuzzy membership function of [2] H. Xue, S. Chen, and Q. Yang, “Structural regularized support vector
uncertain objects based on the Wasserstein distance. To solve machine: A framework for structural large margin classifier,” IEEE
Transactions on Neural Networks, vol. 22, no. 4, pp. 573–587, 2011.
the DFTSVM model, we transform it into a tractable model. [3] N. Cristianini and J. Shawe-Taylor, in An Introduction to Support Vector
The transformed model consists of convex and smooth opti- Machines and Other Kernel-based Learning Methods. Cambridge
mization problems when the covariance matrices are positive University Press, 2000.
[4] Y. Wang, S. Wang, and K. keung Lai, “A new fuzzy support vector
definite. Some properties of DFTSVM are analyzed, and we machine to evaluate credit risk,” IEEE Trans. on Fuzzy Systems, vol. 13,
also employ a reduced-set scheme to obtain a kernel version pp. 820–831, 2005.
of DFTSVM. From the experimental results, we have the [5] R. Batuwita and V. Palade, “Fsvm-cil: Fuzzy support vector machines
for class imbalance learning,” IEEE Trans. on Fuzzy Systems, vol. 18,
following conclusions: 1) the pinball function in DFTSVM can pp. 558–571, 2010.
reduce the effect of feature noise, and the weights of uncertain [6] O. L. Mangasarian and E. W. Wild, “Multisurface proximal support
objects in DFTSVM can suppress label noise; 2) the GMM- vector machine classification via generalized eigenvalues,” IEEE Trans.
on PAMI, vol. 28, pp. 69–74, 2006.
based DFTSVM model achieves better performance than those [7] Y.-H. Shao, N.-Y. Deng, and Z.-M. Yang, “Least squares recursive pro-
models only depending on prototypes for approximately deal- jection twin support vector machine for classification,” Pattern Recognit.,
ing with large-scale data sets; 3) modeling deep features as vol. 45, pp. 2299–2307, 2012.
random vectors is suitable for those classifiers designed for [8] Y. Tian, X. Ju, Z. Qi, and Y. Shi, “Improved twin support vector
machine,” Science China Mathematics, vol. 57, pp. 417–432, 2014.
handling a type of uncertain data. [9] Y.-H. Shao, C. Zhang, X.-B. Wang, and N.-Y. Deng, “Improvements
Although we have investigated DFTSVM from some as- on twin support vector machines,” IEEE Trans. on Neural Networks,
pects, there are several problems to be addressed in the vol. 22, pp. 962–968, 2011.
[10] X. Chen, J. Yang, Q. Ye, and J. Liang, “Recursive projection twin
future. How to achieve the fuzzy membership function of support vector machine via within-class variance minimization,” Pattern
uncertain objects without Gaussian assumptions is interesting. Recognit., vol. 44, pp. 2643–2655, 2011.

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on October 11,2023 at 13:14:00 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Fuzzy Systems. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2023.3296503

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 15

[11] X. Peng, “Tpmsvm: A novel twin parametric-margin support vector [37] Y. Li, J. Chen, and L. Feng, “Dealing with uncertainty: A survey of the-
machine for pattern recognition,” Pattern Recognit., vol. 44, pp. 2678– ories and practices,” IEEE Trans. on Knowledge and Data Engineering,
2692, 2011. vol. 25, pp. 2463–2482, 2013.
[12] Y. Xu, Z. Yang, and X. Pan, “A novel twin support-vector machine with [38] P. K. Shivaswamy, C. Bhattacharyya, and A. J. Smola, “Second order
pinball loss,” IEEE Transactions on Neural Networks, vol. 28, no. 2, pp. cone programming approaches for handling missing and uncertain data,”
359–370, 2017. J. Mach. Learn. Res., vol. 7, pp. 1283–1314, 2006.
[13] M. Tanveer, A. Tiwari, R. Choudhary, and M. A. Ganaie, “Large-scale [39] T. Falck, J. A. K. Suykens, and B. D. Moor, “Robustness analysis for
pinball twin support vector machines,” Machine Learning, vol. 111, pp. least squares kernel based regression: an optimization approach,” IEEE
3525–3548, 2022. Conference on Decision and Control, pp. 6774–6779, 2009.
[14] M. Tanveer, A. Sharma, and P. Suganthan, “General twin support vector [40] C. Tzelepis, V. Mezaris, and I. Patras, “Linear maximum margin clas-
machine with pinball loss function,” Information Sciences, vol. 494, pp. sifier for learning from uncertain data,” IEEE Trans. on PAMI, vol. 40,
311–327, 2019. pp. 2948–2962, 2018.
[15] J. López, S. Maldonado, and M. Carrasco, “Robust nonparallel support [41] ——, “Video event detection using kernel support vector machine with
vector machines via second-order cone programming,” Neurocomputing, isotropic gaussian sample uncertainty (ksvm-igsu),” in MMM, 2016.
vol. 364, pp. 227–238, 2019. [42] B. Tavakkol, M. K. Jeong, and S. L. Albin, “Measures of scatter and
[16] M. Tanveer, T. Rajani, R. Rastogi, Y. Shao, and M.A.Ganaie, “Compre- fisher discriminant analysis for uncertain data,” IEEE Trans. on Systems,
hensive review on twin support vector machines,” Annals of Operations Man, and Cybernetics, vol. 51, pp. 1690–1703, 2021.
Research, 2022. [43] Z. Liang and L. Zhang, “Uncertainty-aware twin support vector ma-
[17] A. R. de Mello, M. R. Stemmer, and A. L. Koerich, “Incremental and chines,” Pattern Recognition, vol. 129, p. 108706, 2022.
decremental fuzzy bounded twin support vector machine,” Information [44] S. Boyd and L. Vandenbeghe, in Convex Optimization. Cambridge
Science, vol. 526, pp. 20–38, 2020. University Press, 2009.
[18] C.-F. Lin and S.-D. Wang, “Fuzzy support vector machines,” IEEE [45] J. Delon and A. Desolneux, “A wasserstein-type distance in the space of
Trans. on Neural Networks, vol. 132, pp. 464–71, 2002. gaussian mixture models,” SIAM J.Imaging Sci., vol. 13, pp. 936–970,
[19] S. Chen and X. Wu, “A new fuzzy twin support vector machine for 2020.
pattern classification,” Journal of Machine Learning and Cybernetics, [46] X. Huang, L. Shi, and J. A. K. Suykens, “Support vector machine
vol. 9, pp. 1553–1564, 2018. classifier with pinball loss,” IEEE Trans. on PAMI, vol. 36, pp. 984–
[20] R. Wang, X. ying Zhang, and W. Cao, “Clifford fuzzy support vector 997, 2014.
machines for classification,” Advances in Applied Clifford Algebras, [47] T. Hu, D. Xiang, and D.-X. Zhou, “Online learning for quantile
vol. 26, pp. 825–846, 2016. regression and support vector regression,” Journal of Statistical Planning
[21] J. S. Sartakhti, N. Ghadiri, and H. Afrabandpey, “Fuzzy least squares and Inference, vol. 142, pp. 3107–3122, 2012.
twin support vector machines,” Eng. Appl. Artif. Intell., vol. 85, pp. [48] N. Baikov, “Algorithm and implementation details for complementary
402–409, 2019. error function,” IEEE Trans. on Computers, vol. 66, pp. 1106–1118,
[22] M. A. Ganaie, M. Tanveer, and A. D. N. Initiative, “Fuzzy least squares 2017.
projection twin support vector machines for class imbalance learning,” [49] J. Yang, R. Shi, and B. Ni, “Medmnist classification decathlon: A
Appl. Soft Comput., vol. 113, p. 107933, 2021. lightweight automl benchmark for medical image analysis,” in 18th
[23] B. Richhariya and M. Tanveer, “A robust fuzzy least squares twin support International Symposium on Biomedical Imaging, 2021, pp. 191–195.
vector machine for class imbalance learning,” Applied Soft Computing, [50] W. Zhang, S. X. Yu, and S.-H. Teng, “Power svm: Generalization with
vol. 71, pp. 418–432, 2018. exemplar classification uncertainty,” IEEE Conference on CVPR, pp.
[24] M. A. Ganaie, M. Tanveer, and C.-T. Lin, “Large scale fuzzy least 2144–2151, 2012.
squares twin svms for class imbalance learning,” IEEE Trans. on Fuzzy [51] J. Demsar, “Statistical comparisons of classifiers over multiple data sets,”
Systems, 2022. J. Mach.Learn. Res., vol. 7, pp. 1–30, 2006.
[25] P.-Y. Hao, “Asymmetric possibility and necessity regression by twin- [52] D.Musicant, “Normally distributed clustered dataset,” University
support vector networks,” IEEE Trans. on Fuzzy Systems, vol. 29, pp. Wisconsin-Madision, Madision, WI,USA, 1998.
3028–3042, 2021. [53] M. Tanveer, A. Tiwari, R. Choudhary, and S. Jalan, “Sparse pinball twin
[26] M. Ha, C. C. Wang, and J. Chen, “The support vector machine based support vector machines,” Applied Soft Computing, vol. 78, pp. 164–175,
on intuitionistic fuzzy number and kernel function,” Soft Computing, 2019.
vol. 17, pp. 635–641, 2013.
[27] S. Rezvani, X. Wang, and F. Pourpanah, “Intuitionistic fuzzy twin
support vector machines,” IEEE Trans. on Fuzzy Systems, vol. 27, pp. Zhizheng Liang Zhizheng Liang received his Ph.D. in pattern analysis and
2140–2151, 2019. intelligent systems from Shanghai Jiaotong University (China) in 2005. Now
[28] S. Rezvani and X. Wang, “Intuitionistic fuzzy twin support vector he is an associate professor at School of Computer Science and Technology,
machines for imbalanced data,” Neurocomputing, vol. 507, pp. 16–25, China University of Mining and Technology, China. He has published more
2022. than 70 papers in different fields. His current interests include fuzzy pattern
[29] Z. Liang and L. Zhang, “Intuitionistic fuzzy twin support vector ma- recognition and machine learning.
chines with the insensitive pinball loss,” Appl. Soft Comput., vol. 115,
p. 108231, 2022. Shifei Ding Shifei Ding received his Ph.D degree from Shandong University
[30] S. Laxmi and S. Gupta, “Multi-category intuitionistic fuzzy twin support of Science and Technology in 2004. He is a professor and Ph.D supervisor
vector machines with an application to plant leaf recognition,” Eng. Appl. at China University of Mining and Technology. His research interests include
Artif. Intell., vol. 110, p. 104687, 2022. intelligent information processing and granular computing. He has published
[31] S. Rezvani and X. Wang, “Class imbalance learning using fuzzy art and five books including twin support vector machines, and more than about 180
intuitionistic fuzzy twin support vector machines,” Inf. Sci., vol. 578, papers in international conferences and journals. He is one of Highly Cited
pp. 659–682, 2021. Chinese Researchers.
[32] S. Laxmi and S. K. Gupta, “Intuitionistic fuzzy proximal support vector
machines for pattern classification,” Neural Processing Letters, vol. 51,
pp. 2701–2735, 2020.
[33] L. Bai, X. Chen, Z. Wang, and Y. Shao, “Safe intuitionistic fuzzy
twin support vector machine for semi-supervised learning,” Applied Soft
Computing, 2022.
[34] M. Tanveer, M. A. Ganaie, A. Bhattacharjee, and C. T. Lin, “Intuitionis-
tic fuzzy weighted least squares twin svms.” IEEE Trans. on Cybernetics,
vol. PP, 2022.
[35] B. Tavakkol, M. K. Jeong, and S. L. Albin, “Object-to-group probabilis-
tic distance measure for uncertain data classification,” Neurocomputing,
vol. 230, pp. 143–151, 2017.
[36] B. Jiang, J. Pei, Y. Tao, and X. Lin, “Clustering uncertain data based
on probability distribution similarity,” IEEE Trans. on Knowledge and
Data Engineering, vol. 25, pp. 751–763, 2013.

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on October 11,2023 at 13:14:00 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

You might also like