Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Expert Systems With Applications 229 (2023) 120449

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

A new support vector machine for categorical features


Taeil Jung a , Jaejik Kim b ,∗
a
LG CNS, Seoul, 07795, South Korea
b
Department of Statistics, Sungkyunkwan University, Seoul, 03063, South Korea

ARTICLE INFO ABSTRACT

Keywords: Support vector machine (SVM) was originally developed to solve binary classification problems for objects with
SVM only continuous features, and it often outperforms other classifiers. However, we often encounter datasets with
Categorical features mixed-type features or categorical features only. This study proposes an efficient SVM for dealing with such
Mixed-type features
datasets. The proposed SVM uses a subset of categorical features and it performs well in most cases, including
Variable importance
imbalanced and/or high-dimensional categorical datasets. In particular, it is more efficient than existing SVMs
in high-dimensional categorical datasets. To validate its performance, it is applied to simulated datasets and
various benchmark datasets, and it is also compared to existing SVMs for categorical features.

1. Introduction directed acyclic graph structures. Instead of multiple binary classifica-


tions, Crammer and Singer (2001) proposed a multi-class SVM with a
Support vector machine (SVM) is a popular and efficient method to single optimization problem (see also Van den Burg & Groenen, 2016;
solve classification problems in the big data era. Basically, it finds a Lee et al., 2004).
linear decision boundary maximizing the distance between two classes SVM was originally developed for classification on continuous fea-
in a feature space transformed by a kernel function, and the decision tures. However, in usual classification problems, we often encounter
boundary is identified by relying on a small number of training obser- datasets with categorical features. To extend SVM into such cases,
vations known as support vectors. Since SVM was developed by Boser several approaches have been proposed so far. A simple approach is to
et al. (1992) and Cortes and Vapnik (1995), its performance and ap- transform categorical values into appropriate numeric values. Integer
plicability have been constantly improved and expanded by addressing encoding and one-hot encoding, also known as dummy encoding, are
the following key issues: examples of such transformations. Hsu et al. (2016) indicated that one-
First, SVM requires to solve constrained quadratic optimization with hot encoding is more stable than integer encoding in datasets with
a Lagrangian dual form, which is computationally expensive for a the moderate number of categorical features. Thus, in general, one-hot
large number of observations and features. To solve this optimization encoding is preferred over integer-encoding.
problem quickly and accurately, Yuh-Jye and Mangasarian (2001b)
Also, Carrizosa et al. (2017) proposed an SVM that identifies clusters
introduced a new formulation for the optimization using a smoothing
of categories and builds the model in the clustered categorical feature
method, and Yuh-Jye and Mangasarian (2001a) proposed to use a
space. This method has an advantage in that it provides interpretable
small random subset of data in solving the optimization and evaluating
results for categorical features. However, it uses one-hot encoding for
the nonlinear separating surface. In addition, Lee and Jang (2010)
each category and it works only on linear decision functions to keep
proposed a fast computing method for the optimization by decomposing
the interpretability. Wilson and Martinez (1997) introduced an overlap
the quadratic programming of 1-slack structural SVMs into a series of
metric, which is a simple distance measure for categorical features. It
small quadratic programming problems.
simply assigns one to observations that have the same category value
Another issue with SVM is the extension into multi-class. SVM was
originally designed to work for binary classifications. The easiest way to and zero to observations that have different category values. Belanche
solve multi-class problems is to consider multiple binary classifications and Villegas (2013) developed a categorical kernel function by using
such as one-versus-one and one-versus-all (Duan & Keerthi, 2005; Hsu the overlap metric and they considered it as a kernel function in SVM.
& Lin, 2002). Another multiple binary classification approach performs Kasif et al. (1998) proposed the value difference metric (VDM) which
SVMs within class hierarchies. Dietterich and Bakiri (1995) used deci- is also a kind of distance measure for categorical features defined by
sion trees to solve multi-class problems, and Platt et al. (1999) used conditional probabilities for each category, and Tang and Yin (2005)

∗ Corresponding author.
E-mail addresses: taeill421@naver.com (T. Jung), jaejik@skku.edu (J. Kim).

https://doi.org/10.1016/j.eswa.2023.120449
Received 26 February 2022; Received in revised form 17 March 2023; Accepted 8 May 2023
Available online 13 May 2023
0957-4174/© 2023 Elsevier Ltd. All rights reserved.
T. Jung and J. Kim Expert Systems With Applications 229 (2023) 120449

directly plugged VDM into the Gaussian kernel function in SVM for s.t. 𝑦𝑖 (𝐰⊤ 𝜙(𝐱𝑖𝑟 ) + 𝑏) ≥ 1 − 𝜉𝑖 , 𝜉𝑖 ≥ 0, 𝑖 = 1, … , 𝑛, (1)
mixed-type features. Peng et al. (2015) proposed heterogeneous SVM
(HSVM), which maps categories into a real space by minimizing radius where 𝐰 ∈ R𝑝
and 𝑏 ∈ R are the parameters of the decision function, 𝐶
margin error, instead of encoding or computing distance. To find the is a cost parameter penalizing the training error, 𝜉𝑖 is a slack variable,
optimal mapping, it iteratively performs SVM. and 𝜙(⋅) is a nonlinear mapping function. 𝜙(⋅) is not specified explicitly
However, while categorical features have nominal values, one-hot but rather through a kernel function 𝐾(𝐱𝑖𝑟 , 𝐱𝑖𝑟′ ) = ⟨𝜙(𝐱𝑖𝑟 ), 𝜙(𝐱𝑖𝑟′ )⟩. A com-
encoding and integer encoding generate discrete values for categorical mon approach to solve the optimization problem of (1) is to consider
features. This discrepancy might lead poor classification performance the Lagrangian dual objective function as follows:
because the discrete values have no meaning and do not fully reflect ∑ 1 ∑∑
𝑛 𝑛 𝑛
the effect of categorical features. Furthermore, distance measures for max 𝛼𝑖 − 𝛼 𝛼 ′ 𝑦 𝑦 ′ ⟨𝜙(𝐱𝑖𝑟 ), 𝜙(𝐱𝑖𝑟′ )⟩,
𝜶
𝑖=1
2 𝑖=1 𝑖′ =1 𝑖 𝑖 𝑖 𝑖
categorical features and distance measures for continuous features are
not directly comparable due to their different natures. Therefore, if we ∑
𝑛
s.t. 𝛼𝑖 𝑦𝑖 = 0, 0 ≤ 𝛼𝑖 ≤ 𝐶, 𝑖 = 1, … , 𝑛, (2)
use such distance measures in SVM, poor performance might result
𝑖=1
because SVM uses the direct sum of distance values for categorical
and continuous features in a kernel function. This issue occurs in the where 𝛼𝑖 is a Lagrangian multiplier and 𝜶 = (𝛼1 , … , 𝛼𝑛 )⊤ . Therefore, the
categorical kernel, VDM, and HSVM as well. Basically, such methods dual function of (2) can be solely expressed by kernel functions, and the
handle categorical features by assigning appropriate numeric values to inner products in a high dimensional space can be efficiently computed
each category, which can be thought of as continuous values. without the specification of the mapping function 𝜙(⋅). A typical choice
To address these issues, this study proposes a novel approach for of the kernel function can be Gaussian, polynomial, and neural network
dealing with categorical features in SVM. Suppose that we have both functions, etc.
continuous and categorical features. Instead of assigning appropriate The constrained convex quadratic problem of (1) can be expressed
values to each category, the proposed method moves data points in by an unconstrained empirical hinge loss minimization problem with a
a kernel space of continuous features by reflecting the effect of each penalty term for the parameters as follows (Vapnik, 1991):
categorical feature and usual SVM is applied to the data points. Since it
1∑
𝑛
does not replace nominal categorical values with some numeric values, 𝜆
min ‖𝐰‖2 + max(0, 1 − 𝑦𝑖 {𝐰⊤ 𝜙(𝐱𝑖𝑟 ) + 𝑏}), (3)
it is free from different natures of continuous and categorical features. 𝐰,𝑏 2 𝑛 𝑖=1
Similarly to HSVM, the proposed method also requires to perform SVM
where 𝜆 = 1∕(𝑛𝐶). The hinge loss, max(0, 1 − 𝑦𝑖 {𝐰⊤ 𝜙(𝐱𝑖𝑟 ) + 𝑏}), gives
iteratively in order to reflect the effect of categorical features. However,
zero penalty to observations inside their margin and a linear penalty
while HSVM iteratively performs SVM in a space of both continuous
to observations on the wrong side and far away. Similarly to (2), the
and categorical features, SVM in the proposed method works only in
optimization problem of (3) can be written in terms of kernel functions
a space of continuous features throughout the process. Furthermore,
as follows:
when considering the effect of categorical features, it uses a subset
𝜆 ∑∑
𝑛 𝑛
of categorical features, rather than all of them. Thus, for datasets
min 𝛾 𝛾 ′ 𝐾(𝐱𝑖𝑟 , 𝐱𝑖𝑟′ )
with high-dimensional categorical features, the proposed method is 𝜸,𝑏 2 𝑖=1 𝑖′ =1 𝑖 𝑖
much faster than HSVM. Also, it can be applied to datasets with only ( { 𝑛 })
1∑ ∑
𝑛
categorical features by generating real-valued random points. + max 0, 1 − 𝑦𝑖 𝑟 𝑟
𝛾𝑖′ 𝐾(𝐱𝑖 , 𝐱𝑖′ ) + 𝑏 , (4)
Section 2 introduces some background of this study and several 𝑛 𝑖=1 𝑖′ =1
existing SVMs for categorical features, and Section 3 describes a new
where 𝜸 = (𝛾1 , … , 𝛾𝑛 )⊤ is a vector of the coefficients of the decision
algorithm of SVM for categorical features. In Section 4, the proposed ∑𝑛 𝑟 𝑟
function 𝑓 (𝐱𝑟 ), where 𝑓 (𝐱𝑟 ) = 𝑖=1 𝛾𝑖 𝐾(𝐱𝑖 , 𝐱 ) + 𝑏. Since the coeffi-
method and existing methods are applied to simulated datasets and var-
cients 𝛾𝑖 are different from 𝛼𝑖 in (2), they should not be interpreted
ious benchmark datasets, and they are compared in terms of prediction
performance. as Lagrangian multipliers (Chapelle, 2007). The Representer theorem
introduced in Kimeldorf and Wahba (1970) implies that the optimal
∑𝑛 𝑟
2. Related works solution of 𝐰 in (3) can be expressed by 𝐰 = 𝑖=1 𝛾𝑖 𝜙(𝐱𝑖 ) and the
∑ 𝑛
decision function of (4) has the form 𝑓 (𝐱𝑟 ) = 𝑖=1 𝛾𝑖 𝐾(𝐱𝑖𝑟 , 𝐱𝑟 ) + 𝑏. To
In this section, we review SVM for continuous features and in- solve the unconstrained optimization problem of (4), Chapelle (2007)
troduce SVM using categorical kernel functions (CKSVM; Belanche & rewrote (4) using the kernel matrix that has kernel functions as its
Villegas, 2013), SVM using value difference metric (VSVM; Tang & Yin, elements and solved it by taking gradients with respect to 𝜸 and
2005), and heterogeneous SVM (HSVM; Peng et al., 2015). Suppose that 𝑏. Also, Shalev-Shwartz et al. (2011) obtained the solution of 𝜸 by
we have 𝑝 continuous features 𝐗𝑟 = (𝑋1𝑟 , … , 𝑋𝑝𝑟 )⊤ and 𝑞 categorical computing sub-gradients with respect to the parameter vector 𝐰 and
features 𝐗𝑐 = (𝑋1𝑐 , … , 𝑋𝑞𝑐 )⊤ . Then, the 𝑖th observation has an output they found the bias term 𝑏 for given 𝜸.
𝑦𝑖 ∈ {−1, 1} and an observed feature vector 𝐱𝑖 = (𝐱𝑖𝑟⊤ , 𝐱𝑖𝑐⊤ )⊤ , 𝑖 = 1, … , 𝑛,
where 𝐱𝑖𝑟 = (𝑥𝑟𝑖1 , … , 𝑥𝑟𝑖𝑝 )⊤ is a vector of continuous features and 𝐱𝑖𝑐 = 2.2. SVM using categorical kernel function
(𝑥𝑐𝑖1 , … , 𝑥𝑐𝑖𝑞 )⊤ is a vector of categorical features, and the 𝑗th categorical
feature has 𝐺𝑗 categories (i.e., 𝑋𝑗𝑐 = 𝑔, 𝑔 = 1, … , 𝐺𝑗 ). Belanche and Villegas (2013) proposed a categorical kernel function
by converting the overlap metric into a probabilistic version and it is
2.1. Review of SVM for continuous features defined by
( 𝑞 )
SVM is an effective method to solve binary classification problems 𝜏∑ 𝑐 𝑐 𝑐
𝐾(𝐱𝑖𝑐 , 𝐱𝑖𝑐′ ) = exp 𝐾𝑢 (𝑥𝑖𝑗 , 𝑥𝑖′ 𝑗 ) , 𝜏 > 0, (5)
by constructing an optimal hyperplane in a feature space. We can 𝑞 𝑗=1
find the hyperplane by a linear decision function with a maximum
margin between two classes in the space transformed by a mapping where 𝜏 is a parameter of the kernel function, and 𝐾𝑢𝑐 (𝑥𝑐𝑖𝑗 , 𝑥𝑐𝑖′ 𝑗 ) =
( )1∕𝜈
function 𝜙 (Chapelle et al., 2002). Suppose that we have 𝑛 observations 1 − [𝑃𝑋 𝑐 (𝑥𝑖𝑗 )]𝜈 if 𝑥𝑐𝑖𝑗 = 𝑥𝑐𝑖′ 𝑗 , otherwise, 𝐾𝑢𝑐 (𝑥𝑐𝑖𝑗 , 𝑥𝑐𝑖′ 𝑗 ) = 0, where
(𝑦𝑖 , 𝐱𝑖𝑟 ), 𝑖 = 1, … , 𝑛. Then, the decision function, 𝑓 (𝐱𝑖𝑟 ) = 𝐰⊤ 𝜙(𝐱𝑖𝑟 )+𝑏, can 𝑗
𝜈 > 0 and 𝑃𝑋 (𝑥𝑖𝑗 ) is the probability mass function of the 𝑗th categorical
𝑐
be found by solving the convex quadratic problem as follows: 𝑗
feature 𝑋𝑗𝑐 . The categorical kernel function of (5) has a similar form
1 ∑ 𝑛
to the Gaussian kernel function for continuous features, and it can be
min ‖𝐰‖2 + 𝐶 𝜉𝑖 ,
𝐰,𝑏 2 directly used as a kernel function in SVM.
𝑖=1

2
T. Jung and J. Kim Expert Systems With Applications 229 (2023) 120449

2.3. SVM using VDM Table 1


An example for the effect of a categorical variable 𝑋𝑗𝑐 .

The value difference metric (VDM) introduced by Kasif et al. (1998) 𝑋𝑗𝑐 = 1 𝑋𝑗𝑐 = 2 𝑋𝑗𝑐 = 3 𝑋𝑗𝑐 = 4 Total
is a distance function for categorical features defined by the class 𝑌 =1 2 14 35 9 60
information of data, and it is given in binary classification problems 𝑌 = −1 8 6 15 11 40
by Total 10 20 50 20 100


𝑞 ∑ [ ]2
𝑉 (𝐱𝑖𝑐 , 𝐱𝑖𝑐′ ) = 𝑃 (𝑌 |𝑥𝑐𝑖𝑗 ) − 𝑃 (𝑌 |𝑥𝑐𝑖′ 𝑗 ) . (6)
𝑗=1 𝑌 ∈{−1,1}
Now, an issue is how to quantify the effect of each categorical
For continuous features, it considers the following distance function: feature. In SVM for continuous features, if the 𝑖th object is far away
𝑝 (𝑥𝑟 − 𝑥𝑟 )2
∑ from the hyperplane in the direction of the SVM parameter vector 𝐰,
𝑖𝑗 𝑖′ 𝑗
𝐷(𝐱𝑖𝑟 , 𝐱𝑖𝑟′ ) = , (7) the probability of class 1 (𝑌𝑖 = 1) is much greater than the probability
𝑗=1 16𝜎𝑗2
of class 2 (𝑌𝑖 = −1) (i.e., 𝑃 (𝑌𝑖 = 1|𝐱𝑖𝑟 ) ≫ 𝑃 (𝑌𝑖 = −1|𝐱𝑖𝑟 )). In contrast,
where 𝜎𝑗2 is the variance of 𝑋𝑗𝑟 . From (6) and (7), the heteroge- if it is far from the hyperplane in the opposite direction of 𝐰, then
neous VDM (HVDM) can be obtained by 𝐻(𝐱𝑖 , 𝐱𝑖′ ) = 𝑉 (𝐱𝑖𝑐 , 𝐱𝑖𝑐′ ) + 𝑃 (𝑌𝑖 = 1|𝐱𝑖𝑟 ) ≪ 𝑃 (𝑌𝑖 = −1|𝐱𝑖𝑟 ). Also, if it is close to the hyperplane, the
𝐷(𝐱𝑖𝑟 , 𝐱𝑖𝑟′ ) and the kernel function using HVDM is defined by 𝐾(𝐱𝑖 , 𝐱𝑖′ ) = two conditional probabilities would be similar. The same concept can
exp(−𝜏𝐻(𝐱𝑖 , 𝐱𝑖′ )). be applied to categorical features, and we can conversely use this idea
to move data points in a continuous feature space transformed by the
2.4. Heterogeneous SVM mapping function 𝜙. For the 𝑖th object and the 𝑗th categorical feature,
if 𝑃 (𝑌𝑖 = 1|𝑥𝑐𝑖𝑗 = 𝑔) > 𝑃 (𝑌𝑖 = −1|𝑥𝑐𝑖𝑗 = 𝑔), we can move the data point
Heterogeneous SVM (HSVM) proposed by Peng et al. (2015) ini-
𝜙(𝐱𝑖𝑟 ) in the direction of 𝐰, and if 𝑃 (𝑌𝑖 = 1|𝑥𝑐𝑖𝑗 = 𝑔) < 𝑃 (𝑌𝑖 = −1|𝑥𝑐𝑖𝑗 = 𝑔),
tially maps 𝑞 categorical features into a 𝑞-dimensional real space by
𝜙(𝐱𝑖𝑟 ) can be updated in the opposite direction of 𝐰. Fig. 1 describes
integer encoding, and it iteratively updates categorical feature values
that if all circles have higher 𝑃 (𝑌 = 1|𝑋𝑗𝑐 ) than 𝑃 (𝑌 = −1|𝑋𝑗𝑐 ) and all
by minimizing the radius margin which is a type of generalization error.
squares have higher 𝑃 (𝑌 = −1|𝑋𝑗𝑐 ), circles are updated in the direction
Let 𝑧𝑖𝑗 , 𝑖 = 1, … , 𝑛, 𝑗 = 1, … , 𝑞 be real values transformed by integer
encoding from 𝑥𝑐𝑖𝑗 . Also, let 𝐱𝑖∗ = (𝑥𝑟𝑖1 , … , 𝑥𝑟𝑖𝑝 , 𝑧𝑖1 , … , 𝑧𝑖𝑞 )⊤ . of 𝐰 and squares in the opposite direction by reflecting the effect of the
HSVM minimizes the following objective function based on the categorical feature 𝑋𝑗𝑐 . Note that each data point may be updated by a
radius margin error bound: different distance depending on the category to which it belongs.
To present such updates, we define an effect measure, 𝛽𝑗𝑔 , for a
𝑅2
𝑀= = 𝑅2 ‖𝐰‖2 , (8) category 𝑔 of the 𝑗th categorical feature as follows:
𝜇2
where 𝑅2 is the radius of a minimum sphere enclosing all 𝑛 observations 𝛽𝑗𝑔 = 𝑃 (𝑌 = 1|𝑋𝑗𝑐 = 𝑔)𝑃 (𝑋𝑗𝑐 = 𝑔) − 𝑃 (𝑌 = −1|𝑋𝑗𝑐 = 𝑔)𝑃 (𝑋𝑗𝑐 = 𝑔)
and 𝜇 2 is the maximum margin between two classes in the feature space = 𝑃 (𝑌 = 1, 𝑋𝑗𝑐 = 𝑔) − 𝑃 (𝑌 = −1, 𝑋𝑗𝑐 = 𝑔), 𝑗 = 1, … , 𝑞, 𝑔 = 1, … , 𝐺𝑗 .
transformed by 𝜙. Note that 𝜇 = 1∕‖𝐰‖2 . In the iterative update process
(10)
of 𝑧𝑖𝑗 , the solution of ‖𝐰‖2 can be obtained by implementing SVM
for 𝐱𝑖∗ , 𝑖 = 1, … , 𝑛 and the solution of 𝑅2 can be obtained by solving 𝛽𝑗𝑔 of (10) can be directly estimated from training data as follows:
another optimization problem as follows:
𝑛(𝑌 = 1, 𝑋𝑗𝑐 = 𝑔) 𝑛(𝑌 = −1, 𝑋𝑗𝑐 = 𝑔)
min 𝑅2 , s.t. ‖𝐜 − 𝜙(𝐱𝑖∗ )‖2 ≤ 𝑅2 , 𝑖 = 1, … , 𝑛, (9) 𝛽̂𝑗𝑔 = − , (11)
𝑅,𝐜 𝑛 𝑛
where 𝐜 is the center of the sphere with the minimum radius. Under where 𝑛(𝑌 = 1, 𝑋𝑗𝑐
= 𝑔) and 𝑛(𝑌 = −1, 𝑋𝑗𝑐
= 𝑔) are the number of
the optimal 𝑅2 and ‖𝐰‖2 , 𝑧𝑖𝑗 is updated by the gradient of 𝑀 with training observations with 𝑋𝑗𝑐 = 𝑔 in classes 𝑌 = 1 and 𝑌 = −1,
respect to 𝑧𝑖𝑗 . This update process is iterated until 𝑀 is converged to an respectively.
optimal point. Note that all 𝑝 continuous and 𝑞 categorical features are As shown in (10), we consider a difference of joint probabilities
used in the entire iterative update process and all 𝑞 categorical features as an effect measure for each category, instead of the conditional
are updated at every iteration. This means that intensive computing is
probabilities. The reason why we use joint probabilities in (10) is
required for high-dimensional categorical features.
because 𝑃 (𝑋𝑗𝑐 = 𝑔) can serve as a weight of the category 𝑔 for all
categories of 𝑋𝑗𝑐 . For example, suppose that we have frequencies for
3. Methodology
the categorical feature 𝑋𝑗𝑐 as shown in Table 1. If we use the difference
of estimated conditional probabilities for classes 𝑌 = 1 and −1, the first
3.1. Effect of categorical features
category (𝑋𝑗𝑐 = 1) has the largest difference. However, since the first
In general, distance-based models handle categorical features by category has only 10 observations, it should have a small weight in
assigning an appropriate real value to each category. As we mentioned terms of the effect of the categorical feature. If we consider estimated
in Section 2, one-hot encoding, categorical kernel, VDM, and HSVM ba- joint probabilities, from Table 1, we can obtain 𝛽̂𝑗1 = −0.06, 𝛽̂𝑗2 = −0.08,
sically follow this approach. However, since the concept of distance for 𝛽̂𝑗3 = 0.20, and 𝛽̂𝑗4 = −0.004. This means that the third category has the
categorical features differs from the concept of distance for continuous greatest influence on classifying two classes in 𝑋𝑗𝑐 , and if an object has
features, it is not easy to find appropriate real values for categorical 𝑋𝑗𝑐 = 3, it highly tends to be classified into class 𝑌 = 1.
features in the same space as continuous features, and the discrepancy Using 𝛽𝑗𝑔 of (10), we define the importance of a categorical feature
between the natures of categorical and continuous features might lead 𝑋𝑗𝑐 as follows:
to poor model performance.
To address this issue, we propose an alternative approach for deal- 𝜅𝑗 = max{|𝛽̂𝑗𝑔 |, 𝑔 = 1, … , 𝐺𝑗 }, 𝑗 = 1, … , 𝑞. (12)
ing with categorical features in SVM. The proposed method considers
data points in a kernel space of continuous features and it iteratively We can determine the order of categorical features based on the im-
moves the data points in the space by reflecting the effect of each portance of (12), and we can reduce the number of categorical features
categorical feature. used in the proposed method by employing this order.

3
T. Jung and J. Kim Expert Systems With Applications 229 (2023) 120449

Fig. 1. An example of data points updated by the effect of a categorical feature.

3.2. SVM for categorical features 𝐰𝑡 . Mandic (2004) suggested that the normalization of 𝐰 should be
guaranteed for stability of nonlinear SVM. Therefore, ‖𝐰𝑡 ‖2 in (15) can
The method proposed in this study is applicable to both datasets be considered as a normalizing factor.
with mixed-type features and categorical features only. First, we con- One of the main benefits of SVM is that kernel functions are used,
sider datasets with mixed-type features. To handle categorical features instead of specifying the mapping function 𝜙. Thus, (15) can be updated
in SVM, we can consider the following optimization problem:
through the kernel function without specifying 𝜙 as follows:
1∑
𝑛
𝜆 ′ ′ ′ ′
min ‖𝐰‖2 + max{0, 1 − 𝑦𝑖 (𝐰⊤ ℎ(𝜙(𝐱𝑖𝑟 ); 𝐱𝑖𝑐 ) + 𝑏)}, (13) ℎ
𝐾𝑡+1 (𝐱𝑖𝑟 , 𝐱𝑖𝑟′ ) = 𝐾𝑡ℎ (𝐱𝑖𝑟 , 𝐱𝑖𝑟′ )+𝛿𝑡𝑖 𝑗𝑔 𝜸 ⊤ ℎ 𝑖𝑗𝑔 ⊤ ℎ 𝑖𝑗𝑔 𝑖 𝑗𝑔 ⊤ ℎ
𝑡 𝐾𝑖𝑡 +𝛿𝑡 𝜸 𝑡 𝐾𝑖′ 𝑡 +𝛿𝑡 𝛿𝑡 𝜸 𝑡 𝐊𝑡 𝜸 𝑡 , (16)
𝐰,𝑏,ℎ 2 𝑛 𝑖=1

where ℎ(𝜙(𝐱𝑖𝑟 ); 𝐱𝑖𝑐 ) ∶ R𝐿 → R𝐿 is an unknown function adjusting 𝜙(𝐱𝑖𝑟 ) where 𝐾𝑡ℎ (𝐱𝑖𝑟 , 𝐱𝑖𝑟′ ) is a kernel function value for 𝜙𝑡 (𝐱𝑖𝑟 ) and 𝜙𝑡 (𝐱𝑖𝑟′ ) at
by the categorical observation 𝐱𝑖𝑐 , where 𝜙(𝐱𝑖𝑟 ) = (𝜙1 (𝐱𝑖𝑟 ), … , 𝜙𝐿 (𝐱𝑖𝑟 ))⊤ . iteration 𝑡, 𝛿𝑡𝑖𝑗𝑔 = 𝜂 𝛽̂𝑗𝑔 (𝑥𝑐𝑖𝑗 = 𝑔)(𝑡𝜸 ⊤ ℎ −1 ⊤
𝑡 𝐊𝑡 𝜸 𝑡 ) , 𝜸 𝑡 = (𝛾1𝑡 , … , 𝛾𝑛𝑡 ) is the
To minimize (13), if 𝑃 (𝑌𝑖 = 1|𝐱𝑖𝑐 ) > 𝑃 (𝑌𝑖 = −1|𝐱𝑖𝑐 ), the function ℎ should coefficient vector at iteration 𝑡, 𝐊ℎ𝑡 is the 𝑛 × 𝑛 kernel function matrix
map 𝜙(𝐱𝑖𝑟 ) in the direction of 𝐰 in a space transformed by 𝜙. Otherwise, with 𝐾𝑡ℎ (𝐱𝑖𝑟 , 𝐱𝑖𝑟′ ) as the element of the 𝑖th row and 𝑖′ th column, and 𝐾𝑖𝑡ℎ is
it should update 𝜙(𝐱𝑖𝑟 ) in the opposite direction of 𝐰. Also, (13) can be the 𝑖th column vector of 𝐊ℎ𝑡 . Note that, for given ℎ, the solution of 𝐰 in

written in terms of the kernel function as follows: (13) at iteration 𝑡 is 𝐰𝑡 = 𝑛𝑖=1 𝛾𝑖𝑡 𝜙𝑡 (𝐱𝑖𝑟 ). From this, we can derive (16).
𝜆 ∑∑
𝑛 𝑛 As shown in (16), the kernel function values at iteration 𝑡 are updated
min 𝛾 𝛾 ′ 𝐾 ℎ (𝐱𝑖𝑟 , 𝐱𝑖𝑟′ ) only through the coefficient vector 𝜸 𝑡 and the kernel function values at
𝜸,𝑏,ℎ 2 𝑖=1 𝑖′ =1 𝑖 𝑖
( { 𝑛 }) 𝑡, instead of 𝐰 and 𝜙.
1∑ ∑
𝑛
+ max 0, 1 − 𝑦𝑖 𝛾𝑖′ 𝐾 ℎ (𝐱𝑖𝑟 , 𝐱𝑖𝑟′ ) + 𝑏 , (14) For a single categorical feature, the update of (16) and the opti-
𝑛 𝑖=1 𝑖′ =1 mization of (14) are iterated until (14) is no longer decreased. Once
where 𝐾 ℎ (𝐱𝑖𝑟 , 𝐱𝑖𝑟′ ) is a kernel function representing the inner product the reflection of the effect of one categorical feature is completed, the
of ℎ(𝜙(𝐱𝑖𝑟 ); 𝐱𝑖𝑐 ) and ℎ(𝜙(𝐱𝑖𝑟′ ); 𝐱𝑖𝑐′ ). That is, 𝐾 ℎ (𝐱𝑖𝑟 , 𝐱𝑖𝑟′ ) = ⟨ℎ(𝜙(𝐱𝑖𝑟 ); 𝐱𝑖𝑐 ), effect of the next categorical feature is taken into account in the same
ℎ(𝜙(𝐱𝑖𝑟′ ); 𝐱𝑖𝑐′ )⟩. iterative manner. In the update process, since the effects of categorical
However, it is not easy to find simultaneously the coefficients (𝜸, 𝑏) features are sequentially considered, the order in which categorical
and the function ℎ minimizing (14). Therefore, we propose an iterative features enter the update process is critical to find the function ℎ
algorithm for the minimization of (14). That is, we first find (𝜸, 𝑏) effectively. Thus, we propose to determine the order of categorical
minimizing (14) for given ℎ, and then we find ℎ for given (𝜸, 𝑏). This features by the importance, 𝜅𝑗 , 𝑗 = 1, … , 𝑞, of (12). This means that
procedure is iterated until the dual objective function of (14) is no throughout the update process, the effect of the categorical feature with
longer decreased. (𝜸, 𝑏) can be obtained using SVM for ℎ(𝜙(𝐱𝑖𝑟 ); 𝐱𝑖𝑐 ), 𝑖 = the higher importance value is taken into first.
1, … , 𝑛 through kernel functions.
Note that the function ℎ does not have explicit forms. In fact, it is a
To estimate the unknown function ℎ, the strategy that we propose
function representing the paths that data points 𝜙(𝐱𝑖𝑟 ) are updated in a
is to update 𝜙(𝑋 𝑟 ) by considering a single categorical feature at a time.
Thus, for current 𝐰, 𝜙(𝐱𝑖𝑟 ) is updated by reflecting the effect of the 𝑗th space transformed by 𝜙, and these paths are found through the iterative
categorical feature as follows: updates of the parameter vector 𝜸 𝑡 and the kernel function values. To
recover these paths, we need to keep the kernel function values at 𝑡 = 0
𝜂 𝛽̂𝑗𝑔 (𝑥𝑐𝑖𝑗 = 𝑔) and all coefficient vectors 𝜸 𝑡 , 𝑡 = 1, … , 𝑇 , where 𝑇 is the index for the
𝜙𝑡+1 (𝐱𝑖𝑟 ) = 𝜙𝑡 (𝐱𝑖𝑟 ) + 𝐰𝑡 , 𝑖 = 1, … , 𝑛, 𝑡 = 0, 1, 2, … , (15)
‖𝐰𝑡 ‖2 iteration when the algorithm is terminated.

where 𝜂 is a learning rate, 𝛽̂𝑗𝑔 (𝑥𝑐𝑖𝑗 = 𝑔) is the estimated effect of As we mentioned, the proposed method can be also applied to
(11) corresponding to the category value 𝑔 of 𝑥𝑐𝑖𝑗 , and 𝐰𝑡 is the SVM datasets with only categorical features. For such datasets, we propose
parameter vector estimated at iteration 𝑡. The learning rate 𝜂 can be to generate randomly 𝐱1𝑟 , … , 𝐱𝑛𝑟 from 𝑝-variate normal distribution, and
determined by cross-validation (CV). then the proposed method can be applied to the mixed-type dataset
As shown in (15), the sign of 𝛽̂𝑗𝑔 (𝑥𝑐𝑖𝑗 = 𝑔) determines the direction with the randomly generated continuous features. Since 𝐱1𝑟 , … , 𝐱𝑛𝑟 are
of the update of 𝜙(𝐱𝑖𝑟 ). If the sign is positive, 𝜙(𝐱𝑖𝑟 ) is updated in the randomly generated, they have no effect for classification. Therefore,
direction of 𝐰𝑡 . Otherwise, it is updated in the opposite direction of to avoid computational complexity, 𝑝 = 1 or 2 is recommended.

4
T. Jung and J. Kim Expert Systems With Applications 229 (2023) 120449

Fig. 2. An example for the update process of data points.

Algorithm 1 Support vector machine for categorical features (SVM-C) normal distribution. In Figs. 2(b), 2(c) and 2(d), the data points are
1: Initialization: Set 𝑡 ← 0 & 𝑗 ← 1. If there are no continuous features in the dataset,
updated by reflecting the effects of categorical features, and they are
generate 𝐱1𝑟 , … , 𝐱𝑛𝑟 randomly from the 𝑝-variate normal distribution 𝑁𝑝 (𝟎, 𝐈). finally well separated.
2: Estimate the effects, 𝛽𝑗𝑔 , 𝑔 = 1, … , 𝐺𝑗 , 𝑗 = 1, … , 𝑞, of all categories for all categorical Algorithm 1 and Fig. 3 describe the detailed process of the proposed
features using (11). method. The minimization of (17) can be solved by the stochastic
3: Compute importance, 𝜅𝑗 , 𝑗 = 1, … , 𝑞, of all categorical features using (12), and then
sort categorical features in decreasing order of 𝜅𝑗 . Let 𝑋𝑗𝑐 be the categorical feature
sub-gradient descent method (Shalev-Shwartz et al., 2011). Since Al-
with the 𝑗th largest 𝜅𝑗 value. gorithm 1 stops when the objective function is no longer decreased at
4: For given kernel function form, perform SVM for 𝐱1𝑟 , … , 𝐱𝑛𝑟 by solving the optimization the first iteration of the next categorical feature, it may not use all
problem of (4) and obtain the initial coefficient vector 𝜸 0 = (𝛾10 , … , 𝛾𝑛0 )⊤ and kernel 𝑞 categorical features. Thus, at the end of Algorithm 1, we can have
function matrix 𝐊ℎ0 . Also, set 0 to the objective function value of (4) for 𝜸 0 and 𝑏0
a subset of categorical features used for classification and it could be
5: Loop:
6: Set 𝑙 ← 1. helpful for interpreting the classification outcomes.
7: For 𝑋𝑗𝑐 , update kernel function values to 𝐾𝑡+1ℎ (𝐱𝑟 , 𝐱𝑟 ) for 𝑖, 𝑖′ = 1, … , 𝑛 using (16).
𝑖 𝑖′ We can predict a new observation 𝐱0 = (𝐱0𝑟⊤ , 𝐱0𝑐⊤ )⊤ using the SVM
8: ℎ (𝐱𝑟 , 𝐱𝑟 ) obtained in Step 6, solve the following
For the kernel function values 𝐾𝑡+1 𝑖 𝑖 ′ trained by Algorithm 1 as follows:
optimization problem: { 𝑛 }

𝑛 ( ) ℎ 𝑟 𝑟
𝜆 ⊤ ℎ 1∑ ̂
𝑌0 = 𝑠𝑖𝑔𝑛 𝛾𝑖𝑇 𝐾𝑇 (𝐱𝑖 , 𝐱0 ) + 𝑏𝑇 , (18)
min 𝜸 𝐊𝑡+1 𝜸 + max 0, 1 − 𝑦𝑖 (𝜸 ⊤ 𝐾𝑖,𝑡+1
ℎ + 𝑏) . (17)
𝜸,𝑏 2 𝑛 𝑖=1
𝑖=1

where the subscript 𝑇 is the index for the iteration when Algorithm 1
9: Set 𝜸 𝑡+1 and 𝑏𝑡+1 to the solution of (17). Also, set 𝑡+1 to the objective function
value of (17) for 𝜸 𝑡+1 and 𝑏𝑡+1 . stops, 𝛾𝑖𝑇 , 𝑖 = 1, … , 𝑛 and 𝑏𝑇 are the coefficients and bias term obtained
10: If 𝑙 = 1 & 𝑡+1 ≥ 𝑡 , STOP. at iteration 𝑇 . 𝐾𝑇ℎ (𝐱𝑖𝑟 , 𝐱0𝑟 ), 𝑖 = 1, … , 𝑛, can be obtained by updating
11: If 𝑡+1 ≥ 𝑡 , then 𝑗 ← 𝑗 + 1 and go to Step 6. Otherwise, 𝑡 ← 𝑡 + 1, 𝑙 ← 𝑙 + 1, and go
through the kernel functions, 𝐾0ℎ , 𝐾1ℎ , … , 𝐾𝑇ℎ−1 . To obtain 𝐾𝑇ℎ (𝐱𝑖𝑟 , 𝐱0𝑟 ), 𝑖 =
to Step 7.
12: End Loop. 1, … , 𝑛, we first compute the kernel function values 𝐾0ℎ (𝐱𝑖𝑟 , 𝐱0𝑟 ), 𝑖 =
1, … , 𝑛, and then they can be updated into 𝐾1ℎ (𝐱𝑖𝑟 , 𝐱0𝑟 ), 𝑖 = 1, … , 𝑛,
using (16). By iterating (16) through the kernel function values and
the coefficient values, we can finally obtain 𝐾𝑇ℎ (𝐱𝑖𝑟 , 𝐱0𝑟 ), 𝑖 = 1, … , 𝑛. If
Fig. 2 shows an example of the update process. Fig. 2(a) presents the sign of 𝑌̂0 is positive, then it is classified into class 𝑌 = 1. Otherwise,
data points for both classes randomly generated from the bivariate class 𝑌 = −1.

5
T. Jung and J. Kim Expert Systems With Applications 229 (2023) 120449

Fig. 3. Flowchart of Algorithm 1.

4. Experiments Fig. 4 shows how random data points are updated in an one-
dimensional space through an example of Case 2. Table 2 presents
In this section, we evaluate the performance of the proposed SVM correctly classified rates and their standard deviations for each method.
(SVM-C1 ) through simulations and benchmark datasets. We also com- From Table 2, we see that SVM-C has better classification performance
pare the proposed SVM to SVM using one-hot encoding (OSVM), cate- than the other methods in both cases. As mentioned in Section 3, SVM-C
gorical kernel SVM (CKSVM), SVM using VDM (VSVM), and heteroge- uses a subset of categorical features during the training process, while
neous SVM (HSVM). the other methods use all categorical features. The average number of
categorical features used in SVM-C for Case 1 and 2 was 3.30 and 3.83,
respectively, while the average number of categorical features used in
4.1. Simulation
SVM-C that matched the categorical features in the true model (19) was
3.08 and 3.67. This means that most categorical features used in SVM-C
To test the performance of the proposed SVM, we generate data
matched those of the true model and it used fewer categorical features
from logistic regression models and we consider a case with both
than the true model. SVM-C does not aim to identify true categorical
continuous and categorical features (Case 1) and a case with only
features, but it uses categorical features that contribute to reduce the
categorical features (Case 2). For Case 1, we generate 5 continuous
hinge loss. Thus, it tends to employ as few categorical features as
features 𝑋1𝑟 , … , 𝑋5𝑟 from the standard normal distribution and 100
possible while preserving prediction accuracy.
binary categorical features 𝑋1𝑐 , … , 𝑋100
𝑐 from the Bernoulli distribution.

However, only 3 continuous features and 5 categorical features are 4.2. Benchmark datasets
included in the logistic regression model for the data generation. For
Case 2, we generated only 100 binary categorical features. For Case 1 We consider 18 benchmark datasets with binary classification prob-
and 2, the logistic regression models are considered, respectively, as lems. Except for ‘handwriting’ (Lecun et al., 1998), ‘backache’ (Chat-
follows: field, 1985), and ‘Amazon reviews’ (He & McAuley, 2016) datasets, all
𝑝 15 datasets are available at the UCI machine learning repository (Dua
𝐂𝐚𝐬𝐞𝟏 ∶ log = sin(2𝑋1𝑟 − 𝑋2𝑟 ) − 0.8𝑋3𝑟 + 𝜃0 + 𝜃1 𝑋1𝑐 + ⋯ + 𝜃5 𝑋5𝑐 ,
1−𝑝 & Graff, 2017). The detailed information of the datasets is given in
𝑝 Table 3. These datasets consider various situations for binary classi-
𝐂𝐚𝐬𝐞𝟐 ∶ log = 𝜃0 + 𝜃1 𝑋1𝑐 + ⋯ + 𝜃5 𝑋5𝑐 , (19)
1−𝑝 fication problems. As shown in Table 3, while datasets 1–6 and 18
where 𝑝 = 𝑃 (𝑌 = 1|𝐗). 𝜃0 , 𝜃1 , … , 𝜃5 are randomly generated from have only categorical features, datasets 7–17 have both categorical and
uniform(3,5) and uniform(−5,-3) distributions. For training and test continuous features. Datasets 16–18 have high-dimensional categorical
sets, 1000 and 5000 observations are generated at each iteration and features. Also, datasets 12–18 have imbalanced classes.
this procedure is iterated 100 times. For Case 2, random data points are The original ‘handwriting’ dataset consists of 28 × 28 pixel values
generated from the univariate standard normal distribution (i.e., 𝑝 = 1). (0–255) for handwritten numbers 0, 1, … , 9. We modified it to a binary
In this simulation, we use the Gaussian kernel function for all SVM classification problem with categorical features. In this dataset, we
models, and all tuning parameters for all models are determined by focus on distinguishing numbers 3 and 5 and we created binary features
5-fold CV. for each pixel by assigning one to pixel values greater than 128 and zero
to values less than or equal to 128. In ‘splice’, we focus on classifying
exon/intron and intron/exon boundaries. In ‘contraceptive’, we create
1
Python code for SVM-C is available on https://sites.google.com/view/ a binary response by encoding ‘Long-term’ and ‘Short-term’ as ‘use’.
jaejik. In ‘Amazon reviews’, we deal with only reviews for auto parts with

6
T. Jung and J. Kim Expert Systems With Applications 229 (2023) 120449

Fig. 4. An example of Case 2 for SVM-C.

Table 2
Classification accuracy for each case.
Case SVM-C OSVM CKSVM VSVM HSVM
Case 1 0.799 (0.111) 0.728 (0.132) 0.728 (0.130) 0.694 (0.119) 0.728 (0.132)
Case 2 0.825 (0.071) 0.740 (0.146) 0.740 (0.146) 0.727 (0.127) 0.805 (0.136)

( ): Standard deviation.

Table 3 12–18 by using F-measure and G-mean metrics. To provide a compre-


Dataset information.
hensive analysis, we include OSVM and CKSVM (denoted as OSVMC
No. Dataset # of obs. Categorical Continuous Class 1 Class −1 and CKSVMC), which incorporate a cost-sensitive learning technique,
1 monk-3 432 6 0 204 228 as competitive models. The performance results for imbalanced datasets
2 mushroom 8124 22 0 4208 3916
are presented in Table 5. As indicated, SVM-C generally outperforms
3 kr-vs-kp 3196 36 0 1527 1669
4 handwriting 11552 784 0 6131 5421 other models in identifying the minor class, with the exception of
5 splice 1535 60 0 767 768 ‘backache’ and the G-mean of ‘advertisement’. In particular, SVM-C
6 house votes 232 16 0 113 119 successfully detects the minor class in ‘Amazon reviews’ dataset that
7 jcrx 690 11 4 307 383 are missed by other models.
8 diagnosis 120 6 1 50 70
9 contraceptive 1473 7 2 955 629
Table 6 presents computing times of SVM-C and HSVM for the
10 Australian 690 8 6 307 383 four datasets with high-dimensional categorical features. As shown in
11 crx 690 9 6 307 383 Table 6, the computing time of SVM-C is significantly lower than that
12 adult 30162 8 6 22654 7508 of HSVM for all four datasets. HSVM algorithm requires a complexity of
13 backache 180 26 5 155 25
𝑂(𝑛2 𝑞) to compute the gradient of the kernel function at every iteration,
14 allhypo 3772 21 8 3481 291
15 hepatitis 155 13 5 123 32 and the optimization is performed for a kernel space of all 𝑝 continuous
16 advertisement 3279 1555 3 459 2820 and 𝑞 categorical features. In contrast, since SVM-C uses only a single
17 Amazon reviews 20473 804 3 17895 2578 categorical feature to compute the gradient at each iteration, it requires
18 drug design 2543 10000 0 192 2351
only the complexity of 𝑂(𝑛2 ) and the optimization is computed only
for a kernel space of 𝑝 continuous features. Therefore, as the number
of categorical features increases, the difference of computing time
at least five reviews, and we created a binary response from review between SVM-C and HSVM grows much larger.
ratings. Ratings from 1 to 3 are encoded into ‘negative’ reviews and Moreover, SVM-C uses a single categorical feature in the order
ratings for 4 and 5 are considered as ‘positive’ reviews. The original determined by its importance at each iteration, and the algorithm is
‘drug design’ dataset has 139,351 binary features. However, due to the usually terminated before using all categorical features. That is, all
computational feasibility of the existing methods, we use only 10,000 categorical features are not used in the model. This can be another
binary features selected by Cheng et al. (2002). In this dataset, the reason why SVM-C has much less computing time than HSVM in high-
number of categorical features is much larger than the number of dimensional categorical features. Table 7 presents the average number
objects. of categorical features used in SVM-C for each dataset. From Table 7,
we see that SVM-C uses a much smaller number of categorical features
4.3. Experimental results for benchmark datasets in comparison to the total number of categorical features in most
datasets. In particular, while the ’drug design’ has 10,000 categorical
For all datasets, we removed observations with at least one missing features, SVM-C used only 2.1 categorical features on average and it has
value and we also removed categorical features with only one observed the highest classification rate (0.956). In the ‘mushroom’, the output
category. All continuous features are standardized. To evaluate the variable has edible and poisonous mushroom classes and SVM-C mainly
prediction performance accurately, we iterated 5-fold CV ten times for used the features ‘odor’, ‘ring type’, ‘population’, and ‘gill-color’, etc. In
each dataset and we consider the classification accuracy for balanced the ‘allhypo’, the categorical features ‘hypopituitary’, ‘thyroid surgery’,
datasets and F-measure and G-mean for imbalanced datasets as perfor- ‘lithium’, and ‘goitre’ were primarily used for predicting the presence or
mance measures. All tuning parameters for all models are determined absence of hypothyroid. For the ‘hepatitis’ dataset, it mainly considered
by 5-fold CV. the features ‘ascites’, ‘varices’, ‘sex’, and ‘spleen palable’ when training
Table 4 shows the accuracy and its standard deviation value. As the model. In the ‘handwritten’ dataset, it used 62.5 pixels on average
shown in Table 4, SVMC has better accuracy than other models in most among 784 pixels. As shown in Fig. 5, handwritten numbers 3 and 5 are
datasets, except for four datasets. Even on those four datasets, it shows not easy to distinguish with human eyes. The pixels used in SVM-C are
comparable performance to HSVM, which has the highest accuracy. We presented by the red areas in Fig. 5. It seems that they are areas with
also evaluate the performance of the models on imbalanced datasets important characteristics distinguishing handwritten numbers 3 and 5.

7
T. Jung and J. Kim Expert Systems With Applications 229 (2023) 120449

Table 4
Classification accuracy for each dataset.
No. Dataset SVM-C OSVM CKSVM VSVM HSVM
1 monk-3 0.971 (0.016) 0.971 (0.016) 0.971 (0.016) 0.971 (0.016) 0.971 (0.016)
2 mushroom 0.978 (0.072) 0.997 (0.002) 0.971 (0.079) 0.980 (0.008) 0.999 (0.002)
3 kr-vs-kp 0.999 (0.012) 0.922 (0.021) 0.725 (0.045) 0.862 (0.005) 0.979 (0.013)
4 handwriting 0.971 (0.010) 0.969 (0.004) 0.983 (0.002) 0.938 (0.005) 0.989 (0.003)
5 splice 0.978 (0.012) 0.868 (0.019) 0.952 (0.009) 0.882 (0.019) 0.962 (0.018)
6 house votes 0.948 (0.044) 0.773 (0.038) 0.529 (0.001) 0.529 (0.001) 0.834 (0.042)
7 jcrx 0.862 (0.016) 0.665 (0.023) 0.719 (0.032) 0.438 (0.168) 0.832 (0.041)
8 diagnosis 0.999 (0.001) 0.980 (0.036) 0.846 (0.020) 0.746 (0.026) 0.729 (0.027)
9 contraceptive 0.697 (0.016) 0.573 (0.001) 0.663 (0.016) 0.494 (0.032) 0.613 (0.026)
10 Australian 0.842 (0.018) 0.572 (0.010) 0.692 (0.029) 0.413 (0.146) 0.830 (0.038)
11 crx 0.864 (0.028) 0.862 (0.019) 0.863 (0.014) 0.546 (0.049) 0.869 (0.015)
12 adult 0.857 (0.007) 0.760 (0.001) 0.760 (0.001) 0.533 (0.016) 0.838 (0.002)
13 backache 0.865 (0.027) 0.859 (0.039) 0.859 (0.039) 0.859 (0.039) 0.875 (0.028)
14 allhypo 0.967 (0.010) 0.951 (0.011) 0.943 (0.012) 0.918 (0.011) 0.963 (0.011)
15 hepatitis 0.866 (0.048) 0.828 (0.068) 0.829 (0.072) 0.829 (0.072) 0.858 (0.005)
16 advertisement 0.974 (0.014) 0.906 (0.017) 0.961 (0.010) 0.925 (0.002) 0.956 (0.005)
17 Amazon reviews 0.898 (0.028) 0.874 (0.018) 0.874 (0.018) 0.874 (0.018) 0.874 (0.018)
18 drug design 0.956 (0.006) 0.945 (0.008) 0.924 (0.010) 0.924 (0.010) 0.937 (0.003)

( ): Standard deviation.

Table 5
F-measure and G-mean values for imbalanced datasets.
No. SVM-C OSVM CKSVM VSVM HSVM OSVMC CKSVMC
0.650 0.000 0.000 0.405 0.565 0.000 0.001
F
(0.099) (0.000) (0.000) (0.014) (0.089) (0.000) (0.003)
adult
0.721 0.000 0.000 0.121 0.649 0.000 0.008
G
(0.079) (0.000) (0.000) (0.022) (0.065) (0.000) (0.022)
0.318 0.000 0.000 0.000 0.359 0.000 0.000
F
(0.049) (0.000) (0.000) (0.000) (0.054) (0.000) (0.000)
backache
0.468 0.000 0.000 0.000 0.497 0.000 0.000
G
(0.213) (0.000) (0.000) (0.000) (0.240) (0.000) (0.000)
0.792 0.572 0.476 0.000 0.773 0.613 0.568
F
(0.009) (0.006) (0.005) (0.000) (0.010) (0.007) (0.006)
allhypo
0.892 0.652 0.577 0.000 0.888 0.704 0.624
G
(0.130) (0.157) (0.147) (0.000) (0.120) (0.161) (0.152)
0.513 0.203 0.000 0.000 0.461 0.247 0.000
F
(0.102) (0.141) (0.000) (0.000) (0.124) (0.146) (0.000)
hepatitis
0.585 0.325 0.000 0.000 0.542 0.371 0.000
G
(0.425) (0.410) (0.000) (0.000) (0.415) (0.431) (0.000)
0.899 0.562 0.852 0.725 0.809 0.579 0.881
F
(0.024) (0.053) (0.027) (0.029) (0.025) (0.042) (0.024)
advertisement
0.905 0.650 0.884 0.826 0.814 0.764 0.910
G
(0.156) (0.100) (0.078) (0.067) (0.238) (0.137) (0.085)
0.321 0.000 0.000 0.000 0.000 0.000 0.001
F
(0.158) (0.000) (0.000) (0.000) (0.000) (0.000) (0.001)
Amazon reviews
0.415 0.000 0.000 0.000 0.000 0.000 0.057
G
(0.264) (0.000) (0.000) (0.000) (0.000) (0.000) (0.052)
0.657 0.000 0.000 0.000 0.477 0.001 0.000
F
(0.004) (0.000) (0.000) (0.000) (0.006) (0.001) (0.000)
drug design
0.744 0.000 0.000 0.000 0.611 0.012 0.000
G
(0.127) (0.000) (0.000) (0.000) (0.051) (0.053) (0.000)

( ): Standard deviation.

Table 6
Computing time (second) of SVM-C and HSVM in high-dimension.
Dataset Handwriting Advertisement Amazon reviews Drug design
SVM-C 16,461.3 (87.0) 551.2 (544.6) 39,558.2 (1750.9) 525.0 (116.2)
HSVM 1,682,311.5 (4,793.3) 161,520.9 (654.9) 1,977,937.5 (2128.6) 14,123,263.6 (6,903.4)

( ): Standard deviation.

5. Discussion and conclusion high-dimensional categorical features, such methods could lead to poor
classification performance and they could require intensive computing.
Categorical features often involve in classification and regression To overcome these problems, we propose an approach to update data
problems. SVM was originally developed for continuous features, and points in a kernel space of continuous feature by reflecting the effect
several methods have been developed for dealing with categorical
of each categorical feature. Since the proposed method does not assign
features in SVM. Similarly to usual SVM for continuous features, most
numeric values to categorical features, it is free from the problem due
methods apply distance measures to categorical features or they regard
categorical features as continuous features by assigning appropriate to direct comparison of categorical and continuous features. Also, since
numeric values to categories. However, it is not easy to accurately it finds a hyperplane in only a kernel space of continuous features
evaluate categories in a continuous feature space. Also, if we have throughout the process and it does not use all categorical features

8
T. Jung and J. Kim Expert Systems With Applications 229 (2023) 120449

Table 7
The average number of categorical features used in SVM-C.
No. Dataset 𝑞 Avg. #’s No. Dataset 𝑞 Avg. #’s
1 monk-3 6 2.84 (0.91) 10 Australian 8 7.95 (0.71)
2 mushroom 22 4.66 (2.33) 11 crx 9 3.90 (0.91)
3 kr-vs-kp 36 10.50 (1.35) 12 adult 8 7.67 (0.75)
4 handwriting 784 62.50 (4.95) 13 backache 26 11.24 (1.54)
5 splice 60 19.20 (0.75) 14 allhypo 21 10.02 (2.11)
6 house votes 16 5.80 (1.99) 15 hepatitis 13 3.80 (1.63)
7 jcrx 11 7.45 (0.50) 16 advertisement 1555 10.20 (2.48)
8 diagnosis 6 4.01 (0.10) 17 Amazon reviews 804 67.4 (5.26)
9 contraceptive 7 2.03 (0.52) 18 drug design 10000 2.1 (0.52)

( ): Standard deviation.

Fig. 5. An example of handwritten numbers 3 and 5; the red areas represent pixels used in SVM-C.

by utilizing an importance measure, it is computationally efficient for Acknowledgments


high-dimensional categorical features, relatively to HSVM.
However, the proposed SVM has an additional tuning parameter, This work was supported by the National Research Foundation
𝜂 in (15), compared to most SVMs, and it is determined by CV. Since of Korea (NRF) grant funded by the Korea government(MSIT) (No.
this could be computational burden, we might need to develop a scale- NRF-2022R1F1A1072444).
invariant measure to reflect the effect of categorical features, instead
of 𝜂. In addition, Algorithm 1 does not guarantee the global optimum References
of (14) due to the difficulty in estimating the function ℎ. It is a limit
of the proposed method. To find the optimal function ℎ reflecting Belanche, L. A., & Villegas, M. A. (2013). Kernel functions for categorical variables
the effect of categorical features, the proposed method considers a with application to problems in the life sciences. In CCIA (pp. 171–180). http:
single categorical feature at a time. If the true hyperplane depends //dx.doi.org/10.3233/978-1-61499-320-9-171.
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal
on combinations of categorical features, it could perform poorly. For
margin classifiers. In Proceedings of the fifth annual workshop on computational
better performance, we might need to develop a method considering learning theory – colt ’92 (pp. 144–152). Pittsburgh, PA, USA: http://dx.doi.org/
joint effects of categorical features. 10.1145/130385.130401.
Furthermore, our proposed method demonstrates strong perfor- Van den Burg, G. J. J., & Groenen, P. J. F. (2016). GenSVM: A generalized multiclass
mance in detecting the minor class in imbalanced datasets. As a po- support vector machine. Journal of Machine Learning Research, 17, 1–42, URL:
http://jmlr.org/papers/v17/14-526.html.
tential improvement for addressing imbalanced class problems, we can
Carrizosa, E., Nogales-Gomez, A., & Romero Morales, D. (2017). Clustering categories
consider exploring the application of cost-sensitive learning techniques in support vector machines. Omega, 66, 28–37. http://dx.doi.org/10.1016/j.omega.
to our proposed method in future work. 2016.01.008.
Chapelle, O. (2007). Training a support vector machine in the primal. Neural
CRediT authorship contribution statement Computation, 19, 1155–1178. http://dx.doi.org/10.1162/neco.2007.19.5.1155.
Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple
parameters for support vector machines. Machine Learning, 46, 131–159. http:
Taeil Jung: Problem conceptulization, Methodology, Algorithm de- //dx.doi.org/10.1023/A:1012450327387.
velopment, Programming, Data analysis. Jaejik Kim: Problem concep- Chatfield, C. (1985). The initial examination of data. Journal of the Royal Statistical
tulization, Methodology, Algorithm development, Supervision, Writing Society. Series A (General), 148, 214–253. http://dx.doi.org/10.2307/2981969.
– review & editing. Cheng, J., Hatzis, C., Hayashi, H., Krogel, M.-A., Morishita, S., Page, D., & Sese, J.
(2002). KDD cup 2001 report. ACM SIGKDD Explorations Newsletter, 3, 47–64.
http://dx.doi.org/10.1145/507515.507523.
Declaration of competing interest
Cortes, C., & Vapnik, V. N. (1995). Support-vector networks. Machine Learning, 20,
273–297. http://dx.doi.org/10.1007/BF00994018.
The authors declare that they have no known competing finan- Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass
cial interests or personal relationships that could have appeared to kernel-based vector machines. Journal of Machine Learning Research, 2, 265–292,
influence the work reported in this paper. URL: https://www.jmlr.org/papers/v2/crammer01a.html.
Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via error-
correcting output codes. Journal of Artificial Intelligence Research, 2, 263–286.
Data availability http://dx.doi.org/10.1613/jair.105.
Dua, D., & Graff, C. (2017). UCI machine learning repository. URL: http://archive.ics.
All data are available in public. uci.edu/ml.

9
T. Jung and J. Kim Expert Systems With Applications 229 (2023) 120449

Duan, K. B., & Keerthi, S. S. (2005). Which is the best multiclass SVM method? An Peng, S., Hu, Q., Chen, Y., & Dang, J. (2015). Improved support vector machine
empirical study. In N. C. Oza, R. Polikar, J. Kittler, & F. Roli (Eds.), Lecture notes algorithm for heterogeneous data. Pattern Recognition, 48, 2072–2083. http://dx.
in computer science: vol. 3541, MCS 2005: Multiple classifier systems (pp. 278–285). doi.org/10.1016/j.patcog.2014.12.015.
Berlin, Heidelberg: Springer, http://dx.doi.org/10.1007/11494683_28. Platt, J. C., Cristianini, N., & Taylor, J. S. (1999). Large margin DAGs for mul-
He, R., & McAuley, J. (2016). Ups and downs: Modeling the visual evolution of fashion ticlass classification. In S. A. Solla, T. K. Leen, & K. Muller (Eds.), NIPS’99:
trends with one-class collaborative filtering. In WWW ’16: proceedings of the 25th Proceedings of the 12th international conference on neural information process-
international conference on world wide web (pp. 507–517). Montreal, Quebec, Canada: ing systems (pp. 547–553). URL: https://proceedings.neurips.cc/paper/1999/file/
http://dx.doi.org/10.1145/2872427.2883037. 4abe17a1c80cbdd2aa241b70840879de-Paper.pdf.
Hsu, C. W., Chang, C. C., & Lin, C. J. (2016). A practical guide to support vector classi- Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2011). Pegasos: Primal estimated
fication: Tech. Rep., Department of Computer Science, National Taiwan University, sub-gradient solver for SVM. Mathematical Programming: Series B, 127, 3–30. http:
URL: https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf. //dx.doi.org/10.1007/s10107-010-0420-4.
Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass support vector Tang, J., & Yin, J. (2005). Developing an intelligent data discriminating system of
machines. IEEE Transactions on Neural Networks, 13, 415–425. http://dx.doi.org/ anti-money laundering based on SVM. In 2005 international conference on machine
10.1109/72.991427. learning and cybernetics, Vol. 6 (pp. 3453–3457). Guangzhou, China: http://dx.doi.
Kasif, S., Salzberg, S., Waltz, D., Rachlin, J., & Aha, D. W. (1998). A probabilistic org/10.1109/ICMLC.2005.1527539.
framework for memory-based reasoning. Artificial Intelligence, 104, 287–311. http: Vapnik, V. N. (1991). Principles of risk minimization for learning theory. In J.
//dx.doi.org/10.1016/S0004-3702(98)00046-0. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), NIPS’91 proceedings of the
Kimeldorf, G. S., & Wahba, G. (1970). A correspondence between Bayesian estimation 4th international conference on neural information processing systems (pp. 831–838).
on stochastic processes and smoothing by splines. The Annals of Mathematical URL: https://papers.nips.cc/paper/1991/file/ff4d5fbbafdf976cfdc032e3bde78de5-
Statistics, 41, 495–502. http://dx.doi.org/10.1214/aoms/1177697089. Paper.pdf.
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied Wilson, D. R., & Martinez, T. R. (1997). Improved heterogeneous distance functions.
to document recognition. 86, In Proceedings of the IEEE (pp. 2278–2324). http: Journal of Artificial Intelligence Research, 6, 1–34. http://dx.doi.org/10.1613/jair.
//dx.doi.org/10.1109/5.726791. 346.
Lee, C., & Jang, M. (2010). A modified fixed-threshold SMO for 1-slack structural SVMs. Yuh-Jye, L., & Mangasarian, O. L. (2001). RSVM: Reduced support vector machines.
ETRI Journal, 32, 120–128. http://dx.doi.org/10.4218/etrij.10.0109.0425. In V. Kumar, & R. Grossman (Eds.), Proceedings of the 2001 SIAM international
Lee, Y., Lin, Y., & Wahba, G. (2004). Multicategory support vector machines. Jour- conference on data mining (pp. 1–17). http://dx.doi.org/10.1137/1.9781611972719.
nal of the American Statistical Association, 99, 67–81.. http://dx.doi.org/10.1198/ 13.
016214504000000098. Yuh-Jye, L., & Mangasarian, O. L. (2001). SSVM: A smooth support vector machine for
Mandic, D. P. (2004). A generalized normalized gradient descent algorithm. IEEE Signal classification. Computational Optimization and Applications, 20, 5–22. http://dx.doi.
Processing Letters, 11, 115–118. http://dx.doi.org/10.1109/LSP.2003.821649. org/10.1023/A:1011215321374.

10

You might also like