Research Survey on Support Vector Machine

Huibing Wang Jinbo Xiong Zhiqiang Yao∗

Faculty of Software, Fujian Fujian Engineering Research Fujian Engineering Research
Normal University Center of Public Service Big Data Center of Public Service Big Data
Fuzhou, China 350108 Mining and Application Mining and Application Fuzhou, China 350108 Fuzhou, China

Mingwei Lin Jun Ren

Fujian Engineering Research Faculty of Software, Fujian
Center of Public Service Big Data Normal University
Mining and Application jun
Support vector machine (SVM) as a learning machine SVM was first proposed by Vapnik in 1995 [1] and it
has shown a good learning ability and generalization was named as “support-vector networks” in the early
ability in classification, regression and forecasting. Be- time, which is a binary classification algorithm and im-
cause of its excellent learning performance, SVM has plements the idea of mapping non-linearly vectors to a
always been a hotspot in machine learning. In hence, very high-dimension feature space to construct a linear
this paper will make a more systematic introduction decision surface (hyperplane) in this feature space. In
of SVM, including the theory of SVM, the summariza- hence, it can achieve good effect on solving separable
tion and comparison of the quadratic programming op- and non-separable problems. This hyperplane is opti-
timizations and parameter optimizations for SVM and mal in the sense of being a maximal margin classier
the introduction of some new SVMs, like FSVM, TSVM, with respect to the training data [2]. The Structural
MSVM, etc. Then, the applications of SVM in real life Risk Minimisation (SRM) principle, SVM follows, equip-
will be presented, especially, in the area of mobile mul- s SVM with a greater ability to generalise. Correspond-
timedia. Finally, we conclude with a discussion of the s to such significant advantages, SVM was applied to
direction of further SVM improvement. classification and regression problems quickly [3]. Even
more, SVM embodies the characteristics of small sam-
KEYWORDS ples, nonlinearly problem and “curse of dimensionality”,
SVM, mobile multimedia, optimization algorithms, SVM for which SVM has always been the concern of many re-
extension, application search scholars. As for application, it has been widely
10th, July 2017, Chongqing, People’s Republic of China Wang et al.

multimedia [8]. We also need to put a great effort in de- Female
signing applications that take into account the way the 75 Male
user perceives the overall quality of the provided service. 70
SVM has been applied to many areas of mobile multi-
media. Thus, this article will focus on SVM applications 65

in the field of mobile multimedia. 60
The remainder of the paper is organized as follows. 55
Section 2 introduces the principle of SVM from three
cases. In Section 3, some algorithm optimizations and 50
parameter optimizations about SVM will be represented 45
and compared. Then, Section 4 includes some popular
extensions of SVM like FSVM, TWSVM and MSVM. 145 150 155 160 165 170 175 180 185 190
The applications of SVM in many popular areas, espe-
cially, the area of mobile multimedia, will be presented
Figure 1: Points of Males and Females
in Section 5. Finally, in Section 6, there is a conclusion
referring to the direction of the further SVM improve-


SVM shows its significant advantages on both separable 65

problems and non-separable problems. Separable prob-
lems include linear separable problems and non-linear
separable problems. Thus, this section will present the 55
theory of SVM from three cases. 50

Case of Linearly Separable
Classification problems originated from the linearly sep- 145 150 155 160 165 170 175 180 185 190
arable situation, the method of SVM was applied to lin-
early separable problems at first [3]. Classification result-
Figure 2: Classification Lines.
s often can bring some valuable information for further
prediction or analysis. In hence, people are committed to
using effective classification algorithms to get valuable in which w refers to the weight vector and b is the thresh-
results. For instance, say we are offered some training old. Thus, the relation between xi and f (xi ) can be de-
data with some people’s weight, height and their cor- fined as:
responding sex, then we want to make use of them to
f (x) = wT x + b. (2)
predict the unknown gender data [2].
As Figure 1 shows, these two types of points represent Seeding the optimal hyperplane is equivalent to max-
male and female, respectively. It can be seen that there imal the distance between the closest vectors to the hy-
are a number of lines we can draw to divide the space perplane [1]. Define the Euclidean distance between the
into two regions in Figure 2. And it’s easily to find that nearest points and the hyperplane f (x) as:
the black solid line would be the optimal line, which
f (x)
maximizes the margin between itself and the nearest r=| |, (3)
points of each class. ||w||
SVM extends the two-dimensional linear separable where f (x) refers to functional margin.
problem to multidimensional, and aims to seed the opti- Assume the functional margin f (x) between the n-
mal classification surface, also called as optimal hyper- earest points and the hyperplane is 1 as the Figure 3
plane. The optimal hyperplane can be defined as: shows. The assumption was accompanied with a con-
straint condition yi (wT xi + b) ≥ 1, i = 1, 2, . . . , n, which
implies that all training data are on the two hyperplane
wT x + b = 0, (1) or behind them and the training data on the hyperplane

Research Survey on Support Vector Machine 10th, July 2017, Chongqing, People’s Republic of China



Z7[E í

í í í í     
Figure 3: Optimal Classification Line.
Figure 4: Case of Non-linearly Separable.

are called support vectors (SVs). Thus, the margin be- L, is given by,
tween the parallel bounding planes d can be defined as: n
2 ∂L(w, b, α) 
d = 2r = ||w|| . =0 ⇒ w= αi yi xi ,
Thus the objective function can be presented as: ∂w i=1
∂L(w, b, α) 
2 ⇒
max s.t. yi (wT xi + b) ≥ 1, i = 1, 2, . . . , n. (4) ∂b
=0 αi yi = 0.
||w|| i=1

For ease of calculation, translate equation (4) into: Hence from equations (6), (9) and the KKT condition,
the dual problem d∗ can be changed into:
1 n n
min ||w||2 s.t. yi (wT xi + b) ≥ 1, i = 1, 2, . . . , n. (5)  1 
2 θ(α) = L(w, b, α) = αi + αi αj yi yj (xi )T xj ,
2 i=1,j=1
The solution to this optimization problem is given by
the saddle point of the Lagrange functional:
which is a two convex optimization problems.
n Supposing that αi is the result of equations (10), Xr
L(w, b, α) = ||w||2 − αi [yi (wT xi + b)] (6) is a support vector belonging to the positive class, while
2 i=1 Xs belongs to the negative class. Then the optimal so-
where αi are Lagrange multipliers, subject to αi ≥ 0, lution w and b can be expressed as follows:

and equation (6) has to be minimized with respect to 1
w, b and maximized with respect to αi . The objective w= αi yi xi , b = − w[Xr + Xs ]. (11)
function has been translated as:
Hence from equations (2), (9) and (11), the final de-
p∗ = min(w,b) θ(w) = minw,b max(αi ≥0) L(w, b, α), (7) cision function is:

The classical Lagrange duality enables the primal prob- f (x) = sgn[ αi yi (xi )T x + b]. (12)
lem to be transformed into its dual problem, when sat- SV s
isfying the Karush-Kuhn-Tucker (KKT) condition. In Case of Non-linearly Separable
hence, as 
for equation (7), when it fulfills the KKT con-
n In addition to the linearly separable problems, there are
dition of i=1 αi [yi (wT xi + b) − 1] = 0, equation (7) is
equal to its dual function: much more nonlinearly cases in real life. As the example
showed in Figure 4, there is no line can classify the two
d∗ = max(αi ≥0) minw,b L(w, b, α). (8) classes well, only curves.
SVM shows its superiorities for nonlinearly problems,
It is the KKT condition that determines only the SVs which takes the way of mapping the input vectors in
work on the extremum of the objective function. Then low dimension feature space to a high dimension feature
the minimum with respect to w and b of the Lagrange, space to hunt for a linear optimal separating hyperplane.

10th, July 2017, Chongqing, People’s Republic of China Wang et al.

And it has been demonstrated that it did not account Case of Non-separable
for large storage and computational expense [2]. Last circumstance, some points belong to positive class
The following is an example of mapping process. De- may be classified into negative class. Another way to say,
fine of the curve in two-dimensional space as: there is no hyperplane can separated the points of dif-
g(x) = C0 + C1 x + C2 x2 . ferent categories accurately. Usually these error points
are regarded as noise which can be ignored. However,
Map it into a four-dimensional space as follows: machines can’t deal with the error points as human do.
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ Thus, Cortes [1] introduced slack variables ξi ≥ 0 (a
y3 1 a1 C0
measure of the misclassification error) and punishment
y = ⎣y2 ⎦ = ⎣ x ⎦ , a = ⎣a 2 ⎦ = ⎣C 1 ⎦ .
y1 x2 a3 C2 coefficient C (the emphasis on the outlier) to offset the
effect of noise. And the constraint yi (wT ·xi +b) ≥ 1, i =
Then, the primal function is translated into: 1, 2, . . . , n is modified to:
g(x) ⇒ f (x) =< a, y >= ay. yi (wT · xi + b) ≥ 1 − ξi , i = 1, 2, . . . , n.
It’s a linear function, which means the primal nonlin- C is a given value and subjects to above constraint. Too
early problem has been converted to a linearly separa- small ξ may lead to over learning, while too large may
ble problem. SVM employs the empathy, mapping the lead to owe learning. The equation (5) and equation (6)
w and x into high dimensional space, then gets new w are changed into:
and x .  n
Φ(x, ξ) = ||w||2 + C ξi , (18)
f (x) =< w, x > +b ⇒ g(x ) =< w , x > +b. 2 i=1
In hence, new decision function can be defined as fol- n
lows: L(w, b, α) = ||w||2 − αi [yi (wT xi + b) − (1 − ξ)].
 2 i=1
g(x ) = αi yi < xi , xi > +b, (13) (19)
i=1 While after derivation, the decision function for this
where dot product is core operation. According to the case still is equation (12).
existence theorem of kernel function: for any sample set In summary, SVM is a powerful classifier, which is
K, there is a map of F , and in this map, f (k) (in the suitable for any case of classification with the same de-
high dimensional space) is a linear separable. Based on cision function, high classification accuracy and small
this theorem, SVM employs the kernel function during computation.
mapping process, and realizes the dot product in low di-
mensional space, which is supposed to be done in high 3 THE OPTIMIZATION OF SVM ALGORITHM
dimensional space. Decision function with kernel func- SVM shows its superiorities in many ways, however, the
tion is shown as follows: high classification accuracy and small computation of
 n SVM give the credit to well treatments of quadratic pro-
f (x) = αi yi K < xi , x > +b. (14) gramming problem and parameters selection. Thus, this
i=1 section will introduce some optimization algorithms for
Different kernel functions can be choose to construct SVM.
different types SVM, the following are a few commonly
used kernel functions: Algorithms for Quadratic Programming Optimization
(1) Polynomial For the implementation point of view, training a SVM
is equivalent to solving a linearly constrained Quadratic
K(x, xi ) = (x · xi )q q = 1, 2, . . . ; (15) Programming (QP) problem [11]. QP problem is chal-
(2) Gaussian Radial Basis Function lenging when the size of the data points exceeds few
thousands, which is often the case in practical applica-
(x − xi )2 tions. Thus, much researches for the difficult QP prob-
K(x, xi ) = exp(− ); (16)
2σ 2 lem have been proposed.
(3) Multi-Layer Perceptron A comparison for three common used QP optimiza-
tion algorithms is shown as Table 1. From the compar-
K(x, xi ) = tanh(b(x · xi ) − c). (17) ison, the superiority of SMO is apparent and is most

Research Survey on Support Vector Machine 10th, July 2017, Chongqing, People’s Republic of China

Table 1: Comparison of QP Optimization Algorithms

Name Introduction Advantages Disadvantages Improvements

Chunking al- Break large QP problem Reduces the re- Overly relay on the Double chunking tech-
gorithm [9] into a series of smaller QP quirement of numbers of SVs; has nique for solving large-
problems. storage capacity. no effects when oc- scale problems [10].
curs to large data set-
s with high percentage
of SVs.
Decomposition Break large QP problem Reduces the train- Its results are over- New methods for
Algorith- into a series of smaller ing cost. ly dependent on the the working set se-
m [11] QP problems (the break- choice of working set. lection [12]; two new
ing way is different from decomposition algo-
Chunking algorithm); its rithms for training
main work is selecting a bound-constrained
suitable working set. SVM [13].
Sequential A special case of decompo- Mach faster than It is also time consum- Methods for solving
Minimal Op- sition algorithm; instead formers; reduces ing for large-scale prob- large size problem-
timization of using numerical QP as the requirement of lems; training speed s [15]; methods of
(SMO) [14] an inner loop as the form- storage capacity will slow when large speeding up training
ers, SMO uses an analytic efficiently. margin values are used. efficiency [16].
QP step.

widely used in SVM nowdays. But for the large scale 4 VARIANT SUPPORT VECTOR MACHINE
problems, it still needs further improvements. SVM is a strong classifier with a very good performance
owing to its solid theoretical foundation and good classi-
fication results. Whereas SVM shows limitations in solv-
ing large-scale problems and multi-classification prob-
lems. Therefore, this section will represents some new
Parameter Optimization for SVM
SVM for improving traditional SVM.
SVM refers to many parameters, slack variables, penal-
ty parameter and Kernel parameters, and the selection Fuzzy Support Vector Machine (FSVM)
of parameters takes many effects on the efficiency of Many research results show that SVM is very sensitive
SVM. Traditional optimization methods include cross- to noises and outliners in the training sample due to
validation [17] and grid algorithm. The former improves overfitting [22]. In hence, FSVM [23, 24] was proposed
the classification accuracy, while accompanying with high to deal with the problem. Traditional SVM treated all
computation and low efficiency. The latter is a simple training points uniformly and each training point be-
method and easy to use. However, it’s sensitive to the longs to either one or the other class. However, some in-
initial value of the parameters and is time-consuming put points may not be exactly assigned to one of these t-
with low accuracy and intensive computation. Account- wo classes. Some are more important to be fully assigned
ing for such defects, many new optimization methods to one class so that SVM can seperate these points more
have been proposed which are collectively referred to as correctly. Some data points corrupted by noises are less
swarm intelligence optimization. meaningful and should be discarded. The key point of
The Table 2 is a comparison for parameter optimiza- FSVM is that it applies a fuzzy membership (impor-
tion methods. PSO and DE have shown their own supe- tance weight) to each sample in the training datasets
riorities. PSO has been combined with many other im- such that different samples can make different contribu-
provement algorithms to improve its performance. DE tions in the construction of the SVM hyperplane. The
as a newer hot spot with significant advantages, still fuzzy membership enhances the SVM in reducing the
needs much more further research. effect of outliers and noises in data points. Some new-
er methods assigned smaller weight or even zero weight

10th, July 2017, Chongqing, People’s Republic of China Wang et al.

Table 2: Comparison of Swarm Intelligence Optimization

Name Introduction Advantages Disadvantages

Ant colony opti- An evolutionary algorithm based on With strong robustness and Convergence speed
mization algorithm population optimization; also called as strong global search ability; is slow; easy to fall
(ACO) [18] a heuristic search swarm intelligence al- not sensitive to initial value into local optimum.
gorithm. of the parameters.
Genetic algorithms Follows Darwin’s evolutionary thought; With high efficiency and glob- Complex; fall into
(GAs) [19] catch the optimal value as follows al optimization; reduce the re- part optimum easi-
through a series of operations: initial- liance on initial value selec- ly; not suit for peak
izing population, calculating the fitness tion. problem.
by fitness function, selecting, crossing,
mutating and so on.
Particle Swar- An evolutionary algorithm newer than It has GA’s advantages and Premature conver-
m Optimization GA which originated in the behavior of is more simple than GA; has gence; not suitable
(PSO) [20] birds foraging. higher optimization efficiency for multi peak prob-
and is easy to be implement- lems; easy to fall in-
ed. to part optimum.
Differential evolu- It is arguably one of the most powerful More simple and fast; its algo- Premature conver-
tion (DE) [21] stochastic real-parameter optimization rithm control parameters are gence; local search
algorithms in current use; has drawn relatively less; with strong ro- ability is weak.
the attention of many researchers all bustness and stronger global
over the world and generates a lot of search ability; can solve multi
variants of the basic algorithm with im- peak problem.
proved performance.

to noisy and outlier data [25, 26] and a new type of etc. And recently, a novel twin support vector machine
FSVM [27] combined with some other algorithms was with the pinball loss (Pin-TSVM) was proposed to deal
proposed to increase the difference of the meaningful with the quantile distance to reduce the noise sensitivity
training points and outliers. The future work for FSVM and work fast [34].
would focus on the selection of fuzzy membership func-
Multi-class Support Vector Machine
Twin Support Vector Machine (TWSVM) SVM was originally designed for binary classification,
TWSVM was proposed by Jayadeva et al. [28] for the while in real life, most applications involve multiple clas-
binary classification problem and it has become the cur- sification problems. Multi-class SVM [35] refers to two
rent researching hot spot in machine learning during the approaches: (1) consider all data in one optimization
last few years [29]. TSVM generates two nonparallel hy- formulation directly; (2) construct and combine sever-
perplanes by solving a pair of smaller-sized QPPs, such al binary classifiers. We typically construct a multiclass
that each hyperplane is closer to one class and as far classifier by combining several binary classifiers because
as possible from the other. The strategy of solving t- it made a great progress for algorithm research while
wo smaller-sized QPPs rather than a single larger-sized the former way accounts for high computational com-
QPP makes the learning speed of the TSVM faster than plexity and achieves low efficiency on application [36].
that of the standard SVM in theory. Due to TWSVM’s Multi-class problems require that the number of vari-
lower computational complexity and better generaliza- ables should be proportional to the number of class-
tion ability, it has been studied extensively and devel- es, which denotes that solve a multi-class problem con-
oped rapidly in the last few years. Many variants of the sumes more calculations. Corresponding to such prob-
TSVMs have been proposed, such as twin bounded sup- lems, some novel methods have been proposed. Lee et
port vector machine [30], nonparallel SVM [31], least- al. [37] proposed a multi-category SVM, which is a right-
square TSVM [32], twin support vector regression [33], ful extension from a binary SVM to a multi-category

Research Survey on Support Vector Machine 10th, July 2017, Chongqing, People’s Republic of China

SVM. Madzarov et al. [38] presented a novel architec- in all areas of life. Especially, it shows its own signifi-
ture of SVM classifiers based on Binary Decision Tree cant ability in pattern recognition, such as text recog-
architecture, takes advantage of both the efficient com- nition [47], handwriting recognition [48], face recogni-
putation of the decision tree architecture and the high tion [49], etc. In economic field, SVM is used for re-
classification accuracy of SVMs. Recently, a novel mul- al estate forecasting [50], stock forecasting [51], etc. In
tiple projection twin support vector machine (PTSVM) transportation field, SVM is applied to identify license
was proposed [39], which improves the generalization a- plate number[52], trained to distinguish driving perfor-
bility great. mance and physiological measurements [53], etc. In med-
ical field, SVM has been used in automatic medical de-
cision support system for medical image [54], prediction
Distribute Support Vector Machine (DSVM) of substrate specificities of membrane transport proteins
DSVM is based on connected networks. The network [55], gene classification [56], etc.
is divided into several sub-regions by some kind of cer- Except for the applications in traditional areas, SVM
tain strategy, and SVM is trained in each sub-region also has been applied in the area of mobile multimedia.
in serially or distributed. Finally, the sub-region adopts SVM is mainly used in the field of multimedia retrieval
obtain the approximate solution or global optimal solu- and multimedia event detection (MED). As for the mul-
tion of the original SVM problem through some com- timedia retrieval, SVM has been applied to music and
munications. Navia-Vazque et al. [40] proposed a true image retrieval in the early time. In 2001, Tong et al.
DSVM, which is relative to parallel SVM, includes two [57] proposed the use of support vector active machine
distributed schemes: a distributed packet method for algorithm for conducting effective relevance feedback for
propagating raw data, and a more detailed distributed image retrieval, which can quickly learn a boundary that
semiparametric SVM. It reduces the total amount of separates the images and achieves high search accuracy.
information transferred between nodes and provides a Mandel et al. (2006) [58] described a system for per-
privacy protection mechanism for information sharing. forming flexible music similarity queries using SVM ac-
Then, Lu et al. [41] presented a strong connectivity net- tive learning to achieve high search accuracy. As the
work, Flouri et al. [42] proposed a DSVM over wireless demand for multimedia grows and with the evolution
sensor network and Kim et al. [43] proposed an improve- of digital technology, searching and organizing large s-
ment for this method later. Recently, distributed semi- cale datasets becomes a challenging task. Thus, some
supervised SVM [44], resilient DSVM [45] and some oth- improved approaches have been presented in succession.
er new methods over DSVM has been proposed, which For example, Yildizer et al. [59] proposed an extremely
are different from existing parallel algorithms with good fast content-based image retrieval systems (CBIR) sys-
performance and experimental results. The various sub- tem which uses Multiple Support Vector Machines En-
problems only need to transfer hull information with semble, CBIR have become very popular for browsing,
each other, without the central processing unit. Account- searching and retrieving images from a large database of
ing for such significant advantages, DSVM has become digital images with minimum human intervention; Arya-
a popular direction for research. far et al. [60] introduced a classifier fusion of multimodal
It is worth mentioning that the new type of SVM audio and lyrics data to address single modality classi-
achieves better and better performance, like above men- fication limitations. As for multimedia event detection,
tioned DSVM. A more newer method is called SVM SVM has been applied to multimedia event detection
classification trees [46], which is designed for multi-class (MED) based on audio or video in recent years. Wang
problems. Giving the credit to these new methods’ sig- et al. (2016) [61] proposed the method of classifying clip-
nificant odds, they must be the future research hot spot- s for events using “recurrent SVM” and Tzelepis et al.
s. (2016) [62] using kernel SVM with isotropic gaussian
sample uncertainty for video event detection, and both
of them achieved good results. Except for those, SVM
5 APPLICATIONS OF SVM also can provide service of real-time multimedia quali-
SVM has a strong advantage on the basis of theory, it ty assessment [63], mixed type audio classification [64],
can guarantee that the solution of the extreme solution etc. In the future, accompanying with the developmen-
is the global optimal solution rather than the local min- t of SVM and its extension, SVM may be more widely
imum. In real application, SVM also has been success- used in the areas of mobile multimedia and achieve much
fully used for classification, regression and forecasting more success.

10th, July 2017, Chongqing, People’s Republic of China Wang et al.

