Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Soliman MAMA, Abo-Bakr RM. Linearly and quadratically separable classiers using adaptive approach.

JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 26(5): 908918 Sept. 2011. DOI 10.1007/s11390-011-0188-x

Linearly and Quadratically Separable Classiers Using Adaptive Approach


Mohamed Abdel-Kawy Mohamed Ali Soliman1 and Rasha M. Abo-Bakr2
1 2

Department of Computer and Systems Engineering, Faculty of Engineering, Zagazig University, Zagazig, Egypt Departement of Mathematics, Faculty of Science, Zagazig University, Zagazig, Egypt

E-mail: mamas2000@hotmail.com; rasha abobakr@hotmail.com Received October 3, 2009; revised May 14, 2011. Abstract This paper presents a fast adaptive iterative algorithm to solve linearly separable classication problems in Rn . In each iteration, a subset of the sampling data (n-points, where n is the number of features) is adaptively chosen and a hyperplane is constructed such that it separates the chosen n-points at a margin and best classies the remaining points. The classication problem is formulated and the details of the algorithm are presented. Further, the algorithm is extended to solving quadratically separable classication problems. The basic idea is based on mapping the physical space to another larger one where the problem becomes linearly separable. Numerical illustrations show that few iteration steps are sucient for convergence when classes are linearly separable. For nonlinearly separable data, given a specied maximum number of iteration steps, the algorithm returns the best hyperplane that minimizes the number of misclassied points occurring through these steps. Comparisons with other machine learning algorithms on practical and benchmark datasets are also presented, showing the performance of the proposed algorithm. Keywords linear classication, quadratic classication, iterative approach, adaptive technique

Introduction

Pattern recognition[1-2] is the scientic discipline whose goal is the classication of objects into a number of categories or classes. Depending on applications, these objects can be images or signal waveforms or any type of measurements that need to be classied. Linear separability is an important topic in the domains of articial intelligence and machine learning. There are many real life problems in which there is a linear separation. A linear model is very robust against noise since a nonlinear model may consider the noisy samples in training data and perform more calculations to t them. However, it may be less ecient than a linear model for testing data. Multilayer nonlinear (NL) neural networks, such as the back-propagation algorithm, work well for nonlinear classication problems. However, using back-propagation for a linear problem is overkill, with thousands of iterations needed to get to the point where linear separation can bring us fast. Linear separability methods are also used for training Support Vector Machines (SVMs)[3-4] used for pattern recognition. Support Vector Machines are linear learning machines on linearly or nonlinearly separable data. They are trained by nding a hyperplane that linearly

separates the data. In the case of nonlinearly separable data, the data are mapped into some other Euclidean space. Thus, SVM is still doing a linear separation but in a dierent space. In this paper, a novel and ecient method of nding a hyperplane which separates two linearly separable (LS) sets in Rn is proposed. It is an adaptive iterative linear classier (AILC) approach. The main idea in our approach is to detect the boundary region between the two classes where the points of dierent classes are close to each other. Then, from this region, n-points belonging to the two dierent classes are chosen and a hyperplane is constructed such that each of the n-points lies at prescribed distance (but points belonging to each class lie at opposite sides) from it. There exist precisely two such hyperplanes from which we choose the one that correctly classies more points. If the chosen hyperplane successfully classies all the points, calculations are terminated. Otherwise, other n-points are chosen to start a next iteration. These n-points are chosen adaptively from the misclassied ones as those were furthest from the constructed hyperplane in the current iteration because these points are most probably lying in the critical region between the two classes. Compared with other iterative linear classiers, this approach is

Regular Paper 2011 Springer Science + Business Media, LLC & Science Press, China

Mohamed Abdel-Kawy Mohamed Ali Soliman et al.: Linearly and Quadratically Separable Classiers

909

adaptive and numerical results show that very few iteration steps are sucient for convergence even for large sampling data. The concept of a hyperplane is extended to performing quadratic classications, not just linear ones. Analogous to the separating hyperplane that is represented by a linear (rst degree) equation, in quadratic classication a second degree hypersurface is constructed to separate the two classes. This paper is divided into seven sections. In Section 2, a brief survey of methods which classify LS classes are introduced showing theoretical basis for the most related ones to the proposed classier. In Section 3, the main idea, geometric interpretation and mathematical formulation of the proposed AILC are presented. Illustrative examples are given in Section 4. The quadratically separable classier is discussed and demonstrated by some examples in Section 5. Comparisons with other known algorithms are performed for linearly and nonlinearly separable benchmark datasets and results are presented in Section 6. Finally, in Section 7, conclusions and future work are discussed. 2 Comparison with Existing Algorithms

by nding an optimum hyperplane that separates the dataset (with largest possible margin) by solving a constrained convex quadratic programming optimization problem which is time consuming. In the proposed AILC, starting with an arbitrary hyperplane, the full dataset is tested and the information about the relative locations of the misclassied points with respect to the hyperplane is utilized to predict the critical region between the two classes where a better hyperplane can exist. This adaptive nature of iteration speeds up the convergence to a hyperplane that successfully separates the two classes. In Section 3, the classication problem is reformulated to produce the required information at low cost. In addition, theoretical basis and implementation of AILC are provided. 3 Adaptive Iterative Linear Classier (AILC)

Numerous techniques exist in literature for solving the linear separability classication problem. These techniques include the methods based on solving linear constraints (the Fourier-Kuhn elimination algorithm[5] or linear programming[6] ), the methods based on the perceptron algorithm[7] , and the methods based on computational geometry (convex hull) techniques[8] . In addition, statistical approaches are characterized by an explicit underlying probability model, which provides a probability that an instance belongs to a specic class, rather than simply a classication. The most related algorithms to the one proposed in this work are the perceptron and SVM algorithms. The perceptron algorithm was proposed by Rosenblatt[5] for computing a hyperplane that linearly separates two nite and disjoint sets of points. In the perceptron algorithm, starting with arbitrary hyperplane, the dataset is tested sequentially point after point to check if it is correctly classied. If a point is misclassied, the current hyperplane is updated to correctly classify this point. This process is repeated until a hyperplane is found that succeeds to classify the full dataset. If two classes are linearly separable, the perceptron algorithm will provide, in a nite number of steps, a hyperplane that linearly separates the two classes. However it is not known, ahead of time, how many iteration steps are needed for the algorithm to converge. SVM[3] , as a linear learning method, is trained

In this section we present adaptive iterative linear classier (AILC). The main idea in our approach is to simulate how one can predict a line in R2 that separates points belonging to two linearly separable classes. First, it detects the boundary region between the two classes where points of dierent classes are close to each other. From this region of interest, it can choose two points (one point of each class) that seem to be most dicult (nearest) and predict a line that not only separates the two points but, as much as possible, correctly separates the two classes, that is it tries to construct a line having one of the points with remaining points of its class in one side of the line and the second point with the rest of its class in the other side. If such a line exists, the task is done. Otherwise, another two points are chosen to start a next iteration. These new points are chosen adaptively as those expected, by the constructed line in the current iteration, to lie in the border region between the two classes. Construction of a separating line in our approach is characterized by the requirement that the 2-points lie at prescribed distance (but at opposite sides) from it. In fact, there exist precisely two such lines from which we choose the one that correctly separates more points. A generalization about Rn is straight forward. Starting with n-points in Rn belonging to two dierent classes, we construct a hyperplane such that each of the n-points lies at prescribed distance (but points belonging to each class lie at opposite sides) from it. Again, there exist precisely two such hyperplanes from which we choose the one that correctly classies more points. If the chosen hyperplane successfully classies all the points, we terminate calculations. Otherwise, a new iteration is started by choosing another n-point from the misclassied ones (see Subsection 3.1 for more details).

910

J. Comput. Sci. & Technol., Sept. 2011, Vol.26, No.5

This approach is more ecient than other related methods proposed in the literature. For example, the CLS[9-11] examines each possible hyperplane passing by every n-points set to check if it can successfully classify the remaining points. When such a hyperplane is reached, the required hyperplane is constructed such that it, further, properly separates the n-points according to their classes. 3.1 Geometric Interpretation and Theoretical Basis for AILC

variables is that each ei is a measure of the distance between point xi and P . This can easily be proven as follows. Recall that the distance between any point xi Rn and the hyperplane P (W ; c), xi P (W ; c) is given by (xi , P ) = |xT i W + c|/ W = |ei |/ W > 0, (6)

where W is the L2 norm of W (length of vector W ), then |ei | = W (xi , P ). (7) In our approach, since W = (w1 , w2 , . . . , wn )T consists of n-unknown components, we choose n-points and assume that they all lie at constant distances from a trial hyperplane P such that each point lies in the proper half space according to its class. Substitution of xT i , di and ei = > 0, i = 1, 2, . . . , n in (5) and noting that c = 1 or 1, produces two linear systems of equations in the n-unknowns w1 , w2 , . . . , wn . Solution of these systems (assuming linear independence of the equations) produces two hyperplanes: P1 = P (W 1 ; 1) and P2 = P (W 2 ; 1). The rst adaptive feature of the proposed algorithm is to select from P1 and P2 the more ecient one in classifying the remaining N n points.

The classication problem considered in this work consists of nding a hyperplane P that linearly separates N points in Rn . Each of these points belongs to either of the two disjoint classes A or B that lie in the positive or negative half space of P , respectively. If the training data are linearly separable, then a hyperplane P (w; t) : exists such that xT i w + t > 0, xT i w + t < 0, for all xi A for all xi B, (2) xT i w+t=0 (1)

n where xT i R is the feature vector (or the coordinates) of point i while w Rn is termed the weight vector and t R the bias (or t is termed the threshold) of the hyperplane. Dening the class identier

di =

1,

if xi class A,

1, if xi class B,

(3)

(2) reduces to the single form di (xT i w + t) > 0, Dividing (4) by |t| yields di (xT i W + c) = ei , ei > 0, i = 1, 2, . . . , N, (5) i = 1, 2, . . . , N. (4)
Fig.1. Choice of the better hyperplane. The arrow of each hyperplane refers to its positive half-space.

where W = w/|t| is a weighted vector having the same direction of w (normal to hyperplane P (W ; c): xT i W + c = 0) and pointing to its positive half space, and c = 1 or 1 according to the sign of t is either positive or negative, respectively. In (5), we have introduced the variables ei , i = 1, 2, . . . , N for the rst time. These variables will be the source of information in our approach. According to (5), a hyperplane P (W ; c) will correctly separate the two classes if ei > 0, i = 1, 2, . . . , N . However, for a trial hyperplane P (W ; c ), if substitution of W and c in (5) produces negative value for ei then point i is misclassied by P . Another interesting importance of these

In Fig.1, an elastration in R2 is presented with N = 16 (8 points of each class), where we refer by a black circle to the class with identier d = 1 and a triangle to the other class with d = 1. The starting 2-points are enclosed in squares. Both P1 and P2 successfully separate the chosen points into the two classes. However it is not guaranteed that both P1 and P2 correctly classify the full N -set of points. P2 succeeded in classifying 12 points (5 circles in its positive half space and 7 triangles in the other side) but failed with the rest 4 points whereas P1 succeeded in classifying 6 points (4 circles in its positive half space and 2 triangles in the other side) but failed with the rest 10 points. Thus the algorithm chooses P2 .

Mohamed Abdel-Kawy Mohamed Ali Soliman et al.: Linearly and Quadratically Separable Classiers

911

3.2

Mathematical Formulation

X T be partitioned as: XT = then (10) is rewritten as D1 | 0 0 | D2 a J1 W +c b J2 = E1 , E2 (12) ann , b(N n) n

Let xT i = [xi1 , xi2 , . . . , xin ] be the row representation of the components of an input data point xi Rn that has n features and let N be the number of data points belonging to two disjoint classes (A and B ). Then, applying (5) for all N -points yields the system D (X W + C ) = E where d D= XT =
1 T

(8)

0 d2 0 .. .

where a is a nonsingular square matrix of dimension n, b is in general a rectangular matrix of dimension (N n) n, J 1 and J 2 are vectors with unit n and N n components, respectively. D 1 and D 2 are diagonal square matrices of dimensions n and N n, respectively. (12) can then be written as D 1 (aW + cJ 1 ) = E 1 ,
N n

x11 x21 . . .

x12 x22 . . .

. . .

dN x1n x2n . . .

(13) (14)

D 2 (bW + cJ 2 ) = E 2 . ,

xN 1 xN 2 xN n W1 c W2 c W = , C= . . . . . . Wn 1 1 = . . . 1
n1

And the classication problem becomes: nd W and c, such that: E 1 > 0, E 2 > 0. 3.3 Adaptive Iterative Linear Classier (AILC)

= cJ N ,

JN

e1 e2 . E= . . . eN

N 1

To simplify the solution of (13) and (14), choose a small positive number and assume e1 = e2 = = en = > 0, (9) (15)

N 1

And thus, the classication problem is formulated by D (X T W + cJ N ) = E . (10)

then E 1 = J 1 > 0 and hence, upon substitution in 1 (13), using D 1 = D 1 , and solving W as a function of c, it reduces to W = a1 Q. (16) Here Q = ( D 1 J 1 cJ 1 ) is a vector of length n and is computed easily because its i-entry is given by: di c, 1 < i < n. Substituting (16) in (14) E 2 = D 2 bW + cD 2 J 2 . To compute E 2 , note that the i-th-entry of E 2 is ei = di (bT i W + c), n+1 i N. (18) (17)

One has to notice that matrices X T and D represent the input data such that, for each point i, X T contains in each row i, the feature vector X T i and D is a diagonal matrix formed such that its diagonal elements are the elements of vector d = [d1 d2 . . . dN ]T . Thus, interchanging the rows of both X T and D corresponds to reordering of the N -points. In (10), JN is an N -vector whose entries are all unity and c = 1. Also, referring to (5), for a separating hyperplane, all the entries of vector E must be positive. Hence the classication problem reads: nd a hyperplane such that all the entries of E are all positive, or equivalently nd W and c, such that: E > 0. (11)

Clearly, since vector Q is dependent on the value of c, then so are both W and E 2 . 3.3.1 Adaptive Procedure

The proposed solution consists of the partitioning of the N -system (see (10)), into two subsystems; the rst one consists of the rst n-equations while the second subsystem consists of the next (N n) equations. Let

In the proposed AILC, we try to speed up the convergence rate by making full use of all available information within and after iteration. Two adaptive choices are performed as follows. First, within iteration r, the algorithm chooses the value of c as +1 or 1 such that the constructed hyperplane correctly classies more points as described in

912

J. Comput. Sci. & Technol., Sept. 2011, Vol.26, No.5

Subsection 3.1. In Algorithm 1 the implementation of this adaptive choice is presented.


Algorithm 1. Iteration r (a1 , b, D 1 , D 2 ; c, W , E 2 , m) 1. Set c = 1. 2. Compute the vectors W (c) and E 2 (c) using (16), (17). 3. Compute m(c) as the number of negative entries of E 2 (c). 4. if m(c) = 0, then E 2 (c) > 0, go to step 8. 5. else if c = 1, set c = 1 and repeat steps 24. 6. if m(1) < m(1) then c = 1 produces the accepted hyperplane Pr . Set c = 1, go to step 9. 7. else c = 1 produces the accepted hyperplane Pr . Set c = 1, go to step 9. 8. The separating hyperplane P is dened by c, W (c). end iteration r. 9. The best hyperplane Pr is dened by c, W = W (c), return also, E 2 (c), m(c). end iteration r.

lie in the positive and negative half space, respectively. The misclassied points lie in the shaded regions and the chosen 2-points for the next iteration are shown in rectangles. 3.3.2 Implementation of AILC

E 1 is Second, after an iteration r, vector E r = E 2 r computed. E r is constructed as augmentation of E 1 all of whose n-entries equal and E 2 whose entries are computed by (18). In fact, E r contains important information about the tness of the constructed hyperplane Pr as a separator. First, recall that a negative sign of an entry ei of E r means that point i is misclassied by the hyperplane. Second, (in (7)) the absolute value of ei provides a measure of the distance of point i from the hyperplane. Thus, if entries of E r are all positive, then Pr is an acceptable classier, otherwise, the entries having the lowest values in E r would correspond to the furthest misclassied points from Pr and hence such points more probably lie in the critical region between the two classes where an objective classier P has to be constructed. Accordingly, we choose n of these points (that, in addition, must be linearly independent and belong to both of the dierent classes) to determine the hyperplane in the next iteration. So, matrix a in (12) is chosen by adaptively reordering the input matrix X T after each iteration such that the rst n-rows of X T and D correspond to the data of the chosen n-points. An illustration in R2 is shown in Fig.2 where black circles and triangles refer to the classes that must

Algorithm 1 describes a typical iteration r that returns either a separating hyperplane P or a hyperplane Pr . Although it does not successfully classify all the points, it minimizes the number m of misclassied points through the adaptive choice of c. Adaptive Reordering Algorithm (Algorithm 2) rearranges X T , d such that the rst n-points in X T (forming a in the next iteration) must satisfy the conditions: 1) correspond to rows that have the lowest values in E , 2) a is nonsingular, and 3) belonging to the two classes. The details of such an algorithm are presented in Algorithm 2. The complete algorithm AILC is presented in Algorithm 3.
Algorithm 2. Adaptive Reordering (n, N, , X T , d, E 2 ) 1. 2. 3. 4. 5. 6. Form vector E as augmentation of E 1 (all its nentries equal ) and E 2 . Form vector F such that its entries are the rows numbers of E when it is sorted in an ascending order. Set a(n, n) = zero matrix, da(n) = zero vector, f lag (N ) = zero vector. Set i = 1, j = 1. while i < n while j < N
T I. k = F (j ), aT i = Xk

II. if rank (rst i rows of a) = i, then set: da(i) = d(k); ag (k) = i; break, end. III. j = j + 1; go to step 6. 7. 8. 9. i = i + 1; go to step 5. i = n. while j N
T I. K = F (j ), aT i = Xk

II. if (d(k) = da(n 1) and rank (rst i rows of a) = i), then set: da(i) = d(k); ag (k) = i; break, end. III. j = j + 1; go to step 9. 10. for each 1 k
T N , if (ag (k) = i = 0), X T k = Xi ,

d(k) = d(i).
T 11. for i = 1 to n, X T i = ai , d(i) = da(i).

4
Fig.2. Illustration of the adaptive choice of next iteration in the classier (AILC) in R2 .

Numerical Illustration

In this section, the use of algorithm AILC is demonstrated by three linearly separable (LS) examples. The

Mohamed Abdel-Kawy Mohamed Ali Soliman et al.: Linearly and Quadratically Separable Classiers Algorithm 3. AILC (N, n, X T , d, , r max; c, W , r, m) Input: data N, n, N n array X T , class identier N 1 array d, maximum number of iterations rmax, and parameter . Output: a hyperplane (c, W ), iteration r, and number of misclassied points m. 1. Arrange X T , d such that the rst n-rows of X T form a nonsingular n n matrix a. 2. Set m0 = N , r = 1. 3. while r rmax a) Form the partitioned matrices a, b, D 1 , D 2 then compute a1 (see (12)). b) Call Iteration r (a1 , b, D 1 , D 2 ; c, W , E 2 , m). c) if m = 0 (successful separation), return c, W , r, m; break; end. d) else if m < m0 , m0 = m, copt = c, W opt = W , ropt = r. e) Call Adaptive Reordering (n, N , , X T , d, E 2 ). f) r = r + 1. g) go to step 3. 4. return data of hyperplane with minimum misclassied points: c = copt , W = W opt , also return m = m0 , r = ropt . end.

913

Fig.3. 2D plot of the two-class classication problem (class A (black circles), class B (triangles)). Squares indicate the worst points after iterations 1 and 2. (a) Original dataset. (b) After iteration 1. (c) After iteration 2. (d) After iteration 3. Table 1. Weight Vectors and Threshold Values Obtained by Executing the Algorithm i (iteration) 1 2 3 W (weight vector) (1.6375, 2) (0.0481, 0.4192) (0.3, 1.175) c (threshold) 1 1 1

rst is a 2D-classication problem where successive iterations are visualized to illustrate the adaptive feature and convergence behavior of the algorithm. The inuence of the value of and the reordering of input data on the convergence are numerically discussed. The second example is a 3D-classication problem in R3 while the third one is a 4D-classication problem in R4 where the standard benchmark classication dataset: IRIS[12] is arranged as two LS classes. Example 1. A 2D-classication problem consists of two classes A (black circles) and B (triangles) given: A = {(4, 3), (0, 4), (2, 1.6), (7, 3), (3, 4), (4, 3), (3, 2)} B = {(4, 4), (3, 0), (6, 1), (1, 0), (1, 0.5), (0, 7), (6, 2)}. Points of the two classes A, B are represented in Fig.3(a) showing great diculty in classifying these data. A circle about the starting two points (4, 3), (4, 4) are also shown. Figs. 3(b) 3(d) show the application of our algorithm to this problem with = 0.45. After each iteration, the computed weight vector and threshold are shown in Table 1. To discuss the dependency of the proposed algorithm on the starting n-points and the parameter , we repeat solving the previous example starting with another two points (3, 0), (4, 4) and select = 0.4. The number of iteration changes; two iterations were required to classify these dicult data although the starting points

Table 2. Weight Vectors and Threshold Values Obtained by Executing the Algorithm i (iteration) 1 2 W (weight vector) (0.4667, 0.8167) (0.1355, 0.7059) c (threshold) 1 1

belong to the same class (B ). The results are presented in Table 2 and Fig.4. It would be mentioned that no more than 4 iterations were needed to solve this classication problem irrespective of starting points and for 0.0001 < < 0.5. Example 2. The algorithm presented in Algorithm 3 was tested by applying it to an LS 3D-classication problem that consists of two classes A() and B ( ). A = {(1, 4.5, 1), (2, 4, 3), (6, 5, 4), (4, 6, 5), (4, 5, 6), (1, 3, 1)} B = {(0, 4, 0), (2, 4, 3), (4, 4, 2), (3, 4, 4), (2, 3, 3), (4, 4, 1)}.

914

J. Comput. Sci. & Technol., Sept. 2011, Vol.26, No.5

Fig.4. Classication of the same classes (black circles = class A, triangles = class B ) represented in Fig.3(a) when we start with (3, 0), (4, 4) and = 0.4. The worst points after the rst iteration are included in squares. Table 3. Weight Vectors and Threshold Values Obtained by Executing the Algorithm i (iteration) 1 2 W (weight vector) (0.9375, 0.175, 0.425) (0.465, 0.175, 0.31) c (threshold) 1 1

The dataset describes every iris plant using four input parameters (Sepal length, Sepal width, Petal length, and Petal width). The dataset contains a total of 150 samples with 50 samples for each of the three classes. Some of the publications that used only the samples belonging to the Iris Versicolour and the Iris Virginica classes include: Fisher[13] (1936), Dasarathy (1980), Elizondo (1997), and Gates (1972). Although the IRIS dataset is nonlinearly separable, it is known that all the samples of the Iris Setosa class are linearly separable from the rest of the samples (Iris Versicolour and Iris Virginica). Therefore, in this example, a linearly separable dataset was constructed from the IRIS dataset such that the samples belonging to Iris Versicolour and Iris Virginica classes were grouped in one class and the Iris Setosa was considered to be the other class. Thus, a linearly separable 4D-classication problem was considered in this example with 100 points in class A and 50 points in class B . Using the proposed algorithm with = 0.5, data were completely classied after two iterations and the results were collected in Table 4.
Table 4. Weight Vectors and Threshold Values Iris Classication Problem i (iteration) 1 2 W (weight vector) (0, 0, 0, 2.5) (0.3763, 0.1096, 0.3907, 0.2335) c (threshold) 1 1

Starting with points (0, 4, 0), (1, 4.5, 1), (2, 4, 3) and choosing = 0.3, Algorithm in Table 3 was applied to classify these data. Two iterations were sucient to solve this classication problem as shown in Fig.5. The situation after the rst and second iterations are shown in Figs. 5(a) and 5(b) respectively. In each case, the graph was rotated such that the view was perpendicular to the separating plane. After the rst iteration the points (2, 4, 3), (1, 3, 1), (0, 4, 0) were found to be the worst. The results of dierent iterations are summarized in Table 3.

Classication of Quadratically Separable Sets

Two classes A, B are said to be quadratically separable if there exists a quadratic polynomial P2 (y ) = 0, y Rm such that P2 (y ) > 0 if y A and P2 (y ) < 0 if y B . In R2 , a general quadratic polynomial can be put in the form:
2 2 w1 y1 + w2 y2 + w3 y1 y2 + w4 y1 + w5 y2 + c = 0. (18)

(18) represents a conic section (parabola, ellipse, or hyperbola depending on the values of coecients wi ). Now, consider a mapping : R2 R5 such that a point y (y1 , y2 ) R2 is mapped into a point x R5 , with components: x5 = y2 . (19) Using this mapping, P2 (y ) = 0 is transformed into a hyperplane; xT w + c = 0 in R5 . The transformed linear classication problem can be solved by algorithm AILC to get w and c and hence a quadratic polynomial P2 (y ) = 0 is determined. Generally, a quadratic polynomial in Rm can be transformed into a hyperplane in Rn with n = m +
2 x1 = y1 , 2 x2 = y2 ,

x3 = y1 y2 ,

x4 = y1 ,

Fig.5. Original dataset and the constructed hyperplanes for 3Dproblem of Example 2. (a) After rst iteration. (b) After second iteration.

Example 3. The IRIS dataset[12] classies a plant as being an Iris Setosa, Iris Versicolour or Iris Virginica.

Mohamed Abdel-Kawy Mohamed Ali Soliman et al.: Linearly and Quadratically Separable Classiers

915

m(m + 1)/2. Quadratic polynomials in R3 represent surfaces such as ellipsoids, paraboloids, hyperboloids, cone. Although the algorithm is applicable to higher dimensions, we represent an example in R2 for convenience in visualization. A set of points belonging to two classes (black and red +) are presented in Fig.6. The mapping dened by (19) is used to generate coordinates in R5 corresponding to input data points. Algorithm AILC is used to solve the transformed linearly separable problem with two dierent values of = 0.4 and = 0.5. For each of these values, the resulting quadratic equation is plotted in blue. Although the algorithm successfully classied the points in both the cases, it shows the sensitivity to the value of . For = 0.4, ve iterations were required to converge to a parabola (see Fig.6(a)) while it takes eleven iterations when = 0.5 to converge to the hyperbola shown in Fig.6(b). Moreover, the algorithm may diverge for other range of values compared with the case of the linearly separable classication problems where very few iterations (13) were sucient for convergence for 0 < < 0.5.

Fig.7.

Application of algorithm produces an ellipse for the

quadratically separable data.

Fig.6. Classication by a conic section using dierent values of . (a) = 0.4. (b) = 0.5.

For the dicult dataset presented in Fig.7, the application of the algorithm produces the shown separating ellipse. 6 Numerical Results

following linearly separable datasets were chosen, including benchmark dataset IRIS[12] and some randomly generated datasets. 1) IRIS: full description of IRIS dataset is given in Section 4 (Example 3). Here, we consider two classes: Iris Setosa (50 samples) versus the non- Setosa (the remaining 100 samples belonging to the Iris Versicolour and the Iris Virginica). 2) G 589 2. 3) G 1972 2. 4) G 19001 2. 5) G 1367 10. 6) G 1353 15. The following procedure describes the automatic generation of data. Generate a random array consisting of N rows and n columns as input matrix X T . To dene the class identier d, we rst generate a random vector of length n + 1 for the weight W and c then for 1 i N compute bi = X T i w + c and dene di as +1 or 1 according to bi > or bi < where is a small positive number that preserves a margin between the two generated sets. The generated data consist of X T and d in the form of N (n + 1) array. Table 5 gives the summary of the datasets being used.
Table 5. Description of the Benchmark and Randomly Generated Linearly Separable Datasets Samples N IRIS G 589 2 G 1972 2 G 19001 2 G 1367 10 G 1353 15 150 589 1 972 19 001 1 367 1 353 Features n 4 2 2 2 10 15

In this section we discuss the performance of algorithm AILC compared with other learning algorithms in the case of linearly and nonlinearly separable practical and benchmark datasets. 6.1 Classication of Linearly Separable Datasets

For the evaluation of the AILC algorithm, the

In the next experiment these linearly separable

916

J. Comput. Sci. & Technol., Sept. 2011, Vol.26, No.5

datasets are used to predict the performance of our proposed algorithm AILC and other machine learning algorithms including: decision tree, support vector machine and radial basis function network. A summary of these algorithms is given in Table 6. We compared our results with the implementations in WEKA[14-15] .
Table 6. Summary of Machine Learning Algorithms Used to Produce Results of Tables (7, 9) J48 RBF MLP (L) SMO (d) Decision tree learner Radial basis function network Multilayer perceptron with back-propagation neural network using L hidden layers Sequential minimal optimization algorithm for support vector classication with polynomial kernel of degree d Proposed adaptive iterative linear (d = 1) and quadratic (d = 2) classier

comparison among this algorithm and decision tree, back-propagation neural network and support vector machines is presented. 6.2.1 Datasets Used for Empirical Evaluation For an empirical evaluation of the algorithm AILC with the nonlinearly separable datasets we have chosen ve datasets from the UCI machine learning repository[12] for binary classication tasks. 1) Breast-Cancer (BC). We used the original Winconsin breast cancer dataset, which consists of 699 samples of breast-cancer medical data out of two classes. Sixteen examples containing missing values have been removed. 65.5% of the samples came from the majority class. 2) Pima Indian Diabetes (DI). This dataset contains 768 samples with eight attributes (features) each plus a binary class label. 3) Ionosphere (IO). This database contains 351 samples of radar return signals from the ionosphere. Each sample consists of 34 real valued attributes plus binary class information. 4) IRIS. Full description of IRIS dataset is given in Section 4 (Example 3). Here, only the 100 samples belonging to the Iris Versicolour and the Iris Virginica classes are considered. 5) Sonar (SN). The sonar database is a highdimensional dataset describing sonar signals in 60 realvalued attributes. The dataset contains 208 samples. Table 8 gives an overview of the datasets being used. The numbers of the examples in brackets show the original size of the dataset before the examples containing missing values had been removed.
Table 8. Numerical Description of the Benchmark Datasets Used for Empirical Evaluation Samples (Instances) BC DI IO IRIS SN (699) 683 768 351 100 208 Majority Class (%) 65.50 65.10 64.10 50.00 53.40 Features (Attributes) 9 8 34 4 60

AILC (d)

Table 7. Results for the Empirical Comparison Showing the Number of Misclassied Instants J48 IRIS G 589 2 G 1972 2 G 19001 2 G 1367 10 G 1353 15 0 1 5 1 11 21 SMO (1) 0 4 2 0 0 0 RBF 0 2 66 162 5 21 AILC (1) 0 0 0 0 0 0 (2) (3) (4) (2) (42) (255)

For each dataset, the full data were used in training dierent algorithms to predict the best separating hyperplane. The number of misclassied samples, if exist, is reported in Table 7. In addition, for AILC, the number of required iterations to obtain the separating hyperplane is given in parentheses. One can easily conclude (from Table 7 and many other experiments not reported here) that although the number of required iterations increases signicantly with the increase of number of features n, it is nearly independent of the number of samples N . Being independent of N shows the strength point of the adaptive technique and being signicantly dependent on n presents the weakness of the proposed technique that resulted from the assumption that the chosen n-points have to be at an equal and prescribed distance from the hyperplane. However, the proposed algorithm succeeded in separating all these datasets while other algorithms did not. 6.2 Behavior of Algorithm AILC in Nonlinearly Separable Datasets

In this subsection, we will discuss the behavior of the proposed adaptive iterative linear classier algorithm if the dataset is nonlinearly separable. And a

There exist many dierent techniques to evaluate the performance of dierent learning techniques based on data with a limited number of samples. The stratied ten-fold cross-validation technique is gaining ascendancy and is probably the evaluation method of choice in most practical limited-data situations. In this technique, the data are divided randomly into ten parts in which the class is represented in approximately the same proportions as in the full dataset. Each part is

Mohamed Abdel-Kawy Mohamed Ali Soliman et al.: Linearly and Quadratically Separable Classiers
Table 9. Results for the Empirical Comparison Showing the Number of Misclassied Instances and Accuracy on the Test Set Using 10-Fold Cross Validation BC J48 MLP (3) SMO (1) SMO (2) AILC (1) AILC (2) 32 36 21 24 37 (95.31%) (94.73%) (96.93%) (96.49%) (94.58%) 196 181 179 171 199 DI (74.48%) (76.43%) (76.69%) (77.73%) (74.09%) 34 31 44 33 69 IO (90.31%) (91.17%) (87.46%) (90.60%) (80.34%) 6 7 6 7 6 4 IRIS (94%) (93%) (94%) (93%) (94%) (96%) 60 41 50 37 71 SN (71.15%) (80.28%) (75.96%) (82.21%) (65.87%)

917

held out in turn and the learning scheme trained on the remaining nine-tenths; then its error rate is calculated on the holdout set. Thus the learning procedure is executed a total of ten times on dierent training sets (each of which has a lot in common). Finally, ten error estimates are averaged to yield an overall error estimate. In this study, the technique of cross validation was applied to benchmark datasets (see Table 8) to predict the performance of our proposed algorithm AILC and other machine learning algorithms including: decision tree, back-propagation neural network and support vector machines (see Table 6). We compared our results with the implementations in WEKA[14-15] . The results of comparison are summarized in Table 9 where the number of misclassied instances and accuracy of classication, in parentheses, are given. Although AILC is a linear classier, it produces reasonable results even in the case of nonlinearly separable datasets. Again, as in the linearly separable case (Subsection 6.1), one can easily conclude that the performance of AILC is independent of the size of the samples N but reduces with the increase of feature dimension n. Note that for IRIS dataset where n = 4, AILC is as accurate as SVM when using polynomial kernel of degree 1 and its performance outperforms that of SVM when using polynomial kernel of degree 2. For the datasets BC (n = 9) and DI (n = 8), comparable results are obtained even N is large (see Table 8). On the other hand less acceptable results are obtained in case of IO (n = 34) and SN (n = 60). 7 Conclusions

A fast adaptive iterative algorithm AILC for classifying linearly separable data is presented. In a binary classication problem containing N samples with n features, the main idea of the algorithm is that it chooses adaptively a subset of n-samples and constructs a hyperplane that separates the n-samples at a margin and it best classies the remaining points. This process is repeated until the separating hyperplane is obtained. If such a hyperplane was not obtained after the prescribed number of iterations, the algorithm returns the

hyperplane that misclassies fewer samples. Further, a quadratically separable classication problem can be mapped from its physical space to another larger where the problem becomes linearly separable. From various numerical illustrations and comparisons with other classication algorithms using benchmark datasets, one can conclude: 1) the algorithm is fast due to its adaptive feature; 2) the complexity of the algorithm is C1 N + C2 n2 , C1 and C2 are independent on N , which ensures excellent performance especially when n is small; 3) the assumption that n-samples must lie at a prescribed margin from the hyperplane is restrictive and makes the convergence rate dependent on n; and on the other hand, the user must provide the prescribed parameter which is problem dependent; 4) convergence rates of AILC are measured either by a number of required iterations to get the separating hyperplane or by a number of misclassied samples after prescribed number of iterations. Theoretical and numerical results show that convergence rates are nearly independent on N but reduce with the increase of n, and usually fewer iterations are sucient for convergence for small n. Although reasonable results were obtained, convergence was greatly dependent on the value n which in turn depends on the prescribed parameter . Other algorithms are in development to predict the value of that ensures maximum margin for the n-points. Moreover, the classication problem as formulated in Section 3 may be developed as a linear programming algorithm that determines as an n-valued vector, rather than a scalar value, and produces the hyperplane with maximum margin. References
[1] Duda R O, Hart P E, Stork D G. Pattern Classication. New York: Wiley-Interscience, 2000. [2] Theodoridis S, Koutroumbas K. Pattern Recognition. Academic Press, An Imprint of Elsevier, 2006. [3] Cristianini N, Shawe T J. An Introduction to Support Vector Machines. Vol. I, Cambridge University Press, 2003. [4] Atiya A. Learning with kernels: Support vector machines, regularization, optimization, and beyond. IEEE Transactions on Neural Networks, 2005, 16(3): 781.

918
[5] Rosenblatt F. Principles of Neurodynamics. Spartan Books, 1962. [6] Taha H A. Operations Research An Introduction. Macmillan Publishing Co., Inc, 1982. [7] Zurada J M. Introduction to Articial Neural Systems. Boston: PWS Publishing Co., USA, 1999. [8] Barber C B, Dodkin D P, Huhdanpaa H. The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software, 1996, 22(4): 469-483. [9] Tajine M, Elizondo D. New methods for testing linear separability. Neurocomputing, 2002, 47(1-4): 295-322. [10] Elizondo D. Searching for linearly separable subsets using the class of linear separability method. In Proc. IEEE-IJCNN, Budapest, Hungary, Jul. 25-29, 2004, pp.955-960. [11] Elizondo D. The linear separability problem: Some testing methods. IEEE Transactions on Neural Networks, 2006, 17(2): 330-344. [12] www.archive.ics.uci.edu/ml/datasets.html, Mar. 31, 2009. [13] Fisher R A. The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 1936, 7: 179-188. [14] http://www.cs.waikato.ac.nz/ml/weka/, May 1, 2009. [15] Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. Elsevier, 2005.

J. Comput. Sci. & Technol., Sept. 2011, Vol.26, No.5 Rasha M. Abo-Bakr was born in 1976 in Egypt, received her Bachelors degree from Mathematics (Computer Science) Department, Faculty of Science, Zagazig University, Egypt. She was also awarded her Masters degree in computer science in 2003, with a thesis titled Computer Algorithms for System Identication. Since 2003 she has been an assistant lecturer at Mathematics (Computer Science) Department, Faculty of Science, Zagazig University. She received her Ph.D. degree in mathematics & computer science from Zagazig University, in 2011, with a dissertation titled Symbolic Modeling of Dynamical Systems Using Soft Computing Techniques. Her research interests are articial intelligence, soft computing technologies, and astronomy.

Mohamed Abdel-Kawy Mohamed Ali Soliman received the B.S. degree in electrical and electronic engineering from M.T.C (Military Technical College), Cairo, Egypt, with grade (Excellent) in 1974, the M.S. degree in electronic and communications engineering from Faculty of Engineering, Cairo University, Egypt, with the research on observers in modern control systems theory, 1985, and the Ph.D. degree in aeronautical engineering, the thesis title is Intelligent Management for Aircraft and Spacecraft Sensors Systems, 2000. He is currently head of Computer and Systems Engineering Department, Faculty of Engineering, Zagazig University. His research interests lie in the intersection of the general elds of computer science and engineering, brain science, and cognitive science.

You might also like