A Review of Data Classification Using K-Nearest Neighbour

International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 6, June 2013)
A Review of Data Classification Using K-Nearest Neighbour

Algorithm
Aman Kataria1, M. D. Singh2
1
P.G. Scholar, Thapar University, Patiala, India
2
Assistant professor, EIE Department, Thapar University, Patiala, India
Abstract—To classify data whether it is in the field of The example is classified by determining the majority
neural networks or maybe it is any application of of samples of the labels for K-Near neighbor [3]. In other
Biometrics viz: Handwriting classification or Iris detection, words this method is very easy to enforce for instance if
feasibly the most candid classifier in the stockpile or an example “x” has k nearest examples where feature
machine learning techniques is the Nearest Neighbor
space and majority of them are having the same label
Classifier in which classification is achieved by identifying
the nearest neighbors to a query example and using those “y”, then “x” belongs to “y”. The K-NN method is
neighbors to determine the class of the query. K-NN mostly depends upon furthermost theorem while
classification classifies instances based on their similarity to considering theory. When the decision course is
instances in the training data. This paper presents various considered consider small number of nearest neighbor.
output with various distance used in algorithm and may Hence when this method is used, example disproportion
help to know the response of classifier for the desired problem can be solved. While limited number of nearest
application it also represents computational issues in neighbor are considered by K-NN, not a decision
identifying nearest neighbors and mechanisms for reducing boundary. Hence exceptional to say that K-NN is suitable
the dimension of the data.
to classify the case of example set of boundary intercross
Keywords— K-NN, Biometrics, Classifier,distance and in that case example overlapped. The Euclidian
distance can be calculated as follows [4]. If two vectors
I. INTRODUCTION xi and xi are given where xi =(xi1, xi2, xi3, xi4, xi5……. xin
) And xj =(xj1, xj2, xj3, xj4, xj5……. xjn ) The difference [5]
The belief inherited in Nearest Neighbor Classification between xi and xj is
is quite simple, examples are classified based on the class
of their nearest neighbors. For example If it walks like a D (xi, xj) = √∑ – (1)
duck, quacks like a duck, and looks like a duck, then it's
probably a duck. The k - nearest neighbor classifier is a In this experiment, this formula is used to estimate the
conventional nonparametric classifier that provides good nearest neighbor of an example. The K-NN algorithm is
performance for optimal values of k. In the k - nearest very powerful and lucid to implement. But one of the
neighbor rule, a test sample is assigned the class most main drawback of K-NN is its inefficiency for large scale
frequently represented among the k nearest training and high dimensional data sets. The main reason of its
samples. If two or more such classes exist, then the test drawback is its “lazy” learning algorithm natures and it is
sample is assigned the class with minimum average because it does not have a true learning phase and that
distance to it. It can be shown that the k - nearest results a high computational cost at the classification
neighbor rule becomes the Bayes optimal decision rule as time. Yang and Liu [5] set k as 30-45 since they found
k goes to infinity [1]. However, it is only in the limit as stable effectiveness in those range. In the same way
the number of training samples goes to infinity that the Joachim‟s [6] tried over different kЄ {15, 30, 45, 60}.
nearly optimal behavior of the k - nearest neighbor rule is When the above two attempts are considered, k values
assured. are explored, where kЄ {15,30,45} for the K-NN
classifier and have the best performance for the value of
II. ALGORITHM OF K-NN CLASSIFIER „k‟ that results on the test samples as shown in figure.
A. Basic The K-NN classifier (also known as instance based
classifier) perform on the premises in such a way that
In 1968, Cover and Hart proposed an algorithm the K- classification of unknown instances can be done by
Nearest Neighbor, which was finalized after some time. relating the unknown to the known based on some
K-Nearest Neighbor can be calculated by calculating distance/similarity function. The main objective is that
Euclidian distance, although other measures are also two instances far apart in the instance space those are
available but through Euclidian distance we have defined by the appropriate distance function are less
splendid intermingle of ease, efficiency and similar than two nearly situated instances to belong to the
productivity[2]. same class [7].
354
B. Use in Data mining 3) Cosine
Data mining is the extraction of veiled information 4) Correlation
from large database. Classification is a data mining task f) Rule
of forecasting the value of a categorical variable by
1) Nearest
building a model based on one or more numerical and/or
2) Random
categorical variables. Classification mining function is
3) Consensus
used to achieve a intense understanding of the database
structure There are various classification techniques like I) Distance
decision tree induction, Bayesian networks, lazy a) Euclidean distance-
classifier and rule based classifier. Data mining involves
the use of sophisticated data analysis tools to discover The Euclidean distance between points p and q is the
previously unknown, valid patterns and relationships in length of the line segment connecting them (pq).
In Cartesian coordinates, if p = (p1, p2,..., pn)
large data set. These tools can include statistical models,
and q = (q1, q2,..., qn) are two points in Euclidean n-space,
mathematical algorithm and machine learning methods.
then the distance from p to q, or from q to p is given
Consequently, data mining consists of more than
by[10]:
collection and managing data, it also includes analysis
and prediction [8]. Data mining applications can use a d(p,q)= d(q,p)= √(q1-p1)2 + (q2-p2)2 (2)
variety of parameters to examine the data. They include
association, sequence or path analysis, classification, The position of a point in a Euclidean n-space is
clustering, and forecasting. Classification technique is a Euclidean vector. So, p and q are Euclidean
capable of processing a wider variety of data and is vectors[11], starting from the origin of the space, and
growing in popularity. The various classification their tips indicate two points. The Euclidean norm,
techniques are Bayesian network, tree classifiers, rule or Euclidean length, or magnitude of a vector measures
based classifiers, lazy classifiers [9], Fuzzy set the length of the vector:
approaches, rough set approach etc. |p|= p12+ p22+ p32······· +pn2= √p.p (3)
III. MATERIAL AND METHODOLIGY Where the last equation involves the dot product. A
vector can be described as a directed line segment from
A. Material the origin of the Euclidean space (vector tail), to a point
Outputs with different methodology has been in that space (vector tip).[12] If we consider that its
compared. length is actually the distance from its tail to its tip, it
becomes clear that the Euclidean norm of a vector is just
a) Sample a special case of Euclidean distance: the Euclidean
Matrix whose rows will be classified into groups. distance between its tail and its tip. The distance between
Sample must have the same number of columns as points p and q may have a direction, so it may be
Training. represented by another vector, given by[13]
b) Training q-p=(q1-p1, q2-p2,·······, qn-pn,) (4)
Matrix used to group the rows in the matrix Sample. In a three-dimensional space (n=3), this is an arrow
Training must have the same number of columns as from p to q, which can be also regarded as the position
Sample. Each row of Training belongs to the group of q relative to p. It may be also called
whose value is the corresponding entry of Group. a displacement vector if p and q represent two positions
c) Group of the same point at two successive instants of time. The
Vector whose distinct values define the grouping of Euclidean distance between p and q is just the Euclidean
the rows in Training. length of this distance (or displacement) vector: [14]
d) K |q-p|= √ q-p).(q-p) (5)

The number of nearest neighbors used in the Which is equivalent to equation 1, and also to:
classification. Default is 1. |q-p|= √|p|2 + |q|2-2p.q. (6)
e) Distance
1) Euclidean
2) Cityblock (taxicab metric)
355
b) Cityblock (Taxicab metric) B. Material
The taxican distance, d1, between two vectors p, q in To get the output some training data and sample data
an n-dimensional real vector space with fixed Cartesian are chosen and with different rules and with different
coordinate system, is the sum of lengths of the distance matric we get different classified outputs.
projections of the line segment between the points into The data chose for the classification is
the coordinate axis.More formally, Sample = [0.559 0.510; 0.101 0.282; 0.987 0.988]
Training= [0 0; 0.559 0.559; 1 1]
d1(p,q)= ||p-q||1= ∑ |p q| (7) Group= [1;2;3]
Where p=(p1,p2,p3……pn) and q= (q1,q2,q3……qn) are C. Results
vectors[15]. For example, in the plane, the taxicab
distance between (p1,p2) and (q1,q2) is |p1-q1|+ |p2-q2|[16]. Case 1
c) Cosine distance
The cosine of two vectors can be derived by using
the Euclidean dot product formula:
a.b=||a|| ||b|| cosθ (8)
Given two vectors of attributes, A and B, the cosine
similarity, cos(θ), is represented using a dot
product and magnitude[17] as
Similarity= cos(θ = . || || ||B|| = ∑
Bi/√ ∑ √∑ (9)
The resulting similarity ranges from −1 meaning
exactly opposite, to 1 meaning exactly the same, with 0
usually indicating independence, and in-between values
indicating intermediate similarity or dissimilarity. For
text matching, the attribute vectors A and B are usually
the term frequency vectors of the documents. The cosine
Fig.1 Distance used Euclidean and Rule Nearest
similarity can be seen as a method of normalizing
document length during comparison. In the case In this case Euclidean distance is used and reference of
of information retrieval, the cosine similarity of two this figure has been used in Table I .By using Nearest
documents will range from 0 to 1, since the term neighbor Algorithm classification result was medium.
frequencies (tf-idf weights) cannot be negative. The
angle between two term frequency vectors cannot be Case 2
greater than 90°.
d) Correlation
The distance correlation of two random variables is
obtained by dividing their distance covariance[18] by the
product of their distance standard deviations. The
distance correlation is
dCor(X,Y)= dCov(X,Y √dVar(X dVar(Y (10)
II) Rule
a) Nearest
Majority rule with nearest point tie-break (by default)
b) Random
Majority rule with random point tie-break
c) Consensus
Fig.2. Distance used Euclidean and Rule Random
356
In this case Euclidean distance is used and reference of In this case Cityblock distance is used and reference of
this figure has been used in Table I .By using Random this figure has been used in Table I. By using Nearest
distance Algorithm classification result was good. neighbor Algorithm classification result was excellent.
Case 3 Case 5
Fig.5 Distance used Cityblock and Rule Random

Fig.3 Distance used Euclidean and Rule Consensus
In this case Cityblock distance is used and reference of
In this case Euclidean distance is used and reference this figure has been used in Table I .By using Random
of this figure has been used in Table I .By using Algorithm classification result was medium.
Consensus distance Algorithm classification result was Case 6
Excellent.
Case 4
Fig.6 Distance used Cityblock and Rule Consensus
In this case Cityblock distance is used and reference of

Fig.4. Distance used Cityblock and Rule Nearest this figure has been used in Table I .By using Consensus
rule Algorithm classification result was good.
357
Case 7 Case 9
Fig.7 Distance used Cosine and Rule Nearest Fig.9 Distance used Cosine and Rule Consensus
In this case Cosine distance is used and reference of In this case Cosine distance is used and reference of
this figure has been used in Table I .By using Nearest this figure has been used in Table I .By using Consensus
neighbor Algorithm classification result was medium. Algorithm classification result was medium.
Case 8 Case 10
Fig.8 Distance used Cosine and Rule Random Fig.10 Distance used Correlation and Rule Nearest
In this case Cosine distance is used and reference of In this case Correlation distance is used and reference
this figure has been used in Table I .By using Random of this figure has been used in Table I. By using Nearest
rule Algorithm classification result was poor. neighbor Algorithm classification result was poor.
358
Case 11 D. Inference
TABLE I
RESULTS AND EFFIECIENCY OF CLASSIFIERS
Sr. Case Result Efficiency Percent

no. of
efficiency
1 1 Classified Medium 99%
Successfully
2 2 Classified Good 99.8%
Successfully
3 3 Classified Excellent 100%
Successfully
4 4 Classified Excellent 100%
Successfully
Successfully
6 6 Classified Good 99.8%
Successfully
Fig.11 Distance used Correlation and Rule Random
Successfully
In this case Correlation distance is used and reference 8 8 Classified Poor 98.5%
of this figure has been used in Table I .By using Random Successfully
rule Algorithm classification result was poor. 9 9 Classified Medium 99%
Successfully
Case 12
10 10 Classified Poor 98.5%
Successfully
11 11 Classified Poor 98.5%
Successfully
Successfully
IV. CONCLUSION
Classifiers have paved an important path for
classification of data in biometrics like iris detection,
signature verification. If compared with different
distances Euclidean distance has higher efficiency as
compared to other distances and if compared with Bayes
algorithm K-Nearest neighbor algorithm again maintains
it‟s efficiency. The KNN classifier is one of the most
popular neighborhood classifier in pattern recognition.
However, it has limitations such as great calculation
complexity, fully dependent on training set, and no
Fig.12 Distance used Correlation and Rule Consensus weight difference between each class. To avert this, a
In this case Correlation distance is used and reference innovative method to improve the classification
of this figure has been used in Table I. By using performance of KNN using Genetic Algorithm (GA) is
Consensus rule Algorithm classification result was being implemented. Also in results almost every case has
Medium. efficiency near to 100% because training set and sample
Hamming distance has not been used in this paper used are small and distance is approachable.
because that distance requires binary data which is not in
sample.
359
REFERENCES [9] William Perrizo,QinDing Anne Denton. “Lazy Classifiers Using
P-trees”, Department of Computer Science ,Penn State
[1] R.O. Duda and P.E. Hart, “Pattern Classification and Scene Harrisburg, Middletown, PA 17057.
Analysis”, New York: John Wiley & Sons, 1973.
[10] A. Y. Alfakih, “Graph rigidity via Euclidean distance matrices,
[2] Dasarathy, B. V., “Nearest Neighbor (NN) Norms,NN Pattern Linear lgebra ppl.”, 310 , pp. 149–165, 2000
Classification Techniques”. IEEE Computer Society Press, 1990.
[11] M. Bakonyi and C. Johnson, “The Euclidean distance matrix
[3] Wettschereck, D., Dietterich, T. G. “ n Experimental completion problem, SIAM Journal on Matrix Analysis and
Comparison of the Nearest Neighbor and Nearesthyperrectangle Applications”, 16 , pp. 646–654, 1995.
lgorithms,” Machine Learning, 9: 5-28, 1995.
[12] Elena Deza & Michel Marie Deza,” Encyclopedia of Distances”,
[4] Platt J C. “Fast Training of Support Vector Machines Using page 94, Springer, 2009.
Sequential Minimal Optimization [M]. Advances in Kernel
Methods:Support Vector Machines” (Edited by Scholkopf [13] W. Glunt, T. L. Hayden, S. Hong, and J. Wells, “An alternating
B,Burges C,Smola A)[M]. Cambridge MA: MIT Press, 185-208, projection algorithm for computing the nearest Euclidean distance
1998. matrix, SIAM Journal on Matrix Analysis and Applications”, 11,
pp. 589–600, 1990.
[5] Y. Yang and X. Liu, “ Re-Examination of Text
[14] R. W. Farebrother, “Three theorems with applications to
Categorization Methods,” Proc. SIGIR ‟99, pp. 42-49, Euclidean distance matrices”, Linear lg. ppl., 95, 11-16, 1987.
1999.
[15] Akca, Z. and Kaya, R.,”On the Taxicab Trigonometry”, Jour. of
[6] T. Joachims, “Text Categorization with Support Vector Inst. of Math& Comp. Sci. (Math. Ser) 10 , No 3, 151-159, 1997.
Machines: Learning with Many Relevant Features,” Proc. [16] Thompson, K. and Dray, T., “Taxicab Angles and Trigonometry”,
European Conf. Machine Learning, pp. 137-142, 1998. Pi Mu Epsilon J., 11, 87-97, 2000.
[7] Man Lan, Chew Lim Tan, Jian Su, and Yue Lu, [17] Bei-Ji Zou,” Shape-Based Trademark Retrieval Using Cosine
“Supervised and Traditional Term Weighting Methods for Distance Method” Intelligent Systems Design and
Automatic Text Categorization”, Ieee Transactions on Applications, 2008. ISDA '08. Eighth International
Pattern Analysis and Machine Intelligence, Vol. 31, No. 4, Conference on 26-28 Nov. 2008
April 2009. [18] M.R. Kosorok, “Discussion of Brownian distance covariance”,
[8] Thair Nu Phyu, “Survey of Classification Techniques in Data Ann. Appl. Stat. 3 (4) 1270–1278, 2009.
Mining” , Proceedings of the International MultiConference of
Engineers and Computer Scientists 2009 Vol I,IMECS
2009,March 18-20,2009.
360

A Review of Data Classification Using K-Nearest Neighbour

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Review of Data Classification Using K-Nearest Neighbour

Uploaded by

Copyright:

Available Formats

International Journal of Emerging Technology and Advanced Engineering

A Review of Data Classification Using K-Nearest Neighbour

d) K |q-p|= √ q-p).(q-p) (5)

Fig.2. Distance used Euclidean and Rule Random

Fig.5 Distance used Cityblock and Rule Random

Fig.6 Distance used Cityblock and Rule Consensus

In this case Cityblock distance is used and reference of

Sr. Case Result Efficiency Percent

You might also like