Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Introduction

Classification is one of the most important areas of research in various fields of Computer
Science: like Data Mining, Artificial Intelligence and Machine Learning etc. Different
algorithms can be used to solve the problems of classification. Of the few widely used
classification algorithms, one is K nearest neighbors. It is a supervised learning algorithm used
for pattern recognition in Machine Learning. It does not have a training phase and it stores all the
data. It works on the labeled input data and solves the regression and classification problems. It
works on the assumption that similar things fall close to each other. It uses various formulas to
find out the difference between various different points in the data. Generally, Euclidean
Distance Algorithm or Manhattan Distance formula is used for this purpose.

In KNN, it is required to choose the right value of K. To do that KNN algorithm needs to be run
multiple times to select the value of K. Then that value of K is chosen which reduces the error
and increases the accuracy of prediction. The value of K shows the nearest points against each
data points. In case the value of K is chosen to be 1, the results become very unstable. The
results, however, become stable upon increasing the value of K because of averaging and
majority voting. Accurate predictions are then made.

This paper aims at optimizing the value of K for each type of dataset. It is common knowledge
that finding the optimal value of K can be relatively expensive in the big datasets. So, we will do
some calculations and make an educated guess to try to find the optimal value of K. Thus, saving
time and minimizing computations.

Related Work
The problem of neighbor selection has been a long running issue for the Machine Learning
Engineers. As the prediction and accuracy of algorithm is dependent primarily on value of k, we
choose it as a decision-making parameter [3].

The concept of different number of neighbors at different times in system has been introduced by
the authors in [2]. The features that require complex computations are introduced in system at the
later stages of the process. It is also dependent upon the discriminative abilities of the feature.
Feature ordering implemented as a separate process here because it is very important in this
scenario. Thus, the introduction of different number of features at different times reduces
complex computations as well as the memory requirements.

The study in [4] primarily focuses on the issue that how many features are considered by us for
the classification or prediction. Irrespective of how the distribution of the neighbors takes place,
classification mechanism is usually performed by us. The points are classified on the basis of
their closeness in location to each other or to the test point. Depending upon the dimensions, data
is classified into multiple classes. And then we select a centroid in each class, which helps us
determinE the k local mean vectors in a particular spatial distribution. Keeping in consideration
the harmonic mean distance from the test point reduces error rates to relatively lower amount. It
is especially helpful in datasets that have lesser records and features.

[5] largely paid attention on the detection of the deception by dividing multiple cues into
different classes. A model has been developed by them for the classification of different features
into two categories: non-verbal and verbal. And then again, each feature is classified categories
of truthful or deceptive. The KNN model applied on non-verbal cues shows a relatively higher
accuracy than in comparison to the verbal cues. So, the accuracy of the KNN model is largely
dependent on the type of features that we have in our system.

The work of [3] describes that the technique used (cross-validation technique) for determining
the value of k is time consuming. Therefore, they suggest a k tree method which will initially
train the model to determine the value of k for a separate dataset and then use that particular
value of k for future datasets. This method does actually include an entirely separate training
phase in classification of KNN. It is called k tree and k* tree method. KNN with the k tree and
the k* tree method is less prone to error and is more accurate. Running cost is also reduced
significantly by including both these methods with KNN.

[1] has stated that by the inclusion of the random forest with KNN, much of processing overhead
can be reduced and accuracy can be improved. RFDKNN algorithm works by first sorting
features according to their importance using the Gini index. Then these features are reduced by a
certain proportion R. What makes it distinct and optimal is the dynamic selection of neighbors. It
includes a distance function to determine the distance between the test point and any sample
training test point. The features or dimensions in dataset are reduced by Random Forest. Random
Forest Regressor reduces dimensions to a lower number. But this method is not that effective for
datasets that have closely associated features.

2. Proposed Methodology

2.1 Dataset

The dataset used by us is of iris dataset by the UCI Machine learning repository, that contains the
features related to flowers. By the use of those attributes, we have to label each row on the specie
that it belongs to. There are three predefined species of flowers: Setosa, Virginica and
Versicolor.

There are following attributes in training dataset.

1. Id
2. SepalWidthCm
3. PetalLengthCm
4. SepalLengthCm
5. PetalWidthCm
6. Species
2.2 Model

The main concepts are based on the introduction of the three training phases for the three
different size of datasets. In the beginning, we have to train each dataset that has a constant
number of rows and then store it with respective values in some dictionary-data-structure. Let’s
take an example

DS1 be the dataset having higher number of rows M.

DS2 be the dataset with some lower number of rows N.

DS3 be the dataset with some intermediate number of rows N+M/2.

After the training of each dataset, we will store each value of k with respective dataset in the
dictionary.

2.2.1 Training Phase


We have taken a training dataset of ‘iris’ and we trained it to get the optimal value of k. Here is
the plot of the k against the accuracy for a specific kind of dataset. The solution proposed by us
includes three steps of training for three different datasets with different sizes.

Now it is obvious by the preceding graph that the optimal value of k for this particular dataset is
4. We store that in a dictionary and then again, we take other datasets and then store their values
in the dictionary provided above.

dictionary = {'DS1':'x', 'DS2':'y', ‘'DS3':'z'} 

x, y, and z represent the value of number of neighbors considered by us in each dataset. Now, the
training phase is finished and the future datasets don’t have to be trained. We only need to
compare the number of rows of the new dataset with three rows of the above datasets. The one
which has the lower difference will be the k value for our dataset.

d= df1.count(axis=0)- df .count(axis=0)

where

d= difference in number of rows of test dataset and one of the other datasets

Thus
K= Value of( Minimum(d,d1,d2));

Algorithm Accuracy
Simple KNN 96%
KNN with raining phase 97%
Selecting this dynamic value of k gives us the best accuracy as opposed to selecting the value of
k randomly, or by using odd value or taking the square root of the number of records in dataset.

Results of these experiments showed a massive difference between accuracy, error rate and
running cost.

Abstract
KNN is a very simple yet very efficient algorithm for classification and it is widely used in data
mining due to its simple implementation. However, for selecting the value of k, there is no well-
set mechanism in it which is usually the main deciding factor in the classification. This paper
aims at proposing a new solution for determining the value of k, by the introduction of the
separate training phases. A training phase is completed for 3 different sizes of datasets and
corresponding value of k is stored in a dictionary data structure. Each new dataset will first
compare itself with the existing datasets and the value of the dataset which has the minimum
difference between test point and itself, will be considered as a value of the dataset. Thereby the
introduction of the separate training phase for differing sizes of datasets increases the accuracy of
our algorithm by a sizeable amount.

Conclusion and Future work


This paper introduced a new method for determining the value of k. One experiment was
performed to determine the k value for a dataset, which was a training phase. Other experiment
was performed to determine the accuracy of original KNN. And there was a huge difference in
terms of accuracy, error rate and running cost. The experiments were conducted on a dataset
having fewer dimensions. In future, this work can be extended to the datasets with higher
dimensions.

You might also like