Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Evolutional Study on KNN and K-means

Algorithms

Hritika Vaishnav1 and Anamika Choudhary2

Mtech Student, Computer Science and Engineering, JIET, Jodhpur

hritika16@gmail.com
2
Associate Professor, Computer Science and Engineering, JIET, Jodhpur
2
anamika.choudhary@jietjodhpur.ac.in

Abstract. We are currently in a technological age when fresh innovations


are making human job quick and easy. One of these subjects is machine
learning, which enables a machine to carry out tasks in accordance with
what it has learned and understood. Machine learning is being utilized to
guide a lot of programmers online. These devices are able to recognize our
traditions and preferences in the same way that people do. Machine learning
is now applied in a wide range of industries. This covers fields including
finance, medicine, retail, social media, robotics, automation, and game apps,
among others. These various sectors employ various algorithms. The
objective of the work is a comparison of the KNN and K-means algorithms.
They differ from one another marginally, according to the evaluation
research. K-means clustering as a tool for organizing data into groups
without predefined categories, like sorting a mixed bag of items based on
their similarities. On the other hand, think of KNN (K-Nearest Neighbors)
as a helpful approach in machine learning where the algorithm learns from
labeled examples to make predictions or classify new instances.
Keywords: Supervised learning, Unsupervised learning, KNN algorithm,
K-means algorithm.

1. Introduction

Machine learning is a sort of learning in which the computer picks up a lot of information
without being explicitly programmed. This is a form of AI (Artificial Intelligence) application
that offers the system the capacity to automatically learn from its experience and develop
itself. Machine learning can easily handle multi-dimensional and multi-variety data in a
dynamic environment. They rely on their own experience rather than formal programming to
do this instead. The creation of computer programmers that can access the data and utilize it
for their own learning is the main emphasis of machine learning. It is simple to see patterns in
data and create future judgements that will be better by starting learning with observations of
data, such as direct experience or teaching. [1]

Machine learning algorithm is mainly 3 types Supervised, Unsupervised, and Reinforcement


Learning.[2] Supervised learning algorithm use labelled data for future data prediction.
Unsupervised learning algorithm use labelled data for describe hidden structure.
Reinforcement learning interacts with data’s environment to produces action and discovers the
error and rewards.

K-means clustering is also known as portioning clustering. Suppose we have a database of 'n'
objects and we cut the 'k' portion of the data. Each portion represents a cluster, and [K <_N] K
<_N shows us that each of the objects belong to a cluster. And it also shows that each Cluster
contains at least one object.

KNN Algorithm is a very simple, Machine Learning Algorithm, based on Supervised Machine
Learning. With the help of KNN, which works to classify the data, the category of any new
input data can be told which category it will belong to. We can also use it for regression, but it
is mostly used to solve the classification problem. [7]

2. Supervised machine learning algorithms:

By using new data and labelled examples, this sort of algorithm utilizes what it has learnt in
the past to make predictions about the future. This learning approach creates a form of inferred
function that can anticipate output values with ease by evaluating a known training dataset. By
providing the system with enough training, it may serve as the target for any new input. This
learning process also compares the output with the intended, correct output to identify flaws,
which may then be corrected in the model.
Regression:
Regression is especially used in places like finance, investing. With the help of
regression, we can find out what is the change in the value of something.

Definition of Regression - Regression is an analysis that describes the relation


between the dependent variable and independent variable. For example, suppose y
is a dependent variable and x is an independent variable, so as we change the value
of x, its result will appear above y. So this kind of relation can be explained by
regression.

Although there are many types of regression but we will talk about some of the
main types which are mostly used inside machine learning.

i. Linear Regression:
Within Linear Regression, one predicts the result of a dependent variable, with the
help of another independent variable.

ii. Polynomial Regression


Within a polynomial regression, the independent variable has more than one power.
You can see the mathematical equation of Polynomial Regression in the image
above. Here the power of the independent variable is more than one and as this
power increases, the hypothesis will become more and more complex.

iii. Logistic Regression


The use of Logistic Regression indicates the possibility of no event taking place.
Inside it the value of the dependent variable is Binary (eg: True / False, Yes / No,
0/1). For example, conditions like whether it will rain tomorrow or not, India will
win the match or not, etc, you can solve the situation through Logistic Regression.

iv. Non-Linear Regression


Within non-Linear Regression the power of the parameter can also be more. Inside
it, the power of the parameter varies according to the power of the independent
variable. Within this, the problem of Linear Regression and Polynomial Regression
can also be solved.

Classification:
In this, the output variables are called labels or category because the output;
categorically it means that the output is divided into two more categories, such as
categorize an email in two ways.

In other words, classification; Classifies class labels based on training data and class
label. So, when we have to solve a problem where we have to create categories or labels,
we use Classification Algorithms.

A classification models that values; It is observed that tries to draw some kind of
conclusion from them. Given one or more inputs, a classification model tries to estimate
the value of one or more results. Discrete or Real-Valued in classification; Input can be
Variable.

3. Unsupervised Machine Learning Algorithms


Unsupervised learning is a part of machine learning that gives a machine an ability so that
it can understand any object, situation, and problem, based on its old experiences.
Whenever a machine or system is trained within Unsupervised Learning, it is given
unlabeled data for input. Here, unlabeled data means that data which has no label. For
example, the name of the animal is not defined here, is that animal or something else just a
simple data value, now within Unsupervised Learning to find out which category it
belongs to, and then cluster for this Build. So, Unsupervised Learning is a method in
which we train an AI (Artificial Intelligence) algorithm in such a way that it can make its
own decisions without any guidance. The model or system of Unsupervised Learning is
trained in such a way that it makes a group of all the same data values in the given data
set. And these groups are called cluster. Unsupervised learning is a machine learning
method in which a previously trained model is given unlabeled data or uncategorized data
without having to train again and again. The output of the Unsupervised Learning model
depends only on the algorithm coded once.

Clustering
The most commonly used method within Unsupervised Learning is Clustering.
Clustering is a technique in which different groups of similar types of data are
created, these groups are called clusters. Now whenever an object has to be
detected, it will check which cluster the data value of that object is matching, and it
will be identified accordingly.

No labeled data is given to the model during training in unsupervised learning. In


this learning, the model is trained in such a way that it identifies similar data or
patterns in the given data set to form a group of them called clusters. Clustering
Algorithms:
 K Means Clustering
 Hierarchical Clustering
 DBSCAN Density Based Spatial Clustering

 Expectation- Maximization (EM) Clustering

4. K-Nearest Neighbors (KNN) Algorithm


KNN Algorithm is a very simple, Machine Learning Algorithm, based on Supervised
Machine Learning. With the help of KNN, which works to classify the data, the category
of any new input data can be told which category it will belong to. We can also use it for
regression, but it is mostly used to solve the classification problem.[7]

The K-Nearest Neighbor (KNN) Algorithm is a Supervised Machine Learning Algorithm,


so it uses labeled data to model it, and then whenever any new unlabeled data is passed to
this KNN model, it With the help of labeled data given during training, it is able to easily
classify new unlabeled data. KNN Algorithm is also called Lazy Learning Algorithm
because it requires data to teach algorithms only before the algorithm is able to take a
decision.

Working of K-Nearest Neighbor (KNN) Algorithm


 Within KNN Algorithm we first configure a value of a variable K here, the
value of K will tell the number of nearest never.
 Here we will concatenate the value of K to an old value so that the algorithm
can easily take the decision.
 In the second step we will calculate the distance of Nearest Neighbor from the
new data point, for this we will use the following Euclidean Distance formula:
Euclidean Distance Formula:

D = √(𝑋𝑜1 − 𝑋𝑎1)2 + (𝑋𝑜2 − 𝑋𝑎2)2

Here, D = Distance o = observed value a = actual value

 In the third step KNN Algorithm extracts the number of Nearest Neighbor
point from the new data point and also find out how many Nearest Neighbor
point are billed by the new data point.

 KNN Algorithm, in the fourth step, designates the category of new data point.

 The new data point will belong to the category whose more Nearest Neighbor
point will be considered as that category.

Advantages Of KNN K-Nearest Neighbor (KNN) Algorithm


 Implementing KNN Algorithm is very easy.
 KNN Algorithm can also be used for classification, regression and searching.
 KNN Algorithm works in a very robust way for noisily data.
 KNN Algorithm can also be used for large amount of data.

Disadvantages Of KNN K-Nearest Neighbor (KNN) Algorithm


 When the Example data is high, its data processing slows down.
 Always have to determine the value of k which sometimes gets a bit complex.
 The calculation cost in KNN is high because the distance between all data
points has to be extracted.
 KNN is a Lazy Learning Algorithm because it requires data to be available
beforehand because first it stores that data and then further processes it.
5. K-MEANS CLUSTERING ALGORITHM
K-means Clustering algorithm is based on an Unsupervised Machine learning. K-means
clustering is an algorithm in which objects are grouped or grouped based on a number of
attributes or attributes in a group. K is an integer that is positive. By reducing the sum of
squares of distances between the data and the relevant cluster centroid, the data are
grouped.

K-means is additionally referred to as a distance-based algorithm or a centroid-based


method. In order to assign a
point to a cluster in this, compute the distances. Each cluster in K-Means has a centroid
attached to it.

Working of K-Means Algorithm


The K-means clustering algorithm operates through several distinct steps
 Firstly, the number of desired clusters, denoted as 'K,' is determined.
 Subsequently, K cluster centers are randomly selected from the data points,
ensuring maximum separation between them.
 The algorithm then calculates the distance between each data point and the
cluster centers, typically employing the Euclidean distance formula.
Euclidean Distance Formula:

D = √(𝑋𝑜1 − 𝑋𝑎1)2 + (𝑋𝑜2 − 𝑋𝑎2)2

Here, D = Distance o = observed value a = actual value

 This distance, represented as 'D,' guides the assignment of each data point to
the cluster whose center is nearest.
 The process iteratively continues, with the recalibration of cluster centers based
on the data points within each cluster.
 Each data point is assigned to some cluster. A data point is made up of a
cluster whose center is closest to that data point.
 The center of the newly formed groups is recalculated. To calculate the center
of a cluster is done through all the data points contained in that cluster.

Advantages of K-means Clustering Algorithm


 In terms of time complexity O, it is comparatively effective (nkt) where n =
number of instances, k = number of clusters, and t = number of iterations.
 The K-means algorithm exhibits notable advantages.
 In terms of time complexity (O), it demonstrates efficiency with a
computational cost proportional to the number of instances (n), clusters (k), and
iterations (t).
 Moreover, it often converges to a local optimum, providing pragmatic utility.
 Techniques like Simulated Annealing or Genetic Algorithms can be employed
to locate the global optimum, enhancing its adaptability to various scenarios.

Disadvantages of K-means Clustering Algorithm


 The K-means algorithm is not without limitations. Foremost, the number of
clusters (k) must be predetermined, posing a challenge in scenarios where
optimal cluster count is unknown
 Its susceptibility to noisy data and outliers can impact clustering accuracy.
Additionally, K-means struggles to identify clusters with non-convex shapes,
limiting its applicability in certain datasets.

CONCLUSION

In conclusion, this research delved into the comparative analysis of the K-nearest neighbor
(KNN) and K-means clustering algorithms within the realm of machine learning. While both
algorithms share the commonality of the letter 'k,' they serve distinct purposes. K-means
clustering, an unsupervised learning algorithm, excels in grouping data into clusters, offering
valuable insights into patterns and structures. On the other hand, KNN, a supervised learning
algorithm, proves effective for classification tasks, leveraging labeled data to make predictions
for new, unlabeled instances.

The study highlighted the simplicity and versatility of the KNN algorithm, suitable for
classification, regression, and search operations. Despite its ease of implementation, challenges
such as determining the optimal 'k' value and computational costs were acknowledged.
Conversely, the K-means clustering algorithm demonstrated efficiency in grouping data based
on specified attributes. However, limitations in handling noisy data, outliers, and the
requirement to predefine the number of clusters (k) were identified.

In practical applications, the selection between these algorithms depends on the nature of the
problem and data characteristics. Understanding the nuances of KNN and Kmeans aids
practitioners in making informed choices for diverse machine learning tasks. Importantly, the
research emphasizes that, despite their similar nomenclature, these algorithms significantly
differ in functionality and applicability within the machine learning landscape.
REFERENCES
[1] Domingos, P. (2012). A Few Useful Things to Know About Machine Learning. Communications of
the ACM, 55(10), 78–87. [DOI:
10.1145/2347736.2347755]

[2] Mitchell, T. M. (1997). Machine Learning. McGraw-Hill Education.

[3] Vaishnav, H., & Choudhary, A. (Year of Publication). Evolutional Study on KNN and K-means
Algorithms. Title of the Journal or Conference, Volume(Issue), Page Range. [DOI or URL if
applicable]

[4] Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and Techniques. Morgan Kaufmann.

[5] Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys
(CSUR), 31(3), 264–323. [DOI: 10.1145/331499.331504]

[6] Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on
Information Theory, 13(1), 21–
27. [DOI: 10.1109/TIT.1967.1053964]

[7] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining,
Inference, and Prediction (2nd ed.). Springer.

[8] MacQueen, J. (1967). Some Methods for Classification and Analysis of Multivariate Observations. In
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1:
Statistics (pp. 281–297). University of California Press.

[9] Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification (2nd ed.). Wiley.

[10] Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

You might also like