KNN Dan KMeans

Nearest Neighbor Classifiers
 Basic idea:
 If it walks like a duck, quacks like a duck, then
it’s probably a duck

Compute
Distance Test
Record
Training Choose k of the

Records “nearest” records
01/31/2024 1
What is the K-Nearest Neighbor (KNN) algorithm?
K-Nearest Neighbor (KNN) algorithm is a distance based supervised

learning algorithm that is used for solving classification problems.
In this, we will be looking at the classes of the k nearest neighbors
to a new point and assign it the class to which the majority of k
neighbours belong too.
To identify the nearest neighbors we use various techniques of

measuring distance, the most common of them being the
‘Euclidean Distance’.
01/31/2024 2
When should you use KNN?

KNN can be used for both classification and regression predictive
problems. However, it is more widely used in classification problems
in the industry.
Can I use KNN for both classification and regression

problems?
Yes! We just answered the question above but let’s expand a bit
more on that. To evaluate any technique we generally look at 3
important aspects:
• Ease to interpret output
• Calculation time
• Predictive Power
01/31/2024 3
When should you use KNN?

KNN can be used for both classification and regression predictive
problems. However, it is more widely used in classification problems
in the industry.
Can I use KNN for both classification and regression

problems?
Yes! We just answered the question above but let’s expand a bit
more on that. To evaluate any technique we generally look at 3
important aspects:
• Ease to interpret output
• Calculation time
• Predictive Power
01/31/2024 4
Nearest neighbor method
 Majority vote within the k nearest neighbors
new
K= 1: blue
K= 3: green
01/31/2024 5
Nearest-Neighbor Classifiers
Unknown record  Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
 To classify an unknown record:

– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
01/31/2024 6
Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points

that have the k smallest distance to x
01/31/2024 7
Nearest Neighbor Classification
 Compute distance between two points:
 Euclidean distance
d ( p, q )   ( pi
i
q )
i
2
 Determine the class from nearest neighbor list

 take the majority vote of class labels among
the k-nearest neighbors

 Weigh the vote according to distance
 weight factor, w = 1/d2
01/31/2024 8
Nearest Neighbor Classification
Euclidean Distance: Euclidean distance is calculated as the square root of the
sum of the squared differences between a new point (x) and an existing point (y).
Manhattan Distance: This is the distance between real vectors using the sum of
their absolute difference.
Hamming Distance: It is used for categorical variables. If the value (x) and the
value (y) are the same, the distance D will be equal to 0 . Otherwise D=1.
01/31/2024 9
Nearest Neighbor Classification…
 Choosing the value of k:
 If k is too small, sensitive to noise points
 If k is too large, neighborhood may include points from
other classes
01/31/2024 10
 Scaling issues
 Attributes may have to be scaled to prevent
distance measures from being dominated by

one of the attributes
 Example:
 height of a person may vary from 1.5m to 1.8m

 weight of a person may vary from 90lb to 300lb
 income of a person may vary from $10K to $1M
01/31/2024 11
 Problem with Euclidean measure:
 High dimensional data
 curse of dimensionality
 Can produce counter-intuitive results
111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142
 Solution: Normalize the vectors to unit length
01/31/2024 12
Nearest neighbor Classification…
 k-NN classifiers are lazy learners
 It does not build models explicitly
 Unlike eager learners such as decision tree
induction and rule-based systems

 Classifying unknown records are relatively
expensive
01/31/2024 13
Nearest Neighbor Regression…
Let us start with a simple example. Consider the following table – it
consists of the height, age and weight (target) value for 10 people. As
you can see, the weight value of ID11 is missing. We need to predict the
weight of this person based on their height and age.
01/31/2024 14
A SIMPLE EXAMPLE TO UNDERSTAND THE USE OF
KNN FOR REGRESSION
01/31/2024 15
A SIMPLE EXAMPLE TO UNDERSTAND THE USE OF
KNN FOR REGRESSION
In the above graph, the y-axis represents the height of a person

(in feet) and the x-axis represents the age (in years). The points
are numbered according to the ID values. The yellow point (ID
11) is our test point.
If I ask you to identify the weight of ID11 based on the plot,

what would be your answer? You would likely say that since ID11
is closer to points 5 and 1, so it must have a weight similar to
these IDs, probably between 72-77 kgs (weights of ID1 and ID5
from the table). That actually makes sense, but how do you think
the algorithm predicts the values? Let’s find that out.
01/31/2024 16
HOW DOES THE KNN ALGORITHM WORK
As we saw above, KNN algorithm can be used for both

classification and regression problems. The KNN algorithm uses
‘feature similarity’ to predict the values of any new data points.
This means that the new point is assigned a value based on how
closely it resembles the points in the training set. From our
example, we know that ID11 has height and age similar to ID1
and ID5, so the weight would also approximately be the same.
Had it been a classification problem, we would have taken the

mode as the final prediction. In this case, we have two values of
weight – 72 and 77. Any guesses on how the final value will be
calculated? The average of the values is taken to be the final
prediction.
01/31/2024 17
First, the distance between the new point and each training point
is calculated.
01/31/2024 18
The closest k data points are selected (based on the distance). In

this example, points 1, 5, 6 will be selected if the value of k is 3.
01/31/2024 19
The closest k data points are selected (based on the

distance). In this example, points 1, 5, 6 will be selected if
the value of k is 3.
The average of these data points is the final prediction for

the new point. Here, we have weight of ID11 =
(77+72+60)/3 = 69.66 kg.
01/31/2024 20
The problem is that how to select the k value. This determines

the number of neighbors we look at when we assign a value to
any new observation.
01/31/2024 21
For a very low value of k (suppose k=1), the model overfits on the training data,
which leads to a high error rate on the validation set. On the other hand, for a high
value of k, the model performs poorly on both train and validation set. If you
observe closely, the validation error curve reaches a minima at a value of k = 9.
This value of k is the optimum value of the model (it will vary for different
datasets). This curve is known as an ‘elbow curve‘ (because it has a shape like an
elbow) and is usually used to determine the k value.
rmse_val = [] #to store rmse values for different k
for K in range(20):
K = K+1
model = neighbors.KNeighborsRegressor(n_neighbors = K)
model.fit(x_train, y_train) #fit the model

pred=model.predict(x_test) #make prediction on test set
error = sqrt(mean_squared_error(y_test,pred)) #calculate rmse
rmse_val.append(error) #store rmse values
01/31/2024 23
Lazy vs. Eager Learning
 Lazy vs. eager learning
 Lazy learning (e.g., instance-based learning): Simply
stores training data (or only minor processing) and

waits until it is given a test tuple
 Eager learning : Given a set of training set, constructs a
classification model before receiving new (e.g., test)

data to classify
 Lazy: less time in training but more time in predicting
 Accuracy
 Lazy method effectively uses a richer hypothesis space
since it uses many local linear functions to form its

implicit global approximation to the target function
 Eager: must commit to a single hypothesis that covers
the entire instance space

01/31/2024 24
ASSIGNMENT……
1. Tune k-NN. Try larger and larger k values to see if you can
improve the performance of the algorithm on the Iris dataset.
2. Implement other distance measures that you can use to find
similar data, such as Hamming distance, Manhattan distance
and Minkowski distance.
3. Regression. Adapt the example and apply it to a regression
predictive modeling problem (e.g. predict a numerical value)
4. Data Preparation. Distance measures are strongly affected by
the scale of the input data. Experiment with standardization
and other data preparation methods in order to improve results.
01/31/2024 25
How the K-Means Algorithm Works?
Start
Number of
Cluster K
Centroid (-)
Distance objects to
centroids
Grouping based on No object

(+) End
minimum distance move group?
26
K-Means Clustering
Algorithm
 Step 1: Begin with a decision on the value of
k = number of clusters.
 Step 2: Put any initial partition that classifies
the data into k-clusters. You may assign the
training samples randomly, or systematically
as the following:
1. Take the first k training sample as single-element
clusters
2. Assign each of the remaining (N-k) sample to the
cluster with the nearest centroid. After each
assignment, re-compute the centroid of the
gaining cluster.
27
K-Means Clustering
Algorithm
 Step 3: Take each sample in sequence and compute its
distance from the centroid of each of the clusters. If a
sample is not currently in the cluster with the closest
centroid, switch this sample to that cluster and update the
centroid of the cluster gaining the new sample and the
cluster losing the sample.
 Step 4: Repeat step 3 until convergence is achieved, that
is until a pass through the training sample causes no new
assignments.
28
A Simple example showing the implementation of k-means algorithm
(using K=2)
Individual Variable 1 Variable 2

1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
29
 Step 1:
 Initialization:
Randomly we
choose
following two
centroids (k=2)
for two
clusters.
 In this case the
2 centroid are:
m1=(1.0,1.0)
and Individual Mean Vector
m2=(5.0,7.0). Group 1 1 (1.0, 1.0)
Group 2 4 (5.0, 7.0)
30
Individual Centroid 1 Centroid 2
1 0 7.21
2 (1.5, 2.0) 1.12 6.10
Step 2:
 Thus, we obtain two
3 3.61 3.61
clusters containing: 4 7.21 0
5 4.72 2.50
{1,2,3} and {4,5,6,7}.
 Their new centroids are:
6 5.31 2.06
7 4.30 2.92
31
Step 3:
 Now using these centroids
we compute the Euclidean

distance of each object,
as shown in table.
 Therefore, the new

clusters are:
{1,2} and {3,4,5,6,7}
 Next centroids are:

m1=(1.25,1.5) and m2 =
(3.9,5.1)
32
 Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}
 Therefore, there is no
change in the cluster.
 Thus, the algorithm
comes to a halt here and
final result consist of 2
clusters {1,2} and
{3,4,5,6,7}.
33
 PLOT
34
 (with K = 3)
Step 1 Step 2
35
 PLOT
36
Exercise
 A department store in Jakarta area, would want to make sure that their sale
campaigns are right on target. In order to do that, they have to cluster their
customers based on: age, total spending, and shopping frequency in a year. The
department store has collected all the data. You are asked to help them clusters
these data into 3 clusters.
Customer Age Total Spend Shopping

(in million rupiah) Frequency
Customer 1 25 18,000 20
Customer 2 30 15,000 10
Customer 3 20 8,000 8
Customer 4 35 22,000 5
Customer 5 40 40,000 14
Customer 6 50 32,000 15
Customer 7 43 9,000 6
Customer 8 42 12,000 16
37

KNN Dan KMeans

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

KNN Dan KMeans

Uploaded by

Copyright:

Available Formats

Nearest Neighbor Classifiers

it’s probably a duck

Training Choose k of the

What is the K-Nearest Neighbor (KNN) algorithm?

K-Nearest Neighbor (KNN) algorithm is a distance based supervised

To identify the nearest neighbors we use various techniques of

When should you use KNN?

Can I use KNN for both classification and regression

When should you use KNN?

Can I use KNN for both classification and regression

 To classify an unknown record:

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points

 Determine the class from nearest neighbor list

the k-nearest neighbors

 weight factor, w = 1/d2

distance measures from being dominated by

 height of a person may vary from 1.5m to 1.8m

 Solution: Normalize the vectors to unit length

 Unlike eager learners such as decision tree

induction and rule-based systems

In the above graph, the y-axis represents the height of a person

If I ask you to identify the weight of ID11 based on the plot,

As we saw above, KNN algorithm can be used for both

Had it been a classification problem, we would have taken the

The closest k data points are selected (based on the distance). In

The closest k data points are selected (based on the

The average of these data points is the final prediction for

The problem is that how to select the k value. This determines

model.fit(x_train, y_train) #fit the model

stores training data (or only minor processing) and

classification model before receiving new (e.g., test)

since it uses many local linear functions to form its

the entire instance space

Grouping based on No object

Individual Variable 1 Variable 2

we compute the Euclidean

 Therefore, the new

 Next centroids are:

Customer Age Total Spend Shopping

You might also like