Professional Documents
Culture Documents
KNN Dan KMeans
KNN Dan KMeans
Basic idea:
If it walks like a duck, quacks like a duck, then
01/31/2024 1
Nearest Neighbor Classifiers
01/31/2024 2
Nearest Neighbor Classifiers
01/31/2024 3
Nearest Neighbor Classifiers
01/31/2024 4
Nearest neighbor method
Majority vote within the k nearest neighbors
new
K= 1: blue
K= 3: green
01/31/2024 5
Nearest-Neighbor Classifiers
Unknown record Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
01/31/2024 6
Definition of Nearest Neighbor
X X X
01/31/2024 7
Nearest Neighbor Classification
Compute distance between two points:
Euclidean distance
d ( p, q ) ( pi
i
q )
i
2
01/31/2024 8
Nearest Neighbor Classification
Euclidean Distance: Euclidean distance is calculated as the square root of the
sum of the squared differences between a new point (x) and an existing point (y).
Manhattan Distance: This is the distance between real vectors using the sum of
their absolute difference.
Hamming Distance: It is used for categorical variables. If the value (x) and the
value (y) are the same, the distance D will be equal to 0 . Otherwise D=1.
01/31/2024 9
Nearest Neighbor Classification…
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from
other classes
01/31/2024 10
Nearest Neighbor Classification…
Scaling issues
Attributes may have to be scaled to prevent
01/31/2024 11
Nearest Neighbor Classification…
Problem with Euclidean measure:
High dimensional data
curse of dimensionality
Can produce counter-intuitive results
111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142
01/31/2024 12
Nearest neighbor Classification…
k-NN classifiers are lazy learners
It does not build models explicitly
expensive
01/31/2024 13
Nearest Neighbor Regression…
Let us start with a simple example. Consider the following table – it
consists of the height, age and weight (target) value for 10 people. As
you can see, the weight value of ID11 is missing. We need to predict the
weight of this person based on their height and age.
01/31/2024 14
A SIMPLE EXAMPLE TO UNDERSTAND THE USE OF
KNN FOR REGRESSION
01/31/2024 15
A SIMPLE EXAMPLE TO UNDERSTAND THE USE OF
KNN FOR REGRESSION
01/31/2024 16
HOW DOES THE KNN ALGORITHM WORK
01/31/2024 17
HOW DOES THE KNN ALGORITHM WORK
First, the distance between the new point and each training point
is calculated.
01/31/2024 18
HOW DOES THE KNN ALGORITHM WORK
01/31/2024 19
HOW DOES THE KNN ALGORITHM WORK
01/31/2024 20
HOW DOES THE KNN ALGORITHM WORK
01/31/2024 21
HOW DOES THE KNN ALGORITHM WORK
For a very low value of k (suppose k=1), the model overfits on the training data,
which leads to a high error rate on the validation set. On the other hand, for a high
value of k, the model performs poorly on both train and validation set. If you
observe closely, the validation error curve reaches a minima at a value of k = 9.
This value of k is the optimum value of the model (it will vary for different
datasets). This curve is known as an ‘elbow curve‘ (because it has a shape like an
elbow) and is usually used to determine the k value.
rmse_val = [] #to store rmse values for different k
for K in range(20):
K = K+1
model = neighbors.KNeighborsRegressor(n_neighbors = K)
01/31/2024 23
Lazy vs. Eager Learning
Lazy vs. eager learning
Lazy learning (e.g., instance-based learning): Simply
01/31/2024 25
How the K-Means Algorithm Works?
Start
Number of
Cluster K
Centroid (-)
Distance objects to
centroids
26
K-Means Clustering
Algorithm
Step 1: Begin with a decision on the value of
k = number of clusters.
Step 2: Put any initial partition that classifies
the data into k-clusters. You may assign the
training samples randomly, or systematically
as the following:
1. Take the first k training sample as single-element
clusters
2. Assign each of the remaining (N-k) sample to the
cluster with the nearest centroid. After each
assignment, re-compute the centroid of the
gaining cluster.
27
K-Means Clustering
Algorithm
Step 3: Take each sample in sequence and compute its
distance from the centroid of each of the clusters. If a
sample is not currently in the cluster with the closest
centroid, switch this sample to that cluster and update the
centroid of the cluster gaining the new sample and the
cluster losing the sample.
Step 4: Repeat step 3 until convergence is achieved, that
is until a pass through the training sample causes no new
assignments.
28
A Simple example showing the implementation of k-means algorithm
(using K=2)
29
Step 1:
Initialization:
Randomly we
choose
following two
centroids (k=2)
for two
clusters.
In this case the
2 centroid are:
m1=(1.0,1.0)
and Individual Mean Vector
m2=(5.0,7.0). Group 1 1 (1.0, 1.0)
Group 2 4 (5.0, 7.0)
30
Individual Centroid 1 Centroid 2
1 0 7.21
2 (1.5, 2.0) 1.12 6.10
Step 2:
Thus, we obtain two
3 3.61 3.61
clusters containing: 4 7.21 0
5 4.72 2.50
{1,2,3} and {4,5,6,7}.
Their new centroids are:
6 5.31 2.06
7 4.30 2.92
31
Step 3:
Now using these centroids
32
Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}
Therefore, there is no
change in the cluster.
Thus, the algorithm
comes to a halt here and
final result consist of 2
clusters {1,2} and
{3,4,5,6,7}.
33
PLOT
34
(with K = 3)
Step 1 Step 2
35
PLOT
36
Exercise
A department store in Jakarta area, would want to make sure that their sale
campaigns are right on target. In order to do that, they have to cluster their
customers based on: age, total spending, and shopping frequency in a year. The
department store has collected all the data. You are asked to help them clusters
these data into 3 clusters.
37