Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

SPONSORED BY

Home
Why do you need to scale data in Ask Question

Questions
KNN
Tags
Asked 3 years, 8 months ago Active 7 months ago Viewed 47k times
Users

Unanswered Could someone please explain to me why you need to normalize data
when using K nearest neighbors.
24
I've tried to look this up, but I still can't seem to understand it.

I found the following link:

12 https://discuss.analyticsvidhya.com/t/why-it-is-necessary-to-normalize-in-
knn/2715

But in this explanation, I don't understand why a larger range in one of the
features affects the predictions.

k-nearest-neighbour

Share Cite Edit Follow asked Jun 26 '17 at 17:41


bugsyb
471 1 4 12

I think normalization has to be justified from the subject-matter point of view.


Essentially, what matters is what defines the distance between points. You
have to find a convenient arithmetic definition of distance that reflects the
subject-matter definition of distance. In my limited experience, I have
normalize in some but not all directions based on subject-matter
considerations. – Richard Hardy Jun 26 '17 at 19:00

1 For an instructive example, please see


stats.stackexchange.com/questions/140711. – whuber ♦ Jun 26 '17 at 19:28

It seems like any scaling (min-max or robust) is acceptable, not just standard
scaling. Is that correct? – skeller88 Apr 10 '20 at 20:20

Add a comment

3 Answers Active Oldest Votes

The k-nearest neighbor algorithm relies on majority voting based on class


membership of 'k' nearest samples for a given test point. The nearness of
34 samples is typically based on Euclidean distance.

Consider a simple two class classification problem, where a Class 1


sample is chosen (black) along with it's 10-nearest neighbors (filled green).
In the first figure, data is not normalized, whereas in the second one it is.

Notice, how without normalization, all the nearest neighbors are aligned in
the direction of the axis with the smaller range, i.e. 𝑥1 leading to incorrect
classification.

Normalization solves this problem!

Share Cite Edit Follow edited Jun 26 '17 at 20:09

answered Jun 26 '17 at 18:56


kedarps
2,422 1 16 28

2 This answer is exactly right, but I fear the illustrations might be deceptive
because of the distortions involved. The point might be better made by
drawing them both so that the two axes in each are at the same scale. –
whuber ♦ Jun 26 '17 at 19:30

1 I found it difficult to fit all data points in the same scale for both figures. Hence,
I mentioned in a note that scales of axes are different. – kedarps Jun 26 '17 at
19:55

1 That difficulty actually is the point of your response! One way to overcome it is
not to use such an extreme range of scales. A 5:1 difference in scales, rather
than a 1000:1 difference, would still make your point nicely. Another way is to
draw the picture faithfully: the top scatterplot will seem to be a vertical line of
points. – whuber ♦ Jun 26 '17 at 19:57

2 @whuber, I misunderstood your first comment. Fixed the plots, hopefully it's
better now! – kedarps Jun 26 '17 at 20:10

1 @Undertherainbow That is correct! – kedarps Mar 11 '19 at 19:33

Show 1 more comments

Suppose you had a dataset (m "examples" by n "features") and all but one
feature dimension had values strictly between 0 and 1, while a single
10 feature dimension had values that range from -1000000 to 1000000. When
taking the euclidean distance between pairs of "examples", the values of
the feature dimensions that range between 0 and 1 may become
uninformative and the algorithm would essentially rely on the single
dimension whose values are substantially larger. Just work out some
example euclidean distance calculations and you can understand how the
scale affects the nearest neighbor computation.

Share Cite Edit Follow edited Jun 26 '17 at 20:26

answered Jun 26 '17 at 20:08


Derek Jones
101 3

Add a comment

If the scale of features is very different then normalization is required. This


is because the distance calculation done in KNN uses feature values.
0 When the one feature values are large than other, that feature will
dominate the distance hence the outcome of the KNN.

see example on gist.github.com

Share Cite Edit Follow answered Jul 4 '20 at 0:47


Ajey
145 4

Add a comment

Your Answer

Links Images Styling/Headers Lists Blockquotes Code HTML Tables

Advanced help

Post Your Answer

Not the answer you're looking for? Browse other questions tagged
k-nearest-neighbour or ask your own question.

Featured on Meta

Opt-in alpha test for a new Stacks editor

Visual design changes to the review


queues

Hot Meta Posts

7 What does it mean that a user is


unregistered?

Linked

20 Why does gap statistic for k-means suggest


one cluster, even though there are
obviously two of them?

29 Why would anyone use KNN for


regression?

1 Is it possible to compare two different


datasets with a different range of values?

Related

10 Why is KNN not “model-based”?

29 Why would anyone use KNN for


regression?

2 What machine learning model would I use


for predicting rainfall given numerical data

1 Do I need data separation in KNN?

1 How Rapidminer handle same distance for


KNN Algorithm

1 Why is my KNN model overfitting?

2 My machine learning finds patterns in


literally random (generated) data, how to
fix?

Hot Network Questions


should developers have a say in functional
requirements

N-dimensional pyramid numbers

Why would the Lincoln Project campaign *against*


Sen. Susan Collins?

Is there an election System that allows for seats to


be empty?

varioref doesn't use the right language

What does "if the court knows herself" mean?

First postdoc as "the big filter": myth or fact?

why does tee refuse to work when launched


remotely via ssh?

What is the meaning of energy-consistent and


shape-consistent in the context of
pseudopotentials?

Does tightening a QR skewer in different ways


affect wheel alignment?

Which was the first magazine presented in


electronic form, on a data medium, to be read on a
computer?

Cowboys and Aliens type TV Movie on SyFy


Channel

Is it dangerous to use a gas range for heating?

How to tell coworker to stop trying to protect me?

Which is better to minimize w.r.t a lower bound or


an upper bound of an objective function?

Can you solve this creative chess problem?

Why do string instruments need hollow bodies?

Why do guitarists specialize on particular


techniques?

What's a positive phrase to say that I quoted


something not word by word

Bitwarden PIN protection mechanism

Can anyone give me an instance of 3SAT with


exactly one solution?

Buying a house with my new partner as Tenants in


common. How do we work out what is fair for us
both?

Why won't NASA show any computer screens?

unrar nested folder in ubuntu strange behaviour

Question feed

CROSS VALIDATED COMPANY Blog Facebook Twitter LinkedIn Instagram

Tour Stack Overflow


Help For Teams
Chat Advertise With Us
Contact Hire a Developer
Feedback Developer Jobs
Mobile About
Disable Responsiveness Press
Legal
Privacy Policy
Terms of Service
Cookie Settings

STACK EXCHANGE
NETWORK

Technology
Life / Arts
Culture / Recreation
Science site design / logo © 2021 Stack Exchange Inc;
user contributions licensed under cc by-sa.
Other rev 2021.2.18.38600

You might also like