K Nearest Neighbors

K Nearest Neighbors
 The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised

learning classifier, which uses proximity to make classifications or predictions
about the grouping of an individual data point. It is one of the popular and
simplest classification and regression classifiers used in machine learning
today.

For classification problems, a class label is assigned on the basis of a majority
vote—i.e. the label that is most frequently represented around a given data
point is used. While this is technically considered “plurality voting”, the
term, “majority vote” is more commonly used in literature. The distinction
between these terminologies is that “majority voting” technically requires a
majority of greater than 50%, which primarily works when there are only two
categories. When you have multiple classes—e.g. four categories, you don’t
necessarily need 50% of the vote to make a conclusion about a class; you
could assign a class label with a vote of greater than 25%.
 Regression problems use a similar concept as classification problem, but in
this case, the average the k nearest neighbors is taken to make a prediction
about a classification. The main distinction here is that classification is used
for discrete values, whereas regression is used with continuous ones.
However, before a classification can be made, the distance must be defined.
Euclidean distance is most commonly
 It's also worth noting that the KNN algorithm is also part of a family of “lazy
learning” models, meaning that it only stores a training dataset versus
undergoing a training stage. This also means that all the computation occurs
when a classification or prediction is being made. Since it heavily relies on
memory to store all its training data, it is also referred to as an instance-
based or memory-based learning method.
Applications of k-NN in machine learning
 The k-NN algorithm has been utilized within a variety of applications, largely
within classification. Some of these use cases include:
 - Data preprocessing: Datasets frequently have missing values, but the KNN
algorithm can estimate for those values in a process known as missing data
imputation.
 - Recommendation Engines: Using clickstream data from websites, the KNN
algorithm has been used to provide automatic recommendations to users on
additional content. This shows that the a user is assigned to a particular
group, and based on that group’s user behavior, they are given a
recommendation. However, given the scaling issues with KNN, this approach
may not be optimal for larger datasets.
 - Finance: It has also been used in a variety of finance and economic use
cases. For example it shows how using KNN on credit data can help banks
assess risk of a loan to an organization or individual. It is used to determine
the credit-worthiness of a loan applicant. Its used in stock market
forecasting, currency exchange rates, trading futures, and money laundering
analyses.
 - Healthcare: KNN has also had application within the healthcare industry,
making predictions on the risk of heart attacks and prostate cancer. The
algorithm works by calculating the most likely gene expressions.
 - Pattern Recognition: KNN has also assisted in identifying patterns, such as
in text and digit classification. This has been particularly helpful in
identifying handwritten numbers that you might find on forms or mailing
envelopes.
Advantages and disadvantages of the KNN algorithm
 Just like any machine learning algorithm, k-NN has its strengths and
weaknesses. Depending on the project and application, it may or may not be
the right choice.
 Advantages
 Easy to implement: Given the algorithm’s simplicity and accuracy, it is one of
the first classifiers that a new data scientist will learn.
 - Adapts easily: As new training samples are added, the algorithm adjusts to
account for any new data since all training data is stored into memory.
 - Few hyperparameters: KNN only requires a k value and a distance metric,
which is low when compared to other machine learning algorithms.
 Disadvantages
 - Does not scale well: Since KNN is a lazy algorithm, it takes up more memory
and data storage compared to other classifiers. This can be costly from both a
time and money perspective. More memory and storage will drive up business
expenses and more data can take longer to compute. While different data
structures, such as Ball-Tree, have been created to address the computational
inefficiencies, a different classifier may be ideal depending on the business
problem.
 - Curse of dimensionality: The KNN algorithm tends to fall victim to the curse of
dimensionality, which means that it doesn’t perform well with high-dimensional
data inputs. This is sometimes also referred to as the peaking phenomenon,
where after the algorithm attains the optimal number of features, additional
features increases the amount of classification errors, especially when the
sample size is smaller.
 - Prone to overfitting: Due to the “curse of dimensionality”, KNN is also more
prone to overfitting. While feature selection and dimensionality reduction
techniques are leveraged to prevent this from occurring, the value of k can
also impact the model’s behavior. Lower values of k can overfit the data,
whereas higher values of k tend to “smooth out” the prediction values since it
is averaging the values over a greater area, or neighborhood. However, if the
value of k is too high, then it can underfit the data.
Analytical Models we may use in the L&D phase:
Skills Gap Analysis.
An effective learning and development strategy relies on a process in which
one continually moves through learning, attaining competencies and then
Move to organizational capabilities.
An essential part of this process is to determine the competencies/skills
which are required in the organization and the level of these competencies
already in existence with the employees.
This is the Skills Gap Analysis as applied to the Learning & Development phase.
To give an example: Consider a list of competencies required in employees.
When an organization looks at competencies required in employees, it often considers

the following skills :
Communication skill, written and spoken
Business Acumen. Business Acumen is keenness and quickness in understanding
and dealing with a business situation in a manner that is likely to
lead to a good outcome
Planning & Organization, refers to the skill of steps to be taken, prioritize tasks
and time management
Problem solving skills , This refers to defining a problem; determining the cause
of the problem; identifying, prioritizing, selecting alternatives for a solution;
implementing a solution.
Future Developments, process that helps the company establish and maintain
relationships with future prospects, learn about changing markets,
supply chain alterations and changing macro economic environments
Changing Business Landscapes : responding to the new global challenges , changing supply chains,
changing labor standards, human rights, climate change, market place
Process and performance indicators, specifically focus on business processes measuring and
monitoring operations, HR, Marketing and supply chain processes
Identify Resource Requirements, to provide services, reach markets, maintain relationships
with clients and earn revenue.
Supply chain management (SCM) is management of the flow of goods, data, and
finances related to a product or service, from the procurement of raw materials to
the delivery of the product at its final destination
Market knowledge :to know about the potential consumers behaviour which is directly and
indirectly connected to the products and services
One can apply skills gap analysis in the L& D phase to identify which skills
Need to be enhanced by training ( L&D)
Skills Gap Analysis

reqd exist
communication
market 10 businessacumen
0
supplychain planning&organization
-10
identify resource requirements problemsolvngskill
processes and performance indicators futuredev
changing business landscape

Analytical Models we can apply L&D phase:
Principal Component Analysis:
To determine inherent dimensions in the skills required in the employees

To determine inherent dimensions existing in the employee skill set.
K-means Cluster Analysis to determine groups of employees having homogenous skills.

Factor Analysis in HR context:
Principal Component Analysis
Assume there are many variables under

consideration many of which are interrelated.
So there are groups of inter correlated variables.
Each of these groups is represented by a
Factor or Dimension, F1, F2 …
Factors are linear combinations of variables
so that
F1 = a11V1+a12V2+…
F2 = a21V1+a22V2+…
Consider an example in Factor Analysis:
Consider a database having 6 competencies on which

Employees have been rated.
So there are 6 variables:

V1= competency on business acumen
V2= competency on planning & development
V3= competency on problem solving skill
V4= understanding future development
V5= understanding on changing lanscape
V6= competency on communication
The six variables have the following correlation martrix between them:
v1 v2 v3 v4 v5 v6
-------------------------------------------------------------------------------------------
V1 1 .8 .9 .1 .06 .08
V2 1 .85 .08 .1 .04
V3 1 .02 .05 .04
V4 1 .9 .08
V5 1 .02
V6 1
-------------------------------------------------------------------------------------------
There is very high correlation between v1,v2 and v3.

Also very high correlation between v4 and v5. v6 is on its own.
This result indicates that there are three factors,F1,F2 , F3 ;
F1 representing v1,v2 and v3. F2 representing v4 and v5. F3=v6
F1 measures a dimension which is common to v1,v2,v3.
F2 measures another dimension, common to v4,v5 and
F3 measuringv6.
F1: can be labelled knowledge of business sense
F2: changing business scenario
F3: communication
Rotation of factors :
To help clearer identification of factors we make use of varimax rotation.

Other approaches are quartimax, equamax.
Save factor scores for further analysis

K Nearest Neighbors

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

K Nearest Neighbors

Uploaded by

Copyright:

Available Formats

K Nearest Neighbors

 The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised

When an organization looks at competencies required in employees, it often considers

Skills Gap Analysis

identify resource requirements problemsolvngskill

processes and performance indicators futuredev

changing business landscape

To determine inherent dimensions in the skills required in the employees

K-means Cluster Analysis to determine groups of employees having homogenous skills.

Principal Component Analysis

Assume there are many variables under

Consider a database having 6 competencies on which

So there are 6 variables:

There is very high correlation between v1,v2 and v3.

To help clearer identification of factors we make use of varimax rotation.

Save factor scores for further analysis

You might also like