Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

K-Nearest Neighbor(KNN) Algorithm for Machine Learning

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.

o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog,
but we want to know either it is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category.
Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of problem,
we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class
of a particular dataset. Consider the below diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.
Introduction to Decision Tree

In general, Decision tree analysis is a predictive modelling tool that can be applied across
many areas. Decision trees can be constructed by an algorithmic approach that can split the
dataset in different ways based on different conditions. Decisions trees are the most powerful
algorithms that falls under the category of supervised algorithms.

They can be used for both classification and regression tasks. The two main entities of a tree
are decision nodes, where the data is split and leaves, where we got outcome. The example of
a binary tree for predicting whether a person is fit or unfit providing various information like
age, eating habits and exercise habits, is given below −

In the above decision tree, the question are decision nodes and final outcomes are leaves. We
have the following two types of decision trees.

• Classification decision trees − In this kind of decision trees, the decision variable is
categorical. The above decision tree is an example of classification decision tree.
• Regression decision trees − In this kind of decision trees, the decision variable is
continuous.

Implementing Decision Tree Algorithm

Gini Index

It is the name of the cost function that is used to evaluate the binary splits in the dataset and
works with the categorial target variable “Success” or “Failure”.

Higher the value of Gini index, higher the homogeneity. A perfect Gini index value is 0 and
worst is 0.5 (for 2 class problem). Gini index for a split can be calculated with the help of
following steps −
• First, calculate Gini index for sub-nodes by using the formula p^2+q^2, which is the
sum of the square of probability for success and failure.
• Next, calculate Gini index for split using weighted Gini score of each node of that
split.

Classification and Regression Tree (CART) algorithm uses Gini method to generate binary
splits.

Split Creation

A split is basically including an attribute in the dataset and a value. We can create a split in
dataset with the help of following three parts −

• Part 1: Calculating Gini Score − We have just discussed this part in the previous
section.
• Part 2: Splitting a dataset − It may be defined as separating a dataset into two lists
of rows having index of an attribute and a split value of that attribute. After getting
the two groups - right and left, from the dataset, we can calculate the value of split by
using Gini score calculated in first part. Split value will decide in which group the
attribute will reside.
• Part 3: Evaluating all splits − Next part after finding Gini score and splitting dataset
is the evaluation of all splits. For this purpose, first, we must check every value
associated with each attribute as a candidate split. Then we need to find the best
possible split by evaluating the cost of the split. The best split will be used as a node
in the decision tree.

Building a Tree

As we know that a tree has root node and terminal nodes. After creating the root node, we can
build the tree by following two parts −

Part 1: Terminal node creation

While creating terminal nodes of decision tree, one important point is to decide when to stop
growing tree or creating further terminal nodes. It can be done by using two criteria namely
maximum tree depth and minimum node records as follows −

• Maximum Tree Depth − As name suggests, this is the maximum number of the
nodes in a tree after root node. We must stop adding terminal nodes once a tree
reached at maximum depth i.e. once a tree got maximum number of terminal nodes.
• Minimum Node Records − It may be defined as the minimum number of training
patterns that a given node is responsible for. We must stop adding terminal nodes
once tree reached at these minimum node records or below this minimum.

Terminal node is used to make a final prediction.

Part 2: Recursive Splitting

As we understood about when to create terminal nodes, now we can start building our tree.
Recursive splitting is a method to build the tree. In this method, once a node is created, we
can create the child nodes (nodes added to an existing node) recursively on each group of
data, generated by splitting the dataset, by calling the same function again and again.

Prediction

After building a decision tree, we need to make a prediction about it. Basically, prediction
involves navigating the decision tree with the specifically provided row of data.

We can make a prediction with the help of recursive function, as did above. The same
prediction routine is called again with the left or the child right nodes.

Assumptions

The following are some of the assumptions we make while creating decision tree −

• While preparing decision trees, the training set is as root node.


• Decision tree classifier prefers the features values to be categorical. In case if you
want to use continuous values then they must be done discretized prior to model
building.
• Based on the attribute’s values, the records are recursively distributed.
• Statistical approach will be used to place attributes at any node position i.e.as root
node or internal node.

Classification Algorithms - Random Forest

Introduction

Random forest is a supervised learning algorithm which is used for both classification as well
as regression. But however, it is mainly used for classification problems. As we know that a
forest is made up of trees and more trees means more robust forest. Similarly, random forest
algorithm creates decision trees on data samples and then gets the prediction from each of
them and finally selects the best solution by means of voting. It is an ensemble method which
is better than a single decision tree because it reduces the over-fitting by averaging the result.

Working of Random Forest Algorithm

We can understand the working of Random Forest algorithm with the help of following steps

• Step 1 − First, start with the selection of random samples from a given dataset.
• Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it
will get the prediction result from every decision tree.
• Step 3 − In this step, voting will be performed for every predicted result.
• Step 4 − At last, select the most voted prediction result as the final prediction result.

The following diagram will illustrate its working −


Introduction to SVM

Support vector machines (SVMs) are powerful yet flexible supervised machine learning
algorithms which are used both for classification and regression. But generally, they are used
in classification problems. In 1960s, SVMs were first introduced but later they got refined in
1990. SVMs have their unique way of implementation as compared to other machine learning
algorithms. Lately, they are extremely popular because of their ability to handle multiple
continuous and categorical variables.

Working of SVM

An SVM model is basically a representation of different classes in a hyperplane in


multidimensional space. The hyperplane will be generated in an iterative manner by SVM so
that the error can be minimized. The goal of SVM is to divide the datasets into classes to find
a maximum marginal hyperplane (MMH).
The followings are important concepts in SVM −

• Support Vectors − Datapoints that are closest to the hyperplane is called support
vectors. Separating line will be defined with the help of these data points.
• Hyperplane − As we can see in the above diagram, it is a decision plane or space
which is divided between a set of objects having different classes.
• Margin − It may be defined as the gap between two lines on the closet data points of
different classes. It can be calculated as the perpendicular distance from the line to the
support vectors. Large margin is considered as a good margin and small margin is
considered as a bad margin.

The main goal of SVM is to divide the datasets into classes to find a maximum marginal
hyperplane (MMH) and it can be done in the following two steps −

• First, SVM will generate hyperplanes iteratively that segregates the classes in best
way.
• Then, it will choose the hyperplane that separates the classes correctly.

Difference between supervised and unsupervised learning.

Businesses around the world today are smart and do everything to get and retain their
customers. They can identify malicious credit/debit card transactions, they can identify a
person uniquely with face or eye detection as a password to unlock a device, offer what their
customer are looking for in the least possible time, separate spams from regular emails, and
predict within how much time one can reach their intended destination depending upon length
of road, weather conditions, and traffic, etc.

These challenging tasks are possible only when the algorithms carrying out such predictions
are smart, and the learning approaches are the ones which make the algorithms smart.

When it comes to data mining, there are two main approaches of Machine Learning −

• Supervised learning
• Unsupervised learning

Read through this article to find out more about supervised and unsupervised learning and how
they are different from each other.

What is Supervised Learning Approach?

The supervised learning approach of machine learning is the approach with which the
algorithms are trained by using labelled datasets. The datasets train the algorithms to be
smarter. They make it easy for the algorithms to predict the outcome as accurate as possible.
A dataset is the collection of related yet discrete data, which can be used or managed
individually as well as a group. The labelled datasets are the named pieces of data that are
tagged with one or multiple labels pertaining to certain properties or characteristics.

For example, look at the picture below. It depicts classification and labelling −

The labelled datasets make the algorithms understand the relationship among the datasets and
carry out classification or prediction as a new outcome quickly with utmost accuracy. In this
approach, human intervention is necessary to define properties and characteristics as well as to
label the data appropriately.
Supervised Learning is used for the data where the input and output data can be precisely
mapped.

Different Approaches of Supervised Learning

Supervised learning is divided further into two approaches −

• Classification − In this approach, algorithms are trained to categorize the data into
distinct units depending on their labels. Examples of some classification algorithms are
− Decision Tree, Random Forest, Support Vector Machine, etc. Classification can be
of types Binary and Multi-class.
• Regression − This approach makes a computer program understand the relationship
between dependent and independent data. As the name suggests regression means
"going back to", the algorithm is exposed to the past data. Once training the algorithm
is completed, the algorithm can predict the future values easily. Some popular
regression algorithms are Linear, Logistic, and Polynomial regression. Regression can
be of types Linear and Non-linear.

Both the above algorithms of machine learning are used for prediction. Both the algorithms
work with the labelled datasets. Then what is the difference between the two?

What is Unsupervised Learning Approach?

The unsupervised learning approach of machine learning does not use labelled datasets for
training the algorithms. Instead, the machines learn on their own by accessing massive amount
of unclassified data and finding its implicit patterns. The algorithms analyze and cluster the
unlabelled datasets. There is no human intervention required while analyzing and clustering
hence the name "Unsupervised".

Different Approaches of Unsupervised Learning

The unsupervised learning approach is of the following three types −

• Association −This approach uses some rules to find relationships between variables in
a dataset. This approach is often used in suggestions and recommendation. For example,
suggesting an item to a customer with: "The customers who bought this item also
bought", or "You may also like", or simply by showing allied product images and
recommending to buy related items. For example, when the primary product being
purchased is a computer, then suggesting to buy a wireless mouse and a remote
keyboard too.
• Clustering − It is a learning technique in data mining where unlabelled or unclassified
data are grouped depending on either similarities or differences among them. This
technique is helpful for the businesses to understand market segments depending on the
customers demographics.
• Dimensionality Reduction − It is a learning technique used to reduce the number of
random variables or ‘dimensions’ to obtain a set of principal variables, when the
number of variables is very high. This technique helps data compression without
compromising the usability of the data. This learning is used for pre-processing of the
audio/visual data to improve the quality of the outcome or making the background of
an image transparent.

Difference between Supervised and Unsupervised Learning

The following table highlights the major differences between Supervised and Unsupervised
learning –

Factor Supervised Learning Unsupervised Learning


Objective To train the algorithm for To train the algorithm to find
prediction. The outcome the
algorithm predicts mostly insights from the large
occurs as per the human volume of unclassified data.
expectation.
Dataset Labelling The datasets used in The data used in
Supervised learning are Unsupervised learning are
labelled. unclassified.
Knowledge of Classes The classes of data are The number of classes is
known.
unknown as the model data is
uncategorized and
unlabelled.
Human Intervention In supervised learning, The unsupervised learning
human intervention is makes the algorithm to take
required to label the data care of both; the input and the
appropriately. output of the data analizing
but human intervention is
only required for data
validation.
Proximity with Artificial With remarkable amount of With the less amount of
Intelligence human intervention, human intervention,
Supervised learning seems Unsupervised learning is
distant from the real very close to Artificial
Artificial Intelligence. Intelligence.
Computational Complexity It is simple and inexpensive It is complicated,
timeconsuming, and requires
more resources.
Learning Process In Supervised learning, the In case of unsupervised
learning, the process of
process of training the training the algorithms takes
algorithm takes place offline. place in real time.

CLUSTERING
Data Mining - Cluster Analysis

Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.

What is Clustering?

Clustering is the process of making a group of abstract objects into classes of similar objects.

Points to Remember

• A cluster of data objects can be treated as one group.


• While doing cluster analysis, we first partition the set of data into groups based on
data similarity and then assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to changes
and helps single out useful features that distinguish different groups.

Applications of Cluster Analysis

• Clustering analysis is broadly used in many applications such as market research,


pattern recognition, data analysis, and image processing.
• Clustering can also help marketers discover distinct groups in their customer base.
And they can characterize their customer groups based on the purchasing patterns.
• In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures inherent
to populations.
• Clustering also helps in identification of areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses in a city
according to house type, value, and geographic location.
• Clustering also helps in classifying documents on the web for information discovery.
• Clustering is also used in outlier detection applications such as detection of credit card
fraud.
• As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data mining −

• Scalability − We need highly scalable clustering algorithms to deal with large


databases.
• Ability to deal with different kinds of attributes − Algorithms should be capable to
be applied on any kind of data such as interval-based (numerical) data, categorical,
and binary data.
• Discovery of clusters with attribute shape − The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.
• High dimensionality − The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
• Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
• Interpretability − The clustering results should be interpretable, comprehensible, and
usable.

Clustering Methods

Clustering methods can be classified into the following categories −

• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method

Partitioning Method

Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify
the data into k groups, which satisfy the following requirements −

• Each group contains at least one object.


• Each object must belong to exactly one group.
What are the types of the partitional algorithm?

There are two types of partitional algorithms which are as follows −

K-means clustering − K-means clustering is the most common partitioning algorithm. K-


means reassigns each data in the dataset to only one of the new clusters formed. A record or
data point is assigned to the nearest cluster using a measure of distance or similarity. There are
the following steps used in the K-means clustering:
• It can select K initial cluster centroid c1, c2, c3 ... . ck.
• It can assign each instance x in the S cluster whose centroid is nearest to x.
• For each cluster, recompute its centroid based on which elements are contained in that
cluster.
• Go to (b) until convergence is completed.
• It can separate the object (data points) into K clusters.
• It is used to cluster center (centroid) = the average of all the data points in the cluster.
• It can assign each point to the cluster whose centroid is nearest (using distance
function).

The initial values for the means are arbitrarily assigned. These can be assigned randomly or
perhaps can use the values from the first k input items themselves. The convergence element
can be based on the squared error, but they are required not to be. For example, the algorithm
is assigned to different clusters. Other termination techniques have simply locked at a fixed
number of iterations. A maximum number of iterations can be included to ensure shopping
even without convergence.

Hierarchical Methods

This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is formed.
There are two approaches here −

• Agglomerative Approach
• Divisive Approach
Agglomerative Approach

This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the
termination condition holds.

Divisive Approach

This approach is also known as the top-down approach. In this, we start with all of the objects
in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is
down until each object in one cluster or the termination condition holds. This method is rigid,
i.e., once a merging or splitting is done, it can never be undone.

Approaches to Improve Quality of Hierarchical Clustering

Here are the two approaches that are used to improve the quality of hierarchical clustering −

• Perform careful analysis of object linkages at each hierarchical partitioning.


• Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-clustering
on the micro-clusters.

Introduction to Hierarchical Clustering

Hierarchical clustering is another unsupervised learning algorithm that is used to group


together the unlabeled data points having similar characteristics. Hierarchical clustering
algorithms falls into following two categories.

Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each


data point is treated as a single cluster and then successively merge or agglomerate (bottom-
up approach) the pairs of clusters. The hierarchy of the clusters is represented as a
dendrogram or tree structure.

Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms,


all the data points are treated as one big cluster and the process of clustering involves
dividing (Top-down approach) the one big cluster into various small clusters.

Steps to Perform Agglomerative Hierarchical Clustering

We are going to explain the most used and important Hierarchical clustering i.e.
agglomerative. The steps to perform the same is as follows −

• Step 1 − Treat each data point as single cluster. Hence, we will be having, say K
clusters at start. The number of data points will also be K at start.
• Step 2 − Now, in this step we need to form a big cluster by joining two closet
datapoints. This will result in total of K-1 clusters.
• Step 3 − Now, to form more clusters we need to join two closet clusters. This will
result in total of K-2 clusters.
• Step 4 − Now, to form one big cluster repeat the above three steps until K would
become 0 i.e. no more data points left to join.
• Step 5 − At last, after making one single big cluster, dendrograms will be used to
divide into multiple clusters depending upon the problem.

Density based clustering

Introduction

DBSCAN is the abbreviation for Density-Based Spatial Clustering of Applications


with Noise. It is an unsupervised clustering algorithm.DBSCAN clustering can work with
clusters of any size from huge amounts of data and can work with datasets containing a
significant amount of noise. It is basically based on the criteria of a minimum number of points
within a region.
What is DBSCAN Algorithm?

DBSCAN algorithm can cluster densely grouped points efficiently into one cluster. It can
identify local density in the data points among large datasets. DBSCAN can very effectively
handle outliers. An advantage of DBSACN over the K-means algorithm is that the number of
centroids need not be known beforehand in the case of DBSCAN.

DBSCAN algorithm depends upon two parameters epsilon and minPoints.

Epsilon is defined as the radius of each data point around which the density is considered.

minPoints is the number of points required within the radius so that the data point becomes a
core point.

The circle can be extended to higher dimensions.

Working of DBSCAN Algorithm

In the DBSCAN algorithm, a circle with a radius epsilon is drawn around each data point and
the data point is classified into Core Point, Border Point, or Noise Point. The data point is
classified as a core point if it has minPoints number of data points with epsilon radius. If it has
points less than minPoints it is known as Border Point and if there are no points inside epsilon
radius it is considered a Noise Point.

Let us understand working through an example.

In the above figure, we can see that point A has no points inside epsilon(e) radius. Hence it is
a Noise Point. Point B has minPoints(=4) number of points with epsilon e radius , thus it is a
Core Point. While the point has only 1 ( less than minPoints) point, hence it is a Border Point.

Steps Involved in DBSCAN Algorithm.


• First, all the points within epsilon radius are found and the core points are identified
with number of points greater than or equal to minPoints.
• Next, for each core point, if not assigned to a particular cluster, a new cluster is created
for it.
• All the densely connected points related to the core point are found and assigned to the
same cluster. Two points are called densely connected points if they have a neighbor
point that has both the points within epsilon distance.
• Then all the points in the data are iterated, and the points that do not belong to any
cluster are marked as noise.

Advantages of the DBSCAN Algorithm

• DBSCAN does not require the number of centroids to be known beforehand as in the
case with the K-Means Algorithm.
• It can find clusters with any shape.
• It can also locate clusters that are not connected to any other group or clusters. It can
work well with noisy clusters.
• It is robust to outliers.

Disadvantages of the DBSCAN Algorithm

• It does not work with datasets that have varying densities.


• Cannot be employed with multiprocessing as it cannot be partitioned.
• Cannot find the right cluster if the dataset is sparse.
• It is sensitive to parameters epsilon and minPoints

You might also like