Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2.

Classification and
Regression Models - K-Nearest
Neighbors
01. How “Classification” works
In the dataset of Amazon Fine Food Reviews, the class labels are 0 and 1. If any
query point is given, its label would be either 0 or 1. As it has only 2 classes,
such a
classification problem is called a binary classification problem.
Similarly in the MNIST dataset, the class labels are the numbers from 0 to 9. If
any query point is given, its label would be any one value from 0 to 9. As it has
only 10
classes, such a classification problem is called a 10-class classification problem
(or)
multi-class classification problem.

2. Classification and Regression Models - K-Nearest Neighbors 1


In both the examples mentioned above, the output class label comes from a
finite set of values. Hence we call the problem of predicting such an output, a
classification problem.

The algorithm gets trained using the training data and computes the function ‘f’.
After computing the function ‘f’, in the validation stage, this function is used to
predict
the output for the test data.

02. Data Matrix Notation

Here the rows represent the data points and the columns represent the
dimensions/features. Let us assume we have ‘n’ data points, and each data
point has
components in ‘d’ dimensions. Then the dataset is represented mathematically

2. Classification and Regression Models - K-Nearest Neighbors 2


in the
form of a set as

Here we are converting the classes from ‘Negative’ and ‘Positive’ format to 0-1
format, because linear algebra couldn’t understand the terms ‘Positive’ and
‘Negative’.
Note: As there are ‘d’ dimensions in every data point ‘xi’, we have used the
notation xi ∈ Rd. As ‘yi’ takes only two values (ie., 0 and 1), we have used the

notation yi {0,1}.As
there are only 2 classes in the dataset, we call this problem a binary
classification
problem.
Note: If we consider the MNIST dataset, there are 60000 data points, each data
point is
a 784-dimensional vector and the class labels are the numbers 0 to 9, then the
MNIST
dataset is represented mathematically in the form of a set as

As there are 10 classes in the MNIST dataset, we call this problem a 10-class
classification (or) a multi-class classification problem.

2. Classification and Regression Models - K-Nearest Neighbors 3


03. Classification vs Regression
Classification - Definition
Classification is the problem of identifying to which of a set of categories a new
observation belongs to, on the basis of a training set of data containing
observations
whose category is already known.
Classification deals with predicting a qualitative (or) categorical response.

17. Amazon Fine Food Reviews System

29.17 Time Based Splitting


Can be done / Works only if we have timestamp data/column.

💡 Note: Whenever the ‘Time’ column is available in the dataset, and if


things/behavior of
the data changes over time, then time-based splitting is preferred over
random
splitting.
In case, if the ‘Time’ column is not present in the dataset, then we
have to go for
Random Splitting, as we have no other option.

Procedure to perform Time Based Splitting

1. Sort the dataset ‘Dn


’ in the ascending order of the time column values.

2. Now we split the dataset with the first 60% of the points into ‘DTrain’, the
next 20%
of the points into ‘Dcv’, and the last 20% of the points into ‘DTest’.

3. In the case of Random Splitting, we had


DTrain → For finding the nearest neighbors
Dcv → For choosing the optimal ‘K’ value
DTest → For measuring the accuracy

2. Classification and Regression Models - K-Nearest Neighbors 4


write some points from rpdf

29.19 Weighted K-NN


It simply refers to how much weight/priority to a particular thing.

18 .K-NN for Regression

2. Classification and Regression Models - K-Nearest Neighbors 5


2. Classification and Regression Models - K-Nearest Neighbors 6

You might also like