Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 58

Classification

• Classification is a supervised machine learning method


where the model tries to predict the correct label of a
given input data.
• In classification, the model is fully trained using the
training data, and then it is evaluated on test data before
being used to perform prediction on new unseen data.
• Two types of learners in classification: lazy and eager
learners.
• Eager learners are machine learning algorithms that first
build a model from the training dataset before making any
prediction on future datasets.
• They spend more time during the training process because of
their eagerness to have a better generalization during the
training from learning the weights
• But they require less time to make predictions. 
• Most machine learning algorithms are eager learners, and
below are some examples:
• Logistic Regression.
• Support Vector Machine.
• Decision Trees.
• Artificial Neural Networks.
Lazy learners
•Lazy learners or instance-based learners, on the other hand, do not

create any model immediately from the training data, and this is where

the lazy aspect comes from.

•They just memorize the training data, and each time there is a need to

make a prediction, they search for the nearest neighbor from the whole

training data, which makes them very slow during prediction.

Some examples of this kind are: 


K-Nearest Neighbor(KNN)

• K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make
any assumption on underlying data.

• It is also called a lazy learner algorithm because it does not learn


from the training set immediately instead it stores the dataset and at
the time of classification, it performs an action on the dataset.

• KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.
• Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a
similarity measure. Our KNN model will find the similar features of
the new data set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category.
Why do we need a K-NN Algorithm?
• Suppose there are two categories, i.e., Category A and Category B, and
we have a new data point x1, so this data point will lie in which of
these categories.

• To solve this type of problem, we need a K-NN algorithm.

• With the help of K-NN, we can easily identify the category or class of
a particular dataset.
How does K-NN work?

• The K-NN working can be explained on the basis of the below algorithm:

• Step-1: Select the number K of the neighbors

• Step-2: Calculate the Euclidean distance of K number of neighbors

• Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

• Step-4: Among these k neighbors, count the number of the data points in each
category.

• Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.

• Step-6: Our model is ready.


Suppose we have a new data point and we need to put it in the
required category.
• Firstly, we will choose the number of neighbors, so we
will choose the k=5.
• Next, we will calculate the Euclidean
distance between the data points. The Euclidean
distance is the distance between two points, which we
have already studied in geometry.
• It can be calculated as:
By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the
below image:

•As we can see the 3 nearest neighbors are from category A,


hence this new data point must belong to category A.
How do we select the right value of K?

• We don’t have a particular method for determining the correct


value of K.

• The value of K that delivers the best accuracy for both training
and testing data is selected.
• It is recommended to always select an odd value of K

• When the value of K is set to even, a situation may arise in which the
elements from both groups are equal. In the diagram below, elements
from both groups are equal in the internal “Red” circle (k == 4).

• In this condition, the model would be unable to do the correct


classification. Here the model will randomly assign any of the two classes
to this new unknown data.
• Choosing an odd value for K is preferred because such a
state of equality between the two classes would never
occur here.

• Due to the fact that one of the two groups would still
be in the majority, the value of K is selected as odd.
• The impact of selecting a smaller or larger K value on the model
• Larger K value: The case of underfitting occurs when the value of
k is increased. In this case, the model would be unable to
correctly learn on the training data.
• Smaller k value: The condition of overfitting occurs when the
value of k is smaller. The model will capture all of the training
data, including noise. The model will perform poorly for the test
data in this scenario.
Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
• Always needs to determine the value of K which may
be complex some time.
• The computation cost is high because of calculating the
distance between the data points for all the training
samples.
Case-based reasoning 
• Case-based reasoning is any kind of problem-solving
approach that uses past solutions to solve similar problems.
• It assumes that knowledge can be acquired through past
experiences, and can help warn you of avenues that will lead
to failure or to help you think of successful past solutions that
could be adapted to the problem at hand.
• For example, Google Maps uses case-based reasoning to tell
you how long your journey will take by examining the
patterns of past users to see how long it took them to get from
point A to point B. Even if your path is from two slightly
different points, it makes inferences on how long your
journey will take.
Model Evaluation and Selection
 Evaluation metrics: How can we measure accuracy? Other metrics to consider?
 Use validation test set of class-labeled tuples instead of training set when assessing
accuracy
 Methods for estimating a classifier’s accuracy:
 Holdout method, random subsampling
 Cross-validation
 Bootstrap
 Comparing classifiers:
 Confidence intervals
 Cost-benefit analysis and ROC Curves

37
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

 Given m classes, an entry, CMi,j in a confusion matrix indicates


# of tuples in class i that were labeled by the classifier as class j
 May have extra rows/columns to provide totals
38
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P
 One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

 Classifier Accuracy, or negative class and minority of


recognition rate: percentage of the positive class
test set tuples that are correctly  Sensitivity: True Positive
classified recognition rate
 Sensitivity = TP/P
Accuracy = (TP + TN)/All
 Specificity: True Negative
 Error rate: 1 – accuracy, or
Error rate = (FP + FN)/All recognition rate
 Specificity = TN/N

39
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
 Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

 Recall: completeness – what % of positive tuples did the


classifier label as positive?
 Perfect score is 1.0
 Inverse relationship between precision & recall
 F measure (F1 or F-score): harmonic mean of precision and
recall,

 Fß: weighted measure of precision and recall


 assigns ß times as much weight to recall as to precision

40
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

 Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

41
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
 Holdout method
 Given data is randomly partitioned into two independent sets

 Training set (e.g., 2/3) for model construction

 Test set (e.g., 1/3) for accuracy estimation

 Random sampling: a variation of holdout

 Repeat holdout k times, accuracy = avg. of the accuracies

obtained
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive subsets,

each approximately equal size



At i-th iteration, use Di as test set and others as training set
 Leave-one-out: k folds where k = # of tuples, for small sized

data
 *Stratified cross-validation*: folds are stratified so that class

dist. in each fold is approx. the same as that in the initial data
42
Evaluating Classifier Accuracy: Bootstrap
 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement
 i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
 Several bootstrap methods, and a common one is .632 boostrap
 A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
 Repeat the sampling procedure k times, overall accuracy of the model:

43
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
 Suppose we have 2 classifiers, M1 and M2, which one is better?
 Use 10-fold cross-validation to obtain and
 These mean error rates are just estimates of error on the true population of future
data cases
 What if the difference between the 2 error rates is just attributed to chance?
 Use a test of statistical significance
 Obtain confidence limits for our error estimates

44
Estimating Confidence Intervals:
Null Hypothesis
 Perform 10-fold cross-validation
 Assume samples follow a t distribution with k–1 degrees of freedom (here, k=10)
 Use t-test (or Student’s t-test)
 Null Hypothesis: M1 & M2 are the same
 If we can reject null hypothesis, then
 we conclude that the difference between M1 & M2 is statistically significant
 Chose model with lower error rate

45
Estimating Confidence Intervals: t-test

 If only 1 test set available: pairwise comparison


 For ith round of 10-fold cross-validation, the same cross partitioning is used to
obtain err(M1)i and err(M2)i
 Average over 10 rounds to get and
 t-test computes t-statistic with k-1 degrees of freedom:
where

 If two test sets available: use non-paired t-test

where

where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
46
Estimating Confidence Intervals:
Table for t-distribution

 Symmetric
 Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
 Confidence limit, z
= sig/2

47
Estimating Confidence Intervals:
Statistical Significance
 Are M1 & M2 significantly different?
 Compute t. Select significance level (e.g. sig = 5%)
 Consult table for t-distribution: Find t value corresponding to k-1 degrees of
freedom (here, 9)
 t-distribution is symmetric: typically upper % points of distribution shown → look
up value for confidence limit z=sig/2 (here, 0.025)
 If t > z or t < -z, then t value lies in rejection region:
 Reject null hypothesis that mean error rates of M & M are same
1 2

 Conclude: statistically significant difference between M1 & M2


 Otherwise, conclude that any difference is chance

48
Model Selection: ROC Curves
 ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
 Originated from signal detection theory
 Shows the trade-off between the true
positive rate and the false positive rate
 The area under the ROC curve is a  Vertical axis
measure of the accuracy of the model represents the true
positive rate
 Rank the test tuples in decreasing  Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at  The plot also shows a
the top of the list diagonal line
 The closer to the diagonal line (i.e., the  A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
49
Issues Affecting Model Selection
 Accuracy
 classifier accuracy: predicting class label
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
50
Chapter 8. Classification: Basic Concepts

 Classification: Basic Concepts


 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
51
Ensemble Methods: Increasing the Accuracy

 Ensemble methods
 Use a combination of models to increase accuracy

 Combine a series of k learned models, M , M , …, M , with


1 2 k
the aim of creating an improved model M*
 Popular ensemble methods
 Bagging: averaging the prediction over a collection of

classifiers
 Boosting: weighted vote with a collection of classifiers

 Ensemble: combining a set of heterogeneous classifiers

52
Bagging: Boostrap Aggregation
 Analogy: Diagnosis based on multiple doctors’ majority vote
 Training
 Given a set D of d tuples, at each iteration i, a training set D of d tuples is
i
sampled with replacement from D (i.e., bootstrap)
 A classifier model M is learned for each training set D
i i
 Classification: classify an unknown sample X
 Each classifier M returns its class prediction
i
 The bagged classifier M* counts the votes and assigns the class with the
most votes to X
 Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
 Accuracy
 Often significantly better than a single classifier derived from D

 For noise data: not considerably worse, more robust

 Proved improved accuracy in prediction

53
Boosting
 Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
 How boosting works?
 Weights are assigned to each training tuple
 A series of k classifiers is iteratively learned
 After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
 The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
 Boosting algorithm can be extended for numeric prediction
 Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data 54
Adaboost (Freund and Schapire, 1997)
 Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
 Initially, all the weights of tuples are set the same (1/d)
 Generate k classifiers in k rounds. At round i,
 Tuples from D are sampled (with replacement) to form a training set Di
of the same size
 Each tuple’s chance of being selected is based on its weight
 A classification model Mi is derived from Di
 Its error rate is calculated using Di as a test set
 If a tuple is misclassified, its weight is increased, o.w. it is decreased
 Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error
rate is the sum of the weights of the misclassified tuples:

 The weight of classifier Mi’s vote is

55
Random Forest (Breiman 2001)
 Random Forest:
 Each classifier in the ensemble is a decision tree classifier and is

generated using a random selection of attributes at each node to


determine the split
 During classification, each tree votes and the most popular class is

returned
 Two Methods to construct Random Forest:
 Forest-RI (random input selection): Randomly select, at each node, F

attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
 Forest-RC (random linear combinations): Creates new attributes (or

features) that are a linear combination of the existing attributes (reduces


the correlation between individual classifiers)
 Comparable in accuracy to Adaboost, but more robust to errors and outliers
 Insensitive to the number of attributes selected for consideration at each
split, and faster than bagging or boosting
56
Classification of Class-Imbalanced Data Sets
 Class-imbalance problem: Rare positive example but numerous
negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
 Traditional methods assume a balanced distribution of classes
and equal error costs: not suitable for class-imbalanced data
 Typical methods for imbalance data in 2-class classification:
 Oversampling: re-sampling of data from positive class

 Under-sampling: randomly eliminate tuples from negative

class
 Threshold-moving: moves the decision threshold, t, so that

the rare class tuples are easier to classify, and hence, less
chance of costly false negative errors
 Ensemble techniques: Ensemble multiple classifiers

introduced above
 Still difficult for class imbalance problem on multiclass tasks
57
Chapter 8. Classification: Basic Concepts

 Classification: Basic Concepts


 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
58

You might also like