Download as pdf or txt
Download as pdf or txt
You are on page 1of 131

Classification

Unit-5
Topics Covered – Unit 5
 What is Classification
 General Approach to Classification
 Issues in Classification
 Classification Algorithms
 Statistical Based
• Bayesian Classification
 Distance Based
• KNN
 Decision Tree Based
• ID3
 Neural Network Based
 Rule Based
What is Classification

 Classification is the task of assigning objects to one


of the several predefined categories.

 Given a database D={t1,t2,…,tn} and a set of classes


C={C1,…,Cm}, the Classification Problem is to
define a mapping f: D → C where each ti is assigned
to one class.
Classification

Attribute set
Classification Class label
(x) Model (y)

Classification as the task of mapping an input attribute set x into its class label y

 Classification model is useful for:


 Descriptive Modeling
 Predictive Modeling
Classification Examples
 Teachers classify students’ grades as A, B, C, D, or F.

 Identify mushrooms as poisonous or edible.

 Predict when a river will flood.

 Identify individuals with credit risks.

 Speech recognition

 Pattern recognition
Classification Ex: Grading

x
 If x >= 90 then grade =A.
<90 >=90
 If 80<=x<90 then grade =B.
 If 70<=x<80 then grade =C. x A
 If 60<=x<70 then grade =D. <80 >=80
 If x<50 then grade =F. x B
<70 >=70
Classify the following marks x C
78 , 56 , 99
<50 >=60
F D
Topics Covered
 What is Classification
 General Approach to Classification
 Issues in Classification
 Classification Algorithms
 Statistical Based
• Bayesian Classification
 Distance Based
• KNN
 Decision Tree Based
• ID3
 Neural Network Based
 Rule Based
General approach to Classification

 Two step process:


 Learning step
• Where a classification algorithm builds the classifier
by analyzing or “learning from” a training set made
up of database tuples and their associated class
labels.
 Classification step
• The model is used to predict class labels for given
data.
 Classes must be predefined
 Most common techniques use DTs, NNs, or are
based on distances or statistical methods.
Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no IF rank = ‘professor’
Anne Associate Prof 3 no OR years > 6
THEN tenured = ‘yes’
Use the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no Tenured?
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes
Defining Classes
Topics Covered
 What is Classification
 General Approach to Classification
 Issues in Classification
 Classification Algorithms
 Statistical Based
• Bayesian Classification
 Distance Based
• KNN
 Decision Tree Based
• ID3
 Neural Network Based
 Rule Based
Issues in Classification

 Missing Data
 Ignore missing value
 Replace with assumed value

 Measuring Performance
 Classification accuracy on test data
 Confusion matrix
• provides the information needed to determine how
well a classification model performs
Confusion Matrix
Confusion Matrix

Predicted Class
Class= 1 Class= 0
Actual Class= 1 f11 f10
Class Class =0 f01 f00

• Each entry fij in this table denotes the number of records from
class i predicted to be of class j.
• For instance, f01 is the number of records from class 0 incorrectly
predicted as class 1.
• The total number of correct predictions: (f11+ f00)
• The total number of incorrect predictions: (f01 + f10)
Classification Performance

 Definition of the Terms:


 Positive (P) : Observation is positive (for example: is an apple).
 Negative (N) : Observation is not positive (for example: is not an apple).
 True Positive (TP) : Observation is positive, and is predicted to be
positive.
 False Negative (FN) : Observation is positive, but is predicted negative.
 True Negative (TN) : Observation is negative, and is predicted to be
negative.
 False Positive (FP) : Observation is negative, but is predicted positive.
Class Statistics Measures

 Accuracy: Overall, how often is the classifier correct?


(TP+TN)/(TP+TN+FP+FN)
 Error Rate: Overall, how often is it wrong?
(FP+FN)/(TP+TN+FP+FN)
equivalent to 1 minus Accuracy
 Specificity: measures how exact the assignment to the positive class is
TN/(FP+TN)
 Sensitivity/Recall: Recall can be defined as the ratio of the total number of
correctly classified positive examples divide to the total number of positive
examples.
TP/(TP+FN)
High Recall indicates the class is correctly recognized.
Class Statistics Measures
 Precision: is a measure of how accurate a model’s positive
predictions are. TP/(TP+FP)
 High Precision indicates an example labeled as positive is indeed positive

 F-measure: The F measure (F1 score or F score) is used to evaluate


the overall performance of a classification model and is defined as the
weighted harmonic mean of the precision and recall of the test.

F Score =

 High recall, low precision: This means that most of the positive examples are
correctly recognized (low FN) but there are a lot of false positives.
 Low recall, high precision: This shows that we miss a lot of positive examples
(high FN) but those we predict as positive are indeed positive (low FP)
Example to interpret Confusion Matrix

Classification Rate/Accuracy: (TP + TN) / (TP + TN + FP + FN) =

(100 + 50) /(100 + 5 + 10 + 50) = 0.90


Example to interpret Confusion Matrix

Recall = TP / (TP + FN)

= 100 / (100 + 5) = 0.95

Precision = TP / (TP + FP)

= 100 / (100 + 10) = 0.91

F-measure = (2 * Recall * Precision) / (Recall + Precision)

= (2 * 0.95 * 0.91) / (0.91 + 0.95) = 0.92


Example
 We have a total of 20 cats and dogs and our
model predicts whether it is a cat or not.
 Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’,
‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘dog’,
‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]

 Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’,


‘dog’, ‘dog’, ‘cat’, ‘cat’, ‘cat’, ‘cat’, ‘dog’, ‘dog’,
‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]
Example
Accuracy
Precision
 Ex 1:- In Spam Detection : Need to
focus on precision

 Suppose mail is not a spam but


model is predicted as spam : FP
(False Positive). We always try to
reduce FP.

 Ex 2:- Precision is important in


music or video recommendation
systems, e-commerce websites, etc.
Wrong results could lead to
customer churn and be harmful to
the business.
Recall

 Ex 1:- suppose person having


cancer (or) not? He is
suffering from cancer but
model predicted as not
suffering from cancer

 Ex 2:- Recall is important in


medical cases where it doesn’t
matter whether we raise a
false alarm but the actual
positive cases should not go
undetected!
Confusion Matrix for Multi-class Classification

 For 5-class problem with classes A,B,C,D,E

For more detail:


https://www.youtube.com/watch?v=FAr2GmWNbT0
Confusion Matrix for Multi-class Classification
Confusion Matrix for Multi-class Classification
Confusion Matrix for Multi-class Classification

Example
When to use Accuracy / Precision /
Recall / F1-Score?

 Accuracy is used when the True Positives and True Negatives are
more important. Accuracy is a better metric for Balanced Data.

 Whenever False Positive is much more important use Precision.

 Whenever False Negative is much more important use Recall.

 F1-Score is used when the False Negatives and False Positives


are important. F1-Score is a better metric for Imbalanced Data.
Topics Covered
 What is Classification
 General Approach to Classification
 Issues in Classification
 Classification Algorithms
 Statistical Based
• Bayesian Classification
 Distance Based
• KNN
 Decision Tree Based
• ID3
 Neural Network Based
 Rule Based
Statistical Based Algorithms -
Bayesian Classification
 Bayesian classifiers are statistical classifiers. They can
predict class membership probabilities such as the
probability that a given tuple belongs to a particular
class.
 Based on Bayes rule of conditional probability.
 Assumes that the contribution by all attributes are
independent and each of them contribute equally (hence
the name naive)
 Classification is made by combining the impact that the
different attributes have on the prediction to be made.
Bayes Theorem
 Bayes’ Theorem is a way of finding a probability when we know
certain other probabilities.
 The formula is:

•P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
•P(c) is the prior probability of class.
•P(x|c) is the likelihood which is the probability of the predictor given class.
•P(x) is the prior probability of the predictor.
Bayes Theorem
 Data tuple (A): 35-year-old customer with an income of $40,000
 Hypothesis (B): customer will buy a computer
 Posterior Probability: P(A|B),
 P(A|B) is the likelihood. It represents the probability of observing the data
(35-year-old with an income of $40,000) given that the hypothesis
(customer will buy a computer) is true.
 Prior Probability: P(A),
 is the prior probability of a customer being a 35-year-old with an income
of $40,000.
 Posterior Probability: P(B|A),
 probability of the customer buying a computer given their age and income.

 Prior Probability: P(B),


 probability of the customer buying a computer (regardless of age and
income).
Naïve Bayes Classifier

 Naive Bayes is a kind of classifier which uses the


Bayes Theorem.
 It predicts membership probabilities for each class
such as the probability that given record or data point
belongs to a particular class.
 The class with the highest probability is considered as
the most likely class.
 This is also known as Maximum A Posteriori (MAP).
 Naive Bayes classifier assumes that all the features
are unrelated to each other.
Naïve Bayes Classifier
 In real datasets,

 By substituting for X and expanding using the chain rule:

 For all entries in the dataset, the denominator does not change,
it remain static. Therefore, the denominator can be removed
and a proportionality can be introduced.

 For multivariate classification,


Weather dataset
The posterior
probability can be
calculated by first,
constructing a
frequency table for
each attribute against
the target. Then,
transforming the
frequency tables to
likelihood tables and
finally use the Naive
Bayesian equation to
calculate the posterior
probability for each
class. The class with
the highest posterior
probability is the
outcome of prediction.
Advantages of naïve bayes

 It is easy to use.
 Unlike other classification approaches, only one
scan of the training data is required.
 The naive Bayes approach can easily handle missing
values by simply omitting that probability when
calculating the likelihoods of membership in each
class.
 In cases where there are simple relationships, the
technique often does yield good results.
Disadvantages of naïve bayes

 Although the naive Bayes approach is straightforward


to use, it does not always yield satisfactory results.
 First, the attributes usually are not independent. We
could use a subset of the attributes by ignoring any
that are dependent on others.
 The technique does not handle continuous data.
 Dividing the continuous values into ranges could be
used to solve this problem, but the division of the
domain into ranges is not an easy task, and how this is
done can certainly impact the results.
Topics Covered
 What is Classification
 General Approach to Classification
 Issues in Classification
 Classification Algorithms
 Statistical Based
• Bayesian Classification
 Distance Based
• KNN
 Decision Tree Based
• ID3
 Neural Network Based
 Rule Based
Distance Based Algorithms

 Place items in class to which they are “closest”

 Must determine distance between an item and a class.

 Classes represented by
 Centroid: Central value.
 Medoid: Representative point.

 Algorithm: KNN
K Nearest Neighbor (KNN)
 KNN captures the idea of similarity
(sometimes called distance, proximity, or
closeness) with some mathematics -calculating
the distance between points on a graph.
 K in KNN refers to number of nearest
neighbors.
 Properties of KNN −
 Lazy learning algorithm − it does not have a
specialized training phase and uses all the data
for training while classification.
 Non-parametric learning algorithm − it
doesn’t assume anything about the underlying
data.
K Nearest Neighbor (KNN)

 Training set includes classes.


 For a new item, its distance to each item in the training set
must be determined.
 Only the K closest entries in the training set are considered
further.
 The new item is then placed in the class that contains the
most items from this set of K closest items.
KNN Example
KNN Algorithm
 Step 1 − For implementing any algorithm, we need dataset. So during the
first step of KNN, we must load the training as well as test data.
 Step 2 − Next, we need to choose the value of K i.e. the nearest data points.
K can be any integer.
 Step 3 − For each point in the test data do the following −
 3.1 − Calculate the distance between test data and each row of training data
with the help of any of the method namely: Euclidean, Manhattan or
Hamming distance. The most commonly used method to calculate distance is
Euclidean.
 3.2 − Now, based on the distance value, sort them in ascending order.
 3.3 − Next, it will choose the top K rows from the sorted array.
 3.4 − Now, it will assign a class to the test point based on most frequent class
of these rows.
 Step 4 − End
KNN Algorithm
KNN Algorithm
Methods of the calculating
distance between points
 Euclidean Distance: Euclidean distance is calculated as the square root of
the sum of the squared differences between a new point (x) and an existing
point (y).
 Manhattan Distance: This is the distance between real vectors using the sum
of their absolute difference.

 Hamming Distance: It is used for categorical variables.


If the value (x) and the value (y) are the same,
the distance D will be equal to 0 . Otherwise D=1.
Example of KNN Algorithm-1
Example of KNN Algorithm-1
Example of KNN Algorithm-1
Example of KNN Algorithm-1

The average of these data points is the final prediction for the
new point.

Here, we have weight of ID11 = (59+72+60)/3 = 63.66 kg.


KNN

 Choosing the value of k:


 If k is too small, sensitive to noise points
 If k is too large, neighborhood may include
points from other classes

X
Topics Covered
 What is Classification
 General Approach to Classification
 Issues in Classification
 Classification Algorithms
 Statistical Based
• Bayesian Classification
 Distance Based
• KNN
 Decision Tree Based
• ID3
 Neural Network Based
 Rule Based
Decision Tree based Algorithms

 In Decision tree approach, a tree is constructed to


model the classification process.
 Once the tree is built, it is applied to each tuple in the
database and results in a classification for that tuple.
 There are two basic steps in the technique:
 building the tree
 and applying the tree to the database.

Most research has focused on how to build effective trees


as the application process is straightforward.
Decision Tree

Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree

MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
fits the same data!
10 No Single 90K Yes
10
Apply Model to Test Data

Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Parts of a Decision Tree
Decision Tree
Given:
 D = {t1, …, tn} where ti=<ti1, …, tih>
 Attributes {A1, A2, …, Ah}
 Classes C={C1, …., Cm}
Decision or Classification Tree is a tree associated with
D such that
 Each internal node is labeled with attribute, Ai
 Each arc is labeled with predicate which can be
applied to attribute at parent
 Each leaf node is labeled with a class, Cj
Decision Tree based Algorithms

 Solving the classification problem using decision trees


is a two-step process:
 Decision tree induction: Construct a DT using training
data.
 For each tiεD, apply the DT to determine its class.

 DT approaches differ in how the tree is built.

 Algorithms: ID3, C4.5, CART


DT Induction
DT Induction
 The recursive algorithm builds the tree in a top-down fashion.
 Using the initial training data, the "best" splitting attribute is
chosen first. [Algorithms differ in how they determine the "best
attribute" and its "best predicates" to use for splitting. ]
 Once this has been determined, the node and its arcs are
created and added to the created tree.
 The algorithm continues recursively by adding new subtrees to
each branching arc.
 The algorithm terminates when some "stopping criteria" is
reached. [Again, each algorithm determines when to stop the tree
differently. One simple approach would be to stop when the tuples in the
reduced training set all belong to the same class. This class is then used to
label the leaf node created.]
DT Induction

 Splitting attributes: Attributes in the


database schema that will be used to label
nodes in the tree and around which the
divisions will take place.
 Splitting predicates: The predicates by
which the arcs in the tree are labeled.
DT Issues

 Choosing Splitting Attributes


 Ordering of Splitting Attributes
 Splits
 Tree Structure
 Stopping Criteria
 Training Data
 Pruning
DT Issues
 Choosing Splitting Attributes
Name Gender Height Output1(Correct) Output2(Actual
Assignment)
Kristina F 1.6m medium Medium
Jim M 2m Tall Short
Maggie F 1.9m Medium Short
Martha F 1.88m Short medium
Stephanie F 1.7m Medium Tall
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Short
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Short
Debbie F 1.8m Tall Medium
Todd M 1.95m Medium Tall
Kim F 1.9m Short Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Short
DT Issues
 Ordering of Splitting Attributes
 The order in which the attributes are chosen is also
important.
DT Issues
 Splits
 With some attributes, the domain is small, so the number of
splits is obvious based on the domain (as with the gender
attribute).
 However, if the domain is continuous or has a large number
of values, the number of splits to use is not easily
determined. Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


DT Issues
 Tree Structure
 a balanced tree with the fewest levels is desirable.
 However, in this case, more complicated comparisons
with multiway branching may be needed.
 Some algorithms build only binary trees.
DT Issues
 Stopping Criteria
 when the training data are perfectly classified.
 when stopping earlier would be desirable to prevent the
creation of larger trees. This is a trade-off between accuracy of
classification and performance.
 Training Data
 The structure of the DT created depends on the training data.
 If the training data set is too small, then the generated tree
might not be specific enough to work properly with the more
general data.
 If the training data set is too large, then the created tree may
overfit.
DT Issues

 Pruning
 Once a tree is constructed, some modifications to the tree
might be needed to improve the performance of the tree
during the classification phase.
 The pruning phase might remove redundant comparisons
or remove subtrees to achieve better performance.
Comparing
Decision
Trees
ID3
 ID3 stands for Iterative Dichotomiser 3
 Creates tree using information theory concepts and
tries to reduce expected number of comparison.
 ID3 chooses split attribute with the highest
information gain:
 Information gain=(Entropy of distribution before
the split)–(entropy of distribution after it)
Entropy
 Entropy
 Is used to measure the amount of uncertainty or
surprise or randomness in a set of data.
 When all data belongs to a single class, entropy is
zero as there is no uncertainty.
 An equally divided sample as an entropy of 1
Entropy
The Mathematical formula for Entropy is -

Where ‘Pi’ is simply the frequentist probability of an element/class ‘i’ in our


data.

For simplicity’s sake let’s say we only have two classes , a


positive class and a negative class. Therefore ‘i’ here could be
either + or (-). So if we had a total of 100 data points in our
dataset with 30 belonging to the positive class and 70 belonging
to the negative class then ‘P+’ would be 3/10 and ‘P-’ would be
7/10.
Information Gain

 Gain is defined as the difference


between how much information is
needed to make a correct classification
before the split versus how much
information is needed after the split.
ID3

By Apurva Joshi
Advantages of ID3

 Understandable prediction rules are


created from the training data.
 Builds the fastest tree.

 Builds a short tree.

 Only need to test enough attributes until


all data is classified.
 Finding leaf nodes enables test data to be
pruned, reducing number of tests.
Disadvantages of ID3

 Data may be over-fitted or over classified,


if a small sample is tested.
 Only one attribute at a time is tested for
making a decision.
 Classifying continuous data may be
computationally expensive, as many trees
must be generated to see where to break
the continuum.
Topics Covered
 What is Classification
 General Approach to Classification
 Issues in Classification
 Classification Algorithms
 Statistical Based
• Bayesian Classification
 Distance Based
• KNN
 Decision Tree Based
• ID3
 Neural Network Based
 Rule Based
Artificial Neural Networks

 ANN are multi-layer fully-connected neural nets.


 They consist of an input layer, multiple hidden layers, and an
output layer.
 Every node in one layer is connected to every other node in
the next layer.
Artificial Neural Networks

 ANN is a computational model that is inspired by the way


biological neural networks in the human brain process
information.

Schematic diagram of biological neuron


Artificial Neural Networks

 Neural diagram that makes the analogy between the


neuron structure and the artificial neurons in a neural
network.
A Single Neuron

 Basic unit of computation, often called a node or unit.


 It receives input from some other nodes, or from an external
source and computes an output.
 Each input has an associated weight (w), which is assigned
on the basis of its relative importance to other inputs.
 The node applies a function f (defined below) to the weighted
sum of its inputs as shown in Figure 1 below:
Artificial Neural Networks

X1 X2 X3 Y Input Black box


1 0 0 -1
1 0 1 1
X1
1 1 0 1 Output
1 1 1 1
0 0 1 -1
X2 Y
0 1 0 -1
0 1 1 1 X3
0 0 0 -1

Output Y is 1 if at least two of the three inputs are equal to 1.


Artificial Neural Networks
Input
nodes Black box
X1 X2 X3 Y
1 0 0 -1 Output
1 0 1 1
X1 0.3 node
1 1 0 1
1 1 1 1
X2 0.3 
0 0 1 -1
Y
0 1 0 -1
0 1 1 1 X3 0.3 t=0.4
0 0 0 -1

Y  sign(0.3 X 1  0.3 X 2  0.3 X 3  0.4)


1 if x  0
where sign( x )  
 1 if x  0
Mathematical model of a Neuron
Training ANN
 Solving a classification problem using NNs involves several steps:
 Determine the number of input nodes (attributes) as well as output nodes.
 The number of hidden layers (between the source and the sink nodes) also
must be decided. This step is performed by a domain expert.
 Determine weights (labels) and functions to be used for the graph.
 For every training example, perform a forward pass using the current
weights, and calculate the output of each node going from left to right. The
final output is the value of the last node.
 Compare the final output with the actual target in the training data, and
measure the error using a loss function.
 Perform a backwards pass from right to left and propagate the error to
every individual node using backpropagation. Calculate each weight’s
contribution to the error, and adjust the weights accordingly using gradient
descent. Propagate the error gradients back starting from the last layer.
Processing of ANN

 Processing of ANN depends upon the following


three building blocks −
 Network Topology
• Feedforward and feedback networks
 Adjustments of Weights or Learning
• Supervised learning
• Unsupervised learning
 Activation Functions
Some issues
 There are many issues to be examined:
 Attributes (number of source nodes)
 Number of hidden layers
 Number of hidden nodes
 Training data
 Number of sinks(output nodes)
 Interconnections
 Weights
 Activation functions
 Learning Technique
 Stop
Topics Covered
 What is Classification
 General Approach to Classification
 Issues in Classification
 Classification Algorithms
 Statistical Based
• Bayesian Classification
 Distance Based
• KNN
 Decision Tree Based
• ID3
 Neural Network Based
 Rule Based
Rule based Algorithm
 A rule-based classifier uses a set of IF-THEN rules
for classification.
 An IF-THEN rule is an expression of the form: –
IF condition THEN conclusion
 where
• Condition (or LHS) is rule antecedent/precondition
• Conclusion (or RHS) is rule consequent
 Rule: (Condition)  y
 where
• Condition is a conjunction of tests on attributes
• y is the class label
Rule based Algorithm

 Example of classification rules:


R: IF age = youth AND student = yes THEN buys_computer = yes
 The condition consists of one or more attribute tests that
are logically ANDed
• such as age = youth, and student = yes
 The rule’s consequent contains a class prediction
• we are predicting whether a customer will buy a computer
 R can also be rewritten as::
Rule based Algorithm

 Another Example:
 If 90 <= grade, then class = A
 If 80 <=grade and grade < 90, then class = B
 If 70 <=grade and grade < 80, then class = C
 If 60 <=grade and grade < 70, then class = D
 If grade < 60, then class = F

 These rules relate directly to the corresponding Decision Tree


that could be created.
 There are algorithms that generate rules from trees as well as
algorithms that generate rules without first creating DTs.
Building Classification Rules

 Direct Method: extract rules directly from data


 1R Algorithm
 Sequential covering algorithms
• e.g.: PRISM, RIPPER, CN2, FOIL, and AQ

 Indirect Method: extract rules from other


classification models
 e.g. decision trees
Rule Extraction from a Decision Tree

 A DT can always be used to generate rules but they are


not equivalent. The differences are,
 The tree has an implied order in which the splitting is
performed. Rules have no order.
 A tree is created based on looking at all classes. But
only one class is examined at a time when rules are
generated.
Generating Rules from DTs
Generating Rules Example

Decision
Tree

Rules
Generating Rules Example
(contd..)

Optimized Tree

Optimized set of
rules
Generating Rules Example
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

R1: (Give Birth = no)  (Can Fly = yes)  Birds


R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
Assessment of a Rule
Tid Refund Marital Taxable
Status Income Class
 Coverage of a rule: 1 Yes Single 125K No

 Fraction of records that 2 No Married 100K No


3 No Single 70K No
satisfy the antecedent of a 4 Yes Married 120K No

rule 5 No Divorced 95K Yes


6 No Married 60K No
 Accuracy of a rule: 7 Yes Divorced 220K No
8 No Single 85K Yes
 Fraction of records that
9 No Married 75K No
satisfy the antecedent that 10
10 No Single 90K Yes

also satisfy the consequent (Status=Single)  No


of a rule Coverage = 40%, Accuracy = 50%
Rule Coverage and Accuracy
Tid Refund Marital Taxable
Status Income Class

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
Where 8 No Single 85K Yes
– D: class labeled data set 9 No Married 75K No
– |D|: number of instances in D 10 No Single 90K Yes
– ncovers : number of instances covered by R
10

– ncorrect : number of instances correctly (Status=Single)  No


classified by R Coverage = 40%, Accuracy = 50%
Characteristics of Rule Sets:
Strategy

 Mutually exclusive rules


 Classifier contains mutually exclusive rules if the
rules are independent of each other
 Every record is covered by at most one rule

 Exhaustive rules
 Classifier has exhaustive coverage if it accounts for
every possible combination of attribute values
 Each record is covered by at least one rule
Characteristics of Rule Sets:
Strategy

 Rules are not mutually exclusive


 A record may trigger more than one rule
 Solution?
• Ordered rule set
• Unordered rule set – use voting schemes

 Rules are not exhaustive


 A record may not trigger any rules
 Solution?
• Use a default class
Ordered Rule Set

 Rules are rank ordered according to their priority


 An ordered rule set is known as a decision list
 When a test record is presented to the classifier
 It is assigned to the class label of the highest ranked rule it has triggered
 If none of the rules fired, it is assigned to the default class

R1: (Give Birth = no)  (Can Fly = yes)  Birds


R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?

You might also like