Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 63

Data Mining:

Concepts and Techniques

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 8. Classification: Basic Concepts

 Classification: Basic Concepts


 Decision Tree Induction
 Bayes Classification Methods

2
Supervised vs. Unsupervised Learning

 Supervised learning (classification)


 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
3
Prediction Problems: Classification vs.
Numeric Prediction
 Classification
 predicts categorical class labels (discrete or nominal)

 classifies data (constructs a model) based on the training

set and the values (class labels) in a classifying attribute


and uses it in classifying new data
 Numeric Prediction
 models continuous-valued functions, i.e., predicts

unknown or missing values


 Typical applications
 Credit/loan approval:

 Medical diagnosis: if a tumor is cancerous or benign

 Fraud detection: if a transaction is fraudulent

 Web page categorization: which category it is

4
Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as

determined by the class label attribute


 The set of tuples used for model construction is training set

 The model is represented as classification rules, decision trees, or

mathematical formulae
 Model usage: for classifying future or unknown objects
 Previously unseen records should be assigned a class as accurately as
possible.

5
 Estimate accuracy of the model
 The known label of test sample is compared with the classified result

from the model


 Accuracy rate is the percentage of test set samples that are correctly

classified by the model


 Test set is independent of training set (otherwise overfitting)

 If the accuracy is acceptable, use the model to classify new data

 Note: If the test set is used to select models, it is called validation (test) set
 A test set is used to determine the accuracy of the model. Usually the given
data set is divided into training and test sets
 with training set used to build the model and test set used to validate it.

6
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
7
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
8
Classification Techniques
 Decision Tree based Methods
 Naïve Bayes and Bayesian Belief Networks
 Rule-based Methods
 Support Vector Machines
 Memory based reasoning
 Neural Networks
Chapter 8. Classification: Basic Concepts

 Classification: Basic Concepts


 Decision Tree Induction
 Bayes Classification Methods

10
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
 Training data set: Buys_computer <=30 high no excellent no
 The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
 Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes no yes
11
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

12
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-

conquer manner
 At start, all the training examples are at the root

 Attributes are categorical (if continuous-valued, they are

discretized in advance)
 Examples are partitioned recursively based on selected

attributes
 Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)


 Conditions for stopping partitioning
 All samples for a given node belong to the same class

 There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf


 There are no samples left
13
Decision Tree - Classification
 Decision tree builds classification or regression models in
the form of a tree structure.
 It breaks down a dataset into smaller and smaller subsets
while at the same time an associated decision tree is
incrementally developed.
 The final result is a tree with decision nodes and leaf nodes.
 A decision node (e.g., Outlook) has two or more branches
(e.g., Sunny, Overcast and Rainy).
 Leaf node (e.g., Play) represents a classification or decision.
 The topmost decision node in a tree which corresponds to
the best predictor called root node.
 Decision trees can handle both categorical and numerical
data.
Example:
Key Terms:
 Decision tree induction is the learning of decision
trees from class-labeled training tuples.
 A decision tree is a flowchart-like tree structure,
 internal node (non-leaf node) denotes a test on an
attribute,
 branch represents an outcome of the test,
 leaf node (or terminal node) holds a class label.
 The topmost node in a tree is the root node.
Decision Tree Algorithm
 Decision trees are constructed in a top-down recursive divide-and-
conquer manner.
 top-down approach, which starts with a training set of tuples and
their associated class labels.
 The training set is recursively partitioned into smaller subsets as
the tree is being built.
Input:
 Data partition, D, which is a set of training tuples and their
associated class labels;
 attribute list, the set of candidate attributes;
 Attribute selection method, a procedure to determine the splitting
criterion that “best” partitions the data tuples into individual
classes. This criterion consists of a
 splitting attribute and, possibly, either a split-point or splitting
subset.
Output: A decision tree.
1. It begins with the original set D as the root node.
2. On each iteration of the algorithm, it iterates
through the very unused attribute of the set D and
calculates Information gain(IG) of this attribute.
3. It then selects the attribute which has the Largest
Information gain.
4. The set D is then split by the selected attribute to
produce a subset of the data.
5. The algorithm continues to recur on each subset,
considering only attributes never selected before.
Attribute Selection Method
 An attribute selection measure is a heuristic for selecting the splitting criterion that
“best” separates a given data partition, D, of class-labeled training tuples into individual
classes.
 The attribute selection measure provides a ranking for each attribute describing the given
training tuples. The attribute having the best score for the measure is chosen as the splitting
attribute for the given tuples.
 The tree node created for partition D is labeled with the splitting criterion, branches are
grown for each outcome of the criterion, and the tuples are partitioned accordingly.
 Information Gain
 Let node N represent or hold the tuples of partition D. The attribute with the highest
information gain is chosen as the splitting attribute for node N.
 1) The expected information needed to classify a tuple in D is given by
 2) Attribute information can be calculated using the
following formula:

 Information gain is defined as the difference between


the original information requirement (i.e., based on just
the proportion of classes) and the new requirement
(i.e., obtained after partitioning on A). That is,

 Gain(A), is chosen as the splitting attribute at node N


Transaction Database
 The expected information needed to classify a tuple in D:

 Next, we need to compute the expected information requirement for each attribute.
 Let’s start with the attribute age, We need to look at the distribution of yes and no
tuples for each category of age.
 For the age category “youth,” there are two yes tuples and three no tuples.
 For the category “middle aged,” there are four yes tuples and zero no tuples.
 For the category “senior,” there are three yes tuples and two no tuples.
 The expected information needed to classify a tuple in D if the tuples are partitioned
according to age is
 Table presents a training set, D, of class-labeled tuples
randomly selected from the AllElectronics customer
database.
 In this example, each attribute is discretevalued.
 The class label attribute, buys computer, has two distinct
values (namely, yes, no); therefore, there are two distinct
classes (i.e., m =2).
 Let class C1 correspond to yes and class C2 correspond to
no.
 There are nine tuples of class yes and five tuples of class no.
 A (root) node N is created for the tuples in D. To find the
splitting criterion for these tuples, we must compute the
information gain of each attribute.

 A

Absolute value of D
 Similarly, we can compute Gain.income/ D 0.029 bits, Gain.student/ D
0.151 bits, and Gain.credit rating/ D 0.048 bits. Because age has the
highest information gain among the attributes, it is selected as the splitting
attribute. Node N is labeled with age, and branches are grown for each of
the attribute’s values. The tuples are then partitioned accordingly, as shown
in Figure.
Attribute Selection: Information Gain
 Class P: buys_computer = “yes” 5 4
Infoage ( D )  I (2,3)  I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940  I (3,2)  0.694
14 14 14 14 14
age pi ni I(p i, n i) 5
<=30 2 3 0.971 I (2,3)means “age <=30” has 5 out of
14
31…40 4 0 0 14 samples, with 2 yes’es and 3
>40 3 2 0.971 no’s. Hence
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age)  Info( D )  Infoage ( D )  0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes Similarly,
>40 low yes fair yes

Gain(income)  0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain( student )  0.151
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes Gain(credit _ rating )  0.048
31…40 high yes fair yes
>40 medium no excellent no 25
Overfitting
Overfitting occurs when our machine learning model tries to cover

all the data points or more than the required data points present in
the given dataset. Because of this, the model starts caching noise and
inaccurate values present in the dataset, and all these factors reduce
the efficiency and accuracy of the model. The overfitted model
has low bias and high variance.
Underfitting
Underfitting occurs when our machine learning model is not able to

capture the underlying trend of the data. To avoid the overfitting in


the model, the fed of training data can be stopped at an early stage,
due to which the model may not learn enough from the training data.
As a result, it may fail to find the best fit of the dominant trend in
the data.

26
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to

noise or outliers
 Poor accuracy for unseen samples

 Two approaches to avoid overfitting


 Prepruning: Halt tree construction early ̵ do not split a node

if this would result in the goodness measure falling below a


threshold
 Difficult to choose an appropriate threshold

 Postpruning: Remove branches from a “fully grown” tree—

get a sequence of progressively pruned trees


 Use a set of data different from the training data to

decide which is the “best pruned tree” 27


Chapter 8. Classification: Basic Concepts

 Classification: Basic Concepts


 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Summary

28
Bayesian Classification: Why?
 A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
29
Bayes’ Theorem: Basics
M
 Total probability Theorem: P(B)   P(B | A )P( A )
i i
i 1

 Bayes’ Theorem: P( H | X)  P(X | H ) P( H )  P(X | H ) P( H ) / P(X)


P(X)
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
 P(H) (prior probability): the initial probability
 E.g., X will buy computer, regardless of age, income, …

 P(X): probability that sample data is observed


 P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is 31..40,

medium income
30
Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X | C i )   P( x | C i )  P( x | C i )  P( x | C i )  ... P( x | C i )
k 1 2 n
 This greatly reduces the computation cost: Only counts the
k  1
class distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x )2
1 
and P(xk|Ci) is g ( x,  ,  )  e 2 2
2 

P ( X | C i )  g ( xk ,  Ci ,  Ci )
31
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified:
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
32
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40 medium no fair yes
>40 low yes fair yes
P(buys_computer = “no”) = 5/14= 0.357 >40
31…40
low
low
yes excellent
yes excellent
no
yes
<=30 medium no fair no
 Compute P(X|Ci) for each class <=30
>40
low yes fair
medium yes fair
yes
yes

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 <=30


31…40
medium yes excellent
medium no excellent
yes
yes

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6


31…40 high yes fair yes
>40 medium no excellent no

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444


P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”) 33
Avoiding the Zero-Probability Problem
 Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i)   P( x k | C i)
k 1
 Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case

Prob(income = low) = 1/1003


Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their

“uncorrected” counterparts 34
Naïve Bayes Classifier: Comments
 Advantages
 Easy to implement

 Good results obtained in most of the cases

 Disadvantages
 Assumption: class conditional independence, therefore loss of

accuracy
 Practically, dependencies exist among variables

 E.g., hospitals: patients: Profile: age, family history, etc.

Symptoms: fever, cough etc., Disease: lung cancer,


diabetes, etc.
 Dependencies among these cannot be modeled by Naïve

Bayes Classifier
 How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
35
Nearest Neighbor Classifiers
 Basic idea:
 If it walks like a duck, quacks like a duck, then

it’s probably a duck

Compute Test
Distance Record

Training Choose k of the


Records “nearest” records
Nearest-Neighbor Classifiers
Unknown record  Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

 To classify an unknown record:


– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points


that have the k smallest distance to x
Nearest Neighbor Classification
 Compute distance between two points:
 Euclidean distance

d ( p, q )   ( pi
i
q )
i
2

 Determine the class from nearest neighbor list


 take the majority vote of class labels among the k-nearest

neighbors
 Weigh the vote according to distance

 weight factor, w = 1/d2


Nearest Neighbor Classification…
 Choosing the value of k:
 If k is too small, susceptible to overfitting, due

to noise points in the training data.


 If k is too large, neighborhood may include

points from other classes.

X
Nearest Neighbor Classification…
 Scaling issues
 Attributes may have to be scaled to prevent

distance measures from being dominated by one


of the attributes
 Example:

 height of a person may vary from 1.5m to 1.8m


 weight of a person may vary from 90lb to 300lb
 income of a person may vary from $10K to $1M
 Solution: Normalize the vectors to unit length
K-Nearest Neighbor (KNN) Algorithm

• K-Nearest Neighbour is one of the simplest Machine


Learning algorithms based on Supervised Learning
technique.
• K-NN algorithm assumes the similarity between the
new case/data and available cases and put the new case
into the category that is most similar to the available
categories.
• K-NN algorithm stores all the available data and
classifies a new data point based on the similarity. This
means when new data appears then it can be easily
classified into a well suite category by using K- NN
algorithm.
3 – Distance – Square Distance, Euclidean,
Manhattan, Hamming distance
(y1, y2)

d= (X1-Y1)2+(X2-Y2)2
Example 2

CSE-307-Data Mining
49
CSE-307-Data Mining
50
CSE-307-Data Mining
51
CSE-307-Data Mining
52
53
Evaluation of Machine Learning Model
True Positive (TP):
 True Positive (TP): Instances that are actually positive (belong
to the positive class) and are correctly identified as positive by
the model.
 Here's an example to illustrate this concept:
 Suppose you have a binary classification model that predicts
whether an email is spam (positive class) or not spam (negative
class). Now, let's consider the following scenarios:
 True Positive (TP):
 True Class: The email is actually spam.

 Model Prediction: The model correctly predicts that the

email is spam.
 Example: An email containing typical characteristics of spam
(e.g., certain keywords, links, or patterns) is correctly identified
as spam by the model.
True Negative (TN)
 True Negative (TN): Instances that are actually negative (belong
to the negative class) and are correctly identified as negative by
the model.
 Here's an example to illustrate this concept:
 Suppose you have a binary classification model that predicts
whether an email is spam (positive class) or not spam (negative
class). Now, let's consider the following scenarios:
 True Negative (TN):
 True Class: The email is actually not spam (negative class).

 Model Prediction: The model correctly predicts that the

email is not spam.


 Example: A regular, non-spam email that lacks characteristics
commonly associated with spam is correctly identified as not
spam by the model.
False Positive (FP):
 False Positive (FP): Instances that are actually negative (belong
to the negative class) but are incorrectly identified as positive
by the model.
 Here's an example to illustrate this concept:
 Suppose you have a binary classification model that predicts
whether an email is spam (positive class) or not spam (negative
class). Now, let's consider the following scenarios:
 False Positive (FP):
 True Class: The email is actually not spam (negative class).

 Model Prediction: The model incorrectly predicts that the

email is spam.
 Example: A regular, non-spam email is mistakenly identified as
spam by the model due to certain features or patterns in the
email that the model misinterprets.
False Negative (FN)
 False Negative (FN): Instances that are actually positive (belong
to the positive class) but are incorrectly identified as negative
by the model.
 Here's an example to illustrate this concept:
 Suppose you have a binary classification model that predicts
whether an email is spam (positive class) or not spam (negative
class). Now, let's consider the following scenarios:
 False Negative (FN):
 True Class: The email is actually spam (positive class).

 Model Prediction: The model incorrectly predicts that the

email is not spam.


 Example: An email containing characteristics typical of spam is
mistakenly identified as not spam by the model, leading to a
false negative.
Confusion matrix
 A confusion matrix is a
matrix representation of the
prediction results of any
binary testing that is often
used to describe the
performance of the
classification model (or
“classifier”) on a set of test
data for which the true
values are known.
Example
Accuracy
Sensitivity and Specificity
Error rate or Misclassification rate

You might also like