CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh

CH5
Data Mining
Classification
Prepared By Dr. Maher Abuhamdeh

Classification: Definition
• Given a collection of records (training set )

– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function
of the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it.
Classification vs. Prediction
• Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
• Prediction:
– models continuous-valued functions, i.e., predicts
unknown or missing values
Classification Example
c al c al us
i i o
gor gor i nu
a te a te o nt a ss
c c c cl
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat
1 Yes Single 125K No No Single 75K ?

2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
Set
10
7 Yes Divorced 220K No

8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10
10 No Single 90K Yes
Set Classifier Model
Classification: Application 1
• Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.
• Fraud Detection
– Goal: Predict fraudulent cases in credit card transactions.
– Approach:
• Use credit card transactions and the information on its account-
holder as attributes.
– When does a customer buy, what does he buy, how often he pays on
time, etc
• Label past transactions as fraud or fair transactions. This forms the
class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.
• Customer Attrition/Churn:
– Goal: To predict whether a customer is likely to be
lost to a competitor.
– Approach:
• Use detailed record of transactions with each of the
past and present customers, to find attributes.
– How often the customer calls, where he calls, what time-of-
the day he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997
• Sky Survey Cataloging

– Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic
survey images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
• Segment the image.
• Measure image attributes (features) - 40 of them per object.
• Model the class based on these features.
• Success Story: Could find 16 new high red-shift quasars, some of
the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Classifying Galaxies
Courtesy: http://aps.umn.edu
Early Class: Attributes:

• Stages of Formation • Image features,
• Characteristics of light
waves received, etc.
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning

No
1 Yes Large 125K
algorithm
2 No Medium 100K No
3 No Small 70K No
No
4 Yes Medium 120K
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
KNN : K –Nearest Neighbor method
• KNN approach
– K-Nearest Neighbor method
– Given a new test query q, we try to find the k
closest training queries to it in terms of Euclidean
distance.
– train a local ranking model online using the
neighboring training queries (Ranking SVM)
– rank the documents of the test query using the
trained local model
Example
• We have data from the questionnaires survey (to ask people

opinion) and objective testing with two attributes (acid
durability and strength) to classify whether a special paper
tissue is good or not. Here is four training samples
X1 = Acid Durability X2 = Strength Y = Classification

(seconds) (kg/square meter)
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
Example (Cont.)
• Now the factory produces a new paper tissue

that pass laboratory test with X1 = 3 and X2 =
7. Without another expensive survey, can we
guess what the classification of this new
tissue is?
1. Suppose use K = 3
2. Calculate the distance between the query-
instance and all the training samples .
Example (Cont.)
• Coordinate of query instance is (3, 7), instead of

calculating the distance we compute square distance
which is faster to calculate (without square root)
X1 = Acid Durability X2 = Strength Square Distance

7 7 (7-3)2+(7-7)2=16
7 4 (7-3)2+(4-7)2=25
3 4 (3-3)2+(4-7)2=9
1 4 (1-3)2-(4-7)2=13
Example (Cont.)
3. Sort the distance and determine nearest neighbors based

on the K-th minimum distance
X1 = Acid Durability X2 = Strength Square Distance Sort Class

7 7 √(7-3)2+(7-7)2=4 3 Bad
7 4 √(7-3)2+(4-7)2=5 4 Bad
3 4 √(3-3)2+(4-7)2=3 1 Good
1 4 √(1-3)2-(4-7)2=3.6 2 Good
Example (Cont.)
4. Gather the category of the nearest 3 neighbors. In red color.
X1 = Acid Durability X2 = Strength Square Distance Sort Class

(seconds) (kg/square meter) Y
7 7 √(7-3)2+(7-7)2=4 3 Bad
7 4 √(7-3)2+(4-7)2=5 4 Bad
3 4 √(3-3)2+(4-7)2=3 1 Good
1 4 √(1-3)2-(4-7)2=3.6 2 Good
Example (Cont.)
5. Use simple majority of the category of nearest

neighbors as the prediction value of the query
instance We have 2 good and 1 bad, since 2>1
then we conclude that a new paper tissue that
pass laboratory test with X1 = 3 and X2 = 7 is
included in Good category.
• If k=1 we chose the nearest point which has a
good class
Example
Age Loan Default
25 40,000 JD N
35 60,000 JD N
45 80,000 JD N
20 20,000 JD N
35 120,000 JD N
52 18,000 JD N
23 95,000 JD Y
40 62,000 JD Y
60 100,000 JD Y
48 221,000 JD Y
33 150,000 JD Y
We can now use the training set to classify an unknown case (Age=48 and
Loan=142,000 JD) using Euclidean distance. If K=1 then the nearest neighbour is
the last case in the training set with Default=Y.
Example
Age Loan Default Distance Rank
25 40,000 JD N 102,000
35 60,000 JD N 82,000
45 80,000 JD N 62,000
20 20,000 JD N 122,000
35 120,000 JD N 22,000 2
52 18,000 JD N 124,000
23 95,000 JD Y 47,000
40 62,000 JD Y 80,000
60 100,000 JD Y 42,000 3
48 221,000 JD Y 78,000
33 150,000 JD Y 8,000 1
D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.0 >> Default=Y
With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The
prediction for the unknown case is again Default=Y
Standardized Distance
One major drawback in calculating distance measures

directly from the training set is in the case where
variables have different measurement scales or
there is a mixture of numerical and categorical
variables. For example, if one variable is based on
annual income in dollars,JD,…. and the other is
based on age in years then income will have a much
higher influence on the distance calculated. One
solution is to standardize the training set as shown
below.
1
With K=3, there are two Default=N and one Default=Y out of three closest
neighbours. The prediction for the unknown case is Default=N
• Standardized AGE
• Min age = 20
• Max age 60
• X= (x-min) / (max – min)
• X= (25 – 20) / (60 – 20) = 5/4 = 0.125
• Standardized loan
• Min = 18,000
• Max = 212,000
• X= (40,000 – 18,000) / ( 212,000 – 18,000) = 0.11
Pseudo for the basic k-nn
• Input: D = {(x1,c1),...,(xN,cN)} x = (x1,...,xn)

• new instance to be classified
• FOR each labelled instance (xi,ci) calculate
d(xi,x) Order d(xi,x) from lowest to highest,
(i = 1,...,N)
• Select the K nearest instances to x: Dxk Assign
to x the most frequent class in Dxk
• END
Advantage and disadvantage
• Advantage
• The KNN algorithm provides good accuracy on
many domains
• Easy to understand and implemented
• Very quickly
• The KNN algorithm can estimate complex
target concept locally.
Disadvantage
• The KNN has large storage requirement

because it has to store all the data
• The KNN is slow for huge data because all the
training instance have to be visited.
• Need to determine value of parameter k
(number of nearest neighbours)
Naïve Bayesian
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
28
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating

buys_comput
<=30 high no fair no
Class:
<=30 high no excellent no
C1:buys_computer = 31…40 high no fair yes
‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample 31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
<=30 low yes fair yes
Income = medium,
>40 medium yes fair yes
Student = yes <=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
29
Bayesian Theorem: Basics
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (posteriori probability), the probability
that the hypothesis holds given the observed data sample X
• P(H) (prior probability), the initial probability
– E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (likelyhood), the probability of observing the sample X, given that the
hypothesis holds
– E.g., Given that X will buy computer, the prob. that X is 31..40, medium
income
30
Bayesian Theorem
• Given training data X, posteriori probability of a hypothesis H, P(H|X),
follows the Bayes theorem
P(H | X)  P(X | H )P(H )

P(X)
• Informally, this can be written as
posteriori = likelihood x prior/evidence
• Predicts X belongs to C2 iff the probability P(Ci|X) is the highest among all
the P(Ck|X) for all the k classes
31
Towards Naïve Bayesian Classifier
• Let D be a training set of tuples and their associated class labels, and each
tuple is represented by an n-D attribute vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
• This can be derived from Bayes’ theorem P(X | C )P(C )
P(C | X)  i i
i P(X)
• Since P(X) is constant for all classes, only
needs to be maximized
P(C | X)  P(X | C )P(C )

i i i
32
Derivation of Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally independent (i.e., no

dependence relation between attributes):
n
P ( X | C i )   P( x | C i )  P ( x | C i )  P ( x | C i )  ...  P( x | C i )
k 1 2 n
• k 1
This greatly reduces the computation cost: Only counts the class
distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak
divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian
distribution with a mean μ and standard deviation σ
and P(xk|Ci) is 1 
( x )2
g ( x,  ,  )  e 2 2
2 
P ( X | C i )  g ( x k ,  C i ,  Ci )
33
Probabilities for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5
witten&eibe 34
Day Outlook Temp Humidity Windy Play
1 Sunny Hot High False No
2 Sunny Hot High True No
3 Overcast Hot High False Yes
4 Rainy Mild High False Yes
5 Rainy Cool Normal False Yes
6 Rainy Cool Normal True No
7 Overcast Cool Normal True Yes
8 Sunny Mild High False No
9 Sunny Cool Normal False Yes
10 Rainy Mild Normal False Yes
11 Sunny Mild Normal True Yes
12 Overcast Mild High True Yes
13 Overcast Hot Normal False Yes
14 Rainy Mild High True No
35
Probabilities for weather data
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5
Outlook Temp. Humidity Windy Play
Sunny Cool High True ?
Likelihood of the two classes

• A new day:
For “yes” = 2/9  3/9  3/9  3/9  9/14 = 0.0053
For “no” = 3/5  1/5  4/5  3/5  5/14 = 0.0206
Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795
36
Age Income Has a car Buy/class
senior middle yes yes
youth low yes no
junior high yes yes
youth middle yes yes
senior high no yes
junior low no no
senior middle no no
Income Has a car

Age
Yes No Yes No Yes No
senior 2 1 middle 2 1 yes 3 1
junior 1 1 low 0 2 no 1 2
youth 1 1 high 2 0
senior 2/4 1/3 middle 2/4 1/3 yes 3/4 1/3 yes
no
junior 1/4 1/3 low 0/4 2/3 no 1/4 2/3
youth 1/4 1/3 high 2/4 0/3 4/7 3/7
37
Example on Naïve Bayes
Cont.
Age Income Has a car Buy
youth middle no ?
Likelihood for class (yes) = 1/4 * 2/4 * 1/4 = 0.03125

Likelihood for class (no) = 1/3 * 1/3 * 2/3 = 0.07407
P(X|Ci)*P(yes) = 0.03125 * 4/7 = 0.17

P(X|Ci)*P(no) = 0.074 * 3/7 = 0.317
Therefore, X belongs to class (“youth, middle, no” =

“no”)
38
Bayes’s rule
• Probability of event H given evidence E :
Pr[ E | H ] Pr[ H ]
Pr[ H | E ] 
Pr[ E ]
• A priori probability of H :
– Probability of event before evidence is seen Pr[H ]
• A posteriori probability of H :
– Probability of event after evidence is seen Pr[ H | E ]
from Bayes “Essay towards solving a problem

in the doctrine of chances” (1763)
Thomas Bayes
Born: 1702 in London, England
Died: 1761 in Tunbridge Wells, Kent, England
witten&eibe 39
Naïve Bayes for classification
• Classification learning: what’s the probability of the class

given an instance?
– Evidence E = instance
– Event H = class value for instance
• Naïve assumption: evidence splits into parts (i.e. attributes)
that are independent
Pr[ E1 | H ] Pr[ E1 | H ] Pr[ En | H ] Pr[ H ]

Pr[ H | E ] 
Pr[ E ]
witten&eibe 40
Weather data example
Outlook Temp. Humidity Windy Play Evidence E

Sunny Cool High True ?
Pr[ yes | E ]  Pr[Outlook  Sunny | yes ]

 Pr[Temperatur e  Cool | yes ]
Probability of  Pr[ Humidity  High | yes ]
class “yes”
 Pr[Windy  True | yes ]
Pr[ yes ]

Pr[ E ]
 93  93  93  149
2
 9
Pr[ E ]
witten&eibe 41
Missing values
• Training: instance is not included in
frequency count for attribute value-class
combination
• Classification: attribute will be omitted
from calculation
• Example: Outlook Temp. Humidity Windy Play
? Cool High True ?
Likelihood of “yes” = 3/9  3/9  3/9  9/14 = 0.0238

Likelihood of “no” = 1/5  4/5  3/5  5/14 = 0.0343
P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%
P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%
witten&eibe 42
The “zero-frequency problem”
• What if an attribute value doesn’t occur with every class value?

(e.g. “Humidity = high” for class “yes”)
– Probability will be zero! Pr[ Humidity  High | yes ]  0
– A posteriori probability will also be zero!
(No matter how likely the other values are!)
• Remedy: add 1 to the count for every attribute value-class
combination (Laplace estimator)
Pr[ yes | E ]  0
• Result: probabilities will never be zero!
(also: stabilizes probability estimates)
witten&eibe 43
Avoiding the 0-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise,

the predicted prob. will be zero
n
P( X | C i )   P( x k | C i)
Ex. Suppose a dataset with 1000 tuples, 1
k income=low (0), income= medium (990), and
income = high (10),
• Use Laplacian correction (or Laplacian estimator)
– Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
– The “corrected” prob. estimates are close to their “uncorrected”
counterparts
44
Numeric attributes
• Usual assumption: attributes have a normal or
Gaussian probability distribution (given the class)
• The probability density function for the normal
distribution is defined by two parameters:
– Sample mean 
1 n
   xi
– Standard deviation  n i 1
– Then the density function f(x) is
1 n
 
n  1 i1
( xi   ) 2
1 
( x  )2 Karl Gauss, 1777-1855
f ( x)  e 2 2
great German
2 
mathematician
45
46
Day Outlook Temp Humidity Windy Play
1 Sunny 85 85 False No
2 Sunny 80 90 True No
3 Overcast 83 86 False Yes
4 Rainy 70 96 False Yes
6 Rainy 65 70 True No
7 Overcast 64 65 True Yes
8 Sunny 72 95 False No
9 Sunny 69 70 False Yes
11 Sunny 75 70 True Yes
12 Overcast 72 90 True Yes
13 Overcast 81 75 False Yes
14 Rainy 71 91 True No
47
Statistics for
weather data
Sunny 2 3 64, 68, 65, 71, 65, 70, 70, 85, False 6 2 9 5
Overcast 4 0 69, 70, 72, 80, 70, 75, 90, 91, True 3 3
Rainy 3 2 72, … 85, … 80, … 95, …
Sunny 2/9 3/5  =73  =75  =79  =86 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5  =6.2  =7.9  =10.2  =9.7 True 3/9 3/5
Rainy 3/9 2/5
• Example density value:

( 6673)2
1 
f (temperature  66 | yes )  e 26.22
 0.0340
2 6.2
witten&eibe 48
Classifying a new day
• A new day: Outlook Temp. Humidity Windy Play
Sunny 66 90 true ?
Likelihood of “yes” = 2/9  0.0340  0.0221  3/9  9/14 = 0.000036

Likelihood of “no” = 3/5  0.0291  0.0380  3/5  5/14 = 0.000136
P(“yes”) = 0.000036 / (0.000036 + 0. 000136) = 20.9%
P(“no”) = 0.000136 / (0.000036 + 0. 000136) = 79.1%
• Missing values during training are not included in
calculation of mean and standard deviation
witten&eibe 49
Naïve Bayes Text Classification
p(class).p (document class)

p(class document) 
p(document)
p(document class )   p( wordi class )

i
p (class document )  p (class ) p( wordi class )

i
50
Naïve Bayes Text Classification
P(wordi|class)= (Tct +λ )/(Nc+ λV)

Where :
Tct: The number of times the word occurs in that category C
• Nc : The number of words in category C
• V: The size of the vocabulary table (# unique words)
• λ : The positive constant, usually 1, or 0.5 to avoid zero
probability.
51
• You have a set of reviews (documents ) and a
classification
DOC TEXT CLASS
1 I loved the movie +
2 I hated the movie -
3 A great movie, good movie +
4 Poor acting -
Great acting, a good movie +
• First we need to extract unique words
• I, loved, the, movie, hated, a, great, poor
,acting, good.
• We have 10 unique words
Doc I loved the movie hated a great poor acting good class
1 1 1 1 1 +
2 1 1 1 1 -
3 2 1 1 1 +
4 1 1 -
5 1 1 1 1 1 +
• Take documents with positive outcomes
1 1 1 1 1 +
3 2 1 1 1 +
5 1 1 1 1 1 +
P(+) = 3/5 = 0.6

Compute P(I |+); P(loved |+); P(the |+); P(movie |+); P(hated |+) ; P(a |+); P(great |+);
P(poor |+); P(acting |+); P(good |+);
Let P(wk |+) = (nk +1) / (n + size of unique words)
n: the number of words in the (+) case =14
nk =the number of times word k occurs in these
cases (+)
• P(l |+): (1+1) /(14+10)= 0.0833
• P(the |+): (1+1) /(14+10)= 0.0833
• P(a |+): (2+1) /(14+10)= 0.125
• P(acting |+): (1+1) /(14+10)= 0.0833
• P(hated |+): (0+1) /(14+10)= 0.0417
• P(loved |+): (1+1) /(14+10)= 0.0833
• P(movie|+): (4+1) /(14+10)= 0.2
• P(great |+): (2+1) /(14+10)= 0.125
• P(good |+): (2+1) /(14+10)= 0.125
• P(poor |+): (0+1) /(14+10)= 0.0417
• Take documents with positive outcomes
2 1 1 1 1 -
4 1 1 -
P(-) = 2/5 = 0.4

Compute P(I |-); P(loved |-); P(the |-); P(movie |-); P(hated |-) ; P(a |-); P(great |-); P(poor |-);
P(acting |-); P(good |-);
• P(l |-): (1+1) /(6+10)= 0.125
• P(loved |-): (0+1) /(6+10)= 0.0625
• P(movie |-): (1+1) /(6+10)= 0.125
• P(great |-): (0+1) /(6+10)= 0.0625
• P(the |-): (1+1) /(6+10)= 0.125
• P(hated |-): (1+1) /(6+10)= 0.125
• P(acting |-): (1+1) /(6+10)= 0.125
• P(a |-): (0+1) /(6+10)= 0.0625
• P(good |-): (0+1) /(6+10)= 0.0625
• P(poor |-): (1+1) /(6+10)= 0.125
Vwb = arg max P(vj) πwϵ words P(W| Pvj)
V= value or class
Let’s classify a new sentence
I hated the poor acting
• If vj = +;
• = P(+)* P(I |+)*P(hated |+)* P(the |+)* P(poor
|+) * P(acting |+)
= 0.6*0.0833*0.0417*0.0833*0.0417*0.0833
6.03*10-7
• If vj = -;
• = P(-)* P(I |-)*P(hated |-)*P(the |-)*P(poor |-) * P(acting
|-)
= 0.4*0. 125*0.125*0.125*0.125*0.125
1.22*10-5
Since 1.22*10-5 > 6.03*10-7
Sentence will be classified as a negative

CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh

Uploaded by

Copyright:

Available Formats

CH5

Prepared By Dr. Maher Abuhamdeh

• Given a collection of records (training set )

1 Yes Single 125K No No Single 75K ?

7 Yes Divorced 220K No

• Sky Survey Cataloging

Early Class: Attributes:

Tid Attrib1 Attrib2 Attrib3 Class Learning

10 No Small 90K Yes

• We have data from the questionnaires survey (to ask people

X1 = Acid Durability X2 = Strength Y = Classification

• Now the factory produces a new paper tissue

• Coordinate of query instance is (3, 7), instead of

X1 = Acid Durability X2 = Strength Square Distance

3. Sort the distance and determine nearest neighbors based

X1 = Acid Durability X2 = Strength Square Distance Sort Class

4. Gather the category of the nearest 3 neighbors. In red color.

X1 = Acid Durability X2 = Strength Square Distance Sort Class

5. Use simple majority of the category of nearest

One major drawback in calculating distance measures

• Input: D = {(x1,c1),...,(xN,cN)} x = (x1,...,xn)

• The KNN has large storage requirement

age income studentcredit_rating

P(H | X)  P(X | H )P(H )

• Since P(X) is constant for all classes, only

P(C | X)  P(X | C )P(C )

• A simplified assumption: attributes are conditionally independent (i.e., no

1 Sunny Hot High False No

2 Sunny Hot High True No

3 Overcast Hot High False Yes

4 Rainy Mild High False Yes

5 Rainy Cool Normal False Yes

6 Rainy Cool Normal True No

7 Overcast Cool Normal True Yes

8 Sunny Mild High False No

9 Sunny Cool Normal False Yes

10 Rainy Mild Normal False Yes

11 Sunny Mild Normal True Yes

12 Overcast Mild High True Yes

13 Overcast Hot Normal False Yes

14 Rainy Mild High True No

Outlook Temp. Humidity Windy Play

Sunny Cool High True ?

Likelihood of the two classes

Income Has a car

Likelihood for class (yes) = 1/4 * 2/4 * 1/4 = 0.03125

P(X|Ci)*P(yes) = 0.03125 * 4/7 = 0.17

Therefore, X belongs to class (“youth, middle, no” =

from Bayes “Essay towards solving a problem

• Classification learning: what’s the probability of the class

Pr[ E1 | H ] Pr[ E1 | H ] Pr[ En | H ] Pr[ H ]

Outlook Temp. Humidity Windy Play Evidence E

Pr[ yes | E ]  Pr[Outlook  Sunny | yes ]

Likelihood of “yes” = 3/9  3/9  3/9  9/14 = 0.0238

• What if an attribute value doesn’t occur with every class value?

• Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise,

– Then the density function f(x) is

3 Overcast 83 86 False Yes

4 Rainy 70 96 False Yes

5 Rainy 68 80 False Yes

7 Overcast 64 65 True Yes

9 Sunny 69 70 False Yes

P(X|Ci)P(yes) = 0.03125 4/7 = 0.17