Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

29-04-2018

Classification
By Naveen Aggarwal

Naveen Aggarwal

Statistical Learning
Data – instantiations of some or all of the random variables
describing the domain; they are evidence
Hypotheses – probabilistic theories of how the domain works
The Surprise candy example: two flavors in very large bags
of 5 kinds, indistinguishable from outside
h1: 100% cherry – P(c|h1) = 1, P(l|h1) = 0
h2: 75% cherry + 25% lime
h3: 50% cherry + 50% lime
h4: 25% cherry + 75% lime
h5: 100% lime
April 29, 2018 Naveen Aggarwal
29-04-2018

Problem formulation
Given a new bag, random variable H denotes the bag
type (h1 – h5); Di is a random variable (cherry or lime);
after seeing D , D , …, D , predict the flavor (value) of D .
1 2 N N+1

Example:
Hypothesis Prior over h1, …, h5 is {0.1,0.2,0.4,0.2,0.1}
Data:
Q1: After seeing d1, what is P(hi|d1)?
Q2: After seeing d1, what is P(d2= |d1)?

Naveen Aggarwal

Bayesian Classification: Why?


A statistical classifier: performs probabilistic prediction, i.e., predicts class
membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian classifier, has
comparable performance with decision tree and selected neural network
classifiers
Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct — prior knowledge can be combined
with observed data
Standard: Even when Bayesian methods are computationally intractable,
they can provide a standard of optimal decision making against which other
methods can be measured

April 29, 2018 Naveen Aggarwal Data Mining: Concepts and Techniques 4
29-04-2018

Bayesian Theorem: Basics


Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), the probability that the hypothesis holds given the
observed data sample X
P(H) (prior probability), the initial probability
E.g., X will buy computer, regardless of age, income, …
P(X): probability that sample data is observed
P(X|H) (posteriori probability), the probability of observing the sample X, given that the
hypothesis holds
E.g., Given that X will buy computer, the prob. that X is 31..40, medium income

April 29, 2018 Naveen Aggarwal Data Mining: Concepts and Techniques 5

Bayesian Theorem
Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes
theorem

P ( H | X )  P (X | H ) P (H )
P (X )
Informally, this can be written as
posteriori = likelihood x prior/evidence
Predicts X belongs to C2 iff the probability P(Ci|X) is the highest among all the P(Ck|X) for
all the k classes
Practical difficulty: require initial knowledge of many probabilities, significant
computational cost

April 29, 2018 Naveen Aggarwal Data Mining: Concepts and Techniques 6
29-04-2018

Towards Naïve Bayesian Classifier


• Let D be a training set of tuples and their associated class labels, and each
tuple is represented by an n-D attribute vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
• Since P(X) is constant for all classes, only
P(C | X)  P(X | C )P(C )
i i i
needs to be maximized

April 29, 2018 Naveen Aggarwal Data Mining: Concepts and Techniques 7

Derivation of Naïve Bayes Classifier


• A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between attributes):
n
P(X | Ci)   P( x | Ci)  P( x | Ci)  P( x | Ci) ... P( x | Ci)
k 1 2 n
k 1
• This greatly reduces the computation cost: Only counts the class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak
divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continuous-valued, P(xk|Ci) is usually computed based on Gaussian
distribution with a mean μ and standard deviation σ 1 
( x  ) 2

2 2
g ( x,  , )  e
2 

and P(xk|Ci) is P(X| Ci)  g(xk , Ci ,Ci )

April 29, 2018 Naveen Aggarwal Data Mining: Concepts and Techniques 8
29-04-2018

Naïve Bayesian Classifier: Training Dataset


age income studentcredit_rating
buys_computer
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = 31…40 high no fair yes
‘yes’ >40 medium no fair yes
C2:buys_computer = >40 low yes fair yes
>40 low yes excellent no
‘no’
31…40 low yes excellent yes
<=30 medium no fair no
Data sample <=30 low yes fair yes
X = (age <=30, >40 medium yes fair yes
Income = medium, <=30 medium yes excellent yes
Student = yes 31…40 medium no excellent yes
Credit_rating = Fair) 31…40 high yes fair yes
>40 medium no excellent no
April 29, 2018 Naveen Aggarwal Data Mining: Concepts and Techniques 9

Naïve Bayesian Classifier: An Example


P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357

Compute P(X|Ci) for each class


P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044


P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)

April 29, 2018 Naveen Aggarwal Data Mining: Concepts and Techniques 10
29-04-2018

Avoiding the 0-Probability Problem


• Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will
be zero
n
P ( X | C i)   P ( x k | C i)
k  1

• Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10),
• Use Laplacian correction (or Laplacian estimator)
– Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
– The “corrected” prob. estimates are close to their “uncorrected” counterparts

April 29, 2018 Naveen Aggarwal Data Mining: Concepts and Techniques 11

Naïve Bayesian Classifier: Comments


Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore loss of accuracy
Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayesian Classifier
How to deal with these dependencies?
Bayesian Belief Networks

April 29, 2018 Naveen Aggarwal Data Mining: Concepts and Techniques 12
29-04-2018

Bayesian Belief Networks


Bayesian belief network allows a subset of the variables conditionally
independent
A graphical model of causal relationships
Represents dependency among the variables
Gives a specification of joint probability distribution
 Nodes: random variables
 Links: dependency
X Y
 X and Y are the parents of Z, and Y is
the parent of P
Z  No dependency between Z and P
P
 Has no loops or cycles

April 29, 2018 Naveen Aggarwal Data Mining: Concepts and Techniques 13

Bayesian Belief Network: An Example

Family The conditional probability table


Smoker
History (CPT) for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1

LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9

CPT shows the conditional probability for


each possible combination of its parents

PositiveXRay Dyspnea Derivation of the probability of a


particular combination of values of X,
from CPT:
n
Bayesian Belief Networks P ( x 1 ,..., x n )   P ( x i | Parents ( Y i ))
i 1
April 29, 2018 Naveen Aggarwal Data Mining: Concepts and Techniques 14
29-04-2018

Making Statistical Inferences


Bayesian –
predictions made using all hypothesis, weighted by their probabilities

MAP – maximum a posteriori


uses the single most probable hypothesis to make prediction
often much easier than Bayesian; as we get more and more data,
closer to Bayesian optimal

ML – maximum likelihood
assume uniform prior over H
when (1) no preferable hypothesis a priori, (2) data set is large

Naveen Aggarwal

Learning with Complete Data


Parameter learning - to find the numerical parameters for a
probability model whose structure is fixed
Data are complete when each data point contains values for every
variable in the model
Maximum-likelihood parameter learning: discrete model
With complete data, ML parameter learning problem for a Bayesian
network decomposes into separate learning problems, one for each
parameter
A significant problem with ML learning – 0 may be assigned to some
events that have not been observed
• Various tricks are used to avoid this problem. One is to assign count = 1
instead of 0 to each event
Naveen Aggarwal
29-04-2018

Learning with Hidden Variables


Many real-world problems have hidden variables
which are not observable in the data available for
learning.
Question: If a variable (disease) is not observed,
why not construct a model without it?
Answer: Hidden variables can dramatically reduce
the number of parameters required to specify a
Bayesian network. This results in the reduction of
needed amount of data for learning.
Naveen Aggarwal

EM: Learning mixtures of Gaussians


The unsupervised clustering problem
P(x) = ki=1P(C=i)P(x|C=i)
If we knew which component generated each xj, we can get ,
If we knew the parameters of each component, we know which ci should xj belong
to. However, we do not know either, …
EM – expectation and maximization
Pretend we know the parameters of the model and then to infer the probability
that each xj belongs to each component; iterate until convergence.
For the mixture of Gaussians, initialize the mixture model parameters
arbitrarily; and iterate the following
E-step: Compute pij = P(C=i|xj) = αP(xj|C=i)P(C=i)
P(C=i) = pi = j pij
M-step: Compute the new i = j pijxj/pi, i pijxjxjT/pi, wi = pi (component weight)

Naveen Aggarwal
29-04-2018

EM Algorithm steps
E-step computes the expected value pij of the hidden
indicator variables Zij, where Zij is 1 if xj was generated by
i-th component, 0 otherwise
M-step finds the new values of the parameters that
maximize the log likelihood of the data, given the
expected values of Zij
• EM increases the log likelihood of the data at each iteration.
• EM can reach a local maximum in likelihood.

Naveen Aggarwal

Lazy vs. Eager Learning


Lazy vs. eager learning
Lazy learning (e.g., instance-based learning): Simply stores training data
(or only minor processing) and waits until it is given a test tuple
Eager learning (the above discussed methods): Given a set of training set,
constructs a classification model before receiving new (e.g., test) data to
classify
Lazy: less time in training but more time in predicting
Accuracy
Lazy method effectively uses a richer hypothesis space since it uses many
local linear functions to form its implicit global approximation to the
target function
Eager: must commit to a single hypothesis that covers the entire instance
space
April 29, 2018 Naveen Aggarwal Data Mining: Concepts and Techniques 20
29-04-2018

Lazy Learner: Instance-Based Methods


Instance-based learning:
Store training examples and delay the processing (“lazy evaluation”) until
a new instance must be classified
Typical approaches
k-nearest neighbor approach
• Instances represented as points in a Euclidean space.
Locally weighted regression
• Constructs local approximation
Case-based reasoning
• Uses symbolic representations and knowledge-based inference

April 29, 2018 Naveen Aggarwal Data Mining: Concepts and Techniques 21

Nearest-neighbor models
The key idea: Neighbors are similar
Density estimation example: estimate x’s probability density by the density of its
neighbors
Connecting with table lookup, NBC, decision trees, …
How define neighborhood N
If too small, no any data points
If too big, density is the same everywhere
A solution is to define N to contain k points, where k is large enough to ensure a
meaningful estimate
• For a fixed k, the size of N varies
• The effect of size of k
• For most low-dimensional data, k is usually between 5-10

Naveen Aggarwal
29-04-2018

The k-Nearest Neighbor Algorithm


All instances correspond to points in the n-D space
The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2)
Target function could be discrete- or real- valued
For discrete-valued, k-NN returns the most common value among the k training
examples nearest to xq
Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training
examples

_
_
_ _ .
+
_ .
+
xq + . . .
_ +

April 29, 2018 Naveen Aggarwal


.
Data Mining: Concepts and Techniques 23

Discussion on the k-NN Algorithm


k-NN for real-valued prediction for a given unknown tuple
Returns the mean values of the k nearest neighbors
Distance-weighted nearest neighbor algorithm
Weight the contribution of each of the k neighbors according to their distance to
the query xq
w 1
• Give greater weight to closer neighbors d ( xq , x )2
i
Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could be dominated by
irrelevant attributes
To overcome it, axes stretch or elimination of the least relevant attributes

April 29, 2018 Naveen Aggarwal Data Mining: Concepts and Techniques 24
29-04-2018

Thanks

Naveen Aggarwal

You might also like