Scribe: Naive Bayes Classifier

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

SMAI Scribe

Naive Bayes Classifier

April 2020
1 Basics of Probability
Random variable: A random variable is a numerical description of the outcome of a statistical
experiment or in other words it is a variable whose possible values are numerical outcomes of a
random phenomenon.
Coin toss example When we toss a coin only two values are possible namely head tail, let we
denote that as 0 1 respectively.

Mathematically, X=0,1
Sample Space: All the possible outcomes of an experiment or in terms of random variable we can
say a random variable’s set of values.
Dice Roll example Random Variable X = ”The score shown on the top face”. X could be 1, 2,
3, 4, 5 or 6 So the Sample Space is 1, 2, 3, 4, 5, 6
Probability: In terms of random variable it can be written as P(X = value) = probability of
that state
Two Dice Roll Example[1] Let the Random Variable is X = ”The sum of the scores on the
two dice”. Below table represents all possible sums

There are 6 6 = 36 possible outcomes, and the Sample Space (which is the sum of the scores
on the two dice) is 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 Probability of each sum: 2 occurs just once, so

1
P(X = 2) = 1/36
3 occurs twice, so P(X = 3) = 2/36 = 1/18
4 occurs three times, so P(X = 4) = 3/36 = 1/12
5 occurs four times, so P(X = 5) = 4/36 = 1/9
6 occurs five times, so P(X = 6) = 5/36
7 occurs six times, so P(X = 7) = 6/36 = 1/6
8 occurs five times, so P(X = 8) = 5/36
9 occurs four times, so P(X = 9) = 4/36 = 1/9
10 occurs three times, so P(X = 10) = 3/36 = 1/12
11 occurs twice, so P(X = 11) = 2/36 = 1/18
12 occurs just once, so P(X = 12) = 1/36

Types of random variable [2]

Discrete random variable:


• A discrete random variable is one which may take on only a countable number of distinct
values. Let a discrete random variable X takes k different values, with the probability that X
= xi defined to be P(X = xi) = pi. Then,

0 <= pi <= 1

for each i
p1 + p2 + ... + pk = 1.

Continuous random variable:

• A continuous random variable is one which takes an infinite number of possible values.
• It is defined over an interval of values, and is represented by the area under a curve. The
probability of observing any single value is equal to 0, since the number of values which may
be assumed by the random variable is infinite.

• The probability that X is in the set of outcomes A,P(A) , over an interval of real numbers is
defined to be the area above A and under a curve. The curve, which represents a function
p(x), must satisfy the following:
p(x) > 0 for all x The total area under the curve is equal to 1.
Probability distribution
A probability distribution is a function that describes the likelihood of obtaining the possible values
that a random variable can take It can be divided into two types:
• Discrete probability distribution for discrete variables(Probability mass function): That can
take on a countable number of values.
• Probability density function for continuous variables(Probability density functions): That can
assume an infinite number of values between any two values.

2
Joint Probability :
Joint probability is a statistical measure that calculates the likelihood of two events occurring
together(simultaneously). Let A B be two events over sample space S then joint probability is
represented by

P (AandB), P (A, B) = P (AB)

If A B are independent events then

P (AandB) = P (A) ∗ P (B)

Conditional Probability: Probability of an event occurring, given that another event already
occurred. Mathematically,

P (A|B) = P (AandB)/P (B) − (1)

P (B|A) = P (AandB)/P (A) − (2)

3
Bayes Theorem: From eqn 1,

P (AandB) = P (A|B) ∗ P (B)

From eqn 2,

P (AandB) = P (B|A) ∗ P (A)


Equating both eqn we get, P (A|B) = P (B|A)∗P (A)/P (B) P (A|B): posterior probability P (B|A) :
likelihoodP(A): priorprobability
Marginal Probability: The probabilities of various values of the variables in the subset without
reference to the values of the other variables in simpler words the probability of an event irrespective
of the outcome of another variable.
Let X Y be 2 random variables then marginal probability of X=d will be,
X
P (X = d) = ∈ Sy P (X = d, Y = y)
y

Where Sy represents all the possible values of the random variable Y. In other words, we are
holding X constant (X=d) while iterating over all the possible Y values and summing up the joint
probabilities. Let X=c,d and Y=a,b Then marginalizing for X using joint probability,

P (X = c) = P (X = c, Y = a) + P (X = c, Y = b)

Using conditional probability

P (X = c) = P (X = c|a) ∗ P (a) + P (X = c) ∗ P (b)

Gaussian Distribution/ Normal Distribution/ Bell Curve It is a type of continuous proba-


bility distribution for a real-valued random variable.
In 1 dimension general form of it’s probability distribution function is:

4
The parameter µisthemeanitdef inesthelocationof thepeakf ornormaldistributionandisstandarddeviationdef inesth

For multidimensional case it’s probability distribution function is:


1 −1 T −1
p(x; µ, Σ) = e 2 (x−µ) Σ (x−µ)
n 1 (1)
(2π) |Σ|
2 2

Properties: [4]
• Mean = Median = Mode

5
• symmetric about the centre

• Most of the values lie between 2 standard deviation

2 Models
Discriminative model: It learns only hard discriminative boundary instead of underlying distri-
bution i.e it learns the conditional probability P (Y |X).
Example: Logistic regression, Scalar Vector Machine, Traditional neural networks, Nearest neigh-
bour, Conditional Random Fields.
Generative Model: It models the actual distribution of each class i.e it learns the joint probability
distribution P(X,Y).
Example: Naive Bayes, Bayesian networks, Markov random fields, Hidden Markov Models (HMM).
Parameter Estimation:
Technique to estimate the optimal model parameters of selected distribution using sampled data.

6
1. We assume the underlying distribution of our sample data
2. Estimate the optimal model parameters.
Techniques
Maximum Likelihood Estimation
MLE maximizes the probability of observing the data from the joint probability distribution given
a specific probability distribution and its parameters[5]. Mathematically,

θM LE = arg maxp(X|θ)
Y
= arg max p(xi |θ)
i
(2)
Y
= arg max log p(xi |θ)
i
X
= arg max log p(xi |θ)
i

Maximum A Posterior
MAP is an estimate of an unknown quantity, that equals the mode of the posterior distribution.
From bayes rule posterior can be expressed as,
p(X|θ) ∗ p(θ)
p(θ|X) = ∝ p(X|θ) ∗ p(θ) (3)
p(X)
Thus, mathematical expression for MAP will be,

θM AP = arg maxp(X|θ) ∗ p(θ)


= arg max log(p(X|θ)) + log(p(θ))
Y
= arg max log p(xi |θ) + log(p(θ)) (4)
i
X
= arg max log p(xi |θ) + log(p(θ))
i

If we assume priors follows uniform distribution then we can observe MLE is a special case of
MAP[6].
X
θM AP = arg max log(p(xi |θ)) + p(θ)
i
= arg max log(p(xi |θ)) + const
X (5)
= arg max log p(xi |θ)
i
= θM LE
Anomaly Detection
Anomaly detection is a technique used to identify unusual patterns that do not conform to expected
behavior, called outliers.
1. Build a generative model, fit distribution to correct behaviour.

7
2. Check if future events align with the distribution, if not label it as an anomaly.
Assumption: We make assumption of independence, ie.

p(x1 , x2 , x3 , x4 , ...., xn ) = p(x1 ) ∗ p(x2 ) ∗ p(x3 ) ∗ ....... ∗ p(xn )

Now we can fit univariate gaussian on each of the dimension as below,

p(x1 , x2 , x3 , x4 , ...., xn ) = p(x1 | (µ1 ), (σ12 ) ∗ p(x2 | (µ2 , σ22 )) ∗ ....... ∗ p(xn | (µn , σn2 ))

as shown below[8],

8
The above technique will fail when we observe large correlation in our features since we made
an assumption of independence.

Reason being our function makes probability prediction in concentric circles around the means
of both. Probability of the two red circled examples is basically the same, even though we can
clearly see the green one as an outlier.

9
Solution: We apply multivariate Gaussian distribution, i.e we model p(x) all in one go, instead of
each feature separately.
1 −1 T −1
p(x; µ, Σ) = e 2 (x−µ) Σ (x−µ)
n 1 (6)
(2π) |Σ|
2 2

(Σ) is the covariance matrix which captures the correlation between all the feature.

Contours now produced will capture mean covariance of both the features thus we now are
likely to identify the green value as anomalous[9].

Disadvantage of not taking assumption of Independence:

• Multidimensional gaussian doesn’t work properly when dimensions increase.


• Higher dimension means bigger matrix size hence an overload.
• Inverse of covariance matrix requires more computation along with the possibility that the
inverse might not exist.

10
3 Naive Bayes Classifier
The goal of any probabilistic classifier is :
Given features X = {x0 , x1 , ..., xn } and classes C = {c0 , ..., ck }, to determine the probability of the
features occurring in each class, and to return the most likely class. Therefore, for each class, we
want to calculate, P (ci |x0 , x1 , ..., xn
In order to do this we use Bayes rule,

P (B|A) ∗ P (A)
P (A|B) =
P (B)
with A being the class ci and B is the set of features x0 , .., xn .// Here P (B) serves as normalization
and are usually unable to calculate P (x0 , ..., xn ), we can simply ignore that term, and hence

P (ci |x0 , . . . , xn )∝P (x0 , . . . , xn |ci ) ∗ P (ci )

3.1 An important assumption


To simplify the computation, an assumption is made that the features x0 , ..., xn are conditionally
independent given ci . This allows us to represent the conditional probability,

P (x0 , ..., xn |ci ) = P (x0 |ci ) ∗ P (x1 |ci ) ∗ ... ∗ P (xn |ci )

Hence the final class probabilty then becomes,


n
Y
P (ci |x0 , . . . , xn )∝ P (xj |ci )
j=1

3.1.1 Learning
Calculating the individual P (xj |ci ) terms will depend on what distribution your features follow.
Eg. In the context of text classification, where features may be word counts, features may follow
a multi-nomial distribution. In other cases, where features are continuous, they may follow
a Gaussian distribution.

11
3.2 Classification
Given the estimate of the probability of a given data point falling in a certain class, we need to
be able to use this to produce classifications. Naive Bayes does this by picking the ci that has the
largest probability given the data point’s features.
n
Y
y = arg maxci P (ci ) P (xj |ci )
j=1

This is referred to as the Maximum A Posteriori decision rule because we only use the P (B|A)
and P (A) terms, which are the likelihood and prior terms, respectively.

3.3 Estimating Parameters : Y, Given discrete Xi

Frequencies and probabilities for the weather data:

Classify the following instance :

12
Idea : is to compute a probability for each class based on the probability distribution in the
training data.

1. Take into account the the probability of each attribute. Treat all attributes equally important,
i.e., multiply the probabilities

P (x0 , ..., xn |ci ) = P (x0 |ci ) ∗ P (x1 |ci ) ∗ ... ∗ P (xn |ci )

i.e
2 3 3 3
P (yes) = ∗ ∗ ∗ = 0.0082
9 9 9 9
3 1 4 3
P (no) = ∗ ∗ ∗ = 0.0577
5 5 5 5
2. Take into account the overall probabiltiy of a given class,
n
Y
P (ci ) P (xj |ci )
j=1

i.e
9
P (yes) = 0.0082 ∗ = 0.0053
14
5
P (no) = 0.0577 ∗ = 0.0206
14
3. Choose the class so that it maximizes this probability. That means the new instance is
classified as no.
n
Y
y = arg maxci ∈yes,no P (ci ) P (xj |ci )
j=1

3.4 Issues with Naive Bayes classifier


1. Zero-frequency problem : if there are no occurrences of a class label and a certain attribute
value together then the frequency-based probability estimate will be zero.

13
Here the probability of P (overcast|N o) has occurrence and zero in frequency table which will
result in zero posterior. Solution to this would be to shift zero frequency to slightly higher
by adding count for every attribute value-class combination when an attribute value doesn’t
occur with every class value.
2. Underflow : Having a lot of unique features creates probabilities with small values and mul-
tiplying these probabilities may result in floating-point underflow. To prevent this, we sum
the log of probabilities instead of multiplying these probabilities.
n
Y
log P (x0 , ..., xn |ci ) = log P (xj |ci )
j=1

X
= log P (xj |ci )
i

And classification becomes,


n
X
y = arg maxci log P (xj |ci ) log P (ci )
j=1

3. Independence assumption : In practical situations, the predictions taken (predictors) are often
not independent. This dependency is not exploited by the classifier. However, for classification
tasks, this often isn’t a problem. But this doesn’t seem to be true with respect to regression
where independence assumption becomes an issue.

14
4 References
:
1. “Random Variables”(www.mathsisfun.com/data/random-variables.html)

2. Valerie J. Easton and John H. McColl’s Statistics Glossary v1.1

3. “Normal Distribution in Statistics”(statisticsbyjim.com/basics/normal-distribution/)

4. “Normal Distribution”(www.mathsisfun.com/data/standard-normal-distribution.html)

5. “Maximum likelihood estimation”(en.wikipedia.org/wiki/Maximum likelihood estimation)

6. “Maximum Likelihood Estimation VS Maximum A Posterior”(towardsdatascience.com/mle-


vs-map-a989f423ae5c)

7. Machine Learning - Andrew Ng(www.coursera.org/learn/machine-learning)

8. Anomaly Detection(www.holehouse.org/mlclass/15 Anomaly Detection.html)


9. Naive Bayes Classifier (https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c)

10. Naive Bayes example takes from : (http://www2.cs.uh.edu/ arjun/courses/nlp/naive bayes keller.pdf)

15

You might also like