Unit 5 - Machine Learning - WWW - Rgpvnotes.in

Please do not share these notes on apps like WhatsApp or Telegram.
The revenue we generate from the ads we show on our website and app
funds our services. The generated revenue helps us prepare new notes
and improve the quality of existing study materials, which are
available on our website and mobile app.
If you don't use our website and app directly, it will hurt our revenue,
and we might not be able to run the services and have to close them.
So, it is a humble request for all to stop sharing the study material we
provide on various apps. Please share the website's URL instead.
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022
Subject Notes
IT 802 (A) - Machine Learning
B.Tech IT-8th Semester
Unit V
Syllabus: Probabilistic Learning: Bayesian Learning, Bayes Optimal Classifier, Naïve

Bayes Classifier, Bayesian Belief Networks, Mining Frequent Patterns.
Course Objective: To familiarize students with the knowledge of machine learning and enable
them to apply suitable machine learning techniques for data handling and to gain knowledge
from it. Evaluate the performance of algorithms and to provide solution for various real-world
applications.
_____________________________________________________________________________
Course Outcome (CO5): Analyze the co-occurrence of data to find interesting frequent
patterns and preprocess the data before applying to any real-world problem for evaluation.
PROBABILISTIC LEARNING
Probabilistic classification learning is one form of implicit learning in which cues are
probabilistically associated with outcomes and participants process associations
without explicit awareness. Probabilistic classifier is a classifier that is able to predict,
given an observation of an input, a probability distribution over a set of classes, rather
than only outputting the most likely class that the observation should belong to.
Probabilistic classifiers provide classification that can be useful or when combining
classifiers into ensembles.
A probabilistic method or model is based on the theory of probability or the fact that
randomness plays a role in predicting future events. The opposite is deterministic,
which is the opposite of random it tells us something can be predicted exactly, without
the added complication of randomness.
BAYESIAN LEARNING
In Bayesian machine learning, follow these three steps:
1. To define a model, use a “generative process” for the data, i.e., a sequence of
steps describing how the data was created.
a. This generative process includes the unknown model parameters.
b. Incorporate prior beliefs about these parameters, which take the form of
distributions over values that the parameters might take.
follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

2. Data are viewed as observations from the generative process.
3. After running the learning algorithm, it is left with an updated belief about the
parameters — i.e., a new distribution over the parameters.
The Bayesian strategy is particularly useful when:
• User has prior beliefs about unknown model parameters or explicit information
about data generation — i.e., useful info user wants to incorporate.
• User has few data or many unknown model parameters, and it is hard to get an
accurate result with data alone (without the added structure or information).
• User wants to capture the uncertainty about the result — how sure or unsure the
model is instead of only a single “best” result.
Bayesian learning Example:
Suppose user grab a carton of milk from the fridge, see that it is seven days past the
expiration date, and want to know if the milk is still good or if it has gone bad. A quick
internet search leads him to believe that there is roughly a 50–50 chance that the milk is
still good. This is a prior belief (Figure 1).
From past experience, user has some knowledge about how smelly milk gets when it
has gone bad. Specifically, let’s suppose he rate smelliness on a scale of 0–10 (0 being no
smell and 10 being completely rancid) and have probability distributions over the
smelliness of good milk and of bad milk (Figure 2).
Here’s how Bayesian learning works: When he gets some data, i.e., when he smells the
milk (Figure 3), He can apply the machinery of Bayesian inference (Figure 4) to
compute an updated belief about whether the milk is still good or has gone bad (Figure
5).
For example, if user observe that the milk is about a 5 out of 10 on the smelly scale, he
can then use Bayesian learning to factor in his prior beliefs and the distributions over
smelliness of good vs. bad milk to return an updated belief — that there is now a 33%
chance that the milk is still good and a 67% chance that the milk has gone bad.

Figure 5.1: Bayesian learning to find if the milk is gone bad
A program that computes an update belief about whether the “milk has gone bad
whenever user smells the milk”. A program will do the following:
1. Encode prior beliefs about whether the milk is still good or has gone bad and
probability distributions over the smelliness of good vs. bad milk.
2. Smell the milk and give this observation as an input to the program.
3. Do Bayesian learning automatically and return an updated belief about whether

or not the milk has gone bad.
For a Bayesian model, user needs to mathematically derive an inference algorithm i.e.,
the learning algorithm that computes the final distribution over beliefs given the data.
BAYES OPTIMAL CLASSIFIER

The Bayes Optimal Classifier is a probabilistic model that makes the most probable
prediction. It is described using the Bayes Theorem that provides a principled way for
calculating a conditional probability. It is also closely related to the Maximum a
Posteriori: a probabilistic framework referred to as MAP that finds the most probable
hypothesis for a training dataset.
The Bayes Optimal Classifier is computationally expensive, if not intractable to calculate,

and instead, simplifications such as the Gibbs algorithm and Naive Bayes can be used to
approximate the outcome.
The equation below demonstrates how to calculate the conditional probability for a new
instance (vi) given the training data (D), given a space of hypotheses (H).
P(vj | D) = sum {h in H} P(vj | hi) * P(hi | D)

Where vj is a new instance to be classified, H is the set of hypotheses for classifying the
instance, hi is a given hypothesis, P(vj | hi) is the posterior probability for vi given
hypothesis hi, and P(hi | D) is the posterior probability of the hypothesis hi given the
data D.
Selecting the outcome with the maximum probability is an example of a Bayes optimal
classification.
max sum {h in H} P(vj | hi) * P(hi | D)
Any model that classifies examples using this equation is a Bayes optimal classifier and
no other model can outperform this technique, on average.
Any system that classifies new instances according to [the equation] is called a Bayes
optimal classifier, or Bayes optimal learner. No other classification method using the
same hypothesis space and same prior knowledge can outperform this method on
average. Although the classifier makes optimal predictions, it is not perfect given the
uncertainty in the training data and incomplete coverage of the problem domain and
hypothesis space. As such, the model will make errors. These errors are often referred
to as Bayes errors. Because the Bayes classifier is optimal, the Bayes error is the
minimum possible error.
NAÏVE BAYES CLASSIFIER

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’
Theorem. It is not a single algorithm but a family of algorithms where all of them share a
common principle, i.e., every pair of features being classified is independent of each
other.
It is a classification technique based on Bayes’ Theorem with an assumption of

independence among predictors. In simple terms, a Naive Bayes classifier assumes that
the presence of a particular feature in a class is unrelated to the presence of any other
feature.
For example, some fruit may be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other
features, all these properties independently contribute to the probability that this fruit
is an apple and that is why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along
with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x)
and P(x|c). The equation is given below:

where,
• P(c|x) is the posterior probability of class (c, target) given predictor (x,
attributes).
• P(c) is the prior probability of class.
• P(x|c) is the likelihood which is the probability of predictor given class.
• P(x) is the prior probability of predictor.
Working of Naive Bayes algorithm:
A training data set of weather and corresponding target variable ‘Play’ (suggesting
possibilities of playing) is given below. User needs to classify whether players will play
or not based on weather condition. Given below steps to perform classification:
Step 1: Convert the data set into a frequency table
Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.
Figure 5.2: A training data set of weather
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of prediction.

Problem: Players will play if weather is sunny. Is this statement is correct?
User can solve it using method of posterior probability.
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on
various attributes. This algorithm is mostly used in text classification and with problems
having multiple classes.
Applications of Naive Bayes Algorithms
• Real time Prediction: Naive Bayes is an eager learning classifier and it is sure
fast. Thus, it could be used for making predictions in real time.
• Multi class Prediction: This algorithm is also well known for multi class
prediction feature. Here we can predict the probability of multiple classes of
target variable.
• Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers

mostly used in text classification (due to better result in multi class problems
and independence rule) have higher success rate as compared to other
algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail)
and Sentiment Analysis (in social media analysis, to identify positive and
negative customer sentiments)
• Recommendation System: Naive Bayes Classifier and Collaborative Filtering

together builds a Recommendation System that uses machine learning and data
mining techniques to filter unseen information and predict whether a user would
like a given resource or not
BAYESIAN BELIEF NETWORKS

Bayesian Belief Network or Bayesian Network or Belief Network is a Probabilistic
Graphical Model (PGM) that represents conditional dependencies between random
variables through a Directed Acyclic Graph (DAG). The main objective of these networks
is trying to understand the structure of causality relations.
Conditional probability is the probability of a random variable when some other

random variable is given. It is shown by
If these two random variables are dependent,

If they are independent, then
The probabilities are calculated in the belief networks by the following formula
To be able to calculate the joint distribution one need to have conditional probabilities
indicated by the network.
Bayesian Network can be used for building models from data and experts’ opinions, and
it consists of two parts:
• Directed Acyclic Graph
• Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems
under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
Figure 5.3: A Bayesian network graph
Each node corresponds to the random variables, and a variable can be continuous or
discrete.

Arc or directed arrows represent the causal relationship or conditional probabilities

between random variables. These directed links or arrows connect the pair of nodes in
the graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
• In the above diagram, A, B, C, and D are random variables represented by the

nodes of the network graph.
• If node B relates to node A by a directed arrow, then node A is called the parent
of Node B.
• Node C is independent of node A.
The Bayesian network has mainly two components:
• Causal Component
• Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi
|Parent(Xi) ), which determines the effect of the parent on that node.
MINING FREQUENT PATTERNS

Frequent Pattern Mining (Association Rule Mining) is an analytical process that finds
frequent patterns, associations, or causal structures from data sets found in various
kinds of databases such as relational databases, transactional databases, and other data
repositories. Frequent patterns are collections of items which appear in a data set at an
important frequency (usually greater than a predefined threshold) and can thus reveal
association rules and relations between variables.
A frequent pattern represents a set of items co-occurring in comparatively more

transactions. The frequency is quantified using the support metric. Itemset support is
the number of transactions where the itemset elements appear together divided by the
total number of transactions.
Support(item1) = count(item1)/count(all transactions)
Minimum support is a threshold used by the following algorithms in order to discard

sets of items from the analysis which don’t appear frequently enough.
The strength of the association rule between 2 items, (for instance item1 and item2) or
the association confidence represents the number of transactions containing item1 and
item 2 divided by the number of transactions containing item1
Confidence (item1→item2)=count(item1 & item2)/count(item1)

The confidence metric estimates the likelihood that a transaction containing item1 will
include also item2.
The lift is the ratio of observing two items together to the likelihood of seeing just one of
them. A lift greater than 1 means that items1 and 2 are more likely present together in
transactions while values inferior to 1 apply to the cases when the two items are rarely
associated.
Lift(item1→item2) = (Confidence (item1→item2))/(Support (item2)).
Apriori algorithm
Apriori algorithm uses data organized by horizontal layout. It is founded on the fact that
if a subset S appears k times in a database, any other subset S1 which contains S will
appear k times or less. This implies that when deciding on a minimum support
threshold (minimum frequency an item set needs to have in order to not be
discarded)we can avoid calculating S1 or any other superset of S if support(s) <
minimum support. It can be said that all such candidates are being discarded a priori.
The algorithm computes the counts for all itemsets of k elements (starting with k = 1).
During the next iterations the previous sets are being joined and thus we create all
possible k + 1 itemsets. Only the combinations appearing at a frequency inferior to the
minimum support rate are being discarded. The iterations end when no further
extensions (joins) are being found.
Eclat algorithm
Eclat (Equivalence Class Clustering and bottom-up Lattice traversal) algorithm uses
data organized by vertical layout which associates each element with the list of
underlying transactions. In an iterative depth-first search way the algorithm continues
by calculating for all combinations of k items (starting from k=1, it calculates all pairs of
2 items) the list of common transactions. In a nutshell, during the k step all
combinations of k items are calculated by intersecting the lists of transactions
associated with the k-1 itemsets. K will be incremented by 1 each time until no frequent
items or no candidate items can still be found.
Eclat algorithm is generally faster than apriori and requires only one database scan
which will find the support for all itemsets with 1 element. All k>1 iterations rely only
on previously stored data.
FP tree algorithm
FP tree algorithm uses data organized by horizontal layout. It is the most

computationally efficient algorithm from the 3 presented in this post. It only performs 2
database scans and keeps the data in an easily exploitable tree structure.

The first database scan sorts all items in the order of their global occurrence in the
database (the equivalent of applying a counter to the unraveled data of all transactions).
The second pass iterates line by line through the list of transactions and for each
transaction it sorts the elements by the global order (previous step corresponding to
the first database pass) and introduces them as nodes of a tree grown in depth. These
nodes are introduced with a count value of 1. Continuing the iterations, for each line
new nodes are being added to the tree at the point where the ordered items differ from
the existing tree. If the same pattern already exists, all common nodes will increase their
count value by one.
The FP tree can be pruned by removing all nodes having a count value inferior to a
minimum threshold occurrence. The remaining tree can be traversed and for instance
all paths from the root node to leaves correspond to clusters of frequently occurring
items.

Thank you for using our services. Please support us so that we can
improve further and help more people.
https://www.rgpvnotes.in/support-us
If you have questions or doubts, contact us on

WhatsApp at +91-8989595022 or by email at hey@rgpvnotes.in.
For frequent updates, you can follow us on

Instagram: https://www.instagram.com/rgpvnotes.in/.

Unit 5 - Machine Learning - WWW - Rgpvnotes.in

Uploaded by

Copyright:

Available Formats

You might also like

Unit 5 - Machine Learning - WWW - Rgpvnotes.in

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 5 - Machine Learning - WWW - Rgpvnotes.in

Uploaded by

Copyright:

Available Formats

Please do not share these notes on apps like WhatsApp or Telegram.

Syllabus: Probabilistic Learning: Bayesian Learning, Bayes Optimal Classifier, Naïve

a. This generative process includes the unknown model parameters.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

2. Data are viewed as observations from the generative process.

The Bayesian strategy is particularly useful when:

Bayesian learning Example:

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

Figure 5.1: Bayesian learning to find if the milk is gone bad

3. Do Bayesian learning automatically and return an updated belief about whether

BAYES OPTIMAL CLASSIFIER

The Bayes Optimal Classifier is computationally expensive, if not intractable to calculate,

P(vj | D) = sum {h in H} P(vj | hi) * P(hi | D)

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

max sum {h in H} P(vj | hi) * P(hi | D)

NAÏVE BAYES CLASSIFIER

It is a classification technique based on Bayes’ Theorem with an assumption of

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

• P(c) is the prior probability of class.

• P(x|c) is the likelihood which is the probability of predictor given class.

• P(x) is the prior probability of predictor.

Working of Naive Bayes algorithm:

Step 1: Convert the data set into a frequency table

Figure 5.2: A training data set of weather

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

Problem: Players will play if weather is sunny. Is this statement is correct?

User can solve it using method of posterior probability.

P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Applications of Naive Bayes Algorithms

• Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers

• Recommendation System: Naive Bayes Classifier and Collaborative Filtering

BAYESIAN BELIEF NETWORKS

Conditional probability is the probability of a random variable when some other

If these two random variables are dependent,

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

If they are independent, then

• Directed Acyclic Graph

• Table of conditional probabilities.

Figure 5.3: A Bayesian network graph

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

Arc or directed arrows represent the causal relationship or conditional probabilities

• In the above diagram, A, B, C, and D are random variables represented by the

• Node C is independent of node A.

The Bayesian network has mainly two components:

MINING FREQUENT PATTERNS

A frequent pattern represents a set of items co-occurring in comparatively more

Support(item1) = count(item1)/count(all transactions)

Minimum support is a threshold used by the following algorithms in order to discard

Confidence (item1→item2)=count(item1 & item2)/count(item1)

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

Lift(item1→item2) = (Confidence (item1→item2))/(Support (item2)).

FP tree algorithm uses data organized by horizontal layout. It is the most

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

If you have questions or doubts, contact us on

For frequent updates, you can follow us on

You might also like