Professional Documents
Culture Documents
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
The revenue we generate from the ads we show on our website and app
funds our services. The generated revenue helps us prepare new notes
and improve the quality of existing study materials, which are
available on our website and mobile app.
If you don't use our website and app directly, it will hurt our revenue,
and we might not be able to run the services and have to close them.
So, it is a humble request for all to stop sharing the study material we
provide on various apps. Please share the website's URL instead.
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022
Subject Notes
IT 802 (A) - Machine Learning
B.Tech IT-8th Semester
Unit V
Course Objective: To familiarize students with the knowledge of machine learning and enable
them to apply suitable machine learning techniques for data handling and to gain knowledge
from it. Evaluate the performance of algorithms and to provide solution for various real-world
applications.
_____________________________________________________________________________
Course Outcome (CO5): Analyze the co-occurrence of data to find interesting frequent
patterns and preprocess the data before applying to any real-world problem for evaluation.
PROBABILISTIC LEARNING
Probabilistic classification learning is one form of implicit learning in which cues are
probabilistically associated with outcomes and participants process associations
without explicit awareness. Probabilistic classifier is a classifier that is able to predict,
given an observation of an input, a probability distribution over a set of classes, rather
than only outputting the most likely class that the observation should belong to.
Probabilistic classifiers provide classification that can be useful or when combining
classifiers into ensembles.
A probabilistic method or model is based on the theory of probability or the fact that
randomness plays a role in predicting future events. The opposite is deterministic,
which is the opposite of random it tells us something can be predicted exactly, without
the added complication of randomness.
BAYESIAN LEARNING
In Bayesian machine learning, follow these three steps:
1. To define a model, use a “generative process” for the data, i.e., a sequence of
steps describing how the data was created.
b. Incorporate prior beliefs about these parameters, which take the form of
distributions over values that the parameters might take.
3. After running the learning algorithm, it is left with an updated belief about the
parameters — i.e., a new distribution over the parameters.
• User has prior beliefs about unknown model parameters or explicit information
about data generation — i.e., useful info user wants to incorporate.
• User has few data or many unknown model parameters, and it is hard to get an
accurate result with data alone (without the added structure or information).
• User wants to capture the uncertainty about the result — how sure or unsure the
model is instead of only a single “best” result.
Suppose user grab a carton of milk from the fridge, see that it is seven days past the
expiration date, and want to know if the milk is still good or if it has gone bad. A quick
internet search leads him to believe that there is roughly a 50–50 chance that the milk is
still good. This is a prior belief (Figure 1).
From past experience, user has some knowledge about how smelly milk gets when it
has gone bad. Specifically, let’s suppose he rate smelliness on a scale of 0–10 (0 being no
smell and 10 being completely rancid) and have probability distributions over the
smelliness of good milk and of bad milk (Figure 2).
Here’s how Bayesian learning works: When he gets some data, i.e., when he smells the
milk (Figure 3), He can apply the machinery of Bayesian inference (Figure 4) to
compute an updated belief about whether the milk is still good or has gone bad (Figure
5).
For example, if user observe that the milk is about a 5 out of 10 on the smelly scale, he
can then use Bayesian learning to factor in his prior beliefs and the distributions over
smelliness of good vs. bad milk to return an updated belief — that there is now a 33%
chance that the milk is still good and a 67% chance that the milk has gone bad.
A program that computes an update belief about whether the “milk has gone bad
whenever user smells the milk”. A program will do the following:
1. Encode prior beliefs about whether the milk is still good or has gone bad and
probability distributions over the smelliness of good vs. bad milk.
2. Smell the milk and give this observation as an input to the program.
For a Bayesian model, user needs to mathematically derive an inference algorithm i.e.,
the learning algorithm that computes the final distribution over beliefs given the data.
The equation below demonstrates how to calculate the conditional probability for a new
instance (vi) given the training data (D), given a space of hypotheses (H).
Where vj is a new instance to be classified, H is the set of hypotheses for classifying the
instance, hi is a given hypothesis, P(vj | hi) is the posterior probability for vi given
hypothesis hi, and P(hi | D) is the posterior probability of the hypothesis hi given the
data D.
Selecting the outcome with the maximum probability is an example of a Bayes optimal
classification.
Any model that classifies examples using this equation is a Bayes optimal classifier and
no other model can outperform this technique, on average.
Any system that classifies new instances according to [the equation] is called a Bayes
optimal classifier, or Bayes optimal learner. No other classification method using the
same hypothesis space and same prior knowledge can outperform this method on
average. Although the classifier makes optimal predictions, it is not perfect given the
uncertainty in the training data and incomplete coverage of the problem domain and
hypothesis space. As such, the model will make errors. These errors are often referred
to as Bayes errors. Because the Bayes classifier is optimal, the Bayes error is the
minimum possible error.
For example, some fruit may be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other
features, all these properties independently contribute to the probability that this fruit
is an apple and that is why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along
with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x)
and P(x|c). The equation is given below:
where,
• P(c|x) is the posterior probability of class (c, target) given predictor (x,
attributes).
A training data set of weather and corresponding target variable ‘Play’ (suggesting
possibilities of playing) is given below. User needs to classify whether players will play
or not based on weather condition. Given below steps to perform classification:
Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of prediction.
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on
various attributes. This algorithm is mostly used in text classification and with problems
having multiple classes.
• Real time Prediction: Naive Bayes is an eager learning classifier and it is sure
fast. Thus, it could be used for making predictions in real time.
• Multi class Prediction: This algorithm is also well known for multi class
prediction feature. Here we can predict the probability of multiple classes of
target variable.
The probabilities are calculated in the belief networks by the following formula
To be able to calculate the joint distribution one need to have conditional probabilities
indicated by the network.
Bayesian Network can be used for building models from data and experts’ opinions, and
it consists of two parts:
The generalized form of Bayesian network that represents and solve decision problems
under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
Each node corresponds to the random variables, and a variable can be continuous or
discrete.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
• If node B relates to node A by a directed arrow, then node A is called the parent
of Node B.
• Causal Component
• Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi
|Parent(Xi) ), which determines the effect of the parent on that node.
The strength of the association rule between 2 items, (for instance item1 and item2) or
the association confidence represents the number of transactions containing item1 and
item 2 divided by the number of transactions containing item1
The confidence metric estimates the likelihood that a transaction containing item1 will
include also item2.
The lift is the ratio of observing two items together to the likelihood of seeing just one of
them. A lift greater than 1 means that items1 and 2 are more likely present together in
transactions while values inferior to 1 apply to the cases when the two items are rarely
associated.
Apriori algorithm
Apriori algorithm uses data organized by horizontal layout. It is founded on the fact that
if a subset S appears k times in a database, any other subset S1 which contains S will
appear k times or less. This implies that when deciding on a minimum support
threshold (minimum frequency an item set needs to have in order to not be
discarded)we can avoid calculating S1 or any other superset of S if support(s) <
minimum support. It can be said that all such candidates are being discarded a priori.
The algorithm computes the counts for all itemsets of k elements (starting with k = 1).
During the next iterations the previous sets are being joined and thus we create all
possible k + 1 itemsets. Only the combinations appearing at a frequency inferior to the
minimum support rate are being discarded. The iterations end when no further
extensions (joins) are being found.
Eclat algorithm
Eclat (Equivalence Class Clustering and bottom-up Lattice traversal) algorithm uses
data organized by vertical layout which associates each element with the list of
underlying transactions. In an iterative depth-first search way the algorithm continues
by calculating for all combinations of k items (starting from k=1, it calculates all pairs of
2 items) the list of common transactions. In a nutshell, during the k step all
combinations of k items are calculated by intersecting the lists of transactions
associated with the k-1 itemsets. K will be incremented by 1 each time until no frequent
items or no candidate items can still be found.
Eclat algorithm is generally faster than apriori and requires only one database scan
which will find the support for all itemsets with 1 element. All k>1 iterations rely only
on previously stored data.
FP tree algorithm
The first database scan sorts all items in the order of their global occurrence in the
database (the equivalent of applying a counter to the unraveled data of all transactions).
The second pass iterates line by line through the list of transactions and for each
transaction it sorts the elements by the global order (previous step corresponding to
the first database pass) and introduces them as nodes of a tree grown in depth. These
nodes are introduced with a count value of 1. Continuing the iterations, for each line
new nodes are being added to the tree at the point where the ordered items differ from
the existing tree. If the same pattern already exists, all common nodes will increase their
count value by one.
The FP tree can be pruned by removing all nodes having a count value inferior to a
minimum threshold occurrence. The remaining tree can be traversed and for instance
all paths from the root node to leaves correspond to clusters of frequently occurring
items.