DMML Unit4 Ppt.pptx

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 77

MODULE 4

• Sales Forecasting
• To Predict the prices / rent of houses and
other factors
• Finance Applications To Predict Stock prices,
investment evaluation, etc.
• Rainfall and weather prediction
What is Regression?

• Regression analysis is a statistical method to


model the relationship between a dependent
(target) and independent (predictor) variables
with one or more independent variables.
• More specifically, Regression analysis helps us to
understand how the value of the dependent
variable is changing corresponding to an
independent variable when other independent
variables are held fixed.
• It predicts continuous/real values such
as temperature, age, salary, price, etc.
• The regression technique gets used mainly to
determine the predictor strength, forecast
trend, time series, and in case of cause &
effect relation.
• It involves determining the best fit line, which
is a line that passes through all the data points
in such a way that distance of the line from
each data point is minimized.
Terminologies Related to the
Regression Analysis:
• Dependent Variable: The main factor in Regression analysis which
we want to predict or understand is called the dependent variable.
It is also called target variable.
• Independent Variable: The factors which affect the dependent
variables or which are used to predict the values of the dependent
variables are called independent variable, also called as
a predictor.
• Outliers: Outlier is an observation which contains either very low
value or very high value in comparison to other observed values.
An outlier may hamper the result, so it should be avoided.
• Multicollinearity: If the independent variables are highly
correlated with each other than other variables, then such
condition is called Multicollinearity. It should not be present in the
dataset, because it creates problem while ranking the most
affecting variable.
• Underfitting and Overfitting: If our algorithm works well with the
training dataset but not well with test dataset, then such problem
is called Overfitting. And if our algorithm does not perform well
even with training dataset, then such problem is
• Linear Regression Use Cases

• Sales Forecasting
• Risk Analysis
• Housing Applications To Predict the prices and
other factors
• Finance Applications To Predict Stock prices,
investment evaluation, etc.
Why do we use Regression Analysis?

• Regression analysis helps in the prediction of a


continuous variable.
• There are various scenarios in the real world
where we need some future predictions such as
weather condition, sales prediction, marketing
trends, etc., for such case we need some
technology which can make predictions more
accurately.
• By performing the regression, we can confidently
determine the most important factor, the least
important factor, and how each factor is affecting
the other factors.
Types of Regression
Linear Regression:

• If there is only one input variable (x), then


such linear regression is called simple linear
regression.
• And if there is more than one input variable,
then such linear regression is called multiple
linear regression.
• Mathematically, we can represent a linear regression as:
• y= B0+B1x+ ε // y=mx+c
• Here,
• Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
B0= intercept of the line (Gives an additional degree of freedom)
B1 = Linear regression coefficient (scale factor to each input value).
ε = random error
Linear Regression Line
• A linear line showing the relationship between the dependent and
independent variables is called a regression line. A regression line can
show two types of relationship:
• Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear
relationship.
• Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent
variable increases on the X-axis, then such a relationship is called a
negative linear relationship.
Logistic Regression:
• Logistic regression algorithm works with the
categorical variable such as 0 or 1, Yes or No, True
or False, Spam or not spam, etc.
• It is a predictive analysis algorithm which works on
the concept of probability.
• Logistic regression uses sigmoid function or
logistic function which is a complex cost function.
This sigmoid function is used to model the data in
logistic regression.
• The function can be represented as:
• f(x)= Output between the 0 and 1 value.

• x= input to the function


• e= base of natural logarithm.
• It uses the concept of threshold levels, values above the threshold level
are rounded up to 1, and values below the threshold level are rounded up
to 0.

• There are three types of logistic regression:

• Binary(0/1, pass/fail)
• Multi(cats, dogs, lions)
• Ordinal(low, medium, high)
• With binary classification, let ‘x’ be some feature and ‘y’ be the
output which can be either 0 or 1.
• The probability that the output is 1 given its input can be
represented as:

• If we predict the probability via linear regression, we can state it


as:

• where, p(x) = p(y=1|x)

• Linear regression model can generate the predicted probability as


any number ranging from negative to positive infinity, whereas
probability of an outcome can only lie between 0< P(x)<1.

• The odds are defined as the probability that the event will
occur divided by the probability that the event will not occur.
Unlike probability, the odds are not constrained to lie
between 0 and 1 but can take any value from zero to infinity.
• If the probability of Success is P, then the odds of that event
is:

• Example: If the probability of success (P) is 0.60 (60%), then


the probability of failure(1-P) is 1–0.60 = 0.40(40%). Then the
odds are 0.60 / (1–0.60) = 0.60/0.40 = 1.5.
• It’s time… to transform the model from linear regression to
logistic regression using the logistic function.
• To avoid this problem, log-odds function or logit function is used.
• Logistic regression can be expressed as:

• where, the left hand side is called the logit or log-odds function,
and p(x)/(1-p(x)) is called odds.
• The odds signifies the ratio of probability of success to probability
of failure. Therefore, in Logistic Regression, linear combination of
inputs are mapped to the log(odds) - the output being equal to 1.
If we take an inverse of the above function, we get:

• This is known as the Sigmoid function and it gives an S-shaped


curve. It always gives a value of probability ranging from 0<p<1.
• Using the logistic regression algorithm, banks can
predict whether a customer would default on loans or
not
• To predict the weather conditions of a certain place
(sunny, windy, rainy, humid, etc.)
• Ecommerce companies can identify buyers if they are
likely to purchase a certain product
• Companies can predict whether they will gain or lose
money in the next quarter, year, or month based on
their current performance
• To classify objects based on their features and
attributes
Linear Regression Logistic Regression
Linear regression is used to predict Logistic Regression is used to predict
the continuous dependent variable the categorical dependent variable using
using a given set of independent a given set of independent variables.
variables.
Linear Regression is used for Logistic regression is used for solving
solving Regression problem. Classification problems.
In Linear regression, we predict the In logistic Regression, we predict the
value of continuous variables. values of categorical variables.
In linear regression, we find the In Logistic Regression, we find the
best fit line, by which we can easily S-curve by which we can classify the
predict the output. samples.
Least square estimation method is Maximum likelihood estimation method
used for estimation of accuracy. is used for estimation of accuracy.
The output for Linear Regression The output of Logistic Regression must
must be a continuous value, such be a Categorical value such as 0 or 1,
as price, age, etc. Yes or No, etc.
Linear regression is used to
Whereas logistic regression is used to
estimate the dependent variable in
calculate the probability of an event.
case of a change in independent
For example, classify if tissue is benign
Confusion Matrix
• A confusion matrix is a table that is often used
to describe the performance of a
classification model (or "classifier") on a set of
test data for which the true values are known.
• true positives (TP): These are cases in which we predicted yes (they have the
disease), and they do have the disease.
• true negatives (TN): We predicted no, and they don't have the disease.
• false positives (FP): We predicted yes, but they don't actually have the
disease. (Also known as a "Type I error.")
• false negatives (FN): We predicted no, but they actually do have the disease.
(Also known as a "Type II error.")
Performance Metrics for Classification
problems in Machine Learning
Confusion Matrix - The Confusion matrix in itself is not a performance measure
as such, but almost all of the performance metrics are based on Confusion Matrix
and the numbers inside it

36
Performance Metrics for Classification
problems in Machine Learning

38
What is Naive Bayes?

• Bayes theorem, named after Thomas Bayes


from the 1700s.
• The Naive Bayes classifier works on the
principle of conditional probability, as given by
the Bayes theorem.
Machine learning falls into two categories:
– Supervised learning
– Unsupervised learning
Supervised learning falls into two categories:
– Classification
– Regression
• Naive Bayes algorithm falls under
classification.
Where is Naive Bayes Used?
• Face Recognition----As a classifier, it is used to identify
the faces or its other features, like nose, mouth, eyes,
etc.
• Weather Prediction ----It can be used to predict if the
weather will be good or bad.
• Medical Diagnosis ---Doctors can diagnose patients by
using the information that the classifier provides.
Healthcare professionals can use Naive Bayes to
indicate if a patient is at high risk for certain diseases
and conditions, such as heart disease, cancer, and
other ailments.
• News Classification ---With the help of a Naive Bayes
classifier, Google News recognizes whether the news is
political, world news, and so on.
Why is it called Naïve Bayes?
• The Naïve Bayes algorithm is comprised of two
words Naïve and Bayes, Which can be described as:
• Naïve: It is called Naïve because it assumes that
the occurrence of a certain feature is independent
of the occurrence of other features. Such as if the
fruit is identified on the bases of color, shape, and
taste, then red, spherical, and sweet fruit is
recognized as an apple. Hence each feature
individually contributes to identify that it is an
apple without depending on each other.
• Bayes: It is called Bayes because it depends on the
principle of Bayes' Theorem.
Bayes' Theorem:
• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on
the conditional probability.
• The formula for Bayes' theorem is given as:
• Where,

• P(A|B) is Posterior probability: – the probability of event A occurring, given


event B has occurred

• P(B|A) is Likelihood probability: the probability of event B occurring, given


event A has occurred

• P(A) is Prior Probability: the probability of event A

• P(B) is Marginal Probability: the probability of event B


• This relates the probability of the hypothesis before getting the
evidence P(H), to the probability of the hypothesis after getting
the evidence, P(H∣E).
• For this reason, P(H) is called the prior probability, while P(H∣E) is
called the posterior probability.
• The factor that relates the two, P(E∣H) / P(E) , is called
the likelihood ratio.
• Using these terms, Bayes' theorem can be rephrased as "the
posterior probability equals the prior probability times the
likelihood ratio."
• Use Naive Bayes Classifier to predict whether
the given customer X will buy computer or
not.
X=< age<=30, income=medium, student=yes,
credit_rating = fair >
• P(Ci/X) - Maximum

• P(YES/X) =?
• P(NO/X)=?

P(YES/X) = P(X/YES)P(YES)
P(X)
P(NO/X) = P(X/NO)P(NO)
P(X)
P(YES/X) = P(X/YES)P(YES)

P(NO/X) = P(X/NO)P(NO)

P(X/YES)P(YES) = P(age=<30/YES) *
P(income=medium/YES) * P(student=yes/YES) *
P(credit=fair/YES)*P(YES)

P(X/NO)P(NO) = P(age=<30/NO) *
P(income=medium/NO) * P(student=yes/NO) *
P(credit=fair/NO)*P(NO)
P(X/YES)P(YES) = P(age=<30/YES) *
P(income=medium/YES) * P(student=yes/YES) *
P(credit=fair/YES)*P(YES)
=.222*.444*.667*.667*.643 = 0.028

P(X/NO)P(NO) = P(age=<30/NO) *
P(income=medium/NO) * P(student=yes/NO) *
P(credit=fair/NO)*P(NO)
0.6*0.4*0.2*0.4*0.357 = 0.007
P(N/X) and P(P/X)

P(X/N)*P(N) and P(X/P)*P(P)

P(Sunny/N)*P(Cool/N)*P(High/N)*P(true/N)*P(N)

=3/5*1/5*4/5*3/5*5/14 ===== 0.02 N

P(Sunny/Y)*P(Cool/Y)*P(High/Y)*P(true/Y)*P(Y)
2/9*3/9*3/9*3/9*9/14 ======== 0.0052
Types of Naïve Bayes Model:

• There are three types of Naive Bayes Model, which are given
below:
• Gaussian: The Gaussian model assumes that features follow a
normal distribution. This means if predictors take continuous
values instead of discrete, then the model assumes that these
values are sampled from the Gaussian distribution.
• Multinomial: The Multinomial Naïve Bayes classifier is used when
the data is multinomial distributed. It is primarily used for
document classification problems, it means a particular document
belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
• Bernoulli: The Bernoulli classifier works similar to the Multinomial
classifier, but the predictor variables are the independent
Booleans variables. Such as if a particular word is present or not in
a document. This model is also famous for document classification
tasks.
Advantages of Naive Bayes Classifier

The following are some of the benefits of the Naive


Bayes classifier:
• It is simple and easy to implement
• It doesn’t require as much training data
• It handles both continuous and discrete data
• It is highly scalable with the number of predictors and
data points
• It is fast and can be used to make real-time predictions
• It is not sensitive to irrelevant features
What Is A Bayesian Network?
• A Bayesian Network falls under the category of
Probabilistic Graphical Modelling (PGM) technique
that is used to compute uncertainties by using the
concept of probability.
• Popularly known as Belief Networks, Bayesian
Networks are used to model uncertainties by
using Directed Acyclic Graphs (DAG).
• "A Bayesian network is a probabilistic graphical
model which represents a set of variables and
their conditional dependencies using a directed
acyclic graph."
• Real world applications are probabilistic in
nature, and to represent the relationship
between multiple events, we need a Bayesian
network.
• It can also be used in various tasks
including prediction, anomaly detection,
diagnostics, automated insight, reasoning,
time series prediction, and decision making
under uncertainty.
What Is A Directed Acyclic Graph?
• A Directed Acyclic Graph is used to represent a Bayesian
Network and like any other statistical graph, a DAG contains a
set of nodes and links, where the links denote the relationship
between the nodes.
• A DAG models the uncertainty of an event occurring based on
the Conditional Probability Distribution (CDP) of each random
variable. A Conditional Probability Table (CPT) is used to
represent the CPD of each variable in the network.
• Each node corresponds to the random variables, and a variable
can be continuous or discrete.
• Arc or directed arrows represent the causal relationship or
conditional probabilities between random variables. These
directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other
node, and if there is no directed link that means that nodes are
independent with each other
– In the above diagram, A, B, C, and D are random variables represented
by the nodes of the network graph.
– If we are considering node B, which is connected with node A by a
directed arrow, then node A is called the parent of Node B.
– Node C is independent of node A.
• Note: The Bayesian network graph does not contain any cyclic
graph. Hence, it is known as a directed acyclic graph or DAG.
What Is Conditional Probability?

• Conditional Probability of an event X is the


probability that the event will occur given that an
event Y has already occurred.
• p(X| Y) is the probability of event X occurring,
given that event, Y occurs.
• If X and Y are dependent events then the
expression for conditional probability is given by:
P (X| Y) = P (X and Y) / P (Y)
• Each node in the Bayesian network has condition
probability distribution P(Xi |Parent(Xi) ), which
determines the effect of the parent on that node.
• Bayesian network is based on Joint probability
distribution and conditional probability. So let's
first understand the joint probability distribution:
What Is Joint Probability?

• Joint Probability is a statistical measure of two or more events


happening at the same time, i.e., P(A, B, C), The probability of
event A, B and C occurring. It can be represented as the probability
of the intersection two or more events occurring
• Joint probability distribution:
• If we have variables x1, x2, x3,....., xn, then the probabilities of a
different combination of x1, x2, x3.. xn, are known as Joint
probability distribution.
• P[x1, x2, x3,....., xn], it can be written as the following way in terms
of the joint probability distribution.
• = P[x1| x2, x3,....., xn]P[x2, x3,....., xn]= P[x1| x2, x3,.....,
xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].
• In general for each variable Xi, we can write the equation as:
• P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))
Example

• Let’s assume that we’re creating a Bayesian Network that will


model the marks (m) of a student on his examination. The marks
will depend on:
• Exam level (e): This is a discrete variable that can take two values,
(difficult, easy)
• IQ of the student (i): A discrete variable that can take two values
(high, low)
• The marks will intern predict whether or not he/she will get
admitted (a) to a university.
• The IQ will also predict the aptitude score (s) of the student.
• With this information, we can build a Bayesian Network that will
model the performance of a student on an exam. The Bayesian
Network can be represented as a DAG where each node denotes a
variable that predicts the performance of the student.
• Above I’ve represented this distribution through a DAG and a
Conditional Probability Table. We can now calculate the Joint
Probability Distribution of these 5 variables, i.e. the product of
conditional probabilities:
• Here,
• p(a | m) represents the conditional probability of a student getting
an admission based on his marks.
• p(m | I, e) represents the conditional probability of the student’s
marks, given his IQ level and exam level.
• p(i) denotes the probability of his IQ level (high or low)
• p(e) denotes the probability of the exam level (difficult or easy)
• p(s | i) denotes the conditional probability of his aptitude scores,
given his IQ level
• we can formulate Bayesian Networks as:
Bayesian Networks Application
• Disease Diagnosis: Bayesian Networks are commonly used in the field of
medicine for the detection and prevention of diseases. They can be used
to model the possible symptoms and predict whether or not a person is
diseased.
• Optimized Web Search: Bayesian Networks are used to improve search
accuracy by understanding the intent of a search and providing the most
relevant search results. They can effectively map users intent to the
relevant content and deliver the search results.
• Spam Filtering: Bayesian models have been used in the Gmail spam
filtering algorithm for years now. They can effectively classify documents
by understanding the contextual meaning of a mail. They are also used in
other document classification applications.
• Gene Regulatory Networks: GRNs are a network of genes that are
comprised of many DNA segments. They are effectively used to
communicate with other segments of a cell either directly or indirectly.
Mathematical models such as Bayesian Networks are used to model such
cell behavior in order to form predictions.
• Biomonitoring: Bayesian Networks play an important role in monitoring
the quantity of chemical dozes used in pharmaceutical drugs.
• Example: Harry installed a new burglar alarm at his
home to detect burglary. The alarm reliably responds
at detecting a burglary but also responds for minor
earthquakes.
• Harry has two neighbors John and Mary, who have
taken a responsibility to inform Harry at work when
they hear the alarm.John always calls Harry when he
hears the alarm, but sometimes he got confused with
the phone ringing and calls at that time too.
• On the other hand, Mary likes to listen to high music,
so sometimes she misses to hear the alarm. Here we
would like to compute the probability of Burglary
Alarm.
• Problem:
• Calculate the probability that alarm has
sounded, but there is neither a burglary, nor
an earthquake occurred, and John and Mary
both called the Harry.
• What is the probalilty that john calls ?
• P(R)=0.0025

You might also like