Lec 13

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Lectures 13

Logistic Regression
Logistic Regression
Using Logistic Regression, we can classify a sample between 2 classes: 1 and 0.
Thus, if we can predict the probability (P) of a sample belonging to a one of the
two classes, we know the probability of it belonging to the other class as 1-P.
Thus, we select one of the classes any find P. If P>0.5, the sample belongs to the
selected class; otherwise it belongs to the other class.

The probability (P) can be modeled as follows:

It’s derived from Log of “odds”, which is continuous and beyond the 0-1 range (of P)
Loss in Logistic Regression
h(x)=predicted y
The red line here represents
the 1 class (y=1), the right
term of cost function will
vanish. Now if the predicted
probability is close to 1 then
our loss will be less and when
probability approaches 0, our
loss function reaches infinity.

The black line represents 0


class (y=0), the left term will
vanish in our cost function
and if the predicted probability
is close to 0 then our loss
function will be less but if our
probability approaches 1 then
our loss function reaches
infinity.
Replacing B’s with theta’s from now onwards

If there were ‘m’ features originally,

denotes j-th feature


value of i-th
sample
Derivation of LogLoss using MLE (Maximum Likelihood Estimation)

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution,
given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the
observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum
likelihood estimate.

The main aim of MLE is to find the value of our parameters for which the likelihood function is maximized. The likelihood function is
nothing but a joint pdf of given sample observations. Joint distribution is the multiplication of the conditional probability for observing
each example given the distribution parameters.
A random experiment whose
outcomes are of two types,
Conditional success S and failure F,
probability occurring with probabilities p
and q respectively is called a
Bernoulli trial. If for this
experiment a random variable
Likelihood N X is defined such that it takes
Function value 1 when S occurs and 0 if
F occurs, then X follows a
i=
Bernoulli Distribution.
We need a value for theta which will maximize this likelihood function. To make
our calculations easier we multiply the log on both sides. The function we get is
also called the log-likelihood function or sum of the log conditional probability

LL

N*J
To apply gradient descent here as well, we
need to derive the formula for derivatives
Since
J=-LL/N

where
Using chain rule, we can split the derivative

Assume

Then

Thus,
1st part

Since
2nd part

Thus,
Proof for derivative
of a sigmoid function
3rd part
Since
Let’s combine the 3 parts

It’s very similar to


the way we update
parameters in
linear regression
Implementation Steps for Logistic Regression
1) Let’s say, we have N training samples with total m features and a binary label.
2) Initialize the parameters and obtain predicted value of each sample using the
following:

denotes j-th feature


3) Update the parameters using the following: value of i-th
sample

Iterate over steps 2 & 3 until the parameters don’t change

You might also like