Professional Documents
Culture Documents
Lec 13
Lec 13
Lec 13
Logistic Regression
Logistic Regression
Using Logistic Regression, we can classify a sample between 2 classes: 1 and 0.
Thus, if we can predict the probability (P) of a sample belonging to a one of the
two classes, we know the probability of it belonging to the other class as 1-P.
Thus, we select one of the classes any find P. If P>0.5, the sample belongs to the
selected class; otherwise it belongs to the other class.
It’s derived from Log of “odds”, which is continuous and beyond the 0-1 range (of P)
Loss in Logistic Regression
h(x)=predicted y
The red line here represents
the 1 class (y=1), the right
term of cost function will
vanish. Now if the predicted
probability is close to 1 then
our loss will be less and when
probability approaches 0, our
loss function reaches infinity.
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution,
given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the
observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum
likelihood estimate.
The main aim of MLE is to find the value of our parameters for which the likelihood function is maximized. The likelihood function is
nothing but a joint pdf of given sample observations. Joint distribution is the multiplication of the conditional probability for observing
each example given the distribution parameters.
A random experiment whose
outcomes are of two types,
Conditional success S and failure F,
probability occurring with probabilities p
and q respectively is called a
Bernoulli trial. If for this
experiment a random variable
Likelihood N X is defined such that it takes
Function value 1 when S occurs and 0 if
F occurs, then X follows a
i=
Bernoulli Distribution.
We need a value for theta which will maximize this likelihood function. To make
our calculations easier we multiply the log on both sides. The function we get is
also called the log-likelihood function or sum of the log conditional probability
LL
N*J
To apply gradient descent here as well, we
need to derive the formula for derivatives
Since
J=-LL/N
where
Using chain rule, we can split the derivative
Assume
Then
Thus,
1st part
Since
2nd part
Thus,
Proof for derivative
of a sigmoid function
3rd part
Since
Let’s combine the 3 parts