Shrutanik Chatterjee - 34230822046 - Machine Learning Applications

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

FUTURE INSTITUTE OF TECHNOLOGY , KOLKATA

MACHINE LEARNING APPLICATIONS, 2024


PCCAIML 601
Report On

MLE & LOSS FUNCTION

Author: SHRUTANIK CHATERJEE


Course Instructor: DR. PRADIPTA KR. BANERJEE

Department of CSE (AI & ML)


University Roll No: 34230822046
University Registration No: 223420120181 of 2022-23
6th Semester
Contents
1 Show that derivative of 𝜎(𝐙) = 𝜎(𝐙) ⋅ (1 − 𝜎(𝐙)) 3

2 Derivation of Cross Entropy Loss from Maximum Likelihood Estimation of Logistic


Regression: 4

3 Draw MSE, log-loss and hinge-loss: 5

3.1 Python Code Implementation: . . . . . . . . . . . . . . . . . . . . . . . . 6

4 If (𝑋: 𝑃) = 𝑃 ⋅ (1 − 𝑃) 𝑋−𝑃 from this distribution the following sample drawn [4,5,6,5,6,3]
Estimate the Maximum likelihood of P. 8
1 Show that derivative of 𝝈(𝐙) = 𝝈(𝐙) ⋅ (𝟏 − 𝝈(𝐙))

Show that 𝜎 ′ (𝑧) = 𝜎(𝑧)(1 − 𝜎(𝑧))

The sigmoid function 𝜎(𝑧) is defined as:

1
𝜎(𝑧) =
1 + 𝑒 −𝑧

Let's denote 𝜎(𝑧) as 𝑓(𝑧) :o derivative of 𝑓(𝑧) with respect to ∗ can be found using the chain
rule:

∂ ∂ 1
𝑓(𝑧) = ( )
∂𝑧 ∂𝑧 1 + 𝑒 −𝑧

To find this derivative we, will use the quotient rule: To find this derivative we,
∂ 𝑢 𝜇′ 𝑣−𝜇𝑣 ′
( )
∂𝑧 𝑣
= 𝑣2 [Here 𝜇′ and 𝑣 ′ are derivatives of
𝜇 and 𝑣 respectively]

Let 𝑢 = 1 and 𝑣 = 1 + 𝑒 −𝑧 , then 𝑢′ = 0 and 𝑣 ′ = −𝑒 −𝑧

∴ Applying the quotient rule, we get:

∂ 0 ⋅ (1 + 𝑒 −𝑧 ) − 1(−𝑒 −𝑧 ) 𝑒 −𝑧
𝑓(𝑧) = =
∂𝑧 (1 + 𝑒 −𝑧 )2 (1 + 𝑒 −𝑧 )2
𝑧
1 𝑒 1
∴ 𝑓(𝑧) = −𝑧
= 𝑧 = −𝑧
1+𝑒 𝑒 +1 𝑒 +1
and 𝑓(𝑧) = 𝜎(𝑧)
∂ 𝑒 −𝑧 1 𝑒 −𝑧
∴ 𝜎(𝑧) = = [ ] ⋅ [ ]
∂𝑧 (1 + 𝑒 −𝑧 )2 1 + 𝑒 −𝑧 1 + 𝑒 −𝑧
1 1
=[ ] ⋅ [1 − ]
1 + 𝑒 −𝑧 1 + 𝑒 −𝑧
= 𝜎(𝑧) ⋅ [1 − 𝜎(𝑧)]

∴ Derivative of 𝜎(𝑧) = 𝜎(𝑧) ⋅ (1 − 𝜎(𝑧))

2 Derivation of Cross Entropy Loss from Maximum Likelihood Estimation of


Logistic Regression:
Derive the cross entropy loss from MLE estimation of Logistic Regression.

To derive the cross-entropy loss from MLE in Logistic Regression, we start with the
likelihood function. In logistic regression we model the probability of a binary outcome
1
𝑃(𝑦 = 1 ∣ 𝑥) = 𝜎(𝜔⊤ 𝑥) = 1+𝑒 −𝜔𝑇𝑥

𝑃(𝑦 = 0 ∣ 𝑥) = 1 − 𝑃(𝑦 = 1 ∣ 𝑥)
1
Here 𝜎(𝑧) is the logistic function and 𝜎(𝑧) = 1+𝑒 −𝑧 .

• 𝑤 is the weight vector 𝑜𝑥 is the input vector.


For a dataset {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑁 , 𝑦𝑁 )}, the likelihood function can

be written as:
𝑁

𝐿(𝜔) = ∏ 𝑃(𝑦𝑖 ∣ 𝑥𝑖 )
𝑖=1

𝐿(𝜔) = ∏𝑁 𝑖=1 𝑃(𝑦𝑖 ∣ 𝑥𝑖 )


Taking the logarithm of the likelihood function (log-likelihood) to simplify
calculations:
𝑁

→ log 𝐿(𝜔) = ∑ log 𝑃(𝑦𝑖 ∣ 𝑥𝑖 )


𝑖=1
∴ log 𝑃(𝑦𝑖 ∣ 𝑥𝑖 ) = 𝑦𝑖 log (𝜎(𝜔⊤ 𝑥𝑖 )) + (1 − 𝑦𝑖 )log (1 − 𝜎(𝜔⊤ 𝑥𝑖 ))
Jo maximize the likelihood function fwe minimize the regative log-likelihood

1 𝑁
∴ 𝐽(𝜔) = − ∑ [𝑦 (log (𝜎 ⊤ (𝜔⊤ 𝑥𝑖 )) + (1 − 𝑦𝑖 )log (1 − 𝜎(𝜔⊤ 𝑥𝑖 ))]
𝑁 𝑖=1 𝑖

1 1
Now 𝐽(𝜔) = ∑𝑁
𝑖=1 [𝑦𝑖 log ( ⊤ ) + (1 − 𝑦𝑖 )log (1 − ⊤ ))
1+𝑒 −𝜔 𝑥𝑖 1+𝑒 −𝜔 𝑥𝑖

1 ⊤𝑥
Now 𝑦𝑖 log ( ⊤ ) = 𝑦𝑖 log (1) − 𝑦𝑖 log (1 + 𝑒 −𝜔𝑇𝑥𝑖 ) = −𝑦𝑖 log (1 + 𝑒 −𝜔 𝑖 )
1+𝑒 −𝜔 𝑥


1 𝑒 −𝜔 𝑥𝑖 ⊤𝑥
and (1 − 𝑦𝑖 )log (1 − ⊤ ) = (1 − 𝑦𝑖 )log ( ⊤ ) = (1 − 𝑦𝑖 )log (𝑒 −𝜔 𝑖 ) −(1 −
1+𝑒 −𝜔 𝑥𝑖 1−𝑒 −𝜔 𝑥𝑖
−𝜔𝑇 𝑥𝑖
𝑦𝑖)log (1 + 𝑒 )

⊤ 𝑥𝑖
= −(1 − 𝑦𝑖)𝜔⊤ 𝑥𝑖 − log (1 + 𝑒 −𝜔 )

Substituting (1) and (ii) in log-likelihood function, we get.


𝑁
⊤𝑥 ⊤𝑥
𝐽(𝜔) = − ∑ (𝑦𝑖 log (1 + 𝑒 −𝜔 𝑖 ) + (1 − 𝑦𝑖 )log (1 + 𝑒 −𝜔 𝑖 ))
𝑖=1
𝑁
⊤𝑥
= − ∑ log (1 + 𝑒 −𝜔 𝑖 ).
𝑖=1
𝑁
⊤𝑥
= − ∑ log (1 + 𝑒 −𝑦𝑖𝑤 𝑖 ). ← cross entropy loss from
𝑖=1
MLE estimation of Logistic Regression
3 Draw MSE, log-loss and hinge-loss:

Draw MSE, Log Loss & Hinge Loss

MSE (Mean Squared Error Loss):

(1) Log Loss (Cross - Entropy Loss):

(2 ) Hinge Loss:
3.1 Python Code Implementation:

Comparison of Loss Functions


4 If (𝑋: 𝑃) = 𝑃 ⋅ (1 − 𝑃) 𝑋−𝑃 from this distribution the following sample drawn [4,5,6,5,6,3]
Estimate the Maximum likelihood of P.

If 𝑓(𝑥: 𝑝) = (1 − 𝑝)(𝑥−𝐷 ⋅ 𝑝 from this distribution, the following sample arawn

Estimate the Maximum Likelihood of 𝑃.

Probability Mass Function 𝑓(𝑥; 𝑝) = 𝑝𝑥 (𝑦 − 𝑃)𝑥−1


2. 𝐿(𝑝) = 𝑓(4; 𝑝) × 𝑓(5; 𝑝) × 𝑓(6; 𝑝) × 𝑓(5; 𝑝) × 𝑓(6; 𝑝) × 𝑓(3; 𝑝)

= 𝑝 × (1 − 𝑝)3 × 𝑝(1 − 𝑝)4 × 𝑝(1 − 𝑝)5 × 𝑝(1 − 𝑝)4 × 𝑝(1 − 𝑝)5 × 𝑝(1 − 𝑝)2
= 𝑝6 × (1 − 𝑝)23

maximum likelihood estimate, we differentiate the likelihood function wrt 𝑝.

∂ ∂
𝐿(𝑝) = 𝑝6 × (1 − 𝑝)23 [ (𝑢𝑣) = 𝑢′ 𝑣 + 𝑣 ′ 𝑢]
∂𝑝 ∂𝑥

Here 𝑢 = 𝑝6 and 𝑣 = (1 − 𝑝)23


𝐿(𝑝) = (1 − 𝑝)23 × 𝑝6
∂𝑝

∂ ∂ 6
= [(1 − 𝑝)23 ] ⋅ 𝑝6 + (1 − 𝑝)23 ⋅ [𝑝 ]
∂𝑝 ∂𝑝


= {23(1 − 𝑝)22 ⋅ (1 − 𝑝)𝑝6 } + {(1 − 𝑝)23 ⋅ 6𝑝5 }
∂𝑝

𝑑 ∂
= 23(1 − 𝑝)22 ( [1] − (𝑝)) 𝑝6 + (1 − 𝑝)23 ⋅ 6𝑝5
𝑑𝑝 ∂𝑝

= 23(1 − 𝑝)22 (0 − 1)𝑝6 + (1 − 𝑝)23 6𝑝5

= 6𝑝5 (1 − 𝑝)23 − 23(1 − 𝑝)22 𝑝6 = 6𝑝5 × (1 − 𝑝)22 [6 − 6𝑝 − 23𝑝]

= 6𝑝5 × (1 − 𝑝)22 [6 − 29𝑝]

For the product to be equal to zero, at least one of the factors must be equal to xero, Thus we
have two cases:

6𝑝5 = 0 → 𝑝 = 0 /(1 − 𝑝)22 = 0 → 𝑝 = 1/(6 − 29𝑝) = 0 → 𝑝 = 6/29

𝑃 is within the feasible mange of probabilities 10 0 ⩽ 𝑃 ⩽ 1

∴ 𝑝 = 6/29

𝐿(𝑃) = (1 − 6/29)23 × (6/29)6

= 3.79 × 10−7

You might also like