Fattah Regression 2023

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 119

Linear regression

Dr. Shaikh Fattah


Professor Department of EEE, BUET
Chair, IEEE Signal Processing Society Bangladesh Chapter

International FDP Organized by Mizoram University & North-Eastrn Hill University


1 Shillong, India 25 March 2022
Brief Biography
▸ Prof., Dept. of EEE, BUET
▸ In-Charge, BUET Robotics Lab
▸ Editorial Board Member, IEEE Access
▸ Editorial Board Member, IEEE Potentials
▸ Editor-in-Chief, IEEE PES Enews; A/E CSSP
▸ Chair, IEEE PES-HAC
▸ Chair, IEEE SSIT Chapters Committee
▸ Member of LRPC, IEEE PES
▸ Chair (’15-16) IEEE Bangladesh Section
▸ Chair, IEEE SPS Bangladesh (BD) Chapter
▸ Vice-Chair, RAS,PES, EMBS, SSIT BD Chapters
▸ General Chair, ICAICT 2020, SPICSCON 2021
RAAICON 19, BECITHCON’ 19, R10-HTC 2017
▸ TPC Chairs: TENSYMP ‘20, WIECON,
MediTec, ICIVPR, ICAEE
2

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Major Awards
▸ Concordia University's Distinguished Doctoral
Dissertation Prize (ENS, 2009),
▸ 2007 URSI Canadian Young Scientist Award
▸ Dr. Rashid Gold Medal (in MSc)
▸ BAS-TWAS Young Scientists Prize (2014)
▸ 2018 R10 Outstanding Volunteer Award
▸ 2016 IEEE MGA Achievement Award
▸ 2017 IEEE R10 HTA Outstanding Volunteer Award
▸ 2016 IEEE R10 HTA Outstanding Activities
Award for Bangladesh Section
▸ Best paper awards: EEE BECITHCON 2019 ,
Biomedical Track at IEEE TENCON 2017,
WIECON-ECE 2016
▸ 2020 IEEE Video & Image Processing Cup, 1st Runner
Up,
3 Faculty Advisor of Team

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


W
T:

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Linear Regression Model

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


What is “Linear” here?

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Linear Regression- Find Best Line!

Loss Function

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Linear Regression: Loss Function

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Linear Regression: Cost Function

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


y, X and W, b representation

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


y, X and W, b representation

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Minimizing the Cost Function

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Minimizing Cost: Direct Solution

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Minimizing Cost: Direct Solution

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Minimizing Cost: Direct Solution

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Feature Mapping: Handle Nonlinear Relation

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Polynomial Feature Mapping

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Polynomial Feature Mapping, M=3

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Polynomial Fitting: Model Quality

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Polynomial Fitting: Model Complexity

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


l2 Regularizer

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Linear Regression

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


How to Learn a Model?

▸How to learn a mathematical model, after which you can


predict any target value (e.g., IQ) given a feature value (e.g.,
Birthweight)?
Birthweight IQ
(x) (y) What will be the IQ of a
575 59 150
650 55
baby who was born with
832 67 weight 1090 gram?
IQ at Age 5

850 84 100 84
933 87
1001 81
Answer: 84
1111 88 50
1230 92
1321 101
Inference
1370 102 0
1390 85 0 500 1000 1500 2000
1422 95
1480 120 Birthweight (g) Can be done after
1487 114
1490 100 the model (i.e., the line
in the figure) is trained
Trainig Dataset Can be used to train the model

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Learning a Model

▸How do we represent a line in mathematics?


• y = mx + b
Birthweight IQ
(x) (y)
575 59 150
For this best fitting line:
650
832
55
67
m = 0.052
IQ at Age 5

850
933
84
87
100 b = 29.21
1001 81
1111
1230
88
92
50 But, how can we find
1321 101
0
(or learn) m and b?
1370 102
1390 85 0 500 1000 1500 2000
1422 95
1480 120 Birthweight (g)
1487 114
1490 100

Dataset

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Learning a Model

▸Let us try different random values of m and b:


- m = 1 and b = 0
Birthweight (x) Actual IQ (y) Predicted IQ (y’)
y’ = mx + b
575 59 575
650 55 650 = 1*1001 + 0
832 67 832
850 84 850 = 1001
933 87 933
1001 81 1001
1111 88 1111
1230 92 1230 How close are these
1321
1370
101
102
1321
1370
predicted IQs to the
1390
1422
85
95
1390
1422
actual ones?
1480 120 1480
1487 114 1487
1490 100 1490

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Learning a Model

▸Let us try different random values of m and b:


The Error
- m = 1 and b = 0
Birthweight (x) Actual IQ (y) Predicted IQ (y’) Predicted – Actual
(y’ – y)
575 59 575 516
650 55 650 595
832 67 832 765
850 84 850 766
933 87 933 846
1001 81 1001 920
1111 88 1111 1023
1230 92 1230 1138 Sum = 15812
1321 101 1321 1220
1370 102 1370 1268 ≣
1390 85 1390 1305
1422 95 1422 1327 ∑ = 15812
1480 120 1480 1360
1487 114 1487 1373
1490 100 1490 1390

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Learning a Model

▸Let us try different random values of m and b:


- m = 0.052 and b = 29.21
Birthweight (x) Actual IQ (y) Predicted IQ (y’) Predicted – Actual
(y’ – y)
575 59 59.11 0.11
650 55 63.01 8.01
832 67 72.474 5.474
850 84 73.41 -10.59
933 87 77.726 -9.274
1001 81 81.262 0.262 Error:
1111 88 86.982 -1.018
1230 92 93.17 1.17 ∑ = -0.466
1321 101 97.902 -3.098
1370 102 100.45 -1.55
1390 85 101.49 16.49
1422 95 103.154 8.154
1480 120 106.17 -13.83
1487 114 106.534 -7.466
1490 100 106.69 6.69

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Learning a Model

▸Let us observe the three options besides each other

m = 1 and b = 0 m = 0.05 and b = 2 m = 0.052 and b = 29.21


Visually, this is
the best!
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Mean Squared Error (MSE)

▸Let us compare their errors


m = 1 and b = 0 m = 0.05 and b = 2 m = 0.052 and b = 29.21
Birthweight Actual IQ Predicted IQ (Predicted – Actual)2 Predicted IQ (Predicted – Actual)2 Predicted IQ (Predicted – Actual)2
(x) (y) (y’) (y’ – y)2 (y’) (y’ – y)2 (y’) (y’ – y)2
575 59 59.11 266256 30.75 798.0625 59.11 0.0121
650 55 63.01 354025 34.5 420.25 63.01 64.1601
832 67 72.474 585225 43.6 547.56 72.474 29.964676
850 84 73.41 586756 44.5 1560.25 73.41 112.1481
15 examples

933 87 77.726 715716 48.65 1470.7225 77.726 86.007076


1001 81 81.262 846400 52.05 838.1025 81.262 0.068644
1111 88 86.982 1046529 57.55 927.2025 86.982 1.036324
1230 92 93.17 1295044 63.5 812.25 93.17 1.3689
1321 101 97.902 1488400 68.05 1085.7025 97.902 9.597604
1370 102 100.45 1607824 70.5 992.25 100.45 2.4025
1390 85 101.49 1703025 71.5 182.25 101.49 271.9201
1422 95 103.154 1760929 73.1 479.61 103.154 66.487716
1480 120 106.17 1849600 76 1936 106.17 191.2689
1487 114 106.534 1885129 76.35 1417.5225 106.534 55.741156
1490 100 106.69 1932100 76.5 552.25 106.69 44.7561

MEAN SQUARED ERRORS: ∑ = 17922958/15 ∑ = 14019.9/15 ∑ = 936.9/15

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Minimizing MSE to Learn a Model

▸Let us compare their errors


m = 1 and b = 0 m = 0.05 and b = 2 m = 0.052 and b = 29.21
Birthweight Actual IQ Predicted IQ (Predicted – Actual)2 Predicted IQ (Predicted – Actual)2 Predicted IQ (Predicted – Actual)2
(x) (y) (y’) (y’ – y)2 (y’) (y’ – y)2 (y’) (y’ – y)2
575 59 59.11 266256 30.75 798.0625 59.11 0.0121
650 55 63.01 354025 34.5 420.25 63.01 64.1601
832 67 72.474 585225 43.6 547.56 72.474 29.964676
850 84 73.41 586756 44.5 1560.25 73.41 112.1481
Mean Squared Error:
933 87 77.726 715716 48.65 1470.7225 77.726 86.007076
n examples

1001 𝑛 81 81.262 846400 52.05 838.1025 81.262 0.068644


1111 1 88
𝑖
86.982
2 1046529 57.55 927.2025 86.982 1.036324
minimize
1230 ෍ 92𝑦 ′ − 𝑦 𝑖93.17 1295044 63.5 812.25 93.17 1.3689
1321 2𝑛 101 97.902 1488400 68.05 1085.7025 97.902 9.597604
1370 𝑖=1 102 100.45 1607824 70.5 992.25 100.45 2.4025
1390 85 101.49 1703025 71.5 182.25 101.49 271.9201
1422 95 103.154 1760929 73.1 479.61 103.154 66.487716
1480 120 106.17 1849600 76 1936 106.17 191.2689
1487 114 106.534 1885129 76.35 1417.5225 106.534 55.741156
1490 100 106.69 1932100 76.5 552.25 106.69 44.7561

The objective is to minimize the mean squared error

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Learning a Line via Minimizing MSE

▸How to learn a line of equation y’ = mx + b given a labelled


dataset?
- By minimizing mean squared error. That is:
𝑛
1 𝑖
2
minimize ෍ 𝑦′ − 𝑦 𝑖
2𝑛
𝑖=1

𝑛
1 𝑖 𝑖
2
minimize ෍ (𝑚𝑥 + 𝑏) −𝑦
𝑚,𝑏 2𝑛
𝑖=1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Learning a Linear Regression Model via
Minimizing a Cost Function

▸Or, how to learn a linear regression model 𝒉𝜽 (𝒙) = 𝜽𝟎 + 𝜽𝟏 𝒙


given a labelled dataset (𝞱0 = b, 𝞱1= m, and 𝒉𝜽 𝒙 = y’ when
using y’ = mx + b)?
- By minimizing mean squared error. That is:
𝑛
1 𝑖
2
This problem is minimize ෍ 𝑦′ − 𝑦 𝑖
2𝑛
referred to as an 𝑖=1≣
𝑛
optimization problem 1 𝑖 𝑖
2 Known as cost function
minimize ෍ (𝜃0 + 𝜃1 𝑥) −𝑦
with the objective of: 𝜃0 ,𝜃1 2𝑛 𝑱(𝜽𝟎 , 𝜽𝟏 )
𝑖=1
minimizing 𝑱(𝜽𝟎 , 𝜽𝟏 )
𝜃0 ,𝜃1 ≣
minimize 𝑱(𝜽𝟎 , 𝜽𝟏 )
𝜃0 ,𝜃1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Learning a Linear Regression Model via
Minimizing a Cost Function

▸But, how minimizing a cost function (e.g., a mean squared error)


can lead to fitting a given training dataset?
- Let us assume our optimization objective is to minimize 𝑱 𝜽𝟏 , thus, 𝞱0 = 0
𝜃0 ,𝜃1

𝒚′ = 𝒉𝜽 (𝒙) 𝑱

𝑱 𝜽𝟏

𝞱1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Minimizing a Cost Function
𝒚′ = 𝒉𝜽 (𝒙) 𝑱

𝑱 𝜽𝟏
𝒚′ = 𝜽𝟏 𝒙
= 𝟑𝒙
𝒙

𝞱1 = 3 𝞱1

𝒚′ = 𝒉𝜽 (𝒙) 𝑱

𝑱 𝜽𝟏
𝒚′ = 𝜽𝟏 𝒙
= 𝟐𝒙
𝒙

𝞱1
𝞱1 = 2

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Learning a Linear Regression Model via
Minimizing a Cost Function
- Our optimization objective is to minimize 𝑱 𝜽𝟏 , thus, 𝞱0 = 0
𝜃0 ,𝜃1

𝒚′ = 𝒉𝜽 (𝒙) 𝑱

𝑱 𝜽𝟏

𝞱1
𝞱1 = 1

Which corresponds to the


linear regression model (i.e., This 𝞱 is at the
line) that best fits this data! global minimum

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Gradeint Descent For Linear Regression
▸Outline:
- Have some cost function 𝑱 𝜽𝟎 , … , 𝜽𝒏−𝟏
- Start off with some guesses for 𝜃0 , … , 𝜃𝑛−1
• It does not really matter what values you start off with, but a
common choice is to set them all initially to zero
- Keep changing 𝜃0 , … , 𝜃𝑛−1 to reduce 𝑱 𝜽𝟎 , … , 𝜽𝒏−𝟏 until we
hopefully end up at a minimum location
• When you are at a certain position on the surface of 𝑱, look around,
then take a little step in the direction of the steepest descent, then
repeat

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Gradeint Descent For Linear
Regression

▸Outline:
- Have some cost function 𝑱 𝜽𝟎 , … , 𝜽𝒏−𝟏
- Start off with some guesses for 𝜃0 , … , 𝜃𝑛−1
• It does not really matter what values you start off with, but a
common choice is to set them all initially to zero
- Repeat until convergence{
Partial Derivative
𝜕 𝑱 𝜽𝟎 , … , 𝜽𝒏−𝟏
𝜃𝑗 = 𝜃𝑗 − α
𝜕 𝜃𝑗
}
Learing Rate

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Gradeint Descent For Linear
Regression

▸Outline (considering only two varilables 𝜃0 and 𝜃1 ):


- Have some cost function 𝑱 𝜽𝟎 , 𝜽𝟏
- Start off with some guesses for 𝜃0 , 𝜃1
• It does not really matter what values you start off with, but a common
choice is to set them both initially to zero
- Repeat until convergence{
𝑛
1 𝑖 𝑖
𝜕 𝑱 𝜽𝟎 , 𝜽𝟏 ෍ ℎ𝜃 (𝑥) − 𝑦
𝑛
𝑡𝑒𝑚𝑝0 = 𝜃0 − α 𝑖=1
𝜕 𝜃0
𝜕 𝑱 𝜽𝟎 , 𝜽𝟏
𝑡𝑒𝑚𝑝1 = 𝜃1 − α 𝑛
𝜕 𝜃1 1
} ෍ ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 . 𝑥 (𝑖)
𝜃0 = 𝑡𝑒𝑚𝑝0 𝑛
𝜃1 = 𝑡𝑒𝑚𝑝1 𝑖=1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Gradeint Descent For Linear
Regression

▸Outline (considering only two varilables 𝜃0 and 𝜃1 ):


- Have some cost function 𝑱 𝜽𝟎 , 𝜽𝟏
- Start off with some guesses for 𝜃0 , 𝜃1
• It does not really matter what values you start off with, but a common
choice is to set them both initially to zero
- Repeat until convergence{
𝑛
1
𝑡𝑒𝑚𝑝0 = 𝜃0 − α ෍ ℎ𝜃 (𝑥) 𝑖
− 𝑦 𝑖 But, what is partial derviative
𝑛
𝑖=1
𝑛
(i.e., 𝝏) exactly and how was it
1 𝑖 𝑖 calculated here for mean
𝑡𝑒𝑚𝑝1 = 𝜃1 − α ෍ ℎ𝜃 𝑥 − 𝑦 . 𝑥 (𝑖)
𝑛
} 𝑖=1 squared error?
𝜃0 = 𝑡𝑒𝑚𝑝0
𝜃1 = 𝑡𝑒𝑚𝑝1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Gradeint Descent For Linear Regression

▸Outline (considering only two varilables 𝜃0 and 𝜃1 ):


- Have some cost function 𝑱 𝜽𝟎 , 𝜽𝟏
- Start off with some guesses for 𝜃0 , 𝜃1
• It does not really matter what values you start off with, but a common
choice is to set them both initially to zero
- Repeat until convergence{
𝑛
1 𝑖 𝑖
And why & how 𝜶 and 𝝏 can
𝑡𝑒𝑚𝑝0 = 𝜃0 − α ෍ ℎ𝜃 (𝑥) − 𝑦
𝑛 serve in taking little steps in
𝑖=1
1
𝑛
the direction of the steepest
𝑡𝑒𝑚𝑝1 = 𝜃1 − α ෍ ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 . 𝑥 (𝑖)
𝑛 descent and result in hitting
} 𝜃0 = 𝑡𝑒𝑚𝑝0
𝑖=1
the (global) minimum on J?
𝜃1 = 𝑡𝑒𝑚𝑝1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Average Speed vs. Instantenous Speed

▸Usain Bolt is regarded widely as the greatest


sprinter of all time
- He can run 100meters in 9.58seconds!

Distance

100m
What is the average speed of Usain Bolt?
= Change in Distance/Change in Time
Δy = Δy/Δx
= 100/9.58
Δx = 10.43m/s
9.58s time

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Average Speed vs. Instantenous Speed
Distance

100m
But, this average speed is different than
instantenous speed!
Δy
Bolt will not instantly go 100m in 9.58s, but
Δx
rather start off a little slower, then accelerate,
then decelerate a little towards the end
9.58s time

Distance
This way, Δy/Δx != Δy/Δx (this is opposite to
100m
having a line whereby it does not matter which
Δy
Δx
two points to take on it since the slope will be
always the same)
Δy
Consequently, at any given moment in time, a
Δx
9.58s time slope on the green function (e.g., Δy/Δx or
Δy/Δx) will be different than the average slope
on the blue line (i.e., Δy/Δx)

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Instantenous Speed
We can compute the slope around the
steepest point if we are interested about
Distance the fastest instantaneous speed!
100m
But that would be only an approximation
Δy
because the slope of the curve is constantly
Δx changing
We can achieve a better approximation by
measuring the slope with a smaller & smaller
9.58s time
change in x, which yields a smaller & smaller
change in y

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Instantaneous Slope

Distance

100m
This instantaneous slope is what
mathematicians denote as the derivative
and write as:
And, this is an infinitely small
change in x (d stands for
∆𝑦
lim = dy/dx differential)
∆𝑥→0 ∆𝑥
9.58s time

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Gradient Descent For Linear
Regression

▸Outline:
- Have some cost function 𝑱 𝜽𝟎 , … , 𝜽𝒏−𝟏
- Start off with some guesses for 𝜃0 , … , 𝜃𝑛−1
• It does not really matter what values you start off with, but a
common choice is to set them all initially to zero
- Keep changing 𝜃0 , … , 𝜃𝑛−1 to reduce 𝑱 𝜽𝟎 , … , 𝜽𝒏−𝟏 until we
hopefully end up at a minimum location
• When you are at a certain position on the surface of 𝑱, look around,
then take a little step in the direction of the steepest descent, then
repeat

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Gradient Descent For Linear
Regression

▸Outline:
- Have some cost function 𝑱 𝜽𝟎 , … , 𝜽𝒏−𝟏
- Start off with some guesses for 𝜃0 , … , 𝜃𝑛−1
• It does not really matter what values you start off with, but a
common choice is to set them all initially to zero
- Repeat until convergence{
Partial Derivative
𝜕 𝑱 𝜽𝟎 , … , 𝜽𝒏−𝟏
𝜃𝑗 = 𝜃𝑗 − α What do α
𝜕 𝜃𝑗 and 𝜕 do?
}
Learing Rate

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Impact of Partial Derviative

▸For simplicity, let us assume our optimization objective is to


minimize 𝑱 𝜽𝟏 , thus, 𝞱0 = 0
𝜃0 ,𝜃1
𝒚′ = 𝒉𝜽 (𝒙) 𝑱

𝑱 𝜽𝟏
𝒚′ = 𝜽𝟏 𝒙
= 𝟏𝒙
𝒙

𝞱1
𝞱1 = 1
𝒉𝜽 (𝒙) is the Hypothesis Function 𝑱 𝜽𝟏 is the Cost Function

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Impact of Partial Derviative

▸For simplicity, let us assume our optimization objective is to


minimize 𝑱 𝜽𝟏 , thus, 𝞱0 = 0
𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱 𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
= 𝜃1 − α (𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑁𝑢𝑚𝑏𝑒𝑟)

Decrease 𝜃1 by a certain value


Positive Derivative

𝞱1
𝞱1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Impact of Partial Derviative

▸For simplicity, let us assume our optimization objective is to


minimize 𝑱 𝜽𝟏 , thus, 𝞱0 = 0
𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱 𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
= 𝜃1 − α (𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑁𝑢𝑚𝑏𝑒𝑟)

Decrease 𝜃1 by a certain value

𝞱1
New 𝞱1 Old 𝞱1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Impact of Partial Derviative

▸For simplicity, let us assume our optimization objective is to


minimize 𝑱 𝜽𝟏 , thus, 𝞱0 = 0
𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱 𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
= 𝜃1 − α (𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑁𝑢𝑚𝑏𝑒𝑟)

Increase 𝜃1 by a certain value


Negative
Derivative

𝞱1
𝞱1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Impact of Partial Derviative

▸For simplicity, let us assume our optimization objective is to


minimize 𝑱 𝜽𝟏 , thus, 𝞱0 = 0
𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱 𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
= 𝜃1 − α (𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑁𝑢𝑚𝑏𝑒𝑟)

Increase 𝜃1 by a certain value

𝞱1
Old 𝞱1 New 𝞱1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Impact of Partial Derviative

▸For simplicity, let us assume our optimization objective is to


minimize 𝑱 𝜽𝟏 , thus, 𝞱0 = 0
𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱 𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
= 𝜃1 − α (𝑍𝑒𝑟𝑜)

𝜃1 remains the same, hence,


gradient descent converges

𝞱1
Derivative = 0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Impact of Learning Rate

▸For simplicity, let us assume our optimization objective is to


minimize 𝑱 𝜽𝟏 , thus, 𝞱0 = 0
𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱
𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
What
Learing Rate happens if α
is too small?

𝞱1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Impact of Learning Rate

▸For simplicity, let us assume our optimization objective is to


minimize 𝑱 𝜽𝟏 , thus, 𝞱0 = 0
𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱
𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
𝒅 𝑱 𝜽𝟏
= 𝜃1 − (𝑇𝑜𝑜 𝑆𝑚𝑎𝑙𝑙 𝑁𝑢𝑚𝑏𝑒𝑟)
𝒅 𝜃𝑗

𝜃1 changes only a tiny bit on each step,


hence, gradient descent will render
slow (will take more time to converge)
𝞱1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Impact of Learning Rate

▸For simplicity, let us assume our optimization objective is to


minimize 𝑱 𝜽𝟏 , thus, 𝞱0 = 0
𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱
𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
𝒅 𝑱 𝜽𝟏
= 𝜃1 − (𝑇𝑜𝑜 𝐿𝑎𝑟𝑔𝑒 𝑁𝑢𝑚𝑏𝑒𝑟)
𝒅 𝜃𝑗

𝜃1 changes a lot (and probably faster) on


each step, hence, gradient descent will
potentially overshoot the minimum and,
accordingly, fail to converge (or even diverge)
𝞱1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Impact of Learning Rate

▸For simplicity, let us assume our optimization objective is to


minimize 𝑱 𝜽𝟏 , thus, 𝞱0 = 0
𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱
𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗

We can set α between 0 and 1 (say, 0.5, or


a little more or less, hence, not very small
or very large)

𝞱1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Impact of Learning Rate

▸For simplicity, let us assume our optimization objective is to


minimize 𝑱 𝜽𝟏 , thus, 𝞱0 = 0
𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱
𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗

We can also fix 𝜶 because as we approach


the (global) minimum, gradient descent
will automatically start taking smaller steps
(i.e., 𝜃1 will start changing at a slower pace
because the derivative will become less steep)

𝞱1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Gradient Descent For Linear
Regression

▸Outline:
- Have some cost function 𝑱 𝜽𝟎 , … , 𝜽𝒏−𝟏
- Start off with some guesses for 𝜃0 , … , 𝜃𝑛−1
• It does not really matter what values you start off with, but a
common choice is to set them all initially to zero
- Repeat until convergence{ Now we
understand the
Partial derivative
intuition behind
𝜕 𝑱 𝜽𝟎 , … , 𝜽𝒏−𝟏
𝜃𝑗 = 𝜃𝑗 − α gradient descent
𝜕 𝜃𝑗 and how 𝜶 and 𝝏
} act together to
Learing rate, which controls how big a step we take make gradient
when we update 𝜃 descent work!
𝑗

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Gradient Descent For Linear
Regression

▸Outline (considering only two varilables 𝜃0 and 𝜃1 ):


- Have some cost function 𝑱 𝜽𝟎 , 𝜽𝟏
- Start off with some guesses for 𝜃0 , 𝜃1
• It does not really matter what values you start off with, but a common
choice is to set them both initially to zero
- Repeat until convergence{
𝑛
1 𝑖 𝑖
𝜕 𝑱 𝜽𝟎 , 𝜽𝟏 ෍ ℎ𝜃 (𝑥) − 𝑦
𝑛
𝑡𝑒𝑚𝑝0 = 𝜃0 − α 𝑖=1
𝜕 𝜃0
𝜕 𝑱 𝜽𝟎 , 𝜽𝟏
𝑡𝑒𝑚𝑝1 = 𝜃1 − α 𝑛
𝜕 𝜃1 1
} ෍ ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 . 𝑥 (𝑖)
𝜃0 = 𝑡𝑒𝑚𝑝0 𝑛
𝜃1 = 𝑡𝑒𝑚𝑝1 𝑖=1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Gradient Descent For Linear
Regression

▸Outline (considering only two varilables 𝜃0 and 𝜃1 ):


- Have some cost function 𝑱 𝜽𝟎 , 𝜽𝟏
- Start off with some guesses for 𝜃0 , 𝜃1
• It does not really matter what values you start off with, but a common
choice is to set them both initially to zero
- Repeat until convergence{
𝑛
1 𝑖 𝑖
𝑡𝑒𝑚𝑝0 = 𝜃0 − α ෍ ℎ𝜃 (𝑥) − 𝑦
𝑛
𝑖=1
𝑛
1 𝑖 𝑖
𝑡𝑒𝑚𝑝1 = 𝜃1 − α ෍ ℎ𝜃 𝑥 − 𝑦 . 𝑥 (𝑖)
𝑛
} 𝜃0 = 𝑡𝑒𝑚𝑝0
𝑖=1

𝜃1 = 𝑡𝑒𝑚𝑝1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Logistic Regression

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Example 1: Malignant or Benign

▸Consider the example of recognizing whether a tumor in an


input image is malignant or benign
Input: Output:
Benign 0
Can be
Binary
represented
Classifier
as integers!
Malignant 1

Can be represented as a matrix of pixels

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Example 2: Spam or Not Spam

▸As another example, consider the problem of detecting whether


an email is a spam or not a spam
Input: Output:
Not Spam 0
Can be
Binary
Email represented
Classifier
as integers!
Spam 1

Can be represented as a vector x = [x1, x2, . . . , xd], with each component xi


corresponding to the presence (xi = 1) or absence (xi = 0) of a particular word (or feature)
in the email

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Regression vs. Classification

▸What are the possible outputs of the linear regression function


𝒉𝜽 (𝒙) = 𝜽𝑻 𝒙?

Here 𝜽 is a vector that holds all the parameters, that is, 𝜽 = [𝜽𝟎, 𝜽𝟏, …, 𝜽𝒎] and 𝒙
is a vector that encompasses all the features, that is, 𝒙 = [𝒙𝟎, 𝒙𝟏, …, 𝒙𝒎] (𝒙𝟎 shall
be always equal to 1)

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Regression vs. Classification

▸What are the possible outputs of the linear regression function


𝒉𝜽 (𝒙) = 𝜽𝑻 𝒙?
- Real-valued outputs

𝒙 = [𝒙𝟎, … , 𝒙𝒎] 𝜽𝑻 𝒙 𝒉𝜽 (𝒙) ∈ ℝ

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Regression vs. Classification

▸What are the possible outputs of the linear regression function


𝒉𝜽 (𝒙) = 𝜽𝑻 𝒙?
- Real-valued outputs

𝟑. 𝟐
Spam Filter 𝟒. 𝟏
𝒙 = [𝟏, 𝟎, 𝟎, 𝟏, 𝟎] 𝜽𝑻 𝒙 𝒉𝜽 𝒙 = 𝟐. 𝟗 [𝟏, 𝟎, 𝟎, 𝟏, 𝟎]
𝟔. 𝟕
Representing an email 𝟏. 𝟏

Assume 𝞱 = [𝟑. 𝟐, 𝟒. 𝟏, 𝟐. 𝟗, 𝟔. 𝟕, 𝟏. 𝟏]

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Regression vs. Classification

▸What are the possible outputs of the linear regression function


𝒉𝜽 (𝒙) = 𝜽𝑻 𝒙?
- Real-valued outputs ∈ ℝ, which makes it a
regression problem
Spam Filter
𝒙 = [𝟏, 𝟎, 𝟎, 𝟏, 𝟎] 𝜽𝑻 𝒙 𝒉𝜽 𝒙 = 𝟑. 𝟐 + 𝟔. 𝟕 = 𝟗. 𝟗

A spam or not a spam?


We need a discrete-valued
output (e.g., 0 or 1)

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Regression vs. Classification

▸How can we make the possible outputs of 𝒉𝜽 (𝒙) = 𝜽𝑻 𝒙


discrete-valued (as opposed to real-valued)?
- By using an activation function (e.g., sigmoid function)
𝑔 𝑧 Assume a labeled example (𝒙, y):
If y = 1, we want 𝑔 𝑧 ≈ 1 (i.e.,
1
𝑔 𝑧 = we want a correct prediction)
1 + 𝑒 −𝑧
For this to happen, 𝒛 ≫ 𝟎

𝑧
0 8

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Regression vs. Classification

▸How can we make the possible outputs of 𝒉𝜽 (𝒙) = 𝜽𝑻 𝒙


discrete-valued (as opposed to real-valued)?
- By using an activation function (e.g., sigmoid function)
𝑔 𝑧 Assume a labeled example (𝒙, y):
If y = 0, we want 𝑔 𝑧 ≈ 0 (i.e.,
1
𝑔 𝑧 = we want a correct prediction)
1 + 𝑒 −𝑧
For this to happen, 𝒛 ≪ 𝟎

𝑧
-8 0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Regression vs. Classification

▸How can we make the possible outputs of 𝒉𝜽 (𝒙) = 𝜽𝑻 𝒙


discrete-valued (as opposed to real-valued)?
- By using an activation function (e.g., sigmoid function)

𝒙 = [𝒙𝟎, … , 𝒙𝒎] 𝜽𝑻 𝒙 𝒉𝜽 (𝒙) ∈ ℝ

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Regression vs. Classification

▸How can we make the possible outputs of 𝒉𝜽 (𝒙) = 𝜽𝑻 𝒙


discrete-valued (as opposed to real-valued)?
- By using an activation function (e.g., sigmoid function)

𝒙 = [𝒙𝟎, … , 𝒙𝒎] 𝒈(𝜽𝑻 𝒙) 𝒉𝜽 (𝒙) ∈ [0,1]


We can now apply thresholding,
for instance, if 𝒉𝜽 (𝒙) < 0.5, predict
0; otherwise, predict 1
𝒛

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Regression vs. Classification

▸How can we make the possible outputs of 𝒉𝜽 (𝒙) = 𝜽𝑻 𝒙


discrete-valued (as opposed to real-valued)?
- By using an activation function (e.g., sigmoid function)

𝒙 = [𝒙𝟎, … , 𝒙𝒎] 𝒈(𝜽𝑻 𝒙) 𝒉𝜽 (𝒙) ∈ [0,1]


This makes it a classification problem
(i.e., NOT a regression problem
anymore) since the output is a
𝒛 discrete one

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Logistic Regression Model

▸What will be the output of the model 𝒉𝜽 𝒙 = 𝜽𝑻 𝒙, where 𝞱 =


[𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
- Real-valued

▸How can we make the output of 𝒉𝜽 𝒙 discrete?


- By using the logistic function as follows:
1 This is the logistic regression
𝑻
𝒉𝜽 𝒙 = 𝒈(𝜽 𝒙) = 𝑇 model or hypothesis function
1 + 𝑒 −𝜽 𝑥
- And then applying thresholding after learning the model to predict the
output as follows:

𝒊𝒇 𝒉𝜽 𝒙 < 𝟎. 𝟓 𝐩𝐫𝐞𝐝𝐢𝐜𝐭 𝟎
𝒊𝒇 𝒉𝜽 𝒙 ≥ 𝟎. 𝟓 𝐩𝐫𝐞𝐝𝐢𝐜𝐭 𝟏

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Towards Identifying the Logistic
Regression Cost Function

▸How to learn a logistic regression model 𝒉𝜽 𝒙 = 𝒈(𝜽𝑻 𝒙), where


𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
- Perhaps, by minimizing Mean Squared Error (MSE). That is:
𝑛
1 𝑖
2
minimize ෍ 𝑦′ − 𝑦 𝑖
Convex 2𝑛
𝑖=1
Unfortunately, if

Easy for 𝑛 we plot this cost
1 2
function, it will turn
Gradient minimize ෍ (𝑔 𝜃 𝑇 𝑥 ) 𝑖
−𝑦 𝑖
Global Min 𝜃 2𝑛 out to be “non-convex”
Descent 𝑖=1
to locate! ≣
minimize 𝑱(𝞱)
𝞱
𝜃 Cost function 𝑱(𝞱)

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Towards Identifying the Logistic
Regression Cost Function

▸How to learn a logistic regression model 𝒉𝜽 𝒙 = 𝒈(𝜽𝑻 𝒙), where


𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
- Perhaps, by minimizing Mean Squared Error (MSE). That is:
𝑛
1 𝑖
2
Gradient minimize ෍ 𝑦′ − 𝑦 𝑖
Non-Convex 2𝑛
Descent 𝑖=1
Local Min Unfortunately, if
might get ≣
𝑛 we plot this cost
stuck at a Global 1 2
Min minimize ෍ (𝑔 𝜃 𝑇 𝑥 ) 𝑖
−𝑦 𝑖 function, it will turn
local min 𝜃 2𝑛
𝑖=1 out to be “non-convex”
and fail to
locate the ≣
minimize 𝑱(𝞱)
global min! 𝜃
𝞱

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Logistic Regression Cost Function

▸How to learn a logistic regression model 𝒉𝜽 𝒙 = 𝒈(𝜽𝑻 𝒙), where


𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
- Let us try a different cost function. That is:

− log 𝒉𝜽 𝒙 𝒊𝒇 𝒚 = 𝟏
Cost(𝒉𝜽 𝒙 , y) =
− log 𝟏 − 𝒉𝜽 𝒙 𝒊𝒇 𝒚 = 𝟎
Equivalent To

Cost(𝒉𝜽 𝒙 , y) = −𝑦 log 𝒉𝜽 𝒙 − (1 − 𝑦) log 𝟏 − 𝒉𝜽 𝒙

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Logistic Regression Cost Function

▸How to learn a logistic regression model 𝒉𝜽 𝒙 = 𝒈(𝜽𝑻 𝒙), where


𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
- Let us try a different cost function. That is:
Cost(𝒉𝜽 𝒙 , y) = −𝑦 log 𝒉𝜽 𝒙 − (1 − 𝑦) log 𝟏 − 𝒉𝜽 𝒙

This function still assumes real-valued outputs for 𝒉𝜽 𝒙 (i.e., still entails a
regression problem), while logistic regression should predict discrete
values (i.e., logistic regression is a classification problem)

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Logistic Regression Cost Function

▸How to learn a logistic regression model 𝒉𝜽 𝒙 = 𝒈(𝜽𝑻 𝒙), where


𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
- Let us try a different cost function. That is:
Cost(𝒉𝜽 𝒙 , y) = −𝑦 log 𝒉𝜽 𝒙 − (1 − 𝑦) log 𝟏 − 𝒉𝜽 𝒙

We still need to apply to it the logistic function:


1
𝒈(𝑧) =
1 + 𝑒 −𝑧

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Logistic Regression Cost Function

▸How to learn a logistic regression model 𝒉𝜽 𝒙 = 𝒈(𝜽𝑻 𝒙), where


𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
- By minimizing the following cost function:
Cost(𝒉𝜽 𝒙 , y) −𝑦 log 𝒉𝜽 𝒙 − (1 − 𝑦) log 𝟏 − 𝒉𝜽 𝒙
=
Cost(𝒉𝜽 𝒙 , y) −𝑦 log 𝒈(𝜽𝑻 𝒙) − (1 − 𝑦) log 1 − 𝒈(𝜽𝑻 𝒙)
=
1 1
Cost(𝒉𝜽 𝒙 , y) −𝑦 log 𝑇 − (1 − 𝑦) log 1 − 𝑇
= 1 + 𝑒 −𝜽 𝑥 1 + 𝑒 −𝜽 𝑥

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


The Logistic Regression Cost Function

▸How to learn a logistic regression model 𝒉𝜽 𝒙 = 𝒈(𝜽𝑻 𝒙), where


𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
- By minimizing the following cost function:
1 1
Cost(𝒉𝜽 𝒙 , y) −𝑦 log − (1 − 𝑦) log 1 −
1 + 𝑒 −𝜽𝑇𝑥 1 + 𝑒 −𝜽𝑇𝑥
=
I𝒇 𝒚 = 𝟏, we want 𝜽𝑻 𝒙 >> 0 I𝒇 𝒚 = 𝟎, we want 𝜽𝑻 𝒙 << 0

1 1
−𝑙𝑜𝑔 𝑇 −𝑙𝑜𝑔 1 − 𝑇
1 + 𝑒 −𝜽 𝑥 1 + 𝑒 −𝜽 𝑥

𝜽𝑻 𝒙 𝜽𝑻 𝒙

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Learning a Logistic Regression Model

▸How to learn a logistic regression model 𝒉𝜽 𝒙 = 𝒈(𝜽𝑻 𝒙), where


𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
- By minimizing the following cost function:
1 1
Cost(𝒉𝜽 𝒙 , y) −𝑦 log − (1 − 𝑦) log 1 −
1 + 𝑒 −𝜽𝑇𝑥 1 + 𝑒 −𝜽𝑇𝑥
=
- That is: 𝑛
1 𝑖
minimize ෍ Cost(𝒉𝜽 𝒙 ,𝑦 𝑖 )
𝜃 𝑛
𝑖=1


𝑛
1 𝑖
1 1 Cost function
minimize ෍ −𝑦 log − (1 − 𝑦) log 1 −
𝜃 𝑛 1 + 𝑒 −𝜽
𝑇
𝑥𝑖 1 + 𝑒 −𝜽
𝑇
𝑥𝑖 𝑱(𝞱)
𝑖=1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Gradient Descent For Logistic
Regression

▸Outline:
- Have cost function 𝑱 𝞱 , where 𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ]
- Start off with some guesses for 𝜃0 , … , 𝜃𝑚
• It does not really matter what values you start off with, but a
common choice is to set them all initially to zero
- Repeat until convergence{
Partial derivative
𝜕𝑱 𝞱
𝜃𝑗 = 𝜃𝑗 − α
𝜕 𝜃𝑗 Note: Update all 𝜽𝒋 simulatenously

}
Learing rate, which controls how big a step we take
when we update 𝜃𝑗

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Gradient Descent For Logistic
Regression

▸Outline:
- Have cost function 𝑱 𝞱 , where 𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ]
- Start off with some guesses for 𝜃0 , … , 𝜃𝑚
• It does not really matter what values you start off with, but a
common choice is to set them all initially to zero
- Repeat until convergence{
𝑛 The final formula
1 𝑖 after applying
𝜃𝑗 = 𝜃𝑗 − α ෍ −𝑦 𝑥𝑗 (𝑖)
1 + 𝑒 −𝜽
𝑇
𝑥𝑖 partial derivatives
𝑖=1
}

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Inference After Learning

▸After learning the parameters 𝞱 = [𝜃0 , … , 𝜃𝑚 ], we can predict the


output of any new unseen 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ] as follows:

1
𝒊𝒇 𝒉𝜽 𝒙 = 𝑇 < 𝟎. 𝟓 𝐩𝐫𝐞𝐝𝐢𝐜𝐭 𝟎
1 + 𝑒 −𝜽 𝒙

1
𝑬𝒍𝒔𝒆 𝒊𝒇 𝒉𝜽 𝒙 = 𝑇 ≥ 𝟎. 𝟓 𝐩𝐫𝐞𝐝𝐢𝐜𝐭 𝟏
1 + 𝑒 −𝜽 𝒙

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
and vaccine the of nigeria y
Email a 1 1 0 1 1 1
Email b 0 0 1 1 0 0
Email c 0 1 1 0 0 1
Email d 1 0 0 1 0 0
Email e 1 0 1 0 1 1
Email f 1 0 1 1 0 0

A Training Dataset

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
and vaccine the of nigeria y
Email a 1 1 0 1 1 1
Email b 0 0 1 1 0 0
Email c 0 1 1 0 0 1
Email d 1 0 0 1 0 0
Email e 1 0 1 0 1 1
Email f 1 0 1 1 0 0

1 entails that a word (i.e., “and”) is present in an email (i.e., “Email a”)

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
and vaccine the of nigeria y
Email a 1 1 0 1 1 1
Email b 0 0 1 1 0 0
Email c 0 1 1 0 0 1
Email d 1 0 0 1 0 0
Email e 1 0 1 0 1 1
Email f 1 0 1 1 0 0

0 entails that a word (i.e., “and”) is abscent in an email (i.e., “Email b”)

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0,We
0,define
0, 0,60, 0]
parameters
(the first one, i.e., 𝜃0 ,
5 words (or features) = [𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 , 𝒙𝟒 , 𝒙𝟓 ] is the intercept)

𝒙𝟏 = and 𝒙𝟐 = vaccine 𝒙𝟑 = the 𝒙𝟒 = of 𝒙𝟓 = nigeria y


Email a 1 1 0 1 1 1
Email b 0 0 1 1 0 0
Email c 0 1 1 0 0 1
Email d 1 0 0 1 0 0
Email e 1 0 1 0 1 1
Email f 1 0 1 1 0 0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0,The
0,parameter
0, 0, 0]vector:
𝜽 = [𝜽𝟎, 𝜽𝟏, 𝜽𝟐 , 𝜽𝟑, 𝜽𝟒 , 𝜽𝟓]
x = [𝒙𝟎 , 𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 , 𝒙𝟒 , 𝒙𝟓 ] The feature vector

𝒙𝟎 = 𝟏 𝒙𝟏 = and 𝒙𝟐 = vaccine 𝒙𝟑 = the 𝒙𝟒 = of 𝒙𝟓 = nigeria y


Email a 1 1 1 0 1 1 1
Email b 1 0 0 1 1 0 0
Email c 1 0 1 1 0 0 1
Email d 1 1 0 0 1 0 0
Email e 1 1 0 1 0 1 1
Email f 1 1 0 1 1 0 0

To account for the intercept

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Recap: Gradient Descent For Logistic
Regression

▸Outline:
- Have cost function 𝑱 𝞱 , where 𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ]
- Start off with some guesses for 𝜃0 , … , 𝜃𝑚
• It does not really matter what values you start off with, but a
common choice is to set them all initially to zero
- Repeat until convergence{
𝑛
1 𝑖
𝜃𝑗 = 𝜃𝑗 − α ෍ −𝑦 𝑥𝑗 (𝑖)
1 + 𝑒 −𝜽
𝑇
𝑥𝑖
𝑖=1 First, let us calculate this factor
} for every example in our
training dataset

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
𝒙 𝒚 𝜽𝑻 𝒙 (
1
𝑇 − 𝒚)𝒙𝟎
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0.5
0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
-0.5
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]=
-0.5
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Recap: Gradient Descent For Logistic
Regression

▸Outline:
- Have cost function 𝑱 𝞱 , where 𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ]
- Start off with some guesses for 𝜃0 , … , 𝜃𝑚
• It does not really matter what values you start off with, but a
common choice is to set them all initially to zero
- Repeat until convergence{ Second, let us calculate
this equation for every
𝑛 example in our training
1 𝑖
dataset and for every 𝜽𝒋 ,
𝜃𝑗 = 𝜃𝑗 − α ෍ −𝑦 𝑥𝑗 (𝑖) where j is between 0
1 + 𝑒 −𝜽
𝑇
𝑥𝑖
𝑖=1 and m
}

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
𝒙 𝒚 𝜽𝑻 𝒙 (
1
𝑇 − 𝒚)𝒙𝟎
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]= 1
0 (1+𝑒 −𝟎 − 𝟏) × 𝟏 = -0.5

[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]= 1
0 (1+1 − 𝟎) × 𝟏 = 0.5

[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]= 1
0 (1+1 − 𝟏) × 𝟏 = -0.5

[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]= 1
0 (1+1 − 𝟎) × 𝟏 = 0.5

[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]= 1
0 (1+1 − 𝟏) × 𝟏 = -0.5

[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]= 1
0 (1+1 − 𝟎) × 𝟏 = 0.5

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
𝒙 𝒚 𝜽𝑻 𝒙 (
1
𝑇 − 𝒚)𝒙𝟎
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]= 1
0 (1+𝑒 −𝟎 − 𝟏) × 𝟏 = -0.5

[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]= 1
0 (1+1 − 𝟎) × 𝟏 = 0.5

[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]= 1
0 (1+1 − 𝟏) × 𝟏 = -0.5

[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]= 1
0 (1+1 − 𝟎) × 𝟏 = 0.5

[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]= 1
0 (1+1 − 𝟏) × 𝟏 = -0.5

[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]= 1
0 (1+1 − 𝟎) × 𝟏 = 0.5

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Recap: Gradient Descent For Logistic
Regression

▸Outline:
- Have cost function 𝑱 𝞱 , where 𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ]
- Start off with some guesses for 𝜃0 , … , 𝜃𝑚
• It does not really matter what values you start off with, but a
common choice is to set them all initially to zero
- Repeat until convergence{
𝑛
1 𝑖 (𝑖)
Third, let us compute
𝜃𝑗 = 𝜃𝑗 − α ෍ 𝑇 𝑖
−𝑦 𝑥𝑗 every 𝜽𝒋
𝑖=1
1+ 𝑒 −𝜽 𝑥
}

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
𝑛
𝒙 𝒚 𝜽𝑻 𝒙 1 1
( 𝑇 − 𝒚)𝒙𝟎 ෍ −𝑦 𝑖
𝑥0 (𝑖) = 𝟎
1+𝑒 −𝜽 𝒙 𝑇 𝑖

𝑖=1
1 + 𝑒 −𝜽 𝑥
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
𝑻𝒉𝒆𝒏,
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0.5
0 𝜃0 = 𝜃0 − α ×0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
-0.5 New 𝜽𝟎
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]=
-0.5
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
𝑛
𝒙 𝒚 𝜽𝑻 𝒙 1 1
( 𝑇 − 𝒚)𝒙𝟎 ෍ −𝑦 𝑖
𝑥0 (𝑖) = 𝟎
1+𝑒 −𝜽 𝒙 𝑇 𝑖

𝑖=1
1 + 𝑒 −𝜽 𝑥
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
𝑻𝒉𝒆𝒏,
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0.5
0 𝜃0 = 𝜃0 − α ×0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
-0.5 Old 𝜽𝟎
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]=
-0.5
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
𝑛
𝒙 𝒚 𝜽𝑻 𝒙 1 1
( 𝑇 − 𝒚)𝒙𝟎 ෍ −𝑦 𝑖
𝑥0 (𝑖) = 𝟎
1+𝑒 −𝜽 𝒙 𝑇 𝑖

𝑖=1
1 + 𝑒 −𝜽 𝑥
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
𝑻𝒉𝒆𝒏,
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0.5
0 𝜃0 = 𝜃0 − α ×0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
-0.5 = 0 − 0.5 × 𝟎 = 𝟎
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]= New Paramter Vector:
-0.5 𝜽 = [𝟎, 𝜽𝟏 , 𝜽𝟐 , 𝜽𝟑 , 𝜽𝟒 , 𝜽𝟓 ]
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
𝒙 𝒚 𝜽𝑻 𝒙 (
1
𝑇 − 𝒚)𝒙𝟏
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0
0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
0
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]=
-0.5
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
𝑛
𝒙 𝒚 𝜽𝑻 𝒙 1 1
( 𝑇 − 𝒚)𝒙𝟏 ෍ −𝑦 𝑖
𝑥1 (𝑖) = 𝟎
1+𝑒 −𝜽 𝒙 𝑇 𝑖

𝑖=1
1 + 𝑒 −𝜽 𝑥
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
𝑻𝒉𝒆𝒏,
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0
0 𝜃1 = 𝜃1 − α ×0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
0 = 0 − 0.5 × 𝟎 = 𝟎
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]= New Paramter Vector:
-0.5 𝜽 = [𝟎, 𝟎, 𝜽𝟐 , 𝜽𝟑 , 𝜽𝟒 , 𝜽𝟓 ]
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
𝒙 𝒚 𝜽𝑻 𝒙 (
1
𝑇 − 𝒚)𝒙𝟐
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0
0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
0
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]=
-0.5
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
𝑛
𝒙 𝒚 𝜽𝑻 𝒙 1 1
( 𝑇 − 𝒚)𝒙𝟐 ෍ −𝑦 𝑖
𝑥2 (𝑖) = −𝟏
1+𝑒 −𝜽 𝒙 𝑇 𝑖

𝑖=1
1+ 𝑒 −𝜽 𝑥
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
𝑻𝒉𝒆𝒏,
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0
0 𝜃2 = 𝜃2 − α × (−𝟏)
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
-0.5 = 0 − 0.5 × −𝟏 = 𝟎. 𝟓
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]= New Paramter Vector:
0 𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝜽𝟑 , 𝜽𝟒 , 𝜽𝟓 ]
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0
0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
𝒙 𝒚 𝜽𝑻 𝒙 (
1
𝑇 − 𝒚)𝒙𝟑
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0
0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
0
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]=
-0.5
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
𝑛
𝒙 𝒚 𝜽𝑻 𝒙 1 1
( 𝑇 − 𝒚)𝒙𝟑 ෍ −𝑦 𝑖
𝑥3 (𝑖) = 𝟎
1+𝑒 −𝜽 𝒙 𝑇 𝑖

𝑖=1
1 + 𝑒 −𝜽 𝑥
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
0
0
𝑻𝒉𝒆𝒏,
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0.5
0 𝜃3 = 𝜃3 − α × 𝟎
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
-0.5 = 0 − 0.5 × 0 = 𝟎
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]= New Paramter Vector:
-0.5 𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, 𝜽𝟒 , 𝜽𝟓 ]
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
𝒙 𝒚 𝜽𝑻 𝒙 (
1
𝑇 − 𝒚)𝒙𝟒
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0
0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
0
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]=
-0.5
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
𝑛
𝒙 𝒚 𝜽𝑻 𝒙 1 1
( 𝑇 − 𝒚)𝒙𝟒 ෍ −𝑦 𝑖
𝑥4 (𝑖) = 𝟏
1+𝑒 −𝜽 𝒙 𝑇 𝑖

𝑖=1
1 + 𝑒 −𝜽 𝑥
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
𝑻𝒉𝒆𝒏,
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0.5
0 𝜃4 = 𝜃4 − α × 𝟏
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
0 = 0 − 0.5 × 1 = −𝟎. 𝟓
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]= New Paramter Vector:
0 𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓, 𝜽𝟓 ]
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
𝒙 𝒚 𝜽𝑻 𝒙 (
1
𝑇 − 𝒚)𝒙𝟓
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0
0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
0
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]=
-0.5
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: The Training
Phase

▸Let us apply logistic regression on the spam email recognition


problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
𝑛
𝒙 𝒚 𝑻
𝜽 𝒙 1 1
( 𝑇 − 𝒚)𝒙𝟓 ෍ −𝑦 𝑖
𝑥5 (𝑖) = −𝟏
1+𝑒 −𝜽 𝒙 𝑇 𝑖

𝑖=1
1+ 𝑒 −𝜽 𝑥
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
0 -0.5 𝑻𝒉𝒆𝒏,
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0 0 𝜃5 = 𝜃5 − α × (−𝟏)
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
0 0 = 0 − 0.5 × (−1) = 𝟎. 𝟓
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0 0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]= New Paramter Vector:
0 -0.5 𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓, 𝟎. 𝟓]
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0 0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: Testing

▸Let us now test logistic regression on the spam email recognition


problem, using the just learnt 𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓,𝟎. 𝟓]
- Note: Testing is typically done over a portion of the dataset that is not
used during training, but rather kept only for testing the accuracy of the
algorithm’s predictions thus far
- In this example, we will test over all the examples that we used during
training, just for illustrative purposes

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: Testing

▸Let us test logistic regression on the spam email recognition


problem, using the just learnt 𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓,𝟎. 𝟓]
𝒙 𝒚 𝜽𝑻 𝒙 𝒉𝜽 𝒙 = (
1
𝑇 )
Predicted Class
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1 1 [0,0,0.5,0,-0.5,0.5]×[1,1,1,0,1,1]=0.5
0.622459331
]
[1,0,0,1,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,0,0,1,1,0]=-0.5
0.377540669
]
[1,0,1,1,0,0 1 [0,0,0.5,0,-0.5,0.5]×[1,0,1,1,0,0]=0.5
0.622459331
]
[1,1,0,0,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,0,1,0]=-0.5
0.377540669
]
[1,1,0,1,0,1 1 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,0,1]=0.5
0.622459331
]
[1,1,0,1,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,1,0]=-0.5
0.377540669
]

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: Testing

▸Let us test logistic regression on the spam email recognition


problem, using the just learnt 𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓,𝟎. 𝟓]
𝒙 𝒚 𝜽𝑻 𝒙 𝒉𝜽 𝒙 = (
1
𝑇 )
Predicted Class
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1 1 [0,0,0.5,0,-0.5,0.5]×[1,1,1,0,1,1]=0.5
0.622459331
]
[1,0,0,1,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,0,0,1,1,0]=-0.5
0.377540669
]
[1,0,1,1,0,0 1 [0,0,0.5,0,-0.5,0.5]×[1,0,1,1,0,0]=0.5
0.622459331
]
[1,1,0,0,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,0,1,0]=-0.5
0.377540669
]
[1,1,0,1,0,1 1 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,0,1]=0.5
0.622459331
]
[1,1,0,1,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,1,0]=-0.5
0.377540669
]

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: Testing

▸Let us test logistic regression on the spam email recognition


problem, using the just learnt 𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓,𝟎. 𝟓]
(𝒊𝒇 𝒉 𝒙 ≥ 𝟎. 𝟓, 𝒚’ = 𝟏; 𝒆𝒍𝒔𝒆 𝒚’ = 𝟎)
𝜽

𝒙 𝒚 𝜽𝑻 𝒙 𝒉𝜽 𝒙 = (
1
𝑇 )
Predicted Class (or 𝒚’)
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1 1 [0,0,0.5,0,-0.5,0.5]×[1,1,1,0,1,1]=0.5
0.622459331
]
[1,0,0,1,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,0,0,1,1,0]=-0.5
0.377540669
]
[1,0,1,1,0,0 1 [0,0,0.5,0,-0.5,0.5]×[1,0,1,1,0,0]=0.5
0.622459331
]
[1,1,0,0,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,0,1,0]=-0.5
0.377540669
]
[1,1,0,1,0,1 1 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,0,1]=0.5
0.622459331
]
[1,1,0,1,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,1,0]=-0.5
0.377540669
]

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: Testing

▸Let us test logistic regression on the spam email recognition


problem, using the just learnt 𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓,𝟎. 𝟓]
(𝒊𝒇 𝒉 𝒙 ≥ 𝟎. 𝟓, 𝒚’ = 𝟏; 𝒆𝒍𝒔𝒆 𝒚’ = 𝟎)
𝜽

𝒙 𝒚 𝜽𝑻 𝒙 𝒉𝜽 𝒙 = (
1
𝑇 )
Predicted Class (or 𝒚’)
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0.5,0,-0.5,0.5]×[1,1,1,0,1,1]=0.5 0.622459331 1
[1,0,0,1,1,0] 0 [0,0,0.5,0,-0.5,0.5]×[1,0,0,1,1,0]=-0.5 0.377540669 0
[1,0,1,1,0,0] 1 [0,0,0.5,0,-0.5,0.5]×[1,0,1,1,0,0]=0.5 0.622459331 1
[1,1,0,0,1,0] 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,0,1,0]=-0.5 0.377540669 0
[1,1,0,1,0,1] 1 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,0,1]=0.5 0.622459331 1
[1,1,0,1,1,0] 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,1,0]=-0.5 0.377540669 0

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: Testing

▸Let us test logistic regression on the spam email recognition


problem, using the just learnt 𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓,𝟎. 𝟓]
(𝒊𝒇 𝒉 𝒙 ≥ 𝟎. 𝟓, 𝒚’ = 𝟏; 𝒆𝒍𝒔𝒆 𝒚’ = 𝟎)
𝜽

𝒙 𝒚 𝜽𝑻 𝒙 𝒉𝜽 𝒙 = (
1
𝑇 )
Predicted Class (or 𝒚’)
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1 1 [0,0,0.5,0,-0.5,0.5]×[1,1,1,0,1,1]=0.5
0.622459331 1
]
[1,0,0,1,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,0,0,1,1,0]=-0.5
0.377540669 0
]
NO
[1,0,1,1,0,0 1 [0,0,0.5,0,-0.5,0.5]×[1,0,1,1,0,0]=0.5 Mispredictions!
0.622459331 1
]
[1,1,0,0,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,0,1,0]=-0.5
0.377540669 0
]
[1,1,0,1,0,1 1 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,0,1]=0.5
0.622459331 1
]
[1,1,0,1,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,1,0]=-0.5
0.377540669 0
]

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: Inference
▸Let us infer whether a given new email, say, k = [1, 0, 1, 0, 0, 1] is
a spam or not, using logistic regression with the just learnt
parameter vector
𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓,𝟎. 𝟓]

𝒙𝟎 = 𝟏 𝒙𝟏 = and 𝒙𝟐 = vaccine 𝒙𝟑 = the 𝒙𝟒 = of 𝒙𝟓 = nigeria y


Email a 1 1 1 0 1 1 1
Email b 1 0 0 1 1 0 0
Email c 1 0 1 1 0 0 1
Email d 1 1 0 0 1 0 0
Email e 1 1 0 1 0 1 1
Email f 1 1 0 1 1 0 0
Email k 1 0 1 0 0 1 ?
Our Training Dataset

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: Inference

▸Let us infer whether a given new email, say, k = [1, 0, 1, 0, 0, 1] is


a spam or not, using logistic regression with the just learnt
parameter vector
𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓,𝟎. 𝟓]
𝒙𝟎 = 𝟏 𝒙𝟏 = and 𝒙𝟐 = vaccine 𝒙𝟑 = the 𝒙𝟒 = of 𝒙𝟓 = nigeria y
Email a 1 1 1 0 1 1 1
Email b 1 0 0 1 1 0 0
Email c 1 0 1 1 0 0 1
Email d 1 1 0 0 1 0 0
Email e 1 1 0 1 0 1 1
Email f 1 1 0 1 1 0 0
Email k 1 0 1 0 0 1 ?

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: Inference

▸Let us infer whether a given new email, say, k = [1, 0, 1, 0, 0, 1] is


a spam or not, using logistic regression with the just learnt
parameter vector
𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓,𝟎. 𝟓]
𝟎
1 𝟎
𝒉𝜽 𝒙 = 𝑇
𝟎. 𝟓
1 + 𝑒 −𝜽 𝒙 𝟏, 𝟎, 𝟏, 𝟎, 𝟎, 𝟏 = 𝟎. 𝟓 × 𝟏 + 𝟎. 𝟓 × 𝟏 = 𝟏
𝟎
1 −𝟎. 𝟓
= 𝟎. 𝟓
1 + 𝑒 −𝟏
= 0.731
≥ 0.5  Class 1 (i.e., Spam)

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A Concrete Example: Inference

▸Let us infer whether a given new email, say, k = [1, 0, 1, 0, 0, 1] is


a spam or not, using logistic regression with the just learnt
parameter vector
𝜽 = [𝟎,𝒙𝟎, 𝟎. 𝟓, 𝟎,
𝟎 =𝟏
−𝟎.
𝒙𝟏 = and 𝟓,𝟎.
𝒙𝟐 =𝟓]
vaccine 𝒙𝟑 = the 𝒙𝟒 = of 𝒙𝟓 = nigeria y
Email a 1 1 1 0 1 1 1
Email b 1 0 0 1 1 0 0
Email c 1 0 1 1 0 0 1
Email d 1 1 0 0 1 0 0
Email e 1 1 0 1 0 1 1
Email f 1 1 0 1 1 0 0
Email k 1 0 1 0 0 1 1
Somehow interesting since it considered “vaccine” and “nigeria” indicative of spam!

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


W
T:

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

You might also like