Fattah Regression 2023

Linear regression
Dr. Shaikh Fattah

Professor Department of EEE, BUET
Chair, IEEE Signal Processing Society Bangladesh Chapter
International FDP Organized by Mizoram University & North-Eastrn Hill University

1 Shillong, India 25 March 2022
Brief Biography
▸ Prof., Dept. of EEE, BUET
▸ In-Charge, BUET Robotics Lab
▸ Editorial Board Member, IEEE Access
▸ Editorial Board Member, IEEE Potentials
▸ Editor-in-Chief, IEEE PES Enews; A/E CSSP
▸ Chair, IEEE PES-HAC
▸ Chair, IEEE SSIT Chapters Committee
▸ Member of LRPC, IEEE PES
▸ Chair (’15-16) IEEE Bangladesh Section
▸ Chair, IEEE SPS Bangladesh (BD) Chapter
▸ Vice-Chair, RAS,PES, EMBS, SSIT BD Chapters
▸ General Chair, ICAICT 2020, SPICSCON 2021
RAAICON 19, BECITHCON’ 19, R10-HTC 2017
▸ TPC Chairs: TENSYMP ‘20, WIECON,
MediTec, ICIVPR, ICAEE
2
Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Major Awards
▸ Concordia University's Distinguished Doctoral
Dissertation Prize (ENS, 2009),
▸ 2007 URSI Canadian Young Scientist Award
▸ Dr. Rashid Gold Medal (in MSc)
▸ BAS-TWAS Young Scientists Prize (2014)
▸ 2018 R10 Outstanding Volunteer Award
▸ 2016 IEEE MGA Achievement Award
▸ 2017 IEEE R10 HTA Outstanding Volunteer Award
▸ 2016 IEEE R10 HTA Outstanding Activities
Award for Bangladesh Section
▸ Best paper awards: EEE BECITHCON 2019 ,
Biomedical Track at IEEE TENCON 2017,
WIECON-ECE 2016
▸ 2020 IEEE Video & Image Processing Cup, 1st Runner
Up,
3 Faculty Advisor of Team

W
T:

Linear Regression Model

What is “Linear” here?

Linear Regression- Find Best Line!
Loss Function

Linear Regression: Loss Function

Linear Regression: Cost Function

y, X and W, b representation

y, X and W, b representation

Minimizing the Cost Function

Minimizing Cost: Direct Solution



Feature Mapping: Handle Nonlinear Relation

Polynomial Feature Mapping

Polynomial Feature Mapping, M=3

Polynomial Fitting: Model Quality

Polynomial Fitting: Model Complexity

l2 Regularizer

Linear Regression

How to Learn a Model?
▸How to learn a mathematical model, after which you can

predict any target value (e.g., IQ) given a feature value (e.g.,
Birthweight)?
Birthweight IQ
(x) (y) What will be the IQ of a
575 59 150
650 55
baby who was born with
832 67 weight 1090 gram?
IQ at Age 5
850 84 100 84
933 87
1001 81
Answer: 84
1111 88 50
1230 92
1321 101
Inference
1370 102 0
1390 85 0 500 1000 1500 2000
1422 95
1480 120 Birthweight (g) Can be done after
1487 114
1490 100 the model (i.e., the line
in the figure) is trained
Trainig Dataset Can be used to train the model

Learning a Model
▸How do we represent a line in mathematics?

• y = mx + b
Birthweight IQ
(x) (y)
575 59 150
For this best fitting line:
650
832
55
67
m = 0.052
IQ at Age 5
850
933
84
87
100 b = 29.21
1001 81
1111
1230
88
92
50 But, how can we find
1321 101
0
(or learn) m and b?
1370 102
1390 85 0 500 1000 1500 2000
1422 95
1480 120 Birthweight (g)
1487 114
1490 100
Dataset

Learning a Model
▸Let us try different random values of m and b:

- m = 1 and b = 0
Birthweight (x) Actual IQ (y) Predicted IQ (y’)
y’ = mx + b
575 59 575
650 55 650 = 1*1001 + 0
832 67 832
850 84 850 = 1001
933 87 933
1001 81 1001
1111 88 1111
1230 92 1230 How close are these
1321
1370
101
102
1321
1370
predicted IQs to the
1390
1422
85
95
1390
1422
actual ones?
1480 120 1480
1487 114 1487
1490 100 1490

Learning a Model

The Error
- m = 1 and b = 0
Birthweight (x) Actual IQ (y) Predicted IQ (y’) Predicted – Actual
(y’ – y)
575 59 575 516
650 55 650 595
832 67 832 765
850 84 850 766
933 87 933 846
1001 81 1001 920
1111 88 1111 1023
1230 92 1230 1138 Sum = 15812
1321 101 1321 1220
1370 102 1370 1268 ≣
1390 85 1390 1305
1422 95 1422 1327 ∑ = 15812
1480 120 1480 1360
1487 114 1487 1373
1490 100 1490 1390

Learning a Model

- m = 0.052 and b = 29.21
Birthweight (x) Actual IQ (y) Predicted IQ (y’) Predicted – Actual
(y’ – y)
575 59 59.11 0.11
650 55 63.01 8.01
832 67 72.474 5.474
850 84 73.41 -10.59
933 87 77.726 -9.274
1001 81 81.262 0.262 Error:
1111 88 86.982 -1.018
1230 92 93.17 1.17 ∑ = -0.466
1321 101 97.902 -3.098
1370 102 100.45 -1.55
1390 85 101.49 16.49
1422 95 103.154 8.154
1480 120 106.17 -13.83
1487 114 106.534 -7.466
1490 100 106.69 6.69

Learning a Model
▸Let us observe the three options besides each other
m = 1 and b = 0 m = 0.05 and b = 2 m = 0.052 and b = 29.21

Visually, this is
the best!
Mean Squared Error (MSE)
▸Let us compare their errors

Birthweight Actual IQ Predicted IQ (Predicted – Actual)2 Predicted IQ (Predicted – Actual)2 Predicted IQ (Predicted – Actual)2
(x) (y) (y’) (y’ – y)2 (y’) (y’ – y)2 (y’) (y’ – y)2
575 59 59.11 266256 30.75 798.0625 59.11 0.0121
650 55 63.01 354025 34.5 420.25 63.01 64.1601
832 67 72.474 585225 43.6 547.56 72.474 29.964676
850 84 73.41 586756 44.5 1560.25 73.41 112.1481
15 examples
933 87 77.726 715716 48.65 1470.7225 77.726 86.007076

1001 81 81.262 846400 52.05 838.1025 81.262 0.068644
1111 88 86.982 1046529 57.55 927.2025 86.982 1.036324
1230 92 93.17 1295044 63.5 812.25 93.17 1.3689
1321 101 97.902 1488400 68.05 1085.7025 97.902 9.597604
1370 102 100.45 1607824 70.5 992.25 100.45 2.4025
1390 85 101.49 1703025 71.5 182.25 101.49 271.9201
1422 95 103.154 1760929 73.1 479.61 103.154 66.487716
1480 120 106.17 1849600 76 1936 106.17 191.2689
1487 114 106.534 1885129 76.35 1417.5225 106.534 55.741156
1490 100 106.69 1932100 76.5 552.25 106.69 44.7561
MEAN SQUARED ERRORS: ∑ = 17922958/15 ∑ = 14019.9/15 ∑ = 936.9/15

Minimizing MSE to Learn a Model
▸Let us compare their errors

Birthweight Actual IQ Predicted IQ (Predicted – Actual)2 Predicted IQ (Predicted – Actual)2 Predicted IQ (Predicted – Actual)2
(x) (y) (y’) (y’ – y)2 (y’) (y’ – y)2 (y’) (y’ – y)2
575 59 59.11 266256 30.75 798.0625 59.11 0.0121
650 55 63.01 354025 34.5 420.25 63.01 64.1601
832 67 72.474 585225 43.6 547.56 72.474 29.964676
850 84 73.41 586756 44.5 1560.25 73.41 112.1481
Mean Squared Error:
933 87 77.726 715716 48.65 1470.7225 77.726 86.007076
n examples
1001 𝑛 81 81.262 846400 52.05 838.1025 81.262 0.068644

1111 1 88
𝑖
86.982
2 1046529 57.55 927.2025 86.982 1.036324
minimize
1230 ෍ 92𝑦 ′ − 𝑦 𝑖93.17 1295044 63.5 812.25 93.17 1.3689
1321 2𝑛 101 97.902 1488400 68.05 1085.7025 97.902 9.597604
1370 𝑖=1 102 100.45 1607824 70.5 992.25 100.45 2.4025
1390 85 101.49 1703025 71.5 182.25 101.49 271.9201
1422 95 103.154 1760929 73.1 479.61 103.154 66.487716
1480 120 106.17 1849600 76 1936 106.17 191.2689
1487 114 106.534 1885129 76.35 1417.5225 106.534 55.741156
1490 100 106.69 1932100 76.5 552.25 106.69 44.7561
The objective is to minimize the mean squared error

Learning a Line via Minimizing MSE
▸How to learn a line of equation y’ = mx + b given a labelled

dataset?
- By minimizing mean squared error. That is:
𝑛
1 𝑖
2
minimize ෍ 𝑦′ − 𝑦 𝑖
2𝑛
𝑖=1
≣
𝑛
1 𝑖 𝑖
2
minimize ෍ (𝑚𝑥 + 𝑏) −𝑦
𝑚,𝑏 2𝑛
𝑖=1

Learning a Linear Regression Model via
Minimizing a Cost Function
▸Or, how to learn a linear regression model 𝒉𝜽 (𝒙) = 𝜽𝟎 + 𝜽𝟏 𝒙

given a labelled dataset (𝞱0 = b, 𝞱1= m, and 𝒉𝜽 𝒙 = y’ when
using y’ = mx + b)?
- By minimizing mean squared error. That is:
𝑛
1 𝑖
2
This problem is minimize ෍ 𝑦′ − 𝑦 𝑖
2𝑛
referred to as an 𝑖=1≣
𝑛
optimization problem 1 𝑖 𝑖
2 Known as cost function
minimize ෍ (𝜃0 + 𝜃1 𝑥) −𝑦
with the objective of: 𝜃0 ,𝜃1 2𝑛 𝑱(𝜽𝟎 , 𝜽𝟏 )
𝑖=1
minimizing 𝑱(𝜽𝟎 , 𝜽𝟏 )
𝜃0 ,𝜃1 ≣
minimize 𝑱(𝜽𝟎 , 𝜽𝟏 )
𝜃0 ,𝜃1

▸But, how minimizing a cost function (e.g., a mean squared error)

can lead to fitting a given training dataset?
- Let us assume our optimization objective is to minimize 𝑱 𝜽𝟏 , thus, 𝞱0 = 0
𝜃0 ,𝜃1
𝒚′ = 𝒉𝜽 (𝒙) 𝑱
𝑱 𝜽𝟏
𝞱1

𝒚′ = 𝒉𝜽 (𝒙) 𝑱
𝑱 𝜽𝟏
𝒚′ = 𝜽𝟏 𝒙
= 𝟑𝒙
𝒙
𝞱1 = 3 𝞱1
𝒚′ = 𝒉𝜽 (𝒙) 𝑱
𝑱 𝜽𝟏
𝒚′ = 𝜽𝟏 𝒙
= 𝟐𝒙
𝒙
𝞱1
𝞱1 = 2

- Our optimization objective is to minimize 𝑱 𝜽𝟏 , thus, 𝞱0 = 0
𝜃0 ,𝜃1
𝒚′ = 𝒉𝜽 (𝒙) 𝑱
𝑱 𝜽𝟏
𝞱1
𝞱1 = 1
Which corresponds to the

linear regression model (i.e., This 𝞱 is at the
line) that best fits this data! global minimum

Gradeint Descent For Linear Regression
▸Outline:
- Have some cost function 𝑱 𝜽𝟎 , … , 𝜽𝒏−𝟏
- Start off with some guesses for 𝜃0 , … , 𝜃𝑛−1
• It does not really matter what values you start off with, but a
common choice is to set them all initially to zero
- Keep changing 𝜃0 , … , 𝜃𝑛−1 to reduce 𝑱 𝜽𝟎 , … , 𝜽𝒏−𝟏 until we
hopefully end up at a minimum location
• When you are at a certain position on the surface of 𝑱, look around,
then take a little step in the direction of the steepest descent, then
repeat

Gradeint Descent For Linear
Regression
▸Outline:
- Repeat until convergence{
Partial Derivative
𝜕 𝑱 𝜽𝟎 , … , 𝜽𝒏−𝟏
𝜃𝑗 = 𝜃𝑗 − α
𝜕 𝜃𝑗
}
Learing Rate

Regression
▸Outline (considering only two varilables 𝜃0 and 𝜃1 ):

- Have some cost function 𝑱 𝜽𝟎 , 𝜽𝟏
- Start off with some guesses for 𝜃0 , 𝜃1
• It does not really matter what values you start off with, but a common
choice is to set them both initially to zero
𝑛
1 𝑖 𝑖
𝜕 𝑱 𝜽𝟎 , 𝜽𝟏 ෍ ℎ𝜃 (𝑥) − 𝑦
𝑛
𝑡𝑒𝑚𝑝0 = 𝜃0 − α 𝑖=1
𝜕 𝜃0
𝜕 𝑱 𝜽𝟎 , 𝜽𝟏
𝑡𝑒𝑚𝑝1 = 𝜃1 − α 𝑛
𝜕 𝜃1 1
} ෍ ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 . 𝑥 (𝑖)
𝜃0 = 𝑡𝑒𝑚𝑝0 𝑛
𝜃1 = 𝑡𝑒𝑚𝑝1 𝑖=1

Regression

𝑛
1
𝑡𝑒𝑚𝑝0 = 𝜃0 − α ෍ ℎ𝜃 (𝑥) 𝑖
− 𝑦 𝑖 But, what is partial derviative
𝑛
𝑖=1
𝑛
(i.e., 𝝏) exactly and how was it
1 𝑖 𝑖 calculated here for mean
𝑡𝑒𝑚𝑝1 = 𝜃1 − α ෍ ℎ𝜃 𝑥 − 𝑦 . 𝑥 (𝑖)
𝑛
} 𝑖=1 squared error?
𝜃0 = 𝑡𝑒𝑚𝑝0

Gradeint Descent For Linear Regression

𝑛
1 𝑖 𝑖
And why & how 𝜶 and 𝝏 can
𝑡𝑒𝑚𝑝0 = 𝜃0 − α ෍ ℎ𝜃 (𝑥) − 𝑦
𝑛 serve in taking little steps in
𝑖=1
1
𝑛
the direction of the steepest
𝑡𝑒𝑚𝑝1 = 𝜃1 − α ෍ ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 . 𝑥 (𝑖)
𝑛 descent and result in hitting
} 𝜃0 = 𝑡𝑒𝑚𝑝0
𝑖=1
the (global) minimum on J?

Average Speed vs. Instantenous Speed
▸Usain Bolt is regarded widely as the greatest

sprinter of all time
- He can run 100meters in 9.58seconds!
Distance
100m
What is the average speed of Usain Bolt?
= Change in Distance/Change in Time
Δy = Δy/Δx
= 100/9.58
Δx = 10.43m/s
9.58s time

Average Speed vs. Instantenous Speed
Distance
100m
But, this average speed is different than
instantenous speed!
Δy
Bolt will not instantly go 100m in 9.58s, but
Δx
rather start off a little slower, then accelerate,
then decelerate a little towards the end
9.58s time
Distance
This way, Δy/Δx != Δy/Δx (this is opposite to
100m
having a line whereby it does not matter which
Δy
Δx
two points to take on it since the slope will be
always the same)
Δy
Consequently, at any given moment in time, a
Δx
9.58s time slope on the green function (e.g., Δy/Δx or
Δy/Δx) will be different than the average slope
on the blue line (i.e., Δy/Δx)

Instantenous Speed
We can compute the slope around the
steepest point if we are interested about
Distance the fastest instantaneous speed!
100m
But that would be only an approximation
Δy
because the slope of the curve is constantly
Δx changing
We can achieve a better approximation by
measuring the slope with a smaller & smaller
9.58s time
change in x, which yields a smaller & smaller
change in y

Instantaneous Slope
Distance
100m
This instantaneous slope is what
mathematicians denote as the derivative
and write as:
And, this is an infinitely small
change in x (d stands for
∆𝑦
lim = dy/dx differential)
∆𝑥→0 ∆𝑥
9.58s time

Gradient Descent For Linear
Regression
▸Outline:
- Keep changing 𝜃0 , … , 𝜃𝑛−1 to reduce 𝑱 𝜽𝟎 , … , 𝜽𝒏−𝟏 until we
hopefully end up at a minimum location
• When you are at a certain position on the surface of 𝑱, look around,
then take a little step in the direction of the steepest descent, then
repeat

Regression
▸Outline:
Partial Derivative
𝜕 𝑱 𝜽𝟎 , … , 𝜽𝒏−𝟏
𝜃𝑗 = 𝜃𝑗 − α What do α
𝜕 𝜃𝑗 and 𝜕 do?
}
Learing Rate

The Impact of Partial Derviative
▸For simplicity, let us assume our optimization objective is to

minimize 𝑱 𝜽𝟏 , thus, 𝞱0 = 0
𝜃0 ,𝜃1
𝒚′ = 𝒉𝜽 (𝒙) 𝑱
𝑱 𝜽𝟏
𝒚′ = 𝜽𝟏 𝒙
= 𝟏𝒙
𝒙
𝞱1
𝞱1 = 1
𝒉𝜽 (𝒙) is the Hypothesis Function 𝑱 𝜽𝟏 is the Cost Function


𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱 𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
= 𝜃1 − α (𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑁𝑢𝑚𝑏𝑒𝑟)
Decrease 𝜃1 by a certain value

Positive Derivative
𝞱1
𝞱1


𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱 𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
= 𝜃1 − α (𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑁𝑢𝑚𝑏𝑒𝑟)
Decrease 𝜃1 by a certain value
𝞱1
New 𝞱1 Old 𝞱1


𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱 𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
= 𝜃1 − α (𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑁𝑢𝑚𝑏𝑒𝑟)
Increase 𝜃1 by a certain value

Negative
Derivative
𝞱1
𝞱1


𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱 𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
= 𝜃1 − α (𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑁𝑢𝑚𝑏𝑒𝑟)
Increase 𝜃1 by a certain value
𝞱1
Old 𝞱1 New 𝞱1


𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱 𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
= 𝜃1 − α (𝑍𝑒𝑟𝑜)
𝜃1 remains the same, hence,

gradient descent converges
𝞱1
Derivative = 0

The Impact of Learning Rate

𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱
𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
What
Learing Rate happens if α
is too small?
𝞱1


𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱
𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
𝒅 𝑱 𝜽𝟏
= 𝜃1 − (𝑇𝑜𝑜 𝑆𝑚𝑎𝑙𝑙 𝑁𝑢𝑚𝑏𝑒𝑟)
𝒅 𝜃𝑗
𝜃1 changes only a tiny bit on each step,

hence, gradient descent will render
slow (will take more time to converge)
𝞱1


𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱
𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
𝒅 𝑱 𝜽𝟏
= 𝜃1 − (𝑇𝑜𝑜 𝐿𝑎𝑟𝑔𝑒 𝑁𝑢𝑚𝑏𝑒𝑟)
𝒅 𝜃𝑗
𝜃1 changes a lot (and probably faster) on

each step, hence, gradient descent will
potentially overshoot the minimum and,
accordingly, fail to converge (or even diverge)
𝞱1


𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱
𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
We can set α between 0 and 1 (say, 0.5, or

a little more or less, hence, not very small
or very large)
𝞱1


𝜃0 ,𝜃1
𝒅 𝑱 𝜽𝟏
𝑱
𝜃1 = 𝜃1 − α
𝒅 𝜃𝑗
We can also fix 𝜶 because as we approach

the (global) minimum, gradient descent
will automatically start taking smaller steps
(i.e., 𝜃1 will start changing at a slower pace
because the derivative will become less steep)
𝞱1

Regression
▸Outline:
- Repeat until convergence{ Now we
understand the
Partial derivative
intuition behind
𝜕 𝑱 𝜽𝟎 , … , 𝜽𝒏−𝟏
𝜃𝑗 = 𝜃𝑗 − α gradient descent
𝜕 𝜃𝑗 and how 𝜶 and 𝝏
} act together to
Learing rate, which controls how big a step we take make gradient
when we update 𝜃 descent work!
𝑗

Regression

𝑛
1 𝑖 𝑖
𝜕 𝑱 𝜽𝟎 , 𝜽𝟏 ෍ ℎ𝜃 (𝑥) − 𝑦
𝑛
𝑡𝑒𝑚𝑝0 = 𝜃0 − α 𝑖=1
𝜕 𝜃0
𝜕 𝑱 𝜽𝟎 , 𝜽𝟏
𝑡𝑒𝑚𝑝1 = 𝜃1 − α 𝑛
𝜕 𝜃1 1
} ෍ ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 . 𝑥 (𝑖)
𝜃0 = 𝑡𝑒𝑚𝑝0 𝑛
𝜃1 = 𝑡𝑒𝑚𝑝1 𝑖=1

Regression

𝑛
1 𝑖 𝑖
𝑡𝑒𝑚𝑝0 = 𝜃0 − α ෍ ℎ𝜃 (𝑥) − 𝑦
𝑛
𝑖=1
𝑛
1 𝑖 𝑖
𝑡𝑒𝑚𝑝1 = 𝜃1 − α ෍ ℎ𝜃 𝑥 − 𝑦 . 𝑥 (𝑖)
𝑛
} 𝜃0 = 𝑡𝑒𝑚𝑝0
𝑖=1

Logistic Regression

Example 1: Malignant or Benign
▸Consider the example of recognizing whether a tumor in an

input image is malignant or benign
Input: Output:
Benign 0
Can be
Binary
represented
Classifier
as integers!
Malignant 1
Can be represented as a matrix of pixels

Example 2: Spam or Not Spam
▸As another example, consider the problem of detecting whether

an email is a spam or not a spam
Input: Output:
Not Spam 0
Can be
Binary
Email represented
Classifier
as integers!
Spam 1
Can be represented as a vector x = [x1, x2, . . . , xd], with each component xi

corresponding to the presence (xi = 1) or absence (xi = 0) of a particular word (or feature)
in the email

Regression vs. Classification
▸What are the possible outputs of the linear regression function

𝒉𝜽 (𝒙) = 𝜽𝑻 𝒙?
Here 𝜽 is a vector that holds all the parameters, that is, 𝜽 = [𝜽𝟎, 𝜽𝟏, …, 𝜽𝒎] and 𝒙
is a vector that encompasses all the features, that is, 𝒙 = [𝒙𝟎, 𝒙𝟏, …, 𝒙𝒎] (𝒙𝟎 shall
be always equal to 1)


- Real-valued outputs
𝒙 = [𝒙𝟎, … , 𝒙𝒎] 𝜽𝑻 𝒙 𝒉𝜽 (𝒙) ∈ ℝ


- Real-valued outputs
𝟑. 𝟐
Spam Filter 𝟒. 𝟏
𝒙 = [𝟏, 𝟎, 𝟎, 𝟏, 𝟎] 𝜽𝑻 𝒙 𝒉𝜽 𝒙 = 𝟐. 𝟗 [𝟏, 𝟎, 𝟎, 𝟏, 𝟎]
𝟔. 𝟕
Representing an email 𝟏. 𝟏
Assume 𝞱 = [𝟑. 𝟐, 𝟒. 𝟏, 𝟐. 𝟗, 𝟔. 𝟕, 𝟏. 𝟏]


- Real-valued outputs ∈ ℝ, which makes it a
regression problem
Spam Filter
𝒙 = [𝟏, 𝟎, 𝟎, 𝟏, 𝟎] 𝜽𝑻 𝒙 𝒉𝜽 𝒙 = 𝟑. 𝟐 + 𝟔. 𝟕 = 𝟗. 𝟗
A spam or not a spam?

We need a discrete-valued
output (e.g., 0 or 1)

▸How can we make the possible outputs of 𝒉𝜽 (𝒙) = 𝜽𝑻 𝒙

discrete-valued (as opposed to real-valued)?
- By using an activation function (e.g., sigmoid function)
𝑔 𝑧 Assume a labeled example (𝒙, y):
If y = 1, we want 𝑔 𝑧 ≈ 1 (i.e.,
1
𝑔 𝑧 = we want a correct prediction)
1 + 𝑒 −𝑧
For this to happen, 𝒛 ≫ 𝟎
𝑧
0 8


𝑔 𝑧 Assume a labeled example (𝒙, y):
If y = 0, we want 𝑔 𝑧 ≈ 0 (i.e.,
1
𝑔 𝑧 = we want a correct prediction)
1 + 𝑒 −𝑧
For this to happen, 𝒛 ≪ 𝟎
𝑧
-8 0


𝒙 = [𝒙𝟎, … , 𝒙𝒎] 𝜽𝑻 𝒙 𝒉𝜽 (𝒙) ∈ ℝ


𝒙 = [𝒙𝟎, … , 𝒙𝒎] 𝒈(𝜽𝑻 𝒙) 𝒉𝜽 (𝒙) ∈ [0,1]

We can now apply thresholding,
for instance, if 𝒉𝜽 (𝒙) < 0.5, predict
0; otherwise, predict 1
𝒛


𝒙 = [𝒙𝟎, … , 𝒙𝒎] 𝒈(𝜽𝑻 𝒙) 𝒉𝜽 (𝒙) ∈ [0,1]

This makes it a classification problem
(i.e., NOT a regression problem
anymore) since the output is a
𝒛 discrete one

The Logistic Regression Model
▸What will be the output of the model 𝒉𝜽 𝒙 = 𝜽𝑻 𝒙, where 𝞱 =

[𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
- Real-valued
▸How can we make the output of 𝒉𝜽 𝒙 discrete?

- By using the logistic function as follows:
1 This is the logistic regression
𝑻
𝒉𝜽 𝒙 = 𝒈(𝜽 𝒙) = 𝑇 model or hypothesis function
1 + 𝑒 −𝜽 𝑥
- And then applying thresholding after learning the model to predict the
output as follows:
𝒊𝒇 𝒉𝜽 𝒙 < 𝟎. 𝟓 𝐩𝐫𝐞𝐝𝐢𝐜𝐭 𝟎
𝒊𝒇 𝒉𝜽 𝒙 ≥ 𝟎. 𝟓 𝐩𝐫𝐞𝐝𝐢𝐜𝐭 𝟏

Towards Identifying the Logistic
Regression Cost Function
▸How to learn a logistic regression model 𝒉𝜽 𝒙 = 𝒈(𝜽𝑻 𝒙), where

𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
- Perhaps, by minimizing Mean Squared Error (MSE). That is:
𝑛
1 𝑖
2
minimize ෍ 𝑦′ − 𝑦 𝑖
Convex 2𝑛
𝑖=1
Unfortunately, if
≣
Easy for 𝑛 we plot this cost
1 2
function, it will turn
Gradient minimize ෍ (𝑔 𝜃 𝑇 𝑥 ) 𝑖
−𝑦 𝑖
Global Min 𝜃 2𝑛 out to be “non-convex”
Descent 𝑖=1
to locate! ≣
minimize 𝑱(𝞱)
𝞱
𝜃 Cost function 𝑱(𝞱)

Towards Identifying the Logistic
Regression Cost Function

𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
- Perhaps, by minimizing Mean Squared Error (MSE). That is:
𝑛
1 𝑖
2
Gradient minimize ෍ 𝑦′ − 𝑦 𝑖
Non-Convex 2𝑛
Descent 𝑖=1
Local Min Unfortunately, if
might get ≣
𝑛 we plot this cost
stuck at a Global 1 2
Min minimize ෍ (𝑔 𝜃 𝑇 𝑥 ) 𝑖
−𝑦 𝑖 function, it will turn
local min 𝜃 2𝑛
𝑖=1 out to be “non-convex”
and fail to
locate the ≣
minimize 𝑱(𝞱)
global min! 𝜃
𝞱

The Logistic Regression Cost Function

𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
- Let us try a different cost function. That is:
− log 𝒉𝜽 𝒙 𝒊𝒇 𝒚 = 𝟏
Cost(𝒉𝜽 𝒙 , y) =
− log 𝟏 − 𝒉𝜽 𝒙 𝒊𝒇 𝒚 = 𝟎
Equivalent To
Cost(𝒉𝜽 𝒙 , y) = −𝑦 log 𝒉𝜽 𝒙 − (1 − 𝑦) log 𝟏 − 𝒉𝜽 𝒙


𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
This function still assumes real-valued outputs for 𝒉𝜽 𝒙 (i.e., still entails a
regression problem), while logistic regression should predict discrete
values (i.e., logistic regression is a classification problem)


𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
We still need to apply to it the logistic function:

1
𝒈(𝑧) =
1 + 𝑒 −𝑧


𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
- By minimizing the following cost function:
Cost(𝒉𝜽 𝒙 , y) −𝑦 log 𝒉𝜽 𝒙 − (1 − 𝑦) log 𝟏 − 𝒉𝜽 𝒙
=
Cost(𝒉𝜽 𝒙 , y) −𝑦 log 𝒈(𝜽𝑻 𝒙) − (1 − 𝑦) log 1 − 𝒈(𝜽𝑻 𝒙)
=
1 1
Cost(𝒉𝜽 𝒙 , y) −𝑦 log 𝑇 − (1 − 𝑦) log 1 − 𝑇
= 1 + 𝑒 −𝜽 𝑥 1 + 𝑒 −𝜽 𝑥


𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
1 1
Cost(𝒉𝜽 𝒙 , y) −𝑦 log − (1 − 𝑦) log 1 −
1 + 𝑒 −𝜽𝑇𝑥 1 + 𝑒 −𝜽𝑇𝑥
=
I𝒇 𝒚 = 𝟏, we want 𝜽𝑻 𝒙 >> 0 I𝒇 𝒚 = 𝟎, we want 𝜽𝑻 𝒙 << 0
1 1
−𝑙𝑜𝑔 𝑇 −𝑙𝑜𝑔 1 − 𝑇
1 + 𝑒 −𝜽 𝑥 1 + 𝑒 −𝜽 𝑥
𝜽𝑻 𝒙 𝜽𝑻 𝒙

Learning a Logistic Regression Model

𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ] and 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ]?
1 1
Cost(𝒉𝜽 𝒙 , y) −𝑦 log − (1 − 𝑦) log 1 −
1 + 𝑒 −𝜽𝑇𝑥 1 + 𝑒 −𝜽𝑇𝑥
=
- That is: 𝑛
1 𝑖
minimize ෍ Cost(𝒉𝜽 𝒙 ,𝑦 𝑖 )
𝜃 𝑛
𝑖=1
≣
𝑛
1 𝑖
1 1 Cost function
minimize ෍ −𝑦 log − (1 − 𝑦) log 1 −
𝜃 𝑛 1 + 𝑒 −𝜽
𝑇
𝑥𝑖 1 + 𝑒 −𝜽
𝑇
𝑥𝑖 𝑱(𝞱)
𝑖=1

Gradient Descent For Logistic
Regression
▸Outline:
- Have cost function 𝑱 𝞱 , where 𝞱 = [𝜽𝟎 , … , 𝜽𝒎 ]
- Start off with some guesses for 𝜃0 , … , 𝜃𝑚
Partial derivative
𝜕𝑱 𝞱
𝜃𝑗 = 𝜃𝑗 − α
𝜕 𝜃𝑗 Note: Update all 𝜽𝒋 simulatenously
}
Learing rate, which controls how big a step we take
when we update 𝜃𝑗

Gradient Descent For Logistic
Regression
▸Outline:
𝑛 The final formula
1 𝑖 after applying
𝜃𝑗 = 𝜃𝑗 − α ෍ −𝑦 𝑥𝑗 (𝑖)
1 + 𝑒 −𝜽
𝑇
𝑥𝑖 partial derivatives
𝑖=1
}

Inference After Learning
▸After learning the parameters 𝞱 = [𝜃0 , … , 𝜃𝑚 ], we can predict the

output of any new unseen 𝒙 = [𝒙𝟎, … , 𝒙𝒎 ] as follows:
1
𝒊𝒇 𝒉𝜽 𝒙 = 𝑇 < 𝟎. 𝟓 𝐩𝐫𝐞𝐝𝐢𝐜𝐭 𝟎
1 + 𝑒 −𝜽 𝒙
1
𝑬𝒍𝒔𝒆 𝒊𝒇 𝒉𝜽 𝒙 = 𝑇 ≥ 𝟎. 𝟓 𝐩𝐫𝐞𝐝𝐢𝐜𝐭 𝟏
1 + 𝑒 −𝜽 𝒙

A Concrete Example: The Training
Phase
▸Let us apply logistic regression on the spam email recognition

problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0, 0, 0, 0, 0]
and vaccine the of nigeria y
Email a 1 1 0 1 1 1
Email b 0 0 1 1 0 0
Email c 0 1 1 0 0 1
Email d 1 0 0 1 0 0
Email e 1 0 1 0 1 1
Email f 1 0 1 1 0 0
A Training Dataset

Phase

Email a 1 1 0 1 1 1
Email b 0 0 1 1 0 0
Email c 0 1 1 0 0 1
Email d 1 0 0 1 0 0
Email e 1 0 1 0 1 1
Email f 1 0 1 1 0 0
1 entails that a word (i.e., “and”) is present in an email (i.e., “Email a”)

Phase

Email a 1 1 0 1 1 1
Email b 0 0 1 1 0 0
Email c 0 1 1 0 0 1
Email d 1 0 0 1 0 0
Email e 1 0 1 0 1 1
Email f 1 0 1 1 0 0
0 entails that a word (i.e., “and”) is abscent in an email (i.e., “Email b”)

Phase

problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0,We
0,define
0, 0,60, 0]
parameters
(the first one, i.e., 𝜃0 ,
5 words (or features) = [𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 , 𝒙𝟒 , 𝒙𝟓 ] is the intercept)
𝒙𝟏 = and 𝒙𝟐 = vaccine 𝒙𝟑 = the 𝒙𝟒 = of 𝒙𝟓 = nigeria y

Email a 1 1 0 1 1 1
Email b 0 0 1 1 0 0
Email c 0 1 1 0 0 1
Email d 1 0 0 1 0 0
Email e 1 0 1 0 1 1
Email f 1 0 1 1 0 0

Phase

problem, assuming 𝞪 = 0.5 and starting with 𝜽 = [0, 0,The
0,parameter
0, 0, 0]vector:
𝜽 = [𝜽𝟎, 𝜽𝟏, 𝜽𝟐 , 𝜽𝟑, 𝜽𝟒 , 𝜽𝟓]
x = [𝒙𝟎 , 𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 , 𝒙𝟒 , 𝒙𝟓 ] The feature vector
𝒙𝟎 = 𝟏 𝒙𝟏 = and 𝒙𝟐 = vaccine 𝒙𝟑 = the 𝒙𝟒 = of 𝒙𝟓 = nigeria y

Email a 1 1 1 0 1 1 1
Email b 1 0 0 1 1 0 0
Email c 1 0 1 1 0 0 1
Email d 1 1 0 0 1 0 0
Email e 1 1 0 1 0 1 1
Email f 1 1 0 1 1 0 0
To account for the intercept

Recap: Gradient Descent For Logistic
Regression
▸Outline:
𝑛
1 𝑖
𝜃𝑗 = 𝜃𝑗 − α ෍ −𝑦 𝑥𝑗 (𝑖)
1 + 𝑒 −𝜽
𝑇
𝑥𝑖
𝑖=1 First, let us calculate this factor
} for every example in our
training dataset

Phase

𝒙 𝒚 𝜽𝑻 𝒙 (
1
𝑇 − 𝒚)𝒙𝟎
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0.5
0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
-0.5
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]=
-0.5
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Regression
▸Outline:
- Repeat until convergence{ Second, let us calculate
this equation for every
𝑛 example in our training
1 𝑖
dataset and for every 𝜽𝒋 ,
𝜃𝑗 = 𝜃𝑗 − α ෍ −𝑦 𝑥𝑗 (𝑖) where j is between 0
1 + 𝑒 −𝜽
𝑇
𝑥𝑖
𝑖=1 and m
}

Phase

1
𝑇 − 𝒚)𝒙𝟎
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]= 1
0 (1+𝑒 −𝟎 − 𝟏) × 𝟏 = -0.5
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]= 1
0 (1+1 − 𝟎) × 𝟏 = 0.5
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]= 1
0 (1+1 − 𝟏) × 𝟏 = -0.5
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]= 1
0 (1+1 − 𝟎) × 𝟏 = 0.5
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]= 1
0 (1+1 − 𝟏) × 𝟏 = -0.5
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]= 1
0 (1+1 − 𝟎) × 𝟏 = 0.5

Phase

1
𝑇 − 𝒚)𝒙𝟎
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]= 1
0 (1+𝑒 −𝟎 − 𝟏) × 𝟏 = -0.5
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]= 1
0 (1+1 − 𝟎) × 𝟏 = 0.5
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]= 1
0 (1+1 − 𝟏) × 𝟏 = -0.5
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]= 1
0 (1+1 − 𝟎) × 𝟏 = 0.5
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]= 1
0 (1+1 − 𝟏) × 𝟏 = -0.5
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]= 1
0 (1+1 − 𝟎) × 𝟏 = 0.5

Regression
▸Outline:
𝑛
1 𝑖 (𝑖)
Third, let us compute
𝜃𝑗 = 𝜃𝑗 − α ෍ 𝑇 𝑖
−𝑦 𝑥𝑗 every 𝜽𝒋
𝑖=1
1+ 𝑒 −𝜽 𝑥
}

Phase

𝑛
𝒙 𝒚 𝜽𝑻 𝒙 1 1
( 𝑇 − 𝒚)𝒙𝟎 ෍ −𝑦 𝑖
𝑥0 (𝑖) = 𝟎
1+𝑒 −𝜽 𝒙 𝑇 𝑖
𝑖=1
1 + 𝑒 −𝜽 𝑥
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
𝑻𝒉𝒆𝒏,
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0.5
0 𝜃0 = 𝜃0 − α ×0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
-0.5 New 𝜽𝟎
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]=
-0.5
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Phase

𝑛
( 𝑇 − 𝒚)𝒙𝟎 ෍ −𝑦 𝑖
𝑥0 (𝑖) = 𝟎
𝑖=1
1 + 𝑒 −𝜽 𝑥
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
𝑻𝒉𝒆𝒏,
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0.5
0 𝜃0 = 𝜃0 − α ×0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
-0.5 Old 𝜽𝟎
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]=
-0.5
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Phase

𝑛
( 𝑇 − 𝒚)𝒙𝟎 ෍ −𝑦 𝑖
𝑥0 (𝑖) = 𝟎
𝑖=1
1 + 𝑒 −𝜽 𝑥
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
𝑻𝒉𝒆𝒏,
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0.5
0 𝜃0 = 𝜃0 − α ×0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
-0.5 = 0 − 0.5 × 𝟎 = 𝟎
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]= New Paramter Vector:
-0.5 𝜽 = [𝟎, 𝜽𝟏 , 𝜽𝟐 , 𝜽𝟑 , 𝜽𝟒 , 𝜽𝟓 ]
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Phase

1
𝑇 − 𝒚)𝒙𝟏
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0
0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
0
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]=
-0.5
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Phase

𝑛
( 𝑇 − 𝒚)𝒙𝟏 ෍ −𝑦 𝑖
𝑥1 (𝑖) = 𝟎
𝑖=1
1 + 𝑒 −𝜽 𝑥
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
𝑻𝒉𝒆𝒏,
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0
0 𝜃1 = 𝜃1 − α ×0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
0 = 0 − 0.5 × 𝟎 = 𝟎
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
-0.5 𝜽 = [𝟎, 𝟎, 𝜽𝟐 , 𝜽𝟑 , 𝜽𝟒 , 𝜽𝟓 ]
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Phase

1
𝑇 − 𝒚)𝒙𝟐
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0
0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
0
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]=
-0.5
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Phase

𝑛
( 𝑇 − 𝒚)𝒙𝟐 ෍ −𝑦 𝑖
𝑥2 (𝑖) = −𝟏
𝑖=1
1+ 𝑒 −𝜽 𝑥
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
𝑻𝒉𝒆𝒏,
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0
0 𝜃2 = 𝜃2 − α × (−𝟏)
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
-0.5 = 0 − 0.5 × −𝟏 = 𝟎. 𝟓
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0
0
0 𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝜽𝟑 , 𝜽𝟒 , 𝜽𝟓 ]
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0
0

Phase

1
𝑇 − 𝒚)𝒙𝟑
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0
0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
0
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]=
-0.5
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Phase

𝑛
( 𝑇 − 𝒚)𝒙𝟑 ෍ −𝑦 𝑖
𝑥3 (𝑖) = 𝟎
𝑖=1
1 + 𝑒 −𝜽 𝑥
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
0
0
𝑻𝒉𝒆𝒏,
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0.5
0 𝜃3 = 𝜃3 − α × 𝟎
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
-0.5 = 0 − 0.5 × 0 = 𝟎
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0
0
-0.5 𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, 𝜽𝟒 , 𝜽𝟓 ]
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Phase

1
𝑇 − 𝒚)𝒙𝟒
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0
0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
0
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]=
-0.5
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Phase

𝑛
( 𝑇 − 𝒚)𝒙𝟒 ෍ −𝑦 𝑖
𝑥4 (𝑖) = 𝟏
𝑖=1
1 + 𝑒 −𝜽 𝑥
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
𝑻𝒉𝒆𝒏,
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0.5
0 𝜃4 = 𝜃4 − α × 𝟏
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
0 = 0 − 0.5 × 1 = −𝟎. 𝟓
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
0 𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓, 𝜽𝟓 ]
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Phase

1
𝑇 − 𝒚)𝒙𝟓
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
-0.5
0
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0
0
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
0
0
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0.5
0
[1,1,0,1,0,1] 1 [0,0,0,0,0,0]×[1,1,0,1,0,1]=
-0.5
0
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0.5
0

Phase

𝑛
𝒙 𝒚 𝑻
𝜽 𝒙 1 1
( 𝑇 − 𝒚)𝒙𝟓 ෍ −𝑦 𝑖
𝑥5 (𝑖) = −𝟏
𝑖=1
1+ 𝑒 −𝜽 𝑥
[1,1,1,0,1,1] 1 [0,0,0,0,0,0]×[1,1,1,0,1,1]=
0 -0.5 𝑻𝒉𝒆𝒏,
[1,0,0,1,1,0] 0 [0,0,0,0,0,0]×[1,0,0,1,1,0]=
0 0 𝜃5 = 𝜃5 − α × (−𝟏)
[1,0,1,1,0,0] 1 [0,0,0,0,0,0]×[1,0,1,1,0,0]=
0 0 = 0 − 0.5 × (−1) = 𝟎. 𝟓
[1,1,0,0,1,0] 0 [0,0,0,0,0,0]×[1,1,0,0,1,0]=
0 0
0 -0.5 𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓, 𝟎. 𝟓]
[1,1,0,1,1,0] 0 [0,0,0,0,0,0]×[1,1,0,1,1,0]=
0 0

A Concrete Example: Testing
▸Let us now test logistic regression on the spam email recognition

problem, using the just learnt 𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓,𝟎. 𝟓]
- Note: Testing is typically done over a portion of the dataset that is not
used during training, but rather kept only for testing the accuracy of the
algorithm’s predictions thus far
- In this example, we will test over all the examples that we used during
training, just for illustrative purposes

▸Let us test logistic regression on the spam email recognition

𝒙 𝒚 𝜽𝑻 𝒙 𝒉𝜽 𝒙 = (
1
𝑇 )
Predicted Class
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1 1 [0,0,0.5,0,-0.5,0.5]×[1,1,1,0,1,1]=0.5
0.622459331
]
[1,0,0,1,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,0,0,1,1,0]=-0.5
0.377540669
]
[1,0,1,1,0,0 1 [0,0,0.5,0,-0.5,0.5]×[1,0,1,1,0,0]=0.5
0.622459331
]
[1,1,0,0,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,0,1,0]=-0.5
0.377540669
]
[1,1,0,1,0,1 1 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,0,1]=0.5
0.622459331
]
[1,1,0,1,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,1,0]=-0.5
0.377540669
]


1
𝑇 )
Predicted Class
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1 1 [0,0,0.5,0,-0.5,0.5]×[1,1,1,0,1,1]=0.5
0.622459331
]
[1,0,0,1,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,0,0,1,1,0]=-0.5
0.377540669
]
[1,0,1,1,0,0 1 [0,0,0.5,0,-0.5,0.5]×[1,0,1,1,0,0]=0.5
0.622459331
]
[1,1,0,0,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,0,1,0]=-0.5
0.377540669
]
[1,1,0,1,0,1 1 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,0,1]=0.5
0.622459331
]
[1,1,0,1,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,1,0]=-0.5
0.377540669
]


(𝒊𝒇 𝒉 𝒙 ≥ 𝟎. 𝟓, 𝒚’ = 𝟏; 𝒆𝒍𝒔𝒆 𝒚’ = 𝟎)
𝜽
1
𝑇 )
Predicted Class (or 𝒚’)
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1 1 [0,0,0.5,0,-0.5,0.5]×[1,1,1,0,1,1]=0.5
0.622459331
]
[1,0,0,1,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,0,0,1,1,0]=-0.5
0.377540669
]
[1,0,1,1,0,0 1 [0,0,0.5,0,-0.5,0.5]×[1,0,1,1,0,0]=0.5
0.622459331
]
[1,1,0,0,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,0,1,0]=-0.5
0.377540669
]
[1,1,0,1,0,1 1 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,0,1]=0.5
0.622459331
]
[1,1,0,1,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,1,0]=-0.5
0.377540669
]


(𝒊𝒇 𝒉 𝒙 ≥ 𝟎. 𝟓, 𝒚’ = 𝟏; 𝒆𝒍𝒔𝒆 𝒚’ = 𝟎)
𝜽
1
𝑇 )
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1] 1 [0,0,0.5,0,-0.5,0.5]×[1,1,1,0,1,1]=0.5 0.622459331 1
[1,0,0,1,1,0] 0 [0,0,0.5,0,-0.5,0.5]×[1,0,0,1,1,0]=-0.5 0.377540669 0
[1,0,1,1,0,0] 1 [0,0,0.5,0,-0.5,0.5]×[1,0,1,1,0,0]=0.5 0.622459331 1
[1,1,0,0,1,0] 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,0,1,0]=-0.5 0.377540669 0
[1,1,0,1,0,1] 1 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,0,1]=0.5 0.622459331 1
[1,1,0,1,1,0] 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,1,0]=-0.5 0.377540669 0


(𝒊𝒇 𝒉 𝒙 ≥ 𝟎. 𝟓, 𝒚’ = 𝟏; 𝒆𝒍𝒔𝒆 𝒚’ = 𝟎)
𝜽
1
𝑇 )
1+𝑒 −𝜽 𝒙
[1,1,1,0,1,1 1 [0,0,0.5,0,-0.5,0.5]×[1,1,1,0,1,1]=0.5
0.622459331 1
]
[1,0,0,1,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,0,0,1,1,0]=-0.5
0.377540669 0
]
NO
[1,0,1,1,0,0 1 [0,0,0.5,0,-0.5,0.5]×[1,0,1,1,0,0]=0.5 Mispredictions!
0.622459331 1
]
[1,1,0,0,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,0,1,0]=-0.5
0.377540669 0
]
[1,1,0,1,0,1 1 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,0,1]=0.5
0.622459331 1
]
[1,1,0,1,1,0 0 [0,0,0.5,0,-0.5,0.5]×[1,1,0,1,1,0]=-0.5
0.377540669 0
]

A Concrete Example: Inference
▸Let us infer whether a given new email, say, k = [1, 0, 1, 0, 0, 1] is
a spam or not, using logistic regression with the just learnt
parameter vector
𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓,𝟎. 𝟓]

Email a 1 1 1 0 1 1 1
Email b 1 0 0 1 1 0 0
Email c 1 0 1 1 0 0 1
Email d 1 1 0 0 1 0 0
Email e 1 1 0 1 0 1 1
Email f 1 1 0 1 1 0 0
Email k 1 0 1 0 0 1 ?
Our Training Dataset


parameter vector
𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓,𝟎. 𝟓]
Email a 1 1 1 0 1 1 1
Email b 1 0 0 1 1 0 0
Email c 1 0 1 1 0 0 1
Email d 1 1 0 0 1 0 0
Email e 1 1 0 1 0 1 1
Email f 1 1 0 1 1 0 0
Email k 1 0 1 0 0 1 ?


parameter vector
𝜽 = [𝟎, 𝟎, 𝟎. 𝟓, 𝟎, −𝟎. 𝟓,𝟎. 𝟓]
𝟎
1 𝟎
𝒉𝜽 𝒙 = 𝑇
𝟎. 𝟓
1 + 𝑒 −𝜽 𝒙 𝟏, 𝟎, 𝟏, 𝟎, 𝟎, 𝟏 = 𝟎. 𝟓 × 𝟏 + 𝟎. 𝟓 × 𝟏 = 𝟏
𝟎
1 −𝟎. 𝟓
= 𝟎. 𝟓
1 + 𝑒 −𝟏
= 0.731
≥ 0.5  Class 1 (i.e., Spam)


parameter vector
𝜽 = [𝟎,𝒙𝟎, 𝟎. 𝟓, 𝟎,
𝟎 =𝟏
−𝟎.
𝒙𝟏 = and 𝟓,𝟎.
𝒙𝟐 =𝟓]
vaccine 𝒙𝟑 = the 𝒙𝟒 = of 𝒙𝟓 = nigeria y
Email a 1 1 1 0 1 1 1
Email b 1 0 0 1 1 0 0
Email c 1 0 1 1 0 0 1
Email d 1 1 0 0 1 0 0
Email e 1 1 0 1 0 1 1
Email f 1 1 0 1 1 0 0
Email k 1 0 1 0 0 1 1
Somehow interesting since it considered “vaccine” and “nigeria” indicative of spam!

W
T:

Fattah Regression 2023

Uploaded by

Copyright:

Available Formats

You might also like

Fattah Regression 2023

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fattah Regression 2023

Uploaded by

Copyright:

Available Formats

Linear regression

Dr. Shaikh Fattah

International FDP Organized by Mizoram University & North-Eastrn Hill University

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸How to learn a mathematical model, after which you can

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸How do we represent a line in mathematics?

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸Let us try different random values of m and b:

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸Let us try different random values of m and b:

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸Let us try different random values of m and b:

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸Let us observe the three options besides each other

m = 1 and b = 0 m = 0.05 and b = 2 m = 0.052 and b = 29.21

▸Let us compare their errors

933 87 77.726 715716 48.65 1470.7225 77.726 86.007076

MEAN SQUARED ERRORS: ∑ = 17922958/15 ∑ = 14019.9/15 ∑ = 936.9/15

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸Let us compare their errors

1001 𝑛 81 81.262 846400 52.05 838.1025 81.262 0.068644

The objective is to minimize the mean squared error

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸How to learn a line of equation y’ = mx + b given a labelled

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸Or, how to learn a linear regression model 𝒉𝜽 (𝒙) = 𝜽𝟎 + 𝜽𝟏 𝒙

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸But, how minimizing a cost function (e.g., a mean squared error)

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Which corresponds to the

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸Outline (considering only two varilables 𝜃0 and 𝜃1 ):

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸Outline (considering only two varilables 𝜃0 and 𝜃1 ):

Dr. Shaikh Fattah, Professor, Department of EEE, BUET