Professional Documents
Culture Documents
Deeplearning - Ai Deeplearning - Ai
Deeplearning - Ai Deeplearning - Ai
Deeplearning - Ai Deeplearning - Ai
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Welcome!
eddy aarti
geoff
Ivy
daniel
andres
kin
robert
Re: Urgent Information :)
Congratulations!
a million dollars!
Machine Learning
Applications of
Machine Learning
Machine Learning
Overview
What is
Machine Learning?
Machine learning
Andrew Ng
Question
Andrew Ng
Machine learning algorithms
- Supervised learning
- Unsupervised learning
- Recommender systems
- Reinforcement learning
Andrew Ng
Machine Learning
Overview
Supervised Learning
Part 1
Supervised learning
right answers
Andrew Ng
Input (X) Output (Y) Application
Andrew Ng
Regression: Housing price prediction
400
300
Price in
$1000 200
100
0
0 500 1000 1500 2000 2500
Regression House size
in feet2
Predict a number
infinitely many possible outputs
Andrew Ng
Machine Learning
Overview
Supervised Learning
Part 2
Classification: Breast cancer detection
tumor size
(diameter in cm)
benign
malignant
Andrew Ng
Classification: Breast cancer detection
benign
malignant
0cm diameter(cm) 10cm
malignant type 2
Classification
predict categories
small number of possible outputs
Andrew Ng
Two or more inputs
Age
Tumor size
Andrew Ng
Supervised learning
right answers
Regression Classification
Predict a number predict categories
infinitely many possible outputs small number of possible outputs
Andrew Ng
Machine Learning
Overview
Unsupervised Learning
Part 1
Previous: Supervised learning
Andrew Ng
Supervised learning Unsupervised learning
Learn from data labeled Find something interesting
right answers in unlabeled data.
age age
Andrew Ng
Clustering: DNA microarray
genes
(each row)
individuals
(each column)
Andrew Ng
Clustering: Grouping customers
stay updated
with AI
Andrew Ng
Machine Learning
Overview
Unsupervised Learning
Part 2
Unsupervised learning
Data only comes with inputs x, but not output labels y.
Algorithm has to find structure in the data.
Anomaly detection
Find unusual data points.
Andrew Ng
Question
Of the following examples, which would you address using
an unsupervised learning algorithm?
Given a set of news articles found on the web, group them into sets
of articles about the same story.
Andrew Ng
Machine Learning
Overview
Jupyter Notebooks
Linear Regression
with One Variable
Linear Regression
Model Part 1
House sizes and prices
500
linear regression
400
price in 300
$1000
200
100
0
0 1000 2000 3000
size in feet2
Regression model Supervised learning model Classification model
Predicts numbers right answers Predicts categories
Infinitely many possible outputs Small number of possible outputs
Andrew Ng
price in $1000
House sizes and prices
Data table
500
size in feet2 price in $1000
400
2104 400
300 1416 232
200 1534 315
852 178
100
0 3210 870
0 1000 2000 3000
size in feet2
Andrew Ng
Terminology
Training Data used to train the model Notation:
set:
size in feet2 price in $1000 feature
(1) 2104 400
(2) 1416 232
(3) 1534 315 = number of training examples
(4) 852 178
= single training example
(47) 3210 870
Andrew Ng
Linear Regression
with One Variable
Linear Regression
Model Part 2
How to represent ?
training set
learning algorithm
Andrew Ng
Linear Regression
with One Variable
Cost Function
Training set
size in feet2 ( ) price $1000 ) Model:
2104 460
1416 232 : parameters
1534 315
852 178
What do do?
Andrew Ng
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
=0 = 0.5 = 0.5
= 1.5 b=0 =1
Andrew Ng
Cost function: Squared error cost function
Find :
is close to for all
Andrew Ng
Linear Regression
with One Variable
Cost Function
Intuition
model: simplified
parameters:
cost function:
goal:
Andrew Ng
(for fixed w, function of ) (function of )
input parameter
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Andrew Ng
(function of ) (function of )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Andrew Ng
(function of ) (function of )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Andrew Ng
(function of )
goal of linear regression: 3
2
general case:
1
0
-0.5 0 0.5 1 1.5 2 2.5
Andrew Ng
Linear Regression
with One Variable
Visualizing
the Cost Function
Model
Parameters
Cost Function
Objective
Andrew Ng
(function of ) (function of )
500
400
300
price in
$1000
200
100
0
0 1000 2000 3000
size in feet2
Andrew Ng
Andrew Ng
3D surface plot
Andrew Ng
Andrew Ng
Andrew Ng
click within this plot to add points
price in
$1000
size in feet 2
you can rotate this figure
Andrew Ng
Linear Regression
with One Variable
Visualization examples
click within this plot to add points
price in
$1000
size in feet 2
you can rotate this figure
Andrew Ng
click within this plot to add points
price in
$1000
size in feet 2
you can rotate this figure
Andrew Ng
click within this plot to add points
price in
$1000
size in feet 2
you can rotate this figure
Andrew Ng
click within this plot to add points
price in
$1000
size in feet 2
you can rotate this figure
Andrew Ng
Training Linear
Regression
Gradient Descent
Have some function
Want
Outline:
Start with some (set =0, =0)
Keep changing to reduce
Until we settle at or near a minimum
Andrew Ng
Andrew Ng
Training Linear
Regression
Implementing
Gradient Descent
Gradient descent algorithm Assignment Truth assertion
Repeat until convergence
Learning rate
Derivative
Andrew Ng
Training Linear
Regression
Gradient Descent
Intuition
Gradient descent algorithm
repeat until convergence
Andrew Ng
J(w)
J(w)
Andrew Ng
Training Linear
Regression
Learning Rate
minimum
If
Gradient descent may be slow.
If
Gradient descent may:
- Overshoot, never reach minimum
- Fail to converge, diverge
minimum
Andrew Ng
local minimum
current value of
Andrew Ng
Can reach local minimum with fixed learning rate
not as large
large
Andrew Ng
Training Linear
Regression
Gradient Descent
for Linear Regression
Linear regression model Cost function
Andrew Ng
(Optional)
Andrew Ng
Gradient descent algorithm
Update
and
simultaneously
Andrew Ng
More than one local minimum
Andrew Ng
squared error cost
convex function
global minimum
Andrew Ng
Training Linear
Regression
Running
Gradient Descent
price in
$1000
size in feet 2
Andrew Ng
price in
$1000
size
size inin feet
feet22
Andrew Ng
Andrew Ng
price in
$1000
size in feet 2
Andrew Ng
price in
$1000
size in feet 2
Andrew Ng
price in
$1000
size in feet 2
Andrew Ng
price in
$1000
size in feet 2
Andrew Ng
price in
$1000
size in feet 2
Andrew Ng
price in
$1000
size in feet 2
Andrew Ng
uses all the training examples.
size in feet2 price in $1000
Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Multiple Features
Multiple features (variables)
Size in feet2 (𝑥) Price ($) in 1000’s (𝑦)
2104 400
1416 232
1534 315
852 178
… …
𝑓𝑤,𝑏 𝑥 = 𝑤𝑥 + 𝑏
Andrew Ng
Multiple features (variables)
Size in Number of Number of Age of home Price ($) in
feet2 bedrooms floors in years $1000’s
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
x𝑗 = 𝑗 𝑡ℎ feature
𝑛 = number of features
x 𝑖 = features of 𝑖 𝑡ℎ training example
𝑖
x𝑗 = value of feature 𝑗 in 𝑖 𝑡ℎ training example
Andrew Ng
Model:
Previously: 𝑓𝑤,𝑏 𝑥 = 𝑤𝑥 + 𝑏
𝑓𝑤,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + 𝑏
Andrew Ng
𝑓𝑤,𝑏 𝑥 = 𝑤1 𝑥1 + 𝑤2𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + 𝑏
𝑓w,𝑏 x = w ∙ x + 𝑏 =
Andrew Ng
Linear Regression
with Multiple Variables
Vectorization
Part 1
Parameters and features Without vectorization
w = 𝑤1 𝑤2 𝑤3
𝑏 is a number
x = 𝑥1 𝑥2 𝑥3
linear algebra: count from 1
w = np.array([1.0,2.5,-3.3]) f = 0
b = 4 for j in range(0,n):
x = np.array([10,20,30]) f = f + w[j] * x[j]
code: count from 0 f = f + b
Andrew Ng
Linear Regression
with Multiple Variables
Vectorization
Part 2
Without vectorization Vectorization
for j in range(0,16): np.dot(w,x)
f = f + w[j] * x[j]
𝑡0
𝑡0 w[0] w[1] … w[15]
f + w[0] * x[0] * * … *
𝑡1
f + w[1] * x[1] x[0] x[1] … x[15]
𝑡1
…
w[0]*x[0] + w[1]*x[1] + … + w[15]*x[15]
𝑡15
f + w[15] * x[15]
Andrew Ng
Gradient descent w = 𝑤1 𝑤2 ⋯ 𝑤16
d = 𝑑1 𝑑2 ⋯ 𝑑16
w = np.array([0.5, 1.3, … 3.4])
d = np.array([0.3, 0.2, … 0.4])
compute 𝑤𝑗 = 𝑤𝑗 − 0.1𝑑𝑗 for 𝑗 = 1 … 16
Without vectorization With vectorization
𝑤1 = 𝑤1 − 0.1𝑑1 w = w − 0.1d
𝑤2 = 𝑤2 − 0.1𝑑2
⋮
𝑤16 = 𝑤16 − 0.1𝑑16
for j in range(0,16): w = w – 0.1 * d
w[j] = w[j] - 0.1 * d[j]
Andrew Ng
Linear Regression
with Multiple Variables
Cost function 𝐽 𝑤1 , ⋯ , 𝑤𝑛 , 𝑏 𝐽 w, 𝑏
Gradient descent
repeat { repeat {
𝜕 𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼𝜕𝑤 𝐽 𝑤1 , ⋯ , 𝑤𝑛 , 𝑏 𝑤𝑗 = 𝑤𝑗 − 𝛼𝜕𝑤 𝐽 w, 𝑏
𝑗 𝑗
𝜕 𝜕
𝑏=𝑏 − 𝛼𝜕𝑏 𝐽 𝑤1 , ⋯ , 𝑤𝑛 , 𝑏 𝑏 = 𝑏 − 𝛼𝜕𝑏 𝐽 w, 𝑏
} }
Andrew Ng
Gradient descent
One feature 𝑛 features 𝑛 ≥ 2
repeat {
𝑚 repeat { 𝑚
1 1 𝑖
𝑤 = 𝑤 − 𝛼 𝑓𝑤,𝑏 𝑥 𝑖 −𝑦 𝑖 𝑥 𝑖 𝑤1 = 𝑤1 − 𝛼 𝑓w,𝑏 x 𝑖 − 𝑦 𝑖
𝑥1
𝑚 𝑚
𝑖=1 𝑖=1
⋮ 𝜕
𝐽 w, 𝑏
𝜕 𝜕𝑤1
𝜕𝑤
𝐽 𝑤, 𝑏 𝑚
1 𝑖 𝑖 𝑖
𝑤𝑛 = 𝑤𝑛 − 𝛼 𝑓w,𝑏 x −𝑦 𝑥𝑛
𝑚
𝑚 𝑖=1
𝑚 1
1 𝑖 𝑖 𝑏 = 𝑏 − 𝛼 𝑓w,𝑏 x 𝑖 −𝑦 𝑖
𝑏 = 𝑏 − 𝛼 𝑓𝑤,𝑏 𝑥 −𝑦 𝑚
𝑚 𝑖=1
𝑖=1 simultaneously update
simultaneously update 𝑤, 𝑏 𝑤𝑗 (for 𝑗 = 1, ⋯ , 𝑛) and 𝑏
} }
Andrew Ng
An alternative to gradient descent
Normal equation
What you need to know
• Only for linear regression
• Normal equation method may
• Solve for w, b without
be used in machine learning
iterations
libraries that implement linear
Disadvantages regression.
• Doesn’t generalize to other • Gradient descent is the
learning algorithms. recommended method for
• Slow when number of features finding parameters w,b
is large (> 10,000)
Andrew Ng
Practical Tips for
Linear Regression
Feature Scaling
Part 1
Feature and parameter values
= 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏 𝑥1 : size (feet2) 𝑥2 : # bedrooms
𝑝𝑟𝑖𝑐𝑒
range: 300 − 2,000 range: 0 − 5
= 50 ∗ 2000 + 0.1 ∗ 5 + 50
𝑝𝑟𝑖𝑐𝑒 = 0.1 ∗ 2000𝑘 + 50 ∗ 5 + 50
𝑝𝑟𝑖𝑐𝑒
= $100,050.5k
𝑝𝑟𝑖𝑐𝑒 = $500k
𝑝𝑟𝑖𝑐𝑒
Andrew Ng
Feature size and parameter size
size of feature 𝑥𝑗 size of parameter 𝑤𝑗
size in feet2
#bedrooms
Features Parameters
𝐽 w, 𝑏
𝑥2 𝑤2
# bedrooms # bedrooms
Andrew Ng
Feature size and gradient descent
Features Parameters
𝑥2 𝑤2 𝐽 w, 𝑏
# bedrooms # bedrooms
𝑤1 size in feet2
𝑥1 size in feet2
𝑥2 𝑤2
# bedrooms # bedrooms
𝐽 w, 𝑏
rescaled rescaled
Andrew Ng
Practical Tips for
Linear Regression
Feature Scaling
Part 2
Feature scaling
𝑥2
# bedrooms 300 ≤ 𝑥1 ≤ 2000 0 ≤ 𝑥2 ≤ 5
𝑥1 size in
feet2
𝑥1 𝑥2
𝑥1,𝑠𝑐𝑎𝑙𝑒𝑑 = 𝑥2,𝑠𝑐𝑎𝑙𝑒𝑑 =
2000 5
𝑥2
# bedrooms
rescaled 0.15 ≤ 𝑥1,𝑠𝑐𝑎𝑙𝑒𝑑 ≤ 1 0 ≤ 𝑥2,𝑠𝑐𝑎𝑙𝑒𝑑 ≤ 1
𝑥1 size in
feet2 rescaled
Andrew Ng
Mean normalization
300 ≤ 𝑥1 ≤ 2000 0 ≤ 𝑥2 ≤ 5
𝑥2
# bedrooms
𝑥1 − 𝜇1 𝑥2 − 𝜇2
𝑥1 = 𝑥2 =
𝑥1 size 2000−300 5−0
in feet2
𝑥2
# bedrooms
normalized −0.18 ≤ 𝑥1 ≤ 0.82 −0.46 ≤ 𝑥2 ≤ 0.54
𝑥1 size in feet2
normalized
Andrew Ng
Z-score normalization
standard deviation 𝜎
300 ≤ 𝑥1 ≤ 2000 0 ≤ 𝑥2 ≤ 5
𝑥2 𝜎1 = 450 𝜎1
# bedrooms
𝜎2 = 1.4
𝑥1 − 𝜇1 𝑥2 − 𝜇2
𝑥1 size 𝑥1 = 𝑥2 =
𝜎1 𝜎2
in feet2
𝑥2
# bedrooms
normalized
−0.67 ≤ 𝑥1 ≤ 3.1 −1.6 ≤ 𝑥2 ≤ 1.9
𝑥1 size in feet2
normalized
Andrew Ng
Feature scaling
aim for about −1 ≤ 𝑥𝑗 ≤ 1 for each feature 𝑥𝑗
−3 ≤ 𝑥𝑗 ≤ 3
−0.3 ≤ 𝑥𝑗 ≤ 0.3
0 ≤ 𝑥1 ≤ 3
−2 ≤ 𝑥2 ≤ 0.5
−100 ≤ 𝑥3 ≤ 100
−0.001 ≤ 𝑥4 ≤ 0.001
98.6 ≤ 𝑥5 ≤ 105
Andrew Ng
Practical Tips for
Linear Regression
Andrew Ng
Make sure gradient descent is working correctly
𝐽 w, 𝑏 should decrease
objective: min 𝐽 w, 𝑏
w,𝑏 after every iteration
Automatic convergence test
𝐽 w, 𝑏
Let 𝜀 “epsilon” be 10−3 .
𝐽 w, 𝑏 after 100 iterations
𝐽 w, 𝑏 after 200 iterations If 𝐽 w, 𝑏 decreases by ≤ 𝜀
𝐽 w, 𝑏 likely converged in one iteration,
by 400 iterations declare convergence.
(found parameters w, 𝑏
to get close to
global minimum)
0 100 200 300 400
# iterations 𝑤, 𝑏
# iterations needed varies
Andrew Ng
Practical Tips for
Linear Regression
Choosing the
Learning Rate
Identify problem with gradient descent
or learning rate is too
𝛼 is too large large
𝑤1 = 𝑤1 + 𝛼𝑑1
𝐽 𝑤, 𝑏 𝐽 𝑤, 𝑏
use a minus sign
𝑤1 = 𝑤1 − 𝛼𝑑1
# iterations # iterations
𝐽 𝑤, 𝑏 𝐽 𝑤, 𝑏 If 𝛼 is too small,
gradient descent takes
a lot more iterations to
converge
parameter 𝑤1 parameter 𝑤1
Andrew Ng
Values of 𝛼 to try:
J w, b J w, b
# iterations # iterations
Andrew Ng
Practical Tips for
Linear Regression
Feature Engineering
Feature engineering
𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏
Andrew Ng
Practical Tips for
Linear Regression
Polynomial Regression
Polynomial regression
𝑓𝑤,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑏
price y
𝑓𝑤,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑏
size x
Andrew Ng
Choice of features
𝑓w,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2 𝑥 + 𝑏
price y
size x
Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Motivations
Classification
Question Answer “𝑦”
Is this email spam? no yes
Is the transaction fraudulent? no yes
Is the tumor malignant? no yes
Andrew Ng
(yes) 1
malignant?
(no) 0
tumor size 𝑥
(diameter in cm)
benign
malignant
Andrew Ng
𝑓𝑤,𝑏 𝑥 = 𝑤𝑥 + 𝑏
𝑓𝑤,𝑏 𝑥 = 𝑤𝑥 + 𝑏
(yes) 1
threshold
malignant?
(no) 0
tumor size 𝑥
(diameter in cm)
if 𝑓𝑤,𝑏 𝑥 < 0.5 → 𝑦ො = 0
if 𝑓𝑤,𝑏 𝑥 ≥ 0.5 → 𝑦ො = 1
Andrew Ng
Classification
Logistic Regression
Want outputs between 0 and 1
1
(yes) 1
threshold
malignant? 0.5
(no) 0 0 𝑧
tumor size 𝑥 -3 sigmoid function 3
(diameter in cm) logistic function
outputs between 0 and 1
1
𝑔 𝑧 = 0<𝑔 𝑧 <1
1+𝑒−𝑧
Andrew Ng
Want outputs between 0 and 1 𝑓w,𝑏 x
1 𝑧 =w∙x+𝑏
0.5
1
𝑧 𝑔 𝑧 =
0 1+𝑒−𝑧
-3 sigmoid function 3 1
logistic function
𝑓w,𝑏 x = 𝑔(w ∙ x + 𝑏) =
1 + 𝑒− w∙x+𝑏
outputs between 0 and 1
“logistic regression”
1
𝑔 𝑧 = 0<𝑔 𝑧 <1
1+𝑒−𝑧
Andrew Ng
Interpretation of logistic regression output
1
𝑓w,𝑏 x =
1 + 𝑒 − w∙x+𝑏 𝑓w,𝑏 x = 𝑃 𝑦 = 1 x; w,𝑏
“probability” that class is 1 Probability that 𝑦 is 1,
given input x, parameters w,𝑏
Example:
𝑥 is “tumor size”
𝑦 is 0 (not malignant) 𝑃(𝑦 = 0) + 𝑃(𝑦 = 1) = 1
or 1 (malignant)
𝑓w,𝑏 x = 0.7
70% chance that 𝑦 is 1
Andrew Ng
Classification
Decision Boundary
1 1
w ,𝑏( )
𝑔(𝑧)
𝑓→ →
x = 𝑔(→w ∙ →
x + =
𝑏)
0.5 − (→ x + 𝑏)
w ∙→
1+ 𝑒
= 𝑃 (𝑦 = 1 𝑥; →
w ,𝑏)
0 z
!𝑧 = →
w ∙→
x + 𝑏 Yes: !𝑦^ = 1 No: 𝑦!^ = 0
When is
w ,𝑏( x ) ≥0.5
𝑓
! → → ? ≥0.5
!𝑔(𝑧)
!𝑧 ≥0 !𝑧 < 0
1 → → → →
!𝑔(𝑧) = ! w ∙ x + 𝑏 ≥0 !w ∙ x + 𝑏 < 0
1 + 𝑒− 𝑧 ^= 1
!𝑦 𝑦!^ = 0
Andrew Ng
Decision boundary
𝑓w,𝑏 x = 𝑔 𝑧 = 𝑔 𝑤1𝑥1 + 𝑤2𝑥2 + 𝑏
Decision boundary 𝑧 = w ∙ x + 𝑏 = 0
𝑧 = 𝑥1 + 𝑥2 − 3 = 0
𝑥1 + 𝑥2 = 3
𝑥2
3 𝑥1 + 𝑥2 ≥ 3
2
𝑥1 + 𝑥2 < 3
1
1 2 3 𝑥1
Andrew Ng
Non-linear decision boundaries
𝑥2 𝑥12 + 𝑥22 ≥ 1
𝑦ො = 1 𝑓w,𝑏 x = 𝑔 𝑧 = 𝑔( 𝑤1 𝑥12 + 𝑤2𝑥22+b )
1
Andrew Ng
Non-linear decision boundaries
𝑥2
𝑥1
Andrew Ng
Cost Function
Andrew Ng
Squared error cost
average of training set
𝐿 𝑓w,𝑏 x 𝑖 ,𝑦 𝑖
𝐽 w, 𝑏 convex 𝐽 w, 𝑏 non-convex
w, b w, 𝑏
Andrew Ng
Logistic loss function
log 𝑓
𝐿 𝑓w,𝑏 x 𝑖 ,𝑦 𝑖
𝑓
if 𝑦 𝑖 =1 −log 𝑓
0 0.1 0.5 1
Loss is lowest when
As 𝑓w,𝑏 x 𝑖 → 1 then loss → 0 𝑓w,𝑏 x 𝑖
𝑓w,𝑏 x 𝑖 predicts
As 𝑓 w,𝑏 x 𝑖
→ 0 then loss → ∞ close to true label 𝑦 𝑖 .
Andrew Ng
Logistic loss function
−log 1 − 𝑓
As 𝑓w,𝑏 x 𝑖
→ 0 then loss → 0
𝑓
𝑖 𝑖
𝐿 𝑓w,𝑏 x ,𝑦
𝑖
if 𝑦 =0 The further prediction
𝑓w,𝑏 x 𝑖 is from
0 𝑓w,𝑏 x 𝑖 1 target 𝑦 𝑖 , the
As 𝑓w,𝑏 x 𝑖 → 1 then loss → ∞ higher the loss.
Andrew Ng
Cost
Andrew Ng
Cost Function
Simplified Cost
Function for Logistic
Regression
Simplified loss function
𝐿 𝑓w,𝑏 x 𝑖 , 𝑦 𝑖
= − 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖
− 1−𝑦 𝑖
log 1 − 𝑓w,𝑏 x 𝑖
if 𝑦 𝑖 = 1:
𝐿 𝑓w,𝑏 x 𝑖 , 𝑦 𝑖 =
Andrew Ng
Simplified loss function
𝐿 𝑓w,𝑏 x 𝑖 , 𝑦 𝑖
= − 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖
− 1−𝑦 𝑖
log 1 − 𝑓w,𝑏 x 𝑖
if 𝑦 𝑖 = 1:
𝐿 𝑓w,𝑏 x 𝑖 , 𝑦 𝑖
=
if 𝑦 𝑖 = 0:
𝐿 𝑓w,𝑏 x 𝑖 , 𝑦 𝑖 =
Andrew Ng
Simplified cost function
𝐽 w, 𝑏 =
Andrew Ng
Gradient Descent
Gradient Descent
Implementation
Training logistic regression
Find w, 𝑏
1
Given new x, output 𝑓w,𝑏 x =
1+𝑒 −(w∙x+𝑏)
𝑃 𝑦 = 1 x; w, 𝑏
Andrew Ng
Gradient descent
repeat { 𝑚
𝜕 𝜕 1 𝑖 𝑖
𝑤𝑗 = 𝑤𝑗 − 𝛼 𝐽 w, 𝑏 𝐽 w, 𝑏 = (𝑓w,𝑏 x − 𝑦 𝑖 )𝑥𝑗
𝜕𝑤𝑗 𝜕𝑤𝑗 𝑚
𝑖=1
𝜕 𝑚
𝑏 = 𝑏−𝛼 𝐽 w, 𝑏 𝜕 1 𝑖
𝜕𝑏 𝐽 w, 𝑏 = (𝑓w,𝑏 x −𝑦 𝑖 )
𝜕𝑏 𝑚
𝑖=1
} simultaneous updates
Andrew Ng
Gradient descent for logistic regression
repeat {
𝑚
1 𝑖 𝑖
𝑤𝑗 = 𝑤𝑗 − 𝛼 (𝑓w,𝑏 x − 𝑦 𝑖 )𝑥𝑗
𝑚
𝑖=1
𝑚
1 𝑖 Same concepts:
𝑏 = 𝑏−𝛼 (𝑓w,𝑏 x −𝑦 𝑖 )
𝑚 • Monitor gradient descent
𝑖=1 (learning curve)
} simultaneous updates • Vectorized implementation
• Feature scaling
Linear regression 𝑓w,𝑏 x = w ∙ x + 𝑏
1
Logistic regression 𝑓 w,𝑏 x =
1 + 𝑒 (−w∙x+𝑏)
Andrew Ng
Regularization to
Reduce Overfitting
The Problem of
Overfitting
Regression example
price
price
price
size size size
𝑤1𝑥 + 𝑏 𝑤1 𝑥 + 𝑤2𝑥 2 + 𝑏 𝑤1 𝑥 + 𝑤2𝑥 + 𝑤3𝑥 3 + 𝑤4𝑥 4 + 𝑏
2
• Does not fit the • Fits training set • Fits the training set
training set well pretty well extremely well
Andrew Ng
Classification
𝑥2 𝑥2 𝑥2
𝑥1 𝑥1 𝑥1
𝑧 = 𝑤1 𝑥1 + 𝑤2𝑥2 + 𝑏 𝑧 = 𝑤1 𝑥1 + 𝑤2𝑥2 𝑧 = 𝑤1 𝑥1 + 𝑤2𝑥2
𝑓w,𝑏 x = 𝑔 𝑧 +𝑤3𝑥12 + 𝑤4 𝑥22 +𝑤3𝑥12𝑥2 + 𝑤4𝑥12𝑥22
+𝑤5𝑥1 𝑥2 + 𝑏 +𝑤5𝑥12𝑥23 + 𝑤6𝑥13𝑥2
𝑔 is the sigmoid function
+⋯+ 𝑏
Andrew Ng
Regularization to
Reduce Overfitting
Addressing Overfitting
Collect more training examples
price
price
size size
Andrew Ng
Select features to include/exclude
size bedrooms floors age avg … distance to price
income coffee shop
Andrew Ng
Regularization
Reduce the size of parameters 𝑤𝑗
price
price
features features
Andrew Ng
Addressing overfitting
Options
1. Collect more data
2. Select features
1. – Feature selection
Andrew Ng
Regularization to
Reduce Overfitting
price
price
size size
𝑤1 𝑥 + 𝑤2𝑥 2 + 𝑏 𝑤1 𝑥 + 𝑤2𝑥 2 + 𝑤3𝑥 3 + 𝑤4𝑥 4 + 𝑏
Andrew Ng
Regularization simpler model
small values 𝑤1, 𝑤2, ⋯ , 𝑤𝑛 , 𝑏 less likely to overfit
𝑤1 , 𝑤1 , 𝑤2 , ⋯ , 𝑤100 , 𝑏
Andrew Ng
Regularization
choose 𝜆 = 1010
𝑓w,𝑏 x = 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑤4 𝑥 4 + 𝑏
price
Andrew Ng
Regularization to
Reduce Overfitting
Regularized Linear
Regression
Regularized linear regression
Gradient descent
repeat {
𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼 𝐽 w, 𝑏
𝜕𝑤𝑗
𝜕
𝑏 = 𝑏 − 𝛼 𝐽 w, 𝑏
𝜕𝑏
} simultaneous update
Andrew Ng
Implementing gradient descent
repeat {
} simultaneous update
Andrew Ng
Implementing gradient descent
repeat {
} simultaneous update
𝜆
𝑤𝑗 1−𝛼
𝑚
Andrew Ng
How we get the derivative term (optional)
𝜕
𝐽 w, 𝑏 =
𝜕𝑤𝑗
Andrew Ng
Regularization to
Reduce Overfitting
Regularized Logistic
Regression
Regularized logistic regression
𝑧 = 𝑤1 𝑥1 + 𝑤2𝑥2
+𝑤3𝑥12𝑥2 + 𝑤4𝑥12𝑥22
+𝑤5𝑥12𝑥23 + ⋯ + 𝑏
𝑥2 1
𝑓w,𝑏 x =
1 + 𝑒 −𝑧
𝑚
𝑥1 Cost function
1
𝐽 w, 𝑏 = − 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖 + 1−𝑦 𝑖 log 1 − 𝑓w,𝑏 x 𝑖
𝑚
𝑖=1
𝐽 w, 𝑏
Andrew Ng
Regularized logistic regression
𝑚 𝑛
1 𝜆
𝐽 w, 𝑏 = − 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖
+ 1−𝑦 𝑖
log 1 − 𝑓w,𝑏 x 𝑖
+ 𝑤𝑗2
𝑚 2𝑚
𝑖=1 𝑗=1
Gradient descent
repeat {
𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼 𝐽 w, 𝑏
𝜕𝑤𝑗
𝜕
𝑏 = 𝑏 − 𝛼 𝐽 w, 𝑏
𝜕𝑏
}
Andrew Ng