Deeplearning - Ai Deeplearning - Ai

Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Machine Learning
Welcome!
eddy aarti
geoff
Ivy
daniel
andres
kin
robert
Re: Urgent Information :)
Congratulations!
a million dollars!
Machine Learning
Applications of
Machine Learning
Machine Learning
Overview
What is
Machine Learning?
Machine learning
computers the ability to learn

without being explicitly
Arthur Samuel (1959)
Andrew Ng
Question
If the checkers program had been allowed to play only ten

games (instead of tens of thousands) against itself, a much
smaller number of games, how would this have affected its
performance?
Would have made it better
Would have made it worse
Andrew Ng
Machine learning algorithms
- Supervised learning
- Unsupervised learning
- Recommender systems
- Reinforcement learning
Practical advice for applying learning algorithms
Andrew Ng
Machine Learning
Overview
Supervised Learning
Part 1
Supervised learning
input output label
right answers
Andrew Ng
Input (X) Output (Y) Application
email spam? (0/1) spam filtering
audio text transcripts speech recognition
English Spanish machine translation
ad, user info click? (0/1) online advertising
image, radar info position of other cars self-driving car
image of phone defect? (0/1) visual inspection
Andrew Ng
Regression: Housing price prediction
400
300
Price in
$1000 200
100
0
0 500 1000 1500 2000 2500
Regression House size
in feet2
Predict a number
infinitely many possible outputs
Andrew Ng
Machine Learning
Overview
Supervised Learning
Part 2
Classification: Breast cancer detection
tumor size
(diameter in cm)
benign
malignant
Andrew Ng
Classification: Breast cancer detection
benign
malignant
0cm diameter(cm) 10cm
malignant type 2
Classification
predict categories
small number of possible outputs
Andrew Ng
Two or more inputs
Age
Tumor size
Andrew Ng
Supervised learning
right answers
Regression Classification
Predict a number predict categories
infinitely many possible outputs small number of possible outputs
Andrew Ng
Machine Learning
Overview
Unsupervised Learning
Part 1
Previous: Supervised learning
Now: Unsupervised learning
Andrew Ng
Supervised learning Unsupervised learning
Learn from data labeled Find something interesting
right answers in unlabeled data.
age age
tumor size tumor size

Andrew Ng
Clustering: Google news
Andrew Ng
Clustering: DNA microarray
genes
(each row)
individuals
(each column)
Andrew Ng
Clustering: Grouping customers
growing skills develop

and knowledge career
stay updated
with AI
Andrew Ng
Machine Learning
Overview
Unsupervised Learning
Part 2
Unsupervised learning
Data only comes with inputs x, but not output labels y.
Algorithm has to find structure in the data.
Clustering Dimensionality reduction

Group similar data Compress data using fewer
points together. numbers.
Anomaly detection
Find unusual data points.
Andrew Ng
Question
Of the following examples, which would you address using
an unsupervised learning algorithm?
Given email labeled as spam/not spam, learn a spam filter.
Given a set of news articles found on the web, group them into sets
of articles about the same story.
Given a database of customer data, automatically discover market

segments and group customers into different market segments.
Given a dataset of patients diagnosed as either having diabetes or
not, learn to classify new patients as having diabetes or not
Andrew Ng
Machine Learning
Overview
Jupyter Notebooks
Linear Regression
with One Variable
Linear Regression
Model Part 1
House sizes and prices
500
linear regression
400
price in 300
$1000
200
100
0
0 1000 2000 3000
size in feet2
Regression model Supervised learning model Classification model
Predicts numbers right answers Predicts categories
Infinitely many possible outputs Small number of possible outputs
Andrew Ng
price in $1000
House sizes and prices
Data table
500
size in feet2 price in $1000
400
2104 400
300 1416 232
200 1534 315
852 178
100
0 3210 870
0 1000 2000 3000
size in feet2
Andrew Ng
Terminology
Training Data used to train the model Notation:
set:
size in feet2 price in $1000 feature
(1) 2104 400
(2) 1416 232
(3) 1534 315 = number of training examples
(4) 852 178
= single training example
(47) 3210 870
= ith training example

index (1st, 2nd, 3rd
not exponent
Andrew Ng
Linear Regression
with One Variable
Linear Regression
Model Part 2
How to represent ?
training set
learning algorithm
feature model prediction

(estimated Linear regression with one variable.
target
Univariate linear regression.
Andrew Ng
Linear Regression
with One Variable
Cost Function
Training set
size in feet2 ( ) price $1000 ) Model:
2104 460
1416 232 : parameters
1534 315
852 178
What do do?
Andrew Ng
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
=0 = 0.5 = 0.5
= 1.5 b=0 =1
Andrew Ng
Cost function: Squared error cost function
m = number of training examples
Find :
is close to for all
Andrew Ng
Linear Regression
with One Variable
Cost Function
Intuition
model: simplified
parameters:
cost function:
goal:
Andrew Ng
(for fixed w, function of ) (function of )
input parameter
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Andrew Ng
(function of ) (function of )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Andrew Ng
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Andrew Ng
(function of )
goal of linear regression: 3
2
general case:
1
0
-0.5 0 0.5 1 1.5 2 2.5
Andrew Ng
Linear Regression
with One Variable
Visualizing
the Cost Function
Model
Parameters
Cost Function
Objective
Andrew Ng
500
400
300
price in
$1000
200
100
0
0 1000 2000 3000
size in feet2
Andrew Ng
Andrew Ng
3D surface plot
Andrew Ng
Andrew Ng
Andrew Ng
click within this plot to add points
price in
$1000
size in feet 2
you can rotate this figure
Andrew Ng
Linear Regression
with One Variable
Visualization examples
price in
$1000
size in feet 2
Andrew Ng
price in
$1000
size in feet 2
Andrew Ng
price in
$1000
size in feet 2
Andrew Ng
price in
$1000
size in feet 2
Andrew Ng
Training Linear
Regression
Gradient Descent
Have some function
Want
Outline:
Start with some (set =0, =0)
Keep changing to reduce
Until we settle at or near a minimum
Andrew Ng
Andrew Ng
Training Linear
Regression
Implementing
Gradient Descent
Gradient descent algorithm Assignment Truth assertion
Repeat until convergence
Learning rate
Derivative
Simultaneously Code Math

update w and b a==c
Correct: Simultaneous update Incorrect
Andrew Ng
Training Linear
Regression
Gradient Descent
Intuition
Gradient descent algorithm
repeat until convergence
Andrew Ng
J(w)
J(w)
Andrew Ng
Training Linear
Regression
Learning Rate
minimum
If
Gradient descent may be slow.
If
Gradient descent may:
- Overshoot, never reach minimum
- Fail to converge, diverge
minimum
Andrew Ng
local minimum
current value of
Andrew Ng
Can reach local minimum with fixed learning rate
not as large
large
Near a local minimum,

- Derivative becomes smaller
- Update steps become smaller
Can reach minimum without
decreasing learning rate
minimum
Andrew Ng
Training Linear
Regression
Gradient Descent
for Linear Regression
Linear regression model Cost function

Andrew Ng
(Optional)
Andrew Ng
Update
and
simultaneously
Andrew Ng
More than one local minimum
Andrew Ng
squared error cost
convex function
global minimum
Andrew Ng
Training Linear
Regression
Running
Gradient Descent
price in
$1000
size in feet 2
Andrew Ng
price in
$1000
size
size inin feet
feet22
Andrew Ng
Andrew Ng
price in
$1000
size in feet 2
Andrew Ng
price in
$1000
size in feet 2
Andrew Ng
price in
$1000
size in feet 2
Andrew Ng
price in
$1000
size in feet 2
Andrew Ng
price in
$1000
size in feet 2
Andrew Ng
price in
$1000
size in feet 2
Andrew Ng
uses all the training examples.
size in feet2 price in $1000
(1) 2104 400

(2) 1416 232
(3) 1534 315
(4) 852 178
(47) 3210 870
Andrew Ng
Copyright Notice

Linear Regression
with Multiple Variables
Multiple Features
Multiple features (variables)
Size in feet2 (𝑥) Price ($) in 1000’s (𝑦)
2104 400
1416 232
1534 315
852 178
… …
𝑓𝑤,𝑏 𝑥 = 𝑤𝑥 + 𝑏
Andrew Ng
Multiple features (variables)
Size in Number of Number of Age of home Price ($) in
feet2 bedrooms floors in years $1000’s
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
x𝑗 = 𝑗 𝑡ℎ feature
𝑛 = number of features
x 𝑖 = features of 𝑖 𝑡ℎ training example
𝑖
x𝑗 = value of feature 𝑗 in 𝑖 𝑡ℎ training example
Andrew Ng
Model:
Previously: 𝑓𝑤,𝑏 𝑥 = 𝑤𝑥 + 𝑏
𝑓𝑤,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + 𝑏
Andrew Ng
𝑓𝑤,𝑏 𝑥 = 𝑤1 𝑥1 + 𝑤2𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + 𝑏
𝑓w,𝑏 x = w ∙ x + 𝑏 =
multiple linear regression
Andrew Ng
Linear Regression
Vectorization
Part 1
Parameters and features Without vectorization
w = 𝑤1 𝑤2 𝑤3
𝑏 is a number
x = 𝑥1 𝑥2 𝑥3
linear algebra: count from 1
w = np.array([1.0,2.5,-3.3]) f = 0
b = 4 for j in range(0,n):
x = np.array([10,20,30]) f = f + w[j] * x[j]
code: count from 0 f = f + b
Without vectorization Vectorization

𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑏 𝑓w,𝑏 x = w ∙ x + 𝑏
f = w[0] * x[0] +
w[1] * x[1] + f = np.dot(w,x) + b
w[2] * x[2] + b
Andrew Ng
Linear Regression
Vectorization
Part 2
Without vectorization Vectorization
for j in range(0,16): np.dot(w,x)
f = f + w[j] * x[j]
𝑡0
𝑡0 w[0] w[1] … w[15]
f + w[0] * x[0] * * … *
𝑡1
f + w[1] * x[1] x[0] x[1] … x[15]
𝑡1
…
w[0]*x[0] + w[1]*x[1] + … + w[15]*x[15]
𝑡15
f + w[15] * x[15]
Andrew Ng
Gradient descent w = 𝑤1 𝑤2 ⋯ 𝑤16
d = 𝑑1 𝑑2 ⋯ 𝑑16
w = np.array([0.5, 1.3, … 3.4])
d = np.array([0.3, 0.2, … 0.4])
compute 𝑤𝑗 = 𝑤𝑗 − 0.1𝑑𝑗 for 𝑗 = 1 … 16
Without vectorization With vectorization
𝑤1 = 𝑤1 − 0.1𝑑1 w = w − 0.1d
𝑤2 = 𝑤2 − 0.1𝑑2
⋮
𝑤16 = 𝑤16 − 0.1𝑑16
for j in range(0,16): w = w – 0.1 * d
w[j] = w[j] - 0.1 * d[j]
Andrew Ng
Linear Regression
Gradient Descent for

Multiple Regression
Previous notation Vector notation
Parameters 𝑤1 , ⋯ , 𝑤𝑛
w = 𝑤1 ⋯ 𝑤𝑛
𝑏 𝑏
Model 𝑓w,𝑏 x = 𝑤1 𝑥1 + ⋯ + 𝑤𝑛 𝑥𝑛 + 𝑏 𝑓w,𝑏 x = w ∙ x + 𝑏
Cost function 𝐽 𝑤1 , ⋯ , 𝑤𝑛 , 𝑏 𝐽 w, 𝑏
Gradient descent
repeat { repeat {
𝜕 𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼𝜕𝑤 𝐽 𝑤1 , ⋯ , 𝑤𝑛 , 𝑏 𝑤𝑗 = 𝑤𝑗 − 𝛼𝜕𝑤 𝐽 w, 𝑏
𝑗 𝑗
𝜕 𝜕
𝑏=𝑏 − 𝛼𝜕𝑏 𝐽 𝑤1 , ⋯ , 𝑤𝑛 , 𝑏 𝑏 = 𝑏 − 𝛼𝜕𝑏 𝐽 w, 𝑏
} }
Andrew Ng
Gradient descent
One feature 𝑛 features 𝑛 ≥ 2
repeat {
𝑚 repeat { 𝑚
1 1 𝑖
𝑤 = 𝑤 − 𝛼 ෍ 𝑓𝑤,𝑏 𝑥 𝑖 −𝑦 𝑖 𝑥 𝑖 𝑤1 = 𝑤1 − 𝛼 ෍ 𝑓w,𝑏 x 𝑖 − 𝑦 𝑖
𝑥1
𝑚 𝑚
𝑖=1 𝑖=1
⋮ 𝜕
𝐽 w, 𝑏
𝜕 𝜕𝑤1
𝜕𝑤
𝐽 𝑤, 𝑏 𝑚
1 𝑖 𝑖 𝑖
𝑤𝑛 = 𝑤𝑛 − 𝛼 ෍ 𝑓w,𝑏 x −𝑦 𝑥𝑛
𝑚
𝑚 𝑖=1
𝑚 1
1 𝑖 𝑖 𝑏 = 𝑏 − 𝛼 ෍ 𝑓w,𝑏 x 𝑖 −𝑦 𝑖
𝑏 = 𝑏 − 𝛼 ෍ 𝑓𝑤,𝑏 𝑥 −𝑦 𝑚
𝑚 𝑖=1
𝑖=1 simultaneously update
simultaneously update 𝑤, 𝑏 𝑤𝑗 (for 𝑗 = 1, ⋯ , 𝑛) and 𝑏
} }
Andrew Ng
An alternative to gradient descent
Normal equation
What you need to know
• Only for linear regression
• Normal equation method may
• Solve for w, b without
be used in machine learning
iterations
libraries that implement linear
Disadvantages regression.
• Doesn’t generalize to other • Gradient descent is the
learning algorithms. recommended method for
• Slow when number of features finding parameters w,b
is large (> 10,000)
Andrew Ng
Practical Tips for
Linear Regression
Feature Scaling
Part 1
Feature and parameter values
෣ = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏 𝑥1 : size (feet2) 𝑥2 : # bedrooms
𝑝𝑟𝑖𝑐𝑒
range: 300 − 2,000 range: 0 − 5
House: 𝑥1 = 2000, 𝑥2 = 5, 𝑝𝑟𝑖𝑐𝑒 = $500k

size of the parameters 𝑤1 , 𝑤2 ?
𝑤1 = 50, 𝑤2 = 0.1, 𝑏 = 50 𝑤1 = 0.1, 𝑤2 = 50, 𝑏 = 50
෣ = 50 ∗ 2000 + 0.1 ∗ 5 + 50
𝑝𝑟𝑖𝑐𝑒 ෣ = 0.1 ∗ 2000𝑘 + 50 ∗ 5 + 50
෣ = $100,050.5k
𝑝𝑟𝑖𝑐𝑒 ෣ = $500k
Andrew Ng
Feature size and parameter size
size of feature 𝑥𝑗 size of parameter 𝑤𝑗
size in feet2
#bedrooms
Features Parameters
𝐽 w, 𝑏
𝑥2 𝑤2
# bedrooms # bedrooms
𝑥1 size in feet2 𝑤1 size in feet2
Andrew Ng
Feature size and gradient descent
Features Parameters
𝑥2 𝑤2 𝐽 w, 𝑏
𝑤1 size in feet2
𝑥1 size in feet2
𝑥2 𝑤2
𝐽 w, 𝑏
rescaled rescaled
𝑥1 size in feet2 𝑤1 size in feet2

rescaled rescaled
Andrew Ng
Practical Tips for
Linear Regression
Feature Scaling
Part 2
Feature scaling
𝑥2
# bedrooms 300 ≤ 𝑥1 ≤ 2000 0 ≤ 𝑥2 ≤ 5
𝑥1 size in
feet2
𝑥1 𝑥2
𝑥1,𝑠𝑐𝑎𝑙𝑒𝑑 = 𝑥2,𝑠𝑐𝑎𝑙𝑒𝑑 =
2000 5
𝑥2
# bedrooms
rescaled 0.15 ≤ 𝑥1,𝑠𝑐𝑎𝑙𝑒𝑑 ≤ 1 0 ≤ 𝑥2,𝑠𝑐𝑎𝑙𝑒𝑑 ≤ 1
𝑥1 size in
feet2 rescaled
Andrew Ng
Mean normalization
300 ≤ 𝑥1 ≤ 2000 0 ≤ 𝑥2 ≤ 5
𝑥2
# bedrooms
𝑥1 − 𝜇1 𝑥2 − 𝜇2
𝑥1 = 𝑥2 =
𝑥1 size 2000−300 5−0
in feet2
𝑥2
# bedrooms
normalized −0.18 ≤ 𝑥1 ≤ 0.82 −0.46 ≤ 𝑥2 ≤ 0.54
𝑥1 size in feet2
normalized
Andrew Ng
Z-score normalization
standard deviation 𝜎
300 ≤ 𝑥1 ≤ 2000 0 ≤ 𝑥2 ≤ 5
𝑥2 𝜎1 = 450 𝜎1
# bedrooms
𝜎2 = 1.4
𝑥1 − 𝜇1 𝑥2 − 𝜇2
𝑥1 size 𝑥1 = 𝑥2 =
𝜎1 𝜎2
in feet2
𝑥2
# bedrooms
normalized
−0.67 ≤ 𝑥1 ≤ 3.1 −1.6 ≤ 𝑥2 ≤ 1.9
𝑥1 size in feet2
normalized
Andrew Ng
Feature scaling
aim for about −1 ≤ 𝑥𝑗 ≤ 1 for each feature 𝑥𝑗
−3 ≤ 𝑥𝑗 ≤ 3
−0.3 ≤ 𝑥𝑗 ≤ 0.3
0 ≤ 𝑥1 ≤ 3
−2 ≤ 𝑥2 ≤ 0.5
−100 ≤ 𝑥3 ≤ 100
−0.001 ≤ 𝑥4 ≤ 0.001
98.6 ≤ 𝑥5 ≤ 105
Andrew Ng
Practical Tips for
Linear Regression
Checking Gradient Descent

for Convergence
Gradient descent
Andrew Ng
Make sure gradient descent is working correctly
𝐽 w, 𝑏 should decrease
objective: min 𝐽 w, 𝑏
w,𝑏 after every iteration
Automatic convergence test
𝐽 w, 𝑏
Let 𝜀 “epsilon” be 10−3 .
𝐽 w, 𝑏 after 100 iterations
𝐽 w, 𝑏 after 200 iterations If 𝐽 w, 𝑏 decreases by ≤ 𝜀
𝐽 w, 𝑏 likely converged in one iteration,
by 400 iterations declare convergence.
(found parameters w, 𝑏
to get close to
global minimum)
0 100 200 300 400
# iterations 𝑤, 𝑏
# iterations needed varies
Andrew Ng
Practical Tips for
Linear Regression
Choosing the
Learning Rate
Identify problem with gradient descent
or learning rate is too
𝛼 is too large large
𝑤1 = 𝑤1 + 𝛼𝑑1
𝐽 𝑤, 𝑏 𝐽 𝑤, 𝑏
use a minus sign
𝑤1 = 𝑤1 − 𝛼𝑑1
# iterations # iterations
Adjust learning rate

𝛼 is too big Use smaller 𝛼 With a small enough 𝛼,
𝐽 w, 𝑏 should decrease
on every iteration
𝐽 𝑤, 𝑏 𝐽 𝑤, 𝑏 If 𝛼 is too small,
gradient descent takes
a lot more iterations to
converge
parameter 𝑤1 parameter 𝑤1
Andrew Ng
Values of 𝛼 to try:
… 0.001 s p. . ace0.01sp . a.e0.1spa . ce1 …
J w, b J w, b
# iterations # iterations
Andrew Ng
Practical Tips for
Linear Regression
Feature Engineering
Feature engineering
𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏
𝑎𝑟𝑒𝑎 = 𝑓𝑟𝑜𝑛𝑡𝑎𝑔𝑒 × 𝑑𝑒𝑝𝑡ℎ

𝑥3 = 𝑥1 𝑥2 Feature engineering:
Using intuition to design
new features, by
transforming or combining
𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2𝑥2 + 𝑤3𝑥3 + 𝑏 original features.
Andrew Ng
Practical Tips for
Linear Regression
Polynomial Regression
Polynomial regression
𝑓𝑤,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑏
price y
𝑓𝑤,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑏
size x
Andrew Ng
Choice of features
𝑓w,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2 𝑥 + 𝑏
price y
size x
Andrew Ng
Copyright Notice

Classification
Motivations
Classification
Question Answer “𝑦”
Is this email spam? no yes
Is the transaction fraudulent? no yes
Is the tumor malignant? no yes
𝑦 can only be one of two values false true

“binary classification” 0 1
“negative class” “positive class”
Andrew Ng
(yes) 1
malignant?
(no) 0
tumor size 𝑥
(diameter in cm)
benign
malignant
Andrew Ng
(yes) 1
threshold
malignant?
(no) 0
tumor size 𝑥
(diameter in cm)
if 𝑓𝑤,𝑏 𝑥 < 0.5 → 𝑦ො = 0
if 𝑓𝑤,𝑏 𝑥 ≥ 0.5 → 𝑦ො = 1
Andrew Ng
Classification
Logistic Regression
Want outputs between 0 and 1
1
(yes) 1
threshold
malignant? 0.5
(no) 0 0 𝑧
tumor size 𝑥 -3 sigmoid function 3
(diameter in cm) logistic function
outputs between 0 and 1
1
𝑔 𝑧 = 0<𝑔 𝑧 <1
1+𝑒−𝑧
Andrew Ng
Want outputs between 0 and 1 𝑓w,𝑏 x
1 𝑧 =w∙x+𝑏
0.5
1
𝑧 𝑔 𝑧 =
0 1+𝑒−𝑧
-3 sigmoid function 3 1
logistic function
𝑓w,𝑏 x = 𝑔(w ∙ x + 𝑏) =
1 + 𝑒− w∙x+𝑏
outputs between 0 and 1
“logistic regression”
1
𝑔 𝑧 = 0<𝑔 𝑧 <1
1+𝑒−𝑧
Andrew Ng
Interpretation of logistic regression output
1
𝑓w,𝑏 x =
1 + 𝑒 − w∙x+𝑏 𝑓w,𝑏 x = 𝑃 𝑦 = 1 x; w,𝑏
“probability” that class is 1 Probability that 𝑦 is 1,
given input x, parameters w,𝑏
Example:
𝑥 is “tumor size”
𝑦 is 0 (not malignant) 𝑃(𝑦 = 0) + 𝑃(𝑦 = 1) = 1
or 1 (malignant)
𝑓w,𝑏 x = 0.7
70% chance that 𝑦 is 1
Andrew Ng
Classification
Decision Boundary
1 1
w ,𝑏( )
𝑔(𝑧)
𝑓→ →
x = 𝑔(→w ∙ →
x + =
𝑏)
0.5 − (→ x + 𝑏)
w ∙→
1+ 𝑒
= 𝑃 (𝑦 = 1 𝑥; →
w ,𝑏)
0 z
w ,𝑏( x ) w ,𝑏( x ) ≥0.5?

𝑓→ → Is 𝑓
! → →
!𝑧 = →
w ∙→
x + 𝑏 Yes: !𝑦^ = 1 No: 𝑦!^ = 0
When is
w ,𝑏( x ) ≥0.5
𝑓
! → → ? ≥0.5
!𝑔(𝑧)
!𝑧 ≥0 !𝑧 < 0
1 → → → →
!𝑔(𝑧) = ! w ∙ x + 𝑏 ≥0 !w ∙ x + 𝑏 < 0
1 + 𝑒− 𝑧 ^= 1
!𝑦 𝑦!^ = 0
Andrew Ng
Decision boundary
𝑓w,𝑏 x = 𝑔 𝑧 = 𝑔 𝑤1𝑥1 + 𝑤2𝑥2 + 𝑏
Decision boundary 𝑧 = w ∙ x + 𝑏 = 0
𝑧 = 𝑥1 + 𝑥2 − 3 = 0
𝑥1 + 𝑥2 = 3
𝑥2
3 𝑥1 + 𝑥2 ≥ 3
2
𝑥1 + 𝑥2 < 3
1
1 2 3 𝑥1
Andrew Ng
Non-linear decision boundaries
𝑥2 𝑥12 + 𝑥22 ≥ 1
𝑦ො = 1 𝑓w,𝑏 x = 𝑔 𝑧 = 𝑔( 𝑤1 𝑥12 + 𝑤2𝑥22+b )
1
decision 𝑧 = 𝑥12 + 𝑥22 − 1 = 0

𝑥1
−1 1 boundary 𝑥12 + 𝑥22 = 1
−1
𝑥12 + 𝑥22 < 1

𝑦ො = 0
Andrew Ng
Non-linear decision boundaries
𝑥2
w ,𝑏( x ) = 𝑔(𝑧) = 𝑔 (𝑤1𝑥1 + 𝑤2𝑥2

𝑓
! →
→
𝑥1
+! 𝑤3𝑥12 + 𝑤4𝑥1𝑥2 + 𝑤5𝑥22
! + 𝑤6𝑥13 +⋯+ b)
𝑥2
𝑥1
Andrew Ng
Cost Function
Cost Function for

Logistic Regression
Training set
tumor size … patient’s age malignant? 𝑖 = 1, … , 𝑚

(cm)
𝑗 = 1, … , 𝑛
10 52 1
target 𝑦 is 0 or 1
2 73 0
5 55 0 1
𝑓w,𝑏 x =
12 49 1 1 + 𝑒 − w∙x+𝑏
… … …
How to choose w = [𝑤1 𝑤2 ⋯ 𝑤𝑛 ] and 𝑏?
Andrew Ng
Squared error cost
average of training set
𝐿 𝑓w,𝑏 x 𝑖 ,𝑦 𝑖
linear regression logistic regression

1
𝑓w,𝑏 x = w ∙ x + 𝑏 𝑓w,𝑏 x =
1 + 𝑒 − w∙x+𝑏
𝐽 w, 𝑏 convex 𝐽 w, 𝑏 non-convex
w, b w, 𝑏
Andrew Ng
Logistic loss function
log 𝑓
𝐿 𝑓w,𝑏 x 𝑖 ,𝑦 𝑖
𝑓
if 𝑦 𝑖 =1 −log 𝑓
0 0.1 0.5 1
Loss is lowest when
As 𝑓w,𝑏 x 𝑖 → 1 then loss → 0 𝑓w,𝑏 x 𝑖
𝑓w,𝑏 x 𝑖 predicts
As 𝑓 w,𝑏 x 𝑖
→ 0 then loss → ∞ close to true label 𝑦 𝑖 .
Andrew Ng
Logistic loss function
−log 1 − 𝑓
As 𝑓w,𝑏 x 𝑖
→ 0 then loss → 0
𝑓
𝑖 𝑖
𝐿 𝑓w,𝑏 x ,𝑦
𝑖
if 𝑦 =0 The further prediction
𝑓w,𝑏 x 𝑖 is from
0 𝑓w,𝑏 x 𝑖 1 target 𝑦 𝑖 , the
As 𝑓w,𝑏 x 𝑖 → 1 then loss → ∞ higher the loss.
Andrew Ng
Cost
Andrew Ng
Cost Function
Simplified Cost
Function for Logistic
Regression
Simplified loss function
𝐿 𝑓w,𝑏 x 𝑖 , 𝑦 𝑖
= − 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖
− 1−𝑦 𝑖
log 1 − 𝑓w,𝑏 x 𝑖
if 𝑦 𝑖 = 1:
𝐿 𝑓w,𝑏 x 𝑖 , 𝑦 𝑖 =
Andrew Ng
Simplified loss function
= − 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖
− 1−𝑦 𝑖
if 𝑦 𝑖 = 1:
=
if 𝑦 𝑖 = 0:
𝐿 𝑓w,𝑏 x 𝑖 , 𝑦 𝑖 =
Andrew Ng
Simplified cost function
𝐿 𝑓w,𝑏 x 𝑖 , 𝑦 𝑖 = − 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖 − 1−𝑦 𝑖 log 1 − 𝑓w,𝑏 x 𝑖
𝐽 w, 𝑏 =
Andrew Ng
Gradient Descent
Gradient Descent
Implementation
Training logistic regression
Find w, 𝑏
1
Given new x, output 𝑓w,𝑏 x =
1+𝑒 −(w∙x+𝑏)
𝑃 𝑦 = 1 x; w, 𝑏
Andrew Ng
Gradient descent
repeat { 𝑚
𝜕 𝜕 1 𝑖 𝑖
𝑤𝑗 = 𝑤𝑗 − 𝛼 𝐽 w, 𝑏 𝐽 w, 𝑏 = ෍ (𝑓w,𝑏 x − 𝑦 𝑖 )𝑥𝑗
𝜕𝑤𝑗 𝜕𝑤𝑗 𝑚
𝑖=1
𝜕 𝑚
𝑏 = 𝑏−𝛼 𝐽 w, 𝑏 𝜕 1 𝑖
𝜕𝑏 𝐽 w, 𝑏 = ෍ (𝑓w,𝑏 x −𝑦 𝑖 )
𝜕𝑏 𝑚
𝑖=1
} simultaneous updates
Andrew Ng
Gradient descent for logistic regression
repeat {
𝑚
1 𝑖 𝑖
𝑤𝑗 = 𝑤𝑗 − 𝛼 ෍ (𝑓w,𝑏 x − 𝑦 𝑖 )𝑥𝑗
𝑚
𝑖=1
𝑚
1 𝑖 Same concepts:
𝑏 = 𝑏−𝛼 ෍ (𝑓w,𝑏 x −𝑦 𝑖 )
𝑚 • Monitor gradient descent
𝑖=1 (learning curve)
} simultaneous updates • Vectorized implementation
• Feature scaling
Linear regression 𝑓w,𝑏 x = w ∙ x + 𝑏
1
Logistic regression 𝑓 w,𝑏 x =
1 + 𝑒 (−w∙x+𝑏)
Andrew Ng
Regularization to
Reduce Overfitting
The Problem of
Overfitting
Regression example
price
price
price
size size size
𝑤1𝑥 + 𝑏 𝑤1 𝑥 + 𝑤2𝑥 2 + 𝑏 𝑤1 𝑥 + 𝑤2𝑥 + 𝑤3𝑥 3 + 𝑤4𝑥 4 + 𝑏
2
• Does not fit the • Fits training set • Fits the training set
training set well pretty well extremely well
Andrew Ng
Classification
𝑥2 𝑥2 𝑥2
𝑥1 𝑥1 𝑥1
𝑧 = 𝑤1 𝑥1 + 𝑤2𝑥2 + 𝑏 𝑧 = 𝑤1 𝑥1 + 𝑤2𝑥2 𝑧 = 𝑤1 𝑥1 + 𝑤2𝑥2
𝑓w,𝑏 x = 𝑔 𝑧 +𝑤3𝑥12 + 𝑤4 𝑥22 +𝑤3𝑥12𝑥2 + 𝑤4𝑥12𝑥22
+𝑤5𝑥1 𝑥2 + 𝑏 +𝑤5𝑥12𝑥23 + 𝑤6𝑥13𝑥2
𝑔 is the sigmoid function
+⋯+ 𝑏
Andrew Ng
Regularization to
Reduce Overfitting
Addressing Overfitting
Collect more training examples
price
price
size size
Andrew Ng
Select features to include/exclude
size bedrooms floors age avg … distance to price
income coffee shop
all features selected features disadvantage
insufficient data useful features

could be lost
Andrew Ng
Regularization
Reduce the size of parameters 𝑤𝑗
price
price
features features
𝑓 𝑥 = 28𝑥 − 385𝑥 2 + 𝑓 𝑥 = 13𝑥 − 0.23𝑥 2 +

39𝑥 3 − 174𝑥 4 + 100 0.000014𝑥 3 − 0.0001 𝑥 4 + 10
Andrew Ng
Addressing overfitting
Options
1. Collect more data
2. Select features
1. – Feature selection
3. Reduce size of parameters

1. – “Regularization”
Andrew Ng
Regularization to
Reduce Overfitting
Cost Function with

Regularization
Intuition
price
price
size size
𝑤1 𝑥 + 𝑤2𝑥 2 + 𝑏 𝑤1 𝑥 + 𝑤2𝑥 2 + 𝑤3𝑥 3 + 𝑤4𝑥 4 + 𝑏
make 𝑤3, 𝑤4 really small (≈ 0)

𝑚
1 𝑖 𝑖 2
min ෍ 𝑓w,𝑏 x −𝑦
w,𝑏 2𝑚
𝑖=1
Andrew Ng
Regularization simpler model
small values 𝑤1, 𝑤2, ⋯ , 𝑤𝑛 , 𝑏 less likely to overfit
size bedrooms floors age avg … distance to price

income coffee shop
𝑤1 , 𝑤1 , 𝑤2 , ⋯ , 𝑤100 , 𝑏
Andrew Ng
Regularization
choose 𝜆 = 1010
𝑓w,𝑏 x = 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑤4 𝑥 4 + 𝑏
price
Andrew Ng
Regularization to
Reduce Overfitting
Regularized Linear
Regression
Regularized linear regression
Gradient descent
repeat {
𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼 𝐽 w, 𝑏
𝜕𝑤𝑗
𝜕
𝑏 = 𝑏 − 𝛼 𝐽 w, 𝑏
𝜕𝑏
} simultaneous update
Andrew Ng
Implementing gradient descent
repeat {
Andrew Ng
Implementing gradient descent
repeat {
𝜆
𝑤𝑗 1−𝛼
𝑚
Andrew Ng
How we get the derivative term (optional)
𝜕
𝐽 w, 𝑏 =
𝜕𝑤𝑗
Andrew Ng
Regularization to
Reduce Overfitting
Regularized Logistic
Regression
Regularized logistic regression
𝑧 = 𝑤1 𝑥1 + 𝑤2𝑥2
+𝑤3𝑥12𝑥2 + 𝑤4𝑥12𝑥22
+𝑤5𝑥12𝑥23 + ⋯ + 𝑏
𝑥2 1
𝑓w,𝑏 x =
1 + 𝑒 −𝑧
𝑚
𝑥1 Cost function
1
𝐽 w, 𝑏 = − ෍ 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖 + 1−𝑦 𝑖 log 1 − 𝑓w,𝑏 x 𝑖
𝑚
𝑖=1
𝐽 w, 𝑏
Andrew Ng
Regularized logistic regression
𝑚 𝑛
1 𝜆
𝐽 w, 𝑏 = − ෍ 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖
+ 1−𝑦 𝑖
+ ෍ 𝑤𝑗2
𝑚 2𝑚
𝑖=1 𝑗=1
Gradient descent
repeat {
𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼 𝐽 w, 𝑏
𝜕𝑤𝑗
𝜕
𝑏 = 𝑏 − 𝛼 𝐽 w, 𝑏
𝜕𝑏
}
Andrew Ng

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Copyright:

Available Formats

You might also like

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Copyright:

Available Formats

Copyright Notice

These slides are distributed under the Creative Commons License.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

computers the ability to learn

Arthur Samuel (1959)

If the checkers program had been allowed to play only ten

Would have made it better

Would have made it worse

Practical advice for applying learning algorithms

input output label

email spam? (0/1) spam filtering

audio text transcripts speech recognition

English Spanish machine translation

ad, user info click? (0/1) online advertising

image, radar info position of other cars self-driving car

image of phone defect? (0/1) visual inspection

Now: Unsupervised learning

tumor size tumor size

growing skills develop

Clustering Dimensionality reduction

Given email labeled as spam/not spam, learn a spam filter.

Given a database of customer data, automatically discover market

= ith training example

feature model prediction

m = number of training examples

Simultaneously Code Math

Correct: Simultaneous update Incorrect

Near a local minimum,

Gradient descent algorithm

repeat until convergence

(1) 2104 400

(47) 3210 870

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

multiple linear regression

Without vectorization Vectorization

Gradient Descent for

House: 𝑥1 = 2000, 𝑥2 = 5, 𝑝𝑟𝑖𝑐𝑒 = $500k

𝑤1 = 50, 𝑤2 = 0.1, 𝑏 = 50 𝑤1 = 0.1, 𝑤2 = 50, 𝑏 = 50

𝑥1 size in feet2 𝑤1 size in feet2

𝑥1 size in feet2 𝑤1 size in feet2

Checking Gradient Descent

Adjust learning rate

… 0.001 s p. . ace0.01sp . a.e0.1spa . ce1 …

𝑎𝑟𝑒𝑎 = 𝑓𝑟𝑜𝑛𝑡𝑎𝑔𝑒 × 𝑑𝑒𝑝𝑡ℎ

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

𝑦 can only be one of two values false true

“negative class” “positive class”

w ,𝑏( x ) w ,𝑏( x ) ≥0.5?

decision 𝑧 = 𝑥12 + 𝑥22 − 1 = 0

𝑥12 + 𝑥22 < 1

w ,𝑏( x ) = 𝑔(𝑧) = 𝑔 (𝑤1𝑥1 + 𝑤2𝑥2

Cost Function for

tumor size … patient’s age malignant? 𝑖 = 1, … , 𝑚

How to choose w = [𝑤1 𝑤2 ⋯ 𝑤𝑛 ] and 𝑏?

linear regression logistic regression

𝐿 𝑓w,𝑏 x 𝑖 , 𝑦 𝑖 = − 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖 − 1−𝑦 𝑖 log 1 − 𝑓w,𝑏 x 𝑖

all features selected features disadvantage

insufficient data useful features