Deeplearning - Ai Deeplearning - Ai

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 169

Copyright Notice

These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Machine Learning

Welcome!
eddy aarti
geoff

Ivy

daniel

andres

kin

robert
Re: Urgent Information :)

Congratulations!

a million dollars!
Machine Learning

Applications of
Machine Learning
Machine Learning
Overview

What is
Machine Learning?
Machine learning

computers the ability to learn


without being explicitly

Arthur Samuel (1959)

Andrew Ng
Question

If the checkers program had been allowed to play only ten


games (instead of tens of thousands) against itself, a much
smaller number of games, how would this have affected its
performance?

Would have made it better

Would have made it worse

Andrew Ng
Machine learning algorithms

- Supervised learning
- Unsupervised learning
- Recommender systems
- Reinforcement learning

Practical advice for applying learning algorithms

Andrew Ng
Machine Learning
Overview

Supervised Learning
Part 1
Supervised learning

input output label

right answers

Andrew Ng
Input (X) Output (Y) Application

email spam? (0/1) spam filtering

audio text transcripts speech recognition

English Spanish machine translation

ad, user info click? (0/1) online advertising

image, radar info position of other cars self-driving car

image of phone defect? (0/1) visual inspection

Andrew Ng
Regression: Housing price prediction
400

300
Price in
$1000 200

100

0
0 500 1000 1500 2000 2500
Regression House size
in feet2
Predict a number
infinitely many possible outputs

Andrew Ng
Machine Learning
Overview

Supervised Learning
Part 2
Classification: Breast cancer detection

tumor size
(diameter in cm)
benign
malignant

Andrew Ng
Classification: Breast cancer detection
benign
malignant
0cm diameter(cm) 10cm
malignant type 2

Classification
predict categories
small number of possible outputs

Andrew Ng
Two or more inputs

Age

Tumor size

Andrew Ng
Supervised learning
right answers

Regression Classification
Predict a number predict categories
infinitely many possible outputs small number of possible outputs

Andrew Ng
Machine Learning
Overview

Unsupervised Learning
Part 1
Previous: Supervised learning

Now: Unsupervised learning

Andrew Ng
Supervised learning Unsupervised learning
Learn from data labeled Find something interesting
right answers in unlabeled data.

age age

tumor size tumor size


Andrew Ng
Clustering: Google news

Andrew Ng
Clustering: DNA microarray
genes
(each row)

individuals
(each column)

Andrew Ng
Clustering: Grouping customers

growing skills develop


and knowledge career

stay updated
with AI

Andrew Ng
Machine Learning
Overview

Unsupervised Learning
Part 2
Unsupervised learning
Data only comes with inputs x, but not output labels y.
Algorithm has to find structure in the data.

Clustering Dimensionality reduction


Group similar data Compress data using fewer
points together. numbers.

Anomaly detection
Find unusual data points.

Andrew Ng
Question
Of the following examples, which would you address using
an unsupervised learning algorithm?

Given email labeled as spam/not spam, learn a spam filter.

Given a set of news articles found on the web, group them into sets
of articles about the same story.

Given a database of customer data, automatically discover market


segments and group customers into different market segments.
Given a dataset of patients diagnosed as either having diabetes or
not, learn to classify new patients as having diabetes or not

Andrew Ng
Machine Learning
Overview

Jupyter Notebooks
Linear Regression
with One Variable

Linear Regression
Model Part 1
House sizes and prices
500
linear regression
400

price in 300
$1000
200

100

0
0 1000 2000 3000
size in feet2
Regression model Supervised learning model Classification model
Predicts numbers right answers Predicts categories
Infinitely many possible outputs Small number of possible outputs

Andrew Ng
price in $1000
House sizes and prices
Data table
500
size in feet2 price in $1000
400
2104 400
300 1416 232
200 1534 315
852 178
100

0 3210 870
0 1000 2000 3000
size in feet2

Andrew Ng
Terminology
Training Data used to train the model Notation:
set:
size in feet2 price in $1000 feature
(1) 2104 400
(2) 1416 232
(3) 1534 315 = number of training examples
(4) 852 178
= single training example
(47) 3210 870

= ith training example


index (1st, 2nd, 3rd
not exponent

Andrew Ng
Linear Regression
with One Variable

Linear Regression
Model Part 2
How to represent ?
training set

learning algorithm

feature model prediction


(estimated Linear regression with one variable.
target
Univariate linear regression.

Andrew Ng
Linear Regression
with One Variable

Cost Function
Training set
size in feet2 ( ) price $1000 ) Model:
2104 460
1416 232 : parameters
1534 315
852 178

What do do?

Andrew Ng
3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3

=0 = 0.5 = 0.5
= 1.5 b=0 =1

Andrew Ng
Cost function: Squared error cost function

m = number of training examples

Find :
is close to for all

Andrew Ng
Linear Regression
with One Variable

Cost Function
Intuition
model: simplified

parameters:

cost function:

goal:

Andrew Ng
(for fixed w, function of ) (function of )
input parameter
3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x

Andrew Ng
(function of ) (function of )

3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x

Andrew Ng
(function of ) (function of )

3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x

Andrew Ng
(function of )
goal of linear regression: 3

2
general case:
1

0
-0.5 0 0.5 1 1.5 2 2.5

Andrew Ng
Linear Regression
with One Variable

Visualizing
the Cost Function
Model

Parameters

Cost Function

Objective

Andrew Ng
(function of ) (function of )
500

400

300
price in
$1000
200

100

0
0 1000 2000 3000
size in feet2

Andrew Ng
Andrew Ng
3D surface plot

Andrew Ng
Andrew Ng
Andrew Ng
click within this plot to add points

price in
$1000

size in feet 2
you can rotate this figure

Andrew Ng
Linear Regression
with One Variable

Visualization examples
click within this plot to add points

price in
$1000

size in feet 2
you can rotate this figure

Andrew Ng
click within this plot to add points

price in
$1000

size in feet 2
you can rotate this figure

Andrew Ng
click within this plot to add points

price in
$1000

size in feet 2
you can rotate this figure

Andrew Ng
click within this plot to add points

price in
$1000

size in feet 2
you can rotate this figure

Andrew Ng
Training Linear
Regression

Gradient Descent
Have some function
Want

Outline:
Start with some (set =0, =0)
Keep changing to reduce
Until we settle at or near a minimum

Andrew Ng
Andrew Ng
Training Linear
Regression

Implementing
Gradient Descent
Gradient descent algorithm Assignment Truth assertion
Repeat until convergence
Learning rate
Derivative

Simultaneously Code Math


update w and b a==c

Correct: Simultaneous update Incorrect

Andrew Ng
Training Linear
Regression

Gradient Descent
Intuition
Gradient descent algorithm
repeat until convergence

Andrew Ng
J(w)

J(w)

Andrew Ng
Training Linear
Regression

Learning Rate
minimum

If
Gradient descent may be slow.

If
Gradient descent may:
- Overshoot, never reach minimum
- Fail to converge, diverge
minimum

Andrew Ng
local minimum

current value of

Andrew Ng
Can reach local minimum with fixed learning rate

not as large

large

Near a local minimum,


- Derivative becomes smaller
- Update steps become smaller
Can reach minimum without
decreasing learning rate
minimum

Andrew Ng
Training Linear
Regression

Gradient Descent
for Linear Regression
Linear regression model Cost function

Gradient descent algorithm


repeat until convergence

Andrew Ng
(Optional)

Andrew Ng
Gradient descent algorithm

repeat until convergence

Update
and
simultaneously

Andrew Ng
More than one local minimum

Andrew Ng
squared error cost

convex function

global minimum

Andrew Ng
Training Linear
Regression

Running
Gradient Descent
price in
$1000

size in feet 2

Andrew Ng
price in
$1000

size
size inin feet
feet22

Andrew Ng
Andrew Ng
price in
$1000

size in feet 2

Andrew Ng
price in
$1000

size in feet 2

Andrew Ng
price in
$1000

size in feet 2

Andrew Ng
price in
$1000

size in feet 2

Andrew Ng
price in
$1000

size in feet 2

Andrew Ng
price in
$1000

size in feet 2

Andrew Ng
uses all the training examples.
size in feet2 price in $1000

(1) 2104 400


(2) 1416 232
(3) 1534 315
(4) 852 178

(47) 3210 870

Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Linear Regression
with Multiple Variables

Multiple Features
Multiple features (variables)
Size in feet2 (𝑥) Price ($) in 1000’s (𝑦)

2104 400
1416 232
1534 315
852 178
… …

𝑓𝑤,𝑏 𝑥 = 𝑤𝑥 + 𝑏

Andrew Ng
Multiple features (variables)
Size in Number of Number of Age of home Price ($) in
feet2 bedrooms floors in years $1000’s

2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
x𝑗 = 𝑗 𝑡ℎ feature
𝑛 = number of features
x 𝑖 = features of 𝑖 𝑡ℎ training example
𝑖
x𝑗 = value of feature 𝑗 in 𝑖 𝑡ℎ training example

Andrew Ng
Model:
Previously: 𝑓𝑤,𝑏 𝑥 = 𝑤𝑥 + 𝑏

𝑓𝑤,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + 𝑏

Andrew Ng
𝑓𝑤,𝑏 𝑥 = 𝑤1 𝑥1 + 𝑤2𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + 𝑏

𝑓w,𝑏 x = w ∙ x + 𝑏 =

multiple linear regression

Andrew Ng
Linear Regression
with Multiple Variables

Vectorization
Part 1
Parameters and features Without vectorization
w = 𝑤1 𝑤2 𝑤3
𝑏 is a number
x = 𝑥1 𝑥2 𝑥3
linear algebra: count from 1

w = np.array([1.0,2.5,-3.3]) f = 0
b = 4 for j in range(0,n):
x = np.array([10,20,30]) f = f + w[j] * x[j]
code: count from 0 f = f + b

Without vectorization Vectorization


𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑏 𝑓w,𝑏 x = w ∙ x + 𝑏
f = w[0] * x[0] +
w[1] * x[1] + f = np.dot(w,x) + b
w[2] * x[2] + b

Andrew Ng
Linear Regression
with Multiple Variables

Vectorization
Part 2
Without vectorization Vectorization
for j in range(0,16): np.dot(w,x)
f = f + w[j] * x[j]
𝑡0
𝑡0 w[0] w[1] … w[15]
f + w[0] * x[0] * * … *
𝑡1
f + w[1] * x[1] x[0] x[1] … x[15]
𝑡1

w[0]*x[0] + w[1]*x[1] + … + w[15]*x[15]
𝑡15
f + w[15] * x[15]

Andrew Ng
Gradient descent w = 𝑤1 𝑤2 ⋯ 𝑤16
d = 𝑑1 𝑑2 ⋯ 𝑑16
w = np.array([0.5, 1.3, … 3.4])
d = np.array([0.3, 0.2, … 0.4])
compute 𝑤𝑗 = 𝑤𝑗 − 0.1𝑑𝑗 for 𝑗 = 1 … 16
Without vectorization With vectorization
𝑤1 = 𝑤1 − 0.1𝑑1 w = w − 0.1d
𝑤2 = 𝑤2 − 0.1𝑑2

𝑤16 = 𝑤16 − 0.1𝑑16
for j in range(0,16): w = w – 0.1 * d
w[j] = w[j] - 0.1 * d[j]

Andrew Ng
Linear Regression
with Multiple Variables

Gradient Descent for


Multiple Regression
Previous notation Vector notation
Parameters 𝑤1 , ⋯ , 𝑤𝑛
w = 𝑤1 ⋯ 𝑤𝑛
𝑏 𝑏
Model 𝑓w,𝑏 x = 𝑤1 𝑥1 + ⋯ + 𝑤𝑛 𝑥𝑛 + 𝑏 𝑓w,𝑏 x = w ∙ x + 𝑏

Cost function 𝐽 𝑤1 , ⋯ , 𝑤𝑛 , 𝑏 𝐽 w, 𝑏

Gradient descent
repeat { repeat {
𝜕 𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼𝜕𝑤 𝐽 𝑤1 , ⋯ , 𝑤𝑛 , 𝑏 𝑤𝑗 = 𝑤𝑗 − 𝛼𝜕𝑤 𝐽 w, 𝑏
𝑗 𝑗
𝜕 𝜕
𝑏=𝑏 − 𝛼𝜕𝑏 𝐽 𝑤1 , ⋯ , 𝑤𝑛 , 𝑏 𝑏 = 𝑏 − 𝛼𝜕𝑏 𝐽 w, 𝑏
} }
Andrew Ng
Gradient descent
One feature 𝑛 features 𝑛 ≥ 2
repeat {
𝑚 repeat { 𝑚
1 1 𝑖
𝑤 = 𝑤 − 𝛼 ෍ 𝑓𝑤,𝑏 𝑥 𝑖 −𝑦 𝑖 𝑥 𝑖 𝑤1 = 𝑤1 − 𝛼 ෍ 𝑓w,𝑏 x 𝑖 − 𝑦 𝑖
𝑥1
𝑚 𝑚
𝑖=1 𝑖=1
⋮ 𝜕
𝐽 w, 𝑏
𝜕 𝜕𝑤1
𝜕𝑤
𝐽 𝑤, 𝑏 𝑚
1 𝑖 𝑖 𝑖
𝑤𝑛 = 𝑤𝑛 − 𝛼 ෍ 𝑓w,𝑏 x −𝑦 𝑥𝑛
𝑚
𝑚 𝑖=1
𝑚 1
1 𝑖 𝑖 𝑏 = 𝑏 − 𝛼 ෍ 𝑓w,𝑏 x 𝑖 −𝑦 𝑖
𝑏 = 𝑏 − 𝛼 ෍ 𝑓𝑤,𝑏 𝑥 −𝑦 𝑚
𝑚 𝑖=1
𝑖=1 simultaneously update
simultaneously update 𝑤, 𝑏 𝑤𝑗 (for 𝑗 = 1, ⋯ , 𝑛) and 𝑏
} }

Andrew Ng
An alternative to gradient descent

Normal equation
What you need to know
• Only for linear regression
• Normal equation method may
• Solve for w, b without
be used in machine learning
iterations
libraries that implement linear
Disadvantages regression.
• Doesn’t generalize to other • Gradient descent is the
learning algorithms. recommended method for
• Slow when number of features finding parameters w,b
is large (> 10,000)

Andrew Ng
Practical Tips for
Linear Regression

Feature Scaling
Part 1
Feature and parameter values
෣ = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏 𝑥1 : size (feet2) 𝑥2 : # bedrooms
𝑝𝑟𝑖𝑐𝑒
range: 300 − 2,000 range: 0 − 5

House: 𝑥1 = 2000, 𝑥2 = 5, 𝑝𝑟𝑖𝑐𝑒 = $500k


size of the parameters 𝑤1 , 𝑤2 ?

𝑤1 = 50, 𝑤2 = 0.1, 𝑏 = 50 𝑤1 = 0.1, 𝑤2 = 50, 𝑏 = 50

෣ = 50 ∗ 2000 + 0.1 ∗ 5 + 50
𝑝𝑟𝑖𝑐𝑒 ෣ = 0.1 ∗ 2000𝑘 + 50 ∗ 5 + 50
𝑝𝑟𝑖𝑐𝑒

෣ = $100,050.5k
𝑝𝑟𝑖𝑐𝑒 ෣ = $500k
𝑝𝑟𝑖𝑐𝑒

Andrew Ng
Feature size and parameter size
size of feature 𝑥𝑗 size of parameter 𝑤𝑗

size in feet2

#bedrooms

Features Parameters
𝐽 w, 𝑏

𝑥2 𝑤2
# bedrooms # bedrooms

𝑥1 size in feet2 𝑤1 size in feet2

Andrew Ng
Feature size and gradient descent
Features Parameters

𝑥2 𝑤2 𝐽 w, 𝑏
# bedrooms # bedrooms

𝑤1 size in feet2
𝑥1 size in feet2

𝑥2 𝑤2
# bedrooms # bedrooms
𝐽 w, 𝑏
rescaled rescaled

𝑥1 size in feet2 𝑤1 size in feet2


rescaled rescaled

Andrew Ng
Practical Tips for
Linear Regression

Feature Scaling
Part 2
Feature scaling
𝑥2
# bedrooms 300 ≤ 𝑥1 ≤ 2000 0 ≤ 𝑥2 ≤ 5

𝑥1 size in
feet2
𝑥1 𝑥2
𝑥1,𝑠𝑐𝑎𝑙𝑒𝑑 = 𝑥2,𝑠𝑐𝑎𝑙𝑒𝑑 =
2000 5

𝑥2
# bedrooms
rescaled 0.15 ≤ 𝑥1,𝑠𝑐𝑎𝑙𝑒𝑑 ≤ 1 0 ≤ 𝑥2,𝑠𝑐𝑎𝑙𝑒𝑑 ≤ 1

𝑥1 size in
feet2 rescaled

Andrew Ng
Mean normalization
300 ≤ 𝑥1 ≤ 2000 0 ≤ 𝑥2 ≤ 5
𝑥2
# bedrooms

𝑥1 − 𝜇1 𝑥2 − 𝜇2
𝑥1 = 𝑥2 =
𝑥1 size 2000−300 5−0
in feet2
𝑥2
# bedrooms
normalized −0.18 ≤ 𝑥1 ≤ 0.82 −0.46 ≤ 𝑥2 ≤ 0.54

𝑥1 size in feet2
normalized

Andrew Ng
Z-score normalization
standard deviation 𝜎
300 ≤ 𝑥1 ≤ 2000 0 ≤ 𝑥2 ≤ 5
𝑥2 𝜎1 = 450 𝜎1
# bedrooms
𝜎2 = 1.4
𝑥1 − 𝜇1 𝑥2 − 𝜇2
𝑥1 size 𝑥1 = 𝑥2 =
𝜎1 𝜎2
in feet2
𝑥2
# bedrooms
normalized
−0.67 ≤ 𝑥1 ≤ 3.1 −1.6 ≤ 𝑥2 ≤ 1.9

𝑥1 size in feet2
normalized

Andrew Ng
Feature scaling
aim for about −1 ≤ 𝑥𝑗 ≤ 1 for each feature 𝑥𝑗
−3 ≤ 𝑥𝑗 ≤ 3
−0.3 ≤ 𝑥𝑗 ≤ 0.3

0 ≤ 𝑥1 ≤ 3
−2 ≤ 𝑥2 ≤ 0.5

−100 ≤ 𝑥3 ≤ 100

−0.001 ≤ 𝑥4 ≤ 0.001
98.6 ≤ 𝑥5 ≤ 105

Andrew Ng
Practical Tips for
Linear Regression

Checking Gradient Descent


for Convergence
Gradient descent

Andrew Ng
Make sure gradient descent is working correctly
𝐽 w, 𝑏 should decrease
objective: min 𝐽 w, 𝑏
w,𝑏 after every iteration
Automatic convergence test
𝐽 w, 𝑏
Let 𝜀 “epsilon” be 10−3 .
𝐽 w, 𝑏 after 100 iterations
𝐽 w, 𝑏 after 200 iterations If 𝐽 w, 𝑏 decreases by ≤ 𝜀
𝐽 w, 𝑏 likely converged in one iteration,
by 400 iterations declare convergence.
(found parameters w, 𝑏
to get close to
global minimum)
0 100 200 300 400
# iterations 𝑤, 𝑏
# iterations needed varies

Andrew Ng
Practical Tips for
Linear Regression

Choosing the
Learning Rate
Identify problem with gradient descent
or learning rate is too
𝛼 is too large large

𝑤1 = 𝑤1 + 𝛼𝑑1
𝐽 𝑤, 𝑏 𝐽 𝑤, 𝑏
use a minus sign
𝑤1 = 𝑤1 − 𝛼𝑑1
# iterations # iterations

Adjust learning rate


𝛼 is too big Use smaller 𝛼 With a small enough 𝛼,
𝐽 w, 𝑏 should decrease
on every iteration

𝐽 𝑤, 𝑏 𝐽 𝑤, 𝑏 If 𝛼 is too small,
gradient descent takes
a lot more iterations to
converge
parameter 𝑤1 parameter 𝑤1

Andrew Ng
Values of 𝛼 to try:

… 0.001 s p. . ace0.01sp . a.e0.1spa . ce1 …

J w, b J w, b

# iterations # iterations

Andrew Ng
Practical Tips for
Linear Regression

Feature Engineering
Feature engineering
𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏

𝑎𝑟𝑒𝑎 = 𝑓𝑟𝑜𝑛𝑡𝑎𝑔𝑒 × 𝑑𝑒𝑝𝑡ℎ


𝑥3 = 𝑥1 𝑥2 Feature engineering:
Using intuition to design
new features, by
transforming or combining
𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2𝑥2 + 𝑤3𝑥3 + 𝑏 original features.

Andrew Ng
Practical Tips for
Linear Regression

Polynomial Regression
Polynomial regression

𝑓𝑤,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑏

price y
𝑓𝑤,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑏

size x

Andrew Ng
Choice of features

𝑓w,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2 𝑥 + 𝑏

price y

size x

Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Classification

Motivations
Classification
Question Answer “𝑦”
Is this email spam? no yes
Is the transaction fraudulent? no yes
Is the tumor malignant? no yes

𝑦 can only be one of two values false true


“binary classification” 0 1

“negative class” “positive class”

Andrew Ng
(yes) 1

malignant?

(no) 0
tumor size 𝑥
(diameter in cm)
benign
malignant

Andrew Ng
𝑓𝑤,𝑏 𝑥 = 𝑤𝑥 + 𝑏
𝑓𝑤,𝑏 𝑥 = 𝑤𝑥 + 𝑏

(yes) 1
threshold
malignant?

(no) 0
tumor size 𝑥
(diameter in cm)
if 𝑓𝑤,𝑏 𝑥 < 0.5 → 𝑦ො = 0
if 𝑓𝑤,𝑏 𝑥 ≥ 0.5 → 𝑦ො = 1

Andrew Ng
Classification

Logistic Regression
Want outputs between 0 and 1

1
(yes) 1
threshold
malignant? 0.5

(no) 0 0 𝑧
tumor size 𝑥 -3 sigmoid function 3
(diameter in cm) logistic function
outputs between 0 and 1
1
𝑔 𝑧 = 0<𝑔 𝑧 <1
1+𝑒−𝑧

Andrew Ng
Want outputs between 0 and 1 𝑓w,𝑏 x
1 𝑧 =w∙x+𝑏

0.5

1
𝑧 𝑔 𝑧 =
0 1+𝑒−𝑧
-3 sigmoid function 3 1
logistic function
𝑓w,𝑏 x = 𝑔(w ∙ x + 𝑏) =
1 + 𝑒− w∙x+𝑏
outputs between 0 and 1
“logistic regression”
1
𝑔 𝑧 = 0<𝑔 𝑧 <1
1+𝑒−𝑧

Andrew Ng
Interpretation of logistic regression output
1
𝑓w,𝑏 x =
1 + 𝑒 − w∙x+𝑏 𝑓w,𝑏 x = 𝑃 𝑦 = 1 x; w,𝑏
“probability” that class is 1 Probability that 𝑦 is 1,
given input x, parameters w,𝑏
Example:
𝑥 is “tumor size”
𝑦 is 0 (not malignant) 𝑃(𝑦 = 0) + 𝑃(𝑦 = 1) = 1
or 1 (malignant)

𝑓w,𝑏 x = 0.7
70% chance that 𝑦 is 1

Andrew Ng
Classification

Decision Boundary
1 1
w ,𝑏( )
𝑔(𝑧)
𝑓→ →
x = 𝑔(→w ∙ →
x + =
𝑏)
0.5 − (→ x + 𝑏)
w ∙→
1+ 𝑒
= 𝑃 (𝑦 = 1 𝑥; →
w ,𝑏)
0 z

w ,𝑏( x ) w ,𝑏( x ) ≥0.5?


𝑓→ → Is 𝑓
! → →

!𝑧 = →
w ∙→
x + 𝑏 Yes: !𝑦^ = 1 No: 𝑦!^ = 0
When is
w ,𝑏( x ) ≥0.5
𝑓
! → → ? ≥0.5
!𝑔(𝑧)

!𝑧 ≥0 !𝑧 < 0
1 → → → →
!𝑔(𝑧) = ! w ∙ x + 𝑏 ≥0 !w ∙ x + 𝑏 < 0
1 + 𝑒− 𝑧 ^= 1
!𝑦 𝑦!^ = 0

Andrew Ng
Decision boundary
𝑓w,𝑏 x = 𝑔 𝑧 = 𝑔 𝑤1𝑥1 + 𝑤2𝑥2 + 𝑏

Decision boundary 𝑧 = w ∙ x + 𝑏 = 0
𝑧 = 𝑥1 + 𝑥2 − 3 = 0
𝑥1 + 𝑥2 = 3
𝑥2
3 𝑥1 + 𝑥2 ≥ 3
2
𝑥1 + 𝑥2 < 3
1

1 2 3 𝑥1

Andrew Ng
Non-linear decision boundaries
𝑥2 𝑥12 + 𝑥22 ≥ 1
𝑦ො = 1 𝑓w,𝑏 x = 𝑔 𝑧 = 𝑔( 𝑤1 𝑥12 + 𝑤2𝑥22+b )
1

decision 𝑧 = 𝑥12 + 𝑥22 − 1 = 0


𝑥1
−1 1 boundary 𝑥12 + 𝑥22 = 1
−1

𝑥12 + 𝑥22 < 1


𝑦ො = 0

Andrew Ng
Non-linear decision boundaries
𝑥2

w ,𝑏( x ) = 𝑔(𝑧) = 𝑔 (𝑤1𝑥1 + 𝑤2𝑥2


𝑓
! →

𝑥1
+! 𝑤3𝑥12 + 𝑤4𝑥1𝑥2 + 𝑤5𝑥22
! + 𝑤6𝑥13 +⋯+ b)
𝑥2

𝑥1

Andrew Ng
Cost Function

Cost Function for


Logistic Regression
Training set

tumor size … patient’s age malignant? 𝑖 = 1, … , 𝑚


(cm)
𝑗 = 1, … , 𝑛
10 52 1
target 𝑦 is 0 or 1
2 73 0
5 55 0 1
𝑓w,𝑏 x =
12 49 1 1 + 𝑒 − w∙x+𝑏
… … …

How to choose w = [𝑤1 𝑤2 ⋯ 𝑤𝑛 ] and 𝑏?

Andrew Ng
Squared error cost
average of training set

𝐿 𝑓w,𝑏 x 𝑖 ,𝑦 𝑖

linear regression logistic regression


1
𝑓w,𝑏 x = w ∙ x + 𝑏 𝑓w,𝑏 x =
1 + 𝑒 − w∙x+𝑏

𝐽 w, 𝑏 convex 𝐽 w, 𝑏 non-convex

w, b w, 𝑏

Andrew Ng
Logistic loss function

log 𝑓

𝐿 𝑓w,𝑏 x 𝑖 ,𝑦 𝑖
𝑓

if 𝑦 𝑖 =1 −log 𝑓

0 0.1 0.5 1
Loss is lowest when
As 𝑓w,𝑏 x 𝑖 → 1 then loss → 0 𝑓w,𝑏 x 𝑖
𝑓w,𝑏 x 𝑖 predicts
As 𝑓 w,𝑏 x 𝑖
→ 0 then loss → ∞ close to true label 𝑦 𝑖 .

Andrew Ng
Logistic loss function

−log 1 − 𝑓
As 𝑓w,𝑏 x 𝑖
→ 0 then loss → 0

𝑓
𝑖 𝑖
𝐿 𝑓w,𝑏 x ,𝑦
𝑖
if 𝑦 =0 The further prediction
𝑓w,𝑏 x 𝑖 is from
0 𝑓w,𝑏 x 𝑖 1 target 𝑦 𝑖 , the
As 𝑓w,𝑏 x 𝑖 → 1 then loss → ∞ higher the loss.
Andrew Ng
Cost

Andrew Ng
Cost Function

Simplified Cost
Function for Logistic
Regression
Simplified loss function

𝐿 𝑓w,𝑏 x 𝑖 , 𝑦 𝑖
= − 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖
− 1−𝑦 𝑖
log 1 − 𝑓w,𝑏 x 𝑖

if 𝑦 𝑖 = 1:
𝐿 𝑓w,𝑏 x 𝑖 , 𝑦 𝑖 =

Andrew Ng
Simplified loss function

𝐿 𝑓w,𝑏 x 𝑖 , 𝑦 𝑖
= − 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖
− 1−𝑦 𝑖
log 1 − 𝑓w,𝑏 x 𝑖

if 𝑦 𝑖 = 1:
𝐿 𝑓w,𝑏 x 𝑖 , 𝑦 𝑖
=

if 𝑦 𝑖 = 0:
𝐿 𝑓w,𝑏 x 𝑖 , 𝑦 𝑖 =

Andrew Ng
Simplified cost function

𝐿 𝑓w,𝑏 x 𝑖 , 𝑦 𝑖 = − 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖 − 1−𝑦 𝑖 log 1 − 𝑓w,𝑏 x 𝑖

𝐽 w, 𝑏 =

Andrew Ng
Gradient Descent

Gradient Descent
Implementation
Training logistic regression

Find w, 𝑏

1
Given new x, output 𝑓w,𝑏 x =
1+𝑒 −(w∙x+𝑏)

𝑃 𝑦 = 1 x; w, 𝑏

Andrew Ng
Gradient descent

repeat { 𝑚
𝜕 𝜕 1 𝑖 𝑖
𝑤𝑗 = 𝑤𝑗 − 𝛼 𝐽 w, 𝑏 𝐽 w, 𝑏 = ෍ (𝑓w,𝑏 x − 𝑦 𝑖 )𝑥𝑗
𝜕𝑤𝑗 𝜕𝑤𝑗 𝑚
𝑖=1
𝜕 𝑚
𝑏 = 𝑏−𝛼 𝐽 w, 𝑏 𝜕 1 𝑖
𝜕𝑏 𝐽 w, 𝑏 = ෍ (𝑓w,𝑏 x −𝑦 𝑖 )
𝜕𝑏 𝑚
𝑖=1
} simultaneous updates

Andrew Ng
Gradient descent for logistic regression
repeat {
𝑚
1 𝑖 𝑖
𝑤𝑗 = 𝑤𝑗 − 𝛼 ෍ (𝑓w,𝑏 x − 𝑦 𝑖 )𝑥𝑗
𝑚
𝑖=1
𝑚
1 𝑖 Same concepts:
𝑏 = 𝑏−𝛼 ෍ (𝑓w,𝑏 x −𝑦 𝑖 )
𝑚 • Monitor gradient descent
𝑖=1 (learning curve)
} simultaneous updates • Vectorized implementation
• Feature scaling
Linear regression 𝑓w,𝑏 x = w ∙ x + 𝑏
1
Logistic regression 𝑓 w,𝑏 x =
1 + 𝑒 (−w∙x+𝑏)

Andrew Ng
Regularization to
Reduce Overfitting

The Problem of
Overfitting
Regression example
price

price

price
size size size
𝑤1𝑥 + 𝑏 𝑤1 𝑥 + 𝑤2𝑥 2 + 𝑏 𝑤1 𝑥 + 𝑤2𝑥 + 𝑤3𝑥 3 + 𝑤4𝑥 4 + 𝑏
2

• Does not fit the • Fits training set • Fits the training set
training set well pretty well extremely well

Andrew Ng
Classification

𝑥2 𝑥2 𝑥2

𝑥1 𝑥1 𝑥1
𝑧 = 𝑤1 𝑥1 + 𝑤2𝑥2 + 𝑏 𝑧 = 𝑤1 𝑥1 + 𝑤2𝑥2 𝑧 = 𝑤1 𝑥1 + 𝑤2𝑥2
𝑓w,𝑏 x = 𝑔 𝑧 +𝑤3𝑥12 + 𝑤4 𝑥22 +𝑤3𝑥12𝑥2 + 𝑤4𝑥12𝑥22
+𝑤5𝑥1 𝑥2 + 𝑏 +𝑤5𝑥12𝑥23 + 𝑤6𝑥13𝑥2
𝑔 is the sigmoid function
+⋯+ 𝑏

Andrew Ng
Regularization to
Reduce Overfitting

Addressing Overfitting
Collect more training examples
price

price
size size

Andrew Ng
Select features to include/exclude
size bedrooms floors age avg … distance to price
income coffee shop

all features selected features disadvantage

insufficient data useful features


could be lost

Andrew Ng
Regularization
Reduce the size of parameters 𝑤𝑗

price
price

features features

𝑓 𝑥 = 28𝑥 − 385𝑥 2 + 𝑓 𝑥 = 13𝑥 − 0.23𝑥 2 +


39𝑥 3 − 174𝑥 4 + 100 0.000014𝑥 3 − 0.0001 𝑥 4 + 10

Andrew Ng
Addressing overfitting
Options
1. Collect more data

2. Select features
1. – Feature selection

3. Reduce size of parameters


1. – “Regularization”

Andrew Ng
Regularization to
Reduce Overfitting

Cost Function with


Regularization
Intuition

price
price
size size
𝑤1 𝑥 + 𝑤2𝑥 2 + 𝑏 𝑤1 𝑥 + 𝑤2𝑥 2 + 𝑤3𝑥 3 + 𝑤4𝑥 4 + 𝑏

make 𝑤3, 𝑤4 really small (≈ 0)


𝑚
1 𝑖 𝑖 2
min ෍ 𝑓w,𝑏 x −𝑦
w,𝑏 2𝑚
𝑖=1

Andrew Ng
Regularization simpler model
small values 𝑤1, 𝑤2, ⋯ , 𝑤𝑛 , 𝑏 less likely to overfit

size bedrooms floors age avg … distance to price


income coffee shop

𝑤1 , 𝑤1 , 𝑤2 , ⋯ , 𝑤100 , 𝑏

Andrew Ng
Regularization

choose 𝜆 = 1010
𝑓w,𝑏 x = 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑤4 𝑥 4 + 𝑏
price

Andrew Ng
Regularization to
Reduce Overfitting

Regularized Linear
Regression
Regularized linear regression

Gradient descent
repeat {
𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼 𝐽 w, 𝑏
𝜕𝑤𝑗
𝜕
𝑏 = 𝑏 − 𝛼 𝐽 w, 𝑏
𝜕𝑏
} simultaneous update

Andrew Ng
Implementing gradient descent
repeat {

} simultaneous update

Andrew Ng
Implementing gradient descent
repeat {

} simultaneous update

𝜆
𝑤𝑗 1−𝛼
𝑚

Andrew Ng
How we get the derivative term (optional)
𝜕
𝐽 w, 𝑏 =
𝜕𝑤𝑗

Andrew Ng
Regularization to
Reduce Overfitting

Regularized Logistic
Regression
Regularized logistic regression
𝑧 = 𝑤1 𝑥1 + 𝑤2𝑥2
+𝑤3𝑥12𝑥2 + 𝑤4𝑥12𝑥22
+𝑤5𝑥12𝑥23 + ⋯ + 𝑏
𝑥2 1
𝑓w,𝑏 x =
1 + 𝑒 −𝑧

𝑚
𝑥1 Cost function
1
𝐽 w, 𝑏 = − ෍ 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖 + 1−𝑦 𝑖 log 1 − 𝑓w,𝑏 x 𝑖
𝑚
𝑖=1

𝐽 w, 𝑏

Andrew Ng
Regularized logistic regression
𝑚 𝑛
1 𝜆
𝐽 w, 𝑏 = − ෍ 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖
+ 1−𝑦 𝑖
log 1 − 𝑓w,𝑏 x 𝑖
+ ෍ 𝑤𝑗2
𝑚 2𝑚
𝑖=1 𝑗=1

Gradient descent
repeat {
𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼 𝐽 w, 𝑏
𝜕𝑤𝑗
𝜕
𝑏 = 𝑏 − 𝛼 𝐽 w, 𝑏
𝜕𝑏
}

Andrew Ng

You might also like