Deeplearning - Ai Deeplearning - Ai

Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Neural Network
Training
TensorFlow
implementation
Train a Neural Network in TensorFlow
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
model = Sequential([
⋮ a[1] a[2] a[3] Dense(units=25, activation='sigmoid’)

x ⋮ Dense(units=15, activation='sigmoid’)
Dense(units=1, activation='sigmoid’)
1 unit )]
from tensorflow.keras.losses import
15 units BinaryCrossentropy
25 units model.compile(loss=BinaryCrossentropy())
Given set of (x,y) examples model.fit(X,Y,epochs=100)

How to build and train this in code?
Andrew Ng
Neural Network
Training
Training Details
Model Training Steps
specify how to logistic regression neural network
compute output
given input x and z = np.dot(w,x)+ b model = Sequential([
parameters w,b Dense(...)
Dense(...)
(define model) Dense(...)
f_x = 1/(1+np.exp(-z)) ])
𝑓w,𝑏 x =?
specify loss and cost logistic loss binary cross entropy

𝐿 𝑓w,𝑏 x , 𝑦 1 example loss = -y * np.log(f_x)
-(1-y) * np.log(1-f_x) model.compile(
𝑚 loss=BinaryCrossentropy())
1
𝐽 w, 𝑏 = ෍ 𝐿 𝑓w,𝑏 x (𝑖) , 𝑦 (𝑖)
𝑚
𝑖=1
Train on data to w = w - alpha * dj_dw model.fit(X,y,epochs=100)
minimize 𝐽 w, 𝑏 b = b - alpha * dj_db
Andrew Ng
1. Create the model
define the model import tensorflow as tf
𝑓 x =? from tensorflow.keras import Sequential
1
𝐖 ,𝒃 1 𝐖 2 ,𝒃 2 𝐖 3,𝒃 3
[1] [1] Dense(units=25, activation='sigmoid’)
w1 , 𝑏1
Dense(units=15, activation='sigmoid’)
[2] [2] Dense(units=1, activation='sigmoid’)
[1] w1 , 𝑏1 [2]
⋮ a a a[3] ])
x ⋮ [3] [3]
w1 , 𝑏1
[2] [2]
w15 , 𝑏15 1 unit
[1] [1]
w25 , 𝑏25
15 units
25 units
Andrew Ng
2. Loss and cost functions
𝑚
Mnist digit binary classification 1 𝑖 𝑖
𝐽 𝐖, 𝐁 = ෍ 𝐿 𝑓 x ,𝑦
classification problem 𝑚
𝑖=1
𝐿 𝑓 x , 𝑦 = −𝑦log 𝑓 x − 1 − 𝑦 log 1 − 𝑓 x 1 2 3
𝐖 ,𝐖 ,𝐖 𝒃 1 ,𝒃 2 ,𝒃 3

model.compile(loss= BinaryCrossentropy())
BinaryCrossentropy
regression
(predicting numbers
and not categories) from tensorflow.keras.losses import
MeanSquaredError
model.compile(loss= MeanSquaredError())
Andrew Ng
3. Gradient descent
repeat {
[𝑙] [𝑙] 𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼 𝐽 w, 𝑏
𝐽(𝑤) minimum 𝜕𝑤𝑗
[𝑙] [𝑙] 𝜕
𝑏𝑗 = 𝑏𝑗 − 𝛼 𝐽 w, 𝑏
𝜕𝑏
}
𝑤
model.fit(X,y,epochs=100)
Andrew Ng
Neural network libraries
Use code libraries instead of coding ”from scratch”
Good to understand the implementation

(for tuning and debugging).
Andrew Ng
Activation Functions
Alternatives to the
sigmoid activation
Demand Prediction Example
[1] 1 1
price affordability 𝑎2 = 𝑔(w2 ∙ x + 𝑏2 )
Sigmoid ReLU
shipping cost awareness
a
1 1
marketing 1
𝑔 𝑧 = 𝑔 𝑧 =
1+𝑒−𝑧
material perceived quality
0 z z
0
0<𝑔 𝑧 <1
Andrew Ng
Examples of Activation Functions
[1] 1 1
𝑎2 = 𝑔(w2 ∙ x + 𝑏2 )
Linear activation function Sigmoid ReLU
1
𝑔 𝑧 =
1 1
1
𝑔(𝑧) = 𝑧 𝑔 𝑧 =
1+𝑒−𝑧
0 z 0 z 0 z
0<𝑔 𝑧 <1
Andrew Ng
Choosing activation
functions
Output Layer
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3] = 𝑓(x) Choosing 𝑔 𝑧 for output layer?
Binary classification Regression Regression

Sigmoid Linear activation function ReLU
y=0/1 y = +/- Y = 0 or +
1 1
1
𝑔 𝑧
𝑔 𝑧 𝑔 𝑧
0 z 0 z 0 z
Andrew Ng
Hidden Layer
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3] = 𝑓(x) Choosing 𝑔 𝑧 for hidden layer
Sigmoid 𝐽(𝐖, 𝐁) 𝜕 ReLU

𝐽 𝐖,𝐁
𝜕𝑤
1 1
1
𝑔 𝑧 =
1+𝑒−𝑧
𝑔 𝑧 = max(0, 𝑧)
0 z 𝑤 0 z
Andrew Ng
Choosing Activation Summary
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3] = 𝑓(x) activation='sigmoid'
activation='linear'
ReLU activation='relu'
from tf.keras.layers import Dense

Dense(units=25, activation='relu'),
Dense(units=15, activation='relu'),
Dense(units=1, activation='sigmoid')
])
Andrew Ng
Why do we need
activation functions?
Why do we need activation functions?
price price
affordability
shipping cost
marketing top seller?
awareness shipping cost
material 𝑓 x = w ∙x +𝑏
𝑎Ԧ [1] 𝑎Ԧ [2] x w,b 𝑏
x marketing
material
perceived quality 1
𝑔 𝑧
if all 𝑔 𝑧 are linear... ...no different than linear regression
0 z
Andrew Ng
Linear Example
𝑎 [1] [1]
= 𝑤1 𝑥 + 𝑏1[1]
[2] [2]
𝑥 𝑎 [2] = 𝑤1 𝑎 [1] + 𝑏1
𝑤1 , 𝑏1
𝑎 [1]
[1] [1] [2] [2]
𝑤1 , 𝑏1 𝑎 [2]
[2] [1]
𝑎Ԧ [2] =( 𝑤1 𝑤1 ) 𝑥 + 𝑤1[2] 𝑏1[1] + 𝑏1[2]
𝑔(𝑧) = 𝑧
𝑎Ԧ [2] = 𝑤 𝑥 + 𝑏
Andrew Ng
Example
x 𝑎Ԧ [1] 𝑎Ԧ [2] 𝑎Ԧ [3] 𝑎Ԧ [4]
𝑔(𝑧) = 𝑧
1
[4] [4] 𝑎Ԧ [4] =
𝑎Ԧ [4] = w1 ∙ 𝑎Ԧ [3] + 𝑏1 [4] [3] [4]
1+𝑒 −(w1 ∙𝑎 +𝑏1 )
Don’t use linear activations in hidden layers
Andrew Ng
Multiclass
Classification
Multiclass
MNIST example
𝑦=0 1 2 3 4 5 6 7 8 9
𝑦=7
Andrew Ng
Multiclass classification example
𝑃 𝑦=2x
𝑃 𝑦=1x
𝑥2 𝑥2 𝑃 𝑦=3x
𝑃 𝑦=4x
𝑃 𝑦=1x
𝑥1 𝑥1
Andrew Ng
Multiclass
Classification
Softmax
Logistic regression Softmax regression (4 possible outputs)
(2 possible output values)
𝑧 = w∙x+𝑏 𝑧1 = w1 ∙ x + 𝑏1
1
𝑎 = 𝑔 𝑧 = 1+𝑒 −𝑧 = 𝑃 𝑦 = 1 x =𝑃 𝑦=1 x
𝑒 𝑧2
𝑧2 = w2 ∙ x + 𝑏2 𝑎2 = 𝑧
𝑒 1 + 𝑒 𝑧 2 + 𝑒 𝑧 3 + 𝑒 𝑧4
Softmax regression =𝑃 𝑦=2 x
(N possible outputs) 𝑒 𝑧3
𝑧3 = w3 ∙ x + 𝑏3 𝑎3 = 𝑧
𝑒 1 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4
𝑧𝑗 = w𝑗 ∙ x + 𝑏𝑗 j = 1, … , N
=𝑃 𝑦=3x
𝑒 𝑧𝑗 𝑒 𝑧4
𝑎𝑗 = = P(y = 𝑗|x) 𝑧4 = w4 ∙ x + 𝑏4 𝑎4 = 𝑧
σ𝑁
𝑘=1 𝑒
𝑧𝑘 𝑒 1 + 𝑒 𝑧2 + 𝑒 𝑧 3 + 𝑒 𝑧4
=𝑃 𝑦=4x
Andrew Ng
Cost
Logistic regression Softmax regression
𝑒 𝑧1
𝑎1 = 𝑧 =𝑃 𝑦=1 x
𝑧 = w∙x+𝑏 𝑒 1 + 𝑒 𝑧2 + ⋯ + 𝑒 𝑧𝑁
⋮ 𝑒 𝑧𝑁
1 𝑎𝑁 = 𝑧 =𝑃 𝑦=𝑁x
𝑎1 = 𝑔 𝑧 = =𝑃 𝑦=1 x 𝑒 1 + 𝑒 𝑧 2 + ⋯ + 𝑒 𝑧𝑁
1 + 𝑒−𝑧
Crossentropy loss
𝑎2 = 1 − 𝑎1 =𝑃 𝑦=0 x
𝑙𝑜𝑠𝑠 𝑎1 , … , 𝑎𝑁 , 𝑦 =
𝐿
l𝑜𝑠𝑠 = −𝑦 log 𝑎1 − 1 − 𝑦 log 1 − 𝑎1
𝐽 w, 𝑏 = average loss
0 0.5 1 𝑎𝑗
Andrew Ng
Multiclass
Classification
Neural Network with

Softmax output
Neural Network with Softmax
0
output
3
0 𝑒 𝑧1
3 3 3 3
𝑧1 = w1 ∙ a 2 + 𝑏1 𝑎10 = 3 3
0 𝑒 𝑧1 + ⋯ + 𝑒 𝑧10
x a[1] a[2] a[3] = P y = 1 𝑥Ԧ
⋮ ⋮ ⋮ ⋮
3
3 3 3 3 𝑒 𝑧10
𝑧10 = 𝑤10 ∙ 𝑎Ԧ 2 + 𝑏10 𝑎10 = 3 3
𝑒 𝑧1 + ⋯ + 𝑒 𝑧10
= P y = 10 𝑥Ԧ
logistic regression
25 units 15 units 10 units
1 unit 3
𝑎1 = 𝑔 𝑧1
3 3
𝑎2 = 𝑔 𝑧2
3
softmax
a[3]=
3 3 3 3
𝑎1 , … 𝑎10 = 𝑔 𝑧1 , … , 𝑧10
Andrew Ng
specify the model
MNIST with softmax
import tensorflow as tf
𝑓w,𝑏 x =? from tensorflow.keras import Sequential
Dense(units=25, activation='relu')
Dense(units=10, activation='softmax')
specify loss and cost )]
𝐿 𝑓w,𝑏 x , 𝑦 from tensorflow.keras.losses import
SparseCategoricalCrossentropy
model.compile(loss= SparseCategoricalCrossentropy() )
Train on data to model.fit(X,Y,epochs=100)
minimize 𝐽 w, 𝑏 Note: better (recommended) version later.
Andrew Ng
Multiclass
Classification
Improved implementation
of softmax
Numerical Roundoff Errors
option 1
2
𝑥=
10,000
option 2
Andrew Ng
Numerical Roundoff Errors
More numerically accurate implementation of logistic loss:
Logistic regression: model = Sequential([

1 Dense(units=25, activation='relu')
𝑎=𝑔 𝑧 = Dense(units=15, activation='relu')
1 + 𝑒 −𝑧
Original loss
model.compile(loss=BinaryCrossEntropy() )
𝑙𝑜𝑠𝑠 = − 𝑦 log 𝑎 − 1 − 𝑦 log(1 − 𝑎)
model.compile(loss=BinaryCrossEntropy(from_logits=True) )
More accurate loss (in code)

1 1
𝑙𝑜𝑠𝑠 = − 𝑦 log − 1 − 𝑦 log(1 − )
1 + 𝑒 −𝑧 1 + 𝑒 −𝑧
Andrew Ng
More numerically accurate implementation of softmax
Softmax regression model = Sequential([
(𝑎1, … , 𝑎10 ) = 𝑔(𝑧1 , … , 𝑧10 )
− log 𝑎1 if 𝑦 = 1 Dense(units=10, activation='softmax')
Loss = 𝐿(𝑎,
Ԧ 𝑦) = ቐ ⋮
− log 𝑎10 if 𝑦 = 10
model.compile(loss=SparseCategoricalCrossEntropy() )
More Accurate
model.compile(loss=SparseCrossEntropy(from_logits=True) )
Andrew Ng
MNIST (more numerically accurate)
model import tensorflow as tf
from tensorflow.keras import Sequential
Dense(units=10, activation='linear’) )]
loss from tensorflow.keras.losses import
SparseCategoricalCrossentropy
model.compile(...,loss=SparseCategoricalCrossentropy(from_logits=True) )
fit model.fit(X,Y,epochs=100)
predict logits = model(X)
f_x = tf.nn.softmax(logits)
Andrew Ng
logistic regression
(more numerically accurate)
model model = Sequential([
Dense(units=1, activation='linear')
BinaryCrossentropy
loss model.compile(..., BinaryCrossentropy(from_logits=True)) )
model.fit(X,Y,epochs=100)
fit logit = model(X)
predict f_x = tf.nn.sigmoid(logit)
Andrew Ng
Multi-label
Classification
Classification with
multiple outputs
(Optional)
Multi-label Classification
Is there a car?
Is there a bus?
Is there a pedestrian
Andrew Ng
Multiple classes
car bus pedestrian
Alternatively, train one neural network with three outputs
x 𝑎Ԧ [1] 𝑎Ԧ [2] 𝑎Ԧ [3]
Andrew Ng
Additional Neural Network
Concepts
Advanced Optimization
Gradient Descent
𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼𝜕𝑤 𝐽 w, 𝑏
𝑗
𝑤2 𝐽 w, 𝑏 𝑤2 𝐽 w, 𝑏
𝑤1 𝑤1
Go faster – increase Go slower – decrease
Andrew Ng
Adam Algorithm Intuition
Adam: Adaptive Moment estimation
𝜕
𝑤1 = 𝑤1 − 𝛼1 𝜕𝑤 𝐽 w, 𝑏
1
⋮
𝑤10 = 𝑤10 − 𝛼10 𝜕𝑤𝜕 𝐽 w, 𝑏
10
𝜕
𝑏 = 𝑏 − 𝛼11 𝜕𝑏 𝐽 w, 𝑏
Andrew Ng
Adam Algorithm Intuition
𝑤2 𝐽 w, 𝑏 𝑤2 𝐽 w, 𝑏
𝑤1 𝑤1
If 𝑤𝑗 (or 𝑏) keeps moving If 𝑤𝑗 (or 𝑏) keeps oscillating,
in same direction, reduce 𝛼𝑗 .
increase 𝛼𝑗 .
Andrew Ng
model
MNIST Adam
tf.keras.layers.Dense(units=25, activation='sigmoid')
tf.keras.layers.Dense(units=15, activation='sigmoid')
tf.keras.layers.Dense(units=10, activation='linear')
])
compile
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))
fit
model.fit(X,Y,epochs=100)
Andrew Ng
Additional Neural Network
Concepts
Additional Layer Types

Dense Layer
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3]
Each neuron output is a function of

all the activation outputs of the previous layer.
2 2 2
𝑎Ԧ1 = 𝑔 𝑤1 ∙ 𝑎Ԧ 1 + 𝑏1
Andrew Ng
Convolutional Layer
Each Neuron only looks at

part of the previous layer’s inputs.
Why?
• Faster computation
• Need less training data
(less prone to
overfitting)
Andrew Ng
Convolutional Neural Network
EKG
𝑥1 𝑥2 𝑥3 𝑥100
𝑥1
𝑥2 𝑥1 − 𝑥20
𝑥3 𝑥11 − 𝑥30 1 1
𝑎1 − 𝑎5
𝑥21 − 𝑥40 x a[1] a[2] a[3]
⋮ 1 1
⋮ ⋮ 𝑎3 − 𝑎7
𝑥81 − 𝑥100 1 1
𝑥100 𝑎5 − 𝑎9
3 units
9 units
Andrew Ng

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Copyright:

Available Formats

Copyright Notice

These slides are distributed under the Creative Commons License.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

⋮ a[1] a[2] a[3] Dense(units=25, activation='sigmoid’)

Given set of (x,y) examples model.fit(X,Y,epochs=100)

specify loss and cost logistic loss binary cross entropy

from tensorflow.keras.losses import

Good to understand the implementation

Binary classification Regression Regression

Sigmoid 𝐽(𝐖, 𝐁) 𝜕 ReLU

from tf.keras.layers import Dense

if all 𝑔 𝑧 are linear... ...no different than linear regression

Don’t use linear activations in hidden layers

Neural Network with

Logistic regression: model = Sequential([

More accurate loss (in code)

loss model.compile(..., BinaryCrossentropy(from_logits=True)) )

fit logit = model(X)

predict f_x = tf.nn.sigmoid(logit)

Alternatively, train one neural network with three outputs

x 𝑎Ԧ [1] 𝑎Ԧ [2] 𝑎Ԧ [3]

Additional Layer Types

x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3]

Each neuron output is a function of

Each Neuron only looks at

You might also like