Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Copyright Notice

These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Neural Network
Training

TensorFlow
implementation
Train a Neural Network in TensorFlow
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
model = Sequential([

⋮ a[1] a[2] a[3] Dense(units=25, activation='sigmoid’)


x ⋮ Dense(units=15, activation='sigmoid’)
Dense(units=1, activation='sigmoid’)
1 unit )]
from tensorflow.keras.losses import
15 units BinaryCrossentropy
25 units model.compile(loss=BinaryCrossentropy())

Given set of (x,y) examples model.fit(X,Y,epochs=100)


How to build and train this in code?

Andrew Ng
Neural Network
Training

Training Details
Model Training Steps
specify how to logistic regression neural network
compute output
given input x and z = np.dot(w,x)+ b model = Sequential([
parameters w,b Dense(...)
Dense(...)
(define model) Dense(...)
f_x = 1/(1+np.exp(-z)) ])
𝑓w,𝑏 x =?

specify loss and cost logistic loss binary cross entropy


𝐿 𝑓w,𝑏 x , 𝑦 1 example loss = -y * np.log(f_x)
-(1-y) * np.log(1-f_x) model.compile(
𝑚 loss=BinaryCrossentropy())
1
𝐽 w, 𝑏 = ෍ 𝐿 𝑓w,𝑏 x (𝑖) , 𝑦 (𝑖)
𝑚
𝑖=1
Train on data to w = w - alpha * dj_dw model.fit(X,y,epochs=100)
minimize 𝐽 w, 𝑏 b = b - alpha * dj_db

Andrew Ng
1. Create the model
define the model import tensorflow as tf
𝑓 x =? from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
1
𝐖 ,𝒃 1 𝐖 2 ,𝒃 2 𝐖 3,𝒃 3
model = Sequential([
[1] [1] Dense(units=25, activation='sigmoid’)
w1 , 𝑏1
Dense(units=15, activation='sigmoid’)
[2] [2] Dense(units=1, activation='sigmoid’)
[1] w1 , 𝑏1 [2]
⋮ a a a[3] ])
x ⋮ [3] [3]
w1 , 𝑏1
[2] [2]
w15 , 𝑏15 1 unit
[1] [1]
w25 , 𝑏25
15 units
25 units

Andrew Ng
2. Loss and cost functions
𝑚
Mnist digit binary classification 1 𝑖 𝑖
𝐽 𝐖, 𝐁 = ෍ 𝐿 𝑓 x ,𝑦
classification problem 𝑚
𝑖=1
𝐿 𝑓 x , 𝑦 = −𝑦log 𝑓 x − 1 − 𝑦 log 1 − 𝑓 x 1 2 3
𝐖 ,𝐖 ,𝐖 𝒃 1 ,𝒃 2 ,𝒃 3

from tensorflow.keras.losses import


model.compile(loss= BinaryCrossentropy())
BinaryCrossentropy
regression
(predicting numbers
and not categories) from tensorflow.keras.losses import
MeanSquaredError
model.compile(loss= MeanSquaredError())

Andrew Ng
3. Gradient descent
repeat {
[𝑙] [𝑙] 𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼 𝐽 w, 𝑏
𝐽(𝑤) minimum 𝜕𝑤𝑗
[𝑙] [𝑙] 𝜕
𝑏𝑗 = 𝑏𝑗 − 𝛼 𝐽 w, 𝑏
𝜕𝑏
}
𝑤

model.fit(X,y,epochs=100)

Andrew Ng
Neural network libraries
Use code libraries instead of coding ”from scratch”

Good to understand the implementation


(for tuning and debugging).

Andrew Ng
Activation Functions

Alternatives to the
sigmoid activation
Demand Prediction Example
[1] 1 1
price affordability 𝑎2 = 𝑔(w2 ∙ x + 𝑏2 )
Sigmoid ReLU
shipping cost awareness
a
1 1
marketing 1
𝑔 𝑧 = 𝑔 𝑧 =
1+𝑒−𝑧
material perceived quality

0 z z
0
0<𝑔 𝑧 <1

Andrew Ng
Examples of Activation Functions
[1] 1 1
𝑎2 = 𝑔(w2 ∙ x + 𝑏2 )
Linear activation function Sigmoid ReLU

1
𝑔 𝑧 =
1 1
1
𝑔(𝑧) = 𝑧 𝑔 𝑧 =
1+𝑒−𝑧

0 z 0 z 0 z

0<𝑔 𝑧 <1

Andrew Ng
Activation Functions

Choosing activation
functions
Output Layer
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3] = 𝑓(x) Choosing 𝑔 𝑧 for output layer?

Binary classification Regression Regression


Sigmoid Linear activation function ReLU
y=0/1 y = +/- Y = 0 or +

1 1
1
𝑔 𝑧
𝑔 𝑧 𝑔 𝑧

0 z 0 z 0 z

Andrew Ng
Hidden Layer
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3] = 𝑓(x) Choosing 𝑔 𝑧 for hidden layer

Sigmoid 𝐽(𝐖, 𝐁) 𝜕 ReLU


𝐽 𝐖,𝐁
𝜕𝑤
1 1
1
𝑔 𝑧 =
1+𝑒−𝑧
𝑔 𝑧 = max(0, 𝑧)

0 z 𝑤 0 z

Andrew Ng
Choosing Activation Summary
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3] = 𝑓(x) activation='sigmoid'

activation='linear'

ReLU activation='relu'

from tf.keras.layers import Dense


model = Sequential([
Dense(units=25, activation='relu'),
Dense(units=15, activation='relu'),
Dense(units=1, activation='sigmoid')
])

Andrew Ng
Activation Functions

Why do we need
activation functions?
Why do we need activation functions?
price price
affordability
shipping cost
marketing top seller?
awareness shipping cost
material 𝑓 x = w ∙x +𝑏
𝑎Ԧ [1] 𝑎Ԧ [2] x w,b 𝑏
x marketing

material
perceived quality 1
𝑔 𝑧

if all 𝑔 𝑧 are linear... ...no different than linear regression

0 z

Andrew Ng
Linear Example

𝑎 [1] [1]
= 𝑤1 𝑥 + 𝑏1[1]
[2] [2]
𝑥 𝑎 [2] = 𝑤1 𝑎 [1] + 𝑏1
𝑤1 , 𝑏1
𝑎 [1]
[1] [1] [2] [2]
𝑤1 , 𝑏1 𝑎 [2]

[2] [1]
𝑎Ԧ [2] =( 𝑤1 𝑤1 ) 𝑥 + 𝑤1[2] 𝑏1[1] + 𝑏1[2]
𝑔(𝑧) = 𝑧

𝑎Ԧ [2] = 𝑤 𝑥 + 𝑏

Andrew Ng
Example
x 𝑎Ԧ [1] 𝑎Ԧ [2] 𝑎Ԧ [3] 𝑎Ԧ [4]

𝑔(𝑧) = 𝑧
1
[4] [4] 𝑎Ԧ [4] =
𝑎Ԧ [4] = w1 ∙ 𝑎Ԧ [3] + 𝑏1 [4] [3] [4]
1+𝑒 −(w1 ∙𝑎 +𝑏1 )

Don’t use linear activations in hidden layers

Andrew Ng
Multiclass
Classification

Multiclass
MNIST example

𝑦=0 1 2 3 4 5 6 7 8 9

𝑦=7

Andrew Ng
Multiclass classification example
𝑃 𝑦=2x

𝑃 𝑦=1x

𝑥2 𝑥2 𝑃 𝑦=3x

𝑃 𝑦=4x
𝑃 𝑦=1x

𝑥1 𝑥1

Andrew Ng
Multiclass
Classification

Softmax
Logistic regression Softmax regression (4 possible outputs)
(2 possible output values)
𝑧 = w∙x+𝑏 𝑧1 = w1 ∙ x + 𝑏1

1
𝑎 = 𝑔 𝑧 = 1+𝑒 −𝑧 = 𝑃 𝑦 = 1 x =𝑃 𝑦=1 x
𝑒 𝑧2
𝑧2 = w2 ∙ x + 𝑏2 𝑎2 = 𝑧
𝑒 1 + 𝑒 𝑧 2 + 𝑒 𝑧 3 + 𝑒 𝑧4
Softmax regression =𝑃 𝑦=2 x
(N possible outputs) 𝑒 𝑧3
𝑧3 = w3 ∙ x + 𝑏3 𝑎3 = 𝑧
𝑒 1 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4
𝑧𝑗 = w𝑗 ∙ x + 𝑏𝑗 j = 1, … , N
=𝑃 𝑦=3x

𝑒 𝑧𝑗 𝑒 𝑧4
𝑎𝑗 = = P(y = 𝑗|x) 𝑧4 = w4 ∙ x + 𝑏4 𝑎4 = 𝑧
σ𝑁
𝑘=1 𝑒
𝑧𝑘 𝑒 1 + 𝑒 𝑧2 + 𝑒 𝑧 3 + 𝑒 𝑧4
=𝑃 𝑦=4x

Andrew Ng
Cost
Logistic regression Softmax regression
𝑒 𝑧1
𝑎1 = 𝑧 =𝑃 𝑦=1 x
𝑧 = w∙x+𝑏 𝑒 1 + 𝑒 𝑧2 + ⋯ + 𝑒 𝑧𝑁
⋮ 𝑒 𝑧𝑁
1 𝑎𝑁 = 𝑧 =𝑃 𝑦=𝑁x
𝑎1 = 𝑔 𝑧 = =𝑃 𝑦=1 x 𝑒 1 + 𝑒 𝑧 2 + ⋯ + 𝑒 𝑧𝑁
1 + 𝑒−𝑧
Crossentropy loss
𝑎2 = 1 − 𝑎1 =𝑃 𝑦=0 x
𝑙𝑜𝑠𝑠 𝑎1 , … , 𝑎𝑁 , 𝑦 =
𝐿
l𝑜𝑠𝑠 = −𝑦 log 𝑎1 − 1 − 𝑦 log 1 − 𝑎1

𝐽 w, 𝑏 = average loss
0 0.5 1 𝑎𝑗

Andrew Ng
Multiclass
Classification

Neural Network with


Softmax output
Neural Network with Softmax
0
output
3
0 𝑒 𝑧1
3 3 3 3
𝑧1 = w1 ∙ a 2 + 𝑏1 𝑎10 = 3 3
0 𝑒 𝑧1 + ⋯ + 𝑒 𝑧10
x a[1] a[2] a[3] = P y = 1 𝑥Ԧ
⋮ ⋮ ⋮ ⋮
3
3 3 3 3 𝑒 𝑧10
𝑧10 = 𝑤10 ∙ 𝑎Ԧ 2 + 𝑏10 𝑎10 = 3 3
𝑒 𝑧1 + ⋯ + 𝑒 𝑧10
= P y = 10 𝑥Ԧ

logistic regression
25 units 15 units 10 units
1 unit 3
𝑎1 = 𝑔 𝑧1
3 3
𝑎2 = 𝑔 𝑧2
3

softmax
a[3]=
3 3 3 3
𝑎1 , … 𝑎10 = 𝑔 𝑧1 , … , 𝑧10

Andrew Ng
specify the model
MNIST with softmax
import tensorflow as tf
𝑓w,𝑏 x =? from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
model = Sequential([
Dense(units=25, activation='relu')
Dense(units=15, activation='relu')
Dense(units=10, activation='softmax')
specify loss and cost )]
𝐿 𝑓w,𝑏 x , 𝑦 from tensorflow.keras.losses import
SparseCategoricalCrossentropy
model.compile(loss= SparseCategoricalCrossentropy() )
Train on data to model.fit(X,Y,epochs=100)
minimize 𝐽 w, 𝑏 Note: better (recommended) version later.

Andrew Ng
Multiclass
Classification

Improved implementation
of softmax
Numerical Roundoff Errors
option 1
2
𝑥=
10,000
option 2

Andrew Ng
Numerical Roundoff Errors
More numerically accurate implementation of logistic loss:

Logistic regression: model = Sequential([


1 Dense(units=25, activation='relu')
𝑎=𝑔 𝑧 = Dense(units=15, activation='relu')
1 + 𝑒 −𝑧
Dense(units=10, activation='sigmoid')
Original loss
model.compile(loss=BinaryCrossEntropy() )
𝑙𝑜𝑠𝑠 = − 𝑦 log 𝑎 − 1 − 𝑦 log(1 − 𝑎)
model.compile(loss=BinaryCrossEntropy(from_logits=True) )

More accurate loss (in code)


1 1
𝑙𝑜𝑠𝑠 = − 𝑦 log − 1 − 𝑦 log(1 − )
1 + 𝑒 −𝑧 1 + 𝑒 −𝑧

Andrew Ng
More numerically accurate implementation of softmax
Softmax regression model = Sequential([
Dense(units=25, activation='relu')
(𝑎1, … , 𝑎10 ) = 𝑔(𝑧1 , … , 𝑧10 )
Dense(units=15, activation='relu')
− log 𝑎1 if 𝑦 = 1 Dense(units=10, activation='softmax')
Loss = 𝐿(𝑎,
Ԧ 𝑦) = ቐ ⋮
− log 𝑎10 if 𝑦 = 10
model.compile(loss=SparseCategoricalCrossEntropy() )
More Accurate

model.compile(loss=SparseCrossEntropy(from_logits=True) )

Andrew Ng
MNIST (more numerically accurate)
model import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
model = Sequential([
Dense(units=25, activation='relu')
Dense(units=15, activation='relu')
Dense(units=10, activation='linear’) )]
loss from tensorflow.keras.losses import
SparseCategoricalCrossentropy
model.compile(...,loss=SparseCategoricalCrossentropy(from_logits=True) )
fit model.fit(X,Y,epochs=100)
predict logits = model(X)
f_x = tf.nn.softmax(logits)

Andrew Ng
logistic regression
(more numerically accurate)
model model = Sequential([
Dense(units=25, activation='sigmoid')
Dense(units=15, activation='sigmoid')
Dense(units=1, activation='linear')
from tensorflow.keras.losses import
BinaryCrossentropy

loss model.compile(..., BinaryCrossentropy(from_logits=True)) )

model.fit(X,Y,epochs=100)

fit logit = model(X)

predict f_x = tf.nn.sigmoid(logit)

Andrew Ng
Multi-label
Classification

Classification with
multiple outputs
(Optional)
Multi-label Classification

Is there a car?
Is there a bus?
Is there a pedestrian

Andrew Ng
Multiple classes
car bus pedestrian

Alternatively, train one neural network with three outputs

x 𝑎Ԧ [1] 𝑎Ԧ [2] 𝑎Ԧ [3]

Andrew Ng
Additional Neural Network
Concepts

Advanced Optimization
Gradient Descent
𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼𝜕𝑤 𝐽 w, 𝑏
𝑗

𝑤2 𝐽 w, 𝑏 𝑤2 𝐽 w, 𝑏

𝑤1 𝑤1
Go faster – increase Go slower – decrease

Andrew Ng
Adam Algorithm Intuition
Adam: Adaptive Moment estimation
𝜕
𝑤1 = 𝑤1 − 𝛼1 𝜕𝑤 𝐽 w, 𝑏
1

𝑤10 = 𝑤10 − 𝛼10 𝜕𝑤𝜕 𝐽 w, 𝑏
10
𝜕
𝑏 = 𝑏 − 𝛼11 𝜕𝑏 𝐽 w, 𝑏

Andrew Ng
Adam Algorithm Intuition

𝑤2 𝐽 w, 𝑏 𝑤2 𝐽 w, 𝑏

𝑤1 𝑤1
If 𝑤𝑗 (or 𝑏) keeps moving If 𝑤𝑗 (or 𝑏) keeps oscillating,
in same direction, reduce 𝛼𝑗 .
increase 𝛼𝑗 .

Andrew Ng
model
MNIST Adam
model = Sequential([
tf.keras.layers.Dense(units=25, activation='sigmoid')
tf.keras.layers.Dense(units=15, activation='sigmoid')
tf.keras.layers.Dense(units=10, activation='linear')
])

compile
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

fit
model.fit(X,Y,epochs=100)

Andrew Ng
Additional Neural Network
Concepts

Additional Layer Types


Dense Layer

x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3]

Each neuron output is a function of


all the activation outputs of the previous layer.

2 2 2
𝑎Ԧ1 = 𝑔 𝑤1 ∙ 𝑎Ԧ 1 + 𝑏1

Andrew Ng
Convolutional Layer

Each Neuron only looks at


part of the previous layer’s inputs.

Why?
• Faster computation
• Need less training data
(less prone to
overfitting)

Andrew Ng
Convolutional Neural Network
EKG

𝑥1 𝑥2 𝑥3 𝑥100
𝑥1
𝑥2 𝑥1 − 𝑥20
𝑥3 𝑥11 − 𝑥30 1 1
𝑎1 − 𝑎5
𝑥21 − 𝑥40 x a[1] a[2] a[3]
⋮ 1 1
⋮ ⋮ 𝑎3 − 𝑎7
𝑥81 − 𝑥100 1 1
𝑥100 𝑎5 − 𝑎9
3 units
9 units

Andrew Ng

You might also like