Download as pdf or txt
Download as pdf or txt
You are on page 1of 84

Neural Network and

Deep Learning
Reference

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An


introduction to statistical learning (Vol. 112, p. 18). New York:
springer.
Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H.
(2009). The elements of statistical learning: data mining,
inference, and prediction (Vol. 2, pp. 1-758). New York:
springer.
Gopal, M. (2019). Applied machine learning. McGraw-Hill
Education.
Recognize the Number!
Early History

Walter Pitts (L) ,


Warren McCulloch (R)
First proposed neural
network in 1944

A brief history is given in this link


Image source: https://www.historyofinformation.com/detail.php?entryid=782
Deep Learning

ACM Turing Award 2018


Image source: https://awards.acm.org/about/2018-turing
Deep Learning
• Neural networks became popular in the 1980s.
• Lots of successes, hype, and great conferences:
NeurIPS, Snowbird.
• Then along came SVMs, Random Forests and
Boosting in the 1990s, and Neural Networks took a
back seat.
• Re-emerged around 2010 as Deep Learning.
• By 2020s very dominant and successful.
• Part of success due to vast improvements in computing
power, larger training sets, and software: Tensorflow
and PyTorch.
Deep Learning
• Deep learning is a branch of machine learning which is
completely based on ANN with the use of multiple hidden
layers.
• Deep learning is well-suited for tasks such as image
recognition, speech recognition, and natural language
processing.
• The most widely used architectures in deep learning are
feedforward neural networks, convolutional neural networks
(CNNs), and recurrent neural networks (RNNs).
Single Layer Neural Network
K

f X = β0 + ෍ βk hk (X)
k=1

p
= β0 + σK
k=1 βk g wk0 + σj=1 wkj X j

In p u t Hidden Output
L a yer La ye r La ye r
A1

X1

A2

X2

A3 f (X ) Y

X3

A4

X4

A5
Training the Neural Network
p
• Ak = hk X = g wk0 + σj=1 wkj X j are called the
activations in the hidden layer.
• g(z) is called the activation function (e.g. sigmoid,
ReLU)
• Activation functions in hidden layers are typically
nonlinear, otherwise the model collapses to a linear
model.
• The model is fit by minimizing σni=1(yi − f(xi ))2 for
regression, cross-entropy for classification
• The weights 𝑤𝑘𝑗 are learned using backpropagation
algorithm.
Activation Functions

Sigmoid

𝑒𝑧 1
𝑔 𝑧 = 𝑧
=
1+𝑒 1 + 𝑒 −𝑧

ReLU

0 𝑖𝑓 𝑧 < 0
𝑔 𝑧 = 𝑧 + =ቊ
𝑧 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Activation Function

The piecewise-linear ReLU function is more utilized


for its efficiency and computability.
Example: MNIST Digits
Handwritten digits 28 × 28
grayscale images 60K
train, 10K test images
Features are the 784 pixel
grayscale values ∈ (0, 255)
Labels are the digit class 0–
9.

• Goal: build a classifier to predict the image class.


Multilayer Neural Network
X 1

X 2 A 1( 1 )

A (12 )

X 3 A 2( 1 ) f0(X) Y0

A (22 )

X 4 A 3( 1 ) f1(X) Y1

A (32 )
(1) . .
X 5 A4 . .
. .
.
.
.
.
X 6 . f9(X) Y9
.
A(2)
K 2
. (1) B
. A
. K 1
W2
Output
H id d e n layer
X p W1
H id d e n la y e r 𝐿 2

Inp ut la y e r 𝐿 1
layer
Details of Output Layer
2 k (2)
• Let Zm = βm0 + σl=1 βml Al , m = 0, 1, . . . , 9 be 10
linear combinations of activations at second layer.
• Output activation function encodes the SoftMax
function -
eZ m
fm X = Pr(Y = m X = 9 Z
σl=0 e l
• Fit the model by minimizing the negative
multinomial log-likelihood (or cross-entropy) -
n 9

− ෍ ෍ yim log fm (xi )


i=1 m=0
• yim is 1 if true class for observation i is m, else 0 -
i.e.one-hot encoded.
Results
Method Test Error
Neural Network + Ridge Regularization 2.3%
Neural Network + Dropout Regularization 1.8%
Multinomial Logistic Regression 7.2%
Linear Discriminant Analysis 12.7%

• With so many parameters, regularization is essential.


• Very overworked problem — best reported rates are
< 0.5%!
• Human error rate is reported to be around 0.2%, or
20 of the 10K test images.
Convolutional Neural Network — CNN

• Shown are samples from CIFAR100 database.


• Data base of 60K labelled images with 20 superclasses
and 5 classes in each super class
• Each image is 32 × 32 pixels, RGB
• The numbers for each image are organized in a three-
dimensional array called a feature map
How CNNs Work

• CNNs mimics human cognition, by recognizing


specific features or patterns in particular places of
the image
• The CNN builds up an image in a hierarchical
fashion.
• Edges and shapes are recognized and pieced together
to form more complex shapes, eventually assembling
the target image.
How CNNs Work

Convolution Layers
Low level features (to identify small features)
(edges, patches of colours)

Pooling Layers
(down sampling of small
Mid- level features patterns identified)
(eyes, ears)

Classifier
(tiger, lion)
Convolution Layers

• Convolution layers are made up of convolution


filters (CF)
• The filter determines whether a particular
small shape, edge (local feature) etc. is present
in the image
• Convolution implies repeatedly multiplying
matrix elements and then adding the results
• The resultant (convolved) image emphasizes the
sections of original image which are similar to
the CF.
Convolution Filter
a b c
d e f
Input Image (4 × 3) pixel = g h i
j k l

α β
Convolution Filter =
γ δ

aα + bβ + dγ + eδ bα + cβ + eγ + fδ
Convolved Image = dα + eβ + gγ + hδ eα + fβ + hγ + iδ
gα + hβ + jγ + kδ hα + iβ + kγ + lδ
• CF is slid around the input image, scoring for matches.
• The scoring is done via dot-products
• If the sub image of the input image is similar to the filter,
the score is high, otherwise low.
• CFs are learned during training.
Convolution Example

• The two filters shown here highlight vertical and horizontal stripes.
• The result of the convolution is a new feature map.
• In the first image vertical stripes are more prominent
• In the second image the horizontal stripes are more prominent
Convolution Layer
• In CNNs the filters are learned for the specific
classification task.
• The filter weights are the parameters going from
an input layer to a hidden layer, with one hidden
unit for each pixel in the convolved image.
CIFAR100 Example
• For colour image, it has three channels represented
by a three-dimensional feature map (array).
• Each of them a 2D (32 × 32) feature map — one
each for R, G and B.
• A single CF will also have three channels, one per
color, each of dimension 3×3, with potentially
different filter weights.
• At the first hidden layer K CFs -> K 2D feature maps
• The results of the three convolutions are summed
to form a 2D output feature map.
• ReLU activation function is used on the convolved
image, in separate layer known as detector layer
Pooling Layers

• Max pooling
• Convolution implies repeatedly multiplying
matrix elements and then adding the results
• The resultant (convolved) image emphasizes the
sections of original image which are similar to
the CF.
Pooling

1 2 5 3
Max pool 3 0 1 2 ⟶ 3 5
2 1 3 4 2 4
1 1 2 0

• Pooling layers transforms large image to smaller


summary image.
• Max pooling: each non-overlapping 2 × 2 block is
replaced by its maximum.
• Reduces the dimension by a factor of 2 in each
dimension.
• Allows for locational invariance.
• This sharpens the feature identification.
Architecture of a CNN for CIFAR100

• CFs are equivalent to units in a hidden layer of ANN.


• Each CF creates a new channel in convolution layer.
• As pooling reduces size of each layer
• Sequence of convolve and pool layers.
• Continued until the pooling has reduced each channel
feature map down to a few pixels in each dimension.
Architecture of a CNN for CIFAR100

• Then 3D feature maps are flattened


• Flattening treat the pixels as separate units — and fed
into one or more fully-connected layers
• Finally at the output layer, which is a softmax
activation is performed for 100 classes
• Accuracy at the time of ISLR2: 75%
Data Augmentation

• Trick to enhance accuracy for image modeling


• Each image is replicated many times with random distortion
• Distortion include rotate, zoom, horizontal/vertical shift, shear
etc.
• This increases the training set considerably with somewhat
different examples, and thus protects against overfitting.
• This is a form of regularization where a cloud of images are
created with each original image having the same label
Using Pretrained Networks to Classify Images

Flamingo Cooper’s hawk Cooper’s hawk


flamingo 0.83 kite (raptor) 0.60 fountain 0.35
spoonbill 0.17 great grey owl 0.09 nail 0.12
white stork 0.00 robin 0.06 hook 0.07

Lhasa Apso Cat Cape weaver


Tibe t an terrier 0.56 O l d Eng lis h sheepdog 0.82 jacamar 0.28
Lhas a 0.32 Shih-Tzu 0.04 macaw 0.12
cocker spaniel 0.03 Persian cat 0.04 robin 0.12

• A pretrained CNN named resnet50 is trained on


imagenet dataset, which has millions of images in
many categories
Document Classification: IMDB Movie Reviews

• The IMDB corpus consists of user-supplied movie


ratings for a large collection of movies. Each has
been labeled for sentiment as positive or negative.
Here is the beginning of a negative review -

This has to be one of the worst films of the 1990s. When my friends &
I were watching this film (being the target audience it was aimed at) we
just sat & watched the first half an hour with our jaws touching the floor
at how bad it really was. The rest of the time, everyone else in the theater
just started talking to each other, leaving or generally crying into their
popcorn . . .

• Labeled training and test sets, each consisting of


25,000 reviews, and each balanced with regard to
sentiment.
• Build a classifier to predict the sentiment of a review.
Featurization: Bag-of-Words
• From a dictionary, identify the 10K most
frequently occurring words.
• Create a binary vector of length p = 10K for
each document, and score a 1 in every position
that the corresponding word occurred.
• With n documents, there are n × p sparse
feature matrix X.
• Spare matrix: Store location and values of
non-zero elements
• A lasso logistic regression model and a two-
hidden-layer neural network are fitted.
• Bag-of-words are unigrams.
Lasso versus Neural Network — IMDB Reviews
Lasso Neural Net
8

1.0

1.0

●●●
●●●●
●●
●●●●
● ● ● ●

●●

● ● ● ● ● ●
●● ● ●
●● ● ●
●● ●
●●
●● ●
●● ●
●●
● ●

●● ●
●●
●●

0.9

0.9
●● ●
●●

● ● ● ● ●

●●● ●●● ● ● ● ● ● ●●
●● ● ●●
●●●
●●●


●●


●●
●●●
●●●
●● ●
●●● ●● ● ● ● ● ● ● ● ●
●●●●●●


●● ●●
●●●
●●●
●●●
●●
● ● ● ●
● ● ●
●●●●●
●●
● ●●

● ●
●●

●●●●●● ●●●
● ●
●●
●● ● ● ● ● ●
● ●●●
●● ●●
●● ●●
● ● ● ● ● ● ● ●

●●● ● ●●●
●●
●●●●●
●●
●●●●●


●●
Accuracy

Accuracy
●●●●

●●●●
●●●● ●
●●●

0.8

0.8

●●
●●●

●●●


●●
●●
●●





0.7

0.7


●●


●●

● train
● validation

●●
●●

● test
0.6

0.6
4 6 8 10 12 5 10 15 20

− log() Epochs

• Simpler lasso logistic regression model works as well


as neural network in this case.
• glmnet was used to fit the lasso model, and is very
effective because it can exploit sparsity in the X
matrix.
Recurrent Neural Networks
• The previous application does not take into
account sequence of words
• It ignores the context of the words used
• Two ways to incorporate context:
• Bag of n-grams model (consecutive co-
occurrence of n words)
• Treat the document as a sequence,
taking account of all the words in the
context of those that preceded and
those that follow
Recurrent Neural Networks

Data as sequences:
• Documents are sequences of words, and their relative
positions have meaning.
• Time-series such as weather data
• Financial time series, market indices, stock and bond
price, exchange rates.
• Recorded speech or music.
• Handwriting, such as doctor’s notes.
RNNs build models that take into account this
sequential nature of the data, and build a memory of
the past.
Recurrent Neural Networks

• The document is a sequence of 𝐿 words


X = {X1 ,X 2 , . . . , X L },
each element 𝑋𝑙 representing a word
• The target Y is —a single variable such as
Sentiment, or a one-hot vector for
multiclass.
• However, Y can also be a sequence, such
as the same document in a different
language.
Simple Recurrent Neural Network Architecture
Y Y

OA O1 O2 O3 OL-1 OL

B B B B B B
U
AA = A1
U
A2
U
A3
U
AL - 1
U
AL

W W W W W W

XA X1 X2 X3 ... XL-1 XL

• The hidden layer is a sequence of vectors Al , receiving as


input X l as well as Al−1 . Al produces an output Ol .
• The same weights W, U and B are used at each step in the
sequence — hence the term recurrent (weight sharing)
• The Al sequence represents an evolving model for the
response that is updated as each element X l is processed.
RNN in Detail

• Suppose Xl = { Xl1 , Xl2 , … , Xlp } has p components, and hidden


layer Al = Al1 , Al2 , … , Alk has K components. Then the
computation at the kth components of hidden unit A l is -

p K

Alk = g wk0 + ෍ wkj X lj + ෍ uks Al−1,s


j=1 s=1
K

Ol = β0 + ෍ βk Alk
k=1

• Often we are concerned only with the prediction O L at the last


unit. For squared error loss, and n sequence/response pairs,
we would minimize -
2
n n K p K

෍(yi − oiL )2 = ෍ yi − β0 + ෍ βk g wk0 + ෍ wkj xiLj + ෍ uks 𝐴i,L−1,s


i=1 i=1 k=1 j=1 s=1
RNN and IMDB Reviews

• The document feature is a sequence of words {W } . We


typically truncate/pad the documents to the same number
L of words (we use L = 500).
• Each word is Wl represented as a one-hot encoded binary
vector X l (dummy variable) of length 10K, with all zeros
and a single one in the position for that word in the
dictionary.
• This results in an extremely sparse feature representation,
and would not work well.
• Instead use a lower-dimensional pretrained word
embedding matrix 𝑬 (m × 10K , next slide).
• This reduces the binary feature vector of length 10K to a
real feature vector of dimension m 10K (e.g. m in the
low hundreds.)
Word Embedding

One−hot

I
is

fall
film
of
this

films
the

the

the

starts

day
one

one
best

best

ever
have

seen
actually
Embed

this is one of the best films actually the best I have ever seen the film starts one fall day

• Embedding matrix 𝑬 can be learned from training,


embedding layer
• Pretrained embedding can be used: weight freezing
• word2vec and GloVe are popular pretrained embeddings .
RNN and IMDB Reviews

• With 32 hidden units in single recurrent layer, and


embedding 𝐸 where 𝑚 = 32, accuracy is only 76%.
• LSTM (long and short term memory) RNN.
• Here A l receives input from A l −1 (short term memory)
as well as from a version that reaches further back in
time (long term memory).
• Accuracy is improved to 87%, slightly less than the
88% achieved by glmnet.
• These data have been used as a benchmark for new
RNN architectures. The best reported result found
at the time of writing (2020) was around 95%.
Log(Volatility) Dow Jones Return Log(Trading Volume)

−13 −11 − 9 − 8 −0.04 0.00 0.04 −1.0 0.0 0.5 1.0

1965
1970
1975
1980
Time Series Forecasting

1985
New-York Stock Exchange Data

Shown in previous slide are three daily time series for the
period December 3, 1962 to December 31, 1986 (6,051
trading days) -
• Log trading volume (𝒗𝒕 ) - This is the fraction of all
outstanding shares that are traded on that day, relative to
a 100-day moving average of past turnover, on the log
scale.
• Dow Jones return (𝒓𝒕 ) - This is the difference between
the log of the Dow Jones Industrial Index on consecutive
trading days.
• Log v o l a t i l i t y (𝒛𝒕 ) - This is based on the absolute
values of daily price movements.
Goal: Predict Log trading volume tomorrow, given its
observed values up to today, as well as those of Dow
Jones return and Log v o l a t i l i t y .
Autocorrelation
Log( Trading Volume)

Autocorrelation Function

0.8
0.4
0.0

0 5 10 15 20 25 30 35

Lag

• The autocorrelation at lag 𝑙 is the correlation of all


pairs (vt , vt−l ) that are 𝑙 trading days apart.
• These sizable correlations give us confidence that
past values will be helpful in predicting the future.
• This is a curious prediction problem: the response vt
is also a feature vt−l!
RNN Forecaster

• Only one series of data


• Extract many short mini-series of input sequences X
= { X1 , X 2 , … , X L } with a predefined length L known as the
lag:

vt−L vt−L+1 vt−1


X1 = t−L , X 2 = t−L+1 , … , X L = rt−1
r r , and Y = vt
zt−L zt−L+1 zt−1

• Since T = 6, 051, with L = 5 , create 6, 046 such (X, Y )


pairs.
• Use the first 4, 281 as training data, and the following 1, 770
as test data. We fit an RNN with 12 hidden units per lag step
(i.e. per Al .)
RNN Results for NYSE Data
Test Period: Observed and Predicted

0.0 0.5 1.0


log(Trading Volume)

−1.0

1980 1982 1984 1986

Year

Figure shows predictions and truth for test period.

R 2 = 0.42 for RNN.


R 2 = 0.18 for straw man — use yesterday’s value of Log
trading volume to predict that of today.
Autoregression Forecaster

• The R N N forecaster is similar in structure to a traditional


autoregression procedure.

vL+1 1 vL vL−1 ⋯ v1
vL+2 1 vL+1 vL ⋯ v 2
y = vL+3 M = 1 vL+2 vL+1 ⋯ v3
⋮ ⋮ ⋮ ⋮ ⋱ ⋮
vT v
1 T−1 vT−2 … vT−L

• Fit an O L S regression of y on M, giving -

vො t = β෠ 0 + β෠ 1 vt−1 + β෠ 2 vt−2 + ⋯ + β෠ L vt−L


Known as an order-L autoregression model or A R ( L ) .
• For the NYSE data we can include lagged versions of DJ return
and log v o l a t i l i t y in matrix M, resulting in 3L + 1 columns.
Autoregression Results for NYSE Data

• R 2 = 0.41 for AR(5) model (16 parameters).

• R 2 = 0.42 for RNN model (205 parameters).

• R 2 = 0.42 for AR(5) model fit by neural network.

• All the models can be improved by including day of the


week as a variable
Summary of RNNs

• Many more complex variations of RNN exist.


• One variation treats the sequence as a one-
dimensional image, and uses CNNs for fitting.
• Bi-directional RNNs scan sequences in both directions
• Can have additional hidden layers, where each hidden
layer is a sequence, and treats the previous hidden
layer as an input sequence.
• Can have output also be a sequence, and input and
output share the hidden units. So called seq2seq
learning are used for language translation.
When to Use Deep Learning

Testing on Hitters data: (263 observations, 176 used for training, 87


for testing)

A Lasso model with 4 variables, and test error of 224.8


When to Use Deep Learning

• CNNs have had enormous successes in image classification and


modeling, and are starting to be used in medical diagnosis.
• RNNs have had big wins in speech modeling, language
translation, and forecasting.

Should we always use deep learning models?


• Often the big successes occur when the signal to noise ratio is
high. Datasets are large, and overfitting is not a big problem.
• For noisier data, simpler models can often work better.
1. On the NYSE data, the AR(5) model is much simpler
than a RNN, and performed as well.
2. On the IMDB review data, the linear model fit by
glmnet did as well as the neural network, and better
than the RNN.
• We endorse the Occam’s razor principal — we prefer simpler
models if they work as well. More interpretable!
Fitting Neural Networks
Input Hidden Output
L a y er Layer Layer

A1

X 1

A2

X 2

A3 f (X ) Y

X 3

A4

X 4

A5

n
minimize 1
෍(yi − f(xi ))2
{wk }1K , β 2
i=1

p
where, f(xi ) = β0 + σK
k=1 βk g wk0 + σj=1 wkj xij

• This problem is difficult because the objective is non-convex.

• Despite this, effective algorithms have evolved that can optimize


complex neural network problems efficiently.
Non Convex Functions and Gradient Descent
1
• Let, R θ = σn (y − fθ (xi ))2 with θ = ({wk }1K , β )
2 i=1 i

6
5
4
R()
3
R(0) 1

R(
2

) ●
R(2)

R(7)
1


0 1 2 7
0

1. Start with a guess θ0 for all the parameters in θ, and set t


= 0.
2. Iterate until the objective R θ fails to decrease:
• Find a vector δ that reflects a small change in θ,
such that θt+1 = θt + δ reduces the objective; i.e.
R(θt+1 ) < R(θt ).
• Set t ← t + 1.
Gradient Descent Continued

• How to find a direction δ that points downhill? T he gradient


vector is computed, which gives the direction of steepest ascent.
δR(θ)
𝛻𝜃 R θm = | m
δθ θ=θ
i.e. the vector of partial derivatives at the current guess θm .

• Hence moving in the opposite direction of the gradient reduces


𝑅(𝜃)
• The reduction is controlled with a parameter 𝜌.
• 𝜃 is updated by δ = − ρ𝛻𝜃 R θm or
θm+1 ⟵ θm − ρ𝛻𝜃 R θm

where ρ is the learning rate (typically small, e.g. ρ = 0.001.


Gradients and Backpropagation
1
R θ = σni=1 R i θ = σni=1(yi − fθ (xi ))2 is a sum, so gradient is
2
sum of gradients.
2
n n K p
1 1
R i θ = ෍(yi − fθ (xi ))2 = ෍ yi − β0 + ෍ βk g wk0 + ෍ wkj xij
2 2
i=1 i=1 k=1 j=1

p
• For ease of notation, let zik = wk0 + σj=1 wkj xij .
• Backpropagation uses the chain rule for differentiation:
δR i θ δR i θ δfθ (xi )
= ∗
δβk δfθ (xi ) δβk

= −(yi − fθ (xi )) ∗ g(zik )

δR i θ δR i θ δfθ (xi ) δg zik δzik


= ∗ ∗ ∗
δwkj δfθ (xi ) δg zik δzik δwkj

= −(yi − fθ (xi )) ∗ βk ∗ g ′ (zik ) ∗ xij


Methods to Improve Performance

• Slow learning - Gradient descent is slow, and a small


learning rate ρ slows it even further.
• Vectorization
• Stochastic gradient descent (minibatch) - Rather than
compute the gradient using all the data, use a small
batch of observations drawn at random at each step.
• Regularization - Ridge and lasso regularization can
be used to shrink the weights at each layer.
• Dropout – Randomly remove a fraction 𝜑 of the
units in a layer when fitting the model.
• This is done separately each time a training
observation is processed.
• The surviving units are scaled up by a factor of
1/(1 − 𝜑) to compensate
Dropout Learning
Problems with Gradient Descent Algorithms
1. Choice of proper learning rate:
• Too small, slow convergence
• Too large, oscillation
2. Same learning rate applied on all features
• It would be better to use a larger
update for rarely occurring features
and smaller update for more frequent
features
3. Getting trapped in a sub-optimal local
minima
4. Number of gradient calculation is n (no. of
obs) at each iteration
Gradient Descent

• Attributed to Cauchy (1847)

• Update the parameter set 𝜃 𝑚 at each iteration

θm+1 ⟵ θm − ρ𝛻𝜃 R θm

• 𝛻𝜃 R θm is calculated at each training data point,


• So
1
𝛻𝜃 R θm = 𝑛 σ𝑛𝑖=1 𝛻𝜃 R i θm

• This is expensive when 𝑛 is large (could be in order of millions)


Stochastic Gradient Descent

• Attributed to Robbins-Monroe (1950s)

• Pick a random integer 𝑘 ∈ 1, … 𝑛

• Perform the update at this randomly chosen datapoint

• Update the parameter set 𝜃 𝑚

θm+1 ⟵ θm − ρ𝛻𝜃 R k θm

• Much faster (n times) than calculating a full gradient

• Does this make sense?


Stochastic Gradient Descent

• Consider a simple problem 1D

1
min 𝑅 𝜃 = σ𝑛𝑖=1 𝑎𝑖 𝜃 − 𝑏𝑖 2
2

Solving (full gradient) 𝛻𝜃 R 𝜃 = 0

σ𝑛
𝑖=1 𝑎𝑖 𝑏𝑖
𝜃∗ = σ𝑛 2
𝑖=1 𝑎𝑖

Solving for a particular point 𝑖, 𝛻𝜃 R i 𝜃 = 0


𝑏𝑖
𝜃𝑖∗ =
𝑎𝑖
Stochastic Gradient Descent

2 2 2
𝑎1 𝜃 − 𝑏1 𝑎2 𝜃 − 𝑏2 𝑎𝑛 𝜃 − 𝑏𝑛

𝑏𝑖 𝑏𝑖
min
𝑖 𝑎𝑖
𝑄 max
𝑖 𝑎𝑖

• 𝑄: Region of confusion
• The full gradient point 𝜃 ∗ lies somewhere in 𝑄
• Outside 𝑄, the sign of 𝛻𝜃 R 𝜃 and 𝛻𝜃 R i 𝜃 are the same
• It means outside 𝑄 the 𝛻𝜃 R i 𝜃 moves in the right direction
• Inside 𝑄 this property breaks down, as you get closer to optimal
fluctuation increases
Stochastic Gradient Descent

• Inside 𝑄 this property breaks down, as you get closer to optimal


fluctuation increases
• You may not want the actual optimal point, because that may lead
to overfitting
• In general, when there is a difficult function to optimize, use
randomization
• SGD uses stochastic gradient 𝑔(𝜃) such that it is an unbiased
estimator of the full gradient

𝐸 𝑔(𝜃 = 𝛻𝜃 R 𝜃

• Estimators can be noisy, show it should be low on variance also


Stochastic Gradient Descent

min 𝑅 𝜃

At each iteration:
• Option 1: Pick index 𝑖 with replacement
• Option 2: Pick index 𝑖 without replacement
• Use 𝑔 𝜃 = 𝛻𝜃 𝑅𝑖 𝜃 as the SG
• Update θm+1 ⟵ θm − ρ𝑔(𝜃)

Which one is used in practice?


Stochastic Gradient Descent

min 𝑅 𝜃

At each iteration:
• Option 1: Pick index 𝑖 with replacement
• Option 2: Pick index 𝑖 without replacement
• Use 𝑔 𝜃 = 𝛻𝜃 𝑅𝑖 𝜃 as the SG
• Update θm+1 ⟵ θm − ρ𝑔(𝜃)

Which one is used in practice?


• In almost all the implementation (tensorflow, pytorch)
option 2 is used (preshuffle and stream through the data)
• But for theoretical proofs, option 1 is used
(because of IID property)
SGD (with single observations) Advantages

• Frequent updation of weights provides quick


insights on model performance and improvement
in loss functions

• Simplest to understand and implement

• Learning rate may be fast on some problems

• Noisy update process may avoid local minima


SGD (with single observations) Advantages

• Updating the model too frequently maybe too


expensive especially for larger datasets (both in
number of observations and dimensions)
• Model parameters and model error may jump
around (higher variance)
• Frequent updating does not allow the parameters
to settle (one observation may be classified
correctly by a set of weight vectors, but may be
misclassified by the updated weights in the next
iteration. )
Variant of Stochastic Gradient Descent: Mini-batch
𝑛
1
min 𝑅 𝜃 = ෍ 𝛻𝜃 R i 𝜃
𝑛
𝑖=1

• Use a mini-batch 𝐼𝑘 of training points


1
• Update θm+1 ⟵ θm − ρ 𝐼 σ𝑗∈𝐼𝑘 𝛻𝜃 R j 𝜃 𝑚
𝑘
• Each iteration computes 𝐼𝑘 stochastic gradients
• This technique is great for parallelization (GPUs)
• Each core computes a stochastic gradient
• More cores, larger mini-batches can be computed

• Remark: Very large mini-batches (looks like a full GD)


may lead to OVERFITTING in ML problems
Momentum Optimizer
If in one direction the
𝑊2 gradient is much higher
than in another direction, then
gradient descent finds it
difficult to reach the minima

𝑊1
Since curvature in 𝑊2 direction is more pronounced, the gradient
In direction of 𝑊2 is much larger, causing oscillation
Momentum Optimizer

𝜃2

𝜃1
Since curvature in 𝜃2 direction is more pronounced, the gradient
In direction of 𝜃1 is much larger, causing oscillation
Momentum Optimizer

𝜃2

𝜃1
θm+1 ⟵ θm + 𝛾𝑉 𝑚 − ρ𝛻R θm
Where 𝑉 is the momentum component
Issues?

• Selection of hyperparameters (learning


rates for momentum and gradient)
• Same learning rates for all the directions
• It maybe useful to have smaller learning
rates in directions of higher gradient, and
larger for directions of smaller gradient
Other Efficient Variants of Optimizers

Adagrad: It adaptively scales the learning rate for different


dimensions: parameters with largest derivatives of the loss function
will have rapid decrease in their learning rate, with smaller
derivatives will have slower decrease in learning rate. It uses
cumulative squared gradient to update the learning rate.

RMSProp: Uses Exponentially decaying average of squared gradient


and discards extreme past history

Adam Optimizer: Combination of momentum optimizer and


RMSProp. First and second order moments are used, after correcting
for bias.
Adagrad

• Adaptively scales the learning rates at different


directions
• Scale factor at each parameter is inversely
proportional to the square root of the sum of
historical squared values of the gradient
• The parameters with large partial derivatives of the
loss function, will have rapid decrease in learning
rate
• The parameters with small partial derivatives of
the loss function, will have relatively slower
decrease in learning rate
Adagrad
• Let at a particular iteration 𝑡, the mini-batch SGD is
1
𝑔𝑡 = ෍ 𝛻𝜃 R j 𝜃 𝑡
𝐼𝑘
𝑗∈𝐼𝑘

Now
𝑤 𝑡 = σ𝑡𝜏=1 𝑔𝜏 ∘ 𝑔𝜏
(component-wise multiplication, and accumulated upto
iteration 𝑡)

Update (element-wise)
ρ
θt+1 ⟵ θ𝑡 − 𝑔t
𝜖𝐼 + 𝑤 𝑡
Where 𝐼 is a vector of 1s.
Adagrad
• For a 𝑑 dimensional problem

ρ
θt+1 ⟵ θ𝑡 − 𝑔t
𝜖𝐼 + 𝑤𝑡
ρ
𝑔t
θ1t+1 θ1t 𝜖+ 𝑤1𝑡
.
. . .
= −
. . ρ
θ𝑑t+1
θt𝑑 𝑔t
𝜖 + 𝑤𝑑𝑡
Adagrad
Advantages:
• Adaptively scales the learning rate for different dimensions
by normalizing w.r.t the gradient magnitude in the
corresponding dimension
• Eliminates the need to set the learning rate manually
• Converges rapidly when applied to convex functions
Disadvantages:
• For non-convex function it may pass through many
complex terrains ending up in local optima
• With large number of iterations, the scale factor increases
and learning rate becomes very small
• In such cases, the model may stop learning
RMSProp

• Instead of taking the accumulative sum of squares of the past


gradients, RMSProp uses exponentially weighted average of
squared gradients, and discards history from extreme past
• Converges rapidly in a local convex ball
• Treats as if Adagrad initialized in the local convex ball

• Everything being same as the Adagrad, only the learning rate


update component is changed to

𝑤 𝑡 = 𝛽𝑤 𝑡−1 + 1 − 𝛽 𝑔𝑡 ∘ 𝑔𝑡

AdaDelta is a similar algorithm where moving average instead of


exponential decaying for the update term
Adam (Adaptive Moment) Optimizer

• Combination of RMSprop and momentum optimizer


• Consider both first and second moments
• Both firt and second moments are corrected for their bias for
initialization at 0.
• First moment
𝑠 𝑡 = 𝛽1 𝑠 𝑡−1 + 1 − 𝛽1 𝑔𝑡
• Second moment
𝑤 𝑡 = 𝛽2 𝑤 𝑡−1 + 1 − 𝛽2 𝑔𝑡 ∘ 𝑔𝑡

• Bias correction
𝑠𝑡 𝑤𝑡
𝑠Ƹ 𝑡 = 1−𝛽 ෝ 𝑡 = 1−𝛽
;𝑤
1 2
• Updation
t+1 𝑡
𝑠Ƹ 𝑡
θ ⟵ θ −ρ
ෝ𝑡
𝜖𝐼 + 𝑤
Tuning Parameters for the Model

• Number of hidden layers


• Number of units per layer
• Regularization tuning parameters λ
• Batch size, number of epochs for sgd
• Data augmentation parameters
Double Descent

• With neural networks, it seems better to have too


many hidden units than too few.
• Likewise more hidden layers better than few.
• Running stochastic gradient descent till zero training
error often gives good out-of-sample error.
• This may seem to contradict the bias-variance trade-off
idea.
• Increasing the number of units or layers and again
training till zero error sometimes gives even better out-
of-sample error.
The Double-Descent Error Curve

2.0
Training Error
Test Error

1.5
Error

1.0
0.5
0.0

2 5 10 20 50

Degrees of Freedom

• Simulated with Y = sin(X) + ϵ with x ∼ U [−5, 5] and ε


Gaussian with S.D. = 0.3.
• A natural spline is fitted to the data with d degrees of
freedom
• From the above it is apparent that at 𝑑 = 20, the test
error becomes large, but then falls again.
Less Wiggly Solutions
8 Degrees of Freedom 20 Degrees of Freedom

3
2

2
1

1
0

0
−3 −2 −1

−3 −2 −1
−4 −2 0 2 4 −4 −2 0 2 4
42 Degrees of Freedom 80 Degrees of Freedom
3

3
2

2
1

1
0

0
−3 −2 −1

−3 −2 −1

-4 −2 0 2 4 −4 −2 0 2 4

• To achieve a zero-residual (interpolation) with d = 20,


there is just one solution.
• With 𝑑 = 42 𝑜𝑟 80 there are an infinite number of
solutions for interpolation
Some Facts

• NN works particularly well, in cases with high signal-


to-noise ratio — e.g. image recognition, language
translation
• Fitting zero error models is not always optimal, it
depends on signal-noise ratio
• It is possible to achieve better test-set performance
with regularized models, without interpolating the
training set
Thank You.

You might also like