NISS Deep Learning Tutorial

NISS DATA SCIENCE
ESSENTIALS FOR BUSINESS

TUTORIAL: DEEP LEARNING
Ming Li
Amazon & U. of Washington
Disclaimer: the opinions expressed in this short

course is presenter’s own and do not represent the
view of presenter’s employers.
INTENDED AUDIENCE
• Statisticians or practitioners who are exploring data science applications

• Graduate students who are looking for data science opportunities
• University professors who are expanding their data science course
scopes
• Scientist or analysts who are trying to deal with un-structured data such
as text and image in their day to day work
• This tutorial is more focusing on introduction of basic concepts and basic

applications using R keras. Due to limited time, many of important
mathematic details are not covered.
LIST OF MATERIALS
• Coursera course: Deep Learning Specialization by Andrew Ng

et al: a great in-depth introduction of deep learning series
• UFLDL Deep Learning Tutorial: a self-paced tutorial of deep
learning basic and applications
• Deep learning with R by François Chollet with J. J. Allaire (a
great book of introduction of using keras library in R to train
deep learning models, specifically for statisticians)
• Dive into deep learning: interactive deep learning book with
code, math and discussion
DATA SCIENCE
RECAP
Science Engineering
Electrical
Physics
Engineering
Statistical
Statistics
Engineering
+
Big Data & Software
=
Data Science
*Statistical engineering
Statis- Data
tician Scientist Mainly focus on
Relatively focus on
business problem &
modeling (i.e. science)
result (i.e. engineering)
Bring data to model Bring models to data
Need to work with

Data is relatively small
messy and large
in size and clean in text
amount data in various
file formats
formats
Both structured &

Usually structured data
unstructured data
Usually isolated from Usually embedded in

production system production system
How to Bridge the Gaps?
• Engineering aspects of big data, data pipeline and

production system:
R/Python in laptop à Cloud Environment
• Modeling aspects to expand toolsets to deal with text and
image data more efficiently:
Traditional machine learning methods (such as regression,
random forest, gradient boosting tree, …)
+
Deep learning models:
• Standard feedforward neural network for typical data frames
• Convolutional neural network for image data
• Word embedding and Recurrent neural network for text data
DEEP LEARNING
A BRIEF INTRODUCTION
THE NEW TOOL IN DATA

SCIENTIST’S TOOL BOX
A Little Bit of History – Perceptron
• Fun video: https://www.youtube.com/watch?v=cNxadbrN_aI
• Classification of N points into 2 classes: -1 and +1 (i.e. two different colors in the picture below )
• In this example below, only two features to use (X1 and X2)
• Linear functions to separate classes, to find (𝒘𝟎 , 𝒘𝟏 , 𝒘𝟐 ) such that:

𝜙$ = 𝑤% + 𝑤& 𝑥&,$ + 𝑤( 𝑥(,$
1, 𝑖𝑓 𝜙$ > 0
X2 𝑃𝑟𝑒𝑑$ = +
−1, 𝑖𝑓 𝜙$ ≤ 0
• How to find a good line? Perceptron algorithm:
• Start with random weights
• For each data point
1. Predict class label
2. Update weights when prediction is not correct using a
preset learning rate and the value of features of that data
X1 point, for example for 𝑤! :
𝑤& = 𝑤& +𝜂 𝐴𝑐𝑡𝑢𝑎𝑙$ − 𝑃𝑟𝑒𝑑$ 𝑥&,$
Key features of perceptron

• It is a linear classification function and the weight is updated
after each data points are used to the algorithm (concept
similar to stochastic gradient descent).
• The algorithm continues to update when we use the same
dataset again and again (i.e. epochs)
• It is not going to converge for none-linearly-spreadable
problems.
* From Sebastian Python Machine Learning • We have define training stop criteria such as accuracy of the
model or the number of times using the data set to train.
A Little Bit of History – Perceptron
For i in 1:M: We set a maximum of epochs of M to run.
For every data point,

For j in 1:N: we update the weight
based on the
𝜙! = 𝑤" + 𝑤# 𝑥#,! + 𝑤% 𝑥%,!
prediction correctness,
1, 𝑖𝑓 𝜙! > 0 learning rate and
𝑃𝑟𝑒𝑑! = * feature values.
−1, 𝑖𝑓 𝜙! ≤ 0
𝑤" = 𝑤" +𝜂 𝐴𝑐𝑡𝑢𝑎𝑙! − 𝑃𝑟𝑒𝑑!

𝑤# = 𝑤# +𝜂 𝐴𝑐𝑡𝑢𝑎𝑙! − 𝑃𝑟𝑒𝑑! 𝑥#,!
𝑤% = 𝑤% +𝜂 𝐴𝑐𝑡𝑢𝑎𝑙! − 𝑃𝑟𝑒𝑑! 𝑥%,!
For not linearly separable

dataset, we need to use
Calculate accuracy for the entire dataset to see some accuracy threshold or
whether the criteria has meet. number of epochs to stop
the algorithm.
Perceptron R notebook: link

A Little Bit of History – Adaline
• Very similar to Perceptron and the
only difference is that the error is
;
calculated based on !𝐴𝑐𝑡𝑢𝑎𝑙= −
<
<
𝜙= " , i.e. the square error.
• With the square error, which is

differentiable, now we can use
stochastic gradient descent method
to update the weights:
* From Sebastian Python Machine Learning
𝑤" = 𝑤" +𝜂 𝐴𝑐𝑡𝑢𝑎𝑙! − 𝜙!
Adaline R notebook: link 𝑤# = 𝑤# +𝜂 𝐴𝑐𝑡𝑢𝑎𝑙! − 𝜙! 𝑥#,!
𝑤% = 𝑤% +𝜂 𝐴𝑐𝑡𝑢𝑎𝑙! − 𝜙! 𝑥%,!
• We can use the error for the entire dataset as
the loss function (i.e. SSE): ∑?
=>;!𝐴𝑐𝑡𝑢𝑎𝑙= − • When calculating prediction
<
𝜙= " , and update the weight using gradient accuracy, we still based on
descent. whether 𝜙= is larger than zero or
• If we use a logistic function as the activation, not with the final weight learned
then it becomes logistic regression. from the data.
Types of NNs * Figure adapted from slides by Andrew NG:
Deep Learning Specialization
2D, 3D or higher
dimension data
Output
A direct extension of Adaline is to
Input
…
add more neurons in a layer, non- …
linear activation function, and add …
more intermedia layers to make it a …
feed forward neural network.
Each neuron is fully connected to

the neurons on the next layer.
FEED FORWARD
NEURAL NETWORK
Simple Feed Forward Neural Network (FFNN)
Concepts stated as early as 1940s!
ℎ!
#
𝑥!
ℎ% = 𝑓! 𝑏%& + ( 𝑏%' 𝑥' , 𝑗 = 1,2,3,4
ℎ" 𝑦! '(!
𝑥" $
ℎ# 𝑦" 𝑦) = 𝑓" 𝑏)& + ( 𝑏)% ℎ% , 𝑘 = 1,2

%(!
𝑥#
ℎ$ 𝒚 = 𝑓" 𝑓! 𝒙, 𝒃! , 𝒃𝟐 Vector representation
𝑥! , 𝑥" , 𝑥# : Input features vector for one observation

𝑦, 𝑦" : Actual output outcome for one observation
𝑓! 1 , 𝑓" 1 : Activation functions, usually non-linear
𝐿 𝑦, 𝑦3 𝒃, 𝒙 : Loss function where 𝑦3 is the model forecast responses
and 𝑦 is actual observed responses
Total number of parameters (i.e. size of a NN): (3+1)*4 + (4+1)*2 = 26
Typical Loss Functions
q Two-class binary responses
ü Binary cross-entropy:
?
, −𝑦@ 𝑙𝑜𝑔 𝑝@ − 1 − 𝑦@ 𝑙𝑜𝑔 1 − 𝑝@
@>;
where 𝑦@ is actual value of 1 or 0, 𝑝@ is the predicted probability of being 1, and 𝑁 is the
total number of observations in the training data.
q Multiple-class responses
ü Categorical cross-entropy for 𝑀 classes:
? A
, − , 𝑦@,= 𝑙𝑜𝑔 𝑝@,=
@>; =>;
where 𝑦@,= is actual value of 1 or 0 for a class of 𝑗, 𝑝@,= is the predicted probability of being
at class 𝑗 and 𝑁 is the total number of observations in the training data.
q Continuous responses
ü Mean square error
ü Mean absolute error
ü Mean absolute percentage error
From Slow Progress to Wide Adoption
1940s – 1980s, very slow progress due to:
q Computation hardware capacity limitation
q Number of observations with labeled results in the dataset
q Lack efficient algorithm to estimate the parameters in the model
1980s – 2010, a few applications in real word due to:

q Moore’s Law + GPU
q Internet generated large labeled dataset
q Efficient algorithm for optimization (i.e. SGD + Backpropagation)
q Better activation functions (i.e. Relu)
2010-Now, very broad application in various areas:

q Near-human-level image classification
q Near-human-level handwriting transcription
q Much improved speech recognition
q Much improved machine translation
Now we know working neural network models usually contains many layers (i.e. the
depth of the network is deep), and to achieve near-human-level accuracy the deep
neural network need huge training dataset (for example millions of labeled pictures
for image classification problem).
Optimization Methods
q Mini-batch Stochastic Gradient Descent (SGD)
q Use a small segment of data (i.e. 128 or 256) to update the model
parameters: 𝑏 = 𝑏 − 𝛼𝛻+ 𝐿 𝑏, 𝑥 , , 𝑦 , where 𝛼 is the learning rate which is
a hyper parameter that can be changed, and 𝑚 is the mini-batch.
q Gradients are efficiently calculated using backpropagation method
q When the entire dataset are used to updated the SGD, it is called one epoch
and multiple epochs are needed to run to reach convergence
q An updated version with ‘momentum’ for quick convergence:
𝑣 = 𝛾𝑣-./0 + 𝛼𝛻+ 𝐿 𝑏, 𝑥 , , 𝑦 ,
𝑏 =𝑏−𝑣
q The optimal number for epoch

is determined by when the
model is not overfitted (i.e.
validation accuracy reaches
the best performance).
More Optimization Methods
• With Adaptive Learning Rates
– Adagrad: Scale learning rate inversely proportional
to the square root of the sum of all historical squared
values of the gradient (stored for every weight)
– RMSProp: Exponentially weighted moving average to
accumulate gradients
– AdaDelta: Uses sliding window to accumulate
gradients
– Adam (adaptive moments): momentum integrated,
bias correction in decay
• A good summary: http://ruder.io/optimizing-
gradient-descent/
Activation Functions
𝒚 = 𝑓! 𝑓!"# 𝑓!"$ … 𝑓# 𝒙, 𝒃# … , 𝒃!"$ , 𝒃!"# , 𝒃!
=𝒚 =𝒚 =>; =>;<> =>=
Gradient: =
=𝒃𝟏 =>; =>;<= =>;<=
… =𝒃
𝟏
q Intermediate layers
q Relu (i.e. rectified linear unit) is usually a good choice which has the following good
properties: (1) fast computation; (2) non-linear; (3) reduced likelihood of the gradient
to vanish; (4) Unconstrained response
q Sigmoid, studied in the past, not as good as Relu in deep learning, due to the gradient
vanishing problem when there are many layers
q Last layer which connects to the output

q Binary classification: sigmoid with binary cross entropy as loss function
q Multiple class, single-label classification: softmax with categorical cross entropy for
loss function
q Continuous responses: identity function (i.e. y = x)
* Graphs from wiki: link

Deal With Overfitting
q Huge number of parameters, even with large amount of training data,
there is a potential of overfitting
q Overfitting due to size of the NN (i.e. total number of parameters)
q Overfitting due to using the training data for too many epochs
q Solution for overfitting due to NN size

q Dropout: randomly dropout some proportion (such as 0.3 or 0.4) of nodes at
each layer, which is similar to random forest concept
q Using L1 or L2 regularization in the activation function at each layer
q Solution for overfitting due to using too many epochs

q Run NN with large number of epochs to reach overfitting region through
cross validation from training/validation vs. epoch curve
q Choose the model with number of epochs that have the minimum validation
accuracy as the final NN model (i.e. early stop)
Recap of A Few Key Concepts
q Data: Require large well-labeled dataset
q Computation: intensive matrix-matrix operation
q Structure of fully connected feedforward NN

q Size of the NN: total number of parameters
q Depth: total number of layers (this is where deep learning comes from)
q Width of a particular layer: number of nodes (i.e. neurons) in that layer
q Activation function q Optimization methods (SGD)

q Intermediate layers q Batch size
q Last layer connecting to outputs q Learning rate
q Epoch
q Loss function
q Classification (i.e. categorical response) q Deal with overfitting
q Regression (i.e. continuous response) q Dropout
q Regularization (L1 or L2)
DEEP LEARNING
ACROSS PLATFORMS
Luckily, there is no needs to write our own functions from scratch! There are
thousands of dedicated developers, engineers and scientists are working on
open source deep learning frameworks.
Open NN Exchange Format (ONNX)
https://onnx.ai/index.html
THE MNIST DATASET
MNIST Dataset
q Originally created by NIST, then modified for machine leaning training purpose
q Contains 70000 handwritten digit images and the label of the digit in the image
where 60000 images are the training dataset and the rest 10000 images are the
testing dataset.
q Census Bureau employees and American high school students wrote these digits
q Each image is 28x28 pixel in greyscale
q Yann LeCun used convolutional network LeNet to achieve < 1% error rate at
1990s
q More details: https://en.wikipedia.org/wiki/MNIST_database
USING KERAS R PACKAGE
TO BUILD FEED FORWARD
NEURAL NETWORK MODEL
Procedures
q Data preprocessing (from image to list of input features)
q One image of 28x28 grey scale value matrix à 784 column of features
q Scale the value to between 0 and 1, by divide each value by 255
q Make response categorical (i.e. 10 columns with the corresponding digit column 1 and
rest columns zero.
q Load keras package and build a neural network with a few layers
q Define a placeholder object for the NN structure
q 1st layer using 256 nodes, fully connected, using ‘relu’ activation function and
connect from the input 784 features
q 2nd layer using 128 nodes, fully connected, using ‘relu’ activation function
q 3rd layer using 64 nodes, fully connected, using ‘relu’ activation function
q 4th layer using 10 nodes, fully connected, using ‘softmax’ activation function and
connect to the output 10 columns
q Add drop out to the first three layers to prevent overfitting
q Compile the NN model, define loss function, optimizer, and metrics to follow
q Fit the NN model using the training dataset, define epoch, mini batch size, and validation
size used in the training where the metrics will be checked
q Predict using the fitted NN model using the testing dataset

R Scripts
Define NN
Structure
Compile and define loss function, optimizer

and metrics to monitor during the training
Fit the model using training dataset and

define epochs, batch size and validation data
Predict new outcomes using the trained model

Size of the Model
Performance Accuracy: 98% without much fine tuning
A few misclassified images:

Cross-Validation Curve
FFNN HANDS-ON SESSION
Databrick Community Edition: link

FFNN Notebook: link
CONVOLUTIONAL
NEURAL NETWORK
Types of NNs * Figure adapted from slides by Andrew NG:
2D, 3D or higher
dimension data
There are many different CNN structures and

applications. In this course, we focus on the
application of image classification of the
handwritten digits same as FFNN part.
Image as Numbers
Grey Scale 23 45 1 3
1 channel
67 87 102 103
Each pixel is an 23 56 234 89
integer between 4 108 76 44
0 and 255
Color Picture 66 104 8 10

3 channel: R G B 156 78 23 4 78
201 223 109 78 178 66
Each pixel is three
integers between 23 45 221 178 56 23
0 and 255 99 83 76 105 77
132 15 6 98
There are more ways to describe pictures.

Concept of Convolution
Input (5x5) Kernel/Filter (3x3) Output (3x3)
With stride 1 (move 1 pixel at a time)
For 𝑛×𝑛 input picture, 𝑓×𝑓 kernal, and

stride 𝑠, the output after convolution will
has size:
𝑛−𝑓 𝑛−𝑓
+1 × +1
𝑠 𝑠
* Animation graph from UFLDL Tutorial: link

Extend to Multiple Channels and Filters
For 5×5×3 input RGB picture, we will need 3×3×3 as one kernel/filter. We apply the
same procedure for each 5×5 input and 3×3 filter, and get 3 intermedia matrixes of
4×4. The final output after convolution for one set of kernel will has size: 4×4 by
applying element-wise average.
We can apply multiple channels to the same input, in the graph below, there are two
set of 3×3×3 kernels. The final output will be two 4×4 matrixes (i.e. a tensor).
* Figure adapted from slides by Andrew NG: Deep Learning Specialization

Padding
Convolution will reduce the size of the matrix after each convolution layer.
Sometime, we want to keep the output dimension to be the same as input
dimension. It can be done by padding (i.e. put zeroes at the edges of the input
matrix).
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
× =
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
For 𝑛×𝑛 input picture, 𝑓×𝑓 kernal, stride 𝑠,

𝑛 + 2𝑝 − 𝑓 𝑛 + 2𝑝 − 𝑓
and padding 𝑝, the output after convolution +1 × +1
will has size: 𝑠 𝑠
Estimation Process
Similar to FFNN, we can define number of convolution layers and size & number of kernels to use at
each layer to construct a CNN. The CNN estimation process is to get a set of values for all the kernels
to minimize the loss function using mini-batch gradient descent. For MNIST dataset, the loss function
is the same as what we defined in FFNN example. Due to the time limit, we will not go into detail of
the estimation process through back-propagation. More details can be found: link
In the picture below, it shows one 3x3 kernel with unknow parameters 𝑤! , … , 𝑤" :
𝑤& 𝑤( 𝑤)
× 𝑤* 𝑤+ 𝑤, =
𝑤- 𝑤. 𝑤/
6x6 3x3 4x4

Flatten Layer
If we only have convolution layers, the output will be a tensor. How can we connect a
tensor to the output of 10 classes in the MNIST example? We need to flatten the tensor
into a vector. The graph below shows a flatten operation to convert a 4x4x3 tensor into a
vector of 48 elements. For flatten layer, there is no parameters to be estimated, just
change the shape.
Once it is flattened, we can add a few fully connected layers before finally connecting to
the output layer of 10 classes for the NIST example.
201
66 104 8 10 …
156 78 23 4 78 98
201 223 109 78 178 66 156

23 45 221 178 56 23 …
99 83 76 105 77 77
132 15 6 98 66
…
23
Pooling Layer
One way to reduce the dimension of matrixes. A 𝑓×𝑠 rectangular area is
replaced by one single number by taking the maximum or average of these
𝑓×𝑠 values.
Max Pooling
55 124
54 210
34 3 21 7 Output
2x2
55 12 78 124 𝑓=2
𝑠=2
54 7 76 66
21 15 23 210 Average Pooling
Input 26 57.5
4x4 24.25 93.75
Output
2x2
CNN Example
With convolution layers, pooling layers, flatten layer, and fully connect layers, we
can create functional CNN structure to connect inputs and output for image
related applications.
LeNet-5 Architecture
LeCun et al., 1998. Gradient-based learning applied to document recognition. Link, Link
More modern CNN structures can be found: link

Create and Train CNN Model in Keras
CNN HANDS-ON SESSION
CNN Notebook: link

RECURRENT
NEURAL NETWORK
Types of NNs * Figures adapted from slides by Andrew NG:
Sequence is key!
There are many different RNN structures

and applications. In this course, we focus on
one type of application called many to one
problem (i.e. the input is a sequence of
words and the output is just one variable).
Problem Description & IMDB dataset
q Raw data: 50,000 movie review text (X) and it’s corresponding sentiment of
positive (1, 50%) or negative (0, 50%) (Y).
q Included in keras package, can be easily loaded and processed.
q Task: train a model such that we can use review text as input to predict whether
its sentiment is positive or negative (i.e. binary classification problem).
q It is one example of many-to-one RNN application.
q The raw data can be found here. Sample data in its text format:
Input X (Review) Output Y

(Sentiment)
Very smart, sometimes shocking, I just love it. It shoved one more side of David's Positive
brilliant talent. He impressed me greatly! David is the best. The movie captivates your
attention for every second. (i.e. 1)
I and a friend rented this movie. We both found the movie soundtrack and production
techniques to be lagging. The movie's plot appeared to drag on throughout with little Negative
surprise in the ending. We both agreed that the movie could have been compressed
into roughly an hour giving it more suspense and moving plot.
(i.e. 0)
Analyzing Text - Tokenization
Algorithm cannot deal with raw text and we have to convert text into
numbers for machine learning methods.
This movie is great !

Raw Text Great movie ? Are you kidding me ! Not worth the money.
Love it
…
[23, 55, 5, 78, 9]

Tokenize [78, 55, 8, 17, 12, 234, 33, 9, 14, 78, 32, 77, 4] Suppose we only have
[65, 36] 250 unique words and
… punctuation marks in
the entire dataset,
and each unique word
[23, 55, 5, 78, 9, 0, 0, 0, 0, 0] is replaced with an
Pading
[78, 55, 8, 17, 12, 234, 33, 9, 14, 78] integer between 1 to
& Truncate
[65, 36, 0, 0, 0, 0, 0, 0, 0, 0] 250, and 0 is used for
… padding.
Now we have a typical data frame, each row is an observation, and each column is a feature.
Here we have 10 columns by designing after the padding and truncating stage. We have
converted raw text into categorical integers.
Analyzing Text – Encoding/Embedding
Categorical integers can not be used directly to algorithm as there is no mathematical
relationship among these categories. We have to use either Encoding or Embedding.
250 columns! The OHE will explode the

One Hot One_Column One_Column_dummy dimension if we have 10,000
Encoding 23 0, 0, 0, …, 1, 0, 0, …, 0, 0 unique words in vocabulary and
a few hundreds words in the
each review in the training
Original data frame OHE data frame dataset!
[23, 55, 5, 78, 9, 0, 0, 0, 0, 0] [0, 0, 0, …, 1, 0, …, 0, 0, 0, 0, 0]
[78, 55, 8, 17, 12, 234, 33, 9, 14, 78] [0, 0, 0, …, 0, 0, …, 0, 0, 0, 0, 0] It is binary, sparse, without
[65, 36, 0, 0, 0, 0, 0, 0, 0, 0] [0, 0, 0, …, 0, 0, …, 0, 0, 0, 0, 0] consider the relationship among
… … words. OHE is generally not
used in text related models.
10 columns 250x10 columns!
Configurable (here use 4) Dense embedding utilizes the
Dense One_Column Embedding_Column inherit relationship of words and

dramatically reduce the embedded
Embedding 23 0.3, 0.9, 0.1, 0.2 dimension. The dimension is a
vector space and can be configured
such as 4 for this example.
Original data frame OHE data frame Word2vec has 300 for the vector.
[23, 55, 5, 78, 9, 0, 0, 0, 0, 0] [0.3, 0.9, 0.1, 0.2, …, 0.7, 0.8] It is low dimension compared with
[78, 55, 8, 17, 12, 234, 33, 9, 14, 78] [0.2, 0.7, 0.3, 0.7, …, 0.4, 0.3] OHE, real number, meaningful, and
[65, 36, 0, 0, 0, 0, 0, 0, 0, 0] [0.5, 0.8, 0.4, 0.6, …, 0.5, 0.9]
can be learned on specific data or
… …
use pre-trained embeddings.
10 columns
Analyzing Text – Pre-Trained Embeddings
• word2vec embedding: word2vec was first introduced in 2013 and it was trained by a
large collection of text in an unsupervised fashion. After the training, each word is
represented by a fixed length of vector (for example a vector with 300 elements). We
can quantify relationships between two words using cosine similarity measure. To train
the embedding vector, continuous bag-of-words or continuous skip-gram method was
used. In the continuous bag-of-words architecture, the model predicts the current word
from a window of surrounding context words. In the continuous skip-gram architecture,
the model uses the current word to predict the surrounding window of context words.
There are pretrained word2vec embeddings based on large amount of text (such as wiki
pages, news reports, etc) for general applications.
• More detail: https://code.google.com/archive/p/word2vec/
• fastText embedding: word2vec uses word-level information, however words are created
by meaningful components. fastText use subword and morphological information to
extend the skip-gram model in word2vec. With fastText, new words not appeared in the
training data can be repressed well. It also has 150+ different languages support.
• More detail: https://fasttext.cc/
• BERT embedding: Bidirectional Encoder Representations from Transformers which uses

contextual information of text in a bi-directional manner.
• More detail: link
Analyzing Text – RNN Modeling
With Embedding (pre-trained or to be trained), we now have a typical data frame for model
training: many-to-one RNN structure where the sequence of the word is considered as input
and the RNN layer output is connected through a fully-connected layer to the binary output.
Intermediate output Final RNN output

Can be used for next RNN layer To FFNN layer
RNN Output [0.2,0.3,0.3,0.6] [0.8,0.2,0.3,0.7] [0.6,0.2,0.7,0.4] [0.7,0.3,0.8,0.2] [0.4,0.3,0.7,0.2]
RNN Layer
Embedding [0.2,0.4,0.1,0.7] [0.7,0.1,0.5,0.4] [0.4,0.2,0.9,0.3] [0.6,0.1,0.8,0.4] [0.3,0.2,0.9,0.0]
Raw Text Input This movie is great !

t=0 t=1 t=2 t=3 t=4
Output from embedding

& Input features to RNN We will skip the theory part of RNN
Time Stamp and go directly to implementation.
IMDB dataset
q Raw data: 50,000 movie review text (X) and it’s corresponding sentiment of
positive (1, 50%) or negative (0, 50%) (Y).
q Included in keras R package, can be easily loaded and preprocessed with build-in
R functions
q Preprocessing includes:
o Set size of the vocabulary (i.e. N most frequently occurring words)
o Set length of the review by padding using ‘0’ by default or truncating as we have to
have same length for all reviews for modeling
o Any words not in the chosen vocabulary replaced by ‘2’ by default
o Words are indexed by overall frequency in the chosen vocabulary
q Once the dataset is preprocessed, we can apply embedding and then feed the
data to RNN layer.
R Code – Data Preprocessing
R Code – RNN Modeling
RNN Extension – LSTM
Simple RNN layer is a good starting point, but the performance is usually not that good because long-term
dependencies are impossible to learn due to vanishing gradient problem in the optimization process.
LSTM Long Short Term Memory RNN model introduce a “carry out” mechanism
such that useful information from the front words can be carried to later
words like a conveyor belt without suffering the vanishing gradient problem
we see in the simple RNN case.
[0.4,0.3,0.7,0.2]
Final RNN output

To FFNN layer
Embedding [0.2,0.4,0.1,0.7] [0.7,0.1,0.5,0.4] [0.4,0.2,0.9,0.3] [0.6,0.1,0.8,0.4] [0.3,0.2,0.9,0.0]
Raw Text Input This movie is great !

Implementation:
A great tutorial about LSTM can be found at: layer_simple_rnn() à
http://colah.github.io/posts/2015-08-Understanding-LSTMs/ layer_lstm()
SIMPLE RNN / LSTM
HANDS-ON SESSION
RNN/LSTM Notebook: link

The End!

NISS Deep Learning Tutorial

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NISS Deep Learning Tutorial

Uploaded by

Copyright:

Available Formats

NISS DATA SCIENCE

ESSENTIALS FOR BUSINESS

Disclaimer: the opinions expressed in this short

• Statisticians or practitioners who are exploring data science applications

• This tutorial is more focusing on introduction of basic concepts and basic

• Coursera course: Deep Learning Specialization by Andrew Ng

Bring data to model Bring models to data

Need to work with

Both structured &

Usually isolated from Usually embedded in

• Engineering aspects of big data, data pipeline and

THE NEW TOOL IN DATA

• Linear functions to separate classes, to find (𝒘𝟎 , 𝒘𝟏 , 𝒘𝟐 ) such that:

Key features of perceptron

For every data point,

𝑤" = 𝑤" +𝜂 𝐴𝑐𝑡𝑢𝑎𝑙! − 𝑃𝑟𝑒𝑑!

For not linearly separable

Perceptron R notebook: link

• With the square error, which is

Each neuron is fully connected to

ℎ# 𝑦" 𝑦) = 𝑓" 𝑏)& + ( 𝑏)% ℎ% , 𝑘 = 1,2

ℎ$ 𝒚 = 𝑓" 𝑓! 𝒙, 𝒃! , 𝒃𝟐 Vector representation

𝑥! , 𝑥" , 𝑥# : Input features vector for one observation

1980s – 2010, a few applications in real word due to:

2010-Now, very broad application in various areas:

q The optimal number for epoch

q Last layer which connects to the output

* Graphs from wiki: link

q Solution for overfitting due to NN size

q Solution for overfitting due to using too many epochs

q Computation: intensive matrix-matrix operation

q Structure of fully connected feedforward NN

q Activation function q Optimization methods (SGD)

q Predict using the fitted NN model using the testing dataset

Compile and define loss function, optimizer

Fit the model using training dataset and

Predict new outcomes using the trained model

A few misclassified images:

Databrick Community Edition: link

There are many different CNN structures and

Color Picture 66 104 8 10

There are more ways to describe pictures.

With stride 1 (move 1 pixel at a time)

For 𝑛×𝑛 input picture, 𝑓×𝑓 kernal, and

* Animation graph from UFLDL Tutorial: link

* Figure adapted from slides by Andrew NG: Deep Learning Specialization

For 𝑛×𝑛 input picture, 𝑓×𝑓 kernal, stride 𝑠,

6x6 3x3 4x4

201 223 109 78 178 66 156

More modern CNN structures can be found: link

CNN Notebook: link

There are many different RNN structures

Input X (Review) Output Y

This movie is great !

[23, 55, 5, 78, 9]

250 columns! The OHE will explode the

Dense One_Column Embedding_Column inherit relationship of words and

• BERT embedding: Bidirectional Encoder Representations from Transformers which uses

Intermediate output Final RNN output

RNN Output [0.2,0.3,0.3,0.6] [0.8,0.2,0.3,0.7] [0.6,0.2,0.7,0.4] [0.7,0.3,0.8,0.2] [0.4,0.3,0.7,0.2]

Embedding [0.2,0.4,0.1,0.7] [0.7,0.1,0.5,0.4] [0.4,0.2,0.9,0.3] [0.6,0.1,0.8,0.4] [0.3,0.2,0.9,0.0]

Raw Text Input This movie is great !

Output from embedding

Final RNN output