Professional Documents
Culture Documents
NISS Deep Learning Tutorial
NISS Deep Learning Tutorial
Ming Li
Amazon & U. of Washington
Statistical
Statistics
Engineering
+
Big Data & Software
=
Data Science
*Statistical engineering
Statis- Data
tician Scientist Mainly focus on
Relatively focus on
business problem &
modeling (i.e. science)
result (i.e. engineering)
+
Deep learning models:
• Standard feedforward neural network for typical data frames
• Convolutional neural network for image data
• Word embedding and Recurrent neural network for text data
DEEP LEARNING
A BRIEF INTRODUCTION
2D, 3D or higher
dimension data
Output
A direct extension of Adaline is to
Input
…
add more neurons in a layer, non- …
linear activation function, and add …
more intermedia layers to make it a …
feed forward neural network.
𝑥" $
q Multiple-class responses
ü Categorical cross-entropy for 𝑀 classes:
? A
, − , 𝑦@,= 𝑙𝑜𝑔 𝑝@,=
@>; =>;
where 𝑦@,= is actual value of 1 or 0 for a class of 𝑗, 𝑝@,= is the predicted probability of being
at class 𝑗 and 𝑁 is the total number of observations in the training data.
q Continuous responses
ü Mean square error
ü Mean absolute error
ü Mean absolute percentage error
From Slow Progress to Wide Adoption
1940s – 1980s, very slow progress due to:
q Computation hardware capacity limitation
q Number of observations with labeled results in the dataset
q Lack efficient algorithm to estimate the parameters in the model
Now we know working neural network models usually contains many layers (i.e. the
depth of the network is deep), and to achieve near-human-level accuracy the deep
neural network need huge training dataset (for example millions of labeled pictures
for image classification problem).
Optimization Methods
q Mini-batch Stochastic Gradient Descent (SGD)
q Use a small segment of data (i.e. 128 or 256) to update the model
parameters: 𝑏 = 𝑏 − 𝛼𝛻+ 𝐿 𝑏, 𝑥 , , 𝑦 , where 𝛼 is the learning rate which is
a hyper parameter that can be changed, and 𝑚 is the mini-batch.
q Gradients are efficiently calculated using backpropagation method
q When the entire dataset are used to updated the SGD, it is called one epoch
and multiple epochs are needed to run to reach convergence
q An updated version with ‘momentum’ for quick convergence:
𝑣 = 𝛾𝑣-./0 + 𝛼𝛻+ 𝐿 𝑏, 𝑥 , , 𝑦 ,
𝑏 =𝑏−𝑣
q Intermediate layers
q Relu (i.e. rectified linear unit) is usually a good choice which has the following good
properties: (1) fast computation; (2) non-linear; (3) reduced likelihood of the gradient
to vanish; (4) Unconstrained response
q Sigmoid, studied in the past, not as good as Relu in deep learning, due to the gradient
vanishing problem when there are many layers
q Loss function
q Classification (i.e. categorical response) q Deal with overfitting
q Regression (i.e. continuous response) q Dropout
q Regularization (L1 or L2)
DEEP LEARNING
ACROSS PLATFORMS
Luckily, there is no needs to write our own functions from scratch! There are
thousands of dedicated developers, engineers and scientists are working on
open source deep learning frameworks.
Open NN Exchange Format (ONNX)
https://onnx.ai/index.html
THE MNIST DATASET
MNIST Dataset
q Originally created by NIST, then modified for machine leaning training purpose
q Contains 70000 handwritten digit images and the label of the digit in the image
where 60000 images are the training dataset and the rest 10000 images are the
testing dataset.
q Census Bureau employees and American high school students wrote these digits
q Each image is 28x28 pixel in greyscale
q Yann LeCun used convolutional network LeNet to achieve < 1% error rate at
1990s
q More details: https://en.wikipedia.org/wiki/MNIST_database
USING KERAS R PACKAGE
TO BUILD FEED FORWARD
NEURAL NETWORK MODEL
Procedures
q Data preprocessing (from image to list of input features)
q One image of 28x28 grey scale value matrix à 784 column of features
q Scale the value to between 0 and 1, by divide each value by 255
q Make response categorical (i.e. 10 columns with the corresponding digit column 1 and
rest columns zero.
q Load keras package and build a neural network with a few layers
q Define a placeholder object for the NN structure
q 1st layer using 256 nodes, fully connected, using ‘relu’ activation function and
connect from the input 784 features
q 2nd layer using 128 nodes, fully connected, using ‘relu’ activation function
q 3rd layer using 64 nodes, fully connected, using ‘relu’ activation function
q 4th layer using 10 nodes, fully connected, using ‘softmax’ activation function and
connect to the output 10 columns
q Add drop out to the first three layers to prevent overfitting
q Compile the NN model, define loss function, optimizer, and metrics to follow
q Fit the NN model using the training dataset, define epoch, mini batch size, and validation
size used in the training where the metrics will be checked
Define NN
Structure
2D, 3D or higher
dimension data
We can apply multiple channels to the same input, in the graph below, there are two
set of 3×3×3 kernels. The final output will be two 4×4 matrixes (i.e. a tensor).
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
× =
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
In the picture below, it shows one 3x3 kernel with unknow parameters 𝑤! , … , 𝑤" :
𝑤& 𝑤( 𝑤)
× 𝑤* 𝑤+ 𝑤, =
𝑤- 𝑤. 𝑤/
Once it is flattened, we can add a few fully connected layers before finally connecting to
the output layer of 10 classes for the NIST example.
201
66 104 8 10 …
156 78 23 4 78 98
Max Pooling
55 124
54 210
34 3 21 7 Output
2x2
55 12 78 124 𝑓=2
𝑠=2
54 7 76 66
21 15 23 210 Average Pooling
Input 26 57.5
4x4 24.25 93.75
Output
2x2
CNN Example
With convolution layers, pooling layers, flatten layer, and fully connect layers, we
can create functional CNN structure to connect inputs and output for image
related applications.
LeNet-5 Architecture
LeCun et al., 1998. Gradient-based learning applied to document recognition. Link, Link
Sequence is key!
q Raw data: 50,000 movie review text (X) and it’s corresponding sentiment of
positive (1, 50%) or negative (0, 50%) (Y).
q Included in keras package, can be easily loaded and processed.
q Task: train a model such that we can use review text as input to predict whether
its sentiment is positive or negative (i.e. binary classification problem).
q It is one example of many-to-one RNN application.
q The raw data can be found here. Sample data in its text format:
I and a friend rented this movie. We both found the movie soundtrack and production
techniques to be lagging. The movie's plot appeared to drag on throughout with little Negative
surprise in the ending. We both agreed that the movie could have been compressed
into roughly an hour giving it more suspense and moving plot.
(i.e. 0)
Analyzing Text - Tokenization
Algorithm cannot deal with raw text and we have to convert text into
numbers for machine learning methods.
Now we have a typical data frame, each row is an observation, and each column is a feature.
Here we have 10 columns by designing after the padding and truncating stage. We have
converted raw text into categorical integers.
Analyzing Text – Encoding/Embedding
Categorical integers can not be used directly to algorithm as there is no mathematical
relationship among these categories. We have to use either Encoding or Embedding.
[23, 55, 5, 78, 9, 0, 0, 0, 0, 0] [0.3, 0.9, 0.1, 0.2, …, 0.7, 0.8] It is low dimension compared with
[78, 55, 8, 17, 12, 234, 33, 9, 14, 78] [0.2, 0.7, 0.3, 0.7, …, 0.4, 0.3] OHE, real number, meaningful, and
[65, 36, 0, 0, 0, 0, 0, 0, 0, 0] [0.5, 0.8, 0.4, 0.6, …, 0.5, 0.9]
can be learned on specific data or
… …
use pre-trained embeddings.
10 columns
Analyzing Text – Pre-Trained Embeddings
• word2vec embedding: word2vec was first introduced in 2013 and it was trained by a
large collection of text in an unsupervised fashion. After the training, each word is
represented by a fixed length of vector (for example a vector with 300 elements). We
can quantify relationships between two words using cosine similarity measure. To train
the embedding vector, continuous bag-of-words or continuous skip-gram method was
used. In the continuous bag-of-words architecture, the model predicts the current word
from a window of surrounding context words. In the continuous skip-gram architecture,
the model uses the current word to predict the surrounding window of context words.
There are pretrained word2vec embeddings based on large amount of text (such as wiki
pages, news reports, etc) for general applications.
• More detail: https://code.google.com/archive/p/word2vec/
• fastText embedding: word2vec uses word-level information, however words are created
by meaningful components. fastText use subword and morphological information to
extend the skip-gram model in word2vec. With fastText, new words not appeared in the
training data can be repressed well. It also has 150+ different languages support.
• More detail: https://fasttext.cc/
RNN Layer
q Raw data: 50,000 movie review text (X) and it’s corresponding sentiment of
positive (1, 50%) or negative (0, 50%) (Y).
q Included in keras R package, can be easily loaded and preprocessed with build-in
R functions
q Preprocessing includes:
o Set size of the vocabulary (i.e. N most frequently occurring words)
o Set length of the review by padding using ‘0’ by default or truncating as we have to
have same length for all reviews for modeling
o Any words not in the chosen vocabulary replaced by ‘2’ by default
o Words are indexed by overall frequency in the chosen vocabulary
q Once the dataset is preprocessed, we can apply embedding and then feed the
data to RNN layer.
R Code – Data Preprocessing
R Code – RNN Modeling
RNN Extension – LSTM
Simple RNN layer is a good starting point, but the performance is usually not that good because long-term
dependencies are impossible to learn due to vanishing gradient problem in the optimization process.
LSTM Long Short Term Memory RNN model introduce a “carry out” mechanism
such that useful information from the front words can be carried to later
words like a conveyor belt without suffering the vanishing gradient problem
we see in the simple RNN case.
[0.4,0.3,0.7,0.2]