Machine Learning Deep Learning Overview AIST

Machine learning &
Deep learning
Hoang Van Nam
MICA Institute - HUST
Agenda
• Introduction
• Machine Learning
• Deep Learning
• CNNs
• Discussion
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Introduction
Artificial Intelligence (AI)
• What is Artificial Intelligence (AI)?
• Using computers to solveproblems
• Or make automated decisions
• For tasks that, when done by humans,
• Typically require intelligence
Timeline Of Intelligent Machines
The Learning Perceptron Backpropagation Watson Wins Jeopardy Facebook DeepFace,

Machine (Alan Turing) (Frank Rosenblatt) (D. Rumelhart, G. Hinton, R. Williams) Amazon Echo
Machine Playing Checker Stanford Cart Deep Blue Beats Google NN recognizing DeepMind Wins Go
(Author Samuel)
Kasparov cat in Youtube
1950 1952 1957 1979 1986 1997 2011 2012 2014 2016
Limits of Artificial Intelligence
• “Strong” Artificial Intelligence ✘
• Computers thinking at a level that meets or surpassespeople
• Computers engaging in abstract reasoning & thinking
• This is not what we have today
• There is no evidence that we are close to Strong AI
• “Weak” Pattern-Based Artificial Intelligence ✔

• Computers solve problems by detecting useful patterns
• Pattern-based AI is an Extremely powerful tool
• Has been used to automate many processes today
• Driving, language translation
• This is the dominant mode of AItoday
Major AI Approaches
Two Major AI Techniques
• Logic and Rules-BasedApproach
• Machine Learning (Pattern-BasedApproach)

Logic and Rules-Based Approach
• Logic and Rules-BasedApproach
• Representing processes or systems using logical rules
• Top-down rules are created for computer
• Computers reason about those rules
• Can be used to automate processes
• Example within law – ExpertSystems

• Turbotax
• Personal income tax laws
• Represented as logical computer rules
• Software computes tax liability
Machine Learning (Patternbased)
• Machine Learning (ML)
• Algorithms find patterns in data and infer rules on their own
• ”Learn” from data and improve overtime
• These patterns can be used for automation or prediction
• ML is the dominant mode ofAI today
Hybrid Systems
• Many successful AI systems are hybrids of
• Machine learning & Rules-BasedHybrids

• e.g. Self-driving cars employ both approaches
• Human intelligence + AI Hybrids

• Also, many successful AI systems work best when
• They work with humanintelligence
• AI systems supply information for humans
What is Machine Learning?
Option 1- Build A Rule Engine
Human
Programmer
Input Output
Age Gender Purchase Items
Date Age Gender Purchase Items
Rule 1: 15 <age< 30 Date
30 M 3/1/2017 Toy Rule 2: Bought Toy=Y, Last
30 M 3/1/2017 Toy
Purchase<30 days
40 M 1/3/2017 Books Rule 3: Gender = ‘M’, Bought
…. …… ….. …..
Toy =‘Y’
…. …… ….. ….. Rule 4: ……..
Rule 5: ……..
Scalability
Problem with Hand Adaptability
Designed Rules
Closed Loop
Option 2 - Learn The Business Rules From Data
Input - New Unseen Data
Age Gender Items
35 F
39 M Toy

Date
30 M 3/20/2017 Toy *
Learning Model Prediction
40 M 1/3/2017 Books Algorithm
…. …… ….. …..
Output
Historical Purchase Data
(Training Data)
Option 2 - Learn The Business Rules From Data
Input - New Unseen Data
Age Gender Items
35 F
39 M Toy
X Y

Date
30 M 3/20/2017 Toy *
Learning Model Prediction
40 M 1/3/2017 Books Algorithm
…. …… ….. …..
Output
Historical Purchase Data f(X)=Y’
Y’~Y
(Training Data)
Machine learning as programming
We Call This Approach Machine Learning
Why Use Machine Learning?
• Use ML when you can’t code it
• Complex tasks where deterministic solution don’t suffice
• E.g. Recognizing speech/images
• Use ML when you can’t scale it
• Replace repetitive tasks needing human like expertise
• E.g Recommendations, spam, fraud detection, machine translation.
• Use ML when you have to adapt/personalize
• E.g. Recommendation and personalization
• Use ML when you can’t track it

• E.g. Automated driving, fraud detection.
Types Of Machine Learning
Reinforcement Learning
Reinforcement Learning
Supervised Learning
No, it’s a
Labrador.
It is a cat.
Supervised Learning – How Machine Learn
Human intervention and validation required
e.g. Photo classification and tagging
Training Data Adjust Model
Input
Machine
?
Learning Prediction Label
Algorithm
Labrador Label Cat Labrador

Unsupervised Learning
No human intervention required
(e.g. Customer segmentation)
Machine
Input Learning Prediction
Algorithm
Literature review onML
Learning hierarchical
representations
Bellman-1957: through deep SL,UL,
Dynamic Baum- Dempster- RL-Good Old-
Programming 1966:HMM Fashioned Artificial SVM Kernel-SVM
1977:EM
Intelligence
1979: Late 1980s-

1960-1981and 2006/7: improved
McCulloch- 1960s: visual convolution + 2000 and
1965: first beyond: CNNs/GPU-CNNs/BP
1943:Early cortex weight beyond:
feed forward backpropagation for MPCNNs/LSTM
NN, didnot inspiration for replication + numerous
deeplearning , gradient stacks
learn DL subsampling improvements
descent,RNN
Model Training
Model Training – Split training data
All Labeled Dataset
70% 30%
Training Data
Model Training – Training w/ training data
All Labeled Dataset
70% 30%
Trial
Training Data Training Model
Model Training – Split the test data
All Labeled Dataset Test

30%
Data
70%
Trial
Model Training – Model evaluation

30%
Data
70%
Trial
Evaluation
Result
Model Training - PerformanceMeasurement

30%
Data
70%
Trial
Accuracy
Evaluation
Result
Model Training - PerformanceMeasurement
Deep Learning
What is Deep Learning?
• Deep Learning is a subfield of machine learning
concerned with algorithms inspired by the structure
and function of the brain called artificial neural
networks.
• Data is passed through multiple non-linear

transformations to generate a prediction
• Objective: Learn the parameters of the transformations

that minimize a cost function
Performance
Performance Deep Learning Algorithms
Traditional Machine
Learning Algorithms
Data
Sample Deep Learning Use Cases
ASR/NLU Language Translation Self Driving Cars
Playing Go Financial Risk Medical Diagnosis
mazon Web Services, Inc. or its Affi

image understanding
The Advent of Deep Learning
Algorithms
speech recognition
Programming
Models Data
natural language
processing
GPUs & autonomy

Acceleration
Artificial Neuron/Perceptron
b
Input: X0 w0
Vector of training data x
Output: X1 w1
Linear function of input Neuron
Nonlinearity: Inputs w2 Output
⟨w, x⟩ !
Transform output into desired X2
range of value wn
…
Training
Learn the weights and bias b
Xn
by minimize loss
f(x) = ! (⟨w, x⟩ + b)
Human Brain Neuron
Inputs Output
Neural Network
X0 w1 0 Neuron 0 Neuron 0
w11
……
……
……
Output
Neuron
w1 2
Xn w1 3 Neuron n Neuron n
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer

Neural Network – Forward Propagation
X0 w1 0 Neuron 0 Neuron 0
w11
……
5
……
……
Output
Input Neuron
w1 2
Xn w1 3 Neuron n Neuron n

Neural Network – Backpropagation
X0 w10 Neuron 0 Neuron 0

Error/Loss Error/Loss
w11
5
……
……
……
Output
Input
?
Neuron
w12
Label
4
Xn w13 Neuron n Neuron n
Error/Loss
Neural Network – Backpropagation
Update
weights
W1’0
X0 Neuron 0 Neuron 0
W1’1 Error/Loss Error/Loss
……
……
……
Output
Input
W1’2
?
Neuron
W1’3 Label
4
Xn Neuron n Neuron n
Error/Loss
NNs & DL: Neural Networks
Naming conventions:
◦ N-Layer: not include input layer
◦ “Artificial Neural Networks” (ANN) or “Multi-Layer Perceptrons” (MLP)
Output layer: normally don’t have activation function (linear identity activation
function) – output score (0-1)
47
Sizing neural networks: number of neurons, or more commonly the number
of parameters.
DL: NN-like model with many such stages
NNs & DL: Neural Networks Variations
NNs & DL:Neural Networks drawbacks
◦ Local maximum: who is afraid of non-convex loss functions – YannLecun
◦ Unsupervised learning
◦ No-memory networks: recurrent net, LSTM
◦ Computational consumption on conv-layer: GPU
◦ Memory bottleneck
◦ Network compression (Squeeze Net)
◦ Model re-designing
NNs & DL: DL vs Traditional
NNs & DL:Use cases
1.Data security: malware prediction, detect abnormal data accession behaviour
2.Personal security: speed up, spot things human screeners miss
3. Financial trading: stock market prediction
4.Healthcare: cancer prediction
5.Marketing personalization: target audiences
6. Fraud detection: spot potential cases of fraud
7.Recommendations: Amazon, Netflix
8.Online Search
9. NLP
10.Smart Cars And so on...
NNs & DL: Number and size of layers
More neurons: express more complicatedfunctions

Overfitting: model with high capacity fits the noise in the data
à smaller neural networks, regularization, dropout
smaller networks is hardly to train with GD, has few local, easy to converge but bad
minima
NNs & DL:Settingup the dataand the model
Data preprocessing:
mean subtraction, normalization
PCA, whitening
Weight Initialization: Pitfall (all zero initialization), small random numbers

Regularization: L1, L2, Maxnorm,
Dropout
Loss Function: 55
Classification: SVM Loss function, Softmax (cross-entropy loss), Hierarchical Softmax
Regression: L2, L1 Norm
NNs & DL:Settingup the dataand the model
Solver:
- Vanilla update:
- Momentum update:
- Nesterov momentum update
Per-parameter adaptive learning ratemethods

Adagrad, RMSprop, Adam
NNs & DL: Trainingnetworks
Before learning:sanity checks
◦ Look for correct loss at chance performance.
◦ Eg: CIFAR-10 with a Softmax classifier 10 classes -> 0.1 per class -> -
ln(0.1)=2.302
◦ Overfit a tiny subset of data
Babysitting the learning process

◦ Loss function, number of iteration, epoch
◦ Train/Val accuracy
◦ Ratio of weights:updates
◦ First-layer Visualizations
◦ Choose solver
http://cs231n.github.io/neural-networks-3/#loss
Introduction to CNNs: How brain’s visual system works
Introduction to CNNs: How brain’s visual system works
Introduction to CNNs: Neural networks
Neural networks
Introductionto CNNs: Image Convolution
Introductionto CNNs: ConvolutionLayer
Convolution as a neural layer:
◦ Goals: not to use predefined kernels, but instead to learndata-specific
kernels.
n Convolutional layers are locallyconnected
u A filter/kernel/window slides on the image or the
previous map
u The position of the filter explicitlyprovides
information for localizing
n Convolutional layers share weightsspatially:
translation-invariant
u Translation-invariant: a translated region will produce
the same response at the correspondingly translated
position
u A local pattern’s convolutional response can be re-
used by different candidate regions
n Convolutional layers can be applied toimages of
any sizes, yielding proportionally-sizedoutputs
Convolution: Principle
Cross-correlation: computing a series of dot-products and putting them
into an output vector
Convolution: similar with cross-correlation but flipping the kernel
Key feature: shift-invariant , linear –>simple
Different
◦ Convolution: associative properties F*(GI) = (FG)*I
◦ Cross-correlation: match a template to an image
How tocalculate Convolution
Complexity: O(w*h*Fw*Fh)
Convolution -> matrix multiplication

à Toepliz matrix
67
Introduction to CNNs: HOG by Convolutional Layers
n Thinking HOG as CONV layers [Mahendran & Vedaldi, CVPR 2015]
Steps of computing HOG Convolutional perspectives
Computing image gradients Horizontal/vertical edge filters

Binning gradients into 18 Directional filters + gating (non-
directions Computing cell linearity)
histograms Normalizing cell
histograms Sum/average pooling
Local response normalization (LRN)
n HOG, SIFT, and many other “hand-engineered” featuresare

convolutional feature maps.
Image is not just “2D”, it’s volume: 2Dx3 (3 channels of intensity map, ex:
RGB,HSV)
We want to learn more than 1 kernels for each layer
CNNs : Poolinglayer
- Reduce the spatial size of the representation
- Reduce the amount of parameters and computation in the
network à control overfitting.
- Operates independently on every depth slice of the inputand
resizes it spatially, using the MAX operation (or AveragePooling,
L2 pooling).
70
CNNs: FC and ReLU layers
As seen in regular network
Have full connections to all activations in the previous layer.
n Common use: Predict a label
n Convert FC-CONV
n Example: K=4096 input: 7×7×512 can be equivalently expressed as a
CONVlayer with F=7,P=0,S=1,K=4096
n Filter size is exactly the size of the input volume, output will simply be
1×1×4096
ReLU(Rectified LinearUnits):
◦ Apply the non-saturating activation function f(x)=max(0,x)
◦ Increases the nonlinear properties of the decision function without affecting the
receptive fields of the convolutionlayer.
Introductionto CNNs: ConvolutionNeural Network
72
CNNs: Learning
CNNs: Transferlearning
Very few people train from scratch (randominit)
◦ Rare dataset has sufficient size comparing to Imagenet (1.2 mil, 1000 classes)
◦ Training Imagenet takes 2-3 weeks using modern GPU(eg: Titan X)
Major Transfer Learning scenarios:

◦ ConvNet as fixed featureextractor.
◦ Use when new dataset is small and similar/different to originaldataset.
◦ Take a ConvNet pretrained on ImageNet, remove the last fully-connected layer
◦ Output: 4096 dimensional vector
◦ Train linear classifier (SVM/Softmax)
◦ Fine-tuning the ConvNet.
◦ New dataset is large and similar to the originaldataset.
◦ Fine-tune the weights of the pretrained network by continuing the backpropagation
◦ Most commonly: only fine-tune some higher-level portion of the network (earlier features of a
ConvNet contain more generic features: edge detectors or color blobdetectors)
◦ Train from scratch
◦ New dataset is large and very different from the original dataset
CNNs: Milestones
1990’s: LeNet by Yann LeCun: number classification
2012: AlexNet : ImageNet ILSVRC2012winner
◦ First work that popularizedCNN
◦ Significantly outperformed the secondrunner-up,
2013: ZF Net from Google: ILSVRC2013 winner.

GoogleLeNet from Google: ILSVRC2014winner.
◦ Inception Module dramatically reduced the number of parameters (4M, compared to AlexNet
with 60M).
VGGNet from Karen Simonyan and Andrew Zisserman: The runner-up in ILSVRC
2014.
◦ Showing that the depth of the network is a critical component for good performance.
◦ Two well-known architectures: VGG-16, VGG-19
ResNet (Residual Network) by Kaiming He:

◦ ILSVRC2015 winner, best paper on CVPR2016
◦ Currently state-of-the-art in Image Classification, Image Detection, ImageCaptioning.
◦ Main ideas: increase number of layer (up to 1000 layers) but remaining the model complexity by
skip connection.
CNNs: Applications
CNNs: Applications
CNNs: Applications
79
CNNs: Applications
CNNs: Applications
CNNs: Applications
CNNs: Applications
CNNs: Applications
References
1.CS231n Convolutional Neural Networks for Visual Recognition –
Standford University
2.Jürgen Schmidhuber 2005 - Deep learning in neural networks:An
overview
3. Yann Le Cun - Unsupervised Learning: The Next Frontier In AI
85
UsefulLinks
Understanding LSTM
Yolo – Realtime object detection
Visualize CNN Google
Deepdream
Visual Question & Answering

◦ Challenge
◦ Demo: http://visualqa.csail.mit.edu/
Deep Spreadsheets with ExcelNet
Deep Learning Frameworks

Machine Learning Deep Learning Overview AIST

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Deep Learning Overview AIST

Uploaded by

Copyright:

Available Formats

Machine learning &

The Learning Perceptron Backpropagation Watson Wins Jeopardy Facebook DeepFace,

• “Weak” Pattern-Based Artificial Intelligence ✔

• Logic and Rules-BasedApproach

• Machine Learning (Pattern-BasedApproach)

• Example within law – ExpertSystems

• Machine learning & Rules-BasedHybrids

• Human intelligence + AI Hybrids

Age Gender Purchase Items

Age Gender Purchase Items

• Use ML when you can’t track it

Labrador Label Cat Labrador

1979: Late 1980s-

All Labeled Dataset

All Labeled Dataset

All Labeled Dataset Test

All Labeled Dataset Test

All Labeled Dataset Test

• Data is passed through multiple non-linear

• Objective: Learn the parameters of the transformations

ASR/NLU Language Translation Self Driving Cars

Playing Go Financial Risk Medical Diagnosis

mazon Web Services, Inc. or its Affi

GPUs & autonomy

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer

X0 w10 Neuron 0 Neuron 0

More neurons: express more complicatedfunctions

Weight Initialization: Pitfall (all zero initialization), small random numbers

Per-parameter adaptive learning ratemethods

Babysitting the learning process

Convolution -> matrix multiplication

n Thinking HOG as CONV layers [Mahendran & Vedaldi, CVPR 2015]

Steps of computing HOG Convolutional perspectives

Computing image gradients Horizontal/vertical edge filters

n HOG, SIFT, and many other “hand-engineered” featuresare

Major Transfer Learning scenarios:

2013: ZF Net from Google: ILSVRC2013 winner.

ResNet (Residual Network) by Kaiming He:

Visual Question & Answering

You might also like