Download as pdf or txt
Download as pdf or txt
You are on page 1of 86

Machine learning &

Deep learning
Hoang Van Nam
MICA Institute - HUST
Agenda

• Introduction

• Machine Learning

• Deep Learning

• CNNs

• Discussion

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Introduction
Artificial Intelligence (AI)
• What is Artificial Intelligence (AI)?
• Using computers to solveproblems
• Or make automated decisions
• For tasks that, when done by humans,
• Typically require intelligence
Timeline Of Intelligent Machines

The Learning Perceptron Backpropagation Watson Wins Jeopardy Facebook DeepFace,


Machine (Alan Turing) (Frank Rosenblatt) (D. Rumelhart, G. Hinton, R. Williams) Amazon Echo

Machine Playing Checker Stanford Cart Deep Blue Beats Google NN recognizing DeepMind Wins Go
(Author Samuel)
Kasparov cat in Youtube

1950 1952 1957 1979 1986 1997 2011 2012 2014 2016
Limits of Artificial Intelligence
• “Strong” Artificial Intelligence ✘
• Computers thinking at a level that meets or surpassespeople
• Computers engaging in abstract reasoning & thinking
• This is not what we have today
• There is no evidence that we are close to Strong AI

• “Weak” Pattern-Based Artificial Intelligence ✔


• Computers solve problems by detecting useful patterns
• Pattern-based AI is an Extremely powerful tool
• Has been used to automate many processes today
• Driving, language translation
• This is the dominant mode of AItoday
Major AI Approaches
Two Major AI Techniques

• Logic and Rules-BasedApproach

• Machine Learning (Pattern-BasedApproach)


Logic and Rules-Based Approach
• Logic and Rules-BasedApproach
• Representing processes or systems using logical rules
• Top-down rules are created for computer
• Computers reason about those rules
• Can be used to automate processes

• Example within law – ExpertSystems


• Turbotax
• Personal income tax laws
• Represented as logical computer rules
• Software computes tax liability
Machine Learning (Patternbased)
• Machine Learning (ML)
• Algorithms find patterns in data and infer rules on their own
• ”Learn” from data and improve overtime
• These patterns can be used for automation or prediction
• ML is the dominant mode ofAI today
Hybrid Systems
• Many successful AI systems are hybrids of

• Machine learning & Rules-BasedHybrids


• e.g. Self-driving cars employ both approaches

• Human intelligence + AI Hybrids


• Also, many successful AI systems work best when
• They work with humanintelligence
• AI systems supply information for humans
What is Machine Learning?
Option 1- Build A Rule Engine

Human
Programmer

Input Output
Age Gender Purchase Items
Date Age Gender Purchase Items
Rule 1: 15 <age< 30 Date
30 M 3/1/2017 Toy Rule 2: Bought Toy=Y, Last
30 M 3/1/2017 Toy
Purchase<30 days
40 M 1/3/2017 Books Rule 3: Gender = ‘M’, Bought
…. …… ….. …..
Toy =‘Y’
…. …… ….. ….. Rule 4: ……..
Rule 5: ……..
Scalability
Problem with Hand Adaptability
Designed Rules
Closed Loop
Option 2 - Learn The Business Rules From Data
Input - New Unseen Data
Age Gender Items

35 F

39 M Toy

Age Gender Purchase Items


Date
30 M 3/20/2017 Toy *
Learning Model Prediction
40 M 1/3/2017 Books Algorithm
…. …… ….. …..

Output
Historical Purchase Data
(Training Data)
Option 2 - Learn The Business Rules From Data
Input - New Unseen Data
Age Gender Items

35 F

39 M Toy
X Y

Age Gender Purchase Items


Date
30 M 3/20/2017 Toy *
Learning Model Prediction
40 M 1/3/2017 Books Algorithm
…. …… ….. …..

Output
Historical Purchase Data f(X)=Y’
Y’~Y
(Training Data)
Machine learning as programming
We Call This Approach Machine Learning
Why Use Machine Learning?
• Use ML when you can’t code it
• Complex tasks where deterministic solution don’t suffice
• E.g. Recognizing speech/images
• Use ML when you can’t scale it
• Replace repetitive tasks needing human like expertise
• E.g Recommendations, spam, fraud detection, machine translation.
• Use ML when you have to adapt/personalize
• E.g. Recommendation and personalization

• Use ML when you can’t track it


• E.g. Automated driving, fraud detection.
Types Of Machine Learning
Reinforcement Learning
Reinforcement Learning
Supervised Learning
No, it’s a
Labrador.

It is a cat.
Supervised Learning – How Machine Learn
Human intervention and validation required
e.g. Photo classification and tagging
Training Data Adjust Model

Input

Machine
?
Learning Prediction Label
Algorithm

Labrador Label Cat Labrador


Unsupervised Learning
No human intervention required
(e.g. Customer segmentation)

Machine
Input Learning Prediction
Algorithm
Literature review onML

Learning hierarchical
representations
Bellman-1957: through deep SL,UL,
Dynamic Baum- Dempster- RL-Good Old-
Programming 1966:HMM Fashioned Artificial SVM Kernel-SVM
1977:EM
Intelligence

1979: Late 1980s-


1960-1981and 2006/7: improved
McCulloch- 1960s: visual convolution + 2000 and
1965: first beyond: CNNs/GPU-CNNs/BP
1943:Early cortex weight beyond:
feed forward backpropagation for MPCNNs/LSTM
NN, didnot inspiration for replication + numerous
deeplearning , gradient stacks
learn DL subsampling improvements
descent,RNN
Model Training

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Model Training – Split training data

All Labeled Dataset

70% 30%

Training Data
Model Training – Training w/ training data

All Labeled Dataset

70% 30%

Trial
Training Data Training Model
Model Training – Split the test data

All Labeled Dataset Test


30%
Data
70%

Trial
Training Data Training Model
Model Training – Model evaluation

All Labeled Dataset Test


30%
Data
70%

Trial
Training Data Training Model

Evaluation
Result
Model Training - PerformanceMeasurement

All Labeled Dataset Test


30%
Data
70%

Trial
Training Data Training Model
Accuracy

Evaluation
Result
Model Training - PerformanceMeasurement
Deep Learning

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Deep Learning?
• Deep Learning is a subfield of machine learning
concerned with algorithms inspired by the structure
and function of the brain called artificial neural
networks.

• Data is passed through multiple non-linear


transformations to generate a prediction

• Objective: Learn the parameters of the transformations


that minimize a cost function
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Performance
Performance Deep Learning Algorithms

Traditional Machine
Learning Algorithms

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data
Sample Deep Learning Use Cases

ASR/NLU Language Translation Self Driving Cars

Playing Go Financial Risk Medical Diagnosis

mazon Web Services, Inc. or its Affi


image understanding
The Advent of Deep Learning
Algorithms
speech recognition

Programming
Models Data
natural language
processing

GPUs & autonomy


Acceleration
Artificial Neuron/Perceptron
b

Input: X0 w0
Vector of training data x
Output: X1 w1
Linear function of input Neuron
Nonlinearity: Inputs w2 Output
⟨w, x⟩ !
Transform output into desired X2

range of value wn


Training
Learn the weights and bias b
Xn
by minimize loss

f(x) = ! (⟨w, x⟩ + b)
Human Brain Neuron

Inputs Output
Neural Network

X0 w1 0 Neuron 0 Neuron 0

w11

……
……

……
Output
Neuron
w1 2

Xn w1 3 Neuron n Neuron n

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer


Neural Network – Forward Propagation

X0 w1 0 Neuron 0 Neuron 0

w11

……
5
……

……
Output
Input Neuron

w1 2

Xn w1 3 Neuron n Neuron n

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer


Neural Network – Backpropagation

X0 w10 Neuron 0 Neuron 0


Error/Loss Error/Loss
w11
5

……
……

……
Output
Input

?
Neuron

w12
Label
4
Xn w13 Neuron n Neuron n

Error/Loss Error/Loss

Error/Loss
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer
Neural Network – Backpropagation
Update
weights

W1’0
X0 Neuron 0 Neuron 0
W1’1 Error/Loss Error/Loss

……
……

……
Output
Input
W1’2
?
Neuron

W1’3 Label
4
Xn Neuron n Neuron n

Error/Loss Error/Loss

Error/Loss
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer
NNs & DL: Neural Networks

Naming conventions:
◦ N-Layer: not include input layer
◦ “Artificial Neural Networks” (ANN) or “Multi-Layer Perceptrons” (MLP)

Output layer: normally don’t have activation function (linear identity activation
function) – output score (0-1)
47
Sizing neural networks: number of neurons, or more commonly the number
of parameters.
DL: NN-like model with many such stages
NNs & DL: Neural Networks Variations
NNs & DL:Neural Networks drawbacks
◦ Local maximum: who is afraid of non-convex loss functions – YannLecun
◦ Unsupervised learning
◦ No-memory networks: recurrent net, LSTM
◦ Computational consumption on conv-layer: GPU
◦ Memory bottleneck
◦ Network compression (Squeeze Net)
◦ Model re-designing
NNs & DL: DL vs Traditional
NNs & DL: DL vs Traditional
NNs & DL:Use cases
1.Data security: malware prediction, detect abnormal data accession behaviour
2.Personal security: speed up, spot things human screeners miss
3. Financial trading: stock market prediction
4.Healthcare: cancer prediction
5.Marketing personalization: target audiences
6. Fraud detection: spot potential cases of fraud
7.Recommendations: Amazon, Netflix
8.Online Search
9. NLP
10.Smart Cars And so on...
NNs & DL: DL vs Traditional
NNs & DL: Number and size of layers

More neurons: express more complicatedfunctions


Overfitting: model with high capacity fits the noise in the data
à smaller neural networks, regularization, dropout
smaller networks is hardly to train with GD, has few local, easy to converge but bad
minima
NNs & DL:Settingup the dataand the model
Data preprocessing:
mean subtraction, normalization
PCA, whitening

Weight Initialization: Pitfall (all zero initialization), small random numbers


Regularization: L1, L2, Maxnorm,
Dropout
Loss Function: 55
Classification: SVM Loss function, Softmax (cross-entropy loss), Hierarchical Softmax
Regression: L2, L1 Norm
NNs & DL:Settingup the dataand the model
Solver:
- Vanilla update:
- Momentum update:
- Nesterov momentum update

Per-parameter adaptive learning ratemethods


Adagrad, RMSprop, Adam
NNs & DL: Trainingnetworks
Before learning:sanity checks
◦ Look for correct loss at chance performance.
◦ Eg: CIFAR-10 with a Softmax classifier 10 classes -> 0.1 per class -> -
ln(0.1)=2.302
◦ Overfit a tiny subset of data

Babysitting the learning process


◦ Loss function, number of iteration, epoch
NNs & DL: Trainingnetworks
◦ Train/Val accuracy

◦ Ratio of weights:updates
◦ First-layer Visualizations

◦ Choose solver
http://cs231n.github.io/neural-networks-3/#loss
NNs & DL: Trainingnetworks
Introduction to CNNs: How brain’s visual system works
Introduction to CNNs: How brain’s visual system works
Introduction to CNNs: Neural networks
Neural networks
Introductionto CNNs: Image Convolution
Introductionto CNNs: ConvolutionLayer
Convolution as a neural layer:
◦ Goals: not to use predefined kernels, but instead to learndata-specific
kernels.
Introductionto CNNs: ConvolutionLayer
n Convolutional layers are locallyconnected
u A filter/kernel/window slides on the image or the
previous map
u The position of the filter explicitlyprovides
information for localizing
n Convolutional layers share weightsspatially:
translation-invariant
u Translation-invariant: a translated region will produce
the same response at the correspondingly translated
position
u A local pattern’s convolutional response can be re-
used by different candidate regions
n Convolutional layers can be applied toimages of
any sizes, yielding proportionally-sizedoutputs
Convolution: Principle
Cross-correlation: computing a series of dot-products and putting them
into an output vector
Convolution: similar with cross-correlation but flipping the kernel
Key feature: shift-invariant , linear –>simple
Different
◦ Convolution: associative properties F*(GI) = (FG)*I
◦ Cross-correlation: match a template to an image
How tocalculate Convolution

Complexity: O(w*h*Fw*Fh)

Convolution -> matrix multiplication


à Toepliz matrix

67
Introduction to CNNs: HOG by Convolutional Layers

n Thinking HOG as CONV layers [Mahendran & Vedaldi, CVPR 2015]

Steps of computing HOG Convolutional perspectives

Computing image gradients Horizontal/vertical edge filters


Binning gradients into 18 Directional filters + gating (non-
directions Computing cell linearity)
histograms Normalizing cell
histograms Sum/average pooling
Local response normalization (LRN)

n HOG, SIFT, and many other “hand-engineered” featuresare


convolutional feature maps.
Introductionto CNNs: ConvolutionLayer
Image is not just “2D”, it’s volume: 2Dx3 (3 channels of intensity map, ex:
RGB,HSV)
We want to learn more than 1 kernels for each layer
CNNs : Poolinglayer
- Reduce the spatial size of the representation
- Reduce the amount of parameters and computation in the
network à control overfitting.
- Operates independently on every depth slice of the inputand
resizes it spatially, using the MAX operation (or AveragePooling,
L2 pooling).

70
CNNs: FC and ReLU layers
As seen in regular network
Have full connections to all activations in the previous layer.
n Common use: Predict a label
n Convert FC-CONV
n Example: K=4096 input: 7×7×512 can be equivalently expressed as a
CONVlayer with F=7,P=0,S=1,K=4096
n Filter size is exactly the size of the input volume, output will simply be
1×1×4096
ReLU(Rectified LinearUnits):
◦ Apply the non-saturating activation function f(x)=max(0,x)
◦ Increases the nonlinear properties of the decision function without affecting the
receptive fields of the convolutionlayer.
Introductionto CNNs: ConvolutionNeural Network

72
CNNs: Learning
CNNs: Transferlearning
Very few people train from scratch (randominit)
◦ Rare dataset has sufficient size comparing to Imagenet (1.2 mil, 1000 classes)
◦ Training Imagenet takes 2-3 weeks using modern GPU(eg: Titan X)

Major Transfer Learning scenarios:


◦ ConvNet as fixed featureextractor.
◦ Use when new dataset is small and similar/different to originaldataset.
◦ Take a ConvNet pretrained on ImageNet, remove the last fully-connected layer
◦ Output: 4096 dimensional vector
◦ Train linear classifier (SVM/Softmax)
◦ Fine-tuning the ConvNet.
◦ New dataset is large and similar to the originaldataset.
◦ Fine-tune the weights of the pretrained network by continuing the backpropagation
◦ Most commonly: only fine-tune some higher-level portion of the network (earlier features of a
ConvNet contain more generic features: edge detectors or color blobdetectors)
◦ Train from scratch
◦ New dataset is large and very different from the original dataset
CNNs: Milestones
1990’s: LeNet by Yann LeCun: number classification
2012: AlexNet : ImageNet ILSVRC2012winner
◦ First work that popularizedCNN
◦ Significantly outperformed the secondrunner-up,

2013: ZF Net from Google: ILSVRC2013 winner.


GoogleLeNet from Google: ILSVRC2014winner.
◦ Inception Module dramatically reduced the number of parameters (4M, compared to AlexNet
with 60M).

VGGNet from Karen Simonyan and Andrew Zisserman: The runner-up in ILSVRC
2014.
◦ Showing that the depth of the network is a critical component for good performance.
◦ Two well-known architectures: VGG-16, VGG-19

ResNet (Residual Network) by Kaiming He:


◦ ILSVRC2015 winner, best paper on CVPR2016
◦ Currently state-of-the-art in Image Classification, Image Detection, ImageCaptioning.
◦ Main ideas: increase number of layer (up to 1000 layers) but remaining the model complexity by
skip connection.
CNNs: Applications
CNNs: Applications
CNNs: Applications

79
CNNs: Applications
CNNs: Applications
CNNs: Applications
CNNs: Applications
CNNs: Applications
References
1.CS231n Convolutional Neural Networks for Visual Recognition –
Standford University
2.Jürgen Schmidhuber 2005 - Deep learning in neural networks:An
overview
3. Yann Le Cun - Unsupervised Learning: The Next Frontier In AI

85
UsefulLinks
Understanding LSTM
Yolo – Realtime object detection
Visualize CNN Google
Deepdream

Visual Question & Answering


◦ Challenge
◦ Demo: http://visualqa.csail.mit.edu/
Deep Spreadsheets with ExcelNet
Deep Learning Frameworks

You might also like