Professional Documents
Culture Documents
Lecture1-3 (2) - Merged
Lecture1-3 (2) - Merged
❑ Research Interests:
- Research on the applicability of Machine Learning and Deep Learning in the domain of Cryptography and
Quantum Computing.
- Explore Recurrent Neural Network approaches to address ‘Time Series Anomaly Detection’ in Cloud
Platform.
❑ Research Engagements
- Working as an IBM Centre for Advanced Studies (CAS) researcher from January 2019
How to get in touch?
Send email to
– Expect response within 24 hours
Please
– Include [AML-3104_#] at the beginning into the subject line
– Send email from your college account (college policy)
Who are you?
Required Materials
• Neural Networks and Deep Learning: A Textbook
Author: Charu C. Aggarwal
ISBN-13: 978-3319944623
Publisher: Springer
Published: 26 August 2018
• Deep Learning
Author: Ian Goodfellow, Yoshua Bengio and Aaron Courville
ISBN-13: 978-0262035613
Publisher: The MIT Press
Published: 18 November 2016
Learning Outcomes
• Marking scheme
Any questions so far?
Any comments?
Neural Networks
Human Brain
Dendrites
Cell Body
Neuron Details
Presynaptic terminal
Axon
Structure of Neuron
Image: https://miro.medium.com/max/1400/1*jJs3ZVz8xMEjvThtFY4gDA.jpeg
Artificial Neural Networks
❑History
• 1943-McCulloch & Pitts are generally recognised as the designers
of the first neural network
• The Brain
– Pattern Recognition
– Association
– Complexity
– Noise Tolerance
Computing Machines
• The Machine
– Calculation
– Precision
– Logic
The Contrast in Architecture
• Matrix Addition: Works for matrix of equal dimensions and with scalar
Matrix Operations
• Scaling: Multiply each element with the scalar
• Multiplication example:
Each entry of the matrix product is
made by taking the dot product of the
corresponding row in the left matrix,
with the corresponding column in the
right one.
• Transpose: Transpose flip the matrix, so row and column swap positions;
i.e., row 1 becomes column 1.
Transpose A = AT
Matrix Operations
• Identity matrix I
- Square matrix; 1’s along diagonal, 0’s elsewhere
- I . [Another matrix] = [This matrix]
Matrix Operations
• Diagonal matrix
- Square matrix with numbers along diagonal, 0’s elsewhere
- A diagonal . [Another matrix] = Scales the rows of that matrix
Matrix Operations
• Symmetric matrix
- Special matrix with AT = A
Matrix Operation Properties
• L3 Norm: p-norm
Any questions so far?
Any comments?
Basic Statistics
Statistical Analysis
• Mean
• Median
• Mode
• Stdv
• Range
• IQR
• Skewness
• Kurtosis
Mean, median, and mode
• Mean, median, and mode are main measures of central tendency in a distribution.
• Each of these measures try to summarize a dataset with a single number to represent
a typical or center data point from the numerical data set.
Median: The middle number; found by ordering all data points and picking out the one
in the middle (or if there are two middle numbers, taking the mean of those two
numbers).
Example: The median of 4, 1, and 7 is 4 because when the numbers are put in
order ( 1, 4, 7), the number 4 is in the middle.
Mean, median, and mode - continues
Mode: The most frequent number—that is, the number that occurs the highest number
of times.
Example: The mode of { 4, 2, 4, 3, 2, 2} is 2 because it occurs three times, which
is more than any other number.
Standard Deviation
• The Standard Deviation is a measure of variation or dispersion of a
dataset.
Probability density
functions for selected
distributions with
mean 0,
variance 1 and
different excess
kurtosis
Inferential Statistics
✓Inferential statistics makes inferences and predictions about a
population based on a sample of data taken from the population in
question.
✓This infers properties of a population by testing hypotheses and
deriving estimates.
- Descriptive statistics describes data
(for example, a chart or graph)
✓Hypothesis tests: This is where you can use sample data to answer research questions.
For example,
- Interested in knowing if a new cancer drug is effective. Or
- If breakfast helps children perform better in schools.
Inferential Statistics – continues
- The higher the probability of an event, the more likely it is that the event will
occur.
Correlation
• Importance:
- Correlation can help in predicting one quantity from another
- It can indicate the presence of a causal relationship
- Correlation is used as a foundation for many other modeling
techniques
Concept of correlation
Correlation types
Correlation types
credit: displayr.com
Random Sampling
Correlation types
• https://www.python.org/downloads/
• https://www.jetbrains.com/pycharm/download/#section=windows
(download the free version)
Installing Python and Pycharm
• Install with pip https://packaging.python.org/installing/
–python -m pip install -U pip setuptools(installing pip)
–pip install -U pip (updating pip)
–Pip install “library_name”
Three-Layer Arbitrary
(Complexity A B
B
Limited by No. A
of Nodes) B A
5
Deep Learning
• Layers are not handcrafted
• Features are learned from raw data via a general-purpose learning
algorithm
https://devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/
Applications of Deep Learning
• Domains in science, business and government
• Beat current records in image and speech recognition
• Beaten other machine-learning techniques at
• Predicting activity of potential drug molecules
• Analyzing particle accelerator data
• Reconstructing brain circuits
• Produced promising results in natural language understanding
• Topic classification
• Sentiment analysis
• Question answering
Overview
• Supervised Learning
• Backpropagation to Train Multilayer Architectures
• Convolution Neural Networks
• Image Understanding with Deep Convolution Networks
• Distributed Representation and Language Processing
• Recurrent Neural Networks
• Future of Deep Learning
Supervised Learning
• Most common form of machine learning
• Data set
• Labeling
• Training on data set (tuning parameters, gradient descent)
• Testing
•Objective function: measures error between output scores and the
desired pattern of scores
•Modifies internal adjustable parameters (weights) to reduce error
Back-Propagation
10
Supervised Learning
•Objective Function → “Hilly landscape” in high dimensional space of
weight values
•Computes a gradient vector
•Indicates how much the error would increase or decrease if the weight were
increased by a tiny amount
•Negative gradient vector indicates the direction of steepest descent in this
landscape
•Taking it closer to a minimum, where the output error is low on average
Gradient Descent Algorithm
• Gradient descent is an optimization algorithm used to find the
values of parameters of a function that minimizes a cost function.
• It is an iterative algorithm.
• We use gradient descent to update the parameters of the model.
Gradient
Descent
GRADIENT DESCENT
14
GRADIENT DESCENT
• https://miro.medium.com/max/1400/1*E-
5K5rHxCRTPrSWF60XLWw.gif
15
GRADIENT DESCENT
16
Gradient descent
25
Multi Layer Neural Network
Distort the input space to make the classes of data (ex: red and blue
lines) linearly separable
Illustrative example with only two input units, two hidden units and
one output unit
Feed Forward
• Trained by backpropagation
• problematic because the back propagated gradients either grow or shrink
at each time step
• over many time steps they typically explode or vanish
Recurrent Neural Networks
Future of Deep Learning
• Expect unsupervised learning to become more important
• Human and animal learning is largely unsupervised
• Future progress in vision
• Systems trained end-to-end
• Combine ConvNets with RNNS that use reinforcement learning.
• Natural language
• RNNs systems will become better when they learn strategies for selectively
attending to one part at a time
Discussion
• Deep Learning has already drastically improved the state-of-the-art
in
• image recognition
• speech recognition
• natural language understanding
• Deep Learning requires very little engineering by hand and thus has
the potential to be applied to many fields
Applications
36
Neural network for OCR
A
•feedforward network B
Hidden
Layer Output
Layer
Input
Layer
37
OCR for 8x10 characters
10 10 10
8 8 8
•NN are able to generalise
•learning involves generating a partitioning of the input space
•for single layer network input space must be linearly separable
•what is the dimension of this input space?
•how many points in the input space?
•this network is binary(uses binary values)
•networks may also be continuous
38
Engine management
39
ALVINN
Drives 70 mph on a public highway
30 outputs
for steering
30x32 weights
4 hidden
into one out of
units
four hidden
30x32 pixels unit
as inputs
40
Signature recognition
41
Sonar target recognition
42
Stock market prediction
43
Mortgage assessment
44
References
Y. LeCun, Y. Bengio, G. Hinton (2015). Deep Learning. Nature 521, 436-444.
Decision Tree Learning
(Outlook = Sunny Humidity = Normal) (Outlook = Overcast) (Outlook = Rain Wind = Weak)
1. First test all attributes and select the on that would function as the best
root;
2. Break-up the training set into subsets based on the branches of the root
node;
3. Test the remaining attributes to see which ones fit best underneath the
branches of the root node;
4. Continue this process for all other branches until
a. all examples of a subset are of one type
b. there are no examples left (return majority classification of the parent)
c. there are no more attributes left (default value should be majority
classification)
Decision Tree Learning
x1
Training set (labels known) Test set (labels
unknown)
x1
x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2
x1
x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2
x1
x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2
x1
x
x
x x x
x
o x
x
o o
o
o
x2
x1
x
x
x x x
x
o x
x
o o
o
o
x2
x1
SVMs: Pros and cons
Pros
Many publicly available SVM packages:
http://www.kernel-machines.org/software
Kernel-based framework is very powerful, flexible
SVMs work very well in practice, even with very small
training sample sizes
Cons
-class SVM, must combine two-class SVMs
Computation, memory
During training time, must compute matrix of kernel values for
every pair of examples
Learning can take a very long time for large-scale problems
Slide credit: D. Hoiem
Artificial Neural Networks
◼The Brain
◼Brain vs. Computers
◼The Perceptron
◼Multilayer networks
◼Some Applications
Artificial Neural Networks
◼ Other terms/names
◼ connectionist
◼ parallel distributed processing
◼ neural computation
◼ adaptive networks..
◼ History
◼ 1943-McCulloch & Pitts are generally recognised as
the designers of the first neural network
◼ 1949-First learning rule
◼ 1969-Minsky & Papert - perceptron limitation - Death
of ANN
◼ 1980’s - Re-emergence of ANN - multi-layer networks
2
Deep Learning
3
Brain and Machine
• The Brain
– Pattern Recognition
– Association
– Complexity
– Noise Tolerance
• The Machine
– Calculation
– Precision
– Logic
4
The contrast in
architecture
• The Von Neumann architecture
uses a single processing unit;
– Tens of millions of operations per
second
– Absolute arithmetic precision
5
Features of the Brain
6
The biological inspiration
7
The biological inspiration
8
The Structure of Neurons
9
The Structure of Neurons
A neuron has a cell body, a branching input
structure (the dendrIte) and a branching
output structure (the axOn)
10
The Structure of Neurons
11
12
13
14
15
The Artificial Neuron
(Perceptron)
ao+1
wj0
a1 wj1
wj2
a2
Sj f(Sj) Xj
wjn
an
16
A Simple Model of a
Neuron (Perceptron)
y1 w1j
y2 w2j
y3
w3j
O
yi wij
• Each neuron has a threshold value
• Each neuron has weighted inputs from other
neurons
• The input signals form a weighted sum
• If the activation level exceeds the threshold, the
neuron “fires”
17
An Artificial Neuron
y1 w1j
y2 w2j
y3
w3j
f(x) O
yi wij
19
Perceptron Training
1 if wi xi >t
Output={ i=0
0 otherwise
◼ Linear threshold is used.
◼ W - weight value
◼ t - threshold value
20
21
Simple network
1 if wixi >t
AND with a Biased input {
output= i=0
0 otherwise
-1
W1 = 1.5
X W2 = 1 t = 0.0
W3 = 1
Y
22
Learning algorithm
While epoch produces an error
Present network with next inputs from
epoch
Error = T – O
If Error <> 0 then
Wj = Wj + LR * Ij * Error
End If
End While
23
Learning algorithm
Epoch : Presentation of the entire training set to the neural
network.
In the case of the AND function an epoch consists
of four sets of inputs being presented to the
network (i.e. [0,0], [0,1], [1,0], [1,1])
Error: The error value is the amount by which the value
output by the network differs from the target
value. For example, if we required the network to
output 0 and it output a 1, then Error = -1
24
Learning algorithm
Target Value, T : When we are training a network we not
only present it with the input but also with a value
that we require the network to produce. For
example, if we present the network with [1,1] for
the AND function the target value will be 1
Output , O : The output value from the neuron
Ij : Inputs being presented to the neuron
Wj : Weight from input neuron (Ij) to the output neuron
LR : The learning rate. This dictates how quickly the
network converges. It is set by a matter of
experimentation. It is typically 0.1
25
Training Perceptrons
For AND
-1 A B Output
W1 = ?
00 0
01 0
x t = 0.0
W2 = ? 10 0
11 1
W3 = ?
y
26
Training Perceptrons
For AND
-1 A B Output
W1 = 0.3
00 0
01 0
x t = 0.0
W2 = 0.5 10 0
11 1
W3 =-0.4
y
I1 I2 I3 Summation Output
-1 0 0 (-1*0.3) + (0*0.5) + (0*-0.4) = -0.3 0
-1 0 1 (-1*0.3) + (0*0.5) + (1*-0.4) = -0.7 0
-1 1 0 (-1*0.3) + (1*0.5) + (0*-0.4) = 0.2 1
-1 1 1 (-1*0.3) + (1*0.5) + (1*-0.4) = -0.2 0
27
Learning in Neural Networks
28
Decision boundaries
• In simple cases, divide feature space by
drawing a hyperplane across it.
• Known as a decision boundary.
• Discriminant function: returns different values
on opposite sides. (straight line)
• Problems which can be thus classified are
linearly separable.
29
Linear Separability
X1
A
A
A
B Decision
A Boundary
A B
B
B
A B
A B B
X2
B
30
Rugby players & Ballet
dancers
2 Rugby ?
Height (m)
Ballet?
1
50 120
Weight (Kg)
31
Hyperplane partitions
32
Hyperplane partitions
33
Different Non-Linearly
Separable Problems
Types of Exclusive-OR Classes with Most General
Structure
Decision Regions Problem Meshed regions Region Shapes
Three-Layer Arbitrary
(Complexity A B
B
Limited by No. A
of Nodes) B A
34
Multilayer Perceptron (MLP)
Output Values
Output Layer
Adjustable
Weights
Input Layer
35
Types of Layers
• The input layer.
– Introduces input values into the network.
– No activation function or other processing.
• The hidden layer(s).
– Perform classification of features
– Two hidden layers are sufficient to solve any problem
– Features imply more layers may be better
• The output layer.
– Functionally just like the hidden layers
– Outputs are passed on to the world outside the neural
network.
36
Activation functions
37
Activation functions
38
Standard activation
functions
• The hard-limiting threshold function
– Corresponds to the biological paradigm
• either fires or not
• Sigmoid functions ('S'-shaped curves)
1
– The logistic function f(x) =
1 + e -ax
– The hyperbolic tangent (symmetrical)
– Both functions have a simple differential
– Only the shape is important
39
Training Algorithms
40
Training Algorithms
HKUST
41
Training Algorithms
HKUST
42
Training Algorithms
HKUST
43
Training Algorithms
• Predictions are fed forward through the
network to classify
HKUST
44
Training Algorithms
• Predictions are fed forward through the
network to classify
HKUST
45
Training Algorithms
• Predictions are fed forward through the
network to classify
HKUST
46
Training Algorithms
• Predictions are fed forward through the
network to classify
HKUST
47
Regularization
• L1, L2 regularization (weight decay)
• Dropout
• Randomly turn off some neurons
• Allows individual neurons to independently be
responsible for performance
Dropout: A simple way to prevent neural networks from overfitting [Srivastava JMLR 2014]
Adapted from Jia-bin Huang
Back-Propagation
49
Back-Propagation
50
GRADIENT DESCENT
51
GRADIENT DESCENT
• https://miro.medium.com/max/1400/1*E-
5K5rHxCRTPrSWF60XLWw.gif
52
GRADIENT DESCENT
53
Gradient descent
• Backprop formula:
Applications
58
Image Classification
59
Neural network for OCR
◼ feedforward network A
B
propagation
D
E
Hidden
Layer Output
Layer
Input
Layer
60
OCR for 8x10 characters
10 10 10
8 8 8
◼ NN are able to generalise
◼ learning involves generating a partitioning of the input space
◼ for single layer network input space must be linearly
separable
◼ what is the dimension of this input space?
◼ how many points in the input space?
◼ this network is binary(uses binary values)
◼ networks may also be continuous
61
Engine management
62
ALVINN
Drives 70 mph on a public highway
30 outputs
for steering
30x32 weights
4 hidden
into one out of
units
four hidden
30x32 pixels unit
as inputs
63
Signature recognition
64
Sonar target recognition
65
Stock market prediction
66
Mortgage assessment
67
Neural Network
Problems
68
Parameter setting
• Number of layers
• Number of neurons
• too many neurons, require more training time
• Learning rate
• from experience, value should be small ~0.1
• Momentum term
• ..
69
Over-fitting
70
Training time
71
References
• https://www.slideserve.com/phelan-cortez/artificial-neural-
networks
• https://media.giphy.com/media/4LiMmbAcvgTQs/giphy.gif
• https://miro.medium.com/max/1400/1*E-
5K5rHxCRTPrSWF60XLWw.gif
• https://media.giphy.com/media/O9rcZVmRcEGqI/giphy.gif
• https://media.giphy.com/media/4LiMmbAcvgTQs/giphy.gif
72
73
Deep Neural Networks
Convolutional Neural Networks
1
What we have learned so far?
𝚯T𝜙(𝐱 𝑖 )
Last layer is just:
Linear regression
𝚯T𝜙(𝐱 𝑖 )
Logistic regression,
Softmax regression
𝚯T𝜙(𝐱 𝑖 )
𝜙(𝐱 𝑖 )
What we have learned so far?
We must apply nonlinear activation.
For hidden layers, often
● ReLU: 𝑎 𝑥 = max{0, 𝑥}
● Hyperbolic tangent: 𝑎 𝑥 = tanh(𝑥)
64x64x3=12288 parameters
What are some specialized deep learning models?
Convolutional Recurrent or
NN (Images) Transformer NNs
(Time Series)
Graph NN
(Networks/Relational)
What are some specialized deep learning models?
Convolutional
Neural Network
Fully Connected Layer
32x32x3 image -> stretch to 3072 x 1
input activation
1 1
10 x 3072
3072 10
weights
Fully Connected Layer
32x32x3 image -> stretch to 3072 x 1
input activation
1 1
10 x 3072
3072 10
weights
1 number:
the result of taking a dot product
between a row of W and the input
(a 3072-dimensional dot product)
Convolution Layer
32x32x3 image -> preserve spatial structure
32 height
32 width
3 depth
Convolution Layer
32x32x3 image
5x5x3 filter
32
32
3
Convolution Layer Filters always extend the full
depth of the input volume
32x32x3 image
5x5x3 filter
32
Convolve the filter with the image
i.e. “slide over the image spatially,
computing dot products”
32
3
Convolution Layer
32x32x3 image
5x5x3 filter
32
1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3
Convolution Layer
32
32
3
Convolution Layer
32
32
3
Convolution Layer
32
32
3
Convolution Layer
32
32
3
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32
28
32 28
3 1
consider a second, green filter
Convolution Layer
32x32x3 image activation maps
5x5x3 filter
32
28
32 28
3 1
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps
32
28
Convolution Layer
32 28
3 6
32 28
CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions
32 28 24
CONV, ….
CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 24
filters
3 6 10
Visualization of VGG-16by Lane McIntosh. VGG-16
Preview [Zeiler and Fergus 2013] architecture from [Simonyanand Zisserman2014].
one filter =>
one activation map example 5x5 filters
(32 total)
activation map
32x32x3 image
5x5x3 filter
32
28
32 28
3 1
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
7
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
7
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
7
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
7
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
(recall:)
(N - F) / stride + 1
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
(recall:)
(N + 2P - F) / stride + 1
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
in general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
Remember back to…
E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially!
(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
Examples time:
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32
(btw, 1x1 convolution layers make perfect sense)
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32
Pooling layer
- makes the representations smaller and more manageable
- operates over each activation map independently:
MAX POOLING
3 2 1 0 3 4
1 2 3 4
y
preview:
Properties of CNN: Translational invariance
Let convolutions extract features and let normal cnn’s decide on them.
Image credit: LeCun et al. (1998)
Summary
- ConvNets stack CONV,POOL,FC layers
- Trend towards smaller filters and deeper architectures
- Trend towards getting rid of POOL/FC layers (just CONV)
- Historically architectures looked like
[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
where N is usually up to ~5, M is large, 0 <= K <= 2.
- but recent advances such as ResNet/GoogLeNet
have challenged this paradigm