Professional Documents
Culture Documents
Deep Learning in Wireless Communications: Prof. Dola Saha
Deep Learning in Wireless Communications: Prof. Dola Saha
Introduction
Motivation
Materials
Deep Learning
Ian Goodfellow, Yoshua Bengio and Aaron Courville
https://www.deeplearningbook.org/
Dive into Deep Learning
Aston Zhang, Zachary Lipton, Mu Li and Alexander Smola
https://d2l.ai
Machine Learning: A Probabilistic Perspective
Kevin P. Murphy
https://probml.github.io/pml-book/
Several published papers
Channel Modeling
Channel Estimation
Resource Allocation
Beamforming and beam prediction
Path planning for autonomous systems
Antenna tilting
Handover
RF fingerprinting
Wireless user behavior
Spectrum availability prediction
Wireless content prediction
Modulation detection
UAV trajectory prediction
Waveform generation
Channel Coding
Signal processing in Tx-Rx chains have been developed (and are optimal) for Gaussian
channels
Often, it is difficult to find a closed form representation of a problem
Computational complexity of optimal solutions might be high
New areas of research
Signals are complex valued (I, Q), whereas image data is three dimensional (RGB)
Spectrogram can be considered images, but we lose information
Range varies from [0-254] in image, whereas between [-1,+1] in wireless signals
Challenge: Dataset
Representation learning
Machine learning M. E. Morocho-Cayamcela, H. Lee and W. Lim, “Machine Learning for 5G/B5G
Mobile and Wireless Communications: Potential, Limitations, and Future
Directions,” in IEEE Access, vol. 7, pp. 137184-137206, 2019, doi:
10.1109/ACCESS.2019.2942390.
AI
CHAPTER 1.
Advent of GPUs
INTRODUCTION 102 T vs. G vs. F Rotated T vs. C
Iris
101
Faster network connectivity 100
1900 1950 1985 2000 2015
Better software infrastructure Year
Figure 1.8: Dataset
104 sizes have increased greatly over time. In the early 1900s, statisticians
Human
studied datasets using hundreds or thousands of manually 6 compiled measurements
Cat (Garson,
7
10 11
Human 1900; Gosset, 1908; Anderson, 1935; Fisher, 1936). In4
the 1950s
9 through 1980s, the pioneers
1010
109
17
16 19
20 of biologically10inspired machine learning often worked with small, synthetic datasets, such
3 Mouse
Octopus 2
108 14 18 as low-resolution bitmaps of letters, that were designed to incur10low computational cost and
5
107
8
11 Frog demonstrate that neural networks were able to learn specific kinds of functions (Widrow
106 8
105 3 Bee and Hoff, 1960; 102 Rumelhart et al., 1986b). In the 1980s and 1990s, Fruitmachine
fly learning
3
104
Ant became more statistical1in nature and began to leverage larger datasets containing tens
10 3
2 13 Leech of thousands of examples such as the MNIST dataset (shown in figure 1.9) of scans
10
101 1 2 12 15 of handwritten 101numbers (LeCun et al., 1998b). In the first decade of the 2000s, more
Roundworm
100
6 9
10
sophisticated datasets of this same size, 1985
1950 such as 2000
the CIFAR-10
2015 dataset (Krizhevsky and
5
10 1 4 7 Hinton, 2009) continued to be produced. Toward the end of that decade and throughout
Year
10 2 Sponge theFigure
first 1.10:of Initially,
half the 2010s,thesignificantly
number of connections
larger between
datasets, neurons hundreds
containing in artificial
of neural
thousands
1950 1985 2000 2015 2056
networks was limited by hardware capabilities. Today, the number of connections between
Year to tens of millions of examples, completely changed what was possible with deep learning.
neurons is mostly a design consideration. Some artificial neural networks have nearly as
Figure 1.11: Since the introduction of hidden units, artificial neural networks have doubled These datasets included the public Street View House Numbers dataset (Netzer et al.,
Dola Saha (UAlbany) Deep Learningmany connections per neuron as a cat, and it is quite common for other neural 2021 networks9 / 128
in size roughly every 2.4 years. Biological neural network sizes from Wikipedia (2015). 2011), various versions of the ImageNet dataset (Deng et al., 2009, 2010a; Russakovsky
Introduction
Improved Accuracy
0.30
0.20
0.15
0.10
0.05
0.00
2010 2011 2012 2013 2014 2015
Year
Figure 1.12: Since deep networks reached the scale necessary to compete in the
Large Scale Visual Recognition Challenge, they have consistently won the com
every year, and yielded lower and lower error rates each time. Data from Rus
et al. (2014b) and He et al. (2015).
Learning Model
Model y
<latexit sha1_base64="fSFHYI9Zr8JNHToPEkr14C8WnIo=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KomKeix68diC/YA2lM120q7dbMLuRiihv8CLB0W8+pO8+W/ctjlo64OBx3szzMwLEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6m/qtJ1Sax/LBjBP0IzqQPOSMGivVx71S2a24M5Bl4uWkDDlqvdJXtx+zNEJpmKBadzw3MX5GleFM4KTYTTUmlI3oADuWShqh9rPZoRNyapU+CWNlSxoyU39PZDTSehwFtjOiZqgXvan4n9dJTXjjZ1wmqUHJ5ovCVBATk+nXpM8VMiPGllCmuL2VsCFVlBmbTdGG4C2+vEya5xXvqnJRvyxXb/M4CnAMJ3AGHlxDFe6hBg1ggPAMr/DmPDovzrvzMW9dcfKZI/gD5/MH6lmNBQ==</latexit>
⁃ Classification
<latexit sha1_base64="AROfqyrq/ZqtVw1dt8A4T7g6bF4=">AAAB63icbVBNSwMxEJ2tX7V+VT16CRahXsquinosevFYwX5Au5Rsmm1Dk+ySZMWy9C948aCIV/+QN/+N2XYP2vpg4PHeDDPzgpgzbVz32ymsrK6tbxQ3S1vbO7t75f2Dlo4SRWiTRDxSnQBrypmkTcMMp51YUSwCTtvB+Dbz249UaRbJBzOJqS/wULKQEWwyKaw+nfbLFbfmzoCWiZeTCuRo9MtfvUFEEkGlIRxr3fXc2PgpVoYRTqelXqJpjMkYD2nXUokF1X46u3WKTqwyQGGkbEmDZurviRQLrScisJ0Cm5Fe9DLxP6+bmPDaT5mME0MlmS8KE45MhLLH0YApSgyfWIKJYvZWREZYYWJsPCUbgrf48jJpndW8y9r5/UWlfpPHUYQjOIYqeHAFdbiDBjSBwAie4RXeHOG8OO/Ox7y14OQzh/AHzucPbvuN2Q==</latexit>
⁃ Domain Transfer
f (x) ⁃ Prediction
Learning is the process by which we discover the right setting of the knobs yielding the
desired behavior from our model. In other words, we train our model with data.
Wireless Applications: modulation detection, RF fingerprinting, channel estimation, channel
modeling, generate waveforms
The data that we can learn from: constitutes attributes or features from which the model
should learn.
Training Set: set of examples used to fit the parameters of the model
Validation/Testing Set: set of examples used to test the performance of the model after training
The data that we can learn from: constitutes attributes or features from which the model
should learn.
Training Set: set of examples used to fit the parameters of the model
Validation/Testing Set: set of examples used to test the performance of the model after training
A model of how to transform the data
The data that we can learn from: constitutes attributes or features from which the model
should learn.
Training Set: set of examples used to fit the parameters of the model
Validation/Testing Set: set of examples used to test the performance of the model after training
A model of how to transform the data
An objective function that quantifies how well (or badly) the model is doing.
a mathematical formulation to measure performance in each iteration (epoch) of training
conventionally, minimization, leading to the term, loss function
The data that we can learn from: constitutes attributes or features from which the model
should learn.
Training Set: set of examples used to fit the parameters of the model
Validation/Testing Set: set of examples used to test the performance of the model after training
A model of how to transform the data
An objective function that quantifies how well (or badly) the model is doing.
a mathematical formulation to measure performance in each iteration (epoch) of training
conventionally, minimization, leading to the term, loss function
An algorithm to adjust the model’s parameters to optimize the objective function.
Linear Algebra
Probability
Calculus
Matrix Operations
Matrix Transpose: B = A>
··· ...
a11 a12 a1n a11 a21 am1
a21 a22 ··· a2n a12 a22 ... am2
>
A=
.. . . . A =
.. . . .
. . . . . .
. . . . . . . .
am1 am2 ··· amn a1n a2n ... amn
Matrix Operations
Matrix Transpose: B = A>
··· ...
a11 a12 a1n a11 a21 am1
a21 a22 ··· a2n a12 a22 ... am2
>
A=
.. . . . A =
.. . . .
. . . . . .
. . . . . . . .
am1 am2 ··· amn a1n a2n ... amn
Matrix Operations
Matrix Transpose: B = A>
··· ...
a11 a12 a1n a11 a21 am1
a21 a22 ··· a2n a12 a22 ... am2
>
A=
.. . . . A =
.. . . .
. . . . . .
. . . . . . . .
am1 am2 ··· amn a1n a2n ... amn
Matrix Operations
Matrix Multiplication: C = AB
> >
a> a>
··· ··· a1 a1 b1 1 b2 ··· 1 bm
a11 a12 a1k b11 b12 b1m
a21 a22 ··· a2k b21 b22 ··· b2m a> a> b a> ··· a>
2 b2 2 bm
2 2 1
A= . . . , B= . . . , C = AB = . b1 b2 ··· bm = .
.. . . . .
.. .
. .
.
. .. .
. .
. .
.
.
.
.
. .
.
.
.
.
.
an1 an2 ··· ank bk1 bk2 ··· bkm a> a> >
··· a>
n n b1 an b 2 n bm
Matrix Operations
Matrix Multiplication: C = AB
> >
a> a>
··· ··· a1 a1 b1 1 b2 ··· 1 bm
a11 a12 a1k b11 b12 b1m
a21 a22 ··· a2k b21 b22 ··· b2m a> a> b a> ··· a>
2 b2 2 bm
2 2 1
A= . . . , B= . . . , C = AB = . b1 b2 ··· bm = .
.. . . . .
.. .
. .
.
. .. .
. .
. .
.
.
.
.
. .
.
.
.
.
.
an1 an2 ··· ank bk1 bk2 ··· bkm a> a> >
··· a>
n n b1 an b 2 n bm
Matrix Operations
Matrix Operations
Matrix Operations
Norms
P
Generalized Form: (Lp norm) kxkp = ( ni=1 |xi |p )1/p , where p ∈ R, p ≥ 1
L2 norm is the Eucledian distance
Suppose elements in the n-dimensional
qP vector x are x1 , . . . , xn
n 2
Then, L2 norm of x is kxk2 = i=1 xi ,
Pn
L1 norm is kxk1 = i=1 |xi |
Compared to L2 norm, it is less influenced by outliers
Max norm (L∞ ): absolute value of the element with the largest magnitude in the vector
kxk∞ = maxi |xi |
Frobeniusqnorm of matrices is similar to L2 Norm of vectors
Pm Pn 2
j=1 xij , where X ∈ R
kXkF = m×n
i=1
Objective functions are described as norms: minimize distance between predictions and
ground-truth observations
Dola Saha (UAlbany) Deep Learning 2021 20 / 128
Machine Learning Basics Probability
Basic Probability
Sample space or outcome space: S = {1, 2, 3, 4, 5, 6}, where output of random experiment is
an event A.
Probability of an event A in the given sample space S, denoted as P(A) satisfies the
following properties:
For any event A, its probability is never negative, i.e., P(A) ≥ 0
Probability of the entire sample space is 1, or P(S) = 1
For any countable sequence of events, A1 , A2 , . . . that are mutually exclusive, (Ai ∩ Aj = ∅ for
i 6= j) the probability
allS P∞ that any happens is equal to the sum of their individual probabilities, or
∞
P( i=1 Ai ) = i=1 P(Ai )
Random Variables: A random variable is a variable that can take on different values randomly.
Random variables may be discrete (integer or states) or continuous (real numbers).
Probability Distributions
The variance gives a measure of how much the values of a function of a random variable x
vary as we sample different values of x from its probability distribution
Var[X] = E (X − E[X])2 = E[X 2 ] − E[X]2
Covariance describes how much two values are linearly related to each other
Differential Calculus
Optimization in neural networks uses Differential Calculus
If a function f : R → R has scalar input and output
Derivative of f is
f (x + h) − f (x)
f 0 (x) = lim
h→0 h
if f 0 (a) exists, f is said to be differentiable at a
The derivative f 0 (x) is instantaneous rate of change of f (x) with respect to x.
dy df
Common notations: f 0 (x) = y0 = dx = dx d
= dx f (x) = Df (x) = Dx f (x)
2 0
Example: f (x) = x − 3x, f (x) = 2x − 3 is the tangent
Rules of Differentiation
Chain Rule
Loss Function
Measure of fitness: quantifies the distance between the real and predicted value of the target
Usually a non-negative number, smaller is better, 0 for perfect prediction
Most popular loss function in regression problems is the squared error
1 (i) 2
l(i) (w, b) = ŷ − y(i)
2
Mean Squared Error (MSE): Average of losses on entire dataset quantifies quality of a model
1 X 1 > (i) 2
n n
1 X (i)
L(w, b) = l (w, b) = w x + b − y(i)
n n 2
i=1 i=1
Analytic Solution
Optimization refers to the task of either minimizing or maximizing some function f (x) by
altering x
In deep learning, most popular optimization is done by Gradient Descent [Cauchy, 1847]
– Reduce the error by updating the parameters in the direction that iteratively lowers the
loss function
– Requires taking the derivative of the loss function, which is an average of the losses
computed on entire dataset
– Extremely slow in practice
Minibatch Stochastic Gradient Descent: sample a random minibatch of examples every time
there is a need to compute the update
The values of |B| and η are manually pre-specified and not learned through model training
Dola Saha (UAlbany) Deep Learning 2021 35 / 128
Linear Neural Networks Linear Regression
60
Data generated from y =10x +25 #Generate Data (blue dots)
50 Prediction: w =23, b =11 X = 2 * np.random.rand(100,1)
Y = 25 + 10 * X+np.random.randn(100,1)
Error (ŷ−y)
40 #Random Prediction (red dots)
weight = 23.0
bias = 11.0
30 Y_hat = bias + weight * X
The Gradient
100 100
90 Mean Squared Error with w =23 90 Mean Squared Error with b =11
80 80
70 70
MSE
MSE
60 60
50 50
40 40
30 30
20 20
0 5 10 15 20 25 10 15 20 25 30 35
Bias Weight
Gradient Descent
700 50
Learning Rate η =0.1 Data for i in range(1000):
600 45 #Calculate the gradient
Prediction err = Y-Y_hat
500 grad_bias = -(err)
40
grad_weight = -(err)*X
400
MSE
MSE
MSE
300 300 300
200 200 200
100 100 100
0 0 0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration Iteration Iteration
Learning Rate and Batch Size are tunable but not updated in the training loop
They are called Hyperparameters
Hyperparameter tuning is the process by which hyperparameters are chosen
Hyperparameters are adjusted based on the results of the training loop
Choice of Hyperparameters
Learning Rate:
– Too small will be too slow, too large might oscillate
Batch Size:
– Too small: Workload is too small, hard to fully utilize computation resources
– Too large: Memory issues, Wasteful computation when xi are identical
0.40
0.35 Mean = 0, SD = 1
0.30 Mean = 0, SD = 2
0.25
Mean = 3, SD = 1
p(x)
0.20
0.15
0.10
0.05
0.00
8 6 4 2 0 2 4 6 8
x
wi X
Number of layers of the NN = 1 [input layer xi o1 w i xi = w · x
is not counted] …
wM i=0
Connection to Neurons
Regression vs Classification
Multi-class Classification
One-hot Encoding: A one-hot encoding is a vector with as many components as the number
of categories
The component corresponding to particular instance’s category is set to 1 and all other
components are set to 0
Example: QPSK has 4 constellation points
Constellation Hot-one Encoded Labels
1+j (1, 0, 0, 0)
-1+j (0, 1, 0, 0)
-1-j (0, 0, 1, 0)
1-j (0, 0, 0, 1)
Network Architecture
Estimate the conditional probabilities associated with all the possible classes
A model with multiple outputs, one per class
With linear models, we need as many affine functions as we have outputs
Each output will correspond to its own affine function
Example: 4 features and 3 output categories
12 scalars to represent the weights, 3 scalars to represent the biases
Softmax Operation
Softmax Operation
Goals:
– Logits should not be negative
– Sum of all logits equals 1
– Model should be differentiable
Softmax function:
exp(oj )
ŷ = softmax(o) where ŷj = P .
k exp(ok )
Exponential function, exp(oj ) ensures non negativity
P
Dividing by their sum k exp(ok ) ensures they sum up to 1.
Prediction:
argmax ŷj = argmax oj .
j j
Loss Function
The softmax function outputs a vector ŷ, which can be interpreted as estimated conditional
probabilities of each class given any input x
ŷ = P(y = Category 1 | x)
Q
For the entire dataset, P(Y | X) = ni=1 P(y(i) | x(i) )
According to maximum likelihood estimation, we maximize P(Y | X)
This is equivalent to minimizing the negative log-likelihood
P P
− log P(Y | X) = ni=1 − log P(y(i) | x(i) ) = ni=1 l(y(i) , ŷ(i) )
For any pair of label y and model prediction ŷ over q classes, the loss function l is
P
l(y, ŷ) = − qj=1 yj log ŷj – Cross Entropy Loss
y one-hot vector of length q, so sum over all its coordinates j vanishes for all except one term
ŷj are probabilities, so their logarithms are never greater than 0.
Derivative:
exp(oj )
∂oj l(y, ŷ) = Pq − yj = softmax(o)j − yj .
k=1 exp(ok )
Gradient is the difference between the observation and estimate
Dola Saha (UAlbany) Deep Learning 2021 52 / 128
Multi Layer Perceptron
Perceptron
An algorithm for supervised learning of binary classifiers
Binary classification outputs 0 or 1
– vs. scalar real value for regression
– vs. probabilities for logistic regression
Given input x, weight w and bias b, perceptron outputs:(
1 if x > 0
o = σ (hw, xi + b) σ(x) =
0 otherwise
Binary Classification
initialize w = 0 and b = 0
repeat
if yi [hw, xi i + b] 0 then
w w + yi xi and b b + yi
end if
until all classified correctly
Assumption of Linearity
Wireless domain:
– Frequency response of a multipath channel
– Non-linear region of hardware
Outputs O ∈ Rn×q
H = XW(1) + b(1) ,
O = HW(2) + b(2) .
O = (XW(1) +b(1) )W(2) +b(2) = XW(1) W(2) +b(1) W(2) +b(2) = XW+b
Outputs O ∈ Rn×q
H = XW(1) + b(1) ,
O = HW(2) + b(2) .
O = (XW(1) +b(1) )W(2) +b(2) = XW(1) W(2) +b(1) W(2) +b(2) = XW+b
H = σ(XW(1) + b(1) ),
O = HW(2) + b(2) .
Activation Functions
ReLU Function
Generalization of ReLU
Generalizations are based on the principle that models are easier to optimize if their behavior
is closer to linear
Add non-linear slope: αi , when xi < 0
hi = g(x, α)i = max (0, xi ) + αi min (0, xi )
Leaky ReLU: Fixes αi to a small value like 0.01 (Maas et al., 2013)
parameterized ReLU (pReLU): Treats αi as a learnable parameter (He et al., 2015)
Maxout units: Divides x into groups of k values (Goodfellow et al., 2013)
g(x)i = maxj∈G(i) xj
Sigmoid Function
d exp(−x)
sigmoid(x) =
dx (1 + exp(−x))2
= sigmoid(x) (1 − sigmoid(x)) .
Multiclass Classification
Outputs: y1 , y2 , . . . , yk = softmax(o1 , o2 , . . . , ok )
h = σ(W1 x + b1 )
o = wT2 h + b2
y = softmax(o)
Multiclass Classification
y1 , y2 , . . . , yk = softmax(o1 , o2 , . . . , ok )
h1 = σ(W1 x + b1 )
h2 = σ(W2 h1 + b2 )
h3 = σ(W3 h2 + b3 )
o = W4 h3 + b4
1 1 h1 h2 h
W
h2
x2
0 0 x1 x2 x
0 1 0 1 2
x1 h1
Architecture Basics
96.5
96.0
Test accuracy (percent)
95.5
95.0
94.5
94.0
93.5
93.0
92.5
92.0
3 4 5 6 7 8 9 10 11
Number of hidden layers
Figure 6.6: Empirical results showing that deeper networks generalize better when used
to transcribe multi-digit numbers from photographs of addresses. Data from Goodfellow
et al.
Dola Saha(2014d). The test set accuracy consistently increases with increasing depth.
(UAlbany) SeeLearning
Deep 2021 70 / 128
Multi Layer Perceptron Training and Optimization
Generalization Error
Data Distribution
How can we affect performance on the test set when we get to observe only the training set?
Assumptions:
– training and test sets are not collected arbitrarily
– Independent and identically distributed (i.i.d. or IID)
+ examples in each dataset are independent from each other
+ train and test set are identically distributed (drawn from same probability distribution)
Under these assumptions: expected training set error = expected test set error
Training process: training dataset is used to choose the parameters that minimize the chosen
loss function
Test error ≥ Train error
Goal of ML algorithm:
– Make the training error small
– Make the gap between training and test error small
Dola Saha (UAlbany) Deep Learning 2021 72 / 128
Multi Layer Perceptron Training and Optimization
Underfitting vs Overfitting
CHAPTER 5. MACHINE LEARNING BASICS
Underfitting: the model is not able to obtain a sufficiently low error value on the training set
have more parameters than training examples. We have little chance of choosing
Overfitting: the gap
a solution that generalizes well between the training
when so many wildly error
different solutions exist.and
In test error is too large
CHAPTER 5. MACHINE LEARNING BASICS
this example, the quadratic model is perfectly matched to the true structure of
abilitywell
the task so it generalizes
Capacity: totofit
newadata.
wide variety of functions
Training error
Underfitting zone Overfitting zone
Generalization error
Error
Generalization gap
0 Optimal Capacity
x0 x0 x0 Capacity
Figure 5.2: We fit three models to this example training set. The training data was 5.3: Typical relationship between capacity and error. Training and test error
Figure
No Free Lunch Theorem: [Wolpert, 1996] averaged over all possible data generating
generated synthetically, by randomly sampling x values and choosing y deterministically
behave differently. At the left end of the graph, training error and generalization error
by evaluating a quadratic function. (Left)A linear function fit to the data suffers arefrom
both high. This is the underfitting regime. As we increase capacity, training error
distributions, every classification algorithm has the same error rate when classifying
underfitting—it cannot capture the curvature that is present in the data. (Center)A decreases, but the gap between training and generalization error increases. Eventually,
quadratic function fit to the data generalizes well to unseen points. It does not suffer
thefrom
size of this gap outweighs the decrease in training error, and we enter the overfitting
previously unobserved points
a significant amount of overfitting or underfitting. (Right)A polynomial of degree regime,
9 fit to where capacity is too large, above the optimal capacity.
the data suffers from overfitting. Here we used the Moore-Penrose pseudoinverse to solve
the underdetermined normal equations. The solution passes through all of the training
points exactly, but we have not been lucky enough for it to extract the correct structure.
the concept of non-parametric models. So far, we have seen only parametric
It now
Dolahas a deep
Saha valley in between two training points that does not appear
(UAlbany) DeepinLearning
the true 2021 73 / 128
models, such as linear regression. Parametric models learn a function described
Multi Layer Perceptron Training and Optimization
Regularization
Effect of Regularization
Regularization is any modification we make to a learning algorithm that is intended to reduce its
generalization error but not its training error.
Forward Propagation
Simple Assumptions:
– Input: x ∈ Rd
– Intermediate variable: z ∈ Rh
– Weight parameter of the hidden layer: W(1) ∈ Rh×d
– Before activation: z = W(1) x, when hidden terms do not have a bias
– After passing through activation function (φ):
hidden variable h = φ(z)
– Weight parameter of output layer: W(2) ∈ Rq×h
– Output layer variable: o = W(2) h
– Loss function: L = l(o, y)
Computational Graph
theory
Each node represents a variable (a scalar, z u(1)
+
u(2)
variables
(a) (b)
X W b x w
– (c) ReLU layer
– (d) Linear regression and weight decay (c) (d)
Backpropagation
Backpropagation
ER 6. DEEP FEEDFORWARD NETWORKS
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
z z
JMLE J
f f cross_entropy +
f0 dz y
U (2) u(8)
y y
dy matmul
⇥
f f
0
H W (2) U (5) u(6) u(7)
f ⇥ sqr sum +
dy dz relu
x x
dx dx
f f U (1)
f0 ⇥ matmul
dx dz
w w
dw dw
X W (1) U (3) u(4)
sqr sum
6.10: An example
Figure: of the symbol-to-symbol
Symbol-to-symbol approach
approach to computing derivatives. In
to computing Figure 6.11: The computational graph used to compute the cost used to train our example
proach, the back-propagation algorithm does not need to ever access any Figure:
of a actual
Cross entropy loss for forward propagation
single-layer MLP using the cross-entropy loss and weight decay.
derivatives
numeric values. Instead, it adds nodes to a computational graph describing how
pute these derivatives. A generic graph evaluation engine can later compute the
The other path through the cross-entropy cost is slightly more complicated.
ves for any specific numeric values. (Left)In this example, we begin with a graph
Let G be the gradient on the unnormalized log probabilities U (2) provided by
nting z = f (f (f (w))). (Right)We
Dola Saha (UAlbany)
run the back-propagation algorithm, instructing
the cross_entropy operation. The back-propagation algorithm
Deep Learning now needs
2021 to
80 / 128
dz
Multi Layer Perceptron Training and Optimization
f (x)
Ideally, we would like
to arrive at the global
Minimum Maximum Saddle point minimum, but this
might not be possible.
This local minimum performs
poorly and should be avoided.
Figure 4.3: Optimization algorithms may fail to find a global minimum when there
multiple local minima or plateaus present. In the context of deep learning, we gener
accept such solutions even though they are not truly minimal, so long as they corresp
Figure 4.2: Examples of each of the three types of critical points in 1-D. A critical point is
a point with zero slope. Such a point can either be a local minimum, to which
significantly low values of the cost function.
is lower than
the neighboring points, a local maximum, which is higher than the neighboring points, or
a saddle point,Dola
which has(UAlbany)
Saha neighbors that are both higher and lower than the point
Deep itself.
Learning 2021 81 / 128
Convolutional Neural Network
Two Principles
Cross Correlation:
P j)P= (I ∗ K)(i, j) =
S(i,
m n I(i + m, j + n)K(m, n)
Same as convolution, without flipping 0 × 0 + 1 × 1 + 3 × 2 + 4 × 3 = 19,
kernel 1 × 0 + 2 × 1 + 4 × 2 + 5 × 3 = 25,
3 × 0 + 4 × 1 + 6 × 2 + 7 × 3 = 37,
No difference in practice due to symmetry 4 × 0 + 5 × 1 + 7 × 2 + 8 × 3 = 43.
Figure: Cross Correlation.
Convolution Layer
Input
Kernel
a b c d
w x
Input Matrix: I : nh × nw e f g h
y z
Kernel Matrix: K : kh × kw i j k l
Output Matrix:
aw + bx + bw + cx + cw + dx +
S : (nh − kh + 1) × (nw − kw + 1) ey + fz fy + gz gy + hz
S=I?K+b
K and b are learnable parameters
ew + fx + fw + gx + gw + hx +
Common choice: kh or kw are odd (1, 3, 5, iy + jz jy + kz ky + lz
7).
Figure 9.1: An example of 2-D convolution without kernel-flipping. In this case we restric
the output to only positions where the kernel lies entirely within the image, called “valid
Figure: 2-D convolution without kernel-flipping.
convolution in some contexts. We draw boxes with arrows to indicate how the upper-le
Dola Saha (UAlbany) Deepelement
Learningof the output tensor is formed by applying the kernel to2021 the correspondin
87 / 128
Convolutional Neural Network
Padding
Padding
Stride
s1 s2 s3
Strided
convolution
x1 x2 x3 x4 x5
s1 s2 s3
Downsampling
z1 z2 z3 z4 z5
Convolution
x1 x2 x3 x4 x5
Figure 9.12: Convolution with a stride. In this example, we use a stride of two.
(Top)Convolution with a stride length of two implemented in a single operation. (Bot-
Figure: Equivalent Models
tom)Convolution with a stride greater than one pixel is mathematically equivalent to
convolution with unit stride followed by downsampling. Obviously, the two-step approach
involving downsampling is computationally wasteful, because it computes many values
that are then discarded.
Pooling
344
Convolutional Layer
Figure 9.7: The components of a typical convolutional neural network layer. There are tw
commonly used sets of terminology for describing these layers. (Left)In this terminolog
the convolutional net is viewed as a small number of relatively complex layers, wit
Dola Saha (UAlbany) Deep Learning 2021 96 / 128
Convolutional Neural Network
s1 s2 s3 s4 s5
a b c d e f g h i
x1 x2 x3 x4 x5
Fully Connected
s1 s2 s3 s4 s5
x1 x2 x3 x4 x5
Sequence Modeling
Data
So far . . .
Collect observation pairs (xi , yi ) ∼ p(x, y) for training
Estimate for y|x ∼ p(y|x) unseen x0 ∼ p(x)
Examples:
– Images & objects
– Regression problem
– House & house prices
The order of the data did not matter
Dependence on time
Autoregressive Model
Sequence Model
Input Xt ∈ Rn×d
For a minibatch of size n, each row of Xt
corresponds to one example at time step t
from the sequence
Hidden variable of current time step
depends on input of the current time step
and hidden variable of the previous time
step
Ht = φ(Xt Wxh + Ht−1 Whh +bh )
| {z }
hidden state dependency
Output:
Ot = Ht Whq + bq .
The softmax function outputs a vector ŷ, which can be interpreted as estimated conditional
probabilities of each class given any input x
ŷ = P(y = Category 1 | x)
Q
For the entire dataset, P(Y | X) = ni=1 P(y(i) | x(i) )
According to maximum likelihood estimation, we maximize P(Y | X)
This is equivalent to minimizing the negative log-likelihood
P P
− log P(Y | X) = ni=1 − log P(y(i) | x(i) ) = ni=1 l(y(i) , ŷ(i) )
For any pair of label y and model prediction ŷ over q classes, the loss function l is
P
l(y, ŷ) = − qj=1 yj log ŷj – Cross Entropy Loss
y one-hot vector of length q, so sum over all its coordinates j vanishes for all except one term
ŷj are probabilities, so their logarithms are never greater than 0.
Gradients in RNN
Hidden and Output Layer: ht = f (xt , ht−1 , wh ), ot = g(ht , wo )
Chain of values that depend on recurrent computation: {. . . , (xt−1 , ht−1 , ot−1 ), (xt , ht , ot ), . . .}
Forward propagation: Compute (xt , ht , ot ) at each time step
Difference between output ot and labelPyt is:
L(x1 , . . . , xT , y1 , . . . , yT , wh , wo ) = T1 Tt=1 l(yt , ot )
Backward propagation (by Chain Rule):
1 X ∂l(yt , ot )
T
∂L
=
∂wh T ∂wh
t=1
1 X ∂l(yt , ot ) ∂g(ht , wo )
T
∂ht
=
T ∂ot ∂ht ∂wh
t=1 |{z}
Recurrent computation needed
∂ht
at = ,
∂wh
∂f (xt , ht−1 , wh )
bt = ,
∂wh
∂f (xt , ht−1 , wh )
ct =
∂ht−1
BPTT:
t−1 t
∂ht ∂f (xt , ht−1 , wh ) X Y ∂f (xj , hj−1 , wh ) ∂f (xi , hi−1 , wh )
= + .
∂wh ∂wh i=1 j=i+1
∂hj−1 ∂wh
Figure: Computational graph showing dependencies for an RNN model with three time steps
Rt = σ(Xt W xr + Ht−1 W hr + br ),
Zt = σ(Xt W xz + Ht−1 W hz + bz )
It = σ(Xt W xi + Ht−1 W hi + bi )
Ft = σ(Xt W xf + Ht−1 W hf + bf )
Ot = σ(Xt W xo + Ht−1 W ho + bo )
C̃t = tanh(Xt W xc + Ht−1 W hc + bc )
Ct = Ft Ct−1 + It C̃t
Ht = Ot tanh(Ct )
Other Networks
different representation f g
Training Process
Networks compete with each other
Generator attempts to fool the Discriminator
Discriminator adapts to the new fake/generated data
This information is used to improve the generator
Discriminator:
A binary classifier, outputs a scalar prediction o ∈ R
Applies sigmoid function to obtain predicted
probability D(x) = 1/(1 + e−o )
The label y for true data is 1 and for fake data is 0
Train the discriminator to minimize the cross-entropy
loss: minD {−y log D(x) − (1 − y) log(1 − D(x))}
Complex NN
Federated Learning
Wireless Applications
Wireless Applications
Signal Detection
Channel Encoding and Decoding
Channel Estimation, Prediction, and Compression
End-to-End Communications and Semantic Communications
Distributed and Federated Learning and Communications
Resource Allocation
RF Fingerprinting
Federated Learning
Waveform Generation
https://www.comsoc.org/publications/best-readings/machine-learning-communications