Chapter 4

Chapter 4:
CNNs for Other Supervised

Learning Tasks
Mathematical Foundations of Deep Neural Networks
Fall 2022
Department of Mathematical Sciences
Ernest K. Ryu
Seoul National University
1
Inverse problem model
In inverse problems, we wish to recover a signal 𝑋true given measurements 𝑌. The unknown
and the measurements are related through
𝒜 𝑋true + 𝜀 = 𝑌,
where 𝒜 is often, but not always, linear, and 𝜀 represents small error.
The forward model 𝒜 may or may not be known. In other words, the goal of an inverse
problem is to find an approximation of 𝒜 −1 .
In many cases, 𝒜 is not even be invertible. In such cases, we can still hope to find an
mapping that serves as an approximate inverse in practice.
2
Gaussian denoising
Given 𝑋true ∈ ℝ𝑤×ℎ , we measure
𝑌 = 𝑋true + ε
where ε𝑖𝑗 ∼ 𝒩 0, 𝜎 2 is IID Gaussian noise. For the sake of simplicity, assume we know 𝜎.
Goal is to recover 𝑋true from 𝑌.
Guassian denoising is the simplest setup in which the goal is to remove noise from the
image. In more realistic setups, the noise model will be more complicated and the noise
level 𝜎 will be unknown.
3
DnCNN
In 2017, Zhang et al. presented the denoising convolutional neural networks (DnCNNs).
They trained a 17-layer CNN 𝑓𝜃 to learn the noise with the loss
𝑁
ℒ 𝜃 = ෍ 𝑓𝜃 𝑌𝑖 − 𝑌𝑖 − 𝑋𝑖 2
𝑖=1
so that the clean recovery can be obtained with 𝑌𝑖 − 𝑓𝜃 𝑌𝑖 . (This is equivalent to using a
residual connection from beginning to end.)
17 layers of
3x3 conv.
p=1
K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising, IEEE TIP, 2017. 4
DnCNN
Image denoising is was an area with a large
body of prior work. DnCNN dominated all
prior approaches that were not based on
deep learning.
Nowadays, all state-of-the-art denoising

algorithms are based on deep learning.
K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising, IEEE TIP, 2017. 5
Inverse problems via deep learning
In deep learning, we use a neural network to approximate the inverse mapping
𝑓𝜃 ≈ 𝒜−1
i.e., we want 𝑓𝜃 𝑌 ≈ 𝑋true for the measurements 𝑋 that we care about.
If we have 𝑋1 , … , 𝑋𝑁 and 𝑌1 , … , 𝑌𝑁 (but no direct knowledge of 𝒜), we can solve

𝑁
minimize
𝑝
෍ 𝑓𝜃 𝑌𝑖 − 𝑋𝑖
𝜃∈ℝ
𝑖=1
If we have 𝑋1 , … , 𝑋𝑁 and knowledge of 𝒜, we can solve
𝑁
minimize
𝑝
෍ 𝑓𝜃 𝒜 𝑋𝑖 − 𝑋𝑖
𝜃∈ℝ
𝑖=1
If we have 𝑌1 , … , 𝑌𝑁 and knowledge of 𝒜, we can solve
𝑁
minimize
𝑝
෍ 𝒜 𝑓𝜃 𝑌𝑖 − 𝑌𝑖
𝜃∈ℝ
𝑖=1
6
Image super-resolution
Given 𝑋true ∈ ℝ𝑤×ℎ , we measure
𝑌 = 𝐴 𝑋true
where 𝐴 is a “downsampling” operator. So 𝑌 ∈ ℝ𝑤2 ×ℎ2 with 𝑤2 < 𝑤 and ℎ2 < ℎ. Goal is to
recover 𝑋true from 𝑌.
In the simplest setup, 𝐴 is an average pool operator with 𝑟 × 𝑟 kernel and a stride 𝑟.
7
SRCNN
In 2015, Dong et al. presented super-resolution convolutional neural network (SRCNN).
They trained a 3-layer CNN 𝑓𝜃 to learn the high-resolution reconstruction with the loss
𝑁
2
ℒ 𝜃 = ෍ 𝑓𝜃 𝑌෨𝑖 − 𝑋𝑖
𝑖=1
where 𝑌෨𝑖 ∈ ℝ𝑤×ℎ is an upsampled version of 𝑌𝑖 ∈ ℝ 𝑤/𝑟 × ℎ/𝑟 , i.e., 𝑌෨𝑖 has the same number
of pixels as 𝑋𝑖 , but the image is pixelated or blurry. The goal is to have 𝑓𝜃 𝑌෨𝑖 be a sharp
reconstruction.
C. Dong, C. C. Loy, K. He, and X. Tang, Image super-resolution using deep convolutional networks, IEEE TPAMI, 2015. 8
SRCNN
SRCNN showed that simple
learning based approaches
can match the state-of the-
art performances of super-
resolution task.
C. Dong, C. C. Loy, K. He, and X. Tang, Image super-resolution using deep convolutional networks, IEEE TPAMI, 2015. 9
VDSR
In 2016, Kim et al. presented VDSR. They trained a 20-layer CNN with a residual
connection 𝑓𝜃 to learn the high-resolution reconstruction with the loss
𝑁
2
ℒ 𝜃 = ෍ 𝑓𝜃 𝑌෨𝑖 − 𝑋𝑖
𝑖=1
The residual connection was the key insight that enabled the training of much deeper CNNs.
ILR Conv.1 ReLu.1 Conv.D-1 ReLu.D-1 Conv.D (Residual) HR

x r y
J. Kim, J. K. Lee, and K. M. Lee, Accurate image super-resolution using very deep convolutional networks, CVPR, 2016. 10
VDSR
VDSR dominated all prior approaches
not based on deep learning.
showed that simple learning based
approaches can batch the state-of the-
art performances of super-resolution
task.
J. Kim, J. K. Lee, and K. M. Lee, Accurate image super-resolution using very deep convolutional networks, CVPR, 2016. 11
Other inverse problem tasks and results
There are many other inverse problems. Almost all of them now require deep neural
networks to achieve state-of-the-art results.
We won’t spend more time on inverse problems in this course, but let’s have fun and see a
few other tasks and results. (These results are based on much more complex architectures
and loss functions.)
12
SRGAN
bicubic interpolation SRGAN ground truth

C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, Photo-realistic single image
super-resolution using a generative adversarial network, CVPR, 2017. 13
SRGAN
SRGAN
Image colorization
R. Zhang, P. Isola, and A. A. Efros, Colorful image colorization, ECCV, 2016. 16

Image inpainting
J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, Generative image inpainting with contextual attention, CVPR, 2018. 17
Image inpainting
input output ground truth

J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, Generative image inpainting with contextual attention, CVPR, 2018. 18
Linear operator ≅ matrix
Core tenet of linear algebra: matrices are linear operators and linear operators are matrices.
Let 𝑓 ∶ ℝ𝑛 → ℝ𝑚 be linear, i.e.,

𝑓 𝑥 + 𝑦 = 𝑓 𝑥 + 𝑓 𝑦 and 𝑓 𝛼𝑥 = 𝛼𝑓 𝑥
for all 𝑥, 𝑦 ∈ ℝ𝑛 and 𝛼 ∈ ℝ.
There exists a matrix 𝐴 ∈ ℝ𝑚×𝑛 that represents 𝑓 ∶ ℝ𝑛 → ℝ𝑚 , i.e.,

𝑓 𝑥 = 𝐴𝑥
for all 𝑥 ∈ ℝ𝑛 .
19
Linear operator ≅ matrix
Let 𝑒𝑖 be the 𝑖-th unit vector, i.e., 𝑒𝑖 has all zeros elements except entry 1 in the 𝑖-th
coordinate.
Given a linear 𝑓 ∶ ℝ𝑛 → ℝ𝑚 , we can find the matrix

𝐴 = 𝐴∶,1 𝐴∶,2 ⋯ 𝐴∶,𝑛 ∈ ℝ𝑚×𝑛
representing 𝑓 with
𝑓 𝑒𝑗 = 𝐴𝑒𝑗 = 𝐴∶,𝑗
for all 𝑗 = 1, … , 𝑛, or with
𝑒𝑖⊤ 𝑓 𝑒𝑗 = 𝑒𝑖⊤ 𝐴𝑒𝑗 = 𝐴𝑖,𝑗
for all 𝑖 = 1, … , 𝑚 and 𝑗 = 1, … , 𝑛.
20
Linear operator ≇ matrix
In applied mathematics and machine learning, there are many setups where explicitly
forming the matrix representation 𝐴 ∈ ℝ𝑚×𝑛 is costly, even though the matrix-vector
products 𝐴𝑥 and 𝐴⊤ 𝑦 are efficient to evaluate.
In machine learning, convolutions are the primary example. Other areas, linear operators
based on FFTs are the primary example.
In such setups, the matrix representation is still a useful conceptual tool, even if we never
intend to form the matrix.
21
Transpose (adjoint) of a linear operator
Given a matrix 𝐴, the transpose 𝐴⊤ is obtained by flipping the row and column dimensions,
i.e., 𝐴⊤ 𝑖𝑗 = 𝐴 𝑗𝑖 . However, using this definition is not always the most effective when
understanding the action of 𝐴⊤.
Another approach is to use the adjoint view. Since

𝑦 ⊤ 𝐴𝑥 = 𝐴⊤ 𝑦 ⊤ 𝑥
for any 𝑥 ∈ ℝ𝑛 and 𝑦 ∈ ℝ𝑚 , understand the action of 𝐴⊤ by finding an expression of the form
𝑛
𝑦 ⊤ 𝐴𝑥 = ෍ something 𝑗 𝑥𝑗 = 𝐴⊤ 𝑦 ⊤ 𝑥
𝑗=1
22
Example: 1D transpose convolution
Consider the 1D convolution represented by 𝐴 ∈ ℝ 𝑛−𝑓+1 ×𝑛 defined with a given 𝑤 ∈ ℝ𝑓
and
Then we have
23
Example: 1D transpose convolution
and we have the following formula which
coincides with transposing the matrix 𝐴.
For more complicated linear operators, this is how

you understand the transpose operation.
24
Operations increasing spatial dimensions
In image classification tasks, the spatial dimensions of neural networks often decrease as
the depth progresses.
This is because we are trying to forget location information. (In classification, we care about
what is in the image, but we do not where it is in the image.)
However, there are many networks for which we want to increase the spatial dimension:
• Linear layers
• Upsampling
• Transposed convolution
25
Upsampling: Nearest neighbor
torch.nn.Upsample with mode='nearest'
6 6 8 8
6 8 6 6 8 8
3 4 3 3 4 4
3 3 4 4
26
Upsampling: Bilinear interpolation
Torch.nn.Upsample with mode='bilinear'
(We won’t pay attention to the interpolation formula.) 6.0000 6.5000 7.5000 8.0000
5.2500 5.6875 6.5625 7.0000

6 8
3 4 3.7500 4.0625 4.6875 5.0000
'linear' interpolation is available for 1D data

3.0000 3.2500 3.7500 4.0000
'trilinear' interpolation is available for 3D data
27
Transposed convolution
In transposed convolution, Input neurons
additively distribute values to the output via the
kernel.
Before people noticed that this is the transpose
of convolution, the names backwards
convolution and deconvolution* were used.
*This is a particularly bad name as deconvolution refers to the inverse, rather than transpose, of the convolution in signal processing. 28
Transposed convolution
For each input neuron,
multiply the kernel and add
(accumulate) the value in the
output.
Can accommodate strides,

padding, and multiple
channels.
29
Convolution visualized
Illustration due to: https://medium.com/apache-mxnet/convolutions-explained-with-ms-excel-465d6649831c

D. Mishra, Convolutions explained with… MS Excel!, Medium, 2018. 30
Transpose convolution visualized
Illustration due to: https://medium.com/apache-mxnet/transposed-convolutions-explained-with-ms-excel-52d13030c7e8

D. Mishra, Transposed convolutions explained with… MS Excel!, Medium, 2018. 31
2D trans. Conv. layer: Formal definition
Input tensor: 𝑌 ∈ ℝ𝐵×𝐶in×𝑚×𝑛 , 𝐵 batch size, 𝐶in # of input channels.
Output tensor: 𝑋 ∈ ℝ𝐵×𝐶out× 𝑚+𝑓1−1 × 𝑛+𝑓2 −1
, 𝐵 batch size, 𝐶out # of output channels, 𝑚, 𝑛 #
of vertical and horizontal indices.
Filter 𝑤 ∈ ℝ𝐶in×𝐶out×𝑓1×𝑓2 , bias 𝑏 ∈ ℝ𝐶out . (If bias=False, then 𝑏 = 0.)
def trans_conv(Y, w, b):
c_in, c_out, f1, f2 = w.shape
batch, c_in, m, n = Y.shape
X = torch.zeros(batch, c_out, m + f1 - 1, n + f2 - 1)
for k in range(c_in):
for i in range(Y.shape[2]):
for j in range(Y.shape[3]):
X[:, :, i:i+f1, j:j+f2] += Y[:, k, i, j].view(-1,1,1,1)*w[k, :, :, :].unsqueeze(0)
return X + b.view(1,-1,1,1)
32
Dependency by sparsity pattern
Input for conv Output for conv
In a matrix representation 𝐴 of convolution. The Output for trans.conv Input for trans.conv.
dependencies of the inputs and outputs are represented by
the non-zeros of 𝐴, i.e., the sparsity pattern of 𝐴.
If 𝐴𝑖𝑗 = 0, then input neuron 𝑗 does not affect the output
neuron 𝑖. If 𝐴𝑖𝑗 ≠ 0, then 𝐴⊤ 𝑗𝑖 ≠ 0. So if input neuron 𝑗
affects output neuron 𝑖 in convolution, then input neuron 𝑖
affects output neuron 𝑗 in transposed convolution.
We can combine this reasoning with our visual
understanding of convolution. The diagram simultaneously
illustrates the dependencies for both convolution and
transposed convolution.
33
Semantic segmentation
In semantic
segmentation, the goal is
to segment the image
into semantically
meaningful regions by
classifying each pixel.
34
Other related tasks
Object localization localizes a single object
usually via a bounding box.
Object detection detects many objects, with

the same class often repeated, usually via
bounding boxes.
35
Other related tasks
Instance segmentation distinguishes multiple instances of the same object type.
36
Pascal VOC
We will use PASCAL Visual Object Classes (VOC) dataset for
semantic segmentation.
(Dataset also contains labels for object detection.)
There are 21 classes: 20 main classes and 1 “unlabeled” class.
Data 𝑋1 , … , 𝑋𝑁 ∈ ℝ3×𝑚×𝑛 and labels 𝑌1 , … , 𝑌𝑁 ∈ 0,1, … , 20 𝑚×𝑛 ,
i.e., 𝑌𝑖 provides a class label for every pixel of 𝑋𝑖 .
37
Loss for semantic segmentation
Consider the neural network
𝑓𝜃 ∶ ℝ3×𝑚×𝑛 → ℝ𝑘×𝑚×𝑛
such that 𝜇 𝑓𝜃 𝑋 𝑖𝑗
∈ Δ𝑘 is the probabilities for the 𝑘 classes for pixel (𝑖, 𝑗).
We minimize the sum of pixel-wise cross-entropy losses

𝑁 𝑚 𝑛
ℒ 𝜃 = ෍ ෍ ෍ ℓCE 𝑓𝜃 𝑋𝑙 𝑖𝑗 , 𝑌𝑙 𝑖𝑗
𝑙=1 𝑖=1 𝑗=1
where ℓCE is the cross entropy loss.
38
U-Net
The U-Net architecture:
• Reduce the spatial dimension
to obtain high-level (coarse
scale) features
• Upsample or transpose
convolution to restore spatial
dimension.
• Use residual connections
across each dimension
reduction stage.
O. Ronneberger, P. Fischer, and T. Brox, U-Net: Convolutional networks for biomedical image segmentation, Medical Image Computing and Computer-
Assisted Intervention, 2015. 39
Magnetic resonance imaging
Magnetic resonance imaging (MRI) is an inverse problem in which we partially* measure the
Fourier transform of the patient and the goal is to reconstruct the patient’s image.
So 𝑋true ∈ ℝ𝑛 is the true original image (reshaped into a vector) with 𝑛 pixels or voxels and
𝒜 𝑋true ∈ ℂ𝑘 with 𝑘 ≪ 𝑛. (If 𝑘 = 𝑛, MRI scan can take hours.)
Classical reconstruction algorithms rely on Fourier analysis, total variation regularization,

compressed sensing, and optimization.
Recent state-of-the-art use deep neural networks.
*We measure fewer points of the Fourier transform than there are pixels or voxels in the 2D or 3D image. 40
fastMRI dataset
A team of researchers from
Facebook AI Research and
NYU released a large MRI
dataset to stimulate data-
driven deep learning
research for MRI
reconstruction.
J. Zbontar, F. Knoll, A. Sriram, T. Murrell, Z. Huang, M. J. Muckley, A. Defazio, R. Stern, P. Johnson, M. Bruno, M. Parente, K. J. Geras, J. Katsnelson, H.
Chandarana, Z. Zhang, M. Drozdzal, A. Romero, M. Rabbat, P. Vincent, N. Yakubova, J. Pinkerton, D. Wang, E. Owens, C. L. Zitnick, M. P. Recht, D. K. 41
Sodickson, and Y. W. Lui, fastMRI: An open dataset and benchmarks for accelerated MRI, arXiv, 2019.
U-Net for inverse problems
Although U-Net was originally proposed as an architecture for semantic segmentation, it is
also being used widely as one of the default architectures in inverse problems, including
MRI reconstruction.
J. Zbontar, F. Knoll, A. Sriram, T. Murrell, Z. Huang, M. J. Muckley, A. Defazio, R. Stern, P. Johnson, M. Bruno, M. Parente, K. J. Geras, J. Katsnelson, H.
Chandarana, Z. Zhang, M. Drozdzal, A. Romero, M. Rabbat, P. Vincent, N. Yakubova, J. Pinkerton, D. Wang, E. Owens, C. L. Zitnick, M. P. Recht, D. K. 42
Sodickson, and Y. W. Lui, fastMRI: An open dataset and benchmarks for accelerated MRI, arXiv, 2019.
Computational tomography
Computational tomography (CT) is an inverse problem in which we partially* measure the
Radon transform of the patient and the goal is to reconstruct the patient’s image.
So 𝑋true ∈ ℝ𝑛 is the true original image (reshaped into a vector) with 𝑛 pixels or voxels and
𝒜 𝑋true ∈ ℝ𝑘 with 𝑘 ≪ 𝑛. (If 𝑘 = 𝑛, the X-ray exposure to perform the CT scan can be
harmful.)
Recent state-of-the-art use deep neural networks.
*We measure fewer points of the Radon transform than there are pixels or voxels in the 2D or 3D image. 43
U-Net for CT reconstruction
U-Net is also used as one of the default architectures in CT reconstruction
K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, Deep convolutional neural network for inverse problems in imaging, IEEE TIP, 2017. 44

Chapter 4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4

Uploaded by

Copyright:

Available Formats

Chapter 4:

CNNs for Other Supervised

Nowadays, all state-of-the-art denoising

If we have 𝑋1 , … , 𝑋𝑁 and 𝑌1 , … , 𝑌𝑁 (but no direct knowledge of 𝒜), we can solve

ILR Conv.1 ReLu.1 Conv.D-1 ReLu.D-1 Conv.D (Residual) HR

bicubic interpolation SRGAN ground truth

bicubic interpolation SRGAN ground truth

bicubic interpolation SRGAN ground truth

R. Zhang, P. Isola, and A. A. Efros, Colorful image colorization, ECCV, 2016. 16

input output ground truth

Let 𝑓 ∶ ℝ𝑛 → ℝ𝑚 be linear, i.e.,

There exists a matrix 𝐴 ∈ ℝ𝑚×𝑛 that represents 𝑓 ∶ ℝ𝑛 → ℝ𝑚 , i.e.,

Given a linear 𝑓 ∶ ℝ𝑛 → ℝ𝑚 , we can find the matrix

Another approach is to use the adjoint view. Since

For more complicated linear operators, this is how

5.2500 5.6875 6.5625 7.0000

3 4 3.7500 4.0625 4.6875 5.0000

'linear' interpolation is available for 1D data

Can accommodate strides,

Illustration due to: https://medium.com/apache-mxnet/convolutions-explained-with-ms-excel-465d6649831c

Illustration due to: https://medium.com/apache-mxnet/transposed-convolutions-explained-with-ms-excel-52d13030c7e8

Object detection detects many objects, with

There are 21 classes: 20 main classes and 1 “unlabeled” class.

Data 𝑋1 , … , 𝑋𝑁 ∈ ℝ3×𝑚×𝑛 and labels 𝑌1 , … , 𝑌𝑁 ∈ 0,1, … , 20 𝑚×𝑛 ,

i.e., 𝑌𝑖 provides a class label for every pixel of 𝑋𝑖 .

We minimize the sum of pixel-wise cross-entropy losses

where ℓCE is the cross entropy loss.

Classical reconstruction algorithms rely on Fourier analysis, total variation regularization,

Recent state-of-the-art use deep neural networks.

Recent state-of-the-art use deep neural networks.

You might also like