Michael Dorkenwald Eml2018 Report PDF

Report
Explainable Machine Learning
Dynamic Routing Between Capsules
Author: Supervisor:
Michael Dorkenwald Dr. Ullrich Köthe
28. Juni 2018

Inhaltsverzeichnis
1 Introduction 2
2 Motivation 2
3 CapusleNet 3
3.1 Capsules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.3 Routing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.4 Example How the Routing Algorithm Works . . . . . . . . . . . . . . . . . . . . . . . 5
3.5 Margin Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.6 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Results 7
4.1 MNIST Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.1.1 Instantation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.1.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 MultiMNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3 CIFAR10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Discussion 9
1
1 Introduction
Artificial neural networks may be the hottest topic in Machine Learning. In the last few years, there
were a lot of new developments that have enhanced neural networks and made them more accessible.
However, they were mostly incremental, like adding more layers or improving types of layers like Batch
Normalization, but did not introduce a new type of architecture or topic. That is the reason why this
paper is really interesting because on the one hand it is published by Geoffrey Hinton and his team
and on the other, it introduces a completely new architecture based on capsules. Hinton is one of
the founders of deep learning and an inventor of various models and algorithms that are widely used
today. He has had the idea of capsules for quite a long time and has finally managed to publish a
functioning network that achieves state-of-the-art performance on MNIST.
2 Motivation
Convolutional Neural Network (ConvNet or CNN) is a specific type of deep neural networks in which
a model learns to perform classification tasks directly from images or videos. They are useful in finding
patterns in images to recognize objects, faces, and scenes. ConvNets have been successful in identifying
faces, objects and traffic signs, which is an important component for powering vision in robots and
self-driving cars.
During training the different layers of the Convolutional Neural Network learn some different types
of features. The convolutional layers that are closer to the input learn low-level features like edges or
color gradients. The convolutional layers which are close to the fully connected layers (output) learn
high-level features. These high-level features are combinations of low-level features. The dense layers
combine these high-level features and produce a classification task. This is shown in the figure below.
Abbildung 1: Differnt Types of Features of a CNN2
For example, if you want to classify a ship or a horse the innermost layer understands the small curves
and edges. The 2nd layer might understand the straight lines or the smaller shapes, like the mast of
a ship or the curvature of the entire tail. Higher up layers start understanding more complex shapes
like the entire tail or the ship hull. Final layers try to see a more holistic picture like the entire ship
or the entire horse.
In Convolutional Neural Network we use Max-Pooling to reduce the spatial size of the features. Because
of this it computes in a reasonable time and avoids overfitting due to the fact that it decreases the
number of neurons in the network. On the other hand, it extracts the dominant features of a specific
2
field of view. Thereby we lose the spatial information where this value comes from and because of this,
we have a small invariance in the change of the viewpoint. This is illustrated in the following figure.
Abbildung 2: Differnt Types of Features of a CNN3
Max-Pooling is a crutch that makes CNN works really well. When we take a look on the performance of
Convolutional Neural Networks we see that the error for nearly all datasets decreases over the last year
and achieve superhuman performance in several benchmarks. But the loss of spatial relation is a thorn
in the flesh for Geoffrey Hinton. He says that: ’The pooling operation used in convolutional
neural networks is a big mistake and the fact that it works so well is a disaster’.4 Hinton
is looking for equivariance, this means changes in viewpoint leads to corresponding changes in neural
activities. Therefore we need a different architecture which does not use Max-Pooling layers.
3 CapusleNet
3.1 Capsules
A capsule is a small group of neurons that learns to detect a particular object (e.g., a rectangle) within
a given region of the image. There are a lot of different ways to implement the basic idea of capsules.
They decided that the output vector of a capsule should represent the probability that a specific pattern
or object is present in the input image. These specific patterns are the instantiation parameters of
a specific entity and is learned from the network. In the results, the individual dimensions of the
output vector are shown. They used a ’squashing’ function to ensure that the output represents a
probability and to have a non-linearity in their network. The difference of this non-linearity compared
to one of a neural network like a ReLu or sigmoid is that it is applied to the whole capsule instead of
each neuron separately. The ’squashing’ is defined as:
||sj ||2 sj
vj =
1 + ||sj ||2 ||sj ||2
with vj the output vector of capsule j and sj the input.
3.2 Architecture
Like regular neural networks, the Capsule Network consists of multiple layers but is shallow which
means it consists only of two convolutional layers and one fully connected layer. A simple CapsNet
architecture is shown in the figure below.
The first layer is a standard convolutional layer with 256 channels and a 9x9 window and is followed
from a ReLU activation. The job of the first layer is to detect basic features in the 2D input image.
Therefore it transforms the pixel values from the input image to local feature activities which are the
input for the first primary capsules.
The second layer is a convolutional capsule layer with 32 primary capsules whose job is to combine the
basic features which are detected in the first layer. Each capsule applies eight convolutional kernels
to the input volume and creates a 6x6x8 output tensor. Since there are 32 such capsules, the output
3
volume has the shape of 6x6x8x32.
The third and last layer consists of the 10 digit capsules, one for each digit (MNIST dataset). Each
digit capsule takes as input the 6x6x32 8D vectors and output for each digit a 16D capsule. Between
these two layers, we have a routing algorithm that is where all the magic of this paper appears.
Abbildung 3: Architecture of the Capsule Network1
3.3 Routing Algorithm

Between the two capsule layers, the routing algorithm is applied. The algorithm follows a specific
procedure which can be seen in figure 7. This procedure can also be applied to Capsule Networks with
more capsule layers.
For all capsule layers except the first one, they apply a weight Matrix Wij to the output of the previous
capsule layer, to encode important relationships between lower level features (e.g. mouth, nose) and
higher level feature (e.g. face).
ûj|i = Wij ui
ûj|i is the ’prediction vector’ which is an estimation what the output of the layer would be and ui is
the output of the previous layer. With the help of the ’prediction vector’ we can compute the output
of the capsule which is a weighted sum over all these vectors
X
sj = cij ûi|j
i
vj = squash(sj )
where the cij are the coupling coefficients. The ultimate output is vj . At the first glance, this calcula-
tion looks similar to the one where neurons weight its inputs before adding them up in fully connected
layers. In the neuron case, these weights are updated during backpropagation, but in the case of cap-
sules, they are determined by the dynamic routing process. This can be interpreted as follows: the
dynamic routing algorithm determines where each capsule’s output goes.
The coefficients of the coupling variable are normalized which means that between capsule i and all the
capsules in the layer above they sum to 1. Furthermore, they are determined by a ’routing softmax’
whose initial logits bij are the log prior probabilities that capsule i should be coupled to capsule j.
exp(bij )
cij = P
k exp(bik )
The bij can be learned at the same time as all other weights. They depend on the region and type of
the two capsules but not on the current image. The initial coupling coefficients are iteratively updated
depending on the agreement which is defined as:
aij = vj ûj|i
4
Therefore, the agreement score takes into account both likeliness and the feature properties, instead of
just likeliness in neurons. Also, bij remains low if the activation ui of capsule i is low since ûj|i length is
proportional to ui , for example, bij should remain low between the mouth capsule and the face capsule
if the mouth capsule is not activated. The agreement can just be added to the log priors because at
the beginning of each iteration bij will be normalized with the ’routing softmax’. On the MNIST
dataset, they showed that 1 to 3 iterations are sufficient. The dynamic routing is not a complete
replacement of the backpropagation, because the transformation matrix Wij is still trained with it.
We compute cij to quantify the connection between a capsule and its parent capsules and this value
is important but short lived. They re-initialize it to 0 for every data point before the dynamic routing
calculation. To calculate a capsule output, training or testing, you always have to redo the dynamic
routing calculation.
Abbildung 4: Routing Algorithm1
3.4 Example How the Routing Algorithm Works

In this example, we focus on a rectangle and a triangle capsule and with them, we can build a boat or
a house. Furthermore, we reduce the number of the instantiation parameters to one which represents
the rotation. In the figure below you can see in the lowest window the input image which consists
of a boat (left) and a house (right). By passing the input image into the primary capsule layer the
corresponding capsules gets activated. The length of the vector corresponds to the probability that
a rectangle or a triangle is present in the image and the rotation of the vector corresponds to the
rotation of the rectangle or triangle.
Abbildung 5: 2 layer capsule network6
For now, we focus only on the boat as an input image and look how the CapsNet proceeds with it.
The rectangle and the triangle could be part of a house or a boat. Now we have to take the pose of
5
the rectangle into account and the output house or boat would have to be slightly rotated, left side
of the figure below. For the triangle the same, because of the rotation of it, the output house would
stand almost upside down. As a result, the triangle and the rectangle strongly agree with the object
of a boat but disagree with the output of a house.
Abbildung 6: predict the presence and pose of objects based on the presence and pose of object parts6
Since it is very likely that the output is a boat, it would make sense to send the output of the primary
capsule more to the boat capsule and less to the house capsule. Because of this, the boat capsule
receives a more useful input signal. This is realized by updating the coupling coefficients. Because of
the strong agreement, the weights of the boat will be big and the one of the house very small. In two
to three iterations the corresponding house capsule (top window fig 5) will shrink to a tiny vector and
the boat capsule will be huge which corresponds to a large probability that a boat is present in the
input image.
3.5 Margin Loss

The length of instantiation vector corresponds to the probability that a specific entity is present in the
input image. We would like the top-level capsule for digit class k to have a long instantiation vector
if and only if that digit is present in the image. To allow images with multiple labels (more than one
digit), they used a separate margin loss, Lk for each digit capsule, k:
where Tk = 1 if a digit of class k is present and m+ = 0.9, m− = 0.1 and λ = 0.5. The loss function is
very similar to the SVM loss function. During training, for each image, one loss value will be calculated
for each of the 10 vectors according to the formula above and then the 10 values will be added together
to calculate the final loss. This forces the model that the top-level capsule has a probability greater
than 0.9 and the rest smaller than 0.1.
3.6 Reconstruction
They reconstruct the image from the correct digit capsule (16D vector), the other digit capsules are
masked out. Therefore, they use a decoder, like a typical one from an autoencoder. The decoder is
used as a regularizer, it takes the output of the correct DigitCap as input and learns to recreate a 28
by 28 pixels image, with the loss function being Euclidean distance between the reconstructed image
and the input image. As a result, the decoder forces the capsules to learn features that are useful for
reconstructing the original image. The reconstruction loss is scaled down by a factor of 0.0005 so that
it does not dominate the margin loss.
6
Abbildung 7: Decoder Network1
4 Results
They evaluated their capsule network on different data sets and achieved for the MNIST and Mul-
tiMNIST data set state of the art performance. The test error for these experiments are in the figure
below.
Abbildung 8: CapsNet classification test accuracy1
4.1 MNIST Data Set

First, they evaluated their model on MNIST a dataset consisting of images with on digit. They
only shifted the images by up to 2 pixels in each direction with zero padding as data augmentation.
The dataset consists of 60k training examples and 10k test examples. They compared their model
only to those which used the same type of data augmentation. The figure above (figure 10) shows
the importance of the routing and reconstruction regularizer. Without the reconstruction part, the
capsule net with 3 iterations would not be better than the version with one iteration.
As a baseline model, they used a CNN (convolutional neural network) with 3 convolutional layers with
256,256 and 128 channels and 5x5 kernels with stride 1. The last convolutional layers are followed by
two fully connected layers of size 328, 192. The last fully connected layer is connected with dropout to
a 10 class softmax layer with cross-entropy loss. The baseline is also trained on 2-pixel shifted MNIST
with Adam optimizer. The capsule network achieved a test error of (0.25%) only on 3 layers similar
results are only achieved by deeper networks.
7
4.1.1 Instantation Parameters
They investigated what each individual dimension of a capsule represents. They feed in a perturbed
version of this activity vector and looked how this influenced the reconstructed image. This is shown
in the figure below. One dimension of the digit capsule always represents the width of the digit. Other
dimensions could represent the localization, scale, and thickness. These are easier to interpret as the
layers from a standard convolutional neural network.
Abbildung 9: Dimension perturbations1
4.1.2 Robustness
They looked at the robustness of their model and compared it to a traditional convolutional neural
network. Therefore, they applied small affine transformations on the MNIST images and investigated
the impact of the results. They achieved 79 % accuracy, the traditional CNN with the same number of
parameters as the capsule network achieved only 66 %. The models are trained on the normal MNIST
dataset and only evaluated on this one.
4.2 MultiMNIST
First of all, they had to create this dataset. Therefore, they took every image from the MNIST dataset
and overlay it with all digits from a different class. Each digit is shifted up to 4 pixels in each direction
resulting in a 36 × 36 image. They generate for each digit in the MNIST dataset 1K MultiMNIST
examples. As a result, the training set consists of 60 M images and the testing set of 10M. Examples of
images can be seen in the figure below in the first row. For this experiment, they used a capsule network
Abbildung 10: MultiMNIST1
8
with 3 routing iterations. The two reconstructed digits are overlayed in green and red in the second row.
L (l1, l2 ) are the labels for the input image and R (r1,r2) are the labels used for the reconstruction.
The two rightmost columns show two images with wrong classification (R) reconstructed from the
input label and the predicted label. The other columns have correct classifications and show that
the model accounts for all the pixels while being able to assign one pixel to two digits in extremely
difficult scenarios. They treated the two most active digit capsules as the classification produced by
the capsules network. For the reconstruction, they picked one digit at a time and used the activity
vector of the chosen digit capsule to reconstruct the image of the chosen digit.
Their 3 layer network achieved a better performance than the baseline model. The baseline models
consist of 2 convolutional layers (followed by max-pooling layers) and 2 fully connected layers for
classification. The number of parameters is 25 M of the baseline model, for the capsule net only 11M.
4.3 CIFAR10
They evaluated CapsNet on the CIFAR10 dataset and achieved a test error of 10.6 % with an ensemble
of 7 models each trained with 3 routing iterations. The architecture is similar to the one for the MNIST
dataset, they only increased the number of primary capsule because the input image has 3 channels.
They stated that the standard CNN had the same error rate when they were first applied to this
dataset.
5 Discussion
The authors of the paper have presented a routing algorithm which provides a way how capsules
interact with each other. This kind of network could perhaps replace one-day convolutional neural
networks. CNN’s have become dominant in computer vision tasks for instance object detection, but
there are signs that these may be replaced by other networks. One sign is, for instance, the difficulty
in generalization to novel viewpoints. If the viewpoint changes in CapsNet the neural activities vary
correspondingly rather than eliminating the viewpoint variation from the neural activity like in CNN.
On MNIST CapsNet has achieved state-of-the-art performance, but it has not yet been tested on
large data sets such as Imagenet. They have also shown that their model produces better results on
MultiMNIST as a regular convolutional neural network. The reconstructions illustrated that CapsNet
is able to segment the image into the two original digits. This is very promising for later use in object
detection. Furthermore, CapsNet also has shown better robustness to affine transformation than a
regular convolutional neural network with the same number of parameters.
Overall, the activation vectors are easier to interpret, this could be seen in fig 9 and the dimensions
represent for instance the thickness, scale or rotation. A problem with this routing algorithm is the
time it needs for training, because of the inner loop. However the research on capsule networks are
on an early stage and there are good reasons for believing that it is a better approach as the current
networks, but it will take a lot of effort to out-perform a highly developed network.
9
Literatur
[1] Paper Dynamic Routing Between Capsules: https://arxiv.org/abs/1710.09829
[2] CNN Features Image: http://www.iro.umontreal.ca/ bengioy/talks/DL-Tutorial-NIPS2015.pdf
[3] Max Pooling Image: https://cs231n.github.io/convolutional-networks/#pool
[4] Quote Geoffrey Hinton: https://www.reddit.com/r/MachineLearning/comments/2lmo0l/ama

geoffrey hinton/clyj4jv/
[5] AlexNet image: https://leonardoaraujosantos.gitbooks.io/artificial-

inteligence/content/image segmentation.html
[6] Routing Example: https://www.oreilly.com/ideas/introducing-capsule-networks
10

Michael Dorkenwald Eml2018 Report PDF

Uploaded by

Copyright:

Available Formats

You might also like

Michael Dorkenwald Eml2018 Report PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Michael Dorkenwald Eml2018 Report PDF

Uploaded by

Copyright:

Available Formats

Report

Explainable Machine Learning

Dynamic Routing Between Capsules

28. Juni 2018

Abbildung 1: Differnt Types of Features of a CNN2

Abbildung 2: Differnt Types of Features of a CNN3

with vj the output vector of capsule j and sj the input.

Abbildung 3: Architecture of the Capsule Network1

3.3 Routing Algorithm

Abbildung 4: Routing Algorithm1

3.4 Example How the Routing Algorithm Works

Abbildung 5: 2 layer capsule network6

3.5 Margin Loss

Abbildung 8: CapsNet classification test accuracy1

4.1 MNIST Data Set

Abbildung 9: Dimension perturbations1

Abbildung 10: MultiMNIST1

[3] Max Pooling Image: https://cs231n.github.io/convolutional-networks/#pool

[4] Quote Geoffrey Hinton: https://www.reddit.com/r/MachineLearning/comments/2lmo0l/ama

[5] AlexNet image: https://leonardoaraujosantos.gitbooks.io/artificial-

[6] Routing Example: https://www.oreilly.com/ideas/introducing-capsule-networks

You might also like