Yang 2018

The 2018 International Conference On Control Automation & Information Sciences (ICCAIS 2018)
October 24-27, 2018, Hangzhou, China.
The Analysis Between Traditional Convolution

Neural Network and CapsuleNet
Feng Yang1,2 Wentong Li Weikang Tang Xinliang Wu
1. Key Laboratory of Key Laboratory of Key Laboratory of China National
Information Fusion Tech- Information Fusion Tech- Information Fusion Tech- Aeronautical Radio Electronics
nology Ministry of Education nology Ministry of Education nology Ministry of Education Research Institute
Northwestern Poly- Northwestern Poly- Northwestern Poly- Shanghai, China
technical University technical University technical University wuxinliang51@163.com
Xi’an, China Xi’an, China Xi’an, China
2. Science and Technology liwentong@mail.nwpu.edu.cn tangweikang725@foxmail.com
on Electro-optic
Control Laboratory
Luoyang, China
yangfeng@nwpu.edu.cn
Abstract—Convolution neural networks(CNNs) have made a process, thereby speeding up image feature extraction and
series of breakthroughs in the fields of autonomous driving, learning. However, this also makes the CNN always activate
robotics, medical treatment and so on. The powerful feature local information, have a weak global information association
learning and classification abilities of CNNs have attracted wide
attention of scholars. With the improvement of the networks capability, and unable to express more complex and abstract
structure and the increasing of the networks depth, the per- attribute characteristics of objects.
formance of tasks such as image classification, detection, recog- In the CapsuleNet [1] model, a capsule is a set of neurons
nition, and segmentation has been improved greatly. Recently, whose input and output vectors implement the encapsulation
CapsuleNet, which is different from CNNs, has been proposed.
of the entity. The length of vectors is used to describe the
In CapsuleNet, vectors replace traditional scalar points to better
characterize the relationships between feature information and probability distribution, and the direction of vectors represents
acquire more attributes of things. In this paper, the basic the property of corresponding entities. In the process of
structure and algorithm principle of classical CNN models were information processing, CapsuleNet model keeps convolution
described firstly, and then the basic neural network, network to get the information of receptive field. The processing
structure, parameter update and distribution process for the
and transformation process of vectors information replace the
CapsuleNet model were analyzed and summarized. Finally,
several group of training experiments were conducted to compare traditional max-pooling, which preserves the precise feature
and analyze the image classification and feature expression information of the entity’s posture (position, size, orientation),
capabilities of CapsuleNet and CNN models. deformation, velocity, albedo, hue, texture and so on. In terms
Index Terms—CNN, CapsuleNet, VGGNet, GoogleNet, ResNet of parameters update, CapsuleNet model uses dynamic routing
algorithm in addition to the traditional back-propagation gra-
I. I NTRODUCTION dient descent method [2]. It solves the problem of gradient dis-
sipation in training process to a certain degree, and improves
The extraction and classification of image features have the expression ability of the model. In order to accurately learn
always been a fundamental and important research direction in the features of higher-order abstractions, solve difficult tasks
the field of computer vision. The CNN provides an end-to-end related to artificial intelligence such as image classification [3],
learning model. The parameters in the model can be trained to object detection [4], and semantic understanding [5], it is of
complete feature extraction and classification at different levels great significance for the comparative analysis of traditional
of image. The information processed by the CNN models is CNNs and CapsuleNet.
the scalar. The point-and-product convolution and the local
information extraction are achieved through weight sharing
and pooling respectively, which greatly reduces the training II. T RADITIONAL C NN M ODEL
parameters and improves the perfermance of model training As an important research branch in the field of neural
The paper is sponsored by National Natural Science Foundation of networks, the CNN model has been rapidly developed from
China (No. 61374159, No. 61374023), Aviation Science Foundation LeNet-5 [6], AlexNet [7], VGGNet [8], GoogleNet [9] to
(No. 20165153034), Natural Science Foundation of Shaanxi province deeper ResNet [10], DenseNet [11]. CNN models have been
(No.2018MJ6048) and the Seed Foundation of Innovation and Creation
for Graduate Students in Northwestern Polytechnical University (No. Z- greatly improved and developed in depth, network structure,
Z2018149). generalization capabilies. In this section, the classic CNN
978-1-5386-6020-1/18/$31.00 ©2018 IEEE 210

models including VGGNet, Google Inception Net, and ResNet convolution can not only increase the expressive ability of the
were analysised. network by organizing information across channels, but also
reduce the dimension of the output channels to reduce the
A. VGGNet training parameters. For the second branch, a 1*1 convolution
As shown in Fig.1, the relationship between the depth is used to reduce the channel dimensions firstly, and then
and the performance of CNN was explored in VGGNet. By a 3*3 small-size receptive fields is concatenated to achieve
repeatedly stacking the 3*3 convolution and the 2*2 max- convolution. In this case, two nonlinear transformations of
pooling layers, VGGNet successfully built a deep CNN model, model information are carried out. The third branch is similar
the number of weight layers from 11 to 19. VGGNet has 5 to the second branch. Firstly, a 1*1 convolution is used to
convolution segments. Each of segments has 2-3 layers, and a reduce the dimension of the channel, and then a 5*5 large
max-pooling layer is connected to the end of each segment to size receptive field is concatenates to achieve convolution. The
reduce the picture size. The number of kenels in each segment fourth branch is different. A 3*3 max-pooling is used to extract
is the same, followed by 64 - 128 - 256 - 512 - 512. There is the salient features first, and then a 1*1 convolution is used to
a very useful design in VGGNet that stacks multiple identical reduce the channel dimensions. Finally, the four branches are
convolutional layers together. Two 3*3 convolutional layers merged in the channel dimension. The inception module output
in series are equivalent to a 5*5 convolutional layer, i.e. one incorporates multiple types of features extracted through three
pixel will be associated with the surrounding 5*5 pixels. The different sizes of convolutional layers and one max-pooling
three 3*3 convolutional layers in series is equivalent to a 7*7 layer to increase the adaptability of the network to different
convolutional layer, but the quantity of parameters is less than scales. In this way, the depth and width of the network can
a 7*7 convolutional layer and the model has more nonlinear be expanded efficiently, and the accuracy can be increased
transformations. In this way, the learning ability of the CNN without overfitting. During the development of the Google
is more robust. VGGNet shows that the deeper the network, Inception Net model, the inception module has also been
the better the performance. gradually improved.
Fig. 2. Inception module structure [9]
Google Inception Net includes four versions of Inception

V1 [9], Inception V2 [13], Inception V3 [14] and Inception
V4 [15]. All of those are spliced by the inception module.
The depth of the models gradually increased, and the top-
5 error rate dropped from 6.67% to 3.08% [9]. Among
them, the Batch Normalization (BN) method was proposed
to standardize the internal processing of the data in Inception
V2. The output of each layer can be normalized to a Gaussian
Fig. 1. The structure of VGGNet [8]. distribution to ensure the stability of the input data by the BN
method. This not only accelerates the convergence process of
the training, but also increases the generalization ability of
B. Google Inception Net the model, which provides strong support for the training of
The inception module, which is the core network struc- large-scale CNN models.
ture in the Google Inception Net model, combines different
convolutional layers in parallel. As shown in Fig.2, the in- C. ResNet
ception module structure has 4 parallel branches. The first If the layer is added to the previous layer simply to
branch performs a 1*1 convolution on the input, which is increase the depth of the network, there will be a problem of
an important structure proposed by the NIN [12]. The 1*1 gradient dissipation during the training process, which makes
978-1-5386-6020-1/18/$31.00 ©2018 IEEE 211

the training process of the deep network more difficult. As the layers. The above aspects of CapsuleNet are focused on to
depth of the network increases, its performance will gradually analysis in this chapter.
become saturated, and even begin to decline. ResNet can solve
the problem of ”with the network becoming deeper and deeper, A. The capsule neuron
the accuracy rate does drop”. It successfully trained a 152-
layer deep neural network, making the top-5 error rate as low
as 3.57% [10]. At the same time, the amount of parameters
is lower than VGGNet, and the effect of the model is very
prominent. The core module of ResNet is residual unit, and
its core idea is to introduce a shortcut connection that can skip
one or more layers. As shown in Fig.3, the input of a certain
neural network is x and the desired output is H (x). The input
x bypasses weight layer to H (x) as the initial result. At this
Fig. 4. Parametric connection of capsule neurons
time, the learning objective is only the difference between the
input and output, that is, the residual F (x) = H(x) − x.
The key structure in CapsuleNet is the capsule neuron, as
The learning goals have been changed, which allows the
shown in Fig.4. Unlike the traditional neuron, the input and
stacked network layer to gradually adapt to a residual mapping.
output parameters are vectors.The internal parameters are also
This way of this shortcut connection is to directly send the
corresponding to vectors, and the activation functions used are
input information to the output to protect the integrity of
also different. The capsule neuron input and output vectors
the information. The entire network only needs to learn the
represent instantiation parameters for a particular entity type.
difference between the input and output.The Learning goal is
The length of the vector represents the probability of existence
simplified. The residual unit effectively solved the problem
of the entity, and the direction of the vector represents the
of gradient dissipation to some extent, and made significant
existence of the attribute of the entity. The activation function
contributions to deepening the network.
of capsule neurons is the squashing function, by which the
vectors are compressed and non-linearized. The same level of
capsules predict the instantiation parameters of higher level
capsules through matrix transformation.
B. CapsuleNet structure
Fig. 5. Structure of the CaptureNet mode [1]
Fig. 3. The residual unit structure [10] At present, the structure of CapsuleNet is relatively simple.
As shown in Fig.5, the CapsuleNet model consists of three
The number of layers evolved from 18 (ResNet18), 34 layers [1]: Conv Layder, PrimaryCaps, and DigitCaps. The
(ResNet34), and 50 (ResNet50) to 101 (ResNet101) and Conv1 is a conventional convolution layer and uses the ReLU
152 (ResNet152). Though the depth has been significantly activation function to implement local feature extraction in
increased, ResNet152 is still lower than VGG16 and VGG19 pixels. The second layer, which is the PrimaryCaps layer,
in the complexity, and a significant accuracy gain has also serves as the input to the Capsule layer and constructs the
been achieved. Because ResNet is simple and practical, many corresponding vector structure. The PrimaryCaps layer also
methods of detection, segmentation and recognition are built uses convolution to encapsulate multiple convolution units into
on the basis of ResNet50 or ResNet101, and ResNet is also a new capsule unit, which is output as a vector and used as the
applied in AlphaGo zero [16]. input to the next layer. The third layer, which is the DigitCaps
layer, is the output layer of the Capsule. It is connected to
III. C APSULENET the upper layer through a full connection. The calculation
and parameter update between the full connections uses the
Compared with the traditional CNN model, CapsuleNet has dynamic routing algorithm [1]. The length of the final capsule
quite differences in the aspects of neuron structure, network output vector is used as the probability distribution of the
structure, and the way of propagation and distribution between entity.
978-1-5386-6020-1/18/$31.00 ©2018 IEEE 212

C. Propagation and distribution between Capsule units of parameter update, in addition to the traditional back-
The procedure of propagation and distribution between propagation algorithm to update the weight parameters, Capsu-
Capsule units is divided into three phases: linear combination, leNet adopts a dynamic routing algorithm to iteratively update
routing parameter update, and nonlinear function activation. the coupling coefficients in the forward propagation, which
_
The first stage is a linear combination. As in (1), u j|i is makes the network model easier to be trained and converge,
_
a linear combination of ui . The prediction vector u j|i was but brings a larger amount of calculation.
obtained by multiplying the vector ui of the i − th Capsule of
the previous layer output and the corresponding weight vector IV. C OMPASION A ND A NALYSIS OF R ESULTS OF
Wij . D IFFERENT DATASETS
_
u j|i = Wij ui (1)
In this section, the traditional classical CNN models and
In the second phase, the routing parameters are updated by CapsuleNet are used to conduct three groups of training
using the dynamic routing algorithm. The coupling coefficient experiments in different datasets. For the training results, this
denoted as cij is iteratively updated and determined by the chapter conducts corresponding comparation and analysis.
dynamic routing process. The dynamic routing process is:
First, bij is initialized by 0. As in (2), bij is depended on
the positions and types of the two capsules, and is updated A. Introduction to datasets
iteratively through the consistency between the current output This training experiment selected three different image
_
vj of each capsule j and the prediction vector u j|i of the datasets:MNIST [17], small NORB [18], and CIFAR-10 [19].
previous layer of the capsule i. Equation (3) is the softmax The three kinds of datasets are described in detail below.
function. We can see that the sum of the coupling coefficient 1) MNIST database of handwritten digits: The MNIST
between Capsule i and all Capsules in the same level is 1. dataset is a hand-written digital database consisting of 250
_
According to (4), the prediction vector u j|i and the corre- different handwritten digits 0-9, the training set with 60,000
sponding coupling coefficient cij are weighted sum to obtain handwritten digital images, and the test set with 10,000
the input vector sj of the next layer of the unit. According handwritten digital images. Each picture in the dataset is a
to (5), the output vector vj of the next layer of the Capsule blackwhite picture, the number of channels is 1, and the picture
unit is obtained. Output vectors are used to update bij , so pixel size is 28*28. An example diagram is shown in Fig.6.
the coupling coefficient cij is also updated to complete the
dynamic routing process. The algorithm is easy to converge
through 3 iterations.
_
bij =bij + u j|i vj (2)
exp(bij )
cij = P (3)
exp(bik )
k
X _
sj = cij u j|i (4)
i
2 Fig. 6. Example of MNIST [17]
ksj k sj
vj = 2 ks k (5)
1 + ksj k j
2) V1.0-small NORB: The dataset is a 3D database con-
The third stage is the activation process of nonlinear func- taining 50 toys in five categories:four-legged animals, human
tions. Since the length of the output vector of the Capsule figures, airplanes, trucks, and cars. The training and test
unit represents the probability that an entity exists, its value set contain 48,600 images. Each image has two channels
must be between 0 and 1. In order to achieve this vector and is composed of two grayscale images taken at different
compression, and complete the nonlinear activation function of angles. The pixel size of the image is 96*96. And the V1.0-
the Capsule layer, the activation function used by CapsuleNet small NORB is commonly used in 3D object recognition
is squashing function expressed as (5). The nonlinear function experiments. The example image is shown in Fig.7.
ksj k2
is divided into two parts. The former part is 1+ks 2 to
jk
complete the compression reassignment of the input vector
s
sj . The second part is ksjj k to unitize the input vector sj . This
function therefore preserves the direction of the input vector
and compresses the length of the input vector to the interval
[0, 1).
In general, CapsuleNet uses vectors instead of scalars of Fig. 7. Example of V1.0-small NORB [18]
traditional CNNs to obtain more information links. In terms
978-1-5386-6020-1/18/$31.00 ©2018 IEEE 213

3) CIFAR-10: The dataset which is a color picture set

contains 10 categories:airplanes, cars, birds, cats, deer, dogs,
frogs, zebras, boats and trucks. The training set contains
50,000 images and the test set contaions 10,000 images. Each
picture has 3 channels and the picture pixel size is 32*32. The
example image is shown in Fig.8.
Fig. 9. Comparison of training results of the four models on the MNIST

dataset
Fig. 8. Example of CIFAR-10 [19]
Fig. 10. Comparison of training results of the four models on the V1.0-small
B. Experimental background and results NORB dataset
For the traditional CNN, this section selects the classic 16-
layer VGG16, 42-layer Google Inception V3, and 152-layer After training experiments, the corresponding training mod-
ResNet, and contrast training experiment with CapsuleNet. els were got in this paper, and then test experiments were
These four network models are trained by the same dataset. carried out on the training models. The test results show that
Each group of experiments ensures the consistent training the accuracy of the test set has declined as a whole, but the
parameters. The training parameters and results are shown overall curve trend and performance are consistent with the
below. training experiment results.
1) The training experiment on the MNIST dataset: The
MNIST dataset is used for training in this experiment. The C. Analysis of experimental results
parameters are set as follows: batch-size is 64, the number According to the above three groups of experimental results,
of training steps is 50000, learning rate is 0.0001, learning the training results of the traditional CNN models and Cap-
optimization algorithm is Adam. The training experiment suleNet perform well on the two gray image datasets MNIST
results are shown in Fig.9 below. and small NORB. In 10K steps, the training accuracy rate can
2) The training experiment on V1.0-small NORB dataset: reach 98%. And whether it is in the convergence speed, or
The V1.0-small NORB dataset is used for training in this in training accuracy, CapsuleNet is better than the traditional
experiment. The parameters are set as follows: batch-size is CNN models. For the CIFAR-10 RGB color picture dataset,
32, the number of training steps is 30000, learning rate is the overall convergence speed of the four neural network
0.0001, learning optimization algorithm is Adam. The training models is slower than that of the gray picture dataset. However,
experiment results are shown in Fig.10 below. the traditional CNN models show good performance. Within
3) The training experiment on CIFAR-10 dataset: The 40K steps, the training accuracy rate of the traditional CNN
CIFAR-10 dataset is used for training in this experiment. The models reaches 95%. The CapsuleNet is about 140K steps,
parameters are set as follows: batch-size is 64, the number and the training accuracy rate is only about 90%.
of training steps is 150000, learning rate is 0.0001, learning The reasons are analyzed as follows: First, CapsuleNet tends
optimization algorithm is Adam. The training experiment to explain everything in the picture, and it can perform well for
results are shown in Fig.11 below. MNIST and small NORB datasets with a single background.
978-1-5386-6020-1/18/$31.00 ©2018 IEEE 214

R EFERENCES
[1] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic Routing Between
Capsules,” arXiv:1710.09829v2 [cs.CV] 7 Nov.2017.
[2] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representa-
tions by back-propagating errors,” Nature, vol. 323(6088), pp. 533-536,
1986.
[3] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features
off-the-shelf: An astounding baseline for recognition,” IEEE Conference
on Computer Vision and Pattern Recognition. IEEE Computer Society,
pp. 512-519, 2014.
[4] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A.
C. Berg, “SSD: Single shot multibox detector,” European Conference
on Computer Vision. Springer, Cham, pp. 21-37, 2016.
[5] J. Long, E. Shelhamer, T. Darrell, “Fully convolutional networks for
semantic segmentation,” IEEE Conference on Computer Vision and
Pattern Recognition. IEEE Computer Society, pp. 3431-3440, 2015.
[6] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
Fig. 11. Comparison of training results of the four models on the CIFAR-10 applied to document recognition,” Proceedings of the IEEE, vol. 86(11),
dataset pp. 2278-2324, 1998.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification
With Deep Convolutional Neural Networks,” International Conference
on Neural Information Processing Systems. Curran Associates Inc. pp.
For the CIFAR-10 dataset, the background of the picture is 1097-1105, 2012.
chaotic, and it changes too much for the fixed size model. [8] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
Second, the CapsuleNet model is in its infancy. Compared large-scale image recognition,” In ICLR, 2015.
[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan,
with VGG16, inception-v3 and ResNet CNN models, the V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
CapsuleNet model structure only with three layers is more In Proceedings of the IEEE Conference on Computer Vision and Pattern
simple. Although the vector in the network can express more Recognition, pp. 1-9, 2015.
[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
information and attributes of images, the number of network image recognition,” In IEEE Conference on Computer Vision and Pattern
layers is less. For complex picture information, the feature Recognition, pp. 770-778, 2016.
extracted by learning model is not enough. Therefore, the [11] G. Huang, Z. Liu, L. V. D. Maaten, and K. Q. Weinberger, “Densely
Connected Convolutional Networks,” arXiv: 1608.06993v5 [cs.CV] 28
CapsuleNet model performs poorly on the CIFAR-10 and other Jan 2018.
RGB datasets. [12] M. Lin, Q. Chen, and S. Yan, “Network In Network,” Computer Science,
2013.
V. C ONCLUSION [13] S. Ioffe, and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” International
This paper compares the classical CNN models (VGGNet, Conference on Machine Learning, pp. 448-456, 2015.
Google Inception Net, and ResNet) with CapsuleNet in the [14] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
aspects of network structure, parameter update and propaga- the inception architecture for computer vision,” arXiv preprint arXiv:
1512.00567, 2015.
tion, principle analysis and training results. The traditional [15] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4,
CNN models show powerful feature extraction and learning Inception-resnet and the impact of residual connections on learning,”
ability with the increasing of network depth and improvement arXiv: 1602.07261v2 [cs.CV], 23 Aug 2016.
[16] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A.
of algorithm. While the traditional pooling is discarded in Guez, T. Hubert, L. Baker, M. Lai, and A. Bolton, “Mastering the game
CapsuleNet to preserve precise location information. In the of go without human knowledge,” Nature, vol. 550(7676), pp. 354-359,
form of vector, the capsule unit maintains the relationship 2017.
[17] Y. LeCun, and C. Cortes, “The mnist database of handwritten digits,”
between feature information and implements the encapsulation 1998.
of the entity, which makes up for the lack of the CNN [18] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods for generic ob-
models. The experimental results show that a simple three- ject recognition with invariance to pose and lighting,” IEEE Conference
on Computer Vision and Pattern Recognition. IEEE Computer Society,
layer CapsuleNet gets a significant effect on the gray image vol. 2, pp. II-104, 2004.
with a single background. However, for the RGB image with [19] M. D. Zeiler and R. Fergus, “Stochastic pooling for regularization of
a complex background, the features of the image cannot be deep convolutional neural networks,” arXiv preprint arXiv: 1301.3557,
2013.
completely learned by CapsuleNet to achieve high-precision
classification. Therefore, CapsuleNet needs to be improved in
both the model structure and the parameter updating algorithm,
and it will also promote the advancement of neural network.
ACKNOWLEDGMENT
The paper is sponsored by National Natural Science Foun-
dation of China (No. 61374159, No. 61374023), Aviation Sci-
ence Foundation (No. 20165153034), Natural Science Foun-
dation of Shaanxi province (No.2018MJ6048) and the Seed
Foundation of Innovation and Creation for Graduate Students
in Northwestern Polytechnical University (No. ZZ2018149).
978-1-5386-6020-1/18/$31.00 ©2018 IEEE 215

Yang 2018

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Yang 2018

Uploaded by

Copyright:

Available Formats

The 2018 International Conference On Control Automation & Information Sciences (ICCAIS 2018)

October 24-27, 2018, Hangzhou, China.

The Analysis Between Traditional Convolution

978-1-5386-6020-1/18/$31.00 ©2018 IEEE 210

Fig. 2. Inception module structure [9]

Google Inception Net includes four versions of Inception

978-1-5386-6020-1/18/$31.00 ©2018 IEEE 211

Fig. 5. Structure of the CaptureNet mode [1]

978-1-5386-6020-1/18/$31.00 ©2018 IEEE 212

978-1-5386-6020-1/18/$31.00 ©2018 IEEE 213

3) CIFAR-10: The dataset which is a color picture set

Fig. 9. Comparison of training results of the four models on the MNIST

Fig. 8. Example of CIFAR-10 [19]

978-1-5386-6020-1/18/$31.00 ©2018 IEEE 214

978-1-5386-6020-1/18/$31.00 ©2018 IEEE 215

You might also like