Professional Documents
Culture Documents
2018 - Song - A Sparsity-Based Stochastic Pooling Mechanism For Deep Convolutional Neural Networks
2018 - Song - A Sparsity-Based Stochastic Pooling Mechanism For Deep Convolutional Neural Networks
Neural Networks
journal homepage: www.elsevier.com/locate/neunet
article info a b s t r a c t
Article history: A novel sparsity-based stochastic pooling which integrates the advantages of max-pooling, average-
Received 21 March 2017 pooling and stochastic pooling is introduced. The proposed pooling is designed to balance the advantages
Received in revised form 2 May 2018 and disadvantages of max-pooling and average-pooling by using the degree of sparsity of activations
Accepted 23 May 2018
and a control function to obtain an optimized representative feature value ranging from average value
Available online 15 June 2018
to maximum value of a pooling region. The optimized representative feature value is employed for
probability weights assignment of activations in normal distribution. The proposed pooling also adopts
Keywords:
Deep learning weighted random sampling with a reservoir for the sampling process to preserve the advantages of
Pooling mechanism stochastic pooling. This proposed pooling is evaluated on several standard datasets in deep learning
Degree of sparsity framework to compare with various classic pooling methods. Experimental results show that it has good
Representative feature value performance on improving recognition accuracy. The influence of changes to the feature parameter on
Recognition accuracy recognition accuracy is also investigated.
© 2018 Elsevier Ltd. All rights reserved.
https://doi.org/10.1016/j.neunet.2018.05.015
0893-6080/© 2018 Elsevier Ltd. All rights reserved.
Z. Song et al. / Neural Networks 105 (2018) 340–345 341
these two pooling methods in the specific cases, which could random sampling (WRS) has been employed for this sampling
promote the generalization ability of pooling. operation to promote the performance of pooling by improving
For this aim, Yu, Wang, Chen, and Wei (2014) proposed a randomness of sampling. This proposed pooling was evaluated in
mixed pooling method that consisted in randomly choosing be- terms of recognition accuracy within several classic datasets and
tween max-pooling or average-pooling to generate the output. its experimental test error compared with other classic pooling
This mechanism is realized by adding together the maximum and methods. The influence of changes to feature parameters on recog-
average values which are multiplied by their own coefficients. One nition accuracy is also discussed.
of the coefficients is either 0 or 1 randomly, and another is equal
to the corresponding opposite value of the previous one (0 and 2. Pooling mechanism
1 are the opposite values). This mechanism is praiseworthy on
improving the overall performance of pooling results, except that 2.1. Optimized representative feature value
it fails to reflect the advantages of both pooling methods at the
same time because it can only adopt and reflect either max-pooling A feature value is always employed for the pooling region, such
or average-pooling in each pooling process. Lee, Gallagher, and Tu as maximum value or average value, to be a benchmark for weight
(2015) made an improvement on this mixed pooling by replacing assignment and probability distribution of activations in pooling
previous random coefficient with a real number ranging from 0 region. The weight of the maximum value of activations could
to 1, namely, the mixing proportion; consequently, the weights be defined as 1 in max-pooling; in contrast, the weight of each
of maximum and average values are assigned by this real num- activation is the same in average-pooling by this point of view.
ber. The features of both max-pooling and average-pooling could In rank-based stochastic pooling, no such feature value exists.
be reflected in each pooling process by this mixing proportion Activations are arranged in descending order and given probability
mechanism, although the randomness of the sampling process is weights by exponential ranking (Michalewicz, 1994). This method
sacrificed. Later on, stochastic pooling emerged to give the prob- could improve the performance of pooling by avoiding the mistake
ability weights of the elements in feature map according to their that of offering equal or highly imbalanced importance to each
numerical values, and to randomly take sample in accordance with region since image features are highly spatially non-stationary
the probability weights as well. Zeiler and Fergus (2013) proposed (Shi, Ye, & Wu, 2016). Meanwhile, the authors also mentioned that
a classical stochastic pooling method by randomly picking the an inevitable degeneration of rank-based stochastic pooling into
activations in pooling region on the basis of their activities. It has max-pooling and loss of background information would occur if
the advantages of being hyper-parameter free and the ability of the maximum activation is much greater than the sum of others
combining with other regularization approaches, such as dropout (probabilities of others are ignored by that of maximum activa-
and data augmentation. This stochastic pooling method presents tion).
smaller training and testing errors than those of max-pooling and To remedy above mentioned disadvantages and improve the
average-pooling. Meanwhile, it is also reported that the pooling pooling algorithm, an optimized representative feature value R
performance could also be improved using the method of Dropout is proposed to replace these common feature values (maximum
(Iosifidis, Tefas, & Pitas, 2015). But the performance of classic and average values) and seek a reasonable balance between max-
Dropout is highly depended on the experience of position selection pooling and average-pooling to highlight foreground texture de-
for random deleting, which makes it an experience-dependent tails while preserving enough background character information.
method and limits its generalization ability (Cao, Li, & Zhang, 2015; The feature value R is defined by Eq. (1) (and shown in Fig. 1).
Srivastava et al., 2014). Wu and Gu (2015) pointed out that the R − Av g
random sampling process of stochastic pooling for activation obeys = Fp (α) (1)
Max − Av g
multinomial distributions, which is same as that of max-pooling
where, Max and Avg are the maximum and average values of
dropout. But in the case of specific retaining probabilities, the
activations in pooling region, respectively. Fp (α) is the control
max-pooling dropout could perform better than stochastic pooling.
function for optimizing this feature value, as shown in Eq. (2).
It reveals that max-pooling dropout and stochastic pooling have ( )
their own advantages with respect to sampling. Therefore, if a 1
⎧
novel pooling mechanism is designed to integrate the advantages
⎪
⎪ 2p−1 α p , 0 ≤ α ≤
2
⎨
of max-pooling, average-pooling and stochastic pooling together, Fp (α) = ( ) (2)
1
it would be expected to not only improve the diversity of pool- ⎩1 − 2p−1 (1 − α)p , ≤α≤1 .
⎪
⎪
ing results by taking a balance between highlighting foreground 2
textures and preserving background information, but also promote Here, p is a positive integer as feature parameter for setting the
the performance of recognition accuracy. curved shape of function Fp (α), and α is the degree of sparsity of
In this research, a novel sparsity-based stochastic pooling has convolved features in a pooling region as many researches showed
been proposed to integrate advantages of max-pooling, average- that the performance of pooling methods are highly affected by the
pooling and stochastic pooling on taking a balance to highlight sparsity of the pooling region (Boureau, Ponce et al., 2010). For ex-
foreground and preserve background information at same time ample, taking the maximum value works better than average value
and improving randomness of sampling. The pooling mechanism in a sparse region. Thus, a representative feature value designed
was introduced by using an optimized representative feature based on the sparsity of activations in a pooling region is more
value, which could automatically select to perform the features of reasonable.
max-pooling or average pooling primarily in specific application There are three main advantages of using Eq. (2) to define R.
cases or databases for promoting the generalization ability of pool- First, if p = +∞ (its value is set to be 100 in real case, which is
ing since it has been defined by using the degree of sparsity and a a number large enough to meet the computing requirement), the
special control function to generate a value ranging from average value of R tends to be either maximum value or average value of
value to maximum value of a pooling region. And the probabil- activations in pooling region (Fig. 1). This pooling will degenerate
ity weights of activations are assigned according to the distance into max-pooling or average-pooling to contain the features and
between the feature value and value of each activation based on functions of these two classic pooling methods; if p = 1, the value
normal distribution, which could evaluate the contributions of all of R will be linearly distributed between maximum and average
activations in the feature pooling region. A method of weighted values, which simplifies it for high computational efficiency in
342 Z. Song et al. / Neural Networks 105 (2018) 340–345
as shown in Fig. 1. ⎧ ⎫
⎨ (ai − R)2 ⎬
× exp − [∑ ( (7)
2.2. The degree of sparsity
]
)2 ⎭
n ∑n
j=1 aj /n
⎩ 2
j=1 aj −
The criterion for measuring the degree of sparsity of a convolved where, R has been set to be the mathematical expectation of the
feature pooling region could be described by using entropy theory normal distribution function. Then, each activation ai in convolved
(Li, Fan, & Liu, 2015). It shows that the smaller the entropy, the features pooling region has a corresponding probability weights
sparser the feature. Here we also employ entropy to describe the value w (ai ). In previous mentioned example of Fig. 2, R is calcu-
sparse degree. Meanwhile, the entropy of a pooling region (matrix) lated as 6.898, and the average and maximum value of activations
could be represented by the singular values, which are obtained by are 5 (same as a4 ) and 9 (same as a7 ), respectively. The assignment
singular value decomposition (SVD). For a matrix, the entropy of of probability weights of activations is shown in Fig. 3.
it would be smaller if the matrix is more singular (Gu, Xiong, & It should be noticed that although the covered area of normal
Li, 2015), which means that more information about the matrix distribution function is 1 (100%), in real case the sum of all proba-
is contained in fewer singular values. For instance, the singular bility weights is not 1 due to the discrete or repeated distribution
values λ1 , λ2 , . . . , λn (λ1 > λ2 > · · · > λn ) could be obtained and limited amount of activations. Thus, a method of weighted
from a non-negative matrix (activations ai are the elements of it, random sampling (WRS) with a reservoir (Efraimidis & Spirakis,
i = 1, 2, . . . , n), which is the convolved feature pooling region 2006) is employed in this study for sampling of activations without
in this study, by SVD. Thus, the relationship could be established weights normalization. WRS with a reservoir is considered to be a
between singular values and the degree of sparsity with the aid of method of sampling in a data stream, without the need to know the
the entropy. size of the stream in advance. Consequently, w (ai ) is only defined
Z. Song et al. / Neural Networks 105 (2018) 340–345 343
3. Experiments
Fig. 4. CT image (a) recognition and extraction with average-pooling (b), max-pooling (c) and this proposed pooling (d).
Appendix A. Supplementary data LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied
to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Lee, C. Y., Gallagher, P. W., & Tu, Z. (2015). Generalizing pooling functions in
Supplementary material related to this article can be found
convolutional neural networks: Mixed, gated, and tree. Computer Science,
online at https://dx.doi.org/10.1016/j.neunet.2018.05.015. 464–472.
Li, Z., Fan, Y., & Liu, W. (2015). The effect of whitening transformation on pooling
References operations in convolutional autoencoders. EURASIP Journal on Advances in Signal
Processing, 37, 1–11.
Lipovetsky, S. (2009). PCA and SVD with nonnegative loadings. Pattern Recognition,
Ahmed, S. E. (1995). A pooling methodology for coefficient of variation. Sankhyā,
42(1), 68–76.
57(1), 57–75.
Michalewicz, Z. (1994). Genetic algorithms + data structures = evolution programs.
Boureau, Y., Bach, F., LeCun, Y., & Ponce, J. (2010). Learning mid-level features for
Computational Statistics & Data Analysis, 24(3), 372–373.
recognition. Computer Vision and Pattern Recognition, 26(2), 2559–2566.
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A. Y. (2011). Reading digits
Boureau, Y. L., Ponce, J., & LeCun, Y. (2010). A Theoretical analysis of feature pool-
in natural images with unsupervised feature learning. In Nips workshop on deep
ing in visual recognition. International Conference on Machine Learning, 32(4),
learning & unsupervised feature learning (pp. 1-9).
111–118. Shi, Z., Ye, Y., & Wu, Y. (2016). Rank-based pooling for deep convolutional neural
Cao, B., Li, J., & Zhang, B. (2015). Regularizing neural networks with adaptive local networks. Neural Networks, 83, 21.
drop. In 2015 international joint conference on neural networks (pp. 1–5). IEEE. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014).
Dan, K. (1996). A singularly valuable decomposition: The SVD of a matrix. College Dropout: a simple way to prevent neural networks from overfitting. Journal of
Mathematics Journal, 27(1), 2–23. Machine Learning Research (JMLR), 15(1), 1929–1958.
Efraimidis, P. S., & Spirakis, P. G. (2006). Weighted random sampling with a reser- Sun, M., Song, Z., Jiang, X., Pan, J., & Pang, Y. (2017). Learning pooling for convolu-
voir. Information Processing Letters, 97(5), 181–185. tional neural network. Neurocomputing, 224, 96–104.
Gu, R., Xiong, W., & Li, X. (2015). Does the singular value decomposition entropy Wang, L., Gao, C., Liu, J., & Meng, D. (2017). A novel learning-based frame pooling
have predictive power for stock market? —Evidence from the Shenzhen stock method for Event Detection. Signal Processing, 140, 45–52.
market. Physica A Statistical Mechanics & Its Applications, 439, 103–113. Wu, H., & Gu, X. (2015). Towards dropout training for convolutional neural net-
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data works. Neural Networks the Official Journal of the International Neural Network
with neural networks. Science, 313(5786), 504. Society, 71(C), 1–10.
Iosifidis, A., Tefas, A., & Pitas, I. (2015). DropELM: Fast neural network regularization Xie, L., Tian, Q., Wang, M., & Zhang, B. (2014). Spatial pooling of heterogeneous fea-
with Dropout and DropConnect. Neurocomputing, 162, 57–66. tures for image classification. IEEE Transactions on Image Processing. A Publication
Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, & Jonathan, of the IEEE Signal Processing Society, 23(5), 1994–2008.
(2014). Caffe: Convolutional Architecture for Fast Feature Embedding. Eprint Yang, J., Yu, K., Gong, Y., & Huang, T. (2009). Linear spatial pyramid matching using
Arxiv, pp. 675–678. sparse coding for image classification. In IEEE computer society conference on
Krizhevsky, A. (2012). Learning Multiple Layers of Features from Tiny Images. Tech computer vision and pattern recognition (pp. 1794-1801).
Report, pp. 1–60. Yu, D., Wang, H., Chen, P., & Wei, Z. (2014). Mixed pooling for convolutional neural
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., et al. networks. In 9th international conference on rough sets and knowledge technology,
(1989). Backpropagation applied to handwritten zip code recognition. Neural Vol. 8818 (pp. 364-375).
Computation, 1(4), 541–551. Zeiler, M. D., & Fergus, R. (2013). Stochastic pooling for regularization of deep
LeCun, Y., Boser, B., Denker, J. S., Howard, R. E., Habbard, W., Jackel, L. D., et al. (1990). convolutional neural networks. Computer Science, 1–9.
Handwritten digit recognition with a back-propagation network. Advances in Zhang, T. (2004). Solving large scale linear prediction problems using stochastic
Neural Information Processing Systems, 396–404. gradient descent algorithms, pp. 919–926.