Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

2021 International Electrical Engineering Congress (iEECON2021) (Full Manuscript)

March 10-12, 2021, Pattaya, THAILAND

Impacts of Kernel Size on Different Resized Images


in Object Recognition
Based on Convolutional Neural Network
Danupon Chansong Siriporn Supratid
College of Digital Innovation Technology, College of Digital Innovation Technology,
2021 9th International Electrical Engineering Congress (iEECON) | 978-1-7281-9584-1/20/$31.00 ©2021 IEEE | DOI: 10.1109/iEECON51072.2021.9440284

Rangsit University, Rangsit University,


Pathum-thani, Thailand Pathum-thani, Thailand
danuponc@gmail.com siriporn.su@rsu.ac.th

Abstract— This paper focuses a study on impacts of where large kernel-sized filters are used to obtain
kernel sizes on different resized image relying on similar features in an image, VGGNet [3] replaces
convolutional neural network (CNN) for object those large kernels with several small 3×3 kernel-
recognition. Two sets of convolutional neural network sized filters that enable to characterizes, at a lower
(CNN) deep learning models: Conv573 and Conv3, cost, more complex features. According to recently
based on shallow feed-forward-architecture network past researches, the work [4] employed deep CNN on
are employed here for feature extraction with CIFAR-10 to obtain average of 87.57% accuracy
comparative assessment purpose. The Conv573 refers to during testing phase. Particular deep CNNs were
CNN with kernel size of 7×7, 5×5 and 3×3; whilst the
utilized on medical images classification in terms of
Conv3 represents that of only 3×3 kernel size; where
three and two convolutional layers of 3×3 kernel size are
anatomy specificity, using 4,298 separate axial 2D
consecutively comparable to one convolutional layer of key-images to learn 5 anatomical classes [5]. The
7×7 and 5×5 kernel sizes. The experiments rely on averaged results showed of 5.9 % classification errors
different-resized CIFAR-10 image dataset, 50×50, and 0.998 area-under-the-curve (AUC) values on
100×100 and 150×150 pixels for testing with the testing data. A comparative study on CIFAR-10 using
Conv573 and Conv3 models. For the purpose of bias batch normalization, rectified linear unit (ReLU) and
reduction, recognition performance assessments depend exponential linear unit (ELU) activation in DenseNet,
on averages of precision, recall, F1 and accuracy rates, VGGNet, residual network and inception-v3 CNNs
based upon 10-fold cross validation. The results indicate was carried out in [6]. 89.91% best recognition
that the greater the size of an image is, the better the accuracy was yielded by VGGNet with the use of
recognition accuracy would be based on Conv573, ELU. With regard to [7], CIFAR-10 was tested on five
conversely for Conv3. 2.65% and 0.06% recognition CNN models for embedded systems, varying by
accuracy improvement based on Conv573, whereas adding or changing learning rate, batch normalization
0.39% and 1.76% performance decrease based on and dropout layers. Such five models achieved around
Conv3 are respectively yielded when resizing image 85.3% to 85.9% of accuracy rates. Comparatively
from 50×50 to 100×100 and from 100×100 to 150×150. investigation on the impact of kernel size on four
For 50×50 and 100×100 resized images, Conv3 yields
CNN architectures with 3×3, 7×7, 5×5 and 9×9
4.74% and 1.64% better averaged accuracy than
kernel-size, using histopathological images was
Conv573; nevertheless, Conv573 generates 0.18% better
averaged accuracy than Conv3 on 150×150 resized ones.
carried out in [8]. In addition, the works [9,10] applied
two and three convolutional layers of CNNs with 3×3
Keywords— convolutional neural network, object kernel-size instead of using one convolutional layer
recognition, deep learning, kernel size with 5×5 and 7×7 kernel-size respectively. This relied
upon the simple intuitions saying that stacking small
I. INTRODUCTION kernel-sized convolutional layers would reduce
Object recognition [1] is regarded as one of image parameters and push network going deeper, which
analysis systems, where the correct category/class of improve learning results. By using such 3×3 kernel-
an object is extracted. Using some number of stacked size CNNs, 94.34% and 73.65% of accuracy rates
convolutional layers, deep learning convolutional were yielded consecutively on CIFAR-10 and CIFAR-
neural networks (CNNs) perform well- capturing 100 [9]; whilst the work [10], employing smaller
significant features of objects from images without number of CNN layers produced 86.45%.
requiring those features to be explicitly defined a Nevertheless, the impact of kernel size on various
priori. LeNet-5 [2], one of the earliest CNNs has resized images in object recognition has not yet been
typically had stacked convolutional layer; it has been prevalently explored.
designed for recognizing handwritten fonts and This paper studies on effects of kernel sizes on
computer-printed characters. AlexNet [1], containing different resized image based on two models of
more multiple stacks of deep learning network than shallow feed-forward-architecture CNN, Conv573 and
LeNet-5 employs rectified linear activation unit Conv3 for object recognition. The Conv573 refers to
(ReLU) to solve gradient dispersion of Sigmoid in CNN with kernel size of 7×7, 5×5 and 3×3; whilst the
LeNet-5 deep network. Unlike LeNet-5 and AlexNet Conv3 represents that of only 3×3 kernel size, where

978-1-7281-9584-1/21/$31.00 ©2021 IEEE 448

Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on July 03,2021 at 23:51:21 UTC from IEEE Xplore. Restrictions apply.
2021 International Electrical Engineering Congress (iEECON2021) (Full Manuscript)
March 10-12, 2021, Pattaya, THAILAND

three and two convolutional layers of 3×3 kernel size parameters used for Conv3 and Conv573 models are
are respectively comparable to one convolutional layer 5.289 and 5.367 millions, respectively.
of 7×7 and 5×5 kernel sizes. The extracted features
resulted by the CNNs are fed into softmax
classification for object recognition. The empirical
experiments depend on CIFAR-10 image dataset of
different resize, 50×50, 100×100 and 150×150 pixels
to be tested on the Conv573 and Conv3 models.
Recognition performance measurements depend on
precision, recall, F1 as well as accuracy rates,
averaged on 10-fold cross validation of training and
testing image sets, different randomly selected. Such
cross validation is carried out to guarantee unbiased
experimental results. Confusion matrix is also
determined for more detail of results investigation.
The rest of this paper is arranged as follows. Section II
presents Conv573 as well as Conv3 used in this study.
Experimental results are demonstrated in section III.
Finally, section IV concludes the paper.
II. CONVOLUTIONAL NEURAL NETWORKS
(CNNS): CONV573 AND CONV3
According to comparative assessment purpose,
shallow 3-phase CNNs, employing deep learning
feed-forward-architecture neural network are utilized
here for feature extraction, as shown in Fig. 1(a) and
(b). Shallow CNN refers to a simple network structure
having few network parameters, occupying few
computation resources and memory. Fig. 1(a) displays
an architecture of Conv3, where all convolutional
layers employ 3×3 receptive-field kernel filters;
whereas convolutional layers in Fig. 1(b), showing
Conv573 architecture use 7×7, 5×5 and 3×3 sizes of
kernel filters. As aforementioned, three and two
convolutional layers of 3×3 kernel size are
respectively comparably equivalent to one
convolutional layer of 7×7 and 5×5 kernel sizes. All
hidden layers are equipped with y = max(0, x) (a)
rectified linear unit (ReLU) activation function. After
each convolutional layer, dropout [11] and batch-
normalization are applied. For each layer, dropout
probability is implemented to avoid over-fitting. The
batch-normalization normalizes every layer’s inputs
based on y =  (x −  ) /  +  where  = x and
 2 = ( x −  )2 ; x and y are input and normalized
feature map consecutively; γ and β, respectively
setting to 1 and 0 in this study are learnable
parameters. The input of every layer is forced by such
normalization to have nearly same distribution in all
training steps, aiding the network converged faster
with lower error rate. In addition, after each of the
four phases, max pooling of 2×2 pool size is also
executed to halfly reduce the size of feature maps. The
aim of this is to extract low level features from a
neighborhood and decrease variance as well as
computation complexity. In the final layer, global
average pooling along with fully connected neural
(b)
network transform the feature maps into 512-element Fig. 1. Architectures of (a) Conv3; (b) Conv573
feature vectors. For further object recognition process,
such a feature vector is then fed into softmax [12],
known as multinomial logistic classifier which
provides linear classification boundaries. Total

449

Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on July 03,2021 at 23:51:21 UTC from IEEE Xplore. Restrictions apply.
2021 International Electrical Engineering Congress (iEECON2021) (Full Manuscript)
March 10-12, 2021, Pattaya, THAILAND

Fig. 2. Confusion matrices, with respect to Conv573 based on the best F1 scores.

Fig. 3. Confusion matrices, with respect to Conv3 based on the best F1 scores.

shown in parentheses. The results in Table I indicate


III. EXPERIMENTAL RESULTS AND DISCUSSION 2.65% and 0.06% recognition accuracy improvement
In this work, the Conv573 and Conv3 are applied based on Conv573, whereas 0.39% and 1.76%
on CIFAR-10 dataset [7], containing 32×32 of 60,000 performance decrease based on Conv3 are
color images in 10 classes, approximately 6,000 respectively yielded when increasing image size from
images per class. In order to investigate effects of 50 ×50 to 100×100 and from 100×100 to 150×150.
kernel sizes on different resized images, CIFAR-10 For 50×50 and 100×100 resized images, Conv3 yields
images are resized to 50×50, 100×100 and 150×150 4.74% and 1.64% better averaged accuracy than
pixels. 10-fold cross validation is executed by Conv573; however, Conv573 generates 0.18% better
randomly splitting the 60,000 images of CIFAR-10 averaged accuracy than Conv3 with regard to
dataset into 80% training and 20% testing of 10 150×150 resized ones. Confusion matrices based on
different folds for bias reduction purpose. the best F1 scores over the 10-fold cross validation
Performance assessments depend on precision, recall, with respect to three resized-image, yielded by
F1 and accuracy rates. Precision refers to the ratio of Conv573 and Conv3 are respectively shown in Fig. 2
correctly predicted positive observations (true positive and 3 for deep-down investigation on the results in
or TP) to the total predicted positive observations (true Table I. Considering those confusion matrices, the
positive plus false positive or TP+FP); whilst, recall is largest and the second largest numbers of wrong
the ratio of correctly predicted positive observations prediction are produced in a manner that dogs are
(TP) to the all observations in actual class (true wrongly predicted as cats, and vice versa for both
positive plus false negative or TP+FN). F1 scores, Conv573 and Conv3 for all image sizes. These are due
taking false positives as well as false negatives into to the object similarity between those two classes.
account represents the weighted average of precision
and recall. An accuracy rate represents the ratio of
correctly predicted observation (TP+TN) to the total
observations (TP+FP+FN+TN). It is noticed that true
negative or TN is taken into an account on accuracy
rate computation; whereas only TP is taken into an
account on F1 scores.
According to testing sets, the performance results
are averaged over the 10-fold cross validation, as
exhibited in Table I, where non-significant, small
values of standard deviation (S.D.) (italic faces) are

450

Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on July 03,2021 at 23:51:21 UTC from IEEE Xplore. Restrictions apply.
2021 International Electrical Engineering Congress (iEECON2021) (Full Manuscript)
March 10-12, 2021, Pattaya, THAILAND

TABLE I. PERFORMANCE RESULTS RELYING ON TESTING REFERENCES


SETS, AVERAGED ON THE 10-FOLD CROSS VALIDATION
[1] A. Krizhevsky, I. Sutskever, and G.E. Hinton, “ImageNet
50x50 classification with deep convolutional neural networks,”
Model Advances in Neural Information Processing Systems, vol. 25,
Precision Recall F1 Acc
2012, pp. 1090-1098.
0.84593 0.84243 0.84169 0.84170
Conv573 [2] W. Rawat and Z. Wang, “Deep Convolutional Neural Networks
(0.004) (0.009) (0.008) (0.008)
0.88170 0.88176 0.88090 0.88158 for Image Classification: A Comprehensive Review,” Neural
Conv3 Computation, vol. 29, No. 9, 2017, pp. 2352 -2449.
(0.002) (0.002) (0.002) (0.002)
[3] K. Simonyan and A. Zisserman, "Very deep convolutional
100x100
Model networks for large-scale image recognition," 3rd International
Precision Recall F1 Acc Conference on Learning Representations (ICLR 2015), San
Diego, CA, 2015, pp. 1-14.
0.86721 0.86479 0.86436 0.86402
Conv573 [4] R. Doon, T. K. Rawat, and S. Gautam, “Cifar-10 classification
(0.04) (0.017) (0.016) (0.017)
using deep convolutional neural network,” IEEE Punecon,
0.88015 0.87879 0.87839 0.87820 Pune, India, 2018, pp. 1–5.
ConV3
(0.003) (0.005) (0.004) (0.005)
[5] H. R. Roth et al., "Anatomy-specific classification of medical
150x150 images using deep convolutional nets," 2015 IEEE 12th
Model International Symposium on Biomedical Imaging (ISBI),
Precision Recall F1 Acc
New York, NY, 2015, pp. 101-104.
0.86627 0.86522 0.86422 0.86453 [6] V. Thakkar, S. Tewary and C. Chakraborty, "Batch
Conv573
(0.006) (0.005) (0.006) (0.006) Normalization in Convolutional Neural Networks — A
0.86596 0.86363 0.86223 0.8630 comparative study with CIFAR-10 data," 2018 5th
Conv3
(0.004) (0.008) (0.008) (0.008) International Conference on Emerging Applications of
Information Technology (EAIT), Kolkata, 2018, pp. 1-5.
[7] R. C. Çalik and M. F. Demirci, "Cifar-10 Image Classification
IV. CONCLUSION with Convolutional Neural Networks for Embedded Systems,"
2018 IEEE/ACS 15th International Conference on Computer
In this paper, impacts of kernel sizes on different Systems and Applications (AICCSA), Aqaba, 2018, pp. 1-2.
resized images based on shallow 3-phase CNNs, [8] Ş. Öztürk, U. Özkaya, B. Akdemir and L. Seyfi, "Convolution
utilizing deep learning neural network with feed- Kernel Size Effect on Convolutional Neural Network in
forward-architecture are investigated in object Histopathological Image Processing Applications," 2018
recognition. The results, depending on 10 folds of International Symposium on Fundamentals of Electrical
cross validation point that the larger the size of image Engineering (ISFEE), Bucharest, Romania, 2018, pp. 1-5.
is, the better the recognition accuracy would be, based [9] T.-D. Truong, V.-T. Nguyen and M.-T. Tran, ‘‘Lightweight
deep convolu-tional network for tiny object recognition,’’ 7th
on Conv573 using kernel size of 7×7, 5×5 and 3×3, as International Conference Pattern Recognition Applications and
conversely happens for Conv3, employing only 3×3 Methods (ICPRAM), Madeira, Portugal, 2018, pp. 675–682.
kernel size. Depending on 50 ×50 and 100×100 [10] N. Siripibal, S. Supratid and C. Sudprasert, “A Comparative
resized-image, Conv3 yield better recognition Study of Object Recognition Techniques: Softmax, Linear
performance than Con573; whilst Conv573 generates and Quadratic Discriminant Analysis Based on
a few bit better recognition results than Conv3 on Convolutional Neural Network Feature Extraction,”
Proceedings of the 2019 International Conference on
150×150 resized ones. This may lead to simply Management Science and Industrial Engineering (MSIE),
assume that the small resized-image would properly Phuket, Thailand, 2019, pp. 209-214.
matched to the smaller rather than the larger kernel [11] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R.
size, vice versa. It is also noticed that total parameters Salakhutdinov, “Dropout: a simple way to prevent neural
used for Conv3 are 7.80% lesser than Conv573 networks from overfitting,” Journal of Machine Learning
models. Future works relate to investigate effects of Research, vol. 15, No. 1, pp. 1929–1958, 2014.
kernel size or other related CNN parameters on some [12] F.E. Harrell and K.L Lee, “A comparison of the discrimination
other architectures of CNNs, where other datasets of discriminant analysis and logistic regression under
multivariate normality,” Statistics in Biomedical, Public Health,
using more various kernel size would be tested. and Environmental Sciences, in P. K. Sen, North-Holland:
Elsevier Science Publishers, 1985, pp. 333-343.

451

Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on July 03,2021 at 23:51:21 UTC from IEEE Xplore. Restrictions apply.

You might also like