Professional Documents
Culture Documents
Impact of Image Resizing On Deep Learning Detectors For Training Time and Model Performance
Impact of Image Resizing On Deep Learning Detectors For Training Time and Model Performance
net/publication/359838086
Impact of Image Resizing on Deep Learning Detectors for Training Time and
Model Performance
CITATIONS READS
6 144
2 authors, including:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Abdussalam Elhadi Elhanashi on 11 April 2022.
1 Introduction
2 Related Work
Considerable work on imaging processing has been carried out in the field of
machine learning for object detection. Avidan et al [7] proposed seam caving which
performs retargeting by inserting and removing streams of pixels, which are called
seam. It passes through less important features. Pritch et al [8], presented movement of
maps for the rearrangement of pixels. It is composed of graph labelling problems for
editing different images in various applications. Lou et al [9] introduced a method to
predict a specific feature. The receptive field focuses on various image regions such as
low level, middle level and high level without information left during the classification.
Oquab et al [10] stated that training large neural network architectures will require more
time and resources. These models will perform computations for processing the images.
Hashemi et al [11] proposed zero-padding for resizing the images to the same size and
compares them to the convolutional methods of scaling up the images by using
interpolation. Connor et al [12] explored that data augmentation can enhance the
performance of neural network models and expand limited datasets to take the benefits
of utilizing big data. various approaches of regularization the generalization to
eliminate overfitting. Mingxing et al [13] proposed a technique for neural network
design architecture for object detection and suggested several ways of optimization to
enhance efficiency of neural network models.
Tsung et al [14] proposed a method which is called pyramidal hierarchy of deep
learning to establish feature pyramid with marginal additional cost. Krahenbuhl et al
[15] proposed a method which is called energy map. It consists of many automatic
constrains and determined constrains on the key frames. This is to compute a non-
uniform pixel warping on video frames. This research focused-on preservation of the
local structure and optimizing a warping from the original size to the desired size
according to its important regions and permitted regions.
3 Experimental setup
Object detection is a computer vision technique, which involves the prediction the
presence of one or more objects in an image, along with their scores and bounding
boxes. YOLO is the state of art of object detection which is targeted for real-time
application. There are five versions of the deep learning architectures. The first three
yolo versions (YOLO [16], YOLOv2 [17], and YOLOv3 [4]) were released in 2016,
2017, and 2018 respectively by Joseph Redmon. However, in 2020, two major versions
were released (YOLOv4 [18], and YOLOv5) which have considerable improvements
to the previous versions. YOLO detectors use a single neural network takes an image
as input, passes it through a neural network, which has similar convolutional neural
network models (CNN), and generates a vector of bounding boxes with classes
prediction in the output. The input image is divided into a grid of cell S x S, see figure
1. Each grid cell is responsible to predict the object in the image. Each grid predicts
bounding box B with class probability C. Bounding boxes have five components (x, y,
w, h, and confidence). Coordinates x & y represent the center of the bounding box,
while the w & h coordinates represent the height and width of the bounding box. These
coordinates are normalized to fall between 0 and 1.
YOLO models were evaluated on the testing dataset for the three different
resolutions of vehicle images. Only vehicles are focused as objects to be detected. As
the result of this study, vehicle images for all resolution were detected by YOLO. The
vehicle class and Intersection over Union (IoU) for each of the detected vehicles are
displayed by YOLO detectors with bounding boxes. It is observed that for all YOLO
detectors when the image size is varying, the training times are also changing,
respectively. The training time decreases when the image resolution is downscaled,
while the training time increases as the image resolution is upscaled. Table 1 shows the
training computational time with respect to different image sizes for all YOLO
detectors. YOLOv4 took a while to train the model due to the size of its model
(CSPDarknet53), which is used for this architecture. However, YOLOv5 detector
showed the fastest training time for the three different resolutions of vehicle images,
which is intuitive to use verses all prior models of YOLO detectors. Further to our
experiments, we evaluated all YOLO detectors on the three standard metrics
performances: accuracy, precision, and recall, see Eq [2].
𝑇𝑃+𝑇𝑁 𝑇𝑃 𝑇𝑃
Accuracy = 𝑇𝑃+𝐹𝑁+𝑇𝑁+𝐹𝑃
, Precision = 𝑇𝑃+𝐹𝑃
, Recall=𝑇𝑃+𝐹𝑁 (2)
Where TP stands for the number of true positive; TN stands for the number of true
negative; FP stands for the number of false positive; FN stands for the number of false
negative. The detailed results of each performance score for YOLO models with
varying image resolutions are shown in Table 3. The performance metrices decreased
when the images are resized to 300 x 300 and 640 x 480 pixels verses the performance
scores for the original image size 416 x 416 in all YOLO detectors. This is to note that
YOLOv4 showed the best results for performance in all image resolution sizes. This is
due to the depth and strength of this architecture across different image sizes. The
inference time has been measured for all YOLO detectors. YOLOv4 uses
CSPDarknet53 which enabled the model to achieve more accurate detection ability
among the other detectors. We examined the execution time of YOLO models for
predicting the vehicle classes in the images, and as we can see in Table 2, YOLOv5 is
the fastest detector among all different resolutions of vehicle images in comparison to
the other YOLO detectors. YOLOv5 is based on ultralytics PyTorch techniques, which
is instinctive to and inferences the object detection very fast in the images.
Table 1. The training time with respect to different Table 2. The Inference time for YOLO models with
image sizes for all YOLO detectors. respect to different image sizes.
2
Table 3. Impact of imaging resizing on YOLO models performance in terms of (accuracy,
precision, and recall.
Model Resolution Recall Precision Accuracy
300x300 83% 86% 88%
Yolov2 416x416 86% 88% 88%
640x480 84% 84.6% 85%
300x300 85% 87% 88%
Yolov3 416x416 92% 91.8% 94%
640x480 87% 88.1% 91.2%
300x300 94% 93.8% 97%
Yolov4 416x416 96% 94% 98%
640x480 95.3% 95.1% 97.8%
300x300 88% 90.1% 90%
Yolov5 416x416 92% 94% 95.1%
640x480 90% 91.5% 92%
Acknowledgments.
We thank the Islamic Development Bank for the support of the Ph.D. work of A.
Elhanashi and the Crosslab MIUR project.
References
[1] Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems (pp.
1097-1105).
[2] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q., 2017. Densely connected
convolutional networks. IEEE Conf. on Computer Vision and Pattern Recognition (pp. 4700-
4708).
[3] He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image recognition.
IEEE conference on computer vision and pattern recognition (pp. 770-778).
[4] Redmon, Joseph & Farhadi, Ali. (2018). YOLOv3: An Incremental Improvement.
[5] Iandola, Forrest & Han, Song & Moskewicz, Matthew & Ashraf, Khalid & Dally, William &
Keutzer, Kurt. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and
<0.5MB model size.
[6] Hashemi M. Webpage classification: a survey of perspectives, gaps, and future directions.
Multimedia Tools Appl. 2019. https://doi.org/10.1007/s11042-019-08373-8.
[7] S. Avidan and A. Shamir. Seam carving for content-aware image resizing. ACM Trans.
Graph., vol. 26, n. 3, July 2007.
[8] Y. Pritch, E. Kav-Venaki, and S. Peleg. Shift-map image editing. In 2009 IEEE 12th
International Conference on Computer Vision, pp. 151–158, Sept 2009
[9] Luo, W., Li, Y., Urtasun, R. and Zemel, R., 2016. Understanding the effective receptive field
in deep convolutional neural networks. In Advances in neural information processing systems,
pp. 4898-4906.
[10] Oquab, M., Bottou, L., Laptev, I. and Sivic, J., 2014. Learning and transferring mid-level
image representations using convolutional neural networks. IEEE conference on computer
vision and pattern recognition, pp. 1717- 1724.
[11] Hashemi, Mahdi. (2019). Enlarging smaller images before inputting into convolutional
neural network: zero-padding vs. interpolation. Journal of Big Data. 6. 10.1186/s40537-019-
0263-7.
[12] Shorten, Connor & Khoshgoftaar, Taghi. (2019). A survey on Image Data Augmentation for
Deep Learning. Journal of Big Data. 6. 10.1186/s40537-019-0197-0.
[13] Tan, Mingxing & Pang, Ruoming & Le, Quoc. (2019). EfficientDet: Scalable and Efficient
Object Detection.
[14] Lin, Tsung-Yi & Dollár, Piotr & Girshick, Ross & He, Kaiming & Hariharan, Bharath &
Belongie, Serge. (2016). Feature Pyramid Networks for Object Detection.
[15] P. Krahenbuhl, M. Lang, A. Hornung, and M. Gross. A system for retargeting of streaming
video. ACM Trans. Graph., vol. 28, n. 5, pp. 1-10, Dec. 2009.
[16] Redmon, J.: You only look once: Unifed, real-time object detection. IEEE CVPR
[17] Redmon, Joseph, Farhadi, Ali. (2016). YOLO9000: Better, Faster, Stronger.
[18] Bochkovskiy Alexey, Wang Chien-Yao, Liaoyuan Hong. (2020). YOLOv4: Optimal Speed
and Accuracy of Object Detection, https://arxiv.org/abs/2004.10934
[19]Zenodo.2021. VehiclesDataset_416x416.online] Available:
at<https://zenodo.org/record/4106511#.YCnQEGgzaUk> [Accessed 3 September 2021].