A Real Time Deep Learning Forest Fire Monitoring Algorithm Based On An Improved Pruned + KD Model

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Journal of Real-Time Image Processing (2021) 18:2319–2329

https://doi.org/10.1007/s11554-021-01124-9

ORIGINAL RESEARCH PAPER

A real‑time deep learning forest fire monitoring algorithm based


on an improved Pruned + KD model
Shengying Wang1 · Jing Zhao2 · Na Ta1 · Xiaoye Zhao1 · Mingxia Xiao1 · Haicheng Wei1 

Received: 3 February 2021 / Accepted: 4 May 2021 / Published online: 15 May 2021
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2021

Abstract
To meet the needs of embedded intelligent forest fire monitoring systems using an unmanned aerial vehicles (UAV), a deep
learning fire recognition algorithm based on model compression and lightweight requirements is proposed in this study. The
algorithm for the lightweight MobileNetV3 model was developed to reduce the complexity of the conventional YOLOv4
network structure. The redundant channels are eliminated through channel-level sparsity-induced regularization. The knowl-
edge distillation algorithm is used to improve the detection accuracy of the pruned model. The experimental results reveal
that the number of model parameters for the proposed architecture is only 2.64 million—compared with YOLOv4, this
represents a reduction of nearly 95.87%. The inference time decreased from 153.8 to 37.4 ms, a reduction of nearly 75.68%.
Our approach shows the advantages of a model with a smaller number of parameters, low memory requirements and fast
inference speed compared with existing algorithms. The method presented in this paper is specifically tailored for use as a
deep learning forest fire monitoring system on a UAV platform.

Keywords  Fire detection · YOLOv4 · MobileNetV3 · Model compression · Channel-level sparsity

1 Introduction proposed a discrimination algorithm based on K-means


clustering combined with a threshold method to segment
As the global climate becomes warmer, the probability and suspected fire areas. To improve the recognition, they used
intensity of forest fires will gradually increase; therefore, it the sample entropy algorithm, which can eliminate interfer-
is particularly important for people to monitor forest fires ence from red houses, flags, and other objects with colours
intelligently [1]. Consequently, the use of forest fire monitor- similar to fire [3]. Additionally, Surit et al. calculated and
ing platforms based on computer vision technology mounted analysed the static and dynamic characteristics of smoke to
on an unmanned aerial vehicle (UAV) has been expanding accurately and effectively detect smoke from fires [4]. In
rapidly. contrast, Wirth et al. developed a modified histogram back-
Generally, forest fire detection methods based on com- projection algorithm for flame detection in the YCbCr colour
puter vision technology fall into two categories. One method space that had a lower computational cost during detection
adopts a traditional image-processing algorithm that extracts [5].
colour characteristics, geometric shapes, and the trajectory Beyond the traditional image-processing algorithms, the
of fire or smoke for fire identification. Rudz et al. studied other category of forest fire recognition methods are those
fire colour spaces and segmented fire images with the goal based on deep learning architecture. For example, Muham-
of achieving better fire detection and determining flame mad et al. explored a computationally efficient convolutional
characteristics more accurately [2]. Alternatively, Wei et al. neural network (CNN) model based on SqueezeNet archi-
tecture that used smaller convolution kernels for fire detec-
* Haicheng Wei tion, location, and semantic understanding of a fire scene
wei_hc@nun.edu.cn [6]. In addition, Muhammad et al. modified the GoogLeNet
framework and established a fire detection CNN architec-
1
School of Electrical and Information Engineering, North ture appropriate for video monitoring [7]. Zhang et al. pro-
Minzu University, Yinchuan 750021, China
posed the Faster R-CNN structure for wildland forest fire
2
School of Information Engineering, Ningxia University, smoke detection, in which the convolutional layers extracted
Yinchuan 750021, China

13
Vol.:(0123456789)

2320 Journal of Real-Time Image Processing (2021) 18:2319–2329

various types and numbers of image features, thus avoid- 2.1 YOLOv4 object detection and the lightweight
ing the complex manual feature-extraction process of the model
traditional surveillance video smoke-detection methods [8].
However, the conventional detection algorithms are As shown in Fig. 1, the YOLOv4 object detector is usu-
unable to accurately accomplish the task of fire monitor- ally composed of feature extraction, feature enhancement,
ing because they are highly dependent on feature selection detection heads, and other features. [11]. A cross-stage
and the poor robustness caused by hand-crafted thresholds. partial (CSP) Darknet53 functions as its feature extrac-
Moreover, although the fire detection algorithms based on tion, and CSP is added into each large partial dense block
deep learning algorithms have higher accuracy and strong to enhance the CNN learning capability [12]. The feature
robustness, they are difficult to implement in practical maps, which are reduced by 8, 16, and 32 times, are used
applications because their detection speeds are seriously as the output of the feature extractor and connected to
restricted by the enormous numbers of model parameters the feature enhancement module. Following this, a modi-
and the resulting heavy computational burdens. Many fied SPP [13] and PANet [14] are employed for feature
studies have optimized deep learning models in efforts to enhancement to increase the receptive field and enhance
improve their computational efficiency. Such optimization feature fusion. Additionally, we apply three detection
can effectively reduce model complexity and reduce model heads in our proposed algorithm to enhance object detec-
execution times [9, 10]. tion at different scales. Although the YOLOv4 object
Therefore, to increase the robustness, decrease the com- detection algorithm achieves an optimal trade-off between
plexity, and develop a miniaturized forest fire monitoring inference speed and detection accuracy [11], there are
system suitable for a UAV platform, we propose a real-time 63.95 million model parameters, which makes this model
forest fire monitoring algorithm based on the lightweight difficult to deploy as an embedded model on a real-time
and compressed “you only look once” (YOLO) model. Ini- forest fire monitoring platform.
tially, based on the lightweight YOLO model, redundant Initially, all the convolutional layers in the YOLOv4
incoming and outgoing connections and weights associated network used standard convolution operations. However,
with redundant channels are pruned to reduce the number some variants of convolution were produced, allow-
of parameters. Second, this paper designs the model loss ing some lightweight models such as MobileNet [15],
function and the dataset for training. The goal is to perform SqueezeNet [16], and ShuffleNet [17] to be proposed in
model training using empty label data simplicity similar to follow-up studies, to help reduce the high memory require-
flame and smoke. While this approach increases the model ments and need for computational resources for inference.
loss during training, it does not increase the complexity of Taking MobileNetV3 as an example, the major changes in
the model structure and can help reduce misidentified forest the model between the MobileNetV1 and MobileNetV2
fires during testing. Finally, we add the knowledge distil- frameworks were the introduction of depth-wise separa-
lation algorithm to allow the streamlined model to further ble convolutions and a squeeze and excitation (SE) block,
discriminate between positive and negative samples and combined with a lightweight gating mechanism and a
improve its sensitivity. We deployed the proposed algo- reduction in channels to 1/4 of the original [18], which
rithm on an NVIDIA Jetson Xavier NX development kit and leads to obvious advantages, including smaller numbers
achieved real-time forest fire detection from a UAV platform. of parameter, lower computational costs, reduced mem-
ory requirements and improved accuracy compared with
a standard CNN. These modifications make it more appro-
2 Lightweight and compressed YOLOv4 priate for use in a mobile terminal with limited computing
model power [19, 20].
As depicted in Fig.  2, we used the YOLOv4 object
A standard YOLOv4 model needs one inference to complete detection model as the backbone and modified its frame-
its inspection. Now in its third version, the YOLOv4 network work by replacing CSPDarknet53 with a MobileNetV3
possesses obvious advantages, including high accuracy and module to establish the lightweight YOLO + MobileNet
short inference times. In detail, the YOLOv4 network incor- architecture. As a result, the total number of model
porates mosaic data enhancement, self-confrontation train- parameters for our proposed model fell to 23.08 million,
ing, cross-minimum-batch normalization, and improved SPP a decline of 63.91% compared to the original YOLOv4
and PANet; these all contribute to improvements in inference network and constituting a drastic reduction in computa-
speed and recognition accuracy. To make this model even tional cost.
more suitable for a UAV forest fire monitoring system, we
modify the YOLOv4 network in this paper, resulting in a
more compressed and lighter weight model.

13
Journal of Real-Time Image Processing (2021) 18:2319–2329 2321

Fig. 1  Block diagram of
YOLOv4 object detection. The
small modules included are
CBM: Convolution + Batch
Normalization + Mish; CBL:
Convolution + Batch Normaliza-
tion + Leaky ReLU; and UP:
Upsampling

Fig. 2  Block diagram of
YOLO + MobileNet object
detection. The small mod-
ules included are CBh:
Convolution + Batch Nor-
malization + h-Swish; CBL:
Convolution + Batch Normaliza-
tion + Leaky—ReLU; and UP:
Upsampling

13

2322 Journal of Real-Time Image Processing (2021) 18:2319–2329

2.2 YOLO model compression

A certain amount of redundancy exists in the weights,


channels, and network layers even in the lightweight
YOLO + MobileNet model; therefore, it is necessary
to consider model compression. Under the premise of
maintaining the prediction effect of the model, model
redundancies, and insignificant items are removed. In
addition, weights in the convolution and fully connected
layers that have little effect on the recognition accuracy
are pruned to improve the application prospects of the
model [21].
Typical model compression methods include sparsity,
weight decomposition, low-rank decomposition, and
weight sharing. To avoid changing the structure of the
existing model, the approach proposed in this paper mainly
utilizes channel-level sparsity to slim the model; channel-
level sparsity is both easier to implement and more flex-
ible than are other sparsity methods. Hence, our proposed Fig. 3  Sketch of the model pruning process: a The initial network
depicts the YOLO  MobileNet model; b the compact network
+ 
method chiefly prunes the model channels as follows. A depicts the pruned model
scaling factor is inserted into each channel of the model
and adjusted by being trained together with the weights.
Finally, the channels with small scaling factors are treated
as redundant and are removed. The loss function of the
proposed model is:

Loss = lossYOLO + 𝜆 g(𝛾), (1)
𝛾=Γ

where g(𝛾) = |𝛾| represents an L1 regularization for channel-


level sparsity training.
The majority of the convolutional layers in the
YOLO + MobileNet network structure are followed by
batch standardization (BN) layers [22] and an activation
function to boost the convergence speed. This approach is
easy to implement and carries no additional computational
costs because our proposed approach directly leverages
the parameters in the BN layers as the scaling factors for
sparsity training. The BN layers perform the following
operations:
x − 𝜇b
y = 𝛾√ + 𝛽,
2 (2)
𝜎b + 𝜀

where x and y denote the input and output of the BN layers,


respectively, and γ and β represent the zoom and transfer Fig. 4  Schematic diagram of the knowledge distillation process: the
parameters, respectively. YOLO + MobileNet model acts as the teacher network, while the
After the model is trained under channel-level sparsity- pruned model functions as the student network
induced regularization, we obtain a model in which many
scaling factors have a near-zero value; thus, we can prune narrower network has a more compact model size compared
the channels with near-zero scaling factors by detecting all to the initial network.
their incoming and outgoing connections and correspond- Because the pruned model eliminates neurons that
ing weights. As shown in Fig. 3, after pruning, the resulting do not play significant roles during prediction, there is a

13
Journal of Real-Time Image Processing (2021) 18:2319–2329 2323

concomitant reduction in model parameter quantity and collect, the detection model used in this paper was pretrained
computational cost. Inevitably, pruning even unimportant on the MSCOCO dataset and then the weights were trans-
channels will degrade the classification accuracy to some ferred for additional training. At the point of transfer, the
degree. Accordingly, knowledge distillation often results model can already detect 80 kinds of targets but fire targets
in increased classification accuracy of the smaller models are not among them. Fire detection can be conducted only
produced by compression [23]. Figure 4 shows that the after fine-tuning the model on a fire image dataset. A total of
lightweight YOLO + MobileNet model is developed as a 1844 images were collected for this experiment; some sam-
teacher network, and the small, pruned model is consid- ples from the collected dataset are shown in Fig. 5. These
ered a student network. The accuracy of the small, pruned collected images included 1069 fire origin images with
model is promoted by knowledge distillation. At this time, flames and smoke, as displayed in Fig. 5a. The remaining
proposed algorithm loss includes two aspects: the errors 775 images do not show fire but contain interference similar
between the predicted output from the student network and to fire and smoke, such as sunset and clouds; these images
the real labels and the difference between the predicted were treated as negative samples to boost the robustness of
outputs produced by the student and teacher networks. the proposed approach, as indicated in Fig. 5b.
It is better when the student model can distinguish simi-
lar objects and maintain an accuracy close to that of the
teacher network. Using such comparisons, we were able 3.2 Model training
to further compress the lightweight YOLO + MobileNet
model and implement it in the embedded system to realize Misdetection is a serious problem in fire detection. The main
the airborne fire identification system. reason for misdetection is that objects similar to flames and
smoke are misjudged as fire. To reduce the probability of
such occurrences, in this study, we added images similar
3 Experimental methods to flames and smoke to the dataset. When these images are
misdetected, the penalties will be increased. The theoretical
3.1 Experimental dataset basis is that during the YOLO model training process, the
loss function should include location loss, class loss, and
Dataset quality is closely linked to the final performance confidence loss:
of a model. Therefore, to make the model more effective at lossYOLO = losslocation + lossclass + lossconfidence . (3)
accomplishing the forest fire detection task, it is necessary
to carefully collect, classify, and label the dataset. Because The calculations for location and class loss are based
fire image datasets in the forest environments are difficult to on the occurrence of the target object; the corresponding

Fig. 5  Some images in the data-


set. a Fire origin images with
flames and smoke; b non-fire
images that contain interference
similar to that produced by fire
and smoke

13

2324 Journal of Real-Time Image Processing (2021) 18:2319–2329

penalty is not added when there is no target object. In con- PyTorch deep learning framework, we adopted a stochastic
trast, for the confidence loss, we want a high level of con- gradient descent (SGD) optimizer and Nesterov momentum
fidence only when a target object exists. This confidence optimization. The momentum, initial learning rate and mini-
level indicates that a bounding box has a high probability of mum learning rate were set to 0.937, 5e-3, and 5e-4, respec-
bounding an object and of indicating its location accurately. tively. We applied mosaic data enhancement in our proposed
On the other hand, we want the confidence to be zero when model, and all the models were trained for 500 epochs. The
no target object exists: CIoU loss curves obtained by calculating the degree of
overlap between the predicted boxes and the real boxes, are
lossconfidence =lossconf +obj + lossconf +noobj
shown in Fig. 6. First, the CIoU losses of both the YOLOv4
G×G B
∑∑ obj [ and YOLO + MobileNet models tend to converge after
=− Iij Ci log(Ĉ i ) approximately 100 epochs. Subsequently, the loss values
i=0 j=0
] fluctuated dramatically from 100 to 300 epochs. However,
+(1 − Ci ) log(1 − Ĉ i ) the loss curve inclined and remained stable after 300 epochs,
G×G B
∑∑ especially in the YOLO + MobileNet network structure.
noobj [
− Iij Ci log(Ĉ i ) Eventually, the CIoU loss of the YOLOv4 model plateaued
i=0 j=0
] at approximately 2.2, while that of YOLO + MobileNet
+(1 − Ci ) log(1 − Ĉ i ) , (4) reached approximately 2.7, indicating that the prediction
boxes of the YOLO4 model were more in line with the real
where G*G is the grid cell of the YOLO target detector boxes than were those predicted by the YOLO + MobileNet
detection head, and B represents the bounding boxes. If the network structure.
j-th bounding box of the i-th grid contains a target object,
Iij has a value of 1; otherwise, it has a value of 0. In con-
obj
3.3 Model compression
trast, Iij has a value of 1 when the j-th bounding box of
noobj

the i-th grid does not contain a target object and has a value In this paper, we chose the lightweight YOLO + MobileNet
of 0 if it exists. The C represents the confidence level network structure as the backbone to create the compressed
(confidence = Pr(Object) × IoUpred truth
). model. We prune the unimportant channels and obtain an
During YOLO model training, the predicted bounding object detection model with a smaller number of parameters
box with the highest coincidence that is a true box boundary and a fast inference speed. Based on the channel-level spar-
is taken as a positive sample. Each bounding box contains sity algorithm proposed by Liu et al., the model first needed
and only has one positive sample. Based on the location of to be sparsely trained, then pruned, and finally, fine-tuned.
the bounding box and the contained object, the location loss, The sparsity factor λ and the percentage of model clip-
class loss, and the loss of confidence including the target ping needed to be coordinated to ensure that as the num-
object are calculated. When the overlap between the pre- ber of model parameters was reduced, the precision did not
dicted bounding box and the true bounding box is below a decrease too much. The sparsity factor λ value affects the
certain threshold, it is regarded as a negative sample, and the
location and class losses are not calculated; in addition, only
the part of the confidence loss that does not include the tar-
get object is calculated. The flame and smoke in a fire image
are considered as positive samples to improve the recall rate
of the algorithm, while other areas are considered negative
samples that cannot improve the precision rate because
objects similar to flame and smoke cause misdetections far
more often than does the background environment. In this
dataset, 1069 of the 1844 images are marked as having flame
or smoke targets. The remaining 775 images do not include
flame or smoke targets. When the model confidence level is
non-zero in these 775 images, the penalty will be increased.
We conducted experiments on an Ubuntu 18.04.3 sys-
tem equipped with an Intel(R) Core(TM) i9-9900K CPU
@ 3.60 GHz and an NVIDIA GeForce RTX 2080 Ti GPU.
We trained the YOLOv4 model in the same manner as the
lightweight YOLO + MobileNet network structure. Using the Fig. 6  The CIoU loss curves of YOLOv4 and YOLO + MobileNet

13
Journal of Real-Time Image Processing (2021) 18:2319–2329 2325

Fig. 7  The number of parameters and the change in mAP with the pruning percentage

distribution of the scaling factors γ when the model to be were not obvious, which led to the introduction of knowl-
clipped is sparsely trained. When λ = 0 (nonsparse train- edge distillation to improve accuracy. Therefore, we took
ing), the λ value was too large: all the scaling factors γ YOLO + MobileNet as the teacher network, considered the
approached 0, and the sparsity was excessive. Following pruned model as the student network, and capitalized on
[19], we fixed λ = 0.001. Our proposed algorithm was first the knowledge distillation algorithm to boost the accuracy
trained under channel-level sparsity-induced regularization of the pruned model. We selected the knowledge distilla-
for 300 epochs. After sparsity training, the inessential chan- tion algorithm to improve the model accuracy because the
nels were identified based on the scaling factor in the BN teacher model outputs a “soft target”. The negative sample
layers as described earlier. At present, model clipping can be labels with high probability showed that they were similar
performed according to the distribution of the scaling factors to the positive sample labels. This algorithm was help-
γ, and the model accuracy decreases as the clipping percent- ful for improve the generalizability of the student models.
age increases. As shown in Fig. 7a, as the clipping percent- A higher temperature parameter T (1 < T < 20) was set for
age increased from 0% to approximately 80%, the number knowledge distillation when the clipping ratio was large
of parameters decreased rapidly, and the mAP decreased and greater generalizability was desired. However, the
smoothly, indicating that the clipped redundant channels had temperature parameter T was not set to a too-large value;
little effect on the detection accuracy. When the clipping per- otherwise, the output of the squashed teacher model was
centage increased from 80 to 90%, the precision decreased excessive. When the clipping proportion was not large,
dramatically. As shown in Fig. 7b, when the percentage through fine-tuning training, the temperature parameter
of clipping reaches 91%, the mAP decreases significantly; was T = 1. When the 90% clipping ratio was set in this
therefore, the clipping percentage was set to 90%. study, the precision did not decrease significantly; there-
Eventually, the architecture in this paper was fine-tuned. fore, we adopted a smaller temperature parameter of T = 3.
In this experiment, the pruned model was trained for sev- Finally, we named the compact network established in this
eral epochs, but the prediction accuracy improvements paper the Pruned + KD model.

Table 1  Comparison of Model Parameter (M) Fire (AP) Smoke + Fire Smoke (AP) All (mAP)
parameter quantity and mAP (AP)
scores among the tested object
detection models YOLOv4 63.95 0.648 0.806 0.556 0.670
YOLO + MobileNet 23.80 0.580 0.815 0.604 0.666
Pruned 2.63 0.424 0.734 0.517 0.558
Pruned + KD 2.63 0.612 0.726 0.554 0.631

13

2326 Journal of Real-Time Image Processing (2021) 18:2319–2329

4 Experimental results and analysis model. The number of Pruned + KD model parameters


was only 2.63  M, constituting a drop of approximately
First, based on the premise that the input images would 90%. Regarding the detection accuracy of the tested mod-
all have the same size, we evaluated the influence of els, the YOLOv4 model achieved mAP scores similar to
different backbone models, including the YOLOv4, the YOLO + MobileNet model for all types of objects; the
YOLO + MobileNet, Pruned and Pruned + KD models, on difference was only 0.004 because the mAP values of the
the parameter quantity and mean average precision (mAP) YOLOv4 and YOLO + MobileNet models were 0.670 and
value, as shown in Table 1. In terms of parameter quan- 0.666, respectively. In contrast, a considerable drop occurred
tity, the total number of model parameters of the original in the recognition accuracy of the Pruned model; its mAP
YOLOv4 was 63.95 million. This number was reduced value declined to 0.510. However, the Pruned + KD model
by 62.78% (to 23.80 million) for the YOLO + MobileNet achieved increased recognition accuracy, reaching a mAP
value of 0.631, which is a reduction of only 0.035 compared
with the YOLO + MobileNet model.
Furthermore, the inference speeds of three models,
Table 2  Comparison of inference speeds between different models YOLOv4, YOLO + MobileNet and the Pruned + KD model,
Development YOLOv4 (ms) YOLO + MobileNet Pruned + KD were evaluated on different embedded devices, including an
platform (ms) (ms) NVIDIA GeForce RTX 2080 Ti GPU and NVIDIA Jetson
Xavier NX, with 416 × 416-pixel input images, as shown in
RTX 2080 Ti 7.8 2.9 2.0
Table 2. On one hand, when the inference speeds were tested
Xavier NX 153.8 50.6 37.4
on the system with an NVIDIA GeForce RTX 2080 Ti GPU,

Fig. 8  The performances of different detection models: a The detection effect of YOLOv4; b the detection effect of YOLO + MobileNet; c the
detection effect of Pruned + KD

13
Journal of Real-Time Image Processing (2021) 18:2319–2329 2327

the inference speed of the lightweight YOLO + MobileNet Table 3  Accuracy comparison of different detection models
model was 2.69 times faster than that of YOLOv4, while the Model Recall (%) Precision (%) Accuracy (%)
Pruned + KD model required only 2 ms, reached a speed of
nearly 500 FPS, and its inference speed increased by 1.45 YOLOv4 (empty) 100 99.21 99.78
times. On the other hand, when the models were tested on YOLO + MobileNet 98.41 81.05 93.30
the system with an NVIDIA Jetson Xavier NX, the infer- YOLO + MobileNet 97.62 100 99.35
(empty)
ence speed of the YOLO + MobileNet model was 3.04 times
Pruned + KD 98.41 88.57 96.11
faster than that of the YOLOv4 network structure. Similarly,
Pruned + KD (empty) 99.21 99.21 99.57
the inference speed of the Pruned + KD model increased by
1.35 times, and its inference time was 37.4 ms, which is Empty represents a model that is trained with an empty label dataset,
approximately 26.74 FPS. In summary, our study has dem- and the bold portion represents a benchmark model and its detection
onstrated that a real-time fire detection system running on accuracy
an NVIDIA Jetson Xavier NX and mounted a UAV platform
can be achieved by slightly decreasing the frame rate of the interference from objects similar to flames and smoke. This
input video. type of interference image was trained as an empty label
Figure 8 clearly shows that three different models, namely, dataset in the YOLO model, and the mistaken detection rate
YOLOv4, YOLO + MobileNet, and the Pruned + KD model, was reduced by increasing the confidence penalty. To evalu-
detected forest fires at the same time. We evaluated the ate the effectiveness of this method, we used 200 images
detection performances among different models from two containing fire objects and 300 images containing objects
aspects: object location annotations and confidence level. similar to fire to test the models’ missed detection and false
On one hand, all the detection models accurately marked detection rates. If an image is detected by the YOLO model
the flame position in the image in the first column on the as a target frame of fire or smoke, the image is determined
left. Although many similar objects occurred in the middle as a fire image, regardless of the accuracy of the target frame
column of images, the YOLOv4 and YOLO + MobileNet location. The accuracy of the different detection models
models could identify fires accurately. On the other hand, the was calculated based on a confusion matrix. As shown
Pruned + KD model had one missing rate. The smoke + fire in Table 3, the YOLOv4 model trained with empty labels
area marked by the YOLOv4 model is more accurate than has a recall of 100%. These results show that the model
that of the two compared models. Specifically, the right and can overcome the missed detection problem, that its mis-
bottom borders of the label box marked by the YOLOv4 taken detection rate is extremely low, and that the overall
model are close to the boundary of the object area, but the detection accuracy is high. We used this YOLOv4 (empty)
bounding boxes predicted by the two other models deviates model as a baseline model to evaluate the performance of
from the object area boundary to a small degree. Finally, the the improved model in this paper, and the results are con-
rightmost column of Fig. 8 indicates that the three detec- vincing. The precision of the YOLO + MobileNet model is
tion models can accurately locate the position of smoke. significantly lower than that of the baseline model, indicat-
As the second aspect, we considered the flame detection ing that the problem of misdetection is serious. However,
confidence level as the criterion for assessing the detection the YOLO + MobileNet (empty) model achieves a precision
effect of all the models. Due to the high proportion of flame of 100%, and its overall accuracy is also at a high value,
images in our dataset, the confidence level of flame detection indicating that after adding an empty label dataset to model
was higher than that of smoke detection, mostly reaching training, the problem of misdetection is effectively solved.
more than 90%, as illustrated in the leftmost image. How- The precision of the Pruned + KD model is also significantly
ever, as shown in the other two columns, the confidence lower than that of the baseline model and higher than that
level of smoke detection was relatively low. In future studies, of the YOLO + MobileNet model because knowledge distil-
smoke images should be aggregated into a separate dataset lation increases model generalizability of the model. The
to improve the accuracy of smoke area location detection to recall of the Pruned + KD (empty) model was 99.21%, its
improve the confidence level of smoke detection. precision was 99.21%, and its accuracy was 99.57%. Despite
Two issues that need further focus for the forest fire detec- significantly reducing the number of parameters and increas-
tion task are missed and mistaken detections. To reduce the ing the inference speed, the missed detection and misdetec-
rate of missed detection, this paper labelled and trained on tion rates are effectively balanced. Adding an empty label
fire targets to ensure the detection accuracy. However, for dataset reduces the misdetection rate, and the knowledge
mistaken detection, model training using only fire image distillation algorithm improves the generalizability of the
datasets is not a good solution because it is susceptible to model.

13

2328 Journal of Real-Time Image Processing (2021) 18:2319–2329

5 Conclusions Data availability statement  All data generated or appeared in this study
are available upon request by contact with the corresponding author.

Forest fire detection from a UAV platform has the advan-


tages of a real-time remote system. Furthermore, an object Declarations 
detection model suitable for deployment on an embedded
Conflict of interest  The authors declare no conflict of interest.
development kit is of the utmost importance for implement-
ing UAV fire detection.
In this study, we initially chose the YOLOv4 object
References
detection architecture as the network backbone. However,
this model is unsuitable for implementation on devices 1. Donald, M., Littell, J.S.: Climate change and the eco-hydrology
with limited computing power due to its large number of fire: Will area burned increase in a warming western USA?
of parameters, heavy computational burden, and sub- Ecol. Appl. 27(1), 26–36 (2017)
2. Rudz, S., Chetehouna, K., Hafiane, A., et al.: On the evalua-
stantial memory requirements. Therefore, we replaced
tion of segmentation methods for wildland fire. In: International
the backbone network of the YOLOv4 model with a Conference on Advanced Concepts for Intelligent Vision Sys-
MobileNetV3 model to establish an initial lightweight tems, pp. 12–23. Springer, Berlin, Heidelberg (2009)
YOLO + MobileNet model and decrease the number of 3. Wei, H.C., Wang, S.Y., Xu, Y.J., et al.: Forest fire image rec-
ognition algorithm of sample entropy fusion and clustering. J.
model parameters and the computational burden. Then,
Electron. Measure. Instrum. 34(01), 171–177 (2020)
we further compressed the model by removing redun- 4. Surit, S., Chatwiriya, W.: Forest fire smoke detection in video
dant parts of the proposed network structure. Finally, based on digital image processing approach with static and
we improved the detection accuracy of the compressed dynamic characteristic analysis. In: 2011 first ACIS/JNU Inter-
national Conference on Computers, Networks, Systems and
model using knowledge distillation and obtained the final
Industrial Engineering. IEEE, pp. 35–39 (2011)
Pruned + KD model. During the training, images showing 5. Wirth, M., Zaremba, R.: Flame region detection based on his-
fire and smoke were selected as training sets, and images togram backprojection. In: 2010 Canadian Conference on Com-
with fire-like objects were added to function as negative puter and Robot Vision. IEEE, pp. 167–174 (2010)
6. Muhammad, K., Ahmad, J., Lv, Z., et al.: Efficient deep CNN-
samples and improve model robustness.
based fire detection and localization in video surveillance appli-
The total number of model parameters of our proposed cations. IEEE Trans. Syst. Man Cybern. Syst. 49(7), 1419–1434
Pruned + KD architecture fell to 2.64 million—nearly (2019)
95.87% smaller than the number of parameters in the 7. Muhammad, K., Ahmad, J., Mehmood, I., et al.: Convolutional
neural networks based fire detection in surveillance videos.
original YOLOv4 model. When running on an NVIDIA
IEEE Access 6, 18174–18183 (2018)
Jetson Xavier NX development kit, the inference time of 8. Zhang, Q.X., Lin, G.H., Zhang, Y.M., et al.: Wildland forest fire
the Pruned + KD model was 37.4 ms, which is 4.11 times smoke detection based on faster R-CNN using synthetic smoke
faster than the 153.8 ms required by the YOLOv4 model. images. Proc. Eng. 211, 441–446 (2018)
9. Kuanar, S., Rao, K.R., Bilas, M., et al.: Adaptive CU mode
Furthermore, for the forest fire detection task in this paper,
selection in HEVC intra prediction: a deep learning approach.
the mAP of the Pruned + KD model reached 0.631, which Circuits Syst. Signal. Process. 38(11), 5081–5102 (2019)
is only 0.039 below than that of the original YOLOv4 10. Kuanar, S., Athitsos, V., Mahapatra, D., et al.: Low dose abdom-
network structure. Finally, we deployed our proposed inal CT image reconstruction: an unsupervised learning based
approach. In: 2019 IEEE International Conference on Image
algorithm on an NVIDIA Jetson Xavier NX development
Processing (ICIP). IEEE, pp 1351–1355 (2019)
kit, which can provide a reference for constructing forest 11. Bochkovskiy, A., Wang ,C.Y., Liao, H.Y.M.: YOLOv4: Optimal
fire detection and other mobile object detection models on Speed and Accuracy of Object Detection. arXiv 2020. arXiv
UAV platforms. preprint. arXiv:2004.10934. pp. 1–17 (2020)
12. Wang, C.Y., Mark Liao, H.Y., Wu, Y.H., et al.: CSPNet: a new
backbone that can enhance learning capability of cnn. In: Pro-
Acknowledgements  Data processing was supported by Ningxia Tech-
ceedings of the IEEE/CVF Conference on Computer Vision and
nology Innovative Team of advanced intelligent perception & control
Pattern Recognition Workshops, pp. 390–391 (2020)
and the Key Laboratory of Intelligent Perception Control at North
13. He, K., Zhang, X., Ren, S., et al.: Spatial pyramid pooling in
Minzu University.
deep convolutional networks for visual recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015)
Author contribution  S.W. and H.W. designed the experiments; S.W., 14. Liu, S., Qi, L., Qin, H., et al.: Path aggregation network for
J.Z., and X.Z. processed the data; S.W., M.X., and N.T. analyzed the instance segmentation. In: Proceedings of the IEEE Conference
data and wrote the original paper. All authors have read and agreed to on Computer Vision and Pattern Recognition, pp. 8759–8768
the published version of the manuscript. (2018)
15. Howard, A., Sandler, M., Chu, G., et al.: Searching for mobile-
Funding  This research was funded by National Natural Science Foun- netv3. In: Proceedings of the IEEE International Conference on
dation of China (No. 61861001), Postgraduate Innovation Project of Computer Vision, pp. 1314–1324 (2019)
North Minzu University (No. YCX20111).

13
Journal of Real-Time Image Processing (2021) 18:2319–2329 2329

16. Iandola, F.N., Han, S., Moskewicz, M.W., et al.: SqueezeNet: 22. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep net-
AlexNet-level Accuracy with 50x Fewer Parameters and <0.5 work training by reducing internal covariate shift. In: International
MB Model Size. arXiv preprint. arXiv:1602.07360 (2016) Conference on Machine Learning, PMLR, pp 448–456 (2015)
17. Zhang, X., Zhou, X., Lin, M., et al.: Shufflenet: an extremely 23. Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a
efficient convolutional neural network for mobile devices. In: Neural Network. arXiv preprint. arXiv:1503.02531 (2015)
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 6848–6856 (2018) Publisher’s Note Springer Nature remains neutral with regard to
18. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: jurisdictional claims in published maps and institutional affiliations.
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 7132–7141 (2018)
19. Howard, A.G., Zhu, M., Chen, B., et al.: Mobilenets: Efficient
Convolutional Neural Networks for Mobile Vision Applications. Shengying Wang  received his B.Sc. degree in 2018 from Liaocheng
arXiv preprint. arXiv:1704.04861 (2017) University, now he is postgraduate in North Minzu University. His
20. Sandler, M., Howard, A., Zhu, M., et al.: Mobilenetv2: inverted main research interest include image recognize.
residuals and linear bottlenecks. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. Haicheng Wei  received his B.Sc. degree in 1997 from Xi’an jiaotong
4510–4520 (2018) University, received his M.Sc. degree in 2004 from Xi’an jiaotong
21. Liu, Z., Li, J., Shen, Z., et al.: Learning efficient convolutional University, received his Ph.D. degree in 2012 from Xi’an jiaotong Uni-
networks through network slimming. In: Proceedings of the IEEE versity, now he is Associate professor in Basic Experimental Teaching
International Conference on Computer Vision, pp. 2736–2744 and Engineering Training Center North Minzu University. His main
(2017) research interest include image recognize.

13

You might also like