Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Real-time Marine Animal Images Classification by

Embedded System Based on Mobilenet and Transfer


Learning
Xuefeng Liu Zhenqing Jia Xiaoke Hou Min Fu
College of Automation & College of Automation & College of Automation & College of Information Science &
Electronic Engineering Electronic Engineering Electronic Engineering Engineering
Qingdao University of Science Qingdao University of Science Qingdao University of Science Ocean University of China
and Technology and Technology and Technology Country Qingdao, China
Country Qingdao, China Country Qingdao, China Country Qingdao, China fumin@ouc.edu.cn
nina.xf.liu@hotmail.com jia101728@163.com 574741423@qq.com

Li Ma Qiaoqiao Sun
College of Information Science & Institute Fresnel
Engineering Ecole Centrale de Marseille
Ocean University of China ECM, Marseille, France
Country Qingdao, China qiaoqiao.sunny@foxmail.com
sugemali@hotmail.com

Abstract—In marine aquaculture, the classification and I. INTRODUCTION


identification of marine animals image is one of the important
research, which can help monitor the growth and fishing of marine Efficient classification and identification the marine animals
animals, as well as water conditions. Generally, the video and from the underwater images can help monitor the growth of
image information of marine animals transmitted by underwater marine animals and water conditions. [1,2]. Manual recognition
cameras are identified by a human or a computer, which is hardly is time consuming and not accurate enough. Therefore, a high-
to do real-time processing. To classify the marine animal images precision and real-time method is necessary for marine animal
efficiently and in real time, a method combining an embedded images classification.
system and deep learning based on MobileNetV2 and transfer
learning is proposed in this paper. Firstly, the marine animal At present, some methods of machine learning have
images are colleced by an underwater robot equipped with an achieved outstanding results in marine animal image
embedded device. Next, a MobileNetV2 model based on classification. The BP neural network is used to fish recognition
convolutional neural network (CNN) is constructed according to [3]. A maximum probability partial ranking method based on
the marine animal images, which can ensure the real-time sparse representation (SRC-MP), is proposed for real-world fish
requirement. Thirdly, transfer learning is used to further improve recognition and identification[4]. A linear SVM classifier with
the classification. Then, the model can be trained by the collected spatial pyramid pooling (SPP) is used for the classification on a
marine animal images. Finally, the trained model could be real-world fish recognition dataset [5]. These methods can
downloaded to the embedded device and real-time classify the achieve high identification accuracy, but they only extract for
marine animal images under water. To evaluate the performance one or several features of the image and the requirements for the
of the proposed method, InceptionV3 and MobilenetV1 models are data set are high. If the number of images is insufficient, the
used for the comparison in the experiments, and the identification accuracy will be reduced.
accuracy rate and average classification time are calculated. The
results show that the MobilenetV2 model plus transfer learning With the improvement of computer performance, deep
could be a better choice for the real-time classification of the learning shows great potential in large data processing.
marine animal images than the other two considered models. Convolutional neural network (CNN) is one of the most popular
Furthermore, the size of the MobilenetV2 model is only about 40M neural networks for image classification and target detection[6].
and suitable for the embedded device. The CNN extracts the features of the original image through the
convolution layer and the pooling layer. The training of CNN is
Keywords—CNN, parameter transfer, fine tuning, inception, to improve the recognition accuracy by continuously updating
underwater image the weights [7]. In recent years, it has been widely used in face
recognition[8], text recognition[9], license plate detection[10],
medical image[11] and other fields. Transfer learning is a

The work was supported by the Nation Natural Science Foundation of


China (Grant No. 61401244, 61773277).

978-1-7281-1450-7/19/$31.00 ©2019
Authorized licensed use limited to: IEEE
University of Tasmania. Downloaded on March 12,2021 at 23:12:55 UTC from IEEE Xplore. Restrictions apply.
method of transferring weights from a trained network to × × × × ×
= + (1)
another untrained network. It only needs a small amount of the × × × × × × ×

training samples of the source and target data .


When using 3×3 convolution kernel, Depthwise separable
Therefore, we propose a method of the marine animal image convolutions can reduce the computational complexity to 1/8-
classification based on transfer learning and MobileNet based 1/9 of the original one. So MobileNetV1 is essentially designed
on CNN. This method uses MobileNetV2 as pre-training based on this decomposition hypothesis. MobileNet's separable
models. Some parameters are transferred from the deep learning convolution module compresses both parameters and
model trained by the source data in the ImageNet dataset. The
computational complexity. It enables modern mobile devices to
other parameters are trained by the images captured by the
underwater robot. Then, we compare the accuracy and give full play to the computing power of CPU and GPU, and
recognition time in the experiments to evaluate the performance speeds up image recognition without the accuracy reduction.
of the proposed method.
D. MobileNetV2
II. CONVOLUTIONAL NEURAL NETWORK AND TRANSFER
LEARNING MobileNetV2 is also a lightweight convolutional neural
network. On the one hand, MobileNetV2 proposes the concept
A. Convolutional Neural Network of inverted residual block based on MobileNetV1. The
Convolutional neural network (CNN) is a kind of deep traditional residual block workflow is: (1) using the 1×1
feedforward neural network. It imitates the hierarchical structure convolution kernel to reduce the dimensionality of the high-
of biological neural networks. The structure of CNN model dimensional features; (2) applying the 3×3 convolution kernel to
consists of input layer, convolution layer, pooling layer, fully filter; (3) using the 1×1 convolution kernel to improve the
connection layer and output layer[12]. CNN extracts image dimensions and rectified linears units (ReLU) being included in
information layer by layer through convolution kernel. In recent these convolution operations; (4) adding the output features (the
years, the deep convolution neural network model trained by input of the next layer) in element wise manner [16]. But, in the
massive sample sets has reached an unprecedented high level in workflow of inverted residual block, 1×1 convolution kernel is
speed and recognition accuracy.
used to increase the dimension of low-dimensional features
B. IncpetionV3 (excluding ReLU), and then 1×1 convolution kernel is applied
InceptionV3 network is developed from GoogleNet network to reduce the dimensions of features. On the other hand,
[13]. Before the advent of GoogleNet, the networks performance mobileNetV2 uses linear bottleneck instead of ReLU to avoid
was improved by increasing the number of network layers and damage to features. The purpose of the improvements is to avoid
neurons, but there were the disadvantages of the huge amount of information loss and increase the expressive ability of the
calculation and the phenomenon of over-fitting. Therefore, the model. Fig.1. shows the structural changes from MobileNetV1
main idea of InceptionV3 model is to replace the large size to MobileNetV2 [16].
convolution kernel with the small size convolution kernel. For
example, the 5×5 convolution kernel is decomposed into two-
layer 3×3 convolution kernel. The 3×3 convolution kernels are
decomposed into the form of 1×3 and 3×1 convolution kernels
[14]. In addition, batch normalization (BN) layer is added to
normalize the output of each layer to the gaussian distribution
from 0 to 1.
C. MobileNetV1
MobileNetV1 is an efficient model for mobile and embedded
devices. Based on streamlined architecture, MobileNets uses
depthwise separable convolutions to construct lightweight deep
neural networks [15]. Depthwise separable convolutions include
MobileNetV1 MobileNetV2
two parts: depthwise convolution and pointwise convolution.
Fig.1 Basic structural changes from MobileNetV1 to
The convolution kernel parameters of traditional
MobileNetV2.
convolution are DK×DK×M×N, in which DK×DK is the size of
convolution core, M is the number of input channels and N is
the number of output channels. The convolution kernel
parameter in depthwise convolution is DK×DK×1×M, where M E. Transfer Learning
cores of DK×DK and corresponding channels of input The research of transfer learning focuses on reusing the
characteristics are convoluted. In pointwise convolution, the knowledge acquired in the source domain and solving new
convolution kernel parameter is 1×1×M×N. problems in the target domain with better solutions [17]. The
Standard convolutions have the computational cost of: methods of transfer learning can be divided into transfer sample,
DK×DK×M×N×DF×DF. Depthwise separable convolutions cost: transfer feature and transfer parameter [18]. When the data of
DK×DK×M×DF×DF+M×N×DF×DF. We get a reduction in
computation:

Authorized licensed use limited to: University of Tasmania. Downloaded on March 12,2021 at 23:12:55 UTC from IEEE Xplore. Restrictions apply.
Fig.2 Classification based on transfer learning.

source domain and target domain are very similar, sample verify the effectiveness of the MobileNet model, the Inception
migration fuses source samples and target samples, then adjusts V3 model combining with the transfer learning is evaluated in
the source domain weights to get the target domain weights. the experiment for comparison.
Feature migration finds feature association between source and Firstly, we enhance the data of the images collected by the
target domains by reconstructing features, minimizing the underwater vehicle. Then, the method of parameter transfer
difference between them. Parameter migration is to share the learning is applied to small-scale marine animal data sets. We
parameters between the source and the target domains, and use the model parameters of InceptionV3, MobileNetV1 and
automatically adjust the weights to get the optimal results [19]. MobileNetV2, which have been trained by the large samples of
Transfer learning can alleviate the problem of insufficient source data, to train the network of target data. In order to
data. Therefore, transfer learning has gradually become the improve the training accuracy, we adopt a fine-tuning method in
preferred technology for artificial intelligence (AI) projects with parameter transfer. If it is the underlying structure of the
insufficient data or computational power. As a branch of convolution module, its parameters could be kept. If it is a high-
machine learning, transfer learning is increasingly integrated level convolution module close to the classifier, these modules
with neural networks. For image recognition, migration learning are set as trainable modes, including matrix weight, bias term
is a method to solve the problem of less labeled sample data and and other regularization term coefficients. Then the model can
high cost of model training. Pre-training model is the model that be used for the target data and the optimum value can be
has been trained by large data sets. We find some network layers obtained by adjusting the parameters in a small range.
which can reuse feature vectors, and then transfer these network Taking the MobileNetV1 model as an example, the specific
layers and parameters to train networks with smaller data sets transfer learning method is shown in Fig. 2. According to the
[20]. Therefore, training costs are reduced and resource data set category, we replace the full connection layer of the
utilization is improved. source model with the 7-class Softmax classifier. According to
the structure of different networks, the weight of high-level
III. IMAGE CLASSIFICATION BASED ON TRANSFER LEARNING convolution module is set to be trainable for adaptive
A. Data Enhancement adjustment. Then the full connection layer of the model is
modified. Through experiments, it is found that InceptionV3
In this paper, we would like to: (1) train model from small- model can get the highest accuracy of validation set when
scale marine animal’s data sets. (2) do the real-time image training from the level 175, MobileNetV1 is from level 122 and
classification by downloaded the model to embedded devices. MobileNetV2 is from level 130.
Because the data set is small in scale, we first enhance the data
set. Data enhancement is a method to improve the overall IV. EXPERIMENT
performance of training network when the original image data
set is insufficient. The main methods of data enhancement are A. Data Set
rotation, flipping, translation, zooming, noise addition, etc. In Some of the data used in the experiment are obtained by
this paper, we combine rotation, flipping and translation to underwater cameras for underwater marine animals and others
expand the sample space. are collected by the Internet. The whole data set is divided into
seven categories: fish, shrimp, scallop, crab, lobster, abalone and
B. Transfer Parameter sea cucumber. Each category ranges from 1000 to 1400 sheets,
To solve the problem of insufficient data sets, transfer totaling 8455 sheets. 80% of them are training set and 20% are
learning is introduced for the classification in this paper. In validation set. We enhanced the training set data. Each original
convolutional neural networks, MobileNet is an efficient model image is generated into three deformed images by three
for mobile and embedded devices. Therefore, we propose a processing methods: rotation, translation and flipping. The
transfer learning method based on MobileNet model. In order to training set can be expanded to 27056 pictures. In addition, for

Authorized licensed use limited to: University of Tasmania. Downloaded on March 12,2021 at 23:12:55 UTC from IEEE Xplore. Restrictions apply.
the embedded devices, the testing data are seven types of images
collected by the network, each type of image includes five sizes,
each size has 10 pictures, a total of 350. When selecting training
data, a part of the image containing non-target samples is used
to simulate random noise in order to improve the generalization
ability of the model. Because the resolution of the image taken
by underwater camera is different from that collected by
network, we normalize the original image. According to the
requirements of different models, reshape is 224 × 224 or 299 ×
299.
Fig.3 Classification Fig.4 Classification loss of
In this paper, the experimental environment adopts a Ubuntu
16.04 version computer. It carries two GTX 1080ti video cards. accuracy of testing data validation data
The experiment could be completed in the framework of
TensorFlow + Keras. In addition, the embedded device is Jeston-
TX2 produced by NVIDIA. The device has a NVIDIA Pascal From Figure 3 and Figure 4, it can be seen that the accuracy
GUP, 256 NVIDIA CUDA cores, dual-core Denver 2, four-core of validation set increases with the number of iterations in the
ARM Cortex-A57 processor. All these configurations are network training process. The loss rate of validation set
designed to better adapt to convolution operations and improve gradually stabilizes in a certain interval with the increase of
the speed of operations. iterations. According to Table 1, Inception V3 model has the
best training set accuracy, but this does not mean that the best
B. Experimental Process verification set accuracy can be obtained. Under the same
Training Convolutional Neural Network conditions, MobileNetV2 model obtained the highest accuracy
We have selected the bath_size = 16 and epoch = 200 to train of validation set : 92.89%.
the network model. After the training, the values of accuracy and C. Classification Time Test in the Embedded Device
loss rate are recorded. The formulas for calculating the accuracy
and loss rate is as follows:
acc = (2)
where n represents the number of correctly classified
images and N represents the total number of images. Loss is
calculated by the following formulas with yi being the predicted
value and y_hati being the original target value:

Softmax( )=∑ (3)

Loss(y, y )=− ∑ _ℎ × log ( ( )) (4)

The accuracy of training set, the loss rate of training set, the
accuracy rate of validation set and the loss rate of validation set
are illustrated in Table 1. The change process of the accuracy
rate and the loss rate of the validation set in the training process
are shown in Fig. 3 and Fig. 4.

TABLE I. ACCURACY AND LOSS OF TRAINING SET AND VALIDATION SET


FOR EACH MODEL

Types of data Fig. 5 Some real-time classification results. On the top of each
Accuracy of Loss of Accuracy of Loss of
Model
training training validation validation
image, there are species, accuracy and classification time.
data data data data
InceptionV3 99.73% 0.0072 91.13% 0.372
MobileNetV1 99.29% 0.0199 89.92% 0.5113
MobileNetV2 99.58% 0.0085 92.89% 0.3165

Authorized licensed use limited to: University of Tasmania. Downloaded on March 12,2021 at 23:12:55 UTC from IEEE Xplore. Restrictions apply.
TABLE II . CLASSIFICATION ACCURACY AND TIME OF THE MARINE ANIMAL IMAGES BY EACH MODEL IN THE EMBEDDED SYSTEM

Species
Model
Abalone Crab Fish Lobster Scallop Sea_cucumber Shrimp Average
Time(s) 0.091 0.116 0.174 0.181 0.139 0.114 0.135 0.1361
InceptionV3
Accuracy 96.8% 95.6% 100% 98.2% 92.2% 92.6% 91.6% 93.6%
Time(s) 0.046 0.055 0.077 0.046 0.032 0.089 0.051 0.0569
MobileNetV1
Accuracy 96.2% 94.8% 100% 93.6% 96.4% 89.6% 86.8% 92.6%
Time(s) 0.052 0.052 0.052 0.073 0.053 0.075 0.045 0.0578
MobileNetV2
Accuracy 96.6% 96.4% 99.2% 96.8% 100% 89.6% 92.4% 95.0%

The trained model is downloaded into the embedded device [7] A. I. Kukharenko and A. S. Konushin, “Simultaneous classification of
Jeston TX2 and the test data is classified. Four classification several features of a person’s appearance using a deep convolutional
neural network,” Pattern Recognition and Image Analysis, vol. 25, pp.
results are listed in Fig. 5. The average classification time of 461-465, 2015.
each model are shown in Table 2. [8] Siyue Xie and Haifeng Hu, “Facial expression recognition with FRR-
From Table 2, the average classification time of the CNN,” Electronics Letters, vol. 53, pp. 235-237, 2017.
InceptionV3, MobilenetV1 and MobilenetV2 models for image [9] Yichao Wu, Fei Yin and Chenglin Liu, “Improving handwritten Chinese
text recognition using neural network language models and convolutional
are: 0.136s, 0.0569s and 0.0578s respectively. The neural network shape models,” Pattern Recognition, vol. 65, pp. 251-264,
MobileNetV1 model has the best performance for classifying 2017.
runtime and the speed of MobileNetV2 is close to that of [10] Lele Xie, Tasweerie, Lianwen Jin , Yuliang Liu and Sheng Zhang, “A
MobileNetV1. New CNN-Based Method for Multi-Directional Car License Plate
Detection,” IEEE Transactions on Intelligent Transportation Systems,
D. Discussion of the Experimental Results vol. 19, pp. 507-517, 2018.
In this experiment, three models are selected for transfer [11] Tajbakhsh Nima, Shin JaeY, Gurudu, Suryakanth R and et al,
learning and the parameters of marine animal images are “Convolutional Neural Networks for Medical Image Analysis: Fine
Tuning or Full Training?” IEEE Transactions on Medical Imaging, vol.
retrained to classify the images in the embedded devices. The 35, pp. 1299-1312, 2016.
experimental results show that the MobileNetV2 model trained [12] Keqing Zhu, Jie Tian and Haining Huang, “Underwater object Images
by transfer learning has the best validation set accuracy of Classification Based on convolutional neural network,” 2018 IEEE 3rd
92.89%. In terms of image classification speed, The International Conference on Signal and Image Processing (ICSIP), pp.
MobileNetV1 model achieved the least average classification 301-305, 2018.
time. But MobilenetV2 is only 0.001 seconds slower than [13] Christian Szegedy , Wei Liu, Yangqing Jia , et al, “Going Deeper with
MobileNetV1. Therefore, considering both the classification Convolutions,” Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 1-9, 2014.
accuracy and computing time, it can be concluded that the
MobilenetV2 model plus transfer learning could be a better [14] Szegedy Christian, Vanhoucke Vincent, Ioffe Sergey and et al,
“Rethinking the inception architecture for computer vision,” Proceedings
choice for the real-time classification of the marine animal of the 2016 IEEE Conference on Computer Vision and Pattern
images than the other two considered models. Furthermore, the Recognition, pp. 2818-2826, 2016.
size of the MobilenetV2 model is only about 40M and much [15] Howard A G, Zhu Menglong, Chen Bo and et al. “MobileNets: Efficient
suitable for the embedded device. Convolutional Neural Networks for Mobile Vision Applications,” http://
arxiv.org/abs/1704.04861, 2017.
REFERENCES [16] Sandler M, Howard A, Zhu Menglong and et al, “MobileNetV2: Inverted
[1] Yajuan Wei, “Study of Zooplankton Automatic Recognition Method for Residuals and Linera bottlenecks,” http : //arxiv.org/abs/1801.04381,
Dark Field Image,” Ocean University of China, 2013. 2018.
[2] Xi Qiao, “Sea cucumber identification in real-time based on underwater [17] Xin Sun , Junyu Shi , Lipeng Liu and et al, “Transferring deep knowledge
machine vision techinque,” China Agricultural University, 2017. for object recognition in Low-quality underwater videos,”
Neurocomputing, vol. 275, pp. 897-908, 2017.
[3] Peng Wan, Hailong Pan, Changjiang Long, et al, “Design of the on-line
identification device of freshwater fish species based on machine vision [18] HooChang Shin, Roth, Holger Roth, Mingchen Gao and Ronald M
technology,” Food and Machinery, vol. 28, pp.164-167, 2012. Summers, “Deep convolutional neural networks for computer-aided
detection: CNN architectures, dataset characteristics and transfer
[4] Yihao Hsiao , ChaurChin Chen , Sunin Lin and Fangpang Lin, “Real- learning,” IEEE Transactions on Medical Imaging, vol. 35, pp. 1285-
world underwater fish recognition and identification, using sparse 1298, 2016.
representation,” Ecological Informatics, vol. 23, pp.13-21, 2014.
[19] Ling Shao, Fan Zhu and Xuelong Li, “Transfer Learning for Visual
[5] Hongwei Qin, Xiu Li, Jian Liang , Yigang Peng and Changshui Zhang, Categorization: A Survey,” IEEE Transactions on Neural Networks and
“DeepFish: Accurate Underwater Live Fish Recognition with a Deep Learning Systems, vol. 26, pp. 1019-1034, 2015.
Architecture,” Neurocomputing, 2015.
[20] Zhongling Huang, Zongxu, Pan and Bin Lei, “Transfer learning with deep
[6] Christian Szegedy, Wei Liu, Yangqing Jia, et al, “Going deeper with convolutional neural network for SAR target classification with limited
convolutions,” Proceedings of the 2015 IEEE Conference on Computer labeled data,” Remote sensing, vol. 9, pp. 1-21, 2017.
Vision and Pattern Recognition, pp. 1-9, 2015.

Authorized licensed use limited to: University of Tasmania. Downloaded on March 12,2021 at 23:12:55 UTC from IEEE Xplore. Restrictions apply.

You might also like