SPLN Proc 2311

Experimental Design of Bird Recognition Based on Mixed
Convolutional Neural Network
Feiyu Yao and Na Deng*
Hubei University of Technology, Wuhan, China

iamdengna@163.com
Abstract. Bird image recognition is a classic experiment in the fields of artificial

intelligence and machine vision. In order to better explore the recognition pat-
terns and performance differences of mainstream convolutional structures nowa-
days, an experiment based on mixed convolutional neural networks was de-
signed. In the experimental model design section, the hybrid convolutional neural
network focuses on dense connections and adopts residual connections in dense
connections to ensure that the network is easier to learn. In order to ensure that
the network extracts more feature information, an Inception structure based on
deep separable convolution is also introduced. Afterwards, an improved channel
attention mechanism is used to learn the weights between channels and an adap-
tive convolution structure is used to learn spatial weights. In comparative exper-
iments conducted on a 525 classified bird dataset, the accuracy of the mixed con-
volutional neural network experimental model was 94% with fewer parameters,
which was superior to models such as Inception V2 (86.8%), ResNet101(91.8%),
DenseNet264(89%), MobileNet V3-large(88.5%), and EfficientNet (92.3%).
This indicates that it has certain performance advantages in bird feature recogni-
tion tasks.
Keywords: Convolutional neural networks, Residual link, Depthwise separable

convolution, Channel attention mechanism, Inception.
1 Introduction
The automatic identification of birds provides important methodological support for

understanding and mastering the distribution, quantity, habitat, habits, migration pat-
terns, and other information of birds, and is of great significance for maintaining bio-
diversity and protecting the ecological environment. At present, convolutional neural
networks are the mainstream method for automatic bird recognition [1-3].
Since LeNet [4] was proposed in 1998, convolutional neural networks have made
tremendous progress and evolution in the field of computer vision. With the rise of deep
learning and the improvement of computing resources, researchers are constantly intro-
ducing more complex and deeper network structures, further improving the perfor-
mance and application scope of models.
Afterwards, in 2012, AlexNet [5] used the ReLU activation function in the ImageNet
large-scale visual recognition competition and achieved significant breakthroughs,
2 Feiyu Yao and Na Deng
verifying its effectiveness; GoogLeNet [6] proposed the Inception module, which im-
proves the computational efficiency and receptive field of the network through multi-
scale convolution kernels and pooling layers; Reference [7] proposes batch normaliza-
tion to reduce internal covariate shifts between different layers in the network and ac-
celerate network convergence; ResNet [8] introduces a residual structure that allows the
network to have very deep layers, solving the problem of gradient vanishing and greatly
improving the efficiency and performance of training deep networks; DenseNet
[9]
adopts the idea of dense connections to enhance feature reuse; MobileNet [10]proposed
deep separable convolution, which greatly reduces computational costs and is suitable
for lightweight devices and real-time applications; A channel attention mechanism,
Squeeze and Excitation (SE) Attention [11], is proposed with the goal of adaptively ad-
justing the importance of input feature maps by learning effective channel weights. The
proposal of these methods has led to the rapid development and widespread application
of deep learning in the field of vision.
Nowadays, based on these classic convolutional neural networks, more improved
complex convolutional neural networks have been proposed and have achieved good
results. Reference [12] introduced residual connections into CNN and standardized re-
sidual blocks, and the ResNet structure was able to train deep networks to achieve
highly competitive recognition performance; Reference [13] use Convolutional Neural
Networks (CNN) to locate salient features in the images. Finally, for the test dataset,
the average sensitivity, specificity, and accuracy were 93.79%, 96.11%, and 95.37%,
respectively; The network structure of different improvement methods may have inad-
equate feature extraction capabilities for different datasets, so the convolution size and
structure of the network are particularly important.
2 Design of bird experimental models
2.1 Improved channel attention mechanism
The current mainstream channel attention mechanism is usually based on the idea of
SE Attention. Channel attention mechanisms typically use global average pooling to
process input feature maps and extract features on channels. However, global average
pooling reduces the feature map to a single value and loses spatial information. In order
to extract channel features more effectively, this paper proposes an improved channel
attention mechanism.
C×W×H C×W×H
Global avg pool Conv3×3,s/2 Conv3×3

BN, ReLU Sigmoid
Fig. 1. Schematic diagram of improved channel attention mechanism
By observing the above figure, it can be observed that the only change in the network
structure is to replace the original fully connected layer with a convolutional layer. This
Contribution Title (shortened if too long) 3
𝐶
change has led to an increase in the number of parameters, from the original 2 × 𝐶 ×
𝑘
𝐶
to 2 × C × × 3 × 3, which can more effectively extract feature information. Mean-
𝑘
while, due to the constant computational cost of global average pooling, reducing the
size of feature maps does not significantly increase the computational cost of subse-
quent convolutional layers.
In the case of dense connections in the network, this increased amount of parameters
and computation is acceptable. In addition, the improved channel attention mechanism
can be more effective in extracting channel features, thereby enhancing the perfor-
mance and representation ability of the model.
2.2 Adaptive Convolutional Structure
In addition to learning attention weights in the channel dimension, this article also in-
troduces an adaptive attention mechanism in the spatial aspect. Process the spatial fea-
tures of the input feature map through convolution operations to learn a set of adaptive
attention weights for convolution. Unlike global attention, this mechanism applies at-
tention weights to local spatial features to enhance the model's perceptual ability. The
effect of this adaptive convolution makes the model more flexible and representative in
perceiving and understanding input data.
DW BN DW Mean
C×7×7 C×3×3
S=1 ReLU S=1 (batch)
Gobal
Avg Pool Reshape
(C×1×9)
C×1×3×3 Softmax
Reshape
H (C×1×3×3)
W
Output
Fig. 2. Adaptive Convolutional Structure Diagram
As shown in Figure 2, we also first perform global average pooling to C × 7 × 7,

and then output feature maps of size C × 3 × 3 through depthwise separable convolu-
tion (DW) layers. Calculate the average value on the Batch dimension to obtain atten-
tion weights of C × 1 × 3 × 3 as the generated convolution kernel, and finally perform
convolution operations on the input feature map to the output feature map. The convo-
lutional kernel is activated by the Softmax activation function and acts on the last two
dimensions, namely 3 × 3. The formula is as follows:
′
𝑒 w𝑖𝑗
w𝑖𝑗 = (1)
∑𝑖 ∑𝑗 𝑒 w𝑖𝑗
The input feature maps all appear positive after being activated by ReLU(x) =
max(0, x) in the previous convolution operation. In this case, using Softmax for
weighted summation is a natural choice. By performing local weighted averaging on

the input feature map, the network's representation ability can be further improved.
From the perspective of parameters, this adaptive convolution structure mainly con-
sists of two layers of depthwise separable convolutions, with a size of 2 × C × 3 × 3.
This means that the structure has relatively fewer parameters and is suitable for appli-
cation in resource constrained environments. From the perspective of computational
complexity, due to the use of global average pooling to compress feature maps and the
use of depthwise separable convolutions, this adaptive convolution structure does not
have high computational complexity. Therefore, without sacrificing perceptual ability,
this adaptive convolutional structure can locally perceive input features in situations
where computing resources are limited.
2.3 Inception
The core idea of the Inception structure is to use multiple convolution kernels of differ-
ent sizes in the same layer, and then concatenate their outputs. By using multiple con-
volution kernels of different sizes, Inception structures can capture image details and
contextual information in different receptive fields.
Figure 3 shows the structure diagram after introducing depthwise separable convo-
lution into the Inception structure. The advantage of using depthwise separable convo-
lution in Inception structure is that it can more effectively capture the spatial infor-
mation of input feature maps and reduce resource consumption while maintaining net-
work performance.
Output
(C×W×H)
Adaptive
Conv
Channel
Attention
Concat/BN/ReLU
DW3×3
DW1×1 DW3×3 MaxPool
DW3×3
Conv1×1
Input
(C×W×H)
Fig. 3. Transition structure diagram
Compared with the original Inception structure, there is no method of using

1 × 1 convolution kernels for channel dimensionality reduction for convolution outputs
of different scales. On the contrary, this article adopts the same 1 × 1 convolution ker-
nel for convolution operations. This design can learn features from different receptive
fields while maintaining a small number of parameters and calculations. In addition,
this article also considers the problem of deep separable convolution not learning the
relationships between channels, and introduces an improved channel attention mecha-
nism and adaptive convolution structure. The channel attention mechanism can learn
the weight relationship between channels, further improving the representation ability
of features; The adaptive convolutional structure can adaptively adjust based on the
local information of the input feature map, further enhancing perceptual ability.
2.4 Hybrid Convolutional Neural Network Architecture
This network (MixConvNet) is a convolutional neural network that combines the char-
acteristics of multiple classical networks and makes appropriate improvements. By in-
tegrating the advantages and features of different networks, this network has achieved
a more powerful and efficient architecture. MixConvNet mainly consists of two parts:
Block and Transition. The stacking of multiple blocks is used for feature extraction,
while Transition is used to reduce the number of channels and scale size of the feature
map.
1) Block
The block structure is shown in Figure 4. The main components of this structure
include a separable convolutional layer, which converts the number of channels in the
input feature map into c. Subsequently, the feature map processed by a two-layer In-
ception structure is concatenated with the input feature map to achieve a dense connec-
tion effect. In the process of dense connections, multi-level residual connections are
used to solve the problem of gradient vanishing and promote information flow, making
the network easier to learn.
Input
C×W×H Input
C×W×H
DW1×1
Conv1×1
Conv1×1
T1
T1
BN BN
ReLU ReLU
Inception
DW3×3 s=2
DW3×3
BN
BN ReLU
T2
DW1×1,s=2
ReLU
Inception
T2
Channel Attention
Channel Attention AdaptiveConv
AdaptiveConv Conv1×1
T3
(c×W×H) BN
Conv1×1 ReLU
BN
T3
Concat
ReLU
Output Input
(C+c)×W×H c×W×H
Fig. 4. Hybrid neural network structure diagram (Left: Block, Right: Transition, T1: Raise the
dimension to 64, T2: Feature extraction, T3: Reduce the dimension to 64)
During the unidirectional flow of feature map information (indicated by black ar-
rows), the network introduces multi-level residual connections. For example, after un-
dergoing depthwise separable convolution on the input feature map, residual connec-
tions are made to the subsequent three blocks (indicated in light blue). To avoid the
problem of excessive gradients during backpropagation, we have set parameters α =
1/3 and β = 1/2, limit the coefficient of the residual term to 1 to control the gradient
size. If the input of this Block is 𝑥0 , and then after feature extraction is x1 , x2 , and 𝑥3 ,
it can be represented as:
𝐼𝑛𝑝𝑢𝑡 = 𝑥0 (2)
𝑥1 = 𝑓1 (𝑥0 ) + 𝛼 𝑥0
{𝑥2 = 𝑓1 (𝑥1 ) + 𝛽𝑥1 + 𝛼 𝑥0 (3)
𝑥3 = 𝑓1 (𝑥2 ) + 𝑥2 + 𝛽𝑥1 + 𝛼 𝑥0
𝑂𝑢𝑡𝑝𝑢𝑡 = 𝐶𝑜𝑛𝑐𝑎𝑡(𝑥0 , 𝑥3 ) (4)
By stacking multiple blocks, dense connections are achieved, where the current input
is associated with all previous outputs. Assuming the size of the input feature map is
C × W × H, after n blocks of processing, the size of the output feature map will become
(𝐶 + 𝑛 × 𝑐) × 𝑊 × 𝐻.
2) Transition
Compared to the Block structure, the Transition structure only includes the previous
feature extraction part and associates the input and output through a layer of residual
connections. In addition, a step size of 2 is set in depthwise separable convolution to
achieve halving of feature map size. The schematic diagram of the Transition structure
is shown in Figure 4.
The function of the Transition structure is to adjust the channels and scales of the
feature map, by reducing the size and number of channels of the feature map, so that
subsequent network layers can process feature information more effectively. By intro-
ducing residual connections, the Transition structure can maintain the information in-
tegrity of input features and play a role in information flow and feature transmission
throughout the entire network.In MixConvNet, the Transition structure plays a role in
balancing and controlling features between blocks, helping the network adapt to feature
changes at different levels and further improving its performance and generalization
ability.
3 Experimental process
3.1 Experimental data and Preprocessing
This experiment selected the bird dataset on Kaggle as the experimental object. This
dataset contains 525 bird species, with a total of 84635 training images, 2625 validation
images (5 images per species), and 2625 test images (5 images per species). All images
are JPG format color images with 3 color channels and a length and width of 224 pixels
each. Some bird datasets are shown in Figure 5.
Fig. 5. Partial bird experimental data
During the experiment, the bird images fed into the mixed convolutional neural net-
work model will undergo the same preprocessing process, including data augmentation
techniques such as random cropping, random horizontal flipping, and random erasure.
The random probability is set to 0.5, and the random cropping size is [156, 156]. The
flowchart of bird recognition algorithm based on hybrid convolutional neural network
is shown in Figure 6.
Preprocessing
Random horizontal
Random cropping
flip
(156×156)
(224×224)
Train
Bird image
Classification
result output
Hybrid
input
Resize Random erase

(224×224) 224×224)
Convolutional
Neural Network
Classifier
Resize
Test
(224×224)
Fig. 6. Flowchart of Bird Recognition Algorithm
3.2 Experimental model
By combining Block and Transition, we can construct MixConvNet networks of differ-

ent sizes. The input of the network is an image with a size of 3 × 224 × 224. Firstly,
it goes through a convolutional layer with a size of 7 × 7 and a step size of 2, as well
as a max pooling layer with a size of 3 × 3 and a step size of 2 for feature extraction.
Next, we set the number of channels added to each block to 𝑐 = 64, and stack multiple
blocks together through dense connections. Finally, the network generates k classifica-
tion outputs through a fully connected layer. Table 1 shows three different sizes of
MixConvNet network structures, including indicators such as network layers, output
size, parameter count, and computational complexity.
Table 1. MixConvNet structures with different network sizes
network layer Output size MixConvNet84 MixConvNet224 MixConvNet444
112×112 7×7, 64, stride 2

conv_1
56×56 3×3 max pool, stride 2
Block×3 Block×3 Block×7
conv_2 28×28
Transition(128) Transition(128) Transition(256)
conv_3 14×14
conv_4 7×7
1×1 average pool, 525-d fc, softmax
Parameter quantity 0.765M 2.578M 6.466M
FLOPs 0.416G 0.658G 1.826G
For fair comparison, all models were trained using the Adam optimization algorithm
and a cross entropy loss function with a smoothing parameter of 0.1. The batch size
was set to 32, and the initial learning rate was 0.001. After each training round, the
learning rate was updated with a decay rate of 0.95, with a total of 50 iterations.
4 Analysis and discussion of experimental results
4.1 Performance analysis of attention mechanism
As shown in Figure 8, MixConvNet84 trained different attention mechanisms on the

525 bird dataset, including channel attention mechanism (SE), improved channel atten-
tion mechanism (SE/Conv), and adaptive convolutional structure (Adap-
tiveConv)+SE/Conv. The comparison of accuracy and loss values during these training
processes is shown in the figure.
5
0.8
val-AdaptiveConv+SE/Conv
train-AdaptiveConv+SE/Conv
3
Accuary
val-SE/Conv
train-SE/Conv Loss
val-SE
train-SE
0.6
2
0.4
0.2
0 10 20 30 40 50
Step
Fig. 7. Comparison chart of accuracy and loss values for MixConvNet84
Based on the analysis of the results in Figure 8, we can find that using the SE/Conv
structure can accelerate the convergence speed of the network. When using the Adap-
tiveConv structure combined with SE/Conv, compared to using SE/Conv alone, the
AdaptiveConv+SE/Conv model achieved significant improvements in both loss value
and accuracy. The loss value decreased from 1.90 to 1.83, and the accuracy increased
from 82% to 84.5%. This indicates that the introduction of the AdaptiveConv structure
is crucial for improving the performance of the model. AdaptiveConv+SE/Conv can
significantly improve the performance of networks, not only accelerating convergence
speed, but also achieving significant improvements in accuracy.
4.2 Comparative Analysis of Performance of Hybrid Convolutional
Neural Networks
By comparing the training results of 8 different networks in the first 4 iterations, we
can observe that there are differences in their convergence speeds, as shown in Figure
9.
train valid
7.0 7.0
6.5
6.0 6.0
5.5
5.0 5.0
Loss
4.5
4.0 4.0
3.5
3.0 3.0
2.5
2.0 2.0
In
Re
De
Mo
Ef
Mi
Mi
Mi
In
Re
De
Mo
Ef
Mi
Mi
Mi
ce
sN
ns
bi
fi
xC
xC
xC
ce
sN
ns
bi
fi
xC
xC
xC
pt
et
eN
le
ci
on
on
on
pt
et
eN
le
ci
on
on
on
io
10
et
Ne
en
vN
vN
vN
io
10
et
Ne
en
vN
vN
vN
26
tN
et
et
et
n
26
tN
et
et
et
V2
V3
et
84
22
44
V2
V3
et
84
22
44
-l
4
-l
b7
b7
ar
ar
ge
ge
7.0 7.0
6.0 6.0
5.0 5.0
4.0 4.0
3.0 3.0
2.0 2.0
In
Re
De
Mo
Ef
Mi
Mi
Mi
In
Re
De
Mo
Ef
Mi
Mi
Mi
ce
sN
ns
bi
fi
xC
xC
xC
ce
sN
ns
bi
fi
xC
xC
xC
pt
et
eN
le
ci
on
on
on
pt
et
eN
le
ci
on
on
on
io
10
et
Ne
en
vN
vN
vN
io
10
et
Ne
en
vN
vN
vN
n
26
tN
et
et
et
26
tN
et
et
et
V2
V3
et
84
22
44
V2
V3
et
84
22
44
-l
-l
4
b7
b7
ar
ar
ge
ge
Fig. 8. Training results of the first four iterations of different networks
1) After applying data augmentation to the training set, it was observed that the train-
ing loss of the model was initially generally higher than the validation set loss.
2) EfficientNet b7 and Inception V2 have relatively slow convergence speeds. Effi-
cientNet b7 requires more iterations to achieve lower loss values due to its large
model size and large number of parameters. Inception V2, due to the lack of skip
connections, makes gradient transfer and network optimization more difficult.
3) MixConvNet has the fastest convergence speed among the three networks, espe-
cially MixConvNet444 shows the most obvious performance on the validation set.
In Table 2, we compared the accuracy of eight different convolutional neural net-
works on the bird dataset. The results showed that MixConvNet444 achieved the best
performance on the test set, with an accuracy rate of 94.04%. At the same time, Res-
Net101 and EfficientNet b7 have a higher number of parameters and FLOPs, and have
also achieved high accuracy.
According to the results in Table 2, we can measure the complexity of a model by
the number of parameters and FLOPs. ResNet101 and EfficientNet b7 are the models
with the highest number of parameters and FLOPs, with 43.576M and 65.131M, re-
spectively; MixConvNet84 and MobileNet V3-large are the models with the lowest
number of parameters and FLOPs, respectively, at 0.765M and 0.234G. Furthermore,

although the number of parameters in MixConvNet84 is only 1/6 of MobileNet V3-
large, its accuracy is almost equivalent, indicating that MixConvNet84 can learn more
feature information with fewer parameters.
However, from the perspective of computational speed, MobileNet V3-large still has
significant advantages. Although the FLOPs of the two models are not significantly
different, due to dense connections, the actual training time difference is greater than
what FLOPs describe. Therefore, MobileNet V3-large is still a faster choice in terms of
speed.
Table 2. Comparison of different network prediction results
model Verification set Test set Parameter quantity FLOPs
Inception V2 0.8278 0.8686 12.518M 1.511G
ResNet101 0.8899 0.9188 43.576M 7.865G
DenseNet264 0.8648 0.8903 19.101M 4.389G
MobileNet V3-large 0.8476 0.8853 4.875M 0.234G
EffcientNet b7 0.8857 0.9238 65.131M 5.346G
MixConvNet84 0.8423 0.8873 0.765M 0.416G
MixConvNet224 0.8899 0.9240 2.578M 0.658G
MixConvNet444 0.9074 0.9404 6.466M 1.826G
MixConvNet demonstrates more efficient features in computing and memory con-

sumption. This indicates that MixConvNet can still achieve excellent performance
while maintaining a relatively small number of parameters and computational complex-
ity. Due to its excellent performance and small parameter size, its effectiveness in visual
tasks has been confirmed.
5 Conclusion
This article proposes an improved hybrid convolutional neural network model that
combines dense connections, residual connections, improved channel attention mecha-
nisms, and adaptive convolutional structures. Through comparative experiments on
bird datasets, the hybrid convolutional neural network demonstrated excellent perfor-
mance with a low number of parameters, achieving an accuracy of 94.04%.
In addition, the experiments in this article were conducted on specific bird datasets,
so the conclusions and performance advantages obtained may differ on other datasets.
Further research and experiments need to be conducted on a wider and more diverse
dataset to verify the generalization ability and robustness of the hybrid convolutional
network. Although the performance of this hybrid convolutional network can be eval-
uated through these comparisons, future research can consider expanding the scope of
comparative experiments to include more network architectures and models.
Overall, the improved convolutional neural network model proposed in this article
demonstrates significant performance advantages on bird datasets. These results indi-
cate that by combining different convolutional structures and attention mechanisms, as
well as appropriate network design and optimization, excellent performance can be
achieved with fewer parameters. This is of great significance for further promoting the
development of convolutional neural networks, and provides valuable reference and
inspiration for solving feature recognition problems in practical applications.
References
1. Xie J, Zhong Y, Zhang J, et al. A review of automatic recognition technology for bird vo-
calizations in the deep learning era[J]. Ecological Informatics, 2023, 73: 101927. Author,
F., Author, S.: Title of a proceedings paper. In: Editor, F., Editor, S. (eds.)
CONFERENCE 2016, LNCS, vol. 9999, pp. 1–13. Springer, Heidelberg (2016)
2. Huang Y P, Basanta H. Bird image retrieval and recognition using a deep learning plat-
form[J]. IEEE access, 2019, 7: 66980-66989.
3. Ragib K M, Shithi R T, Haq S A, et al. Pakhichini: Automatic bird species identification
using deep learning[C]//2020 Fourth world conference on smart trends in systems, security
and sustainability (WorldS4). IEEE, 2020: 1-6.
4. LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recogni-
tion[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
5. Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional
neural networks[J]. Communications of the ACM, 2012, 60(6): 84-90.
6. Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]//Proceedings of 2015
IEEE conference on computer vision and pattern recognition. Piscataway, NJ: IEEE. 2015:
1-9.
7. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing
internal covariate shift[C]//Proceedings of the 32nd International conference on machine
learning. New York: International Machine Learning Society. 2015: 448-456.
8. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings
of 2016 IEEE conference on computer vision and pattern recognition. Piscataway, NJ: IEEE.
2016: 770-778.
9. Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional net-
works[C]//Proceedings of 2017 IEEE conference on computer vision and pattern recogni-
tion. Piscataway, NJ: IEEE. 2017: 4700-4708.
10. Howard A G, Zhu M, Chen B, et al. Mobilenets: Efficient convolutional neural networks for
mobile vision applications[J]. arXiv preprint arXiv:1704.04861, 2017.
11. Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of 2018 IEEE con-
ference on computer vision and pattern recognition. Piscataway, NJ: IEEE. 2018: 7132-
7141.
12. Zhou T, Zhao Y, Wu J. Resnext and res2net structures for speaker verification[C]//2021
IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021: 301-307.
13. Li W, Qi F, Tang M, et al. Bidirectional LSTM with self-attention mechanism and multi-
channel features for sentiment classification[J]. Neurocomputing, 2020, 387: 63-77.

SPLN Proc 2311

Uploaded by

Copyright:

Available Formats

You might also like

SPLN Proc 2311

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SPLN Proc 2311

Uploaded by

Copyright:

Available Formats

Experimental Design of Bird Recognition Based on Mixed

Convolutional Neural Network

Feiyu Yao and Na Deng*

Hubei University of Technology, Wuhan, China

Abstract. Bird image recognition is a classic experiment in the fields of artificial

Keywords: Convolutional neural networks, Residual link, Depthwise separable

The automatic identification of birds provides important methodological support for

2 Design of bird experimental models

2.1 Improved channel attention mechanism

Global avg pool Conv3×3,s/2 Conv3×3

Fig. 1. Schematic diagram of improved channel attention mechanism

Fig. 2. Adaptive Convolutional Structure Diagram

As shown in Figure 2, we also first perform global average pooling to C × 7 × 7,

weighted summation is a natural choice. By performing local weighted averaging on

Fig. 3. Transition structure diagram

Compared with the original Inception structure, there is no method of using

Channel Attention AdaptiveConv

3.1 Experimental data and Preprocessing

Fig. 5. Partial bird experimental data

Resize Random erase

Fig. 6. Flowchart of Bird Recognition Algorithm

3.2 Experimental model

By combining Block and Transition, we can construct MixConvNet networks of differ-

Table 1. MixConvNet structures with different network sizes

network layer Output size MixConvNet84 MixConvNet224 MixConvNet444

112×112 7×7, 64, stride 2

4 Analysis and discussion of experimental results

4.1 Performance analysis of attention mechanism

As shown in Figure 8, MixConvNet84 trained different attention mechanisms on the

Fig. 7. Comparison chart of accuracy and loss values for MixConvNet84

Fig. 8. Training results of the first four iterations of different networks

number of parameters and FLOPs, respectively, at 0.765M and 0.234G. Furthermore,

Table 2. Comparison of different network prediction results

model Verification set Test set Parameter quantity FLOPs

Inception V2 0.8278 0.8686 12.518M 1.511G

ResNet101 0.8899 0.9188 43.576M 7.865G

DenseNet264 0.8648 0.8903 19.101M 4.389G

MobileNet V3-large 0.8476 0.8853 4.875M 0.234G

EffcientNet b7 0.8857 0.9238 65.131M 5.346G

MixConvNet84 0.8423 0.8873 0.765M 0.416G

MixConvNet224 0.8899 0.9240 2.578M 0.658G

MixConvNet444 0.9074 0.9404 6.466M 1.826G

MixConvNet demonstrates more efficient features in computing and memory con-

You might also like