Professional Documents
Culture Documents
SPLN Proc 2311
SPLN Proc 2311
SPLN Proc 2311
1 Introduction
verifying its effectiveness; GoogLeNet [6] proposed the Inception module, which im-
proves the computational efficiency and receptive field of the network through multi-
scale convolution kernels and pooling layers; Reference [7] proposes batch normaliza-
tion to reduce internal covariate shifts between different layers in the network and ac-
celerate network convergence; ResNet [8] introduces a residual structure that allows the
network to have very deep layers, solving the problem of gradient vanishing and greatly
improving the efficiency and performance of training deep networks; DenseNet
[9]
adopts the idea of dense connections to enhance feature reuse; MobileNet [10]proposed
deep separable convolution, which greatly reduces computational costs and is suitable
for lightweight devices and real-time applications; A channel attention mechanism,
Squeeze and Excitation (SE) Attention [11], is proposed with the goal of adaptively ad-
justing the importance of input feature maps by learning effective channel weights. The
proposal of these methods has led to the rapid development and widespread application
of deep learning in the field of vision.
Nowadays, based on these classic convolutional neural networks, more improved
complex convolutional neural networks have been proposed and have achieved good
results. Reference [12] introduced residual connections into CNN and standardized re-
sidual blocks, and the ResNet structure was able to train deep networks to achieve
highly competitive recognition performance; Reference [13] use Convolutional Neural
Networks (CNN) to locate salient features in the images. Finally, for the test dataset,
the average sensitivity, specificity, and accuracy were 93.79%, 96.11%, and 95.37%,
respectively; The network structure of different improvement methods may have inad-
equate feature extraction capabilities for different datasets, so the convolution size and
structure of the network are particularly important.
The current mainstream channel attention mechanism is usually based on the idea of
SE Attention. Channel attention mechanisms typically use global average pooling to
process input feature maps and extract features on channels. However, global average
pooling reduces the feature map to a single value and loses spatial information. In order
to extract channel features more effectively, this paper proposes an improved channel
attention mechanism.
C×W×H C×W×H
By observing the above figure, it can be observed that the only change in the network
structure is to replace the original fully connected layer with a convolutional layer. This
Contribution Title (shortened if too long) 3
𝐶
change has led to an increase in the number of parameters, from the original 2 × 𝐶 ×
𝑘
𝐶
to 2 × C × × 3 × 3, which can more effectively extract feature information. Mean-
𝑘
while, due to the constant computational cost of global average pooling, reducing the
size of feature maps does not significantly increase the computational cost of subse-
quent convolutional layers.
In the case of dense connections in the network, this increased amount of parameters
and computation is acceptable. In addition, the improved channel attention mechanism
can be more effective in extracting channel features, thereby enhancing the perfor-
mance and representation ability of the model.
2.2 Adaptive Convolutional Structure
In addition to learning attention weights in the channel dimension, this article also in-
troduces an adaptive attention mechanism in the spatial aspect. Process the spatial fea-
tures of the input feature map through convolution operations to learn a set of adaptive
attention weights for convolution. Unlike global attention, this mechanism applies at-
tention weights to local spatial features to enhance the model's perceptual ability. The
effect of this adaptive convolution makes the model more flexible and representative in
perceiving and understanding input data.
DW BN DW Mean
C×7×7 C×3×3
S=1 ReLU S=1 (batch)
Gobal
Avg Pool Reshape
(C×1×9)
C×1×3×3 Softmax
Reshape
H (C×1×3×3)
W
Output
The core idea of the Inception structure is to use multiple convolution kernels of differ-
ent sizes in the same layer, and then concatenate their outputs. By using multiple con-
volution kernels of different sizes, Inception structures can capture image details and
contextual information in different receptive fields.
Figure 3 shows the structure diagram after introducing depthwise separable convo-
lution into the Inception structure. The advantage of using depthwise separable convo-
lution in Inception structure is that it can more effectively capture the spatial infor-
mation of input feature maps and reduce resource consumption while maintaining net-
work performance.
Output
(C×W×H)
Adaptive
Conv
Channel
Attention
Concat/BN/ReLU
DW3×3
DW1×1 DW3×3 MaxPool
DW3×3
Conv1×1
Input
(C×W×H)
the weight relationship between channels, further improving the representation ability
of features; The adaptive convolutional structure can adaptively adjust based on the
local information of the input feature map, further enhancing perceptual ability.
2.4 Hybrid Convolutional Neural Network Architecture
This network (MixConvNet) is a convolutional neural network that combines the char-
acteristics of multiple classical networks and makes appropriate improvements. By in-
tegrating the advantages and features of different networks, this network has achieved
a more powerful and efficient architecture. MixConvNet mainly consists of two parts:
Block and Transition. The stacking of multiple blocks is used for feature extraction,
while Transition is used to reduce the number of channels and scale size of the feature
map.
1) Block
The block structure is shown in Figure 4. The main components of this structure
include a separable convolutional layer, which converts the number of channels in the
input feature map into c. Subsequently, the feature map processed by a two-layer In-
ception structure is concatenated with the input feature map to achieve a dense connec-
tion effect. In the process of dense connections, multi-level residual connections are
used to solve the problem of gradient vanishing and promote information flow, making
the network easier to learn.
Input
C×W×H Input
C×W×H
DW1×1
Conv1×1
Conv1×1
T1
T1
BN BN
ReLU ReLU
Inception
DW3×3 s=2
DW3×3
BN
BN ReLU
T2
DW1×1,s=2
ReLU
Inception
T2
Channel Attention
AdaptiveConv Conv1×1
T3
(c×W×H) BN
Conv1×1 ReLU
BN
T3
Concat
ReLU
Output Input
(C+c)×W×H c×W×H
Fig. 4. Hybrid neural network structure diagram (Left: Block, Right: Transition, T1: Raise the
dimension to 64, T2: Feature extraction, T3: Reduce the dimension to 64)
During the unidirectional flow of feature map information (indicated by black ar-
rows), the network introduces multi-level residual connections. For example, after un-
dergoing depthwise separable convolution on the input feature map, residual connec-
tions are made to the subsequent three blocks (indicated in light blue). To avoid the
problem of excessive gradients during backpropagation, we have set parameters α =
1/3 and β = 1/2, limit the coefficient of the residual term to 1 to control the gradient
size. If the input of this Block is 𝑥0 , and then after feature extraction is x1 , x2 , and 𝑥3 ,
it can be represented as:
6 Feiyu Yao and Na Deng
𝐼𝑛𝑝𝑢𝑡 = 𝑥0 (2)
𝑥1 = 𝑓1 (𝑥0 ) + 𝛼 𝑥0
{𝑥2 = 𝑓1 (𝑥1 ) + 𝛽𝑥1 + 𝛼 𝑥0 (3)
𝑥3 = 𝑓1 (𝑥2 ) + 𝑥2 + 𝛽𝑥1 + 𝛼 𝑥0
𝑂𝑢𝑡𝑝𝑢𝑡 = 𝐶𝑜𝑛𝑐𝑎𝑡(𝑥0 , 𝑥3 ) (4)
By stacking multiple blocks, dense connections are achieved, where the current input
is associated with all previous outputs. Assuming the size of the input feature map is
C × W × H, after n blocks of processing, the size of the output feature map will become
(𝐶 + 𝑛 × 𝑐) × 𝑊 × 𝐻.
2) Transition
Compared to the Block structure, the Transition structure only includes the previous
feature extraction part and associates the input and output through a layer of residual
connections. In addition, a step size of 2 is set in depthwise separable convolution to
achieve halving of feature map size. The schematic diagram of the Transition structure
is shown in Figure 4.
The function of the Transition structure is to adjust the channels and scales of the
feature map, by reducing the size and number of channels of the feature map, so that
subsequent network layers can process feature information more effectively. By intro-
ducing residual connections, the Transition structure can maintain the information in-
tegrity of input features and play a role in information flow and feature transmission
throughout the entire network.In MixConvNet, the Transition structure plays a role in
balancing and controlling features between blocks, helping the network adapt to feature
changes at different levels and further improving its performance and generalization
ability.
3 Experimental process
This experiment selected the bird dataset on Kaggle as the experimental object. This
dataset contains 525 bird species, with a total of 84635 training images, 2625 validation
images (5 images per species), and 2625 test images (5 images per species). All images
are JPG format color images with 3 color channels and a length and width of 224 pixels
each. Some bird datasets are shown in Figure 5.
Contribution Title (shortened if too long) 7
During the experiment, the bird images fed into the mixed convolutional neural net-
work model will undergo the same preprocessing process, including data augmentation
techniques such as random cropping, random horizontal flipping, and random erasure.
The random probability is set to 0.5, and the random cropping size is [156, 156]. The
flowchart of bird recognition algorithm based on hybrid convolutional neural network
is shown in Figure 6.
Preprocessing
Random horizontal
Random cropping
flip
(156×156)
(224×224)
Train
Bird image
Classification
result output
Hybrid
input
(224×224)
For fair comparison, all models were trained using the Adam optimization algorithm
and a cross entropy loss function with a smoothing parameter of 0.1. The batch size
was set to 32, and the initial learning rate was 0.001. After each training round, the
learning rate was updated with a decay rate of 0.95, with a total of 50 iterations.
0.8
val-AdaptiveConv+SE/Conv
train-AdaptiveConv+SE/Conv
3
Accuary
val-SE/Conv
train-SE/Conv Loss
val-SE
train-SE
0.6
2
0.4
0.2
0 10 20 30 40 50
Step
Based on the analysis of the results in Figure 8, we can find that using the SE/Conv
structure can accelerate the convergence speed of the network. When using the Adap-
tiveConv structure combined with SE/Conv, compared to using SE/Conv alone, the
AdaptiveConv+SE/Conv model achieved significant improvements in both loss value
and accuracy. The loss value decreased from 1.90 to 1.83, and the accuracy increased
from 82% to 84.5%. This indicates that the introduction of the AdaptiveConv structure
is crucial for improving the performance of the model. AdaptiveConv+SE/Conv can
significantly improve the performance of networks, not only accelerating convergence
speed, but also achieving significant improvements in accuracy.
4.2 Comparative Analysis of Performance of Hybrid Convolutional
Neural Networks
By comparing the training results of 8 different networks in the first 4 iterations, we
can observe that there are differences in their convergence speeds, as shown in Figure
9.
Contribution Title (shortened if too long) 9
train valid
7.0 7.0
6.5
6.0 6.0
5.5
5.0 5.0
Loss
4.5
4.0 4.0
3.5
3.0 3.0
2.5
2.0 2.0
In
Re
De
Mo
Ef
Mi
Mi
Mi
In
Re
De
Mo
Ef
Mi
Mi
Mi
ce
sN
ns
bi
fi
xC
xC
xC
ce
sN
ns
bi
fi
xC
xC
xC
pt
et
eN
le
ci
on
on
on
pt
et
eN
le
ci
on
on
on
io
10
et
Ne
en
vN
vN
vN
io
10
et
Ne
en
vN
vN
vN
26
tN
et
et
et
n
26
tN
et
et
et
V2
V3
et
84
22
44
V2
V3
et
84
22
44
-l
4
-l
b7
b7
ar
ar
ge
ge
7.0 7.0
6.0 6.0
5.0 5.0
4.0 4.0
3.0 3.0
2.0 2.0
In
Re
De
Mo
Ef
Mi
Mi
Mi
In
Re
De
Mo
Ef
Mi
Mi
Mi
ce
sN
ns
bi
fi
xC
xC
xC
ce
sN
ns
bi
fi
xC
xC
xC
pt
et
eN
le
ci
on
on
on
pt
et
eN
le
ci
on
on
on
io
10
et
Ne
en
vN
vN
vN
io
10
et
Ne
en
vN
vN
vN
n
26
tN
et
et
et
26
tN
et
et
et
V2
V3
et
84
22
44
V2
V3
et
84
22
44
-l
-l
4
b7
b7
ar
ar
ge
ge
1) After applying data augmentation to the training set, it was observed that the train-
ing loss of the model was initially generally higher than the validation set loss.
2) EfficientNet b7 and Inception V2 have relatively slow convergence speeds. Effi-
cientNet b7 requires more iterations to achieve lower loss values due to its large
model size and large number of parameters. Inception V2, due to the lack of skip
connections, makes gradient transfer and network optimization more difficult.
3) MixConvNet has the fastest convergence speed among the three networks, espe-
cially MixConvNet444 shows the most obvious performance on the validation set.
In Table 2, we compared the accuracy of eight different convolutional neural net-
works on the bird dataset. The results showed that MixConvNet444 achieved the best
performance on the test set, with an accuracy rate of 94.04%. At the same time, Res-
Net101 and EfficientNet b7 have a higher number of parameters and FLOPs, and have
also achieved high accuracy.
According to the results in Table 2, we can measure the complexity of a model by
the number of parameters and FLOPs. ResNet101 and EfficientNet b7 are the models
with the highest number of parameters and FLOPs, with 43.576M and 65.131M, re-
spectively; MixConvNet84 and MobileNet V3-large are the models with the lowest
10 Feiyu Yao and Na Deng
5 Conclusion
This article proposes an improved hybrid convolutional neural network model that
combines dense connections, residual connections, improved channel attention mecha-
nisms, and adaptive convolutional structures. Through comparative experiments on
bird datasets, the hybrid convolutional neural network demonstrated excellent perfor-
mance with a low number of parameters, achieving an accuracy of 94.04%.
In addition, the experiments in this article were conducted on specific bird datasets,
so the conclusions and performance advantages obtained may differ on other datasets.
Further research and experiments need to be conducted on a wider and more diverse
dataset to verify the generalization ability and robustness of the hybrid convolutional
network. Although the performance of this hybrid convolutional network can be eval-
uated through these comparisons, future research can consider expanding the scope of
comparative experiments to include more network architectures and models.
Contribution Title (shortened if too long) 11
Overall, the improved convolutional neural network model proposed in this article
demonstrates significant performance advantages on bird datasets. These results indi-
cate that by combining different convolutional structures and attention mechanisms, as
well as appropriate network design and optimization, excellent performance can be
achieved with fewer parameters. This is of great significance for further promoting the
development of convolutional neural networks, and provides valuable reference and
inspiration for solving feature recognition problems in practical applications.
References
1. Xie J, Zhong Y, Zhang J, et al. A review of automatic recognition technology for bird vo-
calizations in the deep learning era[J]. Ecological Informatics, 2023, 73: 101927. Author,
F., Author, S.: Title of a proceedings paper. In: Editor, F., Editor, S. (eds.)
CONFERENCE 2016, LNCS, vol. 9999, pp. 1–13. Springer, Heidelberg (2016)
2. Huang Y P, Basanta H. Bird image retrieval and recognition using a deep learning plat-
form[J]. IEEE access, 2019, 7: 66980-66989.
3. Ragib K M, Shithi R T, Haq S A, et al. Pakhichini: Automatic bird species identification
using deep learning[C]//2020 Fourth world conference on smart trends in systems, security
and sustainability (WorldS4). IEEE, 2020: 1-6.
4. LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recogni-
tion[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
5. Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional
neural networks[J]. Communications of the ACM, 2012, 60(6): 84-90.
6. Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]//Proceedings of 2015
IEEE conference on computer vision and pattern recognition. Piscataway, NJ: IEEE. 2015:
1-9.
7. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing
internal covariate shift[C]//Proceedings of the 32nd International conference on machine
learning. New York: International Machine Learning Society. 2015: 448-456.
8. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings
of 2016 IEEE conference on computer vision and pattern recognition. Piscataway, NJ: IEEE.
2016: 770-778.
9. Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional net-
works[C]//Proceedings of 2017 IEEE conference on computer vision and pattern recogni-
tion. Piscataway, NJ: IEEE. 2017: 4700-4708.
10. Howard A G, Zhu M, Chen B, et al. Mobilenets: Efficient convolutional neural networks for
mobile vision applications[J]. arXiv preprint arXiv:1704.04861, 2017.
11. Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of 2018 IEEE con-
ference on computer vision and pattern recognition. Piscataway, NJ: IEEE. 2018: 7132-
7141.
12. Zhou T, Zhao Y, Wu J. Resnext and res2net structures for speaker verification[C]//2021
IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021: 301-307.
13. Li W, Qi F, Tang M, et al. Bidirectional LSTM with self-attention mechanism and multi-
channel features for sentiment classification[J]. Neurocomputing, 2020, 387: 63-77.