Professional Documents
Culture Documents
Xiao 等 - 2022 - Attribute-Based Progressive Fusion Network for RGB
Xiao 等 - 2022 - Attribute-Based Progressive Fusion Network for RGB
Abstract Feature
RGB Extractor
Feature
RGBT tracking usually suffers from various challenging fac- RGB Extractor
Attribute1 Based
tors of fast motion, scale variation, illumination variation,
Aggregation
Feature Extractor
thermal crossover and occlusion, to name a few. Existing Fusion
works often study fusion models to solve all challenges si- Module
Attribute2 Based
multaneously, which requires fusion models complex enough Feature Extractor
Feature
and training data large enough, and are usually difficult to be T Extractor
constructed in real-world scenarios. In this work, we disen- Feature
T Extractor
tangle the fusion process via the challenge attributes, and thus
propose a novel Attribute-based Progressive Fusion Network (a) (b)
(APFNet) to increase the fusion capacity with a small number
of parameters while reducing the dependence on large-scale Feature
training data. In particular, we design five attribute-specific RGB Extractor
fusion branches to integrate RGB and thermal features under
Enhancement Fusion
the challenges of thermal crossover, illumination variation, Attribute1-
Aggregation
scale variation, occlusion and fast motion respectively. By Specific Fusion
disentangling the fusion process, we can use a small number
of parameters for each branch to achieve robust fusion of dif- Attribute2-
ferent modalities and train each branch using the small train- Specific Fusion
ing subset with the corresponding attribute annotation. Then,
to adaptive fuse features of all branches, we design an aggre- Feature
T
gation fusion module based on SKNet. Finally, we also de- Extractor
sign an enhancement fusion transformer to strengthen the ag-
(c)
gregated feature and modality-specific features. Experimental
results on benchmark datasets demonstrate the effectiveness
of our APFNet against other state-of-the-art methods. Code
Figure 1: Advances of our fusion model over existing
will be available at https://github.com/mmic-lcl/source-code. ones. (a) Common fusion models. (b) Attribute-aware fu-
sion model (Li et al. 2020; Zhang et al. 2021a) which ex-
tracts appearance features under certain attributes and then
Introduction perform feature fusion. (c) Our attribute-based progressive
RGBT tracking is to use the complementary benefits of RGB fusion model, which performs fusion under certain attributes
and thermal infrared data to achieve robust visual tracking, and then aggregate and enhance them in a progressive man-
which has various applications such as autonomous driving, ner.
surveillance security and robotics. RGBT tracking is a chal-
lenging task since it usually suffers from various factors,
such as thermal crossover, illumination variation, scale vari- cludes modality-shared and modality-specific feature learn-
ation, occlusion and fast motion. ing modules in all layers. Zhang et al. (2019) design differ-
Existing works often try studying various fusion models ent fusion strategies in the DiMP tracking framework (Bhat
to solve all challenges simultaneously in RGBT tracking. et al. 2019), which requires a large-scale training dataset, in-
Zhu et al. (2019) fuse features of RGB and thermal modal- cluding 9,335 videos with 1,403,359 frames in total. How-
ities in each layer and recursively aggregate these features ever, these methods need to either design complex fusion
in all layers. Li et al. (2019b) mine modality-shared and - models or construct large-scale training data for all chal-
specific information by a multi-adapter network which in- lenge factors, but the performance improvement is still lim-
* Chenglong Li is the corresponding author. ited since challenge factors are numerous in real-world sce-
Copyright © 2022, Association for the Advancement of Artificial narios. To relieve the burden of model design and data con-
Intelligence (www.aaai.org). All rights reserved. struction, Li et al. (2020) model the target appearance un-
der different challenge attributes separately and then fuse We use a dual-stream hierarchical architecture to gradu-
them using an aggregation module. In this way, the target ally integrate the modules of the attribute-based progressive
representations under certain attributes can be learnt by a fusion as shown in Fig. 2. In the training phase, we design
few parameters even in the situation of insufficient training a three-stage training algorithm to train our network effec-
data. However, the fusion is performed in a simple man- tively and efficiently. Experimental results on three bench-
ner, the capacity of the fusion model might thus be limited mark datasets GTOT, RGBT234 and LasHeR demonstrate
and the tracking performance would be degraded. Zhang et the effectiveness of the proposed APFNet against other
al. (2021a) design an complex attribute ensemble network state-of-the-art methods.
for aggregating multiple attribute features better. However, The contributions of this paper are summarized as fol-
neither of them considers the fusion method under specific lows.
attributes. • We propose a novel attribute-based progressive fusion
In this work, we disentangle the fusion process via the network to increase the fusion capacity with a small
challenge attributes, and thus propose a novel Attribute- number of parameters while reducing the dependence on
based Progressive Fusion Network (APFNet) to increase the large-scale training data in RGBT tracking.
fusion capacity with a small number of parameters while
reducing the dependence on large-scale training data. Ad- • We design an attribute-specific fusion strategy to disen-
vances of our fusion model over existing ones as Fig. 1. tangle the fusion process via five challenge attributes.
In particular, we design the fusion branch for each attribute Each fusion branch only requires a small number of pa-
to learn the attribute-specific fusion parameters, then design rameters which can be trained efficiently using small-
an aggregation fusion module to integrate all fused features scale training data, since it just needs focusing on the
from each attribute-based fusion branch, and finally design feature fusion under a certain attribute.
an enhancement fusion transformer to enhance the aggre- • We design an attribute-based aggregation fusion model to
gated feature and modality-specific features. We describe in adaptively aggregate all attribute-specific fusion features.
detail as follows. Although we do not know which fusion branches should
First, we disentangle the fusion process via five chal- be activated in tracking process, the model could effec-
lenge attributes including thermal crossover (TC), illumina- tively suppress noisy features from unappeared attributes
tion variation (IV), scale variation (SV), occlusion (OCC) via attention based weighting.
and fast motion (FM) to achieve the attribute-specific fu- • We design an enhancement fusion transformer to further
sion. Note that more attributes could be incorporated in our strengthen the aggregated feature and modality-specific
framework and we leave it in future work. For each attribute- features. The transformer consists of different numbers
specific fusion branch, we can employ a small number of pa- of encoders and decoders for self enhancement and inter-
rameters for model design since each branch just needs fo- active enhancement of aggregated and modality-specific
cusing on the feature fusion under a certain attribute. More- features respectively.
over, these branches could be trained separately using small-
scale training data with the corresponding attributes and the
requirement of large-scale data to train fusion models is thus Related Work
addressed. In this section, we give a briefly introduction about RGBT
Second, we design an attribute-based aggregation fusion fusion tracking and transformer tracking.
module to adaptively aggregate all attribute-specific fusion
features. Note that the attribute annotations are available in RGBT Fusion Tracking
training stage but unavailable in testing stage, and thus we Recently, deep learning trackers have dominated this re-
do not know which fusion branches should be activated in search field mainly benefit from powerful representation
tracking process. To handle this problem, we design the ag- ability of CNN, and have significantly improved the track-
gregation model based on the SKNet (Li et al. 2019c) to ing performance. Some of them focus on multi-modal in-
adaptively select effective features from all fusion branches. formation fusion for RGBT tracking. Zhu et al. (2020) fuse
Different from existing works (Qi et al. 2019; Li et al. 2020), the features by calculating weights in both layer-wise and
the designed aggregation model could more effectively sup- modality-wise to generate more discriminative features for
press noisy features from unappeared attributes by predict- RGBT tracking. To remove redundant and noisy features
ing the channel attention for each fusion features. of target, Zhu et al. (2019) propose a recursive strategy to
Finally, we design the enhancement fusion transformer to densely aggregate features in each modality and prune the
strengthen the aggregated feature and modality-specific fea- aggregated features of all modalities in a cooperative man-
tures. Different from existing transformers (Vaswani et al. ner. Zhang et al. (2021b) introduce both appearance and mo-
2017; Wang et al. 2021; Chen et al. 2021), our enhance- tion cues, and propose a switcher to switch the appearance
ment fusion transformer equips three encoders and two de- and motion cues flexibly. To further analyze the effective-
coders for self enhancement and interactive enhancement ness of multi-modal fusion, mfDiMP (Zhang et al. 2019)
respectively. On one hand, we leverage three encoders to consider several fusion mechanisms including pixel-wise,
enhance aggregated and modality-specific features respec- feature-wise and response-wise fusions at different levels.
tively. On the other hand, we use two decoders to interac- Tu et al. (2021) extracts more robust feature representation
tively strengthen the above enhanced features. through the strategy of dividing samples, and then designs
Conv Conv Conv
TC
GAP
FC
FC
Conv Encoder
IV
GAP
FC
FC
RGB Conv Decoder
APF Module
APF Module
SV
Concate
GAP GAP
FC
FC
FC6
FC4
FC5
FC
FC
Encoder
Conv
OCC
GAP Decoder
FC
FC
Conv
FM
GAP Encoder
FC
FC
T Conv
Attribute-specific Fusion Branch Attribute-based Aggregation Fusion Attribute-based Enhanced Fusion Attribute-based Progressive Fusion Module
Figure 2: The structure of Attribute-based Progressive Fusion Network (APFNet). TC, IV, SV, OCC and FM represent five
challenge attribute-specific branches including thermal crossover, illumination variation, scale variation, occlusion and fast
movement. Herein, + and × denote the operation of element-wise addition and multiplication. GAP is the global average
pooling.
an attention-based fusion module to fuse the features of the process of online tracking in details.
two modalities. Lu et al. (2021) design an instance adapter
to predict modality weights for achieving quality-aware fu- Attribute-based Progressive Fusion Network
sion of different modalities. But all of the above methods Overview The overall architecture of APFNet is shown in
are just designing more complex models to solve all chal- Fig. 2, and the major component of APFNet is the attribute-
lenging situations. To make better use of training data, Li based progressive fusion (APF) module which consists of
et al. (2020) mine modality-shared and modality-specific five attribute-specific fusion branches, attribute-based aggre-
information with different challenges, and thus propose a gation fusion module and enhancement fusion transformer.
challenge-aware network to enhance the discriminative abil- In specific, we use the first three layers of the VGG-M (Si-
ity of some weak modality, and have achieved excellent monyan and Zisserman 2015) as the backbone and expand it
tracking performance. Furthermore, Zhang et al. (2021a) de- to a dual-stream architecture. In each layer, we embed APF
sign a more complex aggregation module for aggregating module in order to gradually fuse the information of the two
different attributes better. modalities. First, we input the visible and thermal images.
The backbone network extracts modality-specific features
Transformer Tracking separately, and the five branches perform attribute-specific
The core of the transformer is the attention mechanism, fusion at the same time. Then we send all attribute-specific
which is first proposed to apply in the field of machine trans- fused features to the attribute-based aggregation module to
lation (Vaswani et al. 2017) and has achieved great success. obtain robust aggregated feature. Next, the two modality-
Transformer has also been applied to visual tracking and specific features and aggregated feature are respectively sent
achieved great success. Wang et al. (2021) use transformer to to the enhancement fusion transformer to form a more ro-
aggregate information from multiple template frames to dis- bust feature representation, which is used as the input in
tinguish the target. Chen et al. (2021) design a self-attention- the next convolutional layer and APF module. After the last
based self-enhancement module and a cross-attention-based APF module, three fully connected layers are used to extract
feature fusion module in the Siamese framework to fo- global features for classification and regression (Nam and
cus on global information. Different from the above trans- Han 2016). We present the details of three components of
former models, we design an enhancement fusion trans- APF module and our training method below.
former which consists of three encoders and two decoders to
enhance aggregated feature and modality-specific features. Attribute-specific Fusion Branches We first disentangle
the fusion process via five challenge attributes including
thermal crossover (TC), illumination variation (IV), scale
Attribute-based Progressive Fusion Tracking variation (SV), occlusion (OCC) and fast motion (FM) to
In this part, we first present the architecture of the Attribute- achieve the attribute-specific fusion. In most of existing
based Progressive Fusion Network (APFNet), then intro- datasets such as GTOT (Li et al. 2016) and RGBT234 (Li
duce the three-stage training algorithm, and finally give the et al. 2019a), each attribute is manually annotated for each
video frame. It supports us to train each attribute-specific formed into three vectors including query, key and value
fusion branch individually. For each attribute-specific fu- through linear layer mapping. The attention weight matrix
sion branch, we can employ a small number of parameters is generated by query q and key k, and multiplied by the
for model design since each branch just needs focusing on value v for the i-th layer. Finally, v is added to the origi-
the feature fusion under a certain attribute. Moreover, these nal feature vector with the residual. X e,i e,i
vis and X inf are the
branches could be trained separately using small-scale train- visible feature and thermal infrared feature after the the i-th
ing data with the corresponding attributes and the require- encoder self enhancement, separately. We show the details
ment of large-scale data to train fusion models is thus ad- in the left in Fig. 3. In the i-th encoding phase, X ivis , X iinf
dressed.
and X iagg are inputted to three encoders for self enhance-
For simplicity, we set the fusion branches under different
attributes to the same structure. In specific, for each branch, ment, and then compute the outputs denoted as X e,i e,i
vis , X inf
we first extract features from two modalities using a convo- and X e,i
agg respectively. The self enhancement encoders with
lutional layer with the kernel size of 5×5, a rectified linear aggregated feature and modality-specific features can be de-
unit (ReLU) and a convolutional layer with the kernel size of scribed as follows:
4×4. Then, we use the SKNet (Li et al. 2019c) to adaptively
select the channel-wise features from both modalities. The X e,i i
vis =Encoder(X vis ) ∈ R
C×H×W
0.9 0.9
age of successfully tracked frames whose overlaps are larger 0.8 CMPP[0.926] 0.8 ADRNet[0.739]
Success Rate
JMMAC[0.902] JMMAC[0.732]
Precision
by the area under the curve. RGBT234 is a larger RGBT 0.6
0.5
MANet++[0.901]
M5L[0.896]
MANet[0.894]
0.6
0.5
FANet[0.728]
MANet[0.724]
MANet++[0.723]
RGBT210 (Li et al. 2017) dataset and provides a more ac- 0.3
0.2
HDINet[0.888]
MaCNet[0.885]
0.3
0.2
M5L[0.71]
DAPNet[0.707]
MaCNet[0.698]
DAPNet[0.882]
aligned video pairs with about 234K frames in total, and Location error threshold overlap threshold
0.9 0.9
0.8 0.8
APFNet[0.827] APFNet[0.579]
CAT[0.804]
Precision
ADRNet[0.570]
0.6 0.6
DAFNet[0.796] CAT[0.561]
MANet++[0.795] mfDiMP[0.559]
We test our Attribute-based Progressive Fusion Network 0.5
0.4
M5L[0.795]
JMMAC[0.790]
0.5
0.4
MANet++[0.559]
HDINet[0.559]
FANet[0.787] FANet[0.553]
(APFNet) on three popular RGBT tracking benchmarks and 0.3
mfDiMP[0.785]
HDINet[0.783] 0.3
DAFNet[0.544]
M5L[0.542]
MANet[0.777] MANet[0.539]
compare performance with some state-of-the-art trackers, 0.2 DAPNet[0.766]
MaCNet[0.764]
0.2 DAPNet[0.537]
MaCNet[0.532]
0.1 MDNet+RGBT[0.722] 0.1 MDNet+RGBT[0.495]
40 50
0
0
SiamDW+RGBT[0.397]
2021), HDINet (Mei et al. 2021), M5 L (Tu et al. 2021), Location error threshold overlap threshold
CMPP (Wang et al. 2020), JMMAC (Zhang et al. 2021b),
CAT (Li et al. 2020), MANet (Li et al. 2019b), DAFNet
(Gao et al. 2019), mfDiMP (Zhang et al. 2019), DAPNet Figure 5: The evaluation curve on RGBT234 dataset. The
(Zhu et al. 2019), SGT (Li et al. 2017), FANet (Zhu et al. representative scores of PR/SR is presented in the legend.
2020), MaCNet (Zhang et al. 2020) and MDNet (Nam and
Han 2016)+RGBT, to validate the effectiveness of proposed
method. Evaluation on LasHeR Dataset The evaluation on
LasHeR testing set are shown in Fig. 6. We can find that
Evaluation on GTOT Dataset Comparison results on our tracker has achieved excellent performance compared
GTOT dataset are shown in Fig. 4. We can find that our with some start-of-the-art trackers. Compared with mfDiMP
APFNet outperforms the top traditional algorithms SGT (Li (Zhang et al. 2019), which is the champion of VOT2019-
et al. 2017) with performance gains 5.4%/10.9% in PR/SR RGBT, our PR/SR is 5.3%/1.9% higher than it. While com-
respectively. Moreover, we advance our baseline tracker pared with MaCNet and CAT, we also advance them with
MDNet+RGBT (Nam and Han 2016) with a large margin, 1.8%/1.2% and 5.0%/4.8% respectively in PR/SR, which
prove the huge performance advantage of our method. It is that each challenge branch is specific and solves the corre-
noted that we only use GTOT dataset training to achieve the sponding challenge. In this part, we select five typical se-
best results on RGBT 234 and LasHeR, which shows that quences from GTOT and RGBT234 dataset, and each se-
our model does not depend on large-scale training data. quence has a corresponding main challenge (Fan et al. 2021;
Qi et al. 2019). On the basis of MDNet+RGBT, we only
1
Precision plots
1
Success plots add a specific challenge fusion branch to test each sequence.
[0.500] APFNet [0.362] APFNet
0.9
[0.482] MaCNet
[0.467] MANet++
0.9 [0.350] MaCNet
[0.343] mfDiMP
The Correlative results on GTOT and RGBT234 dataset
0.8
[0.455] MANet
[0.450] CAT
0.8 [0.326] MANet
[0.314] MANet++
are shown in Table 2 and Table 3 respectively. The exper-
[0.448] DAFNet
0.7
[0.447] mfDiMP
0.7 [0.314] DAPNet
[0.314] CAT imental results show that each challenge fusion branch has
Success rate
[0.441] FANet
Precision
0.4
[0.352] CMR
[0.327] SGT 0.4
[0.251] SGT++
[0.232] SGT
lenge as the mainstay compared with other challenge fusion
0.3 0.3 branches, which verifies the effectiveness of the challenge
0.2 0.2 fusion branches.
0.1 0.1
0
0 5 10 15 20 25 30 35 40 45 50
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Table 2: Tracking results in terms of success rate on
Location error threshold Overlap threshold
five GTOT sequences (A:CarNig, B:Minibus, C:OCCCar-
2, D:Otcbvs, E:FastMotoNig). We have marked the domi-
Figure 6: The evaluation curve on LasHeR testing set. The nant attribute of the video after the name of each video se-
representative scores of PR/SR is presented in the legend. quence. We only add each attribute-specific fusion branch to
the backbone to get the result.