Professional Documents
Culture Documents
Smiletrack: Similarity Learning For Occlusion-Aware Multiple Object Tracking
Smiletrack: Similarity Learning For Occlusion-Aware Multiple Object Tracking
Wang Yu Hsiang1 , ∗ Jun-Wei Hsieh1 , Ping-Yang Chen2 , Ming-Ching Chang3 , Hung Hin So4 , Xin Li3
1
College of AI and Green Energy and 2 Department of Computer Science, National Yang Ming Chiao Tung University, Taiwan
3
Department of Computer Science, University at Albany, State University of New York, USA
4
arXiv:2211.08824v3 [cs.CV] 20 Aug 2023
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong
j122333221.ai09@nycu.edu.tw, ∗ jwhsieh@nycu.edu.tw, pingyang.cs08@nycu.edu.tw, mchang2@albany.edu,
1155157729@link.cuhk.edu.hk, xli48@albany.edu
Abstract
boxes [4, 6, 16, 25–28]. The re-identification model then ex- MOT performance. To better handle occlusions, we create
tracts the embedding features of each object from its bound- a Similarity Matching Cascade (SMC) module that takes
ing box, and these features are used to associate each bound- SLM results and matches multiple objects robustly across
ing box with one of the existing trajectories. Despite their frames. The proposed network achieves SoTA performance
flexibility, the efficiency of SDE methods trails behind that on the MOT17 and MOT20 datasets. Contributions of our
of JDE due to the necessity of two separate models. work are summarized as follows.
The motivation behind this work is two-fold. One of the
• We propose SMILETrack, a separate detection and track-
long-standing problems in MOT is occlusion handling, and
ing model, to track multiple objects in frames. SMILE-
the other is a principled solution to speed-accuracy trade-
Track can outperform BYTETrack [50] by 0.4-0.8 MOTA
off. Although the TbA method has impressive results on
points and over 2.0 HOTA points on the MOT17 and
feature attention, its exceptional feature attention results in
MOT20 datasets; see to Fig. 1.
a high time complexity that reduces inference speed. In ad-
dition, occlusions can cause tracked objects to pay less at- • We introduce a Siamese network-based Similarity Learn-
tention, resulting in the failure of MOT. Meanwhile, TbD ing Module (SLM) to learn the similarity in appearance
methods such as ByteTrack [50] enjoy computational effi- between objects for tracking.
ciency, but their accuracy is not optimized. It is highly de- • A Patch Self-Attention (PSA) block is proposed that uses
sirable to develop a class of MOT methods that can strike an a self-attention mechanism to produce reliable features
improved trade-off between cost (e.g., running speed mea- for similarity matching.
sured by FPS) and performance (e.g. tracking accuracy • We design a Similarity Matching Cascade (SMC) module
measured by MOTA [2]). to match objects more reliably across frames, which im-
This paper proposes a novel object tracker, Similar- proves performance largely in the presence of occlusions.
ity Learning for Multiple Object Tracking (SMILE-
track), which combines an object detector and a Simi- 2. Related Work
larity Learning Module (SLM) to address various chal-
2.1. Tracking-by-Detection
lenges in MOT, especially occlusion. Fig. 2 shows the
architecture of our SMILEtrack, which provides two ma- The Tracking-by-Detection (TbD) method has become
jor contributions to achieving the State-of-the-Art (SoTA) one of the most popular approaches in the MOT framework.
MOT system: (1) an efficient and lightweight self-attention The main tasks of the TbD method can be roughly divided
mechanism that learns the similarity between two candi- into two parts: object detection and object association.
date bounding boxes. Although the SDE model can achieve Object Detection: Mainstream visual object detection
high accuracy in object tracking, most feature descriptors models fall into two categories, namely, the two-stage
used in the model cannot differentiate between objects with (proposal-driven) and one-stage (direct) detectors. The two-
similar appearances. To solve this problem, we propose a stage methods [28] offer high accuracy but at the cost of
Siamese network-based Similarity Learning Module (SLM) speed. On the contrary, one-stage methods are faster but
that can calculate the similarity in appearance between two less accurate. YOLO object detection models [4, 25–27]
objects. Inspired by the vision Transformer [9], we intro- have been widely used in multi-object tracking (MOT)
duce a Patch Self-Attention (PSA) block in SLM to produce applications due to their speed and accuracy. However,
reliable features for similarity matching. (2) a robust tracker these anchor-based detectors introduce many hyperparame-
with a novel GATE function that can associate each candi- ters and consume significant time and memory during train-
date bounding box from video frames, leading to improved ing. To mitigate these issues, anchor-free detectors such
as CenterNet [56], and YOLOX [12] have emerged. De-
spite their improvements [1, 50], these tracking devices still
struggle to accurately detect objects of varying sizes. PRB-
Net [6] is an effective object detector for MOT tasks, ad-
dressing the limitations of anchor-based and anchor-free de-
tectors.
Object Association: SORT [3] is a simple effective
tracking algorithm that uses Kalman filtering and Hun-
garian matching for object association. It struggles with Figure 3. Appearance similarity between low-score detection at
challenges such as occlusions and fast-moving objects. the current frame and tracks at the previous frame.
DeepSORT [43] alleviates occlusion issues by incorporat-
ing CNN-based appearance features; however, this com-
promises execution speed. To address this efficiency is-
sue, FairMOT [51] employs an anchor-free method based
on CenterNet [56], which significantly improves the MOT A B C D E F G H
3.1.2 The Q-K-V Attention where Qi ∈ Rdf ×dk , Ki ∈ Rdf ×dk , and Vi ∈ Rdf ×dv .
Since input objects are of different sizes, we resize them to Before matching, the norms of Qi , Ki , and Vi will be nor-
a fixed size W × H where W and H are, respectively, set malized to be one; that is, ||Qi ||=1, ||Ki ||=1, and ||Vi ||=1.
to 80 and 224 in this paper. Assume that an object A is di- Let ⊗ denote the Hadamard product and Sum(M ) be an
vided into NP patches {Pi }i=1,...,NP . Each patch Pi has a element-wise sum on a matrix M . For each Ai , its atten-
fixed size WP × HP . Then, we use a row-major scanning tion αi,j to Aj can be calculated according to the following
order to convert each Pi to a column vector. Then, an ob- equation:
ject can be represented as a sequence of Np feature vectors: Sum(Qi ⊗ K j )
αi,j = N . (2)
(P1 , ..., Pi , ..., PNp ), Pi ∈ RDp , where Dp = Wp × Hp . Pp
Before feature extraction, the values of pixels in Pi are Sum(Qi ⊗ K j )
j=1
normalized to [0, 1]. Since there are geometrical relations
between the patches in A, their representation should be With αi,j , Ai is converted to a feature vector V̄ i as fol-
modified to preserve position-dependent properties. For the Np
P
ith patch Pi , its position embedding vector Ei is specified lows: V̄ i = αi,j V j . After concatenating all V̄ i , a
j=1
by the standard transformer [36]. It follows that an ob- A
ject A is embedded as A = (A1 , ..., Ai , ..., ANP ), where new feature vector V̄ is created from A for object track-
A
Ai = Pi + Ei and A ∈ RDp ×NP . For each Ai , we adopt ing: V̄ = (V̄ 1 , ..., V̄ i , ..., V̄ NP ). In Fig. 5, after the PSA
A
the CSP-Net framework [38] as the backbone to convert it block, V̄ is converted to a new feature vector Z A by us-
into a feature matrix Fi . Fi includes df row vectors and C ing a fully-connected network. Then, given two objects A
column vectors; that is, Fi ∈ Rdf ×C , where C is the num- and B, with SLM, their similarity score can be measured
ber of feature channels and df is the size of the last layer of by calculating the cosine similarity between Z A and Z B .
3.2. Similarity Matching Cascade (SMC) for Target
Tracking
Object association is the crucial step after similarity cal-
culation for MOT. A well-designed association strategy can
have a significant impact on tracking results such as HOTA
[18]. In the literature, ByteTrack [50] is a simple yet ef-
fective method of association with objects, where detected
boxes are classified by their confidence scores from high to
low, and the best match in history is found based on the IOU
criterion. Although ByteTrack achieves SoTA performance
in some MOT evaluations (i.e., simple motion patterns), re-
lying solely on the IOU distance for data association can Figure 7. Appearance similarity between low-score detection at
result in frequent ID switches when visually similar targets the current frame and tracks at the previous frame. Five track-
approach each other (e.g., one occludes the other). To ad- lets compute a similarity score with the low-score detection using
dress this issue, we designed the SMC association method SLM. The most similar tracklet is selected, as indicated by the
as shown in Fig. 6 that integrates the advantages of Byte- orange arrow in the figure.
Track to achieve an improved trade-off between speed and
accuracy.
Let O denote the set of objects detected by the PRB-Net
from the current frame. All objects Oi in O are sorted ac-
cording to their detection scores in descending order (the
(a)
median detection score is µ). Subsequently, all objects Oi
in O are divided into two sets: OH and OL -based thresh-
olding. Any object in O with a detection score higher than
the threshold µ is placed in OH . If its detection score is
lower than µ but higher than 0.1, it belongs to OL . We treat
(b)
an object as background or noise if its detection score is be-
Figure 8. The use of a GATE function can better handle the occlu-
low 0.1. Two different association strategies are employed
sion and ID-switch problems in MOT. (a) Results of MOT with-
to match elements in OH and OL , respectively. out using the GATE function. When the two targets are getting
Let T represent the track list stored in the previous closer and the IOU score is higher than the appearance score, an
frame. Before matching, each track in T predicts its new ID-switch problem happens. (b) Results of MOT using the GATE
position in the current frame using a Kalman filter. More- function.
over, Ti (k) denotes the kth fragment or tracklet of the ith
track in T , where Ti (last) refers to the last fragment of Ti . Using S H (i, j) and S L (i, j), we initially associate the
H H
Furthermore, we use Siou (i, j) and Sapp (i, j) to denote the objects in OH with tracklets in Ti . However, due to oc-
IOU similarity matrix and the appearance similarity ma- clusions or blur, some tracklets in OH remain unmatched.
trix, respectively, between Ti (last) and the jth object Oj To address this issue, we subsequently associate the objects
in OH . The value of Sapp H
(i, j) is obtained using the SLM in OL with these unmatched tracklets, leading to State-of-
H
method as follows: Sapp (i, j) = SLM (Ti (last), Oj ). By The-Art (SoTA) MOT performance. The details of the SMC
H H
integrating Siou (i, j) and Sapp (i, j) together, the similarity module are described below:
between Ti and the jth object Oj in OH is calculated as Stage I: During the first stage of association, our focus is
follows: on finding matches between OH and T . We employ the
S H (i, j) = Siou
H H
(i, j) + Sapp (i, j). (3) Hungarian algorithm to perform linear assignment using the
similarity matrix S H (i, j). The unmatched objects of OH
and the unmatched tracks of T are then placed in OH Remain
Fig. 7 shows an example to calculate appearance simi- H
and TRemain , respectively.
larity by the multi-templated SLM. Furthermore, we denote
L
Siou (i, j) as the IOU similarity matrix between Ti (last) and Stage II: In the second matching stage, we match the ob-
the j-th object Oj in OL . Similar to Eq. (3), the integrated jects in OL to the tracklets in TRemain
H
. We complete the
similarity between Ti and the j-th object Oj in OL is calcu- linear assignment by the Hungarian algorithm with the sim-
lated as: ilarity matrix S L . The unmatched objects in OL and the
H
unmatched tracks in TRemain are placed in OL
Remain and
S L (i, j) = Siou
L L
(i, j) + Sapp (i, j). (4) L
TRemain .
Table 1. Comparison against the SoTA MOT methods on the MOT17 [21] test set.
3.3. The SMC GATE Function This GATE function is a novel addition not present in Byte-
Track [50], and it aims to re-select potential tracks from
To calculate the similarity score, most MOT methods OHRemain to handle challenging scenarios involving severe
use a weighted sum to combine the IOU and the appear- occlusions. Without this GATE function, ByteTrack cannot
ance information to improve the accuracy of data associa- determine whether the objects to be matched are seriously
tion. However, this approach can cause problems when the occluded or not. Fig. 8 and Table 3 shows the advantage of
IOU score is significantly higher than the appearance sim- the GATE function.
ilarity score between two distinct pedestrians, as they may
only overlap, but are not the same. To address this problem,
4. Experimental Results
we introduce a GATE function in the SMC module to reject
a target if its appearance similarity score is low, even when Implementation Details. Our experiments were con-
it comes with a high IOU score. ducted on MOT17 and MOT20 benchmarks [21], with addi-
Due to occlusions or lighting changes, objects in tional training on datasets [8,11,21,30,31,45,49,54]. For re-
OH Remain with higher scores may not be matched in the ID models, datasets providing both bounding box location
current frames, but their correspondences may potentially and identity information, such as CalTech [8], PRW [54],
be found in future frames. If a target in OH Remain passes and CUHK-SYSU [45], were used. Evaluation metrics [21]
the GATE function check, the SMC module will generate included MOTA [2], IDF1 [29], and HOTA [18], highlight-
a new tracklet and add it to T for further matching. The ing detection performance and identity matching. Our de-
GATE function uses a threshold τ to select objects from tector was initialized on the COCO dataset [15] and fine-
OH Remain if their detection scores are higher than τ and in- tuned on MOT datasets, employing data augmentation and
clude them in T as new tracks for further association. Ob- an SGD optimizer with cosine annealing. The SMC module
jects in OHRemain with detection scores lower than τ , as well introduced a GATE function to manage new tracklets, with
as those in OL Remain , are considered background and fil- key parameters assessed in an ablation study. Additional de-
L
tered out. It is important to note that tracks in TRemain are tails regarding the effects of track buffer, template lengths,
deleted if they remain unmatched for more than 30 frames. and patch layout can be found in the supplementary.
Table 2. Comparison against the SoTA methods on the MOT20 [7] test set.
Table 4. Effects of patch layouts on performance improvement periments demonstrate that SMILEtrack achieves high-
evaluated on the MOT17 val set. performance scores in terms of MOTA, IDF1, IDs, and FPS
on the MOT17 and MOT20 datasets.
Method A B C D E F G H Future work. As SMILEtrack is a Separate Detection
MOTA↑ 76.0 76.1 76.2 76.4 76.4 76.3 76.4 76.4 and Embedding (SDE) method, it has a slower runtime
IDF1↑ 77.4 77.6 77.7 77.9 78.4 78.2 78.4 78.3 compared to Joint Detection and Embedding (JDE) meth-
IDs↓ 732 705 681 654 624 645 633 630 ods. In the future, we plan to explore approaches that can
improve the efficiency versus accuracy trade-off in MOT
Table 5. Performance comparisons of similarity scores on the tasks.
MOT17 validation set.
Gate SLM-Multiple
SMC Stages I and II MOTA IDF1 IDS
Function Templates
SLM + IOU 76.4 78.5 624
√
SLM + IOU 76.5 78.7 601
√
SLM + IOU 76.5 78.9 585
√ √
SLM + IOU 76.6 79.2 545