Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

1

A Fine-Grained Vehicle Detection (FGVD) Dataset for 59

Unconstrained Roads∗
2 60
3 61
4 62
5 Anonymous Author Submission, Paper Id: 83 63
6 64
ABSTRACT
7
Stanford 65
8 The previous fine-grained datasets mainly focus on the classifica- Cars-196 66
9 tion task and are often captured in a controlled setup, with the 67
Detection
10 camera focussing on the objects of interest. We introduce the first 68
11 Fine-Grained Vehicle Detection (FGVD) dataset in the wild, cap- 69
CompCars
12 tured from a moving camera mounted on a car. It contains 5502 70
13 images with 210 unique fine-grained labels of multiple vehicle types 71
organized in a three-level hierarchy. While previous classification

14 72
Detection Vehicle-type
15 datasets also include makes for different kinds of cars, the FGVD Manufacturer 73
16 dataset introduces new class labels for categorizing two-wheelers, FGVD (ours) Model 74
17 autorickshaws, and trucks. The FGVD dataset is challenging as (b) 75
18 it has vehicles in complex traffic scenarios with intra-class and 76
19 inter-class variations in types, scale, pose, occlusion, and lighting (a) 77
20 conditions. The current object detection models like yolov5 and 78
21 faster RCNN perform poorly on this dataset due to a high degree 79
22 of class imbalance and lack of hierarchical modeling. Along with 80
23 providing baseline results for existing object detectors on FGVD 81
Car_MarutiSuzuki_Ciaz
24 Dataset, we also present the results of a combination of an existing Motorcycle_Bajaj_Pulsar150 82
Scooter_Honda_Activa
25 detector and the recent Hierarchical Residual Network (HRN) clas- Car_Chevrolet_Tavera 83
26 sifier for the FGVD task. We also show that the vehicle images from 84
27 the FGVD dataset are the most challenging among the fine-grained 85
28 classification datasets. 86
29 87
30 CCS CONCEPTS 88
31
Figure 1: Samples from fine-grained datasets. Top: Previous 89
• Computing methodologies → Object detection. datasets focus only on the classification of cars on vehicle-
32 90
centric images. Middle: The proposed FGVD dataset enables
33 KEYWORDS fine-grained (multi) vehicle detection on unconstrained road
91
34 92
ACM proceedings, LATEX, text tagging scenes captured from vehicle-mounted cameras. Bottom left
35 93
36
ACM Reference Format: GradCAM++ visualizations on predicted crops show that the 94
37
Anonymous Author Submission, Paper Id: 83. 2022. A Fine-Grained Vehi- model focuses on the backlight and blinker at the top of 95
cle Detection (FGVD) Dataset for Unconstrained Roads. In Proceedings of the motorcycle and the bottom (and the right part) of the
38 96
13th Indian Conference on Computer Vision, Graphics and Image Processing scooter. For Tavera, the design on the left and right of license
39 97
(ICVGIP’22). ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/ plates and for Ciaz, the radiator and headlight regions are
40 nnnnnnn.nnnnnnn 98
41
highlighted classification features. 99
42 1 INTRODUCTION 100
43 101
Intelligent traffic monitoring systems are of utmost need in big
44
cities for public security, planning, and surveillance. For the tasks Conventionally, detection models like YOLO [12] and Faster R- 102
45
like vehicle re-identification and robust detection (e.g., when a ve- CNN [14] are trained to classify vehicles based on coarse categories 103
46
hicle occludes another vehicle that is similar in appearance), the of on-road datasets like BDD and IDD [18, 22]. The Fine-Grained 104
47
detectors used in the surveillance systems should finely classify Vehicle Detection (FGVD) models and datasets can enable robust 105
48
the vehicle type, manufacturer, and model of the on-road vehicles. vehicle re-identification and detection in highly dense and occluded 106
49
∗ Produces
traffic scenarios. Therefore, we propose a novel FGVD dataset with 107
the permission block, and copyright information
50 multiple hierarchy levels for the fine-grained labels. Fig. 1 depicts a 108
51 Permission to make digital or hard copies of part or all of this work for personal or sample scene image from the FGVD dataset and the corresponding 109
52 classroom use is granted without fee provided that copies are not made or distributed labels. As shown in the figure, in addition to enabling the detection 110
for profit or commercial advantage and that copies bear this notice and the full citation
53
on the first page. Copyrights for third-party components of this work must be honored. task, the FGVD dataset includes complex intra-class and inter-class 111
54 For all other uses, contact the owner/author(s). variations in types, scales, and orientations compared to the previ- 112
55 ICVGIP’22, December 2022, Gandhinagar, India ous fine-grained classification datasets. The dataset also contains 113
© 2022 Copyright held by the owner/author(s).
56
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. challenging occlusion scenarios and lighting conditions (refer to 114
57 https://doi.org/10.1145/nnnnnnn.nnnnnnn Fig. 2). FGVD contains three levels of hierarchy, i.e., vehicle type, 115
58 1 116
ICVGIP’22, December 2022, Gandhinagar, India

117 Table 1: Related Fine-grained datasets. All the previous road scene works include fine-grained classification only on car types. 175
118 In contrast, FGVD has fine-grained detection labels for seven different types of vehicles (refer Fig. 2 and Table 2). 176
119 177
Dataset Source # Levels # Vehicle-type Classification Detection
120 √ 178
BoxCars116k [15] CCTV 3 1 ×
121 √ 179
CompCars [21] Web/CCTV 3 1 ×
122 √ 180
123
THS-10 [10] CCTV 2 1 × 181

124 Stanford Cars-196 [7] Web 2 1 ×𝜓 182
√ √
125 FGVD (ours) Dashcam 3 6 183
𝜓 Stanford Car-196 is a part-based fine-grained car recognition dataset containing bounding boxes for cars’ parts.
126 184
127 185
128 186
129
manufacturer, and model, as shown at the bottom of the vehicles’ model, submodel, and year of manufacture. Jakub et al. [15] pub- 187
130
Regions of Interest (ROIs) in Fig. 1 (a) and in Fig. 1 (b) with three lish the BoxCars116k dataset with 3D bounding box annotations 188
131
different colors and of 1, 16, 286 vehicle images and their fine-grained classes. Yang et 189
132
While previous classification datasets like Stanford Cars-196 [7] al. [21] released CompCars dataset which contains 44, 481 frontal 190
133
and CompCars [21] also include makes for different kinds of cars view images of cars taken from surveillance cameras. It also con- 191
134
(see Fig. 1 top), we introduce the complementary hierarchical labels tains web-nature images taken from different viewpoints of 136𝐾 192
135
for two-wheelers, autorickshaws, trucks, and buses (refer Sec. 3). vehicles classified into 600 categories. Najeeb et al. [10] released 193
136
The granularity of the FGVD dataset increases as we move from 4250 CCTV car images of 10 different models. Similar to this, the 194
137
parent level to child level (refer Figs. 1 (b) and 4). For the hierarchical Stanford Cars dataset [7] includes vehicle images along with their 195
138
FGVD dataset, every level of the dataset has its uniqueness. Firstly, part labels to assist in fine-grained recognition tasks. The above 196
139
different vehicles may look similar at the first level of granularity, mentioned datasets usually lack the complexity that is required for 197
140
e.g., motorcycle and scooter, both being the two-wheelers. However, their application in real-world scenarios. Our detection dataset in- 198
141
as it can be inferred from the bottom of Fig. 1, the overall design cludes diverse traffic scenarios observed in urban settings capturing 199
142
of the scooter is very different from that of the motorcycle, e.g., large variations in scale, pose, occlusion, illumination and density 200
143
scooters have a backlight at the bottom of the rearview as compared of vehicles. All the vehicle classifiation datasets mentioned above 201
144
to the top backlight of the motorcycle. The overall appearance of have fine-grained labels only for cars whereas in our dataset we 202
145
the two vehicles of the same parent may look even more similar. introduce these labels for four additional vehicle types - motorcy- 203
146
However, the minute subtle and local differences are present in the cles, scooters, autorickshaws, and trucks. Also, the proposed FGVD 204
147
same subcategory. Also, some categories were not present in the dataset contains images captured from dashboard cameras installed 205
148
earlier vehicle datasets, for example, scooter, autorickshaw, truck, on top of surveillance vehicles instead of using static CCTV cameras, 206
149
and bus; hence they were required to be added to the fine-grained making it’s use economically viable and sustainable for road safety 207
150
dataset. Therefore, we introduce classes unique to the FGVD dataset in any remote locations of the city. The datasets like BDD100k [22], 208
151
to facilitate detailed research in fine-grained on-road scenarios. The Waymo [16], and IDD [18] do focus on vehicle detection but do not 209
152
main contributions of this work are as follows: have the fine-grained annotations. 210
153 • A novel Fine-Grained Vehicle Detection (FGVD) dataset for Detection and Fine-grained Classification Models: State-of- 211
154 on-road vehicles in dense and occluded traffic scenarios. the-art object detection models like Yolov5 [5] and Faster-RCNN [14] 212
155 To the best of our knowledge, no fine-grained detection consider all object labels as independent from each other. So in effect 213
156 dataset exists in the literature, and ours is the first dataset they won’t be able to use the hierarchical relationship which exists 214
157 of its kind. between the object’s fine-grained labels. The difficulty of detecting 215
158 • We present the results of baseline detection and classifi- fine-grained objects with such models would thus increase with 216
159 cation models on the proposed dataset. We also show the deeper class definition as the number of samples per class becomes 217
160 initial results of a combination of a detector and a recent smaller and the visual cues become more challenging to detect. 218
161 hierarchical fine-grained classification model. Chen et al. [2] recently introduces Label Relation Graphs Enhanced 219
162
Heirarchical Residual Network (HRN) which gives state-of-the-art 220
163 2 RELATED WORK performance on fine-grained image classification datasets. Their 221
164 Fine-grained Classification Datasets: Multiple fine-grained clas- architecture exploits the parent-child correlation between labels by 222
165 sification datasets exist in the literature that mainly focus on birds [19], transferring the hierarchical knowledge through residual connec- 223
166 flowers [11], dogs [6, 17], aircrafts [9] and not focuses on vehicles. tions across feature levels. But they haven’t been used in the context 224
167 The datasets like birds [19] cover appropriate occlusion examples, of object detection. Therefore, we combined the object detection 225
168 and the aircrafts dataset [9] has objects at low resolution show- and classification models and obtained superior performance on 226
169 casing sufficient complexity for their classification task. But the the fine-grained vehicle detection task as compared to using the 227
170 objects in these datasets are usually located around the image cen- object detector directly for it. 228
171 ter making it unfit for their detection in the wild. There are few 229
172 public datasets for fine-grained vehicle classification on road scenes, 230
173 such as [15], [10], and [21] with hierarchical labels of car’s make, 231
174 2 232
A Fine-Grained Vehicle Detection (FGVD) Dataset for Unconstrained Roads ICVGIP’22, December 2022, Gandhinagar, India

233 291
234
Car 292
235 Maruti Suzuki 293
236 Dzire 294
237 295
238 Motorcycle 296
239
Hero 297
240
Splendor 298
241 299
242 300
243 Scooter 301
244 Honda 302
245 Activa 303
246 304
247 305
248 Truck 306
249 Eicher 307
250 308
251 309
252 310
253
Autorickshaw 311
254
Bajaj 312
255 313
256 314
257 Figure 2: Sample images of different categories in FGVD exhibiting inter-class similarities (bottom two rows), multiple vehicle 315
258 orientations (all rows), and frequent occlusions. 316
259 317
260 318
261 319
262 320
263 321
264 322
265 323
266 324
267 325
268 326
269
(a) Back propagated (b) Identifiable label (c) Forward propagated (d) Forward propagated
327
270
label frame frame label frame label frame 328
271 329
272 330
Figure 3: Annotation Strategy: (b) is the only confidently identifiable image frame where the label cue (logo at bottom-left of the
273 331
car) is visible. As the frames are from common video sequence, the label is propagated to the vehicle instances in (a), (c) & (d).
274 332
275 333
276 3 FGVD DATASET AND ANNOTATION 3.1 Annotation Process 334
277 We use images with corresponding coarse labels, and bounding We chose 5502 out of 16311 high-quality images from the IDD- 335
278 boxes from the IDD detection dataset [18]. The vehicles far from the Detection dataset based on occlusion, size of vehicle boxes, and 336
279 camera are infeasible to annotate. Hence, we remove the bounding traffic density. The annotation team consists of four highly skilled 337
280 boxes with a height-to-width ratio lesser than the thresholds, which annotators and two expert reviewers for quality checks. Firstly, we 338
281 are different for different vehicle types. To consider the variability taught the annotators about fine-grained annotation by annotating 339
282 of vehicles in physical dimensions, we keep the threshold for the a few sample images from the FGVD dataset. Secondly, we provide 340
283 truck’s bounding box ratio higher than that of the car, which in turn the guidelines, template, and a list of objects to be annotated. After 341
284 is higher than that of the bike. The thresholding process makes the proper training, the annotators can recognize the popular vehicles 342
285 annotators’ work easy, manageable, and quick. The FGVD dataset in the scene. However, if still, the vehicle is not recognizable, then 343
286 contains 5502 scene images containing around 24450 bounding they can use google lens or image search on the internet. For exam- 344
287 boxes, with 217 (210 unique, and 7 repeated from higher levels) ple, consider the scenario where the annotator can recognize the 345
288 fine-gained labels in the third level. manufacturer by looking at the brand logo. Still, the model name is 346
289 347
290 3 348
ICVGIP’22, December 2022, Gandhinagar, India

349 Vehicle-type Manufacturer Model Vehicle-type Manufacturer Model 407


Vehicle Type Levels of Hierarchy L-2 labels L-3 labels Car Mahindra Bolero Motorcycle Honda CB Hornet 160R
350 408
Reva CB Twister
351 Car 3 22 112 409
Scorpio Karizma
352 Motorcycle 3 11 67 TUV300 Karizma ZMR 410
353 Scooter 3 9 23 Verito Shine 411
XUV500 Sp 125
354 Truck 2 7 7 412
Xylo Unicorn
355 Autorickshaw 2 6 6 Maruti- Dzire Hero Hunk 413
356 Bus 2 2 2 Suzuki Swift Glamour 414
357 Total 3 57 217 Baleno HF Deluxe 415
Tata Motors Hexa Splendor
358 416
Table 2: Levels of Hierarchy for different Vehicles in FGVD. Indica Scooter Honda Activa
359 Nano Dio 417
360 Nexon Aviator 418
361 Safari Vespa VXL 125 419
Sumo ZX 125
362 420
Auto- Bajaj Truck Eicher
363 rickshaw TVS Mahindra 421
364
not visible due to occlusion, truncation, or any other complexity. Piaggio Tata 422
365
in such scenarios, the annotator can search for similar vehicles 423
366
on the manufacturer’s website. The essential part of looking at 424
the image for classification is the vehicle’s overall design, design Figure 4: Sample Hierarchy Tree of the FGVD dataset.
367 425
368
of its components (e.g., some scooters have petrol openers at the 426
369
back), brand logo, and model name. While creating IDD-Detection 427
370
dataset [18], many images are taken from the continuous video 428
371
frames; therefore, images have a temporal connection. For instance, 8000 429
372
in Fig. 3 Tata Sumo is not confidently identifiable in the first frame 430
6000
373
due to truncation, but in the second frame, the brand label is visible, Number of Vehicles
431
374
which gives the confidence to the annotator to annotate even when 4000 432
375
they are unable to recognize the design of the car. Similarly, they 433
2000
376
propagate the label in the next frame, in which there is a lot of 434
377
occlusion from other vehicles, and comparatively hard to annotate 0
435

le

aw

s
r
ar

if they do not connect the knowledge from other frames.

te

uc

Bu
yc
C

h
oo

Tr
378 c 436

ks
or

Sc

ric
ot

We divided the data creation process into two steps, i.e., the

to
M

Au
379 437
380
pilot phase and the takeoff phase. In the pilot phase, each annotator Vehicle-Type 438
381
is asked to label a small number of images containing less traffic 439
382
density. We also train the annotators to label the vehicles from the 440
383
dataset images, which are confidently identifiable but do not have 441
384
a bounding box. In the case of cars, motorcycles, and scooters, we 442
20
18
17

16
14
7

7
5

3
2
1
385
have an attribute called new. Whenever annotators encounter a new 443
386
variant of any model in our dataset, they tickmark the new attribute 444
387
in the checkbox. During the pilot phase, we note the average time to 445
388
annotate one image, from which we get an estimate of the number 446
389
of days for creating the whole dataset. Performance in the pilot 447
390
phase lets us choose the best two annotators as reviewers in the 448
391
next stage. In the takeoff phase, many images contain high traffic 449
392
density, which in turn causes massive occlusion (samples shown 450
393
in Fig. 2). All the labels in the takeoff stage are reviewed, and only 451
394
the ones with high confidence from the reviewers are chosen to be 452
395
part of the dataset. The remaining vehicle’s bounding boxes, for 453
396
which any fine-grained levels are ambiguous, are labeled as “others". Figure 5: Histograms for Level-1 (top) and Level-2 (bottom) 454
397
If any material like a vehicle cover or a cloth covers any vehicle, labels in FGVD. 455
398
they mark it as “covered". The average time to annotate one image 456
399
is one minute and half a minute to perform a quality check. We 457
400
annotate all the scenes in the FGVD dataset using the Computer 458
401
Vision Annotation Tool (CVAT)1 . 459
402 3.2 Heirarchy and Long-tailed Distribution 460
403 As shown in Table 2, each vehicle type in the proposed Fine-Grained 461
404 Vehicle Detection (FGVD) dataset has different hierarchical levels. 462
405 1 https://github.com/openvinotoolkit/cvat We illustrate the sample vehicle images and hierarchy tree of the 463
406 4 464
A Fine-Grained Vehicle Detection (FGVD) Dataset for Unconstrained Roads ICVGIP’22, December 2022, Gandhinagar, India

465 523
466 524
467 525
GSB
468 526
469 527

Detection
470 528
471 529
GSB
472 530
473 531
474 Trunk 532
475 net 533
GSB
476 ReLU 534
477 ReLU 535
478 Softmax FC 536
479
Xcent (L3) Sigmoid FC
ReLU
537
480 538
Sigmoid
481 Hyundai (L2) FC ReLU
539
482 Sigmoid 540
Car (L1) FC
483 Conv Conv FC FC
541
484 1x1 3x3 542
485 Granularity Specific Block(GSB) 543
486
Hierarchical Residual Network 544
487 545
Figure 6: Architecture for fine-grained vehicle detection. We use YOLO [5] for localization and Label Relation Graphs Enhanced
488 546
HRN [2] for classification.
489 547
490 548
491 FGVD dataset in Figs. 2 and 4. As shown, the FGVD contains three the bounding boxes obtained in the localization stage. The cropped 549
492 different levels of hierarchy, which we detail below2 : ROIs are then resized before feeding them to the classification 550
493 • Vehicle-type: The highest coarse level labels of the vehicle module. In the classification module, we use the Label Relation 551
494 come under the vehicle-type category. We consider it as Graphs Enhanced Hierarchical Residual Network (HRN) model [2], 552
495 level 1 of the hierarchy. Car, motorcycle, scooter, truck, which predicts the coarse to fine-grained classes for the ROIs. 553
496 auto-rickshaw, and bus are the six categories present in 554
497 555
vehicle type. 4.1 Vehicle Localization
498 • Manufacturer: The manufacturer level contains the pri- 556
499
We train the YOLOv5 model to localize the vehicles in the FGVD 557
mary producer of the vehicles. The manufacturer category dataset3 . There are various reasons for choosing the YOLOv5 model
500 has finer details than the vehicle type level. A producer may 558
501
for vehicle localization. Firstly, YOLOv5 incorporates Cross Stage 559
manufacture multiple kinds of vehicles. For example, Bajaj Partial Network (CSPNet) [20] into its backbone and in the neck.
502 manufactures motorcycles as well as auto-rickshaw. 560
503
The CSPNet helps to achieve a richer gradient combination while 561
• Model: The model level is at the last group of the hierarchy. reducing the amount of computation, which ensures the inference
504 This level comprises highly fine-grained features that are 562
505
speed and accuracy are high and reduces the model size. 563
unique for the variant. For example, a car’s design must be Moreover, the HRN’s classification accuracy depends on the lo-
506 unique for each manufacturer. 564
507
calization model’s performance; thus, maintaining high accuracy 565
As illustrated in Fig. 5, the annotation levels contain the common for vehicle localization is essential. Secondly, the head of YOLOv5
508 566
challenge of a class imbalance to different degrees. generates three different sizes (18×18, 36×36, 72×72) of feature maps
509 567
510
to achieve multi-scale [13] prediction, enabling the model to handle 568
4 METHODOLOGY small, medium, and large-sized objects. YOLOv5 also auto-learns
511 569
512
Fine-Grained Vehicle Detection (FGVD) aims to localize the vehicles custom anchor boxes such that the anchors are adapted to our 570
513
in an on-road scene image and identify their type, manufacturer, FGVD dataset, which helps improve the detection results. It also in- 571
514
and model variant. We accomplish this in two stages: the first corporates various augmentations, such as mosaic, during training 572
515
stage involves vehicle localization, and the second involves fine- which significantly helps to generalize. Moreover, we experiment 573
516
grained classification of the localized object. The entire pipeline is with Faster-RCNN [14], but we obtain the best results with YOLOv5 574
517
shown in Fig. 6. In the localization stage, we use YOLOv5 [5] model, (refer to Sec. 6) while it also takes the least training time. 575
518
which gives us the vehicle bounding boxes. We then crop out the 576
519
vehicles’ Regions Of Interest (ROIs) from the original image using 577
520 2 For 3 For 578
bus class, we do not follow this hierarchy. Small buses, known as “mini-bus," and training the HRN model, we prepare the dataset separately by cropping out
521 the “general-bus” are the child classes for the bus. vehicle images using the ground truth boxes from our dataset. 579
522 5 580
ICVGIP’22, December 2022, Gandhinagar, India

581 4.2 Fine-grained Classification vehicle bounding boxes and then use the predictions to crop ROIs 639
582
For several reasons, we chose the Label Relation Graphs Enhanced for HRN’s input. 640
583
HRN network [2] for fine-grained vehicle classification. The HRN We also experiment with two baseline detectors for the FGVD 641
584
network focuses on encoding the label hierarchy from coarse-to- task. Firstly, we train the Faster-RCNN model on the FGVD’s level-3 642
585
fine levels. The HRN accomplishes this by using the Granularity labels for 100 epochs with a batch size of 8 and an initial learning 643
586
Specific Blocks (GSB) and residual connections, as shown in Fig 4. rate of 0.001. We use the Faster-RCNN with the resenet-50 backbone, 644
587
Each GSB block extracts the hierarchical level features by process- and the model is pre-trained on the COCO detection dataset while 645
588
ing the feature maps generated from the trunk network, i.e., any taking an input image size of 512x512. Secondly, compared to the 646
589
common feature extraction network pre-trained on ImageNet [3]. Faster-RCNN detector model, we train a similar sized YOLOv5-large 647
590
The residual connections combine the features of coarse-level and variant model on level 3 labels for 100 epochs with a batch size of 648
591
fine-level subclasses. This kind of hierarchical modeling primarily 16. We train the YOLOv5l model with the same hyperparameter 649
592
benefits the FGVD application on our dataset because there are configuration and the same pre-trained model as used in the first 650
593
many similarities between different vehicle model variants corre- experiment explained above. 651
594
sponding to the same manufacturer or vehicle type. The HRN model We evaluate the performance of our models using the mean 652
595
incorporates a combinatorial loss which aggregates information Average Precision (mAP) metric on all three levels. For the baselines, 653
596
from related labels defined in the tree hierarchy. This tree hierarchy we derive the mAPs for all the levels from the combined label 654
597
uses a sigmoid node for each label, which can be seen in Fig. 6 (sample combined labels are shown below the zoomed-out vehicle 655
598
for the L-1, L-2, and L-3 outputs. The HRN models independent crops in Fig. 1 a) on which the detectors are trained. We use GeForce 656
599
relations using sigmoid instead of softmax since sigmoid implies GTX 3080 Ti GPU for all our experiments. We present the results 657
600
mutual exclusion. But if the training samples at the fine-grained of the above experiments in the next section. 658
601 659
levels are few, the combinatorial loss would fail to well-separate
602 660
the skewed classes. So, an additional multi-class cross-entropy loss Table 3: FGVD detection results: The combination of YOLO5l
603 661
is used with the softmax function for the finest labels, depicted and HRN significantly improves the mAPs on all levels com-
604 662
for L-3 class output in Fig. 6. The softmax function increases the pared to existing detectors.
605 663
weightage of fine-grained classification loss, ultimately ensuring
606
high classification accuracy specifically for the fine-grained labels. Model L-1 mAP L-2 mAP L-3 mAP 664
607
Moreover, the residual connections in the HRN for hierarchical F-RCNN 54.4% 41.4% 31.9% 665
608
feature interactions make the architecture effective compared to YOLOv5l 61.7% 42.4% 32.7% 666
609
other models while demonstrating state-of-the-art performance on YOLOv5L + HRN 84.0% 45.0% 48.4% 667
610 668
standard fine-grained classification datasets.
611 669
We also create a hierarchical tree structure of labels for each of
612 6 RESULTS 670
our fine-grained classes in the format required by the HRN architec-
613
ture. It is important to note that we use the softmax output for the We show the results of our detection experiments in Table 3. As 671
614
fine-grained class instead of the sigmoid output in the HRN model shown in its first row, with the Faster-RCNN model, we obtain 672
615
at inference time. As mentioned by Chen et al. [2], the softmax the mAPs of 58.9%, 41.4%, and 31.9% on level 1 (L-1), level 2 (L-2), 673
616
output channel computes separate cross-entropy loss so that the and level 3 (L-3) labels, respectively. The Yolov5l model obtains 674
617
mutually exclusive fine-grained classes gain more attention during mAPs of 61.7%, 42.4%, and 32.7% on the three levels. We observe 675
618
training. that the YOLOv5l detector performs significantly better than the 676
619 Faster-RCNN model on all the hierarchical levels. We thus choose 677
620 the YOLOv5l model with HRN for the final experiment. Using the 678
621 YOLOv5l+HRN model, the mAP scores at L-1 and L-2 levels also 679
622 5 EXPERIMENTS improve by 23.3% and 2.6%, respectively. We obtain an mAP of 680
623 We split our entire dataset into train:val:test ratio of 64:16:20. We 48.4% on the L-3 level, which is a substantial performance boost of 681
624 use the YOLOv5l [5] model pre-trained on the COCO dataset [8] and 15.7% compared to the standard YOLOv5l model. 682
625 fine-tune it on the FGVD dataset for 100 epochs with a batch size of We showcase the YOLOv5l+HRN model’s detection results on 683
626 8. The input images are pre-processed and resized to 640 × 640 pixel sample images from the FGVD test set in Fig. 7. The classified fine- 684
627 dimensions before feeding them to the training pipeline. While grained labels from the YOLOv5l+HRN model are shown alongside 685
628 training, we observe that the objectness loss on the validation set the predicted bounding boxes. The correctly classified predictions 686
629 reaches the lowest point after a few epochs and then starts diverging. are shown in green, and the incorrect predictions are in red. The 687
630 To resolve this, we reduce the contribution of the objectness loss to figure depicts our model’s performance on images containing ve- 688
631 the overall loss function by half. We train the HRN model on the hicles at multiple viewpoints and resolutions with occlusions and 689
632 ground truth Regions of Interest (ROIs) and all three FGVD levels. low visibility scenarios. It can be observed that the model is giving 690
633 We use the resenet-50 [4] model, pre-trained on ImageNet [3], for correct predictions for the occluded vehicles with low visibility like 691
634 the trunk net in the HRN architecture. The input image size used the Hyundai Santro car in the left image of Fig. 7. Similarly, the 692
635 here is 448x448 pixel dimensions. We train the HRN model for 100 Honda Shine motorcycle on the right image left side of the same 693
636 epochs with a batch size of 8 and an initial learning rate of 0.001. figure is classified correctly at all three levels. We also showcase an 694
637 At test time, we use the trained YOLOv5l model for predicting the occlusion scenario where the model gives an incorrect prediction. 695
638 6 696
Motorcycle
Honda
Shine

Motorcycle
Honda
Shine

A Fine-Grained Vehicle Detection (FGVD) Dataset for Unconstrained Roads ICVGIP’22, December 2022, Gandhinagar, India

Scooter Honda Activa Car Hyundai Santro Autorickshaw


697 Motorcycle GT: Car 755
Honda Shine Hyundai
698 Creta 756
Pred: Car Car
699 Motorcycle
757
Hyundai
Hyundai Santro
700 Honda Shine Santro 758
701 759
702 760
703 761
704 Scooter 762
Suzuki
705 Access 763
706 764
707 765
708 Car 766
Toyota
709 Etios 767
710 Motorcycle 768
Bajaj
711 Scooter Suzuki Access Car MarutiSuzuki Celerio 769
Scooter Honda Dio Scooter Honda Dio Scooter TVS Pep Pulsar150
712 770
713 Figure 7: FGVD results of the YOLOv5l+HRN model on sample scene (crops shown for better visibility) images from the test 771
714 dataset. The detected bounding boxes are marked in green and red, corresponding to correct and incorrect class predictions. 772
715 The zoomed-in views highlight the vehicles which are occluded or have low visibility. 773
716 774
717 L1 (Vehicle Type) Scooter Motorcycle Car 775
718 L2 (Manufacturer) TVS Suzuki TVS Suzuki RoyalEnfield TVS RoyalEnfield Bajaj TataMotors Toyota Toyota MarutiSuzuki 776
L3 (Model) Jupiter Access Jupiter Access Classic350 ApacheRTR Classic350 Avenger Indica Innova Innova Dzire
719 160 777
720 778
Yolov5l
721 detected 779
vehicle as HRN
722 input 780
723 781
724 L1 782
GradCAM++
725 783
726 784
727 785
L2
728 GradCAM++ 786
729 787
730 788
L3
731 789
GradCAM++
732 790
733 791
734 Figure 8: Rows 1-3: correct predictions/labels obtained from YOLO+HRN with ROIs in row 4. GradCAM++ visualizations on 792
735 L1, L2, and L3 levels (rows 5-8) from the HRN network, when applied on the row 4 ROIs. Multiple orientations of scooters, 793
736 motorcycles, and cars are compared. From Row 4, Col 2: two front-view images of scooter models, followed by the rear-view 794
737 images of the same models. Row 4, Col 6: two side-view motorcycle model images followed by two rear-view images. Row 4, Col 795
738 10: two side-view and two back-view images of different car models. The GradCAM++ visualizations highlight the distinguishing 796
739 features for each ROI, like the front, side, and rear body structures, backlight/blinkers (most vehicles), silencer (Access rear-view 797
740 and Classic 350), petrol tank (Jupiter rear-view), door handles, and engine shape for the corresponding vehicles. 798
741 799
742 800
743
The right image in Fig. 7, shows that for the Hyundai Creta car focus on most of the front body parts along with its blinkers. The 801
744
(ground truth), our model gives incorrect prediction as Hyundai corresponding heatmap for the Access model also shows the focus 802
745
Santro car highlighted in the red box in Fig. 7. The zoomed-out on the front body part along with the mudguard region. While the 803
746
views with adjusted brightness show the actual vehicle images L3 visualizations focus on very particular attributes for the Jupiter 804
747
clearly. For the incorrect detection in Fig. 7, we can observe that model, it is centered around the air vent part above the mudguard. 805
748
the model detects vehicle type and manufacturer correctly, even in For the Access model, the focus is on the shape of the front body 806
749
the high occlusion scenario. part. Similarly, we compare the rear views of these vehicle models. 807
750
In Fig. 8, we showcase the GradCAM++ [1] visualizations of the The L1 heatmap of the Jupiter model shows a major focus on the 808
751
detected fine-grained vehicles from the YOLO-HRN model. On the rear body part with some highlights (in green) on its speedometer 809
752
figure’s left side, we compare two front-view scooter images. The display. In contrast, for the Access sample, the primary focus is 810
753
L1 GradCAM++ heatmap of the Jupiter model shows the model’s on the back body part with some highlights around the silencer, 811
754 7 812
ICVGIP’22, December 2022, Gandhinagar, India

GT Predicted HRN input L1 heatmap L2 heatmap L3 heatmap Sample image of


813
Label Label predicted label Table 4: Recognition results of HRN on standard fine-grained 871
814 classification datasets and FGVD. For all three levels, the 872
Scooter Scooter
815 Honda TVS FGVD dataset remains the most challenging. 873
Dio Wego
816 874
817
Dataset L-1 Acc. L-2 Acc. L-3 Acc. 875
818
Aircraft [9] 97.45% 95.79% 92.58% 876
Car Car
819
Toyota Maruti- CUB-200-2011 [19] 98.67% 95.51% 86.60% 877
Innova Suzuki
820 Ertiga Stanford Cars [7] 97.41% 94.03% Not Applicable 878
821 FGVD dataset 96.69% 79.44% 76.35% 879
Scooter Scooter
822 TVS Suzuki
880
823 Jupiter Access 881
824 882
825 883
826 Figure 9: Failure cases with their GradCAM++ visualizations regions are primarily focused, while the entire side body part is also 884
827 for L1, L2, and L3 levels, along with images of an incorrectly highlighted in green. For Innova, the door handles and the window 885
828 predicted vehicle model for comparison (on the right) with edges are primarily focused. Similarly, the rear views of Innova 886
829 the actual input image (on the left). Top and Bottom: Exam- and Dzire show a primary focus on the shape and structure of the 887
830 ples of high similarity in visual cues between the predicted back body part along with the emphasis on the backlight region as 888
831 and the actual classes. Mid: extreme illumination with occlu- visible in the Dzire model. Again due to the limited number of key 889
832 sion from multiple objects. points, we suspect that the highlighted parts in all the levels are 890
833 similar, highlighting the entire overlapping regions considered for 891
834 the L3 classification. Overall we showcase the hierarchical relation- 892
835 speedometer display, and the key region. Overall L1 heatmaps show ship between the coarse-level and fine-level labels and the overlap 893
836 the model’s focus on most of the crucial body parts of the vehicle. of similar features among them. Fig. 9 shows the failure cases for 894
837 While in the L3 visualizations of the Jupiter model the focus is similar vehicles in extreme illumination/occlusion scenarios. 895
838 centered around the petrol tank lid area just above the backlight, We also evaluate the L1, L2, and L3 label’s classification accu- 896
839 and for the Access, it is primarily centered around the backlight racies using the fine-tuned HRN model on our test dataset and 897
840 region with some highlights on the silencer part as well. These compare it with the corresponding state-of-the-art accuracies on 898
841 features highlighted by the L3 visualizations are the main feature standard fine-grained classification datasets as reported in the pa- 899
842 points that help distinguish the two vehicle models. per [2]. We find that the Level-3 classification accuracy on our 900
843 We have made similar comparisons between motorcycles and dataset is the lowest, i.e., 76.35%, among others, as shown in Ta- 901
844 cars. We first compare the side views of two models and then the ble 4. The results quantitatively demonstrate the high complexity 902
845 rear views of the other two models. The L1 heatmap of the Clas- involved in the FGVD dataset and the challenges involved in the 903
846 sic350 model’s side view shows the model’s focus primarily around classification and detection of fine-grained vehicles in the wild. 904
847 the suspension area just beside the triangular engine boundary, Although the YOLOv5l+HRN model performs better than other 905
848 with some focus on the petrol tank, silencer, and the engine area at baseline detectors, there is still a massive gap of around 35% mAP 906
849 the bottom. While the L1 visualizations of the ApacheRTR160 TVS values between the coarse-level and the fine-grained level detection 907
850 motorcycle focus on the vehicle’s back body part, the back wheel’s performances in Table 3. Thus, we propose this dataset for future 908
851 central area, and the front wheel. The L1 visualizations cover all the research works related to the FGVD. 909
852 important key points that help identify a motorcycle, along with 910
853 providing additional feature points for the classification at finer 911
854 levels. The L2 visualizations highlight regions very similar to L3 vi- 7 CONCLUSION 912
855 sualizations, which ultimately help identify the correct L3 label. The This paper presents the first dataset for the Fine-Grained Detection 913
856 L3 visualizations of the Classic350 model show the main focus on of Vehicles while also providing the baselines for the same. FGVD is 914
857 the suspension region and some on the silencer. For ApacheRTR160, also the first dataset in which the fine-grained labels for 5 additional 915
858 the main focus is around the back body part. These focus points vehicle types are available apart from just cars. Our dataset can 916
859 correspond to the unique feature points which are specific to that be used for vehicle re-identification in on-road surveillance sys- 917
860 vehicle model only. Now we compare the rear views of Classic350 tems, generating alarms for road safety systems, and promoting the 918
861 and Avenger models. These models’ L1 and L3 heatmaps do not development of ADAS products for Indian roads. Specifically, we 919
862 show much difference since only the back region of these vehicles showcase the uniqueness of our dataset regarding the detection of 920
863 is visible, where the number of critical key points is limited. The fine-grained vehicles in the wild compared to other related datasets. 921
864 main distinguishing key point which is being highlighted in these We also provide the results of the YOLOv5l+HRN model for the 922
865 heatmaps is around the backlight region and a little bit on the side FGVD dataset, with which we obtain a 15.7% gain in mAP over 923
866 body part, which is partially visible. baseline detectors for the most complex level. For future works, 924
867 Similarly, we compare the side views of cars. The L1 and L3 visu- we plan to fuse the architectures of detection and classification 925
868 alizations of the side views of the Indica and Innova models nearly models to reduce the overall time complexity and further improve 926
869 look similar. For Indica, the door handle and the front window pillar the current model’s performance. 927
870 8 928
A Fine-Grained Vehicle Detection (FGVD) Dataset for Unconstrained Roads ICVGIP’22, December 2022, Gandhinagar, India

929 REFERENCES Learning Capability of CNN. CoRR abs/1911.11929 (2019). arXiv:1911.11929 987
930 [1] Aditya Chattopadhyay, Anirban Sarkar, Prantik Howlader, and Vineeth N. Bal- http://arxiv.org/abs/1911.11929 988
asubramanian. 2017. Grad-CAM++: Generalized Gradient-based Visual Ex- [21] Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2015. A Large-Scale
931 Car Dataset for Fine-Grained Categorization and Verification. In Proceedings of 989
planations for Deep Convolutional Networks. CoRR abs/1710.11063 (2017).
932 arXiv:1710.11063 http://arxiv.org/abs/1710.11063 the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 990
933 [2] Jingzhou Chen, Peng Wang, Jian Liu, and Yuntao Qian. 2022. Label Rela- [22] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen 991
tion Graphs Enhanced Hierarchical Residual Network for Hierarchical Multi- Liu, Vashisht Madhavan, and Trevor Darrell. 2020. Bdd100k: A diverse driving
934 dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF 992
Granularity Classification. In Proceedings of the IEEE/CVF Conference on Computer
935 Vision and Pattern Recognition. 4858–4867. conference on computer vision and pattern recognition. 2636–2645. 993
936 [3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Im- 994
ageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on
937 995
Computer Vision and Pattern Recognition. 248–255. https://doi.org/10.1109/CVPR.
938 2009.5206848 996
939 [4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual 997
Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385
940 http://arxiv.org/abs/1512.03385 998
941 [5] Glenn Jocher, Alex Stoken, Jirka Borovec, NanoCode012, ChristopherSTAN, 999
Liu Changyu, Laughing, tkianai, Adam Hogan, lorenzomammana, yxNONG,
942 1000
AlexWang1900, Laurentiu Diaconu, Marc, wanghaoyang0106, ml5ah, Doug, Fran-
943 cisco Ingham, Frederik, Guilhen, Hatovix, Jake Poznanski, Jiacong Fang, Lijun 1001
944 Yu, changyu98, Mingyu Wang, Naman Gupta, Osama Akhtar, PetrDvoracek, 1002
and Prashant Rai. 2020. ultralytics/yolov5: v3.1 - Bug Fixes and Performance
945 1003
Improvements. https://doi.org/10.5281/zenodo.4154370
946 [6] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. 2011. 1004
947 Novel Dataset for Fine-Grained Image Categorization. In First Workshop on Fine- 1005
Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern
948 Recognition. Colorado Springs, CO. 1006
949 [7] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 2013. 3D Object Repre- 1007
sentations for Fine-Grained Categorization. In 4th International IEEE Workshop
950 1008
on 3D Representation and Recognition (3dRR-13). Sydney, Australia.
951 [8] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Gir- 1009
952 shick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence 1010
Zitnick. 2014. Microsoft COCO: Common Objects in Context. CoRR abs/1405.0312
953 1011
(2014). arXiv:1405.0312 http://arxiv.org/abs/1405.0312
954 [9] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea 1012
955 Vedaldi. 2013. Fine-grained visual classification of aircraft. arXiv preprint 1013
arXiv:1306.5151 (2013).
956 [10] Syeda Aneeba Najeeb, Rana Hammad Raza, Adeel Yusuf, and Zamra Sultan. 2022. 1014
957 Fine-grained vehicle classification in urban traffic scenes using deep learning. 1015
In Proceedings of the 11th International Conference on Robotics, Vision, Signal
958 1016
Processing and Power Applications. Springer, 902–908.
959 [11] Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classifica- 1017
960 tion over a large number of classes. In 2008 Sixth Indian Conference on Computer 1018
Vision, Graphics & Image Processing. IEEE, 722–729.
961 1019
[12] Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement.
962 (04 2018). 1020
963 [13] Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. 1021
CoRR abs/1804.02767 (2018). arXiv:1804.02767 http://arxiv.org/abs/1804.02767
964 [14] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: 1022
965 Towards Real-Time Object Detection with Region Proposal Networks. In Pro- 1023
ceedings of the 28th International Conference on Neural Information Processing
966 1024
Systems - Volume 1 (Montreal, Canada) (NIPS’15). MIT Press, Cambridge, MA,
967 USA, 91–99. 1025
968 [15] J. Sochor, J. Špaňhel, and A. Herout. 2018. BoxCars: Improving Fine-Grained 1026
Recognition of Vehicles Using 3-D Bounding Boxes in Traffic Surveillance. IEEE
969 1027
Transactions on Intelligent Transportation Systems PP, 99 (2018), 1–12. https:
970 //doi.org/10.1109/TITS.2018.2799228 1028
971 [16] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai 1029
Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay
972 Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, 1030
973 Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng 1031
Chen, and Dragomir Anguelov. 2020. Scalability in Perception for Autonomous
974 1032
Driving: Waymo Open Dataset. In Proceedings of the IEEE/CVF Conference on
975 Computer Vision and Pattern Recognition (CVPR). 1033
976 [17] Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos 1034
Ipeirotis, Pietro Perona, and Serge Belongie. 2015. Building a Bird Recognition
977 1035
App and Large Scale Dataset With Citizen Scientists: The Fine Print in Fine-
978 Grained Dataset Collection. In Proceedings of the IEEE Conference on Computer 1036
979 Vision and Pattern Recognition (CVPR). 1037
[18] Girish Varma, Anbumani Subramanian, Anoop Namboodiri, Manmohan Chan-
980 draker, and CV Jawahar. 2019. IDD: A Dataset for Exploring Problems of Au- 1038
981 tonomous Navigation in Unconstrained Environments. In 2019 IEEE Winter 1039
Conference on Applications of Computer Vision (WACV). IEEE, 1743–1751.
982 1040
[19] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie.
983 2011. The caltech-ucsd birds-200-2011 dataset. (2011). 1041
984 [20] Chien-Yao Wang, Hong-Yuan Mark Liao, I-Hau Yeh, Yueh-Hua Wu, Ping-Yang 1042
Chen, and Jun-Wei Hsieh. 2019. CSPNet: A New Backbone that can Enhance
985 1043
986 9 1044

You might also like