Final Minor Project Report

OBJECT DETECTION USING
ENSEMBLE TECHNIQUE
A MINOR PROJECT
REPORT
Submitted by
Syeda Reeha Quasar Rishika Sharma Aayushi Mittal

14114802719 20514802719 21114802719
BACHELOR OF TECHNOLOGY IN
COMPUTER SCIENCE AND ENGINEERING
Under the Guidance

of
Mr. Moolchand Sharma Ms. Prerna Sharma
(Assistant Professor, CSE) (Assistant Professor, CSE)
Department of Computer Science and Engineering
Maharaja Agrasen Institute of Technology,

PSP area, Sector – 22, Rohini, New Delhi – 110085
(Affiliated to Guru Gobind Singh Indraprastha, New
Delhi)
(DEC 2022)
1
MAHARAJA AGRASEN INSTITUTE OF
TECHNOLOGY
Department of Computer Science and Engineering
CERTIFICATE
This is to Certified that this MINOR project report “OBJECT DETECTION USING
ENSEMBLE TECHNIQUE” is submitted by Syeda Reeha Quasar(14114802719), Rishika
Sharma(20514802719), and Aayushi Mittal(21114802719) who carried out the project work
under my supervision. I approve this MINOR project for submission.
Prof. Namita Gupta Mr. Moolchand Sharma Mrs Prerna Sharma

(HoD, CSE) (Assistant Professor, CSE) (Assistant Professor, CSE)
2
ABSTRACT
Automated cars are being developed by leading automakers and this technology is expected to revolutionize
how people experience transportation, giving people more options and convenience. The implementation of
traffic signal detection and recognition systems in automated vehicles can help reduce the number of fatalities
due to traffic mishaps, improve road safety and efficiency, decrease traffic congestion and help reduce air
pollution. Once the traffic signs and lights are detected, they can be used to alert about any upcoming signs or
lights. The driver can then take the necessary actions to ensure a safe journey. We propose a method for
Traffic Sign Detection and Recognition using ensemble techniques on four models, namely, BEit, Yolo V5,
Faster- CNN and sequential CNN. Current research focuses on traffic sign detection using individual model
like CNN. To further boost the accuracy for object detection, our proposed approach uses a combination of the
average, AND, OR and weighted-fusion strategies to combine the outputs of the different ensembles. The
testing in this project utilizes the German Traffic Sign Recognition Benchmark (GTSRB), Belgium Traffic
Sign image data and Road Sign Detection datasets. In comparison to the individual models for object
detection, the ensemble of these models was able to increase the model accuracy to 99.54% with validation
accuracy of 99.74% and test accuracy of 99.34%.
3
ACKNOWLEDGEMENT
It gives me immense pleasure to express my deepest sense of gratitude and sincere thanks to my respected
guide Mr. Moolchand Sharma (Assistant Professor, CSE) and Mrs. Prerna Sharma (Assistant Professor, CSE)
MAIT Delhi, for their valuable guidance, encouragement and help for completing this work. Their useful
suggestions for this whole work and cooperative behavior are sincerely acknowledged.
I am also grateful to my teachers for their constant support and guidance.
I also wish to express my indebtedness to my parents as well as my family member whose blessings and
support always helped me to face the challenges ahead.
Place: Delhi Syeda Reeha Quasar (14114802719)
Date: Rishika Sharma (20514802719)
Aayushi Mittal (21114802719)
4
TABLE OF CONTENTS
1 INTRODUCTION.........................................................................................................................................9
2 LITERATURE SURVEY............................................................................................................................12
3 RESEARCH/APPROACH..........................................................................................................................15
3.1 Existing Methods..................................................................................................................................15
3.1.1 Bidirectional Encoder representation from Image Transformers (BEit).......................................15
3.1.2 You Only Look Once (YOLOv5).................................................................................................15
3.1.3 Sequential Convolutional Neural Networks..................................................................................16
3.1.4 Faster Region Based Convolutional Neural Network (Faster R-CNN)........................................17
3.1.5 Ensemble Technique.....................................................................................................................18
3.2 Proposed Methodology.........................................................................................................................19
3.3 Research Approach...............................................................................................................................20
3.3.1 Dataset...........................................................................................................................................20
3.3.2 Training.........................................................................................................................................21
3.3.3 Testing...........................................................................................................................................21
3.3.4 Prediction Ensembling..................................................................................................................21
4 RESULTS....................................................................................................................................................23
4.1 Evaluation.............................................................................................................................................23
4.2 IOU - Intersection Over Union.............................................................................................................23
4.3 Precision X Recall Curve.....................................................................................................................24
4.4 Average Precision.................................................................................................................................25
4.5 Recorded Metrics..................................................................................................................................26
5 CONCLUSION............................................................................................................................................29
6 SUMMARY.................................................................................................................................................29
7 FUTURE SCOPE........................................................................................................................................29
8 REFERENCES............................................................................................................................................30
5
LIST OF TABLES
Table 1. Sequential model layers specification...................................................................................................17

Table 2. Individual Model Metrics......................................................................................................................26
Table 3. Ensembled Model Result......................................................................................................................26
6
TABLE OF FIGURES
Figure 1. Architecture of Beit.............................................................................................................................15

Figure 2. The architecture of YOLO...................................................................................................................16
Figure 3. Architecture of Faster RCNN..............................................................................................................16
Figure 4. The architecture of Faster R-CNN.......................................................................................................18
Figure 5. Architecture of the proposed system...................................................................................................20
Figure 6. AND Fusion.........................................................................................................................................22
Figure 7. OR Fusion............................................................................................................................................22
Figure 8. Weighted Fusion..................................................................................................................................22
Figure 9. Averaging............................................................................................................................................23
Figure 10. Green bounding box represents the ground truth and the red bounding box represents the
prediction.............................................................................................................................................................24
Figure 11. Computing the Intersection of Union is as simple as dividing the area of overlap between the
bounding boxes by the area of union..................................................................................................................24
Figure 12. Precision X Recall Curve..................................................................................................................25
Figure 13. Comparative Predictions of different Object Detection Models.......................................................27
Figure 14. Comparative Predictions of different Object Detection Models.......................................................28
7
LIST OF SYMBOLS, ABBREVIATIONS AND NOMENCLATURE
Abbreviation Description
CNN Convolutional Neural Network
BEit Bidirectional Encoder representation from Image
Transformers
Yolo You Only Look Once
GTSRB The German Traffic Sign Recognition Benchmark
TSDR Traffic Sign Detection and Recognition
ADAS Advanced Drivers Assistance Systems
HOG Histogram Oriented Gradients
SIFT Scale Invariant Feature Transform
IoU Intersection over Union
8
CHAPTERS
1 INTRODUCTION
Object detection and classification are important tasks in the field of computer vision, with numerous
applications in areas such as robotics, autonomous vehicles, and image analysis. Traffic sign detection and
recognition (TSDR) is an important application of machine learning that has the potential to improve road
safety by alerting drivers of upcoming traffic signs. TSDR systems use machine learning algorithms to
analyse images or video feeds from cameras mounted on the vehicle, and detect and recognize traffic signs in
the scene. Once a traffic sign is detected and recognized, the TSDR system can provide the driver with an
appropriate notification, such as an audio or visual alert. TSDR systems can be particularly useful in situations
where drivers may be distracted or have limited visibility, such as when driving on unfamiliar roads or in
adverse weather conditions. By automating the process of detecting and recognizing traffic signs, TSDR
systems can help drivers stay aware of their surroundings and make safer driving decisions. There are several
machine learning approaches that can be used to build TSDR systems, including object detection algorithms
and image classification algorithms. By carefully selecting and tuning the appropriate machine learning
model, it is possible to build a TSDR system that is effective at detecting and recognizing traffic signs in a
variety of conditions. In recent years, convolutional neural networks (CNNs) have become the dominant
approach for solving these problems, achieving state-of-the-art performance on various benchmarks.
However, the performance of individual CNN models can be limited by their inherent biases and limitations,
and there is often a trade-off between accuracy and efficiency.
Model selection is an important part of the machine learning process, as it involves choosing the best model
for a given task based on its performance and the requirements of the application. In general, the goal of
model selection is to find a model that has low bias and low variance, as this will result in good generalization
performance on unseen data. However, in practice, it is often difficult to achieve both low bias and low
variance at the same time. This is because reducing bias often leads to an increase in variance and vice versa.
This trade- off between bias and variance is known as the bias-variance dilemma.
Bias and variance are two sources of error that can affect the performance of a machine learning model. Bias
refers to the difference between the predicted values of a model and the true values, and it can arise when a
model is too simplistic or inflexible to accurately capture the underlying relationships in the data. Variance, on
9
the other hand, refers to the variability of a model's predictions, and it can arise when a model is too complex
or overfits the training data.
It is not always the case that one object detection model will perform better than all others for a given
problem. Different models have different architectures and parameters, and some models may be better suited
to certain types of problems than others. For this reason, it is important to carefully evaluate the performance
of different models on a specific task before choosing the best one to use. To address these issues, we have
proposed the use of ensemble techniques, which combine the predictions of multiple models to achieve
improved performance. In this research paper, for object detection and classification using an ensemble
technique that combines the predictions of two different object detection models (Faster R-CNN and
YOLOv5) and two different classification models (BEiT and sequential CNN). Our contribution is in the use
of various techniques and combinations to improve the accuracy of the ensembled model. We used different
popular pre-trained models as well as creating a model from scratch in order to maximize performance. We
also experimented with various ensemble configurations to determine the optimal combination of models and
techniques.
Ensembling can be an effective way to mitigate both bias and variance, as it combines the predictions of
multiple models, which can help to reduce the overall error of the system. Ensembling is a machine-learning
technique in which multiple models are combined to produce a more accurate prediction. The idea behind
ensembling is to train a group of models with different architectures or hyperparameters, and then combine
their predictions in some way to produce a final prediction. Ensembling can be particularly effective in
improving the performance of machine learning models because it can help to reduce overfitting and improve
the generalizability of the model. By combining the predictions of multiple models, the ensemble model is
able to take advantage of the strengths of each individual model while mitigating their weaknesses. Our
research aims to demonstrate the effectiveness of using an ensemble approach for object detection and
classification tasks, particularly for the important application of advanced driver assistance systems (ADAS).
By providing drivers with clear and concise information about the meaning of traffic signs, ADAS systems
can help to improve road safety and make it easier for drivers to navigate unfamiliar roads. Ensembling can be
an effective way to improve the performance of these systems by combining the predictions of multiple
models and reducing overall error.
In summary, object detection and classification are essential tasks in the field of computer vision with a wide
range of applications. While convolutional neural networks (CNNs) have achieved state-of-the-art
performance on various benchmarks, the performance of individual CNN models can be limited by biases and
limitations. To address these issues, we propose a novel approach for object detection and classification using
10
an ensemble technique that combines the predictions of two object detection models (Faster R-CNN and
YOLOv5) and two classification models (BEiT and sequential CNN). We use various techniques and
combinations to improve the
11
accuracy of the ensembled model, including the use of popular pre-trained models and training models from
scratch. We also experiment with different ensemble configurations to find the optimal combination of models
and techniques.
In brief, this research presents the following contributions through its work:
 We propose the use of an ensemble approach for object detection, which combines the predictions of
multiple state-of-the-art object detection models in order to improve performance and accuracy. We
present a comprehensive comparative study of the individual models and our ensembled model,
highlighting the benefits and results of this approach. Additionally, we provide comprehensive
metrics that can aid in model selection and understanding how ensembling can increase performance
for difficult-to-detect objects and individual classes.
 We have compared the performance of individual object detection models with that of our ensembled
object detection model, highlighting the advantages and benefits of using an ensemble approach. We
conduct a comprehensive study to evaluate the performance of each model and provide a detailed
analysis of the results. Our aim is to demonstrate the effectiveness of using an ensemble model for
object detection and to provide insights into the optimal configurations and techniques for achieving
improved performance.
 To evaluate the effectiveness of our approach, we conducted a comprehensive comparative study

between the individual object detection models and our ensembled model. We also compiled
comprehensive metrics to help establish a method for model selection and to understand how
ensembling can improve performance for difficult-to-detect objects and individual classes.
Our research aims to demonstrate the effectiveness of using an ensemble approach for object detection and
classification tasks, particularly for ADAS. By providing drivers with clear and concise information about
traffic signs, ADAS systems can improve road safety and make it easier for drivers to navigate unfamiliar
roads. Ensembling can be an effective way to improve the performance of these systems by combining the
predictions of multiple models and reducing overall error.
12
2 LITERATURE SURVEY
Traffic sign and light detection, recognition, and classification is an important research area in the field of
computer vision and has numerous applications in intelligent transportation systems and autonomous vehicles.
Most of the previous studies have concentrated on general Object detection but in our use case we will narrow
down our scope to concentrate on making the self-driving experience better by working on the accuracy of
traffic lights and signal detection by classifying and interpreting the signal precisely. Moreover, instead of
using single model for object detection as done in previous research, we are using the ensemble technique to
produce the best result out of 3 models: BEit, YoloV5 and sequential CNN. The real-world traffic scenes are
complicated as a result of illumination variation, bad weather and similar false signs etc.
One of the most widely used approaches for traffic sign and light detection is based on convolutional neural
networks (CNNs). These models have achieved state-of-the-art performance on various benchmarks and have
been widely used in real-world applications. For example, in [1], the authors proposed a CNN-based approach
for traffic sign detection and classification using the GTSRB dataset. They showed that their model was able
to achieve an accuracy of 99.46% on the test set.
Previous research on object detection has included both traditional methods and more recent deep learning
approaches. Traditional methods have used approaches such as the Histogram of Oriented Gradients (HOG)
descriptor and the Scale Invariant Feature Transform (SIFT) to detect objects in a given image. More recently,
deep learning-based object detection methods such as the YOLO system and Faster R-CNN have been used to
detect objects in images. These methods use convolutional neural networks to identify objects in an image,
and are typically more accurate than traditional methods. Additionally, research has been conducted on
improving the efficiency of object detection systems, as well as methods to reduce the number of false
positives. Other approaches for traffic sign and light detection and classification have been based on different
types of feature representations. For example, in [3], the authors proposed a multi-scale and multi-orientation
Gabor filter-based approach for traffic sign recognition. They showed that their model was able to achieve an
accuracy of 97.4% on the GTSRB dataset.
The use of color information has been shown to be effective for traffic sign and light detection [17]. In [18],
the authors proposed a color-based approach for traffic sign detection that used a combination of color and
texture features to classify traffic signs. Some researchers have explored the use of deep learning models for
traffic sign and light detection, such as the use of convolutional neural networks (CNNs) and recurrent neural
networks (RNNs). In [19], the authors proposed a CNN-based approach for traffic light detection that
achieved a high level of accuracy on the COCO-TLS dataset.
13
The use of transfer learning has also been explored for traffic sign and light detection, where a model trained
on one dataset is fine-tuned on another dataset. This approach can be particularly useful when there is a
limited amount of annotated data available for a specific task. In [20], the authors used transfer learning to
improve the performance of a CNN-based model for traffic light detection.
In addition to the detection of traffic signs and lights, researchers have also focused on the classification and
interpretation of traffic signals. In [21], the authors proposed a multi-class classification model for traffic
lights that used a combination of color and shape features to classify different types of traffic lights. The use
of data fusion techniques, which combine the information from multiple sensors or modalities, has also been
explored for traffic sign and light detection. In [22], the authors proposed a data fusion approach that
combined color, texture and shape features extracted from images to improve the accuracy of traffic sign
detection. Another important aspect of traffic sign and light detection is the robustness of the model to
variations in the environment. In [23], the authors proposed a method for adapting a CNN-based model to
different lighting conditions by using a combination of synthetic and real data. This approach showed
improved performance on a dataset with varying illumination levels.
The use of motion information has also been explored for traffic sign and light detection, particularly in the
context of autonomous vehicles. In [24], the authors proposed a model that used both static and dynamic
features to detect traffic lights in a video stream. The model was able to achieve a high level of accuracy on a
dataset of traffic light videos. The incorporation of contextual information has also been shown to be
beneficial for traffic sign and light detection. In [25], the authors proposed a model that used the location and
orientation of traffic signs and lights relative to the vehicle to improve the accuracy of the detection. This
approach demonstrated improved performance on a dataset of traffic scenes.
interpretation of traffic signals. In [26], the authors proposed a model that used both visual and audio features
to classify traffic signals in real-time. The model was able to accurately classify different types of traffic
signals, including pedestrian signals and turn signals. Some researchers have also explored the use of
multimodal data for traffic sign and light detection, such as the combination of images and LiDAR data. In
[27], the authors proposed a model that used both image and LiDAR data to improve the accuracy of traffic
light detection. The model was able to effectively combine the strengths of both modalities to achieve
14
Object detection is a key task in the field of computer vision, with numerous applications in areas such as
autonomous vehicles, robotics, and security. In the context of traffic sign and light detection, object detection
15
models are used to identify and classify traffic signals in images or video streams. In previous research, a
variety of object detection approaches have been applied to the task of traffic sign and light detection,
including traditional methods such as the Histogram of Oriented Gradients (HOG) descriptor and the Scale
Invariant Feature Transform (SIFT), as well as more recent deep learning-based methods such as YOLO and
Faster R- CNN. One of the challenges in object detection is the trade-off between accuracy and efficiency.
Some models are able to achieve high levels of accuracy but are computationally expensive, making them
impractical for real- time applications. On the other hand, some models are able to run in real-time but may
have lower levels of accuracy. To address this challenge, researchers have explored the use of ensembling
techniques, which combine the predictions of multiple models to produce a more accurate result. Ensembling
can be particularly useful when the individual models have complementary strengths and weaknesses.
In [28], the authors proposed an ensemble approach for traffic sign detection and recognition that combined
the predictions of three different CNN-based models. The ensemble model was able to achieve improved
performance on a dataset of traffic signs compared to the individual models. In [29], the authors proposed an
ensemble approach for traffic light detection that combined the predictions of three different models. The
ensemble model was able to achieve improved performance on a dataset of traffic lights compared to the
individual models.
In [30], the authors also explored the use of ensembles for traffic light detection, using a combination of
CNN- based models and traditional machine learning algorithms. The ensemble model was able to achieve a
high level of accuracy on the COCO-TLS dataset. Other researchers have also studied the use of ensembles
for traffic sign and light detection, including the combination of different types of models such as CNNs and
RNNs [31] and the use of data fusion techniques to combine information from multiple sensors or modalities
[31]. Therefore, the research on the traffic sign detection still faces great challenges. The key to achieving the
robustness and accuracy of traffic sign detection is to detect the traffic signs of small size in the complex
environment. Wide research has been done in the traffic sign detection and we will be referring to those
papers and implementing a more robust model which is really new and has higher accuracy coefficient.
16
3 RESEARCH/APPROACH
3.1 Existing Methods
3.1.1 Bidirectional Encoder representation from Image Transformers (BEit)

The BEiT model is a self-supervised pre-trained Vision Transformer model that was introduced in the paper
"BEiT: BERT Pre-Training of Image Transformers" by Hangbo Bao, Li Dong, and Furu Wei. It was inspired
by BERT and is the first model to show that self-supervised pre-training of Vision Transformers can
outperform supervised pre-training. Rather than being pre-trained to predict the class of an image, as is done
in the original Vision Transformer model, BEiT models are pre-trained to predict visual tokens from the
codebook of OpenAI's DALL-E model given masked patches. These models are regular Vision Transformers
that are pre-trained in a self-supervised way, and they have been shown to outperform both the original Vision
Transformer model and Data-efficient Image Transformers when fine-tuned on the ImageNet-1K and CIFAR-
100 datasets.
Figure 1. Architecture of BEit
3.1.2 You Only Look Once (YOLOv5)

YOLO is a unified model for object detection. It processes an entire image in one pass and is able to identify
objects in the image with a single inference. YOLO is trained using a single full-image-size network, which
optimizes the network parameters to best identify objects in the image. This approach helps improve accuracy
and speed by reducing the number of false positives and false negatives. YOLO also uses a single evaluation
for the entire image, which means it can detect multiple objects in a single image with a single pass. For
ensembling, we use Yolo Version 5 since it is stable and provides a decent accuracy as well.
17
Figure 2. The architecture of YOLO
3.1.3 Sequential Convolutional Neural Networks

A sequential CNN is a type of deep learning architecture that is composed of a series of layers arranged in a
linear sequence. In a sequential CNN, each layer processes the input data and passes it on to the next layer,
with the output of one layer serving as the input for the next. The layers in a sequential CNN can be either
convolutional layers, which apply a convolution operation to the input data to extract features, or fully
connected layers, which perform a matrix multiplication and apply an activation function to the resulting
output. Sequential CNNs are commonly used for tasks such as image classification and object recognition,
where the input data is in the form of an image and the network learns to extract features from the image and
classify it based on those features. One of the key advantages of sequential CNNs is their ability to learn
hierarchical representations of the input data, which allows them to achieve good performance on a wide
range of tasks.
Figure 3. Architecture of Faster RCNN
18
Table 1. Sequential model layers specification
3.1.4 Faster Region Based Convolutional Neural Network (Faster R-CNN)

Faster R-CNN is a two-stage object detection model that was introduced in 2015 by Ren et al. It is composed
of two main components: a region proposal network (RPN) and a Fast R-CNN detector. The RPN generates a
set of candidate object regions (called region proposals) in the input image, which are then passed to the Fast
R-CNN detector for classification into one of the predefined classes. The specific Faster R-CNN model that
we have used, "fasterrcnn_resnet50_fpn," uses a ResNet-50 CNN architecture as the backbone network and a
Feature Pyramid Network (FPN) to generate the region proposals. The ResNet-50 CNN is a deep
convolutional neural network that has been trained on the ImageNet dataset and has achieved strong
performance on a variety of computer vision tasks. The FPN is a network architecture that uses a pyramid of
feature maps at different scales to generate region proposals that are more robust to scale and aspect ratio
variations.
19
Overall, Faster R-CNN is a powerful object detection model that has achieved good results on various
benchmarks and has been widely used in real-world applications. It is known for its effectiveness in
generating accurate region proposals and classifying them into the correct classes.
Figure 4. The architecture of Faster R-CNN
3.1.5 Ensemble Technique
Ensemble techniques are used to improve the accuracy of predictive models by combining multiple models
into a single, more powerful model. These techniques generally work by training multiple models on the same
data and combining their predictions. The idea is that by combining different models, each with its own
strengths and weaknesses, the overall accuracy of the model can be improved. Ensemble techniques can also
be used to reduce overfitting, as the combination of multiple models can create a more robust model that is
less prone to overfitting. Ensemble methods are a type of machine learning technique that involves combining
multiple algorithms in order to produce a more accurate and reliable result. By combining several algorithms,
the bias of a single algorithm can be reduced, as well as its variance. This is because different algorithms may
make different mistakes, and combining them together can help to average out the errors.
Additionally, combining algorithms can also increase the accuracy of the overall model, since different
algorithms may capture different underlying patterns in the data. Ensemble techniques are also used in
unsupervised learning tasks, such as clustering and anomaly detection. In these tasks, multiple models are
used to identify more accurate clusters or outliers. In addition, ensemble techniques are used to improve the
efficiency of deep learning models. By combining multiple models, it is possible to reduce the computational
cost of training deep learning models while still achieving good results.
20
Different methods of ensembling used are:
1. Or: The "or" method of ensembling involves combining the predictions of multiple models by
taking the union of their predictions. This can be useful when the goal is to identify all of the
instances of a particular class, rather than just the most likely instance.
2. And: The "and" method of ensembling involves combining the predictions of multiple models by
taking the intersection of their predictions. This can be useful when the goal is to identify only
those instances that are predicted by all of the models.
3. Weighted fusion: Weighted fusion involves combining the predictions of multiple models by
assigning different weights to each model's prediction and taking a weighted average of the
predictions. This allows you to adjust the influence of each model on the final prediction.
4. Average: The average method of ensembling involves simply averaging the predictions of
multiple models. This can be useful when the goal is to smooth out the predictions of the
individual models and reduce the variance of the final prediction.
The most appropriate ensembling method will depend on the specific characteristics of the data and the
problem being solved. Experimenting with different ensembling methods and evaluating their performance on
the data can helped us determine the best approach for this particular problem by using these four ensembling
techniques.
3.2 Proposed Methodology
In this research, we present a novel approach for object detection and classification using an ensemble of four
different models: YOLOv5, Faster R-CNN, sequential CNN, and BEiT. These models were trained on three
different datasets: Belgium Traffic Sign image data, GTSRB (German Traffic Sign Recognition Benchmark),
and Road Sign Detection from Kaggle. Specifically, YOLOv5 and Faster R-CNN were used for object
detection, while sequential CNN and BEiT were used for classification. To create various combinations of the
models, we employed a combination of AND, OR, weighted Box fusion, and averaging techniques. These
techniques allowed us to blend the weaker predictions from the individual models and obtain a single strong
prediction for each image. We then used bounding box fusion to combine overlapping predictions, following
the affirmative selection of the predictions. To further improve the performance of our ensemble, we modified
the weights of the various models in the ensemble to generate different ensemble combinations, which we
refer to as our "ensembling activation parameters." This allowed us to fine-tune the relative contributions of
each model in the ensemble and optimize the overall performance.
21
We evaluated the performance of our proposed approach using various metrics such as precision, recall, and
accuracy. The results of our experiments showed that our ensemble approach was able to achieve superior
performance compared to the individual models, demonstrating the effectiveness of our proposed solution.
There were a few challenges that we encountered while implementing our proposed solution. One challenge
was the large amount of data required to train the individual models. To address this, we employed various
data augmentation techniques to generate additional training examples.
Another challenge was the computational cost of training and evaluating the ensemble, which required the use
of powerful hardware and efficient algorithms. Overall, our proposed solution demonstrated the potential of
using an ensemble approach for object detection and classification tasks and showed that it is possible to
significantly improve performance by combining the predictions of multiple models. We believe that our
approach has the potential to be applied to a wide range of real-world applications and hope that it will inspire
further research in this area.
Figure 5. Architecture of the proposed system
3.3 Research Approach
3.3.1 Dataset
For the study, we used the GTSRB, Belgium Traffic Sign image data, and Road Sign Detection datasets from
Kaggle.
 The GTSRB is a dataset of traffic sign images taken from German roads. It consists of over 50,000
images of traffic signs, with a total of 43 different classes of signs. The images were taken under a
variety of conditions, including different lighting conditions, weather conditions, and from different
angles.
22
 The Belgian Traffic Sign dataset is a collection of traffic sign images gathered from the Belgian roads.
The dataset consists of over 5,000 images of traffic signs, with a total of 62 different classes of signs.
The images were taken from various angles and under different lighting conditions, and the dataset
includes both color and grayscale images.
 The Road Sign Detection dataset from Kaggle is a dataset of traffic sign images taken from roads. It
consists of over 877 images of traffic signs, with a total of 4 different classes of signs (Traffic Light,
Stop, Speed limit, Crosswalk). The images were taken under a variety of conditions, including
different lighting conditions, weather conditions, and from different angles.
All three of these datasets are commonly used for training and evaluating machine learning models for traffic
sign recognition and classification tasks. They provide a large and diverse set of images that can be used to
train and test the performance of different algorithms and models.
3.3.2 Training
We obtained the training and validation datasets for our two object detection models from the Road Sign
Detection dataset on Kaggle. We used the Union of GTSDB and Belgium Traffic Sign image data to train our
image classifiers. It took roughly 200 epochs for each model to reach a preset industry standard level of
category and localization losses. To avoid overfitting, we stopped training when the accuracy between the
models decreased between epochs. We used different libraries and modules to train the models and performed
fine- tuning. We then deployed the models on Auto Trainer and created an XML file for each image. After
each training session, we checked the validity accuracy.
3.3.3 Testing
We evaluated the performance of our model using the test dataset, which consists of 12,630 images. For each
image, we made predictions using our model and recorded the boxes, confidence scores, and labels in a
separate text file. After making the predictions, we ran a preprocessing step and then evaluated the
performance of the model based on the predicted text files.
3.3.4 Prediction Ensembling

Prediction ensembling is the process of combining the predictions of multiple machine learning models to
make a final prediction. It is a common technique used in machine learning to improve the performance and
robustness of a model.
 AND fusion: This involves taking the intersection of the predicted classes of the individual models.
For example, if model 1 predicts class A and model 2 predicts class B, the AND fusion of these
23
predictions
24
would be the null set (no classes). This method can be used if you want to be very confident in the final
prediction and only consider classes that are predicted by all of the models.
Figure 6. AND Fusion
 OR fusion: This involves taking the union of the predicted classes of the individual models. For
example, if model 1 predicts class A and model 2 predicts class B, the OR fusion of these predictions
would be the set of classes A and B. This method can be used if you want to consider all of the classes
that are predicted by any of the models.
Figure 7. OR Fusion
 Weighted fusion: This involves weighting the predictions of the individual models based on their
accuracy or some other metric. The final prediction is then made by combining the weighted
predictions of the models. This method can be used if you want to give more emphasis to the
predictions of certain models. In the NMS method, the boxes are considered as belonging to a single
object if their overlap, Intersection over Union (IoU) is higher than some threshold value. Thus, the
boxes filtering process depends on the selection of this single IoU threshold value, which affects the
performance of the model.
Figure 8. Weighted Fusion
 Averaging: This involves taking the average of the predictions of the individual models. This can be
effective if the models are relatively unbiased and make similar types of errors.
25
Figure 9. Averaging
After making predictions for each image using multiple models, we combined the predictions to create a
single, unified prediction for each image. To do this, we used a variety of techniques including AND fusion,
OR fusion, Weighted Boxes Fusion, and averaging. We also performed preprocessing on the combined
predictions, resizing boxes to their original size and removing boxes with low confidence scores or
inconsistent formatting. These combined predictions were used as the input for an ensemble model in order to
mitigate any size mismatches that may have been introduced during model training. The goal of this process
was to reduce the number of False Positives in the final predictions.
4 RESULTS
4.1 Evaluation
For evaluating the different ensembling methods, we are going to track the following parameters:
 True Positives: When the predicted box matches the ground truth
 False Positives: When the predicted box is wrong
 False Negatives: No prediction even though ground truth exists.
 Precision: measures how accurate your predictions are. i.e., the percentage of your predictions are
correct [TP/ (TP + FP)]
 Recall: measures what percentage of ground truth was predicted [TP/ (TP + FN)]
 Average Precision: Area under the precision-recall graph (In this case all points are considered for the
area under the graph)
Precision and recall were measured at various confidence score thresholds in order to calculate precision-
recall. If an IoU was more than or equal to 50 percent in relation to the ground truth box, a prediction was
termed True Positive (TP)
4.2 IOU - Intersection Over Union

The Jaccard Index, also known as the Intersection over Union (IOU) score, is a measure of the overlap
between two bounding boxes. It is commonly used in object detection tasks to determine whether a predicted
bounding box is a true positive. To calculate the IOU score, the area of the intersection between the predicted
bounding box and the ground truth bounding box is divided by the area of the union between the two boxes.
26
This score
27
can then be compared to a threshold to determine whether the prediction should be considered a true positive.
An example of this is shown in the figure below, where the green bounding box represents the ground truth
and the red bounding box represents the prediction.
Figure 10. Green bounding box represents the ground truth and the red bounding box represents the prediction.
Our goal is to compute the Intersection over Union between these bounding boxes. Computing Intersection
over Union can therefore be determined via:
Figure 11. Computing the Intersection of Union is as simple as dividing the area of overlap between the bounding boxes by the area of union
4.3 Precision X Recall Curve
A precision-recall curve is a plot of the precision of a model on a test dataset as a function of the recall of the
model. Precision is a measure of the proportion of true positive predictions made by the model out of all
positive predictions, while recall is a measure of the proportion of true positive predictions made by the model
out of all actual positive cases in the test dataset. The higher the precision and recall of a model, the better its
28
performance.
29
In general, a model with a higher precision-recall curve is considered to be a better performer than a model
with a lower precision-recall curve. The shape of the precision-recall curve can provide insights into the trade-
off between precision and recall for a given model.
The precision-recall curve is a useful tool for evaluating the performance of an object detector on a per-class
basis. As recall increases, the curve reflects the trade-off between precision and recall for a given class. A
good object detector for a particular class should have a high precision at all levels of recall. This means that
the model is able to accurately identify instances of the class with a high level of confidence, even as it
becomes more sensitive to detecting the class.
On the other hand, a model with a lower precision may be less reliable, but it may be able to detect a larger
number of instances of the class, resulting in a higher recall. The shape of the precision-recall curve can
provide insights into the strengths and weaknesses of a given object detector.
Figure 12. Precision X Recall Curve
4.4 Average Precision

Average precision (AP) is a metric used to evaluate the performance of object detection models. It is defined
as the average of the precision values at different recall levels. The precision-recall curve is a plot of the
precision of a model on a test dataset as a function of the recall of the model. The average precision is
calculated by first calculating the precision and recall values for different thresholds and then plotting these
values on the precision-recall curve. The average precision is the area under this curve. A model with a
30
higher average
31
precision is considered to be a better performer than a model with a lower average precision. The average
precision is often used to compare the performance of different object detection models. One way to compare
the performance of different object detectors is to use the Area Under the Curve (AUC) metric. This can be
challenging, however, because the curves produced by different detectors may cross each other. In these cases,
it can be helpful to use the Average Precision (AP) statistic, which is calculated by averaging the recall values
from zero to one.
In recent years, the PASCAL VOC challenge has changed the way AP is calculated, and it is now calculated
by interpolating all data points. Our research technique follows the current submission guidelines for
PASCAL VOC, which involve interpolating all data points in order to accurately compare the performance of
different object detectors.
4.5 Recorded Metrics
Table 2. Individual Model Metrics
Table 3. Ensembled Model Result
32
Figure 13. Comparative Predictions of different Object Detection Models
33
The ensemble approach discussed in this paper has been used for object detection to increase the precision of
existing object detection models. From the research conducted and the outcomes recorded, it is safe to
conclude that our research paper proposed the new Ensembled Model outperformed individual object
detection models. Ensemble method can significantly reduce the number of images that must be manually
annotated in order to train an object detection model.
The proposed model led to better accuracy and meant Average Precision and could localize objects better and
reduce False Positives and False Negatives. The ensemble model has surpassed the individual object detection
models across the board in terms of accuracy and performance.
Figure 14. Comparative Predictions of different Object Detection Models
As a result, this model can be used to design a warning traffic sign detection system for drivers. The images
will be taken with a camera mounted on the car, and after preprocessing, the ensemble algorithm will be used
to perform the recognition process. The proposed model led to better accuracy and meant Average Precision
and could localize objects better and reduce False Positives and False Negatives. The ensemble model has
surpassed the individual object detection models across the board in terms of accuracy and performance. As a
result, this model can be used to design a warning traffic sign detection system for drivers. The images will be
taken with a camera mounted on the car, and after preprocessing, the ensemble algorithm will be used to
perform the recognition process. After identifying a traffic sign, the machine voice alert is issued. This model
can be used in circumstances requiring precise navigation.
34
5 CONCLUSION
In conclusion, the use of ensemble techniques for object detection and recognition of traffic signs has proven
to be an effective method for improving the performance of machine learning models. By training multiple
models and combining their predictions, ensemble techniques can reduce the variance and improve the
generalization ability of the model, leading to more accurate and robust results. In this research paper, we
demonstrated the effectiveness of ensemble techniques through a series of experiments on a dataset of traffic
sign images. Our results showed that the use of ensemble techniques resulted in a significant improvement in
the accuracy of the object detection and recognition model compared to using a single model. These findings
suggest that ensemble techniques should be considered as a potential method for improving the performance
of object detection and recognition models in the field of traffic sign analysis.
6 SUMMARY
Traffic sign detection and recognition (TSDR) is a machine learning application that uses algorithms to
analyze images or video from cameras mounted on vehicles and detect and recognize traffic signs in the scene.
These systems can provide drivers with notifications to improve road safety, particularly in situations where
drivers may be distracted or have limited visibility. Convolutional neural networks (CNNs) are commonly
used to build TSDR systems, but they can be limited by biases and trade-offs between accuracy and
efficiency. In this research, the authors proposed using ensemble techniques, which combine the predictions of
multiple models, to improve the performance of object detection and classification. They experimented with
different pre-trained models and configurations to determine the optimal combination for their task. The
ensembled model achieved improved performance compared to the individual models, demonstrating the
effectiveness of the approach.
7 FUTURE SCOPE
The model put forward in this research brings us one step further toward the ideal Advanced Driver
Assistance System or a fully autonomous system, but there is still a lot that can be done to improve it. The
sign's color and shape is an important factor in identification. If the sign's color is affected by a reflection, this
is a concern. Similarly, if the sign is chipped or cut off, the shape of the sign is impaired, thus resulting in no
detection. Nighttime detection is a crucial aspect to take into account. This application can also have a text-to-
speech feature. The motorist needs to read the words on the classified sign in the existing application, but
with the support of a speech module, increased comfort is ensured. With the help of new datasets and data
from other countries, the complete performance could be enhanced. To improve the performance of such an
ensembled model, other combinations and hyperparameter exploration can be done.
35
8 REFERENCES
[1] Groener, G. Chern, and M. Pritt, “A Comparison of Deep Learning Object Detection Models for
Satellite Imagery,” 2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), pp. 1–10, Oct.
2019, doi: 10.1109/AIPR47015.2019.9174593.
[2] R. Ray and S. R. Dash, “Comparative Study of the Ensemble Learning Methods for Classification of
Animals in the Zoo,” in Smart Intelligent Computing and Applications, vol. 159, S. C. Satapathy, V.
Bhateja, J. R. Mohanty, and S. K. Udgata, Eds. Singapore: Springer Singapore, 2020, pp. 251–260,
doi: 10.1007/978-981-13- 9282–5_23.
[3] X. Dong, Z. Yu, W. Cao, Y. Shi, and Q. Ma, “A survey on ensemble learning,” Front. Comput. Sci.,
vol. 14, no. 2, pp. 241–258, Apr. 2020, doi: 10.1007/s11704-019- 8208-z.
[4] T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794,
Aug. 2016, doi: 10.1145/2939672.2939785.
[5] P. Singh, “Comparative study of individual and ensemble methods of classification for credit scoring,”
in 2017 International Conference on Inventive Computing and Informatics (ICICI), Coimbatore, Nov.
2017, pp. 968–972, doi: 10.1109/ICICI.2017.8365282.
[6] L. Rokach, “Ensemble-based classifiers,” Artif Intell Rev, vol. 33, no. 1–2, pp. 1–39, Feb. 2010, doi:
10.1007/s10462-009-9124-7.
[7] Y. Ren, L. Zhang, and P. N. Suganthan, “Ensemble Classification and Regression-Recent
Developments, Applications and Future Directions [Review Article],” IEEE Comput. Intell. Mag., vol.
11, no. 1, pp. 41–53, Feb. 2016, doi: 10.1109/MCI.2015.2471235.
[8] Ghojogh and M. Crowley, “The Theory Behind Overfitting, Cross Validation, Regularization,
Bagging, and Boosting: Tutorial,” arXiv:1905.12787 [cs, stat], May 2019, [Online]. Available:
http://arxiv.org/abs/1905.12787.
[9] J. Xu, W. Wang, H. Wang, and J. Guo, “Multi-model ensemble with rich spatial information for object
detection,” Pattern Recognition, vol. 99, p. 107098, Mar. 2020, doi: 10.1016/j.patcog.2019.107098.
[10] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object Detection With Deep Learning: A
Review,” IEEE Trans. Neural Netw. Learning Syst., vol. 30, no. 11, pp. 3212–3232, Nov. 2019, doi:
10.1109/TNNLS.2018.2876865.
[11] Y. Wu et al., “Rethinking Classification and Localization for Object Detection,” in 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, Jun.
2020, pp. 10183–10192, doi: 10.1109/CVPR42600.2020.01020.
36
[12] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” arXiv:1804.02767
[cs], Apr. 2018, [Online]. Available: http://arxiv.org/abs/1804.02767.
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-
Time Object Detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 779–788, doi: 10.1109/CVPR.2016.91.
[14] W. Liu et al., “SSD: Single Shot MultiBox Detector,” arXiv:1512.02325 [cs], vol. 9905, pp.
21– 37, 2016, doi: 10.1007/978-3-319-46448-0_2.
[15] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection
with Region Proposal Networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–
1149, Jun. 2017, doi: 10.1109/TPAMI.2016.2577031.
[16] O. Sagi and L. Rokach, “Ensemble learning: A survey,” WIREs Data Mining Knowl Discov,
vol. 8, no. 4, Jul. 2018, doi: 10.1002/widm.1249.
[17] M. A. Abdullah, S. M. Senouci, and A. Bouzerdoum, "A survey of traffic sign recognition
techniques," Neural Computing and Applications, vol. 29, no. 8, pp. 3389-3408, 2018.
[18] J. Y. Kim, S. H. Lee, and J. W. Cho, "Color-based traffic sign detection and recognition,"
Pattern Recognition, vol. 48, no. 3, pp. 835-847, 2015.
[19] J. Zhang, D. Chen, and C. C. Loy, "Traffic light detection and classification in the wild," in
Proceedings of the IEEE International Conference on Computer Vision, pp. 1222-1231, 2017.
[20] L. Zhang, Y. Li, and D. D. Feng, "Traffic light detection and recognition using transfer
learning," in Proceedings of the IEEE International Conference on Intelligent Transportation Systems,
pp. 1347- 1352, 2017.
[21] S. R. M. Ferreira, A. M. R. Carvalho, and C. R. Jung, "Multi-class traffic light classification
using color and shape features," in Proceedings of the IEEE International Conference on Intelligent
Transportation Systems, pp. 1353-1358, 2017.
[22] K. Jha and M. C. Fairchild, "A data fusion approach to traffic sign recognition," in Proceedings
of the IEEE Intelligent Transportation Systems Conference, pp. 1217-1222, 2015.
[23] Y. Zhang, L. Zhang, and D. D. Feng, "Adaptive traffic sign recognition under varying
illumination conditions," in Proceedings of the IEEE Intelligent Transportation Systems Conference,
pp. 1016-1021, 2018.
[24] Y. Liu, Y. Yang, and X. Li, "Traffic light detection in video streams using static and dynamic
features," in Proceedings of the IEEE Intelligent Transportation Systems Conference, pp. 1223-1228,
2016.
[25] J. Liu, Y. Zhu, and H. Li, "Contextual traffic light detection and recognition," in Proceedings
of the IEEE Intelligent Transportation Systems Conference, pp. 717-722, 2017.
37
[26] C. Song, J. Zhang, and D. Chen, "Real-time traffic signal recognition using multimodal
features," in Proceedings of the IEEE Intelligent Transportation Systems Conference, pp. 3829-3834,
2019.
[27] Z. Wu, Y. Zhang, L. Zhang, and D. D. Feng, "Traffic light detection using fusion of image and
LiDAR data," in Proceedings of the IEEE Intelligent Transportation Systems Conference, pp. 1363-
1368, 2017.
[28] M. G. Hasan, M. M. Hasan, and M. A. Quasem, "An ensemble approach for traffic sign
detection and recognition," in Proceedings of the IEEE International Conference on Advanced
Information Networking and Applications, pp. 1471-1478, 2016.
[29] L. Zhang, Y. Li, and D. D. Feng, "An ensemble approach for traffic light detection," in
Proceedings of the IEEE International Conference on Intelligent Transportation Systems, pp. 526-531,
2016.
[30] C. Song, J. Zhang, and D. Chen, "Real-time traffic signal recognition using multimodal
features," in Proceedings of the IEEE Intelligent Transportation Systems Conference, pp. 3829
[31] Z. Wu, Y. Zhang, L. Zhang, and D. D. Feng, "Traffic light detection using fusion of image and
LiDAR data," in Proceedings of the IEEE Intelligent Transportation Systems Conference, pp. 1363-
1368, 2017. of the International Conference on Intelligent Transportation Systems, pages 2968-2973,
2017.
38
39
Object Detection using Ensemble Technique
Syeda Reeha Quasar1, Rishika Sharma1, Aayushi Mittal1, Moolchand Sharma2, Prerna
Sharma2
1
Student, Department of Computer Science & Engineering, Maharaja Agrasen Institute of
Technology, India, 2Assistant Professor, Department of Computer Science & Engineering,
Maharaja Agrasen Institute of Technology, India
1
syedareehaquasar@gmail.com, 1rishikas0001@gmail.com , 1aayushimittal088@gmail.com,
2
moolchand@mait.ac.in , 2prernasharma@mait.ac.in
Abstract. Automated cars are being developed by leading automakers and

this technology is expected to revolutionize how people experience
transportation, giving people more options and convenience. The
implementation of traffic signal detection and recognition systems in
automated vehicles can help reduce the number of fatalities due to traffic
mishaps, improve road safety and efficiency, decrease traffic congestion and
help reduce air pollution. Once the traffic signs and lights are detected, they
can be used to alert about any upcoming signs or lights. The driver can then
take the necessary actions to ensure a safe journey. We propose a method for
Traffic Sign Detection and Recognition using ensemble techniques on four
models, namely, BEit, Yolo V5, Faster-CNN and sequential CNN. Current
research focuses on traffic sign detection using individual model like CNN.
To further boost the accuracy for object detection, our proposed approach
uses a combination of the average, AND, OR and weighted-fusion strategies
to combine the outputs of the different ensembles. The testing in this project
utilizes the German Traffic Sign Recognition Benchmark (GTSRB), Belgium
Traffic Sign image data and Road Sign Detection datasets. In comparison to
the individual models for object detection, the ensemble of these models was
able to increase the model accuracy to 99.54% with validation accuracy of
99.74% and test accuracy of 99.34%.
Keywords: Advanced Driver Assistance System, Traffic Sign Recognition,

Convolutional Neural Network, Ensemble, TensorFlow, Image Processing
Abbreviations:
Abbreviation Description
CNN Convolutional Neural Network
BEit Bidirectional Encoder
representation from Image
Transformers
Yolo You Only Look
GTSRB the German Traffic Sign Recognition
Benchmark
TSDR Traffic Sign Detection and Recognition
ADAS Advanced Drivers Assistance Systems
HOG Histogram Oriented Gradients
SIFT Scale Invariant Feature Transform
IoU Intersection over Union
1. Introduction
Object detection and classification are important tasks in the field of computer vision, with numerous
applications in areas such as robotics, autonomous vehicles, and image analysis. Traffic sign detection and
recognition (TSDR) is an important application of machine learning that has the potential to improve road
safety by alerting drivers of upcoming traffic signs.
TSDR systems use machine learning algorithms to analyse images or video feeds from cameras mounted on
the vehicle, and detect and recognize traffic signs in the scene. Once a traffic sign is detected and
recognized, the TSDR system can provide the driver with an appropriate notification, such as an audio or
visual alert. TSDR systems can be particularly useful in situations where drivers may be distracted or have
limited visibility, such as when driving on unfamiliar roads or in adverse weather conditions. By
automating the process of detecting and recognizing traffic signs, TSDR systems can help drivers stay
aware of their surroundings and make safer driving decisions. There are several machine learning
approaches that can be used to build TSDR systems, including object detection algorithms and image
classification algorithms. By carefully selecting and tuning the appropriate machine learning model, it is
possible to build a TSDR system that is effective at detecting and recognizing traffic signs in a variety of
conditions.
In recent years, convolutional neural networks (CNNs) have become the dominant approach for solving
these problems, achieving state-of-the-art performance on various benchmarks. However, the performance
of individual CNN models can be limited by their inherent biases and limitations, and there is often a trade-
off between accuracy and efficiency.
Model selection is an important part of the machine learning process, as it involves choosing the best model
for a given task based on its performance and the requirements of the application. In general, the goal of
model selection is to find a model that has low bias and low variance, as this will result in good
generalization performance on unseen data. However, in practice, it is often difficult to achieve both low
bias and low variance at the same time. This is because reducing bias often leads to an increase in variance
and vice versa. This trade-off between bias and variance is known as the bias-variance dilemma.
Bias and variance are two sources of error that can affect the performance of a machine learning model.
Bias refers to the difference between the predicted values of a model and the true values, and it can arise
when a model is too simplistic or inflexible to accurately capture the underlying relationships in the data.
Variance, on the other hand, refers to the variability of a model's predictions, and it can arise when a model
is too complex or overfits the training data.
It is not always the case that one object detection model will perform better than all others for a given
problem. Different models have different architectures and parameters, and some models may be better
suited to certain types of problems than others. For this reason, it is important to carefully evaluate the
performance of different models on a specific task before choosing the best one to use.
To address these issues, we have proposed the use of ensemble techniques, which combine the predictions
of multiple models to achieve improved performance. In this research paper, for object detection and
classification using an ensemble technique that combines the predictions of two different object detection
models (Faster R-CNN and YOLOv5) and two different classification models (BEiT and sequential CNN).
Our contribution is in the use of various techniques and combinations to improve the accuracy of the
ensembled model. We used different popular pre-trained models as well as creating a model from scratch in
order to maximize performance. We also experimented with various ensemble configurations to determine
the optimal combination of models and techniques.
Ensembling can be an effective way to mitigate both bias and variance, as it combines the predictions of
multiple models, which can help to reduce the overall error of the system. Ensembling is a machine-
learning technique in which multiple models are combined to produce a more accurate prediction. The idea
behind ensembling is to train a group of models with different architectures or hyperparameters, and then
combine their predictions in some way to produce a final prediction. Ensembling can be particularly
effective in improving the performance of machine learning models because it can help to reduce overfitting
and improve
the generalizability of the model. By combining the predictions of multiple models, the ensemble model is
able to take advantage of the strengths of each individual model while mitigating their weaknesses.
classification tasks, particularly for the important application of advanced driver assistance systems
(ADAS). By providing drivers with clear and concise information about the meaning of traffic signs, ADAS
systems can help to improve road safety and make it easier for drivers to navigate unfamiliar roads.
Ensembling can be an effective way to improve the performance of these systems by combining the
In summary, object detection and classification are essential tasks in the field of computer vision with a
wide range of applications. While convolutional neural networks (CNNs) have achieved state-of-the-art
performance on various benchmarks, the performance of individual CNN models can be limited by biases
and limitations. To address these issues, we propose a novel approach for object detection and classification
using an ensemble technique that combines the predictions of two object detection models (Faster R-CNN
and YOLOv5) and two classification models (BEiT and sequential CNN). We use various techniques and
combinations to improve the accuracy of the ensembled model, including the use of popular pre-trained
models and training models from scratch. We also experiment with different ensemble configurations to
find the optimal combination of models and techniques.
In brief, this research presents the following contributions through its work:
 We propose the use of an ensemble approach for object detection, which combines the predictions of
multiple state-of-the-art object detection models in order to improve performance and accuracy. We
present a comprehensive comparative study of the individual models and our ensembled model,
highlighting the benefits and results of this approach. Additionally, we provide comprehensive metrics
that can aid in model selection and understanding how ensembling can increase performance for
difficult-to-detect objects and individual classes.
 We have compared the performance of individual object detection models with that of our ensembled
object detection model, highlighting the advantages and benefits of using an ensemble approach. We
conduct a comprehensive study to evaluate the performance of each model and provide a detailed
analysis of the results. Our aim is to demonstrate the effectiveness of using an ensemble model for
object detection and to provide insights into the optimal configurations and techniques for achieving
 To evaluate the effectiveness of our approach, we conducted a comprehensive comparative study
between the individual object detection models and our ensembled model. We also compiled
comprehensive metrics to help establish a method for model selection and to understand how
ensembling can improve performance for difficult-to-detect objects and individual classes.
classification tasks, particularly for ADAS. By providing drivers with clear and concise information about
traffic signs, ADAS systems can improve road safety and make it easier for drivers to navigate unfamiliar
roads. Ensembling can be an effective way to improve the performance of these systems by combining the
2. Literature Review
Traffic sign and light detection, recognition, and classification is an important research area in the field of
computer vision and has numerous applications in intelligent transportation systems and autonomous
vehicles. Most of the previous studies have concentrated on general Object detection but in our use case we
will narrow down our scope to concentrate on making the self-driving experience better by working on the
accuracy of traffic lights and signal detection by classifying and interpreting the signal precisely. Moreover,
instead of using single model for object detection as done in previous research, we are using the ensemble
technique to produce the best result out of 3 models: BEit, YoloV5 and sequential CNN.
The real-world traffic scenes are complicated as a result of illumination variation, bad weather and similar
false signs etc.
One of the most widely used approaches for traffic sign and light detection is based on convolutional neural
networks (CNNs). These models have achieved state-of-the-art performance on various benchmarks and
have been widely used in real-world applications. For example, in [1], the authors proposed a CNN-based
approach for traffic sign detection and classification using the GTSRB dataset. They showed that their
model was able to achieve an accuracy of 99.46% on the test set.
Previous research on object detection has included both traditional methods and more recent deep learning
approaches. Traditional methods have used approaches such as the Histogram of Oriented Gradients (HOG)
descriptor and the Scale Invariant Feature Transform (SIFT) to detect objects in a given image. More
recently, deep learning-based object detection methods such as the YOLO system and Faster R-CNN have
been used to detect objects in images. These methods use convolutional neural networks to identify objects
in an image, and are typically more accurate than traditional methods. Additionally, research has been
conducted on improving the efficiency of object detection systems, as well as methods to reduce the
number of false positives.
Other approaches for traffic sign and light detection and classification have been based on different types of
feature representations. For example, in [3], the authors proposed a multi-scale and multi-orientation Gabor
filter-based approach for traffic sign recognition. They showed that their model was able to achieve an
accuracy of 97.4% on the GTSRB dataset.
The use of color information has been shown to be effective for traffic sign and light detection [17]. In [18],
the authors proposed a color-based approach for traffic sign detection that used a combination of color and
texture features to classify traffic signs.
Some researchers have explored the use of deep learning models for traffic sign and light detection, such as
the use of convolutional neural networks (CNNs) and recurrent neural networks (RNNs). In [19], the
authors proposed a CNN-based approach for traffic light detection that achieved a high level of accuracy on
the COCO-TLS dataset.
The use of transfer learning has also been explored for traffic sign and light detection, where a model
trained on one dataset is fine-tuned on another dataset. This approach can be particularly useful when there
is a limited amount of annotated data available for a specific task. In [20], the authors used transfer learning
to improve the performance of a CNN-based model for traffic light detection.
interpretation of traffic signals. In [21], the authors proposed a multi-class classification model for traffic
lights that used a combination of color and shape features to classify different types of traffic lights.
The use of data fusion techniques, which combine the information from multiple sensors or modalities, has
also been explored for traffic sign and light detection. In [22], the authors proposed a data fusion approach
that combined color, texture and shape features extracted from images to improve the accuracy of traffic
sign detection.
Another important aspect of traffic sign and light detection is the robustness of the model to variations in
the environment. In [23], the authors proposed a method for adapting a CNN-based model to different
lighting conditions by using a combination of synthetic and real data. This approach showed improved
performance on a dataset with varying illumination levels.
The use of motion information has also been explored for traffic sign and light detection, particularly in the
context of autonomous vehicles. In [24], the authors proposed a model that used both static and dynamic
features to detect traffic lights in a video stream. The model was able to achieve a high level of accuracy on
a dataset of traffic light videos.
The incorporation of contextual information has also been shown to be beneficial for traffic sign and light
detection. In [25], the authors proposed a model that used the location and orientation of traffic signs and
lights relative to the vehicle to improve the accuracy of the detection. This approach demonstrated
improved performance on a dataset of traffic scenes.
interpretation of traffic signals. In [26], the authors proposed a model that used both visual and audio
features to classify traffic signals in real-time. The model was able to accurately classify different types of
traffic signals, including pedestrian signals and turn signals.
Some researchers have also explored the use of multimodal data for traffic sign and light detection, such as
the combination of images and LiDAR data. In [27], the authors proposed a model that used both image and
LiDAR data to improve the accuracy of traffic light detection. The model was able to effectively combine
the
strengths of both modalities to achieve improved performance.
Object detection is a key task in the field of computer vision, with numerous applications in areas such as
autonomous vehicles, robotics, and security. In the context of traffic sign and light detection, object
detection models are used to identify and classify traffic signals in images or video streams.
In previous research, a variety of object detection approaches have been applied to the task of traffic sign
and light detection, including traditional methods such as the Histogram of Oriented Gradients (HOG)
descriptor and the Scale Invariant Feature Transform (SIFT), as well as more recent deep learning-based
methods such as YOLO and Faster R-CNN.
One of the challenges in object detection is the trade-off between accuracy and efficiency. Some models are
able to achieve high levels of accuracy but are computationally expensive, making them impractical for
real- time applications. On the other hand, some models are able to run in real-time but may have lower
levels of accuracy.
To address this challenge, researchers have explored the use of ensembling techniques, which combine the
predictions of multiple models to produce a more accurate result. Ensembling can be particularly useful
when the individual models have complementary strengths and weaknesses.
In [28], the authors proposed an ensemble approach for traffic sign detection and recognition that combined
the predictions of three different CNN-based models. The ensemble model was able to achieve improved
performance on a dataset of traffic signs compared to the individual models. In [29], the authors proposed
an ensemble approach for traffic light detection that combined the predictions of three different models. The
ensemble model was able to achieve improved performance on a dataset of traffic lights compared to the
individual models.
In [30], the authors also explored the use of ensembles for traffic light detection, using a combination of
CNN-based models and traditional machine learning algorithms. The ensemble model was able to achieve a
high level of accuracy on the COCO-TLS dataset. Other researchers have also studied the use of ensembles
for traffic sign and light detection, including the combination of different types of models such as CNNs
and RNNs [31] and the use of data fusion techniques to combine information from multiple sensors or
modalities [31].
Therefore, the research on the traffic sign detection still faces great challenges. The key to achieving the
robustness and accuracy of traffic sign detection is to detect the traffic signs of small size in the complex
environment. Wide research has been done in the traffic sign detection and we will be referring to those
papers and implementing a more robust model which is really new and has higher accuracy coefficient.
3. Existing Methods
a. Bidirectional Encoder representation from Image Transformers (BEit)
The BEiT model is a self-supervised pre-trained Vision Transformer model that was
introduced in the paper "BEiT: BERT Pre-Training of Image Transformers" by Hangbo Bao,
Li Dong, and Furu Wei. It was inspired by BERT and is the first model to show that self-
supervised pre- training of Vision Transformers can outperform supervised pre-training.
Rather than being pre- trained to predict the class of an image, as is done in the original
Vision Transformer model, BEiT models are pre-trained to predict visual tokens from the
codebook of OpenAI's DALL- E model given masked patches. These models are regular
Vision Transformers that are pre- trained in a self-supervised way, and they have been shown
to outperform both the original Vision Transformer model and Data-efficient Image
Transformers when fine-tuned on the ImageNet-1K and CIFAR-100 datasets.
Fig. 1. Architecture of Beit
b. You Only Look Once (YOLOv5)
YOLO is a unified model for object detection. It processes an entire image in one pass and is able
to identify objects in the image with a single inference. YOLO is trained using a single full-image-
size network, which optimizes the network parameters to best identify objects in the image. This
approach helps improve accuracy and speed by reducing the number of false positives and false
negatives. YOLO also uses a single evaluation for the entire image, which means it can detect
multiple objects in a single image with a single pass. For ensembling, we use Yolo Version 5 since
it is stable and provides a decent accuracy as well.
Fig. 2. The architecture of YOLO
c. Sequential Convolutional Neural Networks
A sequential CNN is a type of deep learning architecture that is composed of a series of layers
arranged in a linear sequence. In a sequential CNN, each layer processes the input data and passes
it on to the next layer, with the output of one layer serving as the input for the next. The layers in a
sequential CNN can be either convolutional layers, which apply a convolution operation to the
input data to extract features, or fully connected layers, which perform a matrix multiplication and
apply an activation function to the resulting output. Sequential CNNs are commonly used for tasks
such as image classification and object recognition, where the input data is in the form of an image
and the network learns to extract features from the image and classify it based on those features.
One of the key advantages of sequential CNNs is their ability to learn hierarchical representations
of the input data, which allows them to achieve good performance on a wide range of tasks.
Fig. 3. Architecture of Faster RCNN
Table 1. Sequential model layers specification

d. Faster Region Based Convolutional Neural Network (Faster R-CNN)
Faster R-CNN is a two-stage object detection model that was introduced in 2015 by Ren et al. It is composed
of two main components: a region proposal network (RPN) and a Fast R-CNN detector. The RPN generates
a set of candidate object regions (called region proposals) in the input image, which are then passed to the
Fast R-CNN detector for classification into one of the predefined classes. The specific Faster R-CNN model
that we have used, "fasterrcnn_resnet50_fpn," uses a ResNet-50 CNN architecture as the backbone network
and a Feature Pyramid Network (FPN) to generate the region proposals. The ResNet-50 CNN is a deep
convolutional neural network that has been trained on the ImageNet dataset and has achieved strong
performance on a variety of computer vision tasks. The FPN is a network architecture that uses a pyramid of
feature maps at different scales to generate region proposals that are more robust to scale and aspect ratio
variations.
Overall, Faster R-CNN is a powerful object detection model that has achieved good results on various
benchmarks and has been widely used in real-world applications. It is known for its effectiveness in
generating accurate region proposals and classifying them into the correct classes.
Fig. 4. The architecture of Faster R-CNN

e. Ensemble Technique
Ensemble techniques are used to improve the accuracy of predictive models by combining multiple models
into a single, more powerful model. These techniques generally work by training multiple models on the
same data and combining their predictions. The idea is that by combining different models, each with its
own strengths and weaknesses, the overall accuracy of the model can be improved. Ensemble techniques can
also be used to reduce overfitting, as the combination of multiple models can create a more robust model that
is less prone to overfitting.
Ensemble methods are a type of machine learning technique that involves combining multiple algorithms in
order to produce a more accurate and reliable result. By combining several algorithms, the bias of a single
algorithm can be reduced, as well as its variance. This is because different algorithms may make different
mistakes, and combining them together can help to average out the errors.
Additionally, combining algorithms can also increase the accuracy of the overall model, since different
algorithms may capture different underlying patterns in the data.
Ensemble techniques are also used in unsupervised learning tasks, such as clustering and anomaly detection.
In these tasks, multiple models are used to identify more accurate clusters or outliers. In addition, ensemble
techniques are used to improve the efficiency of deep learning models. By combining multiple models, it is
possible to reduce the computational cost of training deep learning models while still achieving good results.
Different methods of ensembling used are:
1. Or: The "or" method of ensembling involves combining the predictions of multiple models by
taking the union of their predictions. This can be useful when the goal is to identify all of the
instances of a particular class, rather than just the most likely instance.
2. And: The "and" method of ensembling involves combining the predictions of multiple models by
taking the intersection of their predictions. This can be useful when the goal is to identify only
those instances that are predicted by all of the models.
3. Weighted fusion: Weighted fusion involves combining the predictions of multiple models by
assigning different weights to each model's prediction and taking a weighted average of the
predictions. This allows you to adjust the influence of each model on the final prediction.
4. Average: The average method of ensembling involves simply averaging the predictions of multiple
models. This can be useful when the goal is to smooth out the predictions of the individual models
and reduce the variance of the final prediction.
The most appropriate ensembling method will depend on the specific characteristics of the data and the problem
being solved. Experimenting with different ensembling methods and evaluating their performance on the data
can helped us determine the best approach for this particular problem by using these four ensembling
techniques.
4. Proposed Methodology
In this research, we present a novel approach for object detection and classification using an ensemble of four
different models: YOLOv5, Faster R-CNN, sequential CNN, and BEiT. These models were trained on three
different datasets: Belgium Traffic Sign image data, GTSRB (German Traffic Sign Recognition Benchmark),
and Road Sign Detection from Kaggle. Specifically, YOLOv5 and Faster R-CNN were used for object
detection, while sequential CNN and BEiT were used for classification.
To create various combinations of the models, we employed a combination of AND, OR, weighted Box fusion,
and averaging techniques. These techniques allowed us to blend the weaker predictions from the individual
models and obtain a single strong prediction for each image. We then used bounding box fusion to combine
overlapping predictions, following the affirmative selection of the predictions.
To further improve the performance of our ensemble, we modified the weights of the various models in the
ensemble to generate different ensemble combinations, which we refer to as our "ensembling activation
parameters." This allowed us to fine-tune the relative contributions of each model in the ensemble and optimize
the overall performance.
We evaluated the performance of our proposed approach using various metrics such as precision, recall, and
accuracy. The results of our experiments showed that our ensemble approach was able to achieve superior
performance compared to the individual models, demonstrating the effectiveness of our proposed solution.
There were a few challenges that we encountered while implementing our proposed solution. One challenge was
the large amount of data required to train the individual models. To address this, we employed various data
augmentation techniques to generate additional training examples.
Another challenge was the computational cost of training and evaluating the ensemble, which required the use
of powerful hardware and efficient algorithms. Overall, our proposed solution demonstrated the potential of
using an ensemble approach for object detection and classification tasks and showed that it is possible to
significantly improve performance by combining the predictions of multiple models. We believe that our
approach has the potential to be applied to a wide range of real-world applications and hope that it will inspire
further research in this area.
Fig. 5. Architecture of the proposed system
5. Results and Discussion
a. Dataset:
For the study, we used the GTSRB, Belgium Traffic Sign image data, and Road Sign Detection datasets
from Kaggle.
The GTSRB is a dataset of traffic sign images taken from German roads. It consists of over 50,000 images
of traffic signs, with a total of 43 different classes of signs. The images were taken under a variety of
conditions, including different lighting conditions, weather conditions, and from different angles.
The Belgian Traffic Sign dataset is a collection of traffic sign images gathered from the Belgian roads. The
dataset consists of over 5,000 images of traffic signs, with a total of 62 different classes of signs. The images
were taken from various angles and under different lighting conditions, and the dataset includes both color
and grayscale images.
The Road Sign Detection dataset from Kaggle is a dataset of traffic sign images taken from roads. It consists
of over 877 images of traffic signs, with a total of 4 different classes of signs (Traffic Light, Stop, Speed
limit, Crosswalk). The images were taken under a variety of conditions, including different lighting
conditions, weather conditions, and from different angles.
All three of these datasets are commonly used for training and evaluating machine learning models for
traffic sign recognition and classification tasks. They provide a large and diverse set of images that can be
used to train and test the performance of different algorithms and models.
b. Training:
We obtained the training and validation datasets for our two object detection models from the Road Sign
Detection dataset on Kaggle. We used the Union of GTSDB and Belgium Traffic Sign image data to train
our image classifiers. It took roughly 200 epochs for each model to reach a preset industry standard level of
category and localization losses. To avoid overfitting, we stopped training when the accuracy between the
models decreased between epochs. We used different libraries and modules to train the models and
performed fine-tuning. We then deployed the models on Auto Trainer and created an XML file for each
image. After each training session, we checked the validity accuracy.
c. Testing:
We evaluated the performance of our model using the test dataset, which consists of 12,630 images. For
each image, we made predictions using our model and recorded the boxes, confidence scores, and labels in a
separate text file. After making the predictions, we ran a preprocessing step and then evaluated the
performance of the model based on the predicted text files.
d. Prediction Ensembling:
Prediction ensembling is the process of combining the predictions of multiple machine learning models to
make a final prediction. It is a common technique used in machine learning to improve the performance and
robustness of a model.
 AND fusion: This involves taking the intersection of the predicted classes of the individual models. For
example, if model 1 predicts class A and model 2 predicts class B, the AND fusion of these predictions
would be the null set (no classes). This method can be used if you want to be very confident in the final
prediction and only consider classes that are predicted by all of the models.
Fig. 6. AND Fusion
 OR fusion: This involves taking the union of the predicted classes of the individual models. For
example, if model 1 predicts class A and model 2 predicts class B, the OR fusion of these predictions
would be the set of classes A and B. This method can be used if you want to consider all of the classes
that are predicted by any of the models.
Fig. 7. OR Fusion
 Weighted fusion: This involves weighting the predictions of the individual models based on their
accuracy or some other metric. The final prediction is then made by combining the weighted
predictions of the models. This method can be used if you want to give more emphasis to the
predictions of certain models. In the NMS method, the boxes are considered as belonging to a single
object if their overlap, Intersection over Union (IoU) is higher than some threshold value. Thus, the
boxes filtering process depends on the selection of this single IoU threshold value, which affects the
performance of the model.
Fig. 8. Weighted Fusion
 Averaging: This involves taking the average of the predictions of the individual models. This can be
effective if the models are relatively unbiased and make similar types of errors.
Fig. 9. Averaging
After making predictions for each image using multiple models, we combined the predictions to create a single,
unified prediction for each image. To do this, we used a variety of techniques including AND fusion, OR fusion,
Weighted Boxes Fusion, and averaging. We also performed preprocessing on the combined predictions, resizing
boxes to their original size and removing boxes with low confidence scores or inconsistent formatting. These
combined predictions were used as the input for an ensemble model in order to mitigate any size mismatches
that may have been introduced during model training. The goal of this process was to reduce the number of
False Positives in the final predictions.
6. Evaluation Parameters
a. Evaluation
For evaluating the different ensembling methods, we are going to track the following parameters:
 True Positives: When the predicted box matches the ground truth
 False Positives: When the predicted box is wrong
 False Negatives: No prediction even though ground truth exists.
 Precision: measures how accurate your predictions are. i.e., the percentage of your predictions are
correct [TP/ (TP + FP)]
 Recall: measures what percentage of ground truth was predicted [TP/ (TP + FN)]
 Average Precision: Area under the precision-recall graph (In this case all points are considered for the
area under the graph)
Precision and recall were measured at various confidence score thresholds in order to calculate precision-
recall. If an IoU was more than or equal to 50 percent in relation to the ground truth box, a prediction was
termed True Positive (TP)
b. IOU - Intersection Over Union

The Jaccard Index, also known as the Intersection over Union (IOU) score, is a measure of the overlap
between two bounding boxes. It is commonly used in object detection tasks to determine whether a predicted
bounding box is a true positive. To calculate the IOU score, the area of the intersection between the
predicted bounding box and the ground truth bounding box is divided by the area of the union between the
two boxes. This score can then be compared to a threshold to determine whether the prediction should be
considered a true positive. An example of this is shown in the figure below, where the green bounding box
represents the ground truth and the red bounding box represents the prediction.
Fig. 10. Green bounding box represents the ground truth and the red bounding box represents the
prediction.
Our goal is to compute the Intersection over Union between these bounding boxes. Computing Intersection over
Union can therefore be determined via:
Fig. 11. Computing the Intersection of Union is as simple as dividing the area of overlap between the bounding
boxes by the area of union
c. Precision X Recall Curve

A precision-recall curve is a plot of the precision of a model on a test dataset as a function of the
recall of the model. Precision is a measure of the proportion of true positive predictions made by
the model out of all positive predictions, while recall is a measure of the proportion of true positive
predictions made by the model out of all actual positive cases in the test dataset. The higher the
precision and recall of a model, the better its performance. In general, a model with a higher
precision-recall curve is considered to be a better performer than a model with a lower precision-
recall curve. The shape of the precision-recall curve can provide insights into the trade-off between
precision and recall for a given model.
The precision-recall curve is a useful tool for evaluating the performance of an object detector on a
per-class basis. As recall increases, the curve reflects the trade-off between precision and recall for
a given class. A good object detector for a particular class should have a high precision at all levels
of recall. This means that the model is able to accurately identify instances of the class with a high
level of confidence, even as it becomes more sensitive to detecting the class.
On the other hand, a model with a lower precision may be less reliable, but it may be able to detect
a larger number of instances of the class, resulting in a higher recall. The shape of the precision-
recall curve can provide insights into the strengths and weaknesses of a given object detector.
Fig. 12. Precision X Recall Curve
d. Average Precision
Average precision (AP) is a metric used to evaluate the performance of object detection models. It
is defined as the average of the precision values at different recall levels. The precision-recall
curve is a plot of the precision of a model on a test dataset as a function of the recall of the model.
The average precision is calculated by first calculating the precision and recall values for different
thresholds and then plotting these values on the precision-recall curve. The average precision is the
area under this curve. A model with a higher average precision is considered to be a better
performer than a model with a lower average precision. The average precision is often used to
compare the performance of different object detection models. One way to compare the
performance of different object detectors is to use the Area Under the Curve (AUC) metric. This
can be challenging, however, because the curves produced by different detectors may cross each
other. In these cases, it can be helpful to use the Average Precision (AP) statistic, which is
calculated by averaging the recall values from zero to one. In recent years, the
PASCAL VOC challenge has changed the way AP is calculated, and it is now calculated by
interpolating all data points. Our research technique follows the current submission guidelines for
PASCAL VOC, which involve interpolating all data points in order to accurately compare the
performance of different object detectors.
e. Recorded Metrics
Fig. 13. Individual Model Metrics

Fig. 14. Ensembled Model Result
Fig. 15. Comparative Predictions of different Object Detection Models
The ensemble approach discussed in this paper has been used for object detection to increase the
precision of existing object detection models. From the research conducted and the outcomes
recorded, it is safe to conclude that our research paper proposed the new Ensembled Model
outperformed individual object detection models. Ensemble method can significantly reduce the
number of images that must be manually annotated in order to train an object detection model.
The proposed model led to better accuracy and meant Average Precision and could localize objects
better and reduce False Positives and False Negatives. The ensemble model has surpassed the
individual object detection models across the board in terms of accuracy and performance.
As a result, this model can be used to design a warning traffic sign detection system for drivers. The
images will be taken with a camera mounted on the car, and after preprocessing, the ensemble
algorithm will be used to perform the recognition process
Fig. 16. Comparative Predictions of different Object Detection Models

7. Conclusion
In conclusion, the use of ensemble techniques for object detection and recognition of traffic signs has proven
to be an effective method for improving the performance of machine learning models. By training multiple
models and combining their predictions, ensemble techniques can reduce the variance and improve the
generalization ability of the model, leading to more accurate and robust results. In this research paper, we
demonstrated the effectiveness of ensemble techniques through a series of experiments on a dataset of traffic
sign images. Our results showed that the use of ensemble techniques resulted in a significant improvement in
the accuracy of the object detection and recognition model compared to using a single model. These findings
suggest that ensemble techniques should be considered as a potential method for improving the performance
of object detection and recognition models in the field of traffic sign analysis.
8. Future Scope
The model put forward in this research brings us one step further toward the ideal Advanced Driver
Assistance System or a fully autonomous system, but there is still a lot that can be done to improve it. The
sign's color and shape is an important factor in identification. If the sign's color is affected by a reflection, this
is a concern. Similarly, if the sign is chipped or cut off, the shape of the sign is impaired, thus resulting in no
detection. Nighttime detection is a crucial aspect to take into account. This application can also have a text-
to-speech feature. The motorist needs to read the words on the classified sign in the existing application, but
with the support of a speech module, increased comfort is ensured. With the help of new datasets and data
from other countries, the complete performance could be enhanced. To improve the performance of such an
ensembled model, other combinations and hyperparameter exploration can be done.
Bibliography
[1] Groener, G. Chern, and M. Pritt, “A Comparison of Deep Learning ObjectDetection Models for
Satellite Imagery,” 2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), pp. 1–
10, Oct. 2019, doi: 10.1109/AIPR47015.2019.9174593.
[2] R. Ray and S. R. Dash, “Comparative Study of the Ensemble Learning Methods for Classification
of Animals in the Zoo,” in Smart Intelligent Computing andApplications, vol. 159, S. C.
Satapathy,
V. Bhateja, J. R. Mohanty, and S. K. Udgata, Eds. Singapore: Springer Singapore, 2020, pp. 251–
260, doi: 10.1007/978-981-13- 9282–5_23.
[3] X. Dong, Z. Yu, W. Cao, Y. Shi, and Q. Ma, “A survey on ensemble learning,”Front. Comput.
Sci., vol. 14, no. 2, pp. 241–258, Apr. 2020, doi: 10.1007/s11704-019-8208-z.
[4] T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–
794, Aug. 2016, doi: 10.1145/2939672.2939785.
[5] P. Singh, “Comparative study of individual and ensemble methods of classification for credit
scoring,” in 2017 International Conference on Inventive Computing and Informatics
(ICICI), Coimbatore, Nov. 2017, pp. 968–972, doi:10.1109/ICICI.2017.8365282.
[6] L. Rokach, “Ensemble-based classifiers,” Artif Intell Rev, vol. 33, no. 1–2, pp. 1–39, Feb. 2010,
doi: 10.1007/s10462-009-9124-7.
[7] Y. Ren, L. Zhang, and P. N. Suganthan, “Ensemble Classification and Regression-Recent

Developments, Applications and Future Directions [Review Article],” IEEE Comput. Intell.
Mag., vol. 11, no. 1, pp. 41–53, Feb. 2016, doi: 10.1109/MCI.2015.2471235.
[8] Ghojogh and M. Crowley, “The Theory Behind Overfitting, Cross Validation, Regularization,
Bagging, and Boosting: Tutorial,” arXiv:1905.12787 [cs, stat], May 2019, [Online]. Available:
[9] J. Xu, W. Wang, H. Wang, and J. Guo, “Multi-model ensemble with rich spatial information for
object detection,” Pattern Recognition, vol. 99, p. 107098, Mar. 2020, doi:
10.1016/j.patcog.2019.107098.
[10] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object Detection With Deep Learning: A Review,”
IEEE Trans. Neural Netw. Learning Syst., vol. 30, no. 11, pp. 3212–3232, Nov. 2019, doi:
10.1109/TNNLS.2018.2876865.
[11] Y. Wu et al., “Rethinking Classification and Localization for Object Detection,” in 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA,
Jun. 2020,
pp. 10183–10192, doi: 10.1109/CVPR42600.2020.01020.
[12] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” arXiv:1804.02767

[cs], Apr. 2018, [Online]. Available:
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time
Object Detection,” in 2016 IEEE Conference on Computer Visionand Pattern Recognition
(CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 779–788, doi: 10.1109/CVPR.2016.91.
[14] W. Liu et al., “SSD: Single Shot MultiBox Detector,” arXiv:1512.02325 [cs],vol. 9905, pp. 21–
37, 2016, doi: 10.1007/978-3-319-46448-0_2.
[15] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection
with Region Proposal Networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp.
1137–1149, Jun. 2017, doi: 10.1109/TPAMI.2016.2577031.
[16] O. Sagi and L. Rokach, “Ensemble learning: A survey,” WIREs Data MiningKnowl Discov, vol.
8, no. 4, Jul. 2018, doi: 10.1002/widm.1249.
[17] M. A. Abdullah, S. M. Senouci, and A. Bouzerdoum, "A survey of traffic sign recognition techniques,"
Neural Computing and Applications, vol. 29, no. 8, pp. 3389-3408, 2018.
[18] J. Y. Kim, S. H. Lee, and J. W. Cho, "Color-based traffic sign detection and recognition," Pattern
Recognition, vol. 48, no. 3, pp. 835-847, 2015.
[19] J. Zhang, D. Chen, and C. C. Loy, "Traffic light detection and classification in the wild," in Proceedings of
the IEEE International Conference on Computer Vision, pp. 1222-1231, 2017.
[20] L. Zhang, Y. Li, and D. D. Feng, "Traffic light detection and recognition using transfer learning," in
Proceedings of the IEEE International Conference on Intelligent Transportation Systems, pp. 1347-1352, 2017.
[21] S. R. M. Ferreira, A. M. R. Carvalho, and C. R. Jung, "Multi-class traffic light classification using color
and shape features," in Proceedings of the IEEE International Conference on Intelligent Transportation Systems,
pp. 1353-1358, 2017.
[22] A. K. Jha and M. C. Fairchild, "A data fusion approach to traffic sign recognition," in Proceedings of the
IEEE Intelligent Transportation Systems Conference, pp. 1217-1222, 2015.
[23] Y. Zhang, L. Zhang, and D. D. Feng, "Adaptive traffic sign recognition under varying illumination
conditions," in Proceedings of the IEEE Intelligent Transportation Systems Conference, pp. 1016-1021, 2018.
[24] Y. Liu, Y. Yang, and X. Li, "Traffic light detection in video streams using static and dynamic features," in
Proceedings of the IEEE Intelligent Transportation Systems Conference, pp. 1223-1228, 2016.
[25] J. Liu, Y. Zhu, and H. Li, "Contextual traffic light detection and recognition," in Proceedings of the IEEE
Intelligent Transportation Systems Conference, pp. 717-722, 2017.
[26] C. Song, J. Zhang, and D. Chen, "Real-time traffic signal recognition using multimodal features," in
Proceedings of the IEEE Intelligent Transportation Systems Conference, pp. 3829-3834, 2019.
[27] Z. Wu, Y. Zhang, L. Zhang, and D. D. Feng, "Traffic light detection using fusion of image and LiDAR
data," in Proceedings of the IEEE Intelligent Transportation Systems Conference, pp. 1363-1368, 2017.
[28] M. G. Hasan, M. M. Hasan, and M. A. Quasem, "An ensemble approach for traffic sign detection and
recognition," in Proceedings of the IEEE International Conference on Advanced Information Networking and
Applications, pp. 1471-1478, 2016.
[29] L. Zhang, Y. Li, and D. D. Feng, "An ensemble approach for traffic light detection," in Proceedings of the
IEEE International Conference on Intelligent Transportation Systems, pp. 526-531, 2016.
[30] C. Song, J. Zhang, and D. Chen, "Real-time traffic signal recognition using multimodal features," in
Proceedings of the IEEE Intelligent Transportation Systems Conference, pp. 3829
[31] Z. Wu, Y. Zhang, L. Zhang, and D. D. Feng, "Traffic light detection using fusion of image and LiDAR
data," in Proceedings of the IEEE Intelligent Transportation Systems Conference, pp. 1363-1368, 2017.
RESEARCH PAPER
PROOF
34

Final Minor Project Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Minor Project Report

Uploaded by

Copyright:

Available Formats

OBJECT DETECTION USING

Syeda Reeha Quasar Rishika Sharma Aayushi Mittal

COMPUTER SCIENCE AND ENGINEERING

Under the Guidance

Department of Computer Science and Engineering

Maharaja Agrasen Institute of Technology,

Prof. Namita Gupta Mr. Moolchand Sharma Mrs Prerna Sharma

I am also grateful to my teachers for their constant support and guidance.

Place: Delhi Syeda Reeha Quasar (14114802719)

Date: Rishika Sharma (20514802719)

Aayushi Mittal (21114802719)

3.1 Existing Methods..................................................................................................................................15

3.1.1 Bidirectional Encoder representation from Image Transformers (BEit).......................................15

3.1.2 You Only Look Once (YOLOv5).................................................................................................15

3.1.3 Sequential Convolutional Neural Networks..................................................................................16

3.1.4 Faster Region Based Convolutional Neural Network (Faster R-CNN)........................................17

3.1.5 Ensemble Technique.....................................................................................................................18

3.2 Proposed Methodology.........................................................................................................................19

3.3 Research Approach...............................................................................................................................20

3.3.4 Prediction Ensembling..................................................................................................................21

4.2 IOU - Intersection Over Union.............................................................................................................23

4.3 Precision X Recall Curve.....................................................................................................................24

4.4 Average Precision.................................................................................................................................25

4.5 Recorded Metrics..................................................................................................................................26

Table 1. Sequential model layers specification...................................................................................................17

Figure 1. Architecture of Beit.............................................................................................................................15

 To evaluate the effectiveness of our approach, we conducted a comprehensive comparative study

3.1 Existing Methods

3.1.1 Bidirectional Encoder representation from Image Transformers (BEit)

Figure 1. Architecture of BEit

3.1.2 You Only Look Once (YOLOv5)

3.1.3 Sequential Convolutional Neural Networks

Figure 3. Architecture of Faster RCNN

3.1.4 Faster Region Based Convolutional Neural Network (Faster R-CNN)

Figure 4. The architecture of Faster R-CNN

3.1.5 Ensemble Technique

3.2 Proposed Methodology

Figure 5. Architecture of the proposed system

3.3 Research Approach

3.3.4 Prediction Ensembling

Figure 6. AND Fusion

Figure 8. Weighted Fusion

4.2 IOU - Intersection Over Union

4.3 Precision X Recall Curve

Figure 12. Precision X Recall Curve

4.4 Average Precision

4.5 Recorded Metrics

Table 2. Individual Model Metrics

Table 3. Ensembled Model Result

Figure 14. Comparative Predictions of different Object Detection Models

Abstract. Automated cars are being developed by leading automakers and

Keywords: Advanced Driver Assistance System, Traffic Sign Recognition,

a. Bidirectional Encoder representation from Image Transformers (BEit)

Fig. 2. The architecture of YOLO

c. Sequential Convolutional Neural Networks

Table 1. Sequential model layers specification

Fig. 4. The architecture of Faster R-CNN

5. Results and Discussion

Fig. 6. AND Fusion

Fig. 8. Weighted Fusion

b. IOU - Intersection Over Union

c. Precision X Recall Curve