Republic of Turkey Altinbaş University Institute of Graduate Studies Electrical and Computer Engineering

REPUBLIC OF TURKEY
ALTINBAŞ UNIVERSITY
Institute of Graduate Studies
Electrical and Computer Engineering
REAL TIME OBJECT DETECTION AND

RECOGNITION BASED ON DEEP LEARNING
Rana Ali Hussein ANI
Master’s Thesis
Supervisor
Asst. Prof. Dr. Sefer KURNAZ
Istanbul, 2022
REAL TIME OBJECT DETECTION AND RECOGNITION BASED ON
DEEP LEARNING
Electrical and Computer Engineering
Master’s Thesis
ALTINBAŞ UNIVERSITY
2022
The thesis/dissertation titled REAL TIME OBJECT DETECTION AND RECOGNITION
BASED ON DEEP LEARNING prepared by RANA ALI HUSSEIN ANI and submitted on
12/08/ 2022 has been accepted unanimously for the degree of Master of Science.
Thesis Defense Committee Members:

Asst.Prof.Dr. Sefer Kurnaz
Supervisor
Asst. Prof. Dr. Sefer Kurnaz faculty of Engineering and
Natural Sciences,
Altinbas University __________________
Asst. Prof. Dr. Oguz KARAN faculty at C and System

Programmers Association,
Altinbas University __________________
Asst. prof. Dr. Serdar Kargın Electronic and

Telecomunication Engineer,
Beykent University __________________
I hereby declare that this thesis meets all format and submission requirements of a Master’s
thesis Submission date of the thesis to Institute of Graduate Studies: ___/___/___
iii
Signature
iv
ABSTRACT
REAL TIME OBJECT DETECTION AND RECOGNITION BASED ON

DEEP LEARNING
ANI, Rana Ali Hussein
Master’s Thesis of Electrical and Computer Engineering, Altınbaş University,
Supervisor: Asst. Prof. Dr. Sefer Kurnaz
Date: 08/ 2022
Pages: 63
Object discovery that is both effective and precise has been a hotly debated issue in the progression
of PC vision frameworks. The exactness of item discovery has expanded emphatically starting
from the presentation of profound learning strategies. The venture means to consolidate state of
the art object recognition strategies determined to accomplish high precision with ongoing
execution. A huge test in many item identification frameworks is the dependence on other PC
vision methods to help the profound learning-based approach, which results in sluggish and sub-
par execution. In this venture, we utilize an altogether profound learning-based way to deal with
tackle the issue of article discovery beginning to end. The network is trained on the most difficult
publicly available dataset (DARKNET-53), on which an annual object detection challenge is held.
The resulting system is quick and accurate, making it useful for applications that require object
detection. We train the network parameters and compare the mean average precision computed
from pre-trained network parameters. In addition, we propose a post-processing method for
performing real-time object tracking in live video feeds.
Keywords: Motor Imagery EEG, Artificial Neural Network (ANN), Brain-Computer Interface,
Classification Of EEG.
v
TABLE OF CONTENTS
Pages
ABSTRACT ................................................................................................................................... v
LIST OF TABLES ..................................................................................................................... viii
LIST OF FIGURES ..................................................................................................................... ix
ABBREVIATIONS ...................................................................................................................... xi
1. INTRODUCTION ............................................................................................................... 1
1.1 OBJECT RECOGNITION ................................................................................................ 3
1.2 MOTIVATION.................................................................................................................. 5
1.3 PROBLEM DEFINITION................................................................................................. 6
1.4 OBJECTIVES.................................................................................................................... 6
1.5 RESEARCH CONTRIBUTIONS ..................................................................................... 6
1.6 THESIS ORGANIZATION .............................................................................................. 7
2. BACKGROUND .................................................................................................................. 9
2.1 MACHINE LEARNING ................................................................................................... 9
2.1.1 Neural Networks ......................................................................................................... 12
2.1.2 Convolutional Neural Networks ................................................................................. 14
2.2 OBJECT DETECTION AND CLASSIFICATION ........................................................ 15
2.2.1 CNN for Object Detection .......................................................................................... 15
2.2.1.1 Convolutional layers ............................................................................................. 16
2.2.1.2 Pooling layers ....................................................................................................... 17
2.2.1.3 Fully connected layers .......................................................................................... 17
2.2.2 Training ...................................................................................................................... 18
2.2.3 Transfer learning......................................................................................................... 19
2.3 RELATED WORK .......................................................................................................... 20
2.4 LIMITATIONS OF EXISTING WORK......................................................................... 21
2.5 ANCHOR BOX ............................................................................................................... 22
vi
3. PROPOSED SYSTEM ...................................................................................................... 24
3.1 PROPOSED SYSTEM .................................................................................................... 24
3.1.1 Design ......................................................................................................................... 25
3.1.2 Bounding Box Prediction ........................................................................................... 26
3.1.3 Class Prediction .......................................................................................................... 27
3.1.4 Prediction Across Scales ............................................................................................. 28
3.1.5 Feature Extractor ........................................................................................................ 29
3.1.6 Training ...................................................................................................................... 30
3.2 THE DATASET .............................................................................................................. 32
4. RESULTS AND DISCUSSION ........................................................................................ 33
4.1 SYSTEM WORKING ..................................................................................................... 33
4.2 IMPLEMENTATION ..................................................................................................... 35
4.2.1 How it works .............................................................................................................. 38
4.2.2 Anchor box ................................................................................................................. 39
4.3 OBJECT DETECTION RESULTS ................................................................................ 41
4.4 COMPARISON TO OTHER REAL-TIME SYSTEMS ................................................. 43
4.5 CONCLUSION ............................................................................................................... 44
5. CONCLUSION AND FUTURE WORKS ...................................................................... 45
5.1 CONCLUSION ............................................................................................................... 45
5.2 FUTURE WORKS .......................................................................................................... 46
REFERENCES ............................................................................................................................ 47
vii
LIST OF TABLES
Pages
Table 3.1: Number of Classes (80 Classes) .................................................................................. 28
Table 4.1: Standard HyperParameters .......................................................................................... 36
Table 4.2: Real-Time Systems on PASCAL VOC 2007 .............................................................. 43
viii
LIST OF FIGURES
Pages
Figure 2.1:Gradient descent vanila Implementation [51] ............................................................. 10
Figure 2.2: The figure illustrates the neuron and its main operations [51]. .................................. 13
Figure 2.3: A schematics of a MLP with five layers [52] ............................................................. 13
Figure 2.4: CNN architecture illustration from [8] ...................................................................... 16
Figure 2.5: Basic flow in transfer learning [14] ............................................................................ 19
Figure 3.1: The System Model [56]. ............................................................................................. 25
Figure 3.2: The Proposed deep learning architecture ................................................................... 26
Figure 3.3: Bounding boxes process. ............................................................................................ 27
Figure 3.4: Darknet-53 [57] .......................................................................................................... 30
Figure 3.5: PASCAL VOC dataset with annotation [7] ............................................................... 32
Figure 4.1: Definition of the bounding boxes ............................................................................... 33
Figure 4.2: Encoding architecture for YOLO ............................................................................... 34
Figure 4.3: Flattening the last two last dimensions ...................................................................... 35
Figure 4.4: Model Architecture..................................................................................................... 37
Figure 4.5: Example cell output for (3 × 3) × 8 output layer ........................................................ 38
Figure 4.6: Example output for CNN with 2 anchor boxes .......................................................... 39
Figure 4.7: Graph between the Accuracies and Epochs ............................................................... 40
Figure 4.8:Loss Function .............................................................................................................. 40
Figure 4.9: Real time object detection 1 ....................................................................................... 41
ix
Figure 4.10: Real time object detection 2 ..................................................................................... 42
x
ABBREVIATIONS
AI : Artificial Intelligence
AP : Average Precision
CNN : Convolutional Neural Network
IOU : Intersection over Union
MAP : mean Average Precision
xi
1. INTRODUCTION
One of the most powerful types of artificial intelligence is computer vision, which focuses on
mimicking or replicating the human visual system. The PC first recognizes and afterward
processes the articles in pictures and recordings utilizing PC vision. With headways in AI and
profound learning, the field might have the option to beat people in certain undertakings like item
discovery and naming. The gigantic measure of information produced, which is utilized to prepare
and further develop PC vision, is driving the expanded development of the field. The growth of
computer vision applications has seen to increase exponentially. Before the emergence of machine
learning and deep learning, the performance of computer vision was found to be limited and
manual coding was required. For instance, if facial recognition is to be done, then there are three
main steps starting from the creation of the database, next is an annotation of images and finally
capturing a new image. In database creation, the individual images of the subjects were captured
and stored in a specific format. After this, annotation of the images is carried out wherein for every
single image, several key points need to be entered like the distance between the nose and upper
lip, the distance between the two eyes, and numerous different estimations that characterize novel
qualities of every individual. The subsequent stage is to catch new pictures either from the photos
or from the video content, and again then follow the above-mentioned process of marking the key-
points. In this case, the angle in which the image was taken also plays an important role. After
completion of all this manual work, the application can be compared with the measurements
obtained in the new image which are being stored in the database, and then it corresponds as to
whether it matched with any of the images that one wanted to track. As work is being carried out
manually with very less automation involved, hence the chances of error are more.
To solve the problem, machine learning came up with a different approach wherein no manual
coding was to be carried out. The picture's highlights were customized with the assistance of AI.
More modest applications that could identify explicit examples in pictures started this pattern.
Factual learning calculations like direct relapse, strategic relapse, and backing vector machine are
utilized to identify designs and group pictures, as well as distinguish objects in them. AI could
1
tackle many testing issues; the most popular one is the prediction of the cancerous cell. However,
the feature building required a lot of effort. The traditional AI approaches included a ton of
muddled strides alongside the cooperation of different specialists of this area, software engineers,
and mathematicians.
A fundamentally different approach was carried out by deep learning. It relies on a neural network
that can solve the problems which can be represented through examples. When the input is given
to the neural network, then the network can extract the common patterns between the examples,
and those are transformed into a mathematical equation which helps to predict the information.
Considering the same example, as described earlier of facial recognition, deep learning required
only a preconstructed algorithm, training is imparted to it in the form of examples of faces and the
network can detect the faces without further instructions on measurements or features. Profound
learning is viewed as one of the best techniques for performing PC vision. A decent profound
learning calculation considers countless prepared datasets, and the boundaries can be changed to
retrain the model for an assortment of different applications. Boundaries in this setting incorporate
the quantity of stowed away layers, the kind of layers, preparing ages, and numerous others. When
contrasted with AI, profound learning is nearly quicker to create and convey. Popular examples of
deep learning include – self-driving cars, cancer detection, face recognition. Deep learning deals
with a huge amount of complicated information and analyses it efficiently. Geoff Hinton, Yann
Lecun, Andrew Ng, Andrej Karpathy, and Yoshua Bengio are the famous researchers working in
the area of deep learning. The companies dealing with deep learning are Google, Apple, NVIDIA,
Toyota, and many more. The main aim of deep learning is to mimic the human brain. Profound
learning is at the foundation of a few extraordinary new improvements in picture handling, normal
language handling, science, independent driving, computerized reasoning, from there, the sky is
the limit [1]. This is because, as compared to many traditional machine learning techniques, deep
learning excels in four closely related areas:
1. Detecting complex nonlinear patterns 2. Learn the importance of variables automatically 3.

Learns highly complex patterns from a few examples 4. Learns abstract representations
2
With the evolvement of the digital era, Deep learning has also brought about lots and lots of data
in every possible form, from every possible region of the world. The lots and lots of data are
referred to as the Big Data and the source of such data is from search engines, social media, online
cinemas, and many more. This huge amount of data can be accessed easily and shared via using
the concept of cloud computing. However, the unstructured data is so vast that it might take
decades for humans to extract useful information from it. To overcome this challenge, the
companies started using artificial intelligence (AI) systems to have automated support for the
same. Machine learning is one of the most common AI techniques used to process big data. With
the help of a self-adaptive algorithm, better analysis of the patterns can be inferred, and
accordingly, the new data can be added. For example, if a fraud needs to be detected by a digital
payments company, then it can use the machine learning algorithms for it. The machine learning
algorithm can process all the ongoing transactions, identify the patterns in the dataset, and can
point out any anomaly detected by the pattern.
Profound learning is viewed as the subset of AI which utilizes a progressive degree of layers to
complete the course of AI [2]. The counterfeit brain organizations (ANN) are implicit a way that
attempts to imitate the human cerebrum. In ANN, all the neurons are connected. In the traditional
method, the analysis of the data was carried out linearly, but with the introduction of deep learning,
the machines can process the data in a nonlinear fashion.
The various applications of deep learning are as follows: i. Self-driving cars ii. Visual recognition
iii. Object detection and recognition iv. Fraud detection v. Healthcare vi. Natural language
processing
1.1 OBJECT RECOGNITION
Toward the start, the most troublesome undertaking is recognizing the connected PC vision
assignments. It very well might be troublesome, for instance, to recognize picture
characterizations, object limitation, and article identification. This is because of the way that every
one of the three terms allude to protest acknowledgment. Picture order is the method involved with
3
doling out a class name to a picture; object restriction is the most common way of drawing a
bouncing box around the articles in a picture. Besides these two, comes the term object detection
which is more challenging as compared to the above two mentioned as it includes both - bouncing
box around a picture alongside allotting a class mark. It is alluded to as protest acknowledgment.
A common example of it can be the identification of objects in digital photographs. Object
segmentation is one of the extensions of computer vision tasks. It is also referred to as “semantic
segmentation” or “object instance segmentation”. In this, the recognized instances of the objects
are highlighted instead of drawing a bounding box. So, to summarize, it can be stated that in image
classification, The categories of the objects in the image can be generated by algorithms. In single-
object localization, the algorithms can generate the category list of the objects present in the image,
as well as the bounding box, whereas in object detection, the algorithms generate the object
categories list in the image, as well as a bounding box that indicates the scale and position of every
instance of each object category.
Past work in object discovery has involved the extraction of elements utilizing calculations, for
example, HoG [4, 5], SIFT [5, and SURF [6]. These calculations utilize conventional AI draws
near, for example, include extraction followed via preparing the calculation to deliver the ideal
result. Profound learning calculations, then again, enjoy exhibited a critical upper hand over
customary AI approaches via preparing the calculation from the actual information. There are no
physically separated highlights. In the traditional machine learning approach, a disjoint pipeline is
used to extract the features, classify the regions, and accordingly predict the bounding boxes for
the regions where the score is high. Deep learning algorithms are distinct in that they employ a
network to perform feature extraction as well as bounding box prediction and other tasks. As a
result, a more accurate and faster object detection system is obtained. A popular example of it
includes vehicle detection and recognition. It is a challenging example as the images of the vehicles
are affected by several factors and may be distorted.
4
1.2 MOTIVATION
R-CNN and other late methodologies use locale proposition techniques to produce potential
jumping confines a picture prior to running a classifier on these proposed boxes. Post-handling is
utilized after arrangement to refine the jumping box, wipe out copy identifications, and rescore the
case in light of different articles in the scene. Since every individual part should be prepared
independently, these complicated pipelines are slow and hard to advance. Object recognition is
reevaluated as a solitary relapse issue, with picture pixels switched over completely to jumping
box facilitates and class probabilities. Utilizing our framework, you just have to take a gander at a
picture once (YOLO) to foresee what items are available and where they are. Consequences be
damned is refreshingly clear. A solitary convolutional network predicts different jumping boxes
and class probabilities for those crates simultaneously. Just go for it trains on full pictures and
enhances discovery execution straightforwardly. This bound together model enjoys a few upper
hands over conventional item identification strategies. Consequences be damned, first and
foremost, is incredibly quick. We needn't bother with a convoluted pipeline since we outline
location as a relapse issue. To foresee recognitions, we essentially run our brain network on another
picture at test time.
Dissimilar to sliding window and area proposition based strategies, YOLO sees the whole picture
during preparing and testing, permitting it to encode logical data as well as the presence of classes.
Since it can't see the bigger setting, Fast R-CNN, a top discovery strategy, misidentifies foundation
patches in a picture as articles. When contrasted with Fast R-CNN, YOLO makes not exactly a
portion of the quantity of foundation mistakes. Third, YOLO learns generalizable article
portrayals. Just go for it beats top discovery strategies like DPM and R-CNN when prepared on
normal pictures and tried on craftsmanship. Just go for it is more averse to come up short when
applied to new spaces or surprising info since it is profoundly generalizable.
5
1.3 PROBLEM DEFINITION
The objective of this work is to protect the full detail element of little articles by separating the
multi-scale element of the item while likewise guaranteeing the uprightness of the enormous item's
element.
Existing object detection models fail to detect small, low-resolution objects that are influenced by
noise. This was because of the elements neglecting to completely address the fundamental
attributes of the little items after rehashed convolution procedure on existing models. Therefore,
there is a need to make a model to more readily get the boundaries and hyper-boundaries that
influence the location and acknowledgment of objects of different sizes and shapes. A few
boundaries, for example, exactness, outlines each second, ages, quitter, learning rate, goal,
accuracy, and review, are considered. Our work likewise utilizes the model to distinguish
incomplete appearances in a video grouping, with promising outcomes. The extent of our work is
restricted to the Pascal VOC dataset [7] and FDDB [8] dataset individually.
1.4 OBJECTIVES
1. To study deep learning algorithms for object detection and identify the most suitable architecture
for real-time object detection. 2. To identify research problems, hyper-parameters, and datasets
from existing work which gives direction for the proposed work. 3. To implement the proposed
architecture and fine-tune with hyper-parameters. 4. To implement the efficient method for multi-
scale anchor box to achieve better FPS rate for the enhancement of accurate object detection in
real-time. 5. To analyze and evaluate the experimental results of proposed work and the existing
algorithms considering mAP and FPS as the evaluation measures.
1.5 RESEARCH CONTRIBUTIONS
The critical commitments of this work are summed up as follows: • The plan and execution of the
model for precise discovery and acknowledgment of the articles from a video arrangement. o Study
6
of computer vision, machine learning, and gradually move to deep learning. The proposed work
enables to perform object detection and recognition of varying object sizes.
• Plan a model to more readily comprehend the boundaries and the hyper-boundaries which
influence the location and the acknowledgment of the objects of shifting sizes and shapes.
Combine several ideas from existing deep learning models to construct a convolutional neural
network for object detections and recognitions by improving the exiting model to overcome the
limitation of detecting ad recognizing the small objects. The issue is the component map utilized
which has extremely low goal and the little article highlights persuade too little to possibly be
discernible.
The work is carried out considering the benchmark dataset that is the PASCAL VOC dataset. The
results implemented on the dataset are compared and justified with the appropriate output obtained.
The technical aspects of the recognition framework are highlighted in our work by giving a number
of implementation details and attempts. The complexity of the proposed work is also highlighted
by considering the calculation of BFLOPS and the number of learnable parameters.
1.6 THESIS ORGANIZATION
The thesis describes the model used to detect the small objects from a video sequence using the
concept of deep learning. The entire thesis is divided into six chapters. The work focuses to
improve the accuracy and the execution time of detecting the objects of different sizes from a video
sequence. We have also worked on partial face detection using deep learning. The initial
contribution of research focuses on the objects which are very challenging to detect as they are
largely influenced by the surrounding environment and have low resolution. By removing the
multi-scale element of the picture, the proposed model jelly the full detail component of little
estimated objects while additionally guaranteeing the honesty of the element of enormous articles.
Therefore, the location precision of little estimated objects moves along.
Chapter 2 (Literature Review): This chapter presents the survey on various approaches of machine
learning, algorithms of machine learning, limitations of machine learning, deep learning along
7
with its’ various models and applications. However, literature survey specifically related to
different applications and models specific to that of object detection is discussed, whenever the
corresponding chapter is elaborated.
Chapter 4 (Object Detection and Recognition using Efficient Multi-Scale Anchor Box): This
chapter focuses on the implementation of the existing work carried out in object detection and
recognition and the results are compared with the proposed approach. The proposed model is
explained in detail along with the implementation of the same. The comparison of the results
clearly states how our proposed model is effective in detecting and recognizing the objects from a
video sequence. Chapter 5 (Partial Face Detection using Deep Learning): This chapter deals with
work that detects the partially covered faces to a great extent using the proposed approach. This
chapter serves as an application of our proposed work. Chapter 6 (Conclusion and Future Work):
In this chapter, the conclusions are drawn from various implementations mentioned above as
discussed. Future research directions are also mentioned in the chapter.
8
2. BACKGROUND
In this part, a prologue to Convolutional Neural Network is given, with a careful engineering
portrayal;
2.1 MACHINE LEARNING
The use of calculations to spread the word about feeling of information is as AI. Mitchell (1997)
characterizes advancing as "a PC program gaining from an encounter E as for a few class of errands
T and execution measure P, assuming its exhibition at undertakings in T, as estimated by P,
improves with experience E." The objective of administered AI is to gain a model from marked
preparing information that will permit us to make expectations about obscure information. One
could, for example, try to predict if an email is spam or not spam. Following the Mitchell
definition, to predict if an email is spam is the task T. To do that we could train a model on a
dataset of emails previously classified as spam or not spam (this dataset is the experience E in the
Mitchell definition) to predict whether a new email belongs to either of the two categories. This
relationship is normally displayed by a capacity that assesses the reaction variable y as an element
of the info factors x and tunable boundaries w that are acclimated to precisely depict the connection
between the information factors and the forecast:
𝑦̂ = 𝑓(𝑥, 𝑦) (2.1)
The parameters w are learned from data. One of the key ingredients of machine learning algorithms
is the objective function (or loss function) J(w) that is to be minimized during the learning process.
For a given dataset, we want to find a w that minimizes this cost function:
𝑤 = arg 𝑚𝑖𝑛𝑤 𝐽(𝑤) (2.2)
A typical loss function for continuous variables is the quadratic or L2 loss:
1
𝑗(𝑤) = ∑∈𝐷𝑡𝑟𝑎𝑖𝑛(𝑦̃𝑖 , 𝑤)2 (2.3)
2𝑚(𝑥𝑖,𝑦𝑖
̃)
9
where |Dtrain| = m. The 1 2 term present in the equation above is useful to facilitate the derivations
and do not alter the final result of the optimization. Another fundamental ingredient of machine
learning is the optimization algorithm. The aim of the optimization is to find the weights of the
learning algorithm that minimizes the loss function. Gradient descent is a versatile tool for
optimizing functions for which the gradient can be computed analytically. It is one of the most
popular method to optimize neural networks. The weights to optimize are initialized randomly and
iteratively updated by following the rule:
𝑤𝑘 = 𝑤𝑘−1 − 𝛼∇𝐽(𝑤𝑘−1 ) , ∀𝑘 ∈ ℕ (2.4)
The α (also known as learning rate) determines the size of steps to reach a (local) minimum. The
vanilla implementation of the gradient descent is described below:
Figure 2.1:Gradient descent vanila Implementation [51]
Other optimization algorithms have been widely used by the deep learning community to
overcome several challenges found when training deep architectures such as: accelerating the
convergence, choosing the proper learning rate and learning rate schedules; optimizing highly non-
convex functions (very common in neural networks) and avoiding getting trapped in their
numerous suboptimal local-minima. Mini batch gradient descent, in particular, is frequently used.
It iteratively applies gradient-based updates to a small subset of the dataset. The batch size refers
to the number of data points in these subsets. Momentum is a gradient descent extension that can
speed up convergence by applying exponential smoothing to the gradient update.
10
The momentum method is useful when the optimization is close to a local minimum. The surface
of this region usually curves much more steeply in one dimension than in another. Hence, gradient
descent presents an oscillatory behavior, making it difficult for the training process. The
momentum term increases when the gradients of two dimensions point in the same direction and
decreases when the gradients of two dimensions change directions. As a result, we achieve faster
convergence.
Other useful optimization algorithms are either momentum variants or use other tricks to speed up
training. The ADAM method is also widely used, and it is the method we will use in this work. To
assess a machine learning method's capabilities, we must first define a quantitative measure of
performance. Typically, this performance P is unique to the task T. We frequently measure the
accuracy of the model, that is, the proportion of correct results that a classifier achieved, for tasks
such as classification (as in categorizing an email as spam or not spam). For tasks such regression,
we often measure the Mean Absolute Error (MAE) or the Mean Square Error (MSE). In general,
it is often difficult to choose a performance measure that corresponds well to the desired behaviour
of the system. In a future session we show the main performance measurement that may be applied
to MC denoising and the difficulties to choose the performance measurement that best fits to this
problem.
Finally, the supervised Machine Learning tasks can be classified either as classification or
regression. A classification task is used in a supervised learning task with discrete class labels,
such as the previous spam or not-spam example. Regression, on the other hand, is a subcategory
of supervised learning. The outcome in this category is a continuous value. We are dealing with a
regression task in this work.
11
2.1.1 Neural Networks
Artificial Neural Networks (ANNs) are a sort of model that can have countless boundaries and has
been demonstrated to be exceptionally compelling at catching examples in complex information.
The major thought behind counterfeit brain networks was enlivened by speculations and models
of how the human mind tackles complex issues. Despite the fact that ANNs have filled in
prominence lately, the primary investigations of brain networks date back to the 1940s, when the
McCulloch-Pitt neuron was portrayed. A neural network is composed of a fundamental structure
called neuron. This unit, represented in the Figure 2.2 combines the product of inputs by the
weights vector and apply a non-linear activation function:
𝑦𝑖 = 𝑓(𝑏 + ∑𝑛𝑗=1 𝑥𝑗 𝑤𝑖𝑗 ) (2.5)
where yi is the neuron output, w is the neuron weight vector, x is the neuron input and b the bias
term. Also, f(·) is the activation function which guarantees that the composition of several neurons
can potentially give a non-linear model. The most popular activation functions are softplus, selu,
elu, sigmoid function, rectified linear unit (ReLU) and hyperbolic tangent. The Equation 2.15 can
be rewritten considering the bias term as an additional weight w0 and x0 = 1 [51]:
𝑦𝑖 = 𝑓(∑𝑛𝑗=0 𝑥𝑗 𝑤𝑖𝑗 ) = 𝑓(𝑥, 𝑤) (2.6)
The main limitation of this type of architecture is that it is only capable of learning linearly
separable boundaries, this restricts seriously its applications.
12
Figure 2.2: The figure illustrates the neuron and its main operations [51].
We can connect multiple single neurons to obtain a Multilayer Perceptron (MLP) shown in the
Figure 2.1. This special type of fully connect network is also known as multilayer feedforward
neural network. The advantage of MLP over neural networks with only one layer which we
presented before is the fact the MLP is capable of learning both linearly separable and non-linearly
decisions boundaries. The layers of a neural network can be interpreted as feature extractors. A
layer combines the previous layer's features into 'higher level', more meaningful features that, in
the end (in the output layer), enable a more powerful prediction based on the representation built
in the previous layers [52].
Figure 2.3: A schematics of a MLP with five layers [52]
13
We can add an arbitrary number of hidden layers to the MLP to create deeper networks
architectures. However, the error gradients calculated by the backpropagation algorithm ZHANG
(2000), in order to obtain the neurons weights, As more layers are added to the network, the size
of the network shrinks. As a result, model learning becomes extremely difficult. This is known as
the evaporating slope issue in the writing, and a few unique methodologies have been created to
aid the preparation of such profound brain organizations (SZEGEDY et al. (2017); HE et al.
(2016)).
2.1.2 Convolutional Neural Networks
In recent years, Convolutional Neural Networks (CNNs) (LECUN; BENGIO; HINTON, 2015)
have been extremely successful in several practical applications, accomplishing state-ofthe
workmanship execution in a different scope of errands like picture arrangement (KRIZHEVSKY;
SUTSKEVER; HINTON, 2012), discourse handling (VAN DEN OORD et al., 2016) and
numerous others. CNNs are comprised of a progression of layers that apply a bunch of convolution
bits to the past layer's feedback, trailed by a non-straight planning characterized as[53]:
𝑎𝑖𝑙 = 𝜎(∑𝑗 𝑎𝑖𝑙−1 ∗ 𝑘𝑖,𝑗

𝑙
+ 𝑏𝑖𝑙 ) (2.7)
where signifies a convolution layer, a l I is the ith initiation on the lth layer, k l I j, and b l I are the
learnable jth 2D convolution portion and inclination term related with yield enactment I at level l,
individually, and is the actuation, which is ordinarily an amended straight unit, ReLU, that is
characterized as max(0,).
The fundamental difference between a MLP (stack of multiple densely connected layers) and a
CNN (stack of multiple convolution layers) is the fact that the dense layers learn global patterns
in their input feature space, whereas convolution layers learn local patterns in small 2D windows
(the kernels) of the inputs. Thus, CNNs makes neural networks for image processing more
tractable by making the connectivity of neurons between two adjacent layers sparse. Therefore,
Convolution layers have fewer parameters than fully connected layers.
14
These key characteristics give CNNs two properties: 1) the patterns they learn are translation
invariant. This makes CNNs data efficient, thus CNNs need fewer images to learn good
representations. 2) They can learn hierarchical patterns. This means that the firsts convolutions
learn small local patterns and the latter convolutions combine these local representations into local
objects and finally the last convolutions combine these objects into high-level concepts such as
cat, dog, car etc. CNNs have also found applications for a large range of low-level image
processing tasks such as denoising and image colorization [53].
2.2 OBJECT DETECTION AND CLASSIFICATION
The undertaking of finding and handling at least one items in an advanced picture or video transfer
to follow up on them in view of the engineer's ideal result is known as article order. This could
incorporate approval, edge discovery, shading transformation, face acknowledgment, and
numerous different capacities. With all the more remarkable innovation, this is in any event,
turning into a standard component on numerous regular gadgets utilized all over the planet, for
example, the Artificial Intelligence camera element of Huawei cell phones [5], which is only one
of numerous instances of where object identification and arrangement are utilized as an
incorporated apparatus.
2.2.1 CNN for Object Detection
Yann LeCun popularized them in the 1980s. He made LeNet [6], a multi-facet brain network that
joins convolutional brain networks with backpropagation calculations to permit a PC to
characterize written by hand numbers. Parting the issue into diverse organizations instead of
singling layer networks brought about a critical execution gain. AlexNet [7] won the ILSVRC-
2012 rivalry in 2012, utilizing the ImageNet dataset, which contained 1.2 million pictures from
1000 distinct classes. Which accomplished a great 15.3 percent mistake rate, while the second
place had a 26.2 percent blunder rate. In terms of processing human visual input, this
revolutionized image classification in computer vision. From that point forward, CNNs have been
at the very front of AI and man-made brainpower with regards to human visual information sources
15
(objects, penmanship, facial acknowledgment, picture characterization, and numerous others), and
they are broadly utilized in cell phones and AI instruments all over the planet.
Convolutional Neural Networks (CNN) are a subcategory of Deep Neural Networks (DNNs). Like
DNNs, they are comprised of neurons with learnable loads and predispositions. Rather than the
secret layers of brain organizations, CNNs stack numerous extractions as layers that are completely
associated, with the order perspective toward the finish of the chain, as displayed in Figure 2.3.
Convolutional, Pooling, and Fully connected layers are the three main components of the
architectural layers.
Figure 2.4: CNN architecture illustration from [8]
With regards to picture order, the picture is submitted as an information. It is handled and ordered
by the classes characterized in the CNN model indicated for the given assignment. From the stance
of a PC, the picture is handled as a variety of pixels characterized by Height x Width x Dimension.
Assuming the information picture is a standard shading picture, the cluster is 640 480 3, where the
aspect depicts the RGB network. The picture is handled by exploring through the different layers.
2.2.1.1 Convolutional layers
The CNN design's first layer. It extricates highlights from pictures by emphasizing over them and
applying channels to them (known as parts). This is achieved by partitioning the picture lattice into
16
more modest squares while safeguarding the association between pixels. Each split is alluded to
as an element map [54]. The arithmetical activity is:
• Image matrix (ℎ × w × d) • Filter (fℎ × fw × d)
• When combined outputs (ℎ − fℎ + 1) × (w − fw + 1) × 1.
Each element map incorporates a channel that performs activities like edge discovery, lines, power,
honing, obscuring, and molding. The channel size is normally little, for example, 3 3pixels. To
keep the pixels "associated," the channel gets a couple of pixels across the picture network, causing
a cross-over; this is known as steps. Cushioning is required assuming a channel surpasses the
picture grid. This is achieved by either adding zeros to the picture (zero-cushioning) or legitimate
cushioning, which holds just the substantial pieces of the picture. The Rectified Linear Unit
(ReLU) is an extra advance for the convolutional activity that adds non-linearity to the CNN. By
fusing ReLU, the information would comprise of non-negative direct qualities. The result recipe
emphasizes through the framework, setting all bad qualities to 0 [54].
2.2.1.2 Pooling layers
Pooling layers are put between each progressive convolutional layer to decrease difference,
diminish computational intricacy, and concentrate highlights. This is alluded to as spatial pooling.
It keeps the significant highlights while contracting the size of the component maps. By
diminishing the quantity of boundaries, the organization's memory utilization diminishes,
permitting more highlights to be added [54].
2.2.1.3 Fully connected layers
The last layer in a CNN assesses the aftereffects of the convolutional/pooling layers by smoothing
them from a grid into vectors and taking care of them forward to the following completely
associated layer, which conducts expectations utilizing loads. To arrange the picture, the result
layer then, at that point, allots last probabilities to each mark. To lay out the probabilities, the last
17
advance is much of the time joined with softmax, which standardizes the brain network result to
fit somewhere in the range of 0 and 1 [55].
2.2.2 Training
The backpropagation algorithm is used to train a CNN in the same way that it is used to train other
feedforward neural networks. [9]. It utilizes two stages, which are rehashed a few times for each
clump. The initial step is to take care of a bunch of pictures through the whole organization, where
the softmax layer yields a likelihood gauge. This is alluded to as a forward pass. The organization's
exactness is then estimated utilizing a misfortune work, which figures the mistake distinction
between the set preparation names and the softmax gauge names. The objective is to prepare the
organization to create a low misfortune work (low misfortune rises to high exactness) [56].
The aftereffects of the misfortune work are then inspected by an analyzer (versatile learning
method) [10], which crosses back through the organization and computes how the organization's
loads and inclinations should be changed to decrease misfortune, an interaction known as in
reverse pass. An age is a finished cycle for the whole cluster (all pictures) with both forward and
in reverse passes.
• Training problems
While preparing a model to deliver a classifier that makes great and right expectations, we might
experience two situations with the given informational collection. The first is sick fitting (high
inclination, low fluctuation). This issue normally happens when the model can't make a precise
model, ordinarily due to a too-little or non-straight informational index, making the model produce
mistaken forecasts. The other issue is overfitting (high predisposition, low change). This implies
that the model was prepared on an unreasonably huge informational index or pictures taken in a
static climate, making the model spotlight on picture commotion or subtleties that are unessential
to the main job. There are multiple ways of staying away from this, including utilizing explained
pictures to expand the informational collection, jumping boxes around the engaged article, and
early halting, which screens the misfortune during preparing and quits preparation assuming that
18
it increments. The least difficult method for monitoring this is to partition our informational
collection into two sections: preparing and testing (approval). Then, at that point, we can see when
the precision on the test set is topping and know when the model is in a solid match state. The
precision of an underfitting model would be poor for both the preparation and testing informational
collections. Follow the Tensorflow instructional exercise for great viable models and a nitty gritty
clarification of how to deal with over/underfit [11]
2.2.3 Transfer learning
Preparing a model without any preparation requires gigantic informational collections and is
viewed as a significant expense activity as far as equipment necessities and real time spent. All
things considered, we utilize a pre-prepared CNN model that has effectively been prepared for a
huge scope informational collection, for example, the notable ImageNet [12] or COCO (Common
Objects in Context) [13]. To retrain the pre-prepared model for the new job that needs to be done,
the standard methodology is to begin the model and present a new and more modest informational
index. It uses and calibrates the current model's elements and loads to make likelihood IDs of new
articles in view of the names submitted with the informational collection.
Figure 2.5: Basic flow in transfer learning [14]
19
2.3 RELATED WORK
As indicated by a survey of applicable writing, there are various undertakings and frameworks in
the field of article grouping with an attention on creature identification. Since the created pictures
are from a static climate inside a crate, it was hard to track down comparative exploration utilizing
the COAT camera-trap informational collection. We focused on projects that utilization little or
inserted DNNs or camera-traps as their informational collection source.
Nguyen et al. introduced an answer for computerized untamed life checking utilizing DNNs [15].
The examination depends on single-named information. set from the Wildlife Spotter project1,
this is a mesh extent informational collection with 107 000 pictures partitioned into 15 unique
creature species and pictures without creatures. The informational collection was parted 80/20 for
preparing and approval. As a result of the size of the dataset, they can lead CNN preparing without
any preparation, which is done in Keras [16] and prepared with four Nvidia Titan X GPUs. The
objective was to perform two sorts of characterization: recognizing creature and non-creature
pictures and distinguishing the three most normal species (bird, rodent and bandicoot). They
decided to carry out the lightweight worked on rendition of Alex Net [17], the medium measured
VGG-16[18], and the ResNet-50[19] for model engineering. Each model's preparation time goes
from 3-5 days. Subsequently, three models can arrange the main undertaking with 92.68 percent
precision for AlexNet, 95.88 percent for VGG-16, and 95.65 percent for ResNet-50. On similar
models, the models performed with an exactness of 87.80 percent, 88.03 percent, and 87.97 percent
for the second assignment of arranging the three species. We can close from our model presentation
in creature arrangement that utilizing a pre-prepared model is a decent way to deal with taking care
of the main concern, as we beat this methodology with a preparation season of just 3 hours on a
lightweight model. As far as the static climate and dark scale pictures, the informational index
intricacy is in support of ourselves.
In 2017, H. Thom. introduced a creature distinguishing proof and characterization system [4]. This
depends on the COAT [1] trap camera informational collection, which contains 8000 pictures of 9
distinct species. Utilizing CNNs, the framework binds together three different item location
20
techniques. One of the models utilized in our theory is YOLOv2[20], which is the second-age
YOLO engineering. The model prepared on the snare cam informational index has a 92.4 percent
exactness. He experienced similar issues with informational collection variety and uneven classes
in his examination that we did in our informational index. He was likewise presented to the
significance of informational collection readiness. The addition of bounding boxes and data-set
augmentations resulted in a significant increase in precision. The shared objective of this
examination is basically the same as our way to deal with lessening physical work while working
with a lot of picture information grouping processes.
In his master thesis, S. Thommasen used the same data set as H. Thom. The system was to make
a little implanted PC that could order the creatures utilizing a little brain organization. The Model
design prepared to the undertaking where a few variations of little Mobile Net[22], with a 61
percent exactness. He clarified the troubles with informational collection planning and model
change in little inserted edge PCs in his proposition. In light of this, we needed to move toward
this errand utilizing a state of the art object classifier design from 2020, and the outcomes show
that edge processing on little implanted gadgets can take care of the issues expressed in our
proposition.
2.4 LIMITATIONS OF EXISTING WORK
In section 2.2, various methods of object detection have been discussed. In methods like R-CNN,
Fast R-CNN, Faster R-CNN, the algorithm first identifies the bounding box and then classifies
each of the bounding separately. This is referred to as the two- step object detection process.
Region Proposal Network (RPN) is the first step which provides several regions which are passed
to the deep learning architecture. Moving from R-CNN to utilizing Fast RCNNs and Faster
RCNNs, various strategies and varieties have been given to these district proposition organizations
(RPNs). Such calculations have shown preferred execution over partners for single-step object
recognition, yet are more slow progressively object discovery. The bottleneck observed in the two-
step object detection is latency. Many one-step object detection architectures, such as YOLO,
YOLOv2, SSD, RetinaNet, and others, have been proposed for real-time object detection, with the
21
detection and classification steps combined. These changes allow one-step detectors to run more
quickly. However, because they do not work on each bounding box separately, hence the limitation
is the performance is not good for detecting the small size objects. In the existing work, one-step
object detection architectures suffered the problem to detect small objects accurately. The problem
is the large grid size and fixed size anchor box due to which the small object features get too small
to be detectable. Also, each of the anchor boxes goes for n number of classification (where n
denotes the number of classes) which results in a greater number of comparisons thereby taking
much amount of execution time.
2.5 ANCHOR BOX
The idea of anchor box plays an important role when Convolutional Neural Network comes into
picture with respect to object detection. It can be tuned to improve the performance of the
algorithm on the dataset. In case if the anchor box is not tuned properly, then the neural network
model will never understand that there are irregular, large or small objects in the image/video and
hence they will go undetected.
The state-of-the-art object detection systems use the below mentioned concept of anchor boxes in
practice:
A. “Anchor Boxes” are created for each of the predictor. This predictor represents the shape,
size, location of the object it is specialized in predicting.
B. For every anchor box, the IoU is calculated. In these three possible cases are as follows:
i. If the highest IoU is greater than 50%, then the anchor box should detect the object that gave the
highest IoU.
ii. In any case in the event that the IoU is more noteworthy than 40%, the brain organization should
be prepared in a way expressing the genuine location is questionable and not to gain from that
model.
22
iii.If the highest IoU is less than 40%, then the anchor box should predict that there is no object.
For example, in the RetinaNet configuration, the size of the anchor box is 32X32. This means that
the size of the objects smaller than 32X32 will go undetected. So, the general questions that comes
in the mid is what can be the size of the anchor box so as to detect the small and the large objects
and what shape of the anchor box can be considered. A rough estimate can be obtained by
calculating the dataset's most extreme sizes and aspect ratios. The K-mean bunching strategy is
utilized to work out the ideal jumping box. Another choice is to figure out how to design the anchor
box. Recall that IoU is diminished when the focal point of the jumping box and the anchor box
vary. In case the stride between the anchor box is wide, then the small anchor boxes may miss
some ground truth boxes. One of the solutions is to lower the threshold of IoU from 50% to 40%.
To apply the object detectors on any specific domain, then in such a case, the shape of the anchor
box needs to be manually modified to improve the accuracy. For example, in-text detection, the
text can be higher or wider than the object, so in such a case, the aspect ratio can be 1:5 and 5:1.
Similarly, as the face is roughly square, so the aspect ratio can be 1:1. So, soon after determining
the shapes of the anchors, the size of the same can be fixed during training. If the size of the anchor
box is not proper then it can hamper the performance of the specific domains. To solve this issue,
anchor box optimization can be carried out. The next section describes the approach of anchor box
optimization.
23
3. PROPOSED SYSTEM
The reason for this work is to hold the total detail component of little articles by extricating the
multi-scale element of the item while at the same time guaranteeing the respectability of the
enormous item's element. Existing object detection models struggle to distinguish small, low-
resolution objects that are impacted by noise. This was due to the features failing to accurately
reflect the key qualities of the little objects after repeated convolution operations on existing
models. As a result, there is a need to make a model to more readily comprehend the qualities and
hyper-boundaries that influence the location and acknowledgment of objects of different sizes and
structures. A few attributes, including as exactness, outlines each second, ages, quitter, learning
rate, goal, accuracy, and review, are considered. Our work additionally utilizes the model to
recognize halfway faces in a video arrangement, with promising outcomes [7].
3.1 PROPOSED SYSTEM
The different parts of item recognition are consolidated into a solitary brain organization. To
anticipate each jumping box, our organization utilizes highlights from the whole picture. It
likewise predicts all of a picture's bouncing boxes simultaneously. This implies that our
organization contemplates the whole picture and each of the articles in it. The YOLO setup
considers beginning to end planning and consistent execution while staying aware of high typical
exactness. The data picture is parceled into a S network by our structure. If the point of
convergence of a thing falls into a network cell, that system cell is liable for recognizing it.
Formally we define confidence as Pr(Object) ∗ IOUtruth pred . Assuming there is no item in that
cell, the certainty scores ought to be zero. In any case, we need the certainty score to be equivalent
to the crossing point of the anticipated box and the ground truth. Each bouncing box has five
forecasts: x, y, w, h, and certainty. The (x, y) facilitates address the focal point of the crate
according to the lattice cell limits. The width and level are determined corresponding to the whole
picture. At long last, the IOU between the anticipated box and any ground truth box is addressed
by the certainty forecast. Moreover, every framework cell predicts C restrictive class probabilities,
Pr(Classi | Object). These probabilities are reliant upon the matrix cell that contains an item. No
24
matter what the quantity of boxes B, we just foresee one bunch of class probabilities per matrix
cell. We duplicate the contingent class probabilities and individual box certainty expectations at
test time,
(3.1)
This furnishes us with certainty scores for each container in light of its group. These scores encode
the probability of that class showing up in the case as well as how well the anticipated box fits the
article.
Figure 3.1: The System Model [56].
3.1.1 Design
This model is executed as a convolutional mind association and took a stab at the PASCAL VOC
acknowledgment dataset. The association's fundamental convolutional layers eliminate picture
features, while the totally related layers predict yield probabilities and headings. The GoogLeNet
25
picture grouping model motivated our organization design. Nonetheless, rather than the
GoogLeNet beginning modules, we basically utilize 1 1 decrease layers followed by 3 3
convolutional layers, as Lin et al. do. Figure 3.2 portrays the whole organization.
Figure 3.2: The Proposed deep learning architecture
We similarly train a fast variation of YOLO to test the requirements of speedy thing revelation.
3.1.2 Bounding Box Prediction
As counterbalances from bunch centroids, we anticipate the width and level of the crate. Utilizing
a sigmoid capability, we foresee the middle directions of the crate corresponding to the area of the
channel application. This explicitly self-counterfeited figure isn't awesome, yet it covers a ground
truth object by more than some limit, so we dismiss the forecast, as displayed underneath.
Dissimilar to most frameworks, our own just allots one jumping box before each ground truth
object:
(3.2)
26
Utilizing strategic relapse, YOLOv3 predicts an objectless score for each bouncing box. In the
event that the jumping box earlier covers a ground truth object by more than some other bouncing
box earlier, this ought to be 1. In the event that the bouncing box earlier isn't awesome yet covers
a ground truth object by in excess of a specific limit, we ignore the expectation:
Figure 3.3: Bounding boxes process.
3.1.3 Class Prediction
Utilizing multilabel order, each crate predicts the classes that the bouncing box might contain. We
don't utilize a softmax on the grounds that we've found it's superfluous for good execution; all
things considered, we utilize free calculated classifiers. For class expectations, we utilize parallel
cross-entropy misfortune during preparing. This definition is valuable while managing more mind
boggling spaces, like the Open Images Dataset. There are many covering marks in this dataset (for
example Lady and Person). The utilization of a softmax forces the presumption that each case has
precisely one class, which is regularly erroneous. A multilabel approach models the information
better. DarkNet-53 can be characterized into the 80 classes recorded underneath:
27
Table 3.1: Number of Classes (80 Classes)
Person Bicycle Car Motorbike Aeroplane

Bus Train Truck Boat Traffic
Light Fire Hydrant Stop Sign
Parking Meter Bench Bird Cat
Dog Horse Sheep Cow Elephant
Bear Zebra Giraffe Backpack Umbrella
Handbag Tie Suitcase Frisbee Skis
Snowboard Sports Ball Kite Baseball Bat
Baseball Glove Skateboard Surfboard Tennis Racket

Bottle Wine Glass Cup Fork Knife
Spoon Bow Banana Apple Sandwich
Orange Broccoli Carrot Hot Dog Pizza
Donut Cake Sofa Potted plant Bed
Dining Table Toilet TV Monitor Laptop Mouse

Remote Keyboard Cell Phone Microwave Oven
Toaster Sink Refrigerator Book Clock
Vase Scissors Teddy Bear Hair Drier Toothbrush
3.1.4 Prediction Across Scales
YOLOv3 forecasts boxes on three different scales. Using a concept similar to feature pyramid
networks, our system extracts features from those scales. We add several convolutional layers
to our base feature extractor. Then, we take the component map from the two past layers and
upsample it by two. We likewise use connection to join an element map from prior in the
organization with our upsampled highlights. This technique gives us more significant semantic
28
data from the upsampled highlights as well as better grained data from the past element map.
We then apply a couple of more convolutional layers to this joined component map, in the end
foreseeing a tensor that is two times the size. We rehash the plan interaction to anticipate boxes
for the last scale. Accordingly, our expectations for the third scale benefit from all of the earlier
calculation as well as fine-grained highlights from the organization's beginning phases. We
actually utilize k-implies grouping to decide our bouncing box priors.
3.1.5 Feature Extractor
For include extraction, we utilize another organization. Our new organization is a cross between
the organizations utilized in YOLOv2 and Darknet-19, as well as that brand new leftover
organization stuff. Our organization utilizes progressive 3 3 and 1 1 convolutional layers, yet it
presently incorporates some alternate way associations and is fundamentally bigger. We call it
Darknet-53 in light of the fact that it has 53 convolutional layers. This new organization
outflanks Darknet19 while staying more proficient than ResNet-101 or ResNet-152. Here are
some ImageNet results:
29
Figure 3.4: Darknet-53 [57]
3.1.6 Training
On the ImageNet 1000-class rivalry dataset, we pretrain our convolutional layers. We utilize the
initial 20 convolutional layers from Figure 3.4 for pretraining, trailed by a normal pooling layer
and a completely associated layer. The model is then changed over completely to identify. Ren et
al. exhibit that pretrained organizations can profit from both convolutional and associated layers.
We add four convolutional layers and two completely associated layers with arbitrarily introduced
loads, taking cues from them. Since ID sometimes requires fine-grained visual information, we
increase the association's criticism objective from 224 224 to 448 448. Our last layer gauges class
probabilities as well as bobbing box arranges.
30
The jumping box width and level are standardized by the picture width and level with the goal that
they fall somewhere in the range of 0 and 1. We set the bouncing box x and y directions to be
counterbalances from a particular matrix cell area, so they are likewise somewhere in the range of
0 and 1. For the last layer, we utilize a direct enactment capability, and any remaining layers utilize
the cracked corrected straight initiation capability displayed underneath.:
In the result of our model, we improve for aggregate squared mistake. In spite of the fact that
aggregate squared mistake is easy to upgrade, it doesn't impeccably line up with our objective of
boosting normal accuracy. It loads restriction mistake and arrangement blunder similarly, which
may not be great. Besides, numerous network cells in each picture contain no item.
To achieve this, we utilize two boundaries, coord and noobj. We characterized coord = 5 and noobj
=.5. Aggregate squared blunder loads mistakes in huge and little boxes similarly. Our mistake
metric ought to mirror that little deviations in enormous boxes are less significant than little
deviations in little boxes. To address this, we foresee the square base of the bouncing box width
and level as opposed to the width and level themselves. Per lattice cell, YOLO predicts various
jumping boxes. We just need one bouncing box indicator to be answerable for each item during
preparing. We assign one indicator as "capable" for foreseeing an item founded on which
expectation has the best current IOU with the ground truth. This outcomes in specialization among
the jumping box indicators. Every indicator improves at foreseeing explicit sizes, viewpoint
proportions, or item classes, working on in general review.
On the preparation and approval informational indexes from PASCAL VOC 2007 and 2012, we
train the organization for roughly 135 ages. We remember the VOC 2007 test information for our
2012 testing. We utilize a group size of 64, an energy of 0.9, and a rot of 0.0005 all through
preparing. Coming up next is our learning rate plan: For the primary ages, we bit by bit increment
the gaining rate from 103 to 102. At the point when we start with a high learning rate, our model
31
regularly veers because of shaky inclinations. We keep preparing with 102 for 75 ages, then, at
that point, diminish to 103 for 30 ages, lastly to 104 for 30 ages.
We use dropout and broad information expansion to stay away from overfitting. A dropout layer
with rate =.5 after the main associated layer forestalls layer co-transformation. We utilize arbitrary
scaling and interpretations of up to 20% of the first picture size for information increase. In the
HSV variety space, we likewise haphazardly change the picture's openness and immersion by up
to an element of 1.5.
3.2 THE DATASET
The PASCAL VOC 2007 Dataset [7] is the dataset under consideration. This dataset contains 20
classes for object recognition (as listed below) and certain sets of standardized image datasets. To
access the datasets and annotations, a common set of tools is provided. It mostly consists of two
competitions: classification and detection. The order objective is to figure the
nonappearance/presence of a class model in the test picture.
Figure 3.5: PASCAL VOC dataset with annotation [7]
32
4. RESULTS AND DISCUSSION
This section outlines the dataset used to carry out the planned study and compares it to previous
work. The PASCAL VOC 2007 Dataset [7] is the dataset under consideration. This dataset
contains 20 classes for object recognition (as listed below) and certain sets of standardized image
datasets. To access the datasets and annotations, a typical arrangement of devices is given.
4.1 SYSTEM WORKING
For this situation, we are fostering a self-driving vehicle. You might want to begin by fostering a
vehicle discovery framework as a basic part of this venture. You've connected a camera to the
hood (front) of the vehicle to gather information, and it takes photos of the street ahead like
clockwork as you cruise all over. We accumulated these pictures into an envelope and marked
them by drawing bouncing boxes around every vehicle you found. Figure 4.1 shows an illustration
of what the jumping boxes resemble.
Figure 4.1: Definition of the bounding boxes
33
Accepting we have 80 classes that we want YOLO to recollect that, we can address the class mark
c in general number going from 1 to 80, or as a 80-layered vector (including 80 numbers, one of
which is 1 and the rest are 0). For this situation, we'll take a gander at how YOLO functions prior
to applying it to vehicle location. Since preparing the YOLO model is computationally costly, we
will stack pre-prepared loads for you to utilize.
First things to know :
i- The input is a batch of images of shape (m, 608, 608, 3)
ii- The result is a rundown of jumping boxes alongside the perceived classes. Each bouncing
box is addressed by 6 numbers (pc, bx, by, bh, bw, c) as clarified previously. Assuming
that we extend c into a 80-layered vector, each bouncing box is then addressed by 85
numbers.
Figure 4.2: Encoding architecture for YOLO
34
Expecting to be a thing's center/midpoint falls into a cross section cell, that grid cell is liable for
distinguishing that article. Since we're using 5 anchor boxes, all of the 19 x19 cells encodes data
for 5 boxes. The width and tallness of anchor boxes are the main boundaries that characterize them.
For straightforwardness, we will smooth the last two components of the shape encoding (19, 19,
5, 85). Therefore, the Deep CNN yield is (19, 19, 425).
Figure 4.3: Flattening the last two last dimensions
4.2 IMPLEMENTATION
The model starts by downloading the dataset from the drive and afterward bringing in related
libraries like TensorFlow Keras, NumPy, matplotlib.pyplot, and others. On Data Preprocessing,
the pictures that will go into convnet are 150x150 variety pictures; we'll add dealing with to resize
every one of the pictures to 150x150 prior to taking care of them into the brain organization). We
will stack three modules: convolution, relu, and maxpooling. Our convolutions work on 3x3
windows, while our maxpooling layers work on 2x2 windows. Our most memorable convolution
extricates 16 channels, the following one 32 channels, and the last one 64 channels. Two
35
completely associated layers are added before the result layer. To characterize pictures, the result
layer utilizes softmax actuation.
Table 4.1: Standard HyperParameters
36
Figure 4.4: Model Architecture
37
4.2.1 How it works
In the YOLO Algorithm we apply the Convolution Im- plementation of Sliding Window to the
input image, we end up with image divided into some kind of a grid cell. Each cell combines
information about itself and surrounding cells. Each cell represents vector similar to the one that
we know from Object Localization. It contains probability that cell contains central point of some
object, the coordinates of that central point (to avoid problems with size we assume that the top left
corner of a each grid cell has coordinates (0, 0) and the bottom right corner has coordinates (1, 1)),
object width and high (it might me be grater then one, because cell has knowledge also about cells
surrounding it) and probabilities of classes(In Figure 4.5 example output for a single cell).
After propagating our image through a CNN we have a prediction for every cell. To get a final
predictions we need to perform 2 steps (See result of these 2 steps in Fig. 20)
i- Remove all predictions where po is smaller then some threshold value, so remove all
predictions where probability that cell contains central point of an object is smaller then e.g
50%.
ii- Using Non-max suppression get rid of rest useless predictions
Figure 4.5: Example cell output for (3 × 3) × 8 output layer
38
4.2.2 Anchor box
The idea of anchor box was introduced in YOLO9000 paper. Concept is relatively simple. What
happen when two objects of different shapes or different sizes has central point in the same cell?
In that case in the original YOLO paper only one object could be detected. In the improved version
of an algorithm the last layer of CNN (the one that divide image into grid cells) is ”deeper”, it
contains multiple predictions not only one. The size of each grid cell will be (number of anchor
boxes) (size of original prediction). Thanks to that it is possible to detect multiple object in one
grid cells. In YOLO9000 authors used 5 anchor boxes and in YOLOv3 (Example output in Figure
4.6).
Figure 4.6: Example output for CNN with 2 anchor boxes
YOLO is a real-time, universal object detection algorithm. It combines high performance with a
high accuracy (See Figure 4.7), so it can be used to solve real-world problems. In future it can be
used in self-driving cars to create safer future for all of us. Standard results are the results removing
all the additional hypermeters (consider standard hyperparameters)
39
Figure 4.7: Graph between the Accuracies and Epochs
Figure 4.8:Loss Function
By changing the model's hyperparameters, we can perceive how the outcomes change and quest
for a superior model. Changes, then again, may contrastingly affect different datasets, so they
should be selected cautiously. The exactness changes with ages are displayed underneath in the
40
request in which they were made. Other hyperparameters in every examination are model standard
boundaries.
As per the above results, adding or eliminating L2 regularization, changing the layer count, or
changing the channel meaningfully affects the last age. The accuracy is in the [0.83, 0.89] territory.
Despite the fact that adding a dropout of 0.2 builds the model's train exactness to 0.945 and
approval precision to 0.885 in the last age.
4.3 OBJECT DETECTION RESULTS
Consequences be damned is a quick and precise item finder that is appropriate for PC vision
applications. We interface YOLO to a webcam and test its continuous presentation, including the
time it takes to recover pictures from the camera and show the location. The outcome is an
intelligent and drawing in framework. While YOLO processes pictures exclusively, when
associated with a webcam, it goes about as a global positioning framework, distinguishing objects
as they move and change for all intents and purposes.
Figure 4.9: Real time object detection 1
41
Figure 4.10: Real time object detection 2
Consequences be damned, a brought together model for object discovery, is presented. Our model
is easy to construct and can be prepared on full pictures. Dissimilar to classifier-based approaches,
YOLO is prepared on a misfortune capability that staightforwardly connects to discovery
execution, and the whole model is prepared simultaneously. Quick YOLO is the quickest
universally useful item locator in the writing, pushing the limits of constant article discovery. Just
go for it likewise sums up well to new areas, causing it ideal for applications that to require speedy,
dependable article location.
We exhibited and approved a functioning YOLO port from darknet to TensorFlow. More work is
expected to further develop the picture grouping's vigor for constant video following. The current
execution, for instance, is restricted in the situations it can effectively follow; picture constancy
and casing rate should be high, and articles can't move at high paces. More mind boggling picture
closeness calculations, as examined by, would further develop following ability. In future work,
we mean to contact the first creators of and report the mAP across all informational collection
parts of our self-prepared model to determine the ambiguities in the first misfortune capability.
42
4.4 COMPARISON TO OTHER REAL-TIME SYSTEMS
Many research efforts in object detection are aimed at making standard detection pipelines as fast
as possible. Just Sadeghi et al. have made an ongoing discovery framework (30 edges each second
or better). Consequences be damned is contrasted with their GPU-based DPM execution. Quick
YOLO is PASCAL's quickest object discovery technique and, apparently, the world's quickest
object locator. With 52.7 percent mAP, it is over two times as exact as past work on continuous
recognition. YOLO increases mAP by 10% while maintaining real-time performance. The fastest
DPM effectively accelerates DPM without sacrificing much mAP, but it still falls short of real-
time performance by a factor of two. It is also limited by DPM's relatively low detection accuracy
when compared to neural network approaches. Selective Search is replaced by static bounding box
proposals in R-CNN minus R. While it is much faster than RCNN, it falls short of real-time and
suffers from significant accuracy loss due to a lack of good proposals. Fast R-CNN speeds up the
classification stage of R-CNN but it still relies on selective search which can take around 2 seconds
per image to generate bounding box proposals.
Table 4.2: Real-Time Systems on PASCAL VOC 2007
Quick R-CNN commits undeniably more foundation errors than YOLO. We get a critical
presentation help by utilizing YOLO to eliminate foundation identifications from Fast R-CNN.
For each jumping box anticipated by R-CNN, we verify whether YOLO predicts a comparable
box. Assuming it does, we help that expectation in light of YOLO's forecast and the cross-over
between the two boxes. On the VOC 2007 test set, the best Fast R-CNN model accomplishes a
mAP of 71.8 percent. CNN's Unfortunately, in light of the fact that we run each model
43
independently and afterward consolidate the outcomes, this mix doesn't profit from YOLO's speed.
In any case, in light of the fact that YOLO is so quick, it adds no huge computational time when
contrasted with Fast R-CNN.
4.5 CONCLUSION
In this paper, a significant learning model ward on a convolutional neural association is made, with
the objective of precisely identifying and perceiving objects of changed sizes in a video
arrangement. Profound learning-based article identifiers have been a well-known exploration
region as of late because of their brilliant learning capacities. A powerful multi-scale anchor
confine approach beats others the identification and acknowledgment of objects of different sizes
and shapes. Several ideas from existing deep learning models are combined to construct a
convolutional neural network for object detections and recognitions by improving the exiting
model to overcome the limitation of detecting and recognizing the small objects.
44
5. CONCLUSION AND FUTURE WORKS
5.1 CONCLUSION
In this paper, a deep learning model based on a convolutional neural network is proposed, with the
goal of accurately detecting and recognizing objects of varying sizes in a video sequence. Deep
learning-based object detectors have been a popular research topic in recent years due to their
powerful learning abilities. The thesis begins with a thorough examination of deep learning
algorithms, followed by a clarification of the designs expected for object location and
acknowledgment. The current models didn't perform well as far as continuous item location,
especially for little article discovery. The issue was the enormous matrix size and fixed size anchor
box because of which the little item includes persuade too little to ever be recognizable. Not all
anchor boxes get sufficient information about the object if the majority of the value of the same
pixel is in the vicinity of each other. Furthermore, the prediction score was calculated for each
anchor box, resulting in an increase in object detection execution (training and testing) time. Our
proposed system consists primarily of three modules:
• Proposing an architecture that can accurately detect and recognize the static and the moving
objects of varying sizes from a video sequence.
• Planning a multi-scale anchor box to catch the minor subtleties of the item in the picture.
This works on the identification over the cutting edge object locators.
• Developing an efficient approach to reduce the search space with the use of an efficient
multi-scale anchor box by ignoring the anchors which do not contain any information
thereby resulting in an improvement in the execution time.
• Integrating the proposed modules into the system thereby leading to high object detection
and recognition accuracy with a better FPS rate in the videos.
• Applying the proposed model to work well with a face detection application.
45
5.2 FUTURE WORKS
The proposed model jelly the full detail component of the little articles by separating the picture's
multi-scale highlight and furthermore guarantees the trustworthiness of the enormous item's
element. The PASCAL VOC-2007 dataset was utilized to prepare and test the model. The
effectiveness of the proposed work with the existing work is calculated based on the evaluation
measures mainly accuracy and the FPS rate. For the network input size of 608x608, a mean average
precision of close to 84.49 percent and 11 FPS were obtained to detect the varying sizes of the
objects. The proposed model is also effective at detecting and recognizing objects in complex
backgrounds. In our work, after 130 epochs, it is observed that the IoU accuracy obtained gets
stabilized with 84.49%. Also, the learning rate of 0.0001 is considered for better results. The higher
the resolution of the image, the better the results in terms of mAP. The hardware configuration is
also important in depicting the computation process. If the configuration does not change
significantly, the mAP is unaffected. To evaluate the convolutional neural network's complexity,
the number of parameters and FLOPS are used (the number of floating-point operations) of the
architecture are taken into consideration thereby resulting in 23.39 BLFOPS and 23.31 M as the
number of the parameters of the proposed model.
Partial face detection is also carried out using the proposed methodology by using the FDDB
dataset. The proposed model was successful in getting good accuracy for partial face detection. In
our work, it is observed that after 20 epochs, the IoU accuracy obtained is 92% which is the best.
The resolution of the image is found to be directly proportional to the IoU Accuracy and inversely
proportional to the frames per second. An inverse relation between precision and recall can be
concluded i.e more the precision, less is the recall.
46
REFERENCES
[1] Akula, A., Ghosh, R., & Sardana, H. K. (October). Thermal imaging and its application in
defence systems. In AIP conference proceedings (Vol. 1391, No. 1, pp. 333-335).
American Institute of Physics. 2011.
[2] Akula, A., Khanna, N., Ghosh, R., Kumar, S., Das, A., & Sardana, H. K. Adaptive contour-
based statistical background subtraction method for moving target detection in infrared
video sequences. Infrared Physics & Technology, 63, 103-109.2014.
[3] Sanin, A., Sanderson, C., & Lovell, B. C. Shadow detection: A survey and comparative
evaluation of recent methods. Pattern recognition, 45(4), 1684-1695. 2012.
[4] Ali, I., Mille, J., & Tougne, L. Space–time spectral model for object detection in dynamic
textured background. Pattern Recognition Letters, 33(13), 1710-1716.2012.
[5] Ali, I., Mille, J., & Tougne, L. Adding a rigid motion model to foreground detection:
application to moving object detection in rivers. Pattern Analysis and Applications, 17(3),
567-585. 2014.
[6] Abo-Eleneen, Z. A. Thresholding based on Fisher linear discriminant. Journal of pattern

recognition research, 2, 326-334. 2011.
[7] El Baf, F., Bouwmans, T., & Vachon, B. (June). Comparison of background subtraction
methods for a multimedia application. In 14th International Workshop on Systems, Signals
and Image Processing and 6th EURASIP Conference focused on Speech and Image
Processing, Multimedia Communications and Services (pp. 385-388). IEEE. 2007.
[8] Bai, X., & Zhou, F. Hit-or-miss transform based infrared dim small target
enhancement. Optics & Laser Technology, 43(7), 1084-1090.2011.
[9] Barnich, O., & Van Droogenbroeck, M.ViBe: A universal background subtraction
algorithm for video sequences. IEEE Transactions on Image processing, 20(6), 1709-
1724. 2010.
[10] Barnich, O., & Van Droogenbroeck, M. (April). ViBe: a powerful random technique to
estimate the background in video sequences. In 2009 IEEE international conference on
acoustics, speech and signal processing (pp. 945-948). IEEE. 2009.
[11] Ammar, S., Bouwmans, T., Zaghden, N., & Neji, M. (October). Moving objects
segmentation based on deepsphere in video surveillance. In International Symposium on
Visual Computing (pp. 307-319). Springer, Cham. 2019.
47
[12] Blauensteiner, P., & Kampel, M. Visual surveillance of an airport's apron-An overview of
the AVITRACK project (pp. 213-220). na. 2004.
[13] Bradski, G., & Kaehler, A. Learning OpenCV: Computer vision with the OpenCV library.
" O'Reilly Media, Inc.". 2008.
[14] Butler, D. E., Bove, V. M., & Sridharan, S. Real-time adaptive foreground/background
segmentation. EURASIP Journal on Advances in Signal Processing, (14), 1-13. 2005.
[15] Bouwmans, T. Recent advanced statistical background modeling for foreground detection-
a systematic survey. Recent Patents on Computer Science, 4(3), 147-176. 2011.
[16] Bouwmans, T. Background subtraction for visual surveillance: A fuzzy

approach. Handbook on soft computing for video surveillance, 5, 103-138. 2012.
[17] Bouwmans, T., El Baf, F., & Vachon, B. Background modeling using mixture of gaussians
for foreground detection-a survey. Recent patents on computer science, 1(3), 219-237.
2008.
[18] Bouwmans, T., Porikli, F., Höferlin, B., & Vacavant, A. Handbook on" Background
Modeling and Foreground Detection for Video Surveillance".2014.
[19] Rashmi, M., Ashwin, T. S., & Guddeti, R. M. R. Surveillance video analysis for student
action recognition and localization inside computer laboratories of a smart
campus. Multimedia Tools and Applications, 80(2), 2907-2929. 2021.
[20] Sharma, L., Yadav, D. K., & Bharti, S. K. (February). An improved method for visual
surveillance using background subtraction technique. In 2nd International Conference on
Signal Processing and Integrated Networks (SPIN) (pp. 421-426). IEEE. 2015.
[21] Campbell, J., Mummert, L., & Sukthankar, R. Video monitoring of honey bee colonies at
the hive entrance. Visual observation & analysis of animal & insect behavior, ICPR, 8, 1-
4. 2008.
[22] Casares, M., & Velipasalar, S. (September). Resource-efficient salient foreground

detection for embedded smart cameras BR tracking feedback. In 7th IEEE International
Conference on Advanced Video and Signal Based Surveillance (pp. 369-375). IEEE. 2010.
[23] St-Charles, P. L., Bilodeau, G. A., & Bergevin, R. SuBSENSE: A universal change
detection method with local adaptive sensitivity. IEEE Transactions on Image
Processing, 24(1), 359-373. 2014.
[24] Bovik, A. C. Basic binary image processing. In The Essential Guide to Image
Processing (pp. 69-96). Academic Press. 2009.
48
[25] Chien, S. Y., Ma, S. Y., & Chen, L. G. Efficient moving object segmentation algorithm
using background registration technique. IEEE Transactions on Circuits and Systems for
Video Technology, 12(7), 577-586. 2022.
[26] Chien, S. Y., Ma, S. Y., & Chen, L. G. Efficient moving object segmentation algorithm
using background registration technique. IEEE Transactions on Circuits and Systems for
Video Technology, 12(7), 577-586. 2022.
[27] Cheung, S. C. S., & Kamath, C. Robust background subtraction with foreground validation
for urban traffic video. EURASIP Journal on Advances in Signal Processing, (14), 1-11.
2005.
[28] Chitade, A. Z., & Katiyar, S. K. Colour based image segmentation using k-means
clustering. International Journal of Engineering Science and Technology, 2(10), 5319-
5325.2010.
[29] Ajiboye, S. O., Birch, P., Chatwin, C., & Young, R. (March). Hierarchical video
surveillance architecture: A chassis for video big data analytics and exploration. In Video
Surveillance and Transportation Imaging Applications (Vol. 9407, pp. 160-169). SPIE.
2015.
[30] Cristani, M., Farenzena, M., Bloisi, D., & Murino, V. Background subtraction for
automated multisensor surveillance: a comprehensive review. EURASIP Journal on
Advances in signal Processing, 1-24.2010.
[31] Kupade, R. N., Meenpal, T., & Nath, P. K. (February). Fpga implementation of moving
object segmentation using abpd and background model. In IEEE International Conference
on Electrical, Computer and Communication Technologies (ICECCT) (pp. 1-5). IEEE.
2019.
[32] Chu, K. Y., Kuo, Y. H., & Hsu, W. H. (October). Real-time privacy-preserving moving
object detection in the cloud. In Proceedings of the 21st ACM international conference on
Multimedia (pp. 597-600). 2013.
[33] Chiranjeevi, P., & Sengupta, S. Neighborhood supported model level fuzzy aggregation
for moving object segmentation. IEEE Transactions on Image Processing, 23(2), 645-657.
2013.
[34] Akula, A., Khanna, N., Ghosh, R., Kumar, S., Das, A., & Sardana, H. K. Adaptive contour-
based statistical background subtraction method for moving target detection in infrared
video sequences. Infrared Physics & Technology, 63, 103-109. 2014.
49
[35] Cucchiara, R., Grana, C., Piccardi, M., & Prati, A. Detecting moving objects, ghosts, and
shadows in video streams. IEEE transactions on pattern analysis and machine
intelligence, 25(10), 1337-1342. 2003.
[36] Sharma, L., Yadav, D. K., & Singh, A. Fisher’s linear discriminant ratio based threshold
for moving human detection in thermal video. Infrared Physics & Technology, 78, 118-
128. 2016.
[37] Deshpande, S. D., Er, M. H., Venkateswarlu, R., & Chan, P. (October). Max-mean and
max-median filters for detection of small targets. In Signal and Data Processing of Small
Targets (Vol. 3809, pp. 74-83). SPIE. 1999.
[38] Dey, B., & Kundu, M. K. Robust background subtraction for network surveillance in H.
264 streaming video. IEEE Transactions on Circuits and Systems for Video
Technology, 23(10), 1695-1703.2013.
[39] Dou, J., & Li, J. Moving object detection based on improved VIBE and graph cut
optimization. Optik, 124(23), 6081-6088. 2013.
[40] Ðorđević, Z. P., Graovac, S. G., & Mitrović, S. T. Suboptimal threshold estimation for
detection of point-like objects in radar images. EURASIP Journal on Image and Video
Processing, (1), 1-12. 2015.
[41] Hart, P. E., Stork, D. G., & Duda, R. O. Pattern classification. Hoboken: Wiley. 2000.
[42] Guerra-Filho, G. Optical Motion Capture: Theory and Implementation. RITA, 12(2), 61-
90. 2005.
[43] Fleming, B. Sensors? A forecast [automotive electronics]. IEEE Vehicular Technology

Magazine, 8(3), 4-12. 2013.
[44] Gao, G., Wang, X., & Lai, T. Detection of moving ships based on a combination of
magnitude and phase in along-track interferometric SAR—Part I: SIMP metric and its
performance. IEEE Transactions on Geoscience and Remote Sensing, 53(7), 3565-3581.
2015.
[45] Ghanshala, K. K., & Pant, D. M-Governance Model To Accentuate Governance

Framework In Governance Archietecture To Empower People In Himalayan Villages
Using Mobile Technology. International Journal of Advanced Research in Computer
Science, 7(4). 2016.
[46] Ghosh, R., Akula, A., Kumar, S., & Sardana, H. K. Time–frequency analysis based robust
vehicle detection using seismic sensor. Journal of Sound and Vibration, 346, 424-434.
2015.
50
[47] Ghojogh, B., Karray, F., & Crowley, M. Fisher and kernel Fisher discriminant analysis:
Tutorial. arXiv preprint arXiv:1906.09436.2019.
[48] Goyette, N., Jodoin, P. M., Porikli, F., Konrad, J., & Ishwar, P. (June). Changedetection.
net: A new change detection benchmark dataset. In IEEE computer society conference on
computer vision and pattern recognition workshops (pp. 1-8). IEEE.2012.
[49] Gupta, R. D., & Bariar, A. Modelling of Gaseous Effluents by Implementing Gaussian
Model under GIS Environment. Journal of Environmental Science & Engineering, 48(1),
21-26. 2006.
[50] Haque, M., Murshed, M., & Paul, M. (September). On stable dynamic background
generation technique using Gaussian mixture models for robust object detection. In IEEE
fifth international conference on advanced video and signal based surveillance (pp. 41-
48). IEEE.2008.
[51] Abiodun, O. I., Jantan, A., Omolara, A. E., Dada, K. V., Mohamed, N. A., & Arshad, H.
State-of-the-art in artificial neural network applications: A survey. Heliyon, 4(11),
e00938.2018.
[52] Liu, B., Ding, Z., & Lv, C. Distributed training for multi-layer neural networks by
consensus. IEEE transactions on neural networks and learning systems, 31(5), 1771-
1778.2019.
[53] Elhassouny, A., & Smarandache, F. (July). Trends in deep convolutional neural Networks
architectures: A review. In International Conference of Computer Science and Renewable
Energies (ICCSRE) (pp. 1-8). IEEE.2019.
[54] Albawi, S., Mohammed, T. A., & Al-Zawi, S. (August). Understanding of a convolutional
neural network. In international conference on engineering and technology (ICET) (pp. 1-
6). Ieee.2017.
[55] Basha, S. S., Dubey, S. R., Pulabaigari, V., & Mukherjee, S. Impact of fully connected
layers on performance of convolutional neural networks for image
classification. Neurocomputing, 378, 112-119.2020.
[56] Jiang, P., Ergu, D., Liu, F., Cai, Y., & Ma, B. A Review of Yolo algorithm
developments. Procedia Computer Science, 199, 1066-1073.2022.
[57] Vinh, T. Q., & Anh, N. T. N. (November). Real-time face mask detector using YOLOv3
algorithm and Haar cascade classifier. In International Conference on Advanced
Computing and Applications (ACOMP) (pp. 146-149). IEEE.2020.
51

Republic of Turkey Altinbaş University Institute of Graduate Studies Electrical and Computer Engineering

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Republic of Turkey Altinbaş University Institute of Graduate Studies Electrical and Computer Engineering

Uploaded by

Copyright:

Available Formats

REPUBLIC OF TURKEY

REAL TIME OBJECT DETECTION AND

Rana Ali Hussein ANI

Rana Ali Hussein ANI

Electrical and Computer Engineering

Thesis Defense Committee Members:

Asst. Prof. Dr. Sefer Kurnaz faculty of Engineering and

Asst. Prof. Dr. Oguz KARAN faculty at C and System

Altinbas University __________________

Asst. prof. Dr. Serdar Kargın Electronic and

Beykent University __________________

REAL TIME OBJECT DETECTION AND RECOGNITION BASED ON

ANI, Rana Ali Hussein

Master’s Thesis of Electrical and Computer Engineering, Altınbaş University,

Supervisor: Asst. Prof. Dr. Sefer Kurnaz

Date: 08/ 2022

Table 3.1: Number of Classes (80 Classes) .................................................................................. 28

Table 4.1: Standard HyperParameters .......................................................................................... 36

Table 4.2: Real-Time Systems on PASCAL VOC 2007 .............................................................. 43

Figure 2.1:Gradient descent vanila Implementation [51] ............................................................. 10

Figure 2.3: A schematics of a MLP with five layers [52] ............................................................. 13

Figure 2.4: CNN architecture illustration from [8] ...................................................................... 16

Figure 2.5: Basic flow in transfer learning [14] ............................................................................ 19

Figure 3.1: The System Model [56]. ............................................................................................. 25

Figure 3.2: The Proposed deep learning architecture ................................................................... 26

Figure 3.3: Bounding boxes process. ............................................................................................ 27

Figure 3.4: Darknet-53 [57] .......................................................................................................... 30

Figure 3.5: PASCAL VOC dataset with annotation [7] ............................................................... 32

Figure 4.1: Definition of the bounding boxes ............................................................................... 33

Figure 4.2: Encoding architecture for YOLO ............................................................................... 34

Figure 4.3: Flattening the last two last dimensions ...................................................................... 35

Figure 4.4: Model Architecture..................................................................................................... 37

Figure 4.5: Example cell output for (3 × 3) × 8 output layer ........................................................ 38

Figure 4.7: Graph between the Accuracies and Epochs ............................................................... 40

Figure 4.8:Loss Function .............................................................................................................. 40

Figure 4.9: Real time object detection 1 ....................................................................................... 41

CNN : Convolutional Neural Network

IOU : Intersection over Union

MAP : mean Average Precision

1. Detecting complex nonlinear patterns 2. Learn the importance of variables automatically 3.

1.1 OBJECT RECOGNITION

1.5 RESEARCH CONTRIBUTIONS

1.6 THESIS ORGANIZATION

2.1 MACHINE LEARNING

𝑤 = arg 𝑚𝑖𝑛𝑤 𝐽(𝑤) (2.2)

A typical loss function for continuous variables is the quadratic or L2 loss:

𝑤𝑘 = 𝑤𝑘−1 − 𝛼∇𝐽(𝑤𝑘−1 ) , ∀𝑘 ∈ ℕ (2.4)

Figure 2.1:Gradient descent vanila Implementation [51]

𝑦𝑖 = 𝑓(𝑏 + ∑𝑛𝑗=1 𝑥𝑗 𝑤𝑖𝑗 ) (2.5)

𝑦𝑖 = 𝑓(∑𝑛𝑗=0 𝑥𝑗 𝑤𝑖𝑗 ) = 𝑓(𝑥, 𝑤) (2.6)

Figure 2.3: A schematics of a MLP with five layers [52]

2.1.2 Convolutional Neural Networks

𝑎𝑖𝑙 = 𝜎(∑𝑗 𝑎𝑖𝑙−1 ∗ 𝑘𝑖,𝑗

2.2 OBJECT DETECTION AND CLASSIFICATION

2.2.1 CNN for Object Detection

Figure 2.4: CNN architecture illustration from [8]

2.2.1.1 Convolutional layers

• Image matrix (ℎ × w × d) • Filter (fℎ × fw × d)

• When combined outputs (ℎ − fℎ + 1) × (w − fw + 1) × 1.

2.2.1.2 Pooling layers

2.2.1.3 Fully connected layers

2.2.3 Transfer learning