Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 43

AUTOMATED ARTIFICIAL Intelligence for Object Detection

A project work that was only partially completed.

FOR THE CONDITIONS FOR AWARDING THE M.SC. DEGREE

IN THE STUDY OF COMPUTERS

BENIN CITY, EDO STATE, UNIVERSITY OF BENIN, NIGERIA

BY

Victor MPAMUGO Onyedikach

(PG2019_2121701)

DECLARATION

I, Mpamugo Onyedikachi Victor, with registration number PG2019_2121701, hereby affirm that the project
work titled "object detection using artificial intelligence" that I am submitting as a requirement for the award of
the Master of Science degree in Computer Science is real work that I did during the 2019–2020 academic year.

I further promise that I have never before applied to any university or institution for the award of a degree or
diploma using the material contained in this work.
____________________________ _________________

Date of MPAMUGO Onyedikcahi Victory

PG2019_2121701

CERTIFICATION

This is to certify that the project work "Object detection using Artificial Intelligence" was legitimately
completed by Mr. Mpamugo Onyedikcahi Victor of the Department of Computer Science, Faculty of Physical
Science, University of Benin, Benin City, Nigeria, with matriculation number PG2019_2121701. It is
acknowledged that the report contains all the corrections and recommendations made during the presentation
assessment.

__________________________ _____________________

Doctor Chete F.Date

MSc. Coordinator of a programme


APPROVAL

The project work is approved as it complies with the academic standards for project work set forth for the
Masters of Science degree in Computer Science from the Department of Computer Science, Faculty of Physical
Sciences, University of Benin.

_______________________ _________________

Professor Amadin I. Frank Date

Department Head
DEDICATION

To my cherished Wife, my kids, and my family.

I'm grateful.
ACKNOWLEDGEMENTS
TEXT OF THE CHAPTER

DECLARATION

CERTIFICATION

APPROVAL

DEDICATION

ACKNOWLEDGEMENT

ABSTRACT

FIRST CHAPTER: INTRODUCTION

BACKGROUND

REASONS FOR THE STUDY

DEEP LEARNING AND THE FORMULA FOR IT

OBJECT DETECTION: WHAT IT IS AND HOW TO USE IT.

OBJECT DETECTION MODEL ARCHITECTURE

OBJECT DETECTION FOR YOLO

PART TWO: REVIEW OF THE LITERATURE


ABSTRACT

The YOLO framework deep learning algorithm is used in this two-part project work on object detection using
artificial intelligence for real-time object tracking and detection. The detection and tracking of objects, such as
digital images and videos, is done using a computer vision technique known as object detection in computer
science.

This work demonstrated the deep learning algorithm using the YOLO framework.

The first section of this work examined the theory of this deep learning technology as it related to its literature,
and the second section of this work, using a small training data size, demonstrates practically how an object
detection system can be used in practise.
CHAPITER 1

INTRODUCTION

Identifying and locating objects, such as those in images or videos, is done using one of the computer vision
techniques known as object detection. By first localising, tracking (i.e., precisely pointing to) the object, and
then classifying the detected object into its pre-defined data classes or types, it uses object localization and
object classification. It differs from Image Recognition, another Computer Vision technique that only detects
images and does not track them, in that it takes two distinct approaches.

Tracking or localising the object, detecting it, and classifying it into the appropriate classes according to the data
(object classification) are all parts of object detection. The objective of object detection is to precisely identify
the object within a given media object and classify that object accordingly. Consider the data classes for houses,
cars, and buses. Using object detection, houses can be located (by determining where the object is) and
classified (by identifying the type of object it is).

1.1 CONTEXTE

Even though it is still far from perfect, one area of artificial intelligence that has grown significantly is image
recognition and image processing.

Images can be analysed, hidden representations of features can be understood, and these learned representations
can be applied to a variety of tasks, such as automatically classifying images into different categories and
determining which objects are present and where in a data set. These tasks are described by the terms "image
recognition" and "image processing," which refer to a set of algorithms and technologies that try to do this.

According to Kinza (2018), image recognition in the context of machine vision is the capacity of software to
recognise objects, places, people, writings, and actions in images. To recognise images, computers can use
machine vision technologies in conjunction with a camera and artificial intelligence software. In order to
produce the necessary results for solving such problems, these technologies make use of a variety of
conventional computer vision techniques as well as machine learning and deep learning algorithms.

Sagar (2021) claims that the human eye performs significantly better than that of computers in terms of visual
performance, likely as a result of superior high-level image understanding, contextual awareness, and massively
parallel processing. However, after prolonged surveillance, human capabilities drastically deteriorate.
Additionally, some working environments are either inaccessible or too dangerous. These factors prompt the
development of automatic recognition systems for numerous uses. Computer mimicry of human vision has
recently gained ground in a number of real-world applications, driven by advancements in computing power and
image processing technology. Computer vision processes images in two X dimensions, which are the placement
and spatial orientation of the objects in the image. The spatial orientation refers to the capability of identifying
the position or direction of objects or points in space, whereas the placement discusses what position the object
in the data assumes; this suggests that object detection results in a controlled environment can be easily obtained
than from an uncontrolled environment where objects lay unorganised in arbitrary positions.

As a result, training an algorithm to recognise a house, for instance, is much simpler than training one to
recognise the same image among many houses.

There are a number of object detection algorithms, but You Only Look Once (YOLO) is chosen in this study
because it is very quick compared to other real-time detection algorithms that came before it because it uses a
Unified Model where the detection is seen as a single regression problem with no complex pipeline, just a neural
network run on the image. Because it uses the image as a whole to reason about predictions, it makes fewer
mistakes. YOLO sees the whole picture and encodes some of the context regarding all classes and how they
look. Additionally, YOLO has developed generalised representations of objects and can distinguish between
images of the natural world and works of art. The 2016 faster object detection algorithm YOLO uses real-time
object detection. In this study, the YOLO principle is used to create an AI algorithm that can identify particular
data, such as cars and people.
YOLO is a good object detection algorithm, but it still has some weaknesses. Due to the spatial constraint on
bounding boxes in YOLO, where each cell can predict only two boxes and have one class, it is difficult to detect
small objects, such as a flock of birds. When generalising objects with unusual aspect ratios and configurations,
YOLO runs into some issues. Additionally, for YOLO, the loss function will treat small and large bounding box
errors equally.

REASONS FOR THE STUDY

The field of artificial intelligence is fascinating and offers vast solutions to a wide range of human needs. Its
research is endless because it is a field of computer science that is constantly developing. A byproduct of
artificial intelligence, object detection is still developing and uses a variety of algorithms. Although there are
many object detection algorithms, You Only Look Once (YOLO) is currently the fastest.

The desire to learn more about the state-of-the-art techniques for object detection and how they work is the
driving force behind this project. In order to determine whether the YOLO object detection algorithm is a good
model for general objects detection or not, this project work was carried out with the goals of exploring,
learning, and viewing practical demonstrations of it.

SUMMARY OF THE PROBLEM

The dual goals of object detection are object classification and localization. These present a challenge because
the algorithm will not only categorise but also locate (localise) the spatial position of an object within that
object. Additional issues with object detection include multiple spatial scales and aspect ratios. It can be
challenging to capture objects at various scales and viewpoints in this situation because objects of interest may
appear in a wide range of sizes and aspect ratios.

Another issue in object detection is the generation of object regions of interest (ROIs) using anchor boxes rather
than selective search, where multiple ROIs may be predicted at each position and are described in relation to
reference anchor boxes. Unbalanced classes are another issue with object detection. There are issues with
classification in object detection.

Another issue with object detection is the lack of information. Despite numerous efforts to collect data, some
detection datasets tend to have a smaller vocabulary and scale.

Speed for real-time detection is one of the main issues with object detection, though. The algorithm must
perform the tasks of object location and classification quickly and accurately. This work examines the problem
of speed for real-time detection using the YOLO object detection algorithm as a better option for speed and
accuracy.

This project designs and implements a You Only Look Once (YOLO)-based object detection algorithm that is
incredibly quick and precise.

1.4 PURPOSE AND GOALS

The goal of this project work is to create and put into practise an artificial intelligence algorithm that can quickly
and accurately identify objects in image scenes using YOLO.

The goals for achieving this goal are as follows:

the act of gathering data

data enhancement

Data Markup

Create a model

Educate the model


Analyse and verify the model

Analyse the model

Apply the strategy

1.5 AIM OF THE STUDY

The design and implementation of a quick and precise object detection system using YOLO are the only topics
covered in this study.

1.6 IMPORTANCE OF THE STUDY

A fascinating and ever-evolving field of computer science, artificial intelligence has enormous potential to meet
a wide range of human needs.

Not only is object detection a product of the supercomputer era, but it also heralds a safe technological future.

Object detection has made significant advances in a number of important fields, including biometric and facial
recognition, parking management, emergency response, security, transportation, and security.

The significance of this work is to further its research and demonstrate more of its advantages in resolving some
of man's problems in light of its enormous benefits to humans.
APARTMENT TWO

READING REVIEW

In the area of computer vision, object detection has shown to be a workable technique to identify objects with a
certain level of accuracy. One of the earliest and best-known computer vision tasks is the extraction of higher
level semantic information from images. Numerous studies have been conducted in this area of study to
highlight its limitless potential and advance its development.

According to Weibo et al. (2017), one of the most well-known sub-domains of computer vision is detection,
which aims to precisely locate and categorise the target objects in an image. The image is scanned during the
detection tasks to look for specific unique issues. According to them, image detection can be used in medicine,
for instance, to identify potentially abnormal tissues or cells in images of the body. They pointed out that
computer-aided diagnosis (CAD) systems use Deep Belief Networks (DBN), a deep learning architecture, for
the early detection of medical conditions like breast cancer and glaucoma. It was also stated that due to object
detection and its computational efficiency, the segmentation task and magnetic resonance imaging (MRI)
method for clinical brain tumour detection were given more attention in recent years. However, little was said
about the fact that the MRI based brain tumour detection method suffers from some discrepancies between the
two methods.

They emphasised once more that ship detection on spaceborne images, for example, has been widely used for
traffic control and monitoring maritime security, and that due to their visualised contents and high resolution
properties, they are superior to other remote sensing images in object detection. The difficulty of image
processing rises as larger databases are treated for higher resolution, though when compared to the infrared and
synthetic aperture radar images detected, the spaceborne images are easily affected by the weather.

According to Alina (2015), the Convolutional Neural Network's output for image classification is a 1000-
dimensional vector that contains the likelihood that each potential class's most prominent object will be present.
The network can easily be expanded to generate bounding box coordinates (either just one bounding box or one
bounding box for each potential object class) if only the most prominent object needs to be localised.

The main disadvantage of this method is that it cannot be used if the number of bounding boxes that must be
predicted, or the number of objects that can be detected in the scene, is unknown. But in object detection tasks,
this is exactly the case; the number of bounding boxes that must be predicted should be known in advance.
Applying a network for classification and localization to every area of the image (at various scales) successively
and accumulating the boxes with the best results along the way would be a straightforward solution.

This is evidence supporting the earlier claim made in this work that we must first define the number of boxes in
accordance with the number of objects we intend to detect because object detection uses a pure regressor model.

An Artificial Neural Network (ANN) is a paradigm for information processing that draws inspiration from the
way biological nervous systems, like the brain, process information, according to Swetha et al. (2018). The
innovative structure of the information processing system is the fundamental component of this paradigm. It is
made up of numerous, intricately linked processing units called neurons that cooperate to address particular
issues. ANNs learn by imitation just like people do. Through a learning process, an ANN is tailored for a
particular application, such as pattern recognition or data classification. The synaptic connections between the
neurons in biological systems change as a result of learning.

Additionally, they described the differences between feed-forward and feedback neural networks. By allowing
signals to only travel in one direction, from input to output, feed-forward ANNs eliminate feedback loops,
meaning that the output of one layer does not have an impact on another layer. Feed-forward ANNs are
frequently simple networks that link inputs and outputs. They play a significant role in pattern recognition. This
style of organisation is also known as top-down or bottom-up.
Little was, however, said about the fact that feed-forward neural networks are not as effective and ideal in that
they only permit a one-way data throughput, or, to put it another way, there is no connection or feedback
between layers.

By adding loops to feedback networks, it was claimed that these networks have signals travelling in both
directions. Feedback networks are undoubtedly very effective, but they can also become very complex. They are
dynamic; their "state" changes continuously until they come to an equilibrium. Until the input changes and a
new equilibrium needs to be found, they stay at the equilibrium point. The complexity of the feedback neural
network is caused by the fact that states are constantly changing and equilibrium points are remaining constant.
Recurrent or interactive are other names for feedback architectures.

Using YOLO (You Only Look Once), Joseph et al. (2016) presented a quick and easy method for detecting real-
time images. The model was created to distinguish between real and artistic images and to detect images quickly
and accurately.

However, YOLO introduced a single unified architecture for regression, put images into bounding boxes, and
found class probabilities for each box. This contrasts with Object detection techniques that came before YOLO,
like R-CNN (Region Convolutional Neural Network). This resulted in YOLO operating much more quickly and
accurately because it could accurately predict the artwork.

In their paper, "Object Detection Based on YOLO Network," Chengji et al. (2018) described the development of
a generalised object detection network using training sets that had undergone noise, blurring, rotation, and
cropping. The model's generalisation and robustness were improved by using degraded training sets during
training.

The experiment, however, demonstrated that the model's robustness and generalisation ability for the degraded
images are both subpar for the standard sets. The average precision would have increased if the model had been
trained on damaged images. In comparison to the standard model, the general degenerative model had better
average precision for degraded images.

Rekha et al.'s (2020) list of YOLO's advantages reads:

Due to its use of a Unified Model, which views detection as a single regression problem and eliminates the need
for a complicated pipeline in favour of a simple neural network run on the image, it is incredibly fast when
compared to other real-time detection algorithms that came before it.

Because YOLO, unlike Fast R-CNN, can globally reason the image when making predictions, it makes fewer
errors than Fast R-CNN because it can see the larger context. YOLO sees the whole picture and encodes some
of the context regarding all classes and how they look.

YOLO has gained knowledge of object generalised representations. YOLO successfully distinguishes between
natural and artistic images.

However, it should be noted that YOLO does have some disadvantages:

The detection of small objects, such as a group of birds, is challenging due to the spatial restriction placed on
bounding boxes, which only allows each cell to predict two boxes and have one class.

generalising objects with unusual aspect ratios and configurations can be challenging.

The errors of small or large bounding boxes will be treated equally by the loss function.

A YOLO-based image acquisition, licence plate detection, licence plate extraction, licence plate segmentation,
and character recognition neural network system was proposed by Dnyanisha et al. (2020). When put into
practise, it offered 10% more accuracy and 12% more speed than other object detection systems.

The Python libraries are primarily used by Vehicle Licence Plate Recognition Systems to detect and recognise
licence plates. Under numerous conditions, such as rain, fog, inclined car images, poor image quality, etc., these
systems are unable to provide high accuracy. Most of the results from other vehicle number plate recognition are
such that there is occasionally confusion when determining which letter D corresponds to number 0, which letter
B corresponds to number 8, and which letter I corresponds to number 1. The YOLO Vehicle Licence Plate
Recognition System is based on OCR (optical character recognition), which is better for intelligent vehicular
systems, and convolutional neural networks. Additionally, it offers greater precision and accuracy than other
systems. The YOLO Vehicle Licence Plate Recognition System is used to find the exact location of the licence
plate on the vehicle body. Once the licence plate has been found, the details of the plate are then quickly and
successfully identified.

Single-stage detectors are said to directly predict the classification and localization in a single step by Dan-
Sebastian et al. in 2022. Although two-stage models typically outperform single-state models in terms of
performance, they cannot be used on devices with limited processing power. Single stage object detectors like
those in the YOLO series strike a balance between speed and precision; the most recent model runs more
quickly and more accurately when using Graphics Processing Units (GPUs).

However, it is also important to be aware that when deployed on edge devices, high performing YOLO models
execute much more slowly due to the high computational load. Higher versions of YOLO were suggested as a
solution to this problem. One of the fastest object detection networks, the lightweight YOLOv4-tiny was
created, but due to its shallow architecture, the extracted features are largely homogeneous, the accuracy is
constrained, and there is room for improvement in real-time performance on low-power devices.

According to Fiza et al. (2022), YOLO has more sophisticated applications than Faster R-CNN and offers the
best balance of speed and accuracy that a given application will need. Since YOLO offers end-to-end training,
it demonstrates to be a cleaner and more effective method for performing object detection. Although the
accuracy of both algorithms—Faster R-CNN and YOLO—is comparable, YOLO occasionally outperforms
Faster R-CNN in terms of accuracy, speed, and efficiency. Since YOLO uses single shot algorithms, it is better
to use it for real-time object detection in both images and videos. It is easy to build and works with full images
for training. When compared to Faster R-CNN, YOLO provides a better generalising representation of objects,
making it a more reliable, quick, and strong algorithm to rely on.

This opinion holds true for YOLO because the architecture is more akin to a fully connected convolutional
neural network; the image only needs to go through the FCNN once before the prediction is revealed in the
output. YOLO simultaneously performs detection and classification, increasing speed. When compared to Faster
R-CNN, YOLO makes less than half as many background errors because of its YOLO architecture, which
allows for end-to-end training and real-time speed while maintaining high average precision. It is true that
YOLO outperforms Faster R-CNN in terms of speed and accuracy, but it is important to note that both
algorithms still perform poorly in terms of real-time performance.

Additionally, it can be observed that this object detection algorithm appears to have some distinctive
characteristics as a result of its continued growth and quick upgrade. Given YOLO's rapid development, there is
no doubt that it will continue to dominate the object detection industry for a very long time.

Many industries, including the military, manufacturing, security, medicine, transportation, retail, advertising,
and marketing, use object detection. These industries require a very fast and accurate object detection algorithm
for their real-time operations; a detection architecture that achieves the right speed, memory, and accuracy
balance for their specific application and platform.

A self-driving vehicle should be able to think for itself, make wise decisions like slowing down when
approaching a junction or negotiating a bend, drive itself, sense dangers on the road, and take precautions to
avoid them with speed and accuracy. For example, vehicle assembling industries and self-driving vehicle
manufacturers use high speed and accurate object detection algorithms for their businesses due to the many
components involved in vehicle assembling. In order to navigate safely on the roads and prevent collisions and
accidents, self-driving vehicles should apply their brakes at the precise calculated time needed to cause a stop in
their own movement. They should also take into account the distance of even the vehicle(s) coming after them
(at their back).

Real-time surveillance for forensic, security, and military purposes also uses an efficient and quick object
detection algorithm. As airport security uses facial recognition close to the departure gates to verify travellers'
identities, it is also used for biometric and facial recognition. Object detection is also used in a variety of other
fields, including sports and medicine.
Contrary to regional-based object detection techniques like the Faster Regional-Convolutional Neural Network,
which has two distinct phases—proposing regions and then processing them—this approach uses a single phase.
Yolo is a unified one-state method that locates and categorises objects in a single pass over the image,
significantly reducing test time. Then, all localization and classification take place within a single process thanks
to a complicated, multi-term loss function. YOLO has attained detection rates of 155 frames per second.

SECTION THREE

METHODOLOGY

3.1 OBSTACLE DETECTION:

Humans are remarkably quick and accurate at detecting objects. As a reflex to our patterned visual system,
human brains recognise objects, but with the rise of emerging technologies like the Internet of Things (IoT),
quantum computing, and artificial intelligence, computer scientists have been able to replicate human thought
processes in machines.

The normal power of sight is enhanced by object detection. It assists machines in navigating our visual world
similarly to the human brain.

Similar to object recognition, object detection works. The distinction is that while object detection identifies the
presence and location of an object in an image, object recognition focuses on identifying the correct category of
an object.

There are two different types of data analysis techniques that can be used to perform object detection tasks.

In image processing, models that don't need big datasets or powerful graphics cards can self-train on the input
images and generate feature maps to make predictions.
Deep neural networks are supervised learning algorithms that can predict object classes using large datasets and
powerful graphics processing units. This method of classification is more accurate for objects that are complex
or partially hidden in an image with unknown backgrounds. Deep neural network training is a time-consuming
and expensive process. There are, however, some sizable datasets that make labelled data accessible.

Machine learning or deep learning are the two methods for object detection that are most preferred. These
techniques extract the features, train the algorithm, and categorise objects using a support vector machine
(SVM).

The machine learning algorithm relies on manually entered data for classification rather than automatically
generated training data, which makes the algorithm overall less prone to error and more stable.

An approach to machine learning called natural language processing (NLP) uses illumination intensity against a
background to identify and categorise objects.

Another machine learning technique is aggregate channel features (ACF), which uses a training image dataset
and the ground positions of the objects to identify specific objects in the image.

The deformable parts model (DPM) is a different machine learning method for object detection that consists of
four main parts: a coarse root filter that creates multiple bounding boxes in an image to capture the objects, part
filters that cover the object fragments and turn them into arrows of darker pixels, a spatial model that stores the
locations of all the object fragments relative to the root filter's bounding boxes, and a regressor that is used to
reduce the distance between objects.

Convolutional neural network models and other deep learning techniques are used in object detection using a
deep learning approach, which results in faster and more accurate object predictions.

To locate and identify objects using deep learning, an encoder first passes an image through a number of blocks
and layers to extract statistical data or features. A decoder then receives the results and forecasts bounding boxes
and labels for each object. The encoder's output is connected to a pure regressor, which forecasts the precise
location and dimensions of each bounding box.

The output values of the object's X, Y coordinate pair display the object's size or extent in the image.

Popular deep learning architectures include recurrent neural networks, convolutional neural networks, and deep
belief networks. Deep Belief Networks are frequently employed to solve general classification issues.
Convolution neural networks are among the most widely used deep learning architectures for the classification
of images, texts, and sounds, while recurrent neural networks are used for data that is presented sequentially.

NEURAL ARTIFICIAL NETWORKS

Artificial neural networks, also known as neural networks or neural nets, are computer architectures that draw
inspiration from the biological neural networks that make up animal brains. Artificial Neural Networks are made
up of a network of interconnected nodes or units known as artificial neurons, which are made to roughly
resemble the neurons in a biological brain and have the ability to communicate with one another.

After receiving signals, an artificial neural network processes them and can signal neurons connected to it. A
connection's "signal" is a real number, and each neuron's output is determined by some non-linear function of
the sum of its inputs. Edges refer to the connections. Typically, neurons and edges have a weight that increases
or decreases the signal's strength as learning progresses.

The division of neurons into various layers allows for the possibility of various input transformations. Signals
may pass through the layers more than once as they move from the first layer, also known as the input layer, to
the last layer, also known as the output layer.

Neural networks learn or are trained by processing examples, each of which has a known "input" and "result,"
creating probability-weighted connections with the two, and are stored within the data structure of the network
itself. Such training is typically carried out by calculating the difference between the target output and the
network's processed output, which is frequently a prediction. The error is in this discrepancy. Using this error
value, the network adjusts its weighted associations in accordance with a learning rule. Subsequent adjustments
will cause the neural network to produce output that is steadily getting closer to the desired output. After a
sufficient number of these adjustments, this type of supervised training may be terminated based on specific
criteria.

These systems typically learn to perform tasks without being programmed with task-specific rules and by taking
into account examples.

Artificial Neural Networks, which started out as a way to use the human brain's architecture to perform tasks
that conventional algorithms couldn't, quickly moved on to improving empirical results. Simulated neurons are
connected to one another in different ways to form an artificial neural network, where some neurons' outputs
become other neurons' inputs.

Neurons are arranged in layers, and each layer only connects to the layers that come before and after it. External
data is received by the input layer, and an output is created by the output layer. Recurrent networks are those
that enable connections between neurons in the same or earlier layers.

A network learns by adapting to better handle a task by taking into account sample observations. In order to
improve the accuracy of the output, the network's weights (and optional thresholds) are tilted. The rate of
learning displays the magnitude of the model's corrective steps, which are used to account for errors in each
observation. Despite having a lower accuracy, a high learning rate shortens training time, while a low learning
rate requires more time but may have a higher accuracy.

Types of synthetic neural networks

Artificial Neural Networks have developed into a large family of methods that have raised the bar in many
fields. Artificial neural networks include, for example:

CNN, or CONVOLUTIONAL NEURAL NETWORK

Convolutional Neural Network (CNN) is one of the most well-known and incredible feed-forward deep neural
network models. It is capable of learning features automatically from input data and does not require manual
feature extraction from images, which means that it can complete tasks without human intervention.

Repeated filter applications increase training effectiveness. The use of filters during learning processes helps to
reduce the number of parameters. Each pooling layer performs maximum or average subsampling of non-
overlapping regions in feature maps to handle more complex features, making fully connected layers function
like a typical neural network. Convolutional Neural Network (CNN) is an effective architecture that can process
text, images, and even sound. It is the final learning phase that maps the features to the predicted outputs.

NETWORK OF DEEP BELIEF

The deep belief network is a graphical model with many layers of hidden variables, having binary values known
as hidden units or feature detectors, and is typically represented as a stack of restricted Boltzmann machines
(RBMs). An associative memory is formed by the symmetric connection between the top two layers. However,
top-down and directed connections are used in the lower layers. DBN includes the two different types of neural
networks, Belief Networks and RBM. The DBN is trained in two stages: supervised fine-tuning with labelled
samples after an unsupervised pre-training phase with unlabeled samples.

Deep Belief Networks (DBN) are used in Natural Language Processing (NLP) because they precisely extract
attributes from entities with little manual intervention.

DAILY NEURAL NETWORK

Recurrent Neural Networks (RNN) are deep learning architectures that are primarily used to process sequential
data. Unlike feed-forward neural networks, RNNs form unidirectional cycles between their units. Recurrent
neural networks perform better for Natural Language Processing, text and speech recognition because they can
be built as compactly as the size of their training data and because their memory can retain details from previous
training data.
YOLO OBJECT DETECTION ALGORITHM, Version 3.4

YOLO (You Only Look Once) is an efficient real-time object detection algorithm that, according to one
evaluation, uses convolutional neural networks to see object detection in images as a regression case so as to
spatially separated bounding boxed and associated class probabilities by dividing the image into regions and
predicting the boxes and the class for each region. YOLO was first developed by Joseph Redimond and is well-
known for its speed, accuracy, and learning prowess.

The head, neck, and backbone are the three main parts that make up the YOLO model. The backbone, which
consists of convolutional layers that detect important features in images and process them, is first trained at a
lower resolution on a classification dataset like ImageNet because detection requires better details than
classification.

The neck uses fully connected layers and the convolution layer information from the backbone to make
predictions about probabilities and bounding box coordinates. The head is the network's final output layer,
which is used to transfer learning and is interchangeable with other layers that have the same input shape.

The YOLO algorithm uses only a single forward algorithm propagation through a neural network to detect
objects and to predict multiple class probabilities and bounding boxes. The YOLO algorithm employs the three
methods listed below:

Remaining blocks

Regression with bounding boxes

IOU for Intersection Over Union

REMAINSING BLOCKS

The image is first divided into several grids. S x S are the dimensions of each grid. This example image
demonstrates how grids are created from an input image.

YOLO Grid, Figure 3.1 (Source: www.guidetomlandai.com)

There are numerous equal-sized grid cells, as shown in Figure 3.1. Every grid cell will be able to detect objects
that enter it. For instance, a grid cell will be in charge of detecting an object if its centre appears within that cell.

REGRESSION IN BOUNDING BOXES

An outline that draws attention to an object in an image is called a bounding box. The following characteristics
are present in each bounding box in the image:

Size (bw)

Size (bh)

The letter c stands for class, which includes things like person, car, traffic light, etc.

(bx,by) Bounding box centre

The picture displays a bounding box in action. A yellow outline serves as a representation of the bounding box.
YOLO Bounding Box, Figure 3.2 (www.section.io)

Figure 3.2 shows that YOLO predicts the height, width, centre, and class of objects using a single bounding box
regression. represents the likelihood of an object appearing in the bounding box in the above image.

IOU: INTERSECTION OVER UNION

Box overlapping is described by the object detection phenomenon known as intersection over union (IOU). IOU
is used by YOLO to create an output box that perfectly encircles the objects.

The predicted bounding boxes and their confidence scores are the responsibility of each grid cell. If the
predicted bounding box and the actual box match, the IOU is equal to 1. Bounding boxes that are not equal to
the actual box are eliminated by this mechanism.

The illustration offers a clear illustration of how IOU functions.

Figure 3.3: The YOLO IOU (Miro. Medium)

Two bounding boxes—one in green and the other in blue—can be seen in Figure 3.3. The green box is the
actual box, while the blue box represents the predicted box. The two bounding boxes are balanced thanks to
YOLO.

MEANS OF COMBINING THE THREE TECHNIQUES

The three techniques are applied to produce the final detection results, as shown in the following image.

How the YOLO algorithm functions is shown in Figure 3.4 (source: www.guidetomlandai.com)

Figure 3.4 shows how the image is initially divided into grid cells. B bounding boxes are predicted in each grid
cell, along with confidence ratings. To determine the class of each object, the cells forecast the class
probabilities.

For instance, a bicycle, a dog, and a car are examples of at least three different classes of objects. A single
convolutional neural network is used to make all of the predictions simultaneously.

By favouring intersection over union, unneeded bounding boxes that don't match the properties of the objects
(like height and width) are removed, ensuring that the predicted bounding boxes match the actual boxes of the
objects. Unique bounding boxes that precisely fit the objects make up the final detection.

For instance, the bicycle is enclosed by the yellow bounding box, while the car is enclosed by the pink bounding
box. The blue bounding box has been used to highlight the dog.

3.5 EVOLUTION OF YOLO

The starting point is YOLO or YOLOv1.

With its quick and accurate object recognition, this version revolutionised object detection. It is important to
remember that this initial iteration of YOLO has its own limitations, just like many other solutions. Because
every grid in the architecture is built for single detection, it is difficult to identify smaller images within a
collection of images, such as a group of people in a very large gathering. Additionally, this version struggles to
identify novel or unusual shapes, and the loss function used to approximation the detection performance results
in incorrect localizations because it treats errors for both small and large bounding boxes as identical.

YOLO9000 or YOLOv2
This improved version, which was developed in 2016 to make the YOLO model better, faster, and stronger, uses
Darknet-19 as a new architecture, higher input resolution, convolution layers with anchors, batch normalisation,
fine-grained features, and dimensionality clustering, among other things.

The performance of YOLO 2 was improved by 2% mAP by adding a batch normalisation layer that has a
regularisation effect that prevents over fitting.

Additionally, YOLOv2 simplifies the issue by replacing the fully connected layers with anchor boxes rather
than predicting the precise coordinates of the bounding boxes of the objects as YOLOv1 does, decreasing
accuracy but increasing model speed by 7%.

YOLOv3 — A slight advancement

New network architecture: YOLOv3 or its forerunners were improved by Darknet-53. Darknet-53, a 106 neural
network with up sampling networks and residual blocks, is faster, more accurate, and larger than Darknet-19,
which serves as the foundation of YOLOv2. Yolov3 performs more accurate class predictions as well as more
accurate predictions at various scales. It also has a better and more accurate bounding box prediction. YOLOv3
makes three predictions at different scales for each location within the input image in order to obtain fine-
grained and more meaningful semantic information that results in a better-quality output image.

YOLOv4 — Optimal Object Detection Speed and Accuracy

YOLOv4 is specifically created for production systems and optimised for parallel computations, offering the
best speed and accuracy of object detection when compared to previous versions and other object detectors.

CSPDarknet53, a network with 29 convolution layers, 3 3 filters, and roughly 27.6 million parameters, serves as
the framework of the architecture.

In contrast to YOLO v3, which used a feature pyramid network (FPN) to aggregate parameters from various
detection levels, YOLO v4 uses PANet and includes a Spatial Pyramid Pooling (SPP) block that expands the
receptive field, separates the most important context features, and doesn't slow down the network.

YOLOv5

The YOLOv5s (smallest), YOLOv5m, YOLOv5l, and YOLOv5x (largest) model sizes are all included in the
release. The first iteration of YOLO to be implemented in Pytorch as opposed to Darknet is YOLOv5. In June
2020, Glenn Jocher released YOLOv5, which, like YOLOv4, uses CSPDarknet53 as the foundation of its
architecture. A standout feature of YOLOv5 is the integration of a focus layer, which decreased the number of
layers and parameters while increasing forward and backward speed without significantly affecting the mean
average precision (mAP).

A Single-Stage Object Detection Framework for Industrial Applications is called YOLOv6.

The YOLOv6 (MT-YOLOv6) framework, which was released by Chinese e-commerce giant Meituan, was
written in Pytorch and is intended for industrial applications with hardware-friendly efficient design and high
performance. Three glaring improvements to YOLOv5 were made with the MT-YOLOv6 framework. Three
significant upgrades were made to the previous YOLOv5 in this version: a hardware-friendly backbone and
neck design, a productive decoupled head, and a more potent training method. YOLOv6 is quicker and more
accurate than earlier iterations.

The new state-of-the-art for real-time object detectors is set by YOLOv7 — Trainable Bag of Freebies.

This version, which was released in July 2022, significantly advances the field of object detection by
outperforming all earlier models in terms of speed and accuracy. The Extended Efficient Layer Aggregation
Network (E-ELAN) was integrated into the YOLOv7 architecture to allow the model to learn more diverse
features for better learning. Additionally, YOLOv7 scales its architecture to accommodate for variations in
inference speeds by joining the architectures of the models from which it derives, like YOLOv4, and there is
also an improvement in the accuracy and speed of the model without affecting its training cost due to a tonne of
freebies.
In order to increase model accuracy without raising training costs, a technique known as bag-of-freebies is used,
and it is for this reason that YOLOv7 improved both inference speed and detection accuracy.

The growth and improvements of the YOLO object detection algorithm can be seen from the foregoing, along
with its evolution and the rate at which upgrades are released. YOLO demonstrates to be a very viable algorithm
for object detection in terms of speed and accuracy.

RESEARCH PROJECT METHODS

A significant amount of knowledge about our world must be somehow stored in the computer, either explicitly
or implicitly, for computers to model it accurately enough to demonstrate intelligence. Below is a breakdown of
how the project's research procedures worked.

First of all, object detection is a very data-hungry task that needs a graphics processing unit (GPU) to help train
faster and shorten run-time as the CPU would not be able to handle such a large amount of data as well as
process and use frameworks like Tensorflow and PyTorch that use heavy computation. Setting up the GPU was
the first step in developing our models, and the Google Colab's free Tesla K80 GPU was utilised.

The next step was to collect data. The popularity of object detection has increased, and there are pre-trained
models for detecting objects like faces and vehicles available. However, if there is little to no data available for a
given object detection task, one must create their own data. Data can be sourced from publicly labelled datasets
like ImageNet, Common Object in Context (COCO), and Google's Open Images for cases involving pre-trained
data models. In these repositories, tens of thousands or even millions of image data are stored for particular
object detection tasks. Additionally, images can be downloaded from the internet, and even photos can be taken
with a camera to represent each category of object that needs to be detected. While some of the data images for
this work were taken from these data repositories; ImageNet. Some of the images in Google's Open Images and
Common Object in Context (COCO) were taken with a camera.

The following step was data augmentation. Massive amounts of data are needed for deep learning models. As a
result, it may not be possible to train a generalizable model using a small set of data. In order to achieve this,
previously collected data were enhanced by applying photometric and geometric distortions, two data
augmentation techniques that unquestionably aid the object detection task. The brightness, contrast, hue,
opacity, saturation, and noise of the data images were adjusted for photometric distortion, and the light intensity
was also increased or decreased depending on the requirements of each image. Random scaling, cropping,
flipping, enlarging, mirroring, and rotating were used to create geometric distortion. All of the methods for data
augmentation involved pixel-level adjustments, preserving the original pixel information in the adjusted area.

Following the aforementioned steps, data labelling is the next crucial step in a supervised machine learning task.
With the exception of an open dataset that has already been labelled and from which the algorithm is learning in
this case, the images are manually labelled. Our image dataset was resized so that our data could be annotated
correctly, and we then used the Quishi bounding box tool to create labels for our training images. This process
entails capturing the key features required for training by superimposing a bounding box. For instance, when
training a model to learn and recognise faces, one bounds and labels the faces, and when training a model to
learn houses or vehicles, one labels the whole object. Using bounding boxes to label our data also served the
additional purpose of obtaining the necessary YOLO labelling format. Each data image file n had a
corresponding.txt file created in the same directory with the same name that contained annotations for each of
the corresponding image files.

After installing the YOLO environment, the YOLO repository was cloned, and dependencies were added. This
prepared the programming environment for the execution of commands for object detection training and
inference.

A custom YOLO object detection dataset is downloaded next, followed by the definition of the YOLO model
configuration and architecture, training of the algorithm, evaluation and validation of the algorithm's
performance, use of the trained model to draw conclusions from test images, and finally implementation of the
saved model for use on a real-world computer vision task for future inferences.
CHAPITER IV

THE IMPLEMENTATION OF DESIGN

System Specification (4.1)

Python object recognition implementation needs a number of things, such as hardware requirements, software
libraries, and tools. The system requirements you should take into account for implementing object detection
are listed below:

Hardware specifications:

1. CPU: For effective object recognition, a multi-core processor is a must, especially when working with big
datasets or real-time applications. It is advised to use a modern CPU with four cores or more. (At least 2.5GHz)

2. GPU: Having a dedicated GPU (Graphics Processing Unit) can greatly speed up training and inference for
deep learning models used for object recognition.

For deep learning tasks, NVIDIA GPUs, such as the GeForce RTX or Tesla series, are frequently used.

3. MEMORY (RAM): To handle data processing and model training, one will need a sufficient amount of
RAM. It is advised to have at least 16GB of RAM, but more is preferable, particularly for complex models and
larger datasets.

4. STORAGE: For quicker data loading and model training, high-speed storage, like SSDs (Solid State Drives),
is preferred over conventional Hard Disc Drives. You will require adequate storage space to keep datasets,
models, and preliminary findings.

5. CAMERA: A compatible camera or webcam is required for real-time object recognition from live video
feeds.

software prerequisites

1. OPERATING SYSTEM: Python object recognition projects can be created on Windows, Linux, and macOS,
among other operating systems. The Linux operating system was used to complete this project's work.

2. Python must be installed on the system because it is the main programming language for object recognition.
Python 2 is no longer supported, so Python 3.x (such as Python 3.7, 3.8, or 3.9) is advised instead.
3. A code editor or integrated development environment (IDE) should be used for Python development. Popular
options include Sublime Text, PyCharm, Jupyter Notebook, and Visual Studio Code.

4. DEEP LEARNING FRAMEWORKS: You'll need deep learning frameworks like TensorFlow, PyTorch, or
Keras to work with deep learning models for object recognition. Install the framework that best meets the needs
of your project.

5. COMPUTER VISION LIBRARIES: For image preprocessing, feature extraction, and image manipulation,
libraries like OpenCV (Open Source Computer Vision Library) are crucial.

6. Additional Python libraries may be required for data handling (e.g., NumPy, pandas), visualisation (e.g.,
Matplotlib, Seaborn), and deployment (e.g., Flask for building web applications).

7. DATA: To train and test your object recognition models, you'll need labelled image datasets. You can either
make your own or find publicly accessible datasets.

OPTIONAL DEPLOYMENT SOFTWARE:

1. WEB FRAMEWORK: A web framework like Flask or Django may be required if you intend to deploy your
object recognition system as a web application.

2. CLOUD SERVICES: To host your models and applications for scalable deployments, you can use cloud
services like AWS, Azure, or Google Cloud.

3. DOCKER: For simpler deployment and management, you can containerize your application using Docker.

4.2 JUSTIFICATION OF OBJECT RECOGNITION IN PYTHON

Due to its adaptability, extensive library, and community support, Python offers several compelling
justifications for using it for object recognition. Here are a few primary causes:

1. RICH LIBRARY ECOSYSTEM: Python has a rich ecosystem of libraries, including OpenCV, TensorFlow,
PyTorch, and scikit-image, which were created specifically for computer vision and object recognition. The
development of object recognition systems is made simpler by these libraries, which offer pre-built functions
and models.

2. Support for machine learning and deep learning is provided by Python, which is widely used in these fields.
Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are two types of object
recognition models that can be trained and used with libraries like TensorFlow and PyTorch.
3. INTEGRATION OF OPENCV: The OpenCV (Open Source Computer Vision Library) library is a popular
tool for computer vision tasks. It has Python bindings, enabling programmers to use Python to access a variety
of image processing and computer vision functions. Preprocessing tasks and feature extraction for object
recognition are made simple by this.

4. COMMUNITY AND DOCUMENTATION: The developer community for Python is sizable and vibrant.
This indicates that there is a wealth of information available, including tutorials, online forums, and
documentation that can be used to support and direct object recognition projects. The abundance of resources
from the community can greatly speed up development.

5. COMPATIBILITY ACROSS PLATFORMS: Python is renowned for its cross-platform adaptability. Code
written for one platform, such as Windows, can be easily run on another, such as Linux or macOS, without
requiring significant changes. This is crucial when deploying object recognition systems across a range of
platforms and settings.

6. Python's straightforward and readable syntax makes it perfect for quick prototyping and experimentation.
This is essential when you're developing models, experimenting with different algorithms, or changing
parameters.

7. Python integrates with other technologies used in object recognition systems, such as databases, web
frameworks, and IoT platforms, without much difficulty. This enables you to create end-to-end solutions that
seamlessly integrate object recognition.

8. EXTENSIVE VISUALISATION CAPABILITIES: Matplotlib, Seaborn, and Plotly are just a few of the data
visualisation libraries available in Python. These tools assist in model evaluation and debugging by making it
simpler to visualise and analyse the outcomes of object recognition.

9. ROBUST COMMUNITY SUPPORT: Python has a large user base and a vibrant community, which makes it
possible to find packages, tools, and solutions to a variety of problems that can arise when working on object
recognition projects.

10. Python is widely used by both academia and industry in the fields of computer vision and object recognition.
Python is a tried-and-true option because many businesses use it as their main language for creating object
recognition applications. In conclusion, Python is an excellent choice for object recognition projects due to its
extensive libraries, support for machine learning, community resources, and simplicity of use. Its adaptability
and strong ecosystem give developers the resources and assistance they need to make powerful and effective
object recognition systems.

4.3 IMAGES FOR DATA INPUT

The You Only Look Once framework (YOLO) approaches object detection in a different way. It predicts the
bounding box coordinates and class probabilities for these boxes using the entire image in a single instance. The
main benefit of using YOLO is that it's quick, easy, and straightforward. Additionally, YOLO is aware of
generalised object representation.

We add data to our algorithm after setting up the YOLO environment, cloning the YOLO repository, and
installing any necessary dependencies. One of the test images with data entered into the algorithm is shown
below:

100

100

Test image for the object detection algorithm is shown in Figure 4.1.

Figure 4.1 demonstrates how the framework divided the input image into a 3 X 3 image grid after the algorithm
identified the presence of objects and predicted their bounding box coordinates and class probabilities.

100

100

Figure 4.2 shows the test image divided into a 3x3 grid.

Figure 4.2 illustrates how our test image has been divided into smaller pictures with various pictorial details.
The algorithm then predicts the bounding boxes and their corresponding class probabilities for objects (if any
are found) after applying Image classification and localization to each grid. We have now divided our image
into a grid of three by three cells, with a total of three classes—Bus, Car, and/or Person—into which we want
the objects to be assigned. Each grid cell is therefore an eight-dimensional vector, as shown below:

Table 4.1: Eight-dimensional vector values from the divided test image are represented for each grid cell.

y=

pc

bx

by

bh

bw

c1

c2

c3

Where

A predefined object's probability of being present in the grid is determined by the computer.

If there is an object inside the bounding box, bx, by, bh, and bw specify it.
The classes are shown as c1, c2, and c3. Our focus here is on the smallest object in the test image, which is a car
in this instance. Therefore, class 2 (car) is 1 and class 1 and class 3 are 0 even though they have additional data
values for buses and people, respectively.

For instance, when we chose the final grid from our image of a 3 X 3 grid, the result was as shown in the figure
below: this grid contains no objects.

Figure 4.3: The final grid in the 3x3 test image, which is empty of any objects.

There is no predefined image in Figure 4.3, as is evident. The bx, by, bh, bw, c1, c2, and c3 therefore indicate
that there isn't anything in the grid; consequently, pc will be zero; and the y label for this grid is as follows:

Table 4.2: Table displaying an empty row for the final grid from the 3x3 grid size test image.

y=

??????Table 4.2 shows that for an empty grid without a predefined image value, the algorithm returns zero data
values. It is crucial to remember that while the grid is unquestionably an image on its own, the content of this
image is not predefined data that has been specified and defined for the algorithm to detect.

Then, we create a new grid example with a person in it (c3 = 1):

Grid from Test Image with a Person in Figure 4.4

According to the table below, the y label for the center-right grid with the person will be as shown in Figure 4.4
because the YOLO algorithm determined there was an object in the chosen grid:
Table 4.3: Table displaying a value for a person from the 3x3 test image's centre right grid that was chosen.

y=

bx

by

bh

bw

As can be seen from the table, pc will equal 1 because there is an object in this grid. We will calculate bx, by,
bh, and bw in relation to the specific grid cell we are working with. Person being the third class, c1 and c2 are
equal to 0, but c3 is equal to 1. We will therefore have an eight dimensional output vector for each of the nine
grids. The final product will be 3 X 3 X 8. As a result, we now have a target vector and an input image.

4.4 FORMATION

There are several steps involved in training an object detection and recognition model in Python using YOLOv3.
A real-time object detection system called YOLO can identify and categorise objects in photos or videos. An
instruction manual for training a YOLOv3 model is provided below:

First, prepare your environment.

Make sure you have the following prerequisites before you begin:

1. Pip and Python are installed.

2. A GPU is optional but highly advised for accelerating training.


3. Installed cuDNN and CUDA for GPU support.

4. installation of PyTorch or TensorFlow.

5. the weights file and model architecture for the YOLOv3.

Step 2: Creating the Dataset

For object detection, you need a dataset that has been labelled. The datasets should be arranged with the
corresponding files for annotation and the images. COCO and YOLO are two popular formats for annotations.

Install Dependencies in Step 3

Using pip, install the required Python packages:

Installing Numpy, OpenCV, and TensorFlow with Pip

to PyTorch)

Download Pretrained Weights in Step Four.

Visit https://pjreddie.com/darknet/yolo/ to download the YOLOv3 pretrained weights file (yolov3.weights) from
the official YOLO website.

Convert Pretrained Weights in Step 5

You must use the provided script to convert the pretrained weights into a format compatible with TensorFlow in
order to use TensorFlow. You can find pre-trained weights created especially for PyTorch if you're using it.

Create a YOLOv3 model in step six.

Describe the architecture of the YOLOv3 model. This is done in TensorFlow by loading the converted weights
and creating the model using the YOLOv3 architecture.

Step 7: Preprocess and Load Data

Prepare your dataset for training by loading it, preprocessing the images, and creating training-ready
annotations.

Train the model in Step 8

Utilising your dataset, train the YOLOv3 model. For bounding box regression and object classification,
respectively, Mean Squared Error (MSE) and Cross-Entropy loss functions are typically used, along with an
optimizer like Adam.
Adjusting Hyperparameters in Step 9

To improve the performance of the model, experiment with hyperparameters like learning rate, batch size, and
training epochs. To enhance training, employ strategies like learning rate scheduling and data augmentation.

10th step: assess the model

Depending on your dataset and requirements, use metrics like mean Average Precision (mAP), Intersection over
Union (IoU), and others to assess the model's performance after training.

Post-Processing (Step 11)

Non-maximum suppression (NMS) should be used to get rid of redundant and low-confidence detections.

12th step: Inference

Utilise the trained YOLOv3 model to identify objects in fresh images. Starting with a pre-trained model and
fine-tuning it for your particular task is frequently a good idea because training a YOLOv3 model from scratch
can be resource- and time-intensive. Additionally, TensorFlow and PyTorch have prebuilt YOLOv3
implementations that can make the process simpler.

In YOLO, transfer learning is used.

DESIGN ALGORITHM 4.3

For real-time computer vision tasks, YOLOv3 (You Only Look Once version 3) object detection and
recognition is a popular option. The steps are as follows to implement object detection and recognition using
YOLOv3 in Python:

1. Configuration and Dependencies:

• Install the necessary libraries and dependencies, including TensorFlow (or Keras, which is included with
TensorFlow), OpenCV, NumPy, and Keras.

• From the YOLO website or other reputable sources, download the YOLOv3 pre-trained weights and
configuration files.

2. YOLOv3 Model loaded

• Utilise the weights and configuration files to load the pre-trained YOLOv3 model. To accomplish this, use
OpenCV's cv2.dnn.readNet function.

python

bring in cv2

net = "yolov3.weights", "yolov3.cfg", cv2.dnn.readNet"


3. Classes of Load:

• Load the categories or labels the YOLO model can identify. Usually, a text file is used to store this.

Python

utilising "r" as open("coco.names", "r"):

classes = "n"; f.read(); strip(); split

4. Image or video loading

• Using OpenCV, capture an image or video stream.

Python

cap = "input.mp4"; cv2.VideoCapture;

5. During Preprocessing:

• Use preprocessing to make the input image or frame the correct size for the YOLOv3 model.

python

Blob = frame, 1/255.0, (416, 416), swapRB=True, crop=False; cv2.dnn.blobFromImage

net.setInput(blob)

6. Run-On Sentence:

• Run the YOLOv3 network on the preprocessed input to obtain predictions for object detection.

python

net.forward(net.getUnconnectedOutLayersNames()); outs

7. Post-processing:

• Extract bounding boxes, confidence scores, and class IDs from the YOLO model's output.

• Use non-maximum suppression to get rid of bounding boxes that overlap and have low confidence.
Python

conf_threshold equals 0.5

0.4 for nms_threshold

boxes = []

Assurances = []

[] class_ids

for out-and-outs:

for in-depth detection:

Detection = scores[5:]

np.argmax(scores) class_id

score = confidence[class_id]

If confidence exceeds the conf_threshold:

center_x = int (width * detection[0]

center_y is equal to detection[1] * height.

width = detection[2] * int

h is equal to detection[3] * height.

center_x = int(w / 2)

center_y - h / 2 = int

[X, Y, W, H] boxes.append

confidences.append(float(confidence))

class_ids.append(class_id)

Indexes = cv2.dnn.NMSBoxes(Boxes, Confidentiality, Confidence Threshold, and NMS Threshold)


8. Creating Bounding Boxes

• Mark objects that have been detected with bounding boxes and class labels.

python

in indices for i:

i = i[0]

boxes[i] = box

box = (x, y, w, h)

str(classes[class_ids[i]]) for the label

assurance = assurances[i]

colour = (0, 255, 0)

frame, (x, y), (x + w, y + h), colour, 2; cv2.rectangle

(X, Y - 10) cv2.putText(frame, f"label", "confidence:.2f",

2), colour, cv2.FONT_HERSHEY_SIMPLEX, 0.5

9. Show or Save the Output:

• You can either display or save the annotated frame as a file.

python

"Object Detection" frame; cv2.imshow

10. Cleaning

Release video capture, and when finished, close all OpenCV windows.

python

cap.release()

cv2.destroyAllWindows()
4.5 INSTALLATION

First, import the necessary libraries.

bring in cv2

import np for Numpy

Step 2

# Load the COCO class names and the YOLO model.

net = "yolov3.weights", "yolov3.cfg", cv2.dnn.readNet"

types = []

utilising "r" as open("coco.names", "r"):

f.read().splitlines() classes

Step 3

# Image load

image = "image.jpg" read by cv2.imread

image.shape[:2] height, width

Step 4

# Image preprocessing

Blob = image, 1/255.0, (416, 416), swapRB=True, crop=False; cv2.dnn.blobFromImage

net.setInput(blob)

Step 5

# Gather detection findings

layer_names = net.getOutLayersNamesThatAreNotConnected()
detections = (layer_names, net.forward)

Step 6

Process detections in #

to be found in detections:

obj in detection for:

results = obj[5:]

np.argmax(scores) class_id

score = confidence[class_id]

If confidence exceeds 0.5:

center_x = int(object[[[[0]*width])

center_y = int(object[1, height])

obj[2] * width = int

h is equal to (obj[3] * height).

center_x = int(w / 2)

center_y - h / 2 = int

class="classes[class_id]" label="confidence:.2f"

image, (x, y), (x + w, y + h), (0, 255, 1), cv2.rectangle

image, label, (x, y - 15), 0.5, (0, 255, 1), cv2.FONT_HERSHEY_SIMPLEX

Step 7

# Show result

"Object Detection", image; cv2.imshow;

cv2.waitKey(0)

cv2.destroyAllWindows()

4.6 EXAMINING
The testing of our object detection algorithm involves a few steps. The following steps are listed:-

1. Install the required libraries: For this task, you'll need OpenCV, NumPy, and YOLOv3 weights and
configuration files, among other libraries. Using pip, you can set up these libraries:

Bash

Installing opencv-python and numpy with pip

• Download YOLOv3 weights and configuration files: YOLOv3 weights and configuration files can be
downloaded from a pre-trained model repository or the official YOLO website. Let's assume for the purposes of
this example that you have downloaded the three files listed below:

• YOLOv3 configuration file (yolov3.cfg).

Yolov3 pre-trained weights are known as yolov3.weights.

• coco.names: List of object recognition class names from the COCO dataset.

• Load the YOLOv3 model: You can use OpenCV to load the weights and configuration files for the YOLOv3
model:

python

• cv2 import

# Start the YOLOv3 model.

net = "yolov3.weights", "yolov3.cfg", cv2.dnn.readNet"

• Load class names: Open the coco.names file and load the class names.

python

• Using the file open("coco.names", "r")

classes are equal to file.read().strip().split("n")

• Capture an image for testing: You can use OpenCV to load an image for object detection and recognition or to
capture a video stream. Here is an illustration of an image test:

Python

• image = "image.jpg" in cv2.imread


• Use YOLOv3 to perform object detection, and look for objects in the image:

python

• swapRB=True, blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (416, 416),

crop=False)

net.setInput(blob)

net.forward() detections

• Analyse the detections: Draw bounding boxes with class labels after each iteration of the detections.

Python

• to be discovered in discoveries:

obj in detection for:

results = obj[5:]

np.argmax(scores) class_id

score = confidence[class_id]

If confidence is greater than 0.5, you can change the confidence threshold as necessary.

width, height, center_x, center_y, and object[:4] *

([image.shape[1], image.shape[0], image.shape[1], np.array

image.shape[0]])).astype(int)

x, y = int(center_x - width / 2) and int(center_y - height / 2) respectively.

cv2.rectangle(image, (x, y), (0, 255, 0), (x + width, y + height), 2)

(x, y - 1), cv2.putText(image, f"classes[class_id]: confidence:.2f",

Five), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), and two

• Show the outcome: You can show the image with bounding boxes and class labels by doing the following:

Python
8. "Object Detection", image; cv2.imshow;

cv2.waitKey(0)

cv2.destroyAllWindows()

First, the best-performing and most recent checkpoint for the model was chosen, stored in the backup directory,
and used to test new image inputs as training progressed. Images in our data root directory serve as our test data,
with our data being divided into 90% for training and 10% for validation.

The model can correctly detect and locate objects of our class 7 out of 10 times, according to the final mAP
score obtained after training, which is 70%; this is a good score considering the limited amount of data used.

The test-me, jpg file in our data directory is used as a test image to see the feedback of detection, and a threshold
of 30% is chosen to only display bounding boxes with a probability score of 30% and above. Testing our test
image yielded the following results:

!Data/combo.data for the.darknet detector test in cfg/yolov3-

data/test-me.jpg in custom.cfg/mydrive.obj_det/darknet/backup/yolov3-custom_best.weights

0.3 threshold

As follows is our typical truncated log output:

Version 11000 (11020) of CUDA

CuDNN; 7.6.5,

CUDNN_HALF=1

OpenCV version 3.2.00: compute_capability=370, cudnn_half=0, GPU: Tesla K80. GPU count: 1
CUDNN_HALF=1.

mini_batch=1 net.optimized_memory=0

Batch=1

Time-steps=1

Train=

,,,

,,,

,,,
allocate additional workspace_size=12.46MB with avg_outputs=490041.

Yolo4-custom-best.weights are being loaded from mydrive/obj-net/darknet/backup/weights,,,seen 64, trained:


16 k-images.

(64 kilo-batches) finished! Loaded 139 detection layers with a type of 28 and 150 detection layers from the
weights-file.

Type=28 Data/test-me.jpg: Predicted in 101.298000 milliseconds Detection layer: 161- type=28

86 percent

car: 70%

4.6 RESPONSE AND CONVERSATION

There are several steps involved in object detection using Python and YOLOv3, from data prepping and model
training to inference and evaluation. Our test outcomes were largely influenced by the calibre of our training
data, model design, and fine-tuning. YOLOv3 is a flexible and effective tool for a range of computer vision
tasks, and the success of its implementation and optimisation are crucial.

While similar to YOLO, our algorithm had some difficulties handling small or crowded objects, overcoming
occlusions, and achieving real-time performance on devices with limited resources.

The result of our analysis is a list of bounding boxes, each of which is accompanied by a confidence level and a
class label. The objects that were found in the input image are represented by these bounding boxes.

On three test images that will be displayed below, our algorithm's speed and accuracy were evaluated.
Figure 4.5: Output of a test image detection

Figure 4.5 shows that the algorithm correctly classifies an umbrella as belonging to the "Umbrella" category
with a 92% confidence level, followed by a person and a car with 86% and 70% confidence levels, respectively.
As can be seen, our YOLO v3 algorithm successfully detected objects, and it does so in 22 ms at 28.2 mAP,
which is three times faster than single shot multibox detection (SSD).

Figure 4.6: Output of a test image detection

Figure 4.6 shows that our algorithm successfully identified and classified the object as a bus with a 94%
confidence level. It also recorded a confidence value of 55% for the bus driver, which was categorised as a
"person."
Figure 4.7 shows an additional Test Image detection result.

Figure 4.7 shows that our algorithm successfully identified and classified the object as a person with a 99%
confidence level.

According to our findings, RetinaNet only managed a 57:5 AP50 in 198 ms while our YOLO V3 object
detection algorithm did so in 51 ms. This suggests that YOLO v3 performed similarly to RetinaNet but was 3.8
times faster.

In conclusion, our YOLOv3 object detection algorithm is able to accurately and quickly identify a variety of
objects in real-time. Additionally, it is very adaptable and can be trained using the users' own datasets for
particular object detection tasks. The combination of YOLOv3's speed and accuracy has made it a well-liked
option in the computer vision community.
CHAPITER 5

CONCLUSION

It is clear from the foregoing that the YOLO object detection model is quick and precise, and that it is simple to
export and use in mobile and/or web applications.

By gathering more data and continuing the training phase from one of the checkpoints in the backup directory,
this project work, which focuses on data collection, training, and testing, can be applied in many different
contexts and significantly improved. This will also increase the Map score of 66.7%. The model shows that a lot
of previously unknown data can be noticed. It can also be used on captured images to quickly find and detect
objects in a crowded space.

SUGGESTION

YOLO is an effective algorithm for carrying out detection tasks, and object detection is an intriguing area of
computer science. However, there is always a tradeoff between speed and accuracy, which has been the main
topic of discussion since the first version. To lessen the speed/accuracy tradeoff seen in YOLO, it is suggested
that more research be conducted and more data be used in detection tasks.

RECOMMENDATION

The techniques used in this project can be replicated to create other real-time related object detection projects in
the application of robotics, medical imaging, shape recognition, etc. YOLO is a viable object detection
algorithm.
Check out the links below for a full implementation of the YOLO algorithm.

CONCLUSION:

The cutting-edge object detection algorithm YOLO is incredibly quick and precise.

We feed a CNN an input image, and it returns a volume with the dimensions 19 X 19 X 5 X 85.

Each grid in this example is 19 X 19 and has 5 boxes.

We use Non-Max Suppression to filter through all of the boxes, keeping only the accurate boxes while also
getting rid of any overlapping boxes.

REFERENCES

GitHub URL

https://github.com/sandhyareddy2001/Step-by-step-YOLO

References:

https://www.analyticsvidhya.com/blog/2018/12/practical-guide-object-detection-yolo-framewor-python/

https://youtu.be/ag3DLKsl2vk

https://towardsdatascience.com/yolo-you-only-look-once-3dbdbb608ec4

https://en.wikipedia.org/wiki/Yolo

Apparently, Zoumana K. Introduction to Object Detection, Datacamp (2022, September 7).


https://www.datacamp.com/blog/yolo-object-detection-explained,

You might also like