Internship Report Dikshant Sharma (191203040)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Page |1

PROFESSIONALINDUSTRY TRAINING/INTERNSHIP REPORT


ON
“Object Recognition”

AT

Azure Skynet
Gurgaon, Haryana

AN INDUSTRY INTERNSHIP REPORT SUBMITTED


IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE AWARD OF DEGREE OF

BACHELOR OF ENGINEERING
In
Computer Science and Engineering

SUBMITTED BY:
Dikshant sharma
Roll Number: 191203040

SUBMITTED TO

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


Model Institute of Engineering and Technology (Autonomous)
2023
Page |2

CANDIDATES’ DECLARATION
I, Dikshant Sharma, 191203040, hereby declare that the work which is being
presented in the Industry Internship Report entitled, “Object Recognition ”in partial
fulfillment of requirement for the award of degree of B.E. (Branch Name) and
submitted in the Department Name, Model Institute of Engineering and Technology
(Autonomous), Jammu is an authentic record of myown work carried by me at
“Azure Skynet,Gurgaon(Haryana)” under the supervision and mentorship of Mr.
Ashish Saini , Azure Skynet, Gurgaon(Haryana) and Faculty Name(Designation,
Department, Institute) respectively. The matter presented in this report has not been
submitted in this or any other University / Institute for the award of B.E. Degree.

Signature of the Student Dated: 07/10/2022

(Dikshant sharma)
191203006
Page |3

INTERNSHIP CERTIFICATE
Page |4

ACKNOWLEDGEMENTS

Industry Internshipis an important aspect in the field of engineering, where


contribution is made by many persons and organizations. The present shape of this
work has come forth after contribution from different spheres.
I would also like to thank my parents, friends etc. who helped me in my
Industry Internship.
I express my sincere gratitude to “Azure Skynet, Gurgaon (Haryana)” for
giving methe opportunity to work on Industry Internship during myfinal year of B.E.
I express my sincere gratitude to Model Institute of Engineering and
Technology (Autonomous), Jammu for giving methe opportunity to work on Industry
Internshipduring myfinal year of B.E.
At the end thanks to the Almighty.

(Dikshnat Sharma)
191203040
Page |6

Department Name
Model Institute of Engineering and Technology (Autonomous) Kot
Bhalwal, Jammu, India
(NAAC “A” Grade Accredited)

Ref. No.: 191203040 Date: 07/10/2022

CERTIFICATE
Certified that this Industry Internship Report entitled “Object Recognition using OpenCV”

is the bonafide work of “Dikshant Sharma, 191203040, of 7th Semester, Branch Computer

Science and Engineering, Model Institute of Engineering and Technology

(Autonomous), Jammu”, who carried out the Industry Internship at “Azure Skynet ,

Gurgaon (Haryana)” work under my mentorship during July, 2022-August, 2022.

(Name of Mentor) MR. Navin Mani Upadhyay


Mentor-Internal Supervisor
Academic Designation
Department Name, MIET

This is to certify that the above statement is correct to the best of my knowledge.

(Name) Prof. (dr.) Ashok Kumar


Dean academic affairs
Department Name, MIET
Page |7

ABSTRACT

Object Detection is a well-known computer technology connected with


computer vision and image processing that focuses on detecting objects or its
instances of a certain class (such as humans, flowers, animals) in digital images
and videos. There are various applications of object detection that have been
well researched including face detection, character recognition, and vehicle
calculator. Object detection can be used for various purposes including retrieval
and surveillance. In this study, various basic concepts used in object detection
while making use of OpenCV library of python, improving the efficiency and
accuracy of object detection are presented.
Page |8

Contents
Candidates’ Declaration ii
Internship Certificate iii
Acknowledgement iv
Abstract vi
Contents vii
List of Figures ix
Chapter 1 Introduction 10-11
1.1 Preamble 10
1.2 Object Detection 10
1.3 Approaches 11
1.4 Problem description and Research Questions 11
Chapter 2 Background 12-24
2.1 Machine Learning 12
2.2 Computer Vision 12-13
2.3 Why Python 13
2.4 Libraries & Modules 13-15
2.4.1 Numpy 13
2.4.2 Argparser 14
2.4.3 OpenCV 14-15
2.5 Unified Detection Model – YOLO 15-22
2.6Training YOLO on COCO 22
2.7 Image Processing Technique 22-23
2.8 Object Classification in Moving Object Detection 23-24
Chapter3 Workflow 25-30
3.1 Steps Involved In Object Detection in Python 3.7 25
3.1.1 Install OpenCV-Python
3.1.2 Read an Image
3.1.3 Feature detection and description
3.2Architecture of the Proposed Model 26-27
3.3 Implementation 27
3.4 Results and Analysis 27-30
Page |9

3.4.1 Test Cases 28


3.4.2 Results 28- 30
Chapter4 CONCLUSIONS 31-33
4.1 Application 31-32
4.2 Challenges 32
4.3 Conclusions 33

REFERENCES 34-35

APPENDIX A 37
APPENDIX B 38
P a g e | 10

LIST OF FIGURES

Figure Caption Page


No. No.
1.1 Image Recognition vs. Object Recognition 10
2.1 Figure 2.1 15
2.2 YOLO Structure 17
2.3 YOLO Network Architecture 18
2.4 Bounding Box prediction formula. 21
3.1 YOLO Architecture 26
3.2 : Data Flow Diagram of the System 27
3.3 Image with Detected Object 29
3.4 Image with Overlapping Objects 30
3.5 Objects are detected within black and white picture. 30
P a g e | 11

Chapter 1 Introduction
.
1.1 Preamble
The main aim of this project is to build a system that detects objects from the image or a
stream of images given to the system in the form of previously recorded video or the real time
input from the camera. Bounding boxes will be drawn around the objects that are being
detected by the system. The system will also classify the object to the classes the object belongs.
Python Programming and a Machine Learning Technique named YOLO (You Only Look Once)
algorithm using Convolutional Neural Network is used for the Object Detection.

1.2 Object Detection


Object detection is a computer vision technique that works to identify and locate objects
within an image or video. Specifically, object detection draws bounding boxes around these
detected objects, which allow us to locate where said objects are in (or how they move through)
a given scene.
Object detection is commonly confused with image recognition, so before we proceed, it’s
important that we clarify the distinctions between them.
Image recognition assigns a label to an image. A picture of a dog receives the label “dog”. A
picture of two dogs,still receives the label “dog”. Object detection, on the other hand, draws a
box around each dog and labels the box “dog”. The model predicts where each object is and
what label should be applied. In that way, object detection provides more information about an
image than recognition.
Here’s an example of how this distinction looks in practice:

Fig 1.1: Image Recognition vs. Object Recognition


P a g e | 12

1.3 Approaches
Broadly speaking, object detection can be broken down into machine learning-based
approaches and deep learning-based approaches.

In more traditional ML-based approaches, computer vision techniques are used to look at
various features of an image, such as the color histogram or edges, to identify groups of pixels
that may belong to an object. These features are then fed into a regression model that predicts
the location of the object along with its label.

On the other hand, deep learning-based approaches employ convolutional neural networks
(CNNs) to perform end-to-end, unsupervised object detection, in which features don’t need to
be defined and extracted separately.

1.4 Problem description and Research Questions


The project “Object Detection System using Machine Learning Technique” detects
objects efficiently based on YOLO algorithm and apply the algorithm on image data and video
data to detect objects.
The first research question of this study is to verify whether YOLO is a good model for
general object detection.
The second research question is to evaluate the accuracy of our YOLO detection model.
P a g e | 13

Chapter 2 Background

2.1 Machine Learning


Machine learning is an area within Artificial Intelligence (AI), which focuses on decision
making and predictions. The primary aim is to allow the computer to be further developed
without any human intervention, it will be trained based on observations and data. Machine
learning algorithms are often categorized as supervised or unsupervised, where supervised
algorithms apply what has been learned in the past to new data using labeled examples. This
type of problem can be either predicting a continuous quantity (Regression) or discrete class
labels (Classification). Unsupervised learning includes grouping a set of uncategorized data by
finding structures or patterns (Clustering) or finding relationships between variables in big data
(Association).
A dataset is a pool of samples compiled with the intention of educating a machine on
how to go about a task. These pools can sometimes be numbers, images, texts, or any other kind
of data. It can sometimes be quite strenuous and expensive to compile a data set depending on
how 20 specialized the task is expected to be.
Features are important pieces of data that act as the key to the solution of the task. They
show the machine what to look for in an image. The way features are defined vary based on
what the machine is expected to look out for. For instance, to predict the price of an apartment,
it would prove to be too difficult to prove by way of linear regression the cost of the property
based on the combination of its length and width However, it may be easier to predict a price
based on the correlation between the price and location of a building. Therefore, there is no
gainsaying the fact that the accuracy of the features provided affects the eventual performance
of the machine. Which is what occurs when the machine is trained with labeled data which
contain the “right solutions”, and a validation set. During the learning process, the program is
expected to get the “right” solution. And then, the validation set is used to tune hyper
parameters to avoid over fitting. However, in unsupervised learning, features are learned with
unlabeled input data. In his case the machine is not given any features to familiarize itself with,
it learns to notice the patterns by itself.

2.2 Computer Vision


Computer vision is the area of study in which computers are empowered to visualize, recognize
and process what they see in a similar way as that of humans. The main aim of computer vision
is to generate relevant information from image and video data in order to deduce something
about the world. It can be classified as a sub-field of artificial intelligence and machine learning.
P a g e | 14

This is quite different from image processing, which involves manipulating or enhancing visual
information and is not concerned about the contents of the image. Applications of computer
vision include image classification, visual detection, and 3D scene reconstruction from 2D
images, image retrieval, augmented reality, and machine vision and traffic automation.

Today, machine learning is a necessary component of many computer vision algorithms.


These algorithms are typically a combination of image processing and machine learning
techniques. The major requirement of these algorithms is to handle large amounts of
image/video data and to be able to perform computation in real-time for wide range of
applications. For example, real-time detection and tracking.

2.3 Why Python


Object detection is a domain-specific variation of the machine learning prediction problem.
Intel’s OpenCV library that is implemented in C/C++ has its interfaces available in a number of
programming environment such as C#, Matlab, Octave, R, Python etc. Some of the benefits of
using Python codes over other language codes for object detection are
• More compact and readable code.
• Python uses zero-based indexing.
• Dictionary (hashes) support is offered.

• Simple and elegant Object-oriented programming.


• Free and open.
• Multiple functions can be packaged in one module.
• More choices in graphics packages and toolsets.

2.4 Libraries

2.4.1 Numpy
NumPy is the fundamental package for scientific computing with Python [11].It can be
treated as an extension of the Python programming language with support for multidimensional
matrices and arrays. It is open source software with many contributors. It contains among other
things:
• A powerful N-dimensional array object.
• Broadcasting functions.
P a g e | 15

• Tools for integrating C/C++ and FORTRAN code.


• Useful linear algebra, Fourier transform, and random number capabilities.

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional
container of generic data.
Arbitrary data types can be defined. This allows NumPy to seamlessly and speedily integrate
with a wide variety of databases. NumPy is licensed under the BSD license, enabling reuse with
few restrictions

2.4.2 Argparse
The argparse module in Python helps create a program in a command-line-environment in a
way that appears not only easy to code but also improves interaction. The argparse module also
automatically generates help and usage messages and issues errors when users give the
program invalid arguments.

You may already be familiar with CLIs: programs like git, ls, grep, and find all expose command-
line interfaces that allow you to call an underlying program with specific inputs and options.
argparse allows you to call your own custom Python code with command-line arguments
similar to how you might invoke git, ls, grep, or find using the command line. You might find this
useful if you want to allow other developers to run your code from the command line.

In this tutorial, you’ll use some of the utilities exposed by Python’s argparse standard library
module. You’ll write command-line interfaces that accept positional and optional arguments to
control the underlying program’s behavior. You’ll also self-document a CLI by providing help
text that can be displayed to other developers who are using your CLI.

2.4.3 OpenCV
OpenCV (Open Source Computer Vision) is an open source computer vision and machine
learning software library. OpenCV was initially built to provide a common infrastructure for
applications related to computer vision and to increase the use of machine perception in the
commercial products. As it is a BSD-licensed product so it becomes easy for businesses to utilize
and modify the existing code in OpenCV.

Around 3000 algorithms are currently embedded inside OpenCV library, all these algorithms
being efficiently optimized. It supports real-time vision applications. These algorithms are
P a g e | 16

categorized under classic algorithms, state of art computer vision algorithms and machine
learning algorithms. These algorithms are easily implemented in Java, MATLAB, Python, C, C++
etc. and are well supported by operating system like Window, Mac OS, Linux and Android. A full-
featured CUDA and OpenCL interfaces are being actively developed for the betterment of
technology. There are more than 500 different algorithms and even more such functions that
compose or support those algorithms. OpenCV is written natively in C++ and has a templated
interface that works seamlessly with STL containers.

For OpenCV to work efficiently with python we need to install NumPy package first.

2.5 YOLOALGORITHM
All the previous object detection algorithms have used regions to localize the object
within the image. The network does not look at the complete image. Instead, parts of the image
which has high probabilities of containing the object. YOLO or You Only Look Once is an object
detection algorithm much is different from the region based algorithms. In YOLO a single
convolutional network predicts the bounding boxes and the class probabilities for these boxes.

Figure
YOLO works by taking an image and split it into an S x S grid, within each of the grid we take
mbounding boxes. For each of the bounding box, the network gives an output a class probability
and offset values for the bounding box. The bounding boxes have the class probability above a
P a g e | 17

threshold value is selected and used to locate the object within the image.
YOLO is orders of magnitude faster (45 frames per second) than any other object
detection algorithms. The limitation of YOLO algorithm is that it struggles with the small objects
within the image, for example, it might have difficulties in identifying a flock of birds. This is due
to the spatial constraints of the algorithm.
The YOLO model was first brought into existence by Joseph Redmon in his paper “You
only look once, Unified, Real-time object detection”. The mechanism for the algorithm employs
the use of a single neural network that takes a photograph as an Input and attempts to predict
bounding boxes and class labels for each bounding box directly. Although this offered less
predictive accuracy, which was mostly due to more localization errors, it boasted speeds of up
to 45 frames per second and up to 155 frames person on speed optimized versions of the model
[6].

“Our unified architecture is extremely fast. Our base YOLO model processes images in real-time
at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding
155 frames per second …” [7]

To begin with the model operates by splitting the inputted image into a grid of cells, where each
cell is responsible for predicting a bounding box if the center of a bounding box falls within it.
Each grid cell predicts a bounding box involving the x, y coordinate and the width and height
and a metric of valuation of quality known as a confidence score. A class prediction is also based
on each cell. To supply more emphasis an instance will be provided. For example, an image may
be divided into a 7 × 7 grid and each cell in the grid may predict 2 bounding boxes, resulting in
94 proposed bounding box predictions. The class probabilities map and the bounding boxes
with confidences are then combined into a final set of bounding boxes and class labels. The
YOLO was not without shortcomings, the algorithm had a number of limitations because of the
number of grids that it could run on as well as some other issues which will be addressed
subsequently. Firstly, the model uses a 7 × 7 grid and since each grid can only identify an object,
the model restricts the maximum number of objects detectable to 49. Secondly, the model
suffers from what is known as a close detection model, since each grid is only capable of
detecting one object, if a grid cell contains more than one object it will be unable to detect it.
Thirdly, a problem might arise because the location of an object might be more than a grid, thus,
there exists a possibility that the model might detect the object more than once [8]. Due to the
aforementioned problems encountered when running YOLO, it was fairly obvious that
localization error and other problems of the system needed to be addressed. As a result of that,
YOLOv2 was created as an improvement to deal with the issues and questions posed by its
P a g e | 18

predecessor. Therefore, localization errors as well as errors of real were significantly addressed
in the newversion. The model was updated by Joseph Redmon and Ali Farhadi to further revamp
model performance in their 2016 paper named “YOLO9000: Better, Faster, Stronger [6]”.

Structure of YOLO
YOLO is implemented as a convolution neural network and has been evaluated on the PASCAL VOC
detection dataset. It consists of a total of 24 convolutional layers followed by 2 fully connected layers.

The layers are separated by their functionality in the following manner:


1. First 20 convolutional layers followed by an average pooling layer and a fully
connected layer is pre-trained on the ImageNet 1000-class classification dataset.
2. The pretraining for classification is performed on dataset with resolution224×
224.
3. The layers comprise of 1 × 1 reduction layers and 3 × 3 convolutional layers.
4. Last 4 convolutional layers followed by 2 fully connected layers are added to
train the network for object detection.
5. Object detection requires more granular detail hence the resolution of the
dataset is bumped to 448 × 448.
6. The final layer predicts the class probabilities and bounding boxes.

Figure 2.2. YOLO Structure [10].


P a g e | 19

Network Structure

Figure 2.3. YOLO Network Architecture


The whole system can be divided into two major components: Feature Extractor and Detector;
both are multi-scale. When a new image comes in, it goes through the feature extractor first so
that we can obtain feature embedding at three (or more) different scales. Then, these features
are feed into three (or more) branches of the detector to get bounding boxes and class
information.

Intersection over Union (IoU)


YOLOs default metric for measuring overlap between two bounding boxes or masks is
Intersection over union (𝐼𝑜𝑈). Any algorithm that provides predicted bounding boxes as output
can be evaluated using 𝐼𝑜𝑈 [8]. If the prediction is completely correct, 𝐼𝑜𝑈 = 1. The lower the
𝐼𝑜𝑈, the worse the prediction result. To apply Intersection over union, certain parameters must
be met to evaluate an (arbitrary) object detector, they are: The ground-truth bounding boxes
these are the hand labeled bounding boxes from the testing set that specify where in the image
our object is and the predicted bounding boxes from our model. If these two sets of bounding
boxes are present it is possible to apply Intersection over union. Thus, computing Intersection
over Union can therefore be calculated as; the area of overlap divided by the area of union [9].
Plainly put, the intersection over union is a ratio of the similarity of the ground truth bounding
box and the predicted bounding box, thus predicting a rough estimate on how much an Artificial
intelligence can rely on the predictions made by the algorithm. This is an improvement over
binary models which label predictions as either correct or incorrect. Also due to the variance in
parameters between the model and the object it is quite unrealistic to have a 100% match
between the (𝑥, 𝑦) coordinates a predicted bounding box and the (𝑥, 𝑦) coordinates of the
ground truth box [9]. Thus, this equation ensures that boxes with a larger area of overlap get
higher scores than those with lesser areas thus cementing Intersection over union as excellent
metric forevaluating custom object detectors. Generally, any score greater than 0.8 is a good
score [11]. This application of intersection over union is used during the testing phase at the
completion of the training [12].
Another interesting application of this model occurs where there is more than one bounding
P a g e | 20

box for the same object. The metric helps to eliminate bounding boxes with lesser scores. How
this is done is that if there are two bounding boxes with very high confidence scores, then the
chances are that the boxes are detecting the same object therefore the box with the higher
confidence rating is selected. However, where confidence rating of the two boxes is low, it is
likely that they are predicting separate objects of the same class, such as two different cars in
the same picture. This application is used after the model has completed training and is being
deployed [12]. As earlier mentioned, Intersection over union is a good metric for evaluating the
quality of a task. Before further elucidation, it is pertinent to define some terminologies first,
these terminologies are true positive, true negatives, false negatives, and false positives. Simply
put, a true positive is any correctly drawn annotation with a value greater than 0.5, a true
negative is when an F1 refuses to draw any annotation because there simply is not one to be
drawn. There is no value here because no annotation is drawn, thus there is no way to calculate
true negatives. A false negative occurs where there are missing annotations while a false
positive occurs where there are incorrectly drawn annotations that have an𝐼𝑜𝑈 score of less
than 0.5. An equation known as accuracy is usually used to measure the performance of a task
because it is an incredibly straight forward measurement as well as for its simplicity. It is simply
a ratio of correctly drawn annotations to the total expected annotations (ground truth). Thus, it
is calculated as being equal to the sum of True positive and true negative divided by the sum of
True positive, False positive, True negative and False negative.

Direct Location Prediction


In the early versions of the YOLO model, there were no constraints on the location prediction.
The predicted bounding box was not tethered therefore it could occur far from the original grid
location. This anomaly resulted in a very unstable model [13]. The bulk of the instability
resulted from predicting the (𝑥, 𝑦) locations for the box. In region proposal networks the
network predicts values 𝑡𝑥 and 𝑡𝑦 and the (𝑥, 𝑦) center coordinates are calculated as:
𝑥 = (𝑡𝑥∗𝑤𝑎) − 𝑥𝑎
𝑦 = (𝑡𝑦∗ ℎ𝑎) − 𝑦𝑎

For example, a prediction of tx = 1 would shift the box to the right by the width of the anchor
box, a prediction of 𝑡𝑥 = −1 would shift it to the left by the same amount [25]. This formulation
did not exist within any boundaries and as such, any anchor box could end up at any point in the
image regardless of what location predicted the box. With random initialization it took the
model an obscene amount of time to stabilize to predict sensible offsets [13]. The subsequent
versions of YOLO diverged from this approach and devised a means to properly tackle the
situation. YOLOv2 bounds the location using logistic activation.
P a g e | 21

Sigma (𝜎), which ensures that the value remains between 1 and 0 [14].Given the anchor box of
size (𝑝𝑤, 𝑝ℎ) at the grid cell with its top left corner at (𝑐𝑥, 𝑐𝑦), the model predicts the offset and
the scale, (𝑡𝑥,𝑡𝑦,𝑡𝑤,𝑡ℎ) and the corresponding predicted bounding box has center (𝑏𝑥, 𝑏𝑦) and
size (𝑏𝑤, 𝑏ℎ). The confidence score is the sigmoid (𝜎) of another output to. Since 35 the location
prediction is constrained, the parameterization is easier to learn, making the network more
stable. The employment of dimension clusters along with directly predicting the bounding box’s
center location improves YOLO by almost 5% over the version with anchor boxes n, m.

Bounding Box Prediction


A bounding box is a relatively closed space within which points, objects, or a group of objects
may be contained. For YOLO, the cell in which the center of an objectresides, is the cell
responsible for detecting that object. Each cell will predict B bounding boxes and a confidence
score for each box. The confidence score will be from ‘0.0’ to ‘1.0’, with ‘0.0’ being the lowest
confidence level and ‘1.0’ being the highest; if no object exists in that cell, the confidence scores
should be ‘0.0’, and if the model is completely certain of its prediction, the score should be ‘1.0’.
These confidence levels capture the model’s certainty that there exists an object in that cell and
that the bounding box is accurate. Each of these bounding boxes is madeup of 5 numbers: thex
position, the y position, the width, the height, and the confidence. The coordinates ‘(x, y)’
represent the location of the center of the predicted bounding box, and the width and height are
fractions relative to the entire image size. The confidence represents the Intersection over
Union (IoU) between the predicted bounding box and the actual bounding box, referred to as
the ground truth box [15]. In general, there are five types of bounding boxes, i.e., a surrounding
sphere (SS), an axis-aligned bounding box (AABB), an oriented bounding box (OBB), a fixed-
direction hull (FDH), and a convex hull (CH) [16]. The AABB refers to a box whose axis is
parallel to the coordinate axis. It is the rectangle formed by selecting the maximum and
minimum horizontal andvertical coordinates in each vertex of the two-dimensional shape and is
one of the most commonly used bounding box types [16]. Bounding box functions to mark the
object area that has been detected [28].
In digital image processing, the bounding box is the coordinates of a rectangle wishing which an
object may be contained when it is placed over a page, a canvas, a screen, or any other similar
bi-dimensional background [30]. In the field of object detection, a bounding box is usually used
to describe an object location. The bounding box is a rectangular box that can be determined by
the 𝑥 and 𝑦 axis coordinates in the upper-left corner and the 𝑥 and 𝑦 axis coordinates in the
lower-right corner of the rectangle [31]. The first version of YOLO directly predicted all four
values which describes a bounding box. the x and y coordinates of each bounding box are
defined relative to the top left corner of each grid cell and normalized by the cell dimensions
P a g e | 22

such that the coordinate values are bounded between 0 and 1. However in YOLOv2 there was a
shift in paradigm and the algorithm employed dimensional clusters in place of anchor boxes, 4
coordinates are predicted for each bounding box, 𝑡𝑥,𝑡𝑦,𝑡𝑤,𝑡ℎ, If the cell is offset from the top left
corner of the image by (𝑐𝑥, 𝑐𝑦) and the bounding box prior has width and height 𝑝𝑤, 𝑝ℎ, then
the predictions correspond to [32]:
𝑏𝑥 = 𝜎(𝑡𝑥 ) + 𝑐𝑥
𝑏𝑦 = 𝜎(𝑡𝑦 ) + 𝑐𝑦
𝑏𝑤 = 𝑝𝑤 𝑒 𝑡𝑤
𝑏ℎ = 𝑝ℎ 𝑒 𝑡ℎ

Figure 2.4. Bounding Box prediction formula.

During training, the sum of squared error loss is used Assuming that the ground truth for some
coordinate prediction is 𝑡̂ ∗ our gradient is the ground truth value (computed from the ground
truth box) minus our prediction: 𝑡̂∗ − 𝑡∗. This ground truth value can be easily computed by
inverting the equations above. YOLOv3 predicts an objectness score for each bounding box
using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth
object by more than any other bounding box prior is not the best but does overlap a ground
truth object by more than some threshold the prediction is ignored. We use the threshold of .5.
Usually, a system only assigns one bounding box prior for each ground truth object. If a
bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or
class predictions, only objectness [33]

The concept of a bounding box prior was introduced in YOLOv2. Previously, the model was
expected to provide unique bounding box descriptors for each new image, a collection of
bounding boxes is defined with varying aspect ratios which embed some prior information
about the shape of objects we are expecting to detect. Redmon offers an approach towards
discovering the best aspect ratios by doing k-means clustering (with a custom distance metric)
on all of the bounding boxes in the training dataset [33].
Thus, instead of predicting the bounding box dimension directly, the task is reformulated to
simply predict the offset from the bounding box prior in order to fine-tune the predicted
bounding box dimensions. The result of which is that it makes the prediction task easier to
learn.

Objectness (and assigning labeled objects to a bounding box)


“objectness” score 𝑝𝑜𝑏𝑗 is trained to approximate the intersection over union between
P a g e | 23

the predicted box and the ground truth label. When the loss during training is calculated,
Objects are 38 matched to whichever bounding box prediction on the same grid cell produced
the highest 𝐼𝑜𝑈 score. For unmatched boxes, the only descriptor which will be included in the
function is𝑝𝑜𝑏𝑗.
Upon the introduction of additional bounding box priors in YOLOv2, it was possible to
assign objects to whichever anchor box on the same grid cell has the highest IoUscore with the
labeled object.
YOLO (version 3) redefined the "objectness" target score 𝑝𝑜𝑏𝑗 to be 1 for the bounding
boxes with highest 𝐼𝑜𝑈 score for each given target, and the score 0 for all remaining boxes.
However bounding boxes which have a high 𝐼𝑜𝑈 score above a defined threshold but not the
highest score when calculating the loss will not be included. This simply means that it does not
produce the most appropriate prediction because it is not the best possible prediction [33].

Class Labels
YOLO (version 3) uses sigmoid activations for multi-label classification, noting that
SoftMax (from the previous versions) is not necessary for good performance. This choice will
depend on your dataset and whether or not your labels overlap (e.g., "golden retriever" and
"dog").

Output Layer
YOLO (version 3) has 3 output layers. These output layers predict box coordinates at 3
different scales. The output prediction is of the form 𝑤𝑖𝑑𝑡ℎ × ℎ𝑒𝑖𝑔ℎ𝑡 × 𝑓𝑖𝑙𝑡𝑒𝑟𝑠.
YOLO (version 3) replaced the skip connection splitting for a more standard feature
pyramid network output structure. With this method, there is an alternate between outputting a
prediction and up sampling the feature maps (with skip connections). This allows for
predictions that can take advantage of finer-grained information from earlier in the network,
which helps for detecting small objects in the image [33].

2.6 Training YOLO on COCO


Common Objects in Context (COCO) Common Objects in Context (COCO) is a
database that aims to enable future research for object detection, instance
segmentation, image captioning, and person key point’s localization.COCO is a large-
scale object detection, segmentation, and captioning dataset.
YOLOv3 is extremely fast and accurate. In mAP measured at .5 IOU YOLOv3 is on
par with Focal Loss but about 4x faster.
P a g e | 24

2.7 Image Processing Technique


Every object class has its own special features that help inclassifying the object.
Objectrecognition is that sub-domain of computer vision which helps in identifying objects in an
image or video sequence. With more efficient algorithms, objects can even be recognized even
when they are partially obstructed from the direct view. Various approaches to this task have
been implemented in the past years.

Various terms related to object detection are:


2.7.1 Edge matching
• Uses edge detection techniques to find the edges
• Effect of changes in lighting and color
• Count the number of overlapping edges.
2.7.2 Divide and Conquer search
• All positions are to be considered as a set.
• The lower bound is determined at best position inthe cell.
• The cell is pruned if the bound is too large.
• The process stops when a cell becomes smallenough.
2.7.3 Grayscale matching
• Edges give a lot of information being robust toillumination changes.
• Pixel distance is computed as a function of bothpixel intensity and position. The same
thing cancompute with color too.
2.7.4 Gradient matching
• Comparing image gradients can also be helpful inmaking it robust to illumination
changes.
• Matching is performed like matching greyscaleimages.
The ease of dealing with the image is that it is made up ofpixels, so in most cases, the location of
next point is easilyfound and can be connected with our current pixel anytime.Consider the
following example for calculating Euclideandistance between center of the circle and the
connected points.Take an image consisting a circle, convert it to a grayscaleimage, detect edges,
move along edges, and draw normalwhich will intersect at center. Now repeat this process
forentire circle or find connected edges and thencalculate Euclidean distance between center of
the circle andthe connected points.
P a g e | 25

2.8 Object Classification in Moving Object Detection


Object classification approach is based on shape, motion, color and texture. The
classification can be done undervarious classes such as trees, animals, humans, objects
etc. Tracking objects and analyzing their features is a key concept of object
classification.

2.8.1 Shape Based


A mixture of image-based and scene-based object parameters such as image blob
(binary large object) area, the aspect ratio of blob bounding box and camera zoom is
given as input to this detection system. Classification is performed on the basis of the
blob at each and every frame. The results are kept in the histogram.

2.8.2 Motion Based


When a simple image is given as an input with no objects in motion, this
classification is not needed. In general, non-rigid articulated human motion shows a
periodic property, hence this has been used as a strong clue for classification of moving
objects. Based on this useful clue, human motion can be distinguished from other
objects motion.

2.8.3 Color Based


Though color is not an appropriate measure alone fordetecting and tracking
objects, but the low computational cost of the color based algorithms makes the color a
very good feature to be exploited. For example, the color-histogram based technique is
used for detection of vehicles in real-time. Color histogram describes the color
distribution in a given region, which is robust against partial occlusions.

2.8.4 Texture-Based
The texture-based approaches with the help of texture pattern recognition work
similar to motion-based approaches. It provides better accuracy, by using overlapping
local contrast normalization but may require more time, which can be improved using
some fast techniques.
P a g e | 26

Chapter 3Workflow

3.1 STEPS INVOLVED IN OBJECT DETECTION IN PYTHON 3.7


Let us start with an image (im.jpg) and detect various objects in it

3.1.1 Install OpenCV-Python.


Below Python packages are to be downloaded and installed to their default location -
Python-3.7.x, NumPy and Matplotlib. Install all packages into their default locations. Python will
be installed to C/Python27/. Open Python IDLE. Enter import NumPy and make sure NumPy is
working fine. Download OpenCV from Sourceforge. Go to OpenCV/build/python/3.7 folder.
Copy cv2.pyd to C:/Python27/lib/site-packages.

3.1.2 Read an Image


Use the function CV2.imread() to read an image. The image should be in the current
working directory otherwise, we need to specify the full path of the image as the first argument.
The second argument is a flag which specifies the way image should be read.
1. CV2.IMREAD_COLOR: This function is used to load a color image. Transparency of
image, if present will be neglected. It is the default flag.
2. CV2.IMREAD_GRAYSCALE: Loads image in grayscale mode
3. CV2.IMREAD_UNCHANGED: Loads image as such including alpha channel.

3.1.3 Feature detection and description


• Understanding features (What are the main features in an image? How can finding
those features be useful to us?)
• Corner detection (Okay, Corners are good features? But how do we find them)
• Feature matching (We know a great deal about feature detectors and descriptors. So
let us now learn to match different descriptors. OpenCV provides two techniques, Brute-
Force matcher, and FLANN based matcher.)
•Homography (As we are aware of feature matching, so let us now blend it with Camera
calibration and 3D reconstruction (calib3d module) to find objects for description in a
P a g e | 27

complex image.

3.2 Architecture of the Proposed Model


The Fig-1 shows the Architecture Diagram of the Proposed YOLO Model. Images are
given as the input to the system. If Video can also be taken as input as it is nothing but a stream
of images. As the name suggests You Only Look Once, the input goes through the network only
once and the result of detected object with Bounding Boxes and Labels are obtained.

Fig-3.1: YOLO Architecture

The images are divided into SXS grid cells before sending tothe Convolutional Neural Network
(CNN). B Bounding boxes per grid are generated around all the detected objects in the image as
the result of the Convolutional Neural Network. On the other hand, the Classes to which the
objects belong is also classified by the Convolutional Neural Network, giving C Classes per grid.
Then a threshold is set to the Object Detection. In this project we have given a Threshold of 0.3.
Lesser the Threshold value, more number of bounding boxes will appear in the output resulting
in the clumsy output.
P a g e | 28

Fig -3.2: Data Flow Diagram of the System

The Fig-2 illustrates the Flow of data in the System. Initially User will be given the options to
choose the type of the File to be given to the System as an input. Thus, User can either choose
option of File Selection or start the Camera. In the former, User can choose either Image File or a
Video File and, in the latter, User can start the Camera module. Once the input is selected
Preprocessing is done, where the SXS grids are formed. The resultant thus formed with the grids
is send to the Bounding Box Prediction process where the Bounding Boxes are drawn around
the detected objects. Next the result from the previous process is sent to the Class Prediction
where the Class of the object to which it belongs is predicted. Then it is sent to the detection
process where a Threshold is set in order to reduce clumsiness in the output with many
Bounding Boxes and Labels in the final Output. At the end an image or a stream of images are
generated for image and video or camera input respectively with Bounding Boxes and Labels
are obtained as the Output.

3.3 Implementation
This chapter describes the methodology for implementing this project. Following is the
algorithm for detecting the object in the Object Detection System.

3.3.1 Algorithm for Object Detection System


1. The input image is divided into SxS grid
2. For each cell it predicts B bounding boxes each bounding box contains five elements: (x,
y, w, h) and a box confidence score
P a g e | 29

3. YOLO detects one object per grid cell only regardless of the number bounding boxes
4. It predicts C conditional class probabilities
5. If no objects exists then confidence score is zero Else confidence score should be greater
or equal to threshold value
6. YOLO then draws bounding box around the detected objects and predicts the class to
which the object belongs

3.4 Results And Analysis


This chapter describes the results obtained by the System, different Test Cases used
while testing the System. We used pretrained dataset of COCO which had 80 classes. The reason
why 80 classes because a greater number of classes resulted in the overfitting of the data.
Following section will describe the different Test Cases and the results obtained.

3.4.1 Test Cases


The Table-1 shows the different Test Cases, the Expected as well as the Test Result

Table -1: Test Cases with Results


Test Case ID Test Conditions Expected Result Test Results

TC1 When image is Image with bounding SUCCESSFUL


chosen as input box around the
objects and predicted
class
TC2 When video is chosen Video with bounding SUCCESSFUL
as input box around the
objects and predicted
class
TC3 When black and Image with bounding SUCCESSFUL
white image is taken box around the
as input objects and predicted
class
TC4 When image Image with SUCCESSFUL
with bounding box
overlapping around the
objects is taken as objects and predicted
P a g e | 30

input class.

3.4.2 Result
This section describes different results obtained by giving various Test Cases
describedabove.

Fig -3.3: Image with Detected Object


P a g e | 31

Fig -3.4: Image with Overlapping Objects

The Fig-3.3 illustrates the output of the Object Detection System. Bounding Boxes are drawn
around the Objects detected. Fig-3.4 illustrates the output obtained when objects are
overlapping. This shows that partially visible objects will also be detected by drawing bounding
box around it along with the label indicating the class to which it belongs.

Fig-3.5: Objects are detected within black and white picture


P a g e | 32

Chapter4 CONCLUSIONS

4.1 APPLICATIONS AND FUTURE SCOPE

Computer vision is still a developing discipline, it has not been matured to that level
where it can be applied directly to real life problems. After few years‟ computer vision and
particularly the object detection would not be any more futuristic and will be ubiquitous. For
now, we can consider object detection as a sub-branch of machine learning. Some common and
widely used application of object detection are:

4.1.1 Face Detection


Have you ever wondered how Facebook detects your face when you upload a photo? Not only it
detects, it remembers the face too. This is a simple application of object detection that we see in
our daily life.

4.1.2 Counting objects/peoples


Object detection can be also used for counting purpose, it is used for keeping a count of
particular or all objects in an image or a frame. For e.g. from a group photograph it can count the
number of persons and if implemented smartly you may also find out different people with
different dresses.

4.1.3 Vehicle detection


Similarly, when the object is a vehicle, object detection along with tracking can be used for
finding the type of vehicle, this application may be extended to even make a traffic calculator.

4.1.4 Industries
Object detection is also used in industrial processes for the identification of different products.
Say you want your machine to only detect objects of a particular shape, you can achieve it very
easily. For e.g. Hough circle detection transform can be used for detecting circular objects.

4.1.5 Security
Identification of unwanted or suspicious objects in any particular area or more specifically
object detection techniques are used for detecting bombs/explosives. It is also even used for
personal security purpose. 6.6 Biometric recognition Biometric recognition uses physical or
behavioral traits of humans to recognize any individuals for security and authentication
P a g e | 33

purpose. It uses distinct biological traits like fingerprints, hand geometry, retina and iris
patterns etc.

4.1.6 Surveillance
Objects can be recognized and tracked in videos for security purpose. Object recognition is
required so that the suspected person or vehicle can be tracked.

4.1.8 Medical analysis


Object detection is used to detect diseases like a tumor, stones, cancer in MRI images.

4.1.9 Optical character recognition


Characters in scanned documents can be recognized using object recognition.

4.1.10 Human computer interaction


Human gestures can be stored in the system and can be used for recognition in a dynamic
environment by computers to interact with humans.

Object detection’s scope is not yet limited here. You can use it for any purpose you can think of.
For e.g. for solving number puzzles by just giving their images as input and applying some
proper algorithms after detecting different numbers and their places from the input image.

4.2 CHALLENGES
The main purpose is to recognize a specific object in real time from a large number of objects.
Most recognition systems are poorly scalable with many recognizable objects. Computational
cost rises as the number of objects increases. Comparing and querying images using color,
texture, and shape are not enough because two objects might have same attributes. Designing a
recognition system with abilities to work in the dynamic environment and behave like a human
is difficult. Some main challenges to design object recognition system are lighting, dynamic
background, the presence of shadow, the motion of the camera, the speed of the moving objects,
and intermittent object motion weather conditions etc.

4.3 CONCLUSION
The project is developed with objective of detecting real time objects in image, video and
camera. Bounding Boxes are drawn around the detected objects along with the label indicating
the class to which the object belongs. We have used CPU for the processing in the project. Future
P a g e | 34

enhancements can be focused by implementing the project on the system having GPU for faster
results and better accuracy
The possibilities of using computer vision to solve real world problems are immense. The basics
of object detection along with various ways of achieving it and its scope has been discussed.
Python has been preferred over MATLAB for integrating with OpenCV because when a Matlab
program is run on a computer, it gets busy trying to interpret all that Matlab code as Matlab
code is built on Java. OpenCV is basically a library of functions written in C\C++. Additionally,
OpenCV is easier to use for someone with little programming background. So, it is better to start
researching on any concept of object detection using OpenCV-Python. Feature understanding
and matching are the major steps in object detection and should be performed well and with
high accuracy. Deep Face is the most effective face detection method that is preferred over
Haar-Cascade by most of the social applications like facebook, snap chat, Instagram etc. In the
coming days, OpenCV will be immensely popular among the coders and will also be the prime
requirement of IT companies. To improve the performance of object detection IOU measures are
used.
P a g e | 35

REFERENCES
[1] LeleXie, Tasweer Ahmad,Lianwen Jin , Yuliang Liu, and Sheng Zhang ,“ A New CNN-
Based Method for MultiDirectional Car License Plate Detection”, IEEE Transactions on
Intelligent Transportation Systems, ISSN (e): 1524-9050, Vol-19, Issue-02, Year-2018
[2] L. Carminati, J. Benois-Pineau and C. Jennewein, “Knowledge-Based Supervised
Learning Methods in a Classical Problem of Video Object Tracking”, 2006
International Conference on Image Processing, Atlanta, GA, USA, ISSN (e): 2381-8549,
year-2006.
[3] Jinsu Lee, JunseongBang and Seong-II Yang, “Object Detection with Sliding Window
in Images including Multiple Similar Object”, 2017 IEEE International Conference on
Information and Communication Technology Convergence (ICTC), Jeju, South Korea,
ISBN
[4] Qichang Hu, SakrapeePaisitkriangkrai, ChunhuaShen, Anton van den Hengel and
Faith Porikli,“Fast Detection of Multiple Objects in Traffic Scenes with Common
Detection Framework”, IEEE Transactions on Intelligent Transportation Systems, ISSN
[5] HaihuiXie, Quingxiang Wu and Binshu Chen, “Vehicle Detection in Open Parks Using a
Convolutional NeuralNetwork”, 20015 6th International Conference on
Intelligent Systems Design and Engineering Applications(ISDEA), Guiyang, China, ISSN
[6] J. Brownlee, “A Gentle Introduction to Object Recognition with Deep Learning” (2018)
Available at: https://machinelearningmastery.com/object-recognition-with-deep-learning/
[Assessed: 01/10/20]
[7] You Only Look Once: Unified, Real-Time Object Detection, 2015. Available at:
https://arxiv.org/abs/1506.02640, [Assessed: 04/10/20]
[8] A. Kamal, “YOLO, YOLOv2 and YOLOv3: All You want to know”(2019) Available at:
https://medium.com/@amrokamal_47691/yolo-yolov2-and-yolov3-all-you-want-toknow-
7e3e92dc4899 [Assessed: 04/10/20]
[9] A. Rosebrock., “Intersection over Union (IoU) for object detection” (2016) Available at:
https://pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/
[10]YOLO: You Only Look Once – Real Time Object
Detectionhttps://www.geeksforgeeks.org/yolo-you-only-look-once-real-time-object-detection/
[11] I. Tan, “Measuring Labelling Quality with IOU and F1 Score”, Available at:
P a g e | 36

https://medium.com/supahands-techblog/measuring-labelling-quality-with-iou-and-
f1-score-1717e29e492f
[12] StackOverflow, “Intersection Over Union (IoU) ground truth in YOLO”, Available at:
https://stackoverflow.com/questions/61758075/intersection-over-union-iou-ground-
truth-in-yolo
[13]J. Redmon& A. Farhadi, (University of Washington), “YOLO9000: Better, Faster,
Stronger” Available at:https://pjreddie.com/media/files/papers/YOLO9000.pdf
[14] “YOLO v2 – Object Detection” Available at: https://www.geeksforgeeks.org/yolo-v2-object-
detection/
[15] A. Aggarwal, “YOLO Explained” Available at: https://medium.com/analytics-vidhya/yolo-
explained-5b6f4564f31
[16] L. Cai, F. Jiang, W. Zhou, and K. Li, Design and Application of An Attractiveness Index for
Urban Hotspots Based on GPS Trajectory Data, (Fellow, IEEE), pg 4
[17] K. Mahesh Babu, M.V. Raghunadh, Vehicle number plate detection and recognition using
bounding box method, May 2016, pp 106–110
P a g e | 37

1APPENDIX A
P a g e | 38

APPENDIXB

You might also like