Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

PROJECT REPORT ON:

OBJECT DETECTION USING


DEEP LEARNING

Submitted by:
Hemant Dadhich (1604352, ETC-6)
Parbonee Sen (1604364, ETC-6)
Rounak Mittal (1604373, ETC-6)
Saumyajit Roy (1604384, ETC-6)
Abhigyan Nath (1604402, ETC-6)

Mentored by:
Madam Debolina Dey,
(Credentials)

DEPARTMENT OF ELECTRONICS ENGINEERING


KALINGA INSTITUTE OF INDUSTRIAL TECHNOLOGY
BHUBANESWAR, ODISHA
CERTIFICATE

This is to certify that the project report titled “Object Detection using Deep
Learning”,
submitted by:
Hemant Dadhich 1604352
Parbonee Sen 1604364
Rounak Mittal 1604373
Saumyajit Roy 1604384
Abhigyan Nath 1604402
in partial fulfillment of the requirements for the award of the Degree of Bachelor of
Technology in Electronics and Telecommunications Engineering is a bonafide
record of the work carried out under the supervision and guidance at School of
Electronics Engineering, Kalinga Institute of Industrial Technology.

Signature of Supervisor

Madam Debolina Dey


School Of Electronics Engineering

The Project us evaluated on __________

Examiner 1 Examiner 2

Examiner 3 Examiner 4
ACKNOWLEDGEMENT

We firstly, are immensely grateful and deeply thankful towards the


support provided by our mentor Madam Debolina Dey, who
guided us from scratch, throughout our Project Work.
Without her help, the project would have not have been completed
efficiently.
Abstract:

Efficient object detection has been an important topic in the advancement of


computer vision systems. With the advent of deep learning techniques, the accuracy
for object detection has increased. The project aims to incorporate latest technique for
object detection with the goal of achieving high accuracy. A major challenge is the
dependency on other computer vision techniques for helping the deep learning based
approach, which leads to slow and non-optimal performance. In this project, with
TensorFlow Object Detection Model, we have developed a rather fast system that
would aid us in efficient object detection.
Table Of Contents

Section Subsection Description Page no.

A Introduction 7-12
A.1 What is Object Detection?
A.2 Background
A.3 Application
A.4 Limitations

B Background Theory of TensorFlow 13-22


Abstract

B.1 Introduction

B.2 Design Principles

B.3 TensorFlow Execution


Model
B.4 Application

C Background Theory On Haar 22-28


Cascade
Abstract
C.1
Introduction
C.2
Object Detection Using Haar
C.3
Cascade Classifier

D What is Image Classification? 29-31

E Steps Involved In Object Detection 32-34

F Code For Object Detection Using 35-39


OpenCV and Tensor Flow
G 40
Examples Of Images Detected by
Our System

H 41
Conclusion

I 42
References
List Of Figures

Page No. Figure Description


7 Single and Multiple objects

8 8.1 Facial Recognition


8.2 People Counting
8.3 Self Driving Cars
8.4 Security

9 Image Retrieval

10 Object detection Then Vs. Now

11.1 Lighting limitation


11 11.2 Positioning Limitation

12 Rotation,Mirroring & Occlusion

17 TensorFlow Dataflow Schematic

23
Cascade Classifier

28 Image Classification Samples

29 Classification & Detection:The difference

40 40.1 Detection of a bottle


40.2 detection of a cell phone
40.3 Detection of multiple Faces
A. Introduction:

A.1 What is Object Detection?


Object Detection is the process of finding real-world object instances like car,
bike, TV, flowers, and humans in still images or Videos. It allows for the recognition,
localization, and detection of multiple objects within an image which provides us with
a much better understanding of an image as a whole. It is commonly used in
applications such as image retrieval, security, surveillance, and advanced driver
assistance systems (ADAS).

A.2 Background:
The goal of object detection is to detect all instances of objects from a known
class, such as people, cars or faces in an image. Typically only a small number of
instances of the object are present in the image, but there is a very large number of
possible locations and scales at which they can occur and that need to somehow be
explored. Each detection is reported with some form of pose information. This could
be as simple as the location of the object, a location and scale, or the extent of the
object dined in terms of a bounding box. In other situations the pose information is
more detailed and contains the parameters of a linear or non-linear transformation.
For example a face detector may compute the locations of the eyes, nose and mouth,
in addition to the bounding box of the face. Object detection systems construct a
model for an object class from a set of training examples. In the case of axed rigid
object only one example may be needed, but more generally multiple training
examples are necessary to capture certain aspects of class variability.

A.3 Applications:
A. Facial Recognition- A deep
learning facial recognition system called
the “DeepFace” has been developed by a
group of researchers in
the Facebook, which identifies human
faces in a digital image very
effectively. Google uses its own facial
recognition system in Google Photos,
which automatically segregates all the photos based on the person in the image. There
are various components involved in Facial Recognition like the eyes, nose, mouth and
the eyebrows.

B. People Counting- Object detection


can be also used for people counting, it is used for
analyzing store performance or crowd statistics
during festivals. These tend to be more difficult as
people move out of the frame quickly.

C. Self Driving Cars- Self-driving


cars are the Future, there’s no doubt in that.
But the working behind it is very tricky as it
combines a variety of techniques to perceive
their surroundings, including radar, laser light,
GPS, odometry, and computer vision.

D. Security- Object Detection plays


a very important role in Security. Be it face ID of Apple or the retina scan used in all the sci-fi
movies. It is also used by the government to access the security feed and match it with their
existing database to find any criminals or to detect the robbers’ vehicle.

E. Image Retrieval- Computer-based image retrieval has become an important


research area in computer vision as digital image collections are rapidly being collected and
are made available to multitude of users using the World Wide Web.
OBJECT DETECTION THEN VS. NOW
A.4 Limitations

Lightning: The lightning


conditions may differ during the
course of the day. Also the weather
conditions may affect the lighting in
an image. In-door and outdoor
images for same object can have
varying lightning condition.
Shadows in the image can affect the
image light. Whatever the lightning
may be the system must be able to
recognize the object in any of the
image.

Positioning: Position in the image of the object can be


changed. If template matching is
used, the system must
handle such images uniformly.
Rotation: The image can be in
rotated form. The system must be capable
to handle such difficulty. The character
‘R’. can appear in any of the form. But the
orientation of the letter or image must not
affect the recognition of character ‘R’ or
any image of object.

Mirroring: The mirrored image of any object must be


recognized by the object recognition
system.

Occlusion: The condition when object in an


image is not completely visible is referred as
occlusion.
B.Background Theory of TensorFlow:

B.1 Abstract
TensorFlow is a machine learning system that operates at large scale and in
heterogeneous environments. TensorFlow uses dataflow graphs to represent
computation, shared state, and the operations that mutate that state. It maps the
nodes of a dataflow graph across many machines in a cluster, and within a
machine across multiple computational devices, including multicore CPUs,
generalpurpose GPUs, and custom-designed ASICs known as Tensor Processing
Units (TPUs). This architecture gives flexibility to the application developer:
whereas in previous “parameter server” designs the management of shared state is
built into the system, TensorFlow enables developers to experiment with novel
optimizations and training algorithms. TensorFlow supports a variety of
applications, with a focus on training and inference on deep neural networks.
Several Google services use TensorFlow in production.

It is an interface for expressing machine learning algorithms, and an


implementation for executing such algorithms. A computation expressed using
TensorFlow can be executed with little or no change on a wide variety of
heterogeneous systems, ranging from mobile devices such as phones and tablets
up to large-scale distributed systems of hundreds of machines and thousands of
computational devices such as GPU cards. The system is flexible and can be used
to express a wide variety of algorithms, including training and inference
algorithms for deep neural network models, and it has been used for conducting
research and for deploying machine learning systems into production across more
than a dozen areas of computer science and other fields, including speech
recognition, computer vision, robotics, information retrieval, natural language
processing, geographic information extraction, and computational drug discovery.
B.2 Introduction
In recent years, machine learning has driven advances in many different
fields. This success has been attributed to the invention of more sophisticated
machine learning models, the availability of large datasets for tackling problems
in these fields, and the development of software platforms that enable the easy
use of large amounts of computational resources for training such models on
these large datasets. TensorFlow system was developed for experimenting with
new models, training them on large datasets, and moving them into production.
TensorFlow is based on many years of experience with our first-generation
system, DistBelief, both simplifying and generalizing it to enable researchers to
explore a wider variety of ideas with relative ease. TensorFlow supports both
large-scale training and inference: it efficiently uses hundreds of powerful
(GPU-enabled) servers for fast training, and it runs trained models for inference
in production on various platforms, ranging from large distributed clusters in a
datacenter, down to running locally on mobile devices. At the same time, it is
flexible enough to support experimentation and research into new machine
learning models and system-level optimizations. TensorFlow uses a unified
dataflow graph to represent both the computation in an algorithm and the state on
which the algorithm operates. We draw inspiration from the high-level
programming models of dataflow systems and the low-level efficiency of
parameter servers. Unlike traditional dataflow systems, in which graph vertices
represent functional computation on immutable data, TensorFlow allows vertices
to represent computations that own or update mutable state. Edges carry tensors
(multi-dimensional arrays) between nodes, and TensorFlow transparently inserts
the appropriate communication between distributed subcomputations. By
unifying the computation and state management in a single programming model,
TensorFlow allows programmers to experiment with different parallelization
schemes that, for example, offload computation onto the servers that hold the
shared state to reduce the amount of network traffic. We have also built various
coordination protocols, and achieved encouraging results with synchronous
replication, echoing recent results that contradict the commonly held belief that
asynchronous replication is required for scalable learning.

B.3 Design Principles

TensorFlow provides a simple dataflow-based programming abstraction that


allows users to deploy applications on distributed clusters, local workstations,
mobile devices, and custom-designed accelerators. A highlevel scripting interface
wraps the construction of dataflow graphs and enables users to experiment with
different model architectures and optimization algorithms without modifying the
core system.

1. Dataflow Graphs of Primitive Operators:


TensorFlow model represents individual mathematical operators (such as
matrix multiplication, convolution, etc.) as nodes in the dataflow graph. This
approach makes it easier for users to compose novel layers using a high-level
scripting interface. Many optimization algorithms require each layer to have
defined gradients, and building layers out of simple operators makes it easy
to differentiate these models automatically In addition to the functional
operators, we represent mutable state, and the operations that update it, as
nodes in the dataflow graph, thus enabling experimentation with different
update rules.

2. Deferred execution
A typical TensorFlow application has two distinct phases: the first phase
defines the program (e.g., a neural network to be trained and the update rules)
as a symbolic dataflow graph with placeholders for USENIX Association
12th USENIX Symposium on Operating Systems Design and
Implementation.The input data and variables that represent the state; and the
second phase executes an optimized version of the program on the set of
available devices. By deferring the execution until the entire program is
available, TensorFlow can optimize the execution phase by using global
information about the computation. For example, TensorFlow achieves high
GPU utilization by using the graph’s dependency structure to issue a
sequence of kernels to the GPU without waiting for intermediate results.
While this design choice makes execution more efficient, we have had to
push more complex features—such as dynamic control flow into the dataflow
graph, so that models using these features enjoy the same optimizations.

3. Common abstraction for heterogeneous accelerators


In addition to general-purpose devices such as multicore CPUs and GPUs,
special-purpose accelerators for deep learning can achieve significant performance
improvements and power savings.The Tensor Processing Unit (TPU) yield an order of
magnitude improvement in performance-per-watt compared to alternative
state-of-the-art technology. To support these accelerators in TensorFlow, we define a
common abstraction for devices. At a minimum, a device must implement methods
for (i) issuing a kernel for execution, (ii) allocating memory for inputs and outputs,
and (iii) transferring buffers to and from host memory. Each operator (e.g.,
matrix multiplication) can have multiple specialized implementations for different
devices. As a result, the same program can easily target GPUs, TPUs, or mobile CPUs
as required for training, serving, and offline inference. TensorFlow uses tensors of
primitive values as a common interchange format that all devices understand. At the
lowest level, all tensors in TensorFlow are dense; sparse tensors can be represented in
terms of dense ones. This decision ensures that the lowest levels of the
system have simple implementations for memory allocation and serialization, thus
reducing the framework overhead. Tensors also enable other optimizations for
memory management and communication, such as RDMA and direct
GPU-to-GPU transfer. The main consequence of these principles is that in
TensorFlow there is no such thing as a parameter server. On a cluster, we deploy
TensorFlow as a set of tasks (named processes that can communicate over a network)
that each export the same graph execution API and contain one or more devices.
Typically a subset of those tasks assumes the role that a parameter server plays in
other systems, and we therefore call them PS tasks; the others are worker tasks.
However, since a PS task is capable of running arbitrary TensorFlow graphs, it ismore
flexible than a conventional parameter server: users can program it with the same
scripting interface that they use to define models. This flexibility is the key difference
between TensorFlow and contemporary systems, and in the rest of the paper we will
discuss some of the applications that this flexibility enables.

Figure: A schematic TensorFlow dataflow graph for a training pipeline, containing


subgraphs for reading input data, pre-processing, training and checkpointing stage.

B.4 TensorFlow Execution Model


TensorFlow uses a single dataflow graph to represent all computation and state in
a machine learning algorithm, including the individual mathematical operations,
the parameters and their update rules, and the input preprocessing. The dataflow graph
expresses the communication between subcomputations explicitly, thus
making it easy to execute independent computations in parallel and to partition
computations across multiple devices. TensorFlow differs from batch dataflow
systems in two respects:
• The model supports multiple concurrent executions
on overlapping subgraphs of the overall graph.
• Individual vertices may have mutable state that can be shared between different
executions of the graph.
The key observation in the parameter server architecture is that mutable state is
crucial when training very large models, because it becomes possible to make in-place
updates to very large parameters, and propagate those updates to parallel training
steps as quickly as possible. Dataflow with mutable state enables TensorFlow to
mimic the functionality of a parameter server, but with additional flexibility, because
it becomes possible to execute arbitrary dataflow subgraphs on the machines
that host the shared model parameters. As a result, our users have been able to
experiment with different optimization algorithms, consistency schemes, and
parallelization strategies.
In a TensorFlow graph, each vertex represents a unit of local computation, and each
edge represents the output from, or input to, a vertex. We refer to the computation
at vertices as operations, and the values that flow along edges as tensors. In this
subsection, we describe the common types of operations and tensors.

Tensors: In TensorFlow, we model all data as tensors (n-dimensional arrays) with


the elements having one of a small number of primitive types, such as int32,
float32, or string (where string can represent arbitrary binary data). Tensors naturally
represent the inputs to and results of the common mathematical operations in many
machine learning algorithms: for example, a matrix multiplication takes two 2-D
tensors and produces a 2-D tensor; and a batch 2-D convolution takes two 4-D tensors
and produces another 4-D tensor. At the lowest level, all TensorFlow tensors are
dense.TensorFlow offers two alternatives for representing sparse data: either encode
the data into variable-length string elements of a dense tensor, or use a tuple of dense
tensors (e.g., an n-D sparse tensor with m non-zero elements can be represented
in coordinate-list format as an m × n matrix of coordinates and a length-m vector of
values). The shape of a tensor can vary in one or more of its dimensions, which makes
it possible to represent sparse tensors with differing numbers of elements.

Operations: An operation takes m>=0 tensors as input and produces n>=0 tensors
as output. An operation has a named “type” (such as Const, MatMul, or Assign)
and may have zero or more compile-time attributes that determine its behavior. An
operation can be polymorphic and variadic at compile-time: its attributes determine
both the expected types and arity of its inputs and outputs. For example, the simplest
operation Const has no inputs and a single output; its value is a compile-time attribute.
For example, AddN sums multiple tensors of the same element type, and it has a type
attribute T and an integer attribute N that define its type signature.

Stateful operations: Variables: An operation can contain mutable state that is read
and/or written each time it executes. A Variable operation owns a mutable
buffer that may be used to store the shared parameters of a model as it is trained. A
variable has no inputs, and produces a reference handle, which acts as a typed
capability for reading and writing the buffer. A Read operation takes a reference
handle r as input, and outputs the value of the variable (State[r]) as a dense tensor.
Other operations modify the underlying buffer: for example, AssignAdd takes a
reference handle r and a tensor value x, and when executed performs the update
State'[r] State[r] + x. Subsequent Read(r) operations
produce the value State'[r].

Stateful operations: Queues: TensorFlow includes several queue implementations,


which support more advanced forms of coordination. The simplest queue is
FIFO Queue, which owns an internal queue of tensors, and allows concurrent access
in first-in-first-out order. Other types of queues dequeue tensors in random and
priority orders, which ensure that input data are sampled appropriately. Like a
Variable, the FIFO Queue operation produces a reference handle that can be
consumed by one of the standard queue operations, such as Enqueue and Dequeue.
These operations push their input onto the tail of the queue and, respectively, pop the
head element and output it. Enqueue will block if its given queue is full, and Dequeue
will block if its given queue is empty. When queues are used in an input
preprocessing pipeline, this blocking provides backpressure; it also supports
synchronization The combination of queues and dynamic control flow can also
implement a form of streaming computation between subgraphs.
B.5 Application: TensorFlow in Image Classification:
Deep neural networks have achieved breakthrough performance on computer
vision tasks such as recognizing objects in photographs, and these tasks are a key
application for TensorFlow at Google. Training a network to high accuracy requires a
large amount of computation, and we use TensorFlow to scale out this computation
across a cluster of GPU-enabled servers. In these experiments, we focus on Google’s
Inception-v3 model, which achieves 78.8% accuracy in the ILSVRC 2012 image
classification challenge; the same techniques apply to other deep convolutional
models—such as ResNet— implemented on TensorFlow. We investigate the
scalability of training Inception-v3 using multiple replicas. We configure TensorFlow
with 7 PS tasks, and vary the number of worker tasks using two different clusters.
For the first experiment, we compare the performance training Inception using
asynchronous SGD on TensorFlow and MXNet, a contemporary system using a
parameter server architecture. For this experiment we use Google Compute Engine
virtual machines running on Intel Xeon E5 servers with NVIDIA K80 GPUs,
configured with 8 vCPUs, 16Gbps of network bandwidth, and one GPU per VM. Both
systems use 7 PS tasks running on separate VMs with no GPU. Figure 8(a) shows that
TensorFlow achieves performance that is marginally better than MXNet. As expected,
the results are largely determined by single-GPU performance, and both systems use
cuDNN version 5.1, so they have access to the same optimized GPU kernels.
Using a larger internal cluster (with NVIDIA K40 GPUs, and a shared datacenter
network), we investigate the effect of coordination on training performance.
Ideally, with efficient synchronous training, a model such as Inception-v3 will train in
fewer steps, and converge to a higher accuracy than with asynchronous training.
Training throughput improves to 2,300 images per second as we increase the number
of workers to 200, but with diminishing returns. As we add more workers, the step
time increases, because there is more contention on the PS tasks, both at the network
interface and in the aggregation of updates. As expected, for all configurations,
synchronous steps are longer than asynchronous steps, because all workers must wait
for the slowest worker to catch up before starting the next step. While the median
synchronous step is approximately 10% longer than an asynchronous step with the
same workers, above the 90th percentile the synchronous performance degrades
sharply, because stragglers disproportionately impact tail latency. To mitigate tail
latency, we add backup workers so that a step completes when the first m of n tasks
produce gradients. Each additional backup worker up to and including the fourth
reduces the median step time, because the probability of a straggler affecting the step
decreases. Adding a fifth backup worker slightly degrades performance, because
the 51st worker (i.e., the first whose result is discarded) is more likely to be a
non-straggler that generates more incoming traffic for the PS tasks. Figure 8(c) also
plots the normalized speedup for each configuration, defined as t(b)/t(0) × 50/(50 + b)
(where t(b) is the median step time with b backup workers), and which discounts the
speedup by the fraction of additional resources consumed.

B. Background Theory on Haar Cascade:

C.1 Abstract
Object detection is an important feature of computer science. The benefits of
object detection is however not limited to someone with a doctorate of informatics.
Instead, object detection is growing deeper and deeper into the common parts of the
information society, lending a helping hand wherever needed. This paper will address
one such possibility, namely the help of a Haar-cascade classifier. The main focus
will be on the case study of a vehicle detection and counting system and the
possibilities it will provide in a semi-enclosed area - both the statistical kind and also
for the common man. The goal of the system to be developed is to further ease and
augment the everyday part of our lives.

C.2 Introduction
1.1 Computer vision: Computer vision is a field of informatics, which teaches
computers to see. It is a way computers gather and interpret visual information from
the surrounding environment. Usually the image is first processed on a lower level to
enhance picture quality, for example remove noise. Then the picture is processed on a
higher level, for example detecting patterns and shapes, and thereby trying to
determine, what is in the picture.

1.2 Object detection: Object detection is commonly referred to as a method


that is responsible for discovering and identifying the existence of objects of a certain
class. An extension of this can be considered as a method of image processing to
identify objects from digital images.

1.3 Simple detection: By colour one way to do so, it to simply classify objects
in images according to colour. This is the main variant used in, for example, robotic
soccer, where different teams have assembled their robots and go head to head with
other 2 teams. However, this color-coded approach has its downsides. Experiments in
the International RoboCup competition have shown that the lighting conditions are
extremely detrimental to the outcome of the game and even the slightest ambient light
change can prove fatal to the success of one or the other team. Participants need to
recalibrate their systems multiple times even on the same field, because of the minor
ambient light change that occurs with the time of day. Of course, this type of
detection is not suitable for most real world applications, just because of the constant
need for recalibration and maintenance.

1.4 Introduction of Haar-like features: A more sophisticated method is


therefore required. One such method would be the detection of objects from images
using features or specific structures of the object in question. However, there was a
problem. Working with only image intensities, meaning the RGB pixel values in
every single pixel in the image, made feature calculation rather computationally
expensive and therefore slow on most platforms. This problem was addressed by the
socalled Haar-like features, developed by Viola and Jones on the basis of the proposal
by Papageorgiou et. al in 1998. A Haar-like feature considers neighbouring
rectangular regions at a specific location in a detection window, sums up the pixel
intensities in each region and calculates the difference between these sums. This
difference is then used to categorize subsections of an image. An example of this
would be the detection of human faces. Commonly, the areas around the eyes are
darker than the areas on the cheeks. One example of a Haar-like feature for face
detection is therefore a set of two neighbouring rectangular areas above the eye and
cheek regions.

1.5 Cascade classifier: The cascade classifier consists of a list of stages, where
each stage consists of a list of weak learners. The system detects objects in question
by moving a window over the image. Each stage of the classifier labels the specific
region defined by the current location of the window as either positive or negative –
positive meaning that an object was found or negative means that the specified object
was not found in the image. If the labelling yields a negative result, then the
classification of this specific region is hereby complete and the location of the
window is moved to the next location. If the labelling gives a positive result, then the
region moves of to the next stage of classification. The classifier yields a final verdict
of positive, when all the stages, including the last one, yield a result, saying that the
object is found in the image. A true positive means that the object in question is
indeed in the image and the classifier labels it as such – a positive result. A false
positive means that the labelling process falsely determines, that the object is located
in the image, although it is not. A false negative occurs when the classifier is unable
to detect the actual object from the image and a true negative means that a nonobject
was correctly classifier as not being the object in question. In order to work well, each
stage of the cascade must have a low false negative rate, because if the actual object is
classified as a non-object, then the classification of that branch stops, with no way to
correct the mistake made. However, each stage can have a relatively high false
positive rate, because even if the n-th stage classifies the non-object as actually being
the object, then this mistake can be fixed in n+1-th and subsequent stages of the
classifier.

C.3 Object detection using Haar cascade classifier


This section will highlight on the work conducted on the author’s
research in the field of object detection using Haar cascade classifier. The
experiments were conducted mainly on the parking lot located in Campus-15.
The location was chosen mainly for the ease of access and security for the
hardware required to gather information.
3.1 Hardware: Initial testing was conducted with the WebCam of
Lenovo Thinkpad 460. The device was chosen due to its alleged high
capabilities, especially the 30MP camera. The camera was programmed to
take pictures every five minutes, to minimize the impact on the storage
capacity and duplicate images, since the changes during five minutes in the
parking lot were observed to be minimal. If the object is not detected for a
certain number of frames, the hypothesis is discarded. This method can
thusly eliminate false positives that do not last long enough and still keep
track of objects that are missing for only a short period in a detection step. [8]
3. Object detection using Haarcascade classifier This section will highlight
on the work conducted on the author’s research in the field of object
detection using Haarcascade classifier.

3.2 Software: Several programs were developed in the course of this


paper, ranging from a simple convert to grayscale and get size of picture to
recorder, detector and PosCreator.
3.2.1 Recorder: Recorder application was a simple application which after every 5
minutes tries to take a picture. If it can, then a picture is saved to a folder of the
corresponding date with the filename of the corresponding time. If it cannot, then it
simply cuts the connection within 30 seconds and will simply wait for the next 5
minutes. This ensures that if there is a problem with taking a picture, which would
cause the program to “freeze”, then it is simply stops the program and tries again later,
instead of potentially waiting until the power runs out someone manually stops the
program. This is a must-have feature is such an application, due to the fact that
several hours’ worth of image gathering would be wasted due to any simple problem
that halts the exe 6 and determine it is not empty. If it is, then it simply exits with an
error message. Then the image in question is loaded and same procedure is followed.
Then classifier is applied to the image, which outputs an array of rectangles, which
correspond to the detected positions of the objects, in this case automobiles. The
program would then draw bright red rectangles in the locations of the detection and
also add a text to the image, which could for example identify the classifier used,
since one classifier would usually detect one thing.
3.2.2 Background Subtraction: However, as shown by the testing process and the
literature, the classifiers trained can produce errors – either false positives or false
negatives, as described above. In order to minimise the false positive rate originating
from the imperfections of the classifier, an additional layer was added to the algorithm,
before the classifier is applied to the image. This layer has additional knowledge of
the complete background. In this case it would be an image of only the parking lot
and everything that would normally be in the parking lot, except for the cars
themselves. This knowledge can be applied to attempt the filtering of the background
from the image from which we would like to detect vehicles. The background
subtraction type used was MOG. MOG (abbr. from Mixture of Gaussians) is a
technique used to model the background pixels of an image like a mixture of
Gaussians of different weight that represent the pixel intensity. If a match is found for
the pixel of the new image with one of the Gaussians then the point is classified as a
background pixel. If no match is found, then the pixel is classified as the foreground
pixel. Other algorithms, such as MOG2 were considered, but MOG was finally
chosen due to the simple fact that clearer results were obtained by using MOG. MOG
gives us the background mask, so in order to apply it to the original picture, one
would simply need to compute the bitwise and between the original image and the
mask provided. MOG is, however, not perfect. If we were to just take the mask
provided by the default MOG background extractor, then the output for one image of
the parking lot would be rather low quality. Although a person may differentiate the
regions of cars in the image, a cascade classifier proved unable to properly
comprehend the regions of cars on a similar image.

Image 3: Output using MOG with default parameters

3.2.3 Background Subtraction Augmentation: In order to amend this issue,


different augmenting features had to be used. The ones chosen were eroding and
dilating. Dilation is a way to make the bright regions of an image to “grow”. As the
kernel (small matrix used for image processing) is scanned over the image, the
maximal pixel value overlapped by the kernel is calculated and the image pixel in the
anchor point of the kernel (usually at the centre of the image) is replaced by the
maximal value. Erosion works similarly to dilation, but instead of the maximal, it
computes the local minimum over the area of the kernel, thus making the dark areas
of the image larger. If one were to apply dilation to the mask provided by MOG, then
the areas of the mask which are not zeros would get larger, thus improving the overall
quality of the image. This can however raise a new issue, namely the fact that the
small noisy areas present in the original mask could grow larger and have a negative
effect on the provided mask. For this reason, the once dilated mask is eroded with a
kernel with a smaller size, so that it would not nullify the result provided by the
dilating but still reducing the amount of noise produced by the dilation process, thus
providing a symbiotic relation between the two operations. The results provided by
this sort of background filtering were improved. Since a lot of the false positives
provided by the original detections were in fact on the background part, such as the
trees, pavement etc., which is always there, then the algorithm discarded these areas
before the Haarcascade classifier would be applied. However, the regions created by
the background removal created additional problems, such as the classifier mistaking
the grey to black regions as the positive image.

3.2.4 Training cascade: The training of the cascade proved to be no easy task. The
first necessary bit was to gather the images, then create samples based on them and
finally starting the training process. The OpenCV traincascade utility is an
improvement over its predecessor in several aspects, one of them being that
traincascade allows the training process to be multithreaded, which reduces the time it
takes to finish the training of the classifier. This multithreaded approach is only
applied during the precalculation step however, so the overall time to train is still
quite significant, resulting in hours,days and weeks of training time. Since the training
process needs a lot of positive and negative input images, which may not always be
present, then a way to circumvent this is to use a tool for the creation of such positive
images. OpenCV built in mode allows to create more positive images with distorting
the original positive image and applying a background image. However, it does not
allow to do this for multiple images. By using the Perl script createsamples to apply
distortions in batch and the mergevec tool, it is possible to create such necessary files
for each positive input file and then merging the outputted files together into one input
file that OpenCV can understand. Another important aspect to consider is the number
of positives and negatives. When executing the command to start training, it is
required to enter the number of positive and negative images 9 that will be used.
Special care should be taken with these variables, since the number of positive images
here denotes the number of positive images to be used on each step of the classifier
training, which means that if one were to specify to use all images on every step, then
at one point the training process would end in an error. This is due to the way the
training process is set up. The process needs to use many different images on every
stage of the classification and if one were to give all to the first stage, then there
would be no images left over for the second stage, thus resulting in an error message.
The training can result in many types of unwanted behaviour. Most common of these
is either overtraining or undertraining of the classifier. An undertrained classifier will
most likely output too many false positives, since the training process has not had
time to properly determine which actually is positive and which is not. An output may
look similar to image XYZ.

The opposite effect may be observed if too many stages are trained, which could
mean that the classification process may determine that even the positive objects in
the picture are actually negative ones, resulting in an empty result set. Fairly
undefined behaviour can occur if the number of input images are too low, since the
training program cannot get enough information on the actual object to be able to
classify it correctly. One of the best results obtained in the course of this work is
depicted on image XYZ. As one can observe, the classifier does detect some vehicles
without any problems, but unfortunately also some areas of the pavement and some
parts of grass are also classified as a car. Also some cars are not detected as
standalone cars.

The time taken to train the classifier to detect at this level can be measured in days
and weeks, rather than hours. Since the training process is fairly probabilistic, then a
lot of work did also go into testing the various parameters used in this work, from the
number of input images to the subtle changes in the structuring element on the 10
background removal, and verifying whether the output improved, decreased or
remained unchanged. For the same reason, unfortunately the author of this work was
unable to produce a proper classifier, which would give minimal false positives and
maximal true positives.
C. WHAT IS IMAGE CLASSIFICATION?
Image classification takes an image and predicts the object in an image. For example,
when we built a cat-dog classifier, we took images of cat or dog and predicted their
class:

What do you do if both cat and dog are present in the image:
What would our model predict? To solve this problem we can train a multi-label
classifier which will predict both the classes(dog as well as cat). However, we still
won’t know the location of cat or dog. The problem of identifying the location of an
object(given the class) in an image is called localization. However, if the object class
is not known, we have to not only determine the location but also predict the class of
each object.

Predicting the location of the object along with the class is called object
Detection.

The difference between image classification and object


detection:

Figure : The difference between classification (left) and object detection (right) is intuitive and straightforward.
For image classification, the entire image is classified with a single label. In the case of object detection, our
neural network localizes (potentially multiple) objects within the image.

When performing standard image classification, given an input image, we present it


to our neural network, and we obtain a single class label and perhaps a probability
associated with the class label as well.

This class label is meant to characterize the contents of the entire image, or at least
the most dominant, visible contents of the image.

For example, given the input image in Figure 1 above (left) our CNN has labeled the
image as “beagle”.

We can thus think of image classification as:

 One image in
 And one class label out
Object detection, regardless of whether performed via deep learning or other
computer vision techniques, builds on image classification and seeks
to localize exactly where in the image each object appears.

When performing object detection, given an input image, we wish to obtain:

 A list of bounding boxes, or the (x, y)-coordinates for each object in an image
 The class label associated with each bounding box
 The probability/confidence score associated with each bounding box and
class label

Figure (right) demonstrates an example of performing deep learning object detection.


Notice how both the person and the dog are localized with their bounding boxes and
class labels predicted.

Therefore, object detection allows us to:

 Present one image to the network


 And obtain multiple bounding boxes and class labels out

Can a deep learning image classifier be used for object detection?

Figure 2: A non-end-to-end deep learning object detector uses a sliding window (left) + image pyramid (right)
approach combined with classification.

Okay, so at this point you understand the fundamental difference between image
classification and object detection:

 When performing image classification, we present one input image to the


network and obtain one class label out.
 But when performing object detection, we can present one input image and
obtain multiple bounding boxes and class labels out.
D. Steps Involved in Object Detection:

Step 1 : Preprocessing
Often an input image is pre-processed to normalize contrast and brightness effects. A
very common preprocessing step is to subtract the mean of image intensities and
divide by the standard deviation. Sometimes, gamma correction produces slightly
better results. While dealing with color images, a color space transformation ( e.g.
RGB to LAB color space ) may help get better results.

Notice that I am not prescribing what pre-processing steps are good. The reason is
that nobody knows in advance which of these preprocessing steps will produce good
results. You try a few different ones and some might give slightly better results. Here
is a paragraph from Dalal and Triggs

“We evaluated several input pixel representations including grayscale, RGB and LAB
colour spaces optionally with power law (gamma) equalization. These normalizations
have only a modest effect on performance, perhaps because the subsequent descriptor
normalization achieves similar results. We do use colour information when available.
RGB and LAB colour spaces give comparable results, but restricting to grayscale
reduces performance by 1.5% at 10−4 FPPW. Square root gamma compression of
each colour channel improves performance at low FPPW (by 1% at 10−4 FPPW) but
log compression is too strong and worsens it by 2% at 10−4 FPPW.”

As you can see, they did not know in advance what pre-processing to use. They made
reasonable guesses and used trial and error.

As part of pre-processing, an input image or patch of an image is also cropped and


resized to a fixed size. This is essential because the next step, feature extraction, is
performed on a fixed sized image.

Step 2 : Feature Extraction


The input image has too much extra information that is not necessary for
classification. Therefore, the first step in image classification is to simplify the image
by extracting the important information contained in the image and leaving out the
rest. For example, if you want to find shirt and coat buttons in images, you will notice
a significant variation in RGB pixel values. However, by running an edge detector on
an image we can simplify the image. You can still easily discern the circular shape of
the buttons in these edge images and so we can conclude that edge detection retains
the essential information while throwing away non-essential information. The step is
called feature extraction. In traditional computer vision approaches designing these
features are crucial to the performance of the algorithm. Turns out we can do much
better than simple edge detection and find features that are much more reliable. In our
example of shirt and coat buttons, a good feature detector will not only capture the
circular shape of the buttons but also information about how buttons are different
from other circular objects like car tires.

Some well-known features used in computer vision are Haar-like featuresintroduced


by Viola and Jones, Histogram of Oriented Gradients ( HOG ), Scale-Invariant
Feature Transform ( SIFT ), Speeded Up Robust Feature ( SURF ) etc.

As a concrete example, let us look at feature extraction using Histogram of Oriented


Gradients ( HOG ).

Histogram of Oriented Gradients ( HOG )


A feature extraction algorithm converts an image of fixed size to a feature vector of
fixed size. In the case of pedestrian detection, the HOG feature descriptor is
calculated for a 64×128 patch of an image and it returns a vector of size 3780. Notice
that the original dimension of this image patch was 64 x 128 x 3 = 24,576 which is
reduced to 3780 by the HOG descriptor.

HOG is based on the idea that local object appearance can be effectively described by
the distribution ( histogram ) of edge directions ( oriented gradients ). The steps for
calculating the HOG descriptor for a 64×128 image are listed below.

1. Gradient calculation : Calculate the x and the y gradient


images, and , from the original image. This can be done by
filtering the original image with the following

kernels.

Using the gradient images and , we can calculate the magnitude and
orientation of the gradient using the following equations.

2.
3.

The calcuated gradients are “unsigned” and therefore is in the range 0 to 180
degrees.

4.
5. Cells : Divide the image into 8×8 cells.
6. Calculate histogram of gradients in these 8×8 cells : At each pixel in an 8×8
cell we know the gradient ( magnitude and direction ), and therefore we have 64
magnitudes and 64 directions — i.e. 128 numbers. Histogram of these gradients
will provide a more useful and compact representation. We will next convert
these 128 numbers into a 9-bin histogram ( i.e. 9 numbers ). The bins of the
histogram correspond to gradients directions 0, 20, 40 … 160 degrees. Every
pixel votes for either one or two bins in the histogram. If the direction of the
gradient at a pixel is exactly 0, 20, 40 … or 160 degrees, a vote equal to the
magnitude of the gradient is cast by the pixel into the bin. A pixel where the
direction of the gradient is not exactly 0, 20, 40 … 160 degrees splits its vote
among the two nearest bins based on the distance from the bin. E.g. A pixel where
the magnitude of the gradient is 2 and the angle is 20 degrees will vote for the
second bin with value 2. On the other hand, a pixel with gradient 2 and angle 30
will vote 1 for both the second bin ( corresponding to angle 20 ) and the third bin
( corresponding to angle 40 ).
7. Block normalization : The histogram calculated in the previous step is not
very robust to lighting changes. Multiplying image intensities by a constant factor
scales the histogram bin values as well. To counter these effects we can normalize
the histogram — i.e. think of the histogram as a vector of 9 elements and divide
each element by the magnitude of this vector. In the original HOG paper, this
normalization is not done over the 8×8 cell that produced the histogram, but over
16×16 blocks. The idea is the same, but now instead of a 9 element vector you
have a 36 element vector.
8. Feature Vector : In the previous steps we figured out how to calculate
histogram over an 8×8 cell and then normalize it over a 16×16 block. To
calcualte the final feature vector for the entire image, the 16×16 block is
moved in steps of 8 ( i.e. 50% overlap with the previous block ) and the 36
numbers ( corresponding to 4 histograms in a 16×16 block ) calculated at
each step are concatenated to produce the final feature vector.What is the
length of the final vector ?

The input image is 64×128 pixels in size, and we are moving 8 pixels at a time.
Therefore, we can make 7 steps in the horizontal direction and 15 steps in the
vertical direction which adds up to 7 x 15 = 105 steps. At each step we calculated
36 numbers, which makes the length of the final vector 105 x 36 = 3780.

Step 3 : Learning Algorithm For Classification


In the previous section, we learned how to convert an image to a feature vector. In this
section, we will learn how a classification algorithm takes this feature vector as input
and outputs a class label ( e.g. cat or background ).
Before a classification algorithm can do its magic, we need to train it by showing
thousands of examples of cats and backgrounds. Different learning algorithms learn
differently, but the general principle is that learning algorithms treat feature vectors as
points in higher dimensional space, and try to find planes / surfaces that partition the
higher dimensional space in such a way that all examples belonging to the same class
are on one side of the plane / surface.

E. Code for Object Detection using OpenCV and


TensorFlow:

# -*- coding: utf-8 -*-

# coding: utf-8

# # Object Detection Demo


# Welcome to the object detection inference walkthrough! This notebook will walk
you step by step through the process of using a pre-trained model to detect objects in
an image. Make sure to follow the [installation
instructions](https://github.com/tensorflow/models/blob/master/research/object_detect
ion/g3doc/installation.md) before you start.

# # Imports

# In[ ]:
import numpy as np
import os
import six.moves.urllib as urllib
import sys
import tarfile
import tensorflow as tf
import zipfile

from distutils.version import StrictVersion


from collections import defaultdict
from io import StringIO
from matplotlib import pyplot as plt
from PIL import Image

# This is needed since the notebook is stored in the object_detection folder.


sys.path.append("..")
from object_detection.utils import ops as utils_ops

if StrictVersion(tf.__version__) < StrictVersion('1.12.0'):


raise ImportError('Please upgrade your TensorFlow installation to v1.12.*.')

# ## Env setup

# In[ ]:

# This is needed to display the images.


get_ipython().run_line_magic('matplotlib', 'inline')

# ## Object detection imports


# Here are the imports from the object detection module.

# In[ ]:

from utils import label_map_util

from utils import visualization_utils as vis_util

# # Model preparation

# ## Variables
#
# Any model exported using the `export_inference_graph.py` tool can be loaded here
simply by changing `PATH_TO_FROZEN_GRAPH` to point to a new .pb file.
#
# By default we use an "SSD with Mobilenet" model here. See the [detection model
zoo](https://github.com/tensorflow/models/blob/master/research/object_detection/g3d
oc/detection_model_zoo.md) for a list of other models that can be run out-of-the-box
with varying speeds and accuracies.

# In[ ]:

# What model to download.


MODEL_NAME = 'ssd_mobilenet_v1_coco_2017_11_17'
MODEL_FILE = MODEL_NAME + '.tar.gz'
DOWNLOAD_BASE = 'http://download.tensorflow.org/models/object_detection/'

# Path to frozen detection graph. This is the actual model that is used for the object
detection.
PATH_TO_FROZEN_GRAPH = MODEL_NAME + '/frozen_inference_graph.pb'

# List of the strings that is used to add correct label for each box.
PATH_TO_LABELS = os.path.join('data', 'mscoco_label_map.pbtxt')
NUM_CLASSES = 90
# ## Download Model

# In[ ]:

opener = urllib.request.URLopener()
opener.retrieve(DOWNLOAD_BASE + MODEL_FILE, MODEL_FILE)
tar_file = tarfile.open(MODEL_FILE)
for file in tar_file.getmembers():
file_name = os.path.basename(file.name)
if 'frozen_inference_graph.pb' in file_name:
tar_file.extract(file, os.getcwd())

# ## Load a (frozen) Tensorflow model into memory.

# In[ ]:

detection_graph = tf.Graph()
with detection_graph.as_default():
od_graph_def = tf.GraphDef()
with tf.gfile.GFile(PATH_TO_FROZEN_GRAPH, 'rb') as fid:
serialized_graph = fid.read()
od_graph_def.ParseFromString(serialized_graph)
tf.import_graph_def(od_graph_def, name='')

# ## Loading label map


# Label maps map indices to category names, so that when our convolution network
predicts `5`, we know that this corresponds to `airplane`. Here we use internal utility
functions, but anything that returns a dictionary mapping integers to appropriate string
labels would be fine

# In[ ]:

label_map = label_map_util.load_labelmap(PATH_TO_LABELS)
categories = label_map_util.convert_label_map_to_categories(label_map,
max_num_classes=NUM_CLASSES, use_display_name=True)
category_index = label_map_util.create_category_index(categories)

# ## Helper code

# In[ ]:

def load_image_into_numpy_array(image):
(im_width, im_height) = image.size
return np.array(image.getdata()).reshape(
(im_height, im_width, 3)).astype(np.uint8)

# # Detection

# In[ ]:

# For the sake of simplicity we will use only 2 images:


# image1.jpg
# image2.jpg
# If you want to test the code with your images, just add path to the images to the
TEST_IMAGE_PATHS.
PATH_TO_TEST_IMAGES_DIR = 'test_images'
TEST_IMAGE_PATHS = [ os.path.join(PATH_TO_TEST_IMAGES_DIR,
'image{}.jpg'.format(i)) for i in range(1, 3) ]

# Size, in inches, of the output images.


IMAGE_SIZE = (12, 8)

# In[ ]:
import cv2
cap=cv2.VideoCapture(0)

def run_inference_for_single_image(image, graph):


with graph.as_default():
with tf.Session() as sess:
# Get handles to input and output tensors
ops = tf.get_default_graph().get_operations()
all_tensor_names = {output.name for op in ops for output in op.outputs}
tensor_dict = {}
for key in [
'num_detections', 'detection_boxes', 'detection_scores',
'detection_classes', 'detection_masks'
]:
tensor_name = key + ':0'
if tensor_name in all_tensor_names:
tensor_dict[key] = tf.get_default_graph().get_tensor_by_name(
tensor_name)
if 'detection_masks' in tensor_dict:
# The following processing is only for single image
detection_boxes = tf.squeeze(tensor_dict['detection_boxes'], [0])
detection_masks = tf.squeeze(tensor_dict['detection_masks'], [0])
# Reframe is required to translate mask from box coordinates to image
coordinates and fit the image size.
real_num_detection = tf.cast(tensor_dict['num_detections'][0], tf.int32)
detection_boxes = tf.slice(detection_boxes, [0, 0], [real_num_detection, -1])
detection_masks = tf.slice(detection_masks, [0, 0, 0], [real_num_detection,
-1, -1])
detection_masks_reframed =
utils_ops.reframe_box_masks_to_image_masks(
detection_masks, detection_boxes, image.shape[0], image.shape[1])
detection_masks_reframed = tf.cast(
tf.greater(detection_masks_reframed, 0.5), tf.uint8)
# Follow the convention by adding back the batch dimension
tensor_dict['detection_masks'] = tf.expand_dims(
detection_masks_reframed, 0)
image_tensor = tf.get_default_graph().get_tensor_by_name('image_tensor:0')

# Run inference
output_dict = sess.run(tensor_dict,
feed_dict={image_tensor: np.expand_dims(image,
0)})
# all outputs are float32 numpy arrays, so convert types as appropriate
output_dict['num_detections'] = int(output_dict['num_detections'][0])
output_dict['detection_classes'] = output_dict[
'detection_classes'][0].astype(np.uint8)
output_dict['detection_boxes'] = output_dict['detection_boxes'][0]
output_dict['detection_scores'] = output_dict['detection_scores'][0]
if 'detection_masks' in output_dict:
output_dict['detection_masks'] = output_dict['detection_masks'][0]
return output_dict

# In[ ]:
ret=True
while(ret):
ret,image_np=cap.read()

#for image_path in TEST_IMAGE_PATHS:


# image = Image.open(image_path)
# the array based representation of the image will be used later in order to prepare
the
# result image with boxes and labels on it.
#image_np = load_image_into_numpy_array(image)
# Expand dimensions since the model expects images to have shape: [1, None,
None, 3]
image_np_expanded = np.expand_dims(image_np, axis=0)
# Actual detection.
output_dict = run_inference_for_single_image(image_np, detection_graph)
# Visualization of the results of a detection.
vis_util.visualize_boxes_and_labels_on_image_array(
image_np,
output_dict['detection_boxes'],
output_dict['detection_classes'],
output_dict['detection_scores'],
category_index,
instance_masks=output_dict.get('detection_masks'),
use_normalized_coordinates=True,
line_thickness=8)
cv2.imshow('image',cv2.resize(image_np,(1280,960)))
if cv2.waitKey(25) & 0xFF == ord('q'):
break
cv2.destroyAllWindows()
cap.release()
E.Examples of Images Detected by our System:

Image: Detection of a bottle

Image: Detection of a cell phone

Image: Detection of persons


F. Conclusion:

Object Detection method have a wide range of applications in a variety of areas


including robotics, medical image analysis, surveillance and human computer
interaction. Current methods work reasonably well in constrained domains but are
quite sensitive to clutter and occlusion.

These challenges have attracted significant attention in the computer vision


community over the last few years and the performance of the best systems have been
steadily increasing by a significant amount on a yearly basis.

We have also tried to describe to describe the TensorFlow model used in our project.
The code has successfully been debugged, and has accurately detected certain objects
as shown in the examples. This involves a very high level form of image classification
as well as detection. Image detection also helps in crowd management and CCTV
applications.

In future, object detection is aiming to achieve accuracy in motion analysis the


segmented moving object from tracking can be further analyzed with the statistics of
each motion to verify whether a car is speeding or not, or whether a person is walking,
running or jumping. Processing time need to produce searching time by searching
only in some parts of the image. Searching algorithm such as hierarchical search or
block matching algorithm might be able to make this program faster because it
reduces number of pixels to be searched.
G. References:

 https://www.cse.iitb.ac.in/~pratikm/projectPages/objectDetection/

 https://www.edureka.co/blog/tensorflow-object-detection-tutorial/

 https://medium.com/@WuStangDan/step-by-step-tensorflow-object-det
ection-api-tutorial-part-1-selecting-a-model-a02b6aabe39e

 https://www.oreilly.com/ideas/object-detection-with-tensorflow

 https://www.slideshare.net/Brodmann17/introduction-to-object-dete
ction

 https://pdfs.semanticscholar.org/0f1e/866c3acb8a10f96b432e86f8a61
be5eb6799.pdf

 https://cv-tricks.com/object-detection/faster-r-cnn-yolo-ssd/

 https://www.learnopencv.com/tag/object-detection/

 https://pythonprogramming.net/introduction-use-tensorflow-object-
detection-api-tutorial/

 https://github.com/tensorflow/models/tree/master/research/object_
detection

 https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutori
als/py_feature2d/py_features_meaning/py_features_meaning.html

You might also like