Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/326272144

Multi-layered Deep Convolutional Neural Network for Object detection

Article · July 2018

CITATIONS READS

0 65

1 author:

Hema Rubesh
Madurai Kamaraj University
4 PUBLICATIONS   0 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Object detection View project

All content following this page was uploaded by Hema Rubesh on 09 July 2018.

The user has requested enhancement of the downloaded file.


JASC: Journal of Applied Science and Computations ISSN NO: 0076-5131

Multi-layered Deep Convolutional Neural Network


for Object detection
D.Hema1, S.Kannan2
1
Assistant Professor, Department of Computer Science, Lady Doak College, Madurai-02,India
2
Professor, Department of Computer Applications, Madurai Kamraj University, Madurai-21,India
hemamku17@gmail.com
skannanmku@gmail.com

Abstract--Image Classification and object recognition using Deep learning is the state of art technology in the field of Computer
Vision and robotics. Deep Convolutional Neural Networks are used for the task of Image detection and classification. This
research work focuses to build an efficient and robust multi-layer Deep Convolutional Neural Network (DCNN) to classify images
in a binary class. The various stages in DCNN like convolution, activation functions, pooling, flattening and full connection along
with its operation is discussed in detail. This work also compares different values for the hyper-parameters such as epochs and
hidden units of the neural network. These values are implemented in order to increase the accuracy and decrease the logarithmic
loss of the deep learning model.

Keywords-- Convolution,activation function,loss function,Image augmentation

I. DEEP LEARNING AND OBJECT DETECTION


Image Classification in large scale is challenge in the recent decades. Researchers come up with state of
art technique in Image Processing and computer vision to raise the accuracy in detecting objects of
particular class. Deep learning is one such field which has innate potential that can train and detect
enormous number of images relative to machine learning. Machine learning requires hand crafted features
extracted manually in images whereas deep learning gains attention in its own way of learning the images.
Deep learning is a category of machine learning algorithms which uses a cascade of multiple layers
of processing units (nonlinear) for feature extraction. Each successive layer makes use of the output from
the previous layer as input. It also learns in supervised (e.g., classification) and/or unsupervised (e.g.,
pattern analysis) techniques. Deep learning models learn multiple levels of representations that correspond
to various levels of abstraction. Most of the deep learning models are based on an Artificial Neural
Network (ANN), Deep Belief Networks and Deep Boltzmann Machines. There is no doubt that the deep
learning will transform the field of Artificial Intelligence to a greater level. Deep learning can also be
implemented for speech recognition, social network filtering, natural language processing, bioinformatics
[1] and drug design.
II. CONVOLUTIONAL NEURAL NETWORK
Yann LeCun, the father of CNN implemented the deep learning for various applications like Digits
recognition in MNIST dataset, Document recognition [2] using 7 layers, Image recognition, Hand written
zip code recognition etc. AlexNet[3] brought a drastic improvement in deep learning for image
classification for 1000 image classes. They were trained on two GTX 580 GPUs for five to six days. A
slightly modified AlexNet model called ZFNet[4] evolved in 2013,which was trained on a GTX 580 GPU
for twelve days. In [4], the feature activations uses a visualization technique named Deconvolutional
Network (DeConvNet).The idea behind ZFNet was to examine what type of structures stimulates a given
feature map. In 2014, VGGNet[5] came into existence. VGGNet used 3x3 sized filters whereas AlexNet
used 11x11 filters in the first layer and ZFNet used 7x7 filters.VGGNet was trained on 4 Nvidia Titan
Black GPUs for two to three weeks. VGGNet performed the tasks of both classification and localization.

Volume 5, Issue 6, June /2018 Page No:93


JASC: Journal of Applied Science and Computations ISSN NO: 0076-5131

In 2015, GoogLeNet [6] with 22 layer CNN came with an error rate of 6.7%. GoogleNet made use of
average and max pooling layers along with an inception module. In 2015, ResNet[7] with 152 layer
network architecture has set new records in classification, detection, and localization with an incredible
error rate of 3.6%.
III. ARCHITECTURE: MULTI-LAYERED DEEP CONVOLUTIONAL NEURAL NETWORK
Convolutional Neural Networks are the most popular deep learning architecture for large scale image
recognition. This research paper is about a multi-layer DCNN implemented for image detection in binary
class. Deep Convolutional Neural Networks (DCNN) has various operations being performed on multi-
layered networks starting from a convolutional layer to a dense/full connected layer. The DCNN
implemented in this research work has 2 convolutional layers, 1 pooling layer, 1 convolutional layer, 1
pooling layer, flatten layer, 2 full connection layers. The Architecture of this 8 layered DCNN is given in
Fig.1.This 8 layered network still can be modelled deeper to increase the efficiency. Building a deeper
model will elevate the accuracy level in detecting objects.

Fig. 1. 8 layered DCNN architecture

A. Convolution

The first convolutional layer filters the 64X64X3 input image with 20 kernels of size 5X5X3.The Convolution
operation as given in the equation (1) is performed on the entire input image and 20 different kernels are applied to obtain
20 feature maps. The feature map is the output of a single filter applied to the preceding layer. A single filter is drawn across
the entire preceding layer, moved one pixel at a time. The Filter gets convolved along with the input from the preceding
layer.

( f * g )(t )   f ( ).g (t   )d

(1)

Volume 5, Issue 6, June /2018 Page No:94


JASC: Journal of Applied Science and Computations ISSN NO: 0076-5131

Multiple filters can be applied to create multiple feature maps in convolutional layers. The second convolutional layer
performs convolution on the output image of the first convolutional layer with 30 kernels of size 5X5X3 to produce 30
feature maps. Third Convolutional layer is introduced after the first pooling layer. The third convolutional layer makes use
of 50 Kernels of size 5X5X3 to produce 50 feature maps. Kernels in DCNN such as blur, emboss, edge, smoothen, sharpen
are selected by the model itself. The prominently used kernel is an edge filter. While performing the convolution, the DCNN
preserves the spatial relations between pixels. The tiny features are not eliminated but it is retained. The output size (O) of
an image after performing convolution is given by the formula as in (2)

O = ((W-K+2P)/S) +1 (2)

Where W is the input height/width of an image, K is the Kernel size, P is the Padding and S is the Stride. For Instance, if an
input image height (W) is 64, kernel (K) is 3X3 with no padding (P) and a Stride (S) of 1 is used, then the output image size
(O) would be 62X62.
B. ReLU-Rectified Linear Unit
The output of the convolutional layer has negative values in the obtained feature map. In order to remove all negative
values and save only positive value, an activation function called ReLU is applied on the feature maps. This also increases
the non-linearity of the model.
Y  ( x)  max( x,0)

 wixi
i 1

Fig. 2. Rectifier Function

ReLU function is applied on all convolutional layers to increase non-linearity but at the same time to retain the precise
features on the image. In [8], ReLU is applied to restricted Boltzmann Machine to improve its performance. But ReLU can
be applied in DCNN’s as well to improve the model efficiency.
C. Average Pooling/Subsampling
Feature map images are in different directions, occluded or rotated. Pooling reduces the parameters, thereby reducing
overfitting. Pooling also reduces the processing time for the network. Pooling permits Spatial Invariance and doesn’t care
even if the features are rotated, scaled or occluded. Features are preserved and there is no loss of data in this pooling layer.
The evaluation of Different Pooling operations in convolutional networks are discussed in [9]. In Lenet[1], Max
Pooling/Down sampling is used which finds the maximum in the neighborhood and replaces it. The research work discussed
here makes use of an Average Pooling/subsampling which replaces the pixel by the weighted average of its neighborhood.
Few model combines both maxpooling and average pooling techniques.Two Pooling Layers are added after the second and
third convolutional layers respectively. A 2X2 average pooling filter is applied on the output of the second and third
convolutional layers to produce a reduced feature map. The output size(O) of an image after performing Pooling is given by
the formula as in (3)
O = ((W-K)/S) +1 (3)
Where W is the height/width of an image from convolutional layer, K is the Kernel size and S is the Stride. For
Instance, From the convolutional layer the image height (W) is 62, kernel (K) is 3X3, and a Pool Stride (S) of 2 is used, then
the output image size (O) would be 31X31.
D. Flattening
The output of pooling layer is a reduced feature map which should be converted to a single set of vectors. Flattening
helps to perform this task. Flattening feeds the converted feature map to the neural network.
E. Full /Dense Connection

Volume 5, Issue 6, June /2018 Page No:95


JASC: Journal of Applied Science and Computations ISSN NO: 0076-5131

Full Connection is the neural network layer where the Neurons have full connections to all activations in the previous
layer. Their activations are a matrix multiplication followed by a bias offset. There are two full connection layers in this
model. The model calculates the loss function as given in equation (4) and performs a backward propagation to adjust
weights to train the model using the training images. The test images are then fed into the model to find the loss/cost
function and accuracy is calculated for the multi-layered architecture. The output layer of a full connection implements a
sigmoid activation function to detect the object in two classes. In case of multiple class object detection and classification, a
softmax activation function can be implemented.
H ( p, q)   p( x). log q( x) (4)
x
IV. IMAGE AUGMENTATION
DCNN’s require huge amount of training data to achieve good performance. In order to build a powerful image
classifier using very little training data, image augmentation is usually instigated to boost the performance of deep networks.
Image augmentation is the process of artificially creating training images through different ways of processing or it
combines multiple processing, such as random rotation, shifts, shear and flips, etc. In this model, rescale, zoom, shear and
flipping are used for Image Augmentation process.

V. TRAINING AND TEST DATASET


The Training and testing data should be the ratio of 4:1. Care should be taken to avoid overlapping of training and testing
data. In this research work, 3000 training data (1500+1500 in each class) and 750 (375+375 for each class) testing data are
being used. The dataset used is extracted from Caltech-UCSD Birds 200 and INRIA Person Dataset. The Training and
Testing data should be of same size. The size of the images implemented in this model is 64X64.

VI. COMPILING AND FITTING


The DCNN is compiled using adam optimizer and loss function is a binary cross entropy. The model is made to fit for the
hyper-parameters like epochs, batch size and number of hidden units. The model is trained for various numbers of epochs
like 1,5,10,15,20,25 and the batch size is 32. Hence, the entire dataset is split as batch of 32 and a total of 93 batches are fed
for one epoch. Different image sets of 93 batches are passed for each epoch.

VII. RESULTS
The model is trained for two models with 64 and 128 hidden units until it has reached 25 epochs. The model is trained on
a dual core CPU running at 2.3-2.8 GHz. In [10], The CNNs are optimized on Embedded FPGA (Field Programmable Gate
Array) for object detection. The loss of training and validation images implemented for 64 and 128 hidden unit model (25
epochs) is given in Fig.3 (a) and (b) respectively. The accuracy of training and validation images implemented for 64 and
128 hidden unit model (25 epochs) is given in Fig.4 (a) and (b) respectively.

Fig.3.(a) Loss for training and validation of 64 hidden unit Fig.3.(b) Loss for training and validation of 128 hidden unit

Volume 5, Issue 6, June /2018 Page No:96


JASC: Journal of Applied Science and Computations ISSN NO: 0076-5131

Fig.4.(a) Accuracy for training and validation of 64 hidden unit Fig.4.(b) Accuracy for training and validation of 128 hidden unit

While training the model for different number of epochs, the loss should decrease and accuracy should increase. After a
particular number of epochs, the accuracy becomes stable. At this point the model training could be terminated. Underfitting
and Overfitting might occur during the training of a DCNN model. Overfitting happens when the training loss is less than
the validation loss and Underfitting happens when the training loss is greater than validation loss. A good fit is when the
training loss is equal to the validation loss. Overfitting can be prevented by providing sufficient amount of dataset or by
designing a small neural network model or by introducing regularization techniques in the network. To avoid underfitting,
try the opposite methods like small regularization, increasing the size of neural network and so on. A spot between
underfitting and overfitting is said to be a good fit for the model where the error percent is nearly zero.
From the Fig.3 and 4 it is clear that increasing the model’s hidden unit to 128 decreases the accuracy below 80% and the
model with 64 hidden units has accuracy above 80% .The model with 64 hidden units in the neural network has a good fit
because the training and validation accuracy is nearly equal. Increasing the number of epochs to 25 has also yielded an
accuracy of above 80 % whereas from15-20 epochs it is below 80%.Hence the hyper-parameters like epoch and hidden unit
optimization increases the DCNN’s efficiency and robustness.

REFERENCES
[1] Seonwoo Min, Byunghan Lee, Sungroh Yoon,2016, “Deep Learning in Bioinformatics”, Briefings in Bioinformatics, Volume 18, Issue5, p.851-869
[2] Yann LeCun,Leon Bottou,Yoshua Bengio and Patrick Haffner, 2013, “Gradient based learning applied to document recognition,” Proceedings of the
IEEE. 86 (11): p. 2278–2324.
[3] Alex Krizhevsky , Ilya Sutskever , Geoffrey E. Hinton, 2012, “ImageNet Classification with Deep Convolutional Neural Networks”
[4] Matthew D. Zeiler, Rob Fergus, 2013, “Visualizing and Understanding Convolutional Networks”
[5] Karen Simonyan & Andrew Zisserman, 2014,” VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION”
[6] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich,
CVPR 2015,” Going Deeper with Convolutions”
[7] Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun,2015,” Deep Residual Learning for Image Recognition”
[8] V. Nair and G. E. Hinton.,2010, “Rectified linear units improve restricted boltzmann machines” In Proc. 27th,International Conference on Machine
Learning,
[9] Dominik Scherer, Andreas Muller, and Sven Behnke, September 2010,” Evaluation of Pooling Operations in Convolutional Architectures for Object
Recognition” , 20th International Conference on Artificial Neural Networks (ICANN)
[10] Ruizhe Zhao,Xinyu Niu,Yajie Wu,Wayne Luk,Qiang Liu, March 2017, “Optimizing CNN-Based Object Detection Algorithms on Embedded FPGA
Platforms” , International Symposium on Applied Reconfigurable Computing.

Volume 5, Issue 6, June /2018 Page No:97


View publication stats

You might also like