Professional Documents
Culture Documents
CNN Algorithms For Detection of Human Face Attributes - A Survey
CNN Algorithms For Detection of Human Face Attributes - A Survey
CNN Algorithms For Detection of Human Face Attributes - A Survey
Abstract— In recent years, CNN algorithms are being size of the dataset increases and helps to learn high-level
increasingly applied for various computer vision based features from the training dataset. Also, the Machine Learning
applications such as disaster management systems using crowd- network needs two different algorithms for feature extraction
sourced images. Flood is one such frequent natural disaster that and classification [4] whereas deep learning uses only one
threatens human life and property. Research is in progress to network. The three fundamental deep learning architectures
find the extent of damage in flood hit areas by calculating the are Convolutional Neural Network (CNN), Recurrent Neural
depth of the water using flood images containing humans Networks and Recursive Neural Networks among which CNN
captured by smartphone cameras. Algorithms, which can detect a is designed for image processing.
human face and its attributes such as age, gender and ethnicity
with these crowd-sourced images, can provide valuable
information during such situations. A multitude of CNN TABLE I. TYPES OF MACHINE LEARNING ALGORITHM. TABLE
algorithms is available for these tasks. Each one of them is STYLES
different in their architecture which in turn influences the
accuracy of the results. In this survey, we compare the state of
Supervised • Decision Trees
Learning
the art CNN algorithms which perform each of these tasks, • Naive Bayes Classification
namely, face detection, age and gender classification, and • Ordinary Least Squares Regression
ethnicity classification. We compare these algorithms with • Logistic Regression
respect to their performance and accuracy so that an appropriate • Support Vector Machines
algorithm can be selected for the above application. • Ensemble Methods
Based on the different deep learning mechanisms, CNN [8] is a type of Artificial Neural Networks that
CNN architecture used and size of the dataset, the performs feature extraction and classification within a single
performance of the system varies. The accuracy of the system deep neural network. When the CNN employs different deep
can be further improved using better training dataset. This learning mechanisms like Back Propagation, Stochastic
work compares different CNN architectures trained for Gradient Descent, Learning Rate Decay, Max Pooling, Batch
classification of face attributes like age- group, gender and Normalization, Long Short-Term Memory, Transfer Learning,
ethnicity. Comparison gives the accuracy achieved by etc., based on the requirement of the architecture designed and
different algorithms that help to choose the architecture for the application, the system will provide better results.
different applications.
Layers [9] in CNN:
This paper is organised as follows. Section 2 explains
the Flood Monitoring system, Section 3 starts with an ● Convolutional Layer - Filters are applied for feature
introduction on CNN architectures and then describes each extraction
architecture. Section 4 compares different face detection ● Pooling Layer - Max pooling for dimensionality reduction
algorithms, Section 4 compares different algorithms for face ● Fully Connected Layer - For classification of objects
classification based on age-group, gender and ethnicity. Stride and Padding values are two important
Section 4 concludes this work based on the comparison and parameters for a filter for deciding the size of the output
also discusses the potential options for further enhancement of image. Stride value says the number of columns a filter should
this work. shift after a convolution operation. Padding appends the image
with extra pixels in all the side. Also, Max pooling and
II. FLOOD MONITORING Activation functions are two important mechanisms of CNN.
A flood monitoring system is proposed to aid the rescuers Max pooling is to downsample the size of the image (reduce
with information such as water depth to help with rescue the dimension) and Activation functions [10] is to receive the
actions. Estimating water depth helps to continuously monitor output from a node. Sigmoid, Tanh, ReLU, SoftMax are most
the flooded region. Our current research on flood monitoring commonly used activation functions.
system focuses on images with humans. Humans are used as The first CNN architecture is LeNet [8] for recognizing
the reference object and the average height of the humans are handwritten digits from images of size 32*32. But, this is not
used as reference value to estimate the water depth. efficient for high dimensional images. When there is a need
Our flood Monitoring System [7] includes five modules for processing large sized images, this architectures straggles
namely face detection, age-group and gender classification, behind due to unavailability of computational resources.
Ethnicity classification, Semantic segmentation, and Water Secondly, a deep CNN architecture with more layers and a
depth estimation. greater number of filters per layer was developed called
AlexNet [8]. It uses 11*11, 5*5, 3*3 convolutions. The
The face detection algorithm is applied to the input image architecture includes 5 convolutional layers and 3 fully
that gives a bounding box around the human faces. According connected layers. It includes ReLU (Rectified Linear Unit)
to the standard ‘golden ratio’, human height is 8 times the activation unit. Then, ZFNet [8], an improved version of Alex
height of the face. With this concept, the average height of the Net with 7*7 convolutions is developed by varying the
human is determined in pixels in the image. Using age-group, parameters of Alex Net and maintains the same architecture. It
gender and ethnicity classification of human faces detected, uses ReLU activation unit and batch stochastic gradient
the average height of the human is determined in feet. These descent. Amongst the existing CNN architectures, Inception,
two values along with the output of semantic segmentation are VGG Net and ResNet are widely accepted and used. Our
used in estimating the water depth. survey focuses on face classification algorithms based on
Hence, CNN architectures have to be applied in flood these algorithms.
monitoring system in face detection, age-group and gender
classification, ethnicity classification modules. A. Inception
As the region of information in each image varies largely
III. CNN ARCHITECTURES in size, it is difficult to choose the right filter size. So, multiple
filters of different sizes (1*1, 3*3, 5*5) are used followed by
ANN (Artificial Neural Networks) is one of the most
max pooling and then the outputs are concatenated. The filters
dominant tools for machine learning. It is inspired by the
of different sizes make the network computationally
features of the human neural system. It consists of three
expensive. So, 1*1 convolution is added before the different
layers: Input layer, Hidden layer, and Output layer. Each layer
sized filters. This forms the inception module. Inception v1
has numerous nodes. Depending on the requirement of
(Google Net) [11] has 9 inception modules and is 22 layers
different applications like pattern recognition in images,
deep. Two auxiliary classifiers and SoftMax activation units
speech recognition different ANN algorithms are developed
are used [12]. To increase the accuracy and reduce the
with more number of hidden layers called deep learning
computational complexity, inception v2 is developed with all
algorithms.
5*5 filters are replaced with two 3*3 filters and then all n*n
filters are replaced with 1*n and n*1 filters. Filter banks are IV. FACE DETECTION
made wider for easy representation and to reduce the loss of
information [13]. Inception v3 adds RMSprop Optimizer, The face is a unique identifier for an individual. Hence,
Factorized Convolutions, Batch Normalization, and Label face detection becomes the primary step in face classification
Smoothing. TABLE II. Summarise the functionalities of each and face recognition algorithms.
deep learning mechanism.
Wang, et al., 2018 in [17] have compared the traditional
machine learning algorithms of face detection and concluded
TABLE II. DEEP LEARNING MECHANISMS AND THEIR that the Viola-Jones method (Haar Cascade) is the best
FUNCTIONALITIES
method. This section compares Haar Cascade with other state
Mechanism Functionality of the art techniques.
Optimizer Tweak and change parameters to minimize loss
function A. Haar Cascade
Factorized Convolution For reducing dimension
The Haar filters [18] are used to extract the features from
the images. The process of feature extraction involves a large
Normalization Adjusting(normalizing) the input values to number of calculations, so integral image [19] is used.
improve the performance and stability
AdaBoost [20] is used to select the best features from a large
Label Smoothing To prevent overfitting set of features. A cascade of classifiers is used to locate the
region of the face in the image.
In Inception v4, the operations that are before inception The Classifier is accurate for images with frontal
modules are modified and new reduction blocks are orientation but for faces in other angles, this does not work
introduced. Reduction blocks reduce the dimension of the well. Also, it is difficult to tune the parameters of these
input image (reduce 35*35 to 17*17 to 8*8) [14]. Then a classifiers.
hybrid model of inception and ResNet is developed by adding
1*1 convolution after all the operations to make the dimension
of input and output image the same.
B. VGGNet
VGG [15] developed by Visual Geometry Group from
Oxford as an upgrade of AlexNet architecture. It has three
models: VGG-16 (16 layers), VGG-19 (19 layers) and model
fusion. As the filters of smaller size stacked together can
extract more features than large sized filters, the 11*11 and Fig. 1. Face Detection using Haar Cascade
5*5 filters in AlexNet are replaced with 3*3 in VGG, thereby
making the architecture deeper. So, it is the most preferred
B. Dlib Detector
method for feature extraction and can be fine-tuned with
Transfer Learning for different applications. The VGG Dlib is a toolkit that can be used for face detection. The
architecture starts with a set of 3*3 convolutional layers with 1 Dlib can perform face detection along with Histogram of
pixel of stride and padding. Followed by three fully connected Oriented Gradients (HOG) and Support Vector Machine
layers and soft-max layer. Max pooling and ReLU activation (SVM) or with CNN. Dlib with HOG and SVM works well
functions are effectively utilized. for frontal faces. In many cases, it can also detect faces that
are not perfectly frontal. Dlib with CNN can detect faces in all
C. ResNet angles.
ResNet (Residual Neural Networks) architecture has 52 But when the results of the two models are compared, there is
layers. It is equipped with a stack of residual blocks [16]. Each no significant difference in the accuracy.
residual block comprises of a neural network segment and an
identity loop. Residual block with shortcut link in it forms the
residual network. As the architecture becomes deeper, the
signal required to transmit the weights also increases called
vanishing problem and optimization mechanism based on a
large number of parameters decreases the performance of the
architecture called degradation problem. These two challenges
are solved by adding identity loop in residual blocks.
Extraction of skin color and Normalized forehead area and VGG architecture is used for ethnicity classification in
calculation using Sobel Edge Detection method. FERET flood monitoring system.
Database with 447 samples (357 for training, 90 for testing) is
used. The experimental results show 82% accuracy. VII. CONCLUSION
Ethnicity classification in [29] includes Gender We have compared all state-of-the-art algorithms for
identification using PCA, face shape recognition using active human face detection and facial feature based human
appearance model (AAM) and active shape model (ASM), classification and have found the best algorithm based on their
feature and key point extraction, Euclidean distance performance and accuracy.
calculation and final classification using SVM. The As part of the future work, an extensive survey in
experiment was tested on Android platform and the accuracy different layers in each architecture can help in identifying the
is 86.4% scope of improving performance. The hybrid models of
inception and ResNet, in general, have proved to give better
b) CNN results than VGG. Hence, there is more scope for hybrid
models in different applications.
Ethnicity classification in [28] uses FERET Database with
357 samples and VGG- 16 architecture with 13 convolution
ACKNOWLEDGMENT
layers, 3 fully connected layers, pooling with 2*2 window and
stride 2. ReLU activation function and Categorical cross We are extremely grateful to our beloved Chancellor,
entropy for loss function are used. The accuracy on testing is Dr. Mata Amritanandamayi Devi, also known as Amma, for
98.6%. TABLE V. Compares the ANN and CNN Architectures providing us the guidance, motivation and a supportive
Based on Ethnicity. environment to work on this project.
CNN (VGG) FERET 98.6% [2] Kdnuggets.com ‘The 10 Algorithms Machine Learning Enigneers Need
to Know’[Online] Available on :
https://www.kdnuggets.com/2016/08/10-algorithms-machine- learning-
In ethnicity classification, VGG architecture of CNN engineers.html [Accessed on: Jan 2019]
has achieved a better accuracy of 98.6% as compared to 82%
[3] Towardsdatascience.com, ‘Introduction to Various Reinforcement
using ANN (TABLE IV). Both the architectures have used Learning Algorithms. Part I (Q-Learning, SARSA, DQN,DDPG)’
FERET dataset for ethnicity classification. Therefore, in [Online] Available on : https://towardsdatascience.com/introduction-to-
addition to age-group and gender classification, VGG various-reinforcement-learning- algorithms-i-q-learning-sarsa-dqn-
architecture performs accurately for ethnicity classification as ddpg-72a5e0cb6287 [Accessed on: Jan 2019]
well. [4] Towardsdatascience.com, ‘Why Deep Learning over Traditional
Machine Learning?’ [Online] Available on :
https://towardsdatascience.com/why-deep-learning-is-needed- over-
VI. DISCUSSION
traditional-machine-learning-1b6a99177063 [Accessed on: Jan 2019]
From Table 3, it is clear that YOLO performs well [5] Nehru, Mangayarkarasi, and S. Padmavathi. "Illumination invariant
for face detection. In our flood monitoring system, all the face detection using viola jones algorithm." In 2017 4th International
human faces detected cannot be used for water depth Conference on Advanced Computing and Communication Systems
(ICACCS), pp. 1-4. IEEE, 2017.
estimation. In cases, such as child carried over the shoulder,
[6] Narayanan, RamKumar, V. M. Lekshmy, Sethuraman Rao, and Kalyan
the child’s face cannot be considered. So, the algorithm that Sasidhar. "A novel approach to urban flood monitoring using computer
can detect the maximum number of faces has to be used. vision." In Fifth International Conference on Computing,
Hence, YOLO is chosen for face detection in flood monitoring Communications and Networking Technologies (ICCCNT), pp. 1-7.
system. IEEE, 2014.
[7] Nair, Bhavana B., and Sethuraman N. Rao. "Poster: Flood Monitoring
From Table 4 and Table 5, it is clear that VGG using Computer Vision." In Proceedings of the 15th Annual
performs well in all facial feature-based human classification. International Conference on Mobile Systems, Applications, and
For age-group and gender classification, GoogleNet has also Services, pp. 165-165. ACM, 2017.
achieved 98% accuracy. Increase in the accuracy of flood [8] medium.com, ‘CNN Architectures: LeNet, AlexNet, VGG,
monitoring system helps to continuously monitor a flooded GoogLeNet, ResNet and more ....’ [Online] Available on :
region temporally. The system can determine whether the https://medium.com/@sidereal/cnns-architectures- lenet-alexnet-vgg-
googlenet-resnet-and-more-666091488df5 [Accessed on: Jan 2019]
water level is increasing or decreasing with time. Hence,
Google Net is used for age-group and gender classification