Professional Documents
Culture Documents
Chapter - 1: Introduction: Dept. of Electronics and Communication
Chapter - 1: Introduction: Dept. of Electronics and Communication
Chapter - 1: Introduction: Dept. of Electronics and Communication
Chapter – 1: Introduction
Artificial intelligence (A.I) is getting incorporated in all walks of life and are making all
procedures automated in the new era. As computer vision has become an important branch of
artificial intelligence (A.I), it provides the computer the human view of the world. The Artificial
Intelligence (A.I) is getting applied in the fields like face detection, face generation, face
identification, agriculture, supermarkets and many more. Nowadays a particular task which was
considered difficult are impossible previously are performed very easily with the help of
artificial intelligence (A.I). With the help of these a very precise and super accurate automated
systems are obtained.
Deep learning is the sub field of machine learning, where the model mimics the human
brain and the capabilities. Deep learning model is a type of artificial neural network (ANN)
which imitates the human brain in performing the tasks such as face detection, face generation,
speech recognition, language translation and decision making. The deep learning training can
take place with human interference or with minimal human interference or with no human
interference and these are known as supervised, semi supervised and unsupervised learning. With
the help of deep learning models the computers can be trained to perform accurate image
detection, classification and identification. In deep learning, instead of giving the models
predefined equations to organize data they evolve from the basic parameters and train the
computer to learn those parameters on its own and extracting its features and forming many
layers of processing data.
The face detection and recognition is done in order to identify an individual in a digital
image. The various features like skin tone, age, gender, hairstyle etc are considered when a face
has to be recognized. The face detection and recognition can be done using the CNN and the data
pre-processing is done with the help of Local gradient number pattern which is coded with the
help of LDP based methods.
Face detection is nothing but finding a face in an image. Face detection is a technology based on
artificial intelligence for the identification of faces in an image. Face detection and recognition
has grown from a basic computer vision technique to a highly advanced artificial neural network
and plays an important role in face tracking and face analysis.
The main function of the face detection is that in a given image whether a face is present or not.
Usually the face detection is also called facial detection and is done to a gray scale image. In face
detection the return value is the location of the face and the area occupied by the face in that
digital image. After the detection process the information like age, expressions, gender identity
and pose from the faces are considered for the recognition process. The identification of a
specific person is done with the help of the face and this is known as face recognition. In the
beginning the face detection and recognition used to work only for the upright and the frontal
faces but now many new techniques are used to detect and recognize the face from any angle.
In face recognition process, feature description and extraction plays important roles. The two
common methods for feature extraction are geometric based and appearance based methods. The
former method uses geometric information to specify facial features by encoding and combining
shape and locations of the face into a feature vector. The latter uses descriptors to analyze facial
information. However, the geometric-feature-based methods generally require accurate and
reliable facial feature detection and tracking, which is difficult to accommodate in several
situations. The appearance based methods use image filters, either on the whole face, to produce
holistic feature, or some specific face-region, to produce local features, to extract any changes in
the appearance of face image. The performance of the appearance-based methods is much better
in relatively controlled environment but their performance degrade in environmental changes.
In the past decades, even though a great progress has been made in the development of analysis
and recognition of face, those challenges are still a challenging problem in face recognition fields
such as illumination, background, and different expressions. The most important key role in face
recognition is developing an effective and robust descriptor which is capable of extracting the
face features and removing the influence of noise, light and expression.
Many approaches for face analysis and recognition have been developed such as, collaborative
preserving fisher discriminant analysis (CPFDA), discriminative multi-scale sparse coding
(DMSC), two-dimensional quaternion principal component analysis (2D-QPCA), and real-time
face recognition using deep learning and visual tracking (RFR-DLVT). These methods use the
global facial information for recognition and they are sensitive to the changes of illumination,
partial occlusions, expression and poses. Therefore, local descriptors have been deeply studied in
the past decades. As a typical representative, local binary pattern (LBP) was presented, which is
simple in principle and low in computational complexity. In addition, it combines structural as
well as the statistical features. Because of the advantages of LBP, numerous improved operators
have been introduced in recent years such as extended LBP, local ternary pattern (LTP) , and
modified gradient pattern (MGP).It is generally known that the edge and direction information is
important for image analysis. These features were also incorporated into the extended methods of
LBP, such as local maximum edge binary patterns (LMEBP), local edge binary pattern (LEBP),
and local tetra patterns (LTrP).
Further, the local directional pattern (LDP) was proposed to reduce the influence of the random
noise and monotonic illumination changes. However, LDP still produces unreliable codes for the
uniform and smooth neighborhoods. Based on LDP, enhanced local directional pattern (ELDP),
local direction number (LDN), local direction texture pattern (LDTP), and local edge direction
pattern (LEDP) were put forward to overcome the shortcomings.
Besides the methods based on local features, the deep learning approaches concerning
Convolutional Neural Network (CNN) have emerged in recent years. Those methods can learn
high-quality information in face images by training models on a large amount of data. Although
CNN-based methods have attracted considerable attention, the research on local features is still
continuing.
There are many advantages of facial recognition in many aspects of life and some of
them are as follows:
i. Processing is quicker: The face recognition process takes less time to process, usually
it takes a second or two and it is very useful. For the cyber security the companies
need a tool which is quick and safe, with the help of recognition system, this problem
is solved and another added advantage is that it is very difficult to fool the system.
ii. Enhanced Security: Facial recognition is used in the surveillance system, with the
help this system it is very easy to find out thieves and other trespassers. On a higher
level it is used to track the terrorists and other criminals with the help of scanning.
The main benefit is that there is nothing to steal as in the case of passwords or pin
numbers. It can be used to unlock phones and other devices.
iii. Banking Purpose: The facial recognition can be used in the banking system because
despite taking much care there are banking frauds happening. With the help of face
recognition, only the face will be scanned instead of sending one time password.
Sending one time passwords can be stopped as many frauds happen do to duplicating
or stealing it. In face recognition, duplicating and stealing is not possible.
iv. Seamless Integration: This is one of the biggest advantages in the industries as it is
very easily integrated with any system. This does not require any additional expenses
and gets easily integrated with existing security software. So with the help of this the
safety is improved.
The main limitation is analysis of face because of variation in facial appearance due to
changes in pose, expression, illumination and other factors such as age and make-up. Face
descriptors are sensitive to local intensity variations that occur commonly along edge
components such as eyes, eyebrows, noses, mouths, whiskers, beards, or chins due to internal
factors (eye glasses, contact lenses, or makeup) and external factors (different backgrounds).
This sensitivity of the existing face representation methods generates different patterns of local
intensity variations and makes learning of the face detector difficult. Selecting the right features
for classification is also difficult.
Another drawback is the difficulty in extracting all the features in the face which requires
huge number of images. The sub-optimal performance of most face recognition systems (FRS) in
unconstrained environment is another drawback. The lack of highly effective image pre-
processing approaches, which are typically required before the feature extraction and
classification stages might be one of the reasons for poor performance. Further, in real-life
scenarios the wide applicability of most FRSs is limited since only the minimal face recognition
issues are considered in most FRSs.
The one of the main drawbacks which is affecting is that there is very little amount of
corresponding training data, which is very important for the training of the deep learning model
and without the training data the model may not function as expected. The important source of
datasets for the training of the deep learning model is mainly obtained by two ways which are the
already available datasets and the self-created datasets. These datasets has to be maximum in
terms of both quality and quantity, lack of these datasets may cause the abnormal functioning of
the deep learning model. So the creation of the new datasets, which is used for the training of the
deep learning models, needs the involvement of humans and a very large amount of time and the
self-created datasets are less, so the data augmentation is done. Another drawback is the domain
gap, where the computer generated image may not be similar to that of the real world images. So
due to these drawbacks some method has to be incorporated into the face recognition system
which doesn’t require creating more dataset manually.
1.4. Motivation
In CNN based Face Recognition Methods, image is the input to CNN. CNN is learning from
the pixel level. In the learning process of CNN, the feature extraction starts from the lowest level
feature i.e., untreated knowledge. There is more difficulty in learning feature extraction. In order
to overcome this, new method has to be proposed such that the input images given to CNN
makes the CNN learns the characteristics of the image better than the pixel level in learning.
1.5. Objectives
The face recognition has become a blooming topic and all industries are trying to
implement it in their systems for the security reasons. Due to this reason there is high demand for
recognition process and research had been carried out at the industry level and as well as the
educational level. As a result deep learning, which is a sub branch of the machine learning, is
gaining much popularity in the field of facial recognition.
LGNP is a face descriptor which is used to generate LGNP images for the purpose of face
features extraction for face analysis. Further these LGNP images are used for training and testing
process using CNN classifier for face recognition and classification.
Convolutional Neural Network (CNN) is an algorithm in the deep learning, which consists of
neurons which have learnable weights and biases like any other neural network. CNN can be
implemented in python programming language, which is free and open source. Python is used
here because it is a high level programming language and there are many inbuilt modules, which
is used for deep learning. Python has inbuilt libraries like tensorflow and computer vision 2,
which is used for the deep learning models. Using keras module which provide interface to
tensorflow, CNN’s are designed. In this project , a Convolutional Neural Network model is
proposed for classification and recognition of faces. Convolutional Neural network involves
lesser preprocessing in extracting the features and provides the better accuracy. It employs
Convolution+ReLU and Max pooling for feature extraction. The extracted feature then fed to
feed forward neural network for training and validation. At the output layer, the softmax
activation function is used that provides the labeling in terms of probability.
The main objectives of this project are i] Generate Local Gradient Number Pattern (LGNP)
images from already existing datasets using LGNP code based on the LDP based methods. ii]
Classification and Recognition of faces using Convolutional Neural Network (CNN).
LGNP is used as a facial feature descriptor which is coded in gradient space i.e., gray
value distributions is considered in the proposed coding scheme. It is used to extract gradient
features in an image. LGNP can be calculated with the help of LGNP code. The gradient features
describe the spatial gray changes in the local neighborhood. The smaller gradient value shows
that the point has a uniform local distribution, while the bigger gradient value shows that the
point is in a region where the gray level changes rapidly. In the proposed local gradient number
pattern (LGNP) descriptor, gradients are computed in a local neighborhood which describe the
distributions in the gray-values. Inherently, it sharpens the local neighborhood, emphasizes the
transitional part in the gray scale and also reveals the structural relationship between the pixels.
Gradient is used in the improvement of the edge in the face images, to remove the shadows in the
pattern and in the enhancement of the small mutation in the flat area.
Let the image function be f (x, y) , gx and gy are the partial derivative at each pixel in the
neighborhood which are given as follows,
G(x, y) = √ g x 2 + g y 2 (2)
The following formula is used to replace the square root operation in order to reduce complexity
and for calculation simplicity,
In the proposed method, Sobel operator is used to compute gradient values in horizontal and
vertical directions in the local neighborhood. The 3 ×3 sobel kernel will be traversing over the
entire image and the edges are covered as the zero padding is done.
Let N represents a local 3 ×3 neighborhood centered on (x,y), Gradients in the horizontal and
vertical directions are Gx and Gy respectively which can be calculated as follows,
−1 0 1 1 2 1
[
Gx = −2 0 2 * N
−1 0 1 ] [
Gy = 0 0 0 *N
−1 −2 −1 ] (4)
−1 0 1 1 2 1
[
Where, −2 0 2 and 0
−1 0 1 ] [
0
−1 −2 −1 ]
0 are 3 ×3 sobel kernels.
Further, the gradient values in the eight directions, are sorted in the ascending order , i.e., {g i }(i=
1,2,…..,8), and numbered from 0 to 7.
LGNP(x,y) = 8 × d ga + d gb (6)
where ‘+’ implies concatenation, d gaand d gb are the directions corresponding to the ga and gb , a
and b are the sorted indexes (a, b ∈ [1, 8]).
The final LGNP value i.e., LGNP(x,y) obtained is converted into corresponding decimal number.
For coding in LGNP, gradient values are coded in two directions. The gray transitions reflected
in gradient values provides prominent detail features, such as lip border, eye border, wrinkle,
nose bridge etc. The directions that reflect gray transitions, are selected in the LGNP coding
scheme, since the detailed features plays a major role in face recognition. Obviously many ways
are there for selecting the two gradient values. Among the gradient values {gi} (i = 1, 2, 3, . . . ,
8), {gi} is divided into two parts, first part is the bigger gradient values having {g5,g6, g7, g8},
the second part is the smaller gradient values having {g1, g2, g3,g4}. In general, smaller gradient
values mean more prominent details of edge. Based on this, the combination of the smallest
value g5 in the first part and the smallest value g1 in the second part are selected, i.e., (a,b) =
(5,1) to form the final LGNP code.
There has been tremendous development in the field of Artificial Intelligence (AI) for reducing
the gap between abilities of man and machines. It finds wide range of applications in the field of
electronics and mechanical making amazing things to happen. One of the domains of AI is the
Computer vision.
The computer vision is capable of making the machines to view the things as human does,
interpret the situation is same manner and use this knowledge to take a decision to initiate certain
dependent tasks. Its applications are image recognition, image classification, media recreation,
Language processing, speech recognition, etc. The deep learning algorithm is developed as an
advancement in computer vision called Convolutional Neural Network (CNN).
The architecture of a ConvNet is inspired by the Human visual cortex and it is resemble to the
connection pattern of Neurons in the Human Brain. The neurons respond to stimuli only in a
restricted region of the visual field which is known as the Receptive Field. A collection of such
receptive fields combine to cover the entire visual area.
By applying appropriate filters a CNN is capable of successfully capturing the Spatial and
Temporal dependencies in an image. The CNN algorithm performs a better fitting to the image
dataset due to the reduction in the number of parameters involved and reusability of weights. In
other words, the network can be trained to understand the sophistication of the image better.
ConvNet reduces the input images to the form that is easier to process by employing the filters,
without losing features which are critical for making the decision. Convolutional Neural network
is composed of two parts: Feature Extractor and Classifier as shown in figure 3.1. Features
extraction is achieved by using number of layers of convolution and pooling layers. A pair of
convolution and pooling layer forms a single layer. The number of such layers to be used depends
on the types of features to be extracted.
Convolution layers contain set of filters weights of which have to be leaned. The dimensions
of the filter are much smaller to that of input image. These filters are called kernels. Each kernel
is convolved with the input volume to compute an activation map made of neurons. That is the
filter is slid across the width and height of the input and the dot products between the input and
filter are computed at every spatial position. The output volume of the convolutional layer is
obtained by stacking the activation maps of all filters along the depth dimension. Since the width
and height of each filter is designed to be smaller than the input, each neuron in the activation
map is only connected to a small local region of the input volume.
Figure 3.2 depicts the convolution process. Assume an input image is of 5x5x1 dimension as
shown in figure 3.2.1 and denoted by I, shown in green color. The filter which is used for
convolving the input image is in Convolutional Layer is called the Kernel/Filter and denoted by
K, depicted in yellow color. Example for 3x3x1 kernel matrix is
K= 0 0 0
0 1 0
0 0 0
By keeping the stride value at 1, the Kernel shifts 9 number of times and performs dot product
between the kernel and the portion of the input image over which the kernel is hovering.
If the image has 3 channels such as RGB, the convolution Kernel has the same depth as that of
the input image. Matrix Multiplication is performed between Kn and In stack ([K1, I1]; [K2, I2];
[K3, I3]) and then all the results are summed with the bias to give us a squashed one-depth
channel Convoluted Feature Output as shown in figure 3.3.
The convolution layer is to determine the features such as edges, color and shapes. The number
of layers to be used in CNN depends on the level of features to be extracted. The purpose of the
first convolution Layer is to capture the Low-Level features such as edges, color, gradient
orientation, etc. With more layers, the CNN extracts the High-Level features, thus providing the
network capable of understanding the images in the dataset, as humans would.
Like Convolutional Layer, the Pooling layer is also responsible for reducing the spatial size of the
Convolved Feature. This is to decrease the computational power required to process the
data through dimensionality reduction. Furthermore, it is useful for extracting dominant
features which are rotational and positional invariant, thus maintaining the process of effectively
training of the model.
Pooling can be performed in two ways. One is Maximum Pooling and another is Average
Pooling. Maximum Pooling gives the maximum value of all the values from the portion of the
input image covered by the Kernel. Figure 3.4 shows the maximum pooling. On the other
hand, Average Pooling returns the average of all the values from the portion of the image
covered by the Kernel.
Figure 3.5 shows the difference between maximum pooling and average pooling. Max Pooling
also performs as a Noise Suppressant. It discards the noisy activations altogether and also
performs de-noising along with dimensionality reduction. On the other hand, Average Pooling
simply performs dimensionality reduction as a noise suppressing mechanism. Thus Max Pooling
performs a lot better than Average Pooling.
The Convolutional Layer and the Pooling Layer, together form i-th layer of a Convolutional
Neural Network. The number of such layers depends on the complexities in the images. For
capturing the low-levels features the number of layers need be increased, but at the cost of more
computational power.
The extracted features are converted to one dimensional vector by using flatten layer and fed to
the neural network which works as classifier.
Figure 3.6: Fully Connected layer with 2 dense layers and 3 classes.
The flattened output is fed to a feed-forward neural network. Over a series of epochs, the model
will be able to discriminate between dominating and certain low-level features in images and
classify the images by making use of Softmax/sigmoid activation function.
There are different types of CNNs which powers AI: LeNet, AlexNet, VGGNet and so on.
Weights and bias: Weight is the parameter within a neural network that transforms input data
within the network’s hidden layers. When the inputs are transmitted between neurons, the
weights are applied to the inputs and passed into an activation function along with the bias.
Weights and bias are both learnable parameters.
Stride: Stride size defines the amount of steps the kernel moves while traversing the input
image. It is one, by default i.e., If stride =1, the kernel moves by one step over the image.
Padding: In order to maintain the same dimensionality between the input image and output
image, the image matrix is padded with zeros along the width and height of the image.
Sigmoid: The function of Sigmoid is similar to the case of softmax but the only difference is this
sigmoid is used only for binary classification.
Learning rate: The rate at which the weights are updated during training is referred to as the
step size or learning rate.
Adam Solver: Adam is an optimization algorithm, which is used as a substitute for the classical
stochastic gradient descent method for updating the weights of the network during the training
process. The configuration parameters which are considered for the optimization are alpha, beta
1, beta 2 and epsilon. The alpha represents the learning rate, beta1 is the rate of exponential
decay for first moment estimate, beta 2 is the rate of exponential decay for the second moment
estimate and epsilon is the very small number which is used for the prevention of the zero
division. The learning rate decay of the optimizer is given by the ratio of the alpha and the square
root of the epoch t.
Rectified Linear Unit (ReLU): Rectified Linear Unit (ReLU) is an activation function used to
overcome the vanishing gradient problem. The ReLU layer acts different for positive and
negative input, when a positive input is given it passes the ReLU layer without any changes but
when a negative input is passed into the ReLU layer the gives zero as the input. The ReLU
function has become the default activation function for almost all the neural networks as the
model that uses it is easier to train and often better performance is achieved. The ReLU
activation function is as shown in the figure 3.7. The X axis represents the input given to the
ReLU function and the Y axis represents the output of the ReLU function. The ReLU can be
expressed as below
Flatten: Flatten is the function that converts the pooled feature map to a single column that is
passed to fully connected layer. Dense adds the fully connected layers to the neural network.
Softmax: The softmax function is a function that turns a vector of K real values into a vector of
K real values that sum to 1. The input values can be positive, negative, zero, or greater than one,
but the softmax transforms them into values between 0 and 1, so that they can be interpreted as
probabilities. Softmax is used for multi-class category.
Chapter 4: Implementation
4.1. Problem Statement
The main objectives of this project is to develop a descriptor for face recognition.
i] Generate Local Gradient Number Pattern (LGNP) images from already existing
datasets using LGNP code based on the LDP based methods.
ii] Classification and Recognition of faces using Convolutional Neural Network (CNN)
Figure 4.1: Block diagram Face Recognition based on LGNP and CNN
The block diagram as shown in the above figure 4.1 consists of two parts namely the
Local Gradient Number Pattern (LGNP), which is used for the extraction of local gradient
information which describes the distribution of gray values in the local neighborhood of face
images, and the Convolutional Neural Network (CNN), which is used for classification and
recognition of faces.
LGNP is a face feature descriptor which is used to extract gradient features in an image. LGNP
can be calculated with the help of LGNP code. The input image is first converted into a grayscale
image. Then kernel convolution is applied to these grayscale image. Sobel operator kernels are
taken here. A 3 x 3 kernel will be an overlapping window so that all the pixel values in the image
will be covered and this kernel is traversed throughout the entire image. Zero padded
convolution is performed in order to retain the boundary elements. Gradient image in horizontal
direction (Gradient X image) is obtained by convolving X direction sobel kernel with the
grayscale image. Gradient image in vertical direction (Gradient Y image) is obtained by
convolving Y direction sobel kernel with the grayscale image. The final gradient image is
obtained by combining gradient X image and gradient Y image by adding corresponding
absolute values of these images.
LGNP descriptor is then used to obtain LGNP image for the obtained gradient image. The LGNP
values are calculated for every pixel in the gradient image. Let M represents a local 3 x 3
neighborhood centered on (p,q) in gradient image, the gradient values of eight directions are
sorted in ascending order in size and numbered from 0 to 7. The respective directional values are
assigned to the chosen two gradient values and these directional values are used to generate
LGNP value. LGNP values are generated for every 3 x 3 overlapping window by traversing
throughout the entire gradient image and thus LGNP image is formed.
Now, Generated LGNP images are used for face recognition and classification, and the classifier
used is the Convolution Neural Network. For training and testing purpose, classifier needs
training images, validation images and testing images. Therefore, LGNP images are divided into
training images, validation images and testing images. The labelled dataset is prepared using
LGNP images train and to validate the Convolution Neural Network for classification of faces.
Face recognition and classification is implemented using Convolutional Neural Network. The
CNN is trained by using the LGNP face images as the input of the CNN, classifier is generated
and classification result is tested with the test dataset i.e., test images in LGNP form are passed
into the CNN classifier for predicting faces. Layers of the CNN model proposed is as shown in
the figure 4.2
CNN consists of number of sequential layers: i) Input layer ii) Hidden layer and iii) Output
layer. The hidden layer consists of one or more convolutional layers and pooling layers
depending on the level of features to be extracted. Convolution and pooling layer together form
single layer. In the proposed model, there are 3 hidden layers: two convolution + pooling layers
and one dense layer. The working of the model has two phases: i) Training and Validation ii)
Testing.
The convolutional Neural Network has to be trained with training dataset from the database.
Here the dataset is the generated LGNP images of ORL dataset. In keras, the CNN is created as
stack of layers one by one. First we add Convolution layer that performs the dot product of the
input image with the kernels to learn the filters and optimize the filter weights. Then, we add
ReLU (rectified Linear unit) activation function that helps CNN to learn non-linear decision
boundaries. This is required because convolution makes the spatial properties linear that makes it
difficult to distinguish colors, edges and borders. This is followed by Pooling layer that reduces
the dimension of the features depending the stride size used without eliminating the spatial
features. The maximum pooling is found highly efficient and hence employed. The proposed
model has two layers of convolution+pooling. Dropout strategy is adopted to improve
classification effect
Flatten layer is then used to convert multidimensional features into one dimensional features to
ease the training of fully connected network (FCN). The fully connected layer contains two
dense layers. In FCN, each neuron in a layer is connected to the all the neurons in the previous
layer. These connections do have weights .The features extracted from the previous layers are
applied to the FCN layer.
The dimension at the output of each layer and parameters (weights and biases) in the CNN
compiled in python is shown in figure 4.3
Testing
Once the CNN is trained and validated we can test the working of classifier with appropriate
test images. In the project, the softmax activation function is used at the output layer that provides
labels in terms probability. There is one output for each class. The CNN outputs high probability
if the input image belongs to the respective class. The outputs probabilities of each class are
mutual.
The accuracy of the classification dominantly depends on the training the Neural Network.
Training a Neural network model simply means learning (determining) better values for all the
weights and the bias from labeled examples(different classes) .The number of times the CNN
will go through entire training and validation data set is called Number of Epochs. The sequence
in which CNN will go through the training dataset differs in each epoch. In each epoch the
weights of the Fully Connected Network (FCN) of CNN are optimized to yield accurate
classification. That is larger the number of epochs, the higher is the accuracy. Practically this
number will be necessarily larger value for implementing the highly efficient Neural Network.
Accuracy metric is used to measure the algorithm’s performance in an interpretable way.
How accurate the model is able to classify the training dataset and validation dataset are called
training accuracy and validation accuracy respectively. In the training part, cross-validation
technique is used the data is split into two parts - the training set, and the validation set. The
training dataset is used to train the model, while the validation dataset is used only to evaluate
the model's performance.
Metrics on the training set let us see how the model is progressing in terms of its training, but it's
metrics on validation set that let us get a measure of the quality of your model - how well it's
able to make new predictions based on data it hasn't seen before.
A loss function quantifies how “good” or “bad” a given predictor is at classifying the input data
points in a dataset. The smaller the loss, the better a job the classifier is at modeling the
relationship between the input data and the output targets.
The error on the training set is denoted as training loss. Validation loss is the error as a
result of running the validation set through the previously trained CNN. The error that occur
during the training of training dataset is called training loss whereas Validation is an error as a
result of running validation data set through already trained neural network. These losses are not
the percentages. It is the sum of errors that the Neural Network made for each sample in training
and validation data sets.
Chapter 5: Results
The Computer Vision module (CV2) is used to generate the LGNP images. The results of
the Local Gradient Number Pattern model (LGNP) and the Convolutional Neural Network
(CNN) model is shown below. CNN makes use of Keras neural-network open-source library in
the python codes to train the model. Keras is made to work on the python numerical library
Tensorflow which is basically intended to train the neural network faster and easier.
The ORL database is composed of 400 face images taken by 40 different volunteers.
There are 10 different images of 40 distinct subjects. For some of the subjects, the images were
taken at different times, varying lighting slightly, facial expressions (open/closed eyes,
smiling/non-smiling) and facial details (glasses/no-glasses). All the images are taken against a
dark homogeneous background and the subjects are in up-right, frontal position (with tolerance
for some side movement). The files are in PGM format and can be conveniently viewed using
the 'xv' program. The size of each image is 92x112, 8-bit grey levels. The images are organised
in 40 directories (one for each subject) named as: sX , where X indicates the subject number
(between 1 and 40). In each directory there are 10 different images of the selected subject named
as: Y.pgm where Y indicates which image for the specific subject (between 1 and 10).
The features of the dataset are as given in the table 5.1 below. Sample images of a
volunteer from the dataset are as shown in the figure 5.1.
The original image is first converted into the gray scale image and the obtained greyscale image
is as shown in the figure 5.2.
The gray scaled image is padded with zeros for retaining the boundary elements. The 3 x
3 window will be traversing through the entire gray scale image. The sample of the 3 x 3 window
the calculations are as shown below.
50 48 44
[
N = 44 50 43
43 53 41 ]
The gradient images in horizontal and vertical directions are calculated using sobel operator
masks.
−1 0 1 −146 13 9
[ ] [
Gx = −2 0 2 * N = −200 10 11
−1 0 1 −205 12 8 ]
The same method will be done in an overlapping manner so that all the pixels will be converted
and the corresponding Gradient X image will be obtained as shown in the figure 5.3.
[
Gy = 0 0
] [
0 * N = −10 −2
−1 −2 −1 7 2
3
−2 ]
The same method will be done in an overlapping manner so that all the pixels will be converted
and the corresponding Gradient Y image will be obtained as shown in the figure 5.4.
[
−205 12 8 7] [
G= |Gx | + |Gy| = mod −200 10 11 + mod −10 −2
2
3 = 210 12 13
−2 212 14 10 ][ ]
The same method will be done in an overlapping manner so that all the pixels will be converted
and the corresponding final Gradient image will be obtained as shown in the figure 5.5.
N G Eight directions
[ ][
44 50 43 210 12 13
43 53 41 212 14 10
0 1 26 7
3 ¿ ¿
¿ ¿
][ ]
Calculated by Eq (4) and Eq (5)
The gradient values in the eight directions, are sorted in the ascending order, i.e., {g i}(i= 1,2,
…..,8), and numbered from 0 to 7. An example of computing LGNP is shown in figure 5.6.
Directions 7 4 6 2 1 3 5 0
The final LGNP value i.e., LGNP(x,y) obtained is converted into corresponding decimal number.
The center pixel value in the total gradient 3 x 3 window is replaced by the corresponding LGNP
value calculated .
[ ][
210 12 13 210 15 13
212 14 10 212 14 10 ]
The same method will be done in an overlapping manner for the complete obtained gradient
images so that LGNP value for every pixel is calculated , thus generating LGNP images. LGNP
image will be obtained as shown in the figure 5.7.
The feature used here are edge features, line features and the centered surrounded features. The
image properties of the LGNP image is as shown in the below table 2.
The face classifier model is trained with the LGNP images generated for orl database. The
database contains set of 9 volunteers. Each volunteer has 10 images. From each volunteer, 6
images are selected for training, 2 images for validation and 2 images for testing. Therefore there
are 54 images for training, 18 images for validation and 18 images for testing.
From the keras, import Model Checkpoint to save the model by monitoring the particular model
parameter and EarlyStopping to stop the model training early when the parameter we set to
control is not increased. These methods may construct an object of the two and transfer it to
fit_generator as callback functions. In this analysis, validation accuracy was monitored by
passing val_acc to ModelCheckpoint. Only when the validation precision in the present epoch is
greater than in the last epoch, then the model will be saved. The same validation accuracy is
monitored and then passed val_acc to EarlyStopping. The patience parameter was set to 3, which
means that the model will cease to train if validation accuracy does not increase in 3 epochs. The
model.fit_generator function is used to train deep learning models since ImageDataGenerator has
been used for passing data to the model. The train and test data are moved to a fit_generator and
steps_per_epoch sets the batch for training data and the validation steps for test data are similar.
By executing the code the model will start to train and we get the training and validation
accuracy and loss. The Graph of accuracy, Validation accuracy, loss and validation loss Vs
number of epochs is plotted as shown in the figure 5.9.
Epoch 00001: val_accuracy improved from -inf to 0.27778, saving model to t1.h5
Epoch 2/20
Epoch 00002: val_accuracy improved from 0.27778 to 0.44444, saving model to t1.h5
Epoch 3/20
Epoch 00003: val_accuracy improved from 0.44444 to 0.66667, saving model to t1.h5
Epoch 4/20
Epoch 5/20
Figure 5.9: Graph of accuracy, Validation accuracy, loss and validation loss Vs number of
epoches
Testing part:
The best trained model is saved in .h5 file after training and validating. In testing part, the
unlabeled images which are stored in the testing folder were laoded into CNN classifer which
has the best trained model, and these images are converted to numpy array. Each test image
when passed thorugh CNN classifier provides a vector of length 9 with each element being a
probability value for each class, i.e., these probablility values shows the amount of test
image features matching with the features of 9 classes in the trained model. The CNN outputs
high probability if the input test image belongs to the respective class. The classifier also labels
the corresponding class label w.r.t highest probability value. The process is repeated for all the
test images in the test folder and their corresponding probability values and labels are shown
below in the figure 5.10.
s4
[0.00399778550490737, 0.007950114086270332, 0.006784763652831316,
0.0018150405958294868, 0.6553345918655396, 0.07012850791215897,
0.012728000059723854, 0.00039862864650785923,
0.24086247384548187]
s5
[0.00332050072029233, 0.008696336299180984, 0.011834756471216679,
0.004824170842766762, 0.6621164083480835, 0.056516166776418686,
0.03386259824037552, 0.0008811790030449629, 0.21794790029525757]
s5
[2.9639310014317743e-05, 0.00034859354491345584, 0.0008452803594991565,
0.000176896239281632, 0.00047038032789714634, 0.9925413131713867,
0.0038705384358763695, 0.00032607425237074494, 0.0013913377188146114]
s6
[0.0008361845975741744, 0.01328081451356411, 0.015331235714256763,
0.0006310901953838766, 0.013530968688428402, 0.8209946155548096,
0.0946493148803711, 0.003510827897116542, 0.03723505139350891]
s6
[8.051218901528046e-05, 0.0012897439301013947, 0.0008837956702336669,
0.0004185155557934195, 0.00013006202061660588, 0.0007197260856628418,
0.9896787405014038, 0.0057305688969790936, 0.0010682501597329974]
s7
[1.311535743298009e-05, 0.003005495760589838, 0.0011181056033819914,
0.0021446323953568935, 7.558979268651456e-05, 0.0007242822903208435,
0.9921547770500183, 0.00011474672646727413, 0.0006493180408142507]
s7
[0.0071134306490421295, 0.0025696419179439545, 0.009996820241212845,
0.00970382895320654, 5.1145139877917245e-05, 0.008949404582381248,
0.01436237245798111, 0.9435909986495972, 0.0036624432541429996]
s8
[0.0024990406818687916, 0.0009316701325587928, 0.018764814361929893,
0.013309920206665993, 9.139371104538441e-05, 0.010013804771006107,
0.034046001732349396, 0.9160909056663513,
0.004252416081726551]
s8
Figure 5.10: Output of CNN classifier (probability values and predicted labels of 18 test
images)
The accuracy obtained is the number of images predicted correctly. In order to test the classfier,
2 samples from each person is taken for testing. Since there are 9 volunteers, 18 test images are
in the testing folder. The CNN classfier ouputs the predicted labels for every test images. Here,
the classfier has predicted 17 labels correctly out of 18 test images. Hence, the accuracy of the
classfier is calculated by :
The classifier also gives the test images along with their corresponding predicted labels and
actual labels which are shown in the figure 5.11
In this project LGNP descriptor is proposed for image preprocessing which is used to prepare
processed data of the dataset. The proposed scheme encodes the gradient values instead of the
edge responses to achieve robustness to gray-level and noise changes. Also the Kirsch template
is used instead of the Sobel operators to reduce the computational complexity. LGNP images are
generated for the dataset which needs to be classified. CNN based deep learning model is
proposed to perform face recognition and classification. LGNP images generated is the input to
the CNN, therefore the knowledge of CNN learning is knowledge of the edge of the processed
image. The processed knowledge is easier to be learned and understood by CNN, and the face
recognition is better. Therefore, when training CNN, knowledge after processing is chosen. The
CNNs are developed in python codes by using the deep learning library Keras which works on
numerical library Tensorflow. Both the CNNs are composed of three hidden layers such as 2
convolution+pooling layers and one dense layer. The output dense layer uses softmax activation
function with loss function categorical_cross_entropy and provides the classification interms of
one dimensional array with array elements equal to number of classes. To evaluate the classifier
performance, random flip and random shift is used to for data augmentation since the numbers of
the face database are limited. The order of training sets and test sets were disturbed to avoid
over-fitting. To improve the classification effect, the rectified linear unit (ReLU) and dropout
strategy are used and also batch normalization (BN) is used to accellerate convergence. The
model has been classifying the faces with maximum accuracy of 94.44% using Convolution
Neural Network classifier.
LGNP may produce the same code for the neighborhoods with entirely different visual
perception like LBP like LDP approaches. In order to overcome this, a simple convex-concave
partition (CCP) strategy called Fuzzy CCP can be used to enhance LGNP descriptor.
Using different set of images for training, validation and testing part to get the accurate transfer
learning in the CNN model.
Further, combining FCCP_LGNP with deep learning technology can be done to enhance its
accuracy, and expand its scope of application.