Professional Documents
Culture Documents
Face Recognition With PCA+MLP MATLAB Code
Face Recognition With PCA+MLP MATLAB Code
BELGAUM-590014
Project Report
On
Namit
1PI05EC057
CERTIFICATE
Certified that the project work entitled Face Recognition Techniques using Neural Networks is a
bona fide work carried out by Prashanth H, Namit Chauhan bearing USN 1PI05EC072 and
1PI05EC043 in partial fulfilment for the award of degree of BACHELOR OF ENGINEERING in
ELECTRONICS & COMMUNICATION of Visvesvaraya Technological University, during the
academic year 2008-2009. It is certified that all corrections/suggestions indicated for internal assessment
have been incorporated in the report. The project report has been approved as it satisfies the academic
requirements with respect to project work prescribed for the mentioned degree.
Dr Koshy George
PES Centre for Intelligent Systems;
Dr A. Vijaykumar
HOD,
Dept. of E&C Engg., PESIT
Dr K. N. B. Murthy
Principal
PESIT
Prashanth
Namit Chauhan
1PI05EC072
1PI05EC057
External Viva
1._____________________
_____________________
2. _____________________
_____________________
ACKNOWLEDGEMENTS
We would like to express our sincere thanks to Dr. Koshy George Professor, TE
Department, PESIT, Bangalore, for his kind and constant support and guidance during the
course of this project without which, the project would not have been a success
We would like to thank and extend 1our heart-felt gratitude to Dr. A Vijaya Kumar,
HOD of the Department of Electronics and Communication, for his kind, inspiring and
illustrious guidance and ample encouragement.
We would like to thank Prof. D Jawahar, CEO, PES Group of Institutions, and
Dr. K N Balasubramanya Murthy, Director & Principal, PESIT, for the valuable
resources provided for the completion of the project.
We would like to express our sincere thanks and deep sense of gratitude to our parents
for their everlasting support.
And finally it gives us immense pleasure to thank our friends and everyone who have
been instrumental in the successful completion of the project.
Table of Contents:
Serial
Topic
Number
Page Number
1.1
Classification
1.1.1
1.1.2
Adopted approach
1.2
Literature survey
1.2.1
Existing algorithms
1.2.2
Existing techniques
1.3
10
11
2.1
Multilayer Perceptrons
13
2.2
15
2.3
Activation function
20
2.4
Rate of learning
22
2.5
Modes of Training
23
2.6
24
2.7
24
2.8
26
2.9
Conclusions
26
27
3.1
28
3.2
30
3.3
33
3.4
Conclusions
33
34
4.1
Basics
36
4.2
4.3
37
39
4.4
41
4.5
42
4.6
45
4.6.1
46
4.6.2
Mercers Theorem
48
4.6.3
49
4.7
49
4.8
52
4.9
53
4.10
Conclusions
53
Image Preprocessing
54
5.1
Histogram Equalization
54
5.2
Median Filtering
57
5.3
Bi-Cubic Interpolation
57
5.4
Conclusions
58
Results
59
6.1
MLP
59
6.2
PCA
63
6.3
SVMs
67
6.4
PCA+MLP
70
73
References
75
Appendix
78
ii
List of Figures
Figure 3.1: (a) The 1st PC z1 is a minimum distance fit to a line in X space. (b) The 2nd
PC z2 is a minimum distance fit to a line in the plane perpendicular to the 1st PC.
Figure 4.7: Problem encountered when one SVM is used for multi- classification
Figure 6.1: Face recognition accuracy with multilayer feedforward networks with
histogram equalization and median filtering.
iii
Figure 6.2: Face recognition accuracy with multilayer feedforward networks with
only median filtering.
Figure 6.3: Face recognition accuracy with multilayer feedforward networks with no
pre-processing
Figure 6.4: Face recognition accuracy with PCA with histogram equalization and
median filtering.
Figure 6.5: Face recognition accuracy with PCA with only median filtering.
Figure 6.7: Face recognition accuracy with PCA+MLP with histogram equalization
and median filtering.
Figure 6.8: Face recognition accuracy with PCA+MLP with only median filtering.
iv
CHAPTER I
Human beings have recognition capabilities that are unparalleled in the modern
computing era. These are mainly due to the high degree of interconnectivity, adaptive nature,
learning skills and generalization capabilities of the nervous system. The human brain has
numerous highly interconnected biological neurons which, on some specific tasks, can
outperform super computers. A child can accurately identify a face, but for a computer it is a
cumbersome task. Therefore, the main idea is to engineer a system which can emulate what a
child can do. Advancements in computing capability over the past few decades have enabled
comparable recognition capabilities from such engineered systems quite successfully. Early
face recognition algorithms used simple geometric models, but recently the recognition
process has now matured into a science of sophisticated mathematical representations and
matching processes. Major advancements and initiatives have propelled face recognition
technology into the spotlight.
Dept. of ECE
Jan June 09
Page 1
In geometric or feature-based methods, facial features such as eyes, nose, mouth, and chin
are detected. Properties and relations such as areas, distances, and angles between the
features are used as descriptors of faces. Although this class of methods is economical and
efficient in achieving data reduction and is insensitive to variations in illumination and
viewpoint, it relies heavily on the extraction and measurement of facial features.
Unfortunately, feature extraction and measurement techniques and algorithms developed to
date have not been reliable enough to cater to this need.
Dept. of ECE
Jan June 09
Page 2
1.1 Classification
Face recognition is nothing but the ability of a machine to successfully categorize a set
of images based on certain discriminatory features. Classification or pattern recognition can
be a very difficult problem and is still a very active field of research. The aim of the current
project can be considered as an attempt to simulate the recognition capability of the human
brain. A human being has the ability to put a certain scenario in a context and identify its
components. Of course it is not entirely obvious how to make the machine discriminate
between elements of different objects of different classes. The classification task may be
mathematically represented as the mapping:
f:AxB=C
(1.1)
where A represents the elements that have to be classified, i.e. the pixel intensity vectors,
using B as the function parameters. The output C has to discriminate between images of
different classes so that the class of each element can be determined. Classification by
machines requires two steps: First, the properties which distinguish an element of one class
from that of another class have to be identified. Secondly, the machine has to be trained to
know how to discriminate between the classes by defining a learning model. The learning
model describes the procedure that has to be used for the actual training. Basically there are
two types of training often referred to as supervised and unsupervised learning.
Supervised Learning: When dealing with supervised learning, the trainer feeds the
machine a sample and the machine classifies it. The trainer then tells the machine
whether it has classified the sample correctly or not. In the case that it has
misclassified the sample, the machine has to adjust its classifier parameters to better
classify the given sample. Supervised learning is illustrated in Fig. 1.2. Examples of
these include Bayes Classifier and Neural Networks. Another important aspect
involved is the method of training imparted to these learning machines. It is important
to have a training set which is representative for the given classes so that future
classification can be successful.
Dept. of ECE
Jan June 09
Page 3
It is to be noted that over-training leads to poor generalization. For example, an overtrained machine would return different classes for the same image subjected to different
modifications. As illustrated in the Fig 1.4, over-training essentially is memorizing the data,
and leads to poor generalization.
Dept. of ECE
Jan June 09
Page 4
The intrinsic 2D structure of an image matrix is more often than not removed.
Consequently, the spatial information stored therein is discarded and not effectively
utilized.
Curse of Dimensionality: Each image sample is modeled is typically a point in a highdimensional space. Consequently, a large number of training samples are often
Dept. of ECE
Jan June 09
Page 5
needed to get reliable and robust estimation about the characteristics of data
distribution.
Usually very limited amounts of data are available in real applications such as face
recognition, image retrieval, and image classification.
A number of ways have been proposed to solve these problems. Finding an effective means
to reduce the dimensionality is the first step in face recognition.
Trying to find a face within a large database of faces. In this approach the system
returns a possible list of faces from the database. The most useful applications contain
crowd surveillance, video content indexing, personal identification (example: drivers
license), mug shots matching, etc.
Real time face recognition: Here, face recognition is used to identify a person on the
spot and grant access to a building or a compound, thus avoiding security hassles. In
this case the face is compared against a multiple training samples of a person.
Dept. of ECE
Jan June 09
Page 6
Dept. of ECE
Jan June 09
Page 7
Elastic Bunch Graph Matching: All human faces share a similar topological structure.
Faces are represented as graphs, with nodes positioned at fiducially points (eyes, nose, etc.)
and edges labeled with 2-D distance vectors. Each node contains a set of 40 complex Gabor
wavelet coefficients at different scales and orientations (phase, amplitude). They are called
"jets". Recognition is based on labeled graphs. A labeled graph is a set of nodes connected by
edges; each node is labeled as jets, edges are labeled as distances [20].
Kernel Methods: The face manifold in subspace need not be linear. Kernel methods are a
generalization of linear methods. Direct non-linear manifold schemes are explored to learn
this non-linear manifold [4],[5],[10],[11].
Trace Transform: The Trace transform, a generalization of the Radon transform, is a new
tool for image processing which can be used for recognizing objects under transformations,
e.g. rotation, translation and scaling. To produce the Trace transform one computes a
functional along tracing lines of an image. Different Trace transforms can be produced from
an image using different trace functional [21],[22].
Dept. of ECE
Jan June 09
Page 8
Range Imaging: Range imaging is the name for a collection of techniques, used to produce
a 2D image showing the distance to a set of points in a scene from a specific point, normally
associated with some type of sensor device. The resulting image, the range image, has pixel
values which correspond to the distance, e.g., brighter values mean shorter distance, or vice
versa. If the sensor which is used for produce the range image is properly calibrated, the pixel
values can be given directly in physical units such as centimeters [23].
Line edge map: Image-based face recognition algorithm that uses a set of random rectilinear
line segments of 2D face image views as the underlying image representation, together with a
nearest-neighbor classifier as the line-matching scheme. The combination of 1D line
segments exploits the inherent coherence in one or more 2D face image views in the viewing
sphere [24].
Neural Network based Face Recognition Techniques: Neural networks are used to create
the face database and recognize the face. A separate network for each person is built. The
input face is projected onto the eigenface space first to get a new descriptor. This descriptor
is used as network input and applied to each person's network. The one with maximum
output is selected and reported as the host if it is larger than a predefined recognition
threshold [3],[1],[13],[12].
Gabor wavelet networks (GWN): The Gabor wavelet network is used for an effective
object representation. The Gabor wavelet network has several advantages such as invariance
to some degree with respect to translation, rotation and dilation. Furthermore, it has the
ability to generalize and to abstract from the training data and to assure, for a given network.
Dept. of ECE
Jan June 09
Page 9
Sparse Representation: The input image is divided with L-1 minimization. And then those
sparse will be compared with training datas sparse. There are also many techniques which
has been tried to use for facial recognition techniques, but among all of them, the eigenface
technique shows the fastest and most accurate results than other techniques. So far, from our
research, we have seen that the eigenface technique has faster performance than other
techniques. But in recent times the researchers and scientists are trying to focus on Human
Vision System (HVS) for face recognition. It has not been implemented in practical field of
face recognition yet but has laid the foundation for future studies [25].
Dept. of ECE
Jan June 09
Page 10
CHAPTER II
Recurrent Networks
These structures are shown in Fig. 2.2, where each node represents a mathematical model
of a neuron.
Dept. of ECE
Jan June 09
Page 11
In its most general form, a neural network is a machine that is designed to model the
way in which the brain performs a particular task or function of interest. It resembles the
brain in the following two respects:
1. Knowledge is acquired by the network from its environment through a learning
process.
2. Interneuron connection strengths, known as synaptic weights, are used to store the
acquired knowledge.
The procedure used to perform the learning process is called a learning algorithm, the
function of which is to modify the synaptic weights of the network in an orderly fashion to
Dept. of ECE
Jan June 09
Page 12
attain a desired design objective. The modification of synaptic weights provides the
traditional basis for the design of neural networks. Further, it is also possible for a neural
network to modify its own topology, which is motivated by the fact that neurons in the
human brain can die and that new synaptic connections can grow.
In the case of supervised learning, this output signal is subtracted from the desired
response, to yield the error signal. In the classification problem, the desired response
corresponds to the classes of images. For this specific problem, the number of output neurons
equals the number of classes of images. The backward propagation of the estimate of the
error gradient vector, that is, the gradients of the error surface with respect to the weights
connected to the inputs of a neuron, forms the core of the supervised learning process. In the
Dept. of ECE
Jan June 09
Page 13
training phase, the optimum weights of the hidden and output layers are obtained, which
minimize the error estimate for desired results. The optimum weights are learnt by
propagating the error signal backwards, against the synaptic connections. This supervised
learning algorithm is known as error back propagation algorithm, which is described in detail
in Section 2.2.
Dept. of ECE
Jan June 09
Page 14
(2.1)
The instantaneous value of the error energy for neuron j is defined as 0.5e2j(n).
Correspondingly, the instantaneous value of the error energy E(n) of the total error energy is
obtained by summing 0.5e2j(n) over all neurons in the output layer. Thus:
(2.2)
E(n)=0.5
where the set C includes all the neurons in the output layer of the network. Let N denote the
total number of patterns contained in the training set. The average squared error energy is
obtained by summing E(n) over all n and then normalizing with respect to the set size N, as
shown by:
Eav=1/N
(2.3)
The instantaneous error energy E(n), and therefore the average error energy Eav, is a function
of all the free parameters (i.e. synaptic weights and bias levels) of the network. For a given
training set, Eav represents the cost function as a measure of learning performance. The
objective of the learning process is to adjust the free parameters of the network to minimize
Eav. A simple method of training is considered in which the weights are updated on a pattern
by pattern basis until one epoch, that is, one complete presentation of one training set has
been dealt with. The adjustments to the weights are made in accordance with the respective
errors computed for each pattern presented to the network.
Dept. of ECE
Jan June 09
Page 15
The arithmetic average of these individual weight changes over the training set is
therefore an estimate of the true change that would result in from modifying the weights
based on minimizing the cost function Eav over the entire training set. Consider Fig. 2.5
which depicts neuron j being fed by a set of function signals produced by a layer of neurons
to its left. The induced local field vj(n) produced at the input of the activation function
associated with neuron j is therefore:
vj(n)=
(2.4)
and m is the total number of inputs (excluding the bias) applied to neuron j. The synaptic
weight wj0 (corresponding to the fixed input y0=+1) equals the bias bj applied to neuron j.
Hence the function signal yj(n) appearing at the output of neuron j at iteration n is:
yj(n)=j(vj(n))
(2.5)
The back-propagation algorithm applies a correction wji(n) to the synaptic weight wji(n),
which is proportional to the partial derivative E(n)/dwji(n). According to the chain rule of
calculus, one can express this gradient as:
Dept. of ECE
Jan June 09
Page 16
E(n)/wji(n)=
(2.6)
The derivative E(n)/ wji(n) represents the sensitivity factor, determining the direction of
search in weight space for the synaptic weight wji. Differentiating both sides of equation (2.2)
with respect to ej(n),
yj(n), and equation (2.4) with respect to wji(n) one respectively gets:
E(n)/ ej(n)=ej(n)
(2.7)
ej(n)/ yj(n)=-1
(2.8)
yj(n)/ vj(n)=j(vj(n))
(2.9)
vj(n)/ wji(n)=yi(n)
(2.10)
(2.11)
(2.12)
where is the learning-rate parameter of the back propagation algorithm. The use of the
minus sign accounts for the gradient descent in weight space. Accordingly, the use of (2.11)
in (2.12) yields:
wji(n)= j(n)yi(n)
(2.13)
(2.14)
The local gradient points to required changes in synaptic weights. According to (2.14), the
local gradient j(n) for output neuron j is equal to the product of the corresponding error
signal ej(n) for that neuron and the derivative j(vj(n)) of the associated activation function.
Case 1: Neuron j is an output node
When neuron j is located in the output layer of the network, it is supplied with a
desired response of its own. Equation (2.1) is used to compute the error signal ej(n)
associated with this neuron.
Dept. of ECE
Jan June 09
Page 17
(2.15)
where in the second line equation (2.9) is used. To calculate the partial derivative E(n)/
yj(n), one proceeds as follows. From equation (2.2):
E(n)=0.5
(2.16)
[ek(n)/ yj(n)]
(2.17)
Using the chain rule for the partial derivative ek(n)/ yj(n), and rewriting in the equivalent
form the following is obtained:
E(n)/yj(n)=
.ek(n)/vk(n).vk(n)/yj(n)
(2.18)
However,
ek(n)=dk(n)-yk(n)
=dk(n)-k(vk(n)), neuron k is output node
(2.19)
Hence:
ek(n)/ vk(n)=-k(vk(n))
(2.20)
Dept. of ECE
(2.21)
Jan June 09
Page 18
Figure 2.6: Signal-flow graph highlighting the details of output neuron k connected to hidden neuron j
where m is the total number of inputs (excluding the bias) applied to neuron k. Here again,
the synaptic weight wko(n) is equal to the bias bk(n) applied to neuron k, and the
corresponding input is fixed at the value +1. Differentiating equation (2.21) with respect to
yj(n) yields:
vk(n)/ yj(n)=wkj(n)
(2.22)
By using equation (2.20) and equation (2.22) in equation (2.18), the desired partial derivative
is obtained:
E(n)/ yj(n)= =-
k(vk(n))wkj(n)
wkj(n)
(2.23)
where in the second line the definition of the local gradient k(n) given in equation (2.14)
with the index k substituted for j is used. Finally, the back-propagation formula for the local
gradient j(n) is obtained as described:
j(n)= (vj(n))
Dept. of ECE
wkj(n)
(2.24)
Jan June 09
Page 19
The factor j(vj(n)) involved in the computation of the local gradient j(n) in (2.24) depends
solely on the activation function associated with hidden neuron j. The remaining factor
involved in this computation, namely the summation over k, depends on two sets of terms.
The first set of terms, the k(n), requires knowledge of the error signals ek(n), for all neurons
that lie in the layer to the immediate right of hidden neuron j, and that are directly connected
to neuron j. The second set of terms, the wkj(n), consists of synaptic weights associated with
these connections. Figure 2.7 summarizes back-propagation learning.
Figure 2.7: Signal-flow graphical summary of back-propagation learning. Top part of the graph: forward
pass. Bottom part of the graph: backward pass.
Dept. of ECE
Jan June 09
Page 20
(2.25)
where a and b are constants. Its derivative with respect to vj(n) is given by:
j(vj(n))=absec2h(bvj(n))
=ab(1-tanh2(bvj(n)))
=b/a[a-vj(n)][a+yj(n)]
(2.26)
For a neuron j located in the output layer yj(n)=oj(n), the local gradient is therefore:
j(n)=ej(n)j(vj(n))
=b/a[dj(n)-oj(n)][a-oj(n)][a+oj(n)]
(2.27)
wkj(n)
=[b/a][a-yj(n)][a+yj(n)]
(2.28)
Dept. of ECE
Jan June 09
Page 21
(2.29)
where is a positive number called the momentum constant. Rewriting the above equation
(2.29) as a time series with index t. The index t goes from the initial time 0 to current time n.
It maybe viewed as a first-order difference equation in the weight correction wji(n):
wji(n)=
=-
(j(t)yi(t))
(E(t)/wji(t))
(2.30)
Dept. of ECE
Jan June 09
Page 22
For a given training set, back-propagation learning may proceed in one of two basic
ways:
1. Sequential mode: The sequential mode of back-propagation learning is also referred
to as on-line, pattern or stochastic mode. In this mode of operation weight updating is
performed after the presentation of each training example.
2. Batch mode: In the batch mode of back-propagation learning, weight updating is
performed after the presentation of all training examples that constitute an epoch. For
a particular, the cost function is defined as the averaged squared error of equations
(2.2) and (2.3):
Eav=1/2N
(2.31)
where the error signal pertains to the output neuron j for the training example n.
For an on-line operational point of view, the sequential mode of training is
preferred over the batch mode because it requires less local storage for each synaptic
connection. Moreover, given that the patterns are presented to the network in a random
manner, the use of pattern by pattern updating of weights makes the search in weight space
stochastic in nature. This in turn makes it less likely for the back-propagation algorithm to be
trapped in a local minimum.
In the same way, the stochastic nature of the sequential mode makes it difficult to
establish theoretical conditions for convergence of the algorithm, In contrast, the use of batch
mode of training provides an accurate estimate of the gradient vector; convergence to a local
minimum is guaranteed under simple conditions.
Dept. of ECE
Jan June 09
Page 23
Jan June 09
Page 24
Eav(w(n))+gT(n)w(n)+0.5wT(n)H(n)w(n)
(2.32)
(2.33)
(2.34)
In most of the training algorithms, a learning rate is used to determine the length of
the weight update (step size). In most of the conjugate gradient algorithms, the step size is
adjusted at the end of each iteration. A search is made along the conjugate gradient direction
to determine the step size, which minimizes the performance function along that line.
Fletcher-Reeves Update: All of the conjugate gradient algorithms start out by searching in
the steepest descent direction (negative of the gradient) on the first iteration:
po=-go
(2.35)
A line search is then performed to determine the optimal distance to move along the current
search direction:
xk+1=xk+kpk
(2.36)
Then the next search direction is determined so that it is conjugate to previous search
directions. The general procedure for determining the new search direction is to combine the
new steepest descent direction with the previous search direction:
Pk=-gk+kpk-1
Dept. of ECE
(2.37)
Jan June 09
Page 25
The various versions of conjugate gradient are distinguished by the manner in which the
constant is computed. For the Fletcher-Reeves update the procedure is:
k=gTkgk /gTk-1gk-1
(2.38)
This is the ratio of the norm squared of the current gradient to the norm squared of the
previous gradient. Once the MLP has been trained, it can be tested on a database of images.
The output neuron which fires corresponds to the class of image.
2.9 Conclusions
This chapter dealt with artificial neural networks. Specifically, the architecture of
multilayered feedforward neural networks is introduced. Supervised learning via the back
propagation algorithm is discussed along with the modes of training, the choice of learning
rate, and the potential pitfalls of getting trapped in local minima. Finally, the use of artificial
neural networks for face recognition is also introduced. The results are presented in a Chapter
6. A technique that is frequently used for analysis of images is that of principal components.
This is the topic of the next chapter.
Dept. of ECE
Jan June 09
Page 26
CHAPTER III
Feature selection refers to a process whereby the given data is compressed to features
or patterns fewer in number than the given data; Thus, the data space is transformed into a
feature space and undergoes a dimensionality reduction. Evidently, this data compression is
lossy, and it can be shown that in the case of principal component analysis, the mean-square
of the resulting error equals the sum of the variances of the elements of that part of the data
vector that is eventually eliminated. Therefore one seeks a transformation that is optimum in
the mean squared sense; i.e., the eliminated component must have total variance that is less
than a predefined threshold. Principal components analysis computes the basis of a space
which represents the training vectors. These basis vectors are eigenvectors of a related
covariance matrix.
When applied to face recognition, one determines the principal components of the
distribution of faces treating the image as a point in a high dimensional space; i.e., the
eigenvectors of the covariance matrix of the set of face images are computed. These
eigenvectors referred to as eigenfaces in this context contain relevant discriminatory
information extracted from the images. Thus, characteristic features representing the
variation in the collection of faces are captured, and this information is used to encode and
compare individual faces. It is to be noted that the features (i.e., the eigenvectors) are ordered
with respect to the corresponding eigenvalues, and only those features that are significant (in
Dept. of ECE
Jan June 09
Page 27
the sense of the value of the eigenvalues) are considered. Recognition is performed by
projecting a new image on to the subspace spanned by the eigenfaces and then classifying the
face by comparing its position in the face space with the position of known individuals.
Figure 3.1: (a) The 1st PC z1 is a minimum distance fit to a line in X space. (b) The 2nd PC z2 is a
minimum distance fit to a line in the plane perpendicular to the 1st PC.
(3.1)
define the first principal component of the sample by the linear transformation:
z1= a1Tx=
(3.2)
where the vector a1=(a11,a21,,ap1) is chosen such that var[z1] is maximum. Likewise, define
the kth PC of the sample by the linear transformation:
zk=akTx, k=1p
(3.3)
Dept. of ECE
(3.4)
Jan June 09
Page 28
= a1TSa1
(3.5)
(3.6)
(3.7)
So 1 is the largest eigenvalue of S. The first PC z1 retains the greatest amount of variation in
the sample. An example is illustrated in Fig. 3.1(a). To find the next coefficient vector a2
maximize var[z2] subject to:
cov[z2,z1]=0 and a1Ta1 =1
(3.8)
(3.9)
then let and be Lagrange multipliers, and maximize a2TSa2 (a2Ta2 1) - a2Ta1. It is
found that find that a2 is also an eigenvector of S whose eigenvalue =2 and is the second
largest, as illustrated in Fig. 3.1(b) in the two dimensional case. In general:
var[zk]= akTSak =k
(3.10)
The kth PC zk retains the kth greatest fraction of the variation in the sample.
Dept. of ECE
Jan June 09
Page 29
Given a sample of n observations on a vector of p variables x=(x1, x2,, xp), define a vector
of p PCs z=(z1, z2,, zp), according to Z=ATX, where A is an orthogonal p x p matrix whose
kth column is the kth eigenvector ak of S. Then = ATSA is the covariance matrix of the PCs;
the matrix is diagonal with elements ij=ij ij.
1
M
n =1
the vector n = n . This set of very large vectors is then subject to principal component
analysis, which seeks a set of M orthonormal vectors n which best describes the distribution
of the data. The kth vector k is chosen such that:
Dept. of ECE
Jan June 09
Page 30
k =
1
M
(
n =1
T
k
n )2
(3.11)
lT k =
(3.12)
The vectors k and scalars k are the eigenvectors and eigenvalues, respectively, of the
covariance matrix:
C=
1
M
n =1
Tn = AAT
(3.13)
If the number of data points in the image space is less than the dimension of the space
( M < N 2 ), there will be only M , rather than N 2 , meaningful eigenvectors in the sense that
the remaining eigenvectors are associated with zero eigenvalues. Accordingly, it is
computationally better to determine the eigenvalues and eigenvectors of much smaller
matrix. For example, in the situation outlined earlier, one computes the eigenvalues and the
corresponding eigenvectors of an M x M matrix as opposed to a 65,536 x 65,536 matrix.
Consider the eigenvectors n of AT A such that:
AT A n = n n
(3.14)
AAT A n = n A n
(3.15)
from which it can be inferred that A n are the eigenvectors of C = AAT . Following this
analysis, construct the M by M matrix L = AT A , where Lmn = Tm n , and find the M
eigenvectors n of L. These vectors determine linear combinations of the M training set face
images to form the eigenfaces n :
Dept. of ECE
Jan June 09
Page 31
n = nk k = A n , n = 1,......, M
(3.16)
k =1
With this analysis the calculations are greatly reduced, from the order of the number of pixels
in the images ( N 2 ) to the order of the number of images in the training set (M). In practice,
the training set of face images will be relatively small ( M < N 2 ), and the calculations
become quite manageable. The value of the associated eigenvalues allows the ordering of the
eigenvectors according to their usefulness in characterizing the variation among the images.
It is to be emphasized at this juncture that such a ordering is possible as all the eigenvalues of
AT A (and hence AAT ) are positive due to the sign-definiteness of these matrices.
The purpose of PCA is to reduce the large dimensionality of the data space (observed
variables) to the smaller intrinsic dimensionality of feature space (independent variables),
Dept. of ECE
Jan June 09
Page 32
which are needed to describe the data economically. This is the case when there is a strong
correlation between observed variables. Therefore, the goal is to find a set of eis which have
the largest possible projection onto each of the wis. The eigenvectors corresponding to
nonzero eigenvalues of the covariance matrix produce an orthonormal basis for the subspace
within which most image data can be represented with a small amount of error. The
eigenvectors are sorted from high to low according to their corresponding eigenvalues. The
eigenvector associated with the largest eigenvalue is one that reflects the greatest variance in
the image. That is, the smallest eigenvalue is associated with the eigenvector that finds the
least variance. It has been the experience that the usefulness of the eigenvectors decreases in
exponential fashion; i.e., roughly 90% of the total variance is contained in the first 5% to
10% of the dimensions.
k|, where k is a vector describing the kth face class. If k is less than some predefined
threshold , a face is classified as belonging to the class k.
Note: Eigenfaces, once obtained, can also be used to train a multilayer perceptron for
classification purposes. This type of training incorporates both supervised and unsupervised
learning. Once trained, it can be tested on a database of images.
3.4 Conclusions
In this chapter the eigenface approach to face recognition is dealt with. This is the
first statistics-based face recognition technique to have been proposed by researchers. A
principal advantage of this is that it is amenable to real-time face recognition. The eigenface
approach followed in this project uses principal components and nearest-neighbour
classification technique. In this next chapter, face recognition using support vector machines
is presented.
Dept. of ECE
Jan June 09
Page 33
CHAPTER IV
machine with some suitable properties. To explain how it works, it is perhaps easiest to start
with the case of separable patterns that could arise in the context of pattern classification. In
this context, the main idea of a support vector machine is to construct a decision surface,
called a hyperplane, in such a way that the margin of separation between positive and
negative examples is maximized.
(VC) dimension; in the case of separable patterns, a support vector machine produces a value
of zero for the first term and minimizes the second term. Accordingly, the support vector
machine can provide a good generalization on pattern classification problems despite the fact
that it does not incorporate problem-domain knowledge. This attribute is unique to support
vector machines.
A notion that is central to the construction of the support vector learning algorithm is
the inner-product kernel between a support vector xi and the vector x drawn from the input
space. The support vectors consist of a small subset of the training data extracted by the
algorithm. Depending on how this inner-product kernel is generated, construction of different
learning machines is characterized by their respective nonlinear decision surfaces. In
particular, the support vector learning algorithm is used to construct the following three types
of learning machines:
Dept. of ECE
Jan June 09
Page 34
That is, for each of these feedforward networks the support vector learning algorithm is used
to implement the learning process using a given set of training data, automatically
determining the required number of hidden units. Stated in another way, whereas the backpropagation algorithm is devised specifically to train a multilayer perceptron, the support
vector learning algorithm is of a more generic nature because it has wider applicability.
Support Vector machines are rooted in statistical learning theory which gives a family
of bounds which govern the learning capacity of the machine:
R(alpha) = 0.5|y-f(x,alpha)|.dP(x,y)
(4.1)
Dept. of ECE
Jan June 09
Page 35
4.1 Basics
Let X Rn denote the possible input to the Support Vector Machine. These inputs are pixel
intensity vectors obtained after image pre-processing. An SVM is trained with images
belonging to two classes at a time. A single machine is not efficient at encoding multiple
classes. Assume that a point x X is associated with one of two possible classes, denoted -1
and +1. This means that it is sufficient to have the output Yestimate={-1,1} from the classifier.
Consider two stochastic variables, representing the point denoted X, Y representing class
label. These variables then give rise to the following conditional distributions, p(X=x|Y=1),
p(X=x |Y=-1). There are two phases concerning support vector machines, first training and
then classifying. By letting f: X -> Yestimate represent the discriminating function, the task
during training is to minimize the probability of returning wrong classification for all
Dept. of ECE
Jan June 09
Page 36
elements included in the training set. This means that for the training set Yestimate should be
identical to Y in as many cases as possible.
During training, SVMs solve this problem by simply finding a function f, which for
every point xi=[xi1, xi2, , xin]T, with corresponding class label yk, in the training set
S=((x1,y1),,(xl,yl)) has the following property, f(xk)0, if yk=1 and f(xk)<0 if yk=-1. This is
only possible if the training set S is such that there exists a seperating hyperplane. This
assumption is relaxed later on. By presenting the SVM with a number of training examples
and optimizing the function parameters so that the sign of the output corresponds as well as
possible to the true classes of the training examples, the property mentioned can be forced to
be valid. When the training has been completed the sign-function is applied to the function f
and the resulting classifier has the desired output, that is {-1,1}.
(4.2)
Suppose that there is a hyperplane, which separates the positive from the negative examples.
The points x which lie on the hyperplane satisfy w.x + b = 0, where w is normal to the
hyperplane, |b|/||w|| is the perpendicular distance from the hyperplane to the origin, and ||w|| is
the Euclidean norm of w. Let d+ (d-) be the shortest distance from the separating hyperplane
to the closest positive (negative) example. Define the margin of a separating hyperplane to be
d+ + d-. For the linearly separable case, the support vector algorithm simply looks for the
separating hyperplane with largest margin.
Dept. of ECE
Jan June 09
Page 37
To formulate this, suppose that all the training data satisfy the following constraints:
w.xi + b +1 for yi = +1
(4.3.a)
w.xi + b -1 for yi = -1
(4.3.b)
yi(w.xi + b) 1 0, i=1,,N
(4.3)
Now consider the points for which the equality in (4.3.a) holds. These points are the support
vectors which lie on the hyperplane H1: w.xi + b = +1, with normal w and perpendicular
distance from the origin |1 - b|/||w||. Similarly, the points for which the equality in (4.3.b)
holds are also the support vectors which lie on the hyperplane H2: w.xi + b = -1 with
normal again w and perpendicular distance from the origin |-1-b|/||w||. Hence d+ = d- = 1/||w||
and the margin is simply 2/||w||. Note that H1 and H2 are parallel as they have the same
normal and that no training points fall between them. Thus, the pair of hyperplanes which
gives the maximum margin by minimizing the objective function 0.5||w||2 subject to
constraints in equation can be found. Thus the solution for a typical two dimensional case is
expected to have the form shown in Figure 4.2. Those training points for which the equality
in (4.3) holds (i.e. those which wind up lying on one of the hyperplanes H1, H2) and whose
removal would change the solution found, are called support vectors; they are indicated in
Figure 4.2 by the extra circles.
Fig. 4.2: Linear separating hyperplanes for the separable case. The support vectors are circled.
Dept. of ECE
Jan June 09
Page 38
Thus, introducing positive Lagrange multipliers i, i=1.N; one for each of the
inequality constraints in equation. Recall that the rule is that for constraints of the form ci0,
the constraint equations are multiplied by positive Lagrange multipliers and subtracted from
the objective function, to form the Lagrangian. For equality constraints, the Lagrange
multipliers are unconstrained. This gives Lagrangian:
LP = 0.5||w||2
yi(w.xi + b) +
(4.4)
point of LP, which must be minimized with respect to w, b, and simultaneously require that
the derivatives of LP with respect to all the i vanish, all subject to the constraints i0.
Suppose that the set of constraints is referred to as C1. Now this is a convex quadratic
programming problem, since the objective function is itself convex, and those points which
satisfy the constraints also form a convex set. This means that following dual problem can
be equivalently solved: maximize LP, subject to the constraints that the gradient of LP with
respect to w and b vanish, and subject also to the constraints i0. Suppose that this set of
constraints is referred to as C2. This particular dual formulation of the problem is called the
Wolfe dual. It has the property that the maximum of LP, subject to constraints C2, occurs at
the same values of w, b and , as the minimum of LP, subject to constraints C1.
Dept. of ECE
Jan June 09
Page 39
Requiring that the gradient of LP with respect to w and b vanish, gives the conditions:
1. w =
2.
yixi
(4.5)
yi = 0
(4.6)
Since these are equality constraints in the dual formulation substituting them into equation to
give:
LD =
0.5
jyiyjxi.xj
(4.7)
1.
yi = 0
(4.8)
2. i0
(4.9)
Note that there are different labels for Lagrangian (LP for primal, LD for dual), to emphasize
that the two formulations are different: LP and LD arise from the same objective function but
with different constraints, and the solution is found by minimizing LP or by maximizing LD.
Note also that if the problem is formulated with b=0, which amounts to requiring that all
hyperplanes contain the origin, the constraint (4.8) does not appear.
Support vector training (for the separable, linear case) therefore amounts to
maximizing LD with respect to i subject to constraints (4.8) and positivity of i with
solution given by (4.5). Notice that there is a Lagrange multiplier i for every training point.
In the solution, those points for which i>0 are called support vectors, and lie on one of the
hyperplanes H1, H2. All other training points have i=0 and lie on that side of H1 or H2 such
that the strict inequality in Equation (4.3) holds. For these machines, the support vectors are
the critical elements of the training set. They lie closest to the decision boundary, if all other
training points are removed or moved around, but so as not to cross H1 or H2, and training
repeated, the same separating hyperplane is obtained.
Dept. of ECE
Jan June 09
Page 40
(4.10)
The solution vector w is defined in terms of an expansion that involves the N training
examples. Note, however, that although this solution is unique by virtue of convexity of the
Lagrangian, the same cannot be said about the Lagrange coefficients, i. It is important to
note that at the saddle point, for each i, the product of that multiplier with the corresponding
constraint vanishes, as shown by (4.10). Therefore, only those multipliers exactly meeting the
above equation can assume nonzero values.
Having determined the optimal Lagrange multipliers, o,i, computing the optimum
weight vector as:
wo =
(4.11)
Dept. of ECE
(4.12)
Jan June 09
Page 41
yi(w.xi + b) + 1, i=1,,N
This violation can arise in one of two ways, as shown in Figure 4.3:
The data point falls inside the region but on the right hand side of the decision
surface.
The data point falls on the wrong side of the decision surface, resulting in
misclassification.
This can be rectified by introducing positive slack variables i, i=1,,N, in the constraints
which then become:
w.xi + b +1 - i for yi =+1
w.xi + b -1 + i for yi =-1
Dept. of ECE
Jan June 09
(4.13)
Page 42
Thus, for an error to occur, the corresponding i must exceed unity, so i is an upper bound
on the number of training errors. Hence a natural way to assign an extra cost for errors is to
change the objective function to be minimized to 0.5||w||2 to 0.5||w||2 + C(i)k, where C is a
parameter to be chosen by the user, a larger C corresponding to assigning a higher penalty to
errors. As before minimizing the first term is related to minimizing the VC dimension of the
support vector machine. As for the second term, it is an upper bound on the number of test
errors. Formulation of the cost function is therefore in perfect accord with the principle of
structural risk minimization.
The parameter C controls the trade-off between the complexity of the machine and
the number of non-separable points; it may therefore be viewed as a form of a
regularization parameter. The parameter C has to be selected by the user. This can be done
in one of two ways:
Dept. of ECE
Jan June 09
Page 43
Maximize:
LD =
0.5
jyiyjxi.xj
(4.14)
1. 0 i C
(4.15)
2.
(4.16)
yi = 0
Having determined the optimal Lagrange multipliers, o,i, computing the optimum weight
vector as:
yi xi
wo =
(4.17)
(4.18)
The Karush-Kuhn-Tucker conditions are needed for the primal problem. The primal
Lagrangian is:
LP = 0.5||w||2 + C
{yi(w.xi + b)1+i} -
(4.19)
where i are the Lagrange multipliers introduced to enforce positivity of i. The KKT
conditions for the primal problem are therefore:
Dept. of ECE
Jan June 09
Page 44
The optimum bias b0 is determined by taking any data point in the training set for which 0
0,i C and therefore i=0 and using that data point in the KKT conditions. However, from a
numerical perspective it is better to take the mean value of b0 resulting from all such data
points in the sample.
An important question at this juncture is whether or not the methods discussed in earlier
sections can be generalized to the case where the decision function is not a linear function of
the data, or, in other words, when the input space is made up of nonlinearly separable
patterns. Covers theorem states that such a multi-dimensional space maybe transformed into
a new feature space where the patterns are linearly separable with high probability, provided
Dept. of ECE
Jan June 09
Page 45
two conditions are satisfied. First, the transformation is nonlinear. Second, the dimensionality
of the feature space is high enough. The next operation exploits the idea of building an
optimal separating hyperplane in accordance with the theory described, but with a
fundamental difference: the separating hyperplane is now defined as a linear function of
vectors drawn from the feature space rather than the original input space.
Euclidean space H, using a mapping : Rd to H. Then of course the training algorithm would
only depend on the data through dot products in H, i.e. on functions of the form (xi). (xj).
Now if there were a kernel function K such that K(xi,xj)= (xi). (xj) only K need to be
used in the training algorithm, and would never need to explicitly even know what is. One
example is: K(xi,xj)=exp(-0.5||xi-xj||2/2*sigma2). In this particular example, H is infinite
dimensional, so it would not be very easy to work with explicitly. However, if one replaces
xi.xj by K(xi,xj) everywhere in the training algorithm, the algorithm produces a support vector
machine in an infinite dimensional space, and furthermore do so in roughly the same amount
of time it would take to train on the unmapped data.
Suppose that the data belongs to a space denoted L. (The notation L is a mnemonic
for low-dimensional, as is the notation for the range space H for high-dimensional.)
Evidently, a map is not necessarily onto; i.e., there need exist an element in L that is
mapped onto a specific element in H.
Let j(x), j=1,,M; where M is the number of hidden units, denote a set of nonlinear
transformations, define a hyperplane acting as the decision surface as follows:
j(x)
+ b=0, where w defines the vector of linear weights connecting the feature space to the
output space, and b is the bias. The quantity j(x) represents the input supplied to the weight
wj via the feature space. In effect, the vector (x), represents the image induced in the
Dept. of ECE
Jan June 09
Page 46
feature space due to the input vector x. Thus, in terms of this image define the decision
surface in the compact form, as shown in Figure 4.4:
wT (x)=0
(4.22)
Fig. 4.4: Image in H of the square [-1, 1] x [-1 1] under the mapping .
yi (xi)
(4.23)
where the feature vector (xi) corresponds to the input pattern xi in the ith example.
Substituting into (4.23), the decision surface is obtained as:
yi T(xi)(xi)=0
(4.24)
The term T(xi)(xi) represents the inner product of two vectors induced in the
feature space by the input vector x and the input pattern xi pertaining to the ith example.
Introduce the inner product kernel denoted by K(xi,xj) and defined by K(xi,xj)= T(xi)(xi).
Dept. of ECE
Jan June 09
Page 47
This kernel is a symmetric function of its arguments. Therefore, the decision surface can be
defined as:
yi K(xi,xj)=0
(4.25)
K(x,x) =
(4.26)
>0 for all i. For this expansion to be valid and for it to converge
.dx.dx 0
holds for all
(4.27)
for which:
<
(4.28)
The Kernel is positive definite as the eigenvalues of the eigenfunction i(x). Mercers
theorem only tells whether or not a candidate kernel is actually an inner-product kernel in
some space and therefore admissible for use in a support vector machine.
Dept. of ECE
Jan June 09
Page 48
of
support
machine
Polynomial Learning
vector Inner
product
kernel
K(x,xi), i=1,,N
(xTxi + 1)p
Machine
Radial-basis function
Comments
Power p is specified a priori
by the user
exp(-0.5||xi-xj||2/2*sigma2)
network
The
widthsigma2,
Two-layer perceptron
tanh(0xTxi + 1)
Mercers
satisfied
theorem
only
for
is
some
values of 0 and 1
Polynomial Kernel
Radial-basis Kernel
Figure 4.5: Decision surface of Kernels
Dept. of ECE
Jan June 09
Page 49
hyperplane gives the best classification for the given element. This approach gives rise to the
following problem. The output of the SVM in this case is based upon which hyperplane
derives the greatest value, due to the discriminating function, when evaluated for the current
sample. The question is then what happens when two different classes are very similar and
the other class is well distinguished from these two. Although discriminating between class 1
and class 2 is a much harder task than discriminating between class 1 and class 3, or class 2
and class 3, the training algorithm will be prevented to derive an optimal solution for just
classifying between class 1 and class 2 because the whole training set has to be taken under
consideration during training. By just considering the first two classes a better classifier for
discriminating between these could be derived.
It can be seen from Fig. 4.7 that the hyperplane w1,2 clearly discriminates better
between class 1 and class 2 than the hyperplane w2, although this hyperplane has a better
overall performance. By using some sort of a hierarchical structure shown in Fig. 4.8, the
strength of the binary classifier could be exploited.
Dept. of ECE
Jan June 09
Page 50
Figure 4.7: Problem encountered when one SVM is used for multi-classification
Dept. of ECE
Jan June 09
Page 51
2n1+2n2++2nI, where n1 n2 nI. Because any natural number (even or odd) can be
decomposed into finite positive integers which are the power of 2. If c is an odd number, nI =
0; if c is even, nI > 0. Note that the decomposition is not unique, but the number of
comparisons in the test stage is always c-1.
Dept. of ECE
Jan June 09
Page 52
4.10 Conclusions
Support Vector Machines are dealt with in this chapter. Both linearly separable and
nonlinearly separable data are considered. The Karush-Kuhn-Tucker conditions are also
discussed. Face recognition is generally a multi-classification problem. Since SVMs is
typically a binary classification technique, the solution to the multi-classification using
SVMs is also described. This uses a binary tree approach. Essential pre-processing of images
is discussed in the next chapter.
Dept. of ECE
Jan June 09
Page 53
CHAPTER V
Image Preprocessing
5.1 Histogram Equalization
Let the variable r represent the gray levels of the image to be enhanced. Assume that r has
been normalized to the interval [0,1], with r=0 representing black and r=1 representing white.
Consider transformation of the form:
s=T(r), 0r1
(5.1)
that produces a level s for every pixel value r in the original image and satisfies the following
conditions:
The first requirement is needed to guarantee that the inverse transformation will exist, and
the monotonicity condition preserves the increasing order from black to white in the output
image. A transformation that is not monotonically increasing could result in at least a section
of the intensity range being inverted, thus producing some inverted gray levels in the output
image. The other condition guarantees that the output gray levels will be in the same range as
the input levels. The inverse transformation from s back to r is denoted:
r=T-1(s), 0s1
(5.2)
The gray levels in an image may be viewed as random variables in the interval [0,1]. Let
pr(r), ps(s) denote the probability density functions of random variables r and s. A basic result
is that, if pr(r), T(r) are known and T-1 satisfies the first condition, then ps(s) of the
transformed variable s can be obtained by:
ps(s)=pr(r)|dr/ds|
Dept. of ECE
(5.3)
Jan June 09
Page 54
Thus, the probability density function of the transformed variable s is determined by the
gray-level PDF of the input image and by the chosen transformation function.
s=T(r)=
dw
(5.4)
where w is a dummy variable of integration. The right side of equation is described as the
cumulative distribution function (CDF) of random variable r. Since probability density
functions are always positive, and recalling that the integral of a function is the area under
the function, it follows that this transformation is single-valued and monotonically
increasing. Similarly the integral of a probability density function for variables in the range
[0,1] is also in the range [0,1].
Given transformation function T(r), find ps(s) by applying equation (5.3). It is known
from basic calculus, Leibnizs rule, that the derivative of a definite integral with respect to its
upper limit is simply the integrand evaluated at that limit. In other words:
ds/dr = dT(r)/dr
= d[
dw]/dr
(5.5)
=pr(r)
Substituting this result for dr/ds into equation, and keeping in mind that all probability values
are positive, yields:
ps(s)=pr(r)|dr/ds|
=pr(r)|1/pr(r)|
=1, 0s1
Dept. of ECE
(5.6)
Jan June 09
Page 55
Because ps(s) is a probability density function, it follows that it must be zero outside the
interval [0,1] in this case because its integral over all values of s must equal 1. This value of
pr(rk)=nk/n, k=0,1,2,,L-1
(5.7)
where n is the total number of pixels in the image, nk is the number of pixels that have gray
level rk, and L is the total number of possible gray levels in the image. The discrete version of
the transformation function is given as:
sk=T(rk)=
=
(rj)
) / n, where k=0,1,2,,L-1
(5.8)
Thus, a processed output image is obtained by mapping each pixel with level rk in the input
image into a corresponding pixel with level sk in the output image. A plot of pr(rk) versus rk is
called a histogram. The transformation given is called histogram equalization. An example is
shown in Fig. 5.1.
Dept. of ECE
Jan June 09
Page 56
f(x,y)=(1/mn)
This operation can be implemented using a convolution mask in which all coefficients
have value (1/mn). Figure 5.2 illustrates Median filtering in a 2-D space.
Dept. of ECE
Jan June 09
Page 57
5.4 Conclusions
Pre-processing of images is discussed in this chapter. This project uses histogram
equalization, median filtering and bi-cubic interpolation to process the images before face
recognition. With experience this is found to improve the recognition accuracy. After preprocessing of images, the different techniques discussed in Chapters 2, 3 and 4 are used for
face recognition, and the results are presented in Chapter 6.
Dept. of ECE
Jan June 09
Page 58
CHAPTER VI
Results
The algorithms were tested on the Cambridge ORL database, which consists of 40
subjects with 10 images of each with varying illumination, lighting conditions and pose. Of
these, 5 images of a subject are used for training and remaining 5 for testing purposes.
Variations are visible before the pre-processing stages, and these changes are noted down.
Histrogram Equalization ON
Median Filtering ON
The face recognition accuracy for varying number of hidden neurons is summarized in
Table 6.1, and the same data pictured in Fig. 6.1. It can be observed that for larger image
sizes, there is little change in the recognition accuracy for hidden neurons more than 40, and
for smaller image this plateau is reached when the number of neurons is 20. These results
show that neural networks need more neurons to extract the hidden features from larger sized
images. Further, it can be easily seen that with artificial neural networks face recognition
accuracy of better than 90% can be achieved.
Dept. of ECE
Jan June 09
Page 59
Resolution/hidden
10
20
30
40
50
60
50x40
14
32
58
87.5
88
91
20x20
10
79
87.5
89
92.5
93.5
10x10
91.5
89.5
88
90
88.5
neurons
Table 6.1: Face recognition accuracy with multilayer feedforward networks with histogram equalization
and median filtering.
Fig. 6.1: Face recognition accuracy with multilayer feedforward networks with histogram equalization
and median filtering.
Dept. of ECE
Jan June 09
Page 60
Case B:
Median Filtering ON
The face recognition accuracy for varying number of hidden neurons is summarized in
Table 6.2 and shown in Fig. 6.2. From the data conclusions similar to that of Case A can be
drawn.
Resolution/hidden
10
20
30
40
50
60
50x40
44
67
71.5
91.5
93.5
92
20x20
10
79
87.5
89
92.5
93.5
10x10
91.5
89.5
88
90
88.5
neurons
Table 6.2: Face recognition accuracy with multilayer feedforward networks with only median filtering.
Fig. 6.2: Face recognition accuracy with multilayer feedforward networks with only median filtering.
Dept. of ECE
Jan June 09
Page 61
Case C:
The face recognition accuracy for varying number of hidden neurons is summarized in
Table 6.3 and Fig. 6.3. Again, the conclusions drawn are similar to that of Case A and Case
B.
Resolution/hidden
10
20
30
40
50
60
50x40
29
63.5
70.5
90
90
93
20x20
36.5
88.5
90.5
91.5
95.5
95
10x10
32
45.5
89
91.5
91.5
90
neurons
Table 6.3: Face recognition accuracy with multilayer feedforward networks with no preprocessing.
Fig. 6.3: Face recognition accuracy with multilayer feedforward networks with no preprocessing.
Discussion:
Dept. of ECE
Jan June 09
Page 62
From the results for Cases A, B and C, it can be observed that there is no significant
increase in the recognition accuracy when the number of hidden neurons is increased beyond
50. Large differences in recognition accuracies are seen for resolutions 50x40 and 20x20,
with the fact that accuracies decrease for larger sized images. This may be attributed to the
following fact: neural networks learned the unnecessary features when provided with higher
resolution, eventually leading to poor generalization. It can also be observed that the best
recognition accuracy is achieved when there is no pre-processing. Thus, it can be concluded
that pre-processing by median filtering and histogram equalization removes features
important for neural networks.
Case A:
Histogram Equalization ON
Median Filtering ON
Dept. of ECE
Jan June 09
Page 63
Resolution/No.
11
16
21
26
31
36
41
46
50x40
74
80
83.5
84.5
85
85
85
85
86
30x20
11.5
73.5
79.5
84
84.5
85
85
85
85
85.5
20x20
11
75
81
83.5
84
85.5
86.5
87
87.5
87.5
10x10
12.5
77.5
85.5
87.5
87.5 89.5
89.5
89.5
89.5
89.5
5x5
11.5
79.5
84
85
87.5
87.5
88
88
of Eigen Faces
86
87.5
Table 6.4: Face recognition accuracy with principal component analysis with histogram equalization and
median filtering.
Fig. 6.4: Face recognition accuracy with principal component analysis with histogram equalization and
median filtering.
Dept. of ECE
Jan June 09
Page 64
Case B:
Resolution/No.
11
16
21
26
31
36
41
46
50x40
8.5
84.5
86.5
87.5
89
89.5
89.5
90
90
90
30x20
85
88
89
89.5
89.5
90
90.5
90.5
90.5
20x20
13.5
86
88.5
89
89
90.5
90.5
90.5
90.5
90.5
10x10
15.5
87.5
90
91
91.5
91.5
92
92
92
92
5x5
14
86
91.5
91.5
91
91
91
91
91
91.5
of Eigen Faces
Table 6.5: Face recognition accuracy with PCA with only median filtering.
Fig. 6.5: Face recognition accuracy with PCA with only median filtering.
Dept. of ECE
Jan June 09
Page 65
Case C:
The face recognition accuracy for varying number of eigenfaces is summarized in Table
6.6 and Fig. 6.6. From the data conclusions similar to that of Case A and Case B can be
drawn.
Resolution/No.
11
16
21
26
31
36
41
46
50x40
12.5
84.5
87.5
88.5
88.5
89.5
90
90
90
90
30x20
12.5
55.5
87
88.5
89.5
89.5
90
90.5
91
91
20x20
15
86
88.5
89
89.5
90.5
90.5
90.5
90.5
90.5
10x10
13
87.5
90.5
91
93
93
92.5
92.5
92.5
92.5
5x5
14
88
89
90
90
90
90
90.5
90.5
90.5
of Eigen Faces
Dept. of ECE
Jan June 09
Page 66
Discussion:
Histogram equalization ON
Median Filtering ON
The face recognition accuracy for varying resolution is summarized in Table 6.7, and the
same data pictured in Fig. 6.7. It can be observed that there is very little change in the
recognition accuracies. Further, these accuracies are practically independent of the image
size. Moreover, the accuracies achieved are better when compared to those obtained when
artificial neural networks are used.
Dept. of ECE
Jan June 09
Page 67
Resolution
Accuracy (%)
50x40
96
30x20
96
20x20
96
10x10
97
5x5
94
Table 6.7: Face recognition accuracy with SVM with histogram equalization and median filtering.
Case B:
Median Filtering ON
The face recognition accuracy for varying number of eigenfaces is summarized in Table
6.8. From the data, conclusion similar to that of Case A can be drawn.
Resolution
Accuracy (%)
50x40
97
30x20
97
20x20
97.5
10x10
97
5x5
97
Table 6.8: Face recognition accuracy with SVM with median filtering.
Case C:
The face recognition accuracy for varying number of eigenfaces is summarized in Table
6.9 . From the data conclusions similar to that of Case A and Case B can be drawn.
Dept. of ECE
Jan June 09
Page 68
Resolution
Accuracy (%)
50x40
97
30x20
97
20x20
97
10x10
97
5x5
97.5
Discussion:
From Tables 6.7, 6.8 and 6.9, it can be concluded that performance of support vector
machines are almost invariant to changes in resolution. Further, preprocessing does not
appear to have any effect on the recognition accuracy. This suggests that SVMs are more
robust with respect to preprocessing. The required features are extracted despite
preprocessing of images, which is different from that observed when ANN or PCA technique
is used. Furthermore, the SVMs outperform both ANNs and PCA in that SVMs yield the
best recognition accuracies.
6.4 PCA+MLP
Case A:
Histogram Equalization ON
Median Filtering ON
The face recognition accuracy for varying number of hidden neurons is summarized in
Table 6.10, and the same data pictured in Fig. 6.7. It can be observed that for larger image
sizes, there is little change in the recognition accuracy for hidden neurons more than 40, and
for smaller image this plateau is reached when the number of neurons is 35. These results
Dept. of ECE
Jan June 09
Page 69
show that neural networks need more neurons to extract the hidden features from larger sized
images. Further, it can be easily seen that with PCA and artificial neural networks face
recognition accuracy intermediate to that of PCA and MLP is achieved.
Res/Hidden
15
25
35
45
55
65
75
50x40
7.5
37
86.5
80.5
82.5
87.5
93
92.5
20x20
2.5
46
78.5
82.5
89
89.5
88
89.5
10x10
2.5
54.5
75
80.5
86
86.5
89
88.5
Neurons
Case B:
Median Filtering ON
Dept. of ECE
Jan June 09
Page 70
The face recognition accuracy for varying number of eigenfaces is summarized in Table
6.11. From the data, conclusion similar to that of Case A can be drawn.
Accuracy (%):
Res/Hidden Neurons
15
25
35
45
55
65
75
20X20
6.5
40
76.5
86.5
2.5
88
93.5
92
10X10
2.5
55
70.5
2.5
87.5
86.5
90
88
Table 6.11 Face recognition accuracy with PCA+MLP with only median filtering.
Fig 6.8: Face recognition accuracy with PCA+MLP with only median filtering.
Case C:
Dept. of ECE
Jan June 09
Page 71
The face recognition accuracy for varying number of eigenfaces is summarized in Table
6.12. From the data, conclusions similar to that of Case A and case B can be drawn.
Accuracy (%):
Res/Hidden
15
25
35
45
55
65
75
50X40
2.5
41
78.5
78
88
87
89
92.5
20X20
2.5
37
82
86.5
89.5
89
91
91
10X10
2.5
24
57
78
83.5
82
88
87
Neurons
Discussion:
By intuition this method was expected to have better accuracy than PCA and MLP
alone. But the results obtained do not reflect on the expectations. The results are intermediate
to those of PCA and MLP suggesting that the weights obtained after PCA process are not
optimum for classification using MLPs.
Dept. of ECE
Jan June 09
Page 72
CHAPTER VII
Conclusions and Future Work
7.1 Conclusions
This project has demonstrated the recognition of faces using Principal Component
Analysis, Networks and Support Vector Machines successfully. Face images are the inputs
the face recognition system. The pixel intensities are used as inputs to the respective
methods. The pre-processing of faces is also done and the results are obtained. For each
method the pre-processing changes done are using Histogram Equalization and Median
filtering to the face images.
Chapter 2 introduced Neural Networks and Backpropagation algorithm and used the
same for recognizing faces. Recognizing faces using neural network can be done in two
ways. The first method is a direct neural network based approach, and the other is done by
obtaining the weights of eigenfaces, and then classifying the weights using neural networks.
The parameter of Neural Networks, which is varied at each simulation, is the number of
hidden neurons. The change of recognition rate is not appreciable when the number of hidden
neurons is more than 40. Also, the Neural Network does not perform well when the
resolution is 50x40, but performs very well for lower resolutions. This suggests that the
network was learning unwanted parameters leading to poor generalization capabilities.
Chapter 3 introduced an information theory technique i.e. eigenface approach which
uses principal components. Even in this method different combination of pre-processing is
done and resolution of faces is varied. The performance is very commendable when the
number of eigenfaces is more than 25. Highest accuracy of 93% is obtained when then
number of eigenfaces is 26 and the resolution is 10x10.
In Chapter 4 support vector machines were discussed and using these the method of
face recognition was mentioned. It is concluded that performance of support vector machines
are almost invariant to changes in resolution. Further, preprocessing does not appear to have
any effect on the recognition accuracy. This suggests that SVMs are more robust with respect
Dept. of ECE
Jan June 09
Page 73
to preprocessing. The required features are extracted despite preprocessing of images, which
is different from that observed when ANN or PCA techniques are used. Furthermore, the
SVMs outperform both ANNs and PCA to yield the best recognition accuracies. The
accuracy is around 96%.
Dept. of ECE
Jan June 09
Page 74
CHAPTER VIII
References
[1]
[2]
Jahan Zeb, Muhammed Younus Javed and Usman Qayyum Low resolution single
neural network based face recognition International journal of Biomedical Sciences
volume 2 number 3 2007 ISSN 1306 -1216 , pg 206-210.
[4]
[5]
Face Recognition by Support Vector Machines :- Guodong Guo, Stan Z. Li, and
Kapluk Chan. 2000
[6]
[7]
[8]
[9]
M. Pontil and A. Verri. Support vector machines for 3-d object recognition. IEEE
Master Thesis: Pattern Recognition using Support Vector Machines by Fredrik Gran,
Matematikcentrum LTH.
[11]
Yung-Mao Lu; Bin-Yih Liao; Jeng-Shyang Pan, "A Face Recognition Algorithm
Decreasing the Effect of Illumination," Intelligent Information Hiding and
Dept. of ECE
Jan June 09
Page 75
[14]
Face Recognition by Support Vector Machines by Guodong Guo, Stan Z. Li, and
Kapluk Chan, School of Electrical and Electronic Engineering, Nanyang
Technological University, Singapore
[15]
[16]
[17]
[18]
Liu, C.; Wechsler, H., "Evolutionary pursuit and its application to face recognition,"
Wiskott, L.; Fellous, J.-M.; Kruger, N.; von der Malsburg, C., "Face recognition by
elastic bunch graph matching," Image Processing, 1997. Proceedings., International
Nan Liu; Han Wang, "Modeling Images With Multiple Trace Transforms for Pattern
Analysis," Signal Processing Letters, IEEE , vol.16, no.5, pp.394-397, May 2009
[22]
Srisuk, S.; Petrou, M.; Kurutach, W.; Kadyrov, A., "Face authentication using the
trace transform," Computer Vision and Pattern Recognition, 2003. Proceedings. 2003
IEEE Computer Society Conference on , vol.1, no., pp. I-305-I-312 vol.1, 18-20 June
2003
[23]
Achermann, B.; Jiang, X.; Bunke, H., "Face recognition using range images," Virtual
Systems and MultiMedia, 1997. VSMM '97. Proceedings., International Conference
on , vol., no., pp.129-136, 10-12 Sep 1997
Dept. of ECE
Jan June 09
Page 76
[24]
Yongsheng Gao; Leung, M.K.H., "Face recognition using line edge map," Pattern
Analysis and Machine Intelligence, IEEE Transactions on , vol.24, no.6, pp.764-779,
Jun 2002
[25]
Wright, J.; Yang, A.Y.; Ganesh, A.; Sastry, S.S.; Yi Ma, "Robust Face Recognition
via Sparse Representation," Pattern Analysis and Machine Intelligence, IEEE
Transactions on , vol.31, no.2, pp.210-227, Feb. 2009
Dept. of ECE
Jan June 09
Page 77
APPENDIX
MATLAB SCRIPTS
PRINCIPAL COMPONENT ANALYSIS
pca.m
clear
clc
close all
databasepath=uigetdir('C:\Program Files\MATLAB7\work','select database
path');
T=[];
for i=1:2:400
str=int2str(i);
str=strcat('\',str,'.sgi');
str=strcat(databasepath,str);
img=sgiread(str);
A = [];
for i = 1 : Train_Number
temp = double(T(:,i)) - m; % Computing the difference image for each
image in the training set Ai = Ti - m
A = [A temp]; % Merging all centered images
end
T=A;
C=[];
for i=2:2:400
str=int2str(i);
str=strcat('\',str,'.sgi');
str=strcat(databasepath,str);
Dept. of ECE
Jan June 09
Page 78
L=A'*A;
[V D]=eig(L);
l_eig_vectors=[];
threshold=max(D);
for i=1:size(D,2)
if (D(i,i)>threshold(190))
l_eig_vectors=[l_eig_vectors V(:,i)];
end
end
eigenfaces=(A*l_eig_vectors)/256;
for i=1:size(eigenfaces,2)
img=reshape(eigenfaces(:,i),40,50);
img=img';
subplot(5,2,i);
imshow(img);
drawnow;
end
ProjectedImages = [];
Train_Number = size(eigenfaces,2);
for i = 1 : 200
temp = eigenfaces'*A(:,i); % Projection of centered images into
facespace
ProjectedImages = [ProjectedImages temp];
end
ProjectedtestImages = [];
Testrain_Number = size(E,2);
for i = 1 : Testrain_Number
temp = eigenfaces'*E(:,i); % Projection of centered images into
facespace
ProjectedtestImages = [ProjectedtestImages temp];
end
Z=zeros(200,40);
Dept. of ECE
Jan June 09
Page 79
MULTILAYER PERCEPTRON
mlp.m
clear
clc
close all
databasepath=uigetdir('C:\Program Files\MATLAB7\work','select database
path');
T=[];
for i=1:2:400
str=int2str(i);
str=strcat('\',str,'.sgi');
str=strcat(databasepath,str);
img=sgiread(str);
Dept. of ECE
Jan June 09
Page 80
A = [];
for i = 1 : Train_Number
temp = double(T(:,i)) - m; % Computing the difference image for each
image in the training set Ai = Ti - m
A = [A temp]; % Merging all centered images
end
T=A;
C=[];
for i=2:2:400
str=int2str(i);
str=strcat('\',str,'.sgi');
str=strcat(databasepath,str);
img=sgiread(str);
[irow icol] = size(img);
B = imresize(img,[50 40],'bicubic');
temp = reshape(B',50*40,1);
% Reshaping 2D images into 1D image
vectors
C = [C temp]; % 'T' grows after each turn
end
C=double(C);
C=C/256;
E = [];
for i = 1 : Train_Number
temp = double(C(:,i)) - m; % Computing the difference image for each
image in the training set Ai = Ti - m
E = [E temp]; % Merging all centered images
end
L=A'*A;
[V D]=eig(L);
l_eig_vectors=[];
threshold=max(D);
for i=1:size(D,2)
if (D(i,i)>threshold(190))
l_eig_vectors=[l_eig_vectors V(:,i)];
Dept. of ECE
Jan June 09
Page 81
Dept. of ECE
Jan June 09
Page 82
clear
clc
close all
databasepath=uigetdir('C:\Program Files\MATLAB7\work','select database
path');
T=[];
for i=1:2:400
str=int2str(i);
str=strcat('\',str,'.sgi');
str=strcat(databasepath,str);
img=sgiread(str);
img = medfilt2(img,[3 3]);
img=histeq(img);
[irow icol] = size(img);
B = imresize(img,[50 40],'bicubic');
temp = reshape(B',50*40,1);
% Reshaping 2D images into 1D image
vectors
T = [T temp]; % 'T' grows after each turn
end
T=double(T);
T=T/256;
m = mean(T,2);
Train_Number = size(T,2);
A = [];
for i = 1 : Train_Number
temp = double(T(:,i)) - m; % Computing the difference image for each
image in the training set Ai = Ti - m
A = [A temp]; % Merging all centered images
end
T=A;
C=[];
for i=2:2:400
str=int2str(i);
str=strcat('\',str,'.sgi');
Dept. of ECE
Jan June 09
Page 83
% [ypred,maxi] =
svmmultivaloneagainstone(xtest,xsup,w,b,nbsv,kernel,kerneloption);
% kerneloptionm.matrix=svmkernel(xtest,kernel,kerneloption,xapp(pos,:));
kerneloptionm.matrix=svmkernel(E',kernel,kerneloption,xapp(pos,:));
[ypred,maxi] =
Dept. of ECE
Jan June 09
Page 84
Important Note:
The MATLAB scripts can run on any basic version of MATLAB which as Neural Network
toolbox and PNM and SVM-KM toolbox is to be added to the work folder.These toolboxes
are provided in the CD.
Dept. of ECE
Jan June 09
Page 85