Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

- 1 -

Eye Detection Using Optimal Wavelet Packets


and Radial Basis Functions (RBFs)
Jeffrey Huang and Harry Wechsler
Department of Computer Science
George Mason University
Fairfax, VA 22030
{rhuang, wechsler}@cs.gmu.edu
http://cs.gmu.edu/~wechsler
Abstract: The eyes are important facial landmarks, both for image normalization due to their relatively
constant interocular distance, and for post processing due to them anchoring model-based schemes. This
paper introduces a novel approach for the eye detection task using optimal wavelet packets for eye
representation and Radial Basis Functions (RBFs) for subsequent classification (labeling) of facial areas
as eye vs non-eye regions. Entropy minimization is the driving force behind the derivation of optimal
wavelet packets. It decreases the degree of data dispersion and it thus facilitates clustering (prototyping)
and capturing the most significant characteristics of the underlying (eye regions) data. Entropy
minimization is thus functionally compatible with the 1st operational stage of the RBF classifier, that of
clustering, and this explains the improved RBF performance on eye detection. Our experiments on the
eye detection task prove the merit of this approach as they show that eye images compressed using
optimal wavelet packets lead to improved and robust performance of the RBF classifier compared to the
case where original raw images are used by the RBF classifier.
Key Words: eye detection, face recognition, neural network performance, pattern classification, RBFs
(Radial Basis Functions), optimal sampling, wavelet packets.
International Journal of Pattern Recognition and Artificial Intelligence, Vol 13 No 7, 1999
- 2 -
1. Introduction
There are several related (face recognition) sub problems: (i) detection of a pattern as a face (in the
crowd) and its pose, (ii) detection of facial landmarks, (iii) face recognition - identification and/or
verification, and (iv) analysis of facial expressions (Samal et al., 1992). Face recognition starts with the
detection of face patterns in sometimes cluttered scenes, proceeds by normalizing the face images to
account for geometrical and illumination changes, possibly using information about the location and
appearance of facial landmarks, identifies then the faces using appropriate classification algorithms, and
post processes the results using model-based schemes, facial landmarks and logistic feedback (Chellappa
et al, 1995). The eyes are important facial landmarks, both for normalization due to their relatively
constant interocular distance, and for post processing due to them anchoring model-based schemes.
Eye detection is approached within the framework of predictive learning, where the goal is to develop a
computational relationship for estimating the values for the output variables given only the values of the
input variables. Different taxonomies are available for predictive learning, among them that including
regression and density estimation, and classification, corresponding to the input variables being
continuous and discrete / categorical, respectively, even that any classification problem can be reduced to
a regression problem. Predictive learning can be thought of as a relationship y = f(x) + error, where the
error is due both to (measurement) noise and possibly to unobserved input variables. The main issues
one has to address are related to prediction (generalization) ability, data and dimensionality reduction
(complexity), explanation / interpretation capability, and possibly to biological plausibility. In the
framework of predictive learning, estimating (learning) a model (classifier) from finite data requires
specification of three concepts: a set of approximating functions (i.e., a class of models: dictionary), an
inductive principle and an optimization (parameter estimation) procedure. The notion of inductive
principle is fundamental to all learning methods. Essentially, an inductive principle provides a general
prescription for what to do with the training data in order to obtain (learn) the model. In contrast, a
learning method is a constructive implementation of an inductive principle (i.e., an optimization or
parameter estimation procedure) for a given set of approximating functions in which the model is sought,
such as feed forward nets with sigmoid units, radial basis function networks, etc. (Cherkassky and Mulier,
1998). There is just a handful of known inductive principles, Empirical Risk Minimization (ERM),
Regularization, Structural Risk Minimization (SRM), Bayesian Inference, Minimum Description Length),
but there are infinitely many learning methods based on these principles. It is the prediction risk, the
expected performance of an estimator (classifier) for new (future) samples, which determines to what
degree adaptation has been successful so far. Based on the above discussion our approach for eye
detection involves wavelets as the set of approximating functions (dictionary), ERM as the inductive
- 3 -
principle, and RBFs as the learning and optimization method implementing the inductive principle. Our
approach also displays adaptiveness with respect to the dictionary aspect as it seeks an optimal subset of
wavelet kernels to approximate eye regions. This paper introduces an approach for the eye detection task
using optimal wavelet packets for compressed eye representations and Radial Basis Functions (RBFs) for
classification (labeling) of facial areas as eye vs. non-eye regions.
2. Signal Representation and Optimal Sampling
One of the goals for signal processing (SP) is to restore - recover - reconstruct signals using appropriate
sampling procedures. Signals, corresponding to sensory inputs, provide the information link connecting
intelligent systems to their environment. Signals have to be properly sampled and processed to facilitate
optimal behavior and sensory-motor control. Sampling criteria are determined by examining the spectral
contents of the signal and by setting the sampling rate to match the rate of change observed in the given
signal. Intuitively, fast changing signals contain more information and thus should be sampled more
often. If sampling fails to keep track of the fast changing signals, information (detail) is lost, and signal
reconstruction is beset by interference (aliasing) effects. To avoid aliasing effects the Shannon sampling
theorem sets the minimum (Nyquist) rate for sampling a bandwidth-limited continuous signal so it can be
uniquely recovered (restored) from its samples.
Approximation and estimation bounds for artificial networks, when used as universal approximators,
have been derived by Barron (1991) using complexity regularization and statistical risk. Barron has
specifically shown that the mean squared error between the estimated network and a target function f is
bound by
O
Cf
2
n ( )
+ O
nd
N
log N
( )
(1)
where n is the number of nodes, d is the input dimension of the function, N is the number of training
observations, and C
f
is the first absolute moment of the Fourier magnitude distribution of f. The two
contributions to the total risk are the approximation error and the estimation error. The approximation
error (bias) refers to the distance between the target function and the closest neural network function of a
given architecture, while the estimation error (variance) refers to the distance between this ideal network
function and an estimated network function. For the case n ~ C
f
N
d logN
( )
1
2

the optimized order of the bound
on the mean integrated squared error is O(C
f
(
d
N
log N)
1
2
) . The approximation bounds thus estimate the
performance of the network in terms of the spectral characteristics of the data being analyzed. The
preceding result, reconfirms known concepts from signal analysis and information theory referred to
earlier, and establishes strong links between optimal sampling and neural network performance. As an
example, Malinowski and Zurada (1994), developed a new approach to the problem of n-dimensional
- 4 -
function approximation using two-layer neural networks using the Nyquist rate to determine the optimum
number of learning patterns. Choosing the smallest but still sufficient set of training vectors resulted in a
reduced number of hidden neurons and learning time, and it confirmed the dependence of network
performance on proper pattern sampling.
The next fundamental concept bearing on optimal signal recovery comes from information theory and it is
due to Gabor (1946). The now famous concept, known as the uncertainty principle, revolves around the
simultaneous resolution of information in (2D) spatial and spectral terms, and it is closely related to the
choice of optimal lattices of receptive fields (kernel bases) for representing (decomposing) some
underlying signal. As recounted by Daugman (1990), Gabor pointed out that there exists a "quantum
principle" for information, which he illustrated through the construction of a spectrogram-like information
diagram. The information diagram, a plane whose axes correspond to time (or space) and frequency,
must necessarily be grainy, or quantized, in the sense that it is not possible for any signal or filter (and
hence any carrier or detector of information) to occupy less than a certain minimal area in this plane. This
minimal or quantal area, reflects the inevitable trade-off between time (space) resolution t (s) and
frequency resolution , and equals the lower bound on their product. Gabor further noted that
Gaussian-modulated complex exponentials offer the optimal way to encode signals (of arbitrary spectral
content) or to represent them, if one wished the code (basis functions) primitives to have both a well-
defined epoch of occurrence in time (space) and a well-defined spectral identity. The code primitives are
also referred to as kernels or RFs (receptive fields) in analogy to biological vision. The conclusion to be
drawn from the above discussion is better understood and visualized if one were to associate time (space)
occurrence with localization, while frequency change is associated with the speed at which a signal would
change. One thus can not achieve at the same time both optimal signal localization and optimal tracking
of details related to the changes taking place in spatiotemporal signal patterns. Note that not all signal
changes are intrinsic to the signal being analyzed. As an example, both rapid signal changes (high
frequency) corresponding to noise and slow signal changes (close to dc) corresponding to illumination
changes, should be discounted. The Human Visual system (HVS) has adapted so it implements a strategy
as outlined above by properly weighting the frequency contents through the Modulation Transfer
Function (MTF).
3. Wavelet Representation
The wavelet basis functions (Mallat, 1989), a self-similar and spatially localized code, are spatial
frequency/orientation tuned kernels, and provide one possible tessellation of the conjoint spatial/spectral
signal domain considered above. The corresponding wavelet hierarchical (pyramid) is obtained as the
- 5 -
result of orientation-tuned decompositions at a dyadic (powers of two) sequence of scales. The 2D
wavelet representation W of the function f(x, y) is
W(a
x
, a
y
, s
x
, s
y
) f (x, y)(

a
x
a
y
)

1
2

*
[(x s
x
)(a
x
)
1
,(( y s
y
)(a
y
)
1
] (2)
where (x, y) is the "mother" wavelet, a
x
and a
y
are scale parameters, and s
x
and s
y
are shift parameters.
The wavelet representation provides for multi-resolution analysis (MRA) through the orthogonal
decomposition of a function along basis functions consisting of appropriate translations and dilations of
the mother wavelet function. Continuous wavelets, defined using a pair of functions (scaling function)
and (mother- wavelet function), satisfy
( ) ( ) b ax a x
b a

2 1 /
,
| | (3)
( ) ( ) b ax a x
b a

2 1 /
,
| | (4)
where a, b are real numbers, and
a,b,
and
a,b
correspond to the scaling and mother wavelet functions
dilated by a, and translated by b. The discrete wavelet transform (DWT) is obtained for the choice of a =
a
0
-m
, b = nb
0
, where m, n are integer numbers, with the scaling and mother wavelet functions becoming
( ) ( )
0 0
2
0
nb x a a x
m m
n m



/
,
| | (5)
( ) ( )
0 0
2
0
nb x a a x
m m
n m



/
,
| | (6)
For the choice a
0
= 2, b
0
= 1 one obtains
( ) ( ) n x x
m m
n m


2 2
2

/
,
(7)
( ) ( ) n x x
m m
n m


2 2
2

/
,
(8)
The dilation equation, relating the mother wavelet to the scaling function is:

k
k x k h x ) ( ) ( ) ( 2 2
1
where
) ( ) ( ) ( k h k h
k
1 1
0 1
(9)
The wavelet coefficients and the corresponding wavelet decomposition are given as
dx x x f c
n m n m
) ( ) (
, ,

+

(10)
) ( ) (
,
, ,
x c x f
n m
n m n m
(11)
Daubechies (1998) has shown how one can derive the corresponding low (h0) and high (h1) pass filters
for designing appropriate families of scaling and mother wavelet functions. She obtained a complete
parameterization of all finite length of h0(k) that satisfies the following quadratic and linear conditions:

n
n h 2
0
) ( (12)
and ) ( ) ( ) ( l l n h n h
n
+ 2
0 0
(13)
- 6 -
Defining h1(k)=(-1)
k
h0(N-1-k) and the filter length N is even, i.e. N=2K, Daubechies writes
) ( ) ( ) ( z P z z H
K 1
0
1

+ , and parameterizes the autocorrelation sequence P(z) for various values of K.
That P(z) is bounded on the unit circle by
2
1
2
K
is sufficient for wavelet basis to be orthonormal.
Using the sequences h0 and h1, one computes then the Discrete Wavelet Transform (DWT) using the
structure shown in Fig.1. Mallat (1989) has showed that for any orthonormal wavelet basis, the sequences
of two-channel filter banks - h and g can be utilized to compute the DWT with perfect reconstruction.
Fig. 1 also shows the implementation of Eqns. (14) and (15) derived from the Eqn. (8):
> < > <
+ n j
n
i k j i
x k n h x
, , , ,
, ) ( ,
1 0
2 (14)
and
k j i
k i
i n j
k n h x
, ,
,
, ,
) ( , > <
+
2
1 0
(15)
where } , { 1 0 i
Figure 1: Computation of DWT Using a Filter Bank
By using the polyphase-domain analysis, the perfect reconstruction filter banks, having finite impulse
response (FIR), must satisfy a necessary and sufficient condition for aliasing cancellation (anti-aliasing):
G
0
(z)H
0
(-z) + G
1
(z)H
1
(-z) = 0 (16)
where the H
0
, H
1
are the analysis filters and G
0
, G
1
are the synthesis filters used in the forward and inverse
polyphase transformation. For the aliasing cancellation in above two-channel filter bank, one can design
the Quadrature Mirror Filters (QMF) that satisfy the constrains:
H
1
(z) = H
0
(-z), (17)
G
0
(z) = H
0
(z), (18)
G
1
(z) = -H
1
(z) = -H
0
(-z), (19)
Substituting the above into Eqn. (16) leads to H
0
(z)H
0
(-z)+H
0
(-z)H
0
(z) = 0, i.e. aliasing is cancelled. For
the perfect reconstruction, it also satisfies:
G
0
(z)H
0
(z) + G
1
(z)H
1
(z) = 2z
-l
(20)
Using the QMF constrains, Eqn. (20) becomes
H
0
2
(z) - H
0
2
(-z) = 2z
-l
, (21)
) ( k h
0
) ( k h
1
) (k h
0
) (k h
1
2
2
2
2
> <
k j
x
, ,
,
0

> <
k j
x
, ,
,
1

> <
+ k j
x
, ,
,
1 0
> <
+ k j
x
, ,
,
1 0

- 7 -
It means that on the unit circle H
0
(-z)=H(e
j (+)
) is the mirror images of H
0
(z) and the above relation
explains the name of QMF. The QMF design is a useful way to implement PR filter banks for multirate
signal analysis and directly links to multi-resolution analysis (MRA) supported by wavelet theory.
The self-similar Gabor basis functions g are a special case of non-orthogonal wavelets, correspond to
sinusoids modulated by a Gaussian, can be easily tuned to any bandwidth (scale) and orientation. Note
that the self-Gabor wavelets are redundant due to them being non-orthogonal, lead to an over complete
dictionary of basis functions and are thus expected to provide better performance than a dictionary
consisting of orthogonal bases (Daugman, 1990).
4. Optimal Wavelet Packets
The concept of optimal sampling can be expanded using different fitness criteria. As an example, Wilson
(1995) mentions the requirement for more complex approaches to signal representation, whose common
feature is adaptivity. It should be possible to automatically adjust the resolution of the representation, i.e.
its reconstruction ability, in order to provide the best fit to a given data set, rather than using a fixed
representation, whose resolution is only bound to be a compromise between space / time and frequency.
Examples of such an approach include wavelet packets (Coifman and Wickerhauser, 1992), where the
wavelet dictionary is drawn using maximal energy concentration and/or least Shannon entropy.
Let H0 and H1 be a pair of QMFs for one-dimensional signal analysis. The wavelet packets produced by
iterating these filters are products of one-dimensional wavelet packets. The wavelet coefficients in
homogeneous tree structure are generated by applying the QMF convolutions recursively. Suppose the
periodic signal is of size N = 2
L
. This decomposition can be developed down to level L, at which point
each subspace will have a single element. Each level requires O(N) operations and it is proportional to
the length of the QMFs. The total complexity for calculating the wavelet packet coefficients, i.e.
correlating with the entire collection of wavelet coefficients, is O(NlogN). Once the coefficients of the
discrete wavelet transform (DWT) are derived one becomes interested in choosing an optimal subset, with
respect to some reconstruction criteria, for data compression purposes. Towards that end, Coifman and
Wickerhauser (1992) define the Shannon entropy ( ) as
v ( ) v
i
2
ln v
i
2
(22)
where v = {v
i
} is the corresponding set of wavelet coefficients. The Shannon entropy measure is then
used as a cost function for finding the best subset of wavelet coefficients. Note that minimum entropy
- 8 -
corresponds to less randomness (dispersion) and it thus leads to clustering. If one generates the complete
wavelet representations (wavelet packets) as a binary tree, the selection of the best coefficients is done by
comparing the entropy of wavelet packets corresponding to adjacent tree levels (father - son
relationships). One compares the entropy of each adjacent pair of nodes to the entropy of their union and
the subtree is expanded further only if it results in lesser entropy. For a signal whose size is n the DWT
yields n coefficients and the search for optimal coefficients yields that set (still of size n) for whom the
Shannon entropy is minimized. Data compression, subject to the same entropy criteria, ranks the optimal
coefficients according to their magnitude, and would pick up subsets consisting of m coefficients where m
is less than n.
5. Functional Approximation and Pattern Classification
Signal representation is a particular case of functional approximation, and one could use the wavelet
kernels introduced earlier as universal approximators. Wavelet networks (wavenets) using stochastic
gradient descent akin to back propagation (BP) (Zhang and Benveniste, 1992) are such an example.
Signal denoising is closely related to function estimation from noisy samples. Cherkassky and Shao
(1999), have recently proposed a novel signal estimation/denoising approach using the Vapnik-
Chervonenkis (VC) theory and Structural Risk Minimization (SRM) as the inductive principle. The
proposed approach emphasizes the role of an appropriate ordering of wavelet functions for defining the
structuring SRM elements, for modeling complexity and for reducing the predicted estimation risk.
Cherkassky and Shao further showed that under finite sample settings, the choice of appropriate ordering
(of the basis functions) and sound complexity control are more important than choosing the right basis,
whereas under large sample settings the type of basis functions becomes more important. Functional
approximation in terms of dynamic receptive fields (RFs) using Radial Basis Functions (RBFs)
(Lippmann and Ng, 1991) and/or Gaussian bar architectures consisting of semilocal and overlapping units
(Hartman and Keeler, 1991) have been considered as well. Pattern classification is a special case of
functional approximation (regression) when the only possible outputs are categorical and requires special
consideration.
Pattern recognition, a difficult but fundamental task for intelligent systems, depends heavily on the
particular choice of the features used by the (pattern) classifier. Feature selection in pattern recognition
involves the derivation of salient features from the raw input data in order to reduce the amount of data
used for classification and simultaneously provide enhanced discriminatory power. The selection of an
appropriate set of features is one of the most difficult tasks in the design of pattern classification system.
At the lowest level, the raw features are derived from noisy sensory data, the characteristics of which are
complex and difficult to characterize. In addition, there is considerable interaction among the low-level
- 9 -
features which must be identified and exploited. The typical number of possible features, however, is so
large as to prohibit any systematic exploration of all but a few possible interaction types (e.g., pairwise
interactions). In addition, any sort of performance oriented evaluation of feature subsets involves
building and testing the associated classifier, resulting in additional overhead costs. As a consequence, a
fairly standard approach is to attempt to pre-select a subset of features using some abstract measures
related to important properties of good feature sets such as orthogonality, infomax, large variance,
multimodality of marginal distributions, and/or low entropy. This approach has been recently dubbed the
"filter" approach (Kohavi and John, 1995) and it is generally much less resource intensive than building
and testing the associated classifiers, but it may result in sub optimal performance if the abstract measures
do not correlate well with actual performance.
It is still difficult, however, to develop abstract feature space measures of optimality which would
guarantee optimality for actual classification tasks. In practice this is best achieved by including some
form of performance evaluation of the feature subsets while searching for good subsets. This approach,
recently dubbed the "wrapper" approach (Kohavi and John, 1995), typically involves building a classifier
based on the feature subset being evaluated, and using the performance of the classifier as a component of
the overall evaluation. The wrapper approach should produce better classification performance than the
filter approach, but it adds considerable overhead to an already expensive search process. This in turn
usually restricts the number of alternative feature subsets one can afford to evaluate, and thus it may also
produce sub optimal results. Our hybrid approach for eye detection integrates (in an independent fashion)
and takes advantage of both the filter and the wrapper approach. The filter approach seeks optimal
wavelet packets for best features, while the wrapper approach seeks the best classifier in terms of Radial
Basis Functions (RBFs).
The RBFs, related to kernel (potential) functions, can perform functional approximation characteristic of
sparse surface reconstruction while they can also support a general hybrid scheme connecting signal
approximation and pattern classification. Signal approximation, characteristic of self-organization using
adaptive cluster prototypes, provides the needed bases for defining the classifier. As an example, for
some function f(x), its approximation (decomposition in terms of basis functions) given as < c
i
, T
i
, R > is
found as
f ( x) c
i
N( x T
i

R
2
) (23)
where T
i
are the kernels (dictionary) of the function to be learned given in terms of parameterized
clusters across the space of geometrical (and/or topological) variability and weighted by coefficients c
i
- 10 -
learned through straightforward linear algebra (matrix inversion via SVD), N is the Gaussian distribution,
R is the normalization matrix, and the corresponding norm is defined as
x T
i
R
2
( x T
i
) R

R(x T
i
) (24)
The wavenets, referred to earlier, implement the same general methodology as that used by RBFs.
(Universal) approximation takes place by establishing formal links between the network coefficients and
some signal transform, such as the wavelet transform. Training for the set {x, f(x)} is accomplished using
stochastic gradient descent such as Back Propagation (BP) and the approximation found, similar to the
one corresponding to RBFs, is g(x)
i
[D
i
R
i
( x t
i
)] where stands for the wavelet bases.
We address now the problem of eye detection in the context of the face recognition problem. The ability
to detect salient facial features is an important component of any face recognition system. Among the
many facial features available it appears that the eyes play the most important role in both face
recognition and social interaction. Detecting the eyes serves an important role in face normalization and
thus facilitates further localization of other facial landmarks. It is eye detection that allows one to focus
attention on salient facial configurations, to filter out structural noise, and to achieve eventual face
recognition. The hybrid approach for eye detection involves deriving first optimal (reconstructed)
candidate windows in terms of (best basis) wavelet packets followed by their classification (as eye vs
non-eye) using the Radial Basis Function (RBF) method. It is shown that an optimal choice of a subset of
wavelet bases provides for improved RBF performance on the eye detection task.
6. Eye Detection
We now address the task of eye detection in the context of the face recognition problem. The ability to
detect salient facial features is an important component of any face recognition system. Among the many
facial features available it appears that the eyes play the most important role in both face recognition and
social interaction. Detecting the eyes serves an important role in face normalization and thus facilitates
further localization of other facial landmarks. It is eye detection that allows one to focus attention on
salient facial configurations, to filter out structural noise, and to achieve eventual face recognition. The
hybrid approach for eye detection involves deriving first optimal (reconstructed) candidate eye regions
(windows) in terms of (best basis) wavelet packets followed by their classification (as eye vs non-eye)
using the Radial Basis Function (RBFs). It is shown that an optimal choice of a subset of wavelet bases
provides for improved RBF performance on the eye detection task when compared against using raw eye
images.
6.1 Experimental Setup
- 11 -
The experimental data for the eye detection task come from the FERET database (Phillips et al, 1998).
The FERET database consists of 1,564 sets comprising 14,126 face images, corresponding to 1,199
individuals. Since large amounts of the face images were acquired during different photo sessions, the
lighting conditions, the size of the facial images and their orientation can vary. In particular, as the
frontal poses are slightly rotated to the right or left, the results reported later on show that the eye
detection technique is robust to slight rotations. The diversity of the FERET database is across gender,
race, and age, and it includes 365 duplicate sets taken at different times.
The facial data set drawn from FERET consists of 45 frontal facial images taken at the resolution of 128 x
128. Windows of size 8 x 32 were manually cut from different regions of the face corresponding to eye
and non-eye images as shown in Fig.2. The training data consists of 50 eye (+) images and 20 non-eye (-)
images, while the testing data consists of 40 eye (+) and 10 non-eye (-) images. The eye images are raster
scanned to form 1-D vectors consisting of 256 elements.
Figure 2. Eye (+) and Non-Eye (-) Examples
6.2 Eye Representation Using Optimal Wavelet Packets
Wavelets packets, corresponding to the eye and non-eye images, are derived using the Daubechies family
of order two, as shown in Fig. 3. The QMF of low pass h(i) and high pass g(i) for filter size 4 are:
- 12 -
i h(i) g(i)
1 0.4830 -0.1294
2 0.8365 -0.2241
3 0.2241 0.8365
4 0.1294 -0.4830
0 1 2 3
-0.5
0
0.5
1
1.5
Scale Function:
2
(x)
x
0 1 2 3
-2
-1
0
1
2
Wavelet Function:
2
(x)
x
0 50 100 150
0
0.2
0.4
0.6
0.8
FFT of the Scale Function
Frequency (Radians)
0 50 100 150
0
0.1
0.2
0.3
0.4
0.5
FFT of the Wavelet Function
Frequency (Radians)
Figure 3. Daubechies (Scale and Wavelet) Family of Order Two and its FFT
Optimal (wavelet packet) decompositions are then found minimizing the Shannon entropy as described in
Sect. 4. The quality of the reconstructed images, using the whole set of 256 coefficients or a reduced set
consisting of only 16 largest coefficients from the optimal (minimal entropy) induced tree, is shown in
Fig.4. The root mean square error (e
rms
) between the reconstructed and the original image is computed
as:
( ) ( ) ( )
2 1
256
1
2
256
1
/
1
]
1


i
i O i R rms
x f x f e
(25)
Note that perfect reconstruction is achieved when one uses the whole set of 256 wavelet coefficients. The
reconstructed images capture relevant characteristics (features) and discard structural noise. The
reconstructed images (here using 16 or 64 largest coefficients) and raw original images are then passed to
the Radial Basis Function (RBF) classifier to assess the relevance, if any, of optimal signal
decompositions (functional approximation) for classification tasks.
- 13 -
Number of
Wavelet Best
Bases
Reconstructed Images 1-D Signal
RMS
e
rms
16
0 32 64 96 128 160 192 224 256
0
50
100
150
200
29.36
256
0 50 100 150 200 250 300
0
50
100
150
0
Figure 4. Reconstructed Eye images Using Best Wavelet Packet Bases
6.2. Classification Using Radial Basis Functions (RBFs)
The construction of the RBF network involves three different layers (Fig. 5). The input layer is made up
of source nodes (sensory units), here fed with optimal wavelet packets. The second layer is a hidden layer
whose goal is to cluster the data and further reduce its dimensionality. The output layer supplies the
response of the network to the activation patterns applied to the input layer, corresponding here to the eye
and non-eye classes. The transformation from the input space to the hidden-unit space is non-linear,
whereas the transformation from the hidden-unit space to the output space is linear. Connections between
the input and middle layers have unit weights and, as a result, do not have to be trained. Nodes in the
middle layer, called basis functions (BF) nodes (kernels) are similar to the clusters derived using k-
means training. The BF kernels produce a localized response to the input using Gaussian kernels and
each hidden unit can further be viewed as a localized receptive field (RF).
. . . . .
G
G G G
SELECT MAXIMUM
LINEAR WEIGHTS
UNIT WEIGHTS
OUTPUT NODES (j)
BASIS FUNCTION NODES (i)
INPUT NODES (k)
Figure 5. Radial Basis Function (RBF) Architecture
- 14 -
The basis function (BF) used are Gaussians, and the activation level y
i
of the hidden unit i is given by:
( )
1
1
1
1
]
1

,
_


D
k
ik
ik k
i i
o h
x
i
X y
1
2
2
2
exp

(26)
where h is a proportionality constant for the variance, x
k
is the kth component of the input vector X = [x
1
,
x
2
, ..., x
D
], and
ik
and
ik
2
are the kth components of the mean and variance vectors defining the BF, and
o is the overlap factor between BFs. The outputs of the hidden unit lie between 0 and 1, and it can be
interpreted as a degree of fuzzy memberships; the closer the input to the center of the Gaussian, the larger
the response of the BF node. The activation level Z
j
of an output unit is given by:
j i
i
ij j
w y w Z
0
+
(27)
where Z
j
is the output of the jth output node, y
i
is the activation of the ith BF node, w
ij
is the weight
connecting the ith BF node to the jth output node, and w
0j
is the bias or the threshold of the jth output
node. The bias comes from the weights associated with a BF node that has a constant unit output
regardless of the input. An unknown vector X is classified as belonging to the class associated with the
output node j which yields the largest output Z
j.
The RBF input consists of normalized (optimal wavelet packets) values fed to the network as 1-D vectors.
The width of the Gaussian for each cluster, is set to the maximum {the distance between the center of the
cluster and the farthest away member - within class diameter, the distance between the center of the
cluster and the closest pattern from all the other clusters} multiplied by the overlap factor o. The width is
dynamically refined using different proportionality constants h. The hidden layer yields the equivalent of
a functional eye basis, where each cluster node encodes some common characteristics across the eye
space.
The output (supervised) layer maps training examples to their corresponding classes (eye and non-eye)
and finds the corresponding (hidden layer) expansion (weight) coefficients using SVD techniques. Note
that the number of clusters is frozen for that configuration (number of clusters and specific proportionality
constant h) which yields 100% accuracy when validated on the same training images. The experiments
carried out involved 50 eye and 20 non-eye images for training, and a different set consisting of 40 eye
and 20 non-eye images for testing. The same experiment was performed for the case when original
images were used and for the cases when the images used were reconstructed using the best 16 and 64
- 15 -
wavelet coefficients. Table 1 shows the performance of correct eye and non-eye classification
corresponding to the original and reconstructed images.
Another experiment was carried out to assess how robust the eye detection approach is on miscentered
eye images. Twenty eight test images were produced by shifting the center of the eye from its original
position to some new positions using upward/downward two pixels shifts, and/or left and right two or
four pixels shifts. The classification rates obtained for original vs reconstructed images using the best 64
wavelet coefficients were 50% and 80%, respectively. From the above experiments, one concludes that
the performance of the Radial Basis Function (RBF) classifier improves when the original (raw) images
are replaced by optimally reconstructed images using the best wavelet coefficients. As entropy
minimization drives the derivation of optimal wavelet packets, clustering (prototyping) captures the most
significant characteristics of the underlying data and it is thus more compatible with the 1st operational
stage of the RBF classifier. Note that the best RBF classifier based on optimal wavelet packets can be
applied exhaustively across the whole facial landscape in order to detect the eyes or alternatively it can be
used only on salient locations likely to cover the eye regions. In the latter case, one would possibly
evolve a homing routine to first guide an animat toward the eye(s) and then use the RBF classifier derived
earlier to confirm that the eye region has been indeed reached (Huang, Liu and Wechsler, 1998; Huang
and Wechsler, 1999).
Performance Using
Original (8 x 32) Data Performance Using Compressed Data
Eye Non-Eye Number of
Wavelet Coefficients
Eye Non-eye
70% 70% 16 82.5% 80.0%
64 85.0% 100.0%
Table 1. Eye vs. Non-Eye Classification Results
7. Conclusions
We have shown in this paper that eye images compressed using optimal wavelet packets (bases) lead to
enhanced performance on the eye detection task when compared to the case where original raw images
are used. As entropy minimization drives the derivation of optimal wavelet packets, clustering
(prototyping) captures the most significant characteristics of the underlying data and is more compatible
with the 1st operational stage of the RBF classifier. Our hybrid approach is related to the probabilistic
- 16 -
visual learning (PVL) method for object recognition (Moghaddam and Pentland, 1997). PVL measures
similarity for recognition purposes in terms of distance-from feature space (DFFS) (representation
adequacy) and distance-in-feature space (DIFS) (classification accuracy). While DFFS measures
representation adequacy using the reconstruction error as criteria, the filter component of our hybrid
approach seeks representation adequacy in terms of entropy minimization. Both PVL and our hybrid
approach address representation and classification accuracy independent of each other. An alternative
would be to develop a hybrid approach where fitness and success are defined in terms of a combined
fitness measure evolved using Genetic Algorithms (GAs). The suggested measure would take into
account the tradeoffs between representational adequacy and classification accuracy, possibly including
the compactness of the representation as well (Liu and Wechsler, 1998). The updated scheme would
then substitute regularization for the ERM as the inductive principle as representation adequacy is traded
off for classification accuracy in terms of compact clusters well separated. Our approach is also similar to
the evolution of lattices of receptive fields (Linsker, 1988) suitable for both adaptive and optimal
functional decomposition and signal classification. Such kernel bases amount to optimal dictionaries
consisting of those (kernel) primitives able to express best and in a compact form a wide range of
meaningful inputs characteristic of some sensory environment leading eventually to both pattern
classification and robust sensory-motor schemes.
One additional direction for future research would integrate the concepts of optimal signal sampling,
characteristic of signal restoration, and optimal pattern sampling, characteristic of learning machines, with
the joint goals of compact signal approximation and improved classification accuracy. Examples of
optimal pattern sampling include active learning (Krogh and Vedelsby, 1995) and support vector
machines (SVM) (Vapnik, 1995). During the early stages of training most of the examples are
misclassified, while as the time goes on the set of misclassified examples diminishes. Active learning, as
a form of corrective training, seeks to determine what the training (bases) set should consist of in order to
improve the guaranteed prediction risk. Specifically, active learning seeks to determine the relevant
sample patterns or what SVM would label as support vectors (SV) for the purpose of defining the actual
classifier. As both signals and patterns have to be approximated using optimal (kernel) bases (receptive
fields and support vectors) the computational quest should be now after such conjoint bases with the
expectation for enhanced generalization abilities and corresponding lower predictive risk.
Acknowledgments: This work was partially supported by the U.S. Army Research Laboratory under
contract DAAL01-97-K-0118.
- 17 -
References
Barron, A. R. (1991), Approximation and Estimation Bounds for Artificial Neural Networks. Proc. of the
4th Workshop on Computational Learning Theory, Eds. Warmuth, M. K. and L.G. Valiant, 243-249,
Morgan Kaufmann.
Chellappa R., C. L. Wilson. and S. Sirohey, Human and Machine Recognition of Faces: A Survey, Proc.
IEEE 83, 705-740 (1995).
Cherkassky, V. and F. Mulier (1998), Learning from Data: Concepts, Theory and Methods, Wiley.
Cherakssky, V. and X. Shao (1999), Signal Estimation and Denoising Using VC-Theory, Neural
Computation (under review).
Coifman, R. and V. Wickerhauser (1992) Entropy-Based Algorithms for Best Basis Selection, IEEE
Trans. on Information Theory, 38(2), 713-718.
Daubechies, I. (1988), Orthonormal Bases of Compactly Supported Wavelets, Comun. on Pure and Appl.
Math., 41, 909-996.
Daugman, J. G. (1990) An Information-Theoretic View of Analog Representation in Striate Cortex., in
Computational Neuroscience, Ed. E. L. Schwartz, 403-424, MIT Press.
Gabor, D. (1946),Theory of Communication, Journal of IEEE, 93, 429-459.
Hartman, E. and J. Keeler (1991) Predicting the future: Advantages of Semilocal Units, Neural
Computation, 3, 566-578.
Huang, J., C. Liu and H. Wechsler (1998), Evolutionary computation and face recognition, in Face
Recognition: From Theory to Applications, H. Wechsler, J. P. Phillips, V. Bruce, F. Fogelman -
Soulie and T. Huang (Eds.), Springer-Verlag.
Huang, J. and H. Wechsler (1999), Visual Routines for Eye Location Using Learning and Evolution,
IEEE Trans. on Evolutionary Computation (in press).
Kohavi, R. and G. John (1995), Wrappers for Feature Subset Selection," Technical Report, Computer
Science Department, Stanford University.
Krogh, A., and Vedelsby, J. (1995), Neural Network Ensembles, Cross Validation and Active Learning,
in Advances in Neural Information Processing Systems 2, ed. D. S. Touretzky, 231-238, Morgan
Kaufmann.
Linsker, R. (1988) Self-Organization in a Perceptual Neural Network, Computer, 21, 105-117.
Lippmann, R. P. and Ng, K. (1991), A Comparative Study of the Practical Characteristic of Neural
Networks and Pattern Classifiers, Technical Report 894, Lincoln Labs., MIT.
Liu, C. and H. Wechsler (1998), Face Recognition Using Evolutionary Pursuit, 5th European Conf. on
Computer Vision (ECCV), Freiburg, Germany.
Malinowski, A. and J. M. Zurada (1994), Minimal training set size estimation for sampled-data function
encoding, World Congress on Neural Networks (WCNN), Vol. II, 366 - 371, San Diego, CA.
- 18 -
Mallat, S. (1989) A Theory for Multiresolution Signal Decomposition: the Wavelet Representation. IEEE
Trans. on Pattern Analysis and Machine Intelligence, 11(7), 674-693.
Moghaddam, B. and A. Pentland (1997), Probabilistic Visual Learning for Object Representation, IEEE
Trans. on Pattern Analysis and Machine Intelligence, 19(7), 696-710.
Phillips, P. J., H. Wechsler, J. Huang and P. Rauss (1998), The FERET Database and Evaluation
Procedure for Face Recognition, Image and Visual Computing, 16(5), 295-306.
Samal A. and P. Iyengar (1992), Automatic Recognition and Analysis of Human Faces and Facial
Expressions: A Survey, Pattern Recognition 25, 65-77.
Vapnik, V. (1995), Statistical Learning Theory, Springer Verlag.
Wilson, R. (1995), Wavelets: Why so Many Varieties?, UK Symposium on Applications of Time-
Frequency and Time-Scale Methods, University of Warwick, Coventry, UK.
Zhang, Q. and A. Benveniste (1992), Wavelet Networks, IEEE Trans. on Neural Networks , 3(6), 889-
898.
JEFFREY R. J. HUANG received the Ph.D. in Information Technology at George Mason
University, Fairfax, Virginia, and he is now a researcher with the Biometrics and Computer Vision
Lab in the Department of Computer Science at George Mason University. Dr. Huangs research
interests include image processing, pattern recognition, computer vision, and evolutionary
computation and genetic algorithms. The range of applications covers biometrics and face
recognition, human-computer interaction, video processing and multimedia, and automated
target/object detection and recognition. Since 1995, he has published more than 25 scientific
papers including 2 book chapters. Jeffrey Huang is a member of the IEEE Computer Society,
Engineering in Medicine and Biology Society and the member of AAAI, ISAB, and SPIE.
HARRY WECHSLER received the Ph.D. in Computer Science from the University of California, Irvine, in
1975, and he is presently Professor of Computer Science at George Mason University. His research, in the
field of intelligent systems, has been in the areas of PERCEPTION: Computer Vision (CV), Automatic
Target Recognition (ATR), Signal and Image Processing (SIP), MACHINE INTELLIGENCE and
LEARNING: Pattern Recognition (PR), Neural Networks (NN), Information Retrieval, Data Mining and
Knowledge Discovery, EVOLUTIONARY COMPUTATION: Genetic Algorithms (GAs) and Animats,
MULTIMEDIA and VIDEO PROCESSING: Large Image Data Bases, Document Processing, and HUMAN-
COMPUTER INTELLIGENT INTERACTION (HCII) : Face and Hand Gesture Recognition, Biometrics
and Forensics. He was Director for the NATO Advanced Study Institutes (ASI) on "Active Perception and
Robot Vision" (Maratea, Italy, 1989), "From Statistics to Neural Networks" (Les Arcs, France, 1993) and "Face Recognition:
From Theory to Applications" (Stirling, UK, 1997), and he served as co-Chair for the International Conference on Pattern
Recognition held in Vienna, Austria, in 1996. He authored over 200 scientific papers, his book "Computational Vision" was
published by Academic Press in 1990, and he was the editor for "Neural Networks for Perception" (Vol 1 & 2), published by
Academic Press in 1991. He was elected as an IEEE Fellow in 1992 and as an Int. Association of Pattern Recognition (IAPR)
Fellow in 1998.

You might also like