Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

ARTICLE IN PRESS

Pattern Recognition ( ) –
www.elsevier.com/locate/patcog

Algorithms and networks for accelerated convergence of adaptive


LDA
H. Abrishami Moghaddama,∗ , M. Matinfara , S.M. Sajad Sadougha , Kh. Amiri Zadehb
a K.N. Toosi University of Technology, Seyed Khandan, P.O. Box 16315-1355, Tehran, Iran
b AZAD University, Science & Research Unit, P.O. Box 19395-3755, Tehran, Iran

Received 25 December 2003; accepted 23 July 2004

Abstract
We introduce and discuss new accelerated algorithms for linear discriminant analysis (LDA) in unimodal multiclass Gaussian
data. These algorithms use a variable step size, optimally computed in each iteration using (i) the steepest descent, (ii) conjugate
direction, and (iii) Newton–Raphson methods in order to accelerate the convergence of the algorithm. Current adaptive methods
based on the gradient descent optimization technique use a fixed or a monotonically decreasing step size in each iteration,
which results in a slow convergence rate. Furthermore, the convergence of these algorithms depends on appropriate choices
of the step sizes. The new algorithms have the advantage of automatic optimal selection of the step size using the current
data samples. Based on the new adaptive algorithms, we present self-organizing neural networks for adaptive computation of
−1/2 and use them in cascaded form with a PCA network for LDA. Experimental results demonstrate fast convergence and
robustness of the new algorithms and justify their advantages for on-line pattern recognition applications with stationary and
non-stationary multidimensional input data.
䉷 2004 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.
Keywords: Adaptive linear discriminant analysis; Adaptive principal component analysis; Gradient descent optimization; Steepest descent
optimization; Conjugate direction optimization; Newton–Raphson optimization; Self-organizing neural network; Convergence analysis

1. Introduction applications articularly for feature space dimensionality re-


duction [5]. Unlike the PCA, which encodes the information
PCA and LDA have been widely used in signal and image in an orthogonal linear space, the LDA encodes the discrim-
processing especially in pattern recognition applications, inatory information in a (unnecessarily orthogonal) linear
such as feature extraction, face and gesture recogni- separable space. For example, the LDA has been widely used
tion, and hyperspectral image analysis [1–4]. Adaptive for dimensionality reduction in speech recognition [6]. Miao
PCA and LDA algorithms have been used in on-line and Hua [7] used an objective function and presented gra-
dient descent and recursive least squares (RLS) algorithms
[8] for adaptive principal subspace analysis (PSA). Xu [9]
∗ Corresponding author. Tel.: +98 21 8461025; fax: +98 21 846 used a different objective function to derive an algorithm for
2066. adaptive PSA by applying the gradient descent optimization
E-mail addresses: moghaddam@saba.kntu.ac.ir method. Mao and Jain [10] proposed a two-layer network
(H.A. Moghaddam), matinfar3@yahoo.com (M. Matinfar), for LDA, each of which was a PCA network. Chatterjee
seyedos@yahoo.fr (S.M. Sajad Sadough), kh_amiri@yahoo.com and Roychowdhury [2] presented an adaptive algorithm
(Kh. Amiri Zadeh). and a self-organizing LDA network for feature extraction

0031-3203/$30.00 䉷 2004 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2004.07.003
ARTICLE IN PRESS
2 H.A. Moghaddam et al. / Pattern Recognition ( ) –

from Gaussian data using the gradient descent optimization matrices (in upper case letters). We also have the following
technique. They described algorithms and networks for (i) notations:
feature extraction from unimodal and multicluster Gaussian (·)T transpose;
data in the multiclass case and (ii) multivariate linear dis- E(·) expectation;
criminant analysis in the multiclass case. In a later work, they tr(·) trace;
also developed accelerated algorithms for PCA using the |·| determinant;
steepest descent, conjugate direction and Newton–Raphson A vectorized form of the matrix A;
methods [11]. Recently, Abrishami Moghaddam and Amiri A·B inner product of matrices A and B in vector form;
Zadeh [12] derived an accelerated convergence adaptive al- A⊗B Kronecker product of matrices A and B.
gorithm for LDA, based on the steepest descent optimization
method.
Most adaptive PCA and LDA implementations are based 2. Linear discriminant analysis
on minimizing an objective function using the gradient de-
scent technique. It is well known [11,13–15] that such a tech- Different objective functions have been used as LDA cri-
nique has a slow convergence rate. Furthermore, both ana- teria mainly based on a family of functions of scatter ma-
lytical and experimental studies show that the convergence trices. For example, the maximization of the following ob-
of these algorithms depends on an appropriate selection of jective functions have been previously proposed [18]:
the gain sequence. For on-line applications, determining an
appropriate gain value is a difficult task. Hence, for wider J1 = tr(−1
w b ), (1)
applicability of these algorithms, it is important to speed up J2 = ln |−1
w b | = ln |b | − ln |w |, (2)
the convergence rate and automatically determine the gain
sequence based on current data samples. In this article, we J3 = tr(b ) − [tr(w ) − c], (3)
present new accelerated adaptive algorithms for the compu-
tr(b )
tation of the square root of the inverse covariance matrix J4 = , (4)
tr(w )
(−1/2 ). For this purpose, we use: (i) the steepest descent,
(ii) conjugate direction, and (iii) Newton–Raphson methods where w , b are within-class and between-class scatter
to determine the step size of the adaptive algorithm in each matrices, respectively. The following remarks pertain to
iteration. Adaptive computation of the step size has two ma- these criteria:
jor advantages. Firstly, it accelerates considerably the con-
vergence of −1/2 computation. Secondly, dynamic values 1. The optimization of J1 is equivalent to the optimization
for the step size can overcome the instability problems en- of tr(AT b A) with respect to A under the constraint
countered when using the fixed step size with non-stationary AT w A = I, where A is an n × m transformation matrix.
input data. Based on the new adaptive algorithm, we present The same is true for J2 .
a modified version of a self-organizing neural network for 2. The optimization of J1 and J2 results in the same linear
−1/2 computation [2]. For LDA we use the new −1/2 features.
computation network and cascade it with a PCA network 3. Many references use |−1 w b | instead of the logarithm
[13] trained by Sanger’s algorithm [16]. Experimental re- of the determinant for J2 . By using the logarithm, J2
sults with Gaussian data demonstrate high performance of in an n dimensional space can be computed by adding
the presented algorithm particularly for on-line applications the J2 values of individual features, if the features are
with non-stationary processes. independent. This property is called the additive property
The next section describes the fundamentals of LDA. In of independent features.
Section 3, the adaptive computation of the square root of 4. When J3 is used, tr(b ) is optimized, subject to the
the inverse covariance matrix −1/2 based on the gradient constraint tr(w ) = c. That is,  is a Lagrange multiplier
descent method is presented and its convergence is proved and c is a constant.
using the stochastic approximation theory [17]. Section 4 is 5. J1 and J2 are invariant under any nonsingular linear
devoted to our new accelerated adaptive −1/2 algorithms transformation, while J3 and J4 are dependent on the
based on (i) the steepest descent, (ii) conjugate direction, coordinate system.
and (iii) Newton–Raphson methods. A new recursive equa-
tion for on-line estimation of the covariance matrix is also In LDA, the optimum linear transform is composed of
presented in this section. Self-organizing networks for LDA p(  n) eigenvectors of −1 w b corresponding to its p
using our accelerated −1/2 algorithms are introduced in largest eigenvalues. Alternatively, −1
w m can be used
Section 5. Finally, simulation and experimental results in for LDA, where m represents the mixture scatter matrix
the last section demonstrate the superior performance of the (m = b + w ). A simple analysis shows that both −1
w b
proposed methods. and −1
w  m have the same eigenvector matrix  [18]. In
Notations used in this paper are fairly standard. Boldface general, b is not full rank and therefore not a covariance
symbols are used for vectors (in lower case letters) and matrix, hence we shall use m in place of b .
ARTICLE IN PRESS
H.A. Moghaddam et al. / Pattern Recognition ( ) – 3

The computation of the eigenvector matrix  from where W0 ∈ Rn×n is symmetric and non-negative definite,
−1
w m is equivalent to the solution of the generalized and {k } is a scalar gain sequence. According to the general
eigenvalue problem m  = w , where  is the general- form of adaptive algorithms [17] we have:
ized eigenvalue matrix. Under the assumption of a positive
−1/2
definite matrix w , there is a symmetric matrix w such Wk+1 = Wk + k G(Wk , xk ), (8)
that the problem can be reduced to a symmetric eigenvalue
problem: where the update function G(Wk , xk ) is the gradient of an
objective function J (Wk ). Using the stochastic approxi-
−1/2 −1/2
w m w  = , (5) mation theory and convergence analysis by Ljung [20], we
may write:
1/2 −1/2 −1/2
where  = w  is real and w m w is symmetric
and real. If  is orthonormal, then T  = T w  = I, jJ (W)
G(W) = = E[G(Wk , xk )] = I − WW. (9)
therefore  is real and orthonormal with respect to w . Fur- jW
thermore, T m  =  which is diagonal, real and positive
definite [18]. The gain sequence {k } has an important role in
There are two major training algorithms used in the the convergence of the algorithm and can be a con-
class-separability feature extraction networks. There are: stant or a decreasing sequence, satisfying the following
(i) the algorithms for the computation of −1/2 , where conditions:
 is the positive definite covariance matrix of a random
multidimensional sequence {xk ∈ Rn } and (ii) an algo- 
(a) ∞ k=0 k = ∞,
rithm for the computation of the eigenvectors of . Note ∞ r
(b) k=0 k < ∞ (r > 1),
that there is no unique solution for −1/2 . Let  and (c) limk→∞ k → 0.
 = diag(1 , . . . , n ) be the eigenvector and eigenvalue
matrices, respectively, of . Then a solution for −1/2 is
−1/2 −1/2 For example, we can generate k as follows [11]:
D, where D = diag(±1 , . . . , ±n ). However, in
k = /k 
general this is not a symmetric solution. It can be shown
 > 0, 1/2 <   1, (10)
that −1/2 is symmetric if and only if it is of the form
DT , and there are 2n symmetric solutions for −1/2 .
where ,  are selected, such that k satisfies the above
When D is positive definite, we obtain the unique sym-
stated conditions. The convergence of the algorithm has been
metric positive definite solution for −1/2 as −1/2 T ,
−1/2 −1/2 proved [2] using the stochastic approximation theory [20].
where −1/2 = diag(1 , . . . , n ). That means:
Chatterjee and Roychowdhury [2] introduced two differ-
ent networks to solve the generalized symmetric eigenvalue
lim Wk = −1/2 with probability one, (11)
problem (Eq. (5)). In the first network, they developed an k→∞
adaptive algorithm for the computation of −1/2 matrix and
−1/2 where  is the correlation or covariance matrix of the ran-
used it to produce w . In the second network, they used
a PCA network and a learning algorithm such as Sanger’s dom sequence {xk }.
1/2
rule [16] to generate  = w . When two networks op-
erate simultaneously, the eigenvector matrix  is extracted
−1/2 1/2
by multiplying two weight matrices as w w  = . 4. New adaptive −1/2 algorithms
Abrishami Moghaddam and Amiri Zadeh presented an ac-
celerated adaptive algorithm for −1/2 computation based The adaptive computation of −1/2 using Eq. (6), suf-
on the steepest descent method [19]. They also introduced fers from a very slow convergence rate. Increasing k can
modified networks for fast LDA and class separability fea- accelerate the convergence of the algorithm, but large gain
ture extraction [12]. sequences may cause it to diverge or converge to a false
solution. Choosing k as a monotonically decreasing func-
tion of the iteration number k may improve the conver-
3. Adaptive −1/2 computation algorithm gence rate. However, this cannot be considered as an opti-
mal solution to the convergence problem. Noting that Eq.
The following algorithm has been proposed for the adap- (6) is based on the gradient descent method, we devel-
tive computation of −1/2 [2]: oped three new algorithms based on different optimiza-
tion techniques including (i) the steepest descent [12] (ii)
Wk+1 = Wk + k Gk , (6) conjugate direction and (iii) Newton–Raphson methods, in
order to optimally determine the gain sequence in each
Gk = I − Wk xk xkT Wk , (7) iteration.
ARTICLE IN PRESS
4 H.A. Moghaddam et al. / Pattern Recognition ( ) –

4.1. Steepest descent −1/2 algorithm on-line estimation of the covariance matrix k is obtained
using the following recursive equation [12]:
The steepest descent method uses the following updating
T + (k/k + 1) ,
k+1 = (1 − k/k + 1)xk+1 xk+1 (21)
equations: k

Wk+1 = Wk + k Gk , (12) where  ∈]0, 1] is a forgetting scalar factor. If {xk } comes


from a stationary process,  = 1 is used. On the other hand,
G k = I − W k k W k , (13) if {xk } comes from a non-stationary process, 0 <  < 1 is
selected. This equation is applied to obtain an effective win-
where xk xkT in Eq. (7) has been replaced by k which will
dow of size 1/(1 − ). This effective window ensures that
be introduced later in this section. In the steepest descent
the past data samples are down-weighted with an exponen-
method, instead of using a fixed gain sequence, k is calcu-
tially fading window. The exact value of  depends on the
lated by the derivative of the cost function J with respect
specific application. In general for slow time varying {xk },
to k [12]:
 is chosen close to one to obtain a large effective window,
jJ (Wk+1 ) whereas for fast time varying {xk },  is chosen near zero for
= 0, (14)
jk small effective window [17].

using the chain rule:


4.2. Conjugate direction −1/2 algorithm
jJ (Wk+1 ) jJ (Wk+1 ) jWk+1
= · , (15) The adaptive conjugate direction algorithm for LDA can
jk jWk+1 jk
be obtained as follows:
jJ (Wk+1 )
= I − Wk+1 k Wk+1 , (16)
jWk+1 Wk+1 = Wk + k Dk , (22)

jWk+1 Dk+1 = Gk + k Dk , (23)


= I − W k k W k , (17)
jk
where Gk is obtained using Eq. (13). The step size k is
therefore, chosen as  that minimizes J (Wk+1 + Dk+1 ). Similar
to the steepest descent case, we obtain the same quadratic
(I − Wk+1 k Wk+1 ) · (I − Wk k Wk ) = 0. (18)
equation as in Eq. (19) where:
Replacing Wk+1 with Eq. (12) and doing some mathemat-
ical operations we obtain the following quadratic equation ak = (Dk k Dk ) · Dk ,
(Appendix A):
bk = (Dk k Wk + Wk k Dk ) · Dk ,
ak 2k + bk k + ck = 0, (19)
ck = −Gk · Dk
where
and k is obtained using the same equation as Eq. (20). For
ak = (Gk k Gk ) · Gk , the choice of k , we can use a number of methods [21] as
described below:
bk = (Gk k Wk + Wk k Gk ) · Gk , Hestenes–Stiefel:
ck = −Gk · Gk
k = Gk+1 · (Gk+1 − Gk )/Dk · (Gk+1 − Gk ). (24)
and k is obtained as:
Polak–Ribiere:

−bk + bk2 − 4ak ck
k = . (20) k = Gk+1 · (Gk+1 − Gk )/Gk · Gk . (25)
2ak
Fletcher–Reeves:
In Eq. (20), we use the positive sign in order to minimize
the objective function J (Wk+1 ) (see Appendix A). As will k = Gk+1 · Gk+1 /Gk · Gk . (26)
be shown in experimental results, the computation of k
according to Eq. (20) accelerates the convergence of the Powell:
adaptive algorithm. Furthermore, in the adaptive computa-
tion of −1/2 a fixed gain sequence may cause divergence k = max[0, Gk+1 · (Gk+1 − Gk )/Gk · Gk ]. (27)
problems in the case of non-stationary input data. Dynamic
determination of k can overcome the problem while the For the simulation results presented in this paper, we used
convergence is guaranteed under different conditions. The the Polak–Ribiere method.
ARTICLE IN PRESS
H.A. Moghaddam et al. / Pattern Recognition ( ) – 5

4.3. Newton–Raphson −1/2 algorithm xk − mi Σ −w1 / 2


Random
vector
The adaptive NR algorithm for LDA is xk − m Σ −w1 / 2 PCA network

Wk+1 = Wk + k Dk , (28)
Fig. 2. Network for the fast LDA.
Dk = (Hk )−1 Gk , (29)

where Hk is the online Hessian matrix defined as (Appendix class mean mi for the class i is obtained by the following
B) adaptive equation:
jGk
Hk = = −2(I ⊗ k Wk ). (30) gk+1 = gk + k (xk+1 − gk ), (31)
jWk
where gk converges to mi . In the above equation k is se-
The step size parameter k can be selected by minimizing
lected as a constant. The same equation may be used for
J (Wk + k Dk ) with respect to k . This will result in the
on-line estimation of the mixture mean m.
same quadratic equation as Eq. (19), where:

ak = (Dk k Dk ) · Dk ,
6. Results and discussion
bk = (Dk k Wk + Wk k Dk ) · Dk ,
In this section, we use the networks and the learning rules
ck = −Gk · Dk . presented in the last two sections for linear discriminant
analysis in pattern recognition applications, and compare the
In the above relations, Gk and Dk are calculated using Eqs. results with ones obtained by the gradient descent method. In
(13) and (29), respectively. the following experiments, we use samples generated from
unimodal Gaussian distributions, since they are common in
many pattern recognition problems [18].
5. Neural network implementation
6.1. −1/2 algorithms
Based on the k computation algorithm, we introduce a
self-organizing neural network for the fast adaptive compu-
Two sets of experiments were carried out to test the per-
tation of −1/2 as depicted in Fig. 1. The major difference
formance of the new adaptive −1/2 algorithms. The first
between this new scheme and the network presented by [2]
set of experiments was made on stationary Gaussian data
is in the k computation block. Here, instead of using a
whose main purpose was to demonstrate the convergence
fixed step size, we use Eq. (19) in order to optimally deter-
speed of the new algorithm. The second experiment was
mine the step size according to one of the steepest descent,
made on non-stationary Gaussian data in order to show the
conjugate direction and Newton–Raphson methods.
tracking ability of the new adaptive algorithms to follow
For LDA, we need to produce the eigenvector matrix
changes in statistical characteristics of the input data.
 of −1w b . Here, two networks for the computation of
the eigenvector matrix  using our new −1/2 algorithm
are presented. The first network is trained by samples with 6.1.1. Experiment with stationary data
−1/2 Choosing the incoming sequence in R10 and the covari-
known classes and computes w . The second network
ance matrix for the training network shown in Table 1, we
uses a PCA algorithm to generate the eigenvector matrix
−1/2 −1/2 generated 500 samples of a zero-mean Gaussian data and
of w m w [7]. It is trained by samples irrespective calculated −1/2 matrix. The actual value of −1/2 was ob-
of class assignments. When the two trained networks work tained from the sample correlation matrix using a standard
together in a cascade form, the eigenvector matrix  is eigenvector computation method. The L2 -norm of the error
obtained (Fig. 2). In LDA network, on-line estimation of the ek between the estimated and the actual −1/2 matrices was
computed by

Z −1 −1/2
ek = Wk − actual 2 . (32)
Wk
The convergence of the new −1/2 algorithms is illus-
G(xk ,Wk ) ηk Wk+1
Random trated in Fig. 3. For the gradient descent algorithm, we used
vector k = 1/(400 + k) which demonstrated a better convergence
speed compared to a fixed k . We also used  = 1.0 in
Fig. 1. −1/2 computation algorithm. Eq. (21), because of the stationarity of the input data. A
ARTICLE IN PRESS
6 H.A. Moghaddam et al. / Pattern Recognition ( ) –

Table 1
The covariance matrix of the training input data

0.182 0.076 −0.106 −0.010 0.020 −0.272 0.310 0.060 0.004 0.064
0.076 0.746 0.036 −0.056 −0.022 −0.734 0.308 −0.114 −0.062 −0.130
−0.106 0.036 2.860 0.034 0.110 −0.900 −0.076 −0.596 −0.082 −0.060
−0.010 −0.056 0.034 0.168 −0.010 0.032 0.084 −0.044 0.002 0.010
0.020 −0.022 0.110 −0.010 0.142 0.176 0.116 −0.138 −0.016 0.006
−0.272 −0.734 −0.900 0.032 0.176 11.44 −1.088 −0.496 0.010 0.190
0.310 0.308 −0.076 0.084 0.116 −1.088 5.500 −0.686 −0.022 −0.240
0.060 −0.114 −0.596 −0.044 −0.138 −0.496 −0.686 2.900 0.156 0.056
0.004 −0.062 −0.082 0.002 −0.016 0.010 −0.022 0.156 0.134 0.030
0.064 −0.130 −0.060 0.010 0.006 0.190 −0.240 0.056 0.030 0.682

Fig. 3. Convergence behavior of the new −1/2 algorithms compared to the gradient descent method using stationary input data, (a) steepest
descent, (b) conjugate direction, (c) Newton–Raphson, (d) superposition of four curves.

comparison of the gradient descent −1/2 computation dia- rate. Moreover, the new algorithm does not require k to
gram with the new adaptive steepest descent (Fig. 3a), con- be specified explicitly as was made for the gradient descent
jugate direction (Fig. 3b) and Newton–Raphson (Fig. 3c) method. Instead, the gain sequence is automatically com-
methods shows a significant increase in the convergence puted from the input data sequence.
ARTICLE IN PRESS
H.A. Moghaddam et al. / Pattern Recognition ( ) – 7

6.1.2. Experiment with non-stationary data 2 classes were:


In order to demonstrate the tracking ability of the    
−2 32
algorithms with non-stationary data, we generated 250 sam- m1 = , 1 = ,
2 23
ples of zero-mean Gaussian data in R10 with the covariance    
2 3 2
matrix stated as before. We then drastically changed the m2 = , 2 = .
data sequence by generating 250 samples of zero-mean −2 2 3
ten-dimensional Gaussian data with the covariance matrix    
4 −4
Here, w = 23 2
3 , b = −4 4 , and the eigenvector
shown in Table 2 [22].  
We generated {k } from {xk } by using Eq. (17) with  = −1 0.7071 0.3162
matrix  of w b is −0.7071 0.3162 , corresponding to
0.99. The tracking ability of the new algorithms compared eigenvalues 8 and 0. Note that the above results are the exact
to the gradient descent method is illustrated in Fig. 4(a–c). values. After training the LDA network, the first eigenvec-
We initiated all algorithms with W0 = 0.1 ∗ ONE, where tor of −1
w b was estimated as ˆ 1 = [0.7018 − 0.6936]T .
ONE is a n × n matrix whose all elements are one. Once Defining the normalized error E as E =  − / , ˆ
again, it is clear from Fig. 4 that the steepest descent (Fig. where is computed from the sample scatter matrices and
4a), conjugate direction (Fig. 4b) and Newton–Raphson ˆ is estimated from the LDA network, the final result is
(Fig. 4c) algorithms, converge faster and tracks the shown in Fig. 5. As can be seen, the estimation error van-
changes in data much better than the gradient descent ishes much more rapidly using the steepest descent (Fig. 5a),
algorithm. conjugatedirection (Fig. 5b) and Newton–Raphson (Fig. 5c)
Further consideration should be given to the compu- LDA algorithms, compared to the gradient descent method.
tational complexity of the algorithms. The gradient de-
scent method is of order n2 complexity (per sample),
the steepest descent and conjugate gradient methods are Summary
of order n3 complexity and finally the Newton–Raphson
method has computational complexity of O(n4 ). If we New adaptive algorithms and self-organizing neural net-
use an effective window of size p(  n) for the computa- works for linear discriminant analysis have been presented.
tion of k , the computational complexity of the steepest These algorithms use (i) the steepest descent, (ii) conjugate
descent and conjugate gradient methods will be reduced direction and (iii) Newton–Raphson adaptive techniques for
to O(pn2 ). However, it should be noted that the conver- optimal computation of the step size in each iteration. Cur-
gence of the gradient descent is slower than the steepest rent adaptive methods based on the gradient descent opti-
descent, conjugate gradient and Newton–Raphson meth- mization use a fixed or a monotonically decreasing step size.
ods as shown in Figs. 3 and 4. Comparison between the The main advantage of the new algorithms is their fast con-
three new algorithms shows small differences between vergence rate, which distinguishes them from the existing
them in convergence speed. Among the three faster con- on-line methods. It is experimentally shown that an optimal
verging algorithms, since the steepest descent algorithm variable step size significantly improves the convergence
(Eqs. (12) and (13)) requires the smallest amount compu- rate of the algorithm. Furthermore, if the fixed step size ex-
tation per iteration, it is most suitable for optimum speed ceeds an upper bound, the adaptive algorithms may diverge
and computation. Furthermore, the Hessian matrix inver- or converge to a false solution. Dynamic determination of
sion in the Newton–Raphson method may cause some the step size based on the current data samples can prevent
instability problems. A solution to this problem may be the divergence and improve the robustness of the adaptive
obtained using an algorithm for adaptive estimation of algorithm against hazardous inputs. The convergence speed
the approximate inverse of the Hessian matrix [11]. The and robustness of the new adaptive algorithms, make them
adaptive estimation of the inverse of the Hessian ma- appropriate for on-line applications where we deal with non-
trix is also computationally more efficient than the direct stationary input data.
method. It has been shown that the first step for adaptive linear
discriminant analysis is the computation of the square root
of the inverse covariance matrix −1/2 . Three new algo-
6.2. LDA algorithm rithms have been introduced for fast adaptive computation
of −1/2 . This matrix is then used for linear discriminant
Finally, we tested the LDA network. For this purpose, we analysis of multiclass multidimensional Gaussian input data.
(i) generated 1000 samples of 2-D Gaussian data, each from We also introduce modified self-organizing neural networks,
two classes with the different mean vectors and the same in order to efficiently implement the developed algorithms
covariance matrices; (ii) used the LDA network to extract and accelerate their convergence. A new recursive method
the relevant features for classification, and show the classi- for on-line estimation of  has also been used in these net-
fication results for two algorithms; (iii) compared these fea- works. The new recursive estimation algorithm has shown
tures with their actual values computed from sample scatter a good performance in both stationary and non-stationary
matrices. The covariance matrices and mean vectors for the input sequences.
ARTICLE IN PRESS
8 H.A. Moghaddam et al. / Pattern Recognition ( ) –

Table 2
The second covariance matrix for non-stationary experimental input data

0.360 0.004 −0.032 −0.764 −0.028 0.164 −0.120 −0.232 0.088 0.128
0.004 0.368 −0.044 0.032 −0.056 −0.008 0.048 −0.040 −0.084 −0.008
−0.032 −0.044 0.328 0.328 0.056 −0.080 −0.232 0.420 0.016 0.092
−0.764 0.032 0.328 22.72 −0.384 −0.060 2.584 0.876 −0.952 0.872
−0.028 −0.056 0.056 −0.384 0.304 −0.140 −0.160 −0.092 0.108 −0.056
0.164 −0.008 −0.080 −0.060 −0.140 1.832 0.552 −1.004 0.0480 0.156
−0.120 0.048 −0.232 2.584 −0.160 0.552 7.280 −0.732 −0.008 0.468
−0.230 −0.040 0.420 0.876 −0.092 −1.004 −0.732 16.28 −1.856 0.588
0.088 −0.084 0.016 −0.952 0.108 0.048 −0.008 −1.856 1.052 0.216
0.128 −0.008 0.092 0.872 −0.056 0.156 0.468 0.588 0.216 1.548

Fig. 4. Tracking ability of the new −1/2 algorithms compared to the gradient descent method using non-stationary input data, (a) steepest
descent, (b) conjugate direction, (c) Newton–Raphson, (d) superposition of four curves.

The developed networks have been tested under differ- sults show significant improvement in convergence speed
ent conditions using multidimensional input data. First, the and tracking ability of the new (i) steepest descent, (ii) con-
new −1/2 computation algorithms have been tested with jugate direction and (iii) Newton–Raphson adaptive meth-
stationary and non-stationary processes. Experimental re- ods. Finally, for linear discriminant analysis, we combined
ARTICLE IN PRESS
H.A. Moghaddam et al. / Pattern Recognition ( ) – 9

Fig. 5. Convergence of the new LDA networks compared to the gradient descent based LDA with stationary input data, (a) steepest descent,
(b) conjugate direction, (c) Newton–Raphson, (d) superposition of four curves.

the adaptive −1/2 computation algorithm with a PCA net- dJ (Wk+1 )


= I − Wk+1 k Wk+1 , (A.4)
work. Experimental results with multidimensional input data dWk+1
demonstrate better performance of the new algorithms for
dWk+1
on-line linear discriminant analysis. = I − Wk k Wk . (A.5)
dk

Therefore
Appendix A. Computation of k for steepest descent
(I − Wk+1 k Wk+1 ) · (I − Wk k Wk ) = 0. (A.6)
We compute k that minimizes J (Wk+1 ), where
Replacing Wk+1 with Wk + k Gk , we obtain
Wk+1 = Wk + k Gk , (A.1)
[I − (Wk + k Gk )k (Wk + k Gk )]
G k = I − W k k W k . (A.2)
· (I − Wk k Wk ) = 0. (A.7)
By differentiating J (Wk+1 ) with respect to k and using
the chain rule, we will have: Simplifying Eq. (A.7), we obtain the following quadratic
equation:
dJ (Wk+1 ) dJ (Wk+1 ) dWk+1
= · , (A.3) ak 2k + bk k + ck = 0, (A.8)
dk dWk+1 dk
ARTICLE IN PRESS
10 H.A. Moghaddam et al. / Pattern Recognition ( ) –

where References

ak = (Gk k Gk ) · Gk , [1] C. Chang, H. Ren, An experiment-based quantitative


and comparative analysis of target detection and
bk = (Gk k Wk + Wk
k Gk ) · Gk , image classification algorithms for hyperspectral imagery,
IEEE Trans. Geosci. Remote Sensing 38 (2) (2000)
ck = −Gk · Gk . 1044–1063.
[2] C. Chatterjee, V.P. Roychowdhury, On self-organizing
algorithm and networks for class-separability features, IEEE
The roots of the above equation can be computed as
Trans. Neural Networks 8 (3) (1997) 663–678.
 [3] L. Chen, H. Mark Liao, J. Lin, M. Ko, G. Yu, A new LDA-
−bk ± bk2 − 4ak ck based face recognition system which can solve the small
k = . (A.9) sample size problem, Pattern Recog. 33 (10) (2000) 1713–
2ak
1726.
To select k , we note that k should be selected such that [4] W. Zhao, A. Krishnaswamy, R. Chellappa, D.L. Swets, J.
Weng, Discriminant analysis of principal components for
the second derivative of the objective function be positive.
face recognition, in: H. Wechsler, P.J. Phillips, V. Bruce,
That means
F. Fogelman Soulie, T.S. Huang (Eds.), Face Recognition:
From Theory to Applications, Springer, Berlin, 1998,
d2 J (Wk+1 )
= 2ak k + bk  0. (A.10) pp. 73–85.
d2k [5] C. Lee, J. Hong, Optimizing feature extraction for multiclass
cases, in: IEEE International Conference on Computational
Hence we choose k as Cybernetics and Simulations, 1997, pp. 2545–2548.
 [6] X.D. Huang, Y. Akiri, M.A. Jack, Hidden Markov Models
for Speech Recognition, Edinburgh University Press, UK,
−bk + bk2 − 4ak ck
k = . (A.11) 1990.
2ak [7] Y. Maio, Y. Hua, Fast subspace tracking and neural-network
learning by a novel information criterion, IEEE Trans. Signal
Process. 46 (1998) 1964–1979.
Appendix B. Computation of the Hessian matrix [8] S. Bannour, M.R. Azimi-Sadjadi, Principal component
extraction using recursive least squares learning, IEEE Trans.
The Hessian matrix is the second derivative of the ob- Neural Networks 6 (1995) 457–469.
[9] L. Xu, Least mean square error reconstruction principle for
jective function with respect to Wk . It can be computed as
self-organizing neural-net, Neural Networks 6 (1993) 627–
the first derivative of Gk as follows
648.
[10] J. Mao, A.K. Jain, Discriminant analysis neural networks,
dGk
Hk = . (B.1) in: IEEE International Conference on Neural Networks, San
dWk Francisco, CA, 1993, pp. 300–305.
[11] C. Chatterjee, Z. Kang, V.P. Roychowdhury, Algorithms for
Since accelerated convergence of adaptive PCA, IEEE Trans. Neural
Networks 11 (2) (2000) 338–355.
G k = I − W k k W k , (B.2) [12] H. Abrishami Moghaddam, Kh. Amiri-Zadeh, Fast adaptive
algorithms and networks for class-separability features,
we may write [23]: Pattern Recog. 36 (2003) 1695–1702.
[13] S. Haykin, Neural Networks: A Comprehensive Foundation,
d(I − Wk k Wk ) Macmillan, 1994.
Hk =
dWk [14] T.K. Sarkar, X. Yang, Application of conjugate gradient and
d(Wk k Wk ) steepest descent for computing the eigenvalues of an operator,
= − Signal Processing 17 (1989) 31–37.
dWk
[15] B. Widrow, S. Stearns, Adaptive Signal Processing, Prentice-
dWk k Wk + k Wk dWk
= − Hall, Englewood Cliffs, NJ, 1985.
dWk [16] T.D. Sanger, Optimal unsupervised learning in a single-layer
[(k Wk )T ⊗ I]dWk + [IT ⊗ (k Wk )]dWk linear feed-forward neural network, Neural Networks 2 (1998)
= − . 459–473.
dWk
[17] A. Benveniste, M. Metivier, P. Priouret, Adaptive Algorithms
(B.3) and Stochastic Approximations, Springer, Berlin, 1990.
[18] K. Fukunaga, Introduction to Statistical Pattern Recognition,
Therefore, the Hessian matrix may be computed using the 2nd Edition, Academic Press, New York, 1990.
following equation: [19] H. Abrishami Moghaddam, Kh. Amiri-Zadeh, Fast linear
discriminant analysis for on-line pattern recognition
Hk = −2[I ⊗ (k Wk )]. (B.4) applications, in: Proceedings of the 16th International
ARTICLE IN PRESS
H.A. Moghaddam et al. / Pattern Recognition ( ) – 11

Conference on Pattern Recognition (ICPR2002), Quebec, [22] T. Okada, S. Tomita, An optimal orthonormal system
Canada, 2002. for discriminant analysis, Pattern Recog. 18 (2) (1985)
[20] L. Ljung, Analysis of recursive stochastic algorithms, IEEE 139–144.
Trans. Automat. Cont. 22 (4) (1997) 551–575. [23] J.R. Magnus, H. Neudecker, Matrix Differential Calculus,
[21] D. Luenberger, Linear and Nonlinear Programming, Addison- Wiley, New York, 1999.
Wesley, Reading, MA, 1984.

About the Author—HAMID ABRISHAMI MOGHADDAM was born in Iran in 1964. He received his B.S. degree in electrical engineering
from Amirkabir (1988) and his M.S. degree in biomedical engineering from Sharif (1991) Universities of Technology, Tehran, Iran. He
also received his Ph.D. degree in biomedical engineering from Université de Technologie de Compiègne, Compiègne, France (1998). He is
currently working as assistant professor of biomedical engineering at K.N. Toosi University of Technology, Tehran, Iran. His major fields
of interest are image processing, pattern recognition and machine vision.

About the Author—MEHDI MATINFAR was born in Iran in 1980. He received his B.S. degree in electrical engineering from Sharif
University of Technology, Tehran, Iran (2002). Currently, he is a graduate student in biomedical engineering at K.N. Toosi University of
Technology, Tehran, Iran. His fields of interest are pattern recognition, image and biomedical signal processing.

About the Author—SEYED MOHAMMAD SAJAD SADOUGH was born in France in 1979. He received his B.S. degree in electrical
engineering from Shahid Beheshti University, Tehran, Iran in 2002 and his M.S. degree in Signal Processing from Université Paris Sud,
Orsay, France in 2004. His fields of interest are semi-blind channel estimation and ultra wide band telecommunication.

About the Author—KHOSROW AMIRI ZADEH was born in Iran in 1966. He received his B.S. degree in computer engineering from
Shiraz University in 1989 and his M.S. degree in Machine Intelligence from Azad University in 1998, Tehran, Iran. Currently, he is working
at Azad University as lecturer in computer science. His fields of interest are pattern recognition and applications.

You might also like