An Introduction To Independent Component Analysis (ICA)

An Introduction to Independent Component Analysis
(ICA)
吳育德
陽明大學放射醫學科學研究所
台北榮總整合性腦功能實驗室
The Principle of ICA: a cocktail-party problem
x1(t)=a11 s1(t) +a12 s2(t) +a13 s3(t)
x2(t)=a21 s1(t) +a22 s2(t) +a12 s3(t)

Independent Component Analysis
Reference : A. Hyvärinen, J. Karhunen, E. Oja (2001)
John Wiley & Sons. Independent Component Analysis
Central limit theorem
• The distribution of a sum of independent
random variables tends toward a Gaussian
distribution
Observed signal = m1 IC1 + m2 IC2 ….+ mn ICn
toward Gaussian Non-Gaussian Non-Gaussian Non-Gaussian

Central Limit Theorem
Partial sum of a sequence {zi} of independent and identically distributed r

andom variables zi k
xi   zi
i 1
Since mean and variance of xk can grow without bound as k, consider
instead of xk the standardized variables
xk  m x k
yk 
 xk
The distribution of yk  a Gaussian distribution with zero mean and unit varianc
e when k.
How to estimate ICA model
• Principle for estimating the model of ICA
Maximization of NonGaussianity
Measures for NonGaussianity
• Kurtosis
Kurtosis : E{(x- )4}-3*[E{(x-)2}] 2
Super-Gaussian kurtosis > 0

Gaussian kurtosis = 0
Sub-Gaussian kurtosis < 0
kurt(x1+x2)= kurt(x1) + kurt(x2)

kurt(x1) =4kurt(x1)
Whitening process
• Assume measurement x  As is zero mean and E{ssT }  I
• Let D and E be the eigenvalues and eigenvector matrix of

covariance matrix of x, i.e. E{xxT }  EDET
1

• Then V  D ET is a whitening matrix
2
1

  T
z Vx D E x 2

T T
E{zz } VE1{xx }V T
1
 
 D 2 ET EDET ED 2
I
Importance of whitening
For the whitened data z, find a vector w such that the linear c
ombination y=wTz has maximum nongaussianity under the con
strain E{ y 2 }  1
Then
1  E{ y 2 }
 E{w T z  z T w}
 w E{z  z }w
T T
w w T
Maximize | kurt(wTz)| under the simpler constraint that ||w||=1

Constrained Optimization
max F(w), ||w||2=1
2 2
L(w,  )  F (w )   ( w  1), w  wT w  1
L(w ,  )
0
w
F (w )
  [ 2w ]  0
w
F (w )
  2  w
w
At the stable point, the gradient of F(w) must point in the direct
ion of w, i.e. equal to w multiplied by a scalar.
Gradient of kurtosis
F (w )  kurt (w z )  E{(w z ) }  3[ E{(w z ) ] }

T T 4 T 2 2
F (w )  E{(w z ) }  3E{(w z ) }}
T 4 T 2

w w
 1 t 1 (w T z (t )) 4  3(w T w ) 2
T
1 T

T E{y}   y(t )
T t 1
w
4 T

T
t 1
z (t )[ w T
z (t )]3
 3 * 2( w T
w)(w  w )
2
 4sign(kurt(w z))[ E{z[w z] }  3w w ]
T T 3
Fixed-point algorithm using kurtosis
F ( w )
wk+1 = wk +  k
w
k
-1 F ( w )
=( + )  k
2 w k
2
Therefore, w  E{z[w T z ] }  3w w
3
w w/ w
Note that adding the gradient to wk does not change its direction, since
wk+1 = wk - ( 2  wk )
= (1- 2  ) wk
Convergence : |<wk+1 , wk>|=1 since wk and wk+1 are unit vectors

Fixed-point algorithm using kurtosis
1. Centering x~
x  m ~x
2. Whitening z  Vx, E{zz T }  I

3. Choose m, No. of ICs to estimate. Set counter p  1
One-by-one Estimation
4. Choose an initial guess of unit norm for wp, eg. randomly.

Fixed-point iteration
T 2
5. Let w p  E{z[ w p z ] }  3w p w p
3
6. Do deflation decorrelation p 1
w p  w p   (w Tp w j )w j
j 1
7. Let wp  wp/||wp||
8. If wp has not converged (|<wpk+1 , wpk>| 1), go to step 5.

9. Set p  p+1. If p  m, go back to step 4.
Fixed-point algorithm using negentropy
The kurtosis is very sensitive to outliers, which may be erroneous or
irrelevant observations
ex. r.v. with sample size=1000, mean=0, variance=1, contains one value = 10
 kurtosis at least equal to 104/1000-3=7
kurtosis : E{x4}-3 Need to find a more robust

measure for nongaussianity
Approximation of negentropy
Entropy
H (y )    p y ( ) log p y ( )d 0
Negentropy
J (y )  H (y gauss )  H (y ) 0
J ( y )  [ E{G ( y )}  E{G ( )}]2
1 g1 ( y )  tanh a1 y
G1 ( y )  log cosh a1 y, 1  a1  2
a1
G2 ( y )   exp( y 2 / 2) g 2 ( y )  y exp( y 2 / 2)
y4
G3 ( y )  g3 ( y)  y 3
4
Max J(y)
w  E{zg(wTz)}  E{g '(wTz)} w
w  w/||w||
Convergence : |<wk+1 , wk>|=1
1
G1 ( y )  log cosh a1 y, 1  a1  2 g1 ( y )  tanh a1 y g1 ( y )  a1[1  tanh 2 (a1 y )]
a1
G2 ( y )   exp( y 2 / 2) g 2 ( y )  y exp( y 2 / 2) g 2 ( y )  (1  y 2 ) exp( y 2 / 2)
y4
G3 ( y )  g3 ( y)  y 3 g 3 ( y )  3 y 2
4
1. Centering x~
x  m ~x
2. Whitening z  Vx, E{zz T }  I

3. Choose m, No. of ICs to estimate. Set counter p  1
One-by-one Estimation
4. Choose an initial guess of unit norm for wp, eg. randomly.

Fixed-point iteration
5. Let w p  E{z g ( w T
p z )}  E{ g ( w T
p z )}w p
6. Do deflation decorrelation p 1
w p  w p   (w Tp w j )w j
j 1
7. Let wp  wp/||wp||
8. If wp has not converged, go back to step 5.
9. Set p  p+1. If p  m, go back to step 4.
Implantations
• Create two uniform sources
Implantations
• Create two uniform sources
Implantations
• Two mixed observed signals
Implantations
• Two mixed observed signals
Implantations
• Centering
Implantations
• Centering
Implantations
• Whitening
Implantations
• Whitening
Implantations
• Fixed-point iteration using kurtosis
Implantations
Implantations
Implantations
• Fixed-point iteration using negentropy
Implantations
Implantations
Implantations
Implantations
Entropy
H (y )    p y ( ) log p y ( )d
A Gaussian variable has the largest entropy among all random

variables of equal variance
Gaussian  0
nonGaussian  >0
Negentropy
J (y )  H (y gauss )  H (y )
It’s the optimal estimator but computationally difficult since it r

equires an estimate of the pdf
High-order cumulant approximation
1 1
J ( y)  E{ y 3 }  kurt ( y ) 2
12 48
Replace the polynomial function by any n
onpolynomial fun Gi
ex.G1 is odd and G2 is even
J ( y )  k1 ( E{G1 ( y )}) 2  k 2 ( E{G2 ( y )}  E{G2 ( )}) 2
It’s quite common that most r.v. have approximately symmetric dist.  E{G1 ( y )}  0
J ( y )  [ E{G ( y )}  E{G ( )}]2

1 g1 ( y )  tanh a1 y
G1 ( y )  log cosh a1 y, 1  a1  2
a1
G2 ( y )   exp( y 2 / 2) g 2 ( y )  y exp( y 2 / 2)
y4
G3 ( y )  g3 ( y)  y 3
4
According to Lagrange multiplier  the gradient must point in the direction of w
J (w ) [ E{G (w T z )  E{G ( )}]2
w 
w w
 2 *[ E{G (w T z )}  E{G ( )}][E{zg (w T z )}]
ν is standardized Gaussian r.v.
 is constant
w   [ E{zg (w T z )}]
w  [ E{zg (w T z )}]
w w/ w
w  [ E{zg (w T z )}]
w w/ w
The iteration doesn't have the good convergence properties because the
nonpolynomial moments don't have the same nice algebraic properties.
(1   )w  [ E{zg (w T z )}]  w
w w/ w
Finding  by approximative Newton method
Real Newton method is fast.  small steps for convergence

It requires a matrix inversion at every step. large computational load
Special properties of the ICA problem  approximative Newton method

No need a matrix inversion but converge roughly with the same steps as real Newton method
According to Lagrange multiplier  the gradient must point in the direction of w
J (w )
 w  0
w
F (w )  E{zg (w T z )}  w  0
Solve this equation by Newton method
JF(w)*w = -F(w)  w = [JF(w)]-1[-F(w)]
F ( w )
JF (w )   E{zz T g (w T z )}  I
w
E{zzT}E{g' (wTz)} = E{g' (wTz)} I

www  F[E{zg(w E{g(ww]
(w ) *[1 /(z)} T /[E{g'(w z)} ]
T T
z )}   )]I
 [ E{zg (Tw T z )}Multiply {g(-E{g
 w ] /[ Eby '(wTz)}
w T z )} ]
diagonalize JF (w )  [ E{g (w z )}   ]I [ JF (w )]1  [1 /( E{g (w T z )}   )]I
w  w/||w||
w  w/||w||
Convergence : |<wk+1 , wk>|=1

Because ICs can be defined only up to a multiplication sign
1
G1 ( y )  log cosh a1 y, 1  a1  2 g1 ( y )  tanh a1 y g1 ( y )  a1[1  tanh 2 (a1 y )]
a1
G2 ( y )   exp( y 2 / 2) g 2 ( y )  y exp( y 2 / 2) g 2 ( y )  (1  y 2 ) exp( y 2 / 2)
y4
G3 ( y )  g3 ( y)  y 3 g 3 ( y )  3 y 2
4

An Introduction To Independent Component Analysis (ICA)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Introduction To Independent Component Analysis (ICA)

Uploaded by

Copyright:

Available Formats

An Introduction to Independent Component Analysis

x1(t)=a11 s1(t) +a12 s2(t) +a13 s3(t)

x2(t)=a21 s1(t) +a22 s2(t) +a12 s3(t)

Observed signal = m1 IC1 + m2 IC2 ….+ mn ICn

toward Gaussian Non-Gaussian Non-Gaussian Non-Gaussian

Partial sum of a sequence {zi} of independent and identically distributed r

Super-Gaussian kurtosis > 0

kurt(x1+x2)= kurt(x1) + kurt(x2)

• Assume measurement x  As is zero mean and E{ssT }  I

• Let D and E be the eigenvalues and eigenvector matrix of

Maximize | kurt(wTz)| under the simpler constraint that ||w||=1

F (w )  kurt (w z )  E{(w z ) }  3[ E{(w z ) ] }

Convergence : |<wk+1 , wk>|=1 since wk and wk+1 are unit vectors

2. Whitening z  Vx, E{zz T }  I

4. Choose an initial guess of unit norm for wp, eg. randomly.

8. If wp has not converged (|<wpk+1 , wpk>| 1), go to step 5.

kurtosis : E{x4}-3 Need to find a more robust

2. Whitening z  Vx, E{zz T }  I

4. Choose an initial guess of unit norm for wp, eg. randomly.

A Gaussian variable has the largest entropy among all random

It’s the optimal estimator but computationally difficult since it r

J ( y )  [ E{G ( y )}  E{G ( )}]2

Real Newton method is fast.  small steps for convergence

Special properties of the ICA problem  approximative Newton method

E{zzT}E{g' (wTz)} = E{g' (wTz)} I

Convergence : |<wk+1 , wk>|=1

You might also like