0 MathReview

CS6720 --- Pattern Recognition
Review of Prerequisites in
Math and Statistics
Prepared by
Li Yang
Based on
Appendix chapters of
Pattern Recognition, 4th Ed.
by S. Theodoridis and K. Koutroumbas
and figures from
Wikipedia.org
1
Probability and Statistics
 Probability P(A) of an event A : a real number between 0 to 1.
 Joint probability P(A ∩ B) : probability that both A and B occurs in a
single experiment.
P(A ∩ B) = P(A)P(B) if A and B and independent.
 Probability P(A ∪ B) of union of A and B: either A or B occurs in a
single experiment.
P(A ∪ B) = P(A) + P(B) if A and B are mutually exclusive.
 Conditional probability:
P( A  B)
P( A | B) 
P( B)
 Therefore, the Bayes rule:
P( B | A) P( A)
P( A | B) P( B)  P( B | A) P( A) and P( A | B) 
P( B)

m
 Total probability: let A1 ,..., Am such that i 1
P( Ai )  1 then
P ( B)  i 1 P( B | Ai ) P ( Ai )
m
2
 Probability density function (pdf): p(x) for a continuous random
variable x b
P(a  x  b)   p( x)dx
a
Total and conditional probabilities can also be extended to pdf’s.
 Mean and Variance: let p(x) be the pdf of a random variable x

 
E[ x]   xp( x)dx, and    ( x  E[ x]) 2 p( x)dx
2
 
 Statistical independence:
p ( x, y )  p x ( x ) p y ( y )
 Kullback-Leibler divergence (Distance?) of pdf’s
p ' ( x)
L( p(x), p' (x))    p(x) ln dx
p ( x)
Pay attention that L( p (x), p ' (x))  L( p ' (x), p(x))
3
 Characteristic function of a pdf:

 (Ω)   p (x) exp( jΩT x)dx  E[exp( jΩT x)]


 ( s )   p ( x) exp(sx )dx  E[exp(sx )]

 2nd Characteristic function:  ( s )  ln  ( s )

d n  ( 0) n
 n-th order moment: n
 E [ x ]
ds
n
 Cumulants:   d  ( 0)
n
ds n
When E[ x]  0, then
 0  0, 1  E[ x]  0,
 2  E[ x 2 ]   2 ,  3  E[ x 3 ] (Skewness)
 4  E[ x 4 ]  3 4 (Kurtosis)
4
Discrete Distributions
 Binomial distribution B(n,p):
Repeatedly grab n balls, each with a probability p of getting a black ball. The probability of
getting k black balls:
 n k
P (k )    p (1  p) n  k
 Poisson distribution k 
probability of # of events occurring in a fixed period of time if these events occur with a known
average.
k e  
P(k ;  ) 
k!
When n   and np remains constant,
B(n, p)  Poisson(np)
5
Normal (Gaussian) Distribution
 Univariate N(μ, σ2):
1 ( x   )2
p( x)  exp( )
2  2 2
 Multivariate N(μ, Σ):

1  1 
p ( x)  exp  ( x   )T  1 ( x   ) 
2 |  |  2 
with the mean  and the covariance matrix
 12  12   1l 
 2 
   
 21 2 2 l 
    
 2
 l1  l 2   l 
where  i2  E[( xi  i ) 2 ] and
 ij   ji  E[( xi  i )( x j   j )]
 Central limit theorem:

z
Let z  i 1 xi , then
n
~ N (0,1) when n  
 6
irrespective of the pdf' s of xi ' s.
Other Continuous Distributions
 Chi-square (Χ2)distribution of k degrees of freedom:
distribution of a sum of squares of k independent standard normal
random variables, that is,  2  x12  x22    xk2 where xi  N (0,1)
1 k / 2 1  y / 2
p( y )  k /2
y e step( y ),
2 (k / 2)

where ( z )   t z 1e t dt
0
 Mean: k, Variance: 2k
2
 Assume x ~  (k )
 Then ( x  k ) / 2k ~ N (0,1) as k   by central limit theorem.
 Also 2 xis approximately normally distributed with mean 2k  1 and

unit variance.
7
Other Continuous Distributions
 t-distribution: estimating mean of a normal distribution when sample
size is small.
A t-distributed variable q  x / z / k where x  N (0,1) and z   2 (k )
 ( k 1) / 2
((k  1) / 2)  q 2 
p(q)  1  
k (k / 2)  k 
Mean: 0 for k > 1,

variance: k/(k-2) for k > 2
 β-distribution: Beta(α,β): the posterior distribution of p of a binomial

distribution after α−1 events with p and β − 1 with 1 − p.
(   )  1
p ( x)  x (1  x)  1
( )(  )
1
 x 1 (1  x)  1
( ,  )
8
Linear Algebra
 Eigenvalues and eigenvectors:
there exists λ and v such that Av= λv
 Real matrix A is called positive semidefinite if xTAx ≥ 0 for every
nonzero vector x;
A is called positive definite if xTAx > 0.
 Positive definite matrixes act as positive numbers.
All positive eigenvalues
 If A is symmetric, AT=A,
then its eigenvectors are orthogonal, viTvj=0.
 Therefore, a symmetric A can be diagonalized as
A   T and  T A  
where   [v1 , v2 , vl ] and   diag (1 , 2 ,  l )
9
Correlation Matrix and Inner Product Matrix
Principal component analysis (PCA)
 Let x be a random variable in Rl, its correlation matrix Σ=E[xxT] is positive
semidefinite and thus can be diagonalized as
   T
 Assign , then
T T T
x ' 
 Further assign  x 
,1then '  E ( x ' x ' )     
/2 T T
x ' '    x
Classical multidimensional scaling (classical MDS)  ' '  E ( x ' ' x ' ' )I
 Given a distance matrix D={dij}, the inner product matrix G={xiTxj} can be
computed by a bidirectional centering process
1 1 T 1 T
G   ( I  ee ) D( I  ee ) where e  [1,1,...,1]T
 2
G can be diagnolizednas n
 Actually, nΛ and Λ’ share the same set of eigenvalues,
G   '  T and
  X T  where X  [ x1 ,..., xn ]T
Because G  XX T , X can then be receoveredas X   '1/ 2 10
Cost Function Optimization
 Find θ so that a differentiable function J(θ) is minimized.
 Gradient descent method
 Starts with an initial estimate θ(0)
 Adjust θ iteratively by
 new   old  
J ( )
    |  old , where   0

 Taylor expansion of J(θ) at a stationary point θ0
1
J ( )  J ( 0 )  (   0 )T g  (   0 )T H (   0 )  O((   0 ) 3 )
2
J ( )  2 J ( )
where g  | 0 and H (i, j )  | 0
    i  j  
Ignore higher order terms within a neighborhood of θ0
 new   0  ( I  H )( old   0 )
H is positive semidefinite, then H   T , we get
 T ( new   0 )  ( I   ) T ( old   0 )
which will converge if every | 1  i | 1, i.e.,   2 / max .
11
Therefore, the convergence speed is decided by min / max .
 Newton’s method
 Adjust θ iteratively by
1 J ( )
  H old |  old

 Converges much faster that gradient descent.
In fact, from the Taylor expansion, we have
J ( )
 H (   0 )

 new   old  H 1 (H ( old   0 ))   0
 The minimum is found in one iteration.
 Conjugate gradient method
 t  g t   t  t 1
J ( )
where g t  | t

g tT g t g tT ( g t  g t 1 )
and  t  T or  t 
g t 1 g t 1 g tT1 g t 1
12
Constrained Optimization with
Equality Constraints
Minimize J(θ)
subject to fi(θ)=0 for i=1, 2, …, m
 Minimization happens at
J ( ) f i ( )

 
 Lagrange multipliers: construct
m
L( ,  )  J ( )   i f i ( )
i 1
L( ,  ) L( ,  )
and solve  0
 
13
Constrained Optimization with
Inequality Constraints
Minimize J(θ) subject to fi(θ)≥0 for i=1, 2, …, m
 fi(θ)≥0 i=1, 2, …, m defines a feasible region in which the answer lies.

 Karush–Kuhn–Tucker (KKT) conditions:
A set of necessary conditions, which a local optimizer θ* has to satisfy.
There exists a vector λ of Lagrange multipliers such that
(1) L(θ* , λ )  0
θ
(2) i  0 for i  1,2,..., m
(3) i f i (θ* )  0 for i  1,2,..., m
(1) Most natural condition.
(2) fi(θ*) is inactive if λi =0.
(3) λi ≥0 if the minimum is on fi(θ*).
(4) The (unconstrained) minimum in the interior region if all λi =0.
(5) For convex J(θ) and the region, local minimum is global minimum.
(6) Still difficult to compute. Assume some fi(θ*)’s active, check λi ≥0. 14
 Convex function:
f ( ) :S  l   is convax if θ,θ'  S ,   [0,1]

f (  (1   ) ' )  f ( )  (1   ) f ( ' )
 Concave function:
f (  (1   ) ' )  f ( )  (1   ) f ( ' )
 Convex set:
S  l is a convax set if θ,θ'  S ,   [0,1]
  (1   ) ' )  S
Local minimum of a convex function is also global minimum.

If f ( ) is concave, then X  { | f ( )  b} is a convex set.
15
 Min-Max duality
Game : A pays F ( x, y ) $ to B while A chooses x and B chooses y

A' s goal : min max F ( x, y ), B' s goal : max min F ( x, y )
x y y x
The two problems are dual to each other.

In general : min F ( x, y )  F ( x, y )  max F ( x, y )
x y
Therefore, max min F ( x, y )  min max F ( x, y )

y x x y
Saddle point condition :

If there exists ( x* ,y* ) such that
F ( x* , y )  F ( x* , y* )  F ( x, y* )
or equivalently :
F ( x* , y* )  max min F ( x, y )  min max F ( x, y )
y x x y
16
 Lagrange duality
 Recall the optimization problem:
Minimize J ( ) s.t. f i ( )  0 for i  1,2,..., m
m
Lagrange function : L( ,  )  J ( )   i f i ( )
i 1
Because max L( ,  )  J ( ), we have

 0
min J ( )  min max L( ,  )

   0
 Convex Programming
For a large class of applications, J ( ) is convex, f i ( )' s are concave
then, the minimization solution (* , * ) is a saddle point of L( ,  )
L(* ,  )  L(* , * )  L( , * )
L(* , * )  min max L( ,  )  max min L( ,  )
  0  0 
Therefore, the optimization problem becomes max min L( ,  ), or

 0 

max L( ,  ) subject to L( ,  )  0
 0 
MUCH SIMPLER! 17
Mercer’s Theorem and the Kernel Method
 Mercer’s theorem:
Let x  l and given a mapping  ( x)  H ,
( H denotes Hilbert space, i.e. finite or infinite Euclidean space)
the inner product  ( x),  ( y ) can be expressed as a kernel function
 ( x), ( y )  K ( x, y )
where K ( x, y ) is symmetric,continuous, and positive semi - definite.
The opposite is also true.
The kernel method can transform any algorithm that solely depends on
the dot product between two vectors to a kernelized vesion, by
replacing dot product with the kernel function. The kernelized version
is equivalent to the algorithm operating in the range space of φ.
Because kernels are used, however, φ is never explicitly computed.
18

0 MathReview

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

0 MathReview

Uploaded by

Copyright:

Available Formats

CS6720 --- Pattern Recognition

Total and conditional probabilities can also be extended to pdf’s.

 Mean and Variance: let p(x) be the pdf of a random variable x

 Kullback-Leibler divergence (Distance?) of pdf’s

Pay attention that L( p (x), p ' (x))  L( p ' (x), p(x))

 2nd Characteristic function:  ( s )  ln  ( s )

 Multivariate N(μ, Σ):

 Central limit theorem:

 Also 2 xis approximately normally distributed with mean 2k  1 and

Mean: 0 for k > 1,

 β-distribution: Beta(α,β): the posterior distribution of p of a binomial

 Conjugate gradient method

Minimize J(θ) subject to fi(θ)≥0 for i=1, 2, …, m

 fi(θ)≥0 i=1, 2, …, m defines a feasible region in which the answer lies.

f ( ) :S  l   is convax if θ,θ'  S ,   [0,1]

Local minimum of a convex function is also global minimum.

Game : A pays F ( x, y ) $ to B while A chooses x and B chooses y

The two problems are dual to each other.

Therefore, max min F ( x, y )  min max F ( x, y )

Saddle point condition :

Because max L( ,  )  J ( ), we have

min J ( )  min max L( ,  )

Therefore, the optimization problem becomes max min L( ,  ), or

You might also like