Professional Documents
Culture Documents
0 MathReview
0 MathReview
Review of Prerequisites in
Math and Statistics
Prepared by
Li Yang
Based on
Appendix chapters of
Pattern Recognition, 4th Ed.
by S. Theodoridis and K. Koutroumbas
and figures from
Wikipedia.org
1
Probability and Statistics
Probability P(A) of an event A : a real number between 0 to 1.
Joint probability P(A ∩ B) : probability that both A and B occurs in a
single experiment.
P(A ∩ B) = P(A)P(B) if A and B and independent.
Probability P(A ∪ B) of union of A and B: either A or B occurs in a
single experiment.
P(A ∪ B) = P(A) + P(B) if A and B are mutually exclusive.
Conditional probability:
P( A B)
P( A | B)
P( B)
Therefore, the Bayes rule:
P( B | A) P( A)
P( A | B) P( B) P( B | A) P( A) and P( A | B)
P( B)
m
Total probability: let A1 ,..., Am such that i 1
P( Ai ) 1 then
P ( B) i 1 P( B | Ai ) P ( Ai )
m
2
Probability density function (pdf): p(x) for a continuous random
variable x b
P(a x b) p( x)dx
a
Statistical independence:
p ( x, y ) p x ( x ) p y ( y )
p ' ( x)
L( p(x), p' (x)) p(x) ln dx
p ( x)
3
Characteristic function of a pdf:
(Ω) p (x) exp( jΩT x)dx E[exp( jΩT x)]
( s ) p ( x) exp(sx )dx E[exp(sx )]
n k
P (k ) p (1 p) n k
Poisson distribution k
probability of # of events occurring in a fixed period of time if these events occur with a known
average.
k e
P(k ; )
k!
When n and np remains constant,
B(n, p) Poisson(np)
5
Normal (Gaussian) Distribution
Univariate N(μ, σ2):
1 ( x )2
p( x) exp( )
2 2 2
2
Assume x ~ (k )
Then ( x k ) / 2k ~ N (0,1) as k by central limit theorem.
7
Other Continuous Distributions
t-distribution: estimating mean of a normal distribution when sample
size is small.
A t-distributed variable q x / z / k where x N (0,1) and z 2 (k )
( k 1) / 2
((k 1) / 2) q 2
p(q) 1
k (k / 2) k
8
Linear Algebra
Eigenvalues and eigenvectors:
there exists λ and v such that Av= λv
Real matrix A is called positive semidefinite if xTAx ≥ 0 for every
nonzero vector x;
A is called positive definite if xTAx > 0.
Positive definite matrixes act as positive numbers.
All positive eigenvalues
If A is symmetric, AT=A,
then its eigenvectors are orthogonal, viTvj=0.
Therefore, a symmetric A can be diagonalized as
A T and T A
where [v1 , v2 , vl ] and diag (1 , 2 , l )
9
Correlation Matrix and Inner Product Matrix
Principal component analysis (PCA)
Let x be a random variable in Rl, its correlation matrix Σ=E[xxT] is positive
semidefinite and thus can be diagonalized as
T
Assign , then
T T T
x '
Further assign x
,1then ' E ( x ' x ' )
/2 T T
x ' ' x
Classical multidimensional scaling (classical MDS) ' ' E ( x ' ' x ' ' )I
Given a distance matrix D={dij}, the inner product matrix G={xiTxj} can be
computed by a bidirectional centering process
1 1 T 1 T
G ( I ee ) D( I ee ) where e [1,1,...,1]T
2
G can be diagnolizednas n
Actually, nΛ and Λ’ share the same set of eigenvalues,
G ' T and
X T where X [ x1 ,..., xn ]T
Because G XX T , X can then be receoveredas X '1/ 2 10
Cost Function Optimization
Find θ so that a differentiable function J(θ) is minimized.
Gradient descent method
Starts with an initial estimate θ(0)
Adjust θ iteratively by
new old
J ( )
| old , where 0
Taylor expansion of J(θ) at a stationary point θ0
1
J ( ) J ( 0 ) ( 0 )T g ( 0 )T H ( 0 ) O(( 0 ) 3 )
2
J ( ) 2 J ( )
where g | 0 and H (i, j ) | 0
i j
Ignore higher order terms within a neighborhood of θ0
new 0 ( I H )( old 0 )
H is positive semidefinite, then H T , we get
T ( new 0 ) ( I ) T ( old 0 )
which will converge if every | 1 i | 1, i.e., 2 / max .
11
Therefore, the convergence speed is decided by min / max .
Newton’s method
Adjust θ iteratively by
1 J ( )
H old | old
Converges much faster that gradient descent.
In fact, from the Taylor expansion, we have
J ( )
H ( 0 )
new old H 1 (H ( old 0 )) 0
The minimum is found in one iteration.
t g t t t 1
J ( )
where g t | t
g tT g t g tT ( g t g t 1 )
and t T or t
g t 1 g t 1 g tT1 g t 1
12
Constrained Optimization with
Equality Constraints
Minimize J(θ)
subject to fi(θ)=0 for i=1, 2, …, m
Minimization happens at
J ( ) f i ( )
Lagrange multipliers: construct
m
L( , ) J ( ) i f i ( )
i 1
L( , ) L( , )
and solve 0
13
Constrained Optimization with
Inequality Constraints
15
Min-Max duality
16
Lagrange duality
Recall the optimization problem:
Minimize J ( ) s.t. f i ( ) 0 for i 1,2,..., m
m
Lagrange function : L( , ) J ( ) i f i ( )
i 1
Convex Programming
For a large class of applications, J ( ) is convex, f i ( )' s are concave
then, the minimization solution (* , * ) is a saddle point of L( , )
L(* , ) L(* , * ) L( , * )
L(* , * ) min max L( , ) max min L( , )
0 0
max L( , ) subject to L( , ) 0
0
MUCH SIMPLER! 17
Mercer’s Theorem and the Kernel Method
Mercer’s theorem:
Let x l and given a mapping ( x) H ,
( H denotes Hilbert space, i.e. finite or infinite Euclidean space)
the inner product ( x), ( y ) can be expressed as a kernel function
( x), ( y ) K ( x, y )
where K ( x, y ) is symmetric,continuous, and positive semi - definite.
The opposite is also true.
The kernel method can transform any algorithm that solely depends on
the dot product between two vectors to a kernelized vesion, by
replacing dot product with the kernel function. The kernelized version
is equivalent to the algorithm operating in the range space of φ.
Because kernels are used, however, φ is never explicitly computed.
18