Download as pdf or txt
Download as pdf or txt
You are on page 1of 124

Orthogonality

Orthogonality 1 / 116
Goals

1 Orthogonal and independence


2 Orthogonalize a basis of a vector (sub)space with Gram-Schmidt
orthogonalization algorithm
3 Orthogonal diagonalization and applications

Orthogonality 2 / 116
Orthogonality

Table of contents

1 Orthogonality
Orthogonality
Orthogonalization
Orthogonal Complement
Projection

2 Orthogonal Diagonalization

3 Singular Value Decomposition (SVD)

4 Positive Definite Matrices

5 An Application to Quadratic Forms

Orthogonality 3 / 116
Orthogonality Orthogonality

Dot product, length


   
x1 y1
Two vectors x = . . . and y = . . . in Rn
   
xn yn

Dot product
Their dot product is

x · y = x 1 y1 + · · · + x n yn

which is matrix product xT y.

Length of a vector
Length of x is q
∥x∥ = x21 + · · · + x2n

Orthogonality 4 / 116
Orthogonality Orthogonality

Properties of dot product and length

Let x, y and z be vectors in Rn . Then


1 x · y = y · x

2 x · (y + z) = x · y + x · z

3 (ax) · y = a(x · y) = x · (ay) for all scalar a ∈ R

4 ∥x∥ = x · x

5 ∥x∥ ≥ 0 and ∥x∥ = 0 if and only if x = 0

6 ∥ax∥ = |a|∥x∥ for all scalar a ∈ R

Orthogonality 5 / 116
Orthogonality Orthogonality

Distance
Definition (Euclidean distance)

Distance between two vector x and y in Rn is

d(x, y) = ∥x − y∥

Properties
Let x, y, z be three vectors in Rn then
1 d(x, y) ≥ 0
2 d(x, y) = 0 if and only if x = y
3 d(x, y) = d(y, x)
4 d(x, z) ≤ d(x, y) + d(y, z)

Orthogonality 6 / 116
Orthogonality Orthogonality

Orthogonal and Orthogonal Sets

Orthogonal
Two vectors x and y in Rn are orthogonal if x · y = 0

Orthogonal sets
A set of vectors x1 , x2 , · · · , xk in Rn is called orthogonal set if

xi · xj = 0 ∀i ̸= j and xi ̸= 0 ∀i

A set of vectors x1 , x2 , · · · , xk in Rn is called orthonormal is if is


orthogonal and each xi is a unite vector, that is

∥xi ∥ = 1 ∀i

Orthogonality 7 / 116
Orthogonality Orthogonality

Example
The standard basis {e1 , . . . , en } is an orthonormal set in Rn

Example
If {x1 , x2 , · · · , xk } is an orthogonal set then so is {a1 x1 , a2 x2 , · · · , ak xk }
for all nonzero scarlars ai

Orthogonality 8 / 116
Orthogonality Orthogonality

Example
The standard basis {e1 , . . . , en } is an orthonormal set in Rn

Example
If {x1 , x2 , · · · , xk } is an orthogonal set then so is {a1 x1 , a2 x2 , · · · , ak xk }
for all nonzero scarlars ai

Normalizing an Orthogonal Set


n o
1 1
If {x1 , x2 , · · · , xk } is an orthogonal set then ∥x1 ∥ x1 , ∥x2 ∥ x2 , · · · , ∥x1k ∥ xk
is an orthogonal set is an orthonormal set.

Orthogonality 8 / 116
Orthogonality Orthogonality

Example

     
1 1 −1
 1  0  0 
If f1 =  , f2 =   and f3 =   then {f1 , f2 , f3 } is an orthogonal
     
 1  1  1 
−1 2 0
set in R4 . After normalizing, the orthonomal set is
1 1 1
 
f1 , √ f2 , √ f3
2 6 2

Orthogonality 9 / 116
Orthogonality Orthogonality

Orthogonality implies Linear Independence


Suppose that {x1 , x2 , · · · , xk } is an orthogonal set. Consider

t1 x1 + t2 x2 + · · · + tk xk = 0

Multiplying both sides by x1 , we have

t1 ∥x1 ∥2 + t2 x2 x1 + · · · + tk xk x1 = 0

It is clear that x2 x1 = · · · = xk x1 = 0. Hence

t1 ∥x1 ∥2 = 0 ⇒ t1 = 0
| {z }
̸=0

Similar t2 = · · · = tk = 0. Hence {x1 , x2 , · · · , xk } is linear independence

Orthogonality 10 / 116
Orthogonality Orthogonality

Orthogonality implies Linear Independence


Suppose that {x1 , x2 , · · · , xk } is an orthogonal set. Consider

t1 x1 + t2 x2 + · · · + tk xk = 0

Multiplying both sides by x1 , we have

t1 ∥x1 ∥2 + t2 x2 x1 + · · · + tk xk x1 = 0

It is clear that x2 x1 = · · · = xk x1 = 0. Hence

t1 ∥x1 ∥2 = 0 ⇒ t1 = 0
| {z }
̸=0

Similar t2 = · · · = tk = 0. Hence {x1 , x2 , · · · , xk } is linear independence

Theorem
Every orthogonal set in Rn is linear independent

Orthogonality 10 / 116
Orthogonality Orthogonality

Theorem (Expansion Theorem)


If f1 , f2 , . . . , fm is an orthogonal basis of a subspace U of Rn then for any
vector x ∈ U , we have
x · f1 x · f2 x · fm
     
x= f1 + f2 + · · · + fm
∥f1 ∥2 ∥f2 ∥2 ∥fm ∥2

The expansion in of x as a linear combination of the orthogonal basis


f1 , f2 , . . . , fm is called the Fourier expansion of x, and the coefficients
ti = ∥fx·fi ∥i2 are called the Fourier coefficients.

Orthogonality 11 / 116
Orthogonality Orthogonality

Theorem (Expansion Theorem)


If f1 , f2 , . . . , fm is an orthogonal basis of a subspace U of Rn then for any
vector x ∈ U , we have
x · f1 x · f2 x · fm
     
x= f1 + f2 + · · · + fm
∥f1 ∥2 ∥f2 ∥2 ∥fm ∥2

The expansion in of x as a linear combination of the orthogonal basis


f1 , f2 , . . . , fm is called the Fourier expansion of x, and the coefficients
ti = ∥fx·fi ∥i2 are called the Fourier coefficients.

Proof.
Suppose that
x = t1 f1 + · · · + tm fm
then
x · f1 = t1 f1 · f1 = t1 ∥f1 ∥2
x·f1 x·fi
So t1 = ∥f1 ∥2
. Similar we have ti = ∥fi ∥2
for i = 2, . . . , m.
Orthogonality 11 / 116
Orthogonality Orthogonality

Example      
1 1 −1
 1  0  0 
Let U = span(f1 , f2 ), f3 where f1 =  , f2 =   and f3 =   . We
     
 1  1  1 
−1 2 0
have {f1 , f2 , f3 } is an orthogonal set and then is a basis of U . For all
vector x = (a, b, c, d) in U , it can be expanded as a linear combination of
{f1 , f2 , f3 } with Fourrier coefficients
x · f1 1
t1 = = (a + b + c − d)
∥f1 ∥2 4
x · f2 1
t2 = 2
= (a + c + 2d)
∥f2 ∥ 6
x · f2 1
t2 = = (−a + c)
∥f1 ∥2 2
That is
x = t1 f1 + t2 f3 + t3 f3 = . . .
Orthogonality 12 / 116
Orthogonality Orthogonalization

If {v1 , ..., v1 m} is linearly independent , and if vm+1 is not in


span{v1 , ..., vm }, then {v1 , ..., vm , vm+1 } is linear independent

Orthogonal Lemma
Let {f1 , . . . , fm } be an orthogonal set in Rn . Given x ∈ Rn , write

x · f1 2 x · fm 2
fm = x − f1 − · · · − fm
∥f1 ∥ ∥fm ∥

then
1 fm · fi = 0 for k = 1, · · · m
2 If x ∈
/ span{f1 , . . . , fm } then fm ̸= 0 and {f1 , . . . , fm , fm+1 } be an
orthogonal set

Orthogonality 13 / 116
Orthogonality Orthogonalization

One of the important consequence of orthogonal lemma is an extension for


orthogonal sets of the fundamental fact that any independent set is part of
a basis
Theorem
Let U be a subspace in Rn
1 Every orthogonal set in U is a subset of an orthogonal basis of U
2 U has an orthogonal basis

Orthogonality 14 / 116
Orthogonality Orthogonalization

The second consequence of the orthogonal lemma is a procedure by which


any basis of a subspace U of Rn can be systematically modified to yield an
orthogonal basis of U
Gram-Schidt Orthogonalization Algorithm
If {x1 , · · · , xm } is any basis of a subspace U of Rn , construct
f1 , f2 , · · · , fm in U as following

f1 = x1
x2 · f1
f2 = x2 −
∥f1 ∥2
..
.
xk · f2 x2 · f1 xk · fk−1
fk = xk − − −
∥f2 ∥2 ∥f1 ∥2 ∥fk−1 ∥2

for k = 2, . . . , m. Then
1 f1 , f2 , · · · , fm is an orthogonal basis of U
2 span(f1 , · · · , fk ) = span(x1 , · · · , xk ) for all k = 1, . . . , m
Orthogonality 15 / 116
Orthogonality Orthogonalization

Orthogonality 16 / 116
Orthogonality Orthogonalization

Example

Find an orthogonal basis of U = span{x1 , x2 , x3 } with


     
1 3 1
 1  2 0
x1 =   , x2 =   , x3 =  
     
−1 0 1
−1 1 0

Orthogonality 17 / 116
Orthogonality Orthogonalization

Solution
Observe that x1 , x2 and x3 are independent. The algorithm gives
f1 = x1
 T    
3 1 2
x2 · f1 
2
 4 1   1 
  
f2 = x2 − =  −  = 
∥f1 ∥2 0 4 −1 −1
1 −1 2
 
4
x3 · f1 x3 · f2 0 3 −3
1  
f3 = x3 − − = x3 − f 1 − f 2 =
∥f1 ∥2 ∥f2 ∥2 10  7 
 
4 10
−6
Hence {f1 , f2 , f3 } is an orthogonal basis.
Remark
The orthogonal property does not change it a vector in basis is multiplied by a
nonzero. It may be convenient to eliminate fractions and use {f1 , f2 , 10f3 } as an
orthogonal basis of U
Orthogonality 18 / 116
Orthogonality Orthogonalization

Remark

In order to prove {x1 , x2 , x3 } is an independent set, we can solve the


system equation
sx1 + tx2 + ux3 = 0
to obtain the unique solution s = t = u = 0
Another possible procedure is as following
1 x ̸= 0 so {x } is an independent set
1 1
2 f ̸= 0 so x ∈
2 2 / span{f1 } = span{x1 }. That is x1 and x2 are
independent
3 f ̸= 0 so f ∈
3 2 / span{f1 , f2 } = span{x1 , x2 }. Hence x1 , x2 , x3 are
independent.
The second approach is a consequence of Gram-Schmidt orthogonal
algorithm. It can be used to find an orthogonal basis of a subspace
spanned by a set of vectors

Orthogonality 19 / 116
Orthogonality Orthogonalization

Example

 
0
Find an orthogonal basis of U = span{x1 , x2 , x3 , x4 } where x1 = 1,
 
0
     
1 1 1
x2 = 0, x3 = 1, x4 = 1
     
1 1 3

Orthogonality 20 / 116
Orthogonality Orthogonalization

Solution
The algorithm gives
f1 = x1
h iT
x2 ·f1
f2 = x2 − ∥f 1∥
2 = x2 − 0f1 = 1, 0, 1 ̸= 0
So f2 ∈
/ span{f1 } = span{x1 }. It implies that f2 and f1 are
independent and span{f1 , f2 } = span{x1 , x2 }
h iT
x3 ·f1 x3 ·f2 1 2
f3 = x3 − ∥f 1∥
2 − ∥f2∥2 f2 = x3 − 1 f1 − 2 f2 0, 0, 0 =0
So x3 ∈ span{f1 , f2 } = span{x1 , x2 }. Hence
span{x1 , x2 , x3 } = span{x1 , x2 } = span{f1 , f2 }
We do not need to put attention on f3
h iT
x4 ·f1 x4 ·f2 1 4
f4 = x4 − ∥f 2 − ∥f2∥2 f2 = x3 − 1 f1 − 2 f2 −1, 0, 1 ̸= 0
1∥
So x4 ∈ span{f1 , f2 } = span{x1 , x2 , x3 }. Hence
span{x1 , x2 , x3 , x4 } = span{x1 , x2 , x4 } = span{f1 , f2 , f4 }
An orthogonal basis of U = span{x1 , x2 , x3 , x4 } is {f1 , f2 , f4 }
Orthogonality 21 / 116
Orthogonality Orthogonal Complement

Problem motivation
Suppose a point x and a plane U through the origin in R3 are given, and
we want to find the point p in the plane that is closest to x. Our
geometric intuition assures us that such a point p exists. In fact, p must
be chosen in such a way that x − p is perpendicular to the plane.

Orthogonality 22 / 116
Orthogonality Orthogonal Complement

Problem motivation
Suppose a point x and a plane U through the origin in R3 are given, and
we want to find the point p in the plane that is closest to x. Our
geometric intuition assures us that such a point p exists. In fact, p must
be chosen in such a way that x − p is perpendicular to the plane.

Orthogonal Complement
If U is a subspace of Rn , define the orthogonal complement U ⊥ of U
(pronounced ”U -perp”) by

U ⊥ = {x ∈ Rn | x · y = 0 ∀y ∈ U }

Orthogonality 22 / 116
Orthogonality Orthogonal Complement

Properties of the orthogonal complement

Lemma
Let U be a subspace of Rn
1 U ⊥ is a subspace of Rn
2 {0⊥ = Rn } and (Rn )⊥ = {0}
3 If U = span{x1 , . . . , xm } then
U ⊥ = {x ∈ Rn | x · xi = 0 for i = 1, . . . , m}

Orthogonality 23 / 116
Orthogonality Orthogonal Complement

Example
h iT h iT 
Find U ⊥ if U = span 1, −2, 2, 0 , 1, 0, −2, 3 in R4 .

Solution
h iT
x = x, y, z, w is in U ⊥ if and only if it is orthogonal to both
h iT h iT
1, −2, 2, 0 and 1, 0, −2, 3 , that is

x − y + 2z = 0 (1)
x − 2z + 3w = 0 (2)
h iT h iT 
Gaussian elimimation gives U⊥ = span 2, 4, 1, 0 , 3, 3, 0, −1

Orthogonality 24 / 116
Orthogonality Orthogonal Complement

Example

Orthogonality 25 / 116
Orthogonality Projection

Projection onto a Suspace of Rn

Let U be a subspace in Rn with orthogonal basis {f1 , . . . , fm }. The vector


x in Rn is defined by
x · f1 x · fm
projU x = 2
f1 + · · · + fm
∥f1 ∥ ∥fm ∥2

is called the orthogonal projection of x on U .

For the zero subspace U = {0}, we define

proj{0} x = 0

Orthogonality 26 / 116
Orthogonality Projection

Projection onto a line in R3

Consider vectors x and d ̸= 0 in R3 . The projection p = projd x is defined


as
x·d
p = projd x = d
∥d∥2
where the dotted line is the error e = x − p which is orthogonal
(perpendicular) to d.

Orthogonality 27 / 116
Orthogonality Projection

Theorem (Projection Theorem)


If U is a subspace in Rn and denote p = projU x then
1 p ∈ U and x − p ∈ U ⊥
2 p is the vector in U closest to x in the sense that

∥x − p∥ < ∥x − y∥ ∀y ∈ U, y ̸= p

Orthogonality 28 / 116
Orthogonality Projection

Example
h iT h iT
Let U = span{x1 , x2 } where x1 = 1, 1, 0, 1 and x2 = 0, 1, 1, 2 . If
h iT
x = 3, −1, 0, 2 . Find the vector in U closest to x and express x as the
sum of a vector in U and a vector orthogonal to U .

Orthogonality 29 / 116
Orthogonality Projection

Solution
{x1 , x2 } are independent but not orthogonal. Gram-Schmidt algorithm
h iT
gives an orthogonal basis {f1 , f2 } of U where f1 = x1 = 1, 1, 0, 1 and
iT
x2 ·f1 2
h
f2 = x2 − = x2 = 33 f1 −1, 0, 1, 1
∥f1 ∥ f1
Compute projection using orthogonal basis (f1 , f2 )

x · f1 2 x · f2 2 4 −1 1 hT i
p = projU x = f1 + f2 = f1 + f2 = 5, 4, −1, 3
∥f1 ∥ ∥f2 ∥ 3 3 3
h iT
Thus p is the vector in U closest to x and x − p = 31 4, −7, 1, 3 is
orthogonal to every vector in U . The decomposition of x is

x = p + (x − p) = . . .

Orthogonality 30 / 116
Orthogonality Projection

Projection Onto a Subspace


Let {a1 , . . . , an } be a basis of a subspace U in Rm

Problem: find the combination p = x̂1 a1 + · · · + x̂n an closest to a


vector b ∈ U .

We need
h to find p i= Ax̂ - the projection in the column space of
A = a1 . . . an that closest to b. The error vector e = b − Ax̂ is
perpenticular to the space

aT (b − Ax̂) = 0
 T
 1

 a1 h
..  .. 
i
. or  .  b − Ax̂ = 0 ⇐⇒ AT (b − Ax̂) = 0

aTn

 T
a2 (b − Ax̂) = 0

Hence
AT Ax̂ = AT b

Orthogonality 31 / 116
Orthogonality Projection

If a’s are linear independent then AT A is symmetric and invertible then

x̂ = (AT A)−1 AT b

So
p = Ax̂ = A(AT A)−1 AT b
The matrix
P = A(AT A)−1 AT
is the projection matrix such that p = Pb

Orthogonality 32 / 116
Orthogonality Projection

Visualization of projection onto a line and onto S =


column space of A

Let θ be the angle between b and the line go through vector b then the
projection p of b on the line has the length
∥aT ∥∥b∥ cos θ
∥b∥ = ∥a∥ = ∥b∥ cos θ
∥a∥2
and the length of error
∥e∥ = ∥b∥ sin θ
Orthogonality 33 / 116
Orthogonality Projection

Projection in R3

p1 is projection of b on the line Oz - axis = span{e3 } and p2 is the


projection of b onto the plane Oxy = span{e1 , e2 }
Orthogonality 34 / 116
Orthogonality Projection

Projection on Oz
Matrix of subspace  
h 0 i
A = e3 = 0
 
1
 
h i 0
So AT A = 0 0 1 0 = 1. Hence projection matrix
 
1
   
0 h i 0 0 0
T −1 T
P1 = A(A A) A = 0 1 0 0 1 = 0 0 0
   
1 0 0 1
      
x 0 0 0 x 0
The projection of b = y  is p1 = P1 b = 0 0 0 y  = 0
      
z 0 0 1 z z
Orthogonality 35 / 116
Orthogonality Projection

Projection onto Oxy


Matrix of subspace  
h i 1 0
A = e1 , e2 = 0 1
 
0 0
 
" # 1 0 " #
T 1 0 0  1 0
So A A = 0 1 = . Hence projection matrix

0 1 0 0 1
0 0
   
1 0 " #−1 " # 1 0 0
T −1 T 1 0 1 0 0
P2 = A(A A) A = 0 1 = 0 1 0
   
0 1 0 1 0
0 0 0 0 0
      
x 1 0 0 x x
The projection of b = y  is p2 = P2 b = 0 1 0 y  = y 
      
z 0 0 0 z 0
Orthogonality 36 / 116
Orthogonality Projection

Exericise

h iT h iT
Let U = span{x1 , x2 } where x1 = 1, 1, 0, 1 and x2 = 0, 1, 1, 2 . Find
h iT
projection of b = 3, −1, 0, 2 on U .

Orthogonality 37 / 116
Orthogonality Projection

Exercise

Let    
1 0 6
A = 1 1 and b = 0 .
   
1 2 0
Find projection of b on column space of A.

Orthogonality 38 / 116
Orthogonality Projection

Application to Least Square Approximation - Linear


Regression
1 Solving AT Ax̂ = AT b gives the projection p = Ax̂ of b onto the
column space of A
2 When Ax = b has no solution then x̂ is the ”least-square solution”:
∥b − Ax̂∥2 = minimum
3 Setting partial deriavatives  of square of length of error
∂E
E = ∥b − Ax̂∥2 to zero ∂x i
= 0 also produces AT Ax̂ = AT b
4 To fit points (t1 , b1 ), . . . , (tm , bm ) by a straight lines y = ax + k,
1 t1
 
k + at1 = b1


need to solve . . . or Ax = b with A =  ... ... 
 

1 tm

k + at
m = bm

Orthogonality 39 / 116
Orthogonality Projection

Example
Find the closest line to the points (0, 6), (1, 0) and (2, 0)

Solution
   
1 0 6
We have A = 1 1 and b = 0 The coefficients a, k of the fitted line
   
1 2 0
are given by
" # " #
k 5
= x̂ = (AT A)−1 AT b =
a 3

So the fitted line is y = 5 + 3x

Orthogonality 40 / 116
Orthogonality Projection

Fitting by a Parabola
Drop a stone from the Leaning Tower of Pisa by Galileo

Fit heights b1 , . . . , bm at times t1 , . . . , tm by a parabola C +Dt+Et2

The exact fit are solution to system



2 1 t1 t21
 
C + Dt1 + Et1 = b1


... ⇔ Ax = b with A =  ... ... .. 

 . 
C + Dt + Et2 = b

1 tm tm 2
m m m

which is generally unsolvable


 
C
Least square The closet parabola C + Dt + Et2 chooses x̂ = D
 
E
satisfy AT Ax̂ = AT b

Orthogonality 41 / 116
Orthogonal Diagonalization

Table of contents

1 Orthogonality
Orthogonality
Orthogonalization
Orthogonal Complement
Projection

2 Orthogonal Diagonalization

3 Singular Value Decomposition (SVD)

4 Positive Definite Matrices

5 An Application to Quadratic Forms

Orthogonality 42 / 116
Orthogonal Diagonalization

Question

An n × n matrix A is diagonalizable if and only if it has n independent


eigenvectors. The matrix P with these eigenvectors as column is a
diagonalization matrix for A, that is

P −1 AP is diagonal

The really nice bases of Rn are the orthogonal ones. So which matrices
have an orthogonal basis of eigenvectors?

Answer this question is a main result in this section: the matrix A is


symmetric

Orthogonality 43 / 116
Orthogonal Diagonalization

Normalize an orthogonal set

Recall that an orthogonal set of vector is orthogonormal if all vector has


length of 1. Any orthogonal set {v1n, . . . , vk } can be (normalized)
o
converted to an orthogonormal set ∥v11 ∥ v1 , . . . , ∥v1k ∥ vk

Orthogonality 44 / 116
Orthogonal Diagonalization

Orthogonal matrix

Theorem
The following conditions are equivalent for all square matrix P
1 P is invertible and P −1 = P
2 The rows of P are orthogonormal
3 The columns of P are orthogonormal

Definition
A square matrix P is called an orthogonormal matrix if it satisfies one of
the above conditions

Orthogonality 45 / 116
Orthogonal Diagonalization

Example
" #
cos θ − sin θ
The rotation matrix is orthogonal for any angle θ
sin θ cos θ

Orthogonality 46 / 116
Orthogonal Diagonalization

It is not enough that the rows of a matrix A are merely orthogonal for A
to be an orthogonal matrix.
Example
The matrix  
2 1 1
−1 1 1
 
0 −1 1
has orthogonal rows but columns are not orthogonal.

However it the rows are normalized then the resulting matrix


 
√2 √1 √1
 −16 6 6
√ √1 √1 
 3 3 3
−1 √1
0 √
2 2

is orthogonal

Orthogonality 47 / 116
Orthogonal Diagonalization

Example
if P and Q are orthogonal matrices then P Q and P are also orthogonal

Orthogonality 48 / 116
Orthogonal Diagonalization

Example
if P and Q are orthogonal matrices then P Q and P are also orthogonal

Solution
Prove that P Q is orthogonal
P and Q are invertible and so is P Q with

(P Q)−1 = Q−1 P −1

Because P and Q are orthogonal, we have P −1 = P T and Q−1 = QT .


Hence
(P Q)−1 = QT P T = (P Q)T
Thus P Q is orthogonal

Orthogonality 48 / 116
Orthogonal Diagonalization

Solution (Cont)
Prove that P −1 is orthogonal
It is clear that P −1 is invertible with

(P −1 )−1 = P

Moreover P is orthogonal, so P −1 = P T . Hence

(P −1 )T = (P T )T = P

Thus
(P −1 )−1 = (P −1 )T
So P −1 is orthogonal

Orthogonality 49 / 116
Orthogonal Diagonalization

Definition (Orthogonally Diagonalizable Matrices)


An n × n matrix A is said to be orthogonally diaonalizable if here exists
an orthogonal matrix P such that P −1 AP = P T AP is diagonal

This condition turns out to characterize the symmetric matrices

Theorem (Principal Axes Theorem)


The following conditions are equivalent for a square matrix A
1 A is an orthogonal set of n eigenvectors
2 A is orthogonally diagonalizable
3 A is symmetric

A set of orthonormal eigenvectors of a symmetric matrix A is called a set


of principal axes for A. The name comes from geometry and this is
discussed in application to quadratic form later

Orthogonality 50 / 116
Orthogonal Diagonalization

Theorem
If A is symmetric then
(Ax) · y = x · (Ay)
for all column vectors x, y ∈ Rn

Theorem
If A is a symmetric matrix, then eigenvectors of A corresponding to
distinct eigenvalues are orthogonal.

Orthogonality 51 / 116
Orthogonal Diagonalization

Example
an orthogonal matrix P such that P −1 AP is diagonal where
Find 
1 0 −1
A= 0 1 2 
 
−1 2 5

Orthogonality 52 / 116
Orthogonal Diagonalization

Solution
The characteristic polynomial of A is
 
1 − x, 0, −1
cA (x) = det  0, 1 − x, 2  = −x(x − 1)(x − 6)
 
−1, 2, 5 − x

Thus the eigenvalues are λ1 = 0, λ2 = 1 and λ3 = 6. The corresponding


eigenvectors are
     
1 2 −1
x1 = −2 , x2 = 1 , x3 =  2 
     
1 0 5

Orthogonality 53 / 116
Orthogonal Diagonalization

These vectors are orthogonal so they should be normalized to create an


orthogonal diagonalizing matrix
 √ √ 
5 2√6 −1
h
x1 x2 x3
i 1  √
P = ∥x1 ∥ ∥x2 ∥ ∥x3 ∥ = √ −2 √ 5 6 2

30
5 0 5

Thus P −1 = P T and  
0 0 0
T
P AP = 0 1 0
 
0 0 6

Orthogonality 54 / 116
Singular Value Decomposition (SVD)

Table of contents

1 Orthogonality
Orthogonality
Orthogonalization
Orthogonal Complement
Projection

2 Orthogonal Diagonalization

3 Singular Value Decomposition (SVD)

4 Positive Definite Matrices

5 An Application to Quadratic Forms

Orthogonality 55 / 116
Singular Value Decomposition (SVD)

Image processing by linear algebra

1 An image is a large matrix of grayscal values, one for each pixel and
color
2 When nearby pixels are correlated (not random), the image can be
compressed
3 SVD separates any matrix A into rank one pieces (simple pieces)
uvT = (column)(row) . It is useful to compress image
4 The rows and columns are eigenvectors of AT A and AT A

Orthogonality 56 / 116
Singular Value Decomposition (SVD)

Example - Low Rank Images


Consider an image represented by
 
1 1 1 1 1 1
1 1 1 1 1 1
 
1 1 1 1 1 1
 
 
A=
1 1 1 1 1 1
1 1 1 1 1 1
 
1 1 1 1 1 1
 
1 1 1 1 1 1
 
1 1 1 1 1 1
1 1 1 1 1 1
 
1 1 1 1 1 1
 
 
If you send 
1 1 1 1 1 1  then you need to use 36 numbers (36
1 1 1 1 1 1
 
1 1 1 1 1 1
 
1 1 1 1 1 1
pixels - each pixel require 8 bits of information).
Orthogonality 57 / 116
Singular Value Decomposition (SVD)

Observe that  
1
1
 
1 h i
A=  1 1 1 1 1 1
 
1
 
1
1
 
1
1
 
1
Instead of sending all elements in matrix A, we can send a column  
 
1
 
1
1
h i
and a row 1 1 1 1 1 1 which require 12 numbers

Orthogonality 58 / 116
Singular Value Decomposition (SVD)

With 300 by 300 hundred pixels, 90000 numbers becomes 600

Orthogonality 59 / 116
Singular Value Decomposition (SVD)

Rank 1 pattern A = uvT

Orthogonality 60 / 116
Singular Value Decomposition (SVD)

Rank 2 pattern A = c1 u1 vT1 + c2 u2 vT2 or higher

" # " # " #


1 0 1 h i 1 h i
A= is equal to A = 1 1 − 0 1
1 1 1 0

Orthogonality 61 / 116
Singular Value Decomposition (SVD)

Rank 2 pattern A = c1 u1 vT1 + c2 u2 vT2 or higher

" # " # " #


1 0 1 h i 1 h i
A= is equal to A = 1 1 − 0 1
1 1 1 0

If the rank of A is much higher than 2 (for real images) then A will add up
may rank one pieces
n
X
A= σi ui vTi for n ≥ 2
i=1

We want the small ones such that thay can be discarded with no loss of
visual quantity - image compression.

Orthogonality 61 / 116
Singular Value Decomposition (SVD)

In other words, we wan to decompose an m × n matrix A into

A = U ΣV T

where U, V are orthogonal matrices and matrix Σ is diagonal


Starting T
h from A Ai is symmetric, there exists an orthogonal matrix
V = v1 . . . vn such that

V (AT A)V T = D

where D is a diagonal matrix whose diagonal entries λ1 , . . . , λn are the


eigenvalues of AT A, that is

AT Avi = λi vi

Decomposition A = U ΣV T leads to AV = U Σ that require Avi = σi ui .


Hence AAT ui = σ1i AAT Avj = σ1i Aλi vi = λσii Avi = λi ui . That is λi is
also an eigenvalue of AAT and ui is its corresponding eigenvector.
Remark that ∥Avi ∥2 = λi ∥vi ∥2 ≥ 0 √
⇒ λi ≥ 0 and we can verify that
λi vi = AT Avi = σ 2 vi . Hence σi = λi
Orthogonality 62 / 116
Singular Value Decomposition (SVD)

Lemma
1 All eigenvalues of AT A and AAT are non-negative
2 AT A and AAT have the same set of positive eigenvalues {λi }

Definition

The real numbers σi = λi are called singular values of the matrix A

Theorem (SVD theorem)


Suppose that A is a matrix of rank r and σ1 ≥ σ2 ≥ · · · ≥ σr > 0 be
positive singular values of A then
r
X
A= σi ui vTi
i=1

where ui and vi are orthonormal eigenvector corresponding to eigenvalues


λi = σi2 of AT A and AAT , called left singular vectors and right
singular vector respectively
Orthogonality 63 / 116
Singular Value Decomposition (SVD)

Theorem (SVD theorem (cont))


In other words
A = U ΣV T
where U and V are orthogonal matrix of AT A and AAT and
" #
diag(σ1 , . . . , σr ) 0
Σ=
0 0

in block form, which is called singular matrix of A

Geometric meaning (rotation ) × (stretching) × (rotation)


Orthogonality 64 / 116
Singular Value Decomposition (SVD)

SVD algorithm

In order to obtain SVD of A,


1 Compute AT A and AAT

2 Find the eigenvalues λ of AT A and then singularvalues σ = λi of
i i
A to create singular matrix Σ
3 Find orthogonal matrix of AT A and AAT

4 Decompose A = U ΣV T

Orthogonality 65 / 116
Singular Value Decomposition (SVD)

Example
" #
0 1
Find an SVD for A =
−1 0

Orthogonality 66 / 116
Positive Definite Matrices

Table of contents

1 Orthogonality
Orthogonality
Orthogonalization
Orthogonal Complement
Projection

2 Orthogonal Diagonalization

3 Singular Value Decomposition (SVD)

4 Positive Definite Matrices

5 An Application to Quadratic Forms

Orthogonality 67 / 116
Positive Definite Matrices

Definition (Positive Definite Matrices)


A square matrix is called positive definite if it is symmetric and all its
eigenvalues λ are positive, that is λ > 0.

Because these matrices are symmetric, the principal axes theorem plays a
central role in the theory.
Theorem
If A is positive definite then it is invertible and det(A) > 0

We have the following characterization of positive definite matrices


Theorem
A symmetric matrix A is positive definite if and only if xT Ax > 0 for every
column x ̸= 0 in Rn

Proof of both theorems is based on the orthogonal diagonalization of


symmetric matrix
Orthogonality 68 / 116
Positive Definite Matrices

Example
If U is any invertible n × n matrix then A = U T U is positive definite

Orthogonality 69 / 116
Positive Definite Matrices

Example
If U is any invertible n × n matrix then A = U T U is positive definite

Solution
If x ∈ Rn ̸= 0 then

xT Ax = xT U T U x = (U x)T (U x) = ∥U x∥2

Because x ̸= 0 and U is invertible, the vector U x ̸= 0 and then


∥U x∥2 > 0. Thus
xT Ax > 0
Hence A is positive definite.

Orthogonality 69 / 116
Positive Definite Matrices

Principal submatrices

Definition
If A be an n × n matrix, let (r) A denote the r × r submatrix in the upper
left corner of A. The matrices (1) A, (2) A, . . . , (n) A = A are called
principal matrices of A

Example
 
10 5 2 " #
(1) A = 10 , (2) A = 10 5 and
h i
If  5 3 2 then
  (3) A =A
5 3
2 2 3

Orthogonality 70 / 116
Positive Definite Matrices

Theorem
If A is positive definite, so is each principal submatrix (r) A for
r = 1, 2, . . . , n

Proof.
Write " #
(r) A P
A=
Q R
" #
y
For y ̸= 0 in Rr , consider x = ∈ Rn . Then x ̸= 0. So
0
" #" #
h i (r) A P y
0 < xT Ax = y 0 = yT Ay
Q R 0

Hence (r) A is positive definite

Orthogonality 71 / 116
Positive Definite Matrices

Theorem
The following conditions are equivalent for a symmetric n × n matrix A
1 A is positive definite
2 det((r) A) > 0 for each r = 1, 2, . . . , n
3 A = U T U where U is an upper triangular matrix with positive entries
on the main diagonal
Furthermore, the factorization in (3) is unique (called the Cholesky
factorization of A)

Algorithm for the Cholesky Factorization A = U T U for positive


definite matrix A
Step 1 Carry A to an upper triangular matrix U1 with positive diagonal
entries using row operations each of which adds a multiple of a row to
a lower row
Step 2 Obtain U from U1 by dividing each row of U1 by the square root of
the diagonal entry in that row.
Orthogonality 72 / 116
Positive Definite Matrices

Example
 
10 5 2
Find the Cholesky factorization of A =  5 3 2
 
2 2 2

Solution
" #
h i 10 5
We have( 1)A = 10 , (2) A = and (3) A = A. It is easy to verify
5 3
that det((1) A) = 10 > 0, det((2) A) = 5 > 0 and det((3) A) = 3 > 0. So A
is positive definite. It has the Cholesky factorization.
Step 1
     
10 5 2 − r1 +r 10 5 2 10 5 2
5 3 
A =  5 3 2 −−r−−−→  0 1/2 1  −−−−−→  0 1/2 1 
    
1 −2r +r
2 2 2 − 2 +r2
2 3
0 2 13/5 0 0 3/5

Orthogonality 73 / 116
Positive Definite Matrices

Solution (cont)
Step 2 √
√5 √2

10 10 √10
0 √1 2
 
U =
 2 q 
3
0 0 5

One can verify that A = U T U

Orthogonality 74 / 116
An Application to Quadratic Forms

Table of contents

1 Orthogonality
Orthogonality
Orthogonalization
Orthogonal Complement
Projection

2 Orthogonal Diagonalization

3 Singular Value Decomposition (SVD)

4 Positive Definite Matrices

5 An Application to Quadratic Forms

Orthogonality 75 / 116
An Application to Quadratic Forms

Quadratic form
Definition
A quadratic form q of n variables x1 , . . . , xn is a linear combination of
x21 , . . . , x2n and cross terms x1 x2 , x1 x3 , x2 x3 . . .
n
X n
X n X
X n
q= aii x2i + (aij + aji )xi xj = aij xi xj
i=i i,j=1 i=i j=1
n̸=j

This sum can be written compactly as a matrix product

q(x) = xT Ax
T
where x = x1 . . . xn and A = ai j .
h i h i

There is no loss of generality in assuming that xi xj and xj xi have the


same coefficients in the sum for q so we may assume that A is
symmetric
Orthogonality 76 / 116
An Application to Quadratic Forms

Example
Write q = x21 + 3x23 + 2x1 x2 − x1 x3 in the form of q(x) = xT Ax where A
is a symmetric 3 × 3 matrix.

Solution
The cross terms are 2x1 x2 = x1 x2 + x2 x1 , −x1 x3 = − 12 x1 x3 − 21 x3 x1 and
both x2 x3 and x3 x2 have coefficient zero and does x22 . Hence
  
h i 1 1 − 21 x1
q(x) = x1 x2 x3  1 0 0  x2 
  
− 12 0 3 x3

is the required form.

Orthogonality 77 / 116
An Application to Quadratic Forms

Problem

Given a symmetric matrix A and the quadratic form

q(x) = xT Ax

The problem is to find new variables y1 , . . . , yn related to x1 , . . . , xn such


that when q is expressed in terms of y1 , . . . , yn , there are no cross terms,
that is
q = b11 y12 + b22 y22 + · · · + bnn yn2
h iT
If we write y = y1 . . . yn then

q = yT Dy where D is a diagonal matrix

Orthogonality 78 / 116
An Application to Quadratic Forms

Solution

The symmetric matrix A is orthogonally diagonalized. There exists


orthogonal matrix P (that is P −1 = P T ) that diagonalizes matrix A
 
λ1 0 . . . 0
 0 λ2 . . . 0

P T AP = D = 

 .. .. . . .. 
 . . . . 

0 0 ... λn

Define y by
x = Py equivalent to y = PTx
and substitution in q(x) = xT Ax gives

q = (P y)T A(P y) = yT P T AP y = yT Dy = λ1 y12 + λ2 y22 + · · · + λn yn2

Orthogonality 79 / 116
An Application to Quadratic Forms

Principle axes
Let λ1 , . . . , λn (repeated according to their multiplicities) and the
corresponding set {f1 , . . . , fn } of orthonormal eigenvector of A, called a
set of principal axes then the orthogonally diagonalizing matrix
h i
P = f1 f2 . . . fn

and
 
y1
 y2 
i
h 
x = P y = f1 f2 . . . fn  .  = y1 f1 + y2 f2 + · · · + yn fn

 .. 
yn

The new variables yi are the coefficients when x is expanded in terms of


the orthonormal basis {f1 , . . . , fn } of Rn . Hence

q = q(x) = λ1 (x · f1 )2 + · · · + λn (x · fn )2

Orthogonality 80 / 116
An Application to Quadratic Forms

Example
Find new variables y1 , y2 such that

q = x21 + x1 x2 + x22

has diagonal form, and find the corresponding principal axes.

Solution
The form can be written as q = xT Ax where
" # " #
1
x 1
x= 1 and A = 1
2
x2 2 2

The eigenvalues of A is the solution of


3
cA (x) = det(xI − A) = x2 − 2x −
4
which are λ1 = 0.5 and λ2 = 1.5
Orthogonality 81 / 116
An Application to Quadratic Forms

The corresponding orthogonal eigenvectors are the principal axes

− √12
" # " 1 #

f1 = , f2 = 2
√1 √1
2 2

The diagonalizing matrix

− √12 √1
" #
h i
P = f1 f2 = 2
√1 √1
2 2
" # " #
y x1 + x2
Introduce y = 1 = P T x = 1
2 then
y2 −x1 + x2

1 3
q = y12 + y22
2 2

Orthogonality 82 / 116
An Application to Quadratic Forms

Quadratic form of two variables

In case of two variable x1 and x2 , consider the quadratic form

q = ax21 + bx1 x2 + cx22 where a, c, b2 − 4ac are all nonzero

1 There is a counterclockwise rotation of the x1 and x2 -axes about the


origin such to obtain the principal axes
2 The graph of the equation

ax21 + bx1 x2 + cx22 = 1

is an ellipse if b2 − 4ac < 0 and an hyperpoba if b2 − 4ac > 0

Orthogonality 83 / 116
An Application to Quadratic Forms

Proof
If b = 0 then q is already has no cross term and (1), (2) are clear. So
assume b ̸= 0 then " #
a 2b
A= b
2 c
and q has characteristic polynomial
1
cA (x) = x2 − (a + c)x − (b2 − 4ac)
4
p
Denote d = b2 + (a − c)2 then the eigenvalues of A are
1 1
λ1 = (a + c − d) and λ1 = (a + c + d)
2 2
with the corresponding principal axes
" # " #
1 a−c−d 1 −b
f1 = p 2 , f1 = p 2
b + (a − c − d)2 b b + (a − c − d)2 a − c − d

Orthogonality 84 / 116
An Application to Quadratic Forms

for part 1
Because ∥f1 ∥ = 1, there exists an angle θ such that

a−c−d b
cos θ = p 2 , sin θ = p 2
b + (a − c − d)2 b + (a − c − d)2

then " #
h i cos θ − sin θ
P = f1 f2 =
sin θ cos θ
diagonalizes A and the principal axes
" # " #
cos θ − sin θ
f1 = = P e1 and f2 = = P e2
sin θ cos θ

can always be found by rotating the x1 and x2 axes around the origin
through an angle θ

Orthogonality 85 / 116
An Application to Quadratic Forms

rotating the x1 and x2 axes around the origin through an angle θ to obtain
principal axes
Orthogonality 86 / 116
An Application to Quadratic Forms

for part 2

We have " #
λ 0 1
det(A) = det 1 ⇒ λ1 λ2 = (4ac − b2 )
0 λ2 4
In terms of y1 , y2 , the equation becomes

λ1 y12 + λ2 y22 = 1

whose graph is an ellipse if b2 < 4ac and an hyperbola if b2 > 4ac

Orthogonality 87 / 116
An Application to Quadratic Forms

Example
The notation in the prevous result for the equation x2 + xy + y 2 = 1
becomes a = b = c = 1. So the rotation angle θ is found by
−1 1
cos θ = √ , sin θ = √
2 2

Hence θ = 4 . Thus the principle axes are
" −1 #
− √12
" #

f1 = , f2 = 2
√1 −1

2 2

and then
1 1
y1 = √ (−x1 + x2 ), y2 = − √ (x1 + x2 )
2 2
In y1 y2 -coordinate, the equation becomes
1 2 3 2
y + y =1
2 1 4 2
Orthogonality 88 / 116
An Application to Quadratic Forms

The angle θ is choosen such that new y1 and y2 axes are the axes of
symmetry of the ellipse. The eigenvector f1 and f2 point along these axes
of symmetry. For this reason, they are called principal axes

Orthogonality 89 / 116
An Application to Constrained Optimization

Table of contents

6 An Application to Constrained Optimization

7 Principle Component Analysis

Orthogonality 90 / 116
An Application to Constrained Optimization

Constrained Optimization

It is a frequent occurrence in applications that a function


q = q(x1 , x2 , ..., xn ) of n variables, called an objective function, is to be
made as large or as small as possible among all vectors x = (x1 , x2 , ..., xn )
lying in a certain region of Rn called the feasible region. A wide variety
of objective functions q arise in practice; our primary concern here is to
examine one important situation where q is a quadratic form.

Orthogonality 91 / 116
An Application to Constrained Optimization

Example
A politician proposes to spend x1 dollars annually on health care and x2
dollars annually on education. She is constrained in her spending by
various budget pressures, and one model of this is that the expenditures x1
and x2 should satisfy a constraint like

5x21 + 3x22 ≤ 15

Since xi ≥ 0 for each i, the feasible region is the shaded area

Orthogonality 92 / 116
An Application to Constrained Optimization

These choices have different effects on voters, and the politician wants to
choose x = (x1 , x2 ) to maximize some measure q = q(x1 , x2 ) of voter
satisfaction. Assume that for any value of c, all point on the graph of
q(x1 , x2 ) = c have the same appeal to vote. Hence the goal is to find the
largest value of c for which the graph of q(x1 , x2 ) = c contains a feasible
point.

Remark that the constraint can be put in a standard from ∥y∥ ≤ 1 with
x1 x2
y1 = √ 3
, y2 = √ 5
. So we can convert the above problem into find
maximum of a quadratic form subject to ∥y∥ ≤ 1 called unit ball
Orthogonality 93 / 116
An Application to Constrained Optimization

Theorem
Consider a the quadratic form q = xT Ax where A is an n × n symmetric
matrix and let λ1 and λm denote the largest and smallest eigenvalues of
A. Then
1 max{q(x)|∥x∥ ≤ 1} = λ1 and q(f1 ) = λ1 where f1 is any unit
λ1 -eigenvector
2 min{q(x)|∥x∥ ≤ 1} = λn and q(fn ) = λn where fn is any unit
λn -eigenvector

Orthogonality 94 / 116
An Application to Constrained Optimization

Proof of (1)
Since A is symmetric, let the real eigenvalues of A be ordered as
λ1 ≥ λ2 ≥ · · · ≥ λn
By the principal axes theorem, let P be an orthogonal matrix such that
P T AP = D = diag(λ, λ2 , . . . , λn ) and define y = P T x equivalent to
x = P y then ∥y∥ = ∥x∥ because ∥y∥2 = yT y = xT P P T x = xT x = ∥x∥2 .
Express q in terms of y, we have
q(x) = q(P y) = (P y)T A(P y) = yT P T AP y = yT Dy = λ1 y12 + · · · + λn yn2
Assume that ∥x∥ ≤ 1, then ∥y∥ = ∥x∥ ≤ 1. Since λi ≤ λ1 for all i, we
have
q(x) = λ1 y12 + · · · + λn yn2 ≤ λ1 y12 + · · · + λ1 yn2 = λ1 (y12 + · · · + yn2 )
= λ1 ∥y∥2 = λ1
Hence λ1 is the maximum value of q(x) when ∥x∥ ≤ 1.

The proof of (2) is analogous


Orthogonality 95 / 116
An Application to Constrained Optimization

Let f1 be an unit eigenvector corresponding to λ1 then

q(f1 ) = fT1 Af1 = fT1 (λ1 f1 ) = λ1 fT1 f1 = λ1 ∥f1 ∥2 = λ1

Orthogonality 96 / 116
An Application to Constrained Optimization

Example
Maximize and minimize the form q(x) = 3x21 + 14x1 x2 + 3x22 subject to
∥x∥ ≤ 1

Orthogonality 97 / 116
An Application to Constrained Optimization

Example
Maximize and minimize the form q(x) = 3x21 + 14x1 x2 + 3x22 subject to
∥x∥ ≤ 1

Solution
 
3 7
The matrix of q is A = 7  with eigenvalues λ1 = 10 , λ2 = −4 and
 
3
" # " #
1 1 √1
the corresponding unit eigenvectors f1 = , f1 = √12
1 −11 2

Hence q(x) takes its maximum value 10 at x = f1 and the minimum value
is -4 when x = f2

Orthogonality 97 / 116
Principle Component Analysis

Table of contents

6 An Application to Constrained Optimization

7 Principle Component Analysis

Orthogonality 98 / 116
Principle Component Analysis

Sample data - Statistics inference

Suppose the heights h1 , h2 , ..., hn of n men are measured. Such a data set
is called a sample of the heights of all the men in the population under
study, and various questions are often asked about such a sample: What is
the average height in the sample? How much variation is there in the
sample heights, and how can it be measured? What can be inferred from
the sample about the heights of all men in the population? How do these
heights compare to heights of men in neighbouring countries? Does the
prevalence of smoking affect the height of a man?

Orthogonality 99 / 116
Principle Component Analysis

Principal Component Analysis (PCA) by SVD

1 Data often comes in a matrix: n samples and m measurements per


sample
2 Center each row of a matrix by substracting the mean from each
measurement
3 The SVD finds combinations of data that contain the most
information
4 Largest singular values σ1 ↔ greatest variance ↔ most information in
u1

Orthogonality 100 / 116


Principle Component Analysis

Example

For m = 2 variables like age and height, then points lie in the plane R2 .
Substract the average age and height to center the data. If the n recenterd
points clusters along a line, how will linear algebra find that line?

Orthogonality 101 / 116


Principle Component Analysis

Sample mean
Represent
h a sample i {x1 , . . . , xn } as a sample vector
x = x1 , . . . , x n
The most widely known statistic for describing a data set is the
sample mean x̄ defined by x̄ = n1 (x1 + · · · + xn )

 
Figure 1: x = −1, 0, 1, 4, 6 with sample mean x̄ = 2

The difference xi − x̄ is the deviation of xi from x̄ which can be


negative or positive but the sum of these deviation is zero
n n
!
X X
(xi x̄) = xi − nx̄ = nx̄ − nx̄ = 0
i=i i=i
Orthogonality 102 / 116
Principle Component Analysis

Centred sample
If the mean x̄ is substracted from each data valuexi , the resulting data
xi − x̄ is said to be centred. The corresponding data vector is
h i
x̄c = x1 − x̄, . . . , xn − x̄

has the mean x̄c = 0

 
Figure 2: Centred sample xc = −3, −2, −1, 2, 4

The effect of centring is to shift the data by an amount x̄ so that the


mean moves to 0
Orthogonality 103 / 116
Principle Component Analysis

Sample variance
to answer the question how much variability is in the sample
h i
x = x1 , . . . , x n

that is how widely are the data ”spead out” around the sample mean

use the square (xi − x̄)2 as a measure of variability
sample variance
n
1 X 1
s2x = (xi − x̄)2 = ∥x − x̄1∥2
n − 1 i=i n−1

The sample variance will be large if there are many xi at a large


distance from the sample mean x̄ and it will be small if all the xi are
tightly clustered about the mean
square root of sample variance is sample standard deviation
Orthogonality 104 / 116
Principle Component Analysis

Sample Covariance Matrix

Start with the measurements in A0 : the sample data. Find the average
mean µ1 , . . . , µm of each row. Substract each mean µi from row i to
center the data to obtain the centered matrix A.
AAT
The ”sample covariance matrix” is defined by S = n−1

A shows the distance aij − µi from each measurement to the row


average µi
(AAT )11 and (AAT )22 show the sum of squared distances (sample
variance s2i , s22 )
(AAT )12 shows the sample covariance
s12 = (row 1 of A) · (row 2 of A)

Orthogonality 105 / 116


Principle Component Analysis

Interpretation example

Average exam score is 75 tells you that it was a decent exam


A variance s2 = 25 (standard deviation s = 5) means that most
grades were in 70’s: closely packed
A sample variance s2 = 225 (standard deviation s = 15) means that
grade were widely scattered.
covariance of score of two different subjects tells how one score
depends on other score in linear relationship. If covariance close to
zero mean one subject strong while the other is weak. High
covariance means both strong or both weak

Orthogonality 106 / 116


Principle Component Analysis

Example - Six math and history scores (notice the zero


mean in each row)

" #
3 −4 7 1 −4 −3
A=
7 −6 8 −1 −1 −7
has sample covariance
" #
AAT 20 25
S= =
5 25 40

The rows of A are highly correlated s12 = 25. Above avarage math
went with above average history.
Notice that S has posive trace and determinant. AAT is positive
definite.

Orthogonality 107 / 116


Principle Component Analysis

The Essentials of Principal Component Analysis (PCA)


PCA gives a way to understand data plot in dimension m - number of
variables. The crucial connection to linear algebra is in the singular values
and singular vecter in the centered data matrix A. Those come from the
eigenvalues and eigenvectors of the sample covariance matrix
S = AAT /(n − 1)
The total variance in the data is the sum of all eigenvalues and of sample
variances
total variance T = σ12 + · · · + σm
2
= s21 + · · · + s2m = trace

The first eigenvector u1 of S points in the most significant direction of the


σ2
data. This direction accounts for (or explain) a fraction T1 of total variance
σ22
the next eigenvector u2 (orthogonal to u1 ) accounts for smaller fraction T
Stop when those fractions are small. You have R dimensions that explain
most of the data. The n data points are very near a R-dimension subspace
with basis u1 to uR . These u’s are the principal components in
m-dimension space
Orthogonality 108 / 116
Principle Component Analysis

PCA procedure

PCA is a tools for dimension reduction in machine learning when the data
consists of large variables (features). It aims to reduce number of features
(dimension) but keep important information by discard component
contributing smaller variance
1 Centering data and compute sample covariance matrix S

2 Compute eigenvalues and eigenvectors of S

3 Pick K largest eigenvalues and the corresponding eigenvectors

(principle component) which are orthonormal


4 Project data to subspace spanned by selected eigenvector to obtain

projected points in lower dimension (coordinate of data in basis of


selected eigenvectors )

Orthogonality 109 / 116


Principle Component Analysis

Example - six math and history scores


Eigenvalues of S are near 57 and 3. The unit eigenvectors are the principle
components
2 σi2
" # σ i T
−0.82806723
Principle component u1 = 57 0.95
−0.56062881
" #
0.56062881
Principle component u1 = 3 0.05
−0.82806723
Total T = 60 1

The leading vector u1 shows the dominant direction in the scatter plot
Orthogonality 110 / 116
Principle Component Analysis

Remark that {u1 , u2 } is an orthogonal set.


If we choose 1 largest eigenvalue and the correpsonding principle
component then we need to transform raw" score
# data in terms of u1 . For
3
example, the projection of the score s1 = of the first student along the
7
u1 is
s1 · u1
u1 ≈ −6.4
∥u1 ∥2
" #
3
That is a 2-dimensition data is reduced to 1-dimension data with
7
value −6.4

Orthogonality 111 / 116


Principle Component Analysis

Application of PCA

Eigenfaces to recognize faces


Searching the Web
Dynamics of Interest rate in Finance
...

Orthogonality 112 / 116


Principle Component Analysis

Example - Interest rate

Figure 3: U.S. Treasury Yields: 6 Days and 5 Centered Daily Differences

Orthogonality 113 / 116


Principle Component Analysis

σ2
Fractions Ti drop quickly to zero. The first three principle components
containes almost information

Orthogonality 114 / 116


Principle Component Analysis

Principle ui are orthogonal.


u1 measures a weighted average of the daily changes in the 9 yields
u2 gauges the daily change in the yield spread between long and short
bonds
u3 shows daily changes in the curvature (short and long bond versus
medium)
Orthogonality 115 / 116
Principle Component Analysis

Figure 4: The nine loadings on u1 , u2 , u3 form 3 months to 20 years

Orthogonality 116 / 116

You might also like