Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Quick Reference of Linear Algebra

Hao Zhang
National Key Laboratory for Novel Software Technology, Nanjing University, China
zhangh0214@gmail.com

Abstract • ‖𝑨‖2 ∶= ‖𝝈‖∞ =√ max𝑖 𝜎𝑖 . √


• ‖𝑨‖F ∶= ‖𝝈‖2 = tr 𝑨⊤ 𝑨 = tr 𝑨𝑨⊤ = ‖ vec 𝑨‖2 .
In linear algebra, concepts are much more impor- In which 𝑨 = 𝑼 𝚺𝑽 ⊤ .
tant than computations. Computers can do the
calculations, but you have to choose the calcula- Lemma 1 (Cosine Formula). If 𝒖 ≠ 𝟎, 𝒗 ≠ 𝟎, cos⟨𝒖, 𝒗⟩ =
𝒖⊤ 𝒗
tions and interpret the results. The heart of linear .
‖𝒖‖‖𝒗‖
algebra is the linear combination of vectors.
In this note, we give a quick reference of some of Lemma 2 (Schwarz Inequality). |𝒖⊤ 𝒗| ≤ ‖𝒖‖‖𝒗‖. The
the key concepts and applications of linear alge- equality holds when ∃𝑐 ∈ ℝ. 𝒖 = 𝑐𝒗.
bra. The section of inner product spaces is omitted, Lemma 3 (Triangle Inequality). ‖𝒖 + 𝒗‖ ≤ ‖𝒖‖ + ‖𝒗‖. The
since it rarely appear in computer science. Please equality holds when ∃𝑐 ≥ 0. 𝒖 = 𝑐𝒗.
refer to [1] if you are interested in. For further
reading, you may consult [3, 4, 5, 6, 7, 8, 9].
1.2 Vector and Matrix Multiplications
Algorithm 1 (Computing 𝑨𝒙.). 𝑇 (𝑚, 𝑛) ∼ 𝑚𝑛. In practice,
the fastest way to compute 𝑨𝒙 depends on the way the data
1 Introduction: Vectors and Matrices is stored in the memory. Fortran (column major order) com-
putes 𝑨𝒙 as a linear combination of the columns of 𝑨, while
1.1 Vector and Matrix Norms C (row major order) computes 𝑨𝒙 using rows of 𝑨.
(∑ )1
Definition 1 (𝓁𝑝 -norm ‖𝒙‖𝑝 ). ‖𝒙‖𝑝 ∶= 𝑖 |𝑥𝑖 |
𝑝 𝑝. In par- ⊤
⎡𝒂1 𝒙⎤
ticular,
∑ ∑
𝑛
⎢𝒂⊤ 𝒙⎥
• ‖𝒙‖0 ∶= 𝑖 𝕀(𝑥𝑖 ≠ 0). 𝑨𝒙 = 𝑥𝑗 𝒂𝑗 = ⎢ 2 ⎥ . (2)
∑ ⎢ ⊤⋮ ⎥
• ‖𝒙‖1 ∶= 𝑖 |𝑥𝑖 |. 𝑗=1
⎣𝒂𝑚 𝒙⎦
√∑ √
• ‖𝒙‖2 ∶= 2
𝑖 𝑥𝑖 = 𝒙⊤ 𝒙
• ‖𝒙‖∞ ∶= max𝑖 |𝑥𝑖 |. Algorithm 2 (Computing 𝒚 ⊤ 𝑨.). 𝑇 (𝑚, 𝑛) ∼ 𝑚𝑛.

𝑚
[ ⊤ ]
Definition 2 (Mahalanobis Distance). Given a set of in- 𝒚⊤𝑨 = 𝑦𝑖 𝒂⊤ ⊤ ⊤
𝑖 = 𝒚 𝒂1 𝒚 𝒂2 ⋯ 𝒚 𝒂𝑛 . (3)
stances {𝒙𝑖 }𝑚
𝑖=1
with empirical mean 𝝁 ∈ ℝ𝑑 and empirical 𝑖=1
covariance 𝚺 ∈ ℝ𝑑×𝑑 , then the Mahalanobis distance of 𝒙 is
defined as Algorithm 3 (Computing 𝑨𝑩). Suppose 𝑨 ∈ ℝ𝑚×𝑝 , 𝑩 ∈
ℝ𝑝×𝑛 . 𝑇 (𝑚, 𝑛, 𝑘) ∼ 𝑚𝑛𝑘. In practice, Fortran (column major

‖ −1 ⊤ ‖ order) computes 𝑨𝑩 by columns in parallel, while C (row
(𝒙 − 𝝁)⊤ 𝚺−1 (𝒙 − 𝝁) = ‖ ‖
‖𝚲 2 𝑸 (𝒙 − 𝝁)‖ , (1)
‖ ‖ major order) computes 𝑨𝑩 by rows in parallel.

where 𝚺 = 𝑸𝚲𝑸⊤ . ⎡𝒂1 𝑩 ⎤
[ ] ⎢𝒂⊤ 𝑩 ⎥ [ ] ∑𝑝
‖𝑨𝒙‖𝑝 𝑨𝑩 = 𝑨𝒃1 𝑨𝒃2 ⋯ 𝑨𝒃𝑛 = ⎢ 2 ⎥ = 𝒂⊤ 𝒃 = 𝒂 𝑘 𝒃⊤
𝑘 .
Definition 3 (Matrix Norm). ‖𝑨‖𝑝 ∶= max𝒙≠𝟎 ‖𝒙‖𝑝
. In ⋮
⎢ ⊤ ⎥
𝑖 𝑗 𝑚×𝑛
𝑘=1
particular, ⎣𝒂𝑚 𝑩 ⎦
∑ (4)
• ‖𝑨‖1 ∶= max𝑗 𝑚 |𝑎𝑖𝑗 |.
∑𝑖=1
𝑚
• ‖𝑨‖∞ ∶= max𝑖 𝑗=1 |𝑎𝑖𝑗 |. Lemma 4 (Warnings of Matrix Multiplications). The fol-

• ‖𝑨‖∗ ∶= ‖𝝈‖1 = 𝑟𝑖=1 𝜎𝑖 . lowins are warnings of matrix multiplications.

1
• In general, 𝑨𝑩 ≠ 𝑩𝑨. Definition 6 (Vector Space). A set of “vectors” together with
• In general, 𝑨𝑩 = 𝑨𝑪 ̸⇒ 𝑩 = 𝑪. rules for vector addition and for multiplication by real num-
• In general, 𝑨𝑩 = 𝟎 ̸⇒ 𝑨 = 𝟎 ∨ 𝑩 = 𝟎. bers. The addition and multiplication must produce vectors
that are in the space. The space ℝ𝑛 consists of all column
vectors 𝒙 with 𝑛 components. The zero-dimensional space 𝟘
1.3 Common Matrix Operations consists only of a zero vector 𝟎.
Table 1 and 2 summarize the properties of common matrix Definition 7 (Subspace). A subspace of a vector space is a
operations. set of vectors (including 𝟎) where all linear combinations
stay in that subspace.
Lemma 5. Any square matrix 𝑨 ∈ ℝ𝑛×𝑛 can be formed as
a sum of symmetric matrix and an anti-symmetric matrix Definition 8 (Span). A set of vectors spans a space if their
𝑨 = 12 (𝑨 + 𝑨⊤ ) + 12 (𝑨 − 𝑨⊤ ). linear combinations fill the space. The span of a set of vectors
is the smallest subspace containing those vectors.
Lemma 6. diag 𝑨𝑨⊤ = (𝑨 ⊙ 𝑨)𝟏.
Definition 9 (Linear Independent). The columns of 𝑨 are
Lemma 7. The followings are properties of vec operation. linear independent iff  (𝑨) = 𝟘, or rank 𝑨 = 𝑛.
• (vec 𝑨)⊤ (vec 𝑩) = tr 𝑨⊤ 𝑩.
• (vec 𝑨𝑨⊤ )⊤ (vec 𝑩𝑩 ⊤ ) = tr(𝑨⊤ 𝑩)⊤ (𝑨⊤ 𝑩) = Definition 10 (Basis). A basis for a vector space is a set
∑𝑛 ∑𝑛 of linear independent vectors which span the space. Every
‖𝑨 𝑩‖F =
⊤ 2 (𝒂⊤ 𝒃 𝑗 )2 , where 𝑨 ∶=
[ ] 𝑖=1 [
𝑗=1 𝑖 ] vector in the space is a unique combination of the basis
𝒂1 𝒂2 ⋯ 𝒂𝑛 and 𝑩 ∶= 𝒃1 𝒃2 ⋯ 𝒃𝑛 ,
vectors. The columns of 𝑨 ∈ ℝ𝑛×𝑛 are a basis for ℝ𝑛 iff 𝑨 is
Algorithm 4 (Computing the Rank). rank 𝑨 is determined invertible.
by the SVD of 𝑨. The rank is the number of nonzero sigular
Definition 11 (Dimension). The dimension of a space if the
values. In this case, extremely small nonzero singular values
number of vectors in every basis. dim 𝟘 = 0 since 𝟎 itself
are assumed to be zero. In general, rank estimation is not a
forms a linearly dependent set.
simple problem.
Definition 12 (Rank). rank 𝑨 ∶= dim (𝑨), which also
Algorithm 5 (Computing the Determinant). Get the REF
equals to the number pivots of 𝑨.
𝑼 of 𝑨. If there are 𝑝 row interchanges, det 𝑨 = (−1)𝑝 ⋅
(product of pivots in 𝑼 ). Although 𝑼 is not unique and pivots There are four foundamental subspaces for a matrix 𝑨, as
are not unique, the product of pivots is unique. 𝑇 (𝑛) ∼ 31 𝑛3 . illustrated in Table 3.

2 Theory: Vector Spaces and Sub- 2.2 Matrix Inverses


spaces Definition 13 (Invertible Matrix). A square matrix 𝑨 ∈ ℝ𝑛×𝑛
is invertible if there exists a matrix 𝑨−1 such that 𝑨−1 𝑨 = 𝑰
2.1 The Four Foundamental Subspaces or 𝑨𝑨−1 = 𝑰 (if one holds, the other holds with the same
𝑨−1 ). A singular matrix is a square matrix that is not in-
Definition 4 (Orthogonal Subspaces 𝑆1 ⟂ 𝑆2 ). Two sub- vertible. The comparison between invertible and singular
spaces 𝑆1 and 𝑆2 are orthogonal if ∀𝒗1 ∈ 𝑆1 , ∀𝒗2 ∈ matrices are illustrated in Table 4.
𝑆2 . 𝒗⊤ 𝒗 = 0.
1 2
Lemma 9. A square matrix 𝑨 cannot have two different
Definition 5 (Orthogonal Complement 𝑆 ⟂ ). The orthogonal inverses. This shows that a left-inverse and right-inverse of a
complement of a subspace 𝑆1 contains every vector that is square matrix must be the same matrix.
perpendicular to 𝑆. (𝑨)⟂ =  (𝑨⊤ ) and (𝑨⊤ )⟂ =  (𝑨).
Proof. 𝑨−1 −1 −1 −1 −1 −1
𝑙 = 𝑨𝑙 𝑰 = 𝑨𝑙 𝑨𝑨𝑟 = 𝑰𝑨𝑟 = 𝑨𝑟 .

Theorem 8.  (𝑨 𝑨) =  (𝑨). That is to say, if 𝑨 has
independent columns, 𝑨⊤ 𝑨 is invertible. Lemma 10. The followings are inverse of common matrices.
• A diagonal matrix has an inverse iff no diagonal en-
Proof. tries are 0. If 𝑨 = diag(𝑑1 , 𝑑2 , … , 𝑑𝑛 ), then 𝑨−1 =
diag(𝑑1−1 , 𝑑2−1 , … , 𝑑𝑛−1 ).
𝑨𝒙 = 𝟎 ⇒ 𝑨⊤ 𝑨𝒙 = 𝟎 ⇒  (𝑨) ⊆  (𝑨⊤ 𝑨) , (5) • A triangluar matrix has an inverse iff no diagonal en-
⊤ ⊤ ⊤
𝑨 𝑨𝒙 = 𝟎 ⇒ 𝒙 𝑨 𝑨𝒙 = 0 ⇒ 𝑨𝒙 = 𝟎 ⇒  (𝑨 𝑨) ⊆  (𝑨) . ⊤ tries are 0.
(6) • An elimination matrix 𝑬 𝑖𝑗 has an inverse 𝑬 −1 𝑖𝑗 the same
as 𝑬 𝑖𝑗 , except the (𝑖, 𝑗) entry flipped sign.
• The inverse of a symmetric matrix is also symmetric.

2
Table 1: Properties of common matrix operations (I).

Inverse Transpose Rank


⊤ ⊤ −1 −1 ⊤ ⊤ ⊤ ⊤
𝑓 (𝑨 ) (𝑨 ) = (𝑨 ) exits iff 𝑨 is invertible (𝑨 ) = 𝑨 rank 𝑨⊤ = rank 𝑨
𝑓 (𝑨−1 ) (𝑨−1 )−1 = 𝑨 (𝑨−1 )⊤ = (𝑨⊤ )−1 -
𝑓 (𝑐𝑨) (𝑐𝑨)−1 = 1𝑐 𝑨−1 (𝑐𝑨)⊤ = 𝑐𝑨⊤ rank 𝑐𝑨 = rank 𝑨 (𝑐 ≠ 0)
𝑓 (𝑨 + 𝑩) (𝑨 + 𝑩)−1 = 𝑨−1 + 𝑩 −1 (𝑨 + 𝑩)⊤ = 𝑨⊤ + 𝑩 ⊤ rank(𝑨 + 𝑩) ≤ rank 𝑨 + rank 𝑩
𝑓 (𝑨𝑩) (𝑨𝑩)−1 = 𝑩 −1 𝑨−1 exists if 𝑨 and 𝑩 are invertible (𝑨𝑩)⊤ = 𝑩 ⊤ 𝑨⊤ rank 𝑨𝑩 ≤ min(rank 𝑨, rank 𝑩)
𝑓 (𝑨⊤ 𝑨) (𝑨⊤ 𝑨)−1 exists iff 𝑨 has independent columns (𝑨⊤ 𝑨)⊤ = 𝑨⊤ 𝑨 rank 𝑨⊤ 𝑨 = rank 𝑨𝑨⊤ = rank 𝑨
Others 𝑨 and 𝑩 are invertible ̸⇒ 𝑨 + 𝑩 is invertible (𝑨𝒙)⊤ 𝒚 = 𝒙⊤ (𝑨⊤ 𝒚) -

Table 2: Properties of common matrix operations (II).

Determinant Trace Eigenvalue


∏𝑛 ∑𝑛
𝑓 (𝑨) det 𝑨 = 𝑖=1 𝜆𝑖 tr 𝑨 = 𝑖=1 𝜆𝑖 -
𝑓 (𝑨⊤ ) ⊤
det 𝑨 = det 𝑨 tr 𝑨⊤ = tr 𝑨 𝜆(𝑨⊤ ) = 𝜆(𝑨)
𝑓 (𝑨−1 ) det 𝑨−1 = det1 𝑨 - 1
𝜆(𝑨−1 ) = 𝜆(𝑨)
𝑓 (𝑐𝑨) det 𝑐𝑨 = 𝑐 𝑛 det 𝑨 tr 𝑐𝑨 = 𝑐 tr 𝑨 -
𝑓 (𝑨 + 𝑩) - tr(𝑨 + 𝑩) = tr 𝑨 + tr 𝑩 -
𝑓 (𝑨𝑩) det 𝑨𝑩 = det 𝑨 det 𝑩 tr 𝑨𝑩 = tr 𝑩𝑨 -
𝑓 (𝑨⊤ 𝑨) - - 𝜆(𝑨⊤ 𝑨) = 𝜆(𝑨𝑨⊤ ) = 𝜎(𝑨)2
𝑓 (𝑨𝑘 ) det 𝑨𝑘 = (det 𝑨)𝑘 - 𝜆(𝑨𝑘 ) = 𝜆(𝑨)𝑘 (if 𝑘 ≥ 1)
Others det(𝑰 + 𝒖𝒗⊤ ) = 1 + 𝒖⊤ 𝒗 𝒙⊤ 𝒚 = tr 𝒙𝒚 ⊤ 𝜆(𝑐𝑰 + 𝑨) = 𝑐 + 𝜆(𝑨)

Table 3: Vector spaces for 𝑨 ∈ ℝ𝑚×𝑛 , where 𝑹 = 𝑬𝑨 is the RREF of 𝑨.

Subspace Definition Basis Dimension


𝑚
Column space (𝑨) ∶= {𝒗 ∈ ℝ ∣ ∃𝒙. 𝑨𝒙 = 𝒗} Pivot columns of 𝑨 rank 𝑨
Left nullspace  (𝑨⊤ ) ∶= {𝒙 ∈ ℝ𝑚 ∣ 𝑨⊤ 𝒙 = 𝟎} Last rows of 𝑬 𝑚 − rank 𝑨
Row space (𝑨⊤ ) ∶= {𝒗 ∈ ℝ𝑛 ∣ ∃𝒙. 𝑨⊤ 𝒙 = 𝒗} Pivot rows of 𝑨 or 𝑹 rank 𝑨
Nullspace  (𝑨) ∶= {𝒙 ∈ ℝ𝑛 ∣ 𝑨𝒙 = 𝟎} Special solutions of 𝑨 or 𝑹 𝑛 − rank 𝑨

Table 4: Comparasions of invertible and singular matrices (the matrix 𝑨 ∈ ℝ𝑛×𝑛 ).

Invertible matrices Singluar matrices


Number pivots 𝑛 <𝑛
rank 𝑨 𝑛 <
[ 𝑛 ]
𝑹𝑭
RREF 𝑰
𝟎 𝟎
Columns Independent Dependent
Rows Independent Dependent
Solution to 𝑨𝒙 = 𝟎 Only 𝟎 Infinitely many solutions
Solution to 𝑨𝒙 = 𝒃 Only 𝑨−1 𝒃 No or infinitely many solutions
Eigenvalue All 𝜆 > 0 Some eigenvalue is 0
det 𝑨 ≠0 0
𝑨⊤ 𝑨 PD PSD
Linear transformation 𝒙 ↦ 𝑨𝒙 One-to-one and onto -

Algorithm 6 (The 𝑨−1 Algorithm).[ When ] 𝑨 is square[ and] computed, unless the entries of 𝑨−1 is explicitly needed.
invertible, Gaussian Elimination on 𝑨 𝑰 to produce 𝑹 𝑬 .
Since 𝑹 = 𝑰,[ then 𝑬𝑨] = 𝑹 becomes 𝑬𝑨 = 𝑰. The elimina-
tion result is 𝑰 𝑨−1 . 𝑇 (𝑛) ∼ 𝑛3 . In practice, 𝑨−1 is seldom

3
2 ⊤
2.3 Ill-conditioned Matrices Proof. By setting 𝜕(𝒙)
𝜕𝒙
= 𝑚
𝑨 𝑨𝒙− 𝑚2 𝑨⊤ 𝒃+2𝜆𝒙 = 𝟎.
Definition 14 (Ill-conditioned Matrix). An invertible matrix Lemma 14 (Weighted Least-squares Approximation). Sup-
that can become singular if some of its entries are changed pose 𝑪 ∈ ℝ𝑚×𝑚 is a diagonal matrix specifying the weight
ever so slightly. In this case, row reduction may produce for the equations,
fewer than 𝑛 pivots as a result of roundoff error. Also, round-
1
off error can sometimes make a singular matrix appear to be arg min (𝑨𝒙 − 𝒃)⊤ 𝑪(𝑨𝒙 − 𝒃) = (𝑨⊤ 𝑪𝑨)−1 𝑨⊤ 𝑪𝒃 . (12)
𝒙 𝑚
invertible.
2 ⊤
Definition 15 (Condition Number cond 𝑨). cond 𝑨 ∶=
𝜎1 Proof. By setting 𝜕(𝒙)
𝜕𝒙
= 𝑚
𝑨 𝑪𝑨𝒙 − 𝑚2 𝑨⊤ 𝑪𝒃 = 𝟎.
𝜎𝑟
for a matrix 𝑨 ∈ ℝ𝑛×𝑛 . The larger the condition number,
the closer the matrix is to being singular. cond 𝑰 = 1, and 2.5 Orthogonality
cond(singular matrix) = ∞. Lemma 15 (Plane in Point-normal Form). The equation of a
hyperplane with a point 𝒙0 in the plane and a normal vector
2.4 Least-squares and Projections 𝒘 orthogonal to the plane is 𝒘⊤ (𝒙 − 𝒙0 ) = 0.
It is often the case that 𝑨𝒙 = 𝒃 is overdetermined: 𝑚 > 𝑛. Lemma 16. The distance from a point 𝒙 to a plane with a
The 𝑛 columns span a small part of ℝ𝑚 . Typically 𝒃 is outside point 𝒙0 on the plane and a normal vector 𝒘 orthogonal to
(𝑨) and there is no solution. One approach is least-squares. |𝒘⊤ (𝒙−𝒙0 )|
the plane is ‖𝒘‖
.
Theorem 11 (Least-squares Approximation). The projection Definition 16 (Orthogonal Vectors). Two vectors 𝒖 and 𝒗
of 𝒃 ∈ ℝ𝑚 onto (𝑨) is are orthogonal if 𝒖⊤ 𝒗 = 0.
𝒃̂ = 𝑨(𝑨⊤ 𝑨)−1 𝑨⊤ 𝒃 , (7) Definition 17 (Orthonormal[ Vectors
] 𝒒). The columns of 𝑸
where the projection matrix 𝑷 = 𝑨(𝑨 ⊤
𝑨)−1 𝑨⊤ and are orthonormal if 𝑸⊤ 𝑸 = 𝒒 ⊤ 𝒒
𝑖 𝑗 𝑛×𝑛 = 𝑰. If 𝑸 is square, it
is called the orthogonal matrix.
1
arg min ‖𝑨𝒙 − 𝒃‖2 = (𝑨⊤ 𝑨)−1 𝑨⊤ 𝒃 . (8) Lemma 17. The followings are orthogonal matrices.
𝒙 𝑚
• Every permutation matrix 𝑷 .
In particular, the projection of 𝒃 onto the line 𝒂 ∈ ℝ𝑚 is • Reflection matrix 𝑰 − 2𝒆𝒆⊤ where 𝒆 is any unit vector.
𝒂𝒂⊤ Lemma 18. Orthogonal matrices 𝑸 preserve certain norms.
𝒃̂ = ⊤ 𝒃 . (9)
𝒂 𝒂 • ‖𝑸𝒙‖2 = ‖𝒙‖2 .
• ‖𝑸1 𝑨𝑸⊤ ‖ = ‖𝑨‖2 .
2 2
Proof. Let 𝒙⋆ ∶= arg min𝒙 𝑚1 ‖𝑨𝒙 − 𝒃‖2 and the approxima-
• ‖𝑸1 𝑨𝑸⊤ ‖ = ‖𝑨‖F .
2 F
tion error term 𝒆 ∶= 𝒃 − 𝑨𝒙⋆ . 𝒆 ⟂ (𝑨) ⇒ 𝒆 ∈  (𝑨⊤ ) ⇒
𝑨⊤ 𝒆 = 𝑨⊤ (𝒃 − 𝑨𝒙⋆ ) = 𝟎 ⇒ 𝑨⊤ 𝑨𝒙⋆ = 𝑨⊤ 𝒃. Another Theorem 19. The projection of 𝒃 ∈ ℝ𝑚 onto (𝑸) is
proof is by setting 𝜕(𝒙)
𝜕𝒙
= 𝑚2 𝑨⊤ 𝑨𝒙 − 𝑚2 𝑨⊤ 𝒃 = 𝟎. ∑
𝑛
𝒃̂ = 𝑸𝑸⊤ 𝒃 = 𝒒𝑗 𝒒⊤
𝑗 𝒃. (13)
Lemma 12. 𝑨𝒙 = 𝒃 has a unique least-squares solution for 𝑗=1
each 𝒃 when the columns of 𝑨 are linearly independent. ∑𝑛
If 𝑸 is square, 𝒃 = ⊤
Algorithm 7 (Least-squares Approximation). Since 𝑗=1 𝒒 𝑗 𝒒 𝑗 𝒃.

cond 𝑨⊤ 𝑨 = (cond 𝑨)2 , least-squares is solved by QR Definition 18 (QR Factorization). 𝑨 ∈ ℝ𝑛×𝑛 can be wriiten
factorization. as 𝑨 = 𝑸𝑹, where columns of 𝑸 ∈ ℝ𝑛×𝑛 are orthonormal,
and 𝑹 ∈ ℝ𝑛×𝑛 is an upper triangular matrix.
𝑨⊤ 𝑨𝒙 = 𝑨⊤ 𝒃 ⇒ 𝑹⊤ 𝑹𝒙 = 𝑹⊤ 𝑸⊤ 𝒃 ⇒ 𝑹𝒙 = 𝑸⊤ 𝒃 . (10)
Algorithm 8 (Gram-Schmidt Process). The idea is to sub-
𝑇 (𝑚, 𝑛) ∼ 𝑚𝑛2 . tract from every new vector its projections in the directions
Another common case is that 𝑨𝒙 = 𝒃 is underdetermined: already set, and divide the resulting vectors by their lengths,
𝑚 < 𝑛 or 𝑨 has dependent columns. Typically there are in- such that
finitely many solutions. One approach is using regularization. [ ]
𝑨 = 𝒂1 𝒂2 ⋯ 𝒂𝑛
Theorem 13 (Least-squares Approximation With Regular- ⊤ 𝒒⊤ ⋯ 𝒒⊤
⎡𝒒 1 𝒂1 𝒂
1 2 1 𝑛⎤
𝒂
ization). [ ]⎢ 𝒒⊤ 𝒂 ⋯ 𝒒⊤ 𝒂 ⎥
= 𝒒1 𝒒2 ⋯ 𝒒𝑛 ⎢ 2 2 2 𝑛⎥
( ) ( )−1 ⎢ ⋱ ⋮ ⎥
1 1 ⊤ 1 ⊤
arg min ‖𝑨𝒙 − 𝒃‖2 + 𝜆‖𝒙‖2 = 𝑨 𝑨 + 𝜆𝑰 𝑨 𝒃. ⎣ 𝒒⊤
𝑛 𝒂𝑛

𝒙 𝑚 𝑚 𝑚
(11) = 𝑸𝑸⊤ 𝑨 = 𝑸𝑹 . (14)

4
The algorithm is illustrated in Alg. 1. 𝑇 (𝑛) = Definition 22 (Block Elimination). We perform (row 2) -
∑𝑛 ∑𝑗
𝑗=1 𝑖=1
2𝑛 ∼ 𝑛 3 . In practice, the roundoff error can 𝑪𝑨−1 (row 1) to get a zero block in the first column.
build up. [ ][ ] [ ]
𝑰 𝟎 𝑨 𝑩 𝑨 𝑩
= . (15)
−𝑪𝑨−1 𝑰 𝑪 𝑫 𝟎 𝑫 − 𝑪𝑨−1 𝑩
Algorithm 1 QR Factorization.
Input: 𝑨 ∈ ℝ𝑛×𝑛 The final block 𝑫 − 𝑪𝑨−1 𝑩 is called the Schur complement.
Output: 𝑸 ∈ ℝ𝑛×𝑛 , 𝑹 ∈ ℝ𝑛×𝑛
1: 𝑸 ← 𝑹 ← 𝟎 Definition 23 (Row Exchange Matrix 𝑷 𝑖𝑗 ). Identity matrix
2: for 𝑗 ← 0 to 𝑛 − 1 do with row 𝑖 and row 𝑗 exchanged. 𝑷 𝑖𝑗 𝑨 means that we ex-
3: 𝒒 𝑗 ← 𝒂𝑗 change row 𝑖 and row 𝑗.
4: for 𝑖 ← 0 to 𝑗 − 1 do
5: 𝑟𝑖𝑗 ← 𝒒 ⊤𝑖 𝒂𝑗 Definition 24 (Permutation Matrix 𝑷 ). A permutation ma-
6: 𝒒 𝑗 ← 𝒒 𝑗 − 𝑟𝑖𝑗 𝒒 𝑖 trix has the rows of the identity 𝑰 in any order. This matrix
7: 𝑟𝑗𝑗 ← ‖𝒒 𝑗 ‖ has a single 1 in every row and every column. The simplest
𝒒
8: 𝒒 𝑗 ← ‖𝒒𝑗 ‖ permutation matrix is 𝑰. The next simplest are the row ex-
𝑗
9: return 𝑸, 𝑹 change matrix 𝑷 𝑖𝑗 . There are 𝑛! permutation matrices of
order 𝑛, half of which have determinant 1, and the other half
are -1. If 𝑷 is a permutation matrix, then 𝑷 −1 = 𝑷 ⊤ , which
Algorithm 9 (Householder reflections). In practice, House- is also a permutation matrix.
holder reflections are often used instead of the Gram-Schmidt [ ]
process, even though the factorization requires about twice Definition 25 (Augmented Matrix 𝑨 𝒃 ). Elimination does
as much arithmetic. the same row operations to 𝑨 and to 𝒃. We can include 𝒃 as
an extra column and let elimination act on whole rows of this
matrix.
3 Application: Solving Linear Sys- Definition 26 (Row Equivalent). Two matrices are called row
tems equivalent if there is a sequence of elementary row operations
that transforms one matrix into the other.
Understanding the linear system 𝑨𝒙 = 𝒃.
• Row picture: 𝑚 hyperplanes meets at a single point (if Lemma 20. If the augmented matrices of two linear systems
possible). are row equivalent, then the two linear systems have the same
• Column picture: 𝑛 vectors are combined to produce 𝒃. solution set.

Definition 19 (Consist Linear System). A linear system is


said to be consistent if it has either one solution or infinitely 3.2 Row Reduction and Echelon Forms
many solutions, and it is said to be inconsistent if it has no Definition 27 (Row Echelon Form (REF) 𝑼 ). A rectangular
solution. matrix is in row echelon form if it has the following proper-
The idea of Gaussian elimination is to replace one linear ties.
system with an equivalent linear system (i.e., one with the • All nonzero rows are above any rows of all zeros.
same solution set) that is easier to solve. • Each leading entry of a row is in a column to the right
of the leadining entry of the row above it.
• All entries in a column below a leading are zeros.
3.1 Elementary Row Operations
Definition 28 (Reduced Row Echelon Form (RREF) 𝑹). If a
Definition 20 (Elementary Row Operations). The followings matrix in row echelon form satisfies the following additional
are three types of elementary row operations. conditions, then it is in reduced row echelon form.
• (Replacement) Replace one row by the subtraction of • The leading entry in each nonzero row is 1.
itself and a multiple of another row of the matrix. • Each leading 1 is the only
• (Interchange) Interchange any two rows. [ nonzero
] entry in its column.
𝑰𝑭
• (Scaling) Multiply all entries of a row by a nonzero The general form is 𝑹 = , which has 𝑟 pivots rows
𝟎𝟎
constant. and pivots columns, 𝑚 − 𝑟 zero rows, and 𝑛 − 𝑟 free columns.
Definition 21 (Elementary Matrix 𝑬 𝑖𝑗 ). Identity matrix with Every free column is a combination of earlier pivot columns.
an extra nonzero entry −𝑙𝑖𝑗 in the (𝑖, 𝑗) position, where mul- Free variables can be given any values whatsoever.
entry to eliminate in row 𝑖
tiplier 𝑙𝑖𝑗 ∶= pivot in row 𝑗
. 𝑬 𝑖𝑗 𝑨 means that we per- Definition 29 (Pivot). A pivot position in a matrix 𝑨 is a
form (row 𝑖) - 𝑙𝑖𝑗 ⋅ (row 𝑗) to make the (𝑖, 𝑗) entry zero. location in 𝑨 that corresponds to a leading 1 in the RREF of

5
Table 5: The four possibilities for steady state problem 𝑨𝒙 = 𝒃, where 𝑨 ∈ ℝ𝑚×𝑛 and 𝑟 ∶= rank 𝑨. Gaussian elimination on
[𝑨 𝒃] gives 𝑹𝒙 = 𝒅, where 𝑹 ∶= 𝑬𝑨 and 𝒅 ∶= 𝑬𝒃.

Case Shape of 𝑨 RREF 𝑹 Particular solution 𝒙𝑝 Nullspace matrix # solutions Left inverse Right inverse
−1 −1
𝑟=𝑚=𝑛 Square and invertible [𝑰] [𝑨
[ ]𝒃] [ [𝟎] ] 1 𝑨 𝑨−1
𝒅 −𝑭
𝑟=𝑚<𝑛 Short and wide [𝑰 𝑭 ] ∞ - 𝑨⊤ (𝑨𝑨⊤ )−1
[ ] 𝟎 𝑰
𝑰
𝑟=𝑛<𝑚 Tall and thin [𝒅] or none [𝟎] 0 or 1 (𝑨⊤ 𝑨)−1 𝑨 -
[ 𝟎 ] [ ] [ ]
𝑰𝑭 𝒅 −𝑭
𝑟 < 𝑚, 𝑟 < 𝑛 Not full rank or none 0 or ∞ - -
𝟎𝟎 𝟎 𝑰

𝑨. A pivot row/column is a row/column of 𝑨 that contains 3.3 Solution of a Linear System 𝑨𝒙 = 𝒃


a pivot position. A pivot is a nonzero number in a pivot
Algorithm 11 (Solving 𝑨𝒙 = 𝒃). The algorithm is illus-
position. A zero in the pivot position can be repaired if there
trated in Alg. 3. There are four possibilities for 𝑨𝒙 = 𝒃
is a nonzero below it.
depending on rank 𝑨, as illustrated in Table 5. 𝑇 (𝑚, 𝑛) ∼
Algorithm 10 (RREF). The algorithm to get the RREF of 1 3
𝑚 + 13 𝑚2 𝑛.
3
𝑨 is illustrated in Alg. 2. We use partial pivoting to reduce
the roundoff errors in the calculations. The FLOP (number
of multiplication/division on two floating point numbers) is Algorithm 3 Solving a linear system 𝑨𝒙 = 𝒃.
∑ 1 3 1 2
𝑇 (𝑚, 𝑛) = 𝑚−1𝑖=1 (𝑛 − 𝑖 + 1)(𝑚 − 𝑖) ∼ 3 𝑚 + 3 𝑚 𝑛.
Input: 𝑨 ∈ ℝ𝑚×𝑛 , 𝒃 ∈ ℝ𝑚
Output: 𝒙 ∈ ℝ𝑛
1: Use elementary row operations to transform[ the ]augmented
Algorithm 2 The RREF algorithm. [ ] [ ] 𝑰𝑭
matrix 𝑨 𝒃 to its RREF 𝑹 𝒅 , where 𝑹 = .
Input: 𝑨 ∈ ℝ𝑚×𝑛 [ ]
𝟎𝟎
Output: 𝑹 ∈ ℝ𝑚×𝑛 2: if there is a row in 𝑹 𝒅 whose entries are all zeros except the
1: ⊳ Elimination downwards to produce zeros below the pivots. last one on the right then
2: while there is nonzero row to modify do 3: return “The system is inconsistent
[ ] and has no solution”
3: Pick the leftmost nonzero column. This is a pivot column. 𝒅
4: Find one particular solution 𝒙𝑝 ∶= , which solves 𝑨𝒙𝑝 = 𝒃,
The pivot position is at the top. 𝟎
4: Use elementary row interchange to select the entry with the by setting the free variables to 0.
largest absolute value in the pivot column as a pivot. 5: Find special[ solutions
] which are columns of the nullspace ma-
5: Use elementary row replacements to create zeros in all posi- −𝑭
trix 𝑵 ∶= which solves 𝑨𝒙𝑛 = 𝟎. Every free column
tions below the pivot. 𝑰
6: Cover/Ignore that pivot row. leads to a special solution. The complete null solution 𝒙𝑛 is a
7: Use elementary row scaling to produce 1 in all pivot positions. linear combination of the special solutions.
8: ⊳ Elimination upwards to produce zeros above the pivots. 6: return 𝒙𝑝 + 𝒙𝑛
9: while there is nonzero row to modify do
10: Pick the rightmost nonzero column.
11: Use elementary row replacements to create zeros in all posi-
tions above the pivot. 3.4 The LU Factorization
12: Cover/Ignore that pivot row. The LU factorization is motivated by the problem of solv-
13: return the result
ing a set of linear systems all with the same cofficient matrix:
{𝑨𝒙𝑘 = 𝒃𝑘 }𝐾 𝑘=1
.
Lemma 21. Any nonzero matrix is row equivalent to more
than one matrix in REF, by using different sequences of ele- Definition 30 (LU Factorization). Assuming no row ex-
mentary row operations. However, each matrix is row equiv- changes, 𝑨 ∈ ℝ𝑚×𝑛 can be written as 𝑨 = 𝑳𝑼 , where
alent to one and only one matrix in RREF. 𝑳 ∈ ℝ𝑚×𝑚 is an invertible lower triangular matrix with 1’s
on the diagonal and multipliers 𝑙𝑖𝑗 are below the diagonal.
Lemma 22. An elimination matrix 𝑬 ∈ ℝ𝑚×𝑚 which is a 𝑼 ∈ ℝ𝑚×𝑛 is an REF of 𝑨.
product of elementary matrices 𝑬 𝑖𝑗 , row exchange matrices Besides, we can extract from 𝑼 a diagonal matrix 𝑫 ∈
𝑷 𝑖𝑗 , and diagonal matrix 𝑫 −1 (divides rows by their pivots to ℝ𝑚×𝑚 containing the pivots: 𝑨 = 𝑳𝑫𝑼 . The new 𝑼 matrix
produce 1’s) puts the original 𝑨 into its RREF, i.e., 𝑬𝑨[ = 𝑹. ] has 1’s on the pivot positions. When 𝑨 is symmetric, the
If we want 𝑬, we can apply row reduction to the matrix 𝑨 𝑰 ,
[ ] [ ] usual LU factorization becomes 𝑨 = 𝑳𝑫𝑳⊤ . Sometimes
namely, 𝑬 𝑨 𝑰 = 𝑹 𝑬 . row exchanges are needed to produce pivots 𝑷 𝑨 = 𝑳𝑼 .

6
Algorithm 12 (The LU Factorizaiton). The algorithm is il-
lustrated in Alg. 4. 𝑇 (𝑚, 𝑛) ∼ 13 𝑚3 + 31 𝑚2 𝑛. For a band
matrix 𝑩 with 𝑤 nonzero diagonals below and above its
main diagonal, 𝑇 (𝑚, 𝑛, 𝑤) ∼ 𝑚𝑤2 .

Algorithm 4 LU factorization on 𝑨.
Input: 𝑨 ∈ ℝ𝑚×𝑛
Output: 𝑳, 𝑼
1: 𝑳 ← 𝟎 ∈ ℝ𝑚×𝑚 Figure 1: Three different cases (fixed-fixed, fixed-free, and
2: 𝑘 ← 0 free-free) for a spring system.
3: for 𝑗 ← 0 to 𝑛 − 1 do
4: if 𝑗-th column if not a pivot column then
5: continue • 𝒆 ∈ ℝ𝑚 : The stretching distance of each spring.
6: Row exchange to make 𝑎𝑘𝑗 as the largest available pivot. By Hooke’s Law 𝑦𝑖 = 𝑐𝑖 𝑒𝑖 . In matrix form, 𝒚 = 𝑪𝒆.
7: 𝑙𝑘𝑘 ← 1 These are three different cases for these springs, as illus-
8: for 𝑖 ← 𝑘 + 1 to 𝑚 − 1 do trated in Fig. 1.
𝑎
9: 𝑙𝑖𝑘 ← 𝑎 𝑖𝑗 ⊳ Multiplier
𝑘𝑗 Fixed-fixed Case. In this case, 𝑚 = 𝑛 + 1 and the top and
10: ⊳ Eliminates row 𝑖 beyond row 𝑘 bottom spring are fixed. Originally there is no stretching.
11: (row 𝑖 of 𝑨) ← (row 𝑖 of 𝑨) - 𝑙𝑖𝑘 (row 𝑘 of 𝑨) Then gravity acts to move down the masses by 𝒖. Each
12: 𝑘←𝑘+1
spring is stretched by the difference in displacements of its
13: return 𝑳, 𝑨
end 𝑒𝑖 = 𝑢𝑖 − 𝑢𝑖−1 . Besides, 𝑒1 = 𝑢1 since the top is fixed,
and 𝑒𝑚 = −𝑢𝑛 since the bottom is fixed. In matrix form,
Lemma 23. Assuming no row exchanges, when a row of 𝑨
starts with zeros, so does that row of 𝑳. When a column of 𝑨 ⎡1 ⎤
⎢−1 1 ⎥
starts with zeros, so does that column of 𝑼 .
𝒆 = 𝑨𝒖 ∶= ⎢ −1 ⋱ ⎥𝒖. (16)
⎢ ⎥
Algorithm 13 (Solving {𝑨𝒙𝑘 = 𝒃𝑘 }𝐾 𝑘=1
). The algorithm is ⎢ ⋱ 1⎥
⎣ −1⎦
illustrated in Alg. 5. 𝑇 (𝑚, 𝑛, 𝐾) ∼ 3 𝑚 + 31 𝑚2 𝑛 + 𝑛2 𝐾. For
1 3

a band matrix 𝑩 with 𝑤 nonzero diagonals below and above Finally comes the balance equation, the internal forces
its main diagonal, 𝑇 (𝑚, 𝑛, 𝑤) ∼ 𝑚𝑤2 + 2𝑛𝑤𝐾. from the springs balance the external forces on the masses
𝑓𝑖 = 𝑦𝑖 − 𝑦𝑖+1 . In matrix form,
Algorithm 5 Solving {𝑨𝒙𝑘 = 𝒃𝑘 }𝐾 .
𝑘=1 ⎡1 −1 ⎤
Input: 𝑨, {𝒃𝑘 }𝐾
𝑘=1
. ⊤ ⎢ 1 −1 ⎥
Output: {𝒙𝑘 }𝐾 𝒇 = 𝑨 𝒚 ∶= ⎢ ⎥𝒚. (17)
⋱ ⋱
𝑘=1
⎢ ⎥
1: LU factorization 𝑨 = 𝑳𝑼 . ⎣ 1 −1⎦
2: for 𝑘 ← 1 to 𝐾 do
3: Solve 𝑳𝒚 𝑘 = 𝒃𝑘 by forward subsitution. Combining the three matrices gives
4: Solve 𝑼 𝒙𝑘 = 𝒚 𝑘 by backward subsitution.
5: return {𝒙𝑘 }𝐾𝑘=1 𝑨⊤ 𝑪𝑨𝒖 = 𝒇 . (18)

When 𝑪 = 𝑰,

3.5 Matrices in Engineering ⎡ 2 −1 ⎤


⎢−1 2
⊤ ⋱ ⎥
𝑲 =𝑨 𝑨=⎢ ∈ ℝ𝑛×𝑛 .
−1⎥
Suppose there are 𝑛 masses vertically connected by a line (19)
⋱ ⋱
of 𝑚 springs. We define ⎢ ⎥
⎣ −1 2⎦
• 𝒖 ∈ ℝ𝑛 : The movement of the masses, where we define
𝑢𝑗 > 0 when a mass move downward, Lemma 24. The followings are properties of the matrix 𝑲.
• 𝒚 ∈ ℝ𝑚 : The internal force of each spring, where we • 𝑲 is symmetric.
define 𝑦[𝑖 > 0] when a spring is in stretched. • 𝑲 is tridiagonal.
• 𝒇 ∶= 𝑚𝑗 𝑔 𝑛 ∈ ℝ𝑛 : The extern force comes from • The 𝑖-th pivot of 𝑲 is 𝑖+1
𝑖
, and it converges to 1 when
gravity. 𝑛 → ∞.
• 𝑪 ∶= diag 𝒄 ∈ ℝ𝑚×𝑚 : The spring constant of each • 𝑲 is PD.
spring. • det 𝑲 = 𝑛 + 1.

7
• 𝑲 −1 is a full matrix with all positive entries. 𝑲 −1 is
also PD.

Fixed-free Case. In this case, 𝑚 = 𝑛 and the top spring are


fixed. When 𝑪 = 𝑰,

⎡1 ⎤ ⎡ 2 −1 ⎤
⎢−1 1 ⎥ ⎢−1 2 ⋱⎥
𝑨=⎢ ⎥, 𝑲 = ⎢ −1⎥
.
⋱ ⋱ ⋱ ⋱
⎢ ⎥ ⎢ ⎥
⎣ −1 1⎦ ⎣ 1⎦
−1
Figure 2: A curcuit with a current source into vertex 1.
(20)
Free-free Case. In this case, 𝑚 = 𝑛 − 1 and the both ends
are free. When 𝑪 = 𝑰, 𝑨𝒖 gives the potential differences across the 𝑚 edges.
Ohm’s law says that the current 𝑦𝑖 through the resistor is
⎡−1 1 ⎤ ⎡ 1 −1 ⎤ proportional to the potential difference 𝒚 = 𝑪𝑨𝒖. Kirch-
𝑨=⎢ ⋱ ⋱ ⎥, 𝑲 = ⎢⎢−1 2 ⋱ ⎥
hoff’s current law says that the net current into every node is
−1⎥
.
⎢ ⎥ ⎢
⋱ ⋱

⎣ −1 1⎦
⎣ −1 1⎦
zero, which is expressed as
(21) 𝑨⊤ 𝒚 = 𝑨⊤ 𝑪𝑨𝒖 = 𝒇 . (23)
There is a nonzero solution to 𝑨𝒖 = 𝟎. The masses can move
𝒖𝑛 = 𝟏 with no stretching of the springs 𝒆 = 𝟎. 𝑲 is only Kirchhoff’s voltage law says that the sum of potential
PSD, and 𝑲𝒖 = 𝒇 is solvable for special 𝒇 , i.e., 𝟏⊤ 𝒇 = 0, or differences around a loop must be zero.
the whole line of spring (with both ends free) will take off
Lemma 25. The followings are properties of an incidence
like a rocket.
matrix 𝑨.
• dim (𝑨) = dim (𝑨⊤ ) = 𝑛 − 1.
3.6 Graph and Networks • dim  (𝑨) = 1
• dim  (𝑨⊤ ) = 𝑚 − 𝑛 + 1.
Definition 31 (Adjacency Matrix). Given a directed graph
[ = (𝑉 , 𝐸) ]where |𝑉 | = 𝑛, the adjacency matrix is 𝑨 =
𝐺 Proof. Since we can raise or lower all the potentials by the
𝕀((𝑖, 𝑗) ∈ 𝐸) 𝑛×𝑛 , i.e., 𝑎𝑖𝑗 = 1 if there exists path from vertex same constant, 𝟏 ∈  (𝑨). Rows of 𝑨 are dependent if
𝑖 to vertex 𝑗. The (𝑖, 𝑗) entry of 𝑨𝑘 counts the number of 𝑘- the corresponding edges containing a loop. At the end of
step path from vertex 𝑖 to vertex 𝑗. If 𝐺 is undirected, 𝑨 is elimination we have a full set of 𝑟 independent rows. Those
symmetric. 𝑟 edges form a spanning tree of the graph, which has 𝑛 − 1
edges of the graph is connected.
Definition 32 (Incidence Matrix). Given a directed graph
𝐺 = (𝑉 , 𝐸) where |𝑉 | = 𝑛 and |𝐸| = 𝑚, the incidence 3.7 Two-point Boundary-value Problems
matrix is 𝑨 ∈ ℝ𝑚×𝑛 where 𝑎𝑖𝑗 = −1 if edge 𝑖 starts from
vertex 𝑗, 𝑎𝑖𝑗 = 1 if edge 𝑖 ends at vertex 𝑗, and 𝑎𝑖𝑗 = 0 Solving
otherwise.
d2 𝑢(𝑥)
− = 𝑓 (𝑥), 𝑥 ∈ [0, 1] , (24)
For a curcuit in Fig. 2, the incidence matrix is d𝑥2
with bounday condition 𝑢(0) = 0 and 𝑢(1) = 0. This equation
⎡−1 1 ⎤
⎢−1 ⎥ describe a steady state system, e.g., the temperature distribu-
1
⎢ ⎥ tion of a rod with a heat source 𝑓 (𝑥) and both ends fixed at
−1 1
𝑨 ∶= ⎢ ⎥. (22) 0 ◦ C.
⎢−1 1⎥
⎢ Since a computer cannot solve a differential equation ex-
−1 1⎥
⎢ ⎥ actly, we have to approximate the differential equation with
⎣ −1 1⎦
a difference equation. For that reason we can only accept a
finite amount of information at 𝑛 equally spaced points
We define
• 𝒖 ∈ ℝ𝑛 . Pontentials (the voltages) at 𝑛 nodes. 𝑢1 ∶= 𝑢(ℎ), 𝑢2 ∶= 𝑢(2ℎ), … , 𝑢𝑛 ∶= 𝑢(𝑛ℎ) (25)
• 𝒚 ∈ ℝ𝑚 . Currents flowing along 𝑚 edges. 𝑓1 ∶= 𝑓 (ℎ), 𝑓2 ∶= 𝑓 (2ℎ), … , 𝑓𝑛 ∶= 𝑓 (𝑛ℎ) (26)
• 𝒇 ∈ ℝ𝑛 be the current sources into 𝑛 nodes.
1
• 𝐶 ∶= diag(𝑐1 , 𝑐2 , … , 𝑐𝑚 ) ∈ ℝ𝑚×𝑚 . Conductance of where ℎ ∶= 𝑛
The boundary condition becomes 𝑢0 ∶= 0 and
each edge. 𝑢𝑛+1 ∶= 0.

8
We approximate the second-order derivative by • There is no connection between invertibility and diag-
onalizability. Invertibility is concerned with the eigen-
d2 𝑢(𝑥) 𝑢(𝑥 + ℎ) − 2𝑢(𝑥) + 𝑢(𝑥 − ℎ) values (𝜆 = 0 or 𝜆 ≠ 0). Diagonalizability is concerned
− ≈−
d𝑥 2 ℎ2 with the eigenvectors (too few or enough for 𝑺).
−𝑢𝑗+1 + 𝑢𝑗 − 𝑢𝑗−1
= . (27) • Suppose both 𝑨 and 𝑩 can be diagonalized, they share
ℎ2 the same eigenvector matrix 𝑺 iff 𝑨𝑩 = 𝑩𝑨.
2
Therefore, the differential equation − d d𝑥𝑢(𝑥)
2 = 𝑓 (𝑥) becomes
4.2 Diagonalizable
⎡2 −1 ⎤ ⎡𝑢1 ⎤ ⎡𝑓1 ⎤ Theorem 28 (Diagonalizable). If 𝑨 ∈ ℝ𝑛×𝑛 has 𝑛 indepen-
⎢−1 2 ⋱ ⎥ ⎢𝑢2 ⎥ ⎢𝑓 ⎥
𝑲𝒖 = ⎢ = ℎ2 ⎢ 2 ⎥ = ℎ2 𝒇 . (28) dent eigenvectors, 𝑨 is diagonalizable
⋱ ⋱ −1⎥ ⎢ ⋮ ⎥ ⋮
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣ −1 2 ⎦ ⎣𝑢𝑛 ⎦ ⎣𝑓𝑛 ⎦ 𝑨 = 𝑺𝚲𝑺 −1 , (29)
Lemma 26. The FLOPs for solving 𝑲𝒇 = ℎ2 𝒇 is 𝑇 (𝑛) ∼ [ ]
where 𝑺 ∶= 𝒙1 𝒙2 ⋯ 𝒙𝑛 and 𝚲 ∶= diag(𝜆1 , 𝜆2 , … , 𝜆𝑛 ).
3𝑛. In other words, 𝑨 is similar to 𝚲.
[ ]
Proof. 𝑨𝑺 = 𝜆1 𝒙1 𝜆2 𝒙2 ⋯ 𝜆𝑛 𝒙𝑛 = 𝑺𝚲.
4 Theory: Eigenvalues and Eigenvec-
Definition 36 (Normal Matrix). A square matrix 𝑨 is normal
tors when 𝑨⊤ 𝑨 = 𝑨𝑨⊤ . That includes symmetric, antisymmet-
ric, and orthogonal matrices. In this case, 𝜎𝑖 = |𝜆𝑖 |.
4.1 Eigenvalues and Eigenvectors
Definition 33 (Eigenvalue 𝜆 and Eigenvector 𝒙). 𝜆 and 𝒙 ≠ 𝟎 Lemma 29. The eigenvectors of 𝑨 is orthnormal when 𝑨 is
are the eigenvalue and eigenvector of 𝑨 if 𝑨𝒙 = 𝜆𝒙. normal.

Algorithm 14 (Solving Eigenvalues and Eigenvectors). Theorem 30 (Spectral Theorem). Every symmetric matrix
𝑨𝒙 = 𝜆𝒙 ⇒ (𝑨 − 𝜆𝐼)𝒙 = 𝟎. Since 𝒙 ≠ 𝟎,  (𝑨 − 𝜆𝑰) ≠ 𝟘, 𝑨 has the factorization
which has det(𝑨 − 𝜆𝑰) = 0. The algorithm is illustrated

𝑛
in Alg. 6. In practice, the best way to compute eigenvalues 𝑨 = 𝑸𝚲𝑸⊤ = 𝜆𝑖 𝒙𝑖 𝒙⊤ (30)
𝑖
is to compute similar matrices 𝑨1 , 𝑨2 , … that approach a 𝑖=1
triangular matrix.
Porperties of eigenvalues and eigenvectors of special ma-
trices are illustrated in Table 10.
Algorithm 6 Solve the eigenvalues and eigenvector for 𝑨.
Input: 𝑨 ∈ ℝ𝑛×𝑛
Output: 𝜆𝑖 , 𝒙𝑖 5 Application: Solving Dynamic
1: Solve det(𝑨 − 𝜆𝑰) = 0, which is a polynomial in 𝜆 of degree 𝑛,
for eigenvalue 𝜆. Problems
2: For each eigenvalue 𝜆, solve (𝑨 − 𝜆𝑰)𝒙 = 𝟎 for eigenvector 𝒙.
3: return 𝜆𝑖 , 𝒙𝑖 5.1 Solving Difference and Differential Equa-
tions
Definition 34 (Geometric Multiplicity (GM)). The number The algorithm for solving first-order difference and differ-
of independent eigenvectors for 𝜆, which is dim  (𝑨 − 𝜆𝑰). ential equations are illustrated in Alg. 7 and Alg. 8, respec-
tively.
Definition 35 (Algebraic Multiplicity (AM)). The number
of repetitions of 𝜆 among the eigenvalues. Look at the 𝑛 roots Algorithm 7 Solving 𝒖𝑘+1 = 𝑨𝒖𝑘 .
of det(𝑨 − 𝜆𝑰) = 0.
Input: 𝑨 ∈ ℝ𝑛×𝑛 , 𝒖0
Lemma 27. The followings are properties of eigenvalues Output: 𝒖𝑘
and eigenvectors. 1: Diagonalize on 𝑨 = 𝑺𝚲𝑺 −1 .
• For each eigenvalue, GM ≤ AM. A matrix is diagonaliz- 2: Solving 𝑺𝒄 = 𝒖0 to write 𝒖0 as a linear combination of eigen-
vectors.
able iff every eigenvalue has GM = AM. ∑𝑛
3: The solution 𝒖𝑘 = 𝑨𝑘 𝒖0 = 𝑺𝚲𝑘 𝑺 −1 𝒖0 = 𝑺𝚲𝑘 𝒄 = 𝑖=1 𝑐𝑖 𝜆𝑘𝑖 𝒙𝑖
• Each eigenvalue has ≥ 1 eigenvector.
4: return 𝒖𝑘
• All eigenvalues are different ⇒ all eigenvectors are inde-
pendent, which means the matrix can be diagonalized.

9
Algorithm 8 Solving d𝒖(𝑡)
= 𝑨𝒖(𝑡), where 𝑨 is a constant In the stable case, the powers 𝑨𝑘 approach zero and so does
d𝑡
coefficient matrix. 𝒖𝑘 = 𝑨 𝑘 𝒖0 .
Input: 𝑨 ∈ ℝ𝑛×𝑛 , 𝒖(0)
Lemma 33. The differential equation d𝑡d 𝒖(𝑡) = 𝑨𝒖(𝑡) is
Output: 𝒖(𝑡)
1: Diagonalize on 𝑨 = 𝑺𝚲𝑺 −1 .
• stable and exp 𝑨𝑡 → 𝟎 if ∀𝑖. Re𝜆𝑖 < 0.
2: Solving 𝑺𝒄 = 𝒖(0) to write 𝒖(0) as a linear combination of • neutrally stable if ∃𝑖. Re𝜆𝑖 = 0, and all the other Re𝜆𝑖 <
eigenvectors. 0.
3: The solution 𝒖(𝑡) = exp(𝑨𝑡)𝒖(0) = 𝑺 exp(𝚲𝑡)𝑺 −1 𝒖(𝑡) = • unstable and exp 𝑨𝑡 is unbounded if ∃𝑖. Re𝜆𝑖 > 0.
∑𝑛
𝑺 exp(𝚲𝑡)𝒄 = 𝑖=1 𝑐𝑖 exp(𝜆𝑖 𝑡)𝒙𝑖
4: If two 𝜆’s are equal, with only one eigenvector, another solution
5.2 Singular Value Decomposition
𝑡 exp(𝜆𝑡)𝒙 is needed.
5: return 𝒖(𝑡) Theorem 34 (SVD Factorization). For matrix 𝑨 ∈ ℝ𝑚×𝑛
with 𝑟 ∶= rank 𝑨, choose 𝑼 ∈ ℝ𝑚×𝑚 to contain orthonor-
mal eigenvectors of 𝑨𝑨⊤ , and 𝑽 ∈ ℝ𝑛×𝑛 to contain or-
Example 1 (Fibonacci Numbers). Find the 𝑘-th Fibonacci thonormal eigenvectors of 𝑨⊤ 𝑨. The shared eigenvalues are
number where the sequence is defined as 𝐹0 = 0, 𝐹1 = 1, 𝜎12 , 𝜎22 , … , 𝜎𝑟2 . Then
and 𝐹𝑘+2 = 𝐹𝑘+1 + 𝐹𝑘 .
[ ] [ ] ∑
𝑟
𝐹𝑘+1 1 𝑨 = 𝑼 𝚺𝑽 ⊤ = 𝜎𝑖 𝒖𝑖 𝒗⊤ (31)
Solution. Let 𝒖𝑘 ∶=
𝐹𝑘
, then 𝒖0 =
0
and 𝒖𝑘+1 = 𝑖 .
[ ] √ √ 𝑖=1
1 1
𝑨𝒖𝑘 ∶= 𝒖𝑘 . 𝜆1 = 1+2 5 , 𝜆2 = 1−2 5 . 𝒙1 = 𝑼 and 𝑽 satisfy the followings.
1 0
[ ] [ ] [ ] • The first 𝑟 columns of 𝑼 contains orthonormal bases for
𝜆1 𝜆 1 1 1
, 𝒙2 = 2 . 𝒄 = 𝜆 −𝜆 . 𝒖𝑘 = 𝜆 −𝜆 (𝜆𝑘1 𝒙1 −𝜆𝑘2 𝒙2 ). (𝑨).
1 1 1 2 −1 1 2
( √ )𝑘 • The last 𝑚−𝑟 columns of 𝑼 contains orthonormal bases
𝐹𝑘 = √1 (𝜆𝑘1 − 𝜆𝑘2 ) = nearest integer to √1 1+2 5 . for  (𝑨⊤ ).
5 5 • The first 𝑟 columns of 𝑽 contains orthonormal bases for
d2 𝑥(𝑡) (𝑨⊤ ).
Example 2 (Simple Harmonic Vibration). Solve d𝑡2
+
d𝑥(0)
• The last 𝑛 − 𝑟 columns of 𝑽 contains orthonormal bases
𝑥(𝑡) = 0, where 𝑥(0) = 1, = 0. This is a 𝑚𝑎 = −𝑘𝑥
d𝑡 for  (𝑨).
where 𝑚 = 1, 𝑘 = 1.
[ ] [ ] Proof. Start from 𝑨⊤ 𝑨𝒗𝑖 = 𝜎𝑖2 𝒗𝑖 . Multiply both sides by
𝑥 1
Solution. Let 𝒖(𝑡) ∶= d𝑥 , then 𝒖(0) = and d𝒖 = 𝑨 gives 𝑨𝑨⊤ (𝑨𝒗𝑖 ) = 𝜎𝑖2 (𝑨𝒗𝑖 ), which shows that 𝑨𝒗𝑖 is
0 d𝑡
[ ] d𝑡 [ ] an eigenvector of 𝑨𝑨⊤ with shared eigenvalue 𝜎𝑖2 . Since
0 1 1 √ √
𝑨𝒖(𝑡) = 𝒖(𝑡). 𝜆1 = i, 𝜆2 = −i. 𝒙1 = , 𝒙2 =
−1 0 i ‖𝑨𝒗𝑖 ‖ = 𝒗⊤ 𝑨⊤
𝑨𝒗 = 𝜎𝑖2 𝒗⊤
𝑖 𝒗𝑖 = 𝜎𝑖 , we denote 𝒖𝑖 ∶=
[ ] [ ] 𝑖 𝑖
1 1 𝑨𝒗𝑖 𝑨𝒗𝑖
. 𝒄 = 12 . 𝒖(𝑡) = 21 (exp(i𝑡)𝒙1 + exp(−i𝑡)𝒙2 ). 𝑥(𝑡) = ‖𝑨𝒗𝑖 ‖
= 𝜎𝑖
, namely, 𝑨𝒗𝑖 = 𝜎𝑖 𝒖𝑖 . It shows column by
−i 1
1 column that 𝑨𝑽 = 𝑼 𝚺. Since 𝑽 is orthogonal, 𝑨 = 𝑼 𝚺𝑽 ⊤ .
2
(exp(i𝑡) + exp(−i𝑡) = cos 𝑡.
Definition 37 (Markov Matrices). A 𝑛×𝑛 matrix is a Markov
Lemma 35. The largest singular value dominates all eigen-
matrix if all entries are nonnegative and the each column of
values and all entries of 𝑨. That is, 𝜎1 ≥ max𝑖 |𝜆𝑖 | and
the matrix adds up to 1.
𝜎1 ≥ max𝑖,𝑗 |𝑎𝑖𝑗 |.
Lemma 31. A Markov matrix 𝑨 has the following properties
Lemma 36. For a square matrix 𝑨, spectral factorization
• 𝜆1 = 1 is an eigenvalue of 𝑨.
and SVD factorization give the same result when 𝑨 is PSD.
• Its eigenvector 𝒙1 is nonnegative, and it is steady state
since 𝑨𝒙1 = 𝒙1 . Proof. We need orthonormal eigenvectors (𝑨 should be sym-
• The other eigenvalues satisfy |𝜆𝑖 | ≤ 1. metric), and nonnegative eigenvalues (𝑨 is PSD).
• If 𝑨 or any power of 𝑨 has all positive entries, these
other |𝜆𝑖 | < 1. The solution 𝑨𝑘 𝒖0 approaches a multi-
ple of 𝒙1 , which is the steady state 𝒖∞ . 5.3 Leontief’s Input-ouput Model
Lemma 32. The difference equation 𝒖𝑘=1 = 𝑨𝒖𝑘 is Leontief divided the US economy into 𝑛 sectors that pro-
• stable if ∀𝑖. |𝜆𝑖 | < 1. duce goods or services (e.g., coal, automotive, and commu-
• neutrally stable if ∃𝑖. |𝜆𝑖 | = 1, and all the other |𝜆𝑖 | < 1. nication), and another sectors that only consume goods or
• unstable if ∃𝑖. 𝜆𝑖 | > 1. services (e.g., consumer and government).

10
Table 6: Comparation of PD and PSD matrices.

PD matrices PSD matrices



𝒙 𝑨𝒙, if 𝒙 ≠ 𝟎 >0 ≥0
𝑨⊤ 𝑨 If 𝑨 has independent columns If 𝑨 has dependent columns
𝑨⊤ 𝑪𝑨 (𝑪 is diagonal with positive elements) If 𝑨 has independent columns If 𝑨 has dependent columns
Upper left determinants All positive All nonnegative
Pivots All positive All nonnegative
Eigenvalues All Postive All nonnegative

• Production vector 𝒙 ∈ ℝ𝑛 . Ouput of each producer for Definition 38 (Stationary Point). Point where 𝜕𝑓
𝜕𝒖
= 𝟎. Such
one year. point can be a local minimum, a local maximum, or a saddle
• Final demand vector 𝒃 ∈ ℝ𝑛 . Demand for each pro- point.
ducer by the consumer for a year. 𝜕2 𝑓
• Intermediate demand vector 𝒖 ∶= 𝑨𝒙 ∈ ℝ𝑛 . Demand Lemma 40. 𝑓 (𝒖) has a local minimum when 𝜕𝒖2
is PD.
for each producer by the producer for a year. 𝑨 ∈ ℝ𝑛×𝑛 𝜕2 𝑓
Similarly, 𝑓 (𝒖) has a local maximum when is ND. If some
𝜕𝒖2
is the consumption matrix. eigenvalues are postive and some are negative, 𝑓 (𝒖) has a
2
Theorem 37 (Leontief Input-output Model). When there is saddle point. If 𝜕𝜕𝒖𝑓2 has eigenvalue 0, the test is inconclusive.
a production level 𝒙 such that the amounts produced will In some cases, we can directly get the stationary point by
exactly balance the total demand for that production
solving 𝜕𝑓𝜕𝒖
= 𝟎. In other cases, we iteratively approach to
𝒙 = 𝑨𝒙 + 𝒃 . (32) the stationary point.
If 𝑨 and 𝒃 have nonnegative entries and the largest eigen- Algorithm 16 (Gradient Descent). 𝒖 ← 𝒖 − 𝜂 𝜕𝑓
𝜕𝒖
.
value of 𝑨 is less than 1, then the solution exists and has ( 2 )−1
nonnegative entires Algorithm 17 (Newton’s Method). 𝒖 ← 𝒖 − 𝜕𝜕𝒖𝑓2 𝜕𝑓
𝜕𝒖
.
( )
∑∞
Lemma 41 (Taylor Series).
−1 𝑘
𝒙 = (𝑰 − 𝑨) 𝒃 = 𝑰 + 𝑨 𝒃. (33)
𝑘=1 𝜕𝑓 | 1 𝜕2𝑓 |
𝑓 (𝒖) ≈ 𝑓 (𝒖0 )+(𝒖−𝒖0 )⊤ | + (𝒖−𝒖0 )⊤ 2 | (𝒖−𝒖0 ) .
𝜕𝒖 |𝒖0 2 𝜕𝒖 |𝒖0
6 Theory: Positive Definite Matrices (35)

and Optimizations 6.3 Constrained Optimization


In many cases, the sign of eigenvalues are crucial. Construct the Lagrange function is an important method
for solving constrained optimization problems.
6.1 Positive Definite Matrices Definition 39 (Lagrange Function). For a constrained opti-
The comparation of PD and PSD matrices are illustrated mization problem
in Table 6. min 𝑓 (𝒖) (36)
𝒖
Algorithm 15 (Determine Whether 𝑨 is PD). Try to fac-
s. t. 𝑔𝑖 (𝒖) ≤ 0, 𝑖 = 1, 2, … , 𝑚 ,
tor 𝑨 = 𝑹⊤ 𝑹 where 𝑹 is upper triangular with positive
diagonal entries (i.e., Cholesky factorization). ℎ𝑗 (𝒖) = 0, 𝑗 = 1, 2, … , 𝑛 ,

Lemma 38. For symmetric matrices, the pivots and the The Lagrange function is defined as
eigenvalues have the same signs. ∑
𝑚 ∑
𝑛

Lemma 39. 𝒙⊤ 𝑨𝒙 = 1 is an ellipsoid in 𝑛 dimensions. The (𝒖, 𝜶, 𝜷) ≔ 𝑓 (𝒖) + 𝛼𝑖 𝑔𝑖 (𝒖) + 𝛽𝑗 ℎ𝑗 (𝒖) , (37)
𝑖=1 𝑗=1
axes of the ellipsoid point toward the eigenvector of 𝑨.
where 𝛼𝑖 ≥ 0.
6.2 Unconstrained Optimization Lemma 42. The optimization problem of 36 is equivalent to
The goal is to solve min max (𝒖, 𝜶, 𝜷) (38)
𝒖 𝜶,𝜷
arg min 𝑓 (𝒖) . (34) s. t. 𝛼𝑖 ≥ 0, 𝑖 = 1, 2, … , 𝑚 .
𝒖

11
Proof. Lemma 46 (Slater Condition). When primal problem is con-
vex, i.e., 𝑓 and 𝑔𝑖 are convex, ℎ𝑗 is affine, and there exists
min max (𝒖, 𝜶, 𝜷) at least one point in the feasible region to let the inequal-
𝒖 𝜶,𝜷
( ( )) ity strictly holds true, the dual problem is equivalent to the

𝑚 ∑
𝑛
primal problem.
= min 𝑓 (𝒖) + max 𝛼𝑖 𝑔𝑖 (𝒖) + 𝛽𝑗 ℎ𝑗 (𝒖)
𝒖 𝜶,𝜷
𝑖=1 𝑗=1
( { ) Proof. The proof is out of the range of this note. Please refer
0 if 𝒖 feasible ; to [2] if you are interested.
= min 𝑓 (𝒖) +
𝒖 ∞ otherwise
= min 𝑓 (𝒖), and 𝒖 feasible , (39)
𝒖
7 Application: Solving Optimization
When 𝑔𝑖 is infeasible 𝑔𝑖 (𝒖) > 0, we can let 𝛼𝑖 = ∞, such Problems
that 𝛼𝑖 𝑔𝑖 (𝒖) = ∞; When ℎ𝑗 is infeasible ℎ𝑗 (𝒖) ≠ 0, we can
let 𝛽𝑗 = sign(ℎ𝑗 (𝒖))∞, such that 𝛽𝑗 ℎ𝑗 (𝒖) = ∞. When 𝒖 7.1 Removale Non-differentiability
feasible, since 𝛼𝑖 ≥ 0, 𝑔𝑖 (𝒖) ≤ 0, 𝛼𝑖 𝑔𝑖 (𝒖) ≤ 0. Therefore, the
maximum of 𝛼𝑖 𝑔𝑖 (𝒖) is 0. Lemma 47. The optimization problem

Corollary 43 (KKT condition). The optimization problem arg min |𝑓 (𝒖)| (43)
𝒖
of 38 should satisfy the followings at the optimium.
• Primal feasible: 𝑔𝑖 (𝒖) ≤ 0, ℎ𝑖 (𝒖) = 0; is equivalent to
• Dual feasible: 𝛼𝑖 ≥ 0;
• Complementary slackness: 𝛼𝑖 𝑔𝑖 (𝒖) = 0. arg min 𝑥 (44)
𝒖,𝑥
Definition 40 (Dual problem). The dual problem of 36 is s. t. 𝑓 (𝒖) − 𝑥 ≤ 0
− 𝑓 (𝒖) − 𝑥 ≤ 0 .
max min (𝒖, 𝜶, 𝜷) (40)
𝜶,𝜷 𝒖

s. t. 𝛼𝑖 ≥ 0, 𝑖 = 1, 2, … , 𝑚 . Proof. arg min𝒖 |𝑓 (𝒖)| is equivalent to


arg min𝒖 max(𝑓 (𝒖), −𝑓 (𝒖)). Let 𝑥 be an upper bound
Lemma 44. Dual problem is a lower bound of the primal of max(𝑓 (𝒖), −𝑓 (𝒖)).
problem.
7.2 Linear Programming
max min (𝒖, 𝜶, 𝜷) ≤ min max (𝒖, 𝜶, 𝜷) . (41)
𝜶,𝜷 𝒖 𝒖 𝜶,𝜷
7.3 Support Vector Machine
Proof. For any (𝜶 ′ , 𝜷 ′ ), min𝒖 (𝒖, 𝜶 ′ , 𝜷 ′ )
≤ Definition 43 (Support Vector Machine (SVM)). Given a set
min𝒖 max𝜶,𝜷 (𝒖, 𝜶, 𝜷). When (𝜶 ′ , 𝜷 ′ ) = of training examples {(𝒙𝑖 , 𝑦𝑖 )}𝑚 , the goal of SVM is to find
𝑖=1
max𝜶′ ,𝜷 ′ min𝒖 (𝒖, 𝜶 ′ , 𝜷 ′ ), it still holds true, i.e., a hyperplane to separate examples with different classes, and
max𝜶′ ,𝜷 ′ min𝒖 (𝒖, 𝜶 ′ , 𝜷 ′ ) ≤ min𝒖 max𝜶,𝜷 (𝒖, 𝜶, 𝜷). that hyperplane is fastest from training examples.

2
arg max min |𝒘⊤ 𝒙𝑖 + 𝑏| (45)
Definition 41 (Convex Function). A function 𝑓 is convex if 𝒘,𝑏 𝑖 ‖𝒘‖
s. t. 𝑦𝑖 (𝒘⊤ 𝒙𝑖 + 𝑏) > 0, 𝑖 = 1, 2, … , 𝑚 .
∀𝛼 ∈ [0, 1]. 𝑓 (𝛼𝒙 + (1 − 𝛼)𝒚) ≤ 𝛼𝑓 (𝒙) + (1 − 𝛼)𝑓 (𝒚) , (42)
Since scaling of (𝒘, 𝑏) does not change the solution, for
which means that if we pick two points on the graph of a
simplicity, we add a constraint that
convex function and draw a straight line segment between
them, the portion of the function between these two points min |𝒘⊤ 𝒙𝑖 + 𝑏| = 1 . (46)
will lie below this straight line. 𝑖

Lemma 45. A function 𝑓 is convex if every point on the Theorem 48 (Standard Form of SVM). The optimization
tangent line will lie below the corresponding point on 𝑓 problem of SVM is equivalent to
2
𝑓 (𝒚) ≥ 𝑓 (𝒙) + (𝒚 − 𝒙)⊤ 𝜕𝑓𝜕𝒙(𝒙) or 𝜕𝜕𝒙𝑓2 is PSD. 1 ⊤
arg min 𝒘 𝒘 (47)
Definition 42 (Affine Function). Funtion 𝑓 in the form 𝒘,𝑏 2
𝑓 (𝒙) = 𝒄 ⊤ 𝒙 + 𝑑. s. t. 𝑦𝑖 (𝒘⊤ 𝒙𝑖 + 𝑏) ≥ 1, 𝑖 = 1, 2, … , 𝑚 .

12
Proof. By contradiction. Suppose the equality of the Table 7: Analogy of real-valued functions with linear trans-
constraint does not hold at the optimial (𝒘⋆ , 𝑏⋆ ), i.e., formations.

min𝑖 𝑦𝑖 (𝒘⋆ 𝒙𝑖 + 𝑏⋆ ) > 1. There exists (𝑟𝒘, 𝑟𝑏) where
0 < 𝑟 < 1 such that min𝑖 𝑦𝑖 ((𝑟𝒘)⊤ 𝒙𝑖 + 𝑟𝑏) = 1, and Function Linear transformation
1
2
‖𝑟𝒘‖2 < 12 ‖𝒘‖2 . That implies (𝒘⋆ , 𝑟⋆ ) is not an optimial, Definition 𝑓∶ ℝ→ℝ 𝑇 ∶ ℝ𝑛 → ℝ𝑚
which contradicts to the assumption. Therefore, Eqn. 47 is Domain, Codomain ℝ, ℝ ℝ 𝑛 , ℝ𝑚
equivalent to Image of 𝑥 𝑓 (𝑥) 𝑇 (𝒙) ∶= 𝑨𝒙
Range {𝑦 ∣ ∃𝑥. 𝑦 = 𝑓 (𝑥)} (𝑨)
1 ⊤ Zero {𝑥 ∣ 𝑓 (𝑥) = 0}  (𝑨)
arg min 𝒘 𝒘 (48)
𝒘,𝑏 2 Inverse 𝑓 −1 (𝑦) 𝑇 −1 (𝒚) = 𝑨−1 𝒚
Decomposition 𝑔◦𝑓 = 𝑔(𝑓 (𝑥)) 𝑇𝐵 ◦𝑇𝐴 = 𝑩𝑨𝒙
s. t. min 𝑦𝑖 (𝒘⊤ 𝒙𝑖 + 𝑏) = 1 .
𝑖

The objective function is equivalent to Table 8: Terminologies of linear transformation 𝑇 (𝒙) = 𝑨𝒙.

1 2 Terminology Meaning Property


arg min 𝒘⊤ 𝒘 = arg max ⋅1
𝒘,𝑏 2 𝒘,𝑏 ‖𝒘‖
Onto (surjective) ≥ 1 arrow in (𝑨) = ℝ𝑚 .
2 One-to-one (injective) ≤ 1 arrow in  (𝑨) = 𝟘
= arg max min 𝑦 (𝒘⊤ 𝒙𝑖 + 𝑏)
𝒘,𝑏 𝑖 ‖𝒘‖ 𝑖
2
= arg max min |𝒘⊤ 𝒙𝑖 + 𝑏| . (49)
𝒘,𝑏 𝑖 ‖𝒘‖ Definition 44 (Support Vector). Example with dual variable
𝛼𝑖 > 0, which has 𝑦𝑖 (𝒘⊤ 𝒙𝑖 + 𝑏) = 1.
Lemma 50. The hypothesis function of linear SVM is
Theorem 49 (Dual problem of SVM). The dual problem of ( )
SVM is ∑

ℎ(𝒙) = sign 𝛼𝑖 𝑦𝑖 𝒙𝑖 𝒙 + 𝑏 . (55)
1 ∑∑ ∑
𝑚 𝑚 𝑚
𝑖∈𝑆𝑉
arg min 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝒙⊤
𝑖 𝒙𝑗 − 𝛼𝑖 (50)
𝜶 2 𝑖=1 𝑗=1 𝑖=1

𝑚 8 Theory: Linear Transformations
s. t. 𝛼𝑖 𝑦𝑖 = 0,
𝑖=1 8.1 Linear Transformations
𝛼𝑖 ≥ 0, 𝑖 = 1, 2, … , 𝑚 . Definition 45 (Transformation 𝑇 ). A transformation
𝑇 ∶ ℝ𝑛 → ℝ𝑚 is a rule that assigns to each vector 𝒙 ∈ ℝ𝑛 a
Proof. The Lagrange function is
vector 𝑇 (𝒙) ∈ ℝ𝑚 .
1 ⊤ ∑ 𝑚
(𝒘, 𝑏, 𝜶) ≔ 𝒘 𝒘+ 𝛼𝑖 (1 − 𝑦𝑖 (𝒘⊤ 𝒙𝑖 + 𝑏)) . (51) Definition 46 (Linear Transformation). A transformation
2 𝑖=1 𝑇 is linear if 𝑇 (𝑐1 𝒗1 + 𝑐2 𝒗2 ) = 𝑐1 𝑇 (𝒗1 ) + 𝑐2 𝑇 (𝒗2 ). It is
Its dual problem is always the case that 𝑇 (𝟎) = 𝟎. The comparison between real-
valued functions and linear transformations is illustrated in
1 ⊤ ∑ 𝑚 Table 7 and terminologies of transformations are illustrated
arg max min 𝒘 𝒘+ 𝛼𝑖 (1 − 𝑦𝑖 (𝒘⊤ 𝒙𝑖 + 𝑏)) (52) in Table 8.
𝜶 𝒘,𝑏 2 𝑖=1
s. t. 𝛼𝑖 ≥ 0, 𝑖 = 1, 2, … , 𝑚 . Theorem 51 (Standard Matrix for a Linear Transformation).
Let 𝑇 ∶ ℝ𝑛 → ℝ𝑚 be a[ linear transformation.] There exists a
Since the inner optimization problem is unconstrained, we unique matrix 𝑨 ∶= 𝑇 (𝒆1 ) 𝑇 (𝒆2 ) ⋯ 𝑇 (𝒆𝑛 ) ∈ ℝ𝑚×𝑛 such
can get the optimial by that 𝑇 (𝒙) = 𝑨𝒙, where 𝒆𝑗 is the 𝑗-th column of 𝑰 ∈ ℝ𝑛×𝑛 .
(∑ ) ∑
∑ 𝑚
Proof. 𝑇 (𝒙) = 𝑇 𝑛
= 𝑛𝑗=1 𝑥𝑗 𝑇 (𝒆𝑗 ) = 𝑨𝒙.
𝜕
=𝟎⇒𝒘= 𝛼𝑖 𝑦𝑖 𝒙𝑖 , (53) 𝑗=1 𝑥𝑗 𝒆𝑗
𝜕𝒘 𝑖=1

𝜕 ∑ 𝑚
Lemma 52 (Genral Matrix for a Linear Transformation). Let
=0⇒ 𝛼𝑖 𝑦𝑖 = 0 . (54)
𝜕𝑏 𝑇 ∶  →  be a linear transformation, where 𝑽 ∈ ℝ𝑛×𝑛
𝑖=1
is the input basis and 𝑼 ∈ ℝ𝑚×𝑚 is the output basis. There
Substitute them into Eqn. 52 gives Eqn. . exists a unique matrix 𝑨 ∈ ℝ𝑚×𝑛 that gives the coordinate

13
𝑇 (𝒄) = 𝑨𝒄 in the output space when the coordinate of input Table 9: Transformation using homogeneous coordinates.
space is 𝒄. The 𝑗-th column of 𝑨 is found by solving 𝑇 (𝒗𝑗 ) =
𝑼 𝒂𝑗 . Transformation Result
⎡𝑐𝑥 0 0⎤ ⎡𝑥⎤ ⎡𝑐𝑥 𝑥⎤
8.2 Identity Transformations = Change of Ba- Scaling ⎢ 0 𝑐𝑦 0⎥ ⎢𝑦⎥ = ⎢ 𝑐𝑦 𝑦 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
sis ⎣ 0 0 1⎦ ⎣ 1 ⎦ ⎣ 1 ⎦
⎡1 0 𝑥0 ⎤ ⎡𝑥⎤ ⎡𝑥 + 𝑥0 ⎤
Definition 47 (Coordinate). The coordinate of a vector 𝒙 ∈ Translation ⎢0 1 𝑦0 ⎥ ⎢𝑦⎥ = ⎢ 𝑦 + 𝑦0 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
ℝ𝑛 relative to the bases matrix 𝑾 ∈ ℝ𝑛×𝑛 is the coefficient ⎣0 0 1 ⎦ ⎣ 1 ⎦ ⎣ 1 ⎦
𝒄 such that 𝒙 = 𝑾 𝒄, or equivalently 𝒄 = 𝑾 −1 𝒙. ⎡0 1 0⎤ ⎡𝑥⎤ ⎡𝑦⎤
Reflection ⎢1 0 0⎥ ⎢𝑦⎥ = ⎢𝑥⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
Example 3 (Wavelet Transform). Wavelets are little waves. ⎣0 0 1 ⎦ ⎣ 1 ⎦ ⎣ 1 ⎦
They have different length and they localized at different ⎡cos 𝛼 − sin 𝛼 0⎤ ⎡𝑥⎤
places. The basis matrix is Clockwise rotation ⎢ sin 𝛼 cos 𝛼 0⎥ ⎢𝑦⎥
⎢ ⎥⎢ ⎥
⎣ 0 0 1⎦ ⎣ 1 ⎦
⎡1 1 1 0⎤
⎢1 1 −1 0⎥
𝑾 ∶= ⎢
1⎥
. (56)

1 −1 0

9 Applications: Linear Transforma-
⎣1 −1 0 −1⎦
tions
Those bases are orthogonal. The wavelet transform finds the
coefficients 𝒄 when the input signal 𝒙 is expressed in the 9.1 Computer Graphics
wavelet basis 𝒙 = 𝑾 𝒄.
Definition 48 (Homogeneous Coordinates). Each point
Example 4 (Discrete Fourier Transform). The Fourier trans- [ ] ⎡𝑥⎤
𝑥
form decomposes the signal into waves at equally spaced ∈ ℝ2 can be identified with the point ⎢𝑦⎥ ∈ ℝ3 . Homo-
frequencies. The basis matrix is 𝑦 ⎢ ⎥
⎣1⎦
geneous coordinates can be trasformed via multiplication by
⎡1 1 1 1 ⎤ 3 × 3 matrices, as illustrated in Table 9. By analogy, each
⎢1 i i2 i3 ⎥
𝑭 ∶= ⎢ ⎡𝑥⎤
(i3 )2 ⎥
. (57) ⎡𝑥⎤
1 i2 (i2 )2 ⎢𝑦⎥
⎢ ⎥ point ⎢𝑦⎥ ∈ ℝ3 can be identified with the point ⎢ ⎥ ∈ ℝ4 .
⎣1 i3 (i2 )3 (i3 )3 ⎦ ⎢ ⎥ 𝑧
⎣𝑧⎦ ⎢ ⎥
⎣1⎦
Those bases are orthogonal. The discrete Fourier transform
finds the coefficients 𝒄 when the input signal 𝒙 is expressed
in the Fourier basis 𝒙 = 𝑭 𝒄. Theorem 54 (Perspective projections). A 3D object is rep-
resented on the 2D computer screen by projecting the object
Lemma 53 (Change of [ Basis). Suppose
] we want
[ to change] onto a viewing plane at 𝑧 = 0. Suppose the eye of a viewer is
the basis from 𝑽 ∶= 𝒗1 𝒗2 ⋯ 𝒗𝑛 to 𝑼 ∶= 𝒖1 𝒖2 ⋯ 𝒖𝑛 . ⎡0⎤
The coordinate of a vector 𝒙 is 𝒄 in 𝑽 , and is 𝒃 in 𝑼 . Then at the point ⎢ 0 ⎥. A perspective projection maps each point
𝒃 = 𝑼 −1 𝑽 𝒄, where 𝑨 ∶= 𝑼 −1 𝑽 is called the change of ⎢ ⎥
⎣𝑑 ⎦
basis matrix. ⎡𝑥⎤ ⎡𝑥𝑝 ⎤
⎢𝑦⎥ onto an image point ⎢𝑦𝑝 ⎥ such that those two point and
Proof. 𝒙 = 𝑽 𝒄 = 𝑼 𝒃 ⇒ 𝒃 = 𝑼 −1 𝑽 𝒄. ⎢ ⎥ ⎢ ⎥
⎣𝑧⎦ ⎣0⎦
Algorithm 18 (Solving the Change[ of Basis
] [Matrix).
] Per- the eye position (center of projection) are on a line.
form elementary row operations on 𝑼 𝑽 to 𝑰 𝑨 . 𝑇 (𝑛) ∼
𝑥 𝑦
𝑛3 . 𝑥𝑝 = 𝑧 , 𝑦𝑝 = 𝑧 . (58)
1− 𝑑
1− 𝑑
Example 5 (Diagonalization). 𝑇 (𝒙) ∶= 𝑨𝒙 = 𝑺𝚲𝑺 −1 𝒙
defines a linear transformation which changes the basis from
𝑰 to 𝑺, then transform 𝒙 in space of 𝑺, and last changes the
9.2 Principle Component Analysis
basis from 𝑺 back to 𝑰. Definition 49 (Principle Component Analysis, PCA). Given
Example 6 (SVD Factorization). 𝑇 (𝒙) ∶= 𝑨𝒙 = 𝑼 𝚺𝑽 ⊤ 𝒙 a set of instances {𝒙𝑖 }𝑚
𝑖=1
with empirical mean 𝝁 ∈ ℝ𝑑 and
defines a linear transformation which changes the basis from empirical covariance 𝚺 ∈ ℝ[𝑑×𝑑 , PCA wants] to find a set
𝑰 to 𝑽 , then transform 𝒙 from space 𝑽 to space 𝑼 , and last of orthonormal bases 𝑾 ∶= 𝒘1 𝒘2 ⋯ 𝒘𝑑 ′ such that the
changes the basis from 𝑼 back to 𝑰. sum of variance of the projected data along each component

14
Table 10: Porperties of eigenvalues and eigenvectors of special matrices.

Matrix Eigenvalues Eigenvectors



Symmetric 𝑨 = 𝑨 All real Orthnormal
Anti-symmetric 𝑨⊤ = −𝑨 All imaginary Orthnormal
Orthogonal 𝑸−1 = 𝑸⊤ All |𝜆| = 1 Orthnormal
PD All 𝜆 > 0 Orthnormal
PSD All 𝜆 ≥ 0 Orthnormal
Diagonalizable 𝑨 = 𝑺𝚲𝑺 −1 diag 𝚲 Columns of 𝑺 are independent
Rectangular 𝑨 = 𝑼 𝚺𝑽 ⊤ rank 𝑨 = rank 𝚺 Eigenvectors of 𝑨⊤ 𝑨, 𝑨𝑨⊤ in 𝑽 , 𝑼
Stable powers 𝑨𝑛 → 𝟎 All |𝜆| < 1 Any
∑𝑛
Markov 𝐴𝑖𝑗 > 0, 𝑖=1 𝐴𝑖𝑗 = 1 max𝑖 𝜆𝑖 = 1 Steady state 𝒙 > 0
Stable exponential exp 𝑨𝑡 → 0 All Re𝜆 < 0 Any
Projection 𝑷 = 𝑷 2 = 𝑷 ⊤ 1, 0 (𝑷 ),  (𝑷 )
Rank-1 𝒖𝒗⊤ 𝒗⊤ 𝒖, 0, … , 0 𝒖, whole plane 𝒗⟂
Reflection 𝑰 − 2𝒆𝒆⊤ −1, 1, … , 1 𝒆, whole plane 𝒆⟂
[ ]⊤ [ ]⊤
Plane rotation exp(i𝜃), exp(−i𝜃) 1i , 1 −i
[ ]⊤
Cyclic permutation: row 1 of 𝑰 last 𝜆𝑘 = exp 2𝜋i𝑘 𝒙𝑘 = 1 𝜆𝑘 ⋯ 𝜆𝑛−1
𝑛 [ 𝑘
]⊤
𝑘𝜋 𝑘𝜋 2𝑘𝜋
Tridiagonal: -1, 2, -1 on diagonals 𝜆𝑘 = 2 − 2 cos 𝑛+1 𝒙𝑘 = sin 𝑛+1 sin 𝑛+1 ⋯

1
is maximized. Definition 50 (PCA Whitening). 𝒙̂ ∶= 𝚲− 2 𝑸⊤ (𝒙 − 𝝁) has
̂ = 𝟎 and cov 𝒙̂ = 𝑰.
𝔼[𝒙]
arg max tr cov 𝑾 ⊤ (𝒙 − 𝝁) (59)
𝑾 1
Definition 51 (ZCA Whitening). 𝒙̂ ∶= 𝑸𝚲− 2 𝑸⊤ (𝒙 − 𝝁) has
s. t. 𝑾 ⊤𝑾 = 𝑰 .
̂ = 𝟎 and cov 𝒙̂ = 𝑰.
𝔼[𝒙]
Theorem 55. The optimium 𝑾 to Eqn. 61 is the top 𝑑 ′
eigenvectors of 𝚺.
10 Appendix
⊤ ⊤
Proof. Since 𝔼[𝑾 (𝒙 − 𝝁)] = 𝑾 𝔼[𝒙 − 𝝁] = 𝟎,
Lemma 57 (Sum of Series).
tr cov 𝑾 ⊤ (𝒙 − 𝝁) = tr 𝔼[(𝑾 ⊤ (𝒙 − 𝝁) − 𝟎)(𝑾 ⊤ (𝒙 − 𝝁) − 𝟎)⊤ ]

𝑛
𝑛(𝑛 + 1) 1 2
= tr 𝑾 ⊤ 𝔼[(𝒙 − 𝝁)(𝒙 − 𝝁)⊤ ]𝑾 𝑖= ∼ 𝑛 , (64)
2 2
= tr 𝑾 ⊤ 𝚺𝑾 . (60) 𝑖=1

𝑛
𝑛(𝑛 + 1)(2𝑛 + 1) 1 3
The optimization problem is equivalent to 𝑖2 = ∼ 𝑛 . (65)
𝑖=1
6 3
arg min − tr 𝑾 ⊤ 𝚺𝑾 (61)
𝑾
References
s. t. 𝑾 ⊤𝑾 = 𝑰 .
[1] S. Axler. Linear algebra done right. Springer, 1997. 1
The Lagrange function is [2] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge
University Press, 2004. 12
(𝑾 , 𝑩) ∶= − tr 𝑾 ⊤ 𝚺𝑾 + (vec 𝑩)⊤ (vec(𝑾 ⊤ 𝑾 − 𝑰)) [3] S. Boyd and L. Vandenberghe. Introduction to Applied Linear
= − tr 𝑾 ⊤ 𝚺𝑾 + tr 𝑩 ⊤ (𝑾 ⊤ 𝑾 − 𝑰) . (62) Algebra: Vectors, Matrices, and Least Squares. Cambridge
University Press, 2018. 1
We can get the optimial by [4] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge
University Press, 1990. 1
𝜕
= 𝟎 ⇒ 𝚺𝑾 = 𝑾 𝑩 . (63) [5] D. C. Lay, S. R. Lay, and J. J. McDonald. Linear Algebra and
𝜕𝑾 Its Applications (Fifth Edition). Pearson, 2014. 1
[6] K. B. Petersen, M. S. Pedersen, et al. The matrix cookbook.
Technical University of Denmark, 2008. 1
Corollary 56. 𝒙̂ ∶= 𝑸⊤ (𝒙−𝝁) has 𝔼[𝒙]
̂ = 𝟎 and cov 𝒙̂ = 𝚲, [7] G. Strang. Linear algebra and its applications (Fourth Edition).
where 𝚺 = 𝑸𝚲𝑸⊤ . Academic Press, 2006. 1

15
[8] G. Strang. Computational science and engineering. Wellesley-
Cambridge Press, 2007. 1
[9] G. Strang. Introduction to linear algebra (Fourth Edition).
Wellesley-Cambridge Press, 2009. 1

16

You might also like