ST208 - Chapters 1-6

1
ST208 Mathematical Methods

Contents
Module Introduction and Information 5
1 Preliminaries 7
2
1.1 Subsets of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Inverse image of a function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Partial derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4 Calculating determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5 Chapter 1 summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2 Multiple integration 29
2.1 Introduction: Fubini’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Elementary regions of integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.1 Finding the limits of integration 1: 2-d Cartesian . . . . . . . . . . . . . . . . . . . 44
2.3 Change of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4 Common coordinate changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.4.1 Finding the limits of integration 2: 2-d polar coordinates . . . . . . . . . . . . . . . 80
2.5 Applications of multiple integrals to statistics . . . . . . . . . . . . . . . . . . . . . . . . 91
2.5.1 Geometric Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.6 Chapter 2 Consolidation Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
2.7 Chapter 2 summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3 Symmetric matrices, quadratic forms, positive definiteness 104

3.1 Some linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.2 Quadratic forms and positive definiteness . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.3 Covariance matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.4 A glimpse of the future: Monte Carlo simulation of stock portfolio . . . . . . . . . . . . . . 138
3.6 Chapter 3 summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4 Differentiation in R𝑛 143
4.1 Generalising the single variable case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.2 Basic properties of the derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
2
CONTENTS 3
4.3 Finding and classifying critical points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

4.3.1 Critical points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.3.2 Classifying critical points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
4.4 Constrained optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
4.6 Chapter 4 summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
4.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
5 Linear algebra 208

5.1 Vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
5.2 Change of basis and linear maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
5.3 Inner products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.4 Orthogonal/orthonormal bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
5.5 Gram-Schmidt orthogonalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
5.6 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
5.7 The spectral decomposition theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
5.8 Application to simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
5.10 Chapter 5 summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
6 Metric spaces 285

6.1 Introduction to metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
6.2 Open and closed sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
6.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
6.4 Compactness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
6.5 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
6.7 Chapter 6 summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
4
Module Introduction and Information
Image Yang Hui triangle using rod numerals, as depicted in a publication of Zhu Shijie in 1303 AD. Available
via https://commons.wikimedia.org/wiki/File:Yanghui_triangle.gif. Image licence: This work is in the public
domain and is free for reuse.
Historical note Yang Hui’s triangle will look familiar to many as is the so-called ‘Pascal’s triangle,’ named
after French mathematician Blaise Pascal. Although named after Pascal, the triangle had been studied
centuries earlier in India, Persia, China, and parts of Europe. It is impossible to give a definite answer to
the first occurrence of this mathematical object. Nevertheless, there is evidence to suggest knowledge of
5
6 CONTENTS
particular properties of the triangle in 2 BC.
Further reading C.F. Boyer, A history of mathematics , Wiley, 1968.
Information The module web-page https://go.warwick.ac.uk/st208 contains full module details and infor-
mation. (Note, you may be asked for your Warwick sign-on to view this page.) These notes are designed
to be a self-contained study resource that you must use in conjunction with the lectures, tutorials and online
resources. The notes contains gaps that you must complete. The nature of the gaps indicates how you will
typically complete them.
Examples We will complete these together during class.
Questions We may complete some or all of these during class. Those questions that are not completed
during class you must complete as part of your independent study. Proofs and other gaps We will typically
complete these together during class.
Each chapter contains a summary and consolidation questions. The consolidation questions are designed
for when you have finished working through a chapter and typically require deeper thought to complete.
Acknowledgements These lecture notes are closely based on those of the module’s previous lecturers,
Martyn Parker, Adam Johansen, David Croydon, Heather Humphries (which were typeset by Iain Carson)
and Paul Jenkins. Responsibility for any errors is mine.
Chapter 1
Preliminaries
Chapter outcomes
At the end of the chapter you should be familiar with and be able to:
• Sketch subsets of R2 and select appropriate techniques to determine if subsets of R2 are

closed, bounded or compact.
• Determine the pre-image (or pullback) of a given set.
• Compute partial derivatives in a range of situations.
• Apply appropriate theory to calculate determinants.
Assumed background knowledge
Before starting this chapter it is assumed that you know the following material. (References are made to the
2019-20 printed notes.)
• MAT106: Linear Algebra. Chapter 11: The determinant of a matrix
– Definition of the determinant

– The effect of matrix operations on the determinant
– The determinant of a product
– Minors and cofactors
• MA137: Mathematical Analysis. Chapter 1: Functions
– Definition of intervals.
– Standard functions.
• ST116: Mathematical Techniques. The mathematical thinking, language and writing developed is
fundamental to everything presented in this module. This includes:
– Language of sets
– Function notation, including image and pre-image.
• General mathematical knowledge. That you can fluently apply pre-university differentiation techniques.
7
8 CHAPTER 1. PRELIMINARIES
Chapter Statistician: Abraham Manie (Abe) Adelstein
Abe was a South African doctor who became the UK’s Chief Medical Statistician. He held posts a Manch-
ester and the London School of Hygiene and Tropical Medicine. His work included improving cancer re-
search and had leading parts in World Health Organisation meetings that developed recommendations for
health information systems across Europe. He was a Fellow of the Royal College of Physicians and held
several other honours.
Figure 1.1: Abraham Manie (Abe) Adelstein.
Image available via: https://imgix.ranker.com/user_node_img/20/392275/original/abraham- manie-

adelstein-all-people-photo-u1?w=650/&q=50/&fm=pjpg/&fit=crop/&crop=faces Image licence: Creative
commons licence.
1.1 Subsets of R2
“The numbers have no way of speaking for themselves. We speak for them. We imbue them with
meaning…”
— Nate Silver, The Signal and the Noise, Penguin UK, 2012
In order to be able to define properly the concepts of integration and differentiation in higher dimensions,
we need to gather and revise some ideas about sets and functions. The purpose of this chapter is to revise
these topics.
We will need to be able to sketch, quickly and efficiently, simple subsets of R2 defined by constraints such
as simple bounding curves and simple inequalities. We also need to be able to decide whether our sets are
closed, bounded or compact. We will meet the formal definitions of these terms in the Chapter 6 (on metric
spaces); for now, it will be enough to say that a set is:
1.1. SUBSETS OF R2 9
• closed if it contains its own boundary, or if any point in its complement can be surrounded by a disc
of suitably small radius (𝜀 > 0) which also lies in the complement;
• bounded if it could be contained completely within a circle, centre at the origin, of sufficiently large
radius;
• compact if it is both closed and bounded. (This is adequate if we restrict our attention to ℝ𝑛 ; it is not
correct in more general metric spaces.)
There is an equivalent statement for closed sets. A set 𝑋 ⊆ ℝ𝑛 is closed if, and only if, the limit of every
convergent sequence in 𝑋 belongs to 𝑋 . Please remember that once we see these definitions formally in
Chapter 6 you will need to utilise the updated definitions.
Example 1.1
Sketch the following subsets of R2 , and decide if they are closed, if they are bounded and if they
are compact.
1. {(𝑥, 𝑦) ∈ R2 ∶ 𝑥 + 𝑦 ≥ 1, 𝑥2 + 𝑦2 < 1},

2. {(𝑥, 𝑦) ∈ R2 ∶ 𝑥2 + 𝑦 2 ≥ 1, |𝑥| + |𝑦| < 2},
3. {(𝑥, 𝑦) ∈ R2 ∶ |𝑥| + |𝑦| ≥ 1, max{|𝑥|, |𝑦|} ≤ 1}.
1.1. SUBSETS OF R2 11
1.2. INVERSE IMAGE OF A FUNCTION 13
Question 1.2
Sketch the following subsets of R2 , and decide if they are closed, if they are bounded and if they
are compact.
1. 𝐴1 = {(𝑥, 𝑦) ∈ R2 ∶ 𝑥 + 𝑦 ≥ 1},
2. 𝐴2 = {(𝑥, 𝑦) ∈ R2 ∶ 𝑥 + 𝑦 ≥ 1, 𝑥2 + 𝑦2 ≤ 1},
3. 𝐴3 = {(𝑥, 𝑦) ∈ R2 ∶ 𝑦 = 0, 𝑥 ≤ 0},
4. 𝐴4 = {(𝑥, 𝑦) ∈ R2 ∶ 𝑥2 + 𝑦2 ≥ 3, 𝑥2 + 4𝑦2 ≤ 4},
5. 𝐴5 = ([0, 1] × [0, 1]) ∩ (Q × Q),
6. 𝐴6 = {(𝑥, 𝑦) ∈ R2 ∶ 𝑥2 = 𝑦2 },
7. 𝐴7 = {(𝑥, 𝑦) ∈ R2 ∶ 𝑥2 ≤ 𝑦},
8. 𝐴8 = {(𝑥, 𝑦) ∈ R2 ∶ 𝑥2 ≤ 𝑦 ≤ 1},
9. 𝐴9 = {(𝑥, 𝑦) ∈ R2 ∶ |𝑥| ≤ 1, 0 ≤ 𝑦2 < 1},
10. 𝐴10 = {(𝑥, 𝑦) ∈ R2 ∶ |𝑥| + |𝑦| ≥ 1, |𝑦| ≤ 2},
11. 𝐴11 = {(𝑥, 𝑦) ∈ R2 ∶ 𝑥2 + 𝑦2 ≤ 1, 𝑥 ∈ Q},
1.2 Inverse image of a function

Consider a function 𝑓 ∶ 𝐴 → 𝐵, 𝑎 ↦ 𝑓(𝑎) = 𝑏. This may be injective (into), surjective (onto), neither
or both. A bijective function 𝑓 (i.e. one that is both injective and surjective and hence mapping exactly one
element of 𝐴 to each element of 𝐵 ) has a related inverse function 𝑓 −1 , which is defined as the function
𝑓 −1 such that if 𝑓(𝑎) = 𝑏, then 𝑓 −1 (𝑏) = 𝑎.
If𝑓 is not bijective, then we might still be interested in which point(s) of 𝐴 map to a particular point 𝑏 ∈
𝑓(𝐴) ⊆ 𝐵, where 𝑓(𝐴) is the image of 𝐴 under 𝑓 . The preimage, inverse image or pull-back of the
subset 𝑋 of 𝐵 is defined to be
𝑓 −1 (𝑋) = {𝑎 ∈ 𝐴 ∶ 𝑓(𝑎) ∈ 𝑋}.
In particular,
𝑓 −1 ({𝑏}) = {𝑎 ∈ 𝐴 ∶ 𝑓(𝑎) = 𝑏},
which is often referred to as the fibre of 𝑏.
Example 1.3
Let 𝑓 ∶ R2 → R be defined by (𝑥, 𝑦) ↦ 1 + 𝑥2 + 𝑦2 . Find 𝑓 −1 ((0, 1)), and sketch 𝑓 −1 ([0, 2] ∪
[5, 10]).
1.2. INVERSE IMAGE OF A FUNCTION 15
1.3. PARTIAL DERIVATIVES 17
Question 1.4
1. Let 𝑓 ∶ R2 → R, (𝑥, 𝑦) ↦ 𝑥2 + 𝑦2 . Find 𝑓 −1 ([1, 4]), 𝑓 −1 ({−4}), 𝑓 −1 ([−4, 1]), 𝑓 −1 (Z).

2. Let 𝑓 ∶ R2 → R2 , (𝑥, 𝑦) ↦ (𝑥2 + 𝑦2 , 𝑥𝑦). Sketch 𝑓 −1 ([1, 4] × [0, 1]).
3. Let 𝑓 ∶ R2 → R2 , (𝑥, 𝑦) ↦ (𝑦, 𝑥2 + 𝑦2 ). Find 𝑓 −1 (R × {−1}), and sketch 𝑓 −1 ([0, 1] ×
[1, 2]).
4. Let 𝑓 ∶ R → R2 , 𝑥 ↦ (|𝑥|, 𝑥). Show 𝑓 −1 ({1} × R) = {−1, 1}, and find 𝑓 −1 ([0, 2] ×
[−1, 1]), 𝑓 −1 ([−1, 1] × [0, 2]).
5. Let 𝑓 ∶ R2 → R, (𝑥, 𝑦) ↦ max{|𝑥|, |𝑦|}. Sketch 𝑓 −1 ([−1, 1] ∪ [2, 4]).
6. Let 𝑓 ∶ R2 → R, (𝑥, 𝑦) ↦ |𝑥| + 2|𝑦|. Find 𝑓 −1 ({−1}), and sketch 𝑓 −1 ({1}),
−1
𝑓 (𝑓({1} × R)).
1.3 Partial derivatives

A function 𝑓 ∶ R → R is differentiable at 𝑥 if the limit
𝑓(𝑥 + ℎ) − 𝑓(𝑥)
lim
ℎ→0 ℎ
exists. Write 𝑓 ′ (𝑥) for the limit and called this the derivative of 𝑓 at 𝑥. For functions with a graph we have
Consider 𝑓 ∶ 𝑈 (⊆ R𝑛 ) → R𝑚 , what is the ‘derivative of 𝑓 ’? For the moment, we specialise to R𝑛 → R

and for the moment consider 𝑛 = 2. This simplification permits graphical motivation, in particular, the
question is: what is the derivative of 𝑓 at a point (𝑎, 𝑏)? The general question must wait until later in the
module. Graphically, near (𝑎, 𝑏) the graph of the function is
Observe that if we slice through(𝑎, 𝑏) parallel to the 𝑥 and 𝑦-axes, the result is the graph of two related
functions which map R → R. The figure illustrates a slice of the function parallel to the 𝑥-axis. The graph
of 𝑥 ↦ 𝑓(𝑥, 𝑏) is single valued; that is, a map R → R. Differentiating the function 𝑥 ↦ 𝑓(𝑥, 𝑏) in the
𝜕𝑓
normal way one obtains (𝑎, 𝑏), which we abbreviate as 𝑓𝑥 (𝑎, 𝑏). In a similar manner, parallel to the
𝜕𝑥
𝑦-axis the graph of 𝑦 ↦ 𝑓(𝑎, 𝑦) is single-valued; that is, a map R → R. Once again, differentiating the
𝜕𝑓
function 𝑥 ↦ 𝑓(𝑎, 𝑦) in the normal way one obtains (𝑎, 𝑏), which we abbreviate as 𝑓𝑦 (𝑎, 𝑏).
𝜕𝑦
In the case 𝑓 ∶ R𝑛 → R, define
𝜕𝑓 𝑓(𝑥1 , … , 𝑥𝑖−1 , 𝑥𝑖 + ℎ, 𝑥𝑖+1 , … , 𝑥𝑛 ) − 𝑓(x)

(x) = lim
𝜕𝑥𝑖 ℎ→0 ℎ
where x = (𝑥1 , … , 𝑥𝑛 ). Finally, in the general case, where 𝑓(x) = (𝑓1 (x), … , 𝑓𝑚 (x)), this approach
yields a matrix, J, of partial derivatives where the entries are
𝜕𝑓𝑖
𝐽𝑖𝑗 =
𝜕𝑥𝑗
Returning to the general quesiton, finding partial derivatives is only the first step in determining the deriva-
tives of 𝑓 ∶ 𝑈 (⊆ R𝑛 ) → R𝑚 .
Example 1.5
𝑥
Let 𝑓 ∶ R2 → R, (𝑥, 𝑦) ↦ 𝑓(𝑥, 𝑦) be given by 𝑓(𝑥, 𝑦) = . Find 𝑓𝑥 , 𝑓𝑦 , and any
1 + 𝑥2 + 𝑦 2
points where 𝑓𝑥 = 𝑓𝑦 = 0.
Question 1.6
𝑓 ∶ R2 → R, (𝑥, 𝑦) ↦ 𝑓(𝑥, 𝑦) be as given below. Find 𝑓𝑥 , 𝑓𝑦 , and any points where

1. Let
𝑓𝑥 = 𝑓𝑦 = 0.
a) 𝑓(𝑥, 𝑦) = 𝑥3 + 6 − 𝑦3 − 2𝑥𝑦,
b) 𝑓(𝑥, 𝑦) = 6𝑥2 − 2𝑥3 + 3𝑦2 + 6𝑥𝑦,
c) 𝑓(𝑥, 𝑦) = 𝑥3 + 𝑦3 + 3𝑥2 − 3𝑦2 − 8,
d) 𝑓(𝑥, 𝑦) = 𝑥4 − 8𝑥2 + 3𝑦2 − 6𝑦,
e) 𝑓(𝑥, 𝑦) = 6𝑥𝑦𝑒−(2𝑥+3𝑦) ,
1
f) 𝑓(𝑥, 𝑦) = 𝑥 + 8𝑦 + . Note, here, 𝑓 ∶ (R {(0)})2 → R,
𝑥𝑦
2 2
g) 𝑓(𝑥, 𝑦) = 𝑥𝑦𝑒−𝑥 −𝑦
∶ R3 → R, (𝑥, 𝑦, 𝑧) ↦ 𝑓(𝑥, 𝑦, 𝑧) be as given below. Find 𝑓𝑥 , 𝑓𝑦 , 𝑓𝑧 , and any points
2. Let 𝑓
where 𝑓𝑥 = 𝑓𝑦 = 𝑓𝑧 = 0.
a) 𝑓(𝑥, 𝑦, 𝑧) = 𝑥2 𝑦 + 𝑦2 𝑧 + 𝑧 2 − 2𝑥,
b) 𝑓(𝑥, 𝑦, 𝑧) = 𝑥𝑦𝑧 − 𝑥2 − 𝑦2 − 𝑧 2 ,
c) 𝑓(𝑥, 𝑦, 𝑧) = 4𝑥𝑦𝑧 − 𝑥4 − 𝑦 4 − 𝑧 4 .
1.4 Calculating determinants

The purpose of this section is to apply theoretical properties to simplify the determinant computations. The
determinant of a matrix is a function from the space of square matrices to R; that is, the determinant of
a matrix 𝐴, written det 𝐴 or |𝐴| is a number, or expression, that depends on the elements of 𝐴. The
determinant possesses properties that simplify its computation, for example,
1. subtracting a multiple of one row (column) from another,

2. taking out a common factor from any row (column),
3. recognising upper triangular, lower triangular or diagonal matrices,
4. choosing the ‘best’ row (column) to expand along,
5. using ‘block’ matrices where appropriate,
6. identifying rows or columns which can be expressed as linear combinations of other rows or columns,
respectively.
Example 1.7
Compute the determinants of the following matrices
1.
2 1 1
⎛
⎜ 1 1 1 ⎞
⎟
⎝ 2 2 2 ⎠
1.4. CALCULATING DETERMINANTS 23
2.
2 1 1
⎛
⎜ 0 1 2 ⎞
⎟
⎝ 0 0 1 ⎠
3.
2 1 2
⎜ 0 1 0 ⎞
⎛ ⎟
⎝ 2 0 1 ⎠
4.
2 1 0
⎛
⎜ 2 2 0 ⎞
⎟
⎝ 0 0 1 ⎠
5.
0 1 0 0
⎛
⎜ 0 1 0 5 ⎞
⎟
⎜
⎜ ⎟
⎟
⎜ 0 2 1 6 ⎟
⎝ 1 2 6 1 ⎠
1.4. CALCULATING DETERMINANTS 25
1.5. CHAPTER 1 SUMMARY 27
Question 1.8
1. Compute the determinant of

4 2 4
⎛
⎜ 0 1 0 ⎞
⎟
⎝ 2 0 1 ⎠
2. Let
1 2 −2
𝐵=⎜ 1 5 3 ⎞
⎛ ⎟.
⎝ 2 6 −1 ⎠
Compute det(𝐵) by
a) expanding along row 1.

b) expanding down column 3.
c) row reducing the matrix.
3. Let
1 3 −2
⎛
𝐶 =⎜ 4 𝑘+3 5 ⎞ ⎟.
⎝ −2 −6 𝑘 + 2 ⎠
Determine the value(s) of 𝑘 for which det 𝐶 = 0.
1.5 Chapter 1 summary
Chapter outcomes review
Having reviewed the material and completed the assessment (formative and summative) material,
you should be able to:
• Sketch subsets of R2 and select appropriate techniques to determine if subsets of R2 are

closed, bounded or compact.
• Determine the pre-image (or pullback) of a given set.
• Compute partial derivatives in a range of situations.
• Apply appropriate theory to simplify determinant calculations.
This chapter covers four topics.
Subsets of R2 . This section examines sketching (quickly and efficiently) subsets of R2 defined by con-
straints such as bounding curves and inequalities. Deciding whether sets are closed, bounded or compact
in R2 . Specifically, a set is:
• closed if it contains its own boundary, or if any point in its complement can be surrounded by a disc
of suitably small radius (𝜀 > 0) which also lies in the complement;
• bounded if it could be contained completely within a circle, centre at the origin, of sufficiently large
radius;
• compact if it is both closed and bounded.
Inverse images. Consider a function 𝑓 ∶ 𝐴 → 𝐵, where 𝑎 maps to 𝑓(𝑎) = 𝑏. This may be injective
(into), surjective (onto), neither or both. A bijective function 𝑓 (i.e. one that is both injective and surjective
and hence mapping exactly one element of 𝐴 to each element of 𝐵 ) has a related inverse function 𝑓 −1 ,
which is defined as the function 𝑓 −1 such that if 𝑓(𝑎) = 𝑏, then 𝑓 −1 (𝑏) = 𝑎. If 𝑓 is not bijective, then
we might still be interested in which point(s) of 𝐴 map to a particular point 𝑏 ∈ 𝑓(𝐴) ⊆ 𝐵 , where 𝑓(𝐴) is
the image of 𝐴 under 𝑓 . The preimage, inverse image or pull-back of the subset 𝑋 of 𝐵 is defined to be
𝑓 −1 (𝑋) = {𝑎 ∈ 𝐴 ∶ 𝑓(𝑎) ∈ 𝑋}.
Partial derivatives. A function 𝑓 ∶ R → R is differentiable at 𝑥 if
𝑓(𝑥 + ℎ) − 𝑓(𝑥)
lim
ℎ→0 ℎ
exists. The limit is written 𝑓 ′ (𝑥) and is called the derivative of 𝑓 at 𝑥.
Consider 𝑓 ∶ 𝑈 (⊆ R𝑛 ) → R𝑚 , with particular attention to the cases when (i) 𝑚 = 1 and (ii) 𝑛 = 2 and
𝑚 = 1. Consider the specific case in (ii); that is, 𝑓 ∶ R2 → R, what is the derivative of 𝑓 at (𝑎, 𝑏) ∈ R2 ?
In this case, there are two natural functions to consider 𝑥 ↦ 𝑓(𝑥, 𝑏), since 𝑏 is constant, the standard
𝜕𝑓
approach permits differentiation with respect to 𝑥 to give (𝑎, 𝑏) or 𝑓𝑥 (𝑎, 𝑏). In a similar manner, the
𝜕𝑥
𝜕𝑓
standard approach applied to 𝑦 ↦ 𝑓(𝑎, 𝑦) gives (𝑎, 𝑏) or 𝑓𝑦 (𝑎, 𝑏). In the case 𝑓 ∶ R𝑛 → R, define
𝜕𝑦
𝜕𝑓 𝑓(𝑥1 , … , 𝑥𝑖−1 , 𝑥𝑖 + ℎ, 𝑥𝑖+1 , … , 𝑥𝑛 ) − 𝑓(x)

(x) = lim
𝜕𝑥𝑖 ℎ→0 ℎ
where x = (𝑥1 , … , 𝑥𝑛 ). Finally, in the general case, where 𝑓(x) = (𝑓1 (x, … , 𝑓𝑚 (x)), this approach
yields a matrix, J, of partial derivatives where the entries are
𝜕𝑓𝑖
𝐽𝑖𝑗 = .
𝜕𝑥𝑗
Calculating determinants. Theoretical consideration simplify determinant computation through the follow-
ing approaches.
• subtracting a multiple of one row (column) from another,

• taking out a common factor from any row (column),
• recognising upper triangular, lower triangular or diagonal matrices,
• choosing the best row (column) to expand along,
• using block matrices where appropriate,
• identifying rows or columns which can be expressed as linear combinations of other rows or columns,
respectively.
Chapter 2
Multiple integration
Chapter outcomes
• apply, where appropriate, a special case of Fubini’s theorem;

• solve problems that require reversing the order of integration;
• select and apply change of variables to suitable integrals.
• solve using appropriate integration techniques a range of problems, including some involving
statistics.
• MAT106: Linear Algebra. Chapter 11: The determinant of a matrix
– Definition of the determinant.

– Calculation of the determinant
• MA137: Mathematical Analysis. Chapter 1: Functions
– Definition of intervals.
– Standard functions.
• General mathematical knowledge. It is assumed that you have differentiation proficency, for example,
differentiate standard functions including utilising rules such as product, quotient, chain rule. It is also
assumed that you can fluently integrate function including apply standard integration rules.
29
30 CHAPTER 2. MULTIPLE INTEGRATION
Chapter Statistician: Kiyoshi Ito
Kiyoshi Ito was a Japanese mathematician and a major contributor to the theory of probability. Ito devel-
oped the work of others to apply the techniques of differential and integral calculus to stochastic processes
(random phenomena that evolve over time), such as Brownian motion. This work became known as the
Ito stochastic calculus. Ito calculus has applications in several fields, including engineering, population
genetics, and mathematical finance.
Figure 2.1: Kiyoshi Ito.
Image available via: https://upload.wikimedia.org/wikipedia/commons/c/c1/Kiyosi_Ito.jpg Image licence:

Creative Commons Attribution-Share Alike 2.0 Germany.
2.1. INTRODUCTION: FUBINI’S THEOREM 31
2.1 Introduction: Fubini’s theorem
We know how to define the (Riemann) integral of a function 𝑓 ∶ R → R using a limit of (Riemann) sums
𝑥 axis using tiny intervals), and we have seen that this can be used to obtain the area
(by partitioning the
under the graph of 𝑓 . It is clear that a similar limiting process can be carried out by partitioning the rectangle
beneath a bivariate function into ‘tiny’ rectangles, and we can thus define the integral of a function 𝑓(𝑥, 𝑦)
over a rectangle 𝑅, denoted:
∫ 𝑓(𝑥, 𝑦)𝑑𝐴 = ∫ ∫ 𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦,

𝑅 𝑅
and of course we could extend this to cuboids and their higher-dimensional equivalents.
Figure 2.2: Surface approximated by cuboids.
Being able to define an integral is not the same as being able to evaluate it, and we do not wish to be
restricted to integrating over rectangular regions, so we now develop some ideas to allow us to address
these problems.
Fubini’s theorem is a mathematical result which makes precise when we can expect to change to order
of integration (or of integration and other operations in which limits are taken) without affecting the result.
For the avoidance of later confusion, note that the general and formal form of Fubini’s theorem normally
considers the Lebesgue rather than Riemann integral; we’ll consider only functions sufficiently regular that
the distinction is unimportant. You’ll see formal versions of this result in modules such as MA359: Measure
Theory, ST342: Mathematics of Random Events and ST318: Probability Theory.
We will make use of the following simple special case of Fubini’s theorem in this module:
Fubini’s Theorem
If 𝑓 is a continuous function with domain 𝑅 = [𝑎, 𝑏] × [𝑐, 𝑑], then

𝑏 𝑑 𝑑 𝑏
∫ (∫ 𝑓(𝑥, 𝑦)𝑑𝑦) 𝑑𝑥 = ∫ 𝑓(𝑥, 𝑦)𝑑𝐴 = ∫ (∫ 𝑓(𝑥, 𝑦)𝑑𝑥) 𝑑𝑦.
𝑎 𝑐 𝑅 𝑐 𝑎
Cavalieri’s principle states that each of the above integrals will find the volume under the graph of 𝑓 over
the rectangle 𝑅 (by cutting into slices parallel to one axis and integrating to obtain the cross-sectional area).
The next figure illustrates this principle for 𝑓 ∶ R2 → R defined by 𝑓(𝑥, 𝑦) = 12 𝑦 sin(𝑥) + 2.
Figure 2.3: Illustration of Cavalieri’s principle
Example 2.1
Evaluate the integrals
3 2 2 3
(𝑎) ∫ (∫ 𝑥 𝑦𝑑𝑦) 𝑑𝑥 (𝑏) ∫ (∫ 𝑥2 𝑦𝑑𝑥) 𝑑𝑦.
2
0 1 1 0
Fubini’s theorem’s conditions cannot be relaxed
For example,
2 1 1 2
𝑥𝑦(𝑥2 − 𝑦2 ) 1 𝑥𝑦(𝑥2 − 𝑦2 ) 1
∫ (∫ 2 2 3
𝑑𝑦) 𝑑𝑥 = ∫ (∫ 2 2 3
𝑑𝑥) 𝑑𝑦 = −
𝑥=0 𝑦=0 (𝑥 + 𝑦 ) 5 𝑦=0 𝑥=0 (𝑥 + 𝑦 ) 20
where (𝑥, 𝑦) ≠ 0. The exercises provide a method for you to evaluate these integrals.
• With a bit of luck, we will be able to integrate one of the iterated integrals using familiar integrals of
functions R → R.
• It is not necessary for 𝑓 to be continuous for it to be integrable, and Fubini’s theorem does generalise
to bounded functions with ‘nice’ discontinuities, but the statement is messy, and so we will restrict to
continuous 𝑓 .
• ∫𝑅 𝑓(𝑥, 𝑦)𝑑𝐴 inherits all the properties we like from univariate integrals: linearity, monotonicity and
so on.
Example 2.2
Suppose 𝑓 ∶ R2 → R satisfies 𝑓(𝑥, 𝑦) = 𝑔(𝑥)ℎ(𝑦) for some real-valued functions 𝑔 and ℎ.
𝑏 𝑑
Determine ∫ (∫ 𝑓(𝑥, 𝑦)𝑑𝑦) 𝑑𝑥.
𝑎 𝑐
Question 2.3
Evaluate the integral
2 3
∫ ∫ 1 + 8𝑥𝑦 𝑑𝑥𝑑𝑦
𝑦=1 𝑥=0
2.2. ELEMENTARY REGIONS OF INTEGRATION 39
2.2 Elementary regions of integration

The question arises, what happens with non-rectangular domains? The ideas described above can be
extended to less restricted regions of integration. For example, consider integrating the function 𝑓(𝑥, 𝑦) =
2 2
𝑦 𝑥 over the region bounded below by 𝑦 = −𝑥, above by 𝑦 = 𝑥 and the line 𝑥 = 1. Graphically we
have
Figure 2.4: Example type 1 region.
𝜙1 and 𝜙2 mapping [𝑎, 𝑏] to R be

In general, we can classify certain regions of integration by type.Let
continuous functions satisfying 𝜙2 (𝑡) ≤ 𝜙1 (𝑡), 𝑡 ∈ [𝑎, 𝑏]. Let 𝑅 = {(𝑥, 𝑦) ∶ 𝑥 ∈ [𝑎, 𝑏], 𝑦 ∈
[𝜙2 (𝑥), 𝜙1 (𝑥)]}, then 𝑅 is a region of type 1.
Let𝜙3 and 𝜙4 mapping [𝑐, 𝑑] to R be continuous functions satisfying 𝜙4 (𝑡) ≤ 𝜙3 (𝑡), 𝑡 ∈ [𝑐, 𝑑]. Let
𝑆 = {(𝑥, 𝑦) ∶ 𝑦 ∈ [𝑐, 𝑑], 𝑥 ∈ [𝜙4 (𝑦), 𝜙3 (𝑦)]}, then 𝑆 is a region of type 2.
A region of type 3 is one that is both type 1 and type 2.
These are our elementary regions, and their boundaries are just the sorts of sets which allow integrability;
what’s more:
𝑏 𝜙1 (𝑥)
• if 𝑅 is type 1, then ∫ 𝑓(𝑥, 𝑦)𝑑𝐴 =∫ ∫ 𝑓(𝑥, 𝑦)𝑑𝑦𝑑𝑥;
𝑅 𝑎 𝜙2 (𝑥)
𝑑 𝜙3 (𝑦)
• if 𝑅 is type 2, then ∫ 𝑓(𝑥, 𝑦)𝑑𝐴 =∫ ∫ 𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦;
𝑅 𝑐 𝜙4 (𝑦)
𝑏 𝜙1 (𝑥) 𝑑 𝜙3 (𝑦)
• if 𝑅 is type 3, then ∫ 𝑓(𝑥, 𝑦)𝑑𝐴 = ∫ ∫ 𝑓(𝑥, 𝑦)𝑑𝑦𝑑𝑥 = ∫ ∫ 𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦, and
𝑅 𝑎 𝜙2 (𝑥) 𝑐 𝜙4 (𝑦)
we may use whichever version we wish.
Higher dimensions We can generalise the regions over which we can integrate to 3-d and higher dimen-
sions with similar, but much more technical, definitions.
It suffices to say that we may be able to ‘spear’ the region in a co-ordinate direction and move into the region,
pass through, and exit the region in the same simple fashion with nicely defined boundary curves, allowing
us to produce the iterated integral(s) of the type,
𝑏 𝜙1 (𝑥) 𝛾1 (𝑥,𝑦)
∫ 𝑓(𝑥, 𝑦, 𝑧)𝑑𝑉 = ∫ ∫ ∫ 𝑓(𝑥, 𝑦, 𝑧)𝑑𝑧𝑑𝑦𝑑𝑥.
𝑊 𝑎 𝜙2 (𝑥) 𝛾2 (𝑥,𝑦)
We will restrict our attention to very simple regions in 3-d or higher. Of course, even when we have an
elementary region and a free choice of the order in which we carry out the iterated integration, the function
to be integrated might be too complicated to work with!
If we consider 𝑓 ∶ 𝐶 ⊆ R3 → R, where 𝐶 is a cuboid, we can make analogous definitions for

∫𝐶 𝑓(𝑥, 𝑦, 𝑧)𝑑𝑥𝑑𝑦𝑑𝑧 , and Fubini’s theorem tells us that we may, if 𝑓 is ‘nice’, evaluate this integral via one
of the (six distinct sequences of three) iterated integrals for 𝑓 over 𝐶 . And similarly for higher dimensions.
2.2.1 Finding the limits of integration 1: 2-d Cartesian

Suppose we wish to evaluate
3 1
3
𝐼 =∫ ∫ 𝑒𝑦 𝑑𝑦𝑑𝑥.
0 √𝑥/3
3
At present, we cannot evaluate the integral; the integral of 𝑒𝑦 with respect to 𝑦 cannot be expressed in
3 3
terms of elementary functions. If however, we were to integrate 𝑒𝑦 with respect to 𝑥 (so 𝑒𝑦 is a constant)
then the integration is straight forward.
Suppose we wish to reverse the order of integration; that is, given ∬ 𝑓𝑑𝑥𝑑𝑦, we want to perform
𝑅
∬ 𝑓𝑑𝑦𝑑𝑥. We illustrate the process for

𝑅
1 √1−𝑦2
∫ ∫ 𝑓𝑑𝐴 = ∫ ∫ 𝑓𝑑𝑥𝑑𝑦.
𝑅 𝑦=0 𝑥=1−𝑦
1. Sketch and label the region 𝑅 and its bounding curves.

2. Introduce a cursor pointing in the direction of the first desired integration, and mark where it enters
and leaves 𝑅.
3. Move cursor back and forth through 𝑅 parallel to the second desired integration direction, and mark
the values where it enters and leaves 𝑅.
4. Write the integral with the desired order of integration:

Note that
∫ ∫ 𝑑𝑥𝑑𝑦 = Area(𝑅),
𝑅
∫ ∫ ∫ 𝑑𝑥𝑑𝑦𝑑𝑧 = Volume(𝑅).
𝑅
Example 2.4
Evaluate 𝐼 = ∬ (𝑥2 + 𝑦2 )𝑑𝑦𝑑𝑥 where 𝑅 is the region {(𝑥, 𝑦) ∈ R2 ∶ 0 ≤ 𝑥 ≤ 1, 𝑥2 ≤ 𝑦 ≤
𝑅
𝑥}. Verify the answer by reversing the order of integration.
Returning to the motivating example.
Example 2.5
Evaluate
3 1
3
𝐼 =∫ ∫ 𝑒𝑦 𝑑𝑦𝑑𝑥.
0 √𝑥/3
Question 2.6
1. Sketch the regions of integration:

a)
3 2
∫ ∫ 𝑓(𝑥, 𝑦)𝑑𝑦𝑑𝑥,
0 0
b)
3 0
∫ ∫ 𝑓(𝑥, 𝑦)𝑑𝑦𝑑𝑥,
0 −2
c)
ln 8 ln 𝑦
∫ ∫ 𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦,
1 0
d)
2 𝑦2
∫ ∫ 𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦.
1 𝑦
2. For the integrals below, write the integral equivalent with the order of integration reversed:
a) √
1 𝑦
∫ ∫ 𝑑𝑥𝑑𝑦,
0 𝑦
b)
1 √1−𝑦2
∫ ∫ 𝑑𝑥𝑑𝑦,
0 −√1−𝑦2
c)
ln 2 2
∫ ∫ 𝑑𝑥𝑑𝑦.
0 𝑒𝑦
3. Evaluate:
a)
2 2
∫ ∫ 2𝑦2 sin(𝑥𝑦)𝑑𝑦𝑑𝑥,
0 𝑥
b) √
√(ln 3)/2 ln 3
2
∫ ∫√ 𝑒𝑥 𝑑𝑥𝑑𝑦.
0 𝑦 2
Question 2.7
Determine the volume of the solid bounded above by the plane 𝑧 = 4 − 𝑥 − 𝑦 and below by the
2
rectangle 𝑅 = {(𝑥, 𝑦) ∈ R ∶ 0 ≤ 𝑥 ≤ 1, 0 ≤ 𝑦 ≤ 2}.
2.3. CHANGE OF VARIABLES 57
2.3 Change of variables
If a change in the order of integration is unhelpful or impossible, perhaps a change of variables can help.
We motivate the approach using a 2d integral; that is,
∫ 𝑓(𝑥, 𝑦)𝑑𝐴 = ∫ ∫ 𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦.

𝑅 𝑅
Suppose (𝑥, 𝑦) = 𝑇 (𝑢, 𝑣) = (𝑔(𝑢, 𝑣), ℎ(𝑢, 𝑣)), where 𝑇 is 𝐶 1 and bijective. Figure 2.5 provides a
schematic representation.
Figure 2.5: Representation of 2d transformation.

So combining everything together gives
𝜕(𝑥, 𝑦)
∫ 𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦 = ∫ 𝑓 ∗ (𝑢, 𝑣) ∣ ∣ 𝑑𝑢𝑑𝑣,
𝑅 𝐺 𝜕(𝑢, 𝑣)
∗ −1
where 𝑓 (𝑢, 𝑣) = 𝑓(𝑔(𝑢, 𝑣), ℎ(𝑢, 𝑣)), 𝐺 = 𝑇 (𝑅) and
𝜕𝑔 𝜕𝑔
𝜕(𝑥, 𝑦) 𝜕𝑢 𝜕𝑣
∣ ∣ = ∣det ( )∣ .
𝜕(𝑢, 𝑣) 𝜕ℎ 𝜕ℎ
𝜕𝑢 𝜕𝑣
Do not forget the modulus of the determinant of the matrix.
We can produce a corresponding (more complicated) formula for higher dimensions, In particular, for 3-d
suppose we have a transformation 𝑇 ∶ (𝑢, 𝑣, 𝑤) ↦ (𝑥, 𝑦, 𝑧) that is 𝐶 1 and one-to-one (except possibly
on a ‘nice’ set) then:
𝜕(𝑥, 𝑦, 𝑧)
∫ 𝑓(𝑥, 𝑦, 𝑧)𝑑𝑥𝑑𝑦𝑑𝑧 = ∫ 𝑓 ∗ (𝑢, 𝑣, 𝑤) ∣ ∣ 𝑑𝑢𝑑𝑣𝑑𝑤.
𝑊 𝑊∗ 𝜕(𝑢, 𝑣, 𝑤)
Here:
1. 𝑓 ∗ (𝑢, 𝑣, 𝑤) = 𝑓(𝑇 (𝑢, 𝑣, 𝑤)) is the function 𝑓 rewritten in the coordinates 𝑢, 𝑣, 𝑤, and hopefully
has a simpler form than 𝑓 ;
2. 𝑊 ∗ = 𝑇 −1 (𝑊 ) is the region in (𝑢, 𝑣, 𝑤)-space corresponding to 𝑊 in (𝑥, 𝑦, 𝑧)-space, see Figure
2.6.
Figure 2.6: Transformation 𝑇 mapping between the set 𝑊 ∗ and 𝑊 .

𝜕(𝑥,𝑦,𝑧)
3. ∣ 𝜕(𝑢,𝑣,𝑤) ∣ denotes the modulus of the determinant of the Jacobian matrix, i.e.
|detJ(𝑢, 𝑣, 𝑤)| ,
where
𝜕𝑥 𝜕𝑥 𝜕𝑥
𝜕𝑢 𝜕𝑣 𝜕𝑤
⎛
⎜ ⎞
⎟
J(𝑢, 𝑣, 𝑤) =⎜
⎜
𝜕𝑦 𝜕𝑦 𝜕𝑦 ⎟
⎟ .
⎜ 𝜕𝑢 𝜕𝑣 𝜕𝑤 ⎟
𝜕𝑧 𝜕𝑧 𝜕𝑧
⎝ 𝜕𝑢 𝜕𝑣 𝜕𝑤 ⎠
Example 2.8
Find
𝑦 √
𝐼 = ∫ ∫ [√ + 𝑥𝑦] 𝑑𝑥𝑑𝑦,
𝑅 𝑥
where 𝑅 is illustrated below and given by the shaded area between 𝑦 = 4𝑥, 𝑦 = 𝑥, 𝑥𝑦 = 1 and
𝑥𝑦 = 9.
The purpose of the next example is to illustrate how to approach change of variables when the region is not
bounded by contour curves.
Example 2.9
Evaluate
1 𝑥
∫ ∫ 𝑑𝑦𝑑𝑥
𝑥=0 𝑦=0
using the transformation 𝑢 = 𝑥 + 𝑦 and 𝑣 = 𝑥 − 𝑦.

Useful result
The following result is often useful. It is an application of the chain rule to multivariable functions. We
will see the chain rule in Chapter 4, but for the moment we can apply the result without verification.
𝜕(𝑥, 𝑦) 𝜕(𝑢, 𝑣)
= 1.
𝜕(𝑢, 𝑣) 𝜕(𝑥, 𝑦)
This result generalises.
Question 2.10
Using a suitable change of variable, evaluate ∬ (𝑥2 + 𝑦2 )𝑑𝑥𝑑𝑦, where 𝑅 is the square region
𝑅
in R2 bounded by the lines connecting (0, 0), (1, 1), (2, 0) and (1, −1) illustrated below.
Question 2.11
1. By using the change of variables 𝑢 = 𝑥 + 𝑦, 𝑣 = 𝑦 − 2𝑥 or otherwise, evaluate

1 1−𝑥
√
∫ ∫ 𝑥 + 𝑦(𝑦 − 2𝑥)2 𝑑𝑦𝑑𝑥.
0 0
2. Evaluate 𝑦
3 4 2 +1
2𝑥 − 𝑦 𝑧
𝐼 =∫ ∫ ∫ ( + ) 𝑑𝑥𝑑𝑦𝑑𝑧.
0 0 𝑥= 𝑦 2 3
2
using the transformation 𝑥 = 𝑢 + 𝑣, 𝑦 = 2𝑣, 𝑧 = 3𝑤

2.4. COMMON COORDINATE CHANGES 71
2.4 Common coordinate changes
A number of situations utilise the same or related coordinate changes.
2-d polar coordinates: The 2d polar coordinate system utilises a distance 𝑟 from the origin and an angle
𝜃 taken anti-clockwise from the 𝑥 axis. A point 𝑃 whose Cartesian coordinate is (𝑥, 𝑦) is represented by
the ordered pair (𝑟, 𝜃), where 𝑥 = 𝑟 cos 𝜃, 𝑦 = 𝑟 sin 𝜃. A computation, which you should do, shows
𝜕(𝑥, 𝑦)
∣ ∣ = 𝑟.
𝜕(𝑟, 𝜃)
Figure 2.7: Representation of a point in 2d polar coordinates
Example 2.12
By using 2-d polar coordinates evaluate
2
+𝑦2 )
𝐼 = ∬ 𝑒−(𝑥 𝑑𝑥𝑑𝑦,
𝑅
where 𝑅 denotes the region 𝑥 ≥ 0, 𝑦 ≥ 0 in the 𝑥𝑦-plane.

3-d cylindrical coordinates: The cylindrical coordinate system is a combination of the polar coordinate
system in the 𝑥𝑦 -plane with an additional 𝑧 - coordinate vertically. A point 𝑃 whose Cartesian coordinate is
(𝑥, 𝑦, 𝑧) is represented by the ordered triple (𝑟, 𝜃, 𝑧) , where (𝑟, 𝜃) is the polar coordinate of (𝑥, 𝑦) and 𝑧
is the vertical projection along 𝑧 -axis of 𝑃 onto 𝑥𝑦 -plane.
3-d spherical coordinates: The 3-d spherical coordinate system is a coordinate system based on spherical
geometry. A point 𝑃 whose Cartesian coordinate is (𝑥, 𝑦, 𝑧) is represented by the ordered triple (𝜌, 𝜃, 𝜙)
, where 𝑥 = 𝜌 sin 𝜙 cos 𝜃, 𝑦 = 𝜌 sin 𝜙 sin 𝜃, 𝑧 = 𝜌 cos 𝜙.
Note, there are different conventions regarding spherical polar coordinates. For example, some other con-
ventions interchange 𝜃 and 𝜙, you must remember this if you start looking on the web or in textbooks.
2.4.1 Finding the limits of integration 2: 2-d polar coordinates
Consider a region 𝑅 in 2-d polar coordinates. We illustrate finding the limits of integration for ∬ 𝑑𝑟𝑑𝜃
𝑅
where 𝑅 is the region outside the circle 𝑟 = 1, inside 𝑟 = 1 + cos 𝜃 and 𝑥 ≥ 0.
Example 2.13
Determine
∬ 𝑦 𝑑𝐴
𝑅
2
where 𝑅 is the first quadrant inside 𝑥 + 𝑦 = 4 and outside 𝑥2 + 𝑦2 = 1.
2
Example 2.14
Determine the volume of the region in R3 that is bounded above by the hemisphere 𝑧 =
√𝑎2 − 𝑥2 − 𝑦2 and below by the cone 𝑧= √𝑥2 + 𝑦2 .
Question 2.15
1. Evaluate ∭ exp((𝑥2 + 𝑦2 + 𝑧 2 )3/2 )𝑑𝑉 , where 𝐷 is the unit ball in R3 .

𝐷
2. Evaluate ∬ log(𝑥2 + 𝑦2 )𝑑𝑥𝑑𝑦, where 𝐷 is the region below.

𝐷
2.5. APPLICATIONS OF MULTIPLE INTEGRALS TO STATISTICS 91
2.5 Applications of multiple integrals to statistics
The joint distribution of a pair of random variables 𝑋 and 𝑌 is said to have probability density 𝑓 ∶ R2 →
[0, ∞) if
ℙ ((𝑋, 𝑌 ) ∈ 𝐴) = ∫ ∫ 𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦,
𝐴
where ℙ((𝑋, 𝑌 ) ∈ 𝐴) is the probability that (𝑋, 𝑌 ) belongs to 𝐴.
The expected value of 𝑔(𝑋, 𝑌 ) is given by
𝔼(𝑔(𝑋, 𝑌 )) = ∫ ∫ 𝑔(𝑥, 𝑦)𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦

R2
(provided this integral is well-defined and finite).
Example 2.16
Suppose a random point is chosen uniformly from a subset 𝐷 ⊆ R2 of finite area. Determine the
probability density of (𝑋, 𝑌 ).
Question 2.17
If (𝑋,
√ 𝑌 ) is chosen uniformly from 𝐷 = {(𝑥, 𝑦) ∶ 𝑥2 + 𝑦2 ≤ 1}, determine 𝔼(𝑋𝑌 ). Determine
𝔼( 𝑋 2 + 𝑌 2 ).
2.5.1 Geometric Probability

This is a practical experiment. Drop a needle of length 𝑙 at random on a stripped floor, with stripes a distance
𝐿 apart. Suppose 𝑙 < 𝐿. What is the probability that the needle does not lie across any line?
Figure 2.8: Stripped floor distance L apart with needles of length 𝑙
A needle is described by its midpoint and the angle with the horizontal.
Figure 2.9: Stripped floor distance L apart with needles of length 𝑙

2.6 Chapter 2 Consolidation Questions
Question 2.18
1. Evaluate
2
𝑥−𝑦
∬( ) 𝑑𝑥𝑑𝑦
𝑅 𝑥+𝑦+2
where 𝑅 is the square region in the 𝑥𝑦 plane with vertices at (1, 0), (0, 1), (−1, 0) and
(0, −1).
𝑦 𝑦
2. Using the transformation, 𝑢 = 𝑥2 − 𝑦 2 and 𝑣 = , evaluate ∬ 𝑑𝑥𝑑𝑦 , where 𝐷 is the
𝑥 𝐷 𝑥
2 2 2 2
region between 𝑥 − 𝑦 = 1, 𝑥 − 𝑦 = 4 and 𝑦 = 𝑥/2 in positive quadrant illustrated
below
2.6. CHAPTER 2 CONSOLIDATION QUESTIONS 99
Question 2.19
A random variable 𝑋 is said to have a standard (one-dimensional) Normal distribution if it has
density
1 2
𝑓1 (𝑥) = √ 𝑒−𝑥 /2 .
2𝜋
A pair (𝑋, 𝑌 ) is said to have a standard (two-dimensional) Normal distribution if each member
of the pair has a standard one-dimensional Normal distribution and they are independent; their joint
density is given by
1 2 1 2
𝑓2 (𝑥, 𝑦) = √ 𝑒−𝑥 /2 × √ 𝑒−𝑦 /2 .
2𝜋 2𝜋
1. Check that ∫ ∫ 𝑓2 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 = 1 by:

R2
a. using 2-d polar coordinates;

∞ ∞
b. writing the integral as 4 ∫ ∫ 𝑓2 (𝑥, 𝑦)𝑑𝑥𝑑𝑦, and then using the change of variables
0 0
(𝑢, 𝑣) ↦ (𝑥, 𝑦) = (𝑢, 𝑢𝑣);
c. expressing the integral as a 3-d integral (i.e. the volume between 𝑧 = 0 and the surface
1 −(𝑥2 +𝑦2 )/2
𝑧= 2𝜋 𝑒 ), and then evaluating with 3-d cylindrical coordinates.
Applying the result ∬ 𝑓2 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 = 1 and Fubini’s theorem, check further that ∫ 𝑓1 (𝑥)𝑑𝑥 =
R2 R
1.
2. Suppose (𝑋, 𝑌√) has a standard two-dimensional

√ Normal distribution. Using the substitution
𝑢 = (𝑥 + 𝑦)/ 2, 𝑣 = (𝑥 − 𝑦)/ 2, show that
𝑏
𝑋+𝑌 1 2
ℙ( √ ∈ [𝑎, 𝑏]) = ∫ √ 𝑒−𝑥 /2 𝑑𝑥.
2 𝑎 2𝜋
√
(Note that this implies that (𝑋 +𝑌 )/ 2 has a standard one-dimensional Normal distribution.)
Question 2.20
[This question is for interest.] Recall
2 1
𝑥𝑦(𝑥2 − 𝑦2 ) 1
1. ∫ (∫ 2 2 3
𝑑𝑦) 𝑑𝑥 =
𝑥=0 𝑦=0 (𝑥 + 𝑦 ) 5
1 2
𝑥𝑦(𝑥2 − 𝑦2 ) 1
2. ∫ (∫ 2 2 3
𝑑𝑥) 𝑑𝑦 = − .
𝑦=0 𝑥=0 (𝑥 + 𝑦 ) 20
To evaluate (1) consider the first integral
1
𝑥𝑦(𝑥2 − 𝑦2 )
∫ 𝑑𝑦
𝑦=0 (𝑥2 + 𝑦2 )3
and make the substitution 𝑢 = 𝑥2 + 𝑦2 . Using the final result, evaluate the integral in (1).
To evaluate the integral in (2) consider the first integral
2
𝑥𝑦(𝑥2 − 𝑦2 )
∫ 𝑑𝑥
𝑥=0 (𝑥2 + 𝑦2 )3
and once again make the substitution 𝑢 = 𝑥2 + 𝑦2 . Using the final result, evaluate the integral in
(2).
• apply, where appropriate, a special case of Fubini’s theorem;

• solve problems that require reversing the order of integration;
• select and apply change of variables to suitable integrals.
• solve using appropriate integration techniques a range of problems, including some involving
statistics.
This chapter covers six topics.
Fubini’s theorem. Consider a continuous function on domain 𝑅 = [𝑎, 𝑏] × [𝑐, 𝑑] ⊆ R2 ; that is, 𝑅 is a
rectangle. Then, a special case of the general Fubini’s theorem states:
𝑏 𝑑 𝑑 𝑏
∫ (∫ 𝑓(𝑥, 𝑦)𝑑𝑦) 𝑑𝑥 = ∫ 𝑓(𝑥, 𝑦)𝑑𝐴 = ∫ (∫ 𝑓(𝑥, 𝑦)𝑑𝑥) 𝑑𝑦.
𝑎 𝑐 𝑅 𝑐 𝑎
Cavalieri’s principle states that each of the above integrals will find the volume under the graph of 𝑓 over
the rectangle 𝑅.
Elementary regions of integration. The previous case generalises to less restricted regions. - Let 𝜙1 , 𝜙2 ∶
[𝑎, 𝑏] → R be continuous functions. Suppose 𝜙2 (𝑡) ≤ 𝜙1 (𝑡), for 𝑡 ∈ [𝑎, 𝑏]. Let
𝑅 = {(𝑥, 𝑦) ∶ 𝑥 ∈ [𝑎, 𝑏], 𝑦 ∈ [𝜙1 (𝑡), 𝜙2 (𝑡)]},
then 𝑅 is a region of type 1. - Let 𝜙3 , 𝜙4 ∶ [𝑐, 𝑑] → R be continuous functions satisfying 𝜙4 (𝑡) ≤ 𝜙3 (𝑡)
for 𝑡 ∈ [𝑐, 𝑑]. Let
𝑆 = {(𝑥, 𝑦) ∶ 𝑦 ∈ [𝑐, 𝑑], 𝑥 ∈ [𝜙3 (𝑡), 𝜙4 (𝑡)]},
Then 𝑆 is a region of type 2. - A *region of type 3() is a region of type 1 and type 2.
Regions of type 1, 2, 3 are elementary regions, and their boundaries permit integrability and also
𝑏 𝜙1 (𝑡)
• if 𝑅 is type 1, then ∫ 𝑓(𝑥, 𝑦)𝑑𝐴 = ∫𝑎 ∫𝜙 𝑓(𝑥, 𝑦)𝑑𝑦𝑑𝑥;
𝑅 2 (𝑡)
𝑑 𝜙3 (𝑡)
• if 𝑅 is type 2, then ∫ 𝑓(𝑥, 𝑦)𝑑𝐴 = ∫𝑐 ∫𝜙 𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦;
𝑅 4 (𝑡)
𝑏 𝜙1 (𝑡) 𝑑 𝜙3 (𝑡)
• if 𝑅 is type 3, then ∫ 𝑓(𝑥, 𝑦)𝑑𝐴 = ∫𝑎 ∫𝜙 𝑓(𝑥, 𝑦)𝑑𝑦𝑑𝑥 = ∫𝑐 ∫𝜙 𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦;
𝑅 2 (𝑡) 4 (𝑡)
Generalisation to suitable three and higher dimension regions exists, but consideration is given to simple
regions in R𝑛 , 𝑛 ≥ 3.
Finding the limits of integration 1: 2-d cartesian. Consider the integral ∫ ∫ 𝑓𝑑𝑥𝑑𝑦, what is the process
𝑅
to determine ∫ ∫ 𝑓𝑑𝑦𝑑𝑥; that is, to reverse the order of integration? Note, this process is important since
𝑅
such transformations can greatly simplify the computations. The approach is as follows:
1. Sketch and label the region 𝑅 and its bounding curves.

2. Introduce a cursor pointing in the direction of the first desired integration, and mark where it enters
and leaves 𝑅.
3. Move the cursor back and forth through 𝑅 parallel to the second desired integration direction, and
mark the values where it enters and leaves 𝑅.
4. Write the integral with the desired order of integration.
Change of variables. Suppose we have a one-to-one 𝐶 1 transformation 𝑇 ∶ R3 → R3 defined by

𝑇 (𝑢, 𝑣, 𝑤) = (𝑥, 𝑦, 𝑧). (Note, we can replace one-to-one with one-to-one on a ‘nice’ set.) Then
𝜕(𝑥, 𝑦, 𝑧)
∫ 𝑓(𝑥, 𝑦, 𝑧)𝑑𝑥𝑑𝑦𝑑𝑧 = ∫ 𝑓 ∗ (𝑢, 𝑣, 𝑤) ∣ ∣ 𝑑𝑢𝑑𝑣𝑑𝑤.
𝑊 𝑊∗ 𝜕(𝑢, 𝑣, 𝑤)
Here
• 𝑓 ∗ (𝑢, 𝑣, 𝑤) = 𝑓(𝑇 (𝑢, 𝑣, 𝑤)) is the function 𝑓 rewritten in the coordinates 𝑢, 𝑣, 𝑤. The aim is for
𝑓 ∗ to be simpler than 𝑓 .
• 𝑊 ∗ = 𝑇 −1 (𝑊 ) is the region in the (𝑢, 𝑣, 𝑤)-space corresponding to 𝑊 in (𝑥, 𝑦, 𝑧)-space.
𝜕(𝑥, 𝑦, 𝑧)
• ∣ ∣ denotes the modulus of the determinant of the Jacobian matrix; that is,
𝜕(𝑢, 𝑣, 𝑤)
| det J(𝑢, 𝑣, 𝑤)|, where
𝜕𝑥 𝜕𝑥 𝜕𝑥
𝜕𝑢 𝜕𝑣 𝜕𝑤
⎛ ⎞
J(𝑢, 𝑣, 𝑤) =⎜
⎜
⎜
𝜕𝑦
𝜕𝑢
𝜕𝑦
𝜕𝑣
𝜕𝑦
𝜕𝑤
⎟
⎟
⎟.
𝜕𝑧 𝜕𝑧 𝜕𝑧
⎝ 𝜕𝑢 𝜕𝑣 𝜕𝑤 ⎠
Common coordinate changes In many cases utilisation of alternative coordinate systems results in sim-
plified integrals. Common coordinate changes are:
• 2-d polar coordinates, where 𝑥 = 𝑟 cos 𝜃 and 𝑦 = 𝑟 sin 𝜃. In this case,
𝜕(𝑥, 𝑦)
∣ ∣ = 𝑟.
𝜕(𝑟, 𝜃)
• 3-d cylindrical coordinates, where 𝑥 = 𝑟 cos 𝜃, 𝑦 = 𝑟 sin 𝜃 and 𝑧 = 𝑧 . In this case,
𝜕(𝑥, 𝑦, 𝑧)
∣ ∣ = 𝑟.
𝜕(𝑟, 𝜃, 𝑧)
• 3-d spherical coordinates, where 𝑥 = 𝜌 sin 𝜙 cos 𝜃, 𝑦 = 𝑟 sin 𝜙 sin 𝜃 and 𝑧 = 𝜌 cos 𝜙. In this case,
𝜕(𝑥, 𝑦, 𝑧)
∣ ∣ = 𝜌2 sin 𝜙.
𝜕(𝜌, 𝜙, 𝜃)
Applications of multiple integrals to statistics. The joint distribution of a pair of random variables 𝑋
2
and 𝑌 is said to have probability density 𝑓 ∶ R → [0, ∞) if
ℙ((𝑋, 𝑌 ) ∈ 𝐴) = ∬ 𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦,

𝐴
where ℙ((𝑋, 𝑌 ) ∈ 𝐴) is probability that (𝑋, 𝑌 ) belongs to 𝐴. The expected value of 𝑔(𝑋, 𝑌 ) is given
by
𝔼(𝑔(𝑋, 𝑌 )) = ∬ 𝑔(𝑥, 𝑦)𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦
R2
provided the integral is well-defined and finite.
ST218: Mathematical Statistics provides further coverage of these applications.

Chapter 3
Symmetric matrices, quadratic forms,

positive definiteness
Chapter outcomes
• apply key symmetric matrices theoretical results to solve problems;

• select and apply appropriate theoretical results regarding quadratic forms;
• solve problems involving covariance matrices;
• prove results involving symmetric matrices and associated concepts.
• MAT106: Linear Algebra. Chapter 12 and 13
– Inverse matrices
– Determinants of matrices
– Similar matrices
– Eigenvalues and eigenvectors
– Symmetric matrices
– Orthogonal matrices
• MA137: Mathematical Analysis. None.

104
105
• General mathematical knowledge.
Chapter Statistician: Catherine Ann Sugar
Catherine Ann Sugar is an American biostatistician at the University of California, Los Angeles, her research
focuses on clustering, functional data analysis, classification and patterns of covariation in data. Her work
has a wide range of applied interests including mental health, dentistry, nephrology, and particularly health
services research where she has developed a set of cluster-analytic methods for defining and analysing
health state models. She has also worked with the NHS on projects in cancer and multiple sclerosis.
Figure 3.1: Catherine Ann Sugar.
Image available via: https://ph.ucla.edu/faculty/sugar Image licence: Public

3.1. SOME LINEAR ALGEBRA 107
3.1 Some linear algebra

In this chapter, we collect together some results from linear algebra regarding symmetric matrices, quadratic
forms and positive definiteness.
We begin by recalling some definitions and results from Year 1 Linear Algebra. Recall that a set of vectors
in R𝑛 is orthonormal if for any a = (𝑎1 , … , 𝑎𝑛 ) and b = (𝑏1 , … , 𝑏𝑛 ) in the set, we have:
• a⋅b = 𝑎1 𝑏1 + ⋯ + 𝑎𝑛 𝑏𝑛 = 0,whenever a ≠ b,
• a ⋅ a = 𝑎21 + ⋯ + 𝑎2𝑛 = 1; that is, the vectors are unit vectors.
Definition 3.1. Consider an 𝑛 × 𝑛 matrix, A. The matrix A is:
• symmetric if A = AT ;
• anti-symmetric/skew-symmetric if A = − AT ;
• orthogonal if AT A = I. Equivalently, the columns (or rows) of A form an orthonormal set of vectors
in R𝑛 .
Some useful properties of real 𝑛 × 𝑛 symmetric matrices are as follows:
1. A has a ‘full set’ of 𝑛, possibly repeated, real eigenvalues;

2. the eigenvectors corresponding to distinct eigenvalues are orthogonal;
3. there is an orthonormal basis of eigenvectors;
4. if P is a matrix constructed with such an orthonormal basis as its columns, then P−1 AP = D, where
D is a diagonal matrix whose diagonal entries are the eigenvalues of A (in order corresponding to
the order of the columns of P). Actually P−1 = PT (since the inverse of any orthogonal matrix is its
transpose), and so PT AP = D.
Theorem 3.1. Any 𝑛 × 𝑛 matrix A can always be written
A = S + C, (3.1)
where S is symmetric and C is antisymmetric. Moreover, this decomposition is unique.

108 CHAPTER 3. SYMMETRIC MATRICES, QUADRATIC FORMS, POSITIVE DEFINITENESS
Example 3.1
Let
19 20 −16
⎛
A = ⎜ 20 13 4 ⎞⎟.
⎝ −16 4 31 ⎠
Express A in the form RDRT , where R is an orthogonal and D is a diagonal matrix.
Question 3.2
Let
3 −2
S =( ).
−2 3
Write S in the form S = RDR𝑇 , where R is an orthogonal and D is a diagonal matrix.
3.2. QUADRATIC FORMS AND POSITIVE DEFINITENESS 113
3.2 Quadratic forms and positive definiteness

Let A be an 𝑛 × 𝑛 matrix. The quadratic form generated by A is the map:
𝑞 ∶ R𝑛 → R, x ↦ xT Ax.
Example 3.3
Find the quadratic form generated by
3 12 4
⎛ ⎞
A = ⎜ 12 −4 −8 ⎟ .
⎝ 4 −8 7 ⎠
Theorem 3.2. The quadratic form generated by A is the same as that generated by its symmetric part
S = 12 (A + AT ).
It is for this reason that we normally use symmetric matrices when discussing quadratic forms.
Question 3.4
1. Show that for any x, we have that
0 1 −3
⎛
T ⎞
x ⎜ −1 0 −2 ⎟ x = 0.
⎝ 3 2 0 ⎠
2. Find the symmetric matrices which generate:
a) 𝑞(𝑥, 𝑦) = 4𝑥2 + 5𝑥𝑦 − 7𝑦2 ;

b) 𝑞(𝑥, 𝑦, 𝑧) = 4𝑥𝑦 + 5𝑦 2 .
Definition 3.2. A quadratic form (or its associated symmetric matrix) is said to be positive definite (p.d.)
if and only if xT Ax > 0 for every x ∈ R𝑛 \{0}.
A quadratic form (or its associated symmetric matrix) is said to be non-negative definite (n.n.d.), or
positive semi-definite, if and only if xT Ax ≥ 0 for every x ∈ R𝑛 \{0}.
We would like to be able to tell if a symmetric matrix is positive definite or not. The following gives one useful
check.
Theorem 3.3. Let S be an 𝑛 × 𝑛 symmetric matrix. Then:
1. S is p.d. if and only if every eigenvalue of S is strictly positive (> 0);

2. S is n.n.d. if and only if every eigenvalue of S is non-negative (≥ 0).
Theorem 3.4. Let S be an 𝑛 × 𝑛 symmetric matrix. If S is p.d., then S is invertible.

Whilst the above results can be useful, finding the eigenvalues of even a 3×3 matrix can be time-consuming,
but for certain matrices (like diagonal matrices) there are short cuts. Is there a simpler way to decide whether
a matrix is positive definite?
Definition 3.3. A minor is the determinant of a submatrix.
A principal minor is the determinant of a submatrix obtained by deleting corresponding pairs of rows
and columns from the original matrix (so the diagonal entries of the submatrix were diagonal entries in
the original matrix).
The leading principal minors are the determinants of the submatrices of A (sometimes called the prin-
cipal submatrices) obtained by deleting the last 𝑘 rows and columns.
Example leading principal minors are:
𝑎11 𝑎12 𝑎13

𝑎 𝑎
𝑀1 = ∣ 𝑎11 ∣ , 𝑀2 = ∣ 11 12 ∣ , 𝑀3 = ∣ 𝑎21 𝑎22 𝑎23 ∣ , …
𝑎21 𝑎22
𝑎31 𝑎32 𝑎33
with the pattern continuing down the leading diagonal. So, a 4 × 4 matrix has 4 leading principal minors
and 1 principal minor of order 4, (the matrix itself), 4 principal minors of order 3, 6 principle minors of order 2,
and 4 principle minors of order 1 for a total of 15 principal minors; that is, a matrix of order 𝑛 has 𝑛 choose
𝑝, principal minors of order 𝑝.
Theorem 3.5. Let A be an 𝑛 × 𝑛 symmetric matrix. Then A is p.d. if and only if all leading principal
minors of A are strictly positive (> 0).
The previous proof does not adapt to the n.n.d. case. For example, consider the matrix
1 0 1
⎛ ⎞
A = ⎜ 0 0 0 ⎟.
⎝ 1 0 0 ⎠
This has leading principal minors 1, 0, 0 ≥ 0. However, if x = (1, 1, −2), then xT Ax = −3. Alternatively,
consider
0 0
( )
0 −1
The eigenvalues are clearly 0 and −1, so this matrix is neither p.d. nor n.n.d, nevertheless both leading
principal minors are 0.
Theorem 3.6. Let A be an 𝑛 × 𝑛 symmetric matrix. Then A is n.n.d. if and only if all principal minors of
A are non-negative (≥ 0).
It is vital that you recognise this proposition requires all principal minors of A and not just the leading principal
minors. A symmetric 2×2 matrix is p.d. if and only if its first diagonal entry and its determinant strictly exceed
0.
Example 3.5
1. Show that 𝑞(𝑥, 𝑦, 𝑧) = 12𝑥2 + 12𝑥𝑦 + 8𝑥𝑧 + 9𝑦2 − 4𝑦𝑧 + 4𝑧 2 is n.n.d.

2. Show that
3 1 1
⎛ ⎞
A=⎜ 1 2 0 ⎟
⎝ 1 0 2 ⎠
is p.d.
Question 3.6
Let
1 0 1
⎛ ⎞
A = ⎜ 0 1 2 ⎟.
⎝ 1 2 3 ⎠
Is A p.d.?
3.3 Covariance matrices
Suppose that𝑌 is a random variable with mean 𝔼(𝑌 ) = 𝜇𝑌 , and 𝑍 is a random variable with mean
𝔼(𝑍) = 𝜇𝑍 . Recall that the covariance between 𝑌 and 𝑍 is
cov(𝑌 , 𝑍) = 𝔼[(𝑌 − 𝜇𝑌 )(𝑍 − 𝜇𝑍 )] = 𝔼(𝑌 𝑍) − 𝜇𝑌 𝜇𝑍 .
= (𝑍1 , 𝑍2 , … , 𝑍𝑛 ) of random variables (i.e. a random vector)

This can be generalized to a collection Z
by defining the 𝑛 × 𝑛 variance-covariance matrix (or just covariance matrix), Σ = (Σ𝑖𝑗 ), as:
Σ𝑖𝑗 = cov(𝑍𝑖 , 𝑍𝑗 ).
Since cov(𝑍𝑖 , 𝑍𝑗 ) = cov(𝑍𝑗 , 𝑍𝑖 ), Σ is a symmetric matrix, and in this section we will also assume each
Σ𝑖𝑗 is finite. Furthermore:
Theorem 3.7. Let Σ be a covariance matrix. Then Σ is n.n.d.
3.3. COVARIANCE MATRICES 131
A covariance matrix may or may not be p.d. If Σ is n.n.d. but not p.d., then there exists a nonzero vector x
such that
𝑛
var (∑ 𝑥𝑖 𝑍𝑖 ) = 0.
𝑖=1
In other words, one of the 𝑍𝑖 is degenerate in the sense that either it has zero variance or it is an affine
combination of some or all of the other 𝑍𝑗 .
Question: Given a symmetric, n.n.d. matrix A, does there exist a collection (𝑍1 , … , 𝑍𝑛 ) of random
variables whose covariance matrix is A? The answer is affirmative, and we can even construct the
collection explicitly, as follows.
Summarising the previous discussion and Theorem 3.7 gives the following result.
Theorem 3.8. A matrix Σ is a covariance matrix if, and only if, it is n.n.d.
Example 3.7
Suppose 𝑍1′ ∼ 𝑁 (0, 1), 𝑍2′ ∼ 𝑁 (0, 1), with 𝑍1′ and 𝑍2′ independent. Let
1 𝜌
Σ=( ),
𝜌 1
where −1 ≤ 𝜌 ≤ 1. Determine 𝑍1 and 𝑍2 so the corresponding covariance matrix is Σ.

3.4 A glimpse of the future: Monte Carlo simulation of stock portfolio

(Note, this section is provided as an example statistical application, none of this material will be assessed.)
A particular feature of a positive definite matrix is the Cholesky decomposition. The Cholesky decompo-
sition decomposes a positive definite matrix into the product of a lower triangular matrix and its transpose;
that is, A = LL𝑇 , for some matrix L which can be constructed.
25 15 −5 5 0 0 5 3 −1
⎛ ⎞ ⎛ ⎞
⎜ 15 18 0 ⎟ = ⎜ 3 3 0 ⎟ ⎜ 0 3 1 ⎞
⎛ ⎟
⎝ −5 0 11 ⎠ ⎝ −1 1 3 ⎠⎝ 0 0 3 ⎠
This decomposition can be used to generate correlated random variables by multiplying the lower triangular
from the decomposed covariance matrix by standard normals. An example application is Monte Carlo simu-
lation of stock prices, where the portfolio return is dependent on an array of underlying assets. Assume the
daily returns are distributed according to a multivariate normal with mean vector � and covariance matrix
Σ. Use the Cholesky decomposition in order to find a lower triangular matrix L such that Σ = LLT . The
returns can be generated by Rt = � + L𝑍𝑡 where 𝑍𝑡 ∼ 𝑁 (0, 𝐼).
Example of simulated portfolio return.

Question 3.8
1. Decide whether 𝑞 is p.d., n.n.d., or neither for:
a) 𝑞(𝑥, 𝑦) = 𝑥2 + 𝑦2 ,
b) 𝑞(𝑥, 𝑦) = (𝑥 − 𝑦)2 ,
c) 𝑞(𝑥, 𝑦) = 𝑥2 + 6𝑥𝑦 + 7𝑦 2 .
2. a) Show that 𝑞(𝑥, 𝑦, 𝑧) = 4𝑥2 + 10𝑦2 + 2𝑧 2 − 8𝑦𝑧 − 4𝑥𝑧 + 4𝑥𝑦 is n.n.d.

b) Deduce that
4 2 −2
⎛
A=⎜ 2 10 −4 ⎞
⎟
⎝ −2 −4 3 ⎠
has three strictly positive eigenvalues.
Question 3.9
1. Are the following valid covariance matrices?
2 2 1 1 1 −1
⎛ ⎞
A = ⎜ 2 3 0 ⎟,
⎛
B=⎜ 1 1 1 ⎞
⎟.
⎝ 1 0 4 ⎠ ⎝ −1 1 1 ⎠
2. Determine if there exists a collection (𝑍1 , 𝑍2 ) of random variables whose covariance matrix
is
1 2
Σ=( ).
2 4
If such a collection exists, then determine such a collection.
• apply key symmetric matrices theoretical results to solve problems;

• select and apply appropriate theoretical results regarding quadratic forms;
• solve problems involving covariance matrices;
• prove results involving symmetric matrices and associated concepts.
This chapter contains three main topics.
Symmetric matrices. Consider an 𝑛 × 𝑛 matrix, A. The matrix A is:
• symmetric if A = AT ;
• anti-symmetric or skew-symmetric if A = − AT ;
• orthogonal if the columns (or rows) of A form an orthonormal set of vectors in R𝑛 .
Any 𝑛 × 𝑛 matrix A can always be written uniquely as A = S + C, where S is symmetric and C is

antisymmetric. Key properties includes:
1. A has a ‘full set’ of 𝑛, possibly repeated, real eigenvalues;

2. the eigenvectors corresponding to distinct eigenvalues are orthogonal;
3. there is an orthonormal basis of eigenvectors;
4. if P is a matrix constructed with the orthonormal basis of eigenvectors, then P−1 AP = D, where D is a
diagonal matrix whose diagonal entries are the eigenvalues of A. The eigenvalues order corresponds
to the order of the columns in P.
Quadratic forms and positive definiteness. The quadratic form generated by A is the map 𝑞 ∶ R𝑛 → R
defined by x ↦ x Ax. The quadratic form generated by A is the same as that generated by its symmetric
T
1
part S = 2 (A+ AT )
A quadratic form (or its associated symmetric matrix) is said to be positive definite (p.d.) if and only if
xT Ax > 0 for every x ∈ R𝑛 {0}.
A quadratic form (or its associated symmetric matrix) is said to be non-negative definite (n.n.d.) if and only
if xT Ax ≥ 0 for every x ∈ R𝑛 {0}.
Let S be an 𝑛 × 𝑛 symmetric matrix, then
• S is p.d. if and only if every eigenvalue of S is strictly positive;

• S is n.n.d. if and only if every eigenvalue of S is non-negative;
• if S is p.d. then S is invertible.
A minor is the determinant of a sub matrix. A principal minor is the determinant of a sub matrix obtained
by deleting corresponding pairs of rows and columns from the original matrix. (So the diagonal entries of the
sub matrix are diagonal entries in the original matrix.) The leading principal minors are the determinants
of the submatrices of A:
𝑎11 𝑎12 𝑎13

𝑎11 𝑎12 ⎛ ⎞
A1 = ( 𝑎11 ) , A2 = ( ) , A3 = ⎜
⎜
⎜ 𝑎21 𝑎22 𝑎23
⎟
⎟
⎟,
𝑎21 𝑎22
⎝ 𝑎31 𝑎32 𝑎33 ⎠
with the pattern continuing down the leading diagonal. (Sometimes these are called principal submatrices.)
Let S be an 𝑛 × 𝑛 symmetric matrix. Further key results are
• The matrix S is p.d. if and only if all leading principal minors of S are strictly positive.
• The matrix S is n.n.d. if and only if all principal minors of S are non-negative.
Covariance matrices
Suppose that𝑌 is a random variable with mean 𝔼(𝑌 ) = 𝜇𝑌 , and 𝑍 is a random variable with mean
𝔼(𝑍) = 𝜇𝑍 . The covariance between 𝑌 and 𝑍 is
Cov(𝑌 , 𝑍) = 𝔼[(𝑌 − 𝜇𝑌 )(𝑍 − 𝜇𝑍 )] = 𝔼(𝑌 𝑍) − 𝜇𝑌 𝜇𝑍 .
This definition generalises to a collection Z = (𝑍1 , 𝑍2 , … , 𝑍𝑛 ) of random variable by defining the 𝑛 × 𝑛

covariance matrix (or variance-covariance matrix), Σ = (Σ𝑖𝑗 ), as Σ𝑖𝑗 = Cov(𝑍𝑖 , 𝑍𝑗 ). The matrix Σ is
symmetric. Properties include
• If Σ is a covariance matrix, then Σ is n.n.d.

• Given a symmetric, n.n.d. matrix A, there exists a random vector Z whose covariance matrix is A.
A covariance matrix may or may not be p.d.
ST218: Mathematical Statistics provides further coverage of this topic.

Chapter 4
Differentiation in R𝑛
Chapter outcomes
• Determine derivatives of functions defined R𝑛 → R𝑚 .

𝑛
• Find and classify critical points of functions R → R.
• Solve problems involving differentiation of functions R𝑛 → R𝑚 , including constrained optimi-
sation.
• Prove results involving differentiation of functions R𝑛 → R𝑚 .
• MAT106: Linear Algebra. Chapters 11 and 13
– The determinant of a matrix

– The scalar product.
• MA137: Mathematical Analysis. Chapter 4
– Definition of the derivative

– Properties of derivatives
• General mathematical knowledge. That you can fluently apply pre-university differentiation techniques.
Chapter Statistician: Kimiko (Kim) Osada Bowman
143
144 CHAPTER 4. DIFFERENTIATION IN R𝑁
Kimiko (Kim) Osada Bowman is a Japanese-American statistician. Her work focused on the distributional
properties of estimators based on non-normal data. Kim was a member of the scientific staff of Oak Ridge
National Laboratory for 50 years. She authored or coauthored 3 books and more than 200 articles during
her career. She was a fellow of the American Statistical Association and the American Association for the
Advancement of Science, and was an elected fellow of the International Statistical Institute and the Institute
of Mathematical Statistics.
Figure 4.1: Kimiko Osada Bowman.
Image available via: https://charterfuneral.info/wp-content/uploads/2019/01/Bowman-Kimiko-Osada.jpg

Image licence: Unknown.
4.1. GENERALISING THE SINGLE VARIABLE CASE 145
4.1 Generalising the single variable case

Recall that a function, 𝑓 ∶ R → R is differentiable at 𝑥0 if
𝑓(𝑥0 + ℎ) − 𝑓(𝑥0 )
lim
ℎ→0 ℎ
exists, in which case we denote the limit by 𝑓 ′ (𝑥0 ); the function 𝑓 ′ is the derivative of 𝑓 . This allows us to
define a line 𝐿 with equation
𝑦 = 𝑡(𝑥) = 𝑓(𝑥0 ) + 𝑓 ′ (𝑥0 )(𝑥 − 𝑥0 )
with the property

𝑓(𝑥) − 𝑡(𝑥)
lim = 0,
𝑥→𝑥0 𝑥 − 𝑥0
so that 𝑡 is a ‘good linear approximation’ for 𝑓 at 𝑥0 . The line 𝐿 is the tangent to the curve 𝑦 = 𝑓(𝑥) at
𝑥0 .
We now consider the situation in higher dimensions. Write x = (𝑥1 , … 𝑥𝑛 )T . If 𝑓 ∶ R𝑛 → R𝑚 with
𝑓(x) = (𝑓1 (x), … 𝑓𝑚 (x))T , then we have seen how to define its partial derivatives:
𝜕𝑓𝑖
( ) .
𝜕𝑥𝑗
𝑖=1,…,𝑚, 𝑗=1,…,𝑛
It would be convenient to say that 𝑓 is differentiable at x0 if all of its partial derivatives exist there. How-
ever, our choice of coordinate axes is somewhat arbitrary, and doing this might mean we miss some ‘bad’
behaviour that happens away from the axes.
𝑥3 − 𝑥𝑦2
Partial derivatives are not sufficient Consider 𝑓(𝑥, 𝑦) = . Then 𝑓𝑥 (0, 0) = 1 and
𝑥2 + 𝑦 2
𝑓𝑦 (0, 0) = 0. The function is not differentiable at (0, 0) due to the function’s behaviour near (0, 0). For
example, 𝑓(𝑥, 𝑥) ≡ 0.
A picture can never capture the complete 3d behaviour and you may find it useful to manipulate this function
on a computer. Nevertheless, a rendering of 𝑓(𝑥, 𝑦) is given below.
Figure 4.2: Surface approximated by cuboids.

A more restrictive definition would be to say that 𝑓 is differentiable at x0 if 𝑓 is differentiable along a slice in
any direction. This is often referred to as Gateaux differentiability.
However, this definition is still somewhat unsatisfactory, as there are functions for which derivatives exist in
any direction, but for which there is still no tangent plane (i.e. good linear approximation). Even worse, the
usual rules of differentiation (e.g. chain rule, sum rule) are not guaranteed to hold in this situation.
So, we adopt a different approach, and consider the question: what would we need to ensure the existence
of a tangent plane? This is often referred to as the Fr𝑒chet
́ derivative.
In R3 , a (non-vertical) plane has equation 𝑧

= 𝛼𝑥 + 𝛽𝑦 + 𝛾 . Clearly, for the plane to be a tangent to the
surface 𝑧 = 𝑓(𝑥, 𝑦), we would want the slopes in the 𝑥- and 𝑦 -directions to be
𝜕𝑓 𝜕𝑓
𝛼= (𝑥 , 𝑦 ), 𝛽= (𝑥 , 𝑦 ).
𝜕𝑥 0 0 𝜕𝑦 0 0
At x0 = (𝑥0 , 𝑦0 ), we also require that 𝑧 = 𝑓(𝑥0 , 𝑦0 ), which implies
𝜕𝑓 𝜕𝑓
𝛾 = 𝑓(𝑥0 , 𝑦0 ) − (𝑥0 , 𝑦0 ) × 𝑥0 − (𝑥 , 𝑦 ) × 𝑦0 .
𝜕𝑥 𝜕𝑦 0 0
Putting this together, a tangent plane must have the equation
𝜕𝑓 𝜕𝑓
𝑧 = 𝑇 (𝑥, 𝑦) = 𝑓(𝑥0 , 𝑦0 ) + (𝑥0 , 𝑦0 ) × (𝑥 − 𝑥0 ) + (𝑥 , 𝑦 ) × (𝑦 − 𝑦0 ).
𝜕𝑥 𝜕𝑦 0 0
𝜕𝑓 𝜕𝑓
We consequently define 𝑓 ∶ R2 → R to be differentiable at (𝑥0 , 𝑦0 ) if 𝜕𝑥 and 𝜕𝑦 both exist there, and
also
|𝑓(𝑥, 𝑦) − 𝑇 (𝑥, 𝑦)|
lim = 0;
(𝑥,𝑦)→(𝑥0 ,𝑦0 ) ‖(𝑥, 𝑦) − (𝑥0 , 𝑦0 )‖
that is, 𝑇 is a good linear approximation for 𝑓 .
To summarise, when 𝑓 is differentiable at (𝑥0 , 𝑦0 ), we can write
𝑓(𝑥, 𝑦) = 𝑇 (𝑥, 𝑦) + 𝑅((𝑥, 𝑦), (𝑥0 , 𝑦0 )),
where:
• the linear approximation can be expressed
𝑥 − 𝑥0
𝑇 (𝑥, 𝑦) = 𝑓(𝑥0 , 𝑦0 ) + 𝐷𝑓(𝑥0 , 𝑦0 ) ( ),
𝑦 − 𝑦0
with
𝜕𝑓 𝜕𝑓
𝐷𝑓(𝑥0 , 𝑦0 ) = ( 𝜕𝑥 (𝑥0 , 𝑦0 ) 𝜕𝑦 (𝑥0 , 𝑦0 ) )
being the derivative of 𝑓 at (𝑥0 , 𝑦0 );
• the remainder term satisfies
|𝑅((𝑥, 𝑦), (𝑥0 , 𝑦0 ))|

lim = 0.
(𝑥,𝑦)→(𝑥0 ,𝑦0 ) ‖(𝑥, 𝑦) − (𝑥0 , 𝑦0 )‖
More generally, a function 𝑓 ∶ 𝑈 ⊆ R𝑛 → R𝑚 is differentiable at x0 ∈ 𝑈 if the partial derivatives exist at

x0 and
‖𝑓(x) − 𝑓(x0 ) − 𝐷𝑓(x0 )(x − x0 )‖
lim = 0,
x→x0 ‖x − x 0 ‖
where 𝐷𝑓(x0 ) is the linear transformation represented by the Jacobian matrix
𝜕𝑓𝑖
( (x )) .
𝜕𝑥𝑗 0
𝑖=1,…,𝑚, 𝑗=1,…,𝑛
We say the function x0 ↦ 𝐷𝑓(x0 ) is the derivative of 𝑓 .
In the special case 𝑓 ∶ 𝑈 ⊆ R𝑛 → R, we have that 𝐷𝑓(x0 ) is a 1 × 𝑛 matrix
𝜕𝑓 𝜕𝑓
( (x 0 ) … (x )) .
𝜕𝑥1 𝜕𝑥𝑛 0
The corresponding column vector
T
𝜕𝑓 𝜕𝑓
( (x0 ), … , (x ))
𝜕𝑥1 𝜕𝑥𝑛 0
is said to be the gradient of 𝑓 at x0 , and is abbreviated as grad𝑓(x0 ) or ∇𝑓(x0 ).
Directional derivative
Example 4.1
Show that 𝑓(𝑥, 𝑦) = 2𝑥2 − 4𝑦 is differentiable at (2, −3).
Example 4.2
1. Determine the maximum rate of change of 𝑓 ∶ R2 → R defined by 𝑓(𝑥, 𝑦) = 𝑦𝑒𝑥𝑦 at (0, 2)

and the direction in which it occurs.
2. You stand on the surface of a planet at the coordinate (𝑥, 𝑦) = (0, 1). The temperature of
the planet at the point (𝑥, 𝑦) is given by 𝑇 (𝑥, 𝑦) = 𝑥𝑦 .
a. In which direction should you move to increase your temperature as quickly as possible?
b. You decide to move along the line 𝑥 + 𝑦 − 1 = 0 in the direction (−1, 1). What is the
maximum temperature you will experience along this path?
Showing that the limit defining the derivative exists can be difficult, but finding the partial derivatives and the
equation of the tangent is often simple! Moreover, the following theorem tells us that this is nearly enough.
Theorem 4.1. Let 𝑓 ∶ 𝑈 ⊆ R𝑛 → R𝑚 . If the partial derivatives exist and are continuous in a neigh-
bourhood of x, then 𝑓 is differentiable at x.
Example 4.3
Determine the tangent plane to 𝑓(𝑥, 𝑦) = 𝑥𝑒2𝑦 at (1, 0).
Example 4.4
Let 𝑓(𝑥, 𝑦) = √41 − 4𝑥2 − 𝑦2 . Approximate 𝑓(2.1, 2.9) using 𝐷𝑓(2, 3).
Question 4.5
1. Find the tangent plane to 𝑓(𝑥, 𝑦) = 2𝑥2 − 3𝑥𝑦 + 8𝑦2 + 2𝑥 − 4𝑦 + 4 at (2, −1).
2. Find the tangent plane to 𝑓(𝑥, 𝑦) = 𝑥3 − 𝑥2 𝑦 + 𝑦 2 − 2𝑥 + 3𝑦 − 2 at (−1, 3).
3. Let 𝑓(𝑥, 𝑦) = 𝑒5−2𝑥+3𝑦 . Determine an approximate to 𝑓(4.1, 0.9) using 𝐷𝑓(4, 1).
4. A ball sits at coordinate (1, 2) in the (𝑥, 𝑦)-plane on a slope whose height at (𝑥, 𝑦) is given
by
3
−𝑦2
ℎ(𝑥, 𝑦) = 𝑒−𝑥 .
In which direction will the ball roll?
4.2. BASIC PROPERTIES OF THE DERIVATIVE 161
4.2 Basic properties of the derivative

The derivative of 𝑓 , 𝐷𝑓 , has all of the properties that we expect of the derivative. These include the following.
Sum rule. If 𝑓 ∶ R𝑛 → R𝑚 , 𝑔 ∶ R𝑛 → R𝑚 , then
𝐷(𝑓 + 𝑔)(x) = 𝐷𝑓(x) + 𝐷𝑔(x).
Product rule. If 𝑓 ∶ R𝑛 → R, 𝑔 ∶ R𝑛 → R, then
𝐷(𝑓𝑔)(x) = 𝑓(x)𝐷𝑔(x) + 𝑔(x)𝐷𝑓(x).
(A similar identity holds when 𝑓 ∶ R𝑛 → R𝑚 .) Quotient rule. If 𝑓 ∶ R𝑛 → R, 𝑔 ∶ R𝑛 → R, then
𝑓 −𝑓(x)𝐷𝑔(x) + 𝑔(x)𝐷𝑓(x)
𝐷 ( ) ( x) = .
𝑔 𝑔(x)2
Dot product rule. If 𝑓 ∶ R𝑛 → R𝑚 , 𝑔 ∶ R𝑛 → R𝑚 , then
𝐷(𝑓 ⋅ 𝑔)(x) = 𝑓(x) ⋅ 𝐷𝑔(x) + 𝑔(x) ⋅ 𝐷𝑓(x),
where 𝑓 ⋅ 𝑔 ∶ R𝑛 → R is the function defined by setting
(𝑓 ⋅ 𝑔)(x) = 𝑓(x) ⋅ 𝑔(x).
Chain rule. If 𝑔 ∶ 𝑈 ⊆ R𝑛 → R𝑚 and 𝑓 ∶ 𝑉 ⊆ R𝑚 → R𝑝 , with 𝑈 mapping into 𝑉 , so that 𝑓 ∘ 𝑔 is

defined. Let 𝑔 be differentiable at x0 , and 𝑓 be differentiable at 𝑔(x0 ). Then 𝑓 ∘ 𝑔 is differentiable at x0 , and
𝐷(𝑓 ∘ 𝑔)(x0 ) = (𝐷𝑓)(𝑔(x0 ))𝐷𝑔(x0 ).
Example 4.6
Let 𝑓∶ R2 → R2 be defined by (𝑦1 , 𝑦2 ) ↦ (2𝑦1 + 𝑦22 , 3𝑦12 − 𝑦2 ). Let 𝑔 ∶ R3 → R2 be defined
by (𝑥1 , 𝑥2 , 𝑥3 ) ↦ (𝑥1 𝑥2 , 𝑥2 𝑥3 ).
1. Determine 𝐷𝑔(𝑥1 , 𝑥2 , 𝑥3 ).
2. Find explicitly the map 𝑓 ∘ 𝑔 ∶ R3 → R2 , and hence find 𝐷(𝑓 ∘ 𝑔).
Question 4.7
Let 𝑓 ∶ R2 → R3 and 𝑔 ∶ R2 → R3 be defined by:
𝑥𝑦 sin 𝑥
𝑓(𝑥, 𝑦) = ⎛
⎜ 𝑦𝑒𝑥 ⎞
⎟, 𝑔(𝑥, 𝑦) = ⎛
⎜cos 2𝑦⎞⎟.
⎝log 𝑥⎠ ⎝ 1 ⎠
1. Find 𝐷𝑓 and 𝐷𝑔 .
2. By using the dot product rule, find the derivative of the function 𝑓 ⋅ 𝑔 ∶ R2 → R, where
(𝑓 ⋅ 𝑔)(x) ∶= 𝑓(x) ⋅ 𝑔(x).
Question 4.8
Let 𝑔 ∶ ℝ2 → ℝ3 and 𝑓 ∶ ℝ3 → ℝ be defined by:
𝜋𝑥 𝑣
𝑔(𝑥, 𝑦) = (sin( ) + sin(𝜋𝑦), 𝑥𝑦, 2 + 𝑥2 + 𝑦2 ) , 𝑓(𝑢, 𝑣, 𝑤) = .
2 𝑢+𝑤
1. Find 𝐷𝑓 and 𝐷𝑔 .
2. Hence, find the derivative of the composite function 𝑓 ∘ 𝑔 at the point (1, 1).
4.3. FINDING AND CLASSIFYING CRITICAL POINTS 167
4.3 Finding and classifying critical points
4.3.1 Critical points

Let 𝑓 ∶ 𝑈 ⊆ ℝ𝑛 → ℝ. We say that x0 ∈ 𝑈 is a:
• local maximum of 𝑓 if there is a neighbourhood 𝑉 of x0 such that 𝑓(x) ≤ 𝑓(x0 ) for every x ∈ 𝑉 ;
• local minimum of 𝑓 if there is a neighbourhood 𝑉 of x0 such that 𝑓(x) ≥ 𝑓(x0 ) for every x ∈ 𝑉 .
A point x0 is a critical point of 𝑓 if 𝐷𝑓(x0 ) = 0. Note that in the case that 𝑓 ∶ ℝ → ℝ, we have that
𝐷𝑓(𝑥) = 0 if and only if 𝑥 is a local maximum, a local minimum or a point of inflection. In higher dimensions,
the situation is more complicated. Appendix 4.7 contains renderings that illustrate the complexity.
Theorem 4.2. Let 𝑓 ∶ ℝ𝑛 → ℝ be differentiable. If x0 is a local extremum (maximum or minimum),

then 𝐷𝑓(x0 ) = 0.
Example 4.9
Find the critical points of 𝑓(𝑥, 𝑦) = 𝑥2 + 2𝑥𝑦 − 4𝑦2 + 4𝑥 − 6𝑦 + 4.
Question 4.10
Find the critical points of each of the following functions, 𝑓 ∶ 𝑈 ⊆ R2 → R, where 𝑈 is the largest
subset of R2 such that 𝑓 is defined.
1. 𝑓(𝑥, 𝑦) = 𝑥3 + 2𝑥𝑦 − 2𝑥 − 4𝑦.

2. 𝑓(𝑥, 𝑦) = √4𝑦 2 − 9𝑥2 + 24𝑦 + 36𝑥 + 36, where 4𝑦 2 − 9𝑥2 + 24𝑦 + 36𝑥 + 36 ≥ 0.
4.3.2 Classifying critical points
We start with some revision of the single-value case. Recall that under suitable conditions we may write
𝑓 ∶ R → R as
1
𝑓(𝑥0 + ℎ) = 𝑓(𝑥0 ) + 𝑓 ′ (𝑥0 )ℎ + 𝑓 ′′ (𝑥0 )ℎ2 + 𝑅(𝑥0 , ℎ)
2
where 𝑅(𝑥0 , ℎ) is the remainder term. If 𝑥0 is a critical point, then 𝑓 ′ (𝑥0 ) = 0. Therefore,
1
𝑓(𝑥0 + ℎ) − 𝑓(𝑥0 ) = 𝑓 ′′ (𝑥0 )ℎ2 + 𝑅(𝑥0 , ℎ).
2
Therefore, for small ℎ, if 𝑓 ′′ (𝑥0 )
> 0 and hence 𝑓(𝑥0 + ℎ) − 𝑓(𝑥0 ) > 0. Therefore, 𝑥0 is a minimum.
The case 𝑓 (𝑥0 ) < 0 leads to 𝑥0 as a maximum. When 𝑓 ′′ (𝑥0 ) = 0 the second-order term is insufficient
′′
to determine 𝑥0 ’s nature. Trivial examples are 𝑥3 , 𝑥4 and so on.
As noted earlier, critical points are not necessarily maximum, minima or points of inflection. For example,
consider the function 𝑓(𝑥, 𝑦) = 𝑥2 − 𝑦 2 .
At (0, 0), we have 𝑓𝑥 = 𝑓𝑦 = 0, i.e. (0, 0) is a critical point. However, this is neither a local maximum nor
a local minimum. Indeed, if we slice parallel to the 𝑥-axis, it appears to be a minimum. Whereas, if we slice
parallel to the 𝑦 -axis, it appears to be a local maximum. We call such a point a saddle point.
We now introduce a general technique for classifying critical points; as with the procedure you know for the
one-dimensional case, this will involve second derivatives. Let 𝑓 ∶ ℝ𝑛 → ℝ. Then
𝐷𝑓 ∶ ℝ𝑛 → ℝ1×𝑛
x ↦ 𝐷𝑓(x).
As already noted, this map can be represented by a 1 × 𝑛 matrix, corresponding to the 𝑛 × 1 vector
T
𝜕𝑓 𝜕𝑓
∇𝑓(x) = ( (x), … , (x)) ,
𝜕𝑥1 𝜕𝑥𝑛
to which we can apply the operator 𝐷 again. We define the second derivative of 𝑓 by setting
𝐷2 𝑓(x) = 𝐷(∇𝑓)(x).
Since ∇𝑓 ∶ ℝ𝑛 → ℝ𝑛 , 𝐷2 𝑓 can be represented by an 𝑛 × 𝑛 matrix, specifically:

𝜕2𝑓 𝜕2𝑓
𝜕𝑥21
… 𝜕𝑥𝑛 𝜕𝑥1
⎛ ⎞
H =⎜
⎜
⎜ ⋮ ⋱ ⋮ ⎟
⎟
⎟
𝜕2𝑓 𝜕2𝑓
⎝ 𝜕𝑥1 𝜕𝑥𝑛 … 𝜕𝑥2𝑛 ⎠
𝜕2𝑓 𝜕 𝜕𝑓
where 𝜕𝑥 𝜕𝑥 = 𝜕𝑥 ( 𝜕𝑥 ). The symbol H is used as the above matrix is often referred to as the Hessian
𝑖 𝑗 𝑖 𝑗
matrix, or simply the Hessian. (In some literature, the term Hessian is used to refer to the determinant of
𝜕2𝑓
the above matrix.) It will be convenient to use the shorthand notation 𝑓𝑥 𝑥 ∶= 𝜕𝑥𝑖 𝜕𝑥𝑗 .
𝑖 𝑗
Theorem 4.3. If 𝑓𝑥 𝑥 , 𝑖, 𝑗 = 1, … , 𝑛, are defined and continuous throughout an open region containing
𝑖 𝑗
x, then
𝑓𝑥𝑖 𝑥𝑗 (x) = 𝑓𝑥𝑗 𝑥𝑖 (x).
Note, the conditions in this theorem are required. Consider, 𝑓 ∶ R2 → R defined by
0 (𝑥, 𝑦) = (0, 0)
𝑓(𝑥, 𝑦) = { 𝑥𝑦(𝑥2 −𝑦2 )
𝑥2 +𝑦2 (𝑥, 𝑦) ≠ (0, 0)
In this case, 𝑓𝑥𝑦 (0, 0) = 1 and 𝑓𝑦𝑥 (0, 0) = −1.
An immediate consequence of the above result is that if the second order partial derivatives are continuous
in an open region containing x0 , then the Hessian H is symmetric at x0 . In this situation, 𝑓 has a second
order Taylor expansion:
1
𝑓(x0 + h) = 𝑓(x0 ) + 𝐷𝑓(x0 )h + hT 𝐷2 𝑓(x0 )h + 𝑅(x0 , h),
2
where 𝑅(x0 , h) = 𝑜(‖h‖2 ). That is, limℎ→0 𝑅(x0 , h)/‖h‖2 = 0.
This leads to the following criteria for classifying local maxima and minima.
Theorem 4.4 (Classifying critical points). Suppose 𝑓 ∶ ℝ𝑛 → ℝ is twice differentiable at x0 , and its
second order partial derivatives are continuous in an open region containing x0 . If x0 is a critical point,
then:
• 𝐷2 𝑓(x0 ) being positive definite implies that x0 is a local minimum;

• 𝐷2 𝑓(x0 ) being negative definite implies that x0 is a local maximum.
Recall, a symmetric matrix 𝑀 is positive definite if and only if its leading principal minors are strictly positive.
There is a similar result for negative definite matrices.
Theorem 4.5. A symmetric matrix 𝑀 is negative definite if and only if its leading principal minors alter-
nate in sign, with the first being strictly negative.
If 𝐷2 𝑓(x0 ) is positive definite or negative definite, then it must be invertible. If 𝐷2 𝑓(x0 ) is neither positive
definite nor negative definite, but still invertible, then x0 is a non-degenerate saddle point.
If 𝐷2 𝑓(x0 ) is not invertible (including the semi-definite cases), then this is a degenerate case, and further
work is needed to decide on the type of critical point. We will not study the general theory of degenerate
critical points in this module.
Special case: If 𝑓 ∶ ℝ2 → ℝ, then the above results can be summarised as follows.

Example 4.11
Find the critical points for each of the following functions and classify their nature.
1. 𝑓(𝑥, 𝑦) = 4𝑥2 + 9𝑦2 + 8𝑥 − 36𝑦 + 24.

2. 𝑓(𝑥, 𝑦, 𝑧) = 𝑥2 +𝑦 2 +7𝑧 2 −𝑥𝑦 −3𝑦𝑧 . In each case 𝑓 ’s domain is the largest subset of R𝑛
on which 𝑓 is defined. (Note, this is our standing assumption for the functions we consider.)
Question 4.12
Find and classify the critical points of:
1. 𝑓(𝑥, 𝑦) = 13 𝑥3 + 𝑦2 + 2𝑥𝑦 − 6𝑥 − 3𝑦 + 4.
2. 𝑓(𝑥, 𝑦) = 8𝑥3 + 𝑦3 + 6𝑥𝑦;
3. 𝑓(𝑥, 𝑦) = 𝑥𝑦2 − 4𝑥2 − 𝑦2 + 4𝑥;
4. 𝑓(𝑥, 𝑦) = 𝑥4 + 𝑦4 ;
5. 𝑓(𝑥, 𝑦, 𝑧) = 4𝑥𝑦𝑧 − 𝑥4 − 𝑦4 − 𝑧 4 .
4.4. CONSTRAINED OPTIMISATION 185
4.4 Constrained optimisation

Once we have found local extrema, it is natural to ask whether there exist absolute extrema: we may be able
to decide this using our knowledge of the function and the set on which it is defined (for example, if we have
a continuous function on a compact set, then the image will be compact, so we will have absolute extrema).
Another natural question is whether or not there is a maximum (or minimum) value when we consider 𝑓
𝑛 𝑛 𝑚
subject to a constraint of the form 𝑔(x)
= b, where 𝑓 ∶ ℝ → ℝ and 𝑔 ∶ ℝ → ℝ are smooth functions
(we will mainly focus on the case 𝑛 = 2, 𝑚 = 1, but the technique for other 𝑛, 𝑚 is analogous). Note that
we can assume b = 0 by defining 𝑔 ∗ (x) = 𝑔(x) − b. Moreover, more than one constraint can be dealt
with by adjusting the value of 𝑚.
Theorem 4.6 (Lagrange multiplier theorem). Suppose 𝑓 ∶ ℝ𝑛 → ℝ has a local extremum at x0 when
restricted to the constraint set 𝐶 = {x ∶ 𝑔(x) = 0}, where 𝑔 ∶ ℝ𝑛 → ℝ. Suppose also that x0 is
not an end-point of 𝐶 , and ∇𝑔(x0 ) ≠ 0. Then there exists a number 𝜆0 such that (x0 , 𝜆0 ) is a critical
point of
𝐿(x, 𝜆) = 𝑓(x) − 𝜆𝑔(x).
If 𝑔∶ ℝ𝑛 → ℝ𝑚 , then we should take the Lagrange multiplier 𝜆 ∈ ℝ𝑚 , and take the Lagrangian to
be 𝐿(x, 𝜆) = 𝑓(x) − 𝜆 ⋅ 𝑔(x).
The above theorem tells us that we should search for the extrema of the constrained 𝑓 among the critical
points of the Lagrangian function. However, note that it does not guarantee that a solution exists. We will
not prove the theorem, but explain intuitively why it makes sense.
At every point in the domain, ∇𝑓 is normal to the level curve through the point (naively, ∇𝑓 measures the
change in 𝑓 , and there is no change in 𝑓 along the level curve, and so all change must be normal to it). Of
course, 𝐶 = {x ∶ 𝑔(x) = 0} is a level curve for the function 𝑔, so ∇𝑔 is normal to 𝐶 . Now, at a local
extremum x0 , the level curves for 𝑓 must be tangent to 𝐶 (if 𝐶 crossed a level curve, it would move from a
lower value to a greater, or vice versa). Hence ∇𝑓 and ∇𝑔 lie along the same line; they could point in the
same or opposite directions.
Figure 4.3: Geometry of Lagrange multipliers
To write this mathematically, we must have
∇𝑓(x0 ) = 𝜆0 ∇𝑔(x0 )
for some 𝜆0 . This implies 𝐷𝐿(x0 , 𝜆0 ) = 0, i.e. (x0 , 𝜆0 ) is a critical point of 𝐿.

Figure 4.4: Geometry of Lagrange multipliers
We can find the critical points of 𝐿 by finding all the partial derivatives and equating them to 0. We can
classify our extrema either by considering the geometry of the situation or by using a specialised second
derivative test. In particular, let 𝑓 ∶ ℝ𝑛 → ℝ be constrained by 𝑔(x) = 0 (where 𝑔 ∶ ℝ𝑛 → ℝ𝑚 ) with
constraint set 𝐶 and Lagrangian 𝐿. Define the bordered Hessian matrix to be:
0 𝐷𝑔
⎛
⎜ 𝜕2𝐿 𝜕2𝐿 ⎞
⎜ 𝜕𝑥21 𝜕𝑥2 𝜕𝑥1 … ⎟
⎟
H̄ = ⎜ ⎟,
⎜
⎜ (𝐷𝑔)
T 𝜕2𝐿 𝜕2𝐿
… ⎟
⎟
𝜕𝑥1 𝜕𝑥2 𝜕𝑥22
⎝ ⋮ ⋮ ⋱ ⎠
where 0 is 𝑚 × 𝑚.
Important note The bordered Hessian matrix is actually just the Hessian matrix of 𝐿 differentiated with
respect to 𝜆 first that is, with 𝐿(x, 𝜆) = 𝑓(x) − 𝜆𝑔(x) we have
𝜕2𝐿 𝜕2𝐿
𝜕𝜆2 𝜕𝜆𝜕 x
( 𝜕2𝐿
𝑇
𝜕2𝐿
)
( 𝜕𝜆𝜕 x
) 𝜕 x2
−(𝐷𝑔) and
A direct computation of this matrix reveals that the upper right and bottom left entries are
−(𝐷𝑔)𝑇 . So why does our bordered Hessian omit the minus signs? We use the bordered Hessian to
compute leading principal minors, these minors are not affected by a simultaneous change of sign of both
the (𝐷𝑔) and (𝐷𝑔)𝑇 and hence the signs are set to be positive. This helps reduce the chances of sign
errors.
Calculate the leading principal minors of H̄ (x0 , 𝜆0 ) of order ≥ 2𝑚 + 1. If these:

• start with the sign of (−1)𝑚+1 and alternate, then we have a local maximum for 𝑓|𝐶 ;
• all have the sign of (−1)𝑚 , then we have a local minimum for 𝑓|𝐶 ;
• do not fall into either of the above categories, then the test is inconclusive.
Note that this recovers the second derivative test we met earlier when there are 𝑚 = 0 constraints.
In the case 𝑓 ∶ ℝ2 → ℝ, 𝑔 ∶ ℝ2 → ℝ, the bordered Hessian matrix is:
𝜕𝑔 𝜕𝑔
0 𝜕𝑥 𝜕𝑦
⎛ ⎞
H̄ =⎜
⎜
⎜
𝜕𝑔
𝜕𝑥
𝜕2𝐿
𝜕𝑥2
𝜕2𝐿
𝜕𝑦𝜕𝑥
⎟
⎟
⎟.
𝜕𝑔 𝜕2𝐿 𝜕2𝐿
⎝ 𝜕𝑦 𝜕𝑥𝜕𝑦 𝜕𝑦2 ⎠
So, for a critical point x0 , the test becomes
• if det H̄ (x0 , 𝜆0 ) > 0, then we have a local maximum for 𝑓|𝐶 ;

• if det H̄ (x0 , 𝜆0 ) < 0, then we have a local minimum for 𝑓|𝐶 ;
• if det H̄ (x0 , 𝜆0 ) = 0, then the test is inconclusive.
Example 4.13
Let 𝑓(𝑥, 𝑦)= 𝑥2 + 4𝑦2 and 𝑔(𝑥, 𝑦) = 𝑥2 + 𝑦2 and let 𝐶 = {(𝑥, 𝑦) ∶ 𝑔(𝑥, 𝑦) = 1}. Use the
method of Lagrange multipliers to find the extrema for 𝑓 on 𝐶 .
Question 4.14
1. Let 𝑓(𝑥, 𝑦) = 𝑥2 + 𝑦2 and 𝑔(𝑥, 𝑦) = 𝑦 − 𝑥 − 1, and let 𝐶 = {(𝑥, 𝑦) ∶ 𝑔(𝑥, 𝑦) = 0},

i.e. we are constrained to lie on the line 𝑦 = 𝑥 + 1. At what point is 𝐶 tangent to a level curve
of 𝑓 ? Is this a local extrema of 𝑓 on 𝐶 ? If so, what type?
2. Let𝑓(𝑥, 𝑦) = 𝑥2 + 𝑦2 and 𝑔(𝑥, 𝑦) = 𝑥2 𝑦 − 16, and √ let 𝐶 = {(𝑥, 𝑦) ∶ 𝑔(𝑥, 𝑦) = 0}.

Use the method of Lagrange multipliers to show that (±2 2, 2) are both local minima for 𝑓
on 𝐶 .
3. Use the method of Lagrange multipliers to find the maximum value of 𝑓(𝑥, 𝑦) = 2𝑥 + 𝑦 + 6
2
subject to 𝑥 + 2𝑦 = 3.
4. An ant is on a metal plate with temperature at (𝑥, 𝑦) given by 𝑇 (𝑥, 𝑦) = 4𝑥2 − 4𝑥𝑦 + 𝑦2 .
He walks around a circle of centre 0 and radius 5. Where on the circular path followed by
the ant is the temperature highest? And where is it lowest? (You should use the method of
Lagrange multipliers and the second derivative test.)
5. Use the method of Lagrange multipliers to find the extrema of 𝑓(𝑥, 𝑦) = 3𝑥 + 2𝑦 subject to
2𝑥2 + 3𝑦2 = 3.
Question 4.15
Define 𝑓 ∶ R2 → R by
𝑥𝑦
√𝑥2 +𝑦2
(𝑥, 𝑦) ≠ (0, 0),
𝑓(𝑥, 𝑦) = {
0 (𝑥, 𝑦) = (0, 0)
Show that 𝑓(0, 0), 𝑓𝑥 (0, 0) and 𝑓𝑦 (0, 0) all exist, but that there is no tangent plane to 𝑓 at (0, 0).
Why is Theorem 4.1 not contradicted?
[Hint: look at how 𝑓 behaves on different lines through the origin.]
Question 4.16
Let 𝑓(𝑥, 𝑦) = (𝑦 − 𝑥2 )(𝑦 − 3𝑥2 ). 1. Show that (0, 0) is a critical point. 2. Show also that
along any straight line through (0, 0), 𝑓 has a local minimum at (0, 0). (Consider the function
𝑡 ↦ 𝑓(𝑎𝑡, 𝑏𝑡) for different values of 𝑎, 𝑏.) 3. Is (0, 0) a local minimum? (Hint: Consider the curve
𝑦 = 2𝑥2 .) Check your conclusion does not conflict with the second derivative criterion for a local
minimum.
Question 4.17
Let 𝑓(𝑥, 𝑦, 𝑧) = 𝑥2 + 𝑦2 + 𝑧 2 . Find the extreme point 𝑓 subject to the constraint that 𝑧 − 𝑥𝑦 = 2.
Question 4.18
𝑓(𝑥, 𝑦, 𝑧) = 𝑧 . Find the extrema of 𝑓 subject to the constraints that 𝑥 + 𝑦 + 𝑧 = 12 and
Let
𝑥 + 𝑦 2 − 𝑧 = 0.
2
• Determine derivatives of functions defined R𝑛 → R𝑚 .

• Find and classify critical points of functions R𝑛 → R.
• Solve problems involving differentiation of functions R𝑛 → R𝑚 , including constrained optimi-
sation.
• Prove results involving differentiation of functions R𝑛 → R𝑚 .
Differentiability in R𝑛 Consider 𝑓 ∶ R2 → R. Let 𝑇 be the tangent plane to 𝑓 at (𝑥0 , 𝑦0 ). Define 𝑓 to be

𝜕𝑓 𝜕𝑓
differentiable at (𝑥0 , 𝑦0 ) if and both exist at (𝑥0 , 𝑦0 ) and also
𝜕𝑥 𝜕𝑦
|𝑓(𝑥, 𝑦) − 𝑇 (𝑥, 𝑦)|

lim = 0.
(𝑥,𝑦)→(𝑥0 ,𝑦0 ) ‖(𝑥, 𝑦) − (𝑥0 , 𝑦0 )‖
That is, 𝑇 is a ‘good’ linear approximation to 𝑓 at (𝑥0 , 𝑦0 ).
In general, consider 𝑓 ∶ 𝑈 ⊆ R𝑛 → R𝑚 . Then 𝑓 is differentiable at x0 ∈ 𝑈 if the partial derivatives exist

at x0 and
‖𝑓(x) − 𝑓(x0 ) − 𝐷𝑓(x0 )(x − x0 )‖
lim = 0,
x→x0 ‖x − x 0 ‖
Where 𝐷𝑓(x0 ) is the linear transformation represented by the Jacobian matrix
𝜕𝑓𝑖
( (x ))
𝜕𝑥𝑗 0
𝑖=1,…,𝑚,𝑗=1,…,𝑛
The function x0 ↦ 𝐷𝑓(x0 ) is the derivative of 𝑓 .

The special case 𝑓 ∶ 𝑈 ⊂ R𝑛 → R yields a 1 × 𝑛 matrix, 𝐷𝑓(x0 )
𝜕𝑓 𝜕𝑓
( (x 0 ) … (x )) .
𝜕𝑥1 𝜕𝑥𝑛 0
The corresponding column vector 𝐷𝑓(x0 )T is called the gradient of 𝑓 at x0 . The gradient is also ab-
breviated as grad𝑓(x0 ) or ∇𝑓(x0 ). So, ∇𝑓(x0 ) ∈ R𝑛 . The gradient is used to determine directional
derivatives.
Key properties are:
• If the partial derivatives exist and are continuous in a neighbourhood of x, then 𝑓 is differentiable at x.
• The gradient of 𝑓 points in the direction in which 𝑓 is increasing fastest.
• Sum rule. If 𝑓 ∶ R𝑛 → R𝑚 and 𝑔 ∶ R𝑛 → R𝑚 , then
𝐷(𝑓 + 𝑔)(x) = 𝐷𝑓(x) + 𝐷𝑔(x).
• Product rule. If 𝑓 ∶ R𝑛 → R and 𝑔 ∶ R𝑛 → R, then
𝐷(𝑓𝑔)(x) = 𝑓(x)𝐷𝑔(x) + 𝑔(x)𝐷𝑓(x).
• Quotient rule. If 𝑓 ∶ R𝑛 → R and 𝑔 ∶ R𝑛 → R, then
𝑓 𝑔(x)𝐷𝑓(x) − 𝑓(x)𝐷𝑔(x)
𝐷 ( ) ( x) =
𝑔 𝑔(x)2
provided all the expressions are defined at x.
• Dot product rule. If 𝑓 ∶ R𝑛 → R𝑚 and 𝑔 ∶ R𝑛 → R𝑚 , then
𝐷(𝑓 ⋅ 𝑔)(x) = 𝑓(x)T 𝐷𝑔(x) + 𝑔(x)T 𝐷𝑓(x),
where 𝑓 ⋅ 𝑔 ∶ R𝑛 → R is the function defined by (𝑓 ⋅ 𝑔)(x) = 𝑓(x)𝑔(x).

• Chain rule. Let 𝑔 ∶ 𝑈 ⊆ R𝑛 → R𝑚 and 𝑓 ∶ 𝑉 ⊆ R𝑚 → R𝑝 , with 𝑈 mapping into 𝑉 , so that 𝑓 ∘ 𝑔
is defined. Let 𝑔 be differentiable at x0 , and 𝑓 be differentiable at 𝑔(x0 ). Then 𝑓 ∘ 𝑔 is differentiable
at x0 , and
𝐷(𝑓 ∘ 𝑔)(x0 ) = (𝐷𝑓)(𝑔(x0 ))𝐷𝑔(x0 ).
Finding and classifying critical points Let 𝑓 ∶ 𝑈 ⊂ R𝑛 → R. A point x0 ∈ 𝑈 is a:
• local maximum of 𝑓 if there is a neighbourhood 𝑉 of x0 such that 𝑓(x) ≤ 𝑓(x0 ) for every x ∈ 𝑉 ;
• local minimum of 𝑓 if there is a neighbourhood 𝑉 of x0 such that 𝑓(x) ≥ 𝑓(x0 ) for every x ∈ 𝑉 .
If the inequality is strict, then an extremum is non-degenerate. A point x0 is a critical point of 𝑓 if

𝐷𝑓(x0 ) = 0.
• Let 𝑓 ∶ R𝑛 → R be differentiable. If x0 is a local extremum (so a local maximum or minimum), then

𝐷𝑓(x0 ) = 0.
• Consider 𝑓 ∶ R2 → R defined by 𝑓(𝑥, 𝑦) = 𝑥2 − 𝑦 2 for all (𝑥, 𝑦) ∈ R2 . At (0, 0), 𝑓𝑥 = 𝑓𝑦 = 0,
so (0, 0) is a critical point. The point (0, 0) is neither a local maximum nor a local minimum. This
point is a saddle point.
Define the second derivative of 𝑓 by setting 𝐷2 𝑓(x) = 𝐷(∇𝑓)(x). Since ∇𝑓 ∶ R𝑛 → R𝑛 , 𝐷2 𝑓 can

be represented by an 𝑛 × 𝑛 matrix,
𝜕2𝑓 𝜕2𝑓
𝜕𝑥21
⋯ 𝜕𝑥𝑛 𝑥1
⎛
⎜ ⎞
⎟
H =⎜
⎜ ⋮ ⋱ ⋮ ⎟
⎟ ,
⎜ ⎟
𝜕2𝑓 𝜕2𝑓
⎝ 𝜕𝑥1 𝑥𝑛 ⋯ 𝜕𝑥2𝑛 ⎠
𝜕 2𝑓 𝜕
where = ( 𝜕𝑓 ). The symbol H is used as the above matrix is often referred to as the
𝜕𝑥𝑖 𝜕𝑥𝑗 𝜕𝑥𝑗 𝜕𝑥𝑗
𝜕 2𝑓
Hessian matrix, or simply the Hessian. The shorthand notation 𝑓𝑥 𝑥 ∶= is useful. Key results
𝑖 𝑗
𝜕𝑥𝑖 𝜕𝑥𝑗
include
• If 𝑓𝑥 𝑥 , 𝑖, 𝑗 = 1, … , 𝑛 are defined and continuous throughout an open region containing x, then

𝑖 𝑗
𝑓𝑥 𝑥 (x) = 𝑓𝑥 𝑥 (x).
𝑖 𝑗 𝑗 𝑖
• Suppose 𝑓 ∶ R𝑛 → R is twice differentiable at x0 , and its second order derivatives are continuous in
a region containing x0 . If x0 is a critical point, then
– 𝐷2 𝑓(x0 ) being positive definite implies that x0 is a local minimum;

– 𝐷2 𝑓(x0 ) being negative definite implies that x0 is a local maximum;
• If 𝐷2 𝑓(x0 ) is positive definite or negative definite, then it must be invertible. If 𝐷2 𝑓(x0 ) is neither
positive definite nor negative definite, but still invertible, then x0 is a non-degenerate saddle point.
If 𝐷2 𝑓(x0 ) is not invertible (including the semi-definite cases), then this is a degenerate case and
further analysis is required to determine the critical points type.
Constrained optimisation Suppose 𝑓 ∶ R𝑛 → R has a local extremum at x0 when restricted to the

constraint set 𝐶 = {x ∶ 𝑔(x) = 0}, where 𝑔 ∶ R𝑛 → R. Suppose also that x0 is not an end-point of 𝐶 ,
and ∇𝑔(x0 ) ≠ 0. Then there exists a number 𝜆0 such that (x0 , 𝜆0 ) is a critical point of
𝐿(x, 𝜆) = 𝑓(x) − 𝜆𝑔(x).
In the general case, take the Lagrange multiplier 𝜆 ∈ R𝑚 , and take the Lagrangian to be 𝐿(x, 𝜆) =
𝑓(x) − 𝜆 ⋅ 𝑔(x).
𝑓 ∶
Classification follows by geometric considerations or use of a specialised second derivative test. Let
R𝑛 → R be constrained by 𝑔(x) = 0 (where 𝑔 ∶ R𝑛 → R𝑚 with constraint set 𝐶 and Lagrangian 𝐿.
Define the bordered Hessian matrix to be:
0 𝐷𝑔
⎛
⎜ ⎞
⎟
⎜ 𝜕2𝐿 𝜕2𝐿
⎜ 𝜕𝑥21 𝜕𝑥2 𝑥1 … ⎟
⎟
H=⎜ ⎟
⎜
⎜ 𝜕2𝐿 𝜕2𝐿 ⎟
⎜ (𝐷𝑔)
T
𝜕𝑥1 𝑥2 𝜕𝑥22
… ⎟
⎟
⎝ ⋮ ⋮ ⋱ ⎠
where 0 is 𝑚 × 𝑚.
Calculate the leading principal minors of H(x0 , 𝜆0 ) of order greater than or equal to 2𝑚 + 1. If these
• start with the sign of (−1)𝑚+1 and alternate, then the point is a local maximum for 𝑓|𝐶 .
• all have the sign of (−1)𝑚 , then the point is a local minimum for 𝑓|𝐶 .
• do not fall into either of the above two categories, then the test is inconclusive.
4.7. APPENDIX 199
4.7 Appendix
The following renderings illustrate various functions R2 → R and their behaviour near critical points.
𝑓(𝑥, 𝑦) = cos(𝑥) cos(𝑦)𝑒−√𝑥

2 +𝑦2
𝑓(𝑥, 𝑦) = 12 (||𝑥| − |𝑦|| − |𝑥| − |𝑦|)

𝑓(𝑥, 𝑦) = 𝑥2 𝑦/(𝑥2 + 𝑦2 )
𝑓(𝑥, 𝑦) = −6𝑦/(2 + 𝑥2 + 𝑦2 )
4.7. APPENDIX 201
𝑥2 𝑦2
𝑓(𝑥, 𝑦) = 𝑒− 2 − 5
𝑓(𝑥, 𝑦) = 𝑥3 − 3𝑥𝑦2
𝑓(𝑥, 𝑦) = 𝑥2 − 𝑦2
𝑓(𝑥, 𝑦) = 1/(4𝑥2 + 𝑦2 )
4.7. APPENDIX 203
𝑓(𝑥, 𝑦) = 𝑒−𝑦 cos(𝑥)

𝑓(𝑥, 𝑦) = 𝑥𝑦(𝑥2 − 𝑦2 )/(𝑥2 + 𝑦2 )

4.7. APPENDIX 205
𝑓(𝑥, 𝑦) = 𝑦2 − 𝑦4 − 𝑥2
𝑓(𝑥, 𝑦) = sin(𝑥2 + 𝑦2 )/(𝑥2 + 𝑦2 )

4.7. APPENDIX 207
𝑓(𝑥, 𝑦) = cos(𝑥) + 𝑦2
Chapter 5
Linear algebra
Chapter outcomes
• Determine if a particular function defines an inner product or not.

• Select and apply appropriate theory to determine properties, including closest points, orthog-
onality, of vectors and vector spaces.
• Apply the Gram–Schmidt orthogonalisation process.
• Prove results involving inner product, orthogonality and projections.
• Analyse problems and select appropriate strategies to solve problems involving orthogonal
vectors, orthonormal vectors, vector spaces, inner products and projections.
• MAT106: Linear Algebra. This chapter builds on significant parts of last year’s Linear Algbera. Of
particular importance are
– Chapter 5: Vector spaces;

– Chapter 6: Linear independence, spanning and bases of vector spaces;
– Chapter 7: Subspaces;
– Chapter 8: Linear Transformations;
– Chapter 9: Kernels and Images;
– Chapter 10: Inverse of a linear transformation of a matrix;
– Chapter 12: Change of basis and equivalent matrices
– Chapter 13: Eigenvectors and eigenvalues.
• MA137: Mathematical Analysis. Elements of this module arise, for example, the triangle inequality.
208
209
Chapter Statistician: Calyampudi Radhakrishna Rao
Calyampudi Radhakrishna Rao, FRS was born in the state Karnataka in southern India. Perhaps his most
famous results are the Cramér–Rao bound, the Rao–Blackwell theorem, Fisher–Rao theorem and Rao dis-
tance. In 1991, The Journal of Quantitative Economics published a special issue in Rao’s honour, including
“Dr. Rao is a very distinguished scientist and a highly eminent statistician of our time. His contributions to
statistical theory and applications are well known, and many of his results, which bear his name, are included
in the curriculum of courses in statistics at bachelor’s and master’s level all over the world. He is an inspiring
teacher and has guided the research work of numerous students in all areas of statistics. His early work
had greatly influenced the course of statistical research during the last four decades. One of the purposes
of this special issue is to recognize Dr. Rao’s own contributions to econometrics and acknowledge his major
role in the development of econometric research in India.”
Figure 5.1: Calyampudi Radhakrishna Rao.
Image available via: https://upload.wikimedia.org/wikipedia/commons/9/97/Calyampudi_Radhakrishna_R

ao_at_ISI_Chennai_/%28cropped/%29.JPG Image licence: Creative Commons.
5.1. VECTOR SPACES 211
5.1 Vector spaces
We begin by recalling the definitions of vector space and some standard terminology from last year.
Definition 5.1 (Real Vector Space). A real vector space is a set 𝑉 equipped with two operations + ∶
𝑉 × 𝑉 → 𝑉 (addition) and ⋅ ∶ R × 𝑉 → 𝑉 (multiplication by scalars); that is, for every pair u, v ∈ 𝑉
and 𝑎 ∈ R, u + v and 𝑎v are defined and themselves in 𝑉 , that satisfies the following axioms:
1. (associativity of addition) u + (v + w)
= ( u + v) + w
2. (commutativity of addition) u + v = v + u
3. (existence of additive identity element) there exists an element 0 ∈ 𝑉 , called the zero vector, such
that v + 0 = v for all v ∈ 𝑉
4. (existence of additive inverse elements) for every v ∈ 𝑉 , there exists an element −v ∈ 𝑉 , called
the additive inverse of v, such that v + (−v) =0
5. (distributivity of scalar multiplication with respect to vector addition) 𝑎(u + v) = 𝑎 u + 𝑎v
6. (distributivity of scalar multiplication with respect to field addition) (𝑎 + 𝑏)v = 𝑎v + 𝑏v
7. (compatibility of scalar multiplication with field multiplication) 𝑎(𝑏v) = (𝑎𝑏)v
8. (identity element of scalar multiplication) 1v = v.
Note, in the above u, v, w are elements of 𝑉 and 𝑎, 𝑏 ∈ R. We can similarly define a complex vector
space by supposing 𝑎, 𝑏 ∈ C.
Example 5.1
1. R𝑛 is an 𝑛-dimensional vector space.

2. 𝑃𝑛 , the set of polynomials of degree ≤ 𝑛, is an (𝑛 + 1)-dimensional vector space.
3. The set of all functions 𝑓 ∶ R𝑛 → R𝑚 is an infinite-dimensional vector space. The subset of
all linear functions is a (finite-dimensional) subspace of this.
4. It is convenient to view {0}, i.e. the set containing only the zero vector, as a 0-dimensional
vector space.
Definition 5.2 (Subspace). A subset 𝑊 of a vector space 𝑉 is a subspace of 𝑉 if: 1. it is non-empty;

2. it is closed under addition; 3. it is closed under scalar multiplication.
Example 5.2
1. Let 𝑊 = {(𝑎, 0, 𝑏)T ∣ 𝑎, 𝑏 ∈ R}. Then 𝑊 is a subspace of R3 .

2. Let 𝑊 = {(𝑥, 𝑥 + 1)T ∣ 𝑥 ∈ R}. Is 𝑊 a subspace of R2 ?
212 CHAPTER 5. LINEAR ALGEBRA
Definition 5.3 (Span). For 𝑟 ≥ 1, given a set of vectors {v1 , … , v𝑟 } ⊆ 𝑉 , the set of all linear combi-
nations 𝜆1 v1 + ⋯ + 𝜆𝑟 v𝑟 forms a subspace of 𝑉 ; we call this the span of {v1 , … , v𝑟 }, and denote it
sp({v1 , … , v𝑟 }).
If sp({v1 , … , v𝑟 }) = 𝑉 , then we say that the vectors span 𝑉 .
Definition 5.4 (linearly (in)dependent). If 𝑆 = {v1 , … , v𝑟 } is such that 𝜆1 v1 + ⋯ + 𝜆𝑟 v𝑟 = 0 if and

only if𝜆1 = ⋯ = 𝜆𝑟 = 0, then 𝑆 is said to be a linearly independent set. If there are non-zero
solutions, then we say that 𝑆 is linearly dependent.
Definition 5.5 (Basis). A basis is a set 𝑆 that is linearly independent and spans 𝑉 . A non-zero vector
space is finite-dimensional if there exists a finite set of vectors which forms a basis for that space,
otherwise it is infinite-dimensional. Any two bases for a finite-dimensional vector space have the same
number of vectors; we call that number the dimension.
Example 5.3
Find a basis of the subspace of R3 given by 𝑊 = {(𝑥, 𝑦, 𝑧)T ∣ 𝑥 + 𝑦 − 3𝑧 = 0}.
Question 5.4
1. Determine sp{(1, 2)T , (2, −3)T }.

2. Let 𝑊 = {(𝑎 + 2𝑏, 2𝑎 − 3𝑏)T ∣ 𝑎, 𝑏 ∈ R}. Is 𝑊 a subspace of R2 ?
3. Let 𝑊 = {(𝑎 + 2𝑏, 𝑎 + 1, 𝑎)T ∣ 𝑎, 𝑏 ∈ R}. Is 𝑊 a subspace of R3 ?
2𝑎 𝑏
4. Let 𝑊 = {( )∣ 𝑎, 𝑏 ∈ R}. Is 𝑊 a subspace of the space of 2×2 matrices?
3𝑎 + 𝑏 3𝑏
5.2. CHANGE OF BASIS AND LINEAR MAPS 219
5.2 Change of basis and linear maps

The canonical basis of R𝑛 is {e1 , … , e𝑛 }, where e𝑖 = (0, … , 0, 1, 0, … , 0)T (with the 1 being in the 𝑖th
𝑛
place). For any x ∈ R𝑛 there are unique coefficients (𝑥𝑖 )𝑛
𝑖=1 such that x = ∑𝑖=1 𝑥𝑖 e𝑖 .
Change of basis. Sometimes it is convenient to express vectors in R𝑛 in terms of another basis. In

𝑛
particular, suppose we have a vector x = ∑𝑖=1 𝛼𝑖 e𝑖 , where a = (𝛼1 , … , 𝛼𝑛 )T is known, that we wish to
write in terms of the basis {e′1 , … , e′𝑛 }. We know that there exist unique coefficients b = (𝛽1 , … , 𝛽𝑛 )T
𝑛
such that x = ∑𝑖=1 𝛽𝑖 e′𝑖 . But how do we compute (𝛽𝑖 )𝑛 𝑖=1 ?
𝑛
• First, since the e𝑖 form a basis and the e′𝑗 are vectors, we can write e′𝑗 = ∑𝑖=1 𝑏𝑖𝑗 e𝑖 for each 𝑗.
• Let B = (𝑏𝑖𝑗 )𝑖,𝑗=1,…,𝑛 ; that is, the matrix whose columns are given by the vectors {e′1 , … , e′𝑛 }.
• Thus,
𝑛 𝑛 𝑛 𝑛 𝑛 𝑛 𝑛
x = ∑ 𝛼𝑖 e𝑖 = ∑ 𝛽𝑗 e′𝑗 = ∑ 𝛽𝑗 ∑ 𝑏𝑖𝑗 e𝑖 = ∑ [∑ 𝑏𝑖𝑗 𝛽𝑗 ] e𝑖 = ∑(Bb)𝑖 e𝑖
𝑖=1 𝑗=1 𝑗=1 𝑖=1 𝑖=1 𝑗=1 𝑖=1
and as this holds for arbitrary x it follows that a = Bb.

−1
• Multiplying by B yields b = B a, and so we can read off the coefficients (𝛽𝑖 )𝑛𝑖=1 .
−1
The same procedure can be applied to make a change of basis in any finite-dimensional vector space.
Example 5.5
A basis for 𝑃2 is 𝐴 = {1, 𝑥, 𝑥2 } and another basis is 𝐵 = {𝑥 + 1, 𝑥 − 1, 2𝑥2 }.
1. Determine the change of basis matrix from the basis 𝐴 to the basis 𝐵 .
2. Consider 𝑓 ∈ 𝑃2 given by 𝑓(𝑥) = 𝑎 + 𝑏𝑥 + 𝑐𝑥2 . Write 𝑓 in terms of the basis 𝐴 and then
in terms of the basis 𝐵 .
In this example, the e′𝑖 is the basis 𝐴 and the e𝑖 is the basis 𝐵 .
Linear maps. If 𝐿 ∶ 𝑉 → 𝑉 is a linear map on a finite-dimensional vector space 𝑉 , then we can

represent 𝐿 by a matrix whose columns are the images under 𝐿 of the basis vectors of 𝑉 . Thus 𝐿 is
actually represented by a family of matrices, whose entries depend on the choice of basis. If 𝑉 = R𝑛 and
𝐿 is represented by A with respect to {e1 , … , e𝑛 }, then what matrix Ã represents 𝐿 with respect to the
alternative basis {e′1 , … , e′𝑛 }? Let a denote the representation of a vector v in basis (e𝑖 )𝑛
𝑖=1 and b the
representation of v in basis (e′𝑖 )𝑛
𝑖=1 .
• We know that a ↦ Aa (where vectors are written with respect to {e1 , … , e𝑛 }).
• We also have that b ↦ Ab ̃ (where vectors are written with respect to {e′1 , … , e′𝑛 }).
• Using that b = B−1 a, the previous two lines imply b ↦ B−1 Aa, which yields b ↦ B−1 ABb.
• We conclude that Ã = B−1 AB.
Note that if A is symmetric matrix, and we take {e′1 , … , e′𝑛 } to be an orthonormal basis of eigenvectors,
then B−1 AB is diagonal, with entries given by the eigenvalues of A. Furthermore, if {e′1 , … , e′𝑛 } forms an
orthonormal basis, then B is an orthogonal matrix and B−1 = BT .
Example 5.6
Consider the linear transformation 𝐷 ∶ 𝑃2 → 𝑃2 that maps 𝑓 ∈ 𝑃2 to 𝑓𝑥 ; that is, the derivative of
𝑓 with respect to its variable which is denoted by 𝑥.
• Determine the matrix of 𝐷 with respect to the basis {1, 𝑥, 𝑥2 }.

• Determine the matrix of 𝐷 with respect to the basis {𝑥 + 1, 𝑥 − 1, 2𝑥2 }.
Question 5.7
1. Let 𝑇 ∶ R3 → R3 be defined by (𝑥, 𝑦, 𝑧)T ↦ (𝑥 + 𝑦 − 𝑧, 2𝑥 + 𝑧, 𝑥)T for all 𝑥, 𝑦, 𝑧 ∈ R3 .

Find the matrix of 𝑇 with respect to the canonical basis of R3 .
2. Let 𝑉 = 𝑃2 , 𝐵 = {1, 1 + 𝑥, 1 + 𝑥 + 𝑥2 } and 𝐶 = {2 + 𝑥 + 𝑥2 , 𝑥 + 𝑥2 , 𝑥}. Both 𝐵
and 𝐶 are bases for 𝑉 . Determine the change of basis matrix from 𝐵 to 𝐶 .
3. Let 𝑇 ∶ R2 → R2 be defined by (𝑥, 𝑦)T ↦ (𝑥 − 𝑦, 2𝑥 + 𝑦)T . Determine the matrix of the
map 𝑇 with respect to
i) the canonical basis in the domain and codomain;

ii) the basis {(2, 1)T , (1, 2)T } in the domain and codomain.
Where does (6, 2)T map to? Ensure you check this using your earlier working.
5.3. INNER PRODUCTS 227
5.3 Inner products

Let𝑉 be a real vector space, 𝑓 ∶ 𝑉 × 𝑉 → R a function. Write 𝑓(u, v) = ⟨u, v⟩. The function 𝑓 is an
inner product on 𝑉 if it satisfies:
linearity: ⟨𝑎u1 + 𝑏u2 , v⟩ = 𝑎⟨u1 , v⟩ + 𝑏⟨u2 , v⟩;

symmetry: ⟨u, v⟩ = ⟨v, u⟩;
positive definiteness: ⟨u, u⟩ > 0 for all u ≠ 0 and ⟨u, u⟩ = 0 if and only if u = 0.
A vector space 𝑉 equipped with an inner product is said to be an inner product space. We can associated
a norm or length with any inner product. We write ‖u‖ = √⟨u, u⟩; this is the norm of u. Similarly,
‖u − v‖ = √⟨u − v, u − v⟩ is the distance between u and v.
Example 5.8
1. In R𝑛 , the dot or scalar product ⟨u, v⟩ = 𝑢1 𝑣1 + ⋯ + 𝑢𝑛 𝑣𝑛 is an inner product; ‖u‖ =

√𝑢21 +⋯+ 𝑢𝑛 . Since the latter is the usual definition of the length of a vector in R𝑛 , we call
2
𝑛
it the usual or Euclidean inner product. If we are discussing R and do not explicitly mention
another inner product, we mean this one!
2. If 𝑉 is the collection of continuous functions 𝑓 ∶ [𝑎, 𝑏] → R, then
𝑏
⟨𝑓, 𝑔⟩ = ∫ 𝑓(𝑥)𝑔(𝑥)𝑑𝑥
𝑎
defines an inner product on 𝑉 .

3. If 𝑉 is the collection of random variables 𝑋 such that E(𝑋) = 0 and Var(𝑋) < ∞, then
⟨𝑋1 , 𝑋2 ⟩ = Cov(𝑋1 , 𝑋2 ) is an inner product on 𝑉 .
4. If 𝑉 is the collection of 𝑛 × 𝑛 real-valued matrices, then ⟨A, B⟩ = trace(BT A) is an inner
product on 𝑉 .
Theorem 5.1 (Cauchy-Schwarz inequality). For any u, v ∈ 𝑉 , it holds that
⟨u, v⟩2 ≤ ⟨u, u⟩⟨v, v⟩.
Note that in R𝑛 , this says that
2
𝑛 𝑛 𝑛
(∑ 𝑎𝑖 𝑏𝑖 ) ≤ (∑ 𝑎2𝑖 ) (∑ 𝑏𝑖2 ) .
𝑖=1 𝑖=1 𝑖=1
• We have equality in the Cauchy-Schwarz equality if and only if u and v are linearly dependent, i.e. u =
0 or v= 0 or there exists 𝜆 ∈ R such that u = 𝜆v.
• A consequence is the triangle inequality: ‖u + v‖ ≤ ‖u‖ + ‖v‖.
Parallelogram law We also have the parallelogram law: ‖u + v‖2 + ‖u − v‖2 = 2‖u‖2 + 2‖v‖2
For u, v ∈ 𝑉 \{0}, define 𝜃u,v ∈ [0, 𝜋] by
⟨u, v⟩
cos (𝜃u,v ) = .
‖u‖ × ‖v‖
This definition generalises the usual cosine rule for vectors in R𝑛 :

Definition 5.6. 1. We say that u, v are orthogonal (u ⟂ v) if and only if ⟨u, v⟩ = 0. 2. If 𝐸 ⊆ 𝑉 , then
we say that u is orthogonal to 𝐸 (u ⟂ 𝐸 ) if and only if u is orthogonal to every v ∈ 𝐸 . 3. If 𝐸1 and 𝐸2
are subsets of 𝑉 , then they are said to be orthogonal (𝐸1 ⟂ 𝐸2 ) if ⟨v, w⟩ = 0 for all v ∈ 𝐸1 , w ∈ 𝐸2 .
4. Define 𝐸 ⟂ , the orthogonal complement of 𝐸 , by setting
𝐸 ⟂ = {v ∈ 𝑉 ∶ ⟨v, w⟩ = 0, ∀w ∈ 𝐸} .
We note that 𝐸 ⟂ is a subspace of 𝑉 . Moreover, if 𝐸 is a subspace of 𝑉 , then 𝑉 = 𝐸 ⊕ 𝐸 ⟂.

Theorem 5.2. If u, v ∈ 𝑉 \{0} are orthogonal, then u and v are linearly independent.
Question 5.9
1. Show ⟨u, v⟩ = 3𝑢1 𝑣1 + 5𝑢2 𝑣2 defines an inner product on R2 .

2. Let
𝑢1 𝑢2 𝑣1 𝑣2
U =( ), V =( )
𝑢3 𝑢4 𝑣3 𝑣4
be elements of 𝑀2 : the space of 2×2 matrices. Does ⟨U, V⟩ = 𝑢1 𝑣1 +𝑢2 𝑣3 +𝑢3 𝑣2 +𝑢4 𝑣4
define an inner product on 𝑀2 ? Hint: Consider ⟨U, U⟩ when
1 −1
U =( )
1 1
or
1
2 −1
U =( 1 ).
1 2
3. Show ⟨𝑝, 𝑞⟩= 𝑝(0)𝑞(0) + 𝑝( 21 )𝑞( 12 ) + 𝑝(1)𝑞(1) defines an inner product on 𝑃2 , and find
‖1 + 𝑥 + 𝑥2 ‖.
5.4. ORTHOGONAL/ORTHONORMAL BASES 239
5.4 Orthogonal/orthonormal bases
Recall, from Linear Algebra that a set 𝑆 of vectors is orthogonal if its elements are pairwise orthogonal.
The set is orthonormal if, in addition, ‖s‖ = 1 for all s ∈ 𝑆 . For example, the canonical basis for R𝑛 forms
an orthonormal basis.
Example 5.10
Show that if {u1 , … , u𝑛 } is an orthogonal set, then
‖u1 + ⋯ + u𝑛 ‖2 = ‖u1 ‖2 + ⋯ + ‖u𝑛 ‖2 . (5.1)
Note, this result is a generalisation of Pythagoras’ theorem.

𝑛
If we have an orthogonal basis {v1 , … , v𝑛 }, then we can write x = ∑𝑖=1 𝑥𝑖 v𝑖 . So,
𝑛 𝑛
⟨x, v𝑖 ⟩ = ⟨∑ 𝑥𝑗 v𝑗 , v𝑖 ⟩ = ∑ 𝑥𝑗 ⟨v𝑗 , v𝑖 ⟩ = 𝑥𝑖 ⟨v𝑖 , v𝑖 ⟩.
𝑗=1 𝑗=1
Thus we obtain
𝑛
⟨x , v 𝑖 ⟩
x =∑ v.
𝑖=1
⟨v 𝑖 , v 𝑖 ⟩ 𝑖
⟨x , v 𝑖 ⟩
We call 𝑐𝑖 = the (generalised) Fourier coefficient of x with respect to v𝑖 . Note that if ‖v𝑖 ‖ = 1
⟨v 𝑖 , v 𝑖 ⟩
for every 𝑖 (so that the basis is orthonormal), then 𝑐𝑖 = ⟨x, v𝑖 ⟩. We say that 𝑐𝑖 v𝑖 is the component of x in
the direction v𝑖 .
From (5.1), we have that

𝑛 𝑛
‖x‖2 = ∑ ‖𝑐𝑖 v𝑖 ‖2 = ∑ 𝑐𝑖2 ‖v𝑖 ‖2 . (5.2)
𝑖=1 𝑖=1
Theorem 5.3. Let {v1 , … , v𝑟 } be an orthogonal set (with each vector non-zero), and set 𝑆 =
sp({v1 , … , v𝑟 }). Then
𝑟
∥ x − ∑ 𝑎𝑖 v 𝑖 ∥
𝑖=1
𝑟
is minimised by setting 𝑎𝑖 = 𝑐𝑖 . i.e. ∑𝑖=1 𝑐𝑖 v𝑖 is the ‘closest’ vector in 𝑆 to x.
Example 5.11
Let 𝑊 = sp({(3, 1, −1, 1), (1, −1, 1, −1)}) ⊆ R4 . Find w ∈ 𝑊 closest to v = (3, 1, 5, 1).
The advantage of this approach is the relative ease in computing 𝑐𝑖 which requires computing an inner
product. Nevertheless, if the provided vectors are not an orthogonal set further analysis is required.
Question 5.12
Let 𝑉 = sp{( 12 , 12 , 12 , 12 ) , (− √16 , − √16 , 0, √26 ) , ( √150 , √350 , − √650 , √250 )} ⊆ R4 . Let x =
(4, 6, 1, 1).
Determine the vector v ∈ 𝑉 that minimises ‖x − v‖.
5.5. GRAM-SCHMIDT ORTHOGONALISATION 249
5.5 Gram-Schmidt orthogonalisation
Let 𝑉 be an inner product space, and let {v1 , … , v𝑛 } be a basis for 𝑉 . Can we define an orthogonal
basis {w1 , … , w𝑛 } with the property that 𝑊𝑟 = sp({w1 , … , w𝑟 }) is equal to sp({v1 , … , v𝑟 }) for each
𝑟 = 1, … 𝑛? (If we can, we can then of course produce an orthonormal basis with the same properties.)
Step 1: First, we need to take w1 ∈ sp({v1 }). We choose
w1 = v1 .
Step 2: We next want to find w2 ∈ sp({v1 , v2 }) such that ⟨w2 , w1 ⟩ = 0. We can do this by setting
⟨v2 , w1 ⟩
w2 = v2 − w .
⟨w1 , w1 ⟩ 1
Generally: More generally, we need to find w𝑘 ∈ sp({v1 , … , v𝑘 }) such that ⟨w𝑘 , w𝑖 ⟩ = 0 for each
𝑖 = 1, … , 𝑘 − 1. We can do this by setting
w𝑘 = v𝑘 − Proj𝑊𝑘−1 (v𝑘 ),
where
𝑘−1
⟨v𝑘 , w𝑖 ⟩
Proj𝑊 (v 𝑘 ) = ∑ w
𝑘−1
𝑖=1
⟨w𝑖 , w𝑖 ⟩ 𝑖
is the projection of v𝑘 onto the subspace 𝑊𝑘−1 . The following figure shows the situation for 𝑘 = 3.
It is possible to check that: + we indeed have sp({w1 , … , w𝑟 }) = sp({v1 , … , v𝑟 }); + the set
{w1 , … , w𝑛 } is orthogonal; + each of the vectors in {w1 , … , w𝑛 } is non-zero.
Example 5.13
Let
1 1 1
⎛ ⎞
⎜ 2 ⎟ ⎛ ⎞
⎜ 2 ⎟ ⎛
⎜ 0 ⎞
⎟
v1 = ⎜
⎜ ⎟
⎟ , v2 = ⎜
⎜ ⎟
⎟ , v3 = ⎜
⎜ ⎟.
⎜ 3 ⎟ ⎜ 0 ⎟ ⎜ 0 ⎟
⎟
0
⎝ ⎠ 0
⎝ ⎠ 1
⎝ ⎠
Find an orthogonal basis for the subspace 𝑊 spanned by {v1 , v2 , v3 }.
Critical point Gram-Schmidt orthogonalisation defines an orthogonal basis, this basis must be normalised
if one requires an orthonormal basis.
Question 5.14
1. Let
1 1 1 −1 3
u1 = √ ( ), u2 =√ ( ), x =( ).
2 1 2 1 4
Check that u1 , u2 is an orthonormal basis of R2 . Find 𝑎, 𝑏 such that x = 𝑎u1 + 𝑏u2 . Hence
2 2 2
deduce that ‖x‖ =𝑎 +𝑏 .
2. What is the component of v = (1, 2, 3, 4) in the direction of w = (1, −3, 4, −2)?
3. Find an orthonormal basis for the subspace of R4 spanned by
1 1 1
⎜ 1 ⎞
⎛ ⎟ ⎜ 2 ⎞
⎛ ⎟ ⎜ −3 ⎞
⎛ ⎟
v1 = ⎜
⎜ ⎟, v2 = ⎜ ⎟, v3 = ⎜ ⎟.
⎜ 1 ⎟
⎟ ⎜ 4 ⎟
⎜ ⎟ ⎜ 4 ⎟
⎜ ⎟
⎝ 1 ⎠ ⎝ 5 ⎠ ⎝ −2 ⎠
4. Consider the vector space be 𝑃3 ; that is, the collection of polynomials of the form 𝑎3 𝑡3 +
𝑎2 𝑡2 + 𝑎1 𝑡 + 𝑎0 . Endow this vector space with the inner product
1
⟨𝑓, 𝑔⟩ = ∫ 𝑓(𝑥)𝑔(𝑥)𝑑𝑥.
0
Let 𝑊 = sp({𝑡, 𝑡2 }). Find an orthonormal basis for 𝑊 .

5. Consider the vector space be 𝑃2 , and endow this vector space with the inner product be given
by:
⟨𝑎2 𝑡2 + 𝑎1 𝑡 + 𝑎0 , 𝑏2 𝑡2 + 𝑏1 𝑡 + 𝑏0 ⟩ = 𝑎2 𝑏2 + 𝑎1 𝑏1 + 𝑎0 𝑏0 .
= 𝑥 + 𝑥2 and q = 7 + 3𝑥 + 3𝑥2 .
a. Find the cosine of the angle between p
b. Show that u = 1 − 𝑥 + 2𝑥2 and v = 2𝑥 + 𝑥2 are orthogonal. Check also that these
two vectors are not orthogonal with respect to the inner product introduced in part 4.
c. Find an orthonormal basis for sp({1, 𝑥 + 𝑥2 }).
𝑊 = sp({(1, −1, 0, 0), (1, 2, 0, −1), (1, 0, 0, 1)}) ⊆ R4 . Find w ∈ 𝑊 closest to

6. Let
v = (0, 2, 1, 0).
5.6. PROJECTIONS 255
5.6 Projections
A linear map P ∶ 𝑉 → 𝑉 is a projection if and only if P2 = P. Note, the map P is not invertible unless
P = I.
Example 5.15
Let {u1 , … , u𝑟 } be an orthonormal set, and let 𝑈 = sp({u1 , … , u𝑟 }). If we define
𝑟
P(x) = Proj𝑈 (x) = ∑⟨x, u𝑖 ⟩u𝑖 ,
𝑖=1
then P is a projection.
Recall that the kernel of a map, P ∶ 𝑉 → 𝑉 , is ker(P) = {x ∈ 𝑉 ∶ P(x) = 0} and its image is
im(P) = {x ∈ 𝑉 ∶ there exists v ∈ 𝑉 such that x = P(v)}.
Theorem 5.4. If P ∶ 𝑉 → 𝑉 is a projection, then the following statements hold.
1. ker(P) = im(I − P).
2. 𝑉 = im(P) ⊕ ker(P).
Note, if 𝑉 = 𝑈 ⊕ 𝑊 , then there exists a projection P with 𝑈 = im(P), 𝑊 = ker(P). Indeed, such a
projection is defined by noting v ∈ 𝑉 can be uniquely written as v = u + w, where u ∈ 𝑈 and w ∈ 𝑊 ,
and then setting P(v) = u.
Example 5.16
Check that the matrix
1 0
P =( )
4 0
defines a projection. Compute its image and kernel, and show that R2 = im(P) ⊕ ker(P).
Definition 5.7. An orthogonal projection is a projection with im(P) ⟂ ker(P).

Note, the projection
𝑟
P(x) = Proj𝑈 (x) = ∑⟨x, u𝑖 ⟩u𝑖 ,
𝑖=1
is orthogonal.
Critical point An orthogonal projection is not the same as an orthogonal linear map. An orthogonal linear
map preserves an inner product on an inner product space. For example, in R2 with the standard euclidean
inner product, a rotation about the origin is an orthogonal linear map.
Theorem 5.5. Let P be an 𝑛 × 𝑛 matrix representing a projection R𝑛 → R𝑛 , i.e. P2 = P. The following

three statements are equivalent:
1. The projection is orthogonal;

2. PT P = P;
3. PT = P.
Question 5.17
1. By appropriately applying Theorem 5.5 deduce that
1 4
17 17
P =( 4 16
)
17 17
describes an orthogonal projection.
2. Show that
4 6
13 13
P =( 6 9
)
13 13
represents an orthogonal projection.
a) Find imP and kerP, and verify that they are orthogonal.
b) The matrix represents an orthogonal projection from R2 onto a line 𝐿 in R2 . Write down
the equation of 𝐿.
5.7. THE SPECTRAL DECOMPOSITION THEOREM 269
5.7 The spectral decomposition theorem
Let {u1 , … , u𝑟 } be an orthonormal subset of R𝑛 . What matrix represents Proj𝑈 , the orthogonal projection
onto 𝑈 = sp({u1 , … , u𝑟 })?
In summary,
𝑟
Proj𝑈 (x) = (∑ u𝑖 uT𝑖 ) x. (5.3)
𝑖=1
Theorem 5.6 (The spectral decomposition theorem). A symmetric matrix A can be written as
A = 𝜆1 E1 + ⋯ + 𝜆𝑟 E𝑟 ,
where 𝜆𝑖 are the distinct eigenvalues, E𝑖 are the projections onto eigenspaces with the following prop-
erties:
• ET
𝑖 = E𝑖 ;
• E1 + ⋯ + E𝑟 = I;
• E𝑖 E𝑗 = 0 for 𝑖 ≠ 𝑗.
The proof is accessible, but relegated to the Consolidation Questions.
Example 5.18
Let
1 2
A =( ).
2 1
Determine the spectral decomposition of A, verifying the conditions of the spectral decomposition
theorem.
Question 5.19
1. Find the matrix P which represents an orthogonal projection of R3 onto 𝑊 ∶= {(𝑥, 𝑦, 𝑧)T ∶
2𝑥 − 3𝑦 + 𝑧 = 0}. Find the image of (1, 2, 3)T under this projection. How far is the point
(1, 2, 3)T from 𝑊 ?
2. Determine the 3 × 3 matrix P which represents orthogonal projection of R3 onto the subspace
𝑈 generated by the orthonormal vectors
1
3 − √25
⎛
⎜ ⎞
⎟ ⎛ ⎞
u1 = ⎜
⎜
2
3 ⎟
⎟, u2 =⎜
⎜
⎜ √1
⎟
⎟
⎟.
5
⎝
2
⎠ ⎝ 0 ⎠
3
Find imP and kerP, and the point in 𝑈 closest to (8, 1, −5)T .
3. Let v1 = (1, 4, 2)𝑇 and v2 = (4, 9, 1)𝑇 . Let 𝑃 be the plane in R3 spanned by v1 and v2 .
a) Determine the matrix 𝑀 that represents orthogonal projection of R3 onto the plane 𝑃 .
b) Determine ker(𝑀 ) and the equation of the plane 𝑃 .
4. Apply the spectral decomposition theorem to find a decomposition for the matrix
2 1 0
⎛ ⎞
A = ⎜1 2 0 ⎟
⎝0 0 −1⎠
as a linear combination of orthogonal projections.
5.8. APPLICATION TO SIMPLE LINEAR REGRESSION 277
5.8 Application to simple linear regression

The problem of simple linear regression is as follows: given data (𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1, … , 𝑛, (for simplicity,
we suppose that the𝑥𝑖 are distinct, but it is not difficult to drop this restriction,) find the linear function
𝑓(𝑥) = 𝛼 + 𝛽𝑥 that minimises:
𝑛
2
∑ (𝑦𝑖 − 𝑓(𝑥𝑖 )) .
𝑖=1
We will show how this can be solved using linear algebra.
Let 𝑉 be the collection of functions from {𝑥1 , … , 𝑥𝑛 } to R. This is a vector space, and can be equipped
with an inner product defined by
𝑛
⟨𝑓, 𝑔⟩ = ∑ 𝑓(𝑥𝑖 )𝑔(𝑥𝑖 ).
𝑖=1
Note that we can define a function 𝑦 ∈ 𝑉 by setting 𝑦(𝑥𝑖 ) = 𝑦𝑖 . Moreover, if we let 𝑆 be the subspace of
linear functions, then the problem of simple linear regression is to find a function 𝑓 ∈ 𝑆 such that ‖𝑓 − 𝑦‖
is minimised. From the results of the section so far, we deduce that the solution is given by the orthogonal
projection, i.e. 𝑓 = Proj𝑆 (𝑦). To compute this, we need to find an orthogonal basis for 𝑆 .
To find such a basis, we start by noting that 𝑆 = sp({𝑓1 , 𝑓2 }), where 𝑓1 (𝑥) = 1 and 𝑓2 (𝑥) = 𝑥. We
can then use Gram-Schmidt to find an orthogonal basis. In particular, set
𝑔1 = 𝑓1 ,
𝑛
⟨𝑔 , 𝑓 ⟩ ∑ 𝑥𝑖
𝑔2 = 𝑓2 − 1 2 𝑔1 = 𝑓2 − 𝑖=1 𝑔1 = 𝑓2 − 𝑥𝑔̄ 1 ,
⟨𝑔1 , 𝑔1 ⟩ 𝑛
i.e. 𝑔1 (𝑥) = 1, 𝑔2 (𝑥) = 𝑥 − 𝑥.̄

Consequently, we deduce that
2 𝑛 𝑛
⟨𝑦, 𝑔𝑖 ⟩ ∑𝑖=1 𝑦𝑖 ∑ 𝑦𝑖 (𝑥𝑖 − 𝑥)̄
𝑓(𝑥) = ∑ 𝑔𝑖 (𝑥) = 𝑔1 (𝑥) + 𝑖=1
𝑛 𝑔 (𝑥)
2 2
𝑖=1
⟨𝑔𝑖 , 𝑔 𝑖 ⟩ 𝑛 ∑ (𝑥
𝑖=1 𝑖
− 𝑥)
̄
𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥)(𝑦
̄ 𝑖 − 𝑦)̄
= 𝑦̄ + 𝑛 (𝑥 − 𝑥),
̄
∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2
giving the estimators

𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥)(𝑦
̄ 𝑖 − 𝑦)̄
𝛽̂ = 𝑛 , ̂ ̄
𝛼̂ = 𝑦 ̄ − 𝛽 𝑥.
∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2
Question 5.20
Write
0 1 1
A =⎛
⎜ 1 0 1 ⎞
⎟
⎝ 1 1 0 ⎠
in the form A = RDRT , where R is an orthogonal matrix, and D is a diagonal matrix.
Question 5.21
Suppose 𝑉 is the vector space of continuous functions [0, 1] → ℝ, equipped with the inner product
given by
1
⟨𝑓, 𝑔⟩ = ∫ 𝑓(𝑥)𝑔(𝑥)𝑑𝑥.
0
Define ℎ ∈ 𝑉 by setting ℎ(𝑥) = 1 for every 𝑥 ∈ [0, 1], and let 𝑆 = span({ℎ}). Given a function
𝑓 ∈ 𝑉 , determine 𝑔 ∈ 𝑆 ⟂ and 𝑐 ∈ ℝ such that 𝑓 = 𝑔 + 𝑐ℎ.
Question 5.22
Let A be an 𝑛 × 𝑛 symmetric matrix. Suppose A’s eigenvalues are 𝜆1 , … 𝜆𝑛 . Suppose the corre-
𝑛
sponding orthonormal eigenvectors are u1 , … u𝑛 . Let x ∈ R𝑛 . Write x = ∑𝑖=1 𝑥𝑖 u𝑖 .
1. Prove that A = (𝜆1 E1 + ⋯ + 𝜆𝑛 E𝑛 ), where E𝑖 = u𝑖 u𝑇𝑖 .

2. Prove that E𝑖 E𝑗 = 0 for 𝑖 ≠ 𝑗.
3. Prove that (∑ E𝑖 )x = Ix for all x.
• Determine if a particular function defines an inner product or not.

• Select and apply appropriate theory to determine properties, including closest points, orthog-
onality, of vectors and vector spaces.
• Apply the Gram–Schmidt orthogonalisation process.
• Prove results involving inner product, orthogonality and projections.
• Analyse problems and select appropriate strategies to solve problems involving orthogonal
vectors, orthonormal vectors, vector spaces, inner products and projections.
This chapter covers five topics.
Vector spaces
• Let 𝑉 be a real vector space. For 𝑟 ≥ 1, given a set of vectors {v1 , … , v𝑟 } ⊆ 𝑉 , the set of all linear
combinations 𝜆1 v1 + … 𝜆𝑟 v𝑟 for a subspace of 𝑉 called the span of {v1 , … , v𝑟 }.
• Denote the span of {v1 , … , v𝑟 } by sp({v1 , … , v𝑟 }). If sp({v1 , … , v𝑟 }) = 𝑉 , then the vectors
span 𝑉 .
• If 𝑆 = {v1 , … , v𝑟 } is such that 𝜆1 v1 + … 𝜆𝑟 v𝑟 = 0 if, and only if, 𝜆1 = … 𝜆𝑟 = 0, then 𝑆 is said
to be a linearly independent set. If there are non-zero solution, then 𝑆 is linearly dependent.
• A basis is a set 𝑆 that is linearly independent and spans 𝑉 .
• A non-zero vector spaces is finite-dimensional if there exists a finite set of vectors which forms a
basis for that space, otherwise it is infinite-dimensional.
• Any two bases for a finite-dimensional vector space have the same number of vectors, this number is
the dimensions of the vector space.
• The canonical basis of R𝑛 is {e1 , … , e𝑛 }, where e𝑖 = (0, … , 0, 1, 0, … , 0)T , with the 1 being in
𝑛
the 𝑖th place. Given any x ∈ R𝑛 , there are unique coefficients (𝛼𝑖 )𝑛
𝑖=1 such that x = ∑𝑖=1 𝛼𝑖 e𝑖 .
𝑛
Sometimes it is convenient to express a vector x = ∑𝑖=1 𝛼𝑖 e𝑖 in terms of another basis {e′1 , … e′𝑛 } such
𝑛
that x = ∑𝑖=1 𝛽𝑖 e′𝑖 . Key points are:
𝑛
• Each e′𝑗 can be written e′𝑗 = ∑𝑖=1 𝑏𝑖𝑗 e𝑖 for each 𝑗.
• Let B = (𝑏𝑖𝑗 )𝑖,𝑗=1,…,𝑛 ; that is, the matrix whose columns are given by the vectors {e′1 , … , e′𝑛 }.
Then
𝑛 𝑛 𝑛 𝑛 𝑛 𝑛 𝑛
x = ∑ 𝛼𝑖 e𝑖 = ∑ 𝛽𝑗 e′𝑗 = ∑ 𝛽𝑗 ∑ 𝑏𝑖𝑗 e𝑖 = ∑ [∑ 𝑏𝑖𝑗 𝛽𝑗 ] e𝑖 = ∑(Bb)𝑖 e𝑖 .
𝑖=1 𝑗=1 𝑗=1 𝑖=1 𝑖=1 𝑗=1 𝑖=1
Since this holds for arbitrary x it follows that a = Bb. The matrix B is invertible, so b = B−1 a. This
result permits reading off the coefficients (𝛽𝑖 )𝑛

𝑖=1 . This process works in any finite-dimensional vector
space.
• If 𝐿 ∶ 𝑉 → 𝑉 is a linear map on a finite-dimensional vector space 𝑉 , then we can represent 𝐿
by a matrix whose columns are the images under 𝐿 of the basis vectors of 𝑉 . Thus 𝐿 is actually
represented by a family of matrices, whose entries depend on the choice of basis. If 𝑉 = R𝑛 and 𝐿
is represented by A with respect to {e1 , … , e𝑛 }, then the matrix representing 𝐿 with respect to the
alternative basis {e′1 , … , e′𝑛 } is given by A = B−1 AB.
Inner products Let 𝑉 be a real vector space, 𝑓 ∶ 𝑉 × 𝑉 → R a function. Write 𝑓(u, v) = ⟨u, v⟩. The
function 𝑓 is an inner product on 𝑉 if it satisfies:
⟨𝑎u1 + 𝑏u2 , v⟩ = 𝑎⟨u1 , v⟩ + 𝑏⟨u2 , v⟩;

• linearity:
• symmetry: ⟨u, v⟩ = ⟨v, u⟩.
• non-negativity: ⟨u, u⟩ ≥ 0, with equality if and only if u = 0.
A vector space 𝑉 equipped with an inter product is said to be an inner product space. Write ‖u‖ =
√⟨u, u⟩, this is the norm of u and associates a ‘length’ with the inner product. Similarly, ‖u − v‖ =
√⟨u − v, u − v⟩ is the distance between u and v.
• For any u, v ∈ 𝑉 , the following inequality holds
⟨u, v⟩2 ≤ ⟨u, u⟩⟨v, v⟩.
This inequality is named the Cauchy-Schwarz inequality. Equality holds in Cauchy-Schwarz inequality
if, and only if, u and v are linearly dependent; that is, u = 0 or v = 0 or there exists 𝜆 ∈ R such that
u = 𝜆v.
• The Cauchy-Schwarz inequality implies the triangle inequality: ‖u + v‖ ≤ ‖u‖ + ‖v‖.
• The parallelogram law is: ‖u + v‖2 + ‖u − v‖2 = 2‖u‖2 + 2‖v‖2 .
• For u, v ∈ 𝑉 {0}, define 𝜃u,v ∈ [0, 𝜋] by
⟨u, v⟩
cos(𝜃u,v ) = .
‖u‖ × ‖v‖
This generalises the usual cosine rule for vectors in R𝑛 .
Orthogonal/orthonormal bases We say that u, v are orthogonal and write u ⟂ v if, and only if, ⟨u, v⟩ = 0.
If 𝐸 ⊆ 𝑉 , then u is orthogonal to 𝐸 , written u ⟂ 𝐸 if, and only if, u is orthogonal to every v ∈ 𝐸 . If
𝐸1 and 𝐸2 are subsets of 𝑉 , then they are said to be orthogonal, written 𝐸1 ⟂ 𝐸2 if ⟨v, w⟩ = 0 for
all v ∈ 𝐸1 , w ∈ 𝐸2 . Define 𝐸 ⟂ , the orthogonal complement of 𝐸 by setting 𝐸 ⟂ = {v ∈ 𝑉 ∶
⟨v, w⟩ = 0, for all w ∈ 𝑊 }. Note that 𝐸 ⟂ is a subspace of 𝑉 . Moreover, if 𝐸 is a subspace of 𝑉 , then
𝑉 = 𝐸 ⊕ 𝐸 ⟂.
• If u, v ∈ 𝑉 {0} are orthogonal, then u and v are linearly independent.

A set 𝑆 of vectors is orthogonal if its elements are pairwise orthogonal. The set is orthonormal if, in
addition, ‖s‖ = 1 for all s ∈ 𝑆 . The canonical basis for R𝑛 is an example of an orthonormal basis.
𝑛
• If we have an orthogonal basis {v1 , … , v𝑛 }, then we can write x = ∑𝑖=1 𝑥𝑖 v𝑖 and show that
⟨x , v 𝑖 ⟩
𝑥𝑖 = , the {(generalised) Fourier coefficients of x with respect to v𝑖 . The vector 𝑥𝑖 v𝑖 is the
⟨v 𝑖 , v 𝑖 ⟩
component of x in the direction v𝑖 .
Gram–Schmidt orthogonalisation Let 𝑉 be an inner product space, and let {v1 , … , v𝑛 } be a basis
for 𝑉 . A question is, does there exist an orthogonal basis {w1 , … , w𝑛 } with the property that 𝑊𝑟 =
sp({w1 , … , w𝑛 }) is equal to sp({v1 , … , v𝑛 }) for each 𝑟 = 1, … 𝑛? The answer is yes and the steps are
as follows.
• Take w1 ∈ sp({v1 }). For convenience select w1 = v1 .

• Now find w2 ∈ sp(v1 , v2 ) such that ⟨w2 , w1 ⟩ = 0. Setting
⟨v2 , w1 ⟩
w2 = v2 − w ,
⟨w1 , w1 ⟩ 1
yields an appropriate w2 .
• In general, find w𝑘 ∈ sp({v1 , … , v𝑘 }) such that ⟨w𝑘 , w𝑖 ⟩ = 0 for each 𝑖 = 1, … 𝑘 − 1. Setting
w𝑘 = v𝑘 − Proj𝑊𝑘−1 (v𝑘 ), where
𝑘−1
⟨v𝑘 , w𝑖 ⟩
Proj𝑊 (v𝑘 ) = ∑ w
𝑘−1
𝑖=1
⟨w𝑖 , w𝑖 ⟩ 𝑖
is the projection of v𝑘 onto the subspace 𝑊𝑘−1 .

• Let {v1 , … v𝑟 } be an orthogonal set (with each vector non-pro), and set 𝑆 = sp({v1 , … v𝑟 }). Then
𝑟
∥x − ∑ 𝑎𝑖 v𝑖 ∥
𝑖=1
⟨x, v𝑖 ⟩
is minimised by setting 𝑎𝑖 = 𝑐𝑖 , where 𝑐𝑖 = .
⟨v𝑖 , v𝑖 ⟩
A linear map P ∶ 𝑉 → 𝑉 is a projection if, and only if, P2 = P. If P ∶ 𝑉 → 𝑉 is a projection, then

ker(P) = im(I − P) and 𝑉 = im(P) ⊕ ker(P). Moreover, if 𝑉 = 𝑈 ⊕ 𝑊 , then there exists a projection
P with 𝑈 = im(P), 𝑊 = ker(P). An orthogonal projection is a projection with im(P) ⟂ ker(P). Some
key results.
• Let P be an 𝑛 × 𝑛 matrix representation a projection R𝑛 → R𝑛 ; that is, P2 = P. The following are

equivalent:
– The projection is orthogonal;

– PT P = P; +PT = P.
• Let {u1 , … , u𝑟 } be an orthonormal subset of R𝑛 . The matrix representing Proj𝑈 , the orthogonal
projection onto 𝑈 = sp({u1 , … , u𝑟 }), is given by
𝑟
Proj𝑈 (x) = (∑ u𝑖 uT𝑖 ) x.
𝑖=1
• Spectral decomposition theorem A symmetric matrix A can be written as A = 𝜆1 E1 + … + 𝜆𝑟 E𝑟 ,

where 𝜆𝑖 are the distinct eigenvalues, E𝑖 are the projections onto eigenspaces with the following
properties
– ET𝑖 = E𝑖 ;
– E1 + … + E𝑟 = I;
– E𝑖 E𝑗 = 0 for 𝑖 ≠ 𝑗.
Application to simple linear regression Let 𝑉 be the collection of functions from {𝑥1 , … , 𝑥𝑛 } to R. This
𝑛
is a vector space, and can be equipped with an inner product defined by ⟨𝑓, 𝑔⟩ = ∑𝑖=1 𝑓(𝑥𝑖 )𝑔(𝑥𝑖 ).
Define a function 𝑦 ∈ 𝑉 by setting 𝑦(𝑥𝑖 ) = 𝑦𝑖 . Moreover, if we let 𝑆 be the subspace of linear functions,
then the problem of simple linear regression is to find a function 𝑓∈ 𝑆 such that ‖𝑓 − 𝑦‖ is minimised. The
solution is given by the orthogonal projection, i.e. 𝑓 = Proj𝑆 (𝑦). To compute this, find an orthogonal basis
for 𝑆 = sp({𝑓1 , 𝑓2 }), where 𝑓1 (𝑥) = 1 and 𝑓2 (𝑥) = 𝑥. Use Gram–Schmidt to show 𝑔1 (𝑥) = 1 and
𝑔2 (𝑥) = 𝑥 − 𝑥. Thus
2 𝑛
⟨𝑦, 𝑔𝑖 ⟩ ∑ (𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦)
𝑓(𝑥) = ∑ 𝑔𝑖 (𝑥) = 𝑦 + 𝑖=1 𝑛 (𝑥𝑖 − 𝑥),
𝑖=1
⟨𝑔𝑖 , 𝑔𝑖 ⟩ ∑𝑖=1 (𝑥𝑖 − 𝑥)2
giving the estimators

̂
𝛼̂ = 𝑦 − 𝛽𝑥,
where 𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦)
𝛽̂ = 𝑛 .
∑𝑖=1 (𝑥𝑖 − 𝑥)2
Chapter 6
Metric spaces
Chapter outcomes
• recall and apply appropriate definitions to problems involving metric spaces;

• state and prove theorems involving metric spaces;
• evaluate a problem and then apply appropriate techniques to solve problems involving metric
spaces.
Before starting this chapter it is assumed that you know the following material.
• MA137: Mathematical Analysis. The notation, definitions and theory of
– Convergence, particularly limits of sequences and limits of functions.

– Continuity of functions.
– Language of sets.
– Function notation.
• General mathematical knowledge. We will also use some calculus, for example, finding maxima.
Chapter Statistician: Florence Nightingale
Although Florence Nightingale is remembered for her role in founding the modern nursing profession, she
was a gifted mathematician being described as ‘a true pioneer in the graphical representation of statistics’.
She used the graphical representations in presentations to Parliament and civil servants. Her work improved
public health through, for example, the Public Health Acts of 1874 and 1875. In 1859, she was elected
285
286 CHAPTER 6. METRIC SPACES
a member of the Royal Statistical Society. In 1874, she became an honorary member of the American
Statistical Association.
Figure 6.1: Florence Nightingale.
Image available via: https://en.wikipedia.org/wiki/File:Florence_Nightingale_(H_Hering_NPG_x82368).jpg

Image licence: The image is in the public domain.
6.1. INTRODUCTION TO METRIC SPACES 287
6.1 Introduction to metric spaces

We have been considering inner products on vector spaces, which gives us a concept of the distance be-
tween two vectors in the space. Is it possible to have a notion of distance without all the structure of a vector
space?
Definition 6.1 (Metric Space). Let 𝑆 be a non-empty set, and let 𝑑 ∶ 𝑆 × 𝑆 → [0, ∞) be such that,
for all 𝑥, 𝑦, 𝑧 ∈ 𝑆:
• (M1) 𝑑(𝑥, 𝑦) = 0 if and only if 𝑥 = 𝑦;

• (M2) 𝑑(𝑥, 𝑦) = 𝑑(𝑦, 𝑥);
• (M3) 𝑑(𝑥, 𝑧) ≤ 𝑑(𝑥, 𝑦) + 𝑑(𝑦, 𝑧).
We then say that 𝑑 is a metric and (𝑆, 𝑑) is a metric space.
Notice that the codomain of 𝑑 is [0, ∞), so:
• 𝑑(𝑥, 𝑦) ≥ 0 for all 𝑥, 𝑦 ∈ 𝑆 .
Some texts make this an explicit requirement as a fourth axiom (but you can in fact deduce it from (M1–3)).
Example 6.1
1. If 𝑆 = 𝑉 , a vector space equipped with an inner product, then
𝑑(𝑥, 𝑦) = ‖𝑥 − 𝑦‖
is a metric.
2. For any non-empty set 𝑆 , define
0, if 𝑥 = 𝑦,
𝑑(𝑥, 𝑦) = {
1, otherwise.
then 𝑑 is a metric. (This is the discrete metric.)

3. Let 𝑆 be the collection of functions 𝑓 ∶ [𝑎, 𝑏] → R that are bounded, then
𝑑(𝑓, 𝑔) = sup |𝑓(𝑥) − 𝑔(𝑥)|

𝑥∈[𝑎,𝑏]
is a metric. (This is the uniform metric.)

Important metric on R𝑛 If 𝑆 = R𝑛 , then the Euclidean distance is a metric; that is,

1/2
𝑛
2
𝑑(x, y) = (∑ |𝑥𝑖 − 𝑦𝑖 | ) .
𝑖=1
is a metric. More generally, if 𝑝 ≥ 1, then

1/𝑝
𝑛
𝑑𝑝 (x, y) = (∑ |𝑥𝑖 − 𝑦𝑖 |𝑝 )
𝑖=1
is a metric. Furthermore, so is
𝑑∞ (x, y) = max |𝑥𝑖 − 𝑦𝑖 |.
𝑖=1,…,𝑛
The Metric Spaces module (if taken) discusses these in much more detail. Metric spaces can be found in
many different areas of mathematics. For example, let 𝑆 be the vertices of a connected graph, then
𝑑(𝑥, 𝑦) = #{edges on shortest path between 𝑥 and 𝑦}

is a metric.
Definition 6.2 (Bounded). A subset 𝐴 ⊆ 𝑆 is bounded if its diameter, diam(𝐴) = sup𝑥,𝑦∈𝐴 𝑑(𝑥, 𝑦)
is finite.
Example 6.2
Equip R with the metric 𝑑(𝑥, 𝑦) = |𝑥 − 𝑦|. Show that [0, 2] and [0, 2) are bounded. Is [0, ∞)
bounded?
Question 6.3
1. Explain why the following are not metrics on R:

a. 𝑑(𝑥, 𝑦) = |𝑥2 − 𝑦2 |;
b. 𝑑(𝑥, 𝑦) = 𝑥 − 𝑦 if 𝑥 ≥ 𝑦 , and 𝑑(𝑥, 𝑦) = (𝑦 − 𝑥)/2 otherwise;
c. 𝑑(𝑥, 𝑦) = ||𝑥| − |𝑦||;
2. Show that 𝑑(𝑥, 𝑦) = 2|𝑥| + 2|𝑦| if 𝑥 ≠ 𝑦, and 𝑑(𝑥, 𝑥) = 0 is a metric on R.

3. Prove that diam(𝐴) = 0 if and only if 𝐴 consists of a single point.
6.2. OPEN AND CLOSED SETS 293
6.2 Open and closed sets

Let (𝑆, 𝑑) be a metric space,
𝑥 ∈ 𝑆 , 𝑟 > 0. The open ball centred at 𝑥 of radius 𝑟 is 𝐵(𝑥, 𝑟) = {𝑦 ∈
𝑆 ∶ 𝑑(𝑥, 𝑦) < 𝑟}. The closed ball centred at 𝑥 of radius 𝑟 is
𝐵(𝑥, 𝑟) = {𝑦 ∈ 𝑆 ∶ 𝑑(𝑥, 𝑦) ≤ 𝑟}.
Example 6.4
1. In (R, 𝑑𝑝 ), we have 𝐵(𝑥, 𝜀) = (𝑥 − 𝜀, 𝑥 + 𝜀).

2
2. In (R , 𝑑2 ), 𝐵(𝑥, 𝜀) is the inside of a disc centred at 𝑥, radius 𝜀.
2
3. In (R , 𝑑1 ), 𝐵(𝑥, 𝜀) = {𝑦 ∶ |𝑥1 − 𝑦1 | + |𝑥2 − 𝑦2 | < 𝜀} – this is a diamond.
4. In (R2 , 𝑑∞ ), 𝐵(𝑥, 𝜀) = {𝑦 ∶ max{|𝑥1 − 𝑦1 |, |𝑥2 − 𝑦2 |} < 𝜀} – this is a square.
5. If 𝑑 is the discrete metric on 𝑆 , then 𝐵(𝑥, 𝜀) = {𝑥} if 𝜀 ≤ 1, and 𝐵(𝑥, 𝜀) = 𝑆 otherwise.
Example 6.5
Assume that 𝑑, defined as follows, is a metric on R:
0, if 𝑥 = 𝑦,
𝑑(𝑥, 𝑦) = {
|𝑥 − 1| + |𝑦 − 1| + 2|𝑥 − 𝑦|, otherwise.
Find 𝐵(0, 𝛿), 𝐵(0, 1) and 𝐵(1, 𝛿).

Definition 6.3 (Open and Closed sets). A subset𝐴 ⊆ 𝑆 is open if for every 𝑥 ∈ 𝐴, there exists an
𝜀 > 0 such that 𝐵(𝑥, 𝜀) ⊆ 𝐴. A subset is closed if it is the complement of an open set.
Example 6.6
1
As a subset of (R, 𝑑1 ) is { ∶ 𝑛 ∈ N} open or closed?
𝑛
Recall,
1/𝑝
𝑛
𝑝
𝑑𝑝 (x, y) = (∑ |𝑥𝑖 − 𝑦𝑖 | )
𝑖=1
Question 6.7
1. As a subset of (R, 𝑑1 ), is Q open or closed?

2. Let 𝑆 = [0, ∞) and 𝑑 ∶ 𝑆 × 𝑆 → R be defined by
0, 𝑥=𝑦
𝑑(𝑥, 𝑦) = {
1 + |𝑥 − 𝑦| + 𝑥 + 𝑦, 𝑥 ≠ 𝑦.
a) Show that 𝑑 is a metric on 𝑆 .

b) Determine the open balls 𝐵(𝑥, 1) and 𝐵(0, 2), and 𝐵(1, 𝛿), for 𝑥 ∈ 𝑆 and 𝛿 > 0.
3. Let 𝑑 ∶ R2 × R2 → [0, ∞) be defined by
max{‖x‖, ‖y‖}, if x ≠ y,
𝑑(x, y) = {
0, otherwise,
where ‖x‖ is the Euclidean norm of x = (𝑥1 , 𝑥2 ).

a) Show that 𝑑 is a metric on R2 .
b) Determine the balls 𝐵((0, 0), 1) and 𝐵((1, 0), 1).
c) Is the set 𝐴 = {x ∈ R2 ∶ 𝑑((0, 0), x) ≤ 1} open in (R2 , 𝑑)?
6.3. CONVERGENCE 301
6.3 Convergence
Let (𝑥𝑛 )𝑛≥1 be a sequence in a metric space (𝑆, 𝑑). We say that (𝑥𝑛 ) converges to 𝑥 ∈ 𝑆 if and only if
𝑑(𝑥𝑛 , 𝑥) → 0; that is,
∀𝜀 > 0, ∃𝑛0 ∈ N such that 𝑥𝑛 ∈ 𝐵(𝑥, 𝜀), ∀𝑛 ≥ 𝑛0 .
We do not assume that (𝑥𝑛 ) is a sequence of real numbers in R. This definition is general as the next
example illustrates.
Example 6.8
Let 𝑆 be the collection of functions 𝑓 ∶ [0, 1] → R that are bounded and 𝑑(𝑓, 𝑔) =
sup𝑥∈[0,1] |𝑓(𝑥) − 𝑔(𝑥)|. Define a sequence 𝑓𝑛 ∶ [0, 1] → R by 𝑓𝑛 (𝑥) = 𝑛2 𝑥(1 − 𝑥)𝑛
for all 𝑥 ∈ [0, 1].
1. Find a function 𝑓 ∶ [0, 1] → R such that 𝑓𝑛 (𝑥) → 𝑓(𝑥) for every 𝑥 ∈ [0, 1].
2. Determine if (𝑓𝑛 ) converges to 𝑓 in (𝑆, 𝑑).
6.3. CONVERGENCE 303
Question 6.9
Let 𝑆 be the collection of functions 𝑓 ∶ [0, 1] → R that are bounded and 𝑑(𝑓, 𝑔) =
sup𝑥∈[0,1] |𝑓(𝑥) − 𝑔(𝑥)|. Consider the sequence of functions 𝑓𝑛 ∶ [0, 1] → R defined by
𝑛𝑥
𝑓𝑛 (𝑥) = .
1 + 𝑛2 𝑥2
1. Sketch the graphs of 𝑓1 , 𝑓2 and 𝑓3 in the interval [0, 1].
2. Find 𝑓 ∶ [0, 1] → R such that 𝑓𝑛 (𝑥) → 𝑓(𝑥) for every 𝑥 ∈ [0, 1].
3. Determine if (𝑓𝑛 ) converges to 𝑓 in (𝑆, 𝑑).
4. Let 𝑎 ∈ (0, 1) be fixed. Does (𝑓𝑛 ) converges to 𝑓 in (𝑆, 𝑑) on the interval [𝑎, 1]?
6.4. COMPACTNESS 305
6.4 Compactness
Chapter 1 introduced the concept of compactness in R2 . For this module, you may assume that the following
definition gives a definition of compactness for a metric space. (Note, the name ‘compact’ occurs in more
general spaces and the definition is different. In a Metric Space this alternative definition is equivalent to
sequential compactness.)
Definition 6.4 (Sequentially compact). A subset 𝐴 ⊆ 𝑆 is compact if and only if it is sequentially

compact, i.e. every sequence (𝑥𝑛 )𝑛≥1 has a convergent subsequence (𝑥𝑛 )𝑖≥1 .
𝑖
The real numbers R with the metric 𝑑(𝑥, 𝑦) = |𝑥 − 𝑦| is not compact. For example, the sequence 𝑎𝑛 = 𝑛
for all 𝑛 ∈ N does not have convergent subsequence; any subsequence is not bounded and hence not
convergent.
Determining if a general metric spaces is compact can be difficult. In R𝑛 with the Euclidean metric, there is
a complete answer. This is the result utilised in Chapter 1.
Theorem 6.1 (Heine–Borel theorem). A subset of R𝑛 (equipped with the Euclidean metric) is compact
if and only if it is closed and bounded.
Chapter 1 provides further examples of compact and non-compact subsets of R2 . In particular, [0, 1] is
compact and [0, 1) is not compact. The next example presents an example similar to those in Chapter 1,
you will find it useful to review the examples presented there.
Example 6.10
Is the set {(𝑥, 𝑦) ∈ R2 ∶ 𝑥 − 𝑦 ≥ 0, 𝑥2 + 𝑦2 ≤ 1} compact?
6.4. COMPACTNESS 307
Question 6.11
Define subsets of R2 given by
𝐴 = {(𝑥, 𝑦) ∈ R2 ∶ 𝑥2 + 𝑦2 ≥ 1, |𝑥| + |𝑦| ≤ 2},

𝐵 = {(𝑥, 𝑦) ∈ R2 ∶ 2|𝑥| + |𝑦| < 1, 𝑥𝑦 < 0},
𝐶 = {(𝑥, 𝑦) ∈ R2 ∶ 0 < 𝑥 < 2𝜋, 𝑦 = sin(𝑥)}.
where 𝑑2 is the usual (Euclidean) metric given by
𝑑2 ((𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 )) = √(𝑥1 − 𝑥2 )2 + (𝑦1 − 𝑦2 )2 .
Decide whether 𝐴, 𝐵 or 𝐶 are compact in (R2 , 𝑑2 ),

6.5. CONTINUITY 309
6.5 Continuity
Recall from Year 1 analysis that
• A function𝑓 ∶ 𝐴(⊆ R) → R is continuous at 𝑎 ∈ 𝐴 if given 𝜀 > 0, there exists 𝛿 > 0 such that
|𝑥 − 𝑎| < 𝛿 implies |𝑓(𝑥) − 𝑓(𝑎)| < 𝜀.
• A function 𝑓 ∶ 𝐴(⊆ R) → R is sequentially continuous at 𝑎 ∈ 𝐴 if given any sequence (𝑥𝑛 ) ⊆ 𝐴

that converges to 𝑎, (𝑓(𝑥𝑛 )) → 𝑓(𝑎).
• In R, a function is sequentially continuous at 𝑎 ∈ 𝐴 if and only if it is continuous.
This section generalises these ideas to a metric space. Let 𝑓 ∶ (𝑆, 𝑑) → (𝑆 ′ , 𝑑′ ), then 𝑓 is (𝑑, 𝑑′ )-
continuous at 𝑥 ∈ 𝑆 if and only if: ∀𝜀 > 0, ∃𝛿 > 0 such that
𝑑(𝑥, 𝑦) < 𝛿 ⇒ 𝑑′ (𝑓(𝑥), 𝑓(𝑦)) < 𝜀.
Theorem 6.2 (Continuous). A function 𝑓 ∶ (𝑆, 𝑑) → (𝑆 ′ , 𝑑′ ), then 𝑓 is (𝑑, 𝑑′ )-continuous at 𝑥 ∈ 𝑆

if and only if:
𝑑(𝑥𝑛 , 𝑥) → 0 ⇒ 𝑑′ (𝑓(𝑥𝑛 ), 𝑓(𝑥)) → 0.
The proof of this result is omitted and not examinable.
Example 6.12
1. On R, the following defines a metric:
0, if 𝑥 = 𝑦,
𝑑(𝑥, 𝑦) = {
|𝑥| + |𝑦|, otherwise.
Show that 𝑓 ∶ (R, 𝑑) → (R, 𝑑) defined by 𝑥 ↦ 1 + 𝑥 is not continuous at 0.

2. Let 𝑆 be the collection of continuous functions 𝑓 ∶ [𝑎, 𝑏] → R and 𝑑(𝑓, 𝑔) =
sup𝑥∈[𝑎,𝑏] |𝑓(𝑥) − 𝑔(𝑥)|. Show that the function 𝐼 ∶ (𝑆, 𝑑) → (R, 𝑑1 ) defined by
𝑏
𝐼(𝑓) = ∫ 𝑓(𝑥)𝑑𝑥
𝑎
𝑏 𝑏
is continuous. (Note, it will be helpful to recall ∣∫ 𝑓(𝑥)𝑑𝑥∣ ≤ ∫ |𝑓(𝑥)|𝑑𝑥.)
𝑎 𝑎
6.5. CONTINUITY 311
Question 6.13
0, if 𝑥 = 𝑦,
𝑑(𝑥, 𝑦) = {
2|𝑥| + 2|𝑦| + |𝑥 − 𝑦|, otherwise.
Show that 𝑓 ∶ (R, 𝑑) → (R, 𝑑) defined by 𝑥 ↦ 1 + 𝑥 is not continuous at 0.
|𝑥 − 𝑦| + 1, if 𝑥 < 0 ≤ 𝑦 or 𝑦 < 0 ≤ 𝑥,
𝑑(𝑥, 𝑦) = {
|𝑥 − 𝑦|, otherwise.
Show that 𝑓 ∶ (R, 𝑑) → (R, 𝑑) defined by 𝑥 ↦ −𝑥 is not continuous at 0.

Question 6.14
1. Explain why the following are not metrics on R:
a. 𝑑(𝑥, 𝑦) = |𝑥 − 𝑦|3 ;
b. 𝑑(𝑥, 𝑦) = 2|𝑥| + 3|𝑦| if 𝑥 ≠ 𝑦, and 𝑑(𝑥, 𝑥) = 0.
2. Show that 𝑑(𝑥, 𝑦) = |𝑥 − 𝑦| + 1 if 𝑥 < 0 ≤ 𝑦 or 𝑦 < 0 ≤ 𝑥, and 𝑑(𝑥, 𝑦) = |𝑥 − 𝑦|
otherwise, is a metric on R.
3. Is 𝑑(x, y) = min𝑖=1,…,𝑛 |𝑥𝑖 − 𝑦𝑖 | a metric on R𝑛 ?
Question 6.15
Prove that the union of two bounded sets 𝐴 and 𝐵 in a metric space, with 𝐴 ∩ 𝐵 ≠ ∅ is a bounded
set.
(Note, this result holds in the case when 𝐴 ∩ 𝐵 = ∅, but we need to define the distance between
two sets, hence the weaker result presented here.)
Question 6.16
1. Assume that 𝑑, defined as follows, is a metric on R:
0, if 𝑥 = 𝑦,
𝑑(𝑥, 𝑦) = {
2|𝑥| + 2|𝑦| + |𝑥 − 𝑦|, otherwise.
Find 𝐵(0, 𝛿), 𝐵(0, 1) and 𝐵(1, 𝛿).
2. Assume that 𝑑, defined as follows, is a metric on R2 :
⎧ 0, if x = y,
{
𝑑(x, y) = ⎨ 𝑑2 (x, y), if x, y and 0 are colinear (lie on a line) with x ≠ y,
{ 𝑑
⎩ 2 (x , 0 ) + 𝑑 ( 0 , y ), otherwise.
2
Sketch 𝐵(0, 1), 𝐵((2, 2), 1), 𝐵((1, 2), 4).

• recall and apply appropriate definitions to problems involving metric spaces;

• state and prove theorems involving metric spaces;
• evaluate a problem and then apply appropriate techniques to solve problems involving metric
spaces.
This chapter covers key topics in metric spaces. The Metric Spaces module provides greater coverage.
Metric spaces. Let 𝑆 be a non-empty set and let 𝑑 ∶ 𝑆 × 𝑆 → [0, ∞) be such that, for all 𝑥, 𝑦, 𝑧 ∈ 𝑆 :
• (M1) 𝑑(𝑥, 𝑦) = 0 if, and only if, 𝑥 = 𝑦;

• (M2) 𝑑(𝑥, 𝑦) = 𝑑(𝑦, 𝑥);
• (M3) 𝑑(𝑥, 𝑧) ≤ 𝑑(𝑥, 𝑦) + 𝑑(𝑦, 𝑧).
The function 𝑑 is a metric and (𝑆, 𝑑) is a metric space.
A subset 𝐴 ⊆ 𝑆 is bounded if its diameter, diam(𝐴) = sup𝑥,𝑦∈𝐴 𝑑(𝑥, 𝑦) is finite.

Open and closed sets. Let (𝑆, 𝑑) be a metric space, 𝑥 ∈ 𝑆 , 𝑟 > 0. The open ball centred at 𝑥 of radius
𝑟 is 𝐵(𝑥, 𝑟) = {𝑦 ∈ 𝑆 ∶ 𝑑(𝑥, 𝑦) < 𝑟}. A subset 𝐴 ⊂ 𝑆 is open in 𝑆 if for every 𝑥 ∈ 𝐴, there exists an
𝜀 > 0 such that 𝐵(𝑥, 𝜀) ⊆ 𝐴. A subset if closed if it is the complement of an open set in 𝑆 .
• A set can be both open and closed.

• Every open ball is open and every closed ball is closed.
• The empty set ∅ and the entire set 𝑆 are both open, and hence both closed in 𝑆 .
Convergence and compactness. Let (𝑥𝑛 )𝑛≥1 be a sequence in a metric space (𝑆, 𝑑). We say that 𝑥𝑛
converges to 𝑥 ∈ 𝑆 if, and only if, 𝑑(𝑥𝑛 , 𝑥) → 0; that is, for all 𝜀 > 0, there exists 𝑁 ∈ N such that
𝑥𝑛 ∈ 𝐵(𝑥, 𝜀), for all 𝑛 ≥ 𝑁.
• A subset 𝐴 ⊆ 𝑆 is sequentially compact if every sequence (𝑥𝑛 )𝑛≥1 has a convergent subsequence
(𝑥𝑛𝑖 )𝑖≥1 . (Convergence is to a point in 𝐴.)
• A subset 𝐴 ⊆ 𝑆 is compact if, and only if, it is sequentially compact.
• Heine–Borel Theorem. A subset of R𝑛 (equipped with the Euclidean metric) is compact if, and only
if, it is closed and bounded.
Continuity. Let 𝑓 ∶ (𝑆, 𝑑) → (𝑆 ′ , 𝑑′ ), then 𝑓 is (𝑑, 𝑑′ )-continous at 𝑥 ∈ 𝑆 if, and only if, for all 𝜀 > 0,
there exists 𝛿 > 0 such that
𝑑(𝑥, 𝑦) < 𝛿 ⟹ 𝑑′ (𝑓(𝑥), 𝑓(𝑦)) < 𝜀.

• A function 𝑓 ∶ (𝑆, 𝑑) → (𝑆 ′ , 𝑑′ ) is (𝑑, 𝑑′ )-continuous at 𝑥 ∈ 𝑆 if, and only if, for every sequence
(𝑥𝑛 ) in 𝑆 converging to 𝑥,
𝑑(𝑥𝑛 , 𝑥) → 0 ⟹ 𝑑′ (𝑓(𝑥𝑛 ), 𝑓(𝑥)) → 0.

ST208 - Chapters 1-6

Uploaded by

Copyright:

Available Formats

You might also like

ST208 - Chapters 1-6

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ST208 - Chapters 1-6

Uploaded by

Copyright:

Available Formats

1

ST208 Mathematical Methods

Module Introduction and Information 5

3 Symmetric matrices, quadratic forms, positive definiteness 104

4.3 Finding and classifying critical points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

5 Linear algebra 208

6 Metric spaces 285

particular properties of the triangle in 2 BC.

Further reading C.F. Boyer, A history of mathematics , Wiley, 1968.

Examples We will complete these together during class.

• Sketch subsets of R2 and select appropriate techniques to determine if subsets of R2 are

Assumed background knowledge

• MAT106: Linear Algebra. Chapter 11: The determinant of a matrix

– Definition of the determinant

• MA137: Mathematical Analysis. Chapter 1: Functions

Chapter Statistician: Abraham Manie (Abe) Adelstein

Figure 1.1: Abraham Manie (Abe) Adelstein.

Image available via: https://imgix.ranker.com/user_node_img/20/392275/original/abraham- manie-

1. {(𝑥, 𝑦) ∈ R2 ∶ 𝑥 + 𝑦 ≥ 1, 𝑥2 + 𝑦2 < 1},

1.2 Inverse image of a function

𝑓 −1 (𝑋) = {𝑎 ∈ 𝐴 ∶ 𝑓(𝑎) ∈ 𝑋}.

1. Let 𝑓 ∶ R2 → R, (𝑥, 𝑦) ↦ 𝑥2 + 𝑦2 . Find 𝑓 −1 ([1, 4]), 𝑓 −1 ({−4}), 𝑓 −1 ([−4, 1]), 𝑓 −1 (Z).

1.3 Partial derivatives

Consider 𝑓 ∶ 𝑈 (⊆ R𝑛 ) → R𝑚 , what is the ‘derivative of 𝑓 ’? For the moment, we specialise to R𝑛 → R

𝜕𝑓 𝑓(𝑥1 , … , 𝑥𝑖−1 , 𝑥𝑖 + ℎ, 𝑥𝑖+1 , … , 𝑥𝑛 ) − 𝑓(x)

𝑓 ∶ R2 → R, (𝑥, 𝑦) ↦ 𝑓(𝑥, 𝑦) be as given below. Find 𝑓𝑥 , 𝑓𝑦 , and any points where

1.4 Calculating determinants

1. subtracting a multiple of one row (column) from another,

1. Compute the determinant of

a) expanding along row 1.

1.5 Chapter 1 summary

Chapter outcomes review

• Sketch subsets of R2 and select appropriate techniques to determine if subsets of R2 are

This chapter covers four topics.

𝜕𝑓 𝑓(𝑥1 , … , 𝑥𝑖−1 , 𝑥𝑖 + ℎ, 𝑥𝑖+1 , … , 𝑥𝑛 ) − 𝑓(x)

• subtracting a multiple of one row (column) from another,

• apply, where appropriate, a special case of Fubini’s theorem;

Assumed background knowledge

• MAT106: Linear Algebra. Chapter 11: The determinant of a matrix

– Definition of the determinant.

• MA137: Mathematical Analysis. Chapter 1: Functions

Chapter Statistician: Kiyoshi Ito

Figure 2.1: Kiyoshi Ito.

Image available via: https://upload.wikimedia.org/wikipedia/commons/c/c1/Kiyosi_Ito.jpg Image licence:

2.1 Introduction: Fubini’s theorem

∫ 𝑓(𝑥, 𝑦)𝑑𝐴 = ∫ ∫ 𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦,

Figure 2.2: Surface approximated by cuboids.

If 𝑓 is a continuous function with domain 𝑅 = [𝑎, 𝑏] × [𝑐, 𝑑], then

Figure 2.3: Illustration of Cavalieri’s principle

Fubini’s theorem’s conditions cannot be relaxed

2.2 Elementary regions of integration

Figure 2.4: Example type 1 region.

𝜙1 and 𝜙2 mapping [𝑎, 𝑏] to R be

A region of type 3 is one that is both type 1 and type 2.

If we consider 𝑓 ∶ 𝐶 ⊆ R3 → R, where 𝐶 is a cuboid, we can make analogous definitions for

2.2.1 Finding the limits of integration 1: 2-d Cartesian

∬ 𝑓𝑑𝑦𝑑𝑥. We illustrate the process for

1. Sketch and label the region 𝑅 and its bounding curves.