Download as pdf or txt
Download as pdf or txt
You are on page 1of 144

ADVANCED MATHEMATICS

DLMDSAM01
ADVANCED MATHEMATICS
MASTHEAD

Publisher:
IU Internationale Hochschule GmbH
IU International University of Applied Sciences
Juri-Gagarin-Ring 152
D-99084 Erfurt

Mailing address:
Albert-Proeller-Straße 15-19
D-86675 Buchdorf
media@iu.org
www.iu.de

DLMDSAM01
Version No.: 002-2023-0901
Shiela Miller

© 2023 IU Internationale Hochschule GmbH


This course book is protected by copyright. All rights reserved.
This course book may not be reproduced and/or electronically edited, duplicated, or dis-
tributed in any kind of form without written permission by the IU Internationale Hoch-
schule GmbH (hereinafter referred to as IU).
The authors/publishers have identified the authors and sources of all graphics to the best
of their abilities. However, if any erroneous information has been provided, please notify
us accordingly.

2
TABLE OF CONTENTS
ADVANCED MATHEMATICS

Introduction
Signposts Throughout the Course Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Suggested Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Required Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Unit 1
Calculus 11

1.1 Differentiation and Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12


1.2 Partial Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.3 Multiple Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.4 Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Unit 2
Integral Transformations 45

2.1 Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2 Fourier Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Unit 3
Vector Algebra 61

3.1 Scalars and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62


3.2 Addition and Subtraction of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3 Multiplication of Vectors: Dot Product and Scalar Product . . . . . . . . . . . . . . . . . . . . . . . . 76

Unit 4
Vector Calculus 85

4.1 Differentiation of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86


4.2 Integration of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3 Scalar and Vector Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4 Vector Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

3
Unit 5
Matrices and Vector Spaces 99

5.1 Basic Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100


5.2 Determinant, Trace, Transpose, Complex, and Hermitian Conjugates . . . . . . . . . . . . 103
5.3 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.4 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Unit 6
Information Theory 121

6.1 Mean Squared Error (MSE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122


6.2 Gini Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3 Entropy, Shannon Entropy, Kulback-Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.4 Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Appendix
List of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
List of Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

4
INTRODUCTION
WELCOME
SIGNPOSTS THROUGHOUT THE COURSE BOOK

This course book contains the core content for this course. Additional learning materials
can be found on the learning platform, but this course book should form the basis for your
learning.

The content of this course book is divided into units, which are divided further into sec-
tions. Each section contains only one new key concept to allow you to quickly and effi-
ciently add new learning material to your existing knowledge.

At the end of each section of the digital course book, you will find self-check questions.
These questions are designed to help you check whether you have understood the con-
cepts in each section.

For all modules with a final exam, you must complete the knowledge tests on the learning
platform. You will pass the knowledge test for each unit when you answer at least 80% of
the questions correctly.

When you have passed the knowledge tests for all the units, the course is considered fin-
ished and you will be able to register for the final assessment. Please ensure that you com-
plete the evaluation prior to registering for the assessment.

Good luck!

6
SUGGESTED READINGS
GENERAL SUGGESTIONS

Riley, K. F., Hobson, M. P, & Bence, S. J. (2006). Mathematical methods for physics and engi-
neering (2nd ed.). Cambridge University Press.

Strang, G. (2016). Introduction to linear algebra. Wellesley-Cambridge Press.

UNIT 1

Boelkins, M., Austin, D., & Schlicker, S. (2018). Active calculus 2.1. Grand Valley State Univer-
sity Libraries.

Zakon, E. (2004). Mathematical analysis I. The Trillia Group. (Available online).

Zakon, E. (2009). Mathematical analysis II. The Trillia Group. (Available online).

UNIT 3

Zakon, E. (2004). Mathematical analysis I. The Trillia Group. (Available online).

Zakon, E. (2009). Mathematical analysis II. The Trillia Group. (Available online).

UNIT 6

McKay, D. (2003). Information theory, inference and learning algorithms. Cambridge Univer-
sity Press. (Available online).

7
REQUIRED READING
UNIT 1

Guichard, R. D. (2020). Single and Multivariable Calculus. David Guichard. Chapters 3,


8.1-8.4 and 14.3. (Available online).

UNIT 2

Olson, T. (2017): Applied Fourier Analysis. Birkhäuser Springer. Chapters 1.1, 1.2, 2.1, 2.8
and 4.1-4.4.

UNIT 3

Guichard, R. D. (2020). Single and multivariable calculus. David Guichard. Chap-


ters 12.1-12.4. (Available online).

Cherney, D. Denton, T., Thomas, R., & Waldron, A. (2016): Linear Algebra. University of Cali-
fornia, Davis. Chapter 5.1. (Available online)

UNIT 4

Deisenroth, M. P., Faisal, A. A., Soon Ong, Ch. (2020): Mathematics for Machine Learning.
Chapters 5.2 and 5.3. (Available online)

UNIT 5

Deisenroth, M. P., Faisal, A. A., Soon Ong, Ch. (2020): Mathematics for Machine Learning.
Chapters 2.2 and 4.2. (Available online)

8
LEARNING OBJECTIVES
The course Advanced Mathematics aims to provide students with the mathematical
background knowledge to use and understand current methods and approaches from
engineering and the sciences.

To this end, the course starts with an exposition of the fundamentals of calculus. The
notions of differentiation and integration are introduced together with important generali-
zations to multiple dimensions. Moreover, the widely used optimization technique of the
calculus of variations is explained. Integral transformations, which play a vital role in sci-
entific and engineering application, are also covered.

These analytical techniques are complemented by a thorough introduction to mathemati-


cal methods associated with linear algebra. Here, you will learn about the important con-
cepts of vectors, matrices, and their algebraic manipulation. Furthermore, the concept of a
tensor, which plays a crucial role in popular approaches to machine learning, is intro-
duced.

The subject domains of linear algebra and calculus are brought together in the explana-
tion of vector calculus. The course concludes with explanations of important concepts
from the field of information theory that underpins virtually all aspects of our contempo-
rary communication systems.

9
UNIT 1
CALCULUS

STUDY GOALS

On completion of this unit, you will have learned ...

– how to differentiate and integrate functions of a single variable.


– how to perform partial differentiation and multiple integrals for functions with multi-
ple variables.
– how to approximate a function in a Taylor series.
– the basic concepts of calculus of variations.
1. CALCULUS

Introduction
Functions express relationships between variables. For example, in standard notation, the
function y = f(x) formalize s how a value y, called the dependent variable, varies with
respect to another value x, called the independent variable. The letter f is the name of the
function.

The rate at which the dependent variable changes with respect to the independent varia-
ble is of particular interest, both mathematically and for applications. One example of
such a rate of change is the change in distance with respect to time — also known as veloc-
ity. The method for finding this rate when given a function is called differentiation.

Differentiation is a powerful tool that is used to understand the relationship between vari-
ables. In the case of multiple independent variables, for example, z = f(x, y), partial dif-
ferentiation allows us to explore the rate of change of the function with respect to each
independent variable.

The operation of differentiation can often be “undone” via an operation called integration.
Integration can be seen as the inverse of differentiation and therefore, it is often called the
anti-derivative. Intuitively, this means that if we have a formula that expresses the speed
of a particle with respect to time, we can often construct a formula for the displacement of
the particle (i.e., the distance it has traveled).

In calculus of variations, the concept of differentiation is extended to functionals, which


are maps from a set of functions to the set of real numbers. In this sense, functionals are
functions of functions. The calculus of variations looks for which input function maximizes
or minimizes the dependent (output) variable.

Good textbooks that further cover this subject area are e. g. (Deisenroth, Faisal & Ong,
2020, Chap. 5), (Strang, 2017, Chap. 2-5, 7, 8, 13, 14) and (Loomis & Sternberg, 2014, Chap.
3, 8).

1.1 Differentiation and Integration


Derivatives of Functions of a Single Variable

Function We are often interested in how a function changes with respect to its argument. For exam-
A function is a relation ple, we could imagine a travelling car with position s at a given time t. We know from our
between two sets that
associates every element everyday experience that at each time, t, a car has a velocity, v(t), which measures how
of one set to exactly one fast the car is travelling at time t. Over a given time interval, Δt, the average velocity
element of the other set.
describes the rate at which the car travels the distance for that interval of time

12
Δs
vt = Δt
, (1.1)

where Δs is the change in position of the car or the distance covered by the car. In equa-
tion 1.1, the time t is the argument of the function v, which establishes a relationship
between the distance covered by the car and the time it takes to cover that distance.

More generally, we often wish to find the rate of change of a general function f(x), where f
depends on some argument x. We will begin by considering functions that depend only on
a single variable, such as f(x) = x 2, which is shown in the graphic that follows this explan-
ation. Fix a given value of x and let us consider the value of the function, f(x), as we
change the input to a slightly different value, where we assume that the function is contin-
uous and doesn’t have any “kinks” or “jumps” For example, if we start at x0 = 1 and move
to x1 = 1.1, the value of the function f(x) = x2 will change from f(x0) = 1 to f(x1) = 1.21.
Let us denote this change in x by Δx and write x→x+Δx to indicate that x changes from x
to x + Δx. Then, the change in the value of the function f is Δf = f(x + Δx) - f(x). By
making this increment Δx arbitrarily small, we can work out the rate of change of f at a
single instant. That is the idea of a derivative — an instantaneous rate of change — and the
limit operation allows us to formally capture this intuition. As we make the change in x
smaller and smaller, written Δx → 0, we can define the gradient or first derivative of the
function f as

df x f x + Δx − f x
f’ x ≡ dx
≡ lim Δx
. (1.2)
Δx → 0

The function is differentiable at xa if, and only if, this limit exists at the point x = xa. Note
that if the limit does not exist at x = xa, the function is not differentiable at this xa. The
definition 1.2 does not specify if we approach x from smaller values (so Δx is negative) or
vice versa (in which case Δx is positive). This is because, in order for the limit to exist, the
f x + Δx − f x
definition of the limit requires that the quotient Δx
approaches the same
value, fʹ(x), from both the left and right of x.

13
Figure 1: A Parabola: f x = x2

Source: Shiela Miller, (2020).

Geometrically, the derivative fʹ(x) can be interpreted as the slope of the line tangent to the
function f(x) at the point x.

EXAMPLE
Find the first derivative of f(x) = x2.

Using definition 1.2,

14
f x + Δx − f x
f′ x = lim Δx
Δx → 0
2
x + Δx − x2
= lim Δx
Δx → 0
2
x2 + 2xΔx + Δx − x2
= lim Δx
Δx → 0
2
2xΔx + Δx
= lim Δx
Δx → 0
Δx 2x + Δx
= lim Δx
Δx → 0
= lim 2x + Δx
Δx → 0
= 2x

Here, we have observed that Δx becomes infinitessimally small as it


approaches zero, but is always non-zero and can therefore be cancelled from
the numerator and denominator.

Be aware that to be differentiable at xa, a function must be continuous at xa (or else the
limit will not exist at that point), but merely being continuous everywhere does not neces-
sarily mean that a function is differentiable everywhere, as shown in the following figure.
As x approaches 0 from the left, the derivative (which is the slope, in this case) is —1, but if
f x + Δx − f x
we approach x = 0 from the right, the limit of the quotient Δx
is +1. The right
and left hand limits do not agree, so the limit, and therefore the derivative of f(x) = |x|, is
not defined at x = 0.

Using definition 1.2 in combination with the laws of limits, one can find derivatives of
many fundamental functions. For reference, here are the derivatives of some important
functions where n>0 is a natural number and a is a real-valued constant.

15
Figure 2: f x = x

Source: Shiela Miller, (2020).

d ax
d n
x = nxn − 1 dx
e = aeax
dx
d d 1 d sin ax = acos ax
dx
ln ax = dx
ln a + ln x = x dx
d d a
cos ax = − asin ax dx
tan ax =
dx cos2 ax

Higher order derivatives

Derivatives are themselves functions, therefore we can consider their rates of change. We
call derivatives of derivatives of a function f higher order derivatives of f, and they are
obtained using the definition of the derivative in the same way. For the second derivative,
we use definition 1.2 but replace the function f(x) with the first derivative fʹ(x) as follows:

df′ x f′ x + Δx − f′ x
f′′ x ≡ dx
≡ lim Δx
, (1.3)
Δx → 0

where, again, fʹʹ is defined if, and only if, the limit exists. More generally, we can define the
nth derivative of f(x) to be

n−1 n−1 n−1


n df f x + Δx − f x (1.4)
f x ≡ dx
≡ lim Δx
,
Δx → 0

16
whenever the limit exists.

Stationary points

Looking again at the first graphic depicting a parabola, we notice that the point (0, 0) is
special; the value of the function on either side of point x = 0 is greater than at x = 0. In
other words, at x = 0, f achieves a local minimum. Graphically, we observe that the line
tangent to the graph of f at this point is horizontal — its slope is equal to zero. To reiterate:
the slope of the line tangent to f at x = 0, which is fʹ(0), has a derivative at that point with
a value of zero.

Points where the derivative is equal to zero, such as the point described above, are called
stationary points. After examining a number of examples, we see that fʹ is often, but not
always, equal to zero at a local minimum. The other possibility, illustrated by the previous
graphic, is that the slope of the tangent line, the derivative, is undefined at the local
extrema. For f(x) = |x|, (0, 0) is a critical point, defined to be a place where the derivative
is zero or does not exist.

Note that there are three different stationary points. They are as follows:

• the function f has a maximum at a stationary point at x = a if f′ a = 0 and f″ a < 0,


• the function f has a minimum at a stationary point at x = a if f′ a = 0 and f″ a > 0,
and
• a stationary point at x = a is called a saddle point if f′ a = 0 and f″ changes sign at
this point.

Note that the maximum and minimum found this way may not be the global maximum or
minimum of the function, but rather a local extremum at the stationary point.

Rules of Differentiation

Differentiation of functions with a constant

Some functions are composed of a constant and a variable part, e.g. f(x) = a · g(x) where
a is an arbitrary constant and g(x) is some function that depends on x. The derivative is
given by

d d
dx
f x = f′ x = a dx g x = ag′ x .

Differentiation of products

Previously in this section, differentiation rules for some functions with simple structures
were discussed. However, in many cases, we are interested in the rates of change of func-
tions that are more complicated.

17
As a first example, we will investigate how to differentiate functions that can be written as
products of two other functions, namely functions of the form f(x) = u(x) · v(x). The idea
is that if we know how to differentiate u and v, and how to use that information together
with the product structure to find a derivative of f, we can avoid applying the definition of
the derivative. We could, from this perspective, reexamine f(x) = x2, noting that we could
write it as f(x) = x · x. Slightly more complicated examples are g(x)=x2 · sin(x), which we
could decompose as g(x) = u(x) where u(x)=x2 and v(x)=sin(x). Such a decomposition
is not unique; we could consider any functions u and v whose product is x2· sin(x). How-
ever, the idea behind decomposing the original function f(x) into two functions, u and v, is
to choose u and v that are easier to differentiate than f. Then, if we have a general method
to calculate the derivative of a product, we can apply that method to f in order to make
taking the derivative easier than if we were to calculate it using definition and equation
1.2. This general method, called the product rule, is obtained from the definition (See
equation 1.2) as follows. First, let’s simplify the difference f(x+Δx) − f(x). This results in

f x + Δx − f x = u x + Δx · v x + Δx − u x · v x
= u x + Δx v x + Δx − v x + v x u x + Δx − u x .

Note that we added and subtracted v(x)u(x+Δx) in order to be able to factor. Substitut-
ing the result of our simplification into the definition of the derivative, we obtain

df f x + Δx − f x
dx
= lim Δx
Δx → 0
v x + Δx − v x u x + Δx − u x
= lim u x + Δx Δx
+v x Δx
.
Δx → 0

As Δx approaches zero, u(x+Δx) approaches u(x) and the terms in the square brackets
become the derivatives of the functions u and v respectively. Hence, the formula for the
derivative of a product of functions, called the product rule, is given by

df d dv x du x
f′ ≡ dx
≡ dx
u x v x =u x dx
+v x dx
= uv′ + vu′ . (1.5)

Using this rule repeatedly, the derivative of products of three or more differentiable func-
tions can be obtained as follows:

Given f x = u x v x w x ,
df d d
f′ x = dx
= u dx vw + vw dx u
dw dv du
= uv dx + uw dx + vw dx .

EXAMPLE
Find the derivative of f(x) = x2 sin(x).

Using definition 1.5 with u(x) = x2 and v(x) = sin(x), we get

18
d 2 d d 2
dx
x sin x = x2 dx sin x + sin x dx
x
= x2cos x + 2xsin x .

The chain rule

Many functions can be written as compositions of functions, namely as functions whose


inputs are functions themselves. For example, f(x)=(x−1)2 can be written as f(x)=u2(x)
where u(x) = x − 1. Write this as f(u(x)).

The essential idea of the chain rule is that we differentiate the outer function f with
respect to the inner function u to get fʹ(u), leaving the inner function alone. Then differen-
tiate the inner function u with respect to x to get uʹ(x) and multiply the two together:

df df du
dx
= du
· dx
. (1.6)

This is known as the chain rule because we “chain” the derivatives together. The concept
can be easily extended to functions of functions of functions, and so on. We only need to
repeatedly apply the chain rule until we reach the independent variable.

EXAMPLE
Find the derivative of f(x) = (x − 1)2.

We can write this as f(x) = u2 (x), where u(x) = x−1. Using theorem 1.6 we
obtain

df df du
dx
= du
·
dx
du
= 2u dx
= 2u · 1
=2 x−1 .

The chain rule can also be used to calculate the derivative of functions of the form
f(x)=1/v(x). Rather than writing this as a quotient, we can express it as a composition of
functions, f(x)=v−1(x) (noting that this is the —1 power, not the inverse), and then apply
the chain rule

19
df df dv
dx
= dv dx
dv
= −v−2 dx
1 dv
=−
v2 x dx

d n
where we have used the elementary derivative dx
x = nxn − 1.

Differentiation of quotients

In some cases, the function we want to take the derivative of can be written in the form of
ux
a quotient of two functions, such as f x = v x . One way to create a rule in order to calcu-
late derivatives for such functions is to combine the product rule in equation 1.5 with the
chain rule, and write the product as f(x) = u(x)[1/v(x)]. Applying the product rule, we get

df d ux
dx
= dx v x
d 1 1 d
= ux dx v x
+ v x dx
u x .

d 1
Using the chain rule to evaluate dx v x
as above, we obtain

df dv x /dx du x /dx
dx
=u x − 2
+ vx
.
vx

The “prime” notation for the derivative yields an expression that is easier to read

f′ =
u ′= vu′ − uv′
, (1.7)
v v2

where u = u(x) and v = v(x).

Integrals of Functions of a Single Variable

Integrals as area under the curve

In the first part of this unit, we focused on rates of change of functions of a single variable,
and developed the first derivative, or gradient, as a tool to mathematically investigate
rates of change. Returning to the example of the car traveling on a road, we found that we
could express the average velocity as the distance traveled by the car in a given amount of
time,

Δs
v= Δt
,

where Δs is the change in position (the distance) over time interval Δt. We considered pro-
gressively shorter time intervals in order to investigate the instantaneous rate of change

ds
vt = dt
.

20
Figure 3: Constant Velocity Against Time

Source: Shiela Miller(2020).

The first derivative of the position with respect to time is the instantaneous velocity.

If we know a function for the velocity v(t), it is natural to wonder if it is possible to calcu-
late the distance Δs the car has traveled over a given time interval, Δt.

If the velocity is constant, this is intuitively clear. We have Δs = vΔt, as illustrated in the
figure above. Note that the area of the rectangle v · Δt corresponds to change in the posi-
tion Δs. In most cases, the velocity of a car will not be constant, and therefore we would
like a more general way to express the distance as a (continuous) function of velocity and
time.

Informally, we can do this by looking at very small sub-intervals rather than considering
the whole interval Δt at once. We will then assume that the velocity is constant over these
small rectangles in order to get an approximation of the distance the car travels over each
small interval. Each time, we will use our constant approximation of the velocity multi-
plied by the length of time to get the area of the small rectangle, as shown in the following
figure. We know that we have a small error in each interval, but the smaller the intervals,
the smaller the error becomes. To find the total distance the car has traveled, we combine
the total areas of all the rectangles that represent the contributions from each small time
interval.

21
Figure 4: Variable Velocity Against Time

Source: Shiela Miller (2020).

More formally, consider an arbitrary function f(x) of a single variable x that is defined over
the interval a ≤ x ≤ b. Following the approach above, we divide the interval [a, b] into
many sub-intervals by introducing intermediate points ξi so that a = ξ1 < ξ2 < ... < ξn
= b. The lengths of the intervals (ξi − ξi−1) are the lengths of the rectangles on the x-axis
and the f(ξi) are the heights of the rectangles. The sum S,

n (1.8)
S = ∑i = 1 f ξi ξi − ξi − 1 ,

is the area of all of the rectangles. The area under some curves over certain intervals is not
1
finite; consider f x = x
over the interval [0, 1]. Therefore, as we take more and more
intervals, i.e. as we consider the limit of S as n approaches ∞, the sum S may or may not
converge to a finite limit. If this limit exists, the limit of the sum is the definite integral I of
the function f(x) in the interval [a, b],

b
I = ∫a f x dx . (1.9)

If the limit does not exist, the integral is undefined.

For closed, finite intervals, the question of whether this limit exists — whether the function
f is integrable over the given interval — hinges on whether the function f is continuous
over that interval.

22
For continuous functions over a finite interval [a, b] this limit, the integral, always exists.

EXAMPLE
b
Evaluate the integral I = ∫0 x2dx.

The function f(x) = x2, called the integrand, is shown below.

Figure 5: Parabola for x > 0

Source: Shiela Miller (2020).

The first step toward computing this integral, or determining whether this limit
exists, is to divide the interval [0, b] into n rectangles of uniform width w. Next,
we evaluate the function f(x) = x2 at the left hand endpoint of each sub-interval
to determine the height of each rectangle. We could also have taken the value at
the right hand endpoint or any in the middle — the limit does not depend on
this choice. The area of the ith rectangle is then w · (iw)2 = i2w3. The total area
of our approximation, A, is then given by

n
A= ∑ i2w3 .
i=1

The term w3 is a constant with respect to the index of summation, i, so we can


factor it out of the sum operator as follows:

n
A = w3 ∑ i2 .
i=1

23
n
Recall that the sum ∑ i2 has the closed form
i=1

n
∑ i2 = 16 n n + 1 2n + 1 ,
i=1

and hence the area of our approximation is

1
A = w3 6 n n + 1 2n + 1 .

When constructing the rectangles, we divided the interval [0,b] into intervals of
b
the same length, namely w = . Therefore, we can substitute into our expres-
n
sion for A and reduce to get

b 31
A= n 6
n n + 1 2n + 1
b3 n n + 1 2n + 1
= 6 n3
b3 n + 1 2n + 1
= 6 n2
2
b3 2n + 3n + 1
= 6 n2
b3 3 1
= 6
2+ n
+ .
n2

As we increase the number of intervals without bound, or as n → ∞, the value


b3
of the sum in the above expression approaches 2 and thus is the value of the
3
b 1 3
finite integral I = ∫0 x2dx = b .
3

Using the properties of limits and finite sums, as above, one can see that the following
properties hold:

b
∫a 0dx = 0 (1.10)

a
∫a f x dx = 0 (1.11)

b b b
∫a f x + g x dx = ∫a f x dx + ∫a g x dx (1.12)

c b c
∫a f x dx = ∫a f x dx + ∫b f x dx, for all b ∈ a, c (1.13)

If we set c = a in the last expression, we can derive the identity

24
b a
∫a f x dx = − ∫b f x dx .

Integrals as the inverse of differentiation

So far we have introduced integrals over finite intervals [a, b], where the bounds a and b
are fixed. We can formally define the function F(x) to be

x
F x = ∫a f u du . (1.14)

To see how integration is related to differentiation, we evaluate the function F at position


x+Δx and apply equation 1.13 to get

x + Δx
F x + Δx = ∫a f u du
x x + Δx
= ∫a f u du + ∫x f u du
x + Δx
= F x + ∫x f u du .

If we divide both sides by Δx and bring F(x) to the left side, the equation reads

F x + Δx − F x 1 x + Δx
Δx
= ∫
Δx x
f u du .

Considering the limit as Δx approaches zero of both sides, this becomes

dF x
dx
=f x , (1.15)

or, written with the definition of F(x) substituted in,

d x
dx
∫a f u du = f x . (1.16)

This says that the derivative of the integral gives back the original integrand. This very
important result is called the Fundamental Theorem of Calculus, and it has a second part,
which relates the definite integral to the antiderivative. Let’s explore it now.

The above discussion did not depend on any attribute of the arbitrary constant a. Hence,
the inverse of differentiation is not unique. However, any two inverse functions F1(x) and
F2(x) differ at most by a constant, so it is written as

∫f x dx = F x + c (1.17)

d
for the family of functions with derivative f(x). Recall that dx
c = 0. This is the indefinite
integral of f(x), and c is called the constant of integration.

The antiderivative F(x) can also be used to evaluate definite integrals. Let x0 be an arbi-
trary fixed point x0 in (a, b) and consider equation 1.13 to obtain

25
b x b
∫a f x dx = ∫a 0 f x dx + ∫x0 f x dx (1.18)

b x
= ∫x0 f x dx + ∫a 0 f x dx (1.19)

b b a
∫a f x dx = ∫x0 f x dx − ∫x0 f x dx (1.20)

=F b −F a . (1.21)

Integrals with infinite bounds of integration

The somewhat intuitive definition of the integral as the area under a curve or as an inverse
function does not allow for bounds of integration that are infinite. However, we can extend
the definition to include these cases with the observation that

∞ b
∫a f x dx = lim ∫ f x dx = lim F b − F a , (1.22)
b→∞ a b→∞

where the limit as b approaches ∞ is evaluated after the integral is calculated.

Evaluation of integrals

Unfortunately, unlike differentiation, many integrals cannot be evaluated easily and there
are few simple rules which can be used. Some examples of indefinite integrals are given
below. Note that u is typically a function u(x), and du = u’(x)dx.

un + 1
∫undu = n+1
+c n≠ −1
du
∫ u
= ln u + c

au
∫audu = lna
+c

∫eudu = eu + c
∫cos udu = sin u + c
∫sin udu = −cos u + c
∫cosh udu = sinh u + c
∫sinh udu = cosh u + c
du
∫ = tan u + c
cos2u
du
∫ = −cot u + c
sin2u
du 1 u
∫ = a arctan +c
u2 + a2 a
du u
∫ = arcsin a
+c
a2 − u2

26
Formulae for a large number of integrals can be found in tables of integrals. In order to
evaluate unknown integrals, we generally try to transform integrals into forms that are
easier to evaluate. For reference, here are a few “techniques of integration” that might
help.

• Logarithmic integration: Integrals for which the integrand can be written as the quotient
of the derivative of a function, and that same function can be evaluated as

f′ x
∫f x
dx = ln f x + c

• Decomposition: When the integrand is a linear combination of integrable functions, we


can split the integral of the sum into a sum of simpler integrals:

n n
∫ ∑ aif i x dx = ∑ ai∫f i x dx
i=1 i=1

• Substitution: If the integrand can be parameterized in terms of a different variable or


function x = u(t), we can often utilize the substitution:

ua a du t
∫u b f x dx = ∫b f u t dt
dt

The key to identifying integrals of this form is to find a suitable substitution function.

• Integration by parts: Recall the product rule:

d
dx
u ⋅ v = uv′ + u′v

Integration by parts enables us to split the integral into parts which are easier to solve.
Rearranging equation 1.5 (the product rule)

d dv du
dx
uv = u dx + v dx + c

to

dv d du
u dx = dx
uv − v dx .

Integrating both sides we obtain

∫uv′dx = uv − ∫vu′dx .

The “art” of solving the integral is to choose the functions u and v so that the remaining
integral becomes easier to solve.

27
EXAMPLE
b
Evaluate the integral ∫a xcos xdx .

Noting that the integrand is a product of x and cos x, we solve this using integra-
tion by parts and choose u = x and v’ = cosx, and thus get the result that v =
sinx and du = dx. Substituting, we get

b b
∫a xcos xdx = ∫a x sin x ′dx
b b
= xsin x a
− ∫a sin xdx
b
= xsin x + cos x a
= bsin b + cos b − asin a + cos a .

EXAMPLE
1
Evaluate the integral ∫ dx .
x2 + x

First, we note that the denominator x2+x can be factored as x(x+1). Using par-
tial fraction decomposition, we get

1 1
∫ dx = ∫ x dx
x2 + x x+1
1 1
=∫ x
− x+1
dx
= ln x − ln x + 1 + c
x
= ln x+1
+c

where we have split the difference inside the integral into a sum of integrals and
a
used the fact that ln = ln a − ln b . However, in general, we need to be care-
b
ful to consider that the argument of the logarithm is not defined for negative
numbers.

Taylor approximation

Taylor's Theorem A very useful application of derivatives and integrals is Taylor’s theorem. Taylor’s theo-
The theorem is named rem provides an approximation to a function in the vicinity of a given point x0 as a sum.
after Brook Taylor who
expressed this relation- Taylor’s theorem requires that the function f(x) is continuous and that all of the deriva-
ship in 1712. tives f’(x),fʹ’(x), ..., up to order f(n)(x) exist in order to generate an nth degree polynomial
approximation to f(x) near x0. Using equation 1.21 we can express f(x) as

28
a+ϵ
∫a f′ x dx = f a + ϵ − f a (1.23)

where x and x − ϵ are in the vicinity of a. This can be written as

a+ϵ
f a + ϵ = f a + ∫a f′ x dx . (1.24)

Assuming that ϵ is very small, we can assume f'(x) is approximately equal to f'(a), and
hence

f a + ϵ ≈ f a + ϵf′ a (1.25)

holds. We can express this in terms of x and a, assuming that we stay close to the point a,
to get the approximation

f x ≈ f a + x − a f′ a . (1.26)

The approximation given by equation 1.26 is called the linear approximation to f(x) near x
= a. It is the tange nt line approximation to the function f. By using more information
about f, namely by constructing a function that also agrees with f on higher order deriva-
tives at x = a, we can obtain an even better approximation. That is the general idea of the
Taylor approximation of degree n. Because f is n-differentiable, we can apply the approxi-
mation to each of the derivatives of f to obtain

f′ x ≈ f′ a + x − a f′′ a ,
f′′ x ≈ f′′ a + x − a f′′′ a ,

and similarly,

n−1 n−1 n
f x ≈f a + x−a f a .

We can now substitute the estimate of f'(x) into equation 1.24 and obtain

a+ϵ
f a + ϵ ≈ f a + ∫a f′ a + x − a f′′ a dx
ϵ2
= f a + ϵf′ a + 2
f′′ a .

This process can be repeated iteratively as long as the higher order derivatives exist, which
yields the nth-degree Taylor polynomial approximation. Expressing again in terms of x and
a, we can write:

2 n
x−a x−a n (1.27)
f x ≈ f a + x − a f′ a + 2!
f′′ a + ⋯ + n!
f a .

29
1.2 Partial Differentiation
In the previous section, we considered derivatives of functions of a single variable, i.e.
n dn
f x = dx
f x . More generally, we can consider rates of change of functions that
depend on more than one variable. We can write f(x1, x2, . . . , xn) for a function that
depends on n variables x1, x2, . . . , xn. An example of a function that depends on two
variables x and y, f(x, y) = x2 + y2, is shown in the following figure. The function is well-
defined for each pair (x, y). For example, f(1, 1) = 2.

Figure 6: A Function Depending on Two Variables f x, y = x2 + y2

Source: Shiela Miller (2020).

Previously, we discussed that the derivative of a function of a single variable is related to


the change or gradient of that function. As we consider more variables, we want to know
how the function changes as each of the variables change individually, imagining, for
example, how x changes as y is held constant. Considering again the function f(x,
y)=x2+y2, it has a specific gradient in all directions of the xy plane. As a special case, we
consider it when we move in either the x or the y direction, for example, along the x or y
axis. We move along one direction, for example the x axis, and keep the value of the other

30
variable(s), in this case y, constant as we observe the change of the function. These deriva-
tives are called partial derivatives indicating that we only observe the “partial” change of
the function along one of the variables. Similar to the definition of the derivative with
respect to a single variable (equation 1.2), we define the partial derivative to be:

∂f f x + Δx, y − f x, y
∂x
= lim Δx
, (1.28)
Δx → 0

and

∂f f x, y + Δy − f x, y
∂y
= lim Δy
, (1.29)
Δy → 0

where the symbol ∂ indicates that this differentiation is performed partially with respect
to a single variable while the other variables are kept constant. To make this explicit, it is
often written as

∂f ∂f
∂x y
and ∂y x
(1.30)

in order to indicate which variable is considered in the derivative (the one in the partial
derivative expression) and which is kept constant (the one outside the parentheses). Just
as there are many notations for the derivative in the single variable case, there are also
many ways to indicate partial derivatives. The following are some common short-hand
notations for the partial derivative of f with respect to x:

∂f
∂x
= f x = ∂xf . (1.31)

One can calculate higher order derivatives, provided that the relevant limits exist, and are
calculated in the same way. Some possibilities in the case of two-variables are

2
∂ ∂f ∂ f
∂x ∂x
= = f xx,
∂x2
2
∂ ∂f ∂ f
∂y ∂y
= = f yy,
∂y2
2
∂ ∂f ∂ f
∂x ∂y
= ∂x∂y
= f xy, and
2
∂ ∂f ∂ f
∂y ∂x
= ∂y∂x
= f yx .

Note that under sufficient continuity conditions, the relation

2 2
∂ f ∂ f
∂x∂y
= ∂y∂x

holds.

31
EXAMPLE
Find fx and fy, the first partial derivatives of f(x, y) = 3x2y2 + y.

First, we calculate the partial derivative with respect to x, treating y as a con-


stant to obtain

∂f
∂x
= 6xy2 .

For the partial derivative with respect to y, we now treat x as a constant and find

∂f
∂y
= 6x2y + 1 .

Total Differential

The definition of the partial derivatives allows us to examine the rate of change of a func-
tion along, for example, the x or y axes. We now want to investigate the rate of change if
we move in any direction in the domain.

In a case where we have functions of two variables x and y, we move Δx in the x direction
and Δy in the y direction. Following the approach we have taken previously, we can evalu-
ate

Δf = f x + Δx, y + Δy − f x, y
= f x + Δx, y + Δy − f x, y + Δy + f x, y + Δy − f x, y
f x + Δx, y + Δy − f x, y + Δy
= Δx
Δx +
f x, y + Δy − f x, y
Δy
Δy

where we have performed the algebraic trick of adding and subtracting the same term,
namely −f(x, y +Δy) + f(x, y +Δy) = 0 in the middle step in order to factor into the
Δx Δy
desired quotients, and we have also multiplied by Δx
= 1 and Δy
= 1 . The term in the
first square brackets describes the change of the function f(x, y) if we move a step Δx in
the x direction, the term in the second square bracket corresponds in the y direction. If we
let Δx → 0 and Δy → 0 on both sides, the terms in the square brackets are the partial
derivatives defined in equation 1.28, and we obtain the total differential of a function f(x,
y), which is then given by

∂f ∂f
df = ∂x
dx + ∂y dy . (1.32)

For functions of n variables, the formula above is extended accordingly to

32
∂f ∂f ∂f
df = dx + ∂x dx2 + ⋯ + ∂x dxn .
∂x1 1 (1.33)
2 n

Chain Rule

When a function of a single variable could be expressed as a composition of functions, we


used the chain rule (recall equation 1.6) to differentiate it. The same approach can be
applied to functions with several variables.

For example, in the case of a function f(x, y), the variables x and y are now functions of
df
another variable u and we wish to find the derivative with respect to u, i.e. du . Starting
from the total derivative in equation 1.32 we obtain

df ∂f dx ∂f dy
du
= ∂x du
+ ∂y du
.

The same approach can be taken if the functions are nested in more than one level, i.e.
instead of f(u(x)) one might have f(u(v(x))), and the chain rule can be used to calculated
the derivative, e.g.

df u v x ∂f ∂u dv
dx
= ∂u ∂v dx
. (1.34)

1.3 Multiple Integrals


Previously in this unit, we introduced integrals over functions of a single variable. Recall
that the definite integral measures the area under the curve given by the function f(x)
over the interval [a, b]. We later extended this explanation to encompass indefinite inte-
gration. Furthermore, we interpreted the integration as the inverse of differentiation.

33
Figure 7: Schematic View of a Function f x, y which is to be Integrated

Source: Shiela Miller (2020).

Again, in the multivariate case, we will approach integration as a limit of approximations,


focusing on the case of two variables x and y first. We wish to find the volume enclosed by
the x, y-plane and the function f(x, y) with spe cific bounds in the x- and y- directions,
represented by a region R enclosed by a contour C. Following the approach described
previously in this unit, we divide the area R inside the curve into N areas of ΔAp with p =
1, 2, . . . , N and define the sum

N
S= ∑f xp, yp ΔAp
p=1

to express the approximate volume, where ΔAp is the area of the base and f(xp, yp) is the
height of cell p. Again, we consider many areas like this, i.e. we let N → ∞, implying ΔAp
→ 0. Similar to the case of a single variable, if the above sum has a finite limit or value, we
say that this limit is the value of the double integral over f(x, y) over some region R:

I = ∫R f x, y dA, (1.35)

where dA is an infinitesimally small area in the x, y plane whe re the function f(x, y) is
evaluated. So far, we have not made any assumption about the small area ΔA considered
in the above sum. If we choose small rectangles in x and y direction, we can write ΔA =
ΔxΔy, and when Δx → 0 and Δy → 0, we can write

I = ∫∫R f x, y dxdy (1.36)

34
as the double integral. For such integrals, it sometimes matters whether we integrate with
respect to x or to y first. It is frequently helpful to draw a picture to see which variable
could be taken more easily to depend on the other. If x can be easily expressed as a func-
tion of y, we might choose to take small areas in the direction of width dy first. That gives
us

y=d x = x2 y
I = ∫y = c ∫x = x f x, y dx dy . (1.37)
1y

In this case, the bounds of the inner integral are the parametrization of the boundary
curve C, expressed as x = x1(y) and x = x2(y). In the first step, y is treated as a constant
as the inner integral over x is evaluated. The next step of the computation, the outer inte-
gral, is evaluated between the bounds y = c and y = d just as in the single variable case as
there are no x’s left in the expression.

Alternatively, we can first evaluate the integral over y and then over x, as

x=b y = y2 x
I = ∫x = a ∫y = y f x, y dy dx . (1.38)
1x

EXAMPLE
Evaluate the integral I = ∫∫R x2y dxdy where R is given by a triangular area
bounded by x = 0, y = 0, x + y = 1.

First, we carry out the integration over y, which means that we keep x fixed. In
this case, the limits on y are y = 0 and y = 1 - x. Given the constraint x+y = 1,
the maximum value of x is x = 1 for y = 0. The integral is then written as

x=1 y=1−x 2
I = ∫x = 0 ∫y = 0 x ydy dx .

We evaluate the inner integral first, treating x as a constant, to get

y=1−x 2 1 y=1−x 1 2
∫y = 0x ydy = 2 x2y2 y=0 = 2 x2 1 − x .

This result is now inserted into the outer integral as follows:

35
x = 11 2
∫x = 0 2 x2 1 − x dx
1 x=1 1 x=1 1 x=1
= 2 ∫x = 0 x2dx − 2 ∫x = 0 2x3dx + 2 ∫x = 0 x4dx
11 3 1 1 1 11 51
= 23
x 0 − 4 x4 0 + 25
x 0
1 1 1
= 6
− 4 + 10
1
= 60
.

In case of more than two variables, the same notation can be extended accordingly, such
as

∫∫∫V f x, y, z dxdydz (1.39)

where, in the case of three variables, we integrate over a specific volume rather than an
area.

1.4 Calculus of Variations


Previously, we introduced the idea of local extrema and how to use stationary points to
find them. We can apply the same ideas to more than one variable. In fact, we can even
extend this idea to look for input functions that give extrema (maxima and minima), rather
than input values that give extrema.

This is the idea behind calculus of variations. In most cases, we want to minimize or maxi-
mize a given quantity that depends on a family of input functions; the calculus of varia-
tions provides a method for finding a function f(x) which yields the extreme value.

As a concrete example, we could imagine a rope that is attached to two points, A and B,as
Gravity shown in the following figure, but otherwise hangs freely under the influence of gravity.
Gravity is one of the natu- We expect that the rope will hang down in a shape such as the one indicated by the solid
ral forces caused by the
mass of objects, resulting line, and not take any other shape (e.g., those suggested by the two dotted lines) — that is,
in them being pulled as long as there is no external force other than gravity, and all initial motion has come to a
towards each other. rest. In this example, the rope is fixed at the points A and B, so we have two constraints
(not including the length of the rope) which we take as constant. As the gravitational force
acts on each part of the rope, the rope will take the shape where the total potential
energy, expressed by the integral over all small segments of the rope, is minimal. We wish
to find the function y(x) be the function that describes the shape of the hanging rope with
the minimal potential energy.

36
Figure 8: Illustration of the Concept of Calculus of Variations with a Rope Hanging From
Points A and B

Source: Shiela Miller (2020).

To introduce the calculus of variations, we start with the integral

b
I = ∫a F y, y′, x dx, (1.40)

where a, b, and F are given by the nature of the problem we wish to consider. This integral
depends on the function y(x). In the example of the rope, the limits a and b of the integral
are fixed: they correspond to the endpoints of the rope at which the rope is attached, for
example, to two poles.

We call such functions (ones that take other functions as their input and result in a scalar
as their output) functionals. Here, I is a functional of y(x), which we denote by

I=Iy x . (1.41)

We use square brackets to indicate that I is a functional rather than a function of ℝn. We
then look for the curves y(x), which are the stationary value(s) of the integral I, and deter-
mine whether such curves are extrema of the integral. The integral may have one or more
stationary points.

37
A stationary point y(x) of the functional I[y(x)] is a point where the functional I does not
change if the y(x) is perturbed by a small amount. In the case of the rope, this would be
the function that describes the physical shape the rope takes if we fix it at two points and
let it hang under the influence of gravity. Because y(x) is a stationary point of the integral
I[y(x)], if we change

y x → y x + ϵη x (1.42)

by a small amount ε using any (sufficiently well-behaved) function η(x), we require that
the value of I does not change, i.e.,

dI
dϵ ϵ = 0
= 0∀η x . (1.43)

We now insert the above equation 1.42 into the integral in equation 1.40:

b
I y x , ϵ = ∫a F y + ϵη, y′ + ϵη′, x dx .

We generally assume that all functions are well behaved, especially when considering sit-
uations related to physical examples.

TAYLOR SERIES WITH MULTIPLE VARIABLES


We have already encountered Taylor series for the case of the single variable in
Eqn.(1.27) and used this to expand a function into a series around some point.
This approach can be generalized to several variables. For example, for a func-
tion that depends on two variables x and y, we can write the corresponding sec-
ond degree Taylor polynomial as:

∂f ∂f
f x, y = f x0, y0 + ∂x Δx + ∂y
Δy
2 2 2 2 2
1 ∂ f ∂ f ∂ f
+ 2! Δx + 2 ∂x∂y ΔxΔy + Δy
∂x2 ∂y2

where we evaluate the derivatives around some point (x0,y0)

and Δx = x – x0 and Δy = y – y0.

We can write this as:

∂ ∂
f x, y = f x0, y0 + Δx ∂x + Δy ∂y f x0, y0
1 ∂ ∂ 2
+ 2! Δx ∂x + Δy ∂y f x0, y0

Extending to higher derivatives, we can write the Taylor series for a function of
two variables as:

38

1 ∂ ∂ n
f x, y = ∑ n!
Δx ∂x + Δy ∂y f x, y x0, y0
n=0

We can further generalize this to any number of variables denoted by the vector
x:

2
∂f 1 ∂ f
f x = f x 0 + ∑i ∂x + 2!
∑i, j ∂x ∂x ΔxiΔxj + ⋯
i i j

Returning to the calculus of variations and the integral I[y(x), ε], we can use the Taylor
series with Δy = ϵη and Δy′ = ϵη′ and write the integral in the following way:

b
I y x , ϵ = ∫a F y + ϵη, y′ + ϵη′, x dx
(1.44)
b b ∂F ∂F
= ∫a F y, y′, x dx + ∫a ∂y
ϵη + ∂y′ ϵη′ dx + O ϵ2 .

In the following, we ignore all terms of order ε2 and higher because ε is assumed to be a
very small number. This means we consider the equation

b b ∂F ∂F
I y x , ϵ = ∫a F y, y′, x dx + ∫a ∂y
ϵη + ∂y′ ϵη′ dx .

Now, recall that when we introduced the small perturbation in y(x) in Equation 1.42, we
said that this should not change the integral because we are at a stationary point. We
expressed this more formally in Equation 1.43, where we demand that the integral I does
not change if we change y a little bit by the term εη(x) for any choice of η.

This then implies that the second term must be equal to zero for any choice of η(x),
because ε is a small (but non-zero) number and we do not make any demands of the func-
tion η(x) except that it be sufficiently well behaved, so we can take its derivative, integrate
it, and so on. Then, because we demand that this holds for any small perturbation, the
second part in the equation above must vanish, which we can write as

b ∂F ∂F
δI = ∫a ∂y
η + ∂y′ η′ dx = 0, (1.45)

where the notation δI is used to indicate the variation in the functional I[y(x)] due to the
change in y(x) → y(x) + εη(x). Furthermore, ε is a small but non-zero number and can
therefore be omitted from the above equation.

We now integrate the second part of the integral by parts, resulting in

b ∂F ∂F b b d ∂F
∫a ∂y′ η′dx = η − ∫a η dx , (1.46)
∂y′ a ∂y′

so the integral equation becomes

39
∂F b b ∂F d ∂F
η + ∫a − η x dx = 0 . (1.47)
∂y′ a ∂y dx ∂y′

We now impose the constraint that the endpoints a and b are fixed, as are y(a) and y(b) –
recalling our initial example of the freely hanging rope under the influence of gravity,
where the rope is fixed at its two attached points.

Since y(a) and y(b) are fixed, we also require that, at these points, η a = 0 and η b = 0:
if we “wiggle” the rope a bit, i.e., change y(x), the endpoints remain unchanged. It follows
that the first term in the above equation vanishes. Since equation 1.47 must be equal to
zero for any choice of η x , this implies that the function in the integral must be zero,
namely that

∂F d ∂F
∂y
= dx ∂y′
. (1.48)

Equation 1.48 is known as the Euler-Lagrange equation.

EXAMPLE
Show that the shortest path between two points is a straight line.

We start by specifying the initial and final points that will be connected with an
arbitrary path; initial point A is given by the coordinates (a, y(a)) and the final
point B, is given by the coordinates (b, y(b)), as shown below:

Figure 9: Two Points A, B Connected By a Path

Source: Shiela Miller (2020).

For any small segment of the path, the length can be approximated by a straight
line using the distance formula

40
2 2
ds = dx + dy ,

where we assume that dx and dy are small enough to justify the approximation
of the small triangle for ds. Factoring out dx, the equation above can be written
as

2 (1.49)
ds = 1 + y′ dx .

The total length of the line is given by the integral

b b 2 (1.50)
L = ∫a ds = ∫a 1 + y′ dx,

where the integration takes place along the path between the two points. We
now calculate the path which leads to a stationary point for L, in this case a min-
imum which gives the shortest connection between the points A and B. We start
from the Euler-Lagrange equation 1.48 and note that the function in the integral
L does not depend on y explicitly. This implies that

∂F
∂y
= 0,

so the Euler-Langrange equation can be written as

d ∂F
= 0, (1.51)
dx ∂y′

which in turn implies that

∂F
= c, (1.52)
∂y′

2
for some constant c. We now take the derivative of the function 1 + y′ with
respect to y' and obtain

∂F y′
c= = , (1.53)
∂y′ 2
1 + y′

recalling that w can be written as w1/2.

We now solve the equation

y′
c=
2
1 + y′

for dy so that we can integrate both sides and obtain an explicit formula for y as
follows:

41
y′
c = (1.54)
2
1 + y′

2
y′
c2 = (1.55)
2
1 + y′

2 2
c2 1 + y′ = y′ (1.56)

2 2 (1.57)
c2 = y′ − c2 y′

2 (1.58)
= 1 − c2 y′

dy
c = 1 − c2 (1.59)
dx

c
dx = dy . (1.60)
1 − c2

Integrating both sides yields

c
y= x+k (1.61)
1 − c2

c
for some constant k by noting that the term is constant and ∫ dx = x.
1 − c2

As expected, the above equation is indeed a straight line of the form y = mx +


c
b with m = and constant k = b.
1 − c2

SUMMARY
In this unit, we have seen functions of a single variable f(x), as well as
multivariate functions, such as f(x, y). Differentiation is a tool for study-
ing the rate of change of a function with respect to a given variable. In
the case of multivariate functions, the partial derivatives indicate how
much the function changes along the x- or y-axis, for example, while the
total differential extends this idea to the rate of change of a function in
any arbitrary direction. Integration of the functions of one variable was
introduced as the area enclosed by the function and can be interpreted
as the inverse of the derivative. The integral is therefore often called the
antiderivative. The Taylor expansion can be used to approximate a given

42
function at a specific point. Finally, the calculus of variation extends the
concepts of differentiation and integration to functions whose inputs are
themselves functions.

43
UNIT 2
INTEGRAL TRANSFORMATIONS

STUDY GOALS

On completion of this unit, you will have learned ...

– what integral transformations are.


– how to combine the effects of two functions using a convolution integral.
– how to use convolutions to describe real-life applications such as finite sensor resolu-
tion or image manipulation.
– how to express periodic signals as a Fourier series.
– how to express time domain and frequency domain functions using Fourier transfor-
mations.
2. INTEGRAL TRANSFORMATIONS

Introduction
Integral transformations play an important role in analyzing, manipulating, and trans-
forming signals. This unit focuses on two transformations which are of great importance in
practical applications, convolutions, and Fourier transformations. Convolutions describe
how two functions interact with each other, for example, how the finite resolution of a
sensor or measuring device impacts the value of the quantity measured by this device.
Fourier series and Fourier transformations are used to analyze and describe periodic sig-
nals. This formalism allows us to express the observed signals as a superposition of sig-
nals of different frequencies and intensities and allows us to switch between equivalent
descriptions in the (observed) time domain and in the frequency domain. This can make
the treatment of signals much easier to handle as some transformations or filters are more
easily applied in one domain than in the other.

A good textbook that covers this subject area is Signals & Systems (Oppenheim et al.,
1997).

2.1 Convolutions
Definition of the Convolution

In order to measure any quantity, we must rely on a measurement device. For example, to
measure a temperature, we use a thermometer which tells us the temperature of the sub-
stance we want to investigate. This simple picture is not quite correct, however. The meas-
urement does not reflect the actual “true” physical quantity (such as the temperature) but
Resolution is distorted by the intrinsic resolution of the measuring instrument. It is important to keep
Detectors are not infin- in mind that the measured quantity, for example the temperature, does not exist as an
itely precise but can only
measure a quantity up to abstract quantity but is always related to a real, physical system. As such, there is ulti-
a certain precision deter- mately no single value associated with this property in the mathematical sense but it is
mined by the resolution always governed by probabilities and probability distributions. This implies that if we
of the detector.
keep repeating the same measurement, we will get slightly different numeric values for
the same “true” physical values that are determined by the intrinsic resolution of the
device. Examples illustrating these possibilities for such resolutions are shown in the fol-
lowing figure.

46
Figure 10: Three Examples of Biased and Unbiased Resolution Functions

Source: Shiela Miller (2020).

A good thermometer may have a high resolution and return an unbiased measurement.
This means that the thermometer does not shift the “true” value, but rather returns a
value which randomly fluctuates slightly around the true value. The measurements of a
thermometer with this property are indicated by the red dashed line. By repeating the
measurement many times, it is possible to determine the resolution of the instrument and
hence the intrinsic volatility of the measurements made with this thermometer.

The black solid line, on the other hand, illustrates the readings of a thermometer with a
lower resolution; the values still fluctuate randomly around the “true” value, but due to
the lower resolution of the instrument, the fluctuations are stronger.

Finally, the blue dash-dotted line illustrates what happens if the measuring instrument
itself introduces a bias. In this case, the measured values are no longer “faithful” to the
“true” ones; the resolution of the instrument is asymmetric with long tails, indicating that
the intrinsic fluctuations become biased towards higher values.

To express the ideas of “faithful” and “true” more mathematically, we note that we need
to have a function f(x) that represents the true values of the substance being measured. It
is important to note that the vast majority of systems, objects, and processes in our world
are stochastic. This means that any value or number we observe is random, but they fol-
low a distinct probability distribution that is governed by a specific process relevant for
this system. There are, of course, notable exceptions to this, otherwise it would be difficult

47
Stochastic to implement a clock. However, from the atomic scale to everyday situations such as the
Stochastic systems are speed of wind or the shopping behavior of customers in a supermarket, everything needs
governed by the laws of
probability, as opposed to to be calculated in terms of probabilities and probability distributions. In the example
deterministic systems above, f(x) would describe the distribution of the actual temperature of the underlying
that can be calculated
physical process relevant for the substance we want to examine.
precisely.

The instrument itself is represented by a resolution function g(y) that determines how the
“true” values are observed, e.g. biased or unbiased with high or low resolution. The meas-
urements themselves are represented by some function h(z), which depends on both the
actual underlying state as described by f(x) and the resolution function g(y). The varia-
bles x, y, and z all describe the same quantity, which, in our example, is the temperature.
However, they each enter the consideration at a different point: x is the “true” value, y the
resolution, and z what we finally observe. Effectively, the resolution introduces a system-
atic error in the observation of the “true” value. If the measurement device is biased, this
error will include a shift that is more probable in one direction compared to the other. In
the example shown in the previous figure, the resolution function is biased towards larger,
more positive values and hence, more positive values will be observed compared to nega-
tive values.

We now build a more detailed intuition about how convolutions work. Since, in general,
the true value x is a random number taken from a distinct probability distribution, we
know that the probability of getting x precisely is zero. Instead, we can calculate the prob-
ability of getting the true value in the interval (x, x+dx) using the distribution f(x). As f(x)
defines the probability for any x, this is given by f(x)dx. Next, because we need a device to
measure this value, we need to include the resolution function g(y) so that we finally
observe the value z. The measurement instrument will generally shift the true value to the
observed value, hence we do not observe x, but rather z, which is shifted by the amount z
− x, i.e . the diffe re nce be twe e n the m. Note that, in ge ne ral, we cannot know the true
value of x. In case of an unbiased instrument, the resolution will “smear” the true values
so that the resulting distribution is broader than the true, physical one. In case of a biased
instrument, this step will also include a further shift in one direction. Hence, the original
interval dx gets transformed into the interval dz and the overall effect of the instrument is
given by g(z − x)dz. We can the n combine the probabilitie s for the true value and the
instrume nt and obtain f(x)dxg(z − x)dz for a particular observation. To express this for
any value, we need to integrate over all possible values of x to obtain the distribution for
all observable values z


f ∗ g = h z = ∫−∞ f x g z − x dx . (2.1)

Equation 2.1 is called the convolution of the functions f and g and is typically denoted by f
∗ g. Two examples of convolutions of two simple functions are shown in the following fig-
ure. In both cases, two uniform functions are nonzero over a small interval and are con-
volved with each other. In the top instance, the two functions are separated from each
other, meaning that there is no portion of the domain where they are both nonzero. In the
other case, the intervals at which the two functions are nonzero overlap. The convolutions

48
are shown in the right-hand column of graphs. Note that the shape of the convolved func-
tions are not the same as the shape of the original functions; the two uniform functions
gain a “triangular” shape when convolved.

Figure 11: Convolution of Two Overlapping and Non-Overlapping Uniform Functions

Source: Shiela Miller (2020).

Most functions are not uniform functions. An example with functions more like those we
are interested in studying is shown in the following figure. In this case, the “true” values
are distributed according to a Γ function (black, solid). The measurement device is repre-
sented by an unbiased Gaussian resolution function with mean zero and standard devia-
tion one. The observed values are then distributed according to the convolution of both
functions as shown in the graph on the right.

49
Figure 12: Convolution of a Gaussian and a Γ Function

Source: Shiela Miller (2020).

This illustrates a typical pitfall when dealing with measured values. Although we know
that the “true” values are strictly positive (as illustrated by the black solid curve in the left
figure), the observed values can also be negative due to the finite resolution of the meas-
urement instrument. Depending on the concrete problem, these values need to be treated
with extra consideration as they may violate physical boundaries.

Applications in Image Processing

Convolutions play an important part in image processing and are at the core of (conven-
tional) image filters and convolutional neural networks. Instead of interpreting f and g as a
(true) signal and resolution function, one of the functions, for example f, represents the
image and the other (g) represents a kernel that is used to operate on the image. Given a
suitable kernel, this defines a filter that can be used for a wide range of applications such
as blurring, sharpening, and edge detection. The kernel is often denoted by K or ω.

In this section, we will discuss how convolution can be used to blur an image. To do so,
each part of the image is convolved with a Gaussian filter in the x and y directions, in par-
ticular we apply a two-dimensional (or multivariate) Gaussian to each part of the image.
The two-dimensional function is illustrated in the following figure. In the case of images,
we work with discrete data as the images can be represented by collections of individual
pixels of the form (x, y, r, g, b) where x and y give the position of the pixel in the image
and give r, g, and b the relative levels of red, green, and blue in that pixel. The continuous
convolution integral in equation 2.1 then becomes, in the discrete case, the summation

a b
K * f x, y = ∑ ∑ K i, j f x − i, y − j , (2.2)
i= −aj= −b

where f(x, y) is the original image and K is the appropriate kernel.

50
Figure 13: Multivariate Gaussian Function

Source: Shiela Miller (2020).

In the case of Gaussian blurring, the kernel is given by the matrix in equation 2.3, which is
applied to each part of the image. In this way, the value to which each pixel is transformed
is influenced by its neighboring pixels where their relative weight is given by the kernel.
The following figure shows the effect of applying such a kernel to an image.

51
Figure 14: Illustration of Gaussian Blurring: Original, Medium, and Strong Blurring

Source: Pexels, n.d.

Special care needs to be taken at the edges of the image where the kernel can potentially
exceed the image boundaries. In these cases one can either extend the image by, for
example, repeating the outermost pixels, or one can crop the image so that the kernel
always fits inside the original image:

1 2 1
1
K= 16
2 4 2 . (2.3)
1 2 1

2.2 Fourier Transformation


Fourier Series

We have previously seen how the Taylor expansion can be used to approximate a signal or
a function. However, Taylor expansions are not the only way to do this — there is another
way to look at functions, the Fourier series, that is particularly well suited to periodic sig-
nals such as those found in a wide range of natural and engineering systems. The main
idea of the Fourier series is to express the signal as a sum of sine and cosine components
of varying strength and frequency. In the following we only consider univariate functions,
i.e., functions that depend on one variable. However, the concept of Fourier series can also
be extended to multivariate functions. In order to create a Fourier series for a function, the
signal must satisfy the following Dirichlet conditions:

• the function must be periodic,


• the function must have at most a finite number of discontinuities within a period (such
as a sawtooth function),
• within each period, the function must have a finite number of maxima or minima, and
• the integral over the function over a single period must exist and be finite.

52
The requirement that the function is periodic can be stated as the condition
f(x+L)=f(x)=f(x+nL); in other words, the function repeats itself after the full period L
has passed. This also implies that the signal has neither a beginning nor an end. Note that
in practice using 2L instead of L for a single full period is also common notation. In practi-
cal applications, one has to consider how to deal with finite signals, e.g. by assuming that
the signal follows the same structure beyond the observed signal. The Fourier series, the
promised decomposition of a signal into a summation of sine and cosine waves, is given
by


1 2πnx 2πnx
f x = 2 a0 + ∑ ancos L
+ bnsin L
. (2.4)
n=1

When the period L is equal to 2π, this simplifies to


1
f x = 2 a0 + ∑ ancos nx + bnsin nx . (2.5)
n=1

The figure below shows a periodic sinusoidal signal to which only one frequency contrib-
utes, namely f(t) = sin(ωt). The left part of the figure shows the signal in the time
domain, i.e. what we would observe if we measured the signal at different points in time.
The right part of the figure shows which frequencies contribute to the signal, telling us
which coefficients in the Fourier series are nonzero.

Figure 15: Periodic Sine Signal with One Contributing Frequency

Source: Shiela Miller (2020).

The next figure shows a slightly more complicated signal in which the signal with the
lower frequency (the same as in the previous figure) is overlaid with a signal that is ten
times faster. The resulting periodic signal as we would measure it, namely in the time
domain, is shown in the left part of the figure. The right part of the figure shows the two

53
frequencies contributing to this signal. The graphic titled “Periodic Sawtooth Signal”
shows a more realistic signal that is frequently encountered in electrical engineering. Dur-
ing each period, the sawtooth signal rises linearly between the minimal and maximal val-
ues, and drops to the minimum when the signal reaches the maximum. The left part of the
figure shows the measured signal in the time domain, and the right part of the signal illus-
trates that many coefficients in the Fourier series are needed to build up this more com-
plex signal, and that the weight of the contribution of each signal is exponentially damp-
ened.

Figure 16: Periodic Sine Signal with Two Contributing Frequencies

Source: Shiela Miller (2020).

Figure 17: Periodic Sawtooth Signal

Source: Shiela Miller (2020).

54
Recall Euler’s formula, which establishes the fundamental relationship between trigono- Euler's Formula
metric functions and complex-valued exponential functions. In many applications it is eiΦ = cos Φ + i sin Φ

helpful to use Euler’s insight to express the sine and cosine terms in the Fourier series by
exponential functions with imaginary arguments. To do so, we can use the identities

1 1
sin(nx)= 2i einx − e−inx ,andcos nx = 2
einx + e−inx . (2.6)

Inserting these into equation 2.5, the Fourier series can be expressed as


1 π
f x = ∑ cneinx , with cn = ∫ f
2π −π
x e−inxdx . (2.7)
n= −∞

Fourier transformations

In equation 2.7, we expressed the Fourier series as an infinite sum of complex functions,
still assuming for simplicity that the period L is equal to 2π. If we relax this assumption,
the Fourier series can be written as


f x = ∑ cnei2πnx/L, (2.8)
n= −∞

which, using the frequency ω = 2πn/L, can be expressed as


iωnx
f x = ∑ cne . (2.9)
n= −∞

Previously, we always interpreted the period L as some finite interval after which the func-
tion repeats itself — indeed, this was one of our core assumptions when we discussed the
Fourier series. We now consider the limit of large periods, i.e. L → ∞ where the signal
only repeats after a very long time. Instead of fixed frequencies in the sine and cosine
terms that we have considered so far, the difference in frequencies Δω=2π/L becomes
infinitesimally small and the frequencies become a continuum rather than the discrete
values we saw in the first two examples in this section. We recall from equation 2.7 that
the coefficients cn are given by

1 x +L Δω x0 + L −iωnx
cn = L ∫x00 f x e−i2πnx/Ldx = ∫
2π x0
f xe dx (2.10)

where we have again allowed an arbitrary period L and ωn a discrete function of the index
of coefficients n given by ωn = 2πn/L. In the integral limits, x0 is an arbitrary constant
which is often taken as -L/2. In fact, this is the reason that often 2L is used to indicate a
full period so that the limits of the integral can be written as -L for the lower, and L for the
upper limit instead of -L/2 and L/2. Substituting this expression for the coefficients into
the complex Fourier series, we get

55

iωnx
f x = ∑ cne
n= −∞

Δω x0 + L −iωnu iωnx
= ∑ ∫
2π x0
f ue du e
n= −∞

where we have used u instead of x in the expression for cn to avoid confusion with the vari-
ous variables and what they refer to.

We now consider the limit of long periods, i.e. L → ∞ which implies that Δω → 0. Then
the sum


Δω iωnx
∑ 2π
g ωn e (2.11)
n= −∞

is the following integral

1 ∞
∫ g
2π −∞
ω eiωxdω (2.12)

where, in our case, the function g(ω) is given by

L/2 −iωnu
g ωn = ∫−L/2 f u e du (2.13)

where we have chosen the constant x0 now to be -L/2. Putting this together, the integral
becomes

1 ∞ iωx ∞
f x = ∫ e dω∫−∞ f
2π −∞
u e−iωudu . (2.14)

From this, we define the Fourier transform of the function f(x) to be

1 ∞
f ω = ∫ f x e−iωxdx (2.15)
2π −∞

where we have changed back from u to x. We can also define the inverse Fourier transfor-
mation

1 ∞
f x = ∫ f ω e+iωxdω . (2.16)
2π −∞

1
The normalization factor 2π is split equally between the Fourier transformation and its
1
inverse as , which avoids having to remember in which equation the factor needs to be

inserted. Note that the tilde in f (ω) is often dropped if there is no risk of confusing the
Fourier transformation and its inverse.

56
Using the Fourier transformation, we can switch between two equivalent views of analy-
sing a signal, either in the “observable” domain f(x) or in the frequency domain f (ω).
Since most functions are observed as time dependent functions or signals, we typically
use f(t) instead of f(x) to indicate the dependency of time. In fact, the left and right parts
of the previous three figures correspond to either the time domain (left part) or the fre-
quency domain (right part). The main advantage of being able to switch between these
equivalent representations of a signal is that some operations might be very complicated
in one domain but very easy in the other. For example, applying a frequency filter is very
difficult in the time domain but easy in the frequency domain. Concretely, we consider the
periodic signal in the figure titled “Periodic Sine Signal with Two Contributing Frequen-
cies”. We know from the frequency spectrum that this is a simple signal where a low fre-
quency sine function is combined with a high frequency sine function. Using a low-pass
filter (a filter that will only let low frequencies pass), we can extract the part of the signal
with the low frequency. To do so in the time-domain where we observe the signal would
be very difficult, but it is easily achieved in the frequency domain by applying the damp-
ening function shown in the previous figure. Rather than using a simple rectangle as one
might assume, we use a “softened” rectangular function. In this simple example, this
would make little difference, however, in more complex examples, the smooth attenua-
tion avoids artifacts that can be induced by a sharp cut-off. The following figure shows
that we can then extract the low-frequency signal by switching back to the time domain as
desired.

57
Figure 18: Definition of a Low-Pass Filter

Source: Shiela Miller (2020).

58
Figure 19: Extracted Signal in the Time Domain Using a Low-Pass Filter

Source: Shiela Miller (2020).

EXAMPLE
Find the Fourier transformation of the function f(t)=Ae−λt for t≥0, assuming
f(t)=0 for t<0.

We use equation 2.15 and write

A ∞ −λt −iωt A ∞ − λ + iω t
f ω = ∫ e e dt = ∫ e dt (2.17)
2π 0 2π 0

where we note that the bounds of integration range from zero to ∞ as we


1
assume that f(t)=0 for t<0. Remembering ∫eaxdx = eax, the integral is evalu-
a
ated as

59

A e− λ + iω t
f ω = − λ + iω
2π t=0
A 1
= .
2π λ + iω

SUMMARY
Convolutions mathematically express how the effects of two functions
can be combined. For example, a measurement of a physical quantity
taken by some measurement device can be expressed as a resolution
function that is unique to the specific details of the device and the “true”
value. The observed value is then the convolution of the “true” value
with the finite sensor resolution. In image processing, convolutions with
a given kernel are used to manipulate images by blurring, sharpening,
detecting edges, or transforming the image. Periodic signals can be
expressed as a Fourier series which is a series of sine and cosine func-
tions of fixed frequencies. The relative weight of the coefficients of the
terms in the series determines the shape of the final signal. Fourier
transformations can be derived as the limit of increasingly long periods
over which the fixed frequencies become a continuous spectrum. Using
the Fourier transform, a function can be expressed either in its original
form or as a Fourier transformation. Some operations are much simpler
to perform in one form compared to the other, so the Fourier transform
is very useful in practice.

60
UNIT 3
VECTOR ALGEBRA

STUDY GOALS

On completion of this unit, you will have learned ...

– the difference between scalars and vectors.


– how to perform basic operations with vectors.
– geometric and physical interpretations of vectors.
– examples of applications that use basic vector operations.
3. VECTOR ALGEBRA

Introduction
In this unit, we introduce techniques that allow us to operate on more complex mathe-
matical objects called vectors that contain one piece of information in each coordinate.
For example, a vector can describe the speed of a particle and the direction in which it is
moving. This is an example of a two- or three-dimensional vector. Likewise, an n-dimen-
sional vector could be used to record n pieces of information such as time, temperature,
and the x-, y-, and z-coordinates of the direction of movement or force.

Given the numerous and important applications of vectors to physics, we will focus on
developing an intuition for vectors in two- and three-dimensions, in particular in ℝ22 and
ℝ3— the familiar Cartesian plane and space. Vector operations on these spaces can be
generalized to the broader contexts seen in information science, machine learning, and
computer science in general.

Good textbooks that cover the subject area are chapters 11 and 12 of Calculus (Strang,
2017) and chapters one and two of Advanced Calculus (Loomis & Sternberg, 2014).

3.1 Scalars and Vectors


Numbers as we are accustomed to them are scalars, and they measure one-dimensional
quantities like temperature or weight. A familiar example of a scalar-valued function is
h(t)=−9.8t2+v0t+h0, the parabola describing the height of an object t seconds after
being thrown into the air with initial height h0, initial velocity v0, and acceleration due to
Acceleration Due to gravity of −9.8 meters per second squared.
Gravity
Note that this is only an
average value as the exact Technically, a vector is an element of a mathematical structure called a vector space. For
value depends on the our purposes, we can consider a vector to be a mathematical object that has both magni-
location and local topol-
tude and direction. This way of thinking of vectors works very well for two- and three-
ogy of the earth.
dimensional space. For more general applications to machine learning and information
science, vectors can have many more than three coordinates, and might have nothing to
do with physical space — however, they will still satisfy the same basic rules as our more
familiar examples from physical space.

Regardless of their dimension, vectors can be analyzed by breaking them into their com-
ponent parts. For example, in the two-dimensional case, the components would be hori-
zontal and vertical. For n-dimensional vectors, there is no physical analogue, but the ith-
coordinate of one vector will measure the same quantity as the ith-coordinate of another
vector in the same space. Let’s begin by building an intuition for vectors in the familiar,
two-dimensional plane.

62
Two-Dimensional Vectors

One of the most common applications of vectors is the study of objects moving in two-
dimensional space, though there are many other applications of two-dimensional vectors,
too.

In this context, we want a two-dimensional vector to tell us which direction to travel and
how far to go. Note that a vector doesn’t tell us where to start. Also, the “how far to go” is
usually over a fixed time increment, and therefore tells us a speed, not a position. There
are several ways to indicate that a quantity is a vector rather than a scalar: The element
may be written in underline, in bold font, or with an arrow over the symbol. In the follow-

ing, we will use the latter notation and indicate denote vectors like v .

Let’s consider an initial point, P=(p1,p2), in the x, y-coordinate plane and a te rminal
point, Q=(q1,q2). Then the vector representing the travel from P to Q can be written PQ.
The magnitude of PQ is the length of the line segment connecting P and Q, obtained from
the distance formula. Namely, the magnitude of PQ is given by
2 2
PQ = q1 − p1 + q2 − p2 .

If initial point P is the origin (0, 0), we say that the vector is in standard position and we
call it a position vector. Note that the ordered pair at Q=(q1,q2) uniquely specifies a posi-
tion vector as there is only one distance and one path from (0,0) to (q1,q2). The following
figure shows an example of a vector v in standard position. Similarly, for vectors in n-

dimensions, we write v = v1, v2, …, vn for a vector whose ith-component is vi where the
vi come from an underlying field (such as the real or complex numbers).

63
Figure 20: Example of a Position Vector

Source: Math 10, n.d.

Fundamental definitions

→ → → →
1. Two vectors a = a1, …, an and b = b1, …, bm are equal, a = b , if, and only if, n
= m and ai=bi for all 1 ≤ i ≤ n. This means that they have the same magnitude and
direction.
→ →
2. Given a as above, the negation of the vector, denoted by − a , is (−a1,…,−an), a vec-

tor having the same magnitude but opposite direction to that of a .
→ →
3. If v = 1, we say that the vector v is a unit vector.

EXAMPLE
Use a vector to graphically represent a force of 10 Newtons in a direction of 30o
North East.

In two dimensions, this force can be represented as shown in the following fig-
ure.

64
Figure 21: Graphical Representation of a Force of 10N in a Direction of
30° North East

Source: Shiela Miller (2020).

EXAMPLE

Suppose that the vector v is given by the directed line segment extending from

the point (0,0) to the point (3,2) and that the vector u is given by the directed
→ →
line segment from the point (1,2) to the point (4,4). Is it true that u = v ?

Let P=(0,0), Q=(3,2), R=(1,2), and S=(4,4) as shown in the following figure.

65
Figure 22: Illustration of Vectors P Q and RS

Source: Shiela Miller (2020).

We need to determine whether the directed line segments have the same direc-
tion and magnitude. We find the magnitude of both line segments to be

→ 2 2
PQ = v = 3−0 + 2−0 = 13and,
→ 2 2
RS = u = 4−1 + 4−2 = 13 .

In the two-dimensional case, the slope of a line describes its direction; finding
the slope of both line segments, we see that they agree:

Δy y2 − y1 2−0 2
Slope ofPQ = Δx
= x2 − x1
= 3−0
= 3,

Δy y2 − y1 4−2 2
Slope ofRS = Δx
= x2 − x1
= 4−1
= 3
.

→ →
As u and v have the same direction and the same magnitude, we have
→ →
u = v .

66
An easier way to determine whether two vectors are the same is to use the component
form of vectors: For a vector AC determined by the points A=(x1,y1) and C=(x2,y2), the
component form is defined as AC = x2 − x1, y2 − y1 . The coordinates of this vector
describe the position vector OP from the origin, (0, 0) to a point P=(x2-x1,y2-y1).

→ →
Note that two position vectors u = (u1, u2) and v = (v1, v2) are equal if their respective
→ →
coordinates are equal, so u = v if, and only if, u1=v1 and u2=v2. Also, note that the initial
and final points are not part of the vector itself. The vectors AC and OP are the same vec-
tor.

3.2 Addition and Subtraction of Vectors


Properties of Addition and Subtraction of Vectors

Computationally, the addition and subtraction of vectors is as easy as the addition and
subtraction of their components in the underlying field, which, for us, is either the real or Field
complex numbers. A mathematical field is a
set on which addition,
multiplication, subtrac-
Specifically, for tion, and division are
defined and behave the
same way as for rational
v = v1, …, vn and u = u1, …, un , v + u and real numbers.

= v1 + u1, …, vn + un and v − u = v1 − u1, …, vn − un .

Notice that because the components of the vectors are elements of the field of real num-
bers R (or complex numbers ℂ), addition of these entries, and thus of vectors, is commuta-
→ → → →
tive: namely, v + u = u + v . Subtraction, of course, is not commutative.

Vectors also share other properties with the real or complex numbers: Vector addition is
associative, there exists an additive identity 0, every element v has an additive inverse,

−v .

We summarize these properties below:

→ → →
Let u = u1, u2 , v = v1, v2 , w = w1, w2 and let k and c be scalars.

1. → → → → (commutative property)
u + v = v + u

2. → → → → → (associative property)
u + v +w = u + v

+w

3. → → → → → (additive identity)
u + 0 = u = 0 + u

67
4. → → →
u + −u = 0

5. → →
k c u = kc u

6. → → → →
k v + w = k v + kw

7. → → →
k+c u =ku +cu

EXAMPLE
→ → → → → → → →
Let u = (7, 2) and v = (−3, 5). Find u + v , u − 6 v , 3 u + 4 v and
→ →
5v −2u .

We have:

→ →
u + v = 7, 2 + − 3, 5 = 4, 7 ;
→ →
u − 6 v = 7, 2 − 6 − 3, 5 = 7, 2 − − 18, 30 = 25, − 28 ;
→ →
3 u + 4 v = 3 7, 2 + 4 − 3, 5 = 21, 6 + − 12, 20 = 9, 26 ;
→ →
5 v − 2 u = 5 − 3, 5 − 2 7, 2 = − 15, 25 − 14, 4
2
= −29 + 212 ≈ 35, 8 .

Vectors in Two Dimensions

The computational way of adding vectors gives little intuition for their physical applica-
tions. Let us now turn our attention to the graphical addition and subtraction of vectors,
ℝ2 restricting ourselves, for now, to the case of ℝ2, which we will visualize as the Cartesian
ℝ2 is the two-dimen- coordinate plane.
sional space of real num-
bers. → →
Let x and y be vectors in ℝ2. Conceptually, the sum of these two vectors should be the
→ →
net effect of the two vectors together. For ease of explanation, suppose that x and y
represent forces applied in the plane.

One way to think about this is as the two vectors in sequence, namely “doing” or “apply-

ing” one and then the other. Graphically, that would mean placing the tail of y on the

head of x ; if we then draw a new vector, indicated by the red dashed line from the tail of
→ → → →
x to the head of y , that vector is the sum x + y , as shown in the following figure. We
→ →
can denote this a new vector such as z = x + y .

68
Figure 23: Addition of Two Vectors

Source: Shiela Miller (2020).


→ → → →
The difference of two vectors x and y is the sum of x and − y . The difference
→ → → →
x + − y = x − y is shown in the following figure. We can denote this a new vector,
→ → →
such as z = x − y

Figure 24: Difference of Two Vectors

Source: Shiela Miller (2020).

The next figure illustrates that performing addition graphically is equivalent to performing
addition using the component form of the vectors. The parallelogram also shows that
addition of vectors is commutative — we get the same diagonal regardless of which vector
we apply first.

69
Figure 25: Addition of Two Vectors in Component Form

Source: Math 10, n.d.

EXAMPLE
→ → → →
Let v = (−3, 2) and w = (5,−9). Find v + w both graphically and by using the
component form of the vectors.

We have

→ →
v + w = −3 + 5, 2 + −9 = 2, − 7 .

Introduction to Bases

Spanning Set A basis for a vector space is a minimal spanning set of vectors in that vector space. Mini-
A spanning set is one that mal means that if we took any of the vectors away, the set would no longer span the space
allows us to express every
vector in the space as a — there would be some vectors that cannot be formed as linear combinations of the
linear combination of ele- remaining elements in the set. Recall that a linear combination of elements is a sum of
ments in that spanning those elements, possibly each multiplied by a scalar. For example, a two-dimensional
set. → →
space can be spanned by the vectors i and j pointing along the x and y axes. Any ele-
ment of this two-dimensional space (or plane) can be expressed by a linear combination
→ → →
of these two vectors such as v = 2 i + 3 j . If we were to take one of them away, we
could no longer reach all points in the plane.

70
Bases (the plural of basis) are fundamental to the study of vector spaces, and they are not
unique. Though we will typically use the familiar unit vectors in the coordinate-axis direc-
tions as our bases for two- and three-dimensional space, there are many other bases, and
how to move from one basis to another is an important question.

Unit vectors

Recall that a vector of magnitude (or length) one is called a unit vector. For example, the

vector v = − 3/5, 4/5 is a unit vector because

→ 2 2 9 16 25
v = −3/5 + 4/5 = 25
+ 25
= 25
= 1.

Unit vectors are particularly useful because they all have the same length, namely length
one, so they differ from one another only in direction. In particular, there is only one unit
vector that points in a given direction.

→ →
Let us denote the unit vector in the direction of vector v by u . To find the unit vector in
→ →
the direction of a given vector v , we scale u to make a vector of length one as follows:


→ v 1 →
u = → = → v .
v v

EXAMPLE

Find a unit vector in the direction of u = (−1, 2).

→ →
First we find the magnitude of v = v = −1 2 + 22 = 5 .


Then the unit vector in the direction of v is

→ v − 1, 2 1
u = → = = −1, 2 .
v 5 5

The unit vectors parallel to the x- and y-axes, sometimes called standard unit vectors, are
particularly useful because they form what is called a basis of the 2-dimensional vector
space. In particular, letting

→ →
i = 1, 0 ; j = 0, 1 .

Any vector in two-dimensional space can be written as a linear combination of these two
→ →
vectors i and j . For example, we can express an arbitrary vector
→ → → →
v as v = v1, v2 = v1, 0 + 0, v2 = v1 1, 0 + v2 0, 1 = v1 i + v2 j .

71
EXAMPLE
→ → →
Express r = (2,−6) as a linear combination of i and j .

We have

→ → → → →
r = 2, − 6 = 2 i + −6 j = 2 i − 6 j .

→ →
It is important to note that the set i , j is only one of many possible bases for ℝ2; there
are many other bases for this same space. For example, we could express each point in
terms of a distance r from the origin and an angle θ between the positive x-axis and the
vector to any point in ℝ2. Indeed, in the two-dimensional case, any two vectors that are
not scalar multiples of one another will form a basis. This condition is guaranteed in the
two-dimensional case, but not in higher dimensions.

Vectors in Three-Dimensional Space

The definitions and properties we observed in two dimensions can be extended to three
dimensions in a straightforward way. For basis of standard unit vectors for three dimen-
→ → →
sions, we define the set i , j , k , where

→ → →
i = 1, 0, 0 ; j = 0, 1, 0 ; and k = 0, 0, 1 .

The three standard unit vectors are directed along the positive x-, y-, and z-axes of the
three-dimensional rectangular coordinate system.


For this basis, it is easy to see that any vector F can be expressed as a linear combination
of the standard basis vectors because this is how we tend to think about position in three-
dimensional space. In particular,

→ → → → → → →
v = v x + v y + v z = v1 i + v2 j + v3 k


where for i = 1, 2, 3, the vi are the components (orthogonal projections) of v in the x-, y-,
and z-directions, respectively, as illustrated in the following figure.

72
Figure 26: Components of a Three-Dimensional Vector

Source: Shiela Miller (2020).

Position vectors

In the three-dimensional space of real numbers ℝ3, a position vector extends from the ori-

gin (0, 0, 0) to the point (x, y, z) and is written r = (x, y, z). This can be expressed as a
linear combination of the standard unit vectors:

→ → → →
r =x i +y j +zk,

with magnitude


r = x2 + y2 + z2 .

EXAMPLE
→ → → →
Given r 1=3 i −2 j + k, r 2=2 i −4 j −3k and
→ → → →
r 3= − i +2 j +2k.

→ → → →
Find the magnitude of M = r 1 + r 2 + r 3.

73
We have

→ → → →
M = r 1+ r 2+ r 3
→ → →
= i 3 + 2 − 1 + j −2 − 4 + 2 + k 1 − 3 + 2
→ →
=4 i −4 j
= 4, − 4, 0

and thus,

M = 42 + 42 + 02 = 4 2 .

EXAMPLE

Find the position vector r corresponding to the vector v with initial point (−2,

3, 1) and terminal point (0,−4, 4). Next, find the unit vector u in the direction of

r.

The position vector is


r = 0 − −2 , − 4 − 3, 4 − 1 = 2, − 7, 3 .

The magnitude of the position vector is

→ 2
r = 22 + −7 + 32 = 62,


and therefore the unit vector in the direction of r is


→ r 1
u = → = 2, − 7, 3 .
r 62

Collinear vectors

→ →
Two vectors u and v are collinear (a generalization of parallel) if there exists a real-val-
→ →
ued constant c such that u = c v . We use the following notation to express that two vec-
→ →
tors are collinear: u ∥ v . Note that this definition holds for vector spaces of any dimen-
sion.

74
EXAMPLE
Suppose vector w has the initial point (2,−1, 3) and the terminal point (−4, 7,
→ → → →
5). Are w and u = (3,− 4,−1) collinear? What about w and v = (12, 16, 4)?


First let us write w in the component form,


w = −4 − 2, 7 − −1 , 5 − 3 = −6, 8, 2 .

Observing that it is possible to write u as

→ 1 1→
u = 3, − 4, − 1 = − 2
−6, 8, 2 = − 2 w ,

→ →
we conclude that w ∥ u holds.

→ →
To check if w is collinear with v , we must determine whether there is a con-

stant c, satisfying v = cw, namely

12, 16, 4 = c −6, 8, 2 .

This generates three equations in c that must be simultaneously satisfied. Such


a c would have to be —2 for the first coordinate but 2 for the last coordinate.
→ → → →
Therefore, there is no such c and w is not collinear with v , w ∦ v .

EXAMPLE
Determine whether the points P = (1,−2, 3), Q = (2, 1, 0), and R = (4, 7,−6)
lie one the same line.

The position vectors for PQ and PR are

PQ = 2 − 1, 1 − − 2 , 0 − 3 = 1, 3, − 3 and
PR = 4 − 1, 7 − − 2 , − 6 − 3 = 3, 9, − 9 .

Note that PQ and PR have the same initial point by construction so they are
collinear if, and only if, they lie on the same line. By inspection, we see that
PR = 3PQ, and therefore PQ ∥ PR holds and the three points are on the same
line.

75
3.3 Multiplication of Vectors: Dot Product
and Scalar Product
We have covered addition and subtraction of vectors and multiplication of vectors by sca-
lars, which scales vectors, changing their length but not their direction. While there is not
a single “multiplication” of vectors, there are two important products that we will intro-
duce in this section: the scalar or dot product of two vectors, which has a scalar output,
and the cross product of two vectors, the output of which is a vector.

Scalar or Dot Product

→ → → →
Let u = (u1, . . . , un) and v = (v1, . . . , vn) be vectors. The dot product of u with v is
the scalar

→ →
u ⋅ v = u1v1 + u2v2 + … + unvn .

→ →
Note that u ⋅ v is a scalar. By considering the properties of the real valued components,
we can verify the following properties.

Properties of the dot product

→ → n
For all real vectors u and w in ℝ , the following properties hold:

→ → → →
1. u ⋅ v = v ⋅ u
→ → → → → → →
2. u ⋅ v + w = u ⋅ v + u ⋅ w
→ → →2
3. u ⋅ u = u
→ → → → → →
4. u ⋅ c v = c u ⋅ v = c u ⋅ v

EXAMPLE
→ → → → → →
Let u = (2,−2) and v = (5, 8). Find u ⋅ v and u ⋅ 2 v .

We have:

→ →
u ⋅ v = 2 · 5 + − 2 · 8 = − 6;
→ → → →
u ⋅ 2 v = 2 u ⋅ v = 2 · −6 = − 12 .

The scalar or dot product can be extended so that we can handle vectors in ℂn with com-
plex numbers as well. We define the scalar or product as

76
→ →
u ⋅ v = u∗1v1 + u∗2v2 + … + u∗nvn

where the asterix in u∗i indicates the complex conjugate. We then find that some of the
properties above change, in particular, the following:

→ → → →
• u ⋅ v = v ⋅ u ∗
→ → → →
• c u ⋅ v = c∗ u ⋅ v
→ → → →
• u ⋅ cv =cu ⋅ v

→ →
The magnitude of the vector remains unchanged as u ⋅ u is real.

The angle between two vectors

In two or three dimensions, the scalar or dot product can be interpreted geometrically as
2
the angle between two vectors. In two dimensions (ℝ ), we let θ denote the angle between
→ →
two vectors a = (a1, a2) and b = (b1, b2) as shown in the following figure. The cosine of
the angle θ between these two vectors is given by

a1b1 + a2b2
cos θ = →→ . (3.1)
a b

We note without proof that the same formula works for vectors in three dimensions. In
→ →
particular, for a = (a1, a2, a3) and b = (b1, b2, b3),

a1b1 + a2b2 + a3b3


θ = arccos →→ . (3.2)
a b

Figure 27: Angle Between Vectors

Source: Shiela Miller (2020).

77
PROOF
As we can see from the panel (a) of the figure above,

ξ1 ξ2 →
cos θ = → ; sin θ = → ; ξ3 = a − ξ1 . (3.3)
b b

→ →
It follows that ξ1 = b cos θ and ξ2 = b sin θ, as shown in panel (b) of the fig-
ure.

We also have

→ →2 2 2
a − b = a1 − b1 + a2 − b2
(3.4)
→2 → 2
= a + b − 2 a1b1 + a2b2 .

On the other hand

2 → 2 → → 2
a − b = b sin θ + a − b cos θ

→2 →2 →→ (3.5)
= b sin2 θ + cos2 θ + a − 2 a b cos θ
→2 → 2 →→
= a + b − 2 a b cos θ .

Comparing equation 3.4 with equation 3.6, we conclude that

→→
a b cos θ = a1b1 + a2b2, (3.6)

and from this, we can immediately obtain equation 3.1 by solving for cos θ.

A more conventional way to write equation 3.1 is

→ → →→ (3.7)
a ⋅ b = a b cos θ .

Observe that this gives a particularly easy test for orthogonality. In particular, if
→ →
u is perpendicular to v , we have

→ → (3.8)
u ⋅ v =0

since cos θ = cos π/2 = 0.

78
EXAMPLE
→ → →
Let u = (3,−1, 2) and v = (−4, 0, 2). Find the angle θ between u and v :

We have

→ → → →
u ⋅ v = u v cos θ,

from which we obtain

→ →
u ⋅ v −12 + 4 −4
cos θ = → → = = .
u v 14 20 70

→ →
Since u ⋅ v < 0,

−4
θ = arccos ≈ 2 .069 rad.
70

Vector projection

One important application of the dot product is in finding the extent to which a given vec-
tor is “in the same direction” as a second vector. One example of this is the horizontal and
vertical components of the velocity of a projectile thrown into the air. The vertical compo-
nent is the projection of the initial velocity vector onto the y-axis and reflects how much of
that initial velocity is going in the “up” direction. The following discussion allows us to
consider this question for any two directions, and more formally.

→ → 2
Let a and b be two-dimensional real vectors in ℝ . Imagine that we shine a light onto
→ → →
vector a from a light source perpendicular to b . We can think of the projection of a
→ → →
onto b as the shadow of a casts onto vector b . If we think only of the length of the
shadow, this is called the scalar projection. If we consider the length and direction of the
shadow, we get the vector projection.


It is worth noting that this shadow, this projection, let’s call it Proj→
b a , might be longer
→ →
than the vector b . Recall that one way to define the line containing b is L = c b : c ∈ ℝ ,

namely all scalar multiples of b by a real number c. This means that, so far, we know that

for some c, Proj→b a =cb.

79
→ →
The other thing we know about the projection of a onto b is that it is orthogonal to the
→ →
“light source”, which, in this case, has the direction a − c b . This means that
→ → →
b ⋅ a − c b = 0. Using our knowledge of dot products, we can find a particular value
for c. We know that

→ → → → → → →
b ⋅ a −cb =0= b ⋅ a −cb ⋅ b . (3.9)

→ →
b ⋅ a →
Solving this for c, we know that c = →2 and therefore, that the vector projection of a
b

onto b is

→ →→
→ b ⋅ a
Proj→
b a = →2 b . (3.10)
b

→ → →
The scalar projection of a onto b is the magnitude of the vector projection of a onto

b.

EXAMPLE
→ → → → → → → →
Find the vector projection of u = 3 i − 5 j + 2 k onto v = 7 i + j − 2 k .

→ → →
By formula (3.10), the vector projection of u onto v is the vector w , given by

→ →
→ v ⋅ u → 12 14 2 4
w = v = 7, 1, − 2 = , , − .
→2 54 9 9 9
v

Inner product

The inner product can be seen as a generalization of the scalar or dot product discussed
so far. While the dot or scalar product was specifically defined as
→ →
u ⋅ v = u1v1 + u2v2 + … + unvn, the inner product is a function that takes two vectors
and assigns a real (or complex) number to it. However, we generally assume that we use
complex numbers when working with the inner product. Effectively, the inner product is a
→ →
generalization of the dot product. The inner product of two vectors, u and v , is written
as

→ →
⟨ u v ⟩,

→ → → → → →
or alternatively, ⟨ u , v ⟩, u v , or just u ⋅ v .

80
The following properties hold:

→ → → →∗
1. ⟨ u v ⟩ = ⟨ v u ⟩
→ → → → → → →
2. ⟨ u c v + d w ⟩ = c⟨ u v ⟩ + d⟨ u w ⟩
→ → → → → ∗ → →
3. ⟨c u + d v w ⟩ = c∗⟨ u w ⟩ + d ⟨ v w ⟩
→ → → →
4. ⟨c u d v ⟩ = c∗d⟨ u v ⟩
→ →
5. Vectors are defined to be orthogonal if ⟨ u v ⟩ = 0

→ → → 1
The norm of a vector is defined as ∥ u ∥ = ⟨ u u ⟩ 2 as a generalization of the magni-
→ →
tude of a vector we have encountered so far. Generally, ⟨ u u ⟩ can be both positive and
→ →
negative. However, in most cases we will encounter vector spaces with ⟨ u u ⟩ ≥ 0 and
hence we say that the norm is positive semi-definite.

Cross product

The cross product (or vector product) is generally only defined in the three-dimensional
→ → → → → → → →
space ℝ3. Let a = a1 i + a2 j + a3 k and b = b1 i + b2 j + b3 k . For nonzero vectors
in ℝ3, the cross product of a and b is a vector that is perpendicular to both of the given
vectors.

The cross or vector product is defined as:

→ → → → →
a × b = a2b3 − b2a3 i − a1b3 − b1a3 j + a1b2 − b1a2 k .

→ →
Note that there is a more elegant definition using determinants of matrices. Then a × b
→ →
is the vector, which is orthogonal to the plane spanned by the vectors a × b . In particu-
→ → → →
lar, as we have noted, for nonzero vectors, a × b is perpendicular to both a and b .

Algebraic properties of the cross product

→ →
For vectors u and w in ℝ3, the following properties hold:

→ → → →
1. u × v = − v × u
→ → → → → → →
2. u × v + w = u × v + u × w
→ → →
3. u × u = 0
→ → → → → →
4. u ⋅ v × w = u × v ⋅ w
→ → → → → →
5. The cross product is not associative, i.e. u × v × w ≠ u × v × w

EXAMPLE
→ → → →
Let u = i − 2 j + k and v = 3 i + j − 2 k . Find u × v .

81
We have

→ → → → →
u × v = i 4 − 1 − j −2 − 3 + k 1 + 6
→ → →
=3 i +5 j +7k .

Geometric properties of the cross product

→ → →
For vectors u , v , and w the following properties hold:

→ → → → →
1. u × v = 0 if, and only if, u = k v (namely, if the vectors are scalar multiples of
one another).
→ → → →
2. The vector u × v is orthogonal to both u and v as shown in the following figure.
→ → →→
3. The magnitude is given by u × v = u v sin θ and is equal to the area of the paral-
→ →
lelogram with adjacent sides u and v .

Figure 28: Geometric Interpretation of the Cross-Product Between Vectors

Source: Johnson & Johnson, 2012.

→ →
The above figure illustrates the last property: There, we have two vectors a and b . Let θ
be an angle between them and let h be the height of the parallelogram with adjacent sides
→ →
of lengths a and b . The height of the parallelogram is


h = b sin θ,

and the area of the parallelogram is

82
→ → →
Area = h a = b a sin θ .

EXAMPLE
→ → → → →
Find a unit vector w that is orthogonal to both vectors u = i − 4 j + k and
→ → →
v =2 i +3 j .

→ →
We first find the cross product of u and v

→ → → → →
u × v = − 3 i + 2 j + 11 k ,

which has magnitude

→ → 2
u × v = −3 + 22 + 112 = 134 .

The cross product gives us the desired direction; we normalize the vector to be a
unit vector (length one) by dividing the cross product by its magnitude:

11 →
→ →
→ u × v 3 → 2 →
w = → → = − i + j + k
u × v 134 134 134

SUMMARY
In this unit, we learned the basics of how to interpret and compute with
vectors — mathematical objects that encode more than one measure-
ment such as speed and direction or even n-many attributes of the state
of a system. We learned to add and subtract vectors, to multiply vectors
by scalar, and how to compute and use the dot product (also called the
scalar product because the output is a scalar) and the cross product
(whose output is a vector perpendicular to the two vectors forming the
product). The geometrical interpretation of vectors is especially impor-
tant in two- and three-dimensional space. Unit vectors are vectors of
length one, and the standard unit vectors for the Cartesian coordinate
system are parallel to the coordinate axes and pointing in the positive
direction. The concept of a basis plays a central role in vector calculus; a
basis is a minimal spanning set of vectors, a set of the smallest size so
that every vector in the space can be formed as a linear combination of
the vectors in the basis.

83
UNIT 4
VECTOR CALCULUS

STUDY GOALS

On completion of this unit, you will have learned ...

– how to differentiate and integrate vector functions.


– how to differentiate along an arbitrary line.
– how to integrate over an arbitrary surface.
– what scalar fields and vector fields are and how to visualize them.
– how to use and interpret vector operators on scalar and vector fields.
4. VECTOR CALCULUS

Introduction
This unit combines the concepts of differentiation, integration, and vector functions. It
introduces the mathematical tools used to study how objects behave in arbitrary coordi-
nate systems; for example, we will be aware of how to determine the rate of change in a
function that describes an object moving through three-dimensional space. As a concrete
example, we can imagine a plane flying through the sky where the position of the plane
relative to the observer or a fixed point is described as a time-dependent vector (x(t),
y(t), z(t)). The rate of change of this position vector with respect to time gives the veloc-
ity of the plane at time t and the rate of change of the velocity gives the acceleration of the
plane at time t.

Two important concepts of vector calculus are scalar and vector fields. One example of a
scalar field from our common physical experience is a function that gives the temperature
at each point in a room: Given a position, such a function outputs a scalar. To visualize a
vector field, we can consider the speed and direction of water as it flows down a drain; the
vector field associates the vector that describes the velocity (speed and direction) of the
water with each point in space.

For further explanation of this topic, see chapter 15 of the textbook Calculus (Strang,
2017).

4.1 Differentiation of Vectors


Differentiation of Vector Functions

When discussing the concept of the derivative, we are aware of the rate of change of a
function with respect to changes in its arguments. We can use the example of a car to see
that the rate of change in the distance traveled is the velocity. In this simple example, we
implicitly assume that the car would travel along a long, straight road, i.e. we are not con-
cerned with the direction of the car. In general, the car can not only change its speed, but
also the direction in which it is travelling. One natural way to describe the position of the
car and its velocity is with vectors. In particular, we can consider the position of the car a
as a vector function with a scalar argument, time.

→ →
More generally, let a = a u be a vector function with scalar argument u. In three-
→ → → → →
dimensional Cartesian space, we can express a as a = ax u i + ay u j + az u k ,
→ →
where i , j , and k are the unit vectors in the x−, y−, and z−directions, and the scalar
functions ax(u), ay(u), and az(u) are the components of the vector in each of these direc-
tions.

86
The general idea of the derivative as a limit applies to vector functions as well. We can

define the derivative of the vector a u as

→ → →
da a u + Δu − a u
du
= lim Δu
. (4.1)
Δu → 0


The following figure illustrates a small change in the vector a u caused by a small
change in the argument u. Note that the derivative of a vector function is also a vector
function. However, the two vectors are not necessarily parallel, but can point in different
directions. In Cartesian coordinates, equation 4.1 can be written as


da dax → day → daz →
= i + j + k, (4.2)
du du du du


which means that we can differentiate each component of the vector function a u sepa-
rately.

Figure 29: Illustration of the Rate of Change of the Vector Function a u

Source: Shiela Miller (2020).

87
EXAMPLE
→ → → →
The position of a car at time t is given by x t = t2 i + 3t j + t k in Cartesian

coordinates. Find the velocity v t (and its value, i.e. the speed, v t ) of the
car at time t = 1.


First, we find the derivative of x t . The result is

→ → → →
→ dx t (4.3)
v t = = 2t i + 3 j + 1 k .
dt

→ → → →
At time t = 1, we know that the velocity v 1 = 2 i + 3 j + 1 k . Remember-
ing that speed is the magnitude of the velocity vector, we obtain the norm of the

vector v 1 = 22 + 32 + 12 = 14.

Rules of Differentiation of Vector Functions

Just as for scalar functions, it is useful to make note of rules of differentiation that allow us
→ →
to avoid using the definition whenever they apply. Assume that a and b are differentia-
ble vector functions and that Φ is a differentiable scalar function. Then we can use equa-
tion 4.1 and prove the following useful rules:

→ →
d ϕa da dϕ → (4.4)
du
= ϕ du + du
a

→ → →
d a ⋅ b → db

da → (4.5)
du
= a ⋅ du
+ du
⋅ b

→ → →
d a × b → db

da → (4.6)
du
= a × du
+ du
× b.

EXAMPLE
Given a point particle circling a center with constant speed and fixed radius,
show that for any time, t, the velocity vector is perpendicular to the position
vector.


Let r t denote the position function. The point particle is always the same dis-
tance from the center of a circle, and so r · r = r2 a constant. Note that, by
→ → →
the problem, v t has a constant magnitude, so v ⋅ v = v2. Hence,

d → → → → → → → →
dt
r ⋅ r = r ⋅ v + v ⋅ r = 2 r ⋅ v = 0,

88
→ →
which implies r ⊥ v .

Vector functions with multiple scalar arguments

In the case of multivariable scalar functions, we used partial derivatives to express the rate
of change of a function with respect to a single variable. Analogously, we can extend the
idea of partial derivatives to vector functions that depend on more than one variable. Sup-

pose that a u1, u2, …, un is a vector function with scalar arguments u1, …, un. Then, to
find


∂a
∂ui
, (4.7)

we treat all variables uj where j ≠ i as constant and differentiate a as we vary only ui.

Using partial derivatives, one can prove a version of the chain rule in order to compute

derivatives of vector functions a whose arguments u1, u2, …, un are themselves functions
of some variables vi, namely ui(v1, v2, …, vm), and get

→ → → →
∂a ∂ a ∂u1 ∂ a ∂u2 ∂ a ∂un
∂vi
= ∂u1 ∂vi
+ ∂u2 ∂vi
+⋯+ ∂un ∂vi
. (4.8)

4.2 Integration of Vectors


Integration of Vector Functions

We can view integration of vector functions analogously to integration of functions of a



single variable. If we consider the vector function a u as the derivative of some function
→ → →
A , namely a = d A u /du, the integral, viewed as the antiderivative, can be expressed
as

→ → →
∫ a u du = A u + b (4.9)


where b is an arbitrary constant vector.

The definite integral is given by

u → → →
∫u12 a u du = A u2 − A u1 . (4.10)

Naturally, just as the antiderivative of a scalar function is a scalar function, the antideriva-
tive of a vector function is a vector function and its constant of integration is a vector con-
stant.

89
Integration along paths

Previously, we integrated functions along the axes, for example ∫f x dx along the x-axis
or, in the multivariate case, ∫∫f x, y dxdy along first the x-axis and then the y-axis. In
general, however, integrals can be performed along an arbitrary path, not only along one
of the coordinate axes. An intuitive example is the physical definition of the work per-
formed when applying a force along a path. In the simplest case, the work is given as W =
F · r, where F is the force along a defined path. The work W itself is a scalar, however, the
→ →
path r (not only along the axes) and the force F are in general vectors as we can apply a

force of some strength F in any direction as well as move in any direction. Hence, the

definition of the work W done by a force F as a particle travels along path C, becomes

→ →
W = ∫C F ⋅ d r (4.11)

as illustrated in the following figure. Only the component of the force that is parallel to the
line tangent to the curve contributes to the work done moving an object along the curve
C, hence the work W is given by the scalar product of the vectors for the force and the

parametrization of the curve. The path r t parameterizes the way the force is applied,
e.g. in Cartesian coordinates

→ (4.12)
r t = x t ,y t ,z t

where we have included a dependency on the time t to indicate where the object on

which the force is applied is at a given moment. The differential d r is then given by

→ dx dy dz
d r t = dx, dy, dz = dt
dt, dt dt, dt dt (4.13)

as x, y, z implicitly de pe nd on t and are typically parameterized as a function of t as


shown in equation 4.12.

90
Figure 30: A Force F Applied Along a Curve r

Source: Shiela Miller (2020).

Integration over surfaces

Just as we extended integration along a coordinate axis to integration over arbitrary


curves in space, we can also extend double integration to integration over arbitrary surfa-
ces. For a fixed curve, a single free parameter can be used to describe the movement along
the curve. In the case of a surface, we need two free variables in order to parametrize a
surface. For example, we could parameterize by

→ → → →
r u, v = r 0 + u a + v b (4.14)


where r 0 is a fixed point in the surface, “anchoring” the surface in space. Linear combina-
→ → → →
tions of vectors a and b span the surface. In the Cartesian coordinate system, i and j
are orthogonal, so they create a rectangular area over which to integrate. In general, the
→ →
small surface area generated by a and b will be a parallelogram and

→ →
∂ r u, v

∂ r u, v
dA = ∂u
× ∂v
dudv . (4.15)

Therefore, the integral over an arbitrary surface is given by

91
→ →
∂ r u, v

∂ r u, v
∫∫A d A = ∫∫A ∂u
× ∂v
dudv . (4.16)

4.3 Scalar and Vector Fields


Although the phrases “scalar field” and “vector field” may sound intimidating, we are
quite familiar with these concepts from our everyday life even if we don’t know them by
Scalar Field these names. A scalar fieldϕ(x, y) is a function that assigns a scalar value to each point in
A scalar field relates a sin- a two-dimensional space. Note that we can define a scalar field on a space of any dimen-
gle (scalar) value to every
point in a given area A. sion, not only dimension two.

As an example, consider a room with a radiator in a corner as shown in the following fig-
ure. As the radiator heats up the room, the temperature is different at each point of the
room. If we had a thermometer fixed at a wall, we would measure a particular value, a sca-
lar, that changes with time. That is the value of the scalar field at that one point. If we take
the thermometer in our hand and walk around the room, the values change as we walk.
We can potentially obtain a different measurement of the temperature at each position (x,
y), meaning that the function, the scalar field, that represents the temperature in the
room, can have a different value at each position. In this example, we could consider the
temperature in the room at more than one time, suggesting that our scalar field would be
a function ϕ(x, y, z, t) that depends on both position and time.

Figure 31: Visualization of a Scalar Field

Source: HiClipart, n.d.

From our everyday experiences, we are also familiar with situations where each point in a
given region is associated with a movement or force in a given direction with some
strength. The following figure shows two examples. On the left, a paddle is pushed

92
through the water and, as the paddle leaves the water, the water moves with both direc-
tion and magnitude (speed). The picture on the right shows water draining from a sink. As
the water drains out of the sink, it forms a vortex.

Figure 32: Illustrations of Vector Fields in Every Day Life

Source: Pexels, n.d.

Intuitively, it is clear that the direction and speed of the water change with position. We
can introduce the vector fieldV = V (x, y, z, t), which assigns a vector representing the Vector Field
A vector field relates a
velocity of the water to each position (x, y, z) at a given time t. This particular example is
vector to every point in a
dependent on four variables related to physical quantities such as position and time, but given area A.
we can generalize the ideas of both scalar and vector fields to arbitrarily many variables,
such as V = V (x1, x1, …, xn).

Compared to the visualizations of a scalar field as shown in the right part of the figure “Vis-
ualization of a Scalar Field”, the visualization of vector fields becomes more complex. In
the case of scalar fields, we only needed to associate a single value of the scalar field to
each coordinate. In the two-dimensional example, we used a single number illustrated by
an appropriate color scale to visualize the scalar field. In case of a vector field, we need to
add information about strength and direction at each point. For example, we can place
little arrows on a regular grid as shown in the next figure. At each point, the arrow points
in the direction of the vector field and its length indicates the strength at this point.

93
Figure 33: Visualization of a Vector Field Using Arrows to Indicate Direction and
Strength of the Field at Each Position

Source: Shiela Miller (2020).

Additionally, we can use color information to highlight the strength of the vector field.
Alternatively, we can use a stream plot as shown in the following figure where the lines
show how the vector field changes as a function of position and how we might observe a
particle as it follows the vector field. The direction is indicated by little arrows on the field
lines or streams and we can either vary the density or the color of the stream lines to indi-
cate the strength of the vector field (or both).

Figure 34: Visualization of a Vector Field as a Stream Plot

Source: Shiela Miller (2020).

94
4.4 Vector Operations
The ∇ Operator

We now introduce the vector operator ∇, which is often called nabla or del, and is used in
applications of derivatives of vector fields as well as many applications in physical and
information sciences. In Cartesian coordinates, the operator is defined to take partial
derivatives coordinate-wise, namely

∂ → ∂ → ∂→ (4.17)
∇≡ ∂x
i + ∂y
j + ∂z
k

∂ ∂ ∂
≡ , ,
∂x ∂y ∂z
(4.18)

→ → →
where x, y, and z are the Cartesian coordinates, and i , j , and k are the standard unit
vectors along the coordinate axes. Correspondingly, the coordinate-wise second partial
derivatives can be obtained by repeated application of the del operator

2 2 2
∂ ∂ ∂
Δ = ∇2 = ∇ ⋅ ∇ = + + . (4.19)
∂x2 ∂y2 ∂z2

The resulting operator is called the Delta or Laplace operator.

Gradient of a Scalar Field

One of the most important applications of the nabla operator is determining the rate of
change of a scalar field ϕ in a given direction called the directional derivative. In the exam-
ple of the figure “Visualization of a Scalar Field”, we discussed the scalar field of the tem-
perature in a room with a radiator in a corner. We now want to determine how much the
temperature changes if we move from one point to another. Starting at a point P(x0, y0,
→ → →
z0), we move a small distance away from P along the line g s, a = x + s a in the direc-
→ →
tion of the vector a , where x is the position vector of P and s is a scalar. The value of the
field at the new point is then Φ(x+sax, y+say, z+saz). The rate of change of Φ in the direc-

tion of a , which is called the directional derivative and is frequently denoted ∇a, is then

dϕ s ∂ϕ dx ∂ϕ dy ∂ϕ dz
ds
= ∂x ds
+ ∂y ds
+ ∂z ds
(4.20)

∂ϕ ∂ϕ ∂ϕ
= a + ∂y ay + ∂z az
∂x x
(4.21)

→ (4.22)
= a ⋅ ∇ϕ .

The quantity ∇ϕ is called the gradient of a scalar field ϕ and describes the direction of
dϕ s →
steepest ascent from any point in the field. The quantity ds = a ⋅ ∇ϕ describes the

rate of change of the field ϕ for some distance s in a given direction a .

95

A vector field V that is the gradient of some scalar field ϕ is called conservative and the
corresponding scalar field ϕ is called the potential of this conservative field.

EXAMPLE
Find the gradient of the scalar field ϕ = x2yz4.

Applying the nabla operator, we obtain

→ → →
∇ϕ = 2xyz4 i + x2z4 j + 4x2yz3 k .

Divergence of a Vector Field

The scalar product of the nabla operator with a vector field is called the divergence of a

vector field V :

→ →
∇ ⋅ V = div V . (4.23)

Flux The divergence is a measure of the flux of a vector field at any given point and has impor-
The flux of a vector field tant applications in physics. To illustrate this, imagine water flowing into one end of a pipe
can be interpreted as how
much the field acts like a and out the other. The flow of the water can be described by a vector field. In the simplest
‘‘source’’ or ‘‘drain’’ at a case, there are no sources of additional water nor any drains that would alter the total vol-
given point. ume of water. In this case, the divergence of the vector field is zero. If, however, another
pipe is attached between the entrance and exit of the original pipe, the total volume of
water that exits the pipe could change. If the additional pipe adds water to the system, the
divergence of the vector field is greater than zero. If the additional pipe drains water from
the system, the divergence will be less than zero. The figure “Visualization of a Vector Field
Using Arrows to Indicate Direction and Strength of the Field at Each Position” is a repre-
sentation of positive divergence. We can imagine a source in the center and the field flows
outward.

Another physical example is an electric point charge from which field lines extend to infin-
ity.

Curl of a Vector Field

The curl is the cross product of the nabla operator and a vector field


∇ × V = curl V (4.24)

96

The curl describes the “whirliness” of a vector field. For example, if the vector field V
describes the flow of water after a paddle leaves the water as seen in the left part of the

figure “Illustrations of Vector Fields in Every Day Life”, the curl of the vector field V is rela-
ted to the vortices left behind by the paddles. More specifically, the curl describes the
angular velocity of the water in the area around any point. If we were to insert a small
probe such as a sheet of plastic at various points around the vortex left behind by the pad-
dle, this probe would tend to rotate in those regions with non-vanishing curl, i.e. where

∇ × V ≠ 0.

SUMMARY
The concepts of differentiation and integration can be extended to vec-
tor functions in a straightforward way. In addition to the familiar integra-
tion along the coordinate axis, line integrals can be used to integrate
along arbitrary curves. In a similar way, integrals can be defined over
arbitrary surfaces. Scalar and vector fields play an important role in
many applications and vector operations based on the nabla operator
(∇) are used to define the gradient, divergence, and curl.

97
UNIT 5
MATRICES AND VECTOR SPACES

STUDY GOALS

On completion of this unit, you will have learned ...

– what matrices and special matrices are.


– how to perform calculations with matrices.
– how to compute the determinant, trace, and transpose of matrices, as well as the com-
plex and Hermitian conjugates of matrices.
– how to determine eigenvalues and eigenvectors of matrices.
– how to diagonalize matrices and change bases.
– what tensors are and how to perform basic calculations with them.
5. MATRICES AND VECTOR SPACES

Introduction
Matrices are arrays of numbers that play an important role in many applications of mathe-
matics, from solving systems of linear equations to quantum mechanics. Many problem
settings can be reformulated efficiently as matrix equations and then solved in a system-
atic way. This unit introduces basic matrix algebra and operators. Diagonalization of
matrices is an important skill that facilitates the changes of a coordinate system or the
transformation of one set of variables into another. Choosing a specific set of new varia-
bles can make the solution to the original problem much easier.

Tensors formalize and extend the concepts of scalars, vectors, and matrices. This unit
introduces tensors and the basic rules of how to work with them.

To gain further understanding of this topic, see chapters 2—4 of Mathematics for Machine
Learning (Deisenroth et al., 2020), chapter 11 of Calculus (Strang, 2017), and chapters one
and two of Advanced Calculus (Loomis & Sternberg, 2014).

5.1 Basic Matrix Algebra


Scalars that represent single numbers are used in many physical applications. They can,
for example, represent a single measurement such as the temperature at a given point. We
have also previously encountered vectors that are characterized by a magnitude and
direction in a single mathematical object, such as the velocity of a car. Compared to sca-

lars, vectors have several components and we can represent a vector as follows: v = (vx,
vy, vz) = vi. In other words, we can use a single index, i, to identify the components. Matri-
ces are rectangular arrays of numbers. Each entry can be identified by its row (numbered
top to bottom) and column (numbered left to right). Using these two indices, we write

a11 ⋯ a1m
A= ⋮ ⋮ ⋮ = aij . (5.1)
an1 ⋯ anm

This n × m matrix has n rows and m columns. Each entry can be identified using the two
subscripts i and j. Matrices play an important part in many applications, for example, to
describe a rotation of a vector or properties of materials such as elasticity. In many cases,
matrices are a convenient way to perform operations on vectors. Some important types of
matrices are as follows:

• a square matrix in which m = n,


• a symmetric matrix for which aij = aji,
• a diagonal matrix in which only the diagonal elements are nonzero, and

100
• the identity matrix is a square diagonal matrix in which aii = 1 for all i, and all other
elements are zero.

EXAMPLES
Matrix A is symmetric

9 4 5
A= 4 6 3 ,
5 3 7

matrix B is diagonal

9 0 0
B= 0 9 0 ,
0 0 1

and matrix I3 is the 3 × 3 identity matrix

1 0 0
I3 = 0 1 0 ⋅
0 0 1

Calculating with Matricies

We adopt the convention that capital letters such as A de note matricie s, small le tte rs
such as aij denote the entries of a matrix, and Greek letters such as α denote scalars. The
following rules apply when calculating with matrices:

1. A = B ⇔ ∀i, j: (aij = bij)


Two matrices are equal if, and only if, the matrices have the same dimension and
identical entries.
2. A + B = C ⇔ ∀i, j: (cij = aij + bij)
Matrix addition and subtraction are performed entry-wise. Matrix addition is commu-
tative and associative: A + B = B + A and A + (B + C) = (A + B) + C.
3. B = αA ⇔ ∀i, j: (bij = αaij)
Multiplying a matrix by a scalar multiplies each entry by said scalar.
4. Matrix multiplication is only defined when the number of columns in the left matrix is
equal to the number of rows in the right matrix. The resulting matrix, the product, has
the same number of rows as the left matrix and the same number of columns as the
right matrix:

101
A B = C . (5.2)
n × kk × m n×m

The calculation is performed as rowi multiplied by columnj:


AB = C cij = ∑k aikbkj Matrix multiplication is distributive, A(B+C)=AB+AC
and (B+C)A=BA+CA, but is not, in general, commutative. In particular, there are
many cases in which AB≠BA holds. Indeed, not only are the products frequently
unequal, but just because AB is defined, this doesn’t mean that BA is, except when A
and B are square matrices of the same dimension. The quantity [A,B]≡AB−BA is
Commutator called the commutator.
The commutator between
matrices becomes very
important in many appli-
cations such as quantum
mechanics.
EXAMPLE
Evaluate the matrix products AB and BA where

3 2 −1
A= 0 3 2
1 −3 4

and

2 −2 3
B= 1 1 0 ⋅
3 2 1

For AB,

3 2 −1 2 −2 3 5 −6 8
0 3 2 1 1 0 = 9 7 2,
1 −3 4 3 2 1 11 3 7

and for BA,

2 −2 3 3 2 −1 9 −11 6
1 1 0 0 3 2 = 3 5 1 .
3 2 1 1 −3 4 10 9 5

Note that AB ≠ BA.

102
5.2 Determinant, Trace, Transpose,
Complex, and Hermitian Conjugates
The Transpose of a Matrix and the Hermitian Conjugate Matrix

In the following section, we will discuss matrix operations that are common in applica-
tions that involve calculating with matrices. Let An×m be a matrix consisting of n rows and
m columns. In some calculations, it is useful to interchange the rows and columns of a
matrix. The resulting matrix is called the transpose of A, which is denoted by AT. Note that
AT is an m × n matrix.

EXAMPLE
Find the transpose of

1 2 3
A= .
4 5 6

The transpose is

1 4
T
A = 2 5 .
3 6

The transpose of a product of two matrices is the product of the transposed matrices in
reverse order, namely

T T
AB = BTA .

In convincing ourselves of this, it is useful to consider the dimensions of the matrices


involved.

For matrices with complex a ± bi number entries, we find the complex conjugate Complex Conjugate
matrix, A∗, by taking the complex conjugate of each entry of A, The complex conjugate of
a ± bi * is a ∓ bi.


a∗ ij = aij , (5.3)

where aij is the element of matrix A in row i and column j.

103
EXAMPLE
Find the complex conjugate matrix of

1 2 3i
A= .
4+i 5 6

The complex conjugate matrix is

∗ 1 2 −3i
A = .
4−i 5 6

The Hermitian conjugate of a matrix A is the transpose of the complex conjugate and is
denoted by A†.

EXAMPLE
Find the Hermitian conjugate of

1 2 3i
A= .
4+i 5 6

The Hermitian conjugate is

1 4−i

A = 2 5 .
−3i 6

Note that the Hermitian conjugate (or the transpose matrix in the case of real-valued
→ →
matrices) is related to the inner (or dot) product of vectors. Let the two vectors a and b
be given by

a1 b1
→ a2 → b2
a = , and b = .
⋮ ⋮
an bn

Here, the vectors are represented by column matrices. If we take the Hermitian conjugate
→ →
of a (resulting in a row matrix) and multiply it with b we obtain

104
b1
N
b2 →†→
a∗1, a∗2, …, a∗n

= ∑ a∗i bi = a∗i bi = a b, (5.4)
i=1
bn

† T
which is the inner product a b . In the case of real numbers, a becomes a and we
→ →
use the more familiar notation for the inner product a ⋅ b .

In this derivation, we have used the summation convention that all indices occurring twice
are summed over without having to write the summation sign (∑ ) explicitly. In this case,
N
we use the notation a∗i bi as a short-hand version for ∑ a∗i bi .
i=1

Trace of a Matrix

The trace of a square matrix is one of several characteristics associated with square matri-
ces where we express properties of the matrix with a single number. The trace of a square
matrix is the sum of the diagonal elements:

n
Tr A = a11 + a22 + ⋯ + ann = ∑ aii . (5.5)
i=1

Some properties of the trace and the sum, difference, and product of matrices are

Tr A ± B = Tr A ± Tr B (5.6)

and

Tr AB = Tr BA . (5.7)

EXAMPLE
Find the trace of

1 2 3
A= 4 5 6 .
7 8 9

The trace is Tr(A) = 1 + 5 + 9 = 15.

105
Determinant of a Matrix

The determinant of a matrix is also a single number which is only defined for square matri-
ces and is denoted as

a11 a12 ⋯ a1n


a21 a22 ⋯ a2n
det A = A = . (5.8)

an1 an2 ⋯ ann

The determinant is calculated as

det A = ∑P αβ…ω ϵαβ…ωa1αa2β…anω . (5.9)

Permutation The above sum runs over all permutations of the indices indicated by P[αβ . . . ω]. For
A permutation is one of
example, two different permutations of the indices i and j are ij and ji. For n indices, n!
several ways that a num-
ber of items can be permutations can be calculated, i.e. the sum runs over n! terms. The quantity εαβ...ω is
arranged. called the anti-symmetric tensor which takes the values +1 and −1 depending on how
often the indices are changed:

+1 for an even permutation of 1,…,n


εαβ…ω = −1 for an odd permutation of 1,…,n (5.10)
0 if 2 indices are the same.

This means that εαβγ...ω=−εβαγ...ω=+εβγα...ω.

EXAMPLE
Calculate the determinant of an arbitrary 2 × 2 matrix.

For an arbitrary 2 × 2 matrix we obtain

a11 a12
A = = ϵ12a11a22 + ϵ21a12a21 = a11a22
a21 a22

− a12a21 .

Notice that the determinant is the product of the diagonal elements minus the
product of the off-diagonal elements. This is always the case for 2 × 2 matrices,
but it does not work for matrices of other dimensions.

106
EXAMPLE
Calculate the determinant of a 3 × 3 matrix.

For a 3 × 3 matrix we obtain

a11 a12 a13


A = a21 a22 a23
a31 a32 a33
= +a11a22a33 − a11a23a32
+a12a23a31 − a12a21a33
+a13a21a32 − a13a22a31 .

Another way to calculate the determinant of a matrix M is to use the Laplace


expansion, also called the cofactor expansion. The cofactors are

i+j (5.11)
Cij = −1 Mij

where the minor Mij is the determinant of the matrix of size (n−1)×(n−1),
which is obtained by removing all elements of the ith row and jth column of the
original matrix A.

For future use, we record several useful properties of determinants.

1. |AT | = |A|
2. |A†| = |(A∗)T | = |A∗| = |A|∗
3. |AB| = |A||B| = |BA|
4. |λA| = λn|A|
5. If two rows or columns of a matrix are interchanged, the determinant changes its sign
but not its value.
6. If two rows or columns of a matrix are identical, the determinant is zero.

Inverse of a Matrix

Just as in the familiar cases of operations on the real numbers and applications of func-
tions on real numbers, it is sometimes possible to multiply two matrices and obtain the
identity matrix. When

AB = BA = I, (5.12)

we call B the inverse of A, denoted B = A−1. Note that AB = I = BA can only be satis-
fied when A is a square matrix, and that it is possible that BA = I but AB ≠ I, and vice
versa.

107
The entries of the inverse matrix can be calculated using

−1 C ji
A ij = (5.13)
det A

where Cji are the cofactors defined in equation 5.11 with the indices swapped, namely i, j
becomes j, i.

Matrices with det(A) = 0 are called singular and cannot be inverted.

For the case of an invertible 2 × 2 matrix A,

a11 a12
A= ,
a21 a22

the inverse of A is

−1 C
T
1 a22 −a12
A = det A
= a11a22 − a12a21
. (5.14)
−a21 a11

EXAMPLE
Find the inverse of the matrix

1 2 3
A= 0 4 5 .
1 0 6

We first find the determinant by cofactor expansion:

det A = 1 4 · 6 − 0 · 5 − 2 0 · 6 − 1 · 5
+ 3 0 · 0 − 1 · 4 = 24 + 10 − 12 = 22 .

We then find the cofactors for each entry:

4 5 0 5 0 4
a11 = = 24a12 = − = 5a13 = = −4
0 6 1 6 1 0
2 3 1 3 1 2
a21 = − = − 12a22 = = 3a23 = − =2
0 6 1 6 1 0
2 3 1 3 1 2
a31 = = − 2a32 = − = − 5a33 = = 4.
4 5 0 5 0 4

The cofactor matrix is

108
24 5 −4
C = −12 3 2 .
−2 −5 4

Hence, the inverse matrix is

−1 1 T
A = det A
C
24 5 −4 T
1
= 22
−12 3 2
−2 −5 4
24 −12 −2
1
= 22
5 3 −5 .
−4 2 4

Eigenvalues and Eigenvectors of a Matrix

Observe that a column vector is an n ×1 matrix, and therefore, it is possible to perform


certain matrix operations on vectors. This means that we can use matrices to modify vec-
tors or express certain operations as the product of a matrix and a vector. For example we
can express a rotation as a matrix applied to a vector, or a shearing transformation that
moves one axis to a different angle but only stretches or compresses another axis.

In such an example, we may find that the matrix operation applied to a vector only
changes the magnitude or length of a vector:

→ → (5.15)
Ax =λx .

This equation is called the eigenvalue problem for matrix A where λ are either real- or

complex-valued numbers called the eigenvalues of the matrix, and the solution x is an
eigenvector associated with λ. This equation can be written as

→ → (5.16)
A − λI x = 0 ,

which gives rise to the characteristic equation. It can be viewed as a homogeneous system
of linear equations of the form B x = 0 where B = A − λI . If the determinant of B is
zero, the characteristic equation has a nontrivial (nonzero) solution and we can determine
the eigenvalues of A. Note that x = 0 is always a solution to this equation, so the non-
trivial requirement is important.

109
EXAMPLE
Calculate the eigenvalues and associated eigenvectors of the matrix
10 −3
A= .
−3 2

First, we form the equation

→ 10 − λ −3 → →
A − λI x = x = 0 .
−3 2−λ

Next, we find

det A − λI = 10 − λ 2 − λ − − 3 −3
2
= 20 − 10λ − 2λ + λ − 9
= λ2 − 12λ + 11
=0

and then solve the characteristic equation

2
λ − 12λ + 11 = 0
λ2 − 12λ = −11
2
λ−6 = −11 + 36
2
λ−6 = 25,

which has solutions λ1 = 1 and λ2 = 11.

Now that we have the eigenvalues, we can find the associated eigenvectors. For
λ1=1, we get

→ →
A − 1I x = 0
10 − 1 −3 x1 →
= 0,
−3 2 − 1 x2

which leads to the system of equations

9x1 − 3x2 = 0
−3x1 + x2 = 0

and is solved for x1 = t, x2 = 3t where t is a free variable. The same approach is


taken for λ2=11. The equation

110
→ →
A − 11I x = 0
10 − 11 −3 x1 →
= 0
−3 2 − 11 x2

leads to the system of equations

−1x1 − 3x2 = 0
−3x1 + − 9x2 = 0 .

Using the parameter t, we are able to describe the infinite amount of solutions
as those satisfying x1 = t, x2 = −t/3. Typically, we choose unit eigenvectors
that are unit vectors; in this case

→ 1 1 → 1 3
x1= and x 2 = .
10 3 10 −1

5.3 Diagonalization
Change of Basis


A basis e i : i = 1, 2, …, N is a minimal spanning set of linearly independent vectors. Linearly Independent
3 → → Vectors are linearly inde-
One example is the Cartesian coordinate system, which forms a basis of ℝ where i = e 1 pendent if they cannot be
→ → → → expressed as linear com-
points along the positive x-axis, j = e 2 points along the positive y-axis, and k = e 3 binations of each other.
points along the positive z-axis.


Considering an n-dimensional vector space with basis e1, …, en, every vector x in the vec-
tor space can be expressed as a linear combination of the basis vectors, namely,

→ → → → (5.17)
x = x1 e 1 + x2 e 2 + ⋯ + xn c n,


or x can be written in vector form as

→ T
(5.18)
x = x1, x2, …, xn .

However, the familiar i , j , k is not the only basis for ℝ3. For example, sometimes it is
easier to consider a problem in spherical coordinates rather than in Cartesian coordinates.
The new base vectors e ′ can be expressed as
j

111
N
→′ →
ej= ∑ Sij e i (5.19)
i=1

→ →
where Sij is a matrix that transforms from the old base e to the new base e ′ . In the fol-

lowing, we denote the original base with e and the new base that we want to change to
→′
with e i . Intuitively it is clear that any object represented by a vector in either base does
→ →
not change its properties if we express it using e or e i′ . For example, we can imagine

that the vector x refers to the position of an object. Irrespective of how we choose the
base to refer to the position of the object, its position remains unchanged. In general, an
arbitrary vector can be expressed as

N N N N
→ → → →
x = ∑ xi e i = ∑ x′i e i′ = ∑ x′j ∑ Sij e i . (5.20)
i=1 i=1 j=1 i=1


This means that we can express the components of the vector x in the original base (or
→ →
coordinate system) e i . We denote these components of the vector x with xi. We can also

express the components of the vector x in the new base (or coordinate system) defined

by e ′ — in this case, we denote the components by x′ . The matrix S connects the two
i i ij
representations

N
xi = ∑ Sijx′,j (5.21)
j=1

or in vector notation,

→ → → −1→
x =Sx′ ⟺ x′=S x . (5.22)


In case of equations where an arbitrary matrix A is applied to a vector x , this can be
expressed in both bases as

→ → → →
y = A x , y ′ = A′ x ′ (5.23)

and using equation 5.22, the first equation in the line above can be written as

→ →
S y ′ = AS x ′ (5.24)

→ →
expressing x and y in the new “primed” coordinate system using the matrix S. Hence

→′ −1 →
y = S AS x ′, (5.25)

which implies that

−1 (5.26)
A′ = S AS .

112
Matrix Diagonalization

Consider the matrix S so that each column of the matrix corresponds to an eigenvector of
some matrix A, so

↑ ↑ ↑

S= x1 →2 →N . (5.27)
x ⋯ x
↓ ↓ ↓

→1 →2 →N
In the above matrix, x , x , …, x indicate the eigenvectors. Note that the superscript
does not denote an exponent in this case. They fulfill the equation for eigenvectors

→j →j
A x = λj x .

Then, the entry (A’)i,j of matrix A’, which is the matrix A expressed in the new base or
coordinate system consisting of the eigenvector, is given by

−1 −1
S AS ij = ∑k ∑ l S A S
ik kl lj
−1
= ∑k ∑l S A
ik kl
xj l eigenvectors in matrix S
−1
= ∑k S λ x j k eigenvalues of matrix A applied
ik j
−1
= ∑k λ j S ikSkj eigenvectors in matrix S ,

which means that the resulting matrix is diagonal with the eigenvalues on the diagonal

λ1 0 ⋯ 0
0 λ2 ⋮
A′ = . (5.28)

0 ⋯ 0 λN

In this derivation we have used the following:

• the matrix S was chosen so that the columns of S are the eigenvectors of A,
• applying A to an eigenvector of A scales that eigenvector by the associated eigenvalue
→ →
λ (recall that A x = λ x is an eigenvector of A), and
• applying the inverse of a matrix to that matrix gives the identity matrix, namely,
−1
S S = I.

EXAMPLE
Diagonalize the following matrix:

113
2 0 0
A= 1 2 1 .
−1 0 1

As a first step, we need to find the eigenvalues of A via det(A−λI) = 0:

2−λ 0 0
2
det A − λI = 1 2−λ 1 = 2−λ 1−λ
−1 0 1−λ

= 0.

Next, we need to solve the characteristic equation (A−λI)x = 0 for each value
of λ. For λ1 = 1 we obtain

2−1 0 0
→ →
A − 1I x = 1 2−1 1 x
−1 0 1−1
1 0 0

= 1 1 1 x .
−1 0 0

This leads to the following set of equations:

x1 = 0
x1 + x2 + x3 = 0
−x1 = 0 .

The first and third equation imply x1 = 0, which means that x2=−x3. The eigen-
vector is then given by

0

v 1 = a1 −1
1

where a1 is an arbitrary constant. We can choose a1 so that the eigenvector


becomes a unit vector. However, as we want to use the matrix that we construct
from the eigenvalues in the later process of changing from one base or coordi-
nate system to another, we choose a1 = 1 to make the resulting matrix of eigen-
vectors as simple as possible. We find the eigenvalues for λ2 = 2 in the same
way and obtain the e ige nve ctors

0 −1
→ →
v 2 = a2 1 and v 3 = a3 0 .
0 1

114
Again, we choose a2 = 1 and a3 = 1 to simplify the matrix of eigenvectors. The
matrix of eigenvectors is then

0 0 −1
S = −1 1 0
1 0 1

and the diagonal matrix (with the eigenvalues on the diagonal) is

1 0 0
D= 0 2 0,
0 0 2

which can be verified using the expression D = S−1 AS.

5.4 Tensors
Introduction to Tensors

So far, we have encountered scalars, vectors, and matrices. In many examples, we have
seen how these objects can be applied, for example, to describe physical systems. In the
following section, we will discover how we can treat these objects in a more unified way
and how we can use this to extend these objects by building an intuition using examples
from our physical world. Note that this is only a first introduction and not a fully formal
treatment of the underlying mathematical theory.

In some scenarios, a simple number is sufficient, e.g. “how many pieces of cake are
there?” — and you might answer “(number of pieces of cake)”. However, in many situa-
tions, a simple number is not sufficient, e.g. “how far is it to your home?” Just answering
“3” is not sufficient, we also need a unit, e.g. km: “my house is 3 km from here”. In applica-
tions and physics, these objects are called denominate numbers or scalars. Other exam-
ples are temperature, the (inner) energy of a system, and pressure. When calculating with
scalars, we can add, subtract, multiply, or divide them, but we always get a scalar when
operating with scalars.

In other situations, more information is required than denominate numbers, for example,
answering the question “how do I get to your house?” In addition to a number, a direction
is needed: “walk 3 km due north”. These objects are vectors and are characterized by a
direction and a value, the magnitude of the vector. Other examples include velocity, accel-
eration, and (angular) momentum. Vectors are often represented by their components
→ → → → → → →
v = a i + b j + c k where i , j , k are the unit vectors (for example, the x-, y- or z-
direction in a Cartesian coordinate system) and a, b, c are scalars denoting how far one
has to go in each direction. We have already encountered how we can add, subtract, or
multiply vectors. Depending on the operation, the result can either be a scalar or a vector.

115
→ → →
• Sum: w = u ± v . The sum (or difference) of two vectors is a vector.
→ →
• Inner product: u ⋅ v = η. The inner product is a scalar and the inner product of a vec-
tor with itself is the square of its magnitude (length).
→ → → →
• Cross product: s = u × v . The cross product of two vectors u and v is a vector
→ →
orthogonal to both u and v . In the case of three-dimensional space, the cross product
→ → →
s is perpendicular to the plane spanned by u and v .
• Multiplication by a scalar: A vector can be multiplied by a scalar to change its magni-
tude.

We have also used matrices to operate on vectors, which can involve rotating vectors, or
expressing a change of basis or coordinate system. However, many physical systems can-
not be described by scalars and vectors alone. A familiar example is the inertia matrix (or
tensor) of an object: If we rotate a three-dimensional object such as a gyroscope, we find
that the rotation around some axis is stable (at least, as long as the gyroscope does not
lose too much energy due to friction in the course of our investigation). However, if we
apply a force (a vector with both a direction and a magnitude) from outside, the gyroscope
will generally re-orient itself along some other direction that is different to the one of the
force we applied. We need the nine elements of a 3-by-3 matrix (or tensor) to describe this
behavior. Another example is the magnetic susceptibility of a material. When an external
→ →
magnetic field H is applied, any material will, in general, show a magnetic response M .
However, in many materials, the external field and the magnetic response are not aligned.
This means that the magnetic susceptibility χ is not a single number or a vector and we
→ →
need 3 · 3 = 9 elements to describe the response of the object M = χ H . To describe other
physical systems, such as understanding our universe in the context of general relativity,
stationary, or rotating black holes, we need objects with even more elements. We typically
call these metrics in the context of general relativity.

Tensors extend and generalize what we have seen so far of scalars, vectors, and matrices
in 3d space:

1. Scalar: Tensor of rank 0 (1 component).


2. Vector: Tensor of rank 1 (3 components).
3. Dyad: Tensor of rank 2 (32 = 9 components).
4. Triad: Tensor of rank 3 (33 = 27 components).

This list can be extended to higher ranks, though these are then generally just called “ten-
sor of rank k”. Here, we can understand the rank of a tensor intuitively as the number of
indices we need to express the tensor. A scalar that has no index can be represented by a
single number. The components of a vector can be identified by a single index, for exam-
ple ai, where we understand that ai are the elements of the vector a and the index i takes
all values i = 1, 2, …, n required to address all elements of the vector of dimension n. In
the same way, we can denote the elements of some matrix A by two indices: ai,j.

→ → → →
If a and b are vectors and T is a tensor of rank 2, the linear equation b = T a can be
expressed as the following system of linear equations which is equivalent to our previous
use of matrix operations:

116
b1 = t11a1 + t12a2 + t13a3
b2 = t21a1 + t22a2 + t23a3
b3 = t31a1 + t32a2 + t33a3 .

Tensors of rank 2 do behave like square matrices, but tensors generalize the concepts we
have encountered so far.

We use this example to introduce the following summation convention: Without writing
the summation sign explicitly, we always sum over indices that occur twice. In the above

example, we can write each component bi of the vector b as a sum

3
bi = ∑ tijaj = tijaj (5.29)
j=1

3
where we have used the convention to avoid writing the summation sign ∑ . Note that
j=1
this also implies that the range of the indices i, j is no longer made explicit but is under-
stood from the context. This notation becomes useful if many indices are used.

Tensors invariant under a permutation of its indices (they do not change sign when two
indices are swapped) are called symmetric, but if the sign flips, they are called anti-sym-
metric. Specifically,

if tijk . . . l=t jik . . . lwhen i and j are swapped: T is symmetric


if tijk . . . l=−t jik . . . lwhen i and j are swapped: T is anti‐symmetric

Calculating with Tensors

• Addition: Adding or subtracting tensors is only defined for tensors of the same rank:
cij=aij+bij. Addition is commutative, so aij+bij=bij+aij.
• Dyad product: Multiplication is very different to what we have seen so far with scalars or
the inner product of vectors. It is obtained by multiplying each of the components term
by term. The dyad product always leads to a tensor of higher rank: riklm=aik⊗blm. The
dyadic product is neither the inner nor the cross product, but a new entity. The dyad
product is generally not commutative, so aik ⊗ blm ≠ blm ⊗ aik . In case of two tensors
→ T
of rank 1 a = a1, a2, a3 , the dyadic product is

a1b1 a1b2 a1b3


→ →
a ⊗ b = a2b1 a2b2 a2b3 . (5.30)
a3b1 a3b2 a3b3

• Contraction: Tensor contraction is used when we sum over the indices of a tensor, that
is, if an index occurs twice. Recall that we can view vectors as tensors of rank one. The
→ →
dot product of two vectors a ⋅ b is given by the sum of the components

117
a1b1+a2b2+a3b3 for three-dimensional vectors. This can also be expressed as
3
→ →
a ⋅ b = ∑ aibi =aibi if we apply the summation convention that we introduced ear-
i=1
lier. The result is a number, i.e. a tensor of rank 0. In other words, we have contracted
the index i of the vectors (tensor of rank 1) a and b.
• Trace: The trace is an example of a contraction and is calculated in the same way as we
have seen in case of a square matrix, namely as a summation over the diagonal
Tr(rik)=rii.

Co- and Contravariant Tensors

In the following explanation, we use an intuition built by examples from our physical
world to introduce the concept of co- and contravariance. These concepts play an impor-
tant role in the description of physical system and, if we use tensors to describe these sys-
tems, a further convention discussed below affects how we use the indices of a tensor.

The quantitative description of a physical process is independent of the coordinate system


in which we describe it. For example, we can describe a physical process in Cartesian coor-
dinates using x, y, and z. If we were to to use a spherical coordinate system with coordi-
nates r, θ, ϕ or cylindrical coordinates with parameters r, θ, z to describe the same sys-
tem, we must obtain the same physical relationships. However, describing a given
physical system may be more convenient in one coordinate system than in an other as, for
Symmetry example, we can exploit some form of symmetry. An object rotating at a fixed distance
If a system is symmetric from an origin in a plane is easily described in spherical or cylindrical coordinates where
under some operation,
we mean that it looks the the angle θ describing the rotation varies with time and the other coordinates are con-
same whether we apply stant. To express the same system in Cartesian coordinates requires a more complex
the operation or not.
description involving both x and y. It is therefore often desirable to change the coordinate
system or base. As we do so, some quantities will change but others will remain invariant.

An example of an invariant is the mass m of a particle, which is described as a scalar. Irre-


spective of the coordinate system used, the mass will stay the same. Using the example
above, if we express the rotation of the object around a fixed point in Cartesian coordi-
nates, we will obtain the same value of its mass as if we used cylindrical coordinates.

→ →
A normal vector such as the direction vector r or the velocity vector v will generally
change if we change the coordinate system or base. For a vector to remain constant under
such a change, its components have to contra-vary to compensate. For the vector to be
independent of the change of base or coordinate system, the matrix that transforms the
vector components has to be the inverse of the matrix expressing the change in base or
coordinate system. We call this property contravariant. As an example, we consider the

displacement vector r pointing to an object. If the object is at a distance of 1m in the x-
direction, we can express this as (1m, 0m, 0m). However, if we change the units from
meter to millimeter in x direction, the same object is now referred to as (1000mm, 0m,
0m). To change the base, in this case going from meters to millimeters, we divide the base
by 1000 but multiply the components of the vector by 1000.

118
A covector, on the other hand, transforms the same way as the change of base or coordi-
nate system does, its components co-vary and hence, we call this property a covariant. An
example of a covariant vector is the gradient of a field. For example, if we measure the
strength of an electric field in V/m and change the unit of length to mm, we need to multi-
ply the coordinate system or base by 1000. The resulting covector also needs to be multi-
plied by 1000 in order to stay invariant under this change of coordinate system or base.

Generally, contravariant tensors are denoted with super-script indices, i.e. ri, and covariant
tensors are denoted with subscript indices, ri.

We can now connect this to our discussion of tensors. Tensors can have both co- and con-
travariant properties. This means that, in general, a tensor will have both super-script and
subscript indices, for example gkij, depending on how it behaves during a change of base or
coordinate transformation.

SUMMARY
Matrices and vector spaces are key ingredients in many complex calcula-
tions. For example, matrices can be used to change coordinate systems.
Basic skills in calculating with matrices include adding and subtracting
matrices, performing matrix/vector operations, and finding the eigen-
vectors and eigenvalues of matrices. An important step in changing
bases is the diagonalization of a matrix. Tensors extend the concepts
and calculations of vectors and matrices to a more general setting,
allowing for linear and non-linear coordinate systems.

119
UNIT 6
INFORMATION THEORY

STUDY GOALS

On completion of this unit, you will have learned...

– how to express the difference between a prediction and observed events using the
MSE.
– what the Gini index is and how to determine it.
– the concepts of information entropy, Shannon entropy, and Kullback Leibler diver-
gence.
– how to use the cross-entropy to compare two probability functions.
6. INFORMATION THEORY

Introduction
The field of information theory focuses on the quantification, processing, and communi-
cation of information. The concept of entropy was introduced in the field of information
science by Claude Shannon and has many important applications, including the quantifi-
cation of information contained in a data stream. We will also develop related tools for
measuring the degree of similarity between probability distributions and membership in
data classes. Such techniques are useful for classification tasks where one is concerned
with assigning an event to one or more pre-defined classes as well as in regression tasks,
which focus on the extrapolation or prediction of a quantity in a given situation.

Various metrics such as the mean squared error, Kulback-Leibler divergence, and cross-
entropy can be derived from information theoretic principles and play a leading role in
algorithmic approaches to the extrapolation and prediction of new or unknown events.

6.1 Mean Squared Error (MSE)


In many cases, we need to predict a quantity from observed values. One way to do this is a
regressionwhere we associate one or more independent variables with a target or depend-
ent variable we wish to predict. The independent variables are also often called features
and the dependent variable is called a label. If we denote the independent variable(s) by
X and the dependent variable by y, we wish to establish a relationship so that y = f(X,
ai) whe re f is some functional form that maps the independent variable(s) X to the
dependent variable y. This mapping may depend on one or more parameters ai.

The class X is a collection of variables in a given dataset and x corresponds to a specific


set of values of these ordered variables. The domain of the vector x does not necessarily
fulfill all requirements of a vector space that we have encountered so far, it is often merely
a convenient notation to express that we consider all observations or values of the inde-

pendent variables for a specific event. For any concrete set of observations x this map-
ping f(.) results in a prediction ŷ. The obvious task in modeling is to estimate f(X, ai),
which could be done, for example, by building an explicit statistical or causal model, or by
using a machine learning algorithm to learn from past observations.

The simplest nontrivial example of such a model is linear regression where the form of the
function used to estimate the true values of the data is the line y = mx + b, for a single
Free Parameters variable x. In this case, we have two free parameters, namely a1 = m and a2 = b. To
Free parameters are those
de ve lop such a mode l, we not only ne e d the obse rve d value s of the variable sX, but also
parameters in a model
that need to be deter- the corresponding observed true values y in order to choose the parameters ai appropri-
mined using the data. ately. During the development of the model, we obtain an intermediate estimate of the

122
predicted value ŷ based on the current values of the parameters ai and we need a metric
to determine how to optimize the parameters further. Once we have optimized the final
parameter values of the model, we need to assess the accuracy of the model.

One of the simplest and most popular metrics is the mean squared error

N
1 2
MSE = N ∑ yi − ^
y . (6.1)
i=1

The MSE is symmetric in the true observed value y and the model's prediction ŷ. It puts a
strong penalty on predictions ŷ that are far from the observed value y. Although this
sounds like a desirable quantity, it also means that the metric can be dominated by a few
extreme values, even if the bulk of the predicted values are close to the observed ones.

The MSE and other metrics can be used in two different ways in the evaluation of a predic-
tion model. One use is during model construction and the other is during model testing
and assessment.

1. Loss function: A loss function is used during model building while the parameters ai
are optimized.
2. Score function: A score function is used to compare the values predicted by the model
with the observed values after the model has been built.

In the case of the loss function, the metric is directly used to optimize the model parame-
ters. The final value(s) of the model parameters will depend on the loss function in the
sense that a different loss function will lead to different optimal parameters. The final
evaluation of the predicted values ŷ compared to the true values y is done using a score
function, which may or may not be different from the loss function.

6.2 Gini Index


The Gini index, or Gini coefficient, is a statistical measure of the degree of inequality of
values in frequency distributions. It is commonly applied to measure the distribution of
the income (or wealth) within a country. The index was developed by the Italian Corrado
Gini in 1921 (Gini, 1912, 1921). The coefficient ranges from zero, corresponding to zero per-
cent difference and therefore total equality, to one, which corresponds to 100 percent dif-
ference and therefore perfect inequality. Values over one are only possible if there is nega-
tive income or wealth, namely people with debts. The higher the value of the Gini index,
the more pronounced the inequality is.

To determine the Gini coefficient of the income distribution of people in a given country,
we find the income of all the people in the country and present this data as a cumulative
percentage of population against the cumulative share of income earned. An example of
the resulting Lorenz Curve, is shown in the following figure:

123
Figure 35: Example of a General Lorenz Curve

Source: Shiela Miller (2020).

The main idea behind the Gini index is to show the extent to which wealth is or is not
evenly distributed throughout the population of a country. Researchers are often also
interested in whether the government of said country is trying to keep this ratio as low as
possible, namely whether they are striving for income equality. As we already know, the
Gini coefficient 0 ≤ G ≤ 1 is the ratio of the areas

A
G= A+B

depicted in the Lorenz Curve.

A few cases of interest are examined below.

1. If A = 0, the Lorenz Curve coincides with Line of Equality.


2. If G = 0, there is a “perfect” distribution of income, meaning a perfectly uniform dis-
tribution of income; all people in the country possess the same regular influx of
money.
3. If A is very large, the area B becomes very small. In this case, G ≈ 1 (the Gini coeffi-
cient is large) and there is very uneven distribution of income.

124
EXAMPLE
Suppose we live in a very small country with only ten people. Let’s call us a1, a2,
…, a10 and suppose that the total income for the country is $100 per day and this
income is distributed evenly among the population so that the income of each
resident ai is $10 per day (i=1, …, 10). Evaluate A and G.

The total income distribution is shown in the table below, where the proportion
and cumulative proportion refer to the population considered in this example.
So, in this case, A=0 and G=0; the Lorenz Curve and the line of equality coin-
cide.

Figure 36: Lorenz Curve for an Even Income Distribution

Source: Shiela Miller (2020).

The cumulative proportions (“prop”.) of population versus the cumulative per-


centage of income (“inc.”) is shown below:

Table 1: Cumulative Proportions of Population versus Cumulative


Percentage of Income

Cumul. Cumul.
Citizen Prop. prop. Inc. inc.

a1 10% 10% 10% 10%

125
Cumul. Cumul.
Citizen Prop. prop. Inc. inc.

a2 10% 20% 10% 20%

a3 10% 30% 10% 30%

a4 10% 40% 10% 40%

a5 10% 50% 10% 50%

a6 10% 60% 10% 60%

a7 10% 70% 10% 70%

a8 10% 80% 10% 80%

a9 10% 90% 10% 90%

a10 10% 100% 10% 100%

Source: Shiela Miller (2020).

EXAMPLE
Suppose that the graph of the cumulative proportion of population (on the hori-
zontal axis) against the cumulative percentage of income (on the vertical axis) is
as shown below, where the Lorenz curve is defined by y=x5.

126
Figure 37: Lorenz Curve for the Second Exercise

Source: Shiela Miller (2020).

Determine the Gini index.

We need to find the area of the region A, between the line of equality (in green)
and the Lorenz curve (in red). One way to do this is to find the area of the trian-
gular region below the line of equality and subtract the area under the Lorenz
curve. The area of the triangle below the line of equality is half the area of the
whole square, and is therefore 0.5. Let IB denote the area of the region B under
the Lorenz curve. IB is then

1
IB = ∫0 x5dx = 0 .1666

and therefore, the area of the region A is 0.5−0.1666= 0.333. We can now find
the Gini index, which is

A 0 .333
G= A+B
= 0 .5
= 0 .667.

Gini Impurity

The Gini index should not be confused with the Gini impurity. Unfortuntately, in practice
the terminology is used interchangeably. The term “Gini index” is often used for the Gini
impurity and we need to check the context carefully to avoid further confusion. Similar to
the Gini index discussed above, the Gini impurity is a measure of the homogeneity of a
distribution of elements in a set and is related to the probability of incorrectly classifying

127
an object in a data set. Suppose that we have N classification groups or classes in a given
dataset and let pi be the probability of a random instance belonging to class i. Then we
have the following cases for two subsequent experiments where we assign a class to an
element of the dataset:

1. We obtain the identical output for the same category i with probability p2i .
N
2. We obtain the identical output, irrespective of the category with probability ∑ p2i .
i=1
N
3. Using the above, we obtain two different outputs with probability 1 − ∑ p2i .
i=1

Therefore, to find the Gini impurity, we need to find the probability of being wrong about
any given classification and then sum over all classifications. The Gini impurity is

N
G= ∑ ∑j ≠ i pipj . (6.2)
i=1

It is sometimes computationally useful to write this formula in other ways. Recall that
ΣN
i = 1pi = 1, which means that we must assign each item to one of the available classes,
and therefore, pi = 1− Σj≠ipj. Observe that

N
G= ∑ pi ∑j ≠ i pj
i=1
N
= ∑ pi 1 − pi
i=1
N
= ∑ pi − p2i
i=1
N N
= ∑ pi − ∑ p2i
i=1 i=1
N
=1− ∑ p2i
i=1

N
where we used the fact that ∑ pi = 1 in the last step, which is a result of the fact that
i=1
there are no other possible outcomes except the N classifications.

128
6.3 Entropy, Shannon Entropy, Kulback-
Leibler Divergence
What is Entropy?

Apart from the laws of quantum mechanics, entropy is perhaps the most confusing physi-
cal quantity. In our everyday language, it is associated with the degree of randomness in a
system — for example, we would say that a cube of sugar dissolved in tea has a higher
level of randomness as it is natural for the sugar to dissolve, but we have never observed
that a sweet tea spontaneously separates into tea and a cube of sugar at the bottom. We
also often hear that “entropy defines the arrow of time.” The origin of these analogies is Arrow of Time
understandable, but they do not quite capture the concept of entropy. Furthermore, The arrow of time is a
concept indicating that
entropy has historically been introduced first in thermodynamics and then in statistical time always moves for-
physics. At first glance, both appear very different but, after careful consideration, they are ward (and not backward)
equivalent to each other. Before turning to information science, it is therefore useful to and that reactions follow
this direction.
understand entropy on a more fundamental level.

We start with the thermodynamic understanding of entropy and remind ourselves that
while some reactions occur spontaneously, others do not. A good textbook that covers this
subject in more detail is Physical Chemistry (Atkins & de Paula, 2006, p. 573 ff). For exam-
ple, a hot drink such as tea cools down to ambient temperature, a gas expands into the
available volume and a ball bounces a bit lower each time it hits the ground until it comes
to rest. In the case of the ball, we can intuitively understand this as with each bounce, the
ball transfers some of its kinetic energy to the ground, which is transformed into the ran-
dom thermal motion of the atoms in the ground, i.e. the ground heats up a little bit. How-
ever, we have never observed that a ball resting on a warm ground spontaneously jumps
into the air. This could only happen if all the atoms in the ground would act together and
push the ball away. We can then identify spontaneous reactions such as the bouncing ball
or expanding gas by looking for changes that lead to the dispersal of energy of the system:
As the ball bounces a bit less each time, it loses energy which is transferred into the ran-
dom motion of atoms into the ground.

The thermodynamic definition of entropy is then centered on the idea that a change in a
system is related to the energy it loses in the process, which in turn can be expressed by
the amount of energy that is transferred by heat. This sounds quite complicated but in
thermodynamics, the (inner) energy of a system is a measure of how much work a given Inner Energy
system can do. For example, a compressed gas can turn a turbine, whereas a gas filling the The inner energy can be
changed by either trans-
available space cannot. As we have seen in the example of the bouncing ball, heat is rela- ferring energy as heat or
ted to the random motion of the atoms as opposed to uniform motion in the case of work. performing work:
We can then conclude that the ability of a system to perform “useful” work is reduced in dU=δQ+δW

proportion to the amount of heat transferred into random motion. Furthermore, it is intui-
tively reasonable that this depends on the temperature: The effect of adding a bit more
heat to an already hot system is much less than to a cold system. This is indeed then the
thermodynamic definition of entropy:

δQ
dS = T
(6.3)

129
where S denotes the entropy, δQ the incremental amount of heat exchanged, and T the
temperature of the system. This definition gives us an understanding of why we associate
entropy with randomness if we think back to the example of the bouncing ball. As a bit of
heat is transferred to the ground, the atoms in the ground move around a bit more and
their motion becomes more random or more unordered.

Unfortunately, while the thermodynamic understanding of entropy explains our everyday


understanding well, it does not really help us to relate the entropy to any concept in infor-
mation science. We therefore need to turn to statistical physics to gain a deeper under-
standing. In statistical physics, we are concerned with the emergent properties of large
ensembles and not primarily with a detailed description of the interaction of two or
maybe a few atoms or molecules on a quantum level. Instead, we analyze how a large
number of molecules behave and treat them as little hard balls hitting each other. This
Mole simplification allows us to analyze macroscopic quantities: For example, one mole of
A mole is the base unit for water consists of approximately 1023 water molecules. It would be near impossible to cal-
the amount of a sub-
stance and contains culate all precise effects of this large number of molecules, nor is it necessary to do so in
exactly 6.02214076 · 1023 order to describe, for example, how a small amount of water heats up when we put it on a
particles (atoms, mole- stove. In this approach we neglect the contribution to the total energy of a system that
cules, etc.).
arises from the interaction of molecules, but instead assume that the molecules fly around
as tiny “billiard balls”, hitting each other constantly and thus not only changing energy but
also changing modes of motion. Our large ensemble consequently consists of N of these
billiard balls or molecules, each molecule is in a specific state of energy ϵi. This concept of
“state of energy” is important as the different levels of energy ϵi are not continuous but
Ground State discrete, for example, a number of molecules are in the ground stateϵ0, others in the next
The ground state is the
level ϵ1,…. We can imagine that in the ground state ϵ0, all molecules are at rest in an
lowest energy of a parti-
cle (e.g. atom, molecule). ordered lattice and no longer move — this is not exactly correct, but it serves as a useful
analogy. As we are only concerned with large numbers of molecules, we say that, on aver-
age, ni molecules occupy the energy state ϵi. The laws of statistical physics then tell us that
the distribution of molecules across the possible states is governed by a single parameter
— the temperature. We can visualize this in the following way: The hotter a system is, i.e.
the higher its temperature, the more energy states are accessible and the more the mole-
cules can move about and hit each other. In each collision, some molecules will lose a bit
of energy and go into a lower state, and other molecules will gain that energy and go into
a higher state but, on average, the population of states remains the same. At very low tem-
peratures, only few energy states are accessible. This dependency already brings us closer
to our statistical understanding of entropy and how it is related to the randomness or
orderedness we have discussed above. As the temperature gets lower and lower, only the
ground state ϵ0 of the system is accessible and all particles are in that state. In this case, we
can write {N, 0, 0, …}. If the temperature is a bit higher, more states are accessible and
another configuration of the system might be {N−2, 2, 0, …} whe re the first state ϵ1
above the ground state is now accessible. In general, the population of the system’s
energy states is described by {n0,n1,n2,…}, which can be achieved in W different ways
depending on which molecule is in which state. If we imagine the system as lots of identi-
cal balls, we can see that we can obtain the same configuration of states by many different
choices of which ball goes into which state as we cannot tell them apart. W is called the
“weight” of the configuration and is given by

130
N!
W= n0 !n1 !n2 !…
. (6.4)

With these tools, we can define the Boltzmann entropy:

S = kBln W (6.5)

where kB is the Boltzmann constant (kB = 1.38 · 10–23m2kgs–2K–1) and W is the weight of Boltzmann constant
the configuration. From the reasoning above, we can see that this quantity behaves in the kB = 1.38 · 10–23m2kgs–
2K–1
same way as the thermodynamic definition we have seen earlier (in fact, one can show
that they are equivalent). The only parameter that defines the system is the temperature
T, which we can modify by the exchange of some heat q. In the limit of T → 0, only the
ground state is accessible, which means that only one configuration is possible, leading to
W = 1 and hence, S = 0 as ln 1 = 0. As only one state is accessible, the amount of “ran-
domness” is minimal and increases as we raise the temperature (by adding some heat q)
because more states become accessible.

There are, of course, exceptions to this. One example is carbon monoxide (CO). The
ground state is such that a carbon atom C is followed by an oxygen atom O and, as the
temperature gets lower and lower, the only accessible state should be CO CO CO… as
the system slowly “freezes” into a regular lattice. However, the state OC is not much differ-
ent from CO in terms of energy and hence, it can happen that the configuration OC is
“trapped” as not enough energy is available to flip into CO and as T → 0, our lattice
might look like this: COCOOC…. This gives us a first glimpse of how the entropy can be
used in relation to information science if we imagine that we could denote the configura-
tion CO with 0 and OC with 1 and the above sequence expressed as a bit stream would
read 001….

For later use, it is convenient to rewrite the Boltzmann entropy as

N!
ln W = ln n (6.6)
0 !n1 !n2 !…

= ln N! − ln n0 ! + ln n1 ! + ln n2 ! + … (6.7)

= lnN! − ∑i ln ni ! . (6.8)

We can simplify the factorials using Stirling’s approximation: Stirling's Approximation


ln x! ≈ x ln x − x

ln W = lnN! − ∑i ln ni ! (6.9)

= Nln N − N − ∑i niln ni − ni (6.10)

= Nln N − ∑i niln ni . (6.11)

Using N = ∑i ni, we can express the entropy as

S = kB ∑i niln N − niln ni (6.12)

131
ni
= −kB ∑i niln (6.13)
N

= −NkB ∑i piln pi (6.14)

where pi = ni/N is the fraction of molecules in state i or the probability that the molecule
is in state i.

Shannon Entropy

The father of information theory, Claude Shannon, introduced the term entropy to
describe the minimum encoding size necessary to send a message without information
loss (Shannon, 1948). This has two components to it. Firstly, what is the maximal compres-
sion rate we can achieve to transmit the information? This is related to the entropy. The
other is concerned with the technical implementation and is related to the maximal
capacity of a transmission channel. The latter is part of electrical engineering and, for the
remainder, we focus on the first part.

In information theory, we are concerned with the amount of information that we can
obtain from a system and the information content of some event A is defined as

1
I A = − log2 p A = log2 pA
(6.15)

where p(A) is the probability that the event occurs. We notice that the information con-
tent decreases when the probability of an event increases — the more likely an event
becomes, the less “surprised” we are about it and the more we expect it, which implies
that it merely confirms the information we already had. In the extreme case that p(A) = 1
where the event always occurs, no further information is added. We also note that the
information due to independent events is additive: I(A1∩A2) = I(A1) + I(A2).

We now turn to larger systems that are described by some discrete variable X that can
take the values {x1, x2, …, xn} according to some probability distribution p(X). The Shan-
non entropy is then defined as the average information content of an outcome

H = E I X = E − log2 p X (6.16)

= − ∑i p xi log2 p xi (6.17)

Expectation Value where E[.] is the expectation value we use to calculate the average. Comparing this defi-
E X = ∫xp x d x or nition to equation 6.14, we find that they are the same, apart from the change in base
E X = ∑i xi p xi
from the natural logarithm to base two and the fact that the Shannon entropy does not
have the constants NkB as they are not directly related to a physical system. Since we have
already studied the entropy of ensembles in the context of statistical physics, this connec-
tion is not surprising. In both cases, we are concerned with large systems that are descri-
bed in terms of some probability function p that determines the likelihood that a possible
discrete value or state is occupied.

132
So far, the event space was discrete. By considering a probability density function rather
than a frequency, it is possible to meaningfully discuss Shannon’s entropy for underlying
variables that have infinitely many possible values. The underlying topology of the varia-
ble we are measuring becomes important. If the possible values are the real numbers,
equation 6.17 can be written in integral form

H x = − ∫p x log2 p x dx (6.18)

where p(x) represents the probability density function.

EXAMPLE
We can compute the entropy of a coin toss. For a fair coin, heads and tails comes
up with equal probability of 50 percent. The Shannon entropy is

n
H = − ∑ p xi log2 p xi
i=1
2
1 1
= − ∑ 2 log2 2
i=1
2
1
= −∑ 2
−1
i=1
= 1bit.

In this case, the entropy is maximal as we cannot predict the outcome of the
next coin toss from what we have observed so far. Hence, we need one bit per
coin toss to encode the resulting information if heads or tails comes up. How-
ever, if the coin is not fair so that heads comes up with a higher probability p
than the probability q for tails, our entropy would be different: H = −p log2(p)
− q log2(q). This number is smaller than one as the probability of heads is now
higher and we are less “surprised” if heads comes up.

EXAMPLE
Calculate the Shannon entropy of the string 00100010.

First, we note that our system only has two states, zero and one. Counting the
number of each, we find that we have six zeros and two ones out of eight charac-
ters. Hence, the probability to obtain a zero is p(0) = 6/8 = 3/4 and the proba-
bility to obtain a one is p(1)=2/8=1/4. The Shannon entropy is then
H=−0.75log2(0.75)−0.25log2(0.25)=0.811 bits. We can compare this to the coin
toss above: If the zeros (head) would occur as often as the ones (tails), we would

133
have had H = 1. However, the zeros occur much more frequently than the ones,
hence encountering a zero conveys less information as we can guess with a
probability of 3/4 that the next letter will be a zero.

The Kullback-Leibler Divergence

We can use the concept of entropy to determine how different two probability distribu-
tions p(x) and q(x) are. For each of them, we can define the Shannon entropy and we can
define the relative entropy or the Kullback-Leibler(KL)divergence between p(x) and q(x)
as

p xi
DKL p ∥ q = ∑i p xi log2 (6.19)
q xi

for discrete distributions p and q, and

px
DKL p ∥ q = ∫p x log2 q x
dx (6.20)

for the continuous case.

The relative entropy between two probability distributions over the same random variable
is a measure of how different the two distributions are. It satisfies Gibbs’ inequality

DKL p ∥ q ≥ 0 (6.21)

where DKL(p||q) = 0 only if p(x) = q(x). The Kullback-Leibler divergence is sometimes


also called the KL “distance” although it is not, strictly speaking, a distance, since it is not
symmetric in p and q, i.e. its value changes if p and q are interchanged.

6.4 Cross Entropy


Unsurprisingly, we are not always right when we infer a probability distribution from a
data set. We want to be able to discuss this possibility more formally, and it is for this pur-
pose that we introduce cross entropy. Given two probability distributions, let’s call them p
and q, on the same underlying set of variables, suppose p is the true distribution, but q is
the distribution for which we have optimized. How hard is it (measured in number of bits
of data that we need) to identify an event in the space? The cross entropy of p and q is an
effort to answer this question.

To define the cross entropy, we will use the entropy of the random variable x and the Kull-
back-Leibler divergence between the true probability distribution p and the one we use to
estimate it, q. In essence, “how hard it is” to identify an event in the space is the natural

134
difficulty (uncertainty), which is measured by the entropy, plus the added difficulty
induced by estimating p with q, which is quantified by the Kullback-Leibler divergence.
Recall that the Kullback-Leibler divergence Dk(p||q) defined by equation 6.19 is

p xi
DKL p ∥ q = ∑i p xi log2 .
q xi

Using the properties of logarithms, this can be rewritten as

n p xi
DKL p ∥ q = ∑i = 1 p xi log2
q xi
n
= ∑i = 1 p xi log2 p xi − (6.22)
n
∑i = 1 p xi log2 q xi
= −H p + H p, q ,

where H(x) is the Shannon entropy of distribution of x, defined by equation 6.17 and H(p,
q) is defined to be

n (6.23)
H p, q = − ∑i = 1 p xi log2q xi ,

and is called the cross entropy of p and q. The cross entropy, the number H(p, q), repre-
sents the average number of bits required for us to identify an event given that we have
coded our scheme using distribution q when the true distribution is p. Due to the asymme-
try of the Kullback-Leibler divergence, the cross entropy is also generally asymmetric, in
particular H(p, q) ≠H(q, p). Another basic attribute of cross entropy is that it is bounded
below by the entropy of the true distribution. Note that the smallest possible cross
entropy is obtained when we use the true distribution in our coding scheme. Namely,
using equation 6.22 we see

n
H p, p = − ∑i = 1 p xi log2 p xi = H p ,

which is the Shannon entropy.

In machine learning applications, the cross-entropy is also often used as a loss-function Machine Learning
during the optimization of the model — in particular for classification tasks where events In machine learning, algo-
rithms are not explicitly
are sorted into two or more categories. During model building and training, we know the programmed, but use
true category that the event is in — this is our p, namely pk = 1, for the true category and data to learn specific rela-
tionships.
pl = 0 for all othe rs. The pre diction mode l re turns a probability for e ach possible cate -
gory, for e xample q1=0.1, q2=0.7, q3=0.01, …. where the sum ∑i qi = 1 is equal to one as
the event has to belong to one of the categories. Hence, the cross-entropy determines
how well the model q describes the true p.

135
SUMMARY
Information science is a multidisciplinary field that studies the collec-
tion, classification, storage, processing, and dissemination of informa-
tion. The field is concerned both with the underlying theoretical frame-
work and theories, as well as the practical applications. Information
science incorporates aspects from a wide range of fields such as com-
puter science, cognitive, and social science. One key aspect of the
underlying theory in information science is to understand how much
information is contained in a data stream and how to transmit this infor-
mation in the smallest possible lossless encoding. An important tool
used in this is entropy. Understanding how we can use entropy to
describe the emergent properties of large ensembles of physical sys-
tems, we can also understand how the concept is used in information
science.

We also discuss how to evaluate predictive models, starting with the


widespread method of the Mean Squared Error (MSE), which is used to
quantify the quadratic deviation between a model and the observed
data. The Gini index is often confused with the Gini impurity. The Gini
index is a statistical measure of distribution, commonly applied to study
the income (or wealth) distribution of a country. The Gini impurity is also
an impurity measure, however, it is used in machine learning, in particu-
lar in the construction of decision trees, to determine the probability of
being wrong about any given classification. The Kullback-Leibler diver-
gence between two probability distributions over the same variable is
used to describe the degree of similarity between the two distributions.
We conclude this unit by investigating what it actually means to esti-
mate the entropy and how accurately we are able to do so, and we
develop the tool of cross-entropy, used to compare probability distribu-
tions, to help us.

136
BACKMATTER
LIST OF REFERENCES
Atkins, P. W., & de Paula, J. (2006). Physical chemistry (8th ed.). W. H. Freeman.

Deisenroth, M. P., Faisal, A. A., & Ong, C. S. (2020). Mathematics for machine learning. Cam-
bridge University Press. http://doi.org/10.1017/9781108679930

Gini, C. (1912). Variabilita e mutabilita. Memorie di metodologica statistica.

Gini, C. (1921). Measurement of inequality of incomes. The Economics Journal, 31(121), 124
—126. http://dx.doi.org/10.2307/2223319

HiClipart. (n.d.). White panel sliding door illustration. https://www.hiclipart.com/free-trans


parent-background-png-clipart-iozib

Johnson, B. & Johnson, J. (2012, April 28). Cross-product in vector algebra. https://www.th
underbolts.info/wp/2012/05/02/appendix-i-vector-algebra/cross-product-in-vector-al
gebra/

Loomis, L., & Sternberg, S. (2014). Advanced calculus. World Scientific Publishing Com-
pany.

Math 10. (n.d.). Vector operations. https://www.math10.com/en/geometry/vectors-operati


ons/vectors-operations.html

Oppenheim, A., Willsky, A.S. & Nawab, S.H. (1997). Signals & systems (2nd ed.). Prentice
Hall.

Pexels. (n.d.). Shallow focus photography of a cavalier king charles spaniel. https://www.pe
xels.com/photo/shallow-focus-photography-of-a-cavalier-king-charles-spaniel-13903
61/

Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical


Journal, 27(3), 379—423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.

Strang, G. (2017). Calculus (3rd ed.). Wellesley-Cambridge Press.

138
LIST OF TABLES AND
FIGURES
Figure 1: A Parabola: f x = x2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Figure 2: f x = x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Figure 3: Constant Velocity Against Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Figure 4: Variable Velocity Against Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Figure 5: Parabola for x > 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Figure 6: A Function Depending on Two Variables f x, y = x2 + y2 . . . . . . . . . . . . . . . . 30

Figure 7: Schematic View of a Function f x, y which is to be Integrated . . . . . . . . . . . . . . 34

Figure 8: Illustration of the Concept of Calculus of Variations with a Rope Hanging From
Points A and B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Figure 9: Two Points A, B Connected By a Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Figure 10: Three Examples of Biased and Unbiased Resolution Functions . . . . . . . . . . . . . 47

Figure 11: Convolution of Two Overlapping and Non-Overlapping Uniform Functions . . 49

Figure 12: Convolution of a Gaussian and a Γ Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Figure 13: Multivariate Gaussian Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Figure 14: Illustration of Gaussian Blurring: Original, Medium, and Strong Blurring . . . . . 52

Figure 15: Periodic Sine Signal with One Contributing Frequency . . . . . . . . . . . . . . . . . . . . . 53

Figure 16: Periodic Sine Signal with Two Contributing Frequencies . . . . . . . . . . . . . . . . . . . 54

Figure 17: Periodic Sawtooth Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Figure 18: Definition of a Low-Pass Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Figure 19: Extracted Signal in the Time Domain Using a Low-Pass Filter . . . . . . . . . . . . . . . 59

139
Figure 20: Example of a Position Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Figure 21: Graphical Representation of a Force of 10N in a Direction of 30° North East . . 65

Figure 22: Illustration of Vectors P Q and RS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Figure 23: Addition of Two Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Figure 24: Difference of Two Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Figure 25: Addition of Two Vectors in Component Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Figure 26: Components of a Three-Dimensional Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Figure 27: Angle Between Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Figure 28: Geometric Interpretation of the Cross-Product Between Vectors . . . . . . . . . . . . 82

Figure 29: Illustration of the Rate of Change of the Vector Function a u . . . . . . . . . . . . . . 87

Figure 30: A Force F Applied Along a Curve r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Figure 31: Visualization of a Scalar Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Figure 32: Illustrations of Vector Fields in Every Day Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Figure 33: Visualization of a Vector Field Using Arrows to Indicate Direction and Strength of
the Field at Each Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Figure 34: Visualization of a Vector Field as a Stream Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Figure 35: Example of a General Lorenz Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Figure 36: Lorenz Curve for an Even Income Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Table 1: Cumulative Proportions of Population versus Cumulative Percentage of Income


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Figure 37: Lorenz Curve for the Second Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

140
IU Internationale Hochschule GmbH
IU International University of Applied Sciences
Juri-Gagarin-Ring 152
D-99084 Erfurt

Mailing Address
Albert-Proeller-Straße 15-19
D-86675 Buchdorf

media@iu.org
www.iu.org

Help & Contacts (FAQ)


On myCampus you can always find answers
to questions concerning your studies.

You might also like