Chapter 5. Vector Calculus: January 3, 2021

Chapter 5.
Vector Calculus
January 3, 2021
1 5.1 Differentiation of Univariate Functions

2 5.2 Partial Differentiation and Gradients
3 5.3 Gradients of Vector-Valued Functions
4 5.4 Gradients of Matrices
5 5.6 Backpropagation and Automatic Differentiation
5.6.1 Gradients in a Deep Network
5.6.2 Automatic Differentiation
Chapter 5. Vector Calculus January 3, 2021 1 / 28
5.1 Differentiation of Univariate Functions
Definition
Let D be a domain of R, and let f : D → R be a function.
The Taylor polynomial of degree n of a function f at x0 ∈ D is
defined as
n
X f (k) (x0 )
Tn (x) := (x − x0 )k .
k!
k=0
For a smooth functions f ∈ C ∞ (D), the Taylor series of f at x0 is

defined as
∞
X f (k) (x0 )
T∞ (x) := (x − x0 )k .
k!
k=0
For x0 = 0, we obtain the Maclaurin series.

f is called analytic at x0 if T∞ (x) = f (x).

Note.
1. Taylor polynomial of degree n is an approximation of a function. For

n = 1, we obtain the linear approximation.
2. The Taylor polynomial is similar to f in a neighborhood around x0 .
3. For k ≤ n, Taylor polynomial of degree n is an exact representation of
a polynomial f of degree k.
The function
f (x) = cos(x) is
approximated by
Taylor polynomials
around x0 = 1.

Example
Consider the function f (x) = x 3 + x − 3 and find the Taylor polynomial T4
at x0 = 1.
Answer. We have
f 0 (x) = 3x 2 + 1, f 00 (x) = 6x, f 000 (x) = 6, f (4) (x) = 0.
Evaluate them at x0 = 1, we get
f (1) = −1, f 0 (1) = 4, f 00 (1) = 6, f 000 (1) = 6, f (4) (1) = 0.
Thus, the Taylor polynomial T4 of f at x0 = 1 is

6 6
T4 (x) = −1 + 4(x − 1) + (x − 1)2 + (x − 1)3
2! 3!
= −1 + 4(x − 1) + 3(x − 1)2 + (x − 1)3
= f (x).

Maclaurin series of some basic functions
∞
x x2 x3 X xn
e =1+x + + + ··· =
2! 3! n!
k=0
∞
x3 x5 X (−1)k x 2k+1
sin(x) = x − + − ··· =
3! 5! (2k + 1)!
k=0
∞
x2 x4 X (−1)k x 2k
cos(x) = 1 − + − ··· =
2! 4! (2k)!
k=0
∞
1 X
= 1 + x + x2 + · · · = k
x .
1−x
k=0

5.2 Partial Differentiation and Gradients
Definition
A function f : Rn → R is a rule that assigns to each x ∈ Rn of n variables
x1 , . . . , xn to a real value f (x) = f (x1 , · · · , xn ).
Definition
For a function f : Rn → R, the partial derivatives of f are defined by
∂f f (x1 + h, x2 , . . . , xn ) − f (x1 , x2 , . . . , xn )
:= lim
∂x1 h→0 h
..
.
∂f f (x1 , x2 , . . . , xn + h) − f (x1 , x2 , . . . , xn )
:= lim
∂xn h→0 h
∂f
Note. When finding the partial derivative ∂xi , we consider only xi varies
and keep the others constant.
Example
1
Find the partial derivatives of function f (x, y ) = 1+x 2 +2y 4
.
Answer.
∂f (x, y ) 1 ∂ 2x
=− 2 4 2
(1 + x 2 + 3y 4 ) = −
∂x (1 + x + 3y ) ∂x (1 + x + 3y 4 )2
2
∂f (x, y ) 1 ∂ 2 4 12y 3
=− (1 + x + 3y ) = − .
∂y (1 + x 2 + 3y 4 )2 ∂y (1 + x 2 + 3y 4 )2

Definition
For a function f : Rn → R of n variables x1 , · · · , xn , the gradient of f is
defined as:
df h
∂f ∂f
i
1×n
∇x f = grad f = = ∂x · · · ∂xn ∈ R .
dx 1
Example
Find the gradient of function f (x1 , x2 , x3 ) = x12 x23 − 4x22 x3 .
Answer.
h i
∂f ∂f ∂f
∇x f = ∂x1 ∂x2 ∂x3
= 2x1 x23 3x12 x22 − 8x2 x3 −4x22 ∈ R1×3 .


Basic rules of Partial Differentiation
For f , g : Rn → R functions of variables x ∈ Rn .
Sum rule: ∇x [f + g ] = ∇x f + ∇x g .
Product rule: ∇x [fg ] = g (x)∇x f + f (x)∇x g .
Consider a function f : Rn → R of n variables x = (x1 , . . . , xn ). Moreover,

xi (t) are themselves functions of t. Then we have
df ∂f dx1 ∂f dxn
Chain rule : f 0 (t) = = + ··· +
dt ∂x1 dt ∂xn dt
n
X ∂f dxi
= .
∂xi dt
i=1

Example
Consider f (x1 , x2 ) = x12 + x1 x2 , where x1 = sin t, x2 = cos t. Find the
derivative of f with respect to t.
Answer: We have
∂f ∂f
= 2x1 + x2 , = x1 .
∂x1 ∂x2
dx1 dx2
= cos t, = − sin t.
dt dt
Hence
df
= (2x1 + x2 ) cos t + x1 (− sin t)
dt
= (2 sin t + cos t) cos t − sin t sin t
= 2 sin t cos t + cos2 t − sin2 t
= sin(2t) + cos(2t).

5.3 Gradients of Vector-Valued Functions
Definition
A vector-valued function of n variable x = (x1 , . . . , xn ), f : Rn → Rm is
given as    
f1 (x) f1 (x1 , . . . , xn )
f (x) =  ...  =  ..
,
   
.
fm (x) fm (x1 , . . . , xn )
where fi : Rn → R is a function of x.
Definition
The partial derivative of a vector-valued function f : Rn → Rm with
respect to xi , i = 1, . . . , n, is given as the vector
 ∂f1 
i ∂x
∂f
=  ...  ∈ Rm .
 
∂xi ∂f m
∂xi

Example
Consider a linear transformation (operator) f : Rn → Rm
f (x) = Ax, where A = aij ∈ Rm×n .

It is easy to see that the partial derivatives of f are

 
a1j
∂f  .. 
= Aj =  .  ∈ Rm , j = 1, . . . , n,
∂xj
amj

Example
Consider a vector-valued function
T
f (x, y ) = xy 2 y 3 x 2 − y 2 .

Then
y2

∂f
=  0 ,
∂x
2x
 
2xy
∂f
=  3y 2  .
∂y
−2y

Definition
The collection of all first-order partial derivatives of a vector-valued
function f : Rn → Rm is called the Jacobian. The Jacobian J is an m × n
matrix, which we define and arrange as follows:
df h
∂f ∂f
i
J = ∇x f = = ∂x ··· ∂xn
dx 1
 ∂f1 ∂f1 
∂x1 · · · ∂x n
=  ... · · · · · ·  .
 
∂fm
∂x1 · · · ∂f
∂xn
m

Example
If f : Rn → Rm is a linear operator given by
f (x) = Ax,
where A be a matrix in Rm×n . Then it is clear that the Jacobian of f :
∇x f = A.
Example
 −x +x 
e 1 2
Consider a vector-valued function f : R → R , f (x1 , x2 , x3 ) =  x1 x22  .
3 2
sin(x1 )
The Jacobian of f is
 −x +x
−e 1 2 e −x1 +x2

J = ∇x f =  x22 2x1 x2  .
cos x1 0

Chain rule
Consider a valued-vector function f : Rn → Rm given by

T
f (x) = f (x1 , . . . , xn ) = f1 (x) · · · fm (x)
and xi = xi (t1 , . . . , tl ) are themselves function of l-variables

t = (t1 , . . . , tl ). It means x : Rl → Rm
T
x(t) = x(t1 , . . . , tl ) = x1 (t) · · · xn (t) .
Then f ◦ x : Rl → Rm given by f (t) = f (x(t)) and

n
∂fj X ∂fj ∂xi
= , for all j = 1, . . . , m and k = 1, . . . , l
∂tk ∂xi ∂tk
i=1
∇t f = ∇x f ∇t x.

Example
Consider a function f (x1 , x2 ) = x12 + 2x1 x2 and x1 (s, t) = s. cos t,
x2 = s. sin t. Find the partial derivative of f with respect to s and t.
Answer.
∂f ∂f ∂x1 ∂f ∂x2
= +
∂s ∂x1 ∂s ∂x2 ∂s
= (2x1 + 2x2 ) cos t + 2x1 sin t
= s. cos2 t + 2s. sin t. cos t = s. cos2 t + 2s. sin(2t)
∂f ∂f ∂x1 ∂f ∂x2
= +
∂t ∂x1 ∂t ∂x2 ∂t
= (2x1 + 2x2 )(−s. sin t) + 2x1 .(s. cos t)
= −2s 2 . sin2 t

5.4 Gradients of Matrices
dA
Let A ∈ R4×2 and let x ∈ R3 . How to define dx ?
Compute the partial Re-shape A into a vector Ã ∈ R8 .

∂A ∂A ∂A
derivatives ∂x , , , Then, compute the gradient ddxÃ ∈
1 ∂x2 ∂x3
each is a 4 × 2 matrix, R8×3 . Re-shaping this gradient to
and collate them in a obtain the gradient tensor.
4 × 2 × 3 tensor.
The gradient of an m × n matrix A with respect to a p × q matrix B is a

four-dimensional tensor J, whose entries are given as
∂Aij
Jijkl = .
∂Bkl
We can also identity the space Rm×n of m × n matrices and the space
Rmn . Hence, we reshape A and B into vectors of lengths mn and pq,
respectively. The obtained Jacobian is in size mn × pq.

Example
Let f : Rm×n → Rn×n be given f (A) = AT A := K ∈ Rn×n . Find the
gradient dK /dA.
Answer. The gradient dK /dA ∈ R(n×n)×(m×n) is a tensor. Moreover,
dKpq
∈ R1×m×n .
dA
Denote by Ai the i th column of A and by Kpq by the (p, q)-entry of K ,
p, q = 1, . . . , n. Since
m
X
Kpq = AT
p Aq = Akm Amk ,
k=1



Aiq if j = p, p 6= q
m 
∂Kpq X ∂ Aip if j = q = p 6= q
⇒ = (Akm Amk ) =
∂Aij ∂Aij 2Aiq if j = p = q
k=1 


0 otherwise
5.6 Backpropagation and Automatic Differentiation 5.6.1 Gradients in a Deep Network
In deep learning, the function value y is computed as a many-level

function composition
y = (fK ◦ fK −1 ◦ · · · ◦ f1 )(x) = fK (fK −1 (· · · (f1 (x)) · · · ))
where x are the inputs (e.g., images), y are the observations (e.g., class
labels), and every function fi , i = 1, · · · , K , possesses its own parameters.

Given a neural network with multiple layers:
In the i th layer:
fi (xi−1 ) = σ(Ai−1 xi−1 + bi−1 )
where xi−1 is the output of the layer i − 1, σ is an activation function
(sigmoid or ReLU or tanh, ... functions).
Training these model requires us to compute the gradient of a loss
function L w.r.t all model parameters θj = {Aj , bj }, j = 0, . . . , K − 1.

Suppose we have inputs x and observations y and a network structure
f0 := x
fi := σi (Ai fi−1 + bi−1 ), i = 1, . . . , K .
We need find θ = {θj } = {Aj , bj }, j = 0, . . . , K − 1 which minimize the

loss function
L(θ) = ky − fK (θ, x)k2 .

The gradients of L w.r.t θ:

∂L ∂L ∂fK
= ,
∂θK −1 ∂fK ∂θK −1
∂L ∂L ∂fK ∂fK −1
= ,···
∂θK −2 ∂fK ∂fK −1 ∂θK −2
∂L ∂L ∂fK ∂fi+2 ∂fi+1
= ···
∂θi ∂fK ∂fK −1 ∂fi+1 ∂θi
∂L ∂L
Most of the computation of ∂θi+1 can be reused to compute ∂θi .
Gradients are passed backward through the network.

5.6 Backpropagation and Automatic Differentiation 5.6.2 Automatic Differentiation
Backpropagation is a special case of a general technique in numerical

analysis called automatic differentiation.
Automatic differentiation refers to a set of techniques to numerically
evaluate the exact gradient of a function by working with
intermediate variables and applying the chain rule.

Given a simple graph representing the data flow from inputs x to outputs
y:
x −→ a −→ b −→ y .
To compute the derivative dy /dx:

dy dy db da
dy dy db da =
= dx db da dx
dx db da dx
Reverse mode: gradients are prop-
Forward mode: the gradients flow
reverse mode agated backward
with forward mode the data from
through the graph, i.e., reverse to
left to right through the graph.
the data flow.
Definition
Reverse mode automatic differentiation is called backpropagation.

Example
Consider the function
q
f (x) = x 2 + exp (x 2 ) + cos(x 2 + exp (x 2 )).
Use intermediate variables:

√
a = x 2 , b = exp(a), c = a + b, d = c, e = cos c, f = d + e.
We get
∂a ∂b
= 2x = exp(a)
∂x ∂a
∂c ∂c ∂d 1
= =1 = √
∂a ∂b ∂c 2 c
∂e ∂f ∂f
= − sin c = = 1.
∂c ∂d ∂e

We can compute ∂f /∂x using backpropagation method
∂f ∂f ∂d ∂f ∂e 1
= + = 1. √ + 1.(− sin c)
∂c ∂d ∂c ∂e ∂c 2 c
∂f ∂f ∂c ∂f
= =
∂b ∂c ∂b ∂c
∂f ∂f ∂b ∂f ∂c ∂f ∂f
= + = exp(a) +
∂a ∂b ∂a ∂c ∂a ∂b ∂c
∂f ∂f ∂a ∂f
= = .2x
∂x ∂a ∂x ∂a

Chapter 5. Vector Calculus: January 3, 2021

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 5. Vector Calculus: January 3, 2021

Uploaded by

Copyright:

Available Formats

Chapter 5.

1 5.1 Differentiation of Univariate Functions

For a smooth functions f ∈ C ∞ (D), the Taylor series of f at x0 is

For x0 = 0, we obtain the Maclaurin series.

Chapter 5. Vector Calculus January 3, 2021 2 / 28

1. Taylor polynomial of degree n is an approximation of a function. For

Chapter 5. Vector Calculus January 3, 2021 3 / 28

f 0 (x) = 3x 2 + 1, f 00 (x) = 6x, f 000 (x) = 6, f (4) (x) = 0.

Evaluate them at x0 = 1, we get

f (1) = −1, f 0 (1) = 4, f 00 (1) = 6, f 000 (1) = 6, f (4) (1) = 0.

Thus, the Taylor polynomial T4 of f at x0 = 1 is

Chapter 5. Vector Calculus January 3, 2021 4 / 28

Maclaurin series of some basic functions

Chapter 5. Vector Calculus January 3, 2021 5 / 28

Chapter 5. Vector Calculus January 3, 2021 7 / 28

= 2x1 x23 3x12 x22 − 8x2 x3 −4x22 ∈ R1×3 .

Chapter 5. Vector Calculus January 3, 2021 8 / 28

Basic rules of Partial Differentiation

For f , g : Rn → R functions of variables x ∈ Rn .

Consider a function f : Rn → R of n variables x = (x1 , . . . , xn ). Moreover,

Chapter 5. Vector Calculus January 3, 2021 9 / 28

Chapter 5. Vector Calculus January 3, 2021 10 / 28

Chapter 5. Vector Calculus January 3, 2021 11 / 28

f (x) = Ax, where A = aij ∈ Rm×n .

It is easy to see that the partial derivatives of f are

Chapter 5. Vector Calculus January 3, 2021 12 / 28

Chapter 5. Vector Calculus January 3, 2021 13 / 28

Chapter 5. Vector Calculus January 3, 2021 14 / 28

where A be a matrix in Rm×n . Then it is clear that the Jacobian of f :

Chapter 5. Vector Calculus January 3, 2021 15 / 28

Consider a valued-vector function f : Rn → Rm given by

and xi = xi (t1 , . . . , tl ) are themselves function of l-variables

Then f ◦ x : Rl → Rm given by f (t) = f (x(t)) and

Chapter 5. Vector Calculus January 3, 2021 16 / 28

Chapter 5. Vector Calculus January 3, 2021 17 / 28

Compute the partial Re-shape A into a vector Ã ∈ R8 .

The gradient of an m × n matrix A with respect to a p × q matrix B is a

Chapter 5. Vector Calculus January 3, 2021 19 / 28

In deep learning, the function value y is computed as a many-level

y = (fK ◦ fK −1 ◦ · · · ◦ f1 )(x) = fK (fK −1 (· · · (f1 (x)) · · · ))

Chapter 5. Vector Calculus January 3, 2021 21 / 28

Given a neural network with multiple layers:

Chapter 5. Vector Calculus January 3, 2021 22 / 28

Suppose we have inputs x and observations y and a network structure

We need find θ = {θj } = {Aj , bj }, j = 0, . . . , K − 1 which minimize the

Chapter 5. Vector Calculus January 3, 2021 23 / 28

The gradients of L w.r.t θ:

Gradients are passed backward through the network.

Chapter 5. Vector Calculus January 3, 2021 24 / 28

Backpropagation is a special case of a general technique in numerical

Chapter 5. Vector Calculus January 3, 2021 25 / 28

Chapter 5. Vector Calculus January 3, 2021 26 / 28

Use intermediate variables:

Chapter 5. Vector Calculus January 3, 2021 27 / 28

We can compute ∂f /∂x using backpropagation method

Chapter 5. Vector Calculus January 3, 2021 28 / 28

You might also like