Chapter 5 - Vector Calculus File

Vector Calculus
Contents
• ex11-12, bt11-12
Differentiation of Univariate Functions
Partial Differentiation and Gradients
Gradients of Matrices
Backpropagation
Higher-Order Derivatives
Linearization and Multivariate Taylor Series
1/10/2022 Chapter 5 - Vector Calculus 2

The Chain Rule
x f f(x) g (gf)(x)
(gf)(x) = g(f(x))f(x) # gf means g after f

dg dg df
=
dx df dx

Chain rule – Ex
• Use the chain rule to compute the derivative of
h(x) = (2x + 1)4
x 2() 2x () + 1 2x+ 1 ()4 (2x+1)4
• h can be expressed as h(x) = (gfu)(x)

u(x) = 2x, f(u) = u + 1, g(f) = f4
h(x) = (gfu)(x) = g(f)f(u)u(x) = 4f3(1)(2) = 8(2x + 1)3

Partial Derivative
Definition (Partial Derivative). For a function of n variables
x1, . . . , xn f : n → ,
x  f(x)
we define the partial derivatives as
f f(x1,…, xk+ h, xk+1,..., xn) – f(x1,..., xk,…, xn)
= limh→0
xk h

Gradients of f : n → 
• We collect all partial derivatives of f in the row vector to

form the gradient of f f f … f
x1 x2 xn
df
• Notation. xf gradf
dx
• Ex. For f : 2 → , f(x1, x2) = x13 – x1x2
f f
• Partial derivatives = 3x12 – x2, = x 1 3 – x1
x1 x2
• The Gradient of f
xf = [3x12x2 – x2 x13 – x1]  12 (1 row, 2 columns)

Gradients of f : n →  x  3 f  1
df
gradf = = xf  1n
dx
Ex. For f(x, y) = (x3 + 2y)2, xf  13
we obtain the partial derivatives
f 
• = 2(x3 + 2y) (x3 + 2y) = 6x2(x3 + 2y)
x x
f 
• = 2(x3 + 2y) (x3 + 2y) = 4(x3 + 2y)
y y
 The gradient of f is [6x2(x3 + 2y) 4(x3 + 2y)]

Gradients/Jacobian of Vector-Valued Functions f : n  m
• For a vector-valued function f: n  m,

f(x) = [f1(x) f2(x) … fm(x)]T m-row vector
where fi : n  
Gradient (or Jacobian) of f
df ∇xf 1
J = ∇xf = = … Dimension: mn
dx
∇xfm

Jacobian of f: n  m – size
x3 f4 J  43
x1 x2 x3
f1
f2
f2 x3
f3
fm

Jacobian of f: n  m – Ex
Ex. Find the Jacobian of f: 3  2

f1(x1, x2, x3) = 2x1 + x2x3, f2(x1, x2, x3) = x1x3 - x22
Jacobian of f: J 23

Gradient of f: n  m – Ex
• We are given f(x) = Ax, f(x) ∈ m, A ∈ mn, x ∈ n. Compute

the gradient ∇xf
fi
• ∇xf = its size is mn
xj
mn
𝑛
fi(x) = 𝑗=1 aij xj
fi
 = aij  ∇xf = A
xj

Gradient of f: n  m – Ex2
• Given h :  → , h(t) = (fx)(t)

where f : 2 → , f(x) = exp(x1 + x22),
x1(t) t
x :  →  , x(t) =
2 =
x2(t) sint
dh
Compute , the gradient of h with respect to t.
dt
• Use the chain rule (matrix version) for h = fx
dh d df dx
= (fx) =
dt dt dx dt

Gradient of f: n  m – Ex2
x1
dh
= f f t = f x1 + f x2
dt x1 x2 x2 x1 t x2 t
t
= exp(x1 + x22) + 2x2exp(x1 + x22)cost,
where x1 = t, x2 = sint.

Gradient of f: n  m – Exercise
• y  N, θ  D, Φ  ND
e: D  N, e(θ) = y − Φθ,
L: N  , L(e) = e2 = eTe
dL de dL
Find , , and
de dθ dθ

Gradient of A : m  pq
Approach 1
4×2×3 tensor

Gradient of A : m  pq
Approach 2: Re-shape matrices into vectors
4×2×3 tensor

Gradients of A : m  pq – Ex
• Ex. Consider A: 3  32
𝑥1 − 𝑥2 𝑥1 + 𝑥3
• A(x1, x2, x3) = 𝑥1 2 + 𝑥3 2𝑥1
𝑥3 − 𝑥2 𝑥1 + 𝑥2 + 𝑥3
dA
• The dimension of : (32)3
dx
• Approach 1
1 1 −1 0 0 1
A A A
= 2𝑥1 2 , = 0 0, = 1 0
x 1 x2 x3 (32)3 tensor
0 1 −1 1 1 1

Gradient of f : mn  p – Ex
fM AMN xN
… … … … x1 f: MN  M, f(A) = Ax
fj Aj1 … AjN … fi = Ai1x1 +… + Aikxk +…+ AiNxN
fi Ai1 … AiN xN
fi fi
… … … …  = xk, = 0 (j  i)
Aik Ajk
df
 M(MN)
dA … … …
… … …
0 … 0
x1 … xN
x1 … xN
0 … 0
… … …
… … …

Gradient of Matrices with Respect to Matrices mn  pq
For R ∈ MN and f : MN → NN d𝑲𝑝𝑞

∈ 1×(𝑀×𝑁)
with f(R) = RTR = K ∈ NN d𝑹
Compute the gradient dK/dR.
The gradient has the dimensions
R
dK/dR ∈ (NN)MN K
Kpq

Gradient of Matrices with Respect to Matrices mn  pq
dK/dR ∈ (NN)MN K= RTR

R = [r1 r2 … rN], ri is ith column of R
dK𝑝𝑞 1×(𝑀×𝑁)
∈
dR Kpq
Rij
dK𝑝𝑞
∈ 1
dR𝑖𝑗 𝑀
Kpq = rpTrq = 𝑚=1 𝑅𝑚𝑝𝑅𝑚𝑞
𝑅𝑖𝑞 𝑖𝑓𝑗 = 𝑝  𝑞
dK𝑝𝑞 𝑅𝑖𝑝 𝑖𝑓 𝑗 = 𝑞𝑝
= pqij =
* dR𝑖𝑗 2𝑅𝑖𝑞 𝑖𝑓 𝑗 = 𝑝 = 𝑞
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Backpropagation - Introduction
• Probably the single most important algorithm in all of Deep Learning

• In many machine learning applications, we find good model
parameters by performing gradient descent  compute the gradient
of a learning objective w.r.t. the parameters of the model. For
example, an ANN (single Hidden Layer with 150 nodes) for
128x128x3 color image needs at least 128x128x3x150 = 7,372,800
weights.
• The backpropagation algorithm is an efficient way to compute the
gradient of an error function with respect to the parameters of the
model.

 Given training data ML Needs Gradients
{(x1, y1), (x2, y2), …, (xm, ym)}
 Choose decision and cost functions

𝒚𝑖 = 𝑓𝜃 (𝒙𝑖 )
C(𝒚𝑖, yi)
 Define the goal

1
Find * that minimizes iC(𝒚𝑖, yi)
𝑚
! The backpropagation  Train the model with (stochastic) gradient

descent to update ,
algorithm is an efficient 𝜕𝐶
way to compute the (t+1) = (t) -  (xi, yi)
gradient 𝜕(t)

Epochs
• Backpropagation algorithm consists of many cycles, each cycle is
called an epoch with two processes:
forward phase
a(0)  z(1), a(1)  z(2), a(2)  …  C
𝜕𝐶 𝜕𝐶 𝜕𝐶
 …
𝜕 (1) 𝜕 (2) 𝜕 (𝑁)
backward phase

Deep Network (ANN with hidden layers)
Activation equations (matrix version)

Layer (1) = hidden layer
z(1) = W(1)a(0) + b(1)
a(1) = 1(z(1))
Layer (2) = output layer
z(2) = W(2)a(1) + b(2)
a(2) = 2(z(2))
The cost for example number k

1 (2) 1
Ck = 𝑖(𝑎𝑖 −𝑦𝑖 )2 = a(2) – y2
2 2

Forward phase
For L = 1..N, a(0) = x
z(L) = W(L)a(L-1) + b(L)
a(L) = L(z(L))
1
C: cost function (i.e., C = a(N) – y2)
2

Backpropagation
Layer 1 Layer 2 … Layer K-1 Layer K … Layer N-1 Layer N
𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1)
=
𝜕𝑾(𝑁−1) 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝑾(𝑁−1) 𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁)
=
𝜕𝑾(𝑁) 𝜕𝒂(𝑁) 𝜕𝑾(𝑁)
𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1)
=
𝜕𝒃(𝑁−1) 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒃(𝑁−1) 𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁)
=
𝜕𝒃(𝑁) 𝜕𝒂(𝑁) 𝜕𝒃(𝑁)
𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝑁−2)
= Benefit of backpropagation:
𝜕𝑾(𝑁−2) 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝑁−2) 𝜕𝑾(𝑁−2)
Reused terms outside the box
𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝑁−2)
=
𝜕𝒃(𝑁−2) 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝑁−2) 𝜕𝒃(𝑁−2)

Backpropagation Activation equations
z(L) = W(L)a(L-1) + b(L)
Layer 1 Layer 2 … Layer K-1 Layer K … Layer N-1 Layer N a(L) = L(z(L))
C: cost function
𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝐿+3) 𝜕𝒂(𝐿+2) 𝜕𝒂(𝐿+1)

= … (𝐿+2) (𝐿+1) (𝐿+1)
𝜕𝑾(𝐿+1) 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝑁−2) 𝜕𝒂 𝜕𝒂 𝜕𝑾
𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝐿+2) 𝜕𝒂(𝐿+1) 𝜕𝒂(𝐿)

= …
𝜕𝑾(𝐿) 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝑁−2) 𝜕𝒂(𝐿+1) 𝜕𝒂(𝐿) 𝜕𝑾(𝐿)
𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝐿+2) 𝜕𝒂(𝐿+1) 𝜕𝒂(𝐿)

= … Backpropagation
𝜕𝒃(𝐿) 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝑁−2) 𝜕𝒂(𝐿+1) 𝜕𝒂(𝐿) 𝜕𝒃(𝐿)
At each layer (L), need to compute Compute e(L+1) (at layer

𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝐿+2) 𝜕𝒂(𝐿+1) 𝜕𝒂(𝐿+1) L+1) before computing
e(L) := = … = e(L+1)
e(L) (at layer L)
𝜕𝒂(𝐿) 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝐿+1) 𝜕𝒂(𝐿) 𝜕𝒂(𝐿)
Backpropagation algorithm
For each example in training examples
1. Feed forward
2. Backpropagation
At output layer (N), compute and store:
𝜕𝐶
e(N) = (𝑁)
𝜕𝒂
𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝐶 (𝑁)
𝜕𝒂
= e (N) , = e(N)
𝜕𝑾(𝑁) 𝜕𝑾(𝑁) 𝜕𝒃(𝑁) 𝜕𝒃(𝑁)
For layer (L) from N-1 to 1:
𝜕𝒂 (𝐿+1)
• Compute e(L) using e(L) = e(L+1)
Activation equations 𝜕𝒂(𝐿)
z(L) = W(L)a(L-1) + b(L) 𝜕𝐶 𝜕𝒂(𝐿) , 𝜕𝐶
• Compute (𝐿) = e (L) = e(L)
a(L) = L(z(L)) (𝐿)
𝜕𝑾 𝜕𝑾 (𝐿) 𝜕𝒃(𝐿)
𝜕𝒂
C: cost function
𝜕𝒃(𝐿)

Higher-order partial derivatives
Consider a function f : 2
 of two variables x, y.
Second order partial derivatives:
2 f 2 f Ex.
f: 2  , f(x, y) = x3y – 3xy2 + 5y,
x 2 xy 𝜕𝑓 𝜕𝑓
= 3x2y – 3y2, = x3 – 6xy +5
𝜕𝑥 𝜕𝑦
2 f 2 f 2
𝜕𝑓 2
𝜕𝑓
= 3𝑥2 − 6𝑦,
2 = 6𝑥𝑦,
yx y 2
𝜕𝑥 𝜕𝑥𝜕𝑦
2
𝜕𝑓 2 𝜕2𝑓
= 3𝑥 − 6𝑦, 2 = −6𝑥
𝜕𝑦𝜕𝑥 𝜕𝑦
n f
is the nth partial derivative of f with respect to x
x n

The Hessian
• The Hessian is the collection of all second-order partial derivatives
Hessian matrix is symmetric for twice

continuously differentiable functions,
that is,
𝜕2𝑓 𝜕2𝑓
Hessian matrix =
𝜕𝑥𝜕𝑦 𝜕𝑦𝜕𝑥

Gradient vs Hessian of f: n  
Consider a function f : n

Gradient Hessian
 2 f 2 f 2 f 
 f f f   ... 
f   ...   1x 2
x1x2 x1xn 
 x1 x2 xn   2 f 2 f 2 f 
 ... 
Dimension: 1  n  f   x2 x1
2
x2 2 x2 xn 
 ... ... ... ... 
 
 2 f 2 f  f 
2
 x x xn x2
...
xn 2 
 n 1
Dimension: n  n

Gradient vs Hessian of f: n  m
Consider (vector-valued) function f : n
 m
x1 f1
Gradient x2 f2 Hessian
x3
m  n matrix m  (n  n) tensor
Dimension: 2  3
Dimension: 2  (3  3)

Example
• Compute the Hessian of the function z = f(x, y) = x2 + 6xy – y3 and
evaluate at the point (x = 1, y = 2, z = 5).

Taylor series for f:   
Taylor polynomials
Approximation problems

Taylor series for f: D  
Consider a function f (smooth at x0)
multivariate Taylor series of f at x0 is defined as
where

Example
Find the Taylor series for the function
f(x, y) = x2 + 2xy + y3 at x0 = 1, y0 = 2.

Taylor series of f(x, y) = x2 + 2xy + y3

*
* *
[i].[j].[k]
3[i,j,k]

The Taylor series expansion of f at (x0, y0) = (1, 2) is

Summary
Differentiation of Univariate Functions

Partial Differentiation and Gradients
Gradients of Matrices
Backpropagation
Higher-Order Derivatives
Linearization and Multivariate Taylor Series

THANKS

Chapter 5 - Vector Calculus File

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 5 - Vector Calculus File

Uploaded by

Copyright:

Available Formats

Vector Calculus

1/10/2022 Chapter 5 - Vector Calculus 2

(gf)(x) = g(f(x))f(x) # gf means g after f

1/9/2022 Chapter 5 - Vector Calculus 3

x 2() 2x () + 1 2x+ 1 ()4 (2x+1)4

• h can be expressed as h(x) = (gfu)(x)

1/9/2022 Chapter 5 - Vector Calculus 4

1/9/2022 Chapter 5 - Vector Calculus 5

• We collect all partial derivatives of f in the row vector to

1/9/2022 Chapter 5 - Vector Calculus 6

1/9/2022 Chapter 5 - Vector Calculus 7

• For a vector-valued function f: n  m,

1/9/2022 Chapter 5 - Vector Calculus 8

1/9/2022 Chapter 5 - Vector Calculus 9

Ex. Find the Jacobian of f: 3  2

1/9/2022 Chapter 5 - Vector Calculus 10

• We are given f(x) = Ax, f(x) ∈ m, A ∈ mn, x ∈ n. Compute

1/9/2022 Chapter 5 - Vector Calculus 11

• Given h :  → , h(t) = (fx)(t)

1/9/2022 Chapter 5 - Vector Calculus 12

1/9/2022 Chapter 5 - Vector Calculus 13

1/9/2022 Chapter 5 - Vector Calculus 14

1/9/2022 Chapter 5 - Vector Calculus 15

1/9/2022 Chapter 5 - Vector Calculus 16

1/9/2022 Chapter 5 - Vector Calculus 17

1/9/2022 Chapter 5 - Vector Calculus 18

For R ∈ MN and f : MN → NN d𝑲𝑝𝑞

1/9/2022 Chapter 5 - Vector Calculus 19

dK/dR ∈ (NN)MN K= RTR

1/9/2022 Chapter 5 - Vector Calculus 20

• Probably the single most important algorithm in all of Deep Learning

1/9/2022 Chapter 5 - Vector Calculus 21

 Choose decision and cost functions

 Define the goal

! The backpropagation  Train the model with (stochastic) gradient

1/9/2022 Chapter 5 - Vector Calculus 22

1/9/2022 Chapter 5 - Vector Calculus 23

Activation equations (matrix version)

The cost for example number k

1/9/2022 Chapter 5 - Vector Calculus 24

1/9/2022 Chapter 5 - Vector Calculus 25

1/9/2022 Chapter 5 - Vector Calculus 26

𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝐿+3) 𝜕𝒂(𝐿+2) 𝜕𝒂(𝐿+1)

𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝐿+2) 𝜕𝒂(𝐿+1) 𝜕𝒂(𝐿)

𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝐿+2) 𝜕𝒂(𝐿+1) 𝜕𝒂(𝐿)

At each layer (L), need to compute Compute e(L+1) (at layer

1/9/2022 Chapter 5 - Vector Calculus 28

1/9/2022 Chapter 5 - Vector Calculus 29

Hessian matrix is symmetric for twice

1/9/2022 Chapter 5 - Vector Calculus 30

1/9/2022 Chapter 5 - Vector Calculus 31

1/9/2022 Chapter 5 - Vector Calculus 32

1/9/2022 Chapter 5 - Vector Calculus 33

1/9/2022 Chapter 5 - Vector Calculus 34

multivariate Taylor series of f at x0 is defined as

1/9/2022 Chapter 5 - Vector Calculus 35

1/9/2022 Chapter 5 - Vector Calculus 36

1/9/2022 Chapter 5 - Vector Calculus 37

1/9/2022 Chapter 5 - Vector Calculus 38

The Taylor series expansion of f at (x0, y0) = (1, 2) is

1/9/2022 Chapter 5 - Vector Calculus 39

Differentiation of Univariate Functions

1/11/2022 Chapter 5 - Vector Calculus 40