Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Vector/Matrix Calculus

In neural networks, we often encounter problems with analysis of several variables. Vector/Matrix calculus extends calculus of one variable into that of a vector or a matrix of variables. Vector Gradient: Let g(w) be a dierentiable scalar function of m variables, where

w = [w1, . . . , wm]T
Then the vector gradient of g(w) w.r.t. w is the m-dimensional vector of partial derivatives of g
g w1 g = g = w g = . . w g wm

Similarly we can dene second-order gradient or Hessian matrix.


1

Hessian matrix is dened as


= 2 w

2g

2g 2 w1

. .

2g wm w1

2g w1 wm . . 2g
2 wm

Jacobian matrix : Generalization to the vector valued functions

g(w) = [g1(w), . . . , gn(w)]T


leads to a denition of the Jacobian matrix of g w.r.t. w g w
g1 w1 = . . g1 wm

In this vector convention the columns of the Jacobian matrix are gradients of the corresponding components functions gi(w) w.r.t. the vector w.
2

gn w1 . . gn wm

Dierentiation Rules: The dierentiation rules are analogous to the case of ordinary functions:

f (w)g(w) f (w) g(w) = g(w) + f (w) w w w


f (w) g(w) f (w) g(w) f (w)/g(w) w w = w g 2(w)

f (g(w)) g(w) = f (g(w)) w w

Example: Consider
m

g(w) =
i=1

aiwi = aT w

where a is a constant vector. Thus, a1 g = . . w am aT w = a. w Example: Let w = (w1, w2, w3)T R3. g(w) = 2w1 + 5w2 + 12w3 = 2 5 12

or in the vector notation

w.

Because g = (2w1 + 5w2 + 12w3) = 2 + 0 + 0 = 2 w1 w1 hence, 2 g = 5 . w 12


Example: Consider
m m

g(w) =
i=1 j=1

aij wiwj = wT Aw

where A is a constant square matrix. Thus,


g = w

. . m w a j=1 j mj +

m w a + j=1 j 1j

and so in the vector notation

m wa i=1 i i1 m wa i=1 i im

wT Aw = Aw + AT w. w Hessian of g(w) is 2a11 a1m + am1 2wT Aw . . = . . 2 w am1 + a1m 2amm


which equals to A + AT .

Example: Let w = (w1, w2)T R2. g(w) = 3w1w1 + 2w1w2 + 6w2w1 + 5w2w2
5

w1 w2

3 2 6 5

w1 w2

g = (3w1w1 + 2w1w2 + 6w2w1 + 5w2w2) w1 w1 = 3 2w1 + 2w2 + 6w2 + 0 = 6w1 + 8w2 g = 0 + 2 w1 + 6w1 + 5 2w2 = 8w1 + 10w2 w2 and so in the vector notation wT Aw = w 3 2 6 5 = Hessian of g(w) is 2wT Aw wT Aw = ( ) 2 w w w = 23 2+6 6+2 25 = 6 8 8 10 .
6

w1 w2 6 8 8 10

+ w1 w2

3 6 2 5

w1 w2

Matrix Gradient: Consider a scalar valued function g(W) of the n m matrix W = {wij } (e.g. determinant of a matrix). The matrix gradient w.r.t W is a matrix of the same dimension as W consisting of partial derivatives of g(W) w.r.t. components of W:
g w11 g = . . W g wm1

Example: If g(W) = tr(W), then g = I W Example: Consider a matrix function


m m

g w1n . . g wmn

g(W) =
i=1 j=1

wij aiaj = aT Wa

i.e. assume that a is a constant vector, whereas W is a matrix of variables. Taking the gradient aT Wa = a a . Thus, in the w.r.t. to W yields w i j ij matrix form aT Wa = aaT W
7

Example: g(W) = 9w11 + 6w21 + 6w12 + 4w22 = 3 2 w11 w12 w21 w22 3 2

Thus we have g = (9w11 + 6w21 + 6w12 + 4w22) = 9 w11 w11 g = (9w11 + 6w21 + 6w12 + 4w22) = 6 w12 w12 g = (9w11 + 6w21 + 6w12 + 4w22) = 6 w21 w21 g = (9w11 + 6w21 + 6w12 + 4w22) = 4 w22 w22 hence g = W 9 6 6 4 = 3 2 3 2

Example: Let W be an invertible square matrix of dimension m with determinant, det(W). Then, det W = (WT )1 det W W Recall that 1 W1 = adj(W). det W In the above adj(W) is the adjoint of W: W11 Wm1 . adj(W) = . . . W1m Wmm

where Wij is a cofactor obtained by multiplying term (1)i+j by the determinant of a matrix obtained from W by removing the ith row and the j th column. Recall that the determinant of W can be also obtained using cofactors:
m

det W =
k=1

wik Wik .
9

In the above formula i denotes an arbitrary row. Now taking derivative of det(W) w.r.t. Wij gives det(W) = Wij wij but from the denition of the matrix gradient it follows that det(W) = adj(W)T W Using the formula for the inverse of W.

W1 =
We have

1 adj(W). det W

det(W) = (WT )1 det W W Homework: Prove that log |det(W)| 1 |det W| = = (WT )1. W |det W| W
10

You might also like