Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Matrix Calculus

Massimiliano Tomassoli
23/7/2013
Matrix Calculus is a set of techniques which let us dierentiate functions of matrices without com-
puting the single partial derivatives by hand. What this means will be clear in a moment.
1 The derivative
Lets talk about notation a little bit. If f(X) is a scalar function of an mn matrix X, then
f(X)
X
=
_

_
f(X)
x
11

f(X)
x
1n
.
.
.
.
.
.
.
.
.
f(X)
x
m1

f(X)
x
mn
_

_
.
The denition above is valid even if m = 1 or n = 1, that is if f is a function of a row vector, a column
vector or a scalar, in which case the result is a row vector, a column vector or a scalar, respectively.
If f(X) is an mn matrix function of a matrix, then
f(X)
X
=
_

_
f
11
(X)
X

f
1n
(X)
X
.
.
.
.
.
.
.
.
.
f
m1
(X)
X

f
mn
(X)
X
_

_
where the matrix above is a block matrix. The denition above is valid even when m = 1 or n = 1,
that is when f is a row vector function, a column vector function or a scalar function, in which case
the block matrix is a row of blocks, a column of blocks or just a single block, respectively.
If f is
a scalar function of a scalar, vector or matrix, or
a vector function of a scalar or vector, or
a matrix function of a scalar,
then the derivative, also called Jacobian matrix, of f is
Df(x) =
f(x)
x
T
For instance, if f(x) is a vector function of a vector, then
Df(x) :=
f(x)
x
T
=
_

_
f
1
(x)
x
T
.
.
.
f
m
(x)
x
T
_

_
=
_

_
f
1
(x)
x
1

f
1
(x)
x
n
.
.
.
.
.
.
.
.
.
f
m
(x)
x
1

f
m
(x)
x
n
_

_
.
As another example, if f(X) is a scalar function of a matrix, then
Df(X) :=
f(X)
X
T
=
_

_
f(X)
x
11

f(X)
x
m1
.
.
.
.
.
.
.
.
.
f(X)
x
1n

f(X)
x
mn
_

_
.
1
Some authors prefer to nd the gradient of a function, which is dened as the transpose of the Jacobian
matrix.
Now, how do we dene the derivative of an m n matrix function f of a p q matrix? Its clear
that we have mnpq partial derivatives. We could just dene the derivative of f as we did above for the
other cases, but well follow Magnuss way [1, 2] and give the following denition:
Df(X) =
vec f(X)
(vec X)
T
The result is an mn pq matrix. Well talk about the vec operation in a moment. For now lets just
say that it vectorizes a matrix by stacking its columns on top of one another. More formally,
vec(A) =
_
a
11
a
m1
a
12
a
m2
a
1n
a
mn

T
This means that vec f(X) is a vector function and vec X is a vector.
So what about a scalar function of a matrix? According to this last denition, the derivative should
be a row vector, while according to our rst denition of derivative it should be a matrix. Both are
viable options and well see how easy it is to go from one to the other.
The second derivative in the multidimensional case is the Hessian matrix (or just Hessian), a square
matrix of second-order partial derivatives of a scalar function. More precisely, if f is a scalar function
of a vector, then the Hessian is dened as
Hf(x) =

2
f(x)
xx
T
=
_

2
f(x)
x
2
1

2
f(x)
x
1
x
2


2
f(x)
x
1
x
n

2
f(x)
x
2
x
1

2
f(x)
x
2
2


2
f(x)
x
2
x
n
.
.
.
.
.
.
.
.
.
.
.
.

2
f(x)
x
n
x
1

2
f(x)
x
n
x
2


2
f(x)
x
2
n
_

_
Note that the Hessian is the Jacobian of the gradient (i.e. the transpose of the Jacobian) of f:
Hf(x) =

2
f(x)
xx
T
=

x
T
f(x)
x
= J(f)(x)
If f is a scalar function of a matrix (rather than of a vector), we vectorize the matrix:
Hf(X) =

2
f(X)
(vec X)(vec X)
T
In this article we assume that the second-order partial derivatives are continuous so the order of dier-
entiation is irrelevant and the Hessian is always symmetric.
2 The dierential
The method described here is based on dierentials, therefore lets recall what a dierential is.
In the one-dimensional case, the derivative of f at x is dened as
lim
u0
f(x + u) f(x)
u
= f

(x)
This can be rewritten as
f(x + u) = f(x) + f

(x)u + r
x
(u)
where r
x
(u) is o(u), i.e. r
x
(u)/u 0 as u 0. The dierential of f at x with increment u is
df(x; u) = f

(x)u.
In the vector case, we have
f(x + u) = f(x) + (Df(x))u + r
x
(u)
2
and the dierential of f at x with increment u is df(x; u) = (Df(x))u. As we can see, the dierential
is the best linear approximation of f(x + u) f(x) at x. In practice, we write dx instead of u, so,
for instance, df(x; u) = f

(x)dx. Well justify this notation in a moment, but rst lets introduce the
so-called identication results. They are needed to get the derivatives from the dierentials.
The rst result says that, if f is a vector function of a vector,
df(x) = A(x)dx Df(x) = A(x).
More generally, if f is a matrix function of a matrix,
d vec f(X) = A(X)d vec X Df(x) = A(X).
The second result is about the second dierential and says that if f is a scalar function of a vector,
then
d
2
f(x) = (dx)
T
B(x)dx Hf(x) =
1
2
(B(x) + B(x)
T
)
where Hf(x) denotes the Hessian matrix.
Another important result is Cauchys rule of invariance which says that if h(x) = g(f(x)), then
dh(x; u) = dg(f(x); df(x; u))
This is related to the chain rule for the derivatives; in fact, it can be proved by making use of the chain
rule:
dh(x; u) = Dh(x)u
= D(g f)(x)u
= Dg(f(x))Df(x)u
= Dg(f(x))df(x; u)
= dg(f(x); df(x; u))
Now we can justify the abbreviated notation dx. If y = f(x), then we write
dy = df(x; dx)
where x and y are variables. Basically, we name the dierential after the variables rather than after the
functions. But now suppose that x = g(t) and dx = dg(t; dt). We now have y = f(g(t)) = h(t) and,
therefore,
dy = dh(t; dt).
For our abbreviated notation to be consistent, it must be the case that df(x; dx) = dh(t; dt). Fortunately,
thanks to Cauchys rule of invariance, we can see that it is so:
dh(t; dt) = d(f g)(t; dt) = df(g(t); dg(t; dt)) = df(x; dx).
3 Two important operators
Before proceeding with the actual computation of dierentials and derivatives, we need to introduce
two important operators: the Kronecker product and the vec operator. Weve already talked a little
about the vec operator but here well see and prove some useful results for manipulating expressions
involving these two operators.
As we said before, the vec operator is dened as follows:
vec(A) =
_
a
11
a
m1
a
12
a
m2
a
1n
a
mn

T
For instance,
vec
_
1 2 3
4 5 6
_
=
_
1 4 2 5 3 6

T
3
The Kronecker product between two matrix A and B is dened as follows:
A B =
_

_
a
11
B a
1n
B
.
.
.
.
.
.
.
.
.
a
m1
B a
mn
B
_

_
For instance,
_
1 2
3 4
_

_
1 1
1 1
_
=
_

_
1 1 2 2
1 1 2 2
3 3 4 4
3 3 4 4
_

_
Heres a list of properties of the Kronecker product and the vec operator:
1. A (B C) = (A B) C (associativity)
2. A (B + C) = (A B) + (A C)
(A + B) C = (A C) + (B C) (distributivity)
3. a R, a A = A a = aA
4. a, b R, aA bB = ab(A B)
5. For conforming matrices, (A B)(C D) = AC BD
6. (A B)
T
= A
T
B
T
, (A B)
H
= A
H
B
H
7. For all vectors a and b, a
T
b = ba
T
= b a
T
8. (A B)
1
= A
1
B
1
9. The vec operator is linear.
10. For all vectors a and b, vec(ab
T
) = b a
11. vec(AXC) = (C
T
A) vec(X)
12. tr(AB) = vec(A
T
)
T
vec(B)
Well denote with {f(i, j)} the matrix whose generic (i, j) element is f(i, j). Note that if A is a
block matrix with blocks B
ij
, then
A C =
_

_
B
11
C B
1n
C
.
.
.
.
.
.
.
.
.
B
m1
C B
mn
C
_

_
Point (3) follows directly from the denition, while point (4) can be proved as follows:
xA yB = {xa
ij
(yB)} = {xya
ij
B} = xy{a
ij
B} = xy(A B)
Now lets prove the other points one by one.
1. A (B C) = {a
ij
(B C)} = {a
ij
B C} = {a
ij
B} C = (A B) C
2. A (B + C) = {a
ij
(B + C)} = {a
ij
B + a
ij
C} = {a
ij
B} +{a
ij
C} = A B + A C
The other case is analogous.
3. See above.
4. See above.
4
5. (A B)(C D) = {a
ij
B}{c
ij
D} = {

k
a
ik
Bc
kj
D} = {(

k
a
ik
c
kj
) BD} = {

k
a
ik
c
kj
} BD =
AC BD
6. (A B)
T
= {a
ij
B}
T
= {a
ji
B
T
} = A
T
B
T
The other case is analogous.
7. This is very easy.
8. By (5), (A B)(A
1
B
1
) = AA
1
BB
1
= I I = I
9. This is very easy.
10. vec(ab
T
) = vec
_

_
a
1
b
1
a
1
b
n
.
.
.
.
.
.
.
.
.
a
m
b
1
a
m
b
n
_

_
= vec
_
b
1
a b
n
a

=
_

_
b
1
a
.
.
.
b
n
a
_

_
= b a
11. Let x
1
, . . . , x
n
be the columns of the matrix X and e
1
, . . . , e
n
the columns of the identity matrix
of order n. You should convince yourself that X =

k
x
k
e
T
k
. Here we go:
vec(AXC) = vec
_
A
_

k
x
k
e
T
k
_
C
_
= vec
_

k
(Ax
k
)(e
T
k
C)
_
=

k
vec((Ax
k
)(e
T
k
C)) (by (9))
=

k
((e
T
k
C)
T
(Ax
k
)) (by (10))
=

k
((C
T
e
k
) (Ax
k
))
=

k
((C
T
A)(e
k
x
k
)) (by (5))
= (C
T
A)

k
(e
k
x
k
)
= (C
T
A)

k
vec(x
k
e
T
k
) (by (10))
= (C
T
A) vec

k
(x
k
e
T
k
) (by (9))
= (C
T
A) vec(X)
12. The trace of the product AB is just the sum of all the products a
ij
b
ij
:
tr(AB) =

i
[AB]
ii
=

k
a
ik
b
ki
=

k
[A
T
]
ki
b
ki
= vec(A
T
)
T
vec(B)
Since well be using the trace operator quite a bit, recall that:
1. tr is linear.
2. tr(A) = tr(A
T
)
3. tr(AB) = tr(BA)
5
The rst two properties are trivial (but still very useful). Lets see why the last one is also true:
tr(AB) =

i
[AB]
ii
=

k
a
ik
b
ki
=

i
b
ki
a
ik
=

k
[BA]
kk
= tr(BA)
Also note that the trace of a scalar is the scalar itself. This means that when we have a scalar function
we can add a trace. This simple observation will be very useful.
4 Basic rules of dierentiation
In order to be able to dierentiate expressions, well need a set of simple rules:
1. dA = 0, where A is constant
2. d(X) = dX, where is a scalar
3. d(X + Y ) = dX + dY
4. d(tr(X)) = tr(dX)
5. d(XY ) = (dX)Y + XdY
6. d(X Y ) = (dX) Y + X dY
7. d(X
1
) = X
1
(dX)X
1
8. d|X| = |X| tr(X
1
dX)
9. d log |X| = tr(X
1
dX)
10. d(X

) = (dX)

, where

is any operator which rearranges elements such as transpose and
vec
A matrix is really a matrix of scalar functions and the dierential of a matrix is the matrix of
the dierentials of the single scalar functions. More formally, [dX]
ij
= d(X
ij
). Remember that if
f is a scalar function of a vector, then df(x; u) =

i
D
i
f(x)u
i
, where the D
i
f(x) are the partial
derivatives of f at x. If f is, instead, a function of a matrix, we just generalize the previous relation:
df(x; u) =

ij
D
ij
f(x)u
ij
. That said, many of the rules above can be readily proved.
As an example, lets prove the product rule (5). Let f and g be two scalar functions of a matrix.
Then
d(fg)(x; u) =

i,j
D
ij
(fg)(x)u
ij
=

i,j
((D
ij
f(x))g(x) + f(x)D
ij
g(x))u
ij
=

i,j
(D
ij
f(x))u
ij
g(x) +

i,j
f(x)D
ij
g(x)u
ij
= df(x; u)g(x) + f(x)dg(x; u)
6
where we used the usual product rule for derivatives. Now we can use this result to prove the general
one about matrices of scalar functions:
[d(XY )]
ij
= d[XY ]
ij
= d
_

k
x
ik
y
kj
_
=

k
d(x
ik
y
kj
)
=

k
(dx
ik
y
kj
+

k
x
ik
dy
kj
=

k
[dX]
ik
y
kj
+

k
x
ik
[dY ]
kj
= [(dX)Y ]
ij
+ [XdY ]
ij
= [(dX)Y + XdY ]
ij
Now we can prove (7) by using the product rule:
0 = dI = d(XX
1
) = (dX)X
1
+ Xd(X
1
) = dX
1
= X
1
(dX)X
1
To prove (8) we observe that, for any i = 1, . . . , n, we have |X| =

j
x
ij
C
ij
, where C
ij
is the cofactor
of the element x
ij
, i.e. (1)
i+j
times the determinant of the matrix obtained by removing from X the
i-th row and the j-th column. Because C
ij
doesnt depend on x
ij
, then
|X|
x
ij
=

j
x
ij
C
ij
x
ij
= C
ij
.
Now note that C is the matrix of the cofactors and recall that X
1
=
1
|X|
C
T
and thus C
T
= |X|X
1
.
This last result will be used in the following derivation:
d|X| = d| |(X; dX) =

i,j
C
ij
[dX]
ij
=

i,j
[C
T
]
ji
[dX]
ij
=

j
[C
T
dX]
jj
= tr(C
T
dX) = |X| tr(X
1
dX)
Note that (9) follows directly from (8).
5 The special form tr(AdX)
At the beginning of this article we gave two denitions of the derivative of a scalar function of a matrix.
The rst is
Df(X) =
f(X)
X
T
and the second is
Df(X) =
f(X)
(vec X)
T
.
In the rst case the result is a matrix whereas in the second the result is a row vector. Magnus suggests
to use the second denition, but well opt for the rst one.
Lets consider the dierential tr(AdX). We can nd the derivative according to the second de-
nition above by using property (12) in the section Two important operators. Well also use rule of
dierentiation (10) with

= vec. We get
tr(AdX) = vec(A
T
)
T
vec dX = vec(A
T
)
T
d vec X
and, therefore,
tr(AdX)
(vec X)
T
= vec(A
T
)
T
7
To get the result in matrix form (rst denition) we need to unvectorize the result above. Well proceed
step by step:
tr(AdX)
(vec X)
T
= vec(A
T
)
T
=
tr(AdX)
vec X
= vec(A
T
) =
tr(AdX)
X
= A
T
=
tr(AdX)
X
T
= A.
So, the result is simply A. As we saw in the previous section,
d|X| = |X| tr(X
1
dX) = tr(|X|X
1
dX).
This means that the derivative is |X|X
1
= C
T
which agree with the result
|X|
x
ij
= C
ij
that we derived in the previous section.
6 Examples
In practice, the derivative of an expression involving matrices can be computed by using the rules of
dierentiation to get a result of the form (X)dX, tr((X)dX) or (X)d vec X. After having done
that, we can read o the derivative which is simply (X).
To nd the Hessian we must dierentiate a second time, that is we must dierentiate f(x) and nd
df(x; dx), and then dierentiate df(x; dx) itself by keeping in mind that dx is a constant increment and
thus, for instance, d(a
T
dx) = 0 and d(x
T
Adx) = dx
T
Adx.
According to the second identication result, we must get a result of the form dx
T
(x)dx or of the
form (d vec X)
T
(X)d vec X and the Hessian is
1
2
((x) + (x)
T
). Note that if (x) is symmetric then
the Hessian is just (x).
From time to time well need to use the commutation matrix K
mn
, which is the permutation matrix
satisfying
K
mn
vec X = vec(X
T
)
where X is mn. It can be shown that
K
T
mn
= K
1
mn
= K
nm
In particular, K
nn
= K
n
is symmetric. Another useful fact is the following. If A is an mn matrix, B
a p q matrix and X a q n matrix, then
(B A)K
qn
= K
pm
(A B)
Lets prove that:
K
pm
(A B) vec X = K
pm
vec(BXA
T
)
= vec((BXA
T
)
T
)
= vec(AX
T
B
T
)
= (B A) vec(X
T
)
= (B A)K
qn
vec X
which is true for all vec X R
qn1
and hence the proof is complete.
In this section well see some examples of computation of the derivative, of the Hessian, and then
an example of maximum likelihood estimation with a multivariate Gaussian distribution,
Matrices will be written in uppercase and vectors in lowercase.
8
Lets get started!
f(x) = a
T
x d(a
T
x) = a
T
dx
= Df(x) = a
T
f(x) = x
T
Ax d(x
T
Ax) = d(x
T
)Ax + x
T
d(Ax)
= (dx)
T
Ax + x
T
Adx
= x
T
A
T
dx + x
T
Adx
= x
T
(A
T
+ A)dx
= Df(x) = x
T
(A
T
+ A)
f(X) = a
T
Xb d(a
T
Xb) = d tr(a
T
Xb)
= tr(a
T
d(X)b)
= tr(ba
T
dX)
= Df(X) = ba
T
f(X) = a
T
XX
T
a d(a
T
XX
T
a) = tr(a
T
d(XX
T
)a)
= tr(aa
T
d(XX
T
))
= tr(aa
T
((dX)X
T
+ XdX
T
))
= tr(aa
T
(dX)X
T
) + tr((dX)X
T
aa
T
)
= tr(X
T
aa
T
dX) + tr(X
T
aa
T
dX)
= 2 tr(X
T
aa
T
dX)
= Df(X) = 2X
T
aa
T
f(X) = tr(AX
T
BXC) d tr(AX
T
BXC) = tr(Ad(X
T
)BXC) + tr(AX
T
B(dX)C)
= tr(C
T
X
T
B
T
(dX)A
T
) + tr(CAX
T
BdX)
= tr((A
T
C
T
X
T
B
T
+ CAX
T
B)dX)
= Df(X) = A
T
C
T
X
T
B
T
+ CAX
T
B
f(X) = tr(AX
1
B) d tr(AX
1
B) = tr(BAd(X
1
))
= tr(BAX
1
(dX)X
1
)
= tr(X
1
BAX
1
dX)
= Df(X) = X
1
BAX
1
f(X) = |X
T
X| d|X
T
X| = |X
T
X| tr((X
T
X)
1
d(X
T
X))
= |X
T
X| tr((X
T
X)
1
(d(X
T
)X + X
T
dX))
= |X
T
X|
_
tr((X
T
X)
1
d(X
T
)X) + tr((X
T
X)
1
X
T
dX)

= |X
T
X|
_
tr(X
T
(dX)(X
T
X)
1
) + tr((X
T
X)
1
X
T
dX)

= 2|X
T
X| tr((X
T
X)
1
X
T
dX)
= Df(X) = 2|X
T
X|(X
T
X)
1
X
T
9
f(X) = tr(X
p
) d tr(X
p
) = tr((dX)X
p1
+ X(dX)X
p2
+ + X
p1
dX)
= tr(X
p1
dX + X
p1
dX + + X
p1
dX)
= p tr(X
p1
dX)
= Df(X) = pX
p1
f(X) = Xa d(Xa) = (dX)a
= vec((dX)a) (the vec of a vector is the vector itself)
= vec(I
n
(dX)a)
= (a
T
I
n
)d vec X
= Df(X) = a
T
I
n
To dierentiate a matrix we need to vectorize it:
f(x) = xx
T
d vec(xx
T
) = vec((dx)x
T
+ xdx
T
)
= vec(I
n
(dx)x
T
) + vec(x(dx)
T
I
n
)
= (x I
n
)d vec x + (I
n
x)d vec(x
T
)
= (x I
n
)d vec x + (I
n
x)d vec x
= Df(x) = x I
n
+ I
n
x
f(X) = X
2
d vec(X
2
) = vec((dX)X + XdX)
= vec(I
n
(dX)X + X(dX)I
n
)
= (X
T
I
n
+ I
n
X)d vec X
= Df(X) = X
T
I
n
+ I
n
X
Now lets compute some Hessians.
f(x) = a
T
x d(a
T
x) = a
T
dx
d(a
T
dx) = 0
= Hf(x) = 0
f(X) = tr(AXB) d tr(AXB) = tr(a(dX)B)
= tr(BAdX)
d tr(BAdX) = 0
= Hf(x) = 0
f(x) = x
T
Ax d(x
T
Ax) = dx
T
Ax + x
T
Adx
= x
T
A
T
dx + x
T
Adx
= x
T
(A
T
+ A)dx
d(x
T
(A
T
+ A)dx) = dx
T
(A
T
+ A)dx
= Hf(x) = A
T
+ A
10
f(X) = tr(X
T
X) d tr(X
T
X) = tr(dX
T
X + X
T
dX)
= tr(X
T
dX + X
T
dX)
= 2 tr(X
T
dX)
d(2 tr(X
T
dX)) = 2 tr(dX
T
dX)
= 2(d vec X)
T
d vec X
= Hf(X) = 2I
mn
(X is mn)
f(X) = tr(AX
T
BX) d tr(AX
T
BX) = tr(AdX
T
BX + AX
T
BdX)
= tr(X
T
B
T
dXA
T
+ AX
T
BdX)
= tr(A
T
X
T
B
T
dX + AX
T
BdX)
d(tr(A
T
X
T
B
T
dX + AX
T
BdX)) = tr(A
T
dX
T
B
T
dX + AdX
T
BdX)
= tr(dX
T
BdXA + AdX
T
BdX)
= 2 tr(AdX
T
BdX)
= 2 tr(dX
T
B(dX)A)
= 2(d vec X)
T
vec(B(dX)A)
= 2(d vec X)
T
(A
T
B)d vec X
= Hf(X) =
1
2
(2A
T
B + 2A B
T
)
= Hf(X) = A
T
B + A B
T
Here well make use of the commutation matrix K
mn
. Remember that K
nn
= K
n
is symmetric.
f(X) = tr(X
2
) d tr(X
2
) = tr((dX)X + XdX) (X is n n)
= 2 tr(XdX)
d(2 tr(XdX)) = 2 tr(dXdX)
= 2(vec(dX
T
))
T
d vec X
= 2(K
n
vec(dX))
T
d vec X
= 2(d vec X)
T
K
n
d vec X
= Hf(X) = 2K
n
f(X) = tr(AXBX) d tr(AXBX) = tr(A(dX)BX + AXBdX) (X is mn)
= tr(BXAdX + AXBdX)
d tr(BXAdX + AXBdX) = tr(B(dX)AdX + A(dX)BdX)
= 2 tr(A(dX)BdX)
= 2 tr((dX)B(dX)A)
= 2(vec(dX
T
))
T
vec(B(dX)A)
= 2(vec(dX
T
))
T
(A
T
B)d vec X
= 2(K
mn
vec(dX))
T
(A
T
B)d vec X
= 2(d vec X)
T
K
nm
(A
T
B)d vec X
= Hf(X) = K
nm
(A
T
B) + (A B
T
)K
mn
= Hf(X) = K
nm
(A
T
B + B
T
A)
11
In the last step above we used the fact that (A B
T
)K
mn
= K
nm
(B
T
A).
f(X) = a
T
XX
T
a d(a
T
XX
T
a) = a
T
(dX)X
T
a + a
T
X(dX)
T
a
= a
T
(dX)X
T
a + a
T
(dX)X
T
a
= 2a
T
(dX)X
T
a
d(2a
T
(dX)X
T
a) = 2a
T
(dX)(dX)
T
a
= 2 tr(a
T
(dX)(dX)
T
a)
= 2 tr((dX)
T
aa
T
dX)
= 2(d vec X)
T
vec(aa
T
dX)
= 2(d vec X)
T
vec(aa
T
(dX)I)
= 2(d vec X)
T
(I aa
T
)d vec X
= Hf(X) = 2(I aa
T
)
f(X) = tr(X
1
) d tr(X
1
) = tr(d(X
1
))
= tr(X
1
(dX)X
1
)
d(tr(X
1
(dX)X
1
)) = tr(d(X
1
)(dX)X
1
+ X
1
(dX)d(X
1
))
= tr(X
1
(dX)X
1
(dX)X
1
X
1
(dX)X
1
(dX)X
1
)
= 2 tr((dX)X
1
(dX)X
2
)
= 2(vec(dX
T
))
T
vec(X
1
(dX)X
2
)
= 2(K
n
vec(dX))
T
vec(X
1
(dX)X
2
)
= 2(d vec X)
T
K
n
(X
2T
X
1
)d vec X
= Hf(X) = K
n
(X
2T
X
1
) + (X
2
X
T
)K
n
= Hf(X) = K
n
(X
2T
X
1
+ X
T
X
2
)
f(X) = |X| d|X| = |X| tr(X
1
dX)
d(|X| tr(X
1
dX)) = d|X| tr(X
1
dX) +|X| tr(d(X
1
)dX)
= |X|(tr(X
1
dX))
2
|X| tr(X
1
(dX)X
1
dX)
= |X| tr((dX)
T
X
T
) tr(X
1
dX) |X| tr((dX)X
1
(dX)X
1
)
= |X|(d vec X)
T
vec(X
T
)(vec(X
T
))
T
d vec X
|X|(vec(dX
T
))
T
(X
T
X
1
)d vec X
= |X|(d vec X)
T
_
vec(X
T
)(vec(X
T
))
T
K
n
(X
T
X
1
)
_
d vec X
= Hf(X) = |X|
_
vec(X
T
)(vec(X
T
))
T
K
n
(X
T
X
1
)
_
Note that the Hessian above (like all the others) is symmetric; in fact,
(K
n
(X
T
X
1
))
T
= (X
1
X
T
)K
n
= K
n
(X
T
X
1
)
We conclude this section with an example about MLE.
Given a set of vectors x
1
, . . . , x
N
drawn independently from a multivariate Gaussian distribution, we
want to estimate the parameters of the distribution by maximum likelihood. Note that the covariance
matrix , and thus
1
, is symmetric. The log likelihood function is
ln p(x
1
, . . . , x
N
|, ) =
ND
2
ln(2)
N
2
ln ||
1
2
N

n=1
(x
n
)
T

1
(x
n
).
12
We rst nd the derivative with respect to :
d
_

1
2
N

i=1
(x
n
)
T

1
(x
n
)
_
=
1
2
N

i=1
_
d(x
n
)
T

1
(x
n
) + (x
n
)
T

1
d(x
n
)
_
=
1
2
N

i=1
_
d
T

1
(x
n
) (x
n
)
T

1
d
_
=
1
2
N

i=1
_
(x
n
)
T

1
d + (x
n
)
T

1
d
_
=
_
N

i=1
(x
n
)
T

1
_
d
=
ln p

T
=
N

i=1
(x
n
)
T

1
Therefore,
N

i=1
(x
n
)
T

1
= 0
N

n=1
x
T
n
= N
T
=
1
N
N

i=1
x
n
Finally, we nd the derivative with respect to :
d
_

N
2
ln ||
1
2
N

i=1
(x
n
)
T

1
(x
n
)
_
=
N
2
tr(
1
d)
1
2
N

n=1
tr
_
(x
n
)
T
d(
1
)(x
n
)
_
=
N
2
tr(
1
d) +
1
2
N

n=1
tr
_
(x
n
)
T

1
(d)
1
(x
n
)
_
=
N
2
tr(
1
d) +
1
2
N

n=1
tr
_

1
(x
n
)(x
n
)
T

1
d
_
=
ln p

T
=
N
2

1
+
1
2
N

i=1

1
(x
n
)(x
n
)
T

1
Therefore,

N
2

1
+
1
2
N

i=1

1
(x
n
)(x
n
)
T

1
= 0
N
2

1
=
1
2
N

i=1

1
(x
n
)(x
n
)
T

1
N =
1
N

i=1
(x
n
)(x
n
)
T


=
1
N
N

i=1
(x
n
)(x
n
)
T
References
[1] Abadir, K. M., & Magnus, J. R. (2005). Matrix algebra. Cambridge, UK: Cambridge University
Press.
[2] Magnus, J. R. & Neudecker, H. (1999). Matrix dierential calculus with applications in statistics
and economics. New York, NY: John Wiley & Sons
[3] Steven W. Nydick (2012). A Dierent(ial) Way: Matrix Derivatives Again.
[4] Thomas Minka (2000). Old and New Matrix Algebra Useful for Statistics.
13

You might also like