Professional Documents
Culture Documents
MPRA Paper 3917
MPRA Paper 3917
Pawel Kowal
December 2006
Online at http://mpra.ub.uni-muenchen.de/3917/
MPRA Paper No. 3917, posted 9. July 2007
A note on matrix differentiation
Paweł Kowal
July 9, 2007
Abstract
This paper presents a set of rules for matrix differentiation with
respect to a vector of parameters, using the flattered representation of
derivatives, i.e. in form of a matrix. We also introduce a new set of
Kronecker tensor products of matrices. Finally we consider a problem
of differentiating matrix determinant, trace and inverse.
1 Introduction
Derivatives of matrices with respect to a vector of parameters can be ex-
pressed as a concatenation of derivatives with respect to a scalar parameters.
However such a representation of derivatives is very inconvenient in some
applications, e.g. if higher order derivatives are considered, and or even are
not applicable if matrix functions (like determinant or inverse) are present.
For example finding an explicit derivative of det(∂X/∂θ) would be a quite
complicated task. Such a problem arise naturally in many applications, e.g.
in maximum likelihood approach for estimating model parameters.
The same problems emerges in case of a tensor representation of deriva-
tives. Additionally, in this case additional effort is required to find the flat-
tered representation of resulting tensors, which is required, since running
numerical computations efficiently is possible only in case of two dimensional
data structures.
In this paper we derive formulas for differentiating matrices with respect
to a vector of parameters, when one requires the flattered form of resulting
derivatives, i.e. representation of derivatives in form of matrices. To do this
we introduce a new set of the Kronecker matrix products as well as the gener-
alized matrix transposition. Then, first order and higher order derivatives of
functions being compositions of primitive function using elementary matrix
operations like summation, multiplication, transposition and the Kronecker
product, can be expressed in a closed form based on primitive matrix func-
tions and their derivatives, using these elementary operations, the generalized
Kronecker products and the generalized transpositions.
We consider also more general matrix functions containing matrix func-
tions (inverse, trace and determinant). Defining the generalized trace func-
tion we are able to express derivatives of such functions in closed form.
Y ··· 0
£ ∂X ¤ . . £ ∂Y ¤
= ∂θ ∂X
. . . ∂θ × .. . . ...
+ X × ∂θ1 . . .
∂Y
∂θk
1 k
0 ··· Y
∂X ∂Y
= × (Ik ⊗ Y ) + X ×
∂θ ∂θ
2
Differentiating matrix transposition is a little bit more complicated. Let
us define a generalized matrix transposition
Definition 2.2. Let X = [X1 , X2 , . . . Xn ], where Xi ∈ Rp×q , i = 1, 2, . . . , n
is a p × q matrix is a partition of p × nq dimensional matrix X. Then
. £ ¤
Tn (X) = X10 , X20 , . . . , Xn0
Proposition 2.3. The following equations hold
1. ∂
∂θ
(X 0 ) = Tk ( ∂X
∂θ
)
∂
2. ∂θ
(Tn (X)) = Tk×n ( ∂X
∂θ
)
Proof. The first condition is a special case of the second condition for n = 1.
We have
∂ £ ∂X ∂X
¤
(T(n) (X)) = T(n) ( ∂θ ) . . . T(n) ( ∂θ )
∂θ 1 k
h 0
i ³ ∂X ´
∂Xn0 ∂X10 ∂Xn0
= ∂X 1
∂θ1
, . . . , ∂θ1
. . . ∂θk
, . . . , ∂θk
= T (k×n)
∂θ
since
∂X £ ∂X1 ∂X1
¤
= ∂θ1 , . . . , ∂X ∂θ1
n
. . . ∂θk
, . . . , ∂Xn
∂θk
∂θ
3
Proposition 2.5. The following equations hold
∂ ∂X ∂Y
1. ∂θ
(X ⊗Y)= ∂θ
⊗ Y + X ⊗1k ∂θ
2. ∂
∂θ
(X ⊗m 1 ,...,ms
n1 ,...,ns Y ) =
∂X
∂θ
⊗k,m 1 ,...,ms 1,m1 ,...,ms
1,n1 ,...,ns Y + X ⊗k,n1 ,...,ns
∂Y
∂θ
Proof. We have
∂ £ ∂ ∂
¤
(X ⊗m 1 ,...,ms
n1 ,...,ns Y ) = ∂θ1
(X ⊗m 1 ,...,ms
n1 ,...,ns Y ) · · · ∂θk
(X ⊗m 1 ,...,ms
n1 ,...,ns Y )
∂θ £ ∂X m1 ,...,ms ¤
∂X
= ∂θ 1
⊗n1 ,...,ns Y · · · ∂θ k
⊗m 1 ,...,ms
n1 ,...,ns Y
£ 1 ,...,ms ∂Y m1 ,...,ms ∂Y
¤
+ X ⊗m n1 ,...,ns ∂θ1 · · · X ⊗n1 ,...,ns ∂θk
∂X k,m1 ,...,ms ∂Y
= ⊗1,n1 ,...,ns Y + X ⊗1,m 1 ,...,ms
k,n1 ,...,ns
∂θ ∂θ
Since X ⊗ Y = X ⊗11 Y , in case of the standard Kronecker product we obtain
∂ ∂X k ∂Y ∂X ∂Y
(X ⊗ Y ) = ⊗1 Y + X ⊗1k = ⊗ Y + X ⊗1k
∂θ ∂θ ∂θ ∂θ ∂θ
∂ ∂X ∂α
(αX) = α × + ⊗X
∂θ ∂θ ∂θ
Proof. Expression αX can be represented as αX = (α ⊗ Ip ) × X, where Ip
is a p × p dimensional identity matrix. Hence
∂ ∂ ∂(α ⊗ Ip ) ∂X
(αX) = ((α ⊗ Ip ) × X) = × (Ik ⊗ X) + (α ⊗ Ip ) ×
∂θ ∂θ ∂θ ∂θ
∂α ∂X ∂α ∂X
=( ⊗ Ip ) × (Ik ⊗ X) + α × = ⊗X +α×
∂θ ∂θ ∂θ ∂θ
4
Let ext(S) is a set of functions obtained by applying elementary matrix
operations on the set S, i.e. ext(S) is a smallest set such that if X, Y ∈
ext(S), then matrix valued functions X + Y , X × Y , Tn (X), X ⊗m1,...,m n1,...,ns Y
s
5
Proof. We have
∂ det(X) h ∂ det(X) ∂ det(X)
i
= ∂θ1
. . . ∂θk
∂θ £ ¤
∂X ∂X
= det(X) tr(X −1 × ∂θ 1
) . . . det(X) × tr(X −1 × ∂θ k
)
∂X
= det(X) × trk (X −1 × )
∂θ
∂ trn (X) h ∂ trn (X) ∂ trn (X)
i £
∂X ∂X
¤ ∂X
= ∂θ1
. . . ∂θk
= tr n ( ∂θ
) . . . tr( ∂θ
) = trk×n ( )
∂θ 1 k
∂θ
Similarly
∂X −1 h ∂X −1 ∂X −1
i £ ∂X ∂X
¤
= ∂θ1
... ∂θk
=− X −1 ∂θ X −1 . . . X −1 ∂θ X −1
∂θ 1 k
£ ∂X ∂X
¤ ∂X
= −X −1 × ∂θ1
... ∂θk × (Ik ⊗ X −1 ) = −X −1 (Ik ⊗ X −1 )
∂θ
since in case of scalar parameter θ ∈ R, ∂ det(X)/∂θ = det(X) tr(X −1 ∂X/∂θ),
∂ tr(X)/∂θ = tr(∂X/∂θ), and ∂X −1 /∂θ = −X −1 (∂X/∂θ)X −1 (see for exam-
ple Petersen, Petersen, (2006)).
Let a set S and operation dif are defined as in the previous section.
Let ext2 (S) is a set of functions obtained by applying elementary matrix
operations and matrix determinant, trace and inverse on the set S, i.e. ext(S)
is a smallest set such that if X, Y ∈ ext2 (S), then matrix valued functions
X + Y , X × Y , Tn (X), X ⊗m1,...,m n1,...,ns Y , det(X), trn (X), X
s −1
if exist belong
to ext2 (S), where n, n1 , . . . , ns , m1 , . . . , ms are any positive integers.
Theorem 3.3. dif(ext2 (S)) = ext2 (S ∪ dif(S)).
Proof. By induction using propositions 2.1, 2.3, 2.5, 2.6, 3.2.
6
Proof. Let
f11 (x) . . . f1n (x)
.. ... ..
f (x) = . .
fm1 (x) . . . fmn (x)
Finally
f (x) ∂f (x) £ ∂x ∂x
¤ ∂f (x) ∂x
= × ∂θ1
⊗ In . . . ∂θk
⊗ In = ×( ⊗ In )
∂θs ∂x ∂x ∂θ
...,1,...
3. A ⊗...,1,1,...
...,nk ,nk+1 ,... B = A ⊗...,nk ×nk+1 ,... B.
7
Proposition 5.2. For any matrices A, B, C
1. A ⊗m 1 ,...,mk m1 ,...,mk m1 ,...,mk
n1 ,...,nk (B + C) = A ⊗n1 ,...,nk B + A ⊗n1 ,...,nk C.
assuming that the Kronecker products exist and matrix dimensions coincide.
Proposition 5.3. For any matrices A, B, C, D
= (A ⊗ C) × (B ⊗n1,mk ,...,m1 ,1
k+1 ,nk ,...,n1 ,1
D)
Similarly
m ,m ,...,m ,1
1
(AB) ⊗nk+1
k+1
,n ,...,n ,1 (CD)
k
h k 1 i
1,mk ,...,m1 ,1 1,mk ,...,m1 ,1
= (AB1 ) ⊗nk+1 ,nk ,...,n1 ,1 (CD) . . . , (ABmk+1 ) ⊗nk+1 ,nk ,...,n1 ,1 (CD)
h i
= (A ⊗ C)(B1 ⊗1,m k ,...,m1 ,1
nk+1 ,nk ,...,n1 ,1 D) . . . , (A ⊗ C)(B mk+1 ⊗ 1,mk ,...,m1 ,1
nk+1 ,nk ,...,n1 ,1 D)
m ,m ,...,m ,1
1
= (A ⊗ C) × (B ⊗nk+1
k+1
,nk ,...,n1 ,1 D)
k
A ⊗m 1 ,...,ms m1 ,...,ms
n1 ,...,ns B = (A ⊗ B) × (Iq1 ⊗n1 ,...,ns Iq2 )
A ⊗m 1 ,...,ms m1 ,...,ms
n1 ,...,ns (B ⊗ C) = (A ⊗n1 ,...,ns B) ⊗ C
8
m1 ,...,ms ,1
Proof. Observe that X ⊗m 1 ,...,ms 1
n1 ,...,ns Y = X ⊗n1 ,...,ns ,1 Y , and A ⊗1 (B ⊗ C) =
m ,...,m 1 ,1
(A ⊗11 B) ⊗ C, since A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C. Let A ⊗nkk,...,n1 ,1 (B ⊗ C) =
mk ,...,m1 ,1
(A ⊗nk ,...,n1 ,1 B) ⊗ C for k ≥ 0. Then
A ⊗1,mk ,...,m1 ,1
nk+1 ,nk ,...,n1 ,1 (B ⊗ C)
£ ¤
= A ⊗nmkk,...,n ,...,m1 ,1
1 ,1
(B1 ⊗ C) . . . , A ⊗m k ,...,m1 ,1
nk ,...,n1 ,1 (Bnk ⊗ C)
£ ¤
= (A ⊗m k ,...,m1 ,1 mk ,...,m1 ,1
nk ,...,n1 ,1 B1 ) ⊗ C . . . , (A ⊗nk ,...,n1 ,1 Bnk ) ⊗ C
= (A ⊗1,mk ,...,m1 ,1
nk+1 ,nk ,...,n1 ,1 B) ⊗ C
Similarly
m ,m ,...,m ,1
1
A ⊗nk+1
k+1
,nk ,...,n1 ,1 (B ⊗ C)
k
h i
= A1 ⊗1,m k ,...,m1 ,1
nk+1 ,nk ,...,n1 ,1 (B ⊗ C) . . . , A mk+1 ⊗ 1,mk ,...,m1 ,1
nk+1 ,nk ,...,n1 ,1 (B ⊗ C)
h i
1,mk ,...,m1 ,1 1,mk ,...,m1 ,1
= (A1 ⊗nk+1 ,nk ,...,n1 ,1 B) ⊗ C . . . , (Amk+1 ⊗nk+1 ,nk ,...,n1 ,1 B) ⊗ C
m ,m ,...,m ,1
1
= (A ⊗nk+1
k+1
,nk ,...,n1 ,1 B) ⊗ C
k
Ai ⊗ (B ⊗1,mk ,...,m1 ,1
nk+1 ,nk ,...,n1 ,1 C)
£ ¤
= Ai ⊗ (B ⊗m k ,...,m1 ,1 mk ,...,m1 ,1
nk ,...,n1 ,1 C1 ) . . . , Ai ⊗ (B ⊗nk ,...,n1 ,1 Cnk )
£ ¤
= (Ai ⊗ B) ⊗m k ,...,m1 ,1 mk ,...,m1 ,1
nk ,...,n1 ,1 C1 . . . , (Ai ⊗ B) ⊗nk ,...,n1 ,1 Cnk
Similarly
m ,m ,...,m ,1
1
Ai ⊗ (B ⊗nk+1
k+1
,nk ,...,n1 ,1 C)
k
h i
1,mk ,...,m1 ,1 1,mk ,...,m1 ,1
= Ai ⊗ (B1 ⊗nk+1 ,nk ,...,n1 ,1 C) . . . , Ai ⊗ (Bmk+1 ⊗nk+1 ,nk ,...,n1 ,1 C)
h i
= (Ai ⊗ B1 ) ⊗1,m k ,...,m1 ,1
nk+1 ,nk ,...,n1 ,1 C . . . , (A i ⊗ B mk+1
) ⊗ 1,mk ,...,m1 ,1
nk+1 ,nk ,...,n1 ,1 C
m ,m ,...,m ,1
1
= (Ai ⊗ B) ⊗nk+1
k+1
,nk ,...,n1 ,1 C
k
9
Finally,
m ,m ,...,m ,1
1
A ⊗ (B ⊗nk+1
k+1
,nk ,...,n1 ,1 C)
k
h i
mk+1 ,mk ,...,m1 ,1 mk+1 ,mk ,...,m1 ,1
= A1 ⊗ (B ⊗nk+1 ,nk ,...,n1 ,1 C) . . . , A q ⊗ (B ⊗ nk+1 ,nk ,...,n1 ,1 C)
h i
mk+1 ,mk ,...,m1 ,1 mk+1 ,mk ,...,m1 ,1
= (A1 ⊗ B) ⊗nk+1 ,nk ,...,n1 ,1 C) . . . , (A q ⊗ B) ⊗ nk+1 ,nk ,...,n1 ,1 C)
q,m ,m ,...,m ,1
1
= (A ⊗ B) ⊗1,nk+1
k+1
,nk ,...,n1 ,1 C
k
Proof. Let Ai is i-th column of A and B j is j-th column of B. Let Ipk denotes
k-th column of p × p identity matrix and let Bji denotes element of B at j-th
row and i-th column. Then
j i
£ ¤ B1 A
1 j i 1
(Im ⊗p Ip ) × (B ⊗ A ) = Im ⊗ Ip . . . Im ⊗ Ip × p ...
Bpj Ai
p p
X X
r i j
= (Im ⊗ Ip ) × (A ⊗ Br ) = Ai ⊗ (Ipr × Brj ) = Ai ⊗ B j
r=1 r=1
Further
£ ¤
(Im ⊗1p Ip ) × (B j ⊗ A) = (Im ⊗1p Ip ) × B j ⊗ A1 . . . B j ⊗ An
£ ¤
= A1 ⊗ B j . . . An ⊗ B j = A ⊗ B j
£ ¤
(Im ⊗1p Ip ) × (B ⊗ A) = (Im ⊗1p Ip ) × B 1 ⊗ A . . . B q ⊗ A
£ ¤
= A ⊗ B 1 . . . A ⊗ B q = A ⊗1q B
10
Proof. Proposition holds for s = 0. Let for any s ≥ 0 A ⊗1,n s ,...,1,n1
ms ,1,...,m1 ,1,q/m̄s B =
(Im ⊗1p Ip ) × (B ⊗m s ,...,m1
ns ,...,n1 A), where m̄s = m1 × · · · × ms . Then
and
Proposition 5.9.
Proof. Observe that (Im ⊗1q Iq ) is an orthogonal matrix since this matrix can
be obtained permuting columns of the matrix Imq . Hence (Im ⊗1q Iq )−1 =
(Im ⊗1q Iq )T . Further
1 T 1
Im ⊗ (Iq1 )T (Iq ) ⊗m Im
(Im ⊗1q Iq )T = ... = ... = Iq ⊗1m Im
q T q T 1
Im ⊗ (Iq ) (Iq ) ⊗m Im
The second equality can be shown using for example proposition 5.7.
Proposition 5.10.
Proof. Observe that (Im ⊗1n Iq ) is an orthogonal matrix since this matrix can
be obtained permuting columns of the matrix Imq . Hence (Im ⊗1n Iq )−1 =
11
(Im ⊗1n Iq )T . Further
(Iq/n ⊗1m Im ) × Im ⊗ (Iq1 )T
In ⊗ (Iq/n ⊗1m Im ) × (Im ⊗1n Iq )T = ...
1 n T
(Iq/n ⊗m Im ) × Im ⊗ (Iq )
1 T 1
(Iq ) ⊗m Im
= ... = Iq ⊗1m Im
(Iqn )T ⊗1m Im
The second equality can be shown using for example proposition 5.7. Hence
³ ´
(Im ⊗1n Iq )−1 = In ⊗ (Iq/n ⊗1m Im )−1 ×(Iq ⊗1m Im )
³ ´
= In ⊗ (Im ⊗1q/n Iq/n ) ×(Iq ⊗1m Im )
= (Inm ⊗nq/n Iq/n ) × (Iq ⊗1m Im )
Further
³ ´−1 ³ ´−1 ³ ´−1
Im ⊗kn Iq = (Ik ⊗ Im/k ) ⊗k,1
1,n Iq = Ik ⊗ (Im/k ⊗ 1
n Iq )
³ ´
= Ik ⊗ (Inm/k ⊗nq/n Iq/n ) × (Iq ⊗1m/k Im/k )
= Ik ⊗ (Inm/k ⊗nq/n Iq/n ) × Ik ⊗ (Iq ⊗1m/k Im/k )
= (Inm ⊗nk k
q/n Iq/n ) × (Ikq ⊗m/k Im/k )
6 Concluding remarks
Derived formulas requires matrix tensor products, which are absent, when
representing derivatives as the concatenation of derivatives with respect to a
scalar parameters. Hence, this approach may decrease numerical efficiency.
This problem however can be resolved using appropriate data structures.
12
References
[1] T.P. Minka. Old and new matrix algebra useful for statistics. notes,
December 2000.
[2] K.P. Petersen and M. S. Petersen. The matrix cookbook. notes, 2006.
13