Professional Documents
Culture Documents
Matrix Prop
Matrix Prop
Matrix Prop
John Duchi
Contents
1 Notation 2 Matrix multiplication 3 Gradient of linear function 4 Derivative in a trace 5 Derivative of product in trace 6 Derivative of function of a matrix 7 Derivative of linear transformed input to function 8 Funky trace derivative 9 Symmetric Matrices and Eigenvectors 1 1 1 2 2 3 3 3 4
Notation
A few things on notation (which may not be very consistent, actually): The columns of a matrix m A Rmn are a1 through an , while the rows are given (as vectors) by aT throught aT . 1
Matrix multiplication
n
ai aT , i
that is, that the product of AAT is the sum of the outer products of the columns of A. To see this, consider that
n
(AAT )ij =
p=1
api apj
because the i, j element is the ith row of A, which is the vector a1i , a2i , , ani , dotted with the j th column of AT , which is a1j , , anj .
If we look at the matrix AAT , we see that n n p=1 ap1 ap1 p=1 ap1 apn . . .. . . AAT = = . . . n n p=1 apn ap1 p=1 apn apn
.. .
ai aT i
i=1
am
= AT
Now let us consider xT Ax for A Rnn , x Rn . We have that n 1 n xT Ax = xT [T x aT x aT x]T = x1 aT x + + xn aT x a1 2 If we take the derivative with respect to one of the xl s, we have the l component for each ai , which is to say ail , and the term for xl aT x, which gives us that l T x Ax = xl In the end, we see that
xx n
l xi ail + aT x = aT x + aT x. l l
i=1
Ax = Ax + AT x.
Derivative in a trace
Recall (as in Old and New Matrix Algebra Useful for Statistics) that we can dene the dierential of a function f (x) to be the part of f (x + dx) f (x) that is linear in dx, i.e. is a constant times dx. Then, for example, for a vector valued function f , we can have f (x + dx) = f (x) + f (x)dx + (higher order terms). In the above, f is the derivative (or Jacobian). Note that the gradient is the transpose of the Jacobian. Consider an arbitrary matrix A. We see that T a1 dx1 .. tr . n T an dxn aT dxi tr(AdX) = = i=1 i . dX dX dX Thus, we have tr(AdX) dX so that =
ij n i=1
= BT
trAB = tr = tr
m
a1 a2 b1 . . . an a1 T b1 a2 T b1 . . . an T b1 a1 T b2 a2 T b2 an T b2
m
b2
bn
.. .
a1 T bn a2 T bn . . . an T bn
m
=
i=1
a1i bi1 +
i=1
a2i bi2 + . . . +
i=1
ani bin
trAB aij
A trAB
= =
bji BT
.
f (A) An1 f (A) An2
f (A) = = (
AT
. . .
. . .
f (A) A1n
A f (A))
f (A) A2n T
.. .
. . .
f (A) Ann
Consider a function f : Rn R. Suppose we have a matrix A Rnm and a vector x Rm . We wish to compute x f (Ax). By the chain rule, we have f (Ax) xi
n
=
k=1 n
n k=1
f (Ax) (T x) ak (Ax)k xi
= =
k=1 aT i
k f (Ax)aki
k=1
As such, x f (Ax) = AT f (Ax). Now, if we would like to get the second derivative of this function (third derivatives would be a little nice, but I do not like tensors), we have 2 f (Ax) xi xj = = = aT i From this, it is easy to see that
l=1 k=1 2
T a f (Ax) = xj i xj
n n
aki
k=1
f (Ax) (Ax)k
f (Ax)aj
2
2 x f (Ax)
= AT
f (Ax)A.
In this we prove that for a symmetric matrix A Rnn , all the eigenvalues are real, and that the eigenvectors of A form an orthonormal basis of Rn . First, we prove that the eigenvalues are real. Suppose one is complex: we have xT x = (Ax)T x = xT AT x = xT Ax = xT x. Thus, all the eigenvalues are real. Now, we suppose we have at least one eigenvector v = 0 of A. Consider a space W of vectors orthogonal to v. We then have that, for w W , (Aw)T v = wT AT v = wT Av = wT v = 0. Thus, we have a set of vectors W that, when transformed by A, are still orthogonal to v, so if we have an original eigenvector v of A, then a simple inductive argument shows that there is an orthonormal set of eigenvectors. To see that there is at least one eigenvector, consider the characteristic polynomial of A: X (A) = det(A I). The eld is algebraicly closed, so there is at least one complex root r, so we have that A rI is singular and there is a vector v = 0 that is an eigenvector of A. Thus r is a real eigenvalue, so we have the base case for our induction, and the proof is complete. 4