Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

IEEE SIGNAL PROCESSING LETTERS, VOL. 21, NO.

4, APRIL 2014 469

Computing Jacobian and Hessian of Estimators and


Their Application to Risk Approximation
Stefan Uhlich

Abstract—This letter gives formulas to compute the Jacobian which allows to assess the risk in the general case of
and Hessian of an estimator that can be written as the maximum parameter estimation with non-standard2 loss functions.
of a given scoring function, which includes the important cases of The letter is organized as follows: In Section II and III, we
maximum likelihood (ML) and least squares (LS) estimation. We will derive the Jacobian matrix and Hessian tensor of an esti-
use the knowledge about these derivatives to compute two approxi- mator that is defined as in (1). It will turn out that this can be
mations of the estimator risk and show that the linear risk approx-
done using the chain rule on the scoring function . We
imation of an ML estimator coincides with the Cramér–Rao bound
for the case of a Gaussian signal model where the underlying loss then use these results in Section IV to compute a Taylor series
function that is used for the risk computation is the squared error approximation of the estimator risk which results in two risk
loss. approximations (10) and (12). Section V gives an explicit ex-
ample for the scalar Gaussian signal model where we calculate
Index Terms—Hessian of estimator, implicit function theorem,
the gradient and Hessian as well as the two risk approximations.
Jacobian of estimator, risk approximation, scoring function.
We will show that the linear risk approximation (10) is equiv-
alent to the Cramér-Rao bound for this signal model. Finally,
I. INTRODUCTION Section VI provides concluding remarks.
The following notation is used throughout this letter:
M ANY estimators can be written as
(1)
denotes a column vector, a matrix where in particular is
the identity matrix and an -way tensor. The trace operator,
column stacking operator, determinant, matrix transpose and
i.e., they are the solution to a maximization1of a scoring func- Euclidean norm are denoted by , , ,
tion which depends on the observation and the and , respectively. Furthermore, denotes the vector outer
unknown parameter . Examples are the maximum like- product, the Kronecker product, the -mode matrix
lihood (ML) estimator with in the frequentist product, the -mode vector product and the mode-
case where is the parametrized probability density func- matricization of the tensor . Please note that we use in this
tion (PDF) of the observation and the maximum a posteriori letter the same notation for tensors and their operations as in
(MAP) estimator with in the Bayesian [7]. Finally, denotes the del operator with respect to such
case where is the a priori PDF of the parameter and that is the gradient of the scalar-valued function
denotes the likelihood density, see [1]. and is the transpose of the Jacobian matrix of the
The contribution of this letter is two-fold: First, we will study vector-valued function .
the first- and second-order derivatives, i.e., the Jacobian matrix
and Hessian tensor , of such estimators with respect to
and give formulas to compute them in terms of the partial deriva- II. DERIVATION OF THE JACOBIAN
tives of with respect to and at . This has Assume that in (1) is at least twice continuously dif-
many applications, for example in the bagging of estimators [2], ferentiable. A necessary condition3 for to be the solution
[3] and in the approximation of the estimator risk as we will show of (1) is that the so-called estimating equation
in this letter. To our knowledge, the Jacobian and Hessian have is fulfilled (see [2]), i.e., the gradient of with respect to
not been explicitly derived before in the literature, although they
is vanishing at . Using the chain rule, it is straightfor-
have already been used “implicitly” as in the theoretical analysis
of bagging [2]. A second contribution of the letter is the approxi- ward to derive the Jacobian of the estimator
mation of the estimator risk in the frequentist setting using with respect to as we have
the Jacobian and Hessian of the estimator. We will derive an ap-
proximation of for an arbitrary, differentiable loss function

Manuscript received December 23, 2013; revised February 02, 2014;


accepted February 03, 2014. Date of publication February 05, 2014; date of
current version February 21, 2014. The associate editor coordinating the review
of this manuscript and approving it for publication was Prof. Richard K. Martin
The author was with Universität Stuttgart, 70550 Stuttgart, Germany (e-mail: (2)
uhlich.s@gmail.com). 2“Non-standard” refers here to the case of parameter estimation with a loss
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. function that is not the squared error loss, see [4]–[6] for such estimation prob-
lems in the Bayesian context.
Digital Object Identifier 10.1109/LSP.2014.2304693
3We assume here that the parameter space is open. Otherwise, could also
1Please note that this also includes the case of a minimization as we have lie on the boundary of the parameter space and is not necessary
. anymore.

1070-9908 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
470 IEEE SIGNAL PROCESSING LETTERS, VOL. 21, NO. 4, APRIL 2014

for all and and therefore we obtain for all and , which can be written
in matrix notation in tensor notation as

(3)

Hence, we see that we can compute the Jacobian of


from and .
However, we can use (3) only if
(6)

(4) Assuming again that is invertible, we


can solve this linear equation for the Hessian tensor
using matricization as, in gen-
This fact is also known from the implicit function the- eral, the mode- matricization of is , see [7].
orem (IFT) which shows that (4) is a sufficient condition that After matricization of and using the
implicitly defines locally a unique and contin- Jacobian from (3), we obtain for the mode-3 matricization
uously differentiable estimator and that its derivative is of
given by (3), see [8] and [9, Sec. 3.6]. It is now interesting to see
under which circumstances (4) might fail for our estimator (1),
and we will therefore look in the following at three particular
cases where it is not fulfilled:
• : It is directly obvious that
has a unique maximum , but (4) is not fulfilled
and hence we can not use (3) to compute the derivative of
. The reason is that (4) is only a sufficient condition
but not a necessary one, i.e., from its violation, we can not
conclude that there is (locally) no unique and continuously
differentiable estimator . (7)
• : For this example, we obtain the
unique solution which is, however, not differen- For the special case that we have a scalar parameter , (6)
tiable for . Therefore, it is obvious that the condition turns into
(4) has to fail for due to the IFT.
• : In this last case, we have the
problem that is not unique (which is closely related to
the identifiability of the parameters and ) as all
are a solution to (1) which fulfill , and therefore (8)
again (4) fails.
However, despite these special cases, (3) can often be suc- where we used the shorthand notation
cessfully applied and is useful in many practical applications. and denotes the gradient , i.e., it is the transpose
of the Jacobian which reduces to a row vector in the scalar
III. DERIVATION OF THE HESSIAN case.

Assuming furthermore that is at least three times con- IV. APPROXIMATION OF THE ESTIMATOR RISK
tinuously differentiable, we can again use the chain rule to de-
rive the Hessian. It holds Using the Jacobian and Hessian from above, we can now de-
rive two approximations for the risk of an estimator in the fre-
quentist case4. Let be the loss that is assigned to es-
timating from the observation although the true value is
. Then, the (frequentist) risk of an estimator is defined as
and a well known example is the mean
squared error (MSE) which results from the squared error loss
, see [1]. Using these definitions,
we can now state two approximations of the risk , which
differ in their Taylor expansion of .
4In the following, we restrict ourselves to the frequentist case due to space
limitations. Please note that the used Taylor expansion can be easily extended
(5) to the Bayesian setting by taking into account that the expectation is now
also with respect to the random vector .
UHLICH: COMPUTING JACOBIAN AND HESSIAN OF ESTIMATORS 471

(15a)

(15b)

A. Risk Approximation using the First-Order Taylor Expansion Considering again the special case that is symmetric,
of we finally obtain the following quadratic risk approximation
Let denote the observation that we would have about the (QRA):
parameter in the case of no noise and the (ad-
ditive) distortion of the observation due to noise5. The linear
approximation of the estimator at is given by
and the Taylor expansion of the loss
up to the quadratic term at is
(12)
. Combining both expansions with , we obtain
where we used the shorthand notation and
to denote the second-order (i.e., covariance)
and fourth-order moment matrices of the noise.

(9) V. ML ESTIMATOR IN A GAUSSIAN SIGNAL MODEL


In the following, we will now consider the case of a Gaussian
For the special case that the PDF of is symmetric, signal model, i.e., the observation is given by
i.e., and therefore all odd moments are zero, we
can finally write for the linear risk approximation (LRA): (13)

where is Gaussian with zero-mean and covariance ma-


(10) trix and the unknown parameter that we want to estimate
is a scalar, i.e., . This model is widely used in signal
where denotes the covariance matrix of the noise.
processing as many problems can be recast into its form. The
If is the squared error loss, then we can directly view
log-likelihood function, which is the scoring function for (13),
this approximation as the superposition of a squared-bias and a
is and the corre-
variance term. In Section V, we will show that (10) is equivalent
sponding ML estimator is given by
to the Cramér-Rao bound for the special case that is the ML
estimator (i.e., ), is bias-free (i.e., ) (14)
and is the squared error loss.

B. Risk Approximation using the Second-Order Taylor The required derivatives to compute the Jacobian (3)
Expansion of and Hessian (8) can be computed straightforward for our
scoring function in (14) and we obtain (15a) and (15b) with
If we expand the estimator up to its quadratic term, i.e.,
which are
with from (7), we obtain shown at the top of the page.
In the following, we will now derive the risk approximations
(10) and (12) for our Gaussian signal model (13). We will as-
sume that the estimator has no bias and therefore
. Using the gradient (15a) and Hessian (15b), we
have

(16a)

(16b)

(11)
where we introduced and
5We implicitly assume here that such a exists, i.e., we know how the noise- . The linear MSE approximation follows from
less observation looks like. For example, in the important case of a signal model
with additive noise as in (13), is . However, there are estimation (10) and is
problems where there is no noiseless observation , for example in the vari-
ance estimation problem . (17)
472 IEEE SIGNAL PROCESSING LETTERS, VOL. 21, NO. 4, APRIL 2014

Fig. 1. Comparison of different MSE bounds/approximations. The right plot shows a zoomed-in version of the left plot.

as the MSE is the special case of with outlier probability, which is required to derive the MIE,
and therefore we have is application-dependent and often difficult. Hence, it is
. Comparing (17) to the CRB for this signal model, we im- often not possible to use the MIE for a specific problem.
mediately see that both are identical as the CRB is given by Interestingly, the QRA could be combined with the MIE to
, see [1]. Due to this interesting equivalence, tighten the approximation by replacing the CRB with the
we can expect that (10) is also a good approximation of the es- QRA. This would yield a new risk approximation which is
timator risk for other loss functions. Please note that this equiv- better in the SNR region shown at the right-hand side plot
alence is also true for the case that is a parameter vector. This of Fig. 1.
can be shown by the same reasoning as we used for the scalar Please note that we used in this example the squared error
case here and it turns out that our risk approximation yields the loss as we wanted to compare our risk approximations LRA
trace of the Cramér-Rao bound for a vector parameter. and QRA to other well known bounds and approximations from
Furthermore, the quadratic MSE approximation using (12) is literature. However, our approximations are more universal in
the sense that they can be used for more general, differentiable
loss functions which are not the squared error loss.
(18) VI. CONCLUSIONS
where the elements of can be computed using the formula
of the fourth-order moments of a Gaussian random vector [10], In this letter, we showed how to compute the Jacobian and
which is . Hessian of an estimator that maximizes a suitable scoring func-
Finally, Fig. 1 gives a comparison of different risk approx- tion. We used this information to compute two risk approxima-
imations and bounds for the special case of frequency estima- tions which can be used for a general loss function and showed
tion, in particular with that one of them for the special case of ML estimation in a
where is the unknown (deterministic) angular frequency Gaussian signal model is equivalent to the Cramér-Rao bound.
which we want to estimate. We chose for Fig. 1. The Finally, we studied the example of frequency estimation and
MSE of the ML estimator is computed from averaging the re- showed that both bounds yield a good MSE approximation.
sults of runs and it is compared to our linear/quadratic risk REFERENCES
approximation from Section IV, the Bhattacharyya bound of
[1] L. L. Scharf, Statistical Signal Processing: Detection, Estimation and
order three, which was computed in [11] for the scalar Gaussian Time Series Analysis. Reading, MA, USA: Addison-Wesley, 1990.
case, and the method of interval error (MIE) approximation6 [2] J. H. Friedman and P. Hall, On bagging and nonlinear estimation Stan-
[12] of the MSE. We can make the following observations: ford Univ., Stanford, CA, USA, Tech. Rep., 2000.
• The quadratic risk approximation (12) is different from the [3] S. X. Chen and P. Hall, “Effects of bagging and bias correction on
estimators defined by estimating equations,” Statist. Sinica, vol. 13,
Bhattacharyya bound: One could falsely assume that the
pp. 97–109, 2003.
QRA is equal to the Bhattacharyya bound from the fact that [4] H. Rue, “New loss functions in Bayesian imaging,” J. Amer. Statist.
the LRA is equivalent to the CRB. However, this is not the Assoc., vol. 90, no. 431, pp. 900–908, 1995.
case as one can see from Fig. 1. In particular, it is inter- [5] A. Zellner, “Bayesian estimation and prediction using asymmetric loss
esting to observe for this example that the QRA is closer functions,” J. Amer. Statist. Assoc., vol. 81, no. 394, pp. 446–451, 1986.
[6] S. Uhlich and B. Yang, “Bayesian estimation for non-standard loss
to the MSE of the ML estimator than the Bhattacharyya functions using a parametric family of estimators,” IEEE Trans. Signal
bound (cf. the right hand side plot of Fig. 1). Process., vol. 60, no. 3, pp. 1022–1031, 2012.
• Only the MIE is capable of predicting the threshold re- [7] T. G. Kolda and B. W. Bader, “Tensor decompositions and applica-
gion: From Fig. 1, we can see that only the MIE follows the tions,” SIAM Rev., vol. 51, no. 3, pp. 455–500, 2009.
MSE of the ML estimator for low SNR values. This is ex- [8] S. G. Krantz and H. R. Parks, The Implicit Function Theorem: History,
Theory, and Applications. Boston, MA, USA: Birkhäuser, 2002.
pected as of all MSE bounds/approximations that we con- [9] B. Porat, Digital Processing of Random Signals. Upper Saddle River,
sider for this example, only the MIE takes into account the NJ, USA: Prentice-Hall, 1993.
case of “outliers” which can explain the sudden increase [10] A. Papoulis and S. U. Pillai, Probability, Random Variables and Sto-
in the MSE of the ML estimator. The computation of the chastic Processes. New York, NY, USA: McGraw-Hill, 2002.
[11] P. Forster and P. Larzabal, “On lower bounds for deterministic param-
6For this example, the MIE approximation is obtained from assuming that eter estimation,” in Proc. IEEE Conf. Acoustics, Speech, and Signal
is uniformly distributed on for the case of a large error (“outlier”) and the Processing (ICASSP), 2002, vol. 2, pp. 1137–1140.
probability of such an event is given by the probability that any other frequency [12] H. L. Van Trees and K. L. Bell, Detection Estimation and Modulation
bin of the Cosine transform is larger than the expected frequency bin, see [12]. Theory, Part I, 2nd ed. Hoboken, NJ, USA: Wiley, 2013.

You might also like