Fundamentals of Statistical Signal Processing - Estimation Theory

You might also like

Download as pdf
Download as pdf
You are on page 1of 603
Library of Congress Cataloging-in-Publication Data Kay, Steven M. Fundamentals of statistical signal processing : estimation theory / Steven M. Kay. p. cm. — (PH signal processing series) Includes bibliographical references and index. ISBN 0~13-345711-7 1. Signal processing—-Statistical methods. 2. Estimation theory. I. Title. I]. Series: Prentice-Hall signal processing series. TK5102.5.K379 1993 621.382’2—dc20 92-29495 cIP Acquisitions Editor: Karen Gettman Editorial Assistant: Barbara Alfieri Prepress and Manufacturing Buyer: Mary E. McCartney Cover Design: Wanda Lubelska Cover Design Director: Eloise Starkweather © 1993 by Prentice Hall PTR Prentice-Hall, Inc. A Pearson Education Company Upper Saddle River, New Jersey 07458 All rights reserved. No part of this book may be reproduced, in any form or by any means, without permission in writing from the publisher. Printed in the United States of America 10 ISBN O-13-34S711~? Prentice-Hall International (UK) Limited, London Prentice-Hall of Australia Pty. Limited, Sydney Prentice-Hall Canada Inc., Toronto Prentice-Hall Hispanoamericana, S.A., Mezico Prentice-Hall of India Private Limited, New Delhi Prentice-Hall of Japan, Inc., Tokyo Editora Prentice-Hall do Brasil, Ltda., Rio de Janeiro Fundamentals of Statistical Signal Processing: Estimation Theory Steven M. Kay University of Rhode Island Prentice Hall PTR Upper Saddle River, New Jersey 07458 Contents Preface 1 Introduction 1.1 1.2 1.3 14 Estimation in Signal Processing... 1... ..-...0.00 000000 ee The Mathematical Estimation Problem Assessing Estimator Performance Some Notes to the Reader 2 Minimum Variance Unbiased Estimation 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Introduction Summary . 6... eee Unbiased Estimators Finding the Minimum Variance Unbiased Estimator Extension to a Vector Parameter 3 > Cramer-Rao Lower Bound 3.1 3.2 3.3 3.4 3.5 3C 3D Introduction Summary 2... ee Estimator Accuracy Considerations Cramer-Rao Lower Bound... 1.0... 00.000 General CRLB for Signals in White Gaussian Noise Transformation of Parameters... 2... ee Extension to a Vector Parameter... 2... 2. ee ee Vector Parameter CRLB for Transformations CRLB for the General Gaussian Case 2... 2... ......0.0004 Asymptotic CRLB for WSS Gaussian Random Processes Signal Processing Examples .........0..000 000000000 cee Derivation of Scalar Parameter CRLB Derivation of Vector Parameter CRLB Derivation of General Gaussian CRLB Derivation of Asymptotic CRLB viii CONTENTS 4 Linear Models 83 4.1 Introduction... 6... ee 83 4.2 Summary 2.2... 83 4.3 Definition and Properties ©... ........202002.0....000- 83 4.4 Linear Model Examples ee 86 4.5 Extension to the Linear Model ............-...-0-00005 94 5 General Minimum Variance Unbiased Estimation 101 5.1 Introduction... 2... ee 5.2 Summary ... 1... 2.2. ee 5.3 Sufficient Statistics ....... 5.4 Finding Sufficient Statistics 5.5 Using Sufficiency to Find the MVU Estimator 5.6 Extension to a Vector Parameter... 2.2... ee ee eee 5A Proof of Neyman-Fisher Factorization Theorem (Scalar Parameter) . . . 127 5B Proof of Rao-Blackwell-Lehmann-Scheffe Theorem (Scalar Parameter) . 130 6 Best Linear Unbiased Estimators 133 6.1 Introduction... 2... 2... ee ee 6.2 Summary .............- 6.3 Definition of the BLUE ..... 6.4 Findingthe BLUE ........ 6.5 Extension to a Vector Parameter 6.6 Signal Processing Example 6A Derivation of Scalar BLUE 6B Derivation of Vector BLUE 7.1 Introduction... 2.2.2... 5. 020.0000 2 eeeee eeee 7.2 Summary ...... 7.3 An Example..... 7.4 Finding the MLE. . fee te cee 7.5 Propertiesofthe MLE...............2.2 022000200048 7.6 MLE for Transformed Parameters ..........-.--..00006 7.7 Numerical Determination of the MLE 7.8 Extension toa Vector Parameter .......--2---.--..2000- 7.9 AsymptoticMLE.....-....... 0.00. eee 7.10 Signal Processing Examples . . . Lee eee 7A Monte Carlo Methods ...............0000. 7B Asymptotic PDF of MLE for a Scalar Parameter 7C Derivation of Conditional Log-Likelihood for EM Algorithm Example . 214 8 Least Squares 219 8.1 Introduction........---..-.... 0200000022 eee 219 8.2 Summary ©... 1... 219 CONTENTS ix 8.3 The Least Squares Approach 8.4 Linear Least Squares... ee ee 8.5 Geometrical Interpretations ... . 8.6 Order-Recursive Least Squares 8.7 Sequential Least Squares... .-- 2... - +22 ee eee 8.8 Constrained Least Squares... . . 8.9 Nonlinear Least Squares ..... - 8.10 Signal Processing Examples 8A Derivation of Order-Recursive Least Squares. . . 8B Derivation of Recursive Projection Matrix... - 8C Derivation of Sequential Least Squares... 2... - 2+ ee eee 9 Method of Moments 289 9.1 Introduction 92 Summary ......-...---- 9.3 Method of Moments ........ 9.4 Extension to a Vector Parameter . - 292 9.5 Statistical Evaluation of Estimators 294 9.6 Signal Processing Example ...-.----- eee eee eee 299 10 The Bayesian Philosophy 309 10.1 Introduction. ......--..-.4- 10.2 Summary .......-------- 10.3 Prior Knowledge and Estimation . . 10.4 Choosing a Prior PDF .......- 10.5 Properties of the Gaussian PDF. . . 10.6 Bayesian Linear Model. ..... ~~ 10.7 Nuisance Parameters... -.-- 2-2-0 eee eee oe . 330 10.8 Bayesian Estimation for Deterministic Parameters ...... - . 10A Derivation of Conditional Gaussian PDF... ...--- eee eee 337 11 General Bayesian Estimators 341 11.1 Introduction. ©... 0 341 11.2 Summary ........ we Fe .. 341 11.3 Risk Functions . . 342 11.4 Minimum Mean Square Error Estimators .. 844 11.5 Maximum A Posteriori Estimators ........- fae . 350 11.6 Performance Description. ......----+--+ see . 359 11.7 Signal Processing Example ......--- +--+ ee eeee . 365 11A Conversion of Continuous-Time System to Discrete-Time System .. . . 375 12 Linear Bayesian Estimators 12.1 Introduction... 22 ee 12.2 Summary ......... 12.3 Linear MMSE Estimation 12.4 Geometrical Interpretations ...... 2... 2... ee eee eee 12.5 The Vector LMMSE Estimator . tae tee 12.6 Sequential LMMSE Estimation ............0 0.000000 005 12.7 Signal Processing Examples - Wiener Filtering .............. 12A Derivation of Sequential LMMSE Estimator ...............0- 13 Kalman Filters 13.1 Introduction... 6. eee 13.2 Summary .............. 13.3 Dynamical Signal Models 13.4 Scalar Kalman Filter 13.5 Kalman Versus Wiener Filters . . . 13.6 Vector Kalman Filter tee cee 13.7 Extended Kalman Filter... ..-- 22-22 eee eee 13.8 Signal Processing Examples .... 2-20 ee 13A Vector Kalman Filter Derivation . 13B Extended Kalman Filter Derivation 14 Summary of Estimators 14.1 Introduction... 2... ee 14.2 Estimation Approaches 14.3 Linear Model 15 Extensions for Complex Data and Parameters 15.1 Introduction 15.2 Summary .............. 15.3 Complex Data and Parameters 15.4 Complex Random Variables and PDFs 15.5 Complex WSS Random Processes 15.6 Derivatives, Gradients, and Optimization 15.7 ‘Classical Estimation with Complex Data 15.8 Bayesian Estimation .............-0- 15.9 Asymptotic Complex Gaussian PDF 15.10Signal Processing Examples ............----..0-.-0-- 15A Derivation of Properties of Complex Covariance Matrices 15B Derivation of Properties of Complex Gaussian PDF 15C Derivation of CRLB and MLE Formulas Al Review of Important Concepts Al1.1 Linear and Matrix Algebra ©... ee A1.2 Probability, Random Processes. and Time Series Models . A2 Glossary of Symbols and Abbreviations INDEX CONTENTS Preface Parameter estimation is a subject that is standard fare in the many books available on statistics. These books range from the highly theoretical expositions written by statisticians to the more practical treatments contributed by the many users of applied statistics. This text is an attempt to strike a balance between these two extremes. The particular audience we have in mind is the community involved in the design and implementation of signal processing algorithms. As such, the primary focus is on obtaining optimal estimation algorithms that may be implemented on a digital computer. The data sets are therefore assumed to be samples of a continuous-time waveform or a sequence of data points. The choice of topics reflects what we believe to be the important approaches to obtaining an optimal estimator and analyzing its performance. As a consequence, some of the deeper theoretical issues have been omitted with references given instead. It is the author’s opinion that the best way to assimilate the material on parameter estimation is by exposure to and working with good examples. Consequently, there are numerous examples that illustrate the theory and others that apply the theory to actual signal processing problems of current interest. Additionally, an abundance of homework problems have been included. They range from simple applications of the theory to extensions of the basic concepts. A solutions manual is available from the publisher. To aid the reader, summary sections have been provided at the beginning of each chapter. Also, an overview of all the principal estimation approaches and the rationale for choosing a particular estimator can be found in Chapter 14. Classical estimation is first discussed in Chapters 2-9, followed by Bayesian estimation in Chapters 10-13. This delineation will, hopefully, help to clarify the basic differences between these two principal approaches. Finally, again in the interest of clarity, we present the estimation principles for scalar parameters first, followed by their vector extensions. This is because the matrix algebra required for the vector estimators can sometimes obscure the main concepts. This book is an outgrowth of a one-semester graduate level course on estimation theory given at the University of Rhode Island. It includes somewhat more material than can actually be covered in one semester. We typically cover most of Chapters 1-12, leaving the subjects of Kalman filtering and complex data/parameter extensions to the student. The necessary background that has been assumed is an exposure to the basic theory of digital signal processing, probability and random processes, and linear xii PREFACE and matrix algebra. This book can also be used for self-study and so should be useful to the practicing engineer as well as the student. The author would like to acknowledge the contributions of the many people who over the years have provided stimulating discussions of research problems, opportuni- ties to apply the results of that research, and support for conducting research. Thanks are due to my colleagues L. Jackson, R. Kumaresan, L. Pakula, and D. Tufts of the University of Rhode Island, and L. Scharf of the University of Colorado. Exposure to practical problems, leading to new research directions, has been provided by H. Wood- sum of Sonetech, Bedford, New Hampshire, and by D. Mook, S. Lang, C. Myers, and D. Morgan of Lockheed-Sanders, Nashua, New Hampshire. The opportunity to apply estimation theory to sonar and the research support of J. Kelly of the Naval Under- sea Warfare Center, Newport, Rhode Island, J. Salisbury of Analysis and Technology, Middletown, Rhode Island (formerly of the Naval Undersea Warfare Center), and D. Sheldon of the Naval Undersea Warfare Center, New London, Connecticut, are also greatly appreciated. Thanks are due to J. Sjogren of the Air Force Office of Scientific Research, whose continued support has allowed the author to investigate the field of statistical estimation. A debt of gratitude is owed to all my current and former grad- uate students. They have contributed to the final manuscript through many hours of pedagogical and research discussions as well as by their specific comments and ques- tions. In particular, P. Djuri¢ of the State University of New York proofread much of the manuscript, and V. Nagesha of the University of Rhode Island proofread the manuscript and helped with the problem solutions. Steven M. Kay University of Rhode Island Kingston, RI 02881 Chapter 1 Introduction 1.1 Estimation in Signal Processing Modern estimation theory can be found at the heart of many electronic signal processing systems designed to extract information. These systems include 1, Radar 2. Sonar 3. Speech 4. Image analysis 5. Biomedicine 6. Communications 7. Control 8. Seismology, and all share the common problem of needing to estimate the values of a group of pa- rameters. We briefly describe the first three of these systems. In radar we are interested in determining the position of an aircraft, as for example, in airport surveillance radar [Skolnik 1980]. To determine the range R we transmit an electromagnetic pulse that is reflected by the aircraft, causing an echo to be received by the antenna 79 seconds later, as shown in Figure 1.la. The range is determined by the equation 7 = 2R/c, where c is the speed of electromagnetic propagation. Clearly, if the round trip delay 7) can be measured, then so can the range. A typical transmit pulse and received waveform are shown in Figure 1.1b. The received echo is decreased in amplitude due to propaga- tion losses and hence may be obscured by environmental noise. Its onset may also be perturbed by time delays introduced by the electronics of the receiver. Determination of the round trip delay can therefore require more than just a means of detecting a jump in the power level at the receiver. It is important to note-that a typical modern 1 2 CHAPTER 1. INTRODUCTION Radar processing system (a) Radar Transmit pulse Time (b) Transmit and received waveforms Figure 1.1 Radar system radar system will input the received continuous-time waveform into a digital computer by taking samples via an analog-to-digital convertor. Once the waveform has been sampled, the data compose a time series. (See also Examples 3.13 and 7.15 for a more detailed description of this problem and optimal estimation procedures.) Another common application is in sonar, in which we are also interested in the position of a target, such as a submarine [Knight et al. 1981, Burdic 1984] . A typical passive sonar is shown in Figure 1.2a. The target radiates noise due to machinery on board, propellor action, etc. This noise, which is actually the signal of interest, propagates through the water and is received by an array of sensors. The sensor outputs 1.1. ESTIMATION IN SIGNAL PROCESSING 3 Sea surface ‘Towed array (“<= Noisy target Sea bottom. (a) Passive sonar Sensor 1 output Time Time Time (b) Received signals at array sensors Figure 1.2 Passive sonar system are then transmitted to a tow ship for input to a digital computer. Because of the positions of the sensors relative to the arrival angle of the target signal, we receive the signals shown in Figure 1.2b. By measuring 79, the delay between sensors, we can determine the bearing 3 from the expression B = arccos (=) (1.1) where c is the speed of sound in water and d is the distance between sensors (see Examples 3.15 and 7.17 for a more detailed description). Again, however, the received 4 CHAPTER 1. INTRODUCTION Vowel /a/ o 4 Time (ms) Vowel /e/ oO t t t t t t t 0 2 4 6 8 10 12 14 16 18 20 Time (ms) Figure 1.3 Examples of speech sounds waveforms are not “clean” as shown in Figure 1.2b but are embedded in noise, making the determination of 7) more difficult. The value of 8 obtained from (1.1) is then only an estimate. Another application is in speech processing systems [Rabiner and Schafer 1978]. A particularly important problem is speech recognition, which is the recognition of speech by a machine (digital computer). The simplest example of this is in recognizing individual speech sounds or phonemes. Phonemes are the vowels, consonants, etc., or the fundamental sounds of speech. As an example, the vowels /a/ and /e/ are shown in Figure 1.3. Note that they are periodic waveforms whose period is called the pitch. To recognize whether a sound is an /a/ or an /e/ the following simple strategy might be employed. Have the person whose speech is to be recognized say each vowel three times and store the waveforms. To recognize the spoken vowel, compare it to the stored vowels and choose the one that is closest to the spoken vowel or the one that 1.1. ESTIMATION IN SIGNAL PROCESSING 5 305 wx Periodogram t t t t t 0 500 1000 1500 2000 2500 Frequency (Hz) wo t t t t t Periodogram and LPC spectra /e/ (dB) Periodogram and LPC spectra Jaf (dB) 0 500 1000 1500 2000 2500 Frequency (Hz) Figure 1.4 LPC spectral modeling minimizes some distance measure. Difficulties arise if the pitch of the speaker’s voice changes from the time he or she records the sounds (the training session) to the time when the speech recognizer is used. This is a natural variability due to the nature of human speech. In practice, attributes, other than the waveforms themselves, are used to measure distance. Attributes are chosen that are less susceptible to variation. For example, the spectral envelope will not change with pitch since the Fourier transform of a periodic signal is a sampled version of the Fourier transform of one period of the signal. The period affects only the spacing between frequency samples, not the values. To extract the spectral envelope we employ a model of speech called linear predictive coding (LPC). The parameters of the model determine the spectral envelope. For the speech sounds in Figure 1.3 the power spectrum (magnitude-squared Fourier transform divided by the number of time samples) or periodogram and the estimated LPC spectral envelope are shown in Figure 1.4. (See Examples 3.16 and 7.18 for a description of how 6 CHAPTER 1. INTRODUCTION the parameters of the model are estimated and used to find the spectral envelope.) It is interesting that in this example a human interpreter can easily discern the spoken vowel. The real problem then is to design a machine that is able to do the same. In the radar/sonar problem a human interpreter would be unable to determine the target position from the received waveforms, so that the machine acts as an indispensable tool. In all these systems we are faced with the problem of extracting values of parameters based on continuous-time waveforms. Due to the use of digital computers to sample and store the continuous-time waveform, we have the equivalent problem of extracting parameter values from a discrete-time waveform or a data set. Mathematically, we have the N-point data set {z{0], z[1],...,2[N —1]} which depends on an unknown parameter 0. We wish to determine 0 based on the data or to define an estimator 6 = g(z(0], 2[1],...,2{N — 1]) (1.2) where g is some function. This is the problem of parameter estimation, which is the subject of this book. Although electrical engineers at one time designed systems based on analog signals and analog circuits, the current and future trend is based on discrete- time signals or sequences and digital circuitry. With this transition the estimation problem has evolved into one of estimating a parameter based on a time series, which is just a discrete-time process. Furthermore, because the amount of data is necessarily finite, we are faced with the determination of g as in (1.2). Therefore, our problem has now evolved into one which has a long and glorious history, dating back to Gauss who in 1795 used least squares data analysis to predict planetary movements [Gauss 1963 (English translation)]. All the theory and techniques of statistical estimation are at our disposal [Cox and Hinkley 1974, Kendall and Stuart 1976-1979, Rao 1973, Zacks 1981]. Before concluding our discussion of application areas we complete the previous list. 4. Image analysis - estimate the position and orientation of an object from a camera image, necessary when using a robot to pick up an object [Jain 1989] 5. Biomedicine - estimate the heart rate of a fetus [Widrow and Stearns 1985] 6. Communications - estimate the carrier frequency of a signal so that the signal can be demodulated to baseband [Proakis 1983] 7. Control - estimate the position of a powerboat so that corrective navigational action can be taken, as in a LORAN system [Dabbous 1988] 8. Seismology - estimate the underground distance of an oil deposit based on sound reflections due to the different densities of oil and rock layers [Justice 1985]. Finally, the multitude of applications stemming from analysis of data from physical experiments, economics, etc., should also be mentioned [Box and Jenkins 1970, Holm and Hovem 1979, Schuster 1898, Taylor 1986]. 1.2. THE MATHEMATICAL ESTIMATION PROBLEM 7 P(x{0}; 6) z(0] o 02 43 Figure 1.5 Dependence of PDF on unknown parameter 1.2 The Mathematical Estimation Problem In determining good estimators the first step is to mathematically model the data. Because the data are inherently random, we describe it by its probability density func- tion (PDF) or p(z(0], z[1],...,2[N — 1];0). The PDF is parameterized by the unknown parameter 0, i.e., we have a class of PDFs where each one is different due to a different value of 6. We will use a semicolon to denote this dependence. As an example, if N = 1 and 6 denotes the mean, then the PDF of the data might be p(2(0); 0) = mm exp [sca - oy which is shown in Figure 1.5 for various values of 0. It should be intuitively clear that because the value of 6 affects the probability of x(0], we should be able to infer the value of 6 from the observed value of 2[0]. For example, if the value of [0] is negative, it is doubtful that @ = 62. The value 6 = 6, might be more reasonable. This specification of the PDF is critical in determining a good estimator. In an actual problem we are not given a PDF but must choose one that is not only consistent with the problem constraints and any prior knowledge, but one that is also mathematically tractable. To illustrate the approach consider the hypothetical Dow-Jones industrial average shown in Figure 1.6. It might be conjectured that this data, although appearing to fluctuate wildly, actually is “on the average” increasing. To determine if this is true we could assume that the data actually consist of a straight line embedded in random noise or afr] = A+ Bn+w{[n] n=0,1,....N-1. A reasonable model for the noise is that w[n] is white Gaussian noise (WGN) or each sample of w[n] has the PDF (0,07) (denotes a Gaussian distribution with a mean of 0 and a variance of o?) and is uncorrelated with all the other samples. Then, the unknown parameters are A and B, which arranged as a vector become the vector parameter 6 = (A BJ". Letting x = {x(0} x[1]...2[N — 1]]”, the PDF is p(x; 9) = aed ¥ exp 5 > (2[n] — A— Bn)?] . (1.3) The choice of a straight line for the signal component is consistent with the knowledge that the Dow-Jones average is hovering around 3000 (A models this) and the conjecture 8 CHAPTER 1. INTRODUCTION 3200 3150: 3100 3050 Dow-Jones average 2950. 2900 2850 2800 1 t t 1 t t 1 t 1 4 0 10 20 30 40 50 60 70 80 90 100 Day number Figure 1.6 Hypothetical Dow-Jones average that it is increasing (B > 0 models this). The assumption of WGN is justified by the need to formulate a mathematically tractable model so that closed form estimators can be found. Also, it is reasonable unless there is strong evidence to the contrary, such as highly correlated noise. Of course, the performance of any estimator obtained will be critically dependent on the PDF assumptions. We can only hope the estimator obtained is robust, in that slight changes in the PDF do not severely affect the performance of the estimator. More conservative approaches utilize robust statistical procedures [Huber 1981]. Estimation based on PDFs such as (1.3) is termed classical estimation in that the parameters of interest are assumed to be deterministic but unknown. In the Dow-Jones average example we know a priori that the mean is somewhere around 3000. It seems inconsistent with reality, then, to choose an estimator of A that can result in values as low as 2000 or as high as 4000. We might be more willing to constrain the estimator to produce values of A in the range [2800, 3200]. To incorporate this prior knowledge we can assume that A is no longer deterministic but a random variable and assign it a PDF, possibly uniform over the [2800, 3200] interval. Then, any subsequent estimator will yield values in this range. Such an approach is termed Bayesian estimation. The parameter we are attempting to estimate is then viewed as a realization of the random variable 9. As such, the data are described by the joint PDF P(x, 8) = p(x|)P(9) where p(9) is the prior PDF, summarizing our knowledge about 6 before any data are observed, and p(x|9) is a conditional PDF, summarizing our knowledge provided by the data x conditioned on knowing 9. The reader should compare the notational differences between p(x; 4) (a family of PDFs) and p(x|9) (a conditional PDF), as well as the implied interpretations (see also Problem 1.3). 1.3. ASSESSING ESTIMATOR PERFORMANCE 9 3.0 25 2.0 15 1.0 0.5 0.0 0.5 ~1.0 15 2.05 t t T 0 10 20 30 40 50 60 70 80 90 100 x(n] Figure 1.7 Realization of DC level in noise Once the PDF has been specified, the problem becomes one of determining an optimal estimator or function of the data, as in (1.2). Note that an estimator may depend on other parameters, but only if they are known. An estimator may be thought of as a rule that assigns a value to 6 for each realization of x. The estimate of @ is the value of @ obtained for a given realization of x. This distinction is analogous to a random variable (which is a function defined on the sample space) and the value it takes on. Although some authors distinguish between the two by using capital and lowercase letters, we will not do so. The meaning will, hopefully, be clear from the context. 1.3 Assessing Estimator Performance Consider the data set shown in Figure 1.7. From a cursory inspection it appears that z[n] consists of a DC level A in noise. (The use of the term DC is in reference to direct current, which is equivalent to the constant function.) We could model the data as z[n] = A+ w[n] where w[n] denotes some zero mean noise process. Based on the data set {z(0], x[1],..., a[N — 1]}, we would like to estimate A. Intuitively, since A is the average level of z[n] (w[n] is zero mean), it would be reasonable to estimate A as ~ 1 N-1 A-— W » a(n] or by the sample mean of the data. Several questions come to mind: 1. How close will A be to A? 2. Are there better estimators than the sample mean? 10 CHAPTER 1. INTRODUCTION For the data set in Figure 1.7 it turns out that A =0.9, which is close to the true value of A=1. Another estimator might be A=a2(0]. Intuitively, we would not expect this estimator to perform as well since it does not make use of all the data. There is no averaging to reduce the noise effects. However, for the data set in Figure 1.7, A = 0.95, which is closer to the true value of A than the sample mean estimate. Can we conclude that A is a better estimator than A? The answer is of course no. Because an estimator is a function of the data, which are random variables, it too is a random variable, subject to many possible outcomes. The fact that A is closer to the true value only means that for the given realization of data, as shown in Figure 1.7, the estimate A = 0.95 (or realization of A) is closer to the true value than the estimate A = 0.9 (or realization of A). To assess performance we must do so statistically. One possibility would be to repeat the experiment that generated the data and apply each estimator to every data set. Then, we could ask which estimator produces a better estimate in the majority of the cases. Suppose we repeat the experiment by fixing A = 1 and adding different noise realizations of w[n] to generate an ensemble of realizations of x[n]. Then, we determine the values of the two estimators for each data set and finally plot the histograms. (A histogram describes the number of times the estimator produces a given range of values and is an approximation to the PDF.) For 100 realizations the histograms are shown in Figure 1.8. It should now be evident that A is a better estimator than A because the values obtained are more concentrated about the true value of A = 1. Hence, A will usually produce a value closer to the true one than A. The skeptic, however, might argue that if we repeat the experiment 1000 times instead, then the histogram of A will be more concentrated. To dispel this notion, we cannot repeat the experiment 1000 times, for surely the skeptic would then reassert his or her conjecture for 10,000 experiments. To prove that A is better we could establish that the variance is less. The modeling assumptions that we must employ are that the w[n]’s, in addition to being zero mean, are uncorrelated and have equal variance o?. Then, we first show that the mean of each estimator is the true value or . Mo E(A) = E (F > on) n=0 1 N-1t NW > E(e{n]) _ A n=0 E(z(0)) =A so that on the average the estimators produce the true value. Second, the variances are E(A) " 1.3. ASSESSING ESTIMATOR PERFORMANCE il Number of outcomes Ss a 8 BR 8B 7. Pf PF a \ ° t 2 a & 305 25+ 20-4 b+ Number of outcomes o Sample mean value, A L~ j a ] t t t t oO 1 2 3 First sample value, A Figure 1.8 Histograms for sample mean and first sample estimator N-1 Mm var(z[n]) n=0 1 yg? wie o N since the w(n]’s are uncorrelated and thus var(A) t cy Ss 12 CHAPTER 1. INTRODUCTION Furthermore, if we could assume that w[n] is Gaussian, we could also conclude that the probability of a given magnitude error is less for A than for A (see Problem 2.7). Several important points are illustrated by the previous example, which should always be kept in mind. 1. An estimator is a random variable. As such, its performance can only be com- pletely described statistically or by its PDF. 2. The use of computer simulations for assessing estimation performance, although quite valuable for gaining insight and motivating conjectures, is never conclusive. At best, the true performance may be obtained to the desired degree of accuracy. At worst, for an insufficient number of experiments and/or errors in the simulation techniques employed, erroneous results may be obtained (see Appendix 7A for a further discussion of Monte Carlo computer techniques). Another theme that we will repeatedly encounter is the tradeoff between perfor- mance and computational complexity. As in the previous example, even though A has better performance, it also requires more computation. We will see that optimal estimators can sometimes be difficult to implement, requiring a multidimensional opti- mization or integration. In these situations, alternative estimators that are suboptimal, but which can be implemented on a digital computer, may be preferred. For any par- ticular application, the user must determine whether the loss in performance is offset by the reduced computational complexity of a suboptimal! estimator. 1.4 Some Notes to the Reader Our philosophy in presenting a theory of estimation is to provide the user with the main ideas necessary for determining optimal estimators. We have included results that we deem to be most useful in practice, omitting some important theoretical issues. The latter can be found in many books on statistical estimation theory which have been written from a more theoretical viewpoint [Cox and Hinkley 1974, Kendall and Stuart 1976-1979, Rao 1973, Zacks 1981]. As mentioned previously, our goal is to obtain an optimal estimator, and we resort to a suboptimal one if the former cannot be found or is not implementable. The sequence of chapters in this book follows this approach, so that optimal estimators are discussed first, followed by approximately optimal estimators, and finally suboptimal estimators. In Chapter 14 a “road map” for finding a good estimator is presented along with a summary of the various estimators and their properties. The reader may wish to read this chapter first to obtain an overview. We have tried to maximize insight by including many examples and minimizing long mathematical expositions, although much of the tedious algebra and proofs have been included as appendices. The DC level in noise described earlier will serve as a standard example in introducing almost all the estimation approaches. It is hoped that in doing so the reader will be able to develop his or her own intuition by building upon previously assimilated concepts. Also, where possible, the scalar estimator is REFERENCES 13 presented first followed by the vector estimator. This approach reduces the tendency of vector/matrix algebra to obscure the main ideas. Finally, classical estimation is described first, followed by Bayesian estimation, again in the interest of not obscuring the main issues. The estimators obtained using the two approaches, although similar in appearance, are fundamentally different. The mathematical notation for all common symbols is summarized in Appendix 2. The distinction between a continuous-time waveform and a discrete-time waveform or sequence is made through the symbolism z(t) for continuous-time and z{n] for discrete- time. Plots of z[n], however, appear continuous in time, the points having been con- nected by straight lines for easier viewing. All vectors and matrices are boldface with all vectors being column vectors. All other symbolism is defined within the context of the discussion. References Box, G.E.P., G.M. Jenkins, Time Series Analysis: Forecasting and Control, Holden-Day, San Francisco, 1970. Burdic, W.S., Underwater Acoustic System Analysis, Prentice-Hall, Englewood Cliffs, N.J., 1984. Cox, D.R., D.V. Hinkley, Theoretical Statistics, Chapman and Hall, New York, 1974. Dabbous, T.E., N.U. Ahmed. J.C. McMillan, D.F. Liang, “Filtering of Discontinuous Processes ‘Arising in Marine Integrated Navigation.” IEEE Trans. Aerosp. Electron. Syst., Vol. 24, pp. 85-100, 1988. Gauss, K.G., Theory of Motion of Heavenly Bodies, Dover, New York, 1963. Holm, S., J.M. Hovem, “Estimation of Scalar Ocean Wave Spectra by the Maximum Entropy Method,” IEEE J. Ocean Eng., Vol. 4, pp. 76-83, 1979. Huber, P.J., Robust Statistics, J. Wiley, New York, 1981. Jain, A.K., Fundamentals of Digital Image Processing, Prentice-Hall, Englewood Cliffs, N.J., 1989. Justice, J.H., “Array Processing in Exploration Seismology,” in Array Signal Processing, S. Haykin, ed., Prentice-Hall, Englewood Cliffs, N.J., 1985. Kendall, Sir M., A. Stuart, The Advanced Theory of Statistics, Vols. 1-3, Macmillan, New York, 1976-1979. Knight, W.S., R.G. Pridham, S.M. Kay, “Digital Signal Processing for Sonar,” Proc. IEEE, Vol. 69, pp. 1451-1506. Nov. 1981. Proakis, J.G., Digital Communications, McGraw-Hill, New York, 1983. Rabiner, L.R., R.W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, Englewood Cliffs, N.J., 1978. Rao, C.R., Linear Statistical Inference and Its Applications, J. Wiley, New York, 1973. Schuster, A., “On the Investigation of Hidden Periodicities with Application to a Supposed 26 Day Period of Meterological Phenomena,” Terrestrial Magnetism, Vol. 3, pp. 13-41, March 1898. Skolnik, M.I., Introduction to Radar Systems, McGraw-Hill, New York, 1980. Taylor, S., Modeling Financial Time Series, J. Wiley, New York, 1986. Widrow, B., Stearns, S.D., Adaptive Signal Processing, Prentice-Hall, Englewood Cliffs, N.J., 1985. Zacks, S., Parametric Statistical Inference, Pergamon, New York, 1981. 14 CHAPTER 1. INTRODUCTION Problems 1. Ina radar system an estimator of round trip delay 7 has the PDF ~ N(t,03,), where Tp is the true value. If the range is to be estimated, propose an estimator R and find its PDF. Next determine the standard deviation o;, so that 99% of the time the range estimate will be within 100 m of the true value. Use c = 3 x 10° m/s for the speed of electromagnetic propagation. . An unknown parameter @ influences the outcome of an experiment which is mod- eled by the random variable x. The PDF of z is 1 1 ;0) = ——e —he-oy|. v(e;0) = Geen |-5E 8) A series of experiments is performed, and z is found to always be in the interval [97, 103]. As a result, the investigator concludes that @ must have been 100. Is this assertion correct? Let x = 9+, where w is a random variable with PDF p.,(w). If @ is a determin- istic parameter, find the PDF of z in terms of p, and denote it by p(z; 4). Next assume that @ is a random variable independent of w and find the conditional PDF p(2|). Finally, do not assume that @ and w are independent and determine p(a|6). What can you say about p(x; 0) versus p(z|9)? |. It is desired to estimate the value of a DC level A in WGN or z{n] = A+ wn] n=0,1,..,.N-E where w[n] is zero mean and uncorrelated, and each sample has variance e=l. Consider the two estimators . 1 N-1 A= H Lael N-2 A= yo (20 + Zain +201 - n) . Which one is better? Does it depend on the value of A? . For the same data set as in Problem 1.4 the following estimator is proposed: (0) a} = A? > 1000 A= 1 N-1 2 W Sozin] 4 = A? < 1000. n=O The rationale for this estimator is that for a high enough signal-to-noise ratio (SNR) or A?/o, we do not need to reduce the effect of noise by averaging and hence can avoid the added computation. Comment on this approach. Chapter 2 Minimum Variance Unbiased Estimation 2.1 Introduction In this chapter we will begin our search for good estimators of unknown deterministic parameters. We will restrict our attention to estimators which on the average yield the true parameter value. Then, within this class of estimators the goal will be to find the one that exhibits the least variability. Hopefully, the estimator thus obtained will produce values close to the true value most of the time. The notion of a minimum variance unbiased estimator is examined within this chapter, but the means to find it will require some more theory. Succeeding chapters will provide that theory as well as apply it to many of the typical problems encountered in signal processing. 2.2 Summary An unbiased estimator is defined by (2.1), with the important proviso that this holds for all possible values of the unknown parameter. Within this class of estimators the one with the minimum variance is sought. The unbiased constraint is shown by example to be desirable from a practical viewpoint since the more natural error criterion, the minimum mean square error, defined in (2.5), generally leads to unrealizable estimators. Minimum variance unbiased estimators do not, in general, exist. When they do, several methods can be used to find them. The methods rely on the Cramer-Rao lower bound and the concept of a sufficient statistic. If a minimum variance unbiased estimator does not exist or if both of the previous two approaches fail, a further constraint on the estimator, to being linear in the data, leads to an easily implemented, but suboptimal, estimator. 15 16 CHAPTER 2. MINIMUM VARIANCE UNBIASED ESTIMATION 2.3. Unbiased Estimators For an estimator to be unbiased we mean that on the average the estimator will yield the true value of the unknown parameter. Since the parameter value may in general be anywhere in the interval a < @ < b, unbiasedness asserts that no matter what the true value of #, our estimator will yield it on the average. Mathematically, an estimator is unbiased if E(6)=@ a<@ a 0 (a) Unbiased estimator (6) (8) n increases Seat 6 6 E(6) 8 E(6) 0 (b) Biased estimator Figure 2.2 Effect of combining estimators and var(6) so that as more estimates are averaged, the variance will decrease. Ultimately, as n-+ 00, 6 + 8. However, if the estimators are biased or E(0;) = 6 + b(8), then = wt, = 0+4(6) lI E(@) and no matter how many estimators are averaged, 6 will not converge to the true value. This is depicted in Figure 2.2. Note that, in general, (0) = E(6) - is defined as the bias of the estimator. 2.4. MINIMUM VARIANCE CRITERION 19 2.4 Minimum Variance Criterion In searching for optimal estimators we need to adopt some optimality criterion. A natural one is the mean square error (MSE), defined as mse(6) = E [@ - 6)?| . (2.5) This measures the average mean squared deviation of the estimator from the true value. Unfortunately, adoption of this natural criterion leads to unrealizable estimators, ones that cannot be written solely as a function of the data. To understand the problem which arises we first rewrite the MSE as mse(6) = E {[(@ ~ 5) + (2) - |} var(6) + [B(6) - ol” var(6) + b?(8) (2.6) which shows that the MSE is composed of errors due to the variance of the estimator as well as the bias. As an example, for the problem in Example 2.1 consider the modified estimator Azax > 2[n] N n=0 for some constant a. We will attempt to find the a which results in the minimum MSE. Since E(A) = aA and var(A) = a’c?/N, we have, from (2.6), 2 2 mse(A) = ae +(a-1)?4?. Differentiating the MSE with respect to a yields dmse(A) _ 2a0? _1)A2 de WN +2(a-1)A which upon setting to zero and solving yields the optimum value A? It is seen that, unfortunately, the optimal value of a depends upon the unknown param- eter A. The estimator is therefore not realizable. In retrospect the estimator depends upon A since the bias term in (2.6) is a function of A. It would seem that any criterion which depends on the bias will lead to an unrealizable estimator. Although this is generally true, on occasion realizable minimum MSE estimators can be found [Bibby and Toutenburg 1977, Rao 1973, Stoica and Moses 1990]. 20 CHAPTER 2. MINIMUM VARIANCE UNBIASED ESTIMATION 4 6, 6, NoMVU - . estimatot 6, = MVU estimator mater @ @ (a) (b) Figure 2.3 Possible dependence of estimator variance with 0 From a practical viewpoint the minimum MSE estimator needs to be abandoned. An alternative approach is to constrain the bias to be zero and find the estimator which minimizes the variance. Such an estimator is termed the minimum variance unbiased (MVU) estimator. Note that from (2.6) that the MSE of an unbiased estimator is just the variance. Minimizing the variance of an unbiased estimator also has the effect of concentrating the PDF of the estimation error, 9 — 6, about zero (see Problem 2.7). The estimation error will therefore be less likely to be large. 2.5 Existence of the Minimum Variance Unbiased Estimator The question arises as to whether a MVU estimator exists, i.e., an unbiased estimator with minimum variance for all 6. Two possible situations are described in Figure 2.3. If there are three unbiased estimators that exist and whose variances are shown in Figure 2.3a, then clearly 63 is the MVU estimator. If the situation in Figure 2.3b exists, however, then there is no MVU estimator since for 0 < 9, 62 is better, while for 6 > 0, 93 is better. In the former case 6, is sometimes referred to as the uniformly minimum variance unbiased estimator to emphasize that the variance is smallest for all 6. In general, the MVU estimator does not always ezist, as the following example illustrates. Example 2.3 - Counterexample to Existence of MVU Estimator If the form of the PDF changes with 6, then it would be expected that the best estimator would also change with 9. Assume that we have two independent observations z{0] and afl] with PDF afo] ~ N(6,1) N(@,1) if@>0 afl] ~ { wen) if 0 <0. 2.6. FINDING THE MVU ESTIMATOR 21 Figure 2.4 Illustration of nonex- 9 istence of minimum variance unbi- ased estimator The two estimators 1 6 = 5 (eld) +211) 2 1 6 = = z 1” 370) + gall can easily be shown to be unbiased. To compute the variances we have that var(d) = 5 (var((o)) + var(z{t)) a 4 1 var(62) = gvar(z[0]) + gvar(alt}) so that a) #% ifo>0 var(6,) = , x ifa<0 and . 2 if6>0 var(@2) = on 4 f0<0. The variances are shown in Figure 2.4. Clearly, between these two estimators no MVU estimator exists. It is shown in Problem 3.6 that for 6 > 0 the minimum possible variance of an unbiased estimator is 18/36, while that for @ < 0 is 24/36. Hence, no single estimator can have a variance uniformly less than or equal to the minima shown in Figure 2.4. ° To conclude our discussion of existence we should note that it is also possible that there may not exist even a single unbiased estimator (see Problem 2.11). In this case any search for a MVU estimator is fruitless. 2.6 Finding the Minimum Variance Unbiased Estimator Even if a MVU estimator exists, we may not be able to find it. There is no known “turn-the-crank” procedure which will always produce the estimator. In the next few chapters we shall discuss several possible approaches. They are: 22 CHAPTER 2. MINIMUM VARIANCE UNBIASED ESTIMATION var(6) Figure 2.5 Cramer-Rao @ — lower bound on variance of unbiased estimator 1. Determine the Cramer-Rao lower bound (CRLB) and check to see if some estimator satisfies it (Chapters 3 and 4). 2. Apply the Rao-Blackwell-Lehmann-Scheffe (RBLS) theorem (Chapter 5). 3. Further restrict the class of estimators to be not only unbiased but also linear. Then, find the minimum variance estimator within this restricted class (Chapter 6). Approaches 1 and 2 may produce the MVU estimator, while 3 will yield it only if the MVU estimator is linear in the data. The CRLB allows us to determine that for any unbiased estimator the variance must be greater than or equal to a given value, as shown in Figure 2.5. Tf an estimator exists whose variance equals the CRLB for each value of 0, then it must be the MVU estimator. In this case, the theory of the CRLB immediately yields the estimator. It may happen that no estimator exists whose variance equals the bound. Yet, a MVU estimator may still exist, as for instance in the case of 6 in Figure 2.5. Then, we must resort to the Rao-Blackwell-Lehmann-Scheffe theorem. This procedure first finds a sufficient statistic, one which uses all the data efficiently, and then finds a function of the sufficient statistic which is an unbiased estimator of 6. With a slight restriction of the PDF of the data this procedure will then be guaranteed to produce the MVU estimator. The third approach requires the estimator to be linear, a sometimes severe restriction, and chooses the best linear estimator. Of course, only for particular data sets can this approach produce the MVU estimator. 2.7 Extension to a Vector Parameter If 6 = [A 02. . 6,|7 is a vector of unknown parameters, then we say that an estimator 6 = [6, 6,...4,]” is unbiased if E(6)=0: a <0:» anz([n] n=0 is proposed. Find the a,’s so that the estimator is unbiased and the variance is minimized. Hint: Use Lagrangian multipliers with unbiasedness as the constraint equation. 2.7 Two unbiased estimators are Proposed whose variances satisfy var(6) < var(6). If both estimators are Gaussian, prove that Pr {|6- 9].> hc Pr {jé— 9] > eb for any € > 0. This says that the estimator with less variance is to be preferred since its PDF is more concentrated about the true value. 2.8 For the problem described in Example 2.1 show that as N 0, A+ A by using the results of Problem 2.3. To do so prove that gim,Pr{|A—Al>e} =0 for any € > 0. In this case the estimator A is said to be consistent. Investigate what happens if the alternative estimator A = wo z{n] is used instead. 2.9 This problem illustrates what happens to an unbiased esttimator when it undergoes a nonlinear transformation. In Example 2.1, if we choose to estimate the unknown Parameter 6 = A? by - 1 Nol O= (5 xu vn) can we say that the estimator is unbiased? What happens as N -+ oo? 2 > 2.10 In Example 2.1 assume now that in addition to A, the value of ¢? is also unknown. We wish to estimate the vector parameter Is the estimator unbiased? PROBLEMS 5 2.11 Given a single observation [0] from the distribution 1/(0, 1/6], it is desired to estimate 9. It is assumed that 6 > 0. Show that for an estimator 6= g{z[0]) to be unbiased we must have ; [ g(u) du =1. 0 Next prove that a function g cannot be found to satisfy this condition for all @ > 0. Chapter 3 Cramer-Rao Lower Bound 3.1 Introduction Being able to place a lower bound on the variance of any unbiased estimator proves to be extremely useful in practice. At best, it allows us to assert that an estimator is the MVU estimator. This will be the case if the estimator attains the bound for all values of the unknown parameter. At worst, it provides a benchmark against which we can compare the performance of any unbiased estimator. Furthermore, it alerts us to the physical impossibility of finding an unbiased estimator whose variance is less than the bound. The latter is often useful in signal processing feasibility studies. Although many such variance bounds exist [McAulay and Hofstetter 1971, Kendall and Stuart 1979, Seidman 1970, Ziv and Zakai 1969], the Cramer-Rao lower bound (CRLB) is by far the easiest to determine. Also, the theory allows us to immediately determine if an estimator exists that attains the bound. If no such estimator exists, then all is not lost since estimators can be found that attain the bound in an approximate sense, as described in Chapter 7. For these reasons we restrict our discussion to the CRLB. 3.2 Summary The CRLB for a scalar parameter is given by (3.6). If the condition (3.7) is satisfied, then the bound will be attained and the estimator that attains it is readily found. An alternative means of determining the CRLB is given by (3.12). For a signal with an unknown parameter in WGN, (3.14) provides a convenient means to evaluate the bound. When a function of a parameter is to be estimated, the CRLB is given by (3.16). Even though an efficient estimator may exist for 9, in general there will not be one for a function of 6 (unless the function is linear). For a vector parameter the CRLB is determined using (3.20) and (3.21). As in the scalar parameter case, if condition (3.25) holds, then the bound is attained and the estimator that attains the bound is easily found. For a function of a vector parameter (3.30) provides the bound. A general formula for the Fisher information matrix (used to determine the vector CRLB) for a multivariate Gaussian PDF is given by (3.31). Finally, if the data set comes from a 27 28 CHAPTER 3. CRAMER-RAO LOWER BOUND Pi (2[0] = 3; A) P2(z[0] = 3; A) A A 1 2 3 4 5 6 1 2 3 4 5 6 (a) 1 =1/3 (b) o2=1 Figure 3.1 PDF dependence on unknown parameter WSS Gaussian random process, then an approximate CRLB, that depends on the PSD, is given by (3.34). It is valid asymptotically or as the data record length becomes large. 3.3. Estimator Accuracy Considerations Before stating the CRLB theorem, it is worthwhile to expose the hidden factors that determine how well we can estimate a parameter. Since all our information is embodied in the observed data and the underlying PDF for that data, it is not surprising that the estimation accuracy depends directly on the PDF. For instance, we should not expect to be able to estimate a parameter with any degree of accuracy if the PDF depends only weakly upon that parameter, or in the extreme case, if the PDF does not depend on it at all. In general, the more the PDF is influenced by the unknown parameter, the better we should be able to estimate it. Example 3.1 - PDF Dependence on Unknown Parameter If a single sample is observed as z[0] = A+ w[0] where w(0| ~ A(0, 0”), and it is desired to estimate A, then we expect a better estimate if 0? is small. Indeed, a good unbiased estimator is A = x[0]. The variance is, of course, just 0, so that the estimator accuracy improves as o? decreases. An alternative way of viewing this is shown in Figure 3.1, where the PDFs for two different variances are shown. They are ps([0]; A) = aq exp [-25 (2(0 - A)? (3.1) /2no? 2of for i = 1,2. The PDF has been plotted versus the unknown parameter A for a given value of z{0|. If 0? < 03, then we should be able to estimate A more accurately based on p;(z[0]; A). We may interpret this result by referring to Figure 3.1. Tf x[0] = 3 and o, = 1/3, then as shown in Figure 3.1a, the values of A > 4 are highly unlikely. To see 3.3. ESTIMATOR ACCURACY CONSIDERATIONS 29 this we determine the probability of observing [0] in the interval [2[0]—5/2, x[0]+5/2] = [3 — 6/2,3 + 6/2] when A takes on a given value or 4 é 348 pr{s—5 4 can be eliminated from consideration. It might be argued that values of A in the interval 3 + 30, = [2,4] are viable candidates. For the PDF in Figure 3.1b there is a much weaker dependence on A. Here our viable candidates are in the much wider interval 3 + 302 = [0,6]. ° When the PDF is viewed as a function of the unknown parameter (with x fixed), it is termed the likelihood function. Two examples of likelihood functions were shown in Figure 3.1. Intuitively, the “sharpness” of the likelihood function determines how accurately we can estimate the unknown parameter. To quantify this notion observe that the sharpness is effectively measured by the negative of the second derivative of the logarithm of the likelihood function at its peak. This is the curvature of the log- likelihood function. In Example 3.1, if we consider the natural logarithm of the PDF in p(2[0]; A) = — In Vaz0? — sea (20 - Ap then the first derivative is Olnp(z[0};A) 1 aA ga (2ll - A) (3.2) and the negative of the second derivative becomes 6 Inp(z[0);A) 1 aa a (33) The curvature increases as o? decreases. Since we already know that the estimator A= [0] has variance o?, then for this example 1 _ # Inp(z/0}; A) aA? and the variance decreases as the curvature increases. Although in this example the second derivative does not depend on 2/0], in general it will. Thus, a more appropriate measure of curvature is - [maton oA? var(A) = (3.4) (3.5) 30 CHAPTER 3. CRAMER-RAO LOWER BOUND which measures the average curvature of the log-likelihood function. The expectation is taken with respect to p(z(0]; A), resulting in a function of A only. The expectation acknowledges the fact that the likelihood function, which depends on z(0], is itself a random variable. The larger the quantity in (3.5), the smaller the variance of the estimator. 3.4 Cramer-Rao Lower Bound We are now ready to state the CRLB theorem. Theorem 3.1 (Cramer-Rao Lower Bound - Scalar Parameter) It is assumed that the PDF p(x; 0) satisfies the “regularity” condition E [= p(x; 8) 26 | =0 for all 0 where the expectation is taken with respect to p(x; 0). Then, the variance of any unbiased estimator 8 must satisfy var(6) > (3.6) 1 O? In p(x; 9) | where the derivative is evaluated at the true value of 0 and the expectation is taken with respect to p(x;@). Furthermore, an unbiased estimator may be found that attains the bound for all @ if and only if din p(x; 9) ap 7 EP) (alx) - 9) (3.7) for some functions g and I. That estimator, which is the MVU estimator, is 6 = g(x), ‘and the minimum variance is 1/1(8). The expectation in (3.6) is explicitly given by 8? In p(x; 9) @ Inp(x;0) E | 302 | = / 36? p(x; 0) dx since the second derivative is a random variable dependent on x. Also, the bound will depend on @ in general, so that it is displayed as in Figure 2.5 (dashed curve). An example of a PDF that does not satisfy the regularity condition is given in Problem 3.1. For a proof of the theorem see Appendix 3A. Some examples are now given to illustrate the evaluation of the CRLB. Example 3.2 - CRLB for Example 3.1 For Example 3.1 we see that from (3.3) and (3.6) var(A) > 0? for all A. 3.4. CRAMER-RAO LOWER BOUND 31 Thus, no unbiased estimator can exist whose variance is lower than 0? for even a single value of A. But in fact we know that if A = [0], then var(A) = o? for all A. Since <(0] is unbiased and attains the CRLB, it must therefore be the MVU estimator. Had we been unable to guess that z[0] would be a good estimator, we could have used (3.7). From (3.2) and (3.7) we make the identification 6 = A 1(@) = = g(z[0]) = x0) so that (3.7) is satisfied. Hence, A = 9(z[0]) = z[0] is the MVU estimator. Also, note that var(A) = 0? = 1/1(0), so that according to (3.6) we must have 1(0) =-E [age / We will return to this after the next example. See also Problem 3.2 for a generalization to the non-Gaussian case. ° Example 3.3 - DC Level in White Gaussian Noise Generalizing Example 3.1, consider the multiple observations a[n]}=A+u[n] n=0,1,...,.N—-1 where w[n] is WGN with variance o?. To determine the CRLB for A nia) = TI eon [-gaehi- 0] azo V2m0? 1 1 2 @n08E exp a Dele) — A) . Taking the first derivative . N-1 Snes 4) = af In[(200)¥] — > (zIn} - ay = 3 2 (eho) - A) = Nga) (3.8) o 32 CHAPTER 3. CRAMER-RAO LOWER BOUND where & is the sample mean. Differentiating again @inp(x;A) ON oA? ~ gt and noting that the second derivative is a constant, we have from (3.6) (3.9) as the CRLB. Also, by comparing (3.7) and (3.8) we see that the sample mean estimator attains the bound and must therefore be the MVU estimator. Also, once again the minimum variance is given by the reciprocal of the constant N/o? in (3.8). (See also Problems 3.33.5 for variations on this example.) We now prove that when the CRLB is attained var(6) = 7 7) 5) where a Inp(x;6) 1(0) = -2 [=| . From (3.6) and (3.7) ~ 1 var(8) = — Tain p(xs8)] —o ena and ain p(x;6) n p(x; 9) “ —7 = 1(6)(6 — @). Differentiating the latter produces A Inp(x;9) _ 91(9) 2 yy _ on) = 8) - 100) and taking the negative expected value yields -5 [=| = 1 (a) — 6) + 10) = 10) and therefore var(6) = Te" (3.10) In the next example we will see that the CRLB is not always satisfied. 3.4. CRAMER-RAO LOWER BOUND 33 Example 3.4 - Phase Estimation Assume that we wish to estimate the phase ¢ of a sinusoid embedded in WGN or a[n] = Acos(2mfon+¢)+w[n] n=0,1,...,.N-1. The amplitude A and frequency fp are assumed known (see Example 3.14 for the case when they are unknown). The PDF is n=0 1 1 M2 p(x; ¢) = noe? {a SS [2[n] — A cos(2x fon + ar : Differentiating the log-likelihood function produces a inp(x; 4) ie a6 = -B Yel) — Acos(2mfon + $)]Asin(27 fon + 4) AR A = -3 2 Flnl sin(2r fon + ¢) — 5 sin(4x fon + 29)] and ae) = -4 + lelal cos(2m fon + ¢) — Acos(4m fon + 2¢)]. n=0 Upon taking the negative expected value we have . N-1 _E [=a | = 4 Jo [A.cos*(2m fon + ) ~ Acos(4r fon + 2¢)] n=0 2N-1 = S > [5 + 5 cost fon + 2¢) — cos(4a fon + 2¢)| n=0 NA? 20? R since N-tk x > cos(4m fon + 2¢) = 0 n=0 for fo not near 0 or 1/2 (see Problem 3.7). Therefore, 20? NA? In this example the condition for the bound to hold is not satisfied. Hence, a phase estimator does not exist which is unbiased and attains the CRLB. It is still possible, however, that an MVU estimator may exist. At this point we do not know how to var($) > 34 CHAPTER 3. CRAMER-RAO LOWER BOUND var(6) 6, (a) 6; efficient and MVU (b) 6; MVU but not efficient Figure 3.2 Efficiency vs. minimum variance determine whether an MVU estimator exists, and if it does, how to find it. The theory of sufficient statistics presented in Chapter 5 will allow us to answer these questions. ° An estimator which is unbiased and attains the CRLB, as the sample mean estimator in Example 3.3 does, is said to be efficient in that it efficiently uses the data. An MVU estimator may or may not be efficient. For instance, in Figure 3.2 the variances of all possible estimators (for purposes of illustration there are three unbiased estimators) are displayed. In Figure 3.2a, 6, is efficient in that it attains the CRLB. Therefore, it is also the MVU estimator. On the other hand, in Figure 3.2b, 4, does not attain the CRLB, and hence it is not efficient. However, since its variance is uniformly less than that of all other unbiased estimators, it is the MVU estimator. The CRLB given by (3.6) may also be expressed in a slightly different form. Al- though (3.6) is usually more convenient for evaluation, the alternative form is sometimes useful for theoretical work. It follows from the identity (see Appendix 3A) Alnp(x;0)\7] _ 8? In p(x; 8) oar) |= ee] Gay so that 1 var(6) > (3.12) B (eae | (see Problem 3.8). The denominator in (3.6) is referred to as the Fisher information I(6) for the data x or 8 inplx:8) in p(x; =-E—|—— |}. 3.13 1@) = 6 [AEBS (3.13) As we saw previously, when the CRLB is attained, the variance is the reciprocal of the Fisher information. Intuitively, the more information, the lower the bound. It has the essential properties of an information measure in that it is 3.5. GENERAL CRLB FOR SIGNALS IN WGN 35 1, nonnegative due to (3.11) 2. additive for independent observations. The latter property leads to the result that the CRLB for N IID observations is 1/N times that for one observation. To verify this, note that for independent observations N-1 In p(x; 8) = > In p(z{n]; 6). n=0 This results in oe 0g and finally for identically distributed observations 1(8) = Ni(@) -2 ("ae x; iO) -¥ E [* In p(a[nJ; 6) n=0 where 362 is the Fisher information for one sample. For nonindependent samples we might expect that the information will be less than Ni(6), as Problem 3.9 illustrates. For completely dependent samples, as for example, z{0] = z[1] = --- = 2[N—1], we will have I(@) = (0) (see also Problem 3.9). Therefore, additional observations carry no information, and the CRLB will not decrease with increasing data record length. (0) = -2|* In p(z{n]; my 3.5 General CRLB for Signals in White Gaussian Noise Since it is common to assume white Gaussian noise, it is worthwhile to derive the CRLB for this case. Later, we will extend this to nonwhite Gaussian noise and a vector parameter as given by (3.31). Assume that a deterministic signal with an unknown parameter @ is observed in WGN as g[n] = s(n; 0] + w[n] n=0,1,...,.N—1. The dependence of the signal on @ is explicitly noted. The likelihood function is 1 ia 2 p(x; 0) = Qroy {as dee — s(n; 6]) | : Differentiating once produces N-1 ae ga Dalen ~ stay eae 36 CHAPTER 3. CRAMER-RAO LOWER BOUND and a second differentiation results in ? In p(x; RS 2 sin; in; ]\? Mingle 0) 2S? {Col ~ aia = seeel — (25) \. 062 00 Taking the expected value yields E a Inp(x,6)\ _ 1 LS (As{n34] 2 00? ~ o? £4\ 80 ne =z so that finally 2 var(8) > =~ —. (3.14) as{n; 6] x (Me) The form of the bound demonstrates the importance of the signal dependence on 0. Signals that change rapidly as the unknown parameter changes result in accurate esti- mators. A simple application of (3.14) to Example 3.3, in which s[n;6] = 0, produces a CRLB of o?/N. The reader should also verify the results of Example 3.4. As a final example we examine the problem of frequency estimation. Example 3.5 - Sinusoidal Frequency Estimation We assume that the signal is sinusoidal and is represented as s[n; fo] = Acos(2m fon + $) 0 —wo 2 . (3.15) A? > [2nnsin(2z fon + ¢)]? n=0 The CRLB is plotted in Figure 3.3 versus frequency for an SNR of A?/o? = 1, a data record length of N = 10, and a phase of ¢ = 0. It is interesting to note that there appear to be preferred frequencies (see also Example 3.14) for an approximation to (3.15)). Also, as fo + 0, the CRLB goes to infinity. This is because for fo close to zero a slight change in frequency will not alter the signal significantly. °° 3.6. TRANSFORMATION OF PARAMETERS 37 5.0 4.5 4.0 3.5 3.0 2.5 2.0 ‘Cramer-Rao lower bound 15 1.0 t t t t T T t t 0.00 0.05 0.10 015 0.20 0.25 0.30 0.35 0.40 045 0.50 Frequency Figure 3.3 Cramer-Rao lower bound for sinusoidal frequency estimation 3.6 Transformation of Parameters It frequently occurs in practice that the parameter we wish to estimate is a function of some more fundamental parameter. For instance, in Example 3.3 we may not be interested in the sign of A but instead may wish to estimate A? or the power of the signal. Knowing the CRLB for A, we can easily obtain it for A? or in general for any function of A. As shown in Appendix 3A, if it is desired to estimate a = g(0), then the CRLB is ( ) 00 é) > —_———_- .- . var(a) 2 - [* In p(x; 4 (8.18) 66? For the present example this becomes a = g(A) = A? and a. (2A? 4A2o? 2)\>—= : . var(A?) > Njo? W (3.17) Note that in using (3.16) the CRLB is expressed in terms of @. We saw in Example 3.3 that the sample mean estimator was efficient for A. It might be supposed that 2? is efficient for A?. To quickly dispel this notion we first show that 2? is not even an unbiased estimator. Since # ~ N(A,0?/N) o E(@) = E*(#) + var(#) = A? + W x A’. (3.18) Hence, we immediately conclude that the efficiency of an estimator is destroyed by a nonlinear transformation. That it is maintained for linear (actually affine) transfor- mations is easily verified. Assume that an efficient estimator for @ exists and is given 38 CHAPTER 3. CRAMER-RAO LOWER BOUND by 6. It is desired to estimate g(@) = a8 +}. As our estimator of g(@), we choose g(6) = 9(0) =a6+. Then, E(ab+b) = aE(6)+b=a0+6 = g(9) so that 9(6) is unbiased. The CRLB for 9(6), is from (3.16), var(gi@) > Zr 2 (20) var(6) = a’var(6). But var(g(@)) = var(a + b) = a?var(§), so that the CRLB is achieved. Although efficiency is preserved only over linear transformations, it is approximately maintained over nonlinear transformations if the data record is large enough. This has great practical significance in that we are frequently interested in estimating functions of parameters. To see why this property holds, we return to the previous example of estimating A? by £”. Although 2? is biased, we note from (3.18) that Z? is asymptotically unbiased or unbiased as N -» oo. Furthermore, since ~ N(A,o/N), we can evaluate the variance var (2) = E(&*) — E°(2”) by using the result that if € ~ M/(u,07), then E(e’) we t+o? E(é*) = pt + 6u?0? + 30% and therefore var(é?) = E(€*) — E°(¢’) 420? + 20%. For our problem we have then - 4A%o? 204 var(#?) = Nt ONE: (3.19) Hence, as N — 00, the variance approaches 4A?o?/N, the last term in (3.19) converging to zero faster than the first. But this is just the CRLB as given by (3.17). Our assertion that 2? is an asymptotically efficient estimator of A? is verified. Intuitively, this situation occurs due to the statistical linearity of the transformation, as illustrated in Figure 3.4. As N increases, the PDF of becomes more concentrated about the mean A. Therefore, 3.7. EXTENSION TO A VECTOR PARAMETER 39 A (a) Small N (b) Large N Figure 3.4 Statistical linearity of nonlinear transformations the values of Z that are observed lie in a small interval about Z = A (the +3 standard deviation interval is displayed). Over this small interval the nonlinear transformation is approximately linear. Therefore, the transformation may be replaced by a linear one since a value of & in the nonlinear region rarely occurs. In fact, if we linearize g about A, we have the approximation a(2) = 9(A) + A) a — A), It follows that, to within this approximation, Elg(@)] = g(A) = 4? or the estimator is unbiased (asymptotically). Also, dg(A)]? [we (2.A)?o? N 4A%o? N var[9(z)] so that the estimator achieves the CRLB (asymptotically). Therefore, it is asymp- totically efficient. This result also yields insight into the form of the CRLB given by (3.16). 3.7 Extension to a Vector Parameter We now extend the results of the previous sections to the case where we wish to estimate a vector parameter @ = [6 02...4,]7. We will assume that the estimator 6 is unbiased 40 CHAPTER 3. CRAMER-RAO LOWER BOUND as defined in Section 2.7. The vector parameter CRLB will allow us to place a bound on the variance of each element. As derived in Appendix 3B, the CRLB is found as the i, i] element of the inverse of a matrix or var(6,) > (6), (3.20) where I(@) is the p x p Fisher information matriz. The latter is defined by _ 8" In p(x; 6) WO), =~ [SR | (3.21) for i = 1,2,...,p;j7 = 1,2,...,p. In evaluating (3.21) the true value of @ is used. Note that in the scalar case (p = 1), 1(@) = (6) and we have the scalar CRLB. Some examples follow. Example 3.6 - DC Level in White Gaussian Noise (Revisited) We now extend Example 3.3 to the case where in addition to A the noise variance o? is also unknown. The parameter vector is @ = [Ao?]7, and hence p = 2. The 2 x 2 Fisher information matrix is _E |= In p(x; )) _E [* | 0A? Ada? 1(6) = _B [7 In p(x; 2) _B [- In p(x; 8) 00?0A 80? It is clear from (3.21) that the matrix is symmetric since the order of partial differenti- ation may be interchanged and can also be shown to be positive definite (see Problem 3.10). The log-likelihood function is, from Example 3.3, N-) In p(x; 0) = -3 In2n- om a a Yin} — A). n=0 The derivatives are easily found as Oinp(x@) R _ a OT 2 ln A) 8m p(x; 6) N 1 BaF Og? * Fo De eln| - AY” @inp(x;0) ON aA? ~ Ga @ In p(x; 8) 1% BAda? ~~ at Dy (elnl— A) 8? In p(x; 8) N 1 Get gt ~ 8 DCI ~ AY. n=0 3.7. EXTENSION TO A VECTOR PARAMETER 41 Upon taking the negative expectations, the Fisher information matrix becomes N B 0 1(6) = N os 2o4 Although not true in general, for this example the Fisher information matrix is diagonal and hence easily inverted to yield var(A) lv Vv ‘2[¥ 21% var(o?) Note that the CRLB for A is the same as for the case when o? is known due to the diagonal nature of the matrix. Again this is not true in general, as the next example illustrates. ° Example 3.7 - Line Fitting Consider the problem of line fitting or given the observations a{n] = A+ Bn+ vn] n=0,1,...,.N—-1 where wn] is WGN, determine the CRLB for the slope B and the intercept A. The parameter vector in this case is @ = [A BJ. We need to first compute the 2 x 2 Fisher information matrix, _F [* ome) -F (= In p(x; 2) 16) OA? OAdB - _E In p(x; 6) _B eer) ~ BOA OB? The likelihood function is 1 1 -9)=- — _ _t — A— Bn)? p(x) ane ge3 Lin - A | from which the derivatives follow as . N-1 Sinnts:6) 9) _ 5 So (a{n] - A- Bn) . N-1 Sine 8) 9) = = y (z[n] — A- Bn)n n=0 42 CHAPTER 3. CRAMER-RAO LOWER BOUND and PF inp(x,0) _N 6A? ~ 0? In p(x; @) _ iS dAOB ao? 8 In p(x; 8) i OB ge » n Since the second-order derivatives do not depend on x, we have immediately that N-1 N 1 dn (0) = g2 | Nal N=I don ow n=0 n=0 Ny N(N -1) _ 1 2 #1 N(N-1) N(N-1)(2N -1) 2 6 where we have used the identities in N(N -1) n=0 2 N-1 > ne = NWW-DQN = 1) (3.22) n=0 6 Inverting the matrix yields 2(2N — 1) 6 N(IN+1) — N(N41 11) <9: NAD “NWSI 6 12 “N(N+1) N(W?-1 It follows from (3.20) that the CRLB is ; 2(2N — 1)o? ver(A) 2 “Na var(B) > 120? N(N?-1)" 3.7, EXTENSION TO A VECTOR PARAMETER 43 z[n] (a) A=0,B=0toA=1,B=0 (b) A=0,B=0toA=0,B=1 Figure 3.5 Sensitivity of observations to parameter changes—no noise Some interesting observations follow from examination of the CRLB. Note first that the CRLB for A has increased over that obtained when B is known, for in the latter case we have var(A) > -—_ te : — BE @ inp(x;A)] = N oA? and for N > 2, 2(2N — 1)/(N +1) > 1. This is a quite general result that asserts that the CRLB always increases as we estimate more parameters (see Problems 3.11 and 3.12). A second point is that CRLB(A) _ (2N-1)(N=1) . | CRLB(B) 6 for N > 3. Hence, B is easier to estimate, its CRLB decreasing as 1/N® as opposed to the 1/N dependence for the CRLB of A. These differing dependences indicate that x[n] is more sensitive to changes in B than to changes in A. A simple calculation reveals Aga] rte 4 = AA ox tn Az{n| AB =nAB. Changes in B are magnified by n, as illustrated in Figure 3.5. This effect is reminiscent of (3.14), and indeed a similar type of relationship is obtained in the vector parameter case (see (3.33)). See Problem 3.13 for a generalization of this example. ° As an alternative means of computing the CRLB we can use the identity On p(x; 8) ors 8) ome) 0? In p(x; 8) »| 06, 00,30; (8.28) 44 CHAPTER 3. CRAMER-RAO LOWER BOUND as shown in Appendix 3B. The form given on the right-hand side is usually easier to evaluate, however. ‘We now formally state the CRLB theorem for a vector parameter. Included in the theorem are conditions for equality. The bound is stated in terms of the covariance matrix of 6, denoted by Cj, from which (3.20) follows. Theorem 3.2 (Cramer-Rao Lower Bound - Vector Parameter) It is assumed that the PDF p(x;@) satisfies the “regularity” conditions On p(x; 8) = 190 where the expectation is taken with respect to p(x;8). Then, the covariance matrix of any unbiased estimator 8 satisfies | =0 for all @ c; -11(@) 20 (3.24) where > 0 is interpreted as meaning that the matric is positive semidefinite. The Fisher information matrix 1(8) is given as N@)],; =-# "ie | where the derivatives are evaluated at the true value of @ and the expectation is taken with respect to p(x;@). Furthermore, an unbiased estimator may be found that attains the bound in that Cg =1-1(@) af and only if ln p(x; 8 Otnwss8) _ 1(6)(a(x) ~ 9) (8:25) for some p-dimensional function § and some p x p matrix 1. That estimator, which is the MVU estimator, is @ = g(x), and tts covariance matrix is 1" (@). The proof is given in Appendix 3B. That (3.20) follows from (3.24) is shown by noting that for a positive semidefinite matrix the diagonal elements are nonnegative. Hence, [Cs ~ ro. 20 and therefore . var(6,) = [Ca], 2 1). (3.26) Additionally, when equality holds or Cz = 1-*(6), then (3.26) holds with equality also. The conditions for the CRLB to be attained are of particular interest since then 6 = g(x) is efficient and hence is the MVU estimator. An example of equality occurs in Example 3.7. There we found that On p(x; 8) Olnp(x;@) aA 00 i On p(x; 8) (3.27) oB 3.8. VECTOR PARAMETER CRLB FOR TRANSFORMATIONS 45 182 a Yh -—A- Bn) = \ nop . (3.28) =3 (eln| - A- Bn)n 2 o n=0 Although not obvious, this may be rewritten as N N(N ~1) On p(x; 8) - o 20? A-A 3 00 N(N-1) N(N-1)(2N ~1) [3-3 02) 20? 60? where j= 20N =) 6 oR = wore Sahn - nora 8 = —yarpy D y sh) + aay 5 nal Hence, the conditions for equality are satisfied and [A BY? is an efficient and therefore MVU estimator. Furthermore, the matrix in (3.29) is the inverse of the covariance matrix. If the equality conditions hold, the reader may ask whether we can be assured that 6 is unbiased. Because the regularity conditions E [eneee)) =0 are always assumed to hold, we can apply them to (3.25). This then yields E[g(x)| = E(0) = In finding MVU estimators for a vector parameter the CRLB theorem provides a powerful tool. In particular, it allows us to find the MVU estimator for an important class of data models. This class is the linear model and is described in detail in Chap- ter 4. The line fitting example just discussed is a special case. Suffice it to say that if we can model our data in the linear model form, then the MVU estimator and its performance are easily found. 3.8 Vector Parameter CRLB for Transformations The discussion in Section 3.6 extends readily to the vector case. Assume that it is desired to estimate a = g(@) for g, an r-dimensional function. Then, as shown in Appendix 3B r 28) 8" 5 g (3.30) 46 CHAPTER 3, CRAMER-RAO LOWER BOUND where, as before, > 0 is to be interpreted as positive semidefinite. In (3.30), 0g(@)/08 is the r x p Jacobian matrix defined as (0) On (0) 8g: (8) a0; 00, °° ~~ «OB, 8g2(8) 0g2(8) 8g2(8) 58) =| dA 30, °° OB, 80,0) Oge(0) A9-(8) 00; 30, °° = OD, Example 3.8 - CRLB for Signal-to-Noise Ratio Consider a DC level in WGN with A and o? unknown. We wish to estimate Az a= which can be considered to be the SNR for a single sample. Here @ = [Ao?]” and 9(@) = 63/0. = A?/o?. Then, as shown in Example 3.6, No 1@)=|° WN 0 — 204 The Jacobian is 89(6) 36 [ af) | — | 29) 010) | so that dg() 1 oe 1 ) 208 A? vg = [ -] eo 2|% % | | Finally, since @ is a scalar 3.9. CRLB FOR THE GENERAL GAUSSIAN CASE 47 As discussed in Section 3.6, efficiency is maintained over linear transformations =g(0)=A0+b where A is an r x p matrix and b is an r x 1 vector. If &@ = AO +b, and 6 is efficient or Cj =1-1(0), then E(&) = Ad+b=a so that & is unbiased and Ca = AC;A™ = AI“*(@)AT 2800) 2800 the latter being the CRLB. For nonlinear transformations efficiency is maintained only as N + oo. (This assumes that the PDF of 6 becomes concentrated about the true value of @ as N — oo or that 6 is consistent.) Again this is due to the statistical linearity of g(@) about the true value of @. 3.9 CRLB for the General Gaussian Case It is quite convenient at times to have a general expression for the CRLB. In the case of Gaussian observations we can derive the CRLB that generalizes (3.14). Assume that x ~ N (u(8), C(8)) so that both the mean and covariance may depend on @. Then, as shown in Appendix 3C, the Fisher information matrix is given by won, = [2 cro) [RO =] + st ey (3.31) where a{u()h a6, apa) _ | AlH()l2 36, | 8 Als(8)Lw 06; 48 CHAPTER 3. CRAMER-RAO LOWER BOUND and ACO)n ACO)h2 a[C(®)hin 06; 00; ~ 06; ac(e) ACO) A[CO)z2—— A{C(®)aw 00; = 36: 36: ; 06; ACOlwr ACO)Iw2 ACB) ww 00; 00; ~ 06; For the scalar parameter case in which x~N(u(6), C(6)) this reduces to 10) = [mer cre [2H +3((¢ (6) 0) (3.32) which generalizes (3.14). We now illustrate the computation with some examples. Example 3.9 - Parameters of a Signal in White Gaussian Noise Assume that we wish to estimate a scalar signal parameter @ for the data set z[n] = s[n;0]+u[n] n=0,1,...,.N-1 where w{n] is WGN. The covariance matrix is C = o?I and does not depend on 6. The second term in (3.32) is therefore zero. The first term yields co = 2 eael Pa oe EC) (2a ay: which agrees with (3.14). ° it ° 1% 2 n ll Met oa Generalizing to a vector signal parameter estimated in the presence of WGN, we have from (3.31) ° op(9)]" op(@) wr, = [Re | sat [ | 3.9. CRLB FOR THE GENERAL GAUSSIAN CASE 49 which yields 1 As(n; 6] As{n; 6 wae 8B, 80, (3:33) WO); = as the elements of the Fisher information matrix. Example 3.10 - Parameter of Noise Assume that we observe n=0,1,...,.N-1 where w[n] is WGN with unknown variance 6 = 0”. Then, according to (3.32), since C(o?) = 071, we have I(o?) = I NiP NIK Ne on 3 SH a “ons aH wo a ~S os 2 204 which agrees with the results in Example 3.6. A slightly more complicated example follows. ° Example 3.11 - Random DC Level in WGN Consider the data z[n] = A+ w[n] n=0,1,...,.N-1 where w[n] is WGN and A, the DC level, is a Gaussian random variable with zero mean and variance 04. Also, A is independent of w[n]. The power of the signal or variance o4, is the unknown parameter. Then, x = [z[0] 2[1]...2[N —1]]” is Gaussian with zero mean and an N x N covariance matrix whose [i,j] element is (Coals = Flet- tel — i) = BlA+wli- (A+ vl -1)) = 044075. Therefore, C(o4,) = 04117 +0°E 50 CHAPTER 3. CRAMER-RAO LOWER BOUND where 1 =([11...1]7. Using Woodbury’s identity (see Appendix 1), we have 1 2 CO) =a (1 ain’) . ~ a2 + Noy Also, since aC(o4) T =11 80%, we have that ac(o3) 1 C02 TA) Tr (ea) 002, oF+ Noa! Substituting this in (3.32) produces 1 1 ? 2) — = ft T1T I(o3) att l(a ; wat) 1711 N 1 2 aT = 3 (aera) ee = 1(_N_Y ~ 2\0?4+No% so that the CRLB is 2 2 var(o2) > 2 (4 + 7) Note that even as N + co, the CRLB does not decrease below 204. This is because each additional data sample yields the same value of A (see Problem 3.14). ° 3.10 Asymptotic CRLB for WSS Gaussian Random Processes At times it is difficult to analytically compute the CRLB using (3.31) due to the need to invert the covariance matrix. Of course, we can always resort to a computer evaluation. An alternative form which can be applied to Gaussian processes that are WSS is very useful. It is easily computed and provides much insight due to its simplified form. The principal drawback is that strictly speaking it is valid only as N + o0 or asymptotically. In practice it provides a good approximation to the true CRLB if the data record length N is much greater than the correlation time of the process. The correlation time is defined as the maximum lag k of the ACF rz2[k] = Elz[n]z[n + k]] for which the ACF is essentially nonzero. Hence, for processes with broad PSDs the approximation will be good for moderate length data records, while for narrowband processes longer length data records are required. 3.10. ASYMPTOTIC CRLB 51 ats) hh fa 1 we we Paal fi fe) ford 4 2 Femin = fr femax = ; — fa Figure 3.6 Signal PSD for center frequency estimation As shown in Appendix 3D, the elements of the Fisher information are approximately (as N - 00) _N ? dn P,.(f;0) On Pro(f;8) HO); = = [ oC df (3.34) where P,(f;0) is the PSD of the process with the explicit dependence on @ shown. It is assumed that the mean of z[n] is zero. This form, which is somewhat reminiscent of (3.33), allows us to examine the accuracy with which PSD, or equivalently, covariance parameters may be estimated. Example 3.12 - Center Frequency of Process A typical problem is to estimate the center frequency f. of a PSD which otherwise is known. Given Paoli fe) = QF — fe) + Q(-f - fe) +07 we wish to determine the CRLB for f, assuming that Q(f) and o? are known. We tay view the process as consisting of a random signal embedded in WGN. The center frequency of the signal PSD is to be estimated. The real function Q(f) and the signal PSD P,,(f; f-) are shown in Figure 3.6. Note that the possible center frequencies are constrained to be in the interval [f;,1/2 — f.]. For these center frequencies the signal PSD for f > 0 will be contained in the {0,1/2] interval. Then, since 8 = f. is a scalar, 52 CHAPTER 3. CRAMER-RAO LOWER BOUND we have from (3.34) Fy ays But Ain Peo(fife) _ AIn[Q(F - fe) + O(-f - fe) +07) Ofe of OQ(F = fo) , 8Q(-f ~ fe) Of. Of ~ OF =f) + QCF = f+ This is an odd function of f, so that [Cape anal Cet a 2 Also, for f > 0 we have that Q(—f — fc) = 0, and thus its derivative is zero due to the assumption illustrated in Figure 3.6. It follows that 1 eat 2 vf (a= ie) 4 1 ( BEBE) Y vi (aS a ee sf, (ates) a where we have let f’ = f—f-. But 1/2—f. > 1/2—femae = f2 and -fe < fem waa so that we may change the limits of integration to the interval [—1/2, 1/2]. Thus, 1 y/ 7p xf, (ane) a 1 = vf (amet soy var(f.) > var(f.) > 3.11. SIGNAL PROCESSING EXAMPLES 53 QUf) = exp [-3 (5) where 7 < 1/2, so that Q(f) is bandlimited as shown in Figure 3.6. Then, if Q(f) > o”, we have approximately As an example, consider 5 1 1204 var(f.) > rp =W Nf oa -4 oF Narrower bandwidth (smaller 07) spectra yield lower bounds for the center frequency since the PSD changes more rapidly as f, changes. See also Problem 3.16 for another example. ° 3.11 Signal Processing Examples We now apply the theory of the CRLB to several signal processing problems of interest. The problems to be considered and some of their areas of application are: 1. Range estimation - sonar, radar, robotics 2. Frequency estimation - sonar, radar, econometrics, spectrometry 3. Bearing estimation - sonar, radar 4. Autoregressive parameter estimation - speech, econometrics. These examples will be revisited in Chapter 7, in which actual estimators that asymp- totically attain the CRLB will be studied. Example 3.13 - Range Estimation In radar or active sonar a signal pulse is transmitted. The round trip delay 7 from the transmitter to the target and back is related to the range R as T = 2R/c, where c is the speed of propagation. Estimation of range is therefore equivalent to estimation of the time delay, assuming that c is known. If s(t) is the transmitted signal, a simple ~ model for the received continuous waveform is z(t) =s(t—7)+u(t) O Rosin nal \? ~( or ) - “ye (ae - m)) n=no 87 oe dt ana) n=no 3.11. SIGNAL PROCESSING EXAMPLES 55 o Ee 2 n=0 sean) since T = moA. Assuming that A is small enough to approximate the sum by an integral, we have 2 var (fo) > —— 7 —_, T, 2° Ly Finally, noting that A = 1/(2B) and o? = NoB, we have No var(fo) > ————2___ (3.37) T, 2 rey 0 dt An alternative form observes that the energy € is Ty é= [ 8(t) dt 0 which results in var(7o) > z (3.38) Nope where nm 4 () 2 . S| nh Ge) # (Tr) ‘ [ * 2 t)dt oO It can be shown that E/(No/2) is a SNR [Van Trees 1968]. Also, F? is a measure of the bandwidth of the signal since, using standard Fourier transform properties, / " QnFY|S(F)PaF a (3.39) | |S(F) Par 00 where F' denotes continuous-time frequency, and S(F) is the Fourier transform of s(t). In this form it becomes clear that F? is the mean square bandwidth of the signal. From (3.38) and (3.39), the larger the mean square bandwidth, the lower the CRLB. For instance, assume that the signal is a Gaussian pulse given by s(t) = exp(—du2(t — T,/2)*) and that s(t) is essentially nonzero over the interval (0,7,]. Then |S(#)| = 56 CHAPTER 3. CRAMER-RAO LOWER BOUND (op/V2n) exp(—2n?F?/o2) and F? = 03/2. As the mean square bandwidth increases, the signal pulse becomes narrower and it becomes easier to estimate the time delay. Finally, by noting that R = cro/2 and using (3.16), the CRLB for range is 2 7/4 var(R) >. (3.40) 2 Noe 2 Example 3.14 - Sinusoidal Parameter Estimation In many fields we are confronted with the problem of estimating the parameters ofa sinusoidal signal. Economic data which are cyclical in nature may naturally fit such a model, while in sonar and radar physical mechanisms cause the observed signal to be sinusoidal. Hence, we examine the determination of the CRLB for the amplitude A, frequency fo, and phase ¢ of a sinusoid embedded in WGN. This example generalizes Examples 3.4 and 3.5. The data are assumed to be z[n] = Acos(2mfon + ¢) + w[n] n=0,1,...,.N-1 where A > 0 and 0 < fy < 1/2 (otherwise the parameters are not identifiable, as is verified by considering A = 1,¢ = 0 versus A = —1,¢ = mor fo = O with A= 1/2,¢=0 versus A = 1/2, ¢ = 7/4). Since multiple parameters are unknown, we use (3.33) 1 As{n; 6) Os[n; 9] a? 00; 00; n=0 (1()).; = for 9 = [A fod]". In evaluating the CRLB it is assumed that fo is not near 0 or 1/2, which allows us to make certain simplifications based on the approximations [Stoica 1989] (see also Problem 3.7): N-} am > n'sin(4afon +26) ~ 0 n=0 1 X22 Nai s n'cos(4nfon+2¢) ~ 0 n=0 for i= 0,1,2. Using these approximations and letting a = 27 fon + , we have 1S, 1871,1 N HO) = 2 0m Fe (9+ 90") © G8 182 ANZ TO@he = -= Yo Atmncosasina = -— Y) nsin2a = 0 n=0 n=0 3.11. SIGNAL PROCESSING EXAMPLES 57 N-1 W@hs = —gr Ly Acosasina = 35 AS singe x0 n=0 N-1 = 2 2 _ (QnA)? 2f1_1 (1(@)}2 = a LA (2nn)? sin? a = 2 de a7 3 00820 (20 A)? <2 ~ "O62 L” 2N- 1@))23 = = Ly A?2nnsin? a — oun n=0 n=0 N-} NA? = 2 ((@)|33 = 32 2 A’sin?a & Qo? The Fisher information matrix becomes N 4 9 N-) N-1 _i 0 2A? n? A? n 10) = 35 aa N-1 NA 2 0 TA n 2 Using (3.22), we have upon inversion “ 20? > var(A) > W 7 12 > ——__*_ var(Jo) 2 GaygN(N? = 1) ; 2(2N —1) => = . var(¢) 2 aN(N +1) (3.41) where 7 = A?/(207) is the SNR. Frequency estimation of a sinusoid is of considerable interest. Note that the CRLB for the frequency decreases as the SNR increases and that the bound decreases as 1/N*, making it quite sensitive to data record length. See also Problem 3.17 for a variation of this example. 2 Example 3.15 - Bearing Estimation In sonar it is of interest to estimate the bearing to a target as shown in Figure 3.8. To do so the acoustic pressure field is observed by an array of equally spaced sensors in a 58 CHAPTER 3. CRAMER-RAO LOWER BOUND Planar wavefronts o Target 7 _ Figure 3.8 Geometry of array for 0 1 2 M-1 bearing estimation line. Assuming that the target radiates a sinusoidal signal A cos(27Fot + ¢), then the received signal at the nth sensor is Acos(27Fo(t—tn) +), where tn is the propagation time to the nth sensor. If the array is located far from the target, then the circular wavefronts can be considered to be planar at the array. As shown in Figure 3.8, the wavefront at the (n — 1)st sensor lags that at the nth sensor by dcos B/c due to the extra propagation distance. Thus, the propagation time to the nth sensor is tn = ty — m5 cos n=0,1,...,.M-1 where to is the propagation time to the zeroth sensor, and the observed signal at the nth sensor is d 8,(t) = Acos [anrat —tot+ ne cos 3) + d| : If a single “snapshot” of data is taken or the array element outputs are sampled at a given time ¢,, then d Sn(ts) = Acos[2m(Fo~ cos Z)n + ¢'] (3.42) where ¢’ = ¢+2nFo(t, —to). In this form it becomes clear that the spatial observations are sinusoidal with frequency f, = Fo(d/c)cos 8. To complete the description of the data we assume that the sensor outputs are corrupted by Gaussian noise with zero mean and variance o? which is independent from sensor to sensor. The data are modeled as 2[n] = sn(t,)+uln] n=0,1,...,.M-1 where w[n] is WGN. Since typically A, ¢ are unknown, as well as B, we have the problem of estimating {A, f., ¢’} based on (3.42) as in Example 3.14. Once the CRLB for these parameters is determined, we can use the transformation of parameters formula. The transformation is for @ = [A f. ¢’]" -go)-| 3 |=] () a=g(0)= y = COS Fod 3.11. SIGNAL PROCESSING EXAMPLES 59 The Jacobian is 0 0 c ~ Fodsin B 0 og(8) 0 00 oor 1 so that from (3.30) >0. 22 Because of the diagonal Jacobian this yields var(8) > [789] la But from (3.41) we have 12 -1 2 [i (9)] 2 ~ (27)2nM(M? _ 1) and therefore = (Qn)?nM (M? — 1) F2d? sin? 8 or final, y - 12 var(8) > —— ‘Wa 1 (2n)@Mne (5) sin? B where \ = c/Fo is the wavelength of the propagating plane wave and L = (M —1)d is the length of the array. Note that it is easiest to estimate bearing if 8 = 90°, and impossible if 8 = 0°. Also, the bound depends critically on the array length in wavelengths or L/A, as well as the SNR at the array output or Mn. The use of the CRLB for feasibility studies is examined in Problem 3.19. > (3.43) Example 3.16 - Autoregressive Parameter Estimation In speech processing an important model for speech production is the autoregressive (AR) process. As shown in Figure 3.9, the data are modeled as the output of a causal all-pole discrete filter excited at the input by WGN u[n]. The excitation noise u[n] is an inherent part of the model, necessary to ensure that z[n] is a WSS random process. The all-pole filter acts to model the vocal tract, while the excitation noise models the forcing of air through a constriction in the throat necessary to produce an unvoiced sound such as an “s.” The effect of the filter is to color the white noise so as to model PSDs with several resonances. This model is also referred to as a linear predictive coding (LPC) mode! [Makhoul 1975]. Since the AR model is capable of producing a variety of PSDs, depending on the choice of the AR filter parameters {a[1}, a[2],...,a[p]} and 60 CHAPTER 3. CRAMER-RAO LOWER BOUND 1 " —| a /- a Puu(f) Prz(f) o nies Se Ie vies Ss Figure 3.9 Autoregressive model A(z) =1+ Salm] m=1 excitation white noise variance o2, it has also been successfully used for high-resolution spectral estimation. Based on observed data {z(0], z[1],...,2[N — 1]}, the parameters are estimated (see Example 7.18), and hence the PSD is estimated as [Kay 1988] . o Pro(f) = ‘ z: 1+ xy 4[m] exp(—j27 fm) m=1 The derivation of the CRLB proves to be a difficult task. The interested reader may consult [Box and Jenkins 1970, Porat and Friedlander 1987] for details. In practice the asymptotic CRLB given by (3.34) is quite accurate, even for short data records of about N = 100 points if the poles are not too close to the unit circle in the z plane. Therefore, we now determine the asymptotic CRLB. The PSD implied by the AR model is o lACAP where @ = [a{1] a[2]...a[p]o2]? and A(f) = 1+ D7... alm] exp(—j2a fm). The partial derivatives are Pra(f;9) = Ain Prx(fs0) _ _ alnjA(f)? dak) alk] = —aCAR [A(/) exp(j2nyk) + A*(f) exp(—J2n/R)] Oln Pra(f;) 1 002 o u 3.11. SIGNAL PROCESSING EXAMPLES 61 For k = 1,2,...,p;/=1,2,...,p we have from (3.34) N ptt . * . Bn = FL are Aer 2esh) + 4°) exo(— 2074) [AC A)expli2nfl) + AY f)exp(—J2n 0] df -% ef an a expla f(k+D]+ Ging fa ye Plans (k — 0) arid exp[j2mf(l - b+ za exp[—j2mf( ‘+0 df. Noting that I 2 4 | po . [ PoP exp[j2zf(k + l)] df i PO exp[—j2mf(k +0] df : a L appomlee—oia =f) aeppewbianre- me which follows from the Hermitian property of the integrand (due to A(—f) = A*(f)), we have BO = Nf aepperizase vid Poa . +N Bp? Plans +) a. 4 The second integral is the inverse Fourier transform of 1/A*(f)? evaluated at n = k+l > 0. This term is zero since the sequence is the convolution of two anticausal sequences, that is, if _ h[n] n>0 Pano" neo then {aaa} = h-n|xh{—n] At(f? = 0 for n > 0. Therefore, LO) = ayreelh =I. 62 CHAPTER 3. CRAMER-RAO LOWER BOUND For k=1,2,...,.p,l=pt+1 Nii ol HO = —z [, SELAH MU exPli2nsh) + A°(/) exp 32 F) df 4 = -F iy FH exp(j2afk) df = 0 where again we have used the Hermitian property of the integrand and the anticausality of F~'{1/A*(f)}- Finally, for k =p + 1ll=p+1 N pl N (1) = al, wd = g4 so that Mae = u 4 1) oC (3.44) 204 where [Rezlij = Tzeli— j] isap xP Toeplitz autocorrelation matrix and 0 is a p x 1 vector of zeros. Upon inverting the Fisher information matrix we have that var(4[k]) Iv Oe tp-1 ay [Reed k=1,2,...,7 var(o2) x, (3.45) IV As an illustration, if p = 1, 2 var(G{ll) > yo" Nrzx[0]" But 2 an rll Tr orh] so that var(@{il) > 55 (1-H) indicating that it is easier to estimate the filter parameter when |a[1j| is closer to one than to zero. Since the pole of the filter is at —a[1], this means that the filter parameters of processes with PSDs having sharp peaks are more easily estimated (see also Problem 3.20). REFERENCES 63 References Box, G.E.P., G.M. Jenkins, Time Series Analysis: Forecasting and Control, Holden-Day, San Francisco, 1970. Brockwell, P.J., R.A. Davis, Time Series: Theory and Methods, Springer-Verlag, New York, 1987. Kay, S.M., Modern Spectral Estimation: Theory and Application, Prentice-Hall, Englewood Cliffs, N.J., 1988. Kendall, Sir M., A. Stuart, The Advanced Theory of Statistics, Vol. 2, Macmillan, New York, 1979. Makhoul, J., “Linear Prediction: A Tutorial Review,” IEEE Proc., Vol. 63, pp. 561-580, April 1975. McAulay, R.J., E.M. Hofstetter, “Barankin Bounds on Parameter Estimation,” IEEE Trans. In- form. Theory, Vol. 17, pp. 669-676, Nov. 1971. Porat, B., B. Friedlander, “Computation of the Exact Information Matrix of Gaussian Time Series With Stationary Random Components,” IEEE Trans. Acoust., Speech, Signal Process., Vol. 34, pp. 118-130, Feb. 1986. Porat, B., B. Friedlander, “The Exact Cramer-Rao Bound for Gaussian Autoregressive Processes,” IEEE Trans. Aerosp. Electron. Syst., Vol. 23, pp. 537-541, July 1987. Seidman, L.P., “Performance Limitations and Error Calculations for Parameter Estimation,” Proc. IEEE, Vol. 58, pp. 644-652, May 1970. Stoica, P., R.L. Moses, B. Friedlander, T. Soderstrom, “Maximum Likelihood Estimation of the Parameters of Multiple Sinusoids from Noisy Measurements,” IEEE Trans. Acoust., Speech, Signal Process., Vol. 37, pp. 378-392, March 1989. Van Trees, H.L., Detection, Estimation, and Modulation Theory, Part I, J. Wiley, New York, 1968. Ziv, J., M. Zakai, “Some Lower Bounds on Signal Parameter Estimation,” IEEE Trans. Inform. Theory, Vol. 15, pp. 386-391, May 1969. Problems 3.1 If 2[n] for n =0,1,..., N—1 are IID according to U(0, 6], show that the regularity condition does not hold or that 2 [ers 20 for all 0 > 0. Hence, the CRLB cannot be applied to this problem. 3.2 In Example 3.1 assume that w(0] has the PDF p(w[0]) which can now be arbitrary. Show that the CRLB for A is ap(u) 2 Evaluate this for the Laplacian PDF p(w[0]) = Se exp (ee) 64 CHAPTER 3. CRAMER-RAO LOWER BOUND and compare the result to the Gaussian case. 3.3 The data z[n| = Ar” + w[n] for n = 0,1,...,N — 1 are observed, where w(n] i WGN with variance o? and r > 0 is known. Find the CRLB for A. Show that an efficient estimator exists and find its variance. What happens to the variance as N -+ co for various values of r? 3.4 If a[n| =r" + w[n] for n =0,1,...,N —1 are observed, where w{n] is WGN with variance o? and r is to be estimated, find the CRLB. Does an efficient estimator exist and if so find its variance? 3.5 If {n] = A+v[n] forn =0,1,...,N—1are observed and w = (w(0] w[1]}...w[N — 1]? ~ N(0,C), find the CRLB for A. Does an efficient estimator exist and if so what is its variance? 3.6 For Example 2.3 compute the CRLB. Does it agree with the results given? 3.7 Prove that in Example 3.4 18a WN > cos(4m fon + 26) © 0. n=0 What conditions on fy are required for this to hold? Hint: Note that N-1 N-1 z > cos(an + 8) = Re (= exp[j(an + al) n=0 n=0 and use the geometric progression sum formula. 3.8 Repeat the computation of the CRLB for Example 3.3 by using the alternative expression (3.12). 3.9 We observe two samples of a DC level in correlated Gaussian noise z(0] x(] A+vu(0] A+vf[l] where w = [w[0] w[1]|? is zero mean with covariance matrix =oii ? cao] | °|- The parameter p is the correlation coefficient between w([0] and w[1]. Compute the CRLB for A and compare it to the case when w[n| is WGN or p = 0. Also, explain what happens when p > +1. Finally, comment on the additivity property of the Fisher information for nonindependent observations. PROBLEMS 65 3.10 By using (3.23) prove that the Fisher information matrix is positive semidefinite for all @. In practice, we assume it to be positive definite and hence invertible, although this is not always the case. Consider the data model in Problem 3.3 with the modification that r is unknown. Find the Fisher information matrix for @=[Ar]?. Are there any values of @ for which 1(8) is not positive definite? 3.11 For a 2 x 2 Fisher information matrix which is positive definite, show that 1 HO) What does this say about estimating a parameter when a second parameter is either known or unknown? When does equality hold and why? c n= pts a 3.12 Prove that 1 “1 Diez a 1") TO} This generalizes the result of Problem 3.11. Additionally, it provides another lower bound on the variance, although it is usually not attainable. Under what conditions will the new bound be achieved? Hint: Apply the Cauchy-Schwarz inequality to e? \/1(6) /T- '(@)e;, where e; is the vectors of all zeros except for a 1 as the ith element. The square root of a positive definite matrix A is defined to be the matrix with the same eigenvectors as A but whose eigenvalues are the square roots of those of A. 3.13 Consider a generalization of the line fitting problem as described in Example 3.7. termed polynomial or curve fitting. The data model is pol z(n] = Ayn* + win] k=0 forn =0,1,...,N—1. As before, w[n] is WGN with variance o*. It is desired to estimate { Ao, Ai,..., Ap-1}. Find the Fisher information matrix for this problem. 3.14 For the data model in Example 3.11 consider the estimator 07, = (A)?, where A is the sample mean. Assume we observe a given data set in which the realization of the random variable A is the value Ap. Show that A — Ag as N -+ oo by verifying that E(A|A = Av) " ala > var(A|A = Ao) 66 CHAPTER 3. CRAMER-RAO LOWER BOUND Hence, o4 — A? as N -+ oo for the given realization A = Ao. Next find the variance of 0% as N - oo by determining var(A”), where A ~ N(0,0%), and compare it to the CRLB. Explain why 0}, cannot be estimated without error even for N + o. 3.15 Independent bivariate Gaussian samples {x(0],x(1],...,x[N — 1]} are observed. Each observation is a 2 x 1 vector which is distributed as x[n] ~ N(0,C) and e-[5 ft]. Find the CRLB for the correlation coefficient p. Hint: Use (3.32). 3.16 It is desired to estimate the total power Py of a WSS random process, whose PSD is given as P.z(f) = PoQ(f) where 4 [ana = and Q(f) is known. If N observations are available, find the CRLB for the total power using the exact form (3.32) as well as the asymptotic approximation (3.34) and compare. 3.17 Ifin Example 3.14 the data are observed over the interval n= —M,...,0,...,M, find the Fisher information matrix. What is the CRLB for the sinusoidal param- eters? You can use the same approximations as in the example by assuming M is large. Compare the results to that of the example. 3.18 Using the results of Example 3.13, determine the best range estimation accuracy of a sonar if j= 1-100jé-0.01| O0 0 for all x. Now let w(x) = p(x;0) g(x) = @-a h(x) = Smee) and apply the Cauchy-Schwarz inequality to (34.4) to produce (222) < [a artotaioyax f (eB) ax cy E (ee) =_B [Pinzon or var(&) > Now note that This follows from (3A.2) as O\n p(x; 6) p[etnmn) [2858 as) dx = 0 APPENDIX 3A. DERIVATION OF SCALAR PARAMETER CRLB 69 5 f Re ; 0) p(x;6)dx = 0 / SU (x0) + 2BECEO Ontsi8) ix = 0 or a) 0 ol 39) Al 30 “(Eig = fgets _ On p(x; 6)\? - [(2nee )]; Thus, cp) - 06 var (&) > : 2 in p(x 6) ~ | oe which is (3.16). If a = g(@) = 0, we have (3.6). Note that the condition for equality is Olnp(x;6) 1, a *~%) where c can depend on @ but not on x. If a = g(6) = 0, we have Olnp(x;0) 1 06-8) (0-8). The possible dependence of c on @ is noted. To determine c(0) a 1 Pine) e(6)) ae = yt ape 9) E\& In p(x;0)] _ 2 rn nr ()) or finally 1 (8) = ~5 [> hts ; 0) 06 = ~ I(6) which agrees with (3.7). Appendix 3B Derivation of Vector Parameter CRLB In this appendix the CRLB for a vector parameter a = g(@) is derived. The PDF is characterized by 8. We consider unbiased estimators such that E(a) =a; =(g(@)); = §=1,2,...,7. The regularity conditions are E [Paes =o so that [uw _ 0.) 22 B®) a 6) dx = eh (3B.1) Now consider for j #1 fe — a;) oe p(x; @) dx Kt - Op(x; 8) &; — a;)——— dx foo =- 2 / p(x; 8) dx = 30; iP(X; pfoln p(x; 2) -—a,E [eae @, 80% 06; Ae(9)\: 00; ~ Combining (3B.1) and (3B.2) into matrix form, we have .a\r Jeo) ERED (a5 0) dx = BO). (3B.2) 70 APPENDIX 3B. DERIVATION OF VECTOR PARAMETER CRLB 71 Now premultiply by a7 and postmultiply by b, where a and b are arbitrary r x 1 and p x 1 vectors, respectively, to yield / aT (a — 0) PO) boca, 8) dx = a7 2619), Now let w(x) = p(x;6) g(x) = a7(&—-a) _ Olnp(x 8)”, Mx) = ag» and apply the Cauchy-Schwarz inequality of (34.5) 2 (27 ee) < / al (de — a)(& — a) ap(x;8) dx » fur ne) 6) )Ptnpts:6) 0)" bp(x; 8) dx = a™C,ab71(0)b since as in the scalar case On p(x; 8) Aln p(x; 0)] _ (7 In p(x; | _ ; EB 0; 08; | =-E 00,00, } HOD]es- Since b was arbitrary, let —1179)08(8)" b=1"'(@) ; (22s) a) < acon (st 2h 116) 2800)" *). Since I(@) is positive definite, so is I-1(@), and 28(0)-1(g) 29(0)7 is at least positive semidefinite. The term inside the parentheses is therefore nonnegative, and we have at (co- e080) ‘) a>o. Recall that a was arbitrary, so that (3.30) follows. If a = g(8) = 0, then 28 — J and (3.24) follows. The conditions for equality are g(x) = ch(x), where cisa constant not dependent on x. This condition becomes <2lnp(xs8)” 06 Ain p(x; 4)" 1g 28(8)” ag Om ag & to yield b 72 APPENDIX 3B. DERIVATION OF VECTOR PARAMETER CRLB Since a was arbitrary 2alO) 1g) 2O) _ 1g — a, Consider the case when a = g(0) = 9, so that 2) =I. Then, On p(x; 8) 06 Noting that c may depend on @, we have 1 ~ = =1(6)(6 — 8). om On p(x; 8) 60) (1(@) 6, -¥ rae = (Bx — 6) and differentiating once more BP inp(xi@) eo | WO in (Na) ? In p(x; 0 P it c(0) a“ 90,00; i “e(8) —dej) + 26, (8% — 9) Finally, we have 2 * wos = —2[=eree)) (C9) lis c(@) since E(x) = 6,. Clearly, c(@) = 1 and the condition for equality follows. Appendix 3C Derivation of General Gaussian CRLB Assume that x ~ N(y(8),C(8)), where 42(@) is the N x 1 mean vector and C(@) is the N x N covariance matrix, both of which depend on @. Then the PDF is y= —— + nen [ba pie)? 0100) x — Pla 8) = a cTporay om? [oa MON ENO) ml) We will make use of the following identities Pmneeie) _ ir(c7@) 5) (3c.1) where 0C(@)/00, is the N x N matrix with [i, j] element 3[C(6)],;/00, and oO = 010) cr), (30.2) k To establish (3C.1) we first note JIndet[C(B)]_ 1S det (C(@)] a0, ~ del(C(@|) 08, (8C.3) Since det[C(@)] depends on all the elements of C(6) ddet[C(@)) _ A det[C(9)] a[C(4)],; a, >> ACO); OO, _ ._ f Adet[C()] ACT (a) = Sc 9a.) where @det[C(@)|/OC(@) is an N x N_ matrix with [i,j] element 8 det[C(8)]/A[C(@)|.; and the identity (3C.4) N N tr(AB™) = 57> S“/A).; [Bl i=] j=l 73 74 APPENDIX 3C. DERIVATION OF GENERAL GAUSSIAN CRLB has been used. Now by the definition of the determinant N det[C(8)] = }-[C(@) (Mk; i=1 where M is the N x N cofactor matrix and j can take on any value from 1 to N. Thus, ddet[C(6)) _ acca), ~ Mls or Adet[C()} _ acta) It is well known, however, that c M? C8) = saIC@) so that adet{C(6)] te = C-1(6) det[C(4)]. Using this in (3C.3) and (3C.4), we have the desired result. dindet(C(@)] _ 1 _ 2010) od “a ) ll u(c (62000). The second identity (3C.2) is easily established as follows. Consider C7 (0)C(0) =1L. Differentiating each element of the matrices and expressing in matrix form, we have ©1224 Mee) =o which leads to the desired result. We are now ready to evaluate the CRLB. Taking the first derivative Alnp(x;6) __-:1 Alndet(C(@)}_ 1 8 TrH1 ee nae ba (— MLO)ITE"(0)(— HO). The first term has already been evaluated using (3C.1), so that we consider the second term: He (ex— m(6))7C-1(0)(x - (0) APPENDIX 3C. DERIVATION OF GENERAL GAUSSIAN CRLB 75 a N ON = FH LLM MOE Ols Cab ~ (OL) N N = ES zis] — W(@))) (Ios (2M) + Tels ai — wt) 4 (-4 (6) oe) ie 2(0)):5(2[3] - \u(o),,)} = (= w(ayPo2(o) HO) + (x — wi? =O (x — w(0) - HY “oye 00) T 1 = 2 2HEY 6 1(0)(x ~ w(0)) + (x wl) Ox — (9). Using (3C.1) and the last result, we have opt) = -le (om) an C-1(6)(x — (6) — 5 — n(0))? (x - (0). (305) Let y= x— (0). Evaluating p[2nnl x; 0) O1n p(x; | MO) = El, OO, which is equivalent to (3.23), yields Oe = gt (1) (c 16°) + tt (c 10) ) (2 y) + 2H) onto)" © (0) Blyy"}0-1(6) HO) OC*(8)_ 0C~1(8) i T T sh 30,” 0H | 76 APPENDIX 3C. DERIVATION OF GENERAL GAUSSIAN CRLB where we note that all odd order moments are zero. Continuing, we have {1(@)}er = Jer (710) 2S) te r(c (9 25®)) —5ir(c (c 12) a 2 (co) 26@) 8p (8)" 49,948) | 1, [ rdC7'(8) 7 AC7(8) +5, © 19) 36, +5Bly 70, ¥ y| 28, (3C.6) where E(y?z) = tr[E(2y7)] for y,2 N x 1 vectors and (3C.2) have been used. To evaluate the last term we use [Porat and Friedlander 1986] Ely? Ayy” By] = tr(AC)tr(BC) + 2tr(ACBC) where C = E(yy?) and A and B are symmetric matrices. Thus, this term becomes 1, (2S ©) (0) (eo 10) c)) +3 (GO 20) 10) Oc). Next, using the relationship (3C.2), this term becomes ber (or) te (oS) +5tr (cre Gre- 0) (3C.7) and finally, using (3C.7) in (3C.6), produces the desired result. Appendix 3D Derivation of Asymptotic CRLB It can be proven that almost any WSS Gaussian random process z[n] may be repre- sented as the output of a causal linear shift invariant filter driven at the input by white Gaussian noise u{n] [Brockwell and Davis 1987] or zn] = So hlklan — ke] (3D.1) k=0 where h[0] = 1. The only condition is that the PSD must satisfy i [ In Pref) df > —o0. ai . a With this representation the PSD of z[n] is Pas(f) = |A(f)P ou where o? is the variance of u[n] and H(f) = 22, h[k] exp(—j2xfk) is the filter fre- quency response. If the observations are {z(0], x[1],...,2[N —1]} and N is large, then” the representation is approximated by z[n] = So alkluln —k+ > h[k]u[n — k] k=n4+1 ~ Soh[kluln — kl. (3D.2) k=0 This is equivalent to setting u[n] = 0 for n < 0. As n -+ 00, the approximate repre- sentation becomes better for x(n]. It is clear, however, that the beginning samples will be poorly represented unless the impulse response A[k] is small for k > n. For large N most of the samples will be accurately represented if N is much greater than the impulse response length. Since reall] = 02 5 Nndafn + n=0 17 78 APPENDIX 3D. DERIVATION OF ASYMPTOTIC CRLB the correlation time or effective duration of Tz2{k] is the same as the impulse response length. Hence, because the CRLB to be derived is based on (3D.2), the asymptotic CRLB will be a good approximation if the data record length is much greater than the correlation time. To find the PDF of x we use (3D.2), which is a transformation from u = {ufO] u{1]... u[N — I]7 to x = [z{0] 2[1]...2[N — I] or A(O| 0 0 .. 0 Ali] h(0) 0 0 x= : : : tt rIN-1] AIN-2] AIN-3] ... h(O] H Note that H has a determinant of (h{0])” = 1 and hence is invertible. Since u ~ N(0,021), the PDF of x is N(0,02HH") or p(x;6) = aovaatoama [-5 *(o2HH")"s| . But det(o2HH") = 02" det?(H) = 02”. Also, x" (o?HH") 3x = et x) = awe so that 1 1 p(x; 9) = Gro? exp (-za8") : (3D.3) From (3D.2) we have approximately X(f) = H(AUL/) where N-1 X(f) = > 2l{njexp(—J27fn) n=0 N-1 Uf) = Y ulnlexp(—J27 fn) n=0 are the Fourier transforms of the truncated sequences. By Parseval’s theorem ly 1M, zu = zl u uU n=0 APPENDIX 3D. DERIVATION OF ASYMPTOTIC CRLB 79 1 st az | unre + XP I, one 2 IXCAP [ Pat: (3D.4) Also, i Ino® = ‘ Ino? df -i a Ly (inte) # } 4 | jinPae( fa ~ [ nnrar. But Il [ 1 migra =f ; InH(f) + In H*(f) af ane | nt(saf = are in We) = 2Re[Z~ {in H(z)}|, 6] where C is the unit circle in the z plane. Since H(z) corresponds to the system function of a causal filter, it converges outside a circle of radius r < 1 (since H(z) is assumed to exist on the unit circle for the frequency response to exist). Hence, In H(z) also converges outside a circle of radius r < 1, so that the corresponding sequence is causal. By the initial value theorem which is valid for a causal sequence Z7'{inH(z)Hn-0 = jim In X(z) = In lim H(z) 200 = Inh(0] =0. Therefore, $ [ {mlH(N Pat =0 80 APPENDIX 3D, DERIVATION OF ASYMPTOTIC CRLB and finally 4 Ino? = | InP,.(f) df. 3D.5 nota [Pad (30.5) Substituting (3D.4) and (3D.5) into (3D.3) produces for the log PDF $ 4 2 inptaso)= Finan F fm Peal a5 ane Hence, the asymptotic log PDF is 2 In p(x;6) = a Inn — as [ 1 [1 Peel f) + ee gy. (D8) To determine the CRLB ainp(x@) _ oN st (_ 1 _ xX) OPse(f) 20, "f'n 3, a Inp(x30) _ _N fr 1 #IX(A)P\) & Prof) 30:00, 2J5-4\Pes(f) P2(f) } 86:00; 1 2|X(f)P \ OPra(f) P22 (f) + (- PQ) * PRU) ) 36, 00, (7) In taking the expected value we encounter the term E(|X(f)|?/N). For large N this is now shown to be P,,(f). Note that |X(f)|?/N is termed the periodogram spectral estimator. 5 (5Ix(NF) m=0 n=0 N-1N-1 E (i 3 > 2x[m]z[n] exp[—j2af(m — mn) z y Nobuo W Tr2[m — n] exp[—j2mf(m — n)] n=0 " 2 3 Lod Oo In| . = 1-—— )} ree[k] exp(—j2afk) (3D.8) XI i) Tr, exp(—jem. where we have used the identity N-1N-1 N-1 SS Voalm-nj= YO (N—IalgIAl m=0 n=0 k=—(N-1) APPENDIX 3D. DERIVATION OF ASYMPTOTIC CRLB 81 As N - oo, | (= FY rath raat assuming that the ACF dies out sufficiently rapidly. Hence, 1 BUF] = Pea). Upon taking expectations in (3D.7), the first term is zero, and finally, _ ON ft 1_ aPaolf) OPaal) mom = > [ype eae il N fs} Ain P..(f) On Pref) zl, 20, 30, which is (3.34) without the explicit dependence of the PSD on @ shown. Chapter 4 Linear Models 4.1 Introduction The determination of the MVU estimator is in general a difficult task. It is fortunate, however, that a large number of signal processing estimation problems can be repre- sented by a data model that allows us to easily determine this estimator. This class of models is the linear model. Not only is the MVU estimator immediately evident once the linear model has been identified, but in addition, the statistical performance follows naturally. The key, then, to finding the optimal estimator is in structuring the problem in the linear model form to take advantage of its unique properties. 4.2 Summary The linear model is defined by (4.8). When this data model can be assumed, the MVU (and also efficient) estimator is given by (4.9), and the covariance matrix by (4.10). A more general model, termed the general linear model, allows the noise to have an arbitrary covariance matrix, as opposed to 071 for the linear model. The MVU (and also efficient) estimator for this model is given by (4.25), and its corresponding covariance matrix by (4.26). A final extension allows for known signal components in the data to yield the MVU (and also efficient) estimator of (4.31). The covariance matrix is the same as for the general linear model. 4.3 Definition and Properties The linear model has already been encountered in the line fitting problem discussed in Example 3.7. Recall that the problem was to fit a straight line through noise corrupted data. As our model of the data we chose a(n] = A+ Bn + v[n] n=0,1,...,N—-1 83 84 CHAPTER 4. LINEAR MODELS where w(n] is WGN and the slope B and intercept A were to be estimated. In matrix notation the model is written more compactly as x=HO+w (4.1) where x = [x(Oj2(t]...2[N - 1)" {w(0] wll]... wl — yy [ABl? i 4 and 1 0 1 1 H=]|. . 1 N-1 The matrix H is a known matrix of dimension N x2 and is referred to as the observation matric. The data x are observed after @ is operated upon by H. Note also that the noise vector has the statistical characterization w ~ N(0,071). The data model in (4.1) is termed the linear model. In defining the linear model we assume that the noise vector is Gaussian, although other authors use the term more generally for any noise PDF (Graybill 1976). ‘As discussed in Chapter 3, it is sometimes possible to determine the MVU estimator if the equality constraints of the CRLB theorem are satisfied. From Theorem 3.2, 6 = g(x) will be the MVU estimator if Olnp(x;8) _ _ Dinplas6) — y(@\(e(x) ~ 8) (42) for some function g. Furthermore, the covariance matrix of 6 will be 1-1(6). To determine if this condition is satisfied for the linear model of (4.1), we have Ainp(x; 9) oO _ ax la Try —56 36 In(270°)? 393 * Hoe)" (x — H@) = — ag tx 2x"Ho + OTH HO) Using the identities ob™@ a T ons = 2A0 (4.3) for A a symmetric matrix, we have din p(x; 8 1 oinpts:6) = Liwts —H"H9). 4.3. DEFINITION AND PROPERTIES 85 Assuming that H7H is invertible Olnp(x;6) _ H7H a —a (BH) A? x — 6) (4.4) which is exactly in the form of (4.2) with 6 = (H’H)'Hx (4.5) 1(@) = a (4.6) Hence, the MVU estimator of 6 is given by (4.5), and its covariance matrix is Cy =17'(6) = 0? (HH). (4.7) Additionally, the MVU estimator for the linear model is efficient in that it attains the CRLB. The reader may now verify the result of (3.29) by substituting H for the line fitting problem into (4.4). The only detail that requires closer scrutiny is the invertibility of H'H. For the line fitting example a direct calculation will verify that the inverse exists (compute the determinant of the matrix given in (3.29)). Alternatively, this follows from the linear independence of the columns of H (see Problem 4.2). If the columns of H are not linearly independent, as for example, 11 11 H=|.. 11 and x = [22...2]? so that x lies in the range space of H, then even in the absence of noise the model parameters will not be identifiable. For then x=He and for this choice of H we will have for x[n] 2=A+B n=0,1,...,.N-1. As illustrated in Figure 4.1 it is clear that an infinite number of choices can be made for A and B that will result in the same observations or given a noiseless x, @ is not unique. The situation can hardly hope to improve when the observations are corrupted by noise. Although rarely occurring in practice, this degeneracy sometimes nearly occurs when HH is ill-conditioned (see Problem 4.3). The previous discussion, although illustrated by the line fitting example, is com- pletely general, as summarized by the following theorem. Theorem 4.1 (Minimum Variance Unbiased Estimator for the Linear Model) If the data observed can be modeled as x=Hé@4+w (4.8) 86 CHAPTER 4. LINEAR MODELS an] = A+ Btwn] =AtB 2=A+B A All @ on this line produce the same observations Figure 4.1 Nonidentifiability of linear model parameters where x is an N x 1 vector of observations, H is a known N x p observation matrit (with N > p) and rank p, @ is a p x 1 vector of parameters to be estimated, and w is an N x1 noise vector with PDF N(0,071), then the MVU estimator is 6 = (HH) Hx (4.9) and the covariance matrix of 6 is C; = 0?(H7H)"'. (4.10) For the linear model the MVU estimator is efficient in that it attains the CRLB. That 6 is unbiased easily follows by substituting (4.8) into (4.9). Also, the statistical performance of 6 is completely specified (not just the mean and covariance) because 6 is a linear transformation of a Gaussian vector x and hence 6 ~N(0,0?(HH)~). (4.11) The Gaussian nature of the MVU estimator for the linear model allows us to determine the exact statistical performance if desired (see Problem 4.4). In the next section we present some examples illustrating the use of the linear model. 4.4 Linear Model Examples We have already seen how the problem of line fitting is easily handled once we recog- nize it as a linear model. A simple extension is to the problem of fitting a curve to experimental data. Example 4.1 - Curve Fitting In many experimental situations we seek to determine an empirical relationship between a pair of variables. For instance, in Figure 4.2 we present the results of a experiment 4.4. LINEAR MODEL EXAMPLES 87 Voltage J @ <— Measured voltage, (tn) “<< Hypothesized relationship, s(t) Time, t to ti tz ts tn-1 Figure 4.2 Experimental data in which voltage measurements are taken at the time instants t = to,t),...,tn_1. By plotting the measurements it appears that the underlying voltage may be a quadratic function of time. That the points do not lie exactly on a curve is attributed to experi- mental error or noise. Hence, a reasonable model for the data is 2(tn) = 0, + Oatn + Ost, + w(tn) = 2 =0,1,...,N—1. To avail ourselves of the utility of the linear model we assume that w(t,) are IID Gaussian random variables with zero mean and variance o? or that they are WGN samples. Then, we have the usual linear model form x=Hé+w where x = [2(to) 2(t1)...2(tv—1)]” 8 = (0, 0,0,)" and 2 1 to to u-|, " 4 1 tyr tha In general, if we seek to fit a (p — 1)st-order polynomial to experimental data we will have 2(tn) =O, + Oatn ++--+Opt2 + (tr) n=0,1,...,N—1. The MVU estimator for @ = (0; 6, ...0,]” follows from (4.9) as 6 = (H7H)"'H7x 88 CHAPTER 4. LINEAR MODELS where x = [x(to)2(t)..-2(tw—)]” 1 t ... 1 ot... H=|. . (N x p). 1 twa... ty The observation matrix for this example has the special form of a Vandermonde matrix. Note that the resultant curve fit is P a(t) =) 6,t" i=l where s(t) denotes the underlying curve or signal. °° Example 4.2 - Fourier Analysis Many signals exhibit cyclical behavior. It is common practice to determine the presence of strong cyclical components by employing a Fourier analysis. Large Fourier coefficients are indicative of strong components. In this example we show that a Fourier analysis is really just an estimation of the linear model parameters. Consider a data model consisting of sinusoids in white Gaussian noise: M M g{[n] = Sax cos (FF) +b sin (Fe) +w[n] n=0,1,...,.N—-1 (4.12) k=1 k=1 where w[n] is WGN. The frequencies are assumed to be harmonically related or multi- ples of the fundamental f, = 1/N as f; = k/N. The amplitudes ax, by of the cosines and sines are to be estimated. To reformulate the problem in terms of the linear model we let 0 = [a, a2... an by ba. .by]” and H= 1 ate 1 i} aes 0 os (37) cos (25/4) sin(3%) —--. sin (2574) cos [22-1 . cos [22M =] sin [22%=D] ee sin [228 =| 4.4. LINEAR MODEL EXAMPLES 89 Note that H has dimension N x 2M, where p = 2M. Hence, for H to satisfy N > p we require M < N/2. In determining the MVU estimator we can simplify the computations by noting that the columns of H are orthogonal. Let H be represented in column form as H = [hy hy... hyy] where h, denotes the ith column of H. Then, it follows that hth;=0 fori #j. This property is quite useful in that bY HH = i {mio ... bow ] hw bTh, bth, ... bThy — | bay BE, «bP haay hiyh, bfyhs ... bebo becomes a diagonal matrix which is easily inverted. The orthogonality of the columns results from the discrete Fourier transform (DFT) relationships for i,j = 1,2,...,M < N/2: cos (22) co 2nin) _ Ns a NIN) 2° > sin (Fr) sin Qajn\ N; n=0 N N - 2 ? N-1 : . 2 es (3) sin (7) = 0 foralli,j. (4.13) An outline of the orthogonality proof is given in Problem 4.5. Using this property, we have ole 0 0 wH-=|. 7. [="1 NZS 0 0 whee so that the MVU estimator of the amplitudes is CHAPTER 4. LINEAR MODELS 6 = (H7H)'H’x nt 2 2{ ° = z[n] cos Ar) N 4% N 2 2 . {2akn bh = WL aelsn( W ). (4.14) These are recognized as the discrete Fourier transform coefficients. From the properties of the linear model we can immediately conclude that the means are E(@) = a E(b,) = b (4.15) and the covariance matrix is Cy = o°(H7H)" N -t 2 = =I ° (3) 20? = FW 1 (4.16) Because 6 is Gaussian and the covariance matrix is diagonal, the amplitude estimates are independent (see Problem 4.6 for an application to sinusoidal detection). It is seen from this example that a key ingredient in simplifying the computation of the MVU estimator and its covariance matrix is the orthogonality of the columns of H. Note that this property does not hold if the frequencies are arbitrarily chosen. © Example 4.3 - System Identification It is frequently of interest to be able to identify the model of a system from input/output data. A common model is the tapped delay line (TDL) or finite impulse response (FIR) filter shown in Figure 4.3a. The input u(n] is known and is provided to “probe” the system. Ideally, at the output the sequence Y225 Alk]u[n — k] is observed from which 4.4, LINEAR MODEL EXAMPLES 91 p-l s Afk]ufn — k] k=0 (a) Tapped delay line win] fn] (S) alm p-1 H(z) = > h[k]z~* k=0 (b) Model for noise-corrupted output data Figure 4.3 System identification model we would like to estimate the TDL weights h[k], or equivalently, the impulse response of the FIR filter. In practice, however, the output is corrupted by noise, so that the model in Figure 4.3b is more appropriate. Assume that un] is provided for n = 0,1,...,N—1 and that the output is observed over the same interval. We then have —1 z[n] =F hielun — 4 +u[n] n=0,1,...,.N-1 (4.17) k=0 where it is assumed that u[n] = 0 for n <0. In matrix form we have u(0] 0 _ 0 h(0] {| mt (4.18) aN — 1] a[N 2]... ulN—pl | | ap—y Luly Ne H 8 92 CHAPTER 4. LINEAR MODELS Assuming w{n] is WGN, (4.18) is in the form of the linear model, and so the MVU estimator of the impulse response is 6 = (HTH) Hx. The covariance matrix is C, = 0?(H7H)"?. A key question in system identification is how to choose the probing signal u(n}. Jt is now shown that the signal should be chosen to be pseudorandom noise (PRN) [MacWilliams and Sloane 1976]. Since the variance of 9; is var(;) = e7 Cjei where e; = [00...010...0]" with the 1 occupying the ith place, and C;* can be factored as DTD with D an invertible p x p matrix, we can use the Cauchy-Schwarz inequality as follows. Noting that 1 = (ef DDT “e,)? we can let & = De; and € =D"”‘e; to yield the equality (€7&2)? < ETEEZ be. Because €7 € = 1, we have 1 (ef D"De,)(e? DD" ‘e:) (eG; 1e,)(e? Cze:) WIA or finally 2 1 o van) > ere, = THT Equality holds or the minimum variance is attained if and only if £; = cé, forca constant or De; = ¢D™ 'e; or, equivalently, the conditions for all the variances to be minimized are D'De;=cie; i=1,2,.-.,7- Noting that , H’H TH - Co = D'D=C;" = o we have H’H gx ti = ei. 4.4. LINEAR MODEL EXAMPLES 93 If we combine these equations in matrix form, then the conditions for achieving the minimum possible variances are a 0 0 Ha=0?|° ° 0 0 ... & It is now clear that to minimize the variance of the MVU estimator, u[n] should be chosen to make H7H diagonal. Since [H];; = uli — J] N (H7H);; = So uln—dufn—j) t=1,2,...,9,5=1,2,....7 (4.19) n=1 and for N large we have (see Problem 4.7) N-1-li-31 A= SO ufnjuln + fé— Ji] (4.20) n=0 which can be recognized as a correlation function of the deterministic sequence u[n]. Also, with this approximation H?H becomes a symmetric Toeplitz autocorrelation ma- x rwalOl reall] see Tuu[P — H HTH=N rua ] ral } - rlP~ ] reulp— 1) rowlp—2) raul] where 1 Nock Tuu[k] = W u u[nju[n + k] may be viewed as an autocorrelation function of u[n]. For H7H to be diagonal we require Tuulk]=0 k#0 which is approximately realized if we use a PRN sequence as our input signal. Finally, under these conditions H7H = N7r,,,[0]I, and hence 1 var(h{i]) = NraslO/o? i=0,1,...,p—1 (4.21) and the TDL weight estimators are independent. As a final consequence of choosing a PRN sequence, we obtain as our MVU estimator 6 = (H"H)"H"x 94 CHAPTER 4. LINEAR MODELS where H7H = Nr,u|0|I. Hence, we have .. 1 Aa hf] = Nreeld] SS u[n — i]z[n] n=0 [n + i] no 4,22 i (4.22) since u[n] = 0 for n < 0. The numerator in (4.22) is just the crosscorrelation Tuz(i] between the input and output sequences. Hence, if a PRN input is used to identify the system, then the approximate (for large N) MVU estimator is ig = 2b i=0,1,...,p-1 (4.23) where 1 N-1-i Tul] = SS ulnjzi[n +4] pea , rw] = Le [n]. © See also Problem 4.8 for a spectral interpretation of the system identification problem. 4.5 Extension to the Linear Model A more general form of the linear model allows for noise that is not white. The general linear model assumes that wr~N(0,C) where C is not necessarily a scaled identity matrix. To determine the MVU estimator, we can repeat the derivation in Section 4.3 (see Problem 4.9). Alternatively, we can use a whitening approach as follows. Since C is assumed to be positive definite, C~! is positive definite and so can be factored as c?=D'™D (4.24) where D is an Nx N invertible matrix. The matrix D acts as a whitening transformation when applied to w since E [(Dw)(Dw)")] pDcD* = DD'DT "DT =1. il 4.5. EXTENSION TO THE LINEAR MODEL 95 As a consequence, if we transform our generalized model x=Hé4+w to x = Dx DH6@ + Dw H’6+w’ the noise will be whitened since w’ = Dw ~ N(0,1), and the usual linear model will result. The MVU estimator of @ is, from (4.9), 6 (4H) Hx’ = (H™D'DH)"'H’D'Dx so that 6 = (H’C"'H)H'C"'x. (4.25) In a similar fashion we find that c= (aH! or finally Cj = (H"C'H)". (4.26) Of course, if C = oI, we have our previous results. The use of the general linear model is illustrated by an example. Example 4.4 - DC Level in Colored Noise We now extend Example 3.3 to the colored noise case. If z[n] = A+ w[n] for n = 0,1,...,.N —1, where w[n] is colored Gaussian noise with N x N covariance matrix C, it immediately follows from (4.25) that with H = 1= [11...1]", the MVU estimator of the DC level is A = (a"C HR) HTC"x _ We 'x ~ ITC and the variance is, from (4.26), var(A) = (H™C-'H)"! 1 7c" 96 CHAPTER 4. LINEAR MODELS If C = o’I, we have as our MVU estimator the sample mean with a variance of 07 /N. An interesting interpretation of the MVU estimator follows by considering the factor- ization of C-! as D''D. We noted previously that D is a whitening matrix. The MVU estimator is expressed as 17DTDx 1™DTD1 (D1)7x’ 17D'TD1 N-1 = Yo 42'[n) (4.27) n=0 where d, = [D1],/17D"D1. According to (4.27), the data are first prewhitened to form z'[n] and then “averaged” using prewhitened averaging weights d,. The prewhitening has the effect of decorrelating and equalizing the variances of the noises at each obser- vation time before averaging (see Problems 4.10 and 4.11). ° Another extension to the linear model allows for signal components that are known. Assume that s represents a known signal contained in the data. Then, a linear model that incorporates this signal is x=H@+s+w. To determine the MVU estimator let x’ = x — s, so that x =HO+w which is now in the form of the linear model. The MVU estimator follows as 6 = (H"H)"'H" (x — s) (4.28) with covariance C; = 0? (HH). (4.29) Example 4.5 - DC Level and Exponential in White Noise If z[n] = A+r* + w[n] for n =0,1,...,N— 1, where r is known, A is to be estimated, and w{n] is WGN, the model is REFERENCES 97 where 8 = [1r...r¥—1]?. The MVU estimator, is from (4.28), , YN A= Deelnl-"") with variance, from (4.29), as a o var(A) = Ww ° It should be clear that the two extensions described can be combined to produce the general linear model summarized by the following theorem. Theorem 4.2 (Minimum Variance Unbiased Estimator for General Linear Model) If the data can be modeled as x=HO+siw (4.30) where x is an N x 1 vector of observations, H is a known N x p observation matrix (N > p) of rank p, @ is a p x 1 vector of parameters to be estimated, s is an N x1 vector of known signal samples, and w is an N x 1 noise vector with PDF N(0,C), then the MVU estimator is 6 = (H7C1H) HC"! (x — s) (4.31) and the covariance matrix is Cj = (87C"'H)"'. (4.32) For the general linear model the MVU estimator is efficient in that it attains the CRLB. This theorem is quite powerful in practice since many signal processing problems can be modeled by (4.30). References Graybill, F.A., Theory and Application of the Linear Model, Duxbury Press, North Scituate, Mass., 1976. MacWilliams, F.J., N.J. Sloane, “Pseudo-Random Sequences and Arrays,” Proc. IEEE, Vol. 64, pp. 1715-1729, Dec. 1976. Problems 4.1 We wish to estimate the amplitudes of exponentials in noise. The observed data are P an] =) Ar? + win] n=0,1,...,N-1 i=l 98 CHAPTER 4. LINEAR MODELS where w(n] is WGN with variance ?. Find the MVU estimator of the amplitudes and also their covariance. Evaluate your results for the case when p = 2,71 = 1,rp = —1, and N is even. 4.2 Prove that the inverse of HH exists if and only if the columns of H are linearly independent. Hint: The problem is equivalent to proving that HH is positive definite and hence invertible if and only if the columns are linearly independent. 4.3 Consider the observation matrix 1 1 H=|1 1 1 l+e where € is small. Compute (HTH)! and examine what happens as € — 0. If x = [222], find the MVU estimator and describe what happens as € — 0. 4.4 Inthe linear model it is desired to estimate the signal s = H@. Ifan MVU estimator of @ is found, then the signal may be estimated as § = H@. What is the PDF of 8? Apply your results to the linear model in Example 4.2. 4,5 Prove that for k,l =1,2,...,.M< N/2 > Qnkn Qnin\ _ N 5 cos (J eos) = 3 it n=0 by using the trigonometric identity 1 1 cos w CosW2 = 5 cos(w +w2) + 3 cos(wi — w2) and noting that N-1 N=1 y cosan = Re (= exx(in)) : n=O 4.6 Assume that in Example 4.2 we have a single sinusoidal component at fr =k/N. The model as given by (4.12) is z[n] = ax cos(27 fxn) + be sin(Qrfxn) + uln] 2 = 0,1,...,.V—1. Using the identity A cos w+Bsinw = VA? + B? cos(w—¢), where 6 = arctan(B/A), we can rewrite the model as ak + bf cos(2m fir ~ ¢) + un). An MVU estimator is used for ax, b,, so that the estimated power of the sinusoid is : 5 a+b pat.

You might also like