Download as pdf or txt
Download as pdf or txt
You are on page 1of 161

lessons

in Digital
PRENTICE-HALL SIGNAL PROCESSING SERIES Estimation Theory
Alan V. Oppenheim, Editor

Jerry A/l. Mendel


ANDREWS and HUNT Digital Image Restoration
BRIGHAM The Fmt Fourier Transform Department of Electrical Engineering
BURDK Underwater Acoustic System Analysis University of Southern California
CASTLEMAN Digital Image Processing Los Angeles, California
COWAN and GRANT Adaptive Filters
CROCHIERE and RABINER Multirate Digital Signal Processing
DUDGEON and MERSEREAU Multidimensional Digital Signal Processing
HAMMING Digital Filters, 2nd ed.
HAYKIN, ED. Array Signal Processing
JAYANT and NOLL Digital Coding of Waveforms
KIN0 Acoustic Waves: Devices, Imaging, and Analog Signal Processing
LEA, ED. Trends in Speech Recognition
LIM, ED. Speech Enhancement
MARPLE Digital Spectral Analysis with Applications
MCCELLAN and RADER Number Theory in Digital Signal Processing
MENDEL Lessons in Digital Estimation Theory
OPPENHEIM, ED. Applications of Digital Signal Processing
OPPENHEIM, WILLSKY, with YOUNG Signals and Systems
OPPENHEIM and SCHAFER Digital Signal Processing
RABINER and GOLD Theory and Applications of Digital Signal Processing
RABINER and SCHAFER Digital Processing of Speech Signals
ROBINSON and TRIETEL Geophysical Signal Analysis
STEARNS and DAVID Signal Processing Algorithms
TRIBOLET Seismic Applications of Homomorphic Signal Processing
WIDROW and STEARNS Adaptive Signal Processing

Prentice-Hall, Inc., Englewood Cliffs, New Jersey 07632


Lessons
in Digita/
Esfimafion Theory
Librar) of Congress Cataloging-in-Publication Data

MENDEL,JERRY
M., (date)
Lessonsin digital estimation theory.
Bibliography: p.
Includes index.
1. Estimation theory. I. Title.
QA276.8.M46 1986
ISBN o-13-530809-7
511'.4 86-9365 Contents

To my parents, Eleanor and Alfred Mendel


and
my wife, Letty Mendel

...
Editorial/production supervision: Gretchen K. Chenenko .
XIII
Cover design: Lundgren Graphics
Manufacturing buyer: Gordon Osbourne LESSON 1 INTRODUCTION, COVERAGE, AND PHILOSOPHY 1

0 1987by Prentice-Hall, Inc. Introduction 1


A Division of Simon & Schuster Coverage 2
Englewood Cliffs, New Jersey 07632
Philosophy 5
All rights reserved. No part of this book may be THE LINEAR MODEL
LESSON 2
reproduced, in any form or by any means,
without permission in writing from the publisher.
Introduction 7
Printed in the United States of America
Examples 7
Notational Preliminaries 14
10 9 8 7 6 5 4 3 2 1 Problems 15

LESSON 3 LEAST-SQUARES ESTIMATION: BATCH PROCESSING


ISBN 0-13-53U8U~-7 025
Introduction 17
Number of Measurements 18
PRENTXCE-HALLINTERNATIONAL
(UK) LIMITED,London
PRENTICE-HALLOFAUSTRALXAPTYLIMITED,~~~~~~ Objective Function and Problem Statement 18
PRENTICE-HALLCANADAINC.,
Toronto Derivation of Estimator 19
PRENTICE-HALLHISPANOAMERICANASA., Mexico
PRENTICE-HALLOFINDIAPRIVATELIMITED,N~~ Delhi Fixed and Expanding Memory Estimators 23
PUMICE-HALL OFJAPAN.INC.,Tokyo Scale Changes 23
P~NTIcE-HALLOF~O~THEASTASIA PIE LTD., Singapore
ED~~RAPRENTICE-HALLDOBRASIL, LTDA., Rio deJaneiro
Problems 25
vi Contents
Contents

LESSON 4 LEAST-SQUARES ESTIMATION: LESSON 9 BEST LINEAR UNBIASED ESTIMATION


RECURSIVE PROCESSING 26
Introduction 71
Introduction 26 Problem Statement and Objective Function 72
Recursive Least-Squares: Information Form 27 Derivation of Estimator 73
Matrix Inversion Lemma 30 Comparison of iBLU(k) and &u(k) 74
Recursive Least-Squares: Covariance Form 30 Some Properties of itaL&) 75
Which Form to Use 31 Recursive BLUES 78
Problems 32 Problems 79

LESSON 5 LEAST-SQUARES ESTIMATION: LESSON 10 LIKELlHOOD 81


RECURSIVE PROCESSING (continued) 35
Introduction 81
Generalization to Vector Measurements 36 Likelihood Defined 81
Cross-Sectional Processing 37 Likelihood Ratio 84
Multistage Least-Squares Estimators 38 ResultsDescribed by Continuous Distributions 85
Problems 42 Multiple I-Iypotheses 85
LESSON 6 SMALL SAMPLE PROPERTIES OF ESTIMATORS 43
Problems 87
LESSON tl MAXIMUM-LIKELIHOOD ESTIMATION
Introduction 43
Unbiasedness 44 Likelihood 88
Efficiency 46 Maximum-Likelihood Method and Estimates 89
Problems 52 Properties of Maximum-Likelihood Estimates 91
The Linear Model (X(k) deterministic) 92
LESSON 7 LARGE SAMPLE PROPERTIES OF ESTIMATORS 54
A Log-Likelihood Function for an Important
Introduction 54 Dynamical System 94
Asymptotic Distributions 54 Problems 97
Asymptotic Unbiasedness 57 ELEMENTS OF MULTIVARIATE GAUSSIAN
LESSON 12
Consistency 57 RANDOM VARIABLES
Asymptotic Efficiency 60
Introduction 100
Problems 61
Univariate GaussianDensity Function 100
LESSON 8 PROPERTIES OF LEAST-SQUARES ESTIMATORS 63 Multivariate Gaussian Density Function 101
Introduction 63 Jointly GaussianRandom Vectors 101
Small Sample Properties of Least-Squares The Conditional Density Function 102
Estimators 63 Properties of Multivariate GaussianRandom
Large Sample Properties of Least-Squares Variables 104
Estimators 68 Properties of Conditional Mean 104
Problems 70 Problems 106
Contents
Contents
LESSON 17 STATE ESTIMATION: FILTERING
LESSON 13 ESTIMATION OF RANDOM PARAMETERS: (THE KALMAN FILTER) 149
GENERAL RESULTS
Introduction 149
Introduction 108
A Preliminary Result 150
Mean-SquaredEstimation 109
The Kalman Filter 151
Maximum a Posteriori Estimation 114
Observations About the Kalman Filter 153
Problems 116
Problems 158

LESSON 14 ESTIMATION OF RANDOM PARAMETERS: LESSON 18 STATE ESTIMATION: FILTERING EXAMPLES 160
THE LINEAR AND GAUSSIAN MODEL 118
Introduction 160
Introduction 118 Examples 160
Mean-SquaredEstimator 118 Problems 169
Best Linear UnbiasedEstimation,
Revisited 121
LESSON 19 STATE ESTIMATION: STEADY-STATE KALMAN
Maximum a Posteriori Estimator 123
FILTER AND ITS RELATIONSHIP TO A DIGITAL
Problems 126 WIENER FILTER 170

Introduction 170
LESSON 15 ELEMENTS OF DISCRETE-TIME GAUSS-MARKOV
Steady-StateKalman Filter 170
RANDOM PROCESSES 128
Single-Channel Steady-StateKalman Filter 173
Introduction 128 Relationships Between the Steady-State
Definitions and Propertiesof Discrete-Time Kalman Filter and a Finite Impulse
Gauss-Markov Random Processes 128 ResponseDigital Wiener Filter 176
A Basic State-VariableModel 131 Comparisons of Kalman and Wiener Filters 181
Properties of the Basic State-Variable Problems 182
Model 133
Signal-to-NoiseRatio 137 LESSON 20 STATE ESTIMATION: SMOOTHING 183
Problems 138
Three Types of Smoothers 183
Approaches for Deriving Smoothers 184
LESSON 16 STATE ESTIMATION: PREDICTION 140 A Summary of Important Formulas 184
Introduction 140 Single-StageSmoother 184
Single-StagePredictor 140 Double-Stage Smoother 187
A General StatePredictor 142 Single- and Double-StageSmoothers
as General Smoothers 189
The Innovations Process 146
Problems 192
Problems 147
xi
X
Contents Contents

LESSON 21 STATE-ESTIMATION: SMOOTHING Discretization of a Linear Time-Varying


(GENERAL RESULTS) 193 State-Variable Model 242
Discretized Perturbation State-Variable
Introduction 193 Model 245
Fixed-Interval Smoothers 193 Problems 246
Fixed-Point Smoothing 199
Fixed-Lag Smoothing 201 LESSON 25 lTERATED LEAST SQUARES AND EXTENDED
Problems 202 KALMAN FILTERING

STATE ESTIMATION: SMOOTHING APPLICATIONS


Introduction 248
LESSON 22 204
Iterated Least Squares 248
Introduction 204 Extended Kalman Filter 249
Minimum-Variance Deconvolution (MVD) 204 Application to ParameterEstimation 255
Steady-StateMVD Filter 207 Problems 256
Relationship Between Steady-StateMVD
Fitter and an Infinite Impulse Response MAXIMUM-LIKELIHOOD STATE AND PARAMETER
LESSON 26
Digital Wiener Deconvolution Filter 213 ESTIMATION
Maximum-Likehhood Deconvolution 215
Recursive Waveshaping 216 Introduction 258
Problems 222 A Log-Likelihood Function for the Basic
State-Variable Model 259
LESSON 23 STATE ESTIMATION FOR THE NOT-SO-BASIC On Computing i, 261
STATE-VARIABLE MODEL 223 A Steady-StateApproximation 264
Problems 269
Introduction 223
Biases 224
LESSON 27 KALMAN-BUCY FILTERING 270
Correlated Noises 225
Colored Noises 227 Introduction 270
Perfect Measurements:Reduced-Order System Description 271
Estimators 230 Notation and Problem Statement 271
Final Remark 233 The Kalman-Bucy Filter 272
Problems 233 Derivation of KBF Using a Formal Limiting
Procedure 273
LESSON 24 LINEARIZATION AND DISCRETIZATION OF
Derivation of KBF When Structure of the
NONLINEAR SYSTEMS 236
Filter Is Prespecified 275
Introduction 236 Steidy-State KBF 278
A Dynamical Model 237 An Important Application for the KBF 280
Linear Perturbation Equations 239 Problems 281
Contents

LESSON A SUFFICIENT STATISTICS AND STATISTICAL


ESTIMATION OF PARAMETERS 282

Introduction 282
Concept of Sufficient Statistics 282
Exponential Families of Distributions 284
Exponential Families and Maximum-
Likelihood Estimation 287
Sufficient Statisticsand Uniformly Minimum-
Variance Unbiased Estimation 290
Problems 294

APPENDIX A GLOSSARY OF MAJOR RESULTS

REFERENCES 300

INDEX

Estimation theory is widely used in many branchesof scienceand engineering.


No doubt, one could trace its origin back to ancient times, but Karl Friederich
Gaussis generally acknowledgedto be the progenitor of what we now refer to
as estimation theory. R. A. Fisher, Norbert Wiener, Rudolph E. Kalman, and
scoresof others have expanded upon Gauss legacy, and have given us a rich
collection of estimation methods and algorithms from which to choose. This
book describes many of the important estimation methods and shows how
they are interrelated.
Estimation theory is a product of need and technology. Gauss, for ex-
ample, needed to predict the motions of planets and comets from telescopic
measurements. This need led to the method of least squares. Digital com-
puter technology has revolutionized our lives. It created the need for recur-
sive estimation algorithms, one of the most important ones being the Kalman
filter. Because of the importance of digital technology, this book presents
estimation theory from a digital viewpoint. In fact, it is this authors viewpont
that estimation theory is a natural adjunct to classicaldigital signal processing.
It produces time-varying digital filter designs that operate on random data in
an optimal manner.
This book has beenwritten as a collection of lessons.It is meant to be an
introduction to the general field of estimation theory, and, as such, is not
encyclopedic in content or in references. It can be used for self-study or in a
one-semestercourse. At the University of Southern California, we have cov-
ered all of its contents in such a course, at the rate of two lessonsa week. We
have been doing this since 1978.
xiv

Approximately one half of the book is devotedto parameter estimation


Preface
Lesson 1
and the other half to state estimation. For many years there has been a
tendency to treat state estimation as a stand-alonesubject and even to treat
parameter estimation as a special caseof state estimation. Historically, this is
incorrect. In the musical Fiddler on the Roof, Tevye argues on behalf of
Introduction,
tradition . . . Tradition! . . . Estimation theory also has its tradition, and it
begins with Gaussand parameter estimation. In Lesson 2 we show that state
estimation is a specialcaseof parameter estimation. i.e.. it is the problem of Coverage,
estimating random parameters,when these parameterschangefrom one time
to the next. Consequently,the subject of state estimation flows quite naturally
from the subject of parameter estimation.
Most of the books important results are summarized in theorems and
and Philosophy
corollaries. In order to guide the reader to these results, they have been
summarized for easyreferencein Appendix A.
Problemsare included for most lessons,becausethis book is meant to be
used as a textbook. The problems fal1 into two groups. The first group con-
tains problems that askthe reader to fill in details, which havebeen left to the
reader as an exercise. The second group contains problems that are related
to the material in the lesson. They range from theoretical to computational
problems.
This book is an outgrowth of a one-semestercourse on estimation INTRODUCTION
theory, taught at the University of Southern California. Since1978it has been
taught by four different people, who have encouraged me to convert the This book is all about estimation theory. It is useful, therefore, for us to
lecture notes into a book. I wish to thank Mostafa Shiva, Alan Laub, George understand the role of estimation in relation to the more global problem of
Papavassilopoulos,and Rama Chellappa for their encouragement. Special modeling. Figure l-l decomposes modeling into four problems: represen-
thanks goes to Rama Chellappa, who provided supplementary Lesson A on tation, measurement,estimation, and validation. As Mendel (1973, pp. 2-4)
the subject of sufficient statistics and statistical estimation of parameters. This states, The representation problem dealswith how something should be mod-
lesson fits in very nicely just after Lesson 14. eled. We shall be interested only in mathematical models. Within this classof
While writing this text, the author had the benefit of comments and models we need to know whether the model should be static or dynamic,
suggestionsfrom many of his colleagues and students. I especially wish to linear or nonlinear, deterministic or random, continuous or discretized, fixed
acknowledge the help of Guan-Zhong Dai, Chong-Yung Chi, Phil Burns, or varying, lumped or distributed. . . , in the time-domain or in the frequency-
Youngby Kim, Chung-Chin Lu, and Tom I-Iebert. Special thanks goes to domain. , . , etc.
Georgios Giannakis. The book would not be in its present form without their In order to verify a model, physical quantities must be measured. We
contributions. distinguish between two types of physical quantities, signals, and parameters.
Additionaliy, the author wishes to thank Marcel Dekker, Inc. for per- Parameters expressa relation between signals.. . .
mitting him to include material from Mendel, J. M., 1973,Discrere Techniques *Not all signalsand parameters are measurable.The measurementprob-
of Parameter Estimation : the Equation-Error Formulation, in Lessons I-9,11, lem deals with which physical quantities should be measured and how they
18, and 24; and Academic Press, Inc. for permitting him to include material should be measured.
from Mended, J. M., Optimal Seismic Deconvulution: an Estimation-Based The estimation problem deals with the determination of those physical
Approach, copyright 0 1983 by Academic Press, Inc., in Lessons 11-17, quantities that cannot be measuredfrom those that can be measured. We shall
19-22, and 26. distinguish between the estimation of signals(i.e., states) and the estimation
of parameters. Because a subjective decision must sometimes be made to
JERRYM. MENDEL classify a physical quantity as a signal or a parameter, there is some overlap
Los Angeles, Califurnia between signal estimation and parameter estimation. . . .
Introduction, Coverage, and Philosophy Lesson 1 Coverage 3

deterministic parameters, whereas others are for random parameters, how-


ever, state estimation techniques are for random states.Table l-l summarizes
the books coverage.
Four lessons(Lessons 3, 4, 5, and 8) are devoted to least-squares esti-
mation becauseit is a very basic and important technique, and, under certain
often-occurring conditions, other techniquesreduceto it. Consequently, once
we understand the nuancesof least squaresand have established that a differ-
ent technique has reduced to least squares,we do not have to restudy the
nuancesof that technique.
In order to fully study least-squaresestimators we must establish their
small and large sample properties. What we mean by such properties is the
subject of Lessons6 and 7.
Having spent four lessons on least-squaresestimation, we cover best
linear unbiasedestimation (BLUE) in one lesson,Lesson9. We are able to do
this becauseBLUE is a special caseof least squares.
In order to set the stage for maximum-likelihood estimation, which is
covered in Lesson 11, we describe the concept of likelihood and its relation-
ship to probability in Lesson 10.
Figure l-1 Modeling Problem (reprinted from Mendel, 1973,p. 4, by courtesy of Lesson 12 provides a transition from our study of estimation of deter-
Marcel Dekker, Inc). ministic parameters to our study of estimation of random parameters. It
provides much useful information about elements of Gaussian random vari-
After a model has been completely specified, through choice of an
appropriate mathematical representation, measurementof measurable sig-
TABLE l-l Estimation Techniques
nals, estimation of nonmeasurablesignals, and estimation of its parameters,
the model must be checkedout. The validationproblem dealswith demonstra- I. LINEAR MODELS
ting confidence in the model. Often, statistical tests involving confidence A. Parameter Estimation
limits are used to validate a model. 1. Deterministic parameters
a. Least-squares (batch and recursive processing)
In this book we shall be interested in parameter estimation, state esti- b. Best linear unbiased estimation (BLUE)
mation, and combined stateand parameter estimation. In Lesson2 we provide c. Maximum-likelihood
six examples each of which can be categorizedeither as a parameter or state 2. Random Parameters
estimation problem. Here we just mention that: the problem of identifying the a. Mean-squared
sampledvalues of a linear and time-invariant systemsimpulse responsefrom b. Maximum a posteriori
c. BLUE
input/output data is one of parameter estimation; the problem of recon- d. Weighted least squares
structing a state vector associatedwith a dynamicalsystem, from noisy meas- B. State Estimation
urements, is one of state estimation (state estimates might be needed to 1. Mean-squared prediction
implement a linear-quadratic-Gaussian optimal control law, or to perform 2. Mean-squared filtering (Kalman filter/Kalman-Bucy filter)
postflight data analysis,or signal processingsuchas deconvolution). 3. Mean-squared smoothing
II. NONLINEAR MODELS
A. Parameter Estimation
Iterated least squares for deterministic parameters
COVERAGE B. State Estimation
Extended Kalman filter
This book focuses on a wide range of estimation techniques that can be C. Combined State and Parameter Estimation
applied either to linear or nonlinear models. Both parameter and state esti- 1. Extended Kalman filter
2. Maximum-likelihood
mation techniquesare treated. Some parameter estimation techniquesare for
4 Introduction, Coverage, and Philosophy Lesson 1 Philosophy 5

abies. To some readers, this lesson may be a review of material already known Lesson 24 provides a transition from our study of esr.Ination for linear
to them. models to estimation for nonlinear models. Becausemany rca(-worId systems
General results for both mean-squaredand maximum a posteriori esti- are continuous-time in nature and nonlinear, this lesson explains how to
mation of random parameters are covered in Lesson 13. These results are linearize and discretize a nonlinear differential equation model.
specialized to the important caseof the linear and Gaussianmodel in Lesson Lesson 25 is devoted primarily to the extended Kalman filter (EKF),
14. Best linear unbiased and weighted least-squaresestimation are also re- which is a form of the Kalman filter that has been extended to nonlinear
visited in Lesson 14. Lesson 14 is quite important, becauseit gives conditions dynamical systems of the type described in Lesson 24. The EKF is related to
under which mean-squared,maximum a posteriori, best-linear unbiased, and the method of iterated least squares(ILS), the major difference between the
weighted least-squaresestimatesof random parameters are identical. Lesson two being that the EKF is for dynamical systems whereas ILS is not. This
A, which is a supplemental one, is on the subject of sufficient statistics and lesson also shows how to apply the EKF to parameter estimation, in which
statistical estimation of parameters.It fits in very nicely after Lesson 14. casestates and parameters can be estimated simultaneously, and in real time.
Lesson 15 provides a transition from our study of parameter estimation The problem of obtaining maximum-likelihood estimatesof a collection
to our study of state estimation. It provides much useful information about of parameters that appears in the basic state-variable model is treated in
elements of discrete-time Gauss-Markov random processes,and also estab- Lesson 26. The solution involves state and parameter estimation, but calcu-
lishes the basic m&e-variablemodel, and its statistical properties, for which we lations can only be performed off-line, after data from an experiment hasbeen
derive a wide variety of state estimators.To some readers,this lessonmay be a collected.
review of material already known to them. The Kalman-Bucy fiiter, which is the continuous-time counterpart to the
Lessons 16 through 22 cover state estimation for the Lesson 15 basic Kalman filter, is derived from two different viewpoints in Lesson 27. We
state-variable model. Prediction is treated in Lesson 16. The important inno- include this lesson becausethe Kalman-Bucy filter is widely used in linear
vations process is also coveredin that lesson. Filtering is the subject of Lessons stochastic optimal control theory.
17, 18, and 19. The mean-squared state filter, commonly known as the
Kalman filter, is developedin Lesson 17. Five exampleswhich illustrate some
interesting numerical and theoretical aspects of Kalman filtering are pre- PHlLOSOPHY
sented in Lesson 18. Lesson 19 establishesa bridge between mean-squared
estimation and mean-squared digital signal processing. It shows how the The digital viewpoint is emphasizedthroughout this book. Our estimation
steady-state Kalman filter is related to a digital Wiener filter. The latter is algorithms are digital in nature; many are recursive. The reasonsfor the digital
widely used in digital signal processing. Smoothing is the subject of Lessons viewpoint are:
20, 21 and 22. Fixed-interval, fixed-point, and fixed-lag smoothersare devel-
oped in Lessons 20 and 21. Lesson 22 presents some applications which 1. much real data is collected in a digitized manner, so it is in a form ready
illustrate interesting numerical and theoretical aspects of fixed-interval to be processedby digital estimation algorithms, and
smoothing. These applications are taken from the field of digital signal pro- 2. the mathematics associatedwith digital estimation theory are simpler
cessing and include minimum-variance deconvolution, maximum-likelihood than those associatedwith continuous estimation theory.
deconvolution, and recursivewaveshaping.
Lesson 23 shows how to modify results given in Lessons16, 17, 19, 20 Regarding (2), we mention that very little knowledge about random processes
and 21 from the basic state-variable model to a state-variable model that is needed to derive digital estimation algorithms, because digital (i.e.,
includes the following effects: discrete-time) random processescan be treated as vectors of random vari-
ables. Much more knowledge about random processes is needed to design
continuous-time estimation algorithms,
1. nonzero mean noise processesand/or known bias function in the mea- Suppose our underlying model is continuous-time in nature. We are
surement equation, faced with two choices: developa continuous-time estimation theory and then
2. correlated noise processes, implement the resulting estimators on a digital computer (i.e., discretize the
3. colored noise processes?and continuous-time estimation algorithm), or discretize the model and develop a
4, perfect measurements. discrete-time (i.e., digital) estimation theory that leads to estimation algo-
introduction, Coverage, and Philosophy Lesson 1

rithms readily implemented on a digital computer. If both approacheslead to


algorithms that are implemented digitally, then we advocate the principle of
lesson 2
simplicity for their development, and this leads us to adopt the second-choice.
For estimation, our modeling philosophy is, therefore, discretize the model at
the front end of the problem.
Estimation theory has a long and glorious history (e.g., see Sorenson,
The Linear Model
1970); however, it has been greatly influenced by technology, especially the
computer. Although much of estimation theory was developed in the mathe-
matics, statistical, and control theory literatures, we shall adopt the following
viewpoint towards that theory: estimation theory is the extension of classical
signalprocessing to the design of digital filters that processs uncertain data in an
optimal manner. In fact, estimation algorithms are just filters that transform
input streamsof numbers into output streamsof numbers. b
Most of classical digital filter design (e.g., Oppenheim and Schafer,
1975;Hamming, 1983; Peledand Liu, 1976)is concerned with designsassoci-
ated with deterministic signals,e.g., low-pass and bandpassfilters, and, over
the yearsspecific techniqueshavebeen developedfor suchdesigns.The result-
ing filters are usually fixed in the sensethat their coefficients do not change
as a function of time. Estimation theory, on the other hand, leads to filter
structures that are time-varying. These filters are designed (i.e., derived)
using time-domain performance specifications (e.g., smallest error variance), In order to estimate unknown quantities (i.e., parameters or signals) from
and, as mentioned above, processrandom data in an optimal manner. Our measurements and other a priori information, we must begin with model
philosophy about estimation theory is that it can be viewed as a naturaL adjunct representationsand expressthem in sucha way that attention is focusedon the
to digital signal processing theory. explicit relationship between the unknown quantities and the measurements.
Many familiar models are linear in the unknown quantities (denoted 6), and
Example 1-1 can be expressedas
At one time or another we have all used the sample meanto computean average.
Supposewe are given a collection of k measured values of quantity X, namely x(l), Z(k) = X(k)8 + V(k) (2-l)
x(2), . . . , x(k). The sample meanof these measurements,
x(k), is In this model, %(k), which is N x 1, is called the measurement vector;8, which
is n x 1, is called the parameter vector; X(k), which is N x n is called the
Z(k) = k Ii x(j) (l-1)
j-1 observation matrix; and, V(k), which is N x 1, is called the measurement noise
A recursive formula for the samplemean is obtained from (l-l), as follows:
vector. Usually, V(k) is random. By convention, the argument k of Z(k),
X(k), and V(k) denotes the fact that the last measurement used to construct
(2-l) is the kth. All other measurementsoccur before the kth.
i(k + 1) =&$x(j) =& 2 x(j)+x(k +1)
I1 [ j-1 I Strictly speaking, (2-l) representsan affine transformation of param-
(l-2) eter vector 8 rather than a linear transformation. We shall, however, adhereto
x(k + 1) = A(k) + &(k + 1) traditional estimation-theory literature, by calling (2-l) a linear model.
This recursive version of the sample mean is used for k = 0, 1, . . . by setting Z(O) = 0.
Observe that the sample mean, as expressed in (I -2), is a time-varying recursive
digital filter whose input is measurement x(k). In later lessonswe show that the sample
mean is also an optimal estimation algorithm; thus, although the reader may not have
been aware of it, the sample mean, which he or she has been using since early school-
Someexamples that illustrate the formation of (2-l) are given in this section.
days, is an estimation algorithm. cl What distinguishes these examples from one another are the nature of and
interrelationships between 8, X(k) and V(k). The following situations can
occur.

7
The Linear Model Lessun 2 Examples

A. 8 is deterministic Clearly, (2-5) is in the form of (2-l).


1. X(k) is deterministic. In this application the n sampledvalues of the IR, h (i), play the role of unknown
2. X(k) is random. parameters, i.e., 8, = h(l), 62 = h(2), . . .) 0, = h(n), and these parameters are deter-
a. X(k) and V(k) are statistically independent. ministic. If input u(k) is deterministic and is known ahead of time (or can be measured)
b, X(k) andT(k) are statistically dependent. without error, then %(N - 1) is deterministic so that we are in case A.l. Often,
B. f#is random however, u(k) is random so that X(N - 1) is random; but u(k) is in no way related to
I. X(k) is deterministic. measurement noise v(k), so we are in caseA.2.a. 0
2. X(k) is random. Example 2-2 Identification of the CoefFcientsof a Finite-Difference Equation
a. X(k) and V(k) are statisticaIly independent. Suppose a linear, time-invariant, discrete-time system is described by the following
b. X(k) and V(k) are statistically dependent. nth-order finite-difference equation
Example 2-l Impulse ResponseIdentification J(k)+cwly(k-l)+ .** +cr,y(k -n)=u(k -1) P-6)
It is weli known that the output of a single-input single-output, linear, Iirne-iwariarit,
discrete-time system is given by the following convolution-sum relationship This model is often referred to as an all-pole or autoregressivc (AR) model. It occursin
many branches of engineering and science, including speech modeling, geophysical
modeling, etc, Suppose, also, that N perfect measurementsof signal y (k) are available.
Parameters lyl, cx2,. . . , CX,,
are unknown and are to be estimated from the data. To do
this, we can rewrite (2-6) as
where k = 1,2, . . . , IV, h (Q is the systemsimpulse response (IR), u(k) is its input and
y (Jc)its output. If u(k) = 0 for k c 0, and the system is causal, so that h(i) = 0 for v(k) = -ctly(k - I) - .w--a,y(k -n) + u(k - 1) (2-7)
i 5 0, and, II(~) = 0 for i B n, then
and collect y (l), y (2), . , , , y (N) as we did in Example 2-1. Doing this, we obtain
f
y(k) = i h(lJu(k - i) Y(N) Y(N - 0 y(N-2) y(N-3) -.a YW - 4
i==l
Y P.- 1) YW.- 2) W*- 3) YW.- 4) * - YW -,n - 1)
Signa y (k) is measuredby a sensorwhich is corrupted by additive measurement noise, . . , .
p(k), i.e., we only have accessto measurementz (Ic),where Y&l = y(n :- 1) y(:-2) y(ni- 1) -*-- Y@>
. .
z(k) = y(k) + u(k) (j .:.
Yh Yil) YiO) 0
and k = 1,2,.**, N* We now collect these AJmeasurementsas follows: , Y(l) Y(O) 0 0 .. f 0
\ ,
qlN) X(N - 1)
I
u(N - 1)
-aI up- 2)
-ff2.
.
x : + .(,I (2-8)
-% .
1-1
41)
0 40)
\ ,
V(N - 1)
which, again, is in the form of (2-l).
In this example 8 = col(--aal, -cY~,.. . , - cy,,), and these parameters are deter-
ministic. If input u(k - 1) is deterministic, then the systems output y(k) will also be
deterministic, so that both %e(N- 1) and Sr(N - 1) are deterministic. This is a very
special caseof caseA. 1, becauseusually v is random. If, however, u (k - 1) is random
then y(k) will also be random; but, the elements of X(N - 1) will now depend on those
inV(N - l), becausey(k) dependsupon u (0). u (l), , . . , u (k - 1). In this situation we
are in caseA.2.b. q
The Linear Model Lesson 2
Examples 11
Example 2-3 Identification of the initial Condition
in an Unforced State Equation Model Observe that a different (unknown) state vector appearsin each of the N measurement
Consider the problem of identifying the n x 1 initial condition vector x(0) of the linear, equations; thus, there does not appear to be a common 0 for the collected mea-
time-invariant, discrete-time system surements. Appearances can sometimes be deceiving.
So far, we have not made use of the state equation. Its solution can be expressed
x(k + 1) = @x(k) P-9 as
from the N measurementsz(l), t (2), . . . , z (N), where
x(k) = Qk-jx(j) + $ Qk-yu(i - 1) (2-17)
z(k) = hx(k) + v(k) (2-10) i--j+1

where k 2 j + 1. We now focus our attention on the value of x(k) at k = kl, where
me solution t0 (2-9) is
1 5 kl 5 N. Using (2-17), we can express x(N), x(/V - l), . . . , x(kl + 1) as an explicit
x(k) = Q&x(O) (2-11) function of x(k,), i.e.,
SO that
x(k) = @k-klx(kl) + i Qk-yu(i - 1) (2-H)
z(k) = h@x(O) + v(k) (2-12) i=kl+l

Collecting the N measurements,as before, we obtain where k = kl + 1, kl + 2,. . . , N. In order to do the samefor x(l), x(2), . . . , x(kl - l),
we solve (2-17) for x(j) and set k = kl,

(2-13) X(,3 = @jmklx(kl) - 5 @j-yu(i - 1) (2-19)


i-j+ I

where j = kl - 1, kl - 2,. . . , 2, 1. Using (2-18) and (2-19), we can reexpress (2-16) as


k \
z(k) = hWk - k x(kJ + h c #k-iyu(i - 1) + v(k)
i==kl+l
once again, we have been led to (2-l), and we are in caseA.l. 0 k =N,N - l,...,k* + 1
z(k,) = hx(k,) + v(kl) (2-20)
Example 2-4 State Estimation
state-variable models are widely used in control and communication theory, and in z(i) = hWfwklx(kl) - h 3 cP*-yu(i - 1) + v(l)
signalprocessing. Often, we need the entire state vector of a dynamical system in order i=I+l

to implement an optimal control law for it, or, to implement a digital signal processor. 2 = kl - 1, kl - 2,. . . , 1
Usually, we cannot measure the entire state vector, and our measurements are car-
These N equations can now be collected together, to give
rupted by noise. In state estimation, our objective is to estimate the entire state vector
from a limited collection of noisy measurements.
Here we consider the problem of estimating n x 1 state vector x(k), at k = 1,
2 ,..*v N from a scalar measurementz (k), where k = 1,2, . . . , N. The model for this
example is
x(k + 1) = #x(k) + yu(k) (2-14)
z(k) = hx(k) + v(k) (2-15)
we are keeping this example simple by assuming that the systemis time-invariant and VW)
v(N .- 1)
has only one input and one output; however, the results obtained in this example are + M(N,kl)
1
(2-21)
easily generalized to time-varying and multichannel systemsor multichannel systems
v(1)
If we try to collect our N measurements as before, we obtain
z(N) = hx(N) + v(N)
z(N - 1) = hx(N - 1) + v(N - 1) where the exact structure of matrix M(N, kl) is not important to us at this point.
... (2-16) Observe that the state at k = kl plays the role of parameter vector 8 and that both X
z(l) = hx(1) + v(l) ! and V are different for different values of kl.
If x(0) and the system input u(k) are deterministic, then x(k) is deterministic for
12 The Linear Model Lesson 2 Examples 13

this case6 is deterministic, but V(N, kI) is a superposition of deterministic and


all k. 1x-1 where k = 1, 2,. . . , N. Noisy measurements z(l), z (2) . . . , z (NJ are available to us,
random components. On the other hand, if either x(O) or u(k) are random then 0 is a and we assume that we know the systems impulse response h (13, Vi. What is not
vector of randotn parumeters. This latter situation is the more usual one in state known is the input to the system p(l), p(2), . +. , p(N). Deconvolution fi the signal
estimation. 1t corresponds to caseB. 1. III processingprocedure for removing the effects of h(j) and v(j) fram the measurementsso
that one & ieft with an estimateof k(j).
Example 2-5 A Nonlinear Model In deconvolution we often assume that input p( 1) is white noise, but is not
Many of the estimation techniques that are described in this book in the cuntext of necessarily Gaussian. This type of deconvolution problem occurs in reflection seis-
linear model (Z-I} can also be applied to the estimation of unknown signals or param- mology. We assume further that
eters in nonlinear models, when such models are suitably linearized. Suppose, for P w = wq w (2-27)
example, that
where t(k) is white Gaussian noise with variance o$, and q(k) is a random event
z(k) =f(o,k) + LQ$ (2-22)
location sequence of zeros and ones (a Bernoulli sequence).Sequencesr(k) and 4 (k)
where, k = 1, 2,. . . , iv, and the structure of nonlinear function f (0, k) is known are assumed to be statistically independent.
explicitly, To see the forest from the trees m this example we assume 0 is a scalar We now collect the N measurements, but in such a way that p(l), ~(2), . . . ,
parameter. p(N) are treated as the unknown parameters. Doing this we obtain the fohowing linear
Let 0 * denote a nominal value of 19,86 = 19- 6 *, and 8~ = z - z *, where deconvohrtion model:
z*(k) = f (t?*, k) (2-23)
Observe that the nominal meusurements, z* (k), can be computed once 0* is specified, h(N -2) ... h(1) h(O)
becausef (. , k) is assumedto be IXIOW~L h(N.-3) ... h(O) 0
Using a frrst-order Taylor seriesexpansion off (0, k) about 19= 6 * , it is easy to
show that

(2-24)
W9 X(N - 1)
wherek = 1,2,.. . , N. It is easyto seehow to collect these N equations, tu give

in which $(8*, k}/H* is short for $f($, k)/&?evaiuated at 8 = @*. We shall often refer to 9 as p.
Observe that X dependson U*. We will discussdifferent waysfor specifying 0* in Using (2-27), we can also express 8 = ~1as
Lesson 25. 0 (2-29)
CL= Qqr
Example 2-6 DeconvoIu?ion (Mended, 19S3b)
In Example 2-I we showedhow a convolutional model could be expressedas the linear r = col (r(l), r(2), . . . , r(N)) (2-30)
model S! = X0 + V. In that example we assumed that both input and output mea-
surements were available, and, we wanted to estimate the sampled values of the and
systems impulse response,Here we begin with the same convolutional model, written (2-31)
as
Q, = diag (4 WY q (21, - a- ) 4 WI)
In this case (2-28) can be expressedas
Z(N) = X(N - l)Q,r + WY) (2-32)
14 The Linear Model Lesson 2 Lesson 2 Problems

When event locations q (l), q (2), . . . , q (IV) are known, then we can view (2-32) as a smoothed or interpolated estimate. Prediction and filtering can be done in real
linear model for determining r. time whereassmoothing can never be done in real time. We will seethat the
Regardlessof which linear deconvolution model we use as our starting point for impulse responsesof predictors and filters are causal, whereas the impulse
determining F, we see that deconvolution corresponds to case B.l. Put another way, responseof a smoother is noncausal.
we have shown that the design of a deconvolution signal processing filter is isomorphic We use 6(k) to denote estimation error, i.e.,
to the problem of estimating random parameters in a linear model. Note, however, that
the dimension of 8, which is IV x 1, increasesasthe number of measurementsincrease. 6(k) = 0 - 8(k) (2-33)
In all other examples 8 was n x 1 where n is a fixed integer. We return to this point in
Lesson 14 where we discussconvergenceof estimatesof + to their true values. In state estimation, %(kl 1N) denotes state estimation error, and, ji(kl 1IV)=
In Lesson 14, we shall develop minimum-variance and maximum-likelihood x(b) - i(kl 1N>. In deconvolution @(i 1N) is defined in a similar manner.
&convolution filters. Equation (2-28) is the starting point for derivation of the former Very often we use the following estimation model for Z(k),
filter, whereas Equation (2-32) is the starting point for derivation of the latter
8(k) = X(k)i(k) (2-34)
filter. Cl
To obtain (2-34) from (2-l), we assumethat V(k) is zero-mean random noise
that cannot be measured. In some applications (e.g,, Example 2-2) g(k)
NOTATIONAL PRELIMINARIES representsa predicted value of Z(k). Associated with S(k) is the error S(k),
where
Equation (2-l) can be interpreted as a data generating model; it is a mathe- 2(k) = Z(k) - i(k) (2-35)
matical representation that is associatedwith the data. Parameter vector 0 is
assumedto be unknown and is to be estimated using Z(k), X(k) and possibly satisfiesthe equation
other a priori information. We use F(k) to denote the estimate of constant f%(k) = X(k#(k) + V(k) (2-36)
parameter vector 8. Argument k in 8(k) denotesthe fact that the estimate is
basedon measurementsup to and including the kth. In our preceding exam- In those applications where &(k) is a predicted value of Z(k), %(k) is known
ples, we would use the following notation for 6(k): as a prediction error. Other names for S(k) are equation error and mea-
surement residual.
Example 2-l [see (2-91: i(N) with components&i 1ZV) In the rest of this book we develop specific structures for 6(k). These
Example 2-2 [see (2-8)]: i(N) structures are referred to as estimators. Estimates are obtained whenever data
is processedby an estimator. Estimator structures are associatedwith specific
Example 2-3 [see (2-13)]: x(0 1N) estimation techniques, and these techniquescan be classified according to the
Example 2-4 [see (2-21)]: i(kl ] IV) natures of 8 and X(k), and what a priori information is assumedknown about
Example 2-5 [see (2-291: s(N) noise vector V(k). See Lesson 1 for an overview of all the different estimation
Example 2-6 [see (2-28)]: i(N) with componentsk(i ] N) techniquesthat are covered in this book.

The notation used in Examples 1, 3, 4, and 6 is a bit more complicated than


that used in the other examples, becausewe must indicate the time point at PROBLEMS
which we are estimating the quantity of interest (e.g., kl or i) aswell asthe last
data point used to obtain this estimate (e.g., N). We often read i(kl I N) as -1. Suppose r(k) = @I+ &k + v(k), where z(1) = 0.2, z(2) = 1.4, z(3) = 3.6,
the estimate of x(k,) conditioned on N or as x hat at kl conditioned on N. z(4) = 7.5, and z (5) = 10.2. What are the explicit structures of %(5) and X(S)?
In state estimation (or deconvolution) three situations are possible de- 2-2. According to thermodynamic principles, pressure P and volume V of a given
pending upon the relative relationship of N to kl. For example, when N < kl massof gasare related by PVy = C, where y and Care constants. Assume that IV
we are estimating a future value of x(k,), and we refer to this as a predicted measurementsof P and V are available. Explain how to obtain a linear model for
estimate. When N = kl we are using all past measurements and the most estimation of parameters y and In C.
recent measurement to estimate x(k,). The result is referred to as a filtered 2-3. (Mendel, 1973, Exercise 1-16(a), pg. 46). Supposewe know that a relationship
estimate. Finally, when N > kl we are estimating an earlier value of x(k,) using exists between y and x1, x2, . . . , X, of the form
past, present and future measurements.Such an estimate is referred to as a Y =exp(alxl+a2x2+ - +a,,u,)
Lesson 2

We desire to estima?ea~, ff2,. . . , fznfrom measUremen& of y and x = co1 (x1,


x2,..*,xfi). Explain how to do this.
lesson 3
(Mendel, 1973, Exercise l-17, pp. 46-47). The efficiency of a jet engine may be
viewed as a linear combination of tinctions of inlet pressurep ($ and operating
temperature T(i); that is to say, feast-Squares
where the structures of fI? j& and fj are known a priori and v(f) represents
modeling error of known mean and variance. From tests on the engine a table of
values of E(l), p (t), and T(t) are given at discrete vaiues of L Explain how Cl, C2,
Estimation:
CS,and CGare estimated from these data.
Batch Processing

INTRODUCTION

The method of least squaresdatesback to Karl Gauss around 1795,and is the


cornerstone for most estimation theory, both classical and modern. It was
invented by Gaussat a time when he wasinterested in predicting the motion of
planets and comets using telescopic measurements. The motions of these
bodies can be completely characterized by six parameters. The estimation
_ problem that Gauss consideredwasone of inferring the values of theseparam-
eters from the measurement data.
We shall study least-squaresestimation from two points of view: the
classicalbatch-processing approach, in which all the measurementsare pro-
cessedtogether at one time, and the more modern recursive processingap-
proach, in which measurementsare processedonly a few (or even one) at a
time. The recursive approachhasbeenmotivated by todays high-speeddigital
computers; however, as we shall see,the recursive algorithms are outgrowths
of the batch algorithms. In fact, as we enter the era of very large scaleinte-
gration (VLSI) technology, it may well be that VLSI implementations of the
batch algorithms are faster than digital computer implementations of the
recursive algorithms.
The starting point for the method of least squaresis the linear model
Z(k) = X(k)8 + T(k) (3-l)
where Z(k) = co1(z(k), z (k - l), . . . , z (k - N + I)), z(k) = h(k)9 + v(k),
and the estimation model for Z(k) is
it(k) = X(k)&k) U-2)

17
Least-Squares Estimation: Batch Processing Lesson 3 Derivation of Estimator 19

We denote the (weighted) least-squaresestimator of 8 as [&&k)]&(k). In DERIVATION OF ESTIMATOR


this lesson and the next two we shall determine explicit structures for this
estimator. To begin, we express (3-4) as an explicit function of 8(k), using (3-2):
J@(k)] = !i?(k)W(k)?i(k) = [Z(k) - !i(k)]W(k)[%(k) - ii(k)]
NUMBER OF MEASUREMENTS = [Z(k) - %e(k)6(k)]W(k)[%(k) - X(k)&k)]
(3-6)
= %(k)W(k)Z(k) - 25ff(k)W(k)X(k)6(k)
Supposethat 0 contains n parametersand Z(k) contains N measurements.If
N < it we have fewer measurementsthan unknowns and (3-l) is an under- + 6(k)X(k)W(k)X(k)6(k)
determined systemof equationsthat does not lead to unique or very meaning- Next, we take the vector derivative of J [b(k)] with respect to b(k), but before
ful values for 8i, &, . . . , 8,. If N = ~1,we have exactly asmany measurements doing this recall from vector calculus,that:
as unknowns, and as long asthe n measurementsare linearly independent, so If m and b are two yt X 1 nonzero vectors, and A is an n x yt symmetric
that %? (k) exists, we can solve (3-l) for 8, as matrix, then
6 = Ye-(kyiqk) - %e-(kpqk) (3-3) -&bm) = b P-7)
Becausewe cannot measureV(k), it is usually neglectedin the calculation of
and
(3-3). For small amountsof noisethis may not be a bad thing to do but for even
moderate amounts of noise this will be quite bad. Finally, if N > it we have $-(mAm) = 2Am (3-8)
more measurementsthan unknowns, so that (3-l) is an overdetermined sys-
tem of equations. The extra measurementscan be used to offset the effects of Using these formulas, we find that
the noise; i.e., they let us filter the data. Only this last caseis of real interest mwl
to us. p = -2[F(k)W(k)X(k)] + 2X(k)W(k)X(k)&k) (3 -9
d&k)
Setting aY[&k)]/d&k) = 0 we obtain the following formula for b,(k),
OBJECTIVE FUNCTION AND PROBLEM STATEMENT
&r&k) = [%?(k)W(k)X(k)]- X(k)W(k)%(k) (3-10)
Our method for obtaining 6(k) is basedon minimizing the objective function
J@(k)] = %(k)W(k)%(k) P-4) Note, also, that
where
6,(k) = [X(k)%e(k)]-%e(k)%(k) (3-11)
?i(k) = col[Z(k), Z(k - l), . . . , Z(k - N + l)] (3-5)
and weighting matrix W(k) must be symmetric and positive definite, for rea-
sons explained below. Comments
No generalrules exist for how to chooseW(k). The most common choice 1. For (3-10) to be valid, X(k)W(k)X(k) must be invertible. This is always
is a diagonal matrix such as true when W(k) is positive definite, as assumed, and X(k) is of max-
W(k) = diag[pkBN+l, pkTN+', . . . ,F~] imum rank.
When 1~~1 c 1 recent measurementsare weighted more heavily than past ones. 2. How do we know that &&k) minimizes J [6(k)]? We compute
Such a choice for W(k) providesthe weighted least-squaresestimator with an d2J[6(k)]i/d62(k) and see if it is positive definite [which is the vector
aging or forgetting factor. When 1~1> 1. recent measurements are calculus analog of the scalarcalculusrequirement that 8 minimizesJ( 6)
weighted lessheavily than past ones. Finally, if p = 1, so that W(k) = I, then if dJ(@/dt? = 0 and d2J(6)/d6 is positive]. Doing this, we seethat
all measurements are weighted by the same amount. When W(k) = I, - = 2x'(k)W(k)x(k) > 0 (3-12)
ii(k) = d&k), whereasfor all other W(k), 6(k) = b-(k), db2(k)
Our objective is to determine the 6-(k) that minimizesJ [d(k)]. becauseX(k)W(k)X(k) is invertible.
Least-Squares Estimation: 5atch Processing Lesson 3
Derivation of Estimator 21

3. Estimator 6I&$ processesthe measurements%(Az)Iinearly; thus it is


referred to as a Zinearestimator. It processesthe data contained in X(k) Clearly, X = co1(1, 1, . . . , I); hence;
in a very complicated and nonlinear manner.
(3-17)
4. When (3-9) is set equal to zero we obtain the following system of normal
equutions which is the sample mean of the N measurements. We see, therefore, sample
l--l
~~(k)W(k)~(k)l~~~(k) = ~W+YW(W (13-13) mean is a least-squares estimator. u

This is a systemof n hnear equations in the n components of &&). Example 3-2 (Mendel, 1973)
In practice, one does not compute &&J$ using (3-N)), because Figure 3-l depicts simphfied third-order pitch-plane dynamics for a typical, high-
computing the inverse of X(k)W(k)X(k) is fraught with numerical diffi- performance, aerodynamically controlled aerospacevehicle. Cross-coupling and body-
culties. Instead, the normal equationsare solved using stable algorithms bending effects are neglected. Normal acceleration control is considered with feedback
from numerical linear algebra that involve orthogonal transformations on normal acceleration and angle-of-attack rate. Stefani (1967) showsthat if the system
(see, e.g., Stewart, 1973; Bierman, 1977; and Dongarra, et al., 1979) gains are chosen as
Becauseit is not the purpose of this book to go into detaiIs of numerica c-2
linear algebra,we leaveit to the reader to pursuethis important subject. (3-18)
KNi= loo
Based on this discussion, we must view (3-10) as a useful Me-
oretical formula and not as a useful computationalformula. Remember
that this is a book on estimation theory, SOfor our purposes, theoretical K&= 100Ma (3-19)
formulas are just fine. and
5. Equation [3-13) can also be reexpressedas
C~f1ooM,
f!ff(k)W(k)2C(k) = U (3-14) K No= (3-20)
IOOMJ,
which can be viewed as an urthogona2it-ycondition between $5(k) and
W&)X(k). CMhogonality conditions pIay an important role in esti-
rnatiun theory. We shall see many more examples of such conditions (3-21)
throughout this book.
6. Estimates obtained from (3-N)) will be random! This is becauseT(k) is Stefani assumes2, 1845/~ is relatively small, and chooses C1 = 1400 and C, =
random, and, in some applications even X(k) is random. It is therefore 14,000. The closed-loop response resembles that of a second-order system with a
bandwidth of 2 Hz and a damping ratio of 0.6 that respondsto a step command of input
instructive to view (3-10) as a complicated transformation of vectors or acceleration with zero steady-state error.
matrices of random variables into the vector of random variables In general, M,, Mg, and Z, are dynamic parameters and all vary through a large
&&). In later lessons, when we examine the properties of &,&k), range of values. Also, M, may be positive (unstable vehicle) or negative (stable
these will be statistical properties, becauseof the random nature of vehicle). Systemresponsemust remain the samefor al1values of M,, Mg, and Z,; thus,
kvLs(~~. it is necessaryto estimate these parameters so that I&,, &, and KNacan be adapted to
keep C, and CZinvariant at their designed values. For present purposeswe shah assume
Example 3-l (Mendel, 1973, pp. 8647) that M,, Mb, and Z, are frozen at specific values.
Suppose we wish to calibrate an instrumentby making a series of uncorreIated mea- From Fig. 3-1,
surementson a constantquantity, Denoting the constantquantity as 0, our mea-
8(t) = M,a(t) + M&(t) (3-22)
surement equationbecomes
z(k) = 0 + v(k) (3-15)
where /c = 1,X . . . 9N. Collecting these N measurements,we have N4(f) = Zna((f) (3-23)
Our attention is directed at the estimation of M, and Ma in (3-22). We leave it as an
(3-16) exercise for the reader to explore the estimation of 2, in (3-23)
Our approach will be to estimate M, and Ms from the equation
e,(k) = M,a(k) -I- M&k) + vi(k) (3-24)
Scale Changes 23

where s,(k) denotes the measured value of d(k), that is corrupted by measurement
noise v,(k). We shall assume(somewhat unrealistically) that a(k) and 6(k) can both be
measured perfectly. The concatenated measurement equation for N measurements is

e,(k)
&,(k . - 1)
4%)
a(k .- 1)
W)
S(k .- 1)

negative Z axis; KNi, gain on Ni; 6, control-surface deflection; Ms, control-surface effec-
Figure 3-1 Pitch-plane dynamics and nomenclature: Ni, input normal acceleration along the

tiveness; 6, rigid-body acceleration; a, angle-of-attack; M,, aerodynamic moment effec-

system-achieved normal acceleration along the negative Z axis; I&, control gain on N,
.

tiveness; IQ, control gain on &; Z,, normal acceleration force coefficient; k, axial velocity; N,,
.
e,(k -if + 1) a(k -it + 1) S(k - N + 1)

(reprinted from Mendel, 1973, p. 33, by courtesy of Marcel Dekker, Inc.).


Hence, the least-squaresestimatesof M, and MS are
-1

Ng a*(k - j) Ng CY(~- j)6(k -j)


j =0 J=O

Ng a(k - j)6(k -j) Nf 6*(k - j)


j =0 j =0

Cl
s K Na 4
J
I
N$ a(k - j)&,(k -j)
j-0
(3-26)
1
I

Ni* S(k - j)&,(k -j)

FIXED AND EXPANDING MEMORY ESTIMATORS

Estimator 6WLS(k)usesthe measurementsz (k - N + l), z(k - N + 2), . . . ,


z(k). When N is fixed ahead of time, 6-(k) uses a fixed window of mea-
surements, a window of length N, and, 6-(k) is then referred to as a fixed-
memory estimator. A secondapproach for choosing N is to set it equal to k;
I

then, &&k) usesthe measurementst (1), z (2), . . . , z (k). In this case6-(k)


uses an expanding window of measurements, a window of length k, and,
b-(k) is then referred to as an expanding-memory estimator.

SCALE CHANGES

Least-squares (LS) estimates may not be invariant under changesof scale.


One way to circumvent this difficulty is to use normalized data.
For example, assumethat observers A and B are observing a process;
but observer A reads the measurementsin one set of units and B in another.
Least-Squares Estimation: l3atch Processing Lesson 3 Lesson 3 Problems 25

LetMbea symmetric matrix of scalefactors relating A to B; ZA (k) and PROBLEMS

denote the total measurementvectors of A and B, respectively, Then


S&(k) = &i(k)0 + V(k) = MZA(k) = MX&)@ + M=YA(k) 3-1. Derive the formula for &&k) by completing the square on the right-hand side
(3-27) of the expression for J[B(k))l in (3-6).
which meansthat 3-2. Here we explore the estimation of Z, in (3-23). Assume that N noisy
measurements of N,(k) are available, i.e.. h6,(k) = Z,a(k) + v&k). What is
C(k) = M%I (k) (3-28) the formuIa for the least-squaresestimator of Z,?
Let 6A,w(Q and i3B,w(k) denote the WLSEs associatedwith observers A 3-3. Here we explore the simultaneous estimation of M,, Ms, and Z, in (3-22) and
and B, respectively; then dB,-(k) = &.&k) if (3-23). Assume that Nnoisy measurementsof l(k) and N,(k) are available, i.e.,
&.,,(Jc)= M,a(k) + M&(k) + vg(k) and N,,(k) = Z,a(k) + vK,(k). Determine
M&(k) = M-l WA(k)M-1 (3-29) the least-squares estimator of Ma, Mb, and Z,. Is this estimatpr different from
It seemsa bit peculiar though to have different weighting matrices for &!=, and fi,, obtained just from 8,(k) measurements, and Z,, obtained just
the two WLSEs. In fact, if we begin with 6A,u(k) then it is impossible to from N,,(k) measurements?
obtain i$&k) such that &u(k) = 6,&k). The reason for this is simple. To 3-4. In a curvefittingproblem we wish to fit a given set of data z(l), z(2), . . . , t (IV)
obtain eA,&k), we set WA(k) = I, in which case (3-29) reducesto WB(k) = by the approximating function
(M-1)2 + I. i(k) = i Bj&j(k)
Next, let NA and NBdenote symmetric normalization matrices for %A(k) j=l
and !!ZB(k), respectively. We shall assumethat our data is alwaysnormalized to where &((k)(j = 1,2,. . . , n> are a set of prespecified basis functions.
the same set of numbers, i.e., that (a) Obtain a formula for &IV) that is valid for any set of basis functions.
(b) The simplest approximating function to a set of data is the straight line. In
this case i(k) = 8, + t&k, which is known as the least-squaresor regression
Observe that line. Obtain closed-form formulas for 8,&N) and &a(N).
N/&&(k) = N,JtA (k)O + NAVA(k) (3-31) 3-5. Supposez(k) = & + &k, where z(1) = 3 miles per hour and z(2) = 7 miles per
hour. Determine &,Ls and &,= based on these two measurements. Next, redo
and these calcuIations by scaling z(I) and z (2) to the units of feet per second. Are
the least-squares estimates obtained from these two calculations the same?Use
N&!&(k) = &M!&(k) = NBM&(~)~ + N&I%(k) (3-32) the results developed in the section entitled Scale Changes to explain what has
From (3-30), (3*31), and (3-32),we seethat happened here.
3-6. (a) Under what conditions on scaling matrix M is scale invariance preserved for
WI = NBM (3-33) a least-squares estimator?
(b) If our original model is nonlinear in the measurements [e.g., z(k) =
We now find that ez(k - 1) + v(k)] can anything be done to obtain invariant WLSEs under
scaling?

i&x&k) = (~~M~~W~N~M~~)-~~MN~W~N~N~~(k) (3-35)


Substituting (3-33) into (3-35),we then find
bB,-(k) = (X;NA WBNAXJ-l X;NA WBNA?&,
(k) (3-36)
Comparing (3-36) and (3-34), we conclude that &m(k) = &,ms(k) if
WB(k) = WA(k). This is preciselythe result we were looking for. It meansthat,
under proper normalization, dB,ms(k) = &m(k), and, as a special case
bws(k) = kdb
Recursive Least-Squares: Information Form 27

Lesson 4 RECURSIVE LEAST-SQUARES: INFORMATION FORM

To begin, we consider the case when one additional measurement z(k + l),
made at tk + 1,*becomesavailable:
Least-Squares z (k + 1) = h(k + 1)9 + v(k + 1) (4-l)
When this eauation is combined with our earlier linear model we obtain a new
Estimation: linear mode;,
%(k + 1) = X(k + 1)6 + qr(k + 1) (4-2)
Recursive Processing where
(4-3)
%e(k + 1) = (4-4)
and
V(k + 1) = col(v(k + l)IV(k)) (4-5)
Using (3-10) from Lesson 3 and (4-2), it is clear that

INTRODUCTION dwLs(k + 1)
= [X(k + l)W(k + l)X(k + l)]-%(k + l)W(k + l)%(k + 1) (4-6)
In Lesson 3 we assumedthat Z(k) containedN elements,where N > dim 8 =
n. Supposewe decide to add more measurements,increasingthe total number TOproceed further we must assumethat W is diagonal, i.e.,
of them from N to N. Formula (3-10) in Lesson3 would not make use of the
W(k + 1) = diag@(k + l)IW(k)) (4-V
previously calculated value of 6 that is basedon N measurementsduring the
calculation of 6 that is basedon N measurements.This seemsquite wasteful. We shall now show that it is possible to determine &k + 1) from 6(k)
We intuitively feel that it should be possibleto compute the estimate basedon and r(k + 1).
N measurementsfrom the estimatebasedon N measurements,and a modifi-
cation of this earlier estimate to accountfor the N-N new measurements.In
this lessonwe shall justify our intuition. Theorem 4-l. (Information Form of RecursiveLSE). A recursive struc-
In Lesson3 we also assumedthat 6 is determined for a fixed value of 12. ture for dwLs(k) is
In many systemmodeling problems one is interestedin a preliminary model in + 1) = 6-(k) + Kw(k + l)[z(k + 1) - h(k + 1)6-(k)]
6-(k (4-8)
which dimension JZis a variable. This is becomingincreasinglymore important
as we begin to model large scale societal, energy, economic, etc. systemsin
which it may not be clear at the onset what effects are most important. One K&k + 1) = P(k + l)h(k + l)w(k + 1)
approach is to recompute 6 by means of Formula (3-10) in Lesson 3 for (4-9)
different values of ~1.This may be very costly, especiallyfor large scale sys-
tems, since the number of flops to compute 6 is on the order of n3. A second
approach is to obtain @for n = nl, and to use that estimate in a com- P-(k + 1) = P-(k) + h(k + l)w(k + l)h(k + 1) (4-10)
putationally effective manner to obtain 6 for n = ~22,where 112> nl. These
estimators are recursive in the dimension of 8. We shall also examine these These equations are initialized by 6,,(n) and P-(n), where P(k) is defined
estimators. below in (4-13) and are used for k = n, n + 1, . . . , N - 1.
28 Least-Squares Estimation: Recurske Prmess~ng Lessofl4 Recursive Least-Squares: Information Form 29

PJ-OOJSubstitute (4-3), (4-4) and (4-7) into (4-6) (someGmesdropping 7I. In (4-Q the term h(k + 1)6,-~(k) is a prediction of the actual mea-
the dependenceupon k and k + 1, for notational simplicity) to see that surement z(k + 1). Because 6,,(k) is based on Z(k), we express this
predicted vaIue as i(k + Ilk), i.e.,
= [W(k + l)W(k + l)X(k + 1)-J-[hwz + XW%] (4-l 1) i(k + l/k) = h(k -i- 1)&&k), (4-18)
Express &&k) as so that
(4-12) &r&k + 1) = &&k) + Kw(k + l)[r(k + 1) - i(k + Ilk)]. (4- 19)
where 3. Two recursions are present in our recursive LSE. The first is the vector
P(k) = [X(k)W(k)X(k)]-I (4-13) recursion for 4ww given by (4-8). Clearly &,&k + 1) cannot be com-
puted from this expression until measurementt(k + 1) is available. The
From (4-12) and (4-13) it is straightforward to show that second is the matrix recursion for P-l given by (4-10). Observe that
W(k)W(k)fqk) = P-l (k)&*&) (4-14) values for P-l (and subsequently KH)can be precomputed before mea-
surements are made.
and
4. A digital computer implementation of (4-8)-(4-10) proceeds as follows:
P-l (k + 1) = P-l (k) + h(k + l)~+ + l)h(k + 1) (4-S)
P- (k + 1) --, P(k + l)+Kw(k + l>-+ 8-(k + 1).
It now follows that
5. Equations (4-8)-(4-10) can also be used for k = 0, l,..., N - 1 using
&,&k + 1) = P(k + l)[hwz + P-I (k)&&k)] the following values for P- (0) and &&O):
= P(k + l){hwz + [P-(k + 1) - hwh]&&c)j
= 6&k) + P(k + l)hhj[z - h&&k)] (4-X) P- (0) = $1 + h(O)w(O)h(O) (4-20)
= 6&(k) + K& + l)[z - h&&)]
and
which is (4-8) when gain matrix I& is defined asin (4-9).
l3asedon preceding discussionsabout dim 6 = n and dim S!+(k)= IV, we
6,,(O) = P(0) [ $ + h(O)w(0)z (O)] (4-21)
know that the first value of IV for which (3-10) in Lesson 3 can be used is
AJ = n; thus, (4-8) must be initialized by &&z), which is computed using
(3-N)) in Lesson3. Equation (4-10) is also a recursiveequation for P-l (k + l), In these equations (which ale derived in Mendel, 1973,pp. 101-106;see,
which is initialized by Pm1(n) = X(n)W(n)X(n). q aiso, Problem 4-1) Qis a very large number, Eis a very small number, Eis
II ~l,andf=col(~,~,..., e>.When these initial values are used in
corYlments (4-8)-(4-10) for k = 0, 1, . e. , n - 1, then the resulting values obtained
for 6-(n) and P- (n) are the very sameones that are obtained from the
l+ Equation (4-8) can also be expressedas batch formulas for b,(n) and P- (n).
Often z(0) = 0, or there is no measurementmade at k = 0, so that
i&&k + 1) = [I - K&c + l)h(k + l)]&=(k)
we can set z(O) = 0. In this casewe can set w (0) = 0 so that P- (0) =
+ &(k + l)z(k + 1) (4-17) I, /a and 6(O) = a~. By choosing E on the order of l/a, we see that
which demonstrates that the rtxursive [east-squares estimate (LYE) ti a (4-8)-(4-10) can be initialized by setting e(O) = 0 and P(0) equal to a
time-varying digital filter that is excited by random inputs (i.e., the mea- diagonal matrix of very large numbers.
surements),one whose plant matrix may itself be random, becauseKw 6. The reason why the results in Theorem 4-1 are referred to as the infor-
and h(k + 1) may be random. The random natures of Kw and mation form of the recursive LSE is deferred until Lesson 11 where
(I - Kwh) make the analysis of this filter exceedingly difficult. If Kw connections are made between least-squaresand maximum-likelihood
and h are deterministic, then stability of this filter can be studied using estimators (see the section entitled The Linear Model (X(k) deter-
Lyapunov stability theory. ministic), Lesson 11).
Which Form To Use 31
Least-Squares Estimation: Recursive Processing Lesson 4

B = P(k + l), C = h(k + 1) and D = 1/w(k + 1). Then (4-10) looks like
MATRIX INVERSION LEMMA (4-22), so, using (4-23) we seethat
Equations (4-10) and (4-9) require the inversion of y1x TZmatrix P. If n is large P(k + 1) = P(k) - P(k)h(k + l)[h(k + l)P(k)h(k + 1)
than this will be a costly computation. Fortunately, an alternative is available, + w-l (k + l)]-h(k + l)P(k) (4-27)
one that is basedon the following matrix inversion lemma.
Consequently,
Lemma 4-1. If the matrices A, B, C, and D satisfy the equation Kw(k + 1) = P(k + l)h(k + 1) w(k + 1)
B- = A- + CD-C (4-22) = [P - Ph(hPh + wl)+ hP]hw
where all matrix inverses are assumed to exist, then = Ph[I - (hPh + w-)-hPh]w
= Ph(hPh + w-l)-l (hPh + w-l - hPh)w
B = A - AC(CAC + D)-CA. (4-23)
= Ph(hPh + iv-*)-
Proof. Multiply B by B using (4-23) and (4-22) to showthat BB- = I.
For a constructive proof of this lemma seeMendel(1973), pp. 96-97. El which is (4-25). In order to obtain (4-26), express(4-27) as
P(k + 1) = P(k) - Kw(k + l)h(k + l)P(k)
Observe that if A andB are n x n matrices, C is m x n, and D is m x m, = [I - K&k + l)h(k + l)]P(k) Cl
then to compute B from (4-23) requires the inversion of one m x m matrix.
On the other hand, to compute B from (4-22) requires the inversion of one
m X m matrix and two YIX n matrices [A- and (B-)-I. When m < n it is Comments
definitely advantageousto compute B using (4-23) insteadof (4-22). Observe,
also, that in the specialcasewhen m = 1, matrix inversion in (4-23)is replaced The recursive formula for b-, (4.24), is unchanged from (4-8). Only the
by division. matrix recursion for P, leading to gain matrix Kw has changed.A digital
computer implementation of (4.24)-(4-26) proceeds as follows:
P(k)-, K,(k + l)+ &&k + l)+ P(k + 1). This order of computa-
RECURSIVE LEAST-SQUARES: COVARIANCE FORM tions differs from the preceding one.
When z(k) is a scalar then the covariance form of the recursive LSE
Theorem 4-2. (CovarianceForm of Recursive LSE). Another recur- requires no matrix inversions, and only one division.
sive structure for &&k) is: Equations (4-24)-(4-26) can also be used for k = 0, 1, . . . , N - 1 using
i&r& + 1) the values for P- (0) and &&O) given in (4-20) and (4-21).
= i&&c) + K,(k + l)[z(k + 1) - h(k + 1)&,&k)] (4-24) The reason why the results in Theorem 4-2 are referred to asthe covar-
iance form of the recursive LSE is deferred to Lesson 9 where connec-
where tions are made between least-squaresand best linear unbiased mini-
K,(k + 1) mum-variance estimators (seep. 79).
-1
= P(k)h(k + 1) [h(k + l)P(k)h(k + 1) +
w(k + 1) 1 (4-25)

and WHICH FORM TO USE


P(k + 1) = [I - K&k + l)h(k + l)]P(k) (4-26) We have derived two formulations for a recursive least-squaresestimator, the
These equations are initialized by b-(n) and P(n) and are used for k = n, information and covariance forms. In on-line applications, where speed of
n + 1,. . . , N - 1. computation is often the most important consideration, the covarianceform is
preferable to the information form. This is because a smaller matrix needsto
Proof. We obtain the results in (4-25) and (4-26) by applying the matrix be inverted in the covariance form, namely an m x m matrix rather than an
inversion lemma to (4,lo), after which our new formula for P(k + 1) is substi- m X n matrix (m is often much smaller than n).
tuted into (4-9). In order to accomplish the first part of this, let A = P(k),
Least-Squares Estimation: Recursive Processing Lesson 4 Lesson 4 Problems

The information form is often more useful than the covariance form in (c) Show that when the measurements zU(-n), . . . , za (- 1): z (0), z( 1). . . . :
analytical studies. For example, it is used to derive the initial conditions for z (I + I) are used, then
I-r 1
Pm1(0) and &&O), which are given in (4-20) and (4-21) (see Mendel, 1973,
pp. 101~106)The information form is also to be preferred over the covariance
form during the startup of recursiveleast squares.We demonstratewhy this is
so next. -1

1
/+ 1
We consider the casewhen P=(I + 1) = X,/a+ c h(j](j]h(j)
J=o

W-4= a21n (4-28)


(d) Show that when the measurements z(O), z(l), . . . , z(I + 1) are used (i.e.,
where a2 is a very, very large number. Using the information form, we find the artificial measurementsare not used), then
that, for k = 0, P-l (1) = h(l)w(l)h(l) + l/a&, , and, therefore, Kw(l) =
[h(l)w (l)h(l) + 1/a21$ h(l)w(l). No difficulties are encountered when
we compute Kw(l) using the information form.
Using the covariance form we find, first,
a2 h(l)[h(l)a2 h(l)]- = h(l)[h(l)h(l)]-, and then that
that K&) =
i+1
-1
P(l) = {I - h(l)[h(l)h(l)]-* h(l)ju2 (4-29)
w + 1) =
[
JO
c cwW(~)
I
(e) Finally, show that for a >>> and E <<<
however, this matrix is singular. To see this, postmultiply both sides of (4-29)
by h(l), to obtain PyI i- l)-+P(I + I)

P(l)h(l) = {h(l) - h(l)[h(l)h(l)]- h(l)h(l)]a2 = 0 (4-30) and


Neither P( 1) nor h(1) equal zero; hence, P(l) must be a singular matrix for
P(l)h(l) to equal zero. In fact, once P(l) becomes singular, all other P(j), 4-2. Prove that once P(1) becomessingular, all other P(I], j 2 2, will be singular.
j ZE2, will be singular. 4-3. (Mendel, 1973, Exercise 2-12, pg. 138). The following weighting matrix weights
In Lesson 9 we shall show that when WY1(kj = E{V(k)V(k)} = CR(k), past measurements less heavily than the most recent measurevents,
then P(k) is the covariancematrix of the estimation error, 4(k). This matrix
must be positive definite, and it will be quite difficult to maintain this property W(k + I) = diag(w(k -t I), w(k), . . +, w(l)) A (+pG&j)
if P(k) is singular; hence, it is advisableto initialize the recursiveleast-squares
estimator using the information form. However, it is also advisable to switch where
to the covariance formulation as soon after initialization as possible, in order
to reduce computing time.
(a) Show that ~(1) = w(fp +I-) forj = l,...,k f 1.
(b) How must the equations for the recursive weighted least-squaresestimator
PROBLEMS be modified for this weighting matrix? The estimator thus obtained is known
as a fading memory estimatur (Morrison, 1969).
4-l. In order to derive the formulas for P- (0) and 6WLs(0),given in (4-20) and (4-21), 4-4. We showed, in Lesson 3, that it doesnt matter how one chooses the weights,
respectively,one proceeds as follows. Introduce n artificial measurements w(j)? in the method of least squares, becausethe weights cancel out in the batch
z= (-l), zti (-2), . . . ,2=(-n), where z(-j) g E in which E is a very small formula for 6&k). On the other hand, w(k + 1) appears explicitly in the
number.Then,assumethatthemodelforz(-lJis-z*(-j) = iQ(j = 1,2,.,,,n) recursive WLSE, but only in the formula for K,(k + l), i.e.,
where 0 is a very large number. Kw(k + 1) = P(k)h(k + l)[h(k + I)P(k)h(k f I) + w- (k -k l>]-
(a) Show that 6&,(--l) = aq where n X 1 vector E = CO~(E, E, . . . , l ).
Additionally, show that F (-1) = Q' k,. It would appear that if we set ~(k + 1) = wl, for all k, we would obtain a
[b) Show that i&&(O) = F (O)[E/CZ+ ~(O)W(O)Z(O)] and F (0) = [In /u + &(k + 1) value that would be different from that obtained by setting
h(O)w (O)h(O)]-. w(k + 1) = w2, for all k. Of course, this cannot be true. Show, using the
5
34 Least-Squares Estimation: Recursive Processing Lesson 4

formulas for the recursive WLSE, how they can be made independent of
LeccOn
w(k + l), when w(k + 1) = w, for allk.
4-5. For the data in the accompanying table, do the following:
(a) Obtain the least-squaresline y(t) = a + bt by meansof the batch processing
least-squaresalgorithm;
(b) Obtain the least-squares line by means of the recursive least-squares
Least-Squares
algorithm, using the recursive startup technique (let a = 10 and e = 10-16).

t Y(t)
Estimation:
0 1
1
2
5
9
Recursive Processing
3 11

(continued)

Example 5-l
In order to illustrate some of the Lesson 4 results we shall obtain a recursive algorithm
for the least-squares estimator of the scalar 8 in the instrument calibration example
(Lesson 3). Gain Kw(k + 1) is computed using (4-9) of Lesson 4, and P(k + 1) is
computed using (4-13) of Lesson 4. Generally, we do not compute P(k + 1) using
(4-13); but, the simplicity of our example allows us to use this formula to obtain a
closed-form expression for P(k + 1) in the most direct way. Recall that %e= co1
(1 1 l), which is a k x 1 vector, and h(k + 1) = 1; thus, setting W(k) = I and
& ; ;jL 1 (in order to obtain the recursive LSE of 6) in the preceding formulas, we
find that

P(k + 1) = [sle(k + l)%e(k + 1)]- = & (5- 1)

and

Kw(k + 1) = P(k + 1) = & (5-2)

Substituting these results into (4-8) of Lesson 4, we then find

&(k + 1) = &&) + &-+(k + 1) - &&)I


or?

i=(k + 1) = (5-3)

35
36 Least-Squares Estimation: Recursive Processing {continued) Lesson 5 Cross-Sectional Processing 37

Formula (S-3), which can be used for k = 0.1, . . . , iV - 1 by setting &CO) = 0, CROSS-SECTIONAL PROCESSING
lets us reinterpret the well-known sample mean estimator as a time-varying digital filter
[see, also, (l-2) of Lesson 11.We leave it to the reader to study the stability properties Suppose that at each sampling time tk+ 1 there are q sensors or groups of
of this first-order filter. sensors that provide our vector measurementdata. These sensors are cor-
Usually it is in only the simpiest 0: casesJhat we can obtain closed-form expres- rupted by noise that is uncorrelated from one sensor group to another. The
sions for Kw and P and subsequently &=,[or &Q]: however? we can always obtain m-dimensional vector z(k + 1) can be representedas
values for I&(k + 1) and P(k T I) at successivetime points using the results in
Theorems 4-l or 4-2. q z(k + 1) = co1(z,(k + I), z,(k + I), , , . , z,(k -t 1)) (5-j)
where
Zi(k + I) = H,(k + I)8 + vi(k I) (5-6)
GENERALIZAION TO VECTOR MEASUREMENTS
dimz,(k + 1) = mi X 1,
A vector of measurementscan occur in any application where it is possibleto 4
use more than one sensor; however, it is ako possible to obtain a vector of z Hli = f?l P-7)
i=l
measurementsfrom certain types of individual sensors.In spacecraftapplica-
tions, it is not unusual to be able to measureattitude, rate, and acceleration. E(v,(k + 1)) = 0, and
In electrical systemsapplications, it is not uncommon to be able to measure
voltages, currents, and power. Radar measurementsoften provide informa- E(v, (k + l)v,(k + I)) = Ri(k + I)&; P-8)
tion about range, azimuth, and elevation. Radar is an example of a single An alternative to processing all m measurements in one batch (i.e.,
sensorthat provides a vector of measurements. simultaneously) is available, and is one in which we freeze time at tl; +1 and
In the vector measurement case, (4-l) of Lesson 4 is changed from recursively process the 4 batches of measurementsone batch at a time. Data
z(k + I) = h(k + 110+ v(k + 1) to zl(k + I) are used to obtain an estimate [fo: notational simplicity, in this
sectionwe omit the subscript WLS or LS on 6) 8r(k + 1) with if&(k) k b(k) and
z(k + 1) = H(k + l)O + v(k + 1) z(k + 1) 4 zl(k + 1). When these calculationsare completed z?(k + I) is pro-
cessedto obtain the estimate &(k + 1). Estimate iI,(k 4 1) is used to initialize
where z is now an VI x 1 vector, H is m x PIand v is ?TIx 1.
&(k + 1). Each set of data is processedin this manner until the final set
We leave it to the reader to showthat all of the results in Lessons3 and 4
zq(k + 1) has been included. Then time is advanced to fk +2 and the cycle is
are unchanged in the vector measurementcase; but some notation must be
repeated, This type of processing is known as cross-sectional or sequential
altered (see Table 5-l).
processing. It is summarized in Figure 5-l and is contrasted with more usual
simultaneousrecursive processing in Figure 5-2.
TABLE S-1 Transformations from Scalar to Vector
Measurement Situations, and Vice-Versa
Scab Measurement Vector of Measurements
z(k + 1) z(k + l), an m X 1 vector
v(k + 1) v(k + lJ, an m X 1 vector
w(k + 1) w(k + I). an m X m matrix
h(k + 1). a 1 x FImatrix H(k + 1). an ITI x n man-ix
T(k). an A X 1 vector S(k), an Nm X 1 vecb3r
V(k), an N x 1 vector V(k), an Ah X 1 vector
W(k). an N x Nmatrix W(k)* an Nm X Nm matrix AII of this ProcessingDone at FrozenTime Point t,,,
X(k). an N x n matrix X.(k), an Nm X n matrix
Source: Reprinted from Mendel, 1973.p. 110. Courtesy of
Marcel Del&et, Inc., NY. Figure 5-I Cross-sectional processingof m measurements.
38 Least-Squares Estimation: Recursive Processing (continued) Lesson 5 Multistage Least-Squares Estimators

We extend this model to include 2additional parameters, B2,so that our model
is given by
> Time

For this model, datum {Z(k), XI(k), &(k)} is available. We wish to compute
m the least-squaresestimates of e1and e2for the yt + I parameter model using
Wk) 8(k + 1) 8(k + 2) B(k + 3) the previously computed i$&).

/ dim z Theorem 5-l. Given the linear model in (MI), where & is n x 1 and e2
W is l? x 1. The LSEs of e1 and e2 based on datum {Z(k), X,(k), X*(k)} are found
from the following equations:

&,Ls(~) = @:.Ls(~)- G(k)C(k)W(k)[fW) - %(k)~?,&)] (5-12)


: Time
and

bs(k) = W~;(k)[~(k) - &(k)&s(k)] (5-13)


where
G(k)= [X(k)&(k)l- W(k)%(k) (5-14)
&k) 6(k + 1) &k + 2) 6(k+3)

/
Jdlrn 2 C(k) = [%(k)%(k) - %(k)%(k)G(k)]- (5-15)
(W The results in this theorem were worked out by Astrom (1968) and
emphasizeoperations which are performed on the vector of residuals for the
Figure 5-2 Two ways to reach 6(k + l), $k + 2), . . . . (a) Simultaneous recursive
processing performed along the line dim z = m, and (b) cross-sectional recursive n parameter model, namely Z(k) - XY,(k)6~&k). Other forms for & .&k) and
processing where, for example, at tIr+l the processing is performed along the line 6*,&k) appear in Mendel(l975).
TIME = fk+l and stops when that line intersects the line dim z = m.
Proof. The derivation of (5-12) and (5-13) is based primarily on the
The remarkable property about cross-sectionalprocessingis that block decompositon method for inverting %eX,where %e= (XII%e,).SeeMen-
de1(1975) for the details. 0
6,(k + 1) = 6(k + 1) (5-9)
where 6(k + 1) is obtained from simultaneousrecursive processing. Similar results to those in Theorem 5-l can be developed for the removal
A very large computational advantage exists for cross-sectionalpro- of 2 parameters from a model, for adding or removing parameters one at a
cessingif Rli = 1. In this case the matrix inverse [H(k + l)P(k)H(k + 1) + time, and for recursive-in-time versionsof all these results (seeMendel, 1975).
w-l (k + l)]- needed in (4-25) of Lesson 4 is replaced by the division All of these results are referred to as multistage LSEs.
[h,(k + l)Pi(k)hi(k + 1) + l/Wi(k + l)]-. See Mendel, 1973, pp. 113-118 We conclude this section and lessonwith some examples which illustrate
for a proof of (5-9) problems for which multistage algorithms can be quite useful.
Example 5-2 Identification of a SampledImpulse Response:Zoom-In Algorithm
MULTISTAGE LEAST-SQUARES ESTIMATORS A test signal, u (t) is applied at t = toto a linear, time-invariant, causal, but unknown
system whose output, y(t), and input are measured. The unknown impulse response,
Supposewe are given a linear model Z(k) = X,(k)& + T(k) with n unknown w (t), is to be identified using sampled valuesof u (t) and y (t). For such a system
parameters &, datum {S(k), X,(k)}, and the LSE of Or,6:,&k), where
y(tk) = r w(7)u(tk - 7)d7 (5-16)
@:.I&~= wxwwr %w~w (5-10) rO
Least-Squares Estimation: Recursive Processing (conGnuedJ Lesson 5

Performance
One qqxoach to identifying NJ(~)is to discretize (5-16) and to only identify w(i)

Average

11.30
(1.96
discrete values of time. If we assume that (1) w(f) * 0 for all f 2 &, (2) [rO,lW] is

(I.29

0.13
0.18
0.75

0.44
0.40
0.63
0.62
at

divided into ?I equal intervals, each of width T, so that n = (I~ - lo)/ T and (3) for
Te[fi-l, ii], w(T) = ~*(r~- I) and ti (l - 7) = u (t - fi - I), then

y(h) = i Wl(b l)U(fk - ia- 1) (5-17)

-- 1.07
t=l

W](ti - 1) = TW(ll - 1) (518)

0.22
0.23
It is straightforward to identify the n unknown parameters wI(lO), wI(fI), . . . ,
~+(i~- ,) via least squares(seeExample 2-l of Lesson 2); however, for n to be known,
I%,must be accurately known and Tmust be chosen most judiciously. In actual practice

0.30
0.35
0.35
Average Estimates of ParameterCSh
rWis not known that accuratelyso that PImay have to be varied. Multistage LSEs can be
used to handle this situation, Sometimes T can be too coarse for certain regions of
time, in which casesignificant features of W(I) such as a ripple may be obscured. In this

TABLE 5-2 Impulse ResponseEstimates

0.10
0.06

0.10
0.12
situation, we would like to zoom-in on those intervals of time and rediscretize y (1)
just over those intervals, thereby adding more terms to (5-17). Multistage LSEs can
also be used to handle this situation, as we demonstrate next

-0.33
-0.32
-0.29
-0.27
-0.30
For illustrative purposes, we present this procedure for ?hecase when the inter-
val of interest equals T, i.e., when t E[lx, lx + l ] = Twhich is further divided into q equal
intervals, each of width ATX, so that q = (fX+ I - &)/AT= = T/AT=. observe that
+ 1 4-3 rx + (i + l)A7x

-0.47
-0.46
-0.41
-0.44
-0.44
-0.47
w(+i& - T)dr = w(T)u(tk - r)d7
?i
j=il rx + jAT=

Source: Reprinted from Mendel, 1975, pg. 782, 0 1975IEEE.

Blank spacesin the table denote no value for parameters.


Values for parameters shown in parenthesesare true values.
q-1

= x w(tx + jATx)u(tk - tx - jATx)ATx (5-19)


j-0

-0.15

-0.14
-0.15
-0.17

-0.16
-0.12

-0.15
0.43
0.43
0.42
0.40

0.43
0.43
0.44
0.40
(5-20)

0.79

0.79

0.81
0.83
0.81
0.75
0.78

0.79

0.80

0.43
0.39

0.39

0.43
0.34
0.41

0.39
0.40
0.39

0.41
Equation (5-20) contains n + q - 1 parameters.
Let

in the Model
Parameters
Number of
fll = 4331 (w(to), w(t,), . - l , w&x), . - . , w&7 - I)) (5-22)

10
1

8
9
2
3

5
6
7
n

4
and assume that a least-squaresestimate which is based on (547), is
available. Let
02 = coi[w& + AT& w2(tx + 2ATx), . . , , w& + (q - l)ATx)] (S-23)
To obtain 6= = co! (61+M,6Z,Ls)proceed as follows: 1) modify ~TJ-~by scaling &J-s(~~)
to k.~(tx )ATX/ Tand call the modified result 6*l,Ls: and 2) apply Theorem 5-I to obtain
42 Least-Squares Estimation: Recursive Processing (continued) Lesson 5

iLs. Note that the scaling of GI,Ls(fX)in Step 1 is due to the appearanceof W&&W / T
instead of IV&) in (j-20).
It is straightforward to extend the approach of this example to regions that
include more than one sampling interval, T. q
Example 5-3 Impulse ResponseIdentification
Small Sample
An n-stage batch LSE was usedto estimate impulse responsemodels having from one
to ten parameters [i.e., n in (5-17) ranged from l-101 for a second-order system
(natural frequency of 10.47 rad/sec, damping ratio of 0.12, and, unity gain). The
Properties
system was forced with an input of randomly chosen *ls each of which was equally
likely to occur. The systems output was corrupted by zero-mean pseudo-random
Gaussian white noise of unity variance. Fifty consecutive samplesof the noisy output
and noise-free input were then processedby the n-stagebatch LSE, and this procedure
of Esfimafors
was repeated ten times from which the average values for the parameter estimates
given in Table 5-2 (page 41) were obtained.
The ten models were tested using ten input sequences,each containing 50 sam-
ples of randomly chosen 51s. The last column in Table 5-2 gives values for the
normalized averageperformance
r 50 I 50 VQ

Li= 1 I i= 1 J
iNTRODUCTlON
in which L(ti) denotes an averageover the ten runs. Not too surprisingly, we see that
average predicted performance improves as n increases. How do we know whether or not the results obtained from the LSE, or for that
All the 8i results were obtained in one pass through the n-stage LSE using matter any estimator, are good? To answer this question, we make use of the
approximately 3150 flops. The same results could have been obtained using 10 LSEs
fact that all estimators represent transformations of random data. For exam-
(Lesson 3); but, that would have required approximately 5220flops. c3
ple, our LSE, [X(k)W(k)X(k)]-X(k)W(k)Z(k), represents a linear trans-
formation on Z(k). Other estimators may represent nonlinear transformations
PROBLEMS of Z(k). The consequence of this is that 6(k) is itself random. Its properties must
therefore be studied from a statistical viewpoint.
5-l. Show that, in the vector measurement case, the results given in Lessons 3 and 4 In the estimation literature, it is common to distinguish between small-
only need to be modified using the transformations listed in Table 5-l. sample and large-sample properties of estimators. The term sample refers
5-2. Prove that, using cross-sectionalprocessing, &(k + 1) = 6(k + 1). to the number of measurementsusedto obtain 6, i.e., the dimension of Z. The
5-3. Prove the multistage least-squaresestimation Theorem 5-l. phrase small-sample means any number of measurements (e.g., 1, 2, 100,
5-4. Extend Example 5-2 to regions of interest equal to mT, where m is a positive
104,or even an infinite number), whereasthe phrase large-sample meansan
integer and T is the original data sampling time. infinite number of measurements. Large-sample properties are also referred to
as asymptotic properties. It should be obvious that if an estimator possesses a
small-sample property it also possesses the associated large-sample property;
but, the converse is not always true.
Why bother studying large-sampleproperties of estimatorsif theseprop-
erties are included in their small-sampleproperties? Put another way, why not
just study small-sample properties of estimators? For many estimators it is
relatively easy to study their large-sampleproperties and virtually impossible
to learn about their small-sampleproperties. An analogoussituation occurs in
stability theory, where most effort is directed at infinite-time stability behavior
rather than at finite-time behavior.
Small Sample Propertks of Estimators Lesson 6 Unbiasedness

Although large-sample means an infinite number of measurements, Many estimators are linear transformations of the measurements, i.e.,
estimators begin to enjoy their large-sampleproperties for much fewer than an i(k) = F(k)%.(k) (6-6)
infinite number of measurements-How few, usually depends on the dimen-
sion of 0, n. In least-squares, we obtained this linear structure for i(k) by solving an
A thorpugh study into 6 would mean determining*its probability density optimization problem. Sometimes, we begin by assuming that (6-6) is the
function p (9). WsuaUy,it is too difficult to obtain p (0) for most estimators desired structure for 8(k). We now address the question when is F(k)Z(k) an
(unless4 is multivariate Gaussian);thus, it is customary to emphasizethe first- unbiased estimator of deterministic 6?
and second-order statistics of 6 (or, its associatederror 6 = 8 - 6), namely the
mean and covariance. Theorem 6-f. When 55(k) = X(k)8 + V(k), E(Sr(k)) = 0, and X(k) is
We shall examine the following smaK and large-sample properties of deterministic, then b(k) = F(k)%(k) [where F(k) is deterministic] is an unbiased
estimators: unbiasednessand efficiency (small-sample)Yand asymptotic un- estimator of 0 if and only if
biasedness, consistency, and asymptotic efficiency (large-sample). Small- F(k)X(k) = I for ali k (6-7)
sampleproperties are the subject of this lessonTwhereaslarge-sampleproper-
ties are studied in Lesson 7+ Note that this is the first place where we have had to assumeany a priori
knowledge about noise V(k).
Pruof
UNBIASEDNESS a. (Necessity). From the model for S(k) and the assumed structure for
i(k), we see that
6(k) = F(k)%e(k)B + F(k)V(k) (6-S)
If h(k) is an unbiased estimator of 8, F(k) and X(k) are deterministic, and
E(V(k)) = 0, then
E{&k)} = 9 = F(k)X(k)B

[I - F(k)X(k)j9 = 0 W-9
In terms of estimation error, 6(k), unbiasednessmeans,that
Obviously, for 8 # 0, (6-7) is the solution to this equation.
E@(k)] = 0 for ai1k b. (Suficiency). From (6-S) and the nonrandomness of F(k) and X(k),
Exampk 6- 1 we have
In the instrument calibration example of Lesson3, we determined the following LSE of E{&k)} = F(k)X(k)O (6-10)
8:
Assuming the truth of (6-7), it must be that
E{d(k)) = 9
where which, of course, means that 6(k) is an unbiased estimator of 9. 0
z(i) = e + v(i) Example 6-2
SupposeI~{Y(i)} = 0 for L= 1,2, , , . , N; then, Matrix F(k) for the WLSE of 8 is [%(k)W(k)X(k)]-3t?(k)W(k). Observe that this
F(k) matrix satisfies (6-7); thus, when X(k) is deterministic the WLSE of9 is unbiased.
Unfortunately, in many interesting applications X(k) is random, and we cannot
apply Theorem 6-l to study the unbiasednessof the WLSE. We return to this issuein
which means that &(A$! is an unbiased estimator of 0. R Lesson 8. Cl
Efficiency 47
Small Sample Properties of Estimators Lesson 6

Supposethat we begin by assuming a Iinear recursive structure for 6, 8(k) with the smallest error variance that can ever be attained by any unbiased
namely estimator. The following theorem provides a lower bound for E{ h2(k)} when 8
is a scalar deterministic parameter. Theorem 6-4 generalizes these results to
b(k + 1) = A(k + 1)6(k) t b(k + l)z(k + 1) (6-11) the case of a vector of deterministic parameters.
We then have the following counterpart to Theorem 6-l.
Theorem 6-3 (Cramer-Rao Inequality). Let z denote a set of data
Theorem 6-2. When z(k + 1) = h(k + 1)0 + v(k + l), E{v(k + 1)) = [ i.e., z = co1 (Zi, z2, . . . , zk); z is also short for Z(k)]. If i(k) is an unbiased
0, and h(k + 1) is deterministic, then &k + 1) given by (6-11) is an unbiased estimator of deterministic 8, then
estimator of 8 if
E{@(k)} 2 (6-15)
A(k + 1) = I - b(k + l)h(k + 1) (6-12)
E{[$Q:p(z)r} forazzk
where A(k + 1) and b(k + 1) are deterministic. 0
Two other ways for exPressing (6-15) are
We leave the proof of this result to the reader. Unbiasednessmeans that
our recursive estimator does not have two independent design matrices (de- E{@(k)} > for all k (6-16)
greesof freedom), A(k + 1) and b(k + 1). UnbiasednessconstraintsA(k + 1) Ldz
to be a function of b(k + 1). When (6-12) is substitutedinto (6-H), we obtain PCZ >
the following important structure for an unbiasedlinear recursive estimator
of&
E{p(k)} 2 for all k (6-17)
6(k + 1) = i(k) + b(k + l)[z(k + 1) - h(k + 1)&k)] (6-13) -E i; (,)
Our recursive WLSE of 8 has this structure; thus, as long as h(k + 1) is de2
deterministic, it producesunbiasedestimatesof 8. Many other estimators that where dz is short for dq, dt2, . . . , dzk . cl
we shall study will also have this structure.
Inequalities (6-19, (6-16) and (6-17) are named after Cramer and Rao,
who discovered them. They are functions of k because z is.
Before proving this theorem it is instructive to illustrate its use by means
of an example.
Did you hear the story about the conventioning statisticianswho all drowned
in a lake that was on the average 6 in. deep? The point of this rhetorical Example 6-3
question is that unbiasednessby itself is not terribly meaningful. We must also We are given M statistically independent observations of a random variable z that is
study the dispersionabout the mean, namely the variance. If the statisticians known to have a Cauchy distribution, i.e.,
had known that the variance about the 6 in. averagedepth was 120 ft, they
1
might not havedrowned! p(a) = (6-18)
7T[l + (Zi - e)]
Ideally, we would like our estimator to be unbiased and to have the
smallest possible error variance. We consider the caseof a scalar parameter Parameter 6 is unknown and will be estimated using zl, z2, . . . , ZM. We shall determine
first. the lower bound for the error-variance of any unbiased estimator of 8 using (6-15).
Observe that we are able to do this without having to specify an estimator structure for
Definition 6-2. An unbiased estimator, i(k) of 0 is said to be more 6. Without further explanation, we calculate:
eficient than any other unbiased estimator, 8(k), of 9, if
p(Z) = fi p (Zi) = 1 fl fi [l + (Zi - e)2] (6-19)
Var (6(k)) 5 Var (8(k)) for all k 0 (6-14) i==l i( i=l

Very often, it is of interest to know if 6(k) satisfies(6-14) for all other lnP(z) = -Mlnrr - 2 ln[l + (Zi - Ql (6-20)
;=1
unbiased estimators, i(k). This can be verified by comparing the variance of
48 Smail Sample Properties of Estimators Lesson 6 Efficiency 49

and thus, when (6-23) is substituted into (622), and that result is substituted into (615), we
find that
(6-21)
for all M (6-33)
so that
Observe th at the Cramer-Rao bound depends on the number of measurements
used to estimate 8. For large numbers of measurements,this bound equals zero. Cl

Next, we must evaluate the right-hand side of (6-22). This is tedious to do, but Proof uf Theorem 6-3. Because b(k) is an unbiased estimator of t?,
can be accomplished,as follows+Xote that
E{@k)} = /= [i(k) - 01 p(z)dz = 0 (4-34)
-m
Differentiating (6-34) with respect to 8, we find that
r
Consider TA first, i.e.,
&k) - $1 %p dz - f p(z) dz = 0
J --a -cc
TA = E : 2(zj - @/[l + (zj - O)*] 5 Z(Zj - 8)/[1 + (pi - 0)2] (6-24)
i=l j-1 which can be rewritten as
where use of statistical independence of the measurements. observe
that (6-35)

(6-25) As an aside, we note that

where y = zi - 8. The integral is zero because the integrand is an odd function of y. (6-36)
Consequently,
TA = 0 (6-26) so that
Next, consider TEL,i.e.,
I (6-37)
TB = E 5 4(Zi - 0)2/[1 + (pi - tQ212 (6-27)
Substitute (6-37) into (6-35) to obtain
which can also be written as

TB = 4 2 TC (6-23)
i-1

where
TC = E{(Zi - q2/[l + (Zj - ej2J2) (6-29)
Or

where equality is achieved when b(z) = ca(z) in which c is an arbitrary con-


(6-30) stant. Next, square both sides of (6-38) and apply (6-39) to the new right-hand
side, to see that
Integrating (6-30) by parts twice, we fmd that

(6-31) 1I [6(k) - Bjp(z)dz 1-1 [$lnp(z)]:p(z)dz


I
becausethe integral in (6-31) is the area under the Cauchyprobability density function, or
which equals v. Substituting (6-31) into (6-281,we determine that
11 E(b(l;)jE{ [$lnp(z)]?j (6-40)
TB = M/2 (6-32)
Small Sample Properties of Estimators Lesson 6 Efficiency 51

Finally, to obtain (6-Q) solve (6-40) for E{@(k)}. For a vector of parameters, we seethat a more efficient estimator hasthe
In order to obtain (6-16) from (6-H), observethat smallesterror covariance among all unbiasedestimators ot8, smallest in the
sense that E{[8 - b(k)][8 - 6(k)]} - E{[6 - e(k)][e - e(k)]} is negative
semi-definite.
The generalization of the Cramer-Rao inequality to a vector of par;lm-
eters is given next.

To obtain (6-41) we have also used (6-36). Theorem 6-4 (Cramer-Rao inequality for a vector of parameters).
In order to obtain (6-17), we begin with the identitvw Let z denote a set of data as in Theorem 6-3, and 8(k) be any unbiased estimator
x of deterministic 8 based on z. Then
p(z)dz= 1
I -X E{ij(k)h(k)} 2 J- for all k (6-45)
and differentiate it twice with respect to 0, using (6-37) after each differ- where J is the Fisher information matrix,
entiation, to show that
J = E [$np(z)][-$ (6-46)
i
which can also be expressed as
which can also be expressedas d2
J = -E ,zlnp(z) (6-47)
E[ [-$lnp(z)r] = -E[ n!/} (6-42) i I
Equality holds in (6-45), if and only if
Substitute (6-42) into (6-15) to obtain (6-17). Cl
[$ ln P(Z)] = c(e)ii(k) 0 (6-48)
It is sometimeseasierto compute the Cramer-Rao bound using one form
1i.e., (6-15) or (6-16) or (6-17)] than another. The logarithmic forms are A complete proof of this result is given by Sorenson (1980, pp. 94-96).
usually used when p (z) is exponential (e.g., Gaussian). Although the proof is similar to our proof of Theorem 6-3, it is a bit more
CoroNary 64. If the lower bound is achieved in Theorm 6-3, then intricate becauseof the vector nature of 8.
Inequality (6-45) demonstrates that any unbiased estimator can have a
1 d lnp(z) covariance no smaller than J-. Unfortunately, J-l is not a greatest lower
8(k) =--c de (6-43)
bound for the error covariance. Other bounds exist which are tighter than
where c is an arbitrary constant. (6-45) [e.g., the Bhattacharyya bound (Van Trees, 1968)], but they are even
more difficult to compute than J?
Proof. In deriving the Cramer-Rao bound we used the Schwarz in-
equality (6-39) for which equality is achievedwhen b (z) = c ti (z). In our case Corollary 6-2. Let z denote a set of data as in Theorem 6-3, and &(kj be
a(z) = [6(k) - tFJm and b(z) = m (~?/a@lnp(z)- Setting b(z) = c any unbiased estimator of deterministic 0i based on z. Then
a(z), we obtain (6-43). Cl
E{@(k)} 2 (J-)ii i = 1,2, . . . ,n and all k (6-49)
Equation (6-43) links the structure of an estimator tc the property of where (J-)ii is the i-ith element in matrix J-.
efficiency, becausethe left-hand side of (6-43) depends eAxpli,itly on e(k).
We turn next to the general caseof a vector of paramzxrrs. Proof. Inequality (6-45) means that E{@k)&(k)} - J-l is a positive
semi-definite matrix, i.e.,
Definition 6-3. An unbiased estimator, i(k), of vecmy 8 is said to be a[E{b(k)b(k)} - J-]a 2 0 (6-50)
more efficient than any other unbiased estimator, 8(k), of&
s i-s where a is an arbitrary nonzero vector. Choosinga = ei (the ith unit vector) we
E{[B - i(k)][O - i(k)]} 5 E{[6 - $k)][e - & W-- - z] (6-9 obtain (6-49). 0
Small Sample Propedes of Estimators Lesson 6 Lesson 6 Problems 53

Results similar tu those in Theorem 6-4 and Corollary 6-2 are also 6-9. Suppose 6(k) is a biased estimator of deterministic 8, with bias B( IT?).
Show that
available for a vector of random parameters (e.g., Sorenson, 1980, pp.
99-100). Let p (z, 9) denote the joint probability density function between z
and 9. The Cramer-Rao inequality for random parameters is obtained from [I +T12
Theorems 6-3 and 6-4 by replacingp (z) by p (2, 9). Uf course,the expectation E{$(k)} 2
is now with respectto z and t3+ E{ [-$ In p(z)]) for a11k

PROBLEMS

6-l* Prove Theorem 6-2, which provides an unbiasedness constraint for the two
design matrices that appear in a linear recur.siveestimator.
6-2* Prove Theorem 6-4, which provides the Cramer-Rao bound for a vector of
parameters,
6-3. Random variabIe X - N(x; P7 4, and we are given a random sample
{Xl, 12, * * , x,~}. Consider the following estimator for by

where a 2 0. For what value(s) of u is b(N) an unbiased estimator of p?


6-4. Suppose zl, z2, . . . , zAtare random samples from a Gaussiandistribution with
unknown mean, p, and variance, 0. Reasonable estimators of p and Q* are the
sample mean and sample variance,

and

S2

Is s2 an unbiased estimator of u? [Hint: Show that I?&) = (N - l)c?/Nj


6-S. (Mendel, 1973, first part of Exercise 2-9, pg. 137). Show that if 4 is an unbiased
estimate of 0, a 6 + b is an unbiased estimate of a 8 + b.
6-6. suppose that N independent observations xN} are made of a random
variable X that is Gaussian,i.e.,

In this problem only p is unknown. Derive the Cramer-Rao lower bound of


E{F(N)} for an unbiased estimator of P*
6-7. Repeat Problem 6-6, but in this case assumethat only cris unknown, i.,e,,derive
the Cramer-Rao lower bound of E{[&(N)]] for an unbiasedestimator of
6-S. Repeat Problem 6-6, but in this case assume both p and U are unknown,
compute J-r when 9 = co1(p , ~7.
55

7
Asymptotic Distributions

lesson be degenerate, but the form that the distribution tends to put on in the last
part of its journey to the final collapse(if this occurs). Consider the situation
depicted in Figure 7-1, where pi( 6) denotes the probability density function
associated with estimator 6 of the scalar parameter 8, based on i mea-
Large Sample surements.As the number of measurementsincreases,pi (6) changesits shape
(although, in this example, eachone of the density functions is Gaussian).The
density function eventually centers itself about the true parameter value 0,
Properties of and the variance associatedwith pi( 6) tends to get smaller as i increases.
Ultimately, the variance will become so small that in all probability 8 = 6.
The asymptotic distribution refers to pi (6) as it evolvesfrom i = 1,2, . . . , etc.,
Estimators especiallyfor large values of i.
The preceding example illustrates one of the three possible casesthat
can occur for an asymptotic distribution, namely the casewhen an estimator
has a distribution of the sameform regardlessof the sample size, and this form
is known (e.g., Gaussian). Someestimators have a distribution that, although
not necessarilyalways of the sameform, is also known for every samplesize.
For example, p5(8) may be uniform, pZO( 8) may be Rayleigh, andpzoo(6) may
be Gaussian. Finally, for some estimators the distribution is not necessarily
known for every sample size, but is known only for k + 30.
Asymptotic distributions, like other distributions, are characterizedby
their moments. We are especially interested in their first two moments,
INTRODUCTION
namely, the asymptotic mean and variance.
To begin, we reiterate the fact that if an estimator possessesa small-sample
Definition 7-l. The asymptotic mean is equal to the asymptotic expecta-
property it also possessesthe associatedlarge-sampleproperty; but, the con-
tion, namely t@mE{i(k)}. IJ
verse is not always true. In this lessonwe shall examine the following large-
sample properties of estimators: asymptotic unbiasedness,consistency, and
asymptotic efficiency. The first and third properties are natural extensionsof
the small-sample properties of unbiasednessand efficiency in the limiting
situation of an infinite number of measurements. The second property is
about convergenceof B(k) to 6.
Before embarking on a discussionof these three large-sampleproper-
ties, we digressa bit to introducethe concept of asymptotic distribution and its
associatedasymptotic mean and variance (or covariance, in the vector situ-
ation). Doing this will help us better understand these large-sampleproper-
ties.

ASYMPTOTIC DISTRIBUTIONS

According to Kmenta (1971,pg. 163), . . . if the distribution of an estimator


tends to become more and more similar in form to some specific distribution
as the sample size increases,then such a specific distribution is called the %o~~n-h.Kl m20 m5

asymptotic distribution of the estimator in question. . . . What is meant by the Figure 7-1 Probability density function for estimate of scalar 8, as a function of number of
asymptotic distribution is not the ultimate form of the distribution, which may measurements, e.g., pm is the p.d.f. for 20 measurements.
Large Sample Properties of Estimators Lesson 7 Consistency 57

As noted in Goldberger (1944, pg. 116), if E{@$ = m for al/ k, then ASYMPTOTIC UNBIASEDNESS
p?m E{t!(k)j = linn m = RL Alternatively, supposethat
Definition 7-3. Estimator 6(k) is an asymptotically unbiased estimator
E{@k)} = m + k-cI + k -c2 + l l 9 V-1) of deterministic 8, if
where the cs are finite constants;then, ,!@=E&k)) = 9 (7-6)
or of random 9, if
p9z E{i(k)] = 1$X{m + k-q + ke2c2 + . . l} = m P-9
dinn E&k)) = E(9) 0 P-7)
Thus if E{@c)} is expressible as a power series in k, k-l, k-, . . - , the
asymptotic mean of 6(k) is the leading term of this power series; ask + zcthe Figure 7-l depicts an example in which the asymptotic mean of 6 has
terms of higher order of smallness in k vanish. convergedto 8 (note that pzoois centered about mm = 0).
Note that for the calculation on the left-haqd side of either (7-6) or (7-7)
to be performed, the asymptotic distribution of B(k) must exist, because
The usymptutic variurxe, wh@ is short fur variance of
Definition 7-Z.
the asymptotic distribution is not equal to lirnzvar [@)]. It is defined as
E{ii(k)) = J= - - .I= l!qk)p(Cqk))&(k)
-70 --m
U-8)
asymptotic var [8(k)] = Lk lee E{k[&k) - limm@(k)}]j II (7-3) Example 7-1
RecalI our linear model S.(k) = Z(k)0 + T(k) in which E(Sr(k)} = 0. Let us assume
Kmenta (1971, pg. 164) statesThe asymptotic variance . . . is not equal that each component of Y(k) is uncorrelated and has the same variance d. In Lesson
to j@= var(@. The reason is that in the case of estimators whose variance 8, we determine an unbiased estimator for CT:.Here, on the other hand, we just assume
decreaseswith an increasein k, the variancewill approach zero as k + m. This that
will happen when the distribution collapseson a point. But, as we explained,
the asymptotic distribution is not the same as the collapsed (degenerate)
distribution, and its variance is WCzero.
where 2(k) = Z(k) - X(k)&,(k). We leave it to the reader to show that (seeLesson 8)
Goldberger (1944, pg. 116) notes that if E{[@k) - lim E{&k)}]} = u /k
for aZI values of k, then asymptotic var [6(k)] = v /k. A&Fnatively, suppose
that
E{z (k)}= (q) CT?

E{[&k) - j@= E{i(k)}12j = k--Iv + k-2cI + k-cj -+ 0. Observe that I$?(k) is not an unbiased estimator of d; but, it is an asymptotically
unbiased estimator, because ,II- [(k - n)/k] & = CJ-?. q
l

where the the cs are finite constants;then,


CONSISTENCY
asymptotic var [t?(k)] = i Lirn
*a (V + ke1c2 + ke2c3 + -0 a)
We now direct our attention to the issueof stochasticconvergence.The reader
should review the different modes of stochastic convergence, especially con-
vergence in probability and mean-squaredconvergence (see, for example,
Thus if the variance of each 6(k) is expressibleas a power series in k-l, k -2, Papoulis, 1965).
. lthe asymptotic variance of 6(k) is the leading term of this power series;as
* ,

k goesto infinity the terms of higher order of smallnessin k vanish. Observe Definition 7-4. Theprobability limit of 6(k) is the point 0* on which the
that if (7-5) is true then the asymptotic variance of 6(k) decreases as distribution of our estimator collapses. We abbreviate probability limit of
k+m.. . which correspondsto the situation depicted in Figure 7-l. 6(k) by plim 6(k). Mathematically speaking,
Extensions of our definitions of asymptotic mean and variance to se- (7-11)
plim 6(k) = 8* -i& Pr [Ii(k) - 8*1 1 e]-,O
quencesof random vectors [e.g.! 6(k), k = lq 2? . . .] are straightforward and
can be found in Goldberger (1964, pg. 117). where E is a small positive number. El
Large Sample Properties of Estimators Lesson 7 Consistency

Definition 7-5. 6(k) is a consistent estimator of 8 if do not know this to be true, then there is no guarantee that 8 = (6). In
plim i(k) = 8 0 (7-12) Lesson 11 we show that maximum-likelihood estimators are consistent; thus
ML= (&L)2. we mention this property about maximum-likelihood esti-
Note that consistency meansthe samething asconvergencein proba- mators here, becauseone must know whether or not an estimator is consistent
bility. For an estimator to be consistent, its probability limit 8* must equal its before applying the consistency carry-over property. Not all estimators are
true value 0. Note, also, that a consistent estimator need not be unbiased or consistent!
asymptotically unbiased. Finally, this carry-over property for consistency does not necessarily
Why is convergencein probability so popular and widely used in the apply to other properties. For example, if b(k) is an unbiased estimator of 0,
estimation field? One reason is that plim ( 0) can be treated as an operator. then A&) + b will be an unbiased estimator of A8 + b; but, 8(k) will not be
For example, supposeXk and Yk are two random sequences,for which plim an unbiased estimator of 82.
xk = X and plim Yk = Y, then (see Tucker, 1962,for simple proofs of these How do you determine whether or not an estimator is consistent?Often,
and other facts), the direct approach, which makes heavy use of plim ( 0) operator algebra, is
ph &Yk = (plim xk)(plim Yk) = fl (7-13) possible. Sometimes, an indirect approach is used, one which examines
whether both the bias in i(k) and variance of t?(k) approachzero as k * 0~).In
and order to understand the validity of this indirect approach, we digressto discuss
=- plim xk =-x (7-14)
mean-squaredconvergenceand its relationship to convergencein probability.
plim Yk Y
Definition 7-6. Estimator 6 (k) convergesto 8 in a mean-squaredsense,
Additionally, supposeAk and Bk are two commensuratematrix sequences,for
which plim Ak = A and plim Bk = B [note that plim Ak, for example, means
if
CnlE{[&k) - 6]}-+0 q (7-18)
(plim aii(k))ii] ; then
plim AkBk = (plim Ak)(Plirn Bk) = AB (7-15) Theorem 7-1. If i(k) convergesto 8 in mean-square,then it convergesto
8 in probability.
plim Ai1 = (plim AJ = A-l (7-16)
Proof (Papoulis, 1965,pg. 151). Recall the Inequality of Bienayme
WI x- al 2 E] 5 E{lx - a12}/8 (7-19)
plim Ai1 Bk = A-B (7-17)
Let a =Oandx = 6(k) - 6 in (7-19), and take the limit ask + 00on both sides
The treatment of plim ( .) as an operator often makes the study of of (7.19), to seethat
consistency quite easy. We shall demonstrate the truth of this in Lesson 8 hrl Pr[@(k) - 612 E] 5 llm E{@(k) - S]*}/E? (7-20)
when we examine the consistencyof the least-squaresestimator.
A second reason for the importance of consistencyis the property that Using the fact that 6(k) convergesto 6 in mean-square,we seethat
consistency carries over; i.e., any continuous function of a consistent esti- liir Pr[li(k) - 61=>E]-+O (7-21)
mator is itself a consistent estimator [see Tucker, 1967, for a proof of this
property, which relies heavily on the preceding treatment of plim ( ) as an l
thus, 6(k) convergesto 6 in probability. cl
operator]. Recall, from probability theory, that although mean-squared con-
Example 7-2 vergence implies convergencein probability, the converseis not true.
Suppose6 is a consistent estimator of 8. Then 1/ 6 is a consistentestimator of l/ 8, (6) Example 7-3 (Kmenta, 1971, pg. 166)
is a consistent estimator of 8, and In 6 is a consistent estimator of in 8. These facts are Let 6(k) be an estimatorof 8, and let the probability density function of 6(k) be
all due to the consistencycarry-over property. [7

The reader may be scratchinghis or her headat this point and wondering 1
about the emphasisplaced on these illustrative examples.Isnt, for example, 8 1-z
using 6 to estimate 8?by (6) the natural thing to do? The answer is Yes, 1
but only if we know aheadof time that 6 is a consistentestimator of 8 . If you k
k
Large Sample Propertks of Estimators Lesson 7 Lesson 7 Problems 61

In this example t?(k) can only assumetwo different values, 6 and k. Obviously, 6 (k) is PROBLEMS
consistent, becauseas k -+CCthe probability that 6(k) equals 8 approaches unity, i.e.,
plim 6(k) = 0. Observe, also, that E{i(k)] = 1 + G(l - l/k), which means that 6(k) is 7-l. Random variable X - N(x ; p, d), and we are given a random sample (x,,
biasede x2, * * *, xH). Consider the following estimator for p,
Now let us investigate the mean-squared error between i and 0; i.e.,

where a 2 0.
(a) For what value(s) of a is r;(N) an asymptotically unbiased estimator of p?
(b) Prove that F;(N) is a consistent estimator of ~1for all a 2 0.
In this pathological example, the mean-squared error is diverging to infinity; but, 6(k) (c) Compare the results obtained in (a) with those obtained in Problem 6-3.
converges to 0 in probability. 0 7-2. Suppose zl, z2, , . . : zN are random samples from a Gaussian distribution with
unknown mean, p, and variance, 2. Reasonable estimators of p and 2 are the
Theorem 7-L Let 6(k) denote an estimator uf 0. If bias 6(k) und vari- sample mean and sample variance
ance 6(k) both approach zero as k* x, then the mean-squared error between
6(k) and 8 appruuches zero, and, therefore, 6(k) i.sa cunsistent estimatur of 0.
Proc$ From elementary probability theory, we know that

E@(k) - Oj21= [bias ~(I+z)]~


+ variance I!@?) (7-22) S*= $ .i (Zi - if)
1=1

If, as assumed,bias I@) and variance I both approachzero as k + =, then (a) Is sz an asymptotically unbiased estimator of d? [Hint: Show that
E(s*} = (N - 1)&N].
(7-23) (b) Compare the result in (a) with that from Problem 6-4.
(c) One can show that the variance of s2 is
which means that 6(k) convergesto 0 in mean-square.Thus, by Theorem 7-l
- u 2(p4 - 2$) - 3u4
var(s2) =- ct4
+ p4
@c) also convergesto 8 in probability. III -
N N2 N3
The importance of Theorem 7-2 is that it provides a constructive way to Explain whether or not s* is a consistent estimator of 02.
test for consistency. 7-3. Random variable X - N(x; p ,g). Consider the following estimator of the
population mean obtained from a random sample of N observations of X,

/.i(N)=x +;
ASYMPTOTiC EFFICIENCY where a is a finite constant, and f is the sample mean.
(a) What are the asymptotic mean and variance of @(IV)?
Definition 7-7. i(k) is un asymptoticaily efficient estimator of scalar (b) Is r;(N) a consistent estimator of p?
parameter 0, ifi I. 6(k) has an asymptotic distribution with finite mean and (c) Is I;(N) asymptotically efficient?
variance, 2. 6(k) ti cunsistent, and 3. the variance uf the asymptotic distribu- 7-4. Let Xbe a Gaussianvariable with mean p and variance 2. Consider the problem
tion equals of estimating p from a random sample of observations xl, x2, . . . , xN. Three
estimators are proposed:

For the case of a vector of parameters conditions 1. and 2. in this


definition are unchanged;however, condition 3%is changedto read if the
covariance matrix of the asymptotic distribution equals J-l (see Theorem
6-4).
62 Large Sample Properties of Estimators Lesson 7

You are to study unbiasedness,efficiency, asymptotic unbiasedness,consistency,


and asymptotic efficiency for these estimators. Show the analysis that allows you
Lesson 8
to complete the following table.

Estimator
Properties
Properties 61 fi2 c;3

Small sample
Unbiasedness
Efficiency
of L eust-Squares
Large sample
Un biasedness
Consistency
Efficiency
Estimators
Entries to this table are yes or no.
7-S. If plim Xk = X and plim Yk = Y where X and Y are constants, then plim
(Xk + Yk) = X + YandplimcXk = cX, where c is a constant. Prove that plim
XkYk = XY{Hint: XkYk = a [(Xk + Yk)2 - (Xk - yk )2])*

In this lesson we study some small- and large-sample properties of least-


squaresestimators. Recall that in least-squareswe estimate the n X 1 param-
eter vector 8 of the linear model Z(k) = X(k)8 + V(k). We will see that most
of the results in this lessonrequire X(k) to be deterministic, or, X(k) and V(k)
to be statistically independent. In some applicationsone or the other of these
requirements is met; however, there are many important applications where
neither is met.

SMALL SAMPLE PROPERTIES OF LEAST-SQUARES ESTIMATORS

In this section (parts of which are taken from Mendel, 1973, pp. 75-86) we
examine the bias and variance of weighted least-squaresand least-squares
estimators.
To begin, we recall Example 6-2, in which we showed that, when X(k) is
deterministic, the WLSE of 0 is unbiased. We also showed [after the statement
of Theorem 6-2 and Equation (6-13)] that our recursive WLSE of 8 has the
requisite structure of an unbiased estimator; but, that unbiasednessof the
recursive WLSE of 8 also requires h(k + 1) to be deterministic.
When X(k) is random, we have the following important result:
Theorem 8-l. The WLSE of 8,
6,(k) = [X(k)W(k)%e(k)]- X(k)W(k)%(k) (8-1)
Properties of Least-Squares Estimators Smalt Sample Properties of Least-Squares Estimators 65
Lesson 8

number generator. This random sequence is in no way related to the measurement


is unbiased is m-u mm and if T(k) and X.(k) are stahticuily
noise process, which means that X(N - 1) and V(N) are statistically independent, and
independent. again 6wls(AT) will be an unbiased estimate of the impulse response coefficients. We
conclude. therefore, that WLSEs of impulse response coefficients are unbiased. 0
Note that this is the first place where, in connection with least squares,
web lavehad to assumeany a priori knowledge about noise?F(k), Example 8-2
As a further illustration of an application of Theorem 8-1, let us take a look at the
Proof. From (8-l) and %(Ic)= %?(I?)8
+ T(k), we find that weighted least-squares estimates of the n a-coefficients in the Example 2-2 AR model.
i&~(k) = (XwYey X~W(m3 + V) We shall now demonstrate that X(N - 1) and V(N - l), which are defined in Equa-
tion (2-8), are dependent, which meansof course that we cannot apply Theorem 8-l to
= 0 + (xw3e)-WW for all k study the unbiasedness of the WLSEs of the a-coefficients.
where, for notational simplification? we have omitted the functional de- We represent the explicit dependencesof %eand V on their elements in the
pendencesof X, W, and V on k. Taking the expectation on both sidesof (8-2), following manner:
it follows that
E{&&)J = 0 + E{XWX)- X]WE{=V] for all k and
w9
sr = sr[u(N - l), u(N - 2), . . . ) u(O)] P-8)
In deriving (8-3) we have used the fact that X(k) and T(k) are statistically
independent [recall that if two random variables, a and b, are statistically Direct iteration of difference equation (7) in Lesson 2 for k = 1, 2, . . . , N - 1,
reveals that y (1) depends on u (0), y (2) depends on u (1) and u (0) and finally, that
independent p(a, b) = p(ajp(bj; thus, E{ab] = E{@{b] and E&(a)h (b)] =
E-&@)}E{!2(b)}]. The second term in (8-3) is zero, becauseE{V} = 0, and W - 1) depends on u(N - 2) . . . , u (0); thus,
therefore %e[jl(N-l),y(N -2), . . . . Y (011= Xb W - 21, u 0 - 3), - . . , u (01,y ((91 (8-9)
Comparing (8-S) and (B-9), we see that X and 9r depend on similar vaiues of random
input U; hence, they are statistically dependent. 0
This theorem only states suficient condi~iom for unbiasedness of Example 8-3
i&&k), which means that, if we do not satisfy these conditions, we cannut We are interested in estimating the parameter a in the following first-order system:
conclude anything about whether 6=(k) is unbiased or biased. In order to
obtain necessuryccmdifions for unbiasedness,assumethat E{&&k)} = 0 and y(k +- 1) = -ay(k) + u(k) (8-10)
take the expectation on both sidesof (8-2). Doing this, we seethat where u(k) is a zero-mean white noise sequence. One approach to doing this is to
collect y (k + l), y(k), . . . , y (1) as follows,
E{(XWX)-XWV~ = 0 WI
Letting M = (%?W%)- XW and rni denote the ith row of matrix M, (8-5) can
be expressedas the following collection of orthogonaii~ cmdihms,
(4 u(k)
-1) u(k
-1)
Ye + 1)
Y(k)
-Y
-Y(k
Y@ - 1) -y(k - 2) n + u(k -2) (8-l 1)
E{m/V] = 0 for i =1,2,...,N WI
orthogonality [recall that two random variables u and b are orthogonal if
E{ab} = O] is a weaker condition than statistical independence, but is often
I-1 : 1 : 1
...
Y(l) $0)
...
u (0)

Z:(k + 1) W)
more difficult to verify aheadof time than independence,especiallysincern: is
a very nonlinear transformation of the random elements of X. and, to obtain dts. In order to study the bias of ciLsI we use (8-2) in which W(k) is set
equal to I, and X(k) and T(k) are defined in (8-11). We also set 6,(k) = CiLS(k + 1).
Example 8-l The argument of ii u is k 4 1 instead of k, becausethe argument of 5 in (S-11)is k + 1.
Recall the imp&e response identification Example 2-1, in which 8 = co1\h (1), h (2), Doing this, we find that
. . . , h(n)], where h (i) is the value of the sampled impulse responseat time li. System
input u(k) may be deterministic or random. Ii WvG>
&(k+l)=a-O, (S-12)
If{U(k),k =o, 1, .*., *N - l} is deterministic, then X(!V - 1) isee Equation
(Z-5)] is deterministic, so that 0WU(AJ)is an unbiased estimator of 0. Often, one usesa c Y(i)
random input sequencefor {u(k), k = U: I, . . . , A - I}, such as from a random 1-0
66 Properties of Least-Squares Estimators Lesson 8 Small Sample Properties of Least-Squares Estimators

thus, where we have made use of the fact that W is a symmetric matrix and the
transpose and inverse symbols may be permuted. From probability theory
, (8-13) (e.g., Papoulis, 1965), recall that
E{&(k + 1)) = a -
Ex,v 1 l I = Ex mvpc 1 l I%>
(8-22)
Note, from @-lo), that y(j) dependsat most on u (j - 1); therefore, Applying (8-22) to (8-21), we obtain (8-18). ci

As it stands, Theorem 8-2 is not too useful becauseit is virtually impos-


E[c} = E{u(k))E[f$] =0 (8-14) sible to compute the expectation in (8-18), due to the highly nonlinear de-
pendence of (%e,W~)-l%e,W~~T~(%eW~)- on %e.The following special
case of Theorem 8-2 is important in practical applications in which X(k) is
because E(u(k)} = 0. Unfortunately, all of the remaining terms in (8-13), i.e., for
i=O,l, . . . . k - 1, will not be equal to zero; consequently, deterministic and C&(/C)= d I.

E&s(k + 1)) = a + 0 + k nonzero terms (8-15)


Given the conditions in Theorem 8-2, and that X(k) ti
Corollary 8-l.
Unless we are very lucky so that the k nonzero terms sum identically to zero, deterministic, and, the components of V(k) are independent and identically
E{&(k + 1)) $ a, which means, of course, that GLsis biased. distributed with zero-mean and constant variance c& then
The results in this example generalize to higher-order difference equations, so
that we can conclude that least-squares estimates of coeficients in an AR model are . cov [&(k)] = d [X(k)X(k)]- (8-23)
biased. Cl
Proof. When X is deterministic, cov [bWLS(k)] is obtained from (8-18) by
In the method of instrumental variables X(k) is replacedby X*(k) where deleting the expectation on its right-hand side. To obtain cov [b,(k)] when cov
X*(k) is chosen so that it is statistically independent of V(k). There can be, [V(k)] = d I, set W = I and a(k) = d I in (8-18). The result is (8-23). q
and in general there will be, many choices of X*(k) that may qualify as
instrumental variables. It is often difficult to check that X*(k) is statistically Usually, when we use a least-squaresestimation algorithm we do not
independent of V(k). know the numerical value of d. If d is known ahead of time, it can be used
Next we proceed to computethe covariancematrix of &&k), where directly in the estimate of 0. We showhow to do this in Lesson 9. Where do we
obtain dVin order to compute (8-23)?We can estimate it!
&&k) = 8 - &q&k) (8-16)
Theorem 8-3. An unbiased estimator of d is
Theorem 8-2. If E(gr(k)} = 0, V(k) and X(k) are statistically independ-
ent, and z(k) = !%(k)3(k)/(k - n) (8-24)
www)~ = W) (8-17) where
then s(k) = Z(k) - X(k)&(k) (8-25)
cov [i&&k)] = EX{(%eW%e)+
%?W%WX(XWX)-l} (8-18)
Proof. We shall proceed by computing E{%(k)%(k)} and then approxi-
Proof. Because E(gr(k)} = 0 and V(k) and X(k) are statistically inde- mating it asg(k)%(k), becausethe latter quantity can be computed from Z(k)
pendent, E{&&k)} = 0, so that and h&k), as in (8-25).
First, we compute an expressionfor g(k). Substituting both the linear
cov [&r&k)] = E{6,,(k)i&&k)} (8-19)
model for Z(k) and the least-squaresformula for i=(k) into (8-25), we find
Using (8-2) in (8016),we seethat that
ii,(k) = -(XW%e)-Xwsr (8-20)
hence
cov [i&&k)] = EX,V{(3t?W~)-1%%VVWX(XWX)-) (8-21) (8-26)
Properties of Least-Squares Estim&ors Lesson 8 Large Sample Properties of Least-Squares Estimators 69

where Ik is the k X k identity matrix. Let tions under which the LSE of 8.8:,(k). is the same as the maximum-1ikeIihood
M = Ik - qx%y X (s-27) estimator of 8, &&k). Because 8,,(k) is consistent, asymptotically efficient,
Matrix M is idempotent, i.e., M = M and Mz = M; therefore, and asymptotically Gaussian, b&k) inherits all these properties.

Theorem 8-4. If
(8-28)
plim [X(X-)X(k)/k] = CX (8-33)
Recall the following well-known facts about the trace of a matrix:
f-s1 exists, and
1. E{tr A} = tr E{A}
2, tr CA = c tr A, where c is a scalar plim [W(k)Sr(k)/k] = 0 (8-34)
3. tr (A + B) = tr A + tr B then
4. t&l = Iv plim Qk) =6 (8-35)
5. tr AB = tr BA
Using thesefacts, we now continue the developmentof (g-28), as follows: Note that the probability limit of a matrix equalsa matrix each of whose
elements is the probability limit of the respectivematrix element. Assumption
E{fk%} = tr [ME{VVl] = tr M% = tr MCT~ (8-33) postulates the existence of a probability limit for the second-order
= d tr M = 0: tr [Ik - %Y(WX)-%T] moments of the variables in X(k), as given by Z&. Assumption (8-34) postu-
lates a zero probability limit for the correlation between X(k) and T(k).
= &k - & tr X(XX)- X (8-29) %e(k)Sr(k)can be thought of as a filtered version of noise vector V(k). For
= uz k - CT:tr (XX)(XX)-l (8-34) to be true filter X(k) must be stable. If, for example, X(k) is
= dk - &trIn = uz(k -n) deterministic and CT&) < a, then (S-34) will be true.
Solving this equation for d, we find that Proof. Beginning with (8-2), but for i&k) instead of &,&k), we see
that
(8-30) b,,(k) = 8 + (me-XT (8-36)
Although this is an exact result for uz, it is not one that can be evaluated, Operating on both sides of this equation with plim, and using properties
becausewe cannotcompute E{&$. (7-19, (7-16): and (7-17), we find that
Using the structure of d as a starting point, we estimate & by the simple
formula plim i,(k) = 9 + plim [(XX/k)- (WV/k)]
&; (k) = t!t(k)~(k)/(k - II) (8-31)
= 9 + plim (RX/k)-plim (WV/k)
= 9 + & - 0
To show that x (k) is an unbiased estimator of c$, we obsewe that
= 9
(8-32) which demonstrates that, under the given conditions, &(k) is a consistent
estimator of 0. q
where we have used (8-29) for E{~%]. III
In some important applications Eq. (8-34) doesnot apply, e.g., Example
8-2. Theorem S-4 then does not apply, and, the study of consistency is often
LARGE SAMPLE PROPERTIES OF LEAST-SQUARES ESTIMATORS quite complicated in these cases.
Many large sampleproperties of LSEs are determined by establishing that the
LSE is equivalent to another estimator for which it is known that the large Theorem 8-5. If (8-33) and (8-34) are me, C, exists, and
sampIeproperty holds true. In Lesson 11, for example,we will provide condi- plim [gr(k)Sr(k)/k] = IT: (S-37)
Properties of Least-Squares Estimators Lesson 8
Lesson 9
plim 3 (k) = d (8-38)
where 8 (k) is given by (8-24).
Proof. From (8-26),we find that
Best Linear Unbiased
Consequently,
fzi% = VT - Olc3q~~)-%e~ (8-39)
Estimation
plim 8 (k) = plim %t%/(k - n)
= plim VV/(k - n) - plim ~X(XX)-lXV/(k - n)
= d - plim vXl(k - n) . plim [X?tT/(k - n)]-1
. plim %V/(k - n)

INTRODUCTION
PROBLEMS
Least-squaresestimation, as described in Lessons 3, 4 and 5, is for the linear
8-1. Suppose that iLs is an unbiased estimator of 0. Is & an unbiased estimator of model
e? (Hint: Use the least-squaresbatch algorithm to study this question.)
3!(k) = X(k)0 + V(k) (9-l)
8-2. For 6WLs(Q to be an unbiased estimator of 8 we required ET(k)] = 0. This
problem considers the casewhen ET(k)} f 0. where 9 is a deterministic, but unknown vector of parameters, X(k) can be
(a) Assume that E(gr(k)} = V o where Sr, is known to us. How is the deterministic or random, and we do not know anything about V(k) ahead of
concatenated measurementequation Z(k) = %e(k)e+ V(k) modified in this time. By minimizing &(k)W(k)?%I(k), where Z%(k) = Z(k) - X(k)&&k), we
case so we can use the results derived in this lesson to obtain bw,,(k) or determined that 6-(k) is a linear transformation of Z(k), i.e., &&k) =
ii,(k)? F,&k)%(k). After establishingthe structure of &&k), we studied its small-
(b) Assume that E(Sr(k)} = m%Tlwhere m6tpis constant but is unknown. How is and large-sample properties. Unfortunately, 6-(k) is not alwaysunbiased or
the concatenated measurementequation S!(k) = X(k)8 + V(k) modified in
efficient. These properties were not built into 6-(k) during its design.
this case so that we can obtain least-squaresestimatesof both 8 and my?
In this lesson we develop our second estimator. It will be both unbiased
8-3. Consider the stable autoregressive model y(k) = &y(k - 1) + . . + &y l

and efficient, by design. In addition, we want the estimator to be a linear


(k - Q + c(k) in which the e(k) are identically distributed random variables
with mean zero and finite variance a. Prove that the least-squaresestimates of function of the measurements%(k). This estimator is called a best linear
8l,***, & are consistent(see also, Ljung, 1976). unbiased estimator (BLUE) or an unbiased minimum-variance estimator
8-4. In this lesson we have assumed that the X(k) variables have been measured (UMVE). To keep notation relatively simple, we will use i&,,(k) to denote the
without error. Here we examine the situation when 3&(k) = X(k) + N(k) in BLUE of 8.
which X(k) denotes a matrix of true values and N(k) a matrix of measurement As in least-squares,we begin with the linear model in (9-l) where 8 is
errors. The basic linear model is now deterministic. Now, however, X(k) must be deterministic and V(k) is assumed
to be zero mean with positive definite known covariance matrix S(k). An
Z(k) = X(k)0 + V(k) = X,(k)8 + v(k) - N(k)e]. example of such a covariance matrix occurs for white noise. In the case of
Prove that iLS(k) is not a consistent estimator of 8. scalar measurements, z(k), this means that scalar noise v(k) is white, i.e.,
P-2)
Best Linear Unbiased Estimation Lesson 9
Derivation of Estimator 73

where &, is the Kronecker 6 (i.e., &, = 0 for k # j and 6kJ= I for k = j) thus, where ei is the ith unit vector,
3(k) = E{3-(A-)=V-(k)} e; = co1 (0, 0, * . . ,o, l,O, . . . 70) (9-10)
= diag [d(k)- d(k - l), . . . , &(I? - IV + 1)] P-9
in which the nonzero element occurs in the ith position. Equating respective
In the case of vector measurements,z(k), this meansthat vector noise, v(k), is elements on both sides of (9-9), we find that
white, i.e.,
X(k)f,(k) = ei i = 1,2, . . ..n (9-11)
E{v( k)v( j)l = R(k)Skj P-4)
Our single unbiasedness constraint on matrix F(k) is now a set of n constraints
thus, on the rows of F(k).
3(k) = diag [R(k), R(k - l), . . . ! R(k - N + 1)] Next, we express E([& - 6i.BLU(k)]2>in terms of f, (i = 1,2, . . . , IV). We
shall make use of (9-ll), (9-l), and the following equivalent representation of
(9-6)
PROBLEM STATEMENT AND OWECTfVE FUNCTION &HlJ(~) = f:(k)%(k) i = 1,2,. . , ,n (9-12)
Proceeding, we find that
We begin by assumingthe following linear structure for &&k),
E{[& - tir,BLU(k)]2} = E(( ei - fi%)*} = E{( Oi - SYf,)}
k,(k) = F(QW) WI
= E{aj! - 2&%fi + (EfJ*}
where, for notational simplicity, we have omitted subscriptingF(k) as FBLU(k).
= E{@ - 28,(%33+ T)f, + [(%I + T)fi]l)
We shall design F(k) suchthat (9-13)
= E{$ - 2Oi9Xfi - ZOiTfi + [6Xf; + Vff])
a, &&k) is an unbiasedestimator of 0, and = E(B; - 20i8ei - 2OiVfi + [Bej + Vfj])
b. the error variance for eachone of the n parametersis minimized- In this = E{filTfi) = fi9tf;
way, &&k) wiZ1be unbiased afzd eficierzt, by design.
Observe that the error-variance for the ith parameter dependsonly on the ith
Recall, from Theorem 6-1, that mbiasedness cmzs~raim design mutrix row of design matrix F(k). We, therefore, establish the following objective
F(k), such Ihut function:.

F(k)X(k) = I for aI k Ji(f;, hi) = E{[B; - ii.BLU(k)]Z) + A;(Xf, - eJ


P-7 (9-14)
= fi$Rf; + A,(Xfi - e,)
Our objective now is to choosethe elements of F(k), subject to the constraint
where Ai is the ith vector of the Lagrange multipliers, that is associatedwith
of (g-7), in such a way that the error variance for eachone of the ti parameters
the ith unbiasedness constraint. Our objective now is to minimize Ji with
is minunized.
respect to fi and hi (i = 1,2, . . . , iV).
In solving for FBLU(k),it will be convenient to partition matrix F(k), as

DERIVATION OF ESTIMATOR
F(k) =
A necessary condition for minimizing Ji(fi ,Ai) is 81,(f;,Ai)/ df; = 0 (i =
1,2, ,..) n); hence,
Equation (9-7) can now be expressedin terms of the vector components of 29ifi + XAi = 0 (9-E)
F(k). For our purposes, it iseasier to work with the transpose of (9-7),
XF = 1, which can be expressedas from which we determine fj, as
fl = -;s-X& (9-16)
74 Best Linear Unbiased Estimation Lesson 9 Some Properties of &(k) 75

For (9-16) to be valid, ?R- must exist. Any noise V(k) whose covariance Corollary 9-l. AZ1results obtained in Lessons 3,4, and 5 for 6,,(k) can
matrix % is positive definite qualifies. Of course, if V(k) is white, then $Ris be applied to 6BLu(k)by setting W(k) = K(k). Cl
diagonal (or block diagonal) and 3-l exists. This may also be true if V(k) is
We leaveit to the reader to explore the full implications of this important
not white. A second necessary condition for minimizing Ji (fi,hi) is dJi
corollary, by reexamining the wide range of topics, which were discussedin
(fiyX;)/aXj = 0 (i = 172, . 7n), which gives us the unbiasednessconstraints
Lessons3,4 and 5.
l l

Xfi=ei i=1,2,...,n (9-17)


To determine Ai, substitute (9-16) into (9-17). Doing this, we find that
Theorem 9-2 (Gauss-Markov Theorem). If S(k) = d I, then 6&k) =
L(k)*
Xi = -2(X%Ye)- ei (9-18)
Proof. Using (9-22) and the fact that a(k) = a: I, we find that
whereupon
fi = 9i- X(X%.- X)-l ei (9-19)
(i = 1,2, . . . , n). Matrix F(k) is reconstructed from f,(k), asfollows: Why is this a very important result? We have connected two seemingly
different estimators, one of which-&&)-has the properties of unbiased
and minimum variance by design; hence, in this case &(k) inherits these
properties. Remember though that the derivation of &&) required X(k) to
be deterministic; thus, Theorem 9-2 is conditioned on X(k) being
Hence deterministic.
FBLU(k) = [X(k)%-(k)%?(k)]-X(k)%-(k) (9-21)
which means that SOME PROPERTIES OF i&(k)

To begin, we direct our attention at the covariancematrix of parameter esti-


i&&k) = [sle(k)W(k)X(k)l-%e(k)W(k)%(k) (9-22) mation error 6&k).
Theorem 9-3. If V(k) is zero mean, then
COMPARISON OF t&(k) AND i,,,Ls(/o cov [6,,(k)] = [X(k)%-(k)X(k)]- (9-24)
Proof. We apply Corollary 9-1 to cov [6,(k)] [given in (18) of Lesson
We are struck by the close similarity between i,,,(k) and &&k). S] for the casewhen X(k) is deterministic, to seethat
Theorem 94. The BLUE of 0 is thespecialcaseof the WLSE of 0 when Ov hdk)l = Ov [s,(k)l/ w(k) = a-(k)
W(k) = a-(k) (9-23) = (%e~-%e)-%e~-l~~-~(~~-~)-
If W(k) is diagonal, then (9-23) requires V(k) to be white. = (x91-1%e)-1 0
Proof. Compare the formulas for 6&k) in (9-22) and 6-(k) in (10) of Observe the great simplification of the expression for cov [6%&k)],
Lesson 3. If W(k) is a diagonal matrix, then 9t(k) is a diagonal matrix only if when W(k) = S-(k). Note, also, that the error varianceof ij,BLu(k) is given by
V(k) is white. Cl the ith diagonal element of cov [bBLU(k)].
Matrix %-l(k) weights the contributions of precisemeasurementsheav- Corollary 9-2. When W(k) = S-(k) then matrix P(k), which appears
ily and deemphasizesthe contributions of imprecise measurements.The best in the recursive WLSE of 9 equals cov [6BLu(k)],i.e.,
linear unbiased estimation design technique hasled to a weighting matrix that
is quite sensible. SeeProblem 9-2. P(k) = cov [6,,,(k)] (9-25)
Best Lrtear Unbiased Estimation Lesson 9 Some Properties of k&k! 77

Proof. Recall Equation (13) of Lesson4, that Substituting (9-32) and (9-33) into (9-29), and making repeated use of the
P(k) = [%?(k)W(k)X(k)J- (9-26) unbiasednessconstraints XF = %eFd= FOX0 = I, we find that

When W(k) = %(A$, then 2 = F&RF; - IGRF


= F,9tF; - F%F + 2(X%-ste)- - (%%-%)-
P(k) = [X(k)CR-l(k)X(k)]-l (9-27j
- (Ye%-1x)-*
hence, P(k) = cov [6&k)] becauseof (9-24) . 0 (9-34)
= F,%F; - F9tF + 2(%e9t-%)-1(%e9t-1%F)
Soon we will examine a recursive BLUE. Matrix P(k) WU have to be - (XW%e)-(W!R-%F;)
calculated, just as it has to be calculatedfor the recursive WLSE. Every time - (F$Eft-X)(X%-%)-
P(k) is calculated in our recursiveBLUE, we obtain a quantitative measureof
Making use of the structure of F(k). given in (9-21), we see that Z can also be
how well we are estimating 9+Just look at the diagonal elements of P(k),
written as
k = 1,2,. . . . The same statementcannot be made for the meaning of P(k) in
the recursive WLSE. In the r?cursiveWLSE, P(k) has no special meaning. C = F,%F; - F%F + 2F%F - F%F; - F&RF
(9-35)
Next, we examine cov [9&k)] in more detail. = (Fa - F)%(F, - F)
In order to investigate the definitenessof 2, consider the definitenessof
Theorem9-4* iBLu(k) k a most efficient estimutm of 0 within the classof
aGa, where a is an arbitrary nonzero vector,
at/ unbiasedestirnutorsthat m-elinearly reiatedto the mwsurements %(Ic).
aza = [(Fn - F)a)9t[(Fp - F)a] (9-36)
proof (Mendel, 1973, pp. 155-156). According to Definition 6-3, we
must show that Matrix F (i.e., FBLU) is unique; therefore (F, - F)a is a nonzero vector, unless
F, - F and a are orthogonal, which is a possibility that cannot be excluded.
E = cov [&(k)] - cov [b&k)] (9-28) Becausematrix CRis positive definite, axa 2 0, which means that I: is positive
is positive semidefinite. In (g-28), 6=(k) is the error associatedwith an arbi- semidefinite, IJ
trary hnear unbiased estimate of 0. For convenience,we write ZZas These results serve as further confirmation that designing F(k) as we
(9-29) have done, by minimizing only the diagonal elements of cov[8&k)], is
sound.
In order to compute X0we use the facts that
Corollary 9-3. If C!&(k)= IZt I, then 6&k) is a most efficient estimator of 8.

The proof of this result is a direct consequenceof Theorems 9-2 and 9-4. 0
F4(k)X(k) = I (9-31) At the end of Lesson 3 we noted that iwLs(Ic)may not be invariant under
thus, changes.We demonstrate next that 4&k) is invariant to such changes.
Theorem 9-5. 6BLU(k)is invariant under changes of scale.
Prc@(Mendel, 1973,pp. 15157). Assume that observers A and B are
observing a process; but, observer A reads the measurements in one set of
units and B in another. Let M be a symmetric matrix of scale factors relating A
to B (e.g., 5,280 ft/mile, 454 g/lb, etc.), and C&.(k)and Z&(k) denote the total
Because6BLu(k)= F(k)%(k) and F(k)?@) = 1, measurementvectors of A and B, respectively.Then
x BLU= FCRF (9-33) S&(k) = X,(k)B + VB(k) = Mf& (k) = M%e, (k)8 + MT/,(k) (9-37)
Best Linear Unbiased Estimation Lesson 9 Lesson 9 Problems

which meansthat and


X,(k) = MXA(k) (9-38) P(k + 1) = [I - &(k + l)h(k + l)]P(k) (9-46)
VB(k> = Msr, (k) (9-39) These equations are initialized by and P(n), and are used for k = n,
n+l, . . . . N - 1. They can also be used for k = O,l, . . . , N - 1 as long as
and &3LU(O)and P(O) are chosen using Equations (4-21) and (4-20), respectivezy, in
SB(k) = M%,(k)M = MaA( (9-40) which w(0) is replaced by r-(O). Cl

Let i,,,,(k) and 6BBLU(k) denote the BLUEs associatedwith observersA Recall that, in best-linear unbiased estimation, P(k) = COV[&&C)].
and B, respectively; then, Observe, in Theorem 9-7, that we compute P(k) recursively, and not P-l(k).
This is why the results in Theorem 9-7 (and, subsequently, Theorem 4-2) are
referred to as the covariance form of recursive BLUE.

PROBLEMS

9-l. (Mendel, 1973, Exercise 3-2, pg. 175). Assume X(k) is random, and that
RECURSIVE BLUES km,(k) = F(k)~(k).
(a) Show that unbiasedness of the estimate is attained when E{F(k)%e(k)} = I.
Becauseof Corollary 9-1, we obtain recursiveformulas for the BLUE of 8 by (b) At what point in the derivation of 6BLU(k)do the computations break down
because X(k) is random?
setting l/w@ + 1) = r(k + 1) in the recursive formulas for the WLSEs of 8
which are given in Lesson 4. In the caseof a vector of measurements,we set 9-2. Here we examine the situation when V(k) is colored noise and how to use a
model to compute wj* Now our linear model is
(seeTable 5-1) w-(k + 1) = R(k + 1).
z(k + 1) = X(k + lje + v(k + 1)
Theorem 9-6 (Information Form of Recursive BLUE). A recursive
where v(k) is colored noise modeled as
structurefor i&,,(k) is:
v(k + 1) = A,v(k) + k(k)
&,(k + 1) = 6&k) + KB(k + l)[t(k + 1) - h(k + 1)6,,,(k)] (9-42)
We assume that deterministic matrix A, is known and that E(k) is zero-mean
white noise with covariance a,(k). Working with the measurement difference
z*(k + 1) = z(k + 1) - A,z(k) write down the formula for i&&k) in batch
KB(k + 1) = P(k + l)h(k + l)r-(k + 1) (9-43)
form. Be sure to define all concatenatedquantities.
9-3. (Sorenson, 1980, Exercise 3-15, pg. 130). Suppose and & are unbiased
estimators of Blwith var (&j = d and var (&j = u$ Let t!&= (Y& + (1 - a!)&.
P-(k + 1) = P-(k) + h(k + l)r-(k + l)h(k + 1) (9-44) (a) Prove that 6, is unbiased.
These equations are initialized by &u(n) and P-(n) [where P(k) is (b) Assume that & and & are statistically independent, and find the
cov [6&k)], given in (9.31)] and, are used for k = n, n + 1, . . . , N - 1. These mean-squared error of &.
equations calt also be used for k = 0, 1, . . . , N - 1 as long as i&&O) and (c) What choice of cyminimizes the mean-squared error?
P-(O) are chosen using Equations (21) and (20) in Lesson 4, respectively, in 9-4. (Mendel, 1973, Exercise 3-12, pp. 176-177). A series of measurements z(k) are
which w(0) is replaced by r-(O). Cl made, where z(k) = H8 + v(k), H is an m x n constant matrix, E{v(k)} = 0, and
cov[v(k)] = 3 is a constant matrix.
(a) Using the two formulations of the recursive BLUE show that (Ho, 1963, pp.
Theorem 9-7 (Covariance Form of Recursive BLUE). Another recur-
152-154):
sive structure for 6&k) is (9-42) in which (i) P(k + 1jH = P(k)H[HP(k)H + RI-R, and
KB(k + 1) = P(k)h(k + 1) [h(k + l)P(k)h(k + 1) + r(k + l)]- (9-45) (ii) HP(k) = R[HP(k - 1)H + RI-HP(k - 1).
80 Best Linear Unbiased Estimation Lesson 9

(b) Next, show that


(i) P(k)H = P[k - 2)Hf[2HP(k - 2)H + RI-R;
Lesson 10
(ii) P(k)H = P(k - 3)H[3HP(k - 3)H + R]-R; and
(iii) P(k)H = P(O)H[kHP(O)H + R]-R.
(c) Finally, show that the asymptotic form (k + m] for the BLUE of 9 is (Ho,
1963, pp. 152-154) Likelihood

This equation, with its l/(k + 1) weighting function, represents a form of


multidimensional stochastic approximation.

INTRODUCTION

This lesson provides background material for the method of maximum-


likelihood. It explains the relationship of likelihood to probability, and when
the terms Zikelihood and likelihood ratio can be used interchangeably. The
major reference for this lessonis Edwards (1972), a most delightful book.

LIKELIHOOD DEFINED

To begin, we define what is meant by an hypothesis, H, and results (of an


experiment), R. SupposescaIarparameter 8 can assumeonly two values, 0 or
1; then, we say that there are two hypothesesassociatedwith 8, namely If0 and
H,, where, for I&, 8 = 0, and for HI, 8 = 1. This is the situation of a binary
hypothesis. Supposenext that scalar parameter 0 can assumeten values, a, b,
c, d, e, f, g, h, i, i; then, we say there are ten hypothesesassociatedwith 8,
namely IfI, HZ, H3, . . . , HIo, where, for If,, 0 = a, for Hz, 8 = b, . . . , and for
Hlo, 8 = i. Parameter 8 may also assume values from an interval, i.e.,
a I 8 I b. In this case, we have an infinite, uncountable number of hypo-
theses about 0, each one associatedwith a real number in the interval [a, b].
Finally, we may have a vector of parameters eachone of whoseelements has a
coliection of hypotheses associatedwith it. For example, suppose that each
one of the II elements of 9 is either 0 or 1. Vector 6 is then characterized by 2
hypotheses.

81
Likelihood Defined 83
82 Likelihood Lesson 10

Results, R, are the outputs of an experiment. In our work on parameter Example 10-3 (Edwards, 1972, pg. 10)
estimation for the linear model Z(k) = X(k)6 + V(X-), the results are the To further illustrate the difference between P(RjH) and L(HIR), we consider the
data in S(k) and X(k). following binomial model which we assumedescribesthe occurrence of boys and girls
We let P(RIH) denote the probability of obtaining results R given hy- in a family of two children:
pothesis H according to some probability model, e.g., p[z(k)lO]. In proba-
P(R[p) = wpm(l -p) (10-2)
bility, P(R IH) is always viewed as a function of R for fixed values of H. . .
Usually the explicit dependenceof P on H is not shown. In order to under- wherep denotes the probability of a male child, uz equalsthe number of male children,
stand the differences between probability and likelihood, it is important to f equals the number of female children, and, in this example,
show the explicit dependenceof P on H.
m+f=2 (10-3)
Example 10-l
Our objective is to determine p; but, to do this we need some results. Knocking on
Random number generators are often used to generate a sequenceof random numbers neighbors doors and conducting a simple survey, we establish two data sets:
that can then be used as the input sequence to a dynamical system, or as an additive
measurement noise sequence.To run a random number generator, you must choose a RI = (1 boy and 1 girl} +rn = 1 and f = 1 (10-4)
probability model. The Gaussian model is often used; however, it is characterized by R2 = (2 boys) +rn = 2 and f = 0 (10-5)
two parameters, mean p and variance u*. In order to obtain a stream of Gaussian
random.numbers from the random number generator, you must fix p and u*. Let m In order to keep the determination of p simple for this meager collection of data, we
and a$- denote (true) values chosen for p and c*. The Gaussianprobability density shall only consider two values for p, i.e., two hypotheses,
function for the generator is ~[r (k)lp T, c&l, and the numbers we obtain at its output,
are of course quite dependent on the hypothesisHT = (m, a$). 0 H,:p = l/4
z(l), z(2), - l ?
(10-6)
Hz:p = l/2 I
For fixed H we can apply the axioms of probability (e.g., seePapoulis, To begin, we create a table of probabilities, in which the entries are P(RilHj
1965). If, for example, results RI and R2 are mutually exclusive,then P(R1 or fixed), where this is computed using (10-2). For HI (i.e., p = l/4), P(R@/4) = 3/8 and
R,IH) = P(RlIH) + P(R,IH). P(R&4) = l/16; for & (i.e., p = l/2), P(R1]1/2) = l/2 and P(R211/2)= l/4. These
results are collected together in Table 10-l.
Definition 10-l (Edwards, 1972, pg. 9). Likelihood, L(HIR), of the TABLE 10-l P(Ri 1Hjfixed)
hypothesis H given the results R and a specific probability model is proportional
Rl R2
to P(RIH), the constant of proportionality being arbitrary, i.e.,
H* 318 l/l6
L(HIR) = cP(RIH) q (10-l) HZ l/2 l/4

For likelihood R is fixed (i.e., given ahead of time) and H is variable. Next, we create a table of likelihoods, using (lo-l). In this table (Table 10-2) the
There are no axioms of likelihood. Likelihood cannot be compared using entries are L (HilRj fixed). Constants cl and c2 are arbitrary and cl, for example,
different data sets (i.e., different results, say, RI and R2) unless the data sets appears in each one of the table entries in the RI column.
are statistically independent.
TABLE 10-2 L(H,(R, fixed)
Example 10-2 Rl R2

Suppose we are given a sequenceof Gaussian random numbers, using the random Hl 3/8 cl l/16 C2
number generator that was describedin Example 10-1, but, we do not know PT and a:. H2 l/2 Cl l/4 c2
Is it possible to infer (i.e., estimate) what the values of p and u* were that most likely
generated the given sequence?The method of maximum-likelihood, which we study in
Lesson 11, will show us how to do this. The starting point for the estimation of p and C* What can we conclude from the table of likelihoods? First, for data R 1, the
will bep [z (Ic&, u*], where now z (k) is fixed and p and u2 are treated as variables. 0 likelihood of HI is 3/4 the likelihood of I&. The number 3/4 was obtained by taking the
a4 L\ kelihoocl Lessorl 10
Multiple Hypotheses 85

ratio of likehhoods L (Hl!R!) and L (H$?l). Second, on data R2, the likelihood of IfI is
RI &I R2) = 314 x l/4 = 3/l& This reinforces our intuition that p = 112is much more
l/4 the likelihood of Hz [note that l/4 = l/l6 c2/ l/4 c2]. Finally, we conclude that, even
likely thanp = l/4. El
from our two meager results, the vaIuep = l/Z appears to be more likely than the value
p = l/4? which, of course, agreeswith our intuition. q

RESULTS DESCRIBED BY CONTINUOUS DISTRIBUTIONS


LIKELIHOOD RATIO
Supposethe results R have a continuous distribution; then, we know from
In the preceding example we were able to draw conchis~onsabout the like- probability theory that the probabilty of obtaining a result that lies in the
lihood of one hypothesis versus a second hypothesis by comparing ratios of interval (R, R + dR) is P(R lH)dR, as dR + 0. P(R IH) is then a probability
likelihood, defined, of course, on the same set of data. Forming the ratios of density. In this case, L (HIR) = cP(R lH)dR ; but, cdR can be defined asa new
Likelihood, we obtain the Zikeiihood ratio. arbitrary constant, cl, so that L (HIR) = clP(R iH)* In likelihood ratio state-
ments cl = cdR disappears entirely; thus, likelihood and likelihood ratio are
Definition 10-2 (Edwards, 1972, pg. 10). The Zikelihuud ratiu of two unaffected by the nature of the distribution of R .
hypotheses un the sum data is the ratio of the likeNmods un the duta. Let L(HI, Recall, that a transformation of variables greatly affects probability be-
&\R) denote likelihood ratio; then, cause of the dR which appears in the probability formula. Likelihood and
likelihood ratiu, on the other hand, are unaffected by transformations of
(10-7) variables, becauseof the absorption of dR into cl.

Observe that likelihood ratio statementsdo not depend on the arbitrary


constant c which appearsin the definition of Iikehhood, becausec cancelsout MULTIPLE HYPOTHESES
of the ratio c~(R/HJ/cP(R~&).
Thus far, all of our attention has been directed at the caseof two hypotheses.
Theorem 10-l (Edwards, 1972, pg. 11). LikeIihoud ratios of two lzy- In order to apply likelihood and likelihood ratio concepts to parameter esti-
potheses on stutisticully independent sets of duta may be multiplied tugether to mation problems, where parameterstake on more than two values, we must
form the likelihood rutiu uf the combined data. extend our preceding results to the caseof multiple hypotheses. As stated by
Edwards (1972, pg. ll), Instead of forming all the pairwise likelihood ratios
Proof. Let L (HI, H2iRl & I&) denote the likelihood ratio of HI and I& it is simpler to present the same information in terms of the likelihood ratios
on the combined data RI & &, i.e., for the several hypotheses versusone of their number, which may be chosen
quite arbitrarily for this purpose.
Our extensions rely very heavily on the results for the two hypotheses
case,becauseof the convenient introduction of an arbitrary comparison hy-
Because RI & Rz are statistically independent data, P(Rl & R2iHJ = pothesis, H * . Let Hi denote the ith hypothesis; then,
fW$W (&iW hence,
L(Hi ]R)
L(H,,HC/R) = (10-10)
L w *lw
Observe, also, that
LwH*IR) uH,IR) L(H, HjR)
=I---= (1041)
Example 10-4 L(H,,H*lR) L(HjIR)
We can now state the conclusions that are given at the end of Example 10-3 more which means that we can compute the likelihood-ratio between any two
formally. From Table 10-2, we see that L (l/4, l/2iRI) = 3/4 and L (l/4,1/21&) = l/4. hypotheses H, and Hi if we can compute the likelihood-ratio function
Additionally, because RI and R2are data from independent experiments, L (l/4, 1121
L(Hk,H*IR).
Likelihood Lesson 10 Lesson 10 Problems

L(H,/R) W,, H*)R) PROBLEMS


I
10-l. A test is said to be a likelihood-ratio testif there is a number c such that this test
leads to (dj refers to the ith decision)
dl if L(Hl,H,IR) > c
& if L(H1,H21R) < c
dl or d2 if L (HI,H2[R) = c
Consider a sequence of IZ tosses of a coin of which m are heads, i.e.,
P(R[p) = ~(1 - p) --. Let HI = pl and H2 = p2 where p1 > p2 and p
I c I represents the probability of a head. Show that L(H1,H21R) increases as m
H k H, H H, increases.Then show that L(HI,H21R) increaseswhenfi = m In increases,and
that the likelihood-ratio test which consists of accepting HI if L (HI ,H2/R) is
(a> W larger than some constant is equivalent to accepting HI if @is larger than some
Figure 10-l Multiple hypotheses case: (a) likelihood L (Hk (R) versus Hk , and (b) related constant.
likelihood ratio L (I& ,H *II?) versusI&. Comparison hypothesis H * has been cho- 10-2. Consider it independent observations x1, x2, . . . , X, which are normally
sen to be the hypothesis associatedwith the maximum value of L (& (R), L *. distributed with mean p and known standard deviation (T. Our two hypotheses
are: HI: p = p1 and H2 : p = p2. Using the sample mean as an estimator of EL,
show that the likelihood-ratio test (defined in Problem 10-l) consists of
Figure 10-l(a) depicts a likelihood function L (H$?). Any value of Hk accepting HI if the sample mean exceedssome constant.
canbe chosenas the comparison hypothesis.We chooseH* as the hypothesis
10-3. Suppose that the data consists of a single observation z with Cauchy density
associatedwith the maximum value of L (Hk IR), so that the maximum value of given by
L(Hk ,H*IR) will be normalized to unity. The likelihood ratio function
L (Hk ,H *(R), depicted in Figure lo-l(b), wasobtained from Figure lo-l(a) and 1
PW) = rr[l + (2 - e)2]
(10-10).In order to compute L(Hi,HjiR), we determine from Figure lo-l(b),
that L(Hi ,H*IR) = a and L(H,,H*IR) = b, so that L(H&IR) = a/b. Test (see Problem 10-I) HI : 8 = 61= 1versusH2:6 = e2= -1whenc = l/2,
Is it really necessaryto know H in order to carry out the normalization i.e., show that for c = l/2, we acceptHi when z > -0.35 or when z < -5.65.
of L (Hk ,H*(R) depicted in Figure lo-l(b)? No, becauseH * can be eliminated
by a clever conceptual choice of constantc. We can choosec, such that
L(H*IR) = L f 1 (10-12)
According to (lo-l), this meansthat
c = l/P(RIH*) (10-13)
If c is chosen in this manner, then
L cHklR) = LcHkIR)
L(Hk,H*lR) = L(H*IR) (10-14)

which meansthat, in the case of multiple hypotheses, likelihood and likelihood


ratio can be used interchangeably. This helps to explain why authors use
different names for the function that is the starting point in the method of
maximum likelihood, including likelihood function and ZikeZihood-ratio
.Ifunction,
Maximum-Likelihood Method and Estimates

lesson 11 (1965). and Stepner and Mehra (1973), for example], or the conditional like-
lihood function (Nahi, 1969).We shall use these terms interchangeably.

Maximum4ikelihood MAXIMUM-LIKELIHOOD METHOD AND ESTIMATES*

The maximum-iikeiihood method is based on the relatively simple idea that


Estimation different populations generate different samplesand that any given sample is
more likely to have come from some populations than from others. The
maximum-likelihood estimate (MLE) i,, is the value of 6 that maximizes 2or
L for a particular set of measurements5. The logarithm of I is a monotonic
transformation of 1 (i.e., whenever I is decreasingor increasing, in I is also
decreasingor increasing); therefore, the point correspondingto the maximum
of 1is also the point corresponding to the maximum of In 1 = L.
Obtaining an MLE involves specifying the likelihood function and find-
ing those valuesof the parameters that give this function its maximum value. It
is required that, if L is differentiable, the partial derivative of L (or I) with
respect to each of the unknown parameters O,,&, . . . , 0, equal zero:
aww) 0 foralli = 1,2,. . . ,n (11-5)
ae; 8 = 4ML=
LIKELIHOOD* To be sure that the soIution of (11-5) gives, in fact, a maximum value of
L(9\%), certain second-order con$itions must be fulfilled. Consider a Taylor
Let us consider a vector of unknown parameters8 that describesa collection series expansionof L(O/%)about OML,i.e.,
of N independent identically distributed observationsz (& k = 1,2, . . . , ZV.
We collect thesemeasurementsinto an IV x 1 vector %(A),(2 for short),
SE= col(z(l),z(2), . . . ,z(N)) (11-l)
The likelihood of 0, given the observations3, is defined to be proportional to (ei - ii,ML)(J - bj,ML) + l l * (11-6)
the value of the probability density function of the observations given the
parameters where aL(&alZ)/ &Ii, for example, is short for [aL(t$?E)/aei]le = &. Be-
causeiML is the MLE of 8, the second term on the right-hand side of (11-6) is
l Fw9 a P cm9 (H-2)
zero [by virtue of (11-S)]; hence,
where Z is the likehhood function and J?is the conditional joint probability
density function. Becausez(i) are independent and identically distributed, Wl~) = L(~M4f-Q

In many applicationsp (!i@3)is exponential (e.g.: Gaussian).It is easier Recognizing that the secondterm in (11-7) is a quadratic form, we can write it
then to work with the natural logarithm of @~CF,)than with &@iZ). Let in a more compact notation as l/2(9 - ~~~~)lJO(~hlL~~)(~ - && where
L(@E) = ln 1(01%) (11-4) J,(&,,&), the observed Fisher irzformation matrix [seeEquation (646)], is
Quantity L is sometimesreferred to as the log-likelihood function, the support
function (Kmenta, 1971), the likelihood function [Mehra (1971), Schweppe J~(L&)=(~)!,=g i,j=LL..,n (H-8)
I ML

* The material in this section is taken from Mendel (1983b, pp* 94-95). * The material in this section is taken from Mended(1983b, pp+95-98).
Maximum-Likelihood Estimation Lesson 11 Properties of Maximum-Likelihood Estimates

There are two unknown parameters in L, p and (T. Differentiating L with respect to
each of them gives
dL (1147a)
- A- 2 [z(i) - ~1
-&I- cr2jz,
Now let us examine sufficient conditions for the likelihood function to be
maximum. We assumethat, closeto 0, L(el%) is approximately quadratic, in dL Nl (11-17b)
-=----T+--T
a@*) 2t7 21 $ Lm - PI2
q I-1
which case
Equating these partial derivatives to zero, we obtain
(11-10)
(11-18a)
oMLi=l
From vector calculus, it is well known that a sufficient condition for ij;function
of ytvariables to be maximum is that the matrix of secondpartial derivatives of (1148b)
that function, evaluated at the extremum, must be negative definite. For
L(el%) in (ll-lo), this meansthat a sufficient condition for L(elS) to be max- For &$rLdifferent from zero (1 l-18a) reduces to
imized is
(1141) CL zi0 -: bML] = 0
Jo(i,,le> < 0 j z 1

Example 11-l giving


This is a continuation of Example 10-2. We observe a random sample {z(l), z (2), . . . , i r(i)=2 (11-19)
t(N)} at the output of a Gaussian random number generator and wish to find the
bML=Njcl
maximum-likelihood estimators of p and 02.
Thus the MLE of the mean of a Gaussianpopulation is equal to the sample mean 2.
The Gaussian density function p (z 1~,02) is
Once again we see that the sample mean is an optimal estimator.
Observe that (11-18a) and (ll-18b) can be solved for G,,, the MLE of u*.
p (2 1~,a) = (2m2)- exp { -; Kf - PYd] (11-12) Multiplying Eq. (11-18b) by 26hL leads to

Its natural logarithm is --Iv& + li[ zi0 - ,&fd* = 0


i = 1
1
In p (2 1~,u2) = -z In (2rc2) - $(z - p)ld2 (11-13) Substituting z for bML and solving for &,, gives

The likelihood function is SIL = $5 [z(i) - 21 (H-20)


i = 1

b ,u2)= p (2(l)lr.~,u2)p(2(91~ ,u2) l l l p(z(N)Ip ,02) (11-14) Thus, the MLE of the variance of a Gaussianpopulation is simply equal to the sample
and its logarithm is variance. 0

UPP~) = Si lnp(4p,~2) (11-15)


i=l PROPERTIES OF MAXIMUM-LIKELIHOOD ESTIMATES
Substituting for In p (z (i)lp ,02) gives
The importance of maximum-likelihood estimation is that it produces esti-
L&a) = 5
mates that have very desirable properties.
i = 1

Theorem 1l-1. Maximum-likelihood


estimates are: (1) consistent, (2)
N 1
= -2ln (271-(r2)-32 tm - PI2 (11-16)
I==1 asymptotically Gaussian with mean 6 and covariarzce matrix -N J-, in which J
92 Maximum-Likelihood Estimation Lesson 11 The Linear Model Wk) deterministic1 93

is the Fisher lnfur-matiun Matrix [Equation (64)]. and, (3) asymptotically objectives in thisparagraph are twofold: (1) to derive the MLE of 9, iML(k),
efficient. and (2) to relate &,&I?) to &u(k) and &(k).
In order to derive the MLE of 9, we need to determine a formula for
Pr~uf. For proofs of consistency, asymptotic normality and asymptotic
p(S(k)j0). To proceed, we assume that V(k) is Gaussian, with multivariate
efficiency, see Sorenson (1980, pp. 187-190; 190-192; and 192-193, re-
density function p (T(k)), where
specGvely). Tnese proofs, though somewhat heuristic, convey the ideas
needed to prove the three parts of this theorem. More detailed analyses can be
found in Cramer (1946) and Zacks (1971). See a!so, Problems U-13, H-14>
Pwo~ = g (27i-ypIi(k)i
l exp
[ -+ V(k)W(k)V(k)] (11-24)
and 11-15. q Recall (e.g., Papoulis, 1965)that linear transformations on, and linear combi-
nations of, Gaussian random vectors are themselves Gaussian random vec-
Theorem 11-2 (Invariance-Froperty of MLEs). Lel g(8) be a vect~~r
tors. For this reason, it is clear that when V(k) is Gaussian,s(k) is aswell. The
functiun mapping 0 into an itlterval in r-dimensi~~~al Euclidean space. Let 6ML
multivariate Gaussian density function of s(k), derived from p (V(k)), is
be a ML E of 0; then g(&) is a A4L E of g(8); i.e.,
(H-21) Pcw$J~
= v(2?;)Npi(k)[
l exp
-$ EW)
B-oaf (See Zacks, 1971). Note that Zacks points out that in many
- X(k)9]%-(k)[%(k) - X(k)e]j (11-25)
books, this theorem is cited only for the case of one-to-one mappings, g(0).
His proof doesnot require g(9) to be one-to-one. Note, also, that the proof of Theorem 11-3. When p(ZlO) is multivariate Gaussian and X&j is
this theorem is related to the consistency carry-over property of a consistent deterministic, then the principle of ML leads to the BLUE of 0, i.e.,
estimator, which wasdiscussedin Lesson 7. El
kdk) = ~m# (11-26)
Example 1l-2
We wish to obtain a MLE of the variance CT?in our linear model z(k) = h(k)9 + v(k). Proof We must maximize p (3!]6) with respect to 0. This can be ac-
One approach is to let $ = D:, establish the log-likelihood function for & and max- complished by minimizing the argument of the exponential in (11-25); hence,
. . b,,(k) is the solution of
imue 11,to determine &,ML. Usual!y, mathematical programming (i.e., search tech-
niques) rnust be used to determine @1.ML. Here is where a difficulty canoccur,because
& (a variance) is known to be positive; thus, 6IVMLmust be consfrained to be positive. =0 (11-27)
ML
Unfortunately, constrained mathematical programming techniques are more difficult
than unconstrained ones. This equatio-n can also be expressed ,as dJ[&,]ldb = 0 where J[&L] =
A secondapproachis to let & = CT,,. establish the log-likelihood function for O2(it Y(k)%-(k)%(k) and Y,(k) = Z(k) - XeML(k). Comparing this version of
will be the sameas the one for & , except that 19~
will be replaced by &), and maximize it (11-27) with Equation (3-4) and the subsequentderivation of the WLSE of 9,
to determine I&,~~. Because & is a standard deviation, which can be positive or
we conclude that
negative, unconstrained mathematica1programming can be used to determine 62,ML,
Finally, we use the Invariance lroperty of MLEs to compute &,ML,as
a (U-28)
kML = (LdL)2 El (M-22)
but, we also know, from Lesson 9, that
THE LINEAR MODEL (X(k) deterministic)
(13-29)
We return now to the linear model From (11-28) and (ll-29), we conclude that
3(k) = X(k)0 + V(k) (H-23)
(U-30)
in which 9 is an tz x 1 vector of deterministic parameters, X(k) is deferministic,
and V(k) is zero mean white noise, with covariance matrix a(k). This is We now suggest a reason why Theorem 9-6 (and, subsequently, The-
precisely the samemodel that was used to derive the BLUE of 0, &#). Our orem 4-l) is referred to as the information form of recursiveBLUE. From
Maximum-Likelihood Estimation Lesson 11 A Log-Likelihood Function for an Important Dynamical System

Theorems
A 11-1 and 11-3 we know that, when X(k) is deterministic, additive white Gaussian noise. This system is described by the following
eBLU- A@; 8, l/N J-l). This meansthat P(k) = cov [&u(k)] is proportional state-equation model:
to J-l. Observe, in Theorem 9-6, that we compute P- recursively, and not P.
x(k + 1) = @x(k) + Pu(k) (11-32)
BecauseP- is proportional to the Fisher information matrix J, the results in
Theorem 9-6 (and 4-l) are therefore referred to as the information form of
recursive BLUE.
z(k + 1) = Hx(k + 1) + v(k + 1) k =O,l,... ,N - 1 (11-33)
A secondmore pragmaticreasonis due to the fact that the inverse of any
covariance matrix is known as an information matrix (e.g., Anderson and In this model, u(k) is known ahead of time, x(0) is deterministic, E{v(k)} = 0,
Moore, 1979,pg. 138).Consequently, any algorithm that is in terms of infor- v(k) is Gaussian, and E{v(k)v( j)} = R6kj.
mation matrices is known as an information form algorithm. To begin, we must establish the parameters that constitute 9. In theory 8
could contain all of the elementsin @,!P, H and R. In practice, however, these
Corollary 11-l. If p[%(k)le] is multivariate Gaussian, X(k) is deter- matrices are never completely unknown. State equations are either derived
ministic, and a(k) = a: I, then from physical principles (e.g., Newtons laws) or associatedwith a canonical
model (e.g., controllability canonical form); hence, we usually know that
&L(k) = hJ(~) = L(k) (11-31)
certain elements in @, V and H are identically zero or are known constants.
These estimators are: unbiased, most efficient (within the class of linear Even though all of the elements in 0, !P, H and R will not be unknown in
estimators),consistent,and Gaussian. an application, there still may be more unknowns present than can possibly be
identified. How many parameterscan be identified, and which parameters can
Proof. To obtain (H-31), combine the resultsin Theorems 11-3 and 9-2.
be identified by maximum-likelihood estimation (or, for that matter, by any
The estimators are:
type of estimation) is the subject of identifiability of system (Stepner and
Mehra, 1973). We shall assume that 8 is identifiable. Identifiability is akin to
1. unbiased, becausei,,(k) is unbiased;
existence. When we assume 8 is identifiable, we assume that it is possible to
2. most efficient, because&,(k) is most efficient; identify 8 by ML methods. This means that all of our statements are predi-
3. consistent, because&IC) is consistent; and cated by the statement: If 8 is identifiable, then . . . .
4. Gaussian, because they depend linearly upon Z(k) which is Gaus- We let
sian. 0
0 = co1 (elements of a, TV, H, and R) (11-34)
Observe that, when Z(k) is Gaussian, this corollary permits us to make Example 11-3
statementsabout small-sampleproperties of MLEs. Usually, we cannot make
such statements. The controllablecanonicalform state-variablerepresentation for the discrete-time
autoregressive moving average (ARMA) model

H(z) = (11-35)
A LOG-LIKELIHOOD FUNCTION
FOR AN IMPORTANT DYNAMICAL SYSTEM -
which implies the ARMA difference equation (z denotes the unit advance operator)
In practice there are two major problems in obtaining MLEs of parameters in y (k + n) + aly(k + n - 1) + l l l + any(k)
models of dynamical systems =P*u(k+n -1)+.*.+&u(k) (11-36)
is
1. obtaining an expressionfor L(elZ}, and
xl(k + 1) Xl(k) 0
2. maximizing L{6l%}with respect to 8.

In this section we direct our attention to the first problem for a linear, time-
invariant dynamical system that is excited by a known forcing function, has
x2@ .+ 1)

x,,(k:+ 1)
x2(k)
..
10
+ . u(k)
O
.
i
(11-37)

deterministic initial conditions, and has measurementsthat are corrupted by


Maximum-Likelihood Estimation Lesson 11 Lesson 17 Problems

and In essence, then, stute equation (11-43) is a comtraint that is associated with the
computatim of the log-likelihood function.
How do we determine bML for L(9[%) given in (11-42) [subject to its
constraint in (ll-43)]? No simple closed-form solution is possible, because 8
enters into L($?!I) in 2 complicated nunlinear manner. The only way presently
For this model there are no unknown parameters in matrix W, and, @ and H each known for obtaining 9ML is by means of mathematical programming, which is
contain exactly n (unknown) parameters. Matrix @contains the n a-parameters which beyond the scope of this course (e.g., see Mendel, 1983, pp. 141-142).
are associatedwith the poles of H(z), whereas matrix H contains the PIp-parameters
which are associated with the zero of H(z). In genera/, an &-or&r, single-input Comment. This completes our studies of methods for estimating un-
single-output system k compkteiy charucterized by 2 n parameters0 0
known deterministic parameters. Prior to studying methods for estimating
unknown random parameters, we pause to review an important body of mate-
Our objective now is to determine L(C$E)for the system in (U-32) and
rial about muhivariate Gaussian random variables.
(H-33). To begin, we must determine p (910) = p (z(l), z(2), . . . , z(N)/@).This
is easy to do, becauseof the whiteness of noise v(k)? i.e., p (z(l), z(2), . . . ,
eN = P W~P MW l l l P won thw PROBLEMS

(11-39) 11-I. If i is the MLE of a, what is the MLE of aO?Explain your answer.
Wl~~ = h [IJPMM~]
11-2. (Sorenson, 1980, Theorem 5.1, pg. 185). If an estimator exists such that
From the Gaussian nature of v(k) and the linear measurement model in equality is satisfied in the Cramer-Rao inequality, prove that it can be
(U-33), we know that determined as the solution of the likelihood equation.
11-3. Consider a sequence of independently distributed random variables
X1,X2 , . . . ,xN , having the probability density function #xie-~, where 8 > 0.
[ - ; [z(i) - Hx(i)]R- [z(i) - Hx(i)]} (11-40)
(a) Derive &&1v).
(b) You want to study whether or not &&N) is an unbiased estimator of 8.
thus, Explain (without working out all the details) how you would do this.
11-4, Consider a random variable z which can take on the values z = 0, 1,2, . . . . This
L(IqE) = - i i [z(i) - Hx(i)]R- [z(i) - Hx(i)] variable is Poisson distributed, i.e., its probability function P(z) is P(t) =
i=l
pze p/z! Let 2 (l), z(2), . . . , z(N) denote N independent observations of t.
-$hiR/ -$mln27r (11-41) Find i&&N).
11-5. If p(z) = Be+=, z > 0 and p(z) = 0, otherwise, find I&, given a sample of N
The log-likelihood function L#!E) is a function of 0. To indicate which independent observations.
quantities on the right-hand side of (U-41) may depend on 6, we subscript all 11-6. Find the maximum-likelihood estimator of 8 from a sample of N independent
N observations that are uniformly distributed over the interval (0, 0).
such quantities with 9. Additionally, becauseT~H in 27rdoes not depend on 8
11-7. Derive the maximum-likelihood estimate of the signal power, defined as e, in
we negIect it in subsequentdiscussions.Our final log-Iikelihood function is: the signal z(k) = o(k), k = 0, 1,. . . , N, where s(k) is a scalar, stationary,
zero-mean Gaussian random sequence with autocorrelation E{s (9s (j)} =
L(@E) = -; 2 [z(i) - &x&)]Ri [z(i) - Hex&)] - $ In /Rot (11-42) 9(j - i>*
i=l 11-8. Supposex is a binary variable that assumesa value of 1 with probability a and a
value of 0 with probability (1 - a). The probability distribution of x can be
observe that 8 occurs explicitly and implicitly in L@l%). Matrices I&, and described by
R. contain the explicit dependenceof L@lS!) on 8, whereasstate vector x@(i)
contains the implicit dependence of L@i$!E)on 9. In order to numerically P(x) = (1 - a) - *a
calculate the right-hand side of (1 l-42), we must solve state equation (11-32). Supposewe draw a random sample of N values (x1,x2, . . . , x.~}.Find the MLE
This can be done when vaIuesof the unknown parameterswhich appear in @ of a.
and V are given specific values; for, then (11-32)becomes 11-9. (Mendel, 1973, Exercise 3-15, pg. 177). Consider the linear model for which
x@(O)
known (1l-43) V(k) is Gaussian with zero mean and %(A-)= (~1.
x@(k+ 1) = @G@(~)
+ %Q4
Lesson 11 Problems
98 Maximum-Likelihood Estimation Lesson 11

(a) Show that the maximum-likelihood estimator of u*, denoted &&, is p (cq0)/a2 = c;= ] a In p (z (i)!6)/#; (4) Using the strong law of large
numbers to assert that, with probability one, sample averages converge to
(f.E- 3tii,J(% - ei,,) ensemble averages, and assuming that E{B! in p (z(i>le)lde} is negative
&,, = .
N definite? show that 6ML 8 with probability one; thus, iML is a consistent
estimator of 8. The steps in this proof have been taken from Sorenson, 1980,
where 6MLis the maximum-likelihood estimator of 8. pp. 187-191.1
(b) Show that &*MLis biased, but that it is asymptotically unbiased.
11-13. Prove that, for a random sample of measurements, a maximum-
1140. We are given N independent samples z(l), z(2), . . . , z(N) of the identically likelihood estimator is asymptotically Gaussian with mean value 8 and
distributed two-dimensional random vector z(i) = co1[z*(i), z2(i)], with the covariance matrix (NJ)- where J is the Fisher information matrix for a single
Gaussian density function measurement 2 (i>. [Hints: (1) Expand d In p (%liML)/dein a Taylor series about
1 exp -
z:(i) - 2pz4i)z*(i) + z;(i) 8 and neglect second- and higher-order terms; (2) Show that
PMi), t*(i)lp> =
27dc-7 I 20 - P> I
where p is the correlation coefficient between z1and z2.
(a) Determine i)ML (Hint: You will obtain a cubic equation for jjML and must
show that bL = r, where r equals the samplecorrelation coefficient). (3) Let s(e) = dlnp ew) la and show that s(e) = xrS 1si(e), where si(e) =d
(b) Determine the Cramer-Rao bound for i)ML.
ln P (f (i)k9 I% (4) Let S denote the sample mean of Si(e), and show that the
1141. It is well known, from linear system theory, that the choice of state variables distribution of S asymptotically converges to a Gaussian distribution having
used to describe a dynamical system is not unique. Consider the n th-order mean zero and covariance J/N, (5) Using the strong law of large numbers we
difference equation
know that -+ E{ a In p (z (#I) / 8010); consequently, show
y (k + n) + a1y (k + II - 1) + l l l + any(k)

= bim(k + n - 1) + bEeIrn(k + n - 2) + l .* + bym(k). that (i,, - e)J is asymptotically Gaussian with zero mean and covariance
J/N; (6) Complete the proof of the theorem. The steps in this proof have been
A state-variable model for this system is taken from Sorenson, 1980, pg. 192.1
0 11-14. Prove that, for a random sample of measurements, a maximum-likelihood
0. estimator is asymptotically efficient [Hints: (1) Let S(e) = -E{a in p (%$I)/
. m (4 a210} and show that S(0) = NJ where J is the Fisher information matrix for a
-a* single measurement z (13;(2) Use the result stated in Problem 11-13to complete
the proof. The steps in this proof have been taken from Sorenson, 1980, pg.
and 193.]

y(k) = (1 O...O)x(k)
where

Suppose maximum-likelihood estimates have been determined for the ai- and
bi-parameters. How does one compute the maximum-likelihood estimates of
the by-parameters?
1142. Prove that, for a random sample of measurements, a maximum-likelihood
estimator is a consistent estimator [Hints: (1) Show that E{d In p (9fl8) / ae(e} =
0; (2) Expand d In p (!%&&% in a Taylor series about 8, and show that d In
p(%(6)/de = -(iML - 8)a ln ~(S@*)/LJO, where 8* = he + (1 - &)iML and
0 5 h 5 1; (3) Show that d in p(%le)/~% = cfl 1 d In p(z(i))8)/89 and 8 ln
Jointly Gaussian Random Vectors ?Ol

Lesson 72 MULTIVARIATE GAUSSIAN DENStTY FUNCTlON

Let yl, ~2,. - -, y, be random variables, and y = co1 (yl, y,, . . . , ym). The
function
Elemenfs of Pcyl,*-.,YnJ
L2,=p(s)
=v&qexp +r- WPT,
(Y- my>
1 (12-2)
Mulfivar;afe Gcwssiun is said to be a multivariate (m-variate) Gaussian density function [i.e., y - N(y;
m,, P,)]. In (12-2),

Random Variables and


my = E(Y) (12-3)

p, = WY - my)(Y - my)') (12-4)


Note that, although we refer to p(y) as a density function, it is actually a joint
density function between the random variables yl, y2, , . . , y, . If P, is not
positive definite, it is more convenient to define the multivariate Gaussian
distribution by its characteristic function. We will not need to do this.
INTRODUCTION

Gaussian random variables are important and widely used for at least two JOtNTLY GAUSSfAN RANDOM VECTORS
reasons. First, they oflen provide a model that is a reasonableapproximation
to observed random behavior. Second, if the random phenomenon that we Let x and y individually be n- and m-dimensional Gaussianrandom vectors,
observe at the macroscopiclevel is the superposition of an arbitrarily large i.e., x - N(x; m,, Px) and y - N(y; m,, P,). Let P,, and P,, denote the cross-
number of independent random phenomena, which occur at the microscopic covariance matrices betweenx and y, i.e.,
level, the macroscopicdescription is justifiably Gaussian.
Most (if not all) of the material in this lesson should be a review for a P, = Wx - My - my)) (12-5)
reader who has had a coursein probability theory. We collect a wide range of and
facts about multivariate Gaussian random variables here, in one place, be-
causethey are often neededin the remainmg lessons. Pyx= NY - m,.>(x- m,)) (12-6)
We are interested in the joint density between x and y, p (x, y). Vectors x and y
are jointly Gaussianif
UNIVARIATE GAUSSIAN DENSITY FUNCTION

Ph Yl = 2 - mJP,- (z - m,) (12-7)


A random variable y is said to be distributed as the wzivariate Gussian
distribution with mean q, and variance CI$[Len, y - IV(y; my, $11 if the where
density function of y is given as
2 = co1(x, y) (12-B)
m, = cd (m,, my) (12-9)
and
For notational simplicity throughout this chapter. we do not condition density
functions on their parameters;this conditioning is understood. Density JJ(y) is (12-10)
the familiar bell-shapedcume, centered at y = mJ.+
The Conditional Density Function
102 Elements of Multivariate Gaussian Random Variables Lesson 12

Note that if x and y are jointly Gaussian then they are marginally (i.e., Taking a closer look at the quadratic exponent, which we denote E(x,y), we
individually) Gaussian.The converseis true if x and y are independent, but it find that
is not necessarilytrue if they are not independent (Papoulis, 1965,pg. 184). E(x,y) = (x - m,)A(x - m,) + 2(x - mX)B(y - m,)
In order to evaluatep (x, y) in (12-7), we need lPZiand P,. It is straight- + (Y - mJ(C - PY>(Y- my)
forward to compute lPZlonce the given values for P,, P,, PXy,and PrXare (12-20)
= cx - m,)A(x - m,) - 2(x - m,)A P,,P;(y - m,)
substituted into (12-10).It is often useful to be able to expressthe components + (Y - q)qpyxA P,,p;yy - my)
of Pi directly in terms of the componentsof P,. It is a straightforward exercise
in algebra (just form P,P, = I and equate elements on both sides) to show In obtaining (12-20) we have used (12-13) and (12-14). We now recognizethat
that (12-20) looks like a quadratic expression in x - m, and P,,P;(y
. - m,), and
expressit in factored form, as
(12-11)
where ax, Y) = E(x - mx>
A = (PX- P,P;P,)- = P, + P;p,,CP,P; (12-12) - PxyP;(Y - m,>lN(x- mx)- PxyP,(Y - m,>l (12-21)

B = -AP,P; = -P;P,C Defining m and ZLas in (12-H) we seethat


(12-13)
and E(x,y) = (x - m)CL-(x - m) (12-22)
c = (PY - P,P;P,)- = Py + P,P,,AP,,P; (12-14) If we can show that lPZl/IPY[= (21,th en we will have shown that p (xly) is
given by (12-X), which will mean that p (xly) is multivariate Gaussianwith
mean m and covariance 2.
THE CONDITIONAL DENSITY FUNCTION It is not at all obvious that 121= IPZI/ IP,I. We shall reexpressmatrix P, so
that l&l can be determined by appealingto the following theorems (Graybill,
One of the most important density functions we will be interested in is the 1961,pg. 6):
conditional density function p (xly). Recall, from probability theory (e.g.,
Papoulis, 1965),that i. If V and G are n x n matrices, then /VGl = IV11~1.
P(F Y) ii. IfMisa square matrix suchthat
P MY) =- (12-15)
P(Y)

Theorem 12-l. Let x and y be n- and mdimensional vectors that are


jointly Gaussian. Then where Ml1 and Mz2are squarematrices, and if Ml2 = 0 or MS1= 0, then
1 1 IMI = IMlll Iw21*
P(XlY) = q@qqq exp i - 2 (x - mW(x - m) I (12-16) We now show that two matrices, L and N, can be found so that
where (12-23)
m = E{x/y} = mx + P,P;'(y - my) (12-17)
and Multiplying the two matrices on the right-hand side of (12-23), and equating
!%= A- = P, - Px,P;Px (12-18) the l-l and 2-1 components from both sides of the resulting equationwe find
that
This means that p(xly) is also multivariate Gaussian with (conditional) mean m
px = L + P,,N (12-24)
and covariance 9..
Proof. From (12-U), (12-7),and (12-2), we find that
b = P,.N (12-25)
A B
(12-19) from which it follows that
B C -P;
N = P,P,, (12-26)
Properties of Conditional Mean
Bernems of Multivariate Gaussian Rxdom Varhbks Lesson 12

and Proof That E(xjy) is G aussianfollows from the linearity property ap-
plied to (12-30).An affine transformation of y has the structure Ty + f. E(x\y)
L = Px - PqP;lPfl (12-27) has this structure; thus, it is an affine transformation. Note that if m, = 0 and
From (12~23),nyesee that m, = 0, then E{xly} is a linear transformation of y. 5

Pzl = IL1hl Theorem 12-3. Let x, y, andzben X 1, m X 1 artdr X I jointly Gaus-


or, sian random vectors, If y and z are statistically independent, then
(12-28) Ebty, 2) = E{xly) + E(xlz} - m, (12-31)
ILI = lRlml
Comparing the equations for L and 3, we find they are the same; thus, we Proof. Let 5 = co1(y.z); then
have proven that Ebl5) = m, + PxkP;({ - m,) (12-32)
121= IPzlm (1229) We leave it to the reader to show that
which completesthe proof of this theorem. U pxt=Px,lPxz) (12-33)

PROPERTIES OF MULTIVARIATE GAUSSIAN RANDOM VARIABLES p-p,. 0 (12-34)


I- 0 P2
(-I-)
From the preceding formulas for p(y), p (x,y) and p (xiy), we see that r4lL (the off-diagonal elements in P, are zero if y and z are statistically indepen-
variate Gaukan prubabikty density functions are completely characterized by dent; this is also true if y and z are uncorrelated, becausey and z are jointly
their first two moments, Le., their mean vector and covariance matrix. All Gaussian); thus,
other moments can be expressedin terms of their first two moments (see, e.g.,
E{xle] = m, I- PxyPF(y- m,.) + P,P;(z - m,)
Papoulis, 1965)
From probability theory, we also recall the foliowing two important facts = E(xjyj + E(x/z}- m, q
about Gaussianrandom variables:
In our developments of recursive estimators, y and z (which will be
I. Statistically independent Gaussian random variables are uncorrelated associatedwith a partitioning of measurementvectors 2) are not necessarily
and vice-versa; thus, PWand PYxare both zero matrices. independent. The following important generalization of Theorem 12-3will be
2, bear (or affine) transformations on and linear (or affine) combinations needed.
of Gaussian random variables are themselves Gaussian random vari-
ables; thus, if x and y are jointly Gaussian,then z = Ax + By + c is also Theorem 12-4. Let x, y, and z ben x 1.m x landr x 1 jointly Gaus-
Gaussian+We refer to this property as the /inew@ property. SiUtl random vectors. Ijy and z are not necessarily statistically independent, then
E(x]y
,z}=E(xly
,i) (12-35)
where
PROPERTIES OF CONDiTIoNAL MEAN
--
z-z (12-36)
- E{z[Y)
We learned, in Theorem 12-1,that
so that
E{x[y} = rnx + P&(y - my) (12-30)
E(xly,z) = E(x/y} + E{xlZ) - m, (12-37)
BecauseE{xiyl dependson y, which is random, it is also random.
Proof (Mendel, 1983b. pg. 53). The proof proceeds in two stages:(a)
Theorem 12-2. When x and y m-ejointly Gaussian, E{xiyl is multivariate assume(12-35) is true and demonstrate the truth of (12-37), and (b) demon-
C&&an, and is an uffitx combinatiun of the elements of y.
strate the truth of (12-35).
Elements of Multivariate Gaussian Random Variables Lesson 12 Lesson 12 Problems

a. If we can show that y and i are statistically independent, then (12-37) (a) E{xjy} = E(x) if x and y are independent
follows from Theorem 12-3. For Gaussian random vectors, however, (b) WY)XlY~ = g(YPwYl
(d E{c lyl = c
uncorrelatednessimplies independence.
(d) WY~lY~ = ET(Y)
To begin, we assert that y and 21are jointly Gaussian, because (e) E{cx + hzly} = cE{xly} + hE{z/y}
ii = z - E{zly} = z - m, - P,,P;(y - my) dependson y and z, which are (0 E{x) = E{E{xiyl) where the outer expectation is with respect to y
jointly Gaussian. (9) WYbl = WY)EblYH w here the outer expectation is with respect to y.
Xext, we show that Z is zero mean. This follows from the 12-3. Prove that the cross-covariance matrix of two uncorrelated random vectors is
calculation zero.
mi = E{z - E{zly}} = E(z) - E{E{zly}} (12-38)
where the outer expectationin the secondterm on the right-hand side of
(12-38) is with respectto y. From probability theory (Papoulis, 1965,pg.
208), E(z) can be expressedas
EM = E(E(zly)Z (12-39)
From (12-38) and (12-39),we see that rni = 0.
Finally, we show that y and Z are uncorrelated. This follows from
the calculation
E{(y- my>@
- mi))= E{(y - m&i} = E{yi}
= Wyzl - E(yE(z'jy)~ (12-40)
= E{yz} - E{yz} = 0
b. A detailed proof is given by Meditch (1969,pp. 101-102).The idea is to
(1) compute E{xly,z} in expanded form, (2) compute E{xly,i} in ex-
panded form, using Z given in (12-36), and (3) comparethe results from
(1) and (2) to prove the truth of (12-35). 0

Equation (12-35) is very important. It states that, when z and y are


dependent, conditioning on z can always be replaced by conditioning on
another Gaussian random vector 5, where Z and y are statistically indepen-
dent.
The results in Theorems 12-3 and 12-4 depend on all random vectors
being jointly Gaussian.Very similar results, which are distribution free, are
described in Problem 13-4;however, these results are restricted in yet another
way.

PROBLEMS

12-1. Fill in all of the details required to prove part b of Theorem 12-4.
12-2. Let x, y and z be jointly distributed random vectors: c and h fixed constants; and
g ( 0) a scalar-valuedfunction. Assume E(x), E(z). and E(g (y)x} exist. Prove the
following useful properties of conditional expectation:
Mean-Squared Estimation

Lesson 73 MEAN-SQUARED ESTIMATION

Many different measuresof parameter estimation error can be minimized in


order to obtain an estimate of 6 [see Jazwinski (1970), Meditch (1969), van
Estimution Trees (1968), and Sorenson(1980), for example]; but, by far the most widely
studied measure is the mean-squarederror.

of Rcmdom Parameters: Objective Function and Problem Statement

Given measurementsz(l), . . . , z(k), we shall determine an estimator of


6, namely
Generd Results &s(k) = +[z(i), i = 1,2, . a. , rl-] (13-1)
such that the mean-squarederror
Jh,&)l = E~~~s(k)e,s(~)~ (13-2)
is minimized. In (13-2), &s(k) = 8 - i&k).
The right-hand side of (13-1) means that we have some arbitrary and as
yet unknown function of all the measurements. The n components of 0,,(k)
may each depend differently on the measurements.The function +[z(i), i = 1,
2,-., k] may-be nonlinear or lineat. Its exact structure will be determined by
minimizing J[9&k)]. If perchance0&/c) is a linear estimator, then

INTRODUCTION 6&k)= c A(i)+)


i= 1
(13-3)

In Lesson 2 we showedthat state estimation and deconvolution can be viewed We now show that the notion of conditional expectation is to the
as problems in which we are interested in estimating a vector of random calculation of &&c). As usual, we let
parameters* For us, state estimation and deconvolution serve as the primary
95(k) = co1[z(k), z(k - l), . . . , z(l)] (13-4)
motivation for studyingmethodsfor estimating random parameters;however,
the statistics literature is filled with other apphcationsfor thesemethods. The underlying random quantities in our estimation problem are 9 and Z(k).
We now view 0 asan PI x 1 vector of random unknown parameters. The We assumethat their joint density function p[e,!E(k)] exists, so th.at
information available to us are measurementsz(I), z(2), . . . , z(k), which are
assumed to depend upon 9. In this lesson, we do not begin by assuming a
specific structural dependencybetween z(i) and 0. This is quite different than
what we did in WLSE and BLUE. Those methods were studied for the linear where d9 = d&d& . . . de,, d%(k) = dzr(1) . . . dri(k)dzz(l) . . . dz,(k) . . .
model z(i) = H(i)9 -t v(z),and, closed-form solutions for 6=(k) and &JJ(~) dz,(l) . . . dz,(k), and there are H + km integrals. Using the fact that
could not havebeenobtained had we not begun by assumingthe linear model. (13-6)
p WW)l = p C@V)$ [Wdl
We shah study the estimatiun of random 0 for the linear model in Lesson 14.
ln this lessonwe examinetwo methods for estimating a vector of random we rewrite (13-5) as
parameters. The first method is based upon minimizing the mean-squared
error between 0 and 6(k). The resulting estimator is called a mean-squared
estimator, and is denoted ~&Jc). The second method is based upon max-
imizing an unconditional likelihood function, one that not only requires
knowledge of ~(9!~8) but also of p(9). The resulting estimator is called a (13-7)
maximum a posteriori estimakw, and is denoted iMAp(
Estimation of Random Parameters: General Results Lesson 13 Mean-Squared Estimation 111

From this equation we see that minimizing the conditional expectation when 8 and Z(k) are jointly Gaussianwe have a very important and practical
E{@&)&&)~%(k)} with respectto &.&j is equivalent to our original objec- corollary to Theorem 13-l.
tive of minimizing the total expectation E{6~s(k)8Ms(k)}. Note that the inte-
Corollary 13-l. When 8 and Z(k) are jointly Gaussian, the estimator
grals on the right-hand side of (13-7) remove the dependencyof the integrand
that minimizes the mean-squared error is
on the data S(k).
In summary, we have the following mean-squared estimation problem: k&) = m6+ P,(WiOPW - m&41 (13-12)
Given the measurements z(l), z(2), . . . , z(k), determine an estimator of 8,
Proof. When 8 and Z(k) are jointly Gaussian then E{OI%(k)} can be
namely,
evaluatedusing (12-17) of Lesson 12. Doing this we obtain (13-12). 0
6&k) = @[z(i), i = 1,2,. . . , k]
such that the conditional mean-squared error Corollary 13-1gives us an explicit structure for 6&k). We seethat 6&k)
mLs(k)l = E(&&)L(k)l4~), - 7z(k)) l
(13-8) is an affine transformation of Z(k). If me = 0 and m&c) = 0, then 6&k) is a
linear transformation of Z(k).
is minimized.
In order to compute 6&k) using (13-12), we must know me and m&
Derivation of Estimator
and we must first compute P&k) and P&k). We perform these computations
in Lesson 14 for the linear model, Z(k) = X(k)0 + V(k).
The solution to the mean-squaredestimation problem is given in The- Corollary 13-2. Suppose 8 and %(k) are not necessarily jointly
orem 13-1,which is known asthe Fundamental Theorem of Estimation Theory. Gaussian, and that we know me, m%(k), Pz(k) and P&k). In this case, the
estimator that is constrained to be an affine transformation of Z(k), and that
Theorem 13-l. The estimator that minimizes the mean-squared error is
minimizes the mean-squared error is also given by (13-12).
hMS(JC)
= WI~(kj~ (13-9)
Proof. This corollary can be proved in a number of ways. A direct proof
Proof (Mendel, 1983b). In this proof we omit all functional depen- begins by assuming that 6&k) = A(k)%(k) + b(k) and choosing A(k) and
dences on k, for notational simplicity. Our approach is to substitute b(k) so that 6&k) is an unbiased estimator of 8 and E{6Eyis(k)&s(k)} = trace
i&&k) = 8 - 6,,(k) into (13-8) and to complete the square, as follows: E(&&)%&)1 is minimized. We leave the details of this direct proof to the
reader.
J1[~MS(k)I
= we - hMSjv - ~MsjIffJ A less direct proof is based upon the following Gedanken experiment.
= E{ee - efhMS- i&e + iifrlSiiMSpQ (13-10) Using known first and second moments of 8 and Z(k), we can conceptualize
= E{eep) - E{ep$iMs - i&E{ela) + &,JiMs unique Gaussian random vectors that have these same first and secondmo-
= E{efel%}+ [bMS- E{elk3f>]@Ms
- E{ele>] ments. For these statistically-equivalent (through second-order moments)
- E{8pI}E{8]~} Gaussianvectors, we know, from Corollary 13-1, that the mean-squaredesti-
mator is given by the affine transformation of S(k) in (13-12). 0
To obtain the third line we usedthe fact that 6MS,by definition, is a fUnCtiOn of
%; hence, E{&SI%} = 6MS.The first and last terms in (1340) do not dependon Corollaries 13-1 and 13-2, aswell as Theorem 13-1, provide us with the
&& hence, the smallest value of J1[6Ms(k)] is obviously attained by setting the answerto the following important question: When is the linear (affine) mean-
bracketed terms equal to zero. This means that 6,s must be chosen as in squared estimator the same as the mean-squared estimator? The answer is,
(13-9). cl when 8 and Z(k) are jointly Gaussian.If 8 and ttI(k) are not jointly Gaussian,
then 6&k) = E{OIZ(k)}, which, in general, is a nonlinear function of mea-
Let J,*[e,,(k)] denote the minimum Vahe of &[&s(k)]. we see, from surementss(k), i.e., it is a nonlinear estimator.
(1340) and (13-g), that
&@&k)] = E{eel%} - &s(k)&s(k) (1341) Corollary 13-3 (Orthogonality Principle). Suppose f [Z(k)] is any func-
tion of the data Z(k). Then the error in the mean-squared estimator is orthogo-
As it stands, (13-9) is not terribly useful for computing g,s(k). In gen- nal to f [S(k)] in the sense that
eral, we must first compute p [cl%(k)] and then perform the requisite number
of integrations of ep[el%(k)] to obtain bMS(k).In the specialbut important case E(Ee- kls(wlfEw)lI = 0 (13-13)
112 Estimation of Random Parameters: General Results Lesson 13 Mean-Squared Estimation 113

Proof (Mendel, 1983b, pp. 46-47). We use the following result from Becausevariancesare aIwayspositive the minimum value of J{&,(k)] must be
probability theory (Papoulis, 1945;see,also: Problem 12-2(g)). Let CKand @be achievedwhen each of the n variancesis minimized; hence, our mean-squared
jointly distributed random vectors and g (fi) be a scalar-valuedfunction; then estimator is equivalent to an MVE. [7

Property 3 (Linearit)). 6&k) in (13-12) is a linear (i.e., affine)


where the outer expectation on the right-hand side is with respect to f!. We estimator.
proceed as follows (again, omitti ng the argument k): Proof. This is obvious from the form of (13-12). 0

Linearity of i&k) permits us to infer the foIlowing very important


property about both i,,(k) and &s(k).
where we have used the facts that bMSis no longer random when !I!,is specified
and E{@%~= iMS. q Property 4 (Gaussian). Both 6&k) and 6&k> are multivariate Gaus-
Sian.
A frequently encountered specialcaseof (13-13) occurswhen f [2(k)] =
d&k); then Corollary 13-13can be written as Proof. We use the linearity property of jointly Gaussian random vectors
stated in Lesson 12. Estimator 6&k) in (13-12)is an affine transformation of
~oLs~~~k@~l = 0 (13-S) Gaussian random vector Z(k); hence, &,&k) is multivariate Gaussian. Esti-
mation error 8&k) = 9 - 6&k) is an affine transformation of jointly Gaus-
PropertIes of Mean-Squared Estimates When 9 and 3%) sian vectors 8 and 5!.(k);hence, 6&k) is also Gaussian. 0
are Gaussian
Estimate 6&k) in (13-12) is itself random, becausemeasurementsZ(k)
In this section we present a collection of important and useful properties are random. To characterize it completely in a statistical sense, we must
associatedwith 9&k) for the casewhen 0 and g(k) are jointly Gaussian.In specify its probability density function, Generally, this is very difficult to do,
this case6&k) is given by (1342). and often requires that the probability density function of 6&) be approxi-
mated using many moments (in theory an infinite number are required). In the
Property 1 (Unbiasedness). The mean-squared estimator, &s(k) in Gaussian case, we have just le_arnedthat the structure of the probability
(13-12), is unbiased. density function for &.&k) [and 6,,(k)] is k nown. Additionally, we know that
a Gaussiandensity function is completely specified by exactly two moments,
?vof. Taking the expectedvalue of (13-12),we seethat E&&k)} = Q; its mean and covariance; thus, tremendous simplifications occur when 6 and
thus, O&k) is an unbiasedestimator of 0. Cl Z(k) are jointly Gaussian.

Property 2 (Minimum Variance). Dispersion about the mean value of Property 5 (Uniqueness). Mean-squaredestimator 6&k), in (13-12),is
t&&k) is measured by the error variance c$ Ms(k), where i = 1, 2, . . . , n. An unique.
estimator that has the smallest error varianceisa minimum-variance estimator
(an MVE). Th e mean-squaredestimator in (13-12)is an MVE. The proof of this property is not central to our developments; hence,it is
Proof. From Property 1 and the definition of error variance, we seethat omitted.

.Ms(k) = E{@&(k)}
c-r;8 i = 1,2,. . . , II (13-X) Generahations
Our mean-squaredestimator was obtained by minimizing .J[&&k)] in (13-2),
which can now be expressedas Many of the results presented in this section are applicable to objective
functions other than the mean-squaredobjective function in (13-2). See Me-
(13-17) ditch (1969) for discussionson a wide number of objective functions that lead
to E{elS(k)> as the optimal estimator of 8.
Maximum a Posteriori Estimation
Estimation of Random Parameters: General Results Lesson 13

j.&, which is unknown to us in this example, is transferred from the first random
MAXIMUM A POSTERIOR1 ESTIMATION number generator to the second one before we can obtain the given random sample

Recall Bat.ess
. rule (Papoulis, 1965,pg. 39): Using the facts that

p (z (i)lp) = (2m+ exp { -$ [z(i) - @j/d} (13-20)


in which density functionp@j%(k)) 1sk nown as the a posteriori (or posterior)
l

conditional density function, andp (0) is the prior probability density function p(j~) = (2m~)+* exp {+w} (13-21)
for 8. Observe that p(f$!E(k)) is related to likelihood function Z(ele(k)}, be-
cause Z{C$!Z@)}ap (%(k)le). Additionally, becausep (%(k)) does not depend
on 8,
(13-22)
(13-19)
In maximum a posteriori (MAP) estimation, values of 8 are found that
maximize p (e/%(k))in (13-19); such estimates are known as MAP estimates,
and will be denoted as &&j.
If 01, 62,. . . ) & are uniformly distributed, then p (f$E(k)) ap (Zf(k)lO),
and the MAP estimator of 8 equals the ML estimator of 8. Generally, MAP (13-23)
estimatesare quite different from ML estimates. For example, the invariance
property of MLEs usually doesnot carry over to MAP estimates.One reason Taking the logarithm of (13-23) and neglecting the terms which do not depend upon p,
for this can be seenfrom (13-19).Suppose,for example, that + = g(8) and we we obtain
want to determine C$ MAPby first computing 6MAP.Becausep (0) dependson the
Jacobian matrix of g-l(+), 4 MAP# @MAP).Kashyap and Rao (1976, pg. 137) LMAP(#E(RJ)) = -i j? {[z(i) - pl*1(+3- i CL214 (13-24)
1=1
note, the two estimatesare usually asymptotically identical to one another
Setting ~LMAP/@ = 0, and solving for bMA&V), we find that
since in the large samplecasethe knowledge of the observationsswampsthat
of theprior distribution. For additional discussionson the asymptotic proper-
bMAP(R? (13-25)
ties of MAP estimators, seeZacks (1971).
Quantity p (elZ(k)) in (13-19) is sometimescalled an unconditional Zike- Next, we compare bMAP(lV)and b&v), where [see (19) from Lesson 111
Zihoodfunction, becausethe random nature of 8 has been accounted for by
p (0). Density p (%(k)le>is then called a conditional likelihood function (Nahi, l f z(i) (13-24)
bdRJ) = jQ,
1969). I= 1

Obtaining a MAP estimate involves specifying both p (%(#3) and p (0)


In general GMAp(N) # &,&V). If, however, no a priori information about p is avail-
and finding those valuesof 8 that maximize p (Ol%(k)),or In p @l%(k)).Gener- able, then we let &+a, in which case fiMAp(IV)= fi&V). Observe, also that, as
ally speaking, mathematical programming must be used to compute &&&k). N+ 00cMAP(lV)= bML(1V),which implies (Sorenson, 1980) that the influence of the
When s(k) is related to 8 by our linear model, %(k) = X(li)B + V(k), then it prior knowledge about p [i.e., p - A@; 0, cri)] diminishes as the number of mea-
may be possible to obtain &A#) in closed form. We examinethis situation in surements increase. D
some detail in Lesson 14.
Example 13-l Theorem 13-2. If Z(k) and 8 are jointly Gaussian, then i&&k) =
This is a continuation of Example 11-l. We observe a random sample {I (l), t (2), . . . , kdk)~
z(N)} at the output of a Gaussianrandom number generator, i.e., z(i) - N(z (i>; p, Proof. If SC(k)and 0 are jointly Gaussian, then (see Theorem 12-1)
CT). Now, however, p is a random variable with prior distribution N(p; 0, 0:). Both a:
and oc are assumedknown, and we wish to determine the MAP estimator of p. P @lw))
We canview this random number generator as a cascadeof two random number 1 1
generators. The first is characterizedby N(p; 0, o:> and provides at its output a single
i
= ~(271.)lZ(k)l exp -2
[e - m(k)]S!?(k)[8 - mwl] (13-27)
realization for p, say pR. The secondis characterized by N@(i); pR, 0:). Observe that
Estimation of Random Parameters: General Results Lesson 13
Lesson 13 Problems If7

where S(k) is uniquely equal to the linear (i.e., affine), unbiased mean-squared
III(~) = E@/%(k)] (13-28) estimator of 0,4(k).
(c) For random vectors x, y, z, where y and z are uncorrelated. prove that
&&k) is found by maximizingJJ(@I!@)]~or equivalentiy by minimizing the
k(x/y.z} = E{xly) + IQXIZ) - m.
argument of the exponential in (13-27). The minimum value of
[O - m(k)]LT!F(k)[6 - III(~)] is zero, and this occurswhen (d) For random vectors x, y, z, where y and z are correlated, prove that
ii{xly3z} = i{x/y.Z}
where
z = z - i{zly}
so that
&ly,z} = k{xly} + i&Ii} - m,

The result in Theorems13-2is true regardlessof the nature of the model


Parts (c) and (d) show that the results given in Theorems 12-3 and 12-4 are
relating 0 to Z(k). Of course, in order to use it, we must first establish that
distribution free within the class of linear (i.e., affine), unbiased, mean-
%(A$ and 0 are jointly Gaussian. Except for the linear model, which we squared estimators.
examine in Lesson 14, this is very diffkuh to do. Consider the linear model z(k) = 28 + n(k), where

PROBLEMS

13-l. Prove that 6&k), given in (13-12). is unique.


13-2. ProveCorollary 13-Zby meansof a direct proof.
13-3. Let 0 and %(N) be zero mean n x 1 and IV X 1 random vectors, respectively,
to, otherwise
A random sample of N measurements is available. Explain how to find the ML
with known second-orderstatistics, P8, P5, Pm, and Pz@+ View Z(N) as a vector and MAP estimators of 8, and be sure to list all of the assumptions needed to
of measurements. It is desired to determine a linear estimator of 6, obtain a solution (they are not all given).

where KL(v is an n x N matrix that is chosen to minimize the mean-squared


error Ei b - & Q+fE(NLl[fl - &. (NEW)]}.
[a) Show that the gain matrix, which minimizes the mean-squared error, is
KL(w = E~e~(~~~E~~(~~(~}]- = Pa PG.
(b) Shoy that the covariance matrix, P(N), of the estimation error, 6(N) =
9 - e(N), is

(c) Relate the resultsobtained in this problem to those in Corollary 13-2.


13-4, For random vectors 0 and Z(k), the hear projection, 9*(k), of 0 on a Hilbert
space spanned by s(k) is defmed as 9*(k) = a + B%(k), where E{0*(k)} = E{O]
end E{[O - O*(k)B(k)} = U. We denote the linear projection, 0*(k), as
Jo MW
(a} rrove that the linear (i*e., affine), unbiased mean-squaredestimator of 9,
6(k), is the linear projection of 0 on 9(k).
tb} Prove that the linear projection, O*(k), of 9 on the Hilbert spacespanned by
Mean-Squared Estimator 119

Lesson 14 It is straightforward, using (14-1) and (14-2) to show that


m%(k)= W)m* (14-4)
P%(k)= X(k)P,X(k) + a(k) (14-5)
Estimation of and
Pa(k) = P&?(k) (14-6)

Random Parameters: consequently,

h(k) = me + P&W(k)[%e(k)PJt(k) + S(k)]-[Z(k) - X(k)m,l (14-7)


The Linear , Observe that &&k) dependson all the given information, namely, %(k), X(k),
meand Pg.
and Gaussian Model Next, we compute the error-covariance matrix, P&k), that is associated
with i&&k) in (14-7). Because6&k) is unbiased, we know that E{&,(k)} = 0
for all k; thus,
P&k) = E{&,&)&(k)) (14-8)
From (14.7), and the fact that 8,,(k) = 8 - 6&k), we seethat
INTRODUCTION bMs(k) = (0 - m,) - P,X(XP,X + a)-(% - Xrn& (14-9)
From (14-9), (14-8) and (14-l) it is a straightforward exerciseto show that
In this lessonwe begin with the linear model
9!(k) = X(k)6 + V(k) (14-1) P&k) = Ps - PJt(k)[X(k)PJt(k) + St(k)]-X(k)P, (14-10)
where 6 is an y2x 1 vector of random unknown parameters,X(k) is deter-
ministic, and V(k) is white Gaussian noise with known covariance matrix Applying Matrix Inversion Lemma 4-l to (14-lo), we obtain the following
CR(k). We also assumethat 8 is multivariate Gaussianwith known mean, me, alternate formula for P&k),
and covariance, PO,i.e.,
(14-2) P&k) = [Pi + X(k)%-(k)X(k)]- (14-11)
0 - WC me, Pe)
and, that 8 and V(k) are Gaussian and mutually uncorrelated. Our main Next, we expressh&k) as an explicit function of P&k). To do this, we
objectives are to compute 6,,(k) and 6MAP(k) for this linear Gaussian model, note that
and to see how these estimators are related. We shall also illustrate many of
our results for the deconvolution and state estimation exampleswhich were PJe(XPJe + %)- = P$e(SePJe + %)-
described in Lesson2. (ZePJe + 9k - XPJe)W
= P,X[I - (XPJe + %)-%P&W]W
(14-12)
= [Pe - PJe(XP,X + %)-%P,]WW
MEAN-SQUARED ESTIMATOR
= P&!eW
Because 8 and V(k) are Gaussianand mutually uncorrelated,they are jointly hence,
Gaussian. Consequently,Z(k) and 0 are also jointlywGaussian;thus (Corollarvd I I
13-l), b&k) = me + PMs(k)X(k)W1(k)[%(k) - X(k)m] (14-13)
&s(k) = me + ~e@)P,(k)~~(k) - m&l (14-3)

118
Estimation of Random Parameters: The Linear and Gaussian Model lesson 14 Best Linear Unbiased Estimation, Revisited

Theorem lb 1. Although bMSis a mean-squared estimator, so that it enjoys all of the properties
of such an estimator (e.g., unbiased?minimum-variance. etc.), bMSis not a consistent
estimator of )I. Consistency is a large sample property of an estimator: however. as N
Proof. Set Pi = 0 in (14-ll), to see that increases,the dimension of p increases. becausep is Iv X 1. Consequently, we cannot
prove consistency of kMS(recall that, in all other problems, 9 is n x 1, where n is data
P&k) = [x(k)W(k)x(k)J-l (14-H) independent; in these problems we can study consistencyof 6).
Equations (14-18) and (14-23) are not very practical for actually computing kMSr
and, therefore,
becauseboth require the inversion of an N x N matrix, and N can become quite large
(14-16) (it equals the number of measurements). We return to a more practical way for
computing gMSin Lesson 22. 0
Compare (14-16) and (9-ZZ), to concludethat &IS(~) = &(Q q
One of the most startIing aspects of Theorem 14-I is that it showsus that
I3LL-J estirnathn upplies to random parameters as we!! us tu deterministic
BEST LINEAR UNBIASED ESTIMATION, REVfSITED
parameters. We return to a reexamination of BLUE below.
What does the condition Pi1 = 0, given in Theorem 14-1, mean? Sup- In Lesson 9 we derived the BLUE of 9 for the linear model (14-l), under the
pose, for example, that the elements of 0 are uncorrelated; then, POis a following assumptions about this model:
diagonal matrix! with diagonal elements & When all of these variancesare
very large, then Pi* = 0. A large variancefor 8,meanswe have no idea where
1. 8 is a deterministic but unknown vector of parameters,
19~
is located about its mean value.
2. X(k) is deterministic, and
Example 14-l [Minimum-Vkance Deconvolution) 3. V(k) is zero-mean noise with covariancematrix a(k).
In Example 2-6 we showed that, for the application of deconvolution, our linear modei
is We assumedthat 6&k) = F(k)Z(k) and chose FBLU(k)so that iBLv(k) is an
qlv) = X(N - l)p + qly) (14-17) unbiasedestimator of 9, and the error variance for each one of the n elements
We shall assumethat p and V(N) are jointly Gaussian, and, that rn@= 0 and rn=+-
= 0; of 9 is minimized. The reader should return to the derivation of 6&k) to see
hence, rn%= 0, Additionally, we assumethat cove] = pI. From (l4-7), we deter- that the assumption 0 is deterministic is never needed, either m the deri-
mine the following formula for bMS(N), vation of the unbiasednessconstraint [seethe proof of Theorem 6-1, in which
Equation (6-9) becomes [I - F(k)X(k)]E(9) = 0, if 8 is random], or in the
~&v) = FJe(N - l)[X(N - l)PJe(N - 1) + p1]-?qN) (14-18)
derivation of Ji[fj, hi) in Equation (9-14) (due to some remarkable cancel-
Recall, from Example 2-6, that when g(k) is described by the product model lations); thus, 9,Lu(k), given in (g-22), is applicable to random us well us
A4 = q W W, then deterministic parameters in our linear model (14-I); an<, because the BLUE of
P = QG (14-19) 9 is the special case of the WISE when W(k) = 91-(k), B-(k), given in (3-lo),
is also applicable to random as well as deterministic parameters in our linear
where
model.
(14-20) Theorem 14-I relates bMS(k) and i,,,(k) under some very stringent con-
and ditions that are needed in order to remove the dependence of iMs on the a
r = col(r(l), r(2), . . . , r(N)) (14-21) priori statistical information about 9 (i.e.. m, and PO),becausethis information
was never used in the derivation of &&k).
In the product model, r(k) is white Gaussian noise with variance OF, and q (k) is a
Next, we derive a different BLUE of 9, one that incorporates the a priori
Bernoulli sequence. 0bviously, if we know Qq then p is Gaussian, in which case
statistical information about 0. To do this (Sorenson, 1980, pg. 210), we treat
m as an additional measurement which will be augmented to Z(k). Our
where we have used the fact that Qi = Qq, becauseq(k) = 0 or 1, When known, additional measurement equation is obtained by adding and subtracting 8 in
(14-18) becomes the identity me = Q, i.e.,
m, = 9 + (me - 6) (14-24)
122 Maximum a Posteriori Estimator 123
Estimation of Random Parameters: The Linear and Gaussian Model Lesson 14

Quantity me - 6 is now treated as zero-meannoise with covariancematrix PO. To conclude this section, we note that the weighted-least squares objec-
Our augmentedlinear model is tive function that is associatedwith 6,,(k) is
J,[&(k)] = %;(k)%;l(k)%,(k)
fw) = (Ep)() + (lw) (14-25) (14-35)
= (me - O)P;(me - 6) + %(k)%-(k)%(k)
\ / L, L J
The first term in (14-35) contains all the a priori information about 8. Quantity
%a xc7 Yl me - 8 is treated as the difference between measurement me and its noise-
which can be written, as free model, 6.
f&(k) = K(k)6 + X(k) (14-26)
where
3,(k),%e,
(k),and V0(k) are defined in (14-25).Additionally, MAXIMUM A POSTERIOR1 ESTIMATOR

(14-27) In order to determine 6MAP(k) for the linear model in (14-1) we first need to
E&z(k)=C(k)} 2 R(k) = (,%)
determinep @l%(k)). Using the facts that 8 - N(0; me, PO)andV(k) - N(V(k);
0, a(k)), it follows that
We now treat (14-26) as the starting point for derivation of a BLUE of 8,
which we denote 6:&k). Obviously, 1 1
P 60 = vm exp { - 2 (0 - m&W(e - me)I (14-36)
6&,(k) = [%e,(k)~,(k)~,(k)]-~,,(k)~,(k)~,(k) (14-28)

Theorem 14-2. For the linear Gaussianmodel, when X(k) is deter-


ministic it is alwaystrue that exp { - $ [S(k) - %e(k)e]%-(k)[Z(k) - X(k)B]} (14-37)
&4S(k)= &w(k) (14-29)
hence,
Proof. To begin, substitute the definitions of %e,,a,, and %, into (14-28),
and multiply out all the partitioned matrices, to show that mJPi(B - me) - i (z - %ee)w~(a - xe) (14-38)
i,,,(k) = (Pi1 + X%%e)-(P~me + %e%?E) (14-30) To find &&k) we must maximize In p (e[%) in (14-38). Note that to find
6&(k) we had to maximize &[$(k)] in (14-35); but,
From (14.11), we recognize (Pi1 + X%?X)-l as PMs; hence,
In P WV a -J@(k)] (14-39)
6&(k) = PMs(k)[P;me + X(k)9P(k)ZE(k)] (14-31)
This means that maximizing in p(O/fE) leads to the same value of 6 as does
Next, we rewrite 6&k) in (14-13),as minimizing J,[&(k)]. We have, therefore, shown that &,,(k) = &&k).
6&k) = [I - PMs%eW%]me + PMS%er%-lZ (14-32)
Theorem 14-3. For the linear Gaussian model, when X(k) is deter-
however, ministic it is always true that
(14-33) L&) = &w(k) El (14-40)
where we have again used (14-11). Substitute (14-33) into (14-32) to see that Combining the results in Theorems 14-2 and 14-3, we have the very
&k) = PMs(k)[P,m, + X(k)%-(k)%(k)] (14-34) important result, that for the linear Gaussian model
hence, (14-29)follows when we compare (14-31)and (14-34). q k&) = &b,(k) = bw(k) (14-41)
Estimation of Random Parameters: The Linear and Gaussian Model Lesson 14
Maximum a Posteriori Estimator 125

Put another way, for the linear Gaussian model. all roads lead to the same
and
estimator.
Of course, the fact that &&k) = &&c) should not come as any sur- P, = E{xx/q} (14-51)
prise, becausewe already established it (in a model-free environment) in then
Theorem 13-2.
Example 14-2 (Maxirrtum-Likelihood Deconvolution] p(r.CIC(N)jq) = (23--KjP,)-*/2 exp (- i xP;x) (13-52)
As in Example 14-1,we begin with the deconvolution linear model
We leave, as an exercise for the reader, the maximization of p (r,qlCX(N)), from
Z(N) = ?e(lv - l)p + V(N) (14-42) which it follows (Mended, 1983b, pp. 112-114) that
Nuw, however, we use the product model for F, given in (l&19), to express 5(N) as
r*(Nlq) = u?Q,%e(N - l)[c&(N - l)Q,X(N - 1) + pII-%(N) (14-53)
fw) = 3e(N - 1)&r + %yN) (14-43)
&,P can be found by maximizing
For notational convenience,let
q=col(q(l),q(2),...,q(~) (14-44)
Our objectives in this example are to obtain MAP estimators for both q and r. In the (14-54)
literature on maximum-likelihood deconvolution (e.g., Mend& 1983) these estimators
are referred to as unconditional ML estimators, and are denoted i and G* We denote where
these estimators as eMApand 4MAP,in order to be consistent with this books notation.
The starting point for determining GMApand &p is the joint density function Pr(q) = fi Pr[q (k)] = Amq(l - A)N - mq (14-55)
Pk @W% th-e k-1

(14-45)
but
(14%)
P @dll = P (Mwl) ( 14-46)
Equation (14-46) usesa probability function for q rather than a probability density and finally,
function, becauseq (Ic) takes on only two discrete values, 0 and 1. Substituting (14-46) i,,, = r*(NIhM*P) (14-57)
into (14-45), we find that
(14-47) Equations (14-54) and (14-57) are quite interesting results. Observe that we are
Note, also, that permitted first to direct our attention to finding iWAPand then to finding iMAP. Ob-
n
serve, also, that rM,+p= j&&V]QJ [compare (14-53) and (14-23)].
There is no simple solution for determining GMAP.Because the elements of q in
(14-48) p(r*,q[%(N)) have nonlinear interactions, and because the elements of q are con-
strained to take on binary values, it is necessary to evaluate p(r*,qlS(N)) for every
where we have used the fact that r and q are statistically independent* Substituting possible q sequenceto find the q for which p (r * ,ql%(N)) is a global maximum. Because
(14-48) for p (Z(N)~r,qJ into (14-47), we see that q(k) can take on one of two possible values, there are 2Npossible sequenceswhere N is
(14-49) the number of elements in q. For reasonable values of N (such as N = 400), finding the
P wGw9~ =P wwwwl~
global maximum of p (r * ,qj%(N)) would require several centuries of computer time.
Observe that the r dependence of the MAP-likelihood function is completely con- We can always design a method for detecting significant values of p(k) so that
tained in p(r,%(N)iq). Additionally, when q is given, the only remaining random the resulting 4 will be nearly as likely as the unconditional maximum-likelihood esti-
quantities are the zero-mean Gaussian quantities r and Z(w; hence, p(r,C!Z(~iq) is mate, &,P. Two MAP detectors for accomplishing this are described in MendeI
muitivariate Gaussian.Letting (1983b, pp. 127-137). [Note: The reader who is interested in more details about these
MAP detectors should first review Chapter 5 in Mendel, 1983b, because they are
x = co1(r,%(N)) (14-50)
designed using a slightly different likelihood function than the one in (14-54).] III
Lesson 14 Problems 127
Estimation of Random Parameters: The Linear and Gaussian Model Lesson 14

14-3. x and v are independent Gaussian random variables with zero means and
Example 14-3 (State-Estimation)
variances O$ and d, respectively. We observe the single measurement
In Example 2-4 we that, for the application of state estimation, our linear 2 =x+v=l.
model is (a) Find &.
(b) Find iMAP.
ffV9 = X(N, kl)x(kl) + V(N, k,) (14-58)
14-4. For the linear Gaussian model in which X(k) is deterministic, prove that iMAP
From (2-17), we see that is a most efficient estimator of 8. Do this in two different ways. Is &,&) a most
efficient estimator of 6?
x(k,) = Cpklx(0) + $ @ - i yu(i - 1) = @lx(O) + Lu (14-59)
I= 1

u = co1(u(O), u(l), . . . ,u (Iv)) (14-60)


and L is an n x (N + 1) matrix, the exact structure of which is not important for this
example. Additionally, from (2-21), we see that
Sr(N, kl) = M(N, kl)u + v (14-61)

v = col(v(l), v(2), . . . ) v(N)) (14-62)


Observe that both x(k,) andSr(N, kl) can be viewed as linear functions of x(O), u and v.
We now assumethat x(O), u and v are jointly Gaussian. Reasonsfor doing this
are discussedin Lesson 15. Then, becausex(kl) and V(N, kl) are linear functions of
x(O), u and v, x(k,) and V(N, kl) are jointly Gaussian (Papoulis, 1965,pg. 253); hence,

(14-63)
and

In order to evaluate &(klIw we must first compute mxckI1and PxckI) We show how to
do this in Lesson 15.
Formula (14-63) is very cumbersome. It appears that its right-hand side changes
as a function of kl (and N). We conjecture, however, that it ought to be possible to express
&(k$V) as an affine transformation of &(kl - l(N), because x(kl) is an affine
transformation of x(kl - l), i.e., x(kl) = cPx(kl - 1) + yu(kl - 1).
Because of the importance of state estimation in many different fields (e.g.,
control theory, communication theory, signal processing, etc.) we shall examine it in
great detail in many of our succeedinglessons. 0

PROBLEMS

14-l. Derive Eq. (14-10) for P&k).


14-2. Show that r*(Nlq) is given by Eq. (14-53) and that GMAPcan be found by
maximizing (14-54).
Definitions and Properties of Discrete-Time Gauss-Markov Random Processes

Lesson 15 Let Y(I) be defined as

Elements then, Definition 15-2 means that

of DiscreteJiime
in which
GaussmMarkov my(l) = EiW)} (15-3)

and PY(l) is the n/ x nl matrix E([Y(I) - m,(l)][Y(I) - my(/)]} with elements

Random Processes Py(i, f, where

Pdi 311= EMi) - mdh)JLs(~j) - m&>l~ (15-4)


ki = 1,2, . . . , 1.

Defmition 15-3 (Meditch, 1969, pg. 118). A vector random Process


(s(t), k.9) is a Murkuv process, if, for any m time points tl < f2 < . . . < t, in 9,
where m is any integer, it is true that

Lessons 13 and 14 have demonstrated the importance of Gaussianrandom


variables in estimation theory. In this lesson we extend some of the basic
conceptsthat were introduced in Lesson 12, for Gaussianrandom variables, to
indexed random variables?namely random processes.These extensions are For continuous random variables, this means Ihat
needed in order to develop state estimators.

DEFlNlTH3NS AND PROPERTIES UF DISCRETE-TIME Note that, in (U-5), s(tm) 5 S(t,,,) means si(tm) I Si(tm) for i =
GAUSS-MARKOV RANDOM PROCESSES 1, 2,. . . , n. If we view time point t,,, as the present time and time points
t . . . . tl as the past, then a Markov process is one whose probability law
Recall that a random processis a coIIection of random variables in which the Git, probability density function) dependsonly on the immediate past value,
notion of time plays a role. t,,,- 1. This is often referred to as the Markuv property for a vector random
process. Because the probability law depends only on the immediate past
Definition 15-l (Meditch, 1969,pg. 106). A vector random prucess is a value we often refer to such a process as a first-order Markov process (if it
family of randurn vecturs (s(t), ~4) indexed by a parameter t aU uf whuse values depended on the immediate two past values it would be a second-orderMar-
lie in some appropriate index set 9. When 9 = {k: k = 0, I, . . . ] we have a kov process).
discrete4me randurn process. El
Theorem 15-l. Let (s(t), tE.%}be a first-order Markav process, and
Definition U-2 (Meditch, 1969, pg. 117). A vector randurn prucess tl < t* < , . . < t, be any time points in 9, where m is an integer. Then
{s(t), k%} Ls defined tu be multivariate Gaasian if, fur any t time points
t1, t2, . t[ in $ where t zk an integer, [he set uf e randum n vecturs s@J,
. l ,

w7 * * , s(Q is juintly Gaussian distributed. I3

128
A Basic State-Variable Model 131
130 Elements of Discrete-Time Gauss-Markov Random Processes Lesson 15

Theorem 15-3. A vector Gaussian white process s(t) can be veiwed as a


PRX$ From probability theory (e.g., Papoulis, 1965) and the Markov
first-order Gauss-Markov process for which
property of s(t), we know that
pb(G?l>, a?i - I), l l l 7s(4)l Pcsw~>l = P w (15-13)
= PEGz>lS(fm 7s(4)lp[s(L - I), *9s(4)]
- I>7 l l l l l for all t, ~e.9 and t =f T.
= p bh?l>lS(f*
- I)lP[s(L - I>, * 7s(h)1 l l
(15-8) Proof. For a Gaussianwhite process, we know, from (H-12), that
In a similar manner, we find that
P b(t)7 WI = p[s(t)lp[s(+I (15-14)
Pb(Ll), l l l ? $I)1 = p Cs(tm
- I)IG?l- *)Ip [s(tm- 2), 7s(t1)l l l l
but, we also know that
s(h)1= p [s(G?l
- ?)Is(fm
- $1p b(Ll - 31, 7s(h)] (15-9)
I SW1= P b(olw P M7)l (15-15)
pb(t,-*), l l l 9 l l

... P b(t),
Equating (15-14) and (15-15), we obtain (15-13). 0
Equation (15-7) is obtained by successivelysubstituting each one of the equa-
tions in (15-9) into (15-8). 0 Theorem 15-3 means that past and future values of s(t) in no way help
determine present values of s(t). For Gaussianwhite processes,the transition
Theorem 15-l demonstrates that a first-order Markov process is com- probability density function equals the marginal density function, p [s(t)],
pletely characterizedby two probability density functions, namely, the transi- which is multivariate Gaussian. Additionally,
tion probability density function, p [s(ti)ls(ti _ JJ, and the initial (prior) proba- E{s(t,)&l - I), l l l ? s(h)> = WL 1) (15-16)
bility density function p[s(t,)]. Note that generally the transition probability
density functions can all be different, in which casethey should be subscripted
[e.g.,pmb(tm)ls(L
- 111
andpm- I[& - I>I& - JII A BASIC STATE-VARIABLE MODEL

Theorem 15-2. For a first-order Markov process, In succeedinglessonswe shall develop a variety of state estimators for the
following basic linear, (possibly) time-varying, discrete-time dynamical sys-
Ew?l >Is(tm - I), l l l 9 s(4>) = ~~s(t,)Is(L - I>> 0
(15-10) tem (our basic state-variable modeo, which is characterized by YI x 1 state
vector x(k) and m x 1 measurement vector z(k):
We leavethe proof of this useful result as an exercise.
A vector random processthat is both Gaussianand a first-order Markov
x(k + 1) = @(k + l,k)x(k)
processwill be referred to in the sequel as a Gauss-Markovprocess.
+ I-(k + l?k)w(k) + If(k + l,k)u(k) (15-17)
Definition 15-4. A vector random process {s(t), te.9) is said to be a and
Gaussian white process if, for any m time points tl, t2, . . . , t, in 9, where m is
any integer, the m random vectors s(tr), s(t& . . . , s(t,,,)are uncorrelated Gaus- z(k + 1) = H(k + l)x(k + 1) + v(k + 1) (15-M)
sian random vectors. 0

White noise is zero mean, or else it cannot have a flat spectrum. For where k = 0, 1, . . . . In this model w(k) and v(k) are p x 1 and m x 1 mutu-
. ally uncorrelated (possibly nonstationary) jointly Gaussian white noise se-
white noise
quences;i.e.,
E{s(ti)s(t,)) = 0 for all i # j (15-11)
Additionally, for Gaussianwhite noise /Elrou.(j))=a(ns,

P tw1 = P [WI P [S(h)1 P [s(tJl l l l


(15-12)
[because uncorrelatednessimplies statistical independence(see Lesson 12)] E{v(i)v( i)} = R(i)&
where p [s(ti)] is a multivariate Gaussianprobability density function.
Properties of the Basic State-Variable Model
132 Elements of Discrete-Tmx Gauss-Markov Random Process~ tzsson 75

PROPERTIES OF THE BASIC STATE-VARIABLE MODEL


and
I I In this section we state and prove a number of important statistical properties
1 E{w(+(j)j = S= 0 for all i and j
I
(15-21)
for our basic state-variablemodel

Covariancematrix Q(i) is positive semidefinite and R(i) is positive definite [so Theorem 15-4. When x(O) and w(k) are jointly Gaussian then (x(k),
that R-*(i) exists]. Additionally, u(k) is an 2 x 1 vector of known system k= 0, l,...) is a Gauss-Markov sequence.
inputs, and initial state vector x(O) is multivariate Gaussian,with mean m&l)
and covariancePK(0),i.e., Note that if x(O) and w(k) are individually Gaussian and statistically
independent (or uncorrelated), then they will be jointly Gaussian (Papoulis,
1965).
(15-22)
Proof
a. Gaussian Property [assuming u(k) nonrandom]. Because u(k) is non-
and9x(O) is not correlated with w(k) and v(k). The dimensionsof matrices @, random, it has no effect on determining whether x(k) is Gaussian;
I, 9, H, Q and R are n X n, n x p, fl x 2, m X n, p x p, and m x m, hence, for this part of the proof we assumeu(k) = 0. The solution to
respectively. (15-17) is
Disturbance w(k) is often used to model the following types of uncer-
tainty: x(k) = @(k,O)x(O) + i @(k,i)r(i,i - l)w(i - 1) (15-23)
i=l

1. disturbance forces acting on the system (e,g., wind that buffets an air- where
plane); Q&i) = @(k,k - l)@(k - 1,k - 2). . . @,(i + 1,i) (15-24)
2. errors in modeling the system (e.g., neglectedeffects); and
Observe that x(k) is a Iinear transformation of jointly Gaussianrandom
3. errors, due to actuators,in the translation of the known input, u(k), into
vectors x(O), w(O),w(l), . . . , w(k - 1); hence, x(k) is Gaussian,
physical signals.
b. Markov Property. This property does not require x(k) or w(k) to be
Gaussian. Because x satisfies state equation (1517), we see that x(k)
Vector v(k) is often usedto model the following types of uncertainty: depends only on its immediate past value; hence, x(k) is Markov. 0
1. errors in measurementsmade by sensinginstruments; We havebeen able to show that our dynamical systemis Markov because
2. unavoidabledisturbancesthat act directly on the sensors;and we specified a model for it. Without such a specification, it would be quite
3. errors in the realization of feedback compensatorsusing physical com- difficult (or impossible) to test for the Markov nature of a random process.
ponents [this is valid only when the measurementequation contains a By stacking up x(l), x(2), , . . into a supervectorit is easily seen that this
direct throughput of the input u(k), i.e., when z(k + 1) = I-I(k + 1) supervector is just a linear transformation of jointly Gaussianquantities x(O),
x(k + 1) + G(k + l)u(k + 1) + v(k + 1); we shall examine this situ- w(O), w(l), - - - ; hence, x(l), x(2), . . . are themselvesjointly Gaussian.
ation in Lesson231. A Gauss-Markov sequencecan be completeIy characterizedin two ways:
1. specify the marginal density of the initial state vector, P Ex(O)l and the
0f course, not all dynamical systemsare describedby this basic model. transition densityP[x(k + 1)1x(k)], or
In general, w(k) and v(k) may be correlated, some measurements may be
2. specify the mean and covariance of the state vector sequence.The sec-
made so accurate that, for all practical purposes, they are perfect (Le.,
there is no measurementnoise associatedwith them), and either u(k) or v(k), ond characterization is a complete one becauseGaussian random vec-
or both, may be colored noiseprocesses.We shall considerthe modification of tors are completely characterized by their meansand covariances (Les-
our basicstate-variablemodel for each of theseimportant situations in Lesson son 12). We shall find the second characterization more useful than the
23.
first.
Elements of Discrete-Time Gauss-Markov Random Processes Lesson 15 Properties of the Basic State-Variable Model

The Gaussiandensity function for state vector x(k) is mx(k)E(w(k)l = 0, and E{w(k)m#)} = 0. State vector x(k) dependsat
most on random input w(k - 1) [see(B-17)]; hence,
p[x(k)] = [(2n)lP,(k)j]-* exp
E{x(k)w(k)} = E{x(k)}E{w(k)} = 0 (15-32)
- m,(k>lK(k)[x(k) - mxWI} (15-25)
and E{w(k)x(k)} = 0 as well. The last two terms in (15-31) are therefore
where equal to zero, and the equation reducesto (15-29).
m,(k)= EW)l (15-26) c. We leave the proof of (H-30) as an exercise.Observe that once we know
covariance matrix P,(k) it is an easy matter to determine any cross-
covariancematrix between statex(k) andx(i)@ # k). The Markov nature
P,(k)= Eew - mxwmw - mxvw (15-27) of our basic state-variable model is responsiblefor this. Cl
We now demonstrate that m,(k) and P,(k) can be computed by means of
recursive equations. Observe that mean vector m,(k) satisfies a deterministic vector state
equation, (U-28), covariance matrix P,(k) satisfiesa deterministic matrix state
equation, (15-29), and (15-28) and (15-29) are easily programmed for digital
Theorem 15-5. For our basic state-variable model,
computation.
a. m,(k) can be computed from the vector recursive equation Next we direct our attention to the statisticsof measurementvector z(k).
m,(k + 1) = @(k + l,k)m,(k) + 3!(k + l,k)u(k) (15-28)
Theorem 15-6. For our basic state-variable model, when x(O), w(k) and
where k = 0, 1, . . . , and mx(0) initializes (O-28), v(k) are jointly Gaussian, then {z(k), k = 1, 2, . . . } is Gaussian, and
b. P,(k) can be computed from the matrix recursive equation
m,(k + 1) = H(k + l)m,(k + 1) (15-33)
P,(k + 1) = iP(k + l,k)P,(k)@(k + 1,k)
+ r(k + l,k)Q(k)r(k + 1,k) (15-29)
P,(k + 1) = H(k + l)P,(k + l)H(k + 1) + R(k + 1) (15-34)
wherek=O,l,..., and P,(O)initializes (U-29), and
c. E{ [x(i) - mJi)][x(j> - m,(j)]} 4 P,(i, j) can be computed from where m,(k + 1) and P,(k + 1) are computed from (15-28) and (1529),
respectively. Cl
W, j)Px(j) wheni > j
pX(iyi> = Px(i)Qr(j, i) (B-30)
wheni <j We leave the proof as an exercisefor the reader. Note that if x(O), w(k)
Proof and v(k) are statistically independent and Gaussian they will be jointly Gaus-
a. Take the expected value of both sides of (15.17), using the facts that sian.
expectation is a linear operation (Papoulis, 1965)and w(k) is zero mean, Example 15-l
to obtain (S-28).
Consider the simple single-input single-output first-order system
b. For notational simplicity, we omit the temporal argumentsof @and r in
this part of the proof. Using (15-17)and (15-28), we obtain x(k + 1) = $x(k) + w(k) (15-35)
P,(k + 1) = E{ [x(k + 1) - mx(k + l)][x(k + 1) - m,(k + l)]}
z(k + 1) =x(k + 1) + v(k + 1) (15-36)
= E{ ww) - mx(k>l+ w41
PCW - mx(k)l+ rww1 (15-31) where w(k) and v (k) are wide-sense stationary white noise processes,for which q = 20
and r = 5. Additionally, m, (0) = 4 and p,(O) = 10.
= @P,(k)@ + IQ(k)I + QE{ [x(k) - mx(k)]wf(k)}rr The mean of x(k) is computed from the following homogeneous equation
+ FE{w(k)[x(k) - mX(k)]}@
1
Because m,(k) is not random and w(k) is zero mean, E{m,(k)w(k)) m,(k + 1) = ?mx(k) ??z,(O)= 4 (15-37)
\ -= A
Signal-To-Noise Ratio 137
136 Elements of Discrete-Tme Gauss-Markov Random Processes Lesson 15

If our basic state-variable model is time-invariant and stationary, and, if


and, the variance of X(/C) is computed from Equation (15-29), which in this case
simplifies to
@is associatedwith an asymptotically stable system (i.e., one whose poles all
lie within the unit circle), then (Anderson and Moore. 1979) matrix P,(k)
pJk -I 1) = ipx(k) + 20 px(O)= 10 (15-33) reachesa limiting (steady-state)solution P,, i.e.,
Fz P,(k) = p, (15-41)
Additionally, the mean and variance of z(k) are computed from
Matrix f, is the solution of the following steady-stateversion of (1529),
?&(k + 1) = mx(k + 1) (1539)
px = @,W + rQl? (15-42)
and
pr(k + 1) = px[k + 1) + 5 (15-40) This equation is called a discrete-time Lyupunov equation. SeeLaub (1979)for
an excellent numerical method that can be used to solve (15-42) for FX.
Figure 15-l depicts TTL(~)and pr2(k). Observe that tnX(k) decays tu zero very
rapidly and that pln(JQ approachesa steady-statevalue. piQ = 5.163. This steady-state
value can be computed from equation (15-38) by setting JJ~(Jc)= pX(k + 1) = j%. The SIGNAL-TO-NOISE RATIO
existence of j& is guaranteed by our first-order system being stable.
Although ~T=z~ (Jz)-+ 0 there is a lot of uncertainty about x(k), as evidencedby the
large value of j&.There will be an even larger uncertainty about .7(k), because
In this section we simplify our basic state-variable model (15-17) and (15-B)
FZ-+ 31.66. These large values for j& and j& are due to the large values of q and r. In to a time-invariant, stationary, singIe-input single-output model:
many practical applications, both q and r will be much less than unity in which case& x(k + 1) = @x(k) + yw(k) + +u (k) (15-43)
and pX wil1 be quite small. D
z (k + 1) = hx(k + I) + v(k + 1) (15-44)

Measurement z (k) is of the classicalform of signal plus noise, where signal


s(k) = hx(k).
8- The signal-to-noise ratio is an often-used measure of quality of mea-
surement z(k). Here we define that ratio, denoted by SNR(k), as
7-

6-
SNR(k) = -d(k) (15-45)
t
From preceding analyses,we seethat
SNR(k) = F

BecauseP,(k) is in general a function of time, SNR(k) is also a function


of time. If, however, @is associatedwith an asymptotically stable systemthen
(15-41) is true. In this casewe can useFXin (15-46) to provide us with a single
number, SNR, for the signal-to-noiseratio, i.e.,
hF,h
SNR=- (15-47)
r
Finally, we demonstrate that SNR(k) (or SNR) can be computedwithout
knowing q and r explicitly; all that is needed is the ratio q /r . Multiplying and
dividing the right-hand side of (15-46) by q, we find that

Figure E-1 Mean (dashed)and standard deviation(bars)for first-ordersystem SNR(k) = [h y h] (q /r) (15-48)
(15-35) and (B-36).
Elements of Discrete-Time Gauss-Markov Random Processes Lesson 15 Lesson 15 Problems 139

Scaled covariance matrix P,(k)/q is computed from the following version of 15-5.
(15-29)
(15-49)
I v(k)
One of the most useful ways for using (15-48) is to compute q /I- for a
given signal-to-noise ratio SNR, i.e.,
-4 =- SNR (15-50) In this problem, assume that u(k) and v(k) are individually Gaussian and
r h E uncorrelated. Impulse responseh depends on parameter a, where a is a Gaussian
Th random variable that is statistically independent of u (k) and v(k).
(a) Evaluate E{z (k)}.
In Lesson 18 we show that q /r can be viewed as an estimator tuning param- (b) Explain whether or not ,7(k) is Gaussian.
eter; hence, signal-to-noiseratio, SNR, can also be treated as such a param-
eter.

Example 15-2 (Mendel, 1981)


Consider the first-order system
x(k + 1) = &(k) + v(k) (15-51)
z(k + 1) = hx(k + 1) + v(k) (15-52)
In this case, it is easy to solve (15-49), to show that
R/q = yzl(1 - 4) (15-53)
hence,

SNR= (g$)(;) (15-54)

Observe that, if h2y = 1 - &2,then SNR = q /r. The condition h2y = 1 - 42 is satis-
fiedif,forexample,y= l,+= l/fiandh = l/G. 0

PROBLEMS

15-l. Prove Theorem 15-2, and then show that for Gaussian white noise
Eb(t,)[s(t, - I), l . l , s(h)} = E{s(t,)}.

15-2. Derive the formula for the cross-covarianceof x(k), P,(i, j), given in (15-30).
15-3. Derive the first- and second-orderstatistics of measurementvector z(k), that are
summarized in Theorem 15-6.
15-4. Reconsider the basicstate-variable model when x(0) is correlated with w(O), and
w(k) and v(k) are correlated [E{w(k)v(k)} = S(k)].
(a) Show that the covarianceequation for z(k) remains unchanged.
(b) Show that the covarianceequation for x(k) is changed,but only at k = 1.
(c) Compute E{z(k + l)z(k)}.
16
Single-Stage Predictor 141

Lesson with the linear expectation operator E{ =I?X(k- 1)). Doing this, we find that

i(klk - 1) = @(k,k - l)ii(k - Ilk - 1)


State Esf,imatlon: + V(k,k - l)u(k - 1) (16-4)

where k =l, 2,.... To obtain (16-4) we have used the facts that
Predktion E{w(k - 1)) = 0 and u(k - 1) is deterministic.
Observe, from (l&4), that the single-stage predicted estimate,
i(klk - l), dependson the filtered estimate, i(k - Ilk - l), of the preceding
state vector x(k - 1). At this point, (16-4) is an interesting theoretical result;
but, there is nothing much we can do with it, becausewe do not as yet know
how to compute filtered state estimates. In Lesson17 we shall begin our study
into filtered state estimates, and shall learn that suchestimates of x(k) depend
on predicted estimatesof x(k), just as predicted estimates of x(k) depend on
filtered estimates of x(k - 1); thus, filtered and predicted state estimates are
very tightly coupled together.
Let P(k jk - 1) denote the error-covariance matrix that is associated
with i(k(k - l), i.e.,
INTRODUCTION
P(klk - I) = E{ [%(klk - 1)
We have mentioned, a number of times in this book, that in state estimation
- m;(kjk - l)][S(klk - 1) - m,(k [k - I)]) (16-5)
three situations are possibledepending upon the relative relationship of total
number of available measurements, IV, and the time point, k, at which we where
estimate state vector x(k), namely: prediction (N -Ck), filtering (IV = k), and
smoothing (ZV> k). In this lesson we develop algorithms for mean-squared x(k Ik - 1) = x(k) - ri(kik - 1) (16-6)
predicted estimates,&(kij), of state x(k). In order to simplify our notation,
we shall abbreviate&(k 1~)asi(k ij). (Just in caseyou haveforgotten what the Additionally, let P(k - Ilk - 1) denote the error-covariance matrix that is
notation S(k 1~)standsfor, seeLesson 2.) Note that, in prediction, k > j. associatedwith i(k - Ilk - I), i.e.,

P(k - Ilk - 1) = E{[ji(k - ilk - 1)


SINGLE-STAGE PREDICTOR - m,(k - 11k - l)][g(k - Ilk - 1) - m,(k - Ilk - l)]} (16-7)
The most important predictor of x(k) for our future work on filtering and For our basic state-variable model (see Property 1 of Lesson 13),
smoothing is the sin&e-stage predictor g(k/k - 1). From the Fundamental m(k Ik - 1) = 0 and mi(k - Ilk - 1) = 0, so that
Theorem of Estimation Theory (Theorem 13-l), we know that
i(kik - 1) = E{x(k)f%(k - 1)) (16-l) P(klk - 1) = E{%(klk - l)i(klk - 1)) (16-8)
where and
%(k - 1) = co1(z(I), z(2), . . . , z(k - 1)) (16-2) P(k - Ilk - 1) = E(%(k - Ilk - l)Ji(k - lbk - 1)) (16-9)
It is very easyto derive a formula for i(k ik - I) by operating on both sidesof
the state equation Combining (16-3) and (16-4), we seethat
x(k) = @(k,k - l)x(k - 1) + r(k,k - l)w(k - 1) G(klk - 1) = @(k,k - l)x(k - Ilk - 1)
+ I(k?k - l)u(k - 1) (16-3) + T(k,k - l)w(k - 1) (16-10)
142 State Estimation: Prediction Lesson 16 A General State Predictor 143

A straightforward calculation leads to the following formula for P(klk - l), Before proving this theorem, let us observethat the prediction formula
(16-14) is intuitively what one would expect. Why is this so? Supposewe have
P(kJk - .l) = a?(k,k - I)P(k - ilk - l)W(k,k - 1) processedall of the measurementsz(l), z(2), . . . , z(j) to obtain i(jlj) and are
(16-H) asked to predict the value of x(k), where k > j. No additional measurements
+ r(k,k - l)Q(k - l)T(k,k - 1)
can be used during prediction. All that we can therefore use is our dynamical
state equation. When that equation is used for purposes of prediction we
where k = 1,2, . . . . neglect the random disturbance term, becausethe disturbancesare not mea-
Observe, from (16-4) and (16-ll), that x(010)and P(O(0)initialize the surabie. We can only use measured quantities to assistour prediction efforts.
single-stagepredictor and its error-covariance. Additionally, The simpiified state equation is
ri(O[O)= E{x(O)(no measurements}= m,(O) (16-12) x(k + 1) = @(k + l&)x(k) + !I!& + l&u(k) (16-16)
and a solution of which is
P(O10) = E(x(O~O)~(O~O)}
x(k) = @(k,j)x( j) + c @(k,i)!I!(i,i - l)u(i - 1) (16-17)
= W&(O)- mx(~)l[x(~>
- m,(W>= P,(O) (16-13) i=j+ 1

Finally, recall (Property 4 of Lesson 13) that both x(k jk - 1) and Substituting i(jl j) for x(j), we obtain the predictor in (16-14).In our proof of
X(kjk - 1) are Gaussian. Theorem 16-1we establish (16-14) in a more rigorous manner.
Proof
a. The solution to state equation (16-3), for x(k), can be expressedin terms
A GENERAL STATE PREDICTOR of x(j), where j < k, as
In this section we generalizethe results of the precedingsectionso as to obtain x(k) = @(k,j)x(j) + $ @(k,i)[I(i,i - l)w(i - 1)
predicted valuesof x(k) that look further into the future than just one step. We i=j+ 1

shall determine r2(klj) where k > j under the assumptionthat filtered state + V(i,i - l)u(i - l)] (16-M)
estimate x(jlj) and its error-covariance matrix E{ii( jlj)i( jl j)} = P( jl j) are We apply the Fundamental Theorem of Estimation Theory to (16-18) by
known for somej = 0, 1, . . . . taking the conditional expectation with respectto Z(j) on both sides of
it. Doing this, we find that
Theorem 164
a. If input u(k) is deterministic, or does not depend on any measurements, i(kl j) = @(k,j)i(jlj) + i Q(k,i)[I(i,i - l)E{w(i - 1)1%(j)}
i=j+ 1
then the mean-squared predicted estimator of x(k), i(k( j), is given by the
+ *(i,i - l)E{u(i - l)lZ( j)}] (16-19)
expression
Note that j) depends at most on x(j) which, in turn, depends at
Wi) = @OWi(jli), on w(j - Consequently,
+ 2 @(k,i)*(i,i - l)u(i - 1) k>j (16-14) E{w(i - lW(i)l =
i=j+1
E{w(i - l)lw(O), (16-20)w(l), . . l $ w(j - 01

b. The vector random process {x(kl j), k = j + 1, j + 2, . . . } is:


where i = j + 1, j + 2,. . . , k. Becauseof this range of values on argu-
i. zero mean,
ment i, w(i - 1) is never included in the conditioning set of values w(O),
ii. Gaussian, and w(l), . . . , w(j - 1); hence,
iii. first-order Markov, and
E{w(i - 1)1%(j)} = E{w(i - 1)) = 0 (16-21)
iv. its covariance matrix is governed by
for all i = j+l,j+2 ,..., k.
P(klj) = @(k,k - l)P(k - llj)@(k,k - 1) Note, also, that
+ r(k,k - l)Q(k - l)l?(k,k - 1) (16-15) E{u(i - 1)13(j)} = E{u(i - 1)) = u(i - 1) (16-22)
State Estimation: Prediction Lesson 16 A General State Predictor 145

becausewe haveassumedthat U@- 1) does not depend on any of the


measurements.Substituting (16-21) and (16-22) mto (l&19), we obtain
the prediction formula (16-14).
b-i. and b-ii. have alreadybeen proved in Properties 1 and 4 of Lesson 13.
b-iii. Starting with i(k 1j) = x(k) - ?(k/j), and substituting (16-18) and
(M-14) into this relation, we find that

ii(kij) = @(k,j)k(j/j) + $ @(k$)F(iJ - l)w(i - 1) (X-23)


i=j+l

This equation looks quite similar to the solution of stateequation (16-3),


when u(k) = 0 (for all k), e.g., see (16-H). In fact, i(k jj) also satisfies
the state equation
5[kij) = @(k,k - l)k(k - lij) + r(k,k - I)w(k - 1) (16-24)
E3ecause5(k ij) depends only upon its previous value, g(k - lij), it is
first-order Markov.
b-iv. We derived a recursive covariance equation for x(k) in Theorem H-5.
That equation is (15-29) Because ii(k ij) satisfiesthe state equation for
x(k), its covarianceP(k ij) is also given by (1529). We have rewritten this
equation as in (16-H). q

Observe that by setting j = k - 1 in (16-14) we obtain our previously


Figure 16-1 Predictionerror variance p (k/O).
derived single-stagepredictor k(k ik - 1).
Theorem 16-l is quite limited becausepresently the only values of i( j ij)
and P( jij) that we know are those at j = 0. For j = 0, (16-14)becomes
relatively small uncertainty about ;(O/O),and as we expected, our predictions of x (k) for
i(kiU) = @(k,O)mX(0) + i @(k,i)P(z,z - l)U@ - 1) (16-25) k 2 1 do become worse, becausep (k10) > 6 for all k 2 1. After a whilep(kj0) reaches
i=l a limiting value equal to 50. When this occurs we are estimating i(k/O) by a number
The reader might feel that this predictor of x(k) becomespoorer and poorer as 1 k 1 k
that is very close to zero, becausei(k IO) = - ;(OjO),and - approacheszero
k gets farther and farther away from zero. The following example demon- ( yz 1 i lh i
for large values of k.
strates that this is not necessarilytrue. When p(O) = 100 we have large uncertainty about ;(OlO), and, perhaps to our
Example 16-l surprise, our predictions of x(k) for k L 1 improve in performance, becausep(k[O) <
100 for all k 2 1. In this case the predictor discounts the large initial uncertainty;
Let us examine prediction performance, as measured by p (k/0), for the first-order
however, as in the former case,p (k IO)again reachesthe limiting value of 50.
system
For suitably large values of k, the predictor is completely insensitive to p (0). It
reachesa steady-statelevel of performance equal to 50, which can be predetermined by
x(k + 1) = Lx(k) + w(k) (14-26) setting p (k IO)and p (k - 110)equal top, in (N-27), and solving the resulting equation
VT
forjT. El
where q = 25 and p (0) is variable. Quantity p (k IO),which in the caseof a scalar state
vector is a variance, k easily computed from the recursive equation
Prediction is possible only becausewe have a known dynamical model,
p(kiO) = $7 (k - l/O) + 25 (16-27) namely, our state-variable model. Without a model, prediction is dubious at
best (e.g., try predicting tomorrows price of a stick listed on any stock
for k = 1,2.*... Two casesare summarized in Figure 16-l- JA%enp (0) = 6 we have exchangeusing todays closing price).
State Estimation: Prediction Lesson 16 Lesson 16 Problems

THE INNOVATIONS PROCESS


random vectors. To prove that t(k + Ilk) is white noise we must show
that
Suppose we have just computed the single-stage, predicted estimate of
x(k + l), j;(k + l(k). Then the single-stagepredicted estimate of z(k + l), E{z(i + lli)Z(j + llj)} = P&i + l[i)S, (16-34)
i(k + Ilk), is We shall consider the casesi > j and i = j, leaving the casei < j as an
?(k + Ilk) = E{z(k + l)/%(k)} = E{[H(k + l)x(k + 1) + v(k + 1)-$X(k)} exercisefor the reader. When i > j,
E{Z(i + lJi)Z( j + l]j)} = E{ [H(i + l)X(i + lli) + v(i + l)]
i(k + Ilk) = H(k + l)zi(k + Ilk) (16-28)
[H(j + VW + 1113
+ v(i + NI
= E{H(i + l)X(i + l[i)
The error between z(k + 1) and i(k + lik) is i(k + l/k), i.e.,
D-W+ Mi + lI/?+ v(i + WI
Z(k + l/k) = z(k + 1) - i(k + l/k) (16-29) because E{v(i + l)v( j + 1)) = 0 and E{v(i + l)X( j + llj)} = 0. The
Signal Z(k + Ilk) is often referred to either as the inrtovationsprocess,predic- latter is true because, for i > j, X(j + llj) does not depend on mea-
tion error process, or measurement residual process. We shall refer to it as the surement z(i + 1); hence, for i > j, v(i + 1) and X(j + 11~)are indepen-
innovations process, becausethis is most commonly done in the estimation dent, so that E{v(i + l)?( j + llj)} = E{v(i + l)}E{S( j + llj)} = 0. We
theory literature (e.g., Kailath, 1968).The innovations process plays a very continue, as follows:
important role in mean-squaredfiltering and smoothing. We summarize im- E(z(i + lli)Z(j + llj)}
portant facts about it in the following: = H(i + l)E{x(i + lji)[z( j + 1) - H( j + l)ri( j + 11j)]} = 0
Theorem 16-2 (Innovations) by repeated application of the orthogonality principle (Corollary 13-3).
a. The following representations of the innovations process Z(k + l/k) are When i = j,
equivalent: P%(i + lli) = E{ [H(i + l),(i + lli) + v(i + l)][H(i + l),(i + lli)
Z(k + Ilk) = z(k + 1) - i(k + Ilk) (16-30) + v(i + 1)]} = H(i + l)P(i + lli)H(i + 1) + R(i + 1)
E(k + Ilk) = z(k + 1) - H(k + l)i(k + Ilk) (16-31) because,once again E{v(i + l)X(i + Iii)} = 0, and P(i + 1) = E{x(i + l\i)
X(i + Iii)}.
Cl

%(k + lik) = H(k + l)ji(k + Ilk) + v(k + 1) (16-32) rse of Pi;(k + l]k) is needed; hence, we shall
llk)H(k + 1) + R(k + 1) is nonsingular. This is
b. The innovations is a zero-mean Gaussian white noise sequence, with usually true and will always be true if, as in our basic state-variable model,
E{%(k + llk)Z(k + Ilk)} = PC(k + ljk) R(k + 1) is positive definite.
= H(k + l)P(k + llk)H(k + 1) + R(k + 1) (16-33)
Proof (Mendel, 1983b) PROBLEMS
a. Substitute (16-28) into (16-29) in order to obtain (16-31). Next, substi-
16-l. Develop the counterpart to Theorem 16-1for the casewhen input u(k) is random
tute the measurement equation z(k + 1) = H(k + l)x(k + 1) +
and independent of Z( 1). What happens if u(k) is random and dependent upon
v(k + 1) into (16.31), and use the fact that ji(k + Ilk) = x(k + 1) -
i(k + l(k), to obtain (16-32). w=P
16-2. For the innovations process i(k + Ilk), prove that E{Z(i + lli)Z(j + 11~))= 0
b. Becauseg(k + Ilk) and v(k + 1) are both zero mean, E{Z(k + Ilk)} = 0. when i < j.
The innovations is Gaussianbecausez(k + 1) and S(k + l/k) are Gaus- 16-3. In the proof of part (b) of Theorem 16-2 we make repeated use of the
sian, and, therefore, 2(k + l(k) is a linear transformation of Gaussian orthogonality principle, stated in Corollary 13-3. In the latter corollary f[S(k)]
148 State Estimation: Prediction Lesson 16

appears to be a function of 011of the measurements used in i&k). h-r the


expression E{%(i + lii)z(j + l)}, i > j, z(j + I.) certainly is not a function of all
Lesson 17
the measurements used n-r x(i + lji). What is f [ -1, when we apply the
orthogonality principle to E{i(i + lii)z(j + l)j, i > j, to conclude that this
expectation is zero?
16-4. Refer to Problem H-5. Assume that [l(k) can be measured [e.g., u (k) might be State Estimation:
the output of a random number generator], and that LI = Q[Z(1), 2 (2), . . . ,
4k - l)]. What is ?(kjk - l)?
16-5, Consider the following autoregressive model, z(k + FZ)= LI~Z(k + II - 1) -
ag(k +n - 2) - a.9 - anz(k) + w(k + ~2) in which w(k) is white noise.
Filtering (the Kalmun
Measurements z(k), z (k + l), . . . , z (k + n - I) are available.
(a) Compute i(k + n ik + n - 1).
(b) Explain why the result in (a) is the overall mean-squared prediction of Filter)
z (k + rt) even if IV(k + n) is non-Gaussian.

INTRODUCTION

In this lesson we shall develop the Kalman filter, which is a recursive mean-
squared error filter for computing fi(k + Ilk + l), k = 0, 1, 2,. . . . As its
name implies, this filter was deveioped by KaIman [circa 1959 (Kalman,
1960)].
From the Fundamental Theorem of Estimation Theory, Theorem 13-1,
we know that
r2(k + Ilk + 1) = E{x(k + l)lZ(k + 1)) (174)

Our approach to developing the Kalman filter is to partition E(k + 3) into two
setsof measurements,3!(k) and z(k + 11,and to then expand the conditional
expectation in terms of data setsS(k) and z(k + l), i.e.,
%(k + l(k + 1) = E(x(k + 1)1%(k),z(k + 1)) (17-2)

What complicates this expansion is the fact that Z(k) and z(k + 1) are
statistically dependent. Measurement vector SE(k) depends on state vectors
x(l), x(2), - * - 9 x(k), becausez(j) = H(j)x(j) + v(j)(j = 1, 2, . . . , k). Mea-
surement vector z(k + 1) also depends on state vector x(k), because
z(k + 1) = H(k + l)x(k + 1) + v(k + 1) and x(k + 1) = @(k + l,k)x(k) +
r(k t l,k)w(k) + YT(k + l,k)u(k), Hence S(k) and z(k + 1) both dependon
x(k) and are, therefore, dependent.
State Estimation: Filtering (the Kaiman Filter) Lesson 17 The Kaiman Filter
150

Recall that x(k + l), Z(k) and z(k + 1) are jointly Gaussian random dieted value, i(k + Ilk). The following result provides us with the means for
vectors; hence, we can useTheorem 12-4to express(17-2) as evaluating %(k + l[k + 1) in terms of its error-covariance matrix
P(k + Ilk + 1).
i(k + Ilk + 1) = E{x(k + l)la(k),i) (17-3)
Preliminary Result. Filtering error-covariance matrix P(k + Ilk + 1)
where for the arbitrary linear recursive filter (17-8) is computed from the following
i = z(k + 1) - E-w
+ mw)l (17-4) equation:

We immediately recognize Z as the innovations process Z(k + Ilk) [see P(k + Ilk + 1) = [I - K(k + l)H(k + l)]P(k + llk)[I - K(k + l)H(k + l)]
(X-29)]; thus, we rewrite (17-3) as + K(k + l)R(k + l)K(k + 1) (17-9)
i(k + Ilk + 1) = E{x(k + l)lfE(k),i(k + Ilk)} (17-5) Proof. Substitute (16-32) into (17-8) and then subtract the resulting
.ation from x(k + 1) in order to obtain
Applying (12-37) to (17-5),we find that
Z(k + lik + 1) = [I - K(k + l)H(k + l)]%(k + l(k)
i(k + ilk + 1) = E{x(k + 1)1%(k)}
- K(k + l)v(k + 1) (1740)
+ E{x(k + l)IZ(k + Ilk)} - m,(k + 1) (17-6)
Substitute this equation into P(k + lfk + 1) = E{fs(k + Ilk + l)X(k + 11
We recognizethe first term on the right-hand side of (17-6) as the single-stage
k + 1)) to obtain equation (17-9). As in the proof of Theorem 16-2, we have
predicted estimator of x(k + l), r2(k + Ilk); hence,
used the fact that %(k + Ilk) and v(k + 1) are independent to show that
%(k + l]k + 1) = ?(k + l/k) E{g(k + llk)v(k + 1)) = 0. Cl
+ E{x(k + l)(Z(k + l(k)} - m,(k + 1) (17-7)
The state prediction-error covariance matrix P(k + Ilk) is given by
This equation is the starting point for our derivation of the Kalman filter. equation (1641). Observe that (17-9) and (16-11) can be computed recur-
Before proceeding further, we observe, upon comparisonof (17-2) and sively, once gain matrix K(k + 1) is specified, as follows: P(OI0)+
(17-5), that our original conditioning on z(k + 1) has beenreplaced by condi- P(l)O)+P(ljl)+ P(2ll)+P(212)-, . etc.
l l

tioning on the innovationsprocessi(k + Ilk). One can showthat Z(k + Ilk) is It is important to reiterate the fact that (17-9) is true for any gain matrix,
computable from z(k + l), and that z(k + 1) is computablefrom ?(k + Ilk); including the optimal gain matrix given next in Theorem 17-l.
hence, it issaid that z(k + 1) and Z(k + Ilk) are causally invertible (Anderson I+
and Moore, 1979).We explain this statement more carefully at the end of this
lesson. THE KALMAN FILTER

Theorem 17-l
A PRELIMINARY RESULT a. The mean-squared filtered estimator of x(k + l), ii(k + Ilk + I), written
in predictor-corrector format, is
In our derivation of the Kalman filter, we shall determine that (17-11)
i(k + l(k + 1) = i(k + Ilk) + K(k + l)z(k + Ilk)
?(k + Ilk + 1) = i(k + Ilk) + K(k + l)i(k + Ilk) (17-8) fork = 0, 1,. . . , where %(0/O) = m,(O), and z(k + l/k) is the innovations
where K(k + 1) is an n x m (Kalman) gain matrix. We will calculate the process [E(k + Ilk) = z(k + 1) - H(k + l)j2(k + l(k)].
optimal gain matrix in the next section. b. K(k + 1) is an n x m matrix (commonly referred to as the Kalman gain
Here let us view (17-8) as the structure of an abitrary recursive linear matrix or weighting matrix) which is specified by the set of relations
filter, which is written in so-calledpredictor-corrector format; i.e., the filtered
estimate of x(k + 1) is obtained by a predictor step, x(k + Ilk), and a cor- K(k + 1) = P(k + llk)H(k + l)[H(k + l)P(k + llk)H(k + 1)
rector step, K(k + l)i(k + Ilk). The predictor step usesinformation from the + R(k + l)]- (1742)
state equation, because r2(k + ilk) = @(k + l,k)i(klk) + q(k + l,k)u(k). P(k + ilk) = @(k + l,k)P(kIk)W(k + l,k)
The corrector step usesthe new measurementavailableat tk + 1.The correction
+ I-(k + l,k)Q(k)I-(k + 1,k) (17-13)
is proportional to the difference between that measurementand its best pre-
State Estimation: Filtering (the Katman Filter) Lesson 17 Observations About the Kalman Filter

Statefiltering-error covariance matrix P(k + ljk + 1) is obtained


P(k + lik + 1) = [I - K(k + l)H(k + l)]P(k + Ilk) (17-14) by substituting (17-12) for K(k + 1) into (17-4) asfollows:
jbrk=O,l,...: where I is the n X n iderzfity mufrix. fzmi P(Oj0) = Px(0). P(k + Ilk + 1) = (I - KH)P(I - KH) + KRK
c. The stochastic process {2( k + 1ik + l), k = 0, 1, . . . }5which is defined by = (I - KH)P - PHK + KHPHK + KRK
the filteriq error relation = (I - KH)P - PHK + K(HPH + R)K (17-21)
C(k + l/k + 1) = x(k -!- 1) - ?(k + l/k + 1) (17-H) . = (I - KH)P - PHK + PHK
lc=O,l,*-., is u zero-mean Gauss-Murkov sequence whose covariance = (I - KH)P
matrix is given by (U-H). c. The proof that ii(k + l\k + 1) is zero-mean, Gaussianand Markov is so
Pruof (Mendel, 1983b,pp. 5657) similar to the proof of part b of Theorem 16-l that we omit its details [see
a, We begin with the formula for S(k + l/k + 1) in (17-7). Recall that Meditch (1969, pp. lSl-182)]. Cl
x(/c + 1) and z(k + 1) are jointly Gaussian. Because z(k + 1) and
i(k + l[k) are causally invertible, x(k + 1) and @k + lik) are also
jointly Gaussian.Additionally, E{i(k + Ilk)) = 0; hence, OBSERVATIONS ABOUT THE KALMAN FILTER

E{x(k + l)ii(k + l\k)} = mX(k + 1) 1. Figure 17-l depicts the interconnection of our basic dynamical system
+ Pti(k + 1,k + l/k)P;(k + l/k)i(k + l]k) (1746) [equations (15-17) and (U-18)] and Kalman filter system. The feedback
We define gain matrix K(k + 1) as nature of the Kalman filter is quite evident. Observe, also, that the
Kalman filter contains within its structure a model of the plant.
K(k + 1) = Pti(k + 1,k + l]k)Pg(k t ilk) (1747) The feedback nature of the Kalman filter manifests itself in fw~
Substituting (17-16) and (17-17) into (17-7) we obtain the Kalman different ways, namely in the calculation of i(k + Ilk + 1) and also in
filter equation (17-l 1). Because i(k + l]k) = @(k + l,k)%(kik) + the calculation of the matrix of gains, K(k + 1), both of which we shall
!Qk + 1,k)u(k), equation (17-l 1) must be initialized by i(O/O),which we explore below.
have shown must equal mx(0) [see Equation (16-K?)]. 2. The predictor-corrector form of the Kalman filter is illuminating from an
b. In order to evaluate K(k + 1) we must evaluatePGand Pi. Matrix PG information-usage viewpoint. Observe that the predictor equations,
has been computed in (X-33). By definition of cross-covariance, which compute Ei(k + ljk) and P(k + ilk), use information only from
the state equation, whereas the corrector equations, which compute
Pti = E{[x[k + 1) - mx(k + l)-&(k + l/k)} K(k + l), ji(k -+ l(k + I) and P(k + Ilk + l), use information only
= E{x(k + 1)2(k + ljk)} (1748) from the measurement equation.
because i(k + l]k) is zero-mean. Substituting (16-32) into this 3. Once the gain matrix is computed, then (17-11)representsa time-varying
expression,we find that recursive digital filler. This is seen more clearly when equations (16-4)
and (16-31) are substituted into (17-11). The resulting equation can be
Pti = E{x(k + l)%(k + lik)}H(k + 1) (1749) rewritten as
because E{x(k + l)v(k + 1)) = 0. Finally, expressing x(k + 1) as i(k + Ilk + 1) = [I - K(k + l)H(k + l)J@(k + l,k)li(k)k)
%(k + l]k) + $(k + l]k) and applying the orthogonahty principle
+ K(k + l)z(k + 1) (17-22)
(I3-15), we find that
+[I - K(k + l)H(k + l)]P(k + l,k)u(k)
Pti = P(k + #)H(k + 1) (17-20)
for k =0, l,.... This is a state equation for state vector ri, whose
Combining equations (17-20) and (16-33) into (17-l?), we obtain time-varying plant matrix is [I - K(k + l)H(k + l)]Ql(k + 1,k).
equation (1742) for the Kalman gain matrix. Equation (17-22) is time-varying even if our dynamical system in
Stateprediction-error covariancematrix P(k + l/k) was derived in equations (1517) and (15-B) is time-invariant and stationary, because
Lesson 16. gain matrix K(k + 1) is still time-varying in that case. It is possible,
Observations About the Kalman Filter 155

however, for K(k + 1) to reach a limiting value (i.e., a steady-state


value, K), in which case (17-22) reduces to a recursive constant
coefficient filter. We will have more to say about this important
steady-statecasein Lesson 19.
Equation (17-22) is in a recursive filter form, in that it relates the
filtered estimate of x(k + l), x(k + lik + l), to the filtered estimate of
x(k), x(k[k). Using substitutions similar to those used in the derivation of
(17-22),one can also obtain the following recursive predictor form of the
Kalman filter (details left as an exercise),
i(k + Ilk) = @(k + l,k)[I - K(k)H(k)]x(kIk - 1)
+Q(k + l,k)K(k)z(Fi) + V(k + l&u(k) (17-23)
Observethat in (17-23) the predicted estimateof x(k + 1), x(k + ilk), is
related to the predicted estimate of x(k), i(k[k - 1). Interestingly
enough, the recursive predictor (17-23), and not the recursive filter
(17-22),plays an important role in mean-squaredsmoothing, as we shall
seein Lesson 21.
c The structures of (17-22) and (17-23) are summarized in Figure
17-2. This figure supports the claim made in Lesson 1 that our recursive
estimators are nothing more than time-varying digital filters that operate
on random (and also deterministic) inputs.
4. Embedded within the recursive Kalman filter equations is another set of
recursive equations-( 17-12), (17-13) and (17-14). Because P(Ol0)
initializes these calculations, these equationsmust be ordered asfollows:
P(klk)-,P(k + llk)-*K(k + l)+P(k + l]k + l)--,etc.
By combining these three equations it is possible to get a matrix
recursive equation for P(k + Ilk) as a function of P(k (k - l), or a
similar equation for P(k + ilk + 1) as a function of P(kJk). These
equations are nonlinear and a known as matrix Riccati equations. For
example, the matrix Riccati equa 50 q for P(k + Ilk) is
l

P(k + Ilk) = <PP(kIk - l){I - H[HP(-# - l)H + R] -


HP(klk - l)}@ + IQF (17-24)

Figure 17-2 Input-output interpretation of the Kalman filter.

154
156 State Estimation: Filtering (the Kalman Filter) Lesson 17 Observations About the Kalman Filter 157

where we have omitted the temporal argumentson 0, I?. H, Q and R for 8. Because of the equivalence between mean-squared, best-linear un-
notational simplicity [their correct arguments are @(k + I&), biased, and weighted-least squaresfiltered estimates of our state vector
W + Lk): H(k), QM and R(k)]+ The matrix Riccati equation for x(k) (see Lesson 14) we must realize that our Kalman filter equations
P(k + l]k + 1) is obtained by substituting (17-E) into (17-14), and then are just a recursive solution to a systemof normal equations (seeLesson
(17-13) into the resulting equation. We leave its derivation as an 3). Other implementations of the Kalman filter that solve the normal
exercise. equations using stable algorithms from numerical linear algebra (see,
5. A measure of recursive predictor performance is provided by matrix e.g., Bierman, 1977) and involve orthogonal transformations, havebet-
P(k + l\k). This covariance matrix can be calculated prior to any ter numerical properties than (17-l l)-( 17-14). We reiterate, however,
processing of real data, using its matrix Riccati Equation (17-24) or that because this is a book on estimation theory, theoretical formulas
Equations (17-13) (17~14),and (17-12) A measure of recursive fiIter such as those in (17-l I)-( 17-14)are appropriate.
performance is provided by matrix P(k + lik + l), and this covariance 9. In Lesson 4 we developed two forms for a recursive least-squaresesti-
matrix can also be calculated prior to any processingof real data. Note mator, namely, the covariance and information forms. Compare
that P(k + l/k + I) + P(k + l/k). These calculations are often referred K(k + 1) in (17-12) with K,(k + 1) in (4-25) to see they havethe same
to as performance urr+es. It is indeed interesting tlrat the Kalman filter structure; hence, our formulation of the Kalman filter is often known as
utilizes a measure of its mean-squared error during its real-time the covariance formuiatiun. We leave it to the reader to show that
operation. K(k + 1) can also be computed as
6. Two formulas are available for computing P(k + lik + l), namely K(k + I) = P(k + Ijk + l)H(k + l)R-(k + 1) (17-25)
(17-14) which is known asthe standard form, and (17-g),which is known
as the stabilized form. Although the stabilized form requires more com- where P(k + Ilk + 1) is computed as
putations than the standardform, it is much less sensitiveto numerical P-(k + Ilk + 1) = P-(k + Ilk)
errors from the prior calculation of gain matrix K(k + 1) than is the
standard form. In fact, one can show that first-order errors in the calcu- + H(k + l)R-(k + l)H(k + 1) (17-26)
lation of K(k + 1) propagate as first-order errors in the calculation of When these equations are used along with (17-11) and (17-13)we have
P(k + l/k + l), when the standard form is used, but only as second- the information formulation of the Kalman filter. Of course, the order-
order errors in the calculation of P(k + Ilk + 1) when the stabilized ing of the computations in these two formulations of the Kalman filter
form is used. This is why (17-g) is called the stabilized form [for detailed are different. See Lessons4 and 9 for related discussions.
derivations, see Aoki (1967),Jazwinski (1970), or Mendel(l973)]. 10. In Lesson 14 we showed that &&kl/ZV) = EiMS(kljlV); hence,
7. On the subject of computation, the calculation of P(k + lik) is the most
costly one for the Kalman filter, becauseof the term @(k + l,k)P(k/k) hw(k lk) = hdk (k) (17-27)
Q(k + l,k), which entails two multiplications of two n x n matrices This means that the Kalman filter also gives MAP estimatesof the state
l i.e., P(k ik)@(k + l,k) and @(k + l,k)(P(kik)@(k + l,k))]. Total vector x(k) for our basic state-variable model.
computation for the two matrix multiplications is on the order of 2n3 11. At the end of the introduction section in this lesson we mentioned that
multiplications and 2n3 additiuns [for more detailed computation counts, z(k + 1) and Z(k + Ilk) are causally invertible. This meansthat we can
including storage requirements, for all of the Kalman filter equations, compute one from the other using a causal (i.e., realizable) system. For
see Mendel (1971), Gura and Bierman (1971), and Bierman (1973a)]. example, when the measurementsare available, then Z(k + Ilk) can be
One must be very careful to code the standard or stabilized forms obtained from Equations (17-23) and (16-31) which we repeat here for
for P(k + l/k + 1) so that ?zx Hmatrices are never multiplied. We leave the convenience of the reader:
it to the reader to showthat the standard algorithm can be coded in such
a manner that it only requires on the order of $ mn2 multiplications, li(k + Ilk) = @(k + l.k)[I - K(k)H(k)]i(klk - 1)
whereas the stabilized algorithm can be coded in such a manner that it + Z(k + l,k)u(k) + @,(k + l,k)K(k)z(k) (17-28)
only requires on the order of $mn2 multiplications. In many applications,
and
system order n is larger than number of measurementsrn? so that
rnn C= n3. Usually, computation is most sensitive to system order; so, ijk + ljk) = -H(k + l)ri(k + Ilk) t z(k i- I)
wheneverpssibk, use low-order (but adequate)models. k = 0, 1,. . . (17-29)
Lesson 17 Problems 159
State Estimation: Filtering (the Kalman Filter) Lesson 17

equation x(k) = e --ISTX(k- 1) + w(k - l), where E{x (0)) = 1, E{x2(0)} = 2,


We refer to (17-28) and (17-29) in the rest of this book as the Kalman 4 = 2, and T = 7 = 0.1 sec. The measurements are described by z(k) =
innovations system. It is initialized by i(Ol- 1) = 0. x(k) + v(k), where v(k) is a white but non-Gaussian noise sequencefor which
On the other hand, if the innovations are given a priori, then E{v(k)} = 0 and E{v2(k)} = 4. Additionally, w(k) and v(k) are uncorrelated.
z(k + 1) can be obtained from Measurements z(1) = 1.5 and z(2) = 3.0.
(a) Find the best linear estimate of x(1) based on t(l).
?(k + lik) = Q(k + l,k)r2(klk - 1) + P(k + l,k)u(k) (b) Find the best linear estimate of x (2) based on z (1) and z (2).
+ @(k + l,k)K(k)Z(klk - 1) (17-30) 17-8. Table 17-l lists multiplications and additions for all the basic matrix operations
used in a Kalman filter. Using the formulas for the Kalman filter given in
Theorem 17-1, establish the number of multiplications and additions required to
z(k + 1) = H(k + l)f(k + Ilk) + Z(k + Ilk) compute i(k + l/k), P(k + lik), K(k + l), i(k + ilk + l), and P(k + Ilk + 1).
k =O,l,... (17-31)
TABLE 17-1 Operation Characteristics
Equation (17-30) was obtained by rearranging the terms in (17-28), and Name Function Multiplications Additions
(17-31) was obtained by solving (17-29)for z(k + 1). Equation (17-30) is
again initialized by i(Ol- 1) = 0. Matrix addition CMN = AMN + BMN MN
Matrix subtraction c MN = - - MN
Note that (17-30) and (17-31) are equivalent to our basic state- Matrix multiply CML =
&MN BYN
MNL ML(N - 1)
AMNBNL
variable model in (1517) and (1518), from an input-output point of Matrix transpose
view. Consequently, model (17-30)and (17-31) is often the starting point multiply CMN = AML(&L) MNL ML(N- 1)
for important problems such as the stochasticrealization problem (e.g., Matrix inversion A NN+ Ai,: u!N3 PN3
Faurre, 1976). Scalar-vector
product CNl = PANI N -

PROBLEMS 17-9. Show that the standard algorithm for computing P(k + l(k + 1) only requires on
the order of $ mn2 multiplications, whereas the stabilized algorithm requires on
17-l. Prove that %(k + ilk + 1) is zero mean, Gaussian and first-order Markov. the order of $ mrz2 multiplications (use Table 17-l). Note that this last result
Derive the recursive predictor form of the Kalman filter, given in (17-23). requires a very clever coding of the stabilized algorithm.
17-2.
17-3. Derive the matrix Riccati equation for P(k + Ilk + 1).
17-4. Show that gain matrix K(k + 1) can also be computed using (17-25).
17-5. Suppose a small error SK is made in the computation of the Kalman filter gain
K(k + 1).
(a) Show that when P(k + Ilk + 1) is computed from the standard form
equation, then to first-order terms,
6P(k + l[k + 1) = -SK(k + l)H(k + l)P(k + Ilk)
(b) Show that when P(k + Ilk + 1) is computed from the stabilized form
equation, then to first-order terms
6P(k + Ilk + 1) c 0
17-6. Consider the basic scalar system x (k + 1) = +x(k) + w(k) and z (k + 1) =
x(k + 1) + v(k + 1).
(a) Show that p(k + ilk) 2 4, which means that the variance of the system
disturbance sets the performance limit on prediction accuracy.
(b) Show that 0 5 K(k + 1) 5 1.
(c) Show that 0 5 p(k + Ilk + 1) : r.
17-7. An RC filter with time constant 7 is excited by Gaussian white noise, and the
output is measured every T seconds.The output at the sample times obeys the
Lesson 78
St&e Estimation:
Fi/tering Exampies

I
2 I I
I 7 I
INTRODUCTICNJ
2 \
Y I
I
In this lesson (which is an excellent one for self-study) we present five exam-
I
ples, which illustrate some interesting numerical and theoretical aspects of
Kalman filtering.

EXAMPLES
Example 1%1
In Lesson 17 we learned that the Kalman filter is a dynamical feedback system. Its gain
matrix and predicted- and filtering-error cuvariance matrices comprise a matrix
feedback systemoperating within the Kalman filter. Of course, these matrices can be
calculated prior to processing of data, and such calculations constitute a performance
mdysis of the Kalman filter, Here we examine the results of these calculations for two
second-order systems, III(z) = 1/(z - 1.322 + 0.875) and Hz(z) = 1/(z - 1.362 +
0.923). The second system is less damped than the first. Impulse responsesof both
systemsare depicted in Figure 18-l.
In Figure 18-l we also depict pII(k ik), p&kik), Kl(k) and h(k) versus k for both
systems.In both casesP(OiO)was set equal to the zero matrix.For system 1, CJ= 1 and
r = 5, whereas for system 2 q = 1 and r = 20. Observe that the error variances and
Kalman gains exhibit a transient response as well as a steady-stateresponse; i*e., after
a certain value of k (k s 10 for system-l and k = 15 for system-2)p,(kik), p&k!k),
ICI(&), and &(k) reaching limiting values, These limiting values do not depend on
P(OiO),as can be seenfrom Figure 18-2. The Kalman filter is initially influenced by its
initial conditions, but eventually ignores them. paying much greater attention to model
Examples

parameters and the measurements.The relatively large steady-statevalues for pll and
pz2are due to the large values of r. 0
Example 18-2
A state estimate is implicitly conditioned on knowing the true values for all system
parameters (i.e., @, r, v, H, Q, and R). Sometimes we do not know these values
exactly; hence, it is important to learn how sensitive (i.e., robust) the Kalman filter is to
parameter errors. Many references which treat Kalman filter sensitivity issuescan be
found in Mendel and Gieseking (1971) under category 2f, State Estimation:
Sensitivity Considerations.
Let 8 denote any parameter that may appear in @, r, q, H, Q, or R. In order to
determine the sensitivity of the Kalman filter to small variations in 6 one computes
&(k + Ilk + 1) /a& whereas, for large variations in 0one computes ti(k + l/k + 1) /
Ahe.An analysis of A%(k + Ilk + 1)/A& for example, reveals the interesting chain of
events, that: A?(k + l/k + l)/Ae depends on AK(k + 1)/A& which in turn depends
on AP(k + ilk) / A6, which in turn depends on AP(k Ik) / A0. Hence, for each variable
parameter, 8, we have a Kalman filter sensitivity system comprised of equations from
which we compute Ari(k + l]k + l)/A8 (see also, Lesson 26). An alternative to using
these equations is to perform a computer perturbation study. For example,

AK(k + 1) Wk + 1)( 8 = eN + A0 (jN


(18-l)
A8 = A8
where & denotes a nominal value of 8.
We define sensitivity as the ratio of the percentage change in a function [e.g.,
K(k + l)] to the percentage change in a parameter, 8. For example,
A&(k + 1)
g+(k + 1) = &tk + l>
(18-2)
A6
d 8
+
-N
c-4
m.
denotes the sensitivity of the ijth eleme atrix K(k + 1) with respectto parameter
4 \d
0. All other sensitivities, such as Sp@lk), are de similarly.
I
N
Here we present some numerical sensitivity results for the simple first-order
system
x(k + 1) = ax(k) + b w(k) (18-3)
z(k) = hx(k) + n(k) (18-4)
where ON = 0.7, bN = 1.0, hN = 0.5, qN = 0.2, and rN = 0.1. Figure 18-3 depicts
p + 1) s,C+ 1), sfck + I, St( + *) , and S,( + *) for parameter variations of 55%)
flO%, 220%) and 50% about nominal values, &V, bN, hN, qNYand rN l

Observe that the sensitivity functions vary with time and that they all reach
steady-state values. Table 18-l summarizes the steady-state sensitivity coefficients
SK(k + 1) , sr(kIk) and Srk + l(k) .
8
Some conclusions that can be drawn from these numerical results are: (1)
K(k + l), P(klk) and P(k + l(k) are most sensitive to changesin parameter b, and

162
164 State Estimation: Filtering Examples Lesson 18 Examples 165

(2) p + 1) = pw for 8 = a, b , and q . This last observation could have been fore-
seen, because of our alternate equation for K(k + I) [Equation (17-Z)].
K(k + 1) = P(k + Ilk + l)H(k + I)R-(k + 1) (N-5) 0.2

This expression shows that if !I and r are fixed then K(k + 1) varies exactly the same -
-
way as P(k + Ilk + 1). -

-0.2

0.5
- 0.4 I 1 1 I I I I J
0 1 2 3 4 5 6 7 8 9 10
k
O-4 5%
5%
10% 0.8
t
0.3
0.6 - 20%
- 10%
I I I I I 1 t I -5%
02 J I + WC
0 12 3 4 5 6 7 8 9 IO 0.4 + 10%
+ 20%
k + 50%
0.2

0.0 I I t I I I 1 I I I
0 I 1 3 4 5 6 7 8 9 IO
50% k
1.3 20% (d)
10%
1.1 5%
5%
fO%
O-9
2 10%
+ 0.7 5%
2vf 5%
10%
02
20%
0.3
- 0.7
I I t I I I t I 1
0 1 2 -0.8 1 I I I I I I t I
3 4 5 6 7 8 9 IO
0 1 2 3 4 5 6 7 8 9 10
k
(e)

Figure 18-3 Sensitivhy plots. Figure 18-3 Cont.


166 State Estimation: Filtering Examples Lesson 18 Examples

TABLE 18-l Steady State Sensitivity Coefficients and


P(k + 1Jk + 1)
ia) 37+l) Percentage Change in 8 = [I - K(k + l)h] @ ; @) (18-10)
6 +50 +20 +lO +5 -5 -10 -20 -50
r
~~~ Observe that, given ratio q/r, we can compute P(k + llk)/r, K(k + l), and
: .838
.506 ,937
,451 .430
.972 .989
.419 1.025
.396 1.042
.384 1.077
.361 1.158
.294
P(k + Ilk + 1)/r. We refer to Equations (l&8), (l&9), and (18-10) as the scaled
h -.108 - ,053 - .026 - ,010 ,026 .O47 .096 .317
Kalman filter equations.
9 .417 ,465 .483 .493 ,515 .526 .551 ,648 Ratio q /r can be viewed as a filter tuning parameter. Recall that, in Lesson 15,
r - .393 - .452 - .476 - ,490 -.518 - ,534 - .570 - .717 we showed q /r is related to signal-to-noise ratio; thus, using (15-50) for example, we
can also view signal-to-noise ratio as a (single-channel) Kalman filter tuning parameter.
(b) s;kkf Percentage Change in 6 Suppose, for example, that data quality is quite poor, so that signal-to-noise ratio, as
9 +50 +20 +lO +5 -5 -10 -20 -50
measured by SNR, is very small. Then q/r will also be very small, because
; .838
.506 .937
,451 -972
.430 ,989
.419 1.025
.396 1.042
.384 1.077
.361 1.158
.294 4/r = SNR/h(P,/q)h. In this case the Kalman filter rejects the low-quality data, i.e.,
the Kalman gain matrix approaches the zero matrix, because
h - ,739 - .877 - .932 - ,962 -1.025 -1.059 -1.130 -1.367
P(k + 1Jk) ~plk)(p+)
9 .417 .465 .483 ,493 .515 .526 .551 .648 (18-11)
f .410 - .457 .476 .486 .507 ,519 ,544 .642 r r
The Kalman filter is therefore quite sensitive to signal-to-noise ratio, as indeed are
Percentage Change in 8
+50 +20 +lO +5 -5 -10 -20 -50
most digital filters. Its dependence on signal-to-noise ratio is complicated and non-
linear. Although signal-to-noise ratio (or q /r) enters quite simply into the equation for
a 1.047 .820 .754 .723 .664 .637 .585 .453 P(k + llk)/r, it is transformed in a nonlinear manner via (18-8) and (M-10). 0
b 2.022 1.836 1.775 1.745 1.684 1.653 1.591 1.402
h - .213 - ,253 - .268 - .277 - .295 - .305 - ,325 -.393 Example 18-4
9 .832 .846 .851 .854 .860 ,864 .871 .899 A recursive unbiased minimum-variance estimator (BLUE) of a random parameter
r ,118 .131 .137 .140 .146 ,149 .157 .185
vector 6 can be obtained from the Kalman filter equations in Theorem 17-l by setting
x(k) = 9, @(k + 1,k) = I, IY(k + 1,k) = 0, Vl(k + 1,k) = 0, and Q(k) = 0. Under
these conditions we see that w(k) = 0 for all k, and
Here is how to use the results in Table 18-1. From Equation (18-2), for example, x(k + 1) = x(k)
we see that which means, of course, that x(k) is a vector of constants, namely 8. The Kalman
equations reduce to
(18-6)
i(k + Ilk + 1) = i(klk) + K(k + l)[z(k + 1) - H(k + l)ii(klk)], (18-12)
or P(k + l(k) = P(klk)
% Change in Kij(k + 1) = (% Change in 6) x SF@ + ) (18-7) K(k + 1) = P(kjk)H(k + l)[H(k + l)P(klk)H(k + 1) + R(k + l)]- (18-13)
From Table 18-1and this formula we see, for example, that a 20% change in a produces
a (20)(.451) = 9.02% change in K, whereas a 20% change in h only produces a P(k + Ilk + 1) = [I - K(k + l)H(k + l)]P(klk) (18-14)
(20)( - .053) = -1.06% change in K, etc. 0
Note that it is no longer nF*wry-t&Minguish between filtered and predicted
Example 18-3 quantities, because 6(k + 1lkJ = 6(k(k) and P(k + Ilk) = P(kjk); hence, the nota-
In the single-channel case, when w(k), z(k) and v(k) are scalars, then K(k + 1), tion h(k Ik) can be simplified to 6(k), for example. Equations (18-12), (M-13), and
P(k + Ilk) and P(k + Ilk + 1) do not depend on 4 and r separately. Instead, as we (18-14) were obtained earlier in Lesson 9 (see Theorem 9-7) for the case of scalar
demonstrate next, they depend only on the ratio 4 /r . In this case, H = h, r = y, measurements. Cl
Q = 4 and R = I, and Equations (17-12), (17-13) and (17-14) can be expressedas
Example 18-5
K(k + 1) = ck + ik) h h, tk + Ilk) h + 1 -
r E r 1 (18-8) This example illustrates the divergence phenomenon, which often occurs when either
process noise or measurement noise or both are small. We shall see that the Kalman
filter locks onto wrong values for the state, but believesthem to be the true values, i.e.,
P(k + ilk) = @ w lk)
-@ + 4yyt (18-9) it learns the wrong state too well.
r r r
Lesson 18 Problems
State Estimation: Filtering Exampks Lesson 18

0ur example is adapted from Jazwinski (1970, pp. 302-303). We begin with the tions for i,(k + Ilk + 1) or x(k Ik), becausestate equation (18-17) contains no process
following simple first-order system noise.
Divergence is a large-sample property of the Kalman filter. Finite-memory and
x(k + 1) =x(k) + b (1845) fading memory filtering control divergence by not letting the Kalman filter get into its
large sample regime. Finite-memory filtering (Jazwinski, 1970)uses a finite window
z(k + 1) = x(k + 1) + V(k + 1) (18-16)
of measurements (of fixed length W) to estimate x(k). As we move from c = kl to
where b is a very small bias, so smail that , when we designour Kalman filter, we choose I = kl + 1, we must account for two effects, namely, the new measurement at
to neglect it. Our Kalman filter is based on the following model r = kl + 1 and a discarded measurement at f = kl - W. Fading-memory filtering, due
to Sorenson and Sacks (1971) exponentially ages the measurements, weighting the
recent measurement most heavily and past measurements much less heavily. It is
z(k 3- 1) = xm(k + 1) + u(k + 1) (18-18) analogous to weighted least squares, as described in Lesson 3.
Fading-memory filtering seems to be the most successfuland popular way to
Using this model we estimate x(k) and c&,,(k /k), where it is straightfoward to show that control divergence effects. q
Iqk + 1)

PROBLEMS

observe that, as k -+m, K(k + 1) -0 SO that im(k + lik + l)+&(kik). The Ka- 18-l. Derive the equations for &,,(k + Ilk + 1) and i(kjk), in (16-19) and (l&21),
lman filter is rejecting the new measurementsbecauseit believes (1%17) to be the true respectively.
model for x(k); but* of course, it is not the true model. 18-2. In Lesson 5 we described cross-sectional processing for weighted least-squares
The Kalman filter computes the error variance, pm(k ik), between im (k ik) and estimates. Cross-sectional (also known as sequential) processing can be
x,.,,(k). The true error variance is associatedwith Z(k ik), where performed in Kalman filtering. Suppose z(k + 1) = co! (z,(k + l), z,(k + I),
. . . , z,(k + l)), where zi(ll- + 1) = H,x(k + 1) + vj(k + l), vi(k + 1) are
i(k ik) = x(k) - im (k lk) (18-20)
mutually uncorrelated for i = 1, 2,. , , , zi(k + 1) is m, x 1, and
We leave it to the reader to show that ml + rnt + .*- -I- mq = me Let k,(k + ilk + 1) be a corrected estimate of
x(k + 1) that is associatedwith processing z,(k + I).
(a) Using the Fundamental Theorem of Estimation Theory, prove that a cross-
sectional structure for the corrector equation of the Kalman filter is:
i,(k -t Ilk + 1) = ri(k + ljk) + E{x(k + l)!&(k + ilk)}
As k-+ m, i(kjk) + m becausethe third term on the right-hand side of (18-21) diverges i,(k + Ilk + 1) = %(k + Ilk + 1) + E(x(k + l)Ii,(k + Ilk))
to infinity. This term contains the bias II that was neglected in the model used by
the Kalman filter- Note also that L, (k lk) = xm(k) - &(kik)+ 0 as k + m; thus, the f&(k + Ilk + 1) = k- j(k + ljk + 1) + E{x(k + l)&(k + l/k))
Kalman filter has locked onto the wrong state and is unaware that the true error = fi(k + l/k + 1)
variance is diverging.
A number of different remedies have been proposed for controlling divergence (b) Provide equations for computing E{x(k + l)lZ, (k t l/k)}.
effects, including: 18-3. (Project) Choose a second-order system and perform a thorough sensitivity
study of its associatedKalman filter. Do this for various nominal values and for
1. adding fictitious process noise, both small and large variations of the systems parameters. You will need a
computer for this project. Present the results both graphically and tabularly, as in
2. finite-memory filtering, and
Example 18-2. Draw as many conclusions as possible.
3. fading memory filtering.

Fictitious process noise, which appears in the state equation, can be used to
account for neglected modeling effects that enter into the state equation (e.g., trun-
cation of second-and higher-order effects when a nonlinear state equation is hnear-
iced, as described in Lesson 24). This processnoise introduces Q into the Kalman filter
equations, observe, in our first-order example, that Q does not appear in the equa-
Steady-State Kalman Filter 171

Lesson 79 Equation (19-l) is often referred to either as a steady-state or algebraic


Riccati equation.
b. The eigenvalues of the steady-state KaZman filter, @I? - KIWJ, all lie
within the unit circle, so that the filter is asymptotically stable; i.e.,

State Estimation: Ih[@- KH@] / 4 (19-2)


If our dynamical modet in Equations (1.54 7) and (15-18) is time-
SteadyGtate Ku/man invariant and stationary, but is not necessarily asymptotically stable, then
points (a) and (b) still hold as long as the system is completely stabilizable
and detectable. Cl

A proof of this theorem is beyond the scope of this textbook. It can be


found in Anderson and Moore, pp. 78-82 (1979). For definitions of the
system-theoreticterms stabilizable and detectable, the reader should consult a
Relationship to a textbook on linear systems, such as Kailath (1980) or Chen (1970). By com-
pletely detectable and completely stabihzable, we mean that (@,H) is com-
pletely detectable and (@,IQI) is completely stabilizable, where Q = QIQ;.
Digital Wiener Filter Additionally, any asymptotically stable model is always completely sta-
bilizable and detectable.
Probably the most interesting caseof a system that is not asymptotically
INTRODUCTION stable, for which we want to designa steady-stateKalman filter, is one that has
a pole on the unit circle.
In this lesson we study the steady-stateKalman filter from different points of
view, and we then show how it is related to a digital Wiener filter. Example 19-1
In this example (which is similar to Example 5.4 in Meditch, 1969, pp. 189-190) we
consider the scalar system
STEADY-STATE KALMAN FILTER
x(k + 1) =x(k) + w(k) (19-3)
For time-invariant and stationary systems,if lim P(k + l/k) = P exists, then z(k + 1) =x(k + 1) + v(k + 1) (19-4)
lim K(k)+?r; and the Kalman filter (17-11) b<comes a constant coefficient
!%er. BecauseP(k + l(k) and P(klk) are intimately related, then, if p exists, It has a pole on the unit circle. When 4 = 20 and r = 5, Equations (17-13), (17-12) and
(17-14) reduce to:
pz Wlk) k P, also exists. We have already observed limiting behaviors for
K(k + 1) and P(klk) in Example 18-1.The following theorem, which is adop- p(k + lik) =&k[k) + 20 (19-5)
ted from Anderson and Moore (1979, pg. 77), tells us when F exists, and
P(W + 20
assuresus that the steady-stateKalman filter will be asymptotically stable. K(k + 1) = (19-6)
P(W + 25
Theorem 19-l (Steady-StateKalman Filter). Ifour dynamical model in
Equations (15-l 7) and (15-18) is time-invariant, stationary, and asymptotically
stable (i.e., all the eigenvalues of Q, lie inside the unit circle), then: p(k + l[k + 1) = (19-7)
a. For any nonnegative-definite symmetric initial condition P(OI- 1), one has
li+i P(k + 1Jk) = p with p independent of P(OI-1) and satisfying the Starting with p (010)= p,(O) = 50, it is a relatively simple matter to compute
p(k + ilk), K(k + 1) and p(k + Ilk + 1) for k = 0, 1,. . . . The results for the first
following steady-state version of Equation (17-24):
few iterations are given in Table 19-1.
i? = @[I - H(HFH + R)-lHP]@r + IQF (19-l)

170
Single-Channel Steady-State Kaiman Filter
172 State Estimation: Kalman Filter and Wiener Filter Lesson 19

T-4BLE 19-l Kahnan Filter QuanMes


SINGLE-CHANNEL STEADY-STATE KALMAN FILTER

The steady-state Kalman filter can be viewed outside of the context of esti-
0 ..* ... 50 mation theory as a recursive digital filter. As such, it is sometimesuseful to be
1 70 0.933 4.67 able to compute its impulse response, transfer function, and frequency re-
7 24.67 o.s31 4.16
3
sponse.In this section we restrict our attention to the single-channelsteady-
24.14 0.829 4.14
4 24.14 0*828 4.14 state Kalman filter. From (19-11) we observe that this filter is excited by two
inputs, z(k + 1) and u(k) [in the single-channelcase, z(k + 1) and u(k)]. In
this section we shall only be interested in transfer functions which are associ-
The steady-state value of I? is obtained by setting p(k + l[k + 1) = ated with the effect of z(k + 1) on the filter; hence, we set u(k) = 0. Addi-
p (k ik) = pI in the last of the above three relations to obtain tionally, we shall view the signal component of measurement z(k) as our
desired output. The signai component of z(k) is hx(k).
Let Hi(z) denote the z-transform of the impulse responseof the steady-
state filter; i.e.,
Because pI is a variance it must be nonnegatke: hence, only the solution j& = 4.14 is
valid. Comparing this result with p (313)we see that the filter is in the steady state to H,(z) = !E{i(kjk) when z(k) = s(k)} (N-12)
within the indicated computationa accuracyaher processingjust three measurements.
In steady-state, the Kalman filter equation is This transfer function is found from the following steady-state fiZter system:
i(k + lik + 1) = i(+) + O.S28[z(k + 1) - i(k\k)] i(klk) = [I - iih]@i(k - Ilk - I) + h(k) (19-13)
(19-9)
= O.l72i(kik) + 0.828 z(k + 1) i(klk) = hi(klk) (19-14)
Observe that the filters pole at 0.172lies inside the unit circle. 0 Taking the r-transform of (19-13) and (19-14), it follows that
Hf (z) = h(1 - $z-I)- K (19-X)
Many ways have been reported for solving the algebraicRiccati equation
(19-l) [see Laub (1979), for example], ranging from direct iteration of the where
matrix Riccati Equation (17-24)until P(k + l!k) does not changeappreciably (19-16)
from P(k jk - I), to solving the nonlinear algebraic Riccati equation via an
iterative Newton-Raphson procedure, to solving that equation in one shot by Equation (19-15) can also be written as
the Schur method. Iterative methods are quite sensitive to error accumu- Hf(Z) = h,(O) + h,(l)z- + hf(2)2-? + *.* (19-17)
lation. The one-shot Schur method possessesa high degree of numerical
integrity, and appearsto be one of the most successfulwaysfor obtaining F. where the filters coefficients (i.e., Markov parameters), h, (j), are
For details about this method, seeLaub (1979). .-
h,(j) = h@! K (19-18)
In summary, then, to design a steady-stateKalman filter:
forj =O,l,....
1. given (@,r,V,H,Q,R), compute F, the positive definite solution of In our study of mean-squaredsmoothing, in Lessons 20,21, and 22, we
(19-l); will see that the steady-state predictor system plays an important role. The
steady-state predictor, obtained from (17-23), is given by
2. computez, as
ii(k + Ilk) = @&kIk - 1) + y,z(k) (19-19)
ii=Fw(~F~r + R)-l (19-10)
3. use (19-10) in
(I$, = @(I - Kh) (19-20)
k(k + lik + 1) = Wi(kik) + h(k) + %(k + Ilk) and
= (I - EH)@k(k !k) + itz(k + 1) (19-11)
y,=@E (19-21)
+ (I - KH)h(k]
State Estimation: Kalman Filter and Wiener Filter Lesson 19 Single-Channel Steady-State Kalman Filter 175

Let HP (z) denote the z-transform of the impulse responseof the steady- Table 19-2 summarizes 7 /r , r, C&and yP , quantities which are needed to compute the
state predictor, i.e., impulse response I+(k) for k 2 0, which is depicted in Figure 19-l. Observe that all
three responses peak at j = 1; however, the decay time for SNR = 20 is quicker than
H,(z) = Cif{f(klk - 1) when z(k) = 6(k)} (19-22)
TABLE 19-2 Steady-StatePredictor Quantities
This transfer function is found from the following steady-state predictor system: SNR p/r K 4 YP
%(k + Ilk) = @$(klk - 1) + y/J(k) (19-23) 20 20.913 1.287 0.064 0.910
5 5.742 1.049 0.183 0.742
i(klk - 1) = hk(klk - 1) (19-24) 1 1.414 0.586 0.414 0.414
Taking the z-transform of theseequations, we find that
H,(z) = h(I - $,z-)-z-yP (19-25)
which can also be written as 700
HP(z) = h,(l)z- + hP(2)z-2 + . . l (19-26)
where the predictors coefficients,hJj), are
hP(j) = hW;- yP (19-27)
SNR=20 (0) H,(z)=OA43z-+O.O41z-*
for j = 1,2,.... + o.003z-3
Although Hf(z) and H,(z) contain an infinite number of terms, they can
usually be truncated, becauseboth are associatedwith asymptotically stable
filters.
SNR = 5 (a) H,(z) = 0.524~- 1+ 0.096~-2
Example 19-2 + 0.008z-3+0.001z-4
In this example we examine the response of the steady-statepredictor
first-order system

x(k + 1) =&x(k) + w(k) (19-28)


v5

z(k + 1) =&(k + 1) + v(k + 1) (19-29) SNR= 1 (A) HP(z)=0.293z-+O.l2lz-2


ti + 0.050~-~+0.021~-~
In Example 15-2 of Lesson 15, we showed that ratio 4 /r is proportional to signal-to- + o.oo9z-5+o.oo4z-6
noise ratio, SNR. For this example, SNR = 4 /r . Our numerical results, given below, + 0.0012-+0.0012-*
are for SNR = 20, 5, 1 and use the scaled steady-state prediction-error variance B /t ,
which can be solved for from (19-1) when it is written as:

(19-30)

Note that we could just as well have solved (19-l) for p /q but the structure of its
equation is more complicated than (19-30). The positive solution of (19-30) is
FJ Z(SNR-1)++v(SNR-1)+8SNR
- (19-31)
r
0 1 2 3 4 5 6 7 8
Additionally,
j
(19-32) Figure 19-1 Impulseresponse of steady-state predictor for SW? = 20, 5 and 1.
H,(z) is shown to three significant figures (Mendel, 1981, 0 1981, IEEE).
176 State Estimation: Kalman Filter and Wiener Filter Relationships Between Kalman Filter and FIR Wiener Filter
Lesson 19

impulse Amplitude of Phaseof


the decay times for lower SNR values, and the peak amplitude is larger for higher SNR Response Freq. Response Freq. Response
values. The steady-state predictor tends to reject measurements which have a low
signal-to-noise ratio (the same is true for the steady-statefilter). Cl
Example 19-3
In this example we present impulse response and frequency response plots for the
steady-statepredictors associatedwith the systems H,(z)

(19-33)

and
-0.688~~ + 1.651~~ - 1.2212 + 0.25
f&(z) = z4 - 2.58@ + 24t89z2 - 1.033.? -io0.168
(N-34)

Matrices a, y and h for these systemsare:


SNR=I

and
1
u
0
+ I.033 -2.489 +2.586
Figure 19-2depictsII~(Ic),/HI@)1 (in db) and LMI(jw) aswell asIzP,(k), iHP,(j~)[,
and LHP,(iu) for WR = 1, 5, and 20. Figure 19-3 depicts comparable quantities for SNR=5 H,(z)
the second system. observe that, as signal-to-noise ratio decreases, the steady-state
predictor rejects the measurements; for the amplitudes of hPI(k) and hpz(k) become
smaller as SNR becomessmaller. It also appears, from examination of /HPI(ju)/ and
]H,,2(jw)i, that at high signal-to-noise ratios the steady-stale predictor behaves like a
high-passfilter for system-1 and a bandpassfilter for system-2. 0n the other hand, at
low signal-to-noiseratios it appears to behave like a band-passfilter.
The steady-statepredictor appears to be quite dependent on system dynamics at
high signal-to-noise ratios, but is much less dependent on the system dynamics at low
signal-to-noise ratios. At iow signal-to-noise ratios the predictor is rejecting the mea-
surements regardlessof system dynamics.
Just becausethe steady-state predictor behaves like a high-pass fiiter at high
signal-to-noiseratios does not mean that it passesa lot of noise through it, becausehigh
signal-to-noise rasjo means that measurement noise level is quite low* Of course, a
spurious burst of noise would passthrough this filter quite easily. Cl Time Freq Freq
Figure 19-2 Impulse and frequency responsesfor system HI(z) and its associated
steady-statepredictor.
RELATIONSHIPS BETWEEN THE STEADY-STATE KALMAN FILTER
AND A FINITE IMPULSE RESPONSE DiGiTAL WIENER FILTER The truncated steady-stare Kalman filter can then be implemented as a finite-
impulse response(FIR) digital filter.
The steady-state Kahnan filter is a rewrsive digital filter with filter coefficients There is a more direct way for designinga FIR minimum mean-squared
equal to !zf(j),j = 0, I,. , a [see (19-H)]. Quite often h! (j)+
. . 0 for j 2 J! so error filter, i.e., a digital Wiener.fiZter. as we describenext.
that Hf (z) can be truncated, i.e., Consider the situation depicted in Figure 19-4. We wish to design digital
(19-35)
filter F(z)% coefficientsf (O),f (l), . . . ,f(~) so that the filters output, y (k), is
k&j = hj(O) + hj(l)P + *-* + h&qz-J
178 State Estimation: Kalman Filter and Wiener Filter Lesson 19 Relationships Between Kalman Filter and FIR Wiener Filter

Impulse Amplitude of Phaseof


Response
86 Freq. Response
-e
Freq. Response
FIR Filter

88I.r+
c
vi z(k)

Figure 19-4 Starting point for design of t F(z)


*

1
FIR digital Wiener filter. *
Hz(z) t-4 HJz)
I
Using the fact that
$1 1 J G
0;
~1L
0.00 64.00 I -tj4.00 0.00 64.00 -64.00 0.00 64.00 y(k) = f (k)*z(k) = 5 f (i)z(k - i) (19-37)
i =0
6c--
i 8
d
s
cc;
r
we see that

I(f) = E{ [d(k) - 2 f (i)z(k - i)]} (19-38)


SNR=l &--- 8.
8
I
zs
r;
I H,(z)
i=o
The filter coefficients that minimize I(f) are found by setting ar(f)/@ (j) = 0
0 ?L forj =O,l,..., r). We leave it as an exercisefor the reader to show that doing
?, a.
di I s M ?lI this results in the following linear system of or)+ 1 equations in the r) + 1
0.00 64.00 I-64.00 0.00 64.00 I-64.00 0.00 64.00 unknown filter coefficients,

j =O,l,...,q (19-39)
iif(W&
=0 -j)= &d(j)

SNR=5
:p-
sI where &( . ) is the auto-correlation function of filter input z(k), and &( . ) is
the cross-correlation function between z (k) and d(k). Equations (19-39) are
8s.r known as the discrete-time Wiener-Hopf equations. They are a system of nor-
%i
0.00 64.00 -64.00
0.00
64.00 -64.00 0.00 64.00
mal equations, and can be solved in many different ways, the fastest of which
is by the Levinson Algorithm (Treitel and Robinson, 1966).
The minimum mean-squared error, I*(f), can be shown to be given by

I*(f) = hid(O)- ii= f (+l%d(i) (19-40)

86If-
0

SNR= 20 One property of the digital Wiener filter is that I*(f) becomes smaller as I,
the number of filter coefficients, increases. In general, I*(f) approaches a
nonzero limiting value, a value that is often reachedfor modest values of q.
51 I In order to relate this FIR Wiener filter to the truncated steady-state
0.00 64.00 I -64.00 0.00 64.00 Kalman filter, we must first assume a signal-plus-noisemodel for z(k), i.e.,
Time Freq Freq (see Figure 19.5),
Figure 19-3 Impulse and frequency responsesfor system l&(z) and its associated z(k) = s(k) + v(k) = h(k)*w(k) + v(k) (19-41)
steady-statepredictor.
where h (k) is the impulse responseof a linear time-invariant system, and, asin
close, in somesense,to a desired signal d(k). In a digital Wiener filter design, our basic state-variable model (Lesson 15), w(k) and v(k) are mutually un-
correlated (stationary) white noise processeswith variances q and r, re-
fmf(%-J(rl) are obtained by minimizing the following mean-squared
error spectively. We must also specify an explicit form for desired signal d(k). We
shall require that
W) = E{[d(k) - y (k)fz)= w*(k)) (19-36) d(k) = s(k) = h(k)*w(k) (19-42)
StateEstimation: Kalman Filter ard Wiener Filler Lessor7 19
Comparisons of Kalman and Wiener Filters 181

1. obtain a state-variable representation of h(k);


2. determine the steady-stateKalman gain matrix, ii, as described above;
3. implement the steady-stateKalman filter (19-11);
4. compute the estimate of desired signal s(k), as
s^(k!k) = h i(klk) (D-45)

COMPARISONS OF KALMAN AND WIENER FILTERS


Figure 19-5 Signal-plus-noisemodel incorporated into the designof a FIR digital
Wiener filter. The dashedlines denote paths that exist only during the design stage; We conclude this lessonwith a brief comparison of Kalman and Wiener filters.
these paths do not exist in the implementation of the filter. First of all, the filter designs use different types of modeling information,
Auto-correlation information is usedfor FIR digital Wiener filtering, whereas
a combination of auto-correlation and difference equation information is used
which means, of course, that we want the output of the FIR digital Wiener in Kalman filtering. In order to compute I&(Z), h(k) must be known; it is
filter to be as close as possibleto signal 8(k). always possible to go directly from h (k) to the parameters in a state-variable
Using (19-41) and (N-42), it is another straightforward exerciseto com- model [Le., (a, y, h)], using an approximate realization procedure (e.g.,
pute +& - j) and &&). The resulting discrete-time Wiener-Hopf equations Kung, 1978).
are Because the Kalman filter is recursive, it is an infinite-length filter;
hence, unlike the FIR Wiener filter, where filter length is a design variable,
the filters length is not a designvariable in the Kalman filter.
Our derivation of the digital Wiener filter was for a single-channel
where model, whereas the derivation of the Kalman filter was for a more general
multichannel model (i.e., the caseof vector inputs and outputs). For a deri-
(19-44) vation of a multichannel digital Wiener filter, see Treitel (1970).
There is a conceptual difference between Kalman and FIR Wiener fil-
Observe, that, asin the caseof a single-channelKahnanfilter? our single tering. A Wiener filter acts directly on the measurementsto provide a signal
channel Wiener filter only depends on the ratio q/r (and subsequently on s^(kjk) that is close to s(k). The Kalman filter does not do this directly. It
SNR). provides us with such a signal in two steps. In the first step, a signalis obtained
that is close to x(k), and in the second step, a signal is obtained that is
close to s(k). The first step provides the optimal filtered value of x(k),
Theorem U-2. T/w steady-state Kuimun filter is an infinite-kngth digi- i(kjk); the second step provides the optimal filtered value of s(k), because
tal Wiener filter. The truncated steady-state K&nan filter ti a FIR digital Wiener s^(klk) = hii(klk).
fiZter. Finally, we can picture a diagram similar to the one in Figure 19-5for the
Pruuf (heuristic). The digital Wiener filter has constant coefficients as Kalman filter, except that the dashedlines are solid ones. In the Kalman filter
doesthe steady-stateKalman filter. Both filters minimize error variances.The the filter coefficients are usually time-varying and are affected in real time by a
infinite-length Wiener filter has the smallest error variance of all Wiener measure of error e(k) (i.e., by error-covariance matrices), whereas Wiener
filters, as does the steady-stateKalman filter; hence, the steady-stateKalman filter coefficients are constant and have only been affected by a measure of
filter is an infinite length digital Wiener filter. e(k) during their design,
The second part of the theorem is proved in a similar manner. El In short, a Kalman filter is a generalization of the FIR digital Wiener
filter to include time-varying filter coefficients which incorporate the effects of
error in an active manner. Additionally, a Kalman filter is applicable to either
Using Theorem 19-2.we suggestthe following proceduwfor designirzg a
time-varying or nonstationary systems or both, whereas the digital Wiener
recursive Wiener filter (i. e.7a steady-stateKalman fiber):
filter is not.
20
182 State Estimation: Kalman Filter and Wiener Filter Lesson 19

PROBLEMS
lesson
19-l. Derive the discrete-time Wiener-Hopf equations given in (19-39).
19-2. Prove that the minimum mean-squarederror, I*(f), given in (19-40), becomes
smaller as 7, the number of filter coefficients, increases. State Estimation:
2
19-3 Consider the basic state-variable model, x(k + 1) = - x(k) + w(k) and

z(k +l)=x(k+l)+v(k+l),
E(x 2(0)}=2.
where
ti
4=2, r=l, m,(O)=0 and Smoothing
(a) Specify i(OlO) and p (010).
(b) Give the recursive predictor for this system.
(c) Obtain the steady-statepredictor.
(d) Suppose z(5) is not provided (i.e., there is a gap in the measurementsat
k = 5). How does this affect p (615)and p (100/99)?
19-4. Consider the basic scalar system x(k + 1) = 4x(k) + w(k) and z (k + 1) =
x(k + 1) + v(k + 1). Assume that 4 = 0 and let l@p(kIk) = PI.
(a) Show that p1 = 0 and p1 = (42 - 1)r /42.
Which of the two solutions in (a) is the correct one when < l?

THREE TYPES OF SMOOTHERS

Recall that in smoothing we obtain mean-squaredestimates of state vector


x(k), i(klj) for which k c j. From the FundamentalTheorem of Estimation
Theory, Theorem 13-1, we know that the structure of the mean-squared
smoother is

v Ii)= Gwfw~ where k <j (20-l)


In this lessonwe shall develop recursive smoothing algorithms that are com-
parable to our recursive Kalman filter algorithm.
Smoothing is much more complicated than filtering, primarily because
we are using future measurementsto obtain estimates at earlier time points.
Because there is some flexibility associatedwith how we choose to process
future measurements, it is convenient to distinguish between three types of
smoothing (Meditch, 1969, pp. 204-208): fixed-interval, fixed-point, and
fixed-lag.
The fixed-interval smoother is i(k IN), k = 0, 1, . . . , N - 1, where N is a
fixed positive integer. The situation here is as follows: with an experiment
completed, we have measurements available over the fixed interval
1 5 k 5 N. For each time point within this interval we wish to obtain the
optimal estimate of state vector x(k), which is based on all the available
measurement data {z(j), j = 1, 2, * sj ) N). Fixed-interval smoothing is very
useful in signal processing, where the processingis done after all the data is
State Estimation: Smoothing Lesson 20
Single-Stage Smoother

collected. It cannot be carried out on-line during an experiment, as filtering


i(k + Ilk) = z(k + 1) - H(k + l)ii(k + Ilk) (20-2)
can be. BecausealI the availabledata is used,one cannot hope to do better (by
other forms of smoothing) than by fixed-interval smoothing. i(k + Ilk) = H(k + l)X(k -t- Ilk) + v(k + 1) (20-3)
The +&@n? smoothedestimateis i(k 1j), j = k + 1, k + 2, . . . where
k is a fixed positive integer. Supposewe want to improve our estimate of a P;;(k + ljk) = H(k + l)P(k + llk)H(k + 1) + R(k + 1) (20-4)
state at a specifictime by making useof future measurements.Let this time be $k + l/k) = QP(k + l,k)%(klk) + r(k + l,k)w(k) (20-5)
z. Then fixed-point smoothed estimates of x(z) will be $$ + l),
k(iqZ + 2),..., etc. The last possible fixed-point estimate is I@$V), which is and
the same asthe fixed-interval estimate of x(x). Fixed-point smoothing can be ,(k + ljk -t 1) = [I - K(k + l)H(k + l))ji(k + Ilk)
carried out on-line, if desired, but the calculationof k(%iz + dj is subject to a - K(k + l)v(k + 1) (20-6)
delay of Osec.
TheftieGzg smoothedestimate is i(# + I,), k = 0, 1, , . . , where L is
a fixed positive integer. In this case,the point at which we seek the estimate of
SINGLE-STAGE SMOOTHER
the systemsstate lags the time point of the most recent measurement by a
fried interval of time, ~5;i.e., lk +L - I& = L which is a positive constant for al1
As in our study of prediction, it is useful first to develop a single-stage
k =O,l,e.. . This type of estimator canbe usedwhere a constant lag between
smoother and then to obtain more general smoothers.
measurementsand state estimatesis permissible-Fixed-lag smoothing can be
carried out on-line, if desired, but the calculation of %(kik + L) is subject to
Theorem 20-l. The single-stage mean-squared smoothed estimator of
an LT set delay.
X(k)?i(k[k + 1), is given by the expression

i(kjk + 1) = i(klk) + M(klk + l)Z(k + Ilk) (20-7)


APPROACHES FOR DERWING SMOOTHERS where single-stage smoother gain matrix, M(klk + I), is

The literature on smoothing is f3led with many different approaches for M(kIk + 1) = P(kIk)W(k + l,k)H(k + 1)
deriving recursivesmoothers.By augmentingsuitably defined statesto (1517) [H(k -+ l)P(k + llk)H(k + 1) + R(k + l)]- (20-S)
and (15-l@, one can reduce the derivation of smoothing formulas to a Kalman
filter for the augmented state-variable model (Anderson and Moore, 1979). Proof. From the Fundamental Theorem of Estimation Theory and The-
The filtered estimatesof the newly-introduced states turn out to be equiv- orem 12-4, we know that
alent to smoothed values of x(k). We shall examine this augmentation ap- G(klk + 1) = E{x(k)/%(k + 1))
proach in Lesson21. A secondapproach is to use the orthogonality principle
to derive a discrete-time Wiener-Hopf equation, which can then be used to = %WkWMk + 1)) (20-9)
establish the smoothing formulas. We do not treat this approach in this book. = E(x(k)l%(k),Z(k + l(k)}
A third approach, the one we shall follow in this lesson, is based on the = E(x(k)lZ(k)} + E{x(k)li(k + Ilk)) - m,(k)
causal invertibility between the innovations process i(k -t ~IJc+ j - 1) and
measurementz(k + j), and repeated applicationsof Theorem 12-4. which can also be expressedas (see Corollary 13-l)
i(klk + 1) = fr(klk) + P;(k ,k + l]k)P;(k + llk)Z(k + Ilk) (20-10)
Defining single-stagesmoother gain matrix M(k Jk + 1) as
A SUMMARY OF IMPORTANT FORMULAS
M(kIk + 1) = P,(k .k + llk)P;;(k + ljk) (20-l 1)
The following formulas, which have been derived in earlier lessons, are used (20-10) reducesto (20-7).
so frequently in this lesson, as well as in Lesson21, that we collect them here Next, we must show that M(k\k + 1) can also be expressed as in (20-g).
for the convenienceof the reader: We already have an expression for Pii(k + l(k), namely (20-4); hence, we
State Estimation: Smoothing Lesson 20 Double-Stage Smoother

must compute P&k ,k + Ilk). To do this, we make useof (20-3), (20-5) and the k(klk + 1) = %(kjk) + A(k)[%(k + Ilk + 1) - i(k + Ilk)] (20-18)
orthogonahty principle, asfoliows:
Proof. Substitute (20-14) into (20-7) to seethat
P,(k,k + Ilk) = E{x(k)Z(k + Illi)}
k(k lk + 1) = i(klk) + A(k)K(k + l)Z(k + Ilk) (20-19)
= E{x(k)[H(k + l)%(k + l[k) + v(k + l)]}
= E{x(k)i(k + llk)}H(k + 1) but [see (l7-ll)]
= E{x(k)[@(k + l,k)x(k[k) K(k + l)Z(k + ilk) = S(k + Ilk + 1) - ri(k + Ilk) (20-20)
(20-12)
+ r(k + l,k)w(k)]}H(k + 1) Substitute (20-20) into (20-19) to obtain the desired result in (20-18). 0
= E{x(k)?(klk)}@(k + Z,k)H(k + 1) Formula (20-7) is useful for computational purposes, whereas(20-18) is
= E{[f(klk) + ?(klk)]ji(k Ik)}@(k + l,k)H(k + 1) most useful for theoretical purposes.These facts will become more clear when
we examine double-stage smoothing in our next section.
= P(kIk)Q,(k + J,k)H(k + 1) Whereas the structure of the single-stagesmoother is similar to that of
Substituting (20-12) and (20-4) into (20-19, we obtain (20-8). q the Kalman filter, we see that M(kIk + 1) does not depend on single-stage
smoothing error-covariance matrix P(klk + 1). Kalman gain K(k + 1), of
For future reference,we record the following fact, course, does depend on P(k + Ilk) [or P(kIk)]. In fact, P(klk + 1) does not
appear at all in the smoothing equations and must be computed (if one desires
E{x(k)%(k + Ilk)} = P(k)k)@(k + 1,k) (20-13) to do so) separately. We addressthis calculation in Lesson 21.
This is obtained by comparingthe third and last lines of (20-12).
Observethat the structure of the single-stagesmootheris quite similar to
that of the Kalman filter. The Kalman filter obtains r2(k + Ilk + 1) by adding DOUBLE-STAGE SMOOTHER
a correction that dependson the most recent innovations, i(k + Ilk), to the
predicted value of x(k + 1). The single-stagesmoother, on the other hand, Instead of immediately generalizing the single-stagesmoother to an hTstage
obtains i(klk + 1) by adding a correction that also dependson i(k + ljk), to smoother, we first present results for the double-stagesmoother. We will then
the filtered value of x(k). We seethat filtered estimatesare required to obtain be able to write down the general results (almost) by inspection of the single-
smoothed estimates. and double-stage results.

Coroliary 20-l. Kalman gain matrix K(k + 1) is a factor in M(klk + 1), Theorem 20-2. The double-stage mean-squared smoothed estimator of
i.e., x(k), %(k)k + 2), is given by the expression
M(kIk + 1) = A(k)K(k + 1) (20-14) ri(k(k + 2) = f(k)k + 1) + M(kIk + 2)2(k + 21k + 1) (20-21)
where where double-stage smoother gain matrix, M(klk + 2), is
A(k) A P(kIk)W(k + l,k)P-(k + Ilk) (20-15) M(kIk + 2) = P(kIk)W(k + l,k)[I - K(k + 1)
Proof. Using the fact that [Equation (17-12)j H(k + l)]@(k + 2,k + 1)
(20-22)
K(k + 1) = P(k + llk)H(k + 1) H(k + 2)[H(k + 2)P(k + 21k + 1)
[H(k + l)P(k + llk)H(k + 1) + R(k + l)]- (20-16) H(k + 2) + R(k + 2)]-
we seethat Proof. From the Fundamental Theorem of Estimation Theory, The-
H(k + l)[H(k + l)P(k + llk)H(k + 1) + R(k + l)]- orem 12-4, and Corollary 13-1,we know that
= P-(k + llk)K(k + 1) (20-17) i(klk + 2) = E{x(k)lS(k + 2))
When (20-17) is substitutedinto (20.8), we obtain (20-14). 0 = E{x(k)l%(k + l), z(k + 2))
(20-23)
\ ,
= E{x(k)lZ(k + l), i(k + 21k + 1))
Corollary 20-2. Another way to express ri(klk + 1) is = E{x(k)lZ(k + 1)) + E{x(k)li(k + 2Ik + 1)) - m,(k)
188 State Estimation: Smoothing Lesson Xl
Single- and Double-Stage Smoothers as General Smoothers 189

which can also be expressedas Using the definition of matrix .4(k) in (20-151,we see that (20-30) can be
?(kjk -+ 2) = k(kik + 1) expressedas in (20-26). [7
+ PG(k,k + 2ik + l)P&(k + 2ik + l)i(k + 2ik + I) (20-24)
Corollary 20-4. Trvo other ways to express %(klk + 2) are:
Defining double-stagesmoother gain matrix M(k\k + 2) as
i(klk + 2) = i(klk + 1)
M(kik + 2) = Pti(k ,k + 2/k + l)Pg(k + 2ik + 1) (20-25) + A(k)A(k + l)[i(k + 21k + 2) - i(k + 21k + I)] (20-31)
(20-24) reduces to (20-21). and
In order to show that M(k/k + 2) in (20-25) can be expressed as in
i(klk + 2) = i(klk) + A(k)[%(k + l/k + 2) - i(k + Ilk)] (20-32)
(2O-22),one proceedsas in our derivation of M(k !k + 1) in (20-12); however,
the details are lengthier becauseii(k + 2ik + 1) involves quantities that are Procf. The derivation of (20-31) follows exactly the same path as the
two time units awayfrom x(k), whereas i(k + lik) involves quantities that are derivation of (20-H), and is therefore left as an exercise for the reader.
only one time unit awayfrom x(k). Equation (20-13) is used during the deri- Equation (20-31) is the starting place for the derivation of (20-32). Observe,
vation. We leave the detailed derivation of (20-22) as an exercise for the from (20-18), that
reader. q
A(k + l)[i(k + 21k + 2) - ji(k + 2/k + I)]
Whereas (20-22) is a computationally useful formula, it is not useful = G(k + Ilk + 2) - i(k + Ilk + 1) (20-33)
from a theoretical viewpoint; i.e., when we examine M(k jk + 1) in (20-8) and thus, (20-31) can be written as
M(k!k + 2) in (20-22),it is not at al1obvious how to generalizethese formulas
to hS(k[k + N), or evento M(kik + 3) The following result for M(kik + 2) is ji(klk + 2) = i(k jk + 1)
easily generalized. + A(k)[ri(k + Ilk + 2) - i(k + Ilk + l)] (20-34)
Substituting (20-18) into @O-34),we obtain the desired result in (20-32). III
Corollary 20-3, Kalman pin mcztrix K(k + 2) isa factor in M(kik + 2),
i.e., The alternate forms we haveobtained for both i(klk + 1) and f(klk + 2)
M(kik + 2) = A(k)A(k + l)K(k -+2) (20-26) will suggesthow we can generalizeour single- and double-stagesmoothers to
N-stage smoothers.
where A(k) is defined in (20-15).
Pruuf. Increment k to k + 1 in (2047) to seethat
SINGLE- AND DOUBLE-STAGE SMOOTHERS
H(k + 2)[H(k + 2)P(k + 2ik + l)H(k + 2) + R(k + 2)]- AS GENERAL SMOOTHERS
= P-l (k + 2lk + I)K(k + 2) (2U-27)
At the beginning of this lessonwe describedthree types of smoothers, namely:
Next, note from (17-14),that
fixed-interval, fixed-point, and fixed-lag smoothers. Table 20-l showshow our
I - K(k + l)H(k + 1) = P(k + l/k + l)P- (k + lik) (20-28) single- and double-stage smoothersfit into these three categories.
In order to obtain the fixed-interval smoother formulas, given for both
hence,
the single- and double-stagesmoothers, set k + 1 = N in (20-18)and k + 2 =
[I - K(k + l)H(k + l)] = I-(k + llk)P(k + l[k + 1) (20-29) N in (20-32), respectively. Doing this forces the left-hand side of both equa-
tions to be conditioned on data length N. Observe that, before we can com-
Substitute (2U-27)and (20-29)into (20-22) to seethat pute %(N - IIN) or i(N - 2jN), we must run a Kalman filter on all of the data
M(kik + 2) = P(klk)@(k + 1,k)P- (k + lik) in order to obtain li(NIN), This last filtered state estimate initializes the back-
ward running fixed-interval smoother. Observe, also, that we must compute
P(k + ilk + l)W(k + 2,k -+ 1)
i(N - l/N) before we can compute Ei(N - 21N).Clearly, the limitation of our
P-* (k + 2lk + l)K(k + 2) (20-30) results so far is that we can only perform fixed-interval smoothing for
s TABLE 20-l Smoothing Interrelationships
0

Single-Stage Double-Stage
Fixed-Interval f (N - IIN) = & (N - l(N - 1) + A(N - 1) fi (N - 21N) = 2 (N - 21N - 2) + A(N - 2)
I WIN, - &(NlN - l)] [B(N - IIN) - B(N - 11N- 2)]

f(N - WI WIN) Smoother S(N - 21N) B(N - 1IN) WIN)


I +YTime
N-l%- /N Scale 1 :J?
N -2 N-l N

Solution proceedsin reversetime, Solution proceedsin reversetime, from


from N to N - 1, where N is fixed. NtoN - 1, toN - 2, whereNisfixed.

Fixed-Point a(@ + 1) = a(@) + A(k) a(@ + 2) = 2(@ + 1) + A(k)A(% + 1)


[2(X + 1lZ + 1) - s(k + Ilk)] [ii(k + 21k + 2) - ri(k + 2(z + l)]
k fixed at k k fixed at i
i(Zli) a(i + l(i + 1) Filter ir(iiIii) ng+ lpi+ 1) j;(i; + 2pi + 2)
3 Time 1 I -23
k k+1 Scale i;l
-G+ ii+2

n(klk + 1) Smoother
F y Time
i Scale

So!ution proceedsin forward time from k Solution proceedsin forward time. Results
to k + 1 on the filtering time scale and then from single-stagesmooth_eras well as optimal
back to $ on the smoothing time scale. A filter_arerequired at Y=k + 1, whereasat
one-unit time delay is present. Y= k only results from optimal filter are
required. A two-unit time delay is present.

Table 20-l Cont.

Single-Stage Double-Stage
Fixed-Lag %(klk + I) = ir(klk) + A(k) R(klk + 2) = Qklk + 1)+ A(k)A(k + 1)
[fi(k + Ilk + 1)-ii(k + Ilk)] [jZ(k + 21k + 2) -f(k + 21k + l)]

k variable k variable

Picture and discussionsame-asin Picture and discussionsame-asin


Fixed-Point case, replacingk by k. Fixed-Point case, replacingk by k.
Here we have a one-unit window from Here we havea two-unit window from
ktok + 1. ktok + 2.

1 2 3
I I II *
I
IJI
Window Window Window for jz;(113)
for n( 112) for %(2]3)
I
Window for si(214)
792 State Estimation: Smoothing LE?sson 20

k=N-land?+- 2. More general results, that will permit us to perform


fixed-interval smoothing for any k c N, are describedin Lesson 21.
Lesson 21
In order tu obtain the fixed-point smoother formulas, given for both the
single- and double-stage smoothers, set k = z in (20-18) and (20-31J7re-
spectively. Time puint x representsthe fixed point in our smoother formulas.
As nuted at the beginning of this lesson,fixed-point smoothing can be carried State=Estimation:
out on-line, but it is subject to a delay. From an information availability point
of view, a one-unit time delay is present in the single-stage fried-point
smoother, becauseour smoothed estimate at k = x usesthe filtered estimate
computed at k = z + 1. A two-unit time delay is present in the double-stage
Smoothing
fixed-point smoother, becauseour smoothed estimate at 3 = % + 1 usesthe
filtered estimate computed at k = x +- 1 andz + 2.
The fixed-lag smoother formulas look just like the fixed-point formulas,
(General Results)
except that ICis a variable in the former and is fixed in the latter* Observe that
the windows of measurementsused in the single-stagefixed-lag smoother are
nonoverlapping, whereasthey are overlapping in the caseof the double-stage
fixed-lag smoother. As in the case of fixed-point smoothing, fixed-interval
smoothing can be carried out on line, subject, of course, to a fixed deiay equal
to the lag of the smoother*

PROBLEMS INTRODUCTION

20-l. Derive the formula for the double-stagesmoother gain matrix M(k ik + 2), given In Lesson 20 we introduced three general types of smoothers, namely, fixed-
in (20-22). interval, fixed-point, and fixed-lag smoothers,We also developed formulas for
20-2, Derive the alternative expression for k(kik + 2) given in (20-31). single-stage and double-stage smoothers and showed how these specialized
20-3. Using the FundamentaI Theorem of Estimation Theory, derive expressionsfor smoothers fit into the three general categories.In this lesson we shall develop
G(k jk + I) and G(k ik + 2). These represent single- and double-stage estimators general formulas for fixed-interval, fixed-point, and fixed-lag smoothers.
of disturbance w(k).

FIXED-INTERVAL SMOOTHERS

In this section we develop a number of algorithms for the fixed-interval


smoothed estimator of x(k), i(k]N). Not all of them will be useful, becausewe
will not be able to compute with some. Only thosewith which we can compute
will be considered useful.
Recall that

It is straightforward to proceed as we did in Lesson 20, by showing first that

i(kIN) = E{x(k)l%(N - I).Z(N/N - I))


= E(x(k)l%(N - I)} + E{x(k)Ii(NIN - I)} - m,(k) (21-2)
= i(klN - 1) + M(kIN)i(NIZV - 1)
State Estimation: Smoothing (General Results) Lesson 21 Fixed-interval Smoothers 195

where and
M(klN) = P,(k,NIN - l)P,(NlN - 1) (21-3) jz(N - 2/N) = i(N - 21N - 2)
and then that + A(N - 2)[i(N - l[N) - f(N - 1iN - 2)] (21-9)
M(kjN) = A(k)A(k + 1). . .A(N - l)K(N) (21-4) Observe how i(N - 11N)is used in the calculation of i(N - 21N).
Equation (21-6) was developed by Rauch, et al. (1965).
and, finally, that other algorithms for ri(klN) are

ri(klN) = i(k(N
[ 1
- 1) + ijA(i)
i=k
[i(N(N) - rZ(NIN - l)] (21-5)
Theorem 21-l. A useful mean-squared fixed-interval
mator of x(k), i(klN), is given by the expression
?(klN) = i(k)k) + A(k)[k(k
smoothed esti-

+ 1IIV)- i(k + Ilk)] (21-10)


i(klN) = i(k]k) + A(k)[ri(k + l[N) - i(k + Ilk)] (21-6) where matrix A(k) is defined in (20~I5), and k = N - 1, N - 2, . . . , 0.
Additionally, the error-covariance matrix associated with i(klN), P(klN), is
A detailed derivation of these resultsthat usesa mathematical induction given by
proof, can be found in Meditch (1969,pp. 2X-220). The proof relies, in part, .
on our previously derived results in Lesson20, for the single- and double-stage P(klN) = P(klk) + A(k)[P(k + 1IN)- P(k + lIk)]A(k) (21-11)
smoothers.The reader, however, ought to be able to obtain (21.4), (21-Q and where k = N - 1, N - 2,. . . , 0.
(21-6) directly from the rules of formation that can be inferred from the
Lesson 20 formulas for M(klk + 1) and M(kIk + 2), and G(klk + 1) and Proof. We have already derived (21-10); hence, we direct our attention
i(kjk + 2). at the derivation of the algorithm for P(k IN). Our derivation of (21-11) follows
In order to compute r2(k IN) for specificvaluesof k, all of the terms that the one given in Meditch, 1969, pp. 222-224. To begin, we know that
appear on the right-hand side of an algorithm must be available. We have %(klN) = x(k) - ir(klN) (21-12)
three possiblealgorithms for r2(k IN); however,aswe explain next, only one is
useful. Substitute (21-10) into (21-12) to seethat
The first value of k for which the right-hand side of (21-2) is fully X(klN) = 2(kjk) - A(k)[i(k + 1IN) - f(k + l(k)] (21-13)
available is k = N - 1. By running a Kalman filter over all of the mea-
surements,we are able to compute ?(N - 1lN - 1) and Z(NIN - l), so that Collecting terms conditioned on N on the left-hand side, and terms condi-
we can computei(N - 1IN). We can now try to iterate (21-2) in the backward tioned on k on the right-hand side, (21-13) becomes
direction, to see if we can compute i(N - 2/N), jz(N - 31N), etc. Setting i(klN) + A(k)i(k + 1IN) = i(k)k) + A(k)rZ(k + Ilk) (21-14)
k = N - 2 in (21.2), we obtain
Treating (21-14) as an identity, we seethat the covariance of its left-hand side
i(N - 21N) = i(N - 21N - 1) + M(N - 21N)i(NIN - 1) (21-7) must equal the covariance of its right-hand side; hence, [in order to obtain
(21-15) we use the orthogonality principle to eliminate all the cross-product
Unfortunately, f(N - 21N - l), which is a single-stagesmoothed estimate of terms]
x(N - 2), has not been computed; hence, (21-7) is not useful for computing
i(N - 21IV).We, therefore, reject (21-2) as a useful fixed-interval smoother. P(klN) + A(k)P;;(k + lIN)A(k)
A similar argument can be made against (21-5); thus, we also reject 7 P(klk) + A(k)P&k + llk)A(k) (21-15)
(21-5) as a useful fixed-interval smoother.
Equation (214 is a useful fixed-inerval smoother, becauseall of its terms
on its right-hand side are available when we iterate it in the backward direc- P(klN) = P(klk) + A(k)[PG(k + ilk) - P;,(k + lIN)]A(k) (21-16)
tion. For example,
We must now evaluate the two covariancesPG(k + Ilk) and PG(k + 1JN).
i(N - 1IN) = ri(N - 1IN - 1) Recall that x(k + 1) = r2(k + Ilk) + ii(k + ilk); thus,
+ A(N - l)[ri(NIN) - k(N]N - l)] (21-8) P,(k + 1) = P;;(k + Ilk) + P(k + Ilk) (21-17)
196 State Estimation: Smoothing (General Results) Lc?ssan 21 Fixed-Interval Smoothers

Additionally, x(k + 1) = i(k + IIN) + G(k + l[ZV).so that ~(010) is an initial condirion that is data independent, whereas ~(014) is a result of
processing r(l), z(2), f (3), and z (4). In essence, fixed-interval smoothing has let us
Px(k + 1) = IQk + 1iN) + P(k + l/N) (21-B) look into the future and reflect the future back to time zero.
Equating (2147) and (2148) we find that Finally, note that, for large values of k , A(k) reaches a steady-statevalue, x,
where in this example
Pti(k + Ilk) - Pti(k + liw = P(k + 1iA) - P(k + l/k) (2149)
x = g@h + 20) = 0.171 (21-23)
Finally, substituting (2149) into (21~16),we obtain the desired expression for
This steady-state value is achieved for k = 3. 0
P(qQ (21-11). EI
Equation (21-10) requires the multiplication of 3 n X n matrices as well
We leave proof of the fact that {%(k~N), k = N, N - 1, . . . , O} is a as a matrix inversion at each iteration; hence, it is somewhat limited for
zero-mean secund-order Gauss-Markov process as an exercise for the reader.
practical computing purposes.The following results, which are due to Bryson
Example 21-l and Frazier (1963) and Bierman (1973b), represent the most practical way for
In order to illustrate fixed-interval smoothing and obtain at the same time a corn- computing x(kIN) and also P(klN).
parison of the relative accuraciesof smoothing and filtering, we return to Example
19-l. To review briefly, we have the scaIar system x(k + 1) = x(k) +- IV(~) with the Theorem 21-2. (a) A useful mean-squared fixed-interval smoothed
scalar measurement z(k + I) = x(k + 1) + v(k + 1) and P(0) = Xl, q = 20. and estimator @x(k), %(k]N), is
r = 5. In this example (which is similar to Example 6.1 in Meditch 1969, pg. 225) we
choose N = 4 and compute quantities that are associated with i(ki4), where from i(kpv) = i(k lk - 1) + P(k ik - l)r(k jN) (21-24)
(21-10) where k = N - 1, N-2,..., 1, and n x 1 vector r satisfies the backward-
Z(k[4) = i(kik) -+ A (k)[i (k + 114) - i(k + l/k)] (21-20) recursive equation
k = ?, 2, 1,O. Because @ = 1 andp{k + ltk) = P(kik) + 20 r(jlN) = @Ji + L.i)r(j + 1IN)
P (k IkI + H(j)[H(j)P(jIj - l)H(i) + R(j)]-Z(jb - 1) (21-25)
A(k) = P(kfk)@p -(k + ljk) = (21-21)
p(kik) t 20 where j = N, N - 1,. . . ,1 and r(N + lb9
and, therefore, (b) The smoothing error-covariance matrix, $iN), is
P(k IN) = P(k/k - 1) - P(klk - l)S(k [N)P(k Ik - 1) (21-26)
where k = N - 1, N - 2, . . . , 1, and n X n matrix S( jjN), which is the covar-
Utilizing these last two expressions,we compute A (k) andP (k 14)for k = 3,&l, iunce matrix of r( j IN), satisfies the backward-recursive equation
0 and present them, along with the results summarized in Table 19-1,in Table 21-l. The
three estimation error variancesare given in adjacent cohunns for easein comparison. S(jIN) = Qp(j + l,j)S(j + lIN)$ (j + 1,j)
+ wmwP(~Ij - W(i) + wH-a(i) (21-27)
TAl3LE 21-l Kalman Filter and Fixed-interval Smoother Quantities
where j = N, N - 1, . . . , 1 and S(N + l/N) = 0. Matrix Qp is defined in
(21-33).
0 50 16.31 0.714
1 iI- 4.67 3.92 i9.3-3 0.189 Proof. (Mendel, 1983b, pp. 64-65). (a) Substitute the Kalman filter
2 24.67 4.16 3.56 0.831 0.172 equation (1741) for i(klk) into Equation (21~lo), to show that
3 24.14 4.14 3.74 0.329 0.171
4 24.14 4.14 4.14 0.828 0.171 i2(kIN) = i(klk - 1) + K(k)Z(klk - I>
+ A(k)[i(k + 1IN) - i(k + Ilk)] (21-28)
Observe the large improvement (percentage wise) of JJ(ki4) over p (k !k). Im- Residual state vector r(k IN) is defined as
provement seemsto get larger the farther away we get from the end of our data; thus,
P (014) is more than three times as small as p(O/Oj.Of course. it should be, because r(k IN) = P-(klk - l)[i(k IN) - i(klk - I)] (21-29)
State Estimation: Smoothing (General Results) Lesson 21
Fixed-Point Smoothing 199

Next, substitute r(kIN) and r(k + l[N), using (21-29) into (21-28) to show tor which is excited by the innovations-one which is running in a backward
that direction.
r(kIN) = P-(k/k - l)[K(k)i(k[k - 1) Finally, note that (21-24) can also be used for k = N, in which case its
+ P(k Ik)W(k + l,k)r(k + l/N)] (21-30) right-hand side reduces to %(NlN - 1) + K(N)Z(NIN - l), which, of course,
is i((NIN).
From (17-12) and (17-13)and the symmetry of covariancematrices, we
find that
FlXED-POINT SMOOTHING
I-(klk - l)K(k) = H(k)[H(k)P(klk - l)H(k) + R(k)]- (21-31)
and A fixed-point smoother, i(kl j) where j = k + 1, k + 2, . . . , can be obtained
P-(klk - l)P(k)k) = [I - K(k)H(k)] (21-32) in exactly the same manner aswe obtained fixed-interval smoother (21-5). It is
obtained from this equation by setting N equal to j and then letting j = k + 1,
Substituting (21-31) and (21-32)into Equation (21-30), and defining k + 2,. . . ; thus,
Op(k + 1,k) = <P(k+ l,k)[I - K(k)H(k)] (21-33) i(klj) = r2(klj - 1)+ WMjlj) - %jlj - l>l (21-38)
we obtain Equation (21-25). Setting k = N + 1 in (21.29), we establish where
r(N + 1IN) = 0. Finally, solving (21-29) for i(kIN) we obtain Equation j-l
(21-24). B(j) =
A(i) II (21-39)
(b) The orthogonality principle in Corollary 13-3leads us to conclude that =k j

E{%(kIN)r(k IN)} = 0 (21-34) and j = k + 1, k + 2,. . . . Additionally, one can show that the fixed-point
smoothing error-covariance matrix, P(kl j), is computed from
because r(kIN) is simply a linear combination of all the observations z(l),
z(2),
z(N). From (21-24)we find that
l . 9
Wlj) = Wlj - 1)+ BbW(A.0- Wlj - WW) (21-40)

ii(klk - 1) = ii(kliV) - P(klk - l)r(kIN) (21-35) where j = k + 1, k + 2,. . . .


Equation (21-38) is impractical from a computational viewpoint, be-
and, therefore, using (21-34),we find that causeof the many multiplications of n x n matrices required first to form the
P(klk - 1) = P(kIN) - P(kfk - l)S(klN)P(klk - 1) (21-36) A(i) matrices and then to form the B(j) matrices. Additionally, the inverse of
matrix P(i + lli) is needed in order to compute matrix A(i). The following
results present a fast algorithm for computing r2(kl j). It is fast in the sense
S(klN) = E{r(klN)r(klN)} (21-37) that no multiplications of yt x y1matrices are needed to implement it.

is the covariance-matrix of r(klN) [note that r(kIN) is zero mean]. Equation Theorem 21-3. A most useful mean-squared fixed-point smoothed esti-
(21-36) is solved for P(klN) to give the desired result in Equation (21-26). mator of x(k), %(klk + 4 where 1 = 1,L**, is given by the expression
Because the innovations process is uncorrelated, (21-27) follows
from substitution of (21-25) into (21-37). Finally, S(N + 1IN) = 0 because i(klk + I) = i(klk + Z - 1)
r(N + 11N) = 0. 0 + N,(klk + Z)[z(k + I) - H(k + Z)r2(k + Ilk + Z - l)] (21-41)
Equations (21-24) and (21-25) are very efficient; they require no matrix where
inversions or multiplications of 12x y1matrices. The calculation of P(k IN>
does require multiplications of n x n.matrices. N,(klk + I) = S,(k,Z)H(k + I)
Matrix (I+-,(k + 1,k) in (21-33)is the plant matrix of the recursivepredic- [H(k + I)P(k + Ilk + Z - l)H(k + 2) + R(k + Z)]- (21-42)
tor (Lesson 16). It is interesting that the recursive predictor and not the
and
recursive filter plays the predominant role in fixed-interval smoothing. This is
further borne out by the appearanceof predictor quantities on the right-hand 9,(k,Z) = 9,(k,Z - 1)
side of (21-24). Observe that (21-25) looks quite similar to a recursive predic- [I - K(k + Z - ljH(k + I - l)]W(k + Z,k + 2 - 1) (21-43)
State Estimation: Smoothing (Generd Fksults) Lesson 21 Fixed-Lag Smoothing

Equutions (21-#I) and (21-43) are initialized by %(kik) and !Z&(k,l) = variable model. Anderson and Moore (1979)give these equations for the
P(kik)@(k + 1,k), respe&ely. Additio&ly, recursive predictor; i.e., they find
P(klk + 2) = P(kjk + 1 - 1) - N#/k + I)[H(k + Z)P(k + Zik + 2 - 1)
H(k + Z) + R(k + Z)]N;(kik + Z) (21-44)
which is initialized by P(kik). III
where k ~j. The last equality in (21-49) makes use of (21-46) and
We leavethe proof of this useful theorem, which is similar to a result (21-45). Observe that i(jlk), the fixed-point smoother of x(j), has been
given by Fraser (1967), as an exercisefor the reader.
found as the second component of the recursive predictor for the aug-
mented model.
Example 21-2 2. The Kalman filter (or recursive predictor) equations are partitioned in
Here we consider the problem of fixed-point smoothing to obtain a refined estimate of order to obtain the explicit structure of the algorithm for ri(jlk). We
the initial condition for the system described in Example 21-l. Recall that JJ(010)= 50 leave the details of this two-step procedure as an exercise for the reader.
and that by fixed-interval smoothing we had obtained the result p (014) = 16.31, which
is a significant reduction in the uncertainty associatedwith the initial condition.
Using Equation (21-40) or (21-44) we compute ~1(0/1),~1(0/2),and ~(013) to be FIXED-l&G SMOOTHING
16.69, l&32? and 16.31, respectively. Observe that a major reduction in the smoothing
error variance occurs as soon as the first measurementis incorporated, and that the The earliest attempts to obtain a fixed-lag smoother i(kjk + L) led to an
improvement in accuracy thereafter is relatively modest. This seems to be a general algorithm (e.g., Meditch, 1969), which was later shown to be unstable (Kelly
trait of fixed-point smoothing. IJ
and Anderson, 1971).The following state augmentation procedure leads to a
stable fixed-interval smoother for li(k - L Ik).
Another way to derive fixed-point smoothing formulas is by the follow- We introduce t + 1 state vectors, as follows: xl(k + 1) = x(k),
ing state augmentatiun procedure (Anderson and Moore, 1979). We assume xz(k + 1) = x(k - l), xj(k + 1) = x(k - 2), . . . , xL + I(k + 1) = x(k - L)
that for IL 2 j [i.e., xi(k + 1) = x(k + 1 - i), i = 1, 2,. . . , L + 13.The state equations for
W) A x(j) (21-45) these L + 1 state vectors are
The state equation for state vector x0(k) is xl(k + 1) = x(k)
xz(k + 1) = x,(k)
x,,(k + 1) = xG(k) k + (21-46)
xj(k + 1) = x,(k) (X-50)
It is initialized at k = j by (21-45). Augmenting (21-46) to our basic state-
variable model in (15-17) and (S-18), we obtain the following augmented xL+l(k + ;; = XL(k) I
basic state-variable mod&
Augmenting (21-50) to our basic state-variable model in (15-17) and (15-18),
we obtain yet another augmented basic state-variable model:

z(k + 1) = (H(k + 1) 0) [;:$I + v(k + 1) (21-48)


Q
The foIlowing two-step procedure can be usedto obtain an algorithm for (21-51)
+ u(k) + w(k)
S( jik):

1. Write down the KaIman filter equationsfor the augmented basic state-
202 State Estimation: Smoothing (General Results) Lesson 21
Lesson 21 Problems 203

The following two-stepprocedure can be usedto obtain an algorithm for Fixed-Lag Smoothing, derive the resulting fixed-lag smoother equations. Show,
k(k - 1 IL\* by means of a block diagram, that this smoother is stable.
21-7. (Meditch, 1969, Exercise 6-13, pg. 245). Consider the scalar systemx(k + 1) =
1. Write down the Kalman filter equations for the augmentedbasic state- 2-kx(k)+w(k),z(k+1)=x(k+1),k=O,l,...,wherex(0)hasmeanzero
variable model. Anderson and Moore (1979)give theseequations for the and variance U& and w(k), k = 0, 1, . . . is a zero mean Gaussianwhite sequence
recursive predictor; i.e., they find which is independent of x(0) and has a variance equal to 4.
(a) Assuming that optimal fixed-point smoothing is to be employed to determine
E{col (x(k + 1),x1@+ l), . . . ,xL + I@ + l)f%(k)}
x(Olj), j = 1, 2, . . . , what is the equation for the appropriate smoothing
= co1(r2(k + l(k),i@ + l[k), . . . ) IiL +I(k + 1Jk)) (21-52) filter?
= co1(f(k + llk),i(klk),i(k - l/k), . . . f i(k - Llk)) (b) What is the limiting value of p (01j) asj + w?
The last equality in (21-52) makes use of the fact that xi(k + 1) = (c) How does this value compare with p (OIO)?
x(k + 1 - i), i = 1,2,. . . , L + 1.
2. The Kaiman filter (or recursive predictor) equationsare partitioned in
order to obtain the explicit structure of the algorithm for ji(k - Llk).
The detailed derivation of the algorithm for i(k - LIk) is left as an
exercise for the reader (it can be found in Anderson and Moore, 1979,
pp. 177-181).

Some aspectsof this fixed-lag smoother are:

1. It is numerically stable,becauseits stability is determinedby the stability


of the recursive predictor (i.e., no new feedback loops are introduced
into the predictor as a result of the augmentationprocedure);
2. In order to computei(k - L (k), we must also computethe L - 1 fixed-
lag estimates, i(k - Ilk), i(k - 21k), . . . , g(k - L + Ilk); this may be
costly to do from a computational point of view; and
3. Computation can be reducedby careful coding of the partitioned recur-
sive predictor equations.

PROBLEMS

21-1. Derive the formula for x(kIN) in (21-5) using mathematical induction. Then
derive i(klN) in (21-6).
21-2. Prove that {%(kIN), k = N, IV - 1, . . . , 0) is a zero-mean,second-order Gauss-
Markov process.
21-3. Derive the formula for the fixed-point smoothing error-covariance matrix,
P(kl j), given in (21-40).
21-4. Prove Theorem 21-3, which gives formulas for a most useful mean-squared
fixed-point smoother of x(k), i(klk + l), I = 1,2,. . . .
21-5. Using the two-step procedure described at the end of the section entitled
Fixed-Point Smoothing, derive the resulting fixed-point smoother equations.
21-6. Using the two-step procedure described at the end of the section entitled
Minimum-Variance Deconvolution (MVD) 205

lesson 22 is equivalent to the convolutiorzaIsum model in (22-I) when x(0) = 0, ~(0) = 0,


h(0) = 0. and
h(I) = h@ - y, 1 = 1,2,... (22-4)
State Estimation: Proof. Iterate (22-2) and substitute the results into (22-3). Compare the
resulting equation with (22-l) to see that, under the conditions x(O) = 0,
p(O) = 0 and h (0) = 0, they are the same. l
Smoothing The condition x(0) = 0 merely initializes our state-variable model. The
condition ~(0) = 0 means there is no input at time zero. The coefficients in
Appkations (22-4) represent sampled values of the impulse response. If we are given
impulse response data {h (1) h (2): . . . , h (L)) then we can determine matrices
@, y and h as well as system order n by applying an approximate realization
procedure, such as Kungs (1978), to {h (l), h (2), . . . , h (L)). Additionally, if
h (0) + 0 it is simple to modify Theorem 22-l.
In Example 14-l we obtained a rather unwieldy formula for l&(N).
Note that, in terms of our conditioning notation, the elements of b&N) are
jh.&lN, k = 1, 27-v N. We now obtain a very useful algorithm for
j&&k IN. For notational convenience,we shorten bMSto b.

INTRODUCTION
Theorem 22-2 (Mendel, 1983a,pp. 68-70)
In this lesson we present some applications that illustrate interesting numer-
ical and theoretica aspectsof fixed-interval smoothing. Theseapplications are a. A two-puss fixed-interval smoother for I is
taken from the field of digital signa processing.

wherek=N-l,N - 2,. . . ,l.


MINIMUM-VARIANCE DECCHWOLUTION [MVD) b. The smoothing error variance, Gp (klN), is

Here, as in ExampIesZ-6and 14-1, we begin with the convolutional model ~6 (klhr) = q (4 - q WyS(k + 1IN)yq (k) (22-6)

z(k) = 2 p(i)h (k - i) + v(k), k =1,2,...,N (22-l) where k = N-l,N-2,..., 1. In these formulas r(klN) and S(klN) are
i=l computed using (21-25) and (21-27), respectiveZy, and E{p(k)) = q(k)
Recall that deconvolution is the signal processingprocedure for removing the [here q(k) d enotesthe variance of p(k), and should not be confusedwith
effects of !z(j) and v(j) from the measurements so that one is left with an the event sequence;which appearsin the product model for p(k)].
estimate of p(j), Here we shahobtain a useful algorithm for a mean-squared
fixed-interval estimator of p ( j). Proof a. To begin, we apply the fundamental theorem of estimation
To begin, we must convert (22-l) into an equivalent state-variable theory, Theorem 13-1,to (22-2). We operate on both sidesof that equa-
model. tion with E{ . I%(N)}, to show that

Theorem 22-l (Mendel, 1983a, pp. 13-14). The single-channel Jtatea yb(klN) = li(k + 1IN) - @ri(kIN) (22-7)
varia b/e mudei
By performing appropriate manipulations on this equation we can derive
x(k + I) = @x(k) + Tp(k) (22-Z)
(22-5) as follows. Substitute ir(klN) and li(k + 1jN) from Equation
z(k) = hx(k) + r(k) (22-3) (21-24) into Equation (22-7), to see that

204
State Estimation: Smoothing Applications Lesson 22
Steady-State MVD Filter 207

yji(kllV) = i(k + Ilk) + P(k + llk)r(k + liIV>


we see that
- @[i(klk - 1) + P(k jk - l)r(klN)]
(2243) (2248)
= %(k + l]k) - <P#k - 1)
+ P(k + llk)r(k + l[N) - @P(k(k - l)r(k IN) which is equivalent to (22-6). c]
Applying (17-11) and (16-4) to the state-variable model in (22-2) and Observe, from (22-5) and (224, that iIi(kllV) and c$ (klN) are easily
(22-3), it is straightforward to show that computed once r(k IN) and S(k [A/)have been computed.
%(k + l/k) = @x(kIk - 1) + @K(k)Z(klk - 1) (22-9) The extension of Theorem 22-2 to a time-varying state-variable model is
straightforward and can be found in Mendel(1983).
hence, (22-8) reducesto
Example 22-1
y$(klN) = QK(k)i(kIk - 1) + P(k + llk)r(k + 1IN)
In this example we compute b(kIN), first for a broadband channel IR, h#), and then
- @P(kIk - l)r(k N) (22-10) for a narrower-band channel IR, h*(k). The transfer functions of these channel models
Next, substitute (21-25) into (22-lo), to show that are
-0.76286~~ + 1.58842 - 0.823562 + 0.000222419
yb(k]N) = @K(k)t(kIk - 1) + P(k + #)r(k + l\NJ HI(Z) = (22-19)
z4 - 2.2633~~ + 1.777342 - 0.498032 + 0.045546
- @P(k[k - l)@Jk + l,k)r(k + l!IV) (22-11)
- (PP(kIk - l)h[hP(klk - 1)h + r]-z(k[k - 1)
0.0378417~~- 0.0306517~
Hz(z) = (22-20)
Making use of Equation (17-12) for K(k), we find that the first and last Z4 - 3.4016497~~+ 4.5113732~~- 2.7553363~ + 0.6561
terms in Equation (22-11) are identical; hence,
respectively. Plots of these IRs and their squared amplitude spectra are depicted in
yji(kIN) = P(k + llk)r(k + l!N) - @P(k(k - 1) Figures 22-l and 22-2.
@i(k + l,k)r(k + l!IV) (22-12) Measurements, z(k) (k = 1,2,. . . ,250, where T = 3 msec), were generated by
convolving each of these IRs with a sparse spike train (i.e., a Bernoulli-Gaussian
Combine Equations (17-13) and (17-14)to seethat sequence) and then adding measurement noise to the results. These measurements,
which, of course, represent the starting point for deconvolution, are depicted in Figure
P(k + Ilk) = <PP(k[k - l)a$(k + 1,k) + yq(k)y (22-13) 22-3.
Finally, substitute (22-13) into Equation (22-12)to observe that Figure 22-4 depicts fi(k IN). Observe that much better results are obtained for the
broadband channel than for the narrower-band channel, even though data quality, as
YWIN) = Y4oYr(k + 1IN) measured by SNR, is much lower in the former case. The MVD results for the
narrower-band channel appear smeared out, whereas the MVD results for the
which has the unique solution given by broadband channel are quite sharp. We provide a theoretical explanation for this effect
WIN) = 4ww(k + WI (22-14) below.
Observe, also, that fi(k IN) tends to undershoot p(k). See Chi and Mendel(l984)
which is Equation (22-5). for a theoretical explanation about why this occurs. 0
b. To derive Equation (22-6) we use (22-S)and the definition of estimation
error b(k IN),
iwYJ = P(k)- fiwv (22-15) STEADY-STATE MVD FILTER

to form For a time-invariant IR and stationary noises,the Kalman gain matrix, as well
P(k)= WIN) + 4Wr(k + l!N) (22-16) as the error-covariance matrices, will reach steady-state values. When this
Taking the variance of both sidesof (22-16),and using the orthogonality occurs, both the Kalman innovations filter and anticausal p-filter [(22-5) and
condition (21.25)] become time invariant, and we then refer to the MVD filter as a
steady-state MVD filter. In this section we examine an important property of
wwN)r(k + WH = 0 (22-17) this steady-statefilter.
208 State Estimation: Smoothing Applications Lesson 22 Steady-State MVD filter

0.60

0.30

0.00

-0.30

- 0.60

- 0.90
0.00 60.0 120.0 ISO* 240.0 3ou.o
(msecs)
(4

20.0 [

16.0

12.0

8.0

4.0

I 1 I
0.0
0.00 0.80 I.60 2.40 3.20 4.00 0.00 0.80 1.60 2.40 3.20 4.00

(radians)
(b)

Figure 22-l (a) Fourth-order broad-band channel IR, and (b) its squared ampli- Figure 22-2 (a) Fourth-order narrower-band channel, IR, and (b) its squared
tude spectrum (Chi, 1983). amplitude spectrum (Chi and Mendel, 1984, 0 1984,IEEE).
210 State Estimation: Smoothing Applications Lesson 22
Steady-State MVD Filter 211

0.20 -
0

0.06 0.10 -
0 0

- 0.06 0

-0.12
0

450 600 750 0.00 150 300 450 750


(msecs) (msecs)
(a) (a)

.
0.10
0.20
0
0

0.05 0 0
0.10
0
0
0 0
0
0.00
0.00

0 0
-0.05 ,.,,,,,
-0.10 - 0

-0.10 -0.20 -
0 0
1 I I I
0.00 150 300 450 750 0.00 150 300 450 750
(msecs) (msecs)
(b) (b)
Figure 22-3 Measurements associatedwith (a) broad-band channel (SAN = 10) Figure 22-4 b&IN) for (a) broadband channel (SNR = 10) and (b) narrower-
and (b) narrower-band channel (SNR = NO), (Chi and Mendel, 1984, 0 1984, band channel (SNR = 100). Circles depict true p(k) and bars depict estimate of
IEEE). p(k), (Chi and Mendel, 1984,O 1984,IEEE).
212 St&e Estimation: Smoothing Applications Lesson 22
Reiationship Between MVD Filter and IIR Wiener Deconvolution Filter 213

Let /z#) and h&(k) denote the IRs of the steady-stateKalman inno- We leave the proof of this theorem as an exercisefor the reader. Observe
vations and anticausa1p-filters, respectively. Then, that part (b) of the theorem means that bs(kjiV) is a zero-phase wave-shaped
b(kpl) = hJk)*j(kik - 1) version of p(k). Observe, also, that R(w) can be written as
= hp(k)*hi(k)*z (k) lw4124 /r
(22-21) R(w) = (22-31)
= hJk)*hi(k)*h (k)*p(k) 1 + p(w)t2q /r
+ hK(k)*hi(k)*v(k) which demonstrates that q /r , and subsequently SNR, is an MVD filter tuning
which can also be expressedas parameter. As 4 / t + DC,R(o)* 1 so that R(k) + s(k); thus, for high signal-
to-noise ratios fis(klN)+ p.(k). Additionally, when /H(w)12q/r >> 1,
Lwv = lwlxl + N+Yl (22-22) R(w)+ 1, and once again R(kj -+ 6(k). Broadband IRS often satisfy this
where the sigml componentof ,il(k IA/),fis(k IN),is condition. In general, however, &(klN) is a smeared-out version of I;
however, the nature of the smearing is quite dependent on the bandwidth of
fis(kIW z=h~(k)*hi(k)~h(k)~~~k) (22-23) h(k) and SNR.
and the noise componentof b(kiq, n (k/N), is Example 22-2
n(kiN) = hp(k)*hi(k)*v(k) (22-24) This example is a continuationof Example 22-1. Figure 22-5 depicts R (k) for both the
broadband and narrower-band IRS. hi(k) and hz(k), respectively. As predicted by
We shall refer to hp(k)*hi (k) as the IR of the MVD filter, h&k), i.e., (22-31),RI(~) is much spikier than R,(k), which explains why the MVD results for the
broadband IR are quite sharp, whereas the MVD results for the narrower-band IR are
h@-J = h,.G)*hi W (22-25) smeared out. Note, also, the difference in peak amplitudes for R,(k) and R,(k). This
The following result has been proven by Chi and Mendel (1984) for the explains why b(kfN) un derestimatesthe true values of p(k) by such large amounts in
slightly modified model x(k + l,l = @x(k) + yp(k + 1) and z(k) = hx(k) + the narrower-band case (see Figs.22-4a and b). E3
v (k) [becauseof the p(k + 1) input instead of the p(k) input, h (0) + 01.
RELATlONSHlP BETWEEN STEADY-STATE MVD FILTER
Theorem 22-3, AND AN INFlNtTE IMPULSE RESPONSE DIGITAL WIENER
DECONVOLUTION FILTER
a. The Fourier transform of hMv(k) is
We have seen that an MVD filter is a cascadeof a causal Kalman innovations
(22-26) filter and an anticausal p-filter; hence, it is a noncausal filter. Its impulse
response extends from k = --3c to k = +=, and the IR of the steady-state
where H*(U) denotesthe compiex conjugateuf H(u); and MVD filter is given in the time-domain by hMv(k) in (22-23, or in the fre-
b. the signd componentof &(kiN), bS(kiN), ti given by quency domain by HMv(w) in (22-26).
fis(k INI = I3 w *w (22-27) There is a more direct way for designing an IIR minimum mean-squared
error deconvolution filter, i.e., an IIR digital Wiener deconvolution filter, as
where R(k) ti the auto-correlatbz function we describe next.
We return to the situation depicted in Figure 19-4, but now we assume
WI = t[h(k)*hi(k)]*[h(-k)*hi(*k)] (22-28) that: Filter F(z) is an IIR filter, with coefficients (f(j), j = 0, 21, 52, . . . );
in which d(k) = CL(k) (22-32)
where p(k) is a white noisesequence;p(k), v(k), and n (k) are stationary; and,
v = hFh + r (22-29)
p(k) and v(k) are uncorrelated.
additionally, In this case, (19-39) becomes
(22-30) C f (i)&(i - 1) = 4+(j), j = 0, -tL 56 . . . (22-33)
iz -3
Maximum-Likelihood Deconvoiution 215
214 State Estimation: Smoothing Applications Lesson 22

Using (22-l), the whiteness of p(k) and the assumptions that p(k) and v(k)
0.80
r are uncorrelated and stationary, it is straightforward to show that
42,(i) = qh(71 (22-34)
0.60 Substituting (22-34) into (22.33), we have

i f(i)&(i -j) = qh(-j), j = 0, 21, +2,. . . (22-35)


0.40 i= -es
Taking the discrete-time Fourier transform of (22-35), we seethat
0.20
(22-36)
but, from (22-l), we also know that
(22-37)
-200 - loo - 200 300 Substituting (22-37) into (22-36),we determine F(o), as
qH*(w)
F(o) = (22-38)
(a) 41HW12 + r
This IIR digital Wiener deconvolution filter (i.e., two-sided least-
squares inverse filter) was, to the best of our knowledge, first derived by
Berkhout (1977).

Theorem 22-4 (Chi and Mendel, 1984). The steady-state MVD filter,
whose IR is given by hMv(k), is exactly the same as Berkhouts IIR digital
Wiener deconvolution filter. Cl

0.32 The steady-stateMVD filter is a recursive implementation of Berkhouts


infinite-length filter. Of course, the MVD filter is also applicable to time-
0.24 varying and nonstationary systems,whereashis filter is not.

0.16 MAXIMUM-IJKELIHOOD DECONVOLUTION

In Example 14-2 we began with the deconvolution linear model Z(N) =


0.08
X(N - 1)~ + v(N), used the product model for p (i.e., p = Q,r), and
showed that a separation principle exists for the determination of iMAPand
0.00 GmP.We showed that first one must determine GMAP, after which iMAPcan be
computed using (14-57). We repeat (14-57) here for convenience,
-160 -80 0.00 80 160 240
h&N(4) = +Qix'(N - 1)
[&E(N - l)Q$(N - 1) + PI]-e(N) (22-39)
(b) where 4 is short for hMAP.In terms of our conditioning notation used in state
estimation, the elements of &(N)$ are ihlA#CIN; ii), k = l? 2?. . . , N.
Figure 22-5 R(k) for (a) broadband channel (SNR = 100) and (b) narrower-band
Equation (22-39) is terribly unwieldy because of the N X N matrix,
channel (SlVR = lOO),(Chi and Mendel, 1984, 0 1984, IEEE).
State Estknation: Smoothing Applications lesson 22 Recursive Waveshaping

CTptyN- I)Q$(N - 1) + ~1, that must be inverted. The following theorem begin, we must obtain the following state-variablemodels for h(k) and d(k)
provides a more practical way to compute &&!IC; 4). [i.e., we must use an approximate realization procedure (Kung, 1978) or any
other viable technique to map (h(i), i = 0, 1,. . . , 1,) into {a,, yI, h,}, and
Theorem 22-5 (Mend& 1983b). unconditionul maximum-likelihuod {d(i), i = 0, 1,. . . ) &I into {@2,72, h)]:
(i.e., MAP) estimates uf r can be ubtained by applying hWD formuh tu the
x@ + 1) = @,lxO) + y&W (22-43)
state-vuriable mudel
h(k) = hfx,(k) (22-44)
x(k + 1) = @x(k) + y&&k)+) (22-40)
and
z(k) = Wx(k) + v(k) (22-41)
xz(k + 1) = @x2(k) + y?S(k) (22-45)
where qM&k) is a MAP estimate of q (k). d(k) = h;x:(k) (22-46)
&~f. Example 14-2showed that a MAP estimate of q can be obtained State vectors x1 and x2 are nl x 1 and n2 x 1, respectively. Signal 6(k) is the
prior tu finding a MAP estimate of r. By using the product model for p(k), unit spike.
and hMAp,our state-variable model in (22-2) and (22-3) can be expressedas in In the stochasticsituation depicted in Figure 22-6, where h(k) is excited
(22-40) and (22-41).Applving
d (14-41) to this system,we see that by the white sequencew(k) and noise, v(k), corrupts s(k), the best we can
ki&j~ = ;MS@ IN) (22-42) possibly hope to achieveby waveshapingis to make z(k) = w (k)*h (k) + v(k)
look like w(k)*d(k) (Figure 22-7). This is becauseboth h(k) and d(k) must be
but, by comparing (22-40) and (22-2), and (22-41) and (22s3), we see that excited by the same random input, w fk), for the waveshapingproblem in this
&&#V) can be found from the MVD algorithm in Theorem 22-2 in which we situation to be well posed. The state-variable model for this situation is
replace p(k) by r(k) and set q (k) = u&&k). Cl

xr(k + 1) = @xl(k) + y+(k) (22-47)


Y,:
RECURSIVE WAVESHAPING z(k) = hjx,(k) + v(k) (22-48)
In Lesson 19 we described the design of a FIR waveshapingfilter (e.g., see
Figure 19-4). In this section we shall develop a recursivewaveshapingfilter in
the framework of state-variable modeis and mean-squaredestimation theory.
other approaches to the design of recursive waveshapingfilters have been
given by Shanks(1967) and Aguilara? et al. (1970).
We direct our attention at the situation depicted in Figure 22-6. To

Figure 224 W?veshapingprobiern swdied in this section. Information about 2 (k),


not necessaAy I! itself- is used to drive the (time-varying) recursive waveshaping Figure 22-7 State-variable formulation of recursive waveshaping problem (Men-
fiIter, (Mendel. 1983a,G 1983, IEEE). del, 1983a,Q 1983, IEEE).
Recursive Waveshaping
State Estimation: Smoothing Applications Lesson 22

surements. It is the problem of stochastic inversion and can be performed by


means of minimum-variance deconvolution. 0
xt(k + 1) = azxz(k) + y2w(k) (22-49)
92: In this book we have only discussedfixed-interval MVD, from which we
d,(k) = %x2(k) (22-50)
obtain G(k[N). Fixed point algorithms are also available (e.g., Mendel, 1983).
Observe that both Y, and sP2are excited by the same input, w(k). Addi-
tionally, w(k) and v(k) are zero-mean mutually uncorrelated white noise Theorem 22-7 (Recursivewaveshaping,Mendel, 1983a,pg. 600). Let
sequences,for which w(kjN) denote the fixed-interval estimate of w(k), which can be obtainet from
S1via minimum-variance deconvolution (MVD), (Theorem 22-2). Then dl(klN)
E-b2W = 4 (22-51)
is obtained from the waveshaping filter
ci,(klN) = hjji,(kIN) (22-57)
E{v2(k)} = r (22-52)
&(k + 1IN) = @&(kIN) + ygC(klN) (22-58)
We now proceed to formulate the recursive waveshapingfilter design
problem in the context of mean-squaredestimation theory. Our filter design where k = 0, 1, . . . , N - 1, and &(O[N) = 0. q
problem is: Given the measurementsz (l), z (2), . . . , z (j) determine an esti-
mator &(klj) such that the mean-squarederror We leave the proof of this theorem as an exercisefor the reader.
Some observations about Theorems 22-6 and 22-7 are in order. First,
J[&(k 1j)l = E{ C&(k1j)12) (22-53) MVD is analogous to solving a stochastic inverse problem; hence, an MVD
is minimized. The solution to this problem is given next. filter can be thought of as an optimal inverse filter. Second, if w(k) was
deterministic and v(k) = 0, then our intuition tells us that the recursive wave-
Theorem 22-6 (Structure of Minimum-Variance Waveshaper,Dai and shaping filter should consist of the following two distinct components: an
Mendel, 1986). The Minimum-Variance Waveshapingfilter consists of two inverse filter, to remove the effects of H(z), followed by a waveshapingfilter,
components1. stochasticinversion, and 2. waveshaping. whose transfer function is D (2). The transfer function of the recursive wave-
1
shaping filter would then be - D(z), so that D(z)
Proof. According to the Fundamental Theorem of Estimation Theory, H(z) I = DW
Theorem 13-1, the unbiased,minimum-variance estimator of d,(k) based on Finally, the results in these theorems support our intuition even in the sto-
the measurements(2 (1) 2 (2), . . . , 2 (j)} is chastic case; for, &(klN), for example, is also obtained in two steps.As shown
(22-54) in Figure 22-8, first G(klN) is obtained via MVD. Then this signal is reshaped
h(k ii) = E(d~(k)IfKi~l
to give &(klN).
where ?I!(j) = co1(2 (l), 2 (2), . . . ) t(j)). Observe, from Figure 22-7, that Note, also, that in order to compute &(klN) it is not really necessaryto
d,(k) = w (k)*d(k) (22-55) use the state-variable model for &(klN). Observe, from (22~56),that

hence, ci,(kIN) = d(k)*&(k(N) (22-59)


Example 22-3
= E{w (k)IflE(j)}*d (k) = @(kjj)*d (k) (22-56) In this example we describe a simulation study for the Bernoulli-Gaussian input se-
quence [i.e., w(k)] depicted in Figure 22-9. When this sequenceis convolved with the
Equation (22-56)tells us that there are two stepsto obtain &(k /j): fourth-order IR depicted in Figure 22.la and noise is added to the result we obtain
measurements z(k), X-= 1,2,. . . , 1000, depicted in Figure 2240.
1. first obtain G(k ii>,and In Figure 22-11 we see $(k IN). Observe that the large spikes in w(k) have been
2. then convolve the desiredsignal with G(klJ. estimated quite well. When C&IN) is convolved with a first-order decaying ex-
ponential [&(f) = em3001, we obtain the shaped signal depicted in Figure 22-12. Some
Step 1 removes the effects of the original wavelet and noise from the mea- smoothing of the data occurs when 6(k IN) is convolved with the exponential d*(k).
220 State Estimation: Smoothing Applications Lesson 22 Recursive Waveshaping 221

Recursive WaveshapingFilter
I t
0.24 -

0.12

~~~~

0.00 0.20 0.40 0.60 0.80 1.0

Time (set)
Figure 22-8 Details of recursivelinear waveshapingfilter. (Mendel, 1983a,@ 1983 Figure 22-N Noise-corrupted signal t(k). Signal-to-noise ratio chosen equal to
IEEE.) ten (Mendel, 1983a. 0 1983.IEEE).

I I I
0.00 0.20 0.40 0.60 0.80 1.o
Time (set)
Figure 22-11 k(k(N), (Mendet, 1983a. 0 1983, IEEE).
Figure 22-9 I3ernoullLGaussianinput sequence (Mendel, 1983a,C 1933, IEEE).
222 State Estimation: Smoothing Applications Lesson 22
Lesson 23
State Estimation
for the /t/o+So-Basic
State-Variable Model
0.00 0.20 0.40 0.60 0.80 1.0
Time (set)
Figure 22-12 &(kIN) when d,(t) = ea300r
(Mendel 1983a,0 1983, IEEE).

More smoothing is achieved when C(k IN) is convolved with a zero-phase wave-
form. 0

Finally, note that, becauseof Theorem 22-3,we know that perfect wave-
shaping is not possible. For example, the signal component of &(k/I?),
&(I@), is given by the expression INTRODUCTION
&(k Ih) = d (k)*R (k)*w (k) (22-60)
In deriving all of our state estimators we assumedthat our dynamical system
How much the auto-correlation function R(k) will distort &(kIN) from could be modeled as in Lesson 15, i.e., as our basic state-variable model. The
d (k)*w (k), depends,of course, on bandwidth and signal-to-noiseratio consid- results so obtained are applicable only for systemsthat satisfy all the condi-
erations. tions of that model: the noise processesw(k) and v(k) are both zero mean,
white, and mutually uncorrelated, no known bias functions appear in the state
or measurement equations, and no measurements are noise-free (i.e.,
PROBLEMS perfect). The following casesfrequently occur in practice:
22-l. Rederive the MVD algorithm for fi(kIIV), which is given in (22-5), from the
1. either nonzero-mean noise processesor known bias functions or both in
Fundamental Theorem of Estimation Theory, i.e., fi(klN) = E{&k)l%(IV)}.
the state or measurementequations
22-2. Prove Theorem 22-3. Explain why part (b) of the theorem means that fiS(k IN) is
a zero-phasewaveshapedversion of p(k). 2. correlated noise processes,
22-3. This problem is a memory refresher. You probably have either seen or carried 3. colored noise processes,and
out the calculations asked for in a course on random processes. 4. some perfect measurements.
(a) Derive Equation (22-34);
(b) Derive Equation (22-37). In this lesson we show how to modify some of our earlier results in order to
22-4. Prove the recursive waveshapingTheorem 22-7. treat these important special cases.In order to see the forest from the trees,
we consider each of these four casesseparately. In practice, some or all of
them may occur together.

223
224 State Estimation for the Not-So-Bask State-Variable Model Lesson 23 Correlated Noises

BIASES CORRELATED NOISES

Here we asswrx that our basic state-variable model, given in (15-17) and Here we assumethat our basic state-variable model is given by (15-17) and
(H-18), has been modified IO (15-M), except that now w(k) and v(k) are correlated, i.e.,
x(k + 1) = @(k + l.k)x(k) + r(k + Lk)w](k) + P(k + l,k)u(k) (23-l) E{w(k)v(k)} = S(k) # 0 (23-12)
z(k + 1) = H(k + l)x(k +- 1) + G(k + l)u(k + I) + vI(k + I) (23-2) There are many approaches for treating correlated process and measurement
noises, some leading to a recursive predictor, some to a recursive filter, and
where WI(k) and y(k) are nonzero mean, individually and mutually un- others to a filter in predictor-corrector form, as in the following:
correlated Gaussiannoise sequences,i.e.,
J3w(k)l = ml(k) + 0 m,Jk) known (23-3) Theorem 23-2. When w(k) and v(k) are correlated, then u predictor-
correctur form uf the Kalman filter is
%(k)) = ml(k) + 0 rq(k) known (23-4)
i(k + Ilk) = @(k + l,k)i(klk) + F(k + l,k)u(k)
WW - ~&~lbd19 - ~wI~~~lI = QWO, Wb&J - mI~NW~ -
ml(j)ll = R(i)&,,andE-h(i) - ~wl(t)l[~~(~~
- mJ~911
= 0. + r(k + l,k)S(k)[H(k)P(klk - l)H(k) + R(k)]-i(k jk - 1) (2342)
This caseis handled by reducing (23-l) and (23-2) to our previous basic and
state-variable model, using the following simple transformations. Let ji(k + l(k + I> = g(k + Ilk) + K(k + l)i(k + Ilk) (2343)
w(k) A w#c) - mwI(k) (23-5) where Kalman gain matrix, K(k + I), is given by (17-12). filtering-error covar-
and iance matrix, P(k + Ilk + I>, is given by (17-l@, and, prediction-error covar-
v(k) A vdk) - ml(k) (23-6) iance matrix, P(k + l(k), is given by
Observe that both w(k) and v(k) are zero-mean white noise processes,with P(k + Ilk) = @(k + l,k)P(klk)@;(k + I,k) + Q,(k) (23-14)
covariances Q(k) and R(k), respectively. Adding and subtracting in which
W + 1&)a1 (k) in state equation (23-l) and mvI(k + 1) in measurement
@,(k + l,k) = @(k + 1,k) - r(k + l,k)S(k)R-(k)H(k) (23-15)
equation (23-2), these equations can be expressedas
and
x(k + 1) = Q(k + l,k)x(k) + r(k + 1 ,k)w(kJ + uI(k) (23-7)
Ql(k) = r(k + l,k)Q(k)I(k + l,k)
- T(k + l,k)S(k)R-(k)S(k)l-(k + l,k) (23-16)
q(k + 1) = H(k + I)x(k + I) + v(k + 1) (23-8)
Observe that, if S(k) = 0, then (23-12) reduces to the more familiar
predictor equation (16-4), and (23-14) reduces to the more familiar (17-13).
uI(k) = q(k + l,k)u(k) + lY(k + l,k)m,,Jk) (23-9)
P~oc$, The derivation of correction equation (23-13) is exactly the
same, when w(k) and v(k) are correlated, as it was when w(k) and v(k) were
xdk + 1) = z(k + 1) - G(k + l)u(k + I) - m,Jk + 1) (23-10) assumed uncorrelated. See the proof of part (a) of Theorem 17-1 for the
details.
Clearly, (23-7) and (23-8) is once again a basic state-variable model, one in In order to derive predictor equation (23-12), we begin with the Funda-
which uI(k) plays the role of g(k + l,k)u(k) and zI(k + 1) plays the role of - mental Theorem of Estimation Theory, i.e ,
z(k + 1).
r2(k + ilk) = E(x(k + 1)/55(k)) (23-17)
Theorem 23-I. When biases are present in a state-variable model, then Substitute state equation (15-17) into (23-171,to show that
that mude! can always be reduced tu u basic state-variable mode/ [e.g., (23-7) to
(23-N)]. All uf our previuus slate estimators can be applied to this basic state- ii(k + l(k) = @(k + l,k)i(k!k) + Y(k + I .k)u(k)
variable mode2 by replacing z(k) by q(k) and V(k + l,k)u(k) by u*(k). Cl + T(k + l.k)E{w(k)l%(k)) (23-N)
Colored Noises 227
226 State Estimation for the Not-So-Basic State-Variable Model Lesson 23

and
Next, we develop an expressionfor E{w(k)lZ(k)}.
Let Z(k) = co1(%(k - l), z(k)); then, P(k + lik) = [@(k + 1,k) - L(k)H(k)]P(klk - 1)
[QP(k+ 1,k) - L(k)H(k)]
E{w(k)l~(k)}= E(w(k)l~(k- l),z(k))
+ r(k + l,k)Q(k)I-(k + l,k)
= E{w(k)l%(k - l),i(kJk - 1)) (23-25)
- r(k + l,k)S(k)L(k)
= E{w(k)l%(k - 1)) + E{w(k)fi(kIk - 1)) (23-19)
- E(w(k)l - L(k)S(k)Y(k + 1,k)
= E{w(k)[i(kIk - 1)) + L(k)R(k)L(k)

In deriving (23-19) we used the facts that w(k) is zero mean, and w(k) and Proof. These results follow directly from Theorem 23-2; or, they can be
%(k - 1) are statistically independent. Because w(k) and i(klk - 1) are derived in an independent manner, as explained in Problem 23-2. Cl
jointly Gaussian,
Corollary 23-2. When w(k) and v(k) are correlated, then a recursive
E{w(k)li(klk - 1)) = P,;(k,klk - l)P;i(klk - l)g(klk - 1) (23-20)
filter for x(k + 1) is
where Pi; is given by (16-33), and
i(k + Ilk + 1) = (I?l(k + l,k)i(k\k) + *(k + l,k)u(k)
P,(k,kIk - 1) = E{w(k)Z(klk - 1)) + D(k)z(k) + K(k + l)Z(k + Ilk) (23-26)
= E{w(k)[H(k)%(klk - 1) + v(k)]} (23-21)
where
= S(k)
D(k) = I(k + l,k)S(k)R-(k) (23-27)
In deriving (23-21) we used the facts that i(klk - 1) and w(k) are statistically
independent, and w(k) is zero mean. Substituting (23-21) and (16-33) into and all other quantities have been defined above.
(23-20),we find that Proof. Again, these results follow directly from Theorem 23-2; how-
E{w(k)lZ(kIk - 1)) = S(k)[H(k)P(klk - l)H(k) + R(k)]- (23-22) ever, they can also be derived, in a much more elegant and independent
manner, as described in Problem 23-3. q
Substituting (23-22) into (23-19),and the resulting equation into (23-18)com-
pletes our derivation of the recursivepredictor equation (23-12).
We leave the derivation of (23-14) as an exercise. It is straightforward
COLORED NOISES
but algebraically tedious. 0
Quite often, some or all of the elements of either v(k) or w(k) or both are
Recall that the recursivepredictor playsthe predominant role in smooth- colored (i.e., have finite bandwidth). The following three-step procedure is
ing; hence, we present used in these cases:

Corollary 23-l. When w(k) and v(k) are correlated, then a recursive 1. model each colored noise by a low-order difference equation that is
predictor for x(k + l), is excited by white Gaussian noise;
2. augment the states associatedwith the step 1 colored noise models to the
i(k + ilk) = 4P(k + l,k)i(klk - 1)
original state-variable model;
+ V(k + l,k)u(k) + L(k)s(k[k - 1) (23-23)
3. apply the recursive filter or predictor to the augmented system.
where

L(k) = [@(k + l,k)P(klk - l)H(k) We try to model colored noiseprocessesby low-order Markov processes,
-1
i,e,) low-order difference equations. Usually, first- or second-order models
+ r(k + l,k)S(k)l[H(k)P(kjk - l)H(k) + R(k)]- (23-24)
Colored Noises 229
228 State Estimation for the Not-So-Basic State-Variable Model Lesson 23

Equations (23-35) and (23-36) constitute the augmented state-variable model.


are quite adequate.Consider the following first-order model for colored noise We observe that. when the original process noise is colored and the measurement noise
process w(k), is white, the state augmentation procedure leads us once again to a basic (augmented)
u(k + 1) = cm(k) + n(k) (23-28) state-variable model. one that is of higher dimension than the original model because
of the modeled colored processnoise. Hence, in this casewe can apply all of our state
In this model n(k) is white noise with variance & ; thus, this model contains estimation algorithms to the augmented state-variable model. E
two parameters: ctand d, which must be determined from a priori knowledge
abut u(k). We may know the amplitude spectrum of m(k), correlation infor- Example 23-3
mation about w(k), steady-statevarknce of u(k), etc. Two independent pieces Here we consider the situation where the process noise is white but the measurement
of information are needed in order to uniquely identify a and dn. noise is colored, again for a first-order system,
Example 23-l X(k + 1) = a,X(k) t w(k) (23-37)
We are given the facts that scalar noise w(k) is stationary with the properties z(k + 1) = hx(k + 1) + ~(k + 1) (23-38)
E{w(k)} = 0 and E{w(i)w(~]] = e -b- 4. A first-order Markov model for w(k) can
easily be obtained as As in the preceding example, we model v(k) by the following first-order Markov
process
t(k + I) = e-2((k) +-Vl - cd n(k) (23-29)
v(k + 1) = ap(k) + n(k) (23-39)
4kJ = w (23-30)
where n(k) is white noise. Augmenting (23-39) to (23-37) and reexpressing (23-38) in
where E{t(O)] = 0, E{[(O)] = I, E{n (k)] = tl and E{n (~]n(I]} = &. 0 terms of the augmented state vector x(k), where
Example 23-2 x(k) = co1 (X(k), v(k)) (23-40)
Here we illustrate the state augmentation procedure for the first-order system we obtain the following augmented state-variable model,
Xck + 1) = a,X(k) + w(k) (23-31)
(23-4 1)
z (k + 1) = hX(k + 1) + v(k + 1) (23-32)
where w(k) is a first-order Markov process, i.e.,
u(k + 1) = ulw(k) + n(k) (23-33) and
and v(k) and n(k) are white noise processes.We augment (23-33) to (23-31), as (23-42)
follows. Let
x(k) = co1(X(k), u(k)) (23-34) H x(k + 1)
then (23-31) and (23-33) can be combined, to give Observe that a vector process noise now excites the augmented state equation
and that there is no measurement noise in the measurement equation. This second
(23-35) observation can lead to serious numerical problems in our state estimators, becauseit
\ -u -
means that we must set R = 0 in those estimators, and, when we do this, covariance
x(k + 1) (D matrices become and remain singular. 0
w Y
Equation (23-25) is our augmentedstute equutioa Observe that it is once again excited Let us examinewhat happens to P(k + Ilk + 1) when covariance matrix
by a white noise process,just as our basic state equation (15-17) is. R is set equal to zero. From (17-14) and (17-12) (in which we set R = O>,we
In order to complete the description of the augmented state-variable model, we find that
must expressmeasurementz (Ic + 1) in terms of the augmented state vector, x(k + l),
i.e., P(k + Ilk + 1) = P(k + Ilk) - P(k + llk)H(k + 1)
[H(k + l)P(k + llk)H(k + l)]-H(k + l)P(k + Ilk) (23-43)
z(k + 1) = (h 0) ($ ; ;;) + v(k + 1) (23-36)
Multiplying both sidesof (23-43) on the right by H(k + l), we find that
-w
H x(k + 1)
P(k + l[k + l)H(k + 1) = 0 (23-44)
230 State Estimation for the Not-So-Basic State-Variable Model Lesson 23 Perfect Measurements: Reduced-Order Estimators 231

BecauseH(k + 1) is a nonzero matrix, (23-44) implies that P(k + Ifk + 1) where L1 is n x I and L2 is n x (n - 1); thus,
must be a singular matrix. We leave it to the reader to show that once
x(k) = LlYW) + LP(k) (23-50)
P(k + Ilk + 1) becomessingular it remains singular for all other values of k.
In order to obtain a filtered estimate of x(k), we operate on both sidesof
(23-50)with E{ I%(k)}, where
l

PERFECT MEASUREMENTS: REDUCED-ORDER ESTIMATORS


9(k) = co1 (Y(l), Y(2), l l l 9 Y(k))
(23-51)
We have just seen that when R = 0 (or, in fact, even if some, but not all, Doing this, we find that
measurementsare perfect) numerical problems can occur in the Kalman filter.
One way to circumvent theseproblems is ad hoc, and that is to usesmall values
for the elements of covariancematrix R, even through measurements are wqk) = L,Y(k) + L2i-q lk) (23-52)
thought to be perfect. Doing this has a stabilizing effect on the numerics of the
Kalman filter. which is a reduced-order estimator for x(k). Of course, in order to evaluate
A secondway to circumvent these problems is to recognizethat a set of ii(k lk) we must develop a reduced-order Kalman filter to estimate p(k). Know-
perfect measurementsreducesthe number of states that have to be esti- ing fi(k ]k) and y(k), it is then a simple matter to compute jZ(k ]k), using (23-52).
mated. Suppose,for example,that there are 2perfect measurementsand that In order to obtain fi(klk), using our previously-derived Kalman filter
state vector x(k) is n X 1. Then, we conjecture that we ought to be able to algorithm, we first must establish a state-variable model for p(k). A state
estimatex(k) by a Kalmanfilter whosedimension is no greaterthan n - e. Such equation for p is easily obtained, as follows:
an estimator will be referred to as a reduced-order estimator.The payoff for p(k + 1) = Cx(k + 1) = C[@x(k) + rw(k)]
using a reduced-order estimatoris fewer computations and lessstorage. = C@[L,y(k) + Lzp(k)] + CTw(k) (23-53)
In order to illustrate an approach to designing a reduced-order esti-
mator, we limit our discussionsin this section to the following time-invariant = CQLzp(k) + C@Lly(k) + Ww(k)
and stationary basic state-variable model in which u(k) A 0 and all mea- Observe that this state equation is driven by white noise w(k) and the known
surementsare perfect, forcing function, y(k).
x(k + 1) = @x(k) + Tw(k) (23-45) A measurement equation is obtained from (23-46), as
y(k + 1) = Hx(k + 1) (23-46) y(k + 1) = Hx(k + 1) = H[@x(k) + Tw(k)]

In this model y is I x 1. What makes the design of a reduced-orderestimator = H@[Lly(k) + L*p(k)] + HTw(k) (23-54)
challenging is the fact that the I perfect measurementsare linearly related to = H@Lzp(k) + H@Lly(k) + HTw(k)
the n states, i.e., H is rectangular. At time k + 1 we know y(k); hence, we can reexpress (23-54) as
To begin we introduce a reduced-orderstate vector, p(k), whose dimen-
sion is (n - 2) x 1; p(k) is assumedto be a linear transformation of x(k), i.e., yl(k + 1) = H@L*p(k) + HIw(k) (23-55)
p(k) 4 Cx(k) (23-47)
Augmenting (23-47) to (23-46),we obtain yl(k + 1) 4 y(k + 1) - HQLy(k) (23-56)
Before proceeding any farther, we make some important observations
(23-48) about our state-variable model in (23-53) and (23-55). First, the new mea-
surement yl(k + 1) representsa weighted difference between measurements
H y(k + 1) and y(k). The technique for obtaining our reduced-order state-
Design matrix C is chosenso that c is invertible. Of course, many differ-
( > variable model is, therefore, sometimes referred to as a measurement-
ent choices of C are possible;thus, this first step of our reduced-order esti- diflerencing technique (e.g., Bryson and Johansen, 1965). Becausewe have
mator design procedure is nonunique. Let already used y(k) to reduce the dimension of x(k) from n to n - 1,we cannot
again use y(k) alone as the measurementsin our reduced-order state-variable
= &IL,) (23-49) model. As we have just seen,we must use both y(k) and y(k + 1).
Lesson 23 Problems
232 State Estimation for the Not-So-Basic State-Vkable Model Lesson 23

we see that
second, measurementequation (23-55) appears to be a combination of
signal and noise. Unless HI? = U, the term Hrw(k) will act asthe measurement &,(k + Ilk + 1) = &(k + ilk) (23-64)
noise in our reduced-order state-variable model. Its covariance matrix is
Equation (23-64) tells us to obtain a recursive filter for our reduced-order
HrQrE#. Unfortunately, it is possiblefor Hr to equal the zero matrix. From
model, that is in terms of data set 9,:(k + l), we must first obtain a recursive
linear system theory, we know that IV is the matrix of first IvIarkov param-
predictor for that model, which is in terms of data set B{(k). Then, wherever
eters for our originai systemin (23-45) and (23-46), and Hr may equal zero. If
t(k) appears in the recursive predictor, it can be replaced by yl(k + 1).
this occurs, then we must repeat all of the above until we obtain a reduced-
Using Corollary 23-1, applied to the reduced-order model in (23-53) and
order state vector whosemeasurementequation is excited by white noise. We
(23-58): we find that
see, therefore, that dependingupon systemdynamics, it is possibleto obtain a
reduced-order estimator of x(k) that uses a reduced-order Kalman filter of @,(k + l/k) = C@L&(@ - 1) + C@L,y(k)
dimension less than n - 1,
Third, the noises, which appear in state equation (23-53) and mea-
surement equation (23-S) are the same, namely w(k); hence, the reduced- thus,
order state-variable model involves the correlated noise casethat we described fi,,(k + l(k + 1) = C~L&,(k~k) + CQLly(k)
before in this chapter in the section entitled Correlated Noises.
+ L(~)[YI(~ + 1) - H@bfi,,(kh)l (23-66)
Finally, and most important, measurement equation (23-55) is non-
standard, in that it expressesyl at k + I in terms of p at k rather than p at Equation (23-66) is our reduced-order Kalman filter. It provides filtered esti-
k + 1. Recall that the measurementequation in our basicstate-variablemodel mates of p(k + 1) and is only of dimension (n - I) x 1. Of course, when L(k)
is z(k + 1) = Hx(k + 1) -t v(k + 1). We cannot immediately apply our Kal- and P,,(k + Ilk + 1) are computed using (23-13) and (23-14), respectively, we
man filter equations to (23-53) and (23-S) until we express (23-55) in the must make the folIowing substitutions: @(k + l,k)* C@L?, H(k)-+ H@Lz,
standard way. f(k + 1 ,k)-* CT, Q(k) --f Q, S(k) + QTH, and R(k)-+ HTQTH.
To proceed, we let

FINAL REMARK
so that
In order to see the forest from the trees, we have considered each of our
c(k) = H@L?p(k) + Hrw(k) (23-58)
special cases separately. In actual practice, some or all of them may occur
Measurement equation (23-58)is now in the standard form; however, because simultaneously. The exercises at the end of this lesson will permit the reader to
g(k) equals a future value of yI, namely yl(k + l), we must be very careful in gain experience with such cases.
applying our estimator formulas to our reduced-order model (23-53) and
(23-58).
In order to see this more clearly, we define the foIlowing two data sets, PROBLEMS
By@ + 1) = -h(l), y,(2), . . . , ydk + I), . . .I (23-59)
23-l. Derive the prediction-error covariance equation (23-14).
and 23-2. Derive the recursive predictor, given in (23-23), by expressing ri(k f l(k) as
(2340) E(x(k t- l)~~(k)) = E(x(k + l)jZE(k - l),i(kjk - 1)).
23-3. Here we derive the recursive filter, given in (23-26), by first adding a convenient
Obviously, form of zero to state equation (1517): in order to decorrelate the processnoise
(23-61) in this modified basic state-variable model from the measurement noise v(k).
Letting Add D(k)[z(k) - H(k)x(k) - v(k)] to-(15-17). The process noise, w,(k), in the
modified basic state-variable model, is equal to r(k f l,k)w(k) - D(k)v(k).
(23-62) Choose decorrelation matrix D(k) so that E{w,(k)v(k)] = 0. Then complete
and the derivation of (23-26). Observe that (23-14) can be obtained by inspection,
via this derivation.
(23-63)
Lesson 23 Problems 235
State Estimation for the Not-So-Basic State-Variable Model Lesson 23

23-11, Obtain the equations from which we can find &(k + Ilk + 1) i,(k + ilk + 1)
23-4. In solving Problem 23-3, one arrives at the following predictor equation, and c(k + l]k + 1) for the following system:
i(k + l(k) = QI(k + l,k)i(k(k) + P(k + l,k)u(k) + D(k)@)
Xl(k + 1) = --xl(k) + x*(k)
Beginning with this predictor equation, and corrector equation (Z-13) derive X2(k + 1) = x2(k) + w(k)
the recursive predictor given in (23-23). z(k + 1) = xI(k + 1) + v(k + 1)
23-5. Show that once P(k + l/k + 1) becomes singular it remains singular for all
other values of k. where v(k) is a colored noise process, i.e.,
23-6. Assume that R = 0, HF = 0, and HOPIf 0. Obtain the reduced-order v(k + 1) = - $ v(k) + n(k)
estimator and its associated reduced-order Kalman filter for this situation.
Contrast this situation with the case given in the text, for which NT f 0. Assume that w(k) and n (k) are white processesand are mutually uncorrelated,
and, c&(k) = 4 and $,(k) = 2. Include a block diagram of the interconnected
23-7. Develop a reduced-order estimator and its associatedreduced-order Kalman
system and reduced-order KF.
filter for the case when 2 measurements are perfect and m - I measurements
are noisy. 23-12. Consider the system x(k + 1) = @x(k) + yp(k) and z(k + 1) = hx(k + 1) +
v (k + 1), where p(k) is a colored noise sequenceand v(k) is zero-mean white
23-8. Consider the first-order system x(k + 1) = ix(k) + w*(k) and z (k + 1) = noise. What are the formulas for computing fi(kIk + l)? Filter &i(k Ik + 1) is a
x(k + 1) + v(k + 1), where E{wi(k)} = 3, E{v(k)} = 0, w,(k) and v(k) are deconvolution filter.
both white and Gaussian,E{w:(k)} = 10, E{v2(k)} = 2, and, w,(k) and v(k) are
correlated, i.e., E{wl(k)v (k)} = 1. 23-13. Consider the scalar moving average (MA) time-series model
(a) Obtain the steady-staterecursive Kalman filter for this system. z(k) = r(k) + r(k - 1)
(b) What is the steady-statefilter error variance, and how does it compare with
the steady-statepredictor error variance? where r(k) is a unit variance, white Gaussian sequence.Show that the optimal
one-step predictor for this model is [assume P(OI0) = I]
23-9. Consider the first-order system x(k + 1) = $x(k) + w(k) and z(k + 1) =
x (k + 1) + v(k + l), where w(k) is a first-order Markov process and v(k) is
i(k + l/k) = & [z(k) - i(klk - I)]
Gaussian white noise with E{v (k)} = 4 and r = 1.
(a) Let the model for w(k) be w(k + 1) = w(k) + u(k), where u(k) is a
zero-mean white Gaussian noise sequence for which E{u (k)} = d. (Hint: Express the MA model in state-spaceform.)
Additionally, E{w(k)} = 0. What value must cyhave if E(w(k)} = W for all 23-14. Consider the basic state variable model for the stationary time-invariant case.
k? assume also that w(k) and v(k) are correlated, i.e., E{w(k)v(k)} = S.
(b) Suppose W2 = 2 and 02, = 1. What are the Kalman filter equations for (a) Show, from first principles, that the single-stage smoother of x(k), i.e.,
estimation of x (k) and w(k)? %(k jk + 1) is given by
2340. Consider the first-order system x(k + 1) = - $x(k) + w(k) and t (k + 1) = i(klk + 1) = i(klk) + M(klk + l)i(k + Ilk)
x(k + 1) + v(k + l), where w(k) is white and Gaussian[w(k) - N(w(k); 0,
1)] and v (k) is also a noise process. The model for v(k) is summarized in Figure where M(kIk + 1) is an appropriate smoother gain matrix.
P23-10. (b) Derive a closed form solution for M(kIk + 1) as a function of the
(a) Verify that a correct state-variable model for v(k) is, correlation matrix S and other quantities of the basic state-variable model.

xl@ + 1) = -&xl(k) + $n(k)


v(k) = xl(k) + n(k)
(b) Show that v(k) is also a white process.
(c) Noise n(k) is white and Gaussian [n(k) - N(n (k); 0, l/4)].
What are the Kalman fiIter equations for finding i(k + l(k + l)?

Figure P23-10
A Dynamical Model 237

fesson 24

f inecwizafion
und Discretization
Figure 24-l Coordinate system for an angular measurement between two objects
of Nonlinear Systems A and 3.

The purpose of this lesson is to explain how to linearize and discretize a


nonlinear differential equation model. We do this so that we will be able to
apply our digital estimators to the resulting discrete-time system.

A DYNAMICAL MODEL
INTRODUCTION
The starting point for this lesson is the nonlinear state-variable model
Many real-world systemsare continuous-time in nature and quite a few are g(f) = f[x(&u(t),t] + G(t)w(l) (24-l)
also nonlinear. For example, the state equationsassociatedwith the motion of
a satellite of massnz about a spherical planet of massA!, in a planet-centered z(f) = hb(+.+),d + ~0) (24-Z)
coordinate system, are nonlinear, becausethe planets force field obeys an We shall assumethat measurements are only available at specific values of
inverse square law. Figure 24-l depicts a situation where the measurement time, namely at t = ti, i = 1, 2, . . . ; thus, our measurement equation will be
equation is nonlinear. The measurement is angle 4, and is expressed in a treated as a discrete-time equation, whereasour state equation will be treated
rectangular coordinate system, i.e., +i = tan- [y /(x - ii)], Sometimes the asa continuous-time equation. State vector x(i) is n X 1; u(t) is an 1 X I vector
state equation may be nonlinear and the measurement equation linear, or of known inputs; measurement vector z(t) is m x 1; ii(f) is short for dx(t)/dt;
vice-versa, or they may both be nonlinear. Occasionally, the coordinate sys- nonlinear functions f and h may depend both implicitly and explicitly on l, and
tern in which one chooses to work causesthe two former situations. For we assumethat both f and h are continuous and continuously differentiable
example, equations of motion in a polar coordinate system are nonlinear, with respect to all the elements of x and u; w(t) is a continuous-time white
whereasthe measurementequations are linear. In a polar coordinate system, noise process,i.e., E{w(t)) =. 0 and
where # is a state-variable, the measurementequation for the situation de-
picted in Figure 24-1 is zj = &, which is linear. In a rectangular coordinate E(w(~)w(T)) = Q(r)?+ - 7); (24-3)
system, on the other hand, equations of motion are linear, but the mea-
surement equations are nonlinear. v(ti) is a discrete-time white noise process,i.e., E(v(t,)) = 0 for t = E,, i = 1,
Finally, we may begin with a linear systemthat contains some unknown and
parameters. When these parameters are modeled as first-order Markov E(v(~~)v(~j))= R(ti)6, ; (24-4)
processes?and these models are augmentedto the original system, the aug-
mented model is nonlinear, because the parameters that appeared in the and, w(f) and V(Q)are mutually uncorrelated at all E = t,, i.e.:
original linear model are treated as states.We shall describethis situation in
much more detail in Lesson25. E(w(r)v(~J} = 0 fort = t, i = 1,2,. . . (24-5)

236
238 Linearization and Discretization of Nonlinear Systems Lesson 24 Linear Perturbation Equations

Example 24 1 Comparing (24-8) and (24-l), and (24-9) and (24-2), we conclude that
Here we expand upon the previously mentioned satellite-planet example. Our example 1 2.w4
is taken from Meditch (1969, pp. 60-61), who states. . . Assuming that the planets
force field obeys an inverse square law, and that the only other forces present are the
fb(t),u(t>Jl= co1x2,x1x:- Xl$+ mU1=X4
-- Xl +lmu2 1 (24-10)

satellites two thrust forces u,(t) and u@(t)(see Figure 24-2), and that the satellites and
initial position and velocity vectors lie in the plane, we know from elementary particle
mechanicsthat the satellites motion is confined to the plane and is governed by the two h[x(t),u(t),t]= x1 - ro (24-11)
equations
Observe that, in this example, only the state equation is nonlinear. 0
.. 1
r = r(j2 - - y+
; u40 (24-6)
r2

and LINEAR PERTURBATION EQUATIONS

_ li = 2G + 1
; ue(0 (24-7) In this section we shall linearize our nonlinear dynamical model in (24-l) and
r
(24-2) about nominal values of x(t) and u(t), x*(t) and u*(t), respectively. If we
where y = GM and G is the universal gravitational constant. are given a nominal input, u*(t), then x*(t) satisfies the following nonlinear
Definingxl=r,xz= i,x~=9,xq=8,ul=ur,anduz=u,,wehave differential equation,
g*(t) = f [x*(t),u*(t),t] (24-12)
(24-8) and associatedwith x*(t) and u*(t) is the following nominal measurement,
z*(t), where
Z*(t) = h [x*(t),u*(t)] t = ti i = 1, 2, . . . (24-13)
which is of the form in (24-l). . . . Assuming . . . that the measurement made on the
satellite during its motion is simply its distance from the surface of the planet, we have Throughout this lesson, we shall assumethat x*(t) exists. We discusstwo
the scalar measurement equation methodsfor choosingx*(t) in Lesson25. Obviously, one is just to solve (24-12)
for x*(t).
z(t) = r(t) - r() + v(t) = x1(t) - ro + v(t) (24-9) Note that x*(t) must provide a good approximation to the actual behav-
where r. is the planets radius. ior of the system. The approximation is considered good if the difference
between the nominal and actual solutions can be described by a system of
linear differential equations, called linear perturbation equations. We derive
these equations next.
Let
6x(t) = x(t) - x*(t) (24-14)

h(t) = u(t) - u*(t) (24-15)


_ Reference then,
Axis
Planet, M -$8x(t) = M(t) = k(t) - g*(t) = f[x(t)g(t),t]

Figure 24-2 Schematic for satellite-planet system (Copyright 1969,McGraw-Hill). + wb@) - f Lx*(t),u*(WI (24-16)
240 Linearization and Discretization of Nonlinear Systems Lesson 24 Linear Perturbation Equations 241

Fact 1. When f [x(&u(j),f] is expandedin a Taylor seriesabout x*(t) and Observe that, even if our original nonlinear differential equation is not an
u*(t), we obtain explicit function of time (i.e., f [x(r>.u(t),f] = f [x(f),u(t)]), our perturbation
state equation is alwaystime-varying becauseJacobian matrices F, and F, vary
f[x(t)u(t),t] = f[x*(t),u*($f] + ~[x*(~),u*(~),~]~x(t) with time, becausex and u* vary with time.
+ FU[x*(t),u*(&@u(t) + higher-order terms (24-17) Next, let
where FXaud FUare n x n and n x f Jacobian matrices, i.e., Sz(r) = z(t) - z*(t) (24-24)

(24-18) Fact 2. When h[x(t).u(t).t] is expanded in a Tuylor series about x*(t)


and u*(t), we obtain

h[x(t),u($f] = h[x*(t),u*(r),t3 + H, [x*(t>,u*(t)~]~x(~)


+ H, [x*(t),u*(t),fjSu(~) + higher-order terms (24-25)
(24-19) c where H, and H, are m x n and m x P Jacobian matrices, i.e.,
ah, f ax,* 4- - dh,/dx,*
H,[x*(t),u*(r).t] = ; *-. ; (24-26)
( ah,/&,* .*a ah&x,* 1
(24-20)
and
8h,l&i;r ..- dhIf&
(24-21) H,[x*(f),u*(t),f] = ; -nI ; (24-27)
( dh, /au,* .,- dh, / du/*1
Proof. The Taylor seriesexpansion of the ith componentof f [x(t),u(t),t] In these expressions dhi/axT and ahi/au; are short for
iS
dhi dhi Cx(~)*u(t),tI1
(24-28)
&j* - % (0 ix(t) = x*(f),u(t) = u*(t)
and
+$ tl [x&)- x;(t)]+3 1 [u#)- u?(t)] + -*. (24-22)
dhi dhi[x(l).u(t)J]
(24-29)
x(f) = x*(t),u(t) = u*(t) 0
+$LI*i u,t
0 - ul*(r)] + higher-order terms aLdj*- aui (0

wherei = 1,2,..., n. Collecting these II equations together in vector-matrix We leave the derivation of this fact to the reader, becauseit is analogous
format, we obtain (24-17), in which FXand FUare defined in (24-18) and to the derivation of the Taylor seriesexpansion off [x(f),u(t),/].
(24-19), respectively. El Substituting (24-25) into (24-24) and neglecting the higher-order
terms , we obtain the following perturbation measurement equation
Substituting (24-17) into (24-X) and neglecting the higher-order
terms, we obtain the following perturbation sttiie equatim 6z(r) = H, [x*(t>,u*(r),r]Sx(r)
+ H~[X*(t),U*(t)yt]SU(t) + V(f) t = Ii, i = 2,2,... (24-30)
G(t) = FX[x*(f),u*(t),#x(t)
+ FU[x*(r},u*(t)~]&~(t) + G(+v(f) (24-23) Equations (24-23) and (24-30) constitute our linear perturbation equa-
1 tions, or our linear perturbation stare-variable model.
242 Linearization and Discretization of Nonlinear Systems Lesson 24 Discretization of a Linear Time-Varying State-Variable Model

Example 24-2 and


Returning to our satellite-planet Example 24-1, we find that
E{W(t)V(ti)} = 0 fort=& i=l,2,... (24-35)
/ 0 1 0 0 \
Our approach to discretizing state equation (24-30) begins with the
solution of that equation.
FX[x*(t>,u*(t),t] = FX[x*(t)] =
Theorem 24-l. The solution to state equation (24-31) can be expressed

0 0
x(t)= @(t,to)x(to)
+ ~@(w)[W)u(r)+ G(++)ld~ (24-36)
; 0 t0

F,[x*(t),u*(t),t] = FU= where state transition matrix @(t,T) is the solution to the following matrix
0 0
homogeneous differential equation,
0 -l
m
&(t,r) = F(t)<P(t,T) (24-37)
Hx[x*(t),u*(t),t] = H, = (1 0 0 0) @(t,t) = I
and
This result should be a familiar one to the readers of this book; hence, .
H, [x*(t),u*(t),t] = 0 we omit its proof.
In the equation for FX[x*(t)], the notation ( )* means that all xi(t) within the matrix Next, we assume that u(t) is a piecewise constant function of time for
are nominal values, i.e., xi(t) = xi*(t). t E [tk ,tk+ J, and set to = t&and t = t&+1in (24-36), to obtain
Observe that the linearized satellite-planet system is time-varying, because its
+ 1
linearized plant matrix, F,[x*(t)], dependsupon the nominal trajectory x*(t). 0 @(tk + l,$(+r u(fk)
x(tk + 1) = @(tk + l,fk)X(tk) +
I

I
rk + 1
+ @(tk + It7 (24-38)
DISCRETlZATlON OF A LINEAR TIME-VARYING k
STATE-VARIABLE MODEL
can also be written as
In this section we describehow one discretizesthe generallinear, time-varying x(k + 1) = @(k + 1, k)x(k) + P(k + l,k)u(k) + Wdck) (24-39)
state-variable model
k(t) = F(t)x(t) + C(r)u(t) + G(t)w(t) (24-31)
@(k + 1,k) = @(tk + dk) (24-40)
Z(t) = H(t)x(t) + v(t) t = ti, i = 1729 . . l (24-32)
I
tk + 1

The application of this sections results to the perturbation state-variable P(k + 1,k) =
@(tk + (24-41)
tk
model is given in the following section.
In (24-31) and (24-32),x(t) is n X 1, control input u(t) is I X 1, process and wd(k) is a discrete-time white Gaussiansequencethat is statistically
noise w(t) is p x 1, and z(t) and v(t) are each wt x 1. Additionally, w(t) is a alent to
continuous-time white noise process,V(ti> is a discrete-time white noise pro-
I
tk + 1
cess, and, w(t) and v(ti) are mutually uncorrelated at all t = ti, i = 1, 2, . . . , @(tk + 1J )G(T)w(7)dT

i.e., E{w(?)} = 0 for all t, E{v(ti)} = 0 for all ti, and tk

E{w(t)w(T)} = Q(t)s(t - T) (24-33) The mean and covariance matrices of wd(k) are

E{v(ti)v(tj)} = R(ti)ajj (24-34) @(t, + l,T)G(+V(T)dT = 0 (24-42)


Linearization and Discretization of Nonlinear Systems Lesson 24
Discretized Perturbation State-Variable Model

Substituting (24-46) into (24-41), we find that

Vr(k + 1,k) = I* - @(tk + t,7)Ckd7 = 1 + eFncrk+ l -r Ckd7


lk tk

lk + 1
z
[I + F&k+ t - ;>Jc,d~
I fk (24-50)
tk + 1

= CkT + FkCktk+ IT - FkCk 7dr


I Ik

respectively.
Ubserve, from the right-hand side of Equations (24-40), (24-41), and
(24-43), that these quantities can be computed from knowIedge about F(l), where we have truncated *(k + 1,k) to its first-order term in T, Proceeding in a similar
C(f), G(t), and Q(l). In general,we must compute@@ + l,kj, q(k + l,k), and manner for Qd(k), it is straightforward to show that
Qd(k) using numerical integration, and, these matrices change from one time
interval to the next becauseF(l), C(t), G(l), and Q(r) usually changefrom one Q&j = GkQ&iET (24-51)
time interval to the next. Note that (24-47), (24~49), (24-50), and (24-51), while much simpIer than their
Becauseour measurementshave been assumedto be available only at original expressions, can change in values from one time-interval to another, because
sampled values of t, namely at t = fi, i = 1, 2, . . . , we can express (24-32) as of their dependenceupon k. Cl

z(k -+ I) = H(k + l)x(k + 1) + v(k + 1) (24-44)


DISCRET12ED PERTURBATION STATE-VARIABLE MODEL
Equations (24-39) and (24-44) constitute our discretized state-variablemodel.
Applying the results of the preceding sectionto the perturbation state-variable
Example 24-3
model in (24-23) and (24-30), we obtain the following d&refired perturbation
Great simplifications of the calculations in (24-40), (24-41) and (24-43) occur if F(I), state-variable model
C(l), G(r)? and Q(t) are approximateIy constant during the time interval [rk, tk+ I], i.e.,
if Sx(k + I) = @(k + l,k;*)ax(k) + P(k -t l.k;*)h(k) + m(k) (24-52)
(24-45) Sz(k + 1) = H,(k + l;*)Sx(k + 1)
+ H,(k + l;*)Su(k + 1) + v(k + 1) (2443)
To begin, (24-37) is easily integrated to yield The notation @(k + l,k;*), for example, denotes the fact that this matrix
dependson x* (t) and u*(t). More specifically,
hence, @(k + l,k;*) = @(t, + I&;*) (24-54)
@(k + l,k) = eFAT (24-47) where
where we have assumed that lk + l - fk = T. The matrix exponentia1
is givenby the &(r,r;*) = Fx[x*(r~~u*(t),r]~(t,T;*)
(24-55)
infinite series @(t,t;*) = I
(24-43) Additionally,

and, for sufficiently small values of T I(k + l,k;*) = 1 +I @(tk+ I,f;*)FU[x*(r),u*(r),~]dT (24-56)
II,
e FkT= I -I-FkT (24-49)
We use this approximation for eFkTin deriving simpler expressionsfor q(k + 1,k) and
Qd(k). Comparable results can be obtained for higher-order truncations of eFkT . Qd(k;*) = i + @(t, + I,T;*)G(~)Q(~)G(r)W(t~ + l,q*)dr (24-57)
A
Linearization and Discretization of Nonlinear Systems Lesson 24 Lesson 24 Problems

PROBLEMS (b) Duffings equation:


if(t) + Ci(f) + ax(f) + )&x3(r)= F cos wt
24-l. Derive the Taylor seriesexpansion of h[x(t), u(t), t] given in (24-25).
24-2. Derive the formula for Q&C) given in (24-51). (c) Van der PoIs equation:
24-3. Derive formulas for !P(k + 1,k) and Q&k) that include first- and second-order
effects of T, using the first three terms in the expansionof eFkT. X(f) - ci(i)[l - $2(f)] +x(t) = m(f)
24-4. Let a zero-mean stationary Gaussian random process v(f) have the auto-
(d) Hills equation:
correlation function +(7) given by 4 (7) = e-l + ee2!
(a) Showthat this colored-noise processcan be generatedby passing white noise X(t) - ux (r) + bp (t)x(t) = m(t)
p(f) through the linear system whose transfer function is
where p (t> is a known periodic function.
lf5
6 s+
(s + l)(s + 2)
(b) Obtain a discrete-time state-variable model for this colored noise process
(assumeT = 1 msec).
24-5. This problem presents a model for estimation of the altitude, velocity, and
constant ballistic coefficient of a vertically falling body (Athans, et al., 1968).
The measurementsare made at discrete instantsof time by a radar that measures
range in the presenceof discrete-time white Gaussiannoise. The state equations
for the falling body are
il = -x2
i2 = -e -yxlxG&3
A!3= 0
where y = 5 x lo-, x1@)is altitude, x2(f) is downward velocity, and x3 is a
constant ballistic parameter. Measured range is given by
z(k) = <M + [x,(k) - H]* + v(k) k = 1,2,. . . ,
where M is horizontal distance and H is radar altitude. Obtain the discretized
perturbation state-variable model for this system.
24-6. Normalized equations of a stirred reactor are
= -(cl + c4)x1+ c3(1+ x4)2exp [&xl/(1 + x1)] + cd2 + cl&
i2 = -(cg + c(j)& + c5x) + &x3
-f3 = -(c7 + 4x3 + CgX2+ c7u2
i4 = - ClX4 - ~~(1+ x4)* exp [K&(1 + x1)] + clu3
inwhich ul, u2, and u3 are control inputs. Measurementsare
zi(k) = xi(k) + vi(k) i = 1,2, 3,
where vi(k) are zero-mean white Gaussian noiseswith variances ri. Obtain the
discretized perturbation state-variable model for this system.
24-7. Obtain the discretized perturbation-state equation for each of the following
nonlinear systems:
(a) Equation for the unsteady operation of a svnchronous
M motor:
X(t) + Ci(t) + p sin x(t) = L (1)
Lesson 25 Extended

and
Kalman Filter

(25-5)
fterated Least Squcwes 2. Concatenate (252) and compute [or~fLP)lusing our Lesson
3 formulas.

and Extended Ku/man 3. Solve the equation


&l~Ls(Aq= i&&N) - 8 (25-6)

for km(N), i.e.,


Filtering i&&v) = tl* + &T&v) (25-7)
4. Replace 8* with L(N) and return to Step 1. Iterate through these
steps until convergence occurs. Let h(N) and m(N) denote esti-
mates of 8 obtained at the ith and (i + l)st iterations, respectively. Con-
vergence of the ILS method occurs when
[@d(N) - i&&v)I < E (25-8)
where Eis a prespecified small positive number.

We observe, from this four-step procedure, that ILS uses the estimate
obtained from the linearized model to generate the nominal value of 6 about
This Iessonis primarily devoted to the extended Kalman fiber (EKF), which is which the nonlinear model is relinearized. Additionally, in each complete
a form of the Kalman filter extended to nonlinear dynamical systemsof the cycle of this procedure, we use both the nonlinear and linearized models. The
type described in Lesson 24. We shall show that the EKF is related to the nonlinear model is used to compute z *(AC),and subsequently 6~(k), using
method of iterated least squares (ILS), the major difference being that the (25-3).
EKF is for dynamical systemswhereas ILS is not. The notions of relinearizing about a filter output and using both the
nonlinear and linearized models are also at the very heart of the EKF.
ITERATED LEAST SQUARES

We shall illustrate the method of ILS for the nonlinear model described in EXTENDED KALMAN FILTER
Example 2-5 of Lesson2, i.e., for the model
The nonlinear dynamical system of interest to us is the one described in
z(k) =f (W + +) (25-l) Lesson 24. For convenience to the reader, we summarize aspects of that
wherek = 1,2 ,..., N. system next. The nonlinear state-variable model is
Iterated Ieastsquaresis basically a four step procedure. i(t) = f [x(t),u(t),t] + G(t)w(t) (25-9)
1. Linearizef(&k) about a nominal value of O,O*. Doing this, we obtain ~(0 = h ~xO),u(O,d + v(t) t = tj i = 1,2 ,--a (25-10)
the perturbutim meusuremerzltyuutkm
Given a nominal input, u*(l), and assuming that a nominal trajectory, x*(t),
Sz(k) = Fo(k;O*)cW + v(k) k = 1,2,. . . , N (25-2) exists, x*(f) and its associated nominal measurement satisfy the following
where nominal system model,
Sz(k) = z(k) - z*(k) = z(k) -j@*.k) (25-3) i*(t} = f [x*(r),u*(t).t] (25-l 1)
M= e - e* (25-4) z*(t) = h [x*(r).u*(t).t] t = t, i = I,2 ) . . * (25-12)
24a
Iterated Least Squares and Extended Kalman Filtering Lesson 25 Extended Kalman Filter 251

Letting &x(t) = x(l) - x*(t), h(t) = u(t) - u*(t), and 6z(t) = z(t) - z*(t), we how to choose x*(l) for the entire interval of time t E [fk ,tk + J. Thus far, we
also have the following discretized perturbation state-variable model that is have only mentioned how x*(t) is chosenat tk, i.e., as r2(kIk).
associatedwith a linearized version of the original nonlinear state-variable
model, Theorem 251. As a consequenceof relinearizing about i(klk) (k = 0,
Sx(k + 1) = @(k + l,k;*)Sx(k) + V(k + l,k;*)Gu(k) + W&) (25-13) 1 ? l l >Y
Sz(k + 1) = H,(k + 1;*)6x(k + 1) si(tlt,) = 0 for all t E [tk ,tk+ J (25-15)
+ H,(k + l;*)Su(k + 1) + v(k + 1) (25-14) This meansthat
In deriving (25-13) and (2514), we made the important assumptionthat x*(t) = %(tlfk)for ai/ t E [tk ?fk+ 11 (2516)
higher-order terms in the Taylor series expansions of f[x(t),u(t)?t] and
h [x(t),u(t),t] could be neglected.Of course,this is only correct aslong asx(t) is Before proving this important result, we observe that it provides us with
close to x*(t) and u(t) is close to u*(t). a choice of x*(t) over the entire interval of time t E [tk ,tk + J, and, it statesthat
If u(t) is an input derived from a feedback control law, so that at the left-hand side of this time interval x*(&) = i(kIk), whereas at the
u(t) = u[x(t),t], then u(t) can differ from u*(t), becausex(t) will differ from right-hand side of this time interval x*(tk +1) = f(k + Ilk). The transition from
x*(t). On the other hand, if u(t) doesnot dependon x(t) then usually u(t) is the jZ(k + Ilk) to i(k + ilk + 1) will be made using the EKFs correction equa-
same as u*(t), in which case Su(t) = 0. We see, therefore, that x*(t) is the tion.
critical quantity in the calculation of our discretized perturbation state-
variable model. Proof. Let tl be an arbitrary value oft lying in the interval betweenfkand
Supposex*(t) is given a priori; then we can compute predicted, filtered, tk+ 1(seeFigure 25-l). For the purposesof this derivation, we can assumethat
or smoothed estimates of 6x(k) by applying all of our previously derived h(k) = 0 [i.e., perturbation input &u(k) takeson no new values in the interval
estimators to the discretized perturbation state-variablemodel in (25-13) and from tkto tk+ 1; recall the piecewise-constantassumption made about u(t) in
(25-14).We can precompute x*(t) by solving the nominal differential equation the derivation of (24-37)], i.e.,
(25-11).The Kalman filter associatedwith using a precomputed x*(t) is known 6x(k + 1) = <P(k + l,k;*)ax(k) + wd(k) (25-17)
as a relinearited KF.
A relinearized KF usually gives poor results, becauseit relies on an Using our general state-predictor results given in (16-14), we seethat (remem-
open-loop strategy for choosingx*(t). When x*(t) is precomputed there is no ber that k is short for tk , and that tk+1 - fk doesnot have to be a constant; this
way of forcing x*(t) to remain close to x(t), and this must be done or else the is true in all of our predictor, filter, and smoother formulas)
perturbation state-variable model is invalid. Divergence of the relinearized
KF often occurs; hence, we do not recommendthe relinearized KF. s;i(t, Itk) = @(t, ,tk ;*)sg(t, I&)
The relinearized KF is basedonly on the discretized perturbation state- = @(t,,tk;*)[k(kIk) - x*(k)] (2518)
variable model. It does not use the nonlinear nature of the original systemin
an active manner. The extended Kalman filter relinearizes the nonlinear sys-
tem about eachnew estimate asit becomesavailable; i.e., at k = 0, the system
is linearized about x(010).Once z(1) is processedby the EKF, so that %(111)is
obtained, the systemis linearized about x( 111).By linearize about x( ill), we
mean x(111)is used to calculate all the quantities needed to make the transi-
tion from x( 111)to i(2/1), and subsequentlyx(212).This phrase will become
clear below. The purpose of relinearizing about the filters output is to use a
better reference trajectory for x*(t). Doing this, 6x = x - i will be held as
small as possible, so that our linearization assumptionsare less likely to be
violated than in the caseof the relinearized KF.
The EKF is developed below in predictor-corrector format (Jazwinski,
1970). Its prediction equation is obtained bv- integratingc the nominal differ- tk t k+l

ential equation for x*(t), from tk to tk +1. In order to do this, we need to know Figure 25-l Nominal state trajectory x* (t).
252 Iterated Least Squares and Extended Kalman Filtering lesson 25
Extended Kalman Filter

In the EKF we set x*(k) = G(#); thus, when this is done,


Substituting these three equations into (25-23)we obtain

i(k -I- Ilk + 1) = i(k + Ilk) + K(k + l:*){z(k + 1)


(25-27)
sk(t&) = 0 for & $1E [lk ,t&&1-j (X-20) - h[i(k + l)k),u(k + l),k + I] - H,(k + 1;*)6u(k + 1))
which is (25-U). Equation (25-16) follows from (25-20) and the fact that
iG(t//t&) = %(f$&) - x*(t,). cl which is the EKFcorrection equation. Observethat the nonlinear nature of the
systems measurement equation is used to determine i(k + l/k + 1). One
We are now able to derive the EKF. As mentioned above, the EKF must usually seesthis equation for the casewhen 6u = 0, in which casethe last term
be obtained in predictor-corrector format. We begin the derivation by obtain- on the right-hand side of (25-27) is not present.
ing the predictur equation for g(k + l/k). In order to compute G(k + lik + l), we must compute the EKF gain
Recall that X*(I) is the solution of the nominal state equation (2541). matrix K(k + I;*). This matrix, as well as its associated P(k + Ilk;*) and
Using (25-X) in (Z-H), we find that P(k + Ilk + l;*) matrices, dependsupon the nominal x*(r) that results from
prediction, namely k(k + l(k). Observe, from (25-16), that x*(k + 1) =
i(k + Ilk), and that the argument of K in the correction equation is k + 1;
hence, we are indeed justified to use ri(k + Ilk) as the nominal value of x*
Integrating this equaticmfrom l = ik to l = fk+ I, we obtain during the calculations of K(k + l;*), P(k + l]k;*), and P(k + Ilk + I;*).
These three quantities are computed from
I 1
K(k + l;*> = P(k + llk;*)H:(k + l;*)[H,(k + l;*)
P(k + ljk;*)H;(k + l;*) + R(k + l)]- (25-28)
P(k + Ilk;*) = G(k + l,k;*)P(klk;*)@(k + l,k;*) + Q&;*) (25-29)
which is the EKFpredicbz equa&im. Observethat the nonlinear nature of the
systemsstate equation is usedto determine S(k -+ Ilk). The integral in (25.22) P(k + Ilk + I;*) = [I - K(k + l;*)H,(k + l;*)]P(k + ilk;*) (25-30)
is evaluatedby meansof numerical integration formulas that are initialized by
Remember that in these three equations * denotes the use of i(k + l(k).
f[~(t~l~~~,u~(~~),~~l. The EKF is very widely used, especially in the aerospace industry;
The corrector equation for g(k + ljk + 1) is obtained from the Kalman
however, it doesnot provide an optima1estimateof x(k). The optimal estimate
filter associated with the discretized perturbation state-variable model in
of x(k) is still E{x(k)f%(k)}, regardlessof the linear or nonlinear nature of the
(25-13) and (25-14), and is
systemsmodel. The EKF is a first-order approximation of E{x(k)l%(k)) that
8%(k + lik + 1) = G(k + l\k) + K(k + l;*)[az(k + 1) sometimesworks quite well, but cannot be guaranteed always to work well.
- &(k + l;*)Z(k + l/k) - &(k + l;*)&(k + I)] (25-23) No convergence results are known for the EKF; hence, the EKF must be
viewed as an ad hoc filter. Alternatives to the EKF, which are based on
As a consequenceof relinearizing about #k), we know that nonlinear filtering, are quite complicated and are rarely used.
G(k + l/k) = 0 (25-24) The EKF is designed to work well as long as 6x(k) is small. The
iterated EKF (Jazwinski, 1970), depicted in Figure 25-2, is designed to keep
G(k + l[k + 1) = g(k + Ilk + 1) - x*(k + 1) 6x(k) as small as possible. The iterated EKF differs from the EKF in that
= G(k + l\k + 1) - %(k + l/k) (2525) it iterates the correction equation L times until llkL(k + Ilk + 1) -
jiLsl(k + Ilk + 1)11 5 E. Corrector Rl computes K(k + l;*), P(k -I- Ilk;),
and and P(k + l/k + l;*) using x* = li(k + Ilk); corrector #2 computes these
ih(k -+ 1) = z(k + 1) - z*(k + 1) quantities using x* = gl(k + Ilk + 1); corrector #3 computes thesequantities
= z(k + 1) - h[x*(k + l), u*(k + I), k + 1-j using x* = iz(k + l!k + 1): etc.
Often, just adding one additiona corrector (i.e., L = 2) leads to sub-
= z(k + 1) - h [i(k -t Ilk). u*(k + l), k + I] (X-26)
stantially better results for i(k + Ilk + 1) than are obtained using the EKF.
Application to Parameter Estimation

APPLICATION TO PARAMETER ESTIMATION

One of the earliest applications of the extended Kalman filter was to param-
eter estimation (Kopp and Orford, 1963).Consider the continuous-time linear

fi(k+ Ilk+ I)
system

S,jk-t Ilk+ I)?i(k+IIk+I)


2(t) = Ax(t) + w(t) (2531a)
Z(t) = Hx(t) + V(t) t = ti i = 1, 2,. . . (25-m)

a
tktl

Figure 252 Iterated EKF. All of the calculations provide us with a refined estimate of
Matrices A and H contain some unknown parameters, and our objective is to
estimate these parameters from the measurementsZ(ti> as they become avail-
able.
To begin, we assume differential equation models for the unknown
parameters, i.e., either
#L
corrector

t&(t) = 0 I = 1,2,. . . $z* (2532a)


b-
I(k + Ilk + 1)

t;j(t)=O i=l,2,...,j* (2532b)

4(t) = w(t) + n&) I = 1,2,. . .) I (25-33a)


4.-

hj(t) = djhj(t) + qi(t) j = 1, 2, . ,i*


l l (25-33b)
+ooo

x(k + l), ii(k + Ilk + l), starting with r2(kIk).


+ Ilk + 1)

In the latter models n/(t) and qj(t) are white noise processes,and one often
choosescl = 0 and dj = 0. The noises n,(t) and qj(t) introduce uncertainty
about the constancy of the al and hj parameters.
g,(k

Next, we augment the parameter differential equations to (25-31a) and


(25031b).The resulting system is nonlinear, because it contains products of
states [e.g., al(t)xi(t)]. The augmented system can be expressedas in (25-9)
Corrector

#I

and (25-lo), which means we have reduced the problem of parameter esti-
mation in a linear system to state estimation in a nonlinear system.
+

Finally, we apply the EKF to the augmentedstate-variable model to ob-


+ Ilk)

tain cil(klk) and hj(klk).


W

Ljung (1979) has studied the convergenceproperties of the EKF applied


=

to parameter estimation, and has shown that parameter estimates do not


W Predictor

converge to their true values. He shows that another term must be added to
the EKF corrector equation in order to guarantee convergence. For details,
seehis paper.
W+)

4
k
Wlk)

Example 25-l
Considerthe satellite and planet Example24-1, in which the satellitesmotion is
governedby the two equati.ons
..
r=$-- 1
*y+ m uro (25-34)
Iterated Least Squares and Extended Kalman Filtering Lesson 25 Lesson 25 Pfoblems

and Noise u (t, > is white and Gaussian, and &(f,) is given. The control signal u(t) is
the sum of a desired control signal u(r) and additive noise, i.e.,
(2535)

We shall assumethat m and y are unknown constants, and shall model them as The additive noise SU(f) is a normally distributed random variable modulated bY
a function of the desired control signal. i.e.,
h+(r) = 0 (25-36)
j(r) = 0 (2537)
su(f) = S[u*(r)Jwo(r)
where we(t) is zero-mean white noise with intensity &. Parameters
and as(t) may be unknown and are modeled as
h(r) = ar(r)[a,(t)- i%(t)]+ w,(t) i = 1, 2,3
In this model the parameters Q,(t) are assumedgiven, as are the a priori values of
a,(t) and e,(l), and, r+pi(t)are zero-mean white noises with intensities cr$,.
(2533)
(a) What are the EKF formulas for estimation of x1, x2, al, and ~2,assumingthat
us is known?
(b) Repeat (a) but now assumethat u3is unknown.
25-4. Suppose we begin with the nonlinear discrete-time system,
NOW, x(k + 1) = f[x(k),k] + w(k)
z(k) = h[x(k),k] + v(k) k = 1,2,...
Develop the EKF for this system [Hint: expand f [x(k),k] and h [x(k),k] in Taylor
series about ri(k jk) and i(klk - l), respectively].
We note, finally, that the modeling and augmentation approach to 25-S. Refer to Problem 24-7. Obtain the EKF for
parameter estimation?described above, is not restricted to continuous-time (a) Equation for the unsteady operation of a synchronousmotor, in which C and
linear systems.Additiona situations are describedin the exercises. p are unknown;
(b) Duffings equation, in which C, cy, and p are unknown;
(c) Van der Pals equation, in which Eis unknown; and
PROBLEMS (d) Hills equation, in which a and b are unknown.

25-L In the first-order system, x(k + 1) = ax(k) + w(k), and z (k + 1) =


x(k + 1) + v(k + 11,k = 1, 2, . . . , N, a is an unknown parameter that is to be
estimated. Sequencesw(k) and v(k) are, as usual, mutually uncorrelated and
white, and, w(k) - jV(w(k);O,l) and v(k) - N(v(k);O,h). Explain, using
equations and a flowchart, how parameter a can be estimated using an EKF.
25-2. Repeat the preceding problem where all conditions are the same except that now
w(k) andv(k) arecorrelated,andE{w(k)v (k)] = ~4.
25-3. The system of differential equations describing the motion of an aerospace
vehicle about its pitch axis can be written as (Kopp and Orford 91963)
ii 1(t)= x2(t)
i2(@ = Q)k(q -I- h(t)%(r) + 03O)W)

where xl = i(f), which is the actual pitch rate. Sampledmeasurements are made
of the pitch rate, i.e.,
z(r) = x,(r) + v(r) 1 = f, i = 1,2,...,N
26
A Log-Likelihood Function for the Basic State-Variable Model 259

Lesson estimates of a collection of parameters, also denoted 8, that appear in our


basic state-variable model,
x(k + 1) = @x(k) + rw(k) + *u(k) (26-5)
Maximum-Likelihood z(k -t 1) = Hx(k + 1) + v(k + 1) k=O,l,...,N-1 (26-6)
Now, however,

State and Parameter 8 = co1(elements of @, r, q? H, Q, and R)


and, we assumethat 6 is d x 1. As in Lesson 11, we shall assume that 6 is
(26-7)

identifiable.
Estimation Before we can determine 4MLwe must establishthe log-likelihood func-
tion for our basic state-variable model.

A LOG-LIKELIHOOD FUNCTION FOR THE BASIC


STATE-VARIABLE MODEL

As always, we must compute p (3!@) = p (z(l), z(2), . . . , z(N#). This is diffi-


cult to do for the basic state-variable model, because

INTRODUCTION
The measurementsare all correlated due to the presenceof either the process
In Lesson 11 we studied the problem of obtaining maximum-likelihood esti- noise, w(k), or random initial conditions or both. This represents the major
mates of a collection of parameters, 0 = co1(elementsof Qp,T, H, and R), that difference between our basic state-variable model, (26-5) and (26-6), and the
appear in the state-variablemodel state-variable model studied earlier, in (26-l) and (26-2). Fortunately, the
measurementsand innovations are causally invertible, and the innovations are
x(k + 1) = @x(k) + Vu(k) (26-l) all uncorrelated, so that it is still relatively easyto determine the log-likelihood
z(k + 1) = Hx(k + 1) + v(k + 1) k =O,l,...,N- 1 (26-2) function for the basic state-variable model.

We determined the log-likelihood function to be Theorem 264. The log-likelihood function for our basic state-variable
model in (26-5) and (26-6) is

L(8pfJ = - i 5 [z(i) - HBxg(i)]Ril [z(i) - Hex0(i)] - $ In lRel (26-3) L(813) = - i 5 [Zi (jlj - l)JV,l (jlj - l)&(j(j - 1)
i =1
j=l

where quantities that are subscripted 8 denote a dependenceon 8. Finally, we + WWli - 1)ll (26-9)
pointed out that the state equation (26-l), written as where &,( j(j - 1) is the innovations process, and & ( jl j - 1) is the covariance
xe(k + 1) = Qexe(k) + I!,u(k) xe(0) known (26-4) of that process [in Lesson 16 we used the symbol Pti (jlj - 1) for this covar-
iance],
acts as a constraint that is associated with the computation of the log- &(jlj - 1) = HftPe( jlj - l)Hi + RB (26-10)
likelihood function. Parameter vector 0 must be determined by maximizing
L(f@!l) subject to the constraint (26-4). This can only be done using mathe- This theorem is also applicable to either time-varying or nonstationary
matical programming techniques (i.e., an optimization algorithm such as systemsor both. Within the structure of these more complicated systemsthere
steepest descentor Marquardt-Levenberg). must still be a collection of unknown but constant parameters. It is these
In this lessonwe study the problem of obtaining maximum-likelihood parameters that are estimated by maximizing L (61%).
Maximum-Likelihood State and Parameter Estimation Lesson 26
On Computing &, 261

Pro@ (Mended, 1983b, pp. lOl-1031. We must first obtain the joint
In the present situation, where the true values of 9, 8~~are not known
density function p (S!@)= p (z(l), . . . , z(N)~9).In Lesson 17 we saw that the
but are being estimated, the estimate of x(i) obtained from a Kalman filter
innovations process i(+ - 1) and measurement z(i) are causaIiy invertible;
will be suboptimal due to wrong values of 8 being used by that filter. In fact,
thus, the density function
we must use iML in the implementation of the Kalman filter, becausehMLwill
jqi(liO), qq), . . . , qqN - l#) be the best information available about OTat tj. If bML-+OTas Iv-+ x, then
contains the same data informatiop as-p (z(l), . . . , z(N@) does. Conse- &,,Mi - I>+ G,(jIj - 1) as iV--+ x, and the suboptimal KaIman filter will
approach the optimal Ratman filter. This result is about the best that one can
quently, L (91%j can be replacedby I,@/?E),where
hope for in maximum-likelihood estimation of parameters in our basic state-
2 = co1(i( liO), . * . , i(Npv - 1)) (26-l 1) variable model.
and Note also, that although we beganwith a parameter estimation problem,
we wound up with a simultaneous state and parameter estimation problem.
, i(AqN - 1)/e) (2642)
This is due to the uncertainties present in our state equation, which necessi-
L@@) = inJqql~o), . l l

Now, however, we use the fact that the innovations processis Gaussian tated state estimation using a Kalman filter.
white noise to expressA$!$$) as

ON COMPUTING ii,,

For our basic state-variablemodel, the innovations are Gaussiandistributed, How do we determine 6MLfor L (@X>given in (26-9) (subject to the constraint
which means that pj (i(jlj - 1)19)= ~?(i(jij - 1)10)forj = 1, . . . , W, hence, of the Kalman filter)? No simple closed-form solution is possible, because8
entersinto L (elfq in a comphcatednonlinear manner. The only way presently
L(@Q = ln fi p(i(jij - 1)/e) (26-14) known to obtain 8MLis by meansof mathematical programming.
j=l
The most effective optimization methods to determine &$Lrequire the
From part (b) of Theorem 16-2in Lesson 16 we know that computation of the gradient of L(C@E)as well as the Hessian matrix, or a
pseudo-Hessianmatrix of ,5(01%).The Marquardt-Levenberg algorithm (also
known as the Levenberg-Marquardt algorithm [Bard, 1970; Marquardt,
1963]),for example, has the form
*i+ 1 = A.
Substitute (26-H) into (26-14)to show that 8 ML &vi, - (6 + Q)-lg, i = 0, 1, , . . (26-17)
where &-denotes the gradient
L(0pt) = -; $ [Z(jij - 1)N-1 (j[j - l)Z(jlj - 1)
j-1

+ wwl~ - w (2646) (2648)


where by convention we haveneglectedthe constant term -ln(2r)& because H, denotes a pseudo-Hessian
it does not depend on 0.
Becausejj( # 19)and JJ( # 10)contain the same information about the Hi -- *1 (2649)
data, L(@) and L(@Z) must also contain the same information about the 8 = eML
data; hence, we can use ~(f$E) to denote the right-hand side of (26~16),as in and Di is a diagonal matrix chosento force Hi + D, to be positive definite, so
(26-9). To indicate which quantities on the right-hand side of (26-9) may that (H, + D)- will always be computable.
depend on 8, we have subscriptedall such quantities with 9. IJ
We do not propose to discussdetails of the Marquardt-Levenberg algo-
rithm. The interested reader should consult the preceding referencesfor gen-
The innovations process&,(j /j - 1) can be generatedby a Kalman filter;
eral discussions, and Mendel (1983) or Gupta and Mehra (1974) for dis-
hence, the Kalman filter acts as a cunstraint that is asuciated with the corn-
cussions directly related to the application of this algorithm to the present
put&on uf the log-likelihuud function fur the basic stute-variable mudel.
problem, maximization of L (9[%).
On Computing 6,, 263
Maximum-Likelihood State and Parameter Estimation Lesson 26

We direct our attention to the calculationsof gi and Hi. The gradient of are used by the sensitivity equations; hence, the Kalman filter must be run
L(O~%) will require the calculationsof together with the d setsof sensitivity equations. This procedure for recursively
calculating the gradient dL (Ofa)l d0 therefore requires about as much com-
putation asd + 1 Kalman filters. The sensitivity systemsare totally uncoupled
and lend themselves quite naturally to parallel processing (see Figure 26-l).
The Hessian matrix of L (01%)is quite complicated, involving not only
The innovations depend upon i, (jlj - 1); hence, in order to compute first derivatives of Ze(j 1j - 1) and N,( j 1j - l), but also their second deriva-
Wjlj - l)/a9, we must compute a&( jlj - 1)/J& A Kalmanfilter must be tives. The pseudo-Hessianmatrix of L (Ol%) i gnores all the second derivative
usedto compute f, (jlj - 1); but this filter requires the following sequenceof terms; hence, it is relatively easy to compute because all the first derivative
calculations: Pg(klk)+PB(k + llk)+K,(k + l)+&(k + llk)+&(k + 11 terms have already been computed in order to calculate the gradient of
k + 1). Taking the partial derivative of the prediction equation with respectto L(Ol%). Justification for neglecting the second derivative terms is given by
6iwe find that Gupta and Mehra (1974), who show that as iML approaches &-, the expected
value of the dropped terms goesto zero.
The estimation literature is filled with many applications of maximum-
likelihood state and parameter estimation. For example, Mendel (1983b) ap-
mi
+xu(k) i = 1,2,...,d (26-20) plies it to seismic data processing,Mehra and Tyler (1973) apply it to aircraft
i
parameter identification and McLaughlin (1980) applies it to groundwater
We seethat to compute &(k + Ilk) / a6i, we must also compute &(k Ik) / aei. flow.
Taking the partial derivative of the correction equation with respect to 6i, we
find that
d&(k + Ilk + 1) = &,(k + ilk) + dK*(k + 1)
[z(k + 1) - H&k + Ilk)]
89i 86i 30i
-&(k + 1) [$+(k + Ilk)
i 113111v rry
1-q dLld6,
her a
+H h(k + ilk)
8 tMi 1 i = 1,2, . . . , d (26-21)

Observe that to compute &(k + Ilk + l)/dei, we must also compute i G


L Sensitwty , =- aLldeZ - R
&(k + l)/ aei. We leave it to the reader to show that the calculation of Filter 0, A
&(k + l)/ a0i requires the calculation of dP@(k+ Ilk)/ 80i, which in turn z(k) ( Filter D gi
requires the calculation of dPO(k+ 1jk + l)/ Joi. I
0 E
The systemof equations N
0
0 T
a&(k + Ilk)/aOi &(k + 1Jk + l)/Hi
a&(k + l)/aOi dPe(k + l)k)/dOi dPo(k + Ilk + 1)/J&
1
is called a Kalman filter sensitivity system. It is a linear system of equations, :nsitiviry
llter 0,
just as the Kalman filter, which is not only driven by measurementsz(k + 1)
[e.g., see (2621)] but is also driven by the Kalman filter [e.g., see(26-20)and
(2621)]. We need a total of d such sensitivity systems, one for each of the d
unknown parameters in 0.
Each system of sensitivity equations requires about as much com- Figure 26-l Calculations needed to compute gradient vector gi. Note that 0j de-
putation as a Kalman filter, Observe, however, that Kalman filter quantities notes the jth component of 8 (Mendel, 1983b, 0 1983, Academic Press, Inc.).
Maximum-Likelihood State and Parameter Estimation Lesson 26
A Steady-State Approximation 265

A STEADY-STATE APPROX!MATlON
and
Supposeour basicstate variabIe mode1is time-invariant and stationary so that i+(k f l/k) = z(k + 1) - H,&,(k + ilk) (26-32)
F = ht$ P(j\j- - 1) exists. Let
Once we have computed +ML we can compute &IL by inverting the trans-
x=E@H+R (26-22) formations in (26-27). Of course, when we do this, we are also using the
and invariance property of maximum-likelihood estimates.
Observe that z(+j%) in (26-29) and the filter in (2630)-(26-32) do not
depend on r; hence, we haye not included r in any definition of $. We explain
how to reconstruct r from $ MLfollowing Equation (26-44).
Because maximum-likelihood estimates are asymptoticahy efficient
Log-likelihood function ~(9~5!!)is a steady-stateapproximation of I. (C$iC).The
steady-stateKalman filter used to compute i(j[j - 1) and x is (Lesson ll), once we have determined ?*iML,the filter in (26-30) and (26-31)
will be the steady-stateKalman filter.
i(k + l/k) = cD%(k\k)-I-ml(k) (X-24) The major advantageof this steady-stateapproximation is that the filter
sensitivity equations are greatly simplified. When K and x are treated as
i(k + Ilk + I) = i(k + lik) + K[z(k + 1)
matrices of unknown parameters we do not need the predicted and corrected
- Hi(k + l\k)] (26-25) error-covariance matrices to compute K and x. The sensitivity equations
in which E is the steady-stateKalman gain matrix. for (26-32), (26-30), and (26-31) are
Recall that
Wk + w = _ dG(k + Ilk) - % %+(k + ilk) (26-33)
0 = co1(elements of Q, IY,q, H, Q, and R) (26-26) d#i IG w 1
We now make the following transformations of variables: d%(k + Ilk) = Qr &&Ik) a@, * J%
+- + 3 Gw + x 44 (26-34)
d#i Wi I 1

d%+(k + Ilk + 1) = d%+(k + Ilk) -


+ $[z(k + 1) - H&k tk + l)]
Wi Wi
a&(k + l/i)
We ignore r (initially) for reasonsthat are explainedbelow Equation (26-32). -G&J
dd+
Let
-&,2 &(k + Ilk) (26-35)
+ = co1(elements of @, V, H, p, and z) (26-28) 1

where + isp x I, and view E as a function of 4, i-e,, where i = 1, 2, . . . , p. Note that d&,/d#i is zero for all & not in q and is a
matrix filled with zeros and a single unity value for & in 5.
There are more elements in + than in 9, because.K and K have more
m@l = - ; g i;( j/j - 1)x&( j[j - I) - ; IV In ix+1 (26-29)
unknown elementsin them than do Q and R, i.e., p > d. Additionally, x does
j-l
not appear in the filter equations; it only appears
1 in t(t$l~). It is, therefore,
Instead of finding 6MLthat maximizes ~(O~~], subject to the constraints of a
fulI-blown Kalman filter, we now propose to find &L that maximizes ~(c#X), possible to obtain a closed-form solution for xML,
subject ?othe cunstraints of the foIlowing filter:
Theorem 26-Z. A closed-form solution for matrix gMLis
&,(k + lik) = @&,(k/k) + P&k) (26-30)
&(k + lik + 1) = &(k + Ilk) + ii&&k + lik) (26-31) (26-36)
266 Maximum-Likelihood State and Parameter Estimation Lesson 26 A Steady-State Approximation

Proof. To determine %hlLwe must set dz:(+[%)/%& = 0 and solve the These equations must be solved simultaneously for p and @Q&L using
resulting equation for I&,. This is most easily accomplished by applying iterative numerical techniques. For details, seeMehra (1970a).
gradient matrix formulas to (26-29) that are given in Schweppe(1974).Doing Note finally, that the best for which we can hope by this approachis not
this, we obtain I?& and QML,but only= ML.This is due to the fact that, when I and Q are
both unknown, there will be an ambieuitv in their determination, i.e., the
(26-37) term rw(k) which appears in our basic-state-variable model [for which
EbWw(~)~ = Ql cannot be distinguishedfrom the term w@), for which
whosesolution is gM, in (26-36). 0 E{w,(k)w;(k)} = Q1 = IQI (26-47)

Observe that & is the samplesteady-state-covariancematrix of &; i.e., This observation is also applicable to the original problem formulation
wherein we obtained & directly; i.e.? when both I and Q are unknown, we
A should really choose
cb ML'& ZML+ j-32
lim cov [Z(j lj - l)]
A A 8 = co1(elementsof @, V, H, rQr, and R) (26-48)
Supposewe are also interested
a in determining QMLand RML.How do we
obtain these quantities from $ML? In summary, when our basic state-variable model is time-invariant and
As in Lesson 19, we let K, i, and PI denote the steady-statevalues of stationary, we can first obtain 4 MLby maximizing Z($la) given in (26-29),
K(k + l), P(k + Ilk), and P(k lk), respectively, where subject to the constraints of the simple filter in (26-30), (26-31), and (26-32).
A mathematical programming method must be used to obtain those elements
K = PH(H~@ + R)-1 = jQfp-1 (26-38) of +MLassociated with &ML, $i&, tiMLYand &. The closed-form solution,
p = @i!@ + IQI (26-39) given in (26-36), is used for x ML.Finally, if we want to reconstruct RMLand
and (IQI?)ML, we use (26-44) for the former and must solve (26-45) and (26-46)for
the Iatter.
i!, = (I - KH)i! (26-40)
Example 26-1 (Mehra, 1971)
Additionally, we know that
The following fourth-order system, which representsthe short period dynamics and the
x = HFH + R ((26-41) first bending mode of a missile, was simulated:
By the invariance property of maximum-likelihood estimates,we know that 0 0 0 0
0 1 0 0
0 0 1 0
-a!1 -a2 -cY3 -a4

2ML (26-43) z(k + 1) = xl(k + I) + v(k + 1) (26-50)


For this model, it was assumed that x(O) = 0, 4 = 1.0, r = 0.25, cyl = -0.656, cy2=
Solving (26-43) for -irkMLand substitu$ngthe resulting expressioninto (26-42),
0.784, a3 = -0.18, and cy4= 1.0.
we obtain the following solution for RML, Using measurements generated from the simulation, maximum-likelihood esti-
A

RML = (1 - fiM,KML)~ML (26-44) mates were obtained for $, where


No closed-form solution existsfor QML.Substituting (aML,&, &, and + = co1
(al, cY2, a!3, cY4, x, L L k.77 7;4> (26-51)
-
&r+ into (26.38)-(26-40) and combining (26-39) and (26-40),we obtain the In (2651), SITis a scalar because, in this example, t (k) is a scalar. Additionally, it was
followi ng two equations assumedthat x(0) was known exactly. According to Mehra (1971, pg. 30) The starting
valuesfor the maximum likelihood schemewere obtained using a correlation technique
given in Mehra (1970b). The results of successiveiterations are shown in Table 26-l.
and The variances of the estimates obtained from the matrix of second partial derivatives
(i.e., the Hessi.an matrix of E) are also given. For comparison purposes, results ob-
(26-46) tained by using 1000 data points and 100 data points are given. Cl
Lesson 26 Problems

PROBLEMS

26-l. Obtain the sensitivity equations for dKe(k f l)/ Jfl,, aP,(k + llk)/&$ and
JPe(k + Ilk i- l)/ &I,. Explain why the sensitivity system for &(k + l/k)/ at?,
and d&+(k + I/k + Q/&3, is linear.
26-2. Compute formulas for g, and H,. Then simplify H, to a pseudo-Hessian.
26-3, In the first-order systemx (k + 1) = ax(k) + w(k). and z (k + 1) = x(k + 1) +
v(k + l), k = 1, 2, . . a) N, a is an unknown parameter that is to be estimated.
TABLE 26-I Parameter Estimates for Missile Example
Sequencesw(k) and v(k) are, as usual, mutually uncorreiated and white, and,
w(k) - N(rr*(k); 0,l) and v(k) - N(v(k); 0, l/z). Explain, using equations and a
flowchart, how parameter a can be estimated using a MLE.
0
ults from correlation technique, Mehra (197Ob). Repeat the preceding problem where all conditions are except that now
1 - I.0706 2.3800 -0.5965 (I.8029 -0.1360 0.8696 0.6830 0.2837 0.4191 0.8207 w(k) and v (k) are correlated, and E{w(k)v(k)} = V4.
estimates using 1000points 26-S. We are interested in estimating the parameters a and r in the following first-order
2 -1.06fkl 2.3811 -O-S938 0.8029 -0.1338 0.8759 0.6803 0.2840 0.42oU 0.8312 system:
3 - 1.0085 2.4026 -0.6054 0.7452 -0.1494 0.9380 0.6304 0.2888 0+4392 I.U3lk x(k + 1) + ax(k) = w(k)
4 -0.9798 2.4409 -0*6u36 0.8161 -0,140s 0.8540 0.6801 0.3210 OA 108 1.1831
5 -0.9785 2.4412 -0.5999 0.8196 -OS1370 0.8%0 0.6803 0.3214 U.6107 1.1835 z(k) = x(k) -t v(k) k = 1,2,. . . , N
6 -0.9771 2.4637 -0.6014 0.8086 -0.1503 0.8841 0.7068 0.3479 0.6059 1.2200
7 -0.9769 2.4603 -Oh023 0.8 130 -0.1470 0.8773 0.7045 0.3429 0.6lCI6 1.2104 Signals w(k) and v (k) are mutualIy uncorrelated, white, and Gaussian, and,
8 - 0.9744 2.5240 -0.6313 0.8105 -0.1631 0.9279 0.7990 0.37% 0.6484 1.2589 E{w(k)) = 1 and E{n(k)) = r.
9 -0.9743 2.5241 -0.6306 0.8 108 -Om1622 O-9296 0.7989 0.3749 0.6480 1.2588 (a) Let 9 = co1(a,$. What is the equation for the log-likelihood function?
10 -0.9734 2 S270 -0.6374 0.7961 -0.1630 0.9so5 0*7974 0.3568 0.6378 1.2577 (b) Prepare a macro flow chart that depicts the sequenceof calculations required
11 -0.9728 2.5313 -0.6482 0.7987 -Owl620 0.9577 0.8103 0.3443 0.6403 1,235i to maximize ,5(9/%), Assume an optimization algorithm is used which re-
12 -0.9720 2.5444 -0.66U2 0.7995 -0.1783 0.9866 0.8487 0.3303 O.m33 1.2053 quires gradient information about L (9/Z.).
13 -0.9714 2.5600 -0.6634 0.7919 -0.2036 1.0280 0.8924 0.3143 II.6014 1.2OS4 (c) Write out the Kalman filter sensitivity equations for parameters a and r.
14 -0.9711 2.5657 -0.6624 0.7808 -0.2148 1.0491 0.9073 0.325i 0.6122 I .2200
26-6. Develop the sensitivity equations for the case considered in Lesson 11, i.e., for
estimates using 100 points
30 -0.9659 2.620 -0Ao94 O-7663 -0.1987 1,0156 1.24 0.136 0.454 1.103
the case where the only uncertainty present in the state-variable model is mea-
surement noise. Begin with t(1319) in (11-42).
al values
-0.94 2.557 -o.twlo 0.7840 -0,180O 1.00 0.8937 0.2957 0.6239 1.2510 26-7. Refer to Problem 24-7. Explain. using equations and a flowchart, how to obtain
mates of standard deviation using 1WOpoints MLEs of the unknown parameters for:
0.0317 0.0277 OS0247 0.0275 0.0261 0.0302 0.0323 o.K+02 0.029 (a) equation for the unsteady operation of a synchronous motor, in which C and
mates of standard deviation using 100 points p are unknown;
0.149 0.104 0.131 0.084 0.184 0,303 0.092 0.082 0.09 (c) Duffings equation, in which C, ct, and /3 are unknown;
(c) Van der Pals equation, in which e is unknown; and
rce: Mehra (1971, pg. 301, Q 1971, AlAA.
(d) Hills equation, in which a and b are unknown.
Notation and Problem Statement 271

Lesson 27 SYSTEM DESCRIPTION

Our continuous-time system is described by the following state-variable


model,
Kalmam-Bucy Filtering 2(t) = F(t)x(t) + G(t)w(t) (27-l)
z(t) = H(t)x(t) + v(t) (27-2)
where x(t) is y1X 1, w(t) is p X 1, z(t) is m X 1, and v(t) is m X 1. For sim-
plicity, we have omitted a known forcing function term in state equation
(27-l). Matrices F(t), G(t), and H(t) have dimensions which conform to the
dimensions of the vector quantities in this state-variable model. Disturbance
w(t) and measurement noise v(t) are zero-mean white noise processes,which
are assumedto be uncorrelated, i.e., E{w(t)} = 0, E{v(t)} = 0,
E{w(t)w(r)} = Q(t)s(t - T) (27-3)
E{v(t)v(T)} = R(t)S(t - 7) (27-4)

INTRODUCTION
E{w(t)v(r)} = 0 (27-5)
The Kalman-Bucy filter is the continuous-time counterpart to the Kalman
filter. It is a continuous-time minimum-variance filter that provides state Equations (27.3), (27,4), and (27-5) apply for t 2 to. Additionally, R(t) is
estimatesfor continuous-timedynamical systemsthat are describedby linear, continuous and positive definite, whereas Q(t) is continuous and positive
(possibly) time-varying, and (possibly) nonstationary ordinary differential semidefinite. Finally, we asume that the initial state vector x(to) may be ran-
equations. dom, and if it is, it is uncorrelated with both w(t) and v(t). The statistics of a
The Kalman-Bucy filter (KBF) can be derived in a number of different random x(to) are
ways, including the following three: Wt,)~ = m&o) (27-6)
and
1. Use a formal limiting procedure to obtain the KBF from the KF (e.g.,
Meditch, 1969). covb(to)l = R&o) (27-7)
2. Begin by assumingthe optimal estimator is a linear transformation of all Measurements z(t) are assumedto be made for tos t s 7.
measurements. Use a calculus of variations argument or the ortho- If x(to), w(t), and v(t) are jointly Gaussianfor all t t [to,?], then the KBF
gonality principle to obtain the Wiener-Hopf integral equation. Em- will be the optimal estimator of state vector x(t). We will not make any
bedded within this equationis the filter kernal. Take the derivative of the distributional assumptions about x(to), w(t), and v(t) in this lesson, being
Wiener-Hopf equation to obtain a differential equation which is the content to establish the linear optimal estimator of x(t).
KBF (Meditch, 1969).
3. Begin by assuminga linear differential equation structure for the KBF,
one that containsan unknown time-varying gain matrix that weights the NOTATION AND PROBLEM STATEMENT
difference betweenthe measurementmade at time t and the estimate of
that measurement. Choose the gain matrix that minimizes the. mean- Our notation for a continuous-time estimate of x(t) and its associatedesti-
squared error (Athans and Tse, 1967). mation error parallels our notation for the comparable discrete-time quan-
tities, i.e., ?(t It) denotes the optimal estimate of x(t) which uses all the mea-
We shall briefly describethe first and third approaches,but first we must surements z(t), where t 2 to, and
define our continuous-time model and formally state the problem we wish to
solve. qt It) = x(t) - k(tlt) (27-8)

270
272 KalmwI-Bucy Filtering Lesson 27 Derivation of KBF Using a Formal Limiting Procedure 273

The mean-squaredstate estimation error is If Ei(@,) = 0, then


J [i(t ll)] = E{%(tit)i(t It)] (27-9)
g(tlf) = j- @(t,~)K(~)z(~)d7 = jr A(~J)z(T)& (27-17)
We shall determine Qit) that minimizes J [%(tit)], subject to the constraints of ID IO
our state-variablemodel and data set. where the filter kernel A(t,r) is
A(~,T) = @(!,r)K(r) (27-18)

THE KALMAWWCY FILTER The second approach to deriving the KBF, mentioned in the intro-
duction to this chapter. begins by assumingthat i(rir) can be expressed as in
The solution to the problem stated in the preceding section is the Kalman- (27-17) where A(f,7) is unknown. The mean-squaredestimation error is min-
Bucy Filter, the structure of which is summarizedin the following: imized to obtain the following Wiener-ISopf integral equation:

Theorem 27-l. The KBF is describedby the vector di#erentia/ equation E{x(t)z(~)} - 1 A(t,~)E{z(r)z(a)}d7 = 0 (27-19)
*o
i(tit) = F(t)g(t[t) + K(t)[z(t) - H(t)i(t It)] (27-N) where lo I (7 5 t. When this equation is convertedinto a differential equation,
wheret 2 to,k(t&) = rn&), one obtains the KBF described in Theorem 27-l. For the details of this
derivation of Theorem 27-1 see Meditch, 1969,Chapter 8.
K(t) = P(t ir)H(t)R-l(t) (27-U)
and
DERIVATION OF KBF USING A FORMAL LIMITING PROCEDURE
I$]t) = F(t)P(t/t) + P(tlt)F(t) - P(tlt)H(t)R-(t)H(t)P(tlt)
+ wNw~~~ (27-12) Kalman filter Equation (17-ll), expressedas
Epatiun (27-L& which is a matrix Riccati differential equatiun, is initialized i(k + l]k + 1) = @(k + l,k)ir(kjk)
by qt&J = R(k?). El + K(k + l)[z(k + 1) - H(k + l)@(k + l,k)ri(klk)] (27-20)

Matrix K(t) is the Kaiman-Bucy gain matrix, and P(t/t) is the state- can also be written as
estimation-error covariancematrix, i.e+, G(t + Atlt + At) = @(t + At,t)i(tIr>
P(tlt) = E{%(tlt)S(tlt)} (27-13) + K(r + At)[z(t + At) - H(f + At)@(t + At,r)k(rir>] (27-21)
Equation (27-10)can be rewritten as where we have let fk = t and tk + r = t + AL In Example 24-3 we showed that
i(tlt) = [F(t) - K(t)H(t)]i(t[t) + K(t)z(t) (27-14) @(t + At,t) = I + F(f)At + O(Ar) (27-22)
which makes it very clear that the KBF is a time-varying filter that processes and
the measurementslinearly to produce k(tlt). Q&) = G(r)Q(t)G(t)At + O(Ar) (27-23)
The solution to (27-14)is
Observe that Qd(f) can also be written as

Q&) = [G(t)Af] [y ] [G(t)Aij + O(Ar) (27-24)


where the state transition matrix @(t,r) is the solution to the matrix differ-
ential equation and, if we expresswd(k) as r(k + l,k)w(k), so that Qd(k) = T(k + l?k)Q(k)
r(k + Z,k), then
&(t,r) = [F(t) - K(t)H(t)]@(t,T)
(27-16)
@(t,t) = I r(f + At,?) = G(f)Ar + O(At) (27-25)
274 Kalman-Bucy Filtering Lesson 27 Derivation of KBF When Structure of the Filter is Prespecified 275

Then, we substitute Q(t)/dt for Q(k = t>in (17-13),to obtain


Q(t> (27-26)
P(t + At It) = @(t + dt,t)P(t It)@@ + At,t)
Q(k =t>+ At
Q(t) ryt + dt,?)
+ r(t + At,t) --&- (27-33)
Equation (27-26) means that we replace Q(k = t) in the KF by Q(t)/At. Note
that we have encountered a bit of a notational problem here?becausewe have We leave it to the reader to show that
used w(k) [and its associatedcovariance, Q(k)] to denote the disturbance in lim P(t + AtIt) = P(@) (27-34)
our discrete-time model, and w(t) [and its associatedintensity, Q(t)] to denote At+ 0

the disturbance in our continuous-time model. hence,


Without going into technical details, we shall also replace R(t + At) in lim K(t + dt) = P(tlt)H(t)R-(t) g K(t) (27-35)
the KF by R(t + At)/At, i.e., At-0 At
R(k + 1 = t + At)+R(t + At)lAt (27-27) Combining (27-29), (27.30), (27-31), and (27-35),we obtain the KBF in (27-10)
and the KB gain matrix in (27-11).
SeeMeditch (1969, pp. 139-142) for an explanation.
In order to derive the matrix differential equation for P(t It), we begin
Substituting (27-22) into (27-21), and omitting all higher-order terms in
with (17-14), substitute (27-33) along with the expansionsof @(t + dt,t) and
At, we find that
r(t + dt,t) into that equation, and use the fact that K(t + 4t) has no zero-
ii(t + At It + At) = [I + F(t)At]%(t(t)+ K(t + At){z(t + At) order terms in dt, to show that
- H(t + At)[I + F(t)At]i(t It)} (27-28) P(t + At It + At) = P(t + 4tlt) - K(t + At)H(t + 4?)P(t + AtIt)
from which it follows that = P(tlt) + [F(t)P(tIt) + P(tlt)F(t)
lim i;(t + At It + At) - i(tIt) + W~Q(OG(t)lAt
= F(t)ri(t It)
Ar-*O At - K(t + dt)H(t + At)P(t + 4tlt) (27-36)
+ lim K(t + At){z(t + At) - H(t + At)[I + F(t)At]i(tlt)}/At Consequently,
At-*0
or lim P(t + At It + 49 - P@(r)
= ti(tlt) = F(t)P(tIt) + P(tjt)F(t)
Jr+0 At
i(tlt) = F(t)i(tIt) + lim K(t + At){z(t + At)
Al-r0 K(t + At)H(t + dt)P(t + 4tlt)
- H(t + At)[I + F(t)At]Ei(tlt)}/At (27-29) + G(t)Q(t)G(t) - lim (27-37)
At-0 At
Under suitable regularity conditions, which we shall assumeare satisfiedhere,
we can replace the limit of a product of functions by the product of limits, i.e., or finally, using (27-35)
*(tit) = F(t)P(tIt) + P(t]t)F(t) + G(t)Q(t)G(t)
lim K(t + At){z(t + At) - H(t + At)[I + F(t)At]rZ(t]t)}/At
At-0 - P(t It)H(t)R-(t)H(t)P(t It) (27-38)
K(t + At)
= lim At lim {z(t + At) - H(t + At)[I + F(t)At]ii(t[t)} (27-30) This completes the derivation of the KBF using a formal limiting pro-
Ar-0 Al-+0
cedure. It is also possible to obtain continuous-time smoothers by means of
The second limit on the right-hand side of (27-30) is easy to evaluate, this procedure (e.g., see Meditch, 1969,Chapter 7).
i.e.,
lim {z(t + At) - H(t + At)[I + F(t)At]zZ(tIt)}= z(t) - H(t)jZ(tIt) (27-31) DERIVATION OF KBF WHEN STRUCTURE
At+0 OF THE FILTER IS PRESPECIFIED
In order to evaluate the first limit on the right-hand side of (27-30),we
first substitute R(t + At)/At for R(k + 1 = t + A ) in (17~12),to obtain in this derivation of the KBF we begin by assuming that the filter has the
K(t + At) = P(t + Atlt)H(t + At) following structure,
[H(t + At)P(t + Atlt)H(t + At)At + R(t + At)]-At (27-32) i(t It) = F(t)i(t It) + K(t)[z(t) - Hk( t It)I (27-39)
Kaiman-Bucy Filtering Lesson 27 277
Derivation of KBF When Structure of the Filter is Prespecified

Our objective is to find the m&x function K(T), to5 f 5 7:that minimizes the Substituting (27-45) into (27-46),we seethat
following mean-squarederror,
X(K.P.C) = tr [F(t)P(t!@(t)]
J[K(r)] = E{e(r)e(r)} (27-40) - tr [K(t)H(t)P(t I@(r)]
+ tr [P(tjt)F(t)C(r)]
e(7) = x(7) - ?(T~T) (27-41) - tr [P(tlt)H(QK(t)X(t)l (27-47)

This optimization problem is a fixed-time, free-end-point (i.e., 7 is fixed but + tr ~W)QW(O~(Ol


e(T) is not fixed) problem in the calculus of variations (e.g., Kwakernaak and + tr [~(l)~(t)~(t)X(t)]
Sivan, 1972;Athans and Falb, 1965;and Bryson and Ho, 1969). The Euler-Lagrange equations for our optimization problem are:
It is straightforward to show that @e(r)] = 0, so that E{ [e(T) - E{e(r)}]
[e(r) - E{e(T)l]] = E{e(T)e(r)J. Letting (27-48)
P(f It) = E{e(l)e(f)} (27-42)
we know that J[K@)] can be reexpressedas Z;(f)= .- m(Kap3,X:) * (27-49)
(27-43)
(27-50)
We leave it to the reader to derive the following state equation for e(f),
and i ts associat.edcovarianceequation, and
4(f) = [F(t) - K(l)H(t)]e(f) + G(l)w(t) - K(f)v(t) (27-44) x*(7) = -& tr P(+) (27-51)

In these equations starred quantities denote optimal quantities, and I* denotes


I$!r) = [F(f) - K(t)H(r)]P(rit) + P(+)[F(t) - K(l)H(r)] the replacement of K, P, and X by K*, P*, and Z* afier the appropriate
+ G(t)Q(f)G(l) + K@)R(l)K(f) (27-45) derivative has been calculated. Note, also, that the derivatives of X(K,P,Z)
are derivatives of a scalar quantity with respect to a matrix (e.g., K, P, or X).
where e(lO)= 0 and P(loitO)= PX(tO). The calculus of gradient matrices (e.g., Schweppe, 1974; or Athans and
Our optimization problem for determining K(f) is: given & matrix dv- Schweppe, 1965) can be usedto evaluate these derivatives.The results are:
ferential equation (27-45), sutisfied by the error-cuvariunce matrix P(tit), u
terminal lime T, and the cost functiunal ~[K(T)] in (27-43),determine the rnutrix -x*p*qjf - s*p*H + BIKER + ~*K*R = 0 (27-52)
K(t), to 5 t 5 T that minimizes J[K(t)]. %* = -S*(F - K*H) - (F - K*H)s* (27-53)
The elementsJ+(#) of P(rit) may be viewed as the state variables of a
dynamical system,and the elementsk;j(t) of K(l) may be viewed as the control Ij* = (F - K*H)P* + P*(F - K*H) + GQG + K*RK* (27-54)
variables in an optimal control problem. The cost functional is then a terminal and
time penalty function on the state variables plj(tlt)+ Euler-Lagrange equations
associatedwith a free end-point problem can be usedto determine the optimal P(r) = I (27-55)
gain matrix K(t). Our immediate objective is to obtain an expression for K*(t).
To do this we define a set of costate variables u&f) that correspond to the
p&it), i,j = 1, 2, . . . , n. Let X(f) be an PI x n costatematrix that is associated Fact. Matrix X*(t) is symmetric and positive definite. a
with P(tlf), i.e., Z(t) = (uo(Q)q. Next, we introduce the Hamiltonian function
X(K,P?Z) where for notational conveniencewe have omitted the dependence We leave the proof of this fact as an exercise for the reader, Using this
of K? P, and Z on 1,and, fact, and the fact that covariance matrix P*(t) is symmetric, we are able to
express (27-52) as
(27-46)
2X*(K*R - PH) = 0 (27-56)
Kalman-Bucy Filtering Lesson 27 Steady-State KBF 279

BecauseC* > 0, (C*)- existsso that (27-56) has for its only solution which leads to the following three algebraic equations:

K*(t) = P*(tlt)H(t)R-(t) (27-57) (27-65a)


which is the Kalman-Bucy gain matrix stated in Theorem 27-l. In order to 1 --
obtain the covariance equation associatedwith K*(t), substitute (27-57) into F22 - ;p11p12 = 0 (27-65b)
(27-54).The result is (27-12).
This completesthe derivation of the KBF when the structure of the filter
is prespecified. l-2
-712 + q = 0 (27-65~)

STEADY-STATE KBF It is straightforward to show that the unique solution of these nonlinear algebraic
equations, for which P > 0, is
If our continuous-time systemis time-invariant and stationary, then, when l/2
certain system-theoretic conditions are satisfied (see, e.g., Kwakernaak and 7l2= (4r ) (27-66a)
Sivan, 1972),P(tlt)+ 0 in which caseP(tlt) hasa steady-statevalue, denotedi? 711 =
fi q l/4r3/4
(27-66b)
In this case,K(t) + K, where fi q 3/4r 114
F22 = (27-66~)
jf = jQl~--* (27-58)
The steady-state KB gain matrix is computed from (27~58),as
i? is the solution of the algebraicRiccati equation
F-i?+ i?F - FHR-HF + GQG = 0 (27-59) (27-67)

and the steady-state KBF is asymptotically stable, i.e., the eigenvalues of


F - %3 all lie in the left-half of the complex s-plane. Observe that, just as in the discrete-time case,the single-channel KBF dependsonly on
the ratio q /r.
Example 27-1 Although we only neededF1, and F12to compute K, F22is an important quantity,
Here we examine the steady-stateKBF for the simplest second-order system, the because
double integrator, 722= I--+=
lim E{ [i(t) - i(tlt)32) (27-68)
X(t) = w(t) (27-60)
Additionally,
and
z(t) = x(t) + v(t) (27-61) Fll = lim E{ [X (t) - Z(tIt)]} (27-69)
t+=J
in which w(t) and v (t) are mutually uncorrelated white noise processes,with intensities
Using (27-66b) and (27-66c), we find that
4 and r, respectively.With xl(t) = x (t) and x2(t)= i(t), this system is expressed in
state-variable format as p-22 = (q wjh (27-70)
(27-62) If q /r > 1 (i.e., SNR possibly greater than unity), we will always have larger errors in
estimation of i(t) than in estimation of x(t). This is not too surprising because our
measurement depends only on x(t), and both w(t) and v(t) affect the calculation of
Z = (1 0) (;:) + v (27-63)
i(t It).
The steady-state KBF is characterized by the eigenvalues matrix F-ii%,
The algebraic Riccati equation for this system is where

(27-71)

(27-64)
These eigenvalues are solutions of the equation
s2 + V5 (q /r)"4 s + (q /r)1'2 = 0 (27-72)
280 l Kalman-Bucy Filtering Lesson27 Problems
Lesson 27

When this equation is expressedin the normaiized form


The controlled variabk can be expressedas

we find that
The stochastic linear optimal output feedback regula tar problem problem
(43 = (q /r)14 (27-73) of finding the functional
u(t) = f[Z(T) to 5 I 5 t] (27-78)
6 = 0.707 (27-74) for to I t 5 f1such that the objective function
thus, the steady-state KBF for the simpk duuble integrator system is damped at 0.707.
The filters pules lie on the 45 line depicted in Figure 27-I. They can be moved along q-u] = E (; x(tl)Wlx(tl) + ; 1; HW&(4 + ~Wbu(~)ld~ ) (27-79)
this line by adjusting the ratio q /r; hence, once again. we may view q /r as a filter
tuning parameter. l is minimized. Here W1, WZ, and W3 are symmetric weighting matrices, and,
W1 I 0, W, > 0, and W3 > 0 for toI t I tl.
In the control theory literature, this problem is also known asthe linear-
quadratic-Gaussianregulator problem (i.e., the LQG problem; see, Athans,
1971, for example), We state the structure of the solution to this problem,
without proof, next.
The optimal control, u*(t), which minimizes J[u] in (27-79) is
u*(t) = -F(t)i(tlt) (27-80)
where p(t> is an optimal gain matrix, computed as
P(t) = W,B(r)P,(t) (27-81)
where PC(t)is the solution of the control Riccati equation
--Ii=(t) = F(r)P,(t) + P,(t)F(f) - P,(t)B(t)W;B(r)P,(t)
+ D(t)W,D(t) (2742)
Figure 27-l Eigenvalues of steady-state Kl3F Lie along 245 degree lines+ In- PC(t ,) given I
creasing q /r moves them farther away from the origin. whereas decreasing q /r
moves them doser to the origin. and i(t/t) is the output of a KBF, properly modified to account for the control
term in the state equation, i.e.,
AN MPORTANT APPLtCATtON FOR THE K8F i(t It) = F(t)?(+) + B(t)u*(r) + K(f)[z(t) - H(t)?(+)] (27-83)

Consider the system We see that the KBF plays an essentialrole in the solution of the LQG
problem.
k(r) = F(t)x(r) + B(I)u(~) + G(f)w(~) (27-75)
QJ = x0
for l 2 ffl where x0 is a random initial condition vector with mean m&J and PROBLEMS
covariancematrix P&J. Measurementsare given by
27-l. Explain the replacement of covariance matrix R(k + 1 = t + df) by
z(f) = H(f)x(f) + v(r) (27-76) R(t + At)/Ar in (27-27).
27-2. Show that GIO P(t + AtIt) = P(tIt).
for t 2 lo. The joint random processco1[w(~),v(I)] is a white noise processwith
intensity 27-3. Derive the state equation for error e(r). given in (27-44), and its associated
covariance equation (27-45).
27-4. Prove that matrix Z*(t) is symmetric and positive definite.
A
Concept of Sufficient Statistics 283

Lesson z (IV)], where z(i) = 0 if the ith car is not defective and z(i) = 1 if the ith car is
defective. The total number of observed defective cars is

T(Z) = 5 z(i)
Sufficient Statistics i=l
This is a statistic that maps many different values of z (l), . . . , z (IV) into the samevalue
of T(%). It is intuitively clear that, if one is interested in estimating the proportion 0 of

and Statistical defective cars, nothing is lost by simply recording and using T(3) in place of z (l), . . . ,
z (IV). The particular sequenceof ones and zeros is irrelevant. Thus, as far as estimating
the proportion of defective cars, T(Z) contains all the information contained in %. 0

Estimation An advantage associated with the concept of a sufficient statistic is


dimensionality reduction. In Example A-l, the dimensionality reduction is
fromNt0 1.
of Parameters Definition A-l. A statistic T(Z) is sufficient for vector parameter 8, if
and only if the distribution of %, conditioned on T(S) = t, does not involve
8. cl
Example A-2
This example illustrates the application of Definition A-l to identify a sufficient statis-
INTRODUCTION tic for the model in Example A-l. Let 8 be the probability that a car is defective. Then
z(l), 42),.*., z(N) is a record of IV Bernoulli trials with probability 8; thus,
In this lesson,* we discussthe usefulnessof the notion of sufficient statistics in Wt) = ey1 - e)- f, 0 < 8 c 1, where t = xr 1 z(z), and t(i) = 1 or 0. The condi-
statistical estimation of parameters. Specifically,we discussthe role played by tional distribution of z (l), . . . , z (N), given xy= 1z (i) = t, is
sufficient statistics and exponential families in maximum-likelihood and uni- P[S,T = t] eyi - e)N-r
formly minimum-variance unbiased (UMVU) parameter estimation. P[%/T = t] = =- 1

CONCEPT OF SUFFICIENT STATISTICS which is independent of 8; hence, T(S) = Cy= 1 z(i) is sufficient. Any one-to-one
function of T (3) is also sufficient. 0
The notion of a sufficient statistic can be explained intuitively (Ferguson,
1967), as follows. We observe3(N) (3 for short), where 55= co1[z(l), z(2), This example illustrates that deriving a sufficient statistic using Defini-
z(N)], in which z(l), . . . , z(N) are independentand identically distrib- tion A-l can be quite difficult. An equivalent definition of sufficiency, which is
ute; random vectors, each having a density function p(z(i)l8), where 8 is easy to apply, is given in the following:
unknown. Often the information in % can be representedequivalently in a
statistic, T(Z), whosedimension is independentof N, such that T(%) contains Theorem A-l (Factorization Theorem). A necessary and sufficient
all of the information about 0 that is originally in %. Sucha statistic is known as condition for T(Z) to be sufficient for 8 is that there exists a factorization
a sufficient statistic.
Example A-l
Consider a sampled sequenceof N manufactured cars. For each car we record whether where the first factor in (A-l) may depend on 8, but depends on 9 only through
it is defective or not. The observed sample can be represented as % = co1 [z(l), . . . , T(Z), whereas the second factor is independent of 8. Cl

* This lesson was written by Dr. Rama Chellappa, Department of Electrical Engineering- The proof of this theorem is given in Ferguson(1967) for the continuous
Systems,Unversity of Southern California. Los Angeles: CA 9089. caseand Duda and Hart (1973) for the discrete case.

282
Sutiicient Statistics and Statistical Estimation of Parameters Lesson A
Exponental Families of Distributions
Example A-3 (Continuation of Example A-2)
In Example A-2, the probability distribution of samplesz [ l), . . . , z (NJ is pie, the family of normal distributions N(~,c?), with 2 known and p
unknown, is an exponential family which, as we have seen in Example A-4,
has a one-dimensional sufficient statistic for CL,that is equal to xr 1z(i). As
where the total number of defective cars is t = either 0 or 1. Bickel and Doksum (1977) state,
Ecluation (A-2) can be written quivalently as
Definition A-2 (Bickel and Doksum, 1977). If there exist real-valued
functions a@), and b(9) on parameter space 0, and real-valued functions T(z)
[z is short for z(i)] and h(z) on R, such that the density function ~(~10)can be
Comparing (A-3) with (A-I), we conclude that
written as
h(S) = 1
p(zl0) = exp [a@)T(z) + b(0) + h(z)] (A-4)
+e + N ln (1 - 0)
1 then p(zjO), 8 E 9, is said to be a one-parameter exponential family
dktributions. 0
of

The Gaussian, Binomial, Beta, Rayleigh, and Gamma distributions are


examples of such one-parameter exponential families.
Using the Factorization Theorem, it was easy to determine T(9). 0 In a one parameter exponential family, T(z) is sufficient for 8. The
Exztmple A-4
family of distributions obtained by sampling from one-parameterexponential
families is also a one-parameter exponential family. For example, suppose
LetS=col[z(l), . . ..z(N)]be a random sample drawn from a univariate G.aussian that z(l), . . . , z(N) are independent and identically distributed with common .
distribution, with unknown mean p, and, known variance 42 > 0. Then
density p (~10);then,

p(fEj6) = exp a(9) 5 T(z(i)) + Nb(9) + E h(z(i))


r=l i=l 1 (A-5)
The sufficient statistic T(Z) for this situation is
h (3) = exp

Based on the Factorization Theorem we identify z(i) as a sufficient T(S) = 5 T(z(i))


i-l
statistic for p. 0
Example A-5
Becausethe concept of sufficient statistics involves reduction of data, it Let z(l), . , . , z(N) be a randomsamplefrom a multivariateGaussiandistribution with
is \vorthwhile to know how far such a reduction can be done for a given unknown d X 1 mean vector p and known covariance matrix P&.Then [z is short for
problem The dimension of the smallest set of statistics that is still sufficient z(i)3
for the parameters is caIled a minimu/ sufficienf aaktic. SeeBarankin (1959) ~(414 = =P WbJW + bbL) + h WI
and Datz (1959) for techniques useful in identifying a minimal sufficient
statistic. where
a(p) = pP,
b(p) = -$.dP; )L
EXP0NENTiAL FAMILIES OF DISTRIBUTIONS
T(z) = z
It is of interest to study families of distributions, ~(z(#) for which, irre-
spective of the sample size N, there exists a sufficient statistic of fixed dimen-
sion. The exponential families of distributions have this property. For exam- - i ln det P&
I
286 Sufficient Statistics and Statistical Estimation of Parameters Lesson A Exponental Families and Maximum-Likelihood Estimation 287

Additionally, EXPONENTIAL FAMILIES AND


MAXIMUM-LIKELIHOOD ESTIMATION

where Let us consider a vector of unknown parameters6 that describe a collection of


T(g) = 5 z(i) l
N independent and identically distributed observations 2%= co1 [z(l), . . . ,
z(N)]. The maximum-likelihood estimate (MLE) of 0 is obtained by max-
and imizing the likelihood of 8 given the observations%. Likelihood is defined in
Lesson 11 to be proportional to the value of the probability density of the
h(S) = exp - + j z(i) Pi z(i) observations, given the parameters, i.e.,
I-1
Nd
-2ln 277
As discussedin Lesson 11, a sufficient condition for I (01%)to be maximized is

The notion of a one-parameter exponential family of distributions as


stated in Bickel and Doksum (1977) can easily be extended to m parameters where
and vector observationsin a straightforward manner.

Definition A-3. If there exist real matrices Al, . . . , A,, a real function b
of 8, where 8 E 8, real matrices I@) and a real function h(z), such that the and L(@Z) = In 2(0/Z). M aximum-likelihood estimates of 8 are obtained by
density function ~(~19)can be written as solving the systemof n equations

then p(z]O), 0 E 8 is said to be an m-parameter exponential family


II exp UJ(0) + h (z)) (A-6)

of
ww3
ae,
= 0
i = 1,2,. . . , n (A-8)
for eMLand checking whether the solution to (A-8) satisfies (A-7).
distributions. Cl When this technique is applied to membersof exponential families, iML
Example A-6 can be obtained by solving a set of algebraicequations. The following theorem
The family of d-variate normal distributions N(p,P,), where both p and Pp are un- paraphrased from Bickel, and Doksum (1977) formalizes this technique for
known, is an example of a 2-parameter exponential family in which 8 contains k and vector observations.
the elements of Pk. In this case
Theorem A-2 (Bickel and Doksum, 1977). Let p(zle) =exp[a(O)T(z) +
Al(e) = al(e) = Pi1 p
b(8) + h(z)] and let ~4 denote the interior of the range of a(e). If the equation
T*(z) = z
WT(z)l = T(z) (A-9)
A*(e) = -;P;l
has a solution e(z) for which a[b(z)] t d, then e(z) is the unique MLE of
T2(z) = zz 8. q
The proof of this theorem can be found in Bickel and Doksum (1977).
and Example A-7 (Continuationof Example A-5)
h(z) = 0 113 In this case

As is true for a one-parameter exponential family of distributions, if T(Z) = 5 z(i)


I 1
z(N) are drawn randomly from an m-parameter exponential family,
and
z(l), l l l ?

then p[z(l), . . . , z(N)le] form an m-parameter exponential family with suf-


ficient statisticsT@Ej, . . . 9T&E), where Il,(!Z) = cr= 1T [z(j)].
zaa Sufficient Statistics and Statistical Eskmtion of Parameters Lesson A
Exponental Families and Maximum-Likelihood Estimation

hence (A-9) becomes


and
E&b (%)I = A (P, + 14)
whose solution, &, is Applying (A-9) to both TI (3e) and T2(9): we obtain

Np = 2 z(i)
i=l
which is the well-known MLE of p. 0

Theorem A-2 can be extended to the m-parameter exponential family


caseby using Definition A-3. We illustrate the applicability of this extension whose solutions, I; and P&,are
using the example given below.
Exam@ A-8 (see, also, Example A-6)
Let z = co1[z(l), . +. , z(N)] be randomly drawn from p (~10)= IV (P,P~), where both
p and PFare unknown, so that 9 contains p and the elements of Pp. Vector p is cI x 1
and matrix P@is d x cI, symmetric and positive d&mite. We expressp (2/O) as
p (3@) = (Z+fd* (det Ph)-N2
which are the MLEs of J.Land P,- Cl

Exampie A-9 (Linear Model)


Consider the linear model
zqk) = %e(k)fl+ T(k)
in which 13is an n X 1 vector of deterministic parameters, X(k) is deterministic, and
V(k) is a zero-mean white noise sequence,with known covariance matrix C&(k).From
(1X-25)in Lesson 11, we can expressp (S(k)l9), as

a(O) = 0
Using Theorem A-1 or Definition A-3 it can be seen that zy= Iz(Q and X7= I z(qz(i) T(zE(k)) = W(k)W(k)zE(k)
are sufficient for (p, Pw).Letting N
b(e) = --ln27r - i In det a(k) - i ew(k)W(k)3e(kp
2
and

Observe that
Ee(X(k)W(k)5T(k)} = W(k)9V(k)%e(k)B
we find that
hence, applying (A-9), we obtain
X(k)W(k)X(k)il= x(k)w(k)zE(k)
Sufficient Statistics and Statistical Estimation of Parameters Lesson A Sufficient Statistics and Uniformly Minimum-Variance Unbiased Estimation 291

whose solution, 6(k), is Completeness is a property of the family of distributions of T(Z) gener-
ated as ovaries over its range. The concept of a complete sufficient statistic, as
ii(k) = [X(k)9i-1(k)%e(k)]-1W(k)9i-(k)%(k)
stated by Lehmann (1983), can be viewed as an extension of the notion of
which is the well-known expression for the MLE of 8 (see Theorem 11-3). The case sufficient statistics in reducing the amount of useful information required for
when R(k) = $1, where 2 is unknown can be handled in a manner very similar to that the estimation of 6. Although a sufficient statistic achieves data reduction, it
in Example A-8. ci may contain some additional information not required for the estimation of 8.
For instance, it may be that E&( T(%))] is a constant independent of 6 for
some nonconstant function g. If so, we would like to have E&( T(S))] = c,
SUFFICIENT STATlSTICS AND UNIFORMLY
(constant independent of 6) imply that g( T(Z)) = c. By subtracting c from
MINIMUM-VARIANCE UNBIASED ESTIMATION
EBGmwl~ one arrives at Definition A-4. Proving completeness using Defi-
nition A-4 can be cumbersome. In the special case when p (z(k)16) is a one-
In this section we discusshow sufficient statistics can be used to obtain uni- parameter exponential family, i.e., when
formly minimum-variance unbiased(UMVU) estimates. Recall, from Lesson
6, that an estimate 6 of parameter 8 is said to be unbiased if (A-13)
E(8) = 9 (A-10) the completeness of T(z(k)) can be verified by checking if the range of a (0)
has an open interval (Lehmann, 1959).
Among such unbiased estimates,we can often find one estimate?denoted 8)
which improves all other estimatesin the sensethat Example A-10
var (0*) 5 var (i) (A-11) Let 3 = co1[r(l), . . . , z(N)] be a random sample drawn from a univariate Gaussian
distribution whose mean p is unknown, and whose variance 2 > 0 is known. From
When (A-l 1) is true for all (admissible) values of 6, 0 is known as the Example A-5, we know that the distribution of % forms a one-parameter exponential
UMVU estimate of 8. The UMVU estimator is obtained by choosing the family, with T(Z) = xy= 1 z(i) and a(p) = p lo? Because a(p) ranges over an open
estimator which has the minimum variance among the classof unbiasedesti- interval as p varies from -- to +m, T(%) = CrX 1r(i) is complete and sufficient.
mators. If the estimator is constrained further to be a linear function of the The same conclusion can be obtained using Definition A-4 as follows. We must
observations, then it becomesthe BLUE which was discussedin Lesson9. show that the Gaussian family of probability distributions (with p unknown and 2
Suppose we have an estimate, i(Z), of parameter 6 that is based on fixed) is complete. Note that the sufficient statistic T(2) = cy= l z (i) (see Example
A-5) is Gaussian with mean Np and variance Nc?. Supposeg is a function such that
observations Z = co1 [z(l), . . . , z(N)]. Assume further that p (910) has a
E,(g( T)} = 0 for all -- < p < 0~;then,
finite-dimensional sufficient statistic, T(Z), for 8. Using T(s), we can con-
struct an estimate e*(Z) which is at least as good as, or even better, than i by
the celebrated Rao-Blackwell Theorem (Bickel and Doksum, 1977).We do
this by computing the conditional expectation of @Z), i.e.,
= \:- &g(vaN + Np) exp (-g) dv = 0 (A-14)
e*(z) = E{@l!I)~T(%)} (A-12)
Estimate 0*(Z) is better than 6 in the sense that E{ [O*(Z) - 01) < implies g ( ) = 0 for all values of the argument of g. 0
l

E{ [t!(%) - el}. B ecause T(Z) is sufficient, the conditional expectation


E{@%)[T(E)} wi11not depend on 8; hence, e*(Z) is a function of % only. Other interesting examples that prove completeness for families of dis-
Application of this conditioning technique can only improve an estimate such tributions are found in Lehmann (1959).
as e(Z); it does not guarantee that 0*(%) will be the UMVU estimate. To Once a complete and sufficient statistic T(Z) is known for a given pa-
obtain the UMVU estimate using this conditioning technique, we need the rameter estimation problem, the Lehmann-Scheffe Theorem, given next, can
additional concept of completeness. be used to obtain a unique UMVU estimate. This theorem is paraphrased
from Bickel and Doksum (1977).
Definition A-4 [Lehmann (1959; 1980);Bickel and Doksum (1977)]. A
sufficient statistic T(S) is said to be complete, if the only real-valued function, g, Theorem A-3 [Lehmann-Scheffe Theorem (e.g., Bickel and Doksum,
defined on the range of T(3), which satisfies Ee(g(T)} = 0 for all 0, is the 1977)]. If a complete and sufficient statistic, T(Z), exists for 8, and 6 is an
function g(T) = 0. 0 unbiased estimator of 8, then O*(S) = E{ iIT( is an UMVU estimator of 8. If
292 Sufficient Stdstks and Statistical Estima?ion of Parameters Lesson A Sufficient Statistics and Uniformty Minimum-Variance Unbiased Estimation 293

Theorem A-4 [Bickel and Doksum (1977) and Lehmann (1959)]. Let
p(zle) be an m-parameter exponentialfamily given by
A proof of this theorem can be found in Bickel and Doksum (1977).
This theorem can be applied in two ways to determine an UMVU esti-
mator [Kckel and Doksum (1977),and Lehmann (1959)]. where
p(zlf3) = exp 5 Qj
[ i= 1
(e)1;:(2)+ b(e) + h(z)
1
a,, . . . , a, and b are real-valued functions of 0, and T,, . . . , T, and h are
real-valued functims of z. Suppose that the range of a = co1 [al(O), . . . , a,(9)]
Method I. Find a staGsticof the form h (T(T)) such that has an open m-rectangle [if (x,, yJ, . . . , (xm,ym>are m open intervals, the set
-@I,..., S,): Xi < Si< yj, 1 I i zs m) is called the open m-rectangle], rhen
T(z) = col [TI(z), . . . , Tm(z)]is complete as well as sufficient. 1
where T(S) is a complete and sufficient statistic for 8. Then,
h (T(g)) is an UMVU estimator of 0. This follows from the fact Example A-13 (This example is taken from Bickel and Doksum, 1977, pp. 123-124)
that As in Example A-4, let 9 = co1 [z(l), . . . , z (Iv)] be a sample from a IV@ ,2) popu-
lation where both p and d are unknown. As a special caseof Example A-6, we observe
that the distribution of 2 forms a two-parameter exponential family where 9 = co1
Method 2. Find an unbiasedestimator, i, of 0; then, E{@(Z)j is an UMVU (CL,d). Because co1[al(9), a2(e)] = co1(p /c?,-I/X?) ranges over the lower halfplane,
estimator of 8 for a complete and sufficient statistic T(S). as 9 ranges over co1[(- m,m), (O,m)], the conditions of Theorem A-4 are satisfied. As a
result, T(S) = co1[Cf= 1z (i), $= 1z (i)] is complete and sufficient. Cl
Example A-11 (Continuation of Example A-10)
We know that T(S) = xy=I z (i) is a completeand sufficient statistic for g, Further- Theorem A-3 also generalizesin a straightforward manner to:
rnore, l/N I!/= I z (i ) is an unbiased estimator of p; hence, we obtain the well-known
result from Method 1, that the samplemean,l/N zySI z(i), isanUMVU estimateof p. Theorem A-5. If a complete- and sufficient statistic T(Z) = co1
Because this estimator is linear, it is alsothe BLUE of p+ 0 (-W-Q, + . . 7 Tm(W exists for 8, and 8 is an unbiased estimator of 0, then
ExampIe A-12 (Linear Model ) e*(Z) = E(@T(Z)) is an UMVU estimator of 8. If the elements ofthe covariance
matrix of 8* (2.) are < mfor all 8, then B*(Z) is the unique UMVU estimate of
As in ExampleA-9, considerthe linear
8. 0
?x(k) = x(k)e + T(k) (A-16)
where 8 is a deterministic but unknown n x 1 vector of parameters, X(k) is deter- The proof of this theorem is a straightforward extension of the proof of
ministic, and Ew(k)} = 0. Additionally, assume that V(k) is Gaussian with known Theorem A-3, which can be found in Bickel and Doksum (1977).
covariance matrix 3(&z). Then, the statistic T@(k)) = X(k)%-%(k) is sufficient (see Example A-14 (Continuation of Example A-13)
Example A-9). That it is alsocomplete can be seen by using Theorem A-4. To obtain
UMVU estimate 6, we need to identify a function Iz[T(Z(k))] such that In Example A-13 we saw that co1 [T,(Z), G(Z)] = co1[zyz 1 z(i), xi,. , z(i)] is suf-
F$z [T@!(k))]) = 9. The structure of /I [T@(k))] is obtained by observing that ficient and complete for both p and 2. Furthermore, since

E{T(%(k))} = E{X(k)%-l(k)%(k)} = X(k)%-(k)?@-)9


hence,
and

Consequently? the UMW estimator of 9 is


are unbiased estimators of ~1and d, respectively, we use the extension of Method 1 to
which agrees with Equation (9-26). 0 the vector parameter case to conclude that 5 and @ are UMW estimators of p and
d. El
We now generalize the discussions given above to the case of an
m-parameter exponential family, and scalar observations. This theorem is It is not always possible to identify a function h (T(2)) that is anun-
paraphrased from Bickel and Doksum (1977) biased estimator of 8. Examples that use the conditioning Method 2 to obtain
UMVU
Sufficient Statistics and Statistical Estimation of Parameters

estimators are found, for example, in Bickel and Doksum (1977) and
Lesson A
Appendix A
Lehman (1980).

PROBLEMS Glossary
A-l. Suppose z(l), . . . , z(N) are independent random variables, each uniform on
[0,6], where 8 > 0 is unknown. Find a sufficient statistic for 8.
A-2. Supposewe have two independent observations from the Cauchy distribution,
of Maior Results
1 1 -xc<zx
P(Z) = -
-7r1 + (2 - 0)
Show that no sufficient statistic exists for 8.
A-3. Let z(l), z(2), . . . , z(N) be generated by the first-order auto-regressive
process,
z(i) = &(i - 1) + fPw(9
where {w(i), i = l,..., N} is an independent and identically distributed
Gaussian noise sequencewith zero mean and unit variance. Find a sufficient
statistic for 8 and p.
A-4. Suppose that T(%) is sufficent Afor 8, and that 6(s) is a maximum-
likelihood estimate of 8. Show that e(Z) depends on (r: only through T(3). Equations (3-10) Batch formulas for &+&k) and &#).
A-S. Using Theorem A-2, derive the maximum-likelihood estimator of 8 when and (3-11)
observations z (l), . . . , z(N) denote a sample from Theorem 4-1 Information form of recursive LSE.
p (2 (i)le) = 8e -e4i) z(i) rO,e >O Lemma 4-l Matrix inversion lemma.
A-6. Show that the family of Bernoulli distributions, with unknown probability of Theorem 4-2 Covariance form of recursive LSE.
successp (0 2sp ~5l), is complete. Theorem 5-l Multistage LSE.
A-7. Show that the family of uniform distributions on (0,Q where 8 > 0 is unknown, Theorem 6-l Necessary and sufficient conditions for a linear
is complete. batch estimator to be unbiased.
A-8. Let z(l), . . . , z(N) be independent and identically distributed samples, where Theorem 6-2 Sufficient condition for a linear recursive esti-
p (z (i)iS) is a Bernoulli distribution with unknown probability of success
p (0 5 p 5 1). Find a complete sufficient statistic, T; the UMVU estimate 4(T)
mator to be unbiased.
of p; and, the variance of 4(T). Theorem 6-3 Cramer-Rao inequality for a scalarparameter.
A-9. [Taken from Bickel and Doksum (1977)]. Let z(l), z(2), . . . , z(N) be an Corollary 6-l Achieving the Cramer-Rao lower bound.
independent and identically distributed sample from N&,1). Find the UMVU Theorem 6-4 Cramer-Rao inequality for a vector of param-
estimator of pg [z (1) 101. eters.
A-10. [Taken from Bickel and Doksum (1977)]. Supposethat & and z are two UMVU Corollary 6-2 Inequality for error-variance of ith parameter.
estimates of 8 with finite variances. Show that z = T2.
Theorem 7-l Mean-squared convergence implies convergence
A-11. In Example A-12 prove that T@!(k)) is complete.
in probability.
Theorem 7-2 Conditions under which i(k) is a consistent esti-
mator of 8.
Theorem 8-l Sufficient conditions for &&) to be an unbiased
estimator of 8.
Theorem 8-2 A formula for cov [&&)].
Appendix A
Glossary of Major Results

A formula for cov [&(k)~ under special condi- Theorem 13-l A formula for 8,,(k) (The Fundamental Theorem
Corollary 8-l of Estimation Theory).
tions on the measurement noise.
Corollary 13-1 A formula for i&k) when 8 and Z(k) are jointly
Theorem 8-3 An unbiasedestimator of 2.
Gaussian.
Theorem 8-4 Sufficient conditions for &(k) to be a consistent
Corollary 13-2 A linear mean-squaredestimator of 8 in the non-
estimator of 0.
Gaussiancase.
Theorem 8-5 Sufficienl conditions for & (k) to be a consistent
Corollary 13-3 Orthogonality principle.
estimator of 2.
Theorem 13-2 When 6MAP(k)= k&).
Equation (9-22) Batch formula for &&IQ
Theorem 14-1 Conditions under which i,,(k) = &&k)-
Theorem 9-l The relationship between &(ti) and &&).
Theorem 14-2 Condition under which 8&k) = &Ltj(k).
Corollary 9-1 When all the results obtained iyLessons 3,4 and 5
for b-(k) can be applied to Or&k). Theorem 14-3 Condition under which &*~(k) = &LU(~).
Theorem 9-2 When 6&k) equals 6&) (Gauss-Markov,The- Theorem 15-l Expansion of a joint probabihty density function
orem). for a first-order Markov process.
peorem 9-3 A formula for cov [6&Q]. Theorem 15-2 Calculation of conditional expectation for a first-
order Markov process.
Corollary 9-2 The equivalencebetween P(k) and cov [&&)].
Theorem 15-3 Interpretation of Gaussianwhite noise as a special
Theorem 9-4 Most efficient estimator property of 6&k).
first-order Markov process.
Corollary 9-3 When 6&) is a most efficient estimator of 6. The basicstate-variable model.
Equations (U-17) &
Theorem 9-5 Invariance of hBLu(k) to scalechanges. (15-18)
Theorem 9-6 Information form of recursive BLUE. Theorem 15-4 Conditions under which x(k) is a Gauss-Markov
Theorem 9-7 Covarianceform of recursive J3LUE. sequence.
Definition 10-l Likelihood defined. Theorem 15-5 Recursive equations for computing mX(k) and
Theorem lo- 1 Likelihood ratio of combined data from statisti- R(k).
cally independent sets of data. Theorem 15-6 Formulasfor computing m, (k) and P,(k).
Theorem 11-l Large-sample properties of maximum-likelihood Equations (16-4) & Single-stage predictor formulas for i(k jk - 1)
estimates. (16-11) and P(k jk - 1).
Theorem 1l-2 Invariance property of MLEs. Theorem 16-l Formula for and properties of general state pre-
Theorem I l-3 Condition under which 6&) = 6&k). dictor, i(klj), k > j.
Corollary 11-l Conditions under which &(k) = &&c) = Theorem 16-2 Representationsand properties of the innovations
iLS(k), and, resulting estimator properties. process.
Theorem 12-l A formula for ~(xiy) when x and y are jointly Theorem 17-l KaIman filter formulas and properties of resulting
Gaussian. estimatesand estimation error.
Theorem 12-2 Propertiesof E{xiyl when x and y are jointly Gaus- Theorem 19-l Steady-stateKalman filter.
sian. Theorem 19-2 Equivalenceof steady-state Kalman filter and in-
Theorem 12-3 Expansion formula for E{x\y ,zl when x, y, and z finite length digita Wiener filter.
are jointly Gaussian,and y and z are statistically Theorem 20-1 Single-statesmoother formula for i(kjk + 1).
independent.
Corollary 20-l ReIationshipbetween single-stagesmoothing gain
Theorem 12-4 Expansion formula for lZ{x~y,z~when x, y and z matrix and Kalman gain matrix,
are jointly Gaussian and y and z are not neces-
Corollary 20-2 Another way to expressd(klk + 1).
sarily statistically independent.
298 Glossary of Major Results Appendix A 299

Theorem 20-2 Double-stage smoother formula for ri(klk + 2). Equations (24-39) & Discretized state-variable model.
Corollary 20-3 Relationship betweendouble-stagesmoothing gain (24-44)
matrix and Kalman gain matrix. Theorem 25-1 A consequenceof relinearizing about i(kIk).
Corollary 20-4 Two other waysto expressi(k jk + 2). Equations (25-22) & Extended Kalman filter prediction and correction
Theorem 21-1 Formulas for a useful fixed-interval smoother of (25-27) equations.
x(k), %k IN), and its error-covariance matrix, Theorem 26-l Formula for the log-likelihood function of the
P(k IN): basic state-variablemodel.
Theorem 21-2 Formulas for a most useful two-passfixed-interval Theorem 26-2 Closed-form formula for the maximum-likelihood
smoother of x(k) and its associated error- estimate of the steady-statevalue of the inno-
covariancematrix. vations covariancematrix.
Theorem 21-3 Formulas for a most useful fixed-point smoothed Theorem 27-l Kalman-Bucy filter equations.
estimator of x(k), jZ(k Ik + Z) where Z = 1, Definition A-l Sufficient statistic defined.
2 and its associatederror-covariance ma- Theorem A-l Factorization theorem.
t&i ;?(klk + I). Theorem A-2 A method for computing the unique maximum-
Theorem 22-l Conditions under which a single-channel state- likelihood estimator of 0 that is associatedwith
variable model is equivalent to a convolutional exponential families of distributions.
sum model. Theorem A-3 Lehmann-ScheffeTheorem. Provides a uniformly
Theorem 22-2 Recursive minimum-variance deconvolution for- minimum-variance unbiasedestimator of 8.
mulas. Theorem A-4 Method for determining whether or not T(z) is
Theorem 22-3 Steady-stateMVD filter, and zero phase nature of complete as well as sufficient when p (~10)is an
6swo m-parameter exponential family.
Theorem 22-4 Equivalence betweensteady-stateMVD filter and Theorem A-5 Provides a uniformly minimum-variance unbiased
Berkhouts infinite impulse response digital estimator of vector 0.
Wiener deconvolution filter.
Theorem 22-5 Maximum-likelihood deconvolution results.
Theorem 22-6 Structure of minimum-variancewaveshaper.
Theorem 22-7 Recursive fixed-interval waveshapingresults.
Theorem 23-1 How to handle biases that may be present in a
state-variablemodel.
Theorem 23-2 Predictor-corrector Kalman filter for the cor-
related noise case.
Corollary 23-1 Recursive predictor formulas for the correlated
noise case.
Corollary 23-2 Recursive filter formulas for the correlated noise
case.
Equations (24-l) & Nonlinear state-variablemodel.
(24-2)
Equations (24-23)& Perturbation state-variablemodel.
(24-30)
Theorem 24-1 Solution to a time-varying continuous-time state
equation.
References

References BARD, Y. 1970. Comparison of gradient methods for the solution of nonlinear param-
eter estimation problems. SIAM J. Numerical Analysis, Vol. 7, pp. 157-186.
% BERKHOUT.A. G. 1977. Least-squares inverse filtering and wavelet deconvolution.
Geophysics, Vol. 42, pp. 1369-1383.
BICKEL, P. J.. and K. A. DOKSUM. 1977. Marhematical Starisfics: Basic Ideas and
Selected Topics. San Francisco: HoIden-Day, Inc.
BERMAN, G. J. 1973a. A comparison of discrete hnear filtering algorithms. KEE
Trans. on Aerospace and Electronic Sysrems, Vol. AES-9, pp. 28-37,
BIERMAN, G. J. 1973b. Fixed-interval smoothing with discrete measurements. Inr. J.
Conrrui, Vol. 18, pp. 65-75.
BIERMAN, G. J. 1977. Factorization Methods for Discrete Sequential Esrimation. NY:
Academic Press.
BRYSOK,A. E., JR., and M. FRAZIER. 1963. Smoothing for linear and nonlinear dy-
namic systems. TDR 63-119, pp. 353-364, Aero. Sys. Div., Wright-Patterson Air
Force Base, Ohio.
BRYSON,A. E., JR., and D. E. JOHANSEN.1965. Linear filtering for time-varying
systems using measurementscontaining colored noise. IEEE Trans. on Automatic
Control, Vol. AC-10, pp. 4-10.
BRYSON,A. E., JR., and Y. C. Ho. 1969. Applied Optimal Control. Waltham, MA:
AGLXLERA, IL J. A. DEBREMAECKER, and S. HERNANDEZ. 1970. Design of recursive Blaisdell.
filters. Geophysics,Vol. 35, pp. 247-253.
CHEN, C, T. X970. Introduction to Linear Sysrem T/zeory. NY: I-Iolt.
ANIIERSON, B. D. 0.. and J. B. MOORE. 1979, QptimaZ Fikering. Eqlewood Cliffs, NJ:
CHI, C. Y. 1983. Single-channel and multichannel deconvolution. Ph.D. dis-
Prentice-Hall. sertation, Univ. of Southern Cahfiornia, Los Angeles, CA.
ACM, M. 1967. Optimization of Stochastic Systems- Tupics in Discrete- Time Systems.
CHI, C. Y., and J. M. MENDEL. 1984. Performance of minimum-variance decon-
NY: Academic Press.
volution fiiter. IEEE Trans. on Acoustics, Speech and Signal Processing, Vol.
ASTR~M, K. J. 196K Lectures on the identification probiem-the least squares ASSP-32, pp. 1145-1153.
method. Rept. No, 6806%Lund Institute of Technology, Division of Automatic CRAMER,H. 1946. Mathematical Methods of Statisrics, Princeton, NJ: Princeton Univ.
CcmtroL
Press.
ATHA?& M. 1971. The role and use of the stochastic linear-quadratic-Gaussian prob-
DAI, G-Z.. and 3. M. MENDEL. 1986. General Problems of Minimum-Variance Recur-
lem in control systemdesign. IEEE Trans. on Automatic ContruI, Vol. AC-16 pp. sive Waveshaping. IEEE Trans. on Acoustics, Speech and Signal Processing, Vol.
529-552.
ASSP-34.
ATHANS. M., and P. L. FALB. 1965. Optimal Contr~k Ati l~troductbn to the Theory and
DONGARRA,J.J.,J.R. BUNCH.~. B. MOLER, and G.W. STEWART.1979. LINPACK
Its Applications. NY: McGraw-Hill.
Users Guide, Philadelphia: SIAM.
AI-HANS, M., and F. SCFIWEPPE.1965. Gradient matrices and matrix calculations.
DUDA, R. D., and P. E. HART. 1973. Pattern Clussification and Scene Analysis. NY:
MIT Lincoln Labs., Lexington, MA, Tech, Note 196553. Wiley Interscience.
ATHANS, M., and E. TSE.1967.A direct derivation of the optimai linear filter using the
EDWARDS,A. W. F. 1972. Likelihood. London: Cambridge Univ. Press.
maximum principle, IEEE Trans. OJIAutomatic Contrd, Vol. AC-12, pp. 690-698.
FAURRE. P. L. 1976. Stochastic Realization Algorithms. in System Identificarion:
ATHANS. M., R. P. WISHNER, and A. BERTOLIM 1968. Suboptimal state estimation Advances and Case Studies (eds., R. K. Mehra and D. G. Lainiotis), pp. l-25. NY:
for continuous-time noniinear systems from discrete noisy measurements.* IIZI!X Academic Press.
Trans. WI Automatic Control, Vol. AC-13, pp. 504-514.
FERGWSON, T. S. 1967. MarhematicaZ Starisrics: A Decision Theoretic Approach. NY:
BARAMUN, E. W., and M, KATZ, JR. 1959. Sufficient Statistics of Minimal Dimen-
Academic Press.
sion, Sunkhya, Vol. 21, pp. 217-246.
FRASER,D. 1967. Discussion of optimal fixed-point continuous linear smoothing (by
BARANKIN, E. W. 1961. Application to Exponential Families of the Sohnion to the
J. S. Meditch). Proc. 1967 Joint Automatic Conrrol Conf., p. 249, Univ. of PA,
Mn-nmal Dimensionality Problem for Sufficient Statistics. Buk hs~. hterrmt. Stat., Philadelphia.
Vol. 38, pp. 141-150.
References References

GOLDBERGER, A. S. 1964. Econometric Theory. NY: John Wiley. MARQUARDT,D. W. 1963. An algorithm for least-squares estimation of nonlinear
GRAYBILL, F. A. 1961. An Introduction to Linear Statistical Models. Vol. 1, NY: parameters. J. Sot. Indust. Appl. Math., Vol. 11, pp. 431-441.
McGraw-Hill. MCLOUGHLIN,D. B. 1980. Distributed systems-notes. Proc. 1980 Pre-JACC Tu-
GUPTA, N. K., and R. K. MEHRA. 1974. Computational aspects of maximum like- torial Workshop on Maximum-Likelihood Identification, San Francisco, CA.
lihood estimation and reduction of sensitivity function calculations. IEEE Trans. MEDITCH,J. S. 1969. Stochastic Optimal Linear Estimation and Control. NY: McGraw-
on Automatic Control, Vol. AC-19, pp. 774-783. Hill.
GURA, I. A., and A. B. BIERMAN. 1971.On computational efficiency of linear filtering MEHRA, R. K. 1970a. An algorithm to solve matrix equations PHT = G and
algorithms. Automatica, Vol. 7, pp. 299-314. P = @P<p + ITT. IEEE Trans. on Automatic ControZ, Vol. AC-15
HAMMING, R. W. 1983. Digital Filters, 2nd Edition. Englewood Cliffs, NJ: Prentice- MEHRA, R. IS. 1970b. On-line identification of linear dynamic systemswith applica-
Hall. tions to Kalman filtering. Proc. Joint Automatic Control Conference, Atlanta, GA,
HO, Y. C. 1963. On the stochasticapproximation and optimal filtering. J. of pp. 373-382.
Math. Ana!. and Appl., Vol. 6, pp. 152-154. MEHRA, R. K. 1971. Identification of stochasticlinear dynamic systemsusing Kalman
JAZWINSKI, A. H. 1970. Stochastic Processes and Filtering Theory. NY: Academic filter representation. AIAA J., Vol. 9, pp. 28-31.
Press. MEHRA, R. K., and J. S. TYLER. 1973. Case studies in aircraft parameter identi-
KAILATH, T. 1968. An innovations approach to least-squares estimation-Part 1: fication. Proc. 3rd IFAC Symposium on Identification and System Parameter
Linear filtering in additive white noise. IEEE Trans. on Automatic Control, Vol. Estimation, North Holland, Amsterdam.
AC-13, pp. 646-655. MENDEL,J. M. 1971. Computational requirements for a discrete Kalman filter.
KAILATH, T. K. 1980. Linear Systems. Englewood Cliffs, NJ: Prentice-Hall. IEEE Trans. on Automatic Control, Vol. AC-16, pp. 748-758.
ULMAN. R. E. 1960. A new approach to linear filtering and prediction problems. MENDEL,J. M. 1973. Discrete Techniques of Parameter Estimation: the Equation Error
Trans. ASME J. Basic Eng. Series D, Vol. 82, pp. 35-46. Formulation. NY: Marcel Dekker.
KALMAN, R. E., and R. BUCY. 1961. New results in linear filtering and prediction MENDEL,J. M. 1975. Multi-stage least squares parameter estimators. IEEE Trans.
theory. Trans. ASME, J. Basic Eng., Series D, Vol. 83, pp. 95-108. on Automatic Control, Vol. AC-20, pp. 775-782.
KASHYAP, R. L., and A. R. RAO. 1976. Dynamic Stochastic Models from Empirical MENDEL,J. M. 1981. Minimum-variance deconvolution. ZEEE Trans. on Geoscience
Data. NY: Academic Press. and Remote Sensing, Vol. GE-19, pp. 161-171.
KELLY, C. N., and B. D. 0. ANDERSON. 1971. On the stability of fixed-lag smoothing MENDEL,J. M. 1983a. Minimum-variance and maximum-likelihood recursive wave-
algorithms. J. Franklin Inst., Vol. 291, pp. 271-281. shaping. IEEE Trans. on Acoustics, Speech, and Signal Processing, Vol. ASSP-31,
KMENTA, J. 1971. Elements of Econometrics. NY: MacMillan. pp. 599-604.
MENDEL,J. M. 1983b. Optimal Seismic Deconvolution: an Estimation Based Approach.
KOPP, R. E., and R. J. ORFORD. 1963. Linear regression applied to system identi-
NY: Academic Press.
fication for adaptive control systems. AZAA J., Vol. 1, pp. 2300.
MENDEL,J. M., and D. L. GIESEKING. 1971. Bibliography on the linear-quadratic-
KUNG, S. Y. 1978. A new identification and model reduction algorithm via singular
gaussianproblem. IEEE Trans. on Automatic Control, Vol. AC-16, pp. 847-869.
value decomposition. Paper presentedat the 12th Annual Asilomar Conference on
Circuits, Systems,and Computers, Pacific Grove, CA. MORRISON,N. 1969. Introduction to Sequential Smoothing and Prediction. NY:
McGraw-Hill.
KWAKEFWAAK, H., and R. SIVAN. 1972. Linear Optimal Control Systems. NY: Wiley-
Interscience. NAHI, N. E. 1969. Estimation Theory and Applications. NY: John Wiley.
OPPENHEIM,A. V., and R. W. SCHAFER.1975. Digital Signal Processing. Englewood
LAUB, A. J. 1979. A Schur method for solving algebraic Riccati equations. IEEE
Cliffs t NJ: Prentice-Hall.
Trans. on Automatic Control, Vol. AC-24, pp. 913-921.
PAPOULIS,A. 1965. Probability, Random Variables, and Stochastic Processes. NY:
LEHMANN, E. L. 1959. Testing Statistical Hypotheses. NY: John Wiley.
McGraw-Hill.
LEHMANN, E. L. 1980. Theory of Point Estimation. NY.: John Wiley.
&LED, A., and B. LIU. 1976. Digital SignaZ Processing: Theory, Design, and
LJUNG, L. 1976. Consistency of the Least-Squares Identification Method. IEEE Implementation. NY: John Wiley.
Trans. on Automatic Control. Vol. AC-21, pp. 779-781. RAUCH,H. E., F. TUNG, and C. T. STRIEBEL.1965. Maximum-likelihood estimatesof
LJUNG, L. 1979. Asymptotic behavior of the extended Kalman filter as a parameter linear dynamical systems. AIAA J., Vol. 3, pp. 1445-1450.
estimator for linear systems. IEEE Trans. on Automatic Control, Vol. AC-24, pp. SCHWEPPE, F. C. 1965. Evaluation of likelihood functions for gaussiansignals. IEEE
36-50. Trans. on Information Theory, Vol. IT-11, pp. 61-70.
estimation; Maximum-likeli- forced state equatron model,
References
lndex hood estimation)
Estimator properties (see Small
sample properties of estima-
10
ImpuIse responseidentification (see
Identification of)
SCHWEWE,F. C. 1973. Uncertain Dynumzc ~ystemx. Englewood Cliffs, NJ: Prentice- tors: Large sample properties Information form of recursive least-
of estimators) squaresestimator. 29. 91
Hall. Estimation of random parameters fnitial condition identification (see
(seeMean-squared estimation; Identification of)
SH+4mS.J. L. 1967. Recursion filters for digital processing. CUX$Z~~~C~~ Vol. 32. pp. Maximum a posteriori estima- Innovations process:
32-X Aerospacevehicle example, 21-23 tion; Best linear unbiased esti- defined. 146
Algebraic Riccati equation, 17& mator; Least-squares estima- properties of. 146-147
SORENSON, H. W. 1970. Least-squares estimation: from Gauss to Kahnan. I,&!% 171,278 tor) Instrument calibration example,
Asymptotic distributions, U-56 Estimation techni ues, 3 (see 20-21.35-36
spectrum, Vol. 7?pp. 63-68. Asymptotic efficiency (see Large Best linear unIi tased estima- Invariance property of maximum-
SORENSON, H. LV. 1980. Parameter Estimation: Principies and Prublems. NY: Marcel sam le properties of estima- tor: Least-squares estimator; likehhood estimators (see
torsP Maximum-likelihood estima- Maximum-likelihood
Dekker. asymptotic mean. 55-56 tors; Mean-squared estima- estimates)
Asymptoticunbiasedness(seeLarge tors; Maximum a posteriori Iterated least squares,248-249
SORENSON, IX W., and J. E. SACKS.1971. Recursive fading memory filtering. I@r- sam le properties of esrima- estimation; State estimation)
mahm Science,Vol* 3, pp. 101-119. torsP Expanding memory estimator, 23 Kalman-Bucy filtering:
Asymptotic variance, 56 Exponential family of distributions, derivation using a formal limiting
STEFANI,R. T. 1967. Design and simulation of a high performance, digital, adaptive, 284-286 procedure. 27X275
Basic state-variable modeI: Extended Kaiman filter: derivation when structure of filter
normal acceleration control system using modern parameter estimation tech- defined, 131-132 application to parameter estima- is prespecified, 275-278
niques. Rept. No. DAC-6U637, Douglas Air&t Co,, Santa Monica, CA. properties of, 133--137 tion, 255-256 notation and problem statement
Batch processing(set Least-squares correction equation, 253 for derivation, 27l-272
STEPNER,ID. E., and R. K. MEIIRA. 1973. Maximum likelihood identification and estimation) derived, 249-253 optima1 control application, 28&
Best linear unbiased estimation, iterated, 253-254 101
,201
optimal input design for identifying aircraft stability and coritroi derivatives. Ch. 73-79 prediction equation, 252 statement of, 272
IV, NASA-CR-2200. Best linear unbiased estimator: steady-state, 278-280
comparisonwith maximum a pos- Factorization theorem, 283 system
me* description for derivation,
STEWART,G. W. 1973. Introduction to Matrix Compututions. NY: Academic Press. teriori estimator, 123-124 Fibering: LII
comparison with mean-squared computations, 156 Kalman-Bucy ain matrix, 272
TREEL, S. 1970- Principles of digital multichannel tiitering. Geophysics? VoI. estimator, 122-123 covarianceformulation, 157 Kahnan filter cg
seeFiltering; Steady-
comparison with weighted least- diy;-ce phenomenon, 167- state Kalman filtering)
XXXV, pp. 785-81I. squaresestimator, 74-7; Kalman filter sensitivity system,
derivation of batch form. 7374 examples, 160-169 163,262-263
TREEEL, S., and E. A. ROBINSON.1966.The design of high-resolution digital filters. properties, 75-78 Kalman filter derivation, 151-153 Kalman filter tuning parameter, 167
IEEE Trans. on Geoscience and Electronics, Vol. GE-4, pp. 25-38. for random parameters, l2l-123 properties. 153--158 Kalman gain matrix, 151
recursive forms, 78-79, 167 recursive filter. 153
TUCKER,H. G. 1962. An Introduction to Probabiiity und Mathematics! Stdstics. NY: Biases (see Not-so-basic state- relationship to Wiener filtering, Large sample properties of estiina-
variable model) 176-181 tars (seealso Least-squareses-
Academic Press. BLUE (seeBest linear unbiased es- steady-state Kalman filter, 17C- timator):
TUCKER, FL G. 1967. A Gruduute Course in Probabi&y. NY: Academic Press. timator) 176 asymptotic efficient , 60
Finite-difference equation coeffi- asymptotic unbiasecyness, 57
VAE TREES,H L. 1968. Detection, Estimution und Modulufion Theory, Vol. I. NY: Causalinvertibility, 157-158 cient identification (see Identi- consistency. 57-60
Colored noises (see Not-so-basic fication of) Least-squares estimation process-
John Wiley. state-variable model) Fisher informatin matrix, 51 ing:
Condttionai mean: Fixed-interval smoother, 183-184, batch, 17-24
ZACKS, S. 1971. The Theory of statisticul Inference. NY: John Wiley. defined. 102 190.193-199 cross-sectional. 37-38
properties of, 104-306 Fixed-lag smoother, 184, 191, recursive, 2632.3-M2
Consistency (see Large sample 24X-202 Least-squares estimator:
properties of estimators) Fixed memory estimator. 23 derivation of batch form, 19-20
Convergencein mean-s uare, 59 Fiiebpgi;ml smoother, 184, 190, derivation of recursive covar-
Convergencein probabi1.lty, 58 iance form, 30-31
Correlated noises (see Not-so-basic derivation of recursive informa-
state-variable model) Gauss-Markov random processes tion form. 27-28
Covariance form of recursive least- defined. 128-129 exam les. 20-23, 3-5-36
squaresestimator, 31,79 Gaussian random processes (see initia Plzation of recursive forms,
Cramer-Rao inequality: Gauss-Markov random pro-
scalar parameter, 47 cesses) arz sample properties. 68-70
vector of parameters, 51 Gaussianrandom variables (seealso multistage. 38-4,
Cross-sectiona processing (see Conditional mean) properties. 63-70
Least-squaresestimation) conditional density function, 102- for random parameters. 121
small sample properties. 63-68
Deconvolution: joi?densitv function 101-102 for vector measurements,36
maximum-likelihood (MLD). multivariate density finction, 101 Lehmann-Scheffetheorem.297-293
124-125.215-216 properties of, 104 Likelihood:
minimum-variance (MVD). 121% univariate density function, 100 compared with probability, 81-82
121.X%21s Glossary of major results. 295-299 conditional, 1l4
model formulation. 12-14 continuous distributions. 85.88
Discretization of linear time- Hypotheses: defined. 82-84
varying state-variable model. binarv, 81 unconditional, 114
242-245 multiple, 85-86 Likelihood rato:
Divergence phenomenon. 167-169 defined for multipIe hypotheses,
S-86
Efficiency (see SmalI sample prop- Identifiability. 95-96 defined for two hypotheses.84-86
erties of estimators) Identification of: Linear model:
Estimate types, 14-15 (seealso Pre- coefficients of a finite-difference defined. 7
diction: Filtering; Smoothing) equation. 9.65 examples of. &I4 (seealso Iden-
Estimation of deterministic param- impulse response. 8-9. 3p-42. tification of; State estimation
eters (see Least-squares esti- 64-65 example: Dcconvolution; Non-
mation: Best linear unbiased initial condition vector in an un- ]incar mcasurcmfn~ rnrlrirll
Nonlinear dvnarnical systems: icasl squares. 2-&24
discrctized petturbalion state- Sensitivity of Kalman filter, 16-L
variable model, 24s 166
linear perturbation equaCor)s, Signal-to-noise ralio, 137-138. 167
23?-242 Smglc-channelsteady-state Kalrnan
model, 237-239 tiller* 17-s 176
Nonlinear measurement model. 12 Small sample Dropertics of estima-
Not-so-basic state-variable model: tars (5ee&o-Least-squares es-
hlarkov process. 129-130 biases, 224 timator\:
hMrix inversion lemma, 30 colored noises, 227-230 efficiency:&-S2
Matrix Riccaci differential equa- correlated noises, 22-5227 unbiasedness, 44-46
tion* 272 perfect measurements, 23&233 Smnothine:
>Iatrix Ricca!i equation+ 15s Notation, 14-15 applications. X%222
Maximum a posteriori estimation: double-stage. 187-189
comparison with best linear un- Orthogonality principle, 11l-l 12 fixed-internal, 190, 19-3-199
biased estimator, 12-%I24 fixed-lag, 191.201-202
comparison with mean-squared Parameter estimation (Jee Ex- fixed-point. 190, 199-201
cstimatnr. i l-%116 tended Kaiman filter) single-stage, 18-Gl87
Gaussianlinear model, 123-124 Perfect measuremen& (see Not-so- three types. 183-184
general case. 114-l 16 basic state-variable model) Stabilized fotrn for computing
~~aximurn-~~keli~~~~~d deconvolu- Perturbation equations (JC~ Han- I(k + l/A + I], IS6
tion (set DeconvoIution) linear dynamical systems) Standard form for computing
Maximum-likelihood estimation, Philosoph!: P(k + lik +- l),l%
w97 estimation theory, 6 State estimation (see Prediction;
Maximum-likelihood estimators: modeling, 5-6 Filte+ng:.Smoothing)
compatison with best linear un- Prediction: Statel;;tlmatlon example, 10-12,
biased estimator, 93-94 general, 142-145
comparison with least-squareses- recursive predictor, 155 State-variable model (see Basic
timator. 94 single-stage, 140-142 state-variable model; Not-so-
for exponential families, 287-290 steady-state predictor. 173 basic state-variable model)
the linear model, 92-94 Predictor-corrector form of Kalman Steady-state a roximation (5ee
obtaining ?hem+89-91 filter, 151-W Maximum- r!i elihood state and
properties. 91-92 Properties of best linear unbiased parameter estimation)
Maximum-likelihood method, $9 estimators [see Best linear un- Steady-statefilter system, 173
h$aximum-IikeIihood state and pa- biased estimator] Steadv-stateKalman filter, I?&172
rameier estimation: Propertie of estimators (see Small Steadi-state MVD filter:
computing &. 261 sample properties of estima- defined, 207
log-likelihood function for the tors: Large sample properties properties of, 212-213
basic state-variable model, of estimators) relationship m HR Wiener de-
259-261 Properties of least-squaresestima- convolution filter, 213-215
steady-state approximation, 264- tar (see Least-squares estima- Steady-statepredictor system, 173
268 tar) Stochastic linear optimal output
Mean-squared convergence (see Propetiies of maximum-likciihood feedback regulator problem,
Convergence in mean-square) estimators (see Maximum- 281
blean-squared estimation, 109-113 likelihood estimators] Sufficient statistics:
Mean-squared e5timators: Properties of mean-squaredestima- defined, 282-284
comparison with best linear un- tars (seeMean-squaredestima- for exponential families of distri-
biased estimator, 122-123 tors) buttons. 285286,287-290
comparison with maximum a pos- and uniformlv minimum-variance
tcriori estimator. ll-$116 Random processes (,qce Gauss- unbiased e&nation, 290-294
derivation. 110 Matkov random processes)
Gaussiancase. 11l-1 13 Random variables (XC Gaussian Wnbiasedness (set Small sample
for linear and Gaussian model, random variabIes) properties of estimators)
1l&l20 Recursive calculation of state co- Uniformly minimum-variance un-
properties of, 112-l 13 variance matrix, 134 biased estimation (see Suffi-
Measurement differencing tech- Recursive calculation of state mean cient statistics)
nique, 231 vector, 134
Minimum-variance deconvoiution Recursive processing (see Least-
squares estimation recessing; Variance estimator, 6748
(seeDeconvolution)
MLD (PC Deconvolution) Best linear unbiasecrestimator)
Mod& ng: Recursive wavesha ing, 21&222 Waveshapin (see Recursive wave-
estimation problem, 1 Reduced-order Ka rman filter, 231 shapingf
measurement problem. 1 Reduced-order state estimator. 231 Weighted least-squares estimator
re resent&on roblem, 1 Riccati equation (seeMatrix Riccati (see Least-square5 estimator;
vaYidation prob rcm, 2 equation: Algebraic Riccati Best linear unbiased estimator)
Multistage least-squares(XC Least- equation) White noise discrete, 130-131
Squaresestimator) Wiener fiIter:
Multivariate Gaussianrandom sari- Sarn;te;yn as a recursive digital derivation. 178-179
ables (-TeeGaussian random relation to Kalman filter, 18&181
variables) Scaiecha&es: Wiener-Hopf equations, 179
MVD (seeDeconvolution) best linear unbiased estimator,
77-78 &m-in algorithm. 3!H2

You might also like