Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Notes on Estimation Theory

P. Dewilde and G. Leus

September 25, 2007

1 Stochastic Variable

Stochastic (or random) variables are (mathematical) functions defined on the outcome of a random
experiment. The experiments considered belong to a given class of experiments under consideration
(e.g. dice throwing). The variable takes a real or complex value for each outcome, often called an
event. For example, if you are throwing a dice and you win two points every time the outcome is
even, while loosing a point when the outcome is odd, then the corresponding stochastic variable
takes the value ’2’ for an experiment with even outcome and ’-1’ otherwise. Suppose that every
outcome of the dice throwing is equally likely, then we assign to these outcomes a probability p of
1/6 (so that the overall probability of any event happening is 1), and the average for our even/odd
stochastic variable becomes -1/6 + 2/6 - 1/6 + 2/6 -1/6 + 2/6 = 1/2. You would expect that
when you do many such experiments, the overall outcome will approximate 1/2. We shall make
this fact, which is known as the law of large numbers a little more precise soon.

More formally, we say that the experiments belong to a ’probability space’ which is often denoted
by the symbol Ω, while a single event is denoted as ω ∈ Ω. Ω must be viewed as a set, and ω as an
element of the set. If the set Ω has a finite number of elements (such as our dice throwing), then
one can assign a probability p(ω) to each event. A stochastic variable X is a function X : Ω → R
where R stands for ’a real number’ (it could also be a complex number), for each event ω it takes
a real (or complex) value X(ω). The mean of the stochastic variable X living on the space Ω is
defined as the expectation of the variable:
X
mX = E(X) = p(ω)X(ω). (1)

If X is a stochastic variable, then any real or complex function of X is itself a stochastic variable.
One important such function is |X|2 (allowing for complex variables). The variance of X is then
defined as the expectation of the squared distance of the variable from its mean:
varX = E(|X − E(X)|2 ) = E(|X|2 ) − |E(X)|2 . (2)
This quantity will always be positive (as is obvious from the relation given). This fact can also be
reduced to a special case of an inequality known as ’Jensen’s inequality’, which states that if h(x)
is a convex function of a real variable then it always holds that
h[E(X)] ≤ E[h(X)], (3)

1
or in words: ’h of the average is always smaller than or equal to the average of the h’s’. In the
present case we take for h(x) = x2 . (A word on notation: we adhere to the custom of indicating a
stochastic variable with a capital letter such as A or X. Specific instances of it will be indicated
by the corresponding lower case letter, e.g. x or a.)

In many cases the underlying stochastic space Ω (the space of all events) is not explicitly given
or known. That does not prevent one to work with stochastic variables and assign probability
distributions to the values it can take. E.g. the dicing variable mentioned above takes the value
2 with probability 1/2 and the value −1 also with probability 1/2. This allows one to work with
stochastic variables defined on infinite spaces. E.g. let X be a stochastic variable that can take
any real value when an event occurs. One may characterize its probability distribution through its
cumulative probability distribution function FX (x) defined as

FX (x) = p(X ≤ x). (4)

It is the probability that the outcome of an event produces a value for X that is less than or equal
to x. The total probability must of course sum up to 1, hence limx→∞ FX (x) = 1. Often FX (x) is
differentiable. In that case the derivative is known as the probability density function and denoted
as
dFX (x)
fX (x) = .
dx
The fact that the total mass is one translates to
Z ∞
fX (x)dx = 1 (5)
−∞

and FX can of course be recovered from fX as


Z x
FX (x) = fX (ξ)dξ (6)
−∞

Probably the most common probability density function is the Gaussian. A scalar Gaussian density
with mean µ and variance σ is given by
1 (x−µ)2
fX (x) = √ e− 2σ2 . (7)
σ 2π
It has the remarkable property that the distribution of a sum of a large collection of small, inde-
pendent stochastic variables with arbitrary distributions tends to Gaussian.

In the exposition above, we have taken a ’probabilistic viewpoint’. Another, very practical viewpoint
that leads to the same general results is known as the ’Bayesian approach’. In the Bayesian approach
we do not assume that the set Ω under consideration stands for a collection of random events that
will happen due to some unexplained random mechanism (such as throwing a dice), but that it
stands for a collection of possible events, whose likelihood of occurrence (known as the ’belief’) is
given by a function p(ω) (that plays the same role as the probability distribution in the stochastic
context). The belief only reflects the current knowledge concerning the likelihood of an event. For
example, suppose you are talking about meeting a person belonging to a certain community, and
that as far as you are concerned there is no reason to believe that you would meet one rather than
another, then you would assign the same ’belief’ to each person - and of course the assignment
must be such that it sums up to one, just like a probability. It may be that the person you would

2
meet has been fixed since a long time, you only do not know who it will be, so the assignment only
reflects the current state of your knowledge. Calculating with beliefs goes exactly the same way
as calculating with probabilities - only the philosophical flavor is different. Henceforth we adopt a
Bayesian viewpoint, but utilize a probabilistic terminology.

Given a probabilistic (or Bayesian) environment, one often disposes of more than one stochastic
variable, and questions arise about what information the value of one variable may provide about
another. Let us look e.g. at variables A and B taking Boolean values: on one subset of Ω a certain
proposition is considered true, which is reflected by assigning A = 1 while in the complementary
subset A = 0. Similar for B but with different subsets. Our space Ω now looks as follows:

ab
ab
ab
€ ab



Figure 1: Subdivision of Ω in subsets for which certain propositions (a,b,∼ a,∼ b) are true

There are four domains, which we may indicate as {ab, āb, ab̄, āb̄}, where we write a where A = 1,
ā where A = −1 etc... Each of these domains has a probability, that is the joint probability
distribution for the joint variables A and B, indicated as p(A, B) for the domain ab, p(A, B̄) for the
domain ab̄ etc... - it can be represented as the surface of the corresponding domain, assuming the
total surface to be 1. The conditional distribution written as p(A|B) is the collection of distributions
when the value of B is known. To be precise, when B = 1, p(A|B = 1) takes two values: when
p(A,B) p(Ā,B)
A = 1 it equals p(A,B)+p( Ā,B)
and when A = 0 it equals p(A,B)+p( Ā,B)
- it is now a function of
A alone, as it relates only to the regions for which B = 1 and a ’re-normalization’ is called for.

3
Likewise, of course, when B = 0. This leads to the famous Bayes rule:

p(A, B) = p(A|B)p(B) (8)

in which one has to plug in the correct distribution p(A|B) depending on the value of B. In many
cases, p(A, B) is not known, but conditional probabilities are. The classical Bayesian case goes
as follows. Suppose we are interested in studying the causes of a certain effect E - say having a
headache - and we have discovered that there are two possibilities: a flue and brain cancer (assume
them to be disjoint: it is either one or the other, for simplicity). Somebody has a headache, what is
the likelihood of cancer? Let us indicate the cause by the variable C, where C can take two values:
f for flue and b for brain cancer. It is very hard to relate the headache to the probable cause, all
the medical studies go the other way: they describe specific illnesses and what effects they have,
or to put it differently, they document what are the effects given the possible causes. So we can
say that we dispose of the information p(E|C) while we are really interested in p(C|E) - to know
what the ’chance’ is that we have a specific illness given a certain effect. Bayes’ rule tells us the
following:
p(E|C)p(C)
p(C|E) = (9)
p(E)
p(C|E) is known as the posterior probability given the effect E, p(C) is the prior probability of a
cause (irrespective of any effect observation), and p(E|C) is known as the likelihood function, it
describes how likely an effect would be given the cause. The denominator is a normalizing factor
to insure that the p(C|E) sum up to one. Both the likelihood function and the a priori probability
are extremely important data, many mistakes in interpreting statistical data can be traced to a
wrong judgement of one or the other. An attractive example is provided by the following story.

In sport circles people are very interested in combating the use of drugs. The means they use for
that is to punish very harshly sport people who are caught having used stimulating drugs. A good
estimate of the use of drugs in the population indicates that about 1% of the sporting population
indeed resorts to the use of drugs. Unfortunately, the test on drug abuse produces errors as well
- there is a chance for a faulty diagnostic. This chance is small and estimated at 1% as well. A
sportsman is caught with a positive test. What is the probability that he is indeed guilty? The
accuser says it is 99/100 because the chance that the test is wrong is a mere 1%. The sportsman
defends himself stating that the chance that he is guilty is just 1/100, because the likelihood that
he is a member of the non-drug population is so much larger, after all the non-drug population
is 100 times larger than the drug-using population. A Bayesian statistician comes by and solves
the problem stating that, given the data, the chance of guilt is precisely 1/2. How does he arrive
at this result? And how large would the chance be if there were a second, independently taken
positive test for the same sportsman with the same probability of error? And what if the various
probabilities (beliefs) differ, what is then the result?

Bayesian calculus is nothing but calculus on the size of subsets of a given base set, as long as large
numbers are involved and the available information is carefully charted, it gives precise estimates
of the sizes of the subsets for which certain properties are true. As probability theory can also
be interpreted as a calculus on subsets of the probability space, it is not surprising that the two
theories actually coincide for all practical purposes!

4
The (weak) law of large numbers

Let X1 , X2 , · · · , Xn be a sequence of stochastic variables with the property that (1) they are inde-
pendent and (2) they have all the same mean µ and the same variance σ 2 (for example: they all
have the same probability distribution - such variables are called ’iid’ = independent, identically
distributed). Let
n
1X
X̄n = Xi (10)
n
i=1

be their average value - known as the ’sample mean’ (in contrast to the mean which is the average
over all possible events). Then it is true that

σ2
p(|X̄n − µ| > ) ≤ , (11)
n2
and, in particular, the probability p(|X̄n − µ| > ) → 0 as n → ∞.

Proof

The proof goes via a sequence of (fairly weak) inequalities which are interesting in their own right.
We state them as lemma’s.

Lemma 1 (Markov’s inequality) Let X be a positive but otherwise arbitrary stochastic variable
and  any positive number. Then
E(X)
p(X ≥ ) ≤ (12)


The proof of the lemma is by evaluation of


Z ∞
E(X) ≥ xdFX (x) ≥ p(X ≥ ), (13)


in which FX (x) is the probability distribution of X which has mass only for positive values of x.

Lemma 2 (Chebyshev’s inequality) Let Y be a stochastic variable with mean µ and variance
σ 2 , then for any positive  it is true that

σ2
p(|Y − µ| > ) ≤ (14)
2

The proof is by applying Markov’s inequality to the positive stochastic variable |Y − µ|2 .

The law of large numbers is now a direct consequence of this last lemma. It is not hard to see that
for the iid variables the following is true:

5
• mX̄n = E(X̄n ) = µ,
σ2
• varX̄n = E(|X̄n − mX̄n |2 ) = n ,

and the property follows by applying the Chebyshev inequality to X̄n . Note the ’linear progression’
of the variance of the sample average (n instead of n2 ).

2 Problem statement

To begin with, we assume here that the parameter θ to be estimated is scalar. The noise space
can be interpreted as consisting of ’physical noise’ generated, say, in a transmission system, or
else as the disturbance caused by other (non-stochastic, but not known) sources. To achieve good
estimation we are dependent on a model that describes in generic terms the relation between θ, N
and X. The linear model we treat here is the most common one, and it can be applied in many
telecommunication situations. The model should, of course, reflect the actual situation one has to
handle. Here we assume that a is an n-dimensional, known (constant) vector, while N = [Nk ]nk=1
is assumed to be a stochastic vector, it consists of n stochastic variables Nk all defined on the same
stochastic base space (this is indicated by qualifying the expectation operator EN with the index
N).

The covariance is an n × n matrix whose (i, j)th element is given by (the overline indicates complex
conjugation)
[covN ]i,j = EN [(Ni − EN (Ni ))(Nj − EN (Nj ))] (15)
Remark the vector/matrix notation we are using: we write EN (X) for the vector [EN Xk ]nk=1 , and
use the ’Hermitian transpose’ for vectors and matrices defined (for a matrix A) as
[AH ]i,j = [A]j,i . (16)
When one works with real quantities, one may replace the ’Hermitian conjugate’ with the ’transpose’
and use the superscript ’T’ instead of ’H’, as is done in the sheets. For generality, we keep to
the superscript ’H’. The definition of the covariance then consists of the expectation of an ’outer
product’, i.e. a product of the type
 

 ∗  
 ..  ∗ ∗ · · · ∗ (17)
 
 . 

Although an outer product produces a rank one matrix, taking the expectation amounts to taking
the average of a large number of such products and will usually result in a full rank matrix. A
covariance matrix C will always be positive semi definite meaning that for any n-dimensional vector
y it will be true that
yH Cy ≥ 0 (18)
We say that it is strictly positive definite, iff yH Cy = 0 ⇒ y = 0. This will be the case iff no
non-zero linear combination of the stochastic variables Yk is deterministic (i.e. ’has variance zero’,
or equivalently, has probability one of occurring).

6
Most stochastic variables (in fact all stochastic variables we consider) have finite means and covari-
ances. We say that a vector N of stochastic variables with mean µ (each component Nk has mean
µk ) and covariance C is Gaussian, if its probability distribution is given by
1 1 H C−1 (n−µ)
fN (n) = e− 2 (n−µ) (19)
(2π)N/2 det(C)1/2

A distribution with such a density is indicated by N (µ, C) as it depends on those parameters. It


is a probability distribution with very special properties, we refer to treatises on probability theory
for them. (It is not too hard to show that C in the formula is indeed the covariance matrix - we
have assumed it to be non-singular, the singular case can be expressed as well, but we avoid this
complication here).

We are now going to derive a number of possible estimation strategies for the unknown parameter
θ, assuming the linear model X = aθ + N. The strategies differ according to the assumptions
made. Either we take the viewpoint that θ is just an algebraically unknown quantity and derive
an equation that it should satisfy to achieve a desired result, namely the optimization of a relevant
criterion, or we look at θ as being an event of a stochastic variable Θ over a stochastic space with
certain characteristics and again optimize for a relevant criterion (the second approach is more
general than the first, which may be considered a special limiting case of it). In the first category
(deterministic θ), we distinguish:

1. the MVU Estimator: derive an expression for unbiased estimation and minimize the variance
of the error;

2. the BLUE: choose the best possible linear model for the estimator (unbiased and with minimal
covariance);

3. the MLE: here we define a new characteristic called the ’likelihood’ and optimize on it;

4. the LSE: given the linear model we minimize the square error (a potentially simpler function
than the likelihood).

The distinction between these estimators may seem subtle, but they are interesting. The following
facts are true:

• the MVU is presumably the best possible deterministic estimator, but it will be non-linear in
general depending on the probability distribution of N. In case the probability distribution
for N (in the linear model) is zero mean and Gaussian, then the MVU turns out to be linear
as well and given by
Θ̂ = (aH C−1 a)−1 aH C−1 X. (20)
We refer to the literature for a derivation of this result. We have assumed that C is non-
singular, i.e. the process n does not contain a deterministic linear combination, hence the
inverse exists.

• instead of looking directly for the MVU, we can restrict the class of estimators under con-
sideration to linear. So, we assume Θ̂ = bH X - i.e. a linear combination of the stochastic

7
variables Xk given by a vector b to be determined. It turns out that in this case b is given
by
b = C−1 a(aH C−1 a)−1 (21)
no matter what the original distribution may be - only the covariance (and of course the
mean) play a role. We now see that in the Gaussian case the MVU and the BLUE coincide.
This is a satisfactory result characteristic of the Gaussian case.
• Another interesting viewpoint is ’maximum likelihood’. It states that the best possible esti-
mate θ̂ is such that the probability fN (x − aθ̂) is maximized over all fN (x − aθ), often written
as
θ̂ = arg max fn (x − aθ). (22)
θ
Just as with the MVU, the ML estimator will be non-linear in general and dependent on
the distribution. In the Gaussian case, the general estimation procedure is easy and leads,
remarkably enough, to the same expression as for the BLUE and MVU. However, for other
distributions, all these expressions will be different.
• A more down to earth approach would consider the least squares error computed directly on
the samples. In the most naive version it just attempts to find θ that minimizes the error
kx − aθk2 . If so, the LSE would be equivalent to the BLUE when the noise is white, and to
the MVU and the ML when the noise is Gaussian and white. In a slightly more sophisticated
version, the LSE can be computed using a weight given by the covariance C, assuming

kx − aθk2C = (x − aθ)H C(x − aθ) (23)

if done so the LSE is equal to the BLUE and in the Gaussian case to the MVU and the ML
as well.

If zero mean stochastic variables are handled and only their ’second order statistics’ count (i.e. their
covariances), then a Euclidean vector space structure can be defined on those variables. Suppose
we are handling n stochastic variables Yk , then we can construct a space Y consisting of all possible
linear combinations with complex coefficients αk :
n
X
Y = {U : U = αk Yk } (24)
k=1

If the vectors {Yk } are second order independent (i.e. if their covariance matrix is non-singular),
then Y is a vector space of dimension n, meaning that one can perform vector space operations on
its components, namely adding two of them and multiplying one with a scalar. One can go a step
further and define an inner product on elements of Y. Let U1 , U2 ∈ Y then

(U1 , U2 ) = E(U1 U2 ). (25)

The original base vectors yk may be considered to form a basis for Y. The Gramian of the basis is
defined as the matrix G for which [G]i,j = (Yi , Yj ). Using the definition of the inner product, we
find  
Y1
 Y2   
G = E  .  Y1 Y2 · · · Yn = covY (26)
 
.
 . 
Yn

8
and we have the interesting result that the Gramian of the basis is actually nothing else than its
covariance. Linear least squares estimation is particularly easy to do in such an inner product space
(sometimes called a ’Euclidean space’ and when it is infinite dimensional a ’Hilbert space’ - modulo
some extra continuity conditions). In particular, the LSE (linear least square estimator) will be
such that the estimation error is orthogonal on the data. This property is the basis of some of the
basic filtering methods (Wiener and Kalman filtering). We shall discuss some examples in other
sections.

A note on the optimization procedure

The optimization procedures considered so far are of the type: optimize bCbH with C strictly
positive definite under the (linear) constraint bH a = 1. Such an optimization problem can easily
be solved directly without recourse to Lagrange multipliers as follows. We start out by remarking
that all b that satisfy the linear constraint have the form

b = κa1 + b1 (27)

in which κ = (aH C−1 a)−1 , a1 = C−1 a and b1 is C-orthogonal on a1 (i.e. aH 1 Cb1 = 0) but
otherwise arbitrary. The proof of this statement is by writing any vector b as the sum of a vector
in the direction a1 and a vector C-orthogonal on a1 : b = κa1 + b1 and introducing the condition
aH b = aH H H −1
1 Cb = 1 (note that a1 = a C ), which leads to the necessary and sufficient condition
κ = (aH −1 = (aH C−1 a)−1 . Next, one remarks that
1 Ca1 )

bH Cb = (κa1 + b1 )H C(κa1 + b1 ) = |κ|2 aH H


1 Ca1 + b1 Cb1 (28)

Because C is positive definite, the smallest possible value for bH1 Cb1 is b1 = 0, and this value is
also unique, since C is strictly positive definite. Hence the optimal b is given by

b = κa1 = (aH C−1 a)−1 C−1 a (29)

and minimal value is


(bH Cb)min = (aH C−1 a)−1 . (30)

A note on the Cramér-Rao lower bound

We follow the treatment given in Cover-Thomas, Elements of Information Theory, Wiley 1991.
The problem is the following: suppose we have conducted a number of experiments with the aim
to determine the mean of a stochastic variable X of which it is known that it has a Gaussian
distribution N (θ, 1), but in which θ is not known - our aim is to determine θ with the best possible
estimator. We have made n independent experiments all for the same random variable, and with
results X = {X1 , X2P · · · Xn }. It turns out that in this case the best possible estimator is actually
the average X̄n = n1 ni=1 Xi . It is not hard to compute the variance of this estimator (see higher),
but is the estimator indeed optimal? The Cramér-Rao bound gives a lower bound on the variance of
an estimator, whatever the distribution may be (assuming it to exist). So, if an estimator achieves
the bound, it is indeed optimal (and some argument can also be given that there is no other optimal
estimator). Although the Cramér-Rao is not hard to prove, it is not that easy to interpret - it seems

9
a very abstract result with a very concrete interpretation. It turns out that it is extremely useful
and often used - hence worthy of study! Before we embark on the presentation of the theorem, first
a couple of definitions.

We assume that we dispose of an indexed family of probability densities f (x; θ) in which θ is an


unknown parameter (we do not assume θ stochastic here, as we R ∞do in the Bayesian case, it is just
a fixed parameter to be estimated). Thus: f (x; θ) ≥ 0 and −∞ f (x; θ)dx = 1. An estimator Θ̂
for θ is a function from the sample set X to the set of possible θ’s. We assume thatR connected to
any θ there is a density function f (x; θ) and hence an expectation operator E(·) = ·f (x; θ)dx (θ
here is just an index!). We shall say that the estimator is unbiased if for any sample set E(Θ̂) = θ
(otherwise the bias would be E(Θ̂ − θ)). Finally, an important quantity that plays a central role in
the Cramér-Rao theory is the Fisher information. It is defined as follows:

Definition 1 The Fisher information J(θ) of the given parametrized set of distributions is given
by
 2

J(θ) = E ln f (x; θ) . (31)
∂θ

Here is now the theorem:

Theorem 1 (Cramér-Rao) The variance of the error of any unbiased estimator Θ̂(X ) of the
parameter θ cannot be lower than the reciprocal of the Fisher information, more specifically, for
any estimator T it is true that
1
varΘ̂ ≥ . (32)
J(θ)

Proof


f (X;θ)
Let V = ∂θf (X;θ) (notice: a stochastic variable). Then it is easy to see that E(V ) = 0 and
varV = J(θ). Next an invocation of the famous Cauchy-Schwartz lemma (which states that for any
inner product space it holds in general that |(x, y)|2 ≤ kxk2 kyk2 ) gives
|E[(V − E(V ))(Θ̂ − E(Θ̂))]|2 ≤ E[|V − E(V )|2 ]E[|Θ̂ − E(Θ̂)|2 ]. (33)
Using E(V ) = 0 and varV = J(θ) the inequality translates to
|E(V Θ̂)|2 ≤ J(θ)varΘ̂ . (34)
It remains to evaluate E(V Θ̂). We obtain in sequence

R ∂θ f (x,θ)
E(V Θ̂) = f (x;θ) θ̂(x)f (x; θ)dx
R ∂
= R f (x; θ)θ̂(x)dx
∂θ

= ∂θ f (x; θ)θ̂(x)dx (35)

= ∂θ E(Θ̂)

= ∂θ θ
= 1.

10
The inequality follows, and that proves the theorem. 

It turns out that it is not necessarily that bad to have a biased estimator (in fact, in a number of
cases the biased estimator can be better than an unbiased one, see the chapter on non-parametric
estimation). A modified Cramér-Rao lower bound for biased estimators, proved in a similar way
as for the unbiased case, is as follows. Let b(θ) = E(Θ̂) − θ and b 0 (θ) the derivative of b(θ) with
respect to θ.

Theorem 2 Any (biased or unbiased) has a variance satisfying the lower bound

[1 + b 0 (θ)]2
E[(Θ̂ − θ)2 ] ≥ + b2 (θ) (36)
J(θ)

In the case of n i.i.d. random samples with Gaussian distribution N (θ; σ 2 ) one has
n
J(θ) = . (37)
σ2
If Θ̂ = X̄n = n1 ni=1 Xi it is easy to compute the variance of X̄n because the variance of each
P
Xi is known and the variance of the sum follows from the individual variances (because of the
independence of the samples we have
n n
2
X 1 2 1 X nσ 2 σ2
E|X̄n − θ| = E| (Xi − θ)| = 2 E|Xi − θ|2 = 2 = (38)
n n n n
i=1 i=1

and the bringing of the expectation sign inside the sum is justified by independence - e.g. if X
and Y are independent, zero mean stochastic variables, then E(X + Y )2 = E(X 2 + 2XY + Y 2 ) =
EX 2 + EY 2 ). The variance of X̄n is thus precisely equal to 1/J(θ). Hence, the sample mean is the
optimal ML estimator in the Gaussian i.i.d. case. The usefulness of the Cramér-Rao lower bound
has been much debated. Although it is valid for any probability distribution, it only evaluates
second order behavior and hence appears to be of limited use for distributions that are very skew,
where the variance is a poor measure of quality, but in a telecommunication setting it is usually a
very useful indication of quality, as many disturbances can be accounted for as (almost) Gaussian
noise.

3 The Bayesian Philosophy

In the Bayesian philosophy we take the viewpoint that the parameter Θ is a stochastic variable,
which has to be estimated statistically. We may then assume that we have some statistical knowl-
edge about Θ. In the previous section we assumed that θ is a fixed but otherwise unknown pa-
rameter. This would amount to assuming its statistics to be restricted to a single value with zero
variance. We now relax this condition and assume some distribution on Θ. Alternatively, following
the Bayesian viewpoint, we can take the data uncertainty on Θ into account by positing a dis-
tribution for it (see our discussion of the Bayesian ’belief’ approach earlier). We now dispose of
two (unrelated) distributions, one for Θ which we write as fΘ (θ) and one for N which we write as

11
fN (n), adopting common usage. If the two stochastic spaces are indeed unrelated, then they can be
joined together to form a common space of all possible combinations, with distribution fN (n)fΘ (θ).
If they are not unrelated, the join can still be done, but then more information would have to be
provided on the common distribution function fN,Θ (x, θ), there is no reason why this case should
be considered).

We consider two cases:

1. The MMSE which minimizes the square error for the estimation for any density fN,Θ (X, Θ)
and produces the general result
Θ̂ = EΘ|X {Θ|X} (39)
thanks to usage of Bayes’ rule. The problem with this very general expression is the lack of
knowledge one normally has of the density fΘ|X (Θ|X) - known as the posterior probability
density function for Θ. True to the Bayesian philosophy, we revert to prior information (all
densities relate to the probability space on which the joint variables N and Θ are defined):

f (x|θ)f (θ)
f (θ|x) = (40)
f (x)

Assuming the linear model (X = aΘ + N), independence between Θ and N and a Gaussian
model N (0, σθ2 ) for Θ (as well as a Gaussian model for N as before), we obtain, very much
as before,
2 H
Θ̂ = σΘ 2
a (σΘ aaH + C)−1 X (41)
This expression can be simplified, invoking the ’matrix inversion lemma’ which can be stated
as follows:

Lemma 3 Let A be a square n × n matrix, a and b n-dimensional vectors. Assume A


invertible. Then A + baH will be invertible iff 1 + aH A−1 b is non-zero, and the inverse is

(A + baH )−1 = A−1 − A−1 b(1 + aH A−1 b)−1 aH A−1 . (42)

The proof is by direct verification. The importance of the lemma is due to the simplification
of the update for A−1 , it involves just a scalar inverse, provided A−1 is known. An important
remark that is relevant here is that if A is strictly positive definite, then A−1 certainly exists,
and for any vector a, A + aaH is even more positive definite so that its inverse certainly exists
as well.

2. Alternatively, if one does not know the statistics for Θ, one may constrain the estimator to
be linear, and ask for a minimal variance of the error. Just as in the previous case, this
automatically leads to the same expression. Again, in the Gaussian case the optimal and the
linear estimator coincide.

12

You might also like