Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Point estimation

Suppose we are interested in the value of a parameter , for example the unknown bias of a coin.
We have already seen how one may use the Bayesian method to reason about ; namely, we select a
likelihood function p(x | D), explaining how observed data D are expected to be generated given
the value of . Then we select prior distribution p() reflecting our initial beliefs about . Finally,
we conduct an experiment to gather data and use Bayes theorem to derive the posterior p( | D).
In a sense, the posterior contains all information about that we care about. However, the process
of inference will often require us to use this posterior to answer various questions. For example,
we might be compelled to choose a single value to serve as a point estimate of . To a Bayesian,
the selection of is a decision, and in different contexts we might want to select different values to
report.
In general, we should not expect to be able to select the true value of , unless we have somehow
observed data that unambiguously determine it. Instead, we can only hope to select an estimate
that is close to the true value. Different definitions of closeness can naturally lead to different
estimates. The Bayesian approach to point estimation will be to analyze the impact of our choice
in terms of a loss function, which describes how bad different types of mistakes can be. We then
select the estimate which appears to be the least bad according to our current beliefs about .

Decision theory
As mentioned above, the selection of an estimate can be seen as a decision. It turns out that the
Bayesian approach to decision theory is rather simple and consistent, so we will introduce it in an
abstract form here.
A decision problem will typically have three components. First, we have a parameter space (also
called a state space) , with an unknown value . We will also have a sample space X
representing the potential observations we could theoretically make. Finally, we have an action
space A representing the potential actions we may select from. In the point estimation problem, the
potential actions are exactly those in the parameter space, so A = . This might not always be the
case, however. Finally, we will have a likelihood function p(D | ) linking potential observations to
the parameter space.
After conducting an experiment and observing data D, we are compelled to select an action a A.
We define a (deterministic) decision rule as a function : X A that selects an action a given the
observations D.1 In general, this decision rule can be any arbitrary function. How do we select
which decision rule to use?
To guide our selection, we will define a loss function, which is a function L : A R. The value
L(, a) summarizes how bad an action a was if the true value of the parameter was revealed to
be ; larger losses represent worse outcomes. Ideally, we would select the action that minimizes
this loss, but unfortunately we will never know the exact value of ; complicating our decision.
As usual, there are two main approaches to designing decision rules. We begin with the Bayesian
approach.
1 We may also consider randomized decision rules, where maps observed data D to a probability distribution over
A, which we select a sample from. This can be useful, for example, when facing an intelligent adversary. Most of the
expressions we will derive can be derived analogously for this case; however, we will not do so here.

Bayesian decision theory


The Bayesian approach to decision theory is straightforward. Given our observed data D, we find
the posterior p( | D), which represents our current belief about the unknown parameter . Given
a potential action a, we may define the posterior expected loss of a by averaging the loss function
over the unknown parameter:
Z



p( | D), a = E L(, a) | D =
L(, a)p( | D) d.

Because the posterior expected loss of each action a is a scalar value, it defines a total order on the
action space A. When there is an action minimizing the posterior expected loss, it is the natural
choice to make:

(D) = arg min p( | D), a ,
aA

representing the action with the lowest expected loss, given our current beliefs about . Note that
(D) may not be unique, in which case we can select any action attaining the minimal value. Any
minimizer of the posterior expected loss is called a Bayes action. The value (D) may also be found
by solving the equivalent minimization problem
Z
(D) = arg min
L(, a)p(D | )p() d;
aA

Rthe advantage of this formulation is that it avoids computing the normalization constant p(D) =
p(D | )p() d in the posterior.
Notice that a Bayes action is tied to a particular set of observed data D. This does not limit its
utility terribly; after all, we will always have a particular set of observed data at hand when making
a decision. However we may extend the notion of Bayes actions in a natural way to define an
entire decision rule. We define the Bayes rule, a decision rule, by simply always selecting a (not
necessarily unique) Bayes action given the observed data. Note that the second formulation above
can often be minimized even when p() is not necessarily a probability distribution (such priors are
called improper but are often encountered in practice). A decision rule derived in this way from an
improper prior is called a generalized Bayes rule.

In the case of point estimation, the decision rule may be more naturally written (D).
Then, as
above, the point estimation problem reduces to selecting a loss function and deriving the decision
rule that minimizes the expected loss at every point. A decision rule that minimizes posterior
expected loss for every possible set of observations D is called a Bayes estimator.
We may derive Bayes estimators for some common loss functions. As an example, consider the
common loss function
= ( )
2,
L(, )
which represents the squared distance between our estimate and the true value. For the squared
loss, we may compute:
Z


| D = ( )
2 p( | D) d
E L(, )
Z
Z
Z
2
2

= p( | D) d 2 p( | D) d +
p( | D) d
Z

= 2 p( | D) d 2E[
| D] + 2 .

We may minimize this expression by differentiating with respect to and equating to zero:


|D
E L(, )
= 2E[ | D] + 2 = 0,


|
from which we may derive = E[ | D]. Examining the second derivative, we see 2 E L(, )

2

D / = 2 > 0, so this is indeed a minimum. Therefore we have shown that the Bayes estimator

in the case of squared loss is the posterior mean (D)


= E[ | D].
= | |

A similar analysis shows that the Bayes estimator for the absolute deviation loss L(, )
is the posterior median, and the Bayes estimator for the a relaxed 01 loss
(
<
0 | |

L(, ; ) =

1 | |
converges to the posterior mode for small .
The posterior mode, also called the maximum a posteriori (map) estimate of and written map , is a
rather common estimator used in practice. The reason is that optimization is almost always easier
than integration. In particular, we may find the map estimate by maximizing the unnormalized
posterior
map = arg max p(D | )p(),

R
where we have avoided computing the normalization constant p(D) = p(D | )p() d. An
important caveat about the map estimator is that it is not invariant to nonlinear transformations of
!
Frequentist decision theory
The frequentist approach to decision theory is somewhat different. As usual, in classical statistics
it is not allowed to place a prior distribution on a parameter such as ; rather, it is much more
common to use the likelihood p(D | ) to hypothesize what data might look like for different values
of , and use this analysis to drive your action.
The frequentist approach to decision theory involves the notion of risk functions. The frequentist
risk of a decision function is defined by
Z

R(, ) =
L , (D) p(D | ) dD,
X

that is, it represents the expected loss incurred when repeatedly using the decision rule on different
datasets D as a function of the unknown parameter .
To a Bayesian, the frequentist risk is a very strange notion: we know the exact value of our data D
when we make our decision, so why should we average over other datasets that we havent seen?
The frequentist counterargument to this is typically that we might know D but cant know p()!
Notice that whereas the posterior expected loss was a scalar defining a total order on the action
space A, which could be extended to naturally define an entire decision rule, the frequentist risk is
a function of and the entire decision rule . It is very unusual for there to be a single decision
rule that works the best for every potential value of . For this reason, we must decide on some
mechanism to use the frequentist risk to select a good decision rule. There are many proposed
mechanisms for doing so, but we will simply quickly describe two below.
3

Bayes risk
One solution to the problem of comparing decision rules is to place a prior distribution on the
unknown and compute the average risk:
Z
Z Z




r p(), = E R(, ) =
R(, )p() d =
L , (D) p(D | )p() d dD.

The function r p(), is called the Bayes risk of under the prior p(). Again, the Bayes risk is
scalar-valued, so we induce a total order on all decision rules, making identifying a unique decision
rule easier. Any minimizing Bayes risk is called a Bayes rule. We have seen this term before! It
turns out that given a prior p(), the Bayesian procedure described above for defining a decision
function by selecting an action with minimum posterior loss is guaranteed to minimize Bayes risk
and therefore produce a Bayes rule with respect to p(). Note, however, that it is unusual in the
Bayesian perspective to first find an entire decision rule and then apply it to a particular dataset
D. Instead, it is almost always easier to minimize the expected posterior loss only at the actual
observed data. After all, why would we want to make decisions with other data?


With this definition, we may compute the Bayes risk of estimating under squared loss: it is the
posterior variance var[ | D].
Admissibility
Another criterion for selecting between decision rules in the frequentist framework is called
admissibility. In short, it is often difficult to identify a single best decision rule, but it can sometimes
be easy to discard some bad ones, for example if they can be shown to always be no better than
(and sometimes worse than) another rule.
Let 1 and 2 be two decision rules. We say that 1 dominates 2 if:
R(, 1 ) R(, 2 ) for all , and
there exists at least one for which R(, 1 ) < R(, 2 ).
If there is a decision rule that is not dominated by any other rule, it is called admissible. One
interesting result tying Bayesian and frequentist decision theory is the following:
Every Bayes rule is admissible.
Every admissible decision rule is a generalized Bayes rule for some (possibly improper) prior
p().
So, in a sense, all admissible frequentist decision rules can be equivalently derived from a Bayesian
perspective.

Examples
Here we give two quick examples of applying Bayesian decision theory.
Classification with 01 loss
Suppose our observations are of the form (x, y), where x is an arbitrary input, and y {0, 1} is a
binary label associated with x. In classification, our goal is to predict the label y 0 associated with a
4

new input x0 . The Bayesian approach is to derive a model giving probabilities Pr(y 0 = 1 | x0 , D).
Suppose this model is provided for you. Notice that this model is not conditioned on any additional
parameters .
Given a new datapoint x0 , which label a should we predict? Notice that the prediction of a label is
actually a decision. Here our action space is A = {0, 1}, enumerating the two labels we can predict.
Our parameter space is the same: the only uncertainty we have is the unknown label y 0 .
Let us suppose a simple loss function for this problem:
(
0 a = y0 ;
L(y 0 , a) =
1 a 6= y 0 .
This loss function, called the 01 loss, is common in classification problems. We pay a constant
loss for every mistake we make. In this case, the expected loss of each possible action is simple to
compute:


E L(y 0 , a = 1) | x0 , D = Pr(y 0 = 0 | x0 , D);


E L(y 0 , a = 0) | x0 , D = Pr(y 0 = 1 | x0 , D).
The Bayes action is then to predict the class with the highest probability. This is not so surprising.
Notice that if we change the loss to have different costs of mistakes (so that L(0, 1) 6= L(1, 0)),
then the Bayes action might compel us to select the less-likely class to avoid a potentially high loss
for misclassification!
Hypothesis testing
The Bayesian approach to hypothesis testing is also rather straightforward. A hypothesis is simply a
subset of the parameter space H . The Bayesian approach allows one to compute the posterior
probability of a hypothesis directly:
Z
Pr( H | D) =
p( | D) d.
H

Now, equipped with a loss function, the posterior expected loss framework above can be applied
directly.
See last weeks notes for a brief discussion of frequentist hypothesis testing.

You might also like