Professional Documents
Culture Documents
Lecture 1.2
Lecture 1.2
Note: These notes are for your personal educational use only. Please do not distribute them.
1 Introduction
This section initiates our study of statistical decision theory. Our focus within this broad topic will be on the
classical understanding of Bayesian and non-Bayesian hypothesis testing and estimation theory, culminating
in a discussion of minimax estimation problems (which have become very popular in the current literature).
We begin with (perhaps) the simplest setting of binary hypothesis testing problems in the sequel. We next
introduce three canonical formulations of binary hypothesis testing.
where the minimum is over all randomized estimators Ĥ of H based on X, E[L(H, Ĥ(X))] is known as the
risk (where the expectation is with respect to the joint distribution of (H, X) as well as the randomness in
Ĥ), and RB is known as the Bayes risk. Moreover, ĤB : R → {0, 1} is often called the Bayes decision rule
in the literature. We will analyze this formulation in Section 2.
where the expectation is with respect to the joint distribution of (H, X) with P(H = 0) = p and the
randomness in Ĥ, and Rm is known as the minimax risk. We only state this formulation for completeness,
and do not analyze it here.
1
Spring 2022 Part 1 Statistical Inference, Lecture 2 Binary Hypothesis Testing
where the probabilities are computed with respect to the likelihoods and the randomness of Ĥ, Q(Ĥ(X) = 1)
is known as the detection probability in the computer science and electrical engineering literature or the power
of Ĥ in statistics, P (Ĥ(X) = 1) is known as the false-alarm probability in the computer science and electrical
engineering literature or the size of Ĥ in statistics. (Alternatively, statisticians also refer to Q(Ĥ(X) = 0)
as the probability of a Type II error, and P (Ĥ(X) = 1) as the probability of a Type I error. Note that we
abuse notation here and treat P and Q as both densities as well as the corresponding measures.)
It is clearly desirable to have large Q(Ĥ(X) = 1) and small P (Ĥ(X) = 1). Observe that using the
deterministic decision rule Ĥ(x) = 0 for all x ∈ R achieves P (Ĥ(X) = 1) = 0, but Q(Ĥ(X) = 1) = 0 as well.
On the other hand, the deterministic decision rule Ĥ(x) = 1 for all x ∈ R achieves Q(Ĥ(X) = 1) = 1, but
P (Ĥ(X) = 1) = 1 as well. This illustrates that Q(Ĥ(X) = 1) cannot be increased without also increasing
P (Ĥ(X) = 1). To characterize the trade-off between Q(Ĥ(X) = 1) and P (Ĥ(X) = 1), the Neyman-Pearson
criterion maximizes Q(Ĥ(X) = 1) over all decision rules subject to a constraint on P (Ĥ(X) = 1). Formally,
we define the Neyman-Pearson function (or receiver operating characteristic curve) as:
where the maximum is over all randomized decision rules Ĥ such that P (Ĥ(X) = 1) ≤ α, and we suppress
the dependence of β on (P, Q). We will analyze this formulation in Section 3.
P (X) Ĥ(X)=0
R η, (5)
Q(X)
Ĥ(X)=1
where P/Q is known as the likelihood ratio, and η ∈ (0, ∞) is some threshold.
Theorem 1 (Bayes Decision Rule). The Bayes decision rule is given by the likelihood ratio test:
( P (x)
0 , Q(x) ≥ (1−p)(L(1,0)−L(1,1))
p(L(0,1)−L(0,0))
∀x ∈ R, ĤB (x) = P (x)
1 , Q(x) < (1−p)(L(1,0)−L(1,1))
p(L(0,1)−L(0,0))
Proof. First, observe that the Bayes risk in (2) can be achieved using deterministic decision rules, cf. [3,
Section 2.1]. To see this, consider any randomized decision rule Ĥ. Rigorously, such a randomized decision
rule is a Markov kernel PĤ|X that assigns each realization X = x ∈ R with a probability PĤ|X (0|x) ∈ [0, 1]
of inferring 0. Equivalently, we may write this rule in its functional representation as Ĥ : R × [0, 1] → {0, 1},
2
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022
Ĥ(X, U ), where U is an independent uniform random variable over [0, 1]. Indeed, we may define Ĥ(x, U ) =
1{U > PĤ|X (0|x)} for all x ∈ R. Hence, the risk of Ĥ is lower bounded by
h i Z 1 h i h i
E L(H, Ĥ(X, U )) = EPH,X L(H, Ĥ(X, u)) du ≥ inf EPH,X L(H, Ĥ(X, u))
0 u∈[0,1]
where the first expectation is with respect to the joint law of U , H, and X. This implies that we can
minimize the Bayes risk h i
RB = min E L(H, Ĥ(X))
Ĥ:R→{0,1}
over all deterministic estimators Ĥ of H based on X. The remainder of the proof is standard [1, Section 2].
Now consider any deterministic decision rule Ĥ : R → {0, 1}. We may decompose its associated risk as
h i h h ii
E L(H, Ĥ(X)) = E E L(H, Ĥ(X))X
h i
= E PH|X (0|X)L(0, Ĥ(X)) + PH|X (1|X)L(1, Ĥ(X))
using the tower property. So, to minimize E[L(H, Ĥ(X))] over all Ĥ, it suffices to minimize
PH|X (0|x)L(0, Ĥ(x)) + PH|X (1|x)L(1, Ĥ(x))
over all Ĥ(x) ∈ {0, 1}, for each x ∈ R. To this end, for any x ∈ R, we set Ĥ(x) = 0 if and only if
PH|X (0|x)L(0, 0) + PH|X (1|x)L(1, 0) ≤ PH|X (0|x)L(0, 1) + PH|X (1|x)L(1, 1)
L(1, 0) − L(1, 1) PH|X (0|x)
⇔ ≤
L(0, 1) − L(0, 0) PH|X (1|x)
(1 − p)(L(1, 0) − L(1, 1)) P (x)
⇔ ≤ ,
p(L(0, 1) − L(0, 0)) Q(x)
and Ĥ(x) = 1 otherwise, where L(1, 0) − L(1, 1) > 0 and L(0, 1) − L(0, 0) > 0 by assumption. Hence, ĤB in
the theorem statement is the Bayes decision rule that achieves Bayes risk. This completes the proof.
where the minimum is over all randomized decision rules Ĥ : R → {0, 1} (as before). In this case, the Bayes
decision rule in Theorem 1 can be written as the maximum a posteriori (MAP) rule:
ĤB (X)=0
pP (X) R (1 − p)Q(X) . (7)
| {z } | {z }
ĤB (X)=1
∝PH|X (0|X) ∝PH|X (1|X)
Furthermore, if we additionally know that p = 1 − p = 12 , i.e., the prior is uniform, then we can simplify the
MAP rule and obtain the well-known maximum likelihood (ML) rule:
ĤB (X)=0
P (X) R Q(X) . (8)
ĤB (X)=1
3
Spring 2022 Part 1 Statistical Inference, Lecture 2 Binary Hypothesis Testing
3 Neyman-Pearson Theory
We next consider the formulation in Section 1.3. Suppose that the likelihood ratio function P/Q : R → (0, ∞)
has the property that for every t > 0 in the range of P/Q, its pre-image (P/Q)−1 (t) = {x ∈ R : P (x)/Q(x) =
t} is a null set, i.e., has Lebesgue measure 0. We impose this somewhat peculiar condition to simplify the
exposition of the results in this section. However, the main result, Theorem 2, of this section can be
generalized to exclude this condition at the expense of the optimal decision rules being randomized.
To derive Theorem 2, we first need to study the power and size of the likelihood ratio tests in (5). For
any η ∈ [0, +∞], define the likelihood ratio test:
( P (x)
0 , Q(x) >η
∀x ∈ R, Ĥη (x) , P (x) (9)
1 , Q(x) ≤ η
So, the map R 3 η 7→ Q(Ĥη (X) = 1) is the cumulative distribution function (CDF) of the random variable
P (X)/Q(X) when X ∼ Q. Hence, by definition of CDFs, we have that [0, +∞) 3 η 7→ Q(Ĥη (X) = 1) is
monotone non-decreasing, right-continuous, and satisfies
P (X)
lim Q ≤ η = 1.
η→+∞ Q(X)
Since P/Q ∈ (0, ∞), we also have
P (X)
Q ≤ 0 = 0,
Q(X)
P (X)
Q ≤ +∞ = 1 ,
Q(X)
where the latter equality implies that the map [0, +∞] 3 η 7→ Q(Ĥη (X) = 1) is (left-)continuous at +∞.
It remains to show that the map (0, +∞) 3 η 7→ Q(Ĥη (X) = 1) is left-continuous. To prove this, for any
η ∈ (0, +∞), notice that
P (X) P (X) P (X)
lim Q ≤η −Q ≤ τ = lim− Q τ < ≤η
τ →η − Q(X) Q(X) τ →η Q(X)
| {z }
≥0
P (X)
= Q ∀τ < η, τ < ≤η
Q(X)
P (X)
=Q =η ,
Q(X)
=0
where the second equality follows from the continuity of the probability measure corresponding to Q, and
the last equality holds because {x ∈ R : P (x)/Q(x) = η} is a null set by assumption. This completes the
proof.
4
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022
Using Lemma 1, we can finally establish the key result of Neyman-Pearson theory.
Theorem 2 (Neyman-Pearson “Lemma”). For any α ∈ [0, 1], the Neyman-Pearson function β(α) is achieved
by the likelihood ratio test Ĥη : R → {0, 1} in (9), i.e.,
where the inequality holds because we assign the maximum possible value of 1 to g at non-negative portions
of Q(x) − η1 P (x) and the minimum possible value of 0 to g at strictly negative portions, and the subsequent
equality follows from (9). (Note the subtlety here; the function Ĥη defines the Markov kernel of a randomized
decision rule that turns out to be precisely the deterministic decision rule Ĥη . In general, if the range of g is
{0, 1}, then g defines a randomized decision rule Ĥ that is actually the deterministic decision rule g.) This
implies that
1 h i 1 h i
EQ [g(X)] − EP [g(X)] ≤ EQ Ĥη (X) − EP Ĥη (X)
η η
1 h i h i
⇒ EQ [g(X)] + EP Ĥη (X) − EP [g(X)] ≤ EQ Ĥη (X) .
η
Thus, for any η ∈ [0, +∞] and any g : R → [0, 1] with EP [g(X)] ≤ EP [Ĥη (X)], we have
h i
EQ [g(X)] ≤ EQ Ĥη (X) .
5
Spring 2022 Part 1 Statistical Inference, Lecture 2 Binary Hypothesis Testing
Finally, choose any parameter η ∈ [0, +∞] such that EP [Ĥη (X)] = P (Ĥη (X) = 1) = α. This is possible
because the map
P (x)
[0, +∞] 3 η 7→ P (Ĥη (X) = 1) = P ≤η
Q(x)
is continuous by Lemma 1, and satisfies P (Ĥ0 (X) = 1) = 0 and P (Ĥ+∞ (X) = 1) = 1. Hence, since
α ∈ [0, 1], there exists η ∈ [0, +∞] such that P (Ĥη (X) = 1) = α using the intermediate value theorem. For
this choice of η, we have shown that for every g : R → [0, 1] with EP [g(X)] ≤ α, we have
h i
EQ [g(X)] ≤ EQ Ĥη (X) .
(Note that if there exist 0 ≤ η1 < η2 ≤ +∞ such that EP [Ĥηi (X)] = α for i = 1, 2, then EP [Ĥη2 (X)] −
EP [Ĥη1 (X)] = P (η1 < P (X)/Q(X) ≤ η2 ) = 0, which implies that EQ [Ĥη2 (X)] − EQ [Ĥη1 (X)] = Q(η1 <
P (X)/Q(X) ≤ η2 ) = 0 since Q is absolutely continuous with respect to P . So, it does not matter which
threshold ηi we choose.) Therefore, β(α) = EQ [Ĥη (X)] = Q(Ĥη (X) = 1), which completes the proof.
We remark that a variant of Theorem 2 also holds for general pairs of probability measures (P, Q), but
the equality case of the likelihood ratio test must be carefully randomized to achieve the Neyman-Pearson
function in many cases, e.g., for discrete random variables.
3. β is monotone non-decreasing.
4. β is concave.
Proof.
Part 1: This follows from Lemma 1 and Theorem 2.
Part 2: For any α ∈ [0, 1], consider the randomized decision rule Ĥ ∼ Bernoulli(α), i.e., P(Ĥ = 1) = α =
1 − P(Ĥ = 0), which is independent of X. Then, Q(Ĥ = 1) = P (Ĥ = 1) = α. Hence, by definition (4),
β(α) ≥ α.
Part 3: Consider any α1 , α2 ∈ [0, 1] such that α1 < α2 , and choose corresponding thresholds η1 , η2 ∈
[0, +∞] such that P (Ĥη1 (X) = 1) = α1 and P (Ĥη2 (X) = 1) = α2 . (Such η1 , η2 exist by the argument in the
proof of Theorem 2.) Then, we must have η1 < η2 . Indeed, if we had η1 ≥ η2 , then we would get α1 ≥ α2
using Lemma 1, which is a contradiction. This implies that Q(Ĥη1 (X) = 1) ≤ Q(Ĥη2 (X) = 1) using Lemma
1. Hence, we have β(α1 ) ≤ β(α2 ) using Theorem 2.
Part 4: Consider any α1 , α2 ∈ [0, 1] and choose corresponding thresholds η1 , η2 ∈ [0, +∞] such that
P (Ĥη1 (X) = 1) = α1 and P (Ĥη2 (X) = 1) = α2 . Then, for any λ ∈ [0, 1], construct the randomized decision
rule: (
Ĥη1 (x) with probability λ
∀x ∈ R, Ĥ(x) =
Ĥη2 (x) with probability 1 − λ
6
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022
where the randomness in Ĥ is independent of X. The power and size of this rule are
Part 5: This result is stated as an exercise in [2, Section 12], so we provide a short proof. Recall from
(10) that
= kP − QkTV ,
where the second equality follows from realizing that the optimal g that achieves (10) satisfies EP [g(X)] = α
(see the proof of Theorem 2), the fourth equality holds because the objective function in the third equal-
ity is invariant to translations of g, and the fifth equality follows from the Kantorovich-Rubinstein dual
characterization of TV distance. This completes the proof.
References
[1] G. W. Wornell, “Inference and information,” May 2017, Department of Electrical Engineering and Com-
puter Science, MIT, Cambridge, MA, USA, Lecture Notes 6.437.
[2] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory,” May 2019, Department of Electrical
Engineering and Computer Science, MIT, Cambridge, MA, USA, Lecture Notes 6.441.
[3] Y. Wu, “Information-theoretic methods for high-dimensional statistics,” January 2020, Department of
Statistics and Datra Science, Yale University, New Haven, CT, USA, Lecture Notes S&DS 677.