Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

CS 57800 Statistical Machine Learning, Prof.

Anuran Makur Spring 2022

Lecture 1.2 Binary Hypothesis Testing


Anuran Makur

Note: These notes are for your personal educational use only. Please do not distribute them.

1 Introduction
This section initiates our study of statistical decision theory. Our focus within this broad topic will be on the
classical understanding of Bayesian and non-Bayesian hypothesis testing and estimation theory, culminating
in a discussion of minimax estimation problems (which have become very popular in the current literature).
We begin with (perhaps) the simplest setting of binary hypothesis testing problems in the sequel. We next
introduce three canonical formulations of binary hypothesis testing.

1.1 Bayesian Formulation


The setup of Bayesian binary hypothesis testing is as follows, cf. [1, 2]. Let H ∈ {0, 1} be the hypothesis
random random variable, and X ∈ R be an observation random variable. The likelihoods, or conditional
probability distributions, of X given H are denoted PX|H (·|0) = P and PX|H (·|1) = Q, where P and Q are
given probability density functions that have support R (for simplicity). We write this observation model
as:
H=0: X∼P,
(1)
H = 1 : X ∼ Q,
where H = 0 is called the null hypothesis and H = 1 is called the alternative hypothesis. Furthermore,
in the Bayesian regime, we are also given a prior probability distribution of H. For convenience, we let
p = P(H = 0) = 1 − P(H = 1) ∈ (0, 1). The goal of binary hypothesis testing is to infer the value of H
(which is unknown) after observing a realization of X.
To formally pose this question, consider a simple loss function L : {0, 1}2 → R, L(i, j), which satisfies
the conditions L(0, 0) < L(0, 1) and L(1, 1) < L(1, 0). L represents the “loss” in guessing j when the true
hypothesis is i. We seek to find the optimal estimator ĤB : R → {0, 1} that solves the following minimization
problem: h i
RB , min E L(H, Ĥ(X)) , (2)
Ĥ:R→{0,1}

where the minimum is over all randomized estimators Ĥ of H based on X, E[L(H, Ĥ(X))] is known as the
risk (where the expectation is with respect to the joint distribution of (H, X) as well as the randomness in
Ĥ), and RB is known as the Bayes risk. Moreover, ĤB : R → {0, 1} is often called the Bayes decision rule
in the literature. We will analyze this formulation in Section 2.

1.2 Minimax Formulation


When we do not know the prior distribution p of H, we cannot define Bayes risk. In this case, we often
resort to the minimax (or adversarial) paradigm. As before, suppose that we are given likelihoods as in (1)
and the loss function L, and our goal is to find a good decision rule for H based on X. Then, it is reasonable
to minimize the worst possible risk obtained by maximizing over all priors p ∈ (0, 1). Formally, we write this
as: h i
Rm , inf sup E L(H, Ĥ(X)) , (3)
Ĥ:R→{0,1} p∈(0,1)

where the expectation is with respect to the joint distribution of (H, X) with P(H = 0) = p and the
randomness in Ĥ, and Rm is known as the minimax risk. We only state this formulation for completeness,
and do not analyze it here.

1
Spring 2022 Part 1 Statistical Inference, Lecture 2 Binary Hypothesis Testing

1.3 Neyman-Pearson Formulation


Now suppose that we neither know the prior distribution of H nor the loss function L. In this case, we
cannot even define risk, and statisticians often fall back to the classical Neyman-Pearson theory of binary
hypothesis testing. Suppose that we are only given likelihoods as in (1), and our goal is again to find a good
decision rule for H based on X, where H ∈ {0, 1} is an unknown deterministic quantity. Neyman-Pearson
theory assesses the quality of any randomized decision rule Ĥ : R → {0, 1} by analyzing the trade-off between
the following conditional probabilities:

P(Ĥ(X) = 1|H = 1) = Q(Ĥ(X) = 1) ,


P(Ĥ(X) = 1|H = 0) = P (Ĥ(X) = 1) ,

where the probabilities are computed with respect to the likelihoods and the randomness of Ĥ, Q(Ĥ(X) = 1)
is known as the detection probability in the computer science and electrical engineering literature or the power
of Ĥ in statistics, P (Ĥ(X) = 1) is known as the false-alarm probability in the computer science and electrical
engineering literature or the size of Ĥ in statistics. (Alternatively, statisticians also refer to Q(Ĥ(X) = 0)
as the probability of a Type II error, and P (Ĥ(X) = 1) as the probability of a Type I error. Note that we
abuse notation here and treat P and Q as both densities as well as the corresponding measures.)
It is clearly desirable to have large Q(Ĥ(X) = 1) and small P (Ĥ(X) = 1). Observe that using the
deterministic decision rule Ĥ(x) = 0 for all x ∈ R achieves P (Ĥ(X) = 1) = 0, but Q(Ĥ(X) = 1) = 0 as well.
On the other hand, the deterministic decision rule Ĥ(x) = 1 for all x ∈ R achieves Q(Ĥ(X) = 1) = 1, but
P (Ĥ(X) = 1) = 1 as well. This illustrates that Q(Ĥ(X) = 1) cannot be increased without also increasing
P (Ĥ(X) = 1). To characterize the trade-off between Q(Ĥ(X) = 1) and P (Ĥ(X) = 1), the Neyman-Pearson
criterion maximizes Q(Ĥ(X) = 1) over all decision rules subject to a constraint on P (Ĥ(X) = 1). Formally,
we define the Neyman-Pearson function (or receiver operating characteristic curve) as:

∀α ∈ [0, 1], β(α) , max Q(Ĥ(X) = 1) , (4)


Ĥ:R→{0,1} :
P (Ĥ(X)=1)≤α

where the maximum is over all randomized decision rules Ĥ such that P (Ĥ(X) = 1) ≤ α, and we suppress
the dependence of β on (P, Q). We will analyze this formulation in Section 3.

2 Bayesian Hypothesis Testing


Recall the setup in Section 1.1. The next theorem characterizes the Bayes risk in (2) and the Bayes decision
rule ĤB : R → {0, 1} using the (deterministic) likelihood ratio test:

P (X) Ĥ(X)=0
R η, (5)
Q(X)
Ĥ(X)=1

where P/Q is known as the likelihood ratio, and η ∈ (0, ∞) is some threshold.

Theorem 1 (Bayes Decision Rule). The Bayes decision rule is given by the likelihood ratio test:
( P (x)
0 , Q(x) ≥ (1−p)(L(1,0)−L(1,1))
p(L(0,1)−L(0,0))
∀x ∈ R, ĤB (x) = P (x)
1 , Q(x) < (1−p)(L(1,0)−L(1,1))
p(L(0,1)−L(0,0))

where the equality case is arbitrarily assigned to the null hypothesis.

Proof. First, observe that the Bayes risk in (2) can be achieved using deterministic decision rules, cf. [3,
Section 2.1]. To see this, consider any randomized decision rule Ĥ. Rigorously, such a randomized decision
rule is a Markov kernel PĤ|X that assigns each realization X = x ∈ R with a probability PĤ|X (0|x) ∈ [0, 1]
of inferring 0. Equivalently, we may write this rule in its functional representation as Ĥ : R × [0, 1] → {0, 1},

2
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022

Ĥ(X, U ), where U is an independent uniform random variable over [0, 1]. Indeed, we may define Ĥ(x, U ) =
1{U > PĤ|X (0|x)} for all x ∈ R. Hence, the risk of Ĥ is lower bounded by
h i Z 1 h i h i
E L(H, Ĥ(X, U )) = EPH,X L(H, Ĥ(X, u)) du ≥ inf EPH,X L(H, Ĥ(X, u))
0 u∈[0,1]

where the first expectation is with respect to the joint law of U , H, and X. This implies that we can
minimize the Bayes risk h i
RB = min E L(H, Ĥ(X))
Ĥ:R→{0,1}

over all deterministic estimators Ĥ of H based on X. The remainder of the proof is standard [1, Section 2].
Now consider any deterministic decision rule Ĥ : R → {0, 1}. We may decompose its associated risk as
h i h h ii
E L(H, Ĥ(X)) = E E L(H, Ĥ(X)) X

h i
= E PH|X (0|X)L(0, Ĥ(X)) + PH|X (1|X)L(1, Ĥ(X))

using the tower property. So, to minimize E[L(H, Ĥ(X))] over all Ĥ, it suffices to minimize
PH|X (0|x)L(0, Ĥ(x)) + PH|X (1|x)L(1, Ĥ(x))

over all Ĥ(x) ∈ {0, 1}, for each x ∈ R. To this end, for any x ∈ R, we set Ĥ(x) = 0 if and only if
PH|X (0|x)L(0, 0) + PH|X (1|x)L(1, 0) ≤ PH|X (0|x)L(0, 1) + PH|X (1|x)L(1, 1)
L(1, 0) − L(1, 1) PH|X (0|x)
⇔ ≤
L(0, 1) − L(0, 0) PH|X (1|x)
(1 − p)(L(1, 0) − L(1, 1)) P (x)
⇔ ≤ ,
p(L(0, 1) − L(0, 0)) Q(x)
and Ĥ(x) = 1 otherwise, where L(1, 0) − L(1, 1) > 0 and L(0, 1) − L(0, 0) > 0 by assumption. Hence, ĤB in
the theorem statement is the Bayes decision rule that achieves Bayes risk. This completes the proof.

We remark that the threshold η = (1−p)(L(1,0)−L(1,1))


p(L(0,1)−L(0,0)) , of the likelihood ratio test (5) that yields the
Bayes decision rule, depends only on the loss function and prior probabilities. So, it can be pre-computed
in statistical inference problems.

2.1 Maximum A Posteriori and Maximum Likelihood


There are two common specializations of the Bayes decision rule [1, Section 2]. Suppose that L(0, 0) =
L(1, 1) = 0 and L(0, 1) = L(1, 0) = 1, i.e., the loss function is the Hamming metric. Then, the Bayes risk is
the minimum probability of error:
 
RB = min P Ĥ(X) 6= H , (6)
Ĥ:R→{0,1}

where the minimum is over all randomized decision rules Ĥ : R → {0, 1} (as before). In this case, the Bayes
decision rule in Theorem 1 can be written as the maximum a posteriori (MAP) rule:
ĤB (X)=0
pP (X) R (1 − p)Q(X) . (7)
| {z } | {z }
ĤB (X)=1
∝PH|X (0|X) ∝PH|X (1|X)

Furthermore, if we additionally know that p = 1 − p = 12 , i.e., the prior is uniform, then we can simplify the
MAP rule and obtain the well-known maximum likelihood (ML) rule:
ĤB (X)=0
P (X) R Q(X) . (8)
ĤB (X)=1

3
Spring 2022 Part 1 Statistical Inference, Lecture 2 Binary Hypothesis Testing

3 Neyman-Pearson Theory
We next consider the formulation in Section 1.3. Suppose that the likelihood ratio function P/Q : R → (0, ∞)
has the property that for every t > 0 in the range of P/Q, its pre-image (P/Q)−1 (t) = {x ∈ R : P (x)/Q(x) =
t} is a null set, i.e., has Lebesgue measure 0. We impose this somewhat peculiar condition to simplify the
exposition of the results in this section. However, the main result, Theorem 2, of this section can be
generalized to exclude this condition at the expense of the optimal decision rules being randomized.
To derive Theorem 2, we first need to study the power and size of the likelihood ratio tests in (5). For
any η ∈ [0, +∞], define the likelihood ratio test:
( P (x)
0 , Q(x) >η
∀x ∈ R, Ĥη (x) , P (x) (9)
1 , Q(x) ≤ η

where the equality case is arbitrarily assigned the alternative hypothesis.


Lemma 1 (Power and Size of Likelihood Ratio Tests). The power [0, +∞] 3 η 7→ Q(Ĥη (X) = 1) and size
[0, +∞] 3 η 7→ P (Ĥη (X) = 1) of the likelihood ratio test (9) are continuous and monotone non-decreasing
functions that satisfy
Q(Ĥ0 (X) = 1) = P (Ĥ0 (X) = 1) = 0 ,
Q(Ĥ+∞ (X) = 1) = P (Ĥ+∞ (X) = 1) = 1 .
Proof. We prove this for the power; the proofs for size are analogous. Fix any η ∈ [0, +∞]. Clearly, we have
 
P (X)
Q(Ĥη (X) = 1) = Q ≤η .
Q(X)

So, the map R 3 η 7→ Q(Ĥη (X) = 1) is the cumulative distribution function (CDF) of the random variable
P (X)/Q(X) when X ∼ Q. Hence, by definition of CDFs, we have that [0, +∞) 3 η 7→ Q(Ĥη (X) = 1) is
monotone non-decreasing, right-continuous, and satisfies
 
P (X)
lim Q ≤ η = 1.
η→+∞ Q(X)
Since P/Q ∈ (0, ∞), we also have
 
P (X)
Q ≤ 0 = 0,
Q(X)
 
P (X)
Q ≤ +∞ = 1 ,
Q(X)

where the latter equality implies that the map [0, +∞] 3 η 7→ Q(Ĥη (X) = 1) is (left-)continuous at +∞.
It remains to show that the map (0, +∞) 3 η 7→ Q(Ĥη (X) = 1) is left-continuous. To prove this, for any
η ∈ (0, +∞), notice that
     
P (X) P (X) P (X)
lim Q ≤η −Q ≤ τ = lim− Q τ < ≤η
τ →η − Q(X) Q(X) τ →η Q(X)
| {z }
≥0
 
P (X)
= Q ∀τ < η, τ < ≤η
Q(X)
 
P (X)
=Q =η ,
Q(X)
=0
where the second equality follows from the continuity of the probability measure corresponding to Q, and
the last equality holds because {x ∈ R : P (x)/Q(x) = η} is a null set by assumption. This completes the
proof.

4
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022

Using Lemma 1, we can finally establish the key result of Neyman-Pearson theory.
Theorem 2 (Neyman-Pearson “Lemma”). For any α ∈ [0, 1], the Neyman-Pearson function β(α) is achieved
by the likelihood ratio test Ĥη : R → {0, 1} in (9), i.e.,

β(α) = Q(Ĥη (X) = 1) ,

where η ∈ [0, +∞] is chosen such that P (Ĥη (X) = 1) = α.


Proof. We combine pieces of the expositions in [1, Sections 3 and 4] and [2, Section 12] with the classical
Lagrangian approach found in various texts, and fill up the gaps as necessary. Notice that any randomized
decision rule Ĥ, which is a Markov kernel PĤ|X , can be equivalently represented by a function g : R → [0, 1],
g(x) = PĤ|X (1|x), because the statistic Ĥ ∈ {0, 1} (is binary). Moreover, the power and size of Ĥ are given
by
Z +∞
Q(Ĥ(X) = 1) = PĤ|X (1|x)Q(x) dx = EQ [g(X)] ,
−∞
Z+∞
P (Ĥ(X) = 1) = PĤ|X (1|x)P (x) dx = EP [g(X)] ,
−∞

respectively. So, we seek to solve the variational problem:

β(α) = max EQ [g(X)] . (10)


g:R→[0,1] :
EP[g(X)]≤α

To this end, construct the Lagrangian


1
L(g, η) = EQ [g(X)] − EP [g(X)]
η
for any function g : R → [0, 1] and any inverse Lagrange multiplier η ∈ [0, +∞] (with the conventions that
1/∞ = 0, 1/0 = +∞, and ∞ × 0 = 0). For all η ∈ [0, +∞] and all functions g : R → [0, 1], observe that
Z +∞  
1
L(g, η) = g(x) Q(x) − P (x) dx
−∞ η
Z +∞   
1 1
≤ 1 Q(x) − P (x) ≥ 0 Q(x) − P (x) dx
−∞ η η
Z +∞  
1
= Ĥη (x) Q(x) − P (x) dx
−∞ η
= L(Ĥη , η) ,

where the inequality holds because we assign the maximum possible value of 1 to g at non-negative portions
of Q(x) − η1 P (x) and the minimum possible value of 0 to g at strictly negative portions, and the subsequent
equality follows from (9). (Note the subtlety here; the function Ĥη defines the Markov kernel of a randomized
decision rule that turns out to be precisely the deterministic decision rule Ĥη . In general, if the range of g is
{0, 1}, then g defines a randomized decision rule Ĥ that is actually the deterministic decision rule g.) This
implies that
1 h i 1 h i
EQ [g(X)] − EP [g(X)] ≤ EQ Ĥη (X) − EP Ĥη (X)
η η
1 h i  h i
⇒ EQ [g(X)] + EP Ĥη (X) − EP [g(X)] ≤ EQ Ĥη (X) .
η

Thus, for any η ∈ [0, +∞] and any g : R → [0, 1] with EP [g(X)] ≤ EP [Ĥη (X)], we have
h i
EQ [g(X)] ≤ EQ Ĥη (X) .

5
Spring 2022 Part 1 Statistical Inference, Lecture 2 Binary Hypothesis Testing

Finally, choose any parameter η ∈ [0, +∞] such that EP [Ĥη (X)] = P (Ĥη (X) = 1) = α. This is possible
because the map  
P (x)
[0, +∞] 3 η 7→ P (Ĥη (X) = 1) = P ≤η
Q(x)
is continuous by Lemma 1, and satisfies P (Ĥ0 (X) = 1) = 0 and P (Ĥ+∞ (X) = 1) = 1. Hence, since
α ∈ [0, 1], there exists η ∈ [0, +∞] such that P (Ĥη (X) = 1) = α using the intermediate value theorem. For
this choice of η, we have shown that for every g : R → [0, 1] with EP [g(X)] ≤ α, we have
h i
EQ [g(X)] ≤ EQ Ĥη (X) .

(Note that if there exist 0 ≤ η1 < η2 ≤ +∞ such that EP [Ĥηi (X)] = α for i = 1, 2, then EP [Ĥη2 (X)] −
EP [Ĥη1 (X)] = P (η1 < P (X)/Q(X) ≤ η2 ) = 0, which implies that EQ [Ĥη2 (X)] − EQ [Ĥη1 (X)] = Q(η1 <
P (X)/Q(X) ≤ η2 ) = 0 since Q is absolutely continuous with respect to P . So, it does not matter which
threshold ηi we choose.) Therefore, β(α) = EQ [Ĥη (X)] = Q(Ĥη (X) = 1), which completes the proof.

We remark that a variant of Theorem 2 also holds for general pairs of probability measures (P, Q), but
the equality case of the likelihood ratio test must be carefully randomized to achieve the Neyman-Pearson
function in many cases, e.g., for discrete random variables.

3.1 Properties of Neyman-Pearson Function


Finally, we present several properties of the Neyman-Pearson function to better describe its behavior, cf. [1,
Sections 3 and 4] and [2, Section 12].

Proposition 1 (Properties of Neyman-Pearson Function). Let β : [0, 1] → [0, 1] be the Neyman-Pearson


function defined in (4). Then, the following are true:

1. β(0) = 0 and β(1) = 1.

2. β(α) ≥ α for all α ∈ [0, 1].

3. β is monotone non-decreasing.

4. β is concave.

5. The TV distance between the likelihoods satisfies

kP − QkTV = sup β(α) − α .


α∈[0,1]

Proof.
Part 1: This follows from Lemma 1 and Theorem 2.
Part 2: For any α ∈ [0, 1], consider the randomized decision rule Ĥ ∼ Bernoulli(α), i.e., P(Ĥ = 1) = α =
1 − P(Ĥ = 0), which is independent of X. Then, Q(Ĥ = 1) = P (Ĥ = 1) = α. Hence, by definition (4),
β(α) ≥ α.
Part 3: Consider any α1 , α2 ∈ [0, 1] such that α1 < α2 , and choose corresponding thresholds η1 , η2 ∈
[0, +∞] such that P (Ĥη1 (X) = 1) = α1 and P (Ĥη2 (X) = 1) = α2 . (Such η1 , η2 exist by the argument in the
proof of Theorem 2.) Then, we must have η1 < η2 . Indeed, if we had η1 ≥ η2 , then we would get α1 ≥ α2
using Lemma 1, which is a contradiction. This implies that Q(Ĥη1 (X) = 1) ≤ Q(Ĥη2 (X) = 1) using Lemma
1. Hence, we have β(α1 ) ≤ β(α2 ) using Theorem 2.
Part 4: Consider any α1 , α2 ∈ [0, 1] and choose corresponding thresholds η1 , η2 ∈ [0, +∞] such that
P (Ĥη1 (X) = 1) = α1 and P (Ĥη2 (X) = 1) = α2 . Then, for any λ ∈ [0, 1], construct the randomized decision
rule: (
Ĥη1 (x) with probability λ
∀x ∈ R, Ĥ(x) =
Ĥη2 (x) with probability 1 − λ

6
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022

where the randomness in Ĥ is independent of X. The power and size of this rule are

Q(Ĥ(X) = 1) = λQ(Ĥη1 (X) = 1) + (1 − λ)Q(Ĥη2 (X) = 1) = λβ(α1 ) + (1 − λ)β(α2 ) ,


P (Ĥ(X) = 1) = λP (Ĥη1 (X) = 1) + (1 − λ)P (Ĥη2 (X) = 1) = λα1 + (1 − λ)α2 ,

respectively, where we use Theorem 2. So, by definition (4), we must have

β(λα1 + (1 − λ)α2 ) ≥ λβ(α1 ) + (1 − λ)β(α2 ) .

Part 5: This result is stated as an exercise in [2, Section 12], so we provide a short proof. Recall from
(10) that

sup β(α) − α = sup max EQ [g(X)] − α


α∈[0,1] α∈[0,1] g:R→[0,1] :
EP[g(X)]≤α

= sup max EQ [g(X)] − α


α∈[0,1] g:R→[0,1] :
EP[g(X)]=α

= sup EQ [g(X)] − EP [g(X)]


g:R→[0,1]

= sup EQ [h(X)] − EP [h(X)]


h:R→[− 12 , 12 ]

= kP − QkTV ,

where the second equality follows from realizing that the optimal g that achieves (10) satisfies EP [g(X)] = α
(see the proof of Theorem 2), the fourth equality holds because the objective function in the third equal-
ity is invariant to translations of g, and the fifth equality follows from the Kantorovich-Rubinstein dual
characterization of TV distance. This completes the proof.

References
[1] G. W. Wornell, “Inference and information,” May 2017, Department of Electrical Engineering and Com-
puter Science, MIT, Cambridge, MA, USA, Lecture Notes 6.437.

[2] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory,” May 2019, Department of Electrical
Engineering and Computer Science, MIT, Cambridge, MA, USA, Lecture Notes 6.441.
[3] Y. Wu, “Information-theoretic methods for high-dimensional statistics,” January 2020, Department of
Statistics and Datra Science, Yale University, New Haven, CT, USA, Lecture Notes S&DS 677.

You might also like