Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Forty-Eighth Annual Allerton Conference

Allerton House, UIUC, Illinois, USA


September 29 - October 1, 2010

Arimoto Channel Coding Converse and Rényi


Divergence
Yury Polyanskiy and Sergio Verdú

Abstract—Arimoto [1] proved a non-asymptotic upper bound randomized) maps f : {1, . . . , M } → A (the encoder) and
on the probability of successful decoding achievable by any code g : B → {1, . . . , M } (the decoder), satisfying
on a given discrete memoryless channel. In this paper we present
a simple derivation of the Arimoto converse based on the data- 1 X
M
processing inequality for Rényi divergence. The method has two P [g(Y ) 6= m|X = f (m)] ≤ ǫ . (5)
benefits. First, it generalizes to codes with feedback and gives the M m=1
simplest proof of the strong converse for the DMC with feedback. 1
Second, it demonstrates that the sphere-packing bound is strictly Without loss of generality we assume that ǫ ≤ 1 − M . In
tighter than Arimoto converse for all channels, blocklengths applications, we will take A and B to be n-fold Cartesian
and rates, since in fact we derive the latter from the former. products of alphabets A and B, and a channel to be a sequence
Finally, we prove similar results for other (non-Rényi) divergence of random transformations {PY n |X n : An → B n } [2]. An
measures.
(M, ǫ) code for {An , B n , PY n |X n } is called an (n, M, ǫ) code.
Index Terms—Shannon theory, strong converse, information
measures, Rényi divergence, feedback. For the statement and proof of the main converse bounds, it is
preferable not to assume that A and B have any structure such
I. I NTRODUCTION as a Cartesian product. This has the advantage of avoiding
the notational clutter that results from explicitly showing the
In [1], Arimoto has shown a simple non-asymptotic bound, dimension (n) of the random variables taking values on A and
that implies a (strengthening of the) strong converse to the B.
channel coding for the DMC. Moreover, his bound is expo- Arimoto has shown the following result:
nentially tight for rates above the capacity. Theorem 1 ([1]): The probability of error ǫ of any (M, ǫ)
To state Arimoto’s bound, recall that Gallager’s code for the random transformation (A, B, PY |X ) satisfies for
E0 (ρ, PX , PY |X ), ρ 6= 1 function is defined for a pair any −1 < ρ < 0
of random variables X ∈ A and Y ∈ B as follows:
ǫ ≥ 1 − M ρ exp{−E0 (ρ, PX , PY |X )} , (6)
E0 (ρ, PX , PY |X )
!1+ρ where PX is the distribution induced on A by the encoder.
X X 1 Note that the bound (6) applies to an arbitrary codebook.
= − log 1+ρ
PX (x)PY |X (y|x) (1) To obtain a universal bound (i.e. the one whose right side
y∈B
" 
x∈A
 1+ρ # depends only on M ) one needs to take the infimum over
all distributions PX . When the blocklength of the code is

i(X̄; Y )
= − log E E exp Y , (2)
1+ρ large, a direct optimization becomes prohibitively complex.
However, the following result resolves this difficulty and
where the second expression is a generalization to the case of makes Theorem 1 especially useful:
infinite alphabets, where Theorem 2 (Gallager-Arimoto): Consider the product chan-
△ dPXY nel PY 2 |X 2 given by
i(x; y) = log (x, y) , (3)
d(PX × PY ) PY 2 |X 2 (y1 y2 |x1 x2 ) = PY1 |X1 (y1 |x1 )PY2 |X2 (y2 |x2 ) . (7)
and the joint distribution of (X̄, Y ) is given by Then for all −1 < ρ < 0 we have [1]

PX̄Y (x̄, y) = PX (x̄)PY (y) . (4) min E0 (ρ, PX 2 , PY 2 |X 2 )


PX 2

A random transformation is defined by a pair of measur- = min E0 (ρ, PX1 , PY1 |X1 ) + min E0 (ρ, PX2 , PY2 |X2 )(8)
.
PX1 PX2
able spaces of inputs A and outputs B and a conditional
probability measure PY |X : A 7→ B. An (M, ǫ) code for Similarly, for ρ > 0 we have [3]
the random transformation (A, B, PY |X ) is a pair of (possibly max E0 (ρ, PX 2 , PY 2 |X 2 )
PX 2
The authors are with the Department of Electrical En- = max E0 (ρ, PX1 , PY1 |X1 ) + max E0 (ρ, PX2 , PY2 |X2 )(9)
.
gineering, Princeton University, Princeton, NJ, 08544 USA. PX1 PX2
e-mail: {ypolyans,verdu}@princeton.edu.
The research was supported by the National Science Foundation under Or in other words, the extremum in the left-hand sides of (8)
Grants CCF-06-35154 and CCF-07-28445. and (9) is achieved by the product distributions.

978-1-4244-8216-0/10/$26.00 ©2010 IEEE 1327

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on June 28,2022 at 10:06:58 UTC from IEEE Xplore. Restrictions apply.
Application of Theorems 1 and 2 to the DMC of blocklength Finally, we conclude by showing which of the results
n, i.e. the channel (An , B n , (PY |X )n ), one obtains that any generalize to other divergence measures, and which are special
(n, exp{nR}, ǫ) code over the DMC satisfies to Rényi divergence. A family of bounds obtained by fixing an
   arbitrary f -divergence includes Fano’s inequality (correspond-
ǫ ≥ 1 − exp −n sup min E0 (ρ, PX , PY |X ) − ρR , ing to relative entropy), Arimoto converse (corresponding to
−1<ρ<0 PX Rényi divergence) and Wolfowitz strong converse (e.g., [5,
(10) Theorem 9]).
The bound (10) has a number of very useful properties:
1) it is non-asymptotic (i.e. valid for any n ≥ 1), II. A RIMOTO CONVERSE : A PROOF VIA META - CONVERSE
2) it is universal (i.e., the only data about the code appear- A. Preliminaries
ing in the right-hand side is the code rate R),
3) it is single-letter (i.e., its computational complexity is One of the main tools in our treatment [5] is the perfor-
independent of the blocklength n), mance of an optimal binary hypothesis test defined as follows.
4) a further analysis, see [1], shows that the exponent is Consider a W-valued random variable W which can take
negative for all R > C, thus proving a (strengthening of probability measures P or Q. A randomized test between
the) strong converse, which shows that above capacity those two distributions is defined by a random transformation
the minimum probability of error goes to 1 exponentially PZ|W : W 7→ {0, 1} where 0 indicates that the test chooses
fast with the blocklength. Moreover, it is known that Q. The best performance achievable among those randomized
this lower bound is exponentially tight in the sense that tests is given by2
there exist a sequence of codes of rate R achieving the
X
βα (P, Q) = min Q(w)PZ|W (1|w) , (11)
exponent [4, Problem 2.5.16b]. The counterpart in data w∈W
compression is given by [4, Problem 1.2.6].
where the minimum is over all probability distributions PZ|W
A drawback of the bound (10), severely limiting its use for
satisfying
finite blocklength analysis, is that the right-hand side vanishes
for any R ≤ C. In this paper we present a strengthening
X
PZ|W : P (w)PZ|W (1|w) ≥ α . (12)
of the Arimoto bound which overcomes this drawback while w∈W
retaining all the mentioned advantages. In particular, for
The minimum in (11) is guaranteed to be achieved by the
R < C it yields a non-trivial exponential lower bound on the
Neyman-Pearson lemma. Thus, βα (P, Q) gives the minimum
probability of error, which although results in a weaker bound
probability of error under hypothesis Q if the probability of
on the error exponent than the Shannon-Gallager-Berlekamp’s
error under hypothesis P is not larger than 1 − α.
sphere-packing bound, is much simpler and is applicable to
non-discrete channels. We give two different proofs of this re- In [5] we have shown that a number of classical con-
sult, each having its own benefits. The first proof demonstrates verse bounds, including Fano’s inequality, Shannon-Gallager-
that Arimoto’s result is implied by the minimax converse Berlekamp, Wolfowitz strong converse and Verdú-Han infor-
shown in [5]. In particular this implies that for symmetric mation spectrum converse, can be obtained in a unified manner
channels, the sphere-packing bound is always tighter than (10) as a consequence of the meta-converse theorem [5, Theorem
for all rates and blocklengths. The second proof demonstrates 26]. One of such consequences is the following minimax
that Arimoto’s result is a simple consequence of the data- converse [5]:
processing inequality for an asymmetric information measure Theorem 3 (minimax converse): Every (M, ǫ) code satis-
introduced by Sibson [6] and Csiszár [7]. The proof parallels fies
1
the standard derivation of the Fano’s inequality and appears M ≤ sup inf , (13)
PX QY β1−ǫ (PXY , PX × QY )
to be the simplest known proof of the strong converse for
memoryless channels. In particular, no measure concentration where PX ranges over all input distributions on A, and QY
inequalities are employed. ranges over all output distributions on B.
The second proof admits an important generalization to the The traditional sphere-packing bound for symmetric channels
case of codes with feedback. Namely, we show that (10) holds follows from Theorem 3 by choosing QY to be equiprobable
in this exact form for (block) codes with feedback. Although, on the finite output alphabet. For this reason, Theorem 3 can
this result is known asymptotically [4, Problem 2.5.16c], the be viewed as a natural generalization of the sphere-packing
non-asymptotic bound appears to be proven here for the first bound.
time1 . A converse bound valid for all DMCs might prove to The Rényi divergence for λ > 0, λ 6= 1 is [11]
be helpful in the ongoing effort of establishing the validity of " λ #
the sphere-packing exponent for codes with feedback over a 1 dP
Dλ (P ||Q) = log E Q . (14)
general DMC [9], [10]. λ−1 dQ

1 A similar result can be extracted with a circuitous route from [8] whose 2 We write summations over alphabets for simplicity; however, all of our
proof contains gaps as pointed out in [9]. general results hold for arbitrary probability spaces.

1328

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on June 28,2022 at 10:06:58 UTC from IEEE Xplore. Restrictions apply.
Normalization ensures that where we adopted the convention (3), (4).
lim Dλ (P ||Q) = D(P ||Q) , (15) Now from Theorem 3 we know that any (M, ǫ) code
λ→1 satisfies
where D(P ||Q) is the relative entropy: 1
β1−ǫ (PXY , PX × Q∗Y ) ≤ . (28)


dP dP
 M
D(P ||Q) = E Q log . (16)
dQ dQ Applying the data-processing for Rényi divergence we get
Although Rényi divergence is not an f -divergence in the sense
of [12], it is a monotone transformation of the Hellinger dλ (1 − ǫ||β1−ǫ (PXY , PX × Q∗Y )) ≤ Dλ (PXY , PX × Q∗Y ) .
divergence of order λ, see [13]. Since Hellinger divergence (29)
1
is an f -divergence for all λ > 0, λ 6= 1, the data-processing In view of (28) and 1 − ǫ ≥ M , (29) implies
inequality automatically follows for Rényi divergence as well.
Additionally, we define a conditional Rényi divergence as 1
dλ (1 − ǫ|| M ) ≤ Dλ (PXY , PX × Q∗Y ) . (30)
follows:
Application of (26) completes the proof of (25).
Dλ (PA|B ||QA|B |PB )
To obtain (6) observe a simple inequality:
△ 1 X
= log PB (b) exp{(λ − 1)Dλ (PA|B=b ||QA|B=b )}
λ−1 1 1
log (1 − ǫ)λ M λ−1 + ǫλ . (31)

b∈B dλ (1 − ǫ|| M )≥
(17) λ−1
1 XX
= log λ
PB (b)PA|B (a|b)Q1−λ
A|B (a|b) (18) If we take −1 < ρ < 0 and let λ = 1+ρ 1
> 1 we can further
λ−1
b∈B a∈A lower-bound dλ :
= Dλ (PB × PA|B ||PB × QA|B ) . (19)
1 λ
Two obvious consequences of the definition are identities dλ (1 − ǫ|| M )≥ log(1 − ǫ) + log M , (32)
λ−1
sup Dλ (PA|B ||QA|B |PB ) = sup Dλ (PA|B=b ||QA|B=b )
PB b∈B which together with (25) implies (6).
(20) As already mentioned, Theorem 4 extends Arimoto’s result.
and First, as shown, inequality (25) is stronger than (6). Moreover,
(λ)
Dλ (PAB ||QAB ) = Dλ (PB ||QB ) + Dλ (PA|B ||QA|B |PB ) , the family of bounds (25) includes Fano’s inequality:
(21)
(λ)
where PB is the λ-tilting of PB towards QB given by (1 − ǫ) log M − h(ǫ) ≤ I(X; Y ) , (33)
(λ) △
PB (b) = PBλ (b)Q1−λ
B (b) exp{−(λ − 1)Dλ (PB ||QB )} . which is obtained by taking λ → 1 (see (15)). This does
(22) not happen with (25) when ρ → 0. Second, inequality (25)
The binary Rényi divergence is given by extends (6) to ρ > 0 as follows:

dλ (p||q) = Dλ ([p 1 − p] || [q 1 − q]) (23)  
1

ρ
1+ρ
= 1 λ 1−λ
log p q λ
+ (1 − p) (1 − q) 1−λ

(24)
. ǫ≥ exp − E0 (ρ, PX , PY |X ) − M − 1+ρ .
λ−1 1+ρ
B. Main result (34)
If we now apply (34) to codes over the DMC of blocklength
Theorem 4: Any (M, ǫ) code satisfies for λ > 0, λ 6= 1:
n and take infimum over all PX n via Theorem 2 (similar to
λ the derivation of (10)) we get an exponential lower bound on
1
dλ (1 − ǫ|| M )≤ E0 (λ−1 − 1, PX , PY |X ) (25)
1−λ ǫ valid for all (n, exp{nR}, ǫ) codes:
and in particular for any −1 < ρ < 0 we obtain (6) (letting
λ = 1+ρ1
> 1). ǫ ≥ exp{−nρ∗ R + o(n)} , (35)
Proof: The key observation is that Gallager’s
E0 (ρ, PX , PY |X ) for ρ > −1 is given by a Rényi divergence where ρ∗ is found as a solution to
1
of order λ = 1+ρ :
max E0 (ρ, PX , PY |X ) = ρR . (36)
λ PX
E0 (λ−1 −1, PX , PY |X ) = Dλ (PXY ||PX ×Q∗Y ) , (26)
1−λ
Compared to the derivation of the sphere-packing bound
where the auxiliary output distribution Q∗Y is defined implicitly
in [14], the bound (35) is much easier to obtain, but, alas,
via
ρ∗ R is always larger than the sphere-packing exponent. Note
dQ∗Y △
(Y ) = also that for R ≥ C the solution ρ∗ = 0 and (35) shows that
dPY exponentially small probabilities are impossible for such rates,
 1
E exp{λi(X̄; Y )}|Y λ exp{E0 ( 1−λ λ , PX , PY |X )} ,(27) a fact also clear from (10).


1329

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on June 28,2022 at 10:06:58 UTC from IEEE Xplore. Restrictions apply.
III. A SECOND PROOF OF T HEOREM 4 in [16] as a maximization of a different information measure
Observe that in the proof of Theorem 4 the inequality (31) (based on Rényi entropy). A simple algorithm for its compu-
holds even if Q∗Y is replaced with an arbitrary measure QY : tation is derived in [15]. In [7] it was shown that the same
value for the capacity of order λ is obtained by maximizing
1
dλ (1 − ǫ|| M ) ≤ Dλ (PXY , PX × QY ) . (37) two other information measures (based on Rényi divergence),
To obtain the best bound we may minimize the right-hand side one of them Kλ (X; Y ); see also [17].
over the choice of QY . However, as noted in [7], the identity The next result relates Kλ (X; Y ) to a proof of Theorem 4
of Sibson [6] and also demonstrates a remarkable resemblance between the
properties of Kλ (X; Y ) and the mutual information I(X; Y ).
Dλ (PXY ||PX QY ) = Dλ (PXY ||PX Q∗Y ) + Dλ (Q∗Y ||QY ) Theorem 5: For λ > 0, λ 6= 1 the following holds.
(38) 1) The function fλ (Kλ (X; Y )) is convex in PX and con-
shows that such method does not lead to any improvement cave in PY |X , where fλ (x) = λ−11
exp{(1 − λ−1 )x} is
since Q∗Y , defined in (27), is in fact the minimizer: monotonically increasing.
inf Dλ (PXY ||PX QY ) = Dλ (PXY ||PX Q∗Y ) . (39) 2) For random variables W −X −Y −Z forming a Markov
QY chain the following holds
This leads us naturally to the following asymmetric infor- Kλ (W ; Z) ≤ Kλ (X; Y ) . (49)
mation measure, introduced by Csiszár in [7]:
3) If X and Y take values in the same set {1, . . . , M } and

Kλ (X; Y ) = inf Dλ (PXY ||PX QY ) . (40) X is equiprobable, then
QY
1
min Kλ (X; Y ) = dλ (1 − ǫ|| M ) (50)
In the special case of discrete PX (40) was introduced by PY |X :P[X6=Y ]≤ǫ
Sibson in [6]. Using Sibson identity (38) we obtain the 1
following equivalent expressions for Kλ (X; Y ) if ǫ ≤ 1 − M and minimum is equal to zero otherwise.
Proof: Property 1 follows by noticing that
Kλ (X; Y ) = Dλ (PXY ||PX Q∗Y ) (41) ! λ1
λ 1 X X λ
= E0 (λ−1 − 1, PX , PY |X ) (42) fλ (Kλ (X; Y )) = PX (x)PY |X (y|x)
1−λ λ−1
! λ1 y∈B x∈A
λ X X (51)
PX (x)PYλ |X (y|x) (43)
1
= log . and applying convexity (concavity) of x λ for λ < 1 (λ > 1).
λ−1
y∈B x∈A The concavity in PY |X follows from Minkowski inequality.
Notice that in [15] for the purpose of finding an efficient To show Property 3, consider an arbitrary PXY with P[X 6=
algorithm for computing supPX E0 (ρ, PX , PY |X ) Arimoto has Y ] = s. Then the data-processing for Rényi divergence applied
shown a variational representation for E0 (and therefore for to the transformation (X, Y ) → 1{X 6= Y } shows
Kλ ; see (42)) different from (40). 1
Dλ (PXY ||PX QY ) ≥ dλ (1 − s|| M ). (52)
An important property of Kλ (X; Y ) shown by Csiszár [7]
is the following: Since the function in the left-hand side is decreasing for all
1
s≤1− M , we find that
sup Kλ (X; Y ) = inf sup Dλ (PY |X=x ||QY ) . (44) 1
PX QY x min Kλ (X; Y ) ≥ dλ (1 − ǫ|| M ). (53)
PY |X :P[X6=Y ]≤ǫ
One application of (44) is a direct proof of Theorem 2: 1
provided that ǫ ≤ 1 − M . On the other hand, the lower bound
sup Kλ (X1 X2 ; Y1 Y2 ) is achieved by the kernel PY |X defined as:
PX1 X2 (
= inf sup Dλ (PY1 |X1 =x1 PY2 |X2 =x2 ||QY1 Y2 ) (45) 1−ǫ, x = y
QY1 Y2 x1 ,x2 PY |X (y|x) = ǫ
(54)
M−1 , x 6= y .
≤ inf sup Dλ (PY1 |X1 =x1 PY2 |X2 =x2 ||QY1 QY2 )(46)
QY1 QY2 x1 ,x2 The proof of Property 2 is the key step. Notice because
= inf sup Dλ (PY1 |X1 =x1 ) of the asymmetric nature of Kλ (X; Y ) we must prove two
QY1 x1
statements separately:
+ inf sup Dλ (PY2 |X2 =x2 ) (47)
QY2 x2 • “data post-processing”: if X − Y − Z form a Markov

= sup Kλ (X1 ; Y1 ) + sup Kλ (X2 ; Y2 ) . (48) chain, then


PX1 PX2
Kλ (X; Z) ≤ Kλ (X; Y ) . (55)
Note that the more cumbersome original proofs of Theorem 2
relied on the Karush-Kuhn-Tucker conditions, which require This inequality follows from the following argument. For
additional justification in non-discrete settings. an arbitrary QY denote
We notice as a side remark that the maximum of Kλ (X; Y )
X
QZ (b) = QY (y)PZ|Y (b|y) . (56)
over PX is known as the capacity of order λ. It was defined y∈B

1330

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on June 28,2022 at 10:06:58 UTC from IEEE Xplore. Restrictions apply.
Then by the data-processing for Rényi divergence we Theorem 7: Every (n, M, ǫ) feedback code for a memory-
have: less channel (An , B n , (PY |X )n ) satisfies (62) and in particu-
lar (10).
Dλ (PXZ ||PX QZ ) ≤ Dλ (PXY ||PX QY ) . (57)
Proof: Take an arbitrary (n, M, ǫ) feedback code. Then
Taking infimum over QY and using the definition of it induces a certain joint distribution on (W, Y n ) according to
Kλ (X; Z) shows (55). n
1 Y
• “data pre-processing”: if W − X − Y form a Markov PW Y n (w, y n ) = PY |X (yi |fi (w, y i−1 )) , (63)
chain, then M i=1

Kλ (W ; Y ) ≤ Kλ (X; Y ) . (58) where fi : {1, . . . , M } × B i−1 → A, i = 1, . . . , n are the


encoder maps. The decoder estimate Ŵ is obtained as a (pos-
Consider the computation of D(PXY ||PX QY ). For a sibly randomized) function of Y n and therefore W − Y n − Ŵ
fixed QY the random variable (X, Y ) is distributed either form a Markov chain. By Theorem 5 (Property 2) we have
as PXY or as PX QY . Observe that applying random
transformation PW Y |XY to (X, Y ) we obtain (W, Y ) Kλ (W ; Ŵ ) ≤ Kλ (W ; Y n ) , (64)
distributed either as PW Y or as PW QY (the Markov
property is needed to see that the distribution of W is and by Theorem 5 (Property 3) we have
PW in the alternative hypothesis). Then by the data- 1
log (1 − ǫ)λ M λ−1 + ǫλ . (65)

Kλ (W ; Ŵ ) ≥
processing for Rényi divergence: λ−1
Dλ (PW Y ||PW QY ) ≤ Dλ (PXY ||PX QY ) , (59) To conclude the proof we need to show that

which implies (58) after taking infimum over QY . Kλ (W ; Y n ) ≤ n sup Kλ (X; Y ) . (66)
PX

Proof of Theorem 4: Notice that an (M, ǫ) code defines To that end consider the following chain:
four random variables forming a Markov chain W − X − Y −
Kλ (W ; Y n )
Ŵ , where W is the message (equiprobable on {1, . . . , M }), △
X is the channel input, Y is the channel output and Ŵ is the = inf Dλ (PW Y n ||PW QY n ) (67)
QY n
decoder estimate of the message W . Then Properties 2 and 3 
(Theorem 5) together imply Theorem 4. = inf Dλ (PW Y n−1 ||PW QY n−1 )
QY n
Inequality (25) applied to an arbitrary (n, M, ǫ) code for 
the channel PY n |X n states that (λ)
+ Dλ (PYn |Y n−1 W ||QYn |Y n−1 |PW Y n−1 ) (68)
1 n n
dλ (1 − ǫ|| M ) ≤ Kλ (X ; Y ) , (60) 
= inf Dλ (PW Y n−1 ||PW QY n−1 )
where X n has the distribution induced by the encoder. Max- QY n−1
imizing the right-hand side of (60) over all PX n is particu-

(λ)
larly simple for memoryless channels since when PY n |X n = + inf Dλ (PYn |Y n−1 W ||QYn |Y n−1 |PW Y n−1 ) (69)
QYn |Y n−1
(PY |X )n , then by (48) we have 
≤ inf Dλ (PW Y n−1 ||PW QY n−1 )
sup Kλ (X n ; Y n ) = n sup Kλ (X; Y ) (61) QY n−1
PX n PX 
(λ)
and hence from (60) we get the following result: + inf Dλ (PYn |Y n−1 W ||QYn |PW Y n−1 ) (70)
QYn
Corollary 6: Every (n, M, ǫ) code for a memoryless chan- 
nel (An , B n , (PY |X )n ) satisfies ≤ inf Dλ (PW Y n−1 ||PW QY n−1 )
QY n−1
1
dλ (1 − ǫ|| M ) ≤ n sup Kλ (X; Y ) . (62)

PX + inf sup Dλ (PYn |Xn =x ||QYn ) (71)
QYn x∈A
As explained in Section II inequality (62) further simplifies to
= inf Dλ (PW Y n−1 ||PW QY n−1 )
either (10) when λ > 1 or to (35) when λ < 1. QY n−1
+ sup Kλ (X; Y ) (72)
IV. C ODES WITH FEEDBACK PX
In [18] Shannon showed that the capacity of a DMC does = Kλ (W ; Y n−1 ) + sup Kλ (X; Y ) , (73)
not increase even if we allow the encoder to use a full noiseless PX

instantaneous feedback. In this Section we demonstrate that, where (68) is by (21), (69) follows since the first term does not
moreover, the non-asymptotic bound in Corollary 6, continues depend on QYn |Y n−1 , (70) follows by restricting the infimum
to hold even in the setting of Shannon feedback. A precise to QYn |Y n−1 = QYn , (71) is by (20), (72) is by (44), and (73)
definition of the feedback code can also bee found in [4, is by the definition of Kλ in (40). The proof of (66) now
Problem 2.1.27], for example. follows from (73) by induction.

1331

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on June 28,2022 at 10:06:58 UTC from IEEE Xplore. Restrictions apply.
V. G ENERALIZATION TO OTHER DIVERGENCE MEASURES 1) Any (M, ǫ) code for the random transformation
Notice that the key Properties 2 and 3 of Kλ needed for (A, B, PY |X ) satisfies
the proof of Theorem 4 also hold (with the same proof) if the 1
δ(1 − ǫ|| M ) ≤ sup inf D(PXY ||PX QY ) (80)
Rényi divergence Dλ in (40) is replaced by any other func- PX QY
tion of a pair of distributions, satisfying the data-processing = sup K(X; Y ) (81)
inequality; for example, any f -divergence works as well. This PX

section formalizes this idea. ≤ inf sup D(PXY ||PX QY ) (82)


QY PX
First, consider a measurable space W and, a pair of distribu-
tions P and Q on it and a transition probability kernel PW ′ |W 2) For random variables W −X −Y −Z forming a Markov
from W to W. Applying PW ′ |W to P and Q we obtain a pair chain the following holds
of distribution P ′ and Q′ : K(W ; Z) ≤ K(X; Y ) . (83)
X
P ′ (w′ ) = PW ′ |W (w′ |w)P (w) (74) 3) If X and Y are taking values in the same set {1, . . . , M }
w∈W
X and X is equiprobable, then
Q′ (w′ ) = PW ′ |W (w′ |w)Q(w) . (75) 1
w∈W
min K(X; Y ) = δ(1 − ǫ|| M ) (84)
PY |X :P[X6=Y ]≤ǫ
Definition 1: A function D(P ||Q) assigning an extended 1
if ǫ ≤ 1 − M and minimum is equal to δ( M1 || M1 )
real number to a pair of distributions is called a generalized
otherwise.
divergence, or a g-divergence, if for any PW ′ |W we have
4) If D(P ||Q) is an f -divergence, then we have an equality
D(P ′ ||Q′ ) ≤ D(P ||Q) . (76) in (82) and

Note that restricting transformations to those mapping W to sup D(PXY ||PX QY ) = sup D(PY |X=x ||QY ) . (85)
PX x∈A
W is made without loss of generality, as we can consider
that the space W is rich enough to contain copies of any In particular, for K(X; Y ) we have
A and B considered in the given problem and therefore, the
sup K(X; Y ) = inf sup D(PY |X=x ||QY ) . (86)
function D satisfies the data-processing inequality with respect PX QY x∈A
to transformations from A to B as well.
Remark: What this theorem shows is that many of the prop-
Examples of g-divergences:
erties of Dλ are common to all g-divergences. However, what
• All f -divergences [12], [17], in particular total variation,
makes Dλ special is additivity under products:
relative entropy and Hellinger divergence [13].
• Rényi divergence; note that it is a non-decreasing function Dλ (P1 P2 ||Q1 Q2 ) = Dλ (P1 ||Q1 ) + Dλ (P2 ||Q2 ) , (87)
of the Hellinger divergence.
which results in identities like (38) and (21), and in turn in
• −βα (P, Q) for any 0 ≤ α ≤ 1. This example shows
single-letter bounds like (10).
that the class of g-divergences is larger than just non-
Proof: Notice that any hypothesis test between PXY and
decreasing functions of f -divergences, since −βα (P, Q)
PX QY is a random transformation from A × B to {0, 1}.
cannot be obtained from any f -divergence3.
Applying the data-processing property for D we get that any
To any g-divergence D(P ||Q) we define a binary g- test attaining probabilities of success 1 − ǫ and 1 − β over
divergence δ(p||q) as the divergence between the distributions PXY and PX × QY , respectively, must satisfy
on {0, 1} given by P (1) = p and Q(1) = q; formally,

δ(1 − ǫ||β1−ǫ ) ≤ D(PXY , PX × QY ) . (88)
δ(p||q) = D([p 1 − p]||[q 1 − q]) . (78)
Note that the data-processing property implies that whenever
Following the approach of Sibson [6] and Csiszár [7] for any p ≤ p′ ≤ q we have
g-divergence we define an information measure
δ(p′ ||q) ≤ δ(p||q) (89)

K(X; Y ) = inf D(PXY ||PX QY ) . (79)
QY and a similar monotonicity in the second argument. Since by
1 1
The following theorem summarizes the results that can be Theorem 3, β1−ǫ ≤ M and M ≤ 1 − ǫ by assumption, we
obtained by the same methods as above: have from (88):
Theorem 8: Consider a g-divergence D(P ||Q). Then all of 1
δ(1 − ǫ|| M ) ≤ D(PXY , PX × QY ) . (90)
the following hold:
Therefore, taking first infimum over all QY and the supremum
3 Assume otherwise, then we would have (see Theorem 8, Property 4) that over all PX we get (80). Then (81) is by definition (79)
inf βα (PXY ||PX QY ) = inf βα (PY |X=x , QY ) , (77) and (82) is obvious.
PX x∈A Proofs of (83) and (84) are exact repetition of the proofs
but it is easy to construct a counter-example where this does not hold. of Properties 2 and 3 in Theorem 5, since there we have not

1332

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on June 28,2022 at 10:06:58 UTC from IEEE Xplore. Restrictions apply.
used any special properties of the Rényi divergence, except [11] A. Rényi, “On measures of entropy and information,” in Proc. 4th Berke-
the data-processing property. ley Symp. Mathematics, Statistics, and Probability, vol. 1, Berkeley, CA,
USA, 1961, pp. 547–561.
Finally, when D(P ||Q) is an f -divergence then [12] I. Csiszár, “Information-type measures of difference of probability
D(PXY ||PX QY ) is linear in PX and convex in QY . distributions and indirect observation,” Studia Sci. Math. Hungar., vol. 2,
Thus the equality in (82) follows from the minimax theorem pp. 229–318, 1967.
[13] F. Liese and I. Vajda, “On divergences and informations in statistics
by interchanging sup and inf exactly as explained by and information theory,” IEEE Trans. Inf. Theory, vol. 52, no. 10, pp.
Csiszár [7] in the proof of (44). (85) follows from linearity 4394–4412, 2006.
of D(PXY ||PX QY ) in PX . Finally, (86) follows from (85) [14] C. E. Shannon, R. G. Gallager, and E. R. Berlekamp, “Lower bounds
to error probability for coding on discrete memoryless channels i,” Inf.
and the equality in (82). Contr., vol. 10, pp. 65–103, 1967.
Remark: Examples of the application of Theorem 8 (Prop- [15] S. Arimoto, “Computation of random coding exponent functions,” IEEE
erty 1) include: Trans. Inf. Theory, vol. 22, no. 6, pp. 665 – 671, Nov. 1976.
[16] ——, “Information measures and capacity of order α for discrete
• Fano’s inequality: take D to be the relative entropy. memoryless channels,” in Topics in Information Theory, ser. Colloq.
• Theorem 4: take D to be Rényi divergence Dλ . Math. Soc. J. Bolyai 16, I. Csiszár and P. Elias, Eds. Amsterdam:
North Holland, 1977, pp. 41–52.
• Wolfowitz strong converse, e.g. [5, Theorem 9]: take D to [17] I. Csiszár, “Axiomatic characterizations of information measures,” En-
be an f -divergence appearing in the DT-bound [5, (78)], tropy, vol. 10, no. 3, pp. 261–73, 2008.
f = |x − γ| .
+ [18] C. E. Shannon, “The zero error capacity of a noisy channel,” IRE Trans.
Inform. Theory, vol. 2, no. 3, pp. 8–19, Sep. 1956.
If we apply Theorem 8 with a g-divergence given by
−βα (P, Q), 0 ≤ α ≤ 1, we get the following (equivalent)
form of Theorem 3:
Corollary 9: Every (M, ǫ) code satisfies for all 0 ≤ α ≤ 1:

inf sup βα (PXY ||PX QY )


PX QY
 
α 1 1 +
≤ + − |α − 1 + ǫ| ,(91)
M (1 − ǫ) ǫ M (1 − ǫ)
where PX ranges over all input distributions on A, and QY
ranges over all out distributions on B.
Taking α = 1 − ǫ in (91) one recovers Theorem 3. The
additional benefit of stating the minimax problem in this form
is that it demonstrates that to bound the cardinality of a code
for a given ǫ, it is not required to evaluate βα for α = 1 − ǫ. In
fact, determining the value of βα for any α sufficiently close
to 1 − ǫ also works. This is useful when βα is computed via
a Neyman-Pearson lemma.

R EFERENCES
[1] S. Arimoto, “On the converse to the coding theorem for discrete
memoryless channels,” IEEE Trans. Inf. Theory, vol. 19, no. 3, pp. 357
– 359, May 1973.
[2] S. Verdú and T. S. Han, “A general formula for channel capacity,” IEEE
Trans. Inf. Theory, vol. 40, no. 4, pp. 1147–1157, 1994.
[3] R. G. Gallager, “A simple derivation of the coding theorem and some
applications,” IEEE Trans. Inf. Theory, vol. 11, no. 1, pp. 3–18, 1965.
[4] I. Csiszár and J. Körner, Information Theory: Coding Theorems for
Discrete Memoryless Systems. New York: Academic, 1981.
[5] Y. Polyanskiy, H. V. Poor, and S. Verdú, “Channel coding rate in the
finite blocklength regime,” IEEE Trans. Inf. Theory, vol. 56, no. 5, pp.
2307–2359, May 2010.
[6] R. Sibson, “Information radius,” Z. Wahrscheinlichkeitstheorie und Verw.
Geb., vol. 14, pp. 149–161, 1969.
[7] I. Csiszár, “Generalized cutoff rates and Renyi’s information measures,”
IEEE Trans. Inf. Theory, vol. 41, no. 1, pp. 26 –34, Jan. 1995.
[8] A. Y. Sheverdyaev, “Lower bound for error probability in a discrete
memoryless channel with feedback,” Prob. Peredachi Inform., vol. 18,
no. 4, pp. 5–15, 1982.
[9] G. Como and B. Nakiboğlu, “Sphere-packing bound for block-codes
with feedback and finite memory,” in Proc. 2010 IEEE Int. Symp. Inf.
Theory (ISIT), Austin, TX, USA, Jun. 2010.
[10] H. Palaiyanur and A. Sahai, “An upper bound for the block coding error
exponent with delayed feedback,” in Proc. 2010 IEEE Int. Symp. Inf.
Theory (ISIT), Austin, TX, USA, Jun. 2010.

1333

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on June 28,2022 at 10:06:58 UTC from IEEE Xplore. Restrictions apply.

You might also like