Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

1

Cooperative Channel Capacity Learning


Nunzio A. Letizia, Andrea M. Tonello, Senior Member, IEEE, and H. Vincent Poor, Life Fellow, IEEE

Abstract—In this paper, the problem of determining the capac- In the following section, we show that a data-driven ap-
ity of a communication channel is formulated as a cooperative proach can be pursued to obtain the channel capacity. In
game, between a generator and a discriminator, that is solved particular, we propose to learn the capacity and the capacity-
via deep learning techniques. The task of the generator is to
produce channel input samples for which the discriminator achieving distribution of any discrete-time continuous mem-
ideally distinguishes conditional from unconditional channel oryless vector channel via a cooperative framework referred
output samples. The learning approach, referred to as cooperative to as CORTICAL. The framework is inspired by generative
channel capacity learning (CORTICAL), provides both the opti- adversarial networks (GANs) [15] but it can be interpreted as
arXiv:2305.13493v1 [cs.IT] 22 May 2023

mal input signal distribution and the channel capacity estimate. a dual version using an appropriately defined value function. In
Numerical results demonstrate that the proposed framework
learns the capacity-achieving input distribution under challenging fact, CORTICAL comprises two blocks cooperating with each
non-Shannon settings. other: a generator that learns to sample from the capacity-
achieving distribution, and a discriminator that learns to dif-
Index Terms—channel capacity, capacity-achieving distribu-
tion, deep learning, capacity learning. ferentiate paired channel input-output samples from unpaired
ones, i.e. it distinguishes the joint PDF pXY (x, y) from the
I. I NTRODUCTION product of marginal pX (x)pY (y), so as to estimate the MI.
The paper is organized as follows: Sec. II describes the
Data-driven models for physical layer communications [1] main idea and contribution. The parametric implementation
have recently seen a surge of interest. While the benefit of a and training algorithm are discussed in Sec. III. Sec. IV
deep learning (DL) approach is evident for tasks like signal presents numerical results for different types of channels. Sec.
detection [2], decoding [3] and channel estimation [4], [5], the V concludes the paper.
fundamental problem of estimating the capacity of a channel
remains elusive. Despite recent attempts via neural mutual
information estimation [6]–[8], it is not clear yet whether DL II. C OOPERATIVE P RINCIPLE FOR C APACITY L EARNING
can provide novel insights. To understand the working principle of CORTICAL and
For a discrete-time continuous memoryless vector channel, why it can be interpreted as a cooperative approach, it is useful
the capacity is defined as to briefly review how GANs operate.
C = max I(X; Y ), (1) In the GAN framework, the adversarial training procedure
pX (x) for the generator G consists of maximizing the probability of
where pX (x) is the input signal probability density function the discriminator D making a mistake. If x ∼ pdata (x) are the
(PDF), X and Y are the channel input and output random data samples and z ∼ pZ (z) are the latent samples, the Nash
vectors, respectively, and I(X; Y ) is the mutual information equilibrium is reached when the minimization over G and the
(MI) between X and Y . The channel capacity problem in- maximization over D of the value function
volves both determining the capacity-achieving distribution 


and evaluating the maximum achievable rate. Only a few V(G, D) = Ex∼pdata (x) log D(x)
special cases, e.g., additive noise channels with specific noise  

distributions under input power constraints, have been solved + Ez∼pZ (z) log 1 − D G(z) , (2)
so far. When the channel is not an additive noise channel,
analytical approaches become mostly intractable leading to is attained so that G(z) ∼ pdata (x). Concisely, the mini-
numerical solutions, relaxations [9], capacity lower and upper mization over G forces the generator to implicitly learn the
bounds [10], and considerations on the support of the capacity- distribution that minimizes the given statistical distance.
achieving distribution [11], [12]. It is known that the capacity Conversely to GANs, the channel capacity estimation prob-
of a discrete memoryless channel can be computed using the lem requires the generator to learn the distribution maximizing
Blahut-Arimoto (BA) algorithm [13], whereas a particle-based the mutual information measure. Therefore, the generator and
BA method was proposed in [14] to tackle the continuous case discriminator need to play a cooperative max-max game with
although it fails to scale to high-dimension vector channels. the objective for G to produce channel input samples for
which D exhibits the best performance in distinguishing (in
The work of N. A. Letizia was supported in part by the mobility funding
scheme of the University of Klagenfurt. the Kullback-Leibler sense) paired and unpaired channel input-
The work of H. V. Poor was supported in part by the U.S National Science output samples. Thus, the discriminator of CORTICAL is
Foundation under Grants CCF-1908308 and CNS-2128448. fed with both the generator and channel’s output samples, x
N. A. Letizia and A. M. Tonello are with Universität Klagenfurt, Institute
of Networked and Embedded Systems, Klagenfurt, 9020, Austria, e-mail: and y, respectively. Fig.1 illustrates the proposed cooperative
nunzio.letizia@aau.at, andrea.tonello@aau.at framework that learns both for the cooperative training, as
H. V. Poor is with the Department of Electrical and Computer Engineering, discussed next.
Princeton University, Princeton, NJ 08544, USA, e-mail: poor@princeton.edu
2

Fig. 1: CORTICAL, Cooperative framework for capacity learning: a generator produces input samples with distribution pX (x) and a
discriminator attempts to distinguish between paired and unpaired channel input-output samples.

Theorem 1. Let X ∼ pX (x) and Y |X ∼ pY |X (y|x) be the In particular, Jα (D∗ ) is a maximum since the second deriva-
vector channel input and conditional output, respectively. Let tive of the integrand −α pD XY (x,y)
2 (x,y) is a non-positive function.
Y = H(X) with H(·) being the stochastic channel model ∗
Therefore, substituting D (x, y) in (8) yields
and let π(·) be a permutation function 1 over the realizations Z Z  
pXY (x, y)

of Y such that pπ(Y ) (π(y)|x) = pY (y). In addition, let G Jα (G, D∗ ) = α pXY (x, y) log α
y x pX (x)pY (y)
and D be two functions in the non-parametric limit such that  
X = G(Z) with Z ∼ pZ (z) being a latent random vector. If pXY (x, y)
− pX (x)pY (y) α dx dy.
Jα (G, D), α > 0, is the value function defined as pX (x)pY (y)
    (11)
Jα (G, D) = α · Ez∼pZ (z) log D G(z), H(G(z)) Now, it is simple to recognize that the second term on the
   right hand side of the above equation is equal to −α. Thus,
− Ez∼pZ (z) D G(z), π(H(G(z))) , (3)
  
∗ pXY (x, y)
Jα (G, D ) = α · E(x,y)∼pXY (x,y) log
pX (x)pY (y)
then the channel capacity C is the solution of
+ α log(α) − α
Jα (G, D)
+ 1 − log(α),

C = max max (4) = α I(X; Y ) + log(α) − 1 . (12)
G D α
and x = G∗ (z) are samples from the capacity-achieving Finally, the maximization over the generator G results in
distribution, where the mutual information maximization since α is a positive
constant. We therefore obtain,
G∗ = arg max max Jα (G, D). (5)
G D max Jα (G, D∗ ) + α − α log(α) = α max I(X; Y ). (13)
G pX (x)
Proof. The value function in (2) can be written as
  
Jα (G, D) = α · E(x,y)∼pXY (x,y) log D(x, y) Theorem 1 states that at the equilibrium, the generator
  of CORTICAL has implicitly learned the capacity-achieving
− E(x,y)∼pX (x)pY (y) D(x, y) . (6) distribution with the mapping x = G∗ (z). In contrast to
the BA algorithm, here the generator samples directly from
Given a generator G that maps the latent (noise) vector Z into the optimal input distribution pX (x) rather than explicitly
X, we need firstly to prove that Jα (G, D) is maximized for modelling it. No assumptions on the input distribution’s nature
pXY (x, y) are made. Moreover, we have access to the channel capacity
D(x, y) = D∗ (x, y) = α . (7)
pX (x)pY (y) directly from the value function used for training as follows:
Using the Lebesgue integral to compute the expectation Jα (G∗ , D∗ )
C= + 1 − log(α). (14)
Z Z  α
 
Jα (G, D) = α pXY (x, y) log D(x, y) In the following, we propose to parametrize G and D with
y x
  neural networks and we explain how to train CORTICAL.
− pX (x)pY (y) D(x, y) dx dy, (8)
III. PARAMETRIC I MPLEMENTATION
taking the derivative of the integrand with respect to D and
It can be shown (see Sec.4.2 in [15]) that the alternating
setting it to 0, yields the following equation in D:
training strategy described in Alg. 1 converges to the optimal
pXY (x, y) G and D under the assumption of having enough capacity
α − pX (x)pY (y) = 0, (9)
D(x, y) and training time. Practically, instead of optimizing over the
whose solution is the optimum discriminator space of functions G and D, it is reasonable to model both
pXY (x, y) the generator and the discriminator with neural networks
D∗ (x, y) = α . (10) (G, D) = (GθG , DθD ) and optimize over their parameters θG
pX (x)pY (y)
and θD (see Alg. 1). We consider the distribution of the source
1 The permutation takes place over the temporal realizations of the vector pZ (z) to be a multivariate normal distribution with indepen-
Y so that π(Y ) and X can be considered statistically independent vectors. dent components. The function π(·) implements a random
3

Algorithm 1 Cooperative Channel Capacity Learning


1: Inputs:
N training steps, K discriminator steps, α.
2: for n = 1 to N do
3: for k = 1 to K do
4: Sample batch of m noise samples {z(1) , . . . , z(m) }
from pZ (z);
5: Produce batch of m channel input/output paired
samples {(x(1) , y(1) ), . . . , (x(m) , y(m) )} using the
generator GθG and the channel model H;
6: Shuffle (derangement) y and get input/output un-
Fig. 2: AWGN scalar peak-power constrained channel: a) capacity-
paired samples {(x(1) , ỹ(1) ), . . . , (x(m) , ỹ(m) )}; achieving distribution learned by CORTICAL, the marker’s radius is
7: Update the discriminator by ascending its stochas- proportional to the PMF; b) capacity estimate and comparisons.
tic gradient:
m
1 X
α log DθD x(i) , y(i) − DθD x(i) , ỹ(i) .
 
∇θ D two different input power constraints; and 3) the Rayleigh
m i=1
fading channel known at both the transmitter and the receiver
8: Sample batch of m noise samples {z(1) , . . . , z(m) } subject to an average power constraint. These are among the
from pZ (z); few scenarios for which analysis has been carried out in the
9: Update the generator by ascending its stochastic gra- literature and thus offer a baseline to benchmark CORTICAL.
dient: For both the first and third scenarios, it is known that the
m    capacity-achieving distribution is discrete with a finite set of
1 X (i)
 (i)
 mass points [11], [12], [18]. For the second scenario the nature
∇θ G α log DθD GθG z , H GθG z
m i=1 of the input distribution depends on the type of input power
  constraint [19]. Additional results including the study of the
− DθD GθG z(i) , π H GθG z(i)
 
. classical AWGN channel with average power constraint are
reported in GitHub [16].

A. Peak Power-Limited Gaussian Channels


derangement of the batch y and it is used to obtain unpaired
The capacity of the discrete-time memoryless vector Gaus-
samples. We use simple neural network architectures and we
sian noise channel with unit-variance under peak-power con-
execute K = 10 discriminator training steps every generator
straints on the input is defined as [20]
training step. Details of the architecture and implementation
are reported in GitHub [16]. It should be noted that in Th. 1, C(A) = sup I(X; Y ), (16)
pX (x):||X||2 ≤A
pX (x) can be subject to certain constraints, e.g., peak and/or
average power. Such constraints are met by the design of G, where pX (x) is the channel input PDF and A2 is the upper
for instance using a batch normalization layer. Alternatively, bound for the input signal peak power. The AWGN Shannon
we can impose peak and/or average power constraints in capacity constitutes a trivial upper bound for C(A),
the estimation of capacity by adding regularization terms A2
 
d
(hinge loss) in the generator part of the value function (3), C(A) ≤ log2 1 + , (17)
2 d
as in constrained optimization problems, e.g., Sec. III of [17].
while a tighter upper bound for the scalar channel is provided
Specifically, the value function becomes
    in [10]
   
Jα (G, D) = α · Ez∼pZ (z) log D G(z), H(G(z)) 2A 1 2

C(A) ≤ min log2 1 + √ , log2 1 + A . (18)
   2πe 2
− Ez∼pZ (z) D G(z), π(H(G(z))) For convenience of comparison and as a proof of concept,
we focus on the scalar d = 1 and d = 2 channels. The first
− λA max(||G(z)||22 − A2 , 0) − λP max(E[||G(z)||22 ] − P, 0), findings on the capacity-achieving discrete distributions were
(15) reported in [11]. For a scalar channel, it was shown in [21]
with λA and λP equal to 0 or 1. that the input has alphabet {−A, A} with equiprobable values
if 0 < A ⪅ 1.6, while it has ternary alphabet {−A, 0, A} if
1.6 ⪅ A ⪅ 2.8. Those results are confirmed by CORTICAL
IV. N UMERICAL R ESULTS which is capable of both understanding the discrete nature
To demonstrate the ability of CORTICAL to learn the of the input and learning the support and probability mass
optimal channel input distribution, we evaluate its performance function (PMF) for any value of A. Notice that no hypothesis
in three non-standard scenarios: 1) the additive white Gaussian on the input distribution is provided during training. Fig. 2a
noise (AWGN) channel subject to a peak-power constrained reports the capacity-achieving input distribution as a function
input; 2) an additive non-Gaussian noise channel subject to of the peak-power constraint A2 . The figure illustrates the
4

a) b)

Fig. 3: AWGN d = 2 channel input distributions√ learned by


CORTICAL under a peak-power constraint: a) A = 10; b) A = 5.

a) b)
Fig. 5: AICN channel input distributions learned by CORTICAL
at different training steps under: a) logarithmic power constraint; b)
peak-power constraint.

and thus CORTICAL offers a guiding tool for the identification


of capacity-achieving distributions.

a) b)
B. Additive Non-Gaussian Channels
Fig. 4: MIMO channel input distributions learned by CORTICAL
for different channels H: a) r2 = 1; b) r2 = 3. Locus of points The nature of the capacity-achieving input distribution re-
satisfying ||HX||2 = 1 is also shown. mains an open challenge for channels affected by additive non-
Gaussian noise. In fact, it is known that under an average
power constraint, the capacity-achieving input distribution
typical bifurcation structure of the distribution. Fig. 2b shows
is discrete and the AWGN channel is the only exception
the channel capacity estimated by CORTICAL with (14) and
[24]. Results concerning the number of mass points of the
compares it with the upper bounds known in the literature.
discrete optimal input have been obtained in [18], [25]. If
Similar considerations can be made for the d = 2 chan-
the transmitter is subject to an average power constraint, the
nel under peak-power constraints [20], and in general for
support is bounded or unbounded depending on the decay rate
the vector Gaussian channel of dimension√d [22]. The case
(slower or faster, respectively) of the noise PDF tail.
d = 2 is considered in Fig. 3a (A = 10) and Fig. 3b
(A = 5) where CORTICAL learns optimal bi-dimensional In this section, we consider a scalar additive independent
distributions, matching the results in [20]. In this case the Cauchy noise (AICN) channel with scale factor γ, under a
amplitude is discrete. specific type of logarithmic power constraint [19]. In particu-
We now consider the MIMO channel where the analytical lar, given the Cauchy noise PDF C(0, γ)
characterization of the capacity-achieving distribution under a 1 1
pN (n) =  , (20)
peak-power constraint remains mostly an open problem [23]. πγ 1 + n 2
γ
We analyze the particular case of d = 2:
we are interested in showing that CORTICAL learns the
C(H, r) = sup I(X; HX + N ), (19)
pX (x):||HX||2 ≤r capacity-achieving distribution that solves
where N ∼ N (0, I2 ) and H ∈ R2×2 is a given MIMO C(A, γ) =  sup
2 2  I(X; Y ),
channel matrix known to both transmitter and receiver. We pX (x):E log A+γ
A + X
A ≤log(4)
impose r = 1, and we also impose a diagonal structure on (21)
H = diag(1, r2 ) without loss of generality since diagonaliza- for a given A ≥ γ. From [19], it is known that under such
tion of the system can be performed. We study two cases: power constraint the channel capacity is C(A, γ) = log(A/γ)
r2 = 1, which is equivalent to a d = 2 Gaussian channel with and the optimal input distribution is continuous with Cauchy
unitary peak-power constraint; and r2 = 3, which forces an PDF C(0, A − γ). For illustration purposes, we study the case
elliptical peak-power constraint. The former set-up produces a of γ = 1 and A = 2 and report in Fig. 5a the input distri-
discrete input distribution in the magnitude and a continuous bution obtained by CORTICAL after 100 and 10000 training
uniform phase, as shown in Fig. 4a. The latter case produces steps. For the same channel, we also investigate the capacity-
a binary distribution, as shown in Fig. 4b. To the best of our achieving distribution under a unitary peak-power constraint.
knowledge, no analytical results are available for 1 < r2 < 2 Fig. 5b shows that the capacity achieving distribution is binary.
5

R EFERENCES
[1] T. O’Shea and J. Hoydis, “An introduction to deep learning for the phys-
ical layer,” IEEE Trans. on Cognitive Communications and Networking,
vol. 3, no. 4, pp. 563–575, Dec 2017.
[2] H. Ye, G. Y. Li, and B.-H. Juang, “Power of deep learning for channel
estimation and signal detection in OFDM systems,” IEEE Wireless
Communications Letters, vol. 7, no. 1, pp. 114–117, 2018.
[3] A. M. Tonello and N. A. Letizia, “Mind: Maximum mutual information
based neural decoder,” IEEE Communications Letters, vol. 26, no. 12,
pp. 2954–2958, 2022.
[4] H. Ye, G. Y. Li, B.-H. F. Juang, and K. Sivanesan, “Channel agnostic
end-to-end learning based communication systems with conditional
gan,” in 2018 IEEE Globecom Workshops (GC Wkshps), 2018, pp. 1–5.
[5] M. Soltani, V. Pourahmadi, A. Mirzaei, and H. Sheikhzadeh, “Deep
Fig. 6: Optimal input U learned by CORTICAL at different training learning-based channel estimation,” IEEE Communications Letters,
steps. vol. 23, no. 4, pp. 652–655, 2019.
[6] Z. Aharoni, D. Tsur, Z. Goldfeld, and H. H. Permuter, “Capacity
of continuous channels with memory via directed information neural
estimator,” in Proc. of the 2020 IEEE International Symposium on Inf.
C. Fading Channels Theory (ISIT), 2020, pp. 2014–2019.
As the last representative scenario, we study the Rayleigh [7] N. A. Letizia and A. M. Tonello, “Capacity-driven autoencoders for
communications,” IEEE Open Journal of the Communications Society,
fading channel subject to an average power constraint P with vol. 2, pp. 1366–1378, 2021.
input-output relation given by [8] F. Mirkarimi, S. Rini, and N. Farsad, “Neural capacity estimators: How
Y = αX + N, (22) reliable are they?” in Proc. of the ICC 2022 - IEEE International
Conference on Communications, 2022, pp. 3868–3873.
where α and N are independent circular complex Gaussian [9] J. Huang and S. Meyn, “Characterization and computation of optimal
random variables, so that the amplitude of α is Rayleigh- distributions for channel coding,” IEEE Trans. on Inf. Theory, vol. 51,
no. 7, pp. 2336–2351, 2005.
distributed. It is easy to prove that an equivalent model [10] A. McKellips, “Simple tight bounds on capacity for the peak-limited
for deriving the distribution of the amplitude of the signal discrete-time channel,” in Proc. of the 2004 IEEE International Sympo-
achieving-capacity is obtained by defining U = |X|σα /σN sium on Inf. Theory, 2004, pp. 348–348.
[11] J. G. Smith, “The information capacity of amplitude- and variance-
and V = |Y |2 /σN 2
, which are non-negative random variables constrained scalar Gaussian channels,” Inf. Control., vol. 18, pp. 203–
2 σ2
such that E[U ] ≤ P σ2α ≜ a and whose conditional distribu- 219, 1971.
N [12] A. Dytso, S. Yagli, H. V. Poor, and S. Shamai Shitz, “The capacity
tion is   achieving distribution for the amplitude constrained additive Gaussian
1 v
pV |U (v|u) = exp − . (23) channel: An upper bound on the number of mass points,” IEEE Trans.
1 + u2 1 + u2 on Inf. Theory, vol. 66, no. 4, pp. 2006–2022, 2020.
A simplified expression for the conditional pdf is [13] R. Blahut, “Computation of channel capacity and rate-distortion func-
tions,” IEEE Trans. on Inf. Theory, vol. 18, no. 4, pp. 460–473, 1972.
pV |S (v|s) = s · exp(−sv) (24) [14] J. Dauwels, “Numerical computation of the capacity of continuous
where S = 1/(1 + U 2 ) ∈ (0, 1] and E[1/S − 1] ≤ a. From memoryless channels,” Proc. of the 26th Symposium on Inf. Theory in
the Benelux, 2005.
[17], it is known that the optimal input signal amplitude S [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
(and thus U ) is discrete and possesses an accumulation point S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
at S = 1 (U = 0). We use CORTICAL to learn the capacity- Advances in Neural Information Processing Systems, vol. 27, 2014.
[16] N. A. Letizia and A. M. Tonello, “CORTICAL cooperative channel
achieving distribution and verify that it matches (24) for the capacity learning,” https://github.com/tonellolab/CORTICAL, 2022.
case a = 1 reported in [17]. Fig. 6 illustrates the evolution of [17] I. Abou-Faycal, M. Trott, and S. Shamai, “The capacity of discrete-
the input distribution for different CORTICAL training steps. time memoryless Rayleigh-fading channels,” IEEE Trans. on Inf. Theory,
vol. 47, no. 4, pp. 1290–1301, 2001.
Also in this scenario, the generator learns the discrete nature [18] A. Tchamkerten, “On the discreteness of capacity-achieving distribu-
of the input distribution and the values of the PMF. tions,” IEEE Trans. on Inf. Theory, vol. 50, no. 11, pp. 2773–8, 2004.
[19] J. Fahs and I. Abou-Faycal, “A Cauchy input achieves the capacity of
V. C ONCLUSIONS a Cauchy channel under a logarithmic constraint,” in Proc. of the 2014
IEEE International Symposium on Inf. Theory, 2014, pp. 3077–3081.
In this paper, we have presented CORTICAL, a novel deep [20] B. Rassouli and B. Clerckx, “On the capacity of vector Gaussian
learning-based framework that learns the capacity-achieving channels with bounded inputs,” IEEE Trans. on Inf. Theory, vol. 62,
no. 12, pp. 6884–6903, 2016.
distribution for any discrete-time continuous memoryless [21] N. Sharma and S. Shamai, “Transition points in the capacity-achieving
channel. The proposed approach consists of a cooperative distribution for free-space optical intensity channels,” Proceedings of the
max-max game between two neural networks. Non-Shannon 2010 IEEE Inf. Theory Workshop, pp. 1–5, 2010.
[22] A. Dytso, M. Al, H. V. Poor, and S. Shamai Shitz, “On the capacity
channel settings have validated the algorithm. The proposed of the peak power constrained vector Gaussian channel: An estimation
approach offers a novel tool for studying the channel capacity theoretic perspective,” IEEE Trans. on Inf. Theory, vol. 65, no. 6, pp.
in non-trivial channels and paves the way for the holistic and 3907–3921, 2019.
[23] A. Dytso, M. Goldenbaum, H. V. Poor, and S. Shamai (Shitz), “Ampli-
data-driven analysis/design of communication systems. tude constrained MIMO channels: Properties of optimal input distribu-
tions and bounds on the capacity,” Entropy, vol. 21, no. 2, 2019.
VI. ACKNOWLEDGEMENT [24] J. Fahs, N. Ajeeb, and I. Abou-Faycal, “The capacity of average power
The authors would like to thank Nir Weinberger and Alex constrained additive non-Gaussian noise channels,” in Proc. of the 19th
International Conference on Telecommunications, 2012, pp. 1–6.
Dytso for the valuable discussions about capacity-achieving [25] A. Das, “Capacity-achieving distributions for non-Gaussian additive
distributions provided while Nunzio A. Letizia was visiting noise channels,” Proc. of the 2000 IEEE International Symposium on
Princeton University. Inf. Theory, p. 432, 2000.

You might also like