Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/370776434

Semantic Communication of Learnable Concepts

Preprint · May 2023

CITATIONS READS
0 16

4 authors, including:

Francesco Pase Deniz Gündüz


University of Padova Imperial College London
18 PUBLICATIONS 112 CITATIONS 530 PUBLICATIONS 10,848 CITATIONS

SEE PROFILE SEE PROFILE

Michele Zorzi
University of Padova
66 PUBLICATIONS 130 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

JSCC for multiuser channels View project

COPES - COnsumer-centric Privacy in smart Energy gridS View project

All content following this page was uploaded by Francesco Pase on 18 May 2023.

The user has requested enhancement of the downloaded file.


This paper has been accepted for presentation at the 2023 IEEE International Symposium on Information Theory. ©2022 IEEE.

Semantic Communication of Learnable Concepts


Francesco Pase? , Szymon Kobus† , Deniz Gündüz† , Michele Zorzi?
?
University of Padova, Italy. Email: {pasefrance, zorzi}@dei.unipd.it

Imperial College London, London, UK. Email: {szymon.kobus17, d.gunduz}@imperial.ac.uk

C1 , . . . , Cn ∼ PC ph (Y |X)h ∼ Q1 (h), X ∼ C1
Abstract—We consider the problem of communicating a se-
quence of concepts, i.e., unknown and potentially stochastic maps,  ...
Si = {zj }m ph (Y |X)h ∼ Qn (h), X ∼ Cn
arXiv:2305.08126v1 [cs.IT] 14 May 2023

which can be observed only through examples, i.e., the mapping j=1
rules are unknown. The transmitter applies a learning algorithm
to the available examples, and extracts knowledge from the data
Semantic Level
by optimizing a probability distribution over a set of models, i.e.,
known functions, which can better describe the observed data, Technical Level
 n
and so potentially the underlying concepts. The transmitter then P (h|Si ) = A(Si ) i=1
needs to communicate the learned models to a remote receiver
through a rate-limited channel, to allow the receiver to decode Noisy
m = {1, . . . , 2nR } {Qi (h)}ni=1
the models that can describe the underlying sampled concepts as Channel
accurately as possible in their semantic space. After motivating Alice Bob
our analysis, we propose the formal problem of communicating Fig. 1. The problem of communicating concepts.
concepts, and provide its rate-distortion characterization, point-
ing out its connection with the concepts of empirical and strong
coordination in a network. We also provide a bound for the of communicating models over rate-limited channels, and not
distortion-rate function. just raw data. To this end semantic communications, which
concerns with the semantic aspect of the message, maps
naturally to the transmission of these learning models [4].
I. I NTRODUCTION AND M OTIVATION The communication fidelity of these models can be judged
With the growing number of mobile devices and sensors, by how close the behavior of the reconstructed model at the
massive amounts of data are collected today at the edge receiver is to the desired one, rather than by the accuracy of
of communication networks. On the one hand, this data is the reconstruction in the parameter space [3].
the fuel for training large learning models like deep neural
networks (DNNs); on the other hand, these models need to II. R ELATED W ORK
be stored, compressed, and communicated over bandwidth The goal of this paper is to introduce the problem of
limited channels to the cloud, and protected against security communicating concepts, which is translated to that of convey-
and privacy risks [1]. These issues are increasingly limiting ing functions that better approximate them by learning their
the application of typical centralized training approaches. semantic aspects from data. The closest reference to this work
Various federated/distributed learning paradigms have emerged is [5], in which the goal is to compress neural networks by
as potential solutions to mitigate these limitations, which allow applying bits back coding [6] to represent probability distribu-
the models to be locally trained, and then aggregated in a tions over models with the minimum number of bits. The prob-
cloud or edge server without moving local private data [2]. lem studied in [5] is the single-shot version of our problem,
The main paradigm shift in distributed learning is to move the focusing on the design of a practical coding scheme, which
models, rather than the data, throughout the network, providing is called MIRACLE, to efficiently compress neural networks. In
better privacy guarantees and reducing the communication [7], the authors study the connections between compressibility
load. However, today even the sizes of such learned models and learnability in the context of probably approximately
are becoming a concern, as transmitting huge models back correct (PAC) learning, and show that the two concepts are
and forth for training or inference purposes can easily congest equivalent when zero/one loss is considered, but not in the case
wireless networks, specifically when considering that the edge of general loss functions. Another line of research investigates
devices like mobile phones, cars, robots etc., are usually the connections between the generalization capabilities of
wirelessly connected to the network, and thus have limited learning algorithms, and the mutual information between the
bandwidth [3]. data and the model [8]–[10]. The logic behind these results
Consequently, it is time to investigate, with the proper is to provide a bound on the generalization gap, i.e., the
information-theoretic models and tools, the fundamental limits difference between the expected error and the training one,
given some information-theoretic properties of the learning
This work received funding from the UKRI (EP/X030806/1) for the project algorithm. However, if the environment imposes a constraint
AIR (ERC-CoG). For the purpose of open access, the authors have applied
a Creative Commons Attribution (CCBY) license to any Author Accepted on such quantities, e.g., mutual information between the input
Manuscript version arising from this submission. and output of the learning rule, for example by introducing a
rate-limited communication channel between the data and the data. To measure how well a model h approximates a concept
final model, this influences not only the generalization gap, c, a per-sample loss `c (h, z) : H×Z → [0, 1] is defined, which
but also the training error itself (the output of the learning rule compares the discrepancy between pc (y|x) and h(x). In this
is constrained by the environment now), and so it is not clear work, we assume a bounded loss within [0, 1] for the sake
how the gap between the best achievable test error changes as a of clarity of exposition. Then, `c (Q, z) : Φ(H) × Z → [0, 1]
function of the mutual information. This is the scenario studied is the performance of the model belief Q, which is defined
in this work, where the constraint on the mutual information is as `c (Q, z) = Eh∼Q [`c (h, z)]. Upon observing the data, Alice
not a property of the learning rule, but rather a physical limit can compute her empirical performance by using the empirical
imposed by the system. In [11], a similar study is performed loss on her dataset S as
on the specific case of contextual multi-armed bandits, where 1 X
the fundamental quantity is the mutual information I(S; A) L̃C (Q, S) = `C (Q, z), (1)
|S|
z∈S
between the system states and the action taken by the agents,
which is a property of the specific policy adopted. This work where Q = QH|S = A(S) is the post-data distribution inferred
generalizes that idea to the supervised learning framework, by Alice, given the data. We assume that, for any sequences
and considers the effect of the communication rate R on of datasetsQsn , s0n , Alice’s distribution can be factorized as
n
the final performance. It is also interesting to highlight the Qnhn |sn = i=1 Qhi |si such that si = s0j ⇒ Qhi |si = Qhj |s0j .
connections between this work and the study in [12], where However, to assess how well the belief Q represents the
the authors quantify the complexity of a learning algorithm concept C, in machine learning we are usually interested in
output Q with its Kullback–Leibler divergence from a prior the true loss
model distribution P , which, in our system model, represents LC (Q) = EZ∼C [`C (Q, Z)], (2)
the minimum achievable rate to convey Q, when P is set as
the prior distribution. i.e., the expected performance on a new unseen sample. Given
To conclude, this work is partially built on top of the results the realization of the datasets {si }ni=1 , the problem for Alice
in [13]–[17], which generalize the concept of rate-distortion is then to convey a message to Bob through a constrained
theory [18] for standard data communication to probability communication channel, which limits the maximum number
distributions, where the fidelity requirement at the receiver is of bits she can convey per model, so that Bob can use
not to exactly reconstruct the input data, but rather to generate the received information to reconstruct models {ĥi }ni=1 that
samples according to some input distribution. Indeed, here the can approximate the concepts {ci }ni=1 by minimizing the
semantic aspect of communication is captured by the fact that loss on random samples {zi }ni=1 distributed according to the
there is no need to convey the exact dataset sampled by the sequence of concepts, i.e., the true loss in Equation (2). We
transmitter, but rather to represent with high fidelity the belief can observe that the task for Bob is not to exactly reconstruct
on the underlying concepts acquired after observing the data, the sequence {hi }ni=1 sampled by Alice, but rather to obtain
which is the post-data probability distribution over the class samples {ĥi }ni=1 whose probability distributions are close to
of feasible models. the target ones, i.e., {Qhi |si }ni=1 .
Remark. We now briefly discuss why we are interested in
III. S YSTEM M ODEL learning rules A(S) that output model distributions, rather than
Let E denote the environment, i.e., the source, that generates single-point solutions:
a sequence of n concepts, e.g., tasks, {ci }ni=1 , ci ∈ C, • First of all, the case in which Alice finds a point-wise
sampled with probability PC in an independent and identically estimate of the best model h∗ is included as a special
distributed (i.i.d.) fashion. While PC is known by both Alice case Qh|S = δh∗ .
and Bob, neither of them can observe the sampled concepts di- • Alice may want to express her uncertainty around the
rectly. Alice has access to a sequence of m samples {zi,j }m j=1 , best choice h∗ , which may be intrinsic in the learning
where zi,j = (xi,j , yi,j ) ∈ Z, sampled according to each of algorithm A, through the distribution Qh|S .
the concept distributions pci (Y |X)pci (X), ∀i = 1, . . . , n. • Usually, optimization algorithms used to train DNNs,
Alice and Bob agree on a hypothesis class, i.e., the model like stochastic gradient descent (SGD), are stochastic
class H, and on a pre-data coding probability distribution algorithms.
Ph , ∀h ∈ H. We call the sequence {zi,j }m j=1 of samples the • When H is the set of all DNNs hω with a specific
dataset si . Alice applies a learning algorithm A : Z m → Φ(H) architecture parameterized by the parameter vector ω,
on si , which is a possibly stochastic function mapping a there exist many vectors ω performing in the same way.
dataset to a probability distribution Qh|si = A(si ) over the Moreover, small perturbations to the parameters usually
set of models H, and so, over the subset of all possible does not reduce the final performance. This means that it
probability mappings h : X → Φ(Y), representing Alice’s is not required for Bob to reconstruct the exact value
concept belief. With Φ(X ) we denote the set of all possible of the network parameters, but rather a nearby or an
probability distributions over the set X . Consequently, the equivalent solution, and this variability is represented
models are functions used by Alice and Bob to represent (or, by Qhω |S . More importantly, Qhω |S can be exploited
more precisely, to approximate) the concepts’ relation among to reduce the rate needed to convey the models, thus
saving network resources [5]. This is the semantic aspect {1, 2, . . . , 2nR }, a decoding function gn : {1, 2, . . . , 2nR } →
of communication captured by our framework, as the X̂ n , and a distortion measure d : X n × X̂ n → R+ ,
meaning of a concept c, i.e., the real unknown mapping, comparing the fidelity between xn and x̂n . Specifically, we
is conveyed through the model belief Qh|S , whose loss are interested in the expected distortion E[d(xn , x̂n )] =
n n n
P
expressed in Equation (2) quantifies its fidelity with xn ∈X n p(x )d(x , gn (fn (x ))).
respect to the real concept c.
Definition IV.2 (Rate-Distortion). A rate-distortion pair (R, )
IV. T HE R ATE -D ISTORTION C HARACTERIZATION is said to be achievable for a source p(x) and a distortion mea-
In this section, we first characterize the limit of the problem sure d, if there exists a sequence of (2nR , n) rate-distortion
when n = 1, i.e., one-shot concept communication, and coding schemes with
then generalize the problem to the n-sequence formulation. lim E[d(xn , gn (fn (xn )))] < . (5)
For the latter, two kinds of performance metrics are defined: n→∞
the first one provides average performance guarantees, while However, as specified in the remark in Section III, we are
the second one ensures the same performance guarantee for interested in conveying beliefs Q ∈ Φ(H), i.e., samples drawn
each sample ĥ. We will show that the minimum achievable according to the probability Q, obtained from the datasets sn ∈
communication rate that can guarantee a certain distortion S n , where S n = Z m . In our case, the coding function fn :
level is the same in both cases, as long as sufficient common S n → {1, 2, . . . , 2nR } maps the sequence sn = {si }ni=1 to
randomness between Alice and Bob is available. a message fn (sn ) from which Bob can obtain the models
A. Single-Shot Problem ĥn = gn (fn (sn )). Our distortion then considers the difference
between the ĥn ’s distribution Q̂n , and the one achievable by
The single-shot version of the problem has been studied Alice Qn = {A(si )}ni=1 by comparing their samples ĥn and
in [5], where the authors propose MIRACLE, a neural net-
hn . Consequently, we define with Q̂S n ,Ĥ n (or simply Q̂n ) the
work compression framework based on bits back coding [6],
joint distribution between the datasets and models induced by
providing an efficient single-shot model compression scheme
a (2nR , n) coding scheme, whose marginals are Q̂Si ,Ĥi for
showing empirically that, with enough common randomness,
i = 1, . . . , n.
it is possible to convey the model with an average of K '
DKL (Q||P ) bits with very good performance, where P is the Definition IV.3 (Concept Distortion). For the problem of
pre-data coding model distribution, and Q is the optimized communicating concepts, we define the following distortion
post-data distribution, providing a belief over well-performing on the model beliefs Q and Q̂:
neural networks, or, equivalently, over a set of parameter h i
vectors, as explained in Section III. However, from Lemma 1.5 dsem (Q, Q̂) = EC,S LC (Q̂) − LC (Q) . (6)
in [15], the average number of bits EC,S [K] needed to exactly The rationale behind this definition is that Q, which is the
code Q with P , when sufficient common randomness is target distribution at the transmitter, is optimized on a given
available, can be bounded by dataset S without any constraint. Therefore, it is reasonable
R ≤ EC,S [K] ≤ R+2 log(R + 1) + O(1), (3) to assume LC (Q̂) − LC (Q) to be always non-negative. We
notice that dsem quantifies the gap between the concept
where R = EC,S [DKL (Q||P )], while using exactly R bits reconstruction at the receiver and the one at the transmitter,
may lead to samples which are distributed according to Q̃, which is a semantic measure on the unknown true loss L(Q).
slightly different from the target Q [15]. More recent results
[16] allow to find even stronger guarantees (Corollary 3.4 in Definition IV.4 (n-Sequence Concept Distortion). For the
[17]) for this relationship: problem of communicating concepts, we define the following
distortion on the sequence of the model beliefs Qn and Q̂n :
EC,S [K] ≤ R + log(R + 1) + 4. (4)
n
1X
B. n-Length Formulation davg (Qn , Q̂n ) = dsem (Qi , Q̂i ), (7)
n i=1
We now study the problem depicted in Figure 1, when
we let Alice code a sequence of n concept realizations, and
i.e., datasets, and study the information-theoretic limit of the
dmax (Qn , Q̂n ) = max dsem (Qi , Q̂i ). (8)
system as n → ∞. Specifically, we are interested in the trade- i={1,...,n}
off between the rate R, which is defined as the average number
In practice, davg defines a constraint on the true loss
of bits consumed per model by Alice to convey the concept
achievable by Bob averaged over the performance of the
process to Bob, and the performance, which is the true loss
sequence ĥn ∼ Q̂n , whereas dmax imposes a constraint on
that can be obtained by Bob (see Eq. (2)). We start by defining
the loss of every marginal ĥi ∼ Q̂ni , i ∈ {1, . . . , n}.
the proper quantities involved.
Remark. The distortion, easily defined for one-shot com-
Definition IV.1 (Rate-Distortion Coding Scheme). A (2nR , n) munications, can be generalized to a sequence of n concepts
coding scheme consists of an alphabet X , a reconstruc- in multiple ways. Specifically, if one is interested in a system-
tion alphabet X̂ , an encoding function fn : X n → level loss, then satisfying the constraint on davg could be
enough. However, to provide a per-model guarantee on the notions, introduced later, map precisely onto the difference
performance, then dmax is the distortion to use. between convergence of davg and dmax . The focus is changed
from determining which distortion is achievable to what joint
Definition IV.5 (Rate-Distortion Region). The rate-distortion
distributions of H and S are feasible under some rate con-
region for a source is the closure of the set of achievable
straint.
rate-distortion pairs (R, ).
First, we extend the encoding and decoding functions fn
Definition IV.6 (Rate-Distortion Function). The rate- and gn in Definition IV.1 to accept an additional common input
distortion function R() is the infimum of rates R such that ω ∈ Ω, which is generated by a source of common randomness
(R, ) is in the rate-distortion region. p(ω). We define a (2nR , 2nR0 , n) stochastic code consisting
of functions fn : S n × {1, . . . , 2nR0 } → {1, 2, . . . , 2nR } and
C. Average Distortion davg
gn : {1, 2, . . . , 2nR } × {1, . . . , 2nR0 } → Ĥn , which consumes
We first analyze the problem with the average distortion on average R0 bits of common randomness per sample.
davg , as defined in Equation (7).
Definition IV.7. A desired distribution QS,H is achievable for
Theorem IV.1 (Rate-Distortion Theorem for davg ). For the empirical coordination with rate pair (R, R0 ) if there exists
problem of communicating concepts with distortion davg , the a sequence of (2nR , 2nR0 , n) codes and a choice of common
rate-distortion function satisfies randomness distribution p(ω) such that
 
R() = min I(S; H), (9) TV Q̂sn ĥn , QS,H → 0, (10)
Q̃H|S :
dsem (Q,Q̃)≤
where TV(Q̂, Q) indicates the total variation
Pn between distri-
where I(S; H) is the mutual information between the data S butions Q̂ and Q, Q̂sn ĥn (s, ĥ) = n1 i=1 1(si ,ĥi )=(s,ĥ) , and
and the model H [18]. 1I = 1 if the condition I is true, and 0 otherwise.
Proof. See Appendix A In other words, the empirical coordination property re-
D. Maximum Distortion dmax quires that the joint empirical distribution of the pairs
(si , gn (fn (sn ))i ) converges, in total variation, to the desired
In this case the distortion function implies a constraint
distribution. Notice how, in Example IV.1, the joint distribu-
on the performance of each symbol, i.e., model real-
tions Qn and Q̂n converge in their empirical distributions,
ization. First of all, we just provide a simple scheme
while differing letter-wise.
in which limn→∞ davg (Qn , Q̂n ) = 0 does not imply
As already pointed out in [13], introducing common ran-
limn→∞ dmax (Qn , Q̂n ) = 0, meaning that in general a code domness does not improve the performance of empirical co-
that achieves 0 distortion on average, may not achieve 0 ordination schemes, meaning that any distribution achievable
distortion model-wise, i.e., we cannot guarantee a single- for empirical coordination by a (2nR , 2nR0 , n) coding scheme
model performance. is also achievable with R0 = 0. Moreover, empirical coordina-
Example IV.1. Let H = {h0 , h1 }, and performance tion schemes can be used to construct rate-distortion schemes
`c (h0 , z) = 0, `c (h1 , z) = 1, ∀z ∈ Z, ∀ c ∈ C. Let ∀ cn ∈ C n , for davg [13]. However, as observed in Example IV.1, they do
Alice’s distribution
Qn Qi (hj |S) = 21 , where j ∈ {0, 1}, and not equate to the same per-symbol performance requirements.
n
Q = i=1 Qi , while Bob’s distribution is deterministic
Definition IV.8. A desired distribution QD,H is achievable
Q̂2i (h0 |S) = 1, Q̂2i+1 (h1 |S) = 1. Then for strong coordination with rate pair (R, R0 ) if there exists a
n n
davg (Q , Q̂ ) = sequence of (2nR , 2nR0 , n) coordination codes and a choice
( Pn
1 1 1
 of common randomness distribution p(ω) such that
i=1 1 − 2 + 0 − 2 = 0, if n is even
2
n n
!
1
P n−1 1 1
 1 1
Y
n[ i=1 1 − 2 + 0 − 2 + 1 − 2 ] = 2n , otherwise
2
TV Q̂sn ĥn , Qsi ,hi → 0, (11)
i=1
Thus, limn→∞ davg (Qn , Q̂n ) = 0
but where Q̂sn ĥn is the joint distribution induced by the stochastic

1 1

1 coding scheme.
n n
dmax (Q , Q̂ ) = max 1 − , 0 − =
2 2 2 Lemma IV.1 (dmax Achievability). With sufficient common
n
=⇒ lim dmax (Q , Q̂ ) 6= 0. n randomness, the rate-distortion region (R, ) for dmax is the
n→∞ same as the one for davg .
The question now is what is needed to ensure the same
Proof. See Appendix B.
distortion  to the single-model performance, i.e., to ensure
dmax < , which is of particular interest in the semantic Given a constraint on the distortion  for davg achievable
communication of concepts like the one considered here. We with minimum rate of R , if Alice and Bob can use common
now distinguish between the two ways in which Q̂n can randomness, then it is possible to satisfy, at the same rate R ,
converge to a target Qn – empirical and strong. These two the same level of distortion  for dmax . To translate it into
machine learning parlance, we showed that the communication is communicated. However, the information bottleneck is the
rate needed to provide some performance guarantees on the same. Consequently, we obtain
expected average system test error, and on the expected single-
I(S; Ĥ 1 ) = I(S; Ĥ 2 ) + I(S; Ŝ 2 |Ĥ 2 ). (13)
model performance, is the same, as long as sufficient common
randomness is available. In both cases, the characterization is By non-negativity of the mutual information, we always have
over the expected performance, thus for any one realization I(S; Ĥ 1 ) ≥ I(S; Ĥ 2 ), meaning that the rate constraint on
the k-th model might have higher than desired loss. the communication channel translates differently into a model
constraint for the two schemes, being stricter for the second
E. Coding Without the Marginal QH
one. In particular, the gap between the two schemes is exactly
We notice that all the previous achievability I(S; Ŝ 2 |Ĥ 2 ). Consequently, for scheme 2 the optimal com-
results Passume knowledge of the exact marginal pression function ρ(S) must achieve I(S; Ŝ 2 |Ĥ 2 ) = 0, and
QH = c∈C,s∈Z m QH|s Ps|c Pc to be used as pre-data so the only way to match the performance of scheme 1 is to
coding model distribution (see Section III), which is usually account for the optimal distribution QĤ|S when computing Ŝ.
not known and difficult to obtain. Consequently, we are
interested in studying the minimum achievable rate, when a A. Distortion Rate Bound
generic coding distribution PH is used to code QH|S . In this section, we bound the distortion-rate function for
Theorem IV.2 (Achievability with General PH ). For the dmax , which is useful to translate the rate constraint into a
problem of communicating concepts, the minimum achievable performance gap. We now define ∆R = R∗ − R to be the
rate for both davg and dmax with pre-data coding distribution difference between the rate R∗ of the optimal distribution and
PH is the rate R of the channel imposed by the problem. In general,
h  i we assume R∗ ≥ R and EC,S [L(Q∗ )] ≤ EC,S [L(Q)], where
R() = min EC,S DKL Q̃H|S kPH , (12) Q∗ is achievable with rate R∗ , and Q with rate R.
Q̃H|S :
dsem (Q,Q̃)≤ Lemma V.1. Assuming that `C (z, h) is upper bounded by
Lmax ∀h ∈ H, ∀z ∈ Z, ∀C ∈ C, and that distortion dmax
assuming sufficient common randomness.
is considered, the distortion-rate function for the problem of
Proof. See Appendix C. communicating concepts when using scheme 1 can be upper
bounded by
It is known that when PH = QH , i.e., using the
marginal, Equation (12) is minimized, and so when the nr 1 p o
1
marginal  (∆R ) ≤ Lmax · min ∆R , 1 − e−∆R , (14)
 is not knownwe pay an additional penalty given by 2
EC,S DKL QH|S kPH − I(S; H) = EC,S [DKL (QH kPH )].
and when using scheme 2 by
V. C OMMUNICATING THE DATA VS C OMMUNICATING THE nr 1  
2
M ODEL  (∆R ) ≤ Lmax · min ∆R + I(S; Ŝ 2 |Ĥ 2 ) ,
To motivate our research problem, we comment on the q2 o (15)
−(∆R +I(S;Ŝ 2 |Ĥ 2 ))
advantages for Alice of first compressing a trained model, and 1−e .
then sending it to Bob (scheme 1), versus a second framework
Proof. See Appendix D.
(scheme 2), in which Alice communicates a compressed ver-
sion of the dataset Ŝ 2 = ρ(S), using which Bob trains his VI. C ONCLUSION
models. Given a quantity X, we indicate with X i the same We introduced the problem of conveying concepts, where
quantity in the i-th scheme. First of all, we see that in scheme concepts naturally appear as sequences of samples and can be
R
1, the corresponding Markov chain is C → S − → Ĥ 1 , where approximated by learnable models, as in standard statistical
the R above the arrow indicates the information bottleneck learning. We study the framework by applying information-
between the two random variables. However, the same chain theoretic tools to the problem of communicating many mod-
R
for the second scheme reads C → S − → Ŝ 2 → Ĥ 2 . By the els jointly. We characterized its rate-distortion function for
data processing inequality [18], the rate constraint imposes two different notions of system-level distortion, provided a
I(S; Ĥ) ≤ R in both cases, limiting the set of all feasible bound for the distortion-rate function, and argued why jointly
beliefs Q ∈ Φ(Ĥ). learning, compressing, and communicating models should be
Now, we reasonably assume that the optimal solution preferred over compressing and sending the datasets. For
Q∗ (S, H) constrained to I(S; H) ≤ R lies on the boundary future investigations, the plan is to study in detail the relations
of the constraint, i.e., IQ∗ (S; H) = R, where IQ∗ (S; H) between model compression and test accuracy, and to design
indicates that the mutual information is computed using Q∗ . practical model communications schemes.
In this case we can see that for scheme 1, I(S; Ĥ 1 ) = R,
whereas for scheme 2, I(S; (Ŝ 2 , Ĥ 2 )) = R. Indeed, in the
former scheme Alice conveys just the random variable Ĥ 1 ,
i.e., the model, to Bob, whereas in the latter, the pair (Ŝ 2 , Ĥ 2 )
R EFERENCES
[1] L. Xu, C. Jiang, J. Wang, J. Yuan, and Y. Ren, “Information security in
big data: Privacy and data mining,” IEEE Access, vol. 2, pp. 1149–1176,
October 2014.
[2] H. Brendan McMahan, E. Moore, D. Ramage, S. Hampson, and
B. Agüera y Arcas, “Communication-efficient learning of deep networks
from decentralized data,” Proceedings of the 20th International Confer-
ence on Artificial Intelligence and Statistics, AISTATS, 2017.
[3] M. Jankowski, D. Gündüz, and K. Mikolajczyk, “Airnet: Neural net-
work transmission over the air,” in IEEE International Symposium on
Information Theory (ISIT), 2022, pp. 2451–2456.
[4] D. Gündüz, Z. Qin, I. E. Aguerri, H. S. Dhillon, Z. Yang, A. Yener, K. K.
Wong, and C.-B. Chae, “Beyond transmitting bits: Context, semantics,
and task-oriented communications,” IEEE Journal on Selected Areas in
Communications, vol. 41, no. 1, pp. 5–41, November 2023.
[5] M. Havasi, R. Peharz, and J. M. Hernandez-Lobato, “Minimal random
code learning: Getting bits back from compressed model parameters,”
NeurIPS Workshop on Compact Deep Neural Network Representation
with Industrial Applications, 2018.
[6] B. J. Frey and G. E. Hinton, “Efficient stochastic source coding and an
application to a bayesian network source model,” The Computer Journal,
vol. 40, pp. 157–165, January 1997.
[7] O. David, S. Moran, and A. Yehudayoff, “On statistical learning via
the lens of compression,” in Proceedings of the 30th International
Conference on Neural Information Processing Systems. Red Hook,
NY, USA: Curran Associates Inc., 2016, p. 2792–2800.
[8] A. Xu and M. Raginsky, “Information-theoretic analysis of general-
ization capability of learning algorithms,” in Proceedings of the 31th
International Conference on Neural Information Processing Systems,
2017.
[9] R. Bassily, S. Moran, I. Nachum, J. Shafer, and A. Yehudayoff, “Learn-
ers that use little information,” in Proceedings of Algorithmic Learning
Theory, vol. 83, April 2018, pp. 25–55.
[10] T. Steinke and L. Zakynthinou, “Reasoning About Generalization via
Conditional Mutual Information,” in Proceedings of Thirty Third Con-
ference on Learning Theory, vol. 125. PMLR, July 2020, pp. 3437–
3452.
[11] F. Pase, D. Gündüz, and M. Zorzi, “Rate-constrained remote contextual
bandits,” IEEE Journal on Selected Areas in Information Theory (Early
Access), December 2022.
[12] A. Achille, G. Paolini, G. Mbeng, and S. Soatto, “The information com-
plexity of learning tasks, their structure and their distance,” Information
and Inference: A Journal of the IMA, vol. 10, no. 1, pp. 51–72, January
2021.
[13] P. W. Cuff, H. H. Permuter, and T. M. Cover, “Coordination capacity,”
IEEE Transactions on Information Theory, vol. 56, no. 9, pp. 4181–
4206, September 2010.
[14] G. Kramer and S. A. Savari, “Communicating probability distributions,”
IEEE Transactions on Information Theory, vol. 53, no. 2, pp. 518–525,
January 2007.
[15] P. Harsha, R. Jain, D. McAllester, and J. Radhakrishnan, “The commu-
nication complexity of correlation,” IEEE Transactions on Information
Theory, vol. 56, no. 1, pp. 438–449, January 2010.
[16] C. T. Li and A. E. Gamal, “Strong functional representation lemma and
applications to coding theorems,” IEEE Transactions on Information
Theory, vol. 64, no. 11, pp. 6967–6978, November 2018.
[17] L. Theis and N. Y. Ahmed, “Algorithms for the communication of
samples,” in Proceedings of the 39th International Conference on
Machine Learning, vol. 162. PMLR, July 2022, pp. 21 308–21 328.
[18] T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley
Series in Telecommunications and Signal Processing). USA: Wiley-
Interscience, 2006.
A PPENDIX A distribution P , Corollary 3.4 in [17] provides the single-shot
P ROOF OF T HEOREM IV.1 bounds
To prove Theorem IV.1, we need to show that we R(Q, P ) ≤ EC,S [K] ≤ R(Q, P ) + log(R(Q, P ) + 1) + 4,
can translate the average distortion davg requirement into
a Pconstraint on the empirical distribution Q̂sn ĥn (d, h) = where R(Q, P ) = EC,S [DKL (Q||P )].
1 n We now translate the results to the n-length sequence, and
n i=1 1(si ,hi )=(s,h) , where 1I = 1 if the condition I is
true, and 0 otherwise. We can write compute its limit as n grows indefinitely.

1X
n R(Qn , P n ) = EC n ,S n [DKL (Qn ||P n )]
davg (Qn , Q̂n ) = d(Qi , Q̂i )  
QnH n |S n

n i=1
= EC ES |C EH |S log
n n n n n
" n
# P nn
1X QnH
= EC,S LC (Q̂i ) − LC (Qi )
  
n i=1
(a) i=1 QH|Si
= EC ES n |C n EH n |S n log
n
" n # (PH )n
1XX (QH|S )n
  
(a) (b)
= EC,S LC (h)(1(S,ĥi )=(S,h) − Qh|S ) = EC n ES n |C n EH n |S n log
n i=1 (PH )n
h∈H
= nR(Q, P ),
" n
!#
(b) X 1X
= EC,S LC (h) 1 − Qh|S
n i=1 (S,ĥi )=(S,h) where (a) is because samples are independent given the nature
h∈H
"
X  
# of the problem, and (b) is because, given the dataset realiza-
= EC,S LC (h) Q̂sn0 ĥn0 (d, h) − Qh|S tion, they are also identically distributed by our assumption.
Consequently, we can upper bound the total number of bits
h h∈H
with
i
= EC,S L(Q̂sn0 hn0 ) − L(Q)
EC,S [K] ≤ nR(Q, P ) + log(nR(Q, P ) + 1) + 4,
= dsem (Q, Q̂sn0 ĥn0 ),
obtaining an average rate of
where the indicator function in (a) depends on the chosen
(2nR , n) coding scheme, and (b) comes from the fact that EC,S [K]
lim → R(Q, P ).
Alice’s sampling scheme does not depend on the model index n→∞ n
i. We then showed that the average distortion requirement A PPENDIX D
translates into a distortion between the empirical distribution P ROOF OF L EMMA V.1
Q̂sn ĥn , and the target Q. Given this, Theorem 1 in [14] applies,
We now prove the result in Lemma V.1. Let Q be the
and our Theorem IV.1 follows.
distribution at the sender, and Q̂ the one at the receiver, i.e.,
A PPENDIX B the distribution Q̂ = Q∗S,H in the proof of Lemma IV.1. We
P ROOF OF L EMMA IV.1 first bound, for each C ∈ C, dmax by noticing that, for each
i ∈ {1, . . . , n}, the following holds
Proof. Theorem IV.1 provides the minimum achievable rate
to satisfy davg ≤ , which is provided by an optimized dsem (Qni , Q̂ni ) =
distribution Q∗S,H that minimizes the mutual information (a)
h i
= EC,S LC (Q̂) − LC (Q)
I(S; H), and satisfies davg ≤ . Now, we impose strong " #
coordination between Alice Qand Bob, by using as target joint X  
n = EC,S Ez∼C [`(h, z)] Q̂ĥ|S − Qh|S
distribution Q∗S n ,H n = Q ∗
i=1 S,H , and by Theorem 10
of [13] the same rate I(S; H) of empirical coordination can be h∈H
" #
achieved, as long as enough common randomness is available. X
Now, given that Bob’s distribution Q̂S n ,Ĥ n converges in total ≤ Lmax · EC,S Q̂ĥ|S − Qh|S
variation to Q∗S n ,H n , so it happens for the marginal, i.e., h h∈H
i
TV
Q∗S,H .
Q̂S n ,Ĥ n (si , ĥi ) −−→ However, by construction, Q∗S,H ≤ Lmax · EC,S TV(Q̂, Q)
satisfies the distortion constraint symbol-wise, and so satisfies (b) h nr 1 q oi
dmax . ≤ Lmax · EC,S min DKL (Q̂kQ), 1 − e−DKL (Q̂kQ)
2
(c) nr 1 i q
A PPENDIX C
h o
≤ Lmax · min EC,S DKL (Q̂kQ) , 1 − e−EC,S [DKL (Q̂kQ)]
P ROOF OF T HEOREM IV.2 2
To prove Theorem IV.2, we observe that Alice now needs where (a) is given by strong coordination, (b) is by the
to convey her probability distribution QnH n |S n to Bob using combination of the inequalities due to Pinsker and Breatgnolle-
PHn
= (PH n )n to code it. If we indicate with EC,S [K] the Huber,
h and (c) is Jensen’s
i inequality. We now proceed to bound
average number of bits spent to convey model belief Q using EC,S DKL (QkQ̂) by observing that the average distortion
between Q and Q̂ is a linear function of the probabilities
Q̂, and the set Q̂ = {Q̂ ∈ Φ(H) : dsem (Q, Q̂) ≤ }
satisfying the constraint is convex, and that by the definition of
Distortion-Rate
h function,
i Q̂ is the distribution in Q̂ minimizing
EC,S DKL (Q̂kP ) , where P is the pre-data coding distribu-
tion (see Theorem IV.2). Then, Theorem 11.6.1 in [18], also
known as the Pythagorean Theorem for the Kullback-Leibler
divergence, applies, and we can bound
h i h i
EC,S DKL (QkQ̂) ≤ EC,S [DKL (QkP )] − EC,S DKL (Q̂kP )
(a)
= R∗ − R
= ∆R ,
where (a) is given by Theorem IV.2. Combined together, the
two inequalities provide Equation (14). Regarding Scheme 2,
it is sufficient to notice that R = I(S; Ĥ 2 )+I(S; Ŝ 2 |Ĥ 2 ) (see
Section V), from which Equation (15) follows.
We notice that this does not directly apply to davg , as the
distribution Q∗ solving Equation (9) and providing the rate is
not, in general, the one used by Bob to sample actions. This is
true for the scheme for dmax , in which common randomness
is introduced.

View publication stats

You might also like