Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

EE 376A/Stat 376A Handout #7

Information Theory Thursday, January 20, 2011


Prof. T. Cover

Solutions to Homework Set #1

1. Entropy of Hamming Code.


Consider information bits X1 , X2 , X3 , X4 ∈ {0, 1} chosen at random, and check bits
X5 , X6 , X7 chosen to make the parity of the circles even.

X5

X2 X3
X1
X7 X4 X6

Thus, for example,

0 1
1
1

becomes

0
0 1
1
0 1 1

That is, 1011 becomes 1011010.

1
(a) What is the entropy of H(X1 , X2 , ..., X7 )?
Now we make an error (or not) in one of the bits (or none). Let Y = X ⊕ e, where
e is equally likely to be (1, 0, . . . , 0), (0, 1, 0, . . . , 0), . . . , (0, 0, . . . , 0, 1), or (0, 0, . . . , 0),
and e is independent of X.
(b) What is the entropy of Y?
(c) What is H(X|Y)?
(d) What is I(X; Y)?

Solution: Entropy of Hamming Code.

(a) By the chain rule,


H(X1 , X2 , ..., X7 ) = H(X1 , X2 , X3 , X4 ) + H(X5 , X6 , X7 |X1 , X2 , X3 , X4 ).
Since X5 , X6 , X7 are all deterministic functions of X1 , X2 , X3 , X4 , we have
H(X5 , X6 , X7 |X1 , X2 , X3 , X4 ) = 0.
And since X1 , X2 , X3 , X4 are independent Bernoulli(1/2) random variables,
H(X1 , X2 , ..., X7 ) = H(X1 ) + H(X2 ) + H(X3 ) + H(X4 ) = 4.

(b) We will expand H(X ⊕ e, X) in two different ways, using the chain rule. On one
hand, we can write
H(X ⊕ e, X) = H(X ⊕ e) + H(X|X ⊕ e)
= H(X ⊕ e).
In the last step, H(X|X ⊕ e) = 0 because X is a deterministic function of X ⊕
e, since the (7,4) Hamming code can correctly decode X when there is at most
one error. (You can check this by experimenting with all possible error patterns
satisfying this constraint.)
On the other hand, we can also expand H(X ⊕ e, X) as follows:
H(X ⊕ e, X) = H(X) + H(X ⊕ e|X)
= H(X) + H(X ⊕ e ⊕ X|X)
= H(X) + H(e|X)
= H(X) + H(e)
= 4 + H(e)
= 4 + log2 8
= 7.

2
The second equality follows since XORing with X is a one-to-one function. The
third equality follows from the well-known property of XOR that y ⊕ y = 0. The
fourth equality follows since the error vector e is independent of X. The fifth
equality follows since from part (a), we know that H(X) = 4. The sixth equality
follows since e is uniformly distributed over eight possible values: either there is
an error in one of seven positions, or no error at all.
Equating our two different expansions for H(X ⊕ e, X), we have

H(X ⊕ e, X) = H(X ⊕ e) = 7.

The entropy of Y = X ⊕ e is 7 bits.


(c) As mentioned before, X is a deterministic function of X ⊕ e, since the (7,4) Ham-
ming code can correctly decode X when there is at most one error. So H(X|Y) = 0.
(d) I(X; Y) = H(X) − H(X|Y) = H(X) = 4.

Problem asked in class:


(Note: This problem was not part of the homework set and will not be graded.
Using the (7,4) Hamming code, solve the 7 hat puzzle:

There are 7 prisoners who enter a room. Each is assigned a hat Red or Blue
with probability 1/2, independent of the other. Each prisoner knows the
color of the hats the other prisoners are wearing, but no prisoner knows the
color of his own hat. The jailor wants the prisoners to guess the color of
their hats. If a prisoner guesses incorrectly, all prisoners are put to death.
Prisoners have the option to refrain from guessing, but if none guesses his hat
correctly, all of them will be put to death. What is the optimal strategy that
they should follow to maximize their chance of survival? No communication
is allowed between the players except for fixing a strategy before entering
the room.

Solution:
The hint is to use the Hamming (7,4) code described in Problem 1. To start, we must
convert hat assignments to bits. Before entering the room, we number the prisoners 1
through 7. The color of the hat that the ith prisoner receives can then be interpreted
as the value of the ith bit in a binary string of length 7. So for example, if Prisoner i
receives a red hat, then Xi = 1; if he receives a blue hat, then Xi = 0.
There are 27 = 128 different possible binary strings (hat assignments) that can be
assigned. Of these strings, there are only 24 = 16 binary strings (X1 , X2 , . . . , X7 ),
which, when plugged into the three-bubble Venn diagram of Problem 1, do not have
any parity check violations. To see that there are 16, notice that if we insist on no
errors, there are only 4 degrees of freedom: X1 , X2 , X3 , and X4 . We will call these
particular strings codewords for the Hamming (7,4) code.

3
The Hamming (7,4) code is what is known as a perfect code. To understand what this
means, imagine the 16 codewords as points in space. Now imagine drawing spheres of
radius 1 around each of these codewords, where the distance metric used is Hamming
distance, the number of bits where two binary strings differ. It so happens that for
this code, the spheres do not overlap, and they also fill up all the Hamming space –
that is, any of the 128 possible bit strings lies in exactly one of these 16 spheres. Or
in other words, any 7-digit binary string has exactly one codeword that it differs from
by at most one bit. Hence, by using minimum distance decoding, the Hamming (7,4)
code can correct any single error, as mentioned in class.
Now, to save the prisoners. Here is the scheme that every prisoner follows:

(a) Look at the other six bits that you can see. (For example, prisoner 3 would look
at X1 , X2 , X4 , X5 , X6 , X7 .)
(b) If there are no parity check violations, assert that your own hat must be the error.
That is, choose the color which would cause the resulting 7-digit binary string to
be off-by-one.
(c) If there is a parity check violation, then say nothing.

What is the probability of success? First, suppose the hat sequence that the prisoners
are assigned corresponds with one of the 16 codewords. Then every prisoner will see
six valid digits, and each will incorrectly assert that his own hat has an error. So all the
prisoners will make a mistake simultaneously, and the prisoners all die. The probability
16
of this occurring is 128 = 18 . On the other hand, consider the more likely situation where
the prisoners do not end up with a codeword, which occurs with probability 112 128
= 7/8.
In this case, since all such bit strings are exactly one bit away from a unique codeword,
only one of the seven prisoners will assert that his own hat has the color of the error,
and he will be right. So the prisoners will be set free with probability 87 .
One may wonder, how is it possible that we can pull this off? If we consider any
individual person, the probability that he correctly states the color of his own hat
must be 50%, since his own hat is assigned at random. This is still true. Most of
the time, any individual prisoner actually says nothing, since non-codewords are more
likely than codewords. However, when he does open his mouth, then 50% of the time
he will be correct and no one else will open his mouth. On the other hand, during the
other 50% of the time when he is wrong, we do not have a codeword, and everyone else
will also open his mouth and also be wrong. So the key trick is to concentrate all the
mistakes that could be made amongst all seven people into an event that occurs with
relatively low probability – namely, with probability 18 .

2. Entropy of functions of a random variable.


Let X be a discrete random variable. Show that the entropy of a function of X is less

4
than or equal to the entropy of X by justifying the following steps:
(a)
H(X, g(X)) = H(X) + H(g(X)|X)
(b)
= H(X).
(c)
H(X, g(X)) = H(g(X)) + H(X|g(X))
(d)
≥ H(g(X)).
Thus H(g(X)) ≤ H(X).

Solution: Entropy of functions of a random variable.

(a) H(X, g(X)) = H(X) + H(g(X)|X) by the chain rule for entropies.
(b) H(g(X)|X) = 0, since
∑ for any particular value∑of X, g(X) is determined, and
hence H(g(X)|X) = x p(x)H(g(X)|X = x) = x 0 = 0.
(c) H(X, g(X)) = H(g(X)) + H(X|g(X)) again by the chain rule.
(d) H(X|g(X)) ≥ 0, with equality iff X is a function of g(X), i.e., g(·) is one-to-one.
Hence H(X, g(X)) ≥ H(g(X)).

Combining parts (b) and (d), we obtain H(X) ≥ H(g(X)).

3. Functions.

(a) Let Y = X 5 , where X is a random variable taking on positive and negative integer
values. What is the relationship of H(X) and H(Y )?
(b) What if Y = X 2 ?
(c) What if Y = tan X?

Solution: Functions
By the previous problem, we know that passing a random variable through a function
can only reduce the entropy or leave it unchanged, but never increase it. That is,
H(g(X)) ≤ H(X), for any function g. The reason for this is simply that if the function
g is not one-to-one, then it will merge some states, reducing entropy.
The trick, then, for this problem, is simply to determine whether or not the mappings
are one-to-one. If so, then entropy is unchanged. If the mappings are not one-to-one,
then the entropy is necessarily decreased. Note that whether the function is one-to-one
or not is only meaningful for the support of X, i.e. for all x with p(x) > 0.

5
(a) Y = X 5 is one-to-one, and hence the entropy, which is just a function of the
probabilities (and not the values of the outcomes) does not change, i.e., H(X) =
H(Y ).
(b) Y = X 2 is even, and hence not one-to-one unless the support of X does not
contain both x and −x for any x.
(c) Like part (a), Y = tan(X) is also one-to-one (as a function on integers), and
hence H(X) = H(Y ).

In general, H(Y ) ≤ H(X) for (b). For this case, it is also possible to give an upper-
bound on H(X): Since X can take only two values given Y , H(X|Y ) ≤ log 2 = 1.
Hence,
H(Y ) ≤ H(X) = H(X, Y ) = H(Y ) + H(X|Y ) ≤ H(Y ) + 1.

4. Mutual information of heads and tails.

(a) Consider a fair coin flip. What is the mutual information between the top side
and the bottom side of the coin?
(b) A 6-sided fair die is rolled. What is the mutual information between the top side
and the bottom side?
(c) What is the mutual information between the top side and the front face (the side
most facing you)?

Solution: Mutual information of heads and tails.


For (a), observe that

I(T ; B) = H(B) − H(B|T )


= log 2 = 1

since B ∼ Ber(1/2), and B is a function of T . Here B and T stand for Bottom and
Top respectively.
For (b), observe that the bottom side B is again a function of the top side T , and there
are six equally probable possibilities for B. (In fact, B + T = 7.) Hence,

I(T ; B) = H(B) − H(B|T )


= log 6
= log 3 + 1.

6
For (c), note that having observed a side F of the cube facing us, there are four equally
probable possibilities for the top T . Thus,

I(T ; F ) = H(T ) − H(T |F )


= log 6 − log 4
= log 3 − 1,

since T has uniform distribution on {1, 2, . . . , 6}.

5. Bytes. ∑
The entropy, Ha (X) = − p(x) loga p(x) is expressed in bits if the logarithm is to the
base 2 and in bytes if the logarithm is to the base 256. What is the relationship of
H2 (X) to H256 (X)?

Solution: Bytes.


H2 (X) = − p(x) log2 p(x)
∑ log2 p(x) log256 (2)
= − p(x)
log256 (2)
(a) ∑ log p(x)
= − p(x) 256
log256 (2)
−1 ∑
= p(x) log256 p(x)
log256 (2)
(b) H256 (X)
= ,
log256 (2)

where (a) comes from the property of logarithms and (b) follows from the definition of
H256 (X). Hence we get
H2 (X) = 8H256 (X).

6. Coin weighing.
Suppose one has n coins, among which there may or may not be one counterfeit coin.
If there is a counterfeit coin, it may be either heavier or lighter than the other coins.
The coins are to be weighed by a balance.

(a) Count the number of states the n coins can be in, and count the number of
outcomes of k weighings. By comparing them, find an upper bound on the number
of coins n so that k weighings will find the counterfeit coin (if any) and correctly
declare it to be heavier or lighter.

7
(b) (Difficult and optional.) What is the coin weighing strategy for k = 3 weighings
and 12 coins?

Solution: Coin weighing.

(a) For n coins, there are 2n + 1 possible situations or “states”.


• One of the n coins is heavier.
• One of the n coins is lighter.
• They are all of equal weight.
Each weighing has three possible outcomes — equal, left pan heavier or right
pan heavier. Hence with k weighings, there are 3k possible outcomes and hence
we can distinguish between at most 3k different “states”. Hence 2n + 1 ≤ 3k or
n ≤ (3k − 1)/2.
Looking at it from an information theoretic viewpoint, each weighing gives at most
log2 3 bits of information. There are 2n + 1 possible “states”, with a maximum
entropy of log2 (2n + 1) bits. Hence in this situation, one would require at least
log2 (2n + 1)/ log2 3 weighings to extract enough information for determination of
the odd coin, which gives the same result as above.
(b) There are many different solutions to this problem, and we will give two. But per-
haps more useful than knowing the solution is knowing how to go about thinking
through the problem, and how this relates to information theory. So let’s say a
few words about the thinking process first, or at least how to start it.
Perhaps the first question we are faced with when trying to solve the problem is:
“How many coins should we put on each pan in the first weighing?” We can use
an information theoretic analysis to answer this question. Firstly, it’s clear that
putting a different number of coins on each pan isn’t very helpful, so let’s put the
same number on both. Thus, the most number of coins we could put on one pan
is six, since there are 12 coins in total. Now let’s suppose we choose to weigh six
coins against six coins. On one hand, it may happen that the pans are balanced,
in which case, we have solved the problem with merely one weighing. However,
it is also possible that the left pan goes down, and the right pan goes up. In this
case, what information has the weighing given us? We can only conclude that
either one of the six coins in the left pan is too heavy, or one of the six coins in
the right pan is too light. Thus, whereas there were originally 2n + 1 = 25 states
for the counterfeit coin, there are now only 6 + 6 = 12. Since we have used up
one weighing, there are only two weighings left. However, since each weighing has
three possible outcomes, there will only be 3 × 3 = 9 different possible outcomes
for the remaining two weighings – and since this is less than 12, we won’t be able
to determine the counterfeit coin! Thus, in general we cannot solve the problem

8
if we weigh six coins against six coins.1
Similarly, suppose we weighed five coins against five coins. If the pans are im-
balanced, then by the same argument, the counterfeit coin has one of 10 possible
states. But 3 × 3 = 9 < 10, so this idea is also no good.
On the other hand, if we weighed four coins against four coins, and the pans are
imbalanced, then there are eight possible states for the counterfeit coin, which
is less than 9. Also, if the pans are balanced, then we can determine an off-
weight counterfeit amongst the remaining four coins by weighing two against two,
and then swapping a pair of coins between these two groups and weighing again
(verify this). So there is hope for this approach. In mathematics, optimality is
often associated with symmetry, so having three weighings of four coins against
four coins sounds plausible. After this point, we can try various combinations to
get toward the answer.
The above analysis can also be framed in terms of entropy and mutual information.
Let X be a random variable describing the state of the counterfeit coin, and let Y
be a random variable describing the outcome of the first weighing, where Y = B
means balanced, Y = L means the pans tilt left, and Y = R means the pans tilt
right. Assume a uniform distribution over the 25 possible values of X. Recall
that the mutual information I(X; Y ) describes the amount of information that
Y tells us about X (and vice versa). Thus, we should design the first weighing
to maximize I(X; Y ). Since H(X) is invariant of the weighing scheme, this is
the same as minimizing H(X|Y ). We can then explicitly compute H(X|Y ) for
different weighing schemes. For example, suppose we choose to weigh m coins
against m coins in the first weighing. Then

H(X|Y ) = Pr(Y = L)H(X|Y = L) + Pr(Y = R)H(X|Y = R)


+ Pr(Y = B)H(X|Y = B)
= 2 Pr(Y = L)H(X|Y = L) + Pr(Y = B)H(X|Y = B)

where the last step holds due to symmetry. We now compute each of these terms.

Pr(Y = L) = Pr(heavy counterfeit) Pr(Y = L | heavy counterfeit)


+ Pr(light counterfeit) Pr(Y = L | light counterfeit)
12 m 12 m 2m
= · + · = ,
25 12 25 12 25
To explain the conditional probabilities, notice that the probability a heavy coin
m
falls into the left pan is 12 , and similarly for the probability that a light coin falls
into the right pan.

H(X|Y = L) = log2 (2m) = 1 + log2 (m),


1
This sort of analysis was also mentioned by student Nagarajan in the TA’s office hours.

9
since it is equally likely given the outcome that the counterfeit coin be any one of
the 2m coins which were weighed.
Pr(Y = B) = Pr(counterfeit exists) Pr(Y = B | counterfeit exists)
+ Pr(no counterfeit) Pr(Y = B | no counterfeit)
24 12 − 2m 1
= · + ·1
25 12 25
4m
=1−
25
where the 12−2m
12
in the second equality represents the probability that the coun-
terfeit coin does not reside in either of the two pans used in our first weighing.
Lastly,
H(X|Y = B) = log2 (1 + 2(12 − 2m))
(Why?) Combining our results and simplifying,
I(X; Y ) = H(X) − H(X|Y )
( )
4m 4m
= log2 (25) − (1 + log2 m) + (1 − ) log2 (25 − 4m)
25 25
Figure 1 and the table directly below evaluate I(X; Y ) for m ∈ {1, 2, 3, 4, 5, 6}.
We see that I(X; Y ) is maximized for m = 4. It should be noted that this maxi-
mization approach is a greedy approach to designing the first weighing, and not
a rigorous argument ... at best, it just supplements the aforementioned analysis.
m 1 2 3 4 5 6
I(X; Y ) 0.79431 1.22438 1.47885 1.58268 1.52193 1.20229

Now for some solutions. First, a fun one. Consider the following sentence:
“MA, DO LIKE ME TO FIND FAKE COIN.”
One can imagine this sentence being slowly spoken in a drawl by a “slack-jawed
yokel” like Cletus from the Simpsons. This sentence actually holds a recipe for
solving the problem. If we count the number of distinct letters in the sentence,
there are exactly 12. Thus, rather than numbering the coins 1 through 12, we can
instead assign them distinct letters from the set {M, A, D, O, L, I, K, E, T, F, N, C}.
Now break the sentence into groups of four, moving across from left to right:
MADO LIKE METO FIND FAKE COIN
The plan is then to weigh the first two groups against each other, the second two
groups against each other, and the third two groups against each other. That is,
we want to perform the following weighings, which can be performed in any order
we want:

10
Mutual Information Between Counterfeit Coin State and Outcome of First Weighing
IHX;YL
1.6
æ

æ
æ

1.4

æ
1.2 æ

1.0

æ ð coins per pan


2 3 4 5 6

Figure 1: Mutual information maximization.

• M, A, D, O vs. L, I, K, E
• M, E, T, O vs. F, I, N, D
• F, A, K, E vs. C, O, I, N
Based on the outcomes of these three weighings, we can deduce which coin is
counterfeit, if any, and also whether it is too heavy or too light. Let’s look at an
example. Suppose “M” is too heavy. Then the weighing results will be:

M, A, D, O vs. L, I, K, E left pan down, right pan up


M, E, T, O vs. F, I, N, D left pan down, right pan up
F, A, K, E vs. C, O, I, N pans are balanced

Now let’s pretend we didn’t know that M is too heavy, and deduce that this is
the case. Firstly, based on the third weighing, we know that coins F, A, K, E,
C, O, I, N must all be normal, so the counterfeit coin must lie outside this set.
We now proceed by process of elimination. First, suppose the counterfeit coin is
too light. Then the counterfeit coin must have been in both of the right pans in
the first two weighings. The only coin common to both of these right pans is “I”.
However, we know from the third weighing that “I” is a normal coin. Thus, the
counterfeit coin must actually be too heavy. So it must be in both of the left pans
in the first two weighings. The only coin common to both of these left pans is
“M”. Hence, we conclude that “M” is too heavy. We can conduct this same kind
of logical reasoning to detect any possible counterfeit coin.
Now we give an alternative solution that is based on the ternary number system.
We may express the numbers {−12, −11, . . . , −1, 0, 1, . . . , 12} in a ternary number
system with alphabet {−1, 0, 1}. For example, the number 8 is (-1,0,1), since
−1 × 30 + 0 × 31 + 1 × 32 = 8. We form the matrix with the representation of the
positive numbers as its columns.

11
1 2 3 4 5 6 7 8 9 10 11 12
0
3 1 -1 0 1 -1 0 1 -1 0 1 -1 0 Σ1 = 0
31 0 1 1 1 -1 -1 -1 0 0 0 1 1 Σ2 = 2
32 0 0 0 0 1 1 1 1 1 1 1 1 Σ3 = 8
Note that the row sums are not all zero. We can negate some columns to make
the row sums zero. For example, negating columns 7,9,11 and 12, we obtain
1 2 3 4 5 6 7 8 9 10 11 12
0
3 1 -1 0 1 -1 0 -1 -1 0 1 1 0 Σ1 = 0
31 0 1 1 1 -1 -1 1 0 0 0 -1 -1 Σ2 = 0
32 0 0 0 0 1 1 -1 1 -1 1 -1 -1 Σ3 = 0
Now place the coins on the balance according to the following rule: For weighing
#i, place coin n
• On left pan, if ni = −1.
• Aside, if ni = 0.
• On right pan, if ni = 1.
The outcome of the three weighings will find the odd coin if any and tell whether
it is heavy or light. The result of each weighing is 0 if both pans are equal, -1 if
the left pan is heavier, and 1 if the right pan is heavier. Then the three weighings
give the ternary expansion of the index of the odd coin. If the expansion is the
same as the expansion in the matrix, it indicates that the coin is heavier. If
the expansion is of the opposite sign, the coin is lighter. For example, (0,-1,-1)
indicates (0)30 +(−1)3+(−1)32 = −12, hence coin #12 is heavy, (1,0,-1) indicates
#8 is light, (0,0,0) indicates no odd coin.
Why does this scheme work? It is a single error correcting Hamming code for the
ternary alphabet (discussed in Section 8.11 in the book). Here are some details.
First note a few properties of the matrix above that was used for the scheme. All
the columns are distinct and no two columns add to (0,0,0). Also if any coin is
heavier, it will produce the sequence of weighings that matches its column in the
matrix. If it is lighter, it produces the negative of its column as a sequence of
weighings. Combining all these facts, we can see that any single odd coin will
produce a unique sequence of weighings, and that the coin can be determined
from the sequence.
One of the questions we can ask is whether the bound derived in part (a) is
actually achievable. For example, can one distinguish 13 coins in 3 weighings?
Although it is not possible with a scheme like the one above, it is possible under
the assumptions from which the bound was derived. The bound did not prohibit
the division of coins into halves, neither did it disallow the existence of another
coin known to be normal. Under either of these conditions, it is possible to find
the odd coin of 13 coins in 3 weighings.

12
7. Entropy of time to first success.
A fair coin is flipped until the first head occurs. Let X denote the number of flips
required.

(a) Find the entropy H(X) in bits. The following expressions may be useful:


∞ ∑

r = r/(1 − r),
n
nrn = r/(1 − r)2 .
n=1 n=1

(b) Find an “efficient” sequence of yes-no questions of the form, “Is X contained in
the set S?” Compare H(X) to the expected number of questions required to
determine X.
(c) Let Y denote the number of flips until the second head appears. Thus, for ex-
ample, Y = 5 if the second head appears on the 5th flip. Argue that H(Y ) =
H(X1 + X2 ) < H(X1 , X2 ) = 2H(X), and interpret in words.

Solution: Entropy of time to first success

(a) The distribution of the number X of tosses until the first head appears is a geo-
metric distribution with parameter p = 1/2: P (X = n) = pq n−1 , n ∈ {1, 2, . . .}.



H(X) = − pq n−1 log(pq n−1 )
[
n=1
]

∞ ∑

= − pq n log p + npq n log q
n=0 n=0
−p log p pq log q
= −
1−q p2
−p log p − q log q
=
p
= H(p)/p bits.

If p = 1/2, then H(X) = 2 bits. Note also that this quantity, H(p)/p, makes
perfect sense: the coin tosses are independent, so each coin toss gives us H(p)
bits of entropy. On average, we make 1/p tosses until we get the first head, so
H(p)/p should be the total entropy.
(b) This problem invites you to use your intuition to guess an answer. A good yes-
or-no question is one whose entropy is as high as possible, that is, as close to one
as possible. We model the question as a random variable Y that can take on two
values: YES and NO, with probabilities p and (1−p). The entropy of the question
is then just H(p). Since I(X; Y ) = H(Y ) − H(Y |X) ≤ H(Y ), and the maximum

13
entropy of Y is H(1/2) = 1, we can’t learn more than 1-bit of information about
X from any single question Y .
Now consider the sequence of questions, Y1 = “Is X = 1”, Y2 = “Is X = 2”, etc.
As soon as we get a yes answer, we are done. On the the other hand, given that
the previous k answers have all been NO, then the entropy of the next question,
given the current state of knowledge, is precisely 1 (since with probability 1/2 X
will be heads on the next flip). Thus, each question gives us 1 bit of additional
information, which is the best we can do.
In the special case of fair coin flips, E[number of questions] = E[X] = 2 =
H(X). In general, E[X] has nothing to do with H(X). On the other hand,
E[number of questions] has a lot to do with H(X). We will see later for any
discrete random variable X, H(X) represents the minimum number of questions
required, on average, to ascertain the value of X.
(c) Intuitively, (X1 , X2 ) has more information than Y = X1 + X2 ; hence H(Y ) <
H(X1 , X2 ).
Since (X1 , X2 ) 7→ X1 + X2 is not a one-to-one mapping, some states will get
merged with a resultant loss of entropy. Hence, by the same argument as in
Question 1, H(X1 + X2 ) < H(X1 , X2 ).
Alternatively, one can observe that by the chain rule,

H(X1 , Y ) = H(X1 ) + H(Y |X1 )


= H(X1 ) + H(X1 + X2 |X1 )
= H(X1 ) + H(X2 |X1 )
= H(X1 , X2 ).

On the other hand, H(X1 , Y ) = H(Y ) + H(X1 |Y ) > H(Y ) since Y does not com-
pletely determine X1 and hence H(X1 |Y ) > 0. Therefore, H(Y ) < H(X1 , X2 ) =
2H(X), where the last equality follows from the fact that X1 and X2 are i.i.d.,
i.e. H(X1 , X2 ) = H(X1 ) + H(X2 |X1 ) = H(X1 ) + H(X2 ) = 2H(X1 ).

8. Example of joint entropy.


Let p(x, y) be given by
Y

X 0 1
1 1
0 4 4
1
1 0 2

Find

(a) H(X), H(Y ).

14
(b) H(X|Y ), H(Y |X).
(c) H(X, Y ).
(d) H(Y ) − H(Y |X).
(e) I(X; Y ).
(f) Draw a Venn diagram for the quantities in (a) through (e).

Solution: Example of joint entropy

Figure 2: Venn diagram to illustrate the relationships of entropy and relative entropy

(a) H(X) = 12 log 21 + 12 log 21 = 1.0 bits.


= H(Y ) = 14 log 4 + 34 log 43 = 0.811 bits.
(b) H(X|Y ) = 14 H(X|Y = 0) + 34 H(X|Y = 1) = 0.689 bits.
H(Y |X) = 21 H(Y |X = 0) + 12 H(Y |X = 1) = 0.5 bits.
(c) H(X, Y ) = 14 log 4 + 14 log 4 + 12 log 2 = 1.5 bits.
(d) H(Y ) − H(Y |X) = 0.311 bits.
(e) I(X; Y ) = H(Y ) − H(Y |X) = 0.311 bits.
(f) See Figure 2.

9. Two looks.
Here is a statement about pairwise independence and joint independence. Let X, Y1 ,
and Y2 be binary random variables. If I(X; Y1 ) = 0 and I(X; Y2 ) = 0, does it follow
that I(X; Y1 , Y2 ) = 0?

(a) Yes or no?

15
(b) Prove or provide a counterexample.
(c) If I(X; Y1 ) = 0 and I(X; Y2 ) = 0 in the above problem, does it follow that
I(Y1 ; Y2 ) = 0? In other words, if Y1 is independent of X, and if Y2 is independent
of X, is it true that Y1 and Y2 are independent?

Solution: Two looks.

(a) The answer is “no”.


(b) Although at first the conjecture seems reasonable enough–after all, if Y1 gives you
no information about X, and if Y2 gives you no information about X, then why
should the two of them together give any information? But remember, it is NOT
the case that I(X; Y1 , Y2 ) = I(X; Y1 ) + I(X; Y2 ). The chain rule for information
says instead that I(X; Y1 , Y2 ) = I(X; Y1 ) + I(X; Y2 |Y1 ). The chain rule gives us
reason to be skeptical about the conjecture.
This problem is reminiscent of the well-known fact in probability that pair-wise
independence of three random variables is not sufficient to guarantee that all three
are mutually independent. I(X; Y1 ) = 0 is equivalent to saying that X and Y1 are
independent. Similarly for X and Y2 . But just because X is pairwise independent
with each of Y1 and Y2 , it does not follow that X is independent of the vector
(Y1 , Y2 ).
Here is a simple counterexample. Let Y1 and Y2 be independent fair coin flips.
And let X = Y1 XOR Y2 . X is pairwise independent of both Y1 and Y2 , but
obviously not independent of the vector (Y1 , Y2 ), since X is uniquely determined
once you know (Y1 , Y2 ).
(c) Again the answer is “no”. Y1 and Y2 can be arbitrarily dependent with each
other and both still be independent of X. For example, let Y1 = Y2 be two
observations of the same fair coin flip, and X an independent fair coin flip. Then
I(X; Y1 ) = I(X; Y2 ) = 0 because X is independent of both Y1 and Y2 . However,
I(Y1 ; Y2 ) = H(Y1 ) − H(Y1 |Y2 ) = H(Y1 ) = 1.

10. A measure of correlation.


Let X1 and X2 be identically distributed, but not necessarily independent. Note that
H(X1 ) = H(X2 ). Let
H(X2 |X1 )
ρ=1− .
H(X1 )
I(X1 ;X2 )
(a) Show ρ = H(X1 )
.
(b) Show 0 ≤ ρ ≤ 1.

16
(c) When is ρ = 0?
(d) When is ρ = 1?

Solution: A measure of correlation.


X1 and X2 are identically distributed and

H(X2 |X1 )
ρ=1−
H(X1 )

(a)

H(X1 ) − H(X2 |X1 )


ρ =
H(X1 )
H(X2 ) − H(X2 |X1 )
= (since H(X1 ) = H(X2 ))
H(X1 )
I(X1 ; X2 )
= .
H(X1 )

(b) Since 0 ≤ H(X2 |X1 ) ≤ H(X2 ) = H(X1 ), we have

H(X2 |X1 )
0≤ ≤1
H(X1 )
0 ≤ ρ ≤ 1.
(c) ρ = 0 iff I(X1 ; X2 ) = 0 iff X1 and X2 are independent.
(d) ρ = 1 iff H(X2 |X1 ) = 0 iff X2 is a function of X1 . By symmetry, X1 is a function
of X2 , i.e., X1 and X2 have a one-to-one correspondence. For example, if X1 = X2
with probability 1 then ρ = 1. Similarly, if the distribution of Xi is symmetric
then X1 = −X2 with probability 1 would also give ρ = 1.

17

You might also like