Entropy

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Entropy Measuring Information

The concept of information entropy is used widely in search and analytics technologies and despite its wide use the idea is poorly understood. When Claude Shannon developed his idea in the 1940s he was at a loss to know what to call it. He asked John Von Neumann who replied that he should call his new measure entropy because nobody knows what entropy really is, so in a debate you will always have the advantage. In this short paper we will explore the basic ideas (with a little math included) since the concept of entropy is central to many data mining techniques (decision trees for example). The best place to start is with a consideration of the over-used, but very useful coin flipping example. If we flip a fair coin we expect that it will show a head or a tail around fifty per cent of the time. Assigning the number one to a head (H) and the number zero to a tail (T) means we can record the outcome of a flip in a single bit a zero or a one. Flipping a coin twice could be recorded in two bits 00, 01, 10 or 11 representing the four possible outcomes (TT,TH, HT, HH). What has this go to do with information you might ask. Well if you understood the binary coding for the two coin flip event (i.e. 01 represents a tail followed by a head), then if the flips were conducted in secret by a third party you would have four possible outcomes, and until the binary code was revealed your uncertainty would be embraced by the 00, 01, 10, 11 possible codes two bits (binary digits) of information. When the code was revealed your uncertainty would have been reduced from four possibilities to one. As such the basic idea behind information entropy (or Shannon entropy as some call it) is reduction of uncertainty, and information is defined by the amount that uncertainty is reduced. For now we will assume that the probabilities of all possible outcomes in an uncertain situation are equal. Bearing in mind our two coin experiment it is not too hard to see that the amount of information needed to reduce n possibilities to one can be expressed by:

So for the two flip experiments the information revealed by the result would be That is, 2 bits of information are needed.

which is 2.

When the probabilities of various outcomes are not equal the situation gets a bit more complicated. If the set of outcomes {S1,S2,,Sn} have associated probabilities {p1,p2,,pn} then the information (the number of bits) for any particular outcome Si is defined by: ( ) This is really just another reformulation of the first equation. In other words if the probability of an outcome is 0.25 (i.e. the double coin flip) then the information is , which is 2 again. Since we are typically interested in the information generated by a system (our coin flipping experiments for example) then this is defined by: ( )

www.modulusfive.com 2011

In other words we just take the weighted sum of the individual information measurements for each of the outcomes. Just in case you are finding it hard to feel intuitively what this is saying then I should remind you of the words of John von Neumann quoted at the start of this article. Information entropy is very hard to understand but we can get some insights. In our coin flip example (just one flip of a fair coin) the probability of heads and tails is both 0.5. Plugging this into the equation we get that the information entropy is (0.5 x 1 + 0.5 x 1) = 1. If however we consider a biased coin such that the probability of a head is 0.875 and the probability of a tail 0.125 (i.e. one in eight) then the information entropy of our new coin flipping system is (0.125 x 3 + 0.875 x .193) = .544. While it may not be immediately apparent, this does make sense. With the fair coin the uncertainty is much more profound - each flip is equally likely to come up with a head or tail its a 50:50 bet. However with the biased coin we can be fairly certain that heads will show and so the information entropy of this system is lower 0.544 versus 1 for the fair coin. Lower entropy means less uncertainty. When all outcomes are equally likely this means the uncertainty is greatest and so is the information entropy.

www.modulusfive.com 2011

You might also like