Download as pdf
Download as pdf
You are on page 1of 30
University of New Mexico (IQU) EEE ESOS Borrower: CKDNB Journal Title: Maximum-entropy models in science and engineering Volume: Month/Year: 1989 Pages: 1-28 Article Author: Kapur, Jagat Narain Article Title: 1 Maximum-Entropy Probability Distributions: Principles,Formalism and Techniques ILL Number: 169685427 CANON A A Call Number: TA342 K37 1989 Location: csel ARTICLE EXCHANGE 1 Maximum-Entropy Probability Distributions: Principles, Formalism and Techniques Every probability distribution has some “uncertainty” associated with it. The concept of ‘entropy’ is introduced here to provide a quantitative measure of this uncertainty. One object of the present chapter is to examine the properties which, on intuitive grounds, we expect a measure of uncer- tainty to have and then to develop some measures having some or all of these properties. ‘According to the maximum-entropy principle, given some partial infor- ‘mation about a random variate, scalar of vector, we should choose that probability distribution for it, which is consistent with the given information, but has otherwise maximum uncertainty associated with it. In other words, we first find that family of probability distributions every member of which is consistent with the given information and then from this family, we choose that distribution whose uncertainty or entropy is greater than that of every other member of this family. ‘There can be no “proof” for such a principle. We shall only make an attempt to make it look very plausible and natural and explain in what sense we can call the resulting probability distribution as ‘most likely’ or ‘most unbiased” or ‘least prejudiced’ or ‘most uniform’. Taking the principle as an axiom, we shall find the probability distributions it leads us to, For this purpose, we develop in this chapter a formalism, mainly due to Jaynes [1957, 1963a, 6, 1982] and then use it in subsequent chapters to show that it leads to almost all the known probability distributions used in statistics and to some generalisations of these, in a very natural and smooth manner. This gives some, ‘a posteriori’ justifation for the Maximum Entropy Principle. We also get a systematic, unified and useful way of deriving and characterising probability distributions. There is little ad hhocism’ and there is no need to start ‘de novo’ for every probability distri- bution. 2. MAXIMUM-ENTROPY MODELS 1.1 MEASURES OF ENTROPY 1.1.1 Requirements of a Measure of Uncertainty of a Probability Distribution ies of m possible outcomes Ay, Ay...» dy of an experiment. Ps giving rise to the probability distribution be 1; Pp 25 PaO a There is an uncertainty as to the outcome when the experiment is performed. Any measure of this uncertainty should satisfy the following irements: (@_It should be a function of py, Py... Pe S0 that we may wrteit as, H = Hy (P) = Hn (Py Posen Pe) @ i) It should be a continuous function of py, Pp, Pa ie. small changes in Py Payns Pa Should cause a small change in Hy (ii) “Te should ‘not change when the outcomes are rearranged among themselves, ic. Hy should be a symmetric function of its arguments. (jv) It should not change if an impossible outcome is added to the Probability scheme i.e. Hass (Py Ps Ha (Pry Pays Pe) @) (9) ‘It should be minimum and possibly zero when there is no uncer- tainty about the outcome, Thus it should vanish when one of the ‘outcomes is certain to happen so that PO) Ha (Pry Paves Po) = O When P= 1, py =O, Jt FHL Qo @ (vi) It should be maximum when there is maximum uncertainty which arises when the outcomes are equally likely so that H, should be maximum When aa = 5 © i wm value of Hy should increase asm increases. Gili) For two independent probability distributions ws Bamba © scheme PUQ should be the sumi of their P= (Par Payers Pads O= Gy Ge the uncertainty ofthe uncertainties i. »(PUQ) = Ha(P) + HQ), ” Principles, Formalism and Techniques 3. where if yy Agwosy Ani Byy Byyoey Bm are the outcomes of P and Q, thea the outcomes of PUQ are 4,B; with probabilities prqy (= 1, Dyeory MT = Uy 2yovey te 1.1.2 Shannon’s Measure of Uncertainty ‘Shannon [1948] suggested the following measure: By Poros P= — J vita ps ® Itis easily seen to be a function of py, Pye» Px It is also & continuous function and is a symmetric function if we always replace 0 In 0 by 0. It oes not change when an impossible outcome is added to the probability scheme, When one of the probabilities is unity and the others are zero, its value is zero and this is its minimum value, since H > 0 when 0< pi <1. To find its maximum value, we can use Lagrange’s method to maximize = Eoinn—a[En—-1] o and ths gives us (5). Since x In xis a convex funetion®, 3 ps In is aconvex function, — 3 pln p1isaconcave futction and its local maximum isa global maximum. The maximum value of Hy is - dim ()=t0 20) and this goes on increasing as n increases. Alternatively to prove that Hy is maximum when (5) is satisfied, we can use Jensen's inequality for a con- vex function 4(2) which states that FIG) > AEG) ay for any random variate x. Let (2) =x In x and let x take values Py Pays ‘Pe cach with probability 1/n, then A460) = 3S nrinp, Ba) = ‘1 1,1 $ @) =a" it oe) : _ gine Emma>intor—$ ninn0,0+6-1>0, (24) Ent ‘This reduces to Renyi’s measure when B = 1, to Shannon's measure when B=1, a1 and to Hartley's measure In n when B= 1 and a =0, When |, & > 00, it gives the measure HP) = — 10 Pass. @) 1.14 Non-additive Measures of Entropy Havrada and Charvat (1967] gave up the requirement (viii) and (ix) of additivity and obtained the first non-additive measure of entropy: (26) To be consistent with Renyi’s measure and for mathematical convenience we shall use it in the modified form Epi en HYP) = * a4 lado, From (22) and (27), we get for a complete probability distribution Hp) = PID HPN gy, (28) Ta so that H(P) is a one-one function of H(P). Instead of property (vii), wwe get ‘*This measure was earlier obtained by Aczel and Darocry (1963) but Kapur studied it in detail and solved some problems connected with it. Principles, Formalism and Techniques 7 IPQ) = 1) + HQ) + (I—-DEPD), ©) so that H*(P) is not an additive measure of entropy. However, a function of it viz H,(P) is an additive measure. Behara and Chawla [1974] defined the non-additive y-entropy 1-(3 phy GP) = Ss > OD bd Kapur [1981, 1982] gavea number of other similar non-additive measures of entropy. When a>%7>0m>0,P>0a+8—1>0 1-2 G1) Gi ” father ao =0/0=1, @ wel 3 ortey 3 of Co) IMHO LD = te @3) B>0,a+6—-1>0 (iy - Ervin pie (viii) [exp [— z prin p]— 1) @) He called these as pscudo-non-additive measures of entropy, since while these are non-additive, some suitable functions of these are addi tive, He also gave the following genuinely non-additive measures of entropy (Kapur 1986) (ix) Hoey — 3 prt prt ta 5 (1 +-ap) i (+29) — api, a >0 es @) AP)= Epvin n+ 113, +57) n+ bm) FCF N0+H,5>0 6) 8. MAXIMUM-ENTROPY MODELS Gi) Hae) =—Zprinp + WETS (+47) in ten) eh e>0 en cit) Hah) = — 3 rein p+ RTE + a) Im C+ kD -(+HIn +H, k>0 (8) Some of these have been found useful in deriving Bose-Einstein and Fermi-Dirac distributions of statistical mechanies [ef. sections 6.2 and 63). While all these measures are interesting and useful in some sense, Shannon's measure is most useful and the most natural (Aczel et al [1974)) and this is the measure we shall mostly use in our discussions. ‘The characterisation of Renyi’s measures of entropy has been discussed in Aczel and Daroczy [1975] and Mathai and Rathi (1975). EXERCISES 1.1 1, Prove that @ Lt pinp=0 @_¥ In xis a convex function for x >0 Si) = x In x, x> 0; f(0) =0 is a convex function for x > 0 (iv) The sum of a number of convex functions is a convex function (8) If f(2) isa convex function, then —/() is a concave function (vi) In nis an increasing function of n. 2. Examine which of the properties ()-(ix) and (ix)' of section 1.1.1 and 1.1.2 hold for each of the twelve measures of entropy given in section 1.1.4. Prepare a table giving the properties which hold for each measure, 3. Show that In (pf -} pf ++...+ Pa) is a concave function of « if\e<1 4, Prove that @ LL H@)=— Epiap, @ Lrey= Gi) Ly A@)= Ion (Epiina @) LE@)=0-) () Lt HAP) = =I pas 5. Show that Shannon's measure of entropy has the following properties: Principles, Formalism and Techniques 9 © Recursivity or Branching Principle, i.e. Ha Pay Pays Pa) = Ha Pu + Pry Pavves Pr) + (Pa + Pe) Hill Ps t Pas PoPs + Pd) Strong Additivity Fema Pazyeoes Pyms Pagyes-s Pami-o-3 Pans Passo» Prom) 2) Hol PrslP to PaslP joe» Pails) = Hal Pin Porn Pa) +S, + HAE Pin on BD) Gv) Functional Equation, i.e. M0) = fo») =f) + 0-4 (25) x =1) + 0-5 (755) where x,y € [0, 1) and x + y © 0, 1]. 6. Examine whether any of the four properties of Ex. 5 hold for any of the twelve non-additive measures of entropy given in section (1.1.4). 1.2 MAXIMUM-ENTROPY PRINCIPLE 1.2.1 Bayesian Entropy Shannon's measure of uncertainty is maximum when all the outcomes are equally likely. This is consistent with Laplace’s principle of insufficient reason that unless there is information to the contrary, all outcomes should be considered equally likely. However, on. the basis of intuition or expe- rience, one may have reasons to believe that the a prior probability distri- bution is given by Pr = Mp Pa = Oy @) then we define another measure: Paane which we call Bayesian entropy by Pa = 03 Bat Bp) = — i Pt bo wom . (40) where? “(ata = in (25 yoy te) ay ‘Here we have assumed that none ofthe a's is zero i.e, we are considering only those outcomes which are not a priori impossible. 10. MAXIMUM-BNTROPY MODELS Now se) = — [Btn 2] = tn Gada aud Ean Zamtosoa>0. (42) ‘Under these conditions, by Shannon’s inequality [see section 1.4.2] the ex- pression within the square brackets in (42) > 0 and vanishes if and only if Pt = % for ail i, Thus Bayesian entropy is maximum when p; = a for all i and is minimum when the outcome with the minimum a priori proba- bility is certain to occur. In information theory, we define [(Kullback and Libler (1951) 1p: A) = 3 min 3) as the Directed Divergence of the probability distribution P = from the a prioti distribution A = (@y dyy ye). Thus BP) = —1(P:A) + In @).. (44) ‘Thus maximizing (minimizing) Bayesian entropy is equivalent to minimizing (maximizing) the directed divergence of P from A. Thus in the absence of any other information, Bayesian entropy is maximum when the directed divergence of the probability distribution from the a ri distribution is minimum i.e., when the given distribution P is the same as thea priori distribution A. Again the Bayesian entropy is minimum when the divergence is maximum i.e., when our probability distri- bution makes that event certain which has minimum a priori probability. This entropy measure is called Bayesian because it takes the a priori probability distribution into account. Shannon's measure is a special case of this when the a priori proba bility distribution is the uniform distribution. ‘The prior distribution may be given to us on the basis of experience of the decision-maker or it may be obtained from theoretical considera- tions, Thus, suppose we throw a coin m times under identical conditions and we do not know how fuir the coin is. We do not know the probability ‘of i successes (i = 0, 1,2, ..., n), but we know from experience that these probabilities are not equal, Since in m independent trials, we can get i 2 sven in () way wosmy tte w= (F)itmona, 3) Principles, Formalism and Techniques 11 and a n BP) = — B prin pil (46) 1.22. Statement of the Maximum-Entroy Principle If we know only the prior probability distribution, we should choose P to bbe the same as this prior probability distribution A i.c, weshould minimize the directed divergence of P from A. If, however, some other information is available, say, for example if the mean of the distribution is prescribed vif iam, «7 then we cannot choose py = a since this may not satisfy (47). Still out of all the probability distributions which satisy (47), we shall like to choose that one which minimizes the directed divergence of P from Ai.e, we shall choose that P which maximizes the Bayesian entropy subject to (47) being satisfied. If there are many constraints like (47), we shall maximize the Bayesian entropy subject to all these constraints being satisfied. If the prior probability distribution is not known, we shall maximize the Shannon entropy subject to all constraints being satisfied. This is the Maximum- Entropy Principle which requires us to maximize the Bayesian (or Shannon) Entropy (or some other appropriate measure of entropy) subject to all the aiven constraints (including, $ py = 1) being satsed. We may also call this as the Minimum-Directed-Divergence Principle since here we minimize the directed divergence of P from A subject to all constraints being satisfled. Since directed divergence is also Kullback- Libler discrimination information measure, the principle is also called ‘Minimum Discrimination Information (MDI) principle. It is also called the Mininium Cross Entropy Principle. We may also call it the Principle of Minimum Bias or Principle of Minimum Prejudice fegarding the information not given. The maximum entropy in the presence of an additional constraint will be less than the maximum entropy in the absence of this constraint and the difference may be considered as measure of the bias duc to this additional constraint and maximizing the entropy implies minimizing this bias. There are two considerations in the use of the Maximum-Entropy Principle. We should take all given information into account and we should taking into account any information that is not given 1 distribution as possible subject to the constraints being satisfied. The probability distribution obtained by using the principle may be 12, MAxIMUM-ENTROPY MODELS called the Maximum-Entropy Probability Distribution (MEPD) or the Most Likely Probability Distribution (MLPD) ot the Least Unbiased Probability Distribution (LUPD) of the Most Uniform Probability Distribution (MUPD) ‘or the Distributian Closest to the prior Distribution or the Most Random Probability Distribution (MRPD) or the Most Uncertain Probability Distr bution (MUPD). To justify our calling the Maximum Entropy Probability Distribution ‘on the Most Likely Probability Distribution, we proceed as follows: Let an experiment be performed WN times, then if N is large, then » out comes will occur approximately Npy, Np4, ..., Nx times and the total number of ways in which this can happen is ut "= NpINDC Nha! nw in NI -i In (Wp) ing’s formula, this gives In Woe—N + NIN + 3 Np, —Z (Ny) 9) vin Inpy. 49) so that maximizing Shannon’s entropy is equivalent to maximizing In W or W. Then for large N, MEPD is also the MLPD. 1.23 Use of a Namber of Prior Probability Distributions ‘Suppose we are given a number of possible prior probability distributions py Qty oven Ons J = Uy Dy ony My then we get a number of measures of directed divergence between the true distribution P and the given prior distributions Q, viz. Bm ya BQ (50) If we use a weighted mean of ta, We get D=E wd win ee, wy = 1,9) >0 ae a a p=$ ninp—Z a3 ngy=% ain =3 pin POI Zak fy-e (1) where x (2) is the weighted geometric mean of gy! Principles, Formalism and Techniques 13 . Thus a weighted mean of the measures of divergence is equivalent to using the directed divergence from the probability distribution obtained from the weighted geometrical mean. of the probabilities of the prior distributions, EXERCISES 1.2 ‘Two coins, a dime and a nickel are tossed. The states of the coin are assigned values as follows: DH (dime, head) = 1, DT = 2, NH = 1, NT = 2, You are told that in a sequence of tosses, the average value for the simultaneous tosses of the two coins is 3.1. Show that the least prejudiced assignment of the probabilities of head in the toss of the dime is 0.45. (2) Suppose we are given a sixfaced die and no other information, show that the MEPD is given by A= 2 Pa = Pe Pe Pe = (®) Suppose now we are given the information that Mr. A hasthrown this die 10,000 times and has obtained points 1, 2, 3, 4, 5.6 in 1500, 1700, 1800, 2000, 1600, 1400, times, What is the MEPD now? (©) Suppose now we throw this die ourselves 100 times and the ‘mean number of points comes out to be 3, What is the MEPD now? ‘The only information available about & die is that when it was thrown a large number of times, the average number of points came out to be 4.5, Show that the MEPD is given by {Pus Py «us Pa) = (0.05435, 0.07877, 0.11416, 0.16545, 0.23977, 0.234749} [You are given that a root of 3x? — Sx* + 9x — 7 =Oiis 1.44925). Derive the probability density curves when the average number of points are 2. 4,5. Compare them and comment on the results. Let pyy be the probability of an item being in the cell in the ith row and jth column of an m x n contingency table Maximize -i, Ey In py subject to a Amn E w= where a;’s and 5,’s are given and Za pan jo bal 14 Maxiwumantrory mopEts 6. 10. uw. Maximize =E pin subject to E= 12 tn Show that the solution i given by P.= BIB — mph rs = FG 3 — 2 ps = plm—1—Pd Draw the graphs of py» Py» Ps against m (I < m <3). Maximize another measure of entropy 3p} subject to the const raints of Ex. 6. Show that the solution is now given by 41 Ltn? mn 2m-3 Draw the graphs of Py, Pos Pes against m(I < m < 3). Note that Px $.and p, <0 when m < 3. Why does this not happen for Shannon's measure of entropy? In order to calculate the cumulative performance index of a student, university calculates the weighted average of his grades, the weights being 10, 8, 6, 4, 2 for grades 4, B, C, Dand E respectively. For a graduating class, the average index is found to be 8.5. What is the most likely value for the proportion of students getting grade A in that university? In Ex. 8, let 21, Pps PosPesPe be the most likely proportions of students getting grades 4, B,C, D and F and let x be the average of the weighted averages of grades for a graduating class. Draw the graphs Of Pay Pe, Pye Pay Ps Against x and comment on the result. Kullback [1959] has defined a measure of symmetric divergence between two probability distributions P and A by ‘when 7; = a for all f, i. for the same probability distribution from which the symmetric divergence is minimized. Show thatin the presence of other constraints, minimising symmetric divergence does not give the same results as minimising directed Principles, Formalism and Techniques 15 divergence. Which of these two measures would you prefer and why? 12, The centre of mass of six point masses situated at (a, 0), (24, 0), a, 0), (4a, 0), (Sa, 0), (6a, 0) is found to be at (3.54, 0). What is the most unbiased statement you can make about the relative values af the masses? 13, Show that the minimum value of expression (50) occurs when Pi= QB for i = 1, 2, «oy n. What is the minimum value? 14. Show that B< 1 where B is defined in (52). When will B be equal tol? 1,3. MAXIMUM-ENTROPY FORMALISM. 1.3.41 Use of Lagrange’s Multipliers We have to Maximize , Be) =-3 6) subject to §, nosy = 1, 3 nos anes) = (4 The Lagrangian is Le, nosy 0 BB NE med PC) 88) = Br 65) Equating the derivative with respect to p(x) = pr, to zero we get POR:) = a(x) EXP [— Ay — ABCs) — agyCes) — ++» — Angas f= 12,50 (56) where the constants Ag 24, ..., Am are to be determined by using (54). In particular we get exp Yo = 3, ob) exp [= Asx) — Dyssls) — - —PasalD] 7) Equation (57) determines a, asa function of Ay, Aye (67) with respect to 2s and using (56), we get Differentiating Fe 12, ~ eceaeten apr, = 3, — shee) exp [— 2 Meera] 8) or a Roc E= 1 (9) 16 MAXIMUM-ENTROPY MODELS Differentiating (57) with respect to 2, we get exp 25 2 + exp ng (5) = 3, ex) ae exe (0) 1) hw HB = 00" er 00 o exp Ay is usually denoted by Z and is called the Partition Function, so that rafotso) 68) Z= exp y= B a(x) exp [—Ayg(x) —- Differentiating (60) with respect to 2, we get a spn 24 392528234 0 (BY exo n(s)] xP % -H--He-E+ = - 8" + B= — we) (64) ‘Thus for a maximum-entropy distribution, 25 or In Z is a function of Da, Agye-es Ams andl all moments EG) = Ey El(s) — 8)" EG) — Be — BY (65) can be expressed in terms of the partial derivatives of various orders of 4 with respect £0 Ay Aasens Ame () The expression for maximum entropy can be found by substituting for p(xi) from (56) into (53). This gives Principles, Formalism and Techniques 7 Dy ty Ba ay os hee (66) gn ey Behe BE o ‘Thus Spas can be regarded as a function Of Ais h-++s Ame Again since 1s OF Day Aves Ay WE Cam ia principle, regard My Bay Bares Bim ae Fun 1m 88 function Of fy By---» Sor Thus we can write both 4 and Saax M tungton of either, hysecs An Or Of Br Bene Let Do = Aas Revver Bn) Satz = SC Reyes An) (68) Ry = Ay Bir Beever Be) Swaax = S' Bip Save Sm) (69) . © From (66) Max = Oy + Erde +E arte (70) but aya FRA ~ fia, i) From (70) and (71) Sux 3 Fn @) so that 7 73) (74) Equations (73) and (74) give us an interpretation for Lagrange’s mult pliers. The maximum entropy depends on the prescribed values of fy gp 58m If we change g) by a small amount 8g;, then the corresponding small change in maximum entropy is given by Son) =A) a Again if we make a small change ing), there will be @ corresponding ‘small change in 4. If we make the same small change in g, there will be ‘a corresponding small change’ in 2y. Equation (74) asserts that these two changes would be the same. Also Zo te Hobs, (76) 18 MAXIMUM-RNTROPY MODELS (a) If g(x) is any function of x, we have FIa@)) = a) = F— F vcsdaexy =F al [Ap Pas (Dons D]OC3)> E63) exp (rag dts so that Bho 8 otay exp A.60) —--—dasatsd}[ — 0480 — 3] ot00 5 3 a(33) exp [2g Ag) — 20s Amin 4D} B/C) 908) + BOC = = J rte asd ated + & 3 ala a0) = 8) 9@) + & 1) es) or SE = — cov tei o00) @) Thus the covariance of any function q(x) and gj(x) is the negative of the derivative of 7 with respect to A. In particular 10) = 848) = V8" 8/0) = — 5 BI =F 3) = © Ta” a) Sar q(x) = sul) > cov (g1(%), Bx(2)) Equations (80), (81) are the same as equations (61) and (62). (©) Hf 8:2), 84C2)eo Sa(2) depend on. some parameters ays 84:r» tre then Ro, Dap Aprons Amy Si» Bayer Sin ANd Seax Will also depend on them and we can discuss the v Of Sux with respect to each one of them. Thus 3, % Bite ma OB) | Baty SAD as : - 1.3.2. Case of Countable Infinity or Continuum of Outcomes For a countable infinity of outcomes, we can define 63) Principles, Formalism and Techniques 19 ‘but we cannot speak of entropy being inaximum when all the outcomes are equally likely. If we take 8; = 1, we get Shannon's entropy in this case and this implies the use of an ‘improper’ prior distribution. We can also minimize measure of directed divergence D= dame, nat, 4) Be which by Shannon's inequality > 0. The difference between two measures of divergence fortwo probability distributions P and @ from 2 is given by 2,-Dy= Eno Ean Bont -Zam! 5 and this is independent of, $ au. Thus whether we make use of a or kaj, wwe can get the same result. However if Sav is 0 divergent series, we are cancelling In 0. Thi with an improper priot in the m: jves some justification for the use of Bayesian entropy jimum-entropy principle provided the final maximum-entropy probabi distribution is a proper distribution. ‘Similarly for the continuous case, we define B j fon £2 & dx (86) For a finite interval a(x) = 1/(b — a) gives a proper prior distribution viz. the uniform distribution. For an infinite interval, taking a(x) —cons- tant means using an improper prior distribution which will be justified if the posterior maximum-entropy probability distribution is a proper distri- bution. EXERCISES 1.3 1, If Zis the partition function, prove that Ow 20 .MAXIMUM-ENTROPY MODELS 1 az 1 8ZaZz G9) 00 (8 8) = 2 ae on, ~ 28 Ie oel AZ imn} oe 2. If fO)=Aexp Lz 28 /(2)} Prove that ) Zale 9 8-75, 104 (ad aos alos) a1 @4 3 1 od od A Gray | AMD” 3. Find expressions for 5 (g,), 44(g,) in terms of the partial derivatives of (i) (ii) Z Gil) A. 4. Find which of the following give rise to proper prior probability distribution and find then when these exist @) He)=— Gili) cov (gy, gx) = @ fo 12s 3yree 41, 25 Bp Gi 0 Wwf = (7): om=sht'T) 14 PROOFS OF SOME NEEDED MATHEMATICAL RESULTS | 1.4.1 Proof that Lagrange’s Method Gives a Global Maximum Lagrange’s method gives an extremum but whether it gives a maximum ‘or minimum has to be decided in each ease. In our case, let tx) Se— 2 Gy - 7) Principles, Formalism and Techniques 24 where Zrcdat 3 peda Gobir trevm 68) This gives, as shown earlier PEs) = a) ex [— Ay = PC) oe Rad] 89) Let F be the entropy for any other probability distribution satisfying the ‘iven constraint (88), so that Fo~ 3 payin; § gay =t £ peredsrain a(x)’ ia then ae 2 Sou Fa — 3 plu) in BB 4. &fexy in 9 = J te) — aca) n 9. +h me [ns 0 ey Substituting from (89) in (90), we get DnB) Sour F= 3 (fle) — p(s] [=~ Aste 2 £3) + Ercomes a) Every term in the first summation vanishes because of the constraints (88) and (90) and the second sum > 0 because of Shannon's inequality, so that Suu—F 20 0 and thus Smax is a global maximum and F = Shas iff f(x) = p(x) for all i. The arguments will remain valid if n— co and if we use integrals instead of sums. 1.4.2, Shannon’s Inequality We have already required it thrice. It states that if P= (Pp Peron Pr) 20 O= (By Gomer GPU >O (94) are two probability distributions, then Eamtso e 22 MAXIMUM-ENTROPY MODELS and 3 mynZt = 0 p= ai forall’ First Proof. For any continuous convex function $(x), Jensen's inequality sives BUH > HEC) cr) Let (x) = x In x and let x take values pi/qi with probabilities gi = 1, 2s Wy then (96) gives & qPtin 2 & ating > ( 2n)[o dan co) or bp inZi>0 (98) ‘The equality sign holds only if all the variate values are equal i.e. if p) = qr for all i. Second Proof. We know that the weighted arithmetic mean of n positive numbers is greater than or equal to the weighted geometric mean of these numbers and the equality sign holds if and only if the numbers are equal. Let the numbers be chosen as gi/p; and let the weights be chosen as p(i=1, s+) $0 that Soa. ale Son AG) on 1> RG)" or § nt & il! i eee oe ae aca C2 and the equality sign holds iff py = qu foreach f. Third Proof. Let qr = pil + 6), > — 1, oy) then Bawknat=drano «ivy & pio = & pin _ = § « Brg = BP gare ted Ples—In(l beg) = 3 prgled, 103) Principles, Formalism and Techniques 23 where $0) =I +6), (10 ‘so that. ‘= (105) and . (ei) 20 when ¢,20 (106) $e) aay 0 . Figs 11 From figure 1.1, it follows that $(e)) > Oand $(e) =0-2¢=0 (107) From (104) and (107) 5 prin 50 and & prin f= Ot p= a foreach t (108) Fourth Proof. (For a continuous variate) Let f(x) and g(x) be probability density functions, then . J 10 4x= fate) de=1, 10)>0, «> 0 (40s) he) = fois) 24 MAXIMUM-ENTROPY MODELS then : J Fis) 1 Od j (4) (8) In 4G) de = fi e(3) Me) n (2) ~ 6) Ma) +a J ats) 1) in) — Ha) +h = |e) wen ax (110) oth) (051)] ° 0) ra) Fi 12 aia WHOM = In > 0 HE HG)> 1 =0 if Aa) =1 any “<0 if AQ) <1 From Figure 1.2 h(x) 30 ‘ from ax>0 ay And thisis ero iff/(x) = g(x) almost every where, Principles, Formalism and Techniques 25 In statistical mechanics, this is often called Gibb’s inequality. Note: Shannon's inequality will continue to.hold if some of g's are zero provided the corresponding p's are also zero and we define 0 In § as equal to 0, EXERCISES 14° : 2. Isa local maximum a global maximum for every concave function? Discuss. 3 For which of the not-additive measures of entropy, the maximum obtained by Lagrange's method will be a global maximum? ‘Use Shannon's inequality to prove that © Smmn>—ton 3d +pp im +) > +0)in (143) i, ,0-p9 2 02) > 2 (1-7) 6) 3 @+ opin e+ bp) > (en + dyin (a+ 2) ‘Use Shannon’s inequality to prove that @ [rom sayar> —in@—a fo +eymd Hepa J (+o In 0 +4) de Gi [e+ Heapin(e + anaas > 40a) + [mw EEE] Gy fre-+ aoninte + sends > fle+ eoniate + deny de When does the equality hold in each case? Here i [roan fat arn 26 MAXIMUM-ENTROPY MODELS 5 10. Prove that © E(t)nem (}) = ao on 8) Pala + Py In Bt ooh Pula ED (Py Pant Ps) in PE Dinh Pee Pat Patent Pon for py > 031m 1,25 JHA, Yeoy ‘Obtain special cases of Shannoa’s inequality when © p= (7) ar a= (7) 2 a5 6 0,1, Brym Gi) =e Nae RY 510, 1,2, len, OH (i), $512.40 6) = FE eo Let f(x) be a probability density with mean y and finite variance o, Show that | 409 in fe dx —in Viet with equality sign if and only if f(x) is equal almost everywhere to the normal probability density. Generalise the result of Ex. 7 for a bivariate probability distribution, Prove that if f(x) and g(y) are non-negative, then $2) 4, A) _ fo) 30)" 20) 80) Let 0) = J i 3) Re), de, Di 1>0 where | fla) dx 1 J F049 d= f &0.2) d= 1k, 2) 30, i. 2. 13. 14. 1s. 16- VA Principles, Formalism and Techniques 27 ‘then show that © fare feo) 60) dy = ff AC, 3) Ms) In 80) day Gi) fa In f(s) dx = Jl KO x) f(2) In flv) dy de (x) [sia fonds — fat) 0 9 dy = (feo. 20) LP in L222) 5 hae ~ [feo or F503) gop Hp eee > 9 Let y = $(2) bea variate transformation, f(x), F(y) be the correspond= ing density function and S, 5’ be the corresponding entropies. Prove that $25-feomisarwiow si=|f ‘Show that for a multivariate distribution, the entropy does not change bby transformation of variates when the transformation isa linear orthogonal transformation. Show that Shannon's inequality holds even if Q is an incomplete Badtin>0q>01=12, Discuss whether it will hold if both P and Q are incomplete probabi- lity distributions, Show that under the conditions of Ex. 13 - Zatetan+i<—3 nleing+b,a>0 Show that the Hessian matrix of the second order derivatives of S’ with respect to £y fern» m is the inverse of the negative of the vari= ance-covariance matrix of £4, Deduce from Ex. 15 that the Hessian is always negative definite and the maximum value of the entropy isa strictly concave function of By Bs 1G inequality Swux OG; + (I =A) Ga) > > Suas(Gy) + (1-2) Sma(G0 SAS 1 Bar Bm); Show that Smax isa function of G satisfying the 28 MAXIMUM-ENTROPY MODELS 18, Prove that 2 Saux (H(G, + G,)) > Soas(Gy) + Smax(G,) 1.5 BIBLIOGRAPHICAL AND HISTORICAL REMARKS Shannon’s measure of entropy was first derived in Shannon (1948) and Shannon and Weaver (1949). The proof that —& $ ‘pi ln p, is the only function which satisfies all the properties ()-ix) is given in Khinchin (1957) anid in all standard text-books on Information Theory e.g. in Reza (1961) and Guiasu (1977). Alternative sets of axioms from which this ‘measure can be derived were given by Fadecv (1956), Lee (1964) and Tverberg (1958). Renyi (1961) gave a measure satisfying properties (i)-(vi and (ixy. This measure has the advantage:that it depends on a para- meter « and as such this represents a family of measures which includes Shannon’s measure as a limiting case as a -> 1. Havrada and Charvat function of Renyi’s measure. Kapur (1967a) gave a doubly infinite family of measu depending on two parameters « and @ and including Shannon's and Renyi’s ‘measure as special or limiting cases. Kapur [1967b, e and 1968a,b,c, 1969, 1972, 1974] and Kapur and Chabra (1969) studied the properties of this gene- ral entropy measure. Similar measures were studied by Aczel and Darocay (1963) Sharma and Taneja (1978), Autar and Taneja (1976), Sharma and tal (1975) and Elsayed (1981). Additional non-additive measures have been given by Behara and Nath (1971), Behara and Chawla (1974) Ferrari (1980) and Kapur (1980¢, 1983h, 1985b, 1986c). Kapur (1972) derived two non-additive measures from the consideration that the maximization subjectto energy constraint should give rise to Bose-Einstein and Fermi- Dirac distributions in statistical mechanics. This aspect has been further discussed in Kapur and Kesavan (1987). ‘An exhaustive discussion of measures of entropy has been given in Aczel and Daroczy (1975), Mathai and Rathie (1975), Behara (1983) and Kapur (19848). Tn spite of the large number of entropy measures being available, Shannon's measure is the most natural mathematically (ef Aczel, Forte and Ng (1974)). It is also the most useful for maximum-entropy models, since — pj In pris a concave function and its maximization subject to linear constraints, by using Lagrange’s method, always leads to positive probabi- lities and a globally maximum value for the entropy. Other measures of entropy, except, the ones proposed by Kapur (1986c) can lead to nega- tive probabilities and even to minimum values for entropy {Nathanson (1977), Kapur (1981h)]. ‘The extension of this measure to the case when the random variate takes an enumerable set of values or when the variate is a continuous random variable presents some logical difficulties, as this may imply the use of ‘improper’ probability distributions or of negative and even possibly infinite Principles, Formalism and Techniques 29 ‘uncertainties’ This problem will be discussed in later chapters.In the mean. time, it will be useful to study Jaynes (1963), Hobson and Chang (1973) Tribus and Rossi (1973), Georgescu-Roegen (1975) and Kapur (1985f). Kullback and Leibler (1951) introduced a measure of directed divergence and Kulback (1959) gave his minimum discrimination information principle which gave an alternative view point for the principle of maximum entropy Bayesian entropy concept introduced here is due to Kapur (1983c) and is obviously clesely related to the directed divergence concept. Jaynes (1957) introduced his elegant formalism which was later elaborated by Jaynes (1963 a,b, 1979) himself and Tribus (1969). Jaynes (1957, 1963a) also extended it to the case of density matrices for applications in statistical mechanics, This extension will be discussed in a later chapter. Shannon's inequality has been discussed in detail by Aczel and Daroczy (1975) and Aczel (1974). This can also be extended to density matrices (Levine (1979)). Some generalised Shannon inequalities have been proved by Kapur (1987a) The concept of ‘weighted’ entropy was introduced by Goiasu (1971) and the related concept of “useful” entropy was introduced by Belis and Gviasu (1968). This has been further discussed by Kapur (1985g, 19864), Phillapatos and Wilson (1978) and Nawrocki and Hardinge (1986). Measures of entropy which depend on not only the probabilites of out- comes, but on the outcomes themselves also, have been discussed in Aczel (1978 a,b, 1980 a,b) and Aczel and Kennappan (1978). These are called “inset entropies’. These are equivalent to Lagrangian entropies introduced by Kapur (1983). : Some new measures of entropy are given in Kapur and Kesavan (1987).

You might also like