Professional Documents
Culture Documents
(1986) A Note On Prediction and Entropy of Bengali Prose
(1986) A Note On Prediction and Entropy of Bengali Prose
(1986) A Note On Prediction and Entropy of Bengali Prose
To cite this article: P.P. Das & Prof. B.N. Chatterji (1986) A Note on Prediction and Entropy of
Bengali Prose, IETE Journal of Research, 32:6, 411-445, DOI: 10.1080/03772063.1986.11436642
Article views: 5
* * • * * * * * *
Authors
Sources
to select the starting line and the last for the length of the text. From Table 2, it can be seen that short vowels (I, 2, 3, 5, 8, 10)
Incomplete sentences at the beginning and end of the text were deleted occur more frequently than long vowels (4, 6, 9, 11) and most of the
to avoid stray effects. consonants. A number of consonants in Bengali can be paired
according to phonetic considerations. In every such pair, the
Let P~o P2o ...... , Pn be the proportions of different letters of the consonants produce very similar sound, but the first in the pair,
alphabet in the texts. The one-gram entropy [13] is given by, occurring inunediately before the second in the alphabet, can be
pronounced easily, whereas the second needs rather strong release
n
H1 =- I Pr· lg Pr of breath. We find that except in one case (17 and 18), first ones of
r=l the pairs (12, 14, 19, 22, 24, 27, 29, 32, 34) are more frequent than
the corresponding second ones (13, 15, 20, 23, 25, 28, 30, 33, 35). This
where lg stands for logarithm to the base two. The maximum value may be taken as one more illustration of Ziprs principle of least
of entropy is obtained when all the p,'s are equal, called the zero-order effort [15]. Among the consonants (12 through 50), r (38) is most
entropy, H 0 =lg n. frequent and next only to three most frequent vowels (1, 2 and 8)
and blank. This confirms Datta's [16] claim that r behaves as a vowel
H 1 is the average number of binary digits required per letter of in forming combination with other consonants and the linguistic notion
Bengali, if the language is encoded with 100% efficiency, assuming of treating r as a semi-vowel.
that the occurrence of every letter in a Bengali text is independent of
the preceding letters. The values of unbiased estimates (UBE) of H~, average word size
and the size of the samples are tabulated for all the categories in Table 3,
The proportions of different letters of Bengali alphabet has been alongwith standard errors of estimates of H 1 • H 1 is asymptotically
shown on Table 2 for all the five different groups and for the prose normally distributed. The values of PBE of H 1 and the standard
as a whole. Huffman's variable-length minimum redundancy prefix error have been calculated using the formulae given by Basharin [6],
code (7] is constructed from these relative frequencies of letter in
modem prose (Table 2), which gives,
16. "' 21 24 14 8 28 21 0 11 I I I I 0 I
17. c 86 84 51 66 86 79 0 11 I I I 0
18. ch 114 83 56 96 112 97 I I 00 110
19. 116 76 142 104 79 99 000000
20. jh 9 26 2 6 7 II 0 I 1 11 I 11 I I
21. fi 3 4 4 0 3 3 0 I 000 I0 I 000 I
22. t 138 91 72 52 146 Ill 000011
23. th 25 25 14 20 17 21 0 I I 111 I 10
24. c;t 14 16 6 20 22 17 010001011
25. c;lh 3 0 0 I 0 10 00 10 I 0 00 0 I
26. IJ 22 24 23 8 34 24 1 I 00 10 I 0 I
27. t 233 297 321 336 277 283 0 I 110
28. th 53 45 43 70 51 51 0000010
29. d 171 161 193 172 ISO 166 100001
30. dh 46 49 33 90 43 50 llOOlll
31. n 369 331 348 370 356 354 100 11
32. p 153 129 189 114 173 152 0 11110
33. ph 14 25 2 12 30 19 0 11111100
34. b 251 299 278 222 230 257 0 100 1
35. bh 67 46 51 34 54 53 00000 1 I
36. m 205 213 220 296 200 216 000 10
37. j 78 116 129 122 101 104 0000 10
38. r 502 489 670 620 557 547 0 1 10
39. 291 278 131 158 214 233 000 I I
40. s 81 42 74 66 85 70 0 1000 I
41. ~ 42 44 62 30 53 46 100011
42. s 166 183 208 176 192 183 I I0 000
43. h 89 106 161 146 97 Ill 0 I 0000
44. r 63 49 41 20 49 48 1 10 0 10 I I
45. fh 1 0 0 0 0 0 0 1000 10 10000
46. y 162 169 193 104 179 165 I 00000
47. t 6 3 29 4 12 9 0 I 000 10 10 I
48 m II 8 29 4 19 14 0 I 000 I 00 I
49. 1]. 5 5 4 8 2 5 I I 00 10 I 0000
50. "' 65 37 22 68 29 44 10001110
51. Blank 1582 1648 1475 1866 1511 1600 I 1 I
Shannon's (I] experimental method of prediction. Short passages, from the formulae given by [17] and its value is
of about 30 letters each were selected from common Bengali prose.
Passages were unknown to the subject, who is to predict in the experi- R = I - (H/H0 ) = 61.55%.
ment. The predictor was asked to guess the first letter of the passage.
If he was correct he was told so, else the correct letter was given. The
experiment continued in this way till the end of the passage. Five Afterall the choice of literary or scientific writings would produce
subjects participated in the experiment, in which, out of total 1160 poorer estimates for H.
letters tried, the subjects could correctly guess 757 letters, giving a
success rate of 65%. Thus, the estimated value of entropy is calculated DISCUSSION
as [17].
From Table 3, we find that there is significant difference between
H 0.65 + H 1 *0. 35 = 2.18 bits per letter. usE's of H 1 amongst different categories. Hence entropy may give
some indication of the style of writing. Stories and novels are closer
in H 1 as they show less difference in style, whereas essays and plays
The lower and upper bounds on the limiting value of entropy were, show wide variation in entropy which is in agreement with their widely
cstim:tted from an extension of the above experiment [1]. Here, different nature of composition. It is to be noted at this point that
if the prdi-::tor is wrong, he is not given the correct letter, but asked the sample sizes chosen for stories, novels and 'others' were almost
to gtLess ag:tin. This is continued till he finds the correct letter. The double of what was taken for essays and plays. This has been done
num;:!r of trh.ls required to find every correct letter is noted. The in view of the fact that essays and plays occupy much smaller volume
experiment was conducted for 40 passages of 30 letters each. Then of Bengali literature as compared to the rest of the three categories.
using Shln'lon's formulae [1], the lower and upper bounds of entropy Afterall this does not affect our conclusions much as the above
were calcultted for I, 2 ...... 30th letter position of the 30 letter texts. observations rem1.in unaltered even if we include the standard errors
The values are shown graphically in Fig I. From the res:Llts of the with the values of liBE's of H1.
above two cxp~rimeats, we conclude that it is possible to reduce the
length of a text in Bengali prose to a considerable extent, ifefficient To have a comp:trison of different Indian languages, the values
encoding is used. And the redund:1ncy of Bengali prose is calculated of H , HI> R, Hand the origin of the languages are tabulated (Table 4)
0
for four South-Indian, three North-Indian languages and English.
The values of different parameters for English differ widely from
thos·! for all Indian languages. In some cases, languages of the same
group are in clos~ proximity in the value of H 1 (Bengali with Hindi in
Indic group and Telugu, Kannada and Malayalam in Dravidian group).
Whereas in some other cases, languages from the same group show
wide variations in the values of H 1 (Bengali and Hindi from Marathi
in Indic group and Tamil from the rest three languages in Dravidian
group). We also note misleading proximity in the value of H 1 between
languages of different groups (Bengali and Hindi with Tamil).
However, it should also be borne in mind that all the estimates of the
entropies (Table 4) are based on sample surveys. Hence these results
/LO'NER BOUND may be subject to sampling errors to some extent. Yet this study
0
provides some guidelines on the comparison of languages and styles
0 0
0 0 of writing. Thus, we re-affirm Siromoney's [6, 8] conclusion that
a-0--~o---v--~~U--o~o~--~~~o--~o~o entropy cannot be used as a language characteristic.
0 0 0 0
0 0
0 8 10 12 14 1s ;n iJ n 24 26 28 JO n ACKNOWLEDGEMENT
NUMBER OF LETTERS
The authors wish to thank G Biswas, BN Das, Mrs M Das and
Fig 1 The lower and upper bounds of entropy of 51-letter Bengali Mrs C Das for help with the experimental work.
-------
Language Origin Ho Hl Rei. efficiency H Redundancy
10 K Rajagopalan, A note on entropy of Karu1ada prose, Inform Prof. B.N. CHA TTERJI, FIETE
Contr, vol 8, pp 640-644, 1965 Department of Electronics and Electrical
Communication Engineering,
II TG Chandravadivelu, Predictive entropy of Hindi and Tamil, Indian Institute of Technology,
Journal of JETE, vol 31, pp 41-45, 1985 Kharagpur 721 302 India