(1986) A Note On Prediction and Entropy of Bengali Prose

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

IETE Journal of Research

ISSN: 0377-2063 (Print) 0974-780X (Online) Journal homepage: https://www.tandfonline.com/loi/tijr20

A Note on Prediction and Entropy of Bengali Prose

P.P. Das & Prof. B.N. Chatterji

To cite this article: P.P. Das & Prof. B.N. Chatterji (1986) A Note on Prediction and Entropy of
Bengali Prose, IETE Journal of Research, 32:6, 411-445, DOI: 10.1080/03772063.1986.11436642

To link to this article: https://doi.org/10.1080/03772063.1986.11436642

Published online: 02 Jun 2015.

Submit your article to this journal

Article views: 5

View related articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=tijr20
LETIERS TO THE EDITOR 441

REFERENCES 7 MJ O'Mahony, Duo-binary transmission with PIN·H.T optical


receivers, Electron Lett, vol 16, p 752, 1980
1 F Gutern & G Zorpette, Fibre optics poised to displace satellites,
IEEE Spectrum, vol 22, pp 30-37, 1985 8. K Asatani & T Kumura, Linearization of LED nonline&rity by
2 R W Dawson, LED Bandwidth Improvement by Bipolar Pulsing, predistortions, IEEE Trans, vol ED-25, pp 207-212, 1978
IEEE Jour Quant Electron, vol QE-16, pp 697-699, 1980
3 Y Suematsu & K Iga, Introduction to Optical Fibre Communi-
cations, Ohm-Sha Publishing Co Ltd, pp 104-106, 1981 S RAKSHJT
4 I Hino & K Iwamoto, LED pulse response analysis considering the Departmelll of Electronics and
distributed CR time constant in the peripheral junction, IEEE Electrical Communication Engineering
Trans, vol ED-26, pp 1238-1242, 1979
and
5 TP Lee, Effect of junction capacitance on the rise time of LEDs
and on the turn-on delay of injection lasers, Bell Syst Tech Jour, A MAULIK
vol 54, pp 53-68, 1975
Department of Computer System Engineering,
6 S Rakshit & I Basu, Speech band compression using partial Indian lnstitute of Technology,
response codes, Jour JETE, vol 29, pp 86-87, 1983 Kharagpur 721 302 India

* * • * * * * * *

A Note on Prediction and Entropy of Bengali Prose


The concept of entropy has been used to study the statistical structure bb8.ry code for the letters of the Bengali alp!tabet from the proportions
of various languages for long. A similar study is made for printed of letters estimated over a large sample of modern Bengali texts.
Bengali prose. Using the proportions of letters estimated from a large One-gram entropy has also been calculated. An estimate of the lower
sample of Bangali prose, an optimum code is constructed for the letters and upper bounds of the limiting value of entropy has been made
of the Bengali alphabet. The unbiased estimates of one-gram entropies experimentally. We also present a comparison of entrcpy and
of different forms of prose writings are obtained. An estimate is made, redundancy for various Indian languages and English.
experimentally, of the lower and upper bounds of the entropy of Bengali
prose. Dift'erent Indian languagesand English have been compared for Bengali is spoken by over 100 million people in India and
their values of entropy. Bangladesh. It is a language of Indic group [12]. It hails its origin tc
Sanskrit language. The earliest writings of Bengali date back to
lOth Century AD, though its modem form of prose writing is only
'fHB INFORMATION-THEORETIC approaches were used for study
about 150 years old [12]. In Bengals prose 11 "vowels" (swara-bama)
of statistical structures of various languages for over three decades.
and 39 "consonants" (byanjan-barna) are used. Jn earlier writings, one
Shannon [1] first applied the concepts of predictive entropy and
more vowel (lee) and one more consonant (vwa) were also used. But
redundancy to the statistical structure of printed English. A lot of
these are omitted in the present analysis since they are not used in
work has been done since then, in comparing and contrasting various
modern prose. The "word-space" or "blank" is also taken as a
languages. Barnard [2] has compared the word-letter entropies of
separate symbol and thus the entire alphabet has, 51 symbols. The
printed English, French, German and Spanish. In 1964, Miller and
punctuation between clauses and sentences were not considered in the
Isard [3] worked on free recall of English sentences with self-embedding
analysis. In Bengali, though vowels (v) occur in pure form, the
phrase structures. Though redundancy was not directly used as an
consonants (c) frequently occur in conjunction with another vowel
estimator in their work, they have discussed similar concepts relating
(cv-form) or another consonant and a vowel tccv-form). In all such
the ease of retention of sentences and the degrees of self-embedding.
occurrences, consonants and vowels are considered as separate
Grignetti [4] modified the word-entropy of printed English from 11.82
symbols.
to 9.8 bits per letter in the same year. Jamisons [5] worked on the
determination of predictive entropy for partially known languages. ZERO AND FIRST ORDER ENTROPY OF BENGALI
Siromoney [6] was the first to work on Indian languages. He showed
that the one-gram entropy for Tamil prose is significantly different A large sample of over 40,000 letters of modem Bengali prose was
from its poetical works. He constructed Huffman's code [7] for taken for estimation. The sources of these prose works are classified
Tamil letters also. Later Siromoney and Balasubrahmanyam [8], into five groups, namely, novels, short stories, essays, plays and others.
while estimating the entropy of Telugu prose, showed that one-gram The last group includes science-fiction, news-paper reporting on sports
entropy fails both as language or as style characteristic. Chi-square and other general events, selected parts from travellers' diary and
test is claimed [9] to be a more reliable tool for deciding on disputed comical writing. Texts were chosen from 30 different sources written
authorship. It is usually performed over relative frequency of letters by 14 different authors. Table 1 lists the sources and the authors.
of the alphabet as observed in the various compositions of two or more Since some of the passages were selected from common-day composi-
writers. Rajagopalan [10] found significant difference between the tions such as news-paper reportings, magazines etc, the authors of the
values of entropy of two dialects of Kannada as spoken in South same were not known. They have been put into the category of
Kanara and in old Mysore region. He highlighted the difference in 'anonymous'.
e11tropy for novels, short-stories and biographies. Recently, Chandra-
vadivelu [11] has estimated the predictive entropy for Hindi and Tamil. For the selection of passages from different sources, stress was
In the present note, we construct Huffman's [7] minimum redundancy given on the popularity of the authors amongst the readers of Bengali.
After deciding upon an author, his well-known writings were listed.
Once an author was decided upon, we generated four random numbers
Pap~rNo. 1868; copyright @ IETE, 1986 in succession for selecting the text. First was used to select one of
Manuscript received 1985 December 10; revised 1986 July 30 his/her writing, second to choose a particular page within that, third
442 J. INSTN. ELECTRONICS & TELECOM. ENGRS., Vol. 32, No. 6, 1986

TABLE 1 :Authors and Sources

Authors

I. Rabindranath Tagore 8. Saradindu Bandopadhyaya


2. Bankimchandra Chattopadhyaya 9. Sunil Gangopadhyaya
3. Saratchandra Chattopadhyaya 10. Sankar
4. Iswarchandra Vidyasagar 11. Shirsendu Mukhopadhyaya
5. Tarasankar Bandopadhyaya 12. Manoj Mitra
6. Bibhutibhusan Bandopadhyaya 13. Badal Sarkar
7. Manik Bandopadhyaya 14. Anonymous

Sources

I . Naukadubi 16. Jhi


2. Kapalkundala 17. Byaglu·acharya BrihallangtLl
3. Adarsha Hindu Hotel 18. Bidhaba Bibah
4. Chidiyakhana 19. Russiar Chithi
5. Grihndaha 20. Mrityttr Chokhe Jal
6. Tanaya 21. Chak Bhanga Madhu
7. Padma Nadir Majhi 22. Michhil
8. Ganadebata 23. Abu Hosen
9. Postmaster 24. Ananda Bazar Patrika
10. Bibek 25. Aazkal
11. Janrna-0-Mrityu 26. Jugantar
12. Garam Bhat Athaba Nichhak Bhuter Galpa 27. Basumati
13. Sonar Ghorha 28. Desh
14. Kabuliwala 29. Bartarnan
15. Kalapahadha 30. Pratikshan

to select the starting line and the last for the length of the text. From Table 2, it can be seen that short vowels (I, 2, 3, 5, 8, 10)
Incomplete sentences at the beginning and end of the text were deleted occur more frequently than long vowels (4, 6, 9, 11) and most of the
to avoid stray effects. consonants. A number of consonants in Bengali can be paired
according to phonetic considerations. In every such pair, the
Let P~o P2o ...... , Pn be the proportions of different letters of the consonants produce very similar sound, but the first in the pair,
alphabet in the texts. The one-gram entropy [13] is given by, occurring inunediately before the second in the alphabet, can be
pronounced easily, whereas the second needs rather strong release
n
H1 =- I Pr· lg Pr of breath. We find that except in one case (17 and 18), first ones of
r=l the pairs (12, 14, 19, 22, 24, 27, 29, 32, 34) are more frequent than
the corresponding second ones (13, 15, 20, 23, 25, 28, 30, 33, 35). This
where lg stands for logarithm to the base two. The maximum value may be taken as one more illustration of Ziprs principle of least
of entropy is obtained when all the p,'s are equal, called the zero-order effort [15]. Among the consonants (12 through 50), r (38) is most
entropy, H 0 =lg n. frequent and next only to three most frequent vowels (1, 2 and 8)
and blank. This confirms Datta's [16] claim that r behaves as a vowel
H 1 is the average number of binary digits required per letter of in forming combination with other consonants and the linguistic notion
Bengali, if the language is encoded with 100% efficiency, assuming of treating r as a semi-vowel.
that the occurrence of every letter in a Bengali text is independent of
the preceding letters. The values of unbiased estimates (UBE) of H~, average word size
and the size of the samples are tabulated for all the categories in Table 3,
The proportions of different letters of Bengali alphabet has been alongwith standard errors of estimates of H 1 • H 1 is asymptotically
shown on Table 2 for all the five different groups and for the prose normally distributed. The values of PBE of H 1 and the standard
as a whole. Huffman's variable-length minimum redundancy prefix error have been calculated using the formulae given by Basharin [6],
code (7] is constructed from these relative frequencies of letter in
modem prose (Table 2), which gives,

H0 =Ig 51 = 5.61 bits per letter.


H 1 = 4.37 bits per letter.
relative efficiency= H1/H0 = 77.10%.
where N is the sample size.
The average length of the Huffman's code is 4.41 binary digits per
letter. So, the efficiency of coding [14] is 99.07%. Other codes PREDICTIVE ENTROPY OF BENGALI
which are not optimum may be constructed using the given values
ofthep,'s. An estimate of the limiting value of entropy H was made using
LETTERS TO THE EDITOR 443

TABLE 2 : Relative frequency of letters"

Sl Lettersb Story Novel Essay Play Others Total Huffman Code


No.

1. a 1507 1504 1610 1398 1660 1546 101


2. ii 892 833 1049 934 839 887 00 1
3. 440 570 611 448 500 509 0 10 I
4. I 43 60 27 26 67 49 100 1 I 10
5. u 220 196 140 224 159 189 I I 0 0 0 I
6. fi 11 8 6 JO 20 12 I I 0 0 I 0 I 000 I
7. f 5 18 37 2 12 14 0 I 000 I 000
8. e 930 851 646 748 765 813 I I 0 1
9. a: i 3 3 4 14 4 4 0 1000 I 0 I 0 0 I
10. 0 147 161 119 228 186 167 I 0 0 0 10
11. au 10 6 15 0 6 7 11001010001
12. k 334 337 321 360 375 347 10 0 10
13. kh 90 124 49 84 71 87 I 0 0 0 I I 0
14. g 97 86 92 36 127 94 1100100
15. gh 16 15 4 0 10 II 0 I I I I I I I I 0

16. "' 21 24 14 8 28 21 0 11 I I I I 0 I
17. c 86 84 51 66 86 79 0 11 I I I 0
18. ch 114 83 56 96 112 97 I I 00 110
19. 116 76 142 104 79 99 000000
20. jh 9 26 2 6 7 II 0 I 1 11 I 11 I I
21. fi 3 4 4 0 3 3 0 I 000 I0 I 000 I
22. t 138 91 72 52 146 Ill 000011
23. th 25 25 14 20 17 21 0 I I 111 I 10
24. c;t 14 16 6 20 22 17 010001011
25. c;lh 3 0 0 I 0 10 00 10 I 0 00 0 I
26. IJ 22 24 23 8 34 24 1 I 00 10 I 0 I
27. t 233 297 321 336 277 283 0 I 110
28. th 53 45 43 70 51 51 0000010
29. d 171 161 193 172 ISO 166 100001
30. dh 46 49 33 90 43 50 llOOlll
31. n 369 331 348 370 356 354 100 11
32. p 153 129 189 114 173 152 0 11110
33. ph 14 25 2 12 30 19 0 11111100
34. b 251 299 278 222 230 257 0 100 1
35. bh 67 46 51 34 54 53 00000 1 I
36. m 205 213 220 296 200 216 000 10
37. j 78 116 129 122 101 104 0000 10
38. r 502 489 670 620 557 547 0 1 10
39. 291 278 131 158 214 233 000 I I
40. s 81 42 74 66 85 70 0 1000 I
41. ~ 42 44 62 30 53 46 100011
42. s 166 183 208 176 192 183 I I0 000
43. h 89 106 161 146 97 Ill 0 I 0000
44. r 63 49 41 20 49 48 1 10 0 10 I I
45. fh 1 0 0 0 0 0 0 1000 10 10000
46. y 162 169 193 104 179 165 I 00000
47. t 6 3 29 4 12 9 0 I 000 10 10 I
48 m II 8 29 4 19 14 0 I 000 I 00 I
49. 1]. 5 5 4 8 2 5 I I 00 10 I 0000
50. "' 65 37 22 68 29 44 10001110
51. Blank 1582 1648 1475 1866 1511 1600 I 1 I

a: Values are scaled to a total of 10,000 in every column.


b : For transliteration ODBL [12] is followed.
444 J. INSTN. ELECTRONICS & TELECOM. ENGRS., Vol. 32, No. 6, 1986

Shannon's (I] experimental method of prediction. Short passages, from the formulae given by [17] and its value is
of about 30 letters each were selected from common Bengali prose.
Passages were unknown to the subject, who is to predict in the experi- R = I - (H/H0 ) = 61.55%.
ment. The predictor was asked to guess the first letter of the passage.
If he was correct he was told so, else the correct letter was given. The
experiment continued in this way till the end of the passage. Five Afterall the choice of literary or scientific writings would produce
subjects participated in the experiment, in which, out of total 1160 poorer estimates for H.
letters tried, the subjects could correctly guess 757 letters, giving a
success rate of 65%. Thus, the estimated value of entropy is calculated DISCUSSION
as [17].
From Table 3, we find that there is significant difference between
H 0.65 + H 1 *0. 35 = 2.18 bits per letter. usE's of H 1 amongst different categories. Hence entropy may give
some indication of the style of writing. Stories and novels are closer
in H 1 as they show less difference in style, whereas essays and plays
The lower and upper bounds on the limiting value of entropy were, show wide variation in entropy which is in agreement with their widely
cstim:tted from an extension of the above experiment [1]. Here, different nature of composition. It is to be noted at this point that
if the prdi-::tor is wrong, he is not given the correct letter, but asked the sample sizes chosen for stories, novels and 'others' were almost
to gtLess ag:tin. This is continued till he finds the correct letter. The double of what was taken for essays and plays. This has been done
num;:!r of trh.ls required to find every correct letter is noted. The in view of the fact that essays and plays occupy much smaller volume
experiment was conducted for 40 passages of 30 letters each. Then of Bengali literature as compared to the rest of the three categories.
using Shln'lon's formulae [1], the lower and upper bounds of entropy Afterall this does not affect our conclusions much as the above
were calcultted for I, 2 ...... 30th letter position of the 30 letter texts. observations rem1.in unaltered even if we include the standard errors
The values are shown graphically in Fig I. From the res:Llts of the with the values of liBE's of H1.
above two cxp~rimeats, we conclude that it is possible to reduce the
length of a text in Bengali prose to a considerable extent, ifefficient To have a comp:trison of different Indian languages, the values
encoding is used. And the redund:1ncy of Bengali prose is calculated of H , HI> R, Hand the origin of the languages are tabulated (Table 4)
0
for four South-Indian, three North-Indian languages and English.
The values of different parameters for English differ widely from
thos·! for all Indian languages. In some cases, languages of the same
group are in clos~ proximity in the value of H 1 (Bengali with Hindi in
Indic group and Telugu, Kannada and Malayalam in Dravidian group).
Whereas in some other cases, languages from the same group show
wide variations in the values of H 1 (Bengali and Hindi from Marathi
in Indic group and Tamil from the rest three languages in Dravidian
group). We also note misleading proximity in the value of H 1 between
languages of different groups (Bengali and Hindi with Tamil).
However, it should also be borne in mind that all the estimates of the
entropies (Table 4) are based on sample surveys. Hence these results
/LO'NER BOUND may be subject to sampling errors to some extent. Yet this study
0
provides some guidelines on the comparison of languages and styles
0 0
0 0 of writing. Thus, we re-affirm Siromoney's [6, 8] conclusion that
a-0--~o---v--~~U--o~o~--~~~o--~o~o entropy cannot be used as a language characteristic.
0 0 0 0
0 0

0 8 10 12 14 1s ;n iJ n 24 26 28 JO n ACKNOWLEDGEMENT
NUMBER OF LETTERS
The authors wish to thank G Biswas, BN Das, Mrs M Das and
Fig 1 The lower and upper bounds of entropy of 51-letter Bengali Mrs C Das for help with the experimental work.

TABLE 3 : UBE and standard error of H 1

Category Sample size Average word sizec UBE of H 1 . Standard error

Story 11,267 6.32 4.3775 0.013345

Novel 10,365 6.07 4.3629 0.016197

Essay 5,137 6.78 4.3048 0.022665

Play 5,000 5.36 4.2514 0.022974

Others 11,194 6.62 4.3934 0.015577

Total for Prose 42,963 6.25 4.3727 0.007955

------------·-·-·-------·------------------ "-------- --·--------- -----


c : E..:pr~ss!d il1 term~
of letters per word. Here, blank is considered as a letter else the values of average
word size would be one less than that shown above.
LETTERS TO THE EDITOR 445

T.\BLE 4 :Comparison of different languages

-------
Language Origin Ho Hl Rei. efficiency H Redundancy

Bengali Indie 5.67 4.37 0.77 2.18 61.55


Hindi [II] I odic 5.52 4.36 0.79 1.97 64.31
Marathi [18] Indie 5.52 4.50 0.82
Kannada [!OJ Dravidian 5.61 4.55 0.81
Malayalam [18] Dravidian 5.64 4.60 0.82
Tamil [6, I l] Dravidian 4.91 4.34 0.88 1.82 62.93
Telugu [8] Dravidian 5.73 4.59 0.80 2.40 58.10
English [I, II] Indo-European 4.76 4.03 0.85 1.45 69.54

REFERENCES 12 SK Chatterji, The origin and development of Bengali language,


part/, G~orge Allen and Unwin Ltd., London, 1970, pp 129-135
CE Shannon, Pr<!diction and entropy of printed English, BSTJ,
vol 30, pp 50-64, 1951 13 CB Shannon, A mathematical theory of communication, BSTJ,
vol 27, pp 379-423, 1948
2 GA B:1rnard, Statistical calculation of word entropies for four
western languages, IRE Trans, vol IT-I, pp 49, 1954 14 FM Reza, An introduction to information theory, McGraw-Hill,
New York, 1961, pp 133
3 GA Miller & Stephen Isard, Free recall of Self-embedded English
sentences, Inform Contr, vol 7, pp 292-303, 1964 15 GK Zipf, Human behavior and the principle of least effort, Addison-
Wesley, Cambridge, Massachusetts, 1949
4 MC Grignetti, A note on the entropy of words in printed English,
ibid, vol 7, pp 304-306, 1964 16 AK Dutta, A generalised formal approach for description and
analysis of major Indian scripts, Journa/IETE, vol 30, pp 155-161,
5 D Jamison & K Jamison, A note on the entropy of partially known Nov 1984
languages, ibid, vol 12, pp 164-167, 1968
17 L Brillouin, Science and information theory, Academic Press,
6 G Siromoney, Entropy of Tamil prose, ibid, vol 6, PP 297-300, New York, 1956, pp 24-25
1963
18 BS Ramakrishna, et a/, Some aspects of relative efficiencies of
7 DA Huffman, A method for the construction of minimum Indian languages, Department of ECE, liSe, Bangalore, 1960,
redundancy codes, Proc IRE, vol 40, pp 1098, 1952 pp 15-16
8 P Balasubrahmanyam & G Siromoney, A note on entropy of
Telugu prose, Inform Contr, vol 13, pp 281-285, 1968
P.P. DAS, AMIETE
9 G Herdan, Language as choice and chance, Noordhoff, Groningen,
1956, pp 88-90 and

10 K Rajagopalan, A note on entropy of Karu1ada prose, Inform Prof. B.N. CHA TTERJI, FIETE
Contr, vol 8, pp 640-644, 1965 Department of Electronics and Electrical
Communication Engineering,
II TG Chandravadivelu, Predictive entropy of Hindi and Tamil, Indian Institute of Technology,
Journal of JETE, vol 31, pp 41-45, 1985 Kharagpur 721 302 India

You might also like