Tdil

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 202

ISSN No.

0972-6454

Achievements of the TDIL


Resource Centres
• Human Machine Interface
Systems (HUMIS)
• Knowledge Resources (KR)
• Knowledge Tools (KT)
• Language Engineering (LE)
• Localisation (L)
• Translation Support
Systems (TSS)
• Standardisation
Patron-in-Chief Technology Treading on Multi-lingualism
Arun Shourie, Hon'ble Minister Over 95% of India’s population cannot work in English, hence there is need for shift
Ministry of Communications & Information Technology from English-centric computing to multilingual computing. Technology development
for Indian Languages was initiated at individual level during early 1960’s & ’70s;
(Government of India) and later on Government sponsored R&D projects during 1980-1990s and further
ashourie@nic.in accelerated these activities in mission mode since 2000. Innovation is key in the
growth of new technologies and incubation of these technologies is essential for
Patron productising them. Incubating innovations in Indian language technology follows
‘S’-curve model with four growth prospectives : Improving Core Capabilities,
K.K. Jaswal, Secretary
Collaborating Capabilities, Developing New competitive Capabilities, and Creating
Department of Information Technology Revolutionary Changes.
Ministry of Communications & Information Technology Technology development for Indian Languages may be categorized in A-B-C Technology
(Government of India) phases:
6, CGO Complex, New Delhi-110003 During 1976-1990 (A-Technology Phase), focus was on Adaptation Technologies;
secretary@mit.gov.in abstraction of requisite technological designs and competence building in R&D
institutions. [This may correspond to phase-1 of the ‘S curve’ model – Improving core
Associate Patron capabilities].

• S. Lakshminarayanan, Additional Secretary During 1991-2000 (B-Technology Phase), focus was on developing Basic Technolo-
gies- generic information processing tools, interface technologies and cross-compatibil-
sln@mit.gov.in
ity conversion utilities. TDIL(Technology Development for Indian Languages)
programme was initiated. [This may correspond to phase-2 of the ‘S’ curve model –
Advisor collaborat-
Incubating Innovations - ‘S’ curve model : Technology
• Y.S. Bhave, Joint Secretary & Financial Advisor Development for Indian Languages ing capa-
bilities]
ysbhave@mit.gov.in
During
Contents July 2003, vk"kk<+ 10 2 0 0 1 -
2005
1. Calendar of Events 2 (C-Technol-
ogy Phase),
2. Reader’s Feedback 4 focus is on
3. TDIL Vision 5 developing
Creative
4. About TDIL Programme 7 Technologies
5. Achievements of the TDIL Resource Centres in the con-
text of con-
RCILTS for Hindi & Nepali (IIT, Kanpur) 13 vergence of
computing,
RCILTS for Gujarati (MS Univ., Baroda) 25
communi-
RCILTS for Marathi & Konkani (IIT, Mumbai) 33 cation and content technologies. Collaborative technology development is being en-
couraged. During this period Resource Centres for Indian Language Technology Solu-
RCILTS for Punjabi (TIET, Patiala) 43
tions are set up. These create a virtual organizational Language Technology environ-
RCILTS for Bengali (ISI, Kolkata) 51 ment. [This may correspond to phase-3 of the ‘S’ curve model - Developing new capa-
bilities].
RCILTS for Oriya (Utkal Univ. & OCAC, Bhubaneswar) 69
During 2006- 2010, focus may be on Demonstrative technologies for newer products
RCILTS for Assamese & Manipuri (IIT, Guwahati) 83 or services. Futuristic projects such as Intelligent Cognitive Systems and Speech to
RCILTS for Kannada (IISc, Bangalore) 95 Speech Translation systems are being conceptualized.

RCILTS for Telugu (Univ. of Hyderabad, Hyderabad) 109 Monitoring of the TDIL programme relied on ZOPP (Objectivity Oriented Project
Planning) workshop for building consensus on approach and knowledge sharing, peer
RCILTS for Malayalam (C-DAC, Thiruvananthapuram) 135 review and regional clustering for closer interaction of technology sharing. Once the
RCILTS for Tamil (Anna Univ. Chennai) 149 technology is developed by a particular centre, it is shared with other centres as well.
Clustering of Regional Resource Centres enables peer competition and collaboration for
RCILTS for Urdu, Sindhi & Kashmiri (C-DAC, Pune) 167 innovation and product-oriented development. Basic Information Processing Kit,
RCILTS for Sanskrit, Japanese & Chinese (JNU, Delhi) 181 OCR, Text-to-Speech and messaging system will shortly be available for all Indian
languages. On-line Machine Aided Translation is available from English to Hindi.
6. TDIL Associates Achieve Honours…Congratulations ! 190 Other Translation pairs are also being developed, Language Technology Business Meet
7. Resource Centres Technology Index 192 (LTBM 2001) provided a forum for closer interaction between acdmia, industry and
government to facilitates innovation, incuvation and proliferation. Technology
handshake were encouraged; 41 technology handshake were signed. International
Editorial Team community has also recognised India’s growing expertise in multilingual computing.
Om Vikas omvikas@mit.gov.in Research groups from France, Germany, U.K., Japan etc. had approached Indian
scholar for possible R&D collaboration.
P.K. Chaturvedi pkc@mit.gov.in
Pardeep Chopra pchopra@mit.gov.in This issue of VishwaBharat@tdil consolidates the technologies developed so far mainly
in the Resource Centres under the TDIL programme. We hope it will be highly useful to
V.K. Sharma vkumar@mit.gov.in the Research Groups in academia as well as to the industry for integrating these
Manoj Jain mjain@mit.gov.in technologies into newer products and services.
S. Chandra schandra@mit.gov.in omvikas@mit.gov.in

ISSN No. 0972-6454 has been granted to VishwaBharat@tdil. Google search engine refers to the contents of this journal.
WebSite : http://tdil.mit.gov.in
çèkku ea=h
Prime Minister
MESSAGE

Information Technology has brought about the


revolution that has transformed commodities, people
and their relations. Our sages envisioned the world as
a Global Village: “Vasudhaiv Kutumbkam”. That
vision is poised to become a reality as communities
and countries network and join up. What we need is the right mindset
to impel us towards the state of “Sarve Bhavantu Sukhinah”.

The opportunities that today’s knowledge economy offers enable


us to make the knowledge and innovation available far and wide,
without any barriers of nationality, language, culture, caste and creed.
IT with such a humanitarian perspective will lead to a society where
peace and harmony obtain. I hope that with the success of the
Technology Development for Indian Languages Mission programme,
India will emerge as a pioneering hub of multilingual computing and
provide appropriate technology solutions for overcoming linguistic
and cultural barriers, and thereby ensure knowledge for all.

(A.B. Vajpayee)

New Delhi
July 29, 2003
1. Calendar of Events

! Workshop on Spoken Language Processing, Tata Institute ! Summer School in Universal Networking Language, IIT
of Fundamental Research, Mumbai, January 9-11, 2003. Mumbai, 5-13 June, 2003.
Website : http://speech.tifr.res.in E-mail : pb@cse.iitb.ernet.in
! Indo Wordnet workshop, CIIL Mysore, Jan. 14-15, 2003. " 8th Biennial Conference of the International Association
Website : http://www.ciil.org for Language Learning Technology - IALLT 2003,
University of Michigan, USA, Conference Workshop : June
! Second International workshop on Technology Development
17 – June 18, 2003, Conference Sessions : June 19 – June
in Indian Languages (IWTDIL-2003), Kolkata,
21, 2003.
Jan. 22-24, 2003.
Website : http://www.lsa.umich.edu/lrc/iallt/
Website : http://isical.ac.in/-cvpr/events/iwtdl.htm
" Terminology & Localization Conference/Workshops,
" Text Retrieval Conference, Conducted by National
Terminology Summer Academy, Kent State University,
Institute of Standards and Technology (NIST)
Institute for Applied Linguistics, Kent Ohio 44242, June
with support from ARDA, U.S.A., February 17-21,
17-19, June 20-21, 2003.
November 2003.
Website : http://appling.kent.edu/ResourcePages/TSA-
Website : http://trec.nist.gov
2003/TSAWeb/TerminologyAndLocalizationHome.html
! National Workshop on Application of Language
" NLDB 2003, 8th International Conference on Application
Technology in Indian Languages, CALTS, University
of Natural Language to Information System, from June
of Hyderabad, Central University, Hyderabad,
March 6-8 2003. 23 – 25, Burg (Spreewald), Germany.

E-mail : sapworkshop@yahoo.com Website : http://dbis-conference.informatik.tu-cottbus.de/


nldb2003/
! 13th International Workshop on Research Issues on Data
Engineering Multilingual Information Management " Association for Computational Linguistics (ACL 2003),
(RIDE-MLIM’ 2003), Hyderabad, India, March Sapporo Convention Center, SAPPORO, Japan
10-11, 2003. July 7-12, 2003.
Website : http://www.iiit.net/conferences/ride2003.html Website : http://www.ec-inc.co.jp/ACL2003/
! First National Indic-Font Workshop at PES Institute of ACL 2003 - Associated Conferences
Technology, Bangalore, 26-30 March, 2003. " EMNLP2003: The Eighth Conference on Empirical
E-mail : vijay@ekgaon.com Methods in Natural Language Processing, Nara Institute
of Science and Technology, Japan.
" EACL 2003-11th Conference of the European Chapter
of the Association for Computational Linguistics Agro " 6th International Workshop on Information Retrieval with
Hotel, Budapest, Hungary, April 12-17, 2003. Asian Languages (IRAL2003), National Institute of
Informatics, Japan.
Website : http://www.coli.uni-sb.de/eacl/eacl03.php3
ACL 2003 - Associated Workshops
" EACL 2003 workshop on Computational Linguistics for
South Asian languages- Expanding synergies with Europe. " WS-1 Multilingual Summarization and Question
Agro Hotel, Budapest, Hungary, April 14, 2003. Answering - Machine Learning and Beyond.
" The 12th International Conference on Management of " WS-2 Natural Language Processing in Biomedicine.
Technology, International Association for Management " WS-3 The Lexicon and Figurative Language.
of Technology (IAMOT), Nancy, May 13 – 15, 2003. " WS-4 Multilingual and Mixed-language Named Entity
Website : http://www.iamot.org/IAMOT2003/cfpl1.html Recognition: Combining Statistical and Symbolic Models.
! WILTS-2003, Utkal University, Bhubaneswar, May " WS-5 The Second International Workshop on
15-17, 2003. Paraphrasing: Paraphrase Acquisition and Applications.
Website : http://ilts-utkal.org/wilts-2003.htm " WS-6 Second SIGHAN Workshop on Chinese Language
" Human Language Technology Conference (NAACL), Processing.
Edmonton, Canada, May 27- June 1, 2003. " WS-7 Multiword Expressions : Analysis, Acquisition and
Website : http://www.sims.berkeley.edu/research/ Treatment.
conferences/hlt-naacl03/dates.html " WS-8 Linguistic Annotation : Getting the Model Right.
" 3rd International LRC 2003, Localisation Summer School, " WS-9 Workshop on Patent Corpus Processing.
University of Limerick, Ireland, 3-6 June, 2003. " WS-10 Towards a Resources Information Infrastructure.
Website : http://lrc.csis.ul.ie/ " 2003 Conference on Empirical Method in Language
" Athabascan Languages Conferences, Arcata, California, Processing (EMNLP 2003), Sapporo, Japan from July
USA, 5 – 7 June, 2003. 11 – 12, 2003.
Website : http://www.uaf.edu/anlc/alc/ Website : http://www.ai.mit.edu/people/mcollins/emnlp03

2
" JHU summer workshop on Language Engineering, " Machine Translation Summit IX, New Orleans, USA,
Baltimore, Maryland, USA, July 14 to Aug.22, 2003. September 23-28, 2003.
Website : http://www.clsp.jhu.edu/workshops Website : http://www.mt-summit.org
! International Conference on Information processing and " Conference on Language Policy and Standardization,
Resources on the Internet (“Tamil Internet 2003”) Iceland, 4th October, 2003.
Chennai, Tamilnadu, India, July 31 - 3 August, 2003.
Website : http://www.ismal.hi.is/Radstefna2003ENS.html
Website : http://www.infitt.org/ti2003/
" 2nd International Semantic Website Conference,
" The XVIth International Conference on Historical Workshop on Human Language Technology for the
Linguistics, University of Copenhagen, Denmark from
Semantic Website and Website Services, Sanibel Island,
11th – 15th August 2003.
Florida, USA, From 20 – 23 October 2003
Website : http://www.hum.ku.dk/ichl2003
Website : http://gate.ac.uk/conferences/iswc2003/
" Adaptation of Automatic Learning Methods for Analytical
and Inflectional Languages, Vienna, Austria, August " 8th Annual LRC Conference, Localisation Research Center,
18-22, 2003. University College Dublin, from 17-19 November, 2003.

Website : http://www.logic.at/esslli03 Website : http://Irc.csis.ul.ie/

" Workshop on “Adaptation of Automatic Learning Methods " LangTech2003-The European Forum for Language Technology,
for Analytical and Inflectional Languages” (ALAF’03), Paris, France, from 24th & 25th November 2003.
Vienna, Austria, Aug.18-22, 2003. Website : http://www.lang-tech.org/
Website : http://www.logic.at/esslli03 " 6th International Conference of Asian Digital Libraries
" The 4th Celtic Linguistic Conference, Selwyn College, (ICADL) 2003, Kuala Lumpur, Malaysia from 8-11
University of Cambridge, UK From 1-3 September, 2003. December 2003.
Website : http://privatewww.essex.ac.uk/~louisa/celtic/ Website : http://www.ftsm.ukm.my/ICADL2003/
clc4.html
! ICON-2003: International Conference on Natural
" Eurospeech 2003 – Switzerland (Interspeech 2003), 8th Language Processing, Central Institute of Indian Languages,
European Conference on Speech Communication & Manasgangotri, Mysore, Dec.18-21, 2003.
Technology, September 1 – 4, 2003 – Geneva, Switzerland.
Website : http//www.iiit.net/conferences/icon2003.html
Website : http://www.symporg.ch/eurospeech/
Unicode Related Meetings & Events
" Generative Approaches to Language Acquisition, Utrecht,
Netherland, from 4th – 6th September, 2003. " INCITS/CT22,Washington, DC,USA Aug.18 2003.
Website : http://www-uilots.let.uu.nl/conferences/Gala/ " UTC # 96/ L2 # 193, Pleasanton, CA, USA, Aug. 25-28
gala_2003_homepage.htm 2003.
" 36th Annual Meeting on Applied Linguistics, University " IUC # 24, Atlanta GA, Sept. 3-5 2003.
of leeds, from 4th – 6th September, 2003. " IRG #22 Guilin, China.
Website : http://www.baal.org.uk/baal03call.htm " SC22 Plenary, Oslo, Norway, Sept. 15-19 2003.
" An International Conference on Text Speech and " Unicode Workshop, Hosted by Technology Development
Dialogue, Czech Republic, September 8 – 11, 2003. for Indian Languages (TDIL) Programme of Department
Website : http://www.kiv.zcu.cz/events/tsd2003/ of Information Technology, Govt. of India, at Park Hotel,
New Delhi, India, Sept 24-26, 2003.
" International Conference RANLP – 2003 (Recent
Advance in Natural Language Processing) Borovets, " WG20, Hosted by Microsoft & The Unicode Consortium,
Bulgaria from 10 – 12 September 2003. Mountain View, CA-USA, Oct 15-17, 2003.
Website : http://www.lml.bas.bg/ranlp2003/ " SC2/WG2 # 44 Hosted by Microsoft & The Unicode
Consortium, Mountain View, CA-USA, Oct 20-23, 2003.
" LISA Workshops in Boston, Boston U.S.A. from 12-18
September, 2003. " UTC#97/L2 # 194 Annual Members Meeting, Hosted
by John Hopkins University, Baltimore, MD,
Website : http://www.lisa.org/events/2003boston/index.html
Nov 4-7, 2003.
! 3 Days National Seminar on Telugu Language, Center for ALTS,
Website : http://www.unicode.org/unicode/timesens/
Univ. of Hyderabad, Hyderabad from 17-19, Sept., 2003.
calendar.html
Website : http://www.cil.org/announcement/ltt2.html
Note : ! Indicates conferences in India.

3
2. Reader’s Feedback

 “…Thank you very much for participating in the  “…Here in Bolivia, we are only recently reaching
Universal Library Million Book Project and for sufficient agreement in an alliance of organisations
hosting our delegation to India. I enjoyed the from civil society, universities, and the private
presentations, discussions, collegiality, and food. sector, to begin to plan a National Strategy for
I was particularly moved by Dr. Om Vikas’s Communications & Education. A vital element
research and spoke with him afterwards. He gave
in this, of course, is an indigenous presence, which
me a copy of one of his papers. That paper and
makes up over 70% of the Bolivian population,
the other publications you provided in the packet
and the strategy planning of multi-lingual
are impressive. The brass Shiva you gave me as a
gift is in an honored place in my office. I information processing.
appreciate your generosity and contribution to In this light, we are interested to know :
achieving noble goals. God bless you …” (i) if there is some written documentation you
-Denise A. Troll, could send us on how you began your enormous
Carnegie Mellon University (U.S.A) venture to transform Indian Information
Website : http://www.library.cmu.edu Technology?
 “…Thank you very much for your letter dated (ii) and as we are still at the stage of preparing the
March 24, 2003. I am delighted with the
broad outlines of a planning strategy, we wonder
information contained therein concerning the
if you have any articles on your experience in
Unicode Consortium Committee meeting on
March 4-7, 2003 at Microsoft Center, Mountain hindsight, particularly concerning the
view, California. management of priorities at this earliest stage.
My interest had always been to preserve and At the same time, we wish to thank you for sending
develop the diversity that India represents in its us copies of your journal, Digital Unite
culture, languages, traditions and much else. That and Knowledge for all, which inspire us
why I always supported efforts that would leads constantly…”
to possibilities in this direction. I am particularly -Juan de Dios Yapita,
pleased that the initiatives you have taken over
Instituto de Lengua y Cultura Aymara
the past quarter of a century, since the whole
E-mail : ilca@megalink.com
programme relating to Indian languages
computers and information technology was first  “…I appreciate your work on “English to Hindi”
taken up in the Department of Electronics when translation. I have tried using it for the first time
I was Secretary, are now flowering. We have come and it seems quite impressive. I cannot write hindi
long way since then, with considerable success, properly so this would be a helpful tool (especially
for which a lot of credit need to be given to you. to write to my parents). It’s done very well and
I am delighted to have had the opportunity to be throws in exceptions very nicely. I was trying to
supportive. My principal interest today is in the use your hindi editor and sometimes totally got
UNL and the approach that this would be
stuck while trying to find something. Also some
complementary to a great deal of what has already
stuff is hard to delete (i.e. if I made a mistake).
been done under the TDIL.
Few more things that would make life more easy
Thank you for sending me the “Annals of Indian
would be to have a picture or a text file
Language Computing” and Vishwabharat@tdil
explaining/showing the keyboard mapping to the
(Newsletter)…”
hindi fonts, and having an ability to get a hardcopy
-M.G.K. Menon, Eminent Scientist,
of the typed stuff. Thanks again, Cheers…”
Dr. Vikram Sarabhai Distinguished
Prof. of ISRO, Dept. of Space -Ashutosh Jha, Osellus Inc. Toronto
E-mail : mgkmenon@ren02.nic.in E-mail : ashutosh@osellus.com

4
3. TDIL Vision

TDIL Vision 2010 (Optical Character Recognition Systems, Voice


(TDIL : Technology Development for Indian Languages) Recognition Systems, Text-to-Speech System)
● Localization
A B C Technology Development Phases
India was long aware of the technological changes and the (Adapting IT Tools and solutions in Indian Languages)
local constraints and development of Language ● Language Technology Human Resource Development
Technology in India may be categorized in three phases: (Manpower Development in Natural Language
● 1976-1990 : A-Technology Phase: Focus was on Processing)
Adaptation Technologies; abstraction of requisite ● Standardization
technological designs and competence building in (ISCII, Unicode, XML, INSFOC, etc.)
R&D institutions.
TDIL Programme Goals
● 1991-2000 : B-Technology Phase: Focus was on Short Term Goals
developing Basic Technologies- generic information
processing tools, interface technologies and cross- ● Standardization of code, font, keyboard etc.
compatibility conversion utilities. TDIL (Technology ● Fonts and basic software utilities in public domain.
Development for Indian Languages) programme was ● Corpora creation and analysis.
initiated.
● Content creation tools.
● 2001-2010 : C-Technology Phase: Focus is on ● Language Technology be integrated into IT curricula.
developing Creative Technologies in the context of
● Collaborative development of Indian language lexical
convergence of computing, communication and
resources.
content technologies. Collaborative technology
development is being encouraged to realise. ● Writing aids (Spell checks, grammar checks and text
summarization utilities).
Vision statement
● Sharing of standardized lexware &development of
Digital unite and knowledge for all. lexware tools.
Mission statement ● Training programs on ILT awareness, lexware
Communicating & moving up the knowledge chain development, and computational linguistics.
overcoming language barrier. Medium Term Goals
Objectives ● Indian language speech database
● To develop information processing tools to facilitate ● Multilingual, multimedia, content development with
human machine interaction in Indian languages and semantic indexing, classical and multi font and
to create and access multilingual knowledge resources/ decorative fonts, off-line/on-line OCR.
content.
● Cross lingual information retrieval (CLIR) tools.
● To promote the use of information processing tools
● Human speech encoding
for language studies and research.
● Speech Engine : Speech recognition, specific speech
● To consolidate technologies thus developed for Indian
I/O.
languages and integrate these to develop innovative
user products and services. ● Indian language support on Internet appliances.
Major Initiatives ● Understanding and Acquisition of languages,
knowledge representation, gisting and interfacing.
● Knowledge Resources
● Distinguished achievement awards for M.Tech/MCA/
(Parallel Corpora, Multilingual Libraries/Dictionaries,
Ph.D. level in Indian Language Technologies.
Wordnets, Ontologies)
● Machine aided translation: English to Indian languages,
● Knowledge Tools
among Indian languages, Indian languages to English
(Portals, Language Processing Tools, Web based Tools) and other foreign languages.
● Translation Support Systems ● On line rapid translation, gisting and summarization.
(Machine Translation, Multilingual Information Access, Long Term Goals
Cross Lingual Information Retrieval)
● Speech-to-speech translation.
● Human Machine Interface System
● Human Inspiring Systems.

5
Language Technology Map
Resource Centres & CoIL-Net Centres
for Indian Language Technology Solutions

Resource Centres

CoILNet Centres

TIET(P)

IIT(R)
IGNCA (D)
JNU(D)

BV(B) IIT(K)
IIT(G)
BHU(V)

MSU(B) BIT(R)
IIITM(G) ISI(K)

GGU(B)
UU(B)
IIT(M) OCAC (B)
CDAC (P)

UOH(H)

Resource Centres

MSU (B) = MS University, Baroda (Gujarati)


CoIL-Net Centres IISc(B) AU (C) = Anna University, Chennai (Tamil)
UU (B) = Utkal University, Bhuwaneshwar (Oriya)
CDAC (P) = Centre for
AU(C) ISI (K) = Indian Statistical Institute, Kolkata
Development of Advanced
Computing, Pune (Bangala)
JNU (D) = Jawaharlal Nehru University, New Delhi
BIT (R) = Birla Institute of (Sanskrit, Japanese, Chinese)
Technology, Ranchi CDAC(T) UOH (H) = University of Hyderabad, Hyderabad
IIT (K) = Indian Institute of (Telugu)
Technology, Kanpur IISc (B) = Indian Institute of Science, Bangalore
IIITM (G) = Indian Institute (Kannada)
of Information Technology & IIT (K) = Indian Institute of Technology, Kanpur
Management, Gwalior (Hindi & Nepali)
BV (B) = Banasthali IIT (M) = Indian Institute of Technology, Mumbai
Vidyapeeth, Banasthali (Marathi & Konkani)
BHU (V) = Banaras Hindu
IIT (G) = Indian Institute of Technology, Guwahati
University, Varanasi (Assamese, Manipuri)
CDAC (T) = Electronic Research & Development
IIT (R) = Indian Institute of Centre of India, Thiruvananthapuram (Malayalam)
Technology, Roorkee
CDAC (P) = Centre for Development of Advanced
IGNCA (ND) = Indira Gandhi Computing, Pune (Urdu, Sindhi, Kashmiri)
National Centre for the Arts, OCAC (B) = Orissa Computer Application Centre,
New Delhi Bhubaneswar (Oriya)
GGU (B) = Guru Ghasidas TIET (P) = Thapar Institute of Engineering &
University, Bilaspur Technology, Patiala (Punjabi)

6
4. About TDIL Programme

Technology Development for Indian human–machine Interface systems in Indian


Languages Programme (TDIL) Languages under the TDIL programme.
-An Overview Vision
1. Need of the programme Digital unite and knowledge for all.
The world is in the midst of a technological Mission
revolution nucleated around Information and Communicating without language barrier & moving
Communication Technology (ICT). Advances in up the knowledge chain.
Human Language Technology will offer nearly
universal access to information and services for more Objectives
and more people in their own language. Today 80 ● To develop information processing tools to
% of the content on the Web is in English, which is facilitate human machine interaction in Indian
spoken by only 8% of the World population and languages and to create and access multilingual
only 5% of Indian population. In a multilingual knowledge resources/content.
country like India, with 18 official languages & 10
scripts it is essential that information processing and ● To promote the use of information processing
translation software should be developed in local tools for language studies and research.
languages and available at low cost for wider ● To consolidate technologies thus developed for
proliferation of ICT to benefit the people at large Indian languages and integrate these to develop
and thus pave the way towards “Digital Unite and innovative user products and services.
Knowledge for all” and arrest the sprawling Digital
Divide. 3.1 Focus Areas

2. Impediments Translation Systems

● Lack of industry involvement due to constrained • Machine aided Translation system (MAT)
demand; • Parallel corpora, Lexware, Multilingual
● Sub-critical and Un-sustained demand in States; dictionaries, Wordnet, speech databases

● Negligible software tools and re-usable • Speech-to-Speech Translation system


components in public domain Human Machine Interface systems
● Plurality of Internal Codes: ISCII-88, ISCII-91, • Optical Character Recognition
UNICODE, and other propriety codes;
• Speech Recognition
● Content Glyph-coded, not (ISCII) character- • Text to Speech system
coded;
Language processing and Web tools
● No font Standard;
Word processors, spell checkers, converters and
● Non-availability of Human resources in the portals
domain of Language Technology in both
Computational Linguistics (CL) and knowledge Localization
Engineering (KE) Adapting IT tools and solutions in Indian
● Reluctance of Academia to focus on productising languages.
the R&D. Standardization
3. TDIL Mission UNICODE, XML, ISCII (Indian Standard
In this context, a number of initiatives have been Code for Information Interchange), INSFOC
taken towards development of software, tools and (Indian Standard Font Code), INSROT (Indian

7
Script to Roman Transliteration), Standard for CoIL-Net (Content Creation & IT Localisation
Lexware format, etc. Network
Evaluation and Benchmarking CoIL-Net programme is formulated encompassing
a vision for all pervasive socio-economic
Evaluation and testing of the prototype
developments by proliferating use of language
technologies against benchmarks, product specific IT based content, solutions and applications
compliance testing against specifications and for bringing the benefits of IT revolution to the
validations of standards. common citizen; requisite technology development
4. Technology Development Phases and consequently bridging the existing digital divide
in the Hindi speaking states of India. The objective
Development of Language Technology in India may is to provide a boost to IT localization based socio-
be categorized in three phases: economic developments and help bridge the existing
1976-1990 : A-Technology Phase digital divide by appreciably improving IT
Focus was on Adaptation Technologies; abstraction penetration and awareness levels, using Hindi as
of requisite technological designs and competence medium of delivery, in the Hindi speaking states of
building in R&D institutions. MP, Chattisgarh, UP, Uttaranchal, Bihar, Jharkhand,
and Rajasthan.
1991-2000 : B-Technology Phase
Focus was on developing Basic Technologies- generic CoIL-Tech
information processing tools, interface technologies The MAIT Consortium on Innovation & Language
and cross-compatibility conversion utilities. TDIL Technology (CoIL-Tech) since its inception in
(Technology Development for Indian Languages) September 2001, has been actively coordinating
programme was initiated. various activities with the Industry and the TDIL
(Technology Development in Indian Languages)
2001-2010 : C-Technology Phase
Department of the Ministry of Communications
Focus is on developing Creative Technologies in the
& Information Technology. The consortium today
context of convergence of computing,
has active participation from both Indian and MNC
communication and content technologies.
companies with a focus to promote industry
Collaborative technology development is being participation in collaborative R&D in language
encouraged to realise. technology, to coordinate Open Source Software
5. Program Management Structure supporting Indian languages, to evolve consensus
on standards, benchmarks, and certification of LT
All the above activities are being implemented products, to collectively interface with government
through various Resource Centres and Localization and academia, to conduct market surveys, organize
(CoIL-Net) Centres. The programme is being technology shows, promote technology transfers and
technically monitored by TDIL Working Group expand market collectively.
consisting of experts drawn from academia, industry,
R &D organizations and Government. The Resource 6. Major Achievements of the TDIL Programme
centres are divided into regional clusters for peer Optical Character Recognition (OCR)
review of the progress of the technologies and
Optical Character Recognition is an indispensable
solutions developed by them.
tool for digitizing the content and is essential for
Resource Centres development of Knowledge networks such as
digital libraries. Optical Character Recognition
The Department has thirteen Resource Centres for
(OCR) technology offers the facility to scan and
Indian Language Technology Solutions at various store the printed text. There are three essential
educational institutes and R&D organizations who elements to OCR technology - scanning,
have developed several important tools and recognition and then reading text. Initially, a
technologies for Indian Language support. camera scans a printed document. OCR software

8
then converts the images into recognized It is now possible on Linux to give file names,
characters and words. The OCRs developed are domain names in Devanagari. Indian language
being tested and benchmarked independently by support on Linux in open domain will also ensure
STQC. Indian Languages support on all Open Source
OCR with more than 97% accuracy has been Tools Kits. INDIX supports GNOME desktop
developed for seven Indian Languages viz Hindi, environment (of LINUX) to enable applications
Marathi, Bangla, Tamil, Telugu, Punjabi, to create, edit and display contents in Indian
Malayalam. Languages. The R&D for supporting other Indian
Languages have been initiated.
The OCR technologies for Assamese, Oriya,
Malayalam and Gujarati scripts are in the Standardization
advanced stages of development.
Unicode Standards are widely being used by the
Spell Checkers Industry for the development of Multilingual
Software. Unicode Standard 3.0 includes standard
Spell checker is the language specific software
code sets for Indian scripts based on ISCII-1988
component used in word processing, OCR post
document. Some modifications are necessary to
processing and text processing applications. Spell
incorporate in the Unicode Standard for adequate
Checkers for Hindi, Marathi, Bangla, Tamil,
representation of Indian Scripts.
Telugu, Punjabi, and Malayalam have been
developed. The Spell checkers for other Indian Department of Information Technology,
Languages are in advanced stages of development. Ministry of Communications & IT, is the voting
Machine Aided Translation System (MAT) member of the Unicode Consortium. Proposed
changes in the existing Unicode Standards has
MAT system provides rough translation from a been finalized in consultation with respective State
source language (say English) into a target Government and Indian IT Industry. These have
language (say Hindi). This output may be post been published in TDIL Newsletter
edited with specially designed tools for enhancing Vishwabharat@tdil.
the translation accuracy.
The Indian Scripts Font Code (INSFOC) to
Anglabharati English to Hindi Translation overcome the problem of interoperability of
Support System has been developed by IIT various software products developed by different
Kanpur (Resource Centre for Indian language vendors has been completed for Hindi,
in Hindi and Nepali language) and the same is Malayalam, Gurmukhi and Gujarati. The work
available in Public Domain at http:// is in progress for other Indian languages.
anglahindi.iitk.ac.in
An “Indian Script to Romanization Table
The utility of text to speech synthesis has also been (INSROT)” has been worked out to facilitate no-
integrated with Anglabharati and works on the Hindi users to work in Hindi. Efforts are also on
Linux platform. to Standardise the Indian Lexware Format.
The Machine Translation system for Indian Parallel Corpora (Gyan Nidhi)
Languages to Hindi and Hindi to English are in
the process of development. A parallel corpus is an important input for MAT
system. One Million pages Parallel Corpora is
Support on Indian Language LINUX Operating presently in the process of development. The
System (INDIX) Graphic User Interface for viewing the multiple
Localization Linux operating system at x-windows Language Corpora has also been developed. The
level has been done. INDIX (Localized LINUX) aim is to develop parallel corpora of One million
operating system has been developed by NCST, pages. So far 450, 000 pages parallel corpora has
Mumbai. been developed (as on May 31, 2003).

9
Bi-lingual Dictionaries Quarterly on Indian Language Technologies

Bi-lingual electronic dictionaries are the basic The VishwaBharat@tdil is a quarterly Newsletter
of Indian Language Technologies, which
linguistic resource which is useful for any NLP
consolidates in one place information about
related research development and testing progress.
products, tools, services, activities, developments,
These are useful in the development of Word
achievements in the area of Indian Language
processors in Indian Languages. English – Hindi, software. It serves as a means of sharing ideas
English – Telugu, English- Tamil, English – among technology developers. This creates
Kannada, English- Bangla, Bangla-Bangla, awareness in the society regarding the availability
English-Punjabi, English-Oriya, English- of language technology resources. It is issued
Malayalam bilingual dictionaries have been quarterly is widely circulated
developed by RCs.
Eight issues viz. Jan 2001, May 2001, Sept.
IT Localisation Clinic 2001, Jan 2002 , May 2002, July 2002, October
2002, Jan 2003 published. All the issues Accessible
Workshops were organised by DIT during through TDIL Web site and all the issues are
November 2002, to accelerate the project available in a single CD.
deliverables such as Content development in 7. Targets for 2003-04
Hindi, Creation of Repository of language
technology products, Trainers Training ● Integration and Deployment of OCR, TTS and
programmes, IT Localisation Clinics, IT Document processing technologies
Curricula for Schools, Website Development in ● Initiation of development of Machine aided
Hindi. Installation of test beds in the domains of Translation System between English and Indian
e-governance (Land Records, Agri-Net, Query Languages
System), e-education (Schools, ITIs), e-business ● Linux Operating System enabled with Indian
(Small Business), e-tourism, e-health. languages support
TDIL Portal ● Enforcing UNICODE revision and Vedic code
standardization
The TDIL website is bi-lingual (English and
Hindi) and contains information about the TDIL ● Initiation of Intelligent Cognitive system
Programme, it’s initiatives and achievements. It (KUNDALINI) project for bringing synergy in
traditional Sanskrit Shastras and Information
provides access to Indian scriptures, standards
Technology.
(Indian scripts, keyboard layout, font layout, etc),
articles and reviews. The unique services provided ● Cross Lingual Information Retrieval for
by the website includes E-mail in Hindi and Universal Access
Online Machine Translation from English to ● IT Localization Clinics for incubation of IT
Hindi and vice versa. technologies
The website also provides downloadable softwares ● 16 Bit Unicode compliant Hindi fonts for public
and tools in Indian Languages viz. Plug-in such domain.
as Akshar for Windows (Plug-in for MS-Word), 8. New Projects Initiated
Shrilipi Bharti (Plug-in with keyboard driver
having Devanagari fonts), Indian Language Word Intelligent Cognitive System
Processors, NLP tools, NLP Resource for A new project in the domain of Intelligent
Windows/Linux. Cognitive system named Knowledge

10
UNDerstanding & Acquisition of Languages, Evaluation and Benchmarking of the Indian
INferencing and Interpretation (KUNDALINI) Language Technology Tools
is being initiated with the following objectives:
Testing, Evaluation and Benchmarking of the
• To develop methodologies and tools for Indian Language Technology tools is essential for
knowledge representation, extraction, wider acceptance of the Indian Language
mining, gisting, inferencing and technology products. In view of this,
interpretation Standardization Testing and Quality Certification
(STQC) division of DIT has been designated as
• To develop knowledge frameworks and access the third party for evaluation of the language
mechanisms based on Indian tradition technologies tools and products developed TDIL
• To develop Sanskrit based Networking programme as per international standard.
Language as Machine Translation Interlingua. 9. Digital Library Initiative of India
(Concept based Networking Languages :
CNL) Digital libraries are a form of Information
Technology in which social impact matters as much
Gyanaudyog (YÉÉxÉÉätÉäMÉ) as technological advancements. Future knowledge
Encouraged by the success of the Gyanaudyog networks will rely on scalable semantics, on
Workshop held on April 7 –9, 2003 a new project automatically indexing the community collections
Gyanaudyog is being initiated to promote Small so those users can effectively search within the
Office & Home Entrepreneurship in the area of Interspace of a billion of repositories. Just as the
Information Technology for catalyzing IT transmission networks of the Internet are connected
enabled services (ITES), specifically, Content via switching machines that switch packets, the
Creation, Content localization & application knowledge networks of the Interspace will be
software localization, Remote customer connected via switching machines that switch
interaction Services, Computer Aided Design concepts. Connectivity and training continue to
with support for Technology Mentoring, Financial be the principal barriers to integrating the global
Support guidance and Market information. network of libraries.

Anusrijan (+xÉÖºÉÞVÉxÉ) Among a number of major DL initiatives in USA is


the Universal Digital Library (UDL) project that
There is addition of over 25 Million pages in aims at creating a free-to-read, searchable collection
R&D in the field of Science &Technology. It is of one million books, primarily in the English
planned to prepare books/monographs in local language, available to everyone over the Internet.
language on emerging areas of Information & This project involves participation from many
Communications Technology (ICT). These will countries including USA, Australia, India China,
be available for translation into other Indian Egypt, Srilanka . The funding agency is National
Languages so as to create averseness about recent Science Foundation (NSF),USA, with a funding of
development to promote innovation and US $ 4.0 Million
entrepreneurial aptitude.
The contribution form various industries of USA is
SOHE - Ganak Bharti (ºÉÉä½äþ-MÉhÉEò ¦ÉÉ®úiÉÒ) US $ 6.0 million. The overall coordination of the
project is done by Carnegie Mellon University
A new project named as Ganak Bharti is being
(CMU), USA.
initiated to promote development and
deployment of Low cost PC with Indian India participates in the UDL project and makes
Language support and Open Source Software for efforts to put e-content of vast Indian knowledge base
small business men, women at home and children in the Indian language as much as possible. The
aiming at catalyzing the growth of Small Office, Indian activity of Digital Library is coordinated by
Home and Education (SOHE) environment. Indian Institute of Science, Bangalore

11
9.1 Objectives ● Kanchi University, Kanchi, Tamil Nadu
● To digitize and index the heritage knowledge. ● Indian Institute of AstroPhysics, Karnataka
● To provide a test bed that will support other ● Indira Gandhi National Centre for Arts, New Delhi
research domains such as scanning techniques,
● Rashtrapathi Bhavan, New Delhi
optical character recognition etc
● Punjab Technical University, Punjab
● Inter-ministerial collaboration for e-Content.
9.3 Targets for 2003-04
● To promote life long learning in the society (A
necessity of the Knowledge-based society). ● Development and Deployment of on-line
Multilingual Machine Translation system, OCRs
● To promote collaborative creativity and building
in Indian Languages and Universal Dictionary
up knowledge teams across borders.
Programme through TDIL programme.
● Involvement of Resource Centers for Indian
● Establishment of Regional Mega Centres at the
Language Technology Solutions and COILNet institutions with proven capability of scanning
Centers to digitize and web-enable contents in at least 5000 pages per day. The Mega-centres may
Indian Languages. be created with State government’s initiative for
meeting operational costs.
9.2 Participating Institutions of UDL program
● Indian Institute of Science, Bangalore (Overall ● Up gradation of the connectivity of the Mega-
Centres of scanning with 2Mb/s connectivity.
coordination)
● Anna University, Chennai, TamilNadu ● Creation of “ Digital Library Act” – Provision for
Tax deduction etc for the purpose of providing a
● Arulmigu Kalasligam College of Engineering, Digital version on web for public benefit.
Madurai,Tamil Nadu
● Creation of 4C ( Consortium for Compensation
● Goa University, Goa for Creative content)
● Indian Institute of Information Technology, ● Evolution of Project Management mechanism
Allahabad, Uttar Pradesh and productivity measurement strategy of the
● City and State Central Library, Andra Pradesh Digital Library Centres and criteria for setting
up of Mega Scanning Centres.
● Indian Institute of Information Technology,
Hyderabad, Andra Pradesh ● Setting up of Inter-ministerial committee to
integrate Digital Library efforts in India
● Shanmugha Art, Science Technology & Research
Academy, Thanjavore, TamilNadu ● Evolution of National Mission on Digital Library
in consultation with Prof. M.G.K.Menon.
● Sri Sri Sharda Peetam, Sringeri Mutt, Sringeri,
Karnataka ● Participation with lead role in

● Tirumala Tirupati Devasthanam, Tirupati, − WSIS (World Summit on Information Society)


AndraPradesh organized by ITU at Geneva in Dec 2003)

● Maharasthra Industrial Development − UN Decade of Literacy (2003-2013)


Corporation, Mumbai, Maharashtra − TWA (Third World Academy)
● University of Pune ● Promotion of Spin-off technology from Digital
● Centre for Development of Advanced Library programme.
Computing, Noida - TDIL Programme Team

12
Resource Centre For
Indian Language Technology Solutions – Hindi, Nepali
Indian Institute of Technology Kanpur
Achievements

Department of Computer Science & Engineering


Indian Institute of Technology Kanpur-208016 India
Tel. : 00-91-512-2590762 E-mail : sgd@iitk.ac.in
Website : http://www.gitasupersite.iitk.ac.in
http://www.cse.iitk.ac.in/users/langtech
RCILTS-Hindi, Nepali HindiSanskrit, HindiBangala, and so on; based on
Indian Institute of Technology, Kanpur hybridization of Anglabharti and Anubharti
methodologies.
Introduction
1. Machine Translation
In 1995, Department of Electronics, Govt. of India,
sanctioned a grant-in-aid for implementation of the Chief Investigator: Dr. R.M.K. Sinha
project titled “Machine Aided Translation from Co-Investigator: Dr. A. Jain
English to Hindi for standard documents (domain Our work on machine translation started in early
of Public Health Campaign) based on eighties when we proposed using Sanskrit as
ANGLABHARTI approach” for which ER&DCI interlingua for translation to and from Indian
(with its office at Lucknow and now moved to languages (See the paper on “Computer processing
NOIDA) was associated for implementation and of Indian languages and scripts - Potentialities and
commercialization of this software on a PC platform Problems”, Jour. of Inst. Electron. & Telecom.
in the domain of public health campaign. The Engrs., vol.30,no.6, 1984). This was further
ANGLABHARTI software already developed by elaborated in CPAL-1 paper presented at Bangkok
IITK on SUN system was used in this project and in 1989.
was implemented (re-engineered) on PC under
Linux jointly by IITK and ER&DCI under the
supervision of IITK (R.M.K. Sinha, Ajai Jain). In
1996, IITK also designed and developed an
Example-based approach for Machine Aided
Translation for similar (Indian languages) and
dissimilar (English and Indian Languages) under the
leadership of Professor R.M.K. Sinha. This approach
has been named as ANUBHARTI approach. A
system to translate from Hindi to English has been
implemented based on ANUBHARTI approach by
IITK(R.M.K. Sinha, Ajai Jain and Renu Jain).

Currently, AnglaHindi, the English to Hindi MAT


based on Anglabharti methodology, which accepts
Later in 1991, the concept of a Pseudo-Interlingua
unconstrained text, has already been made available
was developed which exploited structural
to the users and is very well received. AnglaUrdu
commonality of a group of languages. This concept
which is based on AnglaHindi has also been
has been used in development of machine-aided
demonstrated. HindiAngla, the Hindi to English translation methodology named ANGLABHARTI
MAT based on Anubharti methodology, has been for translation from English to Indian languages.
demonstrated for simple sentences and further work Anglabharti is a pattern directed rule based system
is going on to handle compound and complex with context free grammar like structure for English
sentences. The current research at IITK is focused (source language). It generates a ‘pseudo-target’
towards development of more efficient machine (Pseudeo-Interlingua) applicable to a group of
translation strategies with user friendly interfaces for Indian languages (target languages) such as Indo-
these systems. Another dimension of diversification Aryan family (Hindi, Bangla, Asamiya, Punjabi,
for future, is to cater to all other Indian languages Marathi, Oriya, Gujarati etc.), Dravidian family
by implementing AnglaSanskrit, AnglaBangala, (Tamil, Telugu, Kannada & Malayalam) and others.
AnglaPunjabi, and so on; SanskritAngla, A set of rules obtained through corpus analysis is
BangalaAngla, PunjabiAngla, and so on; and used to identify plausible constituents with respect

14
to which movement rules for the ‘pseudo-target’ is In 1995, we developed another approach for MT
constructed. Within each group the languages exhibit which was example-based. Here the pre-stored
a high degree of structural homogeneity. We exploit example-base forms the basis for translation. The
the similarity to a great extent in our system. A translation is obtained by matching the input
language specific text-generator converts the sentence with the minimum ‘distance’ example
‘pseudo-target’ code into target language text. sentence. In our approach, we do not store the
Paninian framework based on Sanskrit grammar examples in the raw form. The examples are
using Karak (similar to case) relationship provides abstracted to contain the category/class information
an uniform way of designing the Indian language to a great extent. This makes the example-base smaller
text generators. We also use an example-base to in size and further partitioning reduces the search
identify noun and verb phrasals and resolve their space. The creation and growth of the example-base
semantics. An attempt is made to resolve most of is also done in an interactive way. This methodology,
the ambiguities using ontology, syntactic & semantic named ANUBHARTI, has been used for Hindi to
tags and some pragmatic rules. The unresolved English translation and further details of this
ambiguities are left for human post-editing. Some approach can be seen in the Ph.D. thesis of Renu
of the major design considerations in design of Jain.
Anglabharti have been aimed at providing a practical The Anubharti approach works more efficiently for
aid for translation wherein an attempt is made to similar languages such as among Indian languages.
get 90% of the task done by the machine and 10% In such cases the word-order remains the same and
left to the human post-editing; a system which could one need not have pointers to establish
grow incrementally to handle more complex correspondences.
situations; an uniform mechanism by which
Currently, we are working towards developing an
translation from English to majority of Indian
Integrated Machine-aided translation system (with
languages with attachment of appropriate text
funding from TDIL programme of Govt. of India,
generator modules; and human engineered man-
2003 onwards) hybridizing the rule-based approach
machine interface to facilitate both its usage and
of Anglabharti, example-based approach of
augmentation. The translation system has also been
Anubharti, corpus/statistical based approaches to get
interfaced with text-to-speech module and OCR
the best out of these approaches. This is also being
input. explored to be used for translation engine of speech
This project also received funding from TDIL to speech translation system.
programme of Govt. of India during 1995-97 and In parallel, we are also developing MAT system for
2000-2002. Hindi to English translation system, HindiAngla,
based on our Anubharti approach with funding from
The English to Hindi version named AnglaHindi,
COILNET project of Govt. of India (2001
of Anglabharti machine aided translation system has
onwards). AnglaHindi and HindiAngla have been
been web-enabled and is available at URL: http://
used to demonstrate the two way reverse translation
anglahindi.iitk.ac.in
for simple sentences.
The technical know-how of this technology has been 2. Speech to Speech Translation
transferred on a non-exclusive basis to ER&DCI/
CDAC Noida for commercialization. The speech to speech (S2S) translation requires a
tight coupling of the automatic speech recognition
A system for translating English to Urdu, named (ASR) module, MT module, and the target language
AnglaUrdu, has also been developed using our text to speech (TTS) module. A mere interfacing of
AnglaHindi system and Urdu display software of ASR, MT and TTS modules does not yield an
CDAC, Pune. acceptable S2S translation. S2S requires an

15
integration of these modules such that the hypotheses • R.M.K. Sinha, K. Sivaraman, Aditi Agrawal,
are cross verified and appropriate parameters get T. Suresh and C. Sanyal, ‘On logical design of
generated. In our environment, it has to cater to bi- multi-lingual lexicon for machine translation’,
lingual (Hindi mixed with English) speech with Technical Report TRCS-93-174, Department
commonly encountered Indian accent variations. of Computer Science and Engineering, IIT
The MT also needs be a chunk translator with Kanpur, 1993.
multiple translation engines. Our investigations are
4. Optical Character Recognition
directed to domain specific applications in Indian
environment. Work on Devanagari OCR was carried out with
TDIL, Govt. of India, sponsored project named,
DEVDRISHTI, on Recognition of Handprinted
Devanagari script. The investigations were carried
on in developing new features and in integrating
decision making taking into account large variations
in shape. Further, an automated strategy for training
for construction of prototypes and confusion
matrices, from true ISCII files was developed. This
had to be very much distinct from their Roman
counterpart due to script composition being
involved in case of Devanagari script. This work
was further expanded incorporating blackboard
model for knowledge integration in Ph.D. thesis of
Veena Bansal titled “Integrating Knowledge Sources
Some Relevant Publications in Devanagari Text Recognition”
• R.M.K. Sinha, Towards Speech to Speech
Translation, Key-note presentation at Symposium
on Translation Support Systems (STRANS2002),
March 15-17, 2002, Kanpur, India.

3. Lexical Knowledge-Base Development

Lexical knowledge base is the fuel to the translation


engine. It contains various details for each word in
the source language, like their syntactic categories,
possible senses, keys to disambiguate their senses,
corresponding words in target languages, ontology
and word-net information/linkages. We are also
working towards development of Indian language
wordnet named ShabdKalpTaru in association with Some work has also been carried out on On-line
Dr. Om Vikas and Dr. Pushpak Bhhattacharya. character recognition for Roman using handwriting
modeling. Investigations on on-line isolated
Some Relevant Publications
Devanagari characters have also been carried out
• Renu Jain and R.M.K. Sinha, ‘On Multi-lingual and further investigations are in progress on
Dictionary Design’, Symposium for Machine the subject.
Aids for Translation and Communication (Editorial Comment : This technology was declined
(SMATAC96), New Delhi 1996. for test & evaluation by STQC)

16
5. Transliteration 6. Spell-Checker Design
Transliteration among Indian scripts is easily For Indian scripts, there is a very loose concept
achieved using ISCII (Indian Script Code for of a spelling. Writing in Indian scripts is a direct
Information Interchange). ISCII has been designed mapping of the inherent phonetics and you write
using the phonetic property of Indian scripts and as you speak. There are geographical variations
caters to the superset of all Indian scripts. By in the spoken form and so the spellings vary. Our
attaching an appropriate script rendering approach to design of a spell checker is to develop
mechanism to ISCII, transliteration from one an user error model for each class of user where
Indian script to another is achieved in a natural way. the source of error may the due to incorrect
phonetics, inaccurate inputting or other
influences. The spell-checker uses this error-model
in making suggestions for the error.
7. Knowledge Resources
Introduction
The seeds of this work were planted in the pre-
Internet days with a project undertaken by Dr. T.V.
Prabhakar, Indian Institute of Technology (IIT)
Kanpur, funded by the Chinmaya International
Foundation (1989). A DOS version of Swami
Chinmayananda’s book, The Holy Geeta was
hyperised and published as Geeta Vaatika (1992),
perhaps the first electronic book in India. After the
However, transliteration from Indian script requires emergence of Internet standards, Geeta Vaatika was
use of heuristics to convert the non-phonetic script redone in HTML (1996).
to its probable intended spoken form before it could
be transliterated. Similarly, transliteration from an As the World Wide Web grew, the Government of
Indian script to Roman requires using a standardized India (Department of Electronics) funded a project
mapping table to easily readable. In our work on that continued this work. Work began on the Gita
transliteration, we have suggested heuristics and Supersite , which included multiple commentaries
tables. Several other workers have come up with their and translations of the Bhagavadgita. A website was
own suggestions. Recently, TDIL has come up with a designed and built, with the programming (business
standardization of this table called INSROT which uses logic) initially all on the client side.
only lower case letters to facilitate standard search.
Resource Centres for Indian Language Technology
Some Relevant Publications Solutions were established throughout the country.
Under one such Resource Centre established at IIT
• R.M.K. Sinha, ‘Computer processing of Indian
Kanpur, work on the Gita Supersite has continued.
languages and scripts - potentialities and
The technology was extensively reworked and the
problems’, Jour. of Inst. Electron. & Telecom.
content was converted into a database, with all the
Engrs., vol.30,no.6, 1984,pp. 133-49.
business logic on the server side. Currently work is
• R.M.K. Sinha and B. Srinivasan, ‘Machine going on to convert the data into a font-
transliteration from Roman to Devanagari and independent ISCII database, streamline the
Devanagari to Roman’, Jour. of Inst. Electron. programs, improve the audio content, add many
& Telecom. Engrs., vol.30, no.6, 1984, more commentaries on the Bhagavadgita and
pp 243-45. provide additional features on the site.

17
Meanwhile, the idea of building Heritage Websites devanagari, gujarati, kannada, malayalam, oriya,
related to Indian philosophical texts emerged. A punjabi, tamil and telugu) or in Roman
series of websites were planned, including the transliteration. The Supersite also contains Classical
Upanishads (to include 12 major Upanishads with and Contemporary Commentaries on the
Sankara’s commentaries and translations in English Bhagavadgita, with translations in Hindi and
& Hindi), Brahma Sutra, Complete Works of English.
Sankara, Ramcharitmanas and the Yoga Sutra.
The Gita Supersite has been designed to open
The experience of building websites in Indian Multiple Windows, so that one can view multiple
languages was shared with others and a bi-lingual translations and/or commentaries on the
site was designed and built for the Uttar Pradesh Bhagavadgita simultaneously. A Two-Book option
Trade Tax Corporation , Government of India. A for comparative study is also available. The Search
site on the life and works of the contemporary sage, facility on this Supersite enables a search for the
Paramhans Rammangaldasji was also built. Moving occurrence of any word in the original text of the
in another direction, an all-Hindi site on disease- Bhagavadgita.
information and health, Bimari-Jankari was created.
The Gita Supersite is available for Windows, Unix/
The following sites were developed under the TDIL Linux and Mac Platforms, with web browsers that
scheme: support frames, JavaScript and Java (such as Netscape
Navigator 4.0/ Internet Explorer 4.0 or higher
Gitasupersite : http://www.gitasupersite.iitk.ac.in
versions). Users will not need to download fonts
Brahmasutra : http://www.bramsutra.iitk.ac.in because Dynamic Fonts have been used on this
website.
Yogasutra : http://www.yogasutra.iitk.ac.in
The Audio of the chanting of the Bhagavadgita
Complete Works of
shlokas by Swami Brahmananda of Chinmaya
Adi Sankara : http://www.sankara.iitk.ac.in
Mission Bangalore is also available on this website.
Ramcharitmanas : http://www.ramcharitmanas.iitk.ac.in
Texts are included in the Gitasupersite are:
Upanishads : http://www.upanishads.iitk.ac.in
Mool Slokas [Sanskrit Verses] of the Bhagavadgita
Minor Gitas : http://www.gitasupersite.iitk.ac.in/ in all major Indian Language Scripts
minigita/index.html
Hindi translation - Swami Ramsukhdas
Kavi Sammelan : http://www.kavya.iitk.ac.in
Hindi translation - Swami Tejomayananda
Munshi Premchand : http://www.munsipremchand.iitk.ac.in
English translation - Swami Gambhirananda
Bimari-Jankari : http://www.bimari-jankari.org
English translation - Dr. S Sankaranarayan
U P Trade Tax : http://upgov.up.nic.in/tradetax
English translation - Swami Sivananda
Paramhans Ram
Sanskrit Commentary - Sri Abhinavagupta
Mangal Das Ji : http://www.rammangaldasji.org
English translation of Sri Abhinavagupta’s Sanskrit
Short write ups on some of the above sites are
Commentary - Dr. S Sankaranarayan
given below:
Sanskrit Commentary - Sri Ramanuja
7.1 Gitasupersite
English translation of Sri Ramanuja’s Sanskrit
On the Gita Supersite, one can view the entire
Commentary - Swami Adidevananda
Bhagavadgita in its original language (Sanskrit) in
any of ten Indian language scripts (assamese, bengali, Sanskrit Commentary - Sri Sankaracharya

18
Hindi translation of Sri Sankaracharya’s Sanskrit sources of Vedanta Philosophy. No study of Vedanta
Commentary - Sri Harikrishandas Goenka is considered complete without a close examination
of the Brahma Sutra.
English translation of Sri Sankaracharya’s Sanskrit
Commentary - Swami Gambhirananda
Hindi Commentary - Swami Chinmayananda
Hindi Commentary - Swami Ramsukhdas
English Commentary - Swami Sivananda
Sanskrit Commentary - Sri Anandgiri
Sanskrit Commentary - Sri Jayatirtha
Sanskrit Commentary - Sri Madhvacharya
Sanskrit Commentary - Sri Vallabhacharya
It is in this text that the teachings of Vedanta are set
English translation - Swami Adidevananda forth in a systematic and logical order. The Brahma
Sutra consists of 555 aphorisms or sutras, in 4
Sanskrit Commentary - Sri Madhusudan Saraswati chapters, each chapter being divided into 4 sections
Sanskrit Commentary - Sridhra Swami each. The first chapter (Samanvaya: harmony)
explains that all the Vedantic texts talk of Brahman,
Sanskrit Commentary - Sri Vedantadeshikacharya the ultimate reality, that is the goal of life. The second
Venkatanatha chapter (Avirodha: non-conflict) discusses and
refutes the possible objections against Vedanta
Sanskrit Commentary - Sri Purushottamji philosophy. The third chapter (Sadhana: the means)
Sanskrit Commentary - Sri Neelkanth describes the process by which ultimate
emancipation can be achieved. The fourth chapter
Sanskrit Commentary - Sri Dhanpati (Phala: the fruit) talks of the state that is achieved in
final emancipation.
A sample page from the site is shown below
Indian tradition identifies Badrayana, the author
of the Brahma Sutra, with Vyasa, the compiler of
the Vedas. Many commentaries have been written
on this text, the most authoritative being the one
by Adi Sankara, which is considered to be an
exemplary model of how a commentary should
be written.

7.3 Complete Works of Adi Sankara

Adi Sankara, the 9th century philosophical giant of


India was both an intellectual genius and a prolific
writer. In his brief life-span of 32 years, he composed
over 30 original works on vedanta, wrote
authoritative commentaries on 11 Upanishads, the
7.2 Brahma Sutra
Brahma Sutra, Bhagavadgita, and other major texts,
The Brahma Sutra of Badrayana is one of the and also created inspiring devotional hymns to
Prasthana Trayi, the three authoritative primary various gods and goddesses.

19
directly to the verse — doha (or sortha), chaupai,
sloka or chhanda — of your choice.
2-Book View: Open two copies of Ramcharitmanas
simultaneously, for a comparative study of different
kaandas of the text.
Word Search: Alphabetic Search for the occurrence
of any word in Ramcharitmanas
Verse Search: Search for verses in Ramcharitmanas,
using the first few words of the verse
Power browse: This option is for Power Users of
Ramcharitmanas who wish to get an overview or
This website is perhaps the first online repository of quickly browse through the dohas and chaupais of
the Complete Works of Adi Sankaracharya. The texts any kaanda of the text
can be read in the original Sanskrit in any one of
Download: Get printer-friendly chapters of
11 Indian language scripts. The texts can also be
Ramcharitmanas
downloaded for printing, making this vast,
invaluable resource easily accessible to users all over Tulsidas: Read about Tulsidas, the author of
the world. Ramcharitmanas
7.4 Ramcharitmanas Related Links: Annotated Links to related sites
The Ramcharitmanas, the 16th century masterpiece 7.5 Upanishads
written by Goswami Tulsidas, is the story of Lord
The Upanishad site consists of all the major
Rama. The text is an unparalleled combination of
upanishads given below :
devotion and pure non-dualistic philosophy. This
website is an attempt to use contemporary Savasya Mandukya’s karika
technology to facilitate and enhance the study of Kena Taittiriya
this ancient scripture. Some of the features available
on this website are: Katha Aitereya
Prasna Svetashvatra
Mundaka Brihadaranyaka
Mandukya Chandogya

Read the book : Read Ramcharitmanas with a For each Upanishad we have several commentaries
unique, user-friendly interface. Navigation through and translations as given below. For a detailed list of
the book can be linear, using the ‘next’ and ‘previous’ available translations and commentaries for each
buttons. Or, you can use the Navigation Bar to go Upanishad please see the appendix.

20
7.6 Kavi Sammelan have been made to simplify both the language and
the concepts that are explained on the site. The major
This is the first Virtual Kavi Sammelan on the Web. idea behind this website is to supplement a doctor’s
On this site, you can “create” a Hindi kavi sammelan, function. By reading about medical conditions and
choosing from a database of around 100 poems. diseases, a patient (or patient’s family and friends)
Selections for your kavi sammelan can be made would understand their own situation, and therefore
based on the mood [rasa] or metre [vidha] of the be in a better position to cooperate with the doctors’
poems and the poet(s) whose poems you wish to advice and prescription.
include. Video and audio recordings of the poems
are available, so you can ‘see’ and ‘hear’ the poets Navigation or moving within the website has been
recite their own poems. simplified to make it easy to search for
information on any disease. Under each disease,
information has been provided in an easy to
understand, question-answer format. To enable
the user to understand the disease process better,
the related functional anatomy and physiology
has been briefly explained.
Images/illustrations have been profusely used
throughout the site. Most of these images have been
prepared specially for this website. Some of the
images have been obtained, with permission from
Poems in five moods and five metres are available WHO/TDR and CDC.
here. The five moods are: vira rasa, hasya/vyanga,
shringar rasa, shant rasa, vividha rasa. The five metres
are: geet, ghazal, doha, chhanda mukta, muktak. 13
poets have been featured on this website. These poets
are: Gopaldas Neeraj, Govind Vyas, Dharmpal
Awasthi, Madhup Pandey, Buddhinath Mishra,
Urmilesh, Kailash Gautam, Shiv Om Ambar, Surya
Kumar Pandey, Surendra Dube, Vineet Chauhan,
Kamal Musaddi, Suresh Awasthi.

Audio recordings of the poets talking about what


poetry means to them is also available. A detailed 7.8 Paramhans Ram Mangal Das Ji
biography and a photo-gallery of each poet has been
Sri Paramhansa Ram Mangal Das ji (1893-1984)
put up. Other features include book-reviews,
was an extraordinary sage of our times. He was
interviews and articles.
blessed with a divine vision by virtue of which he
An introductory article on the History of Hindi Kavi could communicate with saints from the past.
Sammelans has been specially written by Dr. Over 2000 saints and gods belonging to all
Upendra for this website. Listen to Dr. Upendra religions of the world, visited him and gave him
summarise the development of kavi sammelans over their messages, which he transcribed. These
the years. transcribed messages (over 3500 in number) are
in the Avadhi dialect and run into four volumes
7.7 Bimari-Jankari called the Divya Granths.
Bimari-Jankari is a medical website created for the This website contains these Divya Granths,
benefit of Hindi-speaking people. Special efforts together with other works of Sri Ram Mangal Das ji.

21
The messages have been uniquely indexed and can Vinod Pushpanjali Rajeshwar Dvekota
be read chronologically or alphabetically. A topic
index is also to be included very soon. A photo Abstract Chintan–Pyaaz Shankar Lamichhane
gallery, as well as some audio and video recordings
of Sri Ram Mangal Das ji are in the pipeline. A
simple Avadhi-Hindi dictionary is also under
preparation.

8. Technical Issues

Some of the major features of these sites are:

The server side is written in PHP

The content is stored in MySQL in ISCII (as


Creation of this website has been an exercise in against ISFOC)
making a format using which the works of the
innumerable great sages of our times can be On the fly transliteration into ISFOC for any
preserved as part of our rightful heritage. Indian language.

7.9 Nepali Texts Search in all Indian languages.

On the fly PDF generation in all Indian


The Nepali site has the following original texts:
languages.
Basai Leelabhadhur Shatri
Chanting of the Gita Shlokas.
Bhanubhaktko Ramayana Bhanubhakt Acharya
Continuous Play of the Gita Shlokas.
Kunjani Laxmi Prasad Devkota
Architectural Diagram of the Gitasupersite and its
Langda ko Sathi Lain SinghVadadail sister sites

Maaitghar Lain SinghVadadail

Moona Madan Laxmi Prasad Devkota

Ritu Vichar Lekhnath Paudwal

Tarun Tapsi Lekhnath Paudwal

22
9. Appendix - Details of Texts/Commentaries for Hindi Translation of Sri Shankaracharyas Sanskrit
each of the Upanishads Commentary - Gita Press Gorakhpur

#ISAVASYA English Translation of Sri Shankaracharyas Sanskrit


Commentary - Swami Gambhirananda
Mool Mantra [Sanskrit Verses] of the Upanishad
#PRASNA
English Translation - Swami Sivananda
Mool Mantra [Sanskrit Verses] of the Upanishad
English Commentary - Swami Sivananda
English Translation - Swami Sivananda
English Translation - Swami Gambhirananda
English Commentary - Swami Sivananda
Hindi Translation - Gita Press Gorakgpur
English Translation - Swami Gambhirananda
Hindi Commentary - Harikrishandas Goenka Hindi Translation - Gita Press Gorakgpur
Sanskrit Commentary - Sri Shankaracharya Hindi Commentary - Harikrishandas Goenka
Hindi Translation of Sri Shankaracharyas Sanskrit Sanskrit Commentary - Sri Shankaracharya
Commentary - Gita Press Gorakhpur
Hindi Translation of Sri Shankaracharyas Sanskrit
English Translation of Sri Shankaracharyas Sanskrit Commentary - Gita Press Gorakhpur
Commentary - Swami Gambhirananda
English Translation of Sri Shankaracharyas Sanskrit
#KENA Commentary - Swami Gambhirananda
Mool Mantra [Sanskrit Verses] of the Upanishad #MUNDAKA
English Translation - Swami Sivananda Mool Mantra [Sanskrit Verses] of the Upanishad
English Commentary - Swami Sivananda English Translation - Swami Sivananda

English Translation - Swami Gambhirananda English Commentary - Swami Sivananda

Hindi Translation - Gita Press Gorakgpur English Translation - Swami Gambhirananda


Hindi Translation - Gita Press Gorakgpur
Hindi Commentary - Harikrishandas Goenka
Hindi Commentary - Harikrishandas Goenka
Sanskrit Commentary - Sri Shankaracharya
Sanskrit Commentary - Sri Shankaracharya
Hindi Translation of Sri Shankaracharyas Sanskrit
Commentary - Gita Press Gorakhpur Hindi Translation of Sri Shankaracharyas Sanskrit
Commentary - Gita Press Gorakhpur
English Translation of Sri Shankaracharyas Sanskrit
Commentary - Swami Gambhirananda English Translation of Sri Shankaracharyas Sanskrit
Commentary - Swami Gambhirananda
#KATHA
#MANDUKYA’s KARIKA
Mool Mantra [Sanskrit Verses] of the Upanishad
Mool Mantra [Sanskrit Verses] of the Upanishad
English Translation - Swami Sivananda English Translation - Swami Sivananda
English Commentary - Swami Sivananda English Commentary - Swami Sivananda
English Translation - Swami Gambhirananda English Translation - Swami Gambhirananda
Hindi Translation - Gita Press Gorakgpur Hindi Translation - Gita Press Gorakgpur
Hindi Commentary - Harikrishandas Goenka Hindi Commentary - Harikrishandas Goenka
Sanskrit Commentary - Sri Shankaracharya Sanskrit Commentary - Sri Shankaracharya

23
Hindi Translation of Sri Shankaracharyas Sanskrit Sanskrit Commentary - Sri Shankaracharya
Commentary - Gita Press Gorakhpur English Translation of Sri Shankaracharyas Sanskrit
English Translation of Sri Shankaracharyas Sanskrit Commentary - Swami Madhavananda
Commentary - Swami Gambhirananda
10. The Team Members
#TAITTIRIYA
R.M.K. Sinha rmk@iitk.ac.in
Mool Mantra [Sanskrit Verses] of the Upanishad T.V. Prabhakar tvp@iitk.ac.in
English Translation - Swami Sivananda Harish C. Karnick hk@iitk.ac.in
T. Archna archna@iitk.ac.in
English Commentary - Swami Sivananda Murat Dhwaj Singh murat@cse.iitk.ac.in
English Translation - Swami Gambhirananda Rajni Moona ajnimoona@cse.iitk.ac.in
Madhu Kumar madhuk@cse.iitk.ac.in
Hindi Translation - Gita Press Gorakgpur
Amit Mishra mamit@iitk.ac.in
Hindi Commentary - Harikrishandas Goenka Md . Masroor masroor@iitk.ac.in
Sanskrit Commentary - Sri Shankaracharya Mahaluxmi luxmi@iitk.ac.in
Rajeev Bhatia rajeevkb@iitk.ac.in
Hindi Translation of Sri Shankaracharyas Sanskrit
Commentary - Gita Press Gorakhpur Others Who Helped
Dr. Vineet Chaitanya vc@iiit.net (VC) was the driv-
English Translation of Sri Shankaracharyas Sanskrit
ing force behind the Geeta Vaatika, as well as the
Commentary - Swami Gambhirananda
inspiration for the Gita Supersite. Now he watches
#AITEREYA our activities from IIIT Hyderabad, and is still one
Mool Mantra [Sanskrit Verses] of the Upanishad of the few who truly understand the spirit behind
this work.
English Translation - Swami Sivananda Nagaraju Pappu, the first one, wrote 100,000 lines
English Commentary - Swami Sivananda of c-code for the initial versions of Geeta Vaatika.
English Translation - Swami Gambhirananda His DOS version had more features than the cur-
rent HTML one!
Hindi Translation - Gita Press Gorakgpur A part from the current team, those who have con-
Hindi Commentary - Harikrishandas Goenka tributed to the growth of these websites in a major
way include:
Sanskrit Commentary - Sri Shankaracharya
K. Anil Kumar, Anvita Bajpai, Ashutosh Sharma,
Hindi Translation of Sri Shankaracharyas Sanskrit Gita Pathak, K. Ravi Kiran, Rohit Patwardhan,
Commentary - Gita Press Gorakhpur Samudra Gupta, Shrikant Trivedi and
English Translation of Sri Shankaracharyas Sanskrit Tripti Singh.
Commentary - Swami Gambhirananda Courtesy: Prof Sanjay Dhande
Indian Institute of Technology
#SVETASHVATRA
Department of Computer Science & Engineering
Mool Mantra [Sanskrit Verses] of the Upanishad Kanpur - 208 016
English Translation - Swami Sivananda (RCILTS for Hindi & Nepali)
Tel : 00-91-0512-2597174, 2598254
English Commentary - Swami Sivananda E-mail : sgd@iitk.ac.in
Hindi Commentary - Harikrishandas Goenka Editorial Comment : Because of very large number of publications
related to the above article by the Resource Centre & constraint
#BRIHADARANYAKA
of space, we could not include the publication details here these
Mool Mantra [Sanskrit Verses] of the Upanishad have already been listed in the April 2003 issue of VishwaBharat.
For getting the publications please contact Prof. Sanjay Dhande/
English Translation - Swami Madhavananda Prof. R.M.K. Sinha, IIT(K).

24
Resource Centre For
Indian Language Technology Solutions – Gujarati
The Maharaja Sayajirao University of Baroda, Vadodara
Achievements

Department of Gujarati, Faculty of Arts


The Maharaja Sayajirao University of Baroda, Vadodara-390002, India
Tel. : 00-91-265-2792959 E-mail : rciltg@satyam.net.in
Website : http://msubaroda.ac.in/rciltg/
RCILTS-Gujarati 2.1 Portal
Maharaja Sayajirao University, Baroda RC for Gujarati has a web site in English hosted at
http://msubaroda.ac.in/rciltg/ . The Gujarati version
1. Knowledge Resources
of the same will be available soon. The web site
(Corpora, Parallel Corpora, Multi Lingual Digital contains the information about the software
Dictionaries) developments at RCILTG. The full text of the poems
and prose of well known Gujarati authors is planned
1.1 Gujarati WordNet (Lexical Resource) to be displayed on the web site. A manuscript of
19 th Century classic “Saraswatichandra” by
Gujarati WordNet, a Lexical Resource, is in Govarthanram Madhavram Tripathi, in the author’s
the process of development. RC at MSU own handwriting is located and digitized and would
is working in close collaboration with and be put on the web for the interested researcher soon.
through able guidance of Professor Pushpak
Bhattacharya of RC at IIT, Mumbai. The Data 2.2 Multilingual Text Editor
Entry Interface for entering the synsets is build.
Presently above 500 synsets are already been Unicode based Multilingual Text Editor which was
developed. required since long for Indian Languages, has been
developed at the RC for Gujarati.
The important feature of this Text Editor is that it
fulfils the necessity of Universal storage format
(Unicode) for Indian Languages. The Indian
languages that are given support by this Text Editor
are Bangla, Gujarati, Gurmukhi, Hindi, Kannada,
Malayalam, Marathi, Oriya, Sanskrit, Tamil and
Telugu along with English as International language.

Further work on preparing the synsets and entering


into the database is going on. Basic Semantic
Relationships are also been established like
Hypernymy, Hyponymy between the synsets formed
and entered in the database. Further work is going
on for establishing other Semantic relationships
between the synsets. The Web Interface is already
under construction to facilitate the access of an On-
Line Lexical Database of Gujarati to be available To make typing ease, assistance is given to the user
over the Internet. by providing an onscreen INSCRIPT keyboard
layout for all languages supported. The Text Editor
2. Knowledge Tools supports the basic file features like creating new file,
(Portals, Language Processing Tools, fonts, opening existing file and saving file. It is efficient
morphs analyzer, spell checker, text editor, enough to open file of around 64KB. Also the basic
basic word processor, code conversion utility, etc.) text editing features are provided like cut, copy, paste,
undo, redo, find, find next, replace and select all.

26
The user can use any of the system fonts for writing along with different symbols are ignored and are
text. The styles that are supported by Text Editor not checked. The Gujarati Spell Checker is
are font name, font size, bold, italic and underline. integrated along with Multi-Lingual Text Editor for
The sorting facility is also available based of word the convenience to user.
sort or line sort. The support for three different Look
and Feels viz., Motif, Windows and Metal of Java Currently the software is under testing and very soon
are provided to the user. The Gujarati Spell Checker the accuracy in percentage of Morphological Analysis
is also bundled with this Text Editor. will be available.
Current Focus
Current Focus
• Rigorous testing to be done, using large corpus,
• To increase the efficiency of Text Editor.
to find all the faults in the spell checker and
Further Plans morphological analyzer.

• E-mail support for Indian Languages. • Improving the algorithms for the correctness and
efficiency of the spell checker.
• ISCII compatible Import/Export facilities.
Further Plans
• More formatting features to be provided.
• Increasing the size of the root dictionary covering
2.3 Code Converter the maximum possible words of the language.
We have the programs which can convert the codes • Making an independent spell checker which can
between Unicode and ISCII. be used on Unicode or ISCII compatible text.
2.4 Gujarati Spell Checker • More and more testing to increase the correctness
of the spell checker and also improving the
The Gujarati Spell Checker is in its last stage of suggestion generation.
development. It is based on a Morphological
Analysisof Gujarati language in order to increase the 3. Translation Support Systems
intelligence of the Gujarati Spell Checker.
(Machine Translation, Multi Lingual Information
Morphological Analyzer covers the analysis of
Access, Cross Language Information Retrieval)
Nouns and Verbs of the language.
3.1 Machine Translation
The work on this front has already started and
initially we have planned to be domain specific

The suggestion generation is very effective. The


number of suggestions that will be generated will
not be greater then ten. All the non-gujarati words

27
and we have selected weather related text. A group various modules in order to improve the
of two is working on this area. Basic Universal recognition accuracy.
Word Lexicon format is being studied.
Current Focus
Out of all the approaches we have selected to go
for Interlingua (UNL) based approach for this • In the test we have seen some misrecognition,
purpose. The team has started the exploring the so we are currently testing the performance
various resources available on the UNDL of Discrete Cosine Transform as a feature
Foundation website. The specifications for extractor.
enconverter and deconverter are being studied.
• The programs are getting reorganized to in the
The UNL proxy has been downloaded and tested
Object Oriented way.
with the sample documents.
Future Plans
4. Human Machine Interface Systems
• Developing/Adapting skew-correction algorithm.
(Optical Character Recognition Systems, Voice
Recognition Systems, Text to Speech Systems) • Developing the post-processing routines.

4.1 OCR for Gujarati • Thorough testing of the OCR and necessary fine
tuning to take it to the required accuracy of 97%
A working prototype of Gujarati OCR which or above.
is capable of doing end to end task i.e. converting
• Integrating with the OCRized spell-checker.
an image into an equivalent text, has been
produced. The OCR employs the template • Developing GUI for the system with inbuilt Text
matching technique, the same as the Telugu Editor.
OCR developed at RC for Telugu uses,
for recognition and nearest neighbour for 4.2 Text To Speech (TTS)
classification. The Gujarati OCR differs in
The work on TTS has already been started at RC-
many ways from the Telugu one due to the
Gujarati. A team of three students has explored the
special features like the necessity for zone
available TTSs such as FreeTTS, IBMJS in English
separation.
and Dhvani and Naarad in Hindi. All of these are
The program, which can be invoked from the implementation of Java Speech API
command line, takes a gray image, scanned in specifications. We have made a significant progress
color or grayscale, in any of the popular image also in this direction.
formats like JPEG, GIF, PNG, BMP and after While starting this work, we had divided our task in
processing it produces the output as Plain Text
the following two major subtasks :
file in Unicode format. The input image is
assumed to be skew-corrected, without any images 1. Developing a program which generates the
and tables, having only one column. We have a output, which can be input to the existing
converter which coverts Unicode to equivalent speech engine. E.g. IBMJS takes phonetic
ISCII and vise-versa ready, so we can produce the transcription of the text using IPA codes, to
ISCII output as and when required. produce the sound.

Rigorous testing of the OCR for recognition rate 2. Developing a core speech engine with native
etc. is being initiated and the initial results are accent. We are adopting the concatenation based
quite good. Work is in progress for fine tuning approach.

28
Task Completed So Far on TTS technologies and two are working on
development of UNL based machine translation
• A program is now available with RC-Gujarati support system.
which allows user to type in Gujarati text, using
INSCRIPT keyboard layout and then it Apart from this RC, Gujarati has trained three
converts the string into its equivalent IPA string physically challenged persons in Gujarati data entry
and feeds it to the IBM speech engine, which using inscript keyboard and now they are doing the
in turn produces the sound represented by given data entry for RC and for which they are being paid.
IPA string.
6. Standardization
Current Focus
(Contribution towards UNICODE,
• Exploring the phonological properties of Transliteration Tables(s), XML, lexware format, etc)
language, extracting the basic phonemes from a
continuous speech and design of basic speech 6.1 UNICODE
engine.
UNICODE for Gujarati have been studied and some
Future Plans additions and corrections have been suggested. They
are as follows :
• Development of preprocessor to handle language
specific details. • Correction in the shape of AVAGRAH
• Developing speech engine and integrating with • Addition of Gujarati Currency sign, Symbols for
preprocessor. ¼, ½ , ¾ .
5. Localization • Addition of two rendering rules for combination
(Adapting IT tools and solutions in Indian of Ja with Dependent vowel sign E and EE
language(s)- IT localisation clinic interaction with
The collation sequence for Gujarati alphabets is not
state government for possible solutions in Ils).
same as the sequence of the characters in the
5.1 Language Technology Human Resource UNICODE chart. Hence the correct collation
Development sequence have been finalized.

(Manpower Development in Natural Language All the above modifications and suggestions were
Processing - specialized training programs) published in the April,2002 issue of Vishwabharat
magazine of TDIL.
RC for Gujarati is trying to develop man power
by providing the resources to final year students In addition to above the RC for Gujarati has worked
from the relevant background like Mathematics, closely with Mr. Abhijit Dutta of IBM, Delhi and
Computer Science, Electrical Engineering, helped him in standardizing on the collation
Electronics Engineering, Computer Application sequence for Gujarati and the set of Glyphs for
etc. for doing their projects with RC. Two (one developing the fonts.
ME (Electrical) and one M.Sc. (Statistics))
students have already completed their projects on Apart from development of basic technology RC
study and development of various techniques for for Gujarati has, as per the original proposal
Gujarati OCR. accepted by the Ministry, developed a strong
Knowledge base in Gujarati, available on
At presents Five students are working for their multimedia CDs and on our website. This
final year projects. Three of which are working knowledge base includes the following;

29
1. Select Gujarati classics in print and 7.2 Products Which can be Launched
manuscripts, including a rare mss of the 19th
century classic, Sarasvatichandra in the At present RC has a multilingual text editor ready
author’s handwriting, with many changes he to launch. The detail of which is given above.
has made incorporated in the digitized form 7.3 IT Services
on CD.
RC for Gujarati has developed the following services
2. Cultural History and Geography of Gujarat which it can provide to the State Government and
on CD, useful to researchers in one set and the Industry in the region:
to tourist in a second set.
• Multimedia CDs on Language Self-learning
3. Multimedia CDs on Rural Health Problems (“Learn Gujarati through English”, “Learn
and Preventive Care. Sanskrit through Gujarati”, “Learn English
through Gujarati” etc.) could be of use to
4. Language self-learning CDs useful to students learners in Gujarat and overseas, including the
(learning Sanskrit, English etc.) and second NRIs. In opur interaction with him, these CDs
generation NRIs and new Researchers have been found of interest to Dr. M.N.
abroad. Cooper of Infosis Ltd. Pune.

7. Products which can be launched and the • Multimedia CDs on Rural Health and Adolescent
service which the Resource Centre can provide to Heath, developed in collaboration with
the State Government and the Industry in the Tribhuvandas Foundation, Karamsad and its
region. Director, Dr. Nikhil Kharod, M.D., could be of
use to the Departments of Health and Education,
7.1 Technical Skills Gujarat Government and to the Panchayats.

• Services on Library Knowledge-base, giving


The RC for Gujarati has built technical skills
detailed information on books in print, 19th and
related to Image Processing and Pattern
20th century journals and medieval manuscripts
Recognition, Audio signal processing, Speech
Synthesis, Unicode based text processing and
rendering and implementation of these using
Java. These can readily be shared, with proper
agreement, with researches and industry in the
region. This knowledge can even be transferred
to the new generation by teaching some courses
in these areas at the nearby colleges.

RC-Gujarati has developed Multilingual Text


Editor which can be used to type in Unicode.
With the help of this, we could propagate the
use of standard encoding storage of data in
Unicode or ISCII. In turn this would lead to
the situation where the data can be transferred
from one machine to the other without available at some libraries in Gujarat and Mumbai,
transferring the fonts. This could be much use on CDs and our Website, could be of much use
in Government offices where such data transfer to Universities, Colleges and other institutions
is a daily routine. of research and teaching in India and abroad. The

30
libraries with which we have entered into service suitable for handling a real life situation is to be
agreements have undertaken to provide provided at a regular intervals
information through email, fax and surface mail
to users who would have access to this knowledge • At the end of the presentation a list of common
base through our Website. words, their meanings and their pronunciation
is to be added.
8. IT Services - Multimedia CDs
• The less common grammar rules are to be added.
8.1 Learn Gujarati Through English
• A video of some practical situations is to be
A multimedia tutorial is designed that helps an incorporated.
enthusiast, learning Gujarati language though
English. The tutorial is based upon the concept of 8.2 Adolescent Health
Dr. Jagadish Dave who teaches Gujarati in Europe. RCILTG organized a two days workshop on
He has used this methodology to teach Gujarati to Adolescent Health with Tribhuvandas foundation of
the persons who doesn’t have any connection to Anand on the above topic. About 40 teachers
Gujarat and for that matter India. participated in the workshop. The workshop was
The main features of the CD are: aimed at adolescent health and how teachers can
scientifically disseminate the knowledge about the
• Novel approach for teaching Gujarati by changes occurring in male and female body during
exploiting similarity of shapes for remembering this age.
various characters.
This CD is a gist of ideas and suggestions, came out
• Starts with a familiar graphic resembling a after a few long sessions of brain-storming. The
Gujarati letter and shows the manufacture of information presented here is the work of Dr. Nikhil
related letters from it. Kharod (M.D. Pediatrician).

• For example, starting with the symbol ‘S’ which The features of this CD are:
is a character in English as well as Gujarati, the
• Introduction of sensitive issues from a scientific
Gujarati characters with related shapes DA, KA,
view point.
PHA, HA, THA are derived as morphological
modifications of the basic symbol. • Accompanying pictures, sketches, diagrams and
tables giving medically proved information about
• The related audio commentary figuratively
the topic.
explains the modifications of the basic symbol.
• A detailed description of nutrients, their source
• The pronunciation of Gujarati words in which
and disease caused in case of their deficiency.
the character is embedded at various positions also
accompanies the presentation. • Description of the Body cycle of male and female.
• After introducing a set of alphabets the learner is • A quiz with key answers which gives, the user,
lead to meaningful word formation involving and an idea about his /her knowledge in this area.
those alphabets.
Our future plan includes the production of a series
• Introduction of grammar through simple of learning language CDs. Work on the CD “Learn
sentences instead of complicated rules. Sanskrit through Gujarati” has already begun. Next
to follow would be a set of CDs linking Gujarati
Future Plans
with other Indian languages and with English. There
• A practical conversation involving those words is a good marketing possibility for this series.

31
Initial work has been completed on production of a been located by our RC. The mss, in the author’s
set of CDs with multimedia presentation on Cultural handwriting, has now been digitized and made into
History and Cultural Geography of Gujarat. This a CD with an appropriate navigation.
series would be useful for Primary and Secondary
Schools and also for tourism. Note: RC for Gujarati has trained three physically
challenged former students, Kum. Rina Phadia, Shri
One CD on “Rural Health” has been produced, Ganpat Solanki and Arun Paladgaonkar, in data
focusing on Adolescent Health. Similar workshops entry and has used their paid services for its data
on other areas on Rural Health would be conducted entry work for Wordnet and CD production.
to produce more CDs on other aspects of health in
rural Gujarat. 9. The Team Members

8.3 Bibliography of Books Prof. Sitanshu Mehta pi_rcmsu@satyam.net.in


Prof. S. Rama Mohan srm-cc@msubaroda.ac.in
In close collaboration with some important research
Mr. Jignesh Dholakia jignesh-rcilt@msubaroda.ac.in
libraries of Gujarati and Mumbai, the RC for
Gujarati has worked on preparing on-line Mr. Mihir Trivedi mihirktrivedi@rediffmail.com
bibliography of books and journals available at these Mr. Irshad Shaikh ishaikh16@yahoo.com
centres. Also on-line lists of critical and creative items
Courtesy: Shri Sitanshu Y. Mehta
(essays, fiction, poetry, drama, culture-related
M.S.University of Baroda
information) published in 19th and early 20th century
Department of Gujarati
Gujarati journals, has been digitized. The Libraries
Faculty of Arts
have agreed to provide photo-copied of the
Baroda – 390 002
knowledge-base items, at no-profit level, to the users
(RCILTS for Gujarati)
of the RC’s website.
Tel : 00-91-265-2792959
8.4 Sarasvatichandra E-mail : rciltg@satyam.net.in

Original manuscript of the 19th century classic,


Sarasvatichandra, by Govardhanram Tripathi, which
had a great impact on cultural life of Gujarat, has

32
Resource Centre For
Indian Language Technology Solutions – Marathi & Konkani
I.I.T., Mumbai
Achievements

Computer Science and Engineering Department


Indian Institute of Technology, Mumbai-400076 India
Tel. : 00-91-22-25722545 Extn. 5879 E-mail : pb@cse.iitb.ernet.in
Website : http://www.cfilt.iitb.ac.in
RCILTS-Marathi & Konkani 1.2 Ontology
Indian Institute of Technology, Mumbai
An ontology is a hierarchical organization of
Introduction concepts. Domain specific ontologies are critical
for NLP systems. We have classified Nouns as
The Resource Center for Technology Solutions in animate and inanimate. Further sub-classification
Indian Languages was set up at IIT Bombay in the divides animate nouns as flora and fauna to cover
year 2000. The center concentrated on developing plants and animals. The inanimate category is sub-
niche technologies for Indian languages with special classified as object, place, event, abstract entity etc.
focus on the languages of Western India, in Concept nouns indicating time, emotion, state,
particular Marathi and Konkani. The main theme action etc. have also been incorporated. Verbs have
of the research has been development of (i) lexical been divided into do, be and occur categories, with
resources like wordnets, ontologies and semantics-rich further sub-classifications verbs of action, verbs of
machine readable dictionaries (MRD), (ii) interlingua
state, temporal verbs, verbs of volition, etc.
based machine translation softwares for Hindi and
Adjectives have been classified as descriptive
Marathi and (iii) Marathi search engine and spell
indicating weight, colour, quality, quantity, etc. and
checker. Over the years the center has built its
also as demonstrative, interrogative, relational and
strength and reputation in its chosen areas and has
so on. Adverbs are categorized according to time,
obtained national and international visibility. In
the paragraphs that follow we describe the place, manner, quantity, reason, etc.
technologies already developed and in the process
Statistics of Ontological Categories
of being developed, and also the impact the center
has made in the area of information technology We have 22 broad categories for Noun, Verb,
localization.
Adjective and Adverb. Under these categories, we
1. Lexicon and Ontology have 57 sub-categories for all word classes. The noun
class has 55 sub-sub and 29 sub-sub-sub categories
1.1 Lexicon (vide the table below):

The lexicon is being prepared in the context of 1. No. of Categories of Noun, Verb, Adjective 22
interlingua based machine translation. The and Adverb
interlingua is the Universal Networking Language
(UNL) (http://ww.unl.ias.unu.edu) which is a 2. No. of Sub-categories of Noun, Verb, 58
recently proposed interlingua. The heart of the UNL Adjective and Verb
based MT is the universal word (UW) dictionary.
UWs are essentially disambiguated English words 3. No. of Sub-sub Categories of Noun 55
and stand for unique concepts. Our lexicon contains
about 70,000 UWs and approximately 200 4. No. of Sub-sub-sub Categories of Noun 29
morphological, grammatical and semantic attributes.
These entries are linked with Hindi headwords. The Total No. of Categories 164
lexicon is available on the web (http://
www.cfilt.iitb.ac.in/dictionary/index.html). UW- 2. The Hindi WordNet
Hindi Lexicon is required for enconversion
The Hindi wordnet is an on-line lexical database.
(analysis) and deconversion (generation) processes.
The design closely follows the English wordnet.
The lexicon can also be used as a language reference
for English and Hindi. We are now concentrating The synonym set {Gar, Gar, , gaR h } and {Gar
Gar
Gar, pirvaar
pirvaar}- for
on completing the language coverage and example- can serve as an unambiguous
standardizing the lexicon (standard restrictions and differentiator of the two meanings of {Gar Gar
Gar}. Such
semantic attributes). synsets are the basic entities in the wordnet. Each

34
sense of a word is mapped to a separate synset in
the wordnet. All word sense nodes are linked by
a variety of semantic relationships, such as is-a
(hypernymy/hyponymy) and part-of (meronymy/
holonymy).

The lexical data entered by linguists and


lexicographers are stored in a MySQL database. The
web interface (www.cfilt.iitb.ac.in) gives the facility
to query the data entered and browse the semantic
relations for the search word. The interface also
provides links to extract semantic relations that exist
in the database for the search word. So far
approximately 10000 synsets have been entered. This
corresponds to about 30,000 words in Hindi. There
is a morphological processing module as a front end Figure 3: Online web interface of the Hindi WordNet
to the system.

Figure 1: Snapshot of Data entry interface for entering Figure 4: Snapshot of web interface for the query result
synset data for the word “janmadata”

3. The Marathi WordNet

Since Marathi and Hindi share many common


features, it has been possible to adopt the Hindi
wordnet as the basis for Marathi wordnet creation.
Hindi and Marathi originate from Sanskrit and,
therefore, the t%sama (words taken directly from
mother language, e.g. gait, kRit) and the td\Bava (a
word which has evolved organically from the
mother language, e.g. gaRh ! Gar, kma- ! kama) words
in both the languages are often same in meaning.
Also, both the languages have the same script-
Devanagari.
Figure 2: Snapshot of Data Entry interface for entering
semantic relations for noun category In the construction of the Marathi wordnet, a
Hindi synset is chosen and mapped to the

35
corresponding Marathi synset. For instance, for
the Hindi synset {kagad kagad
kagad, kagaja
kagaja}- standing for paper-
the Marathi synset is {kagad kagad
kagad}. A gloss is added,
which explains the meaning of the word, and an
explanatory sentence is inserted. For the current
example, the gloss is baaMbaU va gavat yaaMcyaa lagadyaapasaUna tyaar
kolao jaaNaaro patL pana jyaavar ilahIlao vaa Caplao jaato and”%yaanao
kao-yaa kagadavar maaJaI sahI GaotlaI”. So far, 3400 synsets
have been entered. This corresponds to about
10,000 words in Marathi. Our aim is to cover all Figure 5: stages in automatic dictionary generation
the common Marathi words. As the Marathi 5. Hindi Analysis and Generation
wordnet is based on the Hindi wordnet, all the
semantic relations become automatically 5.1 Hindi Analysis (Enconversion)
inherited. An additional benefit to accrue is the The analysis engine uses the UW-Hindi dictionary
creation of a parallel corpus. and the analysis rule base. The dictionary contains
headword, universal word and its grammatical and
4. Automatic Generation of Concept Dictionary
semantic attributes. The process of analysis depends
and Word Sense Disambiguation
on these semantic and grammatical attributes. There
A Concept (UW) Dictionary is a repository of are 41 relations defined in the UNL specification.
language independent representation of concepts The system is capable of handling all of them. The
using special disambiguation constructs. A system analysis system can deal with almost all the
has been developed for automatically generating morphological phenomena in Hindi. At present,
document specific concept dictionaries in the there are 5500 rules dealing with morphosyntactic
context of language analysis and generation using and semantic phenomena. The system has been
the Universal Networking Language (UNL). The tested on corpora from the UN, the MICT and the
dictionary entries consist of mappings between agricultural domain.
head words of a natural language and the universal
5.2 Hindi Generation (Deconverion)
words of the UNL. The manual effort in
constructing such dictionaries is enormous, The generation process also uses the UW-Hindi
because the lexicographer has to identify the dictionary and the generation rule base. The dictionary
disambiguation constructs and the semantic is the same for both analysis and generation purpose,
attributes of the entries. Our system, which only the rule base is different. The generation rule
constructs the UW dictionary (semi-) is formed from the grammatical and semantic
automatically, makes use the English wordnet and attributes as well as the syntactic relations. Some of
processes the part of speech tagged and partially the features of the system are:
sense tagged documents. The sense tagging step is • Matrix-based priority of Relation has been designed
a crucial one and uses the verb association to get the correct syntax.
information for disambiguating the nouns. The
accuracy of the word sense disambiguation system • Extensive work has been done on Morphology.
is 70-75% and that of the DDG (document • The system has gone through the UNL of corpora
specific dictionary generator) is 90-95%. from various domains with satisfactory results.
The DDG makes use of a rule base for producing • There are almost 6000 rules for syntax planning
the dictionary. It is based on the principle of expert and morphology of all parts of speech.
systems, doing inferencing and providing
6. Marathi Analysis
explanations for the choices made with respect to
the restrictions and semantic attributes. Marathi is a natural language of Indo-Aryan family.

36
The textual communication uses the Devanagari 8. Project Tukaram
(nä´ÉxÉÉMÉ®Ò) script. The analysis and generation of
Tukaram Project provides the entire collection of
Marathi makes use of the Marathi-UW lexicon and
Saint Tukaram’s Abhangs in browsable and searchable
the rules for the Marathi language grammar. There
format. Tukaram’s abhangs are a household name in
are about 2100 rules to handle Marathi verb
phenomena which are listed below: Maharashtra. It is possible to convert the existing
version to a stand-alone system, which can be
Sr. No. Phenomenon Description installed on a single machine. The abhangs are
1 Present tense Time w.r.t. writer available for public viewing on http://
2 Past tense Time w.r.t. writer www.cfilt.iitb.ac.in (also see the picture below).
3 Future tense Time w.r.t. writer Browsability is provided at chapter and verse levels.
4 Event in progress Writer’s view of aspect Users can move from one chapter or verse to another
5 Complement Writer’s view of by using links at the bottom of the webpage.
reference
6 Passive voice Writer’s view of topic The search engine for Tukaram Project is made
7 Imperative mood Writer’s attitude compatible with both Linux and Windows.
8 Ability of doing Writer’s viewpoint Tukaram’s Abhangas were keyed in Akruti font and
something were later converted to ISFOC font. ISCII encoding
9 Intention Intention about is used by the Tukaram search engine.
something or to do
something To enable the users to use the Tukaram search engine
10 Should To do something as a from both Windows and LINUX, we chose the
matter of course XDVNG font for the input technology while
11 Unreality Unreality that keeping ISFOC for display purposes only. Clients
something is true or can key in their queries through phonetic English
happens designed by Sibal (1996). Clients’ input queries
typed in phonetic English are displayed at clients’
7. Speech Synthesis for Marathi Language terminal in Marathi using the XDVNG font.
The aim of this project is to design speech synthesis Phonetic English to XDVNG mapping is available
system to speak out Marathi text written in as JavaScript from http://www.sibal.com/sandeep/
Devanagari script. Concatenative speech synthesis jtrans. Since JavaScript is used for mapping, it is easy
model is used to achieve this goal. Ours is an to integrate it in a page.
unlimited vocabulary system. It employs basic units,
i.e., vowels and consonants as the basis for A query in XDVNG is sent to the server at IIT
constructing words and sentences. The database of Bombay and is converted into ISCII at the server.
basic speech units, required in this approach, is few For this conversion, an XDVNG to ISCII converter
hundred kilobytes as against other approaches where provided by IIIT Hyderabad is used after
the size is almost tens of megabytes. We are able to modifications. The features of Project Tukaram are
get intelligible speech with this approach. A graphical summarized in table below:
user interface for the system has been developed. It Font and Package used for Akruti in MS
provides features for drawing waveforms and for Original Script Word
amplifying the output speech signal. It also provides,
help for displaying Marathi text through English Number of converted HTML 165
keyboards. chapters
Currently we are working in rendering newspapers Number of converted HTML 4614
as speech is taking shape. On-line reading of Marathi verses
newspaper Maharashtra Times is being carried out
as an illustration of the technology. Font used in HTML display pages ISFOC

37
Total number of words excluding 2,12,442 The N-grams method has been developed based on
HTML tags a rank order of short letter sequences. A rank ordered
list of the most common short character sequences
Number of distinct words used for 34,773
the Indexing is derived from the training document set. A rank
ordered list is prepared for every language under
Keyboard mapping at client Phonetic English consideration. Each such list is called a profile.
Query Display Technology at XDVNG Language identification is done by measuring the
Client closeness of the test document to each of these
profiles. The profile, which is closest to the test
Result Display Technology at ISFOC document, gives the language of the test document.
Client Figure 1 shows the data flow diagram of the classifier.
Encoding used for Indexing ISCII We have extended this approach for Devanagari
script, as the conventional approach was not giving
Database Server MySql desired accuracy.
Languages used for Converters Lex and C Letter Approach: In this approach, each letter in
Language used in Search Engine JAVA Devanagari is treated as a character in an N-gram.
That means an N-gram may contain more than “N”
Language used in Client Side JavaScript
characters. On the other hand, an N-gram will be
exactly “N” (possibly half ) letters in Devanagari
script.
Conjunct Approach: This is a variant of the previous
approach. We draw motivation for this approach by
describing Conjuncts. Indian scripts contain
numerous such conjuncts, which essentially are
clusters of up to four consonants without the
intervening implicit vowels. The shape of these
conjuncts can differ from those of constituting
consonants. For example, <SUÉ has a conjunct SUÉ. It
comprises of two consonants, SÉ and U. The
consonant SÉ is considered a half letter. In the
Conjunct approach, a conjunct is considered as a
single letter.
9. Automatic Language Identification of Documents
using Devanagari Script Results: From experimental results, it could be
observed that although the common words
The problem of language identification has been
technique is computationally cheaper, the N-gram
addressed in the past. There are existing methods
techniques are more accurate. Also, it was clear that
like unique letter combinations, common words, and
our extensions to N-grams work very well than the
N-grams technique. A subtle issue here is the length
conventional N-grams approach. The N-gram
of the text available for classification. Many methods
approach was very much suitable for another reason.
require longer texts for language identification. Also,
some methods require some a priori linguistic The documents used in experiments had textual
knowledge. Another issue arises when the test set errors. As N-grams method is quite robust to textual
for the classification program contains some semantic errors, the results were largely satisfactory. In our
errors. The classification technique should be robust approach, although the developing the method itself
in such a way that these errors do not affect the required some linguistic knowledge, the method is
accuracy of the classifier. completely automatic and does not require any fine-

38
tuning as required in case of common words the hyperlinks correspond to the edges. Graph
approach. traversal methods can be applied for crawling
purposes.
10. Object Oriented Parallel and Distributed Web
Crawler In the current version of the storage system, Berkeley
Database (BDB) over Linux has been used as the
The vastness and dynamic nature of the WWW has
led to the need for efficient information retrieval. A backend for storing webpages. An advantage of
crawler’s task is to fetch pages from the web. The BDB is that the library runs in the same address
crawler starts with an initial set of pages, retrieves space and no inter-process communication is
them, extracts URLs in them and adds them to a required for the database operation. Secondly,
queue. It then retrieves URLs from the queue in because BDB uses a simple function-call interface
some specific order and repeats the process. We have for all operations, there is no query language to
designed and implemented an object oriented web parse.
crawler. Our crawler has 4 components: Graph The storage system provides a set of API for
component, UrlManager, Buffer and Document application programs such as crawlers. The APIs
Processor. The crawler has a 4-tier architecture with help keeping users’ applications independent of
the storage file system at the lowest layer and crawl- implementation of the storage system. The
buffers at level 4. The architecture is shown in figure storage systems recognizes primitives such as nodes
below. The Graph component is at level 2 and it
which represent web pages and graphs which
interacts with the file system. The UrlManager and
represents a specific WWW graph. A storage API
Document Processor are at level 3. This layer forms
encapsulates read and write functions. It provides
the business logic layer. The crawler architecture is
functions such as create graph, store and read a
parallel and distributed. It uses a CORBA
webpage, parse the page and get outlinks.
environment and runs on a Linux Cluster.
11. Designing Devanagari Fonts
The following three types of ‘Arambh’ family fonts
are created:
• 8-bit true type font (.ttf )
• 16-bit Unicode bitmap ascii font (.bdf )
• 16-bit Unicode true type font (.ttf ).
Though there are many Devanagari fonts
available, no single font contains all glyphs
required for displaying documents such as
Damale’s grammer and Saint Tukaram’s abhangas.
Hence a new font is being created to handle the
additional requirements. The font called 8-bit
User Level Storage System Arambh, is a true type font, which covers all
required glyphs. In this font we get the scalability
A search engine requires large amount of storage feature of true type fonts. Two types of 16-bit
space for storing the crawled pages. The existing Arambh Unicode fonts developed are ArambhU
databases and file systems are not specialized to (bdf ) font and ArambhU (ttf ) font. Each glyph in
handle storage of structures such as web graphs. We BDF is represented as a matrix of dots. These fonts
are building storage systems for efficient storage and are rendered faster than TrueType fonts. Each
retrieval to this requirement. In the storage system, glyph in the TrueType font is represented as
webpages correspond to the nodes of the graph and
outline curve thus facilitating scalability.

39
12. Low Level Auto Corrector inspection. This is possible due to positioning
adjustments done on ligature placements.
We have built a low level syntactic checker and
autocorrector based on a classification of low-level For example, an extra kana in Marathi, such as in
syntactic errors. The tool uses a set of rules to detect letter {ÉÉ, is visibly detectable if it occurs more than
and correct the errors. It was used in project Tukaram, once since the additional ligature is right shifted and
and over 300 errors were detected and automatically either displayed as a kana, or taken as a separate
corrected. The kinds of errors handled are visible character. An example of the later case is encoding
and invisible errors due to multiplicity, ordering, paaa in ITRANS, which is displayed as {ÉÉ+
mutual exclusion and association properties.
Ordering
Types of Low Level Syntactic Errors
In some cases, more than one vowel can occur in a
Below, a classification of low-level syntactic errors single letter. For example, an anuswar and a kana in
in a document written in an Indian Language is Marathi as in letter {ÉÆ É may be typed in
provided with a few examples. Reasons of these Akruti_Priya_Expanded in two ways, as hebe (where
errors may be attributed to violations of the h stands for { , e for É and b for anuswar), or as heeb.
constraints of multiplicity, ordering and While the first choice results in an unaesthetic
composition of ligatures, especially of the matra placement of anuswar on top of the first vertical
characters. line, the second choice gives the desired result. With
a smaller font size, the differences may get
Multiplicity overlooked, but with an enlarged font size, they are
Typically, ligatures such as matra, halant and anuswar easily detected.
are used only once in a single letter composition. Mutual Exclusion
While typing, it may be possible that they get typed
twice and the result is still not visible in the text in In Indian Languages, the composition rules are also
an Indian Language font. Such errors can be specified for mutually exclusive vowels. For
classified as invisible errors, while some remain examples, an ukar and a kana cannot occur in one
visible in the display. character. Similarly, a velanti and an ukar or a kana
cannot occur in one letter. Whereas, it is possible
Invisible Errors for a kana and a matra to occur together in one
For example, in the case of an extra anuswar in letter letter.
+ÆÆ the source in ITRANS is given by aM, whereas, Association
aMM also produces +Æ. The Devanagari characters
seen in the previous sentence were actually produced Certain consonants do not accept rafar (i.e., Ç ) or
by two different encodings. While the result of this specific conjuncts. For example, a combination of
error is not visible in the display, this could be a ³ and Ç is invalid. Similarly, it is not possible to
cause of concern for an application such as a pattern combine certain consonants such as YÉ with other
matching algorithm, an index or a converter. consonants, or ½ with {É.

Invisible errors may also occur in the case of multiple 13. Font Converters
ukar and matra ( Ù and ä) characters. For example, in Following is the list font converters have been
{ÉÚ an extra ukar is present but it is invisible, or developed by IIT Bombay.
overlapped.
• IITK to ISCII
Visible errors
• ISCII to IITK
While some ligatures overlap in display and the errors
• Akruti_Priya_Expanded to ISCII
remain undetected by mere human inspection of
the display, some errors may be detectable by display • SHUSHA to ISCII

40
• IITK to SHUSHA codings of the webpage text was converted into ISCII
• DV-TTYogesh to ISCII coding. An inverted index is used to store the words
and their attributes like whether the word appears
• ISCII to DV-TTYogesh in bold, italic, whether it appears in the title of the
14. Marathi Spell Checker webpage and other attributes. MySql is used for
storing the inverted index. The user interface uses J-
Under this ongoing project, a stand-alone Trans, which is a JavaScript code. The user types the
spellchecker is being built for Marathi. The query in phonetic English. When the user clicks the
spellchecker will be available to spell check “search” button, the query words are looked up in
documents in a given Encoding. the inverted index. The documents which contain
From the CIIL (Central Institute of Indian all the query words are noted. Relevancy of a
Language) corpus, 12,886 distinct words have been document is calculated according to the attributes
listed. Similarly other Marathi texts at the center of the query words in that document. For example:
are being used to build a basic dictionary. A Bold or italic words are considered more important
morphological analysis is being carried out on the than non-bold or non-italic words. The documents
collection of words. For example, an automatic are then sorted in descending order of their
grouping algorithm identified 3,975 groups out of relevancy. The results are then displayed to the user.
12,886 distinct words. First word is usually the root Thus the user gets the most relevant Marathi
word. Thus, there are approximately 4000 root webpages.
words from Marathi corpus. A manual proof reading
15. IT Localization
will be done on these results. The morphology will
be enriched. IT awareness
A motivation behind the stand-alone spellchecker is There was a meet in October, 2001 with the media
that it can be used without an editor through a to apprise them of the technology development
packaged interface, or it can be integrated with other efforts going on in India and at the Resource Center
compatible applications such as OCR. in IIT Bombay on Indian Languages. Leading
newspapers like Times of India, Maharastra Times,
Content creation tools
Loksatta, Lokamat, Sakaal sent their representatives
The Marathi content placed on the CFILT website, to this meet. The systems developed in the IITB
viz. Saint Tukaram’s Abhangas, Soundarya Meemansa center were demonstrated and presented. The
and Damle’s Marathi grammar have been typed using newspapers carried reports on this.
Akruti Software along with MS Word. Akruti
Technology solutions
provides three keyboard layouts: Typewriter, English
Phonetic and Inscript. MS Word has the facility to IIT Bombay is providing its expertise for developing
convert the Word files into web pages. We use this wordnets for Indian languages (please details below).
convert option. The web pages created by MS Word We recently taught the fundamentals of wordnet
have a lot of Internet Explorer specific style sheet building in the Indo-wordnet workshop at CIIL
tags and attributes. Due to this the appearance of Mysore. Licensing agreement for transferring the
the Devanagari text on the web page is different in lexical data, the user interface and the API for the
Internet Explorer on Windows and Netscape Hindi wordnet is being worked out with advice from
Navigator on Linux. Hence we remove those style the Ministry.
sheet tags from the web pages by using a Java
Interaction with State Government
program. The files are then uploaded to the web
server. The interaction is mainly with the Marathi
For Marathi search engine Marathi webpages were Rajyabhasha Parishad which is the authorized body
crawled from the web. The webpages crawled were of the Maharastra state government for
in DV-TTYogesh, DV-TTSurekh font. These font developments related to the Marathi language.

41
Similarly interaction has been going on with the 17. The Team Members
Government of Goa for the Konkani language. We
Arti Sharma arti_sharma80@yahoo.com
conducted a very successful Language technology
conference in Goa called the International Ashish F. Almeida ashishfa@cse.iitb.ac.in
Conference on Knowledge and Language (ICUKL) in Deepak Jagtap deepak@cse.iitb.ac.in
Goa with generous support from the MICT-TDIL, Gajanan Krishna Ranegk rane@cse.iitb.ac.in
in which the state of Goa participated very actively
Lata G Popale lata@cse.iitb.ac.in
and supported the event wholeheartedly.
Jaya Saraswati jayas@cse.iitb.ac.in
16. Publications Laxmi Kashyap laxmi_kashyap@hotmail.com
Dipali B. Choudhary, Sagar A. Tamhane, Rushikesh Madhura Bapat madhurasbapat@rediffmail.com
K. Joshi, A Survey Of Fonts And Encodings For Indian Manish Sinha manish@cse.iitb.ac.in
Language Scripts, International Conference On Prabhakar Pandey pande@cse.iitb.ac.in
Multimedia And Design (ICMD 2002), Mumbai,
Roopali Nikam roopali@cse.iitb.ac.in
September 2002.
Shraddha Kalele s_mahapurush@yahoo.com
Shachi Dave, Jignashu Parikh and Pushpak Shushant Devlekar suSAMwa@yahoo.com
Bhattacharyya, Interlingua Based English Hindi
Sunil Kumar Dubey dubey@cse.iitb.ac.in
Machine Translation and Language Divergence, to
appear in Journal of Machine Translation, vol 17, Satish A Dethe satishd@cse.iitb.ac.in
2002. Vasant Zende cukl2002@cse.iitb.ac.in
Veena Dixit veena@cse.iitb.ac.in
P. Bhattacharyya Knowledge Extraction from Texts in
Multilingual Contexts, International Conference on M.Tech students
Knowledge Engineering (IKE02), Las Vegas, USA, Dipali B. Choudhary dipali@cse.iitb.ac.in
June 2002. Nitin Verma nitinv@cse.iitb.ac.in
Hrishikesh Bokil and P. Bhattacharyya Language Sagar A. Tamhane sagar@cse.iitb.ac.in
Independent Natural Language Generation from Ph.D. student
Universal Networking Language, Second Debasri Chakrabarti debasri@cse.iitb.ac.in
International Symposium on Translation Support
Systems, IIT Kanpur, India, March, 2002. Courtesy: Prof. Pushpak Bhattacharya
Indian Institute of Technology
Dipak Narayan, Debasri Chakrabarty, Prabhakar Department of Computer Science & Engineering
Pande and P. Bhattacharyya An Experience in Mumbai- 400 076
Building the Indo WordNet- a WordNet for Hindi, in (RCILTS for Marathi & Konkani)
First International Conference on Global WordNet, Tel: 00-91-22-25767718, 25722545
to be held in Mysore, India, January, 2002. Extn. 5479, 25721955
Shachi Dave and P. Bhattacharyya Knowledge E-mail : pb@cse.iitb.ernet.in
Extraction from Hindi Texts, Journal of Institution
of Electronic and Telecommunication Engineers,
vol. 18, no. 4, July, 2001.

P. Bhattacharyya Multilingual Information Processing


Using Universal Networking Language, in Indo UK
Workshop on Language Engineering for South
Asian Languages (LESAL, Mumbai, India, April,
2001.

42
Resource Centre For
Indian Language Technology Solutions – Punjabi
Thapar Institute of Engineering & Technology, Patiala
Achievements

School of Mathematics and Computer Applications (SMCA)


Thapar Institute of Engineering & Technology
(Deemed University) Patiala-147001
Tel. : 00-91-175-2393382 E-mail : punjabirc@yahoo.com
Website : http://punjabirc.tiet.ac.in
RCILTS-Punjabi fact there are numerous words which are
Thapar Institute of Engineering & spelled in more than one way, some are
being written in 3-4 different ways. After
Technology, Patiala
discussions with Punjabi scholars it was
Introduction decided to retain multiple spellings for the
commonly used words and a spell checker
The Resource Centre for Indian Languages for Punjabi has been developed taking
Technology Solution-Punjabi was established in Harkirat Singh’s Shabad Jor Kosh as the base
April 2000 at Thapar Institute of Engineering & and additional words were taken from the
Technology working in TDIL Programme of dictionaries published by Punjabi University
Department of Information Technology. The and the State Language Department and the
Resource Centre is supported by the Ministry of corpus developed by CIIL, Mysore.
Communications and Information Technology
(MCIT) with the aim of working for technology For generating the suggestion list, a study
development of Punjabi and also to provide access was conducted to discover the most
to millions of people across the globe. The main common errors made by Punjabi typists. A
task before the resource centre has been the list of similar sounding words and
promotion of Punjabi language so as to extend consonants was also complied and a
the benefit of knowledge and awareness across the suggestion list using this knowledge was
globe. It has since evolved into a multi-faced generated based on reverse edit distance. It
research center. Before the Punjabi Resource was found that the right suggestion is
Centre was set up, very little work had been done presented as a default suggestion in majority
for computerization of Punjabi, even though large of the cases. In such a case, the user needs
number of Punjabis settled in USA, UK and only to confirm the default suggestion and
Canada have been using computers for long. The proceed with the next error. Otherwise, the
only work done was development of Punjabi fonts user needs to scroll through a list of
and some Punjabi websites. There was no Punjabi
spell checker, Punjabi sorting utility, electronic
dictionaries or Gurmukhi OCR. In the span of
three years, we have developed for Punjabi lexical
resources, content creation tools, Gurmukhi OCR
and uploaded Punjabi text on web. The details of
the work done during the last three years is as
follows:

1. Products Developed

1.1 Spell Checker : A spell checker is a basic


necessity for composing text in any language.
Till now, no spell checker was available for
Punjabi as a result people had to waste lot of
time proof reading the Punjabi text. This
problem was more severe for publishers and
writers residing abroad who used computers
to type Punjabi material, as even Punjabi
dictionaries are not easily available to them.
To make the matters worse, there has been
no standardization of Punjabi spellings. In Fig.1 : Spell Cheker

44
suggestions and pick one as the right one. the traditional sorting for Gurmukhi words.
The spell checker supports text typed in any Gurmukhi, like other Indian languages has
of the popular Punjabi fonts as well as ISCII a unique sorting mechanism. Unlike
encoded files. The Punjabi spell checker is English, consonants and vowels have
now complete and MOU is soon going to different priorities in sorting. Words are
be signed with M/S Modular Systems, for sorted by taking the consonant’s order as the
Transfer of Technology of the spell checker. first consideration and then the associated
Output of the Spell Checker is shown in vowel’s order as the second consideration.
Fig. 1. In addition there is another complication.
The properly sorting “characters” in
1.2 Font Converter : The major development Gurmukhi often requires treating multiple
in computerization of Punjabi before the (two or three) code points as a single sorting
setting up of the Punjabi Resource Centre element. Thus we cannot depend on
was creation of Punjabi fonts. As a lot of character encoding order to get correct
Punjabis reside in USA, Canada and UK, so sorting instead we have to develop using
a large variety of fonts were developed. In sorting rules of Gurmukhi, collation
the absence of any standardization, each font function which converts the word into some
had its own keyboard layout. The net result intermediate form for sorting. After
is that now there is a chaos as far as the discussions with linguists and studying in
Punjabi language in electronic form is detail the alphabetic order in Punjabi we have
concerned. Neither can one exchange the developed a general sorting algorithm which
notes in Punjabi as conveniently as in works on text encoded in any popular
English language, nor can one perform Gurmukhi font or ISCII.
search on texts in Punjabi. This is so because
the texts are being stored in font dependent 1.4 Bilingual Punjabi/English Word Processor:
glyph codes and the glyph coding schemes A Bilingual Punjabi/English Word processor
for these fonts is typically different for has been developed keeping in view the
different fonts. To alleviate this problem, a difficulties being faced by the users while
font conversion utility has been developed working on the currently available word
at the Resource Centre. The utility supports processors. It was found that as far as Punjabi
more than sixty different Punjabi fonts and is concerned no word processor which
thirty two keyboard layouts. A user may type addresses the word processing requirements
text in any font and using this utility he can typical for Punjabi language, such as Punjabi
later on convert it to any other font. This spell checker, Punjabi thesaurus and
utility will be a great help to publishers, dictionary, virtual Punjabi keyboards and
writers and people exchanging text in Punjabi sorting and font conversion utilities,
different fonts. has been developed. In this direction, efforts
were made a Punjabi word processor Likhari
1.3 Sorting Utility : Sorting and Indexing is one was developed. Likhari supports word
of the basic necessities of the database processing under the windows environment
management system such as maintenance of and allows typing and processing in Punjabi
student’s records or arranging a dictionary Language through the common typewriter
in alphabetic order. But unfortunately there keyboard layout. It has MS-Word
does not exist any software for automatic compatible features and commands. It
sorting of Gurmukhi words and all such provides a number of features that make the
work has to done manually. The collating use of Punjabi Language on a computer easy
sequence provided by UNICODE or ISCII and provides a number of tools to increase
is not adequate, as it is not compatible with the efficiency of the user. These tools include

45
Bilingual Spell Checker with suggestion list, • Online Technical English-Punjabi
Onscreen keyboard layouts with glossary.
composition reference for Punjabi Language
typing, bilingual search and replace, sorting • Support for. ISCII, .TXT, DOC, .RTF
as per the language, alphabetical order, and .HTML file formats.
technical glossaries and onscreen Bilingual
• Extensive help at various levels to make
dictionaries. The main features of Likhari
it easy for the user to learn.
are :
1.5 Gurmukhi OCR : Optical Character
• Very simple user interface
Recognition (OCR) is the process whereby
• Online active Keyboard for users who typed or printed pages can be scanned into
do not know how to type in Punjabi. computer systems, and their contents
recognized and converted into machine-
• Choice of Phonetic, Remington and readable code. The text, which the machine
Alphabetic Keyboard layouts with can read, has great advantages over that
composition reference. merely displayed as an image, since it can be
edited, exported to other programs and
• Bilingual Spell Checker for Punjabi indexed for retrieval on any word. If one
and English. needs to electronically store and manipulate
large amounts of text or printed matter such
• Bilingual Search and Replace.
as newspapers, contracts, letters, faxes, price
• Support for sorting the text in English lists or corpus development, one will find
or Punjabi as per the language OCR programs can save lots of effort. For
alphabetical order. the first time a complete OCR package for
Gurmukhi script has been developed. The
• Support for more than 60 commonly OCR has a recognition accuracy of more
used Punjabi fonts. than 97%. The OCR can automatically
detect and correct skewed pages and can also
• Support for features like Tables, detect page orientation. The OCR can
Numbering, Bullets, Character recognize multiple Punjabi fonts and sizes.
& Paragraph formatting, Page set The main features of the Gurmukhi
up, Print Preview, Header and Footer OCR are:
etc.
Recognition Accuracy

Recognition accuracy is around 97% for


books, photocopies and medium degraded
documents and around 99.5% for laser print
outs and good quality documents.

Import Image

B/W Images, 24-bit color images, 256


grayscale images.

Output File Format

ISCII / text file encoded in any one of the


popular Punjabi fonts.
Fig 2 : A Screen Shot of Punjabi WordProcessor Likhari

46
Fonts Supported two sides of the Punjab, which prevents
cultural and literary exchanges — there is
All non-decorative Gurmukhi fonts.
ignorance on both sides about developments
Proofing in contemporary prose and poetry on the
other side. To break this wall, it is necessary
On-screen verifier
to develop transliteration programs which can
Additional Features automatically convert Gurmukhi text to
Shahmukhi and vice-versa. In this direction
Inbuilt Spell checking facility. Automatic the Punjabi Resource Centre has collaborated
skew detection and correction (skew range – with the Urdu Resource Centre established
5 to +5 degrees). Upside down image auto at CDAC, Pune to develop a computer
detection and correction. program which automatically converts
Gurmukhi text into Shahmukhi. Using this
software, a collection of short stories penned
by Mr. K. S. Duggal in Gurmukhi has been
converted into Shahmukhi (Figs 4-5).

Fig. 3 : A Screen shot of Gurmukhi OCR

1.6 Gurmukhi to Shahmukhi Transliteration:


Punjabi language is used in both parts of
Punjab in India and Pakistan. In East Punjab
( India ) Punjabi is written in Gurmukhi
script. This script was invented by Guru
Fig 4 : A Short Story in Gurmukhi
Angad Dev ji. This is written from left to
right. In West Punjab ( Pakistan ) Punjabi is
written in Persian script , also known as
Shahmukhi script. This is written from right
to left like Urdu and Persian. In East Punjab
many Sanskrit words are absorbed into
Punjabi language and similarly in West
Punjab many Persian and Urdu words are used
in Punjabi. But the essence of Punjabi
language remains same. Both Shahmukhi and
Gurmukhi have been in use simultaneously.
After the Partition, Punjabi on the Indian side
of the border was restricted officially to the
Gurmukhi script. On the Pakistani side it was
restricted to Shahmukhi script. The result is Fig 5 : The Short Story in Gurmukhi of Fig. 4
that a script-wall has come up between the converted to Urdu

47
2. Contents Uploaded on Internet

For benefit of common man, Punjabi literature,


dictionaries and products for free download have
been uploaded on the Resource Centre’s website
(http://punjabirc.tiet.ac.in). The contents
uploaded on the internet are:

2.1 Literature : Punjabi Classics such as


Bullehshah Dian Kafian, Farid De Salok, Heer.
Luna, Chandi di War, Japji Sahib have been
uploaded in Punjabi using Punjabi Dynamic
Fonts. The website contains the detailed
description of these classics. Tool tips of
difficult words have also been provided in
Punjabi. Audio clips of these classics have also
been included.

2.2 Bilingual Dictionaries and glossary : The


following bilingual dictionaries and glossary
have been uploaded at our site.

• Punjabi English On-line Dictionary : A


Punjabi English dictionar y is made
available on the website. The dictionary
has about 40,000 Punjabi words. The
dictionary has sample sentences for
common words and audio clips of
pronunciation of Punjabi words have also
been provided. The Dictionar y is
accessible in both typewriter layout and
phonetic layout. One can search for
complete match or a pattern in the
dictionary.

• English Punjabi On-line Dictionary: An


English-Punjabi dictionary containing
about 45000 words has been made
available on the website.

• Hindi Punjabi On-line Dictionary: A


Hindi-Punjabi dictionary containing
about 45000 words has also been made
available on the website.

• Glossar y of English-Punjabi
administrative terms : A technical glossary
of around 17,000 English-Punjabi
administrative terms has also been

48
uploaded. A CD has also been developed English-Punjabi administrative terms has been
and the glossary has been installed on developed for the Sate Language Department and
computers in many offices of Punjab State the CD has been installed in Patiala DC office
Government. and State Revenue office.

2.3 On Line Teaching of Punjabi : For benefit of 4. Publications


Punjabis settled abroad and others interesting
in learning Punjabi, a web site for On-line 1. G.S.Lehal, “State of Computerization of
teaching of Punjabi has also been developed. Punjabi”, Proceedings Second World Punjabi
Work is complete for Gurmukhi Orthography, Conference, Prince George, Canada (2003).
Punjabi Pronouncing rules and a limited (Accepted for Publication)
vocabulary both in text and pictorial format
2. G.S.Lehal, “A Gurmukhi Collation
along with audio effects is also provided. The
Algorithm”, Journal of CSI. (Accepted for
alphabets can be learnt in Animation pattern
publication)
as we draw with hands.
3. G.S.Lehal, Chandan Singh and Renu Dhir,
2.4 On Line Font Conversion Utility : An online
“Structural feature-based approach for script
font conversion utility has been provided.
identification of Gurmukhi and Roman
The user can paste Punjabi text encoded in
characters and words”, Document Recognition
any of the supported sixty Punjabi fonts and
and Retrieval X, Proceedings SPIE, USA, Vol.
convert it to the desired Punjabi font.
5010 (2003). (Accepted for publication)
2.5 Downloads
4. G.S.Lehal and Chandan Singh, “A complete
• Punajbi Spell Checker – A Punjabi-English OCR system for Gurmukhi script”, Structural,
spell checker has been uploaded on the Syntactic and Statistical Pattern Recognition,
home page and any user can download and T. Caelli, A. Amin, R.P.W. Duin, M. Kamel
install the spell checker on his system. and D.de Ridder (Eds.), Lecture Notes in
Computer Science, Vol. 2396, Springer-
• Punjabi Fonts – Two Punjabi Verlag, Germany, pp. 344-352, (2002).
fonts(LIKHARI_P and LIKHARI_R) for
phonetic and keyboard layouts have been 5. G.S.Lehal, Chandan Singh and Renu Dhir,
developed and made available for free “Automatic separation of Gurmukhi and
download. The fonts are available in both Roman script words”, Proceedings Indo-
True Type and Dynamic formats. European Conference on Multilingual
Communication Technologies, Pune, R. K.
3. Interaction with Punjab State Government Arora, M. Kulkarni and H. Darbari (Editors),
Tata McGraw-Hill, pp. 32-38,(2002).
A five day training programme was organised for
staff of State Language Department. They were 6. G.S.Lehal and Chandan Singh, “A Post
given training in word processing, email and Processor for Gurmukhi OCR”, SADHANA
internet. We have been in constant touch with Academy Proceedings in Engineering Sciences,
the Secretary IT, Punjab and a demonstration of Vol. 27, Part 1, pp. 99-112, (2002)
the products developed at our RC was given to
him. The Punjabi wordprocessor and Gurmukhi 7. G.S.Lehal and Chandan Singh, “ Text
OCR were installed on the systems in the office segmentation of machine printed Gurmukhi
of Secretary, IT, Punjab. We have been providing script”, Document Recognition and Retrieval
technical support to State Language Department. VIII, Paul B. Kantor, Daniel P. Lopresti,
Besides providing training to their staff, we have Jiangying Zhou, Editors, Proceedings SPIE,
also set up their web site in Punjabi. A CD of USA, Vol. 4307, pp. 223-231, (2001).

49
8. G.S.Lehal and Chandan Singh, “A technique Yoginder Sharma NA
for segmentation of Gurmukhi text”,
Pooja Dhamija NA
Computer Analysis of Images and Patterns, W.
Skarbek (Ed.), Lecture Notes in Computer Zameerpal kaur NA
Science, Vol. 2124, Springer-Verlag,
Dr. Kuljeet Kapoor NA
Germany, pp. 191-200, (2001).
Dr. Devinder Singh NA
9. G.S.Lehal, Chandan Singh and Ritu Lehal, “A
shape based post processor for Gurmukhi Karamjeet Kaur NA
OCR”, Proceedings of 6th International Jaspal Singh NA
Conference on Document Analysis and
Recognition, Seattle, USA, IEEE Computer Manpreet Singh NA
Society Press, USA, pp. 1105-1109, (2001). Rakesh K. Dawra NA
10.G.S.Lehal and Nivedan Bhatt, “A recognition Shallu Kalra NA
system for Devnagri and English handwritten
Kuldeep Kumar NA
numerals”, Advances in Multimodal Interfaces
– ICMI 2001, T. Tan, Y. Shi and W. Gao Baljit Singh NA
(Editors), Lecture Notes in Computer Science,
Aarti Gupta NA
Vol. 1948, Springer-Verlag, Germany, pp.
442-449. (2000). Sunita Sharma NA

11. G.S.Lehal and Chandan Singh, “A Gurmukhi Surjit Singh NA


script recognition system”, Proceedings 15th Courtesy: Prof. R.K. Sharma
International Conference on Pattern Thapar Institute of Engineering & Technology
Recognition, Barcelona, Spain, IEEE Department of Computer Science & Engineering
Computer Society Press, California, USA, Vol Patiala 147 001
2, pp. 557-560, (2000). (RCILTS for Gurmukhi)
Tel: 00-91-175-2393137, 393374, 2283502
5. The Team Members
E-mail : rksharma@mail.tiet.ac.in
Dr. R. K. Sharma rksharma@mail.tiet.ac.in
Dr. G. S. Lehal gslehal@mail.tiet.ac.in

Dr. Rajesh Kumar rakumar@mail.tiet.ac.in


Rajeev Kumar rkumar@mail.tiet.ac.in
Ramneet Mavi ramneet_mavi@rediffmail.com
Deepshikha Goyal deepshikha26us@yahoo.com
Nivedan Bhatt nivedanbhatt@rediffmail.com
Sukhwant Kaur sukhwant2010@yahoo.co.in
Ramanpreet Singh shallukalra@yahoo.co.in
Parneet Cheema NA
Pallavi Dixit NA
Rupsi Arora NA

50
Resource Centre For
Indian Language Technology Solutions – Bengali
Indian Statistical Institute, Kolkata
Achievements

Computer Vision and Pattern Recognition Unit


203, Barrackpore Trunk Road
Indian Statistical Institute, Kolkata-700108
Tel. : 00-91-33-25778086 Extn. 2852
E-mail : bbc@isical.ac.in
Website : http://www.isical.ac.in/~rc_bangla
RCILTS-Bengali Link up for Data source has been established with
Indian Statistical Institute, Kolkata Bangla Academy.

Introduction 1. Core Activities

Main objective of this project was resource and 1.1 Web Site Development and The Language
technology development for all aspects of language Design Guide
processing for Eastern Indian languages particularly, • Creation of a web site on Eastern Indian
Bangla. It included Corpus development, Font Languages and Language Technologies for
generation, Website generation, OCR development, information of people interested in TDIL
Information retrieval system development etc. as
well as training of people on Indian language The name of this site is Resource Centre for Indian
technology through courses and workshops. Language Technology Solutions –Bangla. The URL
of this site is: http://www.isical.ac.in/~rc_bangla/. At
Outcome in Physical Terms present, this site contains details about this MIT
• Corpus of Bangla Document Images, Bangla project (Resource Centre for Indian Language
Text Corpus in Electronic Form and Corpus Technology Solutions - Bangla), a brief description
of Bangla Speech Data of the products developed at Indian Statistical
Institute, Kolkata. A Bangla design guide is also
• Web site on Eastern Indian Language introduced in the site so that new technology on
Technologies including Bangla Language Bangla could maintain a common framework. It
Technology design guide covers details of the origin and development of the
script, the alphabets, character statistics, information
• Bangla Font Package
related to the fonts, presentation and storage
• OCR System for Oriya and Assamese consideration, information related to Unicode of
Bangla etc. The local language academy (Bangla
• Information Retrieval System for Bangla Academy) has been contacted. A prototype of a Web-
Electronic Documents based front-end to our spell-checker and phonetic
• Multi-lingual Script Line Separation system dictionary has been developed using an evaluation
copy of CDAC’s GIST Software Development Kit
• Automatic Processing system for Handprinted (SDK). Initiative has been taken so that this front-
Table-Form Documents end can be put on the Internet for public use with
the GIST SDK and iPlugin (or Modular Software’s
• Neural Network based tools for printed Shree Lipi). Issues related to hosting this web site on
document (in eastern regional scripts) another server/Web hosting service are currently
processing being explored.
Production Agencies with which Memorandum of 1.2 Training Programmes
Understanding/link up has been established
• Education through Training Programmes and
MoU has been signed with (i) Webel Mediatronics Workshops
Ltd, Kolkata (speech synthesis Technology), (ii)
Centre for Advanced Computing, Pune (Devanagari Two International workshops under the banner of
and Bangla OCR Technology) (iii) Electronics “International Workshop on Technology
Research & Development Centre India, Noida Development in Indian Languages (IWTDIL)” were
(Devanagari OCR Technology), (iv) Orissa held respectively during March 26-30, 2001 and
Computer Application Centre, Bhubaneswar (Oriya January 22-24, 2003. Some main topics covered in
OCR Technology) (v) Indian Institute of these workshops were: (a) Machine translation from
Technology, Guwahati (Assamese OCR Technology) English to Indian Language (particularly to Bangla),

52
(b) Text corpus generation, designing and is used through out the book and some archaic forms
annotation, (c) Speech synthesis, processing and (terms mostly derived from Sanskrit and Prakrit
recognition, (d) OCR technology for Indian scripts, sources) are also used mainly to create an atmosphere
(e) Hand written character recognition, (f ) of the period of Goutama Buddha, which the book
Document processing and analysis, (g) Some general depicts. Ananda Publishers, one of the largest
features of Indian Languages (e.g, phonology and publishers of Kolkata has published this book. The
acoustics of Bangla phonemes and diphthongs), (h) quality of the printing is excellent. Computer
Anaphora resolution ellipsis in Hindi and other generated font developed by Ananda Publishers has
Indian Languages. In IWTDIL’01, the first three been used in printing.
days had introductory talks and tutorials on various
areas of language technology and the last two days Pratham Alo – This is also a famous Bangla novel
featured lectures by international experts on these written by famous Bangla writer Sunil
subjects. A total of twenty-nine speakers presented Gangopadhyaya. The incidents of the novel are
talks at the workshop and there were seventy-five nearly one hundred and fifty years old and the theme
participants from India and abroad. In IWTDIL’03, is related to Bangla renaissance. The standard Bangla
four distinguished scientists from abroad and six language has been used for this book. The percentage
from India delivered lectures to forty-five of archaic words is much less than that in Maitreya
participants. A cultural programme consisting of a Jatak. This book is also published by Ananda
play by a renowned theatre group of Kolkata was Publishers, but using offset printing technology. The
also arranged. overall printing quality is good but worse than
Maitreya Jatak.
One National workshop entitled “Indian language
Spell-checker Design” was organized in July 18-19, Bhasa Desh Kal – The book is written by Dr. Pabitra
2002. The main theme of the workshop was to Sarkar, a distinguished author of Bangla Linguistics.
present the work on Spell-checker done by various The language of the book is standard Bangla
Resource Centres and groups in different Indian colloquial. The publisher of the book is Mitra and
languages. The participants demonstrated the Spell- Ghosh, Kolkata. Printing quality is moderate and
checker software developed by them. It was worse than Pratham Alo.
anticipated that a benchmarking method for spell
checker would be evolved out of the workshop. Upendra Kishore Rachana Samagra – This book is
Twenty-five participant attended/presented lectures written by the famous author of juvenile literature
in the workshop. Shri Upendra Kishore Roychowdhuri. Old formal
and polished Bangla language is used in this book.
2. Services Here, offset font is used and the book has been
published by Ananda publishers of Kolkata.
2.1Corpus Development
Amar Jibanananda Abiskar o Anyanya – Sunil
• Printed Bangla Document Images & Ground
Gangopadhyaya is the author of this book. The
Truth for OCR and Related Research
language of the book is standard Bangla colloquial.
Some famous Bangla novels and books have been Ananda Publisher has published this book. The
selected to prepare a Bangla document image corpus printing quality is good and offset printing
along with ground truth. A brief description of technology is used.
those selected novels and books are given below:
Rajarshi – This is one of the classic novels written
Maitreya Jatak – This is a famous Bangla by Rabindranath Tagore. Old formal and polished
mythological novel written by Bani Basu, a popular Bangla language is used in this novel. Ashoke Book
Bangla writer. As far as the linguistic aspect goes, Agency of Kolkata is the publisher of the book. The
the overall language of the book is old Bangla. printing quality is not as good as that of Ananda
Polished and chaste form (Sadhu bhasa) of narration Publishers.

53
Bangla Choto Galpa - The book is a collection of Table 1
Bangla short stories written by different writers of
old Bangla literature and edited by Rabindranath Name of Books No. of Total No.
Tagore. Old formal language (Sadhu bhasha) is used in Pages of Words
most of the stories. The book is published by Model Scanned
Publishing House. The printing quality is good. Maitreya Jatak 419 2,43,020
Punarujjiban – It is the Bangla version of a novel Pratham Alo 37 18, 833
written by famous Russian novelist Lev Tolstoy. The
book is published by Raduga Prakashan, Moscow. Bhasa Desh Kal 15 5,760
The printing quality is good and modern colloquial
language is used. Upendra Kishore 330 1,38,270

Bhraman Amnibus – The book is written by Sri Amar Jibanananda 21 7,812


Uma Prasad Mukhopadhyaya. It contains detailed Rajarshi 100 39,800
description of different places of Himalayas. Modern
Bangla language is used in this book. It is published Bangla Chotogalpa 260 1,11,800
by Mitra & Ghosh, Kolkata. Offset printing
Punarujjiban 68 25,704
technology is used.
Parakiya – This book is collection of Bangla short Bhraman Amnibus 31 14,198
stories written by both old and modern writers of Parakiya 163 85,412
Bangla literature and edited by Sunil
Gangapadhyaya. The publisher of the book is Amar Debottar Sampatty 29 10,730
Punashcha. Both old formal and modern colloquial Prabasi Pakhi 10 4,020
language are used in different stories of the book.
The printing quality is not so good. Mahabharat Katha 4 1,304
Amar Debottar Sampatty – The book is an auto Total 1,487 7,06,663
biography of Nirod Chandra Chaudhuri, published
by Ananda Publishers. Offset printing technology Those page-by-page scanned images are then fed as
and old formal language are used. input of the Bangla module of the existing Bilingual
Prabasi Pakhi - It is a Bangla story book written by (Hindi and Bangla) Optical Character Recognition
Sunil Gangapadhaya and published by Ananda (OCR) System developed in our department. Each
publishers. Offset printing technology and modern character of those input images are being recognized
colloquial language are used. by this OCR System and are stored as text
documents in 8- bit ISCII format.
Mahabharat Katha – The book is published by
Udbodhan publication of Ramkrishna Mission. The present OCR system gives an accuracy of 96 %
Modern Bangla language has been used in this book to 98 % in character level depending on paper
(Table 1 is the list of books scanned). quality and font style of the books. So, the 4% to
2% error is corrected manually to prepare the text
OCR – A HP Scan Jet flatbed scanner of high
documents, which serves as ground truth.
resolution (300dpi ordinary and gray level spectral
resolution) has been used to get the image A benchmark software is also developed to give a
documents. The images are then saved into specified format to all ground truths generated using
uncompressed TIFF format and are scanned Indian OCR technologies for automatic evaluation
generally using same setting. Minor adjustment of of them (different OCR technologies). The database
brightness and contrast may be done by using Corel formed in this way is shown in the tabular form in
Photopaint software. table1.

54
Table -2
1. Author’s Name
Iswarchandra Vidyasagar is known as one of the
great social reformers, philanthropist and father of
Bengali prose style of Bengal. He played an
important role in primary education, widow
marriage and women education.

Figure 1: The image is at left hand side and the Name of the Classic & Publisher
corresponding ground truth is at right hand side
Vidyasagar Rachanabali (Tuli – Kalam)
• Development of Bangla Text Corpus in Electronic
Name of Articles (Story, Novel, drama, etc.)
Form including a Bangla dictionary, and several
Bangla classics Akhyanmanjari
Several novels have been entered into the Betal Panchabingshati
computer in ISCII format. The description of
novels, the corresponding authors and total Kathamala
number of words are given in a tabular form Mahabharat
(Table-2). More than 34000 words of a bilingual
(Bangla-English) dictionary have also been Niskritilabh Prayash
entered. A comprehensive corpus for electronic Provaboti Sambhashan
dictionary for Bangla-Bangla (with 65,000 words)
has been constructed and checked. Using the Sakuntala
guidelines of the above two dictionaries a
Sanskrit Bhasa O Sankriti
trilingual (Bangla-English-Hindi) dictionary
creation has started. Till today 17,000 words with Sitar Banobas
meaning and parts of speech as well as other
information has been entered. However, lack of Balyabibaher Dosh
standard size Bangla-Hindi dictionary in printed Ramer Rajjyabhishek
form has created some problem (Bangla-Hindi
dictionaries available in the market are small, Bidhababibaha
containing about 7,000 words only). We have also Jibancharita
designed a prototype of an electronic thesaurus
for Bangla (based on WordNet, a well-known Vrantibilas
electronic resource for the English language). Banglar Itihas
Charitabali
Papers Published Bodhadaya
• Dash, N.S. and Chaudhuri, B.B. (2001) “A No of Words (per Classic)
corpus based study of the Bangla language”.
Indian Journal of Linguistics. 20: 19-40. Sub. Total : 315783

• Dash, N.S. and Chaudhuri, B.B. (2002) 2. Author’s Name


“Corpus generation and text processing”. Bankim Ch. Chattopadhyay is considered the first
International Journal of Dravidian Linguistics. and one of the greatest novelists in Bangla language.
31(1): 25-44. He has written 14 novels besides a large number of

55
essays on various literary and social issues. He wrote work is the epic named ‘Meghnadbadh’.
the ‘Vande Mataram’ song, which played a major
role in Indian Independence struggle. Name of the Classic & Publisher
Madhushudan Rachanabali (Kallol Prakashani)
Name of the Classic & Publisher
Name of Articles (Story, Novel, drama, etc.)
Bankim Rachanabali (Patrajo Publication)
Buroshaliker Ghare Row
Name of Articles (Story, Novel, drama, etc.)
Krishnakumerir Natak
Anandamath Padhyabatir Natak
Bishbriksha
Durgeshnandini No of Words (per Classic)
Kamalakanta
Bigyanrahosya Sub. Total : 54089
Muchiram Gur 5. Author’s Name
Lokrahasya
Samya Computer Dictionary
Bibidha Prabandha (Vol. 1) Name of the Classic & Publisher
Bibidha Prabandha (Vol. 2)
Krishnacharita (Vol 1) B. B. Chaudhuri (Ananda Publishers)
Krishnacharita (Vol 2)
No of Words (per Classic)
No of Words (per Classic)
Sub. Total : 95435
Sub. Total : 320232
6. Author’s Name
3. Author’s Name
Samsad Dictionary
Sarat Chandra Chattopadhyay is probably the most
Name of the Classic & Publisher
popular novelist of Bengal. His novels
sympathetically depicted the weaker section of the S. Biswas et. al. (Sahittya Samsad)
society including woman.
No of Words (per Classic)
Name of the Classic & Publisher
Sub. Total : 167941
Sarat Rachanabali (Sarat Samiti)
Grand Total : 883729
Name of Articles (Story, Novel, drama, etc.)
• Electronic Corpus Of Speech Data
Debdas
Pallisamaj The composition of existing speech databases for
Srikanta English has been studied. Speech data has been
categorized into several classes based on criteria such
No of Words (per Classic)
as sex, age, region of speaker, place of data collection,
Sub. Total : 246032 whether the source of the spoken material is a written
script etc. Based on these, the composition of the
4. Author’s Name database has been designed. All India Radio, Calcutta
has been designated as a potential source of audio
Michael Madhushudan Dutt was a great poet and
material. However, the speech lab, where in the
play writer of Bengal. He has written a few English
controlled environment the data could be generated,
sonnets and poems. He introduced the
is yet to be set up. due to delayed sanction of the
“Amitrakshar” rhyme in Bangali poem. His famous
grant.

56
2.2 Font Generation and Associated Tools consonants without the intervening implicit
vowels. The shape of these conjuncts can differ
• Public Domain Bangla Font Generation from those of the constituting consonants.
Overall font generation and editor development 6. Punctuation: (12)
process can be divided into the following modules:
7. Numerals (10)
1. Designing the exhaustive glyph set.
Converter Program From Font File To ISCII File
2. Converter program from Font file to ISCII file And Vice-versa : The converter program has two
and vice-versa submodules. The first one takes a font encoded string
as input and delivers an ISCII encoded string as
3. Designing a Bangla text editor
output. The second one takes an ISCII encoded
4. Designing a floating Keyboard string as input and gives a font encoded string as
output.
Background Of Designing The Exhaustive Glyph
Set : Every language has its own character set and Bangla Text Editor : Along with the font a Bangla
many of them can be represented within 256 editor has been developed. It supports ISI.ttf font.
characters using individual 8-bit character sets A web version is generated using Bit stream’s web-
including Indic scripts using ISCII code. However font wizard to display text in Bangla within a web
for unification of all these languages recently the browser. The editor provides standard editing
Unicode consortium proposed 16 bit character features such as cut, copy, paste, select, select all, file
encoding scheme. Here characters can be designed operations, bold, italic, underlined text, superscript
at 216 spaces. The Government of India is a member and subscript, left, right and center alignments, find
of the Unicode Consortium and has been engaged and replace specific strings etc. This editor can save
the content in plain text, RTF and in ISCII format.
in a dialogue with the UTC about additional
A floating keyboard can be invoked on demand for
characters in the Indic blocks and improvements to
the new user to get the keying information for
the textual descriptions and annotations. Unicode
writing Bangla in this editor.
is designed to be a multilingual encoding that
requires no escape sequences or switching between Designing a Floating Keyboard : The floating
scripts. For any given Indic script, the consonant Keyboard is illustrated below.
and vowel letter codes of Unicode are based on
ISCII, so they correspond directly. For Bangla,
Unicode 3.0 provides places 0980 to 09FF. One
Bangla True Type Font (called ISI.ttf ) has been
developed in this center which should be compliance
with ISCII and could be upgraded to support
Unicode. The font is designed and generated with
the help of ALTSYS Fontographer 3.0. The following
Bangla orthographic characters can be written with
this font :
1. Vowels (11)
2. Consonants (38)
3. Vowel Matras (10)
4. Halant
5. Conjuncts: Bangla contains numerous conjuncts
(250+), which essentially are clusters of up to four Figure 2:. Standard Bangla DTP typewriter layout

56
• Bangla Spell-Checker whether the string is a conjunct word generated
by appending two noun word and suffix. Option
Bangla spell-checker is a tool to detect errors in for adding new words permanently or temporarily
Bangla words and correcting them by providing a is provided in the spell checker.
set of correct alternatives which includes the intended
word. An errorneous word can belong to one of Reference
two distinct categories, namely, non-word error and
real-word error. Let a string of characters separated • B. B. Chauhduri and T. Pal, Detection of word
by spaces or punctuation marks be called a candidate error position and correction using reverse word
string. A candidate string is a valid word if it carries dictionary’, Intl. Conf. On Computational
a meaning. A meaningless string is a non-word. A Linguistics, Speech and Document Processing
real word error means a valid word but not the ICCLSDP’98, February 18-20, 1998, pp, C41-
intended one in the sentence. It makes the sentence C46.
syntactically or semantically ill-formed or incorrect. • B. B. Chauhduri, A Novel Spell-checker for
In both cases, the problem is to detect the erroneous Bangla Text Based on Reversed-Word Dictionary,
word and either suggest correct alternatives or Vivek, Vol. 14(4), pp. 3-12, October 2002.
automatically replace it by the appropriate word. In
this spell-checker, only non-word errors are
considered.
Word errors can be classified into four major types
namely, substitution, deletion, insertion, and
transposition error. In Bangla, wrong uses of
characters, which are phonetically similar to the
correct ones is observed. A great deal of confusion
occur in the use of long and short vowels, aspirated
and unaspirated consonants, dental and cerebral
nasal consonant due to phonetic similarity. Another
type of error is typographic error which is caused by
the accidental slip of fingers on the keys which are
neighbours of the intended key.
Figure 3: Standard Bangla DTP typewriter layout
In this spell-checker, the main technique of error with a support of a spell checker
detection is based on matching the candidate
string in the normal as well as in the reversed For the spell-checker, several files containing root-
dictionary (following publications may be words and suffix words are maintained. The main
referred). To make the system more powerful, this dictionary contains about 60,000 root-words and
approach is combined with a phonetic similarity 100,000 inflected words. Noun and verb suffix files
key based approach where phonetically similar are also used. The spell-checker works fast and the
characters are mapped into a single symbol and a non-word error are all correctly detected but it makes
nearly-phonetic dictionary of words is formed. about 5% false alarm. This is mainly due to conjunct
Using this dictionary, phonetic errors can be easily words formed by euphony and assimilation as well
detected and corrected. Here a candidate string as proper nouns in the corpus.
first passes through the phonetic dictionary. If the 3. Products
word is not found in the dictionary and also failed
to give suggestion then it tries to divide the word Previously Bangla, Devnagari OCR systems were
in root part and suffix part by separately verifying developed at CVPR Unit of Indian Statistical
both. If an error is found, the spell-checker Institute and these core technologies were transferred
attempts providing suggestions. If it fails, it checks to Industry for commercialization. The latest

58
developments are Oriya OCR and Assamese OCR. Skew Detection and Correction: When a document
These are described below. is fed to the scanner either mechanically or by a
human operator, a few degrees of skew (tilt) is
3.1 OCR System for Oriya unavoidable. The skew angle is the angle that the
Purpose text lines in the digital image make with the
horizontal direction. Skew detection and correction
The purpose of this system is to recognize printed are important preprocessing steps of document
Oriya script automatically. layout analysis and OCR approaches. Skew
correction can be achieved in two steps, namely (i)
Summary of the System estimation of skew angle, and (ii) rotation of the
In this recognition system, the document image is image by the skew angle in the opposite direction.
first captured using a flatbed scanner. The image is Here a Hough transform based technique is used
then passed through different preprocessing modules for estimating the skew angle of Oriya documents.
like skew correction, line segmentation, zone It is observed that the uppermost and lowermost
detection, word and character segmentation, etc. points of most of the characters in an Oriya text
Next, individual characters are recognized using a line lie on the mean line and base line, respectively.
combination of stroke and run-number based The lowermost and uppermost points of characters
features, along with features obtained from the in a skewed Oriya text are shown in Figure 4. To
concept of water overflow from a reservoir. These reduce the amount of data to be processed by the
techniques are discussed in a greater detail in the Hough transform, only the uppermost and
following. lowermost pixels of each component are considered.
First, the connected components in a given image
System Description are identified. For each component, its bounding
box (minimum upright rectangle containing the
Text Digitization and Noise Cleaning: Text
component) is defined. The mean width of the
digitization is done using a flatbed scanner (Model:
bounding box bm is also computed. Next,
HP Scanjet 660 C) at a resolution varying from
components having bounding box-width greater
200 to 300 dots per inch (dpi). The digitized images
than or equal to bm are retained. By thresholding at
are in gray tone and a histogram-based thresholding
bm, small components like dots, punctuation marks,
approach is used to convert them into two-tone
small modified characters, etc., are mostly filtered
images. For a clear document, the histogram shows
out. Because of this filtering process, the irrelevant
two reasonably prominent peaks corresponding to
components cannot create errors in skew estimation.
white and black regions. The threshold value is
Now, the usual Hough transform technique is used
chosen as the midpoint between the two peaks of
on these points to get the skew angle of the
the histogram. The two-tone image is converted into
document. The image is then rotated according to
0-1 labels where 1 and 0 represent object and
the detected skew angle. Font style and size variation
background, respectively. The digitized image shows
do not affect the proposed skew estimation method.
protrusions and dents in the characters, as well as
Also, the approach is not limited to any range of
isolated black pixels over the background, cleaned
skew angles.
by a morphological smoothing approach.
Line, Word and Character Segmentation: For
convenience of recognition, the OCR system should
automatically detect individual text lines, segment
the words from the line, and then segment the
characters in each word accurately. Since Oriya text
lines can be partitioned into three zones (Figure 5),
Figure 4: Uppermost and lowermost point of it is convenient to distinguish these zones. Character
components in a skewed text line recognition becomes easier if the zones are

59
distinguished because the lower zone contains only Word and Character Segmentation: After a text
modifiers and the halant marker, while the upper line is segmented, it is scanned vertically, column-
zone contains modifiers and portions of some basic by-column. If a column contains two or fewer
characters. black pixels, then the scan is denoted by 0, else
the scan is denoted by the number of black pixels
in that column. In this way, a vertical projection
profile is constructed. Now, if in the profile there
exists a run of at least k consecutive 0s then the
midpoint of that run is considered as the
Figure 5: Zones in an Oriya text line boundary between two words. The value of k is
taken as 2/3 of the text line height (text line height
Text Line Detection and Zone Separation: The lines is the normal distance between the mean line and
of a text block are segmented by finding the valleys the base line). To segment each word into
of the projection profile computed by counting the individual characters, only the middle zone of the
number of black pixels in each row. The trough word is considered. To find the boundary between
between two consecutive peaks in this profile characters, the image is scanned in the vertical
denotes the boundary between two text lines. A text direction starting from the mean line of the word.
line can be found between two consecutive If during a scan, the base line without
boundary lines. After line segmentation, the zones encountering any black pixel is reached, then this
in each line are detected. From Figure 6 it can be scan marks the boundary between two characters.
seen that the upper zone is separated from the middle However, the gray-tone to two-tone conversion
zone of a text line by the mean line, and that the of the image gives rise to some touching characters,
middle zone is separated from the lower zone by which cannot be segmented using this method.
the base line. The uppermost and lowermost points To segment these touching characters, the
of the connected components in a text line are used principle of water overflow from a reservoir is
to detect the mean line and base line, respectively. A used, which is as follows. If water is poured on
set of horizontal lines passing through the uppermost top of the character, the positions where water
and lowermost points of the components are will accumulate are considered as reservoirs. Figure
considered. The horizontal line that passes through 7 shows the location of reservoirs in a single
the maximum number of uppermost points (lower character as well as in a pair of touching
most points) is the mean line (base line). It should characters. The height of the water level in the
be noted that the uppermost and lowermost points reservoir, the direction of water overflow from
of the components are previously detected during the reservoir, position of the reservoir with respect
skew detection, so these points do not have to be to the character bounding box, etc are noted. A
recalculated during zone detection. reservoir whose height is small and which lies in
the upper part of the middle zone of a line is
considered as a candidate reservoir for touching
character segmentation. The cusp (lowermost
point) of the candidate reservoir is considered as
the separation point of the touching characters.
In Figure 7, this position is marked by a vertical
line. Because of the round shape of most of the
Oriya characters, it is observed that such a
reservoir is formed in most of the cases when two
characters touch each other. Sometimes, two or
more reservoirs may be formed. In such cases,
Figure 6 : Projection profile of rows in Oriya text the reservoir close to the middle of the bounding
lines (dotted lines show line boundaries) box for segmentation is selected.

60
Figure 7: Water reservoirs in a single and touching
Oriya characters

Feature Selection and Detection: Topological


features, stroke-based features as well as features
obtained from the concept of water overflow
for character recognition are considered. These
features are known as principal features. The
features are chosen with the following
considerations: (a) Robustness, accuracy and
simplicity of detection (b) Speed of
computation (c) Independence of size and
fonts, and (d) Tree classifier design need.
Stroke-based and topological features are
considered for the initial classification of
characters. These features are used to design a
Figure 8: Portion of the tree classifier for Oriya
tree classifier where the decision at each node
characters
of the tree is taken on the basis of the presence/
absence of a particular feature (see Figure 8). Performance
Stroke-based features include the number and
Line Segmentation: Our system identifies individual
position of vertical lines. The topological
text lines with an accuracy of 97.5%.
features used include existence of holes and
their number, position of holes with respect to Word Segmentation: The overall word segmentation
the character bounding box, ratio of hole accuracy of the system is 97.7%. The error rate for
height to character height, etc. In addition, the the inferior documents is 4.2%, whereas for the
concept of water overflow from a reservoir is good-quality documents, it is only 1.2% (these
also used. The reservoirs in a character are figures were calculated based on correctly segmented
identified, and the position of the reservoirs text lines only).
with respect to the character bounding box, the Character Segmentation : The character
height of each reservoir, the direction of water segmentation accuracy of the system is 97.2%. The
overflow, etc. are used as features in the proposed method for separating touching characters
recognition scheme. based on the water reservoir concept is generally
successful.
Platform
Character Recognition: On average, the system
This system is developed in C language in UNIX recognizes characters with an accuracy of about
platform. A WINDOWS based version is also 96.3%, i.e. the overall error rate is 3.7%.
designed.
3.2 Adaptation of Bangla OCR to Assamese
Portability
Background
The system can run in any UNIX and We have already developed an efficient OCR
WINDOWS platform. system for printed documents in Bangla. Since

61
Assamese and Bangla share the same script, this
OCR system can be successfully used for Assamese
documents after some modifications. The
modification is needed mainly in the post-
processing stage when language specific OCR
error correction is needed.

Summary of the System

The segmentation of a document image into lines,


words, and characters, and the recognition of
segmented characters is dependent on the script
only. Thus, the modules of the existing OCR
system for Bangla can be used for Assamese OCR.
However, certain post-processing steps after basic
recognition in order to improve OCR accuracy
are required. For example, the words in the output
of the OCR system may be looked up in a lexicon;
a correctly recognized word will be present in the
lexicon, whereas an incorrectly recognized word
will usually not be found. The incorrect word
can then be replaced by the lexicon-word nearest
to it. This post-processing is obviously language- Figure 9. A sample output from the Assamese OCR
dependent. In this project, we have made the System
necessary modifications in our OCR system so
that it can be used on Assamese documents. Performance

Test pages are selected from three Assamese books.


System Description
Pages are scanned at 300 dpi. In total, 50 pages are
The major components of the system are same used in testing phase. Analysis of test results shows a
as those of Bangla. Details of these components character-level accuracy of about 95%. Since the font
can be found in the following references. used for printing Assamese materials are somewhat
Modifications are done in the following different from the fonts used in Bangla, generation
modules : of a new prototype library considering major
Assamese fonts would improve the overall accuracy
1. Update of Symbol-list: The symbol-list for of the system.
Bangla script is updated to contain Assamese
Design of post-processing module: Post-processing
“ra” and “wa” and all the conjuncts involving
in Bangla OCR is done using a lexicon of Bangla
these two characters. Other characters remain
language. The same module can be used to do post-
unaltered.
processing for Assamese also. However, a lexicon of
2. Formation of the Prototype Library: The Assamese language is needed for this purpose. This
prototype library used in Bangla OCR is activity has been taken up by IIT, Guwahati.
modified by adding new character shapes Technology Transfer
found in Assamese script. Character shapes
not appearing in Assamese script are deleted The source code of the system along with the
from the library. technical details has transferred to IIT, Guwahati.

62
Technical Reports/Correspondence

An MCA student, Anirban Mukherjee did work


on this project for his MCA dissertation submitted
to Indira Gandhi National Open University
(IGNOU), New Delhi. Title of his work is
“Development of an Optical Character
Recognition System for Assamese Script”.

References

1. B.B. Chaudhuri and U. Pal, “A complete Figure 10. A sample output from the prototype
printed Bangla OCR system”, Patter n developed. The question “what is Pythagoras
Recognition, vol. 31, pp. 531-549, 1998. theorem” was submitted in Bangla and the system
retrieves passages and ranked as output.
2. U. Garain and B. B. Chaudhuri,
“Segmentation of Touching Characters in Summary of the System
Printed Devnagari and Bangla Scripts using
Fuzzy Multifactorial Analysis”, IEEE A prototype n-gram based language identifier for
Transactions on Systems, Man and Cybernetics, identifying Indian languages has been developed
Part C, Vol. 32, No. 4, pp. 449-459, 2002. [1]. A prototype of natural language text indexer
for passage retrieval has been developed for
3. A. Ray Chaudhuri, A. Mandal, and B. B. Bangla.
Chaudhuri, Page Layout Analyzer for
Mulitingual Indian Documents, Proc. System Description
Language Engineering Conference, IEEE CS
Note that Indian languages can be grouped into
Press, 2002.
five categories based on their origins: Indo-
3.3 Information Retrieval system for Bangla European (Hindi, Bangla, Marathi, etc.),
documents Dravidian (Tamil, Telugu, etc.),Tibeto-Burmese
(e.g., Khasi), Astro-Asiatic (Santhali, Mundari,
Background etc.) and Sino-Tibetan (e.g., Bhutanese).
Languages within a group share a number of
Digital information is available in various forms common elements. For instance, there is a
like text, image and speech data or multimedia
significant overlap in the vocabulary of Bangla
content. Among them the text information is
and other Indo-European languages and are
considerably abundant and can be easily created.
mutually closer than the profiles for a pair of
Passage retrieval from text documents is gaining
languages from two different groups. We have
momentum over document retrieval for the last
tested the character level n-gram algorithms for
several years. Document ranker returns
language identification from a multilingual
documents, which is often infeasible to search and
collection of Indian language documents. Also, a
extract the necessary information from the entire
prototype of a “English to Bangla” phonetic
document. Passage retrieval on the other hand
transliteration scheme is designed and
returns fixed or variable sized text chunks from
implemented for cross-lingual information.
document(s) where the information is likely to
reside. This saves both time and effort for Part of the n-gram distance matrix between every
searching from a huge text document corpus. pair of Indian languages is shown in Table 3.

63
Table 3. The n-gram distance between some major 3.4 Script Identification and Separation From
Indian language Indian Multi-Script Documents
Profile Bangla Hindi Kannada Kashmiri Malayalam Telugu Urdu Background
Bangla 0 16.54 19.42 23.66 20.27 19.08 24.01
Hindi 16.54 0 18.40 23.65 19.74 18.58 24.12 India is a multi-lingual multi-script country, where
Kannada 19.42 18.40 0 23.80 18.11 16.65 24.09 a single document page (e.g., passport application
Kashmiri 23.66 23.65 23.80 0 24.02 23.88 19.54
Malayalam 20.27 19.74 18.11 24.02 0 18.07 24.29 form, examination question paper, money order
Telugu 19.08 18.58 16.65 23.88 18.07 0 24.15 form, bank account opening application form ) may
Urdu 24.01 24.12 24.09 19.54 24.29 24.15 0
contain lines in two or more language scripts. For
A passage detection and ranking algorithm for this type of document page there is a need for
Bangla text has been designed and implemented. separating different scripts before feeding them to
Stop-word-list are common words ignored by search respective OCR systems. The purpose of this system
engines at the time of searching and these words is to identify different script regions of the document
generally do not contain any information. For and hence separate them.
constructing Bangla search engines by combining
statistical and manual methods, about 500 stop Summary of the System
words are identified. The DoE Bangla corpus is used The system works in two stages. In the first stage, it
for this purpose. separates each line in the scanned document page.
The indexer generates an indexed file, which keeps Line segmentation is based on horizontal profile
the record in this fashion: based technique. Secondly, based on the
distinguishing features between different Indian
Document No., Term, Term_weight ( f tD/N D), scripts, it identifies the script for each line in which
Frequency of the term in whole corpus ft , Occurence it is actually written. The identification of a particular
position of the term script from other scripts is mainly based on water
reservoir principle based features, features based on
Document No: The identity number of the contour tracing, profile features etc. These features
document in the whole corpus. are elaborated below. At present, the system has an
Term weight: The ratio of the term frequency for a overall accuracy of about 97.52%.
particular term in a document (f tD) and the System Description
document length in bytes (ND).
Some of the major distinguishing features used to
The occurrence position of the term: The numerical separate different Indian scripts in a document page
value of the position of the term, counted from the from each other, are discussed.
beginning of the document. The position of first
term of the document is one. So for each term ftD Horizontal projection profile: From the following
occurrence positions are listed. Figure 11, it is apparent that there is a distinct
difference among some of the scripts in terms of
A prototype of passage detection algorithm is horizontal projection profile.
developed and a few standard passage-ranking
algorithms have also been tested.
Technical Reports/Correspondence
1. “N-gram: a language independent approach to
IR and NLP”, P. Majumder, M. Mitra, B.B.
Chaudhuri, Proc. International Conference on
UniversalKnowledge and Language (ICUKL-2002),
Goa, India, 2002. Figure 11: Different Indian script lines (from top
to bottom: Devnagari, Bangla, Gurmukhi,

64
Malayalam, Kannada, English, Tamil, Telugu, Head-line feature: If the longest horizontal run of
Urdu, Kashmiri, Gujarati, Oriya) with their row- black pixels on the rows of a text line is taken then
wise maximum run (left side) and horizontal profile such run length (known as Head-line) can be used
(right side). to distinguish between scripts with this feature ( like
Bangla ) and those without Head-line ( like English).
Water reservoir principle based feature: Top
(bottom) reservoir is defined as the reservoir obtained Feature based on jump discontinuity: Here jump
when water is poured from top (bottom) of the discontinuity (a relatively small white run between
component. (A bottom reservoir of a component is two black runs) of a component from a particular
visualized as top reservoir when water will be poured side is considered. Some script’s characters (like in
from top after rotating the component by 180°). Telugu, Kannada etc.,) have prominent jump
Similarly, if water is poured from left (right) side of discontinuties and the occurrence frequency of this
the component, the cavity regions of the particular feature is successfully used for script
components where water will be stored are identification. The jump discontinuity for a Gujrati
considered as left right) reservoirs. This is shown in character is shown in Figure 14.
Figure 12. Here top, bottom, left and right reservoirs
are shown for the English character X. The water
flow levels of these reservoirs are also shown in this
figure.

Figure 12: Top, bottom, left and right reservoirs are


shown for the character X. Water flow level of reservoir
is shown by dotted arrow. Figure 14: Example of jump discontinuity feature.

In some scripts, many reservoirs may be obtained Based on these above major features the system
from a particular side of the characters whereas in identifies the script of a particular line.
some other scripts many reservoirs may not be
Script Identification Technique:
obtained from that side. Thus, this feature is useful
in distinguishing between scripts. • Scanned gray tone document is converted to two
tone image using histogram based automatic
Left and right profile: In this feature each character
thresholding approach.
is located within a rectangular boundary, a frame.
The horizontal or vertical distances from any one • Noise Removal and Skew correction are
side of the frame to the character edge are a group performed on this two tone image.
of parallel lines, known as the profile (Figure 12). If
left, right and top profile of the characters in a text • Line segmentation is performed to the refined
lines are computed, it is observed that there are some image.
distinct difference in some of the scripts according • For each line of the image the particular script in
to these profiles. Left and right profile of a which the line is actually written is identified based
Malayalam character is shown in Figure 13. on a Binary Tree classifier.
• Once the script of the line is identified its region
is marked by the name of the particular script.
• The marked lines are shown in the output image.
Figure 13: Left and right profile of a character is shown.

65
A part of the Binary Tree classifier is shown in
Figure 15.

Figure 16: The output generated by the system

Papers Published

• U. Pal and B. B. Chaudhuri, “Identification of


different script lines from multi-script
documents”, Image and Vision computing, vol.
20, no.13-14 pp. 945-954 2002.
Figure 15: Flow diagram of the script
identification scheme. • U. Pal and B. B. Chaudhuri, “Script line
(Here B =Bangla, D=Devnagari, E=English, seperation from Indian multi-script documents”
Gu=Gurumukhi, M=Malayalam, Ta=Tamil, IETE Journal of Research, vol. 49. no.1. 2003.
Te=Telugu, G=Gujarati, Ka=Kannada,
U=Urdu and O=Oriya). • U. Pal, S. Sinha, B. B. Chaudhuri, “Multiscript
Line Identification From Indian Documents”, 7th
Platform International Conference on Document Analysis
& Recognition, ICDAR 2003 (In press).
This system is developed in C language in UNIX
platform. An user interface of this system is built in 4. Research & Development
VC++ 6.0 in WINDOWS2000 platform.
4.1 Automatic Processing of Hand-printed Table-
Portability Form Documents
Both the C version and the VC++6.0 version of this Background
system can run in any UNIX and WINDOWS
platform respectively. In office environment, thousands of documents
containing tables may be handled while processing
Performance application forms. In Indian context, most of the
The system is tested based on data taken from various cases, these table-form documents contain hand-
sources like Journals, Newspaper, Synthetic printed text (like customer’s name, date, item
Document etc. Currently the system has an overall details/quantity/price etc.) mixed with printed text
accuracy of 97.52%. The output generated by the (invoice no., challan no., etc.). Once these forms
system when run on a document page is shown in are collected from the stockists, customers, or
Figure 16. other business centers, the computer operators

66
manually enter the data into the computer to In each such line, one or more blocks are identified
maintain an electronic version of the same. This by exploiting the vertical profiles within each
manual approach makes this processing time horizontal strip. Within a block, words are identified
consuming, tedious and inefficient. Hence, an by considering the gap between two consecutive
automatic approach is called for. Moreover, the words. Characters in a word are segmented again
work may be the basis of handwriting recognition by considering the vertical profiles restricted to the
of Indian languages. image of the word. This information in a block-
word-character hierarchy is stored in a three
Summary of the Project dimensional list.
This project deals with the automatic processing of Testing of the System
hand printed table-form documents. It extracts
different blocks of handwritten information from a We collected several hand-filled-in application form
filled-in form. Each such block is segmented into seeking a job. In Figure 17, result of our form
lines, words and characters. Identification of each document processing system is shown.
block is followed by tagging (numbering) them
according to the order of their physical placements. Technical Reports/Correspondence
In the final stage of this information extraction
An MCA student, Mr. Saikat Das did the project
procedure, images of individual hand printed
work for his MCA dissertation submitted to Indira
characters are obtained which should be passed as
Gandhi National Open University (IGNOU), New
inputs to a hand printed character recognition
Delhi. Title of his work is “Recognition of Hand-
system.
written (touching) Bangla characters from Form
Type Documents”.

Another student Prasenjit Mitra to complete DOE


“B” level has started his project work to develop a
system which can automatically process Indian
Money Order Forms. He will take approximately
another 4 months to complete the assignment.

4.2 Research and Development of Neural Network


based tools for printed document (in eastern
regional scripts) processing

Background

Artificial neural network (ANN) based methods


Figure 17: The binarized image(left) and the result
in solving various pattern recognition problems
after the form segmentation (right).
have several advantages over the conventional/
Description of the Work Done classical approaches. Recently ANN based
classification approaches have gained tremendous
In our approach, extraction of handwritten popularity. They are commonly used in the high
information from a filled-in form, does not take accuracy systems because they perform
care of the grey values of the pixels but it works on satisfactorily in the presence of incomplete or
the binarized image of the input form. In the first noisy data and also they can learn from examples.
step, horizontal profiles of object pixels help to Another advantage is the parallel nature of ANN
segment different horizontal lines of information. algorithms. On the other hand, we have already

67
developed an efficient OCR system for printed of a larger training set would improve the overall
documents in Bangla. There are obvious accuracy of the system.
justifications to explore suitable supervised/
unsupervised neural network/hybrid models for 5. The Team Members
developing software tools for shape extraction of
S. K. Parui swapan@isical.ac.in
individual printed characters with a view to
character classification and this may in turn make A. K. Datta, M. Mitra mandar@isical.ac.in
the already developed OCR system more U. Pal umapada@isical.ac.in
efficient.
S. Palit sarbani@isical.ac.in
Summary of the Project U. Bhattacharyya ujjwal@isical.ac.in
For extraction of shapes of individual printed U. Garain utpal@isical.ac.in
characters we may consider Self-Organizing T. Pal tama@isical.ac.in
neural network or Vector Quantization
techniques. Instead of using the input character N. S. Dash niladri@isical.ac.in
image for classification purpose we may obtain A. Datta and D. Sengupta
its graph representation consisting of a few nodes
and links between them. Useful topological and Courtesy: Prof. B.B Chaudhary
geometrical features may be easily obtained from Indian Statistical Institute
such a representation and those in turn should Computer Vision and Pattern Recognition Unit
result in better classification accuracy. 203, Barrackpore Trunk Road
Kolkata –700035
Description of the Work Done (RCILTS for Bengali)
Tel: 00-91-33-25778086
We used Topology Adaptive Self-Organizing Extn. 2852, 25781832, 25311928
Neural Network (TASONN) model to obtain the E-mail: bbc@isical.ac.in
graph representation of the input character. We
considered a few structural features of this graph
describing the topology of the character along with
a hierarchical tree classifier to classify printed
Bangla characters into a few subclasses. To
recognize different characters in each of these
resulting subclasses we considered several
geometrical features. Final recognition is
performed using a look-up table of these feature
values.

Testing of the System

Test pages are selected from different Bangla


books. Pages are scanned at 300 dpi. In total, 100
pages are used for training purpose and another
50 pages are used in the testing phase. Analysis of
test results shows a character-level accuracy of
about 98%. Error analysis indicates that since the
font used for printing of different books are
somewhat different from each other, generation

68
Resource Centre For
Indian Language Technology Solutions – Oriya
Utkal University, & OCAC, Bhubaneswar
Achievements

RC-ILTS-ORIYA
Department of Computer Science & Application
Utkal University, Bhubaneswar, Orissa – 751004
Tel. : 00-91-674-2585518 / 0216 E-mail : sangham1@rediffmail.com
Website : http://www.ilts-utkal.org
&
Orissa Computer Application Centre
Plot No.-N-1/7-D, Acharya Vihar Square
RRL Post Office, Bhubaneswar-751013
Tel. : 00-91-674-2582490/2582850 E-mail : skt_ocac@yahoo.com
Website : http://www.utkal.ernet.in
RCILTS-Oriya document so that the content can be analyzed and
Utkal University, Bhubaneswar understood clearly and unambiguously. Different
approaches are made for optical recognition of
Introduction characters for different languages like English,
Resource Centers (RC) for Indian Language Chinese, Japanese and Korean. Very little efforts
Technology Solutions (ILTS), under Technology have been made for the recognition of Indian
Development of Indian Language (TDIL) programs languages. We have made an attempt for the
of the Ministry of Communication and Information recognition of alphabetic characters of Oriya
Technology, Government of India, are established language using novel technique, which helps for
for providing platform to disseminate knowledge efficient processing of text (document). Machine
to common man through digital unites. One such intelligence involves several aspects among which
RC is established in Utkal University, Orissa to optical recognition is a tool, which can be integrated
handle the issues of Oriya language, the official to text recognition and text-to-speech system. To
language of Orissa. For the effective implementation make these aspects effective, character recognition
of the idea, several technologies, such as Image with better accuracy is needed.
Processing, Speech Processing and Natural Language
The process of Optical Character Recognition of a
Processing are merged together towards the
document image mainly involves six phases:
development of hardware and software under the
project for the services of Oriya people. This enable 1. Digitization
them to be computer literate and thus to be a
2. Pre-processing
knowledgeable person. RC-ILTS-Oriya works for
the development of tools like 3. Segmentation
1. Bilingual E-Dictionary English<>Oriya) 4. Feature Extraction
2. Oriya Spell Checker 5. Classification
3. Oriya WordNet 6. Post Processing
4. Oriya Machine Translation System (English – The digitization phase uses a scanner or a digital
Oriya) camera that divides the whole document into a
5. Oriya Optical Character Recognition System rectangular matrix of dots taking into consideration
the change of light intensity at each dot. The matrix
6. Oriya Text-To-Speech System of dots is represented digitally as a two dimensional
All these softwares are Copy Righted. array of bits. Each dot can be represented by a single
bit for b/w image (0-black, 1-white) and for color
Besides this RC team members are also keen in the image, needs 24 bits for its representation. The better
development of softwares like the resolution, the better the image.
1. Oriya Speech-To-Text System Preprocessing involves several activities, which
2. Oriya Word Processor with trilingual (Oriya, transform the scanned image into a form suitable
English, Hindi) Word Processor with spell for recognition. These activities are noise clearing,
checking capacity. filtering and smoothing, thinning, normalization,
and skew correction. But for the recognition of
3. Sanskrit WordNet. printed characters segmentation plays an important
4. Jagannath Philosophy role in the activities of preprocessing. After noise
clearing phase, it needs individual characters to be
1. Intelligent Document Processing (OCR) for Oriya
extracted with better approximation so that no
Document processing is a need for high-level characters should loose its important features during
representation of the contents present in the the process of extraction. So efficient segmentation

70
algorithms should be employed which will lead to 11. The system tested on documents from book
better recognition. (Bigyandiganta) and also for some old Oriya
books.
During the segmentation phase, the whole image is
The system also integrated to text-to-speech system
analyzed and different logical regions in the image
for Oriya.
are separated. The logical units consist of texts which
has lines thus words thus characters. The error during
isolating characters changes their basic shapes. So
the characters must be properly extracted so that it
will lead to better and accurate representation of
the original character. Many algorithms exist for
extracting characters from an image. But the
problem arises when some characters are connected (An input image in bmp format)
in the document image. So two characters when
connected may be mistaken as a single character,
thus erroneous extraction leading to misrecognition.
In the feature extraction phase, the features of
individual characters are analyzed and represented
in terms of its specialty or uniqueness. These features
help in classification and identification of the
characters. The feature set needs to be small and the
values needs to be coherent for objects in the same
class for better classification.
Technology Involved
1. Literature survey done.
2. Gray tone converted to two-tone image by
dynamic thresholding on intensity.
(The output text in our editor)
3. Skew correction implemented for easy handling
2. Natural Language Processing (Oriya)
of documents.
Natural Language Processing is the technique of
4. Lines extracted from a document using
engineering our language through machine (the
histogram analysis.
computer) by which we can overcome the language
5. Individual characters extracted using region- barrier and the difference between man & machine.
growing and histogram analysis method. With that sincere motive we have taken several steps
6. Matras extracted by region analysis. in assisting the computer system behave like a person
in exchanging our knowledge. Our efforts are
7. Skeletonization for efficient processing, storing converted to product form and are mentioned
(less memory space allocation) and searching. below for the interest of researcher of this field and
8. Connected characters handled by backward and for public interest.
forward chaining of appropriate mask. In short our products already developed or under
9. Features extracted from isolated characters as development are Oriya Machine Translation
well as composite characters like jukta. (OMTrans), Oriya Word Processor (OWP) support
multilingual, Oriya Morphological Analyser
10. Analysis on Multicolored documents by (OMA), ORI-Spell, the Oriya Spell Checker (OSC),
applying Hue, Saturation and Intensity (HSI) Oriya Grammar Checker (OGC), Oriya Semantic
model. Analyser (OSA), ORI-Dic, the bilingual E-

71
Dictionary (English ó Oriya), OriNet (Word-Net 2). The OWP edits OCR (Oriya Character
for Oriya), and SanskritNet (Word-Net for Sanskrit) Recognition) outputs and also the documents of
are under developing phase to produce a complete other editors.
Natural Language Processing System.
2.1 Oriya Machine Translation System (OMTrans)
In OMTrans the source language is English and
target language is Oriya. We have developed a parser,
which is an essential part of Machine Translation.
Our parser is capable in parsing various types of
sentence including complex types of sentence such
as:
I am going to school everyday for learning.
He said that Ram is a hard working boy.
After the parsing phase the real translation is done. Fig 1 : Help option of OWP.
Our translation system is capable in doing the
translation of various types of simple as well as
complex sentences such as:
I am a good boy.
I am going to school everyday.
I will eat rice.
Who is the prime minister of India?
India is passing through financial crisis.
He told that Ram is a good boy.
Our system is doing the translation of sentences
having all types of tense structure. In addition to
that our system is also capable of doing the sense
disambiguation task based on N-gram model such
as:
Fig 2 : Multilingual processing of OWP.
I am going to bank for money. 2.3 Oriya Morphological Analyser (OMA)
I am going to bank everyday.
I am going to bank. I will deposit money Indian languages are characterised by a very rich
there. system of inflections (VIBHAKTI), derivation
In the above examples the meaning of underline and compound formation for which a standard
word (bank) is decided according to the context of OMA is needed to deal with any type of text in
the sentence. Presently it gives very good result for Oriya language. The number of words are being
this type of ambiguous sentences. derived from a given root word by some specific
syntactic rule. Our OMA deals with morphology
2.2 Oriya Word Processor (OWP) (Multilingual) of pronoun, number itself (Nominal, Verbal),
We have developed OWP, which facilitates OSP, Number less and prefix (Fig-3). We have developed
OGC and OSA along with multilingual editing. The and implemented some decision tree of each type
OWP avails phonetic typing for Oriya character with of morphology, by which our OMA is running
help facility (Fig-1). All other basic features are also successfully. It can help to all the applications
available in the OWP like other word processor (Fig- involved in MT, OriNet, OSC, and OGC etc.

72
2.5 Oriya Grammar Checker (OGC)
The S/W for OGC has been developed to determine
the grammatical mistakes occurring in a sentence
by the help of the OMA and e-Dictionary. It first
parses the sentence according to the rule of Oriya
grammar and checks the grammatical mistakes.
Presently, the OGC functions successfully in our
Editor (fig.-5, fig.-6). The OGC -S/W supports both
the Windows-98/2000/NT and the Linux as well.

Fig-3 : Out put of the OMA for the Oriya derived


word “baHiguDika” in OriNet.
2.4 Oriya Spell Checker (OSC)
The misspelled words are being taken care of
successfully by our OSC. We have developed some
algorithms to perform our OSC in order to find
out more accurate suggestion for a misspelled word.
The searching algorithm of the OSC is so fast that it
processes 170000 Oriya words simultaneously for
each misspelled word. The words are indexed
according to their word length in our word database
Fig.-5 Suggestion of word sequence of OGC in
for effective searching. On the basis of the misspelled
our OWP.
word, it (OSC) matches the number of (i) equal
character, (ii) forward character and (iii) backward
character to give more accurate suggestive words for
the misspelled word. Moreover, it also takes help
from the Oriya Morphological Analyser for
ascertaining the mistakes of derived words. The OSC
functions successfully in our Word Processor. This
S/W supports both the Windows-98/2000/NT and
the Linux as well. Output of the Spell Checker is
shown in Fig- 4.

Fig.-6:- Detection of grammar by OGC


2.6 Oriya Semantic Analyser (OSA)
It deals with the understanding procedure of the
sentence for which it takes immense help from the
KARAKA theory of NAVYA NYAYA philosophy,
one of the most advanced epistemological traditions
of India. The OSA determines the semantic position
of Subject (KARTA), Object (KARMA) etc on the
Fig. 4 : Suggestive words for the misspelled word basis of the verb (KRIYA). In other words, it at first
“bAsitAmAne” of OSC. links to the verb of the Verb Table (VT) and from

73
verb it links to the subject. Other KARAKAs are in progress to include Hindi and Sanskrit words in
determined on the basis of these two linkages (fig.- this system for the benefit of the users.
7). It also takes help from OriNet, OMA and OGC
for better understanding. We have worked out 100
verbs in the VT determining their subjective,
objective, locative categories etc with respect to these
verbs. Presently, with these verbs the OSA functions
successfully in our Editor (fig.- 8). The OSA S/W
supports both the Windows-98/2000/NT and the
Linux as well.

Fig. 9: Overview of the Oriya Word “AkAsha.”

Fig-7:- Semantic Grammatical Model (SGM) and


Semantic Extract Model (SEM)
(Here downward arrows represent the SGM and
upward arrows along with down arrow from verb
represent the SEM where nominative case acts as
chief qualifier.) Fig. 10 : Overview of the English Word “delay”.

Fig. 11 : Search Engine (SE) handles misspelled


word and gives suggestion.

Fig. 8:- Detection of semantic by OSA. 2.8 OriNet (WordNet for Oriya)-Online lexical
dictionary/ thesaurus.
2.7 E-Dictionary (Oriya↔English)
One of the major problems in the implementation
It provides the word, category, synonymy and of Natural Language Processing (NLP) or Machine
corresponding English meaning of Oriya as well as Translation (MT) is to develop a complete lexical
English words. The system is successfully functioning database containing all types of information of
over 27000 Oriya words and 10000 English words. words. There are some difficulties in deciding that
Help option also provides the keystroke for each what information should be stored in a lexicon and
Oriya character in phonetic form. Search engine even greater difficulties in acquiring this information
handles the misspelled word and gives some accurate in proper form. The OriNet system is designed on
suggestive words. This S/W supports both the the basis of multiple lexical database and tools under
Windows-98/2000/NT and the Linux. We are also one consistent functional interface in order to

74
facilitate systems requiring syntactic, semantic and per user’s query or other application with the help
lexical information of Oriya language. The system of MA.
is divided into two independent modules. One
At the heart of the OriNet system is the OD, which
module is developed to write the source files
stores the data in simple ASCII format, several easy-
containing the basic lexical data and these files are
to-use interfaces have been devised to cope with
taken as the input for OriNet system. Lexicographer
varied user requirements and the raw data extracted
takes care the major work of this module. Second
from the database are well formatted for display
module is a set of programs by which it accepts the
purpose. MA takes care syntactic analysis of any types
source files, processes it to display for the user and
of lexicon and also provides the root words with
also provides different interface to use other
other grammatical information. X- windows
applications. System has been designed using Object-
interfaces are also designed to access the database
Oriented paradigm according to Oriya language
with different types of options (Fig.-13).
structure with over 1100 lexical entries, which allow
flexibility, reusability and extensibility. It also
provides X-windows interface to access the data from
the OriNet database as per user’s requirement. It
can be widely used in different application like, (i)
Word Sense Disambiguation (WSD) in Oriya
Machine Translation (ii) Oriya Grammar Checker
(OGC) and (iii) Oriya Semantic Checker (OSC).
Moreover, it helps as a lexical resource for Oriya
learners and also for expert scholars who are
involved in the research in NLP. The system also
consists of Oriya Morphological Analyzer (OMA),
which takes care of any type of word such as root
word or derived word and also provides syntactic
information of the word. This S/W supports both
the Windows-98/2000/NT and Linux. Presently, we
Fig. 12:- Architecture of the OriNet System
are adding more and more lexical entries in the
source file and developing different applicable
programs for use in wider range.
Architecture and Output
The architecture of the OriNet system is divided
into five parts (Fig.-12) such as Lexical Source File
(LSF), OriNet Engine (OE), OriNet Database
(OD), Morphological Analyser (MA) and User
Application (UA). The LSF is the collection of
different sorted files according to their syntactic
categories, which are taken as the inputs to the Fig. 13:- Overview of the word bþmþ (good).
OriNet system. The OE is a set of programs, which
2.9 SanskritNet (Word-Net in Sanskrit)-Online
compiles the lexical source files into a database
lexical dictionary/thesaurus.
format that facilitates machine-retrieved information
in proper manner. It also acts as the CPU of the Sanskrit language is the base for most of the Indian
OriNet system. It is used as a verification tool to languages. It has link with foreign languages too.
ensure the syntactic integrity of lexical files. All of To study this language and also to use it for
the lexical files are processed together to build the knowledge enhancement, effective Machine
final OD. It also provides sufficient information as Translation (MT) from one language to the other is

75
necessary for which online lexical resources are Aim
needed. Towards making such resources we have
To design a system/algorithm, which works
tried to develop a Word-Net for Sanskrit language
efficiently, naturally and utilizes memory as less as
using the Navya-NyAya (xÉ´ªÉxªÉɪÉ) Philosophy and
possible.
Paninian grammar. Besides Synonymy, Antonymy,
Hypernymy, Hyponymy, Holonymy and Progress
Meronymy we have also introduced Etymology and
We have developed TTS for Oriya and Hindi
Analogy separately as they play important roles in
language and the TTS for Nepali language is in
Navya-NyAya (xÉ´ªÉxªÉɪÉ) Philosophy, which is the
progress.
specialty on this Sanskrit Word-Net for a better
classification of words. 3.1 Text to Speech (TTS)
Presently, we have defined and analysed 300 Sanskrit As the name signifies this system provides an interface
words (200 nominal words and 100 verbal words) through which a user enters certain text/document
in the SanskritNet. On the basis of the analysis and and it is the software that reads it as natural as a
the definition of these words, we are on progress to human. The basic approach followed here is, first to
design and implement the SanskritNet and other analyse the document (language, font etc.), and then
allied applications as well. Here the prototype model extract words from the text, try to parse individual
of the SanskritNet is displayed (Fig.-14). words into vowels and consonants respectively. Then
corresponding to these vowels and consonants
existing (previously stored in the database) “.wav”
files are concatenated and played.
Technologies Behind :
• Creating the wave file database.
For creation of such a database we studied a lot of
recorded words and sentences and try to break them
into vowels and consonants by minute hearing.
Then we analyse those cut pieces and store the
appropriate and generalised form in the database.
• Parse extracted exact words of a given sentence?
Fig. 14: - Output of the Nominal word “ºÉÚªÉÇ” (Sun)
in SanskritNet The same words in different sentences have different
stress due to its position in the sentence. Appropriate
3. Speech Processing System For Indian Languages hidden vowels are detected form the words extracted.
For example considering a word “jc¯“ ( SAMAYA)
In the age of fast technology, where information
in Oriya is parsed as follows :-
travels at the speed of light, we still rely on feeding
our text inputs in typographical manner. Speech jç + @ + c ç + @ + ¯çÆ + @
Processing System is an approach to provide a speech
The format of vowel and consonant break point is
interface between user and the computer. Basically
shown in the fig1.
our system zeros in on Hindi, Oriya and Nepali
languages at the Resource Centre. Again considering the same word in Hindi i.e., Þle;ß
Broadly the system is classified into two sections. (SAMAY) is parsed as follows :-

• Text To Speech (TTS) l~ $ v $ e~ $ v $ ;~ $ v

• Speech To Text (STT) • Choosing of appropriate ‘.wav’ file from the


database.

76
Considering the above example, may be the or perceptual properties. In our approach we study
vowels we get after parsing are the same as ‘@’ the nature of spoken words by different speakers.
(for Oriya) or ‘+’ (for Hindi), but it is not exactly From a continuous sentence the word boundaries
the same ‘a.wav’ we concat in every case. Thus, are detected and the nature of utterance of
we analyse vowels broadly in three categories as individual consonants and vowels are marked to
ma:tra: in study their behaviour for a particular speaker. We
• Beginning use word spotting techniques and the variations
of pitch and intonation are marked. The
Middle
algorithm makes the use of the properties of F0
• End contours such as declination tendency, resetting
And this is observed that the duration of ma:tra:s and fall- rise patterns in the utterance. Once the
say ‘@’ here, varies from each other, i.e. ‘@’ in middle patterns are obtained it is mapped with a particular
is not the same as that in the end. Again accordingly character and the proper text output is obtained.
we need to get the appropriate “.wav” files from The boundary and parameter detection is shown
the database. in fig2. After obtaining the parameters of an
uttered word, they are standardise to maintain the
As observed in the example speech database. We also provide some tolerance
i) The durations of ‘@’(Oriya) are as follows: in each parameter, so that little variation in the
Starting ma:tra: - 0.065 sec utterance doesn’t affect the recognition of the
word for text conversion. Form each word the
Middle ma:tra: - 0.105 sec
characters are obtained and when a speaker utters,
End ma:tra: - 0.116 sec the corresponding character is mapped to the text
ii) The durations of ‘+’(Hindi) are as follows : and the output is obtained.
Starting ma:tra: - 0.063 sec wah kitna sunder hai !
Middle ma:tra: - 0.078 sec
´É½þ ÊEòiÉxÉÉ ºÉÖxnù®ú ½èþ !
End ma:tra: - 0.020 sec
The stress on the vowels (rising and falling) are
• It is observed that concatenation of the wave files
shown in the pitch curve. The spectrograms of the
is not that natural as expected. This is due to the
uttered words show the content frequency and
certain transitions between the characters in the
intensity of utterance. The word boundary is
actual pronunciation. Thus we are developing a
detected by formant analysis. Though F0 plays a
robust algorithm for the generation of naturalness
major role. The cepstral component of the
in the TTS output.
continuous speech puts major emphasis in the
Applications utterance. Hance taking the word boundary into
consideration the speech database show around 73%
# Helpful for the blinds and illiterates.
accuracy for hindi and 90% in Oriya speech
# Airways and Railways announcement system. recognition.
# Initial phase of Speech-to-Speech conversion Application
system.
# Speech password for security in banking sector.
3.2 Speech to Text (STT)
# Foreign sic experts need for speech recognition
The speech processing is a pattern recognition
& identification of criminals.
problem. The recognition of speech is defined as
an activity whereby a speech sample is attributed # For the blind and illiterate people, provides an
to a person on the basis of its phonetic-acoustic intuitive interface with machine.

77
# Final phase of Speech-to-Speech conversion RCILTS-Oriya
System. OCAC, Bhubaneswar
1. Product Developed at OCAC and the Target
date of its Commercialization
1.1 Oriya Spell Checker
jç @ (starting ma:tra:) Oriya Spellchecker has been developed in Linux
platform. It consists of 60,000 base words. These
base words can be manipulated and stored in
dictionary in a scientific manner. Root words can
also be manipulated. This Spellchecker is
cç @ (middle ma:tra:) incorporated in our Oriya word processor Sulekha.
This project is developed for the smooth use of
novice users.
Here we are concentrating with our suggestion and
¯ç Æ @ (end ma:tra:) checking of words. The checker checks the pattern
Fig. 1 string and suggests to the user more accurate and
required suggestion. Like English Indian languages
are different in their nature. Indian languages are
complex in their nature. For that we have designed
a checker and that checks the pattern string. Words
are stored in side the dictionary file line by line.
Each line is terminated through an escape sequence
of new line character. All these Oriya words are stored
wa ha ki t na sun da r h ai inside a file having extension .txt.
Fig. 2 Features
4. The Team Members Spellchecker consists of several facilities like Prefix,
Suffix, Add, Ignore, Change, Spellcheck, Suggestion
Khyamanidhi Sahoo khyamanidhi@yahoo.com
and Cancel. Suggestion is accurate and match more
Hemanta Kumar Behera kumar_hemanta@yahoo.com closest word to the user requirement. It checks more
Pravat Chandra Satapathy satapathy_pravat@rediffmail.com words within a fraction of seconds. Searching time
is minimum and complexity is less. Spellchecker can
Suman Bhattacharya suman_bh1@yahoo.com
be incorporated in any application for example it
Ajaya Kumar senapati ajayasenapati@rediffmail.com can be incorporated in a word processor or an editor
Pravat Kumar Santi pksanti@rediffmail.com or a browser. It uses ISCII data stored in a file.
Gour Prasad Rout rout_g_prasad@yahoo.com ISFOC for display. Gets the input ISFOC data
displayed the ISFOC data.
Kabi Prasad Samal kabi_prasad@rediffmail.com
Krushna Pada Das Adhikary krushnapada@rediffmail.com
Commercialization
Windows version has been completed and
Courtesy : Prof. (Ms) Sanghmitra Mohanty commercialized and incorporated in LEAP Office
Utkal University of CDAC Linux version Completed
Department of Computer Science & Application
Vani Vihar, Bhubaneshwar – 751 004 1.2 Thesaurus in Oriya
(RCILTS for Oriya) Oriya Thesaurus has been developed in Linux
Tel: 00-91-674-2585518, 254086 platform. It consists of 40,000 base words. These
base words can be manipulated and stored in
E-mail: sangham1@rediffmail.com) dictionary in a scientific manner. Root words can

78
also be manipulated. This Thesaurus is incorporated Commercialization
in our Oriya word processor Sulekha. This project Words have been entered. As a product with more
is developed for the smooth use of novice users. It is words will be completed in June 2003.
a tool for correct documentation. Words are stored 1.4 Corpus in Oriya
in side the dictionary file line by line. Each line is A large corpus has been developed consisting of
terminated through an escape sequence of new line nearly 8 million words for analysis and use in spell
character. All these Oriya words are stored inside a checker and thesaurus. Corpus creation has been
file having extension .txt. taken as a continuing activity as a part of this project.
Features Commercialization
Thesaurus consists of several facilities like hesaurus, It will be used for research and use in other products
Replace. Suggestion is accurate and matches closest and is not for commercialization,. (completed)
word to the user requirement. It checks more words
within a fraction of seconds. Searching time is 1.5 Bilingual Chat Server
minimum and complexity is less.Thesaurus can be Oriya Chat is a Standalone Chat Application in
incorporated in any application for example it can Oriya Language developed in JAVA. After
be incorporated in a word processor or an editor or connecting to the Server, one can chat in one’s
a browser. favorite room, send direct messages with emotion
icons to other online users as well as send public
messages.
Features
User Log in to Chat Server with unique ID, Con-
nect to the Server, with or without firewall/proxy,
Phonetic Keyboard Layout to write in Oriya,Send
Oriya text with Emoticons (images), Private and
Public Chat With Friends, Domain specific Pre-
defined Public Rooms.
System Description
Oriya Chat has a client and a server module.
Serversocket and SocksSockets are used for com-
municating with remote machine. SocksSocket is a
CustomSocket that extends Socket to be used with
Commercialization the firewall or proxy.
Completed
1.3 Bilingual Electronic Lexicon
The dictionary for administration and official
correspondence has been entered for Oriya-English
and English-Oriya. It is proposed to bring out an
Electronic Dictionary as a commercial product
which will have features such as phonetic key board
and user friendly functions for “add”, “delete”, “view”
etc. After the availability of text to speech technology
in Oriya the same will be integrated to this product
for providing help in correct pronunciation of
words. Lexical inputs from other domains such as Commercialization
culture, business, science are being collected. Completed

79
1.6 Net Education Varta is an e-mail solution, which enables people to
A Net Education System (NES) is being developed send and receive mail in Oriya language. It uses
using Bilingual Chat Server and Mail Server. The dynamic version of the OR-TTSarala font, created
educational Content is being created for providing using Microsoft WEFT-III. Use of dynamic font
students, public for education on various aspects allows the client to read and compose the mail
through the net. without downloading the Oriya font. The system
has been developed using ASP technology, Java,
Commercialization
JavaScript and HTML. Varta uses default SMTP
The Chat Sever with Interface has been completed.
Server on the Internet Information Server 5.0 of
A full fledged system with content will be completed
Window 2000 Server.
soon.
Features
1.7 XML Document Creation and Manipulation
Most of the basic e-mail features are available in
XML document creation technology in English and
Varta 0.2, which includes : Sending, Replying,
Oriya language has been developed. It incorporates
Forwarding and Reading Oriya Mail using dynamic
users defined tags, structures the document like a
Font, Online Registration of unique user mail
database providing real world meaning of the
account, Password Security with Online Help to
content. This provides more flexibility to the
retrieve forgotten Password, Phonetic keyboard
presentation of the same document in different style
typing in Oriya, Online typing help through a
sheets and same style sheet for different documents.
keyboard map to dynamically select appropriate keys.
Also queries can be processed on the contents of the
XML document. Content on Tourism has been Phonetic Keyboard Engine
created in XML. Agricultural content useful for The Oriya keyboard has been designed as per Indian
farmers are being created. phonetic standard. It is implemented as an applet
in Java language. It is compiled in Java 1.0 JDK so
Commercialization
that it runs in MS Internet Explorer 5.0 onwards.
A few documents have been completed. More work
is in progress to be completed by August 2003.
1.8 Oriya OCR
OCAC is collaborating with ISI Calcutta to develop
and commercialization of Oriya OCR . All the
language inputs has been provided to ISI Calcutta.
The software developed by ISI Calcutta is being
tested by OCAC. OCAC will commercialize the
Oriya OCR with know how from ISI Calcutta.
Commercialization
Work in collaboration with ISI, Calcutta is in
progress. Will be completed in June 2003
1.9 Oriya E-Mail
Commercialization
A majority of Indian population cannot speak, read
Completed
and write in English language. Oriya language based
tools helps the common man to benefit from e- 1.10 Oriya Word Processor with Spell Checker
governance and other applications of Information under LINUX
Technology.
For the first time an Oriya word processor has been
In order to bridge the digital divide it is necessary to developed on Linux operating system. It runs on
provide Internet application such as e-mail and chat any X-windows system. Two Keyboard engines
in Oriya language for use by the common man. (Inscript keyboard engine and Phonetic Keyboard

80
Engine) have been incorporated for the ease of 1.11 Computer Based Training (in Oriya)
typing. All the texts will be displayed in It is well known that multimedia CDs are highly
GLYPHCODE. For display of a stored ISCII file, effective medium of learning in many disciplines.
a converter engine automatically generates The integration of text with hypertext links, speech,
GLYPHCODE equivalent of the file. graphics, animation and video in an interactive
mode makes the lessons easy to visualize and
Features
understand for self-learning. Computer Based
Most of the basic editing features available in
Tutorial CDs for school children have been
standard word processor have been incorporated
developed to support the classroom teaching with
which includes : Opening new file, saving a file,
interactive self-learning program.
closing window, cut, copy, paste, Single level Undo
and Redo is provided, Keyboard engine for typing Features
in Inscript layout, Keyboard engine for typing in Informative content with animated illustrations,
Phonetic layout, The ISCII to GLYPHCODE Interactive quiz for evaluation of subject
conversion, The GLYPHCODE to ISCII understanding.
conversion, Storing content of a file in plain text as
Description
well as in ISCII format, Other features such as Find,
This project is aimed at the development of tutorial
Find and Replace, Go To Line, File Information, A
CDs based on School Curriculum, in Oriya
Spell Checker incorporated within the word
language. The content is based on textbooks
processor.
prescribed by Board of Secondary Education, Govt.
Keyboard Engine of Orissa. The lessons are illustrated with text, speech,
Two Keyboard engines have been designed. One and animation with facility to navigate through
according to Indian script (Inscript) standard. various chapters using hyperlinks. At the end of a
Another according to Phonetic standard. All the chapter, the on line quiz helps in self-evaluation by
consonants, vowels and conjuncts and special the student. It is based on Windows 98 Operating
symbols etc are displayed in Glyph Code format. System and Tools used are Macromedia Authorware
Invalid word combinations according to Oriya 6, Macromedia Flash 5.0, Adobe PageMaker 6.5, Adobe
Grammar are disallowed. Pressing the Scroll Lock Illustrator 9.0. The minimum System Requirements
Key activates the Keyboard Engine. for Installation is Pentium II or higher, 128 MB or
more RAM, Internet Explorer 5.5 or more
Converter Engine
An Export file option is provided to convert 8-bit
glyph code to 8-bit ISCII code format. The 8-bit
ISCII code can also be converted back to 8-bit glyph
code.

Commercialization
Two CD have been completed .Write up with colour
Commercialization
brochure enclosed. Work in more subject is in
Completed
progress.

81
1.12 Oriya Language Based E-Governance 4. Core Activities
Applications
4.1 Web hosting of Oriya Language Classics
A number of software modules have been
developed for enabling the common citizens to Three well known Oriya classics :
access government information and services in (1) “Chha Mana Atha Guntha” by Fakir Mohan
Oriya and English languages from the Kiosks. A Senapati,
citizen can submit grievance petition; can apply
for one or more certificates ( such as birth, death, (2) “Chilika” by Radhanath Ray and
caste, income etc; down load the application forms (3) “Tapaswini” by Gangadhar Meher - have been
and know the rates of consumer commodities etc. hosted on Internet at http://
These modules have been used as a pilot project www.tdil.utkal.ernet.in
on improving citizens’ access to information in
Kalahandi district of Orissa. 4.2 Hosting of Web Sites of Govt. Colleges in Orissa

Commercialization Web sites have been hosted for the Khallikote College,
Development of a number of applications has been Berhampur www.khallikotecollege.utkal.ernet.in
completed and integrated with the project on and Ravenshaw College, Cuttack.
Citizens’ Access to Information. www.ravenshawcollege.utkal.ernet.in The pages
include information on the college as well as forms
2. TOT done for various projects so far for students feed back. More dynamic information
1. Spell Checking Grammar rules in Oriya Language will be supported by the web sites in future, which
has been Transferred to CDAC. will be undertaken for many colleges in Orissa.
2. Documents on Oriya Language Standards and 5. Products proposed to be developed
Principles have been prepared and for
1. CBT for all classes (std – 4th to 10th)
UNICODE and sent to MIT.
2. OCR
3. The draft for the preparation of National 3. Upgradation of spell checker
Language Design Guide for Oriya has been 4. Grammar module development
prepared and submitted to MIT 5. Machine Translation System (Oriya –English)
6. Net Education System in Oriya Language
4. Oriya Script with Grapheme details have been
provided to ISI, Calcutta for Oriya OCR. 6. The Team Members
3. Training programmes run for the officials for Biswa Ranjan Sahoo biswa_litu@yahoo.com
the State Govt. Sambit Kumar Sahu sam_sahu@yahoo.com
OCAC had earlier trained 70 persons in various Girija Shankar Sarangi _sarangi@rediffmail.com
government offices on Oriya Language Word Sunil Kumar Panda k2_sunil@yahoo.com
Processing using LEAP Office in which the Oriya Sarita Das sarita_das@yahoo.com
Spell Checker has been incorporated. The last Subhendu Kumar Mohanty ohanty_subbu@yahoo.com
formal training course on “ Word processing in Sanjay Kumar Dey sanjay_k_dey@yahoo.com
Oriya Language using Leap Office and ISM “ was Courtesy : Shri S.K. Tripathi
conducted during January 17 - February 12 2002. Orissa Computer Application Centre
in which only four officials attended the course. OCAC Building, Plot No. 1/7-D,
Again a course has been organized for 106 staff of Acharya Vihar Square, RP-O,
the Board of Revenue during 29-07-02 to 3-08-02 Bhubaneswar – 751 013
(first batch) and 5-08-02 to 10-08-02 (second (RCILTS for Oriya)
batch). So far 180 officers have been trained. We Tel: 00-91-674-2582484, 2582490,
are developing course material on a CD for training 2585851, 2554230 (R)
on Oriya Language. E-mail: saroj@ocac.ernet.in

82
Resource Centre For
Indian Language Technology Solutions – Assamese & Manipuri
Indian Institute of Technology, Guwahati
Achievements

Indian Institute of Technology, Guwahati


Department of Computer Science & Engineering, Panbazar
North Guwahati, Assam-781031 India
Tel. : 00-91-361-2691086 E-mail : tdil@iitg.ernet.in
Website : http://www.iitg.ernet.in/rcilts
RCILTS-Assamese & Manipuri 3. Encoding Standard: ISCII.
Indian Institute of Technology, Guwahati " Manipuri Corpora : The Manipuri corpora
Introduction were received from Ministry of Information
Technology, Govt. of India. This corpora
Conceived in the millennium year, the Resource creation started at Manipur University under
Centre for Indian Language Technology Solutions the aegis of Prof. M.S.Ningomba and Dr.
Indian Institute of Technology Guwahati, is a N.Pramodini. The corpora were in LP1
project funded by the Ministry of Information format, which is compatible to Leap Office.
Technology of the Government of India. The The total word count comes out to be around
main objective of the Centre is to make electronic 3,60,000. The fonts used in the creation of
information available in native languages mainly this corpus are BN-TTDurga, BN-TTBidisha
Assamese and Manipuri thereby aiding the and AS-TTBidisha. Further investigation to
dissemination of information to the larger masses. make it compatible to existing systems is in
This Centre is one among several in the nation progress.
and is equipped with all modern systems and
language related software. Four investigators from 1.2 Dictionaries : Derived from the Latin word
the Indian Institute of Technology Guwahati and dictio (the act of speaking) and dictionarius (a
one collaborator from Guwahati University are collection of words), a dictionary is a reference
presently involved in research in the areas of book that provides lists of words in order along
Natural Language Processing, Speech Recognition with their meanings. Dictionaries may also
and Optical Character Recognition Systems. The provide information about the grammatical
Centre is equipped with all modern equipments, forms, syntactic variations, pronunciation,
which include several Pentium-III and Pentium- variations in spellings, etymology, etc. of a word.
IV machines, a Linux server and a Windows2000 The Centre has developed e-dictionaries, the
server, scanner, web camera, printers, digital details of which are shown in the Table 1.
camera and audio recording systems. Eight project
personnel, five technical and three linguists Dictionaries Root Words Total words
working in joint collaboration with Guwahati
English-Assamese 5000 49,627
University, man the Centre.
The Centre has developed various language related Assamese-English 2000 9,015
tools and technologies over the last two and odd English-Manipuri 2500 36,778
years. Details of the same follow:
1. Knowledge Resources Manipuri-English 3000 7,228

1.1 Corpora : A systematic collection of speech Table 1. Dictionary details


or writing in a language or variety of a languages
forms a language corpus. The two corpora created Each entry in the Dictionary contains
by the Centre are- information on – i) the Grammatical form, ii)
" Assamese Corpora: Seven Assamese novels Meanings iii) Synonyms iv) Antonyms, v)
have been transformed into electronic form Pronunciation of the word (in the from of sound
to form the base corpus. The salient features files) vi) Transliteration in English, vii) Soundex
are code and viii) Semantic category. More words are
being added on.
1. Number of words : 6,00,000
2. Fonts used : AS-TTDurga (C-DAC font) and Features
Geetanjalilight (A popular font used in DTP
work). a) Provides information about a word.

84
b) These dictionaries have been structured to Presently the website holds information on
support the spell checker and machine sixty-five of the existing languages.
translation systems being developed at the
Centre. (a) Northeastern Languages : The website
contains information on sixty-five North–
1.3 Design Guides Eastern languages like Assamese, Chokri,
Chakesang, Zeliang, Pochuri, Lotha,
A design guide gives a brief overview of a Sangtam, Deuri, Dimasa, Kokborok,
particular language. The main topics covered by Tintekiya, Koch, Hurso, Miji, Chang,
the design guide are the consonants, vowels, Khiamngamn, Konyak, Nocte, Phom,
conjuncts and matras which form the character Tangsa, Wanchoo, Hmar, Karbi, Kuki,
set of the language and numerals, punctuation Lakher, Manipuri, Mizo, Riang, Khasim
marks, month names, weekday names, time zones, Monpa, Takpa, Tsangla, Sherdukpen, Sulung
currency, weights and measurements used in that (Puirot), Adi, Apatani, Bori, Mishmi,
particular language. Additionally, some linguistic Mishing, Nishi, Tagin, Liangmaialong with
information such as Phonological features, their classification.
grammatical features, the history of the language
and the geographical description of the state where (b) Linguistic Map : Preliminary work
the particular language is spoken are also included. commenced with the compilation and design
of the Linguistic Map of the Northeast. After
The Resource Centre for Indian Language acquiring published and unpublished material
Technology Solutions, Indian Institute of on these languages of the Northeast and with
Technology Guwahati has designed two design inputs from the Census Department,
guides for Assamese and Manipuri. Surveyors, Linguistic fieldwork and findings
of Dr. Dipankar Moral, the Linguistic Map
1.4 Phonetic Guides came into being. The website presently has a
coloured updated version of this map. The
Pronunciation is also an important part in
size of the font connotes the density of
language learning for which we have a Phonetic
speakers of a particular language in that area.
guide in the website. This guide describes how a
particular alphabet should be pronounced. The
International Phonetic Alphabet (IPA) is used in
the dictionaries as well as in the description of
the languages (North –East India) available in the
web site.

2. Knowledge Tools

2.1 The RCILTS, IITGuwahati Website (http://


www.iitg.ernet.in/rcilts)

As part of its core objectives the Centre hosts a


website that offers a wide variety of information
ranging from Assamese and Manipuri languages
to Geographical and Cultural issues.

Information hosted may be categorized as:

(i) Linguistic Information: The Northeast is


known for its large diversity in languages. Figure :1 Resource Centre’s Home page

85
(ii) On-line Dictionary: A transliterated Assamese vowels the upper surface
dictionary has been put up on the website. It
carries information on the Meanings, (in of the tongue is always
Assamese and English), Grammatical convex, hence the result
categories and the Pronunciation.
in tongue height. In this
(iii) Geographic and Demographic information:
Both geographic and demographic context it may be mentioned
information are made available for each of
that Asamiya/ ε /is slightly
the seven Northeastern states of Assam,
Meghalaya, Manipur, Nagaland, Mizoram, higher than English / æ /as
Tripura and Arunachal Pradesh.
in / pæt / pæt
(iv) Guides: Guides have been put up on the
website to aid users to understand language a farm fam
related issues. Two such guides that are u too tu
currently are:
T put pTt
(a) Design Guides: Design guides for both
Asamiya and Manipuri that give a general o saw so
idea of the respective language together with
N got gN t
certain frequently used sentences has been
hosted to aid a novice get a basic idea about Figure 2. Vowel IPA chart
the language.
(v) Web based Dictionary: A transliterated
(b) Phonetic Guides: Pronunciation is an Assamese Dictionary has been put up on the
integral part of the language learning
website. This dictionary is possibly the first
mechanism. A Phonetic Guide describes how
contemporary standard on-line Assamese
an alphabet should be pronounced. The
International Phonetic Alphabet (IPA) has dictionary. Moreover, this dictionary is
been used in the dictionaries as well as in the designed in such a way that it will be helpful
description of the languages (North–East for the non-native speaker. It provides
India) to convey the manner in which the information on a) English Word b)
words are pronounced. The Phonetic Guide Grammatical Category c) Meaning d)
allows the user to interpret the same correctly. Pronunciation and e) Assamese Meaning.
Figure 2 depicts the manner in which one
can correlate the IPA symbols to the normal (vi) Bookmarks: Additional Links to various
orthography especially for the pronunciation Newspapers of Assam have been provided. A
of vowels using the phonetic guide. Figure 2 Manipuri News letter titled “Manika” has also
depicts the IPA chart for vowels. been put up. Further links to the various
tourism related sites of Assam and Manipur,
Vowels News Portals and other Resource Centres are
IPA English Phonetic also made available in the website.
Word Representation
2.2 Fonts : The Centre has developed a True Type
i pin pin Font “Asamiya” using Fontographer software.
e pen pen This font can be used for typing Assamese text in
ε Pat (in the production of pæt Microsoft Word. Fine Tuning of the existing Font

86
Set has been done where the spacing and Matras were taken from a corpus of about 67,000 words.
have been adjusted. A Manipuri True Type font This means that the look up dictionary contains
has also been created. around 72,000 words. A hash table has been used
as a lexical lookup data structure. The performance
Eõ F G H Iø of a dictionary using a hash table is quite adequate,
even for complex access patterns. Perl has an
»Jô »K÷ L Mõ AÕ
efficient implementation of a hash table data
»Oô Pö Qö »Rô S structure, which has been used in the non-word
detection module.
Figure : 3 Snapshot of Assamese font “Asamiya”
Isolated-Word Error Correction
2.3 Spell Checker
A three-pronged strategy has been used for
A Spell Checker forms a vital ingredient of a word generating suggestions comprising of the
processing environment. The basic tasks Soundex, Edit-distance and Morphological
performed by a spell checker include comparison processing methods.
of words in a document with those in a lexicon
of correct entries and suggesting the correct ones Soundex Method
when required. The two commonly used methods
This method maps every word into a key (code)
for detection of non-words are the Dictionary
so that similarly spelled words have similar keys.
look-up and the N-gram analysis. Isolated-word
A Soundex encoding scheme for Assamese has
error correction is achieved by either of the
been designed based on the encoding scheme for
Minimum edit distance technique, the Similarity
English which comprises of a set of rules for
key technique, Rule-based methods, N-gram,
encoding words and 14 numerical codes. The
Probabilistic and Neural Net techniques. Context- Soundex code of the misspelt word is computed,
dependent error correction methods usually and the dictionary is searched for words, which
employ Natural Language Processing (NLP) and have similar codes. An example is given below
Statistical Language Processing for correcting real- (Table 2).
word errors.
Word Soundex code
Spell checking strategy for Assamese
]»] meaning ]65
The development of a Spell Checker for Assamese
has been undertaken at the Resource Centre, and affection
code has already been developed in Perl. The Spell
Checker exists in the form of separate modules Ø‘öçWýÝX meaning Ø‘öç35
for error detection and correction, as well as a
independent
stand-alone system in which all the spell checking
routines have been integrated. The strategy used Zgõ×c÷Ì^çã_ãcg÷ãTöX meaning Zõ794735
is described below.
would have analyzed
Non-Word Detection
Table : 2 Assamese words and soundex codes
The non-words are detected by looking up
document words in a dictionary of valid words. Edit-Distance Method
The dictionary used is actually a word list in which
5000 distinct Assamese words have been extracted Candidate suggestions are obtained by ‘Damerau’s
from the English-to-Assamese online dictionary error reversal’ method. The four well known
developed by the Centre, and the rest of the words error-inducing edit actions are the insertion of a

87
super fluous letter, deletion of a letter, • Text can be copied/cut from an I-leap document
transposition of two adjacent letters and and pasted onto the GUI window.
substitution of one letter for another. In
‘Damerau’s error reversal’ each edit action is
applied to the misspelt string and a set of strings
is first generated. These are checked in the
dictionary to see which of them are valid words
to finally produce the suggestions.

Morphological Processing

Morphological analysis is performed to extract the


root word from the misspelt word as also a list of
valid affixes that can be attached to that root word.
By attaching the affixes that closely match those
of the misspelt word to the root word, a list of
suggestions is generated.

A module for ranking of suggestions, based on


Minimum edit-distance methods has also been
developed.

A run-time snapshot of the Assamese Spell


Checker is shown in Figure 4.

Features of The Stand-Alone Spell Checker Figure 4. The Assamese spell checker.
• GUI developed using Perl/Tk, with simple text The Soundex encoding scheme for Assamese has
editing facilities. been refined into a more fine-grained one, and now
comprises of 21 numerical codes. Added
• Currently supports C-DAC’s AS-TTDurga
functionality has been incorporated into the
font.
Soundex code generator for handling matras attached
• Assamese text files can be loaded onto the GUI, to consonants and conjuncts in the first letter
edited, and saved. position, as also for khandata/chandrabindu/
anuswar/bisarga attached to consonants, vowels and
• Misspellings can be marked by clicking the conjuncts.
‘Det’ (‘Detect’) and then the ‘Show’ button.
The Edit-distance module now accounts for 2234
• Selecting the misspelt word and clicking distinct letter combinations (matras attached to
‘Sug’(‘Suggest’) button can generate consonants/conjuncts, khandata/chandrabindu/
suggestions. anuswar/bisarga attached to consonants/ vowels /
conjuncts.)
• A facility to add new words to the dictionary
exists. Tests conducted against a corpus of about 67,000
words reveal that the Edit-distance method gives the
• ‘Select All/Unselect All’ option available in the
best results, followed closely by the Soundex
‘Edit’ menu on right clicking on the text, for
method. A non-word detection module for
selecting/deselecting entire text.
Manipuri has also been developed. Work has also
• ‘Copy’/’Cut’ and ‘Paste operations possible for been undertaken to integrate a spell checking facility
text within the GUI window. for Assamese in the Microsoft Word environment.

88
2.4 Assamese language support for Microsoft 2.5 Morphological Analyzers : Given any word
Word : Word processors like Leap Office provide or a group of words, a morphological analyzer
the facility of document typing in Indian determines the root and all other inflectional
languages. Content creation can be done with the forms. Morphological Processing plays a vital role
help of these editors by using the inscript keyboard in the development of Spellcheckers and Machine
layout (provided by editors). The Centre has Translation Systems. The Centre has developed
developed a macro that allows the user to type morphological analyzers for both Assamese and
Assamese text in Microsoft Word without the need Manipuri.
of available Indian Language Editors.
a) Assamese Morphological Analyzer: The
Technology Description analyzer has been developed for use with the Spell
This Microsoft Word macro maps the inputs to checker and the Machine translation systems.
appropriate glyphs. The macro supports the
Inscript keyboard layout. Typing can be done Technology Description Stemming technique
using the font “Asamiya” developed by the Centre. forms the base of the Assamese morphological
The use of the Inscript keyboard layout facilitates analyzer. In the above technique, affixes are
smooth migration from C-DAC to Microsoft added/deleted according to the linguistic rules.
technologies. The derived words are verified with the existing
Corpus/Dictionary to treat as valid words.
Features
Features
• Inscript keyboard layout
• Template macro a) Currently works with Eleven linguistic rules

• Supports “Asamiya” font and C-DAC Assamese b) More rules can be added without alteration
fonts (As-ttDurga, As-ttBidisha, As-ttDevashish, in code.
As-ttkali, As-ttAbhijit).
c) Modules are available in the form of API’s
• All the features of MSWord (such as for customization.
justification, different font styles and different
font sizes etc.) can be used. b) Manipuri Morphological Analyzer : A design
• Documents are stored in glyph form. for developing a Morphological Analyzer for
Manipuri is being investigated. Development
of a root dictionary, a morpheme dictionary
and an affix table has commenced.
Identification of linguistic rules for nouns and
pronouns has also commenced. Further
linguistic rules for other grammatical
categories are being studied.

Technology Description

The Manipuri Morphological Analyzer will be


realized using the same techniques used in the
development of Assamese Morphological
Analyzer.

Features
Figure 5. Snapshot of Assamese macro with
different font styles. • Five rules are being used currently

89
• Can be easily upgraded for more complex quality, the system performs morphological
patterns analysis before proceeding to the bilingual
dictionary lookup.
• The modules, which have been developed so
far are available in the form of API’s and can Currently it can handle general-purpose simple
be used according to the need by other translation. It is being upgraded to handle
applications like, spell checker, etc. complex sentences. Efficiency of the system can
be improved by selecting a specific domain.
• A Graphical User Interface has been developed Currently the rule based Machine Translation
(see Figure 6). system contains 22 rules. The system has been
tested for more than 250 frequently used simple
sentences. An example of Machine translation
from English to Assamese is depicted in Figure 7.

Figure 6. Graphical User Interface of Manipuri


Figure 7. English-Assamese Machine Translation
Morphological Analyzer
System
3. Translation Support Systems
4. Human Machine Interface Systems
3.1 Machine Translation System The term
Machine Translation (MT) refers to the process 4.1. Optical Character Recognition for Assamese
of performing or aiding translation tasks and Manipuri: Optical Character Recognition
involving more than one human language. It is a (OCR) is the process of converting scanned images
system that translates natural language from one containing texts into a computer processable
source language (SL) to a target language (TL). format (such as ASCII, ISCII, and UNICODE
etc.). An MoU to transfer the OCR technology
Technology Description from the Resource Centre at the Indian Statistical
Institute, Kolkata was signed in August 2002 and
This MT system developed is basically a rule based accordingly the same was effected in September
one and relies on a bilingual dictionary. It can 2002.
currently handle translation of simple sentences
from English to Assamese. The dictionar y Technology Description
contains around 5000 root words. The system
simply translates source language texts to the The system takes a gray level (8bit) TIFF (Tagged
corresponding target language texts word-for- Image file format) image as input. The images
word by means of the bilingual dictionary lookup. scanned from either Assamese (or Manipuri)
The resulting target language words are re- books or from paper documents can be processed
organized according to the target language by the OCR software. The current version of the
sentence format. In order to improve the output OCR system produces output in the ISCII format,

90
which can be viewed or edited using any editor Portability/ Windows 98/2000/XP &
supporting ISCII. Expandability LINUX

Features: Table 3 shows the salient features of the Table 3. Specifications of the Assamese/Manipuri
OCR system while Figure 8 shows the associated
GUI.

Features Specifications of Assamese


& Manipuri OCR

Scanning resolution 300-600 dpi

Input image TIFF (8 bit gray level)


OCR System
Skew detection +5 to -5 degrees
Figure 8. GUI of the OCR System
Skew correction +5 to -5 degrees
4.2 Speech Recognition System
Font name Assamese: Gitanjalilight,
Luit & AS-TTDurga Automatic Speech Recognition (ASR) for
Assamese, which is concerned with the problem
Manipuri: Font used by the of recognition of human speech by a machine, is
publisher, Manipuri Sahitya the core of a natural man-machine interface. A
Parishad, Imphal speaker-dependent continuous speech recognition
system with a list of vocabulary has been
Font size Assamese: 12-28 points developed for the Assamese language. The
objectives of the system are:
Manipuri: 12-18 points
• Assamese and English Spoken Digit
Test data size Assamese: 600 pages from Recognition
books & printed documents
• Study of Noise Effects on Assamese
Recognition
Manipuri: 100 pages from
books. • Phonetic Alignment of Assamese Digits
Template size Assamese: 2600 A set of acoustic rules has been formulated from
the acoustic-phonetic features of Assamese and
Manipuri: 1800 English that will be able to classify a given sound.
The rules based on these features show an overall
Post processing Morphological analysis success rate of around 76% for randomly collected
test utterances in Assamese. The classification rate
Output file format ISCII format
is least among the sounds falling under the nasal
Accuracy Assamese: 95% (without class (65% success rate). The sounds falling under
post processing) the vowel, stops and the fricatives show better
results (80%, 85% and 85% success rate
Manipuri: 90% (without respectively) while the classification results for
post processing) diphthongs (70% success rate) were not so good.

91
The unique Assamese sound of /x/ is found to be the Dictionary Server has been developed as
very similar to /s/ acoustically. A comparative a menu-based application that assists the
study of acoustic properties of Indian spoken administrator to maintain the dictionaries,
English and Assamese vowels has been performed to stop and restart the server and monitor
incoming requests. The server provides
4.3 Interface for e-Dictionary : An interface for information on the requests and the actions
the English-Assamese and Assamese-English taken during run time. The current version
dictionaries has been developed that allows users runs on a Windows platform but the can be
to choose between one of the two languages viz. ported to a Linux system with minor changes,
English and Assamese, enter a word and find its making it virtually platform independent.
equivalent in the other language. Based on Client- Whenever a connection to the server is
Server Architecture, the system allows users to established, it checks the incoming request
access the Dictionary Server using a Java applet from the client, consults the database,
opened via a browser. searches for information on that particular
word after which it acknowledges the request
Technology Description: The basic components
by serving the information.
that make up the on-line dictionary are:
4. Database: The e-dictionaries that comprise
1. Client: The client is provided with a
the database were initially developed using
Graphical User Interface (GUI) in the form
MS Access. These are then converted to
of a Java applet shown in Figure 9 and
Prolog facts for faster access by the search
initiated from the web browser. The user can
engine. The server provides the administrator
invoke the URL of the dictionary and choose
an option to open an ODBC (Open
the source language; type the word in the
DataBase Connectivity) connection to the
search window provided in the applet and
MS-Access database and perform the
click on the Search button. The query from
conversion thereby uploading the latest
the user is sent to the server and the results
versions of the dictionaries. The current
are displayed at the client-end.
version of the system supports two (English-
Assamese and vice versa) of the four electronic
dictionaries (viz. English-Assamese, Assamese-
English, English-Manipuri and Manipuri-
English) being created.

Features

a) User Friendly GUI available.

b) Dictionaries can be used to aid machine


translation and/or spell checking
applications.
Figure 9 Graphical User Interface of the
dictionary c) Pronunciations of words are available in the
form of wave files.
2. Web Server: Apart from the main html page,
this server hosts the Java jar files and serves d) APIs provided for prospective programmers
the applet classes to the client. to allow them to mould the dictionary
information to suit their custom applications.
3. Dictionary Server: Coded in Visual Prolog,

92
e) A stand-alone and a web enabled version of The training programme was specifically
e-dictionaries available. designed to encourage state government
employees to use Assamese for office
5. Language Technology Human Resource automation. Participants of the programme
Development ranged from employees of Central, State and
5.1 Workshops Conducted Public sector undertakings to Private
entrepreneurs. The course contents of the
Two workshops were conducted to disseminate workshop were :
and make people aware of how to use Indian
Languages in IT. " Office Automation

(i) A one-day workshop on “Natural language " Web development in Assamese.


Processing” was conducted at the Indian
" Creation and manipulation of Database
Institute of Technology Guwahati on the 31st
of March, 2001. The workshop was intended and Spreadsheets in Assamese.
to act as a forum for promoting interaction
" Hands-On session.
among interested graduate students and
researchers working on natural language 6. Standardization
processing and allied areas. The topics that
were addressed in the workshop were : A draft of the Assamese Unicode Code-Set was
prepared after due consultation with linguists and
" Introduction to Computing the State Govt. and submitted to the Ministry of
" Natural Language Processing Communications & Information technology,
Govt. of India for further evaluation. The
" Linguistics Assamese Code –Set is similar to that of the
Bangali, exceptions being.
" Demonstration of language technology
solutions • The letter Ra » has Code 09F0 instead of 09B1
and the letter Wa ¾ has code 09F1 in lieu of
(ii) A Training Programme on “Web page Design 09B5.
and Office Automation Using Assamese” was
held at the Indian Institute of Technology • An additional letter Khya lù with code 09BB is
Guwahati on the 15th & 16th of March, 2002. introduced.
The Training Programme was designed for
the benefit of Web Designers and people who 7. Publications
work with Assamese Word Processors. With
State Government officials using Assamese as 1. Monisha Das, S.Borgohain, Juli Gogoi,
a medium of communication, the need for S.B.Nair, Design and Implementation of a
Office Automation in this language is picking Spell Checker for Assamese, Proceedings of
up fast. Likewise, the demand for putting up the Language Engineering Conference,
information in the native language on the LEC2002, December 2002, Hyderabad,
web is also increasing. This training Published by the IEEE Press.
programme was aimed at disseminating
2. Monisha Das, S.Borgohain, S.B.Nair, Spell
techniques to generate e-content in Assamese
checking in MSWord for Assamese,
and use this language for routine office tasks.
Proceedings of the ITPC-2003:Information

93
Technology: Prospects and Challenges in the Nilima Deka nilima_deka@iitg.ernet.in
21st Century Kathmandu, Nepal, May 23-
26, 2003. Juli Gogoi juli@iitg.ernet.in

8. Manika Newsletter Dr. L.Sarbajit Singh sarbajit@iitg.ernet.in

Snapshot of the Manika: e-newsletter in Manipuri Sirajul Chawdhury si_chow@iitg.ernet.in

Courtesy : Prof.Gautam Barua


Indian Institute of Technology
Department of Computer Science & Engineering
Panbazar, North Guwahati,
Guwahati -781 031, Assam
(RCILTS for Assamese & Manipuri)
Tel: 00-91-361-2690401, 2690325-28
Extn 2001, 2452088
E-mail : gb@iitg.ernet.in, g_barua@yahoo.com)

Description

Manika, an e-newsletter in Manipuri, was


launched on the Republic day this year (26th
January 2003) to mark the beginning of a new
electronic information era in Manipur - the Land
of Jewels. The newsletter hosted by the RCILTS
at Indian Institute of Technology Guwahati will
bring news and share knowledge of the heritage
and local innovations in Manipur and help enable
the Manipuris to be well informed of the
electronic age and forge ahead.

9. The Team Members

Prof. Gautam Barua gb@iitg.ernet.in

Dr. S.B.Nair sbnair@iitg.ernet.in

Dr. S.V.Rao svrao@iitg.ernet.in

Dr. P.K.Das pkdas@iitg.ernet.in

Dr. Dipankar Moral dipankarmoral@postmark.net

Samir Kr. Borgohain samir@iitg.ernet.in

Sushil Kr. Deka sdeka@iitg.ernet.in

Monisha Das moni@iitg.ernet.in

94
Resource Centre For
Indian Language Technology Solutions – Kannada
Indian Institute of Science, Bangalore
Achievements

Indian Institute of Science


Centre for Electronics Design & Technology, Bangalore - 560012.
Tel. : 00-91-80-2932377, 293 3267
E-mail: njrao@mgmt.iisc.ernet.in
RCILTS-Kannada would enthuse both faculty and students at
Indian Institute of Science, Bangalore engineering colleges to do their mini projects and
main projects in language technology related
Introduction activities.
The Resource Centre for Indian Language This phase of Resource Centre created adequate
Technology Solutions-Kannada at Indian Institute momentum in creating IT tools for Kannada and
of Science chose a very broad spectrum of made a beginning that would ensure that
activities, and undertook several activities related Kannadigas need not feel alien in their own state.
languages other than Kannada. Participation by
a large number of faculty of the Institute enabled 1. Web Sites And Support To Instruction
the Resource Centre to create knowledge bases,
web sites, OCR for Tamil and Kannada, speech 1.1 Kannudi
synthesis for Tamil and Kannada, resource
Kannudi is bilingual web site for the benefit of
information on European history in Hindi for
Kannadigas to learn about their language and
high school children, a bilingual interactive
heritage. Non-Kannadigas will also have an
German-Hindi course, CDs in support of learning
opportunity to know about Karnataka. It is
Kannada at primary and high school level, a
mainly aimed at non-specialists, and provides a
variety of tools for language processing, word net
first level overview of a large number of topics.
in Kannada, and investigate into language
identification. It was also recognized that language
technology segment of the IT industry remained
very small in India for a variety of reasons. Two
important approaches were taken towards
improving this. One is to get some jobs done by
the industry instead of getting them done through
project assistants, and the other one is to make
available many of the basic tools developed in the
public domain. The web site KANNUDI should
meet the long felt needs of Kannadigas to know
about their language and state. A large number
of scholars in various disciplines actively
contributed in making this web site very rich. An
organization like Kannada Sahitya Parishat felt
enthused enough to permit this resource centre
to create a web site on their activities and have it The contents of this site are organized under 11
co-located with KANNUDI. In an attempt to topics: Language, Epigraphs, Literature, Folklore,
make a difference to an important segment of Geography, History, Arts, Classics, Personalities,
population, namely primary and high school Temples and Festivals, and Cultural Societies and
students, Bodhana Bharati series were created, their Activities. Some of the important features
which provide interactive learning of Kannada of Kannudi are:
language as per the state syllabus. Even the process
• The site has more than 2000 pages of content.
of creation sensitized a large number of teachers
to the usefulness of IT tools in education, and • Most of the articles are written by experts in
enabled them to actively participate in creating the respective areas.
the CDs. It was realized that manpower
availability in the area of language technology is • Images have been incorporated at all possible
very poor. The web site LT-IISC is created that places.

96
• Brief life sketches of a large number of technologies, optical character recognition, speech
personalities of Karnataka are prepared. technologies, machine translation, multi-lingual
issues, applications, and open source software. The
• Several classics of Kannada literature are made
website will have a download section common to
available on the web site.
all topics which would also have all the products
generated at IISc. The website also features a
Discussion Forum on each topic with which
interested people can exchange their views
regarding technologies, solve problems (if any),
etc. Information on events, seminars, workshops
in the area of language technologies is also
provided. Annotated links to other websites on
language technologies, language products, and
standards is also given.

The address of the site is www.ltiisc.org.

Publications:

1. Narmadamba, K.: e-learning in Kannada


The web site also hosts the web site of Kannada (Kannadadhalli e Kalike) (Book), 2002,
Sahitya Parishad. Kannada Sahitya Parishat.

2. Narmadamba, K.: Kannada and computers in


Samyukta Karnataka, April, 2003.

3. Narmadamba, K.: Speech variation in South


India languages with respect to dialect, emotion
and style, Sadhane, Bangalore University, May,
2003.

1.3 Bodhana Bharathi : Multimedia Educational


CDs for 7th, 8th and 10th Standards

It has been observed that the quality of teaching


Kannada language at primary andhigh school
levels needs considerable improvement. It was,
therefore, decided to create interactive multimedia
The contents of the site are continuously
enhanced. The web site address is learning material, useful to both teachers and
www.kannudi.org. students, in the form of CDs. This CD series are
named as “Bodhana Bharathi”. These CDs are
1.2 LT-IISC meant to supplement the text books and classroom
teaching. The framework within which these CDs
This is a web site to meet the needs of students,
are developed is,
teachers, and developers interested and concerned
with language technologies in general and in 1. To improve listening, and writing skills.
particular with emphasis on Kannada language.
The main sections of the web site are Standards, 2. To enable the students perform better in the
language resources, office automation, web examinations.

97
3. To appreciate the background in which the The material is divided into three sections
lessons were created by the original authors. Teacher’s section, Student’s section and a
common section. The teachers section has
4. To make them understand and feel proud about information such as lesson plan, objectives and
their nationality and culture, literature, people others. The students section contains extra
and language. questions, new words, etc. The common section
5. To enjoy poems and songs. comprises of a preamble, summary of the lesson,
values in the lesson, key points of the lesson
Some of the specific aspects of the instructional supported by visuals, model question paper with
material include, partial interaction and answers, and meaning of
1. Introducing the student to the writers of the words with relations,.
lessons and their works. 1.4 Bilingual Instructional Aid for learning
2. To make them understand the important points German through Hindi
and the moral of the lessons. Both Hindi and German belong to the Indo-
3. Exposing them to new words that they come European family of languages. There are a lot of
across in the lessons. similarities between the two in terms of grammar,
vocabulary etc. These two are semantically quite
4. To help them recite poems appropriately. close to each other. As a consequence it is easier
to explain an unknown German word, phrase or
5. Testing their knowledge in the lessons by giving
idiom in Hindi and vice versa than using English.
them unit tests.
Moreover with the growing number of joint
ventures and offshore projects there has been
increasing interaction between Germans and
Indians. If the visitors from Germany need to
stay in India for a longer period, then some
knowledge of Hindi would be of great advantage
in making professional, social and cultural
contacts. Similar would be the case of Indian
visitors to Germany. This was the motivation for
developing a bilingual instructional aid for Hindi-
German in the form of a web-based material.
This web site has four sections

98
Learn Hindi section teaches Hindi alphabets,
vowels, consonants, counting and methods of
introducing oneself.

The Web-site contains information on German


Anthem, Germany in Europe, German Weather,
German Population, World War II, Construction
and Fall of Berlin wall, German Automobiles,
Traveling to India section gives information on German Beer and Wine, German Culture,
how to speak at Airport, Restaurant, Shopping German Economy, German Unification, German
complex, Bank, Post Office, Hospital, Railway Literature, Political Parties, Social security system,
Station, and for Renting a House. European Union, Indo-European Languages, and
German Cities-Berlin, Brandenburg, Cologne,
Life in India section has information on culture, Hamburg, Frankfurt, Muenich, Trier.
lifestyles, tradition, food etc.
2. Knolwedge Bases
Exercises section provides practicing exercises.
2.1 Sudarshana : A Web Knowledge Base on
This information can be accessed at Darshana Shastras
www.mgmt.iisc.ernet.in/~fls/projects and
www.mgmt.iisc.ernet.in/~fls/german The main objective was to create a “knowledge
base” containing basic texts, commentaries and
1.5 Information Base in Hindi Pertaining to free translations of texts on the Systems of Indian
German History Philosophy, popularly known as Darshanas, in
multimedia form, and have them web-enabled,
The Information Base primarily concentrates on
and make them available in the form of CDs.
bringing information about Germany in
The target audience is classified into Naïveté,
perspective, which appeals to Indian school Students, and Scholars.
children. The aim of this project is also to put all
this material into Hindi so as to make the All the resource material and the experts on all
information accessible to as many school children Darshanas are identified. All the web sites on
as possible. NCERT has now undertaken to Sanskrit and Darshana-s in particular are surveyed
provide all school children access to information comprehensively. A white paper on Shad-
materials both in Hindi and English. This project darshana-s is created. A series of lectures by one
made use of the knowledge of the foreign of the experts on Ishvara in all darshana-s and
language experts at the Institute, and evolved “Arthasamgraha” is recorded. The audio files and
materials in Hindi pertaining to German history. transcripted version of these lectures is made
This was done in conjunction with the prescribed available on the site. Partial translation in English
syllabus. is available. Nyayakusumanjali and other Sanskrit

99
texts referred for the above discourses are also
made available on-line. Glossary of technical
terms in Nyaya, Vaisheshika and Mimamsa
darshana-s in English and Sanskrit is created. A
relevant search engine for both Sanskrit and
English is also made available

2.3 Indian Aesthetics

A website on Indian Aesthetics in Product Design


has been created with following features:

• Identifies the body of knowledge in Indian


Aesthetics from Literature, Culture and
For any clarifications Dr. N.R.Srinvasa Raghavan
Philosophy.
can be contacted at raghavan@mgmt.iisc.ernet.in
• A Visual Database of Indian Products, Visuals
2.2 Indian Logic Systems
and Examples.
The website“PRAMITI : PRAMana – IT with
• Methods for Designers to enable incorporation
Indian logic” is being developed to provide
of “Indianness” into product design.
comprehensive information on various aspects of
Indian Logic and its applicability to Computer • Learning materials for students in design
Science and Information Technology. The
programs regarding Indian Aesthetics, and
website is designed to be user-friendly by using a
methodology of incorporation into product
concise and objective style. The web address is
designs.
www.pramiti.org.

100
The contents are broadly classified as Some scripts have hundreds of individual
“philosophical”, “cultural”, and “pragmatic” characters, and it is not easy to put all these
characters on a standard keyboard that is designed
The philosophical section contains extracts of for simpler scripts. Input Methods are developed
ancient texts, book reviews, review and research to provide a better approach for inputting text
papers. It has a rich visual database of images from for these languages. The Brahmi Kannada Input
crafts, artifacts and general visuals. It has a Method (BKIM) is one such method which allows
comprehensive glossary and rich bibliography. user to give input in Kannada. The BKIM is
developed using Java which provides an input
method framework that enables the user to enter
text directly into the text component. Application
developers who wish to have direct Kannada input
in their applications can make use of the BKIM.
As it is developed in Java, it is also platform
independent. The BKIM uses the KGP keyboard
layout (standardized by Govt. of Karnataka) and
each keystroke is mapped for their corresponding
Unicode characters. It uses Open type fonts.
The Brahmi - Multilingual Word processor is a
demonstration of Brahmi Kannada Input
Method. The word processor has the option of
choosing the input method from among nine
The cultural section has articles on Indian Indian Languages (at present only Kannada is
philosophy and Western Philosophy. Case studies
containing research in the area of product and
visual semantics are also included.

The pragmatic section discusses the learning


material for design students
3. Technologies and Language Resources
3.1. BRAHMI: Kannada Indic Input method,
Word Processor

101
enabled) and English. Apart from normal features tested and are distributed along with Brahmi –
like file open, print, cut, copy, paste, font features, Multilingual Word processor and also individually.
paragraph settings, color changes it also has some People are free to use these fonts in their
unique features such as saving file in different applications and/or for documentation purposes.
encodings (UTF-8, UTF-16, Unicode Big
Endian, etc.), search and replace Unicode text and
email facility. It has an easy access graphical
toolbar and comprehensive help.

3.2 OPENTYPE FONTS: Sampige, Mallige,


Kedage

OpenType Font is a new cross platform font file


format developed jointly by Adobe Systems
Incorporated and Microsoft. Based on Unicode
standard the OpenType format is an extension of
the TrueType SFNT format that can now support
PostScript font data and new Typographic
features.

Open Type offers several compelling advantages:

• A single, cross-platform font file that can be


used on both Macintosh and Windows
platforms

• An expanded character set based on the


international Unicode encoding standard for
rich linguistic support

• Advanced typographic capabilities related to


glyph positioning and glyph substitution that
allow for the inclusion of numerous alternate 3.3 Kannada Wordnet : WordNet is an on-line
glyphs — such as old-style figures, small capitals lexical reference system whose design is inspired
and swashes — in one font file by current psycholinguistic theories of human
lexical memory. The nouns, verbs, and adjectives
• A compact font outline data structure for
of a language are organized into synonym sets,
smaller font file sizes each representing one underlying lexical concept.
• OpenType fonts are best suited for meeting the Different relations link the synonym sets.
demand for complex script handling and high Synchronic organization of lexical knowledge and
quality typography in today’s global publishing structured organization of words is helpful for
and communication environment. natural language processing.

Although the design of Kannada WordNet has


Three OpenType fonts called “Sampige”,
been inspired by the famous English WordNet,
“Kedage” and “Mallige” are developed. Sampige
the unique features of Kannada WordNet are
and Kedage are text based fonts whereas Mallige
‘Graded antonyms and meronymy relationships’
is a handwritten type of font. The fonts have been
and ‘efficient underlying database design’.

102
Nominal as well as verbal compounding, complex 3.4 OCR for Tamil
verb constructions also play a vital role.
A, Optical Character Recognition (OCR) system
There are different organizing principles for capable of converting multi-lingual manuscripts
different syntactic categories. Two kinds of to machine readable codes is one of the key steps
relations are recognized: lexical and semantic in in working towards the goal of machine
WordNet. Lexical relations hold between word translation. There are numerous applications that
forms; semantic relations hold between word OCR systems have to offer which are of help in
meanings. The basic categories such as nouns, day-to-day activities of life. These include mass
verbs, adjectives and adverbs are organized into conversion of existing documents and literature
synonym sets, each representing one underlying into electronic format, reading aid for the blind
lexical concept. Synsets or the synonymy sets are as a part of text-to-speech converter, automatic
the basic building blocks. The synonym set serves sorting of mails in postal department and
as an identifying definition of lexical concepts. processing of bank documents, machine
The semantic relations discussed in the Kannada transliteration/translation of documents and
WordNet are Synonymy, Antonymy, Hypernymy literature of other scripts, etc.
–hyponymy, Meronymy – holonymy, Entailment, A OCR for Tamil script is developed that works
troponymy. This WordNet will initially be built in a multi-font and multi-size scenario. The input
for 2000 words.
to the system is a scanned or digitized document
and the output is in TAB code. The documents
are expected to contain text only. The process
sequence is given in the following block diagram.

Preprocessing is the first step in OCR, which


involves binarisation, skew detection and
correction. Binarisation is the process of
converting the input gray scale image scanned
with a resolution of 300 dpi into a binary image
with foreground as white and back ground as
black. Suitable techniques have been used to take
care of contrast in the images. The skew angle of
the document is estimated using a combination
of Hough transform and Principal Component
Analysis. Segmentation is done by first detecting
lines, and then detecting the words in each line
followed by detection of the individual characters
in each word. Horizontal and vertical projection
profiles are employed for line and word detection,
respectively. Connected component analysis is
performed to extract the individual characters.
The segmented characters are normalized to a

103
predefined size and thinned before recognition 300 dpi (dots per inch). It has been ensured that
phase. the test files contain almost all the symbols present
in the script. An accuracy of over 99% has been
The segmented symbols are fed to the classifier
achieved on the training set and 98% on other
for recognition. Tamil alphabet set contains 154
samples. The sample size was 100.
different symbols. The characters are divided into
some clusters based on domain knowledge, to [1] K G Aparna and A G Ramakrishnan, “A
reduce the recognition time and a smaller complete Tamil Optical Character Recognition
probability of confusion. This is accomplished by System”, Document Analysis Systems V, 5 th
designing a three level, tree structured classifier International Workshop, DAS 2002, Princeton,
to classify Tamil script symbols. NJ, USA, August 19-21, 2002, pp. 53-57.

A line in any Tamil text has three different [2] K G Aparna and A G Ramakrishnan, “Tamil
segments; upper, middle and lower. Depending Gnani – an OCR on Windows, Proc. Tamil
upon the occupancy of these segments, these Internet 2001, Kuala Lumper, August 26-28,
symbols are divided into one of the four different 2001, pp. 60-63.
classes. The number of dots present in each
segment contributes a lot for the classification, 3.5 OCR of Printed Text Documents in Kannada
which is dependent on the scanning resolution.
The input to the system is the image of the printed
This constitutes the first level clustering
page obtained by scanning on a flat bed scanner
The second level of classification based on Matras/ at 300 DPI resolution and converting the image
extensions is applied only to symbols which have into binary by making use of a global threshold
upward matras and downward extensions. The selected automatically for each page. This image
classes are further divided into groups, depending is processed to remove any skew (so that the text
on the type of ascenders and descenders present lines are aligned horizontally) using a Hough
in the character. This classification is feature based transform based technique. Next, the individual
i.e. the feature vectors of the test symbol are lines in the image and the words in each line are
compared with the feature vectors of the separated using Projection Profile based methods.
normalized training set. The features used in this Due to the special characteristics of Kannada
level are second order geometric moments and script, separating the individual characters in the
the classifier employed is the nearest neighbor word is not an attractive choice. Hence a novel
classifier. segmentation algorithm is developed which
segments words into sub-character level so that
Feature-based recognition is performed at the each akshara may be composed of many segments.
third level. For each of the groups, the symbol A pattern classification method is used which is
normalization scheme is different. The dimensions based on the Support Vector Machines to assign
of the feature vector are different for different a classification label to each segment so obtained.
groups, as their normalization sizes are different. After labeling individual segments, rules are used
Truncated Discrete Cosine Transform (DCT) on how aksharas are composed to finally effect
coefficients are used as features at this level of
classification. Nearest neighbor classifier is used
for the classification of the symbols.

The system has been tested on files taken from


Tamil magazines and novels which are scanned at

104
recognition of individual aksharas. The final
output of the system is an ASCII file compatible
with kantex typesetting package for Kannada
which is built around the standard Latex system.

The words are first vertically segmented into three


zones as shown in the figure.
The next task is to segment the three zones
horizontally. The middle zone is the most critical
since it contains a major portion of the letter.

After segmenting the document each segment has


to be recognized to effect final recognition.

Features (set of numbers that capture the salient


characteristics of the segment image) from each The system was tested on pages scanned from
segment are extracted. The characters in Kannada Kannada magazines, etc. Currently the system
have a rounded appearance. Therefore features recognizes about 85% of the aksharas correctly.
which can extract the distribution of the ON
pixels in the radial and the angular directions will 4. Research
be effective in capturing the shapes of the
characters. 4.1 Automatic Classification of Languages Using
Speech Signals
Automatic language identification is the problem
of identifying the language being spoken from a
sample of speech from an unknown speaker.
Among various approaches for language
The features extracted have to be classified using identification, the phone recognition offers
a classifier. Here a Support Vector Machine considerable promise, as it incorporates sufficient
(SVM) classifier is used. knowledge of the phonology of the language to
be identified, without incurring the significantly
The output after classification is transformed into higher cost of word based approaches. An
a format which can be loaded into a Kannada approach based on sub-words that does not
editing package. The input is usually ASCII text require manually labeled data in any of the
in which aksharas are encoded ASCII strings. languages recognized is used. Particularly, the

105
focus was on the specific architecture termed
Parallel Phone Recognition (PPR), and the system
is referred to as parallel Sub-word Recognition
System.
Research in automatic language identification
requires a large corpus of multi-lingual speech data
to capture many sources of variability within and
across the languages. Among various Indian
languages six languages Hindi, Kannada,
Malayalam, Marathi, Tamil and Telugu were
selected. English as spoken by the same Indian
speakers is the seventh language. For each
language, twenty adult speakers of different age
and gender are selected. Care was taken to ensure
that a speaker was chosen for a particular language, This SWRLM approach uses a single front-end
only if he/she had that language as a native sub word recognizer followed by N back-end LMs
language particularly during childhood. Speech for an N language LID task. The front-end SWR
was collected using a Sennheiser HMD224 noise- can be language dependent or language
canceling microphone and low pass filtered at 7.6 independent. The SWRLM performance will
kHz. The recording protocol was designed to improve if we use language independent single-
obtain Digits and days of the week, Numbers in
word unit inventory which is obtained from all
English, English alphabets, commonly used words
languages in LID task.
in native languages, railway reservation words in
native language and in English, Banking words Results: the table III shows the LID performance
in native and English language. In the elicited of SWRLM on OGI-TS database for each of the
free speech, personal details with name, age, native
front ends for both the cases of MLC. Table IV
language, profession, family, passage reading in
shows the performance of SWRLM on ILDB for
Hindi, English and native languages and etc.
all six front ends and both cases of MLC.
Here three approaches are studied, namely,
Parallel Sub-Word Recognition.(PSWR), From the above tables we note that LID
SWRLM, and Parallel – SWRLM. The following performance on ILDB is much better than on
tables show the performance of PWSR system OGI-TS database for training data. Whereas for
on OGI-TS and ILDB data base with different test data, OGI-TS database gives better results as
scores. compared to ILDB database.

As we know that the sounds in the language to be


identified do not always occur in the one language
used to train the front-end sub-word recognizer.
Thus it seems natural to look for a way to
incorporate phones from more than one language
into a SWRL like systems. Alternatively another
approach is simply to run multiple SWRLM
systems in parallel with the single language SWR
each trained in different languages. Therefore P-
SWRLM uses multiple front-end SWRs with each

106
SWR followed by N number of backend LMs for
a N language LID task.

The database and preprocessing with


parameterization is identical to that in PSWR
approach. The bias removal was done by Zissman
method on Pl. Results of LID accuracy for six
languages Hindi, Kannada, Malayalam, Marathi,
Tamil, Telugu PSWRLM system, with two types
of classifiers Maximum Likelihood Classifier and
Gaussian Classifier are given in the following plots
that compare the performance of LID accuracy

on PSWRLM system for two classifiers i.e., MLC


and GC-MD, and also across OGI databases and
for ILDB database.
The following conclusions may drawn:
• The sub-word approach to LID, developed in
our lab holds good promise for LID among a
small set of languages.
• The LID performance is quite good among
south Indian languages, although they share a
lot of phonetic structure and even vocabulary.
• PSWR approach is more promising than the
SWRLM and PSWRLM approaches to LID.
4.2 Algorithms for Kannada Speech Synthesis
The basic units, namely, CV, VC, VCV and
VCCV have identified and recorded. A
framework has been standardized for creation and
handling of database of spoken basic units. A
synthesis scheme, based on waveform
concatenation of basic units has been attempted.

107
New techniques for pitch detection and
modification, and speech synthesis with emotion
are proposed.
5. Publications
1. R. Muralishankar, K. Suresh and A. G.
Ramakrishnan, “DCT based approaches to
Pitch Estimation”, submitted to Signal
Processing.
2. K. Suresh and A. G. Ramakrishnan, “A DCT
based approach to Estimation of Pitch”, Proc.
Intern. Conf. on Multimedia Processing and
Systems, Chennai, Aug. 13-15, 2000, pp. 54-
57.
3. R. Murali Shankar and A. G. Ramakrishnan,
“Robust Pitch detection using DCT based
Spectral Autocorrelation”, Proc. Intern.
Conf. on Multimedia Processing and Systems,
Chennai, Aug. 13-15, 2000, pp. 129-132.
4. R. Murali Shankar and A. G. Ramakrishnan,
“Synthesis of Speech with Emotions”, Proc.
Intern. Conf. on Commn., Computers and
Devices, Vol. II, Kharagpur, Dec. 14-16,
2000, pp. 767-770.
5. Anoop Mohan, Harish T, A. G.
Ramakrishnan and R. Muralishankar, “Pitch
modification sans pitch marking”, submitted
to Intern. Conf. Acoustics, Speech and Signal
Processing, May 7-11, Salt Lake City, Utah,
USA, 2001.
Courtesy: Prof. N.J. Rao
Indian Institute of Science
Centre for Electronics Design and Technology
(CEDT)
Bangalore – 560 012
(RCILTS for Kannada)
Tel. : 00-91-80-3466022, 3942378, 3410764
E-Mail: njrao@mgmt.iisc.ernet.in,
chairman@mgmt.iisc.ernet.in

108
Resource Centre For
Indian Language Technology Solutions – Telugu
University of Hyderabad
Achievements

University of Hyderabad
Department of CIS, Hyderabad-500046
Tel. : 00-91-40-23100500 Extn. : 4017 E-mail : knmcs@uohyd.ernet.in
Website : http://www.languagetechnology.ac.in
RCILTS-Telugu true here. Texts occupy less storage space and less
University of Hyderabad, Hyderabad network bandwidth when sent across a
network. Converting images into texts makes it
Introduction possible to edit and process the contents as normal
University of Hyderabad is a premier institute of text.
higher education and research in India. The University
DRISHTI: An OCR System for Telugu and other
Grants Commission has selected University of
Indian Languages
Hyderabad, among four others in the country, as a
“University with Potential for Excellence”. The
National Assessment and Accreditation Council
(NAAC) has awarded the highest rating of five stars.
University of Hyderabad is the only University in
India to be included among the top 50 institutions
in India under the “High Output - High Impact”
category by The National Information System for
Science and Technology (NISSAT) of the Department
of Scientific and Industrial Research. University of
Hyderabad has been rated the “number one”
University in India in sciences by the Department of
Scientific and Industrial Research.
1. Resource Centre for Indian Language Technology
Solutions OCR systems can be used to convert available
printed documents into electronic texts without
This Resource Centre for Indian Language typing. Since OCR engine can be run day and
Technology Solutions was established by the night on several computers parallely, we can
Ministry of Communications and Information generate large scale corpora with less time and
Technology, Government of India, at the University effort. OCR engines can also be used for a variety
of Hyderabad with a funding of nearly Rs. of other applications. OCR systems have just started
100,00,000 spread over three years - April 2000 appearing for Indian scripts. Most of the current
to March 2003. The project has since been extended OCR systems for Indian languages are designed
till 30th September 2003 to enable thorough only for printed texts and perform well only on
consolidation of all the work done. Two reasonably good quality documents. However,
departments, eight members of the faculty, 20 research work on hand-written document
to 30 students and research staff at any given recognition is going on.
point of time have put in their very best for the
past three years and several products, services and Drishti is a complete Optical Character Recognition
knowledge bases have been developed. The core system for Telugu language. Currently, it handles
competencies, the data, tools and other resources good quality documents scanned at 300 dpi with a
developed here during this period will enable recognition accuracy of approximately 97\%. The
this team to scale new heights in future. system is tested with a number of different fonts
provided by C-DAC and Modular Infotech, and on
2. Products several popular novels, laser and desktop printer
2.1 DRISHTI: Optical Character Recognition (OCR) generated pages and books. Preprocessing modules that
separate textual and graphic blocks, handle multi-
An Optical Character Recognition (OCR) system column text inputs, and skew correction are also
converts a scanned image of a text document into implemented. Drishti is the first comprehensive
electronic text just as if the text matter was typed-in OCR system for Telugu.
by somebody. Scanned images are much larger in size
compared to corresponding text files. The statement A truthing tool with facilities for creating ground-
“A picture is worth one thousand words” is literally truth information, and to review the ground

110
DRISHTI: Truthing Tool DRISHTI: How it works

truth against image data is also implemented. Such


truthing tools are extremely important in objective
and quick evaluation of OCR system performance. Preprocessing Stage
Benchmark standards were proposed in collaboration
Binarization : refers to the conversion of a scanned
with Indian Statistical Institute (ISI) Kolkata, to
256-gray level image into a two-tone or binary (pure
enable uniform and objective evaluation and
black and white) image. A binary image is
performance comparison of OCR systems and
appropriate for OCR work as the image document
subsystems. In addition, several other useful modules
contains only two useful classes of data — the
and library functions that enhance or simplify
background, usually paper, and the foreground, the
adding new features are developed. Initial work is
printed text. It is common to represent the background
also done on touching characters that led to
paper colour by white-coloured pixels and the text
identifying major characteristics and issues in
by black-coloured pixels. In image processing jargon,
addressing the problem. Ours is the only major work
the background pixels have a value of 1 and the
in this area apart from that by ISI, Kolkata for Bangla
foreground pixels have a value of 0. Binarization
characters [4]
has a significant impact because it provides input
Drishti, although designed for Telugu, has been tested to every stage of an OCR system. Drishti provides
with Kannada, Malayalam and Gujarati scripts, with three options — global (the default), percentile based
recognition accuracies over 90\%. Our OCR and iterative method — to achieve desired
technology was transferred to the resource centre for performance on different types of scanned
Gujarati. Work is underway in extending the system documents and scanners.
to Amharic script of Ethiopia.
Skew Detection and Correction : deal with improper
2.1.1 System Overview alignment of a document while it is scanned. The
normal effect is that the lines of text or no longer
Drishti contains three stages: preprocessing stage,
horizontal but at an angle, called the skew angle.
recognition stage and postprocessing stage.
Documents with skew cause line, word and character
Binarization, separation of image regions into textual
breaking routines to fail. Skew also causes reduction
and graphical regions, multi-column detection and
in recognition accuracy. In Drishti, skew detection
skew correction are the major tasks performed in
and correction are done by maximizing the variance
preprocessing phase. Separation of text into glyphs,
in horizontal projection profile.
characters, words and lines, and recognition of
individual glyphs are tasks of the recognition stage. Text and Graphics Separation: refers to the process
Postprocessing comprises combining the recognized of identifying which regions of the document image
glyphs into valid characters and syllables, and spell- contain text and which regions contain pictures and
checking. other non-text information that is not processed by

111
an OCR system. Drishti uses horizontal and vertical glyph separation is extremely accurate and very few
projection profiles for such separation as well as for segmentation errors were found in our
many other preprocessing operations (see below). A experimentation.
horizontal profile is obtained by counting and plotting
the number of text or black pixels in each row of the Recognition: is based on template matching. A glyph
image. A vertical profile is obtained by counting the database containing all the glyphs in the script was
black pixels in each column of the image. Horizontal created from high-quality laser-printed text. Each
profiles show distinct peaks that correspond to lines glyph is scaled to a size of 32 X 32 pixels that forms
of text and valleys that result from inter-line gaps. a template for recognition and is stored in the
A line of text is revealed by a peak in the database. When a document is scanned for OCR,
horizontal profile whose width is approximately the glyph obtained from the glyph separation step is
the font size. A graphic object, in contrast, is much scaled to the same size as the template and
larger. The actual shape of the peak is also different matched using fringe distance maps [3] against
because of higher density of black pixels in a graphics each of the templates in the database. The
block. Thus, the profile shapes discriminate between template with the best matching score is output as
text and graphical blocks. the recognized glyph.

Multi-column Text Detection: is done using Recursive Drishti provides several options and parameters
X-Y Cuts technique proposed in [5]. It is based that affect recognition. The default setup scales the
on recursively splitting a document into rectangular glyphs using a linear scaling algorithm and matching
regions using vertical and horizontal projection is performed using a fringe distance map. Linear
profiles alternately. A different method that allows scaling is fast but suffers from problems with complex
recognition of non-rectangular regions is also shaped glyphs at large font sizes and with small glyphs
implemented but not yet included in Drishti. at small font sizes. Non-linear normalization was
shown to improve performance[7] by selectively
The use of horizontal and vertical projection profiles
scaling regions of low curvature. Non-linear
for all the major preprocessing tasks minimizes
normalization, provided as a user option, gives better
system complexity and allows faster processing of
performance on Hemalata and Harshapriya
documents. The preprocessing stages, except
fonts of C-DAC. Punctuation marks, which are
binarization, are not enabled in the basic version of
Drishti, but are available as add-on options. easily distorted because of their small sizes, are
handled separately without using template
Recognition Stage matching. Recognition accuracy is very high for
Line, Word, Character and Glyph Separation: is a punctuation marks using a location and stacking
very important task as the recognition engine processes based heuristic developed for Drishti[2].
only one glyph at a time. In Drishti, word and glyph There are also several ways to modify the basic
separation are the key steps. Word segmentation is fringe distance measure to reflect the
done using a combination of Run-Length Smearing idiosyncrasies of the Telugu script.
Algorithm (RLSA) [8] and Connected-Component Experimentation was done on 18 distance measures
Labelling. Words are combined into lines using for matching, including 6 new measures, and
simple heuristics based on their locations. The the best is chosen for recognition. Details on
performance of RLSA in accurately segmenting the overall recognition process and modifications
words is very high on good quality text but drops in for improving recognition accuracy may be found
the presence of complex layouts and tightly packed in [1,2,6,7].
text that is sometimes seen in magazines. However,
the difficulty in applying zoning techniques to Output: is written into a file. Information about the
Telugu because of the complex orthography requires location of the glyph with respect to the text baseline,
further studies for improvement. the type of the glyph, i.e., whether it is a base, maatra,
vottu or punctuation glyph, and the recognized symbol
Words are decomposed into glyphs by running the
code are written into the file. Also, there is a facility
connected component labelling algorithm again. The
to output the k-best matches.

112
2.1.2 Postprocessing Stage with an accuracy of about 85% - 87% at the ISCII
level implying a raw accuracy of 93% - 95% without
Assembling Glyphs into Syllables: is one of the most
using any of the preprocessing or postprocessing
challenging tasks of Drishti. The complex
routines. On our scanner, Drishti performed with an
orthography of Telugu permits glyph placement all
accuracy of 96% - 97% on test documents provided
around the base character and finding syllable
at C-DAC, Noida in September 2002. On our
boundaries is a non-trivial task. Currently, Drishti
tests on a number of documents after fine-tuning
uses the relative positions and types output from
the input images for scanner contrast variations
recognition stage in conjunction with a stacking
and other effects, Drishti consistently gives higher
heuristic to identify syllables. The heuristic works
accuracies. Currently, preprocessing stage is being
correctly except in the case of certain large vottus with
tuned to adapt for scanner differences.
an error rate of about 2%.
A Unique Collection of Powerful Library Routines
Converting Glyph Codes into ISCII: is currently done
by combining a simple table look-up method with Drishti is completely modular and implemented using
type codes output by the recognition stage. a number of highly useful C-callable library functions
Improvements are being done to the conversion code for doing each of its tasks. The result is a main() routine
which currently does not work for a limited that is only about 100 lines in length. The complete
number of glyph/punctuation combinations and OCR system can be created by linking these 100 lines
consonant clusters. Consequently, the accuracy of of code with the powerful, pre-compiled library
the ISCII output is approximately 5% - 7\% worse routines. The design process permits changing the
than that of the raw OCR output as on date. The functionality of the system by calling different library
improved conversion algorithm being developed is function as and when needed.
expected to mitigate this problem.
Visual Truthing Tool
Spell-Checker: is recently added for detecting mistakes
The visual truthing tool based on the proposed
in the ISCII output. The current version recognizes
benchmark standards allows easy generation of
nearly 98% of the misspelled words (false positive
ground truth data (including bounding boxes) from
are 2%).
scanned documents. It is a very powerful addition to
2.1.3 Salient Features the OCR community to test and improve their
systems.
High Accuracy, Complete OCR System
2.1.4 First Workshop on Indian Language OCR
Drishti is the first and currently the only complete
Technology
OCR system for Telugu. Currently, several
binarization algorithms (to select the best for given The first OCR workshop for Indian scripts was
set of documents and print quality), text-graphics organized by us. All the major groups in India working
separation, multi-column layouts, skew detection on OCR technology participated. The underlying
and correction are available as optional plug-ins. technologies were discussed in great detail. The
various systems under development were installed
The raw recognition accuracy (i.e., considering the
and tested during the workshop to identify the
accuracy of the glyph codes) is currently 97%. The
strengths and weaknesses of various approaches.
accuracy of the ISCII output generated from the
The discussion and debate that followed helped all
Glyph Code-ISCII conversion process is currently
the centres do make further progress.
lower because of the errors in identifying syllable
boundaries and assigning ISCII byte codes. It is 2.1.5 Conclusion
already improved in internal testing within the
It is now possible, using the developed tools and
resource centre and the improved algorithms will be
library functions, to implement a working OCR
included in Drishti very shortly.
system in less than a day. The development of a
The basic system was tested by STQC unit of working Gujarati OCR system from an absolute
the Ministry of Communications and Information scratch in under two days and subsequent transfer of
Technology, Government of India, and it performed technology is a testimony to the design of Drishti.

113
The result of the work on Telugu OCR system at such a dictionary based approach is bound to produce
RCILTS (Telugu) is more than a product or a some false alarms and some missed detections. A large
technology. It is a powerful set of research and dictionary may reduce false alarms but it is also more
technology tools and a platform that facilitates rapid likely to increase missed detections since rarely used
development of OCR technologies and solutions for words may occur more because of errors in
Telugu and other Indian languages in the future. typing than by intention. The choice of words to
be included in the dictionary is thus critical for the
2.2 Tel-Spell: Spell Checker
best overall performance of the spell checker.
A Spell Checker consists of a spelling error detection
When a misspelled word is detected, other words
system and a spelling error correction system. An ideal
from the stored list that are similar to the given word
spell checker detects all spelling errors and does not
in terms of spelling are given out as suggestions for
raise false alarms for valid words. It also automatically
correction. A quantitative measure of closeness of
corrects all misspelled words. Clearly, real spell
spelling such as the minimum edit distance can be
checkers will not be able to match up to such as an
used to select the words to be included in the suggestion
ideal system. Real spell checkers may raise false alarms
list.
for some valid words and may also fail to catch
some wrongly spelled words. Also, practical spell This description of spell checkers is obviously
checkers rarely correct misspelled words a highly over-simplified description. The number
automatically - they only offer a list of suggestions of techniques used for both detection and correction
for the user to choose from. Some spell checkers can are large and varied. One may use statistical
handle cases where an extra space has been typed in techniques such as n-grams. Sequences of characters
the middle of a word or when two or more words that do not occur or occur with various frequencies
have been joined together into one. The performance are obtained from a training dataset to build a model
of a spell checker may therefore be measured in terms of the language. This model can then be used to detect
of factors such as spelling errors. See Karen Kukich for a good survey
of the various techniques.
• Percentage of False Alarm
2.2.2 Why is it difficult to build a spell checker for
• Percentage of Missed Detection
Telugu?
• Number of suggestions offered
Indian languages in general and Dravidian languages
• Whether the intended correct word is included in particular are characterized by an extremely rich
in the list of suggestions or not system of morphology. Words in Dravidian languages
like Telugu and Kannada are long and complex,
• The rank of the intended correct word in the list
built up from many affixes that combine with one
of suggestions
another according to complex rules of saMdhi. For
• Whether split and merged words are handled example, nilapeTTukooleekapootunnaaDaa?
Developing good spell checkers for Indian languages which means something like “Is it true that he is
has been a challenge. No spell checkers were available finding it difficult to hold on to (his words/
at all for Telugu and a few other Indian languages. something)?”
Tel-Spell is the first ever spell checker for Telugu
Telugu is both highly inflectional and agglutinative.
and it includes both the spelling error detection
Auxiliary verbs are used in various combinations to
and correction components.
indicate complex aspects. Clitics, particles,
2.2.1 How do spell checkers work? vocatives are all part of the word. Telugu exhibits
vowel harmony - vowels deep inside a verb may
Many spell checkers store a list of valid words in the change due to changes at the boundaries of
language. A given word is assumed to be free of saMdhi. External saMdhi between whole words
spelling errors if that word is found in this stored and compounds also occur in the language. See
list. Otherwise it is presumed that the given word is the references below for more on Telugu
wrongly spelled. Since no dictionary can be perfect, morphology. Suffice it to say that Telugu is one of

114
the most complex languages of the world as combination of dictionary and morphology to develop
far as morphology is concerned. the first version of the spell checker for Telugu. A
10 Million word corpus developed by us has been
It is therefore not practically feasible to store all
used to build and test the performance. The
forms of all words directly in a dictionary for
performance of the system has been found to be
purposes of spelling error detection and correction.
satisfactory both in terms of detection and correction
At the same time building a robust morphological
of spelling errors. See the references below for more
analyzer and generator is an extremely challenging
details. Our Telugu spell checker technology has been
task. Developing a good spell checker for languages
transfered to M/S Modular Infotech. Ltd. on a
such as Telugu is thus a very difficult task. No wonder
non-exclusive and non-preferential basis for
no spell checkers are available for these languages to
commercialization. The Telugu spell checker has also
date.
been integrated into our AKSHARA advanced multi-
2.2.3 Design of Tel-Spell lingual text processing system.
Perhaps the best and most thoroughly worked out 2.2.4 Error Pattern Analysis
morphological analyzer for Telugu is the one we
Large scale spelling error data has been obtained from
have developed at University of Hyderabad over the
our 10 Million word Telugu corpus. The raw corpus
past 10 years or so. The system has been tested on
as it was typed has been compared with the final
large scale corpora and enhancements and
version after three levels of proof reading and
refinements have been going on for years. During
certification by qualified and experienced proof
this project, a thorough re-engineering work was taken
readers. A number of tools have been developed to
up and a new version was developed. The new
prepare such a data. Quantitative study of spelling
version is far simpler, more transparent, portable,
error patterns in Telugu is being conducted. This
well documented, and conforms to standards.
will help us to build better spell checkers in future.
AKSHARA: Spelling Error Detection and Correction 2.2.5 Syllable level statistics for spell checking
Since words are large and complex and hence too
numerous in Telugu and proper morphological analysis
is difficult, it would be useful to perform studies at
lower levels of linguistic units. The syllable level is
a natural choice since writing in Indian languages
is primarily syllabic in nature. n-gram models have
been built at syllable level. HMM models have also
been built. These models can be used to detect
spelling errors and to rank the suggestions for
correcting a given word. See the references below for
more technical details.
2.2.6 Further work
Research work has also been taken up on developing A agreement has been entered into with M/S Modular
stemming algorithms for Telugu. A pure corpus based Infotech. Ltd. For further development and transfer
statistical stemming algorithm has been developed. of spell checkers for Telugu and Kannada.
The performance of this stemmer for the spell
2.3 AKSHARA: Advanced Multi-Lingual Text
checking application has been studied in various
Processor
combinations with dictionary and morphology based
approaches. See thesis by Ravi Mrutyunjaya below 2.3.1 Why one more word processor?
for reference.
A systematic study of various available word
A lot of empirical work and experimental studies processors for Indian languages was performed in
had to be conducted to come out with the best order to choose the best ones for our own use here.

115
AKSHARA: Advanced Multilingual Text Processor and unfortunate as it is, often the only practical
way to get back some text already in electronic
form encoded in such fonts is to re-type the whole
text! AKSHARA documents are always
character encoded. Mapping to fonts is done only
for the purposes of display and printing - all other
operations are performed on character encodings.
Commercial packages use proprietary and secret
encodings using non-printable control characters
for storing the attributes. In AKSHARA attributes
are included in an open XML style markup
language called Extensible Document Definition
Language (XDL) developed by us. This makes it
easy to convert to and from various other encoding
schemes thereby ensuring highest levels of
portability and platform independence.

The study indicated that none of the available 2.3.3 Script Grammar
software products were satisfactory. They were slow, One of the unique features of Indian Language
fragile, unreliable, and they broke down when the writing systems is a script grammar. While “akshara”
data is large. Even very simple operations such as or syllable is the basic unit of writing, these aksharas
changing the font size cause the system to crash when are actually composed of more basic elements such
the file is big. There are a number of problems and a as vowels and consonants. Not all sequences of
detailed study convinced us that these are not merely such basic elements are valid and the script
implementation level bugs that can be hoped to be grammar specifies legal combinations. Most
removed in the future versions. The basic design commercial packages do not seem to respect the
philosophies are faulty and short sighted. It appears
script grammar properly. A large percentage of errors
that most of the commercial packages have been
in the corpora developed using other tools earlier has
designed without thinking beyond type-compose-
been found to be due to the inability of these tools
print paradigm for using computers as mere type-
to check and strictly apply the script grammar.
writers. Most commercial packages work only
What you see on the screen is not always what is
under Microsoft Windows platforms. The better
stored in the file and hence there is no way to check
ones are a bit too costly for most ordinary users.
and correct these mistakes by looking at the documents
Adherence to standards is poor and compatibility
on the screen.
across fonts, versions and different packages is a big
problem. This motivated us to start developing our A unique feature of AKSHARA is that it understands
own advanced multi-lingual text processor named the script grammar and warns you if you try to build
AKSHARA. ungrammatical syllables. AKSHARA has been
successfully used to clean up all the corpora at CIIL,
2.3.2 Encoding scheme
Mysore.
AKSHARA encodes texts in a standard character
encoding scheme such as ISCII or UNICODE. 2.3.4 AKSHARA is Robust, Reliable and Platform
Many commercial systems use font encoded pages Independent
to by-pass the character to font conversion process AKSHARA is platform independent - you can use it
- itself a complex step for which there does not on MS Windows, Linux and many other platforms.
seem to be any fully satisfactory solution so far. AKSHARA is also robust and reliable – you can
These commercial systems also often use comfortably work with large documents without
proprietar y, non-standard fonts with secret worrying of silly restrictions such as line lengths.
encodings. The documents so created are not texts AKSHARA has been successfully used to develop a
at all. There will be serious portability constraints 10 Million word corpus of Telugu.

116
2.3.5 Advanced Text Processing Tools this is made possible through our unique WILIO
technology. You will find more on WILIO below.
AKSHARA is an advanced text processing tool -
The web pages in our site
dictionaries, morphological analyzers, spell checkers,
www.LanguageTechnologies.ac.in have been
OCR systems, TTS systems, text processing tools
developed using this technology. For example, you
including searching, sorting etc. are part of
will be able to interactively search bilingual
AKSHARA. Several text processing tools, Telugu
dictionaries from our website.
spell checker and Telugu TTS have already been
integrated. We would also be happy to integrate any 2.3.8 Availability
of the other dictionaries, spell checkers, TTS etc. that
Wondering how much AKSHARA may cost?
other centres may have developed. Full support for
AKSHARA will be available for free and freely
Regular Expressions and Finite State Machines is being
available. Let everybody have the basic Indian
integrated.
Language processing capabilities without restrictions.
AKSHARA: Text Processing Tools 2.4 email in Indian Languages
We have developed technology for composing,
sending as well receiving emails in any combination
of English and other Indian languages. Many other
systems support only sending of mails, receiving mails
would not be as straight forward. This technology
has been integrated into our AKSHARA system. You
will only need a public email account (that supports
POP3 or IMAP protocols) somewhere. Unlike other
technologies, here there will be no dependence on
any other third party web sites. AKSHARA installed
on your local machine will be your email client.
AKSHARA: email client
2.3.6 AKSHARA as an email client
AKSHARA is unique in providing multilingual email
sending as well as receiving facilities. All you need
is a public email account somewhere. While many
other systems allow you to send emails, receiving
mails is not as easy. With AKSHARA there is no
longer any need to depend on any third party sites on
the Internet.
2.3.7 Developing Interactive web pages in Indian
languages
AKSHARA also enables you to develop interactive
web pages in Indian languages and English. Just use
AKSHARA and any web browser to create, edit,
modify and refine your web pages. These web pages 2.5 WILIO: Interactive Web Pages in Indian
will work across platforms and browsers. All your Languages
web pages will still be character encoded, not font
Developing web content in Indian Languages has been
encoded. Thus you will be really building a long
a challenge. None of the browsers understand the
lasting knowledge base. And what more, you can
ISCII character encoding scheme. We may hope that
create interactive web pages - pages into which your
UNICODE compatible browsers and free availability
users can directly type in in Indian languages. All
of UNICODE fonts will mitigate the situation to a

117
large extent in future. As on date, however, not all Requests for a web pages will require connections
browsers support UNICODE, UNICODE fonts are to the service provider’s web site too. Clearly, this
not readily available for all Indian Languages and is not a good solution.
UNICODE scheme is itself still not completely 2.5.3 Plug-in technology
satisfactory and revisions are going on. We explore
here briefly various alternatives people have tried and Then there are plug-ins - add-on pieces of software
present our own technology that we feel is far that do the character to font conversion on the client
superior to the others. machine. The plug-in will have to be downloaded
and installed only once by a client. The pages will be
2.5.1 Text as Pictures character encoded. This sounds like a good solution
The simplest way to ensure that every client sees but it has not worked well in practice. Plug-ins are
exactly what you want him or her to see is to encode add-ons to browser software and the browsers vary
texts as images. This technology Will work irrespective widely in terms of the support and the details of how
of the platform and particular browser the user is using they take on these add-ons. For each browser and in
and whether he or she has the required fonts or not. fact for each version of a browser, a suitable version of
However a picture is worth one thousand words (or a the plug-in will have to be developed. As new
lot more) - both in terms of storage and network browsers keep appearing in the market, new versions
bandwidth required. Clearly, this cannot even be of plug-ins will need to be developed too. Unless the
considered as a solution as there is no text at all. browser developers themselves take on the
responsibility of supporting Indian script standards,
2.5.2 Font encoded pages this technology is unlikely to be accepted as a good
We can have font encoded web pages and expect and permanent solution.
the users to have the fonts locally available on their 2.5.4 Forget the scripts, use Roman
machines. Unfortunately, fonts are not yet freely
available for Indian languages and most computers Of course one may forget Indian scripts and encode
in the country will have no Indian language fonts Indian languages in the Roman script. This will
at all. Much more importantly, font encoded pages provide complete immunity from platform and
are not texts at all. There is no font encoding standard browser variations and font dependencies. Literate
and encoding fonts in proprietary fonts is as good as Indian language users who are not comfortable with
encrypting them. Web sites must be viewed as the Indian scripts, say, non-resident Indians, will
knowledge bases – long lasting and easily maintainable. also be able to use this technology. However,
this cannot be taken as a solution for Indian language
Unlike in the case of languages like English where
support for web content!.
character and glyph have a one to one
correspondence, Indian scripts are complex and the 2.5.5 What is WILIO?
mapping from characters to glyph sequences in a given We have endeavored to develop a better technology
font is a complex many to many mapping. Therefore which we call WILIO. WILIO permits standard
font encoded web pages are no solution at all. character encoded web pages to be viewed on any
Dynamic Font Technology browser and any operating system. The character
encoded pages are received by the client and mapped
One part of the problem with font encoded schemes,
to the required fonts before displaying them.
namely availability of fonts on the client machine,
can be solved by using the so called dynamic font WILIO is unique in its ability to permit two-
technology. The basic idea is to send the fonts also way communication. We can develop interactive web
along with the requested documents to the client. pages wherein the users can also type-in Indian
The pages are still font encoded. Not much different language content directly into the browser. The
from the previous method. Further, required keyboard driver etc. are included in WILIO.
Thus one may prepare lessons, ask questions, allow
dynamic font libraries are required - the usual users to type in their responses, receive and validate
fonts are not sufficient. One needs to buy tools the answerers and get back to the users accordingly.
to prepare dynamic font libraries. Otherwise you This will open up a whole new experience with
will have to depend upon some other service provider. Indian language web content. The web pages in

118
our site have been developed using WILIO technology. Department of Electronics. Even these corpora were
For example, look at our dictionary look up services not released for many years for researchers because
at our website www.LangaugeTechnologies.ac.in of legal and technical problems relating to copy rights.
2.5.6 How WILIO works 2.6.2 How large is a large corpus?
WILIO works through a Java Applet. Browsers must Given the rich morphological nature of Indian
support Java. Most browsers do. In case the Java languages, it was felt that a mere 3 Million word
plug-in is not installed, it will be automatically corpus would not be sufficient. In order to establish
installed after getting user’s confirmation. WILIO also this fact, we conducted a growth rate analysis of the
requires that the fonts are locally available. A few available corpus of Telugu. The corpus is split
fonts are available freely in any country and we hope randomly into equal sized parts and type-token
we will not be far from such a day in India too. We analysis is performed. A “type” is a particular word
are making all out efforts to make a few fonts freely form and each occurrence of that word form
available to everybody for non-commercial use. WILIO constitutes a “token”. For example, “word” is a type
itself is fully integrated into AKSHARA and AKSHARA and there are two tokens of this type in the previous
will be freely available and available for free. sentence. Each part of the corpus contributes a set
2.5.7 Security of types and the cumulative number of types is
plotted on the Y-axis against the size of the corpus
One of the very useful side-effects of this technology (measured in terms of the cumulative number of
is document security - WILIO it more difficult for tokens) on the X-axis. The resulting curve depicts the
people to download and print the pages. rate at which new types grow as the size of the
2.6 Telugu Corpus corpus increases. If the curve shows signs of saturation
A large, representative corpus is the first and most and tends towards the horizontal, that means that
essential resource for language engineering research most of the word forms have already been obtained
and development. A corpus is essential for building and adding more corpus will not add too many new
language models as well as for large scale testing and word forms. As long as the growth rate curve continues
evaluation. Special emphasis was therefore laid on to show a high slope, it indicates that the corpus we
developing a fairly large corpus of Telugu language, have is insufficient and many of the possible word
the language of focus in the current project. forms are yet to be seen even once in the corpus. Given
below is the growth rate curves for all the major
2.6.1 Status before the year 2000 Indian languages for which corpora were available.
Developing corpora for Indian languages has been 2.6.3 Dravidian Languages vs Indo-Aryan Languages
more challenging than it may appear. These days most
publishers use computers at some stage or the other. The distinction between Dravidian languages and
Why not simply compile such readily available Indo-Aryan languages is striking in this figure - there
material? While it is possible to get some material in are many more word forms (types) in Dravidian
electronic form directly from publishers, DTP centres languages than in the other Indian languages. While
and websites, it must be emphasized that there are 150,000 to 200,000 word types should be giving
no free fonts and the proprietary fonts used by an excellent coverage for northern languages,
various groups do not stick to any standard. While Dravidian languages such as Telugu spoken mainly
the ISCII national standard for character encoding in the southern parts of India require a much larger
has been around for a long time now, most of the number of word forms. And, more importantly, the
documents continue to be developed using available corpus is not sufficient even to get a clear
proprietary fonts embedded in proprietary idea of how many words are there in the language.
commercial software. Thus it is not possible to simply The morphology of these languages is so rich, no one
download and add to the corpus. so far has an idea how many different word
forms are there in the language. One well known
Before the year 2000 corpora of only about 3 linguist has argued that each verb root in Telugu can
Million words were available for the major Indian give rise to as many as 200,000 different inflected/
languages. These corpora were developed with the derived word forms (Ref. Dr. G. Uma Maheshwara
support of the Ministry of Communications and Rao, personal communication).
Information Technology, then known as the

119
What this shows is that techniques which work in using our AKSHARA - advanced multi-lingual text
well for Indo-Aryan languages may not be applicable processor and other such tools and subjected to two
to Dravidian languages. For example, it would be levels of thorough proof reading by qualified and
possible to simply list all forms of all words and use experienced proof readers and finally certified free of
this for dictionary based spelling error detection errors. The entire corpus is encoded in the ISCII/
and correction system for Hindi, Punjabi or Bengali UNICODE character encoding standard and XML style
but such an approach cannot not be expected to annotation scheme is used for meta information.
produce comparable performance results for say, The growth rate curve of types against tokens for the
Telugu or Kannada. It would thus is not proper to 12 Million word total corpus of Telugu available now
make out right comparisons of performance of was conducted recently. The curve shown below
language engineering products across these classes of shows clearly that even this corpus is not sufficient -
languages. The inherent complexity of the languages there is no sign of saturation and the growth rate
must be factored in when making any comparative has not reduced significantly. We still do not have
judgements of performance. even a single occurrence of most of the word forms
2.6.4 Copy Right Issues although the number of types has already reached
Given the earlier experiences of groups developing 20,000,000.
corpora, we had to take every care to ensure that 2.6.6 Tools
we did not get into copy right problems. At the A number of tools have been developed to develop,
same time, we realized that it is not going to be easy analyze and manage large scale corpora. Some of these
for us to take over the legal copy rights from the tools have also been given to other centres. A
authors or publishers. Hence it was decided that we comprehensive tool kit is being developed.
will only ask for right of electronic reproduction and
rights for hosting selected works on our web-site, We have also developed tools for semi-automatically
without asking for a legal transfer of copy rights. decoding any unknown font with minimum effort.
The original copy right holders would continue to Using this technique, a mapping scheme can be
hold their copy rights and would be free to sell, developed to map the text strings encoded in the
distribute or transfer their works to any other party at unknown font into an equivalent text string in a
their will. standard character encoding scheme such as ISCII
or UNICODE. In fact several of the widely used
It was nevertheless an extraordinary convincing effort fonts have been decoded and we can now add more
to get the best authors to part with their best works free material such as newspaper articles at ease.
for our corpus without spending a single pie of money.
It took a tremendous amount of time and effort to 2.6.7 Plans
convince the copy right holders that what they and Now that our OCR system for Telugu and the Telugu
the country at large gains in the long run will be spell checker have reached a level of performance
much more than the hypothetical loss incurred in that makes it suitable for use in content creation,
giving us their works free of cost. A variety of we hope to be able to develop even larger corpora of
strategies and tactics had to be used but at the end Telugu very soon. A thorough investigation into the
we have been able to obtain the rights for more than spread across genres and representativeness of the
250 of the best works of the best known writers. corpus is also being carried out so that further work
Add to this works for which copy rights have expired can be fine tuned accordingly despite the non-
and we have a list of more than 500 books. (It will availability per se of texts in some of the categories
be interesting to note that the expectation of the in Telugu language.
funding body was 10 good books!).
Plans for the future including various levels
2.6.5 The Status of annotation. English-Telugu parallel corpus
A corpus of 225 books adding upto about 30,000 pages development is also being considered.
and 9.25 Million words has been completed. The corpus 2.7 Dictionaries, Thesauri and other Lexical
includes a variety of topics and categories - newspaper Resources
articles, short stories, novels, poetry, classical and
Dictionaries are the most basic and essential data
modern writings etc. Each of these works has been typed
resource for any language. Accordingly, we have

120
developed a number of monolingual and bilingual 2.7.8 Telugu Dictionary
dictionaries as detailed below. The dictionaries Status : Completed; Size : 64,000 plus; Fields : POS,
are available in the XML format for data exchange Paradigm Class; XML? : Yes; Indexed? : Yes; Web-
and indexed cleverly for efficient search. Look-up enabled? : WIP.
services are provided from our website using our unique
WILIO technology for OS and browser independent 2.7.9 Kannada Dictionary
deployment. All dictionaries are encoded in ISCII/ Status : Completed; Size : 12,000 plus; Fields : POS;
UNICODE standard character encoding schemes. XML? : Yes; Indexed? : Yes; Web- enabled? : WIP.
Apart from dictionaries, we have also developed
2.7.10 Kannada Thesaurus
thesauri and more importantly, a tool by which we
can develop a thesaurus of sorts for any language in Status : Completed; Size : 12,000 plus; Fields :
just a few minutes from a suitable bilingual Synonyms, POS, Sense; XML? : Yes; Indexed? : Yes;
dictionary. Here is a summary of the dictionaries we Web- enabled? : WIP.
have with us. Some of these are already being used WIP: Work in progress
by researchers in other centres.
These dictionaries are all closely linked up with
2.7.1 C P Brown’s English - Telugu Dictionary corpora, morphological analyzers and generators,
Status : Completed; Size : 31,000 plus; Fields : POS, spell checkers etc. Cross-validation and refinement
Meanings, Usage; XML? : Yes; Indexed? : Yes; Web- continue on a regular basis.
enabled? : Yes. 2.7.11 Tools
2.7.2 C P Brown’s Telugu - English Dictionary We have also developed a number of tools for
Status : Completed; Size : 31,000 plus; Fields : POS, developing electronic dictionaries, for efficient
Meanings, Usage, Etymology; XML? : WIP; Indexed? indexing, searching and other such operations on
: WIP; Web-enabled? : WIP. electronic dictionaries, for formatting in XML or other
standards, for verification and validation, for web-
2.7.3 English - Telugu Dictionary suitable for enabling and offering web based services etc. We
Machine Aided Translation would be glad to host dictionaries developed by other
Status : Completed; Size : 37,500 plus; Fields : POS, centres using our platform independent and secure
Meanings; XML? : Yes; Indexed? : Yes; Web- enabled? WILIO technology.
: WIP. 2.7.12 Automatic generation of Thesaurus from
2.7.4 Telugu - Hindi Dictionary suitable for Bilingual Dictionary
Automatic Translation We have developed a unique tool that can generate a
Status : Completed; Size : 64,000 plus; Fields : POS, thesaurus of sorts for any language in just a few
Paradigm Class, Meanings; XML? : Yes; Indexed? : minutes starting form a suitable bilingual dictionary.
Yes; Web- enabled? : WIP. We would be glad to offer this service to any centre
that has a suitable bilingual dictionary. Thesauri
2.7.5 English - Kannada Dictionary being extremely useful resources yet non existent for
Status : Completed; Size : 15,000 plus; Fields : POS, many Indian Languages, we believe the contribution
Meanings, XML? : Yes; Indexed? : Yes; Web- enabled? of this tool is being appreciated very well from all
: WIP. quarters hope this tool would be very useful.
2.7.6 Basic Material for English Dictionary 2.7.13 Technology for hosting dictionaries on the web
Status : Completed; Size : 6,00,000 plus; Fields : POS, We have the unique capability to place Dictionaries
Frequency; XML? : Yes; Indexed? : Yes; Web- enabled? on the web for efficient, secure, platform and browser
: WIP. independent services. We would be glad to host any
other dictionary developed by any other centre
2.7.7 English Dictionary through our technology from our site.
Status : Completed; Size : 80,000 plus; Fields : POS, 2.8 Morphology
Frequency; XML? : Yes; Indexed? : Yes; Web- enabled?
: WIP. 2.8.1 What is Morphology?

121
Morphology deals with the internal structure of words forms in all. Dravidian languages
words. Morphology makes it possible to treat words including Telugu, Kannada, Malayalam and Tamil
such as compute, computer, computers, computing, are among the most complex languages of the
computed, computation, computerize, computerization, world and can only be placed along with
computerizable and computerizability as variants of the languages such as Finnish and Turkish. Clearly,
same root rather than as different words unrelated there is no way we can hope to list all forms of all
to one another. Morphology makes it possible to store words in a dictionary. We cannot build a spell
only the root words in the dictionary and derive other checker, for example, by simply listing all forms
variants through the rules of the morphology. It helps of all words. Morphology is not just useful but
us to understand the meaning of related words. absolutely essential.
2.8.2 Indian Languages exhibit rich morphology 2.8.3 Design of Telugu Morphological analyzer
Morphology plays a much greater role in Indian Building a morphological analyzer and generator for
languages because our languages are highly a language like Telugu is thus a very challenging
inflectional. While the English verb eat gives rise to task. Perhaps the only large scale system built for
only a few variants such as eats, ate, eaten and eating, Telugu is ours. Our Telugu morphological analyzer
the corresponding verb in Telugu can give rise to a has been built, tested against corpora and refined
very large number of variants. Words in Dravidian over the past 10 years. This system uses a root word
languages like Telugu and Kannada are long and dictionary of 64,000 entries and a suffix list categorized
complex, built up from many affixes that combine into a number of paradigm classes. The basic
with one another according to complex rules of methodology is to look for suffixes, remove them taking
saMdhi. For example, care of saMdhi changes and then cross checking
nilapeTTukooleekapootunnaaDaa? which means with the dictionary. Inflection, derivation, external
something like “Is it true that he is finding it difficult saMdhi are all handled. See the references below for
to hold on to (his words/something)?” more technical details. There is also a separate
Telugu is both highly inflectional and morphological generator that can put together the
agglutinative. Auxiliary verbs are used in various roots and affixes to construct complete word forms.
combinations to indicate complex aspects. Clitics, 2.8.4 Design of Kannada Morphological Analyzer
particles, vocatives are all part of the word. Telugu We have also developed a Kannada morphological
exhibits vowel harmony - Vowels deep inside a analyzer and generator using our own Network and
verb may change due to changes at the boundaries Process Model. A finite state network captures in a
of saMdhi. External saMdhi between whole declarative and bidirectional fashion all the affixes,
words and compounds also occur in the language. their ordering and the various combinations
See the references below for more on Telugu permitted. The process component takes care of
morphology. One linguist puts the number of saMdhi changes when affixes are added or removed.
variants for a single Telugu verb at nearly 200,000! This model makes it possible to develop a
[G. Uma Maheshwara Rao, Personal morphological analyzer, test it against a corpus and
Communication.] The exact number of different then we get a generator of comparable performance
forms that a verb can take in a language like Telugu with no extra effort since the same network is used
is not yet clear. The growth rate analysis described both for analysis and generation. In this model, a
in the section on corpora clearly shows that the 12 complete and detailed analysis is made at the level
Million corpus available at present is not sufficient of each affix.
to give us even a single occurrence of many possible
words in the language. While Indian languages 2.8.5 Tool for developing Morph systems for other
in general are morphologically richer than languages
languages like English, Dravidian languages Morphological analyzers and generators for several
are a lot more complex. The 12 Million word languages including Kannada, Tamil, Oriya etc. have
corpus of Telugu has nearly 20,000,000 different been built using this Network and Process model.
words and there will be many more as the growth That a good Tamil morphological analyzer and
rate curve indicates. In contrast, the Indo-Aryan generator could be built within a week using this
languages have only about 1,50,000 to 2,00,000 system is a testimony to the quality of design and

122
implementation of the system. See references below In English like positional languages, the category of
for more details. a word can be determined in terms of the categories
As we build larger and more representative of the preceding words. As such Hidden Markov
corpora, further refinements to dictionaries as well Models have been widely used. There are several
as morphological analyzers and generators will other techniques too. However, as far as Indian
continue. languages are concerned, many of these sequence
oriented techniques are not very much applicable.
2.9 Stemmer Our languages are characterized by free word order
As the above section on Morphology shows, it is and hence it does make much sense to depend so
very difficult to build a high performance analyzer or much on previous or following few words. Instead,
generator for Dravidian languages such as Telugu. An our languages are characterized by a very rich system
alternative short-cut approach that can be used in of morphological inflection and it is here that we get
practice is stemming. Here a complete and detailed maximum information about the correct part of
morphological analysis is not performed. Instead, the speech of a word. The percentage of words that
affixes are removed to obtain the root. For example, occur in some inflected form rather than in the bare
the common prefix in the words compute, computer stem form is far more for Indian Languages as compared
and computing is comput and hence all these word to English. Morphology holds the key for POS tagging
forms are reduced by removing the affixes to the of Indian languages. In fact one may even go a step
common stem comput. Note that comput is not a valid further and argue that a POS-tagged corpus does not
linguistic unit at all. Yet, such stemming techniques make much sense. Whenever you process some text,
are useful and have been used in many areas you will need to perform morphological analysis and
including Information Retrieval. Stemming can also the job of POS tagger will be done there too.
be used as the second line of defence when However, developing robust morphological
morphology fails. analyzers for Indian languages in general and
A thorough study of various stemming techniques Dravidian languages in particular have been difficult
have been conducted. Ingenious corpus based challenges. The performance of any POS tagger
statistical stemming techniques have been developed based on morphology would be limited by the
for stemming in Telugu. Vowel changes, gemination performance of the morphological analyzer itself.
etc. need to be taken care of in building a stemmer 2.10.3 Degree and nature of lexical ambiguities
for Telugu. The stemmer has been compared with
the full morphological analyzer and various A systematic study of the degree and nature of lexical
combinations have been tried out for the purpose ambiguities at the dictionary and corpus levels is
of spelling error detection and correction in Telugu. being conducted. Appropriate technologies for POS
See the references below for more technical details. tagging based on morphology are being developed.
The percentage of words that occur in some inflected
2.10 Part of Speech Tagging form rather than in the bare stem form is far more
2.10.1 What is POS tagging? for Indian Languages as compared to English. This
has serious implications for the degree and nature of
A dictionary lists all possible grammatical categories
lexical ambiguities in running texts.
for a given word. The job of a Part of Speech
(POS) tagger is to identify the correct POS for a 2.10.4 HMM system for POS tagging
given word in context. For example, the word thought In order to gain deeper understanding of POS
is a verb in the following sentence and a noun in tagging for various languages including English, a
the sentence that comes next: I have thought about Tri-tag based HMM model has also been built and
it from various angles. Suddenly this strange thought tested on the SUSANNE corpus.
came to my mind.} POS tagging may be at the level
of gross grammatical categories such as verbs and 2.11 VIDYA: Comprehensive Toolkit for Web-Based
nouns or, more often, at a more fine grained Education
level of sub-categorization. 2.11.1 eLearning
2.10.2 POS tagging techniques for English and eLearning helps to overcome the barriers of distance
Indian languages and time in learning. Thrust is on learning, not

123
teaching. Students can thus learn whatever they like, Our suite is called VIDYA. Indian languages can be
in whatever order they please and at a pace that is supported. Interactive web content can be created
best for them. Instead of teachers, there will only using our WILIO and AKSHARA technologies.
be facilitators. VIDYA supports inter-student and student-teacher
There are many tools for eLearning. Only some interaction through email, chat, discussion rooms
of them are comprehensive tools that provide for and white boards. It encourages collaborative
the entire gamut of facilities from pre-registration problem solving and group activities.
counseling to maintenance of the alumni database.
Good ones are very costly and most educational VIDYA: Comprehensive suite of Tools for Web Based
institutions in India will not be able to afford to Education
buy such tools. Many developments are taking place
in this area but the major usage so far has been
limited to corporate training in Information
Technology and related areas. The focus is on adult
serious learners only. Profit seems to be the main
motive in many cases.
2.11.2 Web Based Education - Technology for Quality
Education
Our view point is very different. Traditional
education is more teacher centric whereas eLearning
is learner centric. The idea is not to choose one or
the other but the right combination of both. Not all
students are mature and serious enough to learn on
their own. Teachers are required to guide, instill
confidence and to inspire. The need is to look at VIDYA has the unique facility to link with auxiliary
education at all levels in a holistic sense. servers for extra support such as for laboratories.
The primary issue in question is quality of education. For example, you may use VIDYA to do
We produce very large number of BSc’s and MSc’s but programming in Java, C++, C, and Perl languages
very few scientists. We produce very large number of without the need for these compilers on your
BE and BTech degree holders but very few engineers. machines. Coding, editing, compiling, executing,
Primary school level is worse. There are several problems archieval are all supported. Similarly, science labs.
and not all of them can be solved through technology. And language labs. can be developed. Appropriate
The question is what is it that technology can do to use of multi-media and learning by doing makes
ensure quality education to everybody? learning a pleasure and has much greater impact
than reading text books and listening to class room
How do we ensure highest quality of education at lectures.
all levels without barriers of distance and time? Good
teachers are not always available. Distance and time VIDYA support a wide variety of testing, evaluation
are not the only barriers. Cost and language are and reporting facilities. Adaptive testing, navigation
bigger and more serious barriers. The ultimate control, timing etc. are supported. Full range of
objective should therefore be to reach out to all question types including multiple choice, short
interested students and offer the highest quality of answer and essay type questions are permitted.
education without any kinds of barriers - distance, 2.11.4 Status and Plans
time, cost or language. Here is where technology
can bring the services of best teachers, best course VIDYA has been installed in several centres
materials to every student in a cost effective manner. including CIIL, Mysore. VIDYA is being regularly
Our technologies must be Indian language enabled. used in University of Hyderabad for teaching courses
at MCA and MTech levels. It has also been used for
2.11.3 VIDYA - a comprehensive suite of tools offering special courses to reputed industries. A
Given this scenario, we started developing a recent study has shown that it is suitable for
comprehensive suite of tools for web based education. deployment in our distance education programme.

124
VIDYA: Interactive Multimedia Content of these would be equally suitable for positional
and free word order languages and hence a new
formalism had to be developed. Computational
grammar and Parser has been developed for English

VIDYA: On-Line LABs

VIDYA could be used by schools, colleges,


universities, research laboratories etc. for regular
education, continuing education, part-time courses,
in-house training etc. Suitable material can be
developed and shared with others so as to maximize
the impact. In particular, language teaching material and demonstration level systems have also been
already developed or being developed by various developed for Telugu and Kannada. UCSG uses a
centres can be linked with with VIDYA to enable combination of Finite State Machines, Context Free
various classes of language learners to get maximum Grammars and Constraint Satisfaction to achieve the
benefit. We would also be glad to enter into best overall performance. Grammars become simple
agreements for further collaborative development and easy to write, parsers become computationally
of the tool itself. very efficient and the same basic framework works
2.12 Grammars and Syntactic Parsers for English and other Indian languages. UCSG works
from whole to part, rather than from left to right.
2.12.1 Computational Grammars for Indian See the references below for more details.
Languages
Further development of the UCSG English parser is
There are no large scale computational grammars for going on. A much larger and more informative
any of the Indian languages. Computational grammars dictionary has been built based on the analysis of
and syntactic parsers are very much required for taking large scale corpora. Combination of linguistic and
Indian languages beyond the type-compose-print statistical models are being used to enhance the
paradigm that is holding back the country from coverage and robustness of the system. Plans
growing beyond using computers as some kind of include the development of computational
type-writers. All language engineering applications grammars for Telugu and other Indian languages.
including machine translation, information retrieval,
information extraction, automatic categorization 2.12.3 Robust Partial Parsing
and automatic summarization would greatly benefit It is now well recognized that full syntactic parsers
from syntactic parsers. are difficult to build. Hence there is increased
2.12.2 UCSG system of Syntax interest in robust but shallow or partial parsing. An
extensive study of parsing technologies have been made
The UCSG system of syntax was developed by us and efforts are on to build a large scale robust partial
to place positional languages such as English on an parsing system for English. Efforts are also underway,
equal footing with Indian languages that are in collaboration with linguists from CIIL to develop
characterized by relatively free word order. A careful computational grammars and shallow parsing systems
study of both the Western grammar formalisms and for Indian languages.
the paaNinian approach to syntax showed that none

125
2.13 Machine Aided Translation also been taken up on a number of other specific
Automatic or Machine Translation is the one of topics such as corpus based machine learning
the widely known applications in language techniques for sentence boundary identification.
engineering. It has been recognized very well that fully The MAT2 system proposed purports to combine
automatic high quality translation in open domains the best of linguistic theories, corpus based machine
is difficult to achieve. Either restricted domains of learning algorithms and human judgement based on
applications with controlled language usage must world knowledge and commonsense to achieve
be considered or the translation process has to be high quality translations in a semi-automatic setup.
semi-automatic, the man and the machine doing what A very useful by-product of this exercise will be a
they are good at and seeking help from each other in high quality POS and sense tagged, parsed, aligned,
other areas. Even with such restrictions, ensuring parallel corpus.
quality of translation is a very challenging task. 2.14 Tools
Language is rich and varied in structure as well as
meaning. Since the machine cannot be expected to We have also developed a number of tools over the
“understand” the meaning of the given source past many years for our own use and some of these
language text in any real sense, the out of the machine tools could be useful to other groups as well. In fact
can at best be good, calculated guesses. The situation some of these tools have already been given to other
is further complicated by the fact that expectations resource centres. Here we list some of the important
of users are very high when it comes to translation. tools developed by us. It may be noted that not all
these tools were developed within the period of, or
2.13.1 English to Kannada Machine Aided with the support of, this specific project.
Translation System
2.14.1 Font Decoding
A Machine Aided Translation system was developed
here for the Government of Karnataka for Unlike English, Indian scripts are syllabic in nature.
translating budget speech texts from English to The units of writing are akshara’s or syllables. The
Kannada. English text is pre-processed and segmented total number of possible syllables is very large. Thus
into sentences. Each sentence is syntactically parsed fonts are developed using shape units called glyphs
using our UCSG English parser. The parsed which need to be composed to form complete
sentences are translated to Kannada in a whole-to- syllables. The mapping from syllables to glyph
part fashion using the bilingual English-Kannada sequences is complex. There is a proliferation of
dictionary and Kannada morphological generator non-standard and proprietary fonts. In fact there is
developed by us. There is a powerful post processor no font encoding standard as yet for Indian languages.
that is tightly integrated with the dictionary, Thus documents encoded in some unknown font is
thesaurus, morphology and the translator. A full 150 exactly like a coded message - one needs to do decode
page text is parsed and translated in just a couple of them before they start making any sense. Only
minutes on a desktop PC. The output is post- documents encoded in a standard character encoding
edited and then sent for final proof reading. scheme such as ISCII or UNICODE can be
Although this was a very short project with a very considered as text. Font encoded documents re not
low budget, this project has demonstrated the merits texts at all. However a large number of documents
of a well designed and well engineered product. are available only in font encoded forms. Some
There are several unique and powerful features in companies in fact use this as a means of achieving
this system. This system has become a very good some degree of security for their documents.
technology demonstrator and has inspired a lot of We have developed a set of tools through which
serious work in various directions by several groups we can decode any unknown font and map it onto a
across the country. See www.LanguageTechnologies.ac. standard character encoding scheme such as ISCII
in for more details. or UNICODE. This is a semi-automatic and
The success of the MAT system for English to iterative process. With this tool, now it possible to
Kannada translation has inspired further work on decode any unknown font and we hope this would
dictionaries, morphology of Kannada, robust parsing encourage commercial companies to become more
and word sense disambiguation. Research work has open and follow standards instead of pursuing myopic,
proprietary, and restrictive practices.

126
Some of the other centres have developed direct centres as well. It is planned to organize these tools
mappings from one font to another. We believe that in the form of tool kit so that it will become more
the best way to handle font-to-font variations is to convenient for others to use them.
go through a standard character encoding scheme. 2.14.5 Website development Tools
All documents must be encoded and processed in a
character encoding scheme and mapping to fonts is to We have a whole range of tools including our
be used only for the purposes of display and printing. AKSHARA and WILIO systems to develop interactive
web sites in Indian languages. For example, our
2.14.2 Web Crawler for Search Engine History-Society-Culture portal has nearly 500 pages
A web crawler searches the whole web and builds of Indian language content developed and hosted
up an index of web pages that is structured and through our technology. Interactive lookup services
classified in a way that enables search engines to search of our dictionaries also use these tools and
the web efficiently. We have developed a basic web technologies.
crawler through which such an index can be built. 2.14.6 Character to Font mapping Tools
This tool can also be used to download whole web
sites, for archival of web sites etc. As has been noted elsewhere in this report, mapping
between characters and fonts is a non-trivial process
2.14.3 PSA: A Meta Search Engine in Indian languages. Some have used table look-up
A search engine searches the web for the documents method while others have used hand crafted rules.
users request through a short query. There are several None of the systems seem to be satisfactory.
good search engines but there is no single search engine Completeness, consistency, robustness, efficiency,
that is ideal in all cases. A Meta Search Engine accepts transparency, extensibility, ease of development
a user query, fires search engines, obtains the results are some of the desirable features of such mapping
and presents them to the user. Personal Search Assistant systems. Given this scenario we have explored the
(PSA) is one such search engine designed and possibility of developing good mapping systems using
developed by us here. the Finite State Machine technology. Suitable
PSA accepts user queries, formats them in the extensions to the basic technology are proposed for
manner required for various search engines and fires the purpose.
the search engines accordingly. PSA can currently 2.14.7 Dictionary to Thesaurus Tool
handle upto eight different search engines Give us any bilingual dictionary and we can give
simultaneously. It is possible to work in the you a kind of a thesaurus in a couple of minutes.
background mode so that users do not need to sit Our tool can do a clever reverse indexing on any
in front of the machine and wait for results. Status bilingual dictionary to identify closely related words
can be checked up at any given point of time. It is for any given word. While this is not exactly a
possible to monitor the network load and adjust thesaurus in technical sense, the basic idea is the same
accordingly. Results are collated, duplicates removed - given one word, to identify other words in the
and stored in local database where required. Unlike language that are closely related to it in meaning.
a search engine a meta search engine can reside on The Kannada thesaurus so developed has been
local machines and can be customized to suit demonstrated and has been judged to be very useful.
individual requirements. Some work has been done Some researchers are already using this system and
on personalization of PSA. PSA is being used within there are plans to do more work on Kannada and
University of Hyderabad for several years now. Telugu thesauri. We will be glad to develop thesauri
Plans include Indian language support for PSA. for any other language given the suitable bilingual
Queries can then be posed in Indian languages. dictionaries.
Specialized versions of the web crawler can also be 2.14.8 Dictionary Indexing Tools
developed to locate web pages in Indian languages
or web pages relating to India. Dictionaries need to be indexed for efficient
access. We have developed clever indexing schemes
2.14.4 Corpus Analysis Tools for efficient indexing of large dictionaries on any
A number of tools have been developed for corpus computer. Combinations of TRIE indexing, Hashing,
analysis. Some of these tools are being used by other B-trees, AVL Trees etc. are used.

127
2.14.9 Text Processing Tools History, Society and Culture Portal: Temples
A number of text processing tools for working
with word lists, dictionaries, etc. have been developed
over the past many years. These tools have been
found to be very useful for linguists and
lexicographers too. Some of these tools have been
integrated into AKSHARA.
2.14.10 Finite State Technologies Toolkit
Finite State technologies are increasingly being used
in language engineering as they are simple yet very authentic material on the society and culture of the
efficient. A full toolkit has been developed and tested Telugu people and indirectly, on its history too. Get
on large scale data. Now it is possible to work with to know more about temples, music, dance, folk arts
Regular Expressions, NFA, DFA etc. and perform all and many more items. Color photographs are included.
the usual operations without having to write any All the pages are available through our unique WILIO
program code. It is planned to integrate these tools technology that works across operating systems and
into AKSHARA. web browsers. Roman transliterations are also provided
3. Services and Knowledge Bases for those who would have difficulties in reading the
Telugu script. Visit www.LanguageTechnologies .ac.in
3.1 On-Line Literature
3.3 On-Line Searchable Directory
Telugu has a very rich literary tradition dating back
to around 10th Century. In today’s busy world where Given the importance of networking of individuals
reading habits seem to be depleting, here is an and organizations with overlapping interests, it was
attempt to bring some of the best works of Telugu decided to develop an on-line searchable directory
at your door steps. Making literature available on- of people and organizations interested in various
line means making it available anywhere, anytime. aspects of language technology for Telugu. More
You will not need to goto a book store or otherwise than 1200 relevant entries have been developed
order and buy a book. This service is absolutely and cross checked. The directory is available on
free of charge and so you spend no money. line. A flexible search facility has been included.
Kindly visit www.LanguageTechnologies.ac.in
More than 550 of the best works in Telugu have been
enlisted. We have obtained, by a great deal of 3.4 Character encoding standards, Roman
convincing effort, the right of electronic Transliteration Schemes, Tools
reproduction and web enabling these works from There is quite a bit of confusion in the country
the respective copy right holders. Not a single pie about the exact nature of character encoding
was spent to obtain the rights though. About 225 of schemes, fonts, rendering engines, character to font
these books have been converted into electronic form mapping schemes etc. Many hasty decisions are
and checked by a three stage proof reading and sometimes being taken without a full and in depth
certification process by qualified, experienced and understanding of all the issues concerned. Hence a
professional proof readers. A panoramic selection of detailed article was written about the issues involved
these works will be made available from our web site in character encoding schemes and related issues. A
through our unique WILIO technology that version of the article has been published in the
guarantees platform and browser independence as Vidyullipi journal.
well as some degree of security. It is planned to
add Roman transliterated versions too for the benefit 3.5 Research Portal
of those who know Telugu language but are not Research and development requires time, effort,
very comfortable with the Telugu script. money and other resources but as far as Language
Engineering in India is concerned, the most
3.2 History-Society-Culture Portal
important resource required is adequate trained
India is a uniquely pluralistic society that is the manpower. Language Engineering is a highly multi-
home of many religions, traditions and cultures. Here disciplinary field - it borrows from such diverse
is an attempt to bring to you nearly 500 pages of disciplines as Linguistics, Psychology, Philosophy,

128
Logic, Artificial Intelligence, Cognitive Science, amount of research has also been done on other
Computer Science, Mathematics, Statistics and competing technologies and we would be able to
Physics. Clearly, there are no experts who know deliver high quality unrestricted TTS systems in future.
all these areas very well. This multi-disciplinary VAANI has already been into our AKSHARA -
nature of the subject makes it so much more Advanced Multi-lingual Text Processing system.
difficult to create quality training materials, books 3.7 Manpower Development
etc. that are understood by people across so many
disciplines. Indian languages are also characterized by Trained manpower is a critical issue in language
certain unique features compounding the difficulties engineering in India. More than 100 students and
in developing trained researchers and developers. staff have worked in the project for periods ranging
from 3 months to three years on various research
With this in mind, we have developed research and development activities. In the process they have
portals in selected areas of Language Engineering. In obtained significant theoretical knowledge as well
one place, you will find basic and introductory as practical skills in language technologies.
material, tutorial and survey papers, classified and
structured collection of large number of relevant The LEC-2002 International Conference on Language
research papers, pointers to people, departments, engineering included a full day of tutorials by
institutions, conferences and other regular events and distinguished experts from India and abroad. These
so on. This would substantially reduce the time and tutorials were free for students. Several hundred
effort needed by newcomers to these research areas. students could benefit from these.
We could also develop these portals for news, The first IL-OCR workshop organized by us here
discussion and debate, collaborative development etc. attracted many interested students and researchers
3.6 VAANI: A Text to Speech system for Telugu from across the country. The detailed presentations,
demos. and discussions were very helpful.
Text-to-Speech systems convert given text into speech
form. Thanks to the maturity of the techniques and The research portals being set up here will be of
availability of required tools, it is now possible to develop value to beginners in language technology research.
minimal TTS systems in months. VAANI, our TTS A number of articles, technical reports and research
system for Telugu, is one such attempt. Di-phone papers have either been published or are being
segmentation based approach has been used. Phonemes prepared for wider dissemination of the ideas,
in the language are identified and for each ordered pair techniques and technologies. Our website is intended
of phonemes, called di-phones, example words are to serve a similar purpose and is being enhanced
recorded in speech form. From this raw data, di- and updated accordingly.
phones are segmented using available tools. It would
A text book on Natural Language Processing with
then be possible to process this raw data further and
specific emphasis and examples from Indian languages
develop a database. To produce speech from any given
is planned.
text, the text is initially parsed. certain pre-processing
steps are essential to handle numerals, homographs etc. With this it should be possible now to organize training
Then the text is segmented into diphones and the programmes on specific topics to identified target
corresponding speech units are concatenated to produce groups. Suitable course material can also be developed
speech output. Segmenting at di-phone boundaries specifically for the purpose.
gives better continuity since the variations due to co- 4. Epilogue
articulation effects are least in the middle of a
phoneme as compared to its ends. Since the number 4.1 Strengths and Opportunities
of phonemes in a language is usually a small and Our strength lies in our core research competence.
closed set, this technology also leads to unlimited Our team has experts from linguistics, statistics,
vocabulary TTS technology. VAANI is thus an computer science, artificial intelligence and cognitive
unlimited vocabulary, open domain TTS system for science. Each member of the team has a very rich and
Telugu. The system has been tested for intelligibility varied experience. What binds us all is the common
both directly and across telephone lines. Prosodic research competence. The tools, techniques and
features such as duration, pitch and intonation can be algorithms used in bio-informatics, image processing,
added to make the sounds more natural. A substantial speech recognition and language technology have

129
many things in common. Our team emphasizes in future as well. We also organized the first ever
this common core. OCR workshop for Indian scripts. This was a trend
We had a one semester long seminar cum discussion setter of sorts. We laid bare our OCR system in full
series on Markov Models. This semester we have detail and others followed. It was perhaps for the first
semester long seminar cum discussion series on feature time that different research groups got to know about
extraction and feature selection. See annexure for each other’s approach is such great detail. The systems
details. Thorough investigation of Word Sense developed by various groups were demonstrated and
Disambiguation and Shallow Parsing techniques are tested publicly. The discussion and debate that
going on. This in depth understanding of the followed helped all the centres to make further
technologies will enable to do world class research in progress. We have been the Indian coordinators for
various areas and develop quality products and services. a Indo-French research network in computational
linguistics and we plan to work more closely with other
We also have striven to develop large scale linguistic groups across the world.
data resources that are essential for further research
and development. Our centre is perhaps unique in Within the country, we have been maintaining close
having developed a 10 million word corpora and more technical links with a number of organizations
than half a dozen dictionaries of significant size and including the Society for Computer Applications in
quality. Large and representative data and the right Indian Languages, Computer Literacy House,
kinds of tools will enable us to move much faster. Computer Vignannam, AP Press Academy, CMC,
Telugu University, Telugu Academy, etc. apart from
We have also striven to strike a good balance commercial companies such as C-DAC and
between the pure research and publications on the Modular Infotech.
one hand and product development and technology
transfer to meet the needs the society on the A large number of very distinguished visitors have
visited our labs. and offered their appreciation as well
other hand. Many of our results are yet to be published
and we hope to bring out several publications soon. as very valuable suggestions.
We have also striven to strike a good balance between We have been organizing seminars on a regular basis.
long term and short term goals. Two years ago when So far more than 25 seminars have been organized.
we started off almost from scratch on our OCR system, Some of the distinguished speakers include Prof.
other experts felt that as far as content creation is Gerard Huet, Dr. Mark Pedersen and Prof. Rajat
concerned there is nothing better than simply typing Moona.
in texts. Today our OCR is one of the successful 5. Publications
ones in the country and combined with our spell
1. Chakravarthy Bhagvati, Atul Negi, and
checker, we will soon be able to develop much large
B.Chandrasekhar. Weighted Fringe Distances
corpora for Telugu and other Indian languages than
would have been possible otherwise. for Improving Accuracy of a Template
Matching Telugu OCR System. In Proc. of
We have struggled in this first phase of development, IEEE TENCON2003, Bangalore, 2003.
to overcome the teething problems and look for long
lasting and permanent solutions rather than hop 2. Chakravarthy Bhagvati, T.Ravi, S.M.Kumar,
onto short sighted and immediate solutions. This and Atul Negi. Developing High Accuracy
approach has payed off and with the data and tools OCR Systems for Telugu and other Indian
we have with us now, we hope to be able to move Scripts. In Proc. of Language Engineering
much faster. Conference, Pages 18-23, Hyderabad, 2003.
IEEE computer society Press.
Our future efforts will be in more focused areas of
language and speech engineering. We look forward 3. R.L.Brown. The fringe distance measure: an
to meaningful collaboration with other leaders in the easily calculated image distance measure with
world for research as well as technology development. recognition results comparable to Gaussian
4.2 Outreach blurring. IEEE Trans. System man and
Cybernetics, 24(1): 111-116, 1994.
The LEC-2002 Language Engineering conference
organized by us was quite successful and we hope to 4. U. Garain and B.B. Chaudhuri. Segmentation
be able to organize quality international conferences of Touching Characters in Printed Devnagari

130
and Bangla Scripts Using Fuzzy Multifactorial 17-20 April 1999, Southern Cross University,
analysis. In Proc. of Int. Conf. on Document Australia
Analysis and Recognition. IEEE Comp. Soc.
15. P R Kaushik, “PSA - A Meta Search Engine
Press, Los Alamitos (CA), USA, 2001.
for World Wide Web Searching”, M. Tech.
5. G.Nagy, S.Seth, and M.Vishwanathan.A thesis, Department of Computer and
prototype document image analysis system for Information Sciences, University of
technical journals. Computer, 25(7), 1992. Hyderabad, 1998
6. Atul Negi, Chakravarthy Bhagvati and 16. P Naga Samba Siva Rao, “Enhancements to
B.Krishna. an OCR system for Telugu. In Proc. the PSA model for Meta Search Engines”,
Int. Conf. on Document Analysis and M.Tech. thesis, Department of Computer
Recognition. IEEE Comp. Soc. Press, Los and Information Sciences, University of
Alamitos (CA), USA, 2001. Hyderabad, 1999
7. Atul Negi, Chakravarthy Bhagvati, and V.V. 17. K. Narayana Murthy, “MAT2: Enhanced
Suresh Kumar. Non-linear Normalization to Machine Aided Translation System”,
Improve Telugu OCR. In Proc. of Indo- STRANSS 2002 Symposiu on Translation
European Conf. on Multilingual Support Systems, 15-17 March 2002, Indian
Communication Technologies, pages 45-57, Institute of Technology, Kanpur
Tata McGraw Hill Book Co., New Delhi,
2002. 18. K. Narayana Murthy, “MAT” A Machine
Assisted Translation System”, Fifth Natural
8. K.Wong, R. Casey, and F.Wahl. Document Language Pacific Rim Symposium, NLPRS-99,
analysis system. IBM J. Research and 5-7 November 1999, Beijing, China
Development, 26(6), 1982.
19. K. Narayana Murthy, “UCSG and Machine
9. K. Narayana Murthy, B B Chaudhuri (Eds), Aided Translation from English to Kannada”,
LEC-2002: Language Engineering Conference, Indo-French Symposium on Natural Language
IEEE Computer Society Press, 2003 Processing, University of Hyderabad, 21-26
10. K V K Kalpana Reddy and Jahnavi A, March 1997
“Text to Speech system for Telugu”, MCA 20. Kasina Vamsi Krishna, “ Word Sense
thesis Disambiguation: A Study”, MCA thesis,
11. K. Narayana Murthy, “UNICODE: Issues in Department of Computer and Information
Standardization of Character Encoding Sciences, University of Hyderabad, 2003
Schemes”, Vidyullipi, April 2002 21. D Madhusudhana Rao, V V Raghuram, “A
12. K. Narayana Murthy, Nandakumar Hegde, Generic Approach to Sentence Segmentation”,
“Some Issues relating to a Common Script for MCA thesis, Department of Computer and
Indian Languages”, International Conference Information Sciences, University of
on Indian Writing Systems and nagari Script, Hyderabad, 2003
6-7 february 1999, Delhi University, Delhi
22. K. Narayana Murthy, “Universal Clause
13. K Narayana Murthy, “An Indexing Technique Structure Grammar”, PhD thesis, Department
for Efficient Retrieval fro Large Dictionaries”, of Computer and Information Sciences,
National Conference on Information University of Hyderabad 1996
Technology NCIT-97, 21-23 December 1997,
23. K. Narayana Murthy, A Sivasankara Reddy,
Bhubaneswar
“Universal Clause Structure Grammar”,
14. P R Kaushik, K Narayana Murthy, “Personal Special issue on Natural Language Processing
Search Assistant: A Configurable Meta Search and Machine Learning, Computer Science and
Engine”, Proceedings of the AusWeb99 – The Informatics, Vol. 27, No. 1 March 1997
Fifth Australian World Wide Web Conference, pp 26-38

131
24. K. Narayana Murthy, “Universal Clause Computer and Information Sciences,
Structure Grammar and the Syntax of University of Hyderabad, 2001
Relatively Free Word Order Languages”, South
34. T. Ramesh and T. Phani Raju, “Web Based
Asia Language Review, Vol VII, No. 1 Jan.
Education Tool (Software Lab Module)”, MCA
1997 pp 47-64
thesis, Department of Computer and
25. K. Narayana Murthy, “Parsing Telugu in the Information Sciences, University of
UCSG Formalism”, Indian Congress on Hyderabad, 2001
Knowledge and Language, January 1996, 35. G. Murali Krishna and P. Ravi Kumar,
Mysore, Vol 2 pp 1-6
“Generic Framework for Developing Web-Lab
26. Sreekanth D, “A Statistical Syntactic Experiments”, M. Tech.thesis, Department of
Disambiguation Tool”, M.Tech. Thesis, Computer and Information Sciences,
Department of Computer and Information University of Hyderabad, 2002
Sciences, University of Hyderabad, 1998.
36. G. Vamsidhar, “White Board Application - A
27. Ashish Gupta, “Improvements to the GUI Tool for Web Based Education Tool”,
UCSG English Parser”, M.Tech. Thesis, M.Tech.thesis, Department of Computer and
Department of Computer and Information Information Sciences, University of
Sciences, University of Hyderabad, 2001. Hyderabad, 2002
28. D. Srinivasa Rao and M. Suresh Babu, “Design 37. M. Narsimhulu, “An Enhanced Architecture
of a Web Based Education Tool”, MCA thesis, for Web Based Education”, M.Tech. Thesis,
Department of Computer and Information Department of Computer and Information
Sciences, University of Hyderabad, 2000 Sciences, University of Hyderabad, 2002
29. B. P. V. Prasad and A. Rambabu, “Design of 38. K Vasuprada, K. Narayana Murthy, “Part-of-
VIRAT: A ProtoPUBLISHER Virtual Speech Tagging using a Tri-Tag HMM Model”,
Authoring Tool for Web-Based Education”, Second National Symposium on Quantitative
M.Tech. Thesis, Department of Computer and Linguistics, 28-29 Feb. 2000, Indian Statistical
Information Sciences, University of Institute, Kolkata
Hyderabad, 2003
39. K. Vasuprada, “Part of Speech Tagging and
30. P. Uday Bhaskar and R. Krishna Kishore, “A Syntactic Disambiguation using stochastic
Tool for Web Based Education”, M.Tech. parsing techniques”, M.Tech. thesis,
Thesis, Department of Computer and Department of Computer and Information
Information Sciences, University of Sciences, University of Hyderabad, 1999
Hyderabad, 2003
40. Ravi Mrutyunjaya, “Corpus-Based Stemming
31. Rajesh V Patankar and Tammana Dilip, Algorithm for Indian Languages”,
“VIRAT: An Authoring Tool for Publishing on M.Tech.thesis, Department of Computer and
the Web”, MCA thesis, Department of Information Sciences, University of
Computer and Information Sciences, Hyderabad, 2003
University of Hyderabad, 2001
41. K. Narayana Murthy, “Theories and
32. T. S. Vivek and J. Reddeppa Reddy, “Web Techniques in Computational Morphology”,
Based Education Tool – Online Testing & Proc. of the National Seminar on Word
Evaluation Module, MCA thesis, Department Structure of Dravidian Languages, 26-28
of Computer and Information Sciences, November 2001, Dravidian University,
University of Hyderabad, 2001 Kuppam pp 365-375
33. P. Siva Rama Krishna and G. Nagachandra 42. K. Narayana Murthy, “A Network and Process
Sekhar, “Web Based Education Tool (Instructor Model for Morphological Analysis/
Module)”, MCA thesis, Department of Generation”, Second International Conference

132
on South Asian Languages ICOSAL-2, 9-11 53. Narayana Murthy, “Issues in the Design of a
Jan. 1999, Punjabi University, Patiala Spell Checker for Morphologically Rich
Languages”, 3rd International Conference on
43. Anil Kumar, “Morphological Analysis of Telugu
South Asian Languages - ICOSAL-3”, 4-6
Words”, M.Tech.thesis, Department of
January 2001, University of Hyderabad.
Computer and Information Sciences,
University of Hyderabad, 2003 54. K. S. RajyaShree and K. Narayana Murthy,
“Statistical Spell Checking”, Indo-UK
44. K. Narayana Murthy, “Electronic Dictionaries
Workshop on LESAL - Language Engineering
and Computational Tools”, Linguistics Today, in South Asian Languages”, 23-24 April 2001,
Vol 1, No. 1, July 1997 pp 34-50 National Centre for Software Technology,
45. A Sivasankara Reddy, K. Narayana Murthy, Mumbai
Vasudev Varma, “Object Oriented 55. K. Narayana Murthy, “Markov Models of
Multipurpose lexicon”, International Journal Syllables for Spell Checking”, Statistical
of Communication, Vol. 6, No 1 and 2, Jan- Technqiues in Language Processing, 11-12 June
Dec 1996 pp 69-84 2001, central Institute of Indian Languages,
46. K. Narayana Murthy, “An Indexing Technique Mysore
for Efficient retrieval from Large Dictionaries”, 56. Naresh Pidatal and Dhanjay Kumar Singh,
National Conference on Information “Spelling error Detection and Correction“,
Technology NCIT-97, 21-23 December 1997, MCA thesis, Department of Computer and
Bhubaneshwar Information Sciences, University of
47. Vasudev Varma, K. Narayana Murthy, A Hyderabad, 2003
Sivasankara Reddy, “electronic Dictionaries: A 57. Anil Kumar, “Morphological Analysis of Telugu
Model Architecture”, National workshop on Words”, M.Tech.thesis, Department of
Electronic Dictionaries for second language Computer and Information Sciences,
learners, Bharatiar University, Coimbatore, 29 University of Hyderabad, 2003
Sept. to 7 Oct. 1993
58. Ravi Mrutyunjaya, “Corpus-Based Stemming
48. Surya Kiran Mamidi, “Developing keyboard Algorithm for Indian Languages”, M.Tech.
and Display drivers for Indian Language Thesis, Department of Computer and
Editors, MCA thesis, Department of Computer Information Sciences, University of
and Information Sciences, University of Hyderabad, 2003
Hyderabad, 2001 59. K. Narayana Murthy, “Design of a Spelling
49. . Maruti Kumar and Y. N. V. Ganesh Kumar, Error Detection and Correction System for
“Editor for Indian Languages”, MCA thesis, Telugu”, National Seminar on Language
Department of Computer and Information Technology Tools: Implementation for Telugu,
Sciences, University of Hyderabad, 2001 September 17-19, University of Hyderabad
50. V S S Neelima, “Enhancements to the 60. Thesis in the area of Indian Language
AKSHARA Text Processing System”, M. Sc. Technologies Thejavath Ramdas Naik and
Computer Science thesis, K R R Vignan Manoj Kumar Pradhan, “Off-Line printed
degree and PG College for Women, Devanagari page Layout Analysis and Script
Hyderabad, 2003 Recognition”, MCA Thesis, Department of
Computer and Information Sciences,
51. K. Narayana Murthy, “UNICODE: Issues in University of Hyderabad, June 2000.
Standardization of Character Encoding
Schemes”, Vidyullipi, April 2002 61. Buska Krishna, “Design and Implementation
of a Telugu Script Recognition System”,
52. Karen Kukich, “Techniques for Automatically M.Tech. Thesis, Department of Computer
Correcting Words in Text”, ACM Computing and Information Sciences, University of
Reviews, Vol 24. No. 4 December 1992 Hyderabad, December 2000.

133
62. L. K. Prashant Kumar, “A Study of School of Physics, University of Hyderabad,
Binarization methods for Optical Character May 2003.
Recognition”, M.Sc.(Tech) Thesis, School 6. The Team Members
of Physics, University of Hyderabad,
December 2000. Dr. K. Narayana Murthy, Dr. Arun Agarwal, Dr.
B. Chakravarthy, Dr. S. Bapi Raju, Dr. Atul Negi,
63. D. Vijaya Bhaskar, “Preliminary Study of Dr. G. Uma Maheshwara Rao, Dr. P. Mohanthy,
GOCR to implement OCR for Telugu”, MCA Dr. P. R. Dadegoankar, Raman Pillai Rajesh, K.
Thesis, Department of Computer and Rajini Reddy, P. Muttaiah, K. Naga Sabita, Ganesh
Information Sciences, University of Raju, Maruthi, Naresh, Suresh Kumar, Anand,
Hyderabad, 1998-2001. Kwaja Sirajuddin, B.Navatha, P. RamaKrishna
Prasad, O. Bhaskar, Vaishnavi, K. Balaji Rambabu,
64. Ajay Koul, “A Study of Gradient Based Feature
Suresh Kumar, P. Siva Ramakrishna, M. Surya
Extraction for Telugu OCR”, M.Tech. Thesis,
Kiran, T. Ramesh, Ravi Raj Singh, Venu Gopal
Department of Computer and Information
Sciences, University of Hyderabad, December Courtesy: Prof. K. Narayan Murthy
2001. University of Hyderabad, Dept. of CIS,
Hyderabad -500046 (RCILTS for Telugu)
65. G. Raghava Kumar, “A Study of Nearest
Tel: 00-91-40-23100500, 23100518
Neighbour Computation Techniques and their
Extn. 4017, 23010374
applications”, M.Tech. Thesis, Department
E-mail: knmcs@uohyd.ernet.in
of Computer and Information Sciences,
University of Hyderabad, December 2001. Editorial Comment : Because of very large number of
publications by the Resource Centre & constraint of space
66. Y. Pradeep Kumar and G. Koteswar Rao, “ A we could not include the publication details here. For getting
Tool-kit for Document Image Processing”, the publications please contact Prof. K. Narayan Murthy
M.Tech. Thesis, Department of Computer
and Information Sciences, University of
Hyderabad, January 2002.
67. V. V. Suresh Kumar, “Non-Linear
Normalization Techniques to improve OCR”,
M.Tech. Thesis, Department of Computer and
Information Sciences, University of
Hyderabad, January 2002.
68. Tanuk Ravi and S. Mahesh Kumar,
“Enhancements and Redesign of DRISHTI,
An OCR System for Telugu”, MCA Thesis,
Department of Computer and Information
Sciences, University of Hyderabad, June 2002.
69. Anand Kumar and Arafath Pasha, “Soft-
ware for OCR”, BE Thesis, Department of
Computer Science, 2001-2002.
70. D. Bhavani, C. Madhuri and S. Shesha Phani,
“GUI for an OCR System”, B. Tech. Thesis,
Department of Computer Science, March
2002,
71. I. N. Lekha, “ An Experimental Evaluation of
Fringe Distance Measures for Malayalam
Printed Text OCR”, M.Tech.(CT) Thesis,

134
Resource Centre For
Indian Language Technology Solutions – Malayalam
C-DAC, Thiruvananthapuram
Achievements

Centre for Development of Advanced Computing


(Formerly ER&DCI(T)), Vellayambalam,
Thiruvananthapuram, Kerala, India.
Tel. : 00-91-471-2723333 Extn., 243, 303
E-mail : contact@malayalabhasha.com
Website : http://www.erdcitvm.org/hdg/mrc.htm
http://www.malayalabhasha.com
RCILTS-Malayalam We have been successful in developing a variety
C-DAC, Thiruvananthapuram of tools and technologies for Malayalam
computerization and taking IT to the common
Introduction Malayalee in his local language.
C-DAC, Thir uvananthapuram, formerly Many of the products developed under the
ER&DCI, Thiruvananthapuram, is one of the Resource Centre project are first of its kind and
thirteen Resource Centres (Resource Centre for are significant for enabling Malayalam
Indian Language Technology Solutions) set up computerisation. They have got good market
across the country by the Ministry of potential in the present scenario of
Communications and Information Technology, computerisation and conversion of official
Govt. of India under the TDIL (Technology language to Malayalam in the state of Kerala.
Development for Indian Languages) programme. Various Government departments have purchased
These thirteen Resource Centres are aimed at our “Aksharamaala” software for Malayalam word
taking IT to masses in their local languages and processing. Ezuthachan, the Malayalam Tutor, also
cater to all the constitutionally recognised Indian have good demand among non resident
and some foreign languages. The Language of Malayalees. We have already sold 25 copies of the
focus at C-DAC Thiruvananthapuram is same and are trying for marketing tie up with
Malayalam, the official language of the state of some business houses.
Kerala.
The range of our products include:
The main objectives of the “ Resource Centre for
Knowledge Resources like Malayalam Corpora,
Indian Language Technology Solutions –
Trilingual ( English-Hindi-Malayalam) Online
Malayalam” (RCILTS-Malayalam) are to build Dictionary and Knowledge bases for Literature,
competence and expertise in the proliferation of Art and Culture of Kerala.
Information Technology using Malayalam, the
regional language of the state of Kerala. Knowledge Tools for Malayalam such as Portal,
Development of Malayalam enabled core Fonts, Morphological Analyser, Spell checker, Text
technologies and products would give a Editor, Search Engine and Code Converters.
tremendous fillip to IT enabled services in the
Human Machine Interface systems comprising of
state. The comprehensive IT solutions developed
Optical Character Recognition and Text to
would enable the citizens of Kerala to enhance
Speech Systems
their quality of life using the benefits of modern
computer and communications technology Services like E-Commerce application and E-Mail
through Malayalam. This will help them to better Server in Malayalam and Language Tutors for
understand their own culture and heritage, Malayalam and English.
interact with government departments and local
bodies more effectively, besides obtaining a host In addition regular training courses are being
of other advantages. conducted as part of Language Technology
Human Resource Development. Also, there is
Now that the Resource Centre has completed regular interaction with the Government of
three years of functioning, we are happy to note Kerala for providing solutions in the area of
that we have achieved significant progress and standardization, computerisation of various
have been able to complete development of all Government departments and converting the
the expected core deliverables of the project with official language to Malayalam. We have been
in the scheduled time. providing consultancy to individuals and

136
organisations regarding Language Technology
applications. Given below is a detailed description
of the products developed and the achievements
of the Malayalam Resource Centre. Our Resource
Centre developed the following Technologies/
Products.

1. Human Machine Interface Systems

1.1 NAYANA™ - Optical Character Recognition


System for Malayalam

Figure. 1

The preprocessing tasks performed by the first


module include noise removal, conversion of grey
scale image to binary, skew detection and correction
and line, word, and character segmentation. The
scanned images in grey tone are converted into two-
tone (binary) images using a histogram based
thresholding approach (Otsu’s algorithm). Skew
detection is done using the Projection profile based
technique. After estimating the skew angle the skew
is corrected by rotating the image against the
estimated skew angle.

The OCR engine (Character Recognition


Module) is based on the Feature Extraction
method of character recognition. Feature
extraction can be considered as finding a set of
vectors, which effectively represent the
information content of a character. The features
are selected in such a way that they help in
discriminating between characters. A multistage
The Malayalam OCR system converts scanned classification procedure is used, which reduces the
images of printed Malayalam documents to editable processing time while maintaining the accuracy.
text. It is a multi font system that works across a
range of font sizes. The system has a recognition After passing through different stages of the
speed of fifty characters per second. classifier, the character is identified and
corresponding character code is assigned. A
The System consists of a preprocessing module, the
training module is incorporated in the OCR
OCR engine and a post processing module.
engine to recognize characters, which are different
The block diagram of the system is given in from normal characters in their shape and style
figure 1. (example- decorative fonts).

137
In the post processing module, Linguistic rules Supported output format : RTF/HTML/
are applied to the recognised text to correct ACI/ TXT
classification errors. For example, certain
characters never occur at the beginning of a word Character recognition accuracy (%)
and if found so, they are remapped appropriately. Document Type Good quality Bad quality
Similarly, dependent vowel signs can occur only Paper Paper
with consonants or consonant conjuncts; if found
along with vowels or soft consonants, they are Computer Printed 97% 94%
remapped into consonants/conjuncts similar in Document
shape to the vowel sign. Independent vowels
Magazine 92% 90%
occur only at the beginning of a word and if
found anywhere else, they will be mapped into a Newspaper 85% 82%
consonant or ligature having similar shape.
Books 95% 93%
Performance of the OCR
Table1
Developed on VC++ platform, Our Malayalam
OCR runs on Windows 98/2000. It recognises Extensive testing has been done on approximately
50 characters per second and gives an accuracy of 500 pages of different quality printed documents.
97% for good quality printed documents. The Table1 consolidates the results of testing. The
specifications and performance of the system is system has undergone certification testing at
given below. ETDC Chennai.

Skew detection and : - 5 to +5 degree Applications


correction
The Malayalam OCR can be integrated with a
Supported image formats : BMP, TIFF Malayalam Text to speech system to get a Text
Reading System for the visually challenged. Other
Image scan resolution : 300dpi and above. application areas include publishing sector,
content creation, digital library, corpus
Document Type : Single-font single development etc.
size.
1.2 SUBHASHINI™- Malayalam Text to Speech
Supporting Fonts System (TTS)
Fonts Names : CDAC Fonts The Malayalam Text to Speech system
(ML-TTKarthika, SUBHASHINI™ is a Windows based software,
MLW-TTKarthika) which converts Malayalam Text files into fairly
Mathrubhumi Font, intelligible speech output. The software is
Manorama Font, integrated with a text editor having both ISCII
Fonts used by DC and ISFOC support. The editor supports
Books INSCRIPT Key board layout.

Font Size : 12-20 The TTS is based on Speech synthesis by


diaphonic concatenation and consists of the
Font Styles : Normal, BOLD following four modules
Supported Code format : ISCII/ISFOC • Diaphone Database

138
The concatenation of diaphones corresponding
to the text is done in the Synthesis module and
we get speech output. We are using the MBROLA
speech engine for speech synthesis.

Applications

Text reading systems, announcement Systems, and


systems providing voice interface.

2. Knowledge Tools

2.1 NERPADAM™ - Malayalam Spell Checker

Nerpadam is a software subsystem that can be


integrated with Microsoft word as a macro or the
Malayalam editor stylepad developed by us, to
check the spelling of words in a Malayalam text
file. While running as a macro in word, it
functions as an offline spell checker in the sense
that one can use this software with a previously
typed text file only. Both off line and online

• Text Processing module


• Prosodic Rules Generator
• Speech Synthesiser

The Diaphone database consists of 2500


diaphones segmented from recorded words. All
the commonly used allophones are also
considered.

The text-processing module organizes the input


sentences into manageable lists of words. It also
identifies the punctuation symbols, abbreviation,
acronyms and digits in the input data and tags
the input data. These are then processed and
converted to phonetic language – a language that
the speech engine is able to recognise.

Rules for adding prosody to the speech output


are generated using the Speech corpus. This
includes the pitch and delay informations for
different intonations.

139
checking are possible when it is integrated with
the text editor. It generates suggestions for
wrongly spelt words.

The system adapts a rule cum dictionary-based


approach for spell checking. It incorporates a fully
developed Morphological Analyser for
Malayalam. This module splits the input word into
root word, suffixes, post positions etc. and checks
the validity of each using the rule database. Finally
it will check the dictionary to find whether the
root word is present in the dictionary. If anything
goes wrong in this checking it is detected as an
error and the error word is reprocessed to get 3
to 4 valid words, which are displayed as suggestion.
The user can add new words into a personalised
data base file, which can be added to the
dictionary if required.

Applications

All Malayalam word processing jobs.

2.2 AKSHARAMAALA™ Malayalam Font


Package and Script Manager
a switchover key. The appropriate key
The “AKSHARAMAALA” software package combinations automatically render the
consists of two sub packages, ”MSM“ a Malayalam Malayalam characters, such as conjuncts and soft
Script Manager and “ Vahini”, a Malayalam font consonants. The keyboard manager is designed
package. This package complies with the standard to work with Malayalam ISFOC, monolingual as
INSCRIPT keyboard layout, Phonetic Keyboard well as bi-lingual fonts. This manager also supports
Layout and the ISFOC standard. This package the use of Web fonts for data entry. Different
enables the use of Windows based applications options are provided in the package that turns on
for Malayalam data processing. Some of the these features.
packages supported by this application are MS
Office, PageMaker, Adobe Illustrator, MS The “Vahini” package is a collection of Malayalam
Frontpage, Macromedia Dreamweaver Corel fonts that can be used for word processing, web
Draw, and Lotus Smart Suite. This package is publishing, data processing etc. and for use in MS-
intended for use under Windows 95, 98, NT, Windows applications. There are number of high
2000, ME and XP. quality True-type fonts . The fonts comply with
the ISFOC standard and consist of both
The “MSM” package consists of a keyboard monolingual and bilingual fonts. A few web fonts
manager, which supports the INSCRIPT are also provided for use in web pages. The bi-
keyboard overlay and a Phonetic keyboard overlay lingual fonts are those that contain both
for the entry of Malayalam characters. The Malayalam and English characters. The webfonts
Manager can optionally switch on the entry of are monolingual TTF fonts specially designed for
Malayalam or English characters with the help of the use with browsers.

140
We have also developed Dynamic (PFR) fonts, MLBW (ISFOC) ↔ ISCII
which can be used in web page development so
that the user can view the web content without MLW (ISFOC) ↔ ISCII
the font being actually installed in the machines. MLTT (ISFOC) ↔ ISCII
These fonts are platform independent and will
work in both I.E and Netscape. 2.5 ANWESHANAM™ – Malayalam Web based
Search Engine
A Unicode font in open Type format is also
available. “Anweshanam” is a directory based search engine,
which searches for Malayalam content and
Government of Kerala is considering the information in web pages. This solution provides
AKSHARAMAALA package to teach Malayalam a Malayalam interface that helps the user to search
Data entry to the beneficiaries of “Akshaya”, a for information quickly and easily on the web. It
project aimed at providing Computer literacy to searches for specific Malayalam keywords and
at least one person per family and also in the E- generates a list of links to web pages containing
Governance applications of the government of the searched information. It searches for specific
Kerala. content in the pages on the server. This solution
was developed using Java Server pages.
The AKSHARAMAALA package can be used with
standard content creation tools to develop The interface is in Malayalam and includes easy
contents in Malayalam. input facilities with keyboard driver and floating

2.3 Text Editor: A basic Malayalam Text Editor


“STYLEPAD” is developed. It incorporates all the
facilities available in Notepad together with
provision to save Malayalam documents in ISCII
format and read ISCII files.

2.4 Code Converters: Content creation in


Malayalam is accomplished using different fonts
by different organisations/individuals. Most of the
On-line Malayalam Newspapers have their own
proprietary fonts. Lack of standards in font coding
makes data retrieval a difficult task. In order to
alleviate this task, we have developed Font
Converters for most of the commonly used
Malayalam fonts. Given below is a list of the Font
Converters developed by us.

MATWEB ↔ ISCII

THOOLIKA ↔ ISCII

SHREELIPI ↔ ISCII

AKRUTI ↔ ISCII

TULASI ↔ ISCII

141
character map. The keyboard driver supports the 2.6 Malayalam Portal
INSCRIPT keyboard with automatic formation
of conjuncts. The application provides facilities We have designed and developed a Malayalam Web
for searching either English or Malayalam portal, www.malayalamresourcecentre.org alias
keywords. The result pages contain links to the www.malayalabhasha.com. Some of the facilities/
web pages containing the searched information contents provided in the portal are:
along with their brief description in Malayalam. • Malayalam version of the Constitution of
A form is also provided whereby new pages and India.
sites can be added to the database with details of
the content. • A Newspaper “Pradeepam”, published from
Kozhikode.
The application can be expanded to function as a
multilingual search engine and also as a meaning • A knowledge base of traditional Kerala Art
based search engine. forms and Culture (Both English and
Malayalam)
This application is beneficial to Multilingual portal
developers and is useful for any one looking for some
particular Malayalam content on the web.

This application is presently being run as a


service on our website and portal,
www.malayalamresourcecentre.org.

• A Knowledge base of Malayalam Literature.

• Malayalam Literary classic “KRISHNAGATHA”


by Cherusseri and the Grammatical classic,
“KERALAPANINEEYAM” by A.R. Rajaraja
Varma.

142
• Full text of Sanskrit Ayurvedic Classics • A knowledge base in Malayalam for rubber
Charakasamhita and Susrutasamhita cultivators.
(transliterated in Malayalam and ten other
• A tourist aid package named “Explore-Kerala”. The
languages), along with their Malayalam
site, http://www.malayalamresourcecentre.org/wap/
interpretations.
can be accessed from WAP enabled
• PRAKES (Prakruthi estimate) - An Interactive mobile phones.
software package for estimating the Prakruti
• Malayalam brochures on Cancer Awareness
(constitution) of a person based on Ayurvedic
concepts. Available in both English and In addition, some of the tools/technologies
Malayalam. developed (Sandesam, Anweshanam,
Dictionary and E commerce application) have
• A database of forms commonly employed by been integrated in the portal for
Govt. of Kerala (City Corporations, Motor demonstration.
Vehicles, Revenue, Civil Supplies Depts., etc.)
Altogether 76 forms from different The contents of the website are continuously
government departments put on the web. being upgraded and enhanced.

• SSLC (Mathematics and Science) Question The number of visitors to the site has exceeded
papers and Answers for the last five years in 20,500 since June, 2002.
Malayalam.

143
3. Services and the mail is read directly from the disks. The
administration of user accounts is done through
3.1 SANDESAM™ - MALAYALAM E-MAIL
the Vmailmgr program. Authentication is also
SERVER
done via Vmailmgr for both Qmail and Courier
Sandesam is a solution for a Web based mail servers. The service gives the user complete access
service in Malayalam. The back-end mail server to his POP3 or IMAP mailboxes via an easy-to-
comprises of, or is based on the Qmail, Vmailmgr use web interface.
and Courier-IMAP running on Redhat Linux.
Some of the facilities provided in the service are
The Qmail server supports the SMTP and POP3
the IMAP support with user manageable folders,
services, the Vmailmgr program performs the
extensive mime support for attachments.
management of user accounts effectively, while
The interface is in Malayalam and includes
the Courier –Imap server supports the IMAP
easy input facilities with keyboard drivers and
service. The web interface is developed using Java
floating character maps. The service provides
Server Pages and the Java mail API and does not
facilities for sending and receiving mail in
use Java on the client side. The database server
Malayalam, storage of addresses in address book
used is Postgres SQL
with Malayalam names and description.
This is a lightweight and a fast solution to provide The service also provides facilities for user
an easy-to-use web interface in Malayalam. The configuration like changing the password, setting
storage of mailboxes is based on maildir structure quota for mailboxes etc.

144
This solution can be expanded to support any through the Internet. It is useful for multilingual
IMAP client mail server running on Linux or portal developers and is beneficial to computer
Windows platform. literates not that proficient in English but familiar
with Malayalam.
This solution is beneficial for Small and medium
ISPs, business organizations, Government The application has been presently employed in our
Departments Multilingual portal developers. website and portal www.malayalamresourcecentre.org for
the sale of the products developed at the centre.
This solution is presently being run as a service on our
website and portal, www.malayalamresourcecentre.org 4. Knowledge Resources
with the domain id sandesam.com.
4.1 Trilingual Dictionary
3.2 Malayalam E-com Application
An Online Trilingual (English-Hindi-Malayalam)
An e-commerce application in Malayalam has Dictionary is developed. It Contains 50,000 plus
been developed to help computer literates not that words in each Language. English based Search and
proficient in English to purchase goods online. advanced search facilities are implemented. Search
This solution is developed using Java Server pages. based on the other two languages being
This solution is simple using pure html and JSP implemented.
and is viewable in any browser.

The web interface is in Malayalam and basically


contains a window with display of products for
sale with their descriptions in Malayalam. A
shopping cart is provided whereby goods to be
purchased can be added and removed according
to will. The online bill for the cart with price
details can be viewed during the process and then
the purchase can be finalized by filling up and
submitting an order form. The payment can be
made by cheque or DD giving its details in the
order form and sending the instrument by post.

The application also considers inventory details


by displaying available stock, automatically
updating the stock after each purchase and
prompting certain functions on reaching certain
limits. The facilities provided in the web interface
are easy input facilities like keyboard driver and
floating character map for the entry of Malayalam
data and acknowledgement of the purchase order
by e-mail.

The application can be extended to support secure


online payment with credit cards.

The application is useful for small or medium


organizations to increase their sales coverage The main features of the Dictionary are:

145
• Portable in XML format.

• ISCII Based.

• Retrieval based on Parts of Speech

• (POS)

• Word search can be made in all three languages.


(Implementation in progress)

• Description of every word with example.

• Advanced search facilities.

• Extremely fast processing.

The Dictionary can be integrated with any other


application or web portal. It can also be used as
an aid for translation – both Machine aided and
manual.

It helps the study of the concerned languages with


relative ease.

5. Language Tutors

5.1 Ezhuthachan- The Malayalam Tutor


Rajagopal (presently Minister of state for defence)
A Malayalam Tutor Package, formally released the package on 19th October,
“EZHUTHACHAN”, which is aimed at teaching 2002. The Chief Executive of Non Resident
Malayalam to foreigners and second generation Keralites Welfare Association (NORKWA), Shri.
Keralites living abroad. It is basically a multimedia Satish Namboothiripad, IAS received a copy of
package with animations, which show the method EZHUTHACHAN from the Honourable
of writing the letters and sound giving the Minister. NORKWA is considering the
pronunciation of characters and words. A writing “EZHUTHACHAN” package for their
pad is also provided, where a shaded model of programme of on-line teaching of Malayalam for
each character appears and the user can practice Non resident Malayalee children. We are receiving
writing over this using the mouse. This application lot of enquiries and have already sold some copies
lists the commonly used expressions in Malayalam of the software.
(words for daily use, knowing somebody,
numbers, days, months, colours, animals, birds 5.2 English Tutor
etc.), with their pronunciation and English
meaning. The contents are formatted into well English tutor is a multimedia-based application
structured chapters. A test module is also provided intended to help, nursery,/primary school students
at the end. The package also contains an English- or in general, any one interested in learning
Malayalam Dictionary of 2000 words. English, to learn the basics of English through
Malayalam. The contents of the application are
The Honourable Minister of State for Urban organised into different modules that comprise
Development & Poverty Alleviation, Shri. O. of interactive learning programmes to develop

146
6.2 Interaction with State Government
We have been active member of state-level Expert
Committee for standardisation of Malayalam
Keyboard and Character Encoding. The same
committee has recommended the modifications
to be made in the Malayalam UNICODE. The
Committee has submitted its final report and the
Government have approved it.
Recently Government of Kerala has launched a
programme called “Akshaya” which is aimed at
making at least one person per family in Kerala
computer literate. C-DAC, Thiruvananthapuram
is a member of the committee setup for evaluating
the study material for this programme.
C-DAC, Thiruvananthapuram is Member of the
General Council, IT@School Project of the
Government of Kerala.
6.3 Training Activities
C-DAC, Thiruvananthapuram, together with
Centre for Development of Imaging Technology
(C-DIT) has conducted a two day work shop on
Font Design in April 2002.
The ongoing computerisation of various state
government departments, along with increased
skills for reading and writing alphabets, words and
use of Malayalam as official language has created
sentences. The animations, pictures and sounds
a need for a large number of computer literate
teach the association of objects and words along
manpower with proficiency in Malayalam data
with their spelling and pronunciation processing. A module on Malayalam word
6. Other Activities processing tools has been incorporated into the
PGDCA course and other public and corporate
6.1 Providing Technology Solutions training programmes offered by C-DAC,
Thiruvananthapuram.
C-DAC, Thiruvananthapuram has entered into
a contract with Govt. of Kerala to be a Total 6.4 COWMAC
Solution Provider for IT implementation in
C-DAC, Thiruvananthapuram has formed a
Government. The Government has already
COnsortium of Industries Working in the area
implemented “Project Grameen” for effective
of Malayalam Computing (COWMAC). The
dissemination of IT to grass root level. C-DAC,
Consortium aims at mutual interaction among
Thiruvananthapuram together with Library the participating agencies, so that standardisation
Council has setup Community Information in the areas of font coding, IT Vocabulary,
Centres in fourteen districts in Kerala. C-DAC, Keyboard overlay, transliteration schemes etc. can
Thiruvananthapuram is the executing agency of be done effectively. It also aims at avoiding
the E-Governance projects of Government of duplication of development activities by way of
Kerala. closer interaction between developers.

147
7. Publications 9. Future Plans
1. Malayalam Spell Checker - presented in the We plan to take up development of the following
International Conference on “Universal products as second phase of the Resource Centre
Knowledge and Language” in Goa in Project.
November 2002.
• Online Character Recognition System for
2. Optical Character Recognition System for Malayalam
Printed Malayalam Documents and
• Lexical Resources for Machine Translation
3. Text to Speech System for Malayalam –
presented in the SAP workshop at the Centre • Speech to Text
for Applied Linguistics and Translation • Malayalam Word Net and
Studies(CALTS) in the University of
Hyderabad, in March 2003. • Porting of the various applications developed
to Linux platform.
4. Text Reading System for Malayalam - Selected
for presentation in the International 10. The Team Members
Conference, “Information Technology :
Ravindra Kumar R ravi@erdcitvm.org
Prospects and Challenges in the 21st Century”
in Kathmandu, Nepal in May, 2003. Sulochana K.G sulochana@erdcitvm.org
Jithesh .K jithesh@erdcitvm.org
We have also prepared the final proposal for
modifications to be made in the next version of Jose Stephen jose_stephen@erdcitvm.org
Malayalam Unicode and the Malayalam Design Santhosh Varghese santhosh@erdcitvm.org
guide for TDIL. Shaji V.N shaj@erdcitvm.org

8. Expertise Gained Vipin C Das vipincdas@erdcitvm.org


Keith Fernandes keith@erdcitvm.org
The Malayalam resource Centre consists of a core
team of Design engineers, programmers and Sanker C. S sankercs@rediffmail.com
linguists proficient in natural Language Sunil Babu sunilbabu_ktr@yahoo.com
processing. We have gained expertise in the areas Hari Kumar karamanahari@rediffmail.com
of Image Processing, Speech Synthesis, Font Neena M.S neenams@rediffmail.com
Design, Phonology of Malayalam language and
Praveen V.L praveenvl@erdcitvm.org
Morphological Analysis of Malayalam language.
Mithra R.S. shammymithra@rediffmail.com
We have also built up technical capability in the
Praveen Kumar praveenvijay@rediffmail.com
use of Data base packages like MS Access , XML,
Postgre SQL Scripts, Languages and tools : Hari .K harichithira@yahoo.com
HTML, Java Server Pages, Java , and JavaScript, Dr. V.R Prabhodhachandran Nayar pnayar@email.com
VC++, C++, VB, HTML, DHTML, Diaphone
Dr. Usha Nambudripad shanambudripad@yahoo.co.in
Studio, Macromedia Flash, C-DAC Iplugin &
Leap Office, Macromedia Fontographer and Courtesy: Prof. Ravindra Kumar
Bitstream Webfont wizard 1.0 C-DAC, Vellayambalam
Expertise in configuring server side soft wares are Thiruvananthapuram-695 033
also available at the Centre. (RCILTS for Malayalam)
Tel: 00-91-471-2723333, 2725897, 2726718
Web Servers : Linux Apache Server, Tomcat server E-mail : ravi@erdcitvm.org
Mail Server : Qmail, Send mail

148
Resource Centre For
Indian Language Technology Solutions – Tamil
Anna University, Chennai
Achievements

School of Computer Science and Engineering


Anna University, Chennai-600025 India
Tel. : 00-91-44-2351265 Extn : 3340
E-mail : rctamil@annauniv.edu
Web Site: http://annauniv.edu/rctamil
RCILTS-Tamil The following activities have been undertaken for
Anna University, Chennai development of knowledge resources.

Introduction " Online Dictionary

The Resource Center for Indian Language " Corpus Collection Tool
Technology Solutions – Tamil, Anna University
" Content Authoring Tool
has been actively working in the area of language
technology with special emphasis on the following " Tamil Picture Dictionary
areas:
" Flash Tutorial
• Linguistic Tools
" Handbook of Tamil
• Language Technology Products
" Karpanaikkatchi - Scenes from Sangam
• Content Development literature
• Research On Information And Knowledge " District Information
Extraction
" Educational resources
Our website http://annauniv.edu/rctamil
showcases the highlights of these activities. " Appreciation of Tamil poetry
Downloadable demos are available for selected
products. " Geography for You

A detailed report on the activities of the center is " Chemistry for X Standard
presented below. 1.1 Online Dictionary
1. Knowledge Resources Online Dictionary is a language-oriented software
for the retrieval of lexical entries from
monolingual and bilingual dictionaries by many
users simultaneously. It has been designed to
interact with various language-processing tools –
morphological analyzer, parser, machine
translation etc.

The Tamil dictionary contains 20,000 root words.


Each entry in the dictionary includes the Tamil
root word, its English equivalent, different senses
of the word, and the associated syntactic category.
The root words are classified into 15 categories.

The special features of this Online Dictionary are


the following :

• It is an inflectional or derivational dictionary.


That is the search can be based on an inflected
word. Given an inflected word, it calls the
Morphological analyzer to get the root word
and searches the dictionary. In addition to
Screen shots of Knowledge Resources providing information about the root word

150
from the dictionary, it also provides the Status : Testing in progress.
analyzed information about the inflection.
1.4 Tamil Picture Dictionary
• It supports search based on syntactic category.
This is a picture book for children. It has about
• It also supports search based on parts of a word. three hundred words - organized as nouns and
verbs. Each word has on associated picture, an
• Tamil equivalent words of given English words explanation, the English equivalent and a small
can also be obtained. poem to illustrate its meaning.
Status : Dictionary is available with 20,000 root Status : Available on our website with 300 words.
words. In effect, since it is an inflectional dictionary, Audio capability is to be added.
it is possible to find the meaning of over 1 lakh
words. It is to be enhanced to 35,000 root words. 1.5 Flash Tutorial
1.2 Corpus Collection Tool This tutorial in Tamil, provides an easy way to
learn Macromedia Flash, a content development
A corpus with 12 lakh words has been built from tool. Flash is the most popular multimedia tool
Tamil news paper articles, using a Corpus to enhance or design web pages or CBT packages.
Collection tool. This tool automatically collects This tutorial contains a set of technical keywords
articles from different web-sites and stores them along with the content. It starts with the basics of
into a corpus database. The corpus database is Flash and methodically moves up to an advanced
organized into a word database and a document level. The components available in the toolbox
database. The databases have the following are explained. Creation of movies and animated
information : the article, category of the article, images are also explained. The major concepts like
key words, title, date, author, source, unique Tweening and Symbol conversion are discussed
words and their frequencies. The tool collects in detail. Snapshots are provided wherever
articles available in different fonts, and converts necessary. Plenty of examples are also provided.
them to a standard font, before storing them in The English equivalent for the keywords are given
the database. within brackets.
Status : Available for use with 12 lakh words. Status : Available on our website
Collection is an on-going process.
1.6 Handbook of Tamil
1.3 Content Authoring Tool
This e-handbook gives an overview and special
This is a tool used to create and deliver content, features of Tamil grammar and literature. The
integrating text, sound, video, and graphics. The handbook is divided into four sections -
design of the tool is generic so that it can be Introduction, Literature, Grammar and others.
adapted to any domain. The information can be In the ‘introduction’ section, history of Tamil
organized hierarchically with multiple levels. language and details about the Tamil speaking
Content can be updated at any time, with new people are given. The second section provides an
levels being added. There are primarily two overview of Tamil literature starting from the
different modes in which this operates – authoring Sangam age up to the modern times. The grammar
mode, used by the content developer to organize section outlines the classification schemes used in
and present the content, and the viewing mode Tamil grammar and highlights important features.
used by the end-user to view the contents. The The last section provides miscellaneous
authoring mode stores the content in a database, information such as names of days, months,
and the viewing mode serves the pages using JSP. measures, traditional art forms etc. in Tamil.
The user can select the sections that he/she wants
to view. Status : Available on our website

151
1.7 Karpanaikkatchi - Scenes from Sangam the continents. Images, audio, and animation have
Literature been added to make this appealing to students.
This is a visualization of scenes from Sangam 1.9.3 Chemistry for X Standard
literature. Selected poems depicting various
moods pertaining to Sangam period have been This package has been specifically targeted at tenth
chosen. The poems, their interpretation and standard students. Here again the focus is on
specially created imagery picturising the mood presenting concepts, and it is presented both in
conveyed by these poems, have been presented. Tamil and English. The package consists of eight
topics – periodic classification, atomic structure,
Status: Development in progress chemical bonding, phosphorus, halogens, metals,
1.8 District Information organic chemistry, and chemical industry. In
addition to innovatively using images, and audio,
• Gives statistical, informative and special the highlight of this package is simulation of
features about districts of Tamil Nadu experiments using animation.
• User can browse through the information 2. Knowledge Tools
district wise or subject wise
The following tools have been developed for
Status : Under development language and information processing.
1.9 Educational Resources • Language Processing Tools
• Appreciation of Tamil Poetry • Morphological Analyser
• Geography for You • Morphological Generator
• Chemistry for X Standard • Text Editor
1.9.1 Appreciation of Tamil Poetry
• Spell Checker
The aim of this package is to introduce the
• Document Visualization Tool
nuances of Tamil poetry to young children
through music, picture, animation and games. • Utilities
This package consists of 25 lessons that take the
children from the basics of Tamil poetry to an • Code Conversion Utility
advanced level of appreciation. Each lesson
• Tamil Typing Utility
consists of the poem, segmentation of the
compound words, word-by-word meaning, 2.1 Language Processing Tools
interpretation of the poem, related poems, and
author information, and exercises. Tamil like many other Indian languages is a
morphologically rich language. Formation of
1.9.2 Geography for You words occurs by means of agglutination. Thus
This package is available both in Tamil and in each word consists of a root word and one or
English. The aim of this package is to more suffixes. Hence morphological analysis of a
systematically introduce the concepts of word and morphological generation of a word
Geography, catering to students starting from are primary and challenging tasks. Hence, the core
primary to higher secondary stages. A total of 33 of the language processing tools developed are the
lessons have been organized accordingly. The Morphological Analyser and Morphological
topics covered include atmosphere, universe, solar Generator. They are used to build other language
system, earth, natural regions of the world, and processing tools and products, like parser,

152
translation system, spell checker, search engine, Status : Currently, in the testing phase. A
dictionary etc. preliminary version with over 70% accuracy is
available. It is being enhanced by adding rules, to
• Morphological Analyser improve the accuracy.
The morphological analyser tool takes a derived
• Morphological Generator
word as input and separates it into the root word
and the corresponding suffixes. The function of The morphological generator does the inverse
each suffix is indicated. It has two major modules process of the morphological analyser. It generates
– noun analyser and verb analyser. A rule-based words when Tamil morphs are given as input. It
approach is used. The number of rules used is also has two major modules: - noun generator and
approximately 125. Heuristic rules are used to deal verb generator. In the noun section, a root noun,
with ambiguities. A dictionary of 20,000 root plural marker, oblique form, case marker and
words is used. The design of the analyser is such postpositions are given as inputs. In the verb
that the number of times the dictionary is accessed section a root verb, tense markers, relative
is minimized. The steps involved are as follows: participle suffix, verbal participle suffix, auxiliary
1. Given an input string, it starts scanning the verbs, number, person and gender marker are
string from right to left to look for suffixes. A given as inputs. At a time, four auxiliary verbs
list of suffixes is maintained. can be added to the main verb. In addition to
these, adjectives and adverbs are also handled.
2. It searches for the longest match in the suffix
list. It generates the word using the internal sandhi
rules and the given inputs. It uses 125
3. It then removes the last suffix and determines morphological rules. For a given verb, with all
its tag and adds it to the word’s suffix list. possible combinations of tense markers and person
4. It checks the remaining part of the word in the number gender markers nearly 30 forms can be
dictionary and exits if the entry is found. generated. With the auxiliary verbs added to the
main verb, about 200 verb forms can be
5. According to the identified suffix, it generates generated. In the same way for a given noun, with
the next possible suffix list. all possible combinations of case markers and
postpositions nearly 68 forms can be generated.
6. The process repeats from step 2, with the
current suffix list.

Screen shot of Morphological Generator

Using this generator, an interactive sentence


generator has been designed. Given the root noun
Screen shot of Morphological Analyser and associative number, case, adjective,

153
postposition, etc., and the root verb and the equivalent letters and checked against the
corresponding auxiliary, tense, adverb, dictionary.
postposition, etc, this tool will generate
morphologically and syntactically correct 3. Adjacent key errors : Adjacent keys are based
sentences. In order to generate a simple sentence on Tamil 99 standard keyboard. In this case,
at least a subject, verb and the tense should be each letter is replaced with adjacent keys and
given. checked against the dictionary.
After the correction phase, all the corrected case
Status : Preliminary version is available. It is being
endings, PNG markers, root words are given to
enhanced by adding rules to take care of special cases.
the suggestion generation process. The spell
2.2 Text Editor checker makes use of the Morphological generator
to generate all possible suggestions. User has the
The Tamil text editor is aimed at naïve Tamil provision to select the suggestion among the list,
users. It is platform-independent ( works on both ignore the suggestion or add the particular word
Windows and Linux machines). It provides basic to the dictionary.
facilities for word processing both in Tamil and
English. It doesn’t have any special file format.
Files of both rich text format and text-only
formats can be created or edited.
Status : Available for use. Can be downloaded from
the web site.
2.3 Spell Checker
Tamil Spell Checker is a tool used to check the
spelling of Tamil words. It provides possible Screen shot of Tamil Spell Checker
suggestions for the wrong words. This spell
checker is based on the dictionary, which contains Status : Version 1.0 is available. Will be enhanced
Tamil root words. At present the size of the by increasing the root-word dictionary size to 35000.
dictionary is 20,000 words.
2.4 Document Visualization
The spell checker uses the morphological analyser
Document visualization aims at presenting
for splitting the given Tamil word into the root
abstracted information extracted from raw data
word and a set of suffixes. If the word is fully
files visually. Instead of forcing a user to read a
split by the morphological analyser, it is assumed
whole document just to grasp certain information,
that the given word is correct. Otherwise it goes
document visualization offers very important clues
for the correction process. The following types
about the document. It is a tool that is capable of
of errors are handled :
visualizing any document.
1. Errors in case endings and PNG markers: In
The tool has various control points for the user
this case, if there are any errors in the case
to specify his options. The tool takes any tagged
endings like il, aal, iTam, uTan, uTaiya, aan,
document as input and extracts the tag
aaL, arkaL, etc …, they are corrected and given
information and uses that for visualization. The
to the suggestion generation phase.
various views presented are scatter plot view, to
2. Errors due to similar sounding characters, (eg. show relative tag positions in a two dimensional
manam, maNam,palam, paLam, pazham, etc.) plane, document zoom view, to show thumb views
: In this case, if any letter in the erroneous word of pages, statistical information view, to show
can have similar sounds, then it is replaced with statistics about input document and synchronous

154
depth view. These views together make up the Status : A demo version is available. Currently being
entire tool. tested.
Scatter Plot view : Here the document is mapped 2.5 Utilities
on to a graph. This view gives the relevant position
of the tag in a particular page. It normalizes the In order to facilitate language processing in Tamil,
tag position according to the display. The position a set of common utilities have been developed.
of the tags is marked with circles of predefined These include Code conversion utility, and Tamil
size and color. Their shapes distinguish nested and typing utility.
normal tags. Apart from displaying the tags, it Code Conversion Utility
gives the relevant information about a particular
tag in a separate screen, when the user clicks on it There are a plethora of fonts in use on various
and, if the selected tag is of nested type, it also web sites in Tamil. Hence a generic code
drops a line from the starting position to the end conversion utility has been developed. Given a
of the tag. source and a target-coding scheme, this utility will
generate the code for the conversion. This code
Zoom view : The document zoom view prepares can be integrated as part of any language
the intermediate file, which has a proper processing tool or application. Existing standard
alignment of certain number of lines per page. coding schemes are already built in. Any other
Each page is shown as a small thumbnail. The user coding scheme has to be explicitly specified.
can click on the appropriate page to view that
page. Search can be done in zoom view. Status : Available for use.

Statistical Information view : This view presents • Tamil Typing Utility


the frequencies of tags identified in the input file
in a bar chart form. It is used to show the legend This is an API, which indirectly serves as a
about the tags present in the document. keyboard driver. It allows the user to type the
Tamil text in both transliterated mode and using
Synchronous Depth view : This view projects the the Tamilnet99 standard keyboard. This can be
input document onto the 3-dimensional plane. included as part of any language processing
The whole document is considered to be an object application that requires Tamil input.
in 3- dimensional plane. This view helps to get
the picture of how the tags spread over the Status : Available for use.
document. 3. Translation Support Systems
The following are the components that have been
developed to aid machine translation :
• Tamil Parser
• Universal Networking Language (UNL) for
Tamil
• Heuristic Rule Based Automatic Tagger
3.1 Tamil Parser
The Tamil parser identifies the syntactic
constituents of a Tamil sentence. It outputs the
parse tree in a list form. The free word order
Display of Zoom and Scatter Ploter View feature of Tamil makes parsing a challenging task,

155
since there is a need to associate and link syntactic machine translation. The Universal Networking
components that are not always adjacent to each Language (UNL) is an intermediate language
other. This is done by associating positional based on semantic network, and used by machines
information with the words. The parser tackles to express and exchange natural language
both simple sentences having a verb, a number information.
of noun phrases, and simple adverbs and
adjectives, and complex sentences with multiple UNL provides sentence-by-sentence
adjectival, adverbial, and noun clausal forms. representation of the semantic content. Sentence
information is represented as a hyper-graph having
The processing uses a phrase-structure grammar Universal Words (UWs) as nodes and binary
together with a limited amount of look-ahead to relations as arcs. Binary relations are the building
handle free word order. The words of the sentence blocks of UNL sentences. They are made up of a
are first analyzed using the morphological relation and two UWs. Each relation is labeled
analyzer, and the root word along with the part- with one of the possible label descriptors.
of-speech tag and suffix information obtained is Relations that link UWs are labeled with semantic
input to the parser. However, the analyzer does roles of the type such as agent, object, experiencer,
not unambiguously give the part-of-speech time, place, cause, which characterize the
information for all words. In these cases, linguistic relationships between the concepts participating
based heuristic rules are used to obtain the in the events. This project focuses on developing
associated tags. Currently, 15 heuristic rules are the Tamil EnConverter (Tamil to UNL format)
used. For sentences with multiple clauses, the and Tamil DeConverter (UNL format to Tamil).
parser first identifies the cue words of the clauses Both the EnConverter and the DeConverter use
and syntactically groups the clauses based on the language dependent rules, and a language
cue words and phrases. dictionary that contains Universal Word (UW)
for identifying concepts, word heading and
syntactical behavior of words.
An EnConverter is a software that automatically
or interactively enconverts natural language text
into UNL. EnConverter is a language dependent
parser that provides synchronously a framework
for morphological, syntactic and semantic
analysis. EnConverter uses the following sequence
of operations:
• Convert or load the rules
• Input a Tamil sentence

Screen shot of Tamil Parser • Apply the rules and retrieve the word
dictionary
Status : At present it can handle simple sentences, and
complex sentences with a noun clause, with multiple • Output the UNL expressions
adjective clauses, with multiple adverb clauses, and
EnConverter analyses a sentence using the word
both multiple adjective and adverb clauses.
dictionary, knowledge base, and enconversion
3.2 Universal Networking Language (UNL) for Tamil rules. It first uses the morphological analyser to
analyze the given word. It retrieves relevant
Universal Networking Language (UNL) is an dictionary entries from the language dictionary,
intermediate representation for inter-lingua based operates on nodes in the Node-list by applying

156
enconversion rules, and generates semantic 2. If a word contains case marker then it is tagged
networks of UNL by consulting the knowledge as noun.
base.
3. Standalone words : Some commonly used
A DeConverter is software that automatically or words are categorized and stored in lists. If a
interactively deconverts UNL into natural word belongs to a particular list then the word
language text. Tamil DeConverter uses the is tagged with that category.
morphological generator to generate the Tamil
sentence. 4. Fill in rule: If the unknown word comes in
between two nouns, then it is tagged as a noun.
The information needed to construct the UNL
structure and generate Tamil sentence from UNL 5. Verb Terminating: If the unknown word comes
structure is available at different linguistic levels. at the end of a sentence, then it is tagged as a
Tamil being a morphologically rich language verb. This is due to the fact that normally Tamil
allows a large amount of information including sentences end with a verb.
syntactic categorization, and thematic case 6. Bigram : The unknown word is identified using
relation to be extracted from the morphological the category of the previous word.
level. Endings in Tamil words with syntactic
functional grouping and semantic information are Status : About 83% accuracy is obtained when tested
used to identify the correct binary relation. For on a sample corpus.
syntactical functional grouping, information
about relating concepts like verbs to thematic 4. Human Machine Interface Systems
cases, adjectival components to nouns and Two packages that serve as human-machine
adverbial components to verbs are needed. interfaces, have been developed. They are :
Syntactical functional grouping, has been done
by the specially designed parser, taking into • Text To Speech (Ethiroli)
consideration, the requirements of the UNL
• Poonguzhali (A chatterbot)
structure.
4.1 Text To Speech (Ethiroli)
Status : At present, all the relations have been
identified and simple sentences can be handled. Ethiroli is a Text to Speech engine for Tamil. The
engine has been so designed that it can be plugged
3.3 Heuristic Rule Based Automatic Tagger
into any application requiring a text to speech
This tagger automatically tags the words of a given component. When Tamil text is given as input, it
document with parts of speech tags. It uses is pre-processed, and transliterated using linguistic
linguistically oriented heuristic rules for tagging. rules, to a phonetic representation in order to
It does not use the dictionary and the remove homographic ambiguities. Homographic
Morphological Analyser. ambiguity arises in Tamil since the same character
can have different sounds depending on the
The part of speech (POS) tagger uses heuristic rules position in which it occurs. For example, the
to find the nouns, verbs, adjectives, adverbs and vallinum letters (è, ê, ì, î, ð, ø) in Tamil have more
postpositions. This is done, by checking the than one sound based on their position of
appropriate morphemes of the words. Clitics are occurrence within a word. Each word in this
removed from the word before it goes through transliterated text is then split into its
rule verification. corresponding phonemes using the cvcc model.
Some of the heuristic rules used are as follows. Once the phonemes are identified, the sound files
corresponding to the phonemes are concatenated
1. If a word contains PNG markers and tense and played in sequence to give the speech output.
markers, then it is tagged as verb.

157
Currently, 4000 syllables are used to generate technical terms. The input key is entered with its
Tamil speech using this method. response and alternate response in the knowledge
base. The input key can be more than one word
Status : A preliminary version is available. Efforts with the ‘+’ separator.
are on to improve the quality of speech produced.

4.2 Poonguzhali ( Chatterbot )

Poonguzhali is a generic Tamil chatterbot that


allows the user to chat with the system on any
technical topic. A chatterbot is an AI program
that simulates human conversation and allows for
Natural Language communication between man
and machine.

A question or a statement is taken as input from


the user. The function of the system is to generate
an appropriate response, based on the context of
the input. The user can choose any existing topic
for conversation and ask questions to the system
in Tamil. Provision for input in transliterated
form is available. The system identifies the
minimal context of the input as to what the user Screen shot of Poonguzhali (Chatterbot )
is trying to ask, no matter in what way the user
frames his question. This is done using a set of Status : A version with knowledge base for two
decomposition rules. The response is then formed domains is available.
using a set of reassembly rules that reside in the 5. Localization
knowledge base. This response is then reframed
to match the way in which the user had framed A complete office suite for processing Tamil
his question. The system is also versatile enough information has been developed. This works on
to initiate a conversation based on the earlier both Windows and Linux platforms. A search
sequence of the conversation, when there is no engine to search Tamil websites has also been
input from the user. developed.

The software offers the facility of an optional • Tamil Office Suite


alternate explanation to the previous one, in case • Tamil Word Processor,
the user doesn’t understand the earlier answer
given by the system. The pronoun references like • Presentation Tool for Tamil,
the pronoun “idhu” in the second question related
• Tamil Database
to the first one are identified. The Yes/No type of
questions and compound technical terms are also • Tamil Spreadsheet
handled.
• Tamil Search Engine
Currently there are two domains available. Any
number of domains can be added. For this 5.1 Tamil Office Suite (Aluval Pezhai)
purpose, the tool provides a separate interface for The Tamil office suite consists of four text
adding domains and for updating the knowledge processing applications, namely :
base. This has three separate options - for entering
the knowledge base, non-technical words and • Tamil word processor

158
• Presentation tool for Tamil of number of slides (there are different possibilities
2x2, 3x3 or 4x4 slides in one page). Presentation
• Tamil Database mode is the regular slideshow.
• Tamil Spreadsheet.
This office suite is specifically designed for Tamil
users. However, it can be easily localized for any
language.
• Tamil Word Processor (Palagai)
The Tamil word processor is aimed at Tamil users.
It provides basic facilities for word processing both
in Tamil and English. There are two versions of
the word processor available. One with the spell-
checker and grammar-checker, and another
without the spell checker and grammar-checker.
Both these versions include an e-mail facility. The
grammar-checker checks for person, gender and Screen shot of Tamil Presentation Tool (Arangam)
number compatibility between the subject and the Status : Version 1.0 is available for use.
main verb of the sentence.
• Tamil Database (ThaenKoodu)
The word processor does not have any special file
format. Files of both rich text format and text- The database tool helps a user to store Tamil data
only formats can be created or edited. HTML and also provides various means of retrieving it
files can be viewed with this word processor. E- using queries, forms and reports. The help
mails can be sent using this word processor if the document for this tool is provided in bilingual
current system is connected with the SMTP format (Both Tamil & English). All the data are
server. Email attachments are also handled. stored in MS-Access format.
Status :Version 1.0 is available for use. This tool is divided into five major modules. They
are Database, Table, Query, Forms and Reports
• Tamil Presentation Tool (Arangam) modules. In the Database module, a new database
Arangam is a presentation tool for Tamil. It can be created or existing databases can be edited.
organizes a presentation with slides consisting of The Table module allows the user to create the
text, pictures and images. The presentation is saved tables in two different ways. One is the design
in a proprietary format. mode where the table name, field name, data type
Once a file is created one can add, remove or and the constraints (null, primary key and so on.)
edit slides with common operations like adding are to be entered. The other is through the Table
text boxes, pictures and general shapes. There are creation wizard, where the sample table names
three modes of operation: along with their fields are provided to create the
table. Tables can be modified and the structure of
Edit Mode (Slide) the table can also be altered. New fields can be
added to the existing table at any time.
Preview Mode (Slide Sorter)
The Query module handles queries in three
Presentation Mode (Slide Show)
different modes. In the first, the table name and
The edit mode is the one that allows working on field name should be entered to query the table.
one selected slide. Preview mode gives a preview The second mode uses the wizard, where the

159
required table name and the corresponding field and Average are the functions supported in this
name can be selected. The third one is, where the module.
Tamil query can be entered in the space provided,
to retrieve the result. The Chart Module allows the user to draw the
charts, which are visually appealing and make it
The Forms module generates forms. The forms easy for users to see comparisons, patterns, and
can be built with the help of a wizard and can trends in data. This module represents the data in
also be saved for later use. Report module allows the form of pie, horizontal bar and vertical bar
the user to generate reports. The report can also charts. It can be saved as jpeg (.jpg) file format.
be saved and printed. The chart title and the scale to draw the charts
can also be customized.
Status : Version 1.0 is available for use.
5.2 Tamil Search Engine (Bavani)
A search engine is a software package that searches
for documents in the Internet dealing with a
particular topic. The Tamil Search Engine aims
at looking up Tamil sites for information sought
by a user. It searches for Tamil words in Tamil
web sites available in popular font encoding
schemes. Currently 22 font-encoding schemes are
supported. An important feature of the search
engine is the integration of the morphological
Screen shot of Tamil Database (ThaenKoodu) analyser, so that the keywords are always reduced
to the root form.
Status : Version 1.0 is available for use.
The system gathers information from the Internet
• Tamil SpreadSheet (Chathurangam) by the process of crawling. The information is
Tamil Spreadsheet allows one to easily enter or gathered and stored in the database. An interface
edit the data in the desired format, save data, and is provided to the user to enter the query. The
view using various charts. Mathematical query is analyzed with the information from the
expressions are also handled. The data is saved in database, and the information that matches the
the proprietary format (“.tss” Tamil spread sheet query is returned to the user.
format). The following are the important modules handled
It consists of three modules viz. Worksheet in the search engine:
module, Expression Module and Chart Module. Crawler: This is the kernel part of the design.
The default number of worksheets available is Starting from a given URL and domain, the
three. More number of worksheets can be added. document will be downloaded and parsed. It will
The worksheets can be deleted or renamed. then retrieve the URLs from that document and
Worksheet module allows the user to enter the insert these URLs in the URL tree. After parsing
data in the required format. The basic operations each document, the next URL in the tree will be
like cut, copy, paste, find, replace, delete sheet, taken out for further crawling.
print the sheet, insert row/column and delete row/
column are available in this tool. Database: A database will be maintained to store
each word and its corresponding appearance in
The Expression Module evaluates the expressions. all the documents. These words are stored as a B-
Addition (Sum), Count, Maximum, Minimum tree for efficient indexing. The position of the

160
word in various HTML tags is maintained and workshop on Multimedia tools for content
used for ranking analysis. creators in May 2001 – 13 teachers attended
the workshop (130 person days)
Searcher: Searcher accepts queries from the user,
and then splits the queries into words. The given • Effective Teaching through computer
words are searched in the BTree and finally the
results are sent back to the user after ranking is A presentation was given by RCILTS-Tamil
done. The ranking is performed in the order of members at S.B.O.A School, Anna Nagar,
number of outgoing links, number of Chennai. 20 teachers attended.
occurrences, description (meta), head and title.
A presentation was given by RCILTS-Tamil
The results are displayed in the order of AND,
members, at Teachers training school, Saidapet,
OR operations. The document, which has the
Chennai. 100 teacher trainees attended.
maximum number of query words, gets a higher
rank. • WWW-2001
Status : Available for use. Organized a Creative writing competition for
school children on Indian cultural aspects in
6. Language Technology Human Resource
Development Tamil and English, titled “ WWW 2001,
Weave your Words to the Web”, in Sep 2001
Many activities were undertaken to create to create an awareness about content
awareness and develop human resources in the development. 40 students participated.
area of Language Technology. The activities can
be grouped as follows: • InSight

• Creating awareness among the general public “InSight – Scientific Thought, the Indian way”,
an Indian sciences appreciation course for
• Language Technology Training school children, is being conducted every 6
months. (500 person days )
• Co-ordination with linguistic and subject
experts 6.2 Language Technology Training
• Co-ordination with Government and industry Language technology training has been imparted
in two ways. One is by conducting intensive
6.1 Creating Awareness Among The General
workshops to encourage research and
Public
development in language technology, and the
One of the priorities of the center has been to other is by carrying out development projects in-
create awareness about the need for content house.
development among the public. Working towards
this goal, we targeted school and college teachers • Workshop conducted:
and students and the following workshops have • Organized a workshop titled “Corpus-based
been conducted: Natural Language Processing” targeted at
• Workshop on Multimedia Tools for Effective both computer scientists and linguists
Teaching: Conducted a two day workshop on • Pre workshop Tutorials – Dec 14 –16,
Multimedia Tools for effective Teaching for 2001
college and school teachers in Feb 2001 – 100
teachers attended the workshop (200 person days). • Date: Workshop Dec 17, 2001 – Jan 2,
2002
• Workshop on Multimedia Tools for content
creators Conducted a ten-day hands-on • No. of participants: 40

161
• Contents: Paninian Grammar, • Discussion with Tamil scholars for effective
Dependency Grammar, Machine content development in classical Tamil
Translation, Statistical NLP, TAG literature.
grammar etc.
• Seminar given by Dr.Subaiya Pillai and
• In-house Training Dr.D.Ranganathan at Anna University for
“Sandhi in Tamil”.
More than 15 project associates have been actively
working in the area of language technology for • A one day workshop titled “Language and
the past three years and have gained expertise in Computational Issues” for interaction with
this area. Specifically they are well versed in Tamil Linguists
computational morphology, syntax, lexical
Date: Feb 15, 2002
formats, and font and coding issues.
No. of participants: 30
A lot of enthusiasm and interest has been created
among students to work in the area of language • Discussion with experts from C-DAC, ASR,
technology. Over the last three years about 30 Melkote, CIIL Mysore, Annamalai University,
major student projects have been carried out in Madras University regarding language
language technology. The students are both from processing tools developed.
under graduate (B.E – 36 students – 12 batches)
• Demonstration of products developed at
and post graduate (M.E and M.C.A – 18 students)
various forums.
streams. Some of these projects have been further
developed into products. 6.4 Co-ordination with Government and
Industry
A number of research scholars are actively
persuing research in this area. Three M.S by The products developed have been demonstrated
research scholars are working in the areas of to industries working in the area of language
language translation, speech processing, and technology. The outcome of this interaction is
information retrieval. Three Ph.D scholars are two-fold. One is the signing of memorandum of
working in the areas of information extraction, understanding with industries like Modular
Indian logic based knowledge representation, and Infotech, Pune, Apple Soft, Bangalore, and
automatic software document generation. Chennai Kavigal, Chennai for products such as
spell checker, morphological analyser and
6.3 Co-ordination with Linguistic and Subject generator, and text to speech engine. The other is
Experts the proactive interest shown by Kanithamizh
The center has been coordinating with various Changam, a consortium of industries working on
experts and agencies both for linguistic expertise Tamil computing to work in close coordination
and for content development. Numerous with the center.
interactions have been organized for this purpose. The office suite will be made available free of cost
to interested users, industries and Government
• Seminar conducted by RCILTS-Tamil
organizations. It is proposed to publicize the
members with S.B.O.A School, Anna Nagar,
availability of the products and tools through
Chennai, for content development in Tamil
interaction with appropriate state government
Literature.
authorities. Core technology such as morphological
• Seminar conducted by RCILTS-Tamil analyser, generator, and spell checker will be made
members with MOP Vishnawa College, available after signing of appropriate MOUs. The
Chennai, for content development in Tamil educational resources developed will be made
Culture. available freely to government schools.

162
7. Standardization The three important steps involved in
summarization are : Segmentation of the
All the software developed caters to both the document, Lexical chain creation and significant
prevailing Tamil Nadu Govt. standard (TAM/TAB sentence extraction for summary generation.
font encoding scheme), and the ISCII standard. Tamil WordNet is used as knowledge source for
Testing for Unicode support is in progress. The summarization, which helps to find out the
center, actively co-ordinates with the Tamil Nadu relations to build lexical chains. It is able to
Govt, in defining the Tamil Unicode scheme. incorporate the relationship among words and at
Available standards for transliteration have been the same time account for the various senses of
adhered to, wherever necessary. the words.
All the development work has been carried out Segmentation: Text segmentation is a method for
in Java, to provide cross-platform operability. All partitioning full-length text documents into
the software has been tested on both Windows multi-paragraph units that correspond to a
and Linux platforms. Code, technical and user sequence of sub topical passages. This method is
documentation have been done for all the based on Multi-paragraph segmentation of
products. expository text. It has now been implemented for
Tamil.
8. Research Activities
Lexical Chain Creation : Lexical chains are formed
The following research activities have been going by disambiguating the senses of the words
on, in the areas of information retrieval from occurring in the original text and chaining the
documents and knowledge representation. related terms together. Though repetition of
words itself carries informational value,
• Text Summarization collocation of terms serves to recognize topics
• Latent Semantic Indexing precisely. Ranking of the lexical chains and
selecting the strong chains are done using Length
• Knowledge Representation System Based and Homogeneity index (the number of distinct
On Nyaya Shastra occurrences divided by the length.).

8.1 Text Summarization Sentence Extraction : Once the chains have been
selected, the next step of the summarization
Document summarization plays a vital role in the algorithm is to extract full sentences from the
use and management of information original text based on chain distribution. There
dissemination. This project investigates a method are methods, which involve choosing the sentence
for the production of summaries from arbitrary that contains the first appearance of a chain
Tamil text. The primary goal is to create an member in the text. Depending on the requested
effective and efficient tool that is able to summary ratio, this algorithm can dynamically
summarize large Tamil documents. Summarizer change the block size and token-sequence size in
proceeds to produce a meaningful extract of the the segmentation step so that the number of
original text document using the sentence segments recognized is either decreased or
extraction model. Lexical chains are employed to increased. The number of segments in turn
extract the significant sentences from the text determines the number of lexical chains thus either
without complete semantic analysis of the original increasing or decreasing the length of the resulting
document. Segmentation is done initially which summary. For convenience, the percentage size
helps to maintain the coherence in the generated of the document is represented in terms of the
summary. number of sentences.

163
Status : A demo version for sports domain is 8.3 Knowledge Representation System Based On
available. Nyaya Shastra
8.2 Latent Semantic Indexing World knowledge representation for the purpose
Latent Semantic Analysis is a fully automatic of language understanding is a challenging issue
mathematical/statistical technique for extracting because of the amount of knowledge to be
and inferring relations of expected contextual considered and the ease and efficiency of retrieval
usage of words in passages of discourse. Latent of appropriate knowledge. An important issue in
Semantic Indexing (LSI) is chosen to index the the design of an ontology dealing with world
documents, for it uses the higher order associations knowledge is the philosophy on which the
between the words in a passage or document. This classification scheme is based. In this work, an
information is further represented mathematically
ontology based on Nyaya shastra, an Indian Logic
for easier manipulation. Tamil language follows
System has been designed and a knowledge
free word order, and Latent Semantic Indexing
does not consider word order to retrieve the representation system called KRIL has been
semantic space of the documents. Hence, the implemented. The methodology is based on the
choice of LSI for indexing Tamil documents. notion of associating qualities and values to
concepts and adding different kinds of negation,
In LSI, the raw text is represented as a matrix in
which brings a new perspective to the
which each row stands for a unique word and each
interpretation of knowledge. The knowledge
column stands for a text passage or other context.
The Matrix plays a vital role in LSI. It is hierarchy and the techniques to interpret the
considered to be the semantic space. Each cell knowledge have been adapted from Nyaya shastra.
contains the frequency with which the word of Association of qualities and values to concepts and
its row appears in the passage. Singular value addition of new relationships based on these
decomposition is applied to the matrix. associations and new relations between concepts
themselves adds a new dimension to the reasoning
The following applications have been done using
process.
the LSI techniques:

Sentence Comparison : The similarity of multiple To adapt the classification scheme and reasoning
sentences in Tamil is assessed. A similarity score methodology of Nyaya shastra, description logic
for each submitted sentence , will be computed has been extended by adding operators for
and given to the user. concept definition with qualities and for new
types of negation. Hence, the building blocks of
Essay Assessor : This application compares two
this model are the concept, inherent association
essays. The essays are accepted as input, separated,
and converted as vectors. Then the similarity of qualities, values, relationship between concepts
between them is computed, and given as a score and relationships between concept and qualities.
to the user. The representation of knowledge at various levels
includes time variant and invariant components.
Document Retrieval: Relevant documents are
The special feature of this model is handling of
retrieved using Latent Semantic indexing. In this
prior and posterior absence of concepts through
application, the query words are the input , the
additional negation operators. This model can be
relevant document is retrieved based on the query.
The result is ranked and presented to the user. used as a base for various applications like natural
language processing, natural language
Status : A demo version is available. understanding, machine translation etc.

164
5. D.Manjula, P.Malliga and T.V.Geetha,
“Semantic based Text Mining”, First
International Conference on Global Word
net, Mysore, 21-25 January 2002, pp. 266-
270.
6. D.Manjula, and T.V.Geetha, “ Distributed
semantic based text mining”, To appear in
the proceedings of the Third International
Conference on Data mining methods and
databases for engineering, finance and other
fields, Bologona, Italy, 25-27 Sep 2002.
7. G.Aghila, Ranjani Parthasarathi and
T.V.Geetha, “Design of conceptual ontology
based on Nyaya theory and Tolkappiam”,
Third International conference on South
Asian Languages (ICOSAL – 3), Jan 4 – 6
Screen shot of KRIL-system University of Hyderabad, Hyderabad, 2000,
pp.2.
Status : An implementation of this model to represent
the knowledge about the dairy domain has been 8. G.Aghila, Ranjani Parthasarathi and
done. T.V.Geetha, “Indian Logic Based Conceptual
Ontology Using Description Logics”,
9. Publications National Conference on Data Analysis and
Recognition, Mandya, 2001, pp.279-286.
1. G.V.Uma, and T.V.Geetha (2001),
“Generation of natural language text using 9. Devi Poonguzhali, P.Kavitha Noel, N.Preeda
perspective descriptors in frames”, IETE Lakshmi, T.V.Geetha, A. Manavazhan,
Journal of Research-special issue on “ Tamil Wordnet”, First International
Knowledge and Data Engineering, January- Conference on Global Wordnet, 21-25 Jan
April 2001, Vol.47 No.1&2 pp43-56. 2002, pp 65-71.

2. G.V.Uma, and T.V.Geetha, “Softplan a 10. G.V.Uma, and T.V.Geetha, “A knowledge


planner for automatic software based approach Towards automatic software
documentation”, National Conference on documentation”, Spring 2002, International
Document analysis and recognition, Journal of Trends in Software Engineering
Mandya, 2001, pp 266-271. Process Management, Malaysia.

3. G.V.Uma, T.V.Geetha, “Automatic software 11. G.V.Uma, N.S.M.Kanimozhi, and


documentation using frames and causal link T.V.Geetha, “Automatic software
representation” VIVEK journal – A quarterly documentation in Tamil”, presented in the
in Artificial Intelligence, 2001, vol.14, No.3 3rd International conference on south Asian
pp 3-13. languages ICOSAL-3”, University of
Hyderabad, 1999.
4. D.Manjula and T.V.Geetha, “Message
Optimization by Polling for Text Mining”, 12. Madhan Karky.V, Sudarshanan.S,
National Conference on Document analysis Thayagarajan.R., T.V.Geetha, Ranjani
Parthasarathy, and Manoj Annadurai, “Tamil
and recognition, Mandya, 13-14 July 2002,
Voice Engine”, Presented at Tamil Inayam,
pp. 199-202.
Malaysia, 2001.

165
13. P.Anandan, T.V.Geetha, and Ranjani Conference on Universal Knowledge and Lan-
Parathasarathy, “Morphological Generator for guage (ICUKL) – Nov 25-29th, Goa, India.
Tamil” Presented at Tamil Inayam, Malaysia,
2001. 23. G.S.Mahalakshmi, G.Aghila, T.V.Geetha,
“Multi-level Ontology representation based on
14. G.V.Uma, and T.V.Geetha, “A knowledge based Indian Logic system” International Conference
approach Towards automatic software on South Asian Languages (ICOSAL) – 4, Dec
documentation”, Spring 2002, International 3-5, Annamalai University, TamilNadu, 2002.
Journal of Trends in Software Engineering
Process Management, Malaysia. 24. P.Devi Poongulhali, N. Kavitha Noel, R. Preeda
Lakshmi, Manavazhahan and T.V. Geetha,
15. G.V.Uma, N.S.M.Kanimozhi, and T.V.Geetha, “Tamil Text Summarization” International
“Automatic software documentation in Tamil”, Conference on Knowledge Based Computer
presented in the 3rd International conference Systems, Mumbai, India, December 18-21,
on south Asian languages ICOSAL-3”, 2002.
University of Hyderabad, 1999.
10. The Team Members
16. D. Manjula and T.V. Geetha, “ Semantic based
text mining”, of the International conference Dr. T. V. Geetha rctamil@annauniv.edu
on Global WordNet, Pg 266-270, Mysore, Jan Dr. Ranjani Parthasarathi rp@annauniv.edu
21-25, 2002. Dr. K. M. Mehata mehata@annauniv.edu
Dr. Arul Sironmoney asiro@annauniv.edu
17. D. Manjula, A. Kannan and T.V. Geetha Ms. V. Uma Maheswari umam_in@yahoo.com
“Semantic Information Extraction and Query Ms. N. Anuradha anuradha_n_2000@yahoo.com
processing from world wide web”, Bombay, Ms. S. Chithrapoovizhi scpoovizhi@yahoo.com
Dec18-21, KBCS 2002. Ms. J. Deepa devi deepajayaveer@rediffmail.com
Mr. T. Dhanabalan tdhanabalan@yahoo.com
18. D. Manjula and T.V. Geetha, “Message Mr. T. Dhinakaran dhina_t@yahoo.com
Optimization using polling for distributed text Ms. T. Kalaiyarasi kalaivas@yahoo.com
mining,” National Conference on Analysis and Mr. A. Manavazhahan a_manavazhahan@hotmail.com
Recognition, Mandya, 2001. Mr. G. PalaniRajan palaniworld@rediffmail.com
19. P. Malliga, D.Manjula and T.V. Geetha, “ Mr. R. Purusothaman r_purush@hotmail.com
Boostexter for Tamil Document Mr. K. Saravanan ksaranme@yahoo.com
Categorization”, International Conference on Ms. M. Vinu krithiga vinu_krithiga@yahoo.com
South Asian Languages . Mr. V. Venkateswaran venkatjee@indiatimes.com
A. Kavitha avitha_mariyappan@yahoo.com
20. Siva Gurusamy, D. Manjula and T.V. Geetha, M. Kavitha malar_vanam@sify.com
“Text Mining in request for comments
Document Series”, LEC 2002 Language Courtesy: Dr. T.V.Geetha/Ms. Ranjani Parthasarthi
Engineering Conference, Dec 13-15, Anna University, School of Computer Science &
Hyderabad, 2002. Engineering, Chennai - 600 025
(RCILTS for Tamil)
21. T.Dhanabalan, T.V.Geetha, “ UNL – Tel: 00-91-44-22351723,22350397
EnConverter for Tamil”, International Extn- 3342/3347, 24422620, 2423557
Conference on South Asian Languages E-Mail: rp@annauniv.edu
(ICOSAL) – 4, Dec 3-5, Annamalai University,
TamilNadu, 2002.
22. T.Dhanabalan, K.Saravanan, T.V.Geetha,
“Tamil to UNL EnConverter” International

166
Resource Centre For
Indian Language Technology Solutions – Urdu, Sindhi, Kashmiri
C-DAC, Pune
Achievements

Centre for Development of Advanced Computing


Pune University Campus, Ganesh Khind, Pune-411007
Tel. : 00-91-20-5694000-2 E-mail : mdk@cdacindia.com
Website : http://.parc.cdacindia.com
RCILTS-Urdu, Sindhi & Kashmiri Broad areas of work under the Perso-Arabic
Centre for Development of Advanced Resource center
Computing, Gist, Pune GIST workshop for usage of GIST tools for
organizations concerned with development &
Introduction
deployment of Indian language processing systems.
“Digital unite & Knowledge for all the vision of TDIL”
Creation of a network of persons and organizations
MC&IT wishes to make a difference to the quality working on IT applications in Perso-Arabic( PA)
of life of people of India through its TDIL mission, languages namely Urdu, Sindhi & Kashmiri.
and is investing significant amount of resources
towards fulfilling this wish. One such effort was the Development of Language tools & applications in
establishment of “Resource centers” for design, PA scripts.
development & deployment of tools & technologies Conduction of workshop to discuss on
for Indian languages. standardization of PA scripts data & font
“Dissolving the language barrier the vision of representations.
GIST”. Deployment of the developed technology, products,
C-DAC has pioneered the Graphics and Intelligence & services through a country-wide network.
based Script Technology (GIST) which facilitates Services and Knowledge Bases
the use of Indian languages in IT. GIST is geared
for the Internet enabled world where all activities • Creation of a web site for language tools, solutions
are gradually going online. In its endeavor to stay and knowledge available in PA language.
abreast with technologies worldwide, GIST has been • Imparting training for the usage of PA.
adopting the latest concepts to be able to stay tuned
with the changing IT scenario. C-DAC, GIST • language applications on IT.
through its continuous R&D efforts has to its credit • Establishments of Translation & Subtitling
many innovative products & continues to be leaders services in PA languages.
in the language technology field.
Products
MC&IT, through its TDIL Programme has
awarded C-DAC, GIST with the project “Resource • High quality PA fonts.
centre” for development of tools & technologies for • Dictionary, Spell Checker tools in PA languages.
Perso-Aarbic languages.
• Word processor package on PA languages.
Overall Goal of the Resource Centre
• Transliteration tool between Hindi & Urdu.
To empower people of India through the use of
Information Technology solutions in Indian • Prototype of Pocket translator with Urdu support.
languages. • Product for Subtitling, Character Generator,
Project Purpose Teleprompter, DVD in PA scripts.

To improve the quality of life of people of India Activities Undertaken Under The Project
by enabling them to use information technology GIST workshop was conducted at C-DAC Pune
in Indian languages. from 17th to 22nd July 2000, for usage of GIST
To develop new products and ser vices for series of Indian Language Technologies, tools and
processing information in Indian languages. products for Resource Centers concerned with
development and deployment of Indian Language
To conduct research in computer processing of solutions. A team of 23 members from 8 Resource
Indian languages. centres have attended.

168
1. ER & DC, Trivendrum A visit was organized to various departments of C-
DAC, GIST Laboratory, National Multimedia
2. Jawaharlal Nehru University, New Delhi
Resource Centre, and National PARAM Super
3. Indian Institute Of Technology, Kanpur Computing facility.
4. Orrisa Computer Application Center, Evolving Storage, Font & Inputting Standards
Bhubneswar The main task before C-DAC was to evolve
5. Thapar Institute Of Engg. And Technology, standards for storage, fonts & inputting for the
Punjab Perso-Arabic languages. Development of USCII
(Urdu standard for Information Interchange) was
6. Anna University, Chennai carried out well before awarding of the project. GIST
7. M. S. University, Baroda Terminals designed & developed by C-DAC are
based on these standards. The support for the GIST
8. Indian Statistical Institute, Calcutta terminals was limited to only the Naskh
Covering over five days and total of 12 interactive implementation of the fonts.
sessions, the subjects covered were Accordingly with lots of efforts and valuable inputs
1. Building dictionary & Designing spell checker from experts all over, C-DAC drafted PASCII
standards for storage, keyboard standards & font
2. Keyboard driver standards. These proposed standards are published
3. GIST software development kit in the TDIL magazine – Vishwa Bharat for
comments / suggestions from experts all over India.
4. Multilingual web tools
1. The PASCII standard - The Perso-Arabic
5. Script code, Font storage & Design. Standard for Information Interchange
During the workshop training kits were distributed There are some peculiarities that can be seen in
which contained work manuals on following topics; Perso-Arabic scripts:
1. Designing Indian Language Spell Checkers, These scripts join letters with each other, and
therefore letters have different forms as per their
2. Myths about Standards for Indian Languages,
position in a ligature (the positions are beginning,
3. Enabling Indian Languages on Web, middle, ending).
4. ISCII- Foundations & Future, Some shapes do not have middle shape i.e. they do
not join at both the ends. For example alif, vao,
5. GIST Software Development Kit, daal, etc. Every letter also has a standalone form
6. Using GIST SDK in VB, 1.1 Characteristics of Proposed Standard
7. Internet First Steps: Upgrading an Existing Its an 8-bit standard. Supports letters for Urdu,
ActiveX Control. Arabic, Sindhi, Kashmiri.
8. Inscript Keyboard layout, Defines PA alphabets in the upper ASCII leaving
9. Typing in Phonetic Keyboard. lower ASCII free for English. Bilingual support
possible.
Two CDs on GIST s/w and brochures on various
language products & fonts type face catalogue were Defines numerals other than ASCII numbers (48
distributed: to 57).(This may help supporting both Arabic
Numerals 0-9 and language specific numerals).
1. GIST based S/W Developer CD.
Maintains the order of alphabets for perso-arabic
2. GIST Products & Channel Information. languages. Khate-kasheed is given lower value than

169
alphabets. This will sort the words correctly even designed for Arabic may not be that useful for Urdu
though they have khate-kasheed in between. language. CDAC has taken this care of languages to
design the keyboard for URDU, SINDHI &
Alphabets for different languages are placed in their
KASHMIRI and tried to come up with an optimum
ascending order. Letters like “bhey” are not provided
solution.
for URDU but kept for languages like SINDHI.
The URDU may make use of “be” and “choTi-hai” (Details of these standards can be found in TDIL
for that. Vishwabharat magazine.)
Minimal erabs are provided. Tanveen ,for example 1.3 Standardisation of Fonts
do-zabar, can be formed with the help of two
Characterstics of Perso-Arabic languages:
consecutive zabar.
Urdu has traditionally been written in the Nastaleeq
Superscripts
script. Although the script employs the basic letters
Place for superscripts like khaRa-alif is provided. of the language, the rendering of these letters in a
Place for superscripts for ARABIC is provided. word is extremely complex. The reason for this
complexity is that Urdu text has traditionally been
Place for superscripts like “re-ze”, “ain”, etc. is composed through calligraphy, a medium whose
provided. precepts are based on the aesthetic sense of the
Numerals are placed after erabs and super scripts. calligrapher rather than on any formula. So great is
(This is provided only to support display for the variation in calligraphy that many times it is
language specific numerals and standard numerals difficult to recognize the letters in a constituent
i.e. the ASCII numerals are available). word. This is because, in their calligraphed form,
the individual letters partially or completely fuse into
Only a few control characters are defined. each other thereby losing their identity. A degree of
1.2 Standardization of Urdu Keyboard fusion is purposely introduced to make the resulting
fused glyph visually appealing.
Overview
Another characteristic of the Urdu is the existence
There is no standard keyboard available for URDU of diacritics. Diacritics, although sparingly used, help
language. There are vendor specific keyboards in the proper pronunciation of the constituent word.
available which all differ in their layouts. There are The diacritics appear above or below a character to
lot of keyboards designed by different companies define a vowel or emphasize a particular sound. They
who provide Perso-Arabic support in their are essential for removal of ambiguities, natural
applications. Unfortunately, every company has its language processing and speech synthesis.
own Keyboard layout that entirely differs from the
other’s Keyboard Layouts. For example Microsoft A Nastaliq style font – designed, to be used for
has its own Keyboard layouts for Perso-Arabic printing and display. The font is based on traditional
languages. Nastaliq style written in India

Drawbacks of Available Keyboards (in view of Arabic Font – A high quality Arabic Font is designed
URDU language) to compliment religious text in Urdu.

No standard keyboard available. Considerations for Digital Font Design

All vendor specific keyboards differ in their layout. Following is a group-wise list of characters that
were taken into consideration while designing
Most of the keyboards have been designed for fonts for the Perso-Arabic languages.
Arabic.
1. Alphabet.
Although Arabic and Urdu have similar script, they
are both different languages, and hence a Keyboard 2. Numerals.

170
3. Special Characters. 2. Script Core Technology & Rule Engine
4. Diacritics. Because of the complex nature of the Nastaliq script
which is written right to left and top to bottom
5. Religious and linguistic Symbols. diagonally, it was required to make certain databases,
6. Control characters. and use them along with the rule engine, to make
display possible. Following are some tools which were
The 16-bit Naskh & Nastaliq Font developed to create these databases.
The Fonts developed by CDAC are 16-bit and are 2.1 Glyph Property Editor
also defined in the User Area of the Unicode range.
The ASCII range is not used and can be used for In Urdu and all other Perso Arabic languages each
different purposes (can be used to English support character has many different shapes depending
for example). upon the position of that character, 1) the stand
alone 2) the starting position 3) the middle position
Includes 4) the ending position etc. So according to the
• all the basic shapes. position of the character and the letters with it its
joining that particular shape is to be displayed. This
• all the starting shapes and variations. is an application that allows to store the data and
the compositional information for various shapes
• all the middle shapes and variations.
in a font file for proper alignment of characters for
• all the ending shapes and variations. display. ( Urdu requires a larger number of glyphs
than other scripts) This takes care of the horizontal
• levels for erabs (short vowels).
and vertical movement for Nashtaliq script.
• complete ligatures.
2.2 Rule Engine
• Beginning ligatures.
A module that containing rules for each shape. The
• Middle ligatures. rules indicate how and which shape will join with
other shape or shapes based on the succeeding and
• Ending ligatures. preceding shapes that occur in Nastaliq text
• missing character glyph. composition. This facilitates more accurate and faster
display of the shape composition in the Nashtaliq
Glyph Standards decided for FONT text composition.
Development
2.3 Testing version of Urdu Editor
Urdu & Kashmiri- Naskh & Nastaliq script
A simple text editor is ready to test the Rule Engine
Sindhi - Naskh script. and check the various rules formed for displaying
• Basic letters (Vowels, Consonants). the glyphs. This editor allows to write the text in left
to right, and handles all keyboard mappings from
• Full shape letters. ASCII to USCII, and Urdu text is displayed.
• Beginning. 2.4 Font Glyph Editor
• Medial. In Nastaleeq script (for Urdu), the shapes join from
right to left, and within a ligature they also shift
• Final.
vertically. This forms a ligature. So in a ligature,
• Ligatures. shapes join right to left and top to bottom. The
vertical positioning information cannot be put in
• Graphic components.
the font itself. Hence it was decided to make a utility,
• Numerals, Signs and Symbols, etc. which will let one set this information in separate file.

171
Example of Nastaleeq joints: (“sehan” => suad baRi- (Ghalib)
he noon):

Purpose (Ghalib Bold)


To select a glyph from the given font and character,
and define on it following information:
1. Starting point (SP): the point where a glyph ends.
This is generally on baseline or zero.
2. Ending point (EP): the point where a glyph joins (Imaad)
with another glyph on its starting point (SP of
another glyph). This point is less than or equal to
the glyph width.
3. Dot points (UP and DP): dot points where a dot
or diacritic mark will be positioned on the glyph.
(Kamran)
2.5 The Font Editor Utility
The Font Editor utility lets us place the dot
information for each glyph. The file thus generated
is used for retrieving dot information at run time.
We’ll call it a dot file.
The following figure depicts a glyph with dot and (Maghribi)
caret information. As shown in the figure, the utility
helps one define following for a glyph:
1. Font Name.
2. Unicode value of the glyph.
3. Glyph Index in dot file (this is the unique id of (Makhdoum)
the glyph).
4. Start and End points.
5. Dot positions.
6. Caret position.
(Murasilat)

(Riyadh)

Fonts : following are samples of fonts developed

172
(Rubai) A software (automatic) tool providing
transliteration from one language to another is
an effective tool which reduces the amount of
efforts required to create large databases of names
in multiple languages.
(Safi) Once a database in one language is created, the
tool can be used to generate the same data for
other languages. This results in considerable
savings in time and money.
It is a useful tool for applications which require
(Saleem) to convert database of names, addresses, phone
numbers or electoral rolls from one language to
another.
This can also be useful for converting entire
databases from one language to another without
(Shibli) re-entering the data. This utility can convert data
in English from a database directly to ISCII.
This will facilitate automated transliteration across
databases.
A software tool for transliterating text from
(Shuja) Devanagari & Gurumukhi (ISCII) to Urdu has
been developed. This tool is also available as a
separate module, and also been implemented in
the Text Editor Test Application, which is
developed.
(Abid) 3.1 Following is a sample of Hindi text
Transliterated into Urdu (UTRANS)

3. Devanagari/Gurmukhi to Urdu Transliteration


“UTRANS ”
A computer based transliteration allows a user to
convert a given text (say, in storage format - ISCII)
Using this automated tool – UTRANS, we have
of language ‘A’ to text in another language ‘B’ (in
transliterated the book “Meri Ekyavan Kavitayen”
some storage format, say ASCII) on the basis of the
written by our Hon. Prime Minister Shri. Atal
phonetic rules that govern languages A and B.
Bihari Vajpayee.

173
This tool, though lot of refinement is required, is
in the form of a standalone desktop utility which
allows the user to create dictionaries containing
words for Perso-Arabic languages. The user is
provided with an interface to define inflection and
root word rules, add suffix, prefix, grammar tags,
domain for a given word entry.
In order to build spellchecker & domain specific
dictionaries following dictionary creation work was
under taken and completed
1. Official Hindi dictionary - Hindi 50,000 words,
equivalent 50,000
3.2 Following is a sample of Punjabi text 2. Urdu to English : urdu words 17000, grammer
Transliterated into Urdu (UTRANS) tag 17000
3. Hindi Urdu Shabdh Kosh Hindi words 10,000,
Equivalent 10,000
4. Dictionary Urdu words : 10,000 grammer tag
10,000
5. Dictionary of Urdu & Hindi names : 3200,
Equivalent : 3200
6. Technical dictionary Hindi words : 1500, equiva
lent : 1500
7. Muslim names Dictionary : Hindi 1500,
equivalent : 1500
9. Hindi - English - Urdu Hindi words : 8000,
equivalent : 8000
4. Dictionary Development Tool
Following is the typical GUI of the dictionary
Natural language processing tools such as lookup development tool for URDU.
dictionaries, thesaurus, spellchecker, grammar
checkers, part of speech taggers, machine aided
translations require language specific dictionaries.
Several other domains such as speech processing,
expert systems require backend language
dictionaries to be able to process and generate valid
outputs. Each of these applications require
dictionaries to be created to cater for the specific
requirements.
The Generic Dictionary Development Tool will
allow users to create dictionaries specific to
application domains. As part of this development
the format of a generic dictionary should
standardized. The tool will provide facilities for
obtaining views of dictionaries that are subsets of
the generic dictionary.

174
5. Nashir - Wordprocessor– Publishing Made Easy Table
The Nashir is designed to be easy to use and create A Table control is provided to feed tabular
documents in Perso-Arabic languages, and at the information in the document.
same time powerful enough to layout complete
newspapers and magazines in Urdu, Sindhi, Keyboards
Kashmiri, Arabic and Farsi. Phonetic keyboard supported. A Floating key
Each document of the Nashir consists of a number board will be supported soon.
of pages. On each page of the document, you can
Tool tip
place items like text blocks, graphics, etc.
The Nashir is kind of an Word processor & also Tool tips are provided for the user
best suited for publishing segment. Spellchecker Master Page
support is added.
A master page is provided with each document to
Apart from the base dictionary for spellchecker, provide the user with the header, footer and
various domain specific dictionaries addition is similar settings. Whatever is designed on Master
planned. page appears on the pages.
Salient Features of Nashir
Kerning
1. Support Nastaliq True Type fonts. (presently 2
fonts). Kerning feature is provided to manually adjust a
text. Both Horizontal and Vertical kerning is
2. Supports Naskh fonts. (presently 12 fonts). possible.
3. Fonts for Sindhi, and Kashmiri added. Variable Font Sizes & Colors
4. Supports C-DAC & Phonetic Keyboards.
Use variable font sizes and color.
5. User defined keyboard support available.
Others
6. Drawing objects provided.
The Nashir has a number of other features like
7. OLE Automation supported. Horizontal and Vertical rulers in the GUI,
Other Features dynamic font settings for the Urdu and English
fonts. Indents and paragraph settings, page
Bilingual Support settings, etc.
Supports Urdu, Sindhi & Kashmiri along with
Nashir Screen Shots
English.
Transliteration
Transliteration engine (uTRANS) has been
implemented in the Nashir. One can insert an “.aci”
(ISCII file) file into Nashir and see the transliterated
version in Urdu script (Naskh or Nastaliq). Rule
based transliteration developed for Hindi & Punjabi.
Save As HTML
The user can save the document as HTML page,
and thus Naskh as well Nastaliq scripts can be viewed
on the Internet.

175
2. Isolated word error detection
3. Context based word correction
Non-Word Error Detection
Typographic errors (typing mistake)
Cognitive errors (invalid entry by user)
Phonetic error
Real word error
1. Detection: Block invalid words to pass through it
2. Correction: Suggest alternate valid words that can
pass through it
Following screen shot shows "Text wrapping around
object" feature of Nashir : Error Detection Techniques
Lookup Dictionaries
• Structure : Hash, Tries, etc.
• Dictionary size
• Inflection, euphony, assimilation: (Morphological
Analyzer)
• Domain
Bi-gram / Tri-gram approach (based on conditional
probability approach)
A two dimensional matrix having conditional
probability state of letters occurring next to each
other. Bi-grams are searched and if a bi-gram is not
in the dictionary, then the words are termed
Spellchecker Support in Nashir incorrect.
The full version of Nashir is now equipped with the
spell checker facility.
The Spellchecker program works as a word-level
error detection and correction (single or multiple
correction) tool. Following consideration were given
while designing the spellchecker module
• Percentage of invalid words pass through it
• Average suggestion rate
• Intended suggestion is given
• Domain the spellchecker is serving
• Dictionary structure
Error Detection/Correction Approaches
1. Non-word error detection

176
6. The Urdu SDK Controls 3. Copy paste text glyphs into external applications
The GIST Urdu SDK is a software development 4. Use for database applications
tool for Urdu Language on MS Windows.
5. Transliterate (import) Hindi (ISCII) documents
This is a set of software components that enables into Urdu
software developers to facilitate the use of Urdu
6. Use as simple Editor with fixed font and size
scripts with MS Windows applications. Unlike other
solutions, the significant feature of GIST Urdu SDK 7. Use Rich Edit text facilities (e.g. multiple colors,
is that it enables the MS Windows application to font sizes, multiple fonts)
process PASCII data directly.
8. A number of Naskh fonts provided along with
The GIST SDK uses ActiveX Technology from the control
Microsoft and provides a seamless, transparent and
Sample Urdu Active-X control in a form
self contained Indian language layer for data entry,
storage, retrieval and printing in Indian scripts for
your MS Windows applications.
These controls are bilingual and also support English
language. The package contains editing control that
support the complex scripts like Nastaliq. Following
is a listing of controls that are part of the Urdu SDK
controls package version 1.0 which are developed
by C-DAC, GIST:
• UEdit
• UListbox
Applications using GIST URDU SDK
• UCombobox Following are some examples of set of applications
• UStatic where the UrduEdit OCX can be useful:
• UButton DatabaseApplication
The UrduEdit Control can be connected to any type
• UCheckbox
of database, and store text as data. You can store (or
• URadiobutton load) text from a database from (or into) the control.
Typical applications that can be created are mail
With the above set of controls, one can create all
merge, report generators, and all types of business
sorts of applications that support Urdu language.
applications where documents are created from
The UrduEdit OCX is multi-line text editor control information stored in a database.
that supports Perso-Arabic Languages. This version Desktop Applications
(1.0) of the control supports URDU as the primary
language. The control supports Naskh as the default With UrduEdit Control you can easily create
script, but can also be used to depict Nastaliq script. applications that need text fields for Perso-Arabic
The control has a number of features to provide languages. The control can be used as single line or
with most of the functionalities needed for a Text multiline control. You can also use Rich Edit feature
editing control. of the control to display text in different fonts, colors,
and sizes.
Features
Web Applications
1. Read & Write in Naskh / Nastaliq script
Create user controls with Visual Basic, use UrduEdit
2. Bilingual text support i.e. English and Naskh Control in web pages.

177
Sample Database application form using Urdu 9. 2 lines of Urdu message display.
Active-X Control
10. Talk key for speech output.
11. English - Urdu dictionary.
12. Displays nearest matching word in English if
the typed word is not available in Dictionary.
13. Translation of the same to Urdu by pressing
Script key.

7. Pocket Translator
Multilingual Pocket Translator was developed Hardware Details
keeping into mind the foreigners traveling to
1. Developed on Motorola 68EC000 (PQFP pack
India. Even it is useful for persons traveling within
age) processor running at 10 MHz.
India (Inter State). Similar products available
worldwide but not for Indian Languages. 2. 2 MB of EPROM for program, fonts, fixed
messages, dictionary & speech o/p.
Features
3. 16 KB of SRAM for Temporary variables.
1. English to Urdu & vice a versa translation.
4. 64 keys Matrix QWERTY type keyboard
2. Fixed messages for 5 different categories such
as Social, Travel, Emergency, shop & restau 5. 100 x 32 Dot matrix LCD display panel.
rant, both for Urdu & English. 500 messages 6. 4-bit ADPCM decoder for speech output.
in each catagory.
Block Diagram of Pocket Translator
3. Messages can be browsed in Urdu & English.
4. Speech output for Fixed messages only.
5. Single Toggle key for selection of various cat
egories.
6. Arrow keys for scrolling through the messages/
dictionary.
7. Script toggle key for changing the script.
8. 4 lines for English message display.

178
8. Multiprompter 9. LIPS Advanced Creation Station & LIPS for
DVD
An ideal system for news reading and
documentary production. Support for Naskh & LIPS is a revolutionary, multilingual system
Nastaliq scripts provided. meant for Electronic as well as fused subtitling.
It consistes of highly productive software and cost
Read news or your dialogues without tears. You effective hardware to create subtitles in Indian as
look into the camera and get to se the text rolling well as foreign languagess along with the time
a a speed that suits your reading pace. And that codes.
too in India, Arabic or European languages. IT
can have multiple mechanical attachments with LIPS for DVD allows you to create DVDs with
optical glass for mounting on camera tripods. On- multilingual subtitles in various Indian as well s
line skipping of stories. foreign languages. It converts the LIPS subtitling
files into a DVD compatible format. This system
Salient Features supports Diakin and Sonic formats. Generates the
header indicating the background and foreground
Provide support to control the srcolling speed. a colors, edge color, the in-time and out-time as
min. to 9 min. max and 0 for pause. well as the subtitling image file.
1. Support spellchecker. Saient Features
2. Support Indian or European languages 1. Supports all Indian languages inclusive of
URDU, Sindhi & Kashmiri.
3. Upgradeable to News Romm Automation
System. 2. Naskh & Nastaliq support provided.

4. Works under Windows environment. 3. Support TIF and BMP formats

5. Available in two models : Desktop & portable 4. Support different fonts and font sizes
based on Laptop. 5. Edge color for subtitles can be specified
6. Allows positioning of subtitles on the video
7. Provision for increment and decrement of in
and out time
8. Compatible with PAL as well NTSC formats

Following is the screen shot of the GUI (Graphical


User Interface) of the Teleprompter application
running on the windows platform. Used for
creation of the news and documentaries which
are then prompted on a special equipment having
monitor, mirrored glass & camera. Subtitling

179
10. Perso-Arabic Web Site Development Following screen shots shows the main page &
the page containing information related to the
Under the Perso-Arabic resource centre one of
resource centre website.
the major deliverables was to create the web site
in PA scripts & especially in the Nataliq script. Courtesy : Shri. M. D. Kulkarni
Following was the method adopted for design & Chief Investigator - Resource Centre &
development of the website named http:// the Resource Centre Team, GIST
parc.cdacindia.com Centre for Development of Advanced Computing
Designed a special activeX control (UEditWeb) (C-DAC), Pune University Campus,
for web that could display Nastaliq. Ganeshkhind, Pune 411 007
Tel : 00-91-20-5694000/01/02, 5694092
This control supports HTML like tags to set Fax : 91-20-5694059
various attributes. E-mail : mdk@cdacindia.com
Converted USCII data into UEditWeb format.
UEditWeb data was stored in files which were put
in Database.

Server side scripts were written to generate Urdu


content for display.
Ten books will be added to make the site content
rich.

180
Resource Centre For
Indian Language Technology Solutions – Sanskrit, Japanese, Chinese
Jawaharlal Nehru University, New Delhi
Achievements

Jawaharlal Nehru University, New Delhi-110067


Tel. : 00-91-11-26704772
E-mail : gvs10@hotmail.com
Website : http://202.41.10.205/rciltsjnu
RCILTS-Sanskrit, Japanese, Chinese Unicode. During this conversion process we
Jawaharlal Nehru University, New Delhi developed Web Content Unicode Converter. This
converter works with ISFOC/ISCII data.
Introduction
RCILTS, JNU is a resource center for Sanskrit
language of DIT, Government of India. At JNU
work started in three languages viz., Sanskrit,
Japanese, and Chinese. It took lot of time and effort
to assemble a cohesive team for the development of
language technology.
In the beginning lot of problems were faced
regarding development tools, software professionals,
as well as language professionals. Due to the socio-
economic nature of the IT industry the attrition of
the IT professionals was very high, so work suffered
a lot. Till today it is becoming very difficult to get
good linguists (computational linguistics) who could
carry forward the task of future development. The RCILTS, JNU website is available at http://
202.41.10.205/rciltsjnu
In the first stage it was decided that we should not
be reinventing the wheel. We shall try to develop The various modules that have been developed at
language technology and resources that is not being RCILTS JNU, is presently integrated for learning
addressed at other RCILTS. India has rich cultural the language. With the development of new modules
heritage and time proven scientific knowledge. All this language learning system shall be used for various
this is largely in the form of Sanskrit literature. So it kinds of language processing (text and speech). The
was decided that we would develop a web based various software modules and language resources that
Sanskrit Language Learning System (as a second are available are
language primarily). It would be of great use to those Learning materials
scholars who look to our heritage knowledge for • Sanskrit Lessons
designing and developing Knowledge based systems. • Sanskrit Exercises
Keeping the above goal in mind at RCILTS, JNU, • Sanskrit Script Learning System
we have designed various modules that teach Sanskrit Lexicons
language in an Asynchronous fashion. In the process • Sanskrit – English Lexicon
of development of these language resources we have • Sanskrit – English Lexicon of Nyaya terms
kept in mind the various standardization aspect of • English – Sanskrit Lexicon
creating these resources. Here we would like to
mention specially the Unicode. All the web content Sanskrit Grammar
that is available on RCILTS JNU site is Unicode • Panini’s Asthadhyayi
compliant. • Laghu Sidhyant Kaumudi
• Dhatu Roop to Dhatu
In the early phases of development we used lot of • Dhatu to Dhatu Roop
tools developed at C-DAC. So lot of content was in • Pratipadik Roop
ISFOC/ISCII based. Earlier all the processing was
• Shabd Roop
done on ISCII data. There was no tool where we
• Upsarg
could create content that were Unicode compliant.
• Sandhi
At a very later stage of development when Unicode
tools/techniques were developed and few of them Tools
bought, then we could convert all the content in • HTML Content Converter

182
• ISCII to UNICODE Database Converter and vice versa
• Devnagari Unicode Keyboard Driver for Browsers
Sanskrit Grammar is primarily based on Panini’s
Ashtyadhyayi and SidhyantKaumudi. We have already
developed modules for Dhatu Roop, Sabd Roop, and
Sandhi. After learning all these modules a learner
knows the basics of Sanskrit language.
We plan to develop Sandhi Vichhed System. With
the development of Sandhi Vichhed system as tool it
shall be very easy for the scholars to read and
understand the age old Sanskrit manuscripts which
are treasures of time proven scientific knowledge.
With the proper understanding of these manuscripts Some extra facilities are given to learn and
it would be easy for them to interpret the scientific remember easily. All the Sanskrit words, used
knowledge in their original context and apply them through out the lessons are available to the user
in the present context for the development of with their roots and meanings. These words are
knowledge based software systems. Knowledge based also connected to the Lexicon and accessible from
Systems would aid in the process of Administration, any page of any lesson. User can also see the new
Business, Management, and Education. words on a particular page that he is reading with
the facilities explained above. The Navigation
Since JNU has a rich language school, in the
Buttons bellow the each page is self-explanatory.
beginning work was also carried out in the area of
Japanese and Chinese. But after some timework on The exercises are based on the lessons. For each
Chinese language discontinued due to paucity of Lesson more than two exercises are available. These
Chinese language professionals. But work in the are objective type questions. Four options are
Japanese language is still on. We have developed lot given for each question. As the option is chosen,
of language learning resources for Japanese too. result is given promptly to the user that it is
correct or not. The Correct answers of a particular
1. Sanskrit Lessons and Exercises
question will be given in the table bellow after
Sanskrit Lessons are useful to learn Sanskrit attempting the questions. So user can see the
Language. There are 21 lessons and the number will answer that he has given as well as the actual answer
increase. Lessons are designed so that, a person who of that question just choosing an alternative.
doesn’t know Sanskrit at all but knows either English Exercises are obviously helpful to practice what a
or Hindi can go through these lessons and learn. user learns on that Lesson.
Sanskrit contents of the Lessons are described both
in English and Hindi.
The basics of the Sanskrit Language are covered
elaborately. And then the grammar part of the
Sanskrit language is explained. Nearly all topics of
grammar are covered with lot of examples and
explanation. Special care is taken for the
pronunciation of the Sanskrit alphabets and the
words. The pronunciation is based on standard
phonetic notations. One can go to the ‘Sanskrit
script learning module’ from the Lesson to learn
the writing process and pronunciation of alphabets.

183
2. Devanagari Script learning and Character
Recognition System
Scripts are the basics of a Language. And writing of
these scripts are very important to learn a Language.
Using this module the user learns Devnagari
alphabets and learns how to write and pronunce
them. Writing strokes for a character are shown with
stroke orders with four different speeds. It takes care
of all characters, conjuncts, matras and numerals
also. And there is audio support to listen the
pronunciation of the alphabet. Recognize a character
after hearing its pronunciation is also over. It is also
hyper-linked with Sanskrit Lessons. User can access
3. Sanskrit-English Lexicon
this module from Sanskrit Lesson.
In this module a user can get the meaning, its
grammatical category and usages of Sanskrit word.
It will also display the phonetic notation of the
word. Grammatical category and usage can be
displayed if available.

The E-R Diagram of Sanskrit English Lexicon is


given below. All the Lexicons follow the same E-
R Diagram

Example- The Sanskrit Vowel “Riri” is written in


four stroke with four different speed. Except Basic
alphabets there are numbers of conjuncts characters
in Sanskrit Language. Their formation and the
writing process in the proper stroke orders are given.
Phonetic notations are given in the left-top corner
so that a user can pronounce the alphabet easily.
Numerals are also important to learn as their
frequent use in the language. Writing process of
Sanskrit numeric numbers are shown with their
stroke order. If there is more than one style of writing
of numeric numbers then the alternatives are The facility for getting input from the user is
provided bellow. provided in both ways. Users can click the
beginning letter of the word. It will display all
Pronunciation of alphabet is important to speak a words, which start with that letter in the right
language. The attention is given for the right and side window. On clicking the particular word the
clear pronunciation of the alphabets and audio meaning with phonetic notation and the
supports are provided. There are some exercises also grammatical category are displayed in the left
for the users in recognizing alphabets according to window.
the pronunciation.

184
5. Panini Astadhyayi

In this module a user can get the Panini Sutra,


Sutra with Anuvrtti, its Sanskrit Explanation and
English Explanation after giving Astadhyayi Sutra
number or Siddhanta Kaumudi Sutra number.
The user will also get the information that how
many time the Astadhyayi Sutra occur in the
Siddhanta Kaumudi.

The user can select Astadhyayi Sutra or can enter


the Siddhanta Kaumudi Number (If the user gives
both Astadhyayi Sutra and Siddhanta Kaumudi
User can write the input word with the help of Number then the system will accept Astadhyayi
Phonetic Keyboard in the text box. On pressing the Sutra).
submit button the meaning with phonetic notation
and the grammatical category are displayed for that Example- For the inputted Astadhyayi Sutra
inputted word. Number 2.2.23, this Sutra occurs in two places
(Kaumudi Serial Number 529 and 829). If the
4. Dictionary of Nyaya Term user clicks any of the Kaumudi Sutra Number then
This dictionary of Nyaya terms is a part of an all the information of that Sutra will be displayed.
ambitious project of preparing an encyclopedia of If the user clicks Kaumudi Prakaran then all the
Indian logic. The main purpose is to resolve the sutra of that Prakaran will be displayed.
confusion and make uniformity in translating the
Nyaya terms in English. The stress has been given
to the clarity of concepts than on literal translation.
The root and meanings with phonetic notation of
Sanskrit Nyaya term is provided here.
For giving input, it is necessary to click on the
beginning letter of the word in the left window. It
will display all words, which are start with that letter
in the right side window. On clicking the particular
word the root and meanings with phonetic notation
are displayed in the left window.

6. Dhatu to DhatuRoop

Using this module a user can get all the


information of a dhatu, i.e. a user can get dhatu’s
meaning, its gan name, pad name and the roops
after selecting the lakar. For giving input, it is
necessary to click on the beginning letter of a
word in the left window. It will display all words,
which are started with that letter in the right side
window.

185
Sandhi rules and processes. Sutra number in
Astyadhayi and its description is displayed. User can
learn three type of Svara Sandhi, Vyanjan Sandhi,
Hal Sandhi through this Sandhi module Data is in
Unicode. Sandhi exceptions and options are also
incorporated.
This is the Flowchart for this module. These are the
procedures written to make this module user friendly
and more descriptive.

Example- The inputted dhatu ‘ni’ and the


information of the dhatu ‘ni’ is displayed. The 10
lakar name is displayed the user can select any lakar
and get the roops of that lakar. As this dhatu is an
Ubhyapada Dhatu, the user has to select the pad also
in the next page.
7. Dhatu Roop to Dhatu
This module is the reverse of the Dhatu to DhatuRoop
module. In this module a user can get the dhatu
and its attributes from a dhatu roop.
Example- For the given DhatuRoop “Bhavati”, User
can get all the informations related to that
Dhaturoop like Dhatu Name, Dhatu Arth, Dhatu
Gana, Dhatu Pada, Dhatu Lakar, Purush and
Vachan.

Input Procedure : This procedure is written to take


the input from the user in the form of Unicode
data. Phonetic keyboard is used for this purpose
by which everybody can use this module easily.
CheckSandhi procedure: This procedure is used to
find the Sandhi that appears with the given input.
Panini Ashtadhyayi Sutra Number: This procedure
8. Sandhi connects with Ashtadhyayi module. Click on
Sutra number displays the sutras used.
Sandhi means the coalescence of two words coming
in immediate contact with each other. Using this This module takes two words as input. First word
module the user can get the information about cannot be null but second word can be. A user

186
can input the two words and submit the form to Kanji is another thing that adds a different value to
get the result of the given input. the Japanese language. The language has a rich set
of Kanji. Kanji’s are explained with their stroke
a. Fill the first word in the given textbox captioned
order.
‘Input First Word’.
Lots of facilities are presented to make the learning
b. Fill the second word in the given textbox
process easy and attractive. User can see all the words
captioned ‘Input Second Word’.
of the total lessons from any page. Again the new
c. Click the Submit Button. words of a section are also provided to the user. From
these words list a user go to the Japanese-Hindi
Example- For the inputted first word “Dyata” and
Lexicon for the details of the word. From the lessons
second word “ari” It shows all the explanations
of alphabet learning a user can go to the Japanese
related to this Sandhi like its Sutra definition, which
script learning module that see how the alphabets
applies in process and the Sutra. It is also displays
can be written and pronounced. These script learning
the Sutra number in Ashtadhyayi. This hyperlink
includes all alphabets with Kanji.
leads to the more details of this sutra.
First input word + Second input word = resultant
word

10. Japanese Script learning and Character


Recognition System
Scripts are the basic of a Language. And writing of
9. Japanese Lessons
these scripts are very important to learn a Language.
The Japanese Lessons are very useful to introduce Using this module the user learns Japanese alphabets
the Japanese Language. These are mainly for the and learns how to write and pronounce them.
Indians who know Hindi well can learn Japanese as Writing strokes for a character are shown with stroke
the contents in the lessons are described in Hindi. orders with four different speeds. It takes care of all
Few Screen shots are given below. characters with Kanji. Recognize a character after
Japanese Language contains a rich set of alphabets. hearing its pronunciation is also over. It is also hyper-
It is important for the learner to be familiar with linked with Japanese Lessons. User can access this
the set of alphabets. Keeping this in mind the most module from Japanese Lesson.
of fourteen lessons describe the basics of the Japanese There are different types of alphabets in Japanese
Language with simple examples and explanations. Language. All these are covered here. The writing
Grammars of the Japanese Language are discussed process is made user friendly. There are four options
not elaborately but lightly to give idea of verb, tense, to choose the speed of writing for an Alphabet. And
nouns, adjective etc. Care is taken not to make thinks there is audio support to listen the pronunciation of
complicated. the alphabet. Emphasize is given on the stroke order

187
according to which the alphabets are correctly 12. HTML Content Converter
written.
This converter converts ISCII/ISFOC based HTML
Except Basic alphabets there are numbers of Kanji content to Unicode based HTML content. It
in Japanese Language. Their formation and the preserves the structure of the HTML content. It to
writing process in the proper stroke orders are given. a larger extent also preserves the presentation of the
HTML content. It also incorporates the content
Pronunciation of alphabet is important to speak a
encoding directive to UTF-8. So that the resultant
language. The attention is given for the right and
output of the converter is complete in all respect.
clear pronunciation of the alphabets and audio
supports are provided. Higher level design of the converter:
1. The converter reads the input file
2. Looks for the presence of ISFOC string inside
the content
3. Whenever it finds an ISFOC string it converts
them to ISCII
4. Thereafter it converts the ISCII string to Unicode
data
5. It writes the resultant Unicode data to the output
file.
6. It also updates the content encoding inside the
11. Japanese-Hindi Lexicon output data
In this module a user can get the attributes of a This is the flow chart of the HTML Content
Japanese word in Hindi. It is very helpful to the Converter.
user who know Hindi and wants to learn Japanese
Language. The facility for getting input from the
user is provided in both ways. Users can click the
beginning letter of the word. It will display all words,
which start with that letter in the right side window.
On clicking particular word the meaning with
phonetic notation and the grammatical category are
displayed in the left window.

188
3. The center will collect and make a repository
of related work being done at national and
international centers. Such as C-DAC, IIT
Kanpur, IISC Banglore, JETRO etc. The
center will establish active contacts with these
national and international bodies for
acquisition of public domain material on the
subject.
4. The format of the Lexicons will be advertised
and discussed with other RCILTS
5. The center will collect and place on web some
more texts such as Hitopdesh etc.
6. Center must enhance existing systems such
as lexicons and developed systems such as
Sandhi and Sandhi Vicched Modules leading
to the development translation aid system for
Sanskrit to other Indian Language and Voice-
Versa. The centers must acquires Hindi/
Other Converters Sanskrit Morphological Analyzers and then
1. ISCII Text to UNICODE Text Converter improve upon them.

2. Unicode Text to ISCII Text Converter 7. Center will also enhance OCR performance
for Sanskrit.
3. ISCII Database to UNICODE Database
8. The Center will also enhance Spell Checkers,
Converter
Basic Word Processors and Text Editors for
4. UNICODE Database to ISCII Database Sanskrit.
Converter
9. The Center will procure a Message Server.
5. Devanagri_Unicode (Phonetic) Keyboard
10. The task on OCR, Fonts, Spell Checker, Text
driver
Editor, Word Processor and Message Server
6. Devanagri_Unicode (Inscript) Keyboard will also be taken up. However, some of these
driver activities may be taken up right away
depending on the currently available funds
7. Japanese_Unicode Keyboard drivers (under
with the resource center.
development)
Following work being done at the center would
13. Current/Future tasks
be available fully to users on Internet ending
1. The utilities developed, being developed and September 30, 2003
to be developed at RCILTS, JNU will be Sanskrit Modules
available on the web.
1. Ashtadhyayi of Panini with Sidhant Kaumudi
2. RCILTS, JNU will participate in the of Nagesh Bhatta (English Tika) and
development of at least two Devanagari Prabhakari Tika (Hindi). In addition
Unicode Fonts (1 true type and 1 open type) Katyayan’s Vartik on Ashtadhyayi will also be
locally, by procuring or by outsourcing. available.

189
2. Dhatu Ratnakar with all forms of all Dhatus. (Sanskrit to other Indian Languages and voice-
versa). A simple Hindi to Japanese and Japanese
3. Pratyahar Module.
to Hindi translators is also planned. The Center
4. Ten Lessons on Sanskrit Sambhasn. also plans to develop contents on Hitopdesh kind
of material both for Sanskrit and Japanese.
5. Samgya word forms (Pratipadic) with 1,000
words. There is non-availability of good expertise in the
field of Chinese language. Centre is therefore of
6. Sarvnam Module with all Pronouns.
the opinion that the center may stop its efforts in
7. Sanskrit-English Lexicon ( Word Meanings developing systems for Chinese language and
only) with 30,000 words. concentrate its energy on Sanskrit and Japanese.
8. English-Sanskrit Lexicon ( Word Meanings Courtesy: Prof. G.V. Singh
only) with 30,000 words. Jawaharlal Nehru University
School of Computer and Systems Sciences
9. 30 Sanskrit Lessons with Exercises. New Mehrauli Road
10. Sandhi Prakaran with exceptions and options. New Delhi – 110 067
(RCILTS for Sanskrit, Chinese & Japanese)
11. Lists of other Sanskrit words, such as Tel: 00-91-11-26107676,26101885
adjectives, conjunctions etc. E-mail: gvs10@hotmail.com
12. Devanagri Script Learning Module with
Matras and Conjuncts.
Japanese Modules
1. 21 lessons with exercises. These lessons shell
constitutes a course which is more than
equivalent to one year diploma course in
Japanese.
2. A simple Hindi to Japanese translation
exercise module.
3. Japanese-Hindi Lexicon ( Word Meanings
only) with 6,000 words.
4. Hindi-Japanese Lexicon ( Word Meanings
only) with 6,000 words.
5. Hiragana Script Learning Module.
6. Katakana Script Learning Module.
7. Kanjhi Script Learning Module.
In addition to adding more lessons the center plans
to augment the lexicons with audio visual support,
usage of words for meanings and add phrasal
grammatical categories of words. Sandhi Vicched
Module will be developed which will aid the
development of simple translation systems

190
6. TDIL Associates Achieve Honours…Congratulations !
Congrats Dr. Vikas! Congrats Prof. Sinha
On being awarded VIGYAN BHUSHAN by for Successfully launching on-line Machine
Hon’ble Prime Minister in a ceremony Translation System (http://
organized by UP Hindi Sansthan at Lucknow anglahindi.iitk.ac.in)
on 21 May, 2003.
About Prof. R.M.K Sinha (rmk@iitk.ac.in)
About Dr. Om Vikas (omvikas@mit.gov.in) [Chief investigator of MT Projects]
[National Coordinator of TDIL Mission]
Ph.D (CS) IIT Kanpur M.Tech in industrial
B.Tech, M.Tech & Ph.D from I.I.T Kanpur. Fellow of IETE and Electronics (Electronics & communication Engineering) Indian
member of IEEE. Recipient of Fellow of Russian Academy of Institute of Technology, Kharagpur, 1969, M.Sc.Tech in Electronics
Informatization of Education, Vigyan Bhushan, Atmaram, Vishisht
and Communication, University of Allahabad, 1967.
Padak, Indira Gandhi Rajbhasha awards.
He has taught at many Institutions/Universities in India and has
After M.Tech, he served in TCS. Joined NIC in 1977 as Senior
also been visiting professor/faculty to USA, Canada and Bangkok
Systems Analysis and rose to the position of Senior Director in 1998.
Served on deputation as visiting professor at IIT Kanpur and Institutions/Universities. Currently he is Professor at Dept. of CS
NCERT; and as counselor for Science & Technology in Indian and Dept. of EE IIT Kanpur. He has published very exhaustively
Embassy at Tokyo, Japan. Under UNDP fellowship studied large especially in the area of Indian Language Technologies in various
database designs in USA, Canada & Europe (1980). National and International journals.
His research interests include computer architecture, database He received “Man of the Year Award” from American Biographical
design, Language informatics. He had been on program committees Institute, USA. He is fellow IETE, India, Senior member IEEE,
of various international conferences. USA and serves many other organizations in various capacities.
He has significantly contributed towards promoting Language Development of Indian Language Technologies are his main area of
Informatics, Computer Manpower Development, International interest which includes AI, NLP, MT, SST, OCR, Document
Cooperation and Technical Hindi. processing etc. He is credited with development of INSCRIPT
Sensitivity to society is evident from his initiatives of publishing Keyboarding and Coding schemes (ISCII), Integrated Devanagari
ALOK newsletter in Hindi during his B.Tech; founding People’s Computer Terminal, GIST Terminal, Transliteration among Indian
Science Society (Lok Vigyan Parishad) in 1986; launching TDIL Languages, Spell-Checker Design, OCR, Anglabharati Machine
(Technology Development for Indian Languages) in 1990s and Aided Translation System for English to Hindi and English to Urdu
coordinating it in mission mode. Founder editor of for windows and Solaris/Linux OS. HindiAngla the Hindi to Angla
VishwaBharat@tdil MAT is in advanced stages of development. He is also working on
Currently, Senior Director / Scientist ‘G’ and Head, TDIL Developing MAT system for all Indian Languages, Speech to Speech
(Technology Development for Indian Languages) Mission, Min. of Translation System and Lexical Knowledge –Base development named
Communication & IT, New Delhi - 110003. ShabdKalpTaru..

Congrats Prof. Dhande ! Congrats Prof. Lehal !


On becoming Director of Indian Institute of On becoming professor at Punjabi University,
Technology, Kanpur Patiala.
[Chief Investigator of RC-ILTS (Punjabi)]
About Prof. Sanjay Dhande (sgd@iitk.ac.in)
[Chief investigator of RC for ILTS] About Prof. Gurpreet Singh Lehal
(gslehal@lycos.com)
Ph.D Mechanical Engineering, IIT Kanpur
and currently he is Director, Indian Institute Ph.D (CS) Punjabi University, M.E (CS) TIET
of Technology, Kanpur & Professor in Mechanical Engineering Patiala, M.Sc. Math’s (Honours) from Punjab University.
and Computer Science Department, IITK. He has attended, presented and published many technical papers
He also held the post of Asst. Professor (1979 –85), Professor in the area of Indian Language Technology. He has been working for
(1985 to Till Date), Visiting Professor in Virginia Tech, Virginia, more than five years on different projects related to computerization
USA (1992), Head (Mechanical Engg.) (1993 – 95), Dean of Punjabi and guiding many M.Tech. and Ph.D scholars on different
(Research and Development) – (1999 -2001), Director (2001 topics related to technological development of Punjabi.
– till date). As chief Investigator of the project “Resource Centre for Indian
He has published numerous papers in National and Language Technology Solutions- Punjabi(April 2000 - June 2003)”
International Journals. He has also written books : Kinematics his contributions are Gurmukhi OCR, Punjabi Word Processor,
and Geometry of Planar and Spatial Cam Mechanisms”, Gurmukhi to Shahmukhi Transliteration system (in collaboration
published by Wiley Eastern Ltd., “Computer Aided Design and with CDAC, Pune), Punjabi Spell checker, Punjabi Sorting Program,
Manufacture”, published by the Committee on Science & Punjabi font converter, On-line Punjabi-English, English-Punjabi
Technology in Developing Countries (COSTED), Singapore and and Hindi-Punjabi Dictionaries and Web site for On-Line teaching
“Computer Aided Engineering Graphics and Design “ is under of Punjabi
publishing by Wiley Eastern Publishers Ltd. The Gurmukhi OCR and Punjabi word processor are ready for
commercialization and already there is a big demand for these
products in India as well as abroad in UK, USA and Canada. MOU
for transfer of technology of the Punjabi spell checker is being finalised
with M/S Modular Systems, Pune.

191
7. Resource Centres Technology Index

I. Indian Institute of Technology, Kanpur (Hindi, State Government and the Industry in the
Nepali) region.
7.1 Technical Skills
1. Machine Translation
7.2 Products Which can be Launched
2. Speech-to-Speech Translation
7.3 IT Services
3. Lexical Knowledge-Base Development
8. IT Services - Multimedia CDs
4. Optical Character Recognition
8.1 Learn Gujarati Through English
5. Transliteration
8.2 Adolescent Health
6. Spell Checker Design
8.3 Bibliography of books and journals
7. Knowledge Resources
8.4 Classic Knowledge-base
7.1 Gitasupersite
9. The Team Members
7.2 Brahmasutra
7.3 Complete Works of Adi Sankara III. Indian Institute of Technology, Mumbai
7.4 Ramcharitmanas (Marathi, Konkani)
7.5 Upanishads
1. Lexicon And ontology
7.6 Kavi Sammelan
1.1 Lexicon
7.7 Bimari-Jankari
1.2 Ontology
7.8 Paramhans Ram Mangal Das Ji
2. The Hindi Wordnet
7.9 Nepali Texts
3. The Marathi Wordnet
8. Technical Issues
4. Automatic Generation of Concept Dictionary
9. Details of Texts/Commentaries for each of the
& Word Sense Disambiguation
Upanishads
5. Hindi Analysis and Generation
10. The Team Members
6. Marathi Analysis
II. M.S. University of Baroda, Vadodara (Gujarati) 7. Speech Synthesis for Marathi Language
8. Project Tukaram
1. Knowledge Resources
9. Automatic Language Identification of
1.1 Gujarati WordNet
Documents using Devanagari Script
2. Knowledge Tools
10. Object Oriented Parallel & Distributed Web
2.1 Portal
Crawler
2.2 Multilingual Text Editor
11. Designing Devanagari Fonts (Three Types )
2.3 Code Converter
12. Low Level Auto Corrector
2.4 Gujarati Spell Checker
13. Font Converters
3. Translation Support Systems
14. Marathi Spell Checker
3.1 Machine Translation
15. IT Localisation
4. Human Machine Interface Systems
16. Publications
4.1 OCR for Gujarati
17. The Team Members
4.2 Text To Speech (TTS)
5. Localization IV. Thapar Institute of Engineering & Technology,
5.1 Language Technology Human Resource Patiala (Punjabi)
Development
1. Products Developed
6. Standardization
1.1 Spell Checker
6.1 Unicode
1.2 Font Converter
7. Products which can be launched and the service
1.3 Sorting Utility
which the Resource Centre can provide to the
1.4 Bilingual Punjabi/English Word processor

192
1.5 Gurmukhi OCR 3.3 Information Retrieval System for Bangla
1.6 Gumukhi to Shahmukhi Transliteration Documents
2. Contents Uploaded on Internet 3.4 Script Identification and Separation from
2.1 Punjabi Classic Literature Indian Multi-Script Documents
2.2 Bilingual Dictionaries and Glossary 4. Research & Development
• Punjabi English On-line Dictionary 4.1 Automatic Processing of Hand-Printed
• English Punjabi On-line Dictionary Table-Form Documents
• Hindi Punjabi On-line Dictionary 4.2 Research and Development of Neural
• Glossary of English-Punjabi Network Based Tools for Printed
Administrative Terms Document (in eastern regional scripts)
2.3 On Line Teaching of Punjabi Processing
2.4 On Line Font Conversion Utility 5. The Team Members
2.5 Downloads VI. Utkal University (Oriya)
• Punjabi Spell Checker
• Punjabi Fonts 1. Intelligent Document Processing (OCR for
3. Interaction with Punjab State Govt. Oriya)
2. Natural Language Processing (Oriya)
4. Publications
2.1 Oriya Machine Translation System
5. The Team Members
(OMTrans)
V. Indian Statistical Institute, Kolkata (Bangla) 2.2 Oriya Word Processor (OWP)
1. Core Activities (Multilingual)
2.3 Oriya Morphological Analyser (OMA)
1.1 Website Development and the Language
2.4 Oriya Spell Checker (OSC)
Design Guide
2.5 Oriya Grammar Checker (OGC)
1.2 Training Programmes
2.6 Oriya Semantic Analyzer (OSA)
2. Services
2.7 E-Dictionary (Oriya 1 English)
2.1 Corpus Development
2.8 OriNet (WordNet for Oriya)
• Printed Bangla Document Images &
2.9 SanskritNet (Word Net in Sanskrit)
Ground Truth for OCR and Related
3. Speech Processing System for Indian Languages
Research
3.1 Text to Speech (TTS)
• Development of Bangla Text Corpus
3.2 Speech To Text (STT)
in Electronic Form (including a Bangla
4. The Team Members
Dictionary, and Several Bangla Classics)
• Electronic Corpus of Speech Data Orissa Computer Application Centre (Oriya)
2.2 Font Generation and Associated Tools 1. Products Developed at OCAC
• Public Domain Bangla Font 1.1 Oriya Spell Checker
Generation 1.2 Thesaurus in Oriya
• Converter-Font File to ISCII File and 1.3 Bilingual Electronic Lexicon
Vice-Versa. 1.4 Corpus in Oriya
• Bangla Text Editor 1.5 Bilingual Chat Server
• Bangla Spell Checker 1.6 Net Education
3. Products 1.7 XML Document Creation and
3.1 OCR system for Oriya Manipulation
3.2 Adaptation of Bangla OCR to Assamese 1.8 Oriya OCR

193
1.9 Oriya E-Mail VIII. Indian Institute of Science, Bangalore,
1.10 Oriya Word Processor with Spell Checker (Kannada)
under Linux
1. Web Sites and Support to Instruction
1.11 Computer Based Training (in Oriya)
1.1 Kannudi
1.12 Oriya Language Based E-Governance
1.2 LT-IISc
Applications
1.3 Bodhana Bharathi : Multimedia Educational
2. ToT done for various projects so far.
CDs for 7th, 8th and 10th Standards
3. Training Programmes run for the officials for 1.4 Bilingual Instructional Aid for Learning
the state Govt. German through Hindi
4. Core Activities 1.5 Information Base in Hindi Pertaining to
4.1 Web Hosting of Oriya Languages Classics German History
4.2 Hosting of Web Sites of Govt. Colleges in 2. Knowledge Bases
Orissa. 2.1 Sudarshana a Web Knowledge base on
5. Products Proposed to be Developed Darshana Shastras
6. The Team Members 2.2 Indian Logic Systems
VII. Indian Institute of Technology, Guwahati 2.3 Indian Aesthetics
(Assamese & Manipuri) 3 Technologies and Language Resources
3.1 Brahmi : Kannada Indic Input Method,
1. Knowledge Resources Word processor
1.1 Corpora 3.2 Open Type Fonts : Sampige, Mallige,
• Assamese Corpora Kedage
• Manipuri Corpora 3.3 Kannada Wordnet
1.2 Dictionaries 3.4 OCR for Tamil
1.3 Design Guides 3.5 OCR of Printed Text Document in
1.4 Phonetic Guides Kannada
2. Knowledge Tools 4. Research
2.1 The RCILTS, IIT Guwahati Website 4.1 Automatic Classification of Languages
2.2 Fonts Using Speech Signals
2.3 Spell Checker 4.2 Algorithms for Kannada Speech Synthesis
2.4 Assamese Language Support for Microsoft 5. Publications
Word
2.5 Morphological Analyzers IX. University of Hyderabad (Telugu)
3. Translation Support System 1. RC for ILTS
4. Human Machine Interface Systems 2. Products
4.1 Optical Character Recognition for 2.1 DRISHTI: OCR
Assamese and Manipuri 2.2 Tel-Spell : Spell Checker
4.2 Speech Recognition System 2.3 AKSHARA : Advanced Multi-Lingual
4.3 Interface for e-Dictionary Text Processor
5. Language Technology Human Resource 2.4 E-mail in Indian Languages
Development 2.5 WILIO : Interactive Web Page in Indian
6. Standardization Languages
7. Publications 2.6 Telugu Corpus
8. Manika Newsletter 2.7 Dictionaries, Thesauri & other Lexical
9. The Team Members Resources

194
2.8 Morphology 2.2 AKSHARAMAALA TM – Malayalam Font
2.9 Stemmer Package and Script Manager
2.10 Part of Speech Tagging 2.3 Text Editor
2.11 VIDTA : Comprehensive Toolkit for 2.4 Code Converters
Web-Based Education 2.5 ANWESHANAM TM – Malayalam Web
2.12 Grammars & Syntactic Parsers Based Search Engine
2.13 Machine Aided Translation 2.6 Malayalam Portal
2.14 Tools 3. Services
2.14.1 Font Decoding 3.1 SANDESAM TM – Malayalam E-mail Server
2.14.2 Web Crawler for Search Engine 3.2 Malayalam E-Com Application
2.14.3 PSA : A Meta Search Engine 4. Knowledge Resources
2.14.4 Corpus Analysis Tools 4.1 Trilingual Dictionary
2.14.5 Website Development Tools 5. Language Tutors
2.14.6 Character to Font Mapping Tools 5.1 Ezhuthachan – The Malayalam Tutor
2.14.7 Dictionary to Thesaurus Tools 5.2 English Tutor
2.14.8 Dictionary Indexing Tools 6. Other Activities
2.14.9 Text Processing Tools 6.1 Providing Technology Solutions
2.14.10 Finite State Technologies Toolkit 6.2 Interaction with State Government
3. Service & Knowledge Bases 6.3 Training Activities
3.1 On-line Literature 6.4 COWMAC (Consortium of Industries
3.2 History-Society-Culture Portal working in the area of MAlayalam
3.3 On-Line Searchable Directory Computing)
3.4 Character Encoding Standards, Roman 7. Publications
Transliteration Schemes, Tools 8. Expertise gained
3.5 Research Portal 9. Future Plans
3.6 VAANI : A Text to Speech System for 10. The Team Members
Telugu
XI. School of Computer Science & Eng., Anna
3.7 Manpower Development
University Chennai (Tamil)
4. Epilogue
4.1 Strengths and Opportunities 1. Knowledge Resources
4.2 Outreach 1.1 Online Dictionary
5. Publications 1.2 Corpus Collection Tools
6. The Team Members 1.3 Contents Authoring Tools
1.4 Tamil Picture Dictionary
X. Centre for Development of Advanced
1.5 Flash Tutorial
Computing – Thiruvananthapuram (Malayalam)
1.6 e-Handbook of Tamil
1. Human Machine Interface Systems 1.7 Karpanaikkatchi- Scenes from Sangam
1.1 NAYANA TM – Optical Character Literature.
Recognition System for Malayalam 1.8 District Information
1.2 SUBHASHINI TM – Malayalam Text to 1.9 Educational Resources
Speech System (TTS) 2. Knowledge Tools
2. Knowledge Tools 2.1 Language Processing Tools
2.1 NERPADAM TM – Malayalam Spell • Morphological Analyser
Checker • Morphological Generator

195
2.2Text Editor 2. Script Core Technology & Rule Engine
2.3Spell Checker 2.1 Glyph Property Editor
2.4Document Visualization Tool 2.2 Rule Engine
2.5Utilities 2.3 Testing Version of Urdu Editor
• Code Conversion Utility 2.4 Font Glyph Editor
• Tamil Typing Utility 2.5 The Font Editor Utility
3 Translation Support System 3. Devanagari/Gurmukhi to Urdu Transliteration
3.1 Tamil Parser “UTRANS”
3.2 Universal Networking Language (UNL) for 3.1 Following is a Sample of Hindi Text
Tamil Transliterated into Urdu (UTRANs)
3.3 Heuristic Rule Based Automatic Tagger 3.2 Following is a Sample of Punjabi Text
4. Human Machine Interface System Transliterated into Urdu (UTRANs)
4.1 Text-to-Speech (Ethiroli) 4. Dictionary Development Tools
4.2 Poonguzhali (A Chatterbot) 5. Nashir – Word processor
5. Localization 6. The Urdu SDK Controls
5.1 Tamil Office Suite (Aluval Pezhai) 7. Pocket Translator
• Tamil word Processor (palagai) 8. Multipromter
• Presentation Tools for Tamil (Arangam) 9. LIPS Advance Creation Station & LIPS for
• Tamil Database (Thaenkoodu) DVD
• Tamil Spread Sheets (chathurangam) 10. Perso-Arabic Web Site Development
5.2 Tamil Search Engine (Bavani)
XIII. Jawaharlal Nehru University, Delhi (Sanskrit,
6. Language Technology Human Resources
Japanese, Chinese)
Development
6.1 Creating awareness among the general 1. Sanskrit Lessons and Exercises
public 2. Devanagari Script Learning and Character
6.2 Language Technology Training Recognition System
6.3 Co-ordination with Linguistic & Subject 3. Sanskrit-English Lexicon
Exports 4. Dictionary of Nyaya Term
6.4 Co-ordination with Govt. & Industry 5. Panini Astadhyayi
7. Standardization 6. Dhatu to DhatuRoop
8. Research Activities 7. DhatuRoop to Dhatu
8.1 Text Summarization 8. Sandhi
8.2 Latent Semantic Indexing 9. Japanese Lessons
8.3 Knowledge Representation System based 10. Japanese Script Learning and Character
on Nyaya Shastra Recognition System
9. Publications 11. Japanese-Hindi Lexicon
10. The Team Members 12. HTML Content Converter
12.1 Other Code Converters
XII. Perso-Arabic Resources Centre, CDAC, GIST,
13. Current/Future Tasks
Pune (Urdu, Sindhi, Kashmiri)
1. The PASCII Standard
1.1 Characteristics of Proposed Standard
1.2 Standardization of Urdu Keyboard
1.3 Standardization of Fonts

196
List of Resource Centers

◗ Indian Institute of Technology ◗ Anna University


Department of Computer Science & Engineering School of Computer Science & Engineering
Kanpur - 208 016 Chennai - 600 025
(RCILTS for Hindi & Nepali) (RCILTS for Tamil)
Prof R M K Sinha Dr. T.V.Geetha/Ms. Ranjani Parthasarthi
Tel : 00-91-512-2597174,2598254 Tel : 00-91-44-22351723, 22350397
E-Mail : rmk@iitk.ac.in Extn- 3342/3347, 24422620, 2423557
Website : http://www.cse.iitk.ac.in/users/langtech E-Mail : rp@annauniv.edu
◗ Indian Statistical Institute Website : http://ns.annauniv.edu/rctamil/html/eindex.htm
Computer Vision and Pattern Recognition Unit ◗ M.S.University of Baroda
203, Barrackpore Trunk Road, Kolkata –700035 Department of Gujarati, Faculty of Arts, Baroda-390 002
(RCILTS for Bengali) (RCILTS for Gujarati)
Prof. B.B Chaudhary Shri Sitanshu Y. Mehta
Tel : 00-91-33-25778086 Extn 2852, 25781832, 25311928 Tel : 00-91-265-2792959
E-Mail : bbc@isical.ac.in E-Mail : rciltg@satyam.net.in
Website : http://www.isical.ac.in/~rcbangla/ Website : http://msubaroda.ac.in/rciltg/
◗ Indian Institute of Technology ◗ Orissa Computer Application Centre
Department of Computer Science & Engineering OCAC Building, Plot No. 1/7-D,
Mumbai- 400 076 Acharya Vihar Square, RP-PO, Bhubaneswar – 751 013
(RCILTS for Marathi & Konkani) (RCILTS for Oriya)
Prof. Pushpak Bhattacharya Shri S.K. Tripathi
Tel : 00-91-22-25767718,25722545 Extn 5479, 25721955 Tel : 00-91-674-2582484, 2585851, 2554230(R), 2582490
E-Mail : pb@cse.iitb.ernet.in E-Mail : saroj@ocac.ernet.in
Website : www.cfilt.iitb.ac.in Website : http://www.ilts-utkal.org
◗ University of Hyderabad ◗ Utkal University
Dept. of CIS Department of Computer Science & Application
Hyderabad -500046 Vani Vihar,) Bhubaneswar – 751 004
(RCILTS for Telugu) (RCILTS for Oriya)
Prof. K. Narayan Murthy Prof. (Ms) Sanghmitra Mohanty
Tel : 00-91-40-23100500, 23100518 Extn 4017, 23010374 Tel : 00-91-674-2585518, 254086
E-Mail : nmcs@uohyd.ernet.in E-Mail : sangham1@rediffmail.com
Website : http://www.languagetechnologies.ac.in/ Website : http://www.ilts-utkal.org
◗ Jawaharlal Nehru University ◗ Indian Institute of Science
School of Computer and Systems Sciences Centre for Electronics Design and Technology (CEDT)
New Mehrauli Road, New Delhi – 110 067 Bangalore – 560 012
(RCILTS for Sanskrit, Chinese & Japanese) (RCILTS for Kannada)
Prof. G.V. Singh Prof. N.J. Rao
Tel: 00-91-11-26107676, 26101885 Tel. : 00-91-80-3466022, 3942378, 3410764
E-Mail : gvs10@hotmail.com E-mail : jrao@mgmt.iisc.ernet.in
Website : http://www.jnu.ac.in chairman@mgmt.iisc.ernet.in
◗ Thapar Institute of Engineering & Technology ◗ Indian Institute of Technology
Department of Computer Science & Engineering Department of Computer Science & Engineering
Patiala 147 001 Panbazar, North Guwahati, Guwahati -781 031, Assam
(RCILTS for Gurmukhi) (RCILTS for Assamese & Manipuri)
Prof. R.K. Sharma Prof.Gautam Barua
Tel: 00-91-175-2393137/393374, 2283502 Tel : 00-91-361-2690401, 2690325-28 Extn 2001, 2452088
E-Mail : rksharma@mail.tiet.ac.in E-Mail : gb@iitg.ernet.in, g_barua@yahoo.com
Website : http://punjabirc.tiet.ac.in Website : http://www.iitg.ernet.in/rcilts
◗ ER & DCI, Vellayambalam ◗ Centre for Development of Advanced Computing
Thiruvananthapuram -695 033 Pune University Campus, Ganesh Khind Road
(RCILTS for Malayalam) Pune - 411 007
Prof. Ravindra Kumar (RCILTS for Urdu, Sindhi & Kashmiri)
Tel: 00-91-471-2723333, 2725897, 2726718 Shri M.D. Kulkarni
E-Mail : ravi@erdcitvm.org Tel : 00-91-20-25694000, 25694002-09
Website : http://www.malayalamresourcecentre.org/ E-Mail : mdk@cdac.ernet.in
index.jsp/index.jsp Website : http://parc.cdacindia.com

197
List of CoIL-Net Centers

◗ Indian Institute of Information Technology ◗ Indian Institute of Technology


Management Kanpur, Uttar Pradesh
Gwalior (Hindi to English Machine-aided Translation System
Madhya Pradesh based on Anubharati Approach)
(IT localization solutions for Madhya Pradesh) Prof. R.M.K. Sinha
Prof. D.P. Agrawal Tel : 00-91-512-2597174
Tel : 00-91-751-2449701 E-Mail : rmk@iitk.ac.in
E-Mail : prof_dpa@hotmail.com
◗ Indira Gandhi National Centre for the Arts
◗ Birla Institute of Technology New Delhi
Mesra, Ranchi (Development of Digital Library for Regional Heritage)
Jharkhand Prof. N.R Setty
(IT localization solutions for Jharkhand) Tel : 00-91-11-23385277
Dr. P.K. Mahanti E-Mail : msignca@yahoo.com
Tel : 00-91-651-2275333
◗ Institution of Electronics and Telecommunication
E-Mail : deptcom@bitsmart.com
Engineering
◗ Banasthali Vidyapith New Delhi
Banasthali, Rajasthan Wg Cdr (Retd) Dr. M.L. Bala
(IT localization solutions for Rajasthan) (IT Learning material in Hindi)
Dr. Aditya Shastri Tel : 00-91-11-4631810
Tel : 00-91-1438-228647/28787 E-Mail : ietend@giasdl01.vsnl.net.in
E-Mail : shastri@bv.ernet.in ◗ Centre for Development of Advanced Computing
◗ Banaras Hindu University Pune
Institute of Technology (Core Technology Development for Hindi)
Uttar Pradesh Shri Mahesh D. Kulkarni
(IT localization solutions for Uttar Pradesh) Tel : 00-91-20-25694000,25694002-09
Dr. K. K. Shukla E-Mail : mdk@cdac.ernet.in
Tel : 00-91-542-2307055/56 Other Centers
E-Mail : shukla@ieee.org
◗ Centre for Development of Advanced Computing
◗ New Government Polytechnic “Anusandhan Bhawan”
Patliputra Colony, Patna C-56/1, Institutional Area, Sector-62
Bihar Noida-201301
(IT localization solutions for Bihar) (Development of Parallel Text Corpora for 12 Indian
Dr. R.S. Singh Languages & Annotated Speech Corpora for Hindi,
Tel : 00-91-612-2262866/700 Marathi &Punjabi)
E-Mail : polypat@bih.nic.in Shri. V.N. Shukla
Tel : 00-91-95120-2402551-6
◗ School of IT
E-Mail : vnshukla@cdacnoida.com
G.G. University
Bilaspur,Chhatisgarh ◗ Centre for Development of Advanced Computing
(IT localization solutions for Chhatisgarh) Plot E2/1,Block GP, Sector V, saltlake
Dr. Anurag Shrivastva Kolkata- 700091
Tel : 00-91-7752-272541 West Bengal
E-Mail : sanurag_cmt@yahoo.co.in (Development of Annotated Speech Corpora for
Bengali, Assamese & Manipuri)
◗ Indian Institute of Technology
Dr. A.B. Saha
Roorkee, Uttranchal
Tel : 00-91-33-23579846, 23575989
(IT localization solutions for Uttaranchal)
E-Mail : amiya.saha@erdcical.org
Dr. R. C. Joshi
Tel : 00-91-1332-285650
E-Mail : joshifcc@iitr.ernet.in

198
Quick Reference to Previous Issues

Contents Jan. 2001, ek?k ' Contents January 2002, ek?k &

1. Message of Hon’ble Minister Sh. Pramod Mahajan 1-L 1. Calendar of Events-Year 2002 1L-1R
2. TDIL Programme 1-R 2. Reports
3. Overcoming the Language Barrier 2-L 2.1 TDIL Vision 2010 2L-3R
4. Achievements 2-L 2.2 LT-Business Meet & TOT 2001 4L-16L
5. Resource Centres for Language Technology Solutions 3-R 2.3 Intellectual Property Rights (IPR) 16R-16R
6. Potential Products & Services 4-L 2.4 UNESCO-Symposium on Language 17L-19R
7. TDIL Website 4-L in Cyber Space
2.5 SCALLA 2001-Sharing Capability 20L-20R
8. Implementation Strategy 4-L
in Localisation & Human Language
9. International Programmes in 4-R Technologies
Multilingual Computing 2.6 UNESCO - Workshop on Medium 21L-22L
10. MNC Products Supporting Indian Languages 4-R Terms Strategy for Communicatons
11. Tentative List of Indian Language Products 5-L & Information
12. Portals Supporting Indian Languages 7-L 2.7 The Asia Pacific Development 22R-23L
13. Other Efforts 7-L Information Programme (APDIP)
2.8 Workshop on Corpus-based 23R-23R
14. Major Events of Year 2000 7-R
Natural Language Processing
15. Indian Language Technology Vision 2010 9-L 2.9 1st Workshop on Indian Language OCR 24L-24R
2.10 1st International Conference on 25L-25R
Global WordNet
Contents May 2001, T;s’B + 3. Standardization
3.1 Revision of Unicode Standard-3.0 26L-37R
1. Calendar of Events 1-L for Devanagari Script
2. TDIL Website 2-L 3.2 Design Guides (Sanskrit, Hindi, 38L-75R
Marathi, Konkani, Sindhi, Nepali)
3. TDIL Meet 2001 4-L
3.3 Indian Standard Font Code (INSFOC) 76L-77R
4. UNESCO Expert Group on Multilingualism in Cyberspace 9-L 3.4 Indian Standard Lexware Format 78L-86R
5. Lexical Resources for Natural Language Processing 10-L
4. 4.1 Reader’s Feedback 87L-87R
6. Symposium on Translation Support System 10-L
4.2 Frequently Asked Questions 88L-91R
7. Universal Networking Language 10-R
8. Indo-UK workshop on Language Engg. for South 11-R
Asian Languages
9. New Software Testing Facility 12-R
10. Now Domain Name in Regional Languages 12-R
Contents April 2002, pS= )
11. Book Shelf 13-L 1. tdil@Elitex 2002 1L-2L
12. Resource Center(s) for Indian Language 13-L
Technology Solutions 1st Year (2000 - 2001) Progress 1.1 TDIL Programme 2R-3L
13. Impediments in IT Localization & Penetration 15-L 2. Readers Feedback from Abroad 3R-3R
14. Feedback on UNICODE Standard 3.0 15-R
15. Indian Language IT Market 21-L 3. Calendar of Events-Year 2002 4L-4L
16. MS Office XP with Indian Language Support 21-L 4. Standardization 4R-4R
17. Call for Technologies 21-R
4.1 Gujarati Code Chart 6L-24R
4.1.1 Gujarati Code Chart Details
Contents Sept. 2001, +Éζ´ÉxÉ * 4.1.2 Gujarati Script Details
4.1.3 Typical Colloquial Sentences
Special Issue on Language Technology Business Meet
4.2 Malayalam Code Chart 26L-44R
Patrons Message 1-1 4.2.1 Malayalam Code Chart Details
Programme Schedule 1-1 4.2.2 Malayalam Script Details
4.2.3 Typical Colloquial Sentences
I Machine Aided Translation (MAT) 2-3
4.3 Oriya Code Chart 46L-62R
4.3.1 Oriya Code Chart Details
II Operating System (OS) 3-3 4.3.2 Oriya Script Details
4.3.3 Typical Colloquial Sentences
III Human Machine Interface System (HUMIS) 4-16
4.4 Gurmukhi Code Chart 64L-82R
4.4.1 Gurmukhi Code Chart Details
IV Tools 16-29 4.4.2 Gurmukhi Script Details
4.4.3 Typical Colloquial Sentences
V e-Content 30-34 4.5 Telugu Code Chart 84L-101R
4.5.1 Telugu Code Chart Details
VI Other Milestones 37-38 4.5.2 Telugu Script Details
4.5.3 Typical Colloquial Sentences
Quick Reference Guide 39-39 5. Quick Reference to Previous Issues 102L-102R
Contents July 2002, vk"kk<+ - Contents January 2003, ek?k ,
1. Calendar of Events-Year 2002 1L-1R
1. Calendar of Events-Year 2003 1L-1R
2. TDIL Vision 2L-2R

3. Reader's Feedback 3L-3L


2. Reader’s Feedback 2L-2R
4. Universal Digital Library 3R-15R

5. Indo European IEMCT Conference 16L-17R 3. TDIL Vision 3L-3R


6. Indian Language Spell Checker Design Workshop 18L-24R

7. Indian Standard Font Code 25L-30R 4. Conference Reports 4L-12R


(Devanagari, Gujarati, Punjabi, Malayalam)

8. INSROT Revision 31L-32R


5. Language Technology Papers - Abstracts
(Indian Script to Roman Transliteration)

9. Unicode Standardization
5.1 Human Machine Interface 14L-38R
9.1 Bangla Code Chart 34L-50R
System (HUMIS)
9.1.1 Bangla Code Chart Details
9.1.2 Bangla Script Details a) Integrated I/O Environment
9.1.3 Typical Colloquial Sentences b) Optical Character Recognition
9.2 Assamese Language Details 52L-60R c) Speech Recognition
9.2.1 Typical Colloquial Sentences d) Speech Synthesis
9.3 Manipuri Language Details 62L-67R e) Speech Processing
9.3.1 Typical Colloquial Sentences f ) Typesetting
10. Quick Reference to Previous Issues 68L-68R
5.2 Knowledge Resources (KR) 42L-44R
Contents October 2002, dkfrZd . a) Corpora
b) Dictionary
1. Calendar of Events-Year 2002-03 1L-1L c) Lexical Resources
2. Reader’s Feedback 1R-2R

3. TDIL Vision 3L-3R 5.3 Knowledge Tools (KT) 46L-58R


a) Lexical Tools
4. Technology Watch 4L-4L
b) Utilities
5. MAT Evaluation & Benchmarking 5L-16R

6. OCR Evaluation & Benchmarking 17L-21R 5.4 Language Engineering (LE) 60L-64R
7. Transliteration Tables 22L-26L
a) Grammar
(Punjabi to Urdu & Hindi to Urdu) b) Language & Script Analysis
8. Unicode Standardization
5.5 Localisation (L) 66L-66R
8.1 Kannada Code Chart 28L-40R
8.1.1 Kannada Code Chart Details
a) System Software
8.1.2 Kannada Script Details
8.1.3 Typical Colloquial Sentences 5.6 Translation Support Systems (TSS) 68L-72R
8.2 Tamil Code Chart 42L-56R
8.2.1 Tamil Code Chart Details a) Machine Translation
8.2.2 Tamil Script Details b) Universal Networking Language
8.2.3 Typical Colloquial Sentences c) Wordnet
8.3 Perso-Arabic Standard for 58L-73R
Information Interchange
8.3.1 Urdu Design Guide 5.7 Standardisation 74L-74R
8.3.2 Typical Colloquial
Sentences in Urdu
a) Existing Standards
8.3.3 Sindhi Design Guide 74L-79R b) Draft Standards
8.3.4 Typical Colloquial
Sentences in Sindhi
8.3.5 Kashmiri Design Guide 80L-84R Index 76L-89R
8.3.6 Typical Colloquial a) Paper - abstract
Sentences in Kashmiri b) Addresses
8.4 Vedic Code Set 86L-103R

9. Quick Reference to Previous Issues 104L-104R Quick Reference to Previous Issues 90L-90R

TDIL PROGRAMME
Ministry of Communications & Information Technology
Department of Information Technology
Electronics Niketan, 6, CGO Complex, New Delhi-110003
Telefax : 011-2436 3076 E-mail : tdilinfo@mit.gov.in Website : http://tdil.mit.gov.in

You might also like