Scripts, Segmentation and OCR II Nepali OCR and Bangla Collaboration

Scripts, Segmentation and OCR
II
Nepali OCR and Bangla
Collaboration
Bal Krishna Bal
Project Manager
PAN Localization Project
Madan Puraskar Pustakalaya, Nepal
URL : www.madanpuraskar.org
Email: bal@mpp.org.np
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization 1
Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Contents
• Devanagari script and Nepali written language.
• Segmentation problems in Nepali.
• Collaboration between Nepali and Bangla OCR.
• High Level System Architecture of the Nepali OCR.
OCR
• Discussion on the current achievements and the problems being faced.
• Future plans
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 2
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Devanagari script and the written Nepali
• The Devanagari alphabet has the roots from the Brahmi
script.
• Originally developed to write Sanskrit but later adapted
by many other languages.
• The name “Devanagari”g consists of two sanskrit words
“Deva” and “Nagari”, respectively meaning “God” and
“city”.
• It consists of 11 vowels, 33 consonants and 12
modifiers.
• The direction of writing is from left to right.
• There is not any distinction between the upper case and
the lower case characters.
Devanagari script and written Nepali…
Vowels
Modifiers
Modifiers attach to the top, bottom, left or
right side of other characters.
All characters of a word are connected by

a horizontal line called a
Modifiers attached to characters
“Dika/Headline/Matra”.
Consonant characters An example of combined

characters
Three text zones
Segmentation problems of Devanagari/Nepali characters
• Segmentation problems with printed Devanagari/Nepali characters are mainly due

to:
– Variability of character size and inter-character size (font and size issue);
– Confusion between inter-character and within character space;
– Touching between characters.
• Hence segmentation errors for Nepali characters can be broadly categorized as:
– Splitting error;
After the removal of the “dika”, the
– Joining error due to conjuncts and modifiers. original character can seem to be
two characters rather than one.
After the removal of the “dika”,

two different characters can
seem to be single characters.
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th 5
January, 2009, Novotel Hotel, Vientiane, Laos
Nepali – Bangla collaboration
• Similarity of the Devanagari and Bangla scripts.
• Many issues are in common, for instance segmentation.
• Sharing of documentation and source code by the Bangladeshi team to
the Nepali team.
• Two weeks long consultation visit made by Nepal team representative to
Center for Research in Bangla Language Processing (CRBLP), BRAC
University in 2008.
• Efforts have been made to see if the Bangla OCR technology can be
replicated for Nepali with some modifications for Nepali.
High Level System Architecture of the Nepali OCR
Preprocessor
This module is designated for enhancing the quality
of the image. Involves further modules like Noise
Removal, Binarization, Skew correction etc.
Segmentation
The segmentation step is carried out in three
stages:
• Segmentation of text into lines
using interline spacing.
• Segmentation of one line text
into words using vertical spaces
or white spaces.
• Segmentation of words into
characters which is the most
complex. For the character
segmentation, first the
headline is removed after which
the characters are segmented on
the basis of their connectivity.
High Level System Architecture of the Nepali OCR…
Feature extraction
The extracted features of each character image is stored in a specific file format to
be used later in the training and recognition stages.
Training and recognition

• Entire possible
p character set of the adopted
pted font should be trained including
the complex characters.
• For the classifier, we have followed the Hidden Markov Model (HMM)
adopted by the HTK Toolkit.
8
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization
Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Discussion on the current achievements and the
problems faced
Multi-factorial analysis
Joining errors have been attempted to solve using the
Multi-factorial analysis technique which uses the fuzzy
factors like degree of similarity, thickness, middleness,
cross counts etc. Splitting errors would be handled by
means off hand crafted rules in the post processor
module.
The multi-factorial analysis was found to be quite

accurate in segmenting not just basic characters but
also touching characters but it has the overhead of
requiring to address the structural ambiguity of the
Segmentation results with multi-factorial analysis method segmented characters while training.
and removing the upper modifiers
Discussion on the current achievements and problems
faced…
Original image source

Original image source after segmentation
Some observations:
• The testing of the OCR is being done on
real scanned and filmed documents.
• Currently the output is acceptable for clean
scanned text of Preeti font and 28 pt font
size.
• Extensive training is required for getting
acceptable output.
OCRed textt • Does not handled untrained glyphs.
• Post processing modules could enhance the
output text.
Localization Project. 12’th-16’th January, Hotel, Vientiane, Laos
Future plans
• Continue further works with the current system.
A beta release is scheduled for March 2009.
• Explore the Google’s Tesseract OCR platform
for Nepali.
• Work towards incorporating the OCR system for
useful utilities like PDF search for Nepali.
Acknowledgment
This work was carried out with the aid of a grant
from the Language Resource Association (GSK)
of Japan
p and International Developmentp
Research Centre (IDRC), Ottawa, Canada,
administered through the Centre for Research in
Urdu Language Processing (CRULP), National
University of Computer and Emerging Sciences
(NUCES), Pakistan.
Thank You!!

Scripts, Segmentation and OCR II Nepali OCR and Bangla Collaboration

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scripts, Segmentation and OCR II Nepali OCR and Bangla Collaboration

Uploaded by

Copyright:

Available Formats

Scripts, Segmentation and OCR

All characters of a word are connected by

Consonant characters An example of combined

Three text zones

• Segmentation problems with printed Devanagari/Nepali characters are mainly due

After the removal of the “dika”,

• Similarity of the Devanagari and Bangla scripts.

• Many issues are in common, for instance segmentation.

• Sharing of documentation and source code by the Bangladeshi team to

the Nepali team.

• Two weeks long consultation visit made by Nepal team representative to

Center for Research in Bangla Language Processing (CRBLP), BRAC

replicated for Nepali with some modifications for Nepali.

Training and recognition

The multi-factorial analysis was found to be quite

Original image source

You might also like