Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Scripts, Segmentation and OCR

II
Nepali OCR and Bangla
Collaboration
Bal Krishna Bal
Project Manager
PAN Localization Project
Madan Puraskar Pustakalaya, Nepal
URL : www.madanpuraskar.org
Email: bal@mpp.org.np

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization 1
Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Contents
• Devanagari script and Nepali written language.
• Segmentation problems in Nepali.
• Collaboration between Nepali and Bangla OCR.
• High Level System Architecture of the Nepali OCR.
OCR
• Discussion on the current achievements and the problems being faced.
• Future plans

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 2
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Devanagari script and the written Nepali
• The Devanagari alphabet has the roots from the Brahmi
script.
• Originally developed to write Sanskrit but later adapted
by many other languages.
• The name “Devanagari”g consists of two sanskrit words
“Deva” and “Nagari”, respectively meaning “God” and
“city”.
• It consists of 11 vowels, 33 consonants and 12
modifiers.
• The direction of writing is from left to right.
• There is not any distinction between the upper case and
the lower case characters.

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 3
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Devanagari script and written Nepali…
Vowels
Modifiers
Modifiers attach to the top, bottom, left or
right side of other characters.

All characters of a word are connected by


a horizontal line called a
Modifiers attached to characters
“Dika/Headline/Matra”.

Consonant characters An example of combined


characters

Three text zones

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 4
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Segmentation problems of Devanagari/Nepali characters

• Segmentation problems with printed Devanagari/Nepali characters are mainly due


to:
– Variability of character size and inter-character size (font and size issue);
– Confusion between inter-character and within character space;
– Touching between characters.

• Hence segmentation errors for Nepali characters can be broadly categorized as:
– Splitting error;
After the removal of the “dika”, the
– Joining error due to conjuncts and modifiers. original character can seem to be
two characters rather than one.

After the removal of the “dika”,


two different characters can
seem to be single characters.

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th 5
January, 2009, Novotel Hotel, Vientiane, Laos
Nepali – Bangla collaboration

• Similarity of the Devanagari and Bangla scripts.

• Many issues are in common, for instance segmentation.

• Sharing of documentation and source code by the Bangladeshi team to

the Nepali team.

• Two weeks long consultation visit made by Nepal team representative to

Center for Research in Bangla Language Processing (CRBLP), BRAC

University in 2008.

• Efforts have been made to see if the Bangla OCR technology can be

replicated for Nepali with some modifications for Nepali.

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 6
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
High Level System Architecture of the Nepali OCR
Preprocessor
This module is designated for enhancing the quality
of the image. Involves further modules like Noise
Removal, Binarization, Skew correction etc.
Segmentation
The segmentation step is carried out in three
stages:
• Segmentation of text into lines
using interline spacing.
• Segmentation of one line text
into words using vertical spaces
or white spaces.
• Segmentation of words into
characters which is the most
complex. For the character
segmentation, first the
headline is removed after which
the characters are segmented on
the basis of their connectivity.

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 7
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
High Level System Architecture of the Nepali OCR…

Feature extraction
The extracted features of each character image is stored in a specific file format to
be used later in the training and recognition stages.

Training and recognition


• Entire possible
p character set of the adopted
pted font should be trained including
the complex characters.
• For the classifier, we have followed the Hidden Markov Model (HMM)
adopted by the HTK Toolkit.

8
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization
Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Discussion on the current achievements and the
problems faced

Multi-factorial analysis
Joining errors have been attempted to solve using the
Multi-factorial analysis technique which uses the fuzzy
factors like degree of similarity, thickness, middleness,
cross counts etc. Splitting errors would be handled by
means off hand crafted rules in the post processor
module.

The multi-factorial analysis was found to be quite


accurate in segmenting not just basic characters but
also touching characters but it has the overhead of
requiring to address the structural ambiguity of the
Segmentation results with multi-factorial analysis method segmented characters while training.
and removing the upper modifiers

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 9
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Discussion on the current achievements and problems
faced…

Original image source


Original image source after segmentation
Some observations:
• The testing of the OCR is being done on
real scanned and filmed documents.
• Currently the output is acceptable for clean
scanned text of Preeti font and 28 pt font
size.
• Extensive training is required for getting
acceptable output.
OCRed textt • Does not handled untrained glyphs.
• Post processing modules could enhance the
output text.

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 10
Localization Project. 12’th-16’th January, Hotel, Vientiane, Laos
Future plans
• Continue further works with the current system.
A beta release is scheduled for March 2009.
• Explore the Google’s Tesseract OCR platform
for Nepali.
• Work towards incorporating the OCR system for
useful utilities like PDF search for Nepali.

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 11
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Acknowledgment
This work was carried out with the aid of a grant
from the Language Resource Association (GSK)
of Japan
p and International Developmentp
Research Centre (IDRC), Ottawa, Canada,
administered through the Centre for Research in
Urdu Language Processing (CRULP), National
University of Computer and Emerging Sciences
(NUCES), Pakistan.

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 12
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Thank You!!

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 13
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

You might also like