Professional Documents
Culture Documents
Scripts, Segmentation and OCR II Nepali OCR and Bangla Collaboration
Scripts, Segmentation and OCR II Nepali OCR and Bangla Collaboration
II
Nepali OCR and Bangla
Collaboration
Bal Krishna Bal
Project Manager
PAN Localization Project
Madan Puraskar Pustakalaya, Nepal
URL : www.madanpuraskar.org
Email: bal@mpp.org.np
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization 1
Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Contents
• Devanagari script and Nepali written language.
• Segmentation problems in Nepali.
• Collaboration between Nepali and Bangla OCR.
• High Level System Architecture of the Nepali OCR.
OCR
• Discussion on the current achievements and the problems being faced.
• Future plans
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 2
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Devanagari script and the written Nepali
• The Devanagari alphabet has the roots from the Brahmi
script.
• Originally developed to write Sanskrit but later adapted
by many other languages.
• The name “Devanagari”g consists of two sanskrit words
“Deva” and “Nagari”, respectively meaning “God” and
“city”.
• It consists of 11 vowels, 33 consonants and 12
modifiers.
• The direction of writing is from left to right.
• There is not any distinction between the upper case and
the lower case characters.
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 3
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Devanagari script and written Nepali…
Vowels
Modifiers
Modifiers attach to the top, bottom, left or
right side of other characters.
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 4
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Segmentation problems of Devanagari/Nepali characters
• Hence segmentation errors for Nepali characters can be broadly categorized as:
– Splitting error;
After the removal of the “dika”, the
– Joining error due to conjuncts and modifiers. original character can seem to be
two characters rather than one.
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th 5
January, 2009, Novotel Hotel, Vientiane, Laos
Nepali – Bangla collaboration
University in 2008.
• Efforts have been made to see if the Bangla OCR technology can be
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 6
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
High Level System Architecture of the Nepali OCR
Preprocessor
This module is designated for enhancing the quality
of the image. Involves further modules like Noise
Removal, Binarization, Skew correction etc.
Segmentation
The segmentation step is carried out in three
stages:
• Segmentation of text into lines
using interline spacing.
• Segmentation of one line text
into words using vertical spaces
or white spaces.
• Segmentation of words into
characters which is the most
complex. For the character
segmentation, first the
headline is removed after which
the characters are segmented on
the basis of their connectivity.
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 7
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
High Level System Architecture of the Nepali OCR…
Feature extraction
The extracted features of each character image is stored in a specific file format to
be used later in the training and recognition stages.
8
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization
Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Discussion on the current achievements and the
problems faced
Multi-factorial analysis
Joining errors have been attempted to solve using the
Multi-factorial analysis technique which uses the fuzzy
factors like degree of similarity, thickness, middleness,
cross counts etc. Splitting errors would be handled by
means off hand crafted rules in the post processor
module.
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 9
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Discussion on the current achievements and problems
faced…
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 10
Localization Project. 12’th-16’th January, Hotel, Vientiane, Laos
Future plans
• Continue further works with the current system.
A beta release is scheduled for March 2009.
• Explore the Google’s Tesseract OCR platform
for Nepali.
• Work towards incorporating the OCR system for
useful utilities like PDF search for Nepali.
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 11
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Acknowledgment
This work was carried out with the aid of a grant
from the Language Resource Association (GSK)
of Japan
p and International Developmentp
Research Centre (IDRC), Ottawa, Canada,
administered through the Centre for Research in
Urdu Language Processing (CRULP), National
University of Computer and Emerging Sciences
(NUCES), Pakistan.
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 12
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Thank You!!
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN 13
Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos