Tesseract Training - For Khmer Language - For Posting

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

 Download Tesseract from http://code.google.

com/p/tesseract-ocr/downloads/list
 Here I choose the compiled one. Tesseract-2.01.ext.tar.gz (It is better to use new version. But since I do not
have compiler at hand now. I’ll just use the compiled one)
 Extract it to any location.

 Download the english language data.


 Extract it and put in the tessdata of your tesseract folder

by Kruy Vanna


 Download tesseract source folder. What I need are files in folders configs and tessconfigs of tessdata.
(tesseract.exe u downloaded does not have these)

by Kruy Vanna


 Extract it to somewhere and copy the tessdata to our previous tesseract folder.

Now we can start training:

I train with this image. They say should train enough data. So every characters should appear many times. ( don
know if m right).

May be each same character should appear many time but with different font?

by Kruy Vanna
 Make box file. Go to command line and set the current directory to your tesseract folder

tesseract fontfile.tif fontfile batch.nochop makebox

 Got the file: fontfile.txt


 Renamed it to : fontfile.box so that I can open it in Tessboxer
(http://sites.google.com/site/spilkaondrej/)

Here I input the character in the “Letter” textbox and the UTF8 code is automatically filled.
Making feature file ->

tesseract fontfile.tif junk nobatch box.train

got this log


read_variables_file:variable not found:
textord_no_rejectsTesseract Open Source
OCR Engine

Image has 24 bits per pixel and size


(746,387)

Resolution=96

APPLY_BOXES:

Boxes read from boxfile: 19

Initially labelled blobs: 17 in 3 rows

Box failures detected: 2

Duped blobs for rebalance: 2

"ច" has fewest samples: 5

Total
unlabelled words: 1

Final
labelled words: 19

Generating training data

TRAINING ... Font name = UnknownFont.

Generated training data for 19 blobs

Clustering
( You should change the current directory to “training” to use the command)

mftraining fontfile.tr

Now I got the files I should have.

by Kruy Vanna
• Inttemp
This is the binary file -> human eye can’t understand.

• Pffmtable

ខ 104
I don’t know what the number mean.

ង 93

ច 85

• I got this file too “Microfeat” but they say it’s not used

Another command:

cntraining fontfile.fr

Got this file: normproto

Compute the Character Set


unicharset_extractor
fontfile.box
Got this file: unicharset

Dictionary Data

Created “frequent_words_list” file. They said I must put at least one word so I just put “ ខងច” in it using notepad.
Generate the frequent dictionary file using command:
wordlist2dawg
frequent_words_list freq-dawg

Got the file: freq-dawg

Created “words_list” file with the content “ ងចខ”


Generate the word list dictionary file using command:
wordlist2dawg words_list word-
dawg

Got the file: word-dawg

Created “user-words” file. They say it’s usually empty -> I keep them empty
by Kruy Vanna
The last file
This file “DangAmbigs” is manually generated. This file
file’s purpose is to reduce the abiguity. Ex. ““m” can easily
confused with “rn” (r+n)

Khmer character may not have this kind of ambiguity. (need to confirm). So I make it empty file.

Putting it all together


Now I have all the files renamed to have prefix “khm.” (khm is the ISO_639-2_codes of Cambodia lanuage Khmer):
Khmer)

All of these files should be put in “tessdata”” folder.

 khm.DangAmbigs
 khm.freq-dawg
 khm.inttemp
 khm.normproto
 khm.pffmtable
 khm.unicharset
 khm.user-words
 khm.word-dawg

Now time to run the test!!!

I have this image khmer.tif

I run with command:

Tesseract khmer.tif output –l


khm

I got the output.txt with the content:

ខងច

Cheers!!!

You might also like