Professional Documents
Culture Documents
Tesseract Training - For Khmer Language - For Posting
Tesseract Training - For Khmer Language - For Posting
Tesseract Training - For Khmer Language - For Posting
com/p/tesseract-ocr/downloads/list
Here I choose the compiled one. Tesseract-2.01.ext.tar.gz (It is better to use new version. But since I do not
have compiler at hand now. I’ll just use the compiled one)
Extract it to any location.
by Kruy Vanna
Download tesseract source folder. What I need are files in folders configs and tessconfigs of tessdata.
(tesseract.exe u downloaded does not have these)
by Kruy Vanna
Extract it to somewhere and copy the tessdata to our previous tesseract folder.
I train with this image. They say should train enough data. So every characters should appear many times. ( don
know if m right).
May be each same character should appear many time but with different font?
by Kruy Vanna
Make box file. Go to command line and set the current directory to your tesseract folder
Here I input the character in the “Letter” textbox and the UTF8 code is automatically filled.
Making feature file ->
Resolution=96
APPLY_BOXES:
Total
unlabelled words: 1
Final
labelled words: 19
Clustering
( You should change the current directory to “training” to use the command)
mftraining fontfile.tr
by Kruy Vanna
• Inttemp
This is the binary file -> human eye can’t understand.
• Pffmtable
ខ 104
I don’t know what the number mean.
ង 93
ច 85
• I got this file too “Microfeat” but they say it’s not used
Another command:
cntraining fontfile.fr
Dictionary Data
Created “frequent_words_list” file. They said I must put at least one word so I just put “ ខងច” in it using notepad.
Generate the frequent dictionary file using command:
wordlist2dawg
frequent_words_list freq-dawg
Created “user-words” file. They say it’s usually empty -> I keep them empty
by Kruy Vanna
The last file
This file “DangAmbigs” is manually generated. This file
file’s purpose is to reduce the abiguity. Ex. ““m” can easily
confused with “rn” (r+n)
Khmer character may not have this kind of ambiguity. (need to confirm). So I make it empty file.
khm.DangAmbigs
khm.freq-dawg
khm.inttemp
khm.normproto
khm.pffmtable
khm.unicharset
khm.user-words
khm.word-dawg
ខងច
Cheers!!!