Professional Documents
Culture Documents
Indonesian - English Parallel Texts For Statistical Machine Translation
Indonesian - English Parallel Texts For Statistical Machine Translation
Statistical Machine Translation
Dictionary
Corpus Collection and Processing
Data collection schema:
Alignment
Sentences
Collection &
Web Corpus
Alignment Sentences
News Indonesian-
English Toggle
Cleaning
Indonesian
Text
Conversion
to Text Clean
English
Text Corpus
Translation of SMT System (1)
• A. Translation model
– > SRI Language Modeling Toolkit which extracts a 3-gram language model from
the data. Besides the SRILM distribution, you will also need the following freely
available tools: ANSI-C/C++ compiler, gcc version 3.4.3 or higher, GNU make,
GNU gawk, GNU gzip, Tcl, CYGWIN porting layer, to build SRILM on a
Microsoft Windows system.
– > Functionalities of SRILM:
• Generate the n-gram count file from the corpus
• Train the language model from the n-gram count file
• Calculate the test data perplexity using the trained language model
– corpus is where the data should be placed when training the translation model.
source
Translation Model
-Program (SRILM)
-Compiler Pharaoh SMT
Data Generation System
preparations
Train Phrase
Model
target
Testing System Performance (1)
Sentence translation process use decoder Pharaoh :
Files used for translasi are:
– pharaoh (executable)
– pharaoh.ini
– xkalimat.lm
– phrase-table
Example:
Type the command like this
echo ‘Can I check in now’ | ./pharaoh –f ./pharaoh.ini > OUT
The process will yield file OUT, to see result type
cat OUT
Presented results “Dapatkah saya check in sekarang”
Testing System Performance (2)
Testing Performance
Bleu Score