Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 9

Indonesian – English Parallel Texts for

Statistical Machine Translation

( Hammam Riza, Adiansya Prasetya, Henky Mulyadi )

The Republic of Indonesia is an:
• Archipelago of 13,000 islands that spread over an area of 1,900,000 square
• Population of 245,000,000 (July. 2006 estimated)
• 7% growth of the GDP was recorded on per year
• Indonesian economy and political conditions are gradually stabilizing
• Indonesia is back on the track to become an industrialized nation
• Bahasa Indonesia became the formal language of the country, uniting its
citizens who speak different languages
• Bahasa Indonesia has become the language that bridges the language
barrier among Indonesians who have different mother-tongues
• The vocabulary of bahasa Indonesia has been extensively influenced by
outside languages, especially Sanskrit, Arabic, Chinese, Dutch, and
English, as well as local languages such as Javanese and Batavian
Research Topics

Tourism in Business Social Service Safety and Education Archiving of

Asia in Asia In Asia Security in Asia In Asia Asian Language

Multi-lingual Multi-lingual Speech and Language

Speech translation Transcription and formats

Multi-lingual Speech Multi-lingual Speech Multi-lingual Speech

Translation Transcription and Text Archive

Parallel Corpus ( Synonymous Speech + Text)

Indonesian Language English Language
Speech+Text Speech+Text

Parallel Corpus Format

Corpus Collection and Processing
Data collection schema:

Antara News Selection & Alignment

Selected Corpus
Agency Transformation Article
DB Indonesian-
(oracle DB)
(SQL 2000) English


Collection &
Web Corpus
Alignment Sentences
News Indonesian-
English Toggle

to Text Clean
Text Corpus
Translation of SMT System (1)
• A. Translation model
– > SRI Language Modeling Toolkit which extracts a 3-gram language model from
the data. Besides the SRILM distribution, you will also need the following freely
available tools: ANSI-C/C++ compiler, gcc version 3.4.3 or higher, GNU make,
GNU gawk, GNU gzip, Tcl, CYGWIN porting layer, to build SRILM on a
Microsoft Windows system.
– > Functionalities of SRILM:
• Generate the n-gram count file from the corpus
• Train the language model from the n-gram count file
• Calculate the test data perplexity using the trained language model

Training corpus ngram corpus Count file

Lexicon ngram count LM

Test data ngram ppl

Translation of SMT System (2)
• B. Language Model
– bin contains GIZA++ which is an implementation based on the IBM models, and
mkcls which divides words into probabilistically based classes.
– In order to compile GIZA++ you may need:
• a recent version of the GNU compiler (2.95 or higher)
• a recent version of assembler and linker which do not have restrictions with
respect to the length of symbol names

– corpus is where the data should be placed when training the translation model.

Translation Model
-Program (SRILM)
-Compiler Pharaoh SMT
Data Generation System
Train Phrase

Testing System Performance (1)
Sentence translation process use decoder Pharaoh :
Files used for translasi are:
– pharaoh (executable)
– pharaoh.ini
– xkalimat.lm
– phrase-table

Type the command like this
echo ‘Can I check in now’ | ./pharaoh –f ./pharaoh.ini > OUT
The process will yield file OUT, to see result type
cat OUT
Presented results “Dapatkah saya check in sekarang”
Testing System Performance (2)
Testing Performance

Bleu Score

Sample 275,000 sentence bleu score is = 0.878

Thank you...

sunset in Kuta, Bali

You might also like