Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 9

Indonesian – English Parallel Texts for

Statistical Machine Translation

( Hammam Riza, Adiansya Prasetya, Henky Mulyadi )


Background
The Republic of Indonesia is an:
• Archipelago of 13,000 islands that spread over an area of 1,900,000 square
kilometers
• Population of 245,000,000 (July. 2006 estimated)
• 7% growth of the GDP was recorded on per year
• Indonesian economy and political conditions are gradually stabilizing
• Indonesia is back on the track to become an industrialized nation
• Bahasa Indonesia became the formal language of the country, uniting its
citizens who speak different languages
• Bahasa Indonesia has become the language that bridges the language
barrier among Indonesians who have different mother-tongues
• The vocabulary of bahasa Indonesia has been extensively influenced by
outside languages, especially Sanskrit, Arabic, Chinese, Dutch, and
English, as well as local languages such as Javanese and Batavian
Research Topics

Tourism in Business Social Service Safety and Education Archiving of


Asia in Asia In Asia Security in Asia In Asia Asian Language

Multi-lingual Multi-lingual Speech and Language


Speech translation Transcription and formats

Multi-lingual Speech Multi-lingual Speech Multi-lingual Speech


Translation Transcription and Text Archive

Parallel Corpus ( Synonymous Speech + Text)


Indonesian Language English Language
Speech+Text Speech+Text

Parallel Corpus Format

Dictionary
Corpus Collection and Processing
Data collection schema:

Antara News Selection & Alignment


Selected Corpus
Agency Transformation Article
DB Indonesian-
(oracle DB)
(SQL 2000) English

Alignment
Sentences

Collection &
Web Corpus
Alignment Sentences
News Indonesian-
English Toggle
Cleaning

Indonesian
Text
Conversion
to Text Clean
English
Text Corpus
Translation of SMT System (1)
• A. Translation model
– > SRI Language Modeling Toolkit which extracts a 3-gram language model from
the data. Besides the SRILM distribution, you will also need the following freely
available tools: ANSI-C/C++ compiler, gcc version 3.4.3 or higher, GNU make,
GNU gawk, GNU gzip, Tcl, CYGWIN porting layer, to build SRILM on a
Microsoft Windows system.
– > Functionalities of SRILM:
• Generate the n-gram count file from the corpus
• Train the language model from the n-gram count file
• Calculate the test data perplexity using the trained language model

Training corpus ngram corpus Count file

Lexicon ngram count LM

Test data ngram ppl


Translation of SMT System (2)
• B. Language Model
– bin contains GIZA++ which is an implementation based on the IBM models, and
mkcls which divides words into probabilistically based classes.
– In order to compile GIZA++ you may need:
• a recent version of the GNU compiler (2.95 or higher)
• a recent version of assembler and linker which do not have restrictions with
respect to the length of symbol names

– corpus is where the data should be placed when training the translation model.

source
Translation Model
-Program (SRILM)
-Compiler Pharaoh SMT
Data Generation System
preparations
Train Phrase
Model

target
Testing System Performance (1)
Sentence translation process use decoder Pharaoh :
Files used for translasi are:
– pharaoh (executable)
– pharaoh.ini
– xkalimat.lm
– phrase-table

Example:
Type the command like this
echo ‘Can I check in now’ | ./pharaoh –f ./pharaoh.ini > OUT
The process will yield file OUT, to see result type
cat OUT
Presented results “Dapatkah saya check in sekarang”
Testing System Performance (2)
Testing Performance

Bleu Score

Sample 275,000 sentence bleu score is = 0.878


Thank you...

sunset in Kuta, Bali

You might also like