Welcome to Scribd!

Indonesian - English Parallel Texts For Statistical Machine Translation

Uploaded by

0% found this document useful (0 votes)

23 views9 pages

This document discusses the development of parallel corpora and statistical machine translation systems between Indonesian and English. It describes collecting parallel texts from news sources, preprocessing and aligning the data to create corpora. It then explains training translation models using SRILM and GIZA++, and using the Pharaoh decoder to generate translations. Evaluation is done using BLEU scores, with a sample text achieving a score of 0.878.

Original Description:

Original Title

Indonesian_SMT_presentation

Copyright

Available Formats

PPT, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as PPT, PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as ppt, pdf, or txt

0% found this document useful (0 votes)

23 views9 pages

Indonesian - English Parallel Texts For Statistical Machine Translation

Uploaded by

shekoembang

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as PPT, PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as ppt, pdf, or txt

Jump to Page

You are on page 1of 9

Search inside document

Indonesian – English Parallel Texts for

Statistical Machine Translation

( Hammam Riza, Adiansya Prasetya, Henky Mulyadi )

Background
The Republic of Indonesia is an:
• Archipelago of 13,000 islands that spread over an area of 1,900,000 square
kilometers
• Population of 245,000,000 (July. 2006 estimated)
• 7% growth of the GDP was recorded on per year
• Indonesian economy and political conditions are gradually stabilizing
• Indonesia is back on the track to become an industrialized nation
• Bahasa Indonesia became the formal language of the country, uniting its
citizens who speak different languages
• Bahasa Indonesia has become the language that bridges the language
barrier among Indonesians who have different mother-tongues
• The vocabulary of bahasa Indonesia has been extensively influenced by
outside languages, especially Sanskrit, Arabic, Chinese, Dutch, and
English, as well as local languages such as Javanese and Batavian
Research Topics

Tourism in Business Social Service Safety and Education Archiving of

Asia in Asia In Asia Security in Asia In Asia Asian Language

Multi-lingual Multi-lingual Speech and Language

Speech translation Transcription and formats

Multi-lingual Speech Multi-lingual Speech Multi-lingual Speech

Translation Transcription and Text Archive

Parallel Corpus ( Synonymous Speech + Text)

Indonesian Language English Language
Speech+Text Speech+Text

Parallel Corpus Format

Dictionary
Corpus Collection and Processing
Data collection schema:

Antara News Selection & Alignment

Selected Corpus
Agency Transformation Article
DB Indonesian-
(oracle DB)
(SQL 2000) English

Alignment
Sentences

Collection &
Web Corpus
Alignment Sentences
News Indonesian-
English Toggle
Cleaning

Indonesian
Text
Conversion
to Text Clean
English
Text Corpus
Translation of SMT System (1)
• A. Translation model
– > SRI Language Modeling Toolkit which extracts a 3-gram language model from
the data. Besides the SRILM distribution, you will also need the following freely
available tools: ANSI-C/C++ compiler, gcc version 3.4.3 or higher, GNU make,
GNU gawk, GNU gzip, Tcl, CYGWIN porting layer, to build SRILM on a
Microsoft Windows system.
– > Functionalities of SRILM:
• Generate the n-gram count file from the corpus
• Train the language model from the n-gram count file
• Calculate the test data perplexity using the trained language model

Training corpus ngram corpus Count file

Lexicon ngram count LM

Test data ngram ppl

Translation of SMT System (2)
• B. Language Model
– bin contains GIZA++ which is an implementation based on the IBM models, and
mkcls which divides words into probabilistically based classes.
– In order to compile GIZA++ you may need:
• a recent version of the GNU compiler (2.95 or higher)
• a recent version of assembler and linker which do not have restrictions with
respect to the length of symbol names

– corpus is where the data should be placed when training the translation model.

source
Translation Model
-Program (SRILM)
-Compiler Pharaoh SMT
Data Generation System
preparations
Train Phrase
Model

target
Testing System Performance (1)
Sentence translation process use decoder Pharaoh :
Files used for translasi are:
– pharaoh (executable)
– pharaoh.ini
– xkalimat.lm
– phrase-table

Example:
Type the command like this
echo ‘Can I check in now’ | ./pharaoh –f ./pharaoh.ini > OUT
The process will yield file OUT, to see result type
cat OUT
Presented results “Dapatkah saya check in sekarang”
Testing System Performance (2)
Testing Performance

Bleu Score

Sample 275,000 sentence bleu score is = 0.878

Thank you...

sunset in Kuta, Bali

1 s2.0 S0957417423023151 Main
Document17 pages
1 s2.0 S0957417423023151 Main
Baba Ali
No ratings yet
Oxford Progressive English Book 9:: Page 1 of 2 2022-23 El-Fy
Document2 pages
Oxford Progressive English Book 9:: Page 1 of 2 2022-23 El-Fy
Mr Fiery
No ratings yet
Improving Word Alignment in An English - Malay Par
Document5 pages
Improving Word Alignment in An English - Malay Par
22080359
No ratings yet
Arabic To Bangla Machine Translation Using Encoder Decoder Approach
Document4 pages
Arabic To Bangla Machine Translation Using Encoder Decoder Approach
Moidul Hasan Khan
No ratings yet
Actividades Ingles
Document3 pages
Actividades Ingles
Sabrina Danisa
No ratings yet
Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers
Document8 pages
Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers
nombre
No ratings yet
2020 Nlposs-1 2
Document6 pages
2020 Nlposs-1 2
Noureldin Abdelaal
No ratings yet
Acr39DF TMP
Document4 pages
Acr39DF TMP
api-3761762
100% (2)
Improvization of Malayalam Speech Output in Espeak Text-To-Speech Synthesizer
Document6 pages
Improvization of Malayalam Speech Output in Espeak Text-To-Speech Synthesizer
deepapgopinath
No ratings yet
Using Synonyms For Arabic-to-English Example-Based Translation
Document10 pages
Using Synonyms For Arabic-to-English Example-Based Translation
Reza Maulana Hikam
No ratings yet
Using Synonyms For Arabic-to-English Example-Based
Document11 pages
Using Synonyms For Arabic-to-English Example-Based
Adel BOULKHESSAIM
No ratings yet
A Text To Speech (TTS) System With English To Punjabi Conversion
Document6 pages
A Text To Speech (TTS) System With English To Punjabi Conversion
Mebiratu Beyene
No ratings yet
Transliteration Based Gazetteer Preparation For Named Entity Recognition in Hindi
Document6 pages
Transliteration Based Gazetteer Preparation For Named Entity Recognition in Hindi
arpithaswamy
No ratings yet
BENSALAH Nouhaila, AYAD Habib, ADIB Abdellah and IBN EL FAROUK Abdelhamid+
Document2 pages
BENSALAH Nouhaila, AYAD Habib, ADIB Abdellah and IBN EL FAROUK Abdelhamid+
Ahmed Blog
No ratings yet
An Arabic To English Example-Based Translation System: K. Bar, Y. Choueka, and N. Dershowitz
Document4 pages
An Arabic To English Example-Based Translation System: K. Bar, Y. Choueka, and N. Dershowitz
Yassbt21
No ratings yet
2021 wmt-1 30
Document4 pages
2021 wmt-1 30
natarajankr9750
No ratings yet
Design and Implementation of Text To Speech Conversion For Visually Impaired People
Document6 pages
Design and Implementation of Text To Speech Conversion For Visually Impaired People
Gautam Mandoliya
No ratings yet
English-Hindi Translation in 21 Days
Document5 pages
English-Hindi Translation in 21 Days
Adonis Kum
No ratings yet
Introduction To Assembly Language: CS1101: Lecture 37
Document6 pages
Introduction To Assembly Language: CS1101: Lecture 37
Chintu
No ratings yet
NLP Project Final Report1
Document10 pages
NLP Project Final Report1
Abhishek Dhaka
No ratings yet
NLP Project Final Report1
Document10 pages
NLP Project Final Report1
Subham Pandey
No ratings yet
NLP Project Final Report1
Document10 pages
NLP Project Final Report1
hewepo4344
No ratings yet
Word Based Statistical Machine Translation From English Text To Indian Sign Language
Document8 pages
Word Based Statistical Machine Translation From English Text To Indian Sign Language
zemike
No ratings yet
Lattice Based Lexical Transfer in Bengal
Document8 pages
Lattice Based Lexical Transfer in Bengal
Aparajita Aggarwal
No ratings yet
Design and Implementation of Text To Speech Conversion For Visually Impaired People
Document6 pages
Design and Implementation of Text To Speech Conversion For Visually Impaired People
vidhu
No ratings yet
Group 09
Document3 pages
Group 09
Bùi Nguyên Hoàng
No ratings yet
Marathi To English Neural Machine Translation With Near Perfect Corpus and Transformers
Document5 pages
Marathi To English Neural Machine Translation With Near Perfect Corpus and Transformers
Vikram Rajkumar Sabale
No ratings yet
Kumano 2002
Document11 pages
Kumano 2002
k.salehian78
No ratings yet
Hindi To English and Marathi To English Cross Lang
Document9 pages
Hindi To English and Marathi To English Cross Lang
Harish Lunge
No ratings yet
Unit1 130131031436 Phpapp01 PDF
Document47 pages
Unit1 130131031436 Phpapp01 PDF
Rohit Joshi
No ratings yet
SAP Business ByDesign Introduction To Customer Language Adaptation Version April2017 Part2
Document13 pages
SAP Business ByDesign Introduction To Customer Language Adaptation Version April2017 Part2
Abhishek Nandi
No ratings yet
File Paper
Document4 pages
File Paper
RV
No ratings yet
Research Paper
Document6 pages
Research Paper
Bhushan
No ratings yet
SSICT-2023 Paper 5
Document4 pages
SSICT-2023 Paper 5
Bùi Nguyên Hoàng
No ratings yet
Indo Language
Document16 pages
Indo Language
Aliasgar Bharmal
No ratings yet
PPL Unit5
Document18 pages
PPL Unit5
Bhuvanesh
No ratings yet
Chapter - 1: Existing System
Document15 pages
Chapter - 1: Existing System
Bavithraa
No ratings yet
How To Translate From English To Khmer Using Moses
Document11 pages
How To Translate From English To Khmer Using Moses
International Journal of Engineering Inventions (IJEI)
No ratings yet
Speech Therapy System - IEEE - Published Paper
Document4 pages
Speech Therapy System - IEEE - Published Paper
Swapna H
100% (1)
Text Operation Assingnmet
Document33 pages
Text Operation Assingnmet
beshahashenafe20
No ratings yet
Introduction Fine
Document7 pages
Introduction Fine
Bikash Chhetri
No ratings yet
Universal Sentence Encoder
Document7 pages
Universal Sentence Encoder
viterbi kkk
No ratings yet
Developing Speech To Text Messaging System Using Android Platform
Document31 pages
Developing Speech To Text Messaging System Using Android Platform
Kyaw Myint Naing
No ratings yet
LAMP: A Multimodal Web Platform For Collaborative Linguistic Analysis
Document9 pages
LAMP: A Multimodal Web Platform For Collaborative Linguistic Analysis
Arabic Tree learning
No ratings yet
Final TransferLearning 2203.04287
Document14 pages
Final TransferLearning 2203.04287
Khushal Das
No ratings yet
DN - Final For Pub - 1
Document18 pages
DN - Final For Pub - 1
gole
No ratings yet
Design and Implementation of Text To Speech Conver
Document7 pages
Design and Implementation of Text To Speech Conver
Umar Abdulhamid
No ratings yet
A Proposed Automated Extraction Procedure of Bangla Text For Corpus Creation in Unicode
Document5 pages
A Proposed Automated Extraction Procedure of Bangla Text For Corpus Creation in Unicode
mohammedfereje sulieman
No ratings yet
Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
Document5 pages
Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
abaynesh moges
No ratings yet
English To Amharic Statistical Machine Translation
Document10 pages
English To Amharic Statistical Machine Translation
Ambaye Tadesse
50% (2)
Week 1 Introduction To Programming v2
Document38 pages
Week 1 Introduction To Programming v2
Dang Jun Ye
No ratings yet
Dictionary Software Development of Engli
Document3 pages
Dictionary Software Development of Engli
Julyeen Htut
No ratings yet
Automatic Speech Recognition Post-Processing For Readability Task Dataset and A Two-Stage Pre-Trained Approach
Document14 pages
Automatic Speech Recognition Post-Processing For Readability Task Dataset and A Two-Stage Pre-Trained Approach
Naoual Nassiri
No ratings yet
Research On Regional Languages
Document6 pages
Research On Regional Languages
Abhishek Rana
No ratings yet
Implementation of Marathi Language Speech Databases For Large Dictionary
Document6 pages
Implementation of Marathi Language Speech Databases For Large Dictionary
IOSRjournal
No ratings yet
Introduction
Document9 pages
Introduction
Normana Zureikat
No ratings yet
Rule-Based Machine Translation From English To Finnish: Hurskainen, Arvi
Document8 pages
Rule-Based Machine Translation From English To Finnish: Hurskainen, Arvi
محمد المجهلي
No ratings yet
Kannada and Telugu Native Languages To E PDF
Document5 pages
Kannada and Telugu Native Languages To E PDF
Shanti Swaroop
No ratings yet
Everis - Outsystems Academy - MultiLanguage PDF
Document18 pages
Everis - Outsystems Academy - MultiLanguage PDF
Jose Muñoz Troncoso
No ratings yet
Tutorial 17
Document3 pages
Tutorial 17
Nabillah Nadzren
No ratings yet
COMPUTER PROGRAMMING FOR KIDS: An Easy Step-by-Step Guide For Young Programmers To Learn Coding Skills (2022 Crash Course for Newbies)
From Everand
COMPUTER PROGRAMMING FOR KIDS: An Easy Step-by-Step Guide For Young Programmers To Learn Coding Skills (2022 Crash Course for Newbies)
Dexter Rogers
No ratings yet
Spanish in One Month 2200 Steps To Improve Your Spanish Language Fluency
Document274 pages
Spanish in One Month 2200 Steps To Improve Your Spanish Language Fluency
Ignat Nechai
No ratings yet
Teachers Book
Document182 pages
Teachers Book
Lena Florys
No ratings yet
Processability Theory
Document11 pages
Processability Theory
Ignacio Jey B
No ratings yet
TEACHER WENDY BOHORQUEZ Ultimooo
Document9 pages
TEACHER WENDY BOHORQUEZ Ultimooo
GABRIEL URBANO
No ratings yet
Some - Any - A Little - A Few - Much - Many
Document3 pages
Some - Any - A Little - A Few - Much - Many
Alfonso Martinez
No ratings yet
Greetings Prayer Rules Attendance: HI Everyone!!
Document30 pages
Greetings Prayer Rules Attendance: HI Everyone!!
Khenjeza Pornela
No ratings yet
ОШ СОЧ Англ.яз 5кл англ
Document52 pages
ОШ СОЧ Англ.яз 5кл англ
Kamila Rakhimzhanova
No ratings yet
Over 200 English Irregular Verbs Part 1 Levels A1 A2 B1 B2 C1 C2
Document104 pages
Over 200 English Irregular Verbs Part 1 Levels A1 A2 B1 B2 C1 C2
Justyna Malec
No ratings yet
An Introduction To George Bernard Shaw & Pygmalion
Document43 pages
An Introduction To George Bernard Shaw & Pygmalion
sushil_nimbalkar
No ratings yet
Quirino State University: Self-Paced Learning Module
Document18 pages
Quirino State University: Self-Paced Learning Module
Maden beto
No ratings yet
EF SET Certificate Report Sample
Document1 page
EF SET Certificate Report Sample
João Correia
No ratings yet
DLP-Citing Textual Evidences Day 2
Document6 pages
DLP-Citing Textual Evidences Day 2
ariane galeno
No ratings yet
Introduce Yourself Present Perfect
Document6 pages
Introduce Yourself Present Perfect
Miriam Khanto
No ratings yet
Teaching Pronunciation
Document10 pages
Teaching Pronunciation
Khanh Linh Nguyen
No ratings yet
Practising Fce Use of English Part 2
Document11 pages
Practising Fce Use of English Part 2
ftm664
No ratings yet
Urdunization of English by Pakistani Writers
Document8 pages
Urdunization of English by Pakistani Writers
Saif Ur Rahman
No ratings yet
How To Use German Separable-Prefix Verbs
Document12 pages
How To Use German Separable-Prefix Verbs
Siddharth Tiwari
No ratings yet
B1 Preliminary Writing Part 3
Document8 pages
B1 Preliminary Writing Part 3
200875103
No ratings yet
Im-507 Communication Skills
Document4 pages
Im-507 Communication Skills
Kanza Iqbal
No ratings yet
Tackling The Challenges of Teaching English Language As Second Language (ESL) in Nigeria
Document5 pages
Tackling The Challenges of Teaching English Language As Second Language (ESL) in Nigeria
International Organization of Scientific Research (IOSR)
No ratings yet
Chapter 1-7
Document183 pages
Chapter 1-7
Jafar umar
No ratings yet
Mario Pei - The World's Chief Languages PDF
Document673 pages
Mario Pei - The World's Chief Languages PDF
Andrew Taylor
100% (1)
Reading Activity 2-Don't Blame Texters
Document1 page
Reading Activity 2-Don't Blame Texters
G Besas
100% (2)
Comparison GTM, CLT Vs Lexical Approach
Document4 pages
Comparison GTM, CLT Vs Lexical Approach
nngb hoàng bùi
No ratings yet
TQ With ANSWERS KEY
Document5 pages
TQ With ANSWERS KEY
Christine Marie Oraiz
No ratings yet
A Literature Review On Strategies For Teaching Pronunciation
Document22 pages
A Literature Review On Strategies For Teaching Pronunciation
Kavic
100% (1)
PURPOSIVE COMMU-WPS Office
Document5 pages
PURPOSIVE COMMU-WPS Office
Kristine Joyce Nodalo
No ratings yet
Well, To Be Quite Honest, I Don't Think She Is Ill Today
Document3 pages
Well, To Be Quite Honest, I Don't Think She Is Ill Today
RizqaFad
No ratings yet
A Cross-Linguistic Inquiry Into The Potential Reasons For The Avoidance of English Phrasal
Document241 pages
A Cross-Linguistic Inquiry Into The Potential Reasons For The Avoidance of English Phrasal
Azra Hadžić
No ratings yet