Professional Documents
Culture Documents
Multilingual Speech-To-Speech Translation System For Mobile Consumer Devices
Multilingual Speech-To-Speech Translation System For Mobile Consumer Devices
3, August 2014
Abstract — Along with the advancement of speech friendliness, speech recognition technology has been studied
recognition technology and machine translation technology in since the 1960’s; however, the technology actually began
addition to the fast distribution of mobile devices, speech-to- being used in the 1990’s. Since the 2000’s, speech recognition
speech translation technology no longer remains as a subject of technology became popularized as the collection of a corpus
research as it has become popularized throughout many users. was made possible through the internet while computing
In order to develop a speech-to-speech translation system that power made remarkable advancement [1], [2]. Lately, starting
can be widely used by many users, however, the system needs to with automobile navigation system [3], speech recognition
reflect various characteristics of utterances by the users who technology is applied to various devices including digital
are actually to use the speech-to-speech translation system camera [4], smart robot [5], refrigerator and smart TV [6], [7].
other than improving the basic functions under the experimental Especially, as mobile terminal with built-in microphone and
environment. This study has established a massive language wireless data network, namely a smart phone, became rapidly
and speech database closest to the environment where speech-
popularized, speech recognition technology is being applied to
to-speech translation device actually is being used after
mobile terminals in a wide variety including voice search
mobilizing plenty of people based on the survey on users’
demands. Through this study, it was made possible to secure services [8], [9] and personal secretary services. One of the
excellent basic performance under the environment similar to most notable examples is with the speech-to-speech
speech-to-speech translation environment, rather than just translation technology where speech recognition technology,
under the experimental environment. Moreover, with the machine translation technology and speech synthesis
speech-to-speech translation UI, a user-friendly UI has been technology all came together.
designed; and at the same time, errors were reduced during the Speech-to-speech translation technology represents a
process of translation as many measures to enhance user technology which automatically translates one language to
satisfaction were employed. After implementing the actual another language in order to enable communication between
services, the massive database collected through the service was two parties with different native tongues. To translate a voice
additionally applied to the system following a filtering process in one language to another voice in a different language,
in order to procure the best-possible robustness toward both the speech-to-speech translation technology is comprised of three
details and the environment of the users’ utterances. By core technologies that had been previously mentioned: speech
applying these measures, this study is to unveil the procedures recognition technology, which recognizes the utterance of a
where multi-language speech-to-speech translation system has human and converts it into a text; machine translation
been successfully developed for mobile devices1. technology, which translates the text in a certain language into
a text in another language; and speech synthesis technology,
Index Terms — Text-to-text translation system, Text
which converts the translated text into a speech. Additionally,
recognition, machine translation, human-computer interface
the technology to understand the natural language and the user
I. INTRODUCTION interface-related technology integrated with the UI also play
an important role in this speech-to-speech translation system.
Text recognition technology may be the most notable
Speech-to-speech translation technology has been
user-friendly interface as it has been widely mentioned intensively studied since the 1990’s. The technology has been
throughout numerous movies and novels. Because of its user- developed mostly through international joint studies such as
C-STAR [10], and a number of studies on speech-to-speech
1
This work was supported by the Ministry of Science, ICT and Future translation technology have been conducted through various
Planning, Korea, under the ICT R&D Program (Strengthening international joint researches including NESPOLE! [11], TC-
competitiveness of automatic translation industry for realizing language
barrier-free Korea). STAR and GALE [12]. However, the technology was not
Seung Yun is with the Department of Computer Software, University of widely used by the general public at that time, and it is now
Science and Technology, Daejeon, Korea, He is also with the Automatic slowly being spread as the technology became more mature in
Speech Translation Section, ETRI, Daejeon, 305-700, Korea (e-mail:
syun@etri.re.kr).
recent years.
Young-Jik Lee and Sang-Hun Kim are with the Automatic Speech Speech-to-speech translation technology has been used by
Translation Section, ETRI, Daejeon, 305-700, Korea (e-mail: ylee@etri.re.kr, the U.S. Armed Forces in Iraq for military use [13], and LA
ksh@etri.re.kr).
Police Department has also employed the speech-to-speech
Contributed Paper
Manuscript received 06/30/14
Current version published 09/23/14
Electronic licensed
Authorized version published
use limited09/23/14. 0098 3063/14/$20.00
to: Sri Venkateshwara College of Engineering. © 2014
Downloaded IEEE 19,2023 at 07:24:43 UTC from IEEE Xplore. Restrictions apply.
on October
S. Yun et al.: Multilingual Speech-to-Speech Translation System for Mobile Consumer Devices 509
Authorized licensed use limited to: Sri Venkateshwara College of Engineering. Downloaded on October 19,2023 at 07:24:43 UTC from IEEE Xplore. Restrictions apply.
510 IEEE Transactions on Consumer Electronics, Vol. 60, No. 3, August 2014
followed by laptop computers and speech-to-speech detailed in the next chapter. The input method for speech-to-
translation-only device. The rationale behind this finding is speech translation was designed to enable both speech
the portability of mobile device which is also easy to use once recognition and keypad input methods, while the advanced
a user becomes accustomed to the operation of the device. At search function for example, sentences based on words and
the same time, the urge to lighten the weight of the luggage sentences most demanded by the users, was included into the
when travelling also seemed to make an impact. scope of development.
The most anticipated category, the level of expectation
toward the performance of speech-to-speech translation, was III. STRUCTURE OF SPEECH-TO-SPEECH TRANSLATION
unveiled where 6 out of 25 FGI participants expressed 100%, SYSTEM
followed by 6 people with 95%, 6 with 90% and 4 with below
90%. This figure was higher than first anticipated; however, it A. Genietalk System Overview
should be understood that these levels of expectation were not Reflecting users’ demands which were discussed at the
upon the general translation of all sentences but upon specific previous chapter, we developed “Genietalk,” multi-language
situations such as travelling and tourism. Based on these speech-to-speech translation system. Genietalk, network-
findings, it can be deemed that a speech-to-speech translation based application was designed to work at the mobile devices
device can be commercialized when its performance exceeds using Android and iPhone OS, and it is available for everyone
90% upon the utterance that people generally need under the to download to use. Currently, speech-to-speech translation is
situation of travelling/tourism. made available for Korean-English, Korean-Japanese and
Korean-Chinese. Vocabulary recognized by each speech
B. Definition of Speech-to-speech Translation System
recognition engine is 230,000 words for Korean language,
Configuration Based on the Survey of Users’ Demands
80,000 words for English, 130,000 words for Japanese and
According to users’ demands discussed in the previous 330,000 words for Chinese. Upon translation engines, pattern-
chapter, smart phone was selected, since it is not only easy to based translation methodology was employed by the
carry around, but also the most popular device among all translation engines for Korean-English and Korean-Chinese
speech-to-speech translation devices. At the same time, as data whereas SMT-based translation methodology [17] was
communication became generalized where a burden on the use selected for Korean-Japanese translation engine under the
of wireless communication network has been lessened, the consideration of linguistic similarities between the two
system was designed to perform actual speech-to-speech languages. For speech synthetic engines, a commercial engine
translation through a communication with servers after was chosen after considering the market climates.
embedding speech-to-speech translation engine in the high-
performance server. Also considering the users’ high demand B. Speech Recognition Engine
for multi-language service, it was decided to start with the Speech recognition engine for Genietalk was trained as a
Korean-English speech-to-speech translation service, and to HMM-based acoustic model with the tri-gram language
expand to Korean-Japanese and Korean-Chinese services. model, and the decoder was structured with wFST (weighted
Available languages for speech-to-speech translation will Finite State Transducer) type.
continue to increase with French, Spanish, Russian and German. Korean language acoustic model was trained with a 1,820
The next most important issue would be the performance of hour-long speech database. For speech signal, characteristics
speech-to-speech translation engine. In order to satisfy every were extracted while moving the frames by 10ms under the unit
participant of the survey, speech-to-speech translation ratio of 20ms, and 39 MFCC were employed at the feature vector. The
should be at 100%; however, it is a number that is impossible to number of GMM (Gaussian Mixture Model) following triphone-
achieve. It is because the speech recognition performance tying was determined to be 32. One notable fact is that the speech
cannot be perfect, and the performance is bound to further recognition engine for speech-to-speech translation was built to
deteriorate after going through the machine translation contain plenty of speech data with dialog style narration through
procedure. Thus, after removing users who anticipated 100% of channel input of mobile device as much as possible, since the
speech-to-speech translation success rate, the remainder of engine is aimed to achieve dialog-style speech recognition. And
participants was included into the candidates to use speech-to- to be durable against noisy environment, the engine was trained
speech translation device. If the translation success rate reaches under the environment where SNR (signal-to-noise ratio) 5 to 15
90%, it is expected to secure 70% of potential users who are dB of actual noise was randomly inserted into most of the
interested in speech-to-speech translation. Therefore, the system database for training. Acoustic models for English, Japanese and
was developed with a goal to achieve 90% of translation success Chinese were also configured in the same manner; however,
rate. We conducted the development of speech-to-speech these models experienced discrepancy in the quantity of database
translation system focusing on the travelling/daily lives in order used for training. For Korean language, all speech logs collected
to achieve the goal. The system was designed to satisfy the after implementing actual services were already employed for
users’ expectation by building massive corpus for the language speech recognition engine training after going through the
model in the field of travelling/daily lives through various transcription process. Table II shows the amount of speech
means. The following series of procedures will be thoroughly database used to train each language.
Authorized licensed use limited to: Sri Venkateshwara College of Engineering. Downloaded on October 19,2023 at 07:24:43 UTC from IEEE Xplore. Restrictions apply.
S. Yun et al.: Multilingual Speech-to-Speech Translation System for Mobile Consumer Devices 511
Authorized licensed use limited to: Sri Venkateshwara College of Engineering. Downloaded on October 19,2023 at 07:24:43 UTC from IEEE Xplore. Restrictions apply.
512 IEEE Transactions on Consumer Electronics, Vol. 60, No. 3, August 2014
D. User Interface Composition of the translated sentence by touching the icon (Fig. 1(e)) as
User Interface of Genietalk is composed as shown in Fig.1. shown in Fig. 2(b). Also by displaying the pronunciation of
translated sentence in native language of the user, it offers a
feature to give the user an opportunity to pronounce in
translation-target language. Especially, when the target
language is Chinese, Japanese or Korean using Chinese
character, Kana or Hangul, it helps users make the
pronunciation even if the user is unable to read the language.
Authorized licensed use limited to: Sri Venkateshwara College of Engineering. Downloaded on October 19,2023 at 07:24:43 UTC from IEEE Xplore. Restrictions apply.
S. Yun et al.: Multilingual Speech-to-Speech Translation System for Mobile Consumer Devices 513
Authorized licensed use limited to: Sri Venkateshwara College of Engineering. Downloaded on October 19,2023 at 07:24:43 UTC from IEEE Xplore. Restrictions apply.
514 IEEE Transactions on Consumer Electronics, Vol. 60, No. 3, August 2014
Since the beginning of Genietalk service, a monthly Refined Group: Refinement work was implemented in order to
average of 2.9 million logs is being currently accumulated. revise the sentences with a format of 'subject-predicate' to
Amongst these logs, speech-to-speech translation logs through complete sentences
speech recognition takes up 35%, while the remaining 65% - For sentences qualified for a refined group, spelling check and
represents machine translation logs through text input. The the proper word spacing were implemented (however,
next chapter will explain how the performance of speech-to- generally accepted expressions were not revised even if they
speech translation is being improved by using service logs did not comply with the proper spelling system or the range of
collected as described above. proper nouns)
- Voluntarily appeared exclamations or filled pauses were
VI. PERFORMANCE IMPROVEMENT OF SPEECH-TO-SPEECH eliminated. With symbols included in the sentence, the
TRANSLATION LOG-BASED SPEECH RECOGNITION emoticons and the symbols only used through internet were
A. Design of Speech-to-speech Translation Log removed while generally accepted symbols were only allowed
Speech-to-speech translation logs can be categorized into 3 Meaningless Group: Incomplete sentence and meaningless
types based on their input. One is a speech file input by users for expressions were categorized into a meaningless group
speech recognition, another is text sentence input through - Sentences lack of basic sentence format without final endings,
keyboard and the other is the result of error reporting through subject or predicate
‘Report’ explained previously. As of today, speech recognition - Despite proper sentence format, when it is hard to find
result and machine translation result are saved together when generally accepted meaning from the sentence
saving speech-to-speech translation logs in order to make Korean sentence mixed with foreign language, numbers and
reference available upon the entire process of translation in special characters
accordance with overall flow. At the same time, all speech-to- - Sentences containing foreign language, numbers and special
speech translation logs are saved along with language locale characters not generally used along with Korean language
information of the device, device model information and unique Abbreviation, internet terminology
speaker ID. Language locale information of the device is used - If a sentence can be deemed to belong to a refined group once
as basic information whether the input is made by native abbreviation or chatting language is eliminated, the sentence
language speaker or foreign language speaker depending on the will be classified to the refined group following revision.
language locale of the device. The device model information is Lascivious expression & abusive language
used to improve the performance of the speech recognition
considering the characteristics of channels of each device, and Following the classification, the refined group took up
the speaker information is for the improvement of speech 72.1%, and the meaningless group was at 22.48% while the
recognition performance through speech adjustment by each rest was at mere 5.42%.
speaker. In the future, a tailored translation function will be Second, automatic refinement was conducted upon 10.41
offered through automatic personal history management by million sentences from Korean log collected for 6 months.
using the information discussed above. Automatic refinement was performed using automatic
correction tool with Korean spelling system and word spacing
B. Performance Improvement of Speech recognition Engine practice. During this process, emoticon, special characters,
through Text Log foreign language, number, symbol and meaning repetition of
Improvement work on the speech recognition engine Korean language were excluded from the subjects of
performance through text log was conducted for the Korean refinement. The sentences excluded through this method
speech recognition engine. By adding massive text data represented approximately 6% of the total. After adding 10.29
collected from speech-to-speech translation services to the million sentences extracted through two types of methods to the
language model training, it was designed to reflect speech-to- training corpus, the training was conducted again. Based on the
Authorized licensed use limited to: Sri Venkateshwara College of Engineering. Downloaded on October 19,2023 at 07:24:43 UTC from IEEE Xplore. Restrictions apply.
S. Yun et al.: Multilingual Speech-to-Speech Translation System for Mobile Consumer Devices 515
Authorized licensed use limited to: Sri Venkateshwara College of Engineering. Downloaded on October 19,2023 at 07:24:43 UTC from IEEE Xplore. Restrictions apply.
516 IEEE Transactions on Consumer Electronics, Vol. 60, No. 3, August 2014
[7] J. Hong, S. Jeong and M. Hahn, “Wiener filter-based echo suppression BIOGRAPHIES
and beamforming for intelligent TV interface,” IEEE Trans. Consumer
Electron., vol. 59, no. 4, pp. 825-832. Nov. 2013.
Seung Yun received the M.A. degree in Korean
[8] J. Park, G. Jang and J. Kim, “Multistage utterance verification for
Informatics from Yonsei University, Seoul, Korea in 2001.
keyword recognition-based online spoken content retrieval,” IEEE
He is currently a Ph.D. candidate in Computer Software at
Trans. Consumer Electron., vol.58, no. 3. pp. 1000-1005. Aug, 2012.
University of Science and Technology, Daejeon, Korea.
[9] D. Yang, Y. Pan and S. Furui, “Vocabulary expansion through automatic
Since 2001, he has been working for ETRI. His current
abbreviation generation for Chinese voice search,” Computer Speech
research interests include speech-to-speech translation,
and Language, vol. 26, no.5, pp. 321-335. Oct. 2012.
speech database, speech recognition, and human-machine
[10] L. Levin, D. Gates, A. Lavie and A. Waibel, “An interlingua based on
interface.
domain actions for machine translation of task-oriented dialogues,” in
Proc. International Conference on Spoken Language Processing,
Sydney, Australia. pp. 1155-1158. Nov. 1998.
[11] A. Lavie, F. Pianesi and L. Levin, “The Nespole! System for
multilingual speech communication over the internet,” IEEE Trans.
Young-Jik Lee received the B.S. degree in electronics
Speech and Aud. Proc., vol. 14, No. 5, pp. 1664-1673. Sep. 2006.
engineering from Seoul National University, Seoul, Korea,
[12] Y. Al-Onaizan and L. Mangu, “Arabic ASR and MT integration for
the M.S. degree in electrical engineering from Korea
GALE,” in Proc. IEEE International Conference on Acoustics, Speech
Advanced Institute of Science and Technology, Seoul,
and Signal Processing, Honolulu, USA, pp.1285-1288. Apr. 2007.
Korea and the Ph.D. degree in electrical engineering from
[13] M. Frandsen, S. Riehemann and K. Precoda, “IraqComm and FlexTrans:
the Polytechnic University, Brooklyn, New York, USA.
A speech translation system and flexible framework,” in Proc.
From 1981 to 1985 he was with Samsung Electronics
International Speech Communication Association, Antwerp, Belgium,
Company, where he had developed video display terminals. From 1985 to
pp.2324-2327. Aug. 2007.
1988 he had worked on sensor array signal processing. Since 1989, he has
[14] J. Shin, P. Georgiou and S. Narayanan, “Enabling effective design of
been with Electronics and Telecommunications Research Institute pursuing
multimodal interfaces for speech-to-speech translation system: An
researches in speech recognition, speech synthesis, speech translation,
empirical study of longitudinal user behaviors over time and user
machine translation, information retrial, multimodal interface, digital contents,
strategies for coping with errors,” Computer Speech and Language, vol.
computer graphics, computer vision, pattern recognition, neural networks, and
27, no. 2, pp. 554-571. Feb. 2013.
digital signal processing.
[15] S. Raybaud, D. Langlois and K. Smaili, “Broadcast news speech-to-text
translation experimetns,” in Proc. The Thirteenth Machine Translation
Summit, Xiamen, China, pp. 378-381. Sep. 2011.
[16] M. Heck, S. Stuker and A. Waibel, “A hybrid phonotactic language
identification system with an SVM back-end for simultaneous lecture
translation,” in Proc. IEEE International Conference on Acoustics, Sang-Hun Kim received the B.S. degree in Electrical
Speech and Signal Processing, Kyoto, Japan, pp.4857-4860. Mar. 2012. Engineering from Yonsei University, Seoul, Korea in 1990
[17] C. Callison-Burch, P. Koehn, C. Monz and O. Zaidan, “Findings of the and the M.S. degree in Electrical Engineering and
2011 workshop on statistical machine translation,” in Proc. the Sixth Electronic Engineering from KAIST, Daejon, Korea in
Workshop on Statistical Machine Translation, Edinburgh, UK, pp.22-64. 1992 and the Ph.D. degree in Department of Electrical,
Jul. 2011. Electronic and Information Communication Engineering
[18] K. Genichiro, S. Eiichiro, T. Toshiyuki and Y. Seiichi, “Creating corpora from University of Tokyo, Japan in 2003. Since 1992, he
for speech-to-speech translation,” in Proc. EUROSPEECH, Geneva, has been working for ETRI. His interests are speech translation, spoken
Switzerland, pp.381-384. Sep. 2003. language understanding and multi-modal information processing.
Authorized licensed use limited to: Sri Venkateshwara College of Engineering. Downloaded on October 19,2023 at 07:24:43 UTC from IEEE Xplore. Restrictions apply.