Multilingual Speech-To-Speech Translation System For Mobile Consumer Devices

508 IEEE Transactions on Consumer Electronics, Vol. 60, No.
3, August 2014
Multilingual Text-to-Text Translation Tool
Seung Yun, Young-Jik Lee, and Sang-Hun Kim
Abstract — Along with the advancement of speech friendliness, speech recognition technology has been studied
recognition technology and machine translation technology in since the 1960’s; however, the technology actually began
addition to the fast distribution of mobile devices, speech-to- being used in the 1990’s. Since the 2000’s, speech recognition
speech translation technology no longer remains as a subject of technology became popularized as the collection of a corpus
research as it has become popularized throughout many users. was made possible through the internet while computing
In order to develop a speech-to-speech translation system that power made remarkable advancement [1], [2]. Lately, starting
can be widely used by many users, however, the system needs to with automobile navigation system [3], speech recognition
reflect various characteristics of utterances by the users who technology is applied to various devices including digital
are actually to use the speech-to-speech translation system camera [4], smart robot [5], refrigerator and smart TV [6], [7].
other than improving the basic functions under the experimental Especially, as mobile terminal with built-in microphone and
environment. This study has established a massive language wireless data network, namely a smart phone, became rapidly
and speech database closest to the environment where speech-
popularized, speech recognition technology is being applied to
to-speech translation device actually is being used after
mobile terminals in a wide variety including voice search
mobilizing plenty of people based on the survey on users’
demands. Through this study, it was made possible to secure services [8], [9] and personal secretary services. One of the
excellent basic performance under the environment similar to most notable examples is with the speech-to-speech
speech-to-speech translation environment, rather than just translation technology where speech recognition technology,
under the experimental environment. Moreover, with the machine translation technology and speech synthesis
speech-to-speech translation UI, a user-friendly UI has been technology all came together.
designed; and at the same time, errors were reduced during the Speech-to-speech translation technology represents a
process of translation as many measures to enhance user technology which automatically translates one language to
satisfaction were employed. After implementing the actual another language in order to enable communication between
services, the massive database collected through the service was two parties with different native tongues. To translate a voice
additionally applied to the system following a filtering process in one language to another voice in a different language,
in order to procure the best-possible robustness toward both the speech-to-speech translation technology is comprised of three
details and the environment of the users’ utterances. By core technologies that had been previously mentioned: speech
applying these measures, this study is to unveil the procedures recognition technology, which recognizes the utterance of a
where multi-language speech-to-speech translation system has human and converts it into a text; machine translation
been successfully developed for mobile devices1. technology, which translates the text in a certain language into
a text in another language; and speech synthesis technology,
Index Terms — Text-to-text translation system, Text
which converts the translated text into a speech. Additionally,
recognition, machine translation, human-computer interface
the technology to understand the natural language and the user
I. INTRODUCTION interface-related technology integrated with the UI also play
an important role in this speech-to-speech translation system.
Text recognition technology may be the most notable
Speech-to-speech translation technology has been
user-friendly interface as it has been widely mentioned intensively studied since the 1990’s. The technology has been
throughout numerous movies and novels. Because of its user- developed mostly through international joint studies such as
C-STAR [10], and a number of studies on speech-to-speech
1
This work was supported by the Ministry of Science, ICT and Future translation technology have been conducted through various
Planning, Korea, under the ICT R&D Program (Strengthening international joint researches including NESPOLE! [11], TC-
competitiveness of automatic translation industry for realizing language
barrier-free Korea). STAR and GALE [12]. However, the technology was not
Seung Yun is with the Department of Computer Software, University of widely used by the general public at that time, and it is now
Science and Technology, Daejeon, Korea, He is also with the Automatic slowly being spread as the technology became more mature in
Speech Translation Section, ETRI, Daejeon, 305-700, Korea (e-mail:
syun@etri.re.kr).
recent years.
Young-Jik Lee and Sang-Hun Kim are with the Automatic Speech Speech-to-speech translation technology has been used by
Translation Section, ETRI, Daejeon, 305-700, Korea (e-mail: ylee@etri.re.kr, the U.S. Armed Forces in Iraq for military use [13], and LA
ksh@etri.re.kr).
Police Department has also employed the speech-to-speech
Contributed Paper
Manuscript received 06/30/14
Current version published 09/23/14
Electronic licensed
Authorized version published
use limited09/23/14. 0098 3063/14/$20.00
to: Sri Venkateshwara College of Engineering. © 2014
Downloaded IEEE 19,2023 at 07:24:43 UTC from IEEE Xplore. Restrictions apply.
on October
S. Yun et al.: Multilingual Speech-to-Speech Translation System for Mobile Consumer Devices 509
translation devices. The technology was also used in hospitals TABLE I

KEY SITUATIONS WHEN SPEECH-TO-SPEECH TRANSLATION DEVICE IS
to help the communication between doctors and patients [14]. NEEDED DURING OVERSEA TRAVELLING
In addition, the speech-to-speech translation devices are being
currently employed at the various fields such as news Site Rank Situation
broadcasting translation [15] and lecture translation [16]. 1 Report of belongings
Airport 2 Going through immigration
Recently, as mobile devices and wireless data communication
3 Delay of airplane
have become popularized throughout the general public, the 1 Request for equipment & report of incidents
foundation has been laid for the general public to easily use Hotel 2 Asking how to operate facilities
speech-to-speech translation technology. 3 To change rooms
Visiting hours & entry fees for tour
This study is to introduce “Genietalk”, a network-based 1
attractions
multi-language speech-to-speech translation system which Tour
2 Recommending tour attraction
recorded for over 1.8 million downloads. Based on the idea 3 Report of theft & loss
that the speech-to-speech translation procedures are 1 Refund, exchange, cancellation
Shops 2 Inquiry on products
considered an actual conversation between people, Genietalk 3 Errors on price calculation
makes its distinction with the massive quantity of corpus 1 Menu recommendation
collected by over 6,000 people over several years in order to Restaurant 2
Omitted order, change & cancellation of
reflect the diversity of general public. Moreover, it boasts an menu
3 Description of dishes
additional strength where its UI has been specially developed 1 How to use public transportation
to provide the utmost convenience to its users. Most of all, its Transportation 2 Location of destination
performance was verified after actually providing massive 3 Reservation on transportation
services; and based on the log data collected from such
services, it should be noted that Genietalk has gradually
improved its speech-to-speech translation performance. Based on the analysis of the survey, participants mostly
This study is to describe these research experiences in the responded that they needed a speech-to-speech translation
following order. First, Chapter 2 is to detail the result of the device when facing unexpected situation or at the situation
survey on the users’ demands, while Chapter 3 is about where they had to provide specific explanation rather than at
explaining how the system is developed based on the survey. generally predictable situations. During the FGI, an in-depth
Chapter 4 reports on the performance of speech-to-speech interview was conducted to find out whether a speech-to-
translation. After Chapter 5 introduces the current speech translation system is necessary and what demands
whereabouts of speech-to-speech translation services, Chapter users would make if found necessary. Consequently, 18 out of
6 discusses how to improve the system performance through 26 participants in the FGI responded that speech-to-speech
the actual operation of the service. translation device is absolutely necessary, and 4 of them said
they found the device somewhat necessary, whereas only 4
II. USERS’ DEMANDS responded that they didn’t think the device was necessary,
which indicated that majority of them highly evaluated the
A. Survey on Users’ Demands necessity of the device. Especially, the participants from older
In order to design an effective speech-to-speech translation age groups raised the issues that the device was necessary
system, an interview-oriented survey as well as a FGI (Focus rather than the participants in their 20’s. It seemed because
Group Interview) was conducted as follows: an interview- younger generation attained English education more than the
oriented survey was performed upon 302 people between the older counterparts when English was not their native tongues.
age of 20 through 59, and the distribution of residential And when travelling to non-English-speaking nations, it was
regions, genders, age groups and household income was found that the demands for the device equipped with the local
evenly made after considering the demographic language were very high.
characteristics. The FGI was conducted upon a group of 7 to 8 After investigating users’ demands on the input methods of
people after dividing the groups of people in their 20’s, 30’s the speech-to-speech translation device, respondents appeared
and 40’s. All participants of FGI had experiences in overseas to favor a text-input method through keypad as well as speech
travelling, yet with inefficient foreign language skills. In other recognition method. On the other hand, respondents did not
words, they are the main targets of using speech-to-speech show much interest in the methods which limit the scope of
translation devices. Additionally, an in-depth interview was translation such as a search of simple example sentences and
conducted upon 3 professional oversea tour guides to listen to the use of designated menu depending on the situations. The
their experiences to identify which speech-to-speech demands for convenient functions were a speech-to-speech
translation system is required. First, the survey investigated translation function using bluetooth headset and a search
the situations that dictated verbal conversation during function for advanced example sentences with intended
overseas travelling. Table I indicates top 3 situations where expression under the restriction defined by a user. Upon
speech-to-speech translation device was required depending preference for the types of speech-to-speech translation
on the sites. devices, smart phone turned out to be the most favored device,
Authorized licensed use limited to: Sri Venkateshwara College of Engineering. Downloaded on October 19,2023 at 07:24:43 UTC from IEEE Xplore. Restrictions apply.
510 IEEE Transactions on Consumer Electronics, Vol. 60, No. 3, August 2014
followed by laptop computers and speech-to-speech detailed in the next chapter. The input method for speech-to-
translation-only device. The rationale behind this finding is speech translation was designed to enable both speech
the portability of mobile device which is also easy to use once recognition and keypad input methods, while the advanced
a user becomes accustomed to the operation of the device. At search function for example, sentences based on words and
the same time, the urge to lighten the weight of the luggage sentences most demanded by the users, was included into the
when travelling also seemed to make an impact. scope of development.
The most anticipated category, the level of expectation
toward the performance of speech-to-speech translation, was III. STRUCTURE OF SPEECH-TO-SPEECH TRANSLATION
unveiled where 6 out of 25 FGI participants expressed 100%, SYSTEM
followed by 6 people with 95%, 6 with 90% and 4 with below
90%. This figure was higher than first anticipated; however, it A. Genietalk System Overview
should be understood that these levels of expectation were not Reflecting users’ demands which were discussed at the
upon the general translation of all sentences but upon specific previous chapter, we developed “Genietalk,” multi-language
situations such as travelling and tourism. Based on these speech-to-speech translation system. Genietalk, network-
findings, it can be deemed that a speech-to-speech translation based application was designed to work at the mobile devices
device can be commercialized when its performance exceeds using Android and iPhone OS, and it is available for everyone
90% upon the utterance that people generally need under the to download to use. Currently, speech-to-speech translation is
situation of travelling/tourism. made available for Korean-English, Korean-Japanese and
Korean-Chinese. Vocabulary recognized by each speech
B. Definition of Speech-to-speech Translation System
recognition engine is 230,000 words for Korean language,
Configuration Based on the Survey of Users’ Demands
80,000 words for English, 130,000 words for Japanese and
According to users’ demands discussed in the previous 330,000 words for Chinese. Upon translation engines, pattern-
chapter, smart phone was selected, since it is not only easy to based translation methodology was employed by the
carry around, but also the most popular device among all translation engines for Korean-English and Korean-Chinese
speech-to-speech translation devices. At the same time, as data whereas SMT-based translation methodology [17] was
communication became generalized where a burden on the use selected for Korean-Japanese translation engine under the
of wireless communication network has been lessened, the consideration of linguistic similarities between the two
system was designed to perform actual speech-to-speech languages. For speech synthetic engines, a commercial engine
translation through a communication with servers after was chosen after considering the market climates.
embedding speech-to-speech translation engine in the high-
performance server. Also considering the users’ high demand B. Speech Recognition Engine
for multi-language service, it was decided to start with the Speech recognition engine for Genietalk was trained as a
Korean-English speech-to-speech translation service, and to HMM-based acoustic model with the tri-gram language
expand to Korean-Japanese and Korean-Chinese services. model, and the decoder was structured with wFST (weighted
Available languages for speech-to-speech translation will Finite State Transducer) type.
continue to increase with French, Spanish, Russian and German. Korean language acoustic model was trained with a 1,820
The next most important issue would be the performance of hour-long speech database. For speech signal, characteristics
speech-to-speech translation engine. In order to satisfy every were extracted while moving the frames by 10ms under the unit
participant of the survey, speech-to-speech translation ratio of 20ms, and 39 MFCC were employed at the feature vector. The
should be at 100%; however, it is a number that is impossible to number of GMM (Gaussian Mixture Model) following triphone-
achieve. It is because the speech recognition performance tying was determined to be 32. One notable fact is that the speech
cannot be perfect, and the performance is bound to further recognition engine for speech-to-speech translation was built to
deteriorate after going through the machine translation contain plenty of speech data with dialog style narration through
procedure. Thus, after removing users who anticipated 100% of channel input of mobile device as much as possible, since the
speech-to-speech translation success rate, the remainder of engine is aimed to achieve dialog-style speech recognition. And
participants was included into the candidates to use speech-to- to be durable against noisy environment, the engine was trained
speech translation device. If the translation success rate reaches under the environment where SNR (signal-to-noise ratio) 5 to 15
90%, it is expected to secure 70% of potential users who are dB of actual noise was randomly inserted into most of the
interested in speech-to-speech translation. Therefore, the system database for training. Acoustic models for English, Japanese and
was developed with a goal to achieve 90% of translation success Chinese were also configured in the same manner; however,
rate. We conducted the development of speech-to-speech these models experienced discrepancy in the quantity of database
translation system focusing on the travelling/daily lives in order used for training. For Korean language, all speech logs collected
to achieve the goal. The system was designed to satisfy the after implementing actual services were already employed for
users’ expectation by building massive corpus for the language speech recognition engine training after going through the
model in the field of travelling/daily lives through various transcription process. Table II shows the amount of speech
means. The following series of procedures will be thoroughly database used to train each language.
TABLE II various expressions from a massive number of people, but it is

SPEECH DATABASE FOR ACOUSTIC MODEL also focused on the travel/daily life-related conversational
KOR ENG JPN CHI texts. Using this distinctive database, language models were
Training DB (hours) 768 817 509 443 developed through interpolation by each type in order to
Actual service log (hours) 225 N/A N/A N/A provide optimal performance based on tri-gram of back-off
Ratio of dialog-style DB 2
(mobile channel)
71% 11% 61% 49% type.
Total training DB (hours)
1,820 1,386 1,019 862 C. Machine Translation Engine
(noise added)
KOR, Korean; ENG, English; JPN, Japanese; CHI, Chinese The machine translation engine of Genietalk selected the
statistics based machine translation method for the machine
Language models were structured focusing on the translation of Korean-Japanese and Japanese-Korean, whereas
travelling/daily life-related conversional texts. At the same the regulation-based machine translation method was chosen
time, three methods were implemented when developing the for the machine translations of Korean-English, English-
travelling/daily life-related conversional text database to Korean, Korean-Chinese and Chinese-Korean. These two
satisfy various demands made by users. First, to secure the methods are the most popular methods where statistics-based
naturalness and the diversity, a massive number of people methodology is widely used in recent years, because it is not
were recruited to prepare sentences and the utterance of the only capable of shortening a development period, but also
sentences. Subsequently, the outcomes were collected to relatively easy to expand with new languages while it assures
develop the database. This database is a collection of data reliable performances if there is a similarity between the two
from the following practices: foreigners speak to each other languages. After taking these factors into consideration,
through a translator under the assumed situation of Genietalk is featured with statistics-based translation system
travelling/daily lives; two users using the same language talk for Korean-Japanese and Japanese-Korean machine
to each other under the assumption where they were translation, focusing on bilingual corpus in the scope of
foreigners; and an individual prepares sentences expected to travel/daily life, which was collected in massive scale. At the
use when facing foreigners while travelling or on daily life. A same time, 2.5 million bilingual sentences were utilized for
portion of this data was utilized for the database for LM of the translation model. And 1.89 million of those sentences
another language after going through translation. Second, in were formed as dialog-style corpus in the field of travel/daily
addition to BTEC (Basic Travel Expression Corpus) [18], the life in order to elevate the performance of dialog-style
database was collected from phrasebooks in various fields translation in the field of travel/daily life, which is the main
including travel, business, daily life, airline, hotel, target of this speech-to-speech translation system. For the
transportation, medical field, aesthetics, restaurant and sports. language model, training was carried out with 15.79 million
Finally, database was also collected from blogs, general sentences in Korean and 13.33 million sentences in Japanese.
publications and subtitles of drama/movie. Quantity of each By adopting various post-processing regulations, it was
database is as Table III. designed to resolve errors on negative sentences and
interrogative sentences, which are prone to occur at SMT.
TABLE III When the similarity between two languages is deemed
CORPUS FOR LANGUAGE MODEL unlikely, the pattern-based machine translation methodology
KOR ENG JPN CHI was employed, because it is hard for statistics-based
# of sentences from people
(Thousands)
517 312 278 387 methodology to obtain sufficient quantity of bilingual corpus
(2,219) (1,366) (970) (1,516) to extract reliable statistical information; and it is also prone to
(# of participants)
# of sentences from experience translation ratio to drop comparing to regulation-
phrasebook 1,830 1,565 1,432 1,640 based machine translation methodology due to errors caused
(Thousands)
# of general Sentences by the different attributes of the languages. This was
5,399 3,562 3,615 4,231
(Thousands) determined based on the consideration where disparity
# of actual service log between languages is significant for Korean-English, English-
10,330 4,810 N/A N/A
(Thousands)
# of total sentences/words 18,076/ 10,249/ 5,325/ 6,258/ Korean, Korean-Chinese and Chinese-Korean. The pattern-
(Thousands) 94,712 83,244 39,937 53,193 based machine translation method was developed to be
specialized at the domain by adding massive quantity of
Genietalk makes its distinction from all other speech-to- pattern knowledge to existing regulation-based machine
speech translation systems because it not only has collected translation devices. It is equipped with bilingual dictionary
containing 3 million for Korean-English, 2 million for
2
For English speech database, since the quantity of the PC microphone English-Korean, 800,000 for Korean-Chinese and 800,000 for
channel database collected through long-standing existing researches is large, Chinese-Korean. It was also specially developed for the field
the quantity of dialog style database for mobile device is relatively small. of the travel/daily life by adopting sentence translation
Because the development of database for mobile device channel continues
including the logs collected from actual services, dialog-style database for patterns, phrase translation patterns and noun phrase
mobile device is bound to take bigger ratio at English speech database. translation patterns.
D. User Interface Composition of the translated sentence by touching the icon (Fig. 1(e)) as
User Interface of Genietalk is composed as shown in Fig.1. shown in Fig. 2(b). Also by displaying the pronunciation of
translated sentence in native language of the user, it offers a
feature to give the user an opportunity to pronounce in
translation-target language. Especially, when the target
language is Chinese, Japanese or Korean using Chinese
character, Kana or Hangul, it helps users make the
pronunciation even if the user is unable to read the language.
Fig. 1. Screenshot of Genietalk (Speech recognition screen & Main

screen)
By pressing a microphone button at the bottom of screen

(Fig. 1(a)), it initiates speech recognition. During the process of
speech recognition, it helps users utter at proper volume in
effectively timely manner by displaying acoustic waveform and Fig. 2. Screenshot of Genietalk: (a) other translation screen, (b) listen to
database gauge as shown at the left screenshot (Fig. 1(b)). It is sound & view screen
designed to help users easily get used to the speech recognition
engine. Translation results from speech recognition or text It also offers a bookmark feature to save frequently-used
inputs were designed in history-based conversation style, which sentences, an edit feature to modify sentences from speech
is similar to SMS or messenger, as shown at the right screenshot recognition and a copy feature to use the result of speech-to-speech
in Fig.1 after taking UX (User eXperience) into consideration. translation at other applications. And through the icon (Fig. 1(f)) of
And to help first-time users of Genietalk, it provides ‘Send Errors’ directly accessible at the main translation screen,
information on additional features of Genietalk through speech users may transmit errors on speech-to-speech translation that they
bubbles of characters in order to ensure users to access various found directly to the server, which helps improving the
features offered by Genietalk (Fig. 1(c)). performance of speech-to-speech translation device.
Another distinction of Genietalk is that it provides other
translation result by complementing imperfection of speech- IV. PERFORMANCE EVALUATION OF SPEECH-TO-SPEECH
to-speech translation to elevate translation ratio. When a TRANSLATION ENGINE
number icon, such as Fig. 1(d), shows up at the display
window, it means that the relevant number of other translation A. Performance of Speech Recognition Engine
results are available. By touching the number icon, the results Considering the environments the speech-to-speech
are displayed after searching through a massive database of translation devices are actually used in, performance
more than 2 million example sentences that were collected as evaluation upon speech recognition engine of Genietalk was
shown in Fig. 2(a). And these other translation results are not conducted under four different environments, which are quiet
merely the results from TM(Translation Memory) matching, office environment, avenues with heavy traffic, streets dense
but the search results from similar translated example with stores and pedestrians and restaurants with relatively
sentences after extracting key search words from input loud noise. Under these four different environments, the
sentence and adding weighted value. For proper nouns, the results from a total of 8,000 utterances where 40 people
match probability was enhanced by conducting keyword presented 50 utterances each were used as a test set; and the
search after classifying ‘individual name,’ ‘business name’ contents of the utterances was composed with the utterances
and ‘place name’ based on attributes. Additionally, to help made by various users that can be used at actual speech-to-
those with inadequate understanding of the language to speech translation situation in the field of travel/daily life. The
translate, it offers a feature to listen to the synthesized sound results of the performance test are shown in Table IV.
TABLE IV Because a repair tool through multi-modal is being provided

PERFORMANCE OF SPEECH RECOGNITION ENGINE
upon errors of speech recognition during the process of
Environments (Word Accuracy (%)) performance evaluation on entire speech-to-speech translation
Office Avenue Street Restaurant Average system, the performance of speech-to-speech translation system
English 95.07 94.47 92.55 93.36 93.86 can be deemed identical with the performance of machine
Korean 93.83 92.99 91.34 90.70 92.22
Japanese 90.40 90.18 89.37 85.07 88.76 translation. But with Genietalk, when a search for ‘other
Chinese 88.88 86.06 77.97 82.53 83.86 translation result’ is performed upon a sentence even if
translation error is made, a chance to find similar sentence
Slight deviation did exist depending on languages and previously translated reaches 20.1% on the average. This means
environments; however, high performance was reported from that a chance to mitigate the translation failure is given to the user
Korean speech recognition engine with the logs of actual service at 20.1% of time even after translation has failed. Therefore,
environment reflected as well as English speech recognition when using ‘other translation result’ feature, the performance that
engine with plenty of data; and they were followed by Japanese users perceive will be even higher than it actually is.
and Chinese speech recognition engines. With Chinese
language, performance deviation was apparent depending on V. STATUS OF GENIETALK SERVICES
environments comparing to other engines, and it seems because Since starting with English-Korean and Korean-English
the engine has not been fully optimized since it was given translation services in October, 2012, Genietalk added Japanese-
relatively less time for its development. In overall, it satisfied Korean and Korean-Japanese translation services in May, 2013;
users' demands as the performance reached 90% under office and it expanded into Chinese-Korean and Korean-Chinese
environment. Especially under noisy environment, the services in December of 2013. Currently, it has recorded 1.8
performance of the engine did not significantly suffer million downloads where 1.4 million of them were for Android
comparing to the office environment, which means that the OS while 400,000 downloads were for iOS. It serves users from
engine can be fully utilized at actual service situations. 189 countries around the world, and 18 of those nations have
more than 1,000 users. And because the active ratio where the
B. Performance of Machine Translation Engine application still remains after being installed without being
In general, performance evaluation on machine translation removed reaches 50% although over one year has passed since
engine either employs Automated Metrics such as NIST, the beginning of the service, it indicates that users are using this
BLEU (Bilingual Evaluation Understudy), METEOR application rather usefully.
(Metric for Evaluation of Translation with Explicit Currently, Genietalk service is being operated across 12
ORdering) and TER (Translation Error Rate) or is conducted servers located in IDC (Internet Data Center). When a user
by professional translators. To accommodate a special requests for speech-to-speech translation, the speech of the user
circumstance of speech-to-speech translation, this study set goes through the load balancer and reaches the speech
the standard of evaluation whether core intention of the user recognition engine for each language; then the translation result
was properly conveyed over to the counterpart. For the is delivered to speech synthesis engine after going through the
purpose of evaluation, 300 dialog-style sentences in the field machine translation process. After converted into a synthetic
of travel/daily life were randomly extracted from each speech, the result is finally delivered to the user along with the
language and went through machine translation; translation result. When the translation request is made through
subsequently, 3 professional translators conducted text input, the text input by the user will be directly transmitted to
evaluation on the relevant translation results of each machine translation engine to go through machine translation and
language. Based on the standard previously mentioned, speech synthesis process; and the translation result as well as
translation results were deemed successful when the synthetic speech will be delivered to the user. Current speech
intention that the user absolutely needed to convey from the recognition engine is designed to accommodate 160 users for
given utterance was properly translated even if some English, Japanese and Korean simultaneously. The Chinese
awkward expression was discovered. speech recognition engine is capable of accommodating 20 users
at the same time3. The structure of the machine translation engine
TABLE V is designed to serve 16 users simultaneously for each language
PERFORMANCE OF MACHINE TRANSLATION ENGINE group, because its processing speed is faster than the speech
ENG- KOR- CHI- KOR- JPN- KOR- recognition engine. With the speech synthesis engine, Korean-
KOR ENG KOR CHI KOR JPN English translation attracts the most demands; thus, English
Translation
Ratio (%) 87.63 88.21 85.38 77.94 89.0 86.69 speech synthesis engine is featured with 10 channels whereas
speech synthesis engines for other languages are equipped with 5
channels each. Fig. 3 shows an overview of Genietalk Service.
According to the evaluation results, the performance by
each language reached around 78~89%, which indicates that 3
Chinese speech recognition engine offers its result in N-Best type after
the users will not experience much difficulties to convey their reflecting 6 different LM images using 6Core of CPU at each server in order
intention using the actual speech-to-speech translation device. to improve its performance.
speech translation device-using patterns practiced by various

users under the actual service environment upon the system.
Improvement work on speech recognition engine using
Korean log was largely featured with two elements. First, the
classification and the refinement were manually conducted
upon 700,000 sentences from Korean text logs, which were
input through keyboard. The refinement work by human was
performed to verify overall characteristics of log data as well
as to define future processing criteria. Classification was made
into a refined group deemed to make positive contribution
when reflecting Korean text logs on LM and a miscellaneous
group, which may cause side effect during the reflection.
Fig. 3. Service Configuration of Genietalk Subsequently, the refinement work was ensued.
Since the beginning of Genietalk service, a monthly  Refined Group: Refinement work was implemented in order to
average of 2.9 million logs is being currently accumulated. revise the sentences with a format of 'subject-predicate' to
Amongst these logs, speech-to-speech translation logs through complete sentences
speech recognition takes up 35%, while the remaining 65% - For sentences qualified for a refined group, spelling check and
represents machine translation logs through text input. The the proper word spacing were implemented (however,
next chapter will explain how the performance of speech-to- generally accepted expressions were not revised even if they
speech translation is being improved by using service logs did not comply with the proper spelling system or the range of
collected as described above. proper nouns)
- Voluntarily appeared exclamations or filled pauses were
VI. PERFORMANCE IMPROVEMENT OF SPEECH-TO-SPEECH eliminated. With symbols included in the sentence, the
TRANSLATION LOG-BASED SPEECH RECOGNITION emoticons and the symbols only used through internet were
A. Design of Speech-to-speech Translation Log removed while generally accepted symbols were only allowed
Speech-to-speech translation logs can be categorized into 3  Meaningless Group: Incomplete sentence and meaningless
types based on their input. One is a speech file input by users for expressions were categorized into a meaningless group
speech recognition, another is text sentence input through - Sentences lack of basic sentence format without final endings,
keyboard and the other is the result of error reporting through subject or predicate
‘Report’ explained previously. As of today, speech recognition - Despite proper sentence format, when it is hard to find
result and machine translation result are saved together when generally accepted meaning from the sentence
saving speech-to-speech translation logs in order to make  Korean sentence mixed with foreign language, numbers and
reference available upon the entire process of translation in special characters
accordance with overall flow. At the same time, all speech-to- - Sentences containing foreign language, numbers and special
speech translation logs are saved along with language locale characters not generally used along with Korean language
information of the device, device model information and unique  Abbreviation, internet terminology
speaker ID. Language locale information of the device is used - If a sentence can be deemed to belong to a refined group once
as basic information whether the input is made by native abbreviation or chatting language is eliminated, the sentence
language speaker or foreign language speaker depending on the will be classified to the refined group following revision.
language locale of the device. The device model information is  Lascivious expression & abusive language
used to improve the performance of the speech recognition
considering the characteristics of channels of each device, and Following the classification, the refined group took up
the speaker information is for the improvement of speech 72.1%, and the meaningless group was at 22.48% while the
recognition performance through speech adjustment by each rest was at mere 5.42%.
speaker. In the future, a tailored translation function will be Second, automatic refinement was conducted upon 10.41
offered through automatic personal history management by million sentences from Korean log collected for 6 months.
using the information discussed above. Automatic refinement was performed using automatic
correction tool with Korean spelling system and word spacing
B. Performance Improvement of Speech recognition Engine practice. During this process, emoticon, special characters,
through Text Log foreign language, number, symbol and meaning repetition of
Improvement work on the speech recognition engine Korean language were excluded from the subjects of
performance through text log was conducted for the Korean refinement. The sentences excluded through this method
speech recognition engine. By adding massive text data represented approximately 6% of the total. After adding 10.29
collected from speech-to-speech translation services to the million sentences extracted through two types of methods to the
language model training, it was designed to reflect speech-to- training corpus, the training was conducted again. Based on the
re-evaluation of the Korean speech recognition engine after TABLE VI

EVALUATION RESULTS ON SPEECH RECOGNITION PERFORMANCE WITH
reflecting the trained LM, the average recognition rate under 4 ACOUSTIC LOG REFLECTED
different environments improved from 89.87% to 90.40%. This
Environments (Word Accuracy (%))
figure is equivalent to 5.23% of ERR, which indicates that the
Restau
reflection of text logs made notable contributions. Office Avenue Street
rant
Average
Prior to the
C. Performance Improvement of Speech recognition
addition of 93.39 92.66 90.89 89.80 91.69
Engine through Acoustic Log acoustic log
Performance improvement work through acoustic log file was After the
addition of 93.83 92.99 91.34 90.70 92.22
first implemented upon the Korean speech recognition engine. acoustic log
Performance improvement on speech recognition engine was
ERR (%) 6.66 4.50 4.94 8.82 6.38
performed to strengthen the engine under various environments
with many speakers by reflecting acoustic data under service
environment collected through speech-to-speech translation According to the evaluation results, it was notable that the
services upon acoustic model training. To accomplish the given average speech recognition rate brought the performance
task, 400,000 utterances were extracted after eliminating the improvement with 6.38% of ERR. Especially under the
utterance that was too short from Korean acoustic logs. At the restaurant environment where the performance was the least
same time, consideration was made for the statistical balance by effective previously, ERR recorded 8.82%, which was quite
evenly maintaining the distribution of users and durations an accomplishment propelled by the reflection of actual-
during the process of selecting acoustic logs. Next, transcription service-based log data.
was performed on 400,000 utterances. During this process,
exclusion was made to the followings since the logs were VII. CONCLUSION
collected through actual services: In this study, a speech-to-speech translation engine capable
of providing actual service was developed by building training
 When voicing is clipped data the closest to the speech-to-speech translation situation
 When waveform is cut off at a specific level even though after employing a massive number of people based on the
the speech did not exceed the peak value; survey result upon users’ demands. This study also suggested
 When speeches from 2 or more people overlap each other progressive measures to enhance user satisfaction through
or when the speeches of two people are overlapped additional features such as a search for ‘other translation
 Laugh, singing or playful voice of children which is not a result.’ Moreover, it was possible to continue improving the
speech; performance of speech-to-speech translation engine by
 When there is no speech at all; continuously reflecting text and acoustic logs collected from
 When a speech is made while clearly separating each smart mobile devices of the users on the system, after
syllable through the entire voice; providing actual services based on the above.
 Speech in foreign language not in Korean language; Moving forward, if speech-to-speech translation logs
 When it is deemed that the speech in Korean language was collected as described above are also utilized for the
made by a foreigner performance improvement of machine translation, remarkable
 When the volume of the speech is too low (when speech is performance improvements can be anticipated not only on the
hardly verified through waveform); speech recognition, but also throughout the entire speech-to-
 When silent section at the front and the rear of EPD (End speech translation field.
Point Detection) region is completely unsecured and when
the speech is cut off in the middle; REFERENCES
 When unclear speech is detected from an utterance; or
[1] G. Dahl, D. Yu, L. Deng and A. Acero, “Context dependent pre-trained deep
 When excessively hesitant voicing is detected. neural networks for large-vocabulary speech recognition,” IEEE Trans. Speech
and Aud. Proc., vol. 20, no. 1, pp.30-42. Jan. 2012.
After performing transcription based on the principle [2] S. Lee, B. Kang, H. Jung, Y. Lee, “Intra-and inter-frame features for automatic
speech recognition,” ETRI Journal, vol. 36, no. 3, pp.514-517. Jun. 2014.
described above, the sentences excluded from transcription [3] Y. Qian, J. Liu and M. Johnson, “Efficient embedded speech recognition for
represented 19.86%. Transcription was performed on the very large vocabulary mandarin car-navigation systems,” IEEE Trans.
remaining 320,000 utterances. From these utterances, 100,000 Consumer Electron., vol. 55, no. 3, pp.1496-1500. Aug. 2008.
[4] Y. Oh, J. Yoon, H. Kim, M. Kim and S. Kim, “A voice driven scene-mode
utterances with clearly notable noise in the background were recommendation service for portable digital imaging devices,” IEEE Trans.
eliminated, and 220,000 utterances were added to the existing Consumer Electron., vol. 55, no. 4, pp. 1739-1747, Nov. 2009.
speech database to study the acoustic model again. After [5] F. Martin and M. Salichs, “Integration of a voice recognition system in a social
conducting the speech recognition experiments throughout all robot,” Cybernetics and Systems, vol. 42, no. 4. pp. 215-245. Jun. 2011.
[6] J. Park, G. Jang, J. Kim and S. Kim, “Acoustic interference cancellation
these procedures, the performance was evenly improved under for a voice-driven interface in smart TVs,” IEEE Trans. Consumer
all environments as shown in Table VI. Electron., vol. 59, no.1. pp. 244-249. Feb. 2013.
[7] J. Hong, S. Jeong and M. Hahn, “Wiener filter-based echo suppression BIOGRAPHIES
and beamforming for intelligent TV interface,” IEEE Trans. Consumer
Electron., vol. 59, no. 4, pp. 825-832. Nov. 2013.
Seung Yun received the M.A. degree in Korean
[8] J. Park, G. Jang and J. Kim, “Multistage utterance verification for
Informatics from Yonsei University, Seoul, Korea in 2001.
keyword recognition-based online spoken content retrieval,” IEEE
He is currently a Ph.D. candidate in Computer Software at
Trans. Consumer Electron., vol.58, no. 3. pp. 1000-1005. Aug, 2012.
University of Science and Technology, Daejeon, Korea.
[9] D. Yang, Y. Pan and S. Furui, “Vocabulary expansion through automatic
Since 2001, he has been working for ETRI. His current
abbreviation generation for Chinese voice search,” Computer Speech
research interests include speech-to-speech translation,
and Language, vol. 26, no.5, pp. 321-335. Oct. 2012.
speech database, speech recognition, and human-machine
[10] L. Levin, D. Gates, A. Lavie and A. Waibel, “An interlingua based on
interface.
domain actions for machine translation of task-oriented dialogues,” in
Proc. International Conference on Spoken Language Processing,
Sydney, Australia. pp. 1155-1158. Nov. 1998.
[11] A. Lavie, F. Pianesi and L. Levin, “The Nespole! System for
multilingual speech communication over the internet,” IEEE Trans.
Young-Jik Lee received the B.S. degree in electronics
Speech and Aud. Proc., vol. 14, No. 5, pp. 1664-1673. Sep. 2006.
engineering from Seoul National University, Seoul, Korea,
[12] Y. Al-Onaizan and L. Mangu, “Arabic ASR and MT integration for
the M.S. degree in electrical engineering from Korea
GALE,” in Proc. IEEE International Conference on Acoustics, Speech
Advanced Institute of Science and Technology, Seoul,
and Signal Processing, Honolulu, USA, pp.1285-1288. Apr. 2007.
Korea and the Ph.D. degree in electrical engineering from
[13] M. Frandsen, S. Riehemann and K. Precoda, “IraqComm and FlexTrans:
the Polytechnic University, Brooklyn, New York, USA.
A speech translation system and flexible framework,” in Proc.
From 1981 to 1985 he was with Samsung Electronics
International Speech Communication Association, Antwerp, Belgium,
Company, where he had developed video display terminals. From 1985 to
pp.2324-2327. Aug. 2007.
1988 he had worked on sensor array signal processing. Since 1989, he has
[14] J. Shin, P. Georgiou and S. Narayanan, “Enabling effective design of
been with Electronics and Telecommunications Research Institute pursuing
multimodal interfaces for speech-to-speech translation system: An
researches in speech recognition, speech synthesis, speech translation,
empirical study of longitudinal user behaviors over time and user
machine translation, information retrial, multimodal interface, digital contents,
strategies for coping with errors,” Computer Speech and Language, vol.
computer graphics, computer vision, pattern recognition, neural networks, and
27, no. 2, pp. 554-571. Feb. 2013.
digital signal processing.
[15] S. Raybaud, D. Langlois and K. Smaili, “Broadcast news speech-to-text
translation experimetns,” in Proc. The Thirteenth Machine Translation
Summit, Xiamen, China, pp. 378-381. Sep. 2011.
[16] M. Heck, S. Stuker and A. Waibel, “A hybrid phonotactic language
identification system with an SVM back-end for simultaneous lecture
translation,” in Proc. IEEE International Conference on Acoustics, Sang-Hun Kim received the B.S. degree in Electrical
Speech and Signal Processing, Kyoto, Japan, pp.4857-4860. Mar. 2012. Engineering from Yonsei University, Seoul, Korea in 1990
[17] C. Callison-Burch, P. Koehn, C. Monz and O. Zaidan, “Findings of the and the M.S. degree in Electrical Engineering and
2011 workshop on statistical machine translation,” in Proc. the Sixth Electronic Engineering from KAIST, Daejon, Korea in
Workshop on Statistical Machine Translation, Edinburgh, UK, pp.22-64. 1992 and the Ph.D. degree in Department of Electrical,
Jul. 2011. Electronic and Information Communication Engineering
[18] K. Genichiro, S. Eiichiro, T. Toshiyuki and Y. Seiichi, “Creating corpora from University of Tokyo, Japan in 2003. Since 1992, he
for speech-to-speech translation,” in Proc. EUROSPEECH, Geneva, has been working for ETRI. His interests are speech translation, spoken
Switzerland, pp.381-384. Sep. 2003. language understanding and multi-modal information processing.

Multilingual Speech-To-Speech Translation System For Mobile Consumer Devices

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multilingual Speech-To-Speech Translation System For Mobile Consumer Devices

Uploaded by

Copyright:

Available Formats

508 IEEE Transactions on Consumer Electronics, Vol. 60, No.

Multilingual Text-to-Text Translation Tool

Seung Yun, Young-Jik Lee, and Sang-Hun Kim

translation devices. The technology was also used in hospitals TABLE I

TABLE II various expressions from a massive number of people, but it is

Fig. 1. Screenshot of Genietalk (Speech recognition screen & Main

By pressing a microphone button at the bottom of screen

TABLE IV Because a repair tool through multi-modal is being provided

speech translation device-using patterns practiced by various

re-evaluation of the Korean speech recognition engine after TABLE VI

You might also like