Individual Project - Mason Leary

Mason Leary
Professor Adam Wooten
Introduction to Computer Assisted Translation
14 November 2014
The History of Speech-to-Speech Translation Technology
Speech-to-speech machine translation, also referred to as spoken language translation or
voice translation, is only a recent development in the realm of machine translation. The concept
was first presented at the 1983 ITU Telecom World conference by the Japanese company NEC
Corporation (Nakamura, 2). Researchers knew it would take years or even decades to implement
speech-to-speech translation technology, and as a result, the Advanced Telecommunications
Research Institute International (ATR) was founded in 1986 and began a project on
speech-to-speech translation research. The project included researchers both in Japan and from
around the world. In 1993, the first experiment in speech-to-speech translation occurred. The
experiment linked three sites around the world: the ATR, Carnegie Melon University and
Siemens. With the start of the ATR project, other projects began to spring up around the world.
One of these projects was the Verbmobil project in Germany.

Germanys Federal

Ministry of Research and Technology funded the project with 65 million Marks (approximately
$41 million) and private investors from the industry gave 31 million Marks (about $19 million)
(Der Spiegel 1). The project was headed by Wolfgang Wahlster with a team of around 100
colleagues. The project began in 1993 and ran until 2000. The goal of the project was to create
a machine that understands spoken German or Japanese and correctly translates it into English.
The realm of what was to be translated was limited to reservation systems.

The next large project in speech-to-speech translation technology came in 2003.

Carnegie Mellon University launched the STR-DUST (Speech Translation: Domain-Unlimited,
Spontaneous and Trainable) project. The goal was to explore translation under the added
challenge of domain independence and spontaneous language (Fgen et al. 217). The project
used spontaneous lectures and meetings in order to integrate them into the domain-unlimited
translation system. It also focused on highly disfluent speech, e.g. dialogs and spontaneous
speech without domain restrictions.
In 2004, NESPOLE! was developed as a collaborative project between teams at Italian,
German, French and U.S. universities (Lavie et al. 1). The project was funded by the European
Commission and the U.S. National Science Foundation. The system allowed for an Italian
speaker to communicate with either an English, German or French speaker. This system was
limited to the domain of travel planning and medical assistance, and new groundbreaking aspects
were introduced by this project. The hardware or the system did not sit on the users computers
but instead on language-specific servers around the world which can be developed and
maintained independently. The system also used a new interlingua which is sufficiently rich to
capture speaker intention, simple enough to be used reliably by developers of different languages
working independently, and flexible enough to support advanced techniques for analysis and
generation. New approaches to language analysis into the interlingua representation were also
used as well as integration of speech translation with multi-modal communication.
The European Commission funded an additional speech-to-speech translation project in
2004, TC-STAR (Technologies and Corpora for Speech-to-Speech-Translation) (Fgen et al.
217). The purpose of this project was to advance the performance of all speech-to-speech
translation necessary core technologies, e.g. machine translation and text-to-speech technology.

These core technologies needed to have high-performance in order to have more efficient
domain-independent speech-to-speech translation systems, and the gap between human and
machine translation performance needed to be reduced. The project was a success after only
three years of research with usable systems being developed for European English, European
Spanish and Mandarin Chinese.
On March 18th, 2005, the U.S. Defense Advanced Research Projects Agency started a
new project called GALE (Global Autonomous Language Exploration) (Estival 179). The goal
of this project was to [eliminate] the need for linguists and analysts [and to]
automaticallyinterpret huge volumes of speech and text in multiple languages. The U.S.
Government, particularly the military, needed computer software to absorb, analayze and
interpret huge volumes of speech and text in multiple languages (Fgen et al. 218). This project,
was not full speech-to-speech translation but instead speech-to-text translation.

The input

speaker speaks in either Arabic or Mandarin Chinese and the English output was in the form of
text. The output text wss also consolidated and easy to understand.
Once smartphones became popular and more ubiquitous in 2007, many speech-to-speech
translation system developers began to focus on this medium.

The Future of Speech-to-Speech Translation Technology

Today, there are many smartphone apps for speech-to-speech translation such iTranslate
Voice 2, Jibbigo Translation App, Voice Translator, and Google Translate. One device, called
SIGMO, goes one step further and has the user buy an additional device which works as a
microphone and speaker system. The device is connected to the users smartphone via Bluetooth
technology and allows for the speakers to not have to interact directly with the smartphone or
app. The speech-to-speech translation industry is growing rapidly and technologies are being
developed in all kinds of mediums.
A recent development in speech-to-speech translation technology comes from a
partnering of Microsoft and Skype. On July 21, 2014, Microsoft and Skype gave an early
demonstration of their beta speech-to-speech translation system at the Microsoft Worldwide
Partner Conference (Skype). The demonstration showed two people having a conversation
with one speaker using English and the other speaker using German. It is very clear though, that
the technology is only in beta mode and has a long way to go before being considered smooth,
fully-automated, high-quality speech-to-speech translation. The speech recognition software
used in this technology had difficulty understanding fully what the speakers were uttering. There
were also issues with the machine translation. The system often stuck with the syntax of the
source utterance when translating into the target language. Segmentation of sentences also
proved to be a problem for the technology.
In order to make this technology work, Microsoft is utilizing two fairly complex
technologies, machine translation and speech recognition software (Enabling). Microsoft
teamed up with designers and engineers from Skypes prototyping department to develop a
natural user experience. By utilizing the data input via Microsofts Bing translator and speech

recognition technologies, Microsoft is able to fine-tune its model-based training approach. The
engineers are also trying to fix disfluency, the difference between the way people talk and write.
At Microsofts Beijing and Redmond labs, the team has made great advances in their speech
recognition technology.
One of the biggest game changers in Microsofts speech-to-speech translation system is
the use of deep neural networks. Deep neural networks are more efficient because they are
deep architectures which have the capacity to learn more complex models than shallow ones [and]
this expressivity and robust training algorithms allow for learning powerful object
representations with-out the need to hand design features (Szegedy, 1).
IBM is developing a speech-to-speech translation system, too. The technology consists
of three parts: a speech recognition system, a text-to-text system and a text-to-speech system
(Hyman, 17). The team of researchers is working to make all three components behave well,
both individually and together. For speech recognition, the team needs to reduce error rate and
be able to have it detect unarticulated utterances. The team has already improved error rates by
40% since last year. For the text-to-text system, the IBM team is trying to improve out-ofvocabulary words, i.e. dialect and slang. To improve upon this, the team introduced a dialogue
manager which prompts the speaker to clarify an unknown word. So far, the system is able to
detect its inability to recognize a word 80% of the time. The team is hoping to get that up to 95%
over the next few years.
Google is another player in the speech-to-speech translation system game. They have
been developing their machine translation system for over ten years and now offer speech-tospeech translation on their smartphone app. The app can translate between 72 languages and
receives over a billion translations a day (18). These large amounts of data being uploaded daily

allow for google to automatically create dictionaries from that data.

Google is reluctant,

however, to speculate on a timeline of when they will have fully functioning, smooth speech-tospeech translation. They do believe, though, that they are really close.
AT&T is also working on a speech-to-speech translation system. This team is hoping to
integrate all aspects of the process into one single step (AT&T). This is possible, because the
company already has a high-quality speech recognition system called WATSON ASR, as well as
a natural-sounding text-to-speech system called Natural Voices. The system also uses many
recognition possibilities and is constantly extracting from large datasets in various domains to
increase its corpora. The system is capable of being cloud-based and device-based.
All components of speech-to-speech translation technology are being researched by
hundreds of teams around the world.

Each year we are getting better speech-to-speech

translation systems thanks to the combined efforts of these teams.

The Pitfalls of Speech-to-Speech Translation Technology

All three components that make up a speech-to-speech translation system are still lacking
in their abilities to perform with high quality.
The first component of a speech-to-speech translation system is a speech recognizer.
Many issues still need to be addressed with this component. They need to be able to understand
a variety of accents and need to know where to properly segment utterances. They also need to
be able to cut out background noises and voices.
The second component is machine translation. Machine translation still has a hard time
recognizing context clues and syntax. To improve, the technology will need to understand how
syntax and context affects meaning in each language.
The third component is a text-to-speech synthesizer. Many of these technologies produce
very unnatural sounding output which borders on unintelligibility. Many attempts have been
made to synthesize the speakers voice, but this has a long way to go.
All of these pitfalls need to be overcome if high-quality, fully-automated, smooth speechto-speech translation is to be achieved.

Experiments with Speech-to-Speech Translation Systems

I experimented with three free smartphone speech-to-speech translation apps: Jibbigo,
Google Translate and Voice Translator Free. I gave each sentence experiment a rating of 1, 2 or
3. One being the worst. For each system, I uttered the following sentences from German into
English for the following reasons:
1. Wie geht es dir/Ihnen? (How are you (informal/formal)?) I wanted to see how
well the systems distinguish between formal and informal you singular.
2. Ich heie Dietrich Schrder. (My name is Dietrich Schrder.) I wanted to test
the speech recognition of the systems. Particularly their ability to recognize the
phonemes [] and [] when spoken by a non-native German speaker. I also
wanted to see what they do when they do not recognize a word.
3. Der Hund it einen Apfel. (The dog eats/is eating an apple.) I wanted to see if
the systems can identify context. The third person singular conjugation of the
verbs essen to eat and sein to be are both pronounced /st/.
4. Sie hat meinem Vater ein blaues Hemd gekauft. This sentence has many
aspects I wanted to test. The first is that the pronoun sie can mean either she,
they, or you (formal, singular). I wanted to see if the systems could identify that
it is she based off the conjugation of haben to have in this sentence, hat.
In this sentence, the tense is Perfect which is more common than using the
Simple Past.

English however should use Simple Past.

The syntax of the

sentence is very different than the English equivalent. This sentence also utilized
three different cases: nominative, dative, and accusative. I wanted to see if the
systems can properly identify them. Acceptable translations are:

a. She has bought my father/dad a blue shirt.

b. She has bought a blue shirt for my father/dad.
c. She bought my father/dad a blue shirt.
d. She bought a blue shirt for my father/dad.
5. Dann sind wir ja fertig. (So we are done.) I wanted to test out the systems
abilities to detect speech particles. Dann and ja, litearlly then and yes are
being used here as speech particles. Dann should be translated into English as
so whereas ja should not be translated at all into English.
6. Du bleibst doch hier? (You are staying here, arent you?) This sentence also
tests speech particles. Doch is a very tricky word to translate into English. Its
meaning is based solely on context and can be translated differently from person
to person.
7. Morgen ist Freitag, oder? (Tomorrow is Friday, right?) In this sentence I am
testing both context and speech particles. Morgen can mean either tomorrow
or morning. The particle oder at the end of a sentence means right, correct
as an interrogative. As a conjunction, it means or.
For each system, I uttered the following sentences from English into German for the following
1. I am taking the train tomorrow at 9 a.m. (Ich fahre morgen um 9 Uhr mit der
Bahn.) In this sentence, present progressive is being used in English to denote
the near future. I wanted to see if the systems would recognize this, since future
tense is used differently in German and English. I also wanted to test the syntax
that would be produced in German.


2. She is going to go to university next fall.

(Sie wird nchsten Herbst zur

Universitt gehen.) I wanted to test the systems ability to distinguish between

the two different meanings of to go being used in the same sentence, i.e. future
tense and the motion verb. This sentence also test syntactic output.

German into English
System Output



How are you?

My name is Dietrich from both.

The dog is an apple.

She is my father bought a blue shirt.

Yes, then we are ready.

Do you stay here?

Tomorrow is Friday, right?



English into German
System Output

Ich nehme den Zug Morgen um 9:00.

Sie geht zur Universitt nchsten Herbst.


My Comments
Speech recognition
was bad the first
few tries.
Speech recognizer
unable to interpret
differentiate based
on context
Very bad
Correct translation
but unidiomatic.
Did not recognize
the inflection
denoting a
Did not recognize
my speech fully
Perfect translation.

My Comments
Syntax is right but
the verb choice is
Proper tense used.
Wrong future tense


Google Translate
How are you?

German into English

System Output

My Comments
I had to repeat
My name Dietrich Schrder.
Schrder. Left
out verb.
The dog is an apple.
differentiate based
on context
She bought my dad a blue shirt.
Perfect translation.
Did not recognize
the inflection
Then we are ready.
denoting a
question. Left out
Did not translate
Youll stay here.
particles or
Tomorrow is Friday, right?
Perfect translation.
English into German
System Output
My Comments
Syntax is right but
the verb choice is
Ich nehme den Zug Morgen um 9 a.m.
Proper tense used.
Did not know what
was meant by
Could not
Sie geht auf die Universitt nchsten Herbst
between the
different uses of to
go. Used wrong


Voice Translator Free

German into English
System Output
How are you doing?

My Comments
Recognized what I
said, but miss
translated heie
I hot Dietrich Schrder.
as hot (which is an
alternate meaning
of the word).
The dog is an apple.
differentiate based
on context
She bought a blue shirt my father.
Bad syntax.
Then were so done.
Bad translation.
Bad translation.
Did not understand
You stay here you.
meaning of
Did not understand
Tomorrow is Friday, or?
meaning of particle.
English into German
System Output
My Comments
Syntax is right but
the verb choice is
Ich nehme den Zug Morgen um 9:00.
Proper tense used.
Right tense but
Sie wird zur Universitt gehen nchsten Herbst.
wrong syntax.

Avg. 1
Avg. 2


Ratings Comparison
Sentence Jibbigo



My Conclusion:
Each of the systems I tested performed averagely:
Jibbigos translations were adequate but it was bad at syntax and context cues. The
speech recognizer had a hard time understanding my non-native, but intelligible, German accent.
The speech synthesizer was also very hard to comprehend. The user interface was not very good,
Google Translate did well when translating from German into English, but also did
poorly when translating from English into German. The speech recognizer had very little trouble
understanding my non-native German accent.

The speech synthesizer was very clear and

comprehendible. The user interface on this app was the smoothest and most easy-to-use.
Voice Translator Frees translations were adequate but it was bad at syntax and context
cues. The speech recognizer worked very well, since it utilized Googles speech recognition
technology. However, the speech synthesizer used was hard to comprehend. The interface was
also full of glitches and not very smooth.
I would conclude that none of these three speech-to-speech apps performed better than
the others. They all need improvements in machine translation, speech recognition and speech
synthesis to some extent.


